0% found this document useful (0 votes)
12 views

Week 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Week 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

21020324

Phạm Hoàng

Building Resilient Streaming Analytics Systems on Google Cloud

I. Introduction to Processing Streaming Data


- This module introduces stream processing and its role in big data architecture.
- Data flows through Pub/Sub, Dataflow, and BigQuery/Bigtable in the
processing pipeline.
- Streaming enables real-time information for dashboards and situational
awareness.
- Minimizing latency at each step is crucial for timely data processing.
- Challenges in streaming include volume, velocity, and variety of data.
- Google Cloud products like Pub/Sub, Dataflow, and BigQuery address these
challenges.
- The processing pipeline involves data ingestion, Pub/Sub for distribution,
Dataflow for aggregation and enrichment, and storage or machine learning
models.
- This approach is a common practice in Google Cloud for streaming analytics.
II. Serverless Messaging with Pub/Sub
- Pub/Sub is a fully managed data distribution and delivery system.
- It is a serverless service with client libraries available in multiple languages.
- Pub/Sub offers high availability, durability, and scalability.
- It was originally used by Google for distributing search engine data globally.
- Pub/Sub is HIPAA compliant, offers encryption, and stores messages in
multiple locations.
- The key components of Pub/Sub are topics and subscriptions.
- Topics are created by publishers, while subscriptions are created by subscribers.
- Subscriptions connect to topics and allow receiving and processing of
messages.
- A topic can have multiple subscriptions, providing a message bus-like
architecture.
- Messages can be filtered by attributes, and unmatching messages are
automatically acknowledged.
- Pub/Sub is highly flexible and can be used for various scenarios, including
notifications, data distribution, and decoupling applications.
- Pub/Sub is a fully managed data distribution and delivery system.
- It is a serverless service with client libraries available in multiple languages.
- Pub/Sub offers high availability, durability, and scalability.
- It was originally used by Google for distributing search engine data globally.
- Pub/Sub is HIPAA compliant, offers encryption, and stores messages in
multiple locations.
- The key components of Pub/Sub are topics and subscriptions.
- Topics are created by publishers, while subscriptions are created by subscribers.
- Subscriptions connect to topics and allow receiving and processing of
messages.
- A topic can have multiple subscriptions, providing a message bus-like
architecture.
- Messages can be filtered by attributes, and unmatching messages are
automatically acknowledged.
- Pub/Sub is highly flexible and can be used for various scenarios, including
notifications, data distribution, and decoupling applications.
- Pub/Sub is a fully managed data distribution and delivery system.
- It is a serverless service with client libraries available in multiple languages.
- Pub/Sub offers high availability, durability, and scalability.
- It was originally used by Google for distributing search engine data globally.
- Pub/Sub is HIPAA compliant, offers encryption, and stores messages in
multiple locations.
- The key components of Pub/Sub are topics and subscriptions.
- Topics are created by publishers, while subscriptions are created by subscribers.
- Subscriptions connect to topics and allow receiving and processing of
messages.
- A topic can have multiple subscriptions, providing a message bus-like
architecture.
- Messages can be filtered by attributes, and unmatching messages are
automatically acknowledged.
- Pub/Sub is highly flexible and can be used for various scenarios, including
notifications, data distribution, and decoupling applications.
III. Dataflow Streaming Features
- Dataflow is a serverless service for processing both batch and streaming data,
offering scalability and low-latency processing for streaming pipelines.
- Challenges associated with processing streaming data include scalability, fault
tolerance, the choice of the streaming or repeated batch model, timing and
latency issues, and data aggregation challenges.
- Dataflow automatically handles challenges related to data aggregation in
streaming scenarios by using windowing to calculate averages and other
aggregates over specific time intervals.
- Message ordering and timestamps are crucial in streaming data processing, and
Dataflow allows for modification of timestamps, especially when there's
significant latency between data capture and message sending.
- Custom message IDs can be used for message deduplication in Pub/Sub, with
Dataflow maintaining a list of seen custom IDs to identify and discard
duplicates.
- Dataflow offers three types of windows for processing streaming data: fixed,
sliding, and session windows.
- Fixed windows are defined by consistent, non-overlapping intervals like hourly,
daily, or monthly. Sliding windows allow for overlap and are defined by a
minimum gap duration and a specified window length. Session windows are
based on a minimum gap duration and are suitable for capturing burst-y
communication.
- Dataflow automatically keeps track of watermarks, which represent the lag time
in data arrival, allowing for windows to be flushed once the watermark has
passed.
- Late data, which arrives after the window has closed, can be handled based on
user-defined policies, such as discarding or reprocessing.
- Triggers can be used to specify when to accumulate results in windows, and
Dataflow provides various types of triggers, including event time triggers and
processing time triggers.
- Accumulation modes, such as accumulating results and discarding late data, can
be configured based on the use case and requirements of the streaming pipeline.
- Streaming data is inserted into BigQuery using the Streaming Inserts method,
allowing one item at a time to be inserted into a table. New tables can be
created based on temporary tables defining the schema.
- Streaming data is available within seconds, but considerations like data
availability, consistency, and latency should be kept in mind.
- Streaming quotas have daily and concurrent rate limits, and the choice to
disable best effort de-duplication by not populating insert IDs can result in
higher streaming ingest quotas.
- Streaming data should be used when immediate data availability is a
requirement, but batch loading is not charged, so it's preferred for non-real-time
scenarios.
- Data Studio can be used to visualize data in BigQuery, and data exploration can
be initiated immediately after executing a query. Reports created in Data Studio
can be shared and should consider the data source's accessibility.
- Data Studio allows the creation of charts and tables, the ability to arrange
components, define dimensions and metrics, and give reports names. The view
toggle button lets users switch between editing and viewing modes.
- BigQuery BI Engine is an in-memory analysis service integrated with BigQuery
to provide sub-second query response times for business intelligence
applications. It eliminates the need to build and manage custom BI services and
OLAP cubes.
- BigQuery is a powerful tool for querying and analyzing data, but it may not
always meet requirements for low latency and high throughput.
- In such cases, Cloud Bigtable is introduced as a high-performance solution, and
this lesson covers designing schemas, row keys, and data ingestion for Bigtable.
- Bigtable is ideal for non-structured key-value data, but not suited for highly
structured, transactional, or small-volume data with SQL-like queries or joins.
- Bigtable is often used in real-time lookup capacity for applications requiring
high throughput.
- Bigtable stores data in Colossus, and its three levels of operation (data, tablet,
metadata) enable fast rebalancing and recovery.
- The design principles of Bigtable involve simplification and speed, leading to a
NoSQL database with only one index, the row key.
- Efficient design of the row key, column families, and data organization is
critical for performance.
- Row keys that reduce sorting and searching enable common queries to be
executed as scans.
- Reversing timestamps in row keys can help keep the most recent data at the
beginning of the table.
- Periodic compaction is performed to remove deleted rows and optimize data
organization.
- A well-designed schema can evenly distribute reads and writes across the
cluster, and Bigtable can redistribute tablets to balance the workload.
- Spotify's use case demonstrates how Bigtable can be used for data remediation
between Dataflow jobs to process and store data more efficiently.
- Optimizing Bigtable performance is essential for maintaining low latency and
high throughput.
- Correctly designing the table schema is crucial for even distribution of reads
and writes across the Bigtable cluster to prevent overloading individual nodes.
- Adequate workload and data volume are required for Bigtable to learn access
patterns and optimize performance. Small data volumes and short testing
periods may not yield accurate results.
- Increasing the number of nodes in a Bigtable cluster can linearly improve
performance, and monitoring tools can help identify overloading.
- The choice of storage disks (SSD vs. HDD) can significantly affect
performance, with SSDs offering much higher read request capacities.
- Network issues can reduce throughput, and clients in different zones can lead to
performance problems.
- Experimentation with actual workloads, row and cell sizes, and other factors is
necessary for fine-tuning performance.
- Key Visualizer is a tool that generates visual reports to analyze Bigtable usage
patterns based on row keys, helping optimize performance.
- Adequate data volume (at least 300 GB) and a sufficiently long testing period
are essential for accurate performance testing.
- Replication for Bigtable enhances data availability and durability, allowing for
manual or automatic failovers and the isolation of different workloads on
separate clusters.
- Performance estimates provided in documentation may serve as baselines but
should be validated through actual testing with real data and application code.
IV. Advanced BigQuery Functionality and Performance
- BigQuery provides built-in functions, including window functions, to support
advanced analysis.
- Window functions are divided into three groups: standard aggregations,
navigation functions, and ranking and numbering functions.
- Standard aggregation functions include Count and others, which allow you to
calculate results rapidly.
- Navigation functions, like LEAD, compute value expressions over different
rows in the window frame from the current row.
- Ranking and numbering functions include Rank, which assigns an ordinal rank
to each row within an ordered partition.
- These functions can be used to perform advanced analysis and gain insights
from your data.
- The WITH clause is a way to define named subqueries in BigQuery, making
complex queries more manageable and isolating SQL operations.
- BigQuery offers built-in geographic information system (GIS) features for
spatial data analysis.
- Examples demonstrate how to use these features to perform geospatial queries.
- ST_GeogPoint is used to create geospatial objects from latitude and longitude
values, allowing you to work with spatial data.
- ST_DWithin helps determine the proximity of two geospatial objects within a
specified distance.
- Other functions like ST_MakeLine and ST_MakePolygon allow overlaying
information on maps to visualize relationships in the data.
- Functions like ST_Intersects, ST_Contains, and ST_CoveredBy help analyze
the relationships between geospatial objects, such as determining intersection,
containment, and coverage.
- The BigQuery Geo Viz application is available for rendering GIS data with
minimal configuration, making it easy to visualize geospatial data.
- Best practices include using Dataflow for processing, creating multiple tables,
and structuring data for efficient exploration.
- Performance optimization areas include input/output, shuffling, grouping,
materialization, and CPU cost.
- A cheat sheet advises selecting only necessary data, using WHERE clauses, and
applying ORDER BY as the last operation.
- Partitioning tables helps reduce costs and improve performance.
- Clustering enhances query performance, and automatic re-clustering is now
available.
- Intermediate table materialization reduces data processing and storage costs.
- Approximate functions improve performance at a slight accuracy cost.
- Cloud Monitoring can be used to monitor BigQuery performance.
- Best practices include using Dataflow for processing, creating multiple tables,
and structuring data for efficient exploration.
- Performance optimization areas include input/output, shuffling, grouping,
materialization, and CPU cost.
- A cheat sheet advises selecting only necessary data, using WHERE clauses, and
applying ORDER BY as the last operation.
- Partitioning tables helps reduce costs and improve performance.
- Clustering enhances query performance, and automatic re-clustering is now
available.
- Intermediate table materialization reduces data processing and storage costs.
- Approximate functions improve performance at a slight accuracy cost.
- Cloud Monitoring can be used to monitor BigQuery performance.

You might also like