Week 4

Uploaded by

Nguyễn Viết Mạnh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Week 4

Uploaded by

Nguyễn Viết Mạnh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

21020324

Phạm Hoàng

Building Resilient Streaming Analytics Systems on Google Cloud

I. Introduction to Processing Streaming Data

- This module introduces stream processing and its role in big data architecture.
- Data flows through Pub/Sub, Dataflow, and BigQuery/Bigtable in the
processing pipeline.
- Streaming enables real-time information for dashboards and situational
awareness.
- Minimizing latency at each step is crucial for timely data processing.
- Challenges in streaming include volume, velocity, and variety of data.
- Google Cloud products like Pub/Sub, Dataflow, and BigQuery address these
challenges.
- The processing pipeline involves data ingestion, Pub/Sub for distribution,
Dataflow for aggregation and enrichment, and storage or machine learning
models.
- This approach is a common practice in Google Cloud for streaming analytics.
II. Serverless Messaging with Pub/Sub
- Pub/Sub is a fully managed data distribution and delivery system.
- It is a serverless service with client libraries available in multiple languages.
- Pub/Sub offers high availability, durability, and scalability.
- It was originally used by Google for distributing search engine data globally.
- Pub/Sub is HIPAA compliant, offers encryption, and stores messages in
multiple locations.
- The key components of Pub/Sub are topics and subscriptions.
- Topics are created by publishers, while subscriptions are created by subscribers.
- Subscriptions connect to topics and allow receiving and processing of
messages.
- A topic can have multiple subscriptions, providing a message bus-like
architecture.
- Messages can be filtered by attributes, and unmatching messages are
automatically acknowledged.
- Pub/Sub is highly flexible and can be used for various scenarios, including
notifications, data distribution, and decoupling applications.
- Pub/Sub is a fully managed data distribution and delivery system.
- It is a serverless service with client libraries available in multiple languages.
- Pub/Sub offers high availability, durability, and scalability.
- It was originally used by Google for distributing search engine data globally.
- Pub/Sub is HIPAA compliant, offers encryption, and stores messages in
multiple locations.
- The key components of Pub/Sub are topics and subscriptions.
- Topics are created by publishers, while subscriptions are created by subscribers.
- Subscriptions connect to topics and allow receiving and processing of
messages.
- A topic can have multiple subscriptions, providing a message bus-like
architecture.
- Messages can be filtered by attributes, and unmatching messages are
automatically acknowledged.
- Pub/Sub is highly flexible and can be used for various scenarios, including
notifications, data distribution, and decoupling applications.
- Pub/Sub is a fully managed data distribution and delivery system.
- It is a serverless service with client libraries available in multiple languages.
- Pub/Sub offers high availability, durability, and scalability.
- It was originally used by Google for distributing search engine data globally.
- Pub/Sub is HIPAA compliant, offers encryption, and stores messages in
multiple locations.
- The key components of Pub/Sub are topics and subscriptions.
- Topics are created by publishers, while subscriptions are created by subscribers.
- Subscriptions connect to topics and allow receiving and processing of
messages.
- A topic can have multiple subscriptions, providing a message bus-like
architecture.
- Messages can be filtered by attributes, and unmatching messages are
automatically acknowledged.
- Pub/Sub is highly flexible and can be used for various scenarios, including
notifications, data distribution, and decoupling applications.
III. Dataflow Streaming Features
- Dataflow is a serverless service for processing both batch and streaming data,
offering scalability and low-latency processing for streaming pipelines.
- Challenges associated with processing streaming data include scalability, fault
tolerance, the choice of the streaming or repeated batch model, timing and
latency issues, and data aggregation challenges.
- Dataflow automatically handles challenges related to data aggregation in
streaming scenarios by using windowing to calculate averages and other
aggregates over specific time intervals.
- Message ordering and timestamps are crucial in streaming data processing, and
Dataflow allows for modification of timestamps, especially when there's
significant latency between data capture and message sending.
- Custom message IDs can be used for message deduplication in Pub/Sub, with
Dataflow maintaining a list of seen custom IDs to identify and discard
duplicates.
- Dataflow offers three types of windows for processing streaming data: fixed,
sliding, and session windows.
- Fixed windows are defined by consistent, non-overlapping intervals like hourly,
daily, or monthly. Sliding windows allow for overlap and are defined by a
minimum gap duration and a specified window length. Session windows are
based on a minimum gap duration and are suitable for capturing burst-y
communication.
- Dataflow automatically keeps track of watermarks, which represent the lag time
in data arrival, allowing for windows to be flushed once the watermark has
passed.
- Late data, which arrives after the window has closed, can be handled based on
user-defined policies, such as discarding or reprocessing.
- Triggers can be used to specify when to accumulate results in windows, and
Dataflow provides various types of triggers, including event time triggers and
processing time triggers.
- Accumulation modes, such as accumulating results and discarding late data, can
be configured based on the use case and requirements of the streaming pipeline.
- Streaming data is inserted into BigQuery using the Streaming Inserts method,
allowing one item at a time to be inserted into a table. New tables can be
created based on temporary tables defining the schema.
- Streaming data is available within seconds, but considerations like data
availability, consistency, and latency should be kept in mind.
- Streaming quotas have daily and concurrent rate limits, and the choice to
disable best effort de-duplication by not populating insert IDs can result in
higher streaming ingest quotas.
- Streaming data should be used when immediate data availability is a
requirement, but batch loading is not charged, so it's preferred for non-real-time
scenarios.
- Data Studio can be used to visualize data in BigQuery, and data exploration can
be initiated immediately after executing a query. Reports created in Data Studio
can be shared and should consider the data source's accessibility.
- Data Studio allows the creation of charts and tables, the ability to arrange
components, define dimensions and metrics, and give reports names. The view
toggle button lets users switch between editing and viewing modes.
- BigQuery BI Engine is an in-memory analysis service integrated with BigQuery
to provide sub-second query response times for business intelligence
applications. It eliminates the need to build and manage custom BI services and
OLAP cubes.
- BigQuery is a powerful tool for querying and analyzing data, but it may not
always meet requirements for low latency and high throughput.
- In such cases, Cloud Bigtable is introduced as a high-performance solution, and
this lesson covers designing schemas, row keys, and data ingestion for Bigtable.
- Bigtable is ideal for non-structured key-value data, but not suited for highly
structured, transactional, or small-volume data with SQL-like queries or joins.
- Bigtable is often used in real-time lookup capacity for applications requiring
high throughput.
- Bigtable stores data in Colossus, and its three levels of operation (data, tablet,
metadata) enable fast rebalancing and recovery.
- The design principles of Bigtable involve simplification and speed, leading to a
NoSQL database with only one index, the row key.
- Efficient design of the row key, column families, and data organization is
critical for performance.
- Row keys that reduce sorting and searching enable common queries to be
executed as scans.
- Reversing timestamps in row keys can help keep the most recent data at the
beginning of the table.
- Periodic compaction is performed to remove deleted rows and optimize data
organization.
- A well-designed schema can evenly distribute reads and writes across the
cluster, and Bigtable can redistribute tablets to balance the workload.
- Spotify's use case demonstrates how Bigtable can be used for data remediation
between Dataflow jobs to process and store data more efficiently.
- Optimizing Bigtable performance is essential for maintaining low latency and
high throughput.
- Correctly designing the table schema is crucial for even distribution of reads
and writes across the Bigtable cluster to prevent overloading individual nodes.
- Adequate workload and data volume are required for Bigtable to learn access
patterns and optimize performance. Small data volumes and short testing
periods may not yield accurate results.
- Increasing the number of nodes in a Bigtable cluster can linearly improve
performance, and monitoring tools can help identify overloading.
- The choice of storage disks (SSD vs. HDD) can significantly affect
performance, with SSDs offering much higher read request capacities.
- Network issues can reduce throughput, and clients in different zones can lead to
performance problems.
- Experimentation with actual workloads, row and cell sizes, and other factors is
necessary for fine-tuning performance.
- Key Visualizer is a tool that generates visual reports to analyze Bigtable usage
patterns based on row keys, helping optimize performance.
- Adequate data volume (at least 300 GB) and a sufficiently long testing period
are essential for accurate performance testing.
- Replication for Bigtable enhances data availability and durability, allowing for
manual or automatic failovers and the isolation of different workloads on
separate clusters.
- Performance estimates provided in documentation may serve as baselines but
should be validated through actual testing with real data and application code.
IV. Advanced BigQuery Functionality and Performance
- BigQuery provides built-in functions, including window functions, to support
advanced analysis.
- Window functions are divided into three groups: standard aggregations,
navigation functions, and ranking and numbering functions.
- Standard aggregation functions include Count and others, which allow you to
calculate results rapidly.
- Navigation functions, like LEAD, compute value expressions over different
rows in the window frame from the current row.
- Ranking and numbering functions include Rank, which assigns an ordinal rank
to each row within an ordered partition.
- These functions can be used to perform advanced analysis and gain insights
from your data.
- The WITH clause is a way to define named subqueries in BigQuery, making
complex queries more manageable and isolating SQL operations.
- BigQuery offers built-in geographic information system (GIS) features for
spatial data analysis.
- Examples demonstrate how to use these features to perform geospatial queries.
- ST_GeogPoint is used to create geospatial objects from latitude and longitude
values, allowing you to work with spatial data.
- ST_DWithin helps determine the proximity of two geospatial objects within a
specified distance.
- Other functions like ST_MakeLine and ST_MakePolygon allow overlaying
information on maps to visualize relationships in the data.
- Functions like ST_Intersects, ST_Contains, and ST_CoveredBy help analyze
the relationships between geospatial objects, such as determining intersection,
containment, and coverage.
- The BigQuery Geo Viz application is available for rendering GIS data with
minimal configuration, making it easy to visualize geospatial data.
- Best practices include using Dataflow for processing, creating multiple tables,
and structuring data for efficient exploration.
- Performance optimization areas include input/output, shuffling, grouping,
materialization, and CPU cost.
- A cheat sheet advises selecting only necessary data, using WHERE clauses, and
applying ORDER BY as the last operation.
- Partitioning tables helps reduce costs and improve performance.
- Clustering enhances query performance, and automatic re-clustering is now
available.
- Intermediate table materialization reduces data processing and storage costs.
- Approximate functions improve performance at a slight accuracy cost.
- Cloud Monitoring can be used to monitor BigQuery performance.
- Best practices include using Dataflow for processing, creating multiple tables,
and structuring data for efficient exploration.
- Performance optimization areas include input/output, shuffling, grouping,
materialization, and CPU cost.
- A cheat sheet advises selecting only necessary data, using WHERE clauses, and
applying ORDER BY as the last operation.
- Partitioning tables helps reduce costs and improve performance.
- Clustering enhances query performance, and automatic re-clustering is now
available.
- Intermediate table materialization reduces data processing and storage costs.
- Approximate functions improve performance at a slight accuracy cost.
- Cloud Monitoring can be used to monitor BigQuery performance.

Information Systems 7th Edition Baltzan all chapter instant download
83% (6)
Information Systems 7th Edition Baltzan all chapter instant download
82 pages
Learn SAP Basis in 24 Hours
From Everand
Learn SAP Basis in 24 Hours
Alex Nordeen
4.5/5 (2)
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
CertyIQ AZ-204 UpdatedExam Dumps - Part 7 Case Study
No ratings yet
CertyIQ AZ-204 UpdatedExam Dumps - Part 7 Case Study
200 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Ebook Fast Data Architectures For Streaming Applications 2
No ratings yet
Ebook Fast Data Architectures For Streaming Applications 2
58 pages
Learn SAP BI in 24 Hours
From Everand
Learn SAP BI in 24 Hours
Alex Nordeen
3/5 (1)
Interviews Question On Coverage in System Verilog - Hardware Design and Verification
No ratings yet
Interviews Question On Coverage in System Verilog - Hardware Design and Verification
5 pages
Parts Catalog MP C2004-2504 PDF
100% (2)
Parts Catalog MP C2004-2504 PDF
256 pages
b0m33bdt-7p-spark-databricks-streaming_2023_en
No ratings yet
b0m33bdt-7p-spark-databricks-streaming_2023_en
50 pages
Building Resilient Streaming Analytics Systems On Google Cloud
No ratings yet
Building Resilient Streaming Analytics Systems On Google Cloud
1 page
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
BDA-Lec10
No ratings yet
BDA-Lec10
33 pages
BDA_Unit_3
No ratings yet
BDA_Unit_3
18 pages
lec19
No ratings yet
lec19
23 pages
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
lec19
No ratings yet
lec19
24 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
unit 4 Streaming data
No ratings yet
unit 4 Streaming data
4 pages
W7
No ratings yet
W7
227 pages
Big Data Analytics Application
No ratings yet
Big Data Analytics Application
6 pages
big data unit 1
No ratings yet
big data unit 1
24 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
StreamProcessingAndAnalytics Handout
No ratings yet
StreamProcessingAndAnalytics Handout
7 pages
Streaming Data
No ratings yet
Streaming Data
33 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Big Data Architecture
No ratings yet
Big Data Architecture
41 pages
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Bda Mid Ans
No ratings yet
Bda Mid Ans
18 pages
ijsat_UnderstandingDataProcessinginDatabricksFromSparkStreamingtoStructuredStreaming
No ratings yet
ijsat_UnderstandingDataProcessinginDatabricksFromSparkStreamingtoStructuredStreaming
12 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
From Everand
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
vivian njoroge
No ratings yet
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
From Everand
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Vincenzo Thesis
No ratings yet
Vincenzo Thesis
231 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Bda Ut2 Que Ans
No ratings yet
Bda Ut2 Que Ans
14 pages
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Six Minute Guide to IPv6
From Everand
Six Minute Guide to IPv6
Daryl Moon
5/5 (1)
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
No ratings yet
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
59 pages
Pls Academy Pde Student Slides 4 2405
No ratings yet
Pls Academy Pde Student Slides 4 2405
129 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
7- Streaming 2- Calcite
No ratings yet
7- Streaming 2- Calcite
45 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
- Streaming Systems
No ratings yet
- Streaming Systems
1 page
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
51 pages
6
No ratings yet
6
1 page
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
From Everand
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Stream Processing and Analytics Handout
No ratings yet
Stream Processing and Analytics Handout
8 pages
7
No ratings yet
7
1 page
Module II
No ratings yet
Module II
22 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
PostgreSQL for Data Architects
From Everand
PostgreSQL for Data Architects
Jayadevan Maymala
3/5 (1)
Consise Cloud Compute: It Professionals’ Handbook
From Everand
Consise Cloud Compute: It Professionals’ Handbook
Vijay
No ratings yet
Real time data streaming new techniques
No ratings yet
Real time data streaming new techniques
5 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
[Ebooks PDF] download Coding Club Level 1 Python Basics Roffey C. full chapters
100% (5)
[Ebooks PDF] download Coding Club Level 1 Python Basics Roffey C. full chapters
72 pages
TAD Is Not Getting Activated BAD LABELS IN GL
No ratings yet
TAD Is Not Getting Activated BAD LABELS IN GL
4 pages
How To Backup Ps2 Games
No ratings yet
How To Backup Ps2 Games
4 pages
Unit 5 Assignment (Spring 2021)
No ratings yet
Unit 5 Assignment (Spring 2021)
5 pages
Winstars 89C52 Programmer
100% (4)
Winstars 89C52 Programmer
3 pages
teleCARE IP License Table Data Sheet
No ratings yet
teleCARE IP License Table Data Sheet
3 pages
Mcdodo
No ratings yet
Mcdodo
86 pages
DeviceNetDetective2 DS
No ratings yet
DeviceNetDetective2 DS
2 pages
COLLOCATIONS
No ratings yet
COLLOCATIONS
7 pages
OS Chapter 3 Deadlock
No ratings yet
OS Chapter 3 Deadlock
38 pages
Chapter3 Mutex AdvancedTopics
No ratings yet
Chapter3 Mutex AdvancedTopics
46 pages
Flex Sim
No ratings yet
Flex Sim
3 pages
Alphavm For Windows User Manual: Date: 28-Feb-2020 Author: Artem Alimarin © 2020, Emuvm
No ratings yet
Alphavm For Windows User Manual: Date: 28-Feb-2020 Author: Artem Alimarin © 2020, Emuvm
39 pages
Signals Carried Over The Network
No ratings yet
Signals Carried Over The Network
15 pages
Os MCQS
0% (1)
Os MCQS
90 pages
02 - Css Properties Like Background, Border, Margin, Padding
No ratings yet
02 - Css Properties Like Background, Border, Margin, Padding
8 pages
Minutes - Keds X Cloud Logic (091917)
No ratings yet
Minutes - Keds X Cloud Logic (091917)
5 pages
Advanced Returns Management in SD and MM - SNP Poland
100% (1)
Advanced Returns Management in SD and MM - SNP Poland
11 pages
BAdd-Ins Used in Standard T-Codes Related To SD Module
No ratings yet
BAdd-Ins Used in Standard T-Codes Related To SD Module
18 pages
Acer Emachines E520 E720 Service Manual Repair Guide
No ratings yet
Acer Emachines E520 E720 Service Manual Repair Guide
214 pages
Put A Date Picker Calendar On An Excel Worksheet
No ratings yet
Put A Date Picker Calendar On An Excel Worksheet
4 pages
VLAN Completed
No ratings yet
VLAN Completed
58 pages
Stacks Queues Lists
No ratings yet
Stacks Queues Lists
6 pages
AT89S52
No ratings yet
AT89S52
20 pages
Data Warehouse: User Detailed Functional Specifications
No ratings yet
Data Warehouse: User Detailed Functional Specifications
26 pages
AirPrime EM7511 Product Specification r7
No ratings yet
AirPrime EM7511 Product Specification r7
101 pages

Week 4

Uploaded by

Week 4

Uploaded by

21020324

Building Resilient Streaming Analytics Systems on Google Cloud

I. Introduction to Processing Streaming Data

You might also like