We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8
21020324
Phạm Hoàng
Building Resilient Streaming Analytics Systems on Google Cloud
I. Introduction to Processing Streaming Data
- This module introduces stream processing and its role in big data architecture. - Data flows through Pub/Sub, Dataflow, and BigQuery/Bigtable in the processing pipeline. - Streaming enables real-time information for dashboards and situational awareness. - Minimizing latency at each step is crucial for timely data processing. - Challenges in streaming include volume, velocity, and variety of data. - Google Cloud products like Pub/Sub, Dataflow, and BigQuery address these challenges. - The processing pipeline involves data ingestion, Pub/Sub for distribution, Dataflow for aggregation and enrichment, and storage or machine learning models. - This approach is a common practice in Google Cloud for streaming analytics. II. Serverless Messaging with Pub/Sub - Pub/Sub is a fully managed data distribution and delivery system. - It is a serverless service with client libraries available in multiple languages. - Pub/Sub offers high availability, durability, and scalability. - It was originally used by Google for distributing search engine data globally. - Pub/Sub is HIPAA compliant, offers encryption, and stores messages in multiple locations. - The key components of Pub/Sub are topics and subscriptions. - Topics are created by publishers, while subscriptions are created by subscribers. - Subscriptions connect to topics and allow receiving and processing of messages. - A topic can have multiple subscriptions, providing a message bus-like architecture. - Messages can be filtered by attributes, and unmatching messages are automatically acknowledged. - Pub/Sub is highly flexible and can be used for various scenarios, including notifications, data distribution, and decoupling applications. - Pub/Sub is a fully managed data distribution and delivery system. - It is a serverless service with client libraries available in multiple languages. - Pub/Sub offers high availability, durability, and scalability. - It was originally used by Google for distributing search engine data globally. - Pub/Sub is HIPAA compliant, offers encryption, and stores messages in multiple locations. - The key components of Pub/Sub are topics and subscriptions. - Topics are created by publishers, while subscriptions are created by subscribers. - Subscriptions connect to topics and allow receiving and processing of messages. - A topic can have multiple subscriptions, providing a message bus-like architecture. - Messages can be filtered by attributes, and unmatching messages are automatically acknowledged. - Pub/Sub is highly flexible and can be used for various scenarios, including notifications, data distribution, and decoupling applications. - Pub/Sub is a fully managed data distribution and delivery system. - It is a serverless service with client libraries available in multiple languages. - Pub/Sub offers high availability, durability, and scalability. - It was originally used by Google for distributing search engine data globally. - Pub/Sub is HIPAA compliant, offers encryption, and stores messages in multiple locations. - The key components of Pub/Sub are topics and subscriptions. - Topics are created by publishers, while subscriptions are created by subscribers. - Subscriptions connect to topics and allow receiving and processing of messages. - A topic can have multiple subscriptions, providing a message bus-like architecture. - Messages can be filtered by attributes, and unmatching messages are automatically acknowledged. - Pub/Sub is highly flexible and can be used for various scenarios, including notifications, data distribution, and decoupling applications. III. Dataflow Streaming Features - Dataflow is a serverless service for processing both batch and streaming data, offering scalability and low-latency processing for streaming pipelines. - Challenges associated with processing streaming data include scalability, fault tolerance, the choice of the streaming or repeated batch model, timing and latency issues, and data aggregation challenges. - Dataflow automatically handles challenges related to data aggregation in streaming scenarios by using windowing to calculate averages and other aggregates over specific time intervals. - Message ordering and timestamps are crucial in streaming data processing, and Dataflow allows for modification of timestamps, especially when there's significant latency between data capture and message sending. - Custom message IDs can be used for message deduplication in Pub/Sub, with Dataflow maintaining a list of seen custom IDs to identify and discard duplicates. - Dataflow offers three types of windows for processing streaming data: fixed, sliding, and session windows. - Fixed windows are defined by consistent, non-overlapping intervals like hourly, daily, or monthly. Sliding windows allow for overlap and are defined by a minimum gap duration and a specified window length. Session windows are based on a minimum gap duration and are suitable for capturing burst-y communication. - Dataflow automatically keeps track of watermarks, which represent the lag time in data arrival, allowing for windows to be flushed once the watermark has passed. - Late data, which arrives after the window has closed, can be handled based on user-defined policies, such as discarding or reprocessing. - Triggers can be used to specify when to accumulate results in windows, and Dataflow provides various types of triggers, including event time triggers and processing time triggers. - Accumulation modes, such as accumulating results and discarding late data, can be configured based on the use case and requirements of the streaming pipeline. - Streaming data is inserted into BigQuery using the Streaming Inserts method, allowing one item at a time to be inserted into a table. New tables can be created based on temporary tables defining the schema. - Streaming data is available within seconds, but considerations like data availability, consistency, and latency should be kept in mind. - Streaming quotas have daily and concurrent rate limits, and the choice to disable best effort de-duplication by not populating insert IDs can result in higher streaming ingest quotas. - Streaming data should be used when immediate data availability is a requirement, but batch loading is not charged, so it's preferred for non-real-time scenarios. - Data Studio can be used to visualize data in BigQuery, and data exploration can be initiated immediately after executing a query. Reports created in Data Studio can be shared and should consider the data source's accessibility. - Data Studio allows the creation of charts and tables, the ability to arrange components, define dimensions and metrics, and give reports names. The view toggle button lets users switch between editing and viewing modes. - BigQuery BI Engine is an in-memory analysis service integrated with BigQuery to provide sub-second query response times for business intelligence applications. It eliminates the need to build and manage custom BI services and OLAP cubes. - BigQuery is a powerful tool for querying and analyzing data, but it may not always meet requirements for low latency and high throughput. - In such cases, Cloud Bigtable is introduced as a high-performance solution, and this lesson covers designing schemas, row keys, and data ingestion for Bigtable. - Bigtable is ideal for non-structured key-value data, but not suited for highly structured, transactional, or small-volume data with SQL-like queries or joins. - Bigtable is often used in real-time lookup capacity for applications requiring high throughput. - Bigtable stores data in Colossus, and its three levels of operation (data, tablet, metadata) enable fast rebalancing and recovery. - The design principles of Bigtable involve simplification and speed, leading to a NoSQL database with only one index, the row key. - Efficient design of the row key, column families, and data organization is critical for performance. - Row keys that reduce sorting and searching enable common queries to be executed as scans. - Reversing timestamps in row keys can help keep the most recent data at the beginning of the table. - Periodic compaction is performed to remove deleted rows and optimize data organization. - A well-designed schema can evenly distribute reads and writes across the cluster, and Bigtable can redistribute tablets to balance the workload. - Spotify's use case demonstrates how Bigtable can be used for data remediation between Dataflow jobs to process and store data more efficiently. - Optimizing Bigtable performance is essential for maintaining low latency and high throughput. - Correctly designing the table schema is crucial for even distribution of reads and writes across the Bigtable cluster to prevent overloading individual nodes. - Adequate workload and data volume are required for Bigtable to learn access patterns and optimize performance. Small data volumes and short testing periods may not yield accurate results. - Increasing the number of nodes in a Bigtable cluster can linearly improve performance, and monitoring tools can help identify overloading. - The choice of storage disks (SSD vs. HDD) can significantly affect performance, with SSDs offering much higher read request capacities. - Network issues can reduce throughput, and clients in different zones can lead to performance problems. - Experimentation with actual workloads, row and cell sizes, and other factors is necessary for fine-tuning performance. - Key Visualizer is a tool that generates visual reports to analyze Bigtable usage patterns based on row keys, helping optimize performance. - Adequate data volume (at least 300 GB) and a sufficiently long testing period are essential for accurate performance testing. - Replication for Bigtable enhances data availability and durability, allowing for manual or automatic failovers and the isolation of different workloads on separate clusters. - Performance estimates provided in documentation may serve as baselines but should be validated through actual testing with real data and application code. IV. Advanced BigQuery Functionality and Performance - BigQuery provides built-in functions, including window functions, to support advanced analysis. - Window functions are divided into three groups: standard aggregations, navigation functions, and ranking and numbering functions. - Standard aggregation functions include Count and others, which allow you to calculate results rapidly. - Navigation functions, like LEAD, compute value expressions over different rows in the window frame from the current row. - Ranking and numbering functions include Rank, which assigns an ordinal rank to each row within an ordered partition. - These functions can be used to perform advanced analysis and gain insights from your data. - The WITH clause is a way to define named subqueries in BigQuery, making complex queries more manageable and isolating SQL operations. - BigQuery offers built-in geographic information system (GIS) features for spatial data analysis. - Examples demonstrate how to use these features to perform geospatial queries. - ST_GeogPoint is used to create geospatial objects from latitude and longitude values, allowing you to work with spatial data. - ST_DWithin helps determine the proximity of two geospatial objects within a specified distance. - Other functions like ST_MakeLine and ST_MakePolygon allow overlaying information on maps to visualize relationships in the data. - Functions like ST_Intersects, ST_Contains, and ST_CoveredBy help analyze the relationships between geospatial objects, such as determining intersection, containment, and coverage. - The BigQuery Geo Viz application is available for rendering GIS data with minimal configuration, making it easy to visualize geospatial data. - Best practices include using Dataflow for processing, creating multiple tables, and structuring data for efficient exploration. - Performance optimization areas include input/output, shuffling, grouping, materialization, and CPU cost. - A cheat sheet advises selecting only necessary data, using WHERE clauses, and applying ORDER BY as the last operation. - Partitioning tables helps reduce costs and improve performance. - Clustering enhances query performance, and automatic re-clustering is now available. - Intermediate table materialization reduces data processing and storage costs. - Approximate functions improve performance at a slight accuracy cost. - Cloud Monitoring can be used to monitor BigQuery performance. - Best practices include using Dataflow for processing, creating multiple tables, and structuring data for efficient exploration. - Performance optimization areas include input/output, shuffling, grouping, materialization, and CPU cost. - A cheat sheet advises selecting only necessary data, using WHERE clauses, and applying ORDER BY as the last operation. - Partitioning tables helps reduce costs and improve performance. - Clustering enhances query performance, and automatic re-clustering is now available. - Intermediate table materialization reduces data processing and storage costs. - Approximate functions improve performance at a slight accuracy cost. - Cloud Monitoring can be used to monitor BigQuery performance.
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions