Building Batch Data Pipelines On Google Cloud
Building Batch Data Pipelines On Google Cloud
EL (Extract and Load) involves importing data as is into a system, suitable when
the data is already clean and correct, such as loading log files from Cloud Storage into
BigQuery.
ELT (Extract, Load, and Transform) allows loading raw data directly into the
target and transforming it when needed. It's used when transformations are uncertain,
like storing raw JSON from the Vision API and later extracting and transforming
specific data using SQL.
ETL (Extract, Transform, and Load) involves transforming data in an
intermediate service before loading it into the target. An example is transforming data
in Dataflow before loading it into BigQuery.
Use EL when data is clean and correct, ELT when transformations are uncertain
and can be expressed in SQL, and ETL for more complex transformations done in an
intermediate service.
2. Quality consideration
The discussion explores data quality transformations in BigQuery, focusing on
issues like validity, accuracy, completeness, consistency, and uniformity. It
emphasizes the impact of these issues on data analysis and business outcomes.
The talk introduces methods for detecting and resolving data quality problems in
BigQuery, highlighting examples such as using the count distinct function for handling
duplicate records and filtering in views to address issues like data out of range or
invalid data without the need for additional transformation steps.
3. How to carry out operations in BigQuery
The lesson focuses on addressing quality issues in BigQuery. Views can be used
to filter out rows with quality problems, such as removing quantities less than zero or
groups with fewer than 10 records after a group by operation. Handling nulls and blanks
is discussed, emphasizing the use of count if and if statements for non-null value counts
and flexible computations.
Consistency problems, often due to duplicates or extra characters, can be tackled
using count and count distinct functions, along with string functions to clean data. For
accuracy, data can be tested against known good values, and completeness involves
identifying and handling missing values using SQL functions like NULLIF and
COALESCE. Backfilling is introduced as a method for addressing missing data gaps.
The passage introduces the Hadoop ecosystem, tracing its evolution from
traditional big data processing to the emergence of Hadoop in 2006, enabling
distributed processing. The ecosystem includes tools like HDFS, MapReduce, Hive,
Pig, and Spark. It emphasizes the challenges of on-premises Hadoop clusters and
introduces Google Cloud's Dataproc as a managed solution.
Dataproc offers benefits such as managed hardware, simplified version
management, and flexible job configuration. The passage also highlights the
advantages of Spark, a powerful component of the Hadoop ecosystem, known for high
performance, in-memory processing, and versatility in handling various workloads,
including SQL and machine learning through Spark ML Lib.
2. Running Hadoop on Dataproc
This section discusses the benefits of using Dataproc on Google Cloud for
processing Hadoop job code in the cloud. Dataproc leverages open-source data tools,
provides automation for quick cluster creation and management, and offers cost
savings by turning off clusters when not in use. Key features include low cost, fast
cluster operations, resizable clusters, compatibility with Spark and Hadoop tools,
integration with Cloud Storage, BigQuery, and Cloud Bigtable, as well as versioning
and high availability.
Moving to the reduce phase, Combine transforms are used for aggregations.
CombineGlobally combines an entire PCollection, while CombinePerKey works like
GroupByKey but combines values using a specified function. Combining functions
should be commutative and associative. Custom combine functions can be created,
providing flexibility for complex operations.
Flatten merges multiple PCollections, acting like a SQL UNION. Partition splits
a single PCollection into smaller collections, useful for scenarios where different
processing is needed for specific partitions. These capabilities contribute to Dataflow's
efficiency in handling data processing tasks.
4. Side inputs and windows of data
In addition to the main input PCollection, Dataflow allows the provision of side
inputs to a ParDo transform. Side inputs are additional inputs that a do function in
ParDo can access when processing each element in the input PCollection. These inputs
provide additional data determined at runtime, offering flexibility without hard-coding.
Side inputs are particularly useful when injecting data during processing,
depending on the input data or a different branch of the pipeline. The example in
Python demonstrates how side inputs work, creating a view available to all worker
nodes.
Batch inputs can use time-based windows for grouping data by time. Explicit
timestamps can be admitted in the pipeline for windowing. The example illustrates
aggregating batch data by time using sliding windows. In the case of sales records,
fixed windows with a 1-day duration can be created for computing daily totals in batch
processing. Streaming discussions are continued in the streaming data processing
course.
5. Creating and re-using pipeline templates
Dataflow Templates simplify the execution of Dataflow jobs by allowing users
without coding capabilities to run standard data transformation tasks. Users can
leverage pre-existing templates or create their own for team use. This separation of
development and execution workflows streamlines job execution.
IV. Manage Data Pipelines with Cloud Data Fusion and Cloud Composer
1. Introduction to Cloud Data Fusion
Audience: It serves developers for data cleansing, matching, transformation, and
automation; data scientists for building and deploying pipelines; and business analysts
for operationalizing pipelines and inspecting metadata.
Benefits:
• Integration: Connects with a variety of data sources, including legacy and
modern systems, databases, file systems, cloud services, and more.
• Productivity: Consolidates data from different sources into a unified view,
enhancing productivity.
• Reduced Complexity: Provides a visual interface for code-free
transformations, reusable templates, and pipeline building.
• Flexibility: Supports on-premises and cloud environments, ensuring
interoperability with open-source software CDAP.
Capabilities:
• Graphical Interface: Allows building data pipelines visually with existing
templates, connectors, and transformations.
• Testing and Debugging: Permits testing and debugging of pipelines,
tracking data processing at each node.
• Organization and Search: Enables tagging pipelines for efficient
organization and utilizes unified search functionality.
• Lineage Tracking: Tracks the lineage of transformations on data fields.
Extensibility:
• Templatization: Supports templatizing pipelines for reusability.
• Conditional Triggers: Allows the creation of triggers based on conditions.
• Plugin Management: Offers UI widget plugins, custom provisioners,
compute profiles, and integration to hubs.
2. Components of Cloud Data Fusion
Wrangler UI:
• Purpose: Used for visually exploring datasets and constructing pipelines
without writing code.
• Functionality: Enables users to build pipelines through a visual interface,
making data exploration and transformation intuitive.
• Key Benefit: Provides a code-free environment for constructing pipelines.
Data Pipeline UI:
• Purpose: Designed for drawing pipelines directly onto a canvas.
• Functionality: Allows users to create pipelines visually, facilitating a
seamless design process.
• Option: Users can choose from existing templates for common data
processing paths, such as moving data from cloud storage to BigQuery.
3. Cloud Data Fusion UI
In the Cloud Data Fusion UI, essential elements include the Control Center for
managing applications, artifacts, and datasets.
The Pipeline Section, featuring Developer Studio and a Palette, aids in pipeline
development.
The Wrangler Section offers tools for data exploration and transformation.
Integration Metadata Section allows searches, tagging, and data lineage
exploration.
The Hub provides access to plugins and prebuilt pipelines.
Entities encompass pipeline creation and other functionalities, while
Administration includes management and configuration options.
4. Build a pipeline
In Cloud Data Fusion, a pipeline is visually represented as a Directed Acyclic
Graph (DAG), with each stage as a node. Nodes can vary, such as pulling data from
Cloud Storage, parsing CSV, or joining and splitting data.
The studio serves as the interface for pipeline creation, and the canvas allows
node arrangement. Use the mini-map for navigation and the control panel to add
objects. Save and run pipelines through the actions toolbar, employing templates and
plugins.