0% found this document useful (0 votes)
63 views

Building Batch Data Pipelines On Google Cloud

This document discusses building batch data pipelines on Google Cloud. It covers executing Spark jobs on Dataproc, a managed Spark and Hadoop service, including how to optimize Dataproc clusters. Serverless data processing with Dataflow is also examined. Finally, the document introduces Cloud Data Fusion and Cloud Composer for managing data pipelines through graphical user interfaces and workflow orchestration.

Uploaded by

21020781
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Building Batch Data Pipelines On Google Cloud

This document discusses building batch data pipelines on Google Cloud. It covers executing Spark jobs on Dataproc, a managed Spark and Hadoop service, including how to optimize Dataproc clusters. Serverless data processing with Dataflow is also examined. Finally, the document introduces Cloud Data Fusion and Cloud Composer for managing data pipelines through graphical user interfaces and workflow orchestration.

Uploaded by

21020781
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Building Batch Data Pipelines on Google Cloud

I. Introduction to Building Batch Data Pipelines ................................................ 2


1. EL,ELT and ETL .......................................................................................... 2
2. Quality consideration .................................................................................. 2
3. How to carry out operations in BigQuery ................................................... 3
4. Shortcomings ............................................................................................... 4
5. ETL to solve data quality issues .................................................................. 4
II. Executing Spark on Dataproc ......................................................................... 5
1. The Hadoop Ecosystem ............................................................................... 5
2. Running Hadoop on Dataproc .................................................................... 6
3. Cloud Storage instead of HDFS .................................................................. 7
4. Optimizing Dataproc ................................................................................... 8
5. Optimizing Dataproc storage ...................................................................... 8
6. Optimizing Dataproc templates and autoscaling ........................................ 9
7. Optimizing Dataproc monitoring .............................................................. 10
III. Serverless Data Processing with Dataflow .................................................. 10
1. Introduction to Dataflow ........................................................................... 10
2. Why customers value Dataflow ................................................................. 12
3. Serverless Data Processing with Dataflow ............................................... 12
4. Side inputs and windows of data ............................................................... 13
5. Creating and re-using pipeline templates ................................................. 13
IV. Manage Data Pipelines with Cloud Data Fusion and Cloud Composer .... 14
1. Introduction to Cloud Data Fusion ........................................................... 14
2. Components of Cloud Data Fusion ........................................................... 15
3. Cloud Data Fusion UI ............................................................................... 15
4. Build a pipeline .......................................................................................... 16
5. Explore data using wrangler ..................................................................... 16
6. Orchestrate work between Google Cloud services with Cloud Composer17
7. Workflow scheduling ................................................................................. 17
I. Introduction to Building Batch Data Pipelines
Batch pipelines process a fixed amount of data before concluding, such as a daily
pipeline balancing financial transactions and storing results in a data warehouse.
Choosing between EL, ELT, or ETL depends on transformation needs and data quality
considerations. The discussion will cover building EL and ELT pipelines in BigQuery,
scenarios where they may not be suitable, and reasons for choosing ETL.
1. EL,ELT and ETL

EL (Extract and Load) involves importing data as is into a system, suitable when
the data is already clean and correct, such as loading log files from Cloud Storage into
BigQuery.
ELT (Extract, Load, and Transform) allows loading raw data directly into the
target and transforming it when needed. It's used when transformations are uncertain,
like storing raw JSON from the Vision API and later extracting and transforming
specific data using SQL.
ETL (Extract, Transform, and Load) involves transforming data in an
intermediate service before loading it into the target. An example is transforming data
in Dataflow before loading it into BigQuery.
Use EL when data is clean and correct, ELT when transformations are uncertain
and can be expressed in SQL, and ETL for more complex transformations done in an
intermediate service.
2. Quality consideration
The discussion explores data quality transformations in BigQuery, focusing on
issues like validity, accuracy, completeness, consistency, and uniformity. It
emphasizes the impact of these issues on data analysis and business outcomes.
The talk introduces methods for detecting and resolving data quality problems in
BigQuery, highlighting examples such as using the count distinct function for handling
duplicate records and filtering in views to address issues like data out of range or
invalid data without the need for additional transformation steps.
3. How to carry out operations in BigQuery
The lesson focuses on addressing quality issues in BigQuery. Views can be used
to filter out rows with quality problems, such as removing quantities less than zero or
groups with fewer than 10 records after a group by operation. Handling nulls and blanks
is discussed, emphasizing the use of count if and if statements for non-null value counts
and flexible computations.
Consistency problems, often due to duplicates or extra characters, can be tackled
using count and count distinct functions, along with string functions to clean data. For
accuracy, data can be tested against known good values, and completeness involves
identifying and handling missing values using SQL functions like NULLIF and
COALESCE. Backfilling is introduced as a method for addressing missing data gaps.

Ensuring data uniformity is discussed by verifying file integrity during data


loading, safeguarding against unit changes with SQL CAST, and clearly documenting
units using the SQL format function. The lesson emphasizes the powerful capabilities
of BigQuery's SQL for addressing various data quality issues.
4. Shortcomings
The lesson highlights that SQL in ELT pipelines can handle many quality issues,
making ETL unnecessary in some cases. However, situations requiring external API
calls, complex transformations, or continuous data loading may warrant ETL. Google
Cloud's recommended architecture uses Dataflow for ETL, particularly when non-SQL
transformations or continuous loading is needed.
Other Google Cloud services like Dataproc and Data Fusion are also options for
ETL, offering graphical interfaces and Apache Hadoop-based solutions. Dataflow,
based on Apache Beam, supports both batch and streaming data processing, with Quick
Start templates for rapid deployment. These services enable data transformation for
advanced analytics in data lakes or warehouses.

5. ETL to solve data quality issues


The person discusses using ETL to address data quality issues, suggesting
Dataflow and BigQuery unless specific needs arise. Situations like low latency, high
throughput, or the need for visual pipeline building may require alternatives. Dataproc,
a managed batch processing service, is cost-effective for Hadoop workloads, while
Cloud Data Fusion provides a visual interface for building and managing data
pipelines. Crucial aspects for all ETL options include maintaining data lineage and
keeping metadata for discoverability and suitability.
Data Catalog, a fully managed service, aids in metadata management and data
discovery. It supports schematized tags, integrates with the data loss prevention API,
and empowers collaborative annotation of business metadata. Data Catalog provides a
unified user experience for discovering datasets and ensures a single point of access
for users with different levels of access to diverse sets and tables.
II. Executing Spark on Dataproc
1. The Hadoop Ecosystem

The passage introduces the Hadoop ecosystem, tracing its evolution from
traditional big data processing to the emergence of Hadoop in 2006, enabling
distributed processing. The ecosystem includes tools like HDFS, MapReduce, Hive,
Pig, and Spark. It emphasizes the challenges of on-premises Hadoop clusters and
introduces Google Cloud's Dataproc as a managed solution.
Dataproc offers benefits such as managed hardware, simplified version
management, and flexible job configuration. The passage also highlights the
advantages of Spark, a powerful component of the Hadoop ecosystem, known for high
performance, in-memory processing, and versatility in handling various workloads,
including SQL and machine learning through Spark ML Lib.
2. Running Hadoop on Dataproc
This section discusses the benefits of using Dataproc on Google Cloud for
processing Hadoop job code in the cloud. Dataproc leverages open-source data tools,
provides automation for quick cluster creation and management, and offers cost
savings by turning off clusters when not in use. Key features include low cost, fast
cluster operations, resizable clusters, compatibility with Spark and Hadoop tools,
integration with Cloud Storage, BigQuery, and Cloud Bigtable, as well as versioning
and high availability.

Dataproc's developer tools, initialization actions, and flexible configuration


options are highlighted. The passage emphasizes the ease of setting up and interacting
with clusters, along with customization options like optional components and
initialization actions. The architecture of Dataproc clusters is briefly explained,
including primary nodes, worker nodes, and preemptable nodes.
The passage concludes with a sequence of events for using Dataproc, covering
setup, configuration, optimization, utilization, and monitoring. It details the cluster
creation process, options for customization, and considerations for cost-effectiveness.
The importance of monitoring job performance and utilizing Cloud Monitoring for
metrics and alerts is also emphasized.
3. Cloud Storage instead of HDFS
This section explores the transition from Hadoop's native file system (HDFS) to
Google Cloud Storage for improved efficiency. It highlights advancements in network
speed and the ability to separate storage and compute with petabit networking.
The text mentions the ease of migrating Hadoop workloads to Dataproc on
Google Cloud, emphasizing the long-term limitations of HDFS on the Cloud due to
issues with block size, data locality, and replication.

The advantages of Google's high-speed network, Jupyter networking fabric, and


Colossus storage layer are emphasized. Dataproc clusters leverage this infrastructure
to scale VMs for computation while using the fast Jupyter network for storage with
products like Cloud Storage.
A historical continuum of data management is briefly outlined, emphasizing the
benefits of separating compute and storage in the Cloud. Cloud Storage is presented as
a scalable and cost-effective alternative to HDFS, with advantages such as pay-as-you-
go pricing, optimization for large parallel operations, and the elimination of
bottlenecks.
The text acknowledges some challenges with Cloud Storage, including object
renaming difficulties and the inability to append to objects. It concludes by introducing
Disk CP as a key tool for data movement and discussing the preference for a push-
based model for essential data.
4. Optimizing Dataproc
Optimizing Dataproc involves ensuring data locality by aligning the Cloud
storage bucket with the Dataproc region. Avoid network bottlenecks and consider
consolidating data files. Adjust settings for large datasets and choose an appropriately
sized persistent disk to prevent throughput limitations. Allocate enough virtual
machines based on workload understanding and use prototypes for informed decisions.
Take advantage of the Cloud's flexibility for easy cluster resizing and consider job-
scoped clusters for efficiency.
5. Optimizing Dataproc storage
Using local HDFS is suitable for metadata-heavy operations, frequent data
modifications, and directory renaming. Cloud Storage is recommended for initial and
final data storage in a big data pipeline, offering cost savings and flexible cluster sizing.
Adjust local HDFS size based on workload needs, considering SSDs for IO-intensive
tasks.
Be mindful of data and job configurations in different regions for optimal
performance. Explore storage options like Cloud Bigtable for sparse data and BigQuery
for data warehousing. Shift to an ephemeral model for cost-effective Dataproc usage,
creating clusters as needed and releasing them after job completion. Consider job-
scoped clusters and separate environments for efficient data processing on Google
Cloud.
6. Optimizing Dataproc templates and autoscaling

The Dataproc Workflow Template, processed through a DAG, creates, selects


clusters, submits jobs, and deletes clusters. It's accessible via gcloud and REST API,
not Cloud Console. Templates activate when instantiated into the DAG, allowing
multiple submissions with varied parameters. An example template involves installing
dependencies, creating a cluster, adding a job, and submitting the workflow.
Dataproc autoscaling adjusts cluster size based on YARN memory needs, with
features like fine-grained controls, reduced scaling interval, and shared policies.
Autoscaling works with off-cluster data, best for frequently processing jobs or single
large tasks, but not designed for Spark Structured Streaming or zero scaling.
Considerations include setting initial workers, cooldown periods, and managing
preemptible workers' scale.
7. Optimizing Dataproc monitoring
In Google Cloud, Cloud Logging and Cloud Monitoring help view and customize
logs, monitoring Spark jobs, and managing resources. To identify Spark job failures,
check driver output and executor logs, which can be accessed through Cloud Console,
G Cloud command, or the Dataproc cluster's Cloud Storage bucket. All logs, including
Spark container logs, are collected by Yarn and available in Cloud Logging, providing
a consolidated view.
The Cloud Console's logging page allows easy navigation of logs with options
to filter by application ID or custom labels. Adjusting driver log levels can be done
through G Cloud commands or Spark context settings. Cloud Monitoring tracks CPU,
disk, network usage, and Yarn resources, offering customizable dashboards for real-
time metrics. Visualization of Spark metrics requires connection to the Spark
Applications Web UI.

III. Serverless Data Processing with Dataflow


1. Introduction to Dataflow
Dataflow is the preferred way for data processing on Google Cloud due to its
serverless nature, eliminating the need to manage clusters. Auto-scaling in Dataflow is
fine-grained and scales step by step. Dataflow allows using the same code for both
batch and streaming, making it versatile. For existing Hadoop pipelines, Dataproc
might be more suitable, but learning both tools is recommended for flexibility.
Dataflow is recommended for building pipelines, offering scalability for processing
more data and supporting both batch and streaming.

The unified programming and processing concepts in Dataflow, achieved


through P transforms, P collections, pipelines, and pipeline runners, are innovative in
data engineering. The immutable nature of P collections simplifies distributed
processing, allowing elements to be individually accessed and processed.
Data types in P collections are stored as serialized byte strings, enabling seamless
movement through the system without the need for serialization and deserialization
during network transfer. The Dataflow pipeline is represented as a directed graph,
accommodating branches and aggregations.
2. Why customers value Dataflow
Data engineers prefer Dataflow for its efficient execution of Apache Beam, fully
managed and serverless design. It optimizes pipeline graphs, dynamically rebalances
work for fault tolerance, and deploys resources on demand.
Dataflow offers step-by-step autoscaling, eliminating manual resource scaling.
It handles late arrivals, ensures correct aggregations, and functions as a versatile glue,
connecting different Google Cloud services seamlessly.
3. Serverless Data Processing with Dataflow
The shuffle phase in Dataflow involves grouping together like keys, crucial for
operations on key-value pairs or two-element tuples. GroupByKey groups by a
common key, but it may face challenges with data skew, leading to uneven workload
distribution. CoGroupByKey is similar but works across several PCollections.

Moving to the reduce phase, Combine transforms are used for aggregations.
CombineGlobally combines an entire PCollection, while CombinePerKey works like
GroupByKey but combines values using a specified function. Combining functions
should be commutative and associative. Custom combine functions can be created,
providing flexibility for complex operations.
Flatten merges multiple PCollections, acting like a SQL UNION. Partition splits
a single PCollection into smaller collections, useful for scenarios where different
processing is needed for specific partitions. These capabilities contribute to Dataflow's
efficiency in handling data processing tasks.
4. Side inputs and windows of data
In addition to the main input PCollection, Dataflow allows the provision of side
inputs to a ParDo transform. Side inputs are additional inputs that a do function in
ParDo can access when processing each element in the input PCollection. These inputs
provide additional data determined at runtime, offering flexibility without hard-coding.

Side inputs are particularly useful when injecting data during processing,
depending on the input data or a different branch of the pipeline. The example in
Python demonstrates how side inputs work, creating a view available to all worker
nodes.
Batch inputs can use time-based windows for grouping data by time. Explicit
timestamps can be admitted in the pipeline for windowing. The example illustrates
aggregating batch data by time using sliding windows. In the case of sales records,
fixed windows with a 1-day duration can be created for computing daily totals in batch
processing. Streaming discussions are continued in the streaming data processing
course.
5. Creating and re-using pipeline templates
Dataflow Templates simplify the execution of Dataflow jobs by allowing users
without coding capabilities to run standard data transformation tasks. Users can
leverage pre-existing templates or create their own for team use. This separation of
development and execution workflows streamlines job execution.

Traditionally, developers create pipelines in the development environment with


dependencies on language and SDK files. Dataflow Templates eliminate the need for
users to be developers, enabling a more user-friendly execution environment.
To create custom templates, developers add value providers for user-specified
arguments, making values available at runtime. Value providers are crucial for
converting compiled-time parameters into runtime parameters that users can set. They
work through the Value Provider interface, offering flexibility in handling parameters
for IO, transformations, and functions.

IV. Manage Data Pipelines with Cloud Data Fusion and Cloud Composer
1. Introduction to Cloud Data Fusion
Audience: It serves developers for data cleansing, matching, transformation, and
automation; data scientists for building and deploying pipelines; and business analysts
for operationalizing pipelines and inspecting metadata.
Benefits:
• Integration: Connects with a variety of data sources, including legacy and
modern systems, databases, file systems, cloud services, and more.
• Productivity: Consolidates data from different sources into a unified view,
enhancing productivity.
• Reduced Complexity: Provides a visual interface for code-free
transformations, reusable templates, and pipeline building.
• Flexibility: Supports on-premises and cloud environments, ensuring
interoperability with open-source software CDAP.
Capabilities:
• Graphical Interface: Allows building data pipelines visually with existing
templates, connectors, and transformations.
• Testing and Debugging: Permits testing and debugging of pipelines,
tracking data processing at each node.
• Organization and Search: Enables tagging pipelines for efficient
organization and utilizes unified search functionality.
• Lineage Tracking: Tracks the lineage of transformations on data fields.
Extensibility:
• Templatization: Supports templatizing pipelines for reusability.
• Conditional Triggers: Allows the creation of triggers based on conditions.
• Plugin Management: Offers UI widget plugins, custom provisioners,
compute profiles, and integration to hubs.
2. Components of Cloud Data Fusion
Wrangler UI:
• Purpose: Used for visually exploring datasets and constructing pipelines
without writing code.
• Functionality: Enables users to build pipelines through a visual interface,
making data exploration and transformation intuitive.
• Key Benefit: Provides a code-free environment for constructing pipelines.
Data Pipeline UI:
• Purpose: Designed for drawing pipelines directly onto a canvas.
• Functionality: Allows users to create pipelines visually, facilitating a
seamless design process.
• Option: Users can choose from existing templates for common data
processing paths, such as moving data from cloud storage to BigQuery.
3. Cloud Data Fusion UI
In the Cloud Data Fusion UI, essential elements include the Control Center for
managing applications, artifacts, and datasets.
The Pipeline Section, featuring Developer Studio and a Palette, aids in pipeline
development.
The Wrangler Section offers tools for data exploration and transformation.
Integration Metadata Section allows searches, tagging, and data lineage
exploration.
The Hub provides access to plugins and prebuilt pipelines.
Entities encompass pipeline creation and other functionalities, while
Administration includes management and configuration options.
4. Build a pipeline
In Cloud Data Fusion, a pipeline is visually represented as a Directed Acyclic
Graph (DAG), with each stage as a node. Nodes can vary, such as pulling data from
Cloud Storage, parsing CSV, or joining and splitting data.
The studio serves as the interface for pipeline creation, and the canvas allows
node arrangement. Use the mini-map for navigation and the control panel to add
objects. Save and run pipelines through the actions toolbar, employing templates and
plugins.

Preview mode helps ensure correctness before deployment. Post-deployment,


monitor pipeline health, throughput, and metrics. Tags aid organization, and lineage
tracking allows insight into field transformations across datasets. Cloud Data Fusion
excels in batch data pipelines, with streaming capabilities discussed in future modules.
The tool's lineage tracking facilitates understanding and tracing data transformations.
5. Explore data using wrangler
The Wrangler UI in Cloud Data Fusion serves as an environment for visually
exploring and analyzing new data sets before building pipelines. Starting from the left,
you can connect to various data sources like Google Cloud Storage or BigQuery. Once
connected, you can browse files and tables, inspecting the data visually and viewing
sample insights.
The Wrangler UI enables the addition of calculated fields, column drops, row
filters, and other data transformations through directives. Once satisfied with the
transformations, a pipeline can be created and scheduled for regular execution. This
tool is invaluable for exploring and understanding data sets before implementing
transformations.
6. Orchestrate work between Google Cloud services with Cloud Composer

Cloud Composer is a serverless environment that runs the open-source workflow


tool Apache Airflow. Apache Airflow is an orchestration engine, and at the core of any
workflow in Airflow is a Directed Acyclic Graph (DAG). Similar to Cloud Data
Fusion, you build DAGs in Apache Airflow to orchestrate tasks across multiple Google
Cloud services.
7. Workflow scheduling
In Cloud Composer and Apache Airflow environments, scheduling workflows is
crucial for automation. There are two primary ways to execute workflows:
• Scheduled Runs: Workflows can be set to run at specified intervals, such
as daily or weekly. This is suitable for tasks that need to occur regularly,
like data processing at specific times.
• Event-Driven (Trigger-Based) Runs: Workflows can be triggered by
events, such as the arrival of new data in Cloud Storage. This approach is
beneficial when the workflow needs to respond to external events
dynamically.
In the Airflow web UI, you can find existing workflows (DAGs) under the DAGs
tab. DAGs with set schedules will run periodically, while event-driven DAGs rely on
external triggers, like Cloud Functions.
Scheduling with Cloud Composer:
• For scheduled runs, you can specify the schedule_interval in your DAG
code.
• The Airflow UI allows you to view the history of all runs for a workflow
but doesn't directly edit the schedule from there.
Event-Driven (Push) vs. Scheduled (Pull) Architectures:

• Event-driven workflows (push) respond to external events, such as new


data arriving in Cloud Storage or Pub/Sub messages. Cloud Functions can
be used to trigger workflows based on events.
• Scheduled workflows (pull) run at specified intervals, fetching data at
predetermined times. These are suitable when data arrives on a regular
schedule.
Cloud Functions for Event-Driven Workflows:
• Cloud Functions can be used to create event-driven workflows. For
example, a function can watch a Cloud Storage bucket for new CSV files
and trigger a workflow when a file is uploaded.
• Push architectures are effective for scenarios where data arrival is
irregular, making them suitable for machine learning workflows.
Monitoring and Logging:
• Once workflows are automated, monitoring and logging become crucial.
It ensures that everything is working as intended.
• The next topic covers monitoring and logging, which are essential aspects
of maintaining and troubleshooting workflows.

You might also like