s01 PDE Course Workbook
s01 PDE Course Workbook
Your Professional
Data Engineer Journey
Course Workbook
Certification Exam Guide Sections
1 Designing Data Processing Systems
You are migrating on-premises data to a A. Use the Cloud Data Loss
data warehouse on Google Cloud. This data Prevention API (DLP API) to
will be made available to business analysts. identify and redact data that matches infoTypes like
Local regulations require that customer credit card numbers, phone numbers, and email IDs.
information including credit card numbers, B. Delete all columns with a title similar to "credit
phone numbers, and email IDs be captured, card," "phone," and "email."
but not used in analysis. You need to use a
C. Create a regular expression to identify and delete
reliable, recommended solution to redact
patterns that resemble credit card numbers, phone
the sensitive data.
numbers, and email IDs.
What should you do? D. Use the Cloud Data Loss Prevention API (DLP API)
to perform date shifting of any entries with credit
card numbers, phone numbers, and email IDs.
1.1 Diagnostic Question 04
Your data and applications reside in A. Enable confidential
multiple geographies on Google Cloud. computing for all your
Some regional laws require you to hold your virtual machines.
own keys outside of the cloud provider B. Store keys in Cloud Key Management Service
environment, whereas other laws are less (Cloud KMS), and reduce the number of days for
restrictive and allow storing keys with the automatic key rotation.
same provider who stores the data. The
C. Store your keys in Cloud Hardware Security
management of these keys has increased in
Module (Cloud HSM), and retrieve keys from it when
complexity, and you need a solution that
required.
can centrally manage all your keys.
D. Store your keys on a supported external key
What should you do? management partner, and use Cloud External Key
Manager (Cloud EKM) to get keys when required.
Designing for security and
1.1 compliance
Courses Skill Badges Documentation
Modernizing Data Lakes and Data Implement Load Balancing Import data from Google Cloud into a
Warehouses with Google Cloud on Compute Engine
secured BigQuery data warehouse
● Introduction to Data Engineering Prepare Data for ML APIs on
● Building a Data Lake IAM basic and predefined roles
Google Cloud
● Building a Data Warehouse reference
Smart Analytics, Machine Learning, Creating and managing Folders
and AI on Google Cloud Resource hierarchy
● Prebuilt ML Model APIs for Sensitive Data Protection
Unstructured Data InfoType detector reference
Serverless Data Processing with Cloud External Key Manager
Dataflow: Foundations Hold your own key with Google Cloud
● IAM, Quotas, and Permissions External Key Manager
● Security Evolving Cloud External Key Manager –
BigQuery Fundamentals for Redshift What’s new with Cloud EKM | Google
Professionals Cloud Blog
● BigQuery and Google Cloud IAM
1.2 Diagnostic Question 05
Cymbal Retail has a team of business analysts A. Load the data into
who need to fix and enhance a set of large Dataprep, explore the data,
input data files. For example, duplicates need and edit the transformations as needed.
to be removed, erroneous rows should be B. Create a Dataproc job to perform the data
deleted, and missing data should be added. fixes you need.
These steps need to be performed on all the
C. Create a Dataflow pipeline with the data
present set of files and any files received in
fixes you need.
the future in a repeatable, automated
process. The business analysts are not adept D. Load the data into Google Sheets, explore
at programming. the data, and fix the data as needed.
Modernizing Data Lakes and Data Warehouses on Prepare Data for ML Dataprep Basics
Google Cloud APIs on Google Cloud
● Building a Data Warehouse Dataprep Wrangle
Engineer Data for Language
Building Batch Data Pipelines on Google Cloud
Predictive Modeling
● Introduction to Building Batch Data Pipelines
with BigQuery ML Monitoring pipeline
● Manage Data Pipelines with Cloud Data Fusion and Cloud Composer
performance using
Building Resilient Streaming Analytics Systems on
Google Cloud Cloud Profiler | Dataflow
● Serverless Messaging with Pub/Sub
Serverless Data Processing with Dataflow: Develop Pipelines
● Best Practices
Serverless Data Processing with Dataflow: Operations
● Monitoring
● Logging and Error Reporting
● Troubleshooting and Debug
● Testing and CI/CD
● Reliability
1.3 Diagnostic Question 07
Modernizing Data Lakes and Get Started with Dataplex Dataproc best practices | Google Cloud
Data Warehouses on Google Blog
Cloud HDFS vs. Cloud Storage: Pros, cons and
● Introduction to Data migration tips | Google Cloud Blog
Engineering Dataplex overview
● Building a Data Lake
Building Batch Data Pipelines
on Google Cloud
● Introduction to Building
Batch Data Pipelines
Serverless Data Processing
with Dataflow: Foundations
● Beam Portability
1.4 Diagnostic Question 09
Courses Documentation
Your data engineering team A. Store the data in Cloud Storage and create an
receives data in JSON format extract, transform, and load (ETL) pipeline.
from external sources at the B. Make your BigQuery data warehouse public and
end of each day. You need to ask the external sources to insert the data.
design the data pipeline.
C. Create a public API to allow external
applications to add the data to your warehouse.
What should you do? D. Store the data in persistent disks and create an
ETL pipeline.
2.1 Diagnostic Question 02
You are processing large A. Copy the data from Cloud SQL to a new
amounts of input data in BigQuery table hourly.
BigQuery. You need to combine B. Copy the data from Cloud SQL and create a
this data with a small amount of combined, normalized table hourly.
frequently changing data that is
C. Use a federated query to get data from Cloud
available in Cloud SQL.
SQL.
D. Create a Dataflow pipeline to combine the
What should you do? BigQuery and Cloud SQL data when the Cloud
SQL data changes.
2.1 Planning the data pipelines
Modernizing Data Lakes and Data Warehouses Prepare Data for ML APIs on Google
on Google Cloud Cloud What Data Pipeline Architecture should I use? |
● Introduction to Data Engineering
Engineer Data for Predictive Modeling Google Cloud Blog
● Building a Data Lake
with BigQuery ML
● Building a Data Warehouse Bigtable overview
Building Batch Data Pipelines on Google Cloud Cloud SQL federated queries | BigQuery
● Executing Spark on Dataproc
● Manage Data Pipelines with Cloud Data Exploring new features in BigQuery federated
Fusion and Cloud Composer queries | Google Cloud Blog
Building Resilient Streaming Analytics Systems
on Google Cloud
● High-Throughput BigQuery and Bigtable
Streaming Features
Serverless Data Processing with Dataflow:
Develop Pipelines
● Beam Concepts Review
● Sources and Sinks
● Schemas
2.2 Diagnostic Question 04
You manage a PySpark batch data A. Configure the job to run on Dataproc Serverless.
pipeline by using Dataproc. You B. Configure the job to run with Spot VMs.
want to take a hands-off approach
C. Rewrite the job in Spark SQL.
to running the workload, and you do
not want to provision and manage D. Rewrite the job in Dataflow with SQL.
your own cluster.
You need to run batch jobs, A. Use Cloud Scheduler to run the jobs.
which could take many days B. Use Workflows to run the jobs.
to complete. You do not want
C. Run the jobs on Batch.
to manage the infrastructure
provisioning. D. Use Cloud Run to run the jobs.
You are creating a data pipeline for A. Hopping windows (sliding windows in Apache Beam)
streaming data on Dataflow for B. Session windows
Cymbal Retail's point of sales data.
C. Global window
You want to calculate the total sales
per hour on a continuous basis. D. Tumbling windows (fixed windows in Apache Beam)
Building Batch Data Pipelines on Google Cloud Prepare Data for ML APIs on Google Cloud Data Fusion overview
● Introduction to Building Batch Data Pipelines Cloud
● Executing Spark on Dataproc What is Dataproc Serverless?
● Serverless Data Processing with Dataflow Introduction to Google Batch
● Manage Data Pipelines with Cloud Data
Fusion and Cloud Compose Get started with Batch | Google Cloud
Building Resilient Streaming Analytics Systems on
Streaming pipelines | Cloud Dataflow
Google Cloud
● Serverless Messaging with Pub/Sub Basics of the Beam model
● Dataflow Streaming Features
Serverless Data Processing with Dataflow: Streaming analytics solutions | Google Cloud
Foundations
● Separating Compute and Storage with
Dataflow
Serverless Data Processing with Dataflow:
Develop Pipelines
● Windows, Watermarks, and Triggers
● States and Timers
● Dataflow SQL and DataFrames
Serverless Data Processing with Dataflow:
Operations
● Performance
● Testing and CI/CD
● Flex Templates
2.3 Diagnostic Question 09
Building Batch Data Pipelines on Engineer Data for Predictive How to use Cloud Composer for data
Google Cloud Modeling with BigQuery ML orchestration
● Manage Data Pipelines with
Cloud Composer overview
Cloud Data Fusion and Cloud
Composer Use a CI/CD pipeline for data-processing
workflows | Google Cloud
Serverless Data Processing with
Dataflow: Operations
● Testing and CI/CD
Section 3:
Storing the Data
3.1 Diagnostic Question 01
Courses Documentation
Google Cloud Big Data and Machine Learning Fundamentals Cloud SQL for MySQL, PostgreSQL,
● Big Data and Machine Learning on Google Cloud and SQL Server
Modernizing Data Lakes and Data Warehouses on Google Cloud What is Cloud SQL?
● Introduction to data engineering Storage classes | Google Cloud
● Building a data lake
● Building a data warehouse
Building Resilient Streaming Analytics Systems on Google Cloud
● High-Throughput BigQuery and Bigtable Streaming Features
3.2 Diagnostic Question 03
You have several large tables in your A. Retain the data on BigQuery with the same
transaction databases. You need to schema as the source.
move all the data to BigQuery for the B. Combine all the transactional database tables
business analysts to explore and into a single table using outer joins.
analyze the data. C. Redesign the schema to normalize the data by
removing all redundancies.
How should you design the D. Redesign the schema to denormalize the data
schema in BigQuery? with nested and repeated data.
3.2 Diagnostic Question 04
You are ingesting data that is A. Create an ingestion-time partitioned table with
spread out over a wide range daily partitioning type.
of dates into BigQuery at a B. Create an ingestion-time partitioned table with
fast rate. You need to yearly partitioning type.
partition the table to make
C. Create an integer-range partitioned table.
queries performant.
D. Create a time-unit column-partitioned table with
yearly partitioning type.
What should you do?
3.2 Diagnostic Question 05
Your analysts repeatedly run the A. Create a dataset with the data that is frequently
same complex queries that queried.
combine and filter through a lot B. Create a view of the frequently queried data.
of data on BigQuery. The data
C. Export the frequently queried data into a new
changes frequently. You need to
table.
reduce the effort for the analysts.
D. Export the frequently queried data into Cloud
What should you do? SQL.
Planning for using a data
3.2 warehouse
You have data that is ingested daily A. Create a bucket on Cloud Storage with object
and frequently analyzed in the first versioning configured.
month. Thereafter, the data is retained B. Create a bucket on Cloud Storage with Autoclass
only for audits, which happen configured.
occasionally every few years. You need
C. Configure a data retention policy on Cloud Storage.
to configure cost-effective storage.
D. Configure a lifecycle policy on Cloud Storage.
What should you do?
3.3 Diagnostic Question 07
You have data stored in a Cloud A. The user has no access if IAM denies the
Storage bucket. You are using both permission.
Identity and Access Management B. The user only has access if both IAM and ACLs
(IAM) and Access Control Lists grant a permission.
(ACLs) to configure access control.
C. The user has access if either IAM or ACLs grant
a permission.
D. The user has no access if either IAM or ACLs
Which statement describes a user's deny a permission.
access to objects in the bucket?
3.3 Diagnostic Question 08
Courses Documentation
Modernizing Data Lakes and Data Warehouses on Google Cloud Cloud Storage
● Building a data lake Object Lifecycle Management | Cloud
Storage
Overview of access control | Cloud
Storage
Cloud Audit Logs with Cloud Storage |
Google Cloud
3.4 Diagnostic Question 09
Cymbal Retail has accumulated a large A. Create tags for data entries in Cloud Catalog.
amount of data. Analysts and leadership B. Rename BigQuery columns with more
are finding it difficult to understand the descriptive names.
meaning of the data, such as BigQuery
C. Export the data to Cloud Storage with
columns. Users of the data don't know
descriptive file names.
who owns what. You need to improve
the searchability of the data. D. Add a description column corresponding to
each data column.
What should you do?
3.4 Diagnostic Question 10
You have large amounts of data A. Create a lake for Cloud Storage data and a zone
stored on Cloud Storage and for BigQuery data.
BigQuery. Some of it is processed, B. Create a lake for BigQuery data and a zone for
but some is yet unprocessed. You Cloud Storage data.
have a data mesh created in
C. Create a lake for unprocessed data and assets
Dataplex. You need to make it
for processed data.
convenient for internal users of the
data to discover and use the data. D. Create a raw zone for the unprocessed data and
a curated zone for the processed data.
What should you do?
3.4 Designing for a data mesh
You have data in PostgreSQL that was A. Use nested and repeated fields.
designed to reduce redundancy. You B. Retain the data in normalized form always.
are transferring this data to BigQuery
C. Copy the primary tables and use federated
for analytics. The source data is
queries for secondary tables.
hierarchical and frequently queried
together. You need to design a D. Copy the normalized data into partitions.
BigQuery schema that is performant.
Your data in BigQuery has some A. Create a new dataset with the column's data.
columns that are extremely sensitive. B. Create a new table with the column's data.
You need to enable only some users
C. Use policy tags.
to see certain columns.
D. Use Identity and Access Management (IAM)
permissions.
What should you do?
4.2 Diagnostic Question 07
Your business has collected A. Export the data to zip files and share it through
industry-relevant data over many Cloud Storage.
years. The processed data is B. Host the data on Analytics Hub.
useful for your partners and they
C. Export the data to persistent disks and share it
are willing to pay for its usage.
through an FTP endpoint.
You need to ensure proper access
control over the data. D. Host the data on Cloud SQL.
You built machine learning (ML) models A. Train the model with more of similar data.
based on your own data. In production, B. Perform L2 regularization.
the ML models are not giving satisfactory
C. Perform feature engineering, and use domain
results. When you examine the data, it
knowledge to enhance the column data.
appears that the existing data is not
sufficiently representing the business D. Train the model with the same data, but use
goals. You need to create a more more epochs.
accurate machine learning model.
You used Dataplex to create lakes and A. You have an exclude pattern that matches the files.
zones for your business data. However, B. You have scheduled discovery to run every hour.
some files are not being discovered.
C. The files are in ORC format.
D. The files are in Parquet format.
What could be the issue?
4.3 Exploring and analyzing data
Courses Skill Badges Documentation
Google Cloud Big Data and Machine Engineer Data for Predictive Use the BigQuery ML TRANSFORM clause for
Learning Fundamentals Modeling with BigQuery ML feature engineering | Google Cloud
● Big Data with BigQuery Feature preprocessing overview | BigQuery |
● The machine learning workflow with
Google Cloud
Vertex A
Modernizing Data Lakes and Data Discover data | Dataplex | Google Cloud
Warehouses on Google Cloud
● Introduction to Data Engineering
Building Batch Data Pipelines on Google
Cloud
● Introduction to building batch data
pipelines
Smart Analytics, Machine Learning, and
AI on Google Cloud
● Custom model building with SQL in
BigQuery ML
Section 5:
Maintaining and
Automating Data Workloads
5.1 Diagnostic Question 01
You need to design a Dataproc A. Reuse the same cluster and run each job in
cluster to run multiple small sequence.
jobs. Many jobs (but not all) B. Reuse the same cluster to run all jobs in parallel.
are of high priority.
C. Use ephemeral clusters.
D. Use cluster autoscaling.
What should you do?
5.1 Optimizing resources
Courses Documentation
You need to create repeatable A. Write each task to be responsible for one
data processing tasks by using operation.
Cloud Composer. You need to B. Use current time with the now() function for
follow best practices and computation.
recommended approaches.
C. Update data with INSERT statements during the
task run.
What should you do?
D. Combine multiple functionalities in a single task
execution.
Designing automation
5.2 and repeatability
Building Batch Data Pipelines on Engineer Data for Predictive Write Airflow DAGs | Cloud
Google Cloud Modeling with BigQuery ML Composer
● Manage Data Pipelines with DAGs — Airflow Documentation
Cloud Data Fusion and Cloud
Composer DAG writing best practices in
Serverless Data Processing with Apache Airflow | Astronomer
Dataflow: Develop Pipelines Documentation
● Best Practices
5.3 Diagnostic Question 03
You have a team of data analysts that A. Run all queries in interactive mode.
run queries interactively on BigQuery B. Create a yearly reservation of BigQuery slots.
during work hours. You also have
C. Run the report generation queries in batch
thousands of report generation
mode.
queries that run simultaneously. You
often see an error: Exceeded rate D. Create a view to run the queries.
limits: too many concurrent queries
for this project_and_region.
Courses Documentation
Modernizing Data Lakes and Data Warehouses on Google Cloud Scale cloud data warehouse up
and down quickly
● Introduction to Data Engineering
Introduction to reservations |
● Building a Data Warehouse
BigQuery | Google Cloud
Building Resilient Streaming Analytics Systems on Google Cloud
Introduction to BigQuery
● Advanced BigQuery Functionality and Performance editions | Google Cloud
Run a query | BigQuery | Google
Cloud
Troubleshoot quota and limit
errors | BigQuery | Google Cloud
5.4 Diagnostic Question 05
A colleague at Cymbal Retail asks A. When you want to scale on-cluster Hadoop
you about the configuration of Distributed File System (HDFS).
Dataproc autoscaling for a project. B. When you want to scale out single-job clusters.
C. When you want to down-scale idle clusters to
What would be the minimum size.
Google-recommended situation when
D. When there are different size workloads on the
you should enable autoscaling?
cluster.
5.4 Monitoring and troubleshooting processes
Modernizing Data Lakes and Data Warehouses on Google Prepare Data for ML APIs on Use Cloud Monitoring for Dataflow
Cloud Google Cloud pipelines
● Introduction to Data Engineering
Troubleshoot Dataflow errors | Google
Building Batch Data Pipelines on Google Cloud
Cloud
● Executing Spark on Dataproc
Building Resilient Streaming Analytics Systems on Google Troubleshoot stragglers in batch jobs |
Cloud Cloud Dataflow
● Serverless Messaging with Pub/Sub Autoscaling clusters | Dataproc
● Advanced BigQuery Functionality and Performance Documentation | Google Cloud
Serverless Data Processing with Dataflow: Foundations
● IAM, Quotas, and Permissions
Serverless Data Processing with Dataflow: Develop
Pipelines
● State and Timers
● Best Practices
Serverless Data Processing with Dataflow: Operations
● Monitoring
● Troubleshooting and Debug
● Reliability
5.5 Diagnostic Question 08
You are running a Dataflow pipeline in A. Re-read the input data and create separate
production. The input data for this outputs for valid and erroneous data.
pipeline is occasionally inconsistent. B. Read the data once, and split it into two
Separately from processing the valid pipelines, one to output valid data and another
data, you want to efficiently capture to output erroneous data.
the erroneous input data for analysis.
C. Check for the erroneous data in the logs.
D. Create a side output for the erroneous data.
What should you do?
Maintaining awareness of failures and
5.5 mitigating impact
Courses Documentation
Modernizing Data Lakes and Data Warehouses on Google Cloud Use Dataflow snapshots |
● Building a Data Lake Google Cloud
Serverless Data Processing with Dataflow: Develop Pipelines About high availability | Cloud
SQL for MySQL
● State and Timers
Design Your Pipeline
● Best Practices
Serverless Data Processing with Dataflow: Operations
● Troubleshooting and Debug
● Reliability
When will you take the exam?
Now, consider what you’ve learned about your knowledge and skills
through the diagnostic questions in this course. You should have a
better understanding of what areas you need to focus on and what
resources are available.
Use the template that follows to plan your study goals for each week.
Consider:
● What exam guide section(s) or topic area(s) will you focus on?
● What courses (or specific modules) will help you learn more?
● What Skill Badges or labs will you work on for hands-on practice?
● What documentation links will you review?
● What additional resources will you use - such as sample
questions?
● What will you do to prepare for the case studies?
You may do some or all of these study activities each week.
Courses/modules Modernizing Data Lakes and Data Warehouses with Google Cloud
to complete: ● Building a data warehouse
Area(s) of focus:
Courses/modules
to complete:
Skill Badges/labs
to complete:
Documentation
to review:
Additional study: