0% found this document useful (0 votes)
23 views

s01 PDE Course Workbook

Uploaded by

Fakrul Tareq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

s01 PDE Course Workbook

Uploaded by

Fakrul Tareq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Preparing for

Your Professional
Data Engineer Journey

Course Workbook
Certification Exam Guide Sections
1 Designing Data Processing Systems

2 Ingesting and Processing the Data

3 Storing the Data

4 Preparing and Using Data for Analysis

5 Maintaining and Automating Data Workloads


Section 1:
Designing Data Processing
Systems
1.1 Diagnostic Question 01

Business analysts in your team need to run A. bigquery.resourceViewer and


analysis on data that was loaded into bigquery.dataViewer
BigQuery. You need to follow recommended B. bigquery.user and
practices and grant permissions. bigquery.dataViewer
C. bigquery.dataOwner
What role should you grant
the business analysts? D. storage.objectViewer and
bigquery.user
1.1 Diagnostic Question 02
Cymbal Retail has acquired another A. Create a new
company in Europe. Data access organization for all
permissions and policies in this new region projects in Europe and
differ from those in Cymbal Retail’s assign policies in each organization that
headquarters, which is in North America. comply with regional laws.
You need to define a consistent set of B. Implement a flat hierarchy, and assign policies
policies for projects in each region that to each project according to its region.
follow recommended practices.
C. Create top level folders for each region, and
assign policies at the folder level.
What should you do?
D. Implement policies at the resource level that
comply with regional laws.
1.1 Diagnostic Question 03

You are migrating on-premises data to a A. Use the Cloud Data Loss
data warehouse on Google Cloud. This data Prevention API (DLP API) to
will be made available to business analysts. identify and redact data that matches infoTypes like
Local regulations require that customer credit card numbers, phone numbers, and email IDs.
information including credit card numbers, B. Delete all columns with a title similar to "credit
phone numbers, and email IDs be captured, card," "phone," and "email."
but not used in analysis. You need to use a
C. Create a regular expression to identify and delete
reliable, recommended solution to redact
patterns that resemble credit card numbers, phone
the sensitive data.
numbers, and email IDs.
What should you do? D. Use the Cloud Data Loss Prevention API (DLP API)
to perform date shifting of any entries with credit
card numbers, phone numbers, and email IDs.
1.1 Diagnostic Question 04
Your data and applications reside in A. Enable confidential
multiple geographies on Google Cloud. computing for all your
Some regional laws require you to hold your virtual machines.
own keys outside of the cloud provider B. Store keys in Cloud Key Management Service
environment, whereas other laws are less (Cloud KMS), and reduce the number of days for
restrictive and allow storing keys with the automatic key rotation.
same provider who stores the data. The
C. Store your keys in Cloud Hardware Security
management of these keys has increased in
Module (Cloud HSM), and retrieve keys from it when
complexity, and you need a solution that
required.
can centrally manage all your keys.
D. Store your keys on a supported external key
What should you do? management partner, and use Cloud External Key
Manager (Cloud EKM) to get keys when required.
Designing for security and
1.1 compliance
Courses Skill Badges Documentation
Modernizing Data Lakes and Data Implement Load Balancing Import data from Google Cloud into a
Warehouses with Google Cloud on Compute Engine
secured BigQuery data warehouse
● Introduction to Data Engineering Prepare Data for ML APIs on
● Building a Data Lake IAM basic and predefined roles
Google Cloud
● Building a Data Warehouse reference
Smart Analytics, Machine Learning, Creating and managing Folders
and AI on Google Cloud Resource hierarchy
● Prebuilt ML Model APIs for Sensitive Data Protection
Unstructured Data InfoType detector reference
Serverless Data Processing with Cloud External Key Manager
Dataflow: Foundations Hold your own key with Google Cloud
● IAM, Quotas, and Permissions External Key Manager
● Security Evolving Cloud External Key Manager –
BigQuery Fundamentals for Redshift What’s new with Cloud EKM | Google
Professionals Cloud Blog
● BigQuery and Google Cloud IAM
1.2 Diagnostic Question 05

Cymbal Retail has a team of business analysts A. Load the data into
who need to fix and enhance a set of large Dataprep, explore the data,
input data files. For example, duplicates need and edit the transformations as needed.
to be removed, erroneous rows should be B. Create a Dataproc job to perform the data
deleted, and missing data should be added. fixes you need.
These steps need to be performed on all the
C. Create a Dataflow pipeline with the data
present set of files and any files received in
fixes you need.
the future in a repeatable, automated
process. The business analysts are not adept D. Load the data into Google Sheets, explore
at programming. the data, and fix the data as needed.

What should they do?


1.2 Diagnostic Question 06

You have a Dataflow pipeline that runs A. Use Cloud Monitoring


data processing jobs. You need to B. Use Cloud Logging
identify the parts of the pipeline code
C. Use Cloud Profiler
that consume the most resources.
D. Use Cloud Audit Logs
What should you do?
1.2 Designing for reliability and fidelity
Courses Skill Badges Documentation

Modernizing Data Lakes and Data Warehouses on Prepare Data for ML Dataprep Basics
Google Cloud APIs on Google Cloud
● Building a Data Warehouse Dataprep Wrangle
Engineer Data for Language
Building Batch Data Pipelines on Google Cloud
Predictive Modeling
● Introduction to Building Batch Data Pipelines
with BigQuery ML Monitoring pipeline
● Manage Data Pipelines with Cloud Data Fusion and Cloud Composer
performance using
Building Resilient Streaming Analytics Systems on
Google Cloud Cloud Profiler | Dataflow
● Serverless Messaging with Pub/Sub
Serverless Data Processing with Dataflow: Develop Pipelines
● Best Practices
Serverless Data Processing with Dataflow: Operations
● Monitoring
● Logging and Error Reporting
● Troubleshooting and Debug
● Testing and CI/CD
● Reliability
1.3 Diagnostic Question 07

You are using Dataproc to process a large A. Cloud SQL


number of CSV files. The storage option B. Zonal persistent disks
you choose needs to be flexible to serve
C. Local SSD
many worker nodes in multiple clusters.
These worker nodes will read the data and D. Cloud Storage
also write to it for intermediate storage
between processing jobs.

What is the recommended storage option


on Google Cloud?
1.3 Diagnostic Question 08
You are managing the data for Cymbal Retail, which consists A. Implement a
of multiple teams including retail, sales, marketing, and legal. data mesh with
These teams are consuming data from multiple producers Dataplex and have
including point of sales systems, industry data, orders, and producers tag data when created.
more. Currently, teams that consume data have to repeatedly B. Implement a data lake with Cloud Storage,
ask the teams that produce it to verify the most up-to-date and create buckets for each team such as
data and to clarify other questions about the data, such as retail, sales, marketing.
source and ownership. This process is unreliable and
C. Implement a data warehouse by using
time-consuming and often leads to repeated escalations.
BigQuery, and create datasets for each team
You need to implement a centralized solution that gains a
such as retail, sales, marketing.
unified view of the organization's data and improves
searchability. D. Implement Looker dashboards that
provide views of the data that meet each
What should you do? teams’ requirements.
Designing for flexibility and
1.3 portability

Courses Skill Badges Documentation

Modernizing Data Lakes and Get Started with Dataplex Dataproc best practices | Google Cloud
Data Warehouses on Google Blog
Cloud HDFS vs. Cloud Storage: Pros, cons and
● Introduction to Data migration tips | Google Cloud Blog
Engineering Dataplex overview
● Building a Data Lake
Building Batch Data Pipelines
on Google Cloud
● Introduction to Building
Batch Data Pipelines
Serverless Data Processing
with Dataflow: Foundations
● Beam Portability
1.4 Diagnostic Question 09

Laws in the region where you operate A. Store the data in a


require that files related to all orders made Cloud Storage bucket, and
each day are stored immutably for 365 enable object versioning and delete any version
days. The solution that you recommend has older than 365 days.
to be cost-effective. B. Store the data in a Cloud Storage bucket, and
specify a retention period.
What should you do?
C. Store the data in a Cloud Storage bucket, and
set a lifecycle policy to delete the file after 365
days.
D. Store the data in a Cloud Storage bucket, enable
object versioning, and delete any version greater
than 365.
1.4 Diagnostic Question 10

Cymbal Retail is migrating its private data A. Store the data in an


centers to Google Cloud. Over many years, HTTPS endpoint, and configure
hundreds of terabytes of data were Storage Transfer Service to copy the data
accumulated. You currently have a 100 to Cloud Storage.
Mbps line and you need to transfer this B. Upload the data to Cloud Storage by using
data reliably before commencing gcloud storage.
operations on Google Cloud in 45 days.
C. Zip and upload the data to Cloud Storage
buckets by using the Google Cloud console.
What should you do?
D. Order a transfer appliance, export the data to it,
and ship it to Google.
1.4 Designing data migrations

Courses Documentation

Retention policies and retention policy


Modernizing Data Lakes and Data Warehouses on Google Cloud locks | Cloud Storage
● Building a Data Lake Migration to Google Cloud:
● Building a Data Warehouse
Transferring your large datasets
BigQuery Fundamentals for Redshift Professionals
● BigQuery and Google Cloud IAM
Section 2:
Ingesting and
Processing the Data
2.1 Diagnostic Question 01

Your data engineering team A. Store the data in Cloud Storage and create an
receives data in JSON format extract, transform, and load (ETL) pipeline.
from external sources at the B. Make your BigQuery data warehouse public and
end of each day. You need to ask the external sources to insert the data.
design the data pipeline.
C. Create a public API to allow external
applications to add the data to your warehouse.

What should you do? D. Store the data in persistent disks and create an
ETL pipeline.
2.1 Diagnostic Question 02

The first stage of your data pipeline A. Cloud Storage


processes tens of terabytes of B. Cloud SQL
financial data and creates a
C. AlloyDB
sparse, time-series dataset as a
key-value pair. D. Bigtable

Which of these is a suitable sink


for the pipeline's first stage?
2.1 Diagnostic Question 03

You are processing large A. Copy the data from Cloud SQL to a new
amounts of input data in BigQuery table hourly.
BigQuery. You need to combine B. Copy the data from Cloud SQL and create a
this data with a small amount of combined, normalized table hourly.
frequently changing data that is
C. Use a federated query to get data from Cloud
available in Cloud SQL.
SQL.
D. Create a Dataflow pipeline to combine the
What should you do? BigQuery and Cloud SQL data when the Cloud
SQL data changes.
2.1 Planning the data pipelines

Courses Skill Badges Documentation

Modernizing Data Lakes and Data Warehouses Prepare Data for ML APIs on Google
on Google Cloud Cloud What Data Pipeline Architecture should I use? |
● Introduction to Data Engineering
Engineer Data for Predictive Modeling Google Cloud Blog
● Building a Data Lake
with BigQuery ML
● Building a Data Warehouse Bigtable overview
Building Batch Data Pipelines on Google Cloud Cloud SQL federated queries | BigQuery
● Executing Spark on Dataproc
● Manage Data Pipelines with Cloud Data Exploring new features in BigQuery federated
Fusion and Cloud Composer queries | Google Cloud Blog
Building Resilient Streaming Analytics Systems
on Google Cloud
● High-Throughput BigQuery and Bigtable
Streaming Features
Serverless Data Processing with Dataflow:
Develop Pipelines
● Beam Concepts Review
● Sources and Sinks
● Schemas
2.2 Diagnostic Question 04

Your company has multiple data A. Dataflow


analysts but a limited data engineering B. Cloud Data Fusion
team. You need to choose a tool where
C. Dataproc
the analysts can build data pipelines
themselves with a graphical user D. Cloud Composer
interface.

Which of these products is the


most appropriate?
2.2 Diagnostic Question 05

You manage a PySpark batch data A. Configure the job to run on Dataproc Serverless.
pipeline by using Dataproc. You B. Configure the job to run with Spot VMs.
want to take a hands-off approach
C. Rewrite the job in Spark SQL.
to running the workload, and you do
not want to provision and manage D. Rewrite the job in Dataflow with SQL.
your own cluster.

What should you do?


2.2 Diagnostic Question 06

You need to run batch jobs, A. Use Cloud Scheduler to run the jobs.
which could take many days B. Use Workflows to run the jobs.
to complete. You do not want
C. Run the jobs on Batch.
to manage the infrastructure
provisioning. D. Use Cloud Run to run the jobs.

What should you do?


2.2 Diagnostic Question 07

You are creating a data pipeline for A. Hopping windows (sliding windows in Apache Beam)
streaming data on Dataflow for B. Session windows
Cymbal Retail's point of sales data.
C. Global window
You want to calculate the total sales
per hour on a continuous basis. D. Tumbling windows (fixed windows in Apache Beam)

Which of these windowing


options should you use?
2.2 Diagnostic Question 08

You want to build a streaming data A. Pub/Sub, Dataflow, BigQuery


analytics pipeline in Google Cloud. You B. Pub/Sub, Dataprep, BigQuery
need to choose the right products that
C. Cloud Storage, Dataflow, Cloud SQL
support streaming data.
D. Cloud Storage, Dataprep, AlloyDB

Which of these would you choose?


2.2 Building the pipelines
Courses Skill Badges Documentation

Building Batch Data Pipelines on Google Cloud Prepare Data for ML APIs on Google Cloud Data Fusion overview
● Introduction to Building Batch Data Pipelines Cloud
● Executing Spark on Dataproc What is Dataproc Serverless?
● Serverless Data Processing with Dataflow Introduction to Google Batch
● Manage Data Pipelines with Cloud Data
Fusion and Cloud Compose Get started with Batch | Google Cloud
Building Resilient Streaming Analytics Systems on
Streaming pipelines | Cloud Dataflow
Google Cloud
● Serverless Messaging with Pub/Sub Basics of the Beam model
● Dataflow Streaming Features
Serverless Data Processing with Dataflow: Streaming analytics solutions | Google Cloud
Foundations
● Separating Compute and Storage with
Dataflow
Serverless Data Processing with Dataflow:
Develop Pipelines
● Windows, Watermarks, and Triggers
● States and Timers
● Dataflow SQL and DataFrames
Serverless Data Processing with Dataflow:
Operations
● Performance
● Testing and CI/CD
● Flex Templates
2.3 Diagnostic Question 09

You have a data pipeline that requires A. Cloud Tasks


you to monitor a Cloud Storage bucket B. Cloud Composer
for a file, start a Dataflow job to
C. Cloud Scheduler
process data in the file, run a shell
script to validate the processed data in D. Cloud Run
BigQuery, and then delete the original
file. You need to orchestrate this
pipeline by using recommended tools.

Which product should you choose?


2.3 Diagnostic Question 10

You are running Dataflow jobs for data A. Terraform


processing. When developers update the B. Compute Engine
code in Cloud Source Repositories, you
C. Cloud Code
need to test and deploy the updated
code with minimal effort. D. Cloud Build

Which of these would you use to build


your continuous integration and delivery
(CI/CD) pipeline for data processing?
Deploying and operationalizing
2.3 the pipelines

Courses Skill Badges Documentation

Building Batch Data Pipelines on Engineer Data for Predictive How to use Cloud Composer for data
Google Cloud Modeling with BigQuery ML orchestration
● Manage Data Pipelines with
Cloud Composer overview
Cloud Data Fusion and Cloud
Composer Use a CI/CD pipeline for data-processing
workflows | Google Cloud
Serverless Data Processing with
Dataflow: Operations
● Testing and CI/CD
Section 3:
Storing the Data
3.1 Diagnostic Question 01

You need to choose a data storage A. Use Spanner.


solution to support a transactional B. Use Cloud SQL.
system. Your customers are primarily C. Install a database of your choice on a
based in one region. You want to Compute Engine VM.
reduce your administration tasks and D. Create a Cloud Storage bucket with a
focus engineering effort on building regional bucket.
your business application.

What should you do?


3.1 Diagnostic Question 02

You need to store data long term and use A. Standard


it to create quarterly reports. B. Nearline
C. Coldline
What storage class should you choose? D. Archive
3.1 Selecting storage systems

Courses Documentation

Google Cloud Big Data and Machine Learning Fundamentals Cloud SQL for MySQL, PostgreSQL,
● Big Data and Machine Learning on Google Cloud and SQL Server

Modernizing Data Lakes and Data Warehouses on Google Cloud What is Cloud SQL?
● Introduction to data engineering Storage classes | Google Cloud
● Building a data lake
● Building a data warehouse
Building Resilient Streaming Analytics Systems on Google Cloud
● High-Throughput BigQuery and Bigtable Streaming Features
3.2 Diagnostic Question 03

You have several large tables in your A. Retain the data on BigQuery with the same
transaction databases. You need to schema as the source.
move all the data to BigQuery for the B. Combine all the transactional database tables
business analysts to explore and into a single table using outer joins.
analyze the data. C. Redesign the schema to normalize the data by
removing all redundancies.
How should you design the D. Redesign the schema to denormalize the data
schema in BigQuery? with nested and repeated data.
3.2 Diagnostic Question 04

You are ingesting data that is A. Create an ingestion-time partitioned table with
spread out over a wide range daily partitioning type.
of dates into BigQuery at a B. Create an ingestion-time partitioned table with
fast rate. You need to yearly partitioning type.
partition the table to make
C. Create an integer-range partitioned table.
queries performant.
D. Create a time-unit column-partitioned table with
yearly partitioning type.
What should you do?
3.2 Diagnostic Question 05

Your analysts repeatedly run the A. Create a dataset with the data that is frequently
same complex queries that queried.
combine and filter through a lot B. Create a view of the frequently queried data.
of data on BigQuery. The data
C. Export the frequently queried data into a new
changes frequently. You need to
table.
reduce the effort for the analysts.
D. Export the frequently queried data into Cloud
What should you do? SQL.
Planning for using a data
3.2 warehouse

Courses Skill Badges Documentation

Introduction to optimizing query


Modernizing Data Lakes and Data Build a Data Warehouse with performance | BigQuery | Google
Warehouses on Google Cloud BigQuery Cloud
● Building a data warehouse Introduction to partitioned tables |
Building Resilient Streaming Analytics BigQuery | Google Cloud
Systems on Google Cloud Creating partitioned tables |
● Advanced BigQuery functionality BigQuery | Google Cloud
and performance Introduction to views | BigQuery |
Google Cloud
3.3 Diagnostic Question 06

You have data that is ingested daily A. Create a bucket on Cloud Storage with object
and frequently analyzed in the first versioning configured.
month. Thereafter, the data is retained B. Create a bucket on Cloud Storage with Autoclass
only for audits, which happen configured.
occasionally every few years. You need
C. Configure a data retention policy on Cloud Storage.
to configure cost-effective storage.
D. Configure a lifecycle policy on Cloud Storage.
What should you do?
3.3 Diagnostic Question 07

You have data stored in a Cloud A. The user has no access if IAM denies the
Storage bucket. You are using both permission.
Identity and Access Management B. The user only has access if both IAM and ACLs
(IAM) and Access Control Lists grant a permission.
(ACLs) to configure access control.
C. The user has access if either IAM or ACLs grant
a permission.
D. The user has no access if either IAM or ACLs
Which statement describes a user's deny a permission.
access to objects in the bucket?
3.3 Diagnostic Question 08

A manager at Cymbal Retail A. Review the Admin Activity audit logs.


expresses concern about B. Enable and then review the Data Access audit logs.
unauthorized access to objects
C. Route the Admin Activity logs to a BigQuery sink
in your Cloud Storage bucket.
and analyze the logs with SQL queries.
You need to evaluate all access
on all objects in the bucket. D. Change the permissions on the bucket to only
trusted employees.
What should you do?
3.3 Using a data lake

Courses Documentation

Modernizing Data Lakes and Data Warehouses on Google Cloud Cloud Storage
● Building a data lake Object Lifecycle Management | Cloud
Storage
Overview of access control | Cloud
Storage
Cloud Audit Logs with Cloud Storage |
Google Cloud
3.4 Diagnostic Question 09

Cymbal Retail has accumulated a large A. Create tags for data entries in Cloud Catalog.
amount of data. Analysts and leadership B. Rename BigQuery columns with more
are finding it difficult to understand the descriptive names.
meaning of the data, such as BigQuery
C. Export the data to Cloud Storage with
columns. Users of the data don't know
descriptive file names.
who owns what. You need to improve
the searchability of the data. D. Add a description column corresponding to
each data column.
What should you do?
3.4 Diagnostic Question 10

You have large amounts of data A. Create a lake for Cloud Storage data and a zone
stored on Cloud Storage and for BigQuery data.
BigQuery. Some of it is processed, B. Create a lake for BigQuery data and a zone for
but some is yet unprocessed. You Cloud Storage data.
have a data mesh created in
C. Create a lake for unprocessed data and assets
Dataplex. You need to make it
for processed data.
convenient for internal users of the
data to discover and use the data. D. Create a raw zone for the unprocessed data and
a curated zone for the processed data.
What should you do?
3.4 Designing for a data mesh

Courses Skill Badges Documentation

Tags and tag templates | Data Catalog


Modernizing Data Lakes and Data Catalog Fundamentals Documentation | Google Cloud
Data Warehouses on Google Get Started with Dataplex Quickstart: Tag a BigQuery table by
Cloud
using Data Catalog
● Introduction to data
Dataplex overview | Google Cloud
engineering
Building Batch Data Pipelines
on Google Cloud
● Introduction to building
batch data pipelines
Section 4:
Preparing and Using
Data for Analysis
4.1 Diagnostic Question 01

Your company uses Google Workspace A. Create models in Looker.


and your leadership team is familiar with B. Configure Connected Sheets.
its business apps and collaboration tools.
C. Configure Tableau.
They want a cost-effective solution that
uses their existing knowledge to evaluate, D. Configure Looker Studio.
analyze, filter, and visualize data that is
stored in BigQuery.

What should you do to create a


solution for the leadership team?
4.1 Diagnostic Question 02

You have data in PostgreSQL that was A. Use nested and repeated fields.
designed to reduce redundancy. You B. Retain the data in normalized form always.
are transferring this data to BigQuery
C. Copy the primary tables and use federated
for analytics. The source data is
queries for secondary tables.
hierarchical and frequently queried
together. You need to design a D. Copy the normalized data into partitions.
BigQuery schema that is performant.

What should you do?


4.1 Diagnostic Question 03

You repeatedly run the same A. Views


queries by joining multiple tables. B. Materialized views
The original tables change about
C. Federated queries
ten times per day. You want an
optimized querying approach. D. Partitions

Which feature should you use?


4.1 Diagnostic Question 04

You have analytics data stored in A. Use an aggregate function.


BigQuery. You need an efficient B. Use a UDF (user-defined function).
way to compute values across a
C. Use BigQuery ML.
group of rows and return a single
result for each row. D. Use a window function with an OVER clause.

What should you do?


4.1 Diagnostic Question 05

You need to optimize the A. Batch your updates and inserts.


performance of queries in B. Use the LIMIT clause to reduce the data read.
BigQuery. Your tables are not
C. Filter data as late as possible.
partitioned or clustered.
D. Perform self-joins on data.
What optimization technique
can you use?
4.1 Preparing data for visualization

Courses Skill Badges Documentation

Prepare Data for ML APIs on Introduction to analysis and business


Google Cloud Big Data and Machine
Learning Fundamentals Google Cloud intelligence tools
● Data Engineering for streaming Engineer Data for Predictive Use nested and repeated fields
data Modeling with BigQuery ML
Introduction to materialized views
Modernizing Data Lakes and Data
Warehouses on Google Cloud Window function calls
● Building a data warehouse Optimize query computation
Building Resilient Streaming Analytics
Optimize query computation
Systems on Google Cloud
● Dataflow streaming features
● Advanced BigQuery functionality
and performance
Serverless Data Processing with
Dataflow: Develop Pipelines
● Windows, watermarks, and triggers
4.2 Diagnostic Question 06

Your data in BigQuery has some A. Create a new dataset with the column's data.
columns that are extremely sensitive. B. Create a new table with the column's data.
You need to enable only some users
C. Use policy tags.
to see certain columns.
D. Use Identity and Access Management (IAM)
permissions.
What should you do?
4.2 Diagnostic Question 07

Your business has collected A. Export the data to zip files and share it through
industry-relevant data over many Cloud Storage.
years. The processed data is B. Host the data on Analytics Hub.
useful for your partners and they
C. Export the data to persistent disks and share it
are willing to pay for its usage.
through an FTP endpoint.
You need to ensure proper access
control over the data. D. Host the data on Cloud SQL.

What should you do?


4.2 Diagnostic Question 08

You have a complex set of data that A. Looker Studio


comes from multiple sources. The B. Connected Sheets
analysts in your team need to analyze the
C. D3.js library
data, visualize it, and publish reports to
internal and external stakeholders. You D. Looker
need to make it easier for the analysts to
work with the data by abstracting the
multiple data sources.

What tool do you recommend?


4.2 Sharing data

Courses Skill Badges Documentation


Introduction to column-level access
Google Cloud Big Data and Machine Data Catalog Fundamentals control
Learning Fundamentals
Analytics Hub | Data Exchange and Data
● Data Engineering for Streaming
Data Sharing | Google Cloud

Modernizing Data Lakes and Data Introduction to Analytics Hub | BigQuery


Warehouses on Google Cloud Secure data exchanges and data sharing
● Introduction to Data Engineering with Analytics Hub
Building Batch Data Pipelines on Looker business intelligence platform
Google Cloud embedded analytics
● Introduction to Building Batch
Data Pipelines
4.3 Diagnostic Question 09

You built machine learning (ML) models A. Train the model with more of similar data.
based on your own data. In production, B. Perform L2 regularization.
the ML models are not giving satisfactory
C. Perform feature engineering, and use domain
results. When you examine the data, it
knowledge to enhance the column data.
appears that the existing data is not
sufficiently representing the business D. Train the model with the same data, but use
goals. You need to create a more more epochs.
accurate machine learning model.

What should you do?


4.3 Diagnostic Question 10

You used Dataplex to create lakes and A. You have an exclude pattern that matches the files.
zones for your business data. However, B. You have scheduled discovery to run every hour.
some files are not being discovered.
C. The files are in ORC format.
D. The files are in Parquet format.
What could be the issue?
4.3 Exploring and analyzing data
Courses Skill Badges Documentation

Google Cloud Big Data and Machine Engineer Data for Predictive Use the BigQuery ML TRANSFORM clause for
Learning Fundamentals Modeling with BigQuery ML feature engineering | Google Cloud
● Big Data with BigQuery Feature preprocessing overview | BigQuery |
● The machine learning workflow with
Google Cloud
Vertex A
Modernizing Data Lakes and Data Discover data | Dataplex | Google Cloud
Warehouses on Google Cloud
● Introduction to Data Engineering
Building Batch Data Pipelines on Google
Cloud
● Introduction to building batch data
pipelines
Smart Analytics, Machine Learning, and
AI on Google Cloud
● Custom model building with SQL in
BigQuery ML
Section 5:
Maintaining and
Automating Data Workloads
5.1 Diagnostic Question 01

You need to design a Dataproc A. Reuse the same cluster and run each job in
cluster to run multiple small sequence.
jobs. Many jobs (but not all) B. Reuse the same cluster to run all jobs in parallel.
are of high priority.
C. Use ephemeral clusters.
D. Use cluster autoscaling.
What should you do?
5.1 Optimizing resources

Courses Documentation

Building Batch Data Pipelines on Google Cloud Dataproc Job Optimization


● Executing Spark on Dataproc How-to Guide | Google Cloud
Blog
5.2 Diagnostic Question 02

You need to create repeatable A. Write each task to be responsible for one
data processing tasks by using operation.
Cloud Composer. You need to B. Use current time with the now() function for
follow best practices and computation.
recommended approaches.
C. Update data with INSERT statements during the
task run.
What should you do?
D. Combine multiple functionalities in a single task
execution.
Designing automation
5.2 and repeatability

Courses Skill Badges Documentation

Building Batch Data Pipelines on Engineer Data for Predictive Write Airflow DAGs | Cloud
Google Cloud Modeling with BigQuery ML Composer
● Manage Data Pipelines with DAGs — Airflow Documentation
Cloud Data Fusion and Cloud
Composer DAG writing best practices in
Serverless Data Processing with Apache Airflow | Astronomer
Dataflow: Develop Pipelines Documentation
● Best Practices
5.3 Diagnostic Question 03

Multiple analysts need to prepare A. Use on-demand pricing.


reports on Monday mornings due B. Use Flex Slots.
to which there is heavy utilization
C. Use BigQuery Enterprise edition with a
of BigQuery. You want to take a
one-year commitment.
cost-effective approach to
managing this demand. D. Use BigQuery Enterprise Plus edition with a
three-year commitment.
What should you do?
5.3 Diagnostic Question 04

You have a team of data analysts that A. Run all queries in interactive mode.
run queries interactively on BigQuery B. Create a yearly reservation of BigQuery slots.
during work hours. You also have
C. Run the report generation queries in batch
thousands of report generation
mode.
queries that run simultaneously. You
often see an error: Exceeded rate D. Create a view to run the queries.
limits: too many concurrent queries
for this project_and_region.

How would you resolve this issue?


Organizing workloads based on business
5.3 requirements

Courses Documentation

Modernizing Data Lakes and Data Warehouses on Google Cloud Scale cloud data warehouse up
and down quickly
● Introduction to Data Engineering
Introduction to reservations |
● Building a Data Warehouse
BigQuery | Google Cloud
Building Resilient Streaming Analytics Systems on Google Cloud
Introduction to BigQuery
● Advanced BigQuery Functionality and Performance editions | Google Cloud
Run a query | BigQuery | Google
Cloud
Troubleshoot quota and limit
errors | BigQuery | Google Cloud
5.4 Diagnostic Question 05

You have a Dataflow pipeline in A. Review the Dataflow logs regularly.


production. For certain data, the B. Set up alerts with Cloud Run functions code that
system seems to be stuck longer reviews the audit logs regularly.
than usual. This is causing delays
C. Review the Cloud Monitoring dashboard
in the pipeline execution. You
regularly.
want to reliably and proactively
track and resolve such issues. D. Set up alerts on Cloud Monitoring based on
system lag.
What should you do?
5.4 Diagnostic Question 06

When running Dataflow jobs, you A. Disable Dataflow shuffle.


see this error in the logs: "A hot B. Increase the data with the hot key.
key HOT_KEY_NAME was
C. Ensure that your data is evenly distributed.
detected in…". You need to
resolve this issue and make the D. Add more compute instances for processing.
workload performant.

What should you do?


5.4 Diagnostic Question 07

A colleague at Cymbal Retail asks A. When you want to scale on-cluster Hadoop
you about the configuration of Distributed File System (HDFS).
Dataproc autoscaling for a project. B. When you want to scale out single-job clusters.
C. When you want to down-scale idle clusters to
What would be the minimum size.
Google-recommended situation when
D. When there are different size workloads on the
you should enable autoscaling?
cluster.
5.4 Monitoring and troubleshooting processes

Courses Skill Badges Documentation

Modernizing Data Lakes and Data Warehouses on Google Prepare Data for ML APIs on Use Cloud Monitoring for Dataflow
Cloud Google Cloud pipelines
● Introduction to Data Engineering
Troubleshoot Dataflow errors | Google
Building Batch Data Pipelines on Google Cloud
Cloud
● Executing Spark on Dataproc
Building Resilient Streaming Analytics Systems on Google Troubleshoot stragglers in batch jobs |
Cloud Cloud Dataflow
● Serverless Messaging with Pub/Sub Autoscaling clusters | Dataproc
● Advanced BigQuery Functionality and Performance Documentation | Google Cloud
Serverless Data Processing with Dataflow: Foundations
● IAM, Quotas, and Permissions
Serverless Data Processing with Dataflow: Develop
Pipelines
● State and Timers
● Best Practices
Serverless Data Processing with Dataflow: Operations
● Monitoring
● Troubleshooting and Debug
● Reliability
5.5 Diagnostic Question 08

Cymbal Retail processes A. Take Dataflow snapshots periodically.


streaming data on Dataflow with B. Create Dataflow jobs from templates.
Pub/Sub as a source. You need
C. Enable vertical autoscaling.
to plan for disaster recovery and
protect against zonal failures. D. Enable Dataflow shuffle.

What should you do?


5.5 Diagnostic Question 09

You run a Cloud SQL instance for A. Configure replication.


a business that requires that the B. Configure high availability.
database is accessible for
C. Configure backups.
transactions. You need to ensure
minimal downtime for database D. Configure backups and increase the
transactions. number of backups.

What should you do?


5.5 Diagnostic Question 10

You are running a Dataflow pipeline in A. Re-read the input data and create separate
production. The input data for this outputs for valid and erroneous data.
pipeline is occasionally inconsistent. B. Read the data once, and split it into two
Separately from processing the valid pipelines, one to output valid data and another
data, you want to efficiently capture to output erroneous data.
the erroneous input data for analysis.
C. Check for the erroneous data in the logs.
D. Create a side output for the erroneous data.
What should you do?
Maintaining awareness of failures and
5.5 mitigating impact

Courses Documentation

Modernizing Data Lakes and Data Warehouses on Google Cloud Use Dataflow snapshots |
● Building a Data Lake Google Cloud
Serverless Data Processing with Dataflow: Develop Pipelines About high availability | Cloud
SQL for MySQL
● State and Timers
Design Your Pipeline
● Best Practices
Serverless Data Processing with Dataflow: Operations
● Troubleshooting and Debug
● Reliability
When will you take the exam?

Plan time How many weeks do you have to


prepare?

to prepare How many hours will you spend


preparing for the exam each week?

How many total hours will you


prepare?
Weekly study plan

Now, consider what you’ve learned about your knowledge and skills
through the diagnostic questions in this course. You should have a
better understanding of what areas you need to focus on and what
resources are available.

Use the template that follows to plan your study goals for each week.
Consider:
● What exam guide section(s) or topic area(s) will you focus on?
● What courses (or specific modules) will help you learn more?
● What Skill Badges or labs will you work on for hands-on practice?
● What documentation links will you review?
● What additional resources will you use - such as sample
questions?
● What will you do to prepare for the case studies?
You may do some or all of these study activities each week.

Duplicate the weekly template for the number of weeks in your


individual preparation journey.
Weekly study template (example)

Area(s) of focus: Using BigQuery as a data warehouse

Courses/modules Modernizing Data Lakes and Data Warehouses with Google Cloud
to complete: ● Building a data warehouse

Skill Badges/labs Build a Data Warehouse with BigQuery


to complete:

Documentation Overview of BigQuery storage | Google Cloud


to review: Overview of BigQuery analytics | Google Cloud
Introduction to BigQuery administration | Google Cloud
Organizing BigQuery resources | Google Cloud

Additional study: Sample Questions 1- 5


Weekly study template

Area(s) of focus:

Courses/modules
to complete:

Skill Badges/labs
to complete:

Documentation
to review:

Additional study:

You might also like