Copy of 2024
Copy of 2024
Cmd 1
rawDF = spark.table("raw_data")
Cmd 2
rawDF.printSchema()
Cmd 3
Cmd 4
finalDF = flattenedDF.drop("values")
Cmd 5
display(finalDF)
Cmd 6
finalDF.write.mode("append").saveAsTable("flat_data")
df =
spark.read.format("parquet").load(f"/mnt/source/{date}")
Which code block should be used to create the date Python variable used in the
above code block?
A. * dbutils.widgets.text("date", "null")
date = dbutils.widgets.get("date")
B. input_dict = input()
date = input_dict["date"]
C. date = spark.conf.get("date")
D. import sys
date = sys.argv[1]
E. date = dbutils.notebooks.getParam("date")
Assuming users have been added to a workspace but not granted any
permissions, which of the following describes the minimal permissions a user
would need to start and attach to an already configured cluster
A. * "Can Restart" privileges on the required cluster
B. Cluster creation allowed, "Can Restart" privileges on the required cluster
C. "Can Manage" privileges on the required cluster
D. Cluster creation allowed, "Can Attach To" privileges on the required cluster
E. Workspace Admin privileges, cluster creation allowed, "Can Attach To" privileges on
the required cluster
4. Item ID: IT001349
Which statement regarding Spark configuration on the Databricks platform is true?
A. * Spark configuration properties set for an interactive cluster with the Clusters UI
will impact all notebooks attached to that cluster.
B. Spark configurations set within a notebook will affect all SparkSessions attached to
the same interactive cluster.
C. Spark configuration properties can only be set for an interactive cluster by creating
a global init script.
D. The Databricks REST API can be used to modify the Spark configuration properties
for an interactive cluster without interrupting jobs currently running on the cluster.
E. When the same Spark configuration property is set for an interactive cluster and a
notebook attached to that cluster, the notebook setting will always be ignored.
FROM recent_sensor_recordings
GROUP BY sensor_id
The query is set to refresh each minute and always completes in less than 10
seconds. The alert is set to trigger when mean(temperature) > 120.
Notifications are triggered to be sent at most every 1 minute.
If this alert raises notifications for 3 consecutive minutes and then stops, which
statement must be true?
A. * The average temperature recordings for at least one sensor exceeded 120 on
three consecutive executions of the query.
B. The source query failed to update properly for three consecutive minutes and then
restarted.
C. The total average temperature across all sensors exceeded 120 on three
consecutive executions of the query.
D. The maximum temperature recording for at least one sensor exceeded 120 on
three consecutive executions of the query.
E. The recent_sensor_recordings table was unresponsive for three consecutive runs
of the query.
9. Item ID: IT001354
A Delta Lake table was created with the below query:
AS (
SELECT *
FROM prod.sales a
ON a.store_id = b.store_id
USING DELTA
LOCATION "/mnt/prod/sales_by_store"
Realizing that the original query had a typographical error, the below code was
executed:
Which approach allows this user to share their code updates without the risk of
overwriting the work of their teammates?
A. * Use Repos to create a new branch, commit all changes, and push changes to the
remote Git repository.
B. Use Repos to create a fork of the remote repository, commit all changes, and make
a pull request on the source repository.
C. Use Repos to merge all differences and make a pull request back to the remote
repository.
D. Use Repos to checkout all changes and send the git diff log to the team.
E. Use Repos to pull changes from the remote Git repository; commit and push
changes to a branch that appeared as changes were pulled.
Which approach will allow this developer to review the current logic for this
notebook?
A. * Use Repos to pull changes from the remote Git repository and select the
dev-2.3.9 branch.
B. Merge all changes back to the main branch in the remote Git repository and clone
the repo again.
C. Use Repos to make a pull request; use the Databricks REST API to update the
current branch to dev-2.3.9.
D. Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the
current branch.
E. Use Repos to merge the current branch and the dev-2.3.9 branch, then make a
pull request to sync with the remote repository.
13. Item ID:IT001358 Two of the most common data locations on Databricks are the
DBFS root storage and external object storage mounted with
dbutils.fs.mount().
After testing the code with all Python variables being defined with strings, they
upload the password to the secrets module and configure the correct permissions
for the currently active user. They then modify their code to the following (leaving
all other variables unchanged).
password = dbutils.secrets.get(scope="db_creds",
key="jdbc_password")
print(password)
df = (spark
.read
.format("jdbc")
.option("url", connection)
.option("dbtable", tablename)
.option("user", username)
.option("password", password)
The following code correctly imports the production model, loads the customers
table containing the customer_id key column into a DataFrame, and defines the
feature columns needed for the model.
model = mlflow.pyfunc.spark_udf(spark,
model_uri="models:/churn/prod")
df = spark.table("customers")
Which code block will output a DataFrame with the schema "customer_id
LONG, predictions DOUBLE"?
A. * df.select("customer_id",
model(*columns).alias("predictions"))
B. df.apply(model, columns).select("customer_id, predictions")
C. model.predict(df, columns)
D. df.select("customer_id", pandas_udf(model,
columns).alias("predictions"))
E. df.map(lambda x:model(x[columns])).select("customer_id,
predictions")
model = mlflow.pyfunc.spark_udf(spark,
model_uri="models:/churn/prod")
df = spark.table("customers")
preds = (df.select(
"customer_id",
model(*columns).alias("predictions"),
current_date().alias("date")
)
The data science team would like predictions saved to a Delta Lake table with the
ability to compare all predictions across time. Churn predictions will be made at
most once per day.
Which code block accomplishes this task while minimizing potential compute
costs?
A. * preds.write.mode("append").saveAsTable("churn_preds")
B. (preds.writeStream
.outputMode("append")
.option("checkpointPath", "/_checkpoints/churn_preds")
.table("churn_preds")
)
C. preds.write.format("delta").save("/preds/churn_preds")
D. (preds.write
.format("delta")
.mode("overwrite")
.saveAsTable("churn_preds")
)
E. (preds.writeStream
.outputMode("overwrite")
.option("checkpointPath", "/_checkpoints/churn_preds")
.start("/preds/churn_preds")
Cmd 1
%python
Cmd 2
%sql
SELECT *
FROM sales
python ./data_loader/run.py;
mv ./output /dbfs/mnt/new_data
The code executes successfully and provides the logically correct results;
however, it takes over 20 minutes to extract and load around 1 GB of data.
The review column contains the full text of the review left by the user.
Specifically, the data science team is looking to identify if any of 30 key words
exist in this field.
A junior data engineer suggests converting this data to Delta Lake will improve
query performance.
To find all the records from within the Arctic Circle, you execute a query with the
below filter:
Which statement describes how the Delta engine identifies which files to load?
A. * The Delta log is scanned for min and max statistics for the latitude column
B. The Hive metastore is scanned for min and max statistics for the latitude column
C. The Parquet file footers are scanned for min and max statistics for the latitude
column
D. All records are cached to attached storage and then the filter is applied
E. All records are cached to an operational database and then the filter is applied
The team has decided to process all deletions from the previous week as a batch
job at 1am each Sunday.The total duration of this job is less than one hour. Every
Monday at 3am, a batch job executes a series of VACUUM commands on all Delta
Lake tables throughout the organization.
The compliance officer has recently learned about Delta Lake's time travel
functionality. They are concerned that this might allow continued access to deleted
data.
A. * Because the default data retention threshold is 7 days, data files containing
deleted records will be retained until the VACUUM job is run 8 days later.
B. Because the VACUUM command permanently deletes all files containing deleted
records, deleted records may be accessible with time travel for around 24 hours.
C. Because Delta Lake's delete statements have ACID guarantees, deleted records
will be permanently purged from all storage systems as soon as a delete job
completes.
D. Because Delta Lake time travel provides full access to the entire history of a table,
deleted records can always be recreated by users with full admin privileges.
E. Because the default data retention threshold is 24 hours, data files containing
deleted records will be retained until the VACUUM job is run the following day.
22. Item ID: IT001367
In order to prevent accidental commits to production data, a senior data engineer
has instituted a policy that all development work will reference clones of Delta
Lake tables. After testing both DEEP and SHALLOW CLONE, development tables
are created using SHALLOW CLONE.
A few weeks after initial table creation, the cloned versions of several tables
implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The
transaction logs for the source tables show that VACUUM was run the day before.
Which statement describes why the cloned tables are no longer working?
A. * The metadata created by the CLONE operation is referencing data files that were
purged as invalid by the VACUUM command.
B. Running VACUUM automatically invalidates any shallow clones of a table; DEEP
CLONE should always be used when a cloned table will be repeatedly queried.
C. Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee
data consistency for cloned tables.
D. The data files compacted by VACUUM are not tracked by the cloned metadata;
running REFRESH on the cloned table will pull in recent changes.
E. Tables created with SHALLOW CLONE are automatically deleted after their default
retention threshold of 7 days.
"existing_cluster_id": "6015-954420-peace720",
"notebook_task": {
"notebook_path": "/Prod/ingest.py"
Assuming that all configurations and referenced resources are available, which
statement describes the result of executing this workload three times?
A. * Three new jobs named "Ingest new data" will be defined in the workspace, but no
jobs will be executed.
B. The logic defined in the referenced notebook will be executed three times on the
referenced existing all purpose cluster.
C. The logic defined in the referenced notebook will be executed three times on new
clusters with the configurations of the provided cluster ID.
D. One new job named "Ingest new data" will be defined in the workspace, but it will
not be executed.
E. Three new jobs named "Ingest new data" will be defined in the workspace, and they
mwill each run once daily.
25. Item ID: IT001370
An upstream system is emitting change data capture (CDC) logs that are being
written to a cloud object storage directory. Each record in the log indicates the
change type (insert, update, or delete),timestamp, and the values for each field
after the change. The source table has a primary key identified by the field pk_id.
For auditing purposes, the data governance team wishes to maintain a full record
of all values that have ever been valid in the source system. For analytical
purposes, only the most recent value for each record needs to be recorded. The
Databricks job to ingest these records occurs once per hour, but each individual
record may have changed multiple times over the course of an hour.
For analytical purposes, only the most recent value for each record needs to be
recorded in the target Delta Lake table in the Lakehouse. The Databricks job to
ingest these records occurs once per hour, but each individual record may have
changed multiple times over the course of an hour.
The ingestion job is configured to append all data for the previous date to a target
table reviews_raw with an identical schema to the source system. The next step
in the pipeline is a batch write to propagate all new records inserted into
reviews_raw to a table where data is fully deduplicated, validated, and enriched.
Which solution minimizes the compute costs to propagate this batch of data?
A. * Configure a Structured Streaming read against the reviews_raw table using the
trigger available now execution mode to process new records as a batch job.
B. Filter all records in the reviews_raw table based on the review_timestamp;
batch append those records produced in the last 48 hours.
C. Perform a read on the reviews_raw table and perform an insert-only merge using
the natural composite key user_id, review_id, product_id,
review_timestamp.
D. Use Delta Lake version history to get the difference between the latest version of
reviews_raw and one version prior, then write these records to the next table.
E. Reprocess all records in reviews_raw and overwrite the next table in the pipeline.
New records are all ingested into a table named account_history which
maintains a full record of all data in the same schema as the source. The next
table in the system is named account_current and is implemented as a Type 1
table representing the most recent value for each unique user_id.
Assuming there are millions of user accounts and tens of thousands of records
processed hourly, which implementation can be used to efficiently update the
described account_current table as part of each hourly batch job?
A. * Filter records in account_history using the last_updated field and the most
recent hour processed, as well as the max last_login by user_id; write a
merge statement to update or insert the most recent value for each user_id.
B. Overwrite the account_current table with each batch using the results of a
query against the account_history table grouping by user_id and filtering for
the max value of last_updated.
C. Use Delta Lake version history to get the difference between the latest version of
account_history and one version prior, then write these records to
account_current.
D. Filter records in account_history using the last_updated field and the most
recent hour processed, making sure to deduplicate on username; write a merge
statement to update or insert the most recent value for each username.
E. Use Auto Loader to subscribe to new files in the account_history directory;
configure a Structured Streaming trigger available job to batch update newly
detected files into the account_current table.
Because reporting on long-term sales trends is less volatile, analysts using the
new dashboard only require data to be refreshed once daily. Because the
dashboard will be queried interactively by many users throughout a normal
business day, it should return results quickly and reduce total compute associated
with each materialization.
Which solution meets the expectations of the end users while controlling and
limiting possible costs?
A. * Populate the dashboard by configuring a nightly batch job to save the required
values as a table overwritten with each update.
B. Define a view against the products_per_order table and define the dashboard
against this view.
C. Use Structured Streaming to configure a live dashboard against the
products_per_order table within a Databricks notebook.
D. Configure a webhook to execute an incremental read against
products_per_order each time the dashboard is refreshed.
E. Use the Delta Cache to persist the products_per_order table in memory to
quickly update the dashboard with each query.
Immediately after each update succeeds, the data engineering team would like to
determine the difference between the new version and the previous version of the
table.
Given the current implementation, which method can be used?
A. * Execute a query to calculate the difference between the new version and the
previous version using Delta Lake's built-in versioning and time travel functionality.
B. Use Delta Lake's change data feed to identify those records that have been
updated, inserted, or deleted.
C. Parse the Delta Lake transaction log to identify all newly written data files.
D. Parse the Spark event logs to identify those rows that were updated, inserted, or
deleted.
E. Execute DESCRIBE HISTORY customer_churn_params to obtain the full
operation metrics for the update, including a log of all records that have been added
or modified.
The churn prediction model used by the ML team is fairly stable in production. The
team is only interested in making predictions on records that have changed in the
past 24 hours.
FROM
FROM users) a
INNER JOIN
FROM orders
ON a.user_id = b.user_id
Both users and orders are Delta Lake tables. Which statement describes the
results of querying recent_orders?
A. * All logic will execute at query time and return the result of joining the valid
versions of the source tables at the time the query began.
B. All logic will execute at query time and return the result of joining the valid versions
of the source tables at the time the query finishes.
C. All logic will execute when the view is defined and store the result of joining tables
to the DBFS; this stored data will be returned when the view is queried.
D. Results will be computed and cached when the view is defined; these cached
results will incrementally update as new records are inserted into source tables.
E. The versions of each source table will be stored in the view transaction log; query
results will be saved to DBFS with each query.
FROM
FROM users) a
INNER JOIN
FROM orders
ON a.user_id = b.user_id
Both users and orders are Delta Lake tables. Which statement describes the
results of querying recent_orders?
A. * All logic will execute when the table is defined and store the result of joining tables
to the DBFS; this stored data will be returned when the table is queried.
B. All logic will execute at query time and return the result of joining the valid versions
of the source tables at the time the query began.
C. All logic will execute at query time and return the result of joining the valid versions
of the source tables at the time the query finishes.
D. Results will be computed and cached when the table is defined; these cached
results will incrementally update as new records are inserted into source tables.
E. The versions of each source table will be stored in the table transaction log; query
results will be saved to DBFS with each query.
35. Item ID: IT001380
A data ingestion task requires a one-TB JSON dataset to be written out to Parquet
with a target part-file size of 512 MB. Because Parquet is being used instead of
Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction
cannot be used.
Which strategy will yield the best performance without shuffling data?
A. * Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data,
execute the narrow transformations, and then write to parquet
B. Set spark.sql.shuffle.partitions to 512, ingest the data, execute the
narrow transformations, and then write to parquet
C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB
bytes, ingest the data, execute the narrow transformations, coalesce to 2,048
partitions (1TB*1024*1024/512), and then write to parquet.
D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions
(1TB*1024*1024/512), and then write to parquet.
E. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512),
ingest the data, execute the narrow transformations, optimize the data by sorting it
(which automatically repartitions the data), and then write to parquet
Code block:
df._____
.groupBy(
window("event_time", "5 minutes").alias("time"),
"device_id"
.agg(
avg("temp").alias("avg_temp"),
avg("humidity").alias("avg_humidity")
.writeStream
.format("delta")
.saveAsTable("sensor_avg")
Which code lineChoose the response that correctly fills in the blank within the
code block to complete this task?.
A. * withWatermark("event_time", "10 minutes")
B. delayWrite("event_time", "10 minutes")
C. awaitArrival("event_time", "10 minutes")
D. slidingWindow("event_time", "10 minutes")
E. await("event_time + ‘10 minutes’")
.groupBy(
_____,
"device_id"
.agg(
avg("temp").alias("avg_temp"),
avg("humidity").alias("avg_humidity")
.writeStream
.format("delta")
.saveAsTable("sensor_avg")
Choose the response that correctly fills in the blank within the code block to
complete this task.
A. * window("event_time", "5 minutes").alias("time")
B. window("event_time", "10 minutes").alias("time")
C. lag(“event_time”, “5 minutes”).alias(“time”)
D. “event_time”
E. to_interval(“event_time”, “5 minutes”).alias(“time”)
df.groupBy(“item”)
.agg(count(“item”).alias("total_count"),
mean(“sale_price”).alias("avg_price"))
.writeStream
.outputMode("complete")
.option(“checkpointLocation”, “/item_agg/__checkpoint”)
.start(“/item_agg”)
Proposed query:
df.groupBy(“item”)
.agg(count(“item”).alias("total_count"),
mean(“sale_price”).alias("avg_price"),
count(“promo_code =
‘NEW_MEMBER’”).alias(“new_member_promo”))
.writeStream
.outputMode("complete")
.option('mergeSchema', 'true')
.option(“checkpointLocation”, “/item_agg/__checkpoint”)
.start(“/item_agg”)
Which step must also be completed to put the proposed query into production?
A. * Specify a new checkpointLocation
B. Register the data in the “/item_agg” directory to the Hive metastore
C. Increase the shuffle partitions to account for additional aggregates
D. Run REFRESH TABLE delta.`/item_agg`
E. Remove .option('mergeSchema', 'true') from the streaming write
./bronze
├── __checkpoint
├── __delta_log
├── year_week=2020_01
├── year_week=2020_02
└── …
Which statement describes whether this checkpoint directory structure is valid for
the given scenario and why?
A. * No; each of the streams needs to have its own checkpoint directory.
B. Yes; Delta Lake supports infinite concurrent writers.
C. No; only one stream can write to a Delta Lake table.
D. No; Delta Lake manages streaming checkpoints in the transaction log.
E. Yes; both of the streams can share a single checkpoint directory.
Holding all other variables constant and assuming records need to be processed
in less than 10 minutes, which adjustment will meet the requirement?
A. * Use the trigger once option and configure a Databricks job to execute the query
every 10 minutes; this approach minimizes costs for both compute and storage.
B. Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger
interval ensures that the source is not queried too frequently.
C. Set the trigger interval to 10 minutes; each batch calls APIs in the source storage
account, so decreasing trigger frequency to maximum allowable threshold should
minimize this cost.
D. Set the trigger interval to 3 seconds; the default trigger interval is consuming too
many records per batch, resulting in spill to disk that can increase volume costs.
E. Increase the number of shuffle partitions to maximize parallelism, since the trigger
interval cannot be modified without modifying the checkpoint directory.
45. Item ID: IT001390
Which statement describes Delta Lake optimized writes?
A. * A shuffle occurs prior to writing to try to group similar data together resulting in
fewer files instead of each executor writing multiple files based on directory
partitions.
B. Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified
during the most recent job.
C. Optimized writes use logical partitions instead of directory partitions; because
partition boundaries are only represented in metadata, fewer small files are written.
D. An asynchronous job runs after the write completes to detect if files could be further
compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB.
E. Data is queued in a messaging bus instead of committing data directly to memory;
all data is committed from the messaging bus in one batch once the job is
complete.
Where in the Spark UI are two of the primary indicators that a partition is spilling to
disk?
A. * Stage’s detail screen and Query’s detail screen
B. Stage’s detail screen and Executor’s log files
C. Driver’s and Executor’s log files
D. Executor’s detail screen and Executor’s log files
E. Query’s detail screen and Job’s detail screen
Given a job with at least one wide transformation, which of the following cluster
configurations will result in maximum performance?
A. *
■ Total VMs: 1
■ 400 GB per Executor
■ 160 Cores / Executor
B.
■ Total VMs:2
■ 200 GB per Executor
■ 80 Cores / Executor
C.
■ Total VMs: 4
■ 100 GB per Executor
■ 40 Cores / Executor
D.
■ Total VMs: 8
■ 50 GB per Executor
■ 20 Cores / Executor
E.
■ Total VMs: 16
■ 25 GB per Executor
■ 10 Cores / Executor
(spark.read
.format("parquet")
.load(f"/mnt/raw_orders/{date}")
.dropDuplicates(["customer_id", "order_id"])
.write
.mode("append")
.saveAsTable("orders")
Assume that the fields customer_id and order_id serve as a composite key to
uniquely identify each order.
(spark.readStream
.format("parquet")
.load("/mnt/raw_orders/")
.writeStream
.trigger(once=True)
.table("orders")
Assume that the fields customer_id and order_id serve as a composite key to
uniquely identify each order, and that the time field indicates when the record
was queued in the source system.
USING new_events
ON events.event_id = new_events.event_id
INSERT *
The view new_events contains a batch of records with the same schema as the
events Delta table. The event_id field serves as a unique key for this table.
When this query is executed, what will happen with new records that have the
same event_id as an existing record?
A. * They are ignored.
B. They are updated.
C. They are deleted.
D. They are inserted.
E. They are merged.
(spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 0)
.table("bronze")
.filter(col("_change_type").isin(["update_postimage",
"insert"]))
.write
.mode("append")
.table("bronze_history_type1")
Which statement describes the execution and results of running the above query
multiple times?
A. * Each time the job is executed, the entire available history of inserted or updated
records will be appended to the target table, resulting in many duplicate entries.
B. Each time the job is executed, only those records that have been inserted or
updated since the last execution will be appended to the target table, giving the
desired result.
C. Each time the job is executed, newly updated records will be merged into the target
table, overwriting previous values with the same primary keys.
D. Each time the job is executed, the differences between the original and current
versions are calculated; this may result in duplicate entries for some records.
E. Each time the job is executed, the target table will be overwritten using the entire
history of inserted or updated records, giving the desired result.
(spark.readStream.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 0)
.table("bronze")
.filter(col("_change_type").isin(["update_postimage",
"insert"]))
.writeStream
.option("checkpointLocation", checkpoint_path)
.outputMode("append")
.trigger(once=True)
.table("bronze_history_type1")
Which statement describes the execution and results of running the above query
multiple times?
A. * Each time the job is executed, only those records that have been inserted or
updated since the last execution will be appended to the target table, giving the
desired result.
B. Each time the job is executed, the entire available history of inserted or updated
records will be appended to the target table, resulting in many duplicate entries.
C. Each time the job is executed, newly updated records will be merged into the target
table, overwriting previous values with the same primary keys.
D. Each time the job is executed, the differences between the original and current
versions are calculated; this may result in duplicate entries for some records.
E. Each time the job is executed, the target table will be overwritten using the entire
history of inserted or updated records, given the desired result.
A senior data engineer updates the Delta Table’s schema and ingestion logic to
include the current timestamp (as recorded by Apache Spark) as well as the Kafka
topic and partition. The team plans to use these additional metadata fields to
diagnose the transient processing delays.
Which limitation will the team face while diagnosing this problem?
A. * New fields will not be computed for historic records.
B. New fields cannot be added to a production Delta table.
C. Spark cannot capture the topic and partition fields from a Kafka source.
D. Updating the table schema will invalidate the Delta transaction log metadata.
E. Updating the table schema requires a default value provided for each field added.
Which describes how Delta Lake can help to avoid data loss of this nature in the
future?
A. * Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a
permanent, replayable history of the data state.
B. Delta Lake automatically checks that all fields present in the source data are
included in the ingestion layer.
C. The Delta log and Structured Streaming checkpoints record the full history of the
Kafka producer.
D. Delta Lake schema evolution can retroactively calculate the correct value for newly
added fields, as long as the data was in the original source.
E. Data can never be permanently dropped or deleted from Delta Lake, so data loss is
not possible under any circumstance.
(spark.read
.format("parquet")
.load(f"/mnt/daily_batch/{year}/{month}/{day}")
.select("*",
time_col.alias("ingest_time"),
inpute_file_name().alias("source_file")
.write
.mode("append")
.saveAsTable("bronze")
The next step in the pipeline requires a function that returns an object that can be
used to manipulate new records that have not yet been processed to the next
table in the pipeline.
def new_records():
A. * return spark.readStream.table("bronze")
B. return (spark.read
.table("bronze")
.filter(col("ingest_time") == current_timestamp())
)
C. return (spark.read
.table("bronze")
.filter(col("source_file") ==
f"/mnt/daily_batch/{year}/{month}/{day}")
)
D. return spark.read.option("readChangeFeed",
"true").table("bronze")
E. return spark.readStream.load("bronze")
64. Item ID: IT001408
In order to facilitate near real-time workloads, a data engineer is creating a helper
function to leverage the schema detection and evolution functionality of Databricks
Auto Loader. The desired function will automatically detect the schema of the
source directly, incrementally process JSON files as they arrive in a source
directory, and automatically evolve the schema of the table when new fields are
detected
checkpoint_path: str,
target_table_path: str):
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation",
checkpoint_path)
.load(source_path)
______
Which response correctly fills in the blank to meet the specified requirements?
A. * .writeStream
.option("checkpointLocation", checkpoint_path)
.option("mergeSchema", True)
.start(target_table_path)
B. .write
.option("mergeSchema", True)
.mode("append")
.save(target_table_path)
C. .writeStream
.option("checkpointLocation", checkpoint_path)
.option("mergeSchema", True)
.trigger(once=True)
.start(target_table_path)
D. .write
.option("checkpointLocation", checkpoint_path)
.option("mergeSchema", True)
.outputMode("append")
.save(target_table_path)
E. .writeStream
.option("mergeSchema", True)
.start(target_table_path)
The data engineer is trying to determine the best approach for dealing with
schema declaration given the highly-nested structure of the data and the
numerous fields.
Which of the following accurately presents information about Delta Lake and
Databricks that may impact their decision-making process?
A. * Because Databricks will infer schema using types that allow all observed data to
be processed, setting types manually provides greater assurance of data quality
enforcement.
B. The Tungsten encoding used by Databricks is optimized for storing string data;
newly-added native support for querying JSON strings means that string types are
always most efficient.
C. Schema inference and evolution on Databricks ensure that inferred types will
always accurately match the data types used by downstream systems.
D. Because Delta Lake uses Parquet for data storage, data types can be easily
evolved by just modifying file footer information in place.
E. Human labor in writing code is the largest cost associated with data engineering
workloads; as such, automating table declaration logic should be a priority in all
migration workloads.
The data engineer is trying to determine the best approach for dealing with these
nested fields before declaring the table schema.
Which of the following accurately presents information about Delta Lake and
Databricks that may impact their decision-making process?
A. * By default, Delta Lake collects statistics on the first 32 columns in a table; these
statistics are leveraged for data skipping when executing selective queries.
B. Because Delta Lake uses Parquet for data storage, Dremel encoding information
for nesting can be directly referenced by the Delta transaction log.
C. Databricks query latency is mostly due to time spent in query planning with the
Catalyst optimizer; alphabetically storing columns can reduce query planning time
significantly.
D. The Tungsten encoding used by Databricks is optimized for storing string data;
newly-added native support for querying JSON strings means that string types are
always most efficient.
E. Schema inference and evolution on Databricks ensure that inferred types will
always accurately match the data types used by downstream systems.
67. Item ID: IT001411
The data engineering team maintains the following code:
import pyspark.sql.functions as F
(spark.table("silver_customer_sales")
.groupBy("customer_id")
.agg(
F.min("sale_date").alias("first_transaction_date"),
F.max("sale_date").alias("last_transaction_date"),
F.mean("sale_total").alias("average_sales"),
F.countDistinct("order_id").alias("total_orders"),
F.sum("sale_total").alias("lifetime_value")
).write
.mode("overwrite")
.table("gold_customer_lifetime_sales_summary")
)
Assuming that this code produces logically correct results and the data in the
source table has been de-duplicated and validated, which statement describes
what will occur when this code is executed?
A. * The gold_customer_lifetime_sales_summary table will be overwritten by
aggregated values calculated from all records in the silver_customer_sales
table as a batch job.
B. An incremental job will leverage running information in the state store to update
aggregate values in the gold_customer_lifetime_sales_summary table.
C. A batch job will update the gold_customer_lifetime_sales_summary table,
replacing only those rows that have different values than the current version of the
table, using customer_id as the primary key.
D. The silver_customer_sales table will be overwritten by aggregated values
calculated from all records in the gold_customer_lifetime_sales_summary
table as a batch job.
E. An incremental job will detect if new rows have been written to the
silver_customer_sales table; if new rows are detected, all aggregates will be
recalculated and used to overwrite the
gold_customer_lifetime_sales_summary table.
orderWithItemDF = (orderDF.join(
itemDF,
orderDF.itemID == itemDF.itemID)
.select(
orderDF.accountID,
orderDF.itemID,
itemDF.itemName))
finalDF = (accountDF.join(
orderWithItemDF,
accountDF.accountID == orderWithItemDF.accountID)
.select(
orderWithItemDF["*"],
accountDF.city))
(finalDF.write
.mode("overwrite")
.table("enriched_itemized_orders_by_account"))
Assuming that this code produces logically correct results and the data in the
source tables has been de-duplicated and validated, which statement describes
what will occur when this code is executed?
A. * The enriched_itemized_orders_by_account table will be overwritten using
the current valid version of data in each of the three tables referenced in the join
logic.
B. An incremental job will leverage information in the state store to identify unjoined
rows in the source tables and write these rows to the
enriched_itemized_orders_by_account table.
C. A batch job will update the enriched_itemized_orders_by_account table,
replacing only those rows that have different values than the current version of the
table, using accountID as the primary key.
D. An incremental job will detect if new rows have been written to any of the source
tables; if new rows are detected, all results will be recalculated and used to
overwrite the enriched_itemized_orders_by_account table.
E. No computation will occur until enriched_itemized_orders_by_account is
queried; upon query materialization, results will be calculated using the current valid
version of data in each of the three tables referenced in the join logic.
69. Item ID: IT001413
The data engineering team is configuring environments for development, testing,
and production before beginning migration on a new data pipeline. The team
requires extensive testing on both the code and data resulting from code
execution, and the team wants to develop and test against data as similar to
production data as possible.
A junior data engineer suggests that production data can be mounted to the
development and testing environments, allowing pre-production code to execute
against production data. Because all users have admin privileges in the
development environment, the junior data engineer has offered to configure
permissions and mount this data for the team.
The data engineering team has been made aware of new requirements from a
customer-facing application, which is the only downstream workload they manage
entirely. As a result, an aggregate table used by numerous teams across the
organization will need to have a number of fields renamed, and additional fields
will also be added.
Which of the solutions addresses the situation while minimally interrupting other
teams in the organization without increasing the number of tables that need to be
managed?
A. * Configure a new table with all the requisite fields and new names and use this as
the source for the customer-facing application; create a view that maintains the
original data schema and table name by aliasing select fields from the new table.
B. Create a new table with the required schema and new fields and use Delta Lake's
DEEP CLONE functionality to sync up changes committed to one table to the
corresponding table.
C. Replace the current table definition with a logical view defined with the query logic
currently writing the aggregate table; create a new table to power the
customer-facing application.
D. Add a table comment warning all users that the table schema and field names will
be changing on a given date; overwrite the table in place to the specifications of the
customer-facing application.
E. Send all users notice that the schema for the table will be changing; include in the
communication the logic necessary to revert the new table schema to match historic
queries.
Based on the above schema, which column is a good candidate for partitioning
the Delta Table?
A. * date
B. post_time
C. post_id
D. user_id
E. latitude
This table is partitioned by the date column. A query is run with the following filter:
Assuming that all data governance considerations are accounted for, which
statement accurately informs this decision?
A. * Cross-region reads and writes can incur significant costs and latency; whenever
possible, compute should be deployed in the same region the data is stored.
B. Databricks notebooks send all executable code from the user's browser to virtual
machines over the open internet; whenever possible, choosing a workspace region
near the end users is the most secure.
C. Databricks workspaces do not rely on any regional infrastructure; as such, the
decision should be made based upon what is most convenient for the workspace
administrator.
D. Databricks leverages user workstations as the driver during interactive
development; as such, users should always use a workspace deployed in a region
they are physically near.
E. Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines
must be deployed in the region where the data is stored.
A junior engineer has written the following code to add CHECK constraints to the
Delta Lake table:
CHECK (
A senior engineer has confirmed the above logic is correct and the valid ranges for
latitude and longitude are provided, but the code fails when executed.
What explains the cause of this failure?
CHECK (
A batch job is attempting to insert new records to the table, including a record
where latitude = 45.50 and longitude = 212.67.
Which consideration will impact the decisions made by the engineer while
migrating this workload?
A. * All Delta Lake transactions are ACID compliant against a single table, and
Databricks does not enforce foreign key constraints.
B. Committing to multiple tables simultaneously requires taking out multiple table locks
and can lead to a state of deadlock.
C. Databricks only allows foreign key constraints on hashed identifiers, which avoid
collisions in highly-parallel writes.
D. Foreign keys must reference a primary key field; multi-table inserts must leverage
Delta Lake's upsert functionality.
E. Databricks supports Spark SQL and JDBC; all logic can be directly migrated from
the source system without refactoring.
USING (
FROM updates
UNION ALL
ON updates.customer_id = customers.customer_id
ON customers.customer_id = mergeKey
VALUES(staged_updates.customer_id, staged_updates.address,
true, staged_updates.effective_date, null)
What are the maximum notebook permissions that can be granted to the user
without allowing accidental changes to production code or data?
A. * Can Read
B. Can Run
C. Can Edit
D. Can Manage
E. No permissions
86. Item ID: IT001431
A junior data engineer has manually configured a series of jobs using the
Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are
listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to
the "DevOps" group, but cannot successfully accomplish this task.
FROM user_ltv
WHERE
CASE
WHEN is_member("auditing") THEN TRUE
END
An analyst who is not a member of the auditing group executes the following
query:
There are 5 unique topics being ingested. Only the "registration" topic
contains Personal Identifiable Information (PII). The company wishes to restrict
access to PII. The company also wishes to only retain records containing PII in
this table for 14 days after initial ingestion. However, for non-PII information, it
would like to retain these records indefinitely.
Which solution meets the requirements?
A. * Data should be partitioned by the topic field, allowing ACLs and delete
statements to leverage partition boundaries.
B. Because the value field is stored as binary data, this information is not considered
PII and no special precautions should be taken.
C. Data should be partitioned by the registration field, allowing ACLs and delete
statements to be set for the PII directory.
D. Separate object storage containers should be specified based on the partition
field, allowing isolation at the storage level.
E. All data should be deleted biweekly; Delta Lake's time travel functionality should be
leveraged to maintain a history of non-PII information.
Which command allows manual confirmation that these three requirements have
been met?
A. * DESCRIBE EXTENDED dev.pii_test
B. DESCRIBE DETAIL dev.pii_test
C. DESCRIBE HISTORY dev.pii_test
D. SHOW TBLPROPERTIES dev.pii_test
E. SHOW TABLES dev
.format("delta")
.option("readChangeData", True)
.table("user_lookup")
.createOrReplaceTempView("changes"))
spark.sql("""
WHERE user_id IN (
SELECT user_id
FROM changes
WHERE _change_type='delete'
""")
Assume that user_id is a unique identifying key and that all users that have
requested deletion have been removed from the user_lookup table.
Will executing the above logic guarantee that the records to be deleted from the
user_aggregates table are no longer accessible and why?
A. * No; files containing deleted records may still be accessible with time travel until a
VACUUM command is used to remove invalidated data files.
B. No; the change data feed only tracks inserts and updates, not deleted records.
C. Yes; Delta Lake ACID guarantees provide assurance that the DELETE command
succeeded fully and permanently purged these records.
D. No; the Delta Lake DELETE command only provides ACID guarantees when
combined with the MERGE INTO command.
E. Yes; the change data feed uses foreign keys to ensure delete consistency
throughout the Lakehouse.
WHERE user_id IN
Will successfully executing the above logic guarantee that the records to be
deleted are no longer accessible and why?
A. * No; files containing deleted records may still be accessible with time travel until a
VACUUM command is used to remove invalidated data files.
B. No; the Delta cache may return records from previous versions of the table until the
cluster is restarted.
C. Yes; Delta Lake ACID guarantees provide assurance that the DELETE command
succeeded fully and permanently purged these records.
D. No; the Delta Lake DELETE command only provides ACID guarantees when
combined with the MERGE INTO command.
E. Yes; the Delta cache immediately updates to reflect the latest data files recorded to
disk.
The following logic was executed to grant privileges for interactive queries on a
production database to the core engineering group.
Assuming these are the only privileges that have been granted to the eng group
along with permission on the catalog,and that these users are not workspace
administrators, which statement describes their privileges?
A. * Group members are able to query all tables and views in the prod database, but
cannot create or edit anything in the database.
B. Group members are able to query and modify all tables and views in the prod
database, but cannot create new tables or views.
C. Group members are able to create, query, and modify all tables and views in the
prod database, but cannot define custom functions.
D. Group members are able to list all tables in the prod database but are not able to
see the results of any queries on those tables.
E. Group members have full permissions on the prod database and can also assign
permissions to other users or groups.
The following logic was executed to create a database for the finance team:
LOCATION '/mnt/finance_eda_bucket';
SELECT *
FROM sales
WHERE state = "TX";
If all users on the finance team are members of the finance group, which
statement describes how the tx_sales table will be created?
A. * A managed table will be created in the storage container mounted to
/mnt/finance_eda_bucket.
B. An external table will be created in the storage container mounted to
/mnt/finance_eda_bucket.
C. A managed table will be created in the DBFS root storage container.
D. A logical table will persist the query plan to the Hive Metastore in the Databricks
control plane.
E. A logical table will persist the physical plan to the Hive Metastore in the Databricks
control plane.
A new login credential has been created for each group in the external database.
The Databricks Utilities Secrets module will be used to make these credentials
available to Databricks users.
Assuming that all the credentials are configured correctly on the external database
and group membership is properly configured on Databricks, which statement
describes how teams can be granted the minimum necessary access to using
these credentials?
A. * "Read" permissions should be set on a secret scope containing only those
credentials that will be used by a given team.
B. "Read" permissions should be set on a secret key mapped to those credentials that
will be used by a given team.
C. "Manage" permissions should be set on a secret scope containing only those
credentials that will be used by a given team.
D. "Manage" permissions should be set on a secret key mapped to those credentials
that will be used by a given team.
E. No additional configuration is necessary as long as all users are configured as
administrators in the workspace where secrets have been added.
Which statement describes the contents of the workspace audit logs concerning
these events?
A. * Because these events are managed separately, User A will have their identity
associated with the job creation events and User B will have their identity
associated with the job run events.
B. Because User A created the jobs, their identity will be associated with both the job
creation events and the job run events.
C. Because User B last configured the jobs, their identity will be associated with both
the job creation events and the job run events.
D. Because the REST API was used for job creation and triggering runs, user identity
will not be captured in the audit logs.
E. Because the REST API was used for job creation and triggering runs, a Service
Principal will be automatically used to identify these events.
An application has been configured to collect and parse run information returned
by the REST API. Which statement describes the value returned in the
creator_user_name field?
A. * Once User C takes "Owner" privileges, their email address will appear in this field;
prior to this, User A's email address will appear in this field.
B. Once User C takes "Owner" privileges, their email address will appear in this field;
prior to this, User B's email address will appear in this field.
C. User A's email address will always appear in this field, as they still own the
underlying notebooks.
D. User B's email address will always appear in this field, as their credentials are
always used to trigger the run.
E. User C will only ever appear in this field if they manually trigger the job, otherwise it
will indicate User B.
In which location can one review the timeline for cluster resizing events?
A. * Cluster Event Log
B. Driver’s log file
C. Executor’s log file
D. Workspace audit logs
E. Ganglia
Which of the following adjustments will get a more accurate measure of how code
is likely to perform in production, What's the limitation of this approach?
A. * Calling display() forces a job to trigger, while many transformations will only
add to the logical query plan; because of caching, repeated execution of the same
logic does not provide meaningful results.
B. The only way to meaningfully troubleshoot code execution times in development
notebooks is to use production-sized data and production-sized clusters with Run
All execution.
C. The Jobs UI should be leveraged to occasionally run the notebook as a job and
track execution time during incremental code development because Photon can
only be enabled on clusters launched for scheduled jobs.
D. Scala is the only language that can be accurately tested using interactive
notebooks; because the best performance is achieved by using Scala code
compiled to JARs, all PySpark and Spark SQL logic should be refactored.
E. Production code development should only be done using an IDE; executing code
against a local build of open source Spark and Delta Lake will provide the most
accurate benchmarks for how code will perform in production.
When evaluating the Ganglia Metrics for this cluster, which indicator would signal
a bottleneck caused by code executing on the driver?
A. * Overall cluster CPU utilization is around 25%
B. Total Disk Space remains constant
C. The five Minute Load Average remains consistent/flat
D. Bytes Received never exceeds 80 million bytes per second
E. Network I/O never spikes
<command-3293767849433948> in <module>
----> 1 display(df.select(3*"heartrate"))
/databricks/spark/python/pyspark/sql/dataframe.py in
select(self, *cols)
1691 """
1694
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_g
ateway.py in __call__(self, *args)
1302
1303 answer =
self.gateway_client.send_command(command)
1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a,
**kw)
121 # Hide where the exception came from
that shows a non-Pythonic
124 else:
125 raise
'Project ['heartrateheartrateheartrate]
+- SubqueryAlias spark_catalog.database.table
+- Relation[device_id#75L,heartrate#76,mrn#77L,time#78]
parquet
Which statement describes what the number alongside this field represents?
A. * The globally unique ID of the newly triggered run.
B. The number of times the job definition has been run in this workspace.
C. The total number of jobs that have been run in the workspace.
D. The job_id is returned in this field.
E. The job_id and number of times the job has been run are concatenated and
returned.
If tasks A and B complete successfully but task C fails during a scheduled run,
which statement describes the resulting state?
A. * All logic expressed in the notebook associated with tasks A and B will have been
successfully completed; some operations in task C may have completed
successfully.
B. Unless all tasks complete successfully, no changes will be committed to the
Lakehouse; because task C failed, all commits will be rolled back automatically.
C. All logic expressed in the notebook associated with task A will have been
successfully completed; tasks B and C will not commit any changes because of
stage failure.
D. All logic expressed in the notebook associated with tasks A and B will have been
successfully completed; any changes made in task C will be rolled back due to task
failure.
E. Because all tasks are managed as a dependency graph, no changes will be
committed to the Lakehouse until all tasks have successfully been completed.
If task A fails during a scheduled run, which statement describes the results of this
run?
A. * Tasks B and C will be skipped; some logic expressed in task A may have been
committed before task failure.
B. Unless all tasks complete successfully, no changes will be committed to the
Lakehouse; because task A failed, all commits will be rolled back automatically.
C. Tasks B and C will be skipped; task A will not commit any changes because of
stage failure.
D. Tasks B and C will attempt to run as configured; any changes made in task A will be
rolled back due to task failure.
E. Because all tasks are managed as a dependency graph, no changes will be
committed to the Lakehouse until all tasks have successfully been completed.