Databricks Exam
Databricks Exam
Databricks Exam
[email protected] Switch account Draft saved
Email *
A. Workspace
Data Management
https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 1/14
11/9/24, 11:50 AM Databricks Exam
A senior data engineer wants to create a new table from this table using
the following command:
A junior data engineer asks why the schema is not being declared for the
new table. Which of the following responses explains why declaring the
schema is not necessary?
CREATE TABLE AS SELECT statements result in tables that do not support schemas.
CREATE TABLE AS SELECT statements adopt schema details from the source table
and query.
CREATE TABLE AS SELECT statements assign all columns the type STRING.
CREATE TABLE AS SELECT statements result in tables where schemas are optional.
CREATE TABLE AS SELECT statements infer the schema by scanning the data.
Which of the following describes a scenario in which a data engineer will * 2 points
want to use a Job cluster instead of an all-purpose cluster?
https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 2/14
11/9/24, 11:50 AM Databricks Exam
The below code shown block contains an error. The code block is * 2 points
intended to return a DataFrame containing only the rows from DataFrame
storesDF where the value in DataFrame storesDF's "sqft" column is less
than or equal to 25,000. Assume DataFrame storesDF is the only defined
language variable. Identify the error. Code block:
The column name sqft needs to be quoted like storesDF.filter("sqft" <= 25000).
The sign in the logical condition inside filter() needs to be changed from <= to >=.
The sign in the logical condition inside filter() needs to be changed from <= to >.
The column name sqft needs to be quoted and wrapped in the col() function like
resDF.filter(col("sqft") <= 25000).
The column name sqft needs to be wrapped in the col() function like
storesDF.filter(col(sqft) <= 25000)
A data engineering team needs to query a Delta table to extract rows that * 2 points
all meet the same condition. However, the team has noticed that the query
is running slowly. The team has already tuned the size of the data files.
Upon investigating, the team has concluded that the rows meeting the
condition are sparsely located throughout each of the data files. Based on
the scenario, which of the following optimization techniques could speed
up the query?
Data skipping
Bin-packing
Option 6
Z-Ordering
https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 3/14
11/9/24, 11:50 AM Databricks Exam
narrow transformation?
DataFrame.sort()
DataFrame.distinct()
DataFrame.repartition()
DataFrame.select()
DataFrame.join()
Which of the following operations fails to return a DataFrame where every * 2 points
row is unique?
DataFrame.distinct()
DataFrame.drop_duplicates(subset = None)
DataFrame.drop_duplicates()
DataFrame.dropDuplicates()
DataFrame.drop_duplicates(subset = "all")
https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 4/14
11/9/24, 11:50 AM Databricks Exam
A data engineer is overwriting data in a table by deleting the table and * 2 points
recreating the table. Another data engineer suggests that this is inefficient
and the table should simply be overwritten instead. Which of the following
reasons to overwrite the table instead of deleting and recreating the table
is incorrect?
Overwriting a table results in a clean table history for logging and audit purposes.
Overwriting a table maintains the old version of the table for Time Travel.
Overwriting a table is an atomic operation and will not leave the table in an
unfinished state
Which of the following locations hosts the driver and worker nodes of a * 2 points
Databricks-managed cluster?
C. Databricks Filesystem
A. Data plane
B. Control plane
https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 5/14
11/9/24, 11:50 AM Databricks Exam
Which of the following code blocks returns a DataFrame containing all * 2 points
columns from DataFrame storesDF except for column sqft and column
customerSatisfaction? A sample of DataFrame storesDF is below:
storesDF.drop("sqft", "customerSatisfaction")
storesDF.select(-col(sqft), -col(customerSatisfaction))
storesDF.drop(sqft, customerSatisfaction)
storesDF.drop(col(sqft), col(customerSatisfaction))
Which of the following code blocks returns a DataFrame containing only * 2 points
column storeId and column division from DataFrame storesDF?
storesDF.select("storeId").select("division")
storesDF.select(storeId, division)
storesDF.select("storeId", "division")
storesDF.select(col("storeId", "division"))
storesDF.select(storeId).select(division)
https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 6/14
11/9/24, 11:50 AM Databricks Exam
A data engineering team has created a series of tables using Parquet * 2 points
The tables should be refreshed in the writing cluster before the next query is run
A data engineer has created a Delta table as part of a data pipeline. * 2 points
A. Repos
B. Jobs
C. Data Explorer
D. Databricks Filesystem
E. Dashboards
https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 7/14
11/9/24, 11:50 AM Databricks Exam
DataFrame.filter()
DataFrame.distinct()
DataFrame.intersect()
DataFrame.join()
DataFrame.count()
A data engineer has configured a Structured Streaming job to read from a * 2 points
table, manipulate the data, and then perform a streaming write into a new
table. The code block used by the data engineer is below:
(
spark.table("sales")
.withColumn("avg_price", col("sales") / col("units")) .writeStream
.option("checkpointLocation", checkpointPath) .outputMode("complete")
._____
.table("new_sales")
)
If the data engineer only wants the query to execute a single micro-batch
to process all of the available data, which of the following lines of code
should the data engineer use to fill in the blank?
trigger(once=True)
trigger(continuous="once")
processingTime("once")
trigger(processingTime="once")
processingTime(1)
https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 8/14
11/9/24, 11:50 AM Databricks Exam
A data engineer has developed a code block to perform a streaming read * 2 points
https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 9/14
11/9/24, 11:50 AM Databricks Exam
The data engineer asks a colleague for help to convert this query for use in
a Delta Live Tables (DLT) pipeline. The query should create the first table in
the DLT pipeline. Which of the following describes the change the
colleague needs to make to the query?
B. They need to add a CREATE LIVE TABLE table_name AS line at the beginning of
the query.
C. They need to add a live. prefix prior to json. in the FROM line.
D. They need to add a CREATE DELTA LIVE TABLE table_name AS line at the
beginning of the query.
E. They need to add the cloud_files(...) wrapper to the JSON file path.
A data analyst has provided a data engineering team with the following * 2 points
Spark SQL query: SELECT district, avg(sales) FROM store_sales_20220101
GROUP BY district; The data analyst would like the data engineering team
to run this query every day. The date at the end of the table name
(20220101) should automatically be replaced with the current date each
time the query is run. Which of the following approaches could be used by
the data engineering team to efficiently automate this process?
They could replace the string-formatted date in the table with a timestamp-formatted
date.
They could request that the data analyst rewrites the query to be run less frequently.
They could manually replace the date within the table name with the current day’s
date.
They could pass the table into PySpark and develop a robustly tested module on the
existing query
They could wrap the query using PySpark and use Python’s string variable system to
automatically update the table name.
https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 10/14
11/9/24, 11:50 AM Databricks Exam
Which of the following is the default storage level for persist() for a non- * 2 points
streaming DataFrame/Dataset?
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY
MEMORY_ONLY_SER
MEMORY_ONLY
A junior data engineer has ingested a JSON file into a table raw_table with * 2 points
The junior data engineer would like to unnest the items column in
raw_table to result in a new table with the following schema:
Which of the following commands should the junior data engineer run to
complete this task?
https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 11/14
11/9/24, 11:50 AM Databricks Exam
A data architect has determined that a table of the following format is * 2 points
necessary:
[REfer image]
Which of the following code blocks uses SQL DDL commands to create an
empty Delta table in the above format regardless of whether a table
already exists with this name?
Which of the following code blocks returns a new DataFrame from * 2 points
DataFrame storesDF where column numberOfManagers is the constant
integer 1?
storesDF.withColumn("numberOfManagers", lit("1"))
storesDF.withColumn("numberOfManagers", lit(1))
storesDF.withColumn("numberOfManagers", IntegerType(1))
storesDF.withColumn("numberOfManagers", 1)
storesDF.withColumn("numberOfManagers", col(1))
https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 12/14
11/9/24, 11:50 AM Databricks Exam
The code block shown below should extract the value for column sqft * 2 points
from the first row of DataFrame storesDF. Choose the response that
correctly fills in the numbered blanks within the code block to complete
this task.
Code block:
__1__.__2__.__3__
A. Delta Lake is an open source analytics engine used for big data workloads.
B. Delta Lake is an open format storage layer that delivers reliability, security, and
performance.
C. Delta Lake is an open source platform to help manage the complete machine
learning lifecycle.
D. Delta Lake is an open source data storage format for distributed data.
This content is neither created nor endorsed by Google. Report Abuse - Terms of Service - Privacy Policy
Forms
https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 13/14
11/9/24, 11:50 AM Databricks Exam
https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 14/14