0% found this document useful (0 votes)
77 views

Databricks Exam

The document is a Databricks exam containing multiple-choice questions focused on various concepts and functionalities within Databricks, such as DataFrames, Delta tables, and optimization techniques. It includes questions about SQL commands, data engineering practices, and the use of clusters. The exam is designed to assess knowledge and skills relevant to data engineering in the Databricks environment.

Uploaded by

Satyajit Ligade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views

Databricks Exam

The document is a Databricks exam containing multiple-choice questions focused on various concepts and functionalities within Databricks, such as DataFrames, Delta tables, and optimization techniques. It includes questions about SQL commands, data engineering practices, and the use of clusters. The exam is designed to assess knowledge and skills relevant to data engineering in the Databricks environment.

Uploaded by

Satyajit Ligade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

11/9/24, 11:50 AM Databricks Exam

Databricks Exam
[email protected] Switch account Draft saved

* Indicates required question

Email *

[email protected]

Which one of the following is a Databrick concept? * 2 points

A. Workspace

Data Management

Authentication and authorization

https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 1/14
11/9/24, 11:50 AM Databricks Exam

A table customerLocations exists with the following schema: * 2 points

id STRING, date STRING, city STRING, country STRING

A senior data engineer wants to create a new table from this table using
the following command:

CREATE TABLE customersPerCountry AS SELECT country, COUNT(*) AS


customers FROM customerLocations GROUP BY country;

A junior data engineer asks why the schema is not being declared for the
new table. Which of the following responses explains why declaring the
schema is not necessary?

CREATE TABLE AS SELECT statements result in tables that do not support schemas.

CREATE TABLE AS SELECT statements adopt schema details from the source table
and query.

CREATE TABLE AS SELECT statements assign all columns the type STRING.

CREATE TABLE AS SELECT statements result in tables where schemas are optional.

CREATE TABLE AS SELECT statements infer the schema by scanning the data.

Which of the following describes a scenario in which a data engineer will * 2 points
want to use a Job cluster instead of an all-purpose cluster?

A. An ad-hoc analytics report needs to be developed while minimizing compute


costs.

B. A data team needs to collaborate on the development of a machine learning


model.

C. An automated workflow needs to be run every 30 minutes.

D. A Databricks SQL query needs to be scheduled for upward reporting.

E. A data engineer needs to manually investigate a production error.

https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 2/14
11/9/24, 11:50 AM Databricks Exam

The below code shown block contains an error. The code block is * 2 points
intended to return a DataFrame containing only the rows from DataFrame
storesDF where the value in DataFrame storesDF's "sqft" column is less
than or equal to 25,000. Assume DataFrame storesDF is the only defined
language variable. Identify the error. Code block:

storesDF.filter(sqft <= 25000)

The column name sqft needs to be quoted like storesDF.filter("sqft" <= 25000).

The sign in the logical condition inside filter() needs to be changed from <= to >=.

The sign in the logical condition inside filter() needs to be changed from <= to >.

The column name sqft needs to be quoted and wrapped in the col() function like
resDF.filter(col("sqft") <= 25000).

The column name sqft needs to be wrapped in the col() function like
storesDF.filter(col(sqft) <= 25000)

A data engineering team needs to query a Delta table to extract rows that * 2 points
all meet the same condition. However, the team has noticed that the query
is running slowly. The team has already tuned the size of the data files.
Upon investigating, the team has concluded that the rows meeting the
condition are sparsely located throughout each of the data files. Based on
the scenario, which of the following optimization techniques could speed
up the query?

Write as a Parquet file

Tuning the file size

Data skipping

Bin-packing

Option 6

Z-Ordering

https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 3/14
11/9/24, 11:50 AM Databricks Exam

Which of the following DataFrame operations is always classified as a * 2 points

narrow transformation?

DataFrame.sort()

DataFrame.distinct()

DataFrame.repartition()

DataFrame.select()

DataFrame.join()

Which of the following operations fails to return a DataFrame where every * 2 points

row is unique?

DataFrame.distinct()

DataFrame.drop_duplicates(subset = None)

DataFrame.drop_duplicates()

DataFrame.dropDuplicates()

DataFrame.drop_duplicates(subset = "all")

https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 4/14
11/9/24, 11:50 AM Databricks Exam

A data engineer is overwriting data in a table by deleting the table and * 2 points

recreating the table. Another data engineer suggests that this is inefficient
and the table should simply be overwritten instead. Which of the following
reasons to overwrite the table instead of deleting and recreating the table
is incorrect?

Overwriting a table is efficient because no files need to be deleted

Overwriting a table results in a clean table history for logging and audit purposes.

Overwriting a table maintains the old version of the table for Time Travel.

Overwriting a table is an atomic operation and will not leave the table in an
unfinished state

Overwriting a table allows for concurrent queries to be completed while in progress.

Which of the following locations hosts the driver and worker nodes of a * 2 points

Databricks-managed cluster?

E. Databricks web application

C. Databricks Filesystem

A. Data plane

B. Control plane

D. JDBC data source

https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 5/14
11/9/24, 11:50 AM Databricks Exam

Which of the following code blocks returns a DataFrame containing all * 2 points

columns from DataFrame storesDF except for column sqft and column
customerSatisfaction? A sample of DataFrame storesDF is below:

storesDF.drop("sqft", "customerSatisfaction")

storesDF.select("storeId", "open", "openDate", "division")

storesDF.select(-col(sqft), -col(customerSatisfaction))

storesDF.drop(sqft, customerSatisfaction)

storesDF.drop(col(sqft), col(customerSatisfaction))

Which of the following code blocks returns a DataFrame containing only * 2 points
column storeId and column division from DataFrame storesDF?

storesDF.select("storeId").select("division")

storesDF.select(storeId, division)

storesDF.select("storeId", "division")

storesDF.select(col("storeId", "division"))

storesDF.select(storeId).select(division)

https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 6/14
11/9/24, 11:50 AM Databricks Exam

A data engineering team has created a series of tables using Parquet * 2 points

data stored in an external system. The team is noticing that after


appending new rows to the data in the external system, their queries within
Databricks are not returning the new rows. They identify the caching of the
previous data as the cause of this issue. Which of the following
approaches will ensure that the data returned by queries is always up-to-
date?

The tables should be altered to include metadata to not cache

The tables should be refreshed in the writing cluster before the next query is run

The tables should be stored in a cloud-based external system

The tables should be converted to the Delta format

The tables should be updated before the next query is run

A data engineer has created a Delta table as part of a data pipeline. * 2 points

Downstream data analysts now need SELECT permission on the Delta


table. Assuming the data engineer is the Delta table owner, which part of
the Databricks Lakehouse Platform can the data engineer use to grant the
data analysts the appropriate access?

A. Repos

B. Jobs

C. Data Explorer

D. Databricks Filesystem

E. Dashboards

https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 7/14
11/9/24, 11:50 AM Databricks Exam

Which of the following operations will trigger evaluation? * 2 points

DataFrame.filter()

DataFrame.distinct()

DataFrame.intersect()

DataFrame.join()

DataFrame.count()

A data engineer has configured a Structured Streaming job to read from a * 2 points

table, manipulate the data, and then perform a streaming write into a new
table. The code block used by the data engineer is below:

(
spark.table("sales")
.withColumn("avg_price", col("sales") / col("units")) .writeStream
.option("checkpointLocation", checkpointPath) .outputMode("complete")
._____
.table("new_sales")
)

If the data engineer only wants the query to execute a single micro-batch
to process all of the available data, which of the following lines of code
should the data engineer use to fill in the blank?

trigger(once=True)

trigger(continuous="once")

processingTime("once")

trigger(processingTime="once")

processingTime(1)

https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 8/14
11/9/24, 11:50 AM Databricks Exam

A data engineer has developed a code block to perform a streaming read * 2 points

on a data source. The code block is below:

(spark .read .schema(schema) .format("cloudFiles")


.option("cloudFiles.format", "json") .load(dataSource) )

The code block is returning an error.


Which of the following changes should be made to the code block to
configure the block to successfully perform a streaming read?

A new .stream line should be added after the .load(dataSource) line.

D. A new .stream line should be added after the spark line.

The .read line should be replaced with .readStream.

A new .stream line should be added after the .read line.

The .format("cloudFiles") line should be replaced with .format("stream").

Candiadate Name * 2 points

Ajay Pratap Singh

Which of the following describes a benefit of a data lakehouse that is * 2 points

unavailable in a traditional data warehouse?

E. A data lakehouse enables both batch and streaming analytics.

D. A data lakehouse utilizes proprietary storage formats for data.

A. data lakehouse provides a relational system of data management.

C. A data lakehouse couples storage and compute for complete control.

B. A data lakehouse captures snapshots of data for version control purposes.

https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 9/14
11/9/24, 11:50 AM Databricks Exam

A data engineer has written the following query: * 2 points

SELECT * FROM json.`/path/to/json/file.json`;

The data engineer asks a colleague for help to convert this query for use in
a Delta Live Tables (DLT) pipeline. The query should create the first table in
the DLT pipeline. Which of the following describes the change the
colleague needs to make to the query?

A. They need to add a COMMENT line at the beginning of the query.

B. They need to add a CREATE LIVE TABLE table_name AS line at the beginning of
the query.

C. They need to add a live. prefix prior to json. in the FROM line.

D. They need to add a CREATE DELTA LIVE TABLE table_name AS line at the
beginning of the query.

E. They need to add the cloud_files(...) wrapper to the JSON file path.

A data analyst has provided a data engineering team with the following * 2 points
Spark SQL query: SELECT district, avg(sales) FROM store_sales_20220101
GROUP BY district; The data analyst would like the data engineering team
to run this query every day. The date at the end of the table name
(20220101) should automatically be replaced with the current date each
time the query is run. Which of the following approaches could be used by
the data engineering team to efficiently automate this process?

They could replace the string-formatted date in the table with a timestamp-formatted
date.

They could request that the data analyst rewrites the query to be run less frequently.

They could manually replace the date within the table name with the current day’s
date.

They could pass the table into PySpark and develop a robustly tested module on the
existing query

They could wrap the query using PySpark and use Python’s string variable system to
automatically update the table name.

https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 10/14
11/9/24, 11:50 AM Databricks Exam

Which of the following is the default storage level for persist() for a non- * 2 points
streaming DataFrame/Dataset?

MEMORY_AND_DISK

MEMORY_AND_DISK_SER

DISK_ONLY

MEMORY_ONLY_SER

MEMORY_ONLY

A junior data engineer has ingested a JSON file into a table raw_table with * 2 points

the following schema:

cart_id STRING, items ARRAY

The junior data engineer would like to unnest the items column in
raw_table to result in a new table with the following schema:

cart_id STRING, item_id STRING

Which of the following commands should the junior data engineer run to
complete this task?

SELECT cart_id, filter(items) AS item_id FROM raw_table;

SELECT cart_id, flatten(items) AS item_id FROM raw_table;

SELECT cart_id, reduce(items) AS item_id FROM raw_table;

SELECT cart_id, explode(items) AS item_id FROM raw_table;

SELECT cart_id, slice(items) AS item_id FROM raw_table;

https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 11/14
11/9/24, 11:50 AM Databricks Exam

A data architect has determined that a table of the following format is * 2 points
necessary:

[REfer image]

Which of the following code blocks uses SQL DDL commands to create an
empty Delta table in the above format regardless of whether a table
already exists with this name?

CREATE OR REPLACE TABLE table_name AS SELECT id STRING,birthDate


DATE,avgRating FLOAT USING DELTA

CREATE OR REPLACE TABLE table_name ( id STRING, birthDate DATE, avgRating


FLOAT )

CREATE TABLE IF NOT EXISTS table_name ( id STRING, birthDate DATE, avgRating


FLOAT )

CREATE TABLE table_name AS SELECT id STRING, birthDate DATE, avgRating FLOAT

CREATE OR REPLACE TABLE table_name WITH COLUMNS ( id STRING, birthDate


DATE, avgRating FLOAT ) USING DELTA

Which of the following code blocks returns a new DataFrame from * 2 points
DataFrame storesDF where column numberOfManagers is the constant
integer 1?

storesDF.withColumn("numberOfManagers", lit("1"))

storesDF.withColumn("numberOfManagers", lit(1))

storesDF.withColumn("numberOfManagers", IntegerType(1))

storesDF.withColumn("numberOfManagers", 1)

storesDF.withColumn("numberOfManagers", col(1))

https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 12/14
11/9/24, 11:50 AM Databricks Exam

The code block shown below should extract the value for column sqft * 2 points
from the first row of DataFrame storesDF. Choose the response that
correctly fills in the numbered blanks within the code block to complete
this task.
Code block:

__1__.__2__.__3__

1. storesDF 2. first 3. ["sqft"]

1. storesDF 2. first() 3. sqft

1. storesDF 2. first 3. col("sqft")

1. storesDF 2. first() 3. col("sqft")

1. storesDF 2. first 3. sqft

Which of the following statements describes Delta Lake? * 2 points

A. Delta Lake is an open source analytics engine used for big data workloads.

B. Delta Lake is an open format storage layer that delivers reliability, security, and
performance.

C. Delta Lake is an open source platform to help manage the complete machine
learning lifecycle.

D. Delta Lake is an open source data storage format for distributed data.

E. Delta Lake is an open format storage layer that processes data

Submit Page 1 of 1 Clear form

Never submit passwords through Google Forms.

This content is neither created nor endorsed by Google. Report Abuse - Terms of Service - Privacy Policy

Forms

https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 13/14
11/9/24, 11:50 AM Databricks Exam

https://docs.google.com/forms/d/e/1FAIpQLSf4eRe7rFglccAi-6vLFQ9gVjmLaHQsSQYtfrAuNm353UMzDQ/viewform 14/14

You might also like