Databricks Final
Databricks Final
WiFi (.A!7 )1
Topic 1 • Exam A
Qtiestion #1 Topic I
An upstream system has been configured to pass ttte date for a given batch of data to the Oatabncks
Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the
following code: df , sparluead.format{'parquet").load(f"lmnt/source/(date)")
Which code block should be used to create the date Python vanab1e used in the above code block?
A. date , spal1t.coni.get("date"l
B. inpuLdict , lnputO
date• input_dict("date1
C. import sys
date, sys.argv(1]
0. date =- dbutils.notebooks.getParam("date")
E. dbutrls.widgets.text('date', ·null')
date=- dbutlls.widgets.geWdate1 i'-tffl@f:.i
Correct Answer: E -
Community vote d,strlbutloll
Quesuon #2 Topic I
The Databricks workspace administrator has configured interactive clusters for each of the data
engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. 'Each
user shoukl be able to execute workloads against their assigned clusters at any time of the day.
Assuming users have been added to a workspace but not granted any permissions, which of the
following describes the minimal pe1missions a user would need to start and attach to an already
configured cluster.
8. Workspace Admin privileges, cluster creation allowed, "Can Attach To" privileges on the required
cluster
C. Cluster creation allowed, ~can Attach To" pnvileges on the required cluster
When scheduling Structured Streaming jobs for production, which configuration automatically reco,•ers
from query failures and keeps costs low?
■ ◄
5:52 PM A Vo
WiFi ("1t6 )1
Comtoon,ry vote dist/Jb!Jtlon
D1100\)
The data enginee,ing team has configured a Oatabncks SOL query and alert to monitor the values in a
Delta lake table. The 1-e-cent_sensor_recordings table contains an identifying sensor_id alongside the
timestamp and temperature for the most recent 5 minutes of recordings.
The below query 1s used to create the alert:
SELECT ~ i ( temperature) , M.AX (temperature) , MIN(temperature}
FROM recent_sensor_ recordings
GROOP BY sensor_id
The query is set to refresh each minute and afways completes m fess than 10 seconds. The alert is set
to trigger when mean {temperature)> 120. Notiflc.ations are triggered to be sent at most every 1
minute.
If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be
true?
A. The total average 1emperature across all sensors exceeded 120 on three consecutive executions
of the query
8. The recent_sensor_recordings table was unresponsi•1e for three consecutive runs of the query
C. The source query failed to update property for 1hree consecutive minutes and then restarted
0. The maximum temperature recording for at least one sensor exceeded 120 on lhree consecutive
executions of the query
E. The a'lerage temperature recordings for at least one sensor exceeded 120 on three consecutive
executions of the query I',d\!C:I
A junior developer complains that the code in their notebook isn't producing the correct resuJts in the
development environment. A shared screenshot reveals that while they're using a notebook versione-cl
with Oatabricks Repos, they're using a personal branch that contains old logic. The desired branch
named dev-2.3. g is not available from the blanch selection dropdown.
Which approach will allow this developer to review the current logic for this notebook?
A. Use Repos to make a pull request use the Oatabricks REST API to update the current branch to
dev-2.3.9
B. Use Repos to pull changes from the remote Git repository and select the dev-2.3. 9 branch.
IV/tiffl$1
C. Use Repos to checkout the dev-2.3.9 branch and auto-resolve confilcts with the current branch
0. Merge all changes back to the main branch in the remote Git reposrtory and clone the repo again
E. Use Repos to merge the current branch and the dev-2.3.9 branch. then make a pull request to
sync with 1he remote repository
N~xt Questions +
Platform
■ ◄
5:53 PM A Vo
WiFi ("1t6 )1
Question #6 Topic 1
The security team is exploring whether or not the Oatabricks secrets module can be leveraged for
connecting to an external database.
After testing the code wnh all Python variables being defined with strings, they upload the password to
the secrets module and configure the correct permissions for the currently active
modify their code to the following (leavmg all other va11ables unchanged). SomeoiE Etoughl Co~libutnr Aooess tor_
~d~.!W<'"d dl::'.Jt .:~ . seer tl!,:,Pt(!!!C!'!J:·oi•'"db +:r,.1.s ... k~y- .. .,dL.. pa DP·900
AmstMiam, 1mat.a ago
prt nt {T--'""'.K' dJ
df • (l!lpatk
• u,.id
• !onut, .. J-:ibc-J
.cF"f1~r.l "~rl·. c~nn~ctivnJ
.o,:itil')e l '"d:t.t11b~f"- ~ t.,b11"r.111'1")
.(",~t. lOl'l. •u~•r·' UOflt\"ni:11111
Q~r,oni•p•~3W"::r~ .. . r•3~~ord
i
Which statement descnb-es what will happen when the above code is executed?
A. The connecuon to the external table will fail: the string "REDACTED• will be printed.
B. An interactive input box will appear in the notebook; If the right password Is provided, the
connection will succeed and lhe encoded password will be saved to DBFS.
C. An interactive input box will appear in the notebook; if the right password is provided, the
connection will succeed and the password will be printed in plain text.
D. The connection to the external table will succeed; the string value of password will be printed in
plain text .
E. The connection to the external table will succeed; the string "REDACTED~will be pnnted.
l',ffll'l1tri
Hide SoluUon • 0,scussion QJ
Coirect Answer: £ •
Commur.,ty vote d,str,butlon
E 100\)
Questlon #7 Topic 1
The data science team has created and logged a production model using Mlftow. The following code
co11ectJy imports and applies the production model to output the predictions as a new OataFrame
named preds with the schema "customeud LONG, predictions DOUBLE, date DATE"
The data science team would like predictions saved to a Delta Lake table with the ability to compare all
predictions across time. Chum predictions will be made at most once per day.
Which code block accomplishes this task white mintm,zing potential compute costs?
A. preds.write.moderappend").saveAsTabte("churn_preds") ltiffiffl:I
8. preds.vmte.format("delta~).save("/preds/churn_preds")
:-11•11
H·nt-1:t WI :a.r.12,t
outp~tM'Xi~1•-u~r~rit~•,
C. p t.i'\?M *cl\•c k'r.01n. • f~tfi", •I_ l'l• r:kpo,1nn/t.hum.J!:-•-:b" t
., tll r-t. • "/pu•.,,, en·a-, __rrt1d:1*
lp:.•Js wtit4Su.a.,n,
C'Utp~.oo,p. I•~~~,,~• I
E. ,;;pt..:..t;.i,( ..r., icp •1t.J1.sth". " / __ 1.1e... 11.pci nt• .wri ;:..--..:•"1
•.11,:,et "<:lhuffl_;n~• ,
■
Hide Solutkln • Discussion G)
'
■ ◄
5:53 PM A Vo
WiFi (~ 6 )1
Questlon 18 Topic 1
An upstream source writes Parquet data as hourly batches to directories named with the current date.
A nightly batch job runs the following code to ingest andata from the previous day as indicate<! by the
da1e variable:
t!1F&Tlr,r,,.• :I
,f 111... f"'~•cri•t." I
.l ••d ◄ f ..1r..n .. luw (a:r.1•r,!/l"J•~•l"l
,d..rt.pt.a.;pltc_. ul 1.. tl.ijt !rt.ic_l-i" . " ct441t l'"'H
. .,,r1.t.e.
.kQ:1~ I ....IPJ;.f<nd"')
.!'! 1v• lll.:, •b t .. ord•• • " I
Assume that the fields customer_id and ocder_id serve as a composite key to uniquely identify each
Older.
If the upstream system is known to occasionally produce duplicate en1ries for a single order hours
apart. which statement is correct?
A. Each wnte to the orders table will only contain unique records. and only those records without
duplicates in Che target table will be written.
B. Each write to the orders table will only contain unique records. but newty written records may
have duplicates already present m the target tabte. ifrjjfflJH
C. Each write to the orders table will only cont am unique records; if existing records with the same
key are present in the target table, these records will be O'lerwrftten.
0. Each write to the orders table will only contain unique records; if existing records with the same
key are present in the target table, the operation will fail.
E. Each wrne to the orders table will run deduplication over the union of new and existing records.
ensuring no duplicate records are present.
Correct Answer: 8 •
A junior membe, of the data engineering team is exploring the language interoperability of Oatabricks
notebooks. The intended outcome of the below oode is to regrster a view of all sates that occurred In
countries on the continent of Africa. that appear in the geoJookup table.
Before executing the code, running SHOW TABLES on the current database indicates the database
contains only two tables: geo_lookup and sales.
Cmd 1
'rr:noti
C::.<)ut,~.;..-a_•!. • icJO) ! .. !: x tr,
;:.a . obl• 1·1l!o_loo1ros:-'").f 1 ~:.t • con n .. n .... ' A! ' .. 1. -..elee:r '" c<mnr:ry•) . !;-Ollec~o)
Crnd2
':qJ
C~£ATE VT£W ~•l•3_~f A~
SELECT •
rRON .:,al~
iwlu:."R.1 e1t:y lH ,unu·a~ ia!
A!~t' ~ONtH;UiT ... " Ar ..
Which statemen1 correctfy describes the outcome of executing these command cells in orde, in an
interactive notebook?
A. Both commands will succeed. Executing show tables will show that countries_af and sales_af
have been registered as views.
B. Cmd 1 w1fl succeed. Cmd 2 will search all accessible databases for a table or view named
countries_af: if this entity exists, Cmd 2 will succeed.
C. Cmd l will succeed and Cmd 2 will fail. countries_af will be a Python variable representing a
PySpark DataFrame.
0. Both commands -will fall. No new variables, tables, or views will be created.
E. Cmd 1 will succeed and Cmd 2 will fall. coun1ries_af will be a Python variable containing a list of
strings. ltm:@I
Correct Answer: £ . .
Commu,111y vote d,st11bvt1on
■
El91\I 9\
■ ◄
5:53 PM A Vo
WiFi ("1t6 )1
Cmd :Z
• qi
, t:ltAT't vrn !!tle%_Af A
SEttCT •
Til'"""':' :s•J.tU
WHJJU cit:y a1 unnt•i: at
AJ., , 1.-0N'I'l!J:..St - "A!'"•
Which statement correctfy describes the outcome of executing these command cells m Olde, in an
interactive notebook?
A. Both commands will succeed. Executing show tables will show that countries_af and sales_af
have been registered as views.
B. Cmd 1 wdl succeed. Cmd 2 will search all accessible databases for a table or view named
countries_af: if this entity exists, Cmd 2 will succeed.
C. Cmd 1 will succeed and Cmd 2 will fail. countties_af will be a Python variable representing a
PySpark OataFrame.
D. Both commands will fail. No new variables, tables. or v1e•Ns wiU be created.
E. Cmd 1 will succeed and Cmd 2 wilJ fall. countries_af will be a Python variable containing a list of
strings. itffiffflS:I
Htde Solution • D1scusslon ti)
A Oetta table of weather records is partitioned by date and has the below schema~date DATE,
device_id INT, temp FlOAT, lat1tude FLOAT. long1tude FLOAT
To find all the records from within the Arctic Circle, you execute a query with the below filter: Jatrtude >
66.3
Which statement descnbes how the Delta engine identifies which files to load?
A. All records are cached to an operational database and then the filter Is applied
B. The Parquet file footers are scanned for min and max statistics for the latJtude column
C. All records are cached to attached storage and then the fitter is applied
0. The Delta log is scanned for min and max statistics for the latitude column l'rttilffil
E. The Hive metastore is scanned for min and max statistics for the latitude column
Correct Answer: D . .
Platform
> Home
Account
> Login
> Sigr, up
■ ◄
5:54 PM A Vo
WiFi (~ 6 )1
l'llu«ll!!tcom v.1111
Ct View Custom Settings 970IO1 1ntun111~n:,J
The data engineering team has configured a job to process customer requests to be forgotten (have
their data deleted). All user data that needs to be deleted is stored in Delta l ake tables using default
table settings.
The team has decided to process all deletions from the previous week as a batch job at l am each
Sunday. The total duration of th1s Job 1s less than one hour. Every Monday at 3am. a batch Job executes
a series of VACUUM commands on all Delta lake tables throughout the organization.
The compliance officer has recently learned about OeHa Lake's time travel functionality. They are
concerned that this might allow continued access to deleted data.
Assuming all delete Jogi~ is correctly impl-emented, which statement correctly addresses this concern?
A. Because the VACUUM command perma.nenUy deletes all fltes containing deleted records,
deleted records may be accessible with time travel for around 24 hours.
8 . Because the default data retention threshold is 24 hours, data fttes containing d-eleted records
will be retained until the VACUUM job is run the following day.
C. Because Delta Lake time travel provides full access to the entire history of a table, deleted
records can always be recreated by users wtth full admin privileges.
0. Because Oetta Lake's delete statements have ACID guarantees, deleted records wilt be
permanemly purged from all storage systems as soon as a delete job completes.
E. Because the default data retention threshold is 7 days, data fifes containing deleted records will
be retained until the VACUUM job is run 8 dayslater. l'l\'fflt#\1
Coiree1 Answer: £ •
Commun,ryvote d,stnbutlon
E/62\) A (38\l
A tunlor data engineer has configured a workload that posts the following JSON to the Oatabiicks REST
API endpoint 2.0i jobs/create.
Assuming that all configurations and referenced resources are available, which statement describes
the result of executing this workload three l imes?
A. Three new jobs named "Ingest new data" will be defined in the workspace, and they will each run
once daily.
8. The logic defined in the referenced notebook will be executed three times on new clusters with
the configurations of the provided cluster IO.
C. Three new jobs named "log est new data" will be defined in the workspace, but no jobs will be
executed. l'·fflf¢ftl
0. One new job named 'Ingest new data• will be defined in the workspace, but it will not be
executed.
E. The togic defined in the referenced notebook will be executed three times on the referenced
existing all purpose cluster.
Correct Answer: C •
Commun,ryvote d1st11/wtlon
C(l tm.l
■
An upstream system is emitting change data capture {CDC) fogs that are being written to a cloud
object storage directo;y. Each record in the log indicates the change type (insert. update, or delete) and
th,e values for each field after th-e chan~. The source tabfe has a_pnmary_key:ldentifled ID'. the fie~ld,.__ _ _ _ _ _ _ _ __
_ _ _ _ _ _ _ ___,,_,
■ ◄
5:54 PM A Vo
WiFi (~ 6 )1
Correct Answer: C •
An upstream system is emitting change data capture (CDC} logs that are being written to a cloud
object stora9e directory. Each record in the log indicates the change type (insert. update, or delete) aod
the values for each field after the change. The source table has a primary key identified by lhe field
plc.Jd.
For auditing purposes, the data governance team wishes to maintain a full record of all values that
have ever been valid in the source sy-stem. For analytical purposes. only the most recent value for each
record needs to be ,ecorde<l. The Oatabricks Job to ingest these records occurs once per hour, but each
individual record may have changed mu1t1ple times over the course of an hour.
Which solution meets these requfrements?
A. Create a separate history table for each pk_id resolve the current state of the table by running a
union all filtering the history tables for the most recent state.
8. Use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a bconze
table, then propagate all changes throughout the system.
C. Iterate through an ordered set of changes to the tabte, applymg each in turn; rety on Oetta l akes
versioning ability to create an audit log.
0. Use Delta Lake's change data fee<I to automatically process CDC data from an external system,
propagating all changes to all dependent tables in the l akehouse.
E. Ingest all log information into a bronze table; use MERGE INTO to insert. update, or delete the
most recent entry for each plUd into a silver table to recreate the current table state. itffll$Cll
An hourly batch job is conftgure<I to ingest data files from a cloud object storage container where each
batch represent an records produced by the source system in a given hour. The batch job to process
these records into the l akehous.e is sufficiently delayed to ensure no late-arriving data is missed. The
userjd field represents a unique key for the data, which has the foflowing schema: user_id BIGINT,
username STRING, user_utc STRING, userJ egron STRING, lasUogm BIGINT. auto_pay BOOLEAl>I,
Jast_updaled BIGINT
New records are au ingested into a table named accounLhistory which maintains a full record of all
data in the same schema as the source. The next table in the system ,s named accounLcurrent and is
implemented as a Type 1 table representing the most recent value for each unique user_id.
Assuming there are millions of use< accounts and tens of thousands of records processed hourly,
which implementation can be used to efficiently update the described accounLcurrent table as part of
each hour1y batch job?
A. Use Auto loader to subscribe to nev, flies in the accounLhistory directory; configure a
Structured Streaming ttigger once job to batch update newly detected files into the
account_current table.
8. Overwrite the account_current table with each batch using the results of a query against the
account_history table grouping by user_id and filtering for the max value of last_updated.
C. Filter records in accounLhistory using the lasLupdated field and the most ,ecent hour
p1oc.essed, as well as the max- JasUogin by user_id write a merge statement to update or insert ttte
most recent \/alue for each user_id. 1/fflfflttli
0. Use Delta l ake version h1stor1 to get th-e difference between the latest version of
accounLhistory and one ve,sion prior, then write these records to accounLcurrent
E. Filter records in accounLhistory using the last_updated field and the most recent hour
processed, making sure to deduplicate on username; write a merge statement to update or insert
the most recent \'alue for each username.
■ ◄
-8- ("1t6 )1
5:54 PM A
,v1111,,u 1111JIICII I C tUQUVU \,O Ii U C u.::=.eu
.,1fII
iv c11, ....,cu oy Uj.lUOIC tHC u c;:o\,-Ul>C:U a ...\,V Ul ll_\,UII C Ol t au11: 0.:)
Vo
WiFi
l-'01 t V I
~u
A. Use Auto Loader to subscribe to new files in the accounLhistory directory: configure a
Structured Streaming trigger once job to batch update newly detected files into the
account_current table.
B. Overwrite the accounLcurrent table with eacn batch using the results of a query against the
accounLhistorytable grouping by user_id and filtering foe the max value of last_updated.
C. Filter records in accounLhistory using the last_updated field and the most recent hour
processed, as well as the max tast_iogin by user_id write a merge statement to update or insert the
most recent value for each user_id. 1/$@1
0. Use Oetta Lake version history to get the difference between the latest version of
account_history and one version prior, then wnte these records to accounLcu1Tent
E. Filter records in accounLhistory using the last_updated field and the most recent hour
processed, making sure to deduplicate on username; write a merge statement to update oc insert
the most recent value for each username.
Correct Answer: C •
Commun1t1 vote distribution
C (67\) , 8 (11') 7\
A table in the lakehouse named customer_chum_params is used in churn pcediction by the machme
learning team. The table contains information about customers derived from a number of upstream
sources. Currently, the data engineering team populates. this table nightly by ovetwnting the table with
the current valid values denve<I from upstream data sources.
The churn prediction model use<I by the Ml team is fairly stable in production. The team is only
interested in making predictions on records that have changed in the past 24 hours.
Which approach would simplify the identification of these changed records?
A. Apply the chum model to an rows in the customer_churn_params tabfe, but implement logic to
perform an upsert Into the predictions table 1hat ignores rows where predictions have not changed.
8. Convert the batch job to a Structured Streaming job usmg the complete output mode; configure
a Structured Streaming job to read from the customer_chum_params table and incrementatly
predict against the churn model.
C. Calculate the difference between the previous model predictions and the current
customer_churn_params on a key identifying unique customers be-fore making new predictions;
only make predictions on those custome1s not in the previous predictions.
E. Replace the current overwrite logic with a merge statement to modify only those records that
have- changed; wrtte logic to make predictions on the changed records identified by the change
data feed. l'®fM®I
Platform
> Home
■ ◄
5:54 PM A Vo
WiFi ("1t6 )1
[fil
Question #16 Topic I
Both users and orders are Delta Lake tables. Which statement describes the resutts of querying
recent_orders?
A. All togic will execute at query time and return the result of joining the valid versions of the
source tables at the time the query finishes.
B. All fogic will execute when the tabie is defined and store the result of joining tables to the OBFS;
this stored data will be returned when the table is queried. l <
ffltDI
C. Results will be computed and cached when the table is defined; these cached results will
incrementally update as new records are inserted into source tables.
0. All logic will execute at query time and return the result of joining the valid ve,sions of the
source tables at the time the query began.
E. The versions of each source table will be stored in the table transaction fog: query results will be
saved to OBFS with each query.
A production workload incrementally applies updates from an external Change Data Capture feed to a
Delta lake table as an atways•on Structured Stream job. V/hen data was initially migrated for this table,
OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction
were both turned on for the streaming production job. Recent 1eview of data files shows that most data
files ore under 64 MB, although each partition in the table contains at least 1 GB of data and the total
table size is over 10 TB.
Which of the following likely explains these smaller file sizes?
A. Oatabncks has autou.med to a smaller target file size to reduce duration of MERGE operations
l',ffl\'¢1tl
B. Z·order indices calculated on the table are preventing file compaction
C. Bloom filter indices calculated on the table are preventing (lie compaction
0. Oatabricks has autotuned to a smaller target file size based on the overall size of data in the
table
E. Oatabrlcks has autotuned to a smaller target file size based on the amount of data in each
pan ition
Correct Answer: A *
Community vote distribution
A(88t.) 12\
Which statemen1 regarding stream-static joins and static Delta tables is correct?
■
■ ◄
5:55 PM A Vo
WiFi (~ 6 )1
C. Bloom filter indices calcu1ate<I on the table are preventing hie compaction
0. Oatabricks has autotuned to a smaller ta1get file size based on the overall size of data in the
table
E. Oatabricks has autotuned to a smaller target (lie size based on the amount of data in each
panition
Correct Answer: A •
Which statement regarding stream-static joins and static Delta tables is correct?
A. Each microbatch of a stream~static jom will use the most recent version of the stahc Delta table
as of each microbatch. i UtifzjEJ
B. Each miccobatch of a stream-static join will use the most recent version of the static Delta table
as of the job's initialization.
C. The checkpomt directory will be used to track state information for the unique keys present in
the join.
0. Stream-static joins cannot use static Delta tables because of consistency issues.
E. The checkp~nt directory will be used to track updates to the static Oetta table.
Correct Answer: A . .
A junior data engineer has been asked to develop a streaming data pipeline with a grouped
aggregation using DataFrame df. The pipeline needs to calculate the avera9e humidity and average
temperature for each non-.overtapping five-minute interval. E'lents are recorded once per minute per
device.
Streaming OataFcame df has the following schema:
'dev1ce_id INT. evenUime TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:
C.f . w!t ,WllteJ: IA -.;,-••,M~_u..Jna ... " IC ..ie.1t♦ :, " 1
-~ rnv
• ..iQO C
a!V".(!4 1 -,.p•, .• 1 ... ~l " •'"J ~~• •
AYQ("tu.!IJ!h!:J " l . ol:.J..u,c ••vo hUILldl-:v·,
. wrHaStra,11
!c:.r.c... • ,•cw..1ta" !
• -av•AJI'" 1tt' ·-1'1••· ,11••:,• \
Choose the response that correctly fills in the blank within the code block to complete this task.
B. window("evenUime•. ·s minutes").allasCtime•1l\tf!\'¢1
C. "e\'enLtime"
'
Hide SoluUon • Discussion 0
Correct Answer: B *
Commumty vote d,stt,but/On
8 (180\)
■
A data architect has designed a system in which two Structured Streaming jobs will concurrently write
to a smgle bronze Delta tabfe. Each job is subscribing to a differen11opic from an Apache Kafka
_ _ _ _ _ _ _ _ _;source. buLtlleY. will..Yldte data v1ith the_saroe...sct\emaJQ keeo tlle...dile.CtQC.'Lsttu.c.tute simoleJJ data,_ _ _ _ _ _ _ _ __
■ ◄
5:55 PM A Vo
WiFi ("1t6 )1
,V t Lt"#lll
• : :..cr.. t. f "da.l ta" I
,.!.IV.1r.-t ,\bl• " ~~f'l'_,IVQ"'I
Choose the response that correctly fills m the blank within the code block to complete this task_
C. "eve-nt_time"
A data architect has designed a system in which two Structured Streammg jobs will concurrently write
to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka
source, but they will write data with the same schema. To keep the directory structure simple, a data
engineer has decided to nest a checkpoint directory to be shared by both streams.
The prop.osed directory structure is displayed below:
.ibrOlll~
che<kpolni
§ de!r,1 log
yf'at weelcc2020 OS
yeM w~k=2C20 oz
Which statement describes whether lhis checkpoint directory structure is valid for the given scenario
and why?
E. No; each of the sueams needs to have its own checkpofnt directory. l.'ff®I
Correct Answer: C •
Platform
> Home
> AH E:cams
Account
> Logm
> Sign up
■ ◄
5:55 PM A Vo
WiFi ("1t6 )1
[ID
Question 121 Topic 1
A Structured Streaming job deployed to production has been experiencing delays dming peak hours of
th.e day. Al present, during normal execution, each microbatch of data is processed in less than 3
seconds. During peak hours of th-e day, execution time for each micro batch becomes very inconsistent,
sometimes exceeding 30 seconds. The streaming write 1s currently configured with a trigger interval of
10 seconds.
Holding all other variables constant and assuming ,ecords need to be processed in les.s than 10
seconds. which adjustment will meet the requirement?
A. Decrease the trigger Interval to 5 seconds; triggering batches more frequently allows idle
executors to begin processing lhe next batch while longer running tasks from previous batches
finish.
8. Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum
execution time observed for each batch is always best practice to ensure no records are dropped.
C. The trigger interval cannot Ile modified without modifying the checkpoint directory; to maintain
the cunent stream state, increase the number of shuffle part1t1ons to max1mize paranetism.
0. Use the trigger once option and configure a Oatabricks job to execute the query e,•ery 10
seconds; this ensures all backlogged records are processed wittl each batch.
E. Decrease the trigger interval to 5 seconds; triggering batches more frequently may preveni
records from back.mg up and large batclles from causing spill. lt'$H'lt#\i
A. An asynchronous job runs after the write completes to detect if files coukl be further
compacted; if yes. an OPTIMIZE job is executed toward a default of 1 GB.
8. Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most
recent job.
C. Optimized writes use logical par1itions instead of directory partitions; because partition
boundaries are onty represented in metadata, fewer small files are written.
0. Oat a is queued in a messaging bus instead of committing data directly to memory; all data is
committed from the messaging bus in one batch once the job is complete.
E. An asynchronous job runs after the write completes to detect if files could be further
compacted; if yes. an OPTIMIZE job is executed toward a default of 128 MB. I\ffl'.1@$"1
Which s1atemen1 characterizes the general programming model used by Spark Structured Streaming?
A. Structured Streaming leverages. the parallel processing of GPUs to achieve highly parallel data
throughput.
8. Structured Streaming Is implemented as a messaging bus and is derived from Apache Kafka.
■
C. Structured Streaming uses speciali'zed hardware and 1/0 streams to achieve sub·second latency
for data transfer.
_ _ _ _ _ _ _ _ __,0. Structured St1eaminq models new data arnvinQ.UUI data stream as new rows a_p__pended to a~n,__ _ _ _ _ _ _ __
■ ◄
5:55 PM A Vo
WiFi ("1t6 )1
Which statemen1 characterizes the general programming model used by Spark Structured Streaming?
A. Structured Streaming feverages the parallel processing of GPUs to achieve highly parallel data
throughput.
8. Structured StJeaming is implemented as a messaging bus and is derived from Apache Kafka.
C. Structured Streaming uses specialized hardware and 1/0 streams to achieve sub-second latency
for data uansfer.
0. Structured Streaming models new data arriving in a data sheam as new rows appended to an
unbounde<I tabte. I'tfflfdQI
E. Structured Streaming relies on a distributed network of nodes that hotd incremental state vaJues
for cached stages~
Correct Answer: D •
Which configuration parameter directly affec1s the size of a spark:partition upon ingestion of data into
Spark?
C. spark.sq1.61es.openCostlnBytes
0. spark.sql.adaptlve.coatescePartitlons.minPartitionNum
E. spark.sql.adaptive.advisoryPartJtionSizelnBytes
A Spark Job 1s taking longer than expected. Usmg the Spark UI, a data engineer notes that the Mm,
Median, and Max Durations for tasks in a particular stage stiow the minimum and median time to
complete a task as roughly the same. but the max duration for a task to be roughly 100 times as long
as the minimum.
Which situation is causing increased duration of the overall job?
C. Ne1work latency due to some cluster nodes being m different regions from the source data
Correct Answer: D •
■
@ Viewing page Sout of 46 pages.
Viewing questions 21-25 out of 227 questions
■ ◄
5:56 PM A Vo
WiFi ("1t6 )1
[ID
Question #26 Topic 1
Each configuration below is identical to lhe extent that each cluster has 400 GB total of RAM, 160 total
cores and only one Executor per VM.
Given a job with at least one wide transformation, which of the following ctuster configurations will
result Jn maximum performance'?
A. • Total VMs; 1
• 400 GB per Executor
• 160 Cores / Executor
B. • Total VMs: 8
• 50 GB per Executor
• 20 Cores / Executor #/r$j£®f
C. • Total VMs: 16
• 25 GB per Executor
• 10 Cores/Executo,
0. • Tota l VMs: 4
• 100 GB per Executor
• 40 Cores/Executo1
E. • Total VMs:2
• 200 GB per Executor
• 80 Co1es / Executor
A junior data engineer on yoUJ team has implemented the folfowmg code block.
Mfi:i11r urr~ •v"lr1~ ,
USINr; 110Y_av&nt~
ON event:s . tt'lf>rt• _:4 - r.~w_e•,~nr.:s.ev.ent_J.':I
WHf.X fXlT Hi- c.BF:0
~ .... ,.
The view new_events contafns a batch of records wfth the same schema as the events Delta table. The
e·,enUd field serves as a unique key for this table.
When this query is executed, what will happen with new records lhat have the same evenUd as an
existing record?
■ ◄
5:56 PM A Vo
WiFi (~ 6 )1
A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1
table representing all of the values that have ever been valkl for an rows in a bronze table created with
the proper ty delta.enableChangeDataFeed =true. They plan to execute the following code as a daily
job:
!r-o~ py~park . ~ql , fu~ct_on! 1.!!'.port eel
Which statement describes the execution and results of running the above query multiple times?
A. Each time the job is executed, newly updated records will be merged into the target table,
overwrttm-g pre•1ious values with the same primary keys.
8. Each time the Job 1s executed, the entire available history of inserted or updated records will be
appended to 1he target table, resulting in many duplicate entries. ■ '•t/rt\ttf:i
C. Each time the job is executed, the target table will be ovenvritten using th-e entire history of
inser1ed or updated records, giving the desired result.
0. Each time the job is executed, the differences between the origmal and current versions are
calculated; this may result in duplicate entries for some records.
E. Each time the Job is executed, only those records that have been insen ed or updated since the
last execution will be appended to the ta.rget table, 9ivmg the desired result.
A new data engineer notices that a critical fiefd was omitted from an application that wntes its Kafka
source to Delta l ake. This happened even though the critical field was in the Kafka source. That field
was further missing from data wntten to dependent. long-term storage. The retention threshold on the
Kafka seivice is se,•en days. The pipelfne has been in production fOt' three months.
Which descnbes how Delta lake can help to avoid data loss of this nature in the future?
A. The Delta lo-g and Structured Streaming checkpoints record the full history of the Kafka
producer.
8. Delta Lake schema evolution can retroactively calculate the correct •1atue fo r newly added fields,
as long as the data was in the OJi,ginal source.
C. Delta lake automa-Ucally checks tha1 an fietds present in the source data are inctuded in the
Ingestion layer.
0. Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible
under any circumstance.
E. Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a permanent.
replayable history of the data state. I:' ttt'lfflJ'M
Correct Answer: £ •
Commun,ryvote d1stnlwtlon
r Q1'\) 9\
A nightly job ingests data into a Oetta lake table using the following code:
! or .y~r: . 1Ql. tiJnc lon!! lt1110 -ur e1 _tl!Tl-, tntt;, 1p.Jt_!.a1e_n4~, col
■
(r py~~i.tk. :'lql . ccl..man .1.0;;-i-t CcJ. nr
■ ◄
5:56 PM A Vo
WiFi ("1t6 )1
producer.
B. Oetta Lake schema evolution can retroacth•ely catculate the correct value for newly added fields,
as long as the data was in the original source.
C. Delta lake automatically checks that all fields present in the source data are included in the
ingestion l ayer.
0. Data can never be permanentty dropped or deleted from O~ta Lake, so data loss 1s not possible
under any circumstance.
E. Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a permanent,
replayable history of the data state. Ii fflfffltt/1
Correct Answer: E -
Commumry vote distt,b!Jtlon
A nightly job ingests data into a Delta Lake table using the following code:
tr J>Y9'P-' . s(!.i,t'Jn tlon~ l,-.P r c,n nt_t1111fl-,t.111p,
tr~ PY~J...a,Jt.:iql ,CO.i mi.~ .. ~F rt. C ,.u_-:.r·
I
.1'ir1u
.aodi:a-(•app~no"J
• .!.avi.-A.~T..i.b~ <"bt.~nz• .. )
The next step in the pipeline requires a function that returns an object that can be used to manipulate
new records that have not yet been processed to the next table m the p1petme_
Which code snippet completes this function definition?
def new_recordsO:
E.
r~- 1m •~~=~.re.,~
,tcl.blet•t:cnz~••
tJ l t:cr tccl 1"'~c::icte_tUt!• - f" lmnt/ dally_oa-o;cfl/ I•,ear l/ 11tcm:h / (aay I " ,
Platform
> Home
■ ◄
5:57 PM A Vo
WiFi (~ 6 )1
Qtiestion #31 Topic I
A junior data engineer is working to implement logic for a l akehouse table named
silver_device__recordings. The source data contains. JOO unique fields in a highly nested JSON
structure.
The silver_dev1ce_recordings table will be used downstream to power severat product,on monitonng
dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of
these applications.
Scnneane Boui,rt Conl:nbllkW Accefl for:
The da1a engineer is trying to determine the best approach for dealing with sche
DP-900
the highly-nested structure of the data and the numerous fields.
Pans, 1 wiioote ago
Which of the following accurately presents information about Delta lake and Data
impact their decision-making process?
A. The Tungsten encoding used by Data bricks is optimized for storing string data; newly-ackled
native support for querying JSON strings means that string types are always most efficient.
8. Because Delta l ake uses Parquet for data storage, data types can be easily evolved by just
modifying file footer information in place.
C. Human labor in writing code is the largest cost associated with data engineering workloads; as
such, automating table declaration logic should be a priority in atl migration workloads.
0. Because Oatabricks will infer schema using types that allow all observed data to be processed,
setting types manually provides greater assurance of data quality enforcement. f 1¢t\$B-I
E. Schema inference and evolution on Databncks ensure that inferred types will afways accurately
match the data types used by downstream systems.
(f :--.1.lr'IF .. vru:~
.111.... :ui-1 .. ,.,vPrwt' ce-" )
.~ab.le(~enr-chea_1te.:u2Ed_o:Je~e_by_acc_.-unL~>J
Assuming that this code produces logically correct results and the da1a in the source tables has been
de-duplicated and validated, which statement describes what will occur when this code is executed?
A. A batch job will update the enriched_itenuzed_orders_by_account table, replacing onty those
rows that have different values than the current version of Che table, using accountlO as the
primary key.
0. An fncrementat Job will detect if new rows have been wntten to any of the source tables: if new
rows are detected, alt results will be cecafculated and used to overwrne the
enriched..itemized_orders_by_account table.
■ ◄
5:57 PM A Vo
WiFi (~ 6 )1
Question 133 Topic 1
The data engineering team is migrating an enterprise system with thousands of tables and 'liev,s into
the Lakeh-ouse. They plan to implement the target archrtecture using a series of bronze, sitver, and gold
tables~Bronze tables will almost exclusively be used by productron data engineering workloads, while
silver tables will be used to suppon both data engineering and machine learning workloads. Gold
tables wtll largety serve business mtellig.ence and reporting purposes. V/hlle personal identifying
information (Pll) exists in all tiers of data, pseudonymization and anonymization rules are in place for
all data a1 the silver and gold revels.
The ocgamzat,on 1s interested in reducmg security concerns while maximizing th-e ability to collaborate
across diverse teams.
Which statement exemplifies best practices for implementing this system?
A. Isolating tables in separate databases based on da1a quality tiers allows for easy permissions
management through database AC Ls and allows physical separation of default storage locations
for managed tables. ■'·tltidttl
8. Because databases on Databricks are merely a togical construct, choices around database
organization do not impact security or disco'lecabihty in the Lakehouse.
C. Storing all production tables in a single database provides a unified view of all data assets
available throughout the l akehouse. simplifying dlscoverabillty by granting all users view privileges
on this database.
0. Working in the default Oatabricks database provides the greatest security when wOfking with
managed tables, as these will be created in the OBFS root.
E. Because all tables must live m the same storage containers used for the database they're
created In, organizations should be prepared to create between dozens and thousands of
databases depending on their data isolation requirements.
Correct Answer: A . .
The data architect has mandated that all tables in the l akehouse should be configured as external
Delta l ake tables.
Which approach will ensure that 1his requirement is met?
A. Wheneve, a database is being created, make sure that the LOCATION keyword is used
B. When configuring an external data warehouse for all table storage, teverage Databricks for all
Ell
C. Whenever a table is being created, make sure that the LOCATION keyword is used. l'.$J'/ilf:f\■
0. When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE
statement.
E. When the workspace is being con6gured, make sure that external cloud object storage has been
mounted.
Correct Answer: C * 7
Commun,ryvote d,stn!UJtlon
To reduce storage and compute costs, the data enginee,ing team has been tasked with curating a
series of aggregate tables leveraged by business intetllgence dashboards, customer-facing
applications, production machine learning models. and ad hoc anatyt1cal queries.
The data engineeting team has been made aware of new tequirements from a customer-facing
application, which is the only downstream w01kload they manage entirely. As a resuJt, an aggregate
table used by numerous teams across the organization will need to have a number of fields renamed,
and additional fields wilf also be added.
Which of the solutions addresses the situation while minimally interrupting other teams in the
0<9anizatfon without increasing the number of tables that need to be managed?
A. Send all users notice that the schema for the table will be changing; include in the
communication the logic necessary to revert the new table schema to match historic queties. ■
■ ◄
5:58 PM A Vo
WiFi ("1t6 )1
The data archCtect has mandated that all tables in the Lakehouse should be configured as external
Delta Lake tables.
Which approach will ensure that this requirement is met?
A. Whenever a database is being created, make sure that ·the LOCATlON keyword is used
8. \I/hen configuring an external data warehouse for all table storage, leverage Databricks for all
Ell
C. Whenever a table is being created, make sure that the LOCATION keyword is used. ■;\®'®I
0. When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE
statement.
E. When the workspace is being configured, make sure that external cloud object storage has been
mounted.
To re-duc·e storage and compute costs, the data engineering team has been tasked with curating a
series of aggregate tables leveraged by business intelligence dashboards, cus1omer•factng
applications, production machine learning models, and ad hoc anatytical queries.
The data engineering team has been made aware of new requirements from a customer-facing
application, which is the only downstream workload they manage entirety. As a 1esutt, an aggregate
table used by numerous teams across the organization will need to have a number of fields renamed,
and additional fields will also be added.
Which of the solutions addresses the snuation while minimally inte11upting other teams m the
organization without increasing ttte number of tables that need to be managed?
A. Send all users notice that the schema for the table will be changing; include in the
communication the logic necessary to reven. the new tabfe schema to match historic queries.
B. Configure a new table with all the requisite fields and new names and use this as the source for
the cus!omer·f acing application; create a view that maintains the original data schema and table
name by aliasing setec1 fields from the new table. l'ffltMI
C. Create a new table with the required schema and new fields and use Delta Lake·s deep clone
functionality to sync up changes committed to one table to the corresponding table.
0. Replace the curten1 table deftnition with a logical view defined with tbe query logic currently
writing the aggregate table; create a new table to powe, the customer·factng app1ication.
E. Add a table comment warning all users that the table schema and field names will be changing
on a given date; overwrite the table in place to the s-peci6catrons of the customer-facing
application.
HldeSolu1m • Discussion G
Correct Answer: B . .
■ ◄
Vo --._
5:58 PM A W .F.
I I
~
♦
(i;J1 5 )1
v~nuo,: U HU::IUI 11.:11.~
A Oetta Lake table representing metadata about content posts from users has the fonowing schema:
user.id LONG, posLtext STRJNG, posUd STRING, longitude FLOAT, latitude FLOAT, post.time
TIMESTAMP,date OATE
This table is partitioned by the date column. A query is run with the following biter. longitude< 20 &
longitude > •20
Which statement descnbes how data will be filtered?
A. Statistics in the Delta log will be used to identify partitions that might Include files In the f11terecl
range.
B. No file skipping will occur because the optimizer does not know the relationship between the
partition column and the longitude.
C. The Oetta Engine will use row-!evet statistics in the transaction log to identify the fires that meet
the 61ter ct'iteria.
0. Statistics m the Delta log v-.•ill be used to identify data fires that might include records in the
filtered range. j 1fflffitti
E. The Delta Engine wilt scan the parquet file footers to identify each row that meets the filter
critena.
Correct Answer: D -
A small company bas.ed in the United States has recently contracted a consulting firm in India to
implement several new data engineering ptpelines to power artific1a1intelligence applications. All the
company·s data is stored in regional cloud storage in the United States.
The workspace administrator at the company is uncertain about where 1he Databricks woskspace used
by the contractors should be deployed.
Assuming that all data governance considerations are accounted for, which statement accurately
informs this decision?
A. Oatabricks runs HOF"S on c!-Oud volume storage; as such, cloud virtual machines must be
deployed in the region where the data is stored.
B. Oatab1icks workspaces do not rely on any regional infrastructure: as such, the dec1s1on should
be made based upon what 1s most convenient for the workspace administrator.
C. Cross-region reads and writes can incur significant costs and latency; whenever possible.
compute should be deployed in the same region the data is stored. l':1\1i®I
0. Oa1abricks leverages user workstations as the driver during interactive de-,elopment, as such,
users should always use a workspace deployed ma regionlhey a1e physically near.
E. Oatabricks notebooks send all executable cO<le from the user's browser to virtual machines over
the open internet; whenever possible, choosing a workspace region near the end users is the most
secure.
Correct Answer: C -
Community vote di.stJ!bvt1011
C(BIJ\ 14\
■ ◄
5:58 PM A Vo
WiFi (i;J1 5 )1
C(Bti\: 14\
The downsueam consumers of a Delta lake table have been complaining about data quality issues
impacting performance m their applications. Specifi cally, they have complained that invalid latitude
and longitude values in the activity_details table have been breaking their ability to use other
geolocation processes.
A junior engineer has written the following code to add CHECK constraints to the Delta Lake table:
,,r- !"TI" -,.;a- F J .'.:!; l IT)'_dPC> 1 l S
l'Dt:· :ns-ri:i.A1tn .... 1 :::t_:nor<11n. •·
!1£0.' I
- UUI.IC= >- C Al.JO
ilit.J." ~ <• i AND
_ nq•tU;ui -l .A!lr
:,r,n~ ,nd;; .; • P 'M ,
A senior engineer has confirmed lhe above logic is correct and the valid ranges for latitude and
longitude are pro'lided. but the code fails when executed.
Which statement explains the cause of this failure?
A. Because another team uses this table to support a frequently running application~two-phase
lockmg is preventing the operation from committmg.
B. T~ acuv1ty_details table already exists; CHECK constraints earl only be added during initial
table creation.
C. The activity_details table already contains records that violate the constraints; all existing data
must pass CHECK constraints m order to add them to an existing table. l@jJt®I
0. The actrvity_details table already contains records; CHECK constraints can only be added prior
to inserting values into a table.
E. The current table schema does not contain the field 't'alid_coordinates; schema evolution will
need to be enabled before altenng the table to add a constramL
Correct Answer: C -
Community WJtt dist1/bvt1011
C {100\)
A. Because Parquet compresses data row by row. strings will only be compressed when a
charncter is repeated multiple times.
B. Delta l ake automatically collects statistics on the first 32 columns of each table which are
leveraged in data skipping based on query filters. H$1'¢11
C. Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at
all times.
0. Primary and foreign key constraints can be leveraged to ensure duplicate values are never
entered into a dimension table
E. 2-order can only be applied to numeric vatues stored in Delta Lake tables.
The view updates represents an tncrementat batch of all newly ingested data to be inse,ted or updated
in the customers table.
The followmg Jog1c ,s used to process these records.
ll I , I
:':t.t.!C':" l.f';:.&.tQ::1, C\i.)t<:o•t_l:I. .I: llr.:-~et-•)'• 1.,;:dw~c:i. •
f'!l. Jt Jp'l-1! • ~
■
nr,. •Jp.1,r~., ~~ :'"l.!.-=-"Tl'f"..
~• 'IP !41• ..s • •ut,!-· · ' - ' • ~ , ·u .i.,:,., r".1.m•,:,0,-,, le!
~Jl'f_J! Cl:l!l••'f,.Tl',Clcrr - 1 • 1 ,-UI" (\~ 1Jpj.,.-, . ,Jj:1""1<1' <),:,: •tr,-,-.--, .,9,J-I•" ~!
■ ◄
5:58 PM A ("115 )1
0. Psimary and foreign key constraints can be leveraged to ensure duplicate values are ne•1er
entered into a dimension table.
E. Z-order can only be applied to numeric values stored in Delta Lake tables.
The view updates represents an mcrementaf batch of all newly Ingested data to be inserted or updated
in the customers table.
The following Wogic is used to process these records.
iwr,m; rrrn =• 11
IJ"ITJi" I
~[UC: ~-u,.,,Cl,i,!Jt...cllU .ti-=- MSJ.~e~G}', .,_~dd.11.~.•
FF '1'4: ;,ip'lt1.• '"'"
•Eln r-..:.-Arr, J.M:1 c,at •••i:l.:::..JJ..,.,t , .... s.!f~ •i,;:1t.a..~_,. ... w.t ..:1: > Jit .. ;,J_•.af:Ut.;.:. • .-.J<t·...ii:i -i.u
c tcti·!_ t ~Hrl!11• - f1•, .. , ""T>1 .ue - "llQ'".1 ri••--t, err• ·• v+ !•••
lif'ff"l: '!(I- "ll~!!'OtU Tl'!~
u .. o IC'U!!"f!"'r_i.-, ., ::i~!'"I, -11rn>"l· , ~f!•-11·.-• :t..t•, "!fl~_::l.u•)
,;;1.1.t.'!1' I::': -~•o_Uf:.1;o"t4:0. -:..,t-:.llG!.'._l .1, t.l ~'1-_.Jf..i,o~~l!I. ,:;.ui:-•$S, ~1'.:.i ll:.~».:l_t.~t•$ .~t t•t:.t.. 1'1t_-,.t41,
'H
Which statement describes this implementation?
A. The customers table is implemented as a Type 3 table; old values are maintained as a new
column alongside the current value.
B. The customers table is implemented as a Type 2 table; old values are maintained but marked as
no longer current and new values are inserted. I:' ffll'l\ftt\i
C. The customers table is implemented as a Type O table; all writes are append only with no
changes to existing , aJues.
0. The customers table is implemented as a Type 1 table; old values are overwritten by new values
and no history is maintained.
E. The customefs table is implemented as a Type 2 table; old values are overwntten and new
customers are appended.
Correct Answer: B . .
Platform
> Home
Account
> Logm
> Sign up
■ ◄
5:59 PM A Vo
WiFi (i;J1 5 )1
0 View Custom Settings 97 0/'O l'lln~111en1mv.1111
llltl IIIM\'1!:11
[ID
QueS1Jon #41 Topic 1
The OevOps team has configured a production workload as a collection of notebooks scheduled to run
daily using ttte Jobs UL A new data engineering hire is on boarding to the team and has requested
access to one of these notebooks to review the production logic.
What are the maximum notebook permissions that can be granted to the user without allowing
accidental changes to production code or data?
A. Can Manage
B. Can Edit
C. No permtssions
1
Hide SoluUon • Discussion tf)
A table named useUtv is being used to create a view that will be used by data analysts on various
teams. Users in the workspace are configured into groups, which are used for setting up data access
using ACLs.
The useUtv table has the foltowing schema:
email STRING, age INT, ltv INT
The following view definition is executed:
A. Three columns will be returned, but one column will be named ' REDACTED" and contain only null
values.
B. Only the email and ltv columns will be returned; the email column will contain all null values.
C. The email and ltv columns will be returned with the values in user_ftv.
0. The email.age, and ltv columns will be returned 1Nith the values in userJtv.
E. Onl}' the email and ltv columns will be returned; the email column will oontain the string
~REDACTED" in each row. l'tifffll
The data governance team has instituted a requirement that all tables containing Personal ldentJftable
Information {PH) must be clearly annotated. This includes adding column comments, table comments,
and setting the custom table pcoperty "contains_pii" = true.
The following SQL DDL statement is executed to create a new table:
:RT..ATr tAfU _. :fe•J . p~ ~ P!!t
■
tvi 111T , TI,&.-t.• !:- l?fr. , ~.. "f'Tl")
, l'll'!fEh'T '".::onu1ri:t PU "
T~ 1 PRa?f&-'!.lt.-, t • c,atAln,:__p,1,! ' T!" 1-'
◄
5:59 PM A Vo
WiFi (i;J1 5 )1
E 000\l
The da1a governance team has instituted a requirement that all tables contammg Personal Identifiable
Information {PH) must be cl-early annotated. This includes adding column comments, table comments,
and settmg the custom table property "contains_pW -= true.
The following SQL OOL statement is executed to create a new table:
UAT• rAAtr j.-.•,.p, tP~t
t ti "N , TIA .. a ~'f'Flltl~ ·lM~NT " M1 H)
11'-2ff.NT · cor.r-11 .r,e P J"'
'1'8,PR.i:!~i"!<'?'-lf.3 t•.; nt..-1-n:i__Fi ' 'fi::..ai.
Which command allows manual confirmation thal these three requirements have been met?
Correct Answer: A *
Comtoolflff vote dlsl/Jb!Jtlon
A 000\)
The data governance team is reviewing code used for deleting records for compliance with GOPR. They
note the following logic is used to delete records from the Delta lake table named users.
DELETE FROM tlse r 3
t11IBRE use r id IN
(SELECT user_ id FRO~ dele~e_reque sts )
Assuming that userJd is a unique identifying key and lhat detete_requests con1ains all users that have
requested deletion, which statement describes whether successfuny executing the above logrc
guarantees that the records to be deleted are no longer accessible and why?
A. Yes; Della lake ACID guarantees provide assurance that the DELETE command succeeded fully
and permanently purged these records.
8. No; the Delta cache may return records from previous versions of the table unul the ctuster is
restarted_
C. Yes; the Delta cache immediately updates to reflect the latest data files recorded to disk.
0. No; the Delta Lake DELETE command only provides ACID guarantees when combined with the
MERGE INTO command.
E. No; files containing deleted records may still be accessible with time travel until a VACUUM
command is used to remove Jnvalidated data f1les. J \tflfdffll
An externaJ object storage container has been mounted to the location /mnt/fmance_eda_bucket.
The following klg1c was execute-cl to create a database for the frnance team:
Cill"A7E DAT A.BASF. r ll "n ..:s.:i ::t
~OCATICN ' ·n:I,t/hn•ni:•_•d• bur.Ket ' ;
•~~ TTS~'i£ O~i l"'AT A.B.l\ST fi.n.:ir·,;w eij1,;_ db TO f 11•ri.;u;
sr.tCT •
FROM ~•.ee
WMt~~ ~l~'Y ~ •1x~,
■
If all users on the finance team are members of the finance group, which statement describes how the
t)(_sales table will be created'
■ ◄
5:59 PM A Vo
WiFi ("115 )1
The data governance team ts reviewing code used for deleting records for compliance with GOPR. They
note the following logic is used to delete records from the Delta lake table named users.
DELETE FROM u~er 3
t11IBRE U:!ler id IN
(SELECT user_ id FRO~ delete_request:,J
Assuming that userJd is a unique identifying key and lhat detete_requests contains all users that have
requested deletion, which statement describes whether successfufly executing the above logic
guarantees that the 1ecords to be deleted are no longer accessible and why?
A. Yes; Della lake ACID guarantees provide assurance that the DELETE command succeeded fully
and permanently purged these records.
B. No; the Delta cache may return records from previous versions of the table until the ctuster is
restarted
C. Yes; the Delta cache immediately updates to reflect the latest data fifes recorded to disk.
0. No; the Delta Lake DELETE command only provides ACID guarantees when combined with the
MERGE INTO command.
E. No; files containing deleted records may still be accessible with time travel until a VACUUM
command is used to remove invalidated data files.J \fflfdftl
An externaJ object storage container has been mounted to the location /mnt/fmance_eda_bucket.
The following klgtc was executed to create a database foe the frnance team:
CilF.A7E DATA.BASF. r l . .tn ,wJ,:i ::t
.,QCAT:i~ ' ·icr-,t/hnani:•_•d• bur.Ket ';
,-:u,r nsivn: o~• rwrA.S.A.ST t.tn.. r ,;w_ed.,_ db rn r """-ur
~PAN· •7:.ut. Jtl tATA&ASE r•na/"r11t_a-d _::lb"' f U'lf'IH
After the database was successfully created aod permissions configured, a member of the finance
team runs the following code:
a:.f.ATE ;ABL:. _.1t1.t1n1..·-e ed.a dh. tX 3.t,•e ~
cric~ .
f'RO~ ~• .ee
WMt~~ ~l~~y ~ •1x~,
If all users on the finance team are members of the finance group, which statement describes how the
tx_sales table will be created'
A. Arogical table will persist the query plan to the Hive Metastore in the Oatabricks controt plane.
C. A logical table will persist l he physical plan to the Hive Metastore in the Oetabricks control
plane.
Correct Answer: O . .
Platform
■ ◄
6:00 PM A Vo
WiFi (i;J1 5 )1
Although the Oatabncks Utilities Secrets module provides tools to store sensitive credentials and avoid
accidentally displaying them in plain text users should still be careful with which credentials are stored
here aod which users have access 10 using these secrets.
Which statement descnbes a limitahon of Oatabncks Secrets?
So~nt" Bo~ CoMrii.tr.w Accen klr:
AZ·204
A. Because the SHA256 hash is used to obfuscate stored secrets, reversing t Paria, 1 rniiwte ago
the value in plain text.
B. Account administrators can see all secrets in plain text by fogging on to the Data bricks
Accounts console.
C. SeCfets are stored in an administrators-only table within the Hive Metastore; database
administrators have permission to query this table by default.
0. Iterating through a stored secret and printing each character wrll display secret contents in plain
text. I! $1®1
E. The Oatabricks REST API can be used to list secrets in plain text if the personal access token
has proper credentials.
Correct Answer: D -
Commun,11 vote d/$ttibutlon
0167\1 E(~)
B. It 1s retained for 30 days, during which time you can deliver job run logs to OBFS or S3
C. It is retained for 60 days. during which you can expon notebook run resufts to HTfv1L 1'$\f@I
E. It is retained fo r 90 days or until the run-id is re-used through custom run configuration
Correct Answer: C *
Commun,ty vote d,st11but10n
C~\) 61
A data engineer, User A. has promoted a new pipeline to production by using the REST API to
programmatically create several jobs. A OevOps .engineer, User 8, has configured an exte_mal
0<chestration tool to trigger job runs through the REST APL Both users authorized the REST API carts
using their personal access tokens.
Which statement descnb-es the contents of the workspace audit togs concerning these events?
A. Because the REST API was used for job creation and triggering runs, a Service Principal will be
automatically used to identify these events.
B. Because User 8 1ast configured the jobs, their identity will be associated with both the job
creation events and the job run events.
C. Because these events are managed separately, User A w11J have their identity associated wrth
the job creation events and User B win have the1r identity associated with the job run events.
l!ii\til®ta
0. Because the REST API was used for iob creation and triggenng runs.. user tdentity will not be
captured in the audrt logs.
E. Because User A created the j obs, their identity will be associated with both the job creation
events and the job rnn events.
Correct Answer: C •
■ ◄
6:00 PM A ("115 )1
t.!lecause use, A crea1eo me Joos, metr 1oenuty WIii oe assoctateo wnn oom me 100 creauon
events and the job wn events.
A user new to Oatabricks is trying to troubleshoot long execution times for some pipeline logic they are
workfng on. Presently, lhe user is executing code cell•by-cell, using display() calls to conhrm code is
producing the logically co1rec1 results as new transformations are added to an operatJon. To get a
measure of average time to execute, the user is running each cell multiple times interactJvely.
Which of the foflowing adjustments wilf get a more accurate measure of how code is likely to perform
in production?
A. Scala is the only language that can be accuratety tested using interactive notebooks; because
the best performance is achie•,ed by using Scala code compiled to JARs, all PySpark and Spark
SQL logic should be reractored.
8. The only way to meaningfully troubleshoot code execution times in development notebooks Is
to use production-sized data and production-sized clusters with Run All execution. i\ffi\$t:ll
C. Production code development should only be done using an IOE; executing code against a local
build of open source Spark and Delta lake will provide the most accurate benchmarks for how
code will perform in production.
0. Calling displayQ forces a job to trigger. while many transformations will only add to the l09ical
query plan: because of caching, repeated execution of the same Jogic does not provide meanmgful
results.
E. The Jobs Ul should be levetaged to occasionally nm the notebook as a job and track execution
time during incremental code development because Photon can only be enabled on clusters
launched fo, scheduled jobs.
Correct Answer: B *
Commumty vote d,st11but10n
B 6<!'\ 0(3"
A production cluster has 3 executor nodes and uses the same virtual machine type for Che driver and
executor.
When evaluating the Ganglia Metrics for this cluster, which Indicator would signal a bottleneck caused
by code executing on the driver?
Correct Answer: f * 7
Commun,ry11ote d1stnbutlon
E(461.) D(m) A(lin)
■ ◄
6:01 PM A Vo
WiFi (i;J1 5 )1
0/ l':m«llllccum v.ill!
0 View Custom Settings 97 IO this 11101!,nlll
Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate
push-down?
B. In the Stage's Detail screen, in the Completed Stages table. by noting the size of dat-a read from
the Input column
C. In the Storage Detail screen, by notmg which ROOs ate not stored on disk
Correct Answer: f *
Commu,uty vote d,st11b1.1tJon
E83\i , B17"
f u ....iht .,; ,;~ , :it-• : ,; • ~yt.r•._, i:-y.!I;.~ rt ., :,,~ .1,1;!.; t. t. r ..':'Iii • ;.:, ui , . • •:. '.: r 2«. t , • "". • ,
iv ,;, P:r,w(r-.a:Da ' A.-.1,~o•, •.!•"' .1. 1, f.C').>I!'..-- ' 1;.!b', 119(1•1
I Eill
\iJ ]n! _. ,e.a"t, .. ,1t.,•1...,~!•c !'. ]Mh •!~l•'
l{r..: lG".1..::1'. n.urt.:i/1'41 ,c1::., !IOl.i.:::;;l_::<:Jtl
J.h~
, :;~t..atir:c·,:::, !:pl!"it.l'P"'/tt'.t-11"/ l ttV;y4--<l •• 'l. :I-.:re. itl)".- tyll· ·-av.a :ateV.\J .pv :Tl _ nl-. ~ I", '4f·l!! I
1,.....
1'10 111:;•,.•r: ..... lt,1111-re'l•Y_ ";.rl'l".,•r.~_ :lll!l•n:U·--n:r.A~dl
11 4 n,·, v~l,. 12"f Ni,..:111 Vil u.a
llot 111.1.,,-.,,.t, le1t!.q•rev•v_c. ~en:, l-Ctf,t1toet_10, ,e.t.1une\
11,,,:
•~;
, ...:, ~-l.:I• ~MV•t:•~ ?t◊~ ~~ft•
. ~.:: "'1MI :
;i.,
·"
•Jysl11P.~~i'•t.,,1: ;•l"l"4l '"t..-l't. .1',,r,1>s;
11111) •,., • n•.t•tt•'•/•""'"'~(•;.~•itt!.r,1 ,. • ~1,•tt
11r1rl •••l 1,-J.r,•,f~t'!'-!-,r,t·11.~s ·• 1tl, ,rui.- r.n•
.1-, ..1.u-.tu,l•.t•u·r••·•
t";,,41t~_t-ef1lo;,~•t•r~~~-t,~l~ n.rn, ~~•r•_t•••laq.11tcbs•••t&b!e, 7tl!)
' ,Ct,:,~ •~r · 1 tl"!&~, r:.? • !loi.. 1:-trot•~rtrtf".IJ
!1.ol....;r,l,j,~yl'·1;., f,-"l~ •t. ,,J.'1,H t,,olfW.1 •h
A. The code executed was PySpark but was executed In a Scala notebook.
E. There is a syntax error because the heartrate column is not correctly identified as a column.
■
Which distribution does Oatabricks suppoi"t for installing custom Python code packages?
A. sbt
■ ◄
6:01 PM A Vo
WiFi ("115 )1
Which distribution does Oatabricks suppoft for installing custom Python code packages?
A. sbt
B. CRANC. npm
0. Wheels UfflfdtZi
E. jars
Correct Answer: O -
Community vote dist1/bvtlo.n
D(100\)
Which Python variable contains a hst of directories to be searched when trying to locate required
modules?
A. importlih.fesource_path
B. sys.path I :rtii'bt:11
C. Os.path
0. pyp1.path
E. pylib.source
Correct Answer: B -
lncorporaung unit tests into a PySpark application requires upfront attention to the design of your jobs,
or a potentiall_y significant refactoring of existing code.
Which statement describes a main benefit that offset this additional effort?
C. Troubleshooting ,s easier since all steps are isolated and tested individually i 'iffltdtl
0. Yietds faster deployment and execution times
E. Ensures that all steps interact C-Ofrectly to achieve the desired end result
Plattorm
> Home
■ ◄
Which of the following could be used as sources in a stream processing pipeline?
Select two responses.
A. Desired latency
B. Total cost of operation (TCO)
C. Maximum throughput
D. Cloud object storage
A. .load()
B. .print()
C. .return()
D. .merge()
A. continuous and bounded
B. continuous and unbounded
C. micro-batch and unbounded
D. micro-batch and bounded
A. Use streaming live tables for raw data and streaming tables for bronze, silver,
and gold quality data.
B. Use streaming tables for bronze quality data and streaming live tables
for silver and gold quality data.
C. Use streaming live tables for bronze quality data and streaming tables for
silver and gold quality data.
D. Use streaming tables for raw data and streaming live tables for bronze, silver,
and gold quality data.
A. Type 0
B. Type 1
C. Type 2
D. Type 1 or Type 2
A. Stream-stream join
B. Stream-static join
C. Stateful aggregation
D. Drop duplicates
A. Transactional guarantees and Delta Lake ensure that the newest version of a
dimension table will be referenced each time a query is processed for
incremental workloads.
B. Joined data cannot go unmatched because of Delta Lake’s foreign key
constraint.
C. Dimension tables contain a granular record of activities, while fact tables
contain data that is updated or modified over time.
D. Modern guidelines suggest denormalizing dimension and fact tables.
A. {"invalid_record": f"NOT({' AND '.join(rules.values())})"}
B. {"invalid_record": f"&&({' ! '.join(rules.values())})"}
C. {"invalid_record": f"NOT({' OR '.join(rules.values())})"}
D. {"invalid_record": f"IF({' NULL '.join(rules.values())})"}
A. Tokenization
B. Pseudonymization
C. Anonymization
D. Binning
A. HIPAA
B. PCI-DSS
C. GDPR
D. CCPA
A. Hashing
B. Truncating IP addresses
C. Data suppression
D. Binning
A. Tokenization
B. Categorical generalization
C. Binning
D. Hashing
A. Version
B. Date modified
C. Timestamp
D. Size
B
A
B
D
B
C
C
A,C
1:36 PM A Vo
WiFi (m16 J•
Databricks Certified Data Engineer Professional Exam
PS9e: 1 /4(,
Total 227 questions @
Question 1 ( Exam A )
An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The
notebook to be scheduled will use this parameter to load data with the following code: df =
spark.read.format('parquet').loa<l(f•I mnVsoorce/(date)1
Which code block should be used to create the date Python variable used in the above code block?
A. date = spark.conf.get('date")
B. input_dict = input()
date= mput_dictl"date'J
C. import sys
date = sys.argvflj
D. date = dbutils.notebooks.getParam(~date}
E. dbutils.widgets.text("date", ~null")
date = dbutils.widgets.get("date')
Answer : E
Next Question
Question 2 ( Exam A )
The Oatabricks workspace administrator has configured interactive clusters for each of the data engineenng groups. To control costs,
clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters
at any time of the day.
Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal
permissions a user would ne·ed to start and attach to an already configured cluster.
Answer: D
Next Question
Question 3 ( Exam A )
When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps
costs low?
Answer : O
Next Question
Question 4 ( Exam A )
The data engineering team has configured a Oatabncks SQL query and alert to monitor the values in a Delta Lake table. The
recenLsensorJecordings table contains an identifying sensorJd alongside the timestamp and temperature for the most recent 5
■ ◄
1:36 PM A Vo
WiFi (m16 J•
Maximum Concurrent Runs: 1
C. Cluster: Existing All-Purpose Cluster;
Retnes: Unflmited;
Maximum Concurrent Runs: 1
D. Cluster: New Job Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: 1
E. Cluster: Existing All-Purpose Cluster:
Retries: None;
Maximum Concurrent Runs: 1
Answer: 0
Next Question
Question 4 ( Exam A )
The data engineering team has configured a Oatabncks SOL query and alert to monitor the values in a Delta Lake table. The
recenLsensor_recordings table contains an identifying sensoUd alongside the timestamp and temperature for the most recent 5
minutes of recordings.
The below query is used to create the alert:
S£L£CT Me.}.N(~emperature ), MAX (~-=.mperature ) , M!N(cemperature)
FROM recenc_sensor_recording s
GROUP BY sensor id
The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean
(temperature) > 120. Notifications are triggered to be sent at most every 1 minute.
If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be ttue?
A. The total average temperature across all sensors exceeded 120 on three consecutive executions of th-e query
B. The recenLsensor_recordings table was unresponsive for three consecutive runs of the query
C. The source query failed to update property for three consecutive minutes and then resiarted
D. The maximum temperature recording fer at least one sensor exceeded 120 on three consecutive executions of the query
E. The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query
Answer: E
Next Question
Question 5 ( Exam A )
A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared
screenshot reveals that while they're using a notebook vers1oned with Oatabricks Repos, they're using a personal branch that contains old
logic. The desired branch named dev-2.3.9 is not available from the branch setection dropdown.
Which approach wi!I allow this de'leloper to revie>N the current logic for this notebook?
A. Use Repos to make a pull request use the Databricks REST APt to update the current branch to dev-2.3.9
8. Use Repos to pull changes from th.e remote Git repository and select the dev-2.3. 9 branch.
C. Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch
0. Merge all changes back to the main branch in the remote Grt repository and clone the repo agam
E. Use Repos to merge the current branch and the dev-2.3.9 branch.. then make a pull request to sync with the remote repository
Answer: B
Next Quest1-0n
Page: 1 /%
Next Page @
Total 227 questions
CONNECT WITH US
f Facebook
-I Twitter
Youtube
Qi suppcn@1texams com
■ ◄
1:36 PM A Vo
WiFi (m16 J•
The security team is exploring whether or not the Oatabricks secrets module can be leveraged for connecting to an external database.
After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure
the correct permissions for the currently active user. They then modif)' their code to the following Oeaving all other variables unchanged).
Which statement describes what will happen when the above code is executed?
A. The connection to the external table wdl fail: the string '"REDACTED' will be printed.
8. An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the encoded
password will be saved to OBFS.
C. An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the pass•.vord
w1II be printed in plain text.
D. The connection to the external table will succeed; the st.ring value of password wm be printed in plain text.
E. The connection to the external table will succeed; the string "REDACTED" will be printed.
Answer : E
Next Question
Question 7 ( Exam A )
The data science team has created and logged a production model using Mlflow. The following code correctly imports and applies the
production model to output the predictions as a new OataFrame named preds with the schema •customerjd LONG, predictions DOUBLE,
date DATE'.
eurr~nt ia~@().al1•~t•da:e~l
The data science team would like predictions saved to a Oelta Lake table with the ability to compare all predictions across time. Churn
predictions wiU be made at most once per day.
Wh.ich code block accomplishes this task while minimizing potential compute costs?
A. preds.vmte.rnode(:'append").saveAsTable("churn_preds")
B. pre<ls.write.format(°oelta').save('/preds/churn_preds' )
Jpno, vr t ..
t .ar 1•rMlr•" 1
0, .~I'! l"'""""N!'tt..-"I
!IIAYAA11Tillll I'll ('"rl': ;"' _pn,,,,t"-"
1prej~ ~~~eestceA.~
C'Ut'$:'ll'tt!C<'e c••~PtDO" I
E. ,.::pt !.O!' f "n,t-r.~ "t-f'at.l'I ". .. f _ ct'.e;:iry~1 n"' • en.urn_ i-e".S•"l
ta»!.e f"ctn~n, p~-ed<i'"
Answer : A
Next Question
An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch Job runs the
■ ◄
1:37 PM A Vo
WiFi (m16 J•
Question 8 ( Exam A )
An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the
following code to ingest all data from the previous day as indicated by the date variable:
(~parlic . rea/j
r r ...1it 1•p;sr .pl.,.t" :
.. ln.adr!"' ~ t ~•w cr~.r:!!/ld•'"• "l
,tfrtpDllp At !'I " l r :a»r ,1.. , " t-r'd•r d,· I
.Wt'i'!.,:,,
, 111(W'Jn I • &p-f.'(' o!J" )
. .1'•v•~l'•t "'l .."i::1•,.., " I
'
Assume that the ftelds customer_id and orderj d serve as a composite key to uniquely identify each order.
If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?
A. Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be
written.
B. Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the
target table.
C. Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table,
these records will be overwritten.
0. Each write to the orders table will only contain unique records; if existing records with the same key are present fn the target table, the
operation will fail.
E. Each write to the orders table wlll run de:duplication over the union of new and existing records, ensuring no duplicate records are
present.
Answer: B
Next Question
Question 9 ( Exam A )
A junior member of the data engineering team is ex.ploring the language interoperability of Oatabricks notebooks. The intended outcome
of the below code is to registet a view of all sales that occurred in count1ies on the continent of Africa that appe.ar in the geo_lookup
table.
Before executing the code, running SHQV,i'TA8lfS on the current database indicates the database contains only two tables: geo_lookup
and safes.
Cmd1
ltpy•h"ll
ec.l.lnl1.iu_.t • FxlOl ::c1. .x u
.,rar11: .... ,1, I,.« "':J&C_ c-~,l(lJ{ "') • fil ,..., rt · con· l n-tinr- 'IIP'' '") • l!,..1 f"o .. 1·ecun• '"Y'" J •eol he-- {l J
Cmd2
eql
r,;;:.u.,! '-'lfW Hle=_•!" ',S
~P:tEi':"T" •
rRt>.H !!<lle:
~at.Rt c_t~ ~N co1mtr e& ar
Alff Nt1N1.N • "A ..
Whjch statement correctty describes the outcome of executing these command cells m order in an interactive notebook?
A. Both commands will succeed. Executing show tables will show that oountries_af and sales_af have been registered as views.
B. Cmd 1 will succeed. Cmd 2 wlll search all accessible databases for a table or view named countries_af: if th.is entity exists. Cmd 2 will
succeed.
C. Cmd 1 will succeed and Cmd 2 will fall. countries_af will be a Python variable representing a PySpark DataFrame.
0. Both commands will fail. No new variables, tables, or views will be cceated.
E. Cmd 1 will succeed and Cmd 2 wi!I fail. countries_af will be a Python variable containing a list of stnngs.
Answer: e.
Next Question
Question 10 ( Exam A )
A Delta table of weather records is partitioned by date and has the below schema: date DATE, devicejd INT, temp FLOAT. latitude FLOAT,
longhude ROAT
To find all the records from within the Arctic Circle, you execute a query with the below filter: latitude> 66.3
Which statement describes how the Delta engine identifies which files to load?
A. All records are cached to an operational database and then the filter is applied
B. The Parquet file footers are scanned for min and max statistics for the latitude column
C. All records are cached to attached storage and then the ftlter is applied
D. The Delta log is scanned for min and max statistics for the latitude column
E. The Hive metastore is scann-ed for min and max statistics for the latrtude column
Answer : D
■ ◄
1:37 PM A Vo
WiFi (m16 J•
Page: 3/46 O a\ IL ') J"'I.. ®
Total 227 questions '::I '\;;I ,.- • e' .
Question 11 ( Exam A )
The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). AU user data that
needs to be deleted is stored in Delta Lake tables using default table settings.
The team has decided to process all deletions from the previous week as a batch Job at 1am each Sonday. The total durabon of this job is
less than one hour. Every Monday at 3am, a batch Job executes a senes of VACUUM commands on all Detta Lake tables throughout the
organization.
The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow
continued access to deleted data.
Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?
A. Because the VACUUM command permanentfy deletes all files containing deleted records, deleted records may be accessible with time
travel for around 24 hours.
8 . Because the default data retention threshold is 24 hours. data files containing deleted records will be retained until the VACUUM job is
run the following day.
C. Because Delta l ake time travel provides full access to the entire history of a table, deleted records can always be recreated by users
with full admm privileges.
D. Because Delta lake's delete statements haYe ACID guarantees, deleted records will be permanently purged from all storage systems as
soon as a delete j ob completes.
E. Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the VACUUM job is run
8 days later.
Answer : E
Next Question
Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload
three times?
A. Three new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily.
B. The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided
cluster ID.
C. Three new jobs named "Ingest new data• will be defined in the workspace. but no jobs will be executed.
D. One new job named ~Ingest new data· will be defined in the workspace, but it will not be executed.
E. The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.
Answer : C
Next Question
Question 13 ( Exam A )
An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in
the log indicates the change 1ype (insert. update, or delete) and the values for each field after the change. The source table has a primary
key identifie<I by the field pk_id.
For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source
system. For analytical purposes. only the most recent value for each record needs to be recorded. The Databricks j ob to ingest these
records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.
Which solution meets these requirements?
A. Create a separate history table for each pk_td resolve the current state of the table by running a umon all flttenng the history tables for
the mos, recent state~
B. Use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a bronze table, then propagate all changes
throughout the system.
C. Iterate through an ordered set of changes to the table, applying e-ach in turn; rely on Delta lake's versioning ability to create an audit
log.
D. Use Delta Lake's change data feed to automatically process COC data from an external system, propagating all changes to all
dependent tables in the lakehouse.
E. lngest all log information into a bronze table; use MERGE INTO to insert update, or delete the most recent entry for each pk_id into a
silver table to recreate the current table state.
Answer: E
■ ◄
1:37 PM A Vo
WiFi (m16 J•
o- Certifi ed •••
X V -o
itexam s.com
•• •
Next Question
Question 14 ( Exam A )
An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records
pcoduced by 1he source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently defa;•ed to
ensure no late-arri,..ing data is missed. The userjd field represents a unique key for th-e data, which has 1he following schema: user_id
BIGINT, username STRING. user_utc STRING, userJegion STRING, lasUogin BIGINT, auto_pay BOOLEAN, last_updated BIGINT
New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the
source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value
for each unique user_id.
Assuming there are millions of user accounts and tens of thousands of records. processed hourly, which implementation can be used to
efficiently update the described account_current table as part of each hourly batch job?
A. Use Auto Loader to subscribe to n-ew files in the account_history directory, configure a Structured Streaming trigger once job to batch
update newty detected files into the account_current table.
B. Overwrite the account_current table with each batch using the results of a query against the accounLhistory table grouping by user_id
and filtering for the max value of lasLupdated.
C. Fitter records in accounLhistory using the !asLupdated field and the most recent hour processed1 as well as the max lasUogin by
us.er_1d write a merge statement to update or insert the most recent value for each user_id.
D. Use Delta lake version history to get the drfference between the latest version of account_history and one version prior, then write
these records to acoount_current.
E. Filter records in accounLhistory using the lasLupdated field and the most recent hour processed, making sure to deduplicate on
us.ername; write a merge statement to update or insert the most recent value for each username.
Answer : C
Next Quest1on
Question 15 ( Exam A )
A table in the l akehouse named customer_chum_params is used in churn prediction by the machine learning team. The table contains
information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table
nightly by ove1writing the table with the current valid ,aJues derived from upstream data sources.
The churn prediction model used by the Ml team is fairly stable in production. The team is only interested in making predictions on
records that have changed in the past 24 hours.
Which approach would simplify the identification of these changed records?
A. Apply the chum model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predicuons
table that ignores rows where predictions have not changed.
B. Convert the batch job to a Structured Streaming job usmg the complete output mode; configure a Structured Streaming job to read
from the customer_churn_params table and incrementally predict against the chum model.
C. Calculate the difference between the previous model predictions and the current customer_chum_params on a key identifying unique
customers before making new predictions; only make predictions on those customers not in the previous predictions.
D. Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestampQ as data are being wntten; use
this fretd to identify records written on a particular date.
E. Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make
predictions on the changed records identified by the change data feed.
Answer: E
Next Question
Page: 3 / 46
Total 227 questions
e Previous Page Next Page @
CONNECT WIT H US
f Faceboolc
't# Twiner
■ ◄
1:38 PM A Vo
WiFi (m1s J•
o- Certified •••
X V -o
itexams.com
•••
Question 16 ( Exam A )
Both users and orders are Delta Lake tables. Which statement describes the results of querying recenlorders?
A. All logic will execute at query time and return the result of joining the valid \'ersions of the source tables at the time the query finishes.
B. AU logic will execute when the table is defined and store the result of joining tables to the OBFS: this stored data wilt be returned whe-n
th.e table is queried.
C. Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are
inserted into source tables~
0. AU logic will execute at query time and return the result of joining the valid versions of the source tables a1 the time the query began.
E. The versions of each source table will be stored in the table transaction log; query results will be saved to oer=s with each query.
Answer: 8
Next Question
Question 17 ( Exam A )
A production workload incrementally applies updates from an e-xtemal Change Data Capture feed to a Oetta t ake table as an always-on
Structured Stream job. When data was initialfy migrated for this table; OPTIMIZE was executed and most data files were resized to 1 GB.
Auto Optimize and Auto Comp-action were both turned on for the stre.aming production job. Recent review of data files shows that most
data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.
Which of the following likely explains these smaller file sizes?
A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations
B. Z·order indices calCtJlated on the table are preveming file compaction
C. Bloom filter indices calculated on the table are pr.eventing f1te compaction
0. Oatabricks has autotuned to a smaller target file size based on the ove-rall size of data in the table
E. Oatabricks has autotuned to a smaller target file size based on the amount of data in each partition
Answer : A
Next Question
Question 18 ( Exam A )
Which statement regarding stream-stauc joins and static Oetta tables 1s correct?
A. Each microbatch of a stream-static j oin will use the most recent version of the static Oetta table as of each microbatch.
B. Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the j ob's initialization.
C. The checkpoint directory will be used to track state information for the unique keys present in the join.
0. Stream-static j oins cannot use static Detta tables because of consistency issues.
E. The checkpoint directory will be used to track updates to the static Oetta table.
Answer : A
Next Question
■ ◄
1:38 PM A Vo
WiFi (m1s J•
o- Certifi ed •••
X V -o
itexams.com
•• •
Answer : A
Next Question
Question 19 ( Exam A )
A junior data engineer hes be.en asked to develop a streaming data pipeline with a grouped aggregation using OataFrame df. The pipeline
needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once
per minute per device.
Streaming DataFrame df has the following schema:
'devicejd INT. evenu ,me TIMESTAMP. temp FLOAT, humidity FLOAT'
Code block:
C ! -"~ ':l'.lol'.a,.,_.t I "-•r,• t • IJM" "
• ,11 tif'V I
•"' ~,..;rJ-,a~
, ,t - 1"j.. .... !
, .V!"J,~Ta.t:l~t" ,-eri •r-_1 •9'"1
Choose the response that correctly fills m the blank within the code block to complete this task.
A. to_intef'lal("evenLtime•, ·s minutes").alias('time")
B. window{"eventJime.., ·s minutes '}.alias("time~}
C. ·event_time"
D. window\ evenLtime, ' 10 minutes/.eliasrtime~)
E. lag('evenUime•, '10 minutes').alias('time')
Answer: B
Next Question
Question 20 ( Exam A )
A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each
job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory
structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.
The proposed directory structure is displayed below:
1bmme
checkpolo,
_deltaJog
yHt WH-k~2020 01
yt.ar_week•202C_02
Which statement describes whether this checkpoint directory structure is valid for the given scenario end why?
Answer : E
Next Question
P09e: 4/ 46
Total 227 questions
E) Previous Page Next Page @
CONNECT WITH US
f f acebook
W Twitter
_. supportx;:iitexams.com
■ ◄
1:38 PM A
o- Certified •••
X V -o
itexams.com
•• •
Question 21 ( Exam A)
A Structured Streaming j ob deployed to production has been experiencing delays during peak hours of the day. At present, during normal
execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each
miCiobatch becomes •1ery inconsistent. sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger
Interval or 10 seconds.
Holding all other variables constant and assuming records need to be processed In less than 10 seconds, which adjustment wlll meet the
requtrement?
A. Decrease the trig-ger interval to S seconds; triggering batches more frequently allows idle executors to begin processing the next batch
while longer running tasks from previous batches finish..
B. Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time obseived for e.ach batch is
always best practice to ensure no records are dropped.
C. The trigger interval cannot be modified without modify;ng the checkpoint directory; to maintain the current stream state, increase the
number of shuffle partitions to maximize parallelism.
0. Use th.e trigger once option and configure a Oatabricks job to execute the QUerJ every 10 seconds; this ensures all backlogged records
are processed with each batch.
E. Decrease the trigger interval to 5 seconds: triggering batches more frequently may prevent records from backing up and large batches
from causing spill.
Answer : E.
Next Quest1-0n
Question 22 ( Exam A )
A. An asynchronous j ob runs after the write completes to detect if fi!es could be further compacted; if yes, an OPTIMIZEjob is executed
toward a default of 1 GB.
B. Before a Jobs duster termmates, OPTJMIZE is executed on all tables modified during the most recent job.
C. Optimized writes use logical partrtions instead of directory parbtions: because partition boundaries are only represented In metadata,
fewer small files are written.
0. Data is queued in a messaging bus instead of committing data directly to memory, all data is oommitted from the messaging bus in
one batch once the job is complete.
E. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE j ob is executed
toward a default of 128 MB.
Answer: E
Next Question
Question 23 ( Exam A )
Which statement charactenzes the general programming model used by Spark Structured Streaming?
A. Structured Streaming leverages the parallel proce.ssing of GPUs to achieve highly parallel data throughput.
B. Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.
C. Structured Streaming uses specialized hardware and 1/0 streams to achieve sub-second latency for data transfer.
0. Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.
E. Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.
Answer: D
Next Question
Question 24 ( Exam A)
■ ◄
1:38 PM A Vo
WiFi (m1s J•
Which statement characterizes the general programming model used by Spark Structured Streaming?
A. Structu1ed Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.
B. Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.
C. Structured Sueaming uses specialized hardware and 1/0 streams to achieve sub-second latency for data transfer.
D. Structured Streaming models new data a11iving in a data stream as ne\\l rows appended to an unbounded table.
E. Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.
Answer: D
Next Question
Question 24 ( Exam A)
Which configuration parameter directly affects the size of a ::;part<.·partition upon ingestion of data into Spark?
A. spark.sql.flles.maxPartttionBytes
8. spark..sql.autoBroadcastJoinThreshold
C. spar1<.sql.f1les.openCos1ln8ytes
0. spark.sql.adaptive.coa!escePartitions.minPartitionNum
E. spark.sql.adaptive.advisoryPartitionSizefnBytes
Answer: A
Next Question
Question 25 ( Exam A )
ASparkjob is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median,. and Max Durations for tasks in
a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be
roughly 100 times as long as the minimum.
Wh.ich situation is causing increased duration of the overall job?
Answer : D
Next Question
Page: 5 / 46
Total 227 questions
(E) Previous Page Next Page (:!)
CONNECT WITH US
f Facebook
't# Twitter
■ ◄
1:40 PM A
o- Certifi ed •••
X V -o
itexams.com
•• •
A. • Total VMs; 1
• 400 GB per Executor
• 160 Cores/ Executor
B. • Total VMs: 8
• 50 GB per Execulor
• 20 Cores / Executor
C. • Total VMs: 16
• 25 GB per Executor
• 10 CoreS/Executor
D. • Total VMs: 4
• 100 GB per Executor
• 40 Cores/Executor
E. • Total VMs:2
• 200 GB per Executor
• 80 Cores / Executor
Answer: A
Question 27 ( Exam A )
A junior data engineer on your team has implemented the following code block.
Ml. t Ni •v•nr-
oQ'n;-; n,Qv_;;i._,,,nt:ia
01( ev!'!n- ~.•~~:"tt td - new •v0 ?1r-:, . ~v"° ....1: td
~M'!:li .M>~ MJ.'tCHtD:
tJSEr
The view new_events contains a batch of records with the same schema as the events Delta table. The evenUd field seives as a unique
key for this table.
When this query is executed, what will happen with ne•N records that have the same evenUd as an existing record?
Answer : B
Next Question
Question 28 ( Exam A )
A junior data engineer seeks to leverage Delta Lake's Ch.ange Data Feed functionality to create a Type 1 table representi119 all of the values
that have e•,er been •1alid for all rows in a bronze table created with the property delta.enableChangeOataFeed =- true.They plan to execute
the following code as a daily job:
from pyapark . sql . funct.icna impor, col
■ ◄
1:41 PM A
u . 11,eya1e11,111v1eu.
C. They are updated.
D. They are inserted.
E. They are deleted.
Answer : B
'
Next Question
Question 28 ( Exam A )
A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representi11{1 all of the vatues
that have ever been valid for all rows in a bronze table created with the property delta.enableChangeOataFeed = true. They plan to execute
the following code as a daily job:
: rom pyspark . sql . functions imporc col
A. Each time the j ob is executed, newly updated records will be merged into the target table, overwriting previous values with the same
primary keys.
B. Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting
in many duplicate entries.
C. Each time the j ob is executed, the target table will be overwritten using the entire history of insened or updated records, giving the
desired resufL
D. Each time the job is executed,. the differences between the original and current versions are calculated; this may result in duplicate
entries for some records.
E. Each time the job is executed, only those records that ha'le been inserted or updated since the last execution will be appended to the
target table, givmg the desired result
Answer : B
Next Question
Question 29 ( Exam A )
A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta l ake. This happened
even though the critical field was in the Kafka source. That fre!d was further missing from data written to dependent. long-term storage.
The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.
Wh.ich describes how Delta Lake can help to avoid data loss of this nature in the future?
A. The Delta log and Structured Streaming checkpoints record the fu!I history of the Kafka producer.
B. Delta t ake schema evolution can retroactively calculate the correct value for newly ad<led fields, as long as the data was in the original
source.
C. Detta l ake automatically checks that all fields present in the source data are mcluded m the ingestion layer.
0, Data can never be peml anently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.
E. Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a permanent, repleyable history of the data state.
Answer : E
Next Question
Question 30 ( Exam A )
A nightly j ob ingests data Into a Oetta Lake table using the following code:
fr~ r.rr r\.~o .f~nct1cr.~ ·...-~~ curr•n• _~j!!M!'•t•JC9, rfrJt f1l•_n.11,~, .;:,]
h:o:a r:,:-=p.it)t • .Jql.col,.11r 1nrpotl c~ ... m
t 1!p1rk, reild
• .l:Gr::',>lt I ..p,Uq-.JOt "I
..
.J.!l•dti'"/tant d.\,1.J.y_baccl11tyeu 11month1
"
•
• r •
\J.av *I
■ ◄
1:41 PM A Vo
WiFi (m1s J•
Answer : B
Next Question
Question 29 ( Exam A )
A new data engineer notices that a critical field was omitted from an application th.at writes its Kafka source to Delta Lake. This happened
even though the critical field was in the Kafka source. That field was further missing from data written to dependent.. long-term storage.
The retention threshold on the Kafka ser·,ice is seven days. The pipeline has been in production for three months.
Which describes how Delta lake can help to avoid data loss of this nature in the future?
A. The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.
8. Delta t ake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original
source.
C. Delta Lake automatically checks that all fields present in the source data are included In the ingestion layer.
0. Data can never be permanentl)' dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.
E. Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a permanent, rep!ayable history of the data state.
Answer : E
Next Question
Question 30 ( Exam A )
A nightly job ingests data imo a Oetta Lake table using the following code:
fr~~ r.1~r~ ~-• .t~ncr en~ •~~ ~ c~rr•nt_t1T"ll!at
ucm F"Y!'!P,.i:;,..:,q •• ol.<..:111 lm~".r c.,..,.,
<h•f MZa3t._:l.tit....:.y_catctaltJJ11o_cc_: :c... .mi:1, y1,oca.nt. 11,t nc.n:i:-:t, aaya.nt \:
(:,-r;ar.C,rea~
. ti:a::r.a: I ",i.uq.;.et" I
• loAd Ir /lilnt lduly_tot~I!/ \year} /{mcnth I ' -i•'I: "I
•se ec: •••
'
<IT"!"" flnt.~1• ... ~ •1nq•~r. - 1ffl"">,
JU;iat-,_ I , l..,_ri • ~ I I , • 1 ha ( " ''- .u ,.~_ t .._)# •
,"1'r-t•
.-:r c-::>e 1• •FPen<:")
,,•.a AJl'l'al: e( " b_.;;.r.ze"'I
)
The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been
processed to the next table in the pipeline.
Which code snippet completes this function definition?
def new_recordsQ:
A. return spark.readStre-am.table('bronze~)
B. return spark.readStream.loadC'bronzej
-e•ur~ (5~~r• ~~Ad
r a.bl.,. l "'tr o."1~*~
C.
.t1 ...r~c-(co1t"iri1Jes't. - ,.,.... .. - -u~.renr- t. l!'l'}!lrmnr(l I
I
0. return spark.read.option("readChangeFeed", "true~).table {'bronze'}
~~~ucn 1.,-po~~ reA~
t-~le C''l:1-= OJ',2e-" I
E.
111-er(celf'""OllICC_tU~· · - .t'"':m't./ <i.uly_bat:e.hJlvear l ln~mtl /{dAyJ•1
Answer : A
Next Question
Page: 6/4(,
Total 227 questions
e Prevmus Page Next Page @
CONNECT WITH US
f Facebook
't# Twitter
■ ◄
1:52 PM A
Page: 1 / 14
Total 120 questions mill Want moni questions? Get Premium A ~.
( C11ck To St>e Detai• )
Question 1
In order to prevent accidental commits to production data, a senior data engineer has instituted a policy
that all development work will reference clones of Delta Lake tables. After testing both deep and shallow
clone, development tables are created using shallow clone.
A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1
Slowly Changing Dimension (SCD) stop working The transaction logs for the source tables show that
vacuum was run the day before
A O The data files compacted by vacuum are not tracked by the cloned metadata: running refresh on the cl
oned table will pull in recent changes
B O Because Type 1 changes overwnte existing records, Delta Lake cannot guarantee data consistency for
c loned tables
C O The metadata created by the clone operation is referencing data files that were purged as invalid by th
e vacuum command
D O Running vacuum automatically invalidates any shallow clones of a table; deep clone should always be
used when a cloned table will be repeatedly queried.
In Delta Lake, a shallow clone creates a new table by copying the metadata of the source table without
duplicating the data files. When the vacuum command 1s run on the source table, 11 removes old data
files that are no longer needed to maintain the transactional log's integrity; potentially including files
referenced by the shallow clone's metadata. If these files are purged, the shallow cloned tables will
reference non-existent data files, causing them to stop working properly This highlights the dependency
of shallow clones on the source table's data files and the impact of data management operations like
vacuum on these clones Reference Databricks documentation on Delta Lake, particularly the sections
on cloning tables (shallow and deep cloning) and data retention with the vacuum command
(https://docs.databricks.com/delta/index.html).
Next QuesUOfl
Question 2
A user wants to use DLT expectations to validate that a derived table report contains all records from the
source, Included m the table validation_copy
The user attempts and fails to accomplish this by adding an expectation to the report table definition.
Which approach would allow using DLT expectations to validate all expected records are present in this
table'
A O Define a SQL UDF that performs a left outer join on two tables, and check if this returns null values for
report key values in a DLT expectation for the report table.
B O Define a function that perfonns a left outer Join on validatlon_copy and report and report, and check a
gamst the result in a DLT expectation for the report table
C O Define a temporary table that perform a left outer Join on validation_copy and report, and define an exp
ectation that no report key values are null
D O Define a view that perfonns a left outer join on validatlon_copy and report, and reference this view in D
LT expectations for the report table
To validate that all records from the source are included in the derived table, creating a view that
performs a left outer Join between the validation_copy table and the report 1able Is effecuve. The view
can highlight any discrepancies, such as null values in the repon table's key columns, mdicaling missing
records. This view can then be referenced in DLT (Delta Uve Tables) expectations for the report table to
ensure data Integrity. This approach allows for a comprehensive comparison between the source and
the derived table. Leave a me ssage A
■ ◄
1:s2 PM A ..,11 ~rF, ~. w•
Question 3
The data engineer team has been tasked with configured connections to an external database that does
not have a supported native connector with Databricks The external database already has data security
configured by group membership. These groups map directly to user group already created in Databncks
that represent various teams wnhin the company.
A new login credential has been created for each group m the external database. The Databrlcks Utilities
Secrets module will be used to make these credentials available to Databrlcks users.
Assuming that all the credentials are configured correctly on the external database and group membership
is properly configured on Databricks, which statement describes how teams can be granted the minimum
necessary access to using these credentials?
A O "Read" perrniss1ons should be set on a secret key mapped to 1hose credentials that will be used by a 91
ven team.
B O No additional configuration is necessary as long as all users are configured as administrators in the w
orkspace where secrets have been added.
C O "Read" permissions should be set on a secret scope contam111g only those credentials that will be used
by a given team.
D O "Manage' permission should be set on a secret scope containing only those credentials that will be us
ed by a given team
In Databncks, using the Secrets module allows for secure management of sensitive information such as
database credenlials. Granting 'Read' permissions on a secret key that maps to database credentials for
a spec1f1c team ensures that only members of that team can access these credentials. This approach
aligns with the principle of least privilege, granting users the minimum level of access required to
perform their jobs, thus enhancing secunty.
Nm Que~hOfl
Question 4
A team of data engineer are adding tables to a DLT pipeline that contain repetitive expectations for many
of the same data quality checks.
One member of the team suggests reusing these data quality rules across all tables defined for this
pipeline.
A O Maintain data quality rules in a Delta table outside of this pipeline's target schema, providing the sche
ma name as a pipeline parameter
B O Use global Python variables to make expectations visible across DLT notebooks included In the same
pipeline
C O Add data quality constraints to tables in this pipeline using an external Job with access to pipeline conf
iguratlon files
D O Maintain data quality rules in a separate Databrlcks notebook that each DlJ notebook of file.
Maintaining data quality rules In a centralized Delta table allows for the reuse of these rules across
muluple DLT (Delta live Tables) pipelines By stonng these rules outside the pipeline's target schema
and referencing the schema name as a pipeline parameter, the team can apply the same set of data
quality checks to different tables wtthin the pipeline. This approach ensures consistency in data quality
validations and reduces redundancy tn code by not having to replicate the same rules m each DLT
notebook or file.
Next Quesuoo
Leave a me ssage A
1:53 PM A ..,11 ~rFi ~ w•
Databricks Documentation on Delta Live Tables: Delta Live Tables Guide
Ne>ttQuesllOf,
Question 5
A Delta Lake table representing metadata about content from user has the following schema:
user_ld LONG, posLtext STRING, post_ld STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP,
date DATE
Based on the above schema, which column is a good candidate for partitioning the Delta Table?
A 0Date
B 0Post_ld
C Ouser_1d
D 0Post_time
Pan1tion1ng a Delta Lake table Improves query performance by organizing data into partitions based on
the values of a column. In the given schema, the date column rs a good candidate for partitioning for
several reasons:
nme-Based Queries: If queries frequently filter or group by date, partitioning by the date column can
significantly Improve performance by limrting the amount of data scanned
Granularity: The date column likely has a granularity that leads to a reasonable number of partitions (not
too many and not too few) This balance Is important for optimizing both read and write perfonmance
Data Skew: Other columns like post_id or user_ld might lead to uneven partition sizes (data skew),
which can negatively impact perfonmance.
Pan1tion1ng by post_t1me could also be considered, but typically date is preferred due to its more
manageable granularity
Next Quesuoo
Question 6
In order to facilitate near rea~tlme workloads, a data engineer 1s creating a helper function to leverage the
schema detection and evolution functionallty of Oatabricks Auto Loader. The desired function will
automatically detect the schema of the source directly, Incrementally process JSON files as they arrive in a
source directory, and automatically evolve the schema of the table when new fields are detected.
l
Which response correctly fills In the blank to meet the specified requirements?
. ·,:r i ;;eStrea:r.
A.. ~£:C!.on , ~mergesc±c:rut", r=::.el
.5-drt'.(t~rqec_tcble_:pa~hl
.;,;ri;escream
. cj:tior. <.. cr.ec)q::.in~:.oca~1on", checkpoir..::._pat~)
B. • oi)tionl"reerqesc:r,c:..c", T,.ucl
.-:iqq~r<onc:~=Tn:~
.s~art(ca=~-.:-=lLblcJa~h) Leave a me ssage A
1:53 PM A .1fll ~rFi <:'u 113 1
Next Quesuon
Question 6
In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the
schema detection and evolution functionality of Databrlcks Auto Loader The desired function will
automatically detect the schema of the source directly, incrementally process JSON files as they arrive m a
source directory, and automatically evolve the schema of the table when new fields are detected
Which response correctly fills In the blank to meet the specified requirements?
.wriceScrea:r.
A.• ~pt:.:,n (".c:er9escl:ema", True I
.s~arc(tarqe:_table_;>a:hl
.1o1riceScreare
. .:ption l 11 checky!>in-:.:.cca:1.cr.tt, checkpoir.:_eath)
B. • ~ption I "11,erqeSchcn.4", Tr1.e1
.crlqqer(once=Tr~e
.3tart(ta=~et_ta.cle_patn)
.·.. rite
.QS:ticn1"checkpointLocat1on", checlcpo1nc_pathl
C. . c;: cior. ( "mergeSc!".e:n.a", Tr1,;e,
. o.:tpu-=t-:ode I• 11ppe.'lc!")
.,ave(tarQ~t_table_pathl
-~t-ont"me.rgeScr.ema", T.rue)
D.
. save(targec_table_pathl
. .,ri.:eStrea."!l
•~t.:~onl"checlcpcin:LOca.:ion", checkpoir.t_yathl
E.
. .;;.i;:t:.on l"m-,rgeScl':•::ca", Tru"')
.s::.ar.:(target .:a.t:le eathl
- -
A 00ption A
B Ooption B
C OOption C
D Ooption D
E Ooption E
A IW I
Option B correctly fills in the blank to meet the specified requirements. Option Buses the
"cloudFiles.schemaLocation· option. which is required for the schema detection and evolution
functionality of Databricks Auto Loader. Addltlonally, option B uses the •mergeSchema· option, which is
required for the schema evolution functionality of Databrlcks Auto Loader Finally, option B uses the
"writeStream· method, which is required for the incremental processing of JSON files as they arrive in a
source directory, The other options are incorrect because they either omit the required options, use the
wrong method, or use the wrong format.Reference: Leave a me ssage A
1:53 PM A ..,11 ~rF, ~. w •
•wrrteStream• method, which is required for the incremental processing of JSON files as they arrive 1n a
source directory. The other options are incorrect because they either omit the required options, use the
wrong method, or use the wrong format.Reference:
,r l'h ,t
.......,_id,w,isMMi
'A'Jire c011m r,r, •\a hU , doca doinboc.«£ ,J,n/'lfl~l1t.3! ~ l !l~.aT'I( g wllRng st,n,:n1r'9'<1 a h1ml
Next Quesuoo
Question 7
l
Assuming that this code produces logically correct results and the data m the source tab(e has been de-
duplicated and validated, which statement descnbes what will occur when this code Is executed?
A O The silver_customer_sales table will be overwritten by aggregated values calculated from all records i
n the gold_customerJifetime_sales_summary table as a batch job.
B O A batch job will update the gold_customerJifetime_sales_summarytable, replacing only those rows th
at have different values than the current version of the table, usmg customer_td as the primary key.
C 0 The gold_customer_llfetime_sales_summary table will be overwrrtten by aggregated values calculated
from all records in the silver_customer_sales table as a batch job.
D O An incremental job will leverage running information In the state store to update aggregate values m th
e gold_customer_lifetime_sales_summary table.
E O An incremental Job will detect if new rows have been written to the s1lver_customer_sales table: i f new
rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetim
e_sales_summary table.
This code 1s using the pyspark.sql.func11ons library to group the s1lver_customer_sales table by
customer_id and then aggregate the data using the minimum sale date, maximum sale total, and sum
of distinct order ids. The resulting aggregated data 1s then written to the
gold_customer_lifetJme_sales_summary table. overwriting any existing data in that table. This 1s a batch
job that does not use any incremental or streaming logic, and does not perform any merge or update
operations Therefore, the code will overwrite the gold table with the aggregated values from the silver
table every time it is executed.Reference:
https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html
https://docs.databricks.com/spark/latest/dataframes-datasets/transformlng-data-with-
dataframes.html
https://docs.databricks.com/spark/latest/dataframes-datasets/aggregating-data-with-dataframes.html
Ne>ttQuesllOf,
Page: 1 / 14
Total 120 questions
2025 certshero.com. All rights All Products Guarantee Privacy Policy FAO. OMCA Contact Us
reserved. About Us
Leave a me ssage A