Delta from a Data Engineer's Perspective

Delta by example
Palla Lentz, Assoc. Resident Solutions Architect
Jake Therianos, Customer Success Engineer

A Data Engineer’s Dream...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a
cost efficient way without having to choose between batch or
streaming

Table
(Data gets written
continuously)
AI & Reporting
Events
Spark job gets exponentially slower
with time due to small files.
Stream
Stream
The Data Engineer’s Journey...

Table
(Data gets written
continuously)
AI & Reporting
Events
Table
(Data gets compacted
every hour)
Batch Batch
Late arriving data means
processing need to be delayed
Stream

Table
(Data gets written
continuously)
AI & Reporting
Events
Table
every hour) Few hours latency doesn’t
satisfy business needs
Batch Batch
Stream

Table
(Data gets written
continuously)
AI & Reporting
Events
Batch
Stream
Unified View
Lambda arch increases
operational burden
Stream
Table
every hour)

Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Validations and other cleanup
actions need to be done twice
Stream
Table
every hour)

Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Fixing mistakes means
blowing up partitions and
doing atomic re-publish
Reprocessing
Stream
Table
every hour)

Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Stream
Table
every hour)

Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Can this be simplified?
Stream
Table
every hour)

What was missing?
1. Ability to read consistent data while data is being written
1. Ability to read incrementally from a large table with good throughput
1. Ability to rollback in case of bad writes
1. Ability to replay historical data along new data that arrived
1. Ability to handle late arriving data without having to delay downstream processing
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?

So… What is the answer?
STRUCTURED
STREAMING
+ =
The
Delta
Architecture
1. Unify batch & streaming with a continuous data flow model
2. Infinite retention to replay/reprocess historical events as needed
3. Independent, elastic compute and storage to scale while balancing costs

Delta On Disk
my_table/
_delta_log/
00000.json
00001.json
date=2019-01-01/
file-1.parquet
Transaction Log
Table Versions
(Optional) Partition Directories
Data Files

Table = result of a set of actions
Change Metadata – name, schema, partitioning, etc
Add File – adds a file (with optional statistics)
Remove File – removes a file
Result: Current Metadata, List of Files, List of Txns, Version

Implementing Atomicity
Changes to the table
are stored as
ordered, atomic
units called commits
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
000000.json
000001.json
…

Solving Conflicts Optimistically
1. Record start version
2. Record reads/writes
3. Attempt commit
4. If someone else wins,
check if anything you
read has changed.
5. Try again.
000000.json
000001.json
000002.json
User 1 User 2
Write: Append
Read: Schema
Write: Append
Read: Schema

Handling Massive Metadata
Large tables can have millions of files in them! How do we scale
the metadata? Use Spark for scaling!
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
Checkpoint

Connecting the dots...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?

Snapshot isolation between writers and
readers
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written

readers
Optimized file source with scalable metadata
handling
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
being written
1. Ability to read incrementally from a large
table with good throughput

readers
handling
Time travel
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
being written

readers
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
being written
1. Ability to replay historical data along new
data that arrived

being written
data that arrived
1. Ability to handle late arriving data without
having to delay downstream processing
readers
handling
Time travel
the same pipeline
Stream any late arriving data added to the
table as they get added
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?

being written
data that arrived
1. Ability to handle late arriving data without
having to delay downstream processing
readers
handling
Time travel
the same pipeline
Stream any late arriving data added to the
table as they get added
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting

AI & Reporting
Streaming
Analytics
Data Lake
CSV,
JSON, TXT…
Kinesis
The Delta Architecture
A continuous data flow model to unify batch & streaming

Up next...DEMO
Website: https://delta.io
Community (Slack/Email): https://delta.io/#community

Delta from a Data Engineer's Perspective

Recommended

More Related Content

What's hot (20)

Similar to Delta from a Data Engineer's Perspective (20)

More from Databricks (20)

Recently uploaded (20)

Delta from a Data Engineer's Perspective