SlideShare a Scribd company logo
Delta by example
Palla Lentz, Assoc. Resident Solutions Architect
Jake Therianos, Customer Success Engineer
A Data Engineer’s Dream...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a
cost efficient way without having to choose between batch or
streaming
Table
(Data gets written
continuously)
AI & Reporting
Events
Spark job gets exponentially slower
with time due to small files.
Stream
Stream
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Table
(Data gets compacted
every hour)
Batch Batch
Late arriving data means
processing need to be delayed
Stream
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Table
(Data gets compacted
every hour) Few hours latency doesn’t
satisfy business needs
Batch Batch
Stream
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch
Stream
Unified View
Lambda arch increases
operational burden
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Validations and other cleanup
actions need to be done twice
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Fixing mistakes means
blowing up partitions and
doing atomic re-publish
Reprocessing
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Can this be simplified?
Stream
The Data Engineer’s Journey...
Table
(Data gets compacted
every hour)
What was missing?
1. Ability to read consistent data while data is being written
1. Ability to read incrementally from a large table with good throughput
1. Ability to rollback in case of bad writes
1. Ability to replay historical data along new data that arrived
1. Ability to handle late arriving data without having to delay downstream processing
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
So… What is the answer?
STRUCTURED
STREAMING
+ =
The
Delta
Architecture
1. Unify batch & streaming with a continuous data flow model
2. Infinite retention to replay/reprocess historical events as needed
3. Independent, elastic compute and storage to scale while balancing costs
How Delta Lake Works?
Delta On Disk
my_table/
_delta_log/
00000.json
00001.json
date=2019-01-01/
file-1.parquet
Transaction Log
Table Versions
(Optional) Partition Directories
Data Files
Table = result of a set of actions
Change Metadata – name, schema, partitioning, etc
Add File – adds a file (with optional statistics)
Remove File – removes a file
Result: Current Metadata, List of Files, List of Txns, Version
Implementing Atomicity
Changes to the table
are stored as
ordered, atomic
units called commits
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
000000.json
000001.json
…
Solving Conflicts Optimistically
1. Record start version
2. Record reads/writes
3. Attempt commit
4. If someone else wins,
check if anything you
read has changed.
5. Try again.
000000.json
000001.json
000002.json
User 1 User 2
Write: Append
Read: Schema
Write: Append
Read: Schema
Handling Massive Metadata
Large tables can have millions of files in them! How do we scale
the metadata? Use Spark for scaling!
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
Checkpoint
The Delta Architecture
Connecting the dots...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
Connecting the dots...
Snapshot isolation between writers and
readers
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Connecting the dots...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Connecting the dots...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
1. Ability to rollback in case of bad writes
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Connecting the dots...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
1. Ability to rollback in case of bad writes
1. Ability to replay historical data along new
data that arrived
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
1. Ability to rollback in case of bad writes
1. Ability to replay historical data along new
data that arrived
1. Ability to handle late arriving data without
having to delay downstream processing
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Stream any late arriving data added to the
table as they get added
Connecting the dots...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
1. Ability to rollback in case of bad writes
1. Ability to replay historical data along new
data that arrived
1. Ability to handle late arriving data without
having to delay downstream processing
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Stream any late arriving data added to the
table as they get added
Connecting the dots...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
AI & Reporting
Streaming
Analytics
Data Lake
CSV,
JSON, TXT…
Kinesis
The Delta Architecture
A continuous data flow model to unify batch & streaming
Up next...DEMO
Website: https://delta.io
Community (Slack/Email): https://delta.io/#community
Ad

More Related Content

What's hot (20)

Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
Databricks
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdf
Michael Kogan
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
Databricks
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdf
Michael Kogan
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 

Similar to Delta from a Data Engineer's Perspective (20)

Delta Architecture
Delta ArchitectureDelta Architecture
Delta Architecture
Paulo Gutierrez
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdfReal-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
nilanjan172nsvian
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Anyscale
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
Databricks
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
Icebergs Best Secret A Guide to Metadata Tables
Icebergs Best Secret A Guide to Metadata TablesIcebergs Best Secret A Guide to Metadata Tables
Icebergs Best Secret A Guide to Metadata Tables
Szehon Ho
 
Oracle Database Overview
Oracle Database OverviewOracle Database Overview
Oracle Database Overview
honglee71
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
Write intensive workloads and lsm trees
Write intensive workloads and lsm treesWrite intensive workloads and lsm trees
Write intensive workloads and lsm trees
Tilak Patidar
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
felixcss
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdfReal-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
nilanjan172nsvian
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Anyscale
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
Databricks
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
Icebergs Best Secret A Guide to Metadata Tables
Icebergs Best Secret A Guide to Metadata TablesIcebergs Best Secret A Guide to Metadata Tables
Icebergs Best Secret A Guide to Metadata Tables
Szehon Ho
 
Oracle Database Overview
Oracle Database OverviewOracle Database Overview
Oracle Database Overview
honglee71
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
Write intensive workloads and lsm trees
Write intensive workloads and lsm treesWrite intensive workloads and lsm trees
Write intensive workloads and lsm trees
Tilak Patidar
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
felixcss
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Induction Program of MTAB online session
Induction Program of MTAB online sessionInduction Program of MTAB online session
Induction Program of MTAB online session
LOHITH886892
 
History of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptxHistory of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptx
balongcastrojo
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
ggg032019
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
shit yudh slideshare power likha point presen
shit yudh slideshare power likha point presenshit yudh slideshare power likha point presen
shit yudh slideshare power likha point presen
vishalgurjar11229
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Induction Program of MTAB online session
Induction Program of MTAB online sessionInduction Program of MTAB online session
Induction Program of MTAB online session
LOHITH886892
 
History of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptxHistory of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptx
balongcastrojo
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
ggg032019
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
shit yudh slideshare power likha point presen
shit yudh slideshare power likha point presenshit yudh slideshare power likha point presen
shit yudh slideshare power likha point presen
vishalgurjar11229
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 

Delta from a Data Engineer's Perspective

  • 1. Delta by example Palla Lentz, Assoc. Resident Solutions Architect Jake Therianos, Customer Success Engineer
  • 2. A Data Engineer’s Dream... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming
  • 3. Table (Data gets written continuously) AI & Reporting Events Spark job gets exponentially slower with time due to small files. Stream Stream The Data Engineer’s Journey...
  • 4. Table (Data gets written continuously) AI & Reporting Events Table (Data gets compacted every hour) Batch Batch Late arriving data means processing need to be delayed Stream The Data Engineer’s Journey...
  • 5. Table (Data gets written continuously) AI & Reporting Events Table (Data gets compacted every hour) Few hours latency doesn’t satisfy business needs Batch Batch Stream The Data Engineer’s Journey...
  • 6. Table (Data gets written continuously) AI & Reporting Events Batch Stream Unified View Lambda arch increases operational burden Stream Table (Data gets compacted every hour) The Data Engineer’s Journey...
  • 7. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified ViewValidation Validations and other cleanup actions need to be done twice Stream Table (Data gets compacted every hour) The Data Engineer’s Journey...
  • 8. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified ViewValidation Fixing mistakes means blowing up partitions and doing atomic re-publish Reprocessing Stream Table (Data gets compacted every hour) The Data Engineer’s Journey...
  • 9. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified ViewValidation Updates & Merge get complex with data lake Reprocessing Update & Merge Stream Table (Data gets compacted every hour) The Data Engineer’s Journey...
  • 10. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified ViewValidation Updates & Merge get complex with data lake Reprocessing Update & Merge Can this be simplified? Stream The Data Engineer’s Journey... Table (Data gets compacted every hour)
  • 11. What was missing? 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput 1. Ability to rollback in case of bad writes 1. Ability to replay historical data along new data that arrived 1. Ability to handle late arriving data without having to delay downstream processing Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ?
  • 12. So… What is the answer? STRUCTURED STREAMING + = The Delta Architecture 1. Unify batch & streaming with a continuous data flow model 2. Infinite retention to replay/reprocess historical events as needed 3. Independent, elastic compute and storage to scale while balancing costs
  • 13. How Delta Lake Works?
  • 15. Table = result of a set of actions Change Metadata – name, schema, partitioning, etc Add File – adds a file (with optional statistics) Remove File – removes a file Result: Current Metadata, List of Files, List of Txns, Version
  • 16. Implementing Atomicity Changes to the table are stored as ordered, atomic units called commits Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet 000000.json 000001.json …
  • 17. Solving Conflicts Optimistically 1. Record start version 2. Record reads/writes 3. Attempt commit 4. If someone else wins, check if anything you read has changed. 5. Try again. 000000.json 000001.json 000002.json User 1 User 2 Write: Append Read: Schema Write: Append Read: Schema
  • 18. Handling Massive Metadata Large tables can have millions of files in them! How do we scale the metadata? Use Spark for scaling! Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet Checkpoint
  • 20. Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ?
  • 21. Connecting the dots... Snapshot isolation between writers and readers Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written
  • 22. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput
  • 23. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput 1. Ability to rollback in case of bad writes
  • 24. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput 1. Ability to rollback in case of bad writes 1. Ability to replay historical data along new data that arrived
  • 25. 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput 1. Ability to rollback in case of bad writes 1. Ability to replay historical data along new data that arrived 1. Ability to handle late arriving data without having to delay downstream processing Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Stream any late arriving data added to the table as they get added Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ?
  • 26. 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput 1. Ability to rollback in case of bad writes 1. Ability to replay historical data along new data that arrived 1. Ability to handle late arriving data without having to delay downstream processing Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Stream any late arriving data added to the table as they get added Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting
  • 27. AI & Reporting Streaming Analytics Data Lake CSV, JSON, TXT… Kinesis The Delta Architecture A continuous data flow model to unify batch & streaming
  • 28. Up next...DEMO Website: https://delta.io Community (Slack/Email): https://delta.io/#community