SlideShare a Scribd company logo
Dustin Vannoy
Data Engineer
Cloud + Streaming
Azure Databricks with
Delta Lake
Dustin Vannoy
Data Engineering Consultant
Co-founder Data Engineering San Diego
/in/dustinvannoy
@dustinvannoy
dustin@dustinvannoy.com
Technologies
• Azure & AWS
• Spark
• Kafka
• Python
Modern Data Systems
• Data Lakes
• Analytics in Cloud
• Streaming
© Microsoft Azure + AI Conference All rights reserved.
Agenda
 Intro to Spark + Azure Databricks
 Delta Lake Overview
 Delta Lake in Action
 Schema Enforcement
 Time Travel
 MERGE, DELETE, OPTIMIZE
© Microsoft Azure + AI Conference All rights reserved.
Intro to Spark & Azure Databricks
Overview and Databricks workspace walk through
Why Spark?
Big data and the cloud
changed our mindset.
We want tools that
scale easily as data
size grows.
Spark is a leader in
data processing that
scales across many
machines. It can run
on Hadoop but is
faster and easier than
Map Reduce.
Benefit of horizontal scaling
Traditional Distributed (Parallel)
© Microsoft Azure + AI Conference All rights reserved.
What is Spark?
 Fast, general purpose engine for large-scale data processing
 Replaces MapReduce as Hadoop parallel programming API
 Many options:
 Yarn / Spark Cluster / Local
 Scala / Python / Java / R
 Spark Core / SQL / Streaming / ML / Graph
© Microsoft Azure + AI Conference All rights reserved.
Simple code, parallel compute
Spark consists of a programming API and execution engine
Worker Worker Worker Worker
Master
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
song_df = spark.read 
.option('sep','t') 
.option("inferSchema","true") 
.csv("/databricks-datasets/songs/data-001/part-0000*")
tempo_df = song_df.select(
col('_c4').alias('artist_name'),
col('_c14').alias('tempo'),
)
avg_tempo_df = tempo_df 
.groupBy('artist_name') 
.avg('tempo') 
.orderBy('avg(tempo)',ascending=False)
avg_tempo_df.show(truncate=False)
© Microsoft Azure + AI Conference All rights reserved.
Spark’s Strengths
 Data pipelines and analytics
 Batch or streaming
 SparkSQL
 Machine learning
 Uses memory to speed up processing
 Large community, many examples and tutorials
Demo
Databricks
Workspace
© Microsoft Azure + AI Conference All rights reserved.
Delta Lake Overview
Why use it and how to start
© Microsoft Azure + AI Conference All rights reserved.
Spark is powerful, but...
 Not ACID compliant – too easy to get corrupted data
 Schema mismatches – no validation on write
 Small files written, not efficient for reading
 Reads too much data (no indexes, only partitions)
© Microsoft Azure + AI Conference All rights reserved.
ACID
 Atomicity – all or nothing
 Consistency – data always in valid state
 Isolation – uncommitted operations don’t impact other reads/writes
 Durability – committed data is never lost
ACID compliance would give us ability to update and delete!
© Microsoft Azure + AI Conference All rights reserved.
Small File Problem
 Too much metadata
 Too many file open/close operations
 Compression not as effective
 Bad if using Map Reduce to read
We fix this with scheduled file compaction jobs, difficulty is avoiding
interference with new write operations
© Microsoft Azure + AI Conference All rights reserved.
Partitions
 Typically Spark reads all data in a table/directory before applying
filters
 Folder partitioning used to allow some filter push downs
 Limited to one fixed partition scheme to allow skipping reads
 Must use low cardinality columns for partitioning
We used to just add indexes and run statistics to improve seeks
Delta Lake Concepts
Reference: delta.io
© Microsoft Azure + AI Conference All rights reserved.
ACID Transactions
Atomicity, Consistency, and Isolation all improved
© Microsoft Azure + AI Conference All rights reserved.
Reminder: ACID
 Atomicity – all or nothing
 Consistency – data always in valid state
 Isolation – uncommitted operations don’t impact other reads/writes
 Durability – committed data is never lost
© Microsoft Azure + AI Conference All rights reserved.
ACID Transaction Support
“Serializable isolation levels
ensure that readers never
see inconsistent data”
- Delta Lake Documentation
© Microsoft Azure + AI Conference All rights reserved.
Schema Enforcement
How to use schema validation and schema merge
© Microsoft Azure + AI Conference All rights reserved.
Schema validation by default
 Delta defaults to validating schema
 Fails on mismatch
 Or, set schema merge option
© Microsoft Azure + AI Conference All rights reserved.
Time Travel
Data version history in Delta
© Microsoft Azure + AI Conference All rights reserved.
Delta Log
“The transaction log is the mechanism through which
Delta Lake is able to offer the guarantee of atomicity.”
Reference: Databricks Blog: Unpacking the Transaction Log
Demo
Delta
capabilities
© Microsoft Azure + AI Conference All rights reserved.
Final thoughts
Delta Lake delivers some powerful capabilities
© Microsoft Azure + AI Conference All rights reserved.
Delta Lake addresses
 ACID compliance
 Schema enforcement
 Compacting files
 Performance optimizations
© Microsoft Azure + AI Conference All rights reserved.
References
 Video - Simplify and Scale Data Engineering Pipelines with Delta Lake
- Amanda Moran
 Video - Building Data Intensive Application on Top of Delta Lakes
 Video - Why do we need Delta Lake for Spark? - Learning Journal
 Databricks Blog: Unpacking the Transaction Log
 Databricks Delta Lake - James Serra
 Databricks Delta Technical Guide - Jan 2019
 Productionizing Machine Learning with Delta Lake
© Microsoft Azure + AI Conference All rights reserved.
Please use EventsXD to fill out a session evaluation.
Thank you!
Ad

More Related Content

What's hot (20)

Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
Databricks
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentials
qureshihamid
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
Databricks
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
Brett VanderPlaats
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
Databricks
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentials
qureshihamid
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
Databricks
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
Brett VanderPlaats
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 

Similar to Delta Lake with Azure Databricks (20)

Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure Databricks
Dustin Vannoy
 
Self-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft AzureSelf-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft Azure
Cloudera, Inc.
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
DataStax Academy
 
How to Win When Migrating to Azure
How to Win When Migrating to AzureHow to Win When Migrating to Azure
How to Win When Migrating to Azure
Kellyn Pot'Vin-Gorman
 
Google take on heterogeneous data base replication
Google take on heterogeneous data base replication Google take on heterogeneous data base replication
Google take on heterogeneous data base replication
Svetlin Stanchev
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
Cloudera, Inc.
 
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformHow to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
Cloudera, Inc.
 
Data Engineering with Databricks Presentation
Data Engineering with Databricks PresentationData Engineering with Databricks Presentation
Data Engineering with Databricks Presentation
Knoldus Inc.
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
Modern Data Stack France
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
HostedbyConfluent
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
CalvinSim10
 
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Cloudera, Inc.
 
Cloud Computing & Cloud Storage
Cloud Computing & Cloud Storage Cloud Computing & Cloud Storage
Cloud Computing & Cloud Storage
Priyesh Pratap Singh
 
2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with Azure2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with Azure
Marco Parenzan
 
By Popular Demand: The Rise of Elastic SQL
By Popular Demand: The Rise of Elastic SQLBy Popular Demand: The Rise of Elastic SQL
By Popular Demand: The Rise of Elastic SQL
NuoDB
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
Mark Kromer
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
Kent Graziano
 
Azure Data.pptx
Azure Data.pptxAzure Data.pptx
Azure Data.pptx
FedoRam1
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure Databricks
Dustin Vannoy
 
Self-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft AzureSelf-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft Azure
Cloudera, Inc.
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
DataStax Academy
 
Google take on heterogeneous data base replication
Google take on heterogeneous data base replication Google take on heterogeneous data base replication
Google take on heterogeneous data base replication
Svetlin Stanchev
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
Cloudera, Inc.
 
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformHow to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
Cloudera, Inc.
 
Data Engineering with Databricks Presentation
Data Engineering with Databricks PresentationData Engineering with Databricks Presentation
Data Engineering with Databricks Presentation
Knoldus Inc.
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
HostedbyConfluent
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
CalvinSim10
 
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Cloudera, Inc.
 
2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with Azure2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with Azure
Marco Parenzan
 
By Popular Demand: The Rise of Elastic SQL
By Popular Demand: The Rise of Elastic SQLBy Popular Demand: The Rise of Elastic SQL
By Popular Demand: The Rise of Elastic SQL
NuoDB
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
Mark Kromer
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
Kent Graziano
 
Azure Data.pptx
Azure Data.pptxAzure Data.pptx
Azure Data.pptx
FedoRam1
 
Ad

Recently uploaded (20)

Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Ad

Delta Lake with Azure Databricks

  • 1. Dustin Vannoy Data Engineer Cloud + Streaming Azure Databricks with Delta Lake
  • 2. Dustin Vannoy Data Engineering Consultant Co-founder Data Engineering San Diego /in/dustinvannoy @dustinvannoy [email protected] Technologies • Azure & AWS • Spark • Kafka • Python Modern Data Systems • Data Lakes • Analytics in Cloud • Streaming
  • 3. © Microsoft Azure + AI Conference All rights reserved. Agenda  Intro to Spark + Azure Databricks  Delta Lake Overview  Delta Lake in Action  Schema Enforcement  Time Travel  MERGE, DELETE, OPTIMIZE
  • 4. © Microsoft Azure + AI Conference All rights reserved. Intro to Spark & Azure Databricks Overview and Databricks workspace walk through
  • 5. Why Spark? Big data and the cloud changed our mindset. We want tools that scale easily as data size grows. Spark is a leader in data processing that scales across many machines. It can run on Hadoop but is faster and easier than Map Reduce.
  • 6. Benefit of horizontal scaling Traditional Distributed (Parallel)
  • 7. © Microsoft Azure + AI Conference All rights reserved. What is Spark?  Fast, general purpose engine for large-scale data processing  Replaces MapReduce as Hadoop parallel programming API  Many options:  Yarn / Spark Cluster / Local  Scala / Python / Java / R  Spark Core / SQL / Streaming / ML / Graph
  • 8. © Microsoft Azure + AI Conference All rights reserved. Simple code, parallel compute Spark consists of a programming API and execution engine Worker Worker Worker Worker Master from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.getOrCreate() song_df = spark.read .option('sep','t') .option("inferSchema","true") .csv("/databricks-datasets/songs/data-001/part-0000*") tempo_df = song_df.select( col('_c4').alias('artist_name'), col('_c14').alias('tempo'), ) avg_tempo_df = tempo_df .groupBy('artist_name') .avg('tempo') .orderBy('avg(tempo)',ascending=False) avg_tempo_df.show(truncate=False)
  • 9. © Microsoft Azure + AI Conference All rights reserved. Spark’s Strengths  Data pipelines and analytics  Batch or streaming  SparkSQL  Machine learning  Uses memory to speed up processing  Large community, many examples and tutorials
  • 11. © Microsoft Azure + AI Conference All rights reserved. Delta Lake Overview Why use it and how to start
  • 12. © Microsoft Azure + AI Conference All rights reserved. Spark is powerful, but...  Not ACID compliant – too easy to get corrupted data  Schema mismatches – no validation on write  Small files written, not efficient for reading  Reads too much data (no indexes, only partitions)
  • 13. © Microsoft Azure + AI Conference All rights reserved. ACID  Atomicity – all or nothing  Consistency – data always in valid state  Isolation – uncommitted operations don’t impact other reads/writes  Durability – committed data is never lost ACID compliance would give us ability to update and delete!
  • 14. © Microsoft Azure + AI Conference All rights reserved. Small File Problem  Too much metadata  Too many file open/close operations  Compression not as effective  Bad if using Map Reduce to read We fix this with scheduled file compaction jobs, difficulty is avoiding interference with new write operations
  • 15. © Microsoft Azure + AI Conference All rights reserved. Partitions  Typically Spark reads all data in a table/directory before applying filters  Folder partitioning used to allow some filter push downs  Limited to one fixed partition scheme to allow skipping reads  Must use low cardinality columns for partitioning We used to just add indexes and run statistics to improve seeks
  • 17. © Microsoft Azure + AI Conference All rights reserved. ACID Transactions Atomicity, Consistency, and Isolation all improved
  • 18. © Microsoft Azure + AI Conference All rights reserved. Reminder: ACID  Atomicity – all or nothing  Consistency – data always in valid state  Isolation – uncommitted operations don’t impact other reads/writes  Durability – committed data is never lost
  • 19. © Microsoft Azure + AI Conference All rights reserved. ACID Transaction Support “Serializable isolation levels ensure that readers never see inconsistent data” - Delta Lake Documentation
  • 20. © Microsoft Azure + AI Conference All rights reserved. Schema Enforcement How to use schema validation and schema merge
  • 21. © Microsoft Azure + AI Conference All rights reserved. Schema validation by default  Delta defaults to validating schema  Fails on mismatch  Or, set schema merge option
  • 22. © Microsoft Azure + AI Conference All rights reserved. Time Travel Data version history in Delta
  • 23. © Microsoft Azure + AI Conference All rights reserved. Delta Log “The transaction log is the mechanism through which Delta Lake is able to offer the guarantee of atomicity.” Reference: Databricks Blog: Unpacking the Transaction Log
  • 25. © Microsoft Azure + AI Conference All rights reserved. Final thoughts Delta Lake delivers some powerful capabilities
  • 26. © Microsoft Azure + AI Conference All rights reserved. Delta Lake addresses  ACID compliance  Schema enforcement  Compacting files  Performance optimizations
  • 27. © Microsoft Azure + AI Conference All rights reserved. References  Video - Simplify and Scale Data Engineering Pipelines with Delta Lake - Amanda Moran  Video - Building Data Intensive Application on Top of Delta Lakes  Video - Why do we need Delta Lake for Spark? - Learning Journal  Databricks Blog: Unpacking the Transaction Log  Databricks Delta Lake - James Serra  Databricks Delta Technical Guide - Jan 2019  Productionizing Machine Learning with Delta Lake
  • 28. © Microsoft Azure + AI Conference All rights reserved. Please use EventsXD to fill out a session evaluation. Thank you!

Editor's Notes

  • #2: With the shift to data lakes that use distributed file storage as the foundation, we have been missing the reliability that relational databases provides. Databricks Delta is a data management system focused on bringing more reliability and performance into our data lakes. It sits on top of existing storage and the API is very similar to reading and writing to files from Spark already. This session will present the overview of Delta Lake, why it may be a better option than standard data lake storage, and how you can use it from Azure Databricks. We will work through demos that showcase the key benefits of delta lake: 1. ACID transactions 2. Schema enforcement and evolution 3. Time travel (data versioning)
  • #7: Let’s think about the benefit of parallel processing, often referred to as distributed systems. The idea is actually very easy to understand. If we had a task such as counting all the people at a concert, you could have one person who is really good at counting do it and if the venue is small enough they will do just fine. But the job will be completed faster if you have many people counting and combining the results at the end. Sure there is a little more organization needed, but if you need to count the attendees at a Beyonce concert you could just hire a lot of people to do the job. And if one of them gets distracted by the music, you can send whoever finishes first in to take over counting that section. We call this capability “Horizontal Scaling” because if our data processing system is not powerful enough to do the work, we add more computers to help out rather than replacing the single server with a more powerful server. Distributed computing and parallel processing are not new concepts, few things in computing are, but what if you had an easy way to tell all the workers what to do without having to micro-manage to avoid two people counting the same section? That is where new programming models and frameworks have stepped in over the last 10 years and gave us the beloved buzz word ”Big Data”. Spark is not the only option here, but it has a lot of strengths and is often chosen over the traditional single machine processing options.
  • #8: A fast and general engine for large-scale data processing, uses memory to provide benefit Often replaces MapReduce as parallel programming api on Hadoop, the way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you. Several modules for different use cases, similar api so you can swap between modes relatively easily. For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
  • #9: In the day to day we will talk about writing Spark code and also refer to running the code on the Spark cluster. There are actually quite a few options for how to do either of these things, but a quick look at Spark code that uses Spark DataFrames in Python. And then whatever cluster we run it on will have a concept of a master node and worker nodes, as well as some storage that is often a hybrid of local storage on the workers plus a distributed file system like Hadoop’s HDFS, Amazon S3, or Azure Data Lake Storage. If you don’t follow all those terms, it’s ok. There is plenty of time to build up to those concepts after you start learning to write spark code and run it in a simple Spark environment. We will cover that in other videos.
  • #10: So we sort of get what Spark is, we saw a small code sample and discussed how a cluster exists to run the code on. Let’s go back to a higher level and talk about Spark’s strengths.
  • #11: Quick overview of important databricks workspace segments – Clusters, Tables, Notebooks Open create_parquet_tables notebook and run first few commands as examples of working without delta
  • #14: Atomicity – typical Spark save does not use locking and is not atomic so it could leave incomplete changes behind and corrupt data. Overwrite will remove data before loading new data, so typically not an issue. With append mode the default commiter should have atomicity but some of the faster commiters don’t gurantee atomicity. - Learning Journal, Delta Lake for Apache Spark video on YouTube Consistency – with typical Spark overwrite there is a time where no files exist and if failure happens at that point you are left in invalid state. Isolation – an operation that is in progress (not commited) should not impact the results of other reads or writes...do not want dirty reads. Typical database offers different levels of isolation but Spark doesn’t have specific option of commit such as read/commited and serializable. Task level and job level commits exist but lack of atomicity in write leaves this not fully working. Durability – typically not an issue, though lack of commit can lead to issues here as well
  • #15: Atomicity – typical Spark save does not use locking and is not atomic so it could leave incomplete changes behind and corrupt data. Overwrite will remove data before loading new data, so typically not an issue. With append mode the default commiter should have atomicity but some of the faster commiters don’t gurantee atomicity. - Learning Journal, Delta Lake for Apache Spark video on YouTube Consistency – with typical Spark overwrite there is a time where no files exist and if failure happens at that point you are left in invalid state. Isolation – an operation that is in progress (not commited) should not impact the results of other reads or writes...do not want dirty reads. Typical database offers different levels of isolation but Spark doesn’t have specific option of commit such as read/commited and serializable. Task level and job level commits exist but lack of atomicity in write leaves this not fully working. Durability – typically not an issue, though lack of commit can lead to issues here as well
  • #19: Atomicity – typical Spark save does not use locking and is not atomic so it could leave incomplete changes behind and corrupt data. Overwrite will remove data before loading new data, so typically not an issue. With append mode the default commiter should have atomicity but some of the faster commiters don’t gurantee atomicity. - Learning Journal, Delta Lake for Apache Spark video on YouTube Consistency – with typical Spark overwrite there is a time where no files exist and if failure happens at that point you are left in invalid state. Isolation – an operation that is in progress (not commited) should not impact the results of other reads or writes...do not want dirty reads. Typical database offers different levels of isolation but Spark doesn’t have specific option of commit such as read/commited and serializable. Task level and job level commits exist but lack of atomicity in write leaves this not fully working. Durability – typically not an issue, though lack of commit can lead to issues here as well
  • #24: Quote and image from Databricks blog post by Burak Yavuz, Michael Armbrust and Brenner Heintz -> https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
  • #25: Demo notebook create_delta_tables Show bad data when running one set of writes from one source, then run from second source Same example with delta destination to show failure Same example but tweaked to allow schema merge Show transaction log files Demo of file where data was streamed in, show by timestamp and version