SlideShare a Scribd company logo
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Matthew Powers, Prognos
Optimizing data lakes for
Apache Spark
#UnifiedAnalytics #SparkAISummit
About
!3
But what about those
poor data scientists that
work with gzipped CSV
lakes 😱
What you will get from this talk…
• Motivation to write Spark open source code
• Practical knowledge to build better data lakes
!4
Agenda
• Community goals
• Spark open source
• Modern Scala libs
• Parquet lakes
• Incremental updates & small files
• Partitioned lakes
• Delta lakes
!5
Loved by most
!6
Dreaded by some
Source: 2019 Stackoverflow survey
Community goals
• Passionate about community unification
(standardization of method signatures)
• Need to find optimal scalafmt settings
• Strongly dislike UDFs
• Spark tooling?
!7
Spark helper libraries
!8
spark-daria (Scala) quinn (PySpark)
spark-fast-tests / chispa
!9
spark-style-guide
!10
Modern Scala libs
!11
uTest Mill Build Tool
Prognos data lakes
!12
Data lake 2Data lake 1 Data lake 3
Apache Spark
Prognos AI platform to predict disease
Other tech
TL;DR
• 1 GB files
• No nested directories
!13
Small file problem
• Incrementally updating a lake will create a lot of
small files
• We can store data like this so it’s easy to
compact
!14
Suppose we have a CSV data lake
• CSV data lake is constantly being updated
• Want to convert it to a Parquet data lake
• Want incremental updates every hour
!15
CSV => Parquet
!16
Compacting small files
!17
10,000 incremental files and 166GB of data
Access data lake
!18
!19
Why partition data lakes?
• Data skipping
• Massively improve query performance
• I’ve seen queries run 50-100 times faster on
partitioned lakes
!20
Sample data
!21
Filtering unpartitioned lake
== Physical Plan ==
Project [first_name#12, last_name#13, country#14]
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12)) && (country#14 = Russia)) &&
StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/people.csv],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name), EqualTo(country,Russia),
StringStartsWith(first_name,M)],
ReadSchema: struct
!22
Partitioning the data lake
!23
Partitioned lake on disk
!24
Filtering Partitioned data lake
!25
== Physical Plan ==
Project [first_name#74, last_name#75, country#76]
+- Filter (isnotnull(first_name#74) && StartsWith(first_name#74, M))
+- FileScan csv [first_name#74,last_name#75,country#76]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/partitioned_lake],
PartitionCount: 1,
PartitionFilters: [isnotnull(country#76), (country#76 = Russia)],
PushedFilters: [IsNotNull(first_name), StringStartsWith(first_name,M)],
ReadSchema: struct
Comparing physical plans
!26
Unpartitioned
Project [first_name#12, last_name#13, country#14]
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12))
&& (country#14 = Russia)) && StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[….],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name),
EqualTo(country,Russia), StringStartsWith(first_name,M)],
ReadSchema: struct
Partitioned
Project [first_name#74, last_name#75, country#76]
+- Filter (isnotnull(first_name#74) && StartsWith(first_name#74, M))
+- FileScan csv [first_name#74, last_name#75, country#76]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[…],
PartitionCount: 1,
PartitionFilters: [isnotnull(country#76), (country#76 = Russia)],
PushedFilters: [IsNotNull(first_name),
StringStartsWith(first_name,M)],
ReadSchema: struct
Directly grabbing the partitions is
faster
!27
Real partitioned data lake
• Updates every 3 hours
• Has 5 million files
• 15,000 files are being added every day
• Still great for a lot of queries
!28
Creating partitioned lakes (1/3)
!29
Creating partitioned lakes (2/3)
!30
Creating partitioned lakes (3/3)
!31
Compacting Delta Lakes
!32
Incrementally updating partitioned
lakes
• Small file problem grows quickly
• Compaction is hard
• Not sure of any automated Parquet compaction
algos
!33
What talk should I give next?
• Best practices for the Spark community
• Ditching SBT for the Mill build tool
• Testing Spark code
• Running Spark Scala code in PySpark
!34
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot (20)

PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PDF
Physical Plans in Spark SQL
Databricks
 
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
PDF
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
PDF
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Flink Forward
 
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
PDF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Altinity Ltd
 
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
PDF
Using ClickHouse for Experimentation
Gleb Kanterov
 
PDF
Log Structured Merge Tree
University of California, Santa Cruz
 
PDF
Memory Management in Apache Spark
Databricks
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
PDF
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
Altinity Ltd
 
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Physical Plans in Spark SQL
Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Flink Forward
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Altinity Ltd
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Using ClickHouse for Experimentation
Gleb Kanterov
 
Log Structured Merge Tree
University of California, Santa Cruz
 
Memory Management in Apache Spark
Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
Altinity Ltd
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 

Similar to Optimizing Delta/Parquet Data Lakes for Apache Spark (20)

PDF
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
PDF
Apache: Big Data - Starting with Apache Spark, Best Practices
felixcss
 
PDF
Spark + AI Summit recap jul16 2020
Guido Oswald
 
PDF
Apache Spark's Built-in File Sources in Depth
Databricks
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PDF
Apache Spark Core – Practical Optimization
Databricks
 
PDF
Care and Feeding of Catalyst Optimizer
Databricks
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
PDF
Simplifying Change Data Capture using Databricks Delta
Databricks
 
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
PDF
New developments in open source ecosystem spark3.0 koalas delta lake
Xiao Li
 
PDF
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Qbeast
 
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
PPTX
Emerging technologies /frameworks in Big Data
Rahul Jain
 
PDF
In-Memory Evolution in Apache Spark
Kazuaki Ishizaki
 
PDF
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
PDF
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Lviv Startup Club
 
PDF
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
Apache: Big Data - Starting with Apache Spark, Best Practices
felixcss
 
Spark + AI Summit recap jul16 2020
Guido Oswald
 
Apache Spark's Built-in File Sources in Depth
Databricks
 
Making Apache Spark Better with Delta Lake
Databricks
 
Apache Spark Core – Practical Optimization
Databricks
 
Care and Feeding of Catalyst Optimizer
Databricks
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Simplifying Change Data Capture using Databricks Delta
Databricks
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
New developments in open source ecosystem spark3.0 koalas delta lake
Xiao Li
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Qbeast
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
Emerging technologies /frameworks in Big Data
Rahul Jain
 
In-Memory Evolution in Apache Spark
Kazuaki Ishizaki
 
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Lviv Startup Club
 
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 

Optimizing Delta/Parquet Data Lakes for Apache Spark

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Matthew Powers, Prognos Optimizing data lakes for Apache Spark #UnifiedAnalytics #SparkAISummit
  • 3. About !3 But what about those poor data scientists that work with gzipped CSV lakes 😱
  • 4. What you will get from this talk… • Motivation to write Spark open source code • Practical knowledge to build better data lakes !4
  • 5. Agenda • Community goals • Spark open source • Modern Scala libs • Parquet lakes • Incremental updates & small files • Partitioned lakes • Delta lakes !5
  • 6. Loved by most !6 Dreaded by some Source: 2019 Stackoverflow survey
  • 7. Community goals • Passionate about community unification (standardization of method signatures) • Need to find optimal scalafmt settings • Strongly dislike UDFs • Spark tooling? !7
  • 8. Spark helper libraries !8 spark-daria (Scala) quinn (PySpark)
  • 11. Modern Scala libs !11 uTest Mill Build Tool
  • 12. Prognos data lakes !12 Data lake 2Data lake 1 Data lake 3 Apache Spark Prognos AI platform to predict disease Other tech
  • 13. TL;DR • 1 GB files • No nested directories !13
  • 14. Small file problem • Incrementally updating a lake will create a lot of small files • We can store data like this so it’s easy to compact !14
  • 15. Suppose we have a CSV data lake • CSV data lake is constantly being updated • Want to convert it to a Parquet data lake • Want incremental updates every hour !15
  • 17. Compacting small files !17 10,000 incremental files and 166GB of data
  • 19. !19
  • 20. Why partition data lakes? • Data skipping • Massively improve query performance • I’ve seen queries run 50-100 times faster on partitioned lakes !20
  • 22. Filtering unpartitioned lake == Physical Plan == Project [first_name#12, last_name#13, country#14] +- Filter (((isnotnull(country#14) && isnotnull(first_name#12)) && (country#14 = Russia)) && StartsWith(first_name#12, M)) +- FileScan csv [first_name#12,last_name#13,country#14] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/people.csv], PartitionFilters: [], PushedFilters: [IsNotNull(country), IsNotNull(first_name), EqualTo(country,Russia), StringStartsWith(first_name,M)], ReadSchema: struct !22
  • 25. Filtering Partitioned data lake !25 == Physical Plan == Project [first_name#74, last_name#75, country#76] +- Filter (isnotnull(first_name#74) && StartsWith(first_name#74, M)) +- FileScan csv [first_name#74,last_name#75,country#76] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/partitioned_lake], PartitionCount: 1, PartitionFilters: [isnotnull(country#76), (country#76 = Russia)], PushedFilters: [IsNotNull(first_name), StringStartsWith(first_name,M)], ReadSchema: struct
  • 26. Comparing physical plans !26 Unpartitioned Project [first_name#12, last_name#13, country#14] +- Filter (((isnotnull(country#14) && isnotnull(first_name#12)) && (country#14 = Russia)) && StartsWith(first_name#12, M)) +- FileScan csv [first_name#12,last_name#13,country#14] Batched: false, Format: CSV, Location: InMemoryFileIndex[….], PartitionFilters: [], PushedFilters: [IsNotNull(country), IsNotNull(first_name), EqualTo(country,Russia), StringStartsWith(first_name,M)], ReadSchema: struct Partitioned Project [first_name#74, last_name#75, country#76] +- Filter (isnotnull(first_name#74) && StartsWith(first_name#74, M)) +- FileScan csv [first_name#74, last_name#75, country#76] Batched: false, Format: CSV, Location: InMemoryFileIndex[…], PartitionCount: 1, PartitionFilters: [isnotnull(country#76), (country#76 = Russia)], PushedFilters: [IsNotNull(first_name), StringStartsWith(first_name,M)], ReadSchema: struct
  • 27. Directly grabbing the partitions is faster !27
  • 28. Real partitioned data lake • Updates every 3 hours • Has 5 million files • 15,000 files are being added every day • Still great for a lot of queries !28
  • 33. Incrementally updating partitioned lakes • Small file problem grows quickly • Compaction is hard • Not sure of any automated Parquet compaction algos !33
  • 34. What talk should I give next? • Best practices for the Spark community • Ditching SBT for the Mill build tool • Testing Spark code • Running Spark Scala code in PySpark !34
  • 35. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT