Optimizing Delta/Parquet Data Lakes for Apache Spark

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Matthew Powers, Prognos
Optimizing data lakes for
Apache Spark
#UnifiedAnalytics #SparkAISummit

About
!3
But what about those
poor data scientists that
work with gzipped CSV
lakes 😱

What you will get from this talk…
• Motivation to write Spark open source code
• Practical knowledge to build better data lakes
!4

Agenda
• Community goals
• Spark open source
• Modern Scala libs
• Parquet lakes
• Incremental updates & small files
• Partitioned lakes
• Delta lakes
!5

Loved by most
!6
Dreaded by some
Source: 2019 Stackoverflow survey

Community goals
• Passionate about community unification
(standardization of method signatures)
• Need to find optimal scalafmt settings
• Strongly dislike UDFs
• Spark tooling?
!7

Spark helper libraries
!8
spark-daria (Scala) quinn (PySpark)

Modern Scala libs
!11
uTest Mill Build Tool

Prognos data lakes
!12
Data lake 2Data lake 1 Data lake 3
Apache Spark
Prognos AI platform to predict disease
Other tech

TL;DR
• 1 GB files
• No nested directories
!13

Small file problem
• Incrementally updating a lake will create a lot of
small files
• We can store data like this so it’s easy to
compact
!14

Suppose we have a CSV data lake
• CSV data lake is constantly being updated
• Want to convert it to a Parquet data lake
• Want incremental updates every hour
!15

Compacting small files
!17
10,000 incremental files and 166GB of data

Why partition data lakes?
• Data skipping
• Massively improve query performance
• I’ve seen queries run 50-100 times faster on
partitioned lakes
!20

Filtering unpartitioned lake
== Physical Plan ==
Project [first_name#12, last_name#13, country#14]
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12)) && (country#14 = Russia)) &&
StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/people.csv],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name), EqualTo(country,Russia),
StringStartsWith(first_name,M)],
ReadSchema: struct
!22

Partitioning the data lake
!23

Filtering Partitioned data lake
!25
== Physical Plan ==
+- Filter (isnotnull(first_name#74) && StartsWith(first_name#74, M))
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/partitioned_lake],
PartitionCount: 1,
PartitionFilters: [isnotnull(country#76), (country#76 = Russia)],
PushedFilters: [IsNotNull(first_name), StringStartsWith(first_name,M)],
ReadSchema: struct

Comparing physical plans
!26
Unpartitioned
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12))
&& (country#14 = Russia)) && StartsWith(first_name#12, M))
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[….],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name),
EqualTo(country,Russia), StringStartsWith(first_name,M)],
ReadSchema: struct
Partitioned
+- Filter (isnotnull(first_name#74) && StartsWith(first_name#74, M))
+- FileScan csv [first_name#74, last_name#75, country#76]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[…],
PartitionCount: 1,
PartitionFilters: [isnotnull(country#76), (country#76 = Russia)],
PushedFilters: [IsNotNull(first_name),
StringStartsWith(first_name,M)],
ReadSchema: struct

Directly grabbing the partitions is
faster
!27

Real partitioned data lake
• Updates every 3 hours
• Has 5 million files
• 15,000 files are being added every day
• Still great for a lot of queries
!28

Creating partitioned lakes (1/3)
!29

!30

!31

Incrementally updating partitioned
lakes
• Small file problem grows quickly
• Compaction is hard
• Not sure of any automated Parquet compaction
algos
!33

What talk should I give next?
• Best practices for the Spark community
• Ditching SBT for the Mill build tool
• Testing Spark code
• Running Spark Scala code in PySpark
!34

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Optimizing Delta/Parquet Data Lakes for Apache Spark

More Related Content

What's hot (20)

Similar to Optimizing Delta/Parquet Data Lakes for Apache Spark (20)

More from Databricks (20)

Recently uploaded (20)

Optimizing Delta/Parquet Data Lakes for Apache Spark