Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and Parquet Reader

Improving Spark SQL Performance by 30%:
How We Optimize Parquet Filter Pushdown
and Parquet Reader
Ke Sun (sunke3296@gmail.com)
Senior Engineer of Data Engine Team, ByteDance

Who We Are
l Data Engine team of ByteDance
l Build a platform of one-stop
experience for OLAP , on which
users can analyze EB level data by
writing SQL without caring about
the underlying execution engine

What We Do
l Manage Spark SQL / Presto / Hive
workload
l Offer open API and serverless OLAP
platform
l Optimize Spark SQL / Presto / Hudi /
Hive engine
l Design data architecture for most
business lines in ByteDance

Agenda
Spark SQL at ByteDance
How Spark Read Parquet
Optimization of Parquet Filter
Pushdown and Parquet Reader
at ByteDance

2017 2019 20202016 2018
Small Scale Experiments
Ad-hoc Workload
Few ETL Workload
Full-production deployment & migration
Main engine in DW area

▪ Spark SQL covers 98%+ ETL workload
▪ Parquet is the default file format in data warehouse and
vectorizedReader is also enabled by default

▪ Overview of Parquet
▪ Procedure of Parquet Reading
▪ What We can Optimize

▪ Overview of Parquet
Column Pruning
More efficient compression
Parquet can skip useless data by spark filter
pushdown using Footer & RowGroup statistics
https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif

▪ Procedure of Parquet Reading
▪ VectorizedParquetRecordReader skips useless
RowGroups by filter pushdown translated by
ParquetFilters
▪ VectorizedParquetRecordReader builds column readers
for every target column and these column readers read
data together in batch
DataSourceScanExec
.inputRDD
ParquetFileFormat
.buildReaderWithPartitionValues
VectorizedParquetRecordReader
. nextBatch

▪ Optimization – Statistics are not distinguishable
select * from table_name where date = ‘***’ and category = ‘test’
(date is partition column and category is predicate column)
For example, Spark reads all the 3 RowGroups for statistics are not distinguishable.
min of category max of category read or not
RowGroup1 a1 z1 Yes
RowGroup2 a2 z2 Yes
RowGroup3 a3 z3 Yes
Min/Max Statistics of RowGroup for a Parquet File

▪ Optimization – Statistics are not distinguishable
Parquet filter pushdown works poorly when predicate columns are out of order in
parquet files and this phenomenon is reasonable
It is valuable to sort the common used predicate columns in parquet files to
reduce IO

▪ Optimization – Spark reads too much unnecessary data
select col1 from table_name where date = ‘***’ and col2 = ‘test’
col1 col2 col3
col1 col2 col3
col1 col2 col3
ParquetFile
RowGroup1
RowGroup2
RowGroup3
▪ RowGroup1 is skipped by filter pushdown
▪ Col3 is skipped by column pruning
▪ Col1 and col2 are read together by vectorized reader

▪ Optimization – Spark reads too much unnecessary data
select col1 from table_name where date = ‘***’ and col2 = ‘test’
col1 col2 col3
col1 col2 col3
col1 col2 col3
ParquetFile
RowGroup1
RowGroup2
RowGroup3
▪ Most data of col1 is unnecessary to read
if filter ratio (col2=‘test’) is very high
▪ It is valuable to read & filter data by filter columns firstly
and then read data of other columns

Optimization of Parquet Filter Pushdown and
Parquet Reader at ByteDance

Optimization of Parquet Filter Pushdown
▪ Statistics are not distinguishable
▪ Increase parquet statistics discrimination
▪ Low Overhead: sort all data is expensive or even impossible
▪ Automation: users do not need to update ETL jobs
LocalSort: Add a SortExec node before InsertIntoHiveTable node

▪ Which columns to be sorted?
▪ Analyze the history queries and choose the most common used predicate columns
▪ Configure sort columns to table property of hive table and Spark SQL will read this property
▪ It is a automatic procedure without manual intervention
…
Project
SortExec
…
Project
InsertIntoHiveTable
InsertIntoHiveTable

▪ Spark reads less data for more efficient statistics
▪ Parquet file size is much smaller for data is more efficient to compress
▪ Only near 5% overhead
Spark reads only one RowGroup after sorting data by column category
min of category max of category read or not
RowGroup1 a1 g1 No
RowGroup2 g2 u2 Yes
RowGroup3 u3 z3 No
Min/Max Statistics of RowGroup for a Parquet File

Optimization of Parquet Reader
▪ Spark reads too much unnecessary data
▪ Filter unnecessary data as soon as possible
Prewhere: Read data of filter columns in batch firstly and skip other
columns if unmatched (It is a good idea from ClickHouse)

▪ Split parquet reader into 2 reader: FilterReader for filter
columns and NonFilterReader for other columns
col1 col2 col3
col1 col2 col3
col1 col2 col3
RowGroup1
RowGroup2
RowGroup3
VectorizedParquetRecordReader
col1 col2 col3
col1 col2 col3
col1 col2 col3
FilterReader
col1 col2 col3
col1 col2 col3
col1 col2 col3
NonFilterReader
PrewhereVectorizedParquetRecordReader

▪ Skip Page
▪ Skip Decoding
▪ Skip RowGroup
Potential BenefitFilterReader reads data in batch
Apply filter expressions to data
match
NonFilterReader reads data in
batch and skip unnecessary data
N
Y
Union data and return in batch

▪ ByteType
▪ ShortType
▪ IntegerType
▪ LongType
▪ FloatType
▪ DoubleType
▪ StringType
▪ >
▪ >=
▪ <
▪ <=
▪ =
▪ In
▪ isNull
▪ isNotNull
Supported Filter TypeSupported Data Type of Filter Column

Databricks simplifies data and AI
so data teams can innovate faster

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and Parquet Reader

Recommended

More Related Content

What's hot (20)

Similar to Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and Parquet Reader (20)

More from Databricks (20)

Recently uploaded (20)

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and Parquet Reader