SlideShare a Scribd company logo
Improving Spark SQL
At LinkedIn
Fangshi Li
Staff Software Engineer
LinkedIn
1
2
3
4
Agenda
Automated column
pruning for Dataset
2d partitioned join
Adaptive Execution
Cost-based optimizer
Spark SQL adoptions at LinkedIn
60% jobs running on
our cluster are Spark
jobs
Spark jobs:
⅔ Spark SQL
⅓ RDD
Spark SQL jobs:
⅔ DataFrame/SQL API
⅓ Dataset API
60% 2/3 1/3
goals
Enable computations
that could not be
completed before
Make every job run
faster
Spark SQL roadmap at Linkedin: 3-level optimization
Operator-level
Dataset ser-de
joins
Plan-level
Adaptive Execution,
CBO
Cluster-level
Multi-query
optimization
1
2
3
4
Agenda
Automated column
pruning for Dataset
2d partitioned join
Adaptive Execution
Cost-based
optimization (CBO)
Dataset performance
val ds: Dataset<TrackingEvent> ds = createDataset()
val ds2 = ds.filter(x.id > 0).map(x=> (x.id, x.key))
Dataset has performance issue due to
1. Excessive conversion overhead
2. No column pruning for Orc/Parquet
Solutions
Apple:
Spark + AI 2019 talk: “Bridging the Gap Between Datasets and DataFrames”
Using a bytecode analyzer, converting the user lambda functions into SQL expressions
E.g., x.id > 0 ----> isLargerThan(col(“id”) , Literal(0))
Linkedin:
Using a bytecode analyzer, find out which columns are used in the user lambdas, and
prune columns that are not needed
val ds: Dataset<TrackingEvent> ds = createDataset()
val ds2 = ds.filter(x.id > 0).map(x=> (x.id, x.key))
Big performance boost for orc/parquet since columns can be pushed to readers
1
2
3
4
Agenda
Automated column
pruning for Dataset
2d partitioned join
Adaptive Execution
Cost-based
optimization (CBO)
A recommendation use case at Linkedin
1. Pair feature joins with viewer feature
2. Intermediate result joins with entity feature
3. Scores each joined record a ML model
4. Rank the top N entities for each viewer
Exploding intermediate data
Can we perform 3-way join and score in a single step
without exploding intermediate data?
2d partitioned join
- Partition left, right, and pair table into M,
N, M*N partitions
- Left and pair table are sorted within each
partition
- For each partition in pair table
- join left table with a sort-merge join
- join right table with a shuffle-hash join
- For each joined record, perform scoring
right away, and output the scorable
- Rank the scorables
10+hBefore
1hAfter
1
2
3
4
Agenda
Automated column
pruning for Dataset
2d hash
partitioned join
Adaptive Execution
(AE)
Cost-based
optimization(CBO)
Adaptive Execution(AE) at LinkedIn
Optimize query plan while job is running (SPARK-23128)
Handle data skew in join
Works great!
Convert shuffle-based join
to broadcast join at
runtime
Need shuffle map stage before converting
to broadcast join
Should we use Adaptive
Execution to optimize join
plan at runtime?
1
2
3
4
Agenda
Automated column
pruning for Dataset
2d hash
partitioned join
Adaptive Execution
Cost-based
optimization(CBO)
CBO(Cost-based optimizer)
CBO in Spark can optimize the query plan based on the
operators cost(data size, # of records).
Benefits:
Choose best join strategy:
broadcast vs shuffle-hash vs sort-merge
Multi-Join reordering
CBO(Cost-based optimizer)
The native CBO in Spark has usability issue:
Requires detailed stats(count, min,max,distinct,
histograms) available for the input datasets.
Requires scheduled jobs to compute stats on all datasets
which is very expensive.
CBO(Cost-based optimizer)
Can we learn the stats from history? YES!
Learning-based CBO
Eliminate the CBO’s dependency on pre-computing stats by
learning stats from job histories
A general approach to benefit all SQL engines
Learning-based CBO
Approach 2: Model-based learning
Ref: “SageDB: A Learned Database System”
Approach 1: Instance-based learning
Ref: “LEO: DB2’s Learning Optimizer”
Learning-based CBO vs no-CBO
Approach 2: Model-based learning
Ref: “SageDB: A Learned Database System”
Approach 1: Instance-based learning
Ref: “LEO: DB2’s Learning Optimizer”
1
2
3
4
Summary
Automated column
pruning for Dataset
2d partitioned join
Adaptive Execution
History-based CBO
(Cost-based optimizer)
Thank you

More Related Content

What's hot (20)

PDF
Apache Spark Data Validation
Databricks
 
PDF
Dive into PySpark
Mateusz Buśkiewicz
 
PPTX
Apache Spark overview
DataArt
 
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
PDF
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Cloudera Japan
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PPTX
Optimizing Apache Spark SQL Joins
Databricks
 
PPTX
Spark architecture
GauravBiswas9
 
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Deep Dive into the New Features of Apache Spark 3.1
Databricks
 
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
PDF
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
PDF
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Apache Spark Data Validation
Databricks
 
Dive into PySpark
Mateusz Buśkiewicz
 
Apache Spark overview
DataArt
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Cloudera Japan
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Optimizing Apache Spark SQL Joins
Databricks
 
Spark architecture
GauravBiswas9
 
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Deep Dive into the New Features of Apache Spark 3.1
Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 

Similar to Improving Spark SQL at LinkedIn (20)

PDF
Optimizations in Spark; RDD, DataFrame
Knoldus Inc.
 
PDF
New Developments in Spark
Databricks
 
PDF
Spark + AI Summit recap jul16 2020
Guido Oswald
 
PDF
Fossasia 2018-chetan-khatri
Chetan Khatri
 
PDF
Apache Spark 3.0: Overview of What’s New and Why Care
Databricks
 
PPTX
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
PDF
Spark and machine learning in microservices architecture
Stepan Pushkarev
 
DOCX
Quick Guide to Refresh Spark skills
Ravindra kumar
 
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
PDF
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
PDF
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 
PDF
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Databricks
 
PDF
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Databricks
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PDF
Scaling Application on High Performance Computing Clusters and Analysis of th...
Rusif Eyvazli
 
PPTX
Tuning a database for millions of users
Chaowlert Chaisrichalermpol
 
PPTX
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
PDF
Data Processing with Apache Spark Meetup Talk
Eren Avşaroğulları
 
PDF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Optimizations in Spark; RDD, DataFrame
Knoldus Inc.
 
New Developments in Spark
Databricks
 
Spark + AI Summit recap jul16 2020
Guido Oswald
 
Fossasia 2018-chetan-khatri
Chetan Khatri
 
Apache Spark 3.0: Overview of What’s New and Why Care
Databricks
 
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Spark and machine learning in microservices architecture
Stepan Pushkarev
 
Quick Guide to Refresh Spark skills
Ravindra kumar
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Databricks
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Databricks
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Rusif Eyvazli
 
Tuning a database for millions of users
Chaowlert Chaisrichalermpol
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Data Processing with Apache Spark Meetup Talk
Eren Avşaroğulları
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 

Improving Spark SQL at LinkedIn

  • 1. Improving Spark SQL At LinkedIn Fangshi Li Staff Software Engineer LinkedIn
  • 2. 1 2 3 4 Agenda Automated column pruning for Dataset 2d partitioned join Adaptive Execution Cost-based optimizer
  • 3. Spark SQL adoptions at LinkedIn 60% jobs running on our cluster are Spark jobs Spark jobs: ⅔ Spark SQL ⅓ RDD Spark SQL jobs: ⅔ DataFrame/SQL API ⅓ Dataset API 60% 2/3 1/3
  • 4. goals Enable computations that could not be completed before Make every job run faster
  • 5. Spark SQL roadmap at Linkedin: 3-level optimization Operator-level Dataset ser-de joins Plan-level Adaptive Execution, CBO Cluster-level Multi-query optimization
  • 6. 1 2 3 4 Agenda Automated column pruning for Dataset 2d partitioned join Adaptive Execution Cost-based optimization (CBO)
  • 7. Dataset performance val ds: Dataset<TrackingEvent> ds = createDataset() val ds2 = ds.filter(x.id > 0).map(x=> (x.id, x.key)) Dataset has performance issue due to 1. Excessive conversion overhead 2. No column pruning for Orc/Parquet
  • 8. Solutions Apple: Spark + AI 2019 talk: “Bridging the Gap Between Datasets and DataFrames” Using a bytecode analyzer, converting the user lambda functions into SQL expressions E.g., x.id > 0 ----> isLargerThan(col(“id”) , Literal(0)) Linkedin: Using a bytecode analyzer, find out which columns are used in the user lambdas, and prune columns that are not needed val ds: Dataset<TrackingEvent> ds = createDataset() val ds2 = ds.filter(x.id > 0).map(x=> (x.id, x.key)) Big performance boost for orc/parquet since columns can be pushed to readers
  • 9. 1 2 3 4 Agenda Automated column pruning for Dataset 2d partitioned join Adaptive Execution Cost-based optimization (CBO)
  • 10. A recommendation use case at Linkedin 1. Pair feature joins with viewer feature 2. Intermediate result joins with entity feature 3. Scores each joined record a ML model 4. Rank the top N entities for each viewer
  • 11. Exploding intermediate data Can we perform 3-way join and score in a single step without exploding intermediate data?
  • 12. 2d partitioned join - Partition left, right, and pair table into M, N, M*N partitions - Left and pair table are sorted within each partition - For each partition in pair table - join left table with a sort-merge join - join right table with a shuffle-hash join - For each joined record, perform scoring right away, and output the scorable - Rank the scorables
  • 14. 1 2 3 4 Agenda Automated column pruning for Dataset 2d hash partitioned join Adaptive Execution (AE) Cost-based optimization(CBO)
  • 15. Adaptive Execution(AE) at LinkedIn Optimize query plan while job is running (SPARK-23128) Handle data skew in join Works great! Convert shuffle-based join to broadcast join at runtime Need shuffle map stage before converting to broadcast join Should we use Adaptive Execution to optimize join plan at runtime?
  • 16. 1 2 3 4 Agenda Automated column pruning for Dataset 2d hash partitioned join Adaptive Execution Cost-based optimization(CBO)
  • 17. CBO(Cost-based optimizer) CBO in Spark can optimize the query plan based on the operators cost(data size, # of records). Benefits: Choose best join strategy: broadcast vs shuffle-hash vs sort-merge Multi-Join reordering
  • 18. CBO(Cost-based optimizer) The native CBO in Spark has usability issue: Requires detailed stats(count, min,max,distinct, histograms) available for the input datasets. Requires scheduled jobs to compute stats on all datasets which is very expensive.
  • 19. CBO(Cost-based optimizer) Can we learn the stats from history? YES!
  • 20. Learning-based CBO Eliminate the CBO’s dependency on pre-computing stats by learning stats from job histories A general approach to benefit all SQL engines
  • 21. Learning-based CBO Approach 2: Model-based learning Ref: “SageDB: A Learned Database System” Approach 1: Instance-based learning Ref: “LEO: DB2’s Learning Optimizer”
  • 22. Learning-based CBO vs no-CBO Approach 2: Model-based learning Ref: “SageDB: A Learned Database System” Approach 1: Instance-based learning Ref: “LEO: DB2’s Learning Optimizer”
  • 23. 1 2 3 4 Summary Automated column pruning for Dataset 2d partitioned join Adaptive Execution History-based CBO (Cost-based optimizer)