Improving Spark SQL at LinkedIn

2 likes1,133 views

The document discusses enhancements to Spark SQL at LinkedIn, focusing on automated column pruning, two-dimensional partitioned joins, and adaptive execution. It outlines current challenges such as excessive conversion overhead and the need for cost-based optimization, while presenting strategies to improve performance with techniques like learning-based CBO. Additionally, it highlights the roadmap for Spark SQL optimizations and integrates machine learning for better query execution rates.

Software

1
2
3
4
Agenda
Automated column
pruning for Dataset
2d partitioned join
Adaptive Execution
Cost-based optimizer

Spark SQL adoptions at LinkedIn
60% jobs running on
our cluster are Spark
jobs
Spark jobs:
⅔ Spark SQL
⅓ RDD
Spark SQL jobs:
⅔ DataFrame/SQL API
⅓ Dataset API
60% 2/3 1/3

goals
Enable computations
that could not be
completed before
Make every job run
faster

Spark SQL roadmap at Linkedin: 3-level optimization
Operator-level
Dataset ser-de
joins
Plan-level
Adaptive Execution,
CBO
Cluster-level
Multi-query
optimization

1
2
3
4
Agenda
Automated column
pruning for Dataset
2d partitioned join
Adaptive Execution
Cost-based
optimization (CBO)

Dataset performance
val ds: Dataset<TrackingEvent> ds = createDataset()
val ds2 = ds.filter(x.id > 0).map(x=> (x.id, x.key))
Dataset has performance issue due to
1. Excessive conversion overhead
2. No column pruning for Orc/Parquet

Solutions
Apple:
Spark + AI 2019 talk: “Bridging the Gap Between Datasets and DataFrames”
Using a bytecode analyzer, converting the user lambda functions into SQL expressions
E.g., x.id > 0 ----> isLargerThan(col(“id”) , Literal(0))
Linkedin:
Using a bytecode analyzer, find out which columns are used in the user lambdas, and
prune columns that are not needed
val ds: Dataset<TrackingEvent> ds = createDataset()
val ds2 = ds.filter(x.id > 0).map(x=> (x.id, x.key))
Big performance boost for orc/parquet since columns can be pushed to readers

A recommendation use case at Linkedin
1. Pair feature joins with viewer feature
2. Intermediate result joins with entity feature
3. Scores each joined record a ML model
4. Rank the top N entities for each viewer

Exploding intermediate data
Can we perform 3-way join and score in a single step
without exploding intermediate data?

2d partitioned join
- Partition left, right, and pair table into M,
N, M*N partitions
- Left and pair table are sorted within each
partition
- For each partition in pair table
- join left table with a sort-merge join
- join right table with a shuffle-hash join
- For each joined record, perform scoring
right away, and output the scorable
- Rank the scorables

1
2
3
4
Agenda
Automated column
pruning for Dataset
2d hash
partitioned join
Adaptive Execution
(AE)
Cost-based
optimization(CBO)

Adaptive Execution(AE) at LinkedIn
Optimize query plan while job is running (SPARK-23128)
Handle data skew in join
Works great!
Convert shuffle-based join
to broadcast join at
runtime
Need shuffle map stage before converting
to broadcast join
Should we use Adaptive
Execution to optimize join
plan at runtime？

1
2
3
4
Agenda
Automated column
pruning for Dataset
2d hash
partitioned join
Adaptive Execution
Cost-based
optimization(CBO)

CBO(Cost-based optimizer)
CBO in Spark can optimize the query plan based on the
operators cost(data size, # of records).
Benefits:
Choose best join strategy:
broadcast vs shuffle-hash vs sort-merge
Multi-Join reordering

CBO(Cost-based optimizer)
The native CBO in Spark has usability issue:
Requires detailed stats(count, min,max,distinct,
histograms) available for the input datasets.
Requires scheduled jobs to compute stats on all datasets
which is very expensive.

CBO(Cost-based optimizer)
Can we learn the stats from history? YES!

Learning-based CBO
Eliminate the CBO’s dependency on pre-computing stats by
learning stats from job histories
A general approach to benefit all SQL engines

Learning-based CBO
Approach 2: Model-based learning
Ref: “SageDB: A Learned Database System”
Approach 1: Instance-based learning
Ref: “LEO: DB2’s Learning Optimizer”

Learning-based CBO vs no-CBO
Approach 2: Model-based learning
Ref: “SageDB: A Learned Database System”
Approach 1: Instance-based learning
Ref: “LEO: DB2’s Learning Optimizer”

1
2
3
4
Summary
Automated column
pruning for Dataset
2d partitioned join
Adaptive Execution
History-based CBO
(Cost-based optimizer)

More Related Content

What's hot (20)

PDF

Apache Spark Data ValidationDatabricks

PDF

Dive into PySparkMateusz Buśkiewicz

PPTX

Apache Spark overviewDataArt

PDF

Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks

PDF

Apache Hadoop YARNとマルチテナントにおけるリソース管理Cloudera Japan

PDF

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

PPTX

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang

PDF

The Parquet Format and Performance Optimization OpportunitiesDatabricks

PDF

Introduction to Apache SparkAnastasios Skarlatidis

PPTX

Optimizing Apache Spark SQL JoinsDatabricks

PPTX

Spark architectureGauravBiswas9

PPTX

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

PDF

Parquet performance tuning: the missing guideRyan Blue

PDF

Hive Bucketing in Apache Spark with Tejas PatilDatabricks

PDF

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

PDF

Deep Dive into the New Features of Apache Spark 3.1Databricks

PDF

Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks

PDF

Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward

PDF

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

PDF

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Apache Spark Data ValidationDatabricks

Dive into PySparkMateusz Buśkiewicz

Apache Spark overviewDataArt

Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks

Apache Hadoop YARNとマルチテナントにおけるリソース管理Cloudera Japan

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Introduction to Apache SparkAnastasios Skarlatidis

Optimizing Apache Spark SQL JoinsDatabricks

Spark architectureGauravBiswas9

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

Parquet performance tuning: the missing guideRyan Blue

Hive Bucketing in Apache Spark with Tejas PatilDatabricks

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

Deep Dive into the New Features of Apache Spark 3.1Databricks

Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks

Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

More from Databricks (20)

PPTX

DW Migration Webinar-March 2022.pptxDatabricks

PPTX

Data Lakehouse Symposium | Day 1 | Part 1Databricks

PPT

Data Lakehouse Symposium | Day 1 | Part 2Databricks

PPTX

Data Lakehouse Symposium | Day 2Databricks

PPTX

Data Lakehouse Symposium | Day 4Databricks

PDF

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

PDF

Democratizing Data Quality Through a Centralized PlatformDatabricks

PDF

Learn to Use Databricks for Data ScienceDatabricks

PDF

Why APM Is Not the Same As ML MonitoringDatabricks

PDF

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

PDF

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

PDF

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

PDF

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

PDF

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

PDF

Sawtooth Windows for Feature AggregationsDatabricks

PDF

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

PDF

Re-imagine Data Monitoring with whylogs and SparkDatabricks

PDF

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

PDF

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

PDF

Massive Data Processing in Adobe Using Delta LakeDatabricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Recently uploaded (20)

PPTX

ChiSquare Procedure in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

PDF

Unlock Efficiency with Insurance Policy Administration SystemsInsurance Tech Services

PPTX

Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptxVarsha Nayak

PPTX

Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agentsklpathrudu

PPTX

Help for Correlations in IBM SPSS Statistics.pptxVersion 1 Analytics

PDF

Build It, Buy It, or Already Got It? Make Smarter Martech Decisionsbbedford2

PDF

The 5 Reasons for IT Maintenance - Arna SoftechArna Softech

PDF

HiHelloHR – Simplify HR Operations for Modern WorkplacesHiHelloHR

PDF

MiniTool Partition Wizard Free Crack + Full Free Download 2025bashirkhan333g

PDF

iTop VPN With Crack Lifetime Activation Key-CODEutfefguu

PDF

Linux Certificate of Completion - LabEx CertificateVICTOR MAESTRE RAMIREZ

PPTX

Human Resources Information System (HRIS)Amity University, Patna

PDF

Revenue streams of the Wazirx clone script.pdfaaronjeffray

PDF

Top Agile Project Management Tools for Teams in 2025Orangescrum

PDF

vMix Pro 28.0.0.42 Download vMix Registration key Bundlekulindacore

PDF

유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례Seongdae Kim

PPTX

AEM User Group: India Chapter Kickoff Meetingjennaf3

PPTX

Agentic Automation: Build & Deploy Your First UiPath Agentklpathrudu

PDF

Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREEutfefguu

PPTX

Change Common Properties in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

ChiSquare Procedure in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

Unlock Efficiency with Insurance Policy Administration SystemsInsurance Tech Services

Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptxVarsha Nayak

Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agentsklpathrudu

Help for Correlations in IBM SPSS Statistics.pptxVersion 1 Analytics

Build It, Buy It, or Already Got It? Make Smarter Martech Decisionsbbedford2

The 5 Reasons for IT Maintenance - Arna SoftechArna Softech

HiHelloHR – Simplify HR Operations for Modern WorkplacesHiHelloHR

MiniTool Partition Wizard Free Crack + Full Free Download 2025bashirkhan333g

iTop VPN With Crack Lifetime Activation Key-CODEutfefguu

Linux Certificate of Completion - LabEx CertificateVICTOR MAESTRE RAMIREZ

Human Resources Information System (HRIS)Amity University, Patna

Revenue streams of the Wazirx clone script.pdfaaronjeffray

Top Agile Project Management Tools for Teams in 2025Orangescrum

vMix Pro 28.0.0.42 Download vMix Registration key Bundlekulindacore

유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례Seongdae Kim

AEM User Group: India Chapter Kickoff Meetingjennaf3

Agentic Automation: Build & Deploy Your First UiPath Agentklpathrudu

Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREEutfefguu

Change Common Properties in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

Improving Spark SQL at LinkedIn

1. Improving Spark SQL At LinkedIn Fangshi Li Staff Software Engineer LinkedIn

2. 1 2 3 4 Agenda Automated column pruning for Dataset 2d partitioned join Adaptive Execution Cost-based optimizer

3. Spark SQL adoptions at LinkedIn 60% jobs running on our cluster are Spark jobs Spark jobs: ⅔ Spark SQL ⅓ RDD Spark SQL jobs: ⅔ DataFrame/SQL API ⅓ Dataset API 60% 2/3 1/3

4. goals Enable computations that could not be completed before Make every job run faster

5. Spark SQL roadmap at Linkedin: 3-level optimization Operator-level Dataset ser-de joins Plan-level Adaptive Execution, CBO Cluster-level Multi-query optimization

6. 1 2 3 4 Agenda Automated column pruning for Dataset 2d partitioned join Adaptive Execution Cost-based optimization (CBO)

7. Dataset performance val ds: Dataset<TrackingEvent> ds = createDataset() val ds2 = ds.filter(x.id > 0).map(x=> (x.id, x.key)) Dataset has performance issue due to 1. Excessive conversion overhead 2. No column pruning for Orc/Parquet

8. Solutions Apple: Spark + AI 2019 talk: “Bridging the Gap Between Datasets and DataFrames” Using a bytecode analyzer, converting the user lambda functions into SQL expressions E.g., x.id > 0 ----> isLargerThan(col(“id”) , Literal(0)) Linkedin: Using a bytecode analyzer, find out which columns are used in the user lambdas, and prune columns that are not needed val ds: Dataset<TrackingEvent> ds = createDataset() val ds2 = ds.filter(x.id > 0).map(x=> (x.id, x.key)) Big performance boost for orc/parquet since columns can be pushed to readers

9. 1 2 3 4 Agenda Automated column pruning for Dataset 2d partitioned join Adaptive Execution Cost-based optimization (CBO)

10. A recommendation use case at Linkedin 1. Pair feature joins with viewer feature 2. Intermediate result joins with entity feature 3. Scores each joined record a ML model 4. Rank the top N entities for each viewer

11. Exploding intermediate data Can we perform 3-way join and score in a single step without exploding intermediate data?

12. 2d partitioned join - Partition left, right, and pair table into M, N, M*N partitions - Left and pair table are sorted within each partition - For each partition in pair table - join left table with a sort-merge join - join right table with a shuffle-hash join - For each joined record, perform scoring right away, and output the scorable - Rank the scorables

13. 10+hBefore 1hAfter

14. 1 2 3 4 Agenda Automated column pruning for Dataset 2d hash partitioned join Adaptive Execution (AE) Cost-based optimization(CBO)

15. Adaptive Execution(AE) at LinkedIn Optimize query plan while job is running (SPARK-23128) Handle data skew in join Works great! Convert shuffle-based join to broadcast join at runtime Need shuffle map stage before converting to broadcast join Should we use Adaptive Execution to optimize join plan at runtime？

16. 1 2 3 4 Agenda Automated column pruning for Dataset 2d hash partitioned join Adaptive Execution Cost-based optimization(CBO)

17. CBO(Cost-based optimizer) CBO in Spark can optimize the query plan based on the operators cost(data size, # of records). Benefits: Choose best join strategy: broadcast vs shuffle-hash vs sort-merge Multi-Join reordering

18. CBO(Cost-based optimizer) The native CBO in Spark has usability issue: Requires detailed stats(count, min,max,distinct, histograms) available for the input datasets. Requires scheduled jobs to compute stats on all datasets which is very expensive.

19. CBO(Cost-based optimizer) Can we learn the stats from history? YES!

20. Learning-based CBO Eliminate the CBO’s dependency on pre-computing stats by learning stats from job histories A general approach to benefit all SQL engines

21. Learning-based CBO Approach 2: Model-based learning Ref: “SageDB: A Learned Database System” Approach 1: Instance-based learning Ref: “LEO: DB2’s Learning Optimizer”

22. Learning-based CBO vs no-CBO Approach 2: Model-based learning Ref: “SageDB: A Learned Database System” Approach 1: Instance-based learning Ref: “LEO: DB2’s Learning Optimizer”

23. 1 2 3 4 Summary Automated column pruning for Dataset 2d partitioned join Adaptive Execution History-based CBO (Cost-based optimizer)

24. Thank you

Improving Spark SQL at LinkedIn

More Related Content

What's hot (20)

Similar to Improving Spark SQL at LinkedIn (20)

More from Databricks (20)

Recently uploaded (20)

Improving Spark SQL at LinkedIn