SlideShare a Scribd company logo
On Improving Broadcast Joins
in Spark SQL
Jianneng Li
Software Engineer, Workday
This presentation may contain forward-looking statements for which there are risks, uncertainties, and
assumptions. If the risks materialize or assumptions prove incorrect, Workday’s business results and
directions could differ materially from results implied by the forward-looking statements. Forward-looking
statements include any statements regarding strategies or plans for future operations; any statements
concerning new features, enhancements or upgrades to our existing applications or plans for future
applications; and any statements of belief. Further information on risks that could affect Workday’s results is
included in our filings with the Securities and Exchange Commission which are available on the Workday
investor relations webpage: www.workday.com/company/investor_relations.php
Workday assumes no obligation for and does not intend to update any forward-looking statements. Any
unreleased services, features, functionality or enhancements referenced in any Workday document, roadmap,
blog, our website, press release or public statement that are not currently available are subject to change at
Workday’s discretion and may not be delivered as planned or at all.
Customers who purchase Workday, Inc. services should make their purchase decisions upon services,
features, and functions that are currently available.
Safe Harbor Statement
Agenda
▪ Apache Spark in Workday
Prism Analytics
▪ Broadcast Joins in Spark
▪ Improving Broadcast Joins
▪ Production Case Study
Spark in Workday Prism Analytics
Example Spark physical plan of our pipeline shown in Spark UI
▪ Customers use our self-
service product to build data
transformation pipelines, which
are compiled to DataFrames
and executed by Spark
▪ Finance and HR use cases
▪ This talk focuses on our HR use
cases - more on complex plans
than big data
Spark in Prism Analytics
For more details, see session from SAIS 2019 - Lessons Learned
using Apache Spark for Self-Service Data Prep in SaaS World
Broadcast Joins in Spark
Node 1
A 1
B 2
C 3 DD 4
Node 2
D 4
E 5
F 6 AA 1
Node 1 Node 2
A 1
B 2
C 3
AA 1
DD 4
D 4
E 5
F 6
AA 1
DD 4
Broadcast
Join
#UnifiedAnalytics #SparkAISummit
Broadcast Join Review
Broadcast Join Shuffle Join
Avoids shuffling the bigger side Shuffles both sides
Naturally handles data skew Can suffer from data skew
Cheap for selective joins Can produce unnecessary intermediate results
Broadcasted data needs to fit in memory Data can be spilled and read from disk
Cannot be used for certain outer joins Can be used for all joins
Broadcast Join vs. Shuffle Join
Where applicable, broadcast join should be faster than shuffle join
▪ Spark's broadcasting mechanism is inefficient
▪ Broadcasted data goes through the driver
▪ Too much broadcasted data can run the driver out of memory
Broadcasting in Spark
Driver
Executor 1
Executor 2
(1) Executors sends broadcasted data to driver
(2) Driver sends broadcasted data to executors
▪ Uses broadcasting mechanism to collect data to driver
▪ Planned per-join using size estimation and config
spark.sql.autoBroadcastJoinThreshold
Broadcast Joins in Spark
▪ BroadcastHashJoin (BHJ)
▪ Driver builds in-memory hashtable to distribute to executors
▪ BroadcastNestedLoopJoin (BNLJ)
▪ Distributes data as array to executors
▪ Useful for non-equi joins
▪ Disabled in Prism for stability reasons
Improving Broadcast Joins
Goal: More broadcast joins
▪ Q: Is broadcast join faster as long as broadcasted data fits in memory?
▪ A: It depends
▪ Experiment: increase broadcast threshold, and see what breaks
▪ Spoiler: many things go wrong before driver runs out of memory
Experiment: Single Join
Experiment setup
▪ TPC-H Dataset, 10GB
▪ Query: 60M table (lineitem) joining 15M table (orders) on key
▪ Driver: 1 core, 12 GB memory
▪ Executor: 1 instance, 18 cores, 102 GB memory
Single join results
SELECT lineitem.*
FROM lineitem, orders
WHERE l_orderkey =
o_orderkey
▪ Driver collects 15M rows
▪ Driver builds hashtable
▪ Driver sends hashtable to executor
▪ Executor deserializes hashtable
Why is BHJ slower?
Can we reduce BHJ overhead?
▪ Yes - executor side broadcast
Executor Side Broadcast
▪ Based on prototype from SPARK-17556
▪ Data is broadcasted between executors directly
Driver
Executor 1
Executor 2
Executors sends
broadcasted data to
each other
Driver keeps track of executor’s data blocks
Executor BHJ vs. Driver BHJ
Pros Cons
Driver has less memory pressure Each executor builds its own hashtable
Less data shuffled across network More difficult to know size of broadcast
Pros of executor BHJ outweigh cons
New results
SELECT lineitem.*
FROM lineitem, orders
WHERE l_orderkey =
o_orderkey
Why is BHJ still slower?
▪ Let's compare the cost models of the joins
SMJ Cost
▪ Assume n cores, tables A and B, where A > B
1. Read A/n, Sort, Write A/n
2. Read B/n, Sort, Write B/n
3. Read A/n, Read B/n, Join
▪ Considering only I/O costs: 3 A/n + 3 B/n
BHJ Cost
▪ Assume n cores, tables A and B, where A > B
1. Read B/n, Build hashtable, Write B
2. Read A/n, Read B, Join
▪ Considering only I/O costs: A/n + B/n + 2B
▪ SMJ: 3 A/n + 3 B/n
▪ BHJ: A/n + B/n + 2B
▪ SMJ - BHJ: (2 A/n + 2 B/n) - (2B)
Comparing SMJ and BHJ costs
▪ Analysis
▪ More cores, better performance from SMJ
▪ Larger A, better performance from BHJ
SMJ vs. BHJ: (A + B)/n vs. B
Varying cores - SMJ better with more cores
SELECT lineitem.*
FROM lineitem, orders
WHERE l_orderkey =
o_orderkey
Varying size of A - BHJ better with larger difference
SELECT lineitem.*
FROM lineitem, orders
WHERE l_orderkey =
o_orderkey
Increasing size of B - driver BHJ fails, executor BHJ best
SELECT lineitem.*
FROM lineitem, orders
WHERE l_orderkey =
o_orderkey
Other broadcast join improvements
▪ Increase Xms and MetaspaceSize to reduce GC
▪ Fetch all broadcast variables concurrently
▪ Other memory improvements in planning and whole-stage codegen
▪ Planning to contribute code changes back to open source
Production Case Study
▪ 98% of our joins are inner
joins or left outer joins
Join types in HR customer pipelines
Broadcast estimates in HR customer pipelines
▪ If we can increase broadcast
threshold from default 10 MB to
100 MB, then 80% of our joins
can be broadcasted
▪ 30 tables
▪ 29 tables 10K rows
▪ 1 table 3M rows
▪ ~160 joins
▪ Using 18 executor cores
HR use case pipeline
▪ Can broadcast joins make the
pipeline run faster?
Varying broadcast thresholds (0 MB, 10MB, 1GB)
What if we increase the 3M table?
▪ Will it bring similar performance improvements as single join?
30M rows for the big table
Why are more broadcast joins slower?
▪ Self joins and left outer joins
▪ In the highest threshold, the biggest table gets broadcasted
▪ Introduces broadcast overhead
▪ Reduces join parallelism
▪ Takes up storage memory
Closing Thoughts
▪ Executor side broadcast is better than driver side broadcast
▪ When evaluating whether broadcast is better, consider:
▪ Number of cores available
▪ Relative size difference between bigger and smaller tables
▪ Relative size of broadcast tables and available memory
▪ Presence of self joins and outer joins
Broadcast joins are better… with caveats
Future improvements in broadcast joins
▪ Adaptive Query Execution in Spark 3.0
▪ Building hashtables in BHJ with multiple cores
▪ Smaller footprint for BHJ hashtables
▪ Skew handling in sort merge join using broadcast
Thank you
Questions?
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
On Improving Broadcast Joins in Apache Spark SQL

More Related Content

What's hot (20)

PDF
Apache Spark Core – Practical Optimization
Databricks
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PPTX
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
PDF
How to Automate Performance Tuning for Apache Spark
Databricks
 
PDF
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
PPTX
Optimizing Apache Spark SQL Joins
Databricks
 
PDF
Streaming SQL with Apache Calcite
Julian Hyde
 
PDF
Apache Spark At Scale in the Cloud
Databricks
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
The Apache Spark File Format Ecosystem
Databricks
 
PDF
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
PDF
Spark shuffle introduction
colorant
 
PDF
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
PDF
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PPTX
Hive: Loading Data
Benjamin Leonhardi
 
Apache Spark Core – Practical Optimization
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
How to Automate Performance Tuning for Apache Spark
Databricks
 
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Optimizing Apache Spark SQL Joins
Databricks
 
Streaming SQL with Apache Calcite
Julian Hyde
 
Apache Spark At Scale in the Cloud
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
The Apache Spark File Format Ecosystem
Databricks
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Spark shuffle introduction
colorant
 
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Hive: Loading Data
Benjamin Leonhardi
 

Similar to On Improving Broadcast Joins in Apache Spark SQL (20)

PDF
Improving Spark SQL at LinkedIn
Databricks
 
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
PPTX
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Ronald Francisco Vargas Quesada
 
PDF
Joins in a distributed world - Lucian Precup
distributed matters
 
PPTX
Joins in a distributed world Distributed Matters Barcelona 2015
Lucian Precup
 
PDF
Optimizations in Spark; RDD, DataFrame
Knoldus Inc.
 
PDF
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Databricks
 
PDF
Deep Dive into Spark
Eric Xiao
 
PDF
dd presentation.pdf
AnSHiKa187943
 
PDF
Apache Spark 3.0: Overview of What’s New and Why Care
Databricks
 
PDF
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
PDF
Cost-Based Optimizer in Apache Spark 2.2
Databricks
 
PDF
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Databricks
 
PPTX
Cost-Based-Query-Optimization-in-DBMS.pptx
maddishiva1989
 
PDF
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Cynthia Saracco
 
PPT
Building High Performance MySql Query Systems And Analytic Applications
guest40cda0b
 
PPT
Building High Performance MySQL Query Systems and Analytic Applications
Calpont
 
PPTX
What_to_expect_from_oracle_database_12c
Maria Colgan
 
PPT
261197832 8-performance-tuning-part i
NaviSoft
 
PDF
Growth of relational model: Interdependence and complementary to big data
IJECEIAES
 
Improving Spark SQL at LinkedIn
Databricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Ronald Francisco Vargas Quesada
 
Joins in a distributed world - Lucian Precup
distributed matters
 
Joins in a distributed world Distributed Matters Barcelona 2015
Lucian Precup
 
Optimizations in Spark; RDD, DataFrame
Knoldus Inc.
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Databricks
 
Deep Dive into Spark
Eric Xiao
 
dd presentation.pdf
AnSHiKa187943
 
Apache Spark 3.0: Overview of What’s New and Why Care
Databricks
 
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
Cost-Based Optimizer in Apache Spark 2.2
Databricks
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Databricks
 
Cost-Based-Query-Optimization-in-DBMS.pptx
maddishiva1989
 
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Cynthia Saracco
 
Building High Performance MySql Query Systems And Analytic Applications
guest40cda0b
 
Building High Performance MySQL Query Systems and Analytic Applications
Calpont
 
What_to_expect_from_oracle_database_12c
Maria Colgan
 
261197832 8-performance-tuning-part i
NaviSoft
 
Growth of relational model: Interdependence and complementary to big data
IJECEIAES
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna36
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
BinarySearchTree in datastructures in detail
kichokuttu
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna36
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 

On Improving Broadcast Joins in Apache Spark SQL

  • 1. On Improving Broadcast Joins in Spark SQL Jianneng Li Software Engineer, Workday
  • 2. This presentation may contain forward-looking statements for which there are risks, uncertainties, and assumptions. If the risks materialize or assumptions prove incorrect, Workday’s business results and directions could differ materially from results implied by the forward-looking statements. Forward-looking statements include any statements regarding strategies or plans for future operations; any statements concerning new features, enhancements or upgrades to our existing applications or plans for future applications; and any statements of belief. Further information on risks that could affect Workday’s results is included in our filings with the Securities and Exchange Commission which are available on the Workday investor relations webpage: www.workday.com/company/investor_relations.php Workday assumes no obligation for and does not intend to update any forward-looking statements. Any unreleased services, features, functionality or enhancements referenced in any Workday document, roadmap, blog, our website, press release or public statement that are not currently available are subject to change at Workday’s discretion and may not be delivered as planned or at all. Customers who purchase Workday, Inc. services should make their purchase decisions upon services, features, and functions that are currently available. Safe Harbor Statement
  • 3. Agenda ▪ Apache Spark in Workday Prism Analytics ▪ Broadcast Joins in Spark ▪ Improving Broadcast Joins ▪ Production Case Study
  • 4. Spark in Workday Prism Analytics
  • 5. Example Spark physical plan of our pipeline shown in Spark UI ▪ Customers use our self- service product to build data transformation pipelines, which are compiled to DataFrames and executed by Spark ▪ Finance and HR use cases ▪ This talk focuses on our HR use cases - more on complex plans than big data Spark in Prism Analytics For more details, see session from SAIS 2019 - Lessons Learned using Apache Spark for Self-Service Data Prep in SaaS World
  • 7. Node 1 A 1 B 2 C 3 DD 4 Node 2 D 4 E 5 F 6 AA 1 Node 1 Node 2 A 1 B 2 C 3 AA 1 DD 4 D 4 E 5 F 6 AA 1 DD 4 Broadcast Join #UnifiedAnalytics #SparkAISummit Broadcast Join Review
  • 8. Broadcast Join Shuffle Join Avoids shuffling the bigger side Shuffles both sides Naturally handles data skew Can suffer from data skew Cheap for selective joins Can produce unnecessary intermediate results Broadcasted data needs to fit in memory Data can be spilled and read from disk Cannot be used for certain outer joins Can be used for all joins Broadcast Join vs. Shuffle Join Where applicable, broadcast join should be faster than shuffle join
  • 9. ▪ Spark's broadcasting mechanism is inefficient ▪ Broadcasted data goes through the driver ▪ Too much broadcasted data can run the driver out of memory Broadcasting in Spark Driver Executor 1 Executor 2 (1) Executors sends broadcasted data to driver (2) Driver sends broadcasted data to executors
  • 10. ▪ Uses broadcasting mechanism to collect data to driver ▪ Planned per-join using size estimation and config spark.sql.autoBroadcastJoinThreshold Broadcast Joins in Spark ▪ BroadcastHashJoin (BHJ) ▪ Driver builds in-memory hashtable to distribute to executors ▪ BroadcastNestedLoopJoin (BNLJ) ▪ Distributes data as array to executors ▪ Useful for non-equi joins ▪ Disabled in Prism for stability reasons
  • 12. Goal: More broadcast joins ▪ Q: Is broadcast join faster as long as broadcasted data fits in memory? ▪ A: It depends ▪ Experiment: increase broadcast threshold, and see what breaks ▪ Spoiler: many things go wrong before driver runs out of memory
  • 14. Experiment setup ▪ TPC-H Dataset, 10GB ▪ Query: 60M table (lineitem) joining 15M table (orders) on key ▪ Driver: 1 core, 12 GB memory ▪ Executor: 1 instance, 18 cores, 102 GB memory
  • 15. Single join results SELECT lineitem.* FROM lineitem, orders WHERE l_orderkey = o_orderkey
  • 16. ▪ Driver collects 15M rows ▪ Driver builds hashtable ▪ Driver sends hashtable to executor ▪ Executor deserializes hashtable Why is BHJ slower?
  • 17. Can we reduce BHJ overhead? ▪ Yes - executor side broadcast
  • 18. Executor Side Broadcast ▪ Based on prototype from SPARK-17556 ▪ Data is broadcasted between executors directly Driver Executor 1 Executor 2 Executors sends broadcasted data to each other Driver keeps track of executor’s data blocks
  • 19. Executor BHJ vs. Driver BHJ Pros Cons Driver has less memory pressure Each executor builds its own hashtable Less data shuffled across network More difficult to know size of broadcast Pros of executor BHJ outweigh cons
  • 20. New results SELECT lineitem.* FROM lineitem, orders WHERE l_orderkey = o_orderkey
  • 21. Why is BHJ still slower? ▪ Let's compare the cost models of the joins
  • 22. SMJ Cost ▪ Assume n cores, tables A and B, where A > B 1. Read A/n, Sort, Write A/n 2. Read B/n, Sort, Write B/n 3. Read A/n, Read B/n, Join ▪ Considering only I/O costs: 3 A/n + 3 B/n
  • 23. BHJ Cost ▪ Assume n cores, tables A and B, where A > B 1. Read B/n, Build hashtable, Write B 2. Read A/n, Read B, Join ▪ Considering only I/O costs: A/n + B/n + 2B
  • 24. ▪ SMJ: 3 A/n + 3 B/n ▪ BHJ: A/n + B/n + 2B ▪ SMJ - BHJ: (2 A/n + 2 B/n) - (2B) Comparing SMJ and BHJ costs ▪ Analysis ▪ More cores, better performance from SMJ ▪ Larger A, better performance from BHJ SMJ vs. BHJ: (A + B)/n vs. B
  • 25. Varying cores - SMJ better with more cores SELECT lineitem.* FROM lineitem, orders WHERE l_orderkey = o_orderkey
  • 26. Varying size of A - BHJ better with larger difference SELECT lineitem.* FROM lineitem, orders WHERE l_orderkey = o_orderkey
  • 27. Increasing size of B - driver BHJ fails, executor BHJ best SELECT lineitem.* FROM lineitem, orders WHERE l_orderkey = o_orderkey
  • 28. Other broadcast join improvements ▪ Increase Xms and MetaspaceSize to reduce GC ▪ Fetch all broadcast variables concurrently ▪ Other memory improvements in planning and whole-stage codegen ▪ Planning to contribute code changes back to open source
  • 30. ▪ 98% of our joins are inner joins or left outer joins Join types in HR customer pipelines
  • 31. Broadcast estimates in HR customer pipelines ▪ If we can increase broadcast threshold from default 10 MB to 100 MB, then 80% of our joins can be broadcasted
  • 32. ▪ 30 tables ▪ 29 tables 10K rows ▪ 1 table 3M rows ▪ ~160 joins ▪ Using 18 executor cores HR use case pipeline ▪ Can broadcast joins make the pipeline run faster?
  • 33. Varying broadcast thresholds (0 MB, 10MB, 1GB)
  • 34. What if we increase the 3M table? ▪ Will it bring similar performance improvements as single join?
  • 35. 30M rows for the big table
  • 36. Why are more broadcast joins slower? ▪ Self joins and left outer joins ▪ In the highest threshold, the biggest table gets broadcasted ▪ Introduces broadcast overhead ▪ Reduces join parallelism ▪ Takes up storage memory
  • 38. ▪ Executor side broadcast is better than driver side broadcast ▪ When evaluating whether broadcast is better, consider: ▪ Number of cores available ▪ Relative size difference between bigger and smaller tables ▪ Relative size of broadcast tables and available memory ▪ Presence of self joins and outer joins Broadcast joins are better… with caveats
  • 39. Future improvements in broadcast joins ▪ Adaptive Query Execution in Spark 3.0 ▪ Building hashtables in BHJ with multiple cores ▪ Smaller footprint for BHJ hashtables ▪ Skew handling in sort merge join using broadcast
  • 42. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.