SlideShare a Scribd company logo
Apache Spark’s
Performance
Project Tungsten and Beyond
Sameer Agarwal
Spark Summit| Brussels| Oct 26th 2016
About Me
• Software Engineer at Databricks (Spark Core/SQL)
• PhD in Databases (AMPLab, UC Berkeley)
• Research on BlinkDB (Approximate Queries in Spark)
Hardware Trends
Storage
Network
CPU
Hardware Trends
2010
Storage
50+MB/s
(HDD)
Network 1Gbps
CPU ~3GHz
Hardware Trends
2010 2016
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz
Hardware Trends
2010 2016
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
10X
Network 1Gbps 10Gbps 10X
CPU ~3GHz ~3GHz 
On the flip side
Spark IO has been optimized
• Reduce IO by pruning input data that is not needed
• New shuffle and network implementations (2014 sort record)
Data formats have improved
• E.g. Parquet is a “dense” columnar format
CPU increasingly the bottleneck; trend expected to continue
Goals of Project Tungsten
Substantially improve the memory and CPU efficiency of Spark
backend execution and push performance closer to the limits of
modern hardware.
Note the focus on “execution” not “optimizer”: relatively easy to pick broadcast hash join that is 1000X
faster than Cartesian join, but hard to optimize broadcast join to be an order of magnitude faster.
Phase 1
Foundation
Memory Management
Code Generation
Cache-aware Algorithms
Phase 2
Order-of-magnitude Faster
Whole-stage Codegen
Vectorization
Phase 1
Laying The Foundation
Summary
Perform explicit memory management instead of relying on Java objects
• Reduce memory footprint
• Eliminate garbage collection overheads
• Use sun.misc.unsafe rows and off heap memory
Code generation for expression evaluation
• Reduce virtual function calls and interpretation overhead
Cache conscious sorting
• Reduce bad memory access patterns
Phase 2
Order-of-magnitude Faster
Going back to the fundamentals
Difficult to get order of magnitude performance speed ups with
profiling techniques
• For 10x improvement, would need of find top hotspots that add up to
90% and make them instantaneous
• For 100x, 99%
Instead, look bottom up, how fast should it run?
Scan
Filter
Project
Aggregate
select count(*) from store_sales
where ss_item_sk = 1000
G. Graefe, Volcano— An Extensible and Parallel Query Evaluation System,
In IEEE Transactions on Knowledge and Data Engineering 1994
Volcano Iterator Model
Standard for 30 years: almost
all databases do it
Each operator is an “iterator”
that consumes records
from its input operator
class Filter(
child: Operator,
predicate: (Row => Boolean))
extends Operator {
def next(): Row = {
var current = child.next()
while (current == null ||predicate(current)) {
current = child.next()
}
return current
}
}
What if we hire a college freshman to implement this query
in Java in 10 mins?
select count(*) from store_sales
where ss_item_sk = 1000
long count = 0;
for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {
count += 1;
}
}
Volcano model
30+ years of database research
college freshman
hand-written code in 10 mins
vs
Volcano 13.95 million
rows/sec
college
freshman
125 million
rows/sec
Note: End-to-end, single thread, single column, and data originated in Parquet on disk
High throughput
How does a student beat 30 years of research?
Volcano Model
1. Too many virtual function calls
2. Intermediate data in memory (or
L1/L2/L3 cache)
3. Can’t take advantage of modern
CPU features -- no loop unrolling,
SIMD, pipelining, prefetching,
branch prediction etc.
Hand-written code
1. No virtual function calls
2. Data in CPU registers
3. Compiler loop unrolling, SIMD,
pipelining
Take advantage of all the information that is known after query compilation
Whole-stage Codegen
Fusing operators together so the generated code looks like hand
optimized code:
- Identify chains of operators (“stages”)
- Compile each stage into a single function
- Functionality of a general purpose execution engine;
performance as if hand built system just to run your query
Whole-stage Codegen: Planner
Scan
Filter
Project
Aggregate
long count = 0;
for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {
count += 1;
}
}
Whole-stage Codegen: Spark as a “Compiler”
T Neumann, Efficiently compiling efficient query plans for modern hardware. In VLDB 2011
But there are things we can’t fuse
Complicated I/O
• CSV, Parquet, ORC, …
• Sending across the network
External integrations
• Python, R, scikit-learn, TensorFlow, etc
Columnar in memory format
mike
In-memory
Row Format
1 john 4.1
2 3.5
3 sally 6.4
1 2 3
john mike sally
4.1 3.5 6.4
In-memory
Column Format
Why columnar?
1. More efficient: denser storage, regular data access, easier to
index into
2. More compatible: Most high-performance external systems
are already columnar (numpy, TensorFlow, Parquet); zero
serialization/copy to work with them
3. Easier to extend: process encoded data
Parquet 11 million
rows/sec
Parquet
vectorized
90 million
rows/sec
Note: End-to-end, single thread, single column, and data originated in Parquet on disk
High throughput
Putting it All Together
Phase 1
Spark 1.4 - 1.6
Memory Management
Code Generation
Cache-aware Algorithms
Phase 2
Spark 2.0+
Whole-stage Code Generation
Columnar in Memory Support
Both whole stage codegen [SPARK-12795] and the vectorized
parquet reader [SPARK-12992] are enabled by default in Spark 2.0+
Demo
Operator Benchmarks: Cost/Row (ns)
5-30x
Speedups
Operator Benchmarks: Cost/Row (ns)
Radix Sort
10-100x
Speedups
Operator Benchmarks: Cost/Row (ns)
Shuffling
still the
bottleneck
Operator Benchmarks: Cost/Row (ns)
10x
Speedup
TPC-DS (Scale Factor 1500, 100 cores)
QueryTime
Query #
Spark 2.0 Spark 1.6
Lower is Better
What’s Next?
Spark 2.1, 2.2 and beyond
1. SPARK-16026: Cost Based Optimizer
- Leverage table/column level statistics to optimize joins and aggregates
- Statistics Collection Framework (Spark 2.1)
- Cost Based Optimizer (Spark 2.2)
2. Boosting Spark’s Performance on Many-Core Machines
- Qifan’s Talk Today at 2:55pm (Research Track)
- In-memory/ single node shuffle
3. Improving quality of generated code and better integration
with the in-memory column format in Spark
Questions?
I’ll also be at the Databricks booth tomorrow from 12-2pm!
Further Reading
http://tinyurl.com/project-tungsten

More Related Content

What's hot (20)

PDF
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Spark Summit
 
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
PDF
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
PPTX
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
 
PDF
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Jen Aman
 
PDF
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Databricks
 
PDF
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
PPTX
Intro to Spark development
Spark Summit
 
PDF
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Databricks
 
PDF
Processing 70Tb Of Genomics Data With ADAM And Toil
Spark Summit
 
PDF
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Databricks
 
PDF
Assessing Graph Solutions for Apache Spark
Databricks
 
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
PDF
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
Spark Summit
 
PDF
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Databricks
 
PDF
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit
 
PDF
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Spark Summit
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Jen Aman
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Databricks
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Intro to Spark development
Spark Summit
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Databricks
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Spark Summit
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Databricks
 
Assessing Graph Solutions for Apache Spark
Databricks
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
Spark Summit
 
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Databricks
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 

Similar to Spark Summit EU talk by Sameer Agarwal (20)

PDF
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
PDF
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
PDF
New Developments in Spark
Databricks
 
PDF
A look ahead at spark 2.0
Databricks
 
PPTX
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PPTX
Profiling & Testing with Spark
Roger Rafanell Mas
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
PDF
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
PDF
Spark Summit EU talk by Luca Canali
Spark Summit
 
PDF
Spark what's new what's coming
Databricks
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
Migrating to Spark 2.0 - Part 2
datamantra
 
PDF
Fast and Reliable Apache Spark SQL Releases
DataWorks Summit
 
PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
PDF
Dev Ops Training
Spark Summit
 
PPTX
Spark 1.6 vs Spark 2.0
Sigmoid
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
New Developments in Spark
Databricks
 
A look ahead at spark 2.0
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Profiling & Testing with Spark
Roger Rafanell Mas
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Spark Summit EU talk by Luca Canali
Spark Summit
 
Spark what's new what's coming
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Migrating to Spark 2.0 - Part 2
datamantra
 
Fast and Reliable Apache Spark SQL Releases
DataWorks Summit
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
Dev Ops Training
Spark Summit
 
Spark 1.6 vs Spark 2.0
Sigmoid
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Ad

Recently uploaded (20)

PPTX
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
2025 Global Data Summit - FOM with AI.pdf
Marco Wobben
 
PPTX
covid 19 data analysis updates in our municipality
RhuAyungon1
 
PDF
Loading Data into Snowflake (Bulk & Stream)
Accentfuture
 
PDF
Group 5_RMB Final Project on circular economy
pgban24anmola
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
Research Methodology Overview Introduction
ayeshagul29594
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
2025 Global Data Summit - FOM with AI.pdf
Marco Wobben
 
covid 19 data analysis updates in our municipality
RhuAyungon1
 
Loading Data into Snowflake (Bulk & Stream)
Accentfuture
 
Group 5_RMB Final Project on circular economy
pgban24anmola
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 

Spark Summit EU talk by Sameer Agarwal

  • 1. Apache Spark’s Performance Project Tungsten and Beyond Sameer Agarwal Spark Summit| Brussels| Oct 26th 2016
  • 2. About Me • Software Engineer at Databricks (Spark Core/SQL) • PhD in Databases (AMPLab, UC Berkeley) • Research on BlinkDB (Approximate Queries in Spark)
  • 7. On the flip side Spark IO has been optimized • Reduce IO by pruning input data that is not needed • New shuffle and network implementations (2014 sort record) Data formats have improved • E.g. Parquet is a “dense” columnar format CPU increasingly the bottleneck; trend expected to continue
  • 8. Goals of Project Tungsten Substantially improve the memory and CPU efficiency of Spark backend execution and push performance closer to the limits of modern hardware. Note the focus on “execution” not “optimizer”: relatively easy to pick broadcast hash join that is 1000X faster than Cartesian join, but hard to optimize broadcast join to be an order of magnitude faster.
  • 9. Phase 1 Foundation Memory Management Code Generation Cache-aware Algorithms Phase 2 Order-of-magnitude Faster Whole-stage Codegen Vectorization
  • 10. Phase 1 Laying The Foundation
  • 11. Summary Perform explicit memory management instead of relying on Java objects • Reduce memory footprint • Eliminate garbage collection overheads • Use sun.misc.unsafe rows and off heap memory Code generation for expression evaluation • Reduce virtual function calls and interpretation overhead Cache conscious sorting • Reduce bad memory access patterns
  • 13. Going back to the fundamentals Difficult to get order of magnitude performance speed ups with profiling techniques • For 10x improvement, would need of find top hotspots that add up to 90% and make them instantaneous • For 100x, 99% Instead, look bottom up, how fast should it run?
  • 14. Scan Filter Project Aggregate select count(*) from store_sales where ss_item_sk = 1000
  • 15. G. Graefe, Volcano— An Extensible and Parallel Query Evaluation System, In IEEE Transactions on Knowledge and Data Engineering 1994
  • 16. Volcano Iterator Model Standard for 30 years: almost all databases do it Each operator is an “iterator” that consumes records from its input operator class Filter( child: Operator, predicate: (Row => Boolean)) extends Operator { def next(): Row = { var current = child.next() while (current == null ||predicate(current)) { current = child.next() } return current } }
  • 17. What if we hire a college freshman to implement this query in Java in 10 mins? select count(*) from store_sales where ss_item_sk = 1000 long count = 0; for (ss_item_sk in store_sales) { if (ss_item_sk == 1000) { count += 1; } }
  • 18. Volcano model 30+ years of database research college freshman hand-written code in 10 mins vs
  • 19. Volcano 13.95 million rows/sec college freshman 125 million rows/sec Note: End-to-end, single thread, single column, and data originated in Parquet on disk High throughput
  • 20. How does a student beat 30 years of research? Volcano Model 1. Too many virtual function calls 2. Intermediate data in memory (or L1/L2/L3 cache) 3. Can’t take advantage of modern CPU features -- no loop unrolling, SIMD, pipelining, prefetching, branch prediction etc. Hand-written code 1. No virtual function calls 2. Data in CPU registers 3. Compiler loop unrolling, SIMD, pipelining Take advantage of all the information that is known after query compilation
  • 21. Whole-stage Codegen Fusing operators together so the generated code looks like hand optimized code: - Identify chains of operators (“stages”) - Compile each stage into a single function - Functionality of a general purpose execution engine; performance as if hand built system just to run your query
  • 23. Scan Filter Project Aggregate long count = 0; for (ss_item_sk in store_sales) { if (ss_item_sk == 1000) { count += 1; } } Whole-stage Codegen: Spark as a “Compiler”
  • 24. T Neumann, Efficiently compiling efficient query plans for modern hardware. In VLDB 2011
  • 25. But there are things we can’t fuse Complicated I/O • CSV, Parquet, ORC, … • Sending across the network External integrations • Python, R, scikit-learn, TensorFlow, etc
  • 26. Columnar in memory format mike In-memory Row Format 1 john 4.1 2 3.5 3 sally 6.4 1 2 3 john mike sally 4.1 3.5 6.4 In-memory Column Format
  • 27. Why columnar? 1. More efficient: denser storage, regular data access, easier to index into 2. More compatible: Most high-performance external systems are already columnar (numpy, TensorFlow, Parquet); zero serialization/copy to work with them 3. Easier to extend: process encoded data
  • 28. Parquet 11 million rows/sec Parquet vectorized 90 million rows/sec Note: End-to-end, single thread, single column, and data originated in Parquet on disk High throughput
  • 29. Putting it All Together
  • 30. Phase 1 Spark 1.4 - 1.6 Memory Management Code Generation Cache-aware Algorithms Phase 2 Spark 2.0+ Whole-stage Code Generation Columnar in Memory Support Both whole stage codegen [SPARK-12795] and the vectorized parquet reader [SPARK-12992] are enabled by default in Spark 2.0+ Demo
  • 31. Operator Benchmarks: Cost/Row (ns) 5-30x Speedups
  • 32. Operator Benchmarks: Cost/Row (ns) Radix Sort 10-100x Speedups
  • 33. Operator Benchmarks: Cost/Row (ns) Shuffling still the bottleneck
  • 34. Operator Benchmarks: Cost/Row (ns) 10x Speedup
  • 35. TPC-DS (Scale Factor 1500, 100 cores) QueryTime Query # Spark 2.0 Spark 1.6 Lower is Better
  • 37. Spark 2.1, 2.2 and beyond 1. SPARK-16026: Cost Based Optimizer - Leverage table/column level statistics to optimize joins and aggregates - Statistics Collection Framework (Spark 2.1) - Cost Based Optimizer (Spark 2.2) 2. Boosting Spark’s Performance on Many-Core Machines - Qifan’s Talk Today at 2:55pm (Research Track) - In-memory/ single node shuffle 3. Improving quality of generated code and better integration with the in-memory column format in Spark
  • 38. Questions? I’ll also be at the Databricks booth tomorrow from 12-2pm!