SlideShare a Scribd company logo
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Spark
A Tale of Two Computation Engines
Andrii Rosa
Software Engineer
Wenlei Xie
Research Scientist
Agenda
Introduction
Design & Implementation
Introduction
SQL Use Cases @ Facebook
▪ Reporting and Dashboarding
▪ Low latency (<1s)
▪ High QPS
▪ Presto
▪ Adhoc Analysis
▪ Moderate latency (seconds to minutes)
▪ Mainly Presto
▪ Batch Processing
▪ High latency (up to tens of hours)
▪ Both Presto and Spark
Towards an Unified SQL Experience
▪ Batch Processing Uses Both Presto and Spark
▪ Presto doesn’t scale for large batch pipelines
▪ Inconsistent SQL Experience
▪ SQL Dialect
▪ Subtle Semantic Difference
▪ Null vs. Exception
▪ UDF/UDAF
▪ Best Practice
Presto and Spark Architecture
▪ Designed for latency
▪ MPP Architecture
▪ In-memory shuffle
▪ Shared executor
▪ Designed for Scalability
▪ MapReduce Architecture
▪ Disaggregated shuffle
▪ Isolated executor
SparkPresto
Why Presto (or Other MPPs) Doesn’t Scale?
A Decade-Old Question
SELECT custkey, SUM(totalprice)
FROM orders
GROUP BY custkey
Scan
Scan
Scan
Aggr
Aggr
Aggr
In-memory shuffle
on custkey
Aggr
Execute everything concurrently
- inflexible schedule
- fault-tolerant is difficult
- might exceed memory limit
Presto Unlimited
Brings MapReduce-style execution to MPP architectured runtime
SELECT custkey, SUM(totalprice)
FROM orders
GROUP BY custkey
Scan
Scan
Scan
Write
Write
Write
In-memory shuffle
on custkey
Write
Independent partition execution on
“reducer” side:
- partition-level retry
- schedule a few partitions
concurrently to reduce memory
Aggr
Aggr
Aggr
Aggr
Presto-on-Spark
Executes Presto Evaluation Library on Spark Runtime
SELECT custkey, SUM(totalprice)
FROM orders
GROUP BY custkey
Scan
Scan
Scan
Read
Read
Read
Disagg shuffle
on custkey
Read
Aggr
Aggr
Aggr
Aggr
Stage 1
Stage 2
Why Presto-on-Spark
▪ What are Missing?
▪ Full Disaggregated Shuffle
▪ Isolated Executor
▪ Different Scheduler, Speculative Execution, etc, ...
▪ Embed a “mini-Spark Runtime” inside Presto!
Instead of Making Presto Unlimited More Scalable?
Design & Implementation
Presto-on-Spark Design Principles
▪ Presto is run as a library
▪ Presto cluster is not needed to run
Presto-on-Spark
▪ Presto on Spark is just a Spark application
▪ Query is passed as a parameter
▪ Implemented on RDD level
▪ Operations done by Presto are opaque to Spark
engine
spark-submit
# spark-submit 
--master spark://spark-master:7077 
presto-spark-launcher-*.jar 
--package presto-spark-package-*.tar.gz 
--config ./config.properties 
--catalogs ./catalogs 
--catalog hive 
--schema default 
--file /tmp/query.sql
Planning
Logical PlanQuery Distributed Plan
SELECT *
FROM lineitem l
JOIN orders o
ON l.orderkey = o.orderkey
WHERE o.orderstatus = 'O'
TABLE SCAN
[lineitem]
JOIN
[on orderkey]
TABLE SCAN
[orders]
FILTER
[o.orderstatus = 'O']
Fragment 1 Fragment 2
Fragment 0
PARTITION BY
[orderkey]
PARTITION BY
[orderkey]
FILTER
[o.orderstatus = 'O']
TABLE SCAN
[orders]
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]
Translating to RDD
Fragment 1 Fragment 2
Fragment 0
PARTITION BY
[orderkey]
PARTITION BY
[orderkey]
FILTER
[o.orderstatus = 'O']
TABLE SCAN
[orders]
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]
sparkContext
.parallelize(lineitemSplits)
PairRDD<Integer, Row> = rdd
.mapPartitionsToPair(
fragment1Processor)
sparkContext
.parallelize(ordersSplits)
PairRDD<Integer, Row> = rdd
.mapPartitionsToPair(
fragment2Processor)
pairRdd.partitionBy() pairRdd.partitionBy()
lineitemRdd.zipPartitions(ordersRdd,
fragment0Processor)
Spark DAG
Execution
Fragment 2
FILTER
[o.orderstatus = 'O']
TABLE SCAN
[orders]
Fragment 0
JOIN
[on orderkey]
Leaf Fragment
Intermediate Fragment
Iterator<Tuple2<Integer, PrestoSparkRow>> process(List<Split> splits)
Iterator<Tuple2<Integer, PrestoSparkRow>> process(
List<Iterator<Tuple2<Integer, PrestoSparkRow>>> inputs)
Columnar Format to Row Format Conversion
STAGE 1
INPUT OUTPUTPROJECT FILTERPAGE PAGE PAGEROW ROW
STAGE 2
INPUT OUTPUT
GROUP
BY
FILTERPAGE PAGE PAGEROW ROW
COL 1 VAL 1
COL 1 VAL 2
COL 1 VAL 3
COL 1 VAL 4
COL 1 VAL 5
COL 2 VAL 1
COL 2 VAL 2
COL 2 VAL 3
COL 2 VAL 4
COL 2 VAL 5
COL 3 VAL 1
COL 3 VAL 2
COL 3 VAL 3
COL 3 VAL 4
COL 3 VAL 5
[COL 1 VAL 1], [COL 2 VAL 1], [COL 3 VAL 1]
SHUFFLE
Broadcast Join
Distributed Plan
TABLE SCAN
[lineitem]
JOIN
[on orderkey]
TABLE SCAN
[orders]
FILTER
[o.orderstatus = 'O']
Logical Plan
Fragment 1
Fragment 0
BROADCAST
FILTER
[o.orderstatus = 'O']
TABLE SCAN
[orders]
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]
Job 1
Job 0
Translating to RDD
Fragment 1
Fragment 0
BROADCAST
FILTER
[o.orderstatus = 'O']
TABLE SCAN
[orders]
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]
sparkContext
.parallelize(lineitemSplits)
RDD<Row> = rdd
.mapPartitions(fragment0Processor)
sparkContext
.parallelize(ordersSplits)
RDD<Row> = rdd
.mapPartitions(fragment1Processor)
sc.broadcast(ordersRdd.collect())
Spark DAG
Execution
Broadcast Fragment
Join Fragment
Fragment 1
FILTER
[o.orderstatus = 'O']
TABLE SCAN
[orders]
Iterator<Tuple2<Integer, PrestoSparkRow>> process(List<Split> splits)
Fragment 0
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]
Iterator<Tuple2<Integer, PrestoSparkRow>> process(
List<Split> splits,
List<Broadcast<List<PrestoSparkRow>>> broadcasts)
Threading Model
Presto Task
Spark Task
INPUT OUTPUTPROJECT UNNEST FILTER
INPUT
OUTPUTLOCAL
SHUFFLE
PROJECT UNNEST FILTERPROJECT
PROJECT UNNEST FILTERPROJECT
PROJECT UNNEST FILTERPROJECT
PROJECT UNNEST FILTERPROJECT
PROJECT
Classloader Isolation
Spark Classloader (presto-spark-launcher.jar)
Presto Classloader (presto-spark-package.tar.gz)
int main() {
...
sparkContext.addFile(
“presto-spark-package.tar.gz”
)
...
...
IPrestoSparkService service =
createService(
“presto-spark-package.tar.gz”
)
...
}
Hive Plugin
Classloader
Pinot Plugin
Classloader
MySQL Plugin
Classloader
IPrestoSparkService {
getQueryExecutionFactory();
getTaskExecutorFactory();
}
Current Status
▪ Under Active Development on GitHub: #13856
▪ Most query shapes supported
▪ Working on supporting remaining query shapes (some flavors of UNION ALL)
▪ Preparing the feature to become GA
▪ Initial Scalability Tests
▪ Scale to 10,000 Mappers / Reducers
▪ Supports Queries Require 50TB+ Distributed Memory in Presto
▪ Up to 3x Wall Time Reduction for Presto Large Batch Queries (6h in Presto vs 2h in Presto on Spark)
Q&A

More Related Content

What's hot (20)

PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PDF
Building an open data platform with apache iceberg
Alluxio, Inc.
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PDF
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
PDF
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Spark Summit
 
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
PPTX
Optimizing Apache Spark SQL Joins
Databricks
 
PDF
Memory Management in Apache Spark
Databricks
 
PDF
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Spark Performance Tuning .pdf
Amit Raj
 
PPTX
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
PDF
Introduction to PySpark
Russell Jurney
 
PDF
Introduction to apache spark
Aakashdata
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Spark Summit
 
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Optimizing Apache Spark SQL Joins
Databricks
 
Memory Management in Apache Spark
Databricks
 
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Spark Performance Tuning .pdf
Amit Raj
 
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Introduction to PySpark
Russell Jurney
 
Introduction to apache spark
Aakashdata
 

Similar to Presto on Apache Spark: A Tale of Two Computation Engines (20)

PDF
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Databricks
 
ODP
How Spark Does It Internally?
Knoldus Inc.
 
PPTX
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
PDF
Fast federated SQL with Apache Calcite
Chris Baynes
 
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
PDF
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
PDF
Tuning and Debugging in Apache Spark
Databricks
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PPTX
Tuning and Debugging in Apache Spark
Patrick Wendell
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PPTX
Intro to Spark
Kyle Burke
 
PDF
Speed up UDFs with GPUs using the RAPIDS Accelerator
Databricks
 
PDF
Introduction to spark 2.0
datamantra
 
PDF
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
PDF
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Databricks
 
PDF
Data Processing with Apache Spark Meetup Talk
Eren Avşaroğulları
 
PDF
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
PPTX
A Comparative Performance Evaluation of Apache Flink
Dongwon Kim
 
PDF
Apache Spark: What's under the hood
Adarsh Pannu
 
PDF
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Databricks
 
How Spark Does It Internally?
Knoldus Inc.
 
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Fast federated SQL with Apache Calcite
Chris Baynes
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Tuning and Debugging in Apache Spark
Databricks
 
Introduction to Spark with Python
Gokhan Atil
 
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Intro to Spark
Kyle Burke
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Databricks
 
Introduction to spark 2.0
datamantra
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Databricks
 
Data Processing with Apache Spark Meetup Talk
Eren Avşaroğulları
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
A Comparative Performance Evaluation of Apache Flink
Dongwon Kim
 
Apache Spark: What's under the hood
Adarsh Pannu
 
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
BinarySearchTree in datastructures in detail
kichokuttu
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 

Presto on Apache Spark: A Tale of Two Computation Engines

  • 2. Presto on Spark A Tale of Two Computation Engines Andrii Rosa Software Engineer Wenlei Xie Research Scientist
  • 5. SQL Use Cases @ Facebook ▪ Reporting and Dashboarding ▪ Low latency (<1s) ▪ High QPS ▪ Presto ▪ Adhoc Analysis ▪ Moderate latency (seconds to minutes) ▪ Mainly Presto ▪ Batch Processing ▪ High latency (up to tens of hours) ▪ Both Presto and Spark
  • 6. Towards an Unified SQL Experience ▪ Batch Processing Uses Both Presto and Spark ▪ Presto doesn’t scale for large batch pipelines ▪ Inconsistent SQL Experience ▪ SQL Dialect ▪ Subtle Semantic Difference ▪ Null vs. Exception ▪ UDF/UDAF ▪ Best Practice
  • 7. Presto and Spark Architecture ▪ Designed for latency ▪ MPP Architecture ▪ In-memory shuffle ▪ Shared executor ▪ Designed for Scalability ▪ MapReduce Architecture ▪ Disaggregated shuffle ▪ Isolated executor SparkPresto
  • 8. Why Presto (or Other MPPs) Doesn’t Scale? A Decade-Old Question SELECT custkey, SUM(totalprice) FROM orders GROUP BY custkey Scan Scan Scan Aggr Aggr Aggr In-memory shuffle on custkey Aggr Execute everything concurrently - inflexible schedule - fault-tolerant is difficult - might exceed memory limit
  • 9. Presto Unlimited Brings MapReduce-style execution to MPP architectured runtime SELECT custkey, SUM(totalprice) FROM orders GROUP BY custkey Scan Scan Scan Write Write Write In-memory shuffle on custkey Write Independent partition execution on “reducer” side: - partition-level retry - schedule a few partitions concurrently to reduce memory Aggr Aggr Aggr Aggr
  • 10. Presto-on-Spark Executes Presto Evaluation Library on Spark Runtime SELECT custkey, SUM(totalprice) FROM orders GROUP BY custkey Scan Scan Scan Read Read Read Disagg shuffle on custkey Read Aggr Aggr Aggr Aggr Stage 1 Stage 2
  • 11. Why Presto-on-Spark ▪ What are Missing? ▪ Full Disaggregated Shuffle ▪ Isolated Executor ▪ Different Scheduler, Speculative Execution, etc, ... ▪ Embed a “mini-Spark Runtime” inside Presto! Instead of Making Presto Unlimited More Scalable?
  • 13. Presto-on-Spark Design Principles ▪ Presto is run as a library ▪ Presto cluster is not needed to run Presto-on-Spark ▪ Presto on Spark is just a Spark application ▪ Query is passed as a parameter ▪ Implemented on RDD level ▪ Operations done by Presto are opaque to Spark engine spark-submit # spark-submit --master spark://spark-master:7077 presto-spark-launcher-*.jar --package presto-spark-package-*.tar.gz --config ./config.properties --catalogs ./catalogs --catalog hive --schema default --file /tmp/query.sql
  • 14. Planning Logical PlanQuery Distributed Plan SELECT * FROM lineitem l JOIN orders o ON l.orderkey = o.orderkey WHERE o.orderstatus = 'O' TABLE SCAN [lineitem] JOIN [on orderkey] TABLE SCAN [orders] FILTER [o.orderstatus = 'O'] Fragment 1 Fragment 2 Fragment 0 PARTITION BY [orderkey] PARTITION BY [orderkey] FILTER [o.orderstatus = 'O'] TABLE SCAN [orders] TABLE SCAN [lineitem] JOIN [on orderkey]
  • 15. Translating to RDD Fragment 1 Fragment 2 Fragment 0 PARTITION BY [orderkey] PARTITION BY [orderkey] FILTER [o.orderstatus = 'O'] TABLE SCAN [orders] TABLE SCAN [lineitem] JOIN [on orderkey] sparkContext .parallelize(lineitemSplits) PairRDD<Integer, Row> = rdd .mapPartitionsToPair( fragment1Processor) sparkContext .parallelize(ordersSplits) PairRDD<Integer, Row> = rdd .mapPartitionsToPair( fragment2Processor) pairRdd.partitionBy() pairRdd.partitionBy() lineitemRdd.zipPartitions(ordersRdd, fragment0Processor)
  • 17. Execution Fragment 2 FILTER [o.orderstatus = 'O'] TABLE SCAN [orders] Fragment 0 JOIN [on orderkey] Leaf Fragment Intermediate Fragment Iterator<Tuple2<Integer, PrestoSparkRow>> process(List<Split> splits) Iterator<Tuple2<Integer, PrestoSparkRow>> process( List<Iterator<Tuple2<Integer, PrestoSparkRow>>> inputs)
  • 18. Columnar Format to Row Format Conversion STAGE 1 INPUT OUTPUTPROJECT FILTERPAGE PAGE PAGEROW ROW STAGE 2 INPUT OUTPUT GROUP BY FILTERPAGE PAGE PAGEROW ROW COL 1 VAL 1 COL 1 VAL 2 COL 1 VAL 3 COL 1 VAL 4 COL 1 VAL 5 COL 2 VAL 1 COL 2 VAL 2 COL 2 VAL 3 COL 2 VAL 4 COL 2 VAL 5 COL 3 VAL 1 COL 3 VAL 2 COL 3 VAL 3 COL 3 VAL 4 COL 3 VAL 5 [COL 1 VAL 1], [COL 2 VAL 1], [COL 3 VAL 1] SHUFFLE
  • 19. Broadcast Join Distributed Plan TABLE SCAN [lineitem] JOIN [on orderkey] TABLE SCAN [orders] FILTER [o.orderstatus = 'O'] Logical Plan Fragment 1 Fragment 0 BROADCAST FILTER [o.orderstatus = 'O'] TABLE SCAN [orders] TABLE SCAN [lineitem] JOIN [on orderkey]
  • 20. Job 1 Job 0 Translating to RDD Fragment 1 Fragment 0 BROADCAST FILTER [o.orderstatus = 'O'] TABLE SCAN [orders] TABLE SCAN [lineitem] JOIN [on orderkey] sparkContext .parallelize(lineitemSplits) RDD<Row> = rdd .mapPartitions(fragment0Processor) sparkContext .parallelize(ordersSplits) RDD<Row> = rdd .mapPartitions(fragment1Processor) sc.broadcast(ordersRdd.collect())
  • 22. Execution Broadcast Fragment Join Fragment Fragment 1 FILTER [o.orderstatus = 'O'] TABLE SCAN [orders] Iterator<Tuple2<Integer, PrestoSparkRow>> process(List<Split> splits) Fragment 0 TABLE SCAN [lineitem] JOIN [on orderkey] Iterator<Tuple2<Integer, PrestoSparkRow>> process( List<Split> splits, List<Broadcast<List<PrestoSparkRow>>> broadcasts)
  • 23. Threading Model Presto Task Spark Task INPUT OUTPUTPROJECT UNNEST FILTER INPUT OUTPUTLOCAL SHUFFLE PROJECT UNNEST FILTERPROJECT PROJECT UNNEST FILTERPROJECT PROJECT UNNEST FILTERPROJECT PROJECT UNNEST FILTERPROJECT PROJECT
  • 24. Classloader Isolation Spark Classloader (presto-spark-launcher.jar) Presto Classloader (presto-spark-package.tar.gz) int main() { ... sparkContext.addFile( “presto-spark-package.tar.gz” ) ... ... IPrestoSparkService service = createService( “presto-spark-package.tar.gz” ) ... } Hive Plugin Classloader Pinot Plugin Classloader MySQL Plugin Classloader IPrestoSparkService { getQueryExecutionFactory(); getTaskExecutorFactory(); }
  • 25. Current Status ▪ Under Active Development on GitHub: #13856 ▪ Most query shapes supported ▪ Working on supporting remaining query shapes (some flavors of UNION ALL) ▪ Preparing the feature to become GA ▪ Initial Scalability Tests ▪ Scale to 10,000 Mappers / Reducers ▪ Supports Queries Require 50TB+ Distributed Memory in Presto ▪ Up to 3x Wall Time Reduction for Presto Large Batch Queries (6h in Presto vs 2h in Presto on Spark)
  • 26. Q&A