Presto on Apache Spark: A Tale of Two Computation Engines

Presto on Spark
A Tale of Two Computation Engines
Andrii Rosa
Software Engineer
Wenlei Xie
Research Scientist

Agenda
Introduction
Design & Implementation

SQL Use Cases @ Facebook
▪ Reporting and Dashboarding
▪ Low latency (<1s)
▪ High QPS
▪ Presto
▪ Adhoc Analysis
▪ Moderate latency (seconds to minutes)
▪ Mainly Presto
▪ Batch Processing
▪ High latency (up to tens of hours)
▪ Both Presto and Spark

Towards an Uniﬁed SQL Experience
▪ Batch Processing Uses Both Presto and Spark
▪ Presto doesn’t scale for large batch pipelines
▪ Inconsistent SQL Experience
▪ SQL Dialect
▪ Subtle Semantic Difference
▪ Null vs. Exception
▪ UDF/UDAF
▪ Best Practice

Presto and Spark Architecture
▪ Designed for latency
▪ MPP Architecture
▪ In-memory shuffle
▪ Shared executor
▪ Designed for Scalability
▪ MapReduce Architecture
▪ Disaggregated shuffle
▪ Isolated executor
SparkPresto

Why Presto (or Other MPPs) Doesn’t Scale?
A Decade-Old Question
SELECT custkey, SUM(totalprice)
FROM orders
GROUP BY custkey
Scan
Scan
Scan
Aggr
Aggr
Aggr
In-memory shuffle
on custkey
Aggr
Execute everything concurrently
- inﬂexible schedule
- fault-tolerant is difficult
- might exceed memory limit

Presto Unlimited
Brings MapReduce-style execution to MPP architectured runtime
FROM orders
GROUP BY custkey
Scan
Scan
Scan
Write
Write
Write
In-memory shuffle
on custkey
Write
Independent partition execution on
“reducer” side:
- partition-level retry
- schedule a few partitions
concurrently to reduce memory
Aggr
Aggr
Aggr
Aggr

Presto-on-Spark
Executes Presto Evaluation Library on Spark Runtime
FROM orders
GROUP BY custkey
Scan
Scan
Scan
Read
Read
Read
Disagg shuffle
on custkey
Read
Aggr
Aggr
Aggr
Aggr
Stage 1
Stage 2

Why Presto-on-Spark
▪ What are Missing?
▪ Full Disaggregated Shuffle
▪ Isolated Executor
▪ Different Scheduler, Speculative Execution, etc, ...
▪ Embed a “mini-Spark Runtime” inside Presto!
Instead of Making Presto Unlimited More Scalable?

Presto-on-Spark Design Principles
▪ Presto is run as a library
▪ Presto cluster is not needed to run
Presto-on-Spark
▪ Presto on Spark is just a Spark application
▪ Query is passed as a parameter
▪ Implemented on RDD level
▪ Operations done by Presto are opaque to Spark
engine
spark-submit
# spark-submit
--master spark://spark-master:7077
presto-spark-launcher-*.jar
--package presto-spark-package-*.tar.gz
--config ./config.properties
--catalogs ./catalogs
--catalog hive
--schema default
--file /tmp/query.sql

Planning
Logical PlanQuery Distributed Plan
SELECT *
FROM lineitem l
JOIN orders o
ON l.orderkey = o.orderkey
WHERE o.orderstatus = 'O'
TABLE SCAN
[lineitem]
JOIN
[on orderkey]
TABLE SCAN
[orders]
FILTER
[o.orderstatus = 'O']
Fragment 1 Fragment 2
Fragment 0
PARTITION BY
[orderkey]
PARTITION BY
[orderkey]
FILTER
TABLE SCAN
[orders]
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]

Translating to RDD
Fragment 1 Fragment 2
Fragment 0
PARTITION BY
[orderkey]
PARTITION BY
[orderkey]
FILTER
TABLE SCAN
[orders]
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]
sparkContext
.parallelize(lineitemSplits)
PairRDD<Integer, Row> = rdd
.mapPartitionsToPair(
fragment1Processor)
sparkContext
.parallelize(ordersSplits)
PairRDD<Integer, Row> = rdd
.mapPartitionsToPair(
fragment2Processor)
pairRdd.partitionBy() pairRdd.partitionBy()
lineitemRdd.zipPartitions(ordersRdd,
fragment0Processor)

Execution
Fragment 2
FILTER
TABLE SCAN
[orders]
Fragment 0
JOIN
[on orderkey]
Leaf Fragment
Intermediate Fragment
Iterator<Tuple2<Integer, PrestoSparkRow>> process(List<Split> splits)
Iterator<Tuple2<Integer, PrestoSparkRow>> process(
List<Iterator<Tuple2<Integer, PrestoSparkRow>>> inputs)

Columnar Format to Row Format Conversion
STAGE 1
INPUT OUTPUTPROJECT FILTERPAGE PAGE PAGEROW ROW
STAGE 2
INPUT OUTPUT
GROUP
BY
FILTERPAGE PAGE PAGEROW ROW
COL 1 VAL 1
COL 1 VAL 2
COL 1 VAL 3
COL 1 VAL 4
COL 1 VAL 5
COL 2 VAL 1
COL 2 VAL 2
COL 2 VAL 3
COL 2 VAL 4
COL 2 VAL 5
COL 3 VAL 1
COL 3 VAL 2
COL 3 VAL 3
COL 3 VAL 4
COL 3 VAL 5
[COL 1 VAL 1], [COL 2 VAL 1], [COL 3 VAL 1]
SHUFFLE

Broadcast Join
Distributed Plan
TABLE SCAN
[lineitem]
JOIN
[on orderkey]
TABLE SCAN
[orders]
FILTER
Logical Plan
Fragment 1
Fragment 0
BROADCAST
FILTER
TABLE SCAN
[orders]
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]

Job 1
Job 0
Translating to RDD
Fragment 1
Fragment 0
BROADCAST
FILTER
TABLE SCAN
[orders]
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]
sparkContext
.parallelize(lineitemSplits)
RDD<Row> = rdd
.mapPartitions(fragment0Processor)
sparkContext
.parallelize(ordersSplits)
RDD<Row> = rdd
.mapPartitions(fragment1Processor)
sc.broadcast(ordersRdd.collect())

Execution
Broadcast Fragment
Join Fragment
Fragment 1
FILTER
TABLE SCAN
[orders]
Iterator<Tuple2<Integer, PrestoSparkRow>> process(List<Split> splits)
Fragment 0
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]
Iterator<Tuple2<Integer, PrestoSparkRow>> process(
List<Split> splits,
List<Broadcast<List<PrestoSparkRow>>> broadcasts)

Threading Model
Presto Task
Spark Task
INPUT OUTPUTPROJECT UNNEST FILTER
INPUT
OUTPUTLOCAL
SHUFFLE
PROJECT UNNEST FILTERPROJECT
PROJECT

Classloader Isolation
Spark Classloader (presto-spark-launcher.jar)
Presto Classloader (presto-spark-package.tar.gz)
int main() {
...
sparkContext.addFile(
“presto-spark-package.tar.gz”
)
...
...
IPrestoSparkService service =
createService(
“presto-spark-package.tar.gz”
)
...
}
Hive Plugin
Classloader
Pinot Plugin
Classloader
MySQL Plugin
Classloader
IPrestoSparkService {
getQueryExecutionFactory();
getTaskExecutorFactory();
}

Current Status
▪ Under Active Development on GitHub: #13856
▪ Most query shapes supported
▪ Working on supporting remaining query shapes (some flavors of UNION ALL)
▪ Preparing the feature to become GA
▪ Initial Scalability Tests
▪ Scale to 10,000 Mappers / Reducers
▪ Supports Queries Require 50TB+ Distributed Memory in Presto
▪ Up to 3x Wall Time Reduction for Presto Large Batch Queries (6h in Presto vs 2h in Presto on Spark)

Presto on Apache Spark: A Tale of Two Computation Engines

More Related Content

What's hot (20)

Similar to Presto on Apache Spark: A Tale of Two Computation Engines (20)

More from Databricks (20)

Recently uploaded (20)

Presto on Apache Spark: A Tale of Two Computation Engines