A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai

Deep Dive Into Catalyst:
Apache Spark’s Optimizer
Yin Huai, yhuai@databricks.com
2017-06-06, Spark Summit

2
About me
• Software engineer at Databricks
• Apache Spark committer and PMC member
• One of the original developers of Spark SQL
• Before joining Databricks: Ohio State University

3
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
33
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple

4
Overview
4
Spark Core (RDD)
Catalyst
DataFrame/DatasetSQL
ML Pipelines
Structured
Streaming
and more…
Spark SQL
GraphFrames
Spark SQL applies structured views to data from
different systems stored in different kinds of formats.

5
Why structure APIs?
data.map { case (dept, age) => dept -> (age, 1) }
.reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2)}
.map { case (dept, (age, c)) => dept -> age / c }
select dept, avg(age) from data group by 1
SQL
Dataframe
RDD
data.groupBy("dept").avg("age")

6
Why structure APIs?
• Structure will limit what can be expressed.
• In practice, we can accommodate the vast
majority of computations.
6
Limiting the space of what can be expressed
enables optimizations.

7
0 1 2 3 4 5
RDD
Dataframe
SQL
Runtime performance of aggregating 10 million int pairs
(secs)
Why structure APIs?

8
How to take advantage of
optimization opportunities?

9
Get an optimizer that automatically finds
out the most efficient plan to execute data
operations specified in the user’s program

Catalyst:
Apache Spark’s Optimizer
10

11
How Catalyst Works: An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
Code
Generation
Transformations
Catalyst
Abstractions of users’ programs
(Trees)

12
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
Code
Generation
Transformations
Catalyst
(Trees)

13
Trees: Abstractions of Users’
Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50000) tmp

14
Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50000) tmp
Expression
• An expression represents
a new value, computed
based on input values
• e.g. 1 + 2 + t1.value
• Attribute: A column of a
dataset (e.g. t1.id) or a
column generated by a
specific data operation
(e.g. v)

15
Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50000) tmp
Query Plan
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50000

16
Logical Plan
• A Logical Plan describes computation
on datasets without defining how to
conduct the computation
• output: a list of attributes generated
by this Logical Plan, e.g. [id, v]
• constraints: a set of invariants about
the rows generated by this plan, e.g.
t2.id > 50000
• statistics: size of the plan
in rows/bytes. Per column
stats (min/max/ndv/nulls). Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50000

17
Physical Plan
• A Physical Plan describes
computation on datasets with
specific definitions on how to
conduct the computation
• A Physical Plan is executable
Parquet
Scan
(t1)
JSON Scan
(t2)
Sort-Merge
Join
Filter
Project
Hash-
Aggregate
sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50000

18
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
Code
Generation
Transformations
Catalyst
(Trees)

19
Transformations
• Transformations without changing the tree type (Transform
and Rule Executor)
• Expression => Expression
• Logical Plan => Logical Plan
• Physical Plan => Physical Plan
• Transforming a tree to another kind of tree
• Logical Plan => Physical Plan

20
Transform
• A function associated with every tree used to implement a
single rule
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
Attribute
(t1.value)
Add
Literal(3)
3+ t1.valueEvaluate 1 + 2
once
Evaluate 1 +
2 for every
row

21
Transform
• A transformation is defined as a Partial Function
• Partial Function: A function that is defined for a subset of its
possible arguments
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Case statement determines if the partial function is defined for a given input

22
Combining Multiple Rules
Scan
(t1)
Scan
(t2)
Join
Filter
Project
t1.id,
3+t1.value as
v
t1.id=t2.id
t2.id>50000
Predicate Pushdown
Scan
(t1)
Scan
(t2)
Join
Filter
Project
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id

23
Column
Pruning
Scan
(t1)
Scan
(t2)
Join
Filter
Project
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Scan
(t1)
Scan
(t2)
Join
Filter
Project
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Project Project
t1.id
t1.value t2.id

24
Scan
(t1)
Scan
(t2)
Join
Filter
Project
t1.id,
3+t1.value as
v
t1.id=t2.id
t2.id>50000
Scan
(t1)
Scan
(t2)
Join
Filter
Project
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Project Projectt1.id
t1.value
t2.id
Before transformations After transformations

25
Combining Multiple Rules: Rule Executor
25
Batch 1 Batch 2 Batch n…
Rule 1
Rule 2
…
Rule 1
Rule 2
…
A Rule Executor transforms a Tree to another same type
Tree by applying many rules defined in batches
Every rule is
implemented based on
Transform
Approaches of
applying rules
1. Fixed point
2. Once

26
Transformations
• Transformations without changing the tree type (Transform
and Rule Executor)
• Expression => Expression
• Logical Plan => Logical Plan
• Physical Plan => Physical Plan
• Transforming a tree to another kind of tree
• Logical Plan => Physical Plan

27
From Logical Plan to Physical Plan
• A Logical Plan is transformed to a Physical Plan by
applying a set of Strategies
• Every Strategy uses pattern matching to convert a
Logical Plan to a Physical Plan
object BasicOperators extends Strategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
…
case logical.Project(projectList, child) =>
execution.ProjectExec(projectList, planLater(child)) :: Nil
case logical.Filter(condition, child) =>
execution.FilterExec(condition, planLater(child)) :: Nil
…
}
} Triggers other Strategies

28
SQL AST
DataFrame
Dataset
(Java/Scala
)
Query Plan
Optimized
Query Plan
RDDs
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Selected
Physical
Plan
CostModel
Physical
Plans
Catalog
Analysis
Logical
Optimization
Physical
Planning
Catalyst

29
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Selected
Physical
Plan
CostModel
Physical
Plans
Catalog
Analysis
Logical
Optimization
Physical
Planning
• Analysis (Rule Executor): Transforms an Unresolved
Logical Plan to a Resolved Logical Plan
• Unresolved => Resolved: Use Catalog to find where datasets
and columns are coming from and types of columns
• Logical Optimization (Rule Executor): Transforms a
Resolved Logical Plan to an Optimized Logical Plan
• Physical Planning (Strategies + Rule Executor):
• Phase 1: Transforms an Optimized Logical Plan to a Physical
Plan
• Phase 2: Rule executor is used to adjust the physical plan to
make it ready for execution

30
Put what we have learned in
action

31
Use Catalyst’s APIs to
customize Spark
Roll your own planner rule

32
Roll your own Planner Rule
32
import org.apache.spark.sql.functions._
// tableA is a dataset of integers in the ragne of [0, 19999999]
val tableA = spark.range(20000000).as('a)
// tableB is a dataset of integers in the ragne of [0, 9999999]
val tableB = spark.range(10000000).as('b)
// result shows the number of records after joining tableA and tableB
val result = tableA
.join(tableB, $"a.id" === $"b.id")
.groupBy()
.count()
result.show()
This takes 4-8s on Databricks Community edition

33
== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)])
+- *Project
+- *SortMergeJoin [id#642L], [id#646L], Inner
:- *Sort [id#642L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#642L, 200)
: +- *Range (0, 20000000, step=1, splits=8)
+- *Sort [id#646L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#646L, 200)
+- *Range (0, 10000000, step=1, splits=8)
result.explain()

34
34
Exploit the structure of the problem
We are joining two intervals; the result will be the intersection of
these intervals
A
B
A∩B

35
// Import internal APIs of Catalyst
import org.apache.spark.sql.Strategy
import org.apache.spark.sql.catalyst.expressions.{Alias, EqualTo}
import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Join, Range}
import org.apache.spark.sql.catalyst.plans.Inner
import org.apache.spark.sql.execution.{ProjectExec, RangeExec, SparkPlan}
case object IntervalJoin extends Strategy with Serializable {
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
case Join(
Range(start1, end1, 1, part1, Seq(o1)), // mathces tableA
Range(start2, end2, 1, part2, Seq(o2)), // matches tableB
Inner, Some(EqualTo(e1, e2))) // matches the Join
if ((o1 semanticEquals e1) && (o2 semanticEquals e2)) ||
((o1 semanticEquals e2) && (o2 semanticEquals e1)) =>
// See next page for rule body
case _ => Nil
}
}

36
// matches cases like:
// tableA: start1----------------------------end1
// tableB: ...------------------end2
if ((end2 >= start1) && (end2 <= end2)) {
// start of the intersection
val start = math.max(start1, start2)
// end of the intersection
val end = math.min(end1, end2)
val part = math.max(part1.getOrElse(200), part2.getOrElse(200))
// Create a new Range to represent the intersection
val result = RangeExec(Range(start, end, 1, Some(part), o1 :: Nil))
val twoColumns = ProjectExec(
Alias(o1, o1.name)(exprId = o1.exprId) :: Nil,
result)
twoColumns :: Nil
} else {
Nil
}

37
Hook it up with Spark
spark.experimental.extraStrategies = IntervalJoin :: Nil
Use it
result.show()
This now takes ~0.5s to complete

38
== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)])
+- *Project
+- *Project [id#642L AS id#642L]
+- *Range (0, 10000000, step=1, splits=8)
result.explain()

39
Contribute your ideas to Spark
110 line patch took a user’s query from
“never finishing” to 200s.
Overall 200+ people have contributed to the analyzer/optimizer/planner in the last 2 years.

40
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
• Collaborative cloud environment
• Free version (community edition)
4040
DATABRICKS RUNTIME
3.0
• Apache Spark - optimized for the
cloud
• Caching and optimization layer -
DBIO
• Enterprise security - DBES
Try for free today.
databricks.com

Thank you!
What to chat?
Find me after this talk or at Databricks booth 3-3:40pm

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai

Recommended

More Related Content

What's hot (20)

Similar to A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai (20)

More from Databricks (20)

Recently uploaded (20)

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai