Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi

Lazy Join Optimizations without
Upfront Statistics
MATTEO INTERLANDI

Open Source Data-Intensive Scalable Computing (DISC)
Platforms: Hadoop MapReduce and Spark
◦ functional API
◦ map and reduce User-Defined Functions
◦ RDD transformations (filter, flatMap, zipPartitions, etc.)
Several years later, introduction of high-level SQL-like
declarative query languages (and systems)
◦ Conciseness
◦ Pick a physical execution plan from a number of alternatives
Cloud Computing Programs

Two steps process
◦ Logical optimizations (e.g., filter pushdown)
◦ Physical optimizations (e.g., join orders and implementation)
Physical optimizer in RDMBS:
◦ Cost-base
◦ Data statistics (e.g., predicate selectivities, cost of data access, etc.)
The role of the cost-based optimizer is to
(1) enumerate some set of equivalent plans
(2) estimate the cost of each plan
(3) select a sufficiently good plan
Query Optimization

Query Optimization: Why Important?
0.25
1
4
16
64
256
1024
4096
16384
W R W R W R W R
1082
70
343
21
unabletofinishin5+hours
276
15102
954
Time(s)
Scale Factor = 10
Spark
AsterixDB
Hive
Pig

Query Optimization: Why Important?
0.25
1
4
16
64
256
1024
4096
16384
W R W R W R W R
1082
70
343
21
276
15102
954
Time(s)
Scale Factor = 10
Spark
AsterixDB
Hive
Pig
Bad plans over Big Data can be disastrous!

No cost-based join enumeration
◦ Rely on order of relations in FROM clause
◦ Left-deep plans
No upfront statistics:
◦ Often data sits in HDFS and unstructured
Even if input statistics are available:
◦ Correlations between predicates
◦ Exponential error propagation in joins
◦ Arbitrary UDFs
Cost-base Optimizer in DISC

Bad statistics
Adaptive Query planning
RoPe [NSDI 12, VLDB 2013]
No upfront statistics
Pilot runs (samples)
DynO [SIGMOD 2014]
Cost-base Optimizer in DISC: State of the Art

Bad statistics
◦ Adaptive Query planning
◦ RoPe [NSDI 12, VLDB 2013]

Bad statistics
Assumption is that some initial
statistics exist

Bad statistics
◦ Pilot runs (samples)
◦ DynO [SIGMOD 2014]
statistics exist

Bad statistics
◦ Pilot runs (samples)
◦ DynO [SIGMOD 2014]
statistics exist
• Samples are expensive
• Only foreign-key joins
• No runtime plan revision

Lazy Cost-base Optimizer for Spark
Key idea: interleave query planning and execution
◦ Query plans are lazily executed
◦ Statistics are gathered at runtime
◦ Joins are greedly scheduled
◦ Next join can be dynamically changed if a bad decision was made
◦ Execute-Gather-Aggregate-Plan strategy (EGAP)
Neither upfront statistics nor pilot runs are required
◦ Raw dataset size for initial guess
Support for not foreign-key joins

Lazy Optimizer: an Example
BA
A
C
AA
AAB
AAC
Assumption: A < C

Lazy Optimizer: Execute Step
BA
A
C
AA
AAB
AAC
A
B
C
Assumption: A < C

Lazy Optimizer: Gather step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C

U
Lazy Optimizer: Aggregate step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
Driver

Lazy Optimizer: Plan step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U

Lazy Optimizer: Execute step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB

Lazy Optimizer: Gather step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
S

Lazy Optimizer: Plan step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
S

Lazy Optimizer: Execute step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
S
ABCABCABC

Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
Assumption: A < Cσ
σ(A) > σ(C)

B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
σ(A) > σ(C)

B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
σ(A) > σ(C)
Repartition step

B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
σ(A) > σ(C)

B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
BC
S
σ(A) > σ(C)

B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
BC
S
ABCABCABC
σ(A) > σ(C)

Runtime Integrated Optimizer for Spark
Spark batch execution model allows late binding of joins
Set of Statistics:
◦ Join estimations (based on sampling or sketches)
◦ Number of records
◦ Average size of each record
Statistics are aggregates using a Spark job or accumulators
Join implementations are picked based on thresholds

Challenges and Optimizations
Execute - Block and revise execution plans without wasting
computation
Gather - Asynchronous generation of statistics
Aggregate - Efficient accumulation of statistics
Plan - Try to schedule as many broadcast joins as possible

Experiments
Q1: Is RIOS able to generate good query plans?
Q2: What are the performance of RIOS compared to regular
Spark and pilot runs?
Q3: How expensive are wrong guesses?

Minibenchmark with 3 Fact Tables
16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
143
4230
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run

16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
143
4230
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Q1: RIOS is able to avoid bad plans

16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
143
4230
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Q2: RIOS is always faster than pilot run approach

16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
143
4230
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Q3: Bad guesses cost around 15% in the worst case

TPCDS and TPCH Queries
16
32
64
128
256
512
1024
2048
4096
8192
1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000
Query 17 Query 50 Query 28 Query 9
38
49
140
2298
38
55
87
1185
37
41
107
617
55
80
326
8511
37
41
67
899
38
41
54
843
34
35
39
464
46
47
137
7250
40
43
69
930
41
52
70
1128
37
38
42
490
50
50
215
7831
45
55
198
3898
47
105
291
6069
37
44
109
712
70
153
633
Time(s)
Scale Factor
spark good-order RIOS pilot-run spark bad-order

16
32
64
128
256
512
1024
2048
4096
8192
1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000
38
49
140
2298
38
55
87
1185
37
41
107
617
55
80
326
8511
37
41
67
899
38
41
54
843
34
35
39
464
46
47
137
7250
40
43
69
930
41
52
70
1128
37
38
42
490
50
50
215
7831
45
55
198
3898
47
105
291
6069
37
44
109
712
70
153
633
Time(s)
Scale Factor
Q1: RIOS generates optimal plans

16
32
64
128
256
512
1024
2048
4096
8192
1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000
38
49
140
2298
38
55
87
1185
37
41
107
617
55
80
326
8511
37
41
67
899
38
41
54
843
34
35
39
464
46
47
137
7250
40
43
69
930
41
52
70
1128
37
38
42
490
50
50
215
7831
45
55
198
3898
47
105
291
6069
37
44
109
712
70
153
633
Time(s)
Scale Factor
Q2: RIOS is always the faster approach

Conclusions
RIOS: cost-base query optimizer for Spark
Statistics are gathered at runtime (no need for initial
statistics or pilot runs)
Late bind of joins
Up to 2x faster than the best left-deep plans (Spark), and >
100x than previous approaches for fact table joins.

Future Work
More flexible shuffle operations:
◦ Efficient switch from shuffle-base joins to broadcast joins
◦ Allow records to be partitioned in different ways
Take in consideration interesting orders and partitions
Add aggregation and additional statistics (IO and network
cost)

๏ Datasets:
• TPCDS
• TPCH
๏ Configuration:
• 16 machines, 4 cores (2 hyper threads per core)
machines, 32GB of RAM, 1TB disk
• Spark 1.6.3
• Scale factor from 1 to 1000 (~1TB)
Experiment Configuration

Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi

More Related Content

What's hot (20)

Similar to Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi (20)

More from Databricks (20)

Recently uploaded (20)

Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi