SlideShare a Scribd company logo
Ron Hu, Zhenhua Wang
Huawei Technologies, Inc.
Sameer Agarwal, Wenchen Fan
Databricks Inc.
Cost-Based Optimizer in
Apache Spark 2.2
Session 1 Topics
• Motivation
• Statistics Collection Framework
• Cost Based Optimizations
• TPC-DS Benchmark and Query Analysis
• Demo
2
How Spark Executes a Query?
Logical
Plan
Physical
Plan
Catalog
Optimizer
RDDs
…
SQL
Code
Generator
Data
Frames
How Spark Executes a Query?
Logical
Plan
Physical
Plan
Catalog
Optimizer
RDDs
…
SQL
Code
Generator
Data
Frames
Focus of Today’s Talk
Catalyst Optimizer: An Overview
5
events =
sc.read.json(“/logs”)
stats =
events.join(users)
.groupBy(“loc”,“status”)
.avg(“duration”)
errors = stats.where(
stats.status == “ERR”)
Query Plan is an
internal representation
of a user’s program
Series of Transformations
that convert the initial query
plan into an optimized plan
SCAN logs
JOIN
FILTER
AGG
SCAN users SCAN logsSCAN users
JOIN
FILTER
AGG
SCAN users
Catalyst Optimizer: An Overview
6
In Spark, the optimizer’s goal is to
minimize end-to-end query response time.
Two key ideas:
- Prune unnecessary data as early as possible
- e.g., filter pushdown, column pruning
- Minimize per-operator cost
- e.g., broadcast vs shuffle, join re-ordering
SCAN logsSCAN users
JOIN
FILTER
AGG
SCAN users
Rule-based Optimizer in Spark 2.1
• Most of Spark SQL optimizer’s rules are heuristics rules.
– PushDownPredicate, ColumnPruning,
ConstantFolding,…
• Does NOT consider the cost of each operator
• Does NOT consider selectivity when estimating join relation size
• Join order is mostly decided by its position in the SQL queries
• Physical Join implementation is decided based on heuristics
7
An Example (TPC-DS q11 variant)
8
SCAN: store_sales SCAN: customer
SCAN: date_dim
FILTER
JOIN
JOIN
SELECT customer_id
FROM customer, store_sales, date_dim
WHERE c_customer_sk = ss_customer_sk AND
ss_sold_date_sk = d_date_sk AND
c_customer_sk > 1000
An Example (TPC-DS q11 variant)
9
SCAN: store_sales SCAN: customer
SCAN: date_dim
FILTER
JOIN
JOIN
3 billion 12 million
2.5 billion
10 million
500 million
0.1 million
An Example (TPC-DS q11 variant)
10
SCAN: store_sales
SCAN: customer
SCAN: date_dim
FILTERJOIN
JOIN
3 billion
12 million
2.5 billion 500 million 10 million
500 million
0.1 million
40% faster
80% less data
An Example (TPC-DS q11 variant)
11
SCAN: store_sales
SCAN: customer
SCAN: date_dim
FILTERJOIN
JOIN
3 billion
12 million
2.5 billion 500 million 10 million
500 million
0.1 million
How do we automatically optimize queries like these?
Cost Based Optimizer (CBO)
• Collect, infer and propagate table/column
statistics on source/intermediate data
• Calculate the cost for each operator in terms of
number of output rows, size of output, etc.
• Based on the cost calculation, pick the most
optimal query execution plan
12
Rest of the Talk
• Statistics Collection Framework
– Table/Column Level Statistics Collected
– Cardinality Estimation (Filters, Joins, Aggregates etc.)
• Cost-based Optimizations
– Build Side Selection
– Multi-way Join Re-ordering
• TPC-DS Benchmarks
• Demo
13
Statistics Collection Framework
and Cost Based Optimizations
Ron Hu
Huawei Technologies
Step 1: Collect, infer and propagate table
and column statistics on source and
intermediate data
Table Statistics Collected
• Command to collect statistics of a table.
– Ex: ANALYZE TABLE table-name COMPUTE
STATISTICS
• It collects table level statistics and saves into
metastore.
– Number of rows
– Table size in bytes
16
Column Statistics Collected
• Command to collect column level statistics of individual columns.
– Ex: ANALYZE TABLE table-name COMPUTE STATISTICS
FOR COLUMNS column-name1, column-name2, ….
• It collects column level statistics and saves into meta-store.
String/Binary type
✓ Distinct count
✓ Null count
✓ Average length
✓ Max length
Numeric/Date/Timestamp type
✓ Distinct count
✓ Max
✓ Min
✓ Null count
✓ Average length (fixed length)
✓ Max length (fixed length)
17
Filter Cardinality Estimation
• Between Logical expressions: AND, OR, NOT
• In each logical expression: =, <, <=, >, >=, in, etc
• Current support type in Expression
– For <, <=, >, >=, <=>: Integer, Double, Date, Timestamp, etc
– For = , <=>: String, Integer, Double, Date, Timestamp, etc.
• Example:A <= B
– Based on A, B’s min/max/distinct count/null count values, decide
the relationships between A and B. After completing this
expression, we set the new min/max/distinct count/null count
– Assume all the data is evenly distributed if no histogram
information.
18
Filter Operator Example
• Column A (op) literal B
– (op) can be “=“, “<”, “<=”, “>”, “>=”, “like”
– Like the styles as “l_orderkey = 3”, “l_shipdate <= “1995-03-21”
– Column’s max/min/distinct count/null count should be updated
– Example: Column A < value B
Column AB B
A.min A.max
Filtering Factor =0%
need to change A’s statistics
Filtering Factor =100%
no need to change A’s statistics
Without histograms, suppose data is evenly distributed
Filtering Factor =(B.value – A.min) / (A.max – A.min)
A.min = no change
A.max = B.value
A.ndv = A.ndv * Filtering Factor
19
Filter Operator Example
• Column A (op) Column B
– (op) can be “<”, “<=”, “>”, “>=”
– We cannot suppose the data is evenly distributed, so the empirical filtering factor is set to 1/3
– Example: Column A < Column B
B
A
AA
A
B
B B
selectivity =100% selectivity =0%
selectivity =33.3%
20
selectivity =33.3%
Join Cardinality Estimation
• Inner-Join: The number of rows of “A join B on A.k1 = B.k1” is
estimated as: num(A B) = num(A) * num(B) / max(distinct(A.k1),
distinct(B.k1)),
– where num(A) is the number of records in table A, distinct is the number of
distinct values of that column.
– The underlying assumption for this formula is that each value of the smaller
domain is included in the larger domain.
• We similarly estimate cardinalities for Left-Outer Join, Right-Outer
Join and Full-Outer Join
21
Other Operator Estimation
• Project: does not change row count
• Aggregate: consider uniqueness of group-by
columns
• Limit, Sample, etc.
22
Step 2: Cost Estimation and Optimal
Plan Selection
Build Side Selection
• For two-way hash joins, we need to choose one operand as build side and the
other as probe side.
• Choose lower-cost child as build side of hash join.
– Before: build side was selected based on original table sizes. ➔ BuildRight
– Now with CBO: build side is selected based on
estimated cost of various operators before join. ➔ BuildLeft
Join
Scan t2Filter
Scan t15 billion records,
500 GB
t1.value =200
1 million records,
100 MB
100 million records,
20 GB
24
Hash Join Implementation: Broadcast vs. Shuffle
Physical Plan
➢ SortMergeJoinExec/
BroadcastHashJoinExec/
ShuffledHashJoinExec
➢ CartesianProductExec/
BroadcastNestedLoopJoinExec
Logical Plan
➢ Equi-join
• Inner Join
• LeftSemi/LeftAnti Join
• LeftOuter/RightOuter Join
➢ Theta-join
• Broadcast Criterion: whether the join side’s output size is small (default 10MB).
Join
Scan t2Filter
Scan t15 billion records,
500 GB
t1.value = 100
Only 1000 records,
100 KB
100 million records,
20 GB
Join
Scan t2Aggregate
…
Join
Scan t2Join
… …
25
Multi-way Join Reorder
• Reorder the joins using a dynamic programming algorithm.
1. First we put all items (basic joined nodes) into level 0.
2. Build all two-way joins at level 1 from plans at level 0 (single items).
3. Build all 3-way joins from plans at previous levels (two-way joins and single items).
4. Build all 4-way joins etc, until we build all n-way joins and pick the best plan among
them.
• When building m-way joins, only keep the best plan (optimal sub-solution)
for the same set of m items.
– E.g., for 3-way joins of items {A, B, C}, we keep only the best plan
among: (A J B) J C, (A J C) J B and (B J C) J A
26
Multi-way Join Reorder
27
Selinger et al. Access Path Selection in a Relational Database Management System. In SIGMOD 1979
Join Cost Formula
• The cost of a plan is the sum of costs of all intermediate tables.
• Cost = weight X Costcpu + CostIO X (1 - weight)
– In Spark, we use
weight * cardinality + size * (1 – weight)
– weight is a tuning parameter configured via
spark.sql.cbo.joinReorder.card.weight (0.7 as
default)
28
TPC-DS Benchmarks and
Query Analysis
Zhenhua Wang
Huawei Technologies
Session 2 Topics
• Motivation
• Statistics Collection Framework
• Cost Based Optimizations
• TPC-DS Benchmark and Query Analysis
• Demo
30
Preliminary Performance Test
• Setup:
− TPC-DS size at 1 TB (scale factor 1000)
− 4 node cluster (Huawei FusionServer RH2288: 40 cores, 384GB mem)
− Apache Spark 2.2 RC (dated 5/12/2017)
• Statistics collection
– A total of 24 tables and 425 columns
➢ Take 14 minutes to collect statistics for all tables and all columns.
– Fast because all statistics are computed by integrating with Spark’s built-in
aggregate functions.
– Should take much less time if we collect statistics for columns used in predicate,
join, and group-by only.
31
TPC-DS Query Q11
32
WITH year_total AS (
SELECT
c_customer_id customer_id,
c_first_name customer_first_name,
c_last_name customer_last_name,
c_preferred_cust_flag customer_preferred_cust_flag,
c_birth_country customer_birth_country,
c_login customer_login,
c_email_address customer_email_address,
d_year dyear,
sum(ss_ext_list_price - ss_ext_discount_amt) year_total,
's' sale_type
FROM customer, store_sales, date_dim
WHERE c_customer_sk =ss_customer_sk
AND ss_sold_date_sk = d_date_sk
GROUP BY c_customer_id, c_first_name, c_last_name, d_year
, c_preferred_cust_flag, c_birth_country, c_login, c_email_address, d_year
UNION ALL
SELECT
c_customer_id customer_id,
c_first_name customer_first_name,
c_last_name customer_last_name,
c_preferred_cust_flag customer_preferred_cust_flag,
c_birth_country customer_birth_country,
c_login customer_login,
c_email_address customer_email_address,
d_year dyear,
sum(ws_ext_list_price - ws_ext_discount_amt)year_total,
'w' sale_type
FROM customer, web_sales, date_dim
WHERE c_customer_sk =ws_bill_customer_sk AND ws_sold_date_sk = d_date_sk
GROUP BY c_customer_id, c_first_name, c_last_name, c_preferred_cust_flag,
c_birth_country, c_login, c_email_address, d_year)
SELECT t_s_secyear.customer_preferred_cust_flag
FROM year_total t_s_firstyear
, year_total t_s_secyear
, year_total t_w_firstyear
, year_total t_w_secyear
WHERE t_s_secyear.customer_id = t_s_firstyear.customer_id
AND t_s_firstyear.customer_id = t_w_secyear.customer_id
AND t_s_firstyear.customer_id = t_w_firstyear.customer_id
AND t_s_firstyear.sale_type = 's'
AND t_w_firstyear.sale_type = 'w'
AND t_s_secyear.sale_type = 's'
AND t_w_secyear.sale_type = 'w'
AND t_s_firstyear.dyear= 2001
AND t_s_secyear.dyear = 2001 + 1
AND t_w_firstyear.dyear = 2001
AND t_w_secyear.dyear= 2001 + 1
AND t_s_firstyear.year_total > 0
AND t_w_firstyear.year_total >0
AND CASE WHEN t_w_firstyear.year_total > 0
THEN t_w_secyear.year_total / t_w_firstyear.year_total
ELSE NULL END
> CASE WHEN t_s_firstyear.year_total > 0
THEN t_s_secyear.year_total / t_s_firstyear.year_total
ELSE NULL END
ORDER BY t_s_secyear.customer_preferred_cust_flag
LIMIT 100
Query Analysis – Q11 CBO OFF
Large join result
Join	#1
store_sales customer
date_dim
2.9 billion
…
…
Join	#2
web_sales customer
date_dim
Join	#4
…
Join	#3
12 million
2.7 billion 73,049 73,049
12 million720 million
534 million
719 million
144 million
33
Query Analysis – Q11 CBO ON
Small join result
Join	#1
store_sales date_dim
customer
2.9 billion
…
…
Join	#2
web_sales date_dim
customer
Join	#4
…
Join	#3
73,049
534 million 12 million 12 million
73,049720 million
534 million
144 million
144 million
1.4x Speedup
34
80% less
TPC-DS Query Q72
35
SELECT
i_item_desc,
w_warehouse_name,
d1.d_week_seq,
count(CASE WHEN p_promo_sk IS NULL
THEN 1
ELSE 0 END) no_promo,
count(CASE WHEN p_promo_sk IS NOT NULL
THEN 1
ELSE 0 END) promo,
count(*) total_cnt
FROM catalog_sales
JOIN inventory ON (cs_item_sk = inv_item_sk)
JOIN warehouse ON (w_warehouse_sk = inv_warehouse_sk)
JOIN item ON (i_item_sk = cs_item_sk)
JOIN customer_demographics ON (cs_bill_cdemo_sk = cd_demo_sk)
JOIN household_demographics ON (cs_bill_hdemo_sk = hd_demo_sk)
JOIN date_dim d1 ON (cs_sold_date_sk = d1.d_date_sk)
JOIN date_dim d2 ON (inv_date_sk = d2.d_date_sk)
JOIN date_dim d3 ON (cs_ship_date_sk = d3.d_date_sk)
LEFT OUTER JOIN promotion ON (cs_promo_sk = p_promo_sk)
LEFT OUTER JOIN catalog_returns ON (cr_item_sk = cs_item_sk AND cr_order_number = cs_order_number)
WHERE d1.d_week_seq = d2.d_week_seq
AND inv_quantity_on_hand < cs_quantity
AND d3.d_date > (cast(d1.d_date AS DATE) + interval 5 days)
AND hd_buy_potential = '>10000'
AND d1.d_year = 1999
AND hd_buy_potential = '>10000'
AND cd_marital_status = 'D'
AND d1.d_year = 1999
GROUP BY i_item_desc, w_warehouse_name, d1.d_week_seq
ORDER BY total_cnt DESC, i_item_desc, w_warehouse_name, d_week_seq
LIMIT 100
Query Analysis – Q72 CBO OFF
Join	#1
Join	#2
Join	#3
Join	#4
Join	#5
Join	#6
Join	#7
Join	#8
catalog_sales inventory
warehouse
item
customer_demographics
date_dim	d1
date_dim	d2
date_dim	d3
household_demographics
1.4 billion 783 million
223 billion 20
300,000223 billion
223 billion
44.7 million
7.5 million
1.6 million
9 million
1.9 million
7,200
73,049
73,049
73,049
8.7 million
Really large
intermediate
results
36
Query Analysis – Q72 CBO ON
Join	#1
Join	#2
Join	#3
Join	#4
Join	#8
Join	#5
Join	#6
Join	#7
catalog_sales
inventory
warehouseitem
customer_demographics
date_dim	d1 date_dim	d2date_dim	d3
household_demographics
1.4 billion
783 million
20300,000
1.9 million
7,200
73,049 73,049 73,049
238 million
238 million
47.6 million
47.6 million
2,555
1 billion
1 billion
8.7 million
Much smaller
intermediate
results !
37
2-3 orders of
magnitude less!
8.1x Speedup
TPC-DS Query Performance
38
0
350
700
1050
1400
1750
q1 q5 q9 q13 q16 q20 q23bq26 q30 q34 q38 q41 q45 q49 q53 q57 q61 q65 q70 q74 q78 q82 q86 q90 q94 q98
Runtime(seconds)
without CBO
with CBO
TPC-DS Query Speedup
• TPC-DS query speedup
ratio with CBO versus
without CBO
• 16 queries show speedup
> 30%
• The max speedup is 8X.
• The geo-mean of
speedup is 2.2X.
39
TPC-DS Query 64
40
WITH cs_ui AS
(SELECT
cs_item_sk,
sum(cs_ext_list_price) AS sale,
sum(cr_refunded_cash + cr_reversed_charge + cr_store_credit) AS refund
FROM catalog_sales, catalog_returns
WHERE cs_item_sk = cr_item_sk AND cs_order_number = cr_order_number
GROUP BY cs_item_sk
HAVING sum(cs_ext_list_price) > 2 * sum(cr_refunded_cash + cr_reversed_charge + cr_store_credit)),
cross_sales AS
(SELECT
i_product_name product_name, i_item_sk item_sk, s_store_name store_name,
s_zip store_zip, ad1.ca_street_number b_street_number, ad1.ca_street_name b_streen_name,
ad1.ca_city b_city, ad1.ca_zip b_zip, ad2.ca_street_number c_street_number,
ad2.ca_street_name c_street_name, ad2.ca_city c_city, ad2.ca_zip c_zip,
d1.d_year AS syear, d2.d_year AS fsyear, d3.d_year s2year,
count(*) cnt, sum(ss_wholesale_cost) s1, sum(ss_list_price) s2, sum(ss_coupon_amt) s3
FROM store_sales, store_returns, cs_ui, date_dim d1, date_dim d2, date_dim d3,
store, customer, customer_demographics cd1, customer_demographics cd2,
promotion, household_demographics hd1, household_demographics hd2,
customer_address ad1, customer_address ad2, income_band ib1, income_band ib2, item
WHERE ss_store_sk = s_store_sk AND ss_sold_date_sk = d1.d_date_sk AND
ss_customer_sk = c_customer_sk AND ss_cdemo_sk = cd1.cd_demo_sk AND
ss_hdemo_sk = hd1.hd_demo_sk AND ss_addr_sk = ad1.ca_address_sk AND
ss_item_sk = i_item_sk AND ss_item_sk = sr_item_sk AND
ss_ticket_number = sr_ticket_number AND ss_item_sk = cs_ui.cs_item_sk AND
c_current_cdemo_sk = cd2.cd_demo_sk AND c_current_hdemo_sk = hd2.hd_demo_sk AND
c_current_addr_sk = ad2.ca_address_sk AND c_first_sales_date_sk = d2.d_date_sk AND
c_first_shipto_date_sk = d3.d_date_sk AND ss_promo_sk = p_promo_sk AND
hd1.hd_income_band_sk = ib1.ib_income_band_sk AND
hd2.hd_income_band_sk = ib2.ib_income_band_sk AND
cd1.cd_marital_status <> cd2.cd_marital_status AND
i_color IN ('purple', 'burlywood', 'indian', 'spring', 'floral', 'medium') AND
i_current_price BETWEEN 64 AND 64 + 10 AND i_current_price BETWEEN 64 + 1 AND 64 + 15
GROUP BY i_product_name, i_item_sk, s_store_name, s_zip, ad1.ca_street_number,
ad1.ca_street_name, ad1.ca_city, ad1.ca_zip, ad2.ca_street_number,
ad2.ca_street_name, ad2.ca_city, ad2.ca_zip, d1.d_year, d2.d_year, d3.d_year)
SELECT
cs1.product_name,
cs1.store_name,
cs1.store_zip,
cs1.b_street_number,
cs1.b_streen_name,
cs1.b_city,
cs1.b_zip,
cs1.c_street_number,
cs1.c_street_name,
cs1.c_city,
cs1.c_zip,
cs1.syear,
cs1.cnt,
cs1.s1,
cs1.s2,
cs1.s3,
cs2.s1,
cs2.s2,
cs2.s3,
cs2.syear,
cs2.cnt
FROM cross_sales cs1, cross_sales cs2
WHERE cs1.item_sk = cs2.item_sk AND
cs1.syear = 1999 AND
cs2.syear = 1999 + 1 AND
cs2.cnt <= cs1.cnt AND
cs1.store_name = cs2.store_name AND
cs1.store_zip = cs2.store_zip
ORDER BY cs1.product_name, cs1.store_name, cs2.cnt
Query Analysis – Q64 CBO ON
41
10% slower
FileScan (store_sales)
Exchange
ExchangeSort
Aggregate
SortMergeJoin
BroadcastHashJoin
FileScan (store_returns)
Exchange
Sort
BroadcastExchange (cs_ui)
ReusedExchange
Sort Sort
BroadcastHashJoin
ReusedExchange
ReusedExchange
Sort
SortMergeJoin
Sort
Fragment 1
Fragment 2
Query Analysis – Q64 CBO OFF
42
FileScan (store_sales)
Exchange
Exchange
Sort
Aggregate
SortMergeJoin
SortMergeJoin
FileScan (store_returns)
Exchange
Sort
Sort (cs_ui)
SortMergeJoin
Sort
ReusedExchange
Aggregate
Sort (cs_ui)
ReusedExchange
Sort
Exchange
Fragment 1
Fragment 2
CBO Demo
Wenchen Fan
Databricks
Current Status, Credits and
Future Work
Ron Hu
Huawei Technologies
Available in Apache Spark 2.2
• Configured via spark.sql.cbo.enabled
• ‘Off By Default’. Why?
– Spark is used in production
– Many Spark users may already rely on “human
intelligence” to write queries in best order
– Plan on enabling this by default in Spark 2.3
• We encourage you test CBO with Spark 2.2!
45
Current Status
• SPARK-16026 is the umbrella jira.
– 32 sub-tasks have been resolved
– A big project spanning 8 months
– 10+ Spark contributors involved
– 7000+ lines of Scala code have been contributed
• Good framework to allow integrations
– Use statistics to derive if a join attribute is unique
– Benefit star schema detection and its integration into join
reorder
46
Birth of Spark SQL CBO
• Prototype
– In 2015, Ron Hu, Fang Cao, etc. of Huawei’s research
department prototyped the CBO concept on Spark 1.2.
– After a successful prototype, we shared technology with
Zhenhua Wang, Fei Wang, etc of Huawei’s product
development team.
• We delivered a talk at Spark Summit 2016:
– “Enhancing Spark SQL Optimizer with Reliable Statistics”.
• The talk was well received by the community.
– https://issues.apache.org/jira/browse/SPARK-16026
47
Collaboration
• Good community support
– Developers: Zhenhua Wang, Ron Hu, Reynold Xin,
Wenchen Fan, Xiao Li
– Reviewers: Wenchen, Herman, Reynold, Xiao, Liang-chi,
Ioana, Nattavut, Hyukjin, Shuai, …..
– Extensive discussion in JIRAs and PRs (tens to hundreds
conversations).
– All the comments made the development time longer, but
improved code quality.
• It was a pleasure working with community.
48
Future Work: Cost Based Optimizer
• Current cost formula is coarse.
Cost = cardinality * weight + size * (1 - weight)
• Cannot tell the cost difference between sort-
merge join and hash join
– spark.sql.join.preferSortMergeJoin defaults to true.
• Underestimates (or ignores) shuffle cost.
• Will improve cost formula in next release.
49
Future Work: Statistics Collection Framework
• Advanced statistics: e.g. histograms, sketches.
• Hint mechanism.
• Partition level statistics.
• Speed up statistics collection by sampling data
for large tables.
50
Conclusion
• Motivation
• Statistics Collection Framework
– Table/Column Level Statistics Collected
– Cardinality Estimation (Filters, Joins, Aggregates etc.)
• Cost-based Optimizations
– Build Side Selection
– Multi-way Join Re-ordering
• TPC-DS Benchmarks
• Demo
51
Thank You.
ron.hu@huawei.com
wangzhenhua@huawei.com
sameer@databricks.com
wenchen@databricks.com
Multi-way Join Reorder – Example
• Given A J B J C J D with join conditions A.k1 = B.k1 and
B.k2 = C.k2 and C.k3 = D.k3
level 0: p({A}), p({B}), p({C}), p({D})
level 1: p({A, B}), p({A, C}), p({A, D}), p({B, C}), p({B, D}), p({C, D})
level 2: p({A, B, C}), p({A, B, D}), p({A, C, D}), p({B, C, D})
level 3: p({A, B, C, D}) -- final output plan
53
Multi-way Join Reorder – Example
• Pruning strategy: exclude cartesian product candidates.
This significantly reduces the search space.
level 0: p({A}), p({B}), p({C}), p({D})
level 1: p({A, B}), p({A, C}), p({A, D}), p({B, C}), p({B, D}), p({C, D})
level 2: p({A, B, C}), p({A, B, D}), p({A, C, D}), p({B, C, D})
level 3: p({A, B, C, D}) -- final output plan
54
New Commands in Apache Spark 2.2
• CBO commands
– Collect table-level statistics
• ANALYZE TABLE table_name COMPUTE STATISTICS
– Collect column-level statistics
• ANALYZE TABLE table-name COMPUTE STATISTICS FOR COLUMNS column_name1,
column_name2, …
– Display statistics in the optimized logical plan
> EXPLAIN COST
> SELECTcc_call_center_sk,cc_call_center_id FROM call_center;
…
== Optimized Logical Plan ==
Project[cc_call_center_sk#75,cc_call_center_id#76],Statistics(sizeInBytes=1680.0 B,rowCount=42,hints=none)
+- Relation[…31 fields]parquet,Statistics(sizeInBytes=22.5 KB,rowCount=42,hints=none)
…
55
Ad

More Related Content

What's hot (20)

Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
Databricks
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
Databricks
 
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Jen Aman
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
Spark Summit
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon Whitear
Spark Summit
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Databricks
 
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files SyndromeDegrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Databricks
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Spark Summit
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Databricks
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Databricks
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
DataWorks Summit
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
Spark Summit
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
Databricks
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
Databricks
 
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Jen Aman
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
Spark Summit
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon Whitear
Spark Summit
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Databricks
 
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files SyndromeDegrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Databricks
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Spark Summit
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Databricks
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Databricks
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
DataWorks Summit
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
Spark Summit
 

Similar to Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan and Zhenhua Wang (20)

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Ronald Francisco Vargas Quesada
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
Zbigniew Jerzak
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
InfluxData
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
Julian Hyde
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Gruter
 
Amazon's Exabyte-Scale Migration from Spark to Ray
Amazon's Exabyte-Scale Migration from Spark to RayAmazon's Exabyte-Scale Migration from Spark to Ray
Amazon's Exabyte-Scale Migration from Spark to Ray
All Things Open
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
DataWorks Summit
 
DB
DBDB
DB
Samchu Li
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To Code
Yuto Hayamizu
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Processes in Query Optimization in (ABMS) Advanced Database Management Systems
Processes in Query Optimization in (ABMS) Advanced Database Management Systems Processes in Query Optimization in (ABMS) Advanced Database Management Systems
Processes in Query Optimization in (ABMS) Advanced Database Management Systems
gamemaker762
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
SnapLogic
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Oracle Query Optimizer - An Introduction
Oracle Query Optimizer - An IntroductionOracle Query Optimizer - An Introduction
Oracle Query Optimizer - An Introduction
adryanbub
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?
Brent Ozar
 
Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!
Knoldus Inc.
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
J Singh
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Ronald Francisco Vargas Quesada
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
Zbigniew Jerzak
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
InfluxData
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
Julian Hyde
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Gruter
 
Amazon's Exabyte-Scale Migration from Spark to Ray
Amazon's Exabyte-Scale Migration from Spark to RayAmazon's Exabyte-Scale Migration from Spark to Ray
Amazon's Exabyte-Scale Migration from Spark to Ray
All Things Open
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
DataWorks Summit
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To Code
Yuto Hayamizu
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Processes in Query Optimization in (ABMS) Advanced Database Management Systems
Processes in Query Optimization in (ABMS) Advanced Database Management Systems Processes in Query Optimization in (ABMS) Advanced Database Management Systems
Processes in Query Optimization in (ABMS) Advanced Database Management Systems
gamemaker762
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
SnapLogic
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Oracle Query Optimizer - An Introduction
Oracle Query Optimizer - An IntroductionOracle Query Optimizer - An Introduction
Oracle Query Optimizer - An Introduction
adryanbub
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?
Brent Ozar
 
Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!
Knoldus Inc.
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
J Singh
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 

Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan and Zhenhua Wang

  • 1. Ron Hu, Zhenhua Wang Huawei Technologies, Inc. Sameer Agarwal, Wenchen Fan Databricks Inc. Cost-Based Optimizer in Apache Spark 2.2
  • 2. Session 1 Topics • Motivation • Statistics Collection Framework • Cost Based Optimizations • TPC-DS Benchmark and Query Analysis • Demo 2
  • 3. How Spark Executes a Query? Logical Plan Physical Plan Catalog Optimizer RDDs … SQL Code Generator Data Frames
  • 4. How Spark Executes a Query? Logical Plan Physical Plan Catalog Optimizer RDDs … SQL Code Generator Data Frames Focus of Today’s Talk
  • 5. Catalyst Optimizer: An Overview 5 events = sc.read.json(“/logs”) stats = events.join(users) .groupBy(“loc”,“status”) .avg(“duration”) errors = stats.where( stats.status == “ERR”) Query Plan is an internal representation of a user’s program Series of Transformations that convert the initial query plan into an optimized plan SCAN logs JOIN FILTER AGG SCAN users SCAN logsSCAN users JOIN FILTER AGG SCAN users
  • 6. Catalyst Optimizer: An Overview 6 In Spark, the optimizer’s goal is to minimize end-to-end query response time. Two key ideas: - Prune unnecessary data as early as possible - e.g., filter pushdown, column pruning - Minimize per-operator cost - e.g., broadcast vs shuffle, join re-ordering SCAN logsSCAN users JOIN FILTER AGG SCAN users
  • 7. Rule-based Optimizer in Spark 2.1 • Most of Spark SQL optimizer’s rules are heuristics rules. – PushDownPredicate, ColumnPruning, ConstantFolding,… • Does NOT consider the cost of each operator • Does NOT consider selectivity when estimating join relation size • Join order is mostly decided by its position in the SQL queries • Physical Join implementation is decided based on heuristics 7
  • 8. An Example (TPC-DS q11 variant) 8 SCAN: store_sales SCAN: customer SCAN: date_dim FILTER JOIN JOIN SELECT customer_id FROM customer, store_sales, date_dim WHERE c_customer_sk = ss_customer_sk AND ss_sold_date_sk = d_date_sk AND c_customer_sk > 1000
  • 9. An Example (TPC-DS q11 variant) 9 SCAN: store_sales SCAN: customer SCAN: date_dim FILTER JOIN JOIN 3 billion 12 million 2.5 billion 10 million 500 million 0.1 million
  • 10. An Example (TPC-DS q11 variant) 10 SCAN: store_sales SCAN: customer SCAN: date_dim FILTERJOIN JOIN 3 billion 12 million 2.5 billion 500 million 10 million 500 million 0.1 million 40% faster 80% less data
  • 11. An Example (TPC-DS q11 variant) 11 SCAN: store_sales SCAN: customer SCAN: date_dim FILTERJOIN JOIN 3 billion 12 million 2.5 billion 500 million 10 million 500 million 0.1 million How do we automatically optimize queries like these?
  • 12. Cost Based Optimizer (CBO) • Collect, infer and propagate table/column statistics on source/intermediate data • Calculate the cost for each operator in terms of number of output rows, size of output, etc. • Based on the cost calculation, pick the most optimal query execution plan 12
  • 13. Rest of the Talk • Statistics Collection Framework – Table/Column Level Statistics Collected – Cardinality Estimation (Filters, Joins, Aggregates etc.) • Cost-based Optimizations – Build Side Selection – Multi-way Join Re-ordering • TPC-DS Benchmarks • Demo 13
  • 14. Statistics Collection Framework and Cost Based Optimizations Ron Hu Huawei Technologies
  • 15. Step 1: Collect, infer and propagate table and column statistics on source and intermediate data
  • 16. Table Statistics Collected • Command to collect statistics of a table. – Ex: ANALYZE TABLE table-name COMPUTE STATISTICS • It collects table level statistics and saves into metastore. – Number of rows – Table size in bytes 16
  • 17. Column Statistics Collected • Command to collect column level statistics of individual columns. – Ex: ANALYZE TABLE table-name COMPUTE STATISTICS FOR COLUMNS column-name1, column-name2, …. • It collects column level statistics and saves into meta-store. String/Binary type ✓ Distinct count ✓ Null count ✓ Average length ✓ Max length Numeric/Date/Timestamp type ✓ Distinct count ✓ Max ✓ Min ✓ Null count ✓ Average length (fixed length) ✓ Max length (fixed length) 17
  • 18. Filter Cardinality Estimation • Between Logical expressions: AND, OR, NOT • In each logical expression: =, <, <=, >, >=, in, etc • Current support type in Expression – For <, <=, >, >=, <=>: Integer, Double, Date, Timestamp, etc – For = , <=>: String, Integer, Double, Date, Timestamp, etc. • Example:A <= B – Based on A, B’s min/max/distinct count/null count values, decide the relationships between A and B. After completing this expression, we set the new min/max/distinct count/null count – Assume all the data is evenly distributed if no histogram information. 18
  • 19. Filter Operator Example • Column A (op) literal B – (op) can be “=“, “<”, “<=”, “>”, “>=”, “like” – Like the styles as “l_orderkey = 3”, “l_shipdate <= “1995-03-21” – Column’s max/min/distinct count/null count should be updated – Example: Column A < value B Column AB B A.min A.max Filtering Factor =0% need to change A’s statistics Filtering Factor =100% no need to change A’s statistics Without histograms, suppose data is evenly distributed Filtering Factor =(B.value – A.min) / (A.max – A.min) A.min = no change A.max = B.value A.ndv = A.ndv * Filtering Factor 19
  • 20. Filter Operator Example • Column A (op) Column B – (op) can be “<”, “<=”, “>”, “>=” – We cannot suppose the data is evenly distributed, so the empirical filtering factor is set to 1/3 – Example: Column A < Column B B A AA A B B B selectivity =100% selectivity =0% selectivity =33.3% 20 selectivity =33.3%
  • 21. Join Cardinality Estimation • Inner-Join: The number of rows of “A join B on A.k1 = B.k1” is estimated as: num(A B) = num(A) * num(B) / max(distinct(A.k1), distinct(B.k1)), – where num(A) is the number of records in table A, distinct is the number of distinct values of that column. – The underlying assumption for this formula is that each value of the smaller domain is included in the larger domain. • We similarly estimate cardinalities for Left-Outer Join, Right-Outer Join and Full-Outer Join 21
  • 22. Other Operator Estimation • Project: does not change row count • Aggregate: consider uniqueness of group-by columns • Limit, Sample, etc. 22
  • 23. Step 2: Cost Estimation and Optimal Plan Selection
  • 24. Build Side Selection • For two-way hash joins, we need to choose one operand as build side and the other as probe side. • Choose lower-cost child as build side of hash join. – Before: build side was selected based on original table sizes. ➔ BuildRight – Now with CBO: build side is selected based on estimated cost of various operators before join. ➔ BuildLeft Join Scan t2Filter Scan t15 billion records, 500 GB t1.value =200 1 million records, 100 MB 100 million records, 20 GB 24
  • 25. Hash Join Implementation: Broadcast vs. Shuffle Physical Plan ➢ SortMergeJoinExec/ BroadcastHashJoinExec/ ShuffledHashJoinExec ➢ CartesianProductExec/ BroadcastNestedLoopJoinExec Logical Plan ➢ Equi-join • Inner Join • LeftSemi/LeftAnti Join • LeftOuter/RightOuter Join ➢ Theta-join • Broadcast Criterion: whether the join side’s output size is small (default 10MB). Join Scan t2Filter Scan t15 billion records, 500 GB t1.value = 100 Only 1000 records, 100 KB 100 million records, 20 GB Join Scan t2Aggregate … Join Scan t2Join … … 25
  • 26. Multi-way Join Reorder • Reorder the joins using a dynamic programming algorithm. 1. First we put all items (basic joined nodes) into level 0. 2. Build all two-way joins at level 1 from plans at level 0 (single items). 3. Build all 3-way joins from plans at previous levels (two-way joins and single items). 4. Build all 4-way joins etc, until we build all n-way joins and pick the best plan among them. • When building m-way joins, only keep the best plan (optimal sub-solution) for the same set of m items. – E.g., for 3-way joins of items {A, B, C}, we keep only the best plan among: (A J B) J C, (A J C) J B and (B J C) J A 26
  • 27. Multi-way Join Reorder 27 Selinger et al. Access Path Selection in a Relational Database Management System. In SIGMOD 1979
  • 28. Join Cost Formula • The cost of a plan is the sum of costs of all intermediate tables. • Cost = weight X Costcpu + CostIO X (1 - weight) – In Spark, we use weight * cardinality + size * (1 – weight) – weight is a tuning parameter configured via spark.sql.cbo.joinReorder.card.weight (0.7 as default) 28
  • 29. TPC-DS Benchmarks and Query Analysis Zhenhua Wang Huawei Technologies
  • 30. Session 2 Topics • Motivation • Statistics Collection Framework • Cost Based Optimizations • TPC-DS Benchmark and Query Analysis • Demo 30
  • 31. Preliminary Performance Test • Setup: − TPC-DS size at 1 TB (scale factor 1000) − 4 node cluster (Huawei FusionServer RH2288: 40 cores, 384GB mem) − Apache Spark 2.2 RC (dated 5/12/2017) • Statistics collection – A total of 24 tables and 425 columns ➢ Take 14 minutes to collect statistics for all tables and all columns. – Fast because all statistics are computed by integrating with Spark’s built-in aggregate functions. – Should take much less time if we collect statistics for columns used in predicate, join, and group-by only. 31
  • 32. TPC-DS Query Q11 32 WITH year_total AS ( SELECT c_customer_id customer_id, c_first_name customer_first_name, c_last_name customer_last_name, c_preferred_cust_flag customer_preferred_cust_flag, c_birth_country customer_birth_country, c_login customer_login, c_email_address customer_email_address, d_year dyear, sum(ss_ext_list_price - ss_ext_discount_amt) year_total, 's' sale_type FROM customer, store_sales, date_dim WHERE c_customer_sk =ss_customer_sk AND ss_sold_date_sk = d_date_sk GROUP BY c_customer_id, c_first_name, c_last_name, d_year , c_preferred_cust_flag, c_birth_country, c_login, c_email_address, d_year UNION ALL SELECT c_customer_id customer_id, c_first_name customer_first_name, c_last_name customer_last_name, c_preferred_cust_flag customer_preferred_cust_flag, c_birth_country customer_birth_country, c_login customer_login, c_email_address customer_email_address, d_year dyear, sum(ws_ext_list_price - ws_ext_discount_amt)year_total, 'w' sale_type FROM customer, web_sales, date_dim WHERE c_customer_sk =ws_bill_customer_sk AND ws_sold_date_sk = d_date_sk GROUP BY c_customer_id, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_country, c_login, c_email_address, d_year) SELECT t_s_secyear.customer_preferred_cust_flag FROM year_total t_s_firstyear , year_total t_s_secyear , year_total t_w_firstyear , year_total t_w_secyear WHERE t_s_secyear.customer_id = t_s_firstyear.customer_id AND t_s_firstyear.customer_id = t_w_secyear.customer_id AND t_s_firstyear.customer_id = t_w_firstyear.customer_id AND t_s_firstyear.sale_type = 's' AND t_w_firstyear.sale_type = 'w' AND t_s_secyear.sale_type = 's' AND t_w_secyear.sale_type = 'w' AND t_s_firstyear.dyear= 2001 AND t_s_secyear.dyear = 2001 + 1 AND t_w_firstyear.dyear = 2001 AND t_w_secyear.dyear= 2001 + 1 AND t_s_firstyear.year_total > 0 AND t_w_firstyear.year_total >0 AND CASE WHEN t_w_firstyear.year_total > 0 THEN t_w_secyear.year_total / t_w_firstyear.year_total ELSE NULL END > CASE WHEN t_s_firstyear.year_total > 0 THEN t_s_secyear.year_total / t_s_firstyear.year_total ELSE NULL END ORDER BY t_s_secyear.customer_preferred_cust_flag LIMIT 100
  • 33. Query Analysis – Q11 CBO OFF Large join result Join #1 store_sales customer date_dim 2.9 billion … … Join #2 web_sales customer date_dim Join #4 … Join #3 12 million 2.7 billion 73,049 73,049 12 million720 million 534 million 719 million 144 million 33
  • 34. Query Analysis – Q11 CBO ON Small join result Join #1 store_sales date_dim customer 2.9 billion … … Join #2 web_sales date_dim customer Join #4 … Join #3 73,049 534 million 12 million 12 million 73,049720 million 534 million 144 million 144 million 1.4x Speedup 34 80% less
  • 35. TPC-DS Query Q72 35 SELECT i_item_desc, w_warehouse_name, d1.d_week_seq, count(CASE WHEN p_promo_sk IS NULL THEN 1 ELSE 0 END) no_promo, count(CASE WHEN p_promo_sk IS NOT NULL THEN 1 ELSE 0 END) promo, count(*) total_cnt FROM catalog_sales JOIN inventory ON (cs_item_sk = inv_item_sk) JOIN warehouse ON (w_warehouse_sk = inv_warehouse_sk) JOIN item ON (i_item_sk = cs_item_sk) JOIN customer_demographics ON (cs_bill_cdemo_sk = cd_demo_sk) JOIN household_demographics ON (cs_bill_hdemo_sk = hd_demo_sk) JOIN date_dim d1 ON (cs_sold_date_sk = d1.d_date_sk) JOIN date_dim d2 ON (inv_date_sk = d2.d_date_sk) JOIN date_dim d3 ON (cs_ship_date_sk = d3.d_date_sk) LEFT OUTER JOIN promotion ON (cs_promo_sk = p_promo_sk) LEFT OUTER JOIN catalog_returns ON (cr_item_sk = cs_item_sk AND cr_order_number = cs_order_number) WHERE d1.d_week_seq = d2.d_week_seq AND inv_quantity_on_hand < cs_quantity AND d3.d_date > (cast(d1.d_date AS DATE) + interval 5 days) AND hd_buy_potential = '>10000' AND d1.d_year = 1999 AND hd_buy_potential = '>10000' AND cd_marital_status = 'D' AND d1.d_year = 1999 GROUP BY i_item_desc, w_warehouse_name, d1.d_week_seq ORDER BY total_cnt DESC, i_item_desc, w_warehouse_name, d_week_seq LIMIT 100
  • 36. Query Analysis – Q72 CBO OFF Join #1 Join #2 Join #3 Join #4 Join #5 Join #6 Join #7 Join #8 catalog_sales inventory warehouse item customer_demographics date_dim d1 date_dim d2 date_dim d3 household_demographics 1.4 billion 783 million 223 billion 20 300,000223 billion 223 billion 44.7 million 7.5 million 1.6 million 9 million 1.9 million 7,200 73,049 73,049 73,049 8.7 million Really large intermediate results 36
  • 37. Query Analysis – Q72 CBO ON Join #1 Join #2 Join #3 Join #4 Join #8 Join #5 Join #6 Join #7 catalog_sales inventory warehouseitem customer_demographics date_dim d1 date_dim d2date_dim d3 household_demographics 1.4 billion 783 million 20300,000 1.9 million 7,200 73,049 73,049 73,049 238 million 238 million 47.6 million 47.6 million 2,555 1 billion 1 billion 8.7 million Much smaller intermediate results ! 37 2-3 orders of magnitude less! 8.1x Speedup
  • 38. TPC-DS Query Performance 38 0 350 700 1050 1400 1750 q1 q5 q9 q13 q16 q20 q23bq26 q30 q34 q38 q41 q45 q49 q53 q57 q61 q65 q70 q74 q78 q82 q86 q90 q94 q98 Runtime(seconds) without CBO with CBO
  • 39. TPC-DS Query Speedup • TPC-DS query speedup ratio with CBO versus without CBO • 16 queries show speedup > 30% • The max speedup is 8X. • The geo-mean of speedup is 2.2X. 39
  • 40. TPC-DS Query 64 40 WITH cs_ui AS (SELECT cs_item_sk, sum(cs_ext_list_price) AS sale, sum(cr_refunded_cash + cr_reversed_charge + cr_store_credit) AS refund FROM catalog_sales, catalog_returns WHERE cs_item_sk = cr_item_sk AND cs_order_number = cr_order_number GROUP BY cs_item_sk HAVING sum(cs_ext_list_price) > 2 * sum(cr_refunded_cash + cr_reversed_charge + cr_store_credit)), cross_sales AS (SELECT i_product_name product_name, i_item_sk item_sk, s_store_name store_name, s_zip store_zip, ad1.ca_street_number b_street_number, ad1.ca_street_name b_streen_name, ad1.ca_city b_city, ad1.ca_zip b_zip, ad2.ca_street_number c_street_number, ad2.ca_street_name c_street_name, ad2.ca_city c_city, ad2.ca_zip c_zip, d1.d_year AS syear, d2.d_year AS fsyear, d3.d_year s2year, count(*) cnt, sum(ss_wholesale_cost) s1, sum(ss_list_price) s2, sum(ss_coupon_amt) s3 FROM store_sales, store_returns, cs_ui, date_dim d1, date_dim d2, date_dim d3, store, customer, customer_demographics cd1, customer_demographics cd2, promotion, household_demographics hd1, household_demographics hd2, customer_address ad1, customer_address ad2, income_band ib1, income_band ib2, item WHERE ss_store_sk = s_store_sk AND ss_sold_date_sk = d1.d_date_sk AND ss_customer_sk = c_customer_sk AND ss_cdemo_sk = cd1.cd_demo_sk AND ss_hdemo_sk = hd1.hd_demo_sk AND ss_addr_sk = ad1.ca_address_sk AND ss_item_sk = i_item_sk AND ss_item_sk = sr_item_sk AND ss_ticket_number = sr_ticket_number AND ss_item_sk = cs_ui.cs_item_sk AND c_current_cdemo_sk = cd2.cd_demo_sk AND c_current_hdemo_sk = hd2.hd_demo_sk AND c_current_addr_sk = ad2.ca_address_sk AND c_first_sales_date_sk = d2.d_date_sk AND c_first_shipto_date_sk = d3.d_date_sk AND ss_promo_sk = p_promo_sk AND hd1.hd_income_band_sk = ib1.ib_income_band_sk AND hd2.hd_income_band_sk = ib2.ib_income_band_sk AND cd1.cd_marital_status <> cd2.cd_marital_status AND i_color IN ('purple', 'burlywood', 'indian', 'spring', 'floral', 'medium') AND i_current_price BETWEEN 64 AND 64 + 10 AND i_current_price BETWEEN 64 + 1 AND 64 + 15 GROUP BY i_product_name, i_item_sk, s_store_name, s_zip, ad1.ca_street_number, ad1.ca_street_name, ad1.ca_city, ad1.ca_zip, ad2.ca_street_number, ad2.ca_street_name, ad2.ca_city, ad2.ca_zip, d1.d_year, d2.d_year, d3.d_year) SELECT cs1.product_name, cs1.store_name, cs1.store_zip, cs1.b_street_number, cs1.b_streen_name, cs1.b_city, cs1.b_zip, cs1.c_street_number, cs1.c_street_name, cs1.c_city, cs1.c_zip, cs1.syear, cs1.cnt, cs1.s1, cs1.s2, cs1.s3, cs2.s1, cs2.s2, cs2.s3, cs2.syear, cs2.cnt FROM cross_sales cs1, cross_sales cs2 WHERE cs1.item_sk = cs2.item_sk AND cs1.syear = 1999 AND cs2.syear = 1999 + 1 AND cs2.cnt <= cs1.cnt AND cs1.store_name = cs2.store_name AND cs1.store_zip = cs2.store_zip ORDER BY cs1.product_name, cs1.store_name, cs2.cnt
  • 41. Query Analysis – Q64 CBO ON 41 10% slower FileScan (store_sales) Exchange ExchangeSort Aggregate SortMergeJoin BroadcastHashJoin FileScan (store_returns) Exchange Sort BroadcastExchange (cs_ui) ReusedExchange Sort Sort BroadcastHashJoin ReusedExchange ReusedExchange Sort SortMergeJoin Sort Fragment 1 Fragment 2
  • 42. Query Analysis – Q64 CBO OFF 42 FileScan (store_sales) Exchange Exchange Sort Aggregate SortMergeJoin SortMergeJoin FileScan (store_returns) Exchange Sort Sort (cs_ui) SortMergeJoin Sort ReusedExchange Aggregate Sort (cs_ui) ReusedExchange Sort Exchange Fragment 1 Fragment 2
  • 44. Current Status, Credits and Future Work Ron Hu Huawei Technologies
  • 45. Available in Apache Spark 2.2 • Configured via spark.sql.cbo.enabled • ‘Off By Default’. Why? – Spark is used in production – Many Spark users may already rely on “human intelligence” to write queries in best order – Plan on enabling this by default in Spark 2.3 • We encourage you test CBO with Spark 2.2! 45
  • 46. Current Status • SPARK-16026 is the umbrella jira. – 32 sub-tasks have been resolved – A big project spanning 8 months – 10+ Spark contributors involved – 7000+ lines of Scala code have been contributed • Good framework to allow integrations – Use statistics to derive if a join attribute is unique – Benefit star schema detection and its integration into join reorder 46
  • 47. Birth of Spark SQL CBO • Prototype – In 2015, Ron Hu, Fang Cao, etc. of Huawei’s research department prototyped the CBO concept on Spark 1.2. – After a successful prototype, we shared technology with Zhenhua Wang, Fei Wang, etc of Huawei’s product development team. • We delivered a talk at Spark Summit 2016: – “Enhancing Spark SQL Optimizer with Reliable Statistics”. • The talk was well received by the community. – https://issues.apache.org/jira/browse/SPARK-16026 47
  • 48. Collaboration • Good community support – Developers: Zhenhua Wang, Ron Hu, Reynold Xin, Wenchen Fan, Xiao Li – Reviewers: Wenchen, Herman, Reynold, Xiao, Liang-chi, Ioana, Nattavut, Hyukjin, Shuai, ….. – Extensive discussion in JIRAs and PRs (tens to hundreds conversations). – All the comments made the development time longer, but improved code quality. • It was a pleasure working with community. 48
  • 49. Future Work: Cost Based Optimizer • Current cost formula is coarse. Cost = cardinality * weight + size * (1 - weight) • Cannot tell the cost difference between sort- merge join and hash join – spark.sql.join.preferSortMergeJoin defaults to true. • Underestimates (or ignores) shuffle cost. • Will improve cost formula in next release. 49
  • 50. Future Work: Statistics Collection Framework • Advanced statistics: e.g. histograms, sketches. • Hint mechanism. • Partition level statistics. • Speed up statistics collection by sampling data for large tables. 50
  • 51. Conclusion • Motivation • Statistics Collection Framework – Table/Column Level Statistics Collected – Cardinality Estimation (Filters, Joins, Aggregates etc.) • Cost-based Optimizations – Build Side Selection – Multi-way Join Re-ordering • TPC-DS Benchmarks • Demo 51
  • 53. Multi-way Join Reorder – Example • Given A J B J C J D with join conditions A.k1 = B.k1 and B.k2 = C.k2 and C.k3 = D.k3 level 0: p({A}), p({B}), p({C}), p({D}) level 1: p({A, B}), p({A, C}), p({A, D}), p({B, C}), p({B, D}), p({C, D}) level 2: p({A, B, C}), p({A, B, D}), p({A, C, D}), p({B, C, D}) level 3: p({A, B, C, D}) -- final output plan 53
  • 54. Multi-way Join Reorder – Example • Pruning strategy: exclude cartesian product candidates. This significantly reduces the search space. level 0: p({A}), p({B}), p({C}), p({D}) level 1: p({A, B}), p({A, C}), p({A, D}), p({B, C}), p({B, D}), p({C, D}) level 2: p({A, B, C}), p({A, B, D}), p({A, C, D}), p({B, C, D}) level 3: p({A, B, C, D}) -- final output plan 54
  • 55. New Commands in Apache Spark 2.2 • CBO commands – Collect table-level statistics • ANALYZE TABLE table_name COMPUTE STATISTICS – Collect column-level statistics • ANALYZE TABLE table-name COMPUTE STATISTICS FOR COLUMNS column_name1, column_name2, … – Display statistics in the optimized logical plan > EXPLAIN COST > SELECTcc_call_center_sk,cc_call_center_id FROM call_center; … == Optimized Logical Plan == Project[cc_call_center_sk#75,cc_call_center_id#76],Statistics(sizeInBytes=1680.0 B,rowCount=42,hints=none) +- Relation[…31 fields]parquet,Statistics(sizeInBytes=22.5 KB,rowCount=42,hints=none) … 55