20 Cost Based Optimization Annotated
20 Cost Based Optimization Annotated
1 / 52
Cost-Based Query Optimization Recap
Recap
2 / 52
Cost-Based Query Optimization Recap
Query Optimization
3 / 52
Cost-Based Query Optimization Recap
Today’s Agenda
4 / 52
Cost-Based Query Optimization Plan Cost Estimation
5 / 52
Cost-Based Query Optimization Plan Cost Estimation
Cost Estimation
6 / 52
Cost-Based Query Optimization Plan Cost Estimation
Statistics
• The DBMS stores internal statistics about tables, attributes, and indexes in its internal
catalog.
• Different systems update them at different times.
• Manual invocations:
▶ Postgres/SQLite: ANALYZE
▶ Oracle/MySQL: ANALYZE TABLE
▶ SQL Server: UPDATE STATISTICS
▶ DB2: RUNSTATS
7 / 52
Cost-Based Query Optimization Plan Cost Estimation
Statistics
8 / 52
Cost-Based Query Optimization Plan Cost Estimation
Derivable Statistics
• The selection cardinality SC(A, R) is the average number of records with a value for
an attribute A is given by: NR / V(A, R)
• What could go wrong with this estimate?
9 / 52
Cost-Based Query Optimization Plan Cost Estimation
Derivable Statistics
• The selection cardinality SC(A, R) is the average number of records with a value for
an attribute A is given by: NR / V(A, R)
• Note that this assumes data uniformity.
▶ 10,000 students, 10 colleges – how many students in SCS?
10 / 52
Cost-Based Query Optimization Plan Cost Estimation
Selection Statistics
11 / 52
Cost-Based Query Optimization Plan Cost Estimation
Complex Predicates
12 / 52
Cost-Based Query Optimization Plan Cost Estimation
13 / 52
Cost-Based Query Optimization Plan Cost Estimation
• Range Predicate:
▶ sel(A>=a) = (Amax – a) / (Amax – Amin )
▶ Example: sel(age>=2) ≈ (4 – 2) / (4 – 0) ≈ 1/2
SELECT * FROM people WHERE age >= 2
14 / 52
Cost-Based Query Optimization Plan Cost Estimation
15 / 52
Cost-Based Query Optimization Plan Cost Estimation
16 / 52
Cost-Based Query Optimization Plan Cost Estimation
17 / 52
Cost-Based Query Optimization Plan Cost Estimation
Selection Cardinality
18 / 52
Cost-Based Query Optimization Plan Cost Estimation
Correlated Attributes
19 / 52
Cost-Based Query Optimization Plan Cost Estimation
Cost Estimation
• Our formulas are nice, but we assume that data values are uniformly distributed.
20 / 52
Cost-Based Query Optimization Plan Cost Estimation
Cost Estimation
• Our formulas are nice, but we assume that data values are uniformly distributed.
21 / 52
Cost-Based Query Optimization Plan Cost Estimation
Cost Estimation
• Our formulas are nice, but we assume that data values are uniformly distributed.
22 / 52
Cost-Based Query Optimization Plan Cost Estimation
• Vary the width of buckets so that the total number of occurrences for each bucket is
roughly the same.
23 / 52
Cost-Based Query Optimization Plan Cost Estimation
• Vary the width of buckets so that the total number of occurrences for each bucket is
roughly the same.
24 / 52
Cost-Based Query Optimization Plan Cost Estimation
Sampling
25 / 52
Cost-Based Query Optimization Plan Cost Estimation
Sampling
26 / 52
Cost-Based Query Optimization Plan Cost Estimation
Observation
• Now that we can (roughly) estimate the selectivity of predicates, what can we
actually do with them?
27 / 52
Cost-Based Query Optimization Plan Enumeration
Plan Enumeration
28 / 52
Cost-Based Query Optimization Plan Enumeration
Query Optimization
• After performing rule-based rewriting, the DBMS will enumerate different plans for
the query and estimate their costs.
▶ Single relation
▶ Multiple relations
• It chooses the best plan it has seen for the query after exhausting all plans or
some timeout.
29 / 52
Cost-Based Query Optimization Plan Enumeration
30 / 52
Cost-Based Query Optimization Plan Enumeration
• Query planning for OLTP queries is easy because they are sargable (Search Argument
Able).
▶ It is usually just picking the best index.
▶ Joins are almost always on foreign key relationships with a small cardinality.
▶ Can be implemented with simple heuristics.
31 / 52
Cost-Based Query Optimization Plan Enumeration
32 / 52
Cost-Based Query Optimization Plan Enumeration
33 / 52
Cost-Based Query Optimization Plan Enumeration
34 / 52
Cost-Based Query Optimization Plan Enumeration
35 / 52
Cost-Based Query Optimization Plan Enumeration
36 / 52
Cost-Based Query Optimization Plan Enumeration
Dynamic Programming
37 / 52
Cost-Based Query Optimization Plan Enumeration
Dynamic Programming
38 / 52
Cost-Based Query Optimization Plan Enumeration
Dynamic Programming
39 / 52
Cost-Based Query Optimization Plan Enumeration
Dynamic Programming
40 / 52
Cost-Based Query Optimization Plan Enumeration
Dynamic Programming
41 / 52
Cost-Based Query Optimization Plan Enumeration
42 / 52
Cost-Based Query Optimization Plan Enumeration
Candidate Plans
43 / 52
Cost-Based Query Optimization Plan Enumeration
Candidate Plans
44 / 52
Cost-Based Query Optimization Plan Enumeration
Candidate Plans
45 / 52
Cost-Based Query Optimization Plan Enumeration
Postgres Optimizer
46 / 52
Cost-Based Query Optimization Plan Enumeration
Postgres Optimizer
47 / 52
Cost-Based Query Optimization Plan Enumeration
Postgres Optimizer
48 / 52
Cost-Based Query Optimization Plan Enumeration
Postgres Optimizer
49 / 52
Cost-Based Query Optimization Conclusion
Conclusion
50 / 52
Cost-Based Query Optimization Conclusion
Parting Thoughts
• Selectivity estimations
• Key assumptions in query optimization
▶ Uniformity
▶ Independence
▶ Histograms
▶ Join selectivity
• Dynamic programming for join orderings
51 / 52
Cost-Based Query Optimization Conclusion
Next Class
52 / 52