0% found this document useful (0 votes)
27 views

20 Cost Based Optimization Annotated

The document discusses cost-based query optimization and focuses on plan cost estimation. It explains that the database management system stores statistics about tables, attributes, and indexes to estimate the cost of executing a query plan. These statistics include the number of tuples and distinct values for attributes. It also discusses how to estimate the selectivity of different predicate types like equality, range, and complex predicates. The document notes limitations in assuming uniform data distributions and independent predicates. It introduces techniques like histograms and sampling to improve selectivity estimates.

Uploaded by

smumin011
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

20 Cost Based Optimization Annotated

The document discusses cost-based query optimization and focuses on plan cost estimation. It explains that the database management system stores statistics about tables, attributes, and indexes to estimate the cost of executing a query plan. These statistics include the number of tuples and distinct values for attributes. It also discusses how to estimate the selectivity of different predicate types like equality, range, and complex predicates. The document notes limitations in assuming uniform data distributions and independent predicates. It introduces techniques like histograms and sampling to improve selectivity estimates.

Uploaded by

smumin011
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Cost-Based Query Optimization

Lecture 20: Cost-Based Query Optimization

1 / 52
Cost-Based Query Optimization Recap

Recap

2 / 52
Cost-Based Query Optimization Recap

Query Optimization

• Approach 1: Heuristics / Rules


▶ Rewrite the query to remove stupid / inefficient things.
▶ These techniques may need to examine catalog, but they do not need to examine data.
• Approach 2: Cost-based Search
▶ Use a model to estimate the cost of executing a plan.
▶ Evaluate multiple equivalent plans for a query and pick the one with the lowest cost.

3 / 52
Cost-Based Query Optimization Recap

Today’s Agenda

• Plan Cost Estimation


• Plan Enumeration

4 / 52
Cost-Based Query Optimization Plan Cost Estimation

Plan Cost Estimation

5 / 52
Cost-Based Query Optimization Plan Cost Estimation

Cost Estimation

• How long will a query take?


▶ CPU: Small cost; tough to estimate
▶ Disk: Number of block transfers
▶ Memory: Amount of DRAM used
▶ Network: Number of messages
• How many tuples will be read/written?
• It is too expensive to run every possible plan to determine this information, so the
DBMS need a way to derive this information. . .

6 / 52
Cost-Based Query Optimization Plan Cost Estimation

Statistics

• The DBMS stores internal statistics about tables, attributes, and indexes in its internal
catalog.
• Different systems update them at different times.
• Manual invocations:
▶ Postgres/SQLite: ANALYZE
▶ Oracle/MySQL: ANALYZE TABLE
▶ SQL Server: UPDATE STATISTICS
▶ DB2: RUNSTATS

7 / 52
Cost-Based Query Optimization Plan Cost Estimation

Statistics

• For each relation R, the DBMS maintains the following information:


▶ NR : Number of tuples in R.
▶ V(A, R): Number of distinct values for attribute A.

8 / 52
Cost-Based Query Optimization Plan Cost Estimation

Derivable Statistics

• The selection cardinality SC(A, R) is the average number of records with a value for
an attribute A is given by: NR / V(A, R)
• What could go wrong with this estimate?

9 / 52
Cost-Based Query Optimization Plan Cost Estimation

Derivable Statistics

• The selection cardinality SC(A, R) is the average number of records with a value for
an attribute A is given by: NR / V(A, R)
• Note that this assumes data uniformity.
▶ 10,000 students, 10 colleges – how many students in SCS?

10 / 52
Cost-Based Query Optimization Plan Cost Estimation

Selection Statistics

• Equality predicates on unique keys are easy to estimate.


• What about more complex predicates? What is their selectivity?
CREATE TABLE people (
id INT PRIMARY KEY,
val INT NOT NULL,
age INT NOT NULL,
status VARCHAR(16)
);
SELECT * FROM people WHERE id = 123 --- Easier
SELECT * FROM people WHERE val > 1000 --- Harder: Range predicate
SELECT * FROM people WHERE age = 30 AND status = 'Lit' --- Harder:
Complex predicate

11 / 52
Cost-Based Query Optimization Plan Cost Estimation

Complex Predicates

• The selectivity (sel) of a predicate P is the fraction of tuples that qualify.


• Formula depends on type of predicate:
▶ Equality
▶ Range
▶ Negation
▶ Conjunction
▶ Disjunction

12 / 52
Cost-Based Query Optimization Plan Cost Estimation

Selection – Complex Predicates


• Assume that V(age,people) has five distinct values (0–4) and NR = 5
• Equality Predicate: A=constant
▶ sel(A=constant) = SC(P) / NR
▶ Example: sel(age=2) = 1/5
SELECT * FROM people WHERE age = 2

13 / 52
Cost-Based Query Optimization Plan Cost Estimation

Selection – Complex Predicates

• Range Predicate:
▶ sel(A>=a) = (Amax – a) / (Amax – Amin )
▶ Example: sel(age>=2) ≈ (4 – 2) / (4 – 0) ≈ 1/2
SELECT * FROM people WHERE age >= 2

14 / 52
Cost-Based Query Optimization Plan Cost Estimation

Selection – Complex Predicates


• Negation Query:
▶ sel(not P) = 1 – sel(P)
▶ Example: sel(age != 2) = 1 – (1/5) = 4/5
• Observation: Selectivity ≈ Probability
SELECT * FROM people WHERE age != 2

15 / 52
Cost-Based Query Optimization Plan Cost Estimation

Selection – Complex Predicates


• Conjunction:
▶ sel(P1 ∧ P2) = sel(P1) × sel(P2)
▶ sel(age=2 ∧ name LIKE ’A%’)
• This assumes that the predicates are independent.
• Not always true in practice!
SELECT * FROM people WHERE age = 2 AND name LIKE 'A%'

16 / 52
Cost-Based Query Optimization Plan Cost Estimation

Selection – Complex Predicates


• Disjunction:
▶ sel(P1 ∨ P2) = sel(P1) + sel(P2) – sel(P1∧P2) = sel(P1) + sel(P2) – sel(P1) × sel(P2)
▶ sel(age=2 OR name LIKE ’A%’)
• This again assumes that the selectivities are independent.
SELECT * FROM people WHERE age = 2 OR name LIKE 'A%'

17 / 52
Cost-Based Query Optimization Plan Cost Estimation

Selection Cardinality

• Assumption 1: Uniform Data


▶ The distribution of values (except for the heavy hitters) is the same.
• Assumption 2: Independent Predicates
▶ The predicates on attributes are independent
• Assumption 3: Inclusion Principle
▶ The domain of join keys overlap such that each key in the inner relation will also exist in
the outer table.

18 / 52
Cost-Based Query Optimization Plan Cost Estimation

Correlated Attributes

• Consider a database of automobiles:


▶ Number of Makes = 10, Number of Models = 100
• And the following query: (make = ”Honda”ANDmodel = ”Accord”)
• With the independence and uniformity assumptions, the selectivity is:
▶ 1/10 × 1/100 = 0.001
• But since only Honda makes Accords, the real selectivity is 1/100 = 0.01

19 / 52
Cost-Based Query Optimization Plan Cost Estimation

Cost Estimation

• Our formulas are nice, but we assume that data values are uniformly distributed.

20 / 52
Cost-Based Query Optimization Plan Cost Estimation

Cost Estimation

• Our formulas are nice, but we assume that data values are uniformly distributed.

21 / 52
Cost-Based Query Optimization Plan Cost Estimation

Cost Estimation

• Our formulas are nice, but we assume that data values are uniformly distributed.

22 / 52
Cost-Based Query Optimization Plan Cost Estimation

Histograms With Quantiles

• Vary the width of buckets so that the total number of occurrences for each bucket is
roughly the same.

23 / 52
Cost-Based Query Optimization Plan Cost Estimation

Histograms With Quantiles

• Vary the width of buckets so that the total number of occurrences for each bucket is
roughly the same.

24 / 52
Cost-Based Query Optimization Plan Cost Estimation

Sampling

• Modern DBMSs also collect samples from tables to estimate selectivities.


• Update samples when the underlying tables changes significantly.
• Example: 1 billion tuples

SELECT AVG(age) FROM people WHERE age > 50

id name age status


1001 Shiyi 58 Senior
1002 Rahul 41 Sophomore
1003 Peter 25 Freshman
1004 Mark 25 Junior
1005 Alice 38 Senior

25 / 52
Cost-Based Query Optimization Plan Cost Estimation

Sampling

• Modern DBMSs also collect samples from tables to estimate selectivities.


• Update samples when the underlying tables changes significantly.
• Example: 1 billion tuples
• sel(age>50) = 1/3

SELECT AVG(age) FROM people WHERE age > 50

id name age status


1001 Shiyi 58 Senior
1003 Mark 25 Junior
1005 Alice 38 Senior

26 / 52
Cost-Based Query Optimization Plan Cost Estimation

Observation

• Now that we can (roughly) estimate the selectivity of predicates, what can we
actually do with them?

27 / 52
Cost-Based Query Optimization Plan Enumeration

Plan Enumeration

28 / 52
Cost-Based Query Optimization Plan Enumeration

Query Optimization

• After performing rule-based rewriting, the DBMS will enumerate different plans for
the query and estimate their costs.
▶ Single relation
▶ Multiple relations
• It chooses the best plan it has seen for the query after exhausting all plans or
some timeout.

29 / 52
Cost-Based Query Optimization Plan Enumeration

Single-Relation Query Planning

• Pick the best access method.


▶ Sequential Scan
▶ Binary Search (clustered indexes)
▶ Index Scan
• Predicate evaluation ordering.
• Simple heuristics are often good enough for this.
• OLTP queries are especially easy. . .

30 / 52
Cost-Based Query Optimization Plan Enumeration

OLTP Query Planning

• Query planning for OLTP queries is easy because they are sargable (Search Argument
Able).
▶ It is usually just picking the best index.
▶ Joins are almost always on foreign key relationships with a small cardinality.
▶ Can be implemented with simple heuristics.

CREATE TABLE people (


id INT PRIMARY KEY,
val INT NOT NULL,
);

SELECT * FROM people WHERE id = 123;

31 / 52
Cost-Based Query Optimization Plan Enumeration

Multi-Relation Query Planning

• As number of joins increases, number of alternative plans grows rapidly


▶ We need to restrict search space.
• Fundamental decision in System R: only left-deep join trees are considered.
▶ Modern DBMSs do not always make this assumption anymore.

32 / 52
Cost-Based Query Optimization Plan Enumeration

Multi-Relation Query Planning

• Fundamental decision in System R: Only consider left-deep join trees.

33 / 52
Cost-Based Query Optimization Plan Enumeration

Multi-Relation Query Planning

• Fundamental decision in System R: Only consider left-deep join trees.

34 / 52
Cost-Based Query Optimization Plan Enumeration

Multi-Relation Query Planning

• Fundamental decision in System R: Only consider left-deep join trees.


• Allows for fully pipelined plans where intermediate results are not written to temp
files.
▶ Not all left-deep trees are fully pipelined.

35 / 52
Cost-Based Query Optimization Plan Enumeration

Multi-Relation Query Planning

• Enumerate the orderings


▶ Example: Left-deep tree 1, Left-deep tree 2. . .
• Enumerate the physical join operator for each logical join operator
▶ Example: Hash, Sort-Merge, Nested Loop. . .
• Enumerate the access paths for each table
▶ Example: Index 1, Index 2, Seq Scan. . .
• Use dynamic programming to reduce the number of cost estimations.

36 / 52
Cost-Based Query Optimization Plan Enumeration

Dynamic Programming

37 / 52
Cost-Based Query Optimization Plan Enumeration

Dynamic Programming

38 / 52
Cost-Based Query Optimization Plan Enumeration

Dynamic Programming

39 / 52
Cost-Based Query Optimization Plan Enumeration

Dynamic Programming

40 / 52
Cost-Based Query Optimization Plan Enumeration

Dynamic Programming

41 / 52
Cost-Based Query Optimization Plan Enumeration

Candidate Plan Example

• How to generate plans for search algorithm:


▶ Enumerate relation orderings
▶ Enumerate join algorithm choices
▶ Enumerate access method choices
• No real DBMSs does it this way. It’s actually more messy. . .
SELECT * FROM R, S, T
WHERE R.a = S.a AND S.b = T.b

42 / 52
Cost-Based Query Optimization Plan Enumeration

Candidate Plans

• Step 1: Enumerate relation orderings

43 / 52
Cost-Based Query Optimization Plan Enumeration

Candidate Plans

• Step 2: Enumerate join algorithm choices

44 / 52
Cost-Based Query Optimization Plan Enumeration

Candidate Plans

• Step 3: Enumerate access method choices

45 / 52
Cost-Based Query Optimization Plan Enumeration

Postgres Optimizer

• Examines all types of join trees


▶ Left-deep, Right-deep, bushy
• Two optimizer implementations:
▶ Traditional Dynamic Programming Approach
▶ Genetic Query Optimizer (GEQO)
• Postgres uses the traditional algorithm when number of tables in query is less than 12
and switches to GEQO when there are 12 or more.

46 / 52
Cost-Based Query Optimization Plan Enumeration

Postgres Optimizer

47 / 52
Cost-Based Query Optimization Plan Enumeration

Postgres Optimizer

48 / 52
Cost-Based Query Optimization Plan Enumeration

Postgres Optimizer

49 / 52
Cost-Based Query Optimization Conclusion

Conclusion

50 / 52
Cost-Based Query Optimization Conclusion

Parting Thoughts

• Selectivity estimations
• Key assumptions in query optimization
▶ Uniformity
▶ Independence
▶ Histograms
▶ Join selectivity
• Dynamic programming for join orderings

51 / 52
Cost-Based Query Optimization Conclusion

Next Class

• Design Decisions in Query Optimization

52 / 52

You might also like