4-Query Processing Nhom1
4-Query Processing Nhom1
Systems
TS. Phan Thị Hà
Query
Processor
◼ Query language
❑ SQL: “intergalactic dataspeak”
◼ Query execution
❑ The steps that one goes through in executing high-level
(declarative) user queries.
◼ Query optimization
❑ How do we determine the “best” execution plan?
SELECT ENAME
FROM EMP NATURAL JOIN ASG
WHERE RESP = "Manager"
Strategy 1
ENAME(RESP=“Manager”EMP.ENO=ASG.ENO(EMP×ASG))
Strategy 2
ENAME(EMP ⋈ENO (RESP=“Manager” (ASG))
Operation Complexity
Select
Project O(n)
◼ Assume (without duplicate elimination)
Join
Semi-join O(n log n)
Division
Set Operators
◼ Exhaustive search
❑ Cost-based
❑ Optimal
❑ Combinatorial complexity in the number of relations
◼ Heuristics
❑ Not optimal
❑ Regroup common sub-expressions
❑ Perform selection, projection first
❑ Replace a join by a series of semijoins
❑ Reorder operations to reduce intermediate relation size
❑ Optimize individual operations
◼ Static
❑ Compilation ➔ optimize prior to the execution
❑ Difficult to estimate the size of the intermediate resultserror
propagation
❑ Can amortize over many executions
◼ Dynamic
❑ Run time optimization
❑ Exact information on the intermediate relation sizes
❑ Have to reoptimize for multiple executions
◼ Hybrid
❑ Compile using a static algorithm
❑ If the error in estimate sizes > threshold, reoptimize at run time
◼ Relation
❑ Cardinality
❑ Size of a tuple
❑ Fraction of tuples participating in a join with another relation
◼ Attribute
❑ Cardinality of domain
❑ Actual number of distinct values
◼ Simplifying assumptions
❑ Independence between different attribute values
❑ Uniform distribution of attribute values within their domain
◼ Centralized
❑ Single site determines the “best” schedule
❑ Simple
❑ Need knowledge about the entire distributed database
◼ Distributed
❑ Cooperation among sites to determine the schedule
❑ Need only local information
❑ Cost of cooperation
◼ Hybrid
❑ One site determines the global schedule
❑ Each site optimizes the local subqueries
SELECT *
FROM EMP
WHERE ENO="E5"
Selections first
Example
Consider the following hybrid
fragmentation:
EMP1= ENO≤"E4" (ENO,ENAME (EMP))
Input Query
Equivalent QEP
Best QEP
◼ Cost model
❑ I/O cost + CPU cost + communication cost
❑ These might have different weights in different distributed
environments (LAN vs WAN)
❑ Can also maximize throughput
◼ Search algorithm
❑ How do we move inside the solution space?
❑ Exhaustive search, heuristic algorithms (iterative improvement,
simulated annealing, genetic,…)
◼ Deterministic ⋈
⋈ ⋈ R4
⋈ ⋈ R3 ⋈ R3
R1 R2 R1 R2 R1 R2
◼ Randomized
⋈ ⋈
⋈ R3 ⋈ R2
R1 R2 R1 R3
❑ Join Ordering
❑
Consider
PROJ ⋈PNO ASG ⋈ENO EMP
5. EMP → Site 2
PROJ → Site 2
Site 2 computes EMP ⋈ PROJ ⋈ ASG
◼ Alternatives:
1. Do the join R ⋈AS
2. Perform one of the semijoin equivalents
R ⋈ AS (R ⋉AS) ⋈AS
R ⋈A (S ⋉A R)
(R ⋉A S) ⋈A (S ⋉A R)
Consider
ET (ENO, ENAME, TITLE, CITY)
AT (ENO, PNO, RESP, DUR, CITY)
PT (PNO, PNAME, BUDGET, CITY)
◼ Cost functions
❑ Total Time (or Total Cost)
◼ Reduce each cost (in terms of time) component individually
◼ Do as little of each cost component as possible
◼ Optimizes resource utilization and increases system throughput
❑ Response Time
◼ Do as many things as possible in parallel
◼ May increase total time because of increased total activity
◼ Dynamic approach
❑ Distributed INGRES
❑ No static cost estimation, only runtime cost information
◼ Static approach
❑ System R*
❑ Static cost model
◼ Hybrid approach
❑ 2-step
◼ Ship whole
❑ Larger data transfer
❑ Smaller number of messages
❑ Better if relations are small
◼ Fetch as needed
❑ Number of messages = O(cardinality of external relation)
❑ Data transfer per message is minimal
❑ Better if relations are large and the selectivity is good
◼ Given
❑ A set of sites S = {s1, s2, …,sn} with the load of each site
❑ A query Q ={q1, q2, q3, q4} such that each subquery qi is the
maximum processing unit that accesses one relation and
communicates with its neighboring queries
❑ For each qi in Q, a feasible allocation set of sites Sq={s1, s2,
…,sk} where each site stores a copy of the relation in qi
◼ The objective is to find an optimal allocation of Q to S
such that
❑ The load unbalance of S is minimized
❑ The total communication cost is minimized
6
Adaptive Query Processing – Definition
◼ A query processing is adaptive if it receives information
from the execution environment and determines its
behavior accordingly
❑ Feed-back loop between optimizer and runtime environment
❑ Communication of runtime information between DDBMS
components
◼ Additional components
❑ Monitoring, assessment, reaction
❑ Embedded in control operators of QEP
◼ Tradeoff between reactiveness and overhead of
adaptation