Relational Operators
Relational Operators
Napolean Hill
CS3223 - Relational Operators 1
Introduction
• We’ve covered the basic underlying storage, buffering, and
indexing technology
• Now we are ready to move on to query processing
• Some database operations are EXPENSIVE
• Can greatly improve performance by being “smart”
• e.g., can speed up 1,000,000x over naïve approach
• Main approaches are:
• clever implementation techniques for operators
• exploit “equivalences” of relational operators
• use statistics and cost models to choose among these
Database
Statistics Cost Model
sname
sname
sname
rating > 5
bid=100 rating > 5
sid=sid
sid=sid
Reserves Sailors
Reserves
Sailors Reserves
• Reserves (R):
• ||R|| - number of tuples
• |R| - number of pages
• pR tuples per page, |R| = M. Let pR = 100, M = 1000, ||R|| = 100*1000
• Sailors (S):
• pS tuples per page, |S| = N. Let pS = 80, N = 500, ||S|| = 80*500
• Cost metric: # of I/Os (pages)
• We will ignore output costs in the following discussion
SELECT * sid=sid
• In algebra: R S
• Most frequently used operation; very costly operation
• ||R S|| = (||R|| ||S|| ) where is the join selectivity
Can we do better??
CS3223 - Relational Operators 14
Page-based Nested Loop Join
for each page PR of R do
for each page PS of S do
for each tuple r PR do
for each tuple s PS do
if (r.sid == s.sid) then output (r ,s) to result
For each R tuple, scan S For each page of R tuples, For each block of R tuples,
(memory size = 3 pages) scan S (memory = 3 pages) scan S (memory > 3 pages)
Number of iterations of S: ||R|| Number of iterations: |R| block size (B): buffer size – 2
Number of iterations: |R|/B
CS3223 - Relational Operators 16
Block Nested Loops Join
• Use one page as an input buffer for scanning the inner S, one page as the
output buffer, and use all remaining pages (B -2) to hold “block” of outer R
• For each matching tuple r in R-block, s in S-page, add (r, s) to result. Then read
next R-block, scan S, etc
etc.
CS3223 - Relational Operators 31
Examples of Block Nested Loops
• Cost: size of outer + #outer blocks * size of inner
• #outer blocks = no. of pages in outer relation / block size
• With R as outer, block size of 100 pages (buffer size = 102):
• Cost of scanning R is 1000 I/Os; a total of 10 blocks
• Per block of R, we scan S; 10*500 I/Os
• Join cost = 6000 I/Os
• If block size for just 90 pages of R, scan S 12 times
• With 100-page block of S as outer:
• Cost of scanning S is 500 I/Os; a total of 5 blocks
• Per block of S, we scan R; 5*1000 I/Os
• Join cost = 5500 I/Os
Ordering of inner/outer relations affects performance!
CS3223 - Relational Operators 32
• Idea: Sort-Merge Join
• Sort R and S on the join column
• A sorted relation R consists of (implicit) partitions of Ri of records where r, r’ Ri iff r and r’
have the same values for the join attribute(s)
• Scan them to do a “merge’’ (on join col.), and output result tuples
0 r s
bucketID = X mod 4 r
r s
1 r
R
2 r s
s
3 r
Original
Relation OUTPUT Partitions
1
1
INPUT 2
hash 2
...
function
h
B-1
B-1
Partitions
of R & S Join Result
Hash table for partition
hash Ri (k < B-1 pages)
fn
h2
h2
Access Path
• Access path refers to a way of accessing data records/entries
• Table scan
• With no index, unsorted: Must essentially scan the whole relation; cost is M (#pages in R)
• Index-only scan treat the index as a table
• Step 1:
• Cost to scan records = |R|
• Cost to output temporary result = | *L(R)|
• Step 2:
• Cost to sort records = 2| *L(R)| (logm(N0) + 1)
• N0 = number of initial sorted runs; m = merge factor
• Step 3:
• Cost to scan records = | *L(R)|
• Optional: Cost to store answer = | L(R)|
• Without grouping:
• In general, requires scanning the relation
• Given index whose search key includes all attributes in the SELECT or
WHERE clauses, can do index-only scan