0% found this document useful (0 votes)
59 views

Relational Operators

The document discusses relational database operators like selection, projection, join, set difference, and union. It explains that relational operators can be composed to process complex queries more efficiently. The main relational operators - selection, projection, join, set difference, union and aggregation - are introduced along with examples of join algorithms like nested loops join, block nested loops join, sort merge join and hash join. Optimizing the order of composing relational operators can improve query performance.

Uploaded by

Gilbert Heß
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Relational Operators

The document discusses relational database operators like selection, projection, join, set difference, and union. It explains that relational operators can be composed to process complex queries more efficiently. The main relational operators - selection, projection, join, set difference, union and aggregation - are introduced along with examples of join algorithms like nested loops join, block nested loops join, sort merge join and hash join. Optimizing the order of composing relational operators can improve query performance.

Uploaded by

Gilbert Heß
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

Relational Operators

First comes thought; then


organization of that thought, into
ideas and plans; then
transformation of those plans into
reality. The beginning, as you will
observe, is in your imagination.

Napolean Hill
CS3223 - Relational Operators 1
Introduction
• We’ve covered the basic underlying storage, buffering, and
indexing technology
• Now we are ready to move on to query processing
• Some database operations are EXPENSIVE
• Can greatly improve performance by being “smart”
• e.g., can speed up 1,000,000x over naïve approach
• Main approaches are:
• clever implementation techniques for operators
• exploit “equivalences” of relational operators
• use statistics and cost models to choose among these

CS3223 - Relational Operators 2


Steps of processing a high-level
query

Database
Statistics Cost Model

Parsed Query Query QEP Query


Parser Optimizer Evaluator
P1: Sequential Scan
High Level Query P2: Use SAL index
Query Result
SELECT * FROM EMP
WHERE SAL > 50k

CS3223 - Relational Operators 3


Relational Operations
• We will consider how to implement:
• Selection () Selects a subset of rows from relation.
• Projection (  ) Deletes unwanted columns from relation.
• Join ( ) Allows us to combine two relations.
• Set-difference ( - ) Tuples in reln. 1, but not in reln. 2.
• Union ( U ) Tuples in reln. 1 and in reln. 2.
• Aggregation (SUM, MIN, etc.) and GROUP BY

Since each op returns a relation, ops can be composed!


Queries that require multiple ops to be composed may be composed in
different ways - thus optimization is necessary for good performance

CS3223 - Relational Operators 4


SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND
Example
R.bid=100 AND S.rating>5

sname
sname
sname
rating > 5
bid=100 rating > 5

sid=sid
sid=sid

bid=100 rating > 5


sid=sid bid=100 Sailors

Reserves Sailors
Reserves
Sailors Reserves

CS3223 - Relational Operators 5


Paradigm
• Iteration-based
• Index
• B+-tree, Hash
• assume index entries to be (rid,pointer) pair
• Clustered, Unclustered
• Sort
• Hash

CS3223 - Relational Operators 6


Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: dates, rname: string)

• Reserves (R):
• ||R|| - number of tuples
• |R| - number of pages
• pR tuples per page, |R| = M. Let pR = 100, M = 1000, ||R|| = 100*1000
• Sailors (S):
• pS tuples per page, |S| = N. Let pS = 80, N = 500, ||S|| = 80*500
• Cost metric: # of I/Os (pages)
• We will ignore output costs in the following discussion

CS3223 - Relational Operators 7


Equality Joins With One Join Column

SELECT * sid=sid

FROM Reserves R, Sailors S


WHERE R.sid=S.sid Sailors Reserves

• In algebra: R S
• Most frequently used operation; very costly operation
• ||R S|| =   (||R||  ||S|| ) where  is the join selectivity

CS5208 - Relational Operators


CS3223 8
Join Example

CS3223 - Relational Operators 9


Join Algorithms
• Iteration-based
• Block nested loop
• Index-based
• Index nested loop
• Sort-based
• Sort-merge join
• Partition-based
• Hash join

CS3223 - Relational Operators 10


Join Algorithms
• Things to consider when choosing an algorithm
 Types of join predicates
• Equality predicates (e.g., R.A = S.B)
• Inequality predicates (e.g., R.A < S.B)
 Sizes of join operands
 Available buffer space
 Available access methods

CS3223 - Relational Operators 11


Simple (Tuple-based) Nested Loops Join
foreach tuple r in R do
foreach tuple s in S do
if r.sid == s.sid then add <r, s> to result

• For each tuple in the outer relation R, we scan the


entire inner relation S
• I/O Cost?
• Memory?

CS3223 - Relational Operators 12


Simple Nested Loops Join
foreach tuple r in R do
foreach tuple s in S do
if r.sid == s.sid then add (r, s) to result

• For each tuple in the outer relation R, we scan the


entire inner relation S
Scan S
Scan R

• Cost: |R| + ||R|| x |S|


• M + (pR * M) * N = 1000 + 100*1000*500 I/Os

CS3223 - Relational Operators 13


Simple Nested Loops Join
foreach tuple r in R do
foreach tuple s in S do
if r.sid == s.sid then add (r, s) to result

• For each tuple in the outer relation R, we scan the


entire inner relation S.
• Cost: M + (pR * M) * N = 50,001,000 I/Os
• Memory: 3 pages!

Can we do better??
CS3223 - Relational Operators 14
Page-based Nested Loop Join
for each page PR of R do
for each page PS of S do
for each tuple r PR do
for each tuple s PS do
if (r.sid == s.sid) then output (r ,s) to result

• I/O cost = |R| + |R| x |S| = M + M * N = 1000 + 1000*500 I/Os


• Memory = 3 pages!
Can we do better??

CS3223 - Relational Operators 15


Block Nested Loops Join
• Motivation: How to better exploit buffer space to minimize
number of I/Os?
R S

For each R tuple, scan S For each page of R tuples, For each block of R tuples,
(memory size = 3 pages) scan S (memory = 3 pages) scan S (memory > 3 pages)
Number of iterations of S: ||R|| Number of iterations: |R| block size (B): buffer size – 2
Number of iterations: |R|/B
CS3223 - Relational Operators 16
Block Nested Loops Join
• Use one page as an input buffer for scanning the inner S, one page as the
output buffer, and use all remaining pages (B -2) to hold “block” of outer R
• For each matching tuple r in R-block, s in S-page, add (r, s) to result. Then read
next R-block, scan S, etc

R&S Join Result


k blocks of R
(k < B-1 pages)
...
... ...
Input buffer for S Output buffer

CS3223 - Relational Operators 17


Block Nested Loop Join: Example

CS3223 - Relational Operators 18


Block Nested Loop Join: Example

CS3223 - Relational Operators 19


Block Nested Loop Join: Example

CS3223 - Relational Operators 20


Block Nested Loop Join: Example

CS3223 - Relational Operators 21


Block Nested Loop Join: Example

CS3223 - Relational Operators 22


Block Nested Loop Join: Example

CS3223 - Relational Operators 23


Block Nested Loop Join: Example

CS3223 - Relational Operators 24


Block Nested Loop Join: Example

CS3223 - Relational Operators 25


Block Nested Loop Join: Example

CS3223 - Relational Operators 26


Block Nested Loop Join: Example

CS3223 - Relational Operators 27


Block Nested Loop Join: Example

CS3223 - Relational Operators 28


Block Nested Loop Join: Example

CS3223 - Relational Operators 29


Block Nested Loop Join: Example

CS3223 - Relational Operators 30


Block Nested Loop Join: Example

etc.
CS3223 - Relational Operators 31
Examples of Block Nested Loops
• Cost: size of outer + #outer blocks * size of inner
• #outer blocks = no. of pages in outer relation / block size
• With R as outer, block size of 100 pages (buffer size = 102):
• Cost of scanning R is 1000 I/Os; a total of 10 blocks
• Per block of R, we scan S; 10*500 I/Os
• Join cost = 6000 I/Os
• If block size for just 90 pages of R, scan S 12 times
• With 100-page block of S as outer:
• Cost of scanning S is 500 I/Os; a total of 5 blocks
• Per block of S, we scan R; 5*1000 I/Os
• Join cost = 5500 I/Os
Ordering of inner/outer relations affects performance!
CS3223 - Relational Operators 32
• Idea: Sort-Merge Join
• Sort R and S on the join column
• A sorted relation R consists of (implicit) partitions of Ri of records where r, r’  Ri iff r and r’
have the same values for the join attribute(s)
• Scan them to do a “merge’’ (on join col.), and output result tuples

Partition with sort


key value 10

CS3223 - Relational Operators 33


Sort-Merge Join
• Idea: Sort R and S on the join column, then scan them to do a “merge”
(on join col.), and output result tuples
• Advance scan of R until current R-tuple’s sort key  current S tuple’s sort key,
then advance scan of S until current S-tuple’s key  current R tuple’s key; do
this until current R tuple’s key = current S tuple’s key
• At this point, all R tuples with same value in Ri (current R partition) and all S
tuples with same value in Sj (current S partition) match; output (r, s) for all pairs
of such tuples
• Then resume scanning R and S
• R is scanned once; each S partition is scanned once per matching R
tuple (Multiple scans of an S partition are likely to find needed pages in
buffer)

CS3223 - Relational Operators 34


Example of Sort-Merge Join

What if we have another record with sid = 31 (if


sid is not the key), e.g, (31, jonny, 9, 37)

CS3223 - Relational Operators 35


Cost of Sort-Merge Join

• I/O Cost = Cost of sorting R + Cost of sorting S + Merging cost


• Cost to sort R = 2|R| ( 1 + logB-1 |R|/B  ) for external merge sort
• B = number of buffers
• Cost to sort S has a similar expression
• If each S partition is scanned at most once during merging
• Merging cost = |R| + |S|
• Best case?
• Worst case?
• Occurs when each tuple of R requires scanning entire S!
• Merging cost = |R| + |R| x |S| (can be reduced using block-nested loops)
CS3223 - Relational Operators 36
Cost of Sort-Merge Join (Example)
• Reserves (R):
• pR tuples per page, M pages. pR = 100. M = 1000
• Sailors (S):
• pS tuples per page, N pages. pS = 80. N = 500

• Cost: 2M*K1+ 2N*K2+ (M+N)


• K1 and K2 are the number of passes to sort R and S respectively
• The cost of scanning, M+N, could be M*N (very unlikely!)
• With 35, 100 or 300 buffer pages, both R and S can be sorted in 2 passes
• total join cost = 2*1000*2 + 2*500*2 + 1000 + 500 = 7500

(BNL cost: 2500 to 15000 I/Os)


CS3223 - Relational Operators 37
GRACE Hash-Join
S
0 1 2 3

0 r s
bucketID = X mod 4 r

r s
1 r
R
2 r s
s

3 r

CS3223 - Relational Operators 38


GRACE Hash-Join
S
0 1 2 3
XXX
bucketID = X mod 4 0 XXX
XXX
XXX
1 XXX
R XXX
XXX
2 XXX
XXX
XXX
3 XXX
XXX
CS3223 - Relational Operators 39
GRACE Hash-Join
• Operates in two phases:
• Partition phase
• Partition relation R (on join attribute) using hash fn h
• Partition relation S using the same hash fn h
• R tuples in partition i will only match S tuples in partition i
• Join phase
• Read in a partition of R
• Build a hash table for the partition using hash fn h2 (<> h!)
• Scan matching partition of S, search for matches
• Using hash fn h2
• R S =  (Ri Si)

CS3223 - Relational Operators 40


Partitioning Phase

Original
Relation OUTPUT Partitions
1

1
INPUT 2
hash 2
...
function
h
B-1
B-1

Disk B main memory buffers Disk

CS3223 - Relational Operators 41


Grace Hash Join: Partitioning Relation R

CS3223 - Relational Operators 42


Grace Hash Join: Partitioning Relation R

CS3223 - Relational Operators 43


Grace Hash Join: Partitioning Relation R

CS3223 - Relational Operators 44


Grace Hash Join: Partitioning Relation R

CS3223 - Relational Operators 45


Grace Hash Join: Partitioning Relation R

CS3223 - Relational Operators 46


Grace Hash Join: Partitioning Relation R

CS3223 - Relational Operators 47


Grace Hash Join: Partitioning Relation R

CS3223 - Relational Operators 48


Grace Hash Join: Partitioning Relation R

CS3223 - Relational Operators 49


Grace Hash Join: Partitioning Relation R

CS3223 - Relational Operators 50


Grace Hash Join: Partitioning Relation R

CS3223 - Relational Operators 51


Grace Hash Join: Partitioning Relation S

CS3223 - Relational Operators 52


Joining Phase

Partitions
of R & S Join Result
Hash table for partition
hash Ri (k < B-1 pages)
fn
h2

h2

Input buffer Output


for Si buffer

Disk B main memory buffers Disk

CS3223 - Relational Operators 53


Grace Hash Join: Probing/Joining Phase

CS3223 - Relational Operators 54


Grace Hash Join: Probing/Joining Phase

CS3223 - Relational Operators 55


Grace Hash Join: Probing/Joining Phase

CS3223 - Relational Operators 56


Grace Hash Join: Probing/Joining Phase

CS3223 - Relational Operators 57


Grace Hash Join: Probing/Joining Phase

CS3223 - Relational Operators 58


Grace Hash Join: Probing/Joining Phase

CS3223 - Relational Operators 59


Grace Hash Join: Probing/Joining Phase

CS3223 - Relational Operators 60


Grace Hash Join: Probing/Joining Phase

CS3223 - Relational Operators 61


Grace Hash Join: Probing/Joining Phase

CS3223 - Relational Operators 62


Grace Hash Join: Probing/Joining Phase

CS3223 - Relational Operators 63


Grace Hash Join: Probing/Joining Phase

CS3223 - Relational Operators 64


Grace Hash Join: Probing/Joining Phase

CS3223 - Relational Operators 65


Cost of Hash-Join

• In partitioning phase, read+write both relns


• 2(M+N)
• In matching phase, read both relns
• M+N I/Os
Independent of B???
• In our running example, this is a total of 4500 I/Os

BNL cost: 2500 to 15000 I/Os


SortMerge cost: 7500 I/Os
CS3223 - Relational Operators 66
Observations on Hash-Join
• #partitions k  B-1 (why?)
• B-2  size of largest partition to be held in memory
• Assuming uniformly sized partitions, and maximizing k, we get:
• k= B-1, and M/(B-1)  B-2, i.e., B must be   M
• If we build an in-memory hash table to speed up the matching of
tuples, a little more memory is needed
• If the hash function does not partition uniformly, one or more R
partitions may not fit in memory. How?
• Apply hash-join technique recursively to do the join of this R-partition
with corresponding S-partition
• Can also apply any other join algorithms on that pair of partitions
• What if B <  M ?
size of hash tble should be at least 1.2 x of total sata size

CS3223 - Relational Operators 67


Index Nested Loops Join
foreach tuple r in R do
search index of S on sid using Ssearch-key = r.sid
for each matching key
retrieve s; add (r, s) to result
• Precondition: there is an index on the join column of one relation, say S;
make S the inner relation; and exploit the index

CS3223 - Relational Operators 68


Index Nested Loops Join
• Consider R(A, B) A S(A, C)
• Assume that there is a B+-tree index on S.A

First, join (5 r1)  R with matching tuples in S


CS3223 - Relational Operators 69
Index Nested Loops Join
• Consider R(A, B) A S(A, C)
• Assume that there is a B+-tree index on S.A

Next, join (10, r2)  R with matching tuples in S, and so on …


CS3223 - Relational Operators 70
Index Nested Loops Join
foreach tuple r in R do
search index of S on sid using Ssearch-key = r.sid
for each matching key
retrieve s; add (r, s) to result
• Precondition: there is an index on the join column of one relation, say S; make S the
inner relation; and exploit the index
• Cost: M + (M*pR) * cost of finding matching S tuples
• For each R tuple, cost of probing S index (assuming format 2 – (Key, Rid)-pair)
• B+ tree. H (traverse the index = height of tree) + cost of finding S tuples
• The second component depends on clustering
• Clustered index: 1 I/O (typical)
• Unnclustered: upto 1 I/O per matching S tuple
• Hash index. 1.2 (magic number) + cost of finding S tuples
CS3223 - Relational Operators 71
Examples of Index Nested Loops
• Hash-index on sid of S (as inner):
• Scan R: 1000 page I/Os, 100*1000 tuples
• For each R tuple: 1.2 I/Os to get data entry in index, plus 1 I/O to get (the
exactly one) matching S tuple. Total: 220,000 I/Os
• Hash-index on sid of R (as inner):
• Scan S: 500 page I/Os, 80*500 tuples
• For each S tuple: 1.2 I/Os to find index page with data entries, plus cost of
retrieving matching R tuples
• Assuming uniform distribution, 2.5 reservations per sailor (100,000 / 40,000).
Cost of retrieving them is 1 or 2.5 I/Os depending on whether the index is
clustered

How about B+-tree index??


CS3223 - Relational Operators 72
General Join Conditions
• Equalities over several attributes (e.g., R.sid=S.sid AND
R.rname=S.sname):
choose the one that is more selective
• Join on one predicate, and treat the rest as selections;
• Which attribute should be used for join??
• For Index NL, build index on <sid, sname> (if S is inner); use existing
indexes on sid or sname
• For Sort-Merge and Hash Join, sort/partition on combination of the two join
columns
• Inequality join (R.sid < S.sid)
• Nested loops is fine; index nested loops join requires a B+-tree index
• Sort-merge join is fine – incur sorting overhead but save some cost in
scanning of S
• Hash-based joins are not directly applicable
CS3223 - Relational Operators 73
Simple Selections
• Of the form: R.attr op value (R)
SELECT *
• selectivity = Size of result / ||R||
FROM R
WHERE weight>64 and
height>170

CS3223 - Relational Operators 74


base tables: original tables

Access Path
• Access path refers to a way of accessing data records/entries
• Table scan
• With no index, unsorted: Must essentially scan the whole relation; cost is M (#pages in R)
• Index-only scan treat the index as a table

• An index containing the relevant information: Scan the index


• Index search
• With an index on selection attribute: Use index to find qualifying data entries, then
retrieve corresponding data records. (You already know this – B+-tree and hash indexes)
• Index intersection: Combine results from multiple index scans/retrieval
(e.g., intersection, union)
• Index-based access paths (except index-only scan) can be
followed by RID lookups to retrieve data records

CS3223 - Relational Operators 75


Cost/Selectivity of an Access Path

• Cost/Selectivity of an access path = number of index and


data pages retrieved to access data records/entries
• The most selective access path = one that retrieves the
fewest pages
• Usually, index-based access paths are superior, but table scans can
win too eg unclusterd index

CS3223 - Relational Operators 76


Using an Index for Selections
• Cost depends on #qualifying tuples, and clustering
• Cost of finding qualifying data entries (typically small) plus
cost of retrieving records (could be large w/o clustering)
• In example, assuming uniform distribution of names, about
10% of tuples qualify (100 pages, 10000 tuples)
• Clustered index: ~ 100 I/Os
• Unclustered: upto 10000 I/Os!

CS3223 - Relational Operators 77


Two Approaches to General Selections
• First approach: Find the most selective access path, retrieve
tuples using it, and apply any remaining terms that don’t
match the index:
• Most selective access path: An index or file scan that we estimate
will require the fewest page I/Os
• Terms that match this index reduce the number of tuples retrieved; other
terms are used to discard some retrieved tuples, but do not affect number of
tuples/pages fetched
• Consider day<8/9/94 AND bid=5 AND sid=3.
• A B+ tree index on day can be used; then, bid=5 and sid=3 must be checked for each
retrieved tuple
• What is the I/O cost?
• Similarly, a hash index on <bid, sid> could be used; day<8/9/94 must then be checked

CS3223 - Relational Operators 78


Intersection of Rids

• Second approach (if we have 2 or more matching indexes


(assuming leaf data entries are pointers):
• Get sets of rids of data records using each matching index
• Then intersect these sets of rids (we’ll discuss intersection soon!)
• Retrieve the records and apply any remaining terms
• Consider day<8/9/94 AND bid=5 AND sid=3
• If we have a B+ tree index on day and an index on sid, we can retrieve rids
of records satisfying day<8/9/94 using the first, rids of recs satisfying sid=3
using the second, intersect the rids, retrieve records and check for bid=5

CS3223 - Relational Operators 79


B+-tree: Index Intersection

RID intersection = {RID51}

CS3223 - Relational Operators 80


The Projection Operation
(Duplicate Elimination)
• L(R) projects columns given by list L from relation R
• Example: select distinct age from R
Relational algebra
vs SQL (relational
DBMS)

• What about duplicates?


• *L(R) same as L(R) but preserves duplicates

CS3223 - Relational Operators 81


The Projection Operation:
Sort-based Approach

CS3223 - Relational Operators 82


Sort-based Approach: Cost Analysis

• Step 1:
• Cost to scan records = |R|
• Cost to output temporary result = | *L(R)|
• Step 2:
• Cost to sort records = 2| *L(R)| (logm(N0) + 1)
• N0 = number of initial sorted runs; m = merge factor
• Step 3:
• Cost to scan records = | *L(R)|
• Optional: Cost to store answer = | L(R)|

CS3223 - Relational Operators 83


Optimized Sort-based Approach

CS3223 - Relational Operators 84


Optimized Sort-based Approach
• An approach based on sorting:
• Modify phase 1 of external sort to eliminate unwanted fields
• Runs are produced, but tuples in runs are smaller than input tuples (Size ratio depends
on # and size of fields that are dropped)
• Modify merging passes to eliminate duplicates
• Number of result tuples smaller than input (Difference depends on # of duplicates)
• Cost:
• In phase 1, read original relation (size M), write out same number of smaller tuples
• In merging passes, fewer tuples written out in each pass

CS3223 - Relational Operators 85


Hash-based Approach

CS3223 - Relational Operators 86


Partitioning Phase
• Use one buffer for input and (B-1) buffers for
output
• Read R one page at a time into input buffer
• For each tuple t in input buffer
• Project out unwanted attributes from t to form t’
• Apply a hash function h on t’ to distribute t’ into
one output buffer
• Flush output buffer to disk whenever buffer is full
• Optimization: If it so happen that a duplicate is
found in the output buffer, can remove it
immediately. Overhead??

CS3223 - Relational Operators 87


Duplicate Elimination Phase
• For each partition Rj
• Initialize an in-memory hash table
• Read *L(Rj) one page at a time; for each tuple t read
• Hash t into bucket Bj with hash function h’ (h’ ≠ h)
• Insert t into Bj if t  Bj
• Write out tuples in hash table to results

CS3223 - Relational Operators 88


Example: Partitioning Phase

CS3223 - Relational Operators 89


Example: Partitioning Phase

CS3223 - Relational Operators 90


Example: Partitioning Phase

CS3223 - Relational Operators 91


Example: Partitioning Phase

CS3223 - Relational Operators 92


Example: Partitioning Phase

CS3223 - Relational Operators 93


Example: Partitioning Phase

CS3223 - Relational Operators 94


Example: Partitioning Phase

CS3223 - Relational Operators 95


Example: Partitioning Phase

CS3223 - Relational Operators 96


Example: Partitioning Phase

CS3223 - Relational Operators 97


Example: Duplicate Elimination Phase

CS3223 - Relational Operators 98


Example: Duplicate Elimination Phase

CS3223 - Relational Operators 99


Example: Duplicate Elimination Phase

CS3223 - Relational Operators 100


Example: Duplicate Elimination Phase

CS3223 - Relational Operators 101


Example: Duplicate Elimination Phase

CS3223 - Relational Operators 102


Example: Duplicate Elimination Phase

CS3223 - Relational Operators 103


Example: Duplicate Elimination Phase

CS3223 - Relational Operators 104


Example: Duplicate Elimination Phase

CS3223 - Relational Operators 105


Example: Duplicate Elimination Phase

CS3223 - Relational Operators 106


Hash-based Approach: Partition Overflow
• What happen if *L(Rj) is larger than the available memory?
• Recursively apply hash-based partitioning to the overflowed partition

CS3223 - Relational Operators 107


Cost Analysis of Hash-based Approach

(ignore cost to write output)


CS3223 - Relational Operators 108
What if an index is available?
• If the search key of an index contains all the wanted
attributes
• Replace table scan with index scan!
• If the index is ordered (e.g., B+-tree) whose search
key includes wanted attributes as a prefix
• Scan data entries in order
• Compare adjacent data entries for duplicates
• Example
• Use B+-tree index on R with key (A,B) to evaluate query A(R)
CS3223 - Relational Operators 109
Set Operations
• Set operations
• Cross-product: R x S
• Intersection: R  S
• Union: R  S
• Difference: R – S
• Intersection and cross-product: special cases of join
• R(A, B)  S(A, B) = R.*(R S)
• p = (R.A = S.A)  (R.B = S.B)
• RxS=R S with an empty join predicate

• Union (Distinct) and Difference are similar


• Sorting based approach
• Hash based approach

CS3223 - Relational Operators 110


Set Operations
• Sorting based approach to union:
• Sort both relations (on combination of all attributes)
• Scan sorted relations and merge them
• Implementation typically removes duplicates while merging (why?)
• Hash based approach to union:
• Partition R and S using hash function h
• For each S-partition, build in-memory hash table (using h2), scan
corr. R-partition and add tuples to table while discarding duplicates
• Algorithms for R – S are similar
CS3223 - Relational Operators 111
Aggregate Operations
(COUNT, SUM, AVG, MIN, MAX)
SELECT AVG(SALARY)
FROM EMPLOYEE

• Without grouping:
• In general, requires scanning the relation
• Given index whose search key includes all attributes in the SELECT or
WHERE clauses, can do index-only scan

CS3223 - Relational Operators 112


Aggregate Operations
(COUNT, SUM, AVG, MIN, MAX)
SELECT DEPT, AVG(SALARY)
FROM EMPLOYEE
GROUP BY DEPT
• With grouping:
• Sort on group-by attributes, then scan relation and compute aggregate
for each group
• Similar approach based on hashing on group-by attributes
• Given tree index whose search key includes all attributes in SELECT,
WHERE and GROUP BY clauses, can do index-only scan; if group-by
attributes form prefix of search key, can retrieve data entries/tuples in
group-by order )(no sorting needed)

CS3223 - Relational Operators 113


Summary
• A virtue of relational DBMSs: queries are composed of a few
basic operators; the implementation of these operators can
be carefully tuned (and it is important to do this!)
• Many alternative implementation techniques for each
operator; no universally superior technique for most
operators
• Must consider available alternatives for each operation in a
query and choose best one based on system statistics, etc.
This is part of the broader task of optimizing a query
composed of several ops

CS3223 - Relational Operators 114

You might also like