Chapter15 2
Chapter15 2
Query Execution
Sukarna Barua
Associate Professor, CSE, BUET
03/20/2024
Nested-loop Joins
Nested-loop joins:
Requires one and a half-pass.
Tuples of one arguments are read only once, while the other arguments
will be read repeatedly.
Can be used for relations of any size. [ no requirement on memory ]
Not necessary to fit in main memory for any relation.
is sufficient for this join.
03/20/2024
Tuple-based Nested Loop Join
Algorithm for computing :
03/20/2024
Block-based Nested-loop joins
An improvement over tuple-based nested loop join.
Assume and [don't fit in main memory]
Algorithm:
Repeatedly read blocks of into main memory.
Create a search data structure (search key=join attributes) in main memory.
Read blocks of one by one and for each tuple of :
Find matching tuples of S [from main memory].
Calculate joined tuple and send to output.
03/20/2024
Block-based Nested-loop joins
I/O cost of block-based nested-loop join:
Let, , , and .
disk I/Os for every blocks of into main memory.
disk I/Os of for inner loop [Need to retrieve repeated for every 100 blocks of
S]
Total I/O .
If we read one block of at a time, then cost =
Significantly higher than 5500!
03/20/2024
Analysis of nested-loop join
Assume
The number of iterations of the outer loop is
At each iteration, we read blocks of and blocks of .
The number of disk I/Os:
03/20/2024
Cost Summary
Summary costs for different operations:
03/20/2024
Two-Phase Multi-way Merge Sort (TPMMS)
Algorithm:
Phase 1: Repeated fill buffers with blocks of (from disk), sort them using any
main memory algorithm. Write the sorted subsists to disk.
Phase 2: Merge the sorted sublists as follows.
Assume there are at most sorted sublists. [Constraint why?]
Allocate one memory block for each sorted sublist and one block for output
Keep a pointer to each block of input subsists:
Points to tuple in the block of input sublist not yet moved to output.
Merge the input sublists [discussed next]
03/20/2024
Two-Phase Multi-way Merge Sort (TPMMS)
Algorithm for merging sorted sublists:
Find the smallest tuple among all input sublists [main memory operation]
Move the smallest to the output block.
If the output block is full, write it to disk.
If any input block becomes empty, get next block from the corresponding
sorted subsist
If no next block remains, then the input block remains empty as sublist is
empty.
03/20/2024
Two-Phase Multi-way Merge Sort (TPMMS)
Memory requirement for TPMMS:
There cannot be more than sublists.
Each sublist consist of M blocks.
The number of sublists: .
We require that:
[ for simiplicity! ]
03/20/2024
Two-Phase Multi-way Merge Sort (TPMMS)
Calculation of I/O cost of TPMMS:
Suppose blocks size is. Memory size is . What is the maximum size of for
TPMMS?
Max size of
03/20/2024
Duplicate Elimination Using Sorting: Two Pass
Algorithm
Algorithm: Similar to TPMMS.
Phase 1: First sort tuples of in subsists as in TPMMS. [Do not merge]
Phase 2:
Like TPMMS, use one memory block for each sublist and one memory block
for output. Bring one block from each sublist to memory.
Repeatedly perform the following:
Find the smallest tuple among all sublists and all other tuples same as .
Send one copy of to output and discard other copies.
When an output block is full, copy to disk.
03/20/2024
Duplicate Elimination Using Sorting: Two Pass
Algorithm
I/O cost and memory requirement:
Disk I/O: [ignoring output wrte to disk]
Memory requirement:
Comapre: memory requirement for one pass duplicate elimination is
03/20/2024
Grouping and Aggregation Using Sorting:
Two Pass Algorithm
Phase 1: First sort tuples of in sublists as in TPMMS based on grouping attributes
[sort key=L].
Phase 2:
Like TPMMS, use one memory block for each sublist and one memory block for
output. Bring one block from each sublist to memory.
Repeatedly perform the following:
Find the smallest tuple based on sort key. Assume this value is which becomes
the next group. Prepare aggregate accumulation variables for this group (e.g,
MIN, MAX, SUM, COUNT, etc.).
Examine all tuples with sort key and update accumulated result.
If an input buffer becomes empty, replace it with the next block from the same
sublist.
When there are no more tuples with sort key , output a tuple consisting of group
attribute and aggregate values.
03/20/2024
Grouping and Aggregation Using Sorting: Two
Pass Algorithm
I/O Cost and memory requirement of two-pass grouping and aggregation:
Disk I/O: [ignoring output wrte to disk]
Memory requirement for two pass: .
Comapre with one pass algorithm: memory requirement is ).
03/20/2024
Sort-based Set Union: Two Pass Algorithm
Algorithm for :
Phase 1: Create sorted sublists for both and . [ similar to TPMMS ]
Phase 2:
Use one memory buffer for each sublist of and and one buffer for output.
Bring one block from each sublist of and to memory.
Repeatedly perform the following:
Find the smallest tuple among all buffers. Send the copy to the output.
Repeatedly find all other copies same as and discard.
Handle input and output blocks as like in TPMMS.
03/20/2024
Sort-based Set Union: Two Pass Algorithm
Cost and memory requirement of two-pass set union:
Disk I/O:
and are read and written once while sorted sublists are created.
and are read a second time in phase 2 while outputs are generated.
Cost =
Memory requirement:
Total number of sublists of and cannot exceed [ M-1 to be more correct! ]
03/20/2024
Sort-based Set Intersection: Two Pass
Algorithm
Algorithm for :
Phase 1: Same as set union.
Phase 2:
Use one memory buffer for each sublist of and and one buffer for output. Bring one
block from each sublist of and to memory.
Repeatedly perform the following:
Find the smallest tuple and check if it is present in at least one sub list of and
[main memory operation]. If yes, then copy to output.
Repeated find all tuples of and that are same as . Remove and discard the tuples.
Handle input and output blocks as like in TPMMS.
03/20/2024
Sort based Join operation
Algorithm for
Sort using 2PMMS with as the sort key. Store sorted in disk.
Sort using 2PMMS with as the sort key. Store sorted in disk.
Merge sorted and as follows:
Use two memory buffers: one for and one for .
Repeated do the following:
Find the least value of the join attribute that is currently at the front of the blocks
for and .
If does not appear at the front of both relations, then remove the tuples with sort
key
Otherwise get all tuples from both relations having sort key . [ More blocks may
need to be read from the disk during this checking. why?]
03/20/2024
Output all tuples by joining tuples from and having common .
Cost of Sort Based Join
Total I/O cost:
Cost for sorting R and S and storing final result in disk:
Cost for the final merge: [ read once sorted R and S ]
Total cost .
Memory requirement:
Sorting R using 2PMMS requires .
Sorting S using requires
Combing we get .
Total number of tuples having same attribute fit in memory blocks.
[If all tuples are same, then memory requirement increases to
03/20/2024
Cost of Sort Based Join
Calculation of I/O cost for sorted based join:
Assume .
Sorting using 2PPMS takes I/Os.
Sorting S using 2PPMS takes I/Os.
Merging require accessing and one last time causing I/Os.
Total I/O cost .
03/20/2024
Cost of Sort Based Join
Calculate the same cost for .
Sorting using 2PPMS takes I/Os.
Sorting S using 2PPMS takes I/Os.
Merging require accessing and one last time causing I/Os.
Total I/O cost .
Reason: Nested-loop join require I/O proportional to while sort based require proportional
to.
03/20/2024
An Efficient Sort Based Join: Two Pass
Algorithm
An efficient version with a constraint: number of tuples having common sort key
value should be small.
Algorithm:
Phase 1: Create sorted sublists of and based on sort key .
Phase 2:
Bring first block of each sublist of and into memory buffer.
Repeated perform the following:
Find the smallest among all sublists. Find all tuples that have same sort key .
Combine tuples of R and S with common value to create joined tuples.
Copy joined tuples to output.
03/20/2024
An Efficient Sort Based Join: Two Pass
Algorithm
Cost of efficient sort-based join:
I/O cost: 3).
Memory requirement:
Example: Assume
Using 100 blocks per sublists: 10 sublists for , 5 sublists for
This will require blocks of memory in phase 2 for reading of and
remaining blocks will be free to handle large number of tuples with same
sort key value [if more blocks are required]
I/Os for phase 1 sorted sublists creation.
I/O for phase 2 reading of sorted sublists from both and .
Total I/O
03/20/2024
Summary of Sort Based Algorithms
Cost summary for sort-based algorithms:
03/20/2024
Two Pass Algorithm Based on Hashing
Involves partitioning into buckets of roughly equal size.
Algorithm for hash partitioning:
Assume is a hash function. Algorithm to partition into buckets using as
follows.
Allocate one memory buffer for each bucket and one memory buffer to read
one for R.
Read one block of at a time and for each tuple :
Find and copy to the buffer holding bucket for
If buffer is full write to disk and initialize for next tuples of the same
bucket.
At then end, write to disk memory buffers of all buckets.
03/20/2024
Two Pass Algorithm Based on Hashing
Algorithm for hash partitioning:
03/20/2024
Duplicate Elimination Using Hashing
Compute :
Phase 1: Hash to buckets. Note same tuples will hash to same bucket.
Phase 2: Read one bucket at a time to main memory.
Use one pass duplicate elimination algorithm on the bucket.
Memory requirement:
Approx. size of each bucket:
For phase 2 to work, we require to fit in main memory
Thus, [simplify.]
I/O cost:
Phase 1 partitioning:
Phase 2:
Total cost:
03/20/2024
Hash based Union, Intersection, and Difference
Algorithm:
Phase 1: Hash R and S based on the same hash function. Let, , , …, and are
buckets of and .
Phase 2: For each, use one pass algorithm on and buckets with same hash values.
For example, for union, this can be done as follows:
Retrieve bucket into main memory blocks. Use a main memory data structure
to efficiently search a tuple in
Copy all tuples of to the output.
Retrieve one block at a time from the bucket and do the following:
For each tuple to of , if it does not appear in , then copy to output.
Otherwise, discard.
03/20/2024
Hash Based Union, Intersection, and Difference
Cost of set union, intersection, and difference:
I/O cost:
Phase 1 for hashing:
Phase 2 for one pass algorithm:
Total cost =
Memory requirement:
In phase 2, the smaller of and buckets must fit in blocks.
Each bucket size is roughly and whose smaller should be .
Thus the requirement is roughly:
.
03/20/2024
Hash Based Join
Algorithm for :
Algorithm is similar to previous ones for set operations.
Phase 1: Hash and based on the same hash function into buckets.
Phase 2: Take each pair of buckets from and , and use one pass join algorithm.
Memory requirement:
Same as for set union, intersection, etc.
Thus, memory requirement is: .
I/O cost:
Cost = [same as before for set union, intersection, etc.]
03/20/2024
Summary of Hash Based Algorithms
• Summary of costs for hash based algorithms:
03/20/2024