Lecture14 Hash Based Indexing and Sorting MHH 18oct 2016
Lecture14 Hash Based Indexing and Sorting MHH 18oct 2016
415)
Today’s Session:
DBMS Internals- Part V
Hash-based indexes and External Sorting
Announcements:
Mid-semester grades are out
PS3 is out. It is due on Nov 1st
Project 2 is due on Oct 27
DBMS Layers
Queries
Relational Operators
Transaction Files and Access Methods
Manager
Recovery
Buffer Management Manager
Lock
Manager Disk Space Management
DB
Outline
Extendible Hashing
Hash-Based Indexing
What indexing technique can we use to support range
searches (e.g., “Find s_name where gpa >= 3.0)?
Tree-Based Indexing
Static Hashing
Extendible Hashing
Static Hashing
A hash structure (or table or file) is a generalization of
the simpler notion of an ordinary array
In an array, an arbitrary position can be examined in O(1)
Static Hashing
Extendible Hashing
Directory of Pointers
How else (as opposed to overflow pages) can we add a
data record to a full bucket in a static hash file?
Reorganize the table (e.g., by doubling the number of
buckets and redistributing the entries across the new
set of buckets)
But, reading and writing all pages is expensive!
00 Bucket A
4* 12* 32*16*
01 Bucket B
5 = 101 1* 5* 21*
10
Bucket C
10*
11
Bucket D
15* 7* 19*
DIRECTORY
DATA PAGES
Extendible Hashing: Inserting Entries
An entry can be inserted as follows:
Find the appropriate bucket (as in search)
00 Bucket A
4* 12* 32*16*
01 Bucket B
13 = 1101 1* 5* 21* 13*
10
Bucket C
10*
11
Bucket D
15* 7* 19*
DIRECTORY
Extendible Hashing: Inserting Entries
Find the appropriate bucket (as in search), split the bucket
if full, double the directory if necessary and insert the
given entry
00 Bucket A
4* 12* 32*16*
01 Bucket B
20 = 10100 1* 5* 21* 13*
10
Bucket C
10*
11
Bucket D
15* 7* 19*
DIRECTORY
Extendible Hashing: Inserting Entries
Find the appropriate bucket (as in search), split the bucket
if full, double the directory if necessary and insert the
given entry
Bucket A
32*16*
2
Example: insert 20* 00
Bucket B
1* 5* 21* 13*
01
10
20 = 10100 Bucket C
11 10*
DIRECTORY
Bucket D
15* 7* 19*
Bucket A2
Is this enough? 4* 12*20*
(`split image'
of Bucket A)
Extendible Hashing: Inserting Entries
Find the appropriate bucket (as in search), split the bucket
if full, double the directory if necessary and insert the
given entry
Bucket A
32*16*
2
Example: insert 20* 00
Bucket B
1* 5* 21* 13*
01
10
20 = 10100 Bucket C
11 10*
DIRECTORY
Bucket D
15* 7* 19*
3
Example: insert 20* 0 00 1* 5* 21*13* Bucket B
001
These two bits indicate a data entry that
010
belongs to one of these two buckets
011 10* Bucket C
1 00
101
The third bit distinguishes between these
110 15*7* 19* Bucket D
two buckets!
111
3
Example: insert 9* 000 1* 9* Bucket B
001
010
10* Bucket C
9 = 1001 011
100
101 15*7* 19* Bucket D
110
Almost there… 111
4* 12*20* Bucket A2
(`split image‘ of A)
DIRECTORY
5* 21*13* Bucket B2
(`split image‘ of B)
Extendible Hashing: Inserting Entries
Find the appropriate bucket (as in search), split the bucket
if full, double the directory if necessary and insert the
given entry GLOBAL DEPTH 32*16* Bucket A
3
Example: insert 9* 000 1* 9* Bucket B
001
010
10* Bucket C
9 = 1001 011
100
There was no need to 101 15*7* 19* Bucket D
double the directory! 110
111
4* 12*20* Bucket A2
(`split image‘ of A)
When NOT to double the DIRECTORY
directory? 5* 21*13* Bucket A2
(`split image‘ of A)
Extendible Hashing: Inserting Entries
Find the appropriate bucket (as in search), split the bucket
if full, double the directory if necessary and insert the
given entry LOCAL DEPTH 3
GLOBAL DEPTH 32*16* Bucket A
3 3
Example: insert 9* 000 1* 9* Bucket B
001 2
010
10* Bucket C
9 = 1001 011
100 2
101 15*7* 19* Bucket D
If a bucket whose local depth 110 3
equals to the global depth is 111
4* 12*20* Bucket A2
split, the directory must be (`split image‘ of A)
doubled DIRECTORY 3
5* 21*13* Bucket A2
(`split image‘ of A)
Extendible Hashing: Inserting Entries
Example: insert 9*
LOCAL DEPTH 3
Repeat… 32*16* Bucket A
GLOBAL DEPTH
2
FULL, hence, split!
3
000 1* 5* 21*13* Bucket B
001
010 2
3 3
000 1* 9* Bucket B
001 2
010
10* Bucket C
9 = 1001 011
100 2
101 15*7* 19* Bucket D
110 3
111
4* 12*20* Bucket A2
(`split image‘ of A)
DIRECTORY 3
5* 21*13* Bucket B2
(`split image‘ of B)
Extendible Hashing: Inserting Entries
Example: insert 9*
LOCAL DEPTH 3
Repeat… 32*16* Bucket A
GLOBAL DEPTH
3 3
000 1* 9* Bucket B
001 2
010
10* Bucket C
9 = 1001 011
100 2
101 15*7* 19* Bucket D
FINAL STATE! 110 3
111
4* 12*20* Bucket A2
(`split image‘ of A)
DIRECTORY 3
5* 21*13* Bucket B2
(`split image‘ of B)
Extendible Hashing: Inserting Entries
Example: insert 20*
FULL, hence, split!
Repeat… LOCAL DEPTH 2
Bucket A
GLOBAL DEPTH 4* 12* 32*16*
2 2
Bucket B
00 1* 5* 21* 13*
01
20 = 10100
10 2
Bucket C
11 10*
Because the local depth
and the global depth are
DIRECTORY 2
both 2, we should double Bucket D
the directory! 15* 7* 19*
DATA PAGES
Extendible Hashing: Inserting Entries
Example: insert 20*
Repeat… LOCAL DEPTH 2
Bucket A
GLOBAL DEPTH 32*16*
2 2
00 1* 5* 21*13*Bucket B
01
10 2
20 = 10100 11 10* Bucket C
2
DIRECTORY Bucket D
15*7* 19*
Is this enough?
2
4* 12*20* Bucket A2
(`split image'
of Bucket A)
Extendible Hashing: Inserting Entries
Example: insert 20*
2
Repeat… LOCAL DEPTH
32*16* Bucket A
GLOBAL DEPTH
3 2
000 1* 5* 21*13* Bucket B
001
010 2
3 2
000 1* 5* 21*13* Bucket B
001
FINAL STATE! 010 2
32*44* 36*
Level = 0 h0 000 00
5* = 101 01 9* 25* 5*
Data entry r
001 01 with h(r)=5
14* 18*10*30*
010 10 Primary
bucket page
31*35* 7* 11*
011 11
Linear Hashing: Inserting Entries
Find bucket as in search
If the bucket to insert the data entry into is full:
Add an overflow page and insert data entry
(Maybe) Split Next bucket and increment Next
32*44* 36*
000 00
001 01 9* 25* 5*
14* 18*10*30*
010 10
31*35* 7* 11*
011 11
Add an overflow page and
insert data entry
Linear Hashing: Inserting Entries
Example: insert 43*
32*44* 36*
000 00
001 01 9* 25* 5*
14* 18*10*30*
010 10
000 00 32*
001 01 9* 25* 5*
000 00 32*
Next=1
001 01 9* 25* 5*
001 01 9* 25*
100 00 44*36*
101 01 5* 37*29*
Add an overflow page and
insert data entry 110 10 14*30*22*
Linear Hashing: Inserting Entries
Another Example: insert 50*
Level=0, N= 4
PRIMARY OVERFLOW
Level = 0 h0 h1 h0 PAGES PAGES
50* = 110010 10
000 00 32*
001 01 9* 25*
100 00 44*36*
111 11 31*7*
Linear Hashing: Inserting Entries
Another Example: insert 50*
Level=0
PRIMARY OVERFLOW
h1 h0 PAGES PAGES
Next=0
Level = 0 h0 000 00 32*
50* = 110010 10
001 01 9* 25*
111 11 31*7*
Linear Hashing: Inserting Entries
Another Example: insert 50*
Level=1
PRIMARY OVERFLOW
h1 h0 PAGES PAGES
Next=0
Level = 0 h0 000 00 32*
50* = 110010 10
001 01 9* 25*
111 11 31*7*
Linear Hashing: Deleting Entries
Deletion is essentially the inverse of insertion
DB
Outline
Linear Hashing
Why Sorting?
In-Memory vs. External Sorting
Why Sorting?
Why Sorting?
Algorithm:
Pass 1: Read a page into memory, sort it, write it
1-page runs are produced
Passes 2, 3, etc.,: Merge pairs (hence, 2-way) of runs
to produce longer runs until only one run is left
A Simple Two-Way Merge Sort
Algorithm:
Pass 1: Read a page into memory, sort it, write it
How many buffer pages are needed? ONE
Passes 2, 3, etc.,: Merge pairs (hence, 2-way) of runs to
produce longer runs until only one run is left
How many buffer pages are needed? THREE
INPUT 1
OUTPUT
INPUT 2
1,2
2,3
3,4 8-Page Runs
4,5
6,6
7,8
9
2-Way Merge Sort: I/O Cost Analysis
If the number of pages in the input file is 2 k
How many runs are produced in pass 0 and of what size?
2k 1-page runs
How many runs are produced in pass 1 and of what size?
2k-1 2-page runs
How many runs are produced in pass 2 and of what size?
2k-2 4-page runs
How many runs are produced in pass k and of what size?
2k-k 2k-page runs (or 1 run of size 2k)
For N log
numberN of
1 pages, how many passes are incurred?
2
How many pages do we read and write in each pass?
2N
What is the overall cost?
2 N ( log 2 N 1)
2-Way Merge Sort: An Example
3,4 6,2 9,4 8,7 5,6 3,1 2 Input File
PASS 0
3,4 2,6 4,9 7,8 5,6 1,3 2 1-Page Runs
PASS 1
2,3 4,7 1,3
2-Page Runs
8,9 5,6 2
log 2 8 1 PASS 2
4,6
= 4 passes 2,3
4,4 1,2 4-Page Runs
6,7 3,5
8,9 6
PASS 3
1,2
Formula Check: 2,3
2 N ( log 2 N 1) 3,4 8-Page Runs
4,5
= (2 × 8) × (3 + 1) = 64 I/Os 6,6
Correct! 7,8
9
Outline
Linear Hashing
Why Sorting?
INPUT 1
... ...
INPUT 2
... OUTPUT
INPUT B-1
Disk Disk
B Main memory buffers
B-Way Merge Sort: I/O Cost Analysis
I/O cost = 2N × Number of passes
Why Sorting?
12 8 3
4 10 5
IDEA: Pick the tuple in the current set with the smallest value that is greater than the
largest value in the output buffer and append it to the output buffer
Replacement Sort
With a more aggressive implementation of B-way sort,
we can write out runs of 2×B (on average) internally
sorted pages
This is referred to as replacement sort
12 8 3
4 10 5
INPUT 1
... ...
INPUT 2
... OUTPUT
INPUT 5
Disk Disk
Blocked I/O
Normally, we go with‘B’ buffers of size (say) 1 page
INSTEAD: let us go with B/b buffers, of size ‘b’ pages
INPUT 1
OUTPUT
INPUT 2
... ...
Disk Disk
3 Main memory buffers
Blocked I/O
Normally, we go with‘B’ buffers of size (say) 1 page
INSTEAD: let us go with B/b buffers, of size ‘b’ pages
INPUT 1
... ...
INPUT 2
... OUTPUT
INPUT B-1
Disk Disk
B Main memory buffers
Double Buffering
INSTEAD: pre-fetch INPUT1’ into a `shadow block’
When INPUT1 is exhausted, issue a ‘read’
BUT, also proceed with INPUT1’
Thus, the CPU can never go idle!
INPUT 1
INPUT 1'
INPUT 2
OUTPUT
INPUT 2'
OUTPUT'
b
block size
Disk INPUT k
Disk
INPUT k'
Query Optimization
and Execution
Continue…
Relational Operators
Transaction Files and Access Methods
Manager
Recovery
Buffer Management Manager
Lock
Manager Disk Space Management
DB