DBMS-Indexing
DBMS-Indexing
Basic Concepts
Ordered Indices
B+-Tree Index Files
B-Tree Index Files
Hashing
Write-optimized indices
Spatio-Temporal Indexing
1/18/2021 1
Basic Concepts
search-key pointer
Index files are typically much smaller than the original file
Two basic kinds of indices:
• Ordered indices: search keys are stored in sorted order
• Hash indices: search keys are distributed uniformly across “buckets” using a “hash
function”.
1/18/2021 2
Index Evaluation Metrics
1/18/2021 3
Ordered Indices
In an ordered index, index entries are stored sorted on the search key value.
Clustering index: in a sequentially ordered file, the index whose search key specifies the
sequential order of the file.
• Also called primary index
• The search key of a primary index is usually but not necessarily the primary key.
Secondary index: an index whose search key specifies an order different from the
sequential order of the file. Also called
nonclustering index.
Index-sequential file: sequential file ordered on a search key, with a clustering index on the
search key.
1/18/2021 4
Dense Index Files
Dense index — Index record appears for every search-key value in the file.
E.g. index on ID attribute of instructor relation
1/18/2021 5
Dense Index Files (Cont.)
1/18/2021 6
Sparse Index Files
Sparse Index: contains index records for only some search-key values.
• Applicable when records are sequentially ordered on search-key
To locate a record with search-key value K we:
• Find index record with largest search-key value < K
• Search file sequentially starting at the record to which the index record points
1/18/2021 7
Sparse Index Files (Cont.)
• For unclustered index: sparse index on top of dense index (multilevel index)
1/18/2021 8
Secondary Indices Example
Index record points to a bucket that contains pointers to all the actual records with that particular search-
key value.
Secondary indices have to be dense
1/18/2021 9
Multilevel Index
1/18/2021 11
Multilevel Index (Cont.)
1/18/2021 12
Indices on Multiple Keys
1/18/2021 15
Example of B+-Tree
1/18/2021 17
B+-Tree Index Files (Cont.)
1/18/2021 18
B+-Tree Node Structure
Typical node
1/18/2021 19
Leaf Nodes in B+-Trees
1/18/2021 20
Non-Leaf Nodes in B+-Trees
Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with m pointers:
• All the search-keys in the subtree to which P1 points are less than K1
• For 2 i n – 1, all the search-keys in the subtree to which Pi points have values greater than or equal to Ki–1 and
less than Ki
• All the search-keys in the subtree to which Pn points have values greater than or equal to Kn–1
• General structure
1/18/2021 21
Example of B+-tree
1/18/2021 22
Observations about B+-trees
Since the inter-node connections are done by pointers, “logically” close blocks need not be “physically” close.
The non-leaf levels of the B+-tree form a hierarchy of sparse indices.
The B+-tree contains a relatively small number of levels
Level below root has at least 2* n/2 values
Next level has at least 2* n/2 * n/2 values
.. etc.
• If there are K search-key values in the file, the tree height is no more than logn/2(K)
• thus searches can be conducted efficiently.
Insertions and deletions to the main file can be handled efficiently, as the index can be restructured in logarithmic time (as
we shall see).
1/18/2021 23
Queries on B+-Trees
function find(v)
1. C=root
2. while (C is not a leaf node)
1. Let i be least number s.t. V Ki.
2. if there is no such number i then
3. Set C = last non-null pointer in C
4. else if (v = C.Ki ) Set C = Pi +1
5. else set C = C.Pi
3. if for some i, Ki = V then return C.Pi
4. else return null /* no record with search-key value v exists. */
1/18/2021 24
Queries on B+-Trees (Cont.)
Range queries find all records with search key values in a given range
• See book for details of function findRange(lb, ub) which returns set of all such records
• Real implementations usually provide an iterator interface to fetch matching records one at a time, using a next()
function
1/18/2021 25
Queries on B+-Trees (Cont.)
If there are K search-key values in the file, the height of the tree is no more than logn/2(K).
A node is generally the same size as a disk block, typically 4 kilobytes
• and n is typically around 100 (40 bytes per index entry).
With 1 million search key values and n = 100
• at most log50(1,000,000) = 4 nodes are accessed in a lookup traversal from root to leaf.
Contrast this with a balanced binary tree with 1 million search key values — around 20 nodes are accessed in a lookup
• above difference is significant since every node access may need a disk I/O, costing around 20 milliseconds
1/18/2021 26
Non-Unique Keys
If a search key ai is not unique, create instead an index on a composite key (ai , Ap), which is unique
• Ap could be a primary key, record ID, or any other attribute that guarantees uniqueness
Search for ai = v can be implemented by a range search on composite key, with range (v, - ∞) to (v, + ∞)
But more I/O operations are needed to fetch the actual records
• If the index is clustering, all accesses are sequential
• If the index is non-clustering, each record access may need an I/O operation
1/18/2021 27
Updates on B+-Trees: Insertion
1/18/2021 28
Updates on B+-Trees: Insertion (Cont.)
Result of splitting node containing Brandt, Califieri and Crick on inserting Adams
Next step: insert entry with (Califieri, pointer-to-new-node) into parent
1/18/2021 29
B+-Tree Insertion
Affected nodes
Splitting a non-leaf node: when inserting (k,p) into an already full internal node N
• Copy N to an in-memory area M with space for n+1 pointers and n keys
• Insert (k,p) into M
• Copy P1,K1, …, K n/2-1,P n/2 from M back into node N
• Copy Pn/2+1,K n/2+1,…,Kn,Pn+1 from M into newly allocated node N'
• Insert (K n/2,N') into parent N
Example
1/18/2021 32
Examples of B+-Tree Deletion
Affected nodes
Affected nodes
Leaf containing Singh and Wu became underfull, and borrowed a value Kim from its left sibling
Search-key value in the parent changes as a result
1/18/2021 34
Example of B+-tree Deletion (Cont.)
Node with Gold and Katz became underfull, and was merged with its sibling
Parent node becomes underfull, and is merged with its sibling
• Value separating two nodes (at the parent) is pulled down when merging
Root node then has only one child, and is deleted
1/18/2021 35
Updates on B+-Trees: Deletion
Assume record already deleted from file. Let V be the search key value of the record, and Pr be the pointer to the record.
Remove (Pr, V) from the leaf node
If the node has too few entries due to the removal, and the entries in the node and a sibling fit into a single node, then
merge siblings:
• Insert all the search-key values in the two nodes into a single node (the one on the left), and delete the other node.
• Delete the pair (Ki–1, Pi), where Pi is the pointer to the deleted node, from its parent, recursively using the above
procedure.
1/18/2021 36
Updates on B+-Trees: Deletion
Otherwise, if the node has too few entries due to the removal, but the entries in the node and a sibling do not fit into a
single node, then redistribute pointers:
• Redistribute the pointers between the node and a sibling such that both have more than the minimum number of
entries.
• Update the corresponding search-key value in the parent of the node.
The node deletions may cascade upwards till a node which has n/2 or more pointers is found.
If the root node has only one pointer after deletion, it is deleted and the sole child becomes the root.
1/18/2021 37
Complexity of Updates
Cost (in terms of number of I/O operations) of insertion and deletion of a single entry proportional to height of the tree
• With K entries and maximum fanout of n, worst case complexity of insert/delete of an entry is O(logn/2(K))
In practice, number of I/O operations is less:
• Internal nodes tend to be in buffer
• Splits/merges are rare, most insert/delete operations only affect a leaf node
Average node occupancy depends on insertion order
• 2/3rds with random, ½ with insertion in sorted order
1/18/2021 38
Non-Unique Search Keys
1/18/2021 39
B+-Tree File Organization
1/18/2021 40
B+-Tree File Organization (Cont.)
Good space utilization important since records use more space than pointers.
To improve space utilization, involve more sibling nodes in redistribution during splits and merges
• Involving 2 siblings in redistribution (to avoid split / merge where possible) results in each node having at least
entries
2n / 3
1/18/2021 41
Other Issues in Indexing
1/18/2021 42
Indexing Strings
1/18/2021 43
Bulk Loading and Bottom-Up Build
1/18/2021 44
B-Tree Index File Example
1/18/2021 47
Indexing on Flash
1/18/2021 48
Indexing in Main Memory
1/18/2021 49