Module-8
Module-8
Module 3
What We’ll Cover in This Session:
Definition:
●
●
Da
Logical Record: A record as perceived by the user or application (e.g., a row in a table).
Physical Record: How the logical record is stored on disk (e.g., blocks or pages).
S ⑧ i
-
-
~ ~
~
~
Example:
A logical record might be a single employee record, while the physical record could store multiple employee records in a
block on disk. -
~
-
·
-
↓
i
·
I
~ ~
-
Three record storage formats
↓
-
- I
nee a
&
~ . (b) A record with two variable-length fields and three fixed-length fields.
-
& L -
-
-
o
&
Definition:
Fi
● The blocking factor is the number of logical records that can fit into a single physical block (or page)
on disk.
Formula:
Blocking Factor = Block Size / Record Size
Importance:
● Maximizes disk space utilization.
● Reduces I/O operations by reading/writing multiple records at once.
Example:
If a block size is 4 KB and each record is 1 KB, the blocking factor is 4 (4 records per block).
Pinned vs. Unpinned Organization
- Dop
eit
Pinned Organization:
●
●
Records are fixed in specific locations on disk. ~
Useful for systems requiring predictable access patterns.
Unpinned Organization:
~ O ·
●
~
Records can move around on disk (e.g., during reorganization). D
● More flexible but may require additional overhead to track record locations. ~
Comparison:
- -
-
-
~
~
Pinned vs. Unpinned (Buffer Management & Memory Pages)
● Pinned Pages: A page (block of data) in memory is pinned when the database system ~
prevents it from being removed or swapped out because it is actively being used. -
○ Example: A query that scans a table might pin pages in memory to ensure they stay
-
available while the query executes.~ -
○ Once the query is done, the pages are unpinned, meaning they can be replaced by
other data if needed.
● Unpinned Pages: A page is unpinned when the database allows it to be removed from
memory. - ~ -
○ Example: If a page is no longer needed, it is unpinned, and the buffer manager can
evict it to free space.
🔹 Use Case: Used in buffer pool management to decide when to keep or remove data from
memory.
Spanned vs. Unspanned (Storage Allocation & Records)
● Spanned Records: A record is spanned when it crosses multiple pages because it is too
large to fit in a single page. - - -
○ Example: A large row (e.g., a long text or blob field) that doesn’t fit in one page will be
~
split across multiple pages. -
○ The database uses pointers to link the fragments of the record across different pages.
● Unspanned Records: A record is unspanned when it fits entirely within a single page. -
○ Example: A small row that can be stored in one page is an unspanned record.
○ This makes retrieval faster since no extra page reads are needed.
- ~
🔹 Use Case: Used in disk storage and file organization to decide whether records are stored
within one page or across multiple pages.
-
- -
I A
B
B
~ - -
-
*
~ -
-
- ...
Y
-
S
3
- -
- &
~
- -
-
Operations on Files
·
● Retrieval Operations: Locate and examine records without modifying them.
● Update Operations: Modify records by insertion, deletion, or modification.
- -
Basic File Operations
● -File Organization: How records are structured and stored (ordering, hashing, indexing). ~
● ~Access Method: The set of operations available for accessing and managing records.-
● Some access methods are specific to certain file organizations.
-
Choosing an Efficient File Organization
-
● The simplest and most basic type of file organization
● Records are inserted in the order they arrive
● New records are added at the end of the file
● Also known as heap or pile file
....
~
DYM
·
Inserting Records in Heap Files
Steps:
1. Copy the last block of the file into a buffer
2. Add the new record to the buffer
3. Rewrite the block back to disk
4. Update the file header with the new last block address
-
zud
✅ Fast and efficient insertion process
Searching Records in Heap Files
Process:
● Linear search is required (expensive operation) -o
● On average, (b/2) blocks must be read before finding a record
● If no match is found, all b blocks must be searched ~
● Search time increases as the file grows
~
i
Deleting Records in Heap Files
Two methods:
1. Physical Deletion
○ Find the block containing the record ~
-
○ Remove the record and rewrite the block to disk
○ Leaves unused space in blocks (wasted storage) -
2. Logical Deletion (Using Deletion Marker) ~
~
○ Use an extra byte/bit as a deletion marker
○ Mark records as deleted instead of removing them ~
○ Search programs ignore marked records
~
D
Reorganizing Heap Files
Fixed-Length Records:
● Simple to manage ~
Variable-Length Records:
-
● May require deleting and reinserting modified records
● Complicated space management
~
🛠 Both spanned and unspanned storage methods can be used
E
Direct Access in Heap Files
n
● The ith record is located in:
○ Block ⎣(i/bfr)⎦ -
Definition:
● A heap file is an unsorted collection of records stored on disk.
● Records are inserted without any specific order.
Characteristics: ~
● Simple to implement.
● Fast for insertions since no sorting or indexing is required.
● Slow for searches since records must be scanned sequentially.
Example:
A heap file might store employee records in the order they were added, regardless of
their IDs or names.
Summary(Heap)
✅ Heap Files Pros:
●
●
Definition: A file where records are physically ordered based on an ordering field.
Also known as a sequential file. ~ v
● A
If the ordering field is a key field (unique per record), it is called an ordering key.
⑧
-
~ ~
Benefits of Ordered Files
① The we
Binary Search in Ordered Files
Fin
-
● Algorithm: Divide file blocks into halves, check midpoint, and adjust search range accordingly.
X
-
~
y
Some blocks of an ordered (sequential) file of EMPLOYEE records with
Name as the ordering key field. -
↑
Limitations of Ordered Files
●
~
Slow insertions: Maintaining order requires shifting records, increasing disk I/O.
i
-
● Expensive deletions: Removing a record creates gaps, requiring file reorganization. ~
● Limited flexibility: Only optimized for searches on the ordering field. 20
~
Managing Insertions & Deletions -
To
● Insertion: -
○ Records must be placed in their correct position.
○ -
Often requires shifting half of the records on average.
● Deletion:
○ Can leave empty spaces (fragmentation). -
○ Using deletion markers & periodic reorganization helps. -
&
&
&
&
↑
- &
&
Hashing Techniques
1. Internal Hashing -
~
2. External Hashing for Disk Files
-
3. Hashing Techniques That Allow Dynamic File Expansion
a. Extendible Hashing -
b. Dynamic Hashing -
c. Linear Hashing ~
Hashing
-
Introduction to Hashing
·
1 .
=
10
①
=
1
E
=
- 104
-
:
Internal hashing data structures.
J
% i
-100 i -
-
-2
z
calsiut
2 -
-
optam
.
a) Internal Hashing Array (Figure 17.8 (a)):
-
1⃣ Internal Hashing Array (Figure 17.8 (a)):
-
● This is a fixed-size table (array) with M positions (or slots) used for hashing
data.
● Each row consists of:
○ Data fields (e.g., Name, SSN, Job, Salary). -
○ Each position can store a record directly.
⚡ Example:
● A collision occurs when two different keys map to the same index.
● Example: Both h(12345) and h(54325) might map to index 5.
T -
Chaining Mechanism: &
~Open Addressing
●
E
If a collision occurs (i.e., two different records get the same hash address), the program looks for the next available
position in a sequential manner.
● This process continues until an empty space is found.
● A common technique under open addressing is linear probing, where the next available slot is checked one by
D
one. -
Chaining ~
~ -
● Instead of looking for a new location in the main hash table, extra overflow locations are maintained.
● Each record slot has a pointer field that links to an overflow record if a collision occurs.
● This creates a linked list of records that share the same hash address.
-
Multiple Hashing
-
~● If the first hash function leads to a collision, a second hash function is applied to find an alternative address.
● If that also results in a collision, either a third hash function is applied or open addressing is used to resolve it.
● This method reduces clustering and improves efficiency.
ple
~
No
D
Q
a
External Hashing for Disk Files
&
3⃣ Disk Blocks:
● A disk block is the smallest unit of data read from or written to a disk.
● Each block contains:
○ Multiple records or entries. -
○ Overflow pointers if records exceed the block size.
External Hashing for Disk Files
1. Inserting a Record:
-
○ The system computes the hash value using a function like h(Key) = Key % M. -
○ The result points to a bucket number.
○ The bucket number is mapped to a block address on the disk.
2. Data Retrieval:
○ To retrieve a record:
■ Use the same hash function to determine the bucket number. -
■ Follow the mapping to quickly locate the corresponding block address.
■ Retrieve the record directly from the disk block.
3. Handling Collisions:
○ If multiple records map to the same block:
■ Overflow records are stored in linked blocks (chaining). -
■ Pointers in the block direct the system to overflow areas.
External Hashing for Disk Files
● Assume M = 4 (4 buckets).
● Hash function: h(Key) = Key % 4
Inserting Records:
al ~
-
-
-
1 -
&
-
i
j
-
~
~
-
● The hash table is divided into buckets (like Bucket 0, Bucket 1, Bucket 2, etc.).
● Each bucket can hold one or more records (or pointers to the records).
● For example:
-
○ Bucket 0 holds two records: 340 and 460.
○ Bucket 1 holds three records: 321, 761, and 91. -
2. Overflow Buckets:
● When a bucket is full and more elements hash to the same bucket, overflow occurs.
● Instead of overwriting or rejecting the record, an overflow chain (linked list) is created.
● The additional records are stored in overflow buckets, and pointers link them sequentially.
For example: - -
○ In Bucket 1, after adding 321, 761, and 91, there’s still more data (981 and 182) that needs to be stored.
○ These extra elements are placed in overflow buckets linked to Bucket 1.
-
3. Chaining (Linked List Structure):
-
↑
● Overflow records (981 and 182) are stored in a linked list attached to the corresponding bucket.
● If an overflow bucket becomes full, another overflow bucket can be linked, forming a chain (as shown with record 652 for
Bucket 2).
-
4. NULL Pointers:
● When there are no further overflow records, the chain ends with a NULL pointer indicating the end of the overflow list.
● Efficient storage: Allows dynamic handling of overflow without the need for a fixed-size bucket.
● Simplified insertions: New records can be easily appended to the end of the chain.
● Minimizes rehashing: No need for rehashing as records overflow naturally using linked lists.
Disadvantages: Did
~
● Slower searches: In the worst case (when many records hash to the same bucket), searching becomes linear as it needs to
traverse the entire chain.
● Extra memory usage: Additional pointers (overhead) are needed for chaining.
~
Extendible Hashing -
~
-
-
o
i
~
~
>
~
·
Structure of the extendible hashing scheme.
Extendible Hashing
1. Directory:
● The directory acts as a table that maps binary hash prefixes to the corresponding data
buckets.
● Each entry in the directory corresponds to a binary combination of bits (in this case, 3 bits):
○ 000, 001, 010, 011, 100, 101, 110, 111 -
● The directory points to the correct bucket based on the first d bits (where d is the global
depth).
● The global depth (d = 3) indicates how many bits from the hash value are used to index
into the directory.
● With d = 3, there are 2^3 = 8 directory entries (ranging from 000 to 111).
~
Extendible Hashing
● Each bucket has its own local depth (d'), which represents how many bits are used to distinguish
records within that specific bucket.
● If a bucket overflows:
○ If d' < d: Only the bucket splits, and a new bit is used to differentiate records.
○ If d' = d: The directory size doubles, increasing the global depth (d).
● Each bucket stores actual records whose hash values match the corresponding binary prefix:
○ E.g., The bucket for 000 stores all records whose hash values start with 000.
● Buckets can hold multiple records until a threshold is reached (bucket overflow).
● When a bucket overflows, it either splits or triggers an expansion of the directory.
Extendible Hashing
1. Insertion of a Record:
○ Suppose a record’s hash value is 101110.
○ The system uses the first 3 bits (101) because the global depth (d) is 3.
○ The directory entry for 101 will point to the appropriate bucket for insertion.
2. Handling Overflow:
○ If the corresponding bucket overflows:
■ Case 1: If d' < d, split the bucket and increase its local depth by 1.
■ Case 2: If d' = d, double the directory size and increment the global depth (d =
4).
3. Directory Doubling:
○ Doubling the directory adds one more bit to all directory entries.
○ This creates new pointers for the newly created buckets after splitting.
Dynamic Hashing
~
-- ~
-
-
-
● ~
The directory is structured as a binary tree of nodes.
● Each internal node represents a decision point based on bits (0 or 1) from the hash value. ~
● Paths from the root to a leaf correspond to binary prefixes of the hash value.
-
Node Types:
-
● Internal Directory Node: Represents a branching decision based on bits (shown as circles in the figure).d
● Leaf Directory Node: Points directly to the buckets (shown as rectangles).
● Each leaf node points to a bucket in which data records are stored.
● Records are placed into buckets based on the prefix of their hash values.
○ Example: A record with hash 001011 will go to the bucket where the hash starts with 001.
-
-
Dynamic Hashing
3⃣ Splitting Buckets:
1. Inserting a Record: -
im
Di
○ The hash value of the record is computed.
○ The system follows the binary path: -
■ At each node, based on the bit (either 0 or 1), it follows the corresponding branch.
■ When a leaf node is reached, the record is placed into the associated bucket.
2. Handling Overflow (Bucket Splitting):
○ If a bucket overflows:
■ A new internal node is introduced.
■ The records are redistributed into two new buckets based on an additional bit from their hash
values.
3. Directory Expansion: ~
○ Unlike Extendible Hashing, the directory does not double entirely. ~
○ Only the required part of the tree expands, leading to more efficient space utilization.
-
-
-
Handling Collisions
Extendible Hashing -
Linear Hashing -
● Buckets split dynamically based on data growth.
● Uses progressive expansion rather than directory doubling.
● Reduces large-scale reorganization costs.
-
~ ~ w
~
~ ~
z ~ ↑
Summary & Conclusion
-
● Hashing provides efficient record access.
● Collision resolution is crucial for performance. -
● Extendible and Linear Hashing adapt well to data growth. -
● Used in database indexing, memory allocation, and file storage. -
800
Practical Exercise
Task 1:
Calculate the blocking factor for a system where:
● Block size = 8 KB
● Record size = 2 KB
Hint:
Use the formula:
plaintext
Rt
o
Copy
1
~
Blocking Factor = Block Size / Record Size
Task 2:
Explain why heap files are inefficient for large-scale databases with frequent search operations.
Hint:
Consider the sequential scanning process and lack of ordering.
Recap and Key Takeaways 3.6
Questions?
Topic 3.7: Single-Level Indices, Numerical Examples
Module 3
What We’ll Cover in This Session:
1. Introduction to Indexing
● Why do we need indices?
2. Types of Single-Level Indices
● Primary Index
● Secondary Index
3. Numerical Examples
● Calculating Index Size and Search Efficiency
4. Practical Exercises
5. Recap and Key Takeaways
What is Indexing?
Definition:
● An index is a data structure that improves the speed of data retrieval operations on a
database table. ~
● It works like an index in a book, allowing quick access to specific data without scanning
the entire table. - - -
+
↑
a
Indexing
Definition:
● Indexing is a technique used to improve the speed of data retrieval operations on a database.
● An index is a data structure (e.g., B-tree, hash table) that maps keys to record locations.
Types of Indexes:
-
1. Primary Index: Built on the primary key.
2. Secondary Index: Built on non-primary key columns.
3. Clustered Index: Determines the physical order of data.
4. Non-Clustered Index: Stores a separate structure pointing to data.
Benefits of Indexing:
● Faster query performance.
● Enables efficient range queries and sorting.
Drawbacks:
● Increases storage requirements.
● Slows down insertions and updates due to index maintenance.
Single-Level Indices
Definition:
● A single-level index is an index that uses a single level of entries to map keys to
record locations. - -
~
● It is simpler than multi-level indices but may not scale well for very large datasets.
Types of Single-Level Indices:
1. Primary Index: Built on the primary key of a table. ~
● Assumes records are stored in sorted order by the primary key.
2. Secondary Index: Built on non-primary key columns.
● Can be created on unsorted data.
~
Primary Index
Definition:
● A primary index is built on the primary key of a table.
● It assumes that records are stored in sorted order by the primary key.
Structure:
● Each entry in the index contains:
● Key value (e.g., primary key).
● Pointer to the block where the record is stored.
Advantages:
● Efficient for range queries.
● Reduces the number of disk accesses.
Example:Suppose we have a table with 1000 records sorted by EmployeeID. The primary index might look like this:
- -
Secondary Index
Definition:
●
●
A secondary index is built on non-primary key columns.
It allows indexing on fields other than the primary key.
~
Structure:
● Each entry in the index contains:
● Key value (e.g., a secondary column like LastName).
● Pointer to the record location.
~
Advantages:
● Enables fast searches on non-primary key columns.
● Useful for tables with multiple search criteria.
-
Disadvantages:
● Requires additional storage.
● May slow down insertions and updates.
Example:Suppose we want to index employees by LastName. The secondary index might look like this:
Numerical Example - Primary Index
Scenario: -
The same file has a secondary index on LastName, with each index entry being 20 bytes. Calculate:
1. Number of index entries. -
2. Size of the secondary index. -
Solution:
1.Number of Index Entries:
-
Number of Entries = Total Records = 10,000
-
Index Size = Number of Entries × Entry Size = 10,000 × 20 = 200,000 bytes ≈ 200 KB
- -
--
-
-
Practical Exercise
Task 1: -
A file contains 5000 records, each of size 200 bytes. The block size is 4 KB. Calculate:
1. Blocking factor.
2. Number of blocks needed to store the file.
3. Size of the primary index if each index entry is 12 bytes.
Task 2:
If the same file has a secondary index on DepartmentID , with each index entry being 18 bytes, calculate the size of the
secondary index.
Topic 3.8: Multi-Level Indices, Numerical Examples
What We’ll Cover in This Session:
1. Introduction to Multi-Level Indices
● Why do we need multi-level indices?
2. Structure of Multi-Level Indices
● How they work and their advantages
3. Numerical Examples
● Calculating index levels and search efficiency
4. Practical Exercises
5. Recap and Key Takeaways
6.
Op
·
What are Multi-Level Indices?
Definition:
● A multi-level index is an indexing technique that uses multiple levels of indices to map keys to record locations.~
● It is used to overcome the limitations of single-level indices when dealing with very large datasets.
-
Why Use Multi-Level Indices?
● Reduces the number of disk accesses required for searching.
● Scales better than single-level indices for large datasets.
Key Characteristics:
● The top level contains pointers to the next level.
● The bottom level points to actual data blocks.
Structure of Multi-Level Indices
Explanation:
●
~
A multi-level index is like a tree structure where each level reduces the search space.
a
● The top level (root) points to intermediate levels.
● The bottom level points to data blocks.
-
Example:
Suppose we have a file with 1 million records. A multi-level index might look like this:
~
f
- -
-
~
Advantages of Multi-Level Indices
1. Efficient Search:
● Reduces the number of disk accesses by narrowing down the search space at each level.
-
2. Scalability:
● Handles very large datasets more effectively than single-level indices.
3. Flexibility:
● Can be combined with other indexing techniques like B-Trees.
Disadvantages:
● Increased storage requirements due to multiple levels.
● More complex to implement and maintain.
-
~
Numerical Example - Multi-Level Index
Scenario:
A file contains 1,000,000 records, each of size 200 bytes. The block size is 4 KB. Each index entry
-
...
- - -
~is 12 bytes. Calculate:
1. Blocking factor for data blocks. ~
2. Number of blocks needed to store the file.
3. Number of levels in the multi-level index.
Solution:
1.Blocking Factor: ~
Blocking Factor = Block Size / Record Size = 4096 / 200 ≈ 20 records per block
-
-
T
Scenario:
-
1. Number of levels in the multi-level index.
Solution 3:
Number of Levels in Multi-Level Index:
Each index block can hold:
Entries per Index Block = Block Size / Entry Size = 4096 / 12 ≈ 341 entries
-
-
Bottom Level: Points to 50,000 data blocks. Requires:
Blocks in Bottom Level = 50,000 / 341 ≈ 147 blocks
-
--
Intermediate Level: Points to 147 blocks. Requires:
Blocks in Intermediate Level = 147 / 341 ≈ 1 block
● Top Level (Root): Points to 1 block.-
Total Levels: 3 (Root → Intermediate → Bottom).
Search Efficiency with Multi-Level Indices
Scenario:
How many disk accesses are required to retrieve a record using a multi-level index? -S
Explanation:
● Each level reduces the search space. -
● For the previous example:
"I
● Access the root block (1 access).
● Access the intermediate block (1 access).
● Access the bottom-level block (1 access).
● Access the data block (1 access).
Total Disk Accesses: 4
Comparison with Single-Level Index:
● Single-level index would require accessing all 50,000 blocks in the worst case.
Practical Exercise
Task 1:
A file contains 2,000,000 records, each of size 150 bytes. The block size is 8 KB. Each
index entry is 16 bytes. Calculate:
1. Blocking factor for data blocks.
2. Number of blocks needed to store the file.
3. Number of levels in the multi-level index.
Task 2:
If the same file uses a single-level index, how many disk accesses are required in the
worst case? Compare it with the multi-level index.
Recap and Key Takeaways
Questions?
Topic 3.9: B-Trees and B+Trees (Structure Only, Algorithms Not
Required)
m-wasine =
Bu
BST - d K
In
/
wo
eiio
BTree-. chint) ,
#
ag
What We’ll Cover in This Session:
1. Introduction to B-Trees
● What are B-Trees? -
● Key Characteristics -
2. Structure of B-Trees
● Nodes, Keys, and Pointers~
3. Introduction to B+Trees
-
Definition:
● A B-Tree is a self-balancing tree data structure that maintains sorted data and allows efficient
insertion, deletion, and search operations.
● It is widely used in databases and file systems for indexing large datasets.
Key Characteristics: ~
~
1. Balanced Structure: All leaf nodes are at the same level.
2. Multi-Way Search: Each node can have multiple keys and pointers.
3. Efficient Disk Access: Minimizes the number of disk I/O operations. ~
Applications:
● Database indexing. ~
● File system organization. -
(a) A node in a B-tree with q – 1 search values
↑
v - -
~
-
-
↑
-
~ ~ ~
A B-tree of order p = 3.The values were inserted in the
order 8, 5, 1, 7, 3, 12, 9, 6.
~
&
&
~
↑
& ↑
Structure of B-Trees
Explanation:
● A B-Tree consists of nodes , each containing keys and pointers .
-
● Internal Nodes: Contain keys and pointers to child nodes.
● Leaf Nodes: Contain actual data or pointers to data blocks. -
Example of a B-Tree Node:
| Pointer 1 | Key 1 | Pointer 2 | Key 2 | Pointer 3 |
--
Rules for B-Trees: a
-
1. Each node can have up to m children (where m is the order of the tree).
2. Each node contains between ceil(m/2) and m keys.
3. All leaf nodes are at the same level.
~
Example of a B-Tree
Scenario:
↓
A B-Tree of order 3 (each node can have up to 3 children).
Explanation:
● Root node contains keys [10, 20]. -- . -
● Left child contains keys [5, 7].
● Middle child contains key [15].
● Right child contains keys [25, 30].
Search Process:
d
To search for key 15:
~
1. Start at the root [10, 20].
2. Compare 15 with 10 and 20.
f
~
3. Move to the middle child [15].
A B-Tree is a specialized m-way tree designed to optimize data access, especially on disk-based storage
systems. -
- - ~
● In a B-Tree of order m, each node can have up to m children and m-1 keys, allowing it to efficiently
manage large datasets.
~
↑ -
-
-
-
-
--- -
-
T
&
10M
-
es-24
/ L
-
pointe
Need of a B-Tree
The B-Tree data structure is essential for several reasons:
~
Sim
-
-
-
-
-
~
-
-
- -
~
-
Searching in a B-Tree
● Searching in a B-Tree is similar to searching in a Binary Search Tree (BST).
● Let the key to be searched be k.
Search Algorithm:
Key Insights:
●
●
B+ Tree is a variation of B-Tree, designed for efficient indexing and searching.
Key Characteristics:
○ Data pointers are stored only at leaf nodes.~
M ~
&
○ Leaf nodes contain every value of the search field along with a pointer to the actual data (record or block).
○ Internal nodes do not store actual data; they guide the search process. -
○ Some values from leaf nodes are repeated in internal nodes for navigation. T
○ Leaf nodes are linked together to provide ordered access to data.
~
Advantages of B+ Tree:
📌 Key Insight:
~
● Internal nodes help in searching, but actual data is always stored in leaf nodes. ↑
● This structure makes B+ Trees more efficient for large databases! 🚀
-
Features of B+ Trees
1⃣ Balanced: -
Self-balancing structure. -
● Automatically adjusts when data is added or removed.
● Ensures consistent search time regardless of tree size.
2⃣ Multi-Level Structure: -
● Root node at the top, internal nodes in between, leaf nodes at the bottom.
● Leaf nodes store actual data.
3⃣ Ordered: -
● Maintains sorted order of keys, making range queries efficient.
4⃣ High Fan-out: -
✅ Balanced Structure -
E
-
-
~
-
-
-
-
-
-
-
~ -
~ -
Example of a B+Tree
Scenario:
Explanation:
● Root node contains keys [10, 20] .
● Leaf nodes [5, 7] , [15] , and [25, 30] store all the data.
● Leaf nodes are linked sequentially for range queries. -
Search Process:
To search for key 15:
1. Start at the root [10, 20] .
2. Compare 15 with 10 and 20.
3. Move to the middle child [15] .
Range Query Example:
To retrieve all keys between 10 and 25:
1. Start at the leaf node containing 10. -
2. Traverse the linked list of leaf nodes until reaching 25. -
Differences Between B-Trees and B+Trees
Practical Exercise
Task 1:
Draw a B-Tree of order 3 for the following keys:
[5, 10, 15, 20, 25, 30].
Task 2:
Draw a B+Tree of order 3 for the same set of keys. Highlight the differences
between the two structures.
Recap and Key Takeaways
1. Introduction to Hashing
● What is Hashing?
● Why Extendible Hashing?
2. Structure of Extendible Hashing
● Directory and Buckets
● Key Characteristics
3. How Extendible Hashing Works
● Splitting Buckets
● Expanding the Directory
4. Practical Examples
5. Recap and Key Takeaways
What is Hashing?
Definition:
● Hashing is a technique used to map keys to specific locations (buckets) in a hash table.
● It enables fast data retrieval by using a hash function to compute the location of a record.
Why Use Hashing?
● Provides constant-time search performance (O(1) on average).
● Efficient for equality-based queries (e.g., "Find record with key = X").
Limitations of Static Hashing:
● Fixed number of buckets leads to overflow when the dataset grows.
● Poor performance due to collisions in large datasets.
Solution:
● Extendible Hashing dynamically adjusts the hash table size to handle growing datasets.
What is Extendible Hashing?
Definition:
● Extendible Hashing is a dynamic hashing technique that allows the hash table to grow or shrink
as needed.
● It uses a directory structure to manage buckets and ensures efficient storage utilization.
Key Characteristics:
1. Directory: A table of pointers to buckets.
2. Buckets: Store actual records or pointers to records.
3. Dynamic Growth: Buckets split when they overflow, and the directory expands as needed.
Applications:
● Database indexing.
● File systems.
●
Structure of Extendible Hashing
Components:
1. Directory:
● An array of pointers to buckets.
● Each entry corresponds to a prefix of the hash value.
2. Buckets:
● Store records or pointers to records.
● Each bucket has a fixed capacity (e.g., can hold up to n records).
Example Diagram:
Key Concepts:
● Global Depth: Number of bits used to index the directory.
● Local Depth: Number of bits used to identify a bucket.
How Extendible Hashing Works
Step-by-Step Process:
1. Hash Function:
● Compute the hash value of a key.
● Use the first few bits (determined by the global depth) to locate the bucket.
2. Bucket Overflow:
● If a bucket exceeds its capacity, it splits into two buckets.
● The directory is updated to point to the new buckets.
3. Directory Expansion:
● If all buckets at a given depth are full, the directory doubles in size.
● The global depth increases by 1.
Example Scenario:
● Insert keys 5, 10, 15, 20 into an initially empty hash table.
● Show how buckets split and the directory expands as more keys are added.
Example of Extendible Hashing
Initial State:
● Global Depth = 1.
● Directory:
| Prefix 0 | -> Bucket A
| Prefix 1 | -> Bucket B
Insert Key 5:
● Hash value of 5: 01.
● Insert into Bucket B.
Insert Key 10: | Prefix 00 | -> Bucket A
● Hash value of 10: 10. | Prefix 01 | -> Bucket B1
● Insert into Bucket B.
| Prefix 10 | -> Bucket B2
Bucket B Overflows:
● Split Bucket B into two buckets (B1 and B2).
| Prefix 11 | -> Bucket B2
● Update the directory:
Advantages of Extendible Hashing
1. Dynamic Growth:
● Handles growing datasets without wasting space.
2. Efficient Search:
● Constant-time search performance (O(1) on average).
3. Collision Handling:
● Splits buckets to resolve collisions dynamically.
Disadvantages:
● Increased complexity due to directory management.
● Potential for directory expansion overhead.
Practical Exercise
Task 1:
Given the following keys: [5, 10, 15, 20, 25, 30] , simulate the process of inserting them into an
extendible hash table. Assume each bucket can hold up to 2 keys.
Task 2:
Draw the final state of the directory and buckets after all keys are inserted. Highlight any bucket splits
or directory expansions.
Recap and Key Takeaways 3.10
Definition:
- #
● Multi-key indexing allows efficient querying on multiple attributes (keys) simultaneously. ~
~
● It is particularly useful for range queries or multi-dimensional data.
Why Use Multi-Key Indexing?
● Enables fast retrieval for queries involving multiple attributes (e.g., "Find employees with salary >
50,000 AND age < 30"). -
---
● Commonly used in spatial databases, GIS systems, and multi-dimensional datasets.
Challenges:
● Traditional single-key indexes (e.g., B-Trees) are inefficient for multi-dimensional data.
● Grid files provide a solution by partitioning data into cells based on multiple keys.
·
What are Grid Files?
-
Definition:
● A grid file is a data structure used for indexing multi-dimensional data.
● It divides the data space into a grid of cells, where each cell corresponds to a range of values for each key.
Key Characteristics:
1. Partitioning:
-
● Each dimension (key) is divided into intervals.
-
● The intersection of intervals forms a grid of cells.
2. Dynamic Growth:
● Cells can split dynamically to handle overflows.
3. Efficient Range Queries:
~
● Queries involving ranges on multiple keys can be resolved by examining relevant cells.
-
T
Structure of Grid Files
Components:
1. Grid Directory:
● Maps cells to buckets (storage locations). -
● Each cell corresponds to a specific range of values for each key. -
2. Buckets:
● Store actual records or pointers to records.
-
● Each bucket has a fixed capacity.
-
Example Diagram:
Grid Cells:
| Salary 0-50k, Age 0-30 | Salary 50k-100k, Age 0-30 |
| Salary 0-50k, Age 30-60 | Salary 50k-100k, Age 30-60 |
-
How Grid Files Work
Step-by-Step Process:
1. Partitioning the Data Space: -
● Divide each key’s range into intervals. ~
● Form a grid of cells by combining intervals from all keys.
2. Mapping Records to Cells:
● Each record is mapped to a cell based on its key values.
-
● For example, an employee with Salary = 60,000 and Age = 25 would map to the cell [50k-100k,
0-30]. -
D
3. Handling Overflows:
● If a cell’s bucket overflows, the cell splits into smaller sub-cells.
-
● The grid directory is updated to reflect the new structure.
Example Scenario: -
-
● Insert records into a grid file with two keys: Salary and Age.
● Show how cells split when buckets overflow.
-
it
T
Example of Grid Files
Scenario:
A dataset contains employees with attributes Salary and Age .
Initial Grid:
7
↑
-
-
-
● Salary intervals: [0-50k, 50k-100k].
Age intervals: [0-30, 30-60].
:
●
Insert Records: -
-
1. Employee 1: Salary = 40,000, Age = 25 → Cell [0-50k, 0-30].
2. Employee 2: Salary = 60,000, Age = 28 → Cell [50k-100k, 0-30].
3. Employee 3: Salary = 45,000, Age = 27 → Cell [0-50k, 0-30].
Bucket Overflow:
●
●
Cell [0-50k, 0-30] overflows.
Split the cell into sub-cells: [0-25k, 0-30] and [25k-50k, 0-30].
W
-
Advantages of Grid Files
1.
~
Efficient Multi-Key Queries:
● Handles queries involving multiple keys efficiently. ~
2. Dynamic Growth:
● Cells split dynamically to accommodate growing datasets.
3. Range Queries: -
Task 1:
Given the following records:
● Record 1: Salary = 30,000, Age = 22
● Record 2: Salary = 55,000, Age = 28
● Record 3: Salary = 40,000, Age = 25
● Record 4: Salary = 60,000, Age = 35
Simulate the process of inserting these records into a grid file. Assume initial intervals:
● Salary: [0-50k, 50k-100k].
● Age: [0-30, 30-60].
Task 2:
Draw the final state of the grid after all records are inserted. Highlight any cell splits.
Recap and Key Takeaways 3.11
Questions?
PREVIOUS YEAR
UNIVERSITY QUESTION PAPER
UNIVERSITY QUESTIONS 2021 July
~
UNIVERSITY QUESTIONS 2023 June
-
↑
-
-
UNIVERSITY QUESTIONS 2023 June
-
-
- - -
-
QP 2024 December
-
ii. Suppose that the file is ordered by the key field Ssn and we want to construct a
primary index on Ssn. Calculate The number of first-level index entries and the
number of first-level index blocks
iii. Calculate the number of levels needed if we make it into a multilevel index.
~
b) What is a grid file? What are its advantages and disadvantages?
j
~
-
SOLUTIONS
Illustrate the structure of B-Tree and B+ Tree and explain how they are different?
B-Tree Structure:
-
A B-Tree is a self-balancing search tree that maintains sorted data and allows searches, sequential
access, insertions, and deletions in logarithmic time. It is commonly used in databases and file systems.
Characteristics of B-Tree:
A B+ Tree is an extension of B-Tree where all values (data) are stored in leaf nodes, and
internal nodes only store keys.
Characteristics of B+ Tree:
-
What are the different types of single-level ordered indices? -
Explain.
A single-level ordered index is an auxiliary data structure that helps in faster retrieval of records by keeping sorted
deput
references to data entries.
~
Types of Single-Level Ordered Indices:
if
1. Primary Index:
-
-
○ Used for a sorted file where records are arranged sequentially based on a key field.
○ One index entry per block of data. - M
○ Example: A database table where students are sorted by roll number. -
P
2. Clustering Index: ~
○ Used when records are clustered on a non-key field. ~
○ Multiple records can have the same index key.
○ Example: Employees grouped by department.
E
-
-
Secondary Index:
- -
● Used when the data is not stored in a sorted manner.
● Can be created on non-primary key attributes. -
● Example: Index on a "Salary" column in an employee table.
What are the different types of single-level ordered indices?
Explain.
A single-level ordered index is an auxiliary data structure that helps in faster retrieval of records by keeping sorted references to data
entries.
1. Primary Index:
○ Used for a sorted file where records are arranged sequentially based on a key field.
○ One index entry per block of data.
○ Example: A database table where students are sorted by roll number.
2. Clustering Index:
○ Used when records are clustered on a non-key field.
○ Multiple records can have the same index key.
○ Example: Employees grouped by department.
What are the different types of single-level ordered indices?
Explain.
↑
Differentiate between static hashing and dynamic hashing.
Hashing is a technique to map keys to a fixed-size table for fast retrieval.
Static Hashing:
●
●
●
●
The size of the hash table is fixed. -
The hash function does not change over time. -
Collisions are handled using chaining or open addressing.
Example: Hash tables used in memory-based applications.
-
-
200
Dynamic Hashing:
-
-
● The size of the hash table grows or shrinks dynamically.
~
● Uses extendible hashing or linear hashing.
● Reduces collisions by allowing buckets to expand. -
● Example: Databases and file systems.
Differentiate between static hashing and dynamic hashing.
&
&
&
& &
Write short notes on Nested Queries. ↑
A nested query (or subquery) is a query inside another SQL query. It is used to retrieve data
dynamically based on results from another query.
Example of Nested Query: Here, the inner query calculates the average salary, and the outer
query selects employees earning above this average.
~ -
&
~ ↑
---
2.Multi-Row Subqueries: Returns multiple values using IN, ANY, or ALL #
SELECT name FROM Employees WHERE salary IN (SELECT salary
FROM Employees WHERE department_id = 2);
Write short notes on Nested Queries(Continued)
3.Correlated Subqueries: The inner query depends on the outer query
SELECT name FROM Employees e WHERE salary > (SELECT AVG(salary) FROM
-
Employees WHERE department_id = e.department_id);
Nested queries enhance query flexibility but can be performance-heavy. Optimizations like indexing and
JOIN usage can improve efficiency.
Solution 16 a [2023 June]
To calculate the record size RRR in bytes, we need to sum up the sizes of all the fields in a single record.
Given Data:
--
~
Solution 16 a [2023 June] (Continue)
ii. Suppose that the file is ordered by the key field Ssn and we want to construct a
O
-
-
primary index on Ssn. Calculate The number of first-level index entries and the
-
-
number of first-level index blocks .
-
-
-
~
~
-
~ -
-
~
- ~
-
X
&
↑ - -
-
- T ~
(iii) Calculating the Number of Levels for a Multilevel Index
To construct a multilevel index, we treat the first-level index as a file and build an
index on it. - -
-
-
-
-
~
-
↑
-
-
~
(b) Grid File: Definition, Advantages, and Disadvantages
What is a Grid File?
~
A grid file is a type of multi-attribute index structure used in databases to efficiently
support range queries and multidimensional data retrieval. It organizes data into a grid
based on multiple attributes.
It consists of:
● Grid Directory: A table-like structure dividing data into grid cells based on attribute
values. ↑
● Linear Scales: One for each attribute, mapping data values into grid cell coordinates.
● Data Buckets: Store actual records. -
-
(b) Grid File: Definition, Advantages, and Disadvantages
Advantages of Grid Files -
~
1. Efficient Multidimensional Indexing – Handles queries on multiple attributes efficiently.
2. Faster Search – Reduces search space using a grid structure.
3. Dynamic Growth – Adapts well when data size increases.
↑
4. Direct Access – Uses direct pointers to locate records without full scans.
-
affecting performance. ~
Final Answers -
(ii) First-level index blocks = 938 blocks
(iii) Number of levels for multilevel index = 3 levels
(b) Grid File: A multidimensional index with fast search but high overhead for
large dimensions.
↑
-
-
-
-
1. Retrieve the names of employees along with their department
names
To get employee names and their department names, we need to join the
8
Employees and Departments tables using the department_id field.
- -
SELECT e.employee_name, d.department_name
FROM Employees e - - -
JOIN Departments d ON e.department_id = d.department_id;
- --
Explanation:
-
● We select employee_name from the Employees table.
● We join the Departments table using department_id, which is the foreign key in
Employees.
● This query fetches the employee's name along with their respective department name.
-
2. Find the total salary expenditure per department
To calculate the total salary expenditure per department, we need to group employees
based on department_id and sum their salaries. -
-
-
SELECT d.department_name, SUM(e.salary) AS total_salary_expenditure
FROM Employees e
-
● The SUM(e.salary) function calculates the total salary expenditure for each
department.
● We use GROUP BY d.department_name to group employees by department.
● The result shows each department's name along with the total salary expenditure.
3. List employees who are currently assigned to a project ~
To find employees currently assigned to a project, we need to use the Assignments table. The
employees should have a record in this table.
-
↑
The DISTINCT keyword ensures that each employee appears only once, even if assigned to
multiple projects.
4. Find the average salary of employees in projects that started after 22/11/2022
~
To calculate the average salary of employees in projects that started after 22/11/2022, we
need to:
● The WHERE p.start_date > '2022-11-22' condition filters projects that started after 22nd
November 2022.
● We join Employees, Assignments, and Projects tables to find employees assigned to these projects.
● The AVG(e.salary) function calculates the average salary of these employees.
Summary
-
-
(a) Calculating the Blocking Factor
Given Data:
-
(b) Differences
5. Illustrate the Concept of Trigger in SQL with an Example(UQ)
-
What is a Trigger?
A trigger in SQL is a special kind of stored procedure that automatically executes when a specific event occurs
in a table. Triggers are mainly used for:
Let’s consider a Employees table where we want to log any salary updates in a separate Salary_Audit table.
//
DELIMITER ;
Explanation:
DDL commands are used to define and modify database schema and structure. Common DDL commands:
name VARCHAR(100),
age INT
);
DML commands deal with data within tables. Common DML commands:
~
5. Difference Between WHERE and HAVING Clause with
Example - -
-
WHERE Clause
HAVING Clause
FROM Employees
GROUP BY department
● If we search WHERE employee_id = 3, the hash function quickly locates the record.
2. B+ Tree Indexes
● If we search WHERE salary BETWEEN 50000 AND 70000, the B+ Tree efficiently
retrieves the range.
5. Role of Triggers in SQL Databases
What is a Trigger?
A trigger in SQL is a special type of stored procedure that is automatically executed when a specific event occurs in a database. These
events include INSERT, UPDATE, DELETE operations on a table.
This trigger ensures that an employee’s salary cannot be set below 30,000. What Happens?
END IF;
END;
DELIMITER ;
6. Difference Between Correlated and Non-Correlated Nested
Queries
1. Correlated Nested Queries
● The inner query depends on the outer query for each row processed.
● The inner query executes once per outer row, making it less efficient.
SELECT AVG(salary)
FROM Employees e2
);
2. Non-Correlated Nested Queries
Find employees who earn more than the company’s average salary.
);
Thank You