Chapter 5
Chapter 5
This chapter delves into the core concepts related to how data records are stored and accessed on
secondary storage devices, focusing on various file organizations and technologies that optimize
record management. Key topics include how records are stored in various ways (e.g., unordered,
ordered, hashed), strategies to speed up disk access, and how systems handle data efficiently.
5.1 Introduction
This section introduces the concept of record storage and explains the importance of file
organization. The goal is to ensure that records are stored in a manner that allows for efficient
access, insertion, deletion, and updating operations. This is a foundational topic for
understanding how database management systems (DBMS) handle data.
Secondary storage refers to devices like hard drives (HDDs), solid-state drives (SSDs), and
magnetic tapes, which provide non-volatile storage. Unlike primary memory (RAM), secondary
storage offers larger capacities at a lower cost but with slower access speeds. Various secondary
storage devices are discussed here in terms of their performance characteristics, capacities, and
their role in storing large amounts of data in a persistent manner.
Buffering refers to the use of memory (RAM) to temporarily store disk blocks as they are
transferred between the secondary storage and the system’s CPU. This section explains the
importance of buffer management for reducing disk access times. It discusses techniques like
buffer pools, replacement algorithms (such as Least Recently Used (LRU)), and how buffering
helps to improve the overall performance of disk I/O operations.
This section focuses on how file records are stored physically on disks. The organization of files
impacts how quickly records can be retrieved or updated. Key points include:
The physical layout of records on a disk (e.g., contiguous, linked, or scattered across
sectors).
File fragmentation and the strategies to minimize or eliminate it.
How file systems are designed to handle access to records based on their layout.
This section explains the operations that can be performed on files stored on secondary storage,
such as:
In a database system, files stored on secondary storage (like hard drives or SSDs) hold large
volumes of data. To effectively manage and manipulate this data, several operations are
performed on these files. These operations impact performance, storage efficiency, and the
overall system's responsiveness. Below are the common file operations and the trade-offs
between different file organizations.
Reading: When reading records from a file, the system accesses the file stored on
secondary storage. The efficiency of reading depends on the file organization. For
example, sequential files require scanning the entire file for records, while indexed or
hashed files can provide direct access to records.
o Sequential Files: Best for reading large amounts of data in a specific order (e.g.,
scanning all records), but inefficient for random access.
o Indexed Files: Provide faster access for specific records based on indexed key values.
o Hash Files: Allow for constant-time access to records based on a hash key.
Writing: When writing a record to a file, the system must identify the correct location to
store the new record, which depends on the file organization.
o Sequential Files: Records are written at the end of the file, and writing is straightforward
but can be inefficient if updates are frequent.
o Indexed Files: Writing involves updating both the index and the file, which may be more
complex but allows for faster access.
o Hash Files: Records are written to the location determined by the hash function, and if
collisions occur, overflow handling must be addressed.
Searching for records using a primary key is a common operation, and its efficiency is greatly
influenced by the file organization.
Sequential Files: Searching requires a linear scan of the entire file. This is efficient only when
records are sorted and when only a few records need to be accessed.
Indexed Files: Provide logarithmic search time (binary search) for primary keys in the index,
improving search performance significantly over sequential files. However, maintaining the
index can add overhead.
Hash Files: Offer constant-time search access (O(1)) for primary keys using a hash function. This
is ideal for applications where fast, direct access to records is required.
3. Inserting New Records
Sequential Files: Inserting records is simple when appending to the end of the file. However, if
records must be inserted in sorted order, the entire file may need to be reorganized, which can
be slow.
Indexed Files: When a record is inserted, it may require updating both the primary data file and
the index. This can be more complex than sequential insertion but allows for efficient retrieval.
Hash Files: New records are inserted into the location determined by the hash function. If a
collision occurs, overflow blocks are used, or the hash table may need resizing.
Both deletion and updating records can be more challenging when dealing with file organizations
that require maintaining consistency (like indexes or hash functions).
Sequential Files: Deletion involves finding the record and then shifting subsequent records to fill
the gap. This can be inefficient, especially for large files. Updates often require rewriting the
entire record.
Indexed Files: Deletion or updates require removing or modifying the entry in both the data file
and the index, which can increase overhead.
Hash Files: Deleting records typically involves marking them as deleted (using a deletion marker)
and possibly reorganizing the hash table or overflow blocks. Updating a record may require
deleting and reinserting the record if its key changes.
The choice of file organization affects the efficiency of these operations. Each approach has its
pros and cons, depending on the use case.
Sequential Files:
o Pros: Simple, low overhead for writing data in order.
o Cons: Slow search, insertion, and deletion. Poor random access performance.
o Best For: Applications where records are processed in bulk or in a specific order (e.g.,
log files).
Indexed Files:
o Pros: Fast search, especially with secondary keys. Suitable for a mix of read and write
operations.
o Cons: Requires overhead for maintaining the index, and insertions/deletions can be
costly.
o Best For: Databases with frequent lookups and moderate updates or inserts (e.g.,
customer databases).
Hash Files:
o Pros: Very fast search, especially for equality-based queries. Efficient for direct access.
o Cons: Not suited for range queries. Collisions may require handling overflow blocks or
rehashing.
o Best For: Systems with frequent direct lookups (e.g., fast key-value storage).
Slow for search and Fast for direct access, not good
Performance Moderate, index overhead
random access for range queries
Heap files store records in no particular order, with new records being appended to the end of the
file. This organization is simple and efficient for insert-heavy workloads but not ideal for
searches, as every record might need to be scanned sequentially.
A heap file (also called an unordered file) is one of the simplest and most basic types of file
organization used in database systems. In a heap file, records are stored in the file in the order in
which they are inserted, without any particular arrangement or structure. New records are added
to the end of the file. This kind of file organization is used when data does not need to be
accessed in any specific order or when the efficiency of searching or retrieving records is not a
primary concern.
Heap files are particularly useful for scenarios where data is collected or stored for future use, or
when the system performs batch operations where the order of records doesn’t matter.
Characteristics of Heap Files
1. Unordered: The records are inserted sequentially as they come. There is no sorting or indexing
of the records.
2. Efficient Insertion: Since records are always added to the end of the file, inserting new records is
very efficient.
3. Slow Search: Searching for a record, particularly using a search condition, is slow. A linear search
must be performed because the records are not in any particular order (no indexing or sorting).
4. No Redundancy: There is no mechanism for preventing duplicates or managing unique records;
however, a primary key can be used at the application level for uniqueness.
5. Storage Efficiency: The file structure is simple, and it does not require overhead for indexing or
other organizational structures.
A heap file is a collection of blocks (also called pages) on secondary storage (e.g., disk), and
each block holds one or more records. The records are not ordered in any particular fashion
within each block. The records are simply added to the file one after another.
Inserting a new record into a heap file is straightforward. The record is simply appended at the
end of the file, either in the current block (if there is space) or in a new block.
Step-by-step Process:
1. Identify the last block (or page) of the heap file.
2. If there is space in the block, place the new record at the end of the block.
3. If the block is full, create a new block and insert the record there.
Searching for a record in a heap file involves scanning the entire file (or blocks) from the
beginning to the end until the desired record is found. This is called a linear search, and its time
complexity is O(n), where n is the number of records in the file. For a large dataset, this can be
highly inefficient.
Step-by-step Process:
1. Start from the first block in the heap file.
2. Read the block into memory.
3. Search through the records within the block for the target record.
4. If the record is found, return it; otherwise, move on to the next block.
5. Repeat this process until the record is found or the entire file is searched.
Reorganization: Over time, as records are deleted, the blocks may have unused space. To
reclaim this space, the heap file might need to be reorganized, which involves:
1. Scanning the file to find deleted or empty spaces.
2. Compacting the file by moving records to fill gaps and remove fragmented space.
3. This process can be expensive and may require temporarily locking the file.
1. Find the Record: The record to be updated is located by performing a linear search through the
blocks.
2. Delete the Record: The old record is either physically deleted or marked as deleted.
3. Insert the Updated Record: The new record is inserted into the file (typically at the end).
Let’s illustrate heap file operations using a database storing records of students with attributes
such as student_id, name, and age.
Heap files, or unordered record files, represent one of the simplest and most fundamental ways
of organizing data on disk. In this structure, records are stored in the file in the order in which
they are inserted, with no special arrangement or sorting. This type of organization is typically
used when records need to be appended or when additional access methods, such as secondary
indexes, are applied for faster retrieval.
1. Insertion of Records:
o New records are inserted at the end of the file.
o The process involves copying the last block in the file into memory, adding the new
record to it, and then writing the updated block back to disk.
o The address of the last disk block is stored in the file header, so new records can always
be appended efficiently.
4. Deletion Markers:
o Instead of physically removing records, a deletion marker (a special flag or byte) can be
set to mark a record as deleted.
o The deletion marker indicates whether a record is valid or deleted, allowing searches to
ignore deleted records.
o Over time, as records are deleted, the file may become fragmented, necessitating
reorganization to reclaim unused space.
o Reorganization can be done by copying records to new blocks, effectively packing the
file to remove gaps caused by deleted records.
6. Modification of Records:
o Modifying a fixed-length record does not cause major problems, as the record will still
occupy the same amount of space in the file.
o For variable-length records, modifications may require deleting the original record and
inserting a new one. This is because the modified record may no longer fit in the space
allocated for the original record.
Spanned vs. Unspanned Heap Files:
Spanned Organization: In spanned heap files, a record can span across multiple blocks if it
doesn't fit in a single block.
Unspanned Organization: In unspanned heap files, records are kept within a single block,
meaning that if a record is too large to fit in a block, it cannot be stored in the file unless it is
divided or resized.
Performance Considerations:
1. Efficiency of Insertion:
o Insertion is efficient since new records are simply appended to the end of the file, and
this process does not require searching or rearranging other records.
2. Inefficiency of Searching:
o Searching is slow due to the need for sequential scans, making heap files less efficient
for read-heavy workloads where frequent searches are required.
o Heap files are often used in conjunction with secondary indexes to mitigate this
inefficiency.
3. Reorganization Overhead:
o Periodic reorganization to reclaim deleted space can be costly in terms of time and
resources.
o In the worst case, if not done periodically, heap files can become fragmented, causing
significant inefficiency.
Efficient Data Insertion: Heap files are ideal when data is inserted frequently and there is no
immediate need to retrieve or query the data in any particular order.
Temporary Storage: Heap files are often used to temporarily store data before it is processed or
reorganized.
Secondary Indexes: Although the heap file itself does not provide efficient searching, it can be
used alongside secondary indexes to allow faster retrieval of specific records.
Batch Processing: For applications where data is inserted in batches and processed later (e.g.,
ETL processes), heap files are suitable due to their simple insertion mechanism.
In contrast to heap files, sorted files maintain records in a predefined order, usually based on a
primary key. This organization improves search performance (using binary search, for
example) but makes insertion and deletion more complex and slower due to the need to maintain
the order.
Sorted files are an organization method where records are stored in a specific sorted order
based on one or more key fields. This contrasts with heap files, where records are stored in the
order of insertion. Sorted files are particularly useful when you need to perform frequent
searches, range queries, or require ordered data.
2. Insertion of Records:
o Inserting a new record into a sorted file requires finding the correct position for the
record to maintain the sorted order.
o This can be done by:
Searching for the correct position where the new record should go (using
binary search or linear search).
Shifting records to make room for the new record, which may involve rewriting
part of the file.
o Cost of Insertion: Inserting a new record is typically more expensive than in heap files
because it may require rearranging existing records and maintaining the sort order,
which can involve shifting a significant number of records.
5. Modification of Records:
o Modifying a record may require:
1. Deleting the original record.
2. Inserting the modified record in the appropriate position to maintain the sorted
order.
o If the modified record’s key changes, it may need to be moved to a new location in the
file.
Spanned Sorted Files: In a spanned sorted file, records can span across multiple blocks if they
are too large to fit in a single block.
Unspanned Sorted Files: In an unspanned sorted file, records are stored entirely within a single
block, meaning each record must fit within the allocated block size.
1. Efficient Search:
o The primary advantage of sorted files is the ability to perform fast searches, particularly
with binary search. This makes searching for a specific record, or a range of records,
much faster than in heap files.
o Sorted files are highly efficient for queries that involve range conditions, such as
retrieving all records with a key value between X and Y.
1. Expensive Insertions:
o Inserting a record requires finding the correct position in the sorted order, which often
requires a binary search (O(log n)) followed by shifting records to make room for the
new record. This is generally more expensive than appending a record to the end of a
heap file.
o For large files with frequent insertions, the cost of maintaining the sorted order can
become significant.
2. Expensive Deletions:
o Deleting a record requires searching for it and shifting subsequent records, which can be
inefficient, especially if many records are deleted.
4. Potential Fragmentation:
o Over time, if many records are deleted and inserted, the file may become fragmented,
and periodic reorganization might be needed to maintain performance.
Range Queries: Sorted files are highly efficient for applications that require frequent
range queries. For example, retrieving all transactions for a specific time period (e.g., all
records with dates between January 1 and January 31).
Sequential Access: When you need to process records in a sorted order, sorted files offer
an easy and efficient way to access records one by one in order.
Indexing: Sorted files are often used as the basis for primary indexes (where the file is
sorted by the primary key) or secondary indexes (where an auxiliary file is sorted by a
non-primary key) to speed up searches.
Batch Processing with Sorted Output: Sorted files can be used when large datasets
need to be processed and output in sorted order, especially in applications like ETL
(Extract, Transform, Load) pipelines.
Performance Considerations:
1. Insertion Performance: The insertion of records in sorted files can be slow, especially if
you need to maintain sorted order in large datasets. This requires finding the correct spot
for the new record and possibly shifting other records.
2. Search Performance: Sorted files provide very fast search performance due to binary
search and can efficiently handle large datasets for range queries.
3. Space Utilization: Sorted files tend to have more efficient space utilization compared to
heap files, as there is no need for periodic reorganization unless there is significant
fragmentation due to deletions.
4. Reorganization Overhead: While sorted files are more stable than heap files with
respect to fragmentation, they can still suffer from inefficiencies if there are frequent
deletions and updates that cause gaps. Reorganization may be necessary in such cases.
How sorted files are maintained and their performance in search operations.
Trade-offs between search speed and the overhead of maintaining order during inserts
and deletes.
Use cases where sorting is beneficial, such as systems that require efficient range queries.
Hashing is a technique used to efficiently locate a record using a hash function that maps keys to
specific positions in a table (hash table). This technique allows for constant-time average lookups
but can suffer from collisions, where two keys map to the same hash value. The section covers:
Hashing is a powerful technique used for fast record retrieval in a file system. It maps the search
key (hash field) to a specific location in memory or disk through a hash function, providing fast
access to records based on the search key. This section covers the different techniques of hashing
used for internal hashing (in-memory structures), external hashing (disk storage), and
dynamic file expansion (handling growing data sets).
Internal hashing is a technique used for fast access to records stored in main memory based on
a specific key field. This technique is typically implemented using hash tables, which allow
constant-time access to records under ideal conditions. In a database context, internal hashing
can be used to quickly locate records using an equality condition on the key field, such as finding
a record with a specific ID or email address.
So, the hash table (with size 10) would look like this:
In the chaining method, index 5 now holds a linked list (or chain) of records: 205 (Bob) and
405 (David). When searching for a record with ID = 405, we would first hash it to index 5, and
then search through the chain at that index.
How Internal Hashing Works:
1. Hash Function:
o A hash function, denoted as h, is applied to the hash field value (e.g., the key field) to
map it to an index or location in a hash table.
o For a given record, the hash function takes the value of the key field and computes an
index (hash value) that corresponds to the record’s location in the hash table.
2. Hash Table:
o The hash table is an array where each index corresponds to a potential record location,
and records are stored based on their computed hash values.
o Each index in the hash table points to a disk block (or a bucket in memory), which can
contain one or more records.
o When a record is inserted, the hash function calculates the hash value for the key, and
the record is stored at the corresponding index (bucket) in the table.
3. Search Efficiency:
o Searches for records are efficient because once the hash value is computed, it directly
points to the location where the record is stored. Typically, searching for a record
involves only accessing a single location in the hash table and possibly searching within
that block or bucket.
o The average search time is O(1), assuming there are minimal collisions (multiple keys
hashing to the same location).
4. Collisions:
o Collisions occur when two or more keys map to the same hash value (i.e., two records
hash to the same index in the table). There are several strategies for handling collisions:
Chaining: This involves storing multiple records in the same bucket (linked list)
at the same index. When a collision occurs, the new record is added to the list.
Open Addressing: This involves finding an alternate location (slot) in the table if
a collision occurs, using techniques like linear probing, quadratic probing, or
double hashing.
5. Dynamic Resizing:
o Dynamic resizing of the hash table may be needed to maintain performance as the
number of records grows. This involves rehashing the table to a larger size to reduce the
load factor (number of records per bucket).
o Typically, the hash table is resized when the load factor exceeds a threshold (e.g., 75%),
and all records are rehashed to the new larger table.
Fast access: Since hashing directly maps a key to a location in memory, searches are typically
very fast, with average-case constant time complexity O(1).
Efficient for equality searches: Hashing is ideal for situations where records are accessed using
an equality condition on a single field (e.g., searching for a record with a specific key).
Disadvantages of Internal Hashing:
External hashing refers to the use of hashing techniques for managing data stored on disk. Unlike
internal hashing, where records are kept in memory, external hashing is designed to work with
large datasets that exceed memory limits and reside in disk files.
4. Search Efficiency:
o When searching for a record, the hash function is applied to the key, and the
corresponding bucket (disk block) is accessed.
o Hash bucket accesses involve reading the appropriate disk block into memory. If there
are multiple records in the bucket (due to collisions), a linear search or linked list
traversal within the bucket may be needed.
6. Dynamic Hashing:
o External hashing often involves dynamic hashing to handle growing datasets and
maintain efficiency as the number of records increases.
o Extendable hashing and linear hashing are two popular dynamic hashing techniques for
handling overflow and expanding the hash table on disk as more records are added.
Fast searches for large files: External hashing is effective for very large datasets that need to be
stored on disk. It minimizes disk I/O by mapping a record’s key to a disk block directly.
Efficient space usage: Hashing minimizes the need for complex indexing structures like B-trees
while still offering efficient retrieval.
Handling collisions: Like internal hashing, external hashing faces challenges with collisions, and
strategies like chaining or overflow blocks may require extra memory or disk space.
Rehashing and resizing: As the file grows, resizing the hash table (rehashing) can be an
expensive operation.
External Hashing is a technique used for managing large datasets that are stored on disk rather
than in memory. Unlike internal hashing (which works in main memory), external hashing is
designed to handle records that do not fit into memory all at once. It provides efficient access to
records stored in external files by using a hash function to map record keys to specific disk
blocks (buckets). This allows for faster record retrieval compared to linear or binary search
methods, which would require scanning large files or indexes sequentially.
In this section, we will explore how external hashing works and how it is applied in database
systems with examples.
1. Hash Function:
o A hash function maps a key field (hash key) from a record to an index that
corresponds to a disk block (bucket).
o The hash function used in external hashing is typically the same as in internal
hashing, for example: h(K)=Kmod Mh(K) = K \mod Mh(K)=KmodM Where:
K is the key value of a record.
M is the number of buckets (disk blocks).
The result of h(K) gives an index within the range of available disk
blocks, pointing to a specific bucket where the record is stored.
2. Disk Blocks (Buckets):
o Buckets are disk blocks, and each bucket can store multiple records. Each bucket
is essentially a fixed-size block on the disk that can hold several records.
o When a record is inserted into the file, the hash function computes the hash value,
and the record is placed in the corresponding bucket (disk block).
3. Overflow:
o If a bucket is full or if multiple records hash to the same bucket (a collision),
overflow techniques are used. One common technique for handling overflow is
overflow blocks.
o When a bucket becomes full, additional overflow blocks (also called spill-over
blocks) are allocated to store extra records.
4. Accessing Records:
o To search for a record, the database uses the hash function to compute the index
(bucket) of the record. Then, it reads the corresponding disk block into memory
and searches for the record in that block.
o If the bucket contains multiple records (due to collisions), a linear search or a
linked list traversal is performed within that bucket to locate the desired record.
Let’s walk through an example of external hashing for a database system with disk-based
records.
The hash table (with 4 buckets) will store these records based on their RecordID using the hash
function.
Handling Collisions with Overflow Blocks:
If bucket 1 becomes full, additional overflow blocks can be used to store the extra records.
These overflow blocks are linked to the main bucket.
For example:
When bucket 1 becomes full after inserting 101, 205, 405, and 501, an overflow block is
allocated.
The records that hash to bucket 1 but cannot fit in the main bucket are placed in the overflow
block.
The system will continue checking for available space in overflow blocks if more records hash to
bucket 1.
Thus, when searching for a record with ID 405, the system will check bucket 1, find that it is not there,
and then look into Overflow Block 1.
1. Efficient for Large Datasets: External hashing is very efficient for large datasets that exceed
memory size because it avoids sequential scanning of large files.
2. Constant-Time Access: For a well-distributed hash function and minimal collisions, external
hashing can provide constant-time access for searching, inserting, and deleting records.
3. Minimal Disk I/O: By mapping keys directly to disk blocks, external hashing reduces the need for
multiple disk accesses compared to sequential searches.
Dynamic file expansion in hashing is necessary when the number of records in a file grows
significantly over time. As the file expands, traditional hashing schemes may become inefficient
due to increased collisions, overflow, or the need to rehash large datasets.
1. Extendable Hashing:
o In extendable hashing, the hash table grows dynamically by using a directory that points
to buckets.
o The hash function is applied to a key and produces a prefix of the key’s hash value. The
number of bits used for the prefix determines the depth of the directory.
o As more records are inserted and the directory becomes full, the directory doubles in
size and the hash values are reallocated to new buckets.
o Extendable hashing allows the hash table to grow as needed without the need to rehash
the entire table at once.
2. Linear Hashing:
o Linear hashing is a dynamic hashing technique where the hash table expands
incrementally.
o In linear hashing, the hash function is applied, and records are stored in buckets based
on the hash value. However, the hash table is designed to expand by adding new
buckets in a linear fashion.
o When a bucket becomes full, a new bucket is added, and records are redistributed
between the old and new buckets.
o Unlike extendable hashing, linear hashing does not require the entire hash table to be
rehashed at once, making it more efficient for large datasets with frequent insertions.
Advantages of Dynamic Hashing Techniques:
Complexity: Implementing dynamic hashing techniques like extendable and linear hashing can
be more complex than static hashing methods.
Potential for Rehashing: While dynamic techniques minimize rehashing, they do not eliminate it
entirely, and rehashing can still be costly when the directory or hash table expands significantly.
Summary
Internal hashing is used for fast in-memory search and retrieval based on a hash key, typically
providing constant time complexity (O(1)) under ideal conditions.
External hashing is used for large datasets stored
This section covers other methods of file organization that don't fit into the previous categories.
These might include:
B-trees and B+ trees: Used for indexing large datasets, providing efficient search,
insertion, and deletion operations.
Bitmap indexes: Used in situations where a limited number of distinct values exist for a
field.
Clustered files: Storing related records together on disk to reduce access time for related
data.
5.11 Summary
The summary recaps the primary concepts discussed in the chapter, emphasizing the importance
of selecting the right file organization method depending on the workload. The chapter
highlights:
Review questions test the reader’s understanding of the concepts covered in the chapter. They
often include:
Exercises
B Tree
B+ Tree