0% found this document useful (0 votes)
10 views

Chapter 5

Database

Uploaded by

zebrehe
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Chapter 5

Database

Uploaded by

zebrehe
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Chapter 5: Record Storage and Primary File Organizations

This chapter delves into the core concepts related to how data records are stored and accessed on
secondary storage devices, focusing on various file organizations and technologies that optimize
record management. Key topics include how records are stored in various ways (e.g., unordered,
ordered, hashed), strategies to speed up disk access, and how systems handle data efficiently.

5.1 Introduction

This section introduces the concept of record storage and explains the importance of file
organization. The goal is to ensure that records are stored in a manner that allows for efficient
access, insertion, deletion, and updating operations. This is a foundational topic for
understanding how database management systems (DBMS) handle data.

5.2 Secondary Storage Devices

Secondary storage refers to devices like hard drives (HDDs), solid-state drives (SSDs), and
magnetic tapes, which provide non-volatile storage. Unlike primary memory (RAM), secondary
storage offers larger capacities at a lower cost but with slower access speeds. Various secondary
storage devices are discussed here in terms of their performance characteristics, capacities, and
their role in storing large amounts of data in a persistent manner.

5.3 Parallelizing Disk Access Using RAID Technology

RAID (Redundant Array of Independent Disks) technology is introduced as a method to


enhance the performance and reliability of disk storage systems. RAID combines multiple
physical disk drives into one or more logical units, improving data access speed, fault tolerance,
and storage capacity. Different RAID levels (e.g., RAID 0, RAID 1, RAID 5) offer trade-offs
between redundancy, performance, and cost. RAID can significantly speed up data retrieval by
distributing the load across multiple disks.
5.4 Buffering of Blocks

Buffering refers to the use of memory (RAM) to temporarily store disk blocks as they are
transferred between the secondary storage and the system’s CPU. This section explains the
importance of buffer management for reducing disk access times. It discusses techniques like
buffer pools, replacement algorithms (such as Least Recently Used (LRU)), and how buffering
helps to improve the overall performance of disk I/O operations.

5.5 Placing File Records on Disk

This section focuses on how file records are stored physically on disks. The organization of files
impacts how quickly records can be retrieved or updated. Key points include:

 The physical layout of records on a disk (e.g., contiguous, linked, or scattered across
sectors).
 File fragmentation and the strategies to minimize or eliminate it.
 How file systems are designed to handle access to records based on their layout.

5.6 Operations on Files

This section explains the operations that can be performed on files stored on secondary storage,
such as:

 Reading and writing records to a file.


 Searching for records using primary keys.
 Inserting new records.
 Deleting or updating existing records.
 The trade-offs between different file organizations and how they impact the efficiency of
these operations.

Operations on Files in Database Systems

In a database system, files stored on secondary storage (like hard drives or SSDs) hold large
volumes of data. To effectively manage and manipulate this data, several operations are
performed on these files. These operations impact performance, storage efficiency, and the
overall system's responsiveness. Below are the common file operations and the trade-offs
between different file organizations.

1. Reading and Writing Records to a File

 Reading: When reading records from a file, the system accesses the file stored on
secondary storage. The efficiency of reading depends on the file organization. For
example, sequential files require scanning the entire file for records, while indexed or
hashed files can provide direct access to records.
o Sequential Files: Best for reading large amounts of data in a specific order (e.g.,
scanning all records), but inefficient for random access.
o Indexed Files: Provide faster access for specific records based on indexed key values.
o Hash Files: Allow for constant-time access to records based on a hash key.

 Writing: When writing a record to a file, the system must identify the correct location to
store the new record, which depends on the file organization.
o Sequential Files: Records are written at the end of the file, and writing is straightforward
but can be inefficient if updates are frequent.
o Indexed Files: Writing involves updating both the index and the file, which may be more
complex but allows for faster access.
o Hash Files: Records are written to the location determined by the hash function, and if
collisions occur, overflow handling must be addressed.

2. Searching for Records Using Primary Keys

Searching for records using a primary key is a common operation, and its efficiency is greatly
influenced by the file organization.

 Sequential Files: Searching requires a linear scan of the entire file. This is efficient only when
records are sorted and when only a few records need to be accessed.
 Indexed Files: Provide logarithmic search time (binary search) for primary keys in the index,
improving search performance significantly over sequential files. However, maintaining the
index can add overhead.
 Hash Files: Offer constant-time search access (O(1)) for primary keys using a hash function. This
is ideal for applications where fast, direct access to records is required.
3. Inserting New Records

Inserting records into a file varies based on the file organization:

 Sequential Files: Inserting records is simple when appending to the end of the file. However, if
records must be inserted in sorted order, the entire file may need to be reorganized, which can
be slow.
 Indexed Files: When a record is inserted, it may require updating both the primary data file and
the index. This can be more complex than sequential insertion but allows for efficient retrieval.
 Hash Files: New records are inserted into the location determined by the hash function. If a
collision occurs, overflow blocks are used, or the hash table may need resizing.

4. Deleting or Updating Existing Records

Both deletion and updating records can be more challenging when dealing with file organizations
that require maintaining consistency (like indexes or hash functions).

 Sequential Files: Deletion involves finding the record and then shifting subsequent records to fill
the gap. This can be inefficient, especially for large files. Updates often require rewriting the
entire record.
 Indexed Files: Deletion or updates require removing or modifying the entry in both the data file
and the index, which can increase overhead.
 Hash Files: Deleting records typically involves marking them as deleted (using a deletion marker)
and possibly reorganizing the hash table or overflow blocks. Updating a record may require
deleting and reinserting the record if its key changes.

5. Trade-offs Between Different File Organizations

The choice of file organization affects the efficiency of these operations. Each approach has its
pros and cons, depending on the use case.

 Sequential Files:
o Pros: Simple, low overhead for writing data in order.
o Cons: Slow search, insertion, and deletion. Poor random access performance.
o Best For: Applications where records are processed in bulk or in a specific order (e.g.,
log files).

 Indexed Files:
o Pros: Fast search, especially with secondary keys. Suitable for a mix of read and write
operations.
o Cons: Requires overhead for maintaining the index, and insertions/deletions can be
costly.
o Best For: Databases with frequent lookups and moderate updates or inserts (e.g.,
customer databases).

 Hash Files:
o Pros: Very fast search, especially for equality-based queries. Efficient for direct access.
o Cons: Not suited for range queries. Collisions may require handling overflow blocks or
rehashing.
o Best For: Systems with frequent direct lookups (e.g., fast key-value storage).

Summary of Operations and File Organizations

Operation Sequential Files Indexed Files Hash Files

Fast (binary search on


Read Slow (linear scan) Fast (constant time, hash lookup)
index)

Requires updating both


Write Simple, appending is fast Simple (write to hashed location)
index and data file

Simple (appends or Requires updating index and Simple (hash function or


Insert
sorting required) data file overflow handling)

Requires index and data file


Delete/Update Slow (shifting required) Requires overflow management
update

Slow for search and Fast for direct access, not good
Performance Moderate, index overhead
random access for range queries

Sequential, ordered Frequent searches with


Best Use Case Direct access based on key field
processing moderate updates

5.7 Files of Unordered Records (Heap Files)

Heap files store records in no particular order, with new records being appended to the end of the
file. This organization is simple and efficient for insert-heavy workloads but not ideal for
searches, as every record might need to be scanned sequentially.

Files of Unordered Records (Heap Files)

A heap file (also called an unordered file) is one of the simplest and most basic types of file
organization used in database systems. In a heap file, records are stored in the file in the order in
which they are inserted, without any particular arrangement or structure. New records are added
to the end of the file. This kind of file organization is used when data does not need to be
accessed in any specific order or when the efficiency of searching or retrieving records is not a
primary concern.

Heap files are particularly useful for scenarios where data is collected or stored for future use, or
when the system performs batch operations where the order of records doesn’t matter.
Characteristics of Heap Files

1. Unordered: The records are inserted sequentially as they come. There is no sorting or indexing
of the records.
2. Efficient Insertion: Since records are always added to the end of the file, inserting new records is
very efficient.
3. Slow Search: Searching for a record, particularly using a search condition, is slow. A linear search
must be performed because the records are not in any particular order (no indexing or sorting).
4. No Redundancy: There is no mechanism for preventing duplicates or managing unique records;
however, a primary key can be used at the application level for uniqueness.
5. Storage Efficiency: The file structure is simple, and it does not require overhead for indexing or
other organizational structures.

File Organization of Heap Files

A heap file is a collection of blocks (also called pages) on secondary storage (e.g., disk), and
each block holds one or more records. The records are not ordered in any particular fashion
within each block. The records are simply added to the file one after another.

Inserting a Record in a Heap File

Inserting a new record into a heap file is straightforward. The record is simply appended at the
end of the file, either in the current block (if there is space) or in a new block.

 Step-by-step Process:
1. Identify the last block (or page) of the heap file.
2. If there is space in the block, place the new record at the end of the block.
3. If the block is full, create a new block and insert the record there.

Searching for a Record in a Heap File

Searching for a record in a heap file involves scanning the entire file (or blocks) from the
beginning to the end until the desired record is found. This is called a linear search, and its time
complexity is O(n), where n is the number of records in the file. For a large dataset, this can be
highly inefficient.

 Step-by-step Process:
1. Start from the first block in the heap file.
2. Read the block into memory.
3. Search through the records within the block for the target record.
4. If the record is found, return it; otherwise, move on to the next block.
5. Repeat this process until the record is found or the entire file is searched.

Deleting a Record from a Heap File

Deleting a record from a heap file involves two main steps:


1. Finding the Record: First, the record needs to be found, which again involves scanning the file.
2. Removing the Record: Once the record is found, it can be deleted by either:
o Physically deleting the record, which may leave empty space in the block (leading to
fragmentation).
o Marking the record as deleted (by using a deletion marker), so it can be ignored during
searches but the space can be reused later. However, the file may need periodic
reorganization to reclaim space from deleted records.

 Reorganization: Over time, as records are deleted, the blocks may have unused space. To
reclaim this space, the heap file might need to be reorganized, which involves:
1. Scanning the file to find deleted or empty spaces.
2. Compacting the file by moving records to fill gaps and remove fragmented space.
3. This process can be expensive and may require temporarily locking the file.

Updating a Record in a Heap File

Updating a record in a heap file is somewhat similar to deletion and insertion:

1. Find the Record: The record to be updated is located by performing a linear search through the
blocks.
2. Delete the Record: The old record is either physically deleted or marked as deleted.
3. Insert the Updated Record: The new record is inserted into the file (typically at the end).

Example of Heap File Usage

Let’s illustrate heap file operations using a database storing records of students with attributes
such as student_id, name, and age.

Heap Files (Unordered Records) - Key Characteristics and Concepts

Heap files, or unordered record files, represent one of the simplest and most fundamental ways
of organizing data on disk. In this structure, records are stored in the file in the order in which
they are inserted, with no special arrangement or sorting. This type of organization is typically
used when records need to be appended or when additional access methods, such as secondary
indexes, are applied for faster retrieval.

Key Operations on Heap Files:

1. Insertion of Records:
o New records are inserted at the end of the file.
o The process involves copying the last block in the file into memory, adding the new
record to it, and then writing the updated block back to disk.
o The address of the last disk block is stored in the file header, so new records can always
be appended efficiently.

2. Search for Records:


o Searching is performed sequentially, one block at a time, from the beginning to
the end of the file.
o This leads to linear search complexity, making it inefficient for large files.
o In the worst case, if no index exists, a search could require checking every block
in the file, making the process O(b), where b is the number of blocks in the file.
o Average case: If only one record satisfies the search condition, the program may
need to search through half of the file blocks (on average), so the complexity is
about O(b/2).
3. Deletion of Records:
o To delete a record, you need to:
1. Locate the record's block.
2. Copy the block into memory.
3. Remove the record from memory.
4. Rewrite the modified block back to disk.
o Problems: Deleting records leaves empty spaces within blocks, leading to wasted
storage.

4. Deletion Markers:
o Instead of physically removing records, a deletion marker (a special flag or byte) can be
set to mark a record as deleted.
o The deletion marker indicates whether a record is valid or deleted, allowing searches to
ignore deleted records.
o Over time, as records are deleted, the file may become fragmented, necessitating
reorganization to reclaim unused space.
o Reorganization can be done by copying records to new blocks, effectively packing the
file to remove gaps caused by deleted records.

5. Reorganization of Heap Files:


o Reorganization typically involves reading all blocks sequentially, removing deleted
records, and repacking the blocks to reclaim space.
o This is needed after large numbers of deletions, as the leftover "gaps" from deleted
records can cause inefficiencies in space utilization.

6. Modification of Records:
o Modifying a fixed-length record does not cause major problems, as the record will still
occupy the same amount of space in the file.
o For variable-length records, modifications may require deleting the original record and
inserting a new one. This is because the modified record may no longer fit in the space
allocated for the original record.
Spanned vs. Unspanned Heap Files:

 Spanned Organization: In spanned heap files, a record can span across multiple blocks if it
doesn't fit in a single block.
 Unspanned Organization: In unspanned heap files, records are kept within a single block,
meaning that if a record is too large to fit in a block, it cannot be stored in the file unless it is
divided or resized.

Performance Considerations:

1. Efficiency of Insertion:
o Insertion is efficient since new records are simply appended to the end of the file, and
this process does not require searching or rearranging other records.

2. Inefficiency of Searching:
o Searching is slow due to the need for sequential scans, making heap files less efficient
for read-heavy workloads where frequent searches are required.
o Heap files are often used in conjunction with secondary indexes to mitigate this
inefficiency.

3. Reorganization Overhead:
o Periodic reorganization to reclaim deleted space can be costly in terms of time and
resources.
o In the worst case, if not done periodically, heap files can become fragmented, causing
significant inefficiency.

4. Sorting and Ordering:


o Since heap files do not maintain any ordering of records, if there is a need to read
records in a specific order (such as by a certain field), the file must be sorted externally.
o External sorting algorithms are used for this purpose, though sorting large files can be
computationally expensive.

Use Cases for Heap Files:

 Efficient Data Insertion: Heap files are ideal when data is inserted frequently and there is no
immediate need to retrieve or query the data in any particular order.
 Temporary Storage: Heap files are often used to temporarily store data before it is processed or
reorganized.
 Secondary Indexes: Although the heap file itself does not provide efficient searching, it can be
used alongside secondary indexes to allow faster retrieval of specific records.
 Batch Processing: For applications where data is inserted in batches and processed later (e.g.,
ETL processes), heap files are suitable due to their simple insertion mechanism.

The section outlines:


 Advantages: Fast insertion and updates.
 Disadvantages: Slow search and retrieval.
 Use cases: Situations where records are inserted frequently and searches are infrequent.

5.8 Files of Ordered Records (Sorted Files)

In contrast to heap files, sorted files maintain records in a predefined order, usually based on a
primary key. This organization improves search performance (using binary search, for
example) but makes insertion and deletion more complex and slower due to the need to maintain
the order.

Files of Ordered Records (Sorted Files) - Key Characteristics and Concepts

Sorted files are an organization method where records are stored in a specific sorted order
based on one or more key fields. This contrasts with heap files, where records are stored in the
order of insertion. Sorted files are particularly useful when you need to perform frequent
searches, range queries, or require ordered data.

Key Characteristics of Sorted Files:

1. Records Are Ordered:


o In a sorted file, records are kept in sorted order based on a chosen field (or key) when
they are inserted.
o Sorting is typically done using an ascending or descending order based on the key's
values (e.g., numerical values, dates, or alphabetic strings).

2. Insertion of Records:
o Inserting a new record into a sorted file requires finding the correct position for the
record to maintain the sorted order.
o This can be done by:
 Searching for the correct position where the new record should go (using
binary search or linear search).
 Shifting records to make room for the new record, which may involve rewriting
part of the file.
o Cost of Insertion: Inserting a new record is typically more expensive than in heap files
because it may require rearranging existing records and maintaining the sort order,
which can involve shifting a significant number of records.

3. Searching for Records:


o Searching in a sorted file is very efficient compared to heap files because records are in
sorted order.
o Binary search can be used to quickly locate records or ranges of records, making the
time complexity of searches O(log n), where n is the number of records in the file.
o Sorted files are especially useful for range queries, where you need to retrieve all
records between two key values.
4. Deletion of Records:
o Deleting a record from a sorted file involves:

1. Finding the record to be deleted (using binary search or linear search).


2. Removing the record.
3. Shifting the subsequent records to close the gap and maintain the sorted order.
o Deletion is somewhat expensive because it requires shifting records, especially if there
are many deletions.

5. Modification of Records:
o Modifying a record may require:
1. Deleting the original record.
2. Inserting the modified record in the appropriate position to maintain the sorted
order.
o If the modified record’s key changes, it may need to be moved to a new location in the
file.

6. Reorganization of Sorted Files:


o Sorted files do not require frequent reorganization as heap files do. The structure is
inherently sorted, so only minimal shifting of records is needed when inserting, deleting,
or updating records.
o However, if the sort order needs to be changed or if there’s significant fragmentation
(e.g., many deletions or updates), re-sorting or reorganizing the file may become
necessary.

Spanned vs. Unspanned Sorted Files:

 Spanned Sorted Files: In a spanned sorted file, records can span across multiple blocks if they
are too large to fit in a single block.
 Unspanned Sorted Files: In an unspanned sorted file, records are stored entirely within a single
block, meaning each record must fit within the allocated block size.

Advantages of Sorted Files:

1. Efficient Search:
o The primary advantage of sorted files is the ability to perform fast searches, particularly
with binary search. This makes searching for a specific record, or a range of records,
much faster than in heap files.
o Sorted files are highly efficient for queries that involve range conditions, such as
retrieving all records with a key value between X and Y.

2. Efficient Range Queries:


o Sorted files are ideal for applications where data needs to be accessed in a specific
order or where range queries (e.g., "find all records with a key between 50 and 100")
are common.
o Once the starting point for the range is found, records can be retrieved sequentially,
which is faster than searching a heap file for each individual record.
3. Efficient Sorting:
o If you need the records to be in sorted order, a sorted file already provides this without
needing a separate sorting step, as is the case with heap files.

4. Lower Overhead for Sequential Access:


o Since records are sorted, sequential access (e.g., reading all records in order) is
straightforward and efficient.

Disadvantages of Sorted Files:

1. Expensive Insertions:
o Inserting a record requires finding the correct position in the sorted order, which often
requires a binary search (O(log n)) followed by shifting records to make room for the
new record. This is generally more expensive than appending a record to the end of a
heap file.
o For large files with frequent insertions, the cost of maintaining the sorted order can
become significant.

2. Expensive Deletions:
o Deleting a record requires searching for it and shifting subsequent records, which can be
inefficient, especially if many records are deleted.

3. Modifications Require Deletion and Insertion:


o When a record is modified (especially if the key changes), it is essentially deleted and
reinserted, which can be inefficient.

4. Potential Fragmentation:
o Over time, if many records are deleted and inserted, the file may become fragmented,
and periodic reorganization might be needed to maintain performance.

Use Cases for Sorted Files:

 Range Queries: Sorted files are highly efficient for applications that require frequent
range queries. For example, retrieving all transactions for a specific time period (e.g., all
records with dates between January 1 and January 31).
 Sequential Access: When you need to process records in a sorted order, sorted files offer
an easy and efficient way to access records one by one in order.
 Indexing: Sorted files are often used as the basis for primary indexes (where the file is
sorted by the primary key) or secondary indexes (where an auxiliary file is sorted by a
non-primary key) to speed up searches.
 Batch Processing with Sorted Output: Sorted files can be used when large datasets
need to be processed and output in sorted order, especially in applications like ETL
(Extract, Transform, Load) pipelines.
Performance Considerations:

1. Insertion Performance: The insertion of records in sorted files can be slow, especially if
you need to maintain sorted order in large datasets. This requires finding the correct spot
for the new record and possibly shifting other records.
2. Search Performance: Sorted files provide very fast search performance due to binary
search and can efficiently handle large datasets for range queries.
3. Space Utilization: Sorted files tend to have more efficient space utilization compared to
heap files, as there is no need for periodic reorganization unless there is significant
fragmentation due to deletions.
4. Reorganization Overhead: While sorted files are more stable than heap files with
respect to fragmentation, they can still suffer from inefficiencies if there are frequent
deletions and updates that cause gaps. Reorganization may be necessary in such cases.

This section discusses:

 How sorted files are maintained and their performance in search operations.
 Trade-offs between search speed and the overhead of maintaining order during inserts
and deletes.
 Use cases where sorting is beneficial, such as systems that require efficient range queries.

5.9 Hashing Techniques

Hashing is a technique used to efficiently locate a record using a hash function that maps keys to
specific positions in a table (hash table). This technique allows for constant-time average lookups
but can suffer from collisions, where two keys map to the same hash value. The section covers:

 Static hashing: The size of the hash table is fixed.


 Dynamic hashing: The hash table can grow or shrink dynamically.
 Collision resolution: Techniques like chaining and open addressing to handle hash
collisions.
 When hashing is used most effectively (e.g., for point queries with fixed search keys).

5.9 Hashing Techniques

Hashing is a powerful technique used for fast record retrieval in a file system. It maps the search
key (hash field) to a specific location in memory or disk through a hash function, providing fast
access to records based on the search key. This section covers the different techniques of hashing
used for internal hashing (in-memory structures), external hashing (disk storage), and
dynamic file expansion (handling growing data sets).

5.9.1 Internal Hashing


Internal hashing refers to the use of hashing techniques in memory (main memory) for fast
access to records. This technique is often used in situations where a large number of records need
to be accessed and manipulated quickly, based on a specific key field. Examples of such
structures are hash tables and hash maps. For internal files, hashing is typically implemented as
a hash table through the use of an array of records. Suppose that the array index range is from 0
to M - 1 (Figure 05.10a); then we have M slots whose addresses correspond to the array indexes.
We choose a hash function that transforms the hash field value into an integer between 0 and M -
1. One common hash function is the h(K) = K mod M function, which returns the remainder of
an integer hash field value K after division by M; this value is then used for the record address.

Internal Hashing in Database Systems

Internal hashing is a technique used for fast access to records stored in main memory based on
a specific key field. This technique is typically implemented using hash tables, which allow
constant-time access to records under ideal conditions. In a database context, internal hashing
can be used to quickly locate records using an equality condition on the key field, such as finding
a record with a specific ID or email address.

How Internal Hashing Works in a Database

1. Hash Table Structure:


o The records are stored in an array (or hash table), where each slot in the array can
potentially hold a record or a pointer to a list of records.
o Suppose the size of the hash table is M, so the index range of the array is from 0
to M - 1. Each index corresponds to a potential location in memory where records
can be stored.
2. Hash Function:
o A hash function is applied to the key field (also called the hash field) of each
record. The goal of the hash function is to map the key value to an index in the
array that corresponds to a position in the hash table.
o A simple and commonly used hash function is: h(K)=Kmod Mh(K) = K \mod
Mh(K)=KmodM Where:
 K is the key value of the record.
 M is the size of the hash table (i.e., the number of slots in the array).
 The result of h(K) is an integer between 0 and M - 1, which gives the
index of the array where the record should be stored.
3. Inserting a Record:
o To insert a record with key K into the hash table:
 Compute the hash value using the hash function: h(K) = K mod M.
 Place the record in the slot corresponding to the index h(K).
4. Searching for a Record:
o To find a record with key K in the hash table:
 Compute the hash value: h(K) = K mod M.
 Access the array index h(K) to locate the record.
5. Handling Collisions:
Collisions occur when two or more records hash to the same array index. Since
o
the hash function may produce the same value for different keys, special
techniques are used to handle these collisions:
 Chaining: This involves storing a list (or linked list) of records at each
array index. If multiple records hash to the same index, they are chained
together in a list.
 Open Addressing: In this approach, when a collision occurs, the system
searches for the next available slot in the array. This search can be done
using methods like linear probing, quadratic probing, or double
hashing.
6. Example in Database Systems: Suppose we have the following records in a database:

So, the hash table (with size 10) would look like this:
In the chaining method, index 5 now holds a linked list (or chain) of records: 205 (Bob) and
405 (David). When searching for a record with ID = 405, we would first hash it to index 5, and
then search through the chain at that index.
How Internal Hashing Works:

1. Hash Function:
o A hash function, denoted as h, is applied to the hash field value (e.g., the key field) to
map it to an index or location in a hash table.
o For a given record, the hash function takes the value of the key field and computes an
index (hash value) that corresponds to the record’s location in the hash table.

2. Hash Table:
o The hash table is an array where each index corresponds to a potential record location,
and records are stored based on their computed hash values.
o Each index in the hash table points to a disk block (or a bucket in memory), which can
contain one or more records.
o When a record is inserted, the hash function calculates the hash value for the key, and
the record is stored at the corresponding index (bucket) in the table.

3. Search Efficiency:
o Searches for records are efficient because once the hash value is computed, it directly
points to the location where the record is stored. Typically, searching for a record
involves only accessing a single location in the hash table and possibly searching within
that block or bucket.
o The average search time is O(1), assuming there are minimal collisions (multiple keys
hashing to the same location).

4. Collisions:
o Collisions occur when two or more keys map to the same hash value (i.e., two records
hash to the same index in the table). There are several strategies for handling collisions:
 Chaining: This involves storing multiple records in the same bucket (linked list)
at the same index. When a collision occurs, the new record is added to the list.
 Open Addressing: This involves finding an alternate location (slot) in the table if
a collision occurs, using techniques like linear probing, quadratic probing, or
double hashing.

5. Dynamic Resizing:
o Dynamic resizing of the hash table may be needed to maintain performance as the
number of records grows. This involves rehashing the table to a larger size to reduce the
load factor (number of records per bucket).
o Typically, the hash table is resized when the load factor exceeds a threshold (e.g., 75%),
and all records are rehashed to the new larger table.

Advantages of Internal Hashing:

 Fast access: Since hashing directly maps a key to a location in memory, searches are typically
very fast, with average-case constant time complexity O(1).
 Efficient for equality searches: Hashing is ideal for situations where records are accessed using
an equality condition on a single field (e.g., searching for a record with a specific key).
Disadvantages of Internal Hashing:

 Collisions: Collisions degrade performance, requiring extra memory or processing to handle


them.
 Memory overhead: The hash table may require significant memory, particularly if a large
number of empty slots or buckets are needed to minimize collisions.
 Resizing overhead: When resizing the table, rehashing all the records can be an expensive
operation.

5.9.2 External Hashing for Disk Files

External hashing refers to the use of hashing techniques for managing data stored on disk. Unlike
internal hashing, where records are kept in memory, external hashing is designed to work with
large datasets that exceed memory limits and reside in disk files.

How External Hashing Works:

1. Disk-Based Hash Table:


o External hashing extends the concept of internal hashing to disk files. Instead of using an
in-memory hash table, records are stored in disk blocks or buckets.
o A hash function computes a hash value for each record's key field and maps it to a
particular bucket (or disk block). The disk block at the computed hash value stores the
record.

2. Buckets (Disk Blocks):


o Each bucket is a disk block that can hold one or more records. Multiple records can be
stored in a bucket using chaining or open addressing, similar to how collisions are
handled in internal hashing.
o The size of the bucket depends on the system's block size and the records' size.
Typically, a bucket holds several records.

3. Handling Large Datasets:


o External hashing is designed to handle large datasets that don’t fit in memory by
utilizing disk blocks. The main advantage is that it allows for efficient disk-based search
while still achieving constant-time complexity (in ideal cases) by accessing a single disk
block.

4. Search Efficiency:
o When searching for a record, the hash function is applied to the key, and the
corresponding bucket (disk block) is accessed.
o Hash bucket accesses involve reading the appropriate disk block into memory. If there
are multiple records in the bucket (due to collisions), a linear search or linked list
traversal within the bucket may be needed.

5. Collision Handling in External Hashing:


o Collisions are handled in the same way as in internal hashing, with two common
techniques:
 Chaining: Each bucket stores a linked list of records. If multiple records hash to
the same bucket, they are linked together.
 Overflow blocks: If a bucket becomes full, additional overflow blocks may be
allocated to store extra records, or the hashing strategy may be adjusted (e.g.,
dynamic hashing).

6. Dynamic Hashing:
o External hashing often involves dynamic hashing to handle growing datasets and
maintain efficiency as the number of records increases.
o Extendable hashing and linear hashing are two popular dynamic hashing techniques for
handling overflow and expanding the hash table on disk as more records are added.

Advantages of External Hashing:

 Fast searches for large files: External hashing is effective for very large datasets that need to be
stored on disk. It minimizes disk I/O by mapping a record’s key to a disk block directly.
 Efficient space usage: Hashing minimizes the need for complex indexing structures like B-trees
while still offering efficient retrieval.

Disadvantages of External Hashing:

 Handling collisions: Like internal hashing, external hashing faces challenges with collisions, and
strategies like chaining or overflow blocks may require extra memory or disk space.
 Rehashing and resizing: As the file grows, resizing the hash table (rehashing) can be an
expensive operation.

5.9.2 External Hashing for Disk Files in Database Systems

External Hashing is a technique used for managing large datasets that are stored on disk rather
than in memory. Unlike internal hashing (which works in main memory), external hashing is
designed to handle records that do not fit into memory all at once. It provides efficient access to
records stored in external files by using a hash function to map record keys to specific disk
blocks (buckets). This allows for faster record retrieval compared to linear or binary search
methods, which would require scanning large files or indexes sequentially.

In this section, we will explore how external hashing works and how it is applied in database
systems with examples.

How External Hashing Works

1. Hash Function:
o A hash function maps a key field (hash key) from a record to an index that
corresponds to a disk block (bucket).
o The hash function used in external hashing is typically the same as in internal
hashing, for example: h(K)=Kmod Mh(K) = K \mod Mh(K)=KmodM Where:
 K is the key value of a record.
 M is the number of buckets (disk blocks).
 The result of h(K) gives an index within the range of available disk
blocks, pointing to a specific bucket where the record is stored.
2. Disk Blocks (Buckets):
o Buckets are disk blocks, and each bucket can store multiple records. Each bucket
is essentially a fixed-size block on the disk that can hold several records.
o When a record is inserted into the file, the hash function computes the hash value,
and the record is placed in the corresponding bucket (disk block).
3. Overflow:
o If a bucket is full or if multiple records hash to the same bucket (a collision),
overflow techniques are used. One common technique for handling overflow is
overflow blocks.
o When a bucket becomes full, additional overflow blocks (also called spill-over
blocks) are allocated to store extra records.
4. Accessing Records:
o To search for a record, the database uses the hash function to compute the index
(bucket) of the record. Then, it reads the corresponding disk block into memory
and searches for the record in that block.
o If the bucket contains multiple records (due to collisions), a linear search or a
linked list traversal is performed within that bucket to locate the desired record.

Example of External Hashing

Let’s walk through an example of external hashing for a database system with disk-based
records.
The hash table (with 4 buckets) will store these records based on their RecordID using the hash
function.
Handling Collisions with Overflow Blocks:

If bucket 1 becomes full, additional overflow blocks can be used to store the extra records.
These overflow blocks are linked to the main bucket.

For example:

 When bucket 1 becomes full after inserting 101, 205, 405, and 501, an overflow block is
allocated.
 The records that hash to bucket 1 but cannot fit in the main bucket are placed in the overflow
block.
 The system will continue checking for available space in overflow blocks if more records hash to
bucket 1.

Thus, when searching for a record with ID 405, the system will check bucket 1, find that it is not there,
and then look into Overflow Block 1.

Advantages of External Hashing:

1. Efficient for Large Datasets: External hashing is very efficient for large datasets that exceed
memory size because it avoids sequential scanning of large files.
2. Constant-Time Access: For a well-distributed hash function and minimal collisions, external
hashing can provide constant-time access for searching, inserting, and deleting records.
3. Minimal Disk I/O: By mapping keys directly to disk blocks, external hashing reduces the need for
multiple disk accesses compared to sequential searches.

Disadvantages of External Hashing:


1. Handling Collisions: Collisions can still degrade performance, especially when many records
hash to the same bucket. This can lead to longer search times, especially when chaining or
handling large overflow blocks.
2. Overflow Blocks: The need for overflow blocks may lead to inefficient use of disk space and
extra disk I/O.
3. Directory Management: If the number of buckets grows dynamically, managing and resizing the
hash table can be complex and costly.
4. Non-Sequential Access: Since records are scattered across different buckets and overflow
blocks, accessing records sequentially (e.g., to perform range queries) is not as efficient as in
sorted files.

5.9.3 Hashing Techniques That Allow Dynamic File Expansion

Dynamic file expansion in hashing is necessary when the number of records in a file grows
significantly over time. As the file expands, traditional hashing schemes may become inefficient
due to increased collisions, overflow, or the need to rehash large datasets.

Techniques for Dynamic Hashing:

1. Extendable Hashing:
o In extendable hashing, the hash table grows dynamically by using a directory that points
to buckets.
o The hash function is applied to a key and produces a prefix of the key’s hash value. The
number of bits used for the prefix determines the depth of the directory.
o As more records are inserted and the directory becomes full, the directory doubles in
size and the hash values are reallocated to new buckets.
o Extendable hashing allows the hash table to grow as needed without the need to rehash
the entire table at once.

2. Linear Hashing:
o Linear hashing is a dynamic hashing technique where the hash table expands
incrementally.
o In linear hashing, the hash function is applied, and records are stored in buckets based
on the hash value. However, the hash table is designed to expand by adding new
buckets in a linear fashion.
o When a bucket becomes full, a new bucket is added, and records are redistributed
between the old and new buckets.
o Unlike extendable hashing, linear hashing does not require the entire hash table to be
rehashed at once, making it more efficient for large datasets with frequent insertions.
Advantages of Dynamic Hashing Techniques:

 Scalability: These techniques allow hash-based storage systems to dynamically adjust to


growing datasets without sacrificing performance.
 Efficient File Expansion: Dynamic hashing handles the increase in file size while maintaining
quick access and minimal disruption to the system.

Disadvantages of Dynamic Hashing Techniques:

 Complexity: Implementing dynamic hashing techniques like extendable and linear hashing can
be more complex than static hashing methods.
 Potential for Rehashing: While dynamic techniques minimize rehashing, they do not eliminate it
entirely, and rehashing can still be costly when the directory or hash table expands significantly.

Summary

 Internal hashing is used for fast in-memory search and retrieval based on a hash key, typically
providing constant time complexity (O(1)) under ideal conditions.
 External hashing is used for large datasets stored

5.10 Other Primary File Organizations

This section covers other methods of file organization that don't fit into the previous categories.
These might include:

 B-trees and B+ trees: Used for indexing large datasets, providing efficient search,
insertion, and deletion operations.
 Bitmap indexes: Used in situations where a limited number of distinct values exist for a
field.
 Clustered files: Storing related records together on disk to reduce access time for related
data.

5.11 Summary

The summary recaps the primary concepts discussed in the chapter, emphasizing the importance
of selecting the right file organization method depending on the workload. The chapter
highlights:

 The relationship between performance and file organization.


 How the choice of storage method can optimize or hinder operations like searching,
inserting, and deleting records.
 The various trade-offs involved in secondary storage strategies.
Review Questions

Review questions test the reader’s understanding of the concepts covered in the chapter. They
often include:

 The benefits and limitations of different file organizations.


 How specific technologies (like RAID or hashing) improve file management.
 Scenarios requiring decisions about file organization based on system requirements.

Exercises

Exercises help reinforce key concepts, often with practical examples:

 Designing efficient file structures.


 Implementing basic operations on different file organizations.
 Evaluating performance in real-world database systems.

B Tree

B+ Tree

You might also like