File Organization in DBMS
File Organization in DBMS
Indexing Techniques: B+ Trees: Search, Insert, Delete algorithms, File Organization and
Indexing, Cluster Indexes, Primary and Secondary Indexes , Index data Structures, Hash Based
Indexing: Tree base Indexing ,Comparison of File Organizations, Indexes and Performance
Tuning
A database consist of a huge amount of data. The data is grouped within a table in
RDBMS, and each table have related records. A user can see that the data is stored in form of
tables, but in acutal this huge amount of data is stored in physical memory in form of files.
File – A file is named collection of related information that is recorded on secondary storage
such as magnetic disks, magnetic tables and optical disks.
File Operations
Operations on database files can be broadly classified into two categories −
Update Operations
Retrieval Operations
Update operations change the data values by insertion, deletion, or update.
Retrieval operations do not alter the data but retrieve them after optional conditional filtering.
In both types of operations, selection plays a significant role. Other than creation and deletion of
a file, there could be several operations, which can be done on files.
Open − A file can be opened in one of the two modes, read mode or write mode. In read
mode, the operating system does not allow anyone to alter data. In other words, data is read
only. Files opened in read mode can be shared among several entities. Write mode allows
data modification. Files opened in write mode can be read but cannot be shared.
Locate − Every file has a file pointer, which tells the current position where the data is to
be read or written. This pointer can be adjusted accordingly. Using find (seek) operation, it
can be moved forward or backward.
Read − By default, when files are opened in read mode, the file pointer points to the
beginning of the file. There are options where the user can tell the operating system where to
locate the file pointer at the time of opening a file. The very next data to the file pointer is
read.
Write − User can select to open a file in write mode, which enables them to edit its
contents. It can be deletion, insertion, or modification. The file pointer can be located at the
time of opening or can be dynamically changed if the operating system allows to do so.
Close − This is the most important operation from the operating system’s point of view.
When a request to close a file is generated, the operating system
o removes all the locks (if in shared mode),
o saves the data (if altered) to the secondary storage media, and
o releases all the buffers and file handlers associated with the file.
If we want to search, delete or update data in heap file Organization the we will traverse the data
from the beginning of the file till we get the requested record. Thus if the database is very huge,
searching, deleting or updating the record will take a lot of time.
Pros and Cons of Heap File Organization –
Pros –
Fetching and retrieving records is faster than sequential record but only in case of small
databases.
When there is a huge number of data needs to be loaded into the database at a time, then
this method of file Organization is best suited.
Cons –
Problem of unused memory blocks.
Inefficient for larger databases.
c. Hash File Organization :
Hashing is an efficient technique to directly search the location of desired data on the
disk without using index structure. Data is stored at the data blocks whose address is
generated by using hash function. The memory location where these records are stored
is called as data block or data bucket.
Data bucket – Data buckets are the memory locations where the records are stored.
These buckets are also considered as Unit Of Storage.
Hash Function – Hash function is a mapping function that maps all the set of search
keys to actual record address. Generally, hash function uses primary key to generate the
hash index – address of the data block. Hash function can be simple mathematical
function to any complex mathematical function.
Hash Index-The prefix of an entire hash value is taken as a hash index. Every hash
index has a depth value to signify how many bits are used for computing a hash function.
These bits can address 2n buckets. When all these bits are consumed ? then the depth
value is increased linearly and twice the buckets are allocated.
i. Static Hashing –
In static hashing, when a search-key value is provided, the hash function always computes the
same address.
For example, if we want to generate address for STUDENT_ID = 76 using mod (5) hash
function, it always result in the same bucket address 4. There will not be any changes to the
bucket address here. Hence number of data buckets in the memory for this static hashing
remains constant throughout.
Operations –
Insertion – When a new record is inserted into the table, The hash function h generate
a bucket address for the new record based on its hash key K.
Bucket address = h(K)
Searching – When a record needs to be searched, The same hash function is used to
retrieve the bucket address for the record. For Example, if we want to retrieve whole
record for ID 76, and if the hash function is mod (5) on that ID, the bucket address
generated would be 4. Then we will directly got to address 4 and retrieve the whole record
for ID 104. Here ID acts as a hash key.
Deletion – If we want to delete a record, Using the hash function we will first fetch the
record which is supposed to be deleted. Then we will remove the records for that address
in memory.
Updation – The data record that needs to be updated is first searched using hash
function, and then the data record is updated.
If we want to insert some new records into the file But the data bucket address generated by
the hash function is not empty or the data already exists in that address. This becomes a
critical situation to handle. This situation in the static hashing is called bucket overflow.
How will we insert data in this case?
There are several methods provided to overcome this situation. Some commonly used
methods are discussed below:
1. Open Hashing:
In Open hashing method, next available data block is used to enter the new record, instead
of overwriting older one. This method is also called linear probing.
For example, D3 is a new record which needs to be inserted , the hash function generates
address as 105. But it is already full. So the system searches next available data bucket,
123 and assigns D3 to it.
2. Closed hashing –
In Closed hashing method, a new data bucket is allocated with same address and is linked
it after the full data bucket. This method is also known as overflow chaining.
For example, we have to insert a new record D3 into the tables. The static hash function
generates the data bucket address as 105. But this bucket is full to store the new data. In
this case is a new data bucket is added at the end of 105 data bucket and is linked to it.
Then new record D3 is inserted into the new bucket.
Quadratic probing :
Quadratic probing is very much similar to open hashing or linear probing. Here, The
only difference between old and new bucket is linear. Quadratic function is used to
determine the new bucket address.
Double Hashing :
Double Hashing is another method similar to linear probing. Here the difference is
fixed as in linear probing, but this fixed difference is calculated by using another hash
function. That’s why the name is double hashing.
For example:
Consider the following grouping of keys into buckets, depending on the prefix of their hash
address:
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are
01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into bucket B2.
The last two bits of 7 are 11, so it will go into B3.
Insert key 9 with hash address 10001 into the above structure:
o Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is full,
so it will get split.
o The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go
into bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5.
o Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because
last two bits of both the entry are 00.
o Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because
last two bits of both the entry are 10.
o Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last two
bits of both the entry are 11.
B+ Tree is very much similar to binary search tree, with the only difference that instead of
just two children, it can have more than two. All the information is stored in leaf node and the
intermediate nodes acts as pointer to the leaf nodes. The information in leaf nodes always
remain a sorted sequential linked list.
In the above diagram 56 is the root node which is also called the main node of the tree.
The intermediate nodes here, just consist the address of leaf nodes. They do not contain any
actual record. Leaf nodes consist of the actual record. All leaf nodes are balanced.
Thus it lowers the cost of searching and retrieving various records in different files as they are
now combined and kept in a single cluster.
For example we have two tables or relation Employee and Department. These table are related
to each other.
Therefore these table are allowed to combine using a join operation and can be seen in a
cluster file.
If we have to insert, update or delete any record we can directly do so. Data is sorted based on
the primary key or the key with which searching is done. Cluster key is the key with which
joining of the table is performed.
Types of Cluster File Organization –
2. Hash Clusters – This is very much similar to indexed cluster with only difference that
instead of storing the records based on cluster key, we generate hash key value and store
the records with same hash key value.