DBMS - File Organization, Indexing and Hashing Notes
DBMS - File Organization, Indexing and Hashing Notes
A database consist of a huge amount of data. The data is grouped within a table in RDBMS, and
each table has related records. A user can see that the data is stored in form of tables, but in
actual this huge amount of data is stored in physical memory in form of files.
File – A file is named collection of related information that is recorded on secondary storage
such as magnetic disks, magnetic tables and optical disks.
File Organization
File Organization refers to the logical relationships among various records that constitute the file,
particularly with respect to the means of identification and access to any specific record. In
simple terms, Storing the files in certain order is called file Organization. File Structure refers
to the format of the label and data blocks and of any logical control record.
Various methods have been introduced to Organize files. These particular methods have
advantages and disadvantages on the basis of access or selection . Thus it is all upon the
programmer to decide the best suited file Organization method according to his requirements.
Some types of File Organizations are :
Sequential File Organization
Heap File Organization
Hash File Organization
B+ Tree File Organization
Clustered File Organization
The easiest method for file Organization is Sequential method. In this method the the file are
stored one after another in a sequential manner. There are two ways to implement this method:
1. Pile File Method – This method is quite simple, in which we store the records in a
sequence i.e one after other in the order in which they are inserted into the tables.
3. Sorted File Method –In this method, As the name itself suggest whenever a new record
has to be inserted, it is always inserted in a sorted (ascending or descending) manner.
Sorting of records may be based on any primary key or any other key.
Heap File Organization works with data blocks. In this method records are inserted at the end of
the file, into the data blocks. No Sorting or Ordering is required in this method. If a data block is
full, the new record is stored in some other block, Here the other data block need not be the very
next data block, but it can be any block in the memory. It is the responsibility of DBMS to store
and manage the new records.
If we want to search, delete or update data in heap file Organization the we will traverse the data
from the beginning of the file till we get the requested record. Thus if the database is very huge,
searching, deleting or updating the record will take a lot of time.
Pros and Cons of Heap File Organization –
Pros –
Fetching and retrieving records is faster than sequential record but only in case of small
databases.
When there is a huge number of data needs to be loaded into the database at a time, then
this method of file Organization is best suited.
Cons –
Problem of unused memory blocks.
Inefficient for larger databases.
In this method, searching becomes very easy as all the records are stored only in the leaf
nodes and sorted the sequential linked list.
Traversing through the tree structure is easier and faster.
The size of the B+ tree has no restrictions, so the number of records can increase or
decrease and the B+ tree structure can also grow or shrink.
It is a balanced tree structure, and any insert/update/delete does not affect the
performance of tree.
In this method, we can directly insert, update or delete any record. Data is sorted based on the
key with which searching is done. Cluster key is a type of key with which joining of the table is
performed.
1. Indexed Clusters:
In indexed cluster, records are grouped based on the cluster key and stored together. The above
EMPLOYEE and DEPARTMENT relationship is an example of an indexed cluster. Here, all the
records are grouped based on the cluster key- DEP_ID and all the records are grouped.
2. Hash Clusters:
It is similar to the indexed cluster. In hash cluster, instead of storing the records based on the
cluster key, we generate the value of the hash key for the cluster key and store the records with
the same hash key value.
The cluster file organization is used when there is a frequent request for joining the tables
with same joining condition.
It provides the efficient result when there is a 1:M mapping between the tables.
This method has the low performance for the very large database.
If there is any change in joining condition, then this method cannot use. If we change the
condition of joining then traversing the file takes a lot of time.
This method is not suitable for a table with a 1:1 condition.
Indexing in Databases
Indexing Methods
Ordered Indices
The indices are usually sorted so that the searching is faster. The indices which are sorted are
known as ordered indices.
If the search key of any index specifies same order as the sequential order of the file, it is
known as primary index or clustering index.If the search key of any index specifies an
order different from the sequential order of the file, it is called the secondary index or non-
clustering index.
Clustered Indexing
Clustering index is defined on an ordered data file. The data file is ordered on a non-key
field. In some cases, the index is created on non-primary key columns which may not be
unique for each record.
In such cases, in order to identify the records faster, we will group two or more columns
together to get the unique values and create index out of them. This method is known as
clustering index. Basically, records with similar characteristics are grouped together and
indexes are created for these groups.
For example, students studying in each semester are grouped together. i.e. 1 st Semester
students, 2ndsemester students, 3rd semester students etc are grouped.
Non-Clustered Indexing
A non clustered index just tells us where the data lies, i.e. it gives us a list of virtual
pointers or references to the location where the data is actually stored. Data is not
physically stored in the order of the index.
Instead , data is present in leaf nodes. For eg. the contents page of a book. Each entry
gives us the page number or location of the information stored.
The actual data here(information on each page of book) is not organised but we have an
ordered reference(contents page) to where the data points actually lie.
It requires more time as compared to clustered index because some amount of extra work
is done in order to extract the data by further following the pointer. In case of clustered
index, data is directly present in front of the index.
Secondary Index
It is used to optimize query processing and access records in a database with some
information other than the usual search key (primary key). In this two levels of indexing
are used in order to reduce the mapping size of the first level and in general.
Initially, for the first level, a large range of numbers is selected so that the mapping size
is small. Further, each range is divided into further sub ranges.
In order for quick memory access, first level is stored in the primary memory. Actual
physical location of the data is determined by the second mapping level.
Difference between B Tree and B+ Tree Index Files
Compare the difference between the examples of B+ tree index files and B tree index files
above. You can see that they are almost similar but there is little difference in them. This little
difference itself gives greater effect in database performance.
This is a binary tree structure similar to This is a balanced tree with intermediary
B+ tree. But here each node will have nodes and leaf nodes. Intermediary nodes
only two branches and each node will contain only pointers / address to the leaf
have some records. Hence here no need nodes. All leaf nodes will have records
to traverse till leaf node to get the data. and all are at same distance from the root.
It has more height compared to width. More width is compared to height.
Number of nodes at any intermediary Each intermediary node can have n/2 to n
level 'l' is 2l. Each of the intermediary children. Only root node will have 2
nodes will have only 2 sub nodes. children.
Even a leaf node level will have Leaf node stores (n-1)/2 to n-1 values
2l nodes. Hence total nodes in the B
Tree are 2 l+1 - 1.
As the number of intermediary nodes
increases and hence the leaf nodes i.e. as
B+ tree extends, the traversal speed
increases log arithmetically log(n/2)(K)
Records are in sorted order Records are in sorted order
Advantages It might have fewer nodes compared to Automatically Adjust the nodes to fit the
B+ tree as each node will have data. new record. Similarly it re-organizes the
nodes in the case of delete, if required.
Hence it does not alter the definition of
B+ tree.
Since each node has record, there might Reorganization of the nodes does not
not be required to traverse till leaf node. affect the performance of the file. This is
because, even after the rearrangement all
the records are still found in leaf nodes
and are all at equidistance. There is no
change in distance of records from
neither root nor the time to traverse till
leaf node.
No file degradation problem
Good space utilization as intermediary
nodes contain only pointer to the records
and only leaf nodes contain records.
Space needed for pointers are very less
compared to records.
Is suitable for partial and range search
too
Since all the leaf nodes are at equal
distance, the time for I/O fetch is much
less. Hence the performance of the tree
will also increase.
Disadvantages If the tree is very big, then we have to If there is any rearrangement of nodes
traverse through most of the nodes to while insertion or deletion, then it would
get the records. Only few records can be an overhead. It takes little effort, time
be fetched at the intermediary nodes or and space. But this disadvantage can be
near to the root. Hence this method ignored compared to the speed of
might be slower. traversal
Since each node has data and can have
only two child nodes, the tree will not
spread out much. Its depth/height will
increase as the number of records
increases. But if height of a tree
increases, the I/O will also increase and
hence the performance will decrease.
Insertion and deletion of nodes will
have re-arrangements like in B+ tree.
But it will be more complicated as it
has to balance the binary nodes.
Implementation of B tree is little
difficult compared to B+ tree
All these disadvantages cannot be
ignored as they are highly affecting the
performance of the file.
B+ Tree Indexing
A B+ tree is a balanced binary search tree that follows a multi-level index format. The leaf nodes
of a B+ tree denote actual data pointers. B+ tree ensures that all leaf nodes remain at the same
height, thus balanced. Additionally, the leaf nodes are linked using a link list; therefore, a
B+ tree can support random access as well as sequential access.
Structure of B+ Tree
Every leaf node is at equal distance from the root node. A B + tree is of the order n where n is
fixed for every B+ tree.
Internal nodes −
Internal (non-leaf) nodes contain at least ⌈n/2⌉ pointers, except the root node.
At most, an internal node can contain n pointers.
Leaf nodes −
Leaf nodes contain at least ⌈n/2⌉ record pointers and ⌈n/2⌉ key values.
At most, a leaf node can contain n record pointers and n key values.
Every leaf node contains one block pointer P to point to next leaf node and forms a
linked list.
B+ Tree Insertion
B+ trees are filled from bottom and each entry is done at the leaf node.
If a leaf node overflows −
o Split node into two parts.
o Partition at i = ⌊(m+1)/2⌋.
o First i entries are stored in one node.
o Rest of the entries (i+1 onwards) are moved to a new node.
o ith key is duplicated at the parent of the leaf.
If a non-leaf node overflows −
o Split node into two parts.
o Partition the node at i = ⌈(m+1)/2⌉.
o Entries up to i are kept in one node.
o Rest of the entries are moved to a new node.
B+ Tree Deletion
B+ tree entries are deleted at the leaf nodes.
The target entry is searched and deleted.
o If it is an internal node, delete and replace with the entry from the left position.
After deletion, underflow is tested,
o If underflow occurs, distribute the entries from the nodes left to it.
If distribution is not possible from left, then
o Distribute from the nodes right to it.
If distribution is not possible from left or from right, then
o Merge the node with left and right to it.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf
node after 55. It is a balanced tree, and a leaf node of this tree is already full, so we
cannot insert 60 there.
In this case, we have to split the leaf node, so that it can be inserted into tree without
affecting the fill factor, balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We
will split the leaf node of the tree in the middle so that its balance is not altered. So we
can group (50, 55) and (60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should
have 60 added to it, and then we can have pointers to a new leaf node.
This is how we can insert an entry when there is overflow. In a normal scenario, it is very
easy to find the node where it fits and then place it in that leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove
60 from the intermediate node as well as from the 4th leaf node too. If we remove it from
the intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to
modify it to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as
follows:
Hashing
In database management system, when we want to retrieve a particular data, It becomes
very inefficient to search all the index values and reach the desired data. In this situation,
Hashing technique comes into picture.
Hashing is an efficient technique to directly search the location of desired data on the
disk without using index structure. Data is stored at the data blocks whose address is
generated by using hash function. The memory location where these records are stored is
called as data block or data bucket.
Data bucket – Data buckets are the memory locations where the records are stored.
These buckets are also considered as Unit Of Storage.
Hash Function – Hash function is a mapping function that maps all the set of search
keys to actual record address. Generally, hash function uses primary key to generate the
hash index – address of the data block. Hash function can be simple mathematical function
to any complex mathematical function.
Hash Index-The prefix of an entire hash value is taken as a hash index. Every hash index
has a depth value to signify how many bits are used for computing a hash function. These
bits can address 2n buckets. When all these bits are consumed ? then the depth value is
increased linearly and twice the buckets are allocated.
Below given diagram clearly depicts how hash function work:
Hashing is further divided into two sub categories :
Static Hashing –
In static hashing, when a search-key value is provided, the hash function always computes the
same address. For example, if we want to generate address for STUDENT_ID = 76 using mod
(5) hash function, it always result in the same bucket address 4. There will not be any changes to
the bucket address here. Hence number of data buckets in the memory for this static hashing
remains constant throughout.
Operations –
Insertion – When a new record is inserted into the table, The hash function h generate a
bucket address for the new record based on its hash key K.
Bucket address = h(K)
Searching – When a record needs to be searched, The same hash function is used to
retrieve the bucket address for the record. For Example, if we want to retrieve whole record
for ID 76, and if the hash function is mod (5) on that ID, the bucket address generated
would be 4. Then we will directly got to address 4 and retrieve the whole record for ID
104. Here ID acts as a hash key.
Deletion – If we want to delete a record, Using the hash function we will first fetch the
record which is supposed to be deleted. Then we will remove the records for that address
in memory.
Updation – The data record that needs to be updated is first searched using hash
function, and then the data record is updated.
Now, If we want to insert some new records into the file But the data bucket address generated
by the hash function is not empty or the data already exists in that address. This becomes a
critical situation to handle. This situation in the static hashing is called bucket overflow.
1. Open Hashing –
In Open hashing method, next available data block is used to enter the new record, instead
of overwriting older one. This method is also called linear probing.
For example, D3 is a new record which needs to be inserted , the hash function generates
address as 105. But it is already full. So the system searches next available data bucket, 123
and assigns D3 to it.
2. Closed hashing –
In Closed hashing method, a new data bucket is allocated with same address and is linked it
after the full data bucket. This method is also known as overflow chaining.
For example, we have to insert a new record D3 into the tables. The static hash function
generates the data bucket address as 105. But this bucket is full to store the new data. In
this case is a new data bucket is added at the end of 105 data bucket and is linked to it.
Then new record D3 is inserted into the new bucket.
Dynamic Hashing –
The drawback of static hashing is that that it does not expand or shrink dynamically as the size of
the database grows or shrinks. In Dynamic hashing, data buckets grows or shrinks (added or
removed dynamically) as the records increases or decreases. Dynamic hashing is also known
as extended hashing.
In dynamic hashing, the hash function is made to produce a large number of values. For
Example, there are three data records D1, D2 and D3 . The hash function generates three
addresses 1001, 0101 and 1010 respectively. This method of storing considers only part of this
address – especially only first one bit to store the data. So it tries to load three of them at address
0 and 1.
But the problem is that No bucket address is remaining for D3. The bucket has to grow
dynamically to accommodate D3. So it changes the address have 2 bits rather than 1 bit, and then
it updates the existing data to have 2 bit address. Then it tries to accommodate D3.
Bitmap Indexing
Bitmap Indexing is a special type of database indexing that uses bitmaps. This technique
is used for huge databases, when column is of low cardinality and these columns are most
frequently used in the query.
Need of Bitmap Indexing –The need of Bitmap Indexing will be clear through the below given
example: :
For example, Let us say that a company holds an employee table with entries like EmpNo,
EmpName, Job, New_Emp and salary. Let us assume that the employees are hired once in the
year, therefore the table will be updated very less and will remain static most of the time. But the
columns will be frequently used in queries to retrieve data like : No. of female employees in the
company etc. In this case we need a file organization method which should be fast enough to
give quick results. But any of the traditional file organization method is not that fast, therefore
we switch to a better method of storing and retrieving data known as Bitmap Indexing.
How Bitmap Indexing is done –
o In the above example of table employee, we can see that the column New_Emp has only
two values Yes and No based upon the fact that the employee is new to the company or
not.
o Similarily let us assume that the Job of the Employees is divided into 4 categories only i.e
Manager, Analyst, Clerk and Salesman. Such columns are called columns with low
cardinality. Even though these columns have less unique values, they can be queried very
often.
o Bit: Bit is a basic unit of information used in computing that can have only one of two
values either 0 or 1 . The two values of a binary digit can also be interpreted as logical
values true/false or yes/no.
In Bitmap Indexing these bits are used to represent the unique values in those low cardinality
columns. This technique of storing the low cardinality rows in form of bits is called bitmap
indices.
Continuing the Employee example, Given below is the Employee table :
If New_Emp is the data to be indexed, the content of the bitmap index is shown as four( As we
have four rows in the above table) columns under the heading Bitmap Indices. Here Bitmap
Index “Yes” has value 1001 because row 1 and row four has value “Yes” in column New_Emp.
In this case there are two such bitmaps, one for “New_Emp” Yes and one for “New_Emp” NO.
It is easy to see that each bit in bitmap indices shows that whether a particular row refer to a
person who is New to the company or not.
The above scenario is the simplest form of Bitmap Indexing. Most columns will have more
distinct values. For example the column Job here will have only 4 unique values (As mentioned
earlier). Variations on the bitmap index can effectively index this data as well. For Job column
the bitmap Indexing is shown below:
Now Suppose, If we want to find out the details for the Employee who is not new in the
company and is a sales person then we will run the query:
SELECT *
FROM STUDENT
WHERE New_Emp = "No" and Job = "Salesperson";
For this query the DBMS will search the bitmap index of both the columns and perform logical
AND operation on those bits and find out the actual result:
Here the result 0100 represents that the second column has to be retrieved as a result.
Bitmap Indexing in SQL – The syntax for creating bitmap index in sql is given below:
CREATE BITMAP INDEX Index_Name
ON Table_Name (Column_Name);
For the above example of employee table, the bitmap index on column New_Emp will be created
as follows:
CREATE BITMAP INDEX index_New_Emp
ON Employee (New_Emp);
Advantages –
Efficiency in terms of insertion deletion and updation
Faster retrieval of records
Disadvantages –
Only suitable for large tables
Bitmap Indexing is time consuming