DBMS Unit-5
DBMS Unit-5
Indexes, Primary and Secondary Indexes, Index data Structures, Hash Based
Indexing, Indexed Sequential Access Methods (ISAM), B+ Trees: A Dynamic
Index Structure.
Databases are stored in file formats. Data is stored in bits and bytes in different
storage devices. At the physical level, actual data is stored in some devices in
electromagnetic form. Storage systems can be classified into three types of
storage devices.
1. Primary Storage
It is the memory storage that is directly accessible to the CPU. These devices
are typically small and ultra-fast. It provides quick access to the stored data.
These types of storage devices temporarily store the data, and therefore, they are
also termed volatile storage. Volatile means that if the system gets off(restarts,
gets a power cut, or crashes), the data in these devices are lost permanently, and
the space becomes again free, which can be utilized for other purposes.
CPU's main memory and cache are two important primary devices.
Main memory: The main memory handles the instructions of the
computer. It is generally termed RAM(Random Access Memory). RAM
stores operating system software, software applications, and other
information for the CPU to have direct access when needed.
2. Secondary Storage
Secondary storage is the storage area that allows you to save and store data
permanently, and it is also called 'Online storage.' Secondary storage is 'non-
volatile storage, i.e., unlike primary storage, it doesn't lose data if the system
restarts, crashes, or gets a power cut. Secondary storage media are:
3. Tertiary Storage
Tertiary storage devices are devices to store immense amounts of data. These
storage devices are external to the computer system and slowest. Generally,
these devices are used to back up an entire system. Optical disks(optical
storage) and magnetic tapes(tape storage) are commonly used tertiary storage
devices. These two types of storage are
FILE ORGANIZATION
In the file organization, the programmer decides the best-suited file organization
method according to his requirement.
o In this method, the new record is always inserted at the file's end, and
then it will sort the sequence in ascending or descending order. Sorting of
records is based on any primary key or any other key.
o In the case of modification of any record, it will update the record and
then sort the file, and lastly, the updated record is placed in the right
place.
If we want to search, update or delete the data in heap file organization, then we
need to traverse the data from staring of the file till we get the requested record.
If the database is very large then searching, updating or deleting of record will
be time-consuming because there is no sorting or ordering of records. In the
heap file organization
3. Hash File Organization
Hash File Organization uses the computation of hash function on some fields of the records.
The hash function's output determines the location of disk block where the records are to be
placed.
When a record has to be received using the hash key columns, then the address is generated,
and the whole record is retrieved using that address. In the same way, when a new record has
to be inserted, then the address is generated using the hash key and record is directly inserted.
The same process is applied in the case of delete and update.
In this method, there is no effort for searching and sorting the entire file. In this method, each
record will be stored randomly in the memory.
4.Indexed sequential access method (ISAM)
ISAM method is an advanced sequential file organization. In this method, records are stored
in the file using the primary key. An index value is generated for each primary key and
mapped with the record. This index contains the address of the record in the file.
If any record has to be retrieved based on its index value, then the address of the data block is
fetched and the record is retrieved from the memory.
Pros of ISAM:
o In this method, each record has the address of its data block, searching a record in a huge
database is quick and easy.
o This method supports range retrieval and partial retrieval of records. Since the index is based
on the primary key values, we can retrieve the data for the given range of value. In the same
way, the partial value can also be easily searched, i.e., the student name starting with 'JA' can
be easily searched.
Cons of ISAM
o This method requires extra space in the disk to store the index value.
o When the new records are inserted, then these files have to be reconstructed to maintain the
sequence.
o When the record is deleted, then the space used by it needs to be released. Otherwise, the
performance of the database will slow down.
5. B+ Tree File Organization
o B+ tree file organization is the advanced method of an indexed sequential access method. It
uses a tree-like structure to store records in File.
o It uses the same concept of key-index where the primary key is used to sort the records. For
each primary key, the value of the index is generated and mapped with the record.
o The B+ tree is similar to a binary search tree (BST), but it can have more than two children.
In this method, all the records are stored only at the leaf node. Intermediate nodes act as a
pointer to the leaf nodes. They do not contain any records.
1. Indexed Clusters:
In indexed cluster, records are grouped based on the cluster key and stored together. The
above EMPLOYEE and DEPARTMENT relationship is an example of an indexed cluster.
Here, all the records are grouped based on the cluster key- DEP_ID and all the records are
grouped.
2. Hash Clusters:
It is similar to the indexed cluster. In hash cluster, instead of storing the records based on the
cluster key, we generate the value of the hash key for the cluster key and store the records
with the same hash key value.
Indexing in DBMS
o Indexing is used to optimize the performance of a database by minimizing the number of disk
accesses required when a query is processed.
o The index is a type of data structure. It is used to locate and access the data in a database table
quickly.
Index structure:
Indexes can be created using some database columns.
o The first column of the database is the search key that contains a copy of the primary key or
candidate key of the table. The values of the primary key are stored in sorted order so that the
corresponding data can be accessed easily.
o The second column of the database is the data reference. It contains a set of pointers holding
the address of the disk block where the value of the particular key can be found.
o If the index is created on the basis of the primary key of the table, then it is known as primary
indexing. These primary keys are unique to each record and contain 1:1 relation between the
records.
o As primary keys are stored in sorted order, the performance of the searching operation is quite
efficient.
Dense index
Sparse index.
Dense Index
In a dense index, a record is created for every search key valued in the database.
This helps you to search faster but needs more space to store index records. In this
Indexing, method records contain search key value and points to the real record on
the disk.
Sparse Index
It is an index record that appears for only some of the values in the file. Sparse
Index helps you to resolve the issues of dense Indexing in DBMS. In this method
of indexing technique, a range of index columns stores the same data block
address, and when data needs to be retrieved, the block address will be fetched.
Sparse Index stores index records for only some search-key values. It needs less
space, less maintenance overhead for insertion, and deletions but It is slower
compared to the dense Index for locating records.
2. Secondary Index
The secondary Index in DBMS can be generated by a field which has a unique
value for each record, and it should be a candidate key. It is also known as a non-
clustering index.
This two-level database indexing technique is used to reduce the mapping size of
the first level. For the first level, a large range of numbers is selected because of
this; the mapping size always remains small.
Here, you can have a secondary index in DBMS for every search-key. Index record
is a record point to a bucket that contains pointers to all the records with their
specific search-key value.
3.Clustering Index
o A clustered index can be defined as an ordered data file. Sometimes the index is created on
non-primary key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get the unique
value and create index out of them. This method is called a clustering index.
o The records which have similar characteristics are grouped, and indexes are created for these
group.
Data bucket – Data buckets are the memory locations where the records are
stored. These buckets are also considered as Unit Of Storage.
Hash Function – Hash function is a mapping function that maps all the set of
search keys to actual record address. Generally, hash function uses the primary
key to generate the hash index – address of the data block. Hash function can be
simple mathematical function to any complex mathematical function.
Hash Index-The prefix of an entire hash value is taken as a hash index. Every
hash index has a depth value to signify how many bits are used for computing a
hash function. These bits can address 2n buckets.
Types of Hashing:
Static Hashing:
In static hashing, when a search-key value is provided, the hash function always
computes the same address. For example, if we want to generate an address for
STUDENT_ID = 104 using mod (5) hash function, it always results in the same
bucket address 4. There will not be any changes to the bucket address here. Hence a
number of data buckets in the memory for this static hashing remain constant
throughout.
Operations:
Insertion – When a new record is inserted into the table, The hash function h
generates a bucket address for the new record based on its hash key K. Bucket
address = h(K)
Searching – When a record needs to be searched, The same hash function is used
to retrieve the bucket address for the record. For Example, if we want to retrieve
the whole record for ID 104, and if the hash function is mod (5) on that ID, the
bucket address generated would be 4. Then we will directly got to address 4 and
retrieve the whole record for ID 104. Here ID acts as a hash key.
Deletion – If we want to delete a record, Using the hash function we will first
fetch the record which is supposed to be deleted. Then we will remove the
records for that address in memory.
Updation – The data record that needs to be updated is first searched using hash
function, and then the data record is updated.
1. If we want to insert some new records into the file but the data bucket address
generated by the hash function is not empty or the data already exists in that
address. This becomes a critical situation to handle. This situation in the static
hashing is called bucket overflow.
We use two methods they are as follows:
Open Hashing – In Open hashing method, next available data block is used to
enter the new record, instead of overwriting the older one. This method is also
called linear probing. For example, D3 is a new record that needs to be inserted,
the hash function generates the address as 105. But it is already full. So the
system searches next available data bucket, 123 and assigns D3 to it.
2. Closed hashing – In Closed hashing method, a new data bucket is allocated with
same address and is linked it after the full data bucket. This method is also
known as overflow chaining. For example, we have to insert a new record D3
into the tables. The static hash function generates the data bucket address as 105.
But this bucket is full to store the new data. In this case is a new data bucket is
added at the end of 105 data bucket and is linked to it. Then new record D3 is
inserted into the new bucket.
Dynamic Hashing –
The drawback of static hashing is that it does not expand or shrink dynamically as
the size of the database grows or shrinks. In Dynamic hashing, data buckets grows
or shrinks (added or removed dynamically) as the records increases or decreases.
Dynamic hashing is also known as extended hashing. In dynamic hashing, the hash
function is made to produce a large number of values. For Example, there are three
data records D1, D2 and D3 . The hash function generates three addresses 1001,
0101 and 1010 respectively. This method of storing considers only part of this
address – especially only first one bit to store the data. So it tries to load three of
them at address 0 and 1.
But the problem is that No bucket address is remaining for D3. The bucket has to
grow dynamically to accommodate D3. So it changes the address have 2 bits rather
than 1 bit, and then it updates the existing data to have 2 bit address. Then it tries to
accommodate D3.
B+ Tree
o The B+ tree is a balanced binary search tree. It follows a multi-level index format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes
remain at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support
random access as well as sequential access.
Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of the
order n where n is fixed for every B+ tree.
o It contains an internal node and leaf node.
Internal node
o An internal node of the B+ tree can contain at least n/2 record pointers except the root node.
o At most, an internal node of the tree contains n pointers.
Leaf node
o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.
So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the
end, we will be redirected to the third leaf node. Here DBMS will perform a sequential search
to find 55.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node
after 55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert
60 there.
In this case, we have to split the leaf node, so that it can be inserted into tree without affecting
the fill factor, balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will
split the leaf node of the tree in the middle so that its balance is not altered. So we can group
(50, 55) and (60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have
60 added to it, and then we can have pointers to a new leaf node.
This is how we can insert an entry when there is overflow. In a normal scenario, it is very
easy to find the node where it fits and then place it in that leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove 60
from the intermediate node as well as from the 4th leaf node too. If we remove it from the
intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to modify
it to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as
follows: