0% found this document useful (0 votes)
127 views

DBMS Unit-5

This document discusses different methods of data storage and file organization. It covers three types of data storage - primary, secondary, and tertiary - and describes common storage devices like RAM, hard drives, and tapes. It also outlines several methods of file organization, including sequential, heap, hash-based, and indexed sequential access (ISAM). The sequential method stores records sequentially in either a pile or sorted order, while heap and hash-based methods store records randomly without sorting. ISAM generates an index value for each record's primary key to quickly retrieve records.

Uploaded by

Mr. Skull Editor
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views

DBMS Unit-5

This document discusses different methods of data storage and file organization. It covers three types of data storage - primary, secondary, and tertiary - and describes common storage devices like RAM, hard drives, and tapes. It also outlines several methods of file organization, including sequential, heap, hash-based, and indexed sequential access (ISAM). The sequential method stores records sequentially in either a pile or sorted order, while heap and hash-based methods store records randomly without sorting. ISAM generates an index value for each record's primary key to quickly retrieve records.

Uploaded by

Mr. Skull Editor
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT-5: Data on External Storage, File Organization and Indexing, Cluster

Indexes, Primary and Secondary Indexes, Index data Structures, Hash Based
Indexing, Indexed Sequential Access Methods (ISAM), B+ Trees: A Dynamic
Index Structure.

Data on External Storage:

Databases are stored in file formats. Data is stored in bits and bytes in different
storage devices. At the physical level, actual data is stored in some devices in
electromagnetic form. Storage systems can be classified into three types of
storage devices.

Types of Data Storage


There are three types of data storage devices namely primary, secondary and
tertiary storage.                

1. Primary Storage
It is the memory storage that is directly accessible to the CPU. These devices
are typically small and ultra-fast. It provides quick access to the stored data.
These types of storage devices temporarily store the data, and therefore, they are
also termed volatile storage. Volatile means that if the system gets off(restarts,
gets a power cut, or crashes), the data in these devices are lost permanently, and
the space becomes again free, which can be utilized for other purposes.

CPU's main memory and cache are two important primary devices.
 Main memory: The main memory handles the instructions of the
computer. It is generally termed RAM(Random Access Memory). RAM
stores operating system software, software applications, and other
information for the CPU to have direct access when needed.

 Cache: It is a tiny storage media that is maintained by computer


hardware only. A cache is much faster than RAM and makes data
retrieving easier and more efficient. 

2. Secondary Storage
Secondary storage is the storage area that allows you to save and store data
permanently, and it is also called 'Online storage.' Secondary storage is 'non-
volatile storage, i.e., unlike primary storage, it doesn't lose data if the system
restarts, crashes, or gets a power cut. Secondary storage media are:

 Hard Drive: A hard drive, hard disk, HDD, or HD is a non-volatile data


storage device. It is usually installed on the computer and directly
connected to the motherboard. It contains one or more platters housed
inside the air-sealed casing; data is written on these platters with a
magnetic head/heads that move to and fro as platters spin. These can be
of various sizes and can store data as per the user's need.
 

 Magnetic Disk storage:A magnetic disk is a flat disk covered with a


magnetic coating and uses a magnetization process to read, write and
access data. It stores data in the form of tracks, spots, and sectors. Floppy
disks and zip disks are typical examples of magnetic disks.
 

3. Tertiary Storage

Tertiary storage devices are devices to store immense amounts of data. These
storage devices are external to the computer system and slowest. Generally,
these devices are used to back up an entire system. Optical disks(optical
storage) and magnetic tapes(tape storage) are commonly used tertiary storage
devices. These two types of storage are

 Tape Storage: In a tape storage system, magnetic tape is used as a


recording media to store data. Generally, magnetic tapes are used for
archiving and backup data for long-term storage.
 Optical Storage: In optical storage, data is read and written with a laser.
Typically data is written on a Digital Versatile Disc(DVD)  or Compact
Disk(CD). Optical media is more durable and less vulnerable to
environmental conditions than tape storage.

                                                                                       
FILE ORGANIZATION

 The File is a collection of records. Using the primary key, we can access


the records. The type and frequency of access can be determined by the
type of file organization which was used for a given set of records.
 This method defines how file records are mapped onto disk blocks.
 Files of fixed length records are easier to implement than the files of
variable length records.

Objective of file organization:

 It contains an optimal selection of records, i.e., records can be selected as


fast as possible.
 To perform insert, delete or update transaction on the records should be
quick and easy.
 The duplicate records cannot be induced as a result of insert, update or
delete.
 For the minimal cost of storage, records should be stored efficiently.

Types of file organization:

In the file organization, the programmer decides the best-suited file organization
method according to his requirement.

1. Sequential file organization


This method is the easiest method for file organization. In this method, files are
stored sequentially. This method can be implemented in two ways:

 Pile File Method:

o It is a quite simple method. In this method, we store the record in a


sequence, i.e., one after another. Here, the record will be inserted in the
order in which they are inserted into tables.
o In case of updating or deleting of any record, the record will be searched
in the memory blocks. When it is found, then it will be marked for
deleting, and the new record is inserted.
 Insertion of the new record: Suppose we have four records R1, R3 and so
on upto R9 and R8 in a sequence. Hence, records are nothing but a row in
the table. Suppose we want to insert a new record R2 in the sequence,
then it will be placed at the end of the file. Here, records are nothing but a
row in any table.

 Sorted File Method:

o In this method, the new record is always inserted at the file's end, and
then it will sort the sequence in ascending or descending order. Sorting of
records is based on any primary key or any other key.
o In the case of modification of any record, it will update the record and
then sort the file, and lastly, the updated record is placed in the right
place.

 Insertion of the new record: Suppose there is a preexisting sorted


sequence of four records R1, R3 and so on upto R6 and R7. Suppose a
new record R2 has to be inserted in the sequence, then it will be inserted
at the end of the file, and then it will sort the sequence.
Advantages of sequential file organization

o It is a fast and efficient method for the huge amount of data.


o In this method, files can be easily stored in cheaper storage mechanism
like magnetic tapes.
o This method is used when most of the records have to be accessed like
grade calculation of a student, generating the salary slip, etc.
o This method is used for report generation or statistical calculations.

Disadvantages of sequential file organization

o It will waste time as we cannot jump on a particular record that is


required but we have to move sequentially which takes our time.
o Sorted file method takes more time and space for sorting the records.

2. Heap file organization


o It is the simplest and most basic type of organization. It works with data
blocks. In heap file organization, the records are inserted at the file's end.
When the records are inserted, it doesn't require the sorting and ordering
of records.
o When the data block is full, the new record is stored in some other block.
But it can select any data block in the memory to store new records. The
heap file is also known as an unordered file.
o In the file, every record has a unique id, and every page in a file is of the
same size. It is the DBMS responsibility to store and manage the new
records.
 Insertion of a new record:Suppose we have five records R1, R3, R6, R4
and R5 in a heap and suppose we want to insert a new record R2 in a
heap. If the data block 3 is full then it will be inserted in any of the
database selected by the DBMS, let's say data block 1.

If we want to search, update or delete the data in heap file organization, then we
need to traverse the data from staring of the file till we get the requested record.

If the database is very large then searching, updating or deleting of record will
be time-consuming because there is no sorting or ordering of records. In the
heap file organization
3. Hash File Organization
Hash File Organization uses the computation of hash function on some fields of the records.
The hash function's output determines the location of disk block where the records are to be
placed.

When a record has to be received using the hash key columns, then the address is generated,
and the whole record is retrieved using that address. In the same way, when a new record has
to be inserted, then the address is generated using the hash key and record is directly inserted.
The same process is applied in the case of delete and update.

In this method, there is no effort for searching and sorting the entire file. In this method, each
record will be stored randomly in the memory.
4.Indexed sequential access method (ISAM)
ISAM method is an advanced sequential file organization. In this method, records are stored
in the file using the primary key. An index value is generated for each primary key and
mapped with the record. This index contains the address of the record in the file.

If any record has to be retrieved based on its index value, then the address of the data block is
fetched and the record is retrieved from the memory.

Pros of ISAM:
o In this method, each record has the address of its data block, searching a record in a huge
database is quick and easy.
o This method supports range retrieval and partial retrieval of records. Since the index is based
on the primary key values, we can retrieve the data for the given range of value. In the same
way, the partial value can also be easily searched, i.e., the student name starting with 'JA' can
be easily searched.

Cons of ISAM
o This method requires extra space in the disk to store the index value.
o When the new records are inserted, then these files have to be reconstructed to maintain the
sequence.
o When the record is deleted, then the space used by it needs to be released. Otherwise, the
performance of the database will slow down.
5. B+ Tree File Organization
o B+ tree file organization is the advanced method of an indexed sequential access method. It
uses a tree-like structure to store records in File.
o It uses the same concept of key-index where the primary key is used to sort the records. For
each primary key, the value of the index is generated and mapped with the record.
o The B+ tree is similar to a binary search tree (BST), but it can have more than two children.
In this method, all the records are stored only at the leaf node. Intermediate nodes act as a
pointer to the leaf nodes. They do not contain any records.

The above B+ tree shows that:


o There is one root node of the tree, i.e., 25.
o There is an intermediary layer with nodes. They do not store the actual record. They have
only pointers to the leaf node.
o The nodes to the left of the root node contain the prior value of the root and nodes to the right
contain next value of the root, i.e., 15 and 30 respectively.
o There is only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and 29.
o Searching for any record is easier as all the leaf nodes are balanced.
o In this method, searching any record can be traversed through the single path and accessed
easily.
Pros of B+ tree file organization
o In this method, searching becomes very easy as all the records are stored only in the leaf
nodes and sorted the sequential linked list.
o Traversing through the tree structure is easier and faster.
o The size of the B+ tree has no restrictions, so the number of records can increase or decrease
and the B+ tree structure can also grow or shrink.
o It is a balanced tree structure, and any insert/update/delete does not affect the performance of
tree.

Cons of B+ tree file organization


o This method is inefficient for the static method.

6. Cluster file organization


o When the two or more records are stored in the same file, it is known as clusters. These files
will have two or more tables in the same data block, and key attributes which are used to map
these tables together are stored only once.
o This method reduces the cost of searching for various records in different files.
In this method, we can directly insert, update or delete any record. Data is sorted based on the
key with which searching is done. Cluster key is a type of key with which joining of the table
is performed.

Types of Cluster file organization:


Cluster file organization is of two types:

1. Indexed Clusters:
In indexed cluster, records are grouped based on the cluster key and stored together. The
above EMPLOYEE and DEPARTMENT relationship is an example of an indexed cluster.
Here, all the records are grouped based on the cluster key- DEP_ID and all the records are
grouped.

2. Hash Clusters:
It is similar to the indexed cluster. In hash cluster, instead of storing the records based on the
cluster key, we generate the value of the hash key for the cluster key and store the records
with the same hash key value.

Pros of Cluster file organization


o The cluster file organization is used when there is a frequent request for joining the tables
with same joining condition.
o It provides the efficient result when there is a 1:M mapping between the tables.

Cons of Cluster file organization


o This method has the low performance for the very large database.
o If there is any change in joining condition, then this method cannot use. If we change the
condition of joining then traversing the file takes a lot of time.

Indexing in DBMS
o Indexing is used to optimize the performance of a database by minimizing the number of disk
accesses required when a query is processed.
o The index is a type of data structure. It is used to locate and access the data in a database table
quickly.

Index structure:
Indexes can be created using some database columns.

o The first column of the database is the search key that contains a copy of the primary key or
candidate key of the table. The values of the primary key are stored in sorted order so that the
corresponding data can be accessed easily.
o The second column of the database is the data reference. It contains a set of pointers holding
the address of the disk block where the value of the particular key can be found.

Types of Indexing in DBMS


1. Primary Index

o If the index is created on the basis of the primary key of the table, then it is known as primary
indexing. These primary keys are unique to each record and contain 1:1 relation between the
records.
o As primary keys are stored in sorted order, the performance of the searching operation is quite
efficient.

The primary index can be classified into two types:

 Dense index
 Sparse index.

Dense Index
In a dense index, a record is created for every search key valued in the database.
This helps you to search faster but needs more space to store index records. In this
Indexing, method records contain search key value and points to the real record on
the disk.
Sparse Index
It is an index record that appears for only some of the values in the file. Sparse
Index helps you to resolve the issues of dense Indexing in DBMS. In this method
of indexing technique, a range of index columns stores the same data block
address, and when data needs to be retrieved, the block address will be fetched.

Sparse Index stores index records for only some search-key values. It needs less
space, less maintenance overhead for insertion, and deletions but It is slower
compared to the dense Index for locating records.

2. Secondary Index

The secondary Index in DBMS can be generated by a field which has a unique
value for each record, and it should be a candidate key. It is also known as a non-
clustering index.
This two-level database indexing technique is used to reduce the mapping size of
the first level. For the first level, a large range of numbers is selected because of
this; the mapping size always remains small.

Secondary Index Example


In a bank account database, data is stored sequentially by acc_no; you may want to
find all accounts in of a specific branch of ABC bank.

Here, you can have a secondary index in DBMS for every search-key. Index record
is a record point to a bucket that contains pointers to all the records with their
specific search-key value.

3.Clustering Index

o A clustered index can be defined as an ordered data file. Sometimes the index is created on
non-primary key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get the unique
value and create index out of them. This method is called a clustering index.
o The records which have similar characteristics are grouped, and indexes are created for these
group.

Example: Suppose a company contains several employees in each department. Suppose we


use a clustering index, where all employees which belong to the same Dept_ID are
considered within a single cluster, and index pointers point to the cluster as a whole. Here
Dept_Id is a non-unique key.If we use separate disk block for separate clusters, then it is
called better technique.
HASHING:
In a database management system, When we want to retrieve a particular data, It
becomes very inefficient to search all the index values and reach the desired data. In
this situation, Hashing technique comes into real usage
Hashing is an efficient technique to directly search the location of desired data on
the disk without using index structure. Data is stored at the data blocks whose
address is generated by using hash function. The memory location where these
records are stored is called as data block or data bucket.

Hash File Organization:

 Data bucket – Data buckets are the memory locations where the records are
stored. These buckets are also considered as Unit Of Storage.
 Hash Function – Hash function is a mapping function that maps all the set of
search keys to actual record address. Generally, hash function uses the primary
key to generate the hash index – address of the data block. Hash function can be
simple mathematical function to any complex mathematical function.
 Hash Index-The prefix of an entire hash value is taken as a hash index. Every
hash index has a depth value to signify how many bits are used for computing a
hash function. These bits can address 2n buckets.
Types of Hashing:

Static Hashing:

In static hashing, when a search-key value is provided, the hash function always
computes the same address. For example, if we want to generate an address for
STUDENT_ID = 104 using mod (5) hash function, it always results in the same
bucket address 4.  There will not be any changes to the bucket address here. Hence a
number of data buckets in the memory for this static hashing remain constant
throughout.
Operations:

 Insertion – When a new record is inserted into the table, The hash function h
generates a bucket address for the new record based on its hash key K. Bucket
address = h(K)
 Searching – When a record needs to be searched, The same hash function is used
to retrieve the bucket address for the record. For Example, if we want to retrieve
the whole record for ID 104, and if the hash function is mod (5) on that ID, the
bucket address generated would be 4. Then we will directly got to address 4 and
retrieve the whole record for ID 104. Here ID acts as a hash key.
 Deletion – If we want to delete a record, Using the hash function we will first
fetch the record which is supposed to be deleted.  Then we will remove the
records for that address in memory.
 Updation – The data record that needs to be updated is first searched using hash
function, and then the data record is updated.
1. If we want to insert some new records into the file but the data bucket address
generated by the hash function is not empty or the data already exists in that
address. This becomes a critical situation to handle.  This situation in the static
hashing is called bucket overflow.
We use two methods they are as follows:

Open Hashing – In Open hashing method, next available data block is used to
enter the new record, instead of overwriting the older one. This method is also
called  linear probing. For example, D3 is a new record that needs to be inserted,
the hash function generates the address as 105. But it is already full. So the
system searches next available data bucket, 123 and assigns D3 to it. 

2. Closed hashing – In Closed hashing method, a new data bucket is allocated with
same address and is linked it after the full data bucket. This method is also
known as  overflow chaining. For example, we have to insert a new record D3
into the tables. The static hash function generates the data bucket address as 105.
But this bucket is full to store the new data. In this case is a new data bucket is
added at the end of 105 data bucket and is linked to it. Then new record D3 is
inserted into the new bucket. 

 Quadratic probing : Quadratic probing is very much similar to open hashing


or linear probing. Here, The only difference between old and new bucket is
linear. Quadratic function is used to determine the new bucket address.
 Double Hashing : Double Hashing is another method similar to linear
probing. Here the difference is fixed as in linear probing, but this fixed
difference is calculated by using another hash function. That’s why the name
is double hashing.

Dynamic Hashing –
The drawback of static hashing is that it does not expand or shrink dynamically as
the size of the database grows or shrinks.  In Dynamic hashing, data buckets grows
or shrinks (added or removed dynamically) as the records increases or decreases.
Dynamic hashing is also known as extended hashing. In dynamic hashing, the hash
function is made to produce a large number of values. For Example, there are three
data records D1, D2 and D3 . The hash function generates three addresses 1001,
0101 and 1010 respectively.  This method of storing considers only part of this
address – especially only first one bit to store the data. So it tries to load three of
them at address 0 and 1. 

But the problem is that No bucket address is remaining for D3. The bucket has to
grow dynamically to accommodate D3. So it changes the address have 2 bits rather
than 1 bit, and then it updates the existing data to have 2 bit address. Then it tries to
accommodate D3. 

B+ Tree
o The B+ tree is a balanced binary search tree. It follows a multi-level index format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes
remain at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support
random access as well as sequential access.
Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of the
order n where n is fixed for every B+ tree.
o It contains an internal node and leaf node.

Internal node

o An internal node of the B+ tree can contain at least n/2 record pointers except the root node.
o At most, an internal node of the tree contains n pointers.

Leaf node

o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.

Searching a record in B+ Tree


Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.

So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the
end, we will be redirected to the third leaf node. Here DBMS will perform a sequential search
to find 55.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node
after 55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert
60 there.

In this case, we have to split the leaf node, so that it can be inserted into tree without affecting
the fill factor, balance and order.

The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will
split the leaf node of the tree in the middle so that its balance is not altered. So we can group
(50, 55) and (60, 65, 70) into 2 leaf nodes.

If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have
60 added to it, and then we can have pointers to a new leaf node.

This is how we can insert an entry when there is overflow. In a normal scenario, it is very
easy to find the node where it fits and then place it in that leaf node.

B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove 60
from the intermediate node as well as from the 4th leaf node too. If we remove it from the
intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to modify
it to have a balanced tree.

After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as
follows:

You might also like