0% found this document useful (0 votes)
2 views

Unit 6 notes DBMS final

The document provides an overview of storage systems and indexing in database management systems (DBMS), detailing the types of storage (primary, secondary, tertiary) and file organization methods (heap, sequential, hash, clustered). It also explains indexing techniques, including ordered indices, primary indexing, clustered indexing, secondary indexing, multilevel indexing, and B-trees, emphasizing their roles in efficient data retrieval. Additionally, it covers hashing methods, including static and dynamic hashing, and their operations for managing data records in databases.
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit 6 notes DBMS final

The document provides an overview of storage systems and indexing in database management systems (DBMS), detailing the types of storage (primary, secondary, tertiary) and file organization methods (heap, sequential, hash, clustered). It also explains indexing techniques, including ordered indices, primary indexing, clustered indexing, secondary indexing, multilevel indexing, and B-trees, emphasizing their roles in efficient data retrieval. Additionally, it covers hashing methods, including static and dynamic hashing, and their operations for managing data records in databases.
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT -6

OVERVIEW OF STORAGES AND INDEXING


DBMS - Storage System
Databases are stored in file formats, which contain records. At physical level, the actual data is
stored in electromagnetic format on some device. These storage devices can be broadly
categorized into three types −

 Primary Storage − The memory storage that is directly accessible to the CPU comes
under this category. CPU's internal memory (registers), fast memory (cache), and main
memory (RAM) are directly accessible to the CPU, as they are all placed on the
motherboard or CPU chipset. This storage is typically very small, ultra-fast, and volatile.
Primary storage requires continuous power supply in order to maintain its state. In case
of a power failure, all its data is lost.
 Secondary Storage − Secondary storage devices are used to store data for future use or
as backup. Secondary storage includes memory devices that are not a part of the CPU
chipset or motherboard, for example, magnetic disks, optical disks (DVD, CD, etc.),
hard disks, flash drives, and magnetic tapes.
 Tertiary Storage − Tertiary storage is used to store huge volumes of data. Since such
storage devices are external to the computer system, they are the slowest in speed. These
storage devices are mostly used to take the back up of an entire system. Optical disks
and magnetic tapes are widely used as tertiary storage.
File Organization
File Organization defines how file records are mapped onto disk blocks. We have four types of
File Organization to organize file records −
Heap File Organization
When a file is created using Heap File Organization, the Operating System allocates memory
area to that file without any further accounting details. File records can be placed anywhere in
that memory area. It is the responsibility of the software to manage the records. Heap File does
not support any ordering, sequencing, or indexing on its own.

Sequential File Organization


Every file record contains a data field (attribute) to uniquely identify that record. In sequential
file organization, records are placed in the file in some sequential order based on the unique key
field or search key. Practically, it is not possible to store all the records sequentially in physical
form.

Hash File Organization


Hash File Organization uses Hash function computation on some fields of the records. The
output of the hash function determines the location of disk block where the records are to be
placed.

Clustered File Organization


Clustered file organization is not considered good for large databases. In this mechanism,
related records from one or more relations are kept in the same disk block, that is, the ordering
of records is not based on primary key or search key.
DBMS - Indexing
We know that data is stored in the form of records. Every record has a key field, which helps it
to be recognized uniquely.
Indexing is a data structure technique to efficiently retrieve records from the database
files based on some attributes on which the indexing has been done. Indexing in database
systems is similar to what we see in books.
It is a data structure technique which is used to quickly locate and access the data in a
database.
Indexes are created using a few database columns.
 The first column is the Search key that contains a copy of the primary key or candidate
key of the table. These values are stored in sorted order so that the corresponding data can
be accessed quickly.
Note: The data may or may not be stored in sorted order.
 The second column is the Data Reference or Pointer which contains a set of pointers
holding the address of the disk block where that particular key value can be found.
In general, there are two types of file organization mechanism which are followed by the
indexing methods to store the data:

Indexing Methods

Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted are known
as ordered indices.

Example: Suppose we have an employee table with thousands of record and each of which is 10
bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with ID-543.
In the case of a database with no index, we have to search the disk block from starting till it
reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
In the case of an index, we will search using indexes and the DBMS will read the record after
reading 542*2= 1084 bytes which are very less compared to the previous case.

Primary Index
If the index is created on the basis of the primary key of the table, then it is known as primary
indexing. These primary keys are unique to each record and contain 1:1 relation between the
records.
As primary keys are stored in sorted order, the performance of the searching operation is quite
efficient.
The primary index can be classified into two types: Dense index and Sparse index.

o Dense Index:
 For every search key value in the data file, there is an index record.
 This record contains the search key and also a reference to the first data
record with that search key value.

o Sparse Index:
 The index record appears only for a few items in the data file. Each item
points to a block as shown.
 To locate a record, we find the index record with the largest search key
value less than or equal to the search key value we are looking for.
 We start at that record pointed to by the index record, and proceed along
with the pointers in the file (that is, sequentially) until we find the desired
record.
Clustered Indexing

Clustering index is defined on an ordered data file. The data file is ordered on a non-key field.
In some cases, the index is created on non-primary key columns which may not be unique for
each record. In such cases, in order to identify the records faster, we will group two or more
columns together to get the unique values and create index out of them. This method is known
as the clustering index. Basically, records with similar characteristics are grouped together and
indexes are created for these groups.
For example, students studying in each semester are grouped together. i.e. 1st Semester students,
2nd semester students, 3rd semester students etc are grouped.

Clustered index sorted according to first name (Search key)


Secondary Indexing:
In secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In
this method, the huge range for the columns is selected initially so that the mapping size of the
first level becomes small. Then each range is further divided into smaller ranges. The mapping of
the first level is stored in the primary memory, so that address fetch is faster. The mapping of the
second level and actual data are stored in the secondary memory (hard disk).

For example:

o If you want to find the record of roll 111 in the diagram, then it will search the highest
entry which is smaller than or equal to 111 in the first level index. It will get 100 at this
level.
o Then in the second index level, again it does max (111) <= 111 and gets 110. Now using
the address 110, it goes to the data block and starts searching each record till it gets 111.
o This is how a search is performed in this method. Inserting, updating or deleting is also
done in the same manner.

Multilevel Indexing
With the growth of the size of the database, indices also grow. As the index is stored in the
main memory, a single-level index might become too large a size to store with multiple disk
accesses. The multilevel indexing segregates the main block into various smaller blocks so
that the same can stored in a single block. The outer blocks are divided into inner blocks
which in turn are pointed to the data blocks. This can be easily stored in the main memory
with fewer overheads.

Multi-level Index helps in breaking down the index into several smaller indices in order to make
the outermost level so small that it can be saved in a single disk block, which can easily be
accommodated anywhere in the main memory.

B Tree
B Tree is a specialized m-way tree that can be widely used for disk access. A B-Tree of order m
can have at most m-1 keys and m children. One of the main reason of using B tree is its
capability to store large number of keys in a single node and large key values by keeping the
height of the tree relatively small.

A B tree of order m contains all the properties of an M way tree. In addition, it contains the
following properties.

1. Every node in a B-Tree contains at most m children.


2. Every node in a B-Tree except the root node and the leaf node contain at least m/2
children.
3. The root nodes must have at least 2 nodes.
4. All leaf nodes must be at the same level.

It is not necessary that, all the nodes contain the same number of children but, each node must
have m/2 number of nodes.

A B tree of order 4 is shown in the following image.

While performing some operations on B Tree, any property of B Tree may violate such as
number of minimum children a node can have. To maintain the properties of B Tree, the tree
may split or join.

B+ Tree
o The B+ tree is a balanced binary search tree. It follows a multi-level index format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes
remain at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support
random access as well as sequential access.

Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of
the order n where n is fixed for every B+ tree.
o It contains an internal node and leaf node.
Internal node
o An internal node of the B+ tree can contain at least n/2 record pointers except the root
node.
o At most, an internal node of the tree contains n pointers.

Leaf node
o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.

Searching a record in B+ Tree

Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.

So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the end,
we will be redirected to the third leaf node. Here DBMS will perform a sequential search to find
55.

B+ Tree Insertion

Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node after
55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert 60 there.

In this case, we have to split the leaf node, so that it can be inserted into tree without affecting
the fill factor, balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will split
the leaf node of the tree in the middle so that its balance is not altered. So we can group (50, 55)
and (60, 65, 70) into 2 leaf nodes.

If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have 60
added to it, and then we can have pointers to a new leaf node.

This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy to
find the node where it fits and then place it in that leaf node.

B+ Tree Deletion

Suppose we want to delete 60 from the above example. In this case, we have to remove 60 from
the intermediate node as well as from the 4th leaf node too. If we remove it from the
intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to modify it to
have a balanced tree.

After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows:
DBMS - Hashing
For a huge database structure, it can be almost next to impossible to search all the index values
through all its level and then reach the destination data block to retrieve the desired data.
Hashing is an effective technique to calculate the direct location of a data record on the disk
without using index structure.
Hashing uses hash functions with search keys as parameters to generate the address of a data
record.

Hash Organization
 Bucket − A hash file stores data in bucket format. Bucket is considered a unit of storage.
A bucket typically stores one complete disk block, which in turn can store one or more
records.
 Hash Function − A hash function, h, is a mapping function that maps all the set of
search-keys K to the address where actual records are placed. It is a function from
search keys to bucket addresses.

Static Hashing
In static hashing, when a search-key value is provided, the hash function always computes the
same address. For example, if mod-4 hash function is used, then it shall generate only 5 values.
The output address shall always be same for that function. The number of buckets provided
remains unchanged at all times.
Operation
 Insertion − When a record is required to be entered using static hash, the hash
function h computes the bucket address for search key K, where the record will be
stored.
Bucket address = h(K)
 Search − When a record needs to be retrieved, the same hash function can be used to
retrieve the address of the bucket where the data is stored.
 Delete − This is simply a search followed by a deletion operation.

Bucket Overflow
The condition of bucket-overflow is known as collision. This is a fatal state for any static hash
function. In this case, overflow chaining can be used.
 Overflow Chaining − When buckets are full, a new bucket is allocated for the same
hash result and is linked after the previous one. This mechanism is called Closed
Hashing.
 Linear Probing − When a hash function generates an address at which data is already
stored, the next free bucket is allocated to it. This mechanism is called Open Hashing.

Dynamic Hashing
The problem with static hashing is that it does not expand or shrink dynamically as the size of
the database grows or shrinks. Dynamic hashing provides a mechanism in which data buckets
are added and removed dynamically and on-demand. Dynamic hashing is also known
as extended hashing.
Hash function, in dynamic hashing, is made to produce a large number of values and only a few
are used initially.
Organization
The prefix of an entire hash value is taken as a hash index. Only a portion of the hash value is
used for computing bucket addresses. Every hash index has a depth value to signify how many
bits are used for computing a hash function. These bits can address 2n buckets. When all these
bits are consumed − that is, when all the buckets are full − then the depth value is increased
linearly and twice the buckets are allocated.

You might also like