0% found this document useful (0 votes)
6 views

dbms 5th Unit (2)

Uploaded by

hrichabaghel9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

dbms 5th Unit (2)

Uploaded by

hrichabaghel9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Storage System in DBMS

A database system provides an ultimate view of the stored data. However, data in the form of bits, bytes get
stored in different storage devices.

In this section, we will take an overview of various types of storage devices that are used for accessing and
storing data.

Types of Data Storage

For storing the data, there are different types of storage options available. These storage types differ from one
another as per the speed and accessibility. There are the following types of storage devices used for storing the
data:

o Primary Storage
o Secondary Storage
o Tertiary Storage

Primary Storage

It is the primary area that offers quick access to the stored data. We also know the primary storage as volatile
storage. It is because this type of memory does not permanently store the data. As soon as the system leads to a
power cut or a crash, the data also get lost. Main memory and cache are the types of primary storage.

o Main Memory: It is the one that is responsible for operating the data that is available by the storage
medium. The main memory handles each instruction of a computer machine. This type of memory can
store gigabytes of data on a system but is small enough to carry the entire database. At last, the main
memory loses the whole content if the system shuts down because of power failure or other reasons.

1. Cache: It is one of the costly storage media. On the other hand, it is the fastest one. A cache is a tiny
storage media which is maintained by the computer hardware usually. While designing the algorithms
and query processors for the data structures, the designers keep concern on the cache effects.

Secondary Storage

Secondary storage is also called as Online storage. It is the storage area that allows the user to save and store data
permanently. This type of memory does not lose the data due to any power failure or system crash. That's why
we also call it non-volatile storage.

There are some commonly described secondary storage media which are available in almost every type of
computer system:

o Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys which are further
plugged into the USB slots of a computer system. These USB keys help transfer data to a computer
system, but it varies in size limits. Unlike the main memory, it is possible to get back the stored data
which may be lost due to a power cut or other reasons. This type of memory storage is most commonly
used in the server systems for caching the frequently used data. This leads the systems towards high
performance and is capable of storing large amounts of databases than the main memory.
o Magnetic Disk Storage: This type of storage media is also known as online storage media. A magnetic
disk is used for storing the data for a long time. It is capable of storing an entire database. It is the
responsibility of the computer system to make availability of the data from a disk to the main memory for
further accessing. Also, if the system performs any operation over the data, the modified data should be
written back to the disk. The tremendous capability of a magnetic disk is that it does not affect the data
due to a system crash or failure, but a disk failure can easily ruin as well as destroy the stored data.

Tertiary Storage

It is the storage type that is external from the computer system. It has the slowest speed. But it is capable of
storing a large amount of data. It is also known as Offline storage. Tertiary storage is generally used for data
backup. There are following tertiary storage devices available:

o Optical Storage: An optical storage can store megabytes or gigabytes of data. A Compact Disk (CD) can
store 700 megabytes of data with a playtime of around 80 minutes. On the other hand, a Digital Video
Disk or a DVD can store 4.7 or 8.5 gigabytes of data on each side of the disk.
o Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are used for archiving or
backing up the data. It provides slow access to data as it accesses data sequentially from the start. Thus,
tape storage is also known as sequential-access storage. Disk storage is known as direct-access storage as
we can directly access the data from any location on disk.
Storage Hierarchy

Besides the above, various other storage devices reside in the computer system. These storage media are
organized on the basis of data accessing speed, cost per unit of data to buy the medium, and by medium's
reliability. Thus, we can create a hierarchy of storage media on the basis of its cost and speed.

Thus, on arranging the above-described storage media in a hierarchy according to its speed and cost, we conclude
the below-described image:

In the image, the higher levels are expensive but fast. On moving down, the cost per bit is decreasing, and the
access time is increasing. Also, the storage media from the main memory to up represents the volatile nature, and
below the main memory, all are non-volatile devices.

RAID (Redundant Arrays of Independent Disks)


RAID, or “Redundant Arrays of Independent Disks” is a technique which makes use of a combination of
multiple disks instead of using a single disk for increased performance, data redundancy or both. The term was
coined by David Patterson, Garth A. Gibson, and Randy Katz at the University of California, Berkeley in
1987.
Why data redundancy?
Data redundancy, although taking up extra space, adds to disk reliability. This means, in case of disk failure, if
the same data is also backed up onto another disk, we can retrieve the data and go on with the operation. On
the other hand, if the data is spread across just multiple disks without the RAID technique, the loss of a single
disk can affect the entire data.
Key evaluation points for a RAID System
 Reliability: How many disk faults can the system tolerate?
 Availability: What fraction of the total session time is a system in uptime mode, i.e. how available is the
system for actual use?
 Performance: How good is the response time? How high is the throughput (rate of processing work)?
Note that performance contains a lot of parameters and not just the two.
 Capacity: Given a set of N disks each with B blocks, how much useful capacity is available to the user?
RAID is very transparent to the underlying system. This means, to the host system, it appears as a single big
disk presenting itself as a linear array of blocks. This allows older technologies to be replaced by RAID
without making too many changes in the existing code.
Different RAID levels
RAID-0 (Stripping)
 Blocks are “stripped” across disks.

 In the figure, blocks “0,1,2,3” form a stripe.


 Instead of placing just one block into a disk at a time, we can work with two (or more) blocks placed into a
disk before moving on to the next one.

Evaluation:
 Reliability: 0
There is no duplication of data. Hence, a block once lost cannot be recovered.
 Capacity: N*B
The entire space is being used to store data. Since there is no duplication, N disks each having B blocks are
fully utilized.
RAID-1 (Mirroring)
 More than one copy of each block is stored in a separate disk. Thus, every block has two (or more) copies,
lying on different disks.
 The above figure shows a RAID-1 system with mirroring level 2.
 RAID 0 was unable to tolerate any disk failure. But RAID 1 is capable of reliability.
Evaluation:
Assume a RAID system with mirroring level 2.
 Reliability: 1 to N/2
1 disk failure can be handled for certain, because blocks of that disk would have duplicates on some other
disk. If we are lucky enough and disks 0 and 2 fail, then again this can be handled as the blocks of these
disks have duplicates on disks 1 and 3. So, in the best case, N/2 disk failures can be handled.
 Capacity: N*B/2
Only half the space is being used to store data. The other half is just a mirror to the already stored data.

RAID-4 (Block-Level Stripping with Dedicated Parity)


 Instead of duplicating data, this adopts a parity-based approach.

 In the figure, we can observe one column (disk) dedicated to parity.


 Parity is calculated using a simple XOR function. If the data bits are 0,0,0,1 the parity bit is XOR(0,0,0,1)
= 1. If the data bits are 0,1,1,0 the parity bit is XOR(0,1,1,0) = 0. A simple approach is that even number of
ones results in parity 0, and an odd number of ones results in parity 1.
 Assume that in the above figure, C3 is lost due to some disk failure. Then, we can recompute the data bit
stored in C3 by looking at the values of all the other columns and the parity bit. This allows us to recover
lost data.
Evaluation:
 Reliability: 1
RAID-4 allows recovery of at most 1 disk failure (because of the way parity works). If more than one disk
fails, there is no way to recover the data.
 Capacity: (N-1)*B
One disk in the system is reserved for storing the parity. Hence, (N-1) disks are made available for data
storage, each disk having B blocks.

RAID-5 (Block-Level Stripping with Distributed Parity)


 This is a slight modification of the RAID-4 system where the only difference is that the parity rotates
among the drives.

 In the figure, we can notice how the parity bit “rotates”.


 This was introduced to make the random write performance better.
Evaluation:
 Reliability: 1
RAID-5 allows recovery of at most 1 disk failure (because of the way parity works). If more than one disk
fails, there is no way to recover the data. This is identical to RAID-4.
 Capacity: (N-1)*B
Overall, space equivalent to one disk is utilized in storing the parity. Hence, (N-1) disks are made available
for data storage, each disk having B blocks.

What about the other RAID levels?


RAID-2 consists of bit-level stripping using a Hamming Code parity. RAID-3 consists of byte-level striping
with dedicated parity. These two are less commonly used.
RAID-6 is a recent advancement that contains a distributed double parity, which involves block-level stripping
with 2 parity bits instead of just 1 distributed across all the disks. There are also hybrid RAIDs, which make
use of more than one RAID levels nested one after the other, to fulfill specific requirements.
Heap File Organization in DBMS
This is the most basic type of file management. In this case, records are put at the end of the file as they are
added. The records are not sorted or ordered in any way. The next record is saved in the new block after the data
block is filled. This new block does not have to be the next one. This technique can choose any memory block to
store the new records in. It’s comparable to the sequential method’s pile file, but data blocks aren’t selected in
the same order. They could be any memory data blocks. The database management system (DBMS) is in charge
of storing and managing the records.
Table of Contents

 What is Heap File Organization in DBMS?


 Insertion of a New Record
 Heap File Organization Pros
 Heap File Organization Cons

What is Heap File Organization in DBMS?


It is the most fundamental and basic form of file organization. It’s based on data chunks. The records are inserted
at the end of the file in the heap file organization. The ordering and sorting of records are not required when the
entries are added.
The new record is put in a different block when the data block is full. This new data block does not have to be the
next data block in the memory; it can store new entries in any data block in the memory. An unordered file is the
same as a heap file. Every record in the file has a unique id, and every page in the file is the same size. The
DBMS is in charge of storing and managing the new records.

Insertion of a New Record


Let’s say we have five records in a heap, R1, R3, R6, R4, and R5, and we wish to add a fifth record, R2. If data
block 3 is full, the DBMS will insert it into whichever database it chooses, such as data block 1.
In a heap file organization, if we wish to search, update, or remove data, we must traverse the data from the
beginning of the file until we find the desired record.
Because there is no ordering or sorting of records, searching, updating, or removing records will take a long time
if the database is huge. We must check all of the data in the heap file organization until we find the necessary
record.

Pros of Heap File Organization

 It’s a great way to organize your files for mass inclusion. This method is best suited when a significant
amount of data needs to be loaded into the database at once.
 Fetching records and retrieving them is faster in a small database than in consecutive records.

Cons of Heap File Organization

 Because it takes time to find or alter a record in a large database, this method is comparatively inefficient.
 For huge or complicated databases, this type of organization could not be used
Sequential File Organization

This method is the easiest method for file organization. In this method, files are stored sequentially. This method
can be implemented in two ways:

In a database management system (DBMS), sequential file organization is a popular method of file organization.
It’s a straightforward approach to file organization. This method organizes the data elements into a sequence that
is arranged in a binary format, one after the other.

1. Pile File Method:

o It is a quite simple method. In this method, we store the record in a sequence, i.e., one after another. Here,
the record will be inserted in the order in which they are inserted into tables.
o In case of updating or deleting of any record, the record will be searched in the memory blocks. When it
is found, then it will be marked for deleting, and the new record is inserted.

Insertion of the new record:

Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence, records are nothing but a
row in the table. Suppose we want to insert a new record R2 in the sequence, then it will be placed at the end of
the file. Here, records are nothing but a row in any table.

2. Sorted File Method:

o In this method, the new record is always inserted at the file's end, and then it will sort the sequence in
ascending or descending order. Sorting of records is based on any primary key or any other key.
o In the case of modification of any record, it will update the record and then sort the file, and lastly, the
updated record is placed in the right place.

Insertion of the new record:

Suppose there is a preexisting sorted sequence of four records R1, R3 and so on upto R6 and R7. Suppose a new
record R2 has to be inserted in the sequence, then it will be inserted at the end of the file, and then it will sort the
sequence.

Pros of sequential file organization

o It contains a fast and efficient method for the huge amount of data.
o In this method, files can be easily stored in cheaper storage mechanism like magnetic tapes.
o It is simple in design. It requires no much effort to store the data.
o This method is used when most of the records have to be accessed like grade calculation of a student,
generating the salary slip, etc.
o This method is used for report generation or statistical calculations.

Cons of sequential file organization


o It will waste time as we cannot jump on a particular record that is required but we have to move
sequentially which takes our time.
o Sorted file method takes more time and space for sorting the records.

Hashing in DBMS

In a huge database structure, it is very inefficient to search all the index values and reach the desired data.
Hashing technique is used to calculate the direct location of a data record on the disk without using index
structure.

In this technique, data is stored at the data blocks whose address is generated by using the hashing function. The
memory location where these records are stored is known as data bucket or data blocks.

In this, a hash function can choose any of the column value to generate the address. Most of the time, the hash
function uses the primary key to generate the address of the data block. A hash function is a simple mathematical
function to any complex mathematical function. We can even consider the primary key itself as the address of the
data block. That means each row whose address will be the same as a primary key stored in the data block.

The above diagram shows data block addresses same as primary key value. This hash function can also be a
simple mathematical function like exponential, mod, cos, sin, etc. Suppose we have mod (5) hash function to
determine the address of the data block. In this case, it applies mod (5) hash function on the primary keys and
generates 3, 3, 1, 4 and 2 respectively, and records are stored in those data block addresses.
Important Terminologies in Hashing
Here, are important terminologies which are used in Hashing:

 Data bucket – Data buckets are memory locations where the records are stored. It is also known as Unit
Of Storage.
 Key: A DBMS key is an attribute or set of an attribute which helps you to identify a row(tuple) in a
relation(table). This allows you to find the relationship between two tables.
 Hash function: A hash function, is a mapping function which maps all the set of search keys to the
address where actual records are placed.
 Linear Probing – Linear probing is a fixed interval between probes. In this method, the next available
data block is used to enter the new record, instead of overwriting on the older record.
 Quadratic probing– It helps you to determine the new bucket address. It helps you to add Interval
between probes by adding the consecutive output of quadratic polynomial to starting value given by the
original computation.
 Hash index – It is an address of the data block. A hash function could be a simple mathematical function
to even a complex mathematical function.
 Double Hashing –Double hashing is a computer programming method used in hash tables to resolve the
issues of has a collision.
 Bucket Overflow: The condition of bucket-overflow is called collision. This is a fatal stage for any static
has to function.

Types of Hashing Techniques


There are mainly two types of SQL hashing methods/techniques:

1. Static Hashing
2. Dynamic Hashing
Static Hashing

In static hashing, the resultant data bucket address will always be the same. That means if we generate an address
for EMP_ID =103 using the hash function mod (5) then it will always result in same bucket address 3. Here,
there will be no change in the bucket address.

Hence in this static hashing, the number of data buckets in memory remains constant throughout. In this
example, we will have five data buckets in the memory used to store the data.

Operations of Static Hashing

o Searching a record

When a record needs to be searched, then the same hash function retrieves the address of the bucket where the
data is stored.

o Insert a Record

When a new record is inserted into the table, then we will generate an address for a new record based on the hash
key and record is stored in that location.

Skip Ad

o Delete a Record

To delete a record, we will first fetch the record which is supposed to be deleted. Then we will delete the records
for that address in memory.

o Update a Record

To update a record, we will first search it using a hash function, and then the data record is updated.
If we want to insert some new record into the file but the address of a data bucket generated by the hash function
is not empty, or data already exists in that address. This situation in the static hashing is known as bucket
overflow. This is a critical situation in this method.

To overcome this situation, there are various methods. Some commonly used methods are as follows:

1. Open Hashing

When a hash function generates an address at which data is already stored, then the next bucket will be allocated
to it. This mechanism is called as Linear Probing.

For example: suppose R3 is a new address which needs to be inserted, the hash function generates address as
112 for R3. But the generated address is already full. So the system searches next available data bucket, 113 and
assigns R3 to it.

2. Close Hashing

When buckets are full, then a new data bucket is allocated for the same hash result and is linked after the
previous one. This mechanism is known as Overflow chaining.

For example: Suppose R3 is a new address which needs to be inserted into the table, the hash function generates
address as 110 for it. But this bucket is full to store the new data. In this case, a new bucket is inserted at the end
of 110 buckets and is linked to it.
Dynamic Hashing
Dynamic hashing offers a mechanism in which data buckets are added and removed dynamically and on demand.
In this hashing, the hash function helps you to create a large number of values.

Difference between Ordered Indexing and Hashing


Below are the key differences between Indexing and Hashing

Parameters Order Indexing Hashing


Storing of Addresses in the memory are sorted according to a Addresses are always generated using a hash
address key value called the primary key function on the key value.
Performance of hashing will be best when
It can decrease when the data increases in the hash
there is a constant addition and deletion of
file. As it stores the data in a sorted form when
Performance data. However, when the database is huge,
there is any (insert/delete/update) operation
then hash file organization and its
performed which decreases its performance.
maintenance will be costlier.
This is an ideal method when you want to
Preferred for range retrieval of data- which means retrieve a particular record based on the
Use for whenever there is retrieval data for a particular search key. However, it will only perform
range, this method is an ideal option. well when the hash function is on the search
key.
There will be many unused data blocks because of In static and dynamic hashing methods,
Memory the delete/update operation. These data blocks can’t memory is always managed. Bucket overflow
management be released for re-use. That’s why regular is also handled perfectly to extend static
maintenance of the memory is required. hashing.
Bitmap Indexing in DBMS
Bitmap Indexing is a special type of database indexing that uses bitmaps. This technique is used for huge
databases, when column is of low cardinality and these columns are most frequently used in the query.

Need of Bitmap Indexing –


The need of Bitmap Indexing will be clear through the below given example :
For example, Let us say that a company holds an employee table with entries like EmpNo, EmpName, Job,
New_Emp and salary. Let us assume that the employees are hired once in the year, therefore the table will be
updated very less and will remain static most of the time. But the columns will be frequently used in queries to
retrieve data like : No. of female employees in the company etc. In this case we need a file organization
method which should be fast enough to give quick results. But any of the traditional file organization method
is not that fast, therefore we switch to a better method of storing and retrieving data known as Bitmap
Indexing.

How Bitmap Indexing is done –


In the above example of table employee, we can see that the column New_Emp has only two
values Yes and No based upon the fact that the employee is new to the company or not. Similarly let us assume
that the Job of the Employees is divided into 4 categories only i.e Manager, Analyst, Clerk and Salesman. Such
columns are called columns with low cardinality. Even though these columns have less unique values, they can
be queried very often.

Bit: Bit is a basic unit of information used in computing that can have only one of two values either 0 or 1 .
The two values of a binary digit can also be interpreted as logical values true/false or yes/no.
In Bitmap Indexing these bits are used to represent the unique values in those low cardinality columns. This
technique of storing the low cardinality rows in form of bits are called bitmap indices.
Continuing the Employee example, Given below is the Employee table :

If New_Emp is the data to be indexed, the content of the bitmap index is shown as four( As we have four rows
in the above table) columns under the heading Bitmap Indices. Here Bitmap Index “Yes” has value 1001
because row 1 and row four has value “Yes” in column New_Emp.

In this case there are two such bitmaps, one for “New_Emp” Yes and one for “New_Emp” NO. It is easy to see
that each bit in bitmap indices shows that whether a particular row refer to a person who is New to the
company or not.
The above scenario is the simplest form of Bitmap Indexing. Most columns will have more distinct values. For
example the column Job here will have only 4 unique values (As mentioned earlier). Variations on the bitmap
index can effectively index this data as well. For Job column the bitmap Indexing is shown below:

Now Suppose, If we want to find out the details for the Employee who is not new in the company and is a sales
person then we will run the query:

SELECT *
FROM Employee
WHERE New_Emp = "No" and Job = "Salesperson";
For this query the DBMS will search the bitmap index of both the columns and perform logical AND operation
on those bits and find out the actual result:

Here the result 0100 represents that the second row has to be retrieved as a result.
Bitmap Indexing in SQL – The syntax for creating bitmap index in sql is given below:
CREATE BITMAP INDEX Index_Name
ON Table_Name (Column_Name);
For the above example of employee table, the bitmap index on column New_Emp will be created as follows:
CREATE BITMAP INDEX index_New_Emp
ON Employee (New_Emp);
Advantages –
 Efficiency in terms of insertion deletion and updation.
 Faster retrieval of records
Disadvantages –
 Only suitable for large tables
 Bitmap Indexing is time consuming
o The index is a type of data structure. It is used to locate and access the data in a database table quickly.

Indexing in DBMS

o Indexing is used to optimize the performance of a database by minimizing the number of disk accesses
required when a query is processed.

Index structure:

Indexes can be created using some database columns.

o The first column of the database is the search key that contains a copy of the primary key or candidate
key of the table. The values of the primary key are stored in sorted order so that the corresponding data
can be accessed easily.
o The second column of the database is the data reference. It contains a set of pointers holding the address
of the disk block where the value of the particular key can be found.

Indexing Methods

Ordered indices

The indices are usually sorted to make searching faster. The indices which are sorted are known as ordered
indices.
Example: Suppose we have an employee table with thousands of record and each of which is 10 bytes long. If
their IDs start with 1, 2, 3....and so on and we have to search student with ID-543.

o In the case of a database with no index, we have to search the disk block from starting till it reaches 543.
The DBMS will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the record after reading
542*2= 1084 bytes which are very less compared to the previous case.

Primary Index

o If the index is created on the basis of the primary key of the table, then it is known as primary indexing.
These primary keys are unique to each record and contain 1:1 relation between the records.
o As primary keys are stored in sorted order, the performance of the searching operation is quite efficient.
o The primary index can be classified into two types: Dense index and Sparse index.

Dense index

o The dense index contains an index record for every search key value in the data file. It makes searching
faster.
o In this, the number of records in the index table is same as the number of records in the main table.
o It needs more space to store index record itself. The index records have the search key and a pointer to the
actual record on the disk.

Sparse index

o In the data file, index record appears only for a few items. Each item points to a block.
o In this, instead of pointing to each record in the main table, the index points to the records in the main
table in a gap.
Clustering Index

o A clustered index can be defined as an ordered data file. Sometimes the index is created on non-primary
key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get the unique value and
create index out of them. This method is called a clustering index.
o The records which have similar characteristics are grouped, and indexes are created for these group.

Example: suppose a company contains several employees in each department. Suppose we use a clustering
index, where all employees which belong to the same Dept_ID are considered within a single cluster, and index
pointers point to the cluster as a whole. Here Dept_Id is a non-unique key.
The previous schema is little confusing because one disk block is shared by records which belong to the different
cluster. If we use separate disk block for separate clusters, then it is called better technique.

Secondary Index

In the sparse indexing, as the size of the table grows, the size of mapping also grows. These mappings are usually
kept in the primary memory so that address fetch should be faster. Then the secondary memory searches the
actual data based on the address got from mapping. If the mapping size grows then fetching the address itself
becomes slower. In this case, the sparse index will not be efficient. To overcome this problem, secondary
indexing is introduced.

In secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In this method, the
huge range for the columns is selected initially so that the mapping size of the first level becomes small. Then
each range is further divided into smaller ranges. The mapping of the first level is stored in the primary memory,
so that address fetch is faster. The mapping of the second level and actual data are stored in the secondary
memory (hard disk).
For example:

o If you want to find the record of roll 111 in the diagram, then it will search the highest entry which is
smaller than or equal to 111 in the first level index. It will get 100 at this level.
o Then in the second index level, again it does max (111) <= 111 and gets 110. Now using the address 110,
it goes to the data block and starts searching each record till it gets 111.
o This is how a search is performed in this method. Inserting, updating or deleting is also done in the same
manner.

You might also like