dbms 5th Unit (2)
dbms 5th Unit (2)
A database system provides an ultimate view of the stored data. However, data in the form of bits, bytes get
stored in different storage devices.
In this section, we will take an overview of various types of storage devices that are used for accessing and
storing data.
For storing the data, there are different types of storage options available. These storage types differ from one
another as per the speed and accessibility. There are the following types of storage devices used for storing the
data:
o Primary Storage
o Secondary Storage
o Tertiary Storage
Primary Storage
It is the primary area that offers quick access to the stored data. We also know the primary storage as volatile
storage. It is because this type of memory does not permanently store the data. As soon as the system leads to a
power cut or a crash, the data also get lost. Main memory and cache are the types of primary storage.
o Main Memory: It is the one that is responsible for operating the data that is available by the storage
medium. The main memory handles each instruction of a computer machine. This type of memory can
store gigabytes of data on a system but is small enough to carry the entire database. At last, the main
memory loses the whole content if the system shuts down because of power failure or other reasons.
1. Cache: It is one of the costly storage media. On the other hand, it is the fastest one. A cache is a tiny
storage media which is maintained by the computer hardware usually. While designing the algorithms
and query processors for the data structures, the designers keep concern on the cache effects.
Secondary Storage
Secondary storage is also called as Online storage. It is the storage area that allows the user to save and store data
permanently. This type of memory does not lose the data due to any power failure or system crash. That's why
we also call it non-volatile storage.
There are some commonly described secondary storage media which are available in almost every type of
computer system:
o Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys which are further
plugged into the USB slots of a computer system. These USB keys help transfer data to a computer
system, but it varies in size limits. Unlike the main memory, it is possible to get back the stored data
which may be lost due to a power cut or other reasons. This type of memory storage is most commonly
used in the server systems for caching the frequently used data. This leads the systems towards high
performance and is capable of storing large amounts of databases than the main memory.
o Magnetic Disk Storage: This type of storage media is also known as online storage media. A magnetic
disk is used for storing the data for a long time. It is capable of storing an entire database. It is the
responsibility of the computer system to make availability of the data from a disk to the main memory for
further accessing. Also, if the system performs any operation over the data, the modified data should be
written back to the disk. The tremendous capability of a magnetic disk is that it does not affect the data
due to a system crash or failure, but a disk failure can easily ruin as well as destroy the stored data.
Tertiary Storage
It is the storage type that is external from the computer system. It has the slowest speed. But it is capable of
storing a large amount of data. It is also known as Offline storage. Tertiary storage is generally used for data
backup. There are following tertiary storage devices available:
o Optical Storage: An optical storage can store megabytes or gigabytes of data. A Compact Disk (CD) can
store 700 megabytes of data with a playtime of around 80 minutes. On the other hand, a Digital Video
Disk or a DVD can store 4.7 or 8.5 gigabytes of data on each side of the disk.
o Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are used for archiving or
backing up the data. It provides slow access to data as it accesses data sequentially from the start. Thus,
tape storage is also known as sequential-access storage. Disk storage is known as direct-access storage as
we can directly access the data from any location on disk.
Storage Hierarchy
Besides the above, various other storage devices reside in the computer system. These storage media are
organized on the basis of data accessing speed, cost per unit of data to buy the medium, and by medium's
reliability. Thus, we can create a hierarchy of storage media on the basis of its cost and speed.
Thus, on arranging the above-described storage media in a hierarchy according to its speed and cost, we conclude
the below-described image:
In the image, the higher levels are expensive but fast. On moving down, the cost per bit is decreasing, and the
access time is increasing. Also, the storage media from the main memory to up represents the volatile nature, and
below the main memory, all are non-volatile devices.
Evaluation:
Reliability: 0
There is no duplication of data. Hence, a block once lost cannot be recovered.
Capacity: N*B
The entire space is being used to store data. Since there is no duplication, N disks each having B blocks are
fully utilized.
RAID-1 (Mirroring)
More than one copy of each block is stored in a separate disk. Thus, every block has two (or more) copies,
lying on different disks.
The above figure shows a RAID-1 system with mirroring level 2.
RAID 0 was unable to tolerate any disk failure. But RAID 1 is capable of reliability.
Evaluation:
Assume a RAID system with mirroring level 2.
Reliability: 1 to N/2
1 disk failure can be handled for certain, because blocks of that disk would have duplicates on some other
disk. If we are lucky enough and disks 0 and 2 fail, then again this can be handled as the blocks of these
disks have duplicates on disks 1 and 3. So, in the best case, N/2 disk failures can be handled.
Capacity: N*B/2
Only half the space is being used to store data. The other half is just a mirror to the already stored data.
It’s a great way to organize your files for mass inclusion. This method is best suited when a significant
amount of data needs to be loaded into the database at once.
Fetching records and retrieving them is faster in a small database than in consecutive records.
Because it takes time to find or alter a record in a large database, this method is comparatively inefficient.
For huge or complicated databases, this type of organization could not be used
Sequential File Organization
This method is the easiest method for file organization. In this method, files are stored sequentially. This method
can be implemented in two ways:
In a database management system (DBMS), sequential file organization is a popular method of file organization.
It’s a straightforward approach to file organization. This method organizes the data elements into a sequence that
is arranged in a binary format, one after the other.
o It is a quite simple method. In this method, we store the record in a sequence, i.e., one after another. Here,
the record will be inserted in the order in which they are inserted into tables.
o In case of updating or deleting of any record, the record will be searched in the memory blocks. When it
is found, then it will be marked for deleting, and the new record is inserted.
Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence, records are nothing but a
row in the table. Suppose we want to insert a new record R2 in the sequence, then it will be placed at the end of
the file. Here, records are nothing but a row in any table.
o In this method, the new record is always inserted at the file's end, and then it will sort the sequence in
ascending or descending order. Sorting of records is based on any primary key or any other key.
o In the case of modification of any record, it will update the record and then sort the file, and lastly, the
updated record is placed in the right place.
Suppose there is a preexisting sorted sequence of four records R1, R3 and so on upto R6 and R7. Suppose a new
record R2 has to be inserted in the sequence, then it will be inserted at the end of the file, and then it will sort the
sequence.
o It contains a fast and efficient method for the huge amount of data.
o In this method, files can be easily stored in cheaper storage mechanism like magnetic tapes.
o It is simple in design. It requires no much effort to store the data.
o This method is used when most of the records have to be accessed like grade calculation of a student,
generating the salary slip, etc.
o This method is used for report generation or statistical calculations.
Hashing in DBMS
In a huge database structure, it is very inefficient to search all the index values and reach the desired data.
Hashing technique is used to calculate the direct location of a data record on the disk without using index
structure.
In this technique, data is stored at the data blocks whose address is generated by using the hashing function. The
memory location where these records are stored is known as data bucket or data blocks.
In this, a hash function can choose any of the column value to generate the address. Most of the time, the hash
function uses the primary key to generate the address of the data block. A hash function is a simple mathematical
function to any complex mathematical function. We can even consider the primary key itself as the address of the
data block. That means each row whose address will be the same as a primary key stored in the data block.
The above diagram shows data block addresses same as primary key value. This hash function can also be a
simple mathematical function like exponential, mod, cos, sin, etc. Suppose we have mod (5) hash function to
determine the address of the data block. In this case, it applies mod (5) hash function on the primary keys and
generates 3, 3, 1, 4 and 2 respectively, and records are stored in those data block addresses.
Important Terminologies in Hashing
Here, are important terminologies which are used in Hashing:
Data bucket – Data buckets are memory locations where the records are stored. It is also known as Unit
Of Storage.
Key: A DBMS key is an attribute or set of an attribute which helps you to identify a row(tuple) in a
relation(table). This allows you to find the relationship between two tables.
Hash function: A hash function, is a mapping function which maps all the set of search keys to the
address where actual records are placed.
Linear Probing – Linear probing is a fixed interval between probes. In this method, the next available
data block is used to enter the new record, instead of overwriting on the older record.
Quadratic probing– It helps you to determine the new bucket address. It helps you to add Interval
between probes by adding the consecutive output of quadratic polynomial to starting value given by the
original computation.
Hash index – It is an address of the data block. A hash function could be a simple mathematical function
to even a complex mathematical function.
Double Hashing –Double hashing is a computer programming method used in hash tables to resolve the
issues of has a collision.
Bucket Overflow: The condition of bucket-overflow is called collision. This is a fatal stage for any static
has to function.
1. Static Hashing
2. Dynamic Hashing
Static Hashing
In static hashing, the resultant data bucket address will always be the same. That means if we generate an address
for EMP_ID =103 using the hash function mod (5) then it will always result in same bucket address 3. Here,
there will be no change in the bucket address.
Hence in this static hashing, the number of data buckets in memory remains constant throughout. In this
example, we will have five data buckets in the memory used to store the data.
o Searching a record
When a record needs to be searched, then the same hash function retrieves the address of the bucket where the
data is stored.
o Insert a Record
When a new record is inserted into the table, then we will generate an address for a new record based on the hash
key and record is stored in that location.
Skip Ad
o Delete a Record
To delete a record, we will first fetch the record which is supposed to be deleted. Then we will delete the records
for that address in memory.
o Update a Record
To update a record, we will first search it using a hash function, and then the data record is updated.
If we want to insert some new record into the file but the address of a data bucket generated by the hash function
is not empty, or data already exists in that address. This situation in the static hashing is known as bucket
overflow. This is a critical situation in this method.
To overcome this situation, there are various methods. Some commonly used methods are as follows:
1. Open Hashing
When a hash function generates an address at which data is already stored, then the next bucket will be allocated
to it. This mechanism is called as Linear Probing.
For example: suppose R3 is a new address which needs to be inserted, the hash function generates address as
112 for R3. But the generated address is already full. So the system searches next available data bucket, 113 and
assigns R3 to it.
2. Close Hashing
When buckets are full, then a new data bucket is allocated for the same hash result and is linked after the
previous one. This mechanism is known as Overflow chaining.
For example: Suppose R3 is a new address which needs to be inserted into the table, the hash function generates
address as 110 for it. But this bucket is full to store the new data. In this case, a new bucket is inserted at the end
of 110 buckets and is linked to it.
Dynamic Hashing
Dynamic hashing offers a mechanism in which data buckets are added and removed dynamically and on demand.
In this hashing, the hash function helps you to create a large number of values.
Bit: Bit is a basic unit of information used in computing that can have only one of two values either 0 or 1 .
The two values of a binary digit can also be interpreted as logical values true/false or yes/no.
In Bitmap Indexing these bits are used to represent the unique values in those low cardinality columns. This
technique of storing the low cardinality rows in form of bits are called bitmap indices.
Continuing the Employee example, Given below is the Employee table :
If New_Emp is the data to be indexed, the content of the bitmap index is shown as four( As we have four rows
in the above table) columns under the heading Bitmap Indices. Here Bitmap Index “Yes” has value 1001
because row 1 and row four has value “Yes” in column New_Emp.
In this case there are two such bitmaps, one for “New_Emp” Yes and one for “New_Emp” NO. It is easy to see
that each bit in bitmap indices shows that whether a particular row refer to a person who is New to the
company or not.
The above scenario is the simplest form of Bitmap Indexing. Most columns will have more distinct values. For
example the column Job here will have only 4 unique values (As mentioned earlier). Variations on the bitmap
index can effectively index this data as well. For Job column the bitmap Indexing is shown below:
Now Suppose, If we want to find out the details for the Employee who is not new in the company and is a sales
person then we will run the query:
SELECT *
FROM Employee
WHERE New_Emp = "No" and Job = "Salesperson";
For this query the DBMS will search the bitmap index of both the columns and perform logical AND operation
on those bits and find out the actual result:
Here the result 0100 represents that the second row has to be retrieved as a result.
Bitmap Indexing in SQL – The syntax for creating bitmap index in sql is given below:
CREATE BITMAP INDEX Index_Name
ON Table_Name (Column_Name);
For the above example of employee table, the bitmap index on column New_Emp will be created as follows:
CREATE BITMAP INDEX index_New_Emp
ON Employee (New_Emp);
Advantages –
Efficiency in terms of insertion deletion and updation.
Faster retrieval of records
Disadvantages –
Only suitable for large tables
Bitmap Indexing is time consuming
o The index is a type of data structure. It is used to locate and access the data in a database table quickly.
Indexing in DBMS
o Indexing is used to optimize the performance of a database by minimizing the number of disk accesses
required when a query is processed.
Index structure:
o The first column of the database is the search key that contains a copy of the primary key or candidate
key of the table. The values of the primary key are stored in sorted order so that the corresponding data
can be accessed easily.
o The second column of the database is the data reference. It contains a set of pointers holding the address
of the disk block where the value of the particular key can be found.
Indexing Methods
Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted are known as ordered
indices.
Example: Suppose we have an employee table with thousands of record and each of which is 10 bytes long. If
their IDs start with 1, 2, 3....and so on and we have to search student with ID-543.
o In the case of a database with no index, we have to search the disk block from starting till it reaches 543.
The DBMS will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the record after reading
542*2= 1084 bytes which are very less compared to the previous case.
Primary Index
o If the index is created on the basis of the primary key of the table, then it is known as primary indexing.
These primary keys are unique to each record and contain 1:1 relation between the records.
o As primary keys are stored in sorted order, the performance of the searching operation is quite efficient.
o The primary index can be classified into two types: Dense index and Sparse index.
Dense index
o The dense index contains an index record for every search key value in the data file. It makes searching
faster.
o In this, the number of records in the index table is same as the number of records in the main table.
o It needs more space to store index record itself. The index records have the search key and a pointer to the
actual record on the disk.
Sparse index
o In the data file, index record appears only for a few items. Each item points to a block.
o In this, instead of pointing to each record in the main table, the index points to the records in the main
table in a gap.
Clustering Index
o A clustered index can be defined as an ordered data file. Sometimes the index is created on non-primary
key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get the unique value and
create index out of them. This method is called a clustering index.
o The records which have similar characteristics are grouped, and indexes are created for these group.
Example: suppose a company contains several employees in each department. Suppose we use a clustering
index, where all employees which belong to the same Dept_ID are considered within a single cluster, and index
pointers point to the cluster as a whole. Here Dept_Id is a non-unique key.
The previous schema is little confusing because one disk block is shared by records which belong to the different
cluster. If we use separate disk block for separate clusters, then it is called better technique.
Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These mappings are usually
kept in the primary memory so that address fetch should be faster. Then the secondary memory searches the
actual data based on the address got from mapping. If the mapping size grows then fetching the address itself
becomes slower. In this case, the sparse index will not be efficient. To overcome this problem, secondary
indexing is introduced.
In secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In this method, the
huge range for the columns is selected initially so that the mapping size of the first level becomes small. Then
each range is further divided into smaller ranges. The mapping of the first level is stored in the primary memory,
so that address fetch is faster. The mapping of the second level and actual data are stored in the secondary
memory (hard disk).
For example:
o If you want to find the record of roll 111 in the diagram, then it will search the highest entry which is
smaller than or equal to 111 in the first level index. It will get 100 at this level.
o Then in the second index level, again it does max (111) <= 111 and gets 110. Now using the address 110,
it goes to the data block and starts searching each record till it gets 111.
o This is how a search is performed in this method. Inserting, updating or deleting is also done in the same
manner.