0% found this document useful (0 votes)
15 views96 pages

DBMS Module-6

Uploaded by

Nagarjuna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views96 pages

DBMS Module-6

Uploaded by

Nagarjuna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 96

DBMS

DATA BASE MANAGEMENT


SYSTYEMS
Physical Database Design
Topics to be covered:
• Storage and file structure:
– Memory Hierarchies and Storage Devices
– Placing File Records on Disk,
• Hashing Techniques
• Indexing Techniques
– Primary Indexes
– Secondary Indexes
– Clustering Indexes
– Multilevel Indexes
– Dynamic Multilevel Indexes Using B-Trees and
– B+-Trees
Memory Hierarchies
• The Computer memory hierarchy looks like a pyramid structure
that is used to describe the differences among memory types.
• Databases are stored in file formats, which contain records.
• At physical level, the actual data is stored in electromagnetic
format on some device.
• It separates the computer storage based on hierarchy.
– Level 0: CPU registers
– Level 1: Cache memory
– Level 2: Main memory or primary memory
– Level 3: Magnetic disks or secondary memory
– Level 4: Optical disks or magnetic types or tertiary
Memory
Level-0 − Registers

The registers are present inside the CPU. As they are present inside the CPU, they have least access time. Registers are most
expensive and smallest in size generally in kilobytes. They are implemented by using Flip-Flops.

Level-1 − Cache

Cache memory is used to store the segments of a program that are frequently accessed by the processor. It is expensive and
smaller in size generally in Megabytes and is implemented by using static RAM.

Level-2 − Primary or Main Memory

It directly communicates with the CPU and with auxiliary memory devices through an I/O processor. Main memory is less
expensive than cache memory and larger in size generally in Gigabytes. This memory is implemented by using dynamic RAM.

Level-3 − Secondary storage

Secondary storage devices like Magnetic Disk are present at level 3. They are used as backup storage. They are cheaper than
main memory and larger in size generally in a few TB.

Level-4 − Tertiary storage


The memory levels in terms of size, access time, and bandwidth.
Primary Storage
• The memory storage that is directly accessible to the
CPU comes under this category.
• CPU's internal memory (registers), fast memory
(cache), and main memory (RAM) are directly
accessible to the CPU, as they are all placed on the
motherboard or CPU chipset.
• This storage is typically very small, ultra-fast, and
volatile.
• Primary storage requires continuous power supply in
order to maintain its state.
• In case of a power failure, all its data is lost.
Secondary Storage
• Secondary storage devices are used to store data
for future use or as backup.

• Secondary storage includes memory devices that


are not a part of the CPU chipset or motherboard.

– Examples: magnetic disks,


hard disks,
flash drives.
Tertiary Storage

• Tertiary storage is used to store huge volumes of


data.
• Since such storage devices are external to the
computer system, they are the slowest in speed.
• These storage devices are mostly used to take the
back up of an entire system.
• Optical disks and magnetic tapes are widely used
as tertiary storage.
Storage Devices
• There are four types of devices in which
computer data can be stored.
– Primary storage devices
– Magnetic Storage Devices
– Optical Storage Devices
– Flash Memory Devices
– Cloud and virtual Storage
Primary storage devices
•RAM: It stands for Random Access Memory. It is used to store information that is
used immediately or we can say that it is a temporary memory. Computers bring the
software installed on a hard disk to RAM to process it and to be used by the user.
Once, the computer is turned off, the data is deleted. With the help of RAM,
computers can perform multiple tasks like loading applications, browsing the web,
editing a spreadsheet, experiencing the newest game, etc. It allows you to modify
quickly among these tasks, remembering where you’re in one task once you switch
to a different task. It is also used to load and run applications, like your spreadsheet
program, answers commands, like all edits you made within the spreadsheet, or
toggle between multiple programs, like once you left the spreadsheet to see the
email. Memory is nearly always actively employed by your computer. It ranges from
1GB – 32GB/64GB depending upon the specifications. There are different types of
RAM, and although they all serve the same purpose, the most common ones are :
• SRAM: It stands for Static Random Access Memory. It consists of circuits that
retain stored information as long as the power supply is on. It is also known as
volatile memory. It is used to build Cache memory. The access time of SRAM
is lower and it is much faster as compared to DRAM but in terms of cost, it is
costly as compared to DRAM.
• DRAM: It stands for Dynamic Random Access Memory. It is used to store
binary bits in the form of electrical charges that are applied to capacitors. The
access time of DRAM is slower as compared to SRAM but it is cheaper than
SRAM and has a high packaging density.
• SDRAM: It stands for Synchronous Dynamic Random Access Memory. It is
faster than DRAM. It is widely used in computers and others. After SDRAM was
introduced, the upgraded version of double data rate RAM, i.e., DDR1, DDR2,
DDR3, and DDR4 was entered into the market and widely used in home/office
•ROM: It stands for Read-Only Memory. The data written or stored in these
devices are non-volatile, i.e, once the data is stored in the memory cannot
be modified or deleted. The memory from which will only read but cannot
write it. This type of memory is non-volatile. The information is stored
permanently during manufacture only once. ROM stores instructions that
are used to start a computer. This operation is referred to as bootstrap. It
is also used in other electronic items like washers and microwaves. ROM
chips can only store a few megabytes (MB) of data, which ranges between
4 and 8 MB per ROM chip. There are two types of ROM:
• PROM: PROM is Programmable Read-Only Memory. These are ROMs
that can be programmed. A special PROM programmer is employed
to enter the program on the PROM. Once the chip has been
programmed, information on the PROM can’t be altered. PROM is
non-volatile, that is data is not lost when power is switched off.
• EPROM: Another sort of memory is the
Erasable Programmable Read-Only Memory. It is possible to erase
the info which has been previously stored on an EPROM and write
new data onto the chip.
• EEPROM: EEPROM is
Electrically erasable programmable read-only memory. Here, data
can be erased without using ultraviolet light, with the use of just
applying the electric field.
Magnetic Storage Devices
• The most commonly used storage devices in today’s time are magnetic storage devices. These are affordable and
easily accessible. A large amount of data can be stored in these through magnetized mediums.
• A magnetic field is created when the device is attached to the computer and with the help of the two magnetic
polarities, the device is able to read the binary language and store the information. Given below are the examples
of magnetic storage devices.
• Floppy Disk – Also known as a floppy diskette, it is a removable storage device which is in the shape of a square
and comprises magnetic elements. When placed in the disk reader of the computer device, it spins around and can
store information. Lately, these floppy disks have been replaced with CDs, DVDs and USB drives
• Hard Drive – This primary storage device is directly attached to the motherboard’s disk controller. It is integral
storage space as it is required to install any new program or application to the device. Software programs, images,
videos, etc. can all be saved in a hard drive and hard drives with storage space in terabytes are also easily available
now
• Magnetic Card: It is a card in which data is stored by modifying or rearranging the magnetism of tiny iron-based
magnetic particles present on the band of the card. It is also known as a swipe card. It is used like a passcode(to
enter the house or hotel room), credit card, identity card, etc.
• Tape Cassette: It is also known as a music cassette. It is a rectangular flat container in which the data is stored in an
analog magnetic tape. It is generally used to store audio recordings.
• SuperDisk: It is also called LS-240 and LS-120. It is introduced by Imation Corporation and it is popular with OEM
computers. It can store data up to 240 MB.
• Zip Disk – Introduced by Iomega, is a removable storage device which was initially released with a storage space of
100 MB which was later increased to 250 and then finally 750 MB
• Magnetic Strip – A magnetic strip is attached in the device comprising digital data. The most suitable example for
this is a debit card which has a strip placed on one of its sides which stores the digital data
Optical Storage Devices
• Such devices used lasers and lights to detect and store data. They are cheaper in
comparison to USB drives and can store more data. Discussed below are a few commonly
used optical storage devices.
• CD-ROM – This stands for Compact Disc – Read-Only Memory and is an external device
which can store and read data in the form of audio or software data
• CD-R: It stands for Compact Disc read-only. In this type of CD, once the data is
written can not be erased. It is read-only.
• CD-RW: It stands for Compact Disc Read Write. In this type of CD, you can easily
write or erase data multiple times.
• Blu-Ray Disc – Introduced in 2006, Blu-ray disk was backup up by major IT and computer
companies. It can store up to 25 GB data in a single-layer disc and 50 GB data in a dual-
layer disc
• DVD – Digital Versatile Disc is another type of optical storage device. It can be readable,
recordable, and rewritable. Recordings can be done in such devices and then can be
attached to the system
• DVD-R: It stands for Digital Versatile Disc read-only. In this type of DVD, once the
data is written can not be erased. It is read-only. It is generally used to write
movies, etc.
• DVD-RW: It stands for Digital Versatile Disc Read Write. In this type of DVD, you
can easily write or erase data multiple times.
Flash Memory Devices
• These storage devices have now replaced both magnetic and optical storage devices. They are easy to use, portable
and easily available and accessible. They have become a cheaper and more convenient option to store data.
• Discussed below are the major flash memory devices which are being commonly used by the people nowadays.
• USB Drive – Also, known as a pen drive, this storage device is small in size and is portable and ranges between storage
space of 2 GB to 1 TB. It comprises an integrated circuit which allows it to store data and also replace it
• SSD- It stands for Solid State Drive, a mass storage device like HDD. It is more durable because it does not contain
optical disks inside like hard disks. It needs less power as compared to hard disks, is lightweight, and has 10x faster
read and writes speed as compared to hard disks. But, these are costly as well. While SSDs serve an equivalent
function as hard drives, their internal components are much different. Unlike hard drives, SSDs don’t have any moving
parts and thus they’re called solid-state drives. Instead of storing data on magnetic platters, SSDs store data using non-
volatile storage. Since SSDs haven’t any moving parts, they do not need to “spin up”. It ranges from 150GB to a few
more TB.
• Memory Card – Usually attached with smaller electronic and computerized devices like mobile phones or digital
camera, a memory card can be used to store images, videos and audios and is compatible and small in size
• Memory Stick – Originally launched by Sony, a memory stick can store more data and is easy and quick to transfer
data using this storage device. Later on, various other versions of memory stock were also released
• Multimedia Card: It is also known as MMC. It is an integrated circuit that is generally used in-car radios, digital
cameras, etc. It is an external device to store data/information.
• SD Card – Known as Secure Digital Card, it is used in various electronic devices to store data and is available in mini
and micro sizes. Generally, computers have a separate slot to insert an SD card. In case they do not have one, separate
USBs are available in which these cards can be inserted and then connected to the computer

Cloud and virtual Storage
Nowadays, secondary memory has been upgraded to virtual or cloud storage
devices.
• We can store our files and other stuff in the cloud and the data is stored for
as long as we pay for the cloud storage.
• There are many companies that provide cloud services largely Google,
Amazon, Microsoft, etc.
• We can pay the rent for the amount of space we need and we get multiple
benefits out of it.
• Though it is actually being stored in a physical device located in the data
centers of the service provider, the user doesn’t interact with the physical
device and its maintenance.
• For example, Amazon Web Services offers AWS S3 as a type of storage where
users can store data virtually instead of being stored in physical hard drive
devices.
• These sorts of innovations represent the frontier of where storage media
goes.
Redundant Array of Independent Disks
(RAID)
• RAID stands for Redundant Array of Independent Disks,
which is a technology to connect multiple secondary storage
devices and use them as a single storage media.

• RAID consists of an array of disks in which multiple disks are


connected together to achieve different goals. RAID levels
define the use of disk arrays.
RAID 0
• In this level, a striped array of disks is implemented.
• The data is broken down into blocks and the blocks are
distributed among disks.
• Each disk receives a block of data to write/read in
parallel.
• It enhances the speed and performance of the storage
device.
• There is no parity and backup in Level 0.
RAID 1
• RAID 1 uses mirroring techniques.
• When data is sent to a RAID controller, it sends a
copy of data to all the disks in the array.
• RAID level 1 is also called mirroring and provides
100% redundancy in case of a failure.
RAID 2

• RAID 2 records Error Correction Code using


Hamming distance for its data, striped on different
disks.
• Like level 0, each data bit in a word is recorded on a
separate disk and ECC codes of the data words are
stored on a different set disks.
• Due to its complex structure and high cost, RAID 2 is
not commercially available.
RAID 3
• RAID 3 stripes the data onto multiple disks.
• The parity bit generated for data word is stored on a
different disk.
• This technique makes it to overcome single disk failures.
RAID 4
• In this level, an entire block of data is written onto data
disks and then the parity is generated and stored on a
different disk.
• Note that level 3 uses byte-level striping, whereas level 4
uses block-level striping.
• Both level 3 and level 4 require at least three disks to
implement RAID.
RAID 5

• RAID 5 writes whole data blocks onto different disks, but


the parity bits generated for data block stripe are
distributed among all the data disks rather than storing
them on a different dedicated disk.
RAID 6
• RAID 6 is an extension of level 5.
• In this level, two independent parities are generated and
stored in distributed fashion among multiple disks.
• Two parities provide additional fault tolerance. This level
requires at least four disk drives to implement RAID.
DBMS - File Structure or Placing File Records on Disk

• Relative data and information is stored collectively in


file formats.
• A file is a sequence of records stored in binary
format.
• A disk drive is formatted into several blocks that can
store records.
• File records are mapped onto those disk blocks.
File Organization
File Organization defines how file records are mapped onto disk blocks.

We have different types of File Organization to organize file records −


1. Sequential File Organization
• It is one of the simple methods of file organization.

• Here each file/records are stored one after the other in a sequential
manner. This can be achieved in two ways:
• Records are stored one after the other as they are inserted into the
tables.
• This method is called pile file method.

• When a new record is inserted, it is placed at the end of the file.

• In the case of any modification or deletion of record, the record


will be searched in the memory blocks.
• Once it is found, it will be marked for deleting and new block of
record is entered.
Inserting a new record:

In the diagram above, R1, R2, R3 etc. are the records.


They contain all the attribute of a row. i.e.; when we say student record, it will
have his id, name, address, course, DOB etc.
Similarly R1, R2, R3 etc can be considered as one full set of attributes.
In the second method, records are sorted (either ascending or
descending) each time they are inserted into the system.
This method is called sorted file method. Sorting of records may be
based on the primary key or on any other columns.
Whenever a new record is inserted, it will be inserted at the end of
the file and then it will sort – ascending or descending based on key
value and placed at the correct position.
In the case of update, it will update the record and then sort the file
to place the updated record in the right place. Same is the case with
delete.
Advantages of Sequential File Organization
 The design is very simple compared other file organization. There
is no much effort involved to store the data.
 When there are large volumes of data, this method is very fast and
efficient.
 This method is helpful when most of the records have to be
accessed like calculating the grade of a student, generating the
salary slips etc where we use all the records for our calculations
 This method is good in case of report generation or statistical
calculations.
 These files can be stored in magnetic tapes which are
comparatively cheap.
Disadvantages of Sequential File Organization

 Sorted file method always involves the effort for


sorting the record.
 Each time any insert/update/ delete transaction is
performed, file is sorted.
 Hence identifying the record, inserting/ updating/
deleting the record, and then sorting them always
takes some time and may make system slow
2. Heap File Organization
• This is the simplest form of file organization.
• Here records are inserted at the end of the file as and when
they are inserted.
• There is no sorting or ordering of the records. Once the data
block is full, the next record is stored in the new block.
• This new block need not be the very next block. This
method can select any block in the memory to store the new
records.
• It is similar to pile file in the sequential method, but here
data blocks are not selected sequentially.
• They can be any data blocks in the memory. It is the
responsibility of the DBMS to store the records and manage
them.
Diagrammatic representation of Heap File Organization
If a new record is inserted, then in the above case it will be inserted into data block 1.
• When a record has to be retrieved from the database, in this
method, we need to traverse from the beginning of the file till we
get the requested record.
• Hence fetching the records in very huge tables, it is time
consuming.
• This is because there is no sorting or ordering of the records.
• We need to check all the data.

• Similarly if we want to delete or update a record, first we need to


search for the record. Again, searching a record is similar to
retrieving it- start from the beginning of the file till the record is
fetched.
• If it is a small file, it can be fetched quickly.
• But larger the file, greater amount of time needs to be spent in
fetching.
• In addition, while deleting a record, the record will be
deleted from the data block.

• But it will not be freed and it cannot be re-used.


Hence as the number of record increases, the memory
size also increases and hence the efficiency.

• For the database to perform better, DBA has to free


this unused memory periodically.
Advantages of Heap File Organization
 Very good method of file organization for bulk insertion.
i.e.; when there is a huge number of data needs to load
into the database at a time, then this method of file
organization is best suited.
 They are simply inserted one after the other in the
memory blocks.
 It is suited for very small files as the fetching of records
is faster in them.
 As the file size grows, linear search for the record
becomes time consuming.
Disadvantages of Heap File Organization

 This method is inefficient for larger databases as it


takes time to search/modify the record.
 Proper memory management is required to boost the
performance.
 Otherwise there would be lots of unused memory
blocks lying and memory size will simply be growing
3. Hash/Direct File Organization

• In this method of file organization, hash function is used to


calculate the address of the block to store the records.
• The hash function can be any simple or complex mathematical
function.
• The hash function is applied on some columns/attributes – either
key or non-key columns to get the block address.
• Hence each record is stored randomly irrespective of the order they
come. Hence this method is also known as Direct or Random file
organization.
• If the hash function is generated on key column, then that column
is called hash key, and if hash function is generated on non-key
column, then the column is hash column.
• When a record has to be retrieved, based on the hash key column,
the address is generated and directly from that address whole
record is retrieved.
• Here no effort to traverse through whole file. Similarly when a new
record has to be inserted, the address is generated by hash key and
record is directly inserted.
• Same is the case with update and delete. There is no effort for
searching the entire file nor sorting the files. Each record will be
stored randomly in the memory.
Advantages of Hash File Organization
 Records need not be sorted after any of the transaction. Hence
the effort of sorting is reduced in this method.
 Since block address is known by hash function, accessing any
record is very faster. Similarly updating or deleting a record is
also very quick.
 This method can handle multiple transactions as each record is
independent of other. i.e.; since there is no dependency on
storage location for each record, multiple records can be
accessed at the same time.
 It is suitable for online transaction systems like online
banking, ticket booking system etc.
Disadvantages of Hash File Organization
 This method may accidentally delete the data.
 Since all the records are randomly stored, they are scattered in the memory. Hence memory is not
efficiently used.
 If we are searching for range of data, then this method is not suitable. Because, each record will be
stored at random address. Hence range search will not give the correct address range and searching
will be inefficient. For example, searching the employees with salary from 20K to 30K will be
efficient.
 Searching for records with exact name or value will be efficient. If the Student name starting with
‘B’ will not be efficient as it does not give the exact name of the student.
 If there is a search on some columns which is not a hash column, then the search will not be
efficient. This method is efficient only when the search is done on hash column. Otherwise, it will
not be able find the correct address of the data.
 If there is multiple hash columns – say name and phone number of a person, to generate the
address, and if we are searching any record using phone or name alone will not give correct results.
 If these hash columns are frequently updated, then the data block address is also changed
accordingly. Each update will generate new address. This is also not acceptable.
 Hardware and software required for the memory management are costlier in this case. Complex
programs needs to be written to make this method efficient.
4. Cluster File Organization

There are two types of cluster file organization

•Indexed Clusters: - Here records are grouped based on the cluster key and stored
together. Our example above to illustrate STUDENT-COURSE cluster is an indexed
cluster. The records are grouped based on the cluster key – COURSE_ID and all the
related records are stored together. This method is followed when there is retrieval of
data for range of cluster key values or when there is a huge data growth in the clusters.
That means, if we have to select the students who are attending the course with
COURSE_ID 230-240 or there is a large number of students attending the same
course, say 250.

•Hash Clusters: - This is also similar to indexed cluster. Here instead of storing the
records based on the cluster key, we generate the hash key value for the cluster key
and store the records with same hash key value together in the memory disk.
Advantages of Clustered File Organization
This method is best suited when there is frequent request for joining
the tables with same joining condition.
When there is a 1:M mapping between the tables, it results
efficiently

Disadvantages of Clustered File Organization


This method is not suitable for very large databases since the
performance of this method on them is low.
We cannot use this clusters, if there is any change is joining
condition. If the joining condition changes, the traversing the file
takes lot of time.
•This method is not suitable for less frequently joined tables or tables
with 1:1 conditions
Hashing in DBMS
• In a huge database structure, it is very inefficient to search all the
index values and reach the desired data. Hashing technique is used
to calculate the direct location of a data record on the disk without
using index structure.
• In this technique, data is stored at the data blocks whose address is
generated by using the hashing function. The memory location
where these records are stored is known as a data bucket or data
blocks.
• In this, a hash function can choose any of the column value to
generate the address. Most of the time, the hash function uses the
primary key to generate the address of the data block. A hash
function is a simple mathematical function to any complex
mathematical function.
• We can even consider the primary key itself as the address of the
data block. That means each row whose address will be the same as
a primary key stored in the data block.
Hash function
• There are many hash functions that use numeric or alphanumeric keys. Different numeric
hash functions are:
• Division Method.
• Mid Square Method.
• Folding Method.
• Multiplication Method.

Formula: Formula: Formula: Formula:


h(K) = k mod M h(K) = h(k x k) k = k1, k2, k3, k4, ….., kn h(K) = floor (M (kA mod 1))
Here, Here, s = k1+ k2 + k3 + k4 +….+ kn Here,
k is the key value, and k is the key value. h(K)= s M is the size of the hash table.
M is the size of the hash table. Here, k is the key value.
s is obtained by adding the A is a constant value (0<A<1)
parts of the key k

Example:
Example: Example: Example:
k = 60
k = 12345 k = 12345 k = 12345
k x k = 60 x 60
M = 95 k1 = 12, k2 = 34, k3 = 5 A = 0.357840
= 3600
h(12345) = 12345 mod 95 s = k1 + k2 + k3 M = 100
h(60) = 60
= 90 = 12 + 34 + 5 h(12345)
The hash value
k = 1276 = 51 = floor[ 100 (12345*0.357840 mod 1)]
obtained is 60
M = 11 h(K) = 51 = floor[ 100 (4417.5348 mod 1) ]
h(1276) = 1276 mod 11 = floor[ 100 (0.5348) ]
=0 = floor[ 53.48 ]
= 53
The above diagram shows data block addresses same as primary key
value. This hash function can also be a simple mathematical function like
exponential, mod, cos, sin, etc. Suppose we have mod (5) hash
function to determine the address of the data block. In this case, it
applies mod (5) hash function on the primary keys and generates 3, 3, 1,
4 and 2 respectively, and records are stored in those data block
addresses.
Types of Hashing:

• Static Hashing
• Dynamic Hashing
Static Hashing
• In static hashing, the resultant data bucket address will always be the
same. That means if we generate an address for EMP_ID =103 using the
hash function mod (5) then it will always result in same bucket address
3. Here, there will be no change in the bucket address.
• Hence in this static hashing, the number of data buckets in memory
remains constant throughout. In this example, we will have four data
buckets in the memory used to store the data.
Operations of Static Hashing:
Searching a record:
When a record needs to be searched, then the same hash function retrieves the
address of the bucket where the data is stored.

Insert a Record:
When a new record is inserted into the table, then we will generate an address for a
new record based on the hash key and record is stored in that location.

Delete a Record:
To delete a record, we will first fetch the record which is supposed to be deleted.
Then we will delete the records for that address in memory.

Update a Record:
To update a record, we will first search it using a hash function, and then the data
record is updated.

If we want to insert some new record into the file but the address of a data bucket
generated by the hash function is not empty, or data already exists in that
address. This situation in the static hashing is known as bucket overflow. This is a
critical situation in this method.

To overcome this situation, there are various methods. Some commonly used
methods are as follows:
1. Open Hashing

When a hash function generates an address at which data is already


stored, then the next bucket will be allocated to it. This mechanism
is called as Linear Probing.

For example: suppose R3 is a new address which needs to be inserted,


the hash function generates address as 110 for R3. But the generated
address is already full. So the system searches next available data
bucket, 113 and assigns R3 to it.
2. Close Hashing

When buckets are full, then a new data bucket is allocated for the same hash
result and is linked after the previous one. This mechanism is known as Overflow
chaining.

For example: Suppose R3 is a new address which needs to be inserted into the
table, the hash function generates address as 110 for it. But this bucket is full to
store the new data. In this case, a new bucket is inserted at the end of 110 buckets
and is linked to it.
2. Dynamic Hashing
•The dynamic hashing method is used to overcome the
problems of static hashing like bucket overflow.
•In this method, data buckets grow or shrink as the records
increases or decreases. This method is also known as
Extendable hashing method.
•This method makes hashing dynamic, i.e., it allows insertion or
deletion without resulting in poor performance.
How to search a key
•First, calculate the hash address of the key.
•Check how many bits are used in the directory, and these bits
are called as i.
•Take the least significant i bits of the hash address. This gives
an index of the directory.
•Now using the index, go to the directory and find bucket address
where the record might be.
How to insert a new record
•Firstly, you have to follow the same procedure for retrieval,
ending up in some bucket.
•If there is still space in that bucket, then place the record in it.
•If the bucket is full, then we will split the bucket and redistribute
the records.
For example:
•Consider the following grouping of keys into buckets,
depending on the prefix of their hash address:

•The last two bits of 2 and 4 are 00. So it will go into bucket
B0. The last two bits of 5 and 6 are 01, so it will go into bucket
B1. The last two bits of 1 and 3 are 10, so it will go into bucket
B2. The last two bits of 7 are 11, so it will go into B3.
• Insert key 9 with hash address 10001 into the above structure:
• Since key 9 has hash address 10001, it must go into the first bucket.
But bucket B1 is full, so it will get split.
• The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are
001, so it will go into bucket B1, and the last three bits of 6 are 101,
so it will go into bucket B5.
• Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and
100 entry because last two bits of both the entry are 00.
• Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and
110 entry because last two bits of both the entry are 10.
• Key 7 are still in B3. The record in B3 pointed by the 111 and 011
entry because last two bits of both the entry are 11.
Advantages of dynamic hashing
•In this method, the performance does not decrease as the data grows in
the system. It simply increases the size of memory to accommodate the
data.

•In this method, memory is well utilized as it grows and shrinks with the
data. There will not be any unused memory lying.

•This method is good for the dynamic database where data grows and
shrinks frequently.

Disadvantages of dynamic hashing


•In this method, if the data size increases then the bucket size is also
increased. These addresses of data will be maintained in the bucket
address table.

•This is because the data address will keep changing as buckets grow and
shrink.

• If there is a huge increase in data, maintaining the bucket address table


becomes tedious.

•In this case, the bucket overflow situation will also occur. But it might
Indexing
• We know that data is stored in the form of records. Every record
has a key field, which helps it to be recognized uniquely.
• Indexing is a data structure technique to efficiently retrieve
records from the database files based on some attributes on which
the indexing has been done. Indexing in database systems is
similar to what we see in books.
• Indexing is defined based on its indexing attributes. Indexing can
be of the following types −
– Primary Index

– Secondary Index

– Clustering Index
Primary Index − Primary index is defined on an ordered data file.
The data file is ordered on a key field. The key field is generally the
primary key of the relation.
Primary Indexes
Ordered file with two fields
Primary key, K(i)
Pointer to a disk block, P(i)
One index entry in the index file for each block in the data file
Indexes may be dense or sparse
Dense index has an index entry for every search key value in the data
file
Sparse index has entries for only some search values
Primary Indexes
Primary Indexes: Problem
Primary Indexes: Problem
Clustering Indexes
•File records are physically ordered on a non-key field without a
distinct value for each record that field is called the clustering
field and the data file is called a clustered file.

•We can create a different type of index, called a clustering index, to


speed up retrieval of all the records that have the same value for the
clustering field.

•This differs from a primary index, which requires that the ordering field
of the data file have a distinct value for each record.

•A clustering index is also an ordered file with two fields; the first field is of
the same type as the clustering field of the data file, and the second field is
a disk block pointer
Clustering Indexes
clustering index on
the Dept_number
ordering nonkey
field of an
EMPLOYEE file
• Suppose that we consider the same ordered file with r =
300,000 records stored on a disk with block size B = 4,096
bytes.
• Imagine that it is ordered by the attribute Zipcode and there
are 1,000 zip codes in the file (with an average 300 records
per zip code, assuming even distribution across zip codes.)
• The index in this case has 1,000 index entries of 11 bytes each
(5-byte Zipcode and 6-byte block pointer) with a blocking
factor
bfri = (B/Ri) = (4,096/11) = 372 index entries per block.
• The number of index blocks is hence
bi = (ri/bfri)= (1,000/372) = 3 blocks.
To perform a binary search on the index file would need
(log2 bi)⎤ = ⎡(log23)⎤ = 2 block accesses
Secondary Index − Secondary index may be generated from a field
which is a candidate key and has a unique value in every record, or a
non-key with duplicate values.
7
0

Secondary Indexes
• Provide secondary means of accessing a data file Some primary access
exists
• The data file records could be ordered, unordered, or hashed.
• The secondary index may be created on a field that is a candidate key
and has a unique value in every record, or on a non-key field with
duplicate values
• Indexing field, K(i)

Block pointer or record pointer, P(i)

Usually need more storage space and longer search time than primary index

Improved search time for arbitrary record


Introduction Sid Dept Name Mark
101 BCD Tony 89
Most indexes based on ordered files
102 BCN Roy 90
Tree data structures organize the index 104 BCD Mala 95
Field Pointer
70 Sid Dept Name Mark

85 Field Pointer 111 BCI Reddy 99


89 112 BCN Rao 70
101 T1S3B2
90 114 BCD Harini 85
111 T1S3B4
91
151 T1S3B6
93 Sid Dept Name Mark
95 151 BCD Menon 93
96 152 BCI Nair 96
99 154 BCD Komal 91
Secondary Indexes

Dense secondary index (with


block pointers) on a non-
ordering key field of a file.
Secondary Indexes: Problem

Assume: (With Secondary Index)


Total no of records: 30,000 Each record size: 100 bytes
Block size: 1024 bytes Index size: 15 bytes (9 for field, 6 for pointer)
How many records a block can hold? -> 1024/100 = 10 records/block
How many blocks will be required? -> 30,000/10 = 3000 blocks
How many index can hold in a block? -> 1024/15 = 68 index/block
How many blocks required to store 3000 indexed? -> 30000/68 = 442 blocks
How many blocks should be accessed to perform binary search?
Log2(Blocks) = Log2(442) = 9 blocks
Additional disk access required to access from index to file : 1 block
There fore the final disk access using Index is : 9+1 = 10 blocks
Secondary Indexes:
Advantages
It enables binary search when primary index is not useful.

It is better than linear search but not than primary index. Sparse Index:

If the index consists only few records, it is called sparse index.

Dense Index:

If you are creating an index for every record, it is called dense index.
Multilevel Index

Index records comprise search-key values and data pointers. Multilevel index is
stored on the disk along with the actual database files. As the size of the database
grows, so does the size of the indices. There is an immense need to keep the index
records in the main memory so as to speed up the search operations. If single-level
index is used, then a large size index cannot be kept in memory which leads to
multiple disk accesses.
• Suppose that the dense secondary index of Example is converted into a
multilevel index.
• We calculated the index blocking factor bfri = 273 index entries per block,
which is also the fan-out fo for the multilevel index;
• the number of first-level blocks b1 = 1,099 blocks was also calculated.

• The number of second-level blocks will be b2 = ⎡(b1/fo)⎤ = ⎡(1,099/273) ⎤


= 5 blocks, and the number of third level blocks will be b3 = ⎡(b2/fo) ⎤ =
⎡(5/273)⎤ = 1 block.
• Hence, the third level is the top level of the index, and t = 3. To access a
record by searching the multilevel index, we must access one block at
each level plus one block from the data file, so we need t + 1 = 3 + 1 = 4
block accesses.
• Compare this to Example , where 12 block accesses were needed when a
single-level index and binary search were used
Indexed Sequential Access Method (ISAM)

This is an advanced sequential file organization method.


Here records are stored in order of primary key in the file.
Using the primary key, the records are sorted.
For each primary key, an index value is generated and mapped with the record.
This index is nothing but the address of record in the file.
Advantages of ISAM
 Since each record has its data block address, searching for a record
in larger database is easy and quick. There is no extra effort to search
records. But proper primary key has to be selected to make ISAM
efficient.
 This method gives flexibility of using any column as key field and
index will be generated based on that. In addition to the primary key
and its index, we can have index generated for other fields too.
Hence searching becomes more efficient, if there is search based on
columns other than primary key.
 It supports range retrieval, partial retrieval of records. Since the
index is based on the key value, we can retrieve the data for the
given range of values. In the same way, when a partial key value is
provided, say student names starting with ‘JA’ can also be searched
easily.
Disadvantages of ISAM

 An extra cost to maintain index has to be afforded. i.e.; we


need to have extra space in the disk to store this index value.
When there is multiple key-index combinations, the disk space
will also increase.
 As the new records are inserted, these files have to be
restructured to maintain the sequence. Similarly, when the
record is deleted, the space used by it needs to be released.
Else, the performance of the database will slow down.
R=850000 records
Index size=6 bits 6+6=12
Each record size =120
Block size=10 kilo bytes=10X1024=10240
How many records per block=10240/120=85
How many blocks are required=85000/85=10000
How many index can hold in a block=10240/12=853
B1=10000/853=12
B2=b1/fo=12/853=1
T=1+1=2
Example
• Table t has 50000 records with a record length
of 88 bytes and a block size of 2048. byte
make a comparative study for linear search on
the file records with fixed size and unspanned
allocation with primary indexing and without
indexing
• Linear search is a basic search algorithm that can be used to
find a record in a file. It involves scanning the file sequentially
from the beginning until the target record is found or the end
of the file is reached.
• In this scenario, we have a table with 50,000 records, each
with a record length of 88 bytes. The block size is 2048 bytes.
We will compare linear search on the file records with fixed
size and unspanned allocation with primary indexing and
without indexing
Without Indexing:
• When searching a file without an index, the entire file must
be scanned sequentially. To find a particular record, we need
to search 50,000 records, which will require at least 50,000/2
= 25,000 block accesses (assuming an even distribution of
records across blocks).
•Each block can hold up to 2048/88 = 23 records. Therefore,
we need to access at least 25,000/23 = 1087 blocks to search
the entire file.
•The average number of blocks accessed in a linear search
without indexing is (1087+1)/2 = 544.
With Primary Indexing:
• Primary indexing involves creating an index of the key field in the file. In this
case, let us assume that the key field is a 4-byte integer. Each block in the file
will have a pointer to the first record in the block, along with the maximum
key value in the block.
•To search for a particular record with a key value K, we first need to search
the index for the block that contains the record. We can do this by performing
a binary search on the index, which will require log2(50,000) = 15 block
accesses (assuming an evenly distributed index).
•Once we have found the block that contains the record, we can perform a
linear search within the block to find the record. This will require at most 23
block accesses (assuming the record is in the last block of the file).
•The total number of block accesses required for a primary indexed search is
15 + 23 = 38.
Locking protocols
There are four types of lock protocols available −

1. Simplistic Lock Protocol


Simplistic lock-based protocols allow transactions
to obtain a lock on every object before a 'write'
operation is performed.

Transactions may unlock the data item after


completing the ‘write’ operation.
Locking protocols
2. Pre-claiming Lock Protocol:
Pre-claiming protocols evaluate their operations and create
a list of data items on which they need locks.

Before initiating an execution, the transaction requests the


system for all the locks it needs beforehand.

If all the locks are granted, the transaction executes and


releases all the locks when all its operations are over.

If all the locks are not granted, the transaction rolls back
and waits until all the locks are granted.
Locking protocols
3. Two-Phase Locking 2PL: This locking protocol divides the
execution phase of a transaction into three parts.
In the first part, when the transaction starts executing, it seeks
permission for the locks it requires.
The second part is where the transaction acquires all the locks. As
soon as the transaction releases its first lock, the third phase starts. In
this phase, the transaction cannot demand any new locks; it only
releases the acquired locks.
Two-Phase Locking 2PL Cont…
• Two-phase locking has two phases, one is growing,
where all the locks are being acquired by the
transaction; and the second phase is shrinking, where
the locks held by the transaction are being released.

• To claim an exclusive (write) lock, a transaction must


first acquire a shared (read) lock and then upgrade it
to an exclusive lock.
Three categories of 2PL:
• Strict 2PL
• Rigorous 2PL and
• Conservative 2PL
Locking protocols
Strict Two-Phase Locking:
The first phase of Strict-2PL is same as 2PL.

After acquiring all the locks in the first phase, the transaction continues to execute
normally.

Strict 2PL is the most restrictive form of 2PL.


In strict 2PL, a transaction is not allowed to release any locks until it has reached
the commit point. This means that a transaction will hold all of its locks until it has
completed its execution and is ready to be committed..

Strict-2PL does not have cascading abort as 2PL does.


Rigorous Two Phase Locking
• Rigorous 2PL is similar to strict 2PL, but with a
slight relaxation of the locking protocol.
• In rigorous 2PL, a transaction is allowed to
release a lock if it is certain that it will not
need the lock again.
• For example, if a transaction is reading a
resource and it knows that it will not need to
write to the resource, it can release the lock
after reading.
Advantages :
1.The advantage of rigorous 2PL is that it allows for increased
concurrency, as transactions are able to release locks that they no
longer need.
2.This can lead to less contention for resources and improved
performance.
Disadvantages :
•The disadvantage of rigorous 2PL is that it can be more difficult to
implement, as the system must be able to determine when a
transaction can safely release a lock.
•Additionally, rigorous 2PL does not guarantee serializability, as the
released locks may be acquired by other transactions in an order that
would not have been possible in a serial execution.
Conservative Two Phase Locking
• Conservative 2PL is a less restrictive form of 2PL than strict 2PL
and rigorous 2PL. In conservative 2PL, a transaction is allowed
to release any lock at any time, regardless of whether it will
need the lock again.
• The advantage of conservative 2PL is that it allows for
maximum concurrency, as transactions are able to release locks
at any time. This can lead to the best performance in terms of
throughput and response time.
• The disadvantage of conservative 2PL is that it does not
guarantee serializability and can lead to inconsistent results if
not implemented carefully. Additionally, it does not prevent
deadlocks which could cause transaction to hang.

You might also like