RDBMS Unit - Ii 2023
RDBMS Unit - Ii 2023
____________________________________________________
DATABASE INTEGRITY AND NORMALISATION: Relational Database Integrity - The Keys
- Referential Integrity - Entity Integrity - Redundancy and Associated Problems – Single Valued
Dependencies – Normalisation - Rules of Data Normalisation - The First Normal Form - The
Second Normal Form - The Third Normal Form - Boyce Codd Normal Form - Attribute
Preservation - Lossless-join Decomposition - Dependency Preservation.
File Organisation : Physical Database Design Issues - Storage of Database on Hard Disks - File
Organisation and Its Types - Heap files (Unordered files) - Sequential File Organisation - Indexed
(Indexed Sequential) File Organisation - Hashed File Organisation - Types of Indexes - Index and
Tree Structure - Multi-key File Organisation - Need for Multiple Access Paths - Multi-list File
Organisation - Inverted File Organisation.
Example:
Example:
As it can be observed that values of attribute college name, college rank, course is being repeated
which can lead to problems. Problems caused due to redundancy are: Insertion anomaly, Deletion
anomaly, and Updation anomaly.
1. Insertion Anomaly –
If a student detail has to be inserted whose course is not being decided yet then insertion will not
be possible till the time course is decided for student.
This problem happens when the insertion of a data record is not possible without adding some
additional unrelated data to the record.
2. Deletion Anomaly –
If the details of students in this table is deleted then the details of college will also get deleted
which should not occur by common sense.
This anomaly happens when deletion of a data record results in losing some unrelated
information that was stored as part of the record that was deleted from a table.
3. Updation Anomaly –
Suppose if the rank of the college changes then changes will have to be all over the database
which will be time-consuming and computationally costly.
If updation do not occur at all places then database will be in inconsistent state.
It typically exists between the primary key and non-key attribute within a table.
X → Y
The left side of FD is known as a determinant, the right side of the production is known as a
dependent.
Here Emp_Id attribute can uniquely identify the Emp_Name attribute of employee table because
if we know the Emp_Id, we can tell that employee name associated with it.
There are mainly four types of Functional Dependency in DBMS. Following are the types of
Functional Dependencies in DBMS:
A. Multivalued Dependency
B. Trivial Functional Dependency
C. Non-Trivial Functional Dependency
D. Transitive Dependency
E. Fully Functional Dependency
F. Partial Functional Dependancy
A)Multivalued Dependency in DBMS
Multivalued dependency occurs in the situation where there are multiple independent multivalued
attributes in a single table. A multivalued dependency is a complete constraint between two sets of
attributes in a relation. It requires that certain tuples be present in a relation. Consider the
following Multivalued Dependency Example to understand.
Example:
Emp_id Emp_name
AS555 Harry
AS811 George
AS999 Kevin
Google SundarPichai 46
Example:
(Company} -> {CEO} (if we know the Company, we knows the CEO name)
But CEO is not a subset of Company, and hence it's non-trivial functional dependency.
pg. 5 SHWETA K/SWAPNA A
RDBMS UNIT - II Ca iii SEM
____________________________________________________
D)Transitive Dependency in DBMS
A Transitive Dependency is a type of functional dependency which happens when t is indirectly
formed by two functional dependencies. Let's understand with the following Transitive
Dependency Example.
Example:
Google SundarPichai 46
Alibaba Jack Ma 54
{Company} -> {CEO} (if we know the compay, we know its CEO's name)
{CEO } -> {Age} If we know the CEO, we know the Age
Therefore according to the rule of rule of transitive dependency:
{ Company} -> {Age}should hold, that makes sense because if we know the company name, we
can know his age.
A transitive functional dependency is when changing a non-key column, might cause any of the
other non-key columns to change
Consider the table 1. Changing the non-key column Full Name may change Salutation.
https://www.youtube.com/watch?v=ABwD8IYByfk
A)
o Normalization is the process of organizing the data in the database.
o Normalization is used to minimize the redundancy from a relation or set of relations. It
is also used to eliminate the undesirable characteristics like Insertion, Update and Deletion
Anomalies.
o Normalization` divides the larger table into the smaller table and links them using
relationship.
o The normal form is used to reduce redundancy from the database table.
EMPLOYEE table:
14 John 7272826385, UP
9064738238
The decomposition of the EMPLOYEE table into 1NF has been shown below:
14 John 7272826385 UP
14 John 9064738238 UP
Example: Let's assume, a school can store the data of teachers and the subjects they teach. In a
school, a teacher can teach more than one subject.
TEACHER table
25 Chemistry 30
25 Biology 30
83 Math 38
83 Computer 38
To convert the given table into 2NF, we decompose it into two tables:
TEACHER_DETAIL table:
TEACHER TEACHER_
_ID AGE
25 30
47 35
83 38
TEACHER_SUBJECT table:
TEACHER_ID SUBJECT
25 Chemistry
25 Biology
47 English
83 Math
83 Computer
Example:
Non-prime attributes: In the given table, all attributes except EMP_ID are non-prime.
Here, EMP_STATE & EMP_CITY dependent on EMP_ZIP and EMP_ZIP dependent on EMP_ID.
The non-prime attributes (EMP_STATE, EMP_CITY) transitively dependent on super
key(EMP_ID). It violates the rule of third normal form.
That's why we need to move the EMP_CITY and EMP_STATE to the new <EMPLOYEE_ZIP>
table, with EMP_ZIP as a Primary key.
EMPLOYEE table:
EMPLOYEE_ZIP table:
201010 UP Noida
02228 US Boston
60007 US Chicago
06389 UK Norwich
462007 MP Bhopal
Example: Let's assume there is a company where employees work in more than one department.
EMPLOYEE table:
EMP_ID EMP_COUNTRY EMP_DEPT DEPT_TYPE EMP_DEPT_NO
The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys.
To convert the given table into BCNF, we decompose it into three tables:
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY
264 India
264 India
EMP_DEPT table:
EMP_DEPT_MAPPING table:
EMP_ID EMP_DEPT
D394 283
D394 300
D283 232
D283 549
Functional dependencies:
1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys:
Now, this is in BCNF because left side part of both the functional dependencies is a key.
• The purpose of physical database design is to translate the logical description of data
into the technical specifications for storing and retrieving data for the DBMS.
• The goal is to create a design for storing data that will provideadequate performance and
ensure database integrity, security andrecoverability.
Some of the basic inputs required for Physical Database Design are:
• Normalisedrelations
• Attributedefinitions
• Data usage: entered, retrieved, deleted,updated
• Requirements for security, backup, recovery, retention,integrity
• DBMScharacteristics.
• Performance criteria such as response time requirement with respect tovolume
pg. 12 SHWETA K/SWAPNA A
RDBMS UNIT - II Ca iii SEM
____________________________________________________
estimates.
The issues relating to the Design of the Physical Database Files
Physical File is a file as stored on the disk. The main issues relating to physical files are:
• Constructs to link two pieces of data:
• Sequential storage.
• Pointers.
• File Organisation: How the files are arranged on the disk?
• Access Method: How the data can be retrieved based on the file Organisation?
Mostly the databases are stored persistently on magnetic disks for the reasons given below:
• The databases being very large may not fit completely in the mainmemory.
• Storing the data permanently using the non-volatile storage and provide accessto the
users with the help of front endapplications.
• Primary storage is considered to be very expensive and in order to cut shortthe cost
of the storage per unit of data to substantiallyless.
We realise that each record of a table can contain different amounts of data. This is because in
some records, some attribute values may be 'null'. Or, some attributes may be of type varchar
(), and therefore each record may have a different length string as the value of this attribute.
Therefore, the record is stored with each subsequent attribute separated by the next by a
special ASCII character called a field separator.Of course, in each block, we may place many
records. Each record is separated from the next, again by another special ASCII character
called the record separator.
File Organization
Q6) What is File?
A) File is a collection of records related to each other. The file size is limited by the size of memory
and storage medium.
There are two important features of file:
1. File Activity: File activity specifies percent of actual records which proceed in a single run.
2. File Volatility: File volatility addresses the properties of record changes. It helps to increase the
efficiency of disk design than tape.
A) File Organization in DBMS: A database contains a huge amount of data, which is stored is
in the physical memory in the form of files. A file is a set of multiple records stored in the binary
format.
Following are the various methods which are introduced to organize the files in the database
management system are:
It is a method in which the files are stored and sorted one after another on disk. This method is so
simple for file organization.
This file organization arranged the records in either descending or ascending order of the key
column. As the files are sorted in a specific order, so the binary search technique can be used
to reduce the time in searching a file.
In this sequential file organization method, the files are entered in the same sequence in which
they are inserted into the database tables. This method is so simple.
When any user inserts the new record, the record is then placed at the end of that file. If we delete
or update the record, then the record is searched in the blocks of memory. Once it is found, then
that founded record is marked for deleting. And, the new block of record will be entered.
Suppose the four records are already stored in the sequence. And, we want to insert the new record
(R4) in the sequence, then the R4 record will be placed at the end of the sequence.
In this sequential organization method, the records are sorted based on the key attribute or
another key when they are entered into the database system.
Suppose the five records are already stored in a sorted manner. And, you want to enter the new
record (R4) between the existing records, firstly it will be placed at the end of the file, and then it
will sort the specified sequence.
• The sorted file method of sequential file organization is inefficient because it takes more
space and time for sorting the records.
• It is a time-consuming process.
ISAM method is an advanced sequential file organization. In this method, records are stored in
the file using the primary key.In indexed sequential access file, records are stored randomly on a
direct access device such as magnetic disk by a primary key. An index value is generated for each
primary key and mapped with the record.
If any record has to be retrieved based on its index value, then the address of the data block is
fetched and the record is retrieved from the memory.
Pros of ISAM:
Cons of ISAM
• This method requires extra space in the disk to store the index value.
• When the new records are inserted, then these files have to be reconstructed to maintain the
sequence.
• When the record is deleted, then the space used by it needs to be released. Otherwise, the
performance of the database will slow down.
Heap file organization is the most simple and basic type of file organization. Sometimes, the heap
file is also called the unordered file. This type of organization works with the blocks of data, and
the new record is inserted at the last page of the file. This type of file organization does not require
any sorting for sorting the records.
If there is insufficient space in the last data block, then the new data block is added to the file. And,
then we can easily insert the record in that data block. This makes the insertion of records very
efficient.
As there is no particular ordering to the field values, so the linear search must be performed for
accessing the records from the file. The linear search access the blocks from the file until the data
is found.
In the heap file organization, each record has an ID which is unique, and every page or every data
block of the file is of the same size.
If we want to delete the record from the file, then the required record has to be accessed, and then
the marked record to be deleted, and after then the block is written back to the disk. The block
which contains the deleted record cannot be used as again.
Suppose the three records are already stored in a heap, and we want to insert the new
record Record2 in that heap.
Let’s suppose that Data Block 2 is full. Then the Record2 will be added in any one of the data
blocks selected by the database system. Let’s say Data Block 1.
If we want to update, search, or delete the record from the heap file, then we have to read the file
from starting until the required record is not found. Suppose, if the database contains a huge
amount of data, then the operations take a lot of time for performing on the record because the
records are not sorted or not specified in some order.
• For the small database systems, users can access the records fastly than the sequential file
organization.
• It is a simple file organization method.
• It is the best method for loading a large amount of data in the database at a time.
• It is not efficient for large database systems because this method takes more time for
performing the operations on the data.
• The main disadvantage of this file organization is that there is a problem with an unused
block of memory.
In DBMS, hashing is a technique to directly search the location of desired data on the disk without
using index structure. Hashing method is used to index and retrieve items in a database as it is
faster to search that specific item using the shorter hashed key instead of using its original value.
Data is stored in the form of data blocks whose address is generated by applying a hash function in
the memory location where these records are stored known as a data block or data bucket.
This file organization uses the hash functionfor calculating the block addresses.
The output of the hash function provides the disk location where the data is actually stored.
The non-key field on which the hash function is generated is called the hash column, and the key
column on which the hash function is generated is called a hash key.
• Users can access the record at a fast speed because the address of the block is known by the
hash function.
• It is the best method for online transactions like ticket booking, online banking, etc.
• There is more chance of losing the data. For Example, In the employee table, when the hash
field is on the Employee_Name, and there are two same names – ‘Areena’, then the same
address is generated for both. In such a case, the record which is older will be overwritten
by newer.
• This method of file organization is not correct in that situation when we are searching for a
given range of data because each record in the database file will be stored at a random
address. So, in this condition, searching for records is not efficient.
• If we search on those columns which are not hash columns, then the search will not find the
correct data address because the search is done only on the hash columns.
Video : https://www.youtube.com/watch?v=krrSzX7q30c
A) Indexing is a data structure technique which allows you to quickly retrieve records from a
database file. An Index is a small table having only two columns. The first column comprises a
copy of the primary or candidate key of a table. Its second column contains a set of pointers for
holding the address of the disk block where that specific key value stored.
An index -
o The first column of the database is the search key that contains a copy of the primary key or
candidate key of the table. The values of the primary key are stored in sorted order so that
the corresponding data can be accessed easily.
o The second column of the database is the data reference. It contains a set of pointers
holding the address of the disk block where the value of the particular key can be found.
Indexing is defined based on its indexing attributes. Indexing can be of the following types −
• Primary Index − Primary index is defined on an ordered data file. The data file is ordered
on a primarykey field. The key field is generally the primary key of the relation.
• Dense Index
• Sparse Index
Dense Index
In dense index, there is an index record for every search key value in the database. This makes
searching faster but requires more space to store index records itself. Index records contain
search key value and a pointer to the actual record on the disk.
Sparse Index
In sparse index, index records are not created for every search key. An index record here contains
a search key and an actual pointer to the data on the disk. To search a record, we first proceed by
index record and reach at the actual location of the data. If the data we are looking for is not
where we directly reach by following the index, then the system starts sequential search until the
desired data is found.
• Secondary Index − Secondary index may be generated from a field which is a candidate
key and has a unique value in every record,
• Clustering Index − Clustering index is defined on an ordered data file. The data file is
ordered on a non-key field. In some cases, the index is created on non-primary key
columns which may not be unique for each record. In such cases, in order to identify
the records faster, we will group two or more columns together to get the unique values
and create index out of them. This method is known as the clustering index.
Multilevel Index
Index records comprise search-key values and data pointers. Multilevel index is stored on the disk
along with the actual database files. As the size of the database grows, so does the size of the
indices. There is an immense need to keep the index records in the main memory so as to speed
up the search operations. If single-level index is used, then a large size index cannot be kept in
memory which leads to multiple disk accesses.
Multi-level Index helps in breaking down the index into several smaller indices in order to make
the outermost level so small that it can be saved in a single disk block, which can easily be
accommodated anywhere in the main memory.
https://www.youtube.com/watch?v=c3CrNZaReNM
https://www.youtube.com/watch?v=KnXohGgIpQU
BTREE :
A B-tree index stands for “balanced tree” and is a type of index that can be created in relational
databases.
A b-tree index works by creating a series of nodes in a hierarchy. It’s often compared to a tree,
which has a root, several branches, and many leaves.
The steps to find this record would be:
There’s a set of looking up a range of values and stepping to the next level. This is repeated a few
times until the correct row is found.
B + TREE:
A B+ tree is a balanced binary search tree that follows a multi-level index format. The leaf nodes of
a B+ tree denote actual data pointers. B+ tree ensures that all leaf nodes remain at the same
height, thus balanced. Additionally, the leaf nodes are linked using a link list; therefore, a B+ tree
can support random access as well as sequential access.
Structure of B+ Tree
Every leaf node is at equal distance from the root node. A B+ tree is of the order n where n is fixed
for every B+ tree.
Internal nodes −
• Internal (non-leaf) nodes contain at least ⌈n/2⌉ pointers, except the root node.
• At most, an internal node can contain n pointers.
Leaf nodes −
• Leaf nodes contain at least ⌈n/2⌉ record pointers and ⌈n/2⌉ key values.
• At most, a leaf node can contain n record pointers and n key values.
• Every leaf node contains one block pointer P to point to next leaf node and forms a linked
list.
• B+ trees are filled from bottom and each entry is done at the leaf node.
• If a leaf node overflows −
o Split node into two parts.
o Partition at i = ⌊(m+1)/2⌋.
o First i entries are stored in one node.
o Rest of the entries (i+1 onwards) are moved to a new node.
o ith key is duplicated at the parent of the leaf.
• If a non-leaf node overflows −
o Split node into two parts.
o Partition the node at i = ⌈(m+1)/2⌉.
o Entries up to i are kept in one node.
o Rest of the entries are moved to a new node.
B+ Tree Deletion
In inverted file organisation, a linkage is provided between an index and the file of data records. A
key’s inverted index contains all of the values that the key presently has in the records of the data
file. Each key-value entry in the inverted index points to all of the data records that have the
corresponding value. Inverted files represent one extreme of file organisation in which only the
index structures are important. The records themselves may be stored in any way (sequentially
ordered by primary key, random, linked ordered by primary key etc.).
Sequential DataFileSortedonprimarykeyA/CNo
01 1,4,5
02 2,6
03 3
Example:-
Given a relation SKILL with the following attributes WORKID, Skill Type, Bonus Rate.
Where workerid is a primary key. So functional dependency is workerID →Skill Type, Bonus Rate.
Skill
Worker ID Skill Type Bonus Rate
According to FUNCTIONAL DEPENDENCY every nonkey attribute must fully dependent on key
Attribute.
But Functional Dependency:
SkillType →Bonus Rate is also possible.
Which is violation the Functional dependency rule.
To resolve the situation, we need to decompose a SKILL relation (table).
Worker
Worker id Skill Type
Skill
Bonus Rate Workerid