0% found this document useful (0 votes)
31 views

Dbms Digital Notes

Uploaded by

tasmiyamaheen5
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Dbms Digital Notes

Uploaded by

tasmiyamaheen5
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 169

UNIT – I

Introduction to Database Management Systems

OVER VIEW:

Unit – 1 provides a general overview of the nature and purpose of database systems. It also
explains how the concept of a database system has developed, what the common features of
database systems are, what a database system does for the user, and how a database system
interfaces with operating systems. This unit is motivational, historical, and explanatory in nature.

Details about the entity relationship model. This model provides a high level view of the issues.
To design the data base we need to follow proper way that way is called data model. So we see
how to use the E-R model to design the data base.

CONTENTS:
Introduction to database systems
File systems Vs. DBMS
Various data models
Levels of abstraction
Database languages
Structure of DBMS
1.1 Database Management System (DBMS) and Its Applications:
A Database management system is a computerized record-keeping system. It is a repository or a
container for collection of computerized data files. The overall purpose of DBMS is to allow the
users to define, store, retrieve and update the information contained in the database on demand.
Information can be anything that is of significance to an individual or organization.

Databases touch all aspects of our lives. Some of the major areas of application are as
follows:
1. Banking
2. Airlines
3. Universities
4. Manufacturing and selling
5. Human resources

DBMS is software which is used to manage the collection of interrelated data.


1.2 Purpose of DBMS systems:
File systems Vs DBMS:The typical file processing system is supported by the operating
systems. Files are created and manipulated by writing programs so the permanent records are

1
KITSW
stored in various files. Before the advent of DBMS, organizations typically stored the
information using such systems.

Ex: Using COBOL we can maintain several files (collection of records) to access those files we
have to go through the application programs which have written for creating files, updating file,
inserting the records

The problems in file processing system are


Data redundancy and consistency
Difficulty in accessing data
Data isolation
Integrity problems
Atomicity problems
Security problems
To solve the above problems DBMS has been invented.

1.3 View of data:


The man purpose of DBMS is to provide users with an abstract view of the data.
The data abstraction is in three levels.
Physical level: How the data are actually stored that means what data structures are used to store
data on Hard disk
Ex: Sequential , Tree structured
Logical Level : What data are stored in database
View level : It is the part of data base
Ex: Required records in table.

2
KITSW
Instance: The collection of information stored in the database at a particular moment is called an
instance of the data base.

Schema: Database schema skeleton structure of and it represents the logical view of entire
database. It tells about how the data is organized and how relation among them is associated

Data independence:
The ability to modify schema definition in one level with affecting a schema definition in the
next higher level is called data independence.
Physical independence,Logical independence
Data models:

Underlying the structure of a data base is the data model.The collection of conceptual tools for
describing data, data relationships, data semantics.
Three types of data models are there:

Object based logical model:


These are used to describe the data at logical level, view level. It is divided into several
types.

Entity relationship model:


Object oriented model
Semantic data model
Function data model

Record based logical model:

In contrast to object based model they are used both to specify the overall logical structure of
data base and to pride a higher level description of the implementation.
E-R Model
Relational model
Network model
Hierarchical model

1.4 Database Languages:

A data sublanguage mainly has two parts:


Data Definition Language (DDL) ,Data Manipulation Language (DML).
The Data Definition Language is used for specifying the database schema and the Data
Manipulation Language is used for both reading and updating the database. These languages are
called data sub-languages as they do not include constructs for all computational requirements.

3
KITSW
Computation purposes include conditional or iterative statements that are supported by the high-
level programming languages. Many DBMSs have a capability to embed the sublanguage in a
high-level programming language such as ‘Fortran’, ‘C’, C++, Java, or Visual Basic. Here, the
high-level language is sometimes referred to as the host language as it is acting like a host for
this language. To compile the embedded file, the commands in the data sub-language are first
detached from the host-language program and are substituted by function calls. The pre-
processed file is then compiled and placed in an object module which gets linked with a DBMS-
specific library that is having the replaced functions, and executed based on requirement. Most
data sub-languages also supply non-embedded or interactive commands which can be input
directly using terminal.

Data Definition Language:

Data Definition Language (DDL) statements are used to classify the database structure or
schema. It is a type of language that allows the DBA or user to depict and name those entities,
attributes, and relationships that are required for the application along with any associated
integrity and security constraints. Here are the lists of tasks that come under DDL:

 CREATE – used to create objects in the database


 ALTER – used to alters the structure of the database
 DROP – used to delete objects from the database
 TRUNCATE – used to remove all records from a table, including all spaces allocated for
the records are removed
 COMMENT – used to add comments to the data dictionary

 RENAME – used to rename an object

Data Manipulation Language:

A language that offers a set of operations to support the fundamental data manipulation
operations on the data held in the database. Data Manipulation Language (DML) statements are
used to manage data within schema objects. Here are the lists of tasks that come under DML:

 SELECT – It retrieve data from the a database


 INSERT – It inserts data into a table
 UPDATE – It updates existing data within a table
 DELETE – It deletes all records from a table, the space for the records remain
 MERGE – UPSERT operation (insert or update)
 CALL – It calls a PL/SQL or Java subprogram

4
KITSW
Data Control Language:
There is another two forms of database sub-languages. The Data Control Language (DCL) is
used to control privilege in Database. To perform any operation in the database, such as for
creating tables, sequences or views we need privileges. Privileges are of two types,

 System – creating session, table etc are all types of system privilege.
 Object – any command or query to work on tables comes under object privilege. DCL is
used to define two commands. These are:
 Grant – It gives user access privileges to database.
 Revoke – It takes back permissions from user.

Transaction Control Language (TCL):


Transaction Control statements are used to run the changes made by DML statements. It allows
statements to be grouped together into logical transactions.

 COMMIT – It saves the work done


 SAVEPOINT – It identifies a point in a transaction to which you can later roll back
 ROLLBACK – It restores database to original since the last COMMIT
 SET TRANSACTION – It changes the transaction options like isolation level and what
rollback segment to use

1.5 Relational Databases:


A relational database contains two or more tables that are related to each other in some way. For
example, a database might contain a Customers table and an Invoices table that contains the
customer's orders.

5
KITSW
1.6 Database Design:

Database design is the process of producing a detailed data model of a database. This data
model contains all the needed logical and physical design choices and physical storage
parameters needed to generate a design in a data definition language, which can then be used to
create a database.

1.7 DATA STORAGE & QUERYING

The physical storage can be classified into different types.


(1) Cache: Cache is the quickest but most expensive storage. Usually it is managed by the
system hardware, thus there is no need to concern it in a database system.

(2) Main Memory Main memory is used to store the data being dealt with, and all the operations
are done in main memory. It is usually too small and too expensive for a whole database system.
In addition, in case of system crash or electricity failure, the contents in the main memory will be
lost.
(3) Flash memory Different from main memory, the data in flash memory remains after
electricity failure. Though reading data from flash memory is as fast as from main memory,
writing data to it is complex: it must erase the data before writing. The times that the flash
memory being erased are limited.

6
KITSW
(4) Magnetic-disk Storage :It is the main method to store data for a long time. Usually the
whole database is stored in the magnetic-disk. Data must be read to main memory to be operated,
and the result must be written back to magnetic-disk.

DATA QUERYING:

Queries are the primary mechanism for retrieving information from a database and consist of
questions presented to the database in a predefined format. Many database management systems
use the Structured Query Language (SQL) standard query format.

 Choosing parameters from a menu: In this method, thedatabase system presents a list
of parameters from which you can choose. This is perhaps the easiest way to pose a query
because the menus guide you, but it is also the least flexible.

 Query by example (QBE): In this method, the systempresents a blank record and lets
you specify the fields and values that define the query.

 Query language: Many database systems require you to make requests for information in
the form of a stylized query that must be written in a special query language. This is the
most complex method because it forces you to learn a specialized language, but it is also
the most powerful.

7
KITSW
1.8 TRANSACTION MANAGEMENT:

A transaction is a very small unit of a program and it may contain several lowlevel tasks.
A transactionin a database system must maintain Atomicity, Consistency, Isolation, and
Durability
ACID Properties

A transaction may contain several low level tasks and further a transaction is very small unit of
any program. A transaction in a database system must maintain some properties in order to
ensure the accuracy of its completeness and data integrity. These properties are refer to as ACID
properties and are mentioned below:

 Atomicity: Though a transaction involves several low level operations but this property
states that a transaction must be treated as an atomic unit, that is, either all of its
operations are executed or none. There must be no state in database where the transaction
is left partially completed. States should be defined either before the execution of the
transaction or after the execution/abortion/failure of the transaction.

 Consistency: This property states that after the transaction is finished, its database must
remain in a consistent state. There must not be any possibility that some data is
incorrectly affected by the execution of transaction. If the database was in a consistent
state before the execution of the transaction, it must remain in consistent state after the
execution of the transaction.

 Durability: This property states that in any case all updates made on the database will
persist even if the system fails and restarts. If a transaction writes or updates some data in
database and commits that data will always be there in the database. If the transaction
commits but data is not written on the disk and the system fails, that data will be updated
once the system comes up.

 Isolation: In a database system where more than one transaction are being executed
simultaneously and in parallel, the property of isolation states that all the transactions will
be carried out and executed as if it is the only transaction in the system. No transaction
will affect the existence of any other transaction.

8
KITSW
1.9 Structure of a DBMS:

Figure 1.3 shows the structure of a typical DBMS.

9
KITSW
1.10 DATA MINING AND INFORMATION RETREIVAL:

Information Retrieval - the ability to query a computer system to return relevant results.
The most widely used example is the Google web search engine.

Data Mining - the ability to retrieve information from one or more data sources in order to
combine it, cluster it, visualize it and discover patterns in the data.

Big Data - the ability to manipulate huge volumes of data (that far exceed the capacity of a
single machine) in order to perform data mining techniques on that data.

Text/data mining currently involves analyzing a large collection of often unrelated digital items
in a systematic way and to discover previously unknown facts, which might take the form of
relationships or patterns that are buried deep in an extensive collection. These relationships
would be extremely difficult, if not impossible, to discover using traditional manual-based search
and browse techniques. Both text and data mining build on the corpus of past publications and
build not so much on the shoulders of giants as on the breadth of past published knowledge and
accumulated mass wisdom.

1.11 Specialty Databases:

These databases are special in every meaning of the term. This sort of database formatted
information NOT READILY AVAILABLE ANYWHERE ELSE. These databases are excellent
for Telemarketing, Direct Mail Marketing, Email Marketing and Fax Marketing.
Some databases in this category may or may not contain email addresses. Therefore, if you are
specifically looking for email marketing lists inside this category, please read through the
information provided in the product page carefully before purchasing.

RDBMS ORDBMS
It is also known as It is also known as Object – Relational
Relational Database Management System. Database Management System.
It is based on Relational Data Model. It is based on Object Data Model (ODM).
It is dominant model. It is gaining popularity.
RDBMS support a small, fixed collection ORDBMS is based on Object-Oriented
of data types (eg. Integers, dates, strings ) Database systems and Relational Database
which has proven adequate for traditional systems and are aimed at application
application domains such as administrative domains where complex objects play a
data processing central role.

10
KITSW
1.12 Database users and Administrators:

Database users are the one who really use and take the benefits of database. There will be
different types of users depending on their need and way of accessing the database.

Database Users:

Application Programmers - They are the developers who interact with the database by means
of DML queries. These DML queries are written in the application programs like C, C++, JAVA,
Pascal etc. These queries are converted into object code to communicate with the database. For
example, writing a C program to generate the report of employees who are working in particular
department will involve a query to fetch the data from database. It will include a embedded SQL
query in the C Program.

Sophisticated Users - They are database developers, who write SQL queries to
select/insert/delete/update data. They do not use any application or programs to request the
database. They directly interact with the database by means of query language like SQL. These
users will be scientists, engineers, analysts who thoroughly study SQL and DBMS to apply the
concepts in their requirement. In short, we can say this category includes designers and
developers of DBMS and SQL.

Specialized Users - These are also sophisticated users, but they write special database
application programs. They are the developers who develop the complex programs to the
requirement.

Stand-alone Users - These users will have stand –alone database for their personal use. These
kinds of database will have readymade database packages which will have menus and graphical
interfaces.

Native Users - these are the users who use the existing application to interact with the database.
For example, online library system, ticket booking systems, ATMs etc which has existing
application and users use them to interact with the database to fulfill their requests.

11
KITSW
Database Administrators:

The life cycle of database starts from designing, implementing to administration of it. A database
for any kind of requirement needs to be designed perfectly so that it should work without any
issues. Once all the design is complete, it needs to be installed. Once this step is complete, users
start using the database. The database grows as the data grows in the database. When the
database becomes huge, its performance comes down. Also accessing the data from the database
becomes challenge. There will be unused memory in database, making the memory inevitably
huge. These administration and maintenance of database is taken care by Database Administrator

DBA has many responsibilities. A good performing database is in the hands of DBA.

Installing and upgrading the DBMS Servers: - DBA is responsible for installing a new DBMS
server for the new projects. He is also responsible for upgrading these servers as there are new
versions comes in the market or requirement. If there is any failure in upgradation of the existing
servers, he should be able revert the new changes back to the older version, thus maintaining the
DBMS working. He is also responsible for updating the service packs/ hot fixes/ patches to the
DBMS servers.

Design and implementation: - Designing the database and implementing is also DBA’s
responsibility. He should be able to decide proper memory management, file organizations, error
handling, log maintenance etc for the database.

Performance tuning: - Since database is huge and it will have lots of tables, data, constraints
and indices, there will be variations in the performance from time to time. Also, because of some
designing issues or data growth, the database will not work as expected. It is responsibility of the
DBA to tune the database performance. He is responsible to make sure all the queries and
programs works in fraction of seconds.

Migrate database servers: - Sometimes, users using oracle would like to shift to SQL server or
Netezza. It is the responsibility of DBA to make sure that migration happens without any failure,
and there is no data loss.

12
KITSW
Backup and Recovery: - Proper backup and recovery programs needs to be developed by DBA
and has to be maintained him. This is one of the main responsibilities of DBA. Data/objects
should be backed up regularly so that if there is any crash, it should be recovered without much
effort and data loss.

Security: - DBA is responsible for creating various database users and roles, and giving them
different levels of access rights.

Documentation: - DBA should be properly documenting all his activities so that if he quits or
any new DBA comes in, he should be able to understand the database without any effort. He
should basically maintain all his installation, backup, recovery, security methods. He should keep
various reports about database performance.

1.13 A Brief History of Database Management Systems:

A Database Management System allows a person to organize, store, and retrieve data from a
computer. It is a way of communicating with a computer’s “stored memory.” In the very early
years of computers, “punch cards” were used for input, output, and data storage. Punch cards
offered a fast way to enter data, and to retrieve it. Herman Hollerith is given credit for adapting
the punch cards used for weaving looms to act as the memory for a mechanical tabulating
machine, in 1890. Much later, databases came along.

Databases (or DBs) have played a very important part in the recent evolution of computers. The
first computer programs were developed in the early 1950s, and focused almost completely on
coding languages and algorithms. At the time, computers were basically giant calculators and
data (names, phone numbers) was considered the leftovers of processing information. Computers
were just starting to become commercially available, and when business people started using
them for real-world purposes, this leftover data suddenly became important.

Enter the Database Management System (DBMS). A database, as a collection of information,


can be organized so a Database Management System can access and pull specific information.
In 1960, Charles W. Bachman designed the Integrated Database System, the “first” DBMS. IBM,
not wanting to be left out, created a database system of their own, known as IMS. Both database
systems are described as the forerunners of navigational databases.

13
KITSW
By the mid-1960s, as computers developed speed and flexibility, and started becoming popular,
many kinds of general use database systems became available. As a result, customers demanded
a standard be developed, in turn leading to Bachman forming the Database Task Group. This
group took responsibility for the design and standardization of a language called Common
Business Oriented Language (COBOL). The Database Task Group presented this standard in
1971, which also came to be known as the “CODASYL approach.”

The CODASYL approach was a very complicated system and required substantial training. It
depended on a “manual” navigation technique using a linked data set, which formed a large
network. Searching for records could be accomplished by one of three techniques:

Using the primary key (also known as the CALC key)


Moving relationships (also called sets) to one record from another
Scanning all records in sequential order
Eventually, the CODASYL approach lost its popularity as simpler, easier-to-work-with systems
came on the market.
Edgar Codd worked for IBM in the development of hard disk systems, and he was not happy
with the lack of a search engine in the CODASYL approach, and the IMS model. He wrote a
series of papers, in 1970, outlining novel ways to construct databases. His ideas eventually
evolved into a paper titled, A Relational Model of Data for Large Shared Data Banks, which
described new method for storing data and processing large databases. Records would not be
stored in a free-form list of linked records, as in CODASYL navigational model, but instead used
a “table with fixed-length records.”
IBM had invested heavily in the IMS model, and wasn’t terribly interested in Codd’s ideas.
Fortunately, some people who didn’t work for IBM “were” interested. In 1973, Michael
Stonebraker and Eugene Wong (both then at UC Berkeley) made the decision to research
relational database systems. The project was called INGRES (Interactive Graphics and Retrieval
System), and successfully demonstrated a relational model could be efficient and practical.
INGRES worked with a query language known as QUEL, in turn, pressuring IBM to develop
SQL in 1974, which was more advanced (SQL became ANSI and OSI standards in 1986 1nd
1987). SQL quickly replaced QUEL as the more functional query language.
RDBM Systems were an efficient way to store and process structured data. Then, processing
speeds got faster, and “unstructured” data (art, photographs, music, etc.) became much more

14
KITSW
common place. Unstructured data is both non-relational and schema-less, and Relational
Database Management Systems simply were not designed to handle this kind of data.
NoSQL
NoSQL (“Not only” Structured Query Language) came about as a response to the Internet and
the need for faster speed and the processing of unstructured data. Generally speaking, NoSQL
databases are preferable in certain use cases to relational databases because of their speed and
flexibility. The NoSQL model is non-relational and uses a “distributed” database system. This
non-relational system is fast, uses an ad-hoc method of organizing data, and processes high-
volumes of different kinds of data.
“Not only” does it handle structured and unstructured data, it can also process unstructured Big
Data, very quickly. The widespread use of NoSQL can be connected to the services offered by
Twitter, LinkedIn, Facebook, and Google. Each of these organizations store and process colossal
amounts of unstructured data. These are the advantages NoSQL has over SQL and RDBM
Systems:

Higher scalability
A distributed computing system
Lower costs
A flexible schema
Can process unstructured and semi-structured data
Has no complex relationship
Unfortunately, NoSQL does come with some problems. Some NoSQL databases can be quite
resource intensive, demanding high RAM and CPU allocations. It can also be difficult to find
tech support if your open source NoSQL system goes down.
NoSQL Data Distribution
Hardware can fail, but NoSQL databases are designed with a distribution architecture that
includes redundant backup storage of both data and function. It does this by using multiple nodes
(database servers). If one, or more, of the nodes goes down, the other nodes can continue with
normal operations and suffer no data loss. When used correctly, NoSQL databases can provide
high performance at an extremely large scale, and never shut down. In general, there are four
kinds of NoSQL databases, with each having specific qualities and characteristics.

15
KITSW
Document Stores
A Document Store (often called a document-oriented database), manages, stores, and retrieves
semi-structured data (also known as document-oriented information). Documents can be
described as independent units that improve performance and make it easier to spread data across
a number of servers. Document Stores typically come with a powerful query engine and indexing
controls that make queries fast and easy. Examples of Document Stores are: Mongo DB,
and Amazon Dynamo DB
Document-oriented databases store all information for a given “object” within the database, and
each object in storage can be quite different from the others. This makes it easier for mapping
objects to the database and makes document storage for web programming applications very
attractive. (An “object” is a set of relationships. An article object could be related to a tag [an
object], a category [another object], or a comment [another object].)

Column Stores
A DBMS using columns is quite different from traditional relational database systems. It stores
data as portions of columns, instead of as rows. The change in focus, from row to a column, lets
column databases maximize their performance when large amounts of data are stored in a single
column. This strength can be extended to data warehouses and CRM applications. Examples of
column-style databases include Cloudera, Cassandra, and HBase (Hadoop based).
Key-value Stores
A key-value pair database is useful for shopping cart data or storing user profiles. All access to
the database is done using a primary key. Typically, there is no fixed schema or data model. The
key can be identified by using a random lump of data. Key-value stores “are not” useful when
there are complex relationships between data elements or when data needs to be queried by other
than the primary key. Examples of key-value stores are: Riak, Berkeley DB, and Aerospike.
An element can be any single “named” unit of stored data that might, or might not, contain other
data components.

Graph Data Stores


Location aware systems, routing and dispatch systems, and social networks are the primary users
of Graph Databases (also called Graph Data Stores). These databases are based on graph theory,
and work well with data that can be displayed as graphs. They provide a very functional,
cohesive picture of Big Data.

16
KITSW
It differs from relational databases, and other NoSQL databases, by storing data relationships as
actual relationships. This type of storage for relationship data results in fewer disconnects
between an evolving schema and the actual database. It has interconnected elements, using an
undetermined number of relationships between them. Examples Graph Databases
are: Neo4j, GraphBase, and Titan.
Polyglot Persistence
Polyglot Persistence is a spin-off of “polyglot programming,” a concept developed in 2006 by
Neal Ford. The original idea promoted applications be written using a mix of languages, with the
understanding that a specific language may solve a certain kind of problem easily, while another
language would have difficulties. Different languages are suitable for tackling different
problems.

Many NoSQL systems run on nodes and large clusters. This allows for significant scalability and
redundant backups of data on each node. Using different technologies at each node supports a
philosophy of Polyglot Persistence. This means “storing” data on multiple technologies with the
understanding certain technologies will solve one kind of problem easily, while others will not.
An application communicating with different database management technologies uses each for
the best fit in achieving the end goal.

1.14 Introduction to Data base design:

The database design process can be divided into six steps.The ER model is most relevant to the
first three steps.

Requirements Analysis:

The very first step in designing a database application is to understand what data is to be stored
in the database, what applications must be built on top of it, and what operations are most
frequent and subject to performance requirements. In other words, we must find out what the
users want from the database

Conceptual Database Design:


The information gathered in the requirements analysis step is used to develop to high level
description of the data to the stored in the database, along with the constraints that are known to

17
KITSW
hold over this data. This step is often carried out using the ER model, or a similar high level data
model.

Logical Database Design:

We must choose a DBMS to implement our database design, and convert the conceptual
database design into a database schema in the data model of the chosen DBMS.

Schema Refinement:
The fourth step in database design is to analyse the collection of relations in our relational
database schema to identify potential problems, and to refine it. In contrast to the requirements
analysis and conceptual design steps, which are essentially subjective, schema refinement can be
guided by some elegant and powerful theory.

Physical Database Design:

In this step we must consider typical expected workloads that our database must support and
further refine the database design to ensure that it meets desired performance criteria. This tep
may simply involve building indexes on some tables and clustering some tables, or it may
involve a substantial redesign of parts of the database schema obtained from the earlier design
steps.

Security Design:
In this step, we identify different user groups and different roles played by various users (Eg : the
development team for a product, the customer support representatives, the product manager ).
For each role and user group, we must identify the parts of the database that they must be able to
access and the parts of the database that they should not be allowed to access, and take steps to
ensure that they can access.

ER (Entity – Relationship) model:

The entity relationship (E-R) data model is based on a perception of a real world that consists of
a set of basic objects called entities, and of relationships among these objects.
Rectangles- which represent entity sets
Ellipse-which represent attributes
Diamonds-which represent relationship sets
Lines-which link attributes to entity sets and entity sets to relationship sets
Double ellipses-which represent multivalued attributes
Double lines- which indicate total participation of an entity in a relationship set

18
KITSW
The appropriate mapping cardinality for a particular relationship set is obviously dependent on
the real world situation that is being modeled by the relationship set. The overall logical structure
of a database can be expressed graphically by an E-R diagram, which is built up from the
following components.

Rectangles, which represent entity sets


Ellipse, which represent attributes
Diamonds, which represent relationship sets
Lines, which link attributes to entity sets and entity sets to relationship sets
Double ellipses, which represent multivalued attributes
Double lines, which indicate total participation of an entity in a relationship set

19
KITSW
Entity: An entity is a real-world object or concept which is distinguishable from other objects. It
may be something tangible, such as a particular student or building. It may also be somewhat
more conceptual, such as CS A-341, or an email address.

Attributes: These are used to describe a particular entity (e.g. name, SS#, height).

Domain: Each attribute comes from a specified domain (e.g., name may be a 20 character string;
SS# is a nine-digit integer)

Entity set: a collection of similar entities (i.e., those which are distinguished using the same set
of attributes. As an example, I may be an entity, whereas Faculty might be an entity set to which
I belong. Note that entity sets need not be disjoint. I may also be a member of Staff or of Softball
Players.

Key: a minimal set of attributes for an entity set, such that each entity in the set can be uniquely
identified. In some cases, there may be a single attribute (such as SS#) which serves as a key, but
in some models you might need multiple attributes as a key ("Bob from Accounting"). There
may be several possible candidate keys. We will generally designate one such key as
the primary key.

ER diagrams:

It is often helpful to visualize an ER model via a diagram. There are many variant conventions
for such diagrams; we will adapt the one used in the text.

Diagram conventions

 An entity set is drawn as a rectangle.

 Attributes are drawn as ovals.

 Attributes which belong to the primary key are underlined.


Example:

20
KITSW
ER Model
Entity relationship model defines the conceptual view of database. It works around real world
entity and association among them. At view level, ER model is considered well for designing
databases.
Entity

A real-world thing either animate or inanimate that can be easily identifiable and distinguishable.
For example, in a school database, student, teachers, class and course offered can be considered
as entities. All entities have some attributes or properties that give them their identity.

An entity set is a collection of similar types of entities. Entity set may contain entities with
attribute sharing similar values. For example, Students set may contain all the student of a
school; likewise Teachers set may contain all the teachers of school from all faculties. Entities
sets need not to be disjoint.

Attributes

Entities are represented by means of their properties, called attributes. All attributes have values.
For example, a student entity may have name, class, age as attributes.

There exists a domain or range of values that can be assigned to attributes. For example, a
student's name cannot be a numeric value. It has to be alphabetic. A student's age cannot be
negative, etc.

Types of Attributes
Simple attribute
Simple attributes are atomic values, which cannot be divided further. For example, student's
phone-number is an atomic value of 10 digits.

Composite attribute
Composite attributes are made of more than one simple attribute. For example, a student's
complete name may have first_name and last_name.

Derived attribute
Derived attributes are attributes, which do not exist physical in the database, but there values are
derived from other attributes presented in the database. For example, average_salary in a
department should be saved in database instead it can be derived. For another example, age can
be derived from data_of_birth.
Single-valued attribute
Single valued attributes contain on single value. For example: Social_Security_Number.

21
KITSW
Multi-value attribute
Multi-value attribute may contain more than one values. For example, a person can have more
than one phone numbers, email_addresses etc.

These attribute types can come together in a way like:

o simple single-valued attributes


o simple multi-valued attributes
o composite single-valued attributes
o composite multi-valued attributes

Entity-Sets & Keys


Key is an attribute or collection of attributes that uniquely identifies an entity among entity set.
Example:roll_number of a student makes her/him identifiable among students.

o Super Key: Set of attributes (one or more) that collectively identifies an entity in an
entity set.

o Candidate Key: Minimal super key is called candidate key that is, supers keys for which
no proper subset are a superkey. An entity set may have more than one candidate key.

o Primary Key: This is one of the candidate key chosen by the database designer to
uniquely identify the entity set.

Relationship

The association among entities is called relationship. For example, employee entity has relation
works_at with department. Another example is for student who enrolls in some course. Here,
Works_at and Enrolls are called relationship.

Relationship Set

Relationship of similar type is called relationship set. Like entities, a relationship too can have
attributes. These attributes are called descriptive attributes.
Degree of Relationship
The number of participating entities in an relationship defines the degree of the relationship.

o Binary = degree 2
o Ternary = degree 3
o n-ary = degree

22
KITSW
Mapping Cardinalities
Cardinality defines the number of entities in one entity set which can be associated to the
number of entities of other set via relationship set.

o One-to-one: one entity from entity set A can be associated with at most one entity of
entity set B and vice versa.

o One-to-many: One entity from entity set A can be associated with more than one entities
of entity set B but from entity set B one entity can be associated with at most one entity.

o Many-to-one: More than one entities from entity set A can be associated with at most
one entity of entity set B but one entity from entity set B can be associated with more
than one entity from entity set A.

23
KITSW
o Many-to-many: one entity from A can be associated with more than one entity from B
and vice versa.

Additional Features of ER Diagram:

Ternary Relationship SetA relationship set need not be an association of precisely two entities;
it can involve three or more when applicable. Here is another example from the text, in which a
store has multiple locations.

Using several entities from same entity set

24
KITSW
A relationship might associate several entities from the same underlying entity set, such
as in the following example, Reports_To. In this case, an additional role indicator (e.g.,
"supervisor") is used in the diagram to further distinguish the two similar entities.

Specifying additional constraints:

If you took a 'snapshot' of the relationship set at some instant in time, we will call this
an instance..

A (binary) relationship set can further be classified as either


o many-to-many
o one-to-many
o one-to-one
based on whether an individual entity from one of the underlying sets is allowed to be in more
than one such relationship at a time. The above figure contains a many-to-many relationship, as
departments may employ more than one person at a time, and an individual person may be
employed by more than one department.
Sometimes, an additional constraint exists for a given relationship set, that any entity from one of
the associated sets appears in at most one such relationship. For example, consider a relationship
set "Manages" which associates departments with employees. If a department cannot have more
than one manager, this is an example of a one-to-many relationship set (it may be that an
individual manages multiple departments).

25
KITSW
This type of constraint is called a key constraint. It is represented in the ER diagrams by
drawing an arrow from an entity set E to a relationship set R when each entity in an instance of E
appears in at most one relationship in (a corresponding instance of) R.

An instance of this relationship is given in Figure 2.7.

If both entity sets of a relationship set have key constraints, we would call this a "one-to-one"
relationship set. In general, note that key constraints can apply to relationships between more
than two entities, as in the following example.

An instance of this relationship:

26
KITSW
Participation Constraints
Recall that a key constraint requires that each entity of a set be required to participate in at most
one relationship. Dual to this, we may ask whether each entity of a set be required to participate
in at least one relationship.
If this is required, we call this a total participation constraint; otherwise the participation
is partial. In our ER diagrams, we will represent a total participation constraint by using
a thick line.

Weak Entities

27
KITSW
There are times you might wish to define an entity set even though its attributes do not formally
contain a key (recall the definition for a key).
Usually, this is the case only because the information represented in such an entity set is only
interesting when combined through an identifying relationship set with another entity set we
call theidentifying owner.
We will call such a set a weak entity set, and insist on the following:
 The weak entity set must exhibit a key constraint with respect to the identifying
relationship set.
 The weak entity set must have total participation in the identifying relationship set.

Together, this assures us that we can uniquely identify each entity from the weak set by
considering the primary key of its identifying owner together with a partial key from the weak
entity.
In our ER diagrams, we will represent a weak entity set by outlining the entity and the
identifying relationship set with dark lines. The required key constraint and total participation are
diagrammed with our existing conventions. We underline the partial key with a dotted line.

Class Hierarchies

As with object-oriented programming, it is often convenient to classify an entity sets as a


subclass of another. In this case, the child entity set inherits the attributes of the parent entity set.
We will denote this scenario using an "ISA" triangle, as in the following ER diagram:

28
KITSW
Furthermore, we can impose additional constraints on such subclassing. By default, we will
assume that two subclasses of an entity set are disjoint. However, if we wish to allow an entity to
lie in more than one such subclass, we will specify an overlap constraint. (e.g. "Contract_Emps
OVERLAPS Senior_Emps")

Dually, we can ask whether every entity in a superclass be required to lie in (at least) one
subclass. By default we will not assume not, but we can specify a covering constraint if desired.
(e.g. "Motorboats AND Cards COVER Motor_Vehicles")

Aggregation

Thus far, we have defined relationships to be associations between two or more entities.
However, it sometimes seems desirable to define a new relationship which associates some entity
with some other existing relationship. To do this, we will introduce a new feature to our model
called aggregation. We identifying an existing relationship set by enclosing it in a larger dashed
box, and then we will allow it to participate in another relationship set.

29
KITSW
A motivating example follows:

Conceptual Design with the ER Model

It is most important to recognize that there is more than one way to model a given situation. Our
next goal is to start to compare the pros and cons of common choices.

Should a concept be modeled as an entity or an attribute?

Consider the scenario, if we want to add address information to the Employees entity set? We
might choose to add a single attribute address to the entity set. Alternatively, we could introduce
a new entity set, Addresses and then a relationship associating employees with addresses. What
are the pros and cons?

Adding a new entity set is more complex model. It should only be done when there is need for
the complexity. For example, if some employees have multiple address to be associated, then the
more complex model is needed. Also, representing addresses as a separate entity would allow a
further breakdown, for example by zip code or city.

What if we wanted to modify the Works_In relationship to have both a start and end date, rather
than just a start date. We could add one new attribute for the end date; alternatively, we could
create a new entity set Duration which represents intervals, and then the Works_In relationship
can be made ternary (associating an employee, a department and an interval). What are the pros
and cons?

30
KITSW
If the duration is described through descriptive attributes, only a single such duration can be
modeled. That is, we could not express an employment history involving someone who left the
department yet later returned.

Should a concept be modeled as an entity or a relationship?

Consider a situation in which a manager controls several departments. Let's presume that a
company budgets a certain amount (budget) for each department. Yet it also wants managers to
have access to some discretionary budget (dbudget). There are two corporate models. A
discretionary budget may be created for each individual department; alternatively, there may be a
discretionary budget for each manager, to be used as she desires.

Which scenario is represented by the following ER diagram? If you want the alternate
interpretation, how would you adjust the model?

Should we use binary or ternary relationships?


Consider the following ER diagram, representing insurance policies owned by employees at a
company. Each employee can own several polices, each policy can be owned by several
employees, and each dependent can be covered by several policies.

What if we wish to model the following additional requirements:


 A policy cannot be owned jointly by two or more employees.
 Every policy must be owned by some employee.

31
KITSW
Dependents is a weak entity set, and each dependent entity is uniquely identified by
taking pname in conjunction with the policyid of a policy entity (which, intuitively, covers the
given dependent).
The best way to model this is to switch away from the ternary relationship set, and instead use
two distinct binary relationship sets.

Should we use aggregation?


Consider again the following ER diagram:

If we did not need the until or since attributes. In this case, we could model the identical setting
using the following ternary relationship:

32
KITSW
Let's compare these two models. What if we wanted to add an additional constraint to
each, that each sponsorship (of a project by a department) be monitored by at most one
employee. Can you add this constraint to either of the above models.

1.15 Introduction to the Relational Model

The main construct for representing data in the relational model is a relation. A relation
consists of a relation schema and a relation instance. The relation instance is a table, and the
relation schema describes the column heads for the table. We first describe the relation schema
and then the relation instance. The schema specifies the relation’s name, the name of each field
(or column, or attribute), and the domain of each field. A domain is referred to in a relation
schema by the domain name and has a set of associated values.

Eg:

Students(sid: string, name: string, login: string, age: integer, gpa: real)

This says, for instance, that the field named sid has a domain named string. The set of
values associated with domain string is the set of all character strings.

An instance of a relation is a set of tuples, also called records, in which each tuple has the
same number of fields as the relation schema. A relation instance can be thought of as a
table in which each tuple is a row, and all rows have the same number of fields.

33
KITSW
A relation schema specifies the domain of each field or column in the relation instance. These
domain constraints in the schema specify an important condition that we want each instance of
the relation to satisfy: The values that appear in a column must be drawn from the domain
associated with that column. Thus, the domain of a field is essentially the type of that field, in
programming language terms, and restricts the values that can appear in the field.

Domain constraints are so fundamental in the relational model that we will henceforth consider
only relation instances that satisfy them; therefore, relation instance means relation instance that
satisfies the domain constraints in the relation schema.

The degree, also called arity, of a relation is the number of fields. The cardinality of a relation
instance is the number of tuples in it. In Figure 3.1, the degree of the relation (the number of
columns) is five, and the cardinality of this instance is six.

A relational database is a collection of relations with distinct relation names. The relational
database schema is the collection of schemas for the relations in the database.
Creating and Modifying Relations

34
KITSW
The SQL-92 language standard uses the word table to denote relation, and we will often
follow this convention when discussing SQL. The subset of SQL that supports the creation,
deletion, and modification of tables is called the Data Definition Language (DDL).

To create the Students relation, we can use the following statement:

The CREATE TABLE statement is used to define a new table.

CREATE TABLE Students ( sid CHAR(20), name CHAR(30), login


CHAR(20), age INTEGER, gpa REAL )
Tuples are inserted using the INSERT command. We can insert a single tuple into the
Students table as follows:
INSERT INTO Students (sid, name, login, age, gpa) VALUES (53688, ‘Smith’, ‘smith@ee’, 18,
3.2)
We can delete tuples using the DELETE command. We can delete all Students tuples with name
equal to

Smith using the command:


DELETE FROM Students S WHERE S.name = ‘Smith’
We can modify the column values in an existing row using the UPDATE command. For
example, we can increment the age and decrement the gpa of the student with sid 53688:

UPDATE Students S SET S.age = S.age + 1, S.gpa = S.gpa - 1 WHERE S.sid = 53688

1.16 Integrity Constraints over Relations:


An integrity constraint (IC) is a condition that is specified on a database schema, and
restricts the data that can bstored in an instance of the database. If a database instance satisfies all
the integrity constraints specified on the database schema, it is a legal instance. A DBMS
enforces integrity constraints, in that it permits only legal instances to be stored in the database.

35
KITSW
Relational Model – Constraints

 Integrity Constraints: An integrity constraint (IC) is a condition specified on a


database schema and restricts the data that can be stored in an instance of the database. If
a database instance satisfies all the integrity constraints specifies on the database schema,
it is a legal instance. A DBMS permits only legal instances to be stored in the database.
Many kinds of integrity constraints can be specified in the relational model:

 Domain Constraints:A relation schema specifies the domain of each field in the relation
instance. These domain constraints in the schema specify the condition that each
instance of the relation has to satisfy: The values that appear in a column must be drawn
from the domain associated with that column. Thus, the domain of a field is essentially
the type of that field.

Key Constraints
A Key Constraint is a statement that a certain minimal subset of the fields of a relation is a
unique identifier for a tuple.

 Super Key:An attribute, or set of attributes, that uniquely identifies a tuple within a
relation.However, a super key may contain additional attributes that are not necessary for
a unique identification.

Example: The customer_id of the relation customer is sufficient to distinguish one tuple
from other. Thus,customer_id is a super key. Similarly, the combination
of customer_id and customer_name is a super key for the relation customer. Here
the customer_name is not a super key, because several people may have the same
name. We are often interested in super keys for which no proper subset is a super key.
Such minimal super keys are called candidate keys.

 Candidate Key:A super key such that no proper subset is a super key within the
relation.There are two parts of the candidate key definition:
o Two distinct tuples in a legal instance cannot have identical values in all the fields
of a key
o No subset of the set of fields in a candidate key is a unique identifier for a tuple.A
relation may have several candidate keys.

Example: The combination of customer_name and customer_street is sufficient to


distinguish the members of the customer relation. Then both, {customer_id} and
{customer_name, customer_street} are candidate keys.
Although customer_id and customer_name together can distinguish customer tuples, their
combination does not form a candidate key, since the customer_id alone is a candidate
key.

36
KITSW
 Primary Key:The candidate key that is selected to identify tuples uniquely within the
relation. Out of all the available candidate keys, a database designer can identify
a primary key. The candidate keys that are not selected as the primary key are called
as alternate keys.

Features of the primary key:


o Primary key will not allow duplicate values.
o Primary key will not allow null values.
o Only one primary key is allowed per table.

Example: For the student relation, we can choose student_id as the primary key.

 Foreign Key:Foreign keys represent the relationships between tables. A foreign key is a
column (or a group of columns) whose values are derived from the primary key of some
other table.The table in which foreign key is defined is called a Foreign table or Details
table. The table that defines the primary key and is referenced by the foreign key is called
the Primary table or Master table.

Features of foreign key:


o Records cannot be inserted into a detail table if corresponding records in the
master table do not exist.
o Records of the master table cannot be deleted or updated if corresponding
records in the detail table actually exist.

General Constraints
Domain, primary key, and foreign key constraints are considered to be a fundamental part of the
relational data model. Sometimes, however, it is necessary to specify more general constraints.

Example: we may require that student ages be within a certain range of values. Giving such an
IC, the DBMS rejects inserts and updates that violate the constraint.

Current database systems support such general constraints in the form of table
constraints andassertions. Table constraints are associated with a single table and checked
whenever that table is modified. In contrast, assertions involve several tables and are checked
whenever any of these tables is modified.

Example: for table constraint, which ensures always the salary of an employee, is above 1000:
CREATE TABLE employee (eid integer, ename varchar2(20), salary real,
CHECK(salary>1000));

37
KITSW
Example: for assertion, which enforce a constraint that the number of boats plus the number of
sailors should be less than 100.
CREATE ASSERTION smallClub CHECK ((SELECT COUNT (S.sid) FROM Sailors S) +
(SELECT COUNT (B.bid) FROM Boats B) < 100);

Referential/Enforcing Integrity Constraints


This integrity constraints works on the concept of Foreign Key. A key attribute of a relation can
be referred in other relation, where it is called foreign key.

Referential integrity constraint states that if a relation refers to an key attribute of a different or
same relation, that key element must exists.

Querying Relational Data:

Specifying Key Constraints in SQL

CREATE TABLE Students ( sid CHAR(20), name CHAR(30), login CHAR(20), age
INTEGER, gpa REAL, UNIQUE (name, age), CONSTRAINT StudentsKey PRIMARY KEY
(sid) )
Foreign Key Constraints
Sometimes the information stored in a relation is linked to the information stored in another
relation. If one of the relations is modified, the other must be checked, and perhaps modified, to
keep the data consistent. An IC involving both relations must be specified if a DBMS is to make
such checks. The most common IC involving two relations is a foreign key constraint.

Suppose that in addition to Students, we have a second relation:

38
KITSW
Enrolled(sid: string, cid: string, grade: string)
To ensure that only bonafide students can enroll in courses, any value that appears in the sid field
of an instance of the Enrolled relation should also appear in the sid field of some tuple in the

Students relation. The sid field of Enrolled is called a foreign key and refers to Students. The
foreign key in the referencing relation (Enrolled, in our example) must match the primary key of
the referenced relation (Students), i.e., it must have the same number of columns and compatible
data types, although the column names can be different.

Specifying Foreign Key Constraints in SQL

CREATE TABLE Enrolled ( sid CHAR(20), cid CHAR(20), grade CHAR(10), PRIMARY KEY
(sid, cid), FOREIGN KEY (sid) REFERENCES Students )

1.17 Enforcing Integrity Constraints:

Consider the instance S1 of Students shown in Figure 3.1. The following insertion violates the
primary key constraint because there is already a tuple with the sid 53688, and it will be rejected
by the DBMS:
INSERT INTO Students (sid, name, login, age, gpa) VALUES (53688, ‘Mike’, ‘mike@ee’, 17,
3.4)
The following insertion violates the constraint that the primary key cannot contain null:
INSERT INTO Students (sid, name, login, age, gpa) VALUES (null, ‘Mike’, ‘mike@ee’, 17,
3.4)

39
KITSW
1.18 Querying Relational Data
A relational database query is a question about the data, and the answer consists of a new
relation containing the result. For example, we might want to find all students younger than 18 or
all students enrolled in Reggae203.
A query language is a specialized language for writing queries.
SQL is the most popular commercial query language for a relational DBMS. Consider the
instance of the Students relation shown in Figure 3.1. We can retrieve rows corresponding to
students who are younger than 18 with the following SQL query:

SELECT * FROM Students S WHERE S.age < 18

The symbol * means that we retain all fields of selected tuples in the result. The condition S.age
18 in the WHERE clause specifies that we want to select only tuples in which the age field has a
value less than 18.

Logical Data Base Design: ER to Relational


Conversion of ER Diagram to Relational Database
The ER Model is intended as a description of real-world entities. Although it is constructed in
such a way as to allow easy translation to the relational schema model, this is not an entirely
trivial process. The ER diagram represents the conceptual level of database design meanwhile
the relational schema is the logical level for the database design. We will be following the simple
rules:

40
KITSW
1. Entities and Simple Attributes:
An entity type within ER diagram is turned into a table. You may preferably keep the same name
for the entity or give it a sensible name but avoid DBMS reserved words as well as avoid the use
of special characters.Each attribute turns into a column (attribute) in the table. The key attribute
of the entity is the primary key of the table which is usually underlined. It can be composite if
required but can never be null.

It is highly recommended that every table should start with its primary key attribute
conventionally named as TablenameID.

Taking the following simple ER diagram:

The initial relational schema is expressed in the following format writing the table names with
the attributes list inside a parentheses as shown below for
Persons( personid , name, lastname, email )
Persons and Phones are Tables. name, lastname, are Table Columns (Attributes).

personid is the primary key for the table : Person


2. Multi-Valued Attributes
A multi-valued attribute is usually represented with a double-line oval.

If you have a multi-valued attribute, take the attribute and turn it into a new entity or table of its
own. Then make a 1:N relationship between the new entity and the existing one. In simplewords.
1. Create a table for the attribute. 2. Add the primary (id) column of the parent entity as a foreign
key within the new table as shown below:

Persons( personid , name, lastname, email )


Phones ( phoneid , personid, phone )
personid within the table Phones is a foreign key referring to the personid of Persons

41
KITSW
3. 1:1 Relationships

To keep it simple and even for better performances at data retrieval, I would personally
recommend using attributes to represent such relationship. For instance, let us consider the case
where the Person has or optionally has one wife. You can place the primary key of the wife
within the table of the Persons which we call in this case Foreign key as shown below.

Persons( personid , name, lastname, email , wifeid )


Wife ( wifeid , name )
Or vice versa to put the personid as a foreign key within the Wife table as shown below:
Persons( personid , name, lastname, email )
Wife ( wifeid , name , personid)
For cases when the Person is not married i.e. has no wifeID, the attribute can set to NULL
4. 1:N Relationships
This is the tricky part ! For simplicity, use attributes in the same way as 1:1 relationship but we
have only one choice as opposed to two choices. For instance, the Person can have a House from
zero to many , but a House can have only one Person. To represent such relationship
the personid as the Parent node must be placed within the Child table as a foreign key but not
the other way around as shown next:

It should convert to :
Persons( personid , name, lastname, email )
House ( houseid , num , address, personid)
5. N:N Relationships
We normally use tables to express such type of relationship. This is the same for N − ary
relationship of ER diagrams. For instance, The Person can live or work in many countries. Also,
a country can have many people. To express this relationship within a relational schema we use a
separate table as shown below:

42
KITSW
It should convert into :

Persons( personid , name, lastname, email )


Countries ( countryid , name, code)
HasRelat ( hasrelatid , personid , countryid)
Relationship with attributes:
It is recommended to use table to represent them to keep the design tidy and clean regardless of
the cardinality of the relationship.
Case Study
For the sake of simplicity, we will be producing the relational schema for the following ER
diagram:

The relational schema for the ER Diagram is given below as:

Company( CompanyID , name , address )


Staff( StaffID , dob , address , WifeID)
Child( ChildID , name , StaffID )

43
KITSW
Wife ( WifeID , name )
Phone(PhoneID , phoneNumber , StaffID)
Task ( TaskID , description)
Work(WorkID , CompanyID , StaffID , since )
Perform(PerformID , StaffID , TaskID )

1.19 Introduction to Views


A view is a table whose rows are not explicitly stored in the database but are computed as needed
from a view definition. Consider the Students and Enrolled relations. Suppose that we are often
interested in finding the names and student identifiers of students who got a grade of B in some
course, together with the cid for the course. We can define a view for this purpose. Using SQL
notation:

CREATE VIEW B-Students (name, sid, course) AS SELECT S.sname, S.sid, E.cid FROM
Students S, Enrolled E WHERE S.sid = E.sid AND E.grade = ‘B’
This view can be used just like a base table, or explicitly stored table, in defining new queries or
views. Given the instances of Enrolled and Students shown in Figure 3.4, BStudents contains
the tuples shown in Figure 3.18.

1.20 Destroying/Altering Tables and Views

To destroy views, use the DROP TABLE command. For example, DROP TABLE Students
RESTRICT destroys the Students table unless some view or integrity constraint refers to
Students; if so, the command fails. If the keyword RESTRICT is replaced by CASCADE,
Students is dropped and any referencing views or integrity constraints are (recursively) dropped
as well; one of these two keywords must always be specified. A view can be dropped using the
DROP VIEW command, which is just like DROP TABLE.

ALTER TABLE modifies the structure of an existing table. To add a column called maiden
Students, for example, we would use the following command:
ALTER TABLE Students ADD COLUMN maiden-name CHA(10)
The definition of Students is modified to add this column, and all existing rows are padded with
null values in this column. ALTER TABLE can also be used to delete columns and to add or
drop integrity constraints on a table.

44
KITSW
UNIT – II

INTRODUCTION TO RELATIONAL ALGEBRA

Overview:

The Relational Model defines two root languages for accessing a relational database -- Relational
Algebra and Relational Calculus. Relational Algebra is a low-level, operator-oriented language.
Creating a query in Relational Algebra involves combining relational operators using algebraic
notation. Relational Calculus is a high-level, declarative language. Creating a query in Relational
Calculus involves describing what results are desired.

The basic form of SQL,SQL (Structured Query Language) is a database sublanguage for
querying and modifying relational databases. The basic structure in SQL is the statement how to
write the queries and modify tables and columns.

Contents:

Relational Algebra and Calculus


Selection and projection set operations
Renaming
Joins
Division
Relational calculus
Tuple relational Calculus
Domain relational calculus
Expressive Power of Algebra and calculus

Form of Basic SQL Query, Examples:

Introduction to Nested Queries, Correlated Nested Queries Set


Comparison Operators, Aggregative Operators
NULL values
Logical connectivity’s

45
KITSW
2.1 Relational Algebra and Calculus:

Relational algebra is one of the two formal query languages associated with the relational
model. Queries in algebra are composed using a collection of operators. A fundamental property
is that every operator in the algebra accepts (one or two) relation instances as arguments and
returns a relation instance as the result. This property makes it easy to compose operators to form
a complex query —a relational algebra expression is recursively defined to be a relation, a
unary algebra operator applied to a single expression, or a binary algebra operator applied to two
expressions. We describe the basic operators of the algebra (selection, projection, union, cross-
product, and difference).

Selection and Projection

Relational algebra includes operators to select rows from a relation (σ)and to project columns
(π).

These operations allows to manipulate data in a single relation. Consider the instance of the
Sailors relation shown in Figure 4.2, denoted as S2. We can retrieve rows corresponding to
expert sailors by using the s operator. The expression (S2) evaluates to the relation shown in
Figure 4.4. The subscript rating>8 specifies the selection criterion to be applied while retrieving
tuple

46
KITSW
Set Operations:

The following standard operations on sets are also available in relational algebra: union (∪),
intersection (n), set-difference (-), and cross-product (×).

Union: R∪S returns a relation instance containing all tuples that occur in either relation instance
R or relation instance S (or both). R and S must be unioncompatible, and the schema of the result
is defined to be identical to the schema of R.

Intersection: RnS returns a relation instance containing all tuples that occur in both R and S.
The relations R and S must be union-compatible, and the schema of the result is defined to be
identical to the schema of R.

Set-difference: R-S returns a relation instance containing all tuples that occur in R but not in S.
The relations R and S must be union-compatible, and the schema of the result is defined to be
identical to the schema of R.

Cross-product: R×S returns a relation instance whose schema contains all the fields of R (in the
same order as they appear in R) followed by all the fields of S (in the same order as they appear
in S). The result of R × S contains one tuple r, s (the concatenation of tuples r and s) for each pair
of tuples r ∈ R, s ∈ S. The cross-product opertion is sometimes called Cartesian product.

Joins

The join operation is one of the most useful operations in relational algebra and is the most
commonly used way to combine information from two or more relations. Although a join can be
defined as a cross-product followed by selections and projections, joins arise much more
frequently in practice than plain cross-products.

Condition Joins
The most general version of the join operation accepts a join condition c and a pair of relation
instances as arguments, and returns a relation instance. The join condition is identical to a
selection condition in form.

47
KITSW
The operation is defined as follows:

As an example, the result of .

Select Operation (σ)


It selects tuples that satisfy the given predicate from a relation.

Notation − σp(r)

Where σ stands for selection predicate and r stands for relation. p is prepositional logic formula
which may use connectors like and, or, and not. These terms may use relational operators like
− =, ≠, ≥, < , >, ≤.

For example −

σsubject = "database"(Books)
Output − Selects tuples from books where subject is 'database'.

σsubject = "database" and price = "450"(Books)


Output − Selects tuples from books where subject is 'database' and 'price' is 450.

σsubject = "database" and price = "450" or year > "2010"(Books)


Output − Selects tuples from books where subject is 'database' and 'price' is 450 or those books
published after 2010.

48
KITSW
Project Operation (∏)
It projects column(s) that satisfy a given predicate.

Notation − ∏A1, A2, An (r)

Where A1, A2 , An are attribute names of relation r.

Duplicate rows are automatically eliminated, as relation is a set.

For example −

∏subject, author (Books)


Selects and projects columns named as subject and author from the relation Books.

Union Operation (∪)


It performs binary union between two given relations and is defined as −

r ∪ s = { t | t ∈ r or t ∈ s}
Notation − r U s

Where r and s are either database relations or relation result set (temporary relation).

For a union operation to be valid, the following conditions must hold −

 r, and s must have the same number of attributes.


 Attribute domains must be compatible.
 Duplicate tuples are automatically eliminated.
∏ author (Books) ∪ ∏ author (Articles)
Output − Projects the names of the authors who have either written a book or an article or both.

Set Difference (−)


The result of set difference operation is tuples, which are present in one relation but are not in
the second relation.

Notation − r − s

Finds all the tuples that are present in r but not in s.

∏ author (Books) − ∏ author (Articles)


Output − Provides the name of authors who have written books but not articles.

49
KITSW
Cartesian Product (Χ)
Combines information of two different relations into one.

Notation − r Χ s

Where r and s are relations and their output will be defined as −

r Χ s = { q t | q ∈ r and t ∈ s}

σauthor = 'tutorialspoint'(Books Χ Articles)


Output − Yields a relation, which shows all the books and articles written by tutorialspoint.

Rename Operation (ρ)


The results of relational algebra are also relations but without any name. The rename operation
allows us to rename the output relation. 'rename' operation is denoted with small Greek
letter rho ρ.

Notation − ρ x (E)

Where the result of expression E is saved with name of x.

Additional operations are −

 Set intersection
 Assignment
 Natural join

2.2 Relational Calculus:

Relational calculus is an alternative to relational algebra. In contrast to the algebra, which is


procedural, the calculus is nonprocedural, or declarative, in that it allows to describe the set of
answers without being explicit about how they should be computed.

The variant of the calculus that we present in detail is called the tuple relational calculus
(TRC). Variables in TRC take on tuples as values. In another variant, called the domain
relational calculus (DRC), the variables range over field values.

50
KITSW
2.3 Tuple Relational Calculus

A tuple variable is a variable that takes on tuples of a particular relation schema as values. That
is, every value assigned to a given tuple variable has the same number and type of fields. A tuple
relational calculus query has the form { T | p(T) },where T is a tuple variable and p(T) denotes a
formula that describes T. The result of this query is the set of all tuples t for which the formula
p(T)evaluates to true with T = t. The language for writing formulas p(T) is thus at the heart of
TRC and is essentially a simple subset of first-order logic
As a simple example, consider the following query.

Find all sailors with a rating above 7.


{S | S ∈ Sailors ∧ S.rating > 7}

Syntax of TRC Queries

Let Rel be a relation name, R and S be tuple variables, a an attribute of R,and b an attribute of S.
Let op denote an operator in the set {<, >, =, =, =, =}. An atomic formula is one of the following:
R ∈ Rel

R.a op S.b
R.a op constant, or constant op R.a

A formula is recursively defined to be one of the following, where p and q are themselves
formulas, and p(R) denotes a formula in which the variable R appears:

any atomic formula


¬p, p ∧ q, p ∨ q,orp ⇒ q
∃R(p(R)), where R is a tuple variable
∀R(p(R)), where R is a tuple variable

2.4 Domain Relational Calculus

A domain variable is a variable that ranges over the values in the domain of some attribute (e.g.,
the variable can be assigned an integer if it appears in an attribute whose domain is the set of
integers). A DRC query has the form {x | p(x1,x2,...,xn)},where each x is either a domain

51
KITSW
variable or a constant and p(x1,x2,...,xn) denotes a DRC formula whose only free variables are
the variables among the x i, 1 ≤ i ≥ n. The result of this query is the set of all tuples x1,x2,...,xi
for which the formula evaluates to true.
DRC formula is defined in a manner that is very similar to the definition of a TRC formula.
The main difference is that the variables are now domain variables. Let op denote an operator in
the set {<, >, =, =, =, =} and let X and Y be domain variables.

An atomic formula in DRC is one of the following:

0 <x1,x2,...,xn> ∈Rel,where Rel is a relation with n attributes; each x, 1 ≤i≥ n is either a


variable or a constant.

1 X op Y
X op constant,or constant op X

A formula is recursively defined to be one of the following, where p and q are themselves
formulas, and p(X) denotes a formula in which the variable X appears:

any atomic formula


¬p, p ∧ q, p ∨ q,orp ⇒ q
∃X(p(X)), where X is a domain variable
∀X(p(X)), where X is a domain variable
Eg:Find all sailors with a rating above 7.
{<I, N, T, A>|<I, N, T, A>∈Sailors ∧ T>7}

Relational calculus exists in two forms −

Tuple Relational Calculus (TRC)


Filtering variable ranges over tuples

Notation − {T | Condition}

Returns all tuples T that satisfies a condition.

For example −{ T.name | Author(T) AND T.article = 'database' }

52
KITSW
Output − Returns tuples with 'name' from Author who has written article on 'database'.

TRC can be quantified. We can use Existential (∃) and Universal Quantifiers (∀).

For example −

{ R| ∃T ∈ Authors(T.article='database' AND R.name=T.name)}


Output − The above query will yield the same result as the previous one.

Domain Relational Calculus (DRC)


In DRC, the filtering variable uses the domain of attributes instead of entire tuple values (as
done in TRC, mentioned above).

Notation −

{ a1, a2, a3, ..., an | P (a1, a2, a3, ... ,an)}

Where a1, a2 are attributes and P stands for formulae built by inner attributes.

For example −

{< article, page, subject > | ∈ TutorialsPoint ∧ subject = 'database'}


Output − Yields Article, Page, and Subject from the relation TutorialsPoint, where subject is
database.

Just like TRC, DRC can also be written using existential and universal quantifiers. DRC also
involves relational operators.

The expression power of Tuple Relation Calculus and Domain Relation Calculus is equivalent
to Relational Algebra.
Expressive Power of Algebra and Calculus

Key Differences Between Relational Algebra and Relational Calculus

1. The basic difference between Relational Algebra and Relational Calculus is that Relational
Algebra is a Procedural language whereas, the Relational Calculus is a Non-Procedural,
instead it is a Declarative language.
2. The Relational Algebra defines how to obtain the result whereas, the Relational Calculus
define what information the result must contain.
3. Relational Algebra specifies the sequence in which operations have to be performed in the
query. On the other hands, Relational calculus does not specify the sequence of operations
to performed in the query.

53
KITSW
4. The Relational Algebra is not domain dependent whereas, the Relational Calculus can be
domain dependent as we have Domain Relational Calculus.
5. The Relational Algebra query language is closely related to programming language
whereas, the Relational Calculus is closely related to the Natural Language.

SQL:
2.5 The Form of a Basic SQL Query:
SQL is the language used to query all databases. It's simple to learn and appears to do very little
but is the heart of a successful database application. Understanding SQL and using it efficiently
is highly imperative in designing an efficient database application. The better your understanding
of SQL the more versatile you'll be in getting information out of databases.A SQL SELECT
statement can be broken down into numerous elements, each beginning with a keyword.
Although it is not necessary, common convention is to write these keywords in all capital letters.
In this article, we will focus on the most fundamental and common elements of a SELECT
statement, namely

SELECT
FROM
WHERE
ORDER BY

The SELECT ... FROM Clause


The most basic SELECT statement has only 2 parts:
 What columns you want to return
 What table(s) those columns come from.

Examples of Basic SQL Queries:


If we want to retrieve all of the information about all of the customers in the Employees table,
we could use the asterisk (*) as a shortcut for all of the columns, and our query looks like

SELECT * FROM Employees

If we want only specific columns (as is usually the case), we can/should explicitly specify them
in a comma-separated list, as in
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees

Explicitly specifying the desired fields also allows us to control the order in which the fields are
returned, so that if we wanted the last name to appear before the first name, we could write
SELECT EmployeeID, LastName, FirstName, HireDate, City FROM Employees

54
KITSW
The WHERE Clause
The next thing we want to do is to start limiting, or filtering, the data we fetch from the database.
By adding a WHERE clause to the SELECT statement, we add one (or more) conditions that
must be met by the selected data. This will limit the number of rows that answer the query and
are fetched. In many cases, this is where most of the "action" of a query takes place.

Examples
We can continue with our previous query, and limit it to only those employees living in London:
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City = 'London'

If you wanted to get the opposite, the employees who do not live in London, you would write
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City <> 'London'

It is not necessary to test for equality; you can also use the standard equality/inequality operators
that you would expect. For example, to get a list of employees who were hired on or after a given
date, you would write
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHEREHireDate>= '1-july-1993'

Of course, we can write more complex conditions. The obvious way to do this is by having
multiple conditions in the WHERE clause. If we want to know which employees were hired
between two given dates, we could write
SELECTEmployeeID, FirstName, LastName, HireDate, City
FROM Employees
WHERE (HireDate>= '1-june-1992') AND (HireDate<= '15-december-1993')

Note that SQL also has a special BETWEENoperator that checks to see if a value is between
two values (including equality on both ends). This allows us to rewrite the previous query as
SELECT EmployeeID, FirstName, LastName, HireDate, City
FROM Employees
WHEREHireDateBETWEEN '1-june-1992' AND '15-december-1993'

We could also use the NOT operator, to fetch those rows that are not between the specified dates:
SELECTEmployeeID, FirstName, LastName, HireDate, City
FROM Employees
WHERE HireDateNOT BETWEEN '1-june-1992' AND '15-december-1993'

55
KITSW
Let us finish this section on the WHERE clause by looking at two additional, slightly more
sophisticated, comparison operators.
What if we want to check if a column value is equal to more than one value? If it is only 2
values, then it is easy enough to test for each of those values, combining them with the OR
operator and writing something like
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City = 'London' OR City = 'Seattle'

However, if there are three, four, or more values that we want to compare against, the above
approach quickly becomes messy. In such cases, we can use the IN operator to test against a set
of values. If we wanted to see if the City was either Seattle, Tacoma, or Redmond, we would
write
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City IN ('Seattle', 'Tacoma', 'Redmond')

As with the BETWEEN operator, here too we can reverse the results obtained and query for
those rows where City is not in the specified list:
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City NOT IN ('Seattle', 'Tacoma', 'Redmond')

Finally, the LIKE operator allows us to perform basic pattern-matching using wildcard
characters. For Microsoft SQL Server, the wildcard characters are defined as follows:

Wildcard Description
_ (underscore) matches any single character

% matches a string of one or more characters

[] matches any single character within the specified range (e.g. [a-f])
or set (e.g. [abcdef]).

[^] matches any single character not within the specified range (e.g.
[^a-f]) or set (e.g. [^abcdef]).

Here too, we can opt to use the NOT operator: to find all of the employees whose first name
does not start with 'M' or 'A', we would write
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE (FirstNameNOT LIKE'M%') AND (FirstNameNOT LIKE 'A%')

56
KITSW
The ORDER BY Clause
Until now, we have been discussing filtering the data: that is, defining the conditions that
determine which rows will be included in the final set of rows to be fetched and returned from
the database. Once we have determined which columns and rows will be included in the results
of our SELECT query, we may want to control the order in which the rows appear—sorting the
data.
To sort the data rows, we include the ORDER BY clause. The ORDER BY clause includes one
or more column names that specify the sort order. If we return to one of our first SELECT
statements, we can sort its results by City with the following statement:
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
ORDER BY City

If we want the sort order for a column to be descending, we can include the DESC keyword after
the column name.

The ORDER BY clause is not limited to a single column. You can include a comma-delimited
list of columns to sort by—the rows will all be sorted by the first column specified and then by
the next column specified. If we add the Country field to the SELECT clause and want to sort
by Country and City, we would write:
SELECTEmployeeID, FirstName, LastName, HireDate, Country, City
FROM Employees
ORDER BY Country, City DESC

Note that to make it interesting, we have specified the sort order for the City column to be
descending (from highest to lowest value). The sort order for the Country column is still
ascending. We could be more explicit about this by writing
SELECTEmployeeID, FirstName, LastName, HireDate, Country, City
FROM Employees
ORDER BY Country ASC, City DESC

It is important to note that a column does not need to be included in the list of selected (returned)
columns in order to be used in the ORDER BY clause. If we don't need to see/use the Country
values, but are only interested in them as the primary sorting field we could write the query as
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
ORDER BY Country ASC, City DESC

2.6 UNION, INTERSECT, AND EXCEPT

57
KITSW
SQL provides three set-manipulation constructs that extend the basic query form. Since the
answer to a query is a multiset of rows, it is natural to consider the use of operations such as
union, intersection, and difference. SQL supports these operations under the names UNION,
INTERSECT,andEXCEPT.

Union:
Eg: Find the names of sailors who have reserved a red or a green boat.

SELECT S.sname FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND
R.bid = B.bid AND B.color = ‘red’
union
SELECT S2.sname FROM Sailors S2, Boats B2, Reserves R2 WHERE S2.sid = R2.sid AND
R2.bid = B2.bid AND B2.color = ‘green’

This query says that we want the union of the set of sailors who have reserved red boats and
the set of sailors who have reserved green boats.

Intersect:
Eg:Find the names of sailors who have reserved both a red and a green boat.

SELECT S.sname FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND
R.bid = B.bid AND B.color = ‘red’
intersect
SELECT S2.sname FROM Sailors S2, Boats B2, Reserves R2 WHERE S2.sid = R2.sid AND
R2.bid = B2.bid AND B2.color = ‘green’

Except:
Eg:Find the sids of all sailors who have reserved red boats but not green boats.

SELECT S.sid FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND R.bid =
B.bid AND B.color = ‘red’
Except
SELECT S2.sid FROM Sailors S2, Reserves R2, Boats B2 WHERE S2.sid = R2.sid AND
R2.bid = B2.bid AND B2.color = ‘green’

SQL also provides other set operations: IN (to check if an element is in a given set), op ANY, op
ALL (to compare a value with the elements in a given set, using comparison operator op), and

58
KITSW
EXISTS (to check if a set is empty). IN and EXISTS can be prefixed by NOT,withthe obvious
modification to their meaning.
We cover UNION, INTERSECT,andEXCEPT in this section, and the other operations

2.7 NESTED QUERIES


A Subquery or Inner query or Nested query is a query within another SQL query and embedded
within the WHERE clause.Asubquery is used to return data that will be used in the main query as
a condition to further restrict the data to be retrieved.

Subqueries can be used with the SELECT, INSERT, UPDATE, and DELETE statements along
with the operators like =, <, >, >=, <=, IN, BETWEEN etc.

There are a few rules that subqueries must follow:


 Subqueries must be enclosed within parentheses.
 A subquery can have only one column in the SELECT clause, unless multiple columns
are in the main query for the subquery to compare its selected columns.
 An ORDER BY cannot be used in a subquery, although the main query can use an
ORDER BY. The GROUP BY can be used to perform the same function as the ORDER
BY in a subquery.Subqueries that return more than one row can only be used with
multiple value operators, such as the IN operator.
 The SELECT list cannot include any references to values that evaluate to a BLOB,
ARRAY, CLOB, or NCLOB.

EXISTS (sub query)


The argument of EXISTS is an arbitrary SELECT statement. The sub query is evaluated to
determine whether it returns any rows. If it returns at least one row, the result of EXISTS is
TRUE; if the sub query returns no rows, the result of EXISTS is FALSE.

The sub query can refer to variables from the surrounding query, which will act as constants
during any one evaluation of the sub query.

This simple example is like an inner join on col2, but it produces at most one output row for
each tab1 row, even if there are multiple matching tab2 rows:

SELECT col1
FROM tab1
WHERE EXISTS (SELECT 1
FROM tab2
WHERE col2 = tab1.col2);

59
KITSW
Example "Students in Projects":

SELECT name
FROM stud
WHERE EXISTS (SELECT 1
FROM assign
WHERE stud = stud.id);

[NOT] IN/IN [NOT]


The right-hand side of this form of IN is a parenthesized list of scalar expressions. The result is
TRUE if the left-hand expression's result is equal to any of the right-hand expressions.

The right-hand side of this form of IN is a parenthesized sub query, which must return exactly
one column. The left-hand expression is evaluated and compared to each row of the sub query
result. The result of IN is TRUE if any equal sub query row is found.

SELECT id, name


FROM stud
WHERE id IN ( SELECT stud
FROM assign
WHERE id = 1);

ANY and SOME


The right-hand side of this form of ANY is a parenthesized sub query, which must return exactly
one column. The left-hand expression is evaluated and compared to each row of the sub query
result using the given operator, which must yield a Boolean result. The result of ANY is TRUE if
any true result is obtained.
SOME is a synonym for ANY. IN is equivalent to = ANY.

ALL
The right-hand side of this form of ALL is a parenthesized sub query, which must return exactly
one column. The left-hand expression is evaluated and compared to each row of the sub query
result using the given operator, which must yield a Boolean result. The result of ALL is TRUE if
all rows yield TRUE (including the special case where the sub query returns no rows). NOT IN
is equivalent to <> ALL.

Row-wise comparison
The left-hand side is a list of scalar expressions. The right-hand side can be either a list of scalar
expressions of the same length, or a parenthesized sub query, which must return exactly as many
columns as there are expressions on the left-hand side. Furthermore, the sub query cannot return

60
KITSW
more than one row. (If it returns zero rows, the result is taken to be NULL.) The left-hand side is
evaluated and compared row-wise to the single sub query result row, or to the right-hand
expression list. Presently, only = and <> operators are allowed in row-wise comparisons. The
result is TRUE if the two rows are equal or unequal, respectively.

A nested query is a query that has another query embedded within it; the embedded query is
called a subquery.

SQL provides other set operations: IN (to check if an element is in a given set),NOT IN(to
check if an element is not in a given set).

Eg:1. Find the names of sailors who have reserved boat 103.

SELECT S.sname FROM Sailors S

WHERE S.sid IN ( SELECT R.sid

FROM Reserves R

WHERE R.bid = 103 )

The nested subquery computes the (multi)set of sids for sailors who have reserved boat 103,
and the top-level query retrieves the names of sailors whose sid is in this set. The IN operator
allows us to test whether a value is in a given set of elements; an SQL query is used to generate
the set to be tested.

2.Find the names of sailors who have not reserved a red boat.

SELECT S.sname FROM Sailors S

WHERE S.sid NOT IN ( SELECT R.sid

FROM Reserves R

WHERE R.bid IN ( SELECT B.bid

FROM Boats B

WHERE B.color = ‘red’ )

61
KITSW
Correlated Nested Queries

In the nested queries that we have seen, the inner subquery has been completely independent of
the outer query. In general the inner subquery could depend on the row that is currently being
examined in the outer query .

Eg: Find the names of sailors who have reserved boat number 103.

SELECT S.sname FROM Sailors S


WHERE EXISTS ( SELECT *
FROM Reserves R
WHERE R.bid = 103 AND R.sid = S.sid )

The EXISTS operator is another set comparison operator, such as IN. It allows us to test
whether a set is nonempty.

Set-Comparison Operators

SQL also supports op ANY and op ALL, where op is one of the arithmetic comparison
operators {<, <=, =, <>, >=,>}.

Eg:1. Find sailors whose rating is better than some sailor called Horatio.

SELECT S.sid FROM Sailors S


WHERE S.rating > ANY ( SELECT S2.rating
FROM Sailors S2 WHERE S2.sname = ‘Horatio’ )

If there are several sailors called Horatio, this query finds all sailors whose rating is better
than that of some sailor called Horatio.
2.Find the sailors with the highest rating.

SELECT S.sid FROM Sailors S


WHERE S.rating >= ALL ( SELECT S2.rating
FROM Sailors S2 )

SQL Operators
There are two type of Operators, namely Comparison Operators and Logical Operators. These
operators are used mainly in the WHERE clause, HAVING clause to filter the data to be
selected.

62
KITSW
Comparison Operators:Comparison operators are used to compare the column data with
specific values in a condition.Comparison Operators are also used along with the SELECT
statement to filter data based on specific conditions.

Comparison Operators Description


= equal to
<>, != is not equal to
< less than
> greater than
>= greater than or equal to
<= less than or equal to

Logical Operators:There are three Logical Operators namely AND, OR and NOT.

SQL Comparison Keywords


There are other comparison keywords available in sql which are used to enhance the search
capabilities of a sql query. They are "IN", "BETWEEN...AND", "IS NULL", "LIKE".

Comparision Operators Description


LIKE column value is similar to specified character(s).
IN column value is equal to any one of a specified set of values.
BETWEEN...AND column value is between two values, including the end values
specified in the range.
IS NULL column value does not exist.

SQL LIKE Operator


The LIKE operator is used to list all rows in a table whose column values match a specified
pattern. It is useful when you want to search rows to match a specific pattern, or when you do not
know the entire value. For this purpose we use a wildcard character '%'.

To select all the students whose name begins with 'S'


SELECT first_name, last_name
FROM student_details
WHERE first_name LIKE 'S%';

The above select statement searches for all the rows where the first letter of the column
first_name is 'S' and rest of the letters in the name can be any character.

There is another wildcard character you can use with LIKE operator. It is the underscore
character, ' _ ' . In a search string, the underscore signifies a single character.

63
KITSW
To display all the names with 'a' second character,
SELECT first_name, last_name
FROM student_details
WHERE first_name LIKE '_a%';

NOTE:Each underscore act as a placeholder for only one character. So you can use more than
one underscore. Eg: ' __i% '-this has two underscores towards the left, 'S__j%' - this has two
underscores between character 'S' and 'i'.

SQL BETWEEN ... AND Operator


The operator BETWEEN and AND, are used to compare data for a range of values.

To find the names of the students between age 10 to 15 years, the query would be like,
SELECT first_name, last_name, age
FROM student_details
WHERE age BETWEEN 10 AND 15;

SQL IN Operator
The IN operator is used when you want to compare a column with more than one value. It is
similar to an OR condition.

If you want to find the names of students who are studying either Maths or Science, the query
would be like,
SELECT first_name, last_name, subject
FROM student_details
WHERE subject IN ('Maths', 'Science');
You can include more subjects in the list like ('maths','science','history')

NOTE:The data used to compare is case sensitive.

SQL IS NULL Operator


A column value is NULL if it does not exist. The IS NULL operator is used to display all the
rows for columns that do not have a value.

If you want to find the names of students who do not participate in any games, the query would
be as given below
SELECT first_name, last_name
FROM student_details
WHERE games IS NULL

64
KITSW
There would be no output as we have every student participate in a game in the table
student_details, else the names of the students who do not participate in any games would be
displayed.

2.8 Aggregate Operators


The SQL Aggregate Functions are functions that provide mathematical operations. If you need to
add, count or perform basic statistics, these functions will be of great help.

The functions include:


 count() - counts a number of rows
 sum() - compute sum
 avg() - compute average
 min() - compute minimum
 max() - compute maximum

Use of SQL Aggregate Functions


SQL Aggregate Functions are used as follows. If a grouping of values is needed also include the
GROUP BY clause.Use a column name or expression as the parameter to the Aggregate
Function. The parameter, '*', represents all rows.

SELECT <column_name1>, <column_name2><aggregate_function(s)>


FROM <table_name>
GROUP BY <column_name1>, <column_name2>

Example
The following example Aggregate Functions are applied to the employee_count of the branch
table. The region_nbr is the level of grouping.Here are the contents of the table:

Table: BRANCH
branch_nbr branch_name region_nbr employee_count
108 New York 100 10
110 Boston 100 6
212 Chicago 200 5
404 San Diego 400 6
415 San Jose 400 3

This SQL Statement with aggregate functions is executed:


SELECT region_nbr A, count(branch_nbr) B, sum(employee_count) C,
min(employee_count) D
max(employee_count) E, avg(employee_count) F
FROM dbo.branch

65
KITSW
GROUP BY region_nbr
ORDER BY region_nbr

Here is the result.


A B C D E F
100 2 16 6 10 8
200 1 5 5 5 5
400 2 9 3 6 4

2.9 NULL VALUES


The SQL NULL is the term used to represent a missing value. A NULL value in a table is a
value in a field that appears to be blank.A field with a NULL value is a field with no value. It is
very important to understand that a NULL value is different than a zero value or a field that
contains spaces.

Syntax:
The basic syntax of NULL while creating a table:

CREATE TABLE CUSTOMERS


( ID INT NOT NULL,
NAME VARCHAR (20) NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR (25),
SALARY DECIMAL (18, 2),
PRIMARY KEY (ID));

Here, NOT NULL signifies that column should always accept an explicit value of the given data
type. There are two columns where we did not use NOT NULL, which means these columns
could be NULL.

A field with a NULL value is one that has been left blank during record creation.

Example:
The NULL value can cause problems when selecting data, however, because when comparing an
unknown value to any other value, the result is always unknown and not included in the final
results.

You must use the IS NULL or IS NOT NULL operators in order to check for a NULL value.

Consider the following table, CUSTOMERS having the following records:

66
KITSW
ID NAME AGE ADDRESS SALARY
1 Ramesh 32 Ahmedabad 2000.00
2 Khilan 25 Delhi 1500.00
3 kaushik 23 Kota 2000.00
4 Chaitali 25 Mumbai 6500.00
5 Hardik 27 Bhopal 8500.00
6 Komal 22 MP
7 Muffy 24 Indore

Now, following is the usage of IS NOT NULL operator:

SELECT ID, NAME, AGE, ADDRESS, SALARY


FROM CUSTOMERS
WHERE SALARY IS NOT NULL;

This would produce the following result:

ID NAME AGE ADDRESS SALARY


1 Ramesh 32 Ahmedabad 2000.00
2 Khilan 25 Delhi 1500.00
3 kaushik 23 Kota 2000.00
4 Chaitali 25 Mumbai 6500.00
5 Hardik 27 Bhopal 8500.00

Now, following is the usage of IS NULL operator:

SELECT ID, NAME, AGE, ADDRESS, SALARY


FROM CUSTOMERS
WHERE SALARY IS NULL;

This would produce the following result:

ID NAME AGE ADDRESS SALARY


6 Komal 22 MP
7 Muffy 24 Indore

SQL LOGICAL OPERATORS


There are three Logical Operators namely, AND, OR, and NOT. These operators compare two
conditions at a time to determine whether a row can be selected for the output. When retrieving
data using a SELECT statement, you can use logical operators in the WHERE clause, which
allows you to combine more than one condition.

67
KITSW
Logical Operators Description
For the row to be selected at
OR least one of the conditions
must be true.
For a row to be selected all the
AND specified conditions must be
true.
For a row to be selected the
NOT specified condition must be
false.

"OR" Logical Operator


If you want to select rows that satisfy at least one of the given conditions, you can use the logical
operator, OR.

Example: if you want to find the names of students who are studying either Maths or Science, the
query would be like,
SELECT first_name, last_name, subject
FROM student_details
WHERE subject = 'Maths' OR subject = 'Science'

s
firs las
u
t_n t_
b
am na
je
e me
ct
--
---- ----
--
---- ----
--
---- ----
--
- -
--
Bh M
An
ag at
ajal
wa h
i
t s
M
Go
She at
wd
kar h
a
s
Ra Sh S

68
KITSW
ci
e
ar
hul n
ma
c
e
S
ci
Ste Fle
e
phe mi
n
n ng
c
e

The following table describes how logical "OR" operator selects a row.

Column1 Column2 Row


Satisfied? Satisfied? Selected
YES YES YES
YES NO YES
NO YES YES
NO NO NO

"AND" Logical Operator


If you want to select rows that must satisfy all the given conditions, you can use the logical
operator, AND.

Example: To find the names of the students between the age 10 to 15 years, the query would be
like:
SELECT first_name, last_name, age
FROM student_details
WHERE age >= 10 AND age <= 15;

The output would be something like,


firs
last a
t_n
_na g
am
me e
e
----- ---- -
----- ---- -
--- ---- -
- -
-

69
KITSW
-
Rah Sha 1
ul rma 0
Bh
Ana 1
ag
jali 2
wat
Go
She 1
wd
kar 5
a

The following table describes how logical "AND" operator selects a row.

Column1 Column2 Row


Satisfied? Satisfied? Selected
YES YES YES
YES NO NO
NO YES NO
NO NO NO

"NOT" Logical Operator


If you want to find rows that do not satisfy a condition, you can use the logical operator, NOT.
NOT results in the reverse of a condition. That is, if a condition is satisfied, then the row is not
returned.

Example: If you want to find out the names of the students who do not play football, the query
would be like:

SELECT first_name, last_name, games


FROM student_details
WHERE NOT games = 'Football'

OUTER JOINS
All joins mentioned above, that is Theta Join, Equi Join and Natural Join are called inner-joins.
An inner-join process includes only tuples with matching attributes, rest are discarded in
resulting relation. There exists methods by which all tuples of any relation are included in the
resulting relation.

There are three kinds of outer joins:

Left outer join ( R S )

70
KITSW
All tuples of Left relation, R, are included in the resulting relation and if there exists tuples in R
without any matching tuple in S then the S-attributes of resulting relation are made NULL.

Left
A B
100 Database
101 Mechanics
102 Electronics

Right
A B
100 Alex
102 Maya
104 Mira

Left outer join output


A B C D
100 Database 100 Alex
101 Mechanics --- ---
102 Electronics 102 Maya

Right outer join: ( R S )


All tuples of the Right relation, S, are included in the resulting relation and if there exists tuples
in S without any matching tuple in R then the R-attributes of resulting relation are made NULL.

Right outer join output


A B C D
100 Database 100 Alex
102 Electronics 102 Maya
--- --- 104 Mira
Full outer join: ( R S)
All tuples of both participating relations are included in the resulting relation and if there no
matching tuples for both relations, their respective unmatched attributes are made NULL.

Full outer join output

A B C D
100 Database 100 Alex
101 Mechanics --- ---
102 Electronics 102 Maya
--- --- 104 Mira

71
KITSW
DISALLOWING NULL VALUES

SQL NOT NULL Statement


Now one wants to display the field entries whose location is not left blank, then here is a
statement example.

SELECT * FROM Employee


WHERE Location IS NOT NULL;

SQL NOT NULL Statement Output:


The NOT NULL statement will display the following results

Employee Employee
Age Gender Location Salary
ID Name
New
1001 Henry 54 Male 100000
York
1002 Tina 36 Female Moscow 80000
1003 John 24 Male London 40000
1006 Sophie 29 Female London 60000

2.10 Complex Integrity - Constraints in SQL Triggers


An integrity constraint defines a business rule for a table column. When enabled, the rule will be
enforced by oracle (and so will always be true.) To create an integrity constraint all existing table
data must satisfy the constraint.

Default values are also subject to integrity constraint checking (defaults are included as part of
an INSERT statement before the statement is parsed.)

If the results of an INSERT or UPDATE statement violate an integrity constraint, the statement
will be rolled back.

Integrity constraints are stored as part of the table definition, (in the data dictionary.)

If multiple applications access the same table they will all adhere to the same rule.

The following integrity constraints are supported by Oracle:

 NOT NULL
 UNIQUE

72
KITSW
 CHECK constraints for complex integrity rules
 PRIMARY KEY
 FOREIGN KEY integrity constraints - referential integrity actions: – On Update – On
Delete – Delete CASCADE – Delete SET NULL

Constraint States
The current status of an integrity constraint can be changed to any of the following 4 options
using the CREATE TABLE or ALTER TABLE statement.

 ENABLE - Ensure that all incoming data conforms to the constraint


 DISABLE - Allow incoming data, regardless of whether it conforms to the constraint
 VALIDATE - Ensure that existing data conforms to the constraint
 NOVALIDATE - Allow existing data to not conform to the constraint

These can be used in combination


 ENABLE { VALIDATE | NOVALIDATE }
 DISABLE { VALIDATE | NOVALIDATE }
 ENABLE VALIDATE is the same as ENABLE.

ENABLE NOVALIDATE means that the constraint is checked, but it does not have to be true
for all rows. This will resume constraint checking for Inserts and Updates but will not validate
any data that already exists in the table.

DISABLE NOVALIDATE is the same as DISABLE.

DISABLE VALIDATE disables the constraint, drops the index on the constraint, and disallows
any modification of the constrained columns.
For a UNIQUE constraint, this enables you to load data from a nonpartitioned table into a
partitioned table using the ALTER

2.11 Triggers and Active Databases

A trigger is a procedure that is automatically invoked by the DBMS in response to specified


changes to the database, and is typically specified by the DBA. A database that has a set of
associated triggers is called an active database.
A trigger description contains three parts:

Event: A change to the database that activates the trigger.

73
KITSW
Condition: A query or test that is run when the trigger is activated.
Action: A procedure that is executed when the trigger is activated and its condition is true.
Eg: The trigger called init count initializes a counter variable before every execution of an
INSERT statement that adds tuples to the Students relation. The trigger called incr count
increments the counter for each inserted tuple that satisfies the condition age < 1
CREATE TRIGGER init count BEFORE INSERT ON Students /* Event */

DECLARE
count INTEGER;

BEGIN /* Action */
count := 0;
END

CREATE TRIGGER incr count AFTER INSERT ON


Students /* Event */

WHEN (new.age < 18) /* Condition*/

FOR EACH ROW

BEGIN /* Action */
count:=count+1;

END
UNIT-III

Schema Refinement and Normal Forms

Overview:
Only construction of the tables is not only the efficient data base design. Solving the redundant
data problem is the efficient one. For this we use functional dependences. And normal forms
those will be discussed in this chapter.

Contents:
Schema refinement

74
KITSW
Use of Decompositions
Functional dependencies
Normal forms
Multi valued dependencies

75
KITSW
3.1 Introduction to Schema Refinement:

We now present an overview of the problems that schema refinement is intended to address and
a refinement approach based on decompositions. Redundant storage of information is the root
cause of these problems. Although decomposition can eliminate redundancy, it can lead to
problems of its own and should be used with caution.

Problems Caused by Redundancy

Storing the same information redundantly, that is, in more than one place within a database, can
lead to several problems:
Redundant storage: Some information is stored repeatedly.
Update anomalies: If one copy of such repeated data is updated, an inconsistency is created
unless all copies are similarly updated.
Insertion anomalies: It may not be possible to store some information unless some other
information is stored as well.
Deletion anomalies: It may not be possible to delete some information without losing some
other information as well.

76
KITSW
Use of Decompositions

Redundancy arises when a relational schema forces an association between attributes that is not
natural. Functional dependencies can be used to identify such situations and to suggest
refinements to the schema. The essential idea is that many problems arising from redundancy can
be addressed by replacing a relation with a collection of ‘smaller’ relations. Each of the smaller
relations contains a subset of the attributes of the original relation. We refer to this process as
decomposition of the larger relation into the smaller relations.

Problems Related to Decomposition

Decomposing a relation schema can create more problems than it solves. Two important
questions must be asked repeatedly:

Do we need to decompose a relation?

What problems (if any) does a given decomposition cause?

To help with the first question, several normal forms have been proposed for relations. If a
relation schema is in one of these normal forms, we know that certain kinds of problems cannot
arise. Considering the normal form of a given relation schema can help us to decide whether or
not to decompose it further. If we decide that a relation schema must be decomposed further, we
must choose a particular decomposition.

With respect to the second question, two properties of decompositions are of particular interest.
The lossless-join property enables us to recover any instance of the decomposed relation from
corresponding instances of the smaller relations. The dependency preservation property enables
us to enforce any constraint on the original relation by simply enforcing some constraints on
each of the smaller relations. That is, we need not perform joins of the smaller relations to check
whether a constraint on the original relation is violated.

77
KITSW
3.2 Functional dependencies:

A functional dependency A->B in a relation holds if two tuples having same value of attribute A
also have same value for attribute B. For Example, in relation STUDENT shown in table 1,
Functional Dependencies
STUD_NO->STUD_NAME, STUD_NO->STUD_ADDR hold but
STUD_NAME->STUD_ADDR do not hold

How to find functional dependencies for a relation?


Functional Dependencies in a relation are dependent on the domain of the relation. Consider the
STUDENT relation given in Table 1.
 We know that STUD_NO is unique for each student. So STUD_NO->STUD_NAME,
STUD_NO->STUD_PHONE, STUD_NO->STUD_STATE, STUD_NO-
>STUD_COUNTRY and STUD_NO -> STUD_AGE all will be true.
 Similarly, STUD_STATE->STUD_COUNTRY will be true as if two records have same
STUD_STATE, they will have same STUD_COUNTRY as well.
 For relation STUDENT_COURSE, COURSE_NO->COURSE_NAME will be true as
two records with same COURSE_NO will have same COURSE_NAME.
Functional Dependency Set: Functional Dependency set or FD set of a relation is the set of all
FDs present in the relation. For Example, FD set for relation STUDENT shown in table 1 is:
{ STUD_NO->STUD_NAME, STUD_NO->STUD_PHONE, STUD_NO->STUD_STATE,
STUD_NO->STUD_COUNTRY,
STUD_NO -> STUD_AGE, STUD_STATE->STUD_COUNTRY }

78
KITSW
Attribute Closure: Attribute closure of an attribute set can be defined as set of attributes which
can be functionally determined from it.
How to find attribute closure of an attribute set?
To find attribute closure of an attribute set:
 Add elements of attribute set to the result set.
 Recursively add elements to the result set which can be functionally determined from the
elements of the result set.
Using FD set of table 1, attribute closure can be determined as:
(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_STATE)+ = {STUD_STATE, STUD_COUNTRY}
How to finding Candidate Keys and Super Keys using Attribute Closure?
 If attribute closure of an attribute set contains all attributes of relation, the attribute set
will be super key of the relation.
 If no subset of this attribute set can functionally determine all attributes of the relation,
the set will be candidate key as well. For Example, using FD set of table 1,
(STUD_NO, STUD_NAME)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_NO, STUD_NAME) will be super key but not candidate key because its subset
(STUD_NO)+ is equal to all attributes of the relation. So, STUD_NO will be a candidate key.

3.3 Normalization:

In general, database normalization involves splitting tables with columns that have different
types of data ( and perhaps even unrelated data) into multiple table, each with fewer columns that
describe the attributes of a single concept of physical object or being.

The goal of normalization is to prevent the problems ( called modification anomalie) that plague
a poorly designed relation (table).

79
KITSW
Suppose, for example, that you have a table with resort guest ID numbers, activities the guests
have signed up to do, and the cost of each activity – all together in the following GUEST –
ACTIVITY-COST table:

Each row in the table represents a guest that has signed up for the named activity and paid the
specified cost. Assuming that the cost depends only on the activity that is, a specific activity
costs the same for all guests if you delete the row for GUEST – ID 2587, you lose not only the
fact that guest 2587 signed up for scuba diving, but also the fact that scuba diving costs $ 250.00
per outing. This is called a deletion anomaly – when you delete a row, you lose more information
than you intended to remove.

In the current example, a single deletion resulted in the loss of information on two entities what
activity a guest signed up to do and how much a particular activity costs.

80
KITSW
Now, suppose the resort adds a new activity such as horseback riding. You cannot enter the
activity name ( horseback riding) or cost ($190.00) in to the table until a guest decides to sign up
for it. The unnecessary restriction of having to wait until someone signs up for an activity before
you can record its name and cost is called an insertion anomaly.

In the current example, each insertion adds facts about two entities. Therefore, you cannot
INSERT a fact about one entity until you have an additional fact about the other entity.
Conversely, each deletion removes facts about two entities. Thus, you cannot DELETE the
information about one entity while leaving the information about the other in table.

You can eliminate modification anomalies through normalization – that is, splitting the single
table with rows that have attributes about two entities into two tables, each of which has rows
with attributes that describe a single entity.

You will be ablve to remove the aromatherapy appointment for guest 1269 without losing the
fact that an aromatherapy session costs $75.00. Similarly, you can now add the fact that
horseback riding costs $ 190.00 per day to the ACTIVITY – COST table without having to wait
for a guest to sign up for the activity.
During the development of relational database systems in the 1970s, relational theorists kept
discovering new modification anomalies. Some one would find an anomaly, classify it, and then
figure out a way to prevent it by adding additional design criteria to the definition of a “well
formed relation. These design criteria are known as normal forms. Not surprisingly E.F codd (of
the 12 rule database definition fame), defined the first, second, and third normal forms, (INF,
2NF, and 3NF).

After Codd postulated 3 NF, relational theorists formulated Boyce-codd normal form (BCNF)
and then fourth normal form (4NF) and fifth normal form (5NF)

First Normal form :


Normalization is a processes by which database designers attempt to eliminate modification
anomalies such as the :

81
KITSW
Deletion anomaly
The iniability to remove a single fact from a table without removing other (unrelated) facts you
want to keep.
Insertion anomaly:
The inability to insert one fact without inserting another ( and some times, unrelated) fact.
Update anomaly:
Changing a fact in one column creates a false fact in another set of columns. Modification
anomalies are a result of functional dependencies among the columns in a row ( or tuple, to use
the precise relational database term
A functional dependency means that if you know the value in one column or set of columns, you
can always determine the value of another. To put the table in first normal form (INF) you could
break up the student number list in the STUDENTS column of each row such that each row had
only one of the student Ids in the STUDENTS column. Doing so would change the table’s
structure and rows to: The value given by the combination (CLASS, SECTION, STUDENT) is
the composite key for the table because it makes each row unique and all columns atomic. Now
that each the table in the current example is in INF, each column has a single, scalar value.
Unfortunately, the table still exhibits modification anomalies:\
Deletion anomaly:
If professor SMITH goes to another school and you remove his rows from the table, you also
lose the fact that STUDENTS 1005, 2110 and 3115 are enrolled in a history class.
Insertion anomaly:
If the school wants to add an English Class (EI00), it cannot do so until a student signs up for the
course ( Remember, no part of a primary key can have a NULL value).
Update anomaly:
If STUDENT 4587 decides to sign up for the SECTION 1, CS100 CLASS instead of his math
class, updating the Class and section columns in the row for STUDENT 4587 to reflect the
change will cause the table to show TEACHER RAWL INS as being in both the MATH and the
COMP-SCI departments.
Thus, ‘flattening’ a table’s columns to put it into first normal form (INF) does not solve any of
the modification anomaliesAll it does is guarantee that the table satisfies the requirements for a

82
KITSW
table defined as “relational” and that there are no multi valued dependencies between the
columns in each row.

Second Normal Form:


The process of normalization involves removing functional depencies between columns n order
to eliminate the modification anomalies caused by these dependencies.
Putting a table in first normal form (INF) requires removing all multi valued dependencies.

When a table is in second normal form, it must be in first normal form (no multi valued
dependencies and have no partial key dependencies.

A partial key dependency is a situation in which the value in part of a key can be used to
determine the value of another attribute ( column)Thus, a table is in 2NF when the value in all
nonkey columns depends on the entire key. Or, said another way, you cannot determine the value
of any of the columns by using part of the keyWith (CLASS, SECTION, STUDENT) as its
primary key. If the university has two rules about taking classes no student can sign up for more
than one section of the same class, and a student can have only one major then the table, while in
1 NF, is not in 2NF.
Given the value of (STUDENT, COURSE) you can determine the value of the SECTION, since
no student can sign up for two sections of the same course. Similarly since students can sign up
for only one major, knowing STUDENT determines the value of MAJOR. In both instances, the
value of a third column can be deduced (or is determined) by the value in a portion of the key
(CLASS, SECTION, STUDENT) that makes each row unique.
To put the table in the current example in 2NF will require that it be split in to three tables
described by :
Courses (Class, Section, Teacher, Department)
PRIMARY KEY (Class, Section)
Enrollment (Student, Class, Section)
PRIMARY KEY (Student, class)
Students (student, major)
PRIMARY KEY (Student)

83
KITSW
Unfortunately, putting a table in 2NF does not eliminate modification anomalies.
Suppose, for example, that professor Jones leaves the university. Removing his row from the
COURSES table would eliminate the entire ENGINEERING department, since he is currently
the only professor in the department.
Similarly, if the university wants to add a music department, it cannot do so until it hires a
professor to teach in the department.
Understanding Third Normal Form :
To be a third normal form (3NF) a table must satisfy the requirements for INF (no multi valued
dependencies) and 2NF ( all nonkey attributes must depend on the entire key). In addition, a
table in 3NF has no transitive dependencies between nonkey columns.
Given a table with columns, (A,B,C) a transitive dependency is one in which a determines B, and
B determines C, therefore A determines C, or, expressed using relational theory notation
If AB and BC then A C.
When a table is in 3NF the value in every non key column of the table can be determined by
using the entire key and only the entire key,. Therefore, given a table in 3NF with columns
(A,B,C) if A is the PRIMARY KEY, you could not use the value of B ( a non key column) to
determine the value of a C ( another non key column). As such, A determines B(A B), and A
determines C(C). However, knowing the value of column B does not tell you have value in
column C that is, it is not the case that BC.
Suppose, for example, that you have a COURSES tables with columns and PRIMARY KEY
described by
Courses (Class, section, teacher, department , department head)
PRIMARY KEY (Class, Section)
That contains the Data :

(----------A--------- (B) (C) (D)

Class Section Teacher Dept. Dept. Head

--------------------------------------------------------------------------------
History
H100 Smith Smith
History
H1002 Riley Smith

CS100 1 Bowls Comp.Sci Peroit

84
KITSW
Hasting
M2003 Rawlins Math s
Hasting
M2002 Brown Math s
Hasting
M2004 Riley Math s

E1001 Jones Engg. Jones

Given that a TEACHER can be assigned to only one DEPARTMENT and that a
DEPARTMENT can have only one department head, the table has multiple transitive
dependencies.
For example, the value of TEACHER is dependant on the PRIMARY KEY (CLASS,
SECTION), since a particular SECTION of a particular CLASS can have only one teacher that is
A B. Moreover, since a TEACHER can be in only one DEPARTMENT, the value in
DEPARTMENT is dependant on the value in TEACHER that is BC. However, since the
PRIMARY KEY (CLASS, SECTION) determines the value of TEACHER, it also determines
the value of DEPARTMENT that is A C. Thus, the table exhibits the transitive dependency in
which A B and BC, therefore A C.

The problem with a transitive dependency is that it makes the table subject to the deletion
anomaly. When smith retires and we remove his row from the table, we lose not only the fact
that smith taught SECTION 1 of H100 but also the fact that SECTION 1 of H100 was a class
that belonged to the HISTORY department.
To put a table with transitive dependencies between non key columns into 3 NF requires that the
table be split into multiple tables. To do so for the table in the current example, we would need
split it into tables, described by :
Courses (Class, Section, Teacher)
PRIMARY KEY (class, section)
Teachers (Teacher, department)
PRIMARY KEY (teacher)
Departments (Department, Department head)
PRIMARY KEY (department )

85
KITSW
After Normalization

3.4 Schema Refinement or Database design:


 Normalisation or Schema Refinement is a technique of organizing the data in the
database. It is a systematic approach of decomposing tables to eliminate data redundancy
and undesirable characteristics like Insertion, Update and Deletion Anomalies.

86
KITSW
 The Schema Refinement refers to refine the schema by using some technique. The best
technique of schema refinement is decomposition.
 The Basic Goal of Normalisation is used to eliminate redundancy.
 Redundancy refers to repetition of same data or duplicate copies of same data stored in
different locations.

Normalization is used for mainly two purpose :
 Eliminating redundant(useless) data.
 Ensuring data dependencies make sense i.e data is logically stored.

Anomalies or Problems Facing without Normalisation :


Anomalies refers to the problems occurred after poorly planned and unnormalised databases
where all the data is stored in one table which is sometimes called a flat file database. Let us
consider such type of schema –

SI
Sname CID Cname FEE
D

S1 A C1 C 5k

S2 A C1 C 5k

S1 A C2 C 10k

S3 B C2 C 10k

S3 B C2 JAVA 15k

Primary Key(SID,CID)

Here all the data is stored in a single table which causes redundancy of data or say anomalies as
SID and Sname are repeated once for same CID .
3.5 OTHER KINDS OF DEPENDENCIES:
Finish-to-Start Dependencies:

87
KITSW
The most common type of dependency is the finish-to-start relationship (FS). This relationship
means that the first task, the predecessor, must be finished before the next task, the successor,
can start. On the Gantt chart it is usually represented as follows:

Start-to-Start Dependencies

The next type of dependency is the start-to-start relationship (SS). This relationship means that
the successor task cannot start until the predecessor task starts. On the Gantt chart, it is usually
represented as follows:

Finish-to-Finish Dependencies

The third type of dependency is the finish-to-finish relationship (FF). This relationship means
that the successor task cannot finish until the predecessor task finishes. On the Gantt chart, it is
usually represented as follows:

Start-to-Finish Dependencies

88
KITSW
The start-to-finish relationship (SF) is the least common task relationship and means that the
successor cannot finish until the predecessor starts. On the Gantt chart, it is usually represented
as follows:

Variations of Task Dependency Types

Of course tasks sometimes overlap – this is termed lead (or lead time). Tasks can also be delayed
(for example, to wait while concrete dries) which is called lag (or lag time).

89
KITSW
UNIT-IV

TRANSACTION MANAGEMENT

Overview:

In this unit we introduce two topics first one is concurrency control. The stored data will be
accessed by the users so if any two or users try to access same data at a time it may raise the
problem of data inconsistency to solve that concurrency control methods are invented. Recovery
is used to maintain the data without loss when the problem of power failure, software failure and
hardware failure.

Contents:

Concepts of transactions and schedules

Lock based concurrency control

Crash Recovery

Introduction to crash recovery

Log recovery

Check pointing

ARIES

90
KITSW
4.1 Transactions

Collections of operations that form a single logical unit of work are called Transactions. A
database system must ensure proper execution of transactions despite failures – either the entire
transaction executes, or none of it does.
4.2 Transaction Concept:
A transaction is a unit of program execution that accesses and possibly updates various data
items. Usually, a transaction is initiated by a user program written in a high level data
manipulation language or programming language ( for example SQL, COBOL, C, C++ or
JAVA), where it is delimited by statements ( or function calls) of the form Begin transaction and
end transaction. The transaction consists of all operations executed between the begin transaction
and end transaction.

To ensure integrity of the data, we require that the database system maintain the following
properties of the transaction.

Atomicity:

Either all operations of the transaction are reflected properly in the database, or non are .

Consistency:

Execution of a transaction in isolation ( that is, with no other transaction executing concurrently)
preserves the consistency of the database.

Isolation:

Even though multiple transactions may execute concurrently, the system guarantees that, for
every pair of transaction Ti and Tj, ti appears to Ti that either Tj finished execution before Ti
started, or Tj started execution after Ty finished. Thus, each transaction is unaware of other
transactions executing concurrently in the system.

91
KITSW
Durability:

After a transaction completes successfully, the changes it has made to the database persist, even
if there are system failures.
4.3 A Simple Transaction Model:

Transaction state:

In the absence of failures, all transactions complete successfully. A transaction may not always
complete its execution successfully. Such a transaction is termed aborted. If we are to ensure the
atomicity property, an aborted transction must have no effect on the state of the database.

Thus, any changes that the aborted transaction made to the database must be undone. Once the
changes caused by an aborted transaction have been undone, we say that the transaction has been
rolled back. It is part of the responsibility of the recovery scheme to manage transaction aborts.

A transaction that completes its execution successfully is said to be committed. A committed


transaction that has performed updates transforms the database into a new consistent state, which
must persist even if there is a system failure.

Once a transction has committed, we cannot undo its effects by aborting it. The only way to undo
the effects of committed transaction is to execute a compensating transaction. For instance, if a
transaction added $20 to an account, the compensating transaction would subtract $20 from the
account. However, it is not always possible to create such a compensating transaction. Therefore,
the responsibility of writing and executing a compensating transaction is left to the user, and is
not handled by the database system. By successful completion of a transaction, A transaction
must be in one of the following states :

Active:
The initial state ; the transaction stays in this state while it is executing
Partially committed :
After the final statement has been executed

92
KITSW
Faile:
After the discovery that normal execution can no longer proceed

Aborted:
After the transaction has been rolled back and the database has been restrored to its state prior to
the start of the transaction

Committed:
After successful completion

We say that a transaction has committed nly if it has entered the committed state. Similarly, we
say that a transaction has aborted only if it has entered the aborted state. A transaction is said to
have terminated if has either committed or aborted.

A transaction starts in the active state. When it finishes its final statement, it enters the partially
committed state. At this point, the transaction has completed its execution, but it is still possible
that it may have to be aborted, since the actual output may still be termporarily residing in main
momory, and thus a hardware failure may preclude its successful completion.

The database system then writes out enough information to disk that, even in the event of a
failure, the updates performed by the transaction can be recreated when the system restarts after
the failure. When the last of this information is written out, the transaction enters the committed
state.

A transaction enters the filed state after the system determines that the transaction can no longer
proceed with its normal execution ( for example, because of hard ware or logical errors) such a
transaction must be rolled back. Then, it enters the aborted state. At this point, the system has
two options.

93
KITSW
It can restart the transaction, but only if the transaction was aborted as a result of some
hardware or software error that was not created through the internal logic of the transaction. A
restarted transaction is considered to be a new transaction.

It can kill the transaction. It usually does so because of some internal logical error that can be
corrected only by rewriting the application program, or because the input was bad, or because the
desired data were not found in the database.

We must be cautious when dealing with observable external writes, such as writes to a terminal
or printer. Once such a write has occurred, it cannot be erased, since it may have been seen
external to the database system. Most systems allows such writes to take place only after the
transaction has entered the committed state.

A transaction in a database can be in one of the following states −

4.4 Storage Structure:

These properties are often called the ACID properties, the acronym is derived from the first letter
of each of the four properties. Volatile Memory
These are the primary memory devices in the system, and are placed along with the CPU. These
memories can store only small amount of data, but they are very fast. E.g.:- main memory, cache

94
KITSW
memory etc. these memories cannot endure system crashes- data in these memories will be lost
on failure.
Non-Volatile memory
These are secondary memories and are huge in size, but slow in processing. E.g.:- Flash memory,
hard disk, magnetic tapes etc. these memories are designed to withstand system crashes.
Stable Memory
This is said to be third form of memory structure but it is same as non volatile memory. In this
case, copies of same non volatile memories are stored at different places. This is because, in case
of any crash and data loss, data can be recovered from other copies. This is even helpful if there
one of non-volatile memory is lost due to fire or flood. It can be recovered from other network
location. But there can be failure while taking the backup of DB into different stable storage
devices. Even it may fail to transfer all the data successfully; either it will partially transfer the
data to remote devices or completely fail to store the data in stable memory. Hence extra caution
has to be taken while taking the backup of data from one stable memory to other. There are
different methods followed to copy the data. One of them is to copy the data in two phases –
copy the data blocks to first storage device, if it is successful copy to second storage device. The
copying is complete only when second copy is executed successfully. But second copy of data
blocks may fail to copy whole blocks. In such case, each data blocks in first copy and second
copy needs to be compared for its inconsistency. But verifying each blocks would be very costly
task as we may have huge number of data block. One of the better way to identify the failed
block is to identify the block which was in progress during the failure. Take only this block,
compare the data and correct the mismatches.
Failure Classification
When a transaction is being executed in the system, it may fail to execute due to various reasons.
The failure can be because of system program, bug in a program, user, or system crash. These
failures can be broadly classified into three categories.
Transaction Failure : This type of failure affects only few tables or processes. This is the
condition in the transaction where a transaction cannot execute it further. This failure can be
because of user or executing program/ transaction. The user may cancel the transaction when the
transaction is executing by pressing the cancel button or abort using the DB commands. The
transaction may fail because of the constraints on the tables – violation of constraints. It can even

95
KITSW
fail if there is concurrent processing of multiple transactions and there is lack of resources for all
of them or deadlock situation. All these will cause the transaction to stop processing in the
middle of its execution. When a transaction fails / stops in the middle, it would have partially
changed DB and it needs to be rolled back to previous consistent state. In ATM withdrawal
example, if the user cancels his transaction after step (i), the system should be able to stop further
processing of the transaction, or if he cancels the transaction after step (ii), the system should be
strong enough to update his balance in his account. Here system may cancel the transaction due
to insufficient balance. The failure can be because of errors in the code – logical errors or
because of system errors like deadlock or unavailability of system resources to execute the
transactions.
System Crash: This can be because of hardware or software failure or because of external
factors like power failure. This is the failure of the system because of the bug in the software or
the failure of system processor. This crash mainly affects the data in the primary memory. If it
affects only the primary memory, the actual data will not be really affected and recovery from
this failure is easy. This is because primary memories are temporary storages and it would not
have updated the actual database. Hence the system will be in a consistent state before to the
transaction. But when secondary memory crashes, there would be a loss of data and need to take
serious actions to recover lost data. Because secondary memories contain actual DB data.
Recovering them from crash is little tedious and requires more effort. DB Recovery system
provides strong mechanisms to recovery the system from crash and maintains the atomicity of
the transactions. In most of the cases data in the secondary memory are not affected because of
this crash. This is because; the database has lots of integrity checkpoints to prevent the data loss
from secondary memory.
Disk Failure: These are the issues with hard disks like formation of bad sectors, disk head crash,
unavailability of disk etc. Data can even be lost because of fire, flood, theft etc. This is mainly
affects the secondary memory where the actual data lies. In these cases, we need to have
alternative ways of storing DB. We can create backups of DB at regular basis and store them
separately from the memory where DB is stored or maintain multiple copies of DB at different
network locations to recover them from failure.

96
KITSW
4.5 Transaction Atomicity and Durability:

To gain a better understanding of ACID properties and the need for them, consider a simplified
banking system consisting of several accounts and a set of transactions that access and update
those accounts.

Transactions access data using two operations:

Read (X) which transfers the data item X from the database to a local buffer belonging to the
transaction that executed the read operation

Write (X), which transfers the data item X from the local buffer of the transaction that
executed the write back to the database.

In a real database system, the write operation does not necessarily result in the immediate update
of the data on the disk; the write operation may be temporarily stored in memory and executed
on the disk later.

For now, however, we shall assume that the write operation updates the database immediately.

Let Ty be a transaction that transfers $50 from account A to account B. This transaction can be
defined as

Ti : read (A);

A; = A-50;

Write (A);

Read (B);

B:=B+50;

Write (B).

Let us now consider each of the ACID requirements.

Consistency:

Execution of a transaction in isolation ( that is, with no other transaction executing concurrently)
preserves the consistency of the database.

97
KITSW
The consistency requirement here is that the sum of A and B be unchanged by the execution of
the transaction. Without the consistency requirement, money could be created or destroyed by
the transaction. It can be verified easily that, if the database is consistent before an execution of
the transaction, the database remains consistent after the execution of the transaction.
Ensuring consistency for an individual transaction is the responsibility of the application
programmer who codes the transaction. This task may be facilitated by automatic testing of
integrity constraints.

Atomicity:

Suppose that, just before the execution of transaction Ty the values of accounts A and B are
$1000 and $2000, respectively.
Now suppose that, during the execution of transaction Ty, a failure occurs that prevents Ti from
completing its execution successfully.

Examples of such failures include power failures, hardware failures, and software errors

Further, suppose that the failure happened after the write (A) operation but before the write (B)
operation. In this case, the values of amounts A and B reflected in the database are $950 and
$2000. The system destroyed $50 as a result of this failure.
In particular, we note that the sum A + B is no longer preserved. Thus, because of the failure, the
state of the system no longer reflects a real state of the world that the database is supposed to
capture. WE term such a state in inconsistent state. We must ensure that such inconsistencies are
not visible in a database system.

Note, however, that the system must at some point be in an inconsistent state. Even if transaction
Ty is executed to completion, there exists a point at which the value of account A is $ 950 and
the value of account B is $2000 which is clearly an inconsistent state.

This state, however is eventually replaced by the consistent state where the value of account A is
$ 950, and the value of account B is $ 2050.

98
KITSW
Thus, if the transaction never started or was guaranteed to complete, such an inconsistent state
would not be visible except during the execution of the transaction.

That is the reason for the atomicity requirement:

If the atomicity property is present, all actions of the transaction are reflected in the database or
none are.

The basic idea behind ensuring atomicity is this:


The database system keeps track ( or disk) of the old values of any data on which a transaction
performs a write, and, if the transaction does not complete its execution, the database system
restores the old values to make it appear as though the transaction never executed.
Ensuring atomicity is the responsibility of the database system itself; specifically, it is handed by
a component called the transaction management component. ]

Serializability:
When multiple transactions are being executed by the operating system in a multiprogramming
environment, there are possibilities that instructions of one transactions are interleaved with
some other transaction.

 Schedule − A chronological execution sequence of a transaction is called a schedule. A


schedule can have many transactions in it, each comprising of a number of
instructions/tasks.
 Serial Schedule − It is a schedule in which transactions are aligned in such a way that
one transaction is executed first. When the first transaction completes its cycle, then the
next transaction is executed. Transactions are ordered one after the other. This type of
schedule is called a serial schedule, as transactions are executed in a serial manner.

In a multi-transaction environment, serial schedules are considered as a benchmark. The


execution sequence of an instruction in a transaction cannot be changed, but two transactions
can have their instructions executed in a random fashion. This execution does no harm if two
transactions are mutually independent and working on different segments of data; but in case
these two transactions are working on the same data, then the results may vary. This ever-
varying result may bring the database to an inconsistent state.

99
KITSW
To resolve this problem, we allow parallel execution of a transaction schedule, if its transactions
are either serializable or have some equivalence relation among them.

Equivalence Schedules
An equivalence schedule can be of the following types −

Result Equivalence
If two schedules produce the same result after execution, they are said to be result equivalent.
They may yield the same result for some value and different results for another set of values.
That's why this equivalence is not generally considered significant.

View Equivalence
Two schedules would be view equivalence if the transactions in both the schedules perform
similar actions in a similar manner.

For example −

 If T reads the initial data in S1, then it also reads the initial data in S2.
 If T reads the value written by J in S1, then it also reads the value written by J in S2.
 If T performs the final write on the data value in S1, then it also performs the final write
on the data value in S2.

Conflict Equivalence
Two schedules would be conflicting if they have the following properties −

 Both belong to separate transactions.


 Both accesses the same data item.
 At least one of them is "write" operation.
Two schedules having multiple transactions with conflicting operations are said to be conflict
equivalent if and only if −

 Both the schedules contain the same set of Transactions.


 The order of conflicting pairs of operation is maintained in both the schedules.
Note − View equivalent schedules are view serializable and conflict equivalent schedules are
conflict serializable. All conflict serializable schedules are view serializable too.

4.6 Transaction Isolation and Atomicity Transaction Isolation Levels

100
KITSW
Durability:
Once the execution of the transaction completes successfully, and the user who initiated the
transaction has been notified that the transfer of funds has taken place, it must be the case that no
system failure will result in a loss of data corresponding to this transfer of funds.

The durability property guarantees that, once a transaction completes successfully, all the
updates that it carried out on the data base persist, even if there is a system failure after the
transaction complete execution.

We assume for now that a failure of the computer system may result in loss of data in main
memory, but data written to disk are never lost. We can guarantee durability by ensuring that
either The updates carried out by the transaction have been written to disk before the transaction
completes.

Information about the updates carried out by the transaction and written to disk is sufficient
to enable the database to reconstruct the updates when the database system is restarted after the
failure.

Ensuring durability is the responsibility of a component of the database system called the
recovery management component. The transaction management component and the recovery
management component the closely related.

Isolation:
Even if the consistency and atomicity properties are ensured for each transaction, if several
transactions are executed concurrently, their operations may interleave in some undesirable way,
resulting in an inconsistent state.

For example, as we saw earlier, the database is temporarily inconsistent while the transaction to
transfer funds from A to B is executing, with the deducted total written to A and the increased
total yet to be written to B.

101
KITSW
If a second concurrently running transaction reads A and B at this intermediate point and
computes A + B it will observe an inconsistent value. Furthermore, if this second transaction
then performs updates on A and B based on the inconsistent values that it read, the database may
be left in an inconsistent state even after both transactions have completed.

A way to avoid the problem of concurrently executing transactions is to execute transactions


serially that is, one after the other. However, concurrent execution of transactions provides
significant performance benefits.

Other solutions have therefore been developed; they allow multiple transactions to execute
concurrently.
The isolation property of a transaction ensures that the concurrent execution of transactions
results in a system state that is equivalent to a state that could have been obtained had these
transactions executed one at a time in some order.

Ensuring the isolation property is the responsibility of a component of the database system called
the concurrency control component.
4.7 Transaction isolation levels:
Transaction isolation levels are a measure of the extent to which transaction isolation succeeds.
In particular, transaction isolation levels are defined by the presence or absence of the following
phenomena:
Dirty Reads A dirty read occurs when a transaction reads data that has not yet been committed.
For example, suppose transaction 1 updates a row. Transaction 2 reads the updated row before
transaction 1 commits the update. If transaction 1 rolls back the change, transaction 2 will have
read data that is considered never to have existed.
Nonrepeatable Reads A nonrepeatable read occurs when a transaction reads the same row
twice but gets different data each time. For example, suppose transaction 1 reads a row.
Transaction 2 updates or deletes that row and commits the update or delete. If transaction 1
rereads the row, it retrieves different row values or discovers that the row has been deleted.

102
KITSW
Phantoms A phantom is a row that matches the search criteria but is not initially seen. For
example, suppose transaction 1 reads a set of rows that satisfy some search criteria. Transaction
2 generates a new row (through either an update or an insert) that matches the search criteria for
transaction 1. If transaction 1 reexecutes the statement that reads the rows, it gets a different set
of rows.
The four transaction isolation levels (as defined by SQL-92) are defined in terms of these
phenomena. In the following table, an "X" marks each phenomenon that can occur.
+

Transaction isolation level Dirty reads Nonrepeatable reads Phantoms

Read uncommitted X X X

Read committed -- X X

Repeatable read -- -- X

Serializable -- -- --

The following table describes simple ways that a DBMS might implement the transaction
isolation levels.

Transaction
isolation Possible implementation

Read Transactions are not isolated from each other. If the DBMS supports other transaction
uncommitted isolation levels, it ignores whatever mechanism it uses to implement those levels. So
that they do not adversely affect other transactions, transactions running at the Read
Uncommitted level are usually read-only.

Read The transaction waits until rows write-locked by other transactions are unlocked; this
committed prevents it from reading any "dirty" data.

103
KITSW
Transaction
isolation Possible implementation

The transaction holds a read lock (if it only reads the row) or write lock (if it updates
or deletes the row) on the current row to prevent other transactions from updating or
deleting it. The transaction releases read locks when it moves off the current row. It
holds write locks until it is committed or rolled back.

Repeatable The transaction waits until rows write-locked by other transactions are unlocked; this
read prevents it from reading any "dirty" data.

The transaction holds read locks on all rows it returns to the application and write
locks on all rows it inserts, updates, or deletes. For example, if the transaction includes
the SQL statement SELECT * FROM Orders, the transaction read-locks rows as the
application fetches them. If the transaction includes the SQL statement DELETE
FROM Orders WHERE Status = 'CLOSED', the transaction write-locks rows as it
deletes them.

Because other transactions cannot update or delete these rows, the current transaction
avoids any nonrepeatable reads. The transaction releases its locks when it is
committed or rolled back.

Serializable The transaction waits until rows write-locked by other transactions are unlocked; this
prevents it from reading any "dirty" data.

The transaction holds a read lock (if it only reads rows) or write lock (if it can update
or delete rows) on the range of rows it affects. For example, if the transaction includes
the SQL statement SELECT * FROM Orders, the range is the entire Orders table;
the transaction read-locks the table and does not allow any new rows to be inserted
into it. If the transaction includes the SQL statement DELETE FROM Orders
WHERE Status = 'CLOSED', the range is all rows with a Status of "CLOSED"; the

104
KITSW
Transaction
isolation Possible implementation

transaction write-locks all rows in the Orders table with a Status of "CLOSED" and
does not allow any rows to be inserted or updated such that the resulting row has a
Status of "CLOSED" Because other transactions cannot update or delete the rows in
the range, the current transaction avoids any non repeatable reads. Because other
transactions cannot insert any rows in the range, the current transaction avoids any
phantoms. The transaction releases its lock when it is committed or rolled back.

It is important to note that the transaction isolation level does not affect a transaction's ability to
see its own changes; transactions can always see any changes they make. For example, a
transaction might consist of two UPDATE statements, the first of which raises the pay of all
employees by 10 percent and the second of which sets the pay of any employees over some
maximum amount to that amount. This succeeds as a single transaction only because the
second UPDATE statement can see the results of the first.

4.8 Concurrency Control:

4.8.1Lock-Based protocols:

ADBMS must be able to ensure that only serializable, recoverable schedules are allowed, and
that no actions of committed transactions are lost while undoing aborted transactions. A
DBMS typically uses a locking protocol to achieve this. A locking protocol is a set of rules to
be followed by each transaction, in order to ensure that even though actions of several
transactions might be interleaved, the net effect is identical to executing all transactions in
some serial order.
Strict Two-Phase Locking(Strict2PL):

The most widely used locking protocol, called Strict Two-Phase Locking, or Strict2PL,
It has two rules. The first rule is

1.If a transaction T wants to read an object, it first requests a shared lock on the object.

105
KITSW
Of course, a transaction that has an exclusive lock can also read the object; an additional shared
lock is not required. A transaction that requests a lock is suspended until the DBMS is able to
grant it the requested lock. The DBMS keeps track of the locks it has granted and ensures that if
a transaction holds an exclusive lock on an object no other transaction holds a shared or
exclusive lock on the same object.

The second rule in Strict2PL is:

(2)All locks held by a transaction are released when the transaction is completed

4.8.2 Multiple-Granularity Locking

Another specialized locking strategy is called multiple-granularity locking, and it allows us to


efficiently set locks on objects that contain other objects. For instance, a database contains
several files , a file is a collection of pages , and a page is a collection of records . A transaction
that expects to access most of the pages in a file should probably set a lock on the entire file,
rather than locking individual pages as and when it needs them. Doing so reduces the locking
overhead considerably. On the other hand, other transactions that require access to parts of the
file — even parts that are not needed by this transaction are blocked. If a transaction accesses
relatively few pages of the file, it is better to lock only

those pages. Similarly, if a transaction accesses ever alrecords on a page, it should lock the entire
page and if it accesses just a few records, it should lock just those records.
The question to be addressed is how a lock manager can efficiently ensure that a page,
for example, is not locked by a transaction while an other transaction holds a conflicting lock on
the file containing the page.

The recovery manager of a DBMS is responsible for ensuring two important properties of
transactions: atomicity and durability. It ensures atomicity by undoing the actions of transactions
that do not commit and durability by making sure that all actions of committed transactions
survive system crashes, (e.g., a core dump caused by a bus error) and media failures (e.g., a
disk is corrupted).

106
KITSW
The Log

The log, sometimes called the trail or journal, is a history of actions executed by the DBMS.
Physically, the log is a file of records stored in stable storage, which is assumed to survive
crashes; this durability can be achieved by maintaining two or more copies of the log on
different disks, so that the chance of all copies of the log being simultaneously lost is
negligibly small.

The most recent portion of the log, called the log tail,is kept in main memory and is
periodically forced to stable storage. This way, log records and data records are written to disk at
the same granularity.
Every log record is given a unique id called the log sequence number (LSN). As with
any record id, we can fetch a log record with one disk access given the LSN. Further, LSNs
should be assigned in monotonically increasing order; this property is required for the ARIES
recovery algorithm. If the log is a sequential file, in principle growing indefinitely, the LSN can
simply be the address of the first byte of the log record.

A log record is written for each of the following actions:


Updating a page: After modifying the page, an update type record is appended to the log tail.
The page LSN of the page is then set to the LSN of the update log record
Commit: When a transaction decides to commit, it force-writes a commit type log record
containing the transaction id. That is, the log record is appended to the log, and the log tail is
written to stable storage, up to and including the commit record.

The transaction is considered to have committed at the instant that its commit log record is
written to stable storage

Abort: When a transaction is aborted, an abort type log record containing the transaction id is
appended to the log, and Undo is initiated for this transaction

107
KITSW
End: As noted above, when a transaction is aborted or committed, some additional actions must
be taken beyond writing the abort or commit log record. After all these additional steps are
completed, an end type log record containing the transaction id is appended to the log.

Undoing an update: When a transaction is rolled back (because the transaction is aborted, or
during recovery from a crash), its updates are undone. When the action described by an update
log record is undone, a compensation log record,or CLR, is written.

Other Recovery-Related Data Structures

In addition to the log, the following two tables contain important recovery-related information:

Transaction table: This table contains one entry for each active transaction. The entry
contains the transaction id, the status, and a field called lastLSN, which is the LSN of the
most recent log record for this transaction. The status of a transaction can be that it is in
progress, is committed, or is aborted.

Dirty page table: This table contains one entry for each dirty page in the buffer pool, that is,
each page with changes that are not yet reflected on disk. The entry contains a field recLSN,
which is the LSN of the first log record that caused the page to become dirty. Note that this LSN
identifies the earliest log record that might have to be redone for this page during restart from a
crash.

Checkpoint
A checkpoint is like a snapshot of the DBMS state, and by taking checkpoints periodically, as
we will see, the DBMS can reduce the amount of work to be done during restart in the event of a
subsequent crash.
4.8.3 Timestamp-based Protocols:

108
KITSW
The most commonly used concurrency protocol is the timestamp based protocol. This protocol
uses either system time or logical counter as a timestamp.
Lock-based protocols manage the order between the conflicting pairs among transactions at the
time of execution, whereas timestamp-based protocols start working as soon as a transaction is
created.
Every transaction has a timestamp associated with it, and the ordering is determined by the age
of the transaction. A transaction created at 0002 clock time would be older than all other
transactions that come after it. For example, any transaction 'y' entering the system at 0004 is two
seconds younger and the priority would be given to the older one.
In addition, every data item is given the latest read and write-timestamp. This lets the system
know when the last ‘read and write’ operation was performed on the data item.
Timestamp Ordering Protocol
The timestamp-ordering protocol ensures serializability among transactions in their conflicting
read and write operations. This is the responsibility of the protocol system that the conflicting
pair of tasks should be executed according to the timestamp values of the transactions.
 The timestamp of transaction Ti is denoted as TS(Ti).
 Read time-stamp of data-item X is denoted by R-timestamp(X).
 Write time-stamp of data-item X is denoted by W-timestamp(X).
Timestamp ordering protocol works as follows −
 If a transaction Ti issues a read(X) operation −
o If TS(Ti) < W-timestamp(X)
 Operation rejected.
o If TS(Ti) >= W-timestamp(X)
 Operation executed.
o All data-item timestamps updated.
 If a transaction Ti issues a write(X) operation −
o If TS(Ti) < R-timestamp(X)
 Operation rejected.
o If TS(Ti) < W-timestamp(X)
 Operation rejected and Ti rolled back.
o Otherwise, operation executed.

109
KITSW
Thomas' Write Rule
This rule states if TS(Ti) < W-timestamp(X), then the operation is rejected and T i is rolled back.
Time-stamp ordering rules can be modified to make the schedule view serializable.
Instead of making Ti rolled back, the 'write' operation itself is ignored.

4.8.4 Validation-Based Protocols:

In cases where a majority of transactions are read-only transactions, the rate of conflicts among
transactions may be low. Thus, many of these transactions, if executed without the supervision of
a concurrency-control scheme, would nevertheless leave the system in a consistent state. A
concurrency-control scheme imposes overhead of code execution and possible delay of
transactions. It may be better to use an alternative scheme that imposes less overhead. A
difficulty in reducing the overhead is that we do not know in advance which transactions will be
involved in a conflict. To gain that knowledge, we need a scheme for monitoring the system.
We assume that each transaction Ti executes in two or three different phases in its lifetime,
depending on whether it is a read-only or an update transaction. The phases are, in order,
1. Read phase. During this phase, the system executes transaction Ti. It reads the values of the
various data items and stores them in variables local to Ti. It performs all write operations on
temporary local variables, without updates of the actual database.
2. Validation phase. Transaction Ti performs a validation test to determine whether it can copy
to the database the temporary local variables that hold the results of write operations without
causing a violation of serializability.
3. Write phase. If transaction Ti succeeds in validation (step 2), then the system applies the
actual updates to the database. Otherwise, the system rolls back Ti.
Each transaction must go through the three phases in the order shown. However, all three phases
of concurrently executing transactions can be interleaved.
To perform the validation test, we need to know when the various phases of trans-
actions Ti took place. We shall, therefore, associate three different timestamps with
transaction Ti:
1. Start(Ti), the time when Ti started its execution.
2. Validation(Ti ), the time when Ti finished its read phase and started its validation phase.

110
KITSW
3. Finish(Ti), the time when Ti finished its write phase.
We determine the serializability order by the timestamp-ordering technique, using the value of
the timestamp Validation(Ti). Thus, the value TS(Ti) = Validation(Ti) and, if TS(Tj ) < TS(Tk ),
then any produced schedule must be equivalent to a serial schedule in which
transaction Tj appears before transaction Tk . The reason we have chosen Validation(Ti), rather
than Start(Ti), as the timestamp of transaction Ti is that we can expect faster response time
provided that conflict rates among transactions are indeed low.
The validation test for transaction Tj requires that, for all transactions Ti with TS(Ti) < TS(Tj ),
one of the following two conditions must hold:
1. Finish(Ti) < Start(Tj ). Since Ti completes its execution before Tj started, the serializability
order is indeed maintained.
2. The set of data items written by Ti does not intersect with the set of data items read by Tj ,
and Ti completes its write phase before Tj starts its validation phase
(Start(Tj ) < Finish(Ti) < Validation(Tj )). This condition ensures that

the writes of Ti and Tj do not overlap. Since the writes of Ti do not affect the read of Tj , and
since Tj cannot affect the read of Ti, the serializability order is indeed maintained.
As an illustration, consider again transactions T14 and T15. Suppose that TS(T14) < TS(T15).
Then, the validation phase succeeds in the schedule 5 in Figure 16.15. Note that the writes to the
actual variables are performed only after the validation phase of T15. Thus, T14 reads the old
values of B and A, and this schedule is serializable.
The validation scheme automatically guards against cascading rollbacks, since the actual writes
take place only after the transaction issuing the write has committed.
However, there is a possibility of starvation of long transactions, due to a sequence of conflicting
short transactions that cause repeated restarts of the long transaction.

111
KITSW
To avoid starvation, conflicting transactions must be temporarily blocked, to enable the long
transaction to finish.
This validation scheme is called the optimistic concurrency control scheme since transactions
execute optimistically, assuming they will be able to finish execution and validate at the end. In
contrast, locking and timestamp ordering are pessimistic in that they force a wait or a rollback
whenever a conflict is detected, even though there is a chance that the schedule may be conflict
serializable.

4.9 Multi version Concurrency Control Techniques:

Other protocols for concurrency control keep the old values of a data item when the item is
updated. These are known as multiversion concurrency control, because several versions
(values) of an item are maintained. When a transaction requires access to an item,
an appropriateversion is chosen to maintain the serializability of the currently executing
schedule, if possible. The idea is that some read operations that would be rejected in other
techniques can still be accepted by reading an older version of the item to maintain
serializability. When a transaction writes an item, it writes a new version and the old version(s)
of the item are retained. Some multiver-sion concurrency control algorithms use the concept of
view serializability rather than conflict serializability.

An obvious drawback of multiversion techniques is that more storage is needed to maintain


multiple versions of the database items. However, older versions may have to be maintained
anyway—for example, for recovery purposes. In addition, some database applications require
older versions to be kept to maintain a history of the evolution of data item values. The extreme
case is a temporal database (see Secton 26.2), which keeps track of all changes and the times at
which they occurred. In such cases, there is no additional storage penalty for multiversion
techniques, since older versions are already maintained.

Several multiversion concurrency control schemes have been proposed. We discuss two schemes
here, one based on timestamp ordering and the other based on 2PL. In addition, the validation
concurrency control method (see Section 22.4) also maintains multiple versions.

112
KITSW
1. Multi version Technique Based on Timestamp Ordering

In this method, several versions X1, X2, ..., Xk of each data item X are maintained. For each
version, the value of version Xi and the following two timestamps are kept:

read_TS(Xi). The read timestamp of Xi is the largest of all the timestamps of transactions
that have successfully read version Xi.
write_TS(Xi). The write timestamp of Xi is the timestamp of the transaction that wrote the
value of version Xi.

Whenever a transaction T is allowed to execute a write_item(X) operation, a new version Xk+1 of


item X is created, with both the write_TS(Xk+1) and the read_TS(Xk+1) set to TS(T).
Correspondingly, when a transaction T is allowed to read the value of version Xi, the value
ofread_TS(Xi) is set to the larger of the current read_TS(Xi) and TS(T).

To ensure serializability, the following rules are used:

If transaction T issues a write_item(X) operation, and version i of X has the


highestwrite_TS(Xi) of all versions of X that is also less than or equal to TS(T), and read_TS(Xi)
>TS(T), then abort and roll back transaction T; otherwise, create a new
version Xj of X withread_TS(Xj) = write_TS(Xj) = TS(T).

If transaction T issues a read_item(X) operation, find the version i of X that has the
highestwrite_TS(Xi) of all versions of X that is also less than or equal to TS(T); then return the
value ofXi to transaction T, and set the value of read_TS(Xi) to the larger of TS(T) and the
currentread_TS(Xi).

As we can see in case 2, a read_item(X) is always successful, since it finds the appropriate
version Xi to read based on the write_TS of the various existing versions of X. In case 1,
however, transaction T may be aborted and rolled back. This happens if T attempts to write a

113
KITSW
version of X that should have been read by another transaction T whose timestamp isread_TS(Xi);
however, T has already read version Xi, which was written by the transaction with timestamp
equal to write_TS(Xi). If this conflict occurs, T is rolled back; otherwise, a new version of X,
written by transaction T, is created. Notice that if T is rolled back, cascading rollback may occur.
Hence, to ensure recoverability, a transaction T should not be allowed to commit until after all
the transactions that have written some version that T has read have committed.

2. Multi version Two-Phase Locking Using Certify Locks:

In this multiple-mode locking scheme, there are three locking modes for an item: read, write,
andcertify, instead of just the two modes (read, write) discussed previously. Hence, the state
ofLOCK(X) for an item X can be one of read-locked, write-locked, certify-locked, or unlocked.
In the standard locking scheme, with only read and write locks (see Section 22.1.1), a write lock
is an exclusive lock. We can describe the relationship between read and write locks in the
standard scheme by means of the lock compatibility table shown in Figure 22.6(a). An entry
of Yesmeans that if a transaction T holds the type of lock specified in the column header

on item X and if transaction T requests the type of lock specified in the row header on the same
item X, then T can obtain the lock because the locking modes are compatible. On the other hand,
an entry of No in the table indicates that the locks are not compatible, so T must wait until T
releases the lock.

114
KITSW
In the standard locking scheme, once a transaction obtains a write lock on an item, no other
transactions can access that item. The idea behind multiversion 2PL is to allow other
transactions T to read an item X while a single transaction T holds a write lock on X. This is
accomplished by allowing two versions for each item X; one version must always have been
written by some committed transaction. The second version X is created when a
transaction Tacquires a write lock on the item. Other transactions can continue to read
the committed versionof X while T holds the write lock. Transaction T can write the value of X as
needed, without affecting the value of the committed version X. However, once T is ready to
commit, it must obtain a certify lock on all items that it currently holds write locks on before it
can commit. The certify lock is not compatible with read locks, so the transaction may have to
delay its commit until all its write-locked items are released by any reading transactions in order
to obtain the certify locks. Once the certify locks—which are exclusive locks—are acquired, the
committed version X of the data item is set to the value of version X , version X is discarded, and
the certify locks are then released. The lock compatibility table for this scheme is shown in
Figure 22.6(b).
In this multiversion 2PL scheme, reads can proceed concurrently with a single write operation—
an arrangement not permitted under the standard 2PL schemes. The cost is that a transaction may
have to delay its commit until it obtains exclusive certify locks on all the items it has updated. It
can be shown that this scheme avoids cascading aborts, since transactions are only allowed to
read the version X that was written by a committed transaction.
4.10 Recovery System:
Crash Recovery:
DBMS is a highly complex system with hundreds of transactions being executed every second.
The durability and robustness of a DBMS depends on its complex architecture and its underlying
hardware and system software. If it fails or crashes amid transactions, it is expected that the
system would follow some sort of algorithm or techniques to recover lost data.
4.11 Failure Classification:
To see where the problem has occurred, we generalize a failure into various categories, as
follows −
Transaction failure

115
KITSW
A transaction has to abort when it fails to execute or when it reaches a point from where it can’t
go any further. This is called transaction failure where only a few transactions or processes are
hurt.
Reasons for a transaction failure could be −
 Logical errors − Where a transaction cannot complete because it has some code error or
any internal error condition.
 System errors − Where the database system itself terminates an active transaction
because the DBMS is not able to execute it, or it has to stop because of some system
condition. For example, in case of deadlock or resource unavailability, the system aborts
an active transaction.
System Crash
There are problems − external to the system − that may cause the system to stop abruptly and
cause the system to crash. For example, interruptions in power supply may cause the failure of
underlying hardware or software failure.
Examples may include operating system errors.
Disk Failure
In early days of technology evolution, it was a common problem where hard-disk drives or
storage drives used to fail frequently.
Disk failures include formation of bad sectors, unreachability to the disk, disk head crash or any
other failure, which destroys all or a part of disk storage.
Storage Structure
We have already described the storage system. In brief, the storage structure can be divided into
two categories −
 Volatile storage − As the name suggests, a volatile storage cannot survive system
crashes. Volatile storage devices are placed very close to the CPU; normally they are
embedded onto the chipset itself. For example, main memory and cache memory are
examples of volatile storage. They are fast but can store only a small amount of
information.
 Non-volatile storage − These memories are made to survive system crashes. They are
huge in data storage capacity, but slower in accessibility. Examples may include hard-
disks, magnetic tapes, flash memory, and non-volatile (battery backed up) RAM.

116
KITSW
4.12 Recovery and Atomicity:

When a system crashes, it may have several transactions being executed and various files opened
for them to modify the data items. Transactions are made of various operations, which are atomic
in nature. But according to ACID properties of DBMS, atomicity of transactions as a whole must
be maintained, that is, either all the operations are executed or none.
When a DBMS recovers from a crash, it should maintain the following −
 It should check the states of all the transactions, which were being executed.
 A transaction may be in the middle of some operation; the DBMS must ensure the
atomicity of the transaction in this case.
 It should check whether the transaction can be completed now or it needs to be rolled
back.
 No transactions would be allowed to leave the DBMS in an inconsistent state.
There are two types of techniques, which can help a DBMS in recovering as well as maintaining
the atomicity of a transaction −
 Maintaining the logs of each transaction, and writing them onto some stable storage
before actually modifying the database.
 Maintaining shadow paging, where the changes are done on a volatile memory, and later,
the actual database is updated.
Log-based Recovery
Log is a sequence of records, which maintains the records of actions performed by a transaction.
It is important that the logs are written prior to the actual modification and stored on a stable
storage media, which is failsafe.
Log-based recovery works as follows −
 The log file is kept on a stable storage media.
 When a transaction enters the system and starts execution, it writes a log about it.
<Tn, Start>
 When the transaction modifies an item X, it write logs as follows −
<Tn, X, V1, V2>
It reads Tn has changed the value of X, from V1 to V2.

117
KITSW
 When the transaction finishes, it logs −
<Tn, commit>
The database can be modified using two approaches −
 Deferred database modification − All logs are written on to the stable storage and the
database is updated when a transaction commits.
 Immediate database modification − Each log follows an actual database modification.
That is, the database is modified immediately after every operation.
Recovery with Concurrent Transactions
When more than one transaction are being executed in parallel, the logs are interleaved. At the
time of recovery, it would become hard for the recovery system to backtrack all logs, and then
start recovering. To ease this situation, most modern DBMS use the concept of 'checkpoints'.
Checkpoint
Keeping and maintaining logs in real time and in real environment may fill out all the memory
space available in the system. As time passes, the log file may grow too big to be handled at all.
Checkpoint is a mechanism where all the previous logs are removed from the system and stored
permanently in a storage disk. Checkpoint declares a point before which the DBMS was in
consistent state, and all the transactions were committed.
Recovery
When a system with concurrent transactions crashes and recovers, it behaves in the following
manner −

 The recovery system reads the logs backwards from the end to the last checkpoint.
 It maintains two lists, an undo-list and a redo-list.
 If the recovery system sees a log with <T n, Start> and <Tn, Commit> or just <Tn,
Commit>, it puts the transaction in the redo-list.

118
KITSW
 If the recovery system sees a log with <T n, Start> but no commit or abort log found, it
puts the transaction in undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the
transactions in the redo-list and their previous logs are removed and then redone before saving
their logs.

4.13 Recovery Algorithm:


Introduction to ARIES
ARIES is a recovery algorithm that is designed to work with a steal, no-force approach. When
the recovery manager is invoked after a crash, restart proceeds in three phases:
1.Analysis: Identifies dirty pages in the buffer pool and active transactions at the time of the
crash.

2.Redo: Repeats all actions, starting from an appropriate point in the log, and restores the
database state to what it was at the time of the crash. 3.Undo: Undoes the actions of transactions
that did not commit, so that the database reflects only the actions of committed transactions.

There are three main principles behind the ARIES recovery algorithm:

Write-ahead logging: Any change to a database object is first recorded in the log; the record in
the log must be written to stable storage before the change to the database object is written to
disk.

Repeating history during Redo: Upon restart following a crash, ARIES retraces all actions of
the DBMS before the crash and brings the system back to the exact state that it was in at the time
of the crash. Then, it undoes the actions of transactions that were still active at the time of the
crash.
Logging changes during Undo: Changes made to the database while undoing a transaction are
logged in order to ensure that such an action is not repeated in the event of repeated restarts.

4.14 Buffer Management:

119
KITSW
A DBMS must manage a huge amount of data, and in the course of processing therequired space
for the blocks of data will often be greater than the memory spaceavailable. For this there is the
need to manage a memory in which to load and unload theblocks. The buffer manager is
responsible primarily for managing the operations inherentsaving and loading of the blocks. In
fact, the operations that provides the buffer manager are these:* FIX: This command tells the
operator of the buffer to load a block from disk and returnthe pointer to the memory where it is
loaded. If the block was already in memory, thebuffer manager needs only to return the pointer,
otherwise he must load from disk andbring it into memory. If the buffer memory is full but it is
possible to have 2 situations:or the possibility of releasing a portion of memory that is occupied
by transactionsalready completed. In this case, before freeing the area the content is written
to disk if any block of this area had been changed.* There is the possibility of free memory to be
occupied because transitions still ongoing.In this case, the buffer manager can work in 2 ways: in
the first mode (STEAL) theoperator of the free buffer memory occupied by a transition already
active, possiblysaving your changes to disk, in the second mode (NOT STEAL) the transition
requestedblock is made to wait until the free memory.* SET DIRTY: invoking this
command, you mark a block of memory as amended.Before introducing the last 2
commands you need to anticipate that the DMBS canoperate in 2 modes: Force and NOT
FORCE. When working in FORCE mode, the rescuedisk is in synchronous mode with
the commit of a transaction. When working mode isNOT FORCE the rescue is carried out from
time to time in asynchronous manner.Typically, commercial database operating mode NOT
FORCE because this allows anincrease in performance: the block may undergo multiple changes
in memory beforebeing saved, then you can choose to make the saves when the system is
unloading.* Force: This command will cause the operator of the buffer to make the writing
insynchronously with the completion (commit) the transaction* FLUSH: This command will
cause the operator of the buffer to perform the rescue,when in how NOT FORCE.

4.15 Failure with Loss of Nonvolatile Storage:


Until now, we have considered only the case where a failure results in the loss of information
residing in volatile storage while the content of the nonvolatile storage remains intact. Although

120
KITSW
failures in which the content of nonvolatile storage is lost are rare, we nevertheless need to be
prepared to deal with this type of failure. In this section, we discuss only disk storage. Our
discussions apply as well to other nonvolatile storage types.
The basic scheme is to dump the entire content of the database to stable storage periodically—
say, once per day. For example, we may dump the database to one or more magnetic tapes. If a
failure occurs that results in the loss of physical database blocks, the system uses the most recent
dump in restoring the database to a previous consistent state. Once this restoration has been
accomplished, the system uses the log to bring the database system to the most recent consistent
state.
More precisely, no transaction may be active during the dump procedure, and a procedure similar
to checkpointing must take place:
1. Output all log records currently residing in main memory onto stable storage.
2. Output all buffer blocks onto the disk.
3. Copy the contents of the database to stable storage.
4. Output a log record <dump> onto the stable storage.

Steps 1, 2, and 4 correspond to the three steps used for checkpoints in Section 17.4.3.
To recover from the loss of nonvolatile storage, the system restores the database to disk by using
the most recent dump. Then, it consults the log and redoes all the transactions that have
committed since the most recent dump occurred. Notice that no undo operations need to be
executed.
A dump of the database contents is also referred to as an archival dump, since we can archive
the dumps and use them later to examine old states of the database.
Dumps of a database and checkpointing of buffers are similar.
The simple dump procedure described here is costly for the following two reasons.
First, the entire database must be be copied to stable storage, resulting in considerable data
transfer. Second, since transaction processing is halted during the dump procedure, CPU cycles
are wasted. Fuzzy dump schemes have been developed, which allow transactions to be active
while the dump is in progress. They are similar to fuzzy checkpointing schemes; see the
bibliographical notes for more details.

121
KITSW
4.16 Early Lock Release and Logical Undo Operations:

Operations like B+-tree insertions and deletions release locks early.


They cannot be undone by restoring old values (physical undo), since once a lock is released,
other transactions may have updated the B+-tree.
Instead, insertions (resp. deletions) are undone by executing a deletion (resp. insertion)
operation (known as logical undo).
For such operations, undo log records should contain the undo operation to be executed
Such logging is called logical undo logging, in contrast to physical undo logging
Operations are called logical operations
Other examples:
delete of tuple, to undo insert of tuple allows early lock release on space allocation information
subtract amount deposited, to undo deposit allows early lock release on bank balance
Redo information is logged physically (that is, new value for each write) even for operations
with logical undo
Logical redo is very complicated since database state on disk may not be “operation consistent”
when recovery starts
Physical redo logging does not conflict with early lock release

The following actions are taken when recovering from system crash
1. (Redo phase): Scan log forward from last < checkpoint L> record till end of log
1. Repeat history by physically redoing all updates of all transactions,
2. Create an undo-list during the scan as follows
 undo-list is set to L initially
 Whenever <Ti start> is found Ti is added to undo-list
 Whenever <Ti commit> or <Ti abort> is found, Ti is deleted from undo-
list
This
brings database to state as of crash, with committed as well as uncommitted transactions having
been redone.

122
KITSW
Now
undo-list contains transactions that are incomplete, that is, have neither committed nor been
fully rolled back.

4.17 Remote database systems:

A remote, online, or managed backup service, sometimes marketed as cloud backup or backup-
as-a-service, is a service that provides users with a system for the backup, storage, and recovery
of computer files. Online backup providers are companies that provide this type of service to end
users (or clients). Such backup services are considered a form of cloud computing.
Online backup systems are typically built around a client software program that runs on a
schedule. Some systems run once a day, usually at night while computers aren't in use. Other
newer cloud backup services run continuously to capture changes to user systems nearly in real-
time. The only backup system typically collects, compresses, encrypts, and transfers the data to
the remote backup service provider's servers or off-site hardware.
There are many products on the market – all offering different feature sets, service levels, and
types of encryption. Providers of this type of service frequently target specific market segments.
High-end LAN-based backup systems may offer services such as Active Directory, client remote
control, or open file backups. Consumer online backup companies frequently have beta software
offerings and/or free-trial backup services with fewer live support options.

123
KITSW
UNIT-V

Storage and Indexing


Overview:

In this Unit we discuss about Data storage and retrieval. It deals with disk, file, and file system
structure, and with the mapping of relational and object data to a file system. A variety of data
access techniques are presented in this unit , including hashing, B+ - tree indices, and grid file
indices. External sorting which will be done in secondary memory is discussed here.

Contents:

File Organisation:

Storage Media
Buffer Management
Record and Page formats
File organizations
Various kinds of indexes and external storing
ISAM
B++ trees
Extendible vs. Linear Hashing

124
KITSW
This chapter internals of an RDBMS

The lowest layer of the software deals with management of space on disk, where the data is to be
stored. Higher layers allocate, deal locate, read and write pages through (routines provided by)
this layer, called the disk space manager.

On top of the disk space manager, we have the buffer manager, which partitions the available
main memory into a collection of pages of frames. The purpose of the buffer manager is to bring
pages in from disk to main memory as needed in response to read requests from transactions.

The next layer includes a variety of software for supporting the concepts of a file, which, in
DBMS, is a collection of pages or a collection of records. This layer typically supports a heap
file, or file or unordered pages, as well as indexes. In addition to keeping track of the pages in a
file, this layer organizes the information within a page.

The code that implements relational operators sits on top of the file and access methods layer.
These operators serve as the building blocks for evaluating queries posed against the data.

When a user issues a query, the query is presented to a query optimizer, whish uses information
about how the data is stored to produce an efficient execution plan for evaluating the query. An
execution plan is usually represented as tree of relational operators ( with annotations that
contain additional detailed information about which access methods to use.

Data in a DBMS is stored on storage devices such as disks and tapes ; the disk space manager is
responsible for keeping tract of available disk space. The file manager, which provides the
abstraction of a file of records to higher levels of DBMS code, requests to the disk space
manager to obtain and relinquish space on disk.

When a record is needed for processing, it must be fetched from disk to main memory. The page
on which the record resides is determined by the file manager ( the file manager determines the
page on which the record resides)

Sometimes, the file manager uses auxiliary data structures to quickly identify the page that
contains a desired record. After identifying the required page, the file manager issues a request
for the page to a layer of DBMS code called the buffer manager. The buffer manager fetches
requested pages from disk into a region of main memory called the buffer pool, and informs the
file manager.

125
KITSW
5.1 Overview of Storage and Indexing:

Databases are stored in file formats, which contain records. At physical level, the actual data is
stored in electromagnetic format on some device. These storage devices can be broadly
categorized into three types −

 Primary Storage − The memory storage that is directly accessible to the CPU comes
under this category. CPU's internal memory (registers), fast memory (cache), and main
memory (RAM) are directly accessible to the CPU, as they are all placed on the
motherboard or CPU chipset. This storage is typically very small, ultra-fast, and volatile.
Primary storage requires continuous power supply in order to maintain its state. In case of
a power failure, all its data is lost.
 Secondary Storage − Secondary storage devices are used to store data for future use or as
backup. Secondary storage includes memory devices that are not a part of the CPU
chipset or motherboard, for example, magnetic disks, optical disks (DVD, CD, etc.), hard
disks, flash drives, and magnetic tapes.
 Tertiary Storage − Tertiary storage is used to store huge volumes of data. Since such
storage devices are external to the computer system, they are the slowest in speed. These
storage devices are mostly used to take the back up of an entire system. Optical disks and
magnetic tapes are widely used as tertiary storage.
Memory Hierarchy
A computer system has a well-defined hierarchy of memory. A CPU has direct access to it main
memory as well as its inbuilt registers. The access time of the main memory is obviously less
than the CPU speed. To minimize this speed mismatch, cache memory is introduced. Cache

126
KITSW
memory provides the fastest access time and it contains data that is most frequently accessed by
the CPU.
The memory with the fastest access is the costliest one. Larger storage devices offer slow speed
and they are less expensive, however they can store huge volumes of data as compared to CPU
registers or cache memory.
Magnetic Disks
Hard disk drives are the most common secondary storage devices in present computer systems.
These are called magnetic disks because they use the concept of magnetization to store
information. Hard disks consist of metal disks coated with magnetizable material. These disks
are placed vertically on a spindle. A read/write head moves in between the disks and is used to
magnetize or de-magnetize the spot under it. A magnetized spot can be recognized as 0 (zero) or
1 (one).
Hard disks are formatted in a well-defined order to store data efficiently. A hard disk plate has
many concentric circles on it, called tracks. Every track is further divided into sectors. A sector
on a hard disk typically stores 512 bytes of data.
Redundant Array of Independent Disks
RAID or Redundant Array of Independent Disks, is a technology to connect multiple secondary
storage devices and use them as a single storage media.
RAID consists of an array of disks in which multiple disks are connected together to achieve
different goals. RAID levels define the use of disk arrays.
RAID 0
In this level, a striped array of disks is implemented. The data is broken down into blocks and the
blocks are distributed among disks. Each disk receives a block of data to write/read in parallel. It
enhances the speed and performance of the storage device. There is no parity and backup in
Level 0.

127
KITSW
RAID 1
RAID 1 uses mirroring techniques. When data is sent to a RAID controller, it sends a copy of
data to all the disks in the array. RAID level 1 is also called mirroring and provides 100%
redundancy in case of a failure.

RAID 2
RAID 2 records Error Correction Code using Hamming distance for its data, striped on different
disks. Like level 0, each data bit in a word is recorded on a separate disk and ECC codes of the
data words are stored on a different set disks. Due to its complex structure and high cost, RAID 2
is not commercially available.

RAID 3
RAID 3 stripes the data onto multiple disks. The parity bit generated for data word is stored on a
different disk. This technique makes it to overcome single disk failures.

RAID 4
In this level, an entire block of data is written onto data disks and then the parity is generated and
stored on a different disk. Note that level 3 uses byte-level striping, whereas level 4 uses block-
level striping. Both level 3 and level 4 require at least three disks to implement RAID.

128
KITSW
RAID 5
RAID 5 writes whole data blocks onto different disks, but the parity bits generated for data block
stripe are distributed among all the data disks rather than storing them on a different dedicated
disk.

RAID 6
RAID 6 is an extension of level 5. In this level, two independent parities are generated and stored
in distributed fashion among multiple disks. Two parities provide additional fault tolerance. This
level requires at least four disk drives to implement RAID.

129
KITSW
5.2 Data on External Storage:
Secondary Storage − Secondary storage devices are used to store data for future use or as
backup. Secondary storage includes memory devices that are not a part of the CPU chipset or
motherboard, for example, magnetic disks, optical disks (DVD, CD, etc.), hard disks, flash
drives, and magnetic tapes.

5.3 File Organizations And Indexing:

File Organization
File Organization defines how file records are mapped onto disk blocks. We have four types of
File Organization to organize file records −

Heap File Organization


When a file is created using Heap File Organization, the Operating System allocates memory
area to that file without any further accounting details. File records can be placed anywhere in
that memory area. It is the responsibility of the software to manage the records. Heap File does
not support any ordering, sequencing, or indexing on its own.
Sequential File Organization
Every file record contains a data field (attribute) to uniquely identify that record. In sequential
file organization, records are placed in the file in some sequential order based on the unique key
field or search key. Practically, it is not possible to store all the records sequentially in physical
form.
Hash File Organization:Hash File Organization uses Hash function computation on some fields
of the records. The output of the hash function determines the location of disk block where the
records are to be placed.

130
KITSW
Clustered File Organization
Clustered file organization is not considered good for large databases. In this mechanism, related
records from one or more relations are kept in the same disk block, that is, the ordering of
records is not based on primary key or search key.
File Operations
Operations on database files can be broadly classified into two categories −
 Update Operations
 Retrieval Operations
 Update operations change the data values by insertion, deletion, or update. Retrieval
operations, on the other hand, do not alter the data but retrieve them after optional
conditional filtering. In both types of operations, selection plays a significant role. Other
than creation and deletion of a file, there could be several operations, which can be done
on files.
 Open − A file can be opened in one of the two modes, read mode or write mode. In
read mode, the operating system does not allow anyone to alter data. In other words, data
is read only. Files opened in read mode can be shared among several entities. Write mode
allows data modification. Files opened in write mode can be read but cannot be shared.
 Locate − Every file has a file pointer, which tells the current position where the data is to
be read or written. This pointer can be adjusted accordingly. Using find (seek) operation,
it can be moved forward or backward.
 Read − By default, when files are opened in read mode, the file pointer points to the
beginning of the file. There are options where the user can tell the operating system
where to locate the file pointer at the time of opening a file. The very next data to the file
pointer is read.
 Write − User can select to open a file in write mode, which enables them to edit its
contents. It can be deletion, insertion, or modification. The file pointer can be located at
the time of opening or can be dynamically changed if the operating system allows to do
so.
 Close − This is the most important operation from the operating system’s point of view.
When a request to close a file is generated, the operating system
o removes all the locks (if in shared mode),

131
KITSW
o saves the data (if altered) to the secondary storage media, and
o releases all the buffers and file handlers associated with the file.
The organization of data inside a file plays a major role here. The process to locate the file
pointer to a desired record inside a file various based on whether the records are arranged
sequentially or clustered. We know that data is stored in the form of records. Every record has a key field,
which helps it to be recognized uniquely.
Indexing is a data structure technique to efficiently retrieve records from the database files based
on some attributes on which the indexing has been done. Indexing in database systems is similar
to what we see in books.
Indexing is defined based on its indexing attributes. Indexing can be of the following types −
 Primary Index − Primary index is defined on an ordered data file. The data file is
ordered on a key field. The key field is generally the primary key of the relation.
 Secondary Index − Secondary index may be generated from a field which is a candidate
key and has a unique value in every record, or a non-key with duplicate values.
 Clustering Index − Clustering index is defined on an ordered data file. The data file is
ordered on a non-key field.
Ordered Indexing is of two types −
 Dense Index
 Sparse Index
Dense Index
In dense index, there is an index record for every search key value in the database. This makes
searching faster but requires more space to store index records itself. Index records contain
search key value and a pointer to the actual record on the disk.

Sparse Index
In sparse index, index records are not created for every search key. An index record here
contains a search key and an actual pointer to the data on the disk. To search a record, we first

132
KITSW
proceed by index record and reach at the actual location of the data. If the data we are looking for
is not where we directly reach by following the index, then the system starts sequential search
until the desired data is found.

Multilevel Index
Index records comprise search-key values and data pointers. Multilevel index is stored on the
disk along with the actual database files. As the size of the database grows, so does the size of
the indices. There is an immense need to keep the index records in the main memory so as to
speed up the search operations. If single-level index is used, then a large size index cannot be
kept in memory which leads to multiple disk accesses.

133
KITSW
Multi-level Index helps in breaking down the index into several smaller indices in order to make
the outermost level so small that it can be saved in a single disk block, which can easily be
accommodated anywhere in the main memory.

5.4 Comparison Of Three File Organizations:

The costs of some simple operations for three basic file organizations;

Files of randomly ordered records, or heap files;


Files sorted on a sequence of fields or sorted files
Fiels that are hashed on a sequence of fields or hashed files
The choice of file organization can have a significant on performance. The choice of an
appropriate file organization depends on the following operations.

Scan :

Fetch all records in the file. The pages in the file must be fetched from disk into the buffer pool.
There is also a CPU overhead per record for locating the record on the page ( in the pool).

Search with equality selection:

Fetch all records that satisfy an equality selection, for example, “ find the students record for the
student with sid 23’ Pages that contain qualifying records must be fetched from disk, and
qualifying records must be located within retrieved pages.

Search with Range selection ;

Fetch all records that satisfy a range section, for example, “find all students records with name
alphabetically after ‘smith”
Insert :
Insert a given record into the file. We must identify the page in the file into which the new record
must be inserted, fetch that page from disk, modify it to include the new record, and then write

134
KITSW
back the modified page. Depending on the file organization, we may have to fetch, modify and
write back other pages as well.

Delete :
Delete a record that is specified using its record identity 9rid). We must identify the page that
contains the record, fetch it from disk, modify it, and write it back. Depending on the file
organization, we may have to fetch, modify and write back other pages as well.

Heap files :

Files of randomly ordered records are called heap files.


The various operations in heap files are :

Scan :

The cost is B(D+RC) because we must retrieve each of B pages taking time D per page, and for
each page, process R records taking time C per record.

Search with Equality selection:


Suppose that user knows in advance that exactly one record matches the desired equality
selection, that is, the selection is specified on a candidate key. On average, user must scan half
the file, assuming that the record exists and the distribution of values in the search field is
uniform.

For each retrieved data page, user must check all records on the page to see if it is the desired
record. The cost is 0.5B(D+RC). If there is no record that satisfies the selection then user must
scan the entire file to verify it.

Search with Range selection :


The entire file must be scanned because qualifying records could appear anywhere in the file,
and does not know how many records exist. The cost is B(D+RC).

135
KITSW
Insert : Assume that records are always inserted at the end of the file so fetch the last page in the
file, add the record, and write the page back. The cost is 3D+C.

Delete :
First find the record, remove the record from the page, and write the modified page back. For
simplicity, assumption is made that no attempt is made to compact the file to reclaim the free
space created by deletions. The cost is the cost of searching plus C+D.

The record to be deleted is specified using the record id. Since the page id can easily be obtained
from the record it, user can directly read in the page. The cost of searching is therefore D

Sorted files :
The files sorted on a sequence of field are known as sorted files.
The various operation of sorted files are

Scan : The cost is B(D+RC) because all pages must be examined the order in which records
are retrieved corresponds to the sort order.
(ii) Search with equality selection:

Here assumption is made that the equality selection is specified on the field by which the
file is sorted; if not, the cost is identical to that for a heap file. To locate the first page
containing the desired records or records, qualifying records must exists, with a binary search
in log 2 B steps. Each step requires a disk I/O two comparisons. Once the page is known the
first qualifying record can again be located by a binary search of the page at a cost of Clog2
R. The cost is Dlog2 B + Clog2B. This is significant improvement over searching heap files.

(iii) Search with Range selection :


Assume that the range selection is on the soft field, the first record that satisfies the
selction is located as it is for search with equality. Subsequently, data pages are sequentially
retrieved until a record is found that does not satisfy the range selection ; this is similar to an
equality search with many qualifying records.

136
KITSW
(iv) Insert :

To insert a record preserving the sort order, first find the correct position in the file, add
the record, and then fetch and rewrite all subsequent pages. On average, assume that the inserted
record belong in the middles of the file. Thus, read the latter half of the file and then write it back
after adding the new record. The cost is therefore the cost of searching to find the position of the
new record plus 2 * (0.5B(D+RC)), that is, search cost plus B(D+RC)

(v) Delete :

First search for the record, remove the record from the page, and write the modified page
back. User must also read and write all subsequent pages because all records that follow the
deleted record must be moved up to compact the free space. The cost is search cost plus
B(D+RC) Given the record identify (rid) of the record to delete, user can fetch the page
containing the record directly.

Hashed files :

A hashed file has an associated search key, which is a combination of one or more fields of the
file. In enables us to locate records with a given search key value quickly, for example, “Find the
students record for Joe” if the file is hashed on the name field we can retrieve the record quickly.

This organization is called a static hashed file; its main drawback is that long chains of overflow
pages can develop. This can affect performance because all pages ina bucket have to be
searched.
The various operations of hashed files are ;

137
KITSW
Fig: File Hashed on age,with Index on salary

Scan :

In a hashed file, pages are kept at about 80% occupancy ( in order to leave some space for futue
insertions and minimize over flow pages as the file expands). This is achieved by adding a new
page to a bucket when each existing page is 80% full, when records are initially organized into a
hashed file structure. Thus the number of pages, and therefore the cost of scanning all the data
pages, is about 1.25 times the cost of scaning an unordered file, that is, 1.25B(D+RC)

Search with Equality selection:

The hash function associated with a hashed file maps a record to a bucket based on the values in
all the search key fields; if the value for anyone of these fields is not specified, we cannot tell
which bucket the record belongs to. Thus if the selection is not an equality condition on all the
search key fields, we have to scan the entire file.
Search with Range selection :
The harsh structure offers no help at all; even if the range selection is on the search key, the
entire file must be scanned. The cost is 1.25 B{D+RC}

138
KITSW
Insert :

The appropriate page must be located, modified and then written back. The cost is thus the cost
of search plus C+D.

Delete :

We must search for the record, remove it from the page, and write the modified page back. The
cost is again the cost of search plus C+D (for writing the modified page ).

Choosing a file organization :

The below table compares I/O costs for three file organizations
A heap file has good storage efficiency, and supports fast scan, insertion, and deletion of
records. However it is slow for searches.

A stored file also offers good storage efficiency, but insertion and deletion of records is slow.
It is quite for searches, and in particular, it is the best structure for range selections.

A hashed file does not utilize space quite as well as sorted file, but insertions and deletions
are fast, and equality selections are very fast. However, the structure offers no support for range
selections, and full file scans are title slower; the lower space utilization means that files contain
more pages.

5.5 Tree Structured indexing:

5.5.1 INDEXED SEQUENTIAL ACCESS METHOD (ISAM)

The potential large size of the index file motivates the ISAM idea. Building an auxiliary file on
the index file and so on recursively until the final auxiliary file fits on one page? This repeated
construction of a one-level index leads to a tree structure that is illustrated in Figure The data

139
KITSW
entries of the ISAM index are in the leaf pages of the tree and additional overflow pages that are
chained to some leaf page. In addition, some systems carefully organize the layout of pages so
that page boundaries correspond closely to the physical characteristics of the underlying storage
device. The ISAM structure is completely static and facilitates such low-level optimizations.

Fig ISAM Index Structure


Each tree node is a disk page, and all the data resides in the leaf pages. This corresponds to an
index that uses Alternative (1) for data entries, we can create an index with Alternative (2) by
storing the data records in a separate file and storing key, rid pairs in the leaf pages of the ISAM
index. When the file is created, all leaf pages are allocated sequentially and sorted on the search
key value.The non-leaf level pages are then allocated. If there are several inserts to the file
subsequently, so that more entries are inserted into a leaf than will fit onto a single page,
additional pages are needed because the index structure is static. These additional pages are
allocated from an overflow area. The allocation of pages is illustrated in below Figure.

Fig: Page allocation in ISAM

140
KITSW
5.6 B+ Tree:
B+ tree is a balanced binary search tree that follows a multi-level index format. The leaf nodes of
a B+ tree denote actual data pointers. B+ tree ensures that all leaf nodes remain at the same height,
thus balanced. Additionally, the leaf nodes are linked using a link list; therefore, a B + tree can
support random access as well as sequential access.
Structure of B+ Tree
Every leaf node is at equal distance from the root node. A B +tree is of the order n where n is
fixed for every B+ tree.

Internal nodes −
Internal (non-leaf) nodes contain at least ⌈n/2⌉ pointers, except the root node.
At most, an internal node can contain n pointers.
Leaf nodes −
Leaf nodes contain at least ⌈n/2⌉ record pointers and ⌈n/2⌉ key values.
 At most, a leaf node can contain n record pointers and n key values.
 Every leaf node contains one block pointer P to point to next leaf node and forms a
linked list.
 B+ Tree Insertion
 B+ trees are filled from bottom and each entry is done at the leaf node.
 If a leaf node overflows −
o Split node into two parts.
o Partition at i = ⌊(m+1)/2⌋.
o First i entries are stored in one node.
o Rest of the entries (i+1 onwards) are moved to a new node.
o ith key is duplicated at the parent of the leaf.
 If a non-leaf node overflows −

141
KITSW
o Split node into two parts.
o Partition the node at i = ⌈(m+1)/2⌉.
o Entries up to i are kept in one node.
o Rest of the entries are moved to a new node.
 B+ Tree Deletion
 B+ tree entries are deleted at the leaf nodes.
 The target entry is searched and deleted.
o If it is an internal node, delete and replace with the entry from the left position.
 After deletion, underflow is tested,
o If underflow occurs, distribute the entries from the nodes left to it.
 If distribution is not possible from left, then
o Distribute from the nodes right to it.
 If distribution is not possible from left or from right, then
o Merge the node with left and right to it.

5.7 Hash based Indexing:


 Hash Organization
 Bucket − A hash file stores data in bucket format. Bucket is considered a unit of storage.
A bucket typically stores one complete disk block, which in turn can store one or more
records.
 Hash Function − A hash function, h, is a mapping function that maps all the set of
search-keys K to the address where actual records are placed. It is a function from search
keys to bucket addresses.
 Static Hashing
 In static hashing, when a search-key value is provided, the hash function always
computes the same address. For example, if mod-4 hash function is used, then it shall
generate only 5 values. The output address shall always be same for that function. The
number of buckets provided remains unchanged at all times.

142
KITSW

 Operation
 Insertion − When a record is required to be entered using static hash, the hash
function h computes the bucket address for search key K, where the record will be stored.
 Bucket address = h(K)
 Search − When a record needs to be retrieved, the same hash function can be used to
retrieve the address of the bucket where the data is stored.
 Delete − This is simply a search followed by a deletion operation.
 Bucket Overflow
 The condition of bucket-overflow is known as collision. This is a fatal state for any static
hash function. In this case, overflow chaining can be used.
 Overflow Chaining − When buckets are full, a new bucket is allocated for the same hash
result and is linked after the previous one. This mechanism is called Closed Hashing.

143
KITSW

 Linear Probing − When a hash function generates an address at which data is already
stored, the next free bucket is allocated to it. This mechanism is called Open Hashing.

 Dynamic Hashing
 The problem with static hashing is that it does not expand or shrink dynamically as the
size of the database grows or shrinks. Dynamic hashing provides a mechanism in which
data buckets are added and removed dynamically and on-demand. Dynamic hashing is
also known as extended hashing.
 Hash function, in dynamic hashing, is made to produce a large number of values and only
a few are used initially.

144
KITSW
Organization
 The prefix of an entire hash value is taken as a hash index. Only a portion of the hash
value is used for computing bucket addresses. Every hash index has a depth value to
signify how many bits are used for computing a hash function. These bits can address 2n
buckets. When all these bits are consumed − that is, when all the buckets are full − then
the depth value is increased linearly and twice the buckets are allocated.
 Operation
 Querying − Look at the depth value of the hash index and use those bits to compute the
bucket address.
 Update − Perform a query as above and update the data.
 Deletion − Perform a query to locate the desired data and delete the same.
 Insertion − Compute the address of the bucket
If the bucket is already full.

145
KITSW
 Add more buckets.
 Add additional bits to the hash value.
 Re-compute the hash function.
Else
 Add data to the bucket,
If all the buckets are full, perform the remedies of static hashing.
 Hashing is not favorable when the data is organized in some ordering and the queries
require a range of data. When data is discrete and random, hash performs the best.
Hashing algorithms have high complexity than indexing. All hash operations are done in
constant time.

146
KITSW
Assignment Questions

UNIT – I

1. Discuss about Data Definition language, Data Manipulation language commands with
example?
2. Elaborate the relational model? Explain about various domain and integrity constraint in
Relational Model with examples?
3. Name the main steps in database design. What is the goal of each step? In which step is
the ER model mainly used?
4. Justify the difference between binary and ternary relationships.
5. Organize the process of evaluating a query using conceptual evolution strategy with an
example.

UNIT – II

1. Illustrate different set operations in Relational algebra with an example?


2. Summarize the following fundamental operations of relational algebra:
i) select ii) project iii) rename
3. Classify about nested queries? What are correlated nested queries? How would you use
the operators IN, EXISTS, UNIQUE, ANY, ALL in writing nested queries?
4. Discuss short notes on union, set difference, Cartesian product operations with an
example
5. Define trigger and explain its three parts? Differentiate row level and statement
level triggers?

UNIT – III

1. Demonstrate about functional dependencies? Discuss about second normal form.


2. Estimate the problems related to decomposition
3. Evaluate the dependency preserving property with example.
4. Compare fourth normal form and BCNF.
5. Identify about dependency preserving decomposition.

147
KITSW
UNIT – IV

1. Organize a locking protocol? Describe the Strict Two Phase Locking Protocol? What can
you say about the schedules allowed by this protocol?
2. Discuss short notes on : a) Multiple granularity b) Serializability c) Complete schedule d)
Serial Schedule.
3. Experiment the Time Stamp - Based Concurrency Control protocol? How is it used to
ensure serializability?
4. Illustrate a log file? Explain about the check point log based recovery schema for
recovering the database.
5. Discuss the failures that can occur with loss of Non-volatile storage?.

UNIT – V

1. Illustrate extendable hashing techniques for indexing data records. Consider your class
students data records and roll number as index attribute and show the hash directory.
2. Is disk cylinder a logical concept? Justify your answer.
3. Formulate the performance implications of disk structure? Explain briefly about
redundant arrays of independent disks.
4. Measure the indexing? Explain what are the differences between trees based index and
Hash based index.
5. Justify extendable hashing? How it is different from linear hashing?

148
KITSW
Tutorial Problems

Tutorial-1

1. Construct an Entity-Relationship diagram for a online shopping systems such as


Jabong/Flipcart. Quote your assumptions and list the requirements considered by you for
conceptual database design for the above system.
2. Construct an ER diagram? Specify the notations used to indicate various components of
ER diagram.
3. Identify the different types of relationships in ER modeling.
4. Analyze primary key and foreign key constraint. How these constraints are expressed in
SQL?
5. Discuss about query processor and database system structure.
Tutorial -2

1. Consider the following schema to write queries in Domain relational calculus:


Sailor(sid, sname, age, rating)
Boats(bid, bname, bcolor)
Reserves(sid, bid, day)
a) Find the boats reserved by sailor with id 567.
b) Find the names of the sailors who reserved 'red' boats.
c) Find the boats which have at least two reservations by different sailors.
2. Identity the aggregate and comparison operators in SQL? Explain with an example in
detail.
3. Answer each of the following questions briefly. The questions are based on the
following relational schema:
Suppliers(sid: integer, sname:string, address:string)
Parts(pid:integer, pname:string, color:string)
Catalog(sid:integer, pid:integer, cost:real)
a) Find the sids of suppliers who charge more for some part than the average cost of that
part (averaged over all the suppliers who supply that part).
b) For each part, find the sname of the supplier who charges the most for that part.
c) Find the sids of suppliers who supply only red parts.
d) For every supplier that supplies more than 1 part, print the name of the supplier and the
total number of parts that she supplies.

4. Elaborate the Trigger? Explain how to implement Triggers in SQL with example.
5. Discuss the following operators in SQL with examples
i) Some ii) Not In iii) In iv) Except

149
KITSW
Tutorial -3

1. Consider a relation R with five attributes ABCDE. You are given the following
dependencies: A->B, BC->E and ED->A
i) List all keys for R
ii) Is R in 3NF? If not, explain why not.
iii) Is R in BCNF? If not, explain why not.
2. Define 1NF, 2NF, 3NF and BCNF, what is the motivation for putting a relation in
BCNF? What is the motivation for 3NF?
3. Construct closure of F? Where F is the set of functional dependencies. Explain computing
F+ with suitable examples.
4. Differentiate between FD and MFD
5. Summarize the problems are caused by redundancy and decomposition of relation.

Tutorial -4

1. Discuss about log? What is log tail? Explain the concept of checkpoint log record.
2. Elaborate to test serializability of a schedule? Explain with an example.
3. Construct the concurrency control using time stamp ordering protocol.
4. Demonstrate ACID properties of transactions.
5. Differentiate transaction rollback and restart recovery.

Tutorial -5

1. Illustrate the indexed data structures? Explain any one of them.


2. Compare heap file organization with hash file organization.
3. Formulate all operations of B+ tree for indexing with suitable example.
4. Organize the cluster index, primary and secondary indexes with examples.
5. Discuss about composite search key? What are the pros and cons of composite search
keys

150
KITSW
Important Questions

Unit-1

Explain DBMS? Explain Database system Applications.

Make a comparison between Database system and File system.

Explain storage manager component of Database System structure.

Explain the Database users and user interfaces.

Explain levels of data abstraction.

List and explain the functions of data base administrator

What is an ER diagram? Specify the notations used to indicate various components of ER-
diagram

Explain the Transaction management in a database.

What are the types of languages a database system provides? Explain.

How to specify different constraints in ER diagram with examples.

What is an unsafe query? Give an example and explain why it is important to disallow
such queries?

Explain the Participation Constraints.

List the six design goals for relational database and explain why they are desirable.

A company database needs to store data about employees, departments and children
of employees. Draw an ER diagram that captures the above data.

Discuss aggregation versus ternary Relationships.

Explain conceptual design for large Databases.

Explain how to differentiate attributes in Entity set?

What is the composite Attribute? How to model it in the ER diagram? Explain with an example.
Compare candidate key , primary key and super key.

151
KITSW
Unit-2

What is a relational database query? Explain with an example.

Relational Calculus is said to be a declarative language, in contrast to algebra, which is a


procedural language. Explain the distinction.

Discuss about Tuple Relational Calculus in detail.

Write the following queries in Tuple Relational Calculus for following Schema.

Sailors (sid: integer, sname: string, rating: integer, age: real)

Boats (bid: integer, bname: string, color: string)

Reserves (sid: integer, bid: integer, day: date)

i. Find the names of sailors who have reserved a red boat

ii. Find the names of sailors who have reserved at least one boat
(b) Find the names of sailors who have reserved at least two boats

(c) Find the names of sailors who have reserved all boats.

(d) Explain various operations in relational algebra with example.

(e) Compare procedural and non procedural DML’s.

(f) Explain about Relation Completeness

Consider the following schema:

Suppliers (sid : integer, sname: string, address: string)

Parts (pid : integer, pname: string, color: string)

Catalog (sid : integer; pid : integer, cost: real)

The key fields are underlined. The catalog relation lists the price
changes for parts by supplies. Write the following queries in SQL.

Find the pnames of parts for which there is some supplier.

(g) Find the snames of suppliers who supply every part.


(h) Find the pnames of parts supplied by raghu supplier and no one else.
(i) Find the sids of suppliers who supply only red parts.

152
KITSW
Explain in detail thefollowing
(j) i. join operation
ii. Nested-loop join
iii.BlockNested-
Loop join.
Write the SQL expressions for the following relational
database? sailor schema(sailor id, Boat id, sailorname, rating,
age) Reserves(Sailor id, Boat id, Day)
Boat Schema(boat id, Boatname,color)

i. Find the age of the youngest sailor for each rating level?

Find the age of the youngest sailor who is eligible to vote for each rating level with at
lead two such sailors?
Find the No. of reservations for each red boat?
Find the average age of sailor for each rating level that atleast 2 sailors.
What is outer join? Explain different types of joins?
What is a trigger and what are its 3 parts. Explain in detail.
What is view? Explain the Views in SQL.
Unit-3

What is Normalization? give types of normalization


What are the advantages of normalized relations over the unnormalized relations?
What is redundancy? What are the problems caused by redundancy?
What is dependency preserving decomposition?
Explain multivalued dependencies with example
Explain lossless join decomposition
Consider the relation R(A,B,C,D,E) and FD’s

A>BC
C->A
D->E
F->A
E->D

Is the Decomposition R into R1(A,C,D),R2(B,C,D) and R3(E,F,D) lossless?


Explain BCNF with example?
Explain 3NF and 4NF with examples.
Explain 5NF with examples.

153
KITSW
Unit-4

What is transaction? Explain the states and properties of transaction?


Explain the time stamp based protocols.
Discuss how to handle deadlocks?
Explain about multiple granularity
Explain read-only ,write-only & read-before-write protocols in serialazability.
Describe each of the following locking protocols
i. Two phase lock
ii. Conservative two phase lock
Explain the implementation of atomicity and durability.
Explain ACID properties of Transaction?
Explain different types of failures?
Explain logical undo Logging? b.
Explain Transaction Rollback?
Explain Log-Record Buffering in detail.
What are the merits & demerits of using fuzzy dumps for media recovery.
Explain the phases of ARIES Algorithm.
Explain 3 main properties of ARIES Algorithm
What information does the dirty page table and transaction table contain.
Explain about Buffer Manager in detail.
describe the shadow paging recovery technique
Explain the difference between system crash and disaster?

Unit-5

Explain the following

a. Cluster indexes
b. Primary and secondary indexes
c. Clustering file organization

Discuss various file organizations.


Write short notes on dense and spare indices
Explain about the B+ tree structure in detail with an example
Write a short notes on ISAM.
Compare the Ordered Indexing with Hashing.
Compare Linear Hashing with extendable Hashing
Explain about external storage media.
Differentiate between Extendible vs. Linear Hashing.

154
KITSW
Unit wise Objective Questions

Unit-I

Q.1 In the relational modes, cardinality is termed as:

(A)Number of tuples.(B)Number of attributes.C).Number of tables. D) Number of constraints.

Q.2 Relational calculus is a

(A) Procedural language. (B) Non- Procedural language.


(c)Data definition language. (D) High level language.

Q.3 The view of total database content is

(A)Conceptual view. (B) Internal view.(C)External view. (D) Physical View.

Q.4 Cartesian product in relational algebra is

A Unary operator. (B) A Binary operator.


C.Ternary operator. (D) Not defined.

Cartesian product in relational algebra is a binary operator.


(It requires two operands. e.g., P X Q)

Q.5 DML is provided for


a)Description of logical structure of database.
b)Addition of new structures in the database system.
c)Manipulation & processing of database.
d)Definition of physical structure of database system.

Q.6 ‘AS’ clause is used in SQL for

(A) Selection operation. (B) Rename operation.


(C) Join operation. (D) Projection operation.

Q.7 ODBC stands for


a.Object Database Connectivity.
b.Oral Database Connectivity.
c.Oracle Database Connectivity.
d.OpenDatabase Connectivity.

155
KITSW
Q.8 Architecture of the database can be viewed as

(A) two levels. (B)four levels. (C) three levels. (D)one level.

Q.9 In a relational model, relations are termed as

(A)Tuples. (B) Attributes C)Tables. (D) Rows. Ans:

Q.10 In the architecture of a database system external level is the

A)physical level. (B) logical level. C)conceptual level (D) view level.

Unit-II

Q.1 An entity set that does not have sufficient attributes to form a primary key is a

A)strong entity set. (B) weak entity set.C)simple entity set. (D) primary entity set.

Q.2 In a Hierarchical model records are organized as

A) Graph. (B) List. C) Links. (D) Tree.

(A) 2 1 P P ∨ ¬ (B) 2 1 P P ∨
Q.3 In tuple relational calculus 2 1 P P → is equivalent to

(C) 2 1 P P ∧ (D) 2 1 P P ¬ ∧

Q.4 The language used in application programs to request data from the DBMS is
referred to as the

A)DML (B) DDL C)VDL (D) SDL

Q.5 A logical schema is

A) the entire database.


B) is a standard way of organizing information into accessible parts.
C) describes how data is actually stored on disk.
D) both (A) and (C)

Q.6 The database environment has all of the following components except:
(A) users. (B) separate files. C)database. (D) database administrator.

Q.7 The way a particular application views the data from the database that
the application uses is a

(A) module. (B) relational model.C)schema. (D) sub schema.

156
KITSW
Q. 8 In an E-R diagram an entity set is represent by a

A)rectangle. (B) ellipse. C)diamond box. (D) circle.

Unit-III

Q.1 A report generator is used to

(A) update files. (B) print files on paper.


(C) data entry. (D) delete files.

Q.2 The property / properties of a database is / are :

A).It is an integrated collection of logically related records.


B).It consolidates separate files into a common pool of data records.
C)Data stored in a database is independent of the application programs using it.
D)All of the above.

Q.3 The DBMS language component which can be embedded in a program is

A).The data definition language (DDL).


B).The data manipulation language (DML).
C)The database administrator (DBA).
D)Aquery language.

Q.6 Conceptual design

a)is a documentation technique.


(b) needs data volume and processing frequencies to determine the size of the database.
(c) involves modelling independent of the DBMS.
(d) is designing the relational model.

Q.7 The method in which records are physically stored in a specified order according
to a key field in each record is

(A) hash. (B) direct. (C) sequential. (D) all of the above.

Q.8 A subschema expresses

(A) the logical view. (B)the physical view.(C)the external view. (D)all of the above.

Q.9 Count function in SQL returns the number of


(A) values.(B) distinct values. (C) groups.(D) columns.

157
KITSW
Q.10 Which one of the following statements is false?

A) The data dictionary is normally maintained by the database administrator.


B)Data elements in the database can be modified by changing the data dictionary.
C)The data dictionary contains the name and description of each data element.
D)The data dictionary is a tool used exclusively by the database
administrator.

Unit-IV

Q.1 An advantage of the database management approach is

A)data is dependent on programs.


B)data redundancy increases.
C)data is integrated and can be accessed by multiple programs.
D)none of the above.

Q.2 A DBMS query language is designed to

A)support end users who use English-like commands.


B)support in the development of complex applications software.
C)specify the structure of a database.
D)all of the above.

Q.3 Transaction processing is associated with everything below except

A)producing detail, summary, or exception reports.


B)recording a business activity.
C)confirming an action or triggering a response.
D)maintaining data.

Q.4 It is possible to define a schema completely using

(A) VDL and DDL. (B) DDL and DML.


C)SDL and DDL. (D) VDL and DML.

Q.5 The method of access which uses key transformation is known as

(A) direct. (B) hash. (C) random. (D) sequential.

Q.6 Data independence means

158
KITSW
A)data is defined separately and not included in programs.
B)programs are not dependent on the physical attributes of data.
C)programs are not dependent on the logical attributes of data.
D)both (B) and (C).

Q.7 The statement in SQL which allows to change the definition of a table is

(A) Alter. (B)Update. (C)Create. (D)select.

Q.8 Key to represent relationship between tables is called

(A) Primary key (B) Secondary Key C)Foreign Key (D) None of these

Unit-V

1.The file organization that provides very fast access to any arbitrary record of a
. file is

(A) Orderedfile (B) Unordered file

(C) Hashed file (D) B-tree

2.DBMS helps
achieve

(A) Dataindependenc (B) Centralized control of data

(e) Neither (A) nor (B) (D) both (A) and (B)

Q.3 What is a relationship called when it is maintained between two entities?

A)Unary (B) Binary C)Ternary (D) Quaternary

Q.4 Which of the following operation is used if we are interested in only certain columns of a
table?

A)PROJECTION (B) SELECTION C)UNION (D) JOIN

Long answer questions

159
KITSW
II year CSE – II Sem DBMS

Unit-1

1. Explain DBMS? Explain Database system Applications.

2. Make a comparison between Database system and File system.

3. Explain storage manager component of Database System structure.

4. Explain the Database users and user interfaces.

5. Explain levels of data abstraction.

6. List and explain the functions of data base administrator

7. What is an ER diagram? Specify the notations used to indicate various components of


ER-diagram

8. Explain the Transaction management in a database.

9. What are the types of languages a database system provides? Explain.

10. How to specify different constraints in ER diagram with examples

11. What is an unsafe query? Give an example and explain why it is important to disallow
such queries?

12. Explain the Participation Constraints.

13. List the six design goals for relational database and explain why they are desirable.

160
KITSW
Unit-2

1. A company database needs to store data about employees, departments and


children of employees. Draw an ER diagram that captures the above data.

2. Discuss aggregation versus ternary Relationships.

3. Explain conceptual design for large Databases.

4. Explain how to differentiate attributes in Entity set?

5. What is the composite Attribute? How to model it in the ER diagram? Explain with an
example.

6. Compare candidate key,primary key and super key.

7.What is a relational database query? Explain with an example.

8.Relational Calculus is said to be a declarative language, in contrast to algebra,


which is a procedural language. Explain the distinction.

9.Discuss about Tuple Relational Calculus in detail.

Write the following queries in Tuple Relational Calculus for following Schema.

Sailors (sid: integer, sname: string, rating: integer, age: real)

Boats (bid: integer, bname: string, color: string)

Reserves (sid: integer, bid: integer, day: date)

i. Find the names of sailors who have reserved a red boat

Find the names of sailors who have reserved at least one boat

Find the names of sailors who have reserved at least two boats

Find the names of sailors who have reserved all boats.

10. Explain various operations in relational algebra with example.

Compare procedural and non procedural DML’s.

11. Explain about Relation Completness

12. Consider the following schema:

Suppliers (sid : integer, sname: string, address: string)

Parts (pid : integer, pname: string, color: string)

161
KITSW
Catalog (sid : integer; pid : integer, cost: real)

The key fields are underlined. The catalog relation lists the
price changes for parts by supplies.
13. Write the following queries in SQL.

i. Find the pnames of parts for which there is some supplier.

Find the snames of suppliers who supply every part.

Find the pnames of parts supplied by raghu supplier and no one else.

Find the sids of suppliers who supply only red parts.

14. Explain in
detail the
i. join
operation

ii. Nested-loop
join iii.Block
Nested-Loop
join.
15. Write the SQL expressions for the following
relational database? sailor schema(sailor id,
Boat id, sailorname, rating, age)
Reserves(Sailor id, Boat id, Day)

Boat Schema(boat id, Boatname,color)

i. Find the age of the youngest sailor for each rating level?

16. Find the age of the youngest sailor who is eligible to vote for each rating level
with at lead two such sailors?
17. Find the No. of reservations for each red boat?
18. Find the average age of sailor for each rating level that atleast 2 sailors.
19. What is outer join? Explain different types of joins?
20. What is a trigger and what are its 3 parts. Explain in detail.

Unit-3

1. a. What is Normalization? give types of normalization


b.What are the advantages of normalized relations over the unnormalized
relations?
2. What is redundancy? What are the problems caused by redundancy?
3. What is dependency preserving decomposition?
4. Explain multivalued dependencies with example
5. Explain lossless join decomposition

162
KITSW
6. Consider the relation R(A,B,C,D,E) and

FD’s A>BC C->A D->E F->A E->D


Is the Decomposition R into R1(A,C,D),R2(B,C,D) and R3(E,F,D) lossless?

7. Explain BCNF with example?


Explain 3NF and 4NF with examples

Unit-4

1. What is transaction? Explain the states and properties of transaction?

2.Explain the time stamp based protocols.

3. Discuss how to handle deadlocks?

4. Explain about multiple granularity

5.Explain read-only ,write-only & read-before-write protocols in serialazability.

6.Describe each of the following locking protocols

i. Two phase lock


ii. Conservative two phase lock

7. Explain the implementation of atomicity and durability.

8.Explain ACID properties of Transaction?

9.Explain different types of failures?Explain logical undo Logging? Explain


Transaction Rollback?
10Explain Log-Record Buffering in detail.

10 What are the merits & demerits of using fuzzy dumps for media recovery.

a. Explain the phases of ARIES Algorithm.


b. Explain 3 main properties of ARIES Algorithm

11.What information does the dirty page table and transaction table contain.

12.Explain about Buffer Manager in detail.

13.describe the shadow paging recovery technique

14.Explain the difference between system crash and disaster?

163
KITSW
Unit-5

1. Explain the following

a. Cluster indexes
b. Primary and secondary indexes
c. Clustering file organization

2. Discuss various file organizations.


3. Write short notes on dense and spare indices
4. Explain about the B+ tree structure in detail with an example
5. Write a short notes on ISAM.
6. Compare the Ordered Indexing with Hashing.
7. Compare Linear Hashing with extendable Hashing
8. Explain about external storage media.

164
KITSW
Sample Mid Paper
II B.Tech II Sem CSE Database Management Systems I Mid Question Paper

PART-A
1. a) List the responsibilities of DBA?
b) Write brief notes on views?
c) List the primitive operations in relational algebra?
d) What is meant by nested queries?
e) What is Trigger and Active database?

PART-B

2) Explain the different types of relationships in E-R modeling?


(or)
3) Name the main steps in database design. What is the goal of each step? In
which step is the E-R model mainly used.

4) What are integrity constraints? How these constraints are expressed in SQL?
(or)
5) Explain the operations of relational algebra? What are aggregative operations
and logical operators in SQL?

6) Describe about DDL & DML commands with syntaxes and examples?
(or)
7) What is normalization? Explain 1NF, 2NF and 3NF Normal forms with
examples?

165
KITSW
University Question papers of previous years

166
KITSW
167
KITSW
Code No: 114CQ R13
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY HYDERABAD
B.Tech II Year II Semester Examinations, May - 2016
DATABASE MANAGEMENT SYSTEMS
(Common to CSE, IT)
Time: 3 Hours Max. Marks: 75
Note: This question paper contains two parts A and B.
Part A is compulsory which carries 25 marks. Answer all questions in Part A.
Part B consists of 5 Units. Answer any one full question from each unit.
Each question carries 10 marks and may have a, b, c as sub questions.

PART - A(25 Marks)


1.a) Discuss about DDL. [2]
b) Write brief notes on altering tables and views. [3]
c) Describe about outer join. [2]
d) What is meant by nested queries? [3]
e) What is second normal form? [2]
f) Describe the inclusion dependencies. [3]
g) What is meant by buffer management? [2]
h) What is meant by remote backup system? [3]
i) Discuss about primary indexes. [2]
j) What is meant by linear hashing? [3]

PART - B (50 Marks)


2. Explain the relational database architecture. [10]
OR
3. State and explain various features of E-R Models. [10]

4. Explain Tuple relational calculus. [10]


OR
5. Discuss about domain relational calculus. [10]

6. What is meant by functional dependencies? Discuss about second normal form. [10]
OR
7. Explain fourth normal form and BCNF. [10]

8. What is meant by concurrency control? [10]


OR
9. Discuss about failure with loss of nonvolatile storage. [10]

10. What is meant by extendable hashing? How it is different from linear hashing? [10]
OR
11.What are the indexed data structures? Explain any one of them. [10]

168
KITSW
REFERENCES

Reference Text Books:-

1. Data base Management Systems, Raghurama Krishnan, Johannes Gehrke, TATA


McGrawHill 3rd Edition

2. Data base System Concepts, Silberschatz, Korth, McGraw hill, V edition

3. Data base Systems design, Implementation, and Management, Peter Rob & Carlos
Coronel 7th Edition.

4. Fundamentals of Database Systems, Elmasri Navrate Pearson Education

. 5. Introduction to Database Systems, C.J.Date Pearson Education

Websites:-

http://en.wikipedia.org/wiki/Database_management_system

https://www.tutorialspoint.com/dbms

http://helpingnotes.com/notes/msc_notes/dbms_notes/

http://www.geeksforgeeks.org

Journals:-

1.Specifying Integrity Constraints in a Network DBMS N. Prakash, , N. Parimala,and N. Bolloju

2.Design and Implementation of a Relational DBMS for Microcomputers F. Cesarini and G.


Soda

169
KITSW

You might also like