Dbms Digital Notes
Dbms Digital Notes
OVER VIEW:
Unit – 1 provides a general overview of the nature and purpose of database systems. It also
explains how the concept of a database system has developed, what the common features of
database systems are, what a database system does for the user, and how a database system
interfaces with operating systems. This unit is motivational, historical, and explanatory in nature.
Details about the entity relationship model. This model provides a high level view of the issues.
To design the data base we need to follow proper way that way is called data model. So we see
how to use the E-R model to design the data base.
CONTENTS:
Introduction to database systems
File systems Vs. DBMS
Various data models
Levels of abstraction
Database languages
Structure of DBMS
1.1 Database Management System (DBMS) and Its Applications:
A Database management system is a computerized record-keeping system. It is a repository or a
container for collection of computerized data files. The overall purpose of DBMS is to allow the
users to define, store, retrieve and update the information contained in the database on demand.
Information can be anything that is of significance to an individual or organization.
Databases touch all aspects of our lives. Some of the major areas of application are as
follows:
1. Banking
2. Airlines
3. Universities
4. Manufacturing and selling
5. Human resources
1
KITSW
stored in various files. Before the advent of DBMS, organizations typically stored the
information using such systems.
Ex: Using COBOL we can maintain several files (collection of records) to access those files we
have to go through the application programs which have written for creating files, updating file,
inserting the records
2
KITSW
Instance: The collection of information stored in the database at a particular moment is called an
instance of the data base.
Schema: Database schema skeleton structure of and it represents the logical view of entire
database. It tells about how the data is organized and how relation among them is associated
Data independence:
The ability to modify schema definition in one level with affecting a schema definition in the
next higher level is called data independence.
Physical independence,Logical independence
Data models:
Underlying the structure of a data base is the data model.The collection of conceptual tools for
describing data, data relationships, data semantics.
Three types of data models are there:
In contrast to object based model they are used both to specify the overall logical structure of
data base and to pride a higher level description of the implementation.
E-R Model
Relational model
Network model
Hierarchical model
3
KITSW
Computation purposes include conditional or iterative statements that are supported by the high-
level programming languages. Many DBMSs have a capability to embed the sublanguage in a
high-level programming language such as ‘Fortran’, ‘C’, C++, Java, or Visual Basic. Here, the
high-level language is sometimes referred to as the host language as it is acting like a host for
this language. To compile the embedded file, the commands in the data sub-language are first
detached from the host-language program and are substituted by function calls. The pre-
processed file is then compiled and placed in an object module which gets linked with a DBMS-
specific library that is having the replaced functions, and executed based on requirement. Most
data sub-languages also supply non-embedded or interactive commands which can be input
directly using terminal.
Data Definition Language (DDL) statements are used to classify the database structure or
schema. It is a type of language that allows the DBA or user to depict and name those entities,
attributes, and relationships that are required for the application along with any associated
integrity and security constraints. Here are the lists of tasks that come under DDL:
A language that offers a set of operations to support the fundamental data manipulation
operations on the data held in the database. Data Manipulation Language (DML) statements are
used to manage data within schema objects. Here are the lists of tasks that come under DML:
4
KITSW
Data Control Language:
There is another two forms of database sub-languages. The Data Control Language (DCL) is
used to control privilege in Database. To perform any operation in the database, such as for
creating tables, sequences or views we need privileges. Privileges are of two types,
System – creating session, table etc are all types of system privilege.
Object – any command or query to work on tables comes under object privilege. DCL is
used to define two commands. These are:
Grant – It gives user access privileges to database.
Revoke – It takes back permissions from user.
5
KITSW
1.6 Database Design:
Database design is the process of producing a detailed data model of a database. This data
model contains all the needed logical and physical design choices and physical storage
parameters needed to generate a design in a data definition language, which can then be used to
create a database.
(2) Main Memory Main memory is used to store the data being dealt with, and all the operations
are done in main memory. It is usually too small and too expensive for a whole database system.
In addition, in case of system crash or electricity failure, the contents in the main memory will be
lost.
(3) Flash memory Different from main memory, the data in flash memory remains after
electricity failure. Though reading data from flash memory is as fast as from main memory,
writing data to it is complex: it must erase the data before writing. The times that the flash
memory being erased are limited.
6
KITSW
(4) Magnetic-disk Storage :It is the main method to store data for a long time. Usually the
whole database is stored in the magnetic-disk. Data must be read to main memory to be operated,
and the result must be written back to magnetic-disk.
DATA QUERYING:
Queries are the primary mechanism for retrieving information from a database and consist of
questions presented to the database in a predefined format. Many database management systems
use the Structured Query Language (SQL) standard query format.
Choosing parameters from a menu: In this method, thedatabase system presents a list
of parameters from which you can choose. This is perhaps the easiest way to pose a query
because the menus guide you, but it is also the least flexible.
Query by example (QBE): In this method, the systempresents a blank record and lets
you specify the fields and values that define the query.
Query language: Many database systems require you to make requests for information in
the form of a stylized query that must be written in a special query language. This is the
most complex method because it forces you to learn a specialized language, but it is also
the most powerful.
7
KITSW
1.8 TRANSACTION MANAGEMENT:
A transaction is a very small unit of a program and it may contain several lowlevel tasks.
A transactionin a database system must maintain Atomicity, Consistency, Isolation, and
Durability
ACID Properties
A transaction may contain several low level tasks and further a transaction is very small unit of
any program. A transaction in a database system must maintain some properties in order to
ensure the accuracy of its completeness and data integrity. These properties are refer to as ACID
properties and are mentioned below:
Atomicity: Though a transaction involves several low level operations but this property
states that a transaction must be treated as an atomic unit, that is, either all of its
operations are executed or none. There must be no state in database where the transaction
is left partially completed. States should be defined either before the execution of the
transaction or after the execution/abortion/failure of the transaction.
Consistency: This property states that after the transaction is finished, its database must
remain in a consistent state. There must not be any possibility that some data is
incorrectly affected by the execution of transaction. If the database was in a consistent
state before the execution of the transaction, it must remain in consistent state after the
execution of the transaction.
Durability: This property states that in any case all updates made on the database will
persist even if the system fails and restarts. If a transaction writes or updates some data in
database and commits that data will always be there in the database. If the transaction
commits but data is not written on the disk and the system fails, that data will be updated
once the system comes up.
Isolation: In a database system where more than one transaction are being executed
simultaneously and in parallel, the property of isolation states that all the transactions will
be carried out and executed as if it is the only transaction in the system. No transaction
will affect the existence of any other transaction.
8
KITSW
1.9 Structure of a DBMS:
9
KITSW
1.10 DATA MINING AND INFORMATION RETREIVAL:
Information Retrieval - the ability to query a computer system to return relevant results.
The most widely used example is the Google web search engine.
Data Mining - the ability to retrieve information from one or more data sources in order to
combine it, cluster it, visualize it and discover patterns in the data.
Big Data - the ability to manipulate huge volumes of data (that far exceed the capacity of a
single machine) in order to perform data mining techniques on that data.
Text/data mining currently involves analyzing a large collection of often unrelated digital items
in a systematic way and to discover previously unknown facts, which might take the form of
relationships or patterns that are buried deep in an extensive collection. These relationships
would be extremely difficult, if not impossible, to discover using traditional manual-based search
and browse techniques. Both text and data mining build on the corpus of past publications and
build not so much on the shoulders of giants as on the breadth of past published knowledge and
accumulated mass wisdom.
These databases are special in every meaning of the term. This sort of database formatted
information NOT READILY AVAILABLE ANYWHERE ELSE. These databases are excellent
for Telemarketing, Direct Mail Marketing, Email Marketing and Fax Marketing.
Some databases in this category may or may not contain email addresses. Therefore, if you are
specifically looking for email marketing lists inside this category, please read through the
information provided in the product page carefully before purchasing.
RDBMS ORDBMS
It is also known as It is also known as Object – Relational
Relational Database Management System. Database Management System.
It is based on Relational Data Model. It is based on Object Data Model (ODM).
It is dominant model. It is gaining popularity.
RDBMS support a small, fixed collection ORDBMS is based on Object-Oriented
of data types (eg. Integers, dates, strings ) Database systems and Relational Database
which has proven adequate for traditional systems and are aimed at application
application domains such as administrative domains where complex objects play a
data processing central role.
10
KITSW
1.12 Database users and Administrators:
Database users are the one who really use and take the benefits of database. There will be
different types of users depending on their need and way of accessing the database.
Database Users:
Application Programmers - They are the developers who interact with the database by means
of DML queries. These DML queries are written in the application programs like C, C++, JAVA,
Pascal etc. These queries are converted into object code to communicate with the database. For
example, writing a C program to generate the report of employees who are working in particular
department will involve a query to fetch the data from database. It will include a embedded SQL
query in the C Program.
Sophisticated Users - They are database developers, who write SQL queries to
select/insert/delete/update data. They do not use any application or programs to request the
database. They directly interact with the database by means of query language like SQL. These
users will be scientists, engineers, analysts who thoroughly study SQL and DBMS to apply the
concepts in their requirement. In short, we can say this category includes designers and
developers of DBMS and SQL.
Specialized Users - These are also sophisticated users, but they write special database
application programs. They are the developers who develop the complex programs to the
requirement.
Stand-alone Users - These users will have stand –alone database for their personal use. These
kinds of database will have readymade database packages which will have menus and graphical
interfaces.
Native Users - these are the users who use the existing application to interact with the database.
For example, online library system, ticket booking systems, ATMs etc which has existing
application and users use them to interact with the database to fulfill their requests.
11
KITSW
Database Administrators:
The life cycle of database starts from designing, implementing to administration of it. A database
for any kind of requirement needs to be designed perfectly so that it should work without any
issues. Once all the design is complete, it needs to be installed. Once this step is complete, users
start using the database. The database grows as the data grows in the database. When the
database becomes huge, its performance comes down. Also accessing the data from the database
becomes challenge. There will be unused memory in database, making the memory inevitably
huge. These administration and maintenance of database is taken care by Database Administrator
DBA has many responsibilities. A good performing database is in the hands of DBA.
Installing and upgrading the DBMS Servers: - DBA is responsible for installing a new DBMS
server for the new projects. He is also responsible for upgrading these servers as there are new
versions comes in the market or requirement. If there is any failure in upgradation of the existing
servers, he should be able revert the new changes back to the older version, thus maintaining the
DBMS working. He is also responsible for updating the service packs/ hot fixes/ patches to the
DBMS servers.
Design and implementation: - Designing the database and implementing is also DBA’s
responsibility. He should be able to decide proper memory management, file organizations, error
handling, log maintenance etc for the database.
Performance tuning: - Since database is huge and it will have lots of tables, data, constraints
and indices, there will be variations in the performance from time to time. Also, because of some
designing issues or data growth, the database will not work as expected. It is responsibility of the
DBA to tune the database performance. He is responsible to make sure all the queries and
programs works in fraction of seconds.
Migrate database servers: - Sometimes, users using oracle would like to shift to SQL server or
Netezza. It is the responsibility of DBA to make sure that migration happens without any failure,
and there is no data loss.
12
KITSW
Backup and Recovery: - Proper backup and recovery programs needs to be developed by DBA
and has to be maintained him. This is one of the main responsibilities of DBA. Data/objects
should be backed up regularly so that if there is any crash, it should be recovered without much
effort and data loss.
Security: - DBA is responsible for creating various database users and roles, and giving them
different levels of access rights.
Documentation: - DBA should be properly documenting all his activities so that if he quits or
any new DBA comes in, he should be able to understand the database without any effort. He
should basically maintain all his installation, backup, recovery, security methods. He should keep
various reports about database performance.
A Database Management System allows a person to organize, store, and retrieve data from a
computer. It is a way of communicating with a computer’s “stored memory.” In the very early
years of computers, “punch cards” were used for input, output, and data storage. Punch cards
offered a fast way to enter data, and to retrieve it. Herman Hollerith is given credit for adapting
the punch cards used for weaving looms to act as the memory for a mechanical tabulating
machine, in 1890. Much later, databases came along.
Databases (or DBs) have played a very important part in the recent evolution of computers. The
first computer programs were developed in the early 1950s, and focused almost completely on
coding languages and algorithms. At the time, computers were basically giant calculators and
data (names, phone numbers) was considered the leftovers of processing information. Computers
were just starting to become commercially available, and when business people started using
them for real-world purposes, this leftover data suddenly became important.
13
KITSW
By the mid-1960s, as computers developed speed and flexibility, and started becoming popular,
many kinds of general use database systems became available. As a result, customers demanded
a standard be developed, in turn leading to Bachman forming the Database Task Group. This
group took responsibility for the design and standardization of a language called Common
Business Oriented Language (COBOL). The Database Task Group presented this standard in
1971, which also came to be known as the “CODASYL approach.”
The CODASYL approach was a very complicated system and required substantial training. It
depended on a “manual” navigation technique using a linked data set, which formed a large
network. Searching for records could be accomplished by one of three techniques:
14
KITSW
common place. Unstructured data is both non-relational and schema-less, and Relational
Database Management Systems simply were not designed to handle this kind of data.
NoSQL
NoSQL (“Not only” Structured Query Language) came about as a response to the Internet and
the need for faster speed and the processing of unstructured data. Generally speaking, NoSQL
databases are preferable in certain use cases to relational databases because of their speed and
flexibility. The NoSQL model is non-relational and uses a “distributed” database system. This
non-relational system is fast, uses an ad-hoc method of organizing data, and processes high-
volumes of different kinds of data.
“Not only” does it handle structured and unstructured data, it can also process unstructured Big
Data, very quickly. The widespread use of NoSQL can be connected to the services offered by
Twitter, LinkedIn, Facebook, and Google. Each of these organizations store and process colossal
amounts of unstructured data. These are the advantages NoSQL has over SQL and RDBM
Systems:
Higher scalability
A distributed computing system
Lower costs
A flexible schema
Can process unstructured and semi-structured data
Has no complex relationship
Unfortunately, NoSQL does come with some problems. Some NoSQL databases can be quite
resource intensive, demanding high RAM and CPU allocations. It can also be difficult to find
tech support if your open source NoSQL system goes down.
NoSQL Data Distribution
Hardware can fail, but NoSQL databases are designed with a distribution architecture that
includes redundant backup storage of both data and function. It does this by using multiple nodes
(database servers). If one, or more, of the nodes goes down, the other nodes can continue with
normal operations and suffer no data loss. When used correctly, NoSQL databases can provide
high performance at an extremely large scale, and never shut down. In general, there are four
kinds of NoSQL databases, with each having specific qualities and characteristics.
15
KITSW
Document Stores
A Document Store (often called a document-oriented database), manages, stores, and retrieves
semi-structured data (also known as document-oriented information). Documents can be
described as independent units that improve performance and make it easier to spread data across
a number of servers. Document Stores typically come with a powerful query engine and indexing
controls that make queries fast and easy. Examples of Document Stores are: Mongo DB,
and Amazon Dynamo DB
Document-oriented databases store all information for a given “object” within the database, and
each object in storage can be quite different from the others. This makes it easier for mapping
objects to the database and makes document storage for web programming applications very
attractive. (An “object” is a set of relationships. An article object could be related to a tag [an
object], a category [another object], or a comment [another object].)
Column Stores
A DBMS using columns is quite different from traditional relational database systems. It stores
data as portions of columns, instead of as rows. The change in focus, from row to a column, lets
column databases maximize their performance when large amounts of data are stored in a single
column. This strength can be extended to data warehouses and CRM applications. Examples of
column-style databases include Cloudera, Cassandra, and HBase (Hadoop based).
Key-value Stores
A key-value pair database is useful for shopping cart data or storing user profiles. All access to
the database is done using a primary key. Typically, there is no fixed schema or data model. The
key can be identified by using a random lump of data. Key-value stores “are not” useful when
there are complex relationships between data elements or when data needs to be queried by other
than the primary key. Examples of key-value stores are: Riak, Berkeley DB, and Aerospike.
An element can be any single “named” unit of stored data that might, or might not, contain other
data components.
16
KITSW
It differs from relational databases, and other NoSQL databases, by storing data relationships as
actual relationships. This type of storage for relationship data results in fewer disconnects
between an evolving schema and the actual database. It has interconnected elements, using an
undetermined number of relationships between them. Examples Graph Databases
are: Neo4j, GraphBase, and Titan.
Polyglot Persistence
Polyglot Persistence is a spin-off of “polyglot programming,” a concept developed in 2006 by
Neal Ford. The original idea promoted applications be written using a mix of languages, with the
understanding that a specific language may solve a certain kind of problem easily, while another
language would have difficulties. Different languages are suitable for tackling different
problems.
Many NoSQL systems run on nodes and large clusters. This allows for significant scalability and
redundant backups of data on each node. Using different technologies at each node supports a
philosophy of Polyglot Persistence. This means “storing” data on multiple technologies with the
understanding certain technologies will solve one kind of problem easily, while others will not.
An application communicating with different database management technologies uses each for
the best fit in achieving the end goal.
The database design process can be divided into six steps.The ER model is most relevant to the
first three steps.
Requirements Analysis:
The very first step in designing a database application is to understand what data is to be stored
in the database, what applications must be built on top of it, and what operations are most
frequent and subject to performance requirements. In other words, we must find out what the
users want from the database
17
KITSW
hold over this data. This step is often carried out using the ER model, or a similar high level data
model.
We must choose a DBMS to implement our database design, and convert the conceptual
database design into a database schema in the data model of the chosen DBMS.
Schema Refinement:
The fourth step in database design is to analyse the collection of relations in our relational
database schema to identify potential problems, and to refine it. In contrast to the requirements
analysis and conceptual design steps, which are essentially subjective, schema refinement can be
guided by some elegant and powerful theory.
In this step we must consider typical expected workloads that our database must support and
further refine the database design to ensure that it meets desired performance criteria. This tep
may simply involve building indexes on some tables and clustering some tables, or it may
involve a substantial redesign of parts of the database schema obtained from the earlier design
steps.
Security Design:
In this step, we identify different user groups and different roles played by various users (Eg : the
development team for a product, the customer support representatives, the product manager ).
For each role and user group, we must identify the parts of the database that they must be able to
access and the parts of the database that they should not be allowed to access, and take steps to
ensure that they can access.
The entity relationship (E-R) data model is based on a perception of a real world that consists of
a set of basic objects called entities, and of relationships among these objects.
Rectangles- which represent entity sets
Ellipse-which represent attributes
Diamonds-which represent relationship sets
Lines-which link attributes to entity sets and entity sets to relationship sets
Double ellipses-which represent multivalued attributes
Double lines- which indicate total participation of an entity in a relationship set
18
KITSW
The appropriate mapping cardinality for a particular relationship set is obviously dependent on
the real world situation that is being modeled by the relationship set. The overall logical structure
of a database can be expressed graphically by an E-R diagram, which is built up from the
following components.
19
KITSW
Entity: An entity is a real-world object or concept which is distinguishable from other objects. It
may be something tangible, such as a particular student or building. It may also be somewhat
more conceptual, such as CS A-341, or an email address.
Attributes: These are used to describe a particular entity (e.g. name, SS#, height).
Domain: Each attribute comes from a specified domain (e.g., name may be a 20 character string;
SS# is a nine-digit integer)
Entity set: a collection of similar entities (i.e., those which are distinguished using the same set
of attributes. As an example, I may be an entity, whereas Faculty might be an entity set to which
I belong. Note that entity sets need not be disjoint. I may also be a member of Staff or of Softball
Players.
Key: a minimal set of attributes for an entity set, such that each entity in the set can be uniquely
identified. In some cases, there may be a single attribute (such as SS#) which serves as a key, but
in some models you might need multiple attributes as a key ("Bob from Accounting"). There
may be several possible candidate keys. We will generally designate one such key as
the primary key.
ER diagrams:
It is often helpful to visualize an ER model via a diagram. There are many variant conventions
for such diagrams; we will adapt the one used in the text.
Diagram conventions
20
KITSW
ER Model
Entity relationship model defines the conceptual view of database. It works around real world
entity and association among them. At view level, ER model is considered well for designing
databases.
Entity
A real-world thing either animate or inanimate that can be easily identifiable and distinguishable.
For example, in a school database, student, teachers, class and course offered can be considered
as entities. All entities have some attributes or properties that give them their identity.
An entity set is a collection of similar types of entities. Entity set may contain entities with
attribute sharing similar values. For example, Students set may contain all the student of a
school; likewise Teachers set may contain all the teachers of school from all faculties. Entities
sets need not to be disjoint.
Attributes
Entities are represented by means of their properties, called attributes. All attributes have values.
For example, a student entity may have name, class, age as attributes.
There exists a domain or range of values that can be assigned to attributes. For example, a
student's name cannot be a numeric value. It has to be alphabetic. A student's age cannot be
negative, etc.
Types of Attributes
Simple attribute
Simple attributes are atomic values, which cannot be divided further. For example, student's
phone-number is an atomic value of 10 digits.
Composite attribute
Composite attributes are made of more than one simple attribute. For example, a student's
complete name may have first_name and last_name.
Derived attribute
Derived attributes are attributes, which do not exist physical in the database, but there values are
derived from other attributes presented in the database. For example, average_salary in a
department should be saved in database instead it can be derived. For another example, age can
be derived from data_of_birth.
Single-valued attribute
Single valued attributes contain on single value. For example: Social_Security_Number.
21
KITSW
Multi-value attribute
Multi-value attribute may contain more than one values. For example, a person can have more
than one phone numbers, email_addresses etc.
o Super Key: Set of attributes (one or more) that collectively identifies an entity in an
entity set.
o Candidate Key: Minimal super key is called candidate key that is, supers keys for which
no proper subset are a superkey. An entity set may have more than one candidate key.
o Primary Key: This is one of the candidate key chosen by the database designer to
uniquely identify the entity set.
Relationship
The association among entities is called relationship. For example, employee entity has relation
works_at with department. Another example is for student who enrolls in some course. Here,
Works_at and Enrolls are called relationship.
Relationship Set
Relationship of similar type is called relationship set. Like entities, a relationship too can have
attributes. These attributes are called descriptive attributes.
Degree of Relationship
The number of participating entities in an relationship defines the degree of the relationship.
o Binary = degree 2
o Ternary = degree 3
o n-ary = degree
22
KITSW
Mapping Cardinalities
Cardinality defines the number of entities in one entity set which can be associated to the
number of entities of other set via relationship set.
o One-to-one: one entity from entity set A can be associated with at most one entity of
entity set B and vice versa.
o One-to-many: One entity from entity set A can be associated with more than one entities
of entity set B but from entity set B one entity can be associated with at most one entity.
o Many-to-one: More than one entities from entity set A can be associated with at most
one entity of entity set B but one entity from entity set B can be associated with more
than one entity from entity set A.
23
KITSW
o Many-to-many: one entity from A can be associated with more than one entity from B
and vice versa.
Ternary Relationship SetA relationship set need not be an association of precisely two entities;
it can involve three or more when applicable. Here is another example from the text, in which a
store has multiple locations.
24
KITSW
A relationship might associate several entities from the same underlying entity set, such
as in the following example, Reports_To. In this case, an additional role indicator (e.g.,
"supervisor") is used in the diagram to further distinguish the two similar entities.
If you took a 'snapshot' of the relationship set at some instant in time, we will call this
an instance..
25
KITSW
This type of constraint is called a key constraint. It is represented in the ER diagrams by
drawing an arrow from an entity set E to a relationship set R when each entity in an instance of E
appears in at most one relationship in (a corresponding instance of) R.
If both entity sets of a relationship set have key constraints, we would call this a "one-to-one"
relationship set. In general, note that key constraints can apply to relationships between more
than two entities, as in the following example.
26
KITSW
Participation Constraints
Recall that a key constraint requires that each entity of a set be required to participate in at most
one relationship. Dual to this, we may ask whether each entity of a set be required to participate
in at least one relationship.
If this is required, we call this a total participation constraint; otherwise the participation
is partial. In our ER diagrams, we will represent a total participation constraint by using
a thick line.
Weak Entities
27
KITSW
There are times you might wish to define an entity set even though its attributes do not formally
contain a key (recall the definition for a key).
Usually, this is the case only because the information represented in such an entity set is only
interesting when combined through an identifying relationship set with another entity set we
call theidentifying owner.
We will call such a set a weak entity set, and insist on the following:
The weak entity set must exhibit a key constraint with respect to the identifying
relationship set.
The weak entity set must have total participation in the identifying relationship set.
Together, this assures us that we can uniquely identify each entity from the weak set by
considering the primary key of its identifying owner together with a partial key from the weak
entity.
In our ER diagrams, we will represent a weak entity set by outlining the entity and the
identifying relationship set with dark lines. The required key constraint and total participation are
diagrammed with our existing conventions. We underline the partial key with a dotted line.
Class Hierarchies
28
KITSW
Furthermore, we can impose additional constraints on such subclassing. By default, we will
assume that two subclasses of an entity set are disjoint. However, if we wish to allow an entity to
lie in more than one such subclass, we will specify an overlap constraint. (e.g. "Contract_Emps
OVERLAPS Senior_Emps")
Dually, we can ask whether every entity in a superclass be required to lie in (at least) one
subclass. By default we will not assume not, but we can specify a covering constraint if desired.
(e.g. "Motorboats AND Cards COVER Motor_Vehicles")
Aggregation
Thus far, we have defined relationships to be associations between two or more entities.
However, it sometimes seems desirable to define a new relationship which associates some entity
with some other existing relationship. To do this, we will introduce a new feature to our model
called aggregation. We identifying an existing relationship set by enclosing it in a larger dashed
box, and then we will allow it to participate in another relationship set.
29
KITSW
A motivating example follows:
It is most important to recognize that there is more than one way to model a given situation. Our
next goal is to start to compare the pros and cons of common choices.
Consider the scenario, if we want to add address information to the Employees entity set? We
might choose to add a single attribute address to the entity set. Alternatively, we could introduce
a new entity set, Addresses and then a relationship associating employees with addresses. What
are the pros and cons?
Adding a new entity set is more complex model. It should only be done when there is need for
the complexity. For example, if some employees have multiple address to be associated, then the
more complex model is needed. Also, representing addresses as a separate entity would allow a
further breakdown, for example by zip code or city.
What if we wanted to modify the Works_In relationship to have both a start and end date, rather
than just a start date. We could add one new attribute for the end date; alternatively, we could
create a new entity set Duration which represents intervals, and then the Works_In relationship
can be made ternary (associating an employee, a department and an interval). What are the pros
and cons?
30
KITSW
If the duration is described through descriptive attributes, only a single such duration can be
modeled. That is, we could not express an employment history involving someone who left the
department yet later returned.
Consider a situation in which a manager controls several departments. Let's presume that a
company budgets a certain amount (budget) for each department. Yet it also wants managers to
have access to some discretionary budget (dbudget). There are two corporate models. A
discretionary budget may be created for each individual department; alternatively, there may be a
discretionary budget for each manager, to be used as she desires.
Which scenario is represented by the following ER diagram? If you want the alternate
interpretation, how would you adjust the model?
31
KITSW
Dependents is a weak entity set, and each dependent entity is uniquely identified by
taking pname in conjunction with the policyid of a policy entity (which, intuitively, covers the
given dependent).
The best way to model this is to switch away from the ternary relationship set, and instead use
two distinct binary relationship sets.
If we did not need the until or since attributes. In this case, we could model the identical setting
using the following ternary relationship:
32
KITSW
Let's compare these two models. What if we wanted to add an additional constraint to
each, that each sponsorship (of a project by a department) be monitored by at most one
employee. Can you add this constraint to either of the above models.
The main construct for representing data in the relational model is a relation. A relation
consists of a relation schema and a relation instance. The relation instance is a table, and the
relation schema describes the column heads for the table. We first describe the relation schema
and then the relation instance. The schema specifies the relation’s name, the name of each field
(or column, or attribute), and the domain of each field. A domain is referred to in a relation
schema by the domain name and has a set of associated values.
Eg:
Students(sid: string, name: string, login: string, age: integer, gpa: real)
This says, for instance, that the field named sid has a domain named string. The set of
values associated with domain string is the set of all character strings.
An instance of a relation is a set of tuples, also called records, in which each tuple has the
same number of fields as the relation schema. A relation instance can be thought of as a
table in which each tuple is a row, and all rows have the same number of fields.
33
KITSW
A relation schema specifies the domain of each field or column in the relation instance. These
domain constraints in the schema specify an important condition that we want each instance of
the relation to satisfy: The values that appear in a column must be drawn from the domain
associated with that column. Thus, the domain of a field is essentially the type of that field, in
programming language terms, and restricts the values that can appear in the field.
Domain constraints are so fundamental in the relational model that we will henceforth consider
only relation instances that satisfy them; therefore, relation instance means relation instance that
satisfies the domain constraints in the relation schema.
The degree, also called arity, of a relation is the number of fields. The cardinality of a relation
instance is the number of tuples in it. In Figure 3.1, the degree of the relation (the number of
columns) is five, and the cardinality of this instance is six.
A relational database is a collection of relations with distinct relation names. The relational
database schema is the collection of schemas for the relations in the database.
Creating and Modifying Relations
34
KITSW
The SQL-92 language standard uses the word table to denote relation, and we will often
follow this convention when discussing SQL. The subset of SQL that supports the creation,
deletion, and modification of tables is called the Data Definition Language (DDL).
UPDATE Students S SET S.age = S.age + 1, S.gpa = S.gpa - 1 WHERE S.sid = 53688
35
KITSW
Relational Model – Constraints
Domain Constraints:A relation schema specifies the domain of each field in the relation
instance. These domain constraints in the schema specify the condition that each
instance of the relation has to satisfy: The values that appear in a column must be drawn
from the domain associated with that column. Thus, the domain of a field is essentially
the type of that field.
Key Constraints
A Key Constraint is a statement that a certain minimal subset of the fields of a relation is a
unique identifier for a tuple.
Super Key:An attribute, or set of attributes, that uniquely identifies a tuple within a
relation.However, a super key may contain additional attributes that are not necessary for
a unique identification.
Example: The customer_id of the relation customer is sufficient to distinguish one tuple
from other. Thus,customer_id is a super key. Similarly, the combination
of customer_id and customer_name is a super key for the relation customer. Here
the customer_name is not a super key, because several people may have the same
name. We are often interested in super keys for which no proper subset is a super key.
Such minimal super keys are called candidate keys.
Candidate Key:A super key such that no proper subset is a super key within the
relation.There are two parts of the candidate key definition:
o Two distinct tuples in a legal instance cannot have identical values in all the fields
of a key
o No subset of the set of fields in a candidate key is a unique identifier for a tuple.A
relation may have several candidate keys.
36
KITSW
Primary Key:The candidate key that is selected to identify tuples uniquely within the
relation. Out of all the available candidate keys, a database designer can identify
a primary key. The candidate keys that are not selected as the primary key are called
as alternate keys.
Example: For the student relation, we can choose student_id as the primary key.
Foreign Key:Foreign keys represent the relationships between tables. A foreign key is a
column (or a group of columns) whose values are derived from the primary key of some
other table.The table in which foreign key is defined is called a Foreign table or Details
table. The table that defines the primary key and is referenced by the foreign key is called
the Primary table or Master table.
General Constraints
Domain, primary key, and foreign key constraints are considered to be a fundamental part of the
relational data model. Sometimes, however, it is necessary to specify more general constraints.
Example: we may require that student ages be within a certain range of values. Giving such an
IC, the DBMS rejects inserts and updates that violate the constraint.
Current database systems support such general constraints in the form of table
constraints andassertions. Table constraints are associated with a single table and checked
whenever that table is modified. In contrast, assertions involve several tables and are checked
whenever any of these tables is modified.
Example: for table constraint, which ensures always the salary of an employee, is above 1000:
CREATE TABLE employee (eid integer, ename varchar2(20), salary real,
CHECK(salary>1000));
37
KITSW
Example: for assertion, which enforce a constraint that the number of boats plus the number of
sailors should be less than 100.
CREATE ASSERTION smallClub CHECK ((SELECT COUNT (S.sid) FROM Sailors S) +
(SELECT COUNT (B.bid) FROM Boats B) < 100);
Referential integrity constraint states that if a relation refers to an key attribute of a different or
same relation, that key element must exists.
CREATE TABLE Students ( sid CHAR(20), name CHAR(30), login CHAR(20), age
INTEGER, gpa REAL, UNIQUE (name, age), CONSTRAINT StudentsKey PRIMARY KEY
(sid) )
Foreign Key Constraints
Sometimes the information stored in a relation is linked to the information stored in another
relation. If one of the relations is modified, the other must be checked, and perhaps modified, to
keep the data consistent. An IC involving both relations must be specified if a DBMS is to make
such checks. The most common IC involving two relations is a foreign key constraint.
38
KITSW
Enrolled(sid: string, cid: string, grade: string)
To ensure that only bonafide students can enroll in courses, any value that appears in the sid field
of an instance of the Enrolled relation should also appear in the sid field of some tuple in the
Students relation. The sid field of Enrolled is called a foreign key and refers to Students. The
foreign key in the referencing relation (Enrolled, in our example) must match the primary key of
the referenced relation (Students), i.e., it must have the same number of columns and compatible
data types, although the column names can be different.
CREATE TABLE Enrolled ( sid CHAR(20), cid CHAR(20), grade CHAR(10), PRIMARY KEY
(sid, cid), FOREIGN KEY (sid) REFERENCES Students )
Consider the instance S1 of Students shown in Figure 3.1. The following insertion violates the
primary key constraint because there is already a tuple with the sid 53688, and it will be rejected
by the DBMS:
INSERT INTO Students (sid, name, login, age, gpa) VALUES (53688, ‘Mike’, ‘mike@ee’, 17,
3.4)
The following insertion violates the constraint that the primary key cannot contain null:
INSERT INTO Students (sid, name, login, age, gpa) VALUES (null, ‘Mike’, ‘mike@ee’, 17,
3.4)
39
KITSW
1.18 Querying Relational Data
A relational database query is a question about the data, and the answer consists of a new
relation containing the result. For example, we might want to find all students younger than 18 or
all students enrolled in Reggae203.
A query language is a specialized language for writing queries.
SQL is the most popular commercial query language for a relational DBMS. Consider the
instance of the Students relation shown in Figure 3.1. We can retrieve rows corresponding to
students who are younger than 18 with the following SQL query:
The symbol * means that we retain all fields of selected tuples in the result. The condition S.age
18 in the WHERE clause specifies that we want to select only tuples in which the age field has a
value less than 18.
40
KITSW
1. Entities and Simple Attributes:
An entity type within ER diagram is turned into a table. You may preferably keep the same name
for the entity or give it a sensible name but avoid DBMS reserved words as well as avoid the use
of special characters.Each attribute turns into a column (attribute) in the table. The key attribute
of the entity is the primary key of the table which is usually underlined. It can be composite if
required but can never be null.
It is highly recommended that every table should start with its primary key attribute
conventionally named as TablenameID.
The initial relational schema is expressed in the following format writing the table names with
the attributes list inside a parentheses as shown below for
Persons( personid , name, lastname, email )
Persons and Phones are Tables. name, lastname, are Table Columns (Attributes).
If you have a multi-valued attribute, take the attribute and turn it into a new entity or table of its
own. Then make a 1:N relationship between the new entity and the existing one. In simplewords.
1. Create a table for the attribute. 2. Add the primary (id) column of the parent entity as a foreign
key within the new table as shown below:
41
KITSW
3. 1:1 Relationships
To keep it simple and even for better performances at data retrieval, I would personally
recommend using attributes to represent such relationship. For instance, let us consider the case
where the Person has or optionally has one wife. You can place the primary key of the wife
within the table of the Persons which we call in this case Foreign key as shown below.
It should convert to :
Persons( personid , name, lastname, email )
House ( houseid , num , address, personid)
5. N:N Relationships
We normally use tables to express such type of relationship. This is the same for N − ary
relationship of ER diagrams. For instance, The Person can live or work in many countries. Also,
a country can have many people. To express this relationship within a relational schema we use a
separate table as shown below:
42
KITSW
It should convert into :
43
KITSW
Wife ( WifeID , name )
Phone(PhoneID , phoneNumber , StaffID)
Task ( TaskID , description)
Work(WorkID , CompanyID , StaffID , since )
Perform(PerformID , StaffID , TaskID )
CREATE VIEW B-Students (name, sid, course) AS SELECT S.sname, S.sid, E.cid FROM
Students S, Enrolled E WHERE S.sid = E.sid AND E.grade = ‘B’
This view can be used just like a base table, or explicitly stored table, in defining new queries or
views. Given the instances of Enrolled and Students shown in Figure 3.4, BStudents contains
the tuples shown in Figure 3.18.
To destroy views, use the DROP TABLE command. For example, DROP TABLE Students
RESTRICT destroys the Students table unless some view or integrity constraint refers to
Students; if so, the command fails. If the keyword RESTRICT is replaced by CASCADE,
Students is dropped and any referencing views or integrity constraints are (recursively) dropped
as well; one of these two keywords must always be specified. A view can be dropped using the
DROP VIEW command, which is just like DROP TABLE.
ALTER TABLE modifies the structure of an existing table. To add a column called maiden
Students, for example, we would use the following command:
ALTER TABLE Students ADD COLUMN maiden-name CHA(10)
The definition of Students is modified to add this column, and all existing rows are padded with
null values in this column. ALTER TABLE can also be used to delete columns and to add or
drop integrity constraints on a table.
44
KITSW
UNIT – II
Overview:
The Relational Model defines two root languages for accessing a relational database -- Relational
Algebra and Relational Calculus. Relational Algebra is a low-level, operator-oriented language.
Creating a query in Relational Algebra involves combining relational operators using algebraic
notation. Relational Calculus is a high-level, declarative language. Creating a query in Relational
Calculus involves describing what results are desired.
The basic form of SQL,SQL (Structured Query Language) is a database sublanguage for
querying and modifying relational databases. The basic structure in SQL is the statement how to
write the queries and modify tables and columns.
Contents:
45
KITSW
2.1 Relational Algebra and Calculus:
Relational algebra is one of the two formal query languages associated with the relational
model. Queries in algebra are composed using a collection of operators. A fundamental property
is that every operator in the algebra accepts (one or two) relation instances as arguments and
returns a relation instance as the result. This property makes it easy to compose operators to form
a complex query —a relational algebra expression is recursively defined to be a relation, a
unary algebra operator applied to a single expression, or a binary algebra operator applied to two
expressions. We describe the basic operators of the algebra (selection, projection, union, cross-
product, and difference).
Relational algebra includes operators to select rows from a relation (σ)and to project columns
(π).
These operations allows to manipulate data in a single relation. Consider the instance of the
Sailors relation shown in Figure 4.2, denoted as S2. We can retrieve rows corresponding to
expert sailors by using the s operator. The expression (S2) evaluates to the relation shown in
Figure 4.4. The subscript rating>8 specifies the selection criterion to be applied while retrieving
tuple
46
KITSW
Set Operations:
The following standard operations on sets are also available in relational algebra: union (∪),
intersection (n), set-difference (-), and cross-product (×).
Union: R∪S returns a relation instance containing all tuples that occur in either relation instance
R or relation instance S (or both). R and S must be unioncompatible, and the schema of the result
is defined to be identical to the schema of R.
Intersection: RnS returns a relation instance containing all tuples that occur in both R and S.
The relations R and S must be union-compatible, and the schema of the result is defined to be
identical to the schema of R.
Set-difference: R-S returns a relation instance containing all tuples that occur in R but not in S.
The relations R and S must be union-compatible, and the schema of the result is defined to be
identical to the schema of R.
Cross-product: R×S returns a relation instance whose schema contains all the fields of R (in the
same order as they appear in R) followed by all the fields of S (in the same order as they appear
in S). The result of R × S contains one tuple r, s (the concatenation of tuples r and s) for each pair
of tuples r ∈ R, s ∈ S. The cross-product opertion is sometimes called Cartesian product.
Joins
The join operation is one of the most useful operations in relational algebra and is the most
commonly used way to combine information from two or more relations. Although a join can be
defined as a cross-product followed by selections and projections, joins arise much more
frequently in practice than plain cross-products.
Condition Joins
The most general version of the join operation accepts a join condition c and a pair of relation
instances as arguments, and returns a relation instance. The join condition is identical to a
selection condition in form.
47
KITSW
The operation is defined as follows:
Notation − σp(r)
Where σ stands for selection predicate and r stands for relation. p is prepositional logic formula
which may use connectors like and, or, and not. These terms may use relational operators like
− =, ≠, ≥, < , >, ≤.
For example −
σsubject = "database"(Books)
Output − Selects tuples from books where subject is 'database'.
48
KITSW
Project Operation (∏)
It projects column(s) that satisfy a given predicate.
For example −
r ∪ s = { t | t ∈ r or t ∈ s}
Notation − r U s
Where r and s are either database relations or relation result set (temporary relation).
Notation − r − s
49
KITSW
Cartesian Product (Χ)
Combines information of two different relations into one.
Notation − r Χ s
r Χ s = { q t | q ∈ r and t ∈ s}
Notation − ρ x (E)
Set intersection
Assignment
Natural join
The variant of the calculus that we present in detail is called the tuple relational calculus
(TRC). Variables in TRC take on tuples as values. In another variant, called the domain
relational calculus (DRC), the variables range over field values.
50
KITSW
2.3 Tuple Relational Calculus
A tuple variable is a variable that takes on tuples of a particular relation schema as values. That
is, every value assigned to a given tuple variable has the same number and type of fields. A tuple
relational calculus query has the form { T | p(T) },where T is a tuple variable and p(T) denotes a
formula that describes T. The result of this query is the set of all tuples t for which the formula
p(T)evaluates to true with T = t. The language for writing formulas p(T) is thus at the heart of
TRC and is essentially a simple subset of first-order logic
As a simple example, consider the following query.
Let Rel be a relation name, R and S be tuple variables, a an attribute of R,and b an attribute of S.
Let op denote an operator in the set {<, >, =, =, =, =}. An atomic formula is one of the following:
R ∈ Rel
R.a op S.b
R.a op constant, or constant op R.a
A formula is recursively defined to be one of the following, where p and q are themselves
formulas, and p(R) denotes a formula in which the variable R appears:
A domain variable is a variable that ranges over the values in the domain of some attribute (e.g.,
the variable can be assigned an integer if it appears in an attribute whose domain is the set of
integers). A DRC query has the form {x | p(x1,x2,...,xn)},where each x is either a domain
51
KITSW
variable or a constant and p(x1,x2,...,xn) denotes a DRC formula whose only free variables are
the variables among the x i, 1 ≤ i ≥ n. The result of this query is the set of all tuples x1,x2,...,xi
for which the formula evaluates to true.
DRC formula is defined in a manner that is very similar to the definition of a TRC formula.
The main difference is that the variables are now domain variables. Let op denote an operator in
the set {<, >, =, =, =, =} and let X and Y be domain variables.
1 X op Y
X op constant,or constant op X
A formula is recursively defined to be one of the following, where p and q are themselves
formulas, and p(X) denotes a formula in which the variable X appears:
Notation − {T | Condition}
52
KITSW
Output − Returns tuples with 'name' from Author who has written article on 'database'.
TRC can be quantified. We can use Existential (∃) and Universal Quantifiers (∀).
For example −
Notation −
Where a1, a2 are attributes and P stands for formulae built by inner attributes.
For example −
Just like TRC, DRC can also be written using existential and universal quantifiers. DRC also
involves relational operators.
The expression power of Tuple Relation Calculus and Domain Relation Calculus is equivalent
to Relational Algebra.
Expressive Power of Algebra and Calculus
1. The basic difference between Relational Algebra and Relational Calculus is that Relational
Algebra is a Procedural language whereas, the Relational Calculus is a Non-Procedural,
instead it is a Declarative language.
2. The Relational Algebra defines how to obtain the result whereas, the Relational Calculus
define what information the result must contain.
3. Relational Algebra specifies the sequence in which operations have to be performed in the
query. On the other hands, Relational calculus does not specify the sequence of operations
to performed in the query.
53
KITSW
4. The Relational Algebra is not domain dependent whereas, the Relational Calculus can be
domain dependent as we have Domain Relational Calculus.
5. The Relational Algebra query language is closely related to programming language
whereas, the Relational Calculus is closely related to the Natural Language.
SQL:
2.5 The Form of a Basic SQL Query:
SQL is the language used to query all databases. It's simple to learn and appears to do very little
but is the heart of a successful database application. Understanding SQL and using it efficiently
is highly imperative in designing an efficient database application. The better your understanding
of SQL the more versatile you'll be in getting information out of databases.A SQL SELECT
statement can be broken down into numerous elements, each beginning with a keyword.
Although it is not necessary, common convention is to write these keywords in all capital letters.
In this article, we will focus on the most fundamental and common elements of a SELECT
statement, namely
SELECT
FROM
WHERE
ORDER BY
If we want only specific columns (as is usually the case), we can/should explicitly specify them
in a comma-separated list, as in
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
Explicitly specifying the desired fields also allows us to control the order in which the fields are
returned, so that if we wanted the last name to appear before the first name, we could write
SELECT EmployeeID, LastName, FirstName, HireDate, City FROM Employees
54
KITSW
The WHERE Clause
The next thing we want to do is to start limiting, or filtering, the data we fetch from the database.
By adding a WHERE clause to the SELECT statement, we add one (or more) conditions that
must be met by the selected data. This will limit the number of rows that answer the query and
are fetched. In many cases, this is where most of the "action" of a query takes place.
Examples
We can continue with our previous query, and limit it to only those employees living in London:
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City = 'London'
If you wanted to get the opposite, the employees who do not live in London, you would write
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City <> 'London'
It is not necessary to test for equality; you can also use the standard equality/inequality operators
that you would expect. For example, to get a list of employees who were hired on or after a given
date, you would write
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHEREHireDate>= '1-july-1993'
Of course, we can write more complex conditions. The obvious way to do this is by having
multiple conditions in the WHERE clause. If we want to know which employees were hired
between two given dates, we could write
SELECTEmployeeID, FirstName, LastName, HireDate, City
FROM Employees
WHERE (HireDate>= '1-june-1992') AND (HireDate<= '15-december-1993')
Note that SQL also has a special BETWEENoperator that checks to see if a value is between
two values (including equality on both ends). This allows us to rewrite the previous query as
SELECT EmployeeID, FirstName, LastName, HireDate, City
FROM Employees
WHEREHireDateBETWEEN '1-june-1992' AND '15-december-1993'
We could also use the NOT operator, to fetch those rows that are not between the specified dates:
SELECTEmployeeID, FirstName, LastName, HireDate, City
FROM Employees
WHERE HireDateNOT BETWEEN '1-june-1992' AND '15-december-1993'
55
KITSW
Let us finish this section on the WHERE clause by looking at two additional, slightly more
sophisticated, comparison operators.
What if we want to check if a column value is equal to more than one value? If it is only 2
values, then it is easy enough to test for each of those values, combining them with the OR
operator and writing something like
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City = 'London' OR City = 'Seattle'
However, if there are three, four, or more values that we want to compare against, the above
approach quickly becomes messy. In such cases, we can use the IN operator to test against a set
of values. If we wanted to see if the City was either Seattle, Tacoma, or Redmond, we would
write
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City IN ('Seattle', 'Tacoma', 'Redmond')
As with the BETWEEN operator, here too we can reverse the results obtained and query for
those rows where City is not in the specified list:
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City NOT IN ('Seattle', 'Tacoma', 'Redmond')
Finally, the LIKE operator allows us to perform basic pattern-matching using wildcard
characters. For Microsoft SQL Server, the wildcard characters are defined as follows:
Wildcard Description
_ (underscore) matches any single character
[] matches any single character within the specified range (e.g. [a-f])
or set (e.g. [abcdef]).
[^] matches any single character not within the specified range (e.g.
[^a-f]) or set (e.g. [^abcdef]).
Here too, we can opt to use the NOT operator: to find all of the employees whose first name
does not start with 'M' or 'A', we would write
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE (FirstNameNOT LIKE'M%') AND (FirstNameNOT LIKE 'A%')
56
KITSW
The ORDER BY Clause
Until now, we have been discussing filtering the data: that is, defining the conditions that
determine which rows will be included in the final set of rows to be fetched and returned from
the database. Once we have determined which columns and rows will be included in the results
of our SELECT query, we may want to control the order in which the rows appear—sorting the
data.
To sort the data rows, we include the ORDER BY clause. The ORDER BY clause includes one
or more column names that specify the sort order. If we return to one of our first SELECT
statements, we can sort its results by City with the following statement:
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
ORDER BY City
If we want the sort order for a column to be descending, we can include the DESC keyword after
the column name.
The ORDER BY clause is not limited to a single column. You can include a comma-delimited
list of columns to sort by—the rows will all be sorted by the first column specified and then by
the next column specified. If we add the Country field to the SELECT clause and want to sort
by Country and City, we would write:
SELECTEmployeeID, FirstName, LastName, HireDate, Country, City
FROM Employees
ORDER BY Country, City DESC
Note that to make it interesting, we have specified the sort order for the City column to be
descending (from highest to lowest value). The sort order for the Country column is still
ascending. We could be more explicit about this by writing
SELECTEmployeeID, FirstName, LastName, HireDate, Country, City
FROM Employees
ORDER BY Country ASC, City DESC
It is important to note that a column does not need to be included in the list of selected (returned)
columns in order to be used in the ORDER BY clause. If we don't need to see/use the Country
values, but are only interested in them as the primary sorting field we could write the query as
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
ORDER BY Country ASC, City DESC
57
KITSW
SQL provides three set-manipulation constructs that extend the basic query form. Since the
answer to a query is a multiset of rows, it is natural to consider the use of operations such as
union, intersection, and difference. SQL supports these operations under the names UNION,
INTERSECT,andEXCEPT.
Union:
Eg: Find the names of sailors who have reserved a red or a green boat.
SELECT S.sname FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND
R.bid = B.bid AND B.color = ‘red’
union
SELECT S2.sname FROM Sailors S2, Boats B2, Reserves R2 WHERE S2.sid = R2.sid AND
R2.bid = B2.bid AND B2.color = ‘green’
This query says that we want the union of the set of sailors who have reserved red boats and
the set of sailors who have reserved green boats.
Intersect:
Eg:Find the names of sailors who have reserved both a red and a green boat.
SELECT S.sname FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND
R.bid = B.bid AND B.color = ‘red’
intersect
SELECT S2.sname FROM Sailors S2, Boats B2, Reserves R2 WHERE S2.sid = R2.sid AND
R2.bid = B2.bid AND B2.color = ‘green’
Except:
Eg:Find the sids of all sailors who have reserved red boats but not green boats.
SELECT S.sid FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND R.bid =
B.bid AND B.color = ‘red’
Except
SELECT S2.sid FROM Sailors S2, Reserves R2, Boats B2 WHERE S2.sid = R2.sid AND
R2.bid = B2.bid AND B2.color = ‘green’
SQL also provides other set operations: IN (to check if an element is in a given set), op ANY, op
ALL (to compare a value with the elements in a given set, using comparison operator op), and
58
KITSW
EXISTS (to check if a set is empty). IN and EXISTS can be prefixed by NOT,withthe obvious
modification to their meaning.
We cover UNION, INTERSECT,andEXCEPT in this section, and the other operations
Subqueries can be used with the SELECT, INSERT, UPDATE, and DELETE statements along
with the operators like =, <, >, >=, <=, IN, BETWEEN etc.
The sub query can refer to variables from the surrounding query, which will act as constants
during any one evaluation of the sub query.
This simple example is like an inner join on col2, but it produces at most one output row for
each tab1 row, even if there are multiple matching tab2 rows:
SELECT col1
FROM tab1
WHERE EXISTS (SELECT 1
FROM tab2
WHERE col2 = tab1.col2);
59
KITSW
Example "Students in Projects":
SELECT name
FROM stud
WHERE EXISTS (SELECT 1
FROM assign
WHERE stud = stud.id);
The right-hand side of this form of IN is a parenthesized sub query, which must return exactly
one column. The left-hand expression is evaluated and compared to each row of the sub query
result. The result of IN is TRUE if any equal sub query row is found.
ALL
The right-hand side of this form of ALL is a parenthesized sub query, which must return exactly
one column. The left-hand expression is evaluated and compared to each row of the sub query
result using the given operator, which must yield a Boolean result. The result of ALL is TRUE if
all rows yield TRUE (including the special case where the sub query returns no rows). NOT IN
is equivalent to <> ALL.
Row-wise comparison
The left-hand side is a list of scalar expressions. The right-hand side can be either a list of scalar
expressions of the same length, or a parenthesized sub query, which must return exactly as many
columns as there are expressions on the left-hand side. Furthermore, the sub query cannot return
60
KITSW
more than one row. (If it returns zero rows, the result is taken to be NULL.) The left-hand side is
evaluated and compared row-wise to the single sub query result row, or to the right-hand
expression list. Presently, only = and <> operators are allowed in row-wise comparisons. The
result is TRUE if the two rows are equal or unequal, respectively.
A nested query is a query that has another query embedded within it; the embedded query is
called a subquery.
SQL provides other set operations: IN (to check if an element is in a given set),NOT IN(to
check if an element is not in a given set).
Eg:1. Find the names of sailors who have reserved boat 103.
FROM Reserves R
The nested subquery computes the (multi)set of sids for sailors who have reserved boat 103,
and the top-level query retrieves the names of sailors whose sid is in this set. The IN operator
allows us to test whether a value is in a given set of elements; an SQL query is used to generate
the set to be tested.
2.Find the names of sailors who have not reserved a red boat.
FROM Reserves R
FROM Boats B
61
KITSW
Correlated Nested Queries
In the nested queries that we have seen, the inner subquery has been completely independent of
the outer query. In general the inner subquery could depend on the row that is currently being
examined in the outer query .
Eg: Find the names of sailors who have reserved boat number 103.
The EXISTS operator is another set comparison operator, such as IN. It allows us to test
whether a set is nonempty.
Set-Comparison Operators
SQL also supports op ANY and op ALL, where op is one of the arithmetic comparison
operators {<, <=, =, <>, >=,>}.
Eg:1. Find sailors whose rating is better than some sailor called Horatio.
If there are several sailors called Horatio, this query finds all sailors whose rating is better
than that of some sailor called Horatio.
2.Find the sailors with the highest rating.
SQL Operators
There are two type of Operators, namely Comparison Operators and Logical Operators. These
operators are used mainly in the WHERE clause, HAVING clause to filter the data to be
selected.
62
KITSW
Comparison Operators:Comparison operators are used to compare the column data with
specific values in a condition.Comparison Operators are also used along with the SELECT
statement to filter data based on specific conditions.
Logical Operators:There are three Logical Operators namely AND, OR and NOT.
The above select statement searches for all the rows where the first letter of the column
first_name is 'S' and rest of the letters in the name can be any character.
There is another wildcard character you can use with LIKE operator. It is the underscore
character, ' _ ' . In a search string, the underscore signifies a single character.
63
KITSW
To display all the names with 'a' second character,
SELECT first_name, last_name
FROM student_details
WHERE first_name LIKE '_a%';
NOTE:Each underscore act as a placeholder for only one character. So you can use more than
one underscore. Eg: ' __i% '-this has two underscores towards the left, 'S__j%' - this has two
underscores between character 'S' and 'i'.
To find the names of the students between age 10 to 15 years, the query would be like,
SELECT first_name, last_name, age
FROM student_details
WHERE age BETWEEN 10 AND 15;
SQL IN Operator
The IN operator is used when you want to compare a column with more than one value. It is
similar to an OR condition.
If you want to find the names of students who are studying either Maths or Science, the query
would be like,
SELECT first_name, last_name, subject
FROM student_details
WHERE subject IN ('Maths', 'Science');
You can include more subjects in the list like ('maths','science','history')
If you want to find the names of students who do not participate in any games, the query would
be as given below
SELECT first_name, last_name
FROM student_details
WHERE games IS NULL
64
KITSW
There would be no output as we have every student participate in a game in the table
student_details, else the names of the students who do not participate in any games would be
displayed.
Example
The following example Aggregate Functions are applied to the employee_count of the branch
table. The region_nbr is the level of grouping.Here are the contents of the table:
Table: BRANCH
branch_nbr branch_name region_nbr employee_count
108 New York 100 10
110 Boston 100 6
212 Chicago 200 5
404 San Diego 400 6
415 San Jose 400 3
65
KITSW
GROUP BY region_nbr
ORDER BY region_nbr
Syntax:
The basic syntax of NULL while creating a table:
Here, NOT NULL signifies that column should always accept an explicit value of the given data
type. There are two columns where we did not use NOT NULL, which means these columns
could be NULL.
A field with a NULL value is one that has been left blank during record creation.
Example:
The NULL value can cause problems when selecting data, however, because when comparing an
unknown value to any other value, the result is always unknown and not included in the final
results.
You must use the IS NULL or IS NOT NULL operators in order to check for a NULL value.
66
KITSW
ID NAME AGE ADDRESS SALARY
1 Ramesh 32 Ahmedabad 2000.00
2 Khilan 25 Delhi 1500.00
3 kaushik 23 Kota 2000.00
4 Chaitali 25 Mumbai 6500.00
5 Hardik 27 Bhopal 8500.00
6 Komal 22 MP
7 Muffy 24 Indore
67
KITSW
Logical Operators Description
For the row to be selected at
OR least one of the conditions
must be true.
For a row to be selected all the
AND specified conditions must be
true.
For a row to be selected the
NOT specified condition must be
false.
Example: if you want to find the names of students who are studying either Maths or Science, the
query would be like,
SELECT first_name, last_name, subject
FROM student_details
WHERE subject = 'Maths' OR subject = 'Science'
s
firs las
u
t_n t_
b
am na
je
e me
ct
--
---- ----
--
---- ----
--
---- ----
--
- -
--
Bh M
An
ag at
ajal
wa h
i
t s
M
Go
She at
wd
kar h
a
s
Ra Sh S
68
KITSW
ci
e
ar
hul n
ma
c
e
S
ci
Ste Fle
e
phe mi
n
n ng
c
e
The following table describes how logical "OR" operator selects a row.
Example: To find the names of the students between the age 10 to 15 years, the query would be
like:
SELECT first_name, last_name, age
FROM student_details
WHERE age >= 10 AND age <= 15;
69
KITSW
-
Rah Sha 1
ul rma 0
Bh
Ana 1
ag
jali 2
wat
Go
She 1
wd
kar 5
a
The following table describes how logical "AND" operator selects a row.
Example: If you want to find out the names of the students who do not play football, the query
would be like:
OUTER JOINS
All joins mentioned above, that is Theta Join, Equi Join and Natural Join are called inner-joins.
An inner-join process includes only tuples with matching attributes, rest are discarded in
resulting relation. There exists methods by which all tuples of any relation are included in the
resulting relation.
70
KITSW
All tuples of Left relation, R, are included in the resulting relation and if there exists tuples in R
without any matching tuple in S then the S-attributes of resulting relation are made NULL.
Left
A B
100 Database
101 Mechanics
102 Electronics
Right
A B
100 Alex
102 Maya
104 Mira
A B C D
100 Database 100 Alex
101 Mechanics --- ---
102 Electronics 102 Maya
--- --- 104 Mira
71
KITSW
DISALLOWING NULL VALUES
Employee Employee
Age Gender Location Salary
ID Name
New
1001 Henry 54 Male 100000
York
1002 Tina 36 Female Moscow 80000
1003 John 24 Male London 40000
1006 Sophie 29 Female London 60000
Default values are also subject to integrity constraint checking (defaults are included as part of
an INSERT statement before the statement is parsed.)
If the results of an INSERT or UPDATE statement violate an integrity constraint, the statement
will be rolled back.
Integrity constraints are stored as part of the table definition, (in the data dictionary.)
If multiple applications access the same table they will all adhere to the same rule.
NOT NULL
UNIQUE
72
KITSW
CHECK constraints for complex integrity rules
PRIMARY KEY
FOREIGN KEY integrity constraints - referential integrity actions: – On Update – On
Delete – Delete CASCADE – Delete SET NULL
Constraint States
The current status of an integrity constraint can be changed to any of the following 4 options
using the CREATE TABLE or ALTER TABLE statement.
ENABLE NOVALIDATE means that the constraint is checked, but it does not have to be true
for all rows. This will resume constraint checking for Inserts and Updates but will not validate
any data that already exists in the table.
DISABLE VALIDATE disables the constraint, drops the index on the constraint, and disallows
any modification of the constrained columns.
For a UNIQUE constraint, this enables you to load data from a nonpartitioned table into a
partitioned table using the ALTER
73
KITSW
Condition: A query or test that is run when the trigger is activated.
Action: A procedure that is executed when the trigger is activated and its condition is true.
Eg: The trigger called init count initializes a counter variable before every execution of an
INSERT statement that adds tuples to the Students relation. The trigger called incr count
increments the counter for each inserted tuple that satisfies the condition age < 1
CREATE TRIGGER init count BEFORE INSERT ON Students /* Event */
DECLARE
count INTEGER;
BEGIN /* Action */
count := 0;
END
BEGIN /* Action */
count:=count+1;
END
UNIT-III
Overview:
Only construction of the tables is not only the efficient data base design. Solving the redundant
data problem is the efficient one. For this we use functional dependences. And normal forms
those will be discussed in this chapter.
Contents:
Schema refinement
74
KITSW
Use of Decompositions
Functional dependencies
Normal forms
Multi valued dependencies
75
KITSW
3.1 Introduction to Schema Refinement:
We now present an overview of the problems that schema refinement is intended to address and
a refinement approach based on decompositions. Redundant storage of information is the root
cause of these problems. Although decomposition can eliminate redundancy, it can lead to
problems of its own and should be used with caution.
Storing the same information redundantly, that is, in more than one place within a database, can
lead to several problems:
Redundant storage: Some information is stored repeatedly.
Update anomalies: If one copy of such repeated data is updated, an inconsistency is created
unless all copies are similarly updated.
Insertion anomalies: It may not be possible to store some information unless some other
information is stored as well.
Deletion anomalies: It may not be possible to delete some information without losing some
other information as well.
76
KITSW
Use of Decompositions
Redundancy arises when a relational schema forces an association between attributes that is not
natural. Functional dependencies can be used to identify such situations and to suggest
refinements to the schema. The essential idea is that many problems arising from redundancy can
be addressed by replacing a relation with a collection of ‘smaller’ relations. Each of the smaller
relations contains a subset of the attributes of the original relation. We refer to this process as
decomposition of the larger relation into the smaller relations.
Decomposing a relation schema can create more problems than it solves. Two important
questions must be asked repeatedly:
To help with the first question, several normal forms have been proposed for relations. If a
relation schema is in one of these normal forms, we know that certain kinds of problems cannot
arise. Considering the normal form of a given relation schema can help us to decide whether or
not to decompose it further. If we decide that a relation schema must be decomposed further, we
must choose a particular decomposition.
With respect to the second question, two properties of decompositions are of particular interest.
The lossless-join property enables us to recover any instance of the decomposed relation from
corresponding instances of the smaller relations. The dependency preservation property enables
us to enforce any constraint on the original relation by simply enforcing some constraints on
each of the smaller relations. That is, we need not perform joins of the smaller relations to check
whether a constraint on the original relation is violated.
77
KITSW
3.2 Functional dependencies:
A functional dependency A->B in a relation holds if two tuples having same value of attribute A
also have same value for attribute B. For Example, in relation STUDENT shown in table 1,
Functional Dependencies
STUD_NO->STUD_NAME, STUD_NO->STUD_ADDR hold but
STUD_NAME->STUD_ADDR do not hold
78
KITSW
Attribute Closure: Attribute closure of an attribute set can be defined as set of attributes which
can be functionally determined from it.
How to find attribute closure of an attribute set?
To find attribute closure of an attribute set:
Add elements of attribute set to the result set.
Recursively add elements to the result set which can be functionally determined from the
elements of the result set.
Using FD set of table 1, attribute closure can be determined as:
(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_STATE)+ = {STUD_STATE, STUD_COUNTRY}
How to finding Candidate Keys and Super Keys using Attribute Closure?
If attribute closure of an attribute set contains all attributes of relation, the attribute set
will be super key of the relation.
If no subset of this attribute set can functionally determine all attributes of the relation,
the set will be candidate key as well. For Example, using FD set of table 1,
(STUD_NO, STUD_NAME)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_NO, STUD_NAME) will be super key but not candidate key because its subset
(STUD_NO)+ is equal to all attributes of the relation. So, STUD_NO will be a candidate key.
3.3 Normalization:
In general, database normalization involves splitting tables with columns that have different
types of data ( and perhaps even unrelated data) into multiple table, each with fewer columns that
describe the attributes of a single concept of physical object or being.
The goal of normalization is to prevent the problems ( called modification anomalie) that plague
a poorly designed relation (table).
79
KITSW
Suppose, for example, that you have a table with resort guest ID numbers, activities the guests
have signed up to do, and the cost of each activity – all together in the following GUEST –
ACTIVITY-COST table:
Each row in the table represents a guest that has signed up for the named activity and paid the
specified cost. Assuming that the cost depends only on the activity that is, a specific activity
costs the same for all guests if you delete the row for GUEST – ID 2587, you lose not only the
fact that guest 2587 signed up for scuba diving, but also the fact that scuba diving costs $ 250.00
per outing. This is called a deletion anomaly – when you delete a row, you lose more information
than you intended to remove.
In the current example, a single deletion resulted in the loss of information on two entities what
activity a guest signed up to do and how much a particular activity costs.
80
KITSW
Now, suppose the resort adds a new activity such as horseback riding. You cannot enter the
activity name ( horseback riding) or cost ($190.00) in to the table until a guest decides to sign up
for it. The unnecessary restriction of having to wait until someone signs up for an activity before
you can record its name and cost is called an insertion anomaly.
In the current example, each insertion adds facts about two entities. Therefore, you cannot
INSERT a fact about one entity until you have an additional fact about the other entity.
Conversely, each deletion removes facts about two entities. Thus, you cannot DELETE the
information about one entity while leaving the information about the other in table.
You can eliminate modification anomalies through normalization – that is, splitting the single
table with rows that have attributes about two entities into two tables, each of which has rows
with attributes that describe a single entity.
You will be ablve to remove the aromatherapy appointment for guest 1269 without losing the
fact that an aromatherapy session costs $75.00. Similarly, you can now add the fact that
horseback riding costs $ 190.00 per day to the ACTIVITY – COST table without having to wait
for a guest to sign up for the activity.
During the development of relational database systems in the 1970s, relational theorists kept
discovering new modification anomalies. Some one would find an anomaly, classify it, and then
figure out a way to prevent it by adding additional design criteria to the definition of a “well
formed relation. These design criteria are known as normal forms. Not surprisingly E.F codd (of
the 12 rule database definition fame), defined the first, second, and third normal forms, (INF,
2NF, and 3NF).
After Codd postulated 3 NF, relational theorists formulated Boyce-codd normal form (BCNF)
and then fourth normal form (4NF) and fifth normal form (5NF)
81
KITSW
Deletion anomaly
The iniability to remove a single fact from a table without removing other (unrelated) facts you
want to keep.
Insertion anomaly:
The inability to insert one fact without inserting another ( and some times, unrelated) fact.
Update anomaly:
Changing a fact in one column creates a false fact in another set of columns. Modification
anomalies are a result of functional dependencies among the columns in a row ( or tuple, to use
the precise relational database term
A functional dependency means that if you know the value in one column or set of columns, you
can always determine the value of another. To put the table in first normal form (INF) you could
break up the student number list in the STUDENTS column of each row such that each row had
only one of the student Ids in the STUDENTS column. Doing so would change the table’s
structure and rows to: The value given by the combination (CLASS, SECTION, STUDENT) is
the composite key for the table because it makes each row unique and all columns atomic. Now
that each the table in the current example is in INF, each column has a single, scalar value.
Unfortunately, the table still exhibits modification anomalies:\
Deletion anomaly:
If professor SMITH goes to another school and you remove his rows from the table, you also
lose the fact that STUDENTS 1005, 2110 and 3115 are enrolled in a history class.
Insertion anomaly:
If the school wants to add an English Class (EI00), it cannot do so until a student signs up for the
course ( Remember, no part of a primary key can have a NULL value).
Update anomaly:
If STUDENT 4587 decides to sign up for the SECTION 1, CS100 CLASS instead of his math
class, updating the Class and section columns in the row for STUDENT 4587 to reflect the
change will cause the table to show TEACHER RAWL INS as being in both the MATH and the
COMP-SCI departments.
Thus, ‘flattening’ a table’s columns to put it into first normal form (INF) does not solve any of
the modification anomaliesAll it does is guarantee that the table satisfies the requirements for a
82
KITSW
table defined as “relational” and that there are no multi valued dependencies between the
columns in each row.
When a table is in second normal form, it must be in first normal form (no multi valued
dependencies and have no partial key dependencies.
A partial key dependency is a situation in which the value in part of a key can be used to
determine the value of another attribute ( column)Thus, a table is in 2NF when the value in all
nonkey columns depends on the entire key. Or, said another way, you cannot determine the value
of any of the columns by using part of the keyWith (CLASS, SECTION, STUDENT) as its
primary key. If the university has two rules about taking classes no student can sign up for more
than one section of the same class, and a student can have only one major then the table, while in
1 NF, is not in 2NF.
Given the value of (STUDENT, COURSE) you can determine the value of the SECTION, since
no student can sign up for two sections of the same course. Similarly since students can sign up
for only one major, knowing STUDENT determines the value of MAJOR. In both instances, the
value of a third column can be deduced (or is determined) by the value in a portion of the key
(CLASS, SECTION, STUDENT) that makes each row unique.
To put the table in the current example in 2NF will require that it be split in to three tables
described by :
Courses (Class, Section, Teacher, Department)
PRIMARY KEY (Class, Section)
Enrollment (Student, Class, Section)
PRIMARY KEY (Student, class)
Students (student, major)
PRIMARY KEY (Student)
83
KITSW
Unfortunately, putting a table in 2NF does not eliminate modification anomalies.
Suppose, for example, that professor Jones leaves the university. Removing his row from the
COURSES table would eliminate the entire ENGINEERING department, since he is currently
the only professor in the department.
Similarly, if the university wants to add a music department, it cannot do so until it hires a
professor to teach in the department.
Understanding Third Normal Form :
To be a third normal form (3NF) a table must satisfy the requirements for INF (no multi valued
dependencies) and 2NF ( all nonkey attributes must depend on the entire key). In addition, a
table in 3NF has no transitive dependencies between nonkey columns.
Given a table with columns, (A,B,C) a transitive dependency is one in which a determines B, and
B determines C, therefore A determines C, or, expressed using relational theory notation
If AB and BC then A C.
When a table is in 3NF the value in every non key column of the table can be determined by
using the entire key and only the entire key,. Therefore, given a table in 3NF with columns
(A,B,C) if A is the PRIMARY KEY, you could not use the value of B ( a non key column) to
determine the value of a C ( another non key column). As such, A determines B(A B), and A
determines C(C). However, knowing the value of column B does not tell you have value in
column C that is, it is not the case that BC.
Suppose, for example, that you have a COURSES tables with columns and PRIMARY KEY
described by
Courses (Class, section, teacher, department , department head)
PRIMARY KEY (Class, Section)
That contains the Data :
--------------------------------------------------------------------------------
History
H100 Smith Smith
History
H1002 Riley Smith
84
KITSW
Hasting
M2003 Rawlins Math s
Hasting
M2002 Brown Math s
Hasting
M2004 Riley Math s
Given that a TEACHER can be assigned to only one DEPARTMENT and that a
DEPARTMENT can have only one department head, the table has multiple transitive
dependencies.
For example, the value of TEACHER is dependant on the PRIMARY KEY (CLASS,
SECTION), since a particular SECTION of a particular CLASS can have only one teacher that is
A B. Moreover, since a TEACHER can be in only one DEPARTMENT, the value in
DEPARTMENT is dependant on the value in TEACHER that is BC. However, since the
PRIMARY KEY (CLASS, SECTION) determines the value of TEACHER, it also determines
the value of DEPARTMENT that is A C. Thus, the table exhibits the transitive dependency in
which A B and BC, therefore A C.
The problem with a transitive dependency is that it makes the table subject to the deletion
anomaly. When smith retires and we remove his row from the table, we lose not only the fact
that smith taught SECTION 1 of H100 but also the fact that SECTION 1 of H100 was a class
that belonged to the HISTORY department.
To put a table with transitive dependencies between non key columns into 3 NF requires that the
table be split into multiple tables. To do so for the table in the current example, we would need
split it into tables, described by :
Courses (Class, Section, Teacher)
PRIMARY KEY (class, section)
Teachers (Teacher, department)
PRIMARY KEY (teacher)
Departments (Department, Department head)
PRIMARY KEY (department )
85
KITSW
After Normalization
86
KITSW
The Schema Refinement refers to refine the schema by using some technique. The best
technique of schema refinement is decomposition.
The Basic Goal of Normalisation is used to eliminate redundancy.
Redundancy refers to repetition of same data or duplicate copies of same data stored in
different locations.
Normalization is used for mainly two purpose :
Eliminating redundant(useless) data.
Ensuring data dependencies make sense i.e data is logically stored.
SI
Sname CID Cname FEE
D
S1 A C1 C 5k
S2 A C1 C 5k
S1 A C2 C 10k
S3 B C2 C 10k
S3 B C2 JAVA 15k
Primary Key(SID,CID)
Here all the data is stored in a single table which causes redundancy of data or say anomalies as
SID and Sname are repeated once for same CID .
3.5 OTHER KINDS OF DEPENDENCIES:
Finish-to-Start Dependencies:
87
KITSW
The most common type of dependency is the finish-to-start relationship (FS). This relationship
means that the first task, the predecessor, must be finished before the next task, the successor,
can start. On the Gantt chart it is usually represented as follows:
Start-to-Start Dependencies
The next type of dependency is the start-to-start relationship (SS). This relationship means that
the successor task cannot start until the predecessor task starts. On the Gantt chart, it is usually
represented as follows:
Finish-to-Finish Dependencies
The third type of dependency is the finish-to-finish relationship (FF). This relationship means
that the successor task cannot finish until the predecessor task finishes. On the Gantt chart, it is
usually represented as follows:
Start-to-Finish Dependencies
88
KITSW
The start-to-finish relationship (SF) is the least common task relationship and means that the
successor cannot finish until the predecessor starts. On the Gantt chart, it is usually represented
as follows:
Of course tasks sometimes overlap – this is termed lead (or lead time). Tasks can also be delayed
(for example, to wait while concrete dries) which is called lag (or lag time).
89
KITSW
UNIT-IV
TRANSACTION MANAGEMENT
Overview:
In this unit we introduce two topics first one is concurrency control. The stored data will be
accessed by the users so if any two or users try to access same data at a time it may raise the
problem of data inconsistency to solve that concurrency control methods are invented. Recovery
is used to maintain the data without loss when the problem of power failure, software failure and
hardware failure.
Contents:
Crash Recovery
Log recovery
Check pointing
ARIES
90
KITSW
4.1 Transactions
Collections of operations that form a single logical unit of work are called Transactions. A
database system must ensure proper execution of transactions despite failures – either the entire
transaction executes, or none of it does.
4.2 Transaction Concept:
A transaction is a unit of program execution that accesses and possibly updates various data
items. Usually, a transaction is initiated by a user program written in a high level data
manipulation language or programming language ( for example SQL, COBOL, C, C++ or
JAVA), where it is delimited by statements ( or function calls) of the form Begin transaction and
end transaction. The transaction consists of all operations executed between the begin transaction
and end transaction.
To ensure integrity of the data, we require that the database system maintain the following
properties of the transaction.
Atomicity:
Either all operations of the transaction are reflected properly in the database, or non are .
Consistency:
Execution of a transaction in isolation ( that is, with no other transaction executing concurrently)
preserves the consistency of the database.
Isolation:
Even though multiple transactions may execute concurrently, the system guarantees that, for
every pair of transaction Ti and Tj, ti appears to Ti that either Tj finished execution before Ti
started, or Tj started execution after Ty finished. Thus, each transaction is unaware of other
transactions executing concurrently in the system.
91
KITSW
Durability:
After a transaction completes successfully, the changes it has made to the database persist, even
if there are system failures.
4.3 A Simple Transaction Model:
Transaction state:
In the absence of failures, all transactions complete successfully. A transaction may not always
complete its execution successfully. Such a transaction is termed aborted. If we are to ensure the
atomicity property, an aborted transction must have no effect on the state of the database.
Thus, any changes that the aborted transaction made to the database must be undone. Once the
changes caused by an aborted transaction have been undone, we say that the transaction has been
rolled back. It is part of the responsibility of the recovery scheme to manage transaction aborts.
Once a transction has committed, we cannot undo its effects by aborting it. The only way to undo
the effects of committed transaction is to execute a compensating transaction. For instance, if a
transaction added $20 to an account, the compensating transaction would subtract $20 from the
account. However, it is not always possible to create such a compensating transaction. Therefore,
the responsibility of writing and executing a compensating transaction is left to the user, and is
not handled by the database system. By successful completion of a transaction, A transaction
must be in one of the following states :
Active:
The initial state ; the transaction stays in this state while it is executing
Partially committed :
After the final statement has been executed
92
KITSW
Faile:
After the discovery that normal execution can no longer proceed
Aborted:
After the transaction has been rolled back and the database has been restrored to its state prior to
the start of the transaction
Committed:
After successful completion
We say that a transaction has committed nly if it has entered the committed state. Similarly, we
say that a transaction has aborted only if it has entered the aborted state. A transaction is said to
have terminated if has either committed or aborted.
A transaction starts in the active state. When it finishes its final statement, it enters the partially
committed state. At this point, the transaction has completed its execution, but it is still possible
that it may have to be aborted, since the actual output may still be termporarily residing in main
momory, and thus a hardware failure may preclude its successful completion.
The database system then writes out enough information to disk that, even in the event of a
failure, the updates performed by the transaction can be recreated when the system restarts after
the failure. When the last of this information is written out, the transaction enters the committed
state.
A transaction enters the filed state after the system determines that the transaction can no longer
proceed with its normal execution ( for example, because of hard ware or logical errors) such a
transaction must be rolled back. Then, it enters the aborted state. At this point, the system has
two options.
93
KITSW
It can restart the transaction, but only if the transaction was aborted as a result of some
hardware or software error that was not created through the internal logic of the transaction. A
restarted transaction is considered to be a new transaction.
It can kill the transaction. It usually does so because of some internal logical error that can be
corrected only by rewriting the application program, or because the input was bad, or because the
desired data were not found in the database.
We must be cautious when dealing with observable external writes, such as writes to a terminal
or printer. Once such a write has occurred, it cannot be erased, since it may have been seen
external to the database system. Most systems allows such writes to take place only after the
transaction has entered the committed state.
These properties are often called the ACID properties, the acronym is derived from the first letter
of each of the four properties. Volatile Memory
These are the primary memory devices in the system, and are placed along with the CPU. These
memories can store only small amount of data, but they are very fast. E.g.:- main memory, cache
94
KITSW
memory etc. these memories cannot endure system crashes- data in these memories will be lost
on failure.
Non-Volatile memory
These are secondary memories and are huge in size, but slow in processing. E.g.:- Flash memory,
hard disk, magnetic tapes etc. these memories are designed to withstand system crashes.
Stable Memory
This is said to be third form of memory structure but it is same as non volatile memory. In this
case, copies of same non volatile memories are stored at different places. This is because, in case
of any crash and data loss, data can be recovered from other copies. This is even helpful if there
one of non-volatile memory is lost due to fire or flood. It can be recovered from other network
location. But there can be failure while taking the backup of DB into different stable storage
devices. Even it may fail to transfer all the data successfully; either it will partially transfer the
data to remote devices or completely fail to store the data in stable memory. Hence extra caution
has to be taken while taking the backup of data from one stable memory to other. There are
different methods followed to copy the data. One of them is to copy the data in two phases –
copy the data blocks to first storage device, if it is successful copy to second storage device. The
copying is complete only when second copy is executed successfully. But second copy of data
blocks may fail to copy whole blocks. In such case, each data blocks in first copy and second
copy needs to be compared for its inconsistency. But verifying each blocks would be very costly
task as we may have huge number of data block. One of the better way to identify the failed
block is to identify the block which was in progress during the failure. Take only this block,
compare the data and correct the mismatches.
Failure Classification
When a transaction is being executed in the system, it may fail to execute due to various reasons.
The failure can be because of system program, bug in a program, user, or system crash. These
failures can be broadly classified into three categories.
Transaction Failure : This type of failure affects only few tables or processes. This is the
condition in the transaction where a transaction cannot execute it further. This failure can be
because of user or executing program/ transaction. The user may cancel the transaction when the
transaction is executing by pressing the cancel button or abort using the DB commands. The
transaction may fail because of the constraints on the tables – violation of constraints. It can even
95
KITSW
fail if there is concurrent processing of multiple transactions and there is lack of resources for all
of them or deadlock situation. All these will cause the transaction to stop processing in the
middle of its execution. When a transaction fails / stops in the middle, it would have partially
changed DB and it needs to be rolled back to previous consistent state. In ATM withdrawal
example, if the user cancels his transaction after step (i), the system should be able to stop further
processing of the transaction, or if he cancels the transaction after step (ii), the system should be
strong enough to update his balance in his account. Here system may cancel the transaction due
to insufficient balance. The failure can be because of errors in the code – logical errors or
because of system errors like deadlock or unavailability of system resources to execute the
transactions.
System Crash: This can be because of hardware or software failure or because of external
factors like power failure. This is the failure of the system because of the bug in the software or
the failure of system processor. This crash mainly affects the data in the primary memory. If it
affects only the primary memory, the actual data will not be really affected and recovery from
this failure is easy. This is because primary memories are temporary storages and it would not
have updated the actual database. Hence the system will be in a consistent state before to the
transaction. But when secondary memory crashes, there would be a loss of data and need to take
serious actions to recover lost data. Because secondary memories contain actual DB data.
Recovering them from crash is little tedious and requires more effort. DB Recovery system
provides strong mechanisms to recovery the system from crash and maintains the atomicity of
the transactions. In most of the cases data in the secondary memory are not affected because of
this crash. This is because; the database has lots of integrity checkpoints to prevent the data loss
from secondary memory.
Disk Failure: These are the issues with hard disks like formation of bad sectors, disk head crash,
unavailability of disk etc. Data can even be lost because of fire, flood, theft etc. This is mainly
affects the secondary memory where the actual data lies. In these cases, we need to have
alternative ways of storing DB. We can create backups of DB at regular basis and store them
separately from the memory where DB is stored or maintain multiple copies of DB at different
network locations to recover them from failure.
96
KITSW
4.5 Transaction Atomicity and Durability:
To gain a better understanding of ACID properties and the need for them, consider a simplified
banking system consisting of several accounts and a set of transactions that access and update
those accounts.
Read (X) which transfers the data item X from the database to a local buffer belonging to the
transaction that executed the read operation
Write (X), which transfers the data item X from the local buffer of the transaction that
executed the write back to the database.
In a real database system, the write operation does not necessarily result in the immediate update
of the data on the disk; the write operation may be temporarily stored in memory and executed
on the disk later.
For now, however, we shall assume that the write operation updates the database immediately.
Let Ty be a transaction that transfers $50 from account A to account B. This transaction can be
defined as
Ti : read (A);
A; = A-50;
Write (A);
Read (B);
B:=B+50;
Write (B).
Consistency:
Execution of a transaction in isolation ( that is, with no other transaction executing concurrently)
preserves the consistency of the database.
97
KITSW
The consistency requirement here is that the sum of A and B be unchanged by the execution of
the transaction. Without the consistency requirement, money could be created or destroyed by
the transaction. It can be verified easily that, if the database is consistent before an execution of
the transaction, the database remains consistent after the execution of the transaction.
Ensuring consistency for an individual transaction is the responsibility of the application
programmer who codes the transaction. This task may be facilitated by automatic testing of
integrity constraints.
Atomicity:
Suppose that, just before the execution of transaction Ty the values of accounts A and B are
$1000 and $2000, respectively.
Now suppose that, during the execution of transaction Ty, a failure occurs that prevents Ti from
completing its execution successfully.
Examples of such failures include power failures, hardware failures, and software errors
Further, suppose that the failure happened after the write (A) operation but before the write (B)
operation. In this case, the values of amounts A and B reflected in the database are $950 and
$2000. The system destroyed $50 as a result of this failure.
In particular, we note that the sum A + B is no longer preserved. Thus, because of the failure, the
state of the system no longer reflects a real state of the world that the database is supposed to
capture. WE term such a state in inconsistent state. We must ensure that such inconsistencies are
not visible in a database system.
Note, however, that the system must at some point be in an inconsistent state. Even if transaction
Ty is executed to completion, there exists a point at which the value of account A is $ 950 and
the value of account B is $2000 which is clearly an inconsistent state.
This state, however is eventually replaced by the consistent state where the value of account A is
$ 950, and the value of account B is $ 2050.
98
KITSW
Thus, if the transaction never started or was guaranteed to complete, such an inconsistent state
would not be visible except during the execution of the transaction.
If the atomicity property is present, all actions of the transaction are reflected in the database or
none are.
Serializability:
When multiple transactions are being executed by the operating system in a multiprogramming
environment, there are possibilities that instructions of one transactions are interleaved with
some other transaction.
99
KITSW
To resolve this problem, we allow parallel execution of a transaction schedule, if its transactions
are either serializable or have some equivalence relation among them.
Equivalence Schedules
An equivalence schedule can be of the following types −
Result Equivalence
If two schedules produce the same result after execution, they are said to be result equivalent.
They may yield the same result for some value and different results for another set of values.
That's why this equivalence is not generally considered significant.
View Equivalence
Two schedules would be view equivalence if the transactions in both the schedules perform
similar actions in a similar manner.
For example −
If T reads the initial data in S1, then it also reads the initial data in S2.
If T reads the value written by J in S1, then it also reads the value written by J in S2.
If T performs the final write on the data value in S1, then it also performs the final write
on the data value in S2.
Conflict Equivalence
Two schedules would be conflicting if they have the following properties −
100
KITSW
Durability:
Once the execution of the transaction completes successfully, and the user who initiated the
transaction has been notified that the transfer of funds has taken place, it must be the case that no
system failure will result in a loss of data corresponding to this transfer of funds.
The durability property guarantees that, once a transaction completes successfully, all the
updates that it carried out on the data base persist, even if there is a system failure after the
transaction complete execution.
We assume for now that a failure of the computer system may result in loss of data in main
memory, but data written to disk are never lost. We can guarantee durability by ensuring that
either The updates carried out by the transaction have been written to disk before the transaction
completes.
Information about the updates carried out by the transaction and written to disk is sufficient
to enable the database to reconstruct the updates when the database system is restarted after the
failure.
Ensuring durability is the responsibility of a component of the database system called the
recovery management component. The transaction management component and the recovery
management component the closely related.
Isolation:
Even if the consistency and atomicity properties are ensured for each transaction, if several
transactions are executed concurrently, their operations may interleave in some undesirable way,
resulting in an inconsistent state.
For example, as we saw earlier, the database is temporarily inconsistent while the transaction to
transfer funds from A to B is executing, with the deducted total written to A and the increased
total yet to be written to B.
101
KITSW
If a second concurrently running transaction reads A and B at this intermediate point and
computes A + B it will observe an inconsistent value. Furthermore, if this second transaction
then performs updates on A and B based on the inconsistent values that it read, the database may
be left in an inconsistent state even after both transactions have completed.
Other solutions have therefore been developed; they allow multiple transactions to execute
concurrently.
The isolation property of a transaction ensures that the concurrent execution of transactions
results in a system state that is equivalent to a state that could have been obtained had these
transactions executed one at a time in some order.
Ensuring the isolation property is the responsibility of a component of the database system called
the concurrency control component.
4.7 Transaction isolation levels:
Transaction isolation levels are a measure of the extent to which transaction isolation succeeds.
In particular, transaction isolation levels are defined by the presence or absence of the following
phenomena:
Dirty Reads A dirty read occurs when a transaction reads data that has not yet been committed.
For example, suppose transaction 1 updates a row. Transaction 2 reads the updated row before
transaction 1 commits the update. If transaction 1 rolls back the change, transaction 2 will have
read data that is considered never to have existed.
Nonrepeatable Reads A nonrepeatable read occurs when a transaction reads the same row
twice but gets different data each time. For example, suppose transaction 1 reads a row.
Transaction 2 updates or deletes that row and commits the update or delete. If transaction 1
rereads the row, it retrieves different row values or discovers that the row has been deleted.
102
KITSW
Phantoms A phantom is a row that matches the search criteria but is not initially seen. For
example, suppose transaction 1 reads a set of rows that satisfy some search criteria. Transaction
2 generates a new row (through either an update or an insert) that matches the search criteria for
transaction 1. If transaction 1 reexecutes the statement that reads the rows, it gets a different set
of rows.
The four transaction isolation levels (as defined by SQL-92) are defined in terms of these
phenomena. In the following table, an "X" marks each phenomenon that can occur.
+
Read uncommitted X X X
Read committed -- X X
Repeatable read -- -- X
Serializable -- -- --
The following table describes simple ways that a DBMS might implement the transaction
isolation levels.
Transaction
isolation Possible implementation
Read Transactions are not isolated from each other. If the DBMS supports other transaction
uncommitted isolation levels, it ignores whatever mechanism it uses to implement those levels. So
that they do not adversely affect other transactions, transactions running at the Read
Uncommitted level are usually read-only.
Read The transaction waits until rows write-locked by other transactions are unlocked; this
committed prevents it from reading any "dirty" data.
103
KITSW
Transaction
isolation Possible implementation
The transaction holds a read lock (if it only reads the row) or write lock (if it updates
or deletes the row) on the current row to prevent other transactions from updating or
deleting it. The transaction releases read locks when it moves off the current row. It
holds write locks until it is committed or rolled back.
Repeatable The transaction waits until rows write-locked by other transactions are unlocked; this
read prevents it from reading any "dirty" data.
The transaction holds read locks on all rows it returns to the application and write
locks on all rows it inserts, updates, or deletes. For example, if the transaction includes
the SQL statement SELECT * FROM Orders, the transaction read-locks rows as the
application fetches them. If the transaction includes the SQL statement DELETE
FROM Orders WHERE Status = 'CLOSED', the transaction write-locks rows as it
deletes them.
Because other transactions cannot update or delete these rows, the current transaction
avoids any nonrepeatable reads. The transaction releases its locks when it is
committed or rolled back.
Serializable The transaction waits until rows write-locked by other transactions are unlocked; this
prevents it from reading any "dirty" data.
The transaction holds a read lock (if it only reads rows) or write lock (if it can update
or delete rows) on the range of rows it affects. For example, if the transaction includes
the SQL statement SELECT * FROM Orders, the range is the entire Orders table;
the transaction read-locks the table and does not allow any new rows to be inserted
into it. If the transaction includes the SQL statement DELETE FROM Orders
WHERE Status = 'CLOSED', the range is all rows with a Status of "CLOSED"; the
104
KITSW
Transaction
isolation Possible implementation
transaction write-locks all rows in the Orders table with a Status of "CLOSED" and
does not allow any rows to be inserted or updated such that the resulting row has a
Status of "CLOSED" Because other transactions cannot update or delete the rows in
the range, the current transaction avoids any non repeatable reads. Because other
transactions cannot insert any rows in the range, the current transaction avoids any
phantoms. The transaction releases its lock when it is committed or rolled back.
It is important to note that the transaction isolation level does not affect a transaction's ability to
see its own changes; transactions can always see any changes they make. For example, a
transaction might consist of two UPDATE statements, the first of which raises the pay of all
employees by 10 percent and the second of which sets the pay of any employees over some
maximum amount to that amount. This succeeds as a single transaction only because the
second UPDATE statement can see the results of the first.
4.8.1Lock-Based protocols:
ADBMS must be able to ensure that only serializable, recoverable schedules are allowed, and
that no actions of committed transactions are lost while undoing aborted transactions. A
DBMS typically uses a locking protocol to achieve this. A locking protocol is a set of rules to
be followed by each transaction, in order to ensure that even though actions of several
transactions might be interleaved, the net effect is identical to executing all transactions in
some serial order.
Strict Two-Phase Locking(Strict2PL):
The most widely used locking protocol, called Strict Two-Phase Locking, or Strict2PL,
It has two rules. The first rule is
1.If a transaction T wants to read an object, it first requests a shared lock on the object.
105
KITSW
Of course, a transaction that has an exclusive lock can also read the object; an additional shared
lock is not required. A transaction that requests a lock is suspended until the DBMS is able to
grant it the requested lock. The DBMS keeps track of the locks it has granted and ensures that if
a transaction holds an exclusive lock on an object no other transaction holds a shared or
exclusive lock on the same object.
(2)All locks held by a transaction are released when the transaction is completed
those pages. Similarly, if a transaction accesses ever alrecords on a page, it should lock the entire
page and if it accesses just a few records, it should lock just those records.
The question to be addressed is how a lock manager can efficiently ensure that a page,
for example, is not locked by a transaction while an other transaction holds a conflicting lock on
the file containing the page.
The recovery manager of a DBMS is responsible for ensuring two important properties of
transactions: atomicity and durability. It ensures atomicity by undoing the actions of transactions
that do not commit and durability by making sure that all actions of committed transactions
survive system crashes, (e.g., a core dump caused by a bus error) and media failures (e.g., a
disk is corrupted).
106
KITSW
The Log
The log, sometimes called the trail or journal, is a history of actions executed by the DBMS.
Physically, the log is a file of records stored in stable storage, which is assumed to survive
crashes; this durability can be achieved by maintaining two or more copies of the log on
different disks, so that the chance of all copies of the log being simultaneously lost is
negligibly small.
The most recent portion of the log, called the log tail,is kept in main memory and is
periodically forced to stable storage. This way, log records and data records are written to disk at
the same granularity.
Every log record is given a unique id called the log sequence number (LSN). As with
any record id, we can fetch a log record with one disk access given the LSN. Further, LSNs
should be assigned in monotonically increasing order; this property is required for the ARIES
recovery algorithm. If the log is a sequential file, in principle growing indefinitely, the LSN can
simply be the address of the first byte of the log record.
The transaction is considered to have committed at the instant that its commit log record is
written to stable storage
Abort: When a transaction is aborted, an abort type log record containing the transaction id is
appended to the log, and Undo is initiated for this transaction
107
KITSW
End: As noted above, when a transaction is aborted or committed, some additional actions must
be taken beyond writing the abort or commit log record. After all these additional steps are
completed, an end type log record containing the transaction id is appended to the log.
Undoing an update: When a transaction is rolled back (because the transaction is aborted, or
during recovery from a crash), its updates are undone. When the action described by an update
log record is undone, a compensation log record,or CLR, is written.
In addition to the log, the following two tables contain important recovery-related information:
Transaction table: This table contains one entry for each active transaction. The entry
contains the transaction id, the status, and a field called lastLSN, which is the LSN of the
most recent log record for this transaction. The status of a transaction can be that it is in
progress, is committed, or is aborted.
Dirty page table: This table contains one entry for each dirty page in the buffer pool, that is,
each page with changes that are not yet reflected on disk. The entry contains a field recLSN,
which is the LSN of the first log record that caused the page to become dirty. Note that this LSN
identifies the earliest log record that might have to be redone for this page during restart from a
crash.
Checkpoint
A checkpoint is like a snapshot of the DBMS state, and by taking checkpoints periodically, as
we will see, the DBMS can reduce the amount of work to be done during restart in the event of a
subsequent crash.
4.8.3 Timestamp-based Protocols:
108
KITSW
The most commonly used concurrency protocol is the timestamp based protocol. This protocol
uses either system time or logical counter as a timestamp.
Lock-based protocols manage the order between the conflicting pairs among transactions at the
time of execution, whereas timestamp-based protocols start working as soon as a transaction is
created.
Every transaction has a timestamp associated with it, and the ordering is determined by the age
of the transaction. A transaction created at 0002 clock time would be older than all other
transactions that come after it. For example, any transaction 'y' entering the system at 0004 is two
seconds younger and the priority would be given to the older one.
In addition, every data item is given the latest read and write-timestamp. This lets the system
know when the last ‘read and write’ operation was performed on the data item.
Timestamp Ordering Protocol
The timestamp-ordering protocol ensures serializability among transactions in their conflicting
read and write operations. This is the responsibility of the protocol system that the conflicting
pair of tasks should be executed according to the timestamp values of the transactions.
The timestamp of transaction Ti is denoted as TS(Ti).
Read time-stamp of data-item X is denoted by R-timestamp(X).
Write time-stamp of data-item X is denoted by W-timestamp(X).
Timestamp ordering protocol works as follows −
If a transaction Ti issues a read(X) operation −
o If TS(Ti) < W-timestamp(X)
Operation rejected.
o If TS(Ti) >= W-timestamp(X)
Operation executed.
o All data-item timestamps updated.
If a transaction Ti issues a write(X) operation −
o If TS(Ti) < R-timestamp(X)
Operation rejected.
o If TS(Ti) < W-timestamp(X)
Operation rejected and Ti rolled back.
o Otherwise, operation executed.
109
KITSW
Thomas' Write Rule
This rule states if TS(Ti) < W-timestamp(X), then the operation is rejected and T i is rolled back.
Time-stamp ordering rules can be modified to make the schedule view serializable.
Instead of making Ti rolled back, the 'write' operation itself is ignored.
In cases where a majority of transactions are read-only transactions, the rate of conflicts among
transactions may be low. Thus, many of these transactions, if executed without the supervision of
a concurrency-control scheme, would nevertheless leave the system in a consistent state. A
concurrency-control scheme imposes overhead of code execution and possible delay of
transactions. It may be better to use an alternative scheme that imposes less overhead. A
difficulty in reducing the overhead is that we do not know in advance which transactions will be
involved in a conflict. To gain that knowledge, we need a scheme for monitoring the system.
We assume that each transaction Ti executes in two or three different phases in its lifetime,
depending on whether it is a read-only or an update transaction. The phases are, in order,
1. Read phase. During this phase, the system executes transaction Ti. It reads the values of the
various data items and stores them in variables local to Ti. It performs all write operations on
temporary local variables, without updates of the actual database.
2. Validation phase. Transaction Ti performs a validation test to determine whether it can copy
to the database the temporary local variables that hold the results of write operations without
causing a violation of serializability.
3. Write phase. If transaction Ti succeeds in validation (step 2), then the system applies the
actual updates to the database. Otherwise, the system rolls back Ti.
Each transaction must go through the three phases in the order shown. However, all three phases
of concurrently executing transactions can be interleaved.
To perform the validation test, we need to know when the various phases of trans-
actions Ti took place. We shall, therefore, associate three different timestamps with
transaction Ti:
1. Start(Ti), the time when Ti started its execution.
2. Validation(Ti ), the time when Ti finished its read phase and started its validation phase.
110
KITSW
3. Finish(Ti), the time when Ti finished its write phase.
We determine the serializability order by the timestamp-ordering technique, using the value of
the timestamp Validation(Ti). Thus, the value TS(Ti) = Validation(Ti) and, if TS(Tj ) < TS(Tk ),
then any produced schedule must be equivalent to a serial schedule in which
transaction Tj appears before transaction Tk . The reason we have chosen Validation(Ti), rather
than Start(Ti), as the timestamp of transaction Ti is that we can expect faster response time
provided that conflict rates among transactions are indeed low.
The validation test for transaction Tj requires that, for all transactions Ti with TS(Ti) < TS(Tj ),
one of the following two conditions must hold:
1. Finish(Ti) < Start(Tj ). Since Ti completes its execution before Tj started, the serializability
order is indeed maintained.
2. The set of data items written by Ti does not intersect with the set of data items read by Tj ,
and Ti completes its write phase before Tj starts its validation phase
(Start(Tj ) < Finish(Ti) < Validation(Tj )). This condition ensures that
the writes of Ti and Tj do not overlap. Since the writes of Ti do not affect the read of Tj , and
since Tj cannot affect the read of Ti, the serializability order is indeed maintained.
As an illustration, consider again transactions T14 and T15. Suppose that TS(T14) < TS(T15).
Then, the validation phase succeeds in the schedule 5 in Figure 16.15. Note that the writes to the
actual variables are performed only after the validation phase of T15. Thus, T14 reads the old
values of B and A, and this schedule is serializable.
The validation scheme automatically guards against cascading rollbacks, since the actual writes
take place only after the transaction issuing the write has committed.
However, there is a possibility of starvation of long transactions, due to a sequence of conflicting
short transactions that cause repeated restarts of the long transaction.
111
KITSW
To avoid starvation, conflicting transactions must be temporarily blocked, to enable the long
transaction to finish.
This validation scheme is called the optimistic concurrency control scheme since transactions
execute optimistically, assuming they will be able to finish execution and validate at the end. In
contrast, locking and timestamp ordering are pessimistic in that they force a wait or a rollback
whenever a conflict is detected, even though there is a chance that the schedule may be conflict
serializable.
Other protocols for concurrency control keep the old values of a data item when the item is
updated. These are known as multiversion concurrency control, because several versions
(values) of an item are maintained. When a transaction requires access to an item,
an appropriateversion is chosen to maintain the serializability of the currently executing
schedule, if possible. The idea is that some read operations that would be rejected in other
techniques can still be accepted by reading an older version of the item to maintain
serializability. When a transaction writes an item, it writes a new version and the old version(s)
of the item are retained. Some multiver-sion concurrency control algorithms use the concept of
view serializability rather than conflict serializability.
Several multiversion concurrency control schemes have been proposed. We discuss two schemes
here, one based on timestamp ordering and the other based on 2PL. In addition, the validation
concurrency control method (see Section 22.4) also maintains multiple versions.
112
KITSW
1. Multi version Technique Based on Timestamp Ordering
In this method, several versions X1, X2, ..., Xk of each data item X are maintained. For each
version, the value of version Xi and the following two timestamps are kept:
read_TS(Xi). The read timestamp of Xi is the largest of all the timestamps of transactions
that have successfully read version Xi.
write_TS(Xi). The write timestamp of Xi is the timestamp of the transaction that wrote the
value of version Xi.
If transaction T issues a read_item(X) operation, find the version i of X that has the
highestwrite_TS(Xi) of all versions of X that is also less than or equal to TS(T); then return the
value ofXi to transaction T, and set the value of read_TS(Xi) to the larger of TS(T) and the
currentread_TS(Xi).
As we can see in case 2, a read_item(X) is always successful, since it finds the appropriate
version Xi to read based on the write_TS of the various existing versions of X. In case 1,
however, transaction T may be aborted and rolled back. This happens if T attempts to write a
113
KITSW
version of X that should have been read by another transaction T whose timestamp isread_TS(Xi);
however, T has already read version Xi, which was written by the transaction with timestamp
equal to write_TS(Xi). If this conflict occurs, T is rolled back; otherwise, a new version of X,
written by transaction T, is created. Notice that if T is rolled back, cascading rollback may occur.
Hence, to ensure recoverability, a transaction T should not be allowed to commit until after all
the transactions that have written some version that T has read have committed.
In this multiple-mode locking scheme, there are three locking modes for an item: read, write,
andcertify, instead of just the two modes (read, write) discussed previously. Hence, the state
ofLOCK(X) for an item X can be one of read-locked, write-locked, certify-locked, or unlocked.
In the standard locking scheme, with only read and write locks (see Section 22.1.1), a write lock
is an exclusive lock. We can describe the relationship between read and write locks in the
standard scheme by means of the lock compatibility table shown in Figure 22.6(a). An entry
of Yesmeans that if a transaction T holds the type of lock specified in the column header
on item X and if transaction T requests the type of lock specified in the row header on the same
item X, then T can obtain the lock because the locking modes are compatible. On the other hand,
an entry of No in the table indicates that the locks are not compatible, so T must wait until T
releases the lock.
114
KITSW
In the standard locking scheme, once a transaction obtains a write lock on an item, no other
transactions can access that item. The idea behind multiversion 2PL is to allow other
transactions T to read an item X while a single transaction T holds a write lock on X. This is
accomplished by allowing two versions for each item X; one version must always have been
written by some committed transaction. The second version X is created when a
transaction Tacquires a write lock on the item. Other transactions can continue to read
the committed versionof X while T holds the write lock. Transaction T can write the value of X as
needed, without affecting the value of the committed version X. However, once T is ready to
commit, it must obtain a certify lock on all items that it currently holds write locks on before it
can commit. The certify lock is not compatible with read locks, so the transaction may have to
delay its commit until all its write-locked items are released by any reading transactions in order
to obtain the certify locks. Once the certify locks—which are exclusive locks—are acquired, the
committed version X of the data item is set to the value of version X , version X is discarded, and
the certify locks are then released. The lock compatibility table for this scheme is shown in
Figure 22.6(b).
In this multiversion 2PL scheme, reads can proceed concurrently with a single write operation—
an arrangement not permitted under the standard 2PL schemes. The cost is that a transaction may
have to delay its commit until it obtains exclusive certify locks on all the items it has updated. It
can be shown that this scheme avoids cascading aborts, since transactions are only allowed to
read the version X that was written by a committed transaction.
4.10 Recovery System:
Crash Recovery:
DBMS is a highly complex system with hundreds of transactions being executed every second.
The durability and robustness of a DBMS depends on its complex architecture and its underlying
hardware and system software. If it fails or crashes amid transactions, it is expected that the
system would follow some sort of algorithm or techniques to recover lost data.
4.11 Failure Classification:
To see where the problem has occurred, we generalize a failure into various categories, as
follows −
Transaction failure
115
KITSW
A transaction has to abort when it fails to execute or when it reaches a point from where it can’t
go any further. This is called transaction failure where only a few transactions or processes are
hurt.
Reasons for a transaction failure could be −
Logical errors − Where a transaction cannot complete because it has some code error or
any internal error condition.
System errors − Where the database system itself terminates an active transaction
because the DBMS is not able to execute it, or it has to stop because of some system
condition. For example, in case of deadlock or resource unavailability, the system aborts
an active transaction.
System Crash
There are problems − external to the system − that may cause the system to stop abruptly and
cause the system to crash. For example, interruptions in power supply may cause the failure of
underlying hardware or software failure.
Examples may include operating system errors.
Disk Failure
In early days of technology evolution, it was a common problem where hard-disk drives or
storage drives used to fail frequently.
Disk failures include formation of bad sectors, unreachability to the disk, disk head crash or any
other failure, which destroys all or a part of disk storage.
Storage Structure
We have already described the storage system. In brief, the storage structure can be divided into
two categories −
Volatile storage − As the name suggests, a volatile storage cannot survive system
crashes. Volatile storage devices are placed very close to the CPU; normally they are
embedded onto the chipset itself. For example, main memory and cache memory are
examples of volatile storage. They are fast but can store only a small amount of
information.
Non-volatile storage − These memories are made to survive system crashes. They are
huge in data storage capacity, but slower in accessibility. Examples may include hard-
disks, magnetic tapes, flash memory, and non-volatile (battery backed up) RAM.
116
KITSW
4.12 Recovery and Atomicity:
When a system crashes, it may have several transactions being executed and various files opened
for them to modify the data items. Transactions are made of various operations, which are atomic
in nature. But according to ACID properties of DBMS, atomicity of transactions as a whole must
be maintained, that is, either all the operations are executed or none.
When a DBMS recovers from a crash, it should maintain the following −
It should check the states of all the transactions, which were being executed.
A transaction may be in the middle of some operation; the DBMS must ensure the
atomicity of the transaction in this case.
It should check whether the transaction can be completed now or it needs to be rolled
back.
No transactions would be allowed to leave the DBMS in an inconsistent state.
There are two types of techniques, which can help a DBMS in recovering as well as maintaining
the atomicity of a transaction −
Maintaining the logs of each transaction, and writing them onto some stable storage
before actually modifying the database.
Maintaining shadow paging, where the changes are done on a volatile memory, and later,
the actual database is updated.
Log-based Recovery
Log is a sequence of records, which maintains the records of actions performed by a transaction.
It is important that the logs are written prior to the actual modification and stored on a stable
storage media, which is failsafe.
Log-based recovery works as follows −
The log file is kept on a stable storage media.
When a transaction enters the system and starts execution, it writes a log about it.
<Tn, Start>
When the transaction modifies an item X, it write logs as follows −
<Tn, X, V1, V2>
It reads Tn has changed the value of X, from V1 to V2.
117
KITSW
When the transaction finishes, it logs −
<Tn, commit>
The database can be modified using two approaches −
Deferred database modification − All logs are written on to the stable storage and the
database is updated when a transaction commits.
Immediate database modification − Each log follows an actual database modification.
That is, the database is modified immediately after every operation.
Recovery with Concurrent Transactions
When more than one transaction are being executed in parallel, the logs are interleaved. At the
time of recovery, it would become hard for the recovery system to backtrack all logs, and then
start recovering. To ease this situation, most modern DBMS use the concept of 'checkpoints'.
Checkpoint
Keeping and maintaining logs in real time and in real environment may fill out all the memory
space available in the system. As time passes, the log file may grow too big to be handled at all.
Checkpoint is a mechanism where all the previous logs are removed from the system and stored
permanently in a storage disk. Checkpoint declares a point before which the DBMS was in
consistent state, and all the transactions were committed.
Recovery
When a system with concurrent transactions crashes and recovers, it behaves in the following
manner −
The recovery system reads the logs backwards from the end to the last checkpoint.
It maintains two lists, an undo-list and a redo-list.
If the recovery system sees a log with <T n, Start> and <Tn, Commit> or just <Tn,
Commit>, it puts the transaction in the redo-list.
118
KITSW
If the recovery system sees a log with <T n, Start> but no commit or abort log found, it
puts the transaction in undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the
transactions in the redo-list and their previous logs are removed and then redone before saving
their logs.
2.Redo: Repeats all actions, starting from an appropriate point in the log, and restores the
database state to what it was at the time of the crash. 3.Undo: Undoes the actions of transactions
that did not commit, so that the database reflects only the actions of committed transactions.
There are three main principles behind the ARIES recovery algorithm:
Write-ahead logging: Any change to a database object is first recorded in the log; the record in
the log must be written to stable storage before the change to the database object is written to
disk.
Repeating history during Redo: Upon restart following a crash, ARIES retraces all actions of
the DBMS before the crash and brings the system back to the exact state that it was in at the time
of the crash. Then, it undoes the actions of transactions that were still active at the time of the
crash.
Logging changes during Undo: Changes made to the database while undoing a transaction are
logged in order to ensure that such an action is not repeated in the event of repeated restarts.
119
KITSW
A DBMS must manage a huge amount of data, and in the course of processing therequired space
for the blocks of data will often be greater than the memory spaceavailable. For this there is the
need to manage a memory in which to load and unload theblocks. The buffer manager is
responsible primarily for managing the operations inherentsaving and loading of the blocks. In
fact, the operations that provides the buffer manager are these:* FIX: This command tells the
operator of the buffer to load a block from disk and returnthe pointer to the memory where it is
loaded. If the block was already in memory, thebuffer manager needs only to return the pointer,
otherwise he must load from disk andbring it into memory. If the buffer memory is full but it is
possible to have 2 situations:or the possibility of releasing a portion of memory that is occupied
by transactionsalready completed. In this case, before freeing the area the content is written
to disk if any block of this area had been changed.* There is the possibility of free memory to be
occupied because transitions still ongoing.In this case, the buffer manager can work in 2 ways: in
the first mode (STEAL) theoperator of the free buffer memory occupied by a transition already
active, possiblysaving your changes to disk, in the second mode (NOT STEAL) the transition
requestedblock is made to wait until the free memory.* SET DIRTY: invoking this
command, you mark a block of memory as amended.Before introducing the last 2
commands you need to anticipate that the DMBS canoperate in 2 modes: Force and NOT
FORCE. When working in FORCE mode, the rescuedisk is in synchronous mode with
the commit of a transaction. When working mode isNOT FORCE the rescue is carried out from
time to time in asynchronous manner.Typically, commercial database operating mode NOT
FORCE because this allows anincrease in performance: the block may undergo multiple changes
in memory beforebeing saved, then you can choose to make the saves when the system is
unloading.* Force: This command will cause the operator of the buffer to make the writing
insynchronously with the completion (commit) the transaction* FLUSH: This command will
cause the operator of the buffer to perform the rescue,when in how NOT FORCE.
120
KITSW
failures in which the content of nonvolatile storage is lost are rare, we nevertheless need to be
prepared to deal with this type of failure. In this section, we discuss only disk storage. Our
discussions apply as well to other nonvolatile storage types.
The basic scheme is to dump the entire content of the database to stable storage periodically—
say, once per day. For example, we may dump the database to one or more magnetic tapes. If a
failure occurs that results in the loss of physical database blocks, the system uses the most recent
dump in restoring the database to a previous consistent state. Once this restoration has been
accomplished, the system uses the log to bring the database system to the most recent consistent
state.
More precisely, no transaction may be active during the dump procedure, and a procedure similar
to checkpointing must take place:
1. Output all log records currently residing in main memory onto stable storage.
2. Output all buffer blocks onto the disk.
3. Copy the contents of the database to stable storage.
4. Output a log record <dump> onto the stable storage.
Steps 1, 2, and 4 correspond to the three steps used for checkpoints in Section 17.4.3.
To recover from the loss of nonvolatile storage, the system restores the database to disk by using
the most recent dump. Then, it consults the log and redoes all the transactions that have
committed since the most recent dump occurred. Notice that no undo operations need to be
executed.
A dump of the database contents is also referred to as an archival dump, since we can archive
the dumps and use them later to examine old states of the database.
Dumps of a database and checkpointing of buffers are similar.
The simple dump procedure described here is costly for the following two reasons.
First, the entire database must be be copied to stable storage, resulting in considerable data
transfer. Second, since transaction processing is halted during the dump procedure, CPU cycles
are wasted. Fuzzy dump schemes have been developed, which allow transactions to be active
while the dump is in progress. They are similar to fuzzy checkpointing schemes; see the
bibliographical notes for more details.
121
KITSW
4.16 Early Lock Release and Logical Undo Operations:
The following actions are taken when recovering from system crash
1. (Redo phase): Scan log forward from last < checkpoint L> record till end of log
1. Repeat history by physically redoing all updates of all transactions,
2. Create an undo-list during the scan as follows
undo-list is set to L initially
Whenever <Ti start> is found Ti is added to undo-list
Whenever <Ti commit> or <Ti abort> is found, Ti is deleted from undo-
list
This
brings database to state as of crash, with committed as well as uncommitted transactions having
been redone.
122
KITSW
Now
undo-list contains transactions that are incomplete, that is, have neither committed nor been
fully rolled back.
A remote, online, or managed backup service, sometimes marketed as cloud backup or backup-
as-a-service, is a service that provides users with a system for the backup, storage, and recovery
of computer files. Online backup providers are companies that provide this type of service to end
users (or clients). Such backup services are considered a form of cloud computing.
Online backup systems are typically built around a client software program that runs on a
schedule. Some systems run once a day, usually at night while computers aren't in use. Other
newer cloud backup services run continuously to capture changes to user systems nearly in real-
time. The only backup system typically collects, compresses, encrypts, and transfers the data to
the remote backup service provider's servers or off-site hardware.
There are many products on the market – all offering different feature sets, service levels, and
types of encryption. Providers of this type of service frequently target specific market segments.
High-end LAN-based backup systems may offer services such as Active Directory, client remote
control, or open file backups. Consumer online backup companies frequently have beta software
offerings and/or free-trial backup services with fewer live support options.
123
KITSW
UNIT-V
In this Unit we discuss about Data storage and retrieval. It deals with disk, file, and file system
structure, and with the mapping of relational and object data to a file system. A variety of data
access techniques are presented in this unit , including hashing, B+ - tree indices, and grid file
indices. External sorting which will be done in secondary memory is discussed here.
Contents:
File Organisation:
Storage Media
Buffer Management
Record and Page formats
File organizations
Various kinds of indexes and external storing
ISAM
B++ trees
Extendible vs. Linear Hashing
124
KITSW
This chapter internals of an RDBMS
The lowest layer of the software deals with management of space on disk, where the data is to be
stored. Higher layers allocate, deal locate, read and write pages through (routines provided by)
this layer, called the disk space manager.
On top of the disk space manager, we have the buffer manager, which partitions the available
main memory into a collection of pages of frames. The purpose of the buffer manager is to bring
pages in from disk to main memory as needed in response to read requests from transactions.
The next layer includes a variety of software for supporting the concepts of a file, which, in
DBMS, is a collection of pages or a collection of records. This layer typically supports a heap
file, or file or unordered pages, as well as indexes. In addition to keeping track of the pages in a
file, this layer organizes the information within a page.
The code that implements relational operators sits on top of the file and access methods layer.
These operators serve as the building blocks for evaluating queries posed against the data.
When a user issues a query, the query is presented to a query optimizer, whish uses information
about how the data is stored to produce an efficient execution plan for evaluating the query. An
execution plan is usually represented as tree of relational operators ( with annotations that
contain additional detailed information about which access methods to use.
Data in a DBMS is stored on storage devices such as disks and tapes ; the disk space manager is
responsible for keeping tract of available disk space. The file manager, which provides the
abstraction of a file of records to higher levels of DBMS code, requests to the disk space
manager to obtain and relinquish space on disk.
When a record is needed for processing, it must be fetched from disk to main memory. The page
on which the record resides is determined by the file manager ( the file manager determines the
page on which the record resides)
Sometimes, the file manager uses auxiliary data structures to quickly identify the page that
contains a desired record. After identifying the required page, the file manager issues a request
for the page to a layer of DBMS code called the buffer manager. The buffer manager fetches
requested pages from disk into a region of main memory called the buffer pool, and informs the
file manager.
125
KITSW
5.1 Overview of Storage and Indexing:
Databases are stored in file formats, which contain records. At physical level, the actual data is
stored in electromagnetic format on some device. These storage devices can be broadly
categorized into three types −
Primary Storage − The memory storage that is directly accessible to the CPU comes
under this category. CPU's internal memory (registers), fast memory (cache), and main
memory (RAM) are directly accessible to the CPU, as they are all placed on the
motherboard or CPU chipset. This storage is typically very small, ultra-fast, and volatile.
Primary storage requires continuous power supply in order to maintain its state. In case of
a power failure, all its data is lost.
Secondary Storage − Secondary storage devices are used to store data for future use or as
backup. Secondary storage includes memory devices that are not a part of the CPU
chipset or motherboard, for example, magnetic disks, optical disks (DVD, CD, etc.), hard
disks, flash drives, and magnetic tapes.
Tertiary Storage − Tertiary storage is used to store huge volumes of data. Since such
storage devices are external to the computer system, they are the slowest in speed. These
storage devices are mostly used to take the back up of an entire system. Optical disks and
magnetic tapes are widely used as tertiary storage.
Memory Hierarchy
A computer system has a well-defined hierarchy of memory. A CPU has direct access to it main
memory as well as its inbuilt registers. The access time of the main memory is obviously less
than the CPU speed. To minimize this speed mismatch, cache memory is introduced. Cache
126
KITSW
memory provides the fastest access time and it contains data that is most frequently accessed by
the CPU.
The memory with the fastest access is the costliest one. Larger storage devices offer slow speed
and they are less expensive, however they can store huge volumes of data as compared to CPU
registers or cache memory.
Magnetic Disks
Hard disk drives are the most common secondary storage devices in present computer systems.
These are called magnetic disks because they use the concept of magnetization to store
information. Hard disks consist of metal disks coated with magnetizable material. These disks
are placed vertically on a spindle. A read/write head moves in between the disks and is used to
magnetize or de-magnetize the spot under it. A magnetized spot can be recognized as 0 (zero) or
1 (one).
Hard disks are formatted in a well-defined order to store data efficiently. A hard disk plate has
many concentric circles on it, called tracks. Every track is further divided into sectors. A sector
on a hard disk typically stores 512 bytes of data.
Redundant Array of Independent Disks
RAID or Redundant Array of Independent Disks, is a technology to connect multiple secondary
storage devices and use them as a single storage media.
RAID consists of an array of disks in which multiple disks are connected together to achieve
different goals. RAID levels define the use of disk arrays.
RAID 0
In this level, a striped array of disks is implemented. The data is broken down into blocks and the
blocks are distributed among disks. Each disk receives a block of data to write/read in parallel. It
enhances the speed and performance of the storage device. There is no parity and backup in
Level 0.
127
KITSW
RAID 1
RAID 1 uses mirroring techniques. When data is sent to a RAID controller, it sends a copy of
data to all the disks in the array. RAID level 1 is also called mirroring and provides 100%
redundancy in case of a failure.
RAID 2
RAID 2 records Error Correction Code using Hamming distance for its data, striped on different
disks. Like level 0, each data bit in a word is recorded on a separate disk and ECC codes of the
data words are stored on a different set disks. Due to its complex structure and high cost, RAID 2
is not commercially available.
RAID 3
RAID 3 stripes the data onto multiple disks. The parity bit generated for data word is stored on a
different disk. This technique makes it to overcome single disk failures.
RAID 4
In this level, an entire block of data is written onto data disks and then the parity is generated and
stored on a different disk. Note that level 3 uses byte-level striping, whereas level 4 uses block-
level striping. Both level 3 and level 4 require at least three disks to implement RAID.
128
KITSW
RAID 5
RAID 5 writes whole data blocks onto different disks, but the parity bits generated for data block
stripe are distributed among all the data disks rather than storing them on a different dedicated
disk.
RAID 6
RAID 6 is an extension of level 5. In this level, two independent parities are generated and stored
in distributed fashion among multiple disks. Two parities provide additional fault tolerance. This
level requires at least four disk drives to implement RAID.
129
KITSW
5.2 Data on External Storage:
Secondary Storage − Secondary storage devices are used to store data for future use or as
backup. Secondary storage includes memory devices that are not a part of the CPU chipset or
motherboard, for example, magnetic disks, optical disks (DVD, CD, etc.), hard disks, flash
drives, and magnetic tapes.
File Organization
File Organization defines how file records are mapped onto disk blocks. We have four types of
File Organization to organize file records −
130
KITSW
Clustered File Organization
Clustered file organization is not considered good for large databases. In this mechanism, related
records from one or more relations are kept in the same disk block, that is, the ordering of
records is not based on primary key or search key.
File Operations
Operations on database files can be broadly classified into two categories −
Update Operations
Retrieval Operations
Update operations change the data values by insertion, deletion, or update. Retrieval
operations, on the other hand, do not alter the data but retrieve them after optional
conditional filtering. In both types of operations, selection plays a significant role. Other
than creation and deletion of a file, there could be several operations, which can be done
on files.
Open − A file can be opened in one of the two modes, read mode or write mode. In
read mode, the operating system does not allow anyone to alter data. In other words, data
is read only. Files opened in read mode can be shared among several entities. Write mode
allows data modification. Files opened in write mode can be read but cannot be shared.
Locate − Every file has a file pointer, which tells the current position where the data is to
be read or written. This pointer can be adjusted accordingly. Using find (seek) operation,
it can be moved forward or backward.
Read − By default, when files are opened in read mode, the file pointer points to the
beginning of the file. There are options where the user can tell the operating system
where to locate the file pointer at the time of opening a file. The very next data to the file
pointer is read.
Write − User can select to open a file in write mode, which enables them to edit its
contents. It can be deletion, insertion, or modification. The file pointer can be located at
the time of opening or can be dynamically changed if the operating system allows to do
so.
Close − This is the most important operation from the operating system’s point of view.
When a request to close a file is generated, the operating system
o removes all the locks (if in shared mode),
131
KITSW
o saves the data (if altered) to the secondary storage media, and
o releases all the buffers and file handlers associated with the file.
The organization of data inside a file plays a major role here. The process to locate the file
pointer to a desired record inside a file various based on whether the records are arranged
sequentially or clustered. We know that data is stored in the form of records. Every record has a key field,
which helps it to be recognized uniquely.
Indexing is a data structure technique to efficiently retrieve records from the database files based
on some attributes on which the indexing has been done. Indexing in database systems is similar
to what we see in books.
Indexing is defined based on its indexing attributes. Indexing can be of the following types −
Primary Index − Primary index is defined on an ordered data file. The data file is
ordered on a key field. The key field is generally the primary key of the relation.
Secondary Index − Secondary index may be generated from a field which is a candidate
key and has a unique value in every record, or a non-key with duplicate values.
Clustering Index − Clustering index is defined on an ordered data file. The data file is
ordered on a non-key field.
Ordered Indexing is of two types −
Dense Index
Sparse Index
Dense Index
In dense index, there is an index record for every search key value in the database. This makes
searching faster but requires more space to store index records itself. Index records contain
search key value and a pointer to the actual record on the disk.
Sparse Index
In sparse index, index records are not created for every search key. An index record here
contains a search key and an actual pointer to the data on the disk. To search a record, we first
132
KITSW
proceed by index record and reach at the actual location of the data. If the data we are looking for
is not where we directly reach by following the index, then the system starts sequential search
until the desired data is found.
Multilevel Index
Index records comprise search-key values and data pointers. Multilevel index is stored on the
disk along with the actual database files. As the size of the database grows, so does the size of
the indices. There is an immense need to keep the index records in the main memory so as to
speed up the search operations. If single-level index is used, then a large size index cannot be
kept in memory which leads to multiple disk accesses.
133
KITSW
Multi-level Index helps in breaking down the index into several smaller indices in order to make
the outermost level so small that it can be saved in a single disk block, which can easily be
accommodated anywhere in the main memory.
The costs of some simple operations for three basic file organizations;
Scan :
Fetch all records in the file. The pages in the file must be fetched from disk into the buffer pool.
There is also a CPU overhead per record for locating the record on the page ( in the pool).
Fetch all records that satisfy an equality selection, for example, “ find the students record for the
student with sid 23’ Pages that contain qualifying records must be fetched from disk, and
qualifying records must be located within retrieved pages.
Fetch all records that satisfy a range section, for example, “find all students records with name
alphabetically after ‘smith”
Insert :
Insert a given record into the file. We must identify the page in the file into which the new record
must be inserted, fetch that page from disk, modify it to include the new record, and then write
134
KITSW
back the modified page. Depending on the file organization, we may have to fetch, modify and
write back other pages as well.
Delete :
Delete a record that is specified using its record identity 9rid). We must identify the page that
contains the record, fetch it from disk, modify it, and write it back. Depending on the file
organization, we may have to fetch, modify and write back other pages as well.
Heap files :
Scan :
The cost is B(D+RC) because we must retrieve each of B pages taking time D per page, and for
each page, process R records taking time C per record.
For each retrieved data page, user must check all records on the page to see if it is the desired
record. The cost is 0.5B(D+RC). If there is no record that satisfies the selection then user must
scan the entire file to verify it.
135
KITSW
Insert : Assume that records are always inserted at the end of the file so fetch the last page in the
file, add the record, and write the page back. The cost is 3D+C.
Delete :
First find the record, remove the record from the page, and write the modified page back. For
simplicity, assumption is made that no attempt is made to compact the file to reclaim the free
space created by deletions. The cost is the cost of searching plus C+D.
The record to be deleted is specified using the record id. Since the page id can easily be obtained
from the record it, user can directly read in the page. The cost of searching is therefore D
Sorted files :
The files sorted on a sequence of field are known as sorted files.
The various operation of sorted files are
Scan : The cost is B(D+RC) because all pages must be examined the order in which records
are retrieved corresponds to the sort order.
(ii) Search with equality selection:
Here assumption is made that the equality selection is specified on the field by which the
file is sorted; if not, the cost is identical to that for a heap file. To locate the first page
containing the desired records or records, qualifying records must exists, with a binary search
in log 2 B steps. Each step requires a disk I/O two comparisons. Once the page is known the
first qualifying record can again be located by a binary search of the page at a cost of Clog2
R. The cost is Dlog2 B + Clog2B. This is significant improvement over searching heap files.
136
KITSW
(iv) Insert :
To insert a record preserving the sort order, first find the correct position in the file, add
the record, and then fetch and rewrite all subsequent pages. On average, assume that the inserted
record belong in the middles of the file. Thus, read the latter half of the file and then write it back
after adding the new record. The cost is therefore the cost of searching to find the position of the
new record plus 2 * (0.5B(D+RC)), that is, search cost plus B(D+RC)
(v) Delete :
First search for the record, remove the record from the page, and write the modified page
back. User must also read and write all subsequent pages because all records that follow the
deleted record must be moved up to compact the free space. The cost is search cost plus
B(D+RC) Given the record identify (rid) of the record to delete, user can fetch the page
containing the record directly.
Hashed files :
A hashed file has an associated search key, which is a combination of one or more fields of the
file. In enables us to locate records with a given search key value quickly, for example, “Find the
students record for Joe” if the file is hashed on the name field we can retrieve the record quickly.
This organization is called a static hashed file; its main drawback is that long chains of overflow
pages can develop. This can affect performance because all pages ina bucket have to be
searched.
The various operations of hashed files are ;
137
KITSW
Fig: File Hashed on age,with Index on salary
Scan :
In a hashed file, pages are kept at about 80% occupancy ( in order to leave some space for futue
insertions and minimize over flow pages as the file expands). This is achieved by adding a new
page to a bucket when each existing page is 80% full, when records are initially organized into a
hashed file structure. Thus the number of pages, and therefore the cost of scanning all the data
pages, is about 1.25 times the cost of scaning an unordered file, that is, 1.25B(D+RC)
The hash function associated with a hashed file maps a record to a bucket based on the values in
all the search key fields; if the value for anyone of these fields is not specified, we cannot tell
which bucket the record belongs to. Thus if the selection is not an equality condition on all the
search key fields, we have to scan the entire file.
Search with Range selection :
The harsh structure offers no help at all; even if the range selection is on the search key, the
entire file must be scanned. The cost is 1.25 B{D+RC}
138
KITSW
Insert :
The appropriate page must be located, modified and then written back. The cost is thus the cost
of search plus C+D.
Delete :
We must search for the record, remove it from the page, and write the modified page back. The
cost is again the cost of search plus C+D (for writing the modified page ).
The below table compares I/O costs for three file organizations
A heap file has good storage efficiency, and supports fast scan, insertion, and deletion of
records. However it is slow for searches.
A stored file also offers good storage efficiency, but insertion and deletion of records is slow.
It is quite for searches, and in particular, it is the best structure for range selections.
A hashed file does not utilize space quite as well as sorted file, but insertions and deletions
are fast, and equality selections are very fast. However, the structure offers no support for range
selections, and full file scans are title slower; the lower space utilization means that files contain
more pages.
The potential large size of the index file motivates the ISAM idea. Building an auxiliary file on
the index file and so on recursively until the final auxiliary file fits on one page? This repeated
construction of a one-level index leads to a tree structure that is illustrated in Figure The data
139
KITSW
entries of the ISAM index are in the leaf pages of the tree and additional overflow pages that are
chained to some leaf page. In addition, some systems carefully organize the layout of pages so
that page boundaries correspond closely to the physical characteristics of the underlying storage
device. The ISAM structure is completely static and facilitates such low-level optimizations.
140
KITSW
5.6 B+ Tree:
B+ tree is a balanced binary search tree that follows a multi-level index format. The leaf nodes of
a B+ tree denote actual data pointers. B+ tree ensures that all leaf nodes remain at the same height,
thus balanced. Additionally, the leaf nodes are linked using a link list; therefore, a B + tree can
support random access as well as sequential access.
Structure of B+ Tree
Every leaf node is at equal distance from the root node. A B +tree is of the order n where n is
fixed for every B+ tree.
Internal nodes −
Internal (non-leaf) nodes contain at least ⌈n/2⌉ pointers, except the root node.
At most, an internal node can contain n pointers.
Leaf nodes −
Leaf nodes contain at least ⌈n/2⌉ record pointers and ⌈n/2⌉ key values.
At most, a leaf node can contain n record pointers and n key values.
Every leaf node contains one block pointer P to point to next leaf node and forms a
linked list.
B+ Tree Insertion
B+ trees are filled from bottom and each entry is done at the leaf node.
If a leaf node overflows −
o Split node into two parts.
o Partition at i = ⌊(m+1)/2⌋.
o First i entries are stored in one node.
o Rest of the entries (i+1 onwards) are moved to a new node.
o ith key is duplicated at the parent of the leaf.
If a non-leaf node overflows −
141
KITSW
o Split node into two parts.
o Partition the node at i = ⌈(m+1)/2⌉.
o Entries up to i are kept in one node.
o Rest of the entries are moved to a new node.
B+ Tree Deletion
B+ tree entries are deleted at the leaf nodes.
The target entry is searched and deleted.
o If it is an internal node, delete and replace with the entry from the left position.
After deletion, underflow is tested,
o If underflow occurs, distribute the entries from the nodes left to it.
If distribution is not possible from left, then
o Distribute from the nodes right to it.
If distribution is not possible from left or from right, then
o Merge the node with left and right to it.
142
KITSW
Operation
Insertion − When a record is required to be entered using static hash, the hash
function h computes the bucket address for search key K, where the record will be stored.
Bucket address = h(K)
Search − When a record needs to be retrieved, the same hash function can be used to
retrieve the address of the bucket where the data is stored.
Delete − This is simply a search followed by a deletion operation.
Bucket Overflow
The condition of bucket-overflow is known as collision. This is a fatal state for any static
hash function. In this case, overflow chaining can be used.
Overflow Chaining − When buckets are full, a new bucket is allocated for the same hash
result and is linked after the previous one. This mechanism is called Closed Hashing.
143
KITSW
Linear Probing − When a hash function generates an address at which data is already
stored, the next free bucket is allocated to it. This mechanism is called Open Hashing.
Dynamic Hashing
The problem with static hashing is that it does not expand or shrink dynamically as the
size of the database grows or shrinks. Dynamic hashing provides a mechanism in which
data buckets are added and removed dynamically and on-demand. Dynamic hashing is
also known as extended hashing.
Hash function, in dynamic hashing, is made to produce a large number of values and only
a few are used initially.
144
KITSW
Organization
The prefix of an entire hash value is taken as a hash index. Only a portion of the hash
value is used for computing bucket addresses. Every hash index has a depth value to
signify how many bits are used for computing a hash function. These bits can address 2n
buckets. When all these bits are consumed − that is, when all the buckets are full − then
the depth value is increased linearly and twice the buckets are allocated.
Operation
Querying − Look at the depth value of the hash index and use those bits to compute the
bucket address.
Update − Perform a query as above and update the data.
Deletion − Perform a query to locate the desired data and delete the same.
Insertion − Compute the address of the bucket
If the bucket is already full.
145
KITSW
Add more buckets.
Add additional bits to the hash value.
Re-compute the hash function.
Else
Add data to the bucket,
If all the buckets are full, perform the remedies of static hashing.
Hashing is not favorable when the data is organized in some ordering and the queries
require a range of data. When data is discrete and random, hash performs the best.
Hashing algorithms have high complexity than indexing. All hash operations are done in
constant time.
146
KITSW
Assignment Questions
UNIT – I
1. Discuss about Data Definition language, Data Manipulation language commands with
example?
2. Elaborate the relational model? Explain about various domain and integrity constraint in
Relational Model with examples?
3. Name the main steps in database design. What is the goal of each step? In which step is
the ER model mainly used?
4. Justify the difference between binary and ternary relationships.
5. Organize the process of evaluating a query using conceptual evolution strategy with an
example.
UNIT – II
UNIT – III
147
KITSW
UNIT – IV
1. Organize a locking protocol? Describe the Strict Two Phase Locking Protocol? What can
you say about the schedules allowed by this protocol?
2. Discuss short notes on : a) Multiple granularity b) Serializability c) Complete schedule d)
Serial Schedule.
3. Experiment the Time Stamp - Based Concurrency Control protocol? How is it used to
ensure serializability?
4. Illustrate a log file? Explain about the check point log based recovery schema for
recovering the database.
5. Discuss the failures that can occur with loss of Non-volatile storage?.
UNIT – V
1. Illustrate extendable hashing techniques for indexing data records. Consider your class
students data records and roll number as index attribute and show the hash directory.
2. Is disk cylinder a logical concept? Justify your answer.
3. Formulate the performance implications of disk structure? Explain briefly about
redundant arrays of independent disks.
4. Measure the indexing? Explain what are the differences between trees based index and
Hash based index.
5. Justify extendable hashing? How it is different from linear hashing?
148
KITSW
Tutorial Problems
Tutorial-1
4. Elaborate the Trigger? Explain how to implement Triggers in SQL with example.
5. Discuss the following operators in SQL with examples
i) Some ii) Not In iii) In iv) Except
149
KITSW
Tutorial -3
1. Consider a relation R with five attributes ABCDE. You are given the following
dependencies: A->B, BC->E and ED->A
i) List all keys for R
ii) Is R in 3NF? If not, explain why not.
iii) Is R in BCNF? If not, explain why not.
2. Define 1NF, 2NF, 3NF and BCNF, what is the motivation for putting a relation in
BCNF? What is the motivation for 3NF?
3. Construct closure of F? Where F is the set of functional dependencies. Explain computing
F+ with suitable examples.
4. Differentiate between FD and MFD
5. Summarize the problems are caused by redundancy and decomposition of relation.
Tutorial -4
1. Discuss about log? What is log tail? Explain the concept of checkpoint log record.
2. Elaborate to test serializability of a schedule? Explain with an example.
3. Construct the concurrency control using time stamp ordering protocol.
4. Demonstrate ACID properties of transactions.
5. Differentiate transaction rollback and restart recovery.
Tutorial -5
150
KITSW
Important Questions
Unit-1
What is an ER diagram? Specify the notations used to indicate various components of ER-
diagram
What is an unsafe query? Give an example and explain why it is important to disallow
such queries?
List the six design goals for relational database and explain why they are desirable.
A company database needs to store data about employees, departments and children
of employees. Draw an ER diagram that captures the above data.
What is the composite Attribute? How to model it in the ER diagram? Explain with an example.
Compare candidate key , primary key and super key.
151
KITSW
Unit-2
Write the following queries in Tuple Relational Calculus for following Schema.
ii. Find the names of sailors who have reserved at least one boat
(b) Find the names of sailors who have reserved at least two boats
(c) Find the names of sailors who have reserved all boats.
The key fields are underlined. The catalog relation lists the price
changes for parts by supplies. Write the following queries in SQL.
152
KITSW
Explain in detail thefollowing
(j) i. join operation
ii. Nested-loop join
iii.BlockNested-
Loop join.
Write the SQL expressions for the following relational
database? sailor schema(sailor id, Boat id, sailorname, rating,
age) Reserves(Sailor id, Boat id, Day)
Boat Schema(boat id, Boatname,color)
i. Find the age of the youngest sailor for each rating level?
Find the age of the youngest sailor who is eligible to vote for each rating level with at
lead two such sailors?
Find the No. of reservations for each red boat?
Find the average age of sailor for each rating level that atleast 2 sailors.
What is outer join? Explain different types of joins?
What is a trigger and what are its 3 parts. Explain in detail.
What is view? Explain the Views in SQL.
Unit-3
A>BC
C->A
D->E
F->A
E->D
153
KITSW
Unit-4
Unit-5
a. Cluster indexes
b. Primary and secondary indexes
c. Clustering file organization
154
KITSW
Unit wise Objective Questions
Unit-I
155
KITSW
Q.8 Architecture of the database can be viewed as
(A) two levels. (B)four levels. (C) three levels. (D)one level.
A)physical level. (B) logical level. C)conceptual level (D) view level.
Unit-II
Q.1 An entity set that does not have sufficient attributes to form a primary key is a
A)strong entity set. (B) weak entity set.C)simple entity set. (D) primary entity set.
(A) 2 1 P P ∨ ¬ (B) 2 1 P P ∨
Q.3 In tuple relational calculus 2 1 P P → is equivalent to
(C) 2 1 P P ∧ (D) 2 1 P P ¬ ∧
Q.4 The language used in application programs to request data from the DBMS is
referred to as the
Q.6 The database environment has all of the following components except:
(A) users. (B) separate files. C)database. (D) database administrator.
Q.7 The way a particular application views the data from the database that
the application uses is a
156
KITSW
Q. 8 In an E-R diagram an entity set is represent by a
Unit-III
Q.7 The method in which records are physically stored in a specified order according
to a key field in each record is
(A) hash. (B) direct. (C) sequential. (D) all of the above.
(A) the logical view. (B)the physical view.(C)the external view. (D)all of the above.
157
KITSW
Q.10 Which one of the following statements is false?
Unit-IV
158
KITSW
A)data is defined separately and not included in programs.
B)programs are not dependent on the physical attributes of data.
C)programs are not dependent on the logical attributes of data.
D)both (B) and (C).
Q.7 The statement in SQL which allows to change the definition of a table is
(A) Primary key (B) Secondary Key C)Foreign Key (D) None of these
Unit-V
1.The file organization that provides very fast access to any arbitrary record of a
. file is
2.DBMS helps
achieve
(e) Neither (A) nor (B) (D) both (A) and (B)
Q.4 Which of the following operation is used if we are interested in only certain columns of a
table?
159
KITSW
II year CSE – II Sem DBMS
Unit-1
11. What is an unsafe query? Give an example and explain why it is important to disallow
such queries?
13. List the six design goals for relational database and explain why they are desirable.
160
KITSW
Unit-2
5. What is the composite Attribute? How to model it in the ER diagram? Explain with an
example.
Write the following queries in Tuple Relational Calculus for following Schema.
Find the names of sailors who have reserved at least one boat
Find the names of sailors who have reserved at least two boats
161
KITSW
Catalog (sid : integer; pid : integer, cost: real)
The key fields are underlined. The catalog relation lists the
price changes for parts by supplies.
13. Write the following queries in SQL.
Find the pnames of parts supplied by raghu supplier and no one else.
14. Explain in
detail the
i. join
operation
ii. Nested-loop
join iii.Block
Nested-Loop
join.
15. Write the SQL expressions for the following
relational database? sailor schema(sailor id,
Boat id, sailorname, rating, age)
Reserves(Sailor id, Boat id, Day)
i. Find the age of the youngest sailor for each rating level?
16. Find the age of the youngest sailor who is eligible to vote for each rating level
with at lead two such sailors?
17. Find the No. of reservations for each red boat?
18. Find the average age of sailor for each rating level that atleast 2 sailors.
19. What is outer join? Explain different types of joins?
20. What is a trigger and what are its 3 parts. Explain in detail.
Unit-3
162
KITSW
6. Consider the relation R(A,B,C,D,E) and
Unit-4
10 What are the merits & demerits of using fuzzy dumps for media recovery.
11.What information does the dirty page table and transaction table contain.
163
KITSW
Unit-5
a. Cluster indexes
b. Primary and secondary indexes
c. Clustering file organization
164
KITSW
Sample Mid Paper
II B.Tech II Sem CSE Database Management Systems I Mid Question Paper
PART-A
1. a) List the responsibilities of DBA?
b) Write brief notes on views?
c) List the primitive operations in relational algebra?
d) What is meant by nested queries?
e) What is Trigger and Active database?
PART-B
4) What are integrity constraints? How these constraints are expressed in SQL?
(or)
5) Explain the operations of relational algebra? What are aggregative operations
and logical operators in SQL?
6) Describe about DDL & DML commands with syntaxes and examples?
(or)
7) What is normalization? Explain 1NF, 2NF and 3NF Normal forms with
examples?
165
KITSW
University Question papers of previous years
166
KITSW
167
KITSW
Code No: 114CQ R13
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY HYDERABAD
B.Tech II Year II Semester Examinations, May - 2016
DATABASE MANAGEMENT SYSTEMS
(Common to CSE, IT)
Time: 3 Hours Max. Marks: 75
Note: This question paper contains two parts A and B.
Part A is compulsory which carries 25 marks. Answer all questions in Part A.
Part B consists of 5 Units. Answer any one full question from each unit.
Each question carries 10 marks and may have a, b, c as sub questions.
6. What is meant by functional dependencies? Discuss about second normal form. [10]
OR
7. Explain fourth normal form and BCNF. [10]
10. What is meant by extendable hashing? How it is different from linear hashing? [10]
OR
11.What are the indexed data structures? Explain any one of them. [10]
168
KITSW
REFERENCES
3. Data base Systems design, Implementation, and Management, Peter Rob & Carlos
Coronel 7th Edition.
Websites:-
http://en.wikipedia.org/wiki/Database_management_system
https://www.tutorialspoint.com/dbms
http://helpingnotes.com/notes/msc_notes/dbms_notes/
http://www.geeksforgeeks.org
Journals:-
169
KITSW