PDF Unit1
PDF Unit1
ON
DATABASE MANAGEMENT SYSTEMS
UNIT II:
Relational Algebra – Selection and projection set operations – renaming – Joins – Division –
Examples of Algebra overviews – Relational calculus – Tuple Relational Calculus (TRC) –
Domain relational calculus (DRC).
Overview of the SQL Query Language – Basic Structure of SQL Queries, Set Operations,
Aggregate Functions – GROUPBY – HAVING, Nested Sub queries, Views, Triggers,
Procedures.
UNIT III:
Normalization – Introduction, Non loss decomposition and functional dependencies, First,
Second, and third normal forms – dependency preservation, Boyce/Codd normal form.
Higher Normal Forms - Introduction, Multi-valued dependencies and Fourth normal form,
Join dependencies and Fifth normal form
UNIT IV:
Transaction Concept- Transaction State- Implementation of Atomicity and Durability –
Concurrent Executions – Serializability- Recoverability – Implementation of Isolation –
Testing for serializability- Lock –Based Protocols – Timestamp Based Protocols- Validation-
Based Protocols – Multiple Granularity.
UNIT V:
Recovery and Atomicity – Log – Based Recovery – Recovery with Concurrent Transactions
– Check Points - Buffer Management – Failure with loss of nonvolatile storage-Advance
Recovery systems- ARIES Algorithm, Remote Backup systems.
File organization – various kinds of indexes - B+ Trees- Query Processing – Relational Query
Optimization.
TEXT BOOKS:
1. Database System Concepts, Silberschatz, Korth, McGraw hill, Sixth Edition.(All
UNITS except III th)
2. Database Management Systems, Raghu Ramakrishnan, Johannes Gehrke, TATA
McGrawHill 3rd Edition.
REFERENCE BOOKS:
1. Fundamentals of Database Systems, Elmasri Navathe Pearson Education.
2. An Introduction to Database systems, C.J. Date, A.Kannan, S.Swami Nadhan,
Pearson, Eight Edition for UNIT III.
Outcomes:
Demonstrate the basic elements of a relational database management system
Ability to identify the data models for relevant problems
Ability to design entity relationship and convert entity relationship diagrams into
RDBMS and formulate SQL queries on the respect data
Apply normalization for the development of application software’s
INDEX
S. No Topic Page no
Unit
INTRODUCTION TO DATABASE
1 I MANAGEMENT SYSTEM 1
I VIEW OF DATA
2 6
I
4 ENTITY-RELATIONSHIP MODEL 9
I DATABASE SCHEMA
5 21
S. No Topic Page no
Unit
II
1 PRELIMINARIES 23
II
2 RELATIONAL ALGEBRA 23
II
3 RELATIONAL CALCULUS 28
II
4 THE FORM OF A BASIC SQL QUERY 31
II INTRODUCTION TO VIEWS
5 39
II
6 TRIGGERS 40
S. No Topic Page no
Unit
III SCHEMA REFINEMENT
1 42
III FUNCTIONAL
2 44
DEPENDENCIES
III
3 NORMAL FORMS 46
III
4 DECOMPOSITIONS 49
III DEPENDENCY-PRESERVING
5 55
DECOMPOSITION INTO 3NF
III OTHER KINDS OF
6 DEPENDENCIES 56
S. No Topic Page no
Unit
IV
1 TRANSACTION CONCEPT 63
IV
2 CONCURRENT EXECUTION 67
IV TRANSACTION
3 72
CHARACTERISTICS
IV RECOVERABLE
4 SCHEDULES 76
IV
5 RECOVERY SYSTEM 79
IV TIMESTAMP-BASED
6 85
PROTOCOLS
IV
7 MULTIPLE GRANULARITY. 87
S. No Topic Page no
Unit
V FAILURE WITH LOSS OF
1 88
NON-VOLATILE STORAGE
V
2 REMOTE BACKUP 88
V RECOVERY AND
3 89
ATOMICITY
V
4 LOG-BASED RECOVERY 90
V RECOVERY WITH
5 CONCURRENT 92
TRANSACTIONS
V
6 DBMS FILE STRUCTURE 93
UNIT-1
Introduction to Database Management System
As the name suggests, the database management system consists of two parts. They are:
1. Database and
2. Management System
What is a Database?
To find out what database is, we have to start from data, which is the basic building block of any
DBMS.
Data: Facts, figures, statistics etc. having no particular meaning (e.g. 1, ABC, 19 etc).
Record: Collection of related data items, e.g. in the above example the three data items had no
meaning. But if we organize them in the following way, then they collectively represent meaningful
information.
Roll Name Age
1 ABC 19
1 ABC 19 Page 1
1 KOL
2 DEF 22
2 DEL
3 XYZ 28
3 MUM
Roll Year
Age and Hostel attributes are Year Hostel in different tables.
2 II
II H2
3 I
For example, within a company there are different departments, as well as customers, who each need
to see different kinds of data. Each employee in the company will have different levels of access to the
database with their own customized front-end application.
In a database, data is organized strictly in row and column format. The rows are called Tuple or
Record. The data items within one row may belong to different data types. On the other hand, the
columns are often called Domain or Attribute. All the data items within a single attribute are of the
same data type.
Database systems are designed to manage large bodies of information. Management of data involves
both defining structures for storage of information and providing mechanisms for the manipulation of
information. In addition, the database system must ensure the safety of the information stored, despite
system crashes or attempts at unauthorized access. If data are to be shared among several users, the
system must avoid possible anomalous results.
Databases touch all aspects of our lives. Some of the major areas of application are as follows:
1. Banking
2. Airlines
3. Universities
4. Manufacturing and selling
5. Human resources
Enterprise Information
◦ Sales: For customer, product, and purchase information.
◦ Accounting: For payments, receipts, account balances, assets and other accounting information.
◦ Human resources: For information about employees, salaries, payroll taxes, and benefits, and for
generation of paychecks.
◦ Manufacturing: For management of the supply chain and for tracking production of items in factories,
inventories of items inwarehouses and stores, and orders for items.
Online retailers: For sales data noted above plus online order tracking,generation of
recommendation lists, and
maintenance of online product evaluations.
This typical file-processing system is supported by a conventional operating system. The system
stores permanent records in various files, and it needs different application programs to extract records
from, and add records to, the appropriate files. Before database management systems (DBMSs) were
introduced, organizations usually stored information in such systems. Keeping organizational
information in a file-processing system has a number of major disadvantages:
Data redundancy and inconsistency. Since different programmers create the files and application
programs over a long period, the various files are likely to have different structures and the programs
may be written in several programming languages. Moreover, the same information may be duplicated
in several places (files). For example, if a student has a double major (say, music and mathematics) the
address and telephone number of that student may appear in a file that consists of student records of
students in the Music department and in a file that consists of student records of students in the
Mathematics department. This redundancy leads to higher storage and access cost. In addition, it may
lead to data inconsistency; that is, the various copies of the same data may no longer agree. For
example, a changed student address may be reflected in the Music department records but not
elsewhere in the system.
Difficulty in accessing data. Suppose that one of the university clerks needs to find out the names of
all students who live within a particular postal-code area. The clerk asks the data-processing
department to generate such a list. Because the designers of the original system did not anticipate this
request, there is no application program on hand to meet it. There is, however, an application program
to generate the list of all students.
Data isolation. Because data are scattered in various files, and files may be in different formats,
writing new application programs to retrieve the appropriate data is difficult.
Integrity problems. The data values stored in the database must satisfy certain types of consistency
constraints. Suppose the university maintains an account for each department, and records the balance
amount in each account. Suppose also that the university requires that the account balance of a
department may never fall below zero. Developers enforce these constraints in the system by adding
appropriate code in the various application programs. However, when new constraints are added, it is
difficult to change the programs to enforce them. The problem is compounded when constraints
involve several data items from different files.
Atomicity problems. A computer system, like any other device, is subject to failure. In many
applications, it is crucial that, if a failure occurs, the data be restored to the consistent state that existed
prior to the failure.
Consider a program to transfer $500 from the account balance of department A to the account balance
of department B. If a system failure occurs during the execution of the program, it is possible that the
That is, the funds transfer must be atomic—it must happen in its entirety or not at all. It is difficult to
ensure atomicity in a conventional file-processing system.
Concurrent-access anomalies. For the sake of overall performance of the system and faster response,
many systems allow multiple users to update the data simultaneously. Indeed, today, the largest
Internet retailers may have millions of accesses per day to their data by shoppers. In such an
environment, interaction of concurrent updates is possible and may result in inconsistent data.
Consider department A, with an account balance of $10,000. If two department clerks debit the
account balance (by say $500 and $100, respectively) of department A at almost exactly the same time,
the result of the concurrent executions may leave the budget in an incorrect (or inconsistent) state.
Suppose that the programs executing on behalf of each withdrawal read the old balance, reduce that
value by the amount being withdrawn, and write the result back. If the two programs run concurrently,
they may both read the value $10,000, and write back $9500 and $9900, respectively. Depending on
which one writes the value last, the account balance of department A may contain either $9500 or
$9900, rather than the correct value of $9400. To guard against this possibility, the system must
maintain some form of supervision.
But supervision is difficult to provide because data may be accessed by many different application
programs that have not been coordinated previously.
Security problems. Not every user of the database system should be able to access all the data. For
example, in a university, payroll personnel need to see only that part of the database that has financial
information. They do not need access to information about academic records. But, since application
programs are added to the file-processing system in an ad hoc manner, enforcing such security
constraints is difficult.
These difficulties, among others, prompted the development of database systems. In what follows, we
shall see the concepts and algorithms that enable database systems to solve the problems with file-
processing systems.
Advantages of DBMS:
Controlling of Redundancy: Data redundancy refers to the duplication of data (i.e storing same data
multiple times). In a database system, by having a centralized database and centralized control of data
by the DBA the unnecessary duplication of data is avoided. It also eliminates the extra time for
processing the large volume of data. It results in saving the storage space.
Improved Data Sharing : DBMS allows a user to share the data in any number of application programs.
Data Integrity : Integrity means that the data in the database is accurate. Centralized control of the
data helps in permitting the administrator to define integrity constraints to the data in the database. For
example: in customer database we can can enforce an integrity that it must accept the customer only
from Noida and Meerut city.
Security : Having complete authority over the operational data, enables the DBA in ensuring that the
only mean of access to the database is through proper channels. The DBA can define authorization
checks to be carried out whenever access to sensitive data is attempted.
Efficient Data Access : In a database system, the data is managed by the DBMS and all access to the
data is through the DBMS providing a key to effective data processing
Enforcements of Standards : With the centralized of data, DBA can establish and enforce the data
standards which may include the naming conventions, data quality standards etc.
Data Independence : Ina database system, the database management system provides the interface
between the application programs and the data. When changes are made to the data representation, the
meta data obtained by the DBMS is changed but the DBMS is continues to provide the data to
application program in the previously used way. The DBMs handles the task of transformation of data
wherever necessary.
Reduced Application Development and Maintenance Time : DBMS supports many important
functions that are common to many applications, accessing data stored in the DBMS, which facilitates
the quick development of application.
Disadvantages of DBMS
1) It is bit complex. Since it supports multiple functionality to give the user the best, the underlying
software has become complex. The designers and developers should have thorough knowledge
about the software to get the most out of it.
2) Because of its complexity and functionality, it uses large amount of memory. It also needs large
memory to run efficiently.
3) DBMS system works on the centralized system, i.e.; all the users from all over the world access
this database. Hence any failure of the DBMS, will impact all the users.
4) DBMS is generalized software, i.e.; it is written work on the entire systems rather specific one.
Hence some of the application will run slow.
View of Data
A database system is a collection of interrelated data and a set of programs that allow users to access
and modify these data. A major purpose of a database system is to provide users with an abstract view
of the data. That is, the system hides certain details of how the data are stored and maintained.
Data Abstraction
For the system to be usable, it must retrieve data efficiently. The need for efficiency has led designers
to use complex data structures to represent data in the database. Since many database-system users
are not computer trained, developers hide the complexity from users through several levels of
abstraction, to simplify users’ interactions with the system:
• Physical level (or Internal View / Schema): The lowest level of abstraction describes how the data
are actually stored. The physical level describes complex low-level data structures in detail.
• Logical level (or Conceptual View / Schema): The next-higher level of abstraction describes what
data are stored in the database, and what relationships exist among those data. The logical level thus
describes the entire database in terms of a small number of relatively simple structures. Although
implementation of the simple structures at the logical level may involve complex physical-level
structures, the user of the logical level does not need to be aware of this complexity. This is referred
to as physical data independence.
• View level (or External View / Schema): The highest level of abstraction describes only part of the
entire database. Even though the logical level uses simpler structures, complexity remains because of
the variety of information stored in a large database. Many users of the database system do not need
all this information; instead, they need to access only a part of the database. The view level of
abstraction exists to simplify their interaction with the system. The system may provide many views
for the same database.
For example, we may describe a record as follows:
type instructor = record
ID : char (5);
name : char (20);
dept name : char (20);
salary : numeric (8,2);
end;
This code defines a new record type called instructor with four fields. Each field has a name
and a type associated with it. A university organization may have several such record types,
including
At the physical level, an instructor, department, or student record can be described as a block of
consecutive storage locations.
At the logical level, each such record is described by a type definition, as in the previous code
segment, and the interrelationship of these record types is defined as well.
Finally, at the view level, computer users see a set of application programs that hide details of the
data types. At the view level, several views of the database are defined, and a database user sees some
or all of these views.
Databases change over time as information is inserted and deleted. The collection of information
stored in the database at a particular moment is called an instance of the database. The overall design
of the database is called the database schema. Schemas are changed infrequently, if at all. The
concept of database schemas and instances can be understood by analogy to a program written in a
programming language.
Each variable has a particular value at a given instant. The values of the variables in a program at a
point in time correspond to an instance of a database schema. Database systems have several
schemas, partitioned according to the levels of abstraction. The physical schema describes the
database design at the physical level, while the logical schema describes the database design at the
logical level. A database may also have several schemas at the view level, sometimes called
subschemas, which describe different views of the database. Of these, the logical schema is by far
the most important, in terms of its effect on application programs, since programmers construct
applications by using the logical schema. Application programs are said to exhibit physical data
independence if they do not depend on the physical schema, and thus need not be rewritten if the
physical schema changes.
Data Models
Underlying the structure of a database is the data model: a collection of conceptual tools for
describing data, data relationships, data semantics, and consistency constraints.
• Relational Model. The relational model uses a collection of tables to represent both data and the
relationships among those data. Each table has multiple columns, and each column has a unique
name. Tables are also known as relations. The relational model is an example of a record-based
model.
Entity-Relationship Model. The entity-relationship (E-R) data model uses a collection of basic
objects, called entities, and relationships among these objects.
Suppose that each department has offices in several locations and we want to record the locations at
which each employee works. The ER diagram for this variant of Works In, which we call Works In2
Object-Based Data Model. Object-oriented programming (especially in Java, C++, or C#) has
become the dominant software-development methodology. This led to the development of an object-
oriented data model that can be seen as extending the E-R model with notions of encapsulation,
methods (functions), and object identity.
Semi-structured Data Model. The semi-structured data model permits the specification of data
where individual data items of the same type may have different sets of attributes. This is in contrast
to the data models mentioned earlier, where every data item of a particular type must have the same
set of attributes. The Extensible Markup Language (XML) is widely used to represent semi-
structured data.
Historically, the network data model and the hierarchical data model preceded the relational data
model.
These models were tied closely to the underlying implementation, and complicated the task of modeling
data.
As a result they are used little now, except in old database code that is still in service in some places.
Database Languages
A database system provides a data-definition language to specify the database
schema and a data-manipulation language to express database queries and updates. In practice,
the data-definition and data-manipulation languages are not two separate languages; instead they
simply form parts of a single database language, such as the widely used SQL language.
Data-Manipulation Language
A data-manipulation language (DML) is a language that enables users to access or manipulate data
as organized by the appropriate data model. The types of access are:
• Retrieval of information stored in the database
• Insertion of new information into the database
• Deletion of information from the database
• Modification of information stored in the database
• Domain Constraints. A domain of possible values must be associated with every attribute (for
example, integer types, character types, date/time types). Declaring an attribute to be of a particular
domain acts as a constraint on the values that it can take. Domain constraints are the most elementary
form of integrity constraint. They are tested easily by the system whenever a new data item is entered
into the database.
• Authorization. We may want to differentiate among the users as far as the type of access they are
permitted on various data values in the database. These differentiations are expressed in terms of
authorization, the most common being: read authorization, which allows reading, but not
modification, of data; insert authorization, which allows insertion of new data, but not modification
of existing data; update authorization, which allows modification, but not deletion, of data; and
delete authorization, which allows deletion of data. We may assign the user all, none, or a
combination of these types of authorization.
The DDL, just like any other programming language, gets as input some instructions (statements) and
generates some output. The output of the DDL is placed in the data dictionary,which contains
metadata—that is, data about data.
Data Dictionary
We can define a data dictionary as a DBMS component that stores the definition of data
characteristics and relationships. You may recall that such “data about data” were labeled metadata.
The DBMS data dictionary provides the DBMS with its self describing characteristic. In effect, the
data dictionary resembles and X-ray of the company’s entire data set, and is a crucial element in the
data administration function.
For example, the data dictionary typically stores descriptions of all:
• Data elements that are define in all tables of all databases. Specifically the data dictionary stores
the name, datatypes, display formats, internal storage formats, and validation rules. The data
dictionary tells where an element is used, by whom it is used and so on.
• Tables define in all databases. For example, the data dictionary is likely to store the name of the
table creator, the date of creation access authorizations, the number of columns, and so on.
• Indexes define for each database tables. For each index the DBMS stores at least the index name
the attributes used, the location, specific index characteristics and the creation date.
• Define databases: who created each database, the date of creation where the database is located, who
the
DBA is and so on.
• End users and The Administrators of the data base
• Programs that access the database including screen formats, report formats application formats,
SQL queries and so on.
• Access authorization for all users of all databases.
• Relationships among data elements which elements are involved: whether the relationship are
mandatory or optional, the connectivity and cardinality and so on.
Database Administrators and Database Users
A primary goal of a database system is to retrieve information from and store new information in the
database.
Database Users and User Interfaces
A database system is partitioned into modules that deal with each of the responsibilities of the overall
system. The functional components of a database system can be broadly divided into the storage
manager and the query processor components. The storage manager is important because databases
Query Processor:
The query processor components include
· DDL interpreter, which interprets DDL statements and records the definitions in the data
dictionary.
· DML compiler, which translates DML statements in a query language into an evaluation plan
consisting of low-level instructions that the query evaluation engine understands.
A query can usually be translated into any of a number of alternative evaluation plans that all give the
same result. The DML compiler also performs query optimization, that is, it picks the lowest cost
evaluation plan from among the alternatives.
Query evaluation engine, which executes low-level instructions generated by the DML compiler.
Storage Manager:
A storage manager is a program module that provides the interface between the lowlevel data stored in
the database and the application programs and queries submitted to the system. The storage
manager is responsible for the interaction with the file manager.
The storage manager components include:
Transaction Manager:
What is ER Modeling?
A graphical technique for understanding and organizing the data independent of the actual
database implementation
We need to be familiar with the following terms to go further.
Entity
Any thing that has an independent existence and about which we collect data. It is also known as entity
type.
In ER modeling, notation for entity is given below.
Entity instance
Entity instance is a particular member of the entity type.
Example for entity instance : A particular employee
Regular Entity
An entity which has its own key attribute is a regular entity.
Example for regular entity : Employee.
Weak entity
An entity which depends on other entity for its existence and doesn't have any key attribute of its own is
a weak entity.
Attributes
Properties/characteristics which describe entities are called attributes.
In ER modeling, notation for attribute is given below.
Domain of Attributes
The set of possible values that an attribute can take is called the domain of the attribute. For example,
the attribute day may take any value from the set {Monday, Tuesday ... Friday}. Hence this set can
be termed as the domain of the attribute day.
Key attribute
The attribute (or combination of attributes) which is unique for every entity instance is called key
attribute.
E.g the employee_id of an employee, pan_card_number of a person etc.If the key attribute
consists of two or more attributes in combination, it is called a composite key.
In ER modeling, notation for key attribute is given below.
Simple attribute
If an attribute cannot be divided into simpler components, it is a simple attribute.
Example for simple attribute : employee_id of an employee.
Composite attribute
If an attribute can be split into components, it is called a composite attribute.
Example for composite attribute : Name of the employee which can be split into First_name,
Middle_name, and Last_name.
Single valued Attributes
If an attribute can take only a single value for each entity instance, it is a single valued attribute.
example for single valued attribute : age of a student. It can take only one value for a particular student.
Multi-valued Attributes
Stored Attribute
An attribute which need to be stored permanently is a stored attribute
Example for stored attribute : name of a student
Derived Attribute
An attribute which can be calculated or derived based on other attributes is a derived attribute.
Example for derived attribute : age of employee which can be calculated from date of birth and current
date.
In ER modeling, notation for derived attribute is given below.
Relationships
Associations between entities are called relationships
Example : An employee works for an organization. Here "works for" is a relation between the
entities employee and organization.
In ER modeling, notation for relationship is given below.
However in ER Modeling, To connect a weak Entity with others, you should use a weak
relationship notation as given below
One employee is assigned with only one parking space and one parking space is assigned to
only one employee. Hence it is a 1:1 relationship and cardinality is One-To-One (1:1)
One organization can have many employees , but one employee works in only one organization.
Hence it is a 1:N relationship and cardinality is One-To-Many (1:N)
In ER modeling, this can be mentioned using notations as given below
One employee works in only one organization But one organization can have many employees.
Hence it is a M:1 relationship and cardinality is Many-to-One (M :1)
One student can enroll for many courses and one course can be enrolled by many students. Hence
it is a M:N relationship and cardinality is Many-to-Many (M:N)
In ER modeling, this can be mentioned using notations as given below
Relationship Participation
1. Total
In total participation, every entity instance will be connected through the relationship to another
instance of the other participating entity types
2. Partial
Example for relationship participation
Consider the relationship - Employee is head of the department.
Here all employees will not be the head of the department. Only one employee will be the head
of the department. In other words, only few instances of employee entity participate in the
above relationship. So employee entity's participation is partial in the said relationship.
Relational Model
The relational model is today the primary data model for commercial data processing applications. It
attained its primary position because of its simplicity, which eases the job of the programmer,
compared to earlier data models such as the network model or the hierarchical model.
Structure of Relational Databases:
A relational database consists of a collection of tables, each of which is assigned a unique name. For
example, consider the instructor table of Figure:1.5, which stores information about instructors. The
table has four column headers: ID, name, dept name, and salary. Each row of this table records
information about an instructor, consisting of the instructor’s ID, name, dept name, and salary.
Database Schema
When we talk about a database, we must differentiate between the database schema, which is the
logical design of the database, and the database instance, which is a snapshot of the data in the
database at a given instant in time. The concept of a relation corresponds to the programming-
language notion of a variable, while the concept of a relation schema corresponds to the
programming-language notion of type definition.
Keys
A superkey is a set of one or more attributes that, taken collectively, allow us to identify uniquely a
tuple in the relation. For example, the ID attribute of the relation instructor is sufficient to distinguish
one instructor tuple from another. Thus, ID is a superkey. The name attribute of instructor, on the
other hand, is not a superkey, because several instructors might have the same name.
A superkey may contain extraneous attributes. For example, the combination of ID and name is a
superkey for the relation instructor. If K is a superkey, then so is any superset of K. We are often
interested in superkeys for which no proper subset is a superkey. Such minimal superkeys are called
candidate keys.
It is customary to list the primary key attributes of a relation schema before the other attributes; for
example, the dept name attribute of department is listed first, since it is the primary key. Primary key
attributes are also underlined. A relation, say r1, may include among its attributes the primary key of
another relation, say r2. This attribute is called a foreign key from r1, referencing r2.
Schema Diagrams
A database schema, along with primary key and foreign key dependencies, can be depicted by
schema diagrams. Figure 1.12 shows the schema diagram for our university organization.
Referential integrity constraints other than foreign key constraints are not shown explicitly in schema
diagrams.We will study a different diagrammatic representation called the entity-relationship diagram.
Page 22
UNIT-2
Relational Algebra and Calculus
PRELIMINARIES
In defining relational algebra and calculus, the alternative of referring to fields by position is
more convenient than referring to fields by name: Queries often involve the computation of
intermediate results, which are themselves relation instances, and if we use field names to refer
to fields, the definition of query language constructs must specify the names of fields for all
intermediate relation instances.
The key fields are underlined, and the domain of each field is listed after the field name.
Thus sid is the key for Sailors, bid is the key for Boats, and all three fields together form the key
for Reserves. Fields in an instance of one of these relations will be referred to by name, or
positionally, using the order in which they are listed above.
RELATIONAL ALGEBRA
Relational algebra is one of the two formal query languages associated with the re-
lational model. Queries in algebra are composed using a collection of operators. A fundamental
property is that every operator in the algebra accepts (one or two) rela-tion instances as
arguments and returns a relation instance as the result.
Each relational query describes a step-by-step procedure for computing the desired
answer, based on the order in which operators are applied in the query.
Selection and Projection
Relational algebra includes operators to select rows from a relation (σ) and to project columns
(π). These operations allow us to manipulate data in a single relation. Con - sider the instance
σrating>8(S2)
The selection operator σ specifies the tuples to retain through a selection condition. In general,
the selection condition is a boolean combination (i.e., an expression using the logical connectives
∧ and ∨) of terms that have the form attribute op constant or attribute1 op attribute2, where op is
one of the comparison operators <, <=, =, =, >=, or >.
The projection operator π allows us to extract columns from a relation; for example, we can find
out all sailor names and ratings by using π. The expression πsname,rating(S2)
Suppose that we wanted to find out only the ages of sailors. The expression
πage(S2)
a single tuple with age=35.0 appears in the result of the projection. This follows from
the definition of a relation as a set of tuples. However, our discussion of relational algebra and
calculus assumes that duplicate elimination is always done so that relations are always sets of
tuples.
Set Operations
The following standard operations on sets are also available in relational algebra: union (U),
intersection (∩), set-difference (−), and cross-product (×).
Union: R u S returns a relation instance containing all tuples that occur in either
relation instance R or relation instance S (or both). R and S must be union-compatible, and
the schema of the result is defined to be identical to the schema of R.
Figure 4.9 S1 ∩ S2
sid sname rating age
22 Dustin 7 45.0
Figure 4.10 S1 − S2
emphasize that it is not an inherited field name; only the corresponding domain is
inherited.
We introduce a renaming operator ρ for this purpose. The expression ρ(R(F ), E) takes an
arbitrary relational algebra expression E and returns an instance of a (new) relation called R. R
contains the same tuples as the result of E, and has the same schema as E, but some fields are
renamed. The field names in relation R are the same as in E, except for fields renamed in the
renaming list F.
For example, the expression ρ(C(1 → sid1, 5 → sid2), S1 × R1) returns a relation that contains
the tuples shown in Figure 4.11 and has the followi ng schema: C(sid1: integer, sname: string,
rating: integer, age: real, sid2: integer, bid: integer,day: dates).
The join operation is one of the most useful operations in relational algebra and is the most
commonly used way to combine information from two or more relations. Although a join can
be defined as a cross-product followed by selections and projections, joins arise much more
frequently in practice than plain cross-products.joins have received a lot of attention, and there
are several variants of the join operation.
Condition Joins
The most general version of the join operation accepts a join condition c and a pair of relation
instances as arguments, and returns a relation instance. The join condition is identical to a
selection condition in form. The operation is defined as follows:
R ⊲⊳c S = σc(R × S)
Thus ⊲⊳ is defined to be a cross-product followed by a selection. Note that the condition c can
(and typically does) refer to attributes of both R and S.
(sid) sname rating age (sid) bid day
22 Dustin 7 45.0 58 103 11/12/96
31 Lubber 8 55.5 58 103 11/12/96
Figure 4.12 S1 ⊲⊳ S1.sid<R1.sid R1
Equijoin
A common special case of the join operation R ⊲⊳ S is when the join condition con-sists solely
of equalities (connected by ∧) of the form R.name1 = S.name2, that is, equalities between two
fields in R and S. In this case, obviously, there is some redun-dancy in retaining both attributes
in the result.
Natural Join
A further special case of the join operation R ⊲⊳ S is an equijoin in which equalities are
specified on all fields having the same name in R and S. In this case, we can simply omit the
join condition; the default is that the join condition is a collection of equalities on all common
fields.
The division operator is useful for expressing certain kinds of queries, for example: “Find the
names of sailors who have reserved all boats.” Understanding how to use the basic operators of
the algebra to define division is a useful exercise.
(Q1) Find the names of sailors who have reserved boat 103.
This query can be written as follows:
πsname((σbid=103Reserves) ⊲⊳Sailors)
We first compute the set of tuples in Reserves with bid = 103 and then take the natural join
of this set with Sailors. This expression can be evaluated on instances of Reserves and
Sailors. Evaluated on the instances R2 and S3, it yields a relation
(Q2) Find the names of sailors who have reserved a red boat.
πsname((σcolor=′red′ Boats) ⊲⊳ Reserves ⊲⊳ Sailors
This query involves a series of two joins. First we choose (tuples describing) red boats.
(Q4) Find the names of sailors who have reserved at least one boat.
πsname(Sailors ⊲⊳ Reserves)
(Q5) Find the names of sailors who have reserved a red or a green boat.
ρ(T empboats, (σcolor=′red ′ Boats) U (σcolor=′green′ Boats))
πsname(Tempboats ⊲⊳Reserves ⊲⊳Sailors)
(Q6) Find the names of sailors who have reserved a red and a green boat
ρ(T empboats2, (σcolor=′red′ Boats) ∩ (σcolor=′green′ Boats))
πsname(Tempboats2 ⊲⊳ Reserves ⊲⊳ Sailors)
However, this solution is incorrect —it instead tries to compute sailors who have re-served a boat
that is both red and green.
ρ(T empred, πsid((σcolor=′red′ Boats) ⊲⊳ Reserves))
ρ(T empgreen, πsid((σcolor=′green′ Boats) ⊲⊳ Reserves))
πsname((Tempred ∩ Tempgreen) ⊲⊳ Sailors)
(Q7) Find the names of sailors who have reserved at least two boats.
πsname1σ(sid1=sid2) ∩ (bid1=bid2)Reservationpairs
(Q8) Find the sids of sailors with age over 20 who have not reserved a red boat.
πsid(σage>20Sailors) −πsid((σcolor=′red′ Boats) ⊲⊳ Reserves ⊲⊳ Sailors)
This query illustrates the use of the set-difference operator. Again, we use the fact that sid is the
key for Sailors.
(Q9) Find the names of sailors who have reserved all boats.
The use of the word all (or every) is a good indication that the division operation might be
applicable:
ρ(T empsids, (πsid,bidReserves)/(πbidBoats))
πsname(Tempsids ⊲⊳ Sailors)
(Q10) Find the names of sailors who have reserved all boats called Interlake.
ρ(T empsids, (πsid,bidReserves)/(πbid(σbname=′Interlake′ Boats)))
πsname(Tempsids ⊲⊳ Sailors)
RELATIONAL CALCULUS
A tuple variable is a variable that takes on tuples of a particular relation schema as values. That
is, every value assigned to a given tuple variable has the same number and type of fields.
F is of the form ¬p, and p is not true; or of the form p ^ q, and both p and q are true;
or of the form p V q, and one of them is true, or of the form p q and q is true
whenever4 p is true.
F is of the form R(p(R)), and there is some assignment of tuples to the free variables
in p(R), including the variable R,5 that makes the formula p(R) true.
F is of the form R(p(R)), and there is some assignment of tuples to the free variables
in p(R) that makes the formula p(R) true no matter what tuple is assigned to R.
(Q12) Find the names and ages of sailors with a rating above 7 .
(Q13) Find the sailor name, boat id, and reservation date for each reservation
{P | R Reserves S Sailors
(R.sid = S.sid P.bid = R.bid P.day = R.day P.sname = S.sname)}
(Q1) Find the names of sailors who have reserved boat 103.
This query can be read as follows: “Retrieve all sailor tuples for which there exists
a tuple in Reserves, having the same value in the sid field, and with bid = 103.”
(Q2) Find the names of sailors who have reserved a red boat.
{P | S Sailors R Reserves(R.sid = S.sid P.sname = S.sname
B Boats(B.bid = R.bid B.color =′red′))}
(Q7) Find the names of sailors who have reserved at least two boats. {P |
S Sailors R1 Reserves R2 Reserves (S.sid = R1.sid
R1.sid = R2.sid R1.bid ≠ R2.bid P.sname = S.sname)}
(Q9) Find the names of sailors who have reserved all boats.
{P | S Sailors B Boats
( R Reserves(S.sid = R.sid R.bid = B.bid P.sname = S.sname))}
(Q14) Find sailors who have reserved all red boats.
{S | S Sailors B Boats
(B.color =′red′ ( R ∈ Reserves(S.sid = R.sid R.bid = B.bid)))}
A domain variable is a variable that ranges over the values in the domain of some attribute (e.g.,
the variable can be assigned an integer if it appears in an attribute
whose domain is the set of integers). A DRC query has the form { 〈 x1, x2, . . . , xn〉 |
p(〈 x1,x2,.. ., xn〉 )}, where each xi is either a domain variable or a constant and p( 〈 x1,x2,.. ., xn〉)
denotes a DRC formula whose only free variables are thevari-ables among the xi, 1 ≤ i ≤ n. The result of
this query is the set of all tuples 〈x1, x2,.. .,xn〉for which the formula evaluates to true.
(Q1) Find the names of sailors who have reserved boat 103.
{〈N 〉 | I, T, A(〈I, N, T, A〉 Sailors
〈
Ir, Br, D( Ir, Br, D 〉 Reserves Ir = I Br =
103))} (Q2) Find the names of sailors who have reserved a red boat.
This section presents the syntax of a simple SQL query and explains its meaning through a
conceptual evaluation strategy. A conceptual evaluation strategy is a way to evaluate the query
that is intended to be easy to understand, rather than efficient. A DBMS would typically execute
a query in a different and more efficient way.
The answer to this query with and without the keyword DISTINCT on instance S3 of Sailors is
shown in Figures 5.4 and 5.5. The only difference is that the tuple for Horatio appears twice if
DISTINCT is omitted; this is because there are two sailors called Horatio and age 35.
SELECT S.sid, S.sname, S.rating, S.age FROM Sailors AS S WHERE S.rating > 7
(Q16) Find the sids of sailors who have reserved a red boat.
(Q2) Find the names of sailors who have reserved a red boat.
SELECT S.sname FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND
R.bid = B.bid AND B.color = ‘red’
(Q4) Find the names of sailors who have reserved at least one boat.
SQL supports a more general version of the select-list than just a list of columns. Each item in a
select-list can be of the form expression AS column name, where expression is any arithmetic or
string expression over column names (possibly prefixed by range variables) and constants.
(Q5) Compute increments for the ratings of persons who have sailed two different boats on
the same day.
(Q6) Find the ages of sailors whose name begins and ends with B and has at least three
characters.
SQL provides three set-manipulation constructs that extend the basic query form pre-sented
earlier. Since the answer to a query is a multiset of rows, it is natural to consider the use of
operations such as union, intersection, and difference. SQL supports these operations under the
names UNION, INTERSECT, and EXCEPT. 4 SQL also provides other set operations: IN (to
check if an element is in a given set),op ANY,op ALL(tocom-pare a value with the elements in a
given set, using comparison operator op), and EXISTS (to check if a set is empty). IN and
EXISTS can be prefixed by NOT, with the obvious modification to their meaning. We cover
UNION, INTERSECT, and EXCEPT in this section. Consider the following query:
(Q1) Find the names of sailors who have reserved both a red and a green boat.
SELECT S.sname FROM Sailors S, Reserves R1, Boats B1, Reserves R2, Boats
B2 WHERE S.sid = R1.sid AND R1.bid = B1.bid AND S.sid = R2.sid AND R2.bid
= B2.bid AND B1.color=‘red’ AND B2.color = ‘green’
(Q2) Find the sids of all sailors who have reserved red boats but not green boats.
A nested query is a querythat has another query embedded within it; the embedded query is
called a subquery.
(Q1) Find the names of sailors who have reserved boat 103.
SELECT S.sname
FROMSailors S
WHERES.sid IN ( SELECT R.sid
FROMReserves R
WHERER.bid = 103 )
(Q2) Find the names of sailors who have reserved a red boat.
SELECT S.sname
FROM Sailors S
WHERE S.sid IN ( SELECT R.sid
FROMReserves R
WHERER.bid IN ( SELECT B.bid
FROMBoats B
WHEREB.color = ‘red’ )
(Q3) Find the names of sailors who have not reserved a red boat.
SELECTS.sname
FROMSailors S
WHERE S.sid NOT IN ( SELECT R.sid
FROMReserves R
WHERER.bid IN ( SELECT B.bid
FROMBoats B
WHEREB.color = ‘red’ )
In the nested queries that we have seen thus far, the inner subquery has been completely
independent of the outer query:
(Q1) Find the names of sailors who have reserved boat number 103.
SELECT S.sname
FROMSailors S
WHEREEXISTS ( SELECT *
FROMReserves R
WHERER.bid = 103
AND R.sid = S.sid )
Set-Comparison Operators
(Q1) Find sailors whose rating is better than some sailor called Horatio.
SELECT S.sid
FROMSailors S
WHERES.rating > ANY ( SELECT S2.rating
FROMSailors S2
WHERES2.sname = ‘Horatio’ )
SELECT S.sid
FROM Sailors S
WHERE S.rating >= ALL ( SELECT S2.rating
FROM Sailors S2 )
(Q1) Find the names of sailors who have reserved both a red and a green boat.
SELECT S.sname
FROMSailors S, Reserves R, Boats B
WHERES.sid = R.sid AND R.bid = B.bid AND B.color = ‘red’
AND S.sid IN ( SELECT S2.sid
FROMSailors S2, Boats B2, Reserves R2
WHERE S2.sid = R2.sid AND R2.bid = B2.bid
AND B2.color = ‘green’ )
SELECT S.sname
FROMSailors S
WHERENOT EXISTS (( SELECT B.bid
FROMBoats B )
EXCEPT
(SELECT R.bid
FROMReserves R
WHERER.sid = S.sid ))
AGGREGATE OPERATORS
We now consider a powerful class of constructs for computing aggregate values such as MIN
and SUM.
SELECTS.sname,MAX(S.age)
FROMSailors S
Q32) Find the age of the youngest sailor who is eligible to vote (i.e., is at least 18 years old)
for each rating level with at least two such sailors.
SELECTS.rating, MIN (S.age) AS minageGROUP BY S.rating
HAVING COUNT (*) > 1
Q3) For each red boat, find the number of reservations for this boat.
SELECT B.bid, COUNT (*) AS sailorcount FROM Boats B, Reserves R
WHERE R.bid = B.bid AND B.color = ‘red’ GROUP BY B.bid
(Q4) Find the average age of sailors for each rating level that has at least two sailors.
(Q5) Find the average age of sailors who are of voting age (i.e., at least 18 years old) for each
rating level that has at least two sailors.
SELECTS.rating, AVG ( S.age ) AS avgage
FROMSailors S
WHERE S. age >= 18
GROUP BY S.rating
HAVING 1 < ( SELECT COUNT (*)
(Q6) Find the average age of sailors who are of voting age (i.e., at least 18 years old) for
each rating level that has at least two such sailors.
SELECTS.rating, AVG ( S.age ) AS avgage
FROMSailors S
WHERE S. age > 18
GROUP BY S.rating
The above formulation of the query reflects the fact that it is a variant of Q35. The answer to
Q36 on instance S3 is shown in Figure 5.16. It differs from the answer to Q35 in that there is no
tuple for rating 10, since there is only one tuple with rating 10 and age ≥ 18.
This formulation of Q36 takes advantage of the fact that the WHERE clause is applied before
grouping is done; thus, only sailors with age > 18 are left when grouping is done. It is
instructive to consider yet another way of writing this query:
SELECT Temp.rating, Temp.avgage
FROM ( SELECT S.rating, AVG ( S.age ) AS
avgage, COUNT (*) AS
ratingcount
FROM Sailors S WHERE S. age > 18 GROUP BY S.rating ) AS Temp
WHERE Temp.ratingcount > 1
NULL VALUES
we have assumed that column values in a row are always known. In practice column values can
be unknown. For example, when a sailor, say Dan, joins a yacht club, he may not yet have a
rating assigned. Since the definition for the Sailors table has a rating column, what row should
we insert for Dan? What is needed here is a special value that denotes unknown.
Consider a comparison such as rating = 8. If this is applied to the row for Dan, is this condition
true or false? Since Dan’s rating is unknown, it is reasonable to say that this comparison should
evaluate to the value unknown.
SQL also provides a special comparison operator IS NULL to test whether a column value is
null; for example, we can say rating IS NULL, which would evaluate to true on the row
representing Dan. We can also say rating IS NOT NULL, which would evaluate to false on the
row for Dan.
Now, what about boolean expressions such as rating = 8 OR age < 40 and rating = 8 AND
age < 40? Considering the row for Dan again, because age < 40, the first expression evaluates to
true regardless of the value of rating, but what about the second? We can only say unknown.
INTRODUCTION TO VIEWS
A view is a table whose rows are not explicitly stored in the database but are computed as needed
from a view de nition. Consider the Students and Enrolled relations.
This view can be used just like a base table, or explicitly stored table, in de ning new queries or
views.
If we decide that we no longer need a base table and want to destroy it (i.e., delete all the rows
and remove the table de nition information), we can use the DROP TABLE command. For
example, DROP TABLE Students RESTRICT destroys the Students table unless some view or
integrity constraint refers to Students; if so, the command fails. If the keyword RESTRICT is
replaced by CASCADE, Students is dropped and any ref-erencing views or integrity constraints
are (recursively) dropped as well; one of these two keywords must always be speci ed. A view
can be dropped using the DROP VIEW command, which is just like DROP TABLE.
ALTER TABLE modi es the structure of an existing table. To add a column called maiden-name
to Students, for example, we would use the following command:
TRIGGERS
A trigger action can examine the answers to the query in the condition part of the trigger, refer to
old and new values of tuples modified by the statement activating the trigger, execute new
queries, and make changes to the database.
(identifying the modified table, Students, and the kind of modifying statement, an
INSERT), and the third field is the number of inserted Students tuples with age < 18.
(The trigger in Figure 5.19 only computes the count; an additional trigger is required to
insert the appropriate tuple into the statistics table.)
SCHEMA REFINEMENT
Storing the same information redundantly, that is, in more than one
place within a database, can lead to several problems:
5
Unless we are careful, decomposing a relation schema can create
more problems than it solves. Two important questions must be
asked repeatedly:
DATABASE MANAGEMENT
SYSTEMS 43
FUNCTIONAL DEPENDENCIES
The set of all FDs implied by a given set F of FDs is called the
closure of F and is
denoted as F +. An important question is how we can infer, or
compute, the closure of a given set F of FDs. The answer is simple
and elegant. The following three rules, called Armstrong's Axioms,
can be applied repeatedly to infer all FDs implied by a set F of FDs.
We use X, Y, and Z to denote sets of attributes over a relation schema
R:
Reflexivity: If X Y, then X ! Y.
Augmentation: If X ! Y, then XZ ! YZ for any Z.
Transitivity: If X ! Y and Y ! Z, then X ! Z.
NORMAL FORMS
The normal forms based on FDs are first normal form (1NF), second
normal form (2NF), third normal form (3NF), and Boyce-Codd
normal form (BCNF). These forms have increasingly restrictive
requirements: Every relation in BCNF is also in 3NF, every relation
in 3NF is also in 2NF, and every relation in 2NF is in 1NF. A
relation is in first normal form if every field contains only atomic
values, that is, not lists or sets. This requirement is implicit in our de
nition of the relational model. Although some of the newer database
systems are relaxing this requirement, in this chapter we will assume
that it always holds. 2NF is mainly of historical interest. 3NF and
BCNF are important from a database design standpoint.
15.5.1Boyce-Codd Normal Form
DATABASE MANAGEMENT
SYSTEMS 47
A 2 X; that is, it is a trivial FD, or
X is a super key, or
A is part of some key for R.
DATABASE MANAGEMENT
SYSTEMS 48
Transitive Dependencies
DECOMPOSITIONS
Lossless-Join Decomposition
DATABASE MANAGEMENT
SYSTEMS 49
Let R be a relation and F be a set of FDs that hold over R.
The decomposition of R into relations with attribute sets R 1
and R2 is lossless if and only if F + contains either the FD R1
\ R2 ! R1 or the FD R1 \ R2 ! R2.
DATABASE MANAGEMENT
SYSTEMS 50
Dependency-Preserving Decomposition
DATABASE MANAGEMENT
SYSTEMS 52
schemas SDP, JS, and CJDQV is in BCNF, and this collection of
schemas also represents a lossless-join decomposition of CSJDQV.
CSJDPQV
SD P
SDP CSJDQV
J S
JS CJDQV
DATABASE MANAGEMENT
SYSTEMS 53
Redundancy in BCNF Revisited
DATABASE MANAGEMENT
SYSTEMS 54
First, let us rewrite ACDF ! EG so that every right side is a
A ! B, ABCD ! E, and EF ! G.
A ! B, ACD ! E, EF ! G, and EF ! H.
Multivalued Dependencies
DATABASE MANAGEMENT
SYSTEMS 56
Suppose that we have a relation with attributes course, teacher, and
book, which we denote as CTB. The meaning of a tuple is that
teacher T can teach course C, and book
B is a recommended text for the course. There are no FDs; the
key is CTB.
However, the recommended texts for a course are independent of the
instructor.
The instance shown in Figure 15.13 illustrates this situation.
DATABASE MANAGEMENT
SYSTEMS 57
The redundancy can be eliminated by decomposing CTB into CT
and CB.
DATABASE MANAGEMENT
SYSTEMS 58
MVD Complementation: If X !! Y, then X !! R − XY .
MVD Augmentation: If X !! Y and W Z, then WX !! YZ.
MVD Transitivity: If X !! Y and Y !! Z, then X !! (Z − Y ).
Replication: If X ! Y, then X !! Y.
Coalescence: If X !! Y and there is a W such that W \ Y is
empty, W ! Z, and Y
Z, then X ! Z.
Observe that replication states that every FD is also an MVD.
Fourth Normal Form
B C A D
b c1 a1 d1 | tuple t1
b c2 a2 d2 | tuple t2
b c1 a2 d2 | tuple t3
DATABASE MANAGEMENT
SYSTEMS 60
Consider tuples t2 and t3. From the given FD A ! BCD and the
fact
that these tuples have the same A-value, we can
deduce that c 1 = c 2. Thus, we see that the FD B ! C must
hold over ABCD whenever the FD A ! BCD and the MVD B !!
C hold. If B ! C holds, the relation ABCD is not in BCNF
(unless additional FDs hold that make B a key)!
Join Dependencies
DATABASE MANAGEMENT
SYSTEMS 61
The second condition deserves some explanation, since we
have not presented inference rules for FDs and JDs taken
together. Intuitively, we must be able to show
that the decomposition of R into fR 1; : : : ; Rng is lossless-join whenever the
key dependencies (FDs in which the left side is a key for R) hold. ./ fR 1; : : : ;
Rng is a trivial JD if Ri = R for some i; such a JD always holds.
The following result, also due to Date and Fagin, identifies conditions again,
detected using only FD information under which we can safely ignore JD
information.
DATABASE MANAGEMENT
SYSTEMS 62
.
UNIT-IV
TRANSACTION MANAGEMENT
What is a Transaction?
A transaction is an event which occurs on the database. Generally a transaction reads a value from
the database or writes a value to the database. If you have any concept of Operating Systems, then
we can say that a transaction is analogous to processes.
Although a transaction can both read and write on the database, there are some fundamental
differences between these two classes of operations. A read operation does not change the image of
the database in any way. But a write operation, whether performed with the intention of inserting,
updating or deleting data from the database, changes the image of the database. That is, we may say
that these transactions bring the database from an image which existed before the transaction
occurred (called the Before Image or BFIM) to an image which exists after the transaction occurred
(called the After Image or AFIM).
Atomicity: This means that either all of the instructions within the transaction will be reflected in the
database, or none of them will be reflected.
Say for example, we have two accounts A and B, each containing Rs 1000/-. We now start a
transaction to deposit Rs 100/- from account A to Account B.
Read A;
A = A – 100;
Write A;
Read B;
B = B + 100;
Write B;
Now, suppose there is a power failure just after instruction 3 (Write A) has been complete. What
happens now? After the system recovers the AFIM will show Rs 900/- in A, but the same Rs 1000/-
in B. It would be said that Rs 100/- evaporated in thin air for the power failure. Clearly such a
situation is not acceptable.
The solution is to keep every value calculated by the instruction of the transaction not in any stable
storage (hard disc) but in a volatile storage (RAM), until the transaction completes its last instruction.
When we see that there has not been any error we do something known as a COMMIT operation. Its
job is to write every temporarily calculated value from the volatile storage on to the stable storage. In
this way, even if power fails at instruction 3, the post recovery image of the database will show
accounts A and B both containing Rs 1000/-, as if the failed transaction had never occurred.
To give better performance, every database management system supports the execution of multiple
transactions at the same time, using CPU Time Sharing. Concurrently executing transactions may
have to deal with the problem of sharable resources, i.e. resources that multiple transactions are
trying to read/write at the same time. For example, we may have a table or a record on which two
transaction are trying to read or write at the same time. Careful mechanisms are created in order to
prevent mismanagement of these sharable resources, so that there should not be any change in the
way a transaction performs. A transaction which deposits Rs 100/- to account A must deposit the
same amount whether it is acting alone or in conjunction with another transaction that may be trying
to deposit or withdraw some amount at the same time.
Isolation: In case multiple transactions are executing concurrently and trying to access a sharable
resource at the same time, the system should create an ordering in their execution so that they should
not create any anomaly in the value stored at the sharable resource.
Durability: It states that once a transaction has been complete the changes it has made should be
permanent.
As we have seen in the explanation of the Atomicity property, the transaction, if completes
successfully, is committed. Once the COMMIT is done, the changes which the transaction has made
to the database are immediately written into permanent storage. So, after the transaction has been
committed successfully, there is no question of any loss of information even if the power fails.
Committing a transaction guarantees that the AFIM has been reached.
There are several ways Atomicity and Durability can be implemented. One of them is called Shadow
Copy. In this scheme a database pointer is used to point to the BFIM of the database. During the
transaction, all the temporary changes are recorded into a Shadow Copy, which is an exact copy of
the original database plus the changes made by the transaction, which is the AFIM. Now, if the
transaction is required to COMMIT, then the database pointer is updated to point to the AFIM copy,
and the BFIM copy is discarded. On the other hand, if the transaction is not committed, then the
database pointer is not updated. It keeps pointing to the BFIM, and the AFIM is discarded. This is a
simple scheme, but takes a lot of memory space and time to implement.
If you study carefully, you can understand that Atomicity and Durability is essentially the same
thing, just as Consistency and Isolation is essentially the same thing.
Partially Committed: At any given point of time if the transaction is executing properly,
then it is going towards it COMMIT POINT. The values generated during the execution are
all stored in volatile storage.
Failed: If the transaction fails for some reason. The temporary values are no longer required,
and the transaction is set to ROLLBACK. It means that any change made to the database by
this transaction up to the point of the failure must be undone. If the failed transaction has
withdrawn Rs. 100/- from account A, then the ROLLBACK operation should add Rs 100/- to
account A.
Aborted: When the ROLLBACK operation is over, the database reaches the BFIM. The
transaction is now said to have been aborted.
Committed: If no failure occurs then the transaction reaches the COMMIT POINT. All the
temporary values are written to the stable storage and the transaction is said to have been
committed.
Terminated: Either committed or aborted, the transaction finally reaches this state.
PARTIALLY COMMITTED
COMMITTED
Entry Point
ACTIVE
TERMINATE
D
FAILED ABORTED
In Serial schedule, there is no question of sharing a single data item among many transactions,
because not more than a single transaction is executing at any point of time. However, a serial
schedule is inefficient in the sense that the transactions suffer for having a longer waiting time and
response time, as well as low amount of resource utilization.
In concurrent schedule, CPU time is shared among two or more transactions in order to run them
concurrently. However, this creates the possibility that more than one transaction may need to access
a single data item for read/write purpose and the database could contain inconsistent value if such
accesses are not handled properly. Let us explain with the help of an example.
Let us consider there are two transactions T1 and T2, whose instruction sets are given as following.
T1 is the same as we have seen earlier, while T2 is a new transaction.
T1
Read A;
A = A – 100;
Write A;
Read B;
B = B + 100;
Write B;
T2
Read A;
Temp = A * 0.1;
Read C;
C = C + Temp;
Write C;
If we prepare a serial schedule, then either T1 will completely finish before T2 can begin, or T2 will
completely finish before T1 can begin. However, if we want to create a concurrent schedule, then
some Context Switching need to be made, so that some portion of T1 will be executed, then some
portion of T2 will be executed and so on. For example say we have prepared the following
concurrent schedule.
T1 T2
Read A;
A = A – 100;
Write A; Read A;
Temp = A * 0.1;
Read C;
C = C + Temp;
Write C;
Read B;
B = B + 100;
Write B;
No problem here. We have made some Context Switching in this Schedule, the first one after
executing the third instruction of T1, and after executing the last statement of T2. T1 first deducts Rs
100/- from A and writes the new value of Rs 900/- into A. T2 reads the value of A, calculates the
value of Temp to be Rs 90/- and adds the value to C. The remaining part of T1 is executed and Rs
100/- is added to B.
It is clear that a proper Context Switching is very important in order to maintain the Consistency and
Isolation properties of the transactions. But let us take another example where a wrong Context
Switching can bring about disaster. Consider the following example involving the same T1 and T2
Read A;
A = A – 100; Read A;
Temp = A * 0.1;
Read C;
C = C + Temp;
Write C;
Write A;
Read B;
B = B + 100;
Write B;
This schedule is wrong, because we have made the switching at the second instruction of T1. The
result is very confusing. If we consider accounts A and B both containing Rs 1000/- each, then
the result of this schedule should have left Rs 900/- in A, Rs 1100/- in B and add Rs 90 in C (as
C should be increased by 10% of the amount in A). But in this wrong schedule, the Context
Switching is being performed before the new value of Rs 900/- has been updated in A. T2 reads
the old value of A, which is still Rs 1000/-, and deposits Rs 100/- in C. C makes an unjust gain of
Rs 10/- out of nowhere.
Serializability
When several concurrent transactions are trying to access the same data item, the instructions
within these concurrent transactions must be ordered in some way so as there are no problem in
accessing and releasing the shared data item. There are two aspects of serializability which are
described here:
View Serializability:
This is another type of serializability that can be derived by creating another schedule out of an
existing schedule, involving the same set of transactions. These two schedules would be called
View Serializable if the following rules are followed while creating the second schedule out of
the first. Let us consider that the transactions T1 and T2 are being serialized to create two
different schedules
S1 and S2 which we want to be View Equivalent and both T1 and T2 wants to access the same
data item.
1. If in S1, T1 reads the initial value of the data item, then in S2 also, T1 should read the
initial value of that same data item.
2. If in S1, T1 writes a value in the data item which is read by T2, then in S2 also, T1 should
write the value in the data item before T2 reads it.
Let us consider a schedule S in which there are two consecutive instructions, I and J , of
transactions Ti and Tj , respectively (i _= j). If I and J refer to different data
items, then we can swap I and J without affecting the results of any instruction
in the schedule. However, if I and J refer to the same data item Q, then the order of the two steps
may matter. Since we are dealing with only read and write instructions, there are four cases that
we need to consider:
I = read(Q), J = read(Q). The order of I and J does not matter, since the same
value of Q is read by Ti and Tj , regardless of the order.
I = read(Q), J = write(Q). If I comes before J , then Ti does not read the value of Q that is
written by Tj in instruction J . If J comes before I, then Ti reads
the value of Q that is written by Tj. Thus, the order of I and J matters.
I = write(Q), J = read(Q). The order of I and J matters for reasons similar to those of the
previous case.
4.I = write(Q), J = write(Q). Since both instructions are write operations, the order of these
instructions does not affect either Ti or Tj . However, the value obtained by the next read(Q)
instruction of S is affected, since the result of only the latter of the two write instructions is
preserved in the database. If there is no other write(Q) instruction after I and J in S, then the
order of I and J directly affects the final value of Q in the database state that results from
schedule S.
Transaction Characteristics
Every transaction has three characteristics: access mode, diagnostics size, and isolation level.
The diagnostics size determines the number of error conditions that can be recorded.
If the access mode is READ ONLY, the transaction is not allowed to modify the database.
Thus, INSERT, DELETE, UPDATE, and CREATE commands cannot be executed. If we have
to execute one of these commands, the access mode should be set to READ WRITE. For
transactions with READ ONLY access mode, only shared locks need to be obtained, thereby
increasing concurrency.
The isolation level controls the extent to which a given transaction is exposed to the actions of
other transactions executing concurrently. By choosing one of four possible isolation level
settings, a user can obtain greater concurrency at the cost of increasing the transaction's
exposure to other transactions' uncommitted changes.
REPEATABLE READ ensures that T reads only the changes made by committed transactions,
and that no value read or written by T is changed by any other transaction until T is complete.
However, T could experience the phantom phenomenon; for example, while T examines all
READ COMMITTED ensures that T reads only the changes made by committed transactions,
and that no value written by T is changed by any other transaction until T is complete. However,
a value read by T may well be modified by another transaction while T is still in progress, and T
is, of course, exposed to the phantom problem.
A READ COMMITTED transaction obtains exclusive locks before writing objects and holds
these locks until the end. It also obtains shared locks before reading objects, but these locks are
released immediately; their only effect is to guarantee that the transaction that last modified the
object is complete. (This guarantee relies on the fact that every SQL transaction obtains
exclusive locks before writing objects and holds exclusive locks until the end.)
A READ UNCOMMITTED transaction does not obtain shared locks before reading objects.
This mode represents the greatest exposure to uncommitted changes of other transactions; so
much so that SQL prohibits such a transaction from making any changes itself - a READ
UNCOMMITTED transaction is required to have an access mode of READ ONLY. Since such a
transaction obtains no locks for reading objects, and it is not allowed to write objects (and
therefore never requests exclusive locks), it never makes any lock requests.
The SERIALIZABLE isolation level is generally the safest and is recommended for most
transactions. Some transactions, however, can run with a lower isolation level, and the smaller
number of locks requested can contribute to improved system performance.
For example, a statistical query that finds the average sailor age can be run at the READ
COMMITTED level, or even the READ UNCOMMITTED level, because a few incorrect or
missing values will not significantly affect the result if the number of sailors is large. The
isolation level and access mode can be set using the SET TRANSACTION command. For
example, the following command declares the current transaction to be SERIALIZABLE and
READ ONLY:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ ONLY
When a transaction is started, the default is SERIALIZABLE and READ WRITE.
PRECEDENCE GRAPH
A precedence graph of the schedule D, with 3 transactions. As there is a cycle (of length 2; with
two edges) through the committed transactions T1 and T2, this schedule (history) is not Conflict
serializable.
The drawing sequence for the precedence graph:-
2 To test for conflict serializability, we need to construct the precedence graph and to
invoke a cycle-detection algorithm.Cycle-detection algorithms exist which take order
n2 time, where n is the number of vertices in the graph.
For example, a serializability order for the schedule (a) would be one of either (b) or
(c)
RECOVERABLE SCHEDULES
Recoverable schedule — if a transaction Tj reads a data item previously written by a
transaction Ti , then the commit operation of Ti must appear before the commit operation of Tj.
The following schedule is not recoverable if T9 commits immediately after the read(A)
operation.
If T8 should abort, T9 would have read (and possibly shown to the user) an inconsistent
database state. Hence, database must ensure that schedules are recoverable.
CASCADING ROLLBACKS
CASCADELESS SCHEDULES
Cascadeless schedules — for each pair of transactions Ti and Tj such that Tj reads
a data item previously written by Ti, the commit operation of Ti appears before the
read operation of Tj.
CONCURRENCY SCHEDULE
A database must provide a mechanism that will ensure that all possible schedules are
both:
Conflict serializable.
A policy in which only one transaction can execute at a time generates serial
schedules, but provides a poor degree of concurrency
Testing a schedule for serializability after it has executed is a little too late!
E.g., a read-only transaction that wants to get an approximate total balance of all
accounts
E.g., database statistics computed for query optimization can be approximate (why?)
Repeatable read — only committed records to be read, repeated reads of same record
must return same value. However, a transaction may not be serializable – it may find some
records inserted by a transaction but not find others.
Read committed — only committed records can be read, but successive reads of
record may return different (but committed) values.
Lower degrees of consistency useful for gathering approximate information about the
database
E.g., Oracle and PostgreSQL by default support a level of consistency called snapshot
isolation (not part of the SQL standard)
TRANSACTION DEFINITION IN SQL
Data manipulation language must include a construct for specifying the set of
actions that comprise a transaction.
In almost all database systems, by default, every SQL statement also commits
implicitly if it executes successfully
RECOVERY SYSTEM
Failure Classification:
Transaction failure :
Logical errors: transaction cannot complete due to some internal error condition
System errors: the database system must terminate an active transaction due to an error
condition (e.g., deadlock)
System crash: a power failure or other hardware or software failure causes the system
to crash.
Database systems have numerous integrity checks to prevent corruption of disk data
Disk failure: a head crash or similar disk failure destroys all or part of disk storage
RECOVERY ALGORITHMS
A failure may occur after one of these modifications have been made but before both of
them are made.
Modifying the database without ensuring that the transaction will commit may
leave the database in an inconsistent state
Not modifying the database may result in lost updates if failure occurs just after
transaction commits
2. Actions taken after a failure to recover the database contents to a state that ensures
atomicity, consistency and durability
STORAGE STRUCTURE
Volatile storage:
Nonvolatile storage:
Stable storage:
copies can be at remote sites to protect against disasters such as fire or flooding.
Successful completion
Protecting storage media from failure during data transfer (one solution):
2. When the first write successfully completes, write the same information onto the
second physical block.
3. The output is completed only after the second write successfully completes.
Copies of a block may differ due to failure during output operation. To recover
from failure:
2. Better solution:
Use this information during recovery to find blocks that may be inconsistent, and only
compare copies of these.
DATA ACCESS
System buffer blocks are the blocks residing temporarily in main memory.
Block movements between disk and main memory are initiated through the following
two operations:
output(B) transfers the buffer block B to the disk, and replaces the appropriate physical
block there.
We assume, for simplicity, that each data item fits in, and is stored inside, a single
block.
Each transaction Ti has its private work-area in which local copies of all data
items accessed and updated by it are kept.
Transferring data items between system buffer blocks and its private work-area done
by:
read(X) assigns the value of data item X to the local variable xi.
write(X) assigns the value of local variable xi to data item {X} in the buffer block.
Transactions
Must perform read(X) before accessing X for the first time (subsequent reads can
be from local copy)
The write(X) can be executed at any time before the transaction commits
Note that output(BX) need not immediately follow write(X). System can perform the
output operation when it seems fit.
1) A transaction may be granted a lock on an item if the requested lock is compatible with locks
already held on the item by other transactions
2) Any number of transactions can hold shared locks on an item,
but if any transaction holds an exclusive on the item no other transaction may hold any
lock on the item.
3) If a lock cannot be granted, the requesting transaction is made to wait till all incompatible
locks held by other transactions have been released. The lock is then granted.
TIMESTAMP-BASED PROTOCOLS
1. Each transaction is issued a timestamp when it enters the system. If an old transaction
Ti has time-stamp TS(Ti), a new transaction Tj is assigned time-stamp TS(Tj) such that
TS(Ti) <TS(Tj).
2. The protocol manages concurrent execution such that the time-stamps determine the
serializability order.
3. In order to assure such behavior, the protocol maintains for each data Q two timestamp
values:
a.W-timestamp(Q) is the largest time-stamp of any transaction that executed
write(Q) successfully.
b.R-timestamp(Q) is the largest time-stamp of any transaction that executed
read(Q) successfully.
4. The timestamp ordering protocol ensures that any conflicting read and write
operations are executed in timestamp order.
5.Suppose a transaction Ti issues a read(Q)
1. If TS(Ti) W-timestamp(Q), then Ti needs to read a value of Q that was
already overwritten.
n Hence, the read operation is rejected, and Ti is rolled back.
2. If TS(Ti) W-timestamp(Q), then the read operation is executed, and R-
timestamp(Q) is set to max(R-timestamp(Q), TS(Ti)).
3.Any transaction Tj with TS(Tj ) > TS(T28) must read the value of Q written by T28,
rather than the value that T27 is attempting to write. This observation leads to a
modified version of the timestamp-ordering protocol in which obsolete write operations
can be ignored under certain circumstances. The protocol rules for read operations
remain unchanged. The protocol rules for write operations, however, are slightly
different from the timestamp- ordering protocol.
The modification to the timestamp-ordering protocol, called Thomas’ write rule, is this:
Suppose that transaction Ti issues write(Q).
1. If TS(Ti ) < R-timestamp(Q), then the value of Q that Ti is producing was previously
needed, and it had been assumed that the value would never be produced. Hence, the
system rejects the write operation and rolls Ti back.
2. If TS(Ti ) < W-timestamp(Q), then Ti is attempting to write an obsolete value of Q. Hence,
this write operation can be ignored.
3. Otherwise, the system executes the write operation and setsW-timestamp(Q) to
TS(Ti ).
VALIDATION-BASED PROTOCOLS
Phases in Validation-Based Protocols
In MGL, locks are set on objects that contain other objects. MGL exploits the hierarchical
nature of the contains relationship. For example, a database may have files, which contain pages,
which further contain records. This can be thought of as a tree of objects, where each node
contains its children. A lock on such as a shared or exclusive lock locks the targeted node as well
as all of its descendants.
RECOVERY:
When the system recovers from failure, it can restore the latest dump.
It can maintain redo-list and undo-list as in checkpoints.
It can recover the system by consulting undo-redo lists to restore the state of all transaction up
to last checkpoint.
REMOTE BACKUP
Remote backup provides a sense of security and safety in case the primary location where the
database is located gets destroyed. Remote backup can be offline or real-time and online. In case
it is offline it is maintained manually.
BUFFER MANAGEMENT
1. Database buffers are generally implemented in virtual memory in spite of some
drawbacks:
a.When operating system needs to evict a page that has been modified, the page is
written to swap space on disk.
FUZZY CHECKPOINTING
a.To avoid long interruption of normal processing during checkpointing, allow updates to
happen during checkpointing
uzzy checkpointing is done as follows:
1. Temporarily stop all updates by transactions
2. Write a <checkpoint L> log record and force log to stable storage
3. Note list M of modified buffer blocks
4. Now permit transactions to proceed with their actions
5. Output to disk all modified buffer blocks in list M
H blocks should not be updated while being output
H Follow WAL: all log records pertaining to a block must be output before
the block is output
6. Store a pointer to the checkpoint record in a fixed position last_checkpoint on
disk
7.When recovering using a fuzzy checkpoint, start scan from the checkpoint record
pointed to by last_checkpoint
a. Log records before last_checkpoint have their updates reflected in database
on disk, and need not be redone.
Heap File Organization: When a file is created using Heap File Organization mechanism, the
Operating Systems allocates memory area to that file without any further accounting details. File
records can be placed anywhere in that memory area.
Sequential File Organization: Every file record contains a data field (attribute) to uniquely
identify that record. In sequential file organization mechanism, records are placed in the file in
the some sequential order based on the unique key field or search key. Practically, it is not
possible to store all the records sequentially in physical form.
Hash File Organization: This mechanism uses a Hash function computation on some field of
the records. As we know, that file is a collection of records, which has to be mapped on some
block of the disk space allocated to it.
Clustered File Organization: Clustered file organization is not considered good for
large databases. In this mechanism, related records from one or more relations are kept in a same
disk block, that is, the ordering of records is not based on primary key or search key.
FILE OPERATIONS
Operations on database files can be classified into two categories broadly.
Update Operations
Retrieval Operations
Update operations change the data values by insertion, deletion or update. Retrieval operations
on the other hand do not alter the data but retrieve them after optional conditional filtering. In
both types of operations, selection plays significant role. Other than creation and deletion of a
file, there could be several operations, which can be done on files
. Open: A file can be opened in one of two modes, read mode or write mode. In read mode,
operating system does not allow anyone to alter data it is solely for reading purpose. Files
opened in read mode can be shared among several entities. The other mode is write mode, in
which, data modification is allowed. Files opened in write mode can be read also but cannot be
shared.
DBMS INDEXING
We know that information in the DBMS files is stored in form of records. Every record is
equipped with some key field, which helps it to be recognized uniquely.
Indexing is defined based on its indexing attributes. Indexing can be one of the following types:
Primary Index: If index is built on ordering 'key-field' of file it is called Primary Index.
Generally it is the primary key of the relation.
Secondary Index: If index is built on non-ordering field of file it is called Secondary Index.
Clustering Index: If index is built on ordering non-key field of file it is called Clustering
Index.
Ordering field is the field on which the records of file are ordered. It can be different from
primary or candidate key of a file.
Ordered Indexing is of two types:
Dense Index
Sparse Index
Dense Index
In dense index, there is an index record for every search key value in the database. This makes
searching faster but requires more space to store index records itself. Index record contains
search key value and a pointer to the actual record on the disk.
Sparse Index
In sparse index, index records are not created for every search key. An index record here
contains search key and actual pointer to the data on the disk. To search a record we first proceed
by index record and reach at the actual location of the data.
Multi-level Index helps breaking down the index into several smaller indices in order to make
the outer most level so small that it can be saved in single disk block which can easily be
accommodated anywhere in the main memory.
B+ TREE
B<sup+< sup=""> tree is multi-level index format, which is balanced binary search trees. As
mentioned earlier single level index records becomes large as the database size grows, which
also degrades performance.</sup+<> All leaf nodes of B+ tree denote actual data pointers. B+
tree ensures that all leaf nodes remain at the same height, thus balanced. Additionally, all leaf
nodes are linked using link list, which makes B+ tree to support random access as well as
sequential access.
STRUCTURE OF B+ TREE
Every leaf node is at equal distance from the root node. A B+ tree is of order n where n is fixed
for every B+ tree.
B+ tree deletion
B+ tree entries are deleted leaf nodes.
The target entry is searched and deleted.
o If it is in internal node, delete and replace with the entry from the left position.
After deletion underflow is tested
o If underflow occurs
Distribute entries from nodes left to it.
o If distribution from left is not possible
Distribute from nodes right to it
o If distribution from left and right is not possible
Merge the node with left and right to it.
DBMS HASHING
For a huge database structure it is not sometime feasible to search index through all its level and
then reach the destination data block to retrieve the desired data. Hashing is an effective
technique to calculate direct location of data record on the disk without using index structure.
Hash Organization
Bucket: Hash file stores data in bucket format. Bucket is considered a unit of storage. Bucket
typically stores one complete disk block, which in turn can store one or more records.
Operation:
Insertion: When a record is required to be entered using static hash, the hash function h,
computes the bucket address for search key K, where the record will be stored.
Bucket Overflow:
The condition of bucket-overflow is known as collision. This is a fatal state for any static hash
function. In this case overflow chaining can be used.
Overflow Chaining: When buckets are full, a new bucket is allocated for the same hash result
and is linked after the previous one. This mechanism is called Closed Hashing.
Linear Probing: When hash function generates an address at which data is already stored, the
next free bucket is allocated to it. This mechanism is called Open Hashing.
Dynamic Hashing
Problem with static hashing is that it does not expand or shrink dynamically as the size of
database grows or shrinks. Dynamic hashing provides a mechanism in which data buckets are
added and removed dynamically and on-demand. Dynamic hashing is also known as extended
hashing.
Operation
Querying: Look at the depth value of hash index and use those bits to compute the bucket
address.
Update: Perform a query as above and update data.
Deletion: Perform a query to locate desired data and delete data.
Insertion: compute the address of bucket o If the bucket is already full
Add more buckets
Add additional bit to hash value
Re-compute the hash function o Else
Add data to the bucket o If all buckets are full, perform the remedies of static hashing.
Hashing is not favorable when the data is organized in some ordering and queries require range
of data. When data is discrete and random, hash performs the best. Hashing algorithm and
implementation have high complexity than indexing. All hash operations are done in constant
time.
QUERY OPTIMIZATION
Query Optimization works in a similar way:
QUERY FLOW
Query Parser – Verify validity of the SQL statement. Translate query into an internal
structure using relational calculus.
Query Optimizer – Find the best expression from various different algebraic expressions.
Criteria used is ‘Cheapness’
Code Generator/Interpreter – Make calls for the Query processor as a result of the work
done by the optimizer.
Query Processor – Execute the calls obtained from the code generator.
We use the number of block transfers from disk and the number of disk seeks to estimate the cost
of a query-evaluation plan. If the disk subsystem takes an average of tT seconds to transfer a
block of data, and has an average block-access time (disk seek time plus rotational latency)
of tSseconds, then an operation that transfers b blocks and performs S seeks would
take b ∗ tT + S ∗ tSseconds. The values of tT and tS must be calibrated for the disk system used,
but typical values for high-end disks today would be tS = 4 milliseconds and tT = 0.1 milliseconds,
assuming a 4-kilobyte block size and a transfer rate of 40 megabytes per second.