0% found this document useful (0 votes)
3 views

DBMS

The document provides an introduction to Database Management Systems (DBMS), outlining key concepts such as the definition of a database, its architecture, and the differences between DBMS and file systems. It discusses the applications of DBMS in various sectors, the historical evolution of database systems, and the advantages of using DBMS over traditional file systems. Additionally, it explains data abstraction levels and the importance of schemas in managing database instances.

Uploaded by

sanusankalp2003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

DBMS

The document provides an introduction to Database Management Systems (DBMS), outlining key concepts such as the definition of a database, its architecture, and the differences between DBMS and file systems. It discusses the applications of DBMS in various sectors, the historical evolution of database systems, and the advantages of using DBMS over traditional file systems. Additionally, it explains data abstraction levels and the importance of schemas in managing database instances.

Uploaded by

sanusankalp2003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 615

Introduction to

Database Management
System

Kathmandu University
BE Computer Engineering
Year II/II
Overview
● Database Systems Applications
● Database System versus File systems
● View of Data
● Database Languages
● Database Users and Administrators
● Transaction Management
● Database Architecture

2
Introduction: Database management System
As the name suggests, the Database Management System consists
of two parts. They are:
1. Database and
2. Management System

3
Database
● What is a Database?
● To find out what database is, we have to start from data, which is
the basic building block of any DBMS.
● Data: Facts, figures, statistics etc. having no particular meaning
(e.g. 1, ABC, 19 etc).
● Record: Collection of related data items, e.g. in the above example
the three data items had no meaning. But if we organize them in
the following way, then they collectively represent meaningful
information.

4
Database
● Table or Relation: Collection of related records

● The columns of this relation are called Fields, Attributes or


Domains. The rows are called Tuples or Records.
5
Database
● Database: Collection of related relations. Consider the following
collection of tables:

6
Database
● We now have a collection of 4 tables. They can be called a “related
collection” because we can clearly find out that there are some
common attributes existing in a selected pair of tables.
● Because of these common attributes we may combine the data of
two or more tables together to find out the complete details of a
student.
● Questions like “Which hostel does the youngest student live in?”
can be answered now, although Age and Hostel attributes are in
different tables.

7
Database Management System
● A database-management system (DBMS) is a collection of
interrelated data and a set of programs to access those data. This
is a collection of related data with an implicit meaning and hence
is a database.
● The collection of data, usually referred to as the database,
contains information relevant to an enterprise. The primary goal
of a DBMS is to provide a way to store and retrieve database
information that is both convenient and efficient.
● By data, we mean known facts that can be recorded and that have
implicit meaning.
8
Database Management System
● Database systems are designed to manage large bodies of information.
Management of data involves both defining structures for storage of
information and providing mechanisms for the manipulation of
information.
● In addition, the database system must ensure the safety of the
information stored, despite system crashes or attempts at unauthorized
access.
● If data are to be shared among several users, the system must avoid
possible anomalous results.

9
Database Management Systems
● DBMS contains information about a particular enterprise
○ Collection of interrelated data
○ Set of programs to access the data
○ An environment that is both convenient and efficient to use
● Database systems are used to manage collections of data that are:
○ Highly valuable
○ Relatively large
○ Accessed by multiple users and applications, often at the same time.
● A modern database system is a complex software system whose task is to
manage a large, complex collection of data.
● Databases touch all aspects of our lives

10
Views of DBMS
● A database in a DBMS could be viewed by lots of different people with
different responsibilities.

11
Applications of DBMS
● Enterprise Information
○ Sales: customers, products, purchases
○ Accounting: payments, receipts, assets
○ Human Resources: Information about employees, salaries, payroll taxes.
● Manufacturing: management of production, inventory, orders,
supply chain.
● Banking and finance
○ customer information, accounts, loans, and banking transactions.
○ Credit card transactions
○ Finance: sales and purchases of financial instruments (e.g., stocks and bonds;
storing real-time market data
● Universities: registration, grades 12
Applications of DBMS …
● Airlines: reservations, schedules
● Telecommunication: records of calls, texts, and data usage, generating
monthly bills, maintaining balances on prepaid calling cards
● Web-based services
○ Online retailers: order tracking, customized recommendations
○ Online advertisements
● Document databases
● Navigation systems: For maintaining the locations of various places of
interest along with the exact routes of roads, train systems, buses, etc.

13
University Database Example
● In this course we will be using a university database to illustrate most of
the concepts
● Data consists of information about:
○ Students
○ Instructors
○ Classes
● Application program examples:
○ Add new students, instructors, and courses
○ Register students for courses, and generate class rosters
○ Assign grades to students, compute grade point averages (GPA) and generate transcripts

14
History of Database Systems
1950s and early 1960s:
● Data processing using magnetic tapes for storage
○ Tapes provided only sequential access
● Punched cards for input
Late 1960s and 1970s:
● Hard disks allowed direct access to data
● Network and hierarchical data models in widespread use
● Ted Codd defines the relational data model
● Would win the ACM Turing Award for this work
○ IBM Research begins System R prototype
○ UC Berkeley (Michael Stonebraker) begins Ingres prototype
○ Oracle releases first commercial relational database
● High-performance (for the era) transaction processing
15
History of Database Systems …
1980s:
● Research relational prototypes evolve into commercial systems
○ SQL becomes industrial standard
● Parallel and distributed database systems
○ Wisconsin, IBM, Teradata
● Object-oriented database systems
1990s:
● Large decision support and data-mining applications
● Large multi-terabyte data warehouses
● Emergence of Web commerce

16
History of Database Systems …
2000s
● Big data storage systems
○ Google BigTable, Yahoo PNuts, Amazon,
○ “NoSQL” systems.
● Big data analysis: beyond SQL
○ Map reduce

2010s
● SQL reloaded
○ SQL front end to Map Reduce systems
○ Massively parallel database systems
○ Multi-core main-memory databases
17
File System ?

● What is a file system?


● How is it used to store data/information?
● What are the advantages and disadvantages of file systems ?
● Compare File System with DBMS ?

18
File System
In computing, File System or filesystem (often abbreviated to fs) is a method
and data structure that the operating system uses to control how data is
stored and retrieved.

File system organizes the files and helps in retrieval of files when they are
required. File systems consists of different files which are grouped into
directories. The directories further contain other folders and files. File
system performs basic operations like management, file naming, giving
access rules etc.

19
File System
The FAT (short for File Allocation Table) file system is a general purpose file
system that is compatible with all major operating systems (Windows, Mac OS
X, and Linux/Unix).

Examples of file systems:

Windows: NTFS

Mac Os: APFS

Linux/Unix: Ext4

20
Purpose of Database Systems
In the early days, database applications were built directly on top of file
systems, which leads to:
● Data redundancy and inconsistency: data is stored in multiple file
formats resulting in duplication of information in different files
● Difficulty in accessing data
○ Need to write a new program to carry out each new task
● Data isolation
○ Multiple files and formats
● Integrity problems
○ Integrity constraints (e.g., account balance > 0) become “buried” in program code rather
than being stated explicitly
○ Hard to add new constraints or change existing ones

21
Purpose of Database Systems
● Atomicity of updates
○ Failures may leave database in an inconsistent state with partial updates carried
out
○ Example: Transfer of funds from one account to another should either be complete
or not happen at all
● Concurrent access by multiple users
○ Concurrent access needed for performance
○ Uncontrolled concurrent accesses can lead to inconsistencies
○ Ex: Two people reading a balance (say 100) and updating it by withdrawing money
(say 50 each) at the same time
● Security problems
○ Hard to provide user access to some, but not all, data

Database systems offer solutions to all the above problems


22
Advantages of DBMS
● Reducing Data Redundancy
● Sharing of Data
● Data Integrity
● Data Security
● Privacy
● Backup and Recovery
● Data Consistency

23
File System Vs DBMS
Basis File System DBMS

Structure File system is a software that manages DBMS is a software for managing the
and organizes the files in a storage database.
medium within a computer

Data Redundancy Redundant data can be present in a file In DBMS there is no redundant data.
system.

Backup and It doesn’t provide backup and recovery It provides backup and recovery of data
Recovery of data if it is lost. even if it is lost.

. Query There is no efficient query processing in Efficient query processing is there in DBMS.
processing file system.

24
File system Vs DBMS ...
Basis File System DBMS

Consistency There is less data consistency in file There is more data consistency because of
system. the process of normalization.

Complexity It is less complex as compared to It is more complex as compared to File


DBMS System

Security File systems provide less security in DBMS has more security mechanisms as
Constraints comparison to DBMS. compared to file system

Cost Relatively less expensive Relatively more expensive than File system.

25
View of Data
● A database system is a collection of interrelated data and a set of
programs that allow users to access and modify these data.
● A major purpose of a database system is to provide users with an abstract
view of the data. That is, the system hides certain details of how the data
are stored and maintained.
○ Data models
■ A collection of conceptual tools for describing data, data relationships,
data semantics, and consistency constraints.
○ Data abstraction
■ Hide the complexity of data structures to represent data in the database
from users through several levels of data abstraction.

26
Data Abstraction
● For the system to be usable, it must retrieve data efficiently. The
need for efficiency has led designers to use complex data
structures to represent data in the database.
● Since many database-system users are not computer trained,
developers hide the complexity from users through several levels
of abstraction, to simplify users’ interactions with the system:

27
Data Abstraction/View of Data

28
Physical Level (Internal View/Schema)
● Physical level (or Internal View / Schema): The lowest level of abstraction
describes how the data are actually stored.
● The physical level describes complex low-level data structures in detail.

29
Logical Level (Conceptual View/Schema)
● Logical level (or Conceptual View / Schema): The next-higher level of
abstraction describes what data are stored in the database, and what
relationships exist among those data. The logical level thus describes the entire
database in terms of a small number of relatively simple structures.
● Although implementation of the simple structures at the logical level may
involve complex physical-level structures, the user of the logical level does not
need to be aware of this complexity. This is referred to as physical data
independence.
● Database administrators, who must decide what information to keep in the
database, use the logical level of abstraction.

30
Physical Data independence
● Physical Data Independence – the ability to modify the physical schema
without changing the logical schema
○ Applications depend on the logical schema
○ In general, the interfaces between the various levels and components
should be well defined so that changes in some parts do not seriously
influence others.

31
View level (or External View / Schema):
● View level (or External View / Schema): The highest level of abstraction
describes only part of the entire database.
● Even though the logical level uses simpler structures, complexity remains
because of the variety of information stored in a large database. Many
users of the database system do not need all this information; instead,
they need to access only a part of the database.
● The view level of abstraction exists to simplify their interaction with the
system. The system may provide many views for the same database.

32
Data Types and Level of Abstraction
Programming Vs DBMS: an Analogy
● Many high-level programming languages support the notion of a structure
type. For example, we may describe a record as follows:
type instructor = record
ID : char (5);
name : char (20);
dept name : char (20);
salary : numeric (8,2);
end;
● This code defines a new record type called instructor with four fields. Each
field has a name and a type associated with it. A university organization
may have several other such record types.

33
Physical level of abstraction: Programming vs DBMS
● At the physical level, an instructor (department, or student)
record can be described as a block of consecutive storage
locations. The compiler hides this level of detail from
programmers.
● Similarly, the database system hides many of the lowest-level
storage details from database programmers. Database
administrators, on the other hand, may be aware of certain details
of the physical organization of the data.

34
Logical Level of Abstraction: Programming Vs DBMS
● At the logical level, each such record is described by a type definition, as
in the previous code segment, and the interrelationship of these record
types is defined as well.
● Programmers using a programming language work at this level of
abstraction. Similarly, database administrators usually work at this level of
abstraction.

35
View Level of Abstraction: Programming vs DBMS
● At the view level, computer users see a set of application programs that
hide details of the data types.

● At the view level, several views of the database are defined, and a
database user sees some or all of these views. In addition to hiding details
of the logical level of the database, the views also provide a security
mechanism to prevent users from accessing certain parts of the database.

● For example, clerks in the university registrar office can see only that part
of the database that has information about students; they cannot access
information about salaries of instructors.

36
Instance and Schemas
● Databases change over time as information is inserted and deleted. The
collection of information stored in the database at a particular moment is
called an instance of the database. The overall design of the database is
called the database schema. Schemas are changed infrequently, if at all.
● Schema – the logical structure of the database
○ Physical schema: database design at the physical level
○ Logical schema: database design at the logical level
● Instance – the actual content of the database at a particular point in time
● Physical Data Independence – the ability to modify the physical schema
without changing the logical schema
● Applications depend on the logical schema
● In general, the interfaces between the various levels and components
should be well defined so that changes in some parts do not seriously
influence others.
37
Instances and Schemas
● Similar to types and variables in programming languages
● Logical Schema – the overall logical structure of the database
Example: The database consists of information about a set of customers
and accounts in a bank and the relationship between them
Analogous to type information of a variable in a program
● Physical schema– the overall physical structure of the database
● Instance – the actual content of the database at a particular point in time
Analogous to the value of a variable

38
Data Model
● Underlying structure of a database is the data model. It is the a collection of
conceptual tools for describing
○ Data
○ data relationships
○ data semantics
○ consistency constraints.
● A data model provides a way to describe the design of a database at the
physical, logical, and view levels.
● The data models can be classified into different categories:
○ Relational model
○ Entity Relationship Model
○ Object Based Model
○ Semi Structured Data Model
Other older models: Network model, Hierarchical model, etc.

39
Relational Model
● The relational model uses a collection of tables to represent both data
and the relationships among those data. Each table has multiple columns,
and each column has a unique name. Tables are also known as relations.
The relational model is an example of a record-based model.
● Record-based models are so named because the database is structured in
fixed-format records of several types. Each table contains records of a
particular types. Each record type defines a fixed number of fields, or
attributes. The columns of the table correspond to the attributes of the
record type.
● The relational data model is the most widely used data model, and a vast
majority of current database systems are based on the relational model.

40
A Sample Relational Database

41
42
Database languages
● A database system provides a data-definition language to specify the
database schema and a data-manipulation language to express database
queries and updates.
● In practice, the data definition language and data-manipulation
languages are not two separate languages; instead they simply form
parts of a single database language,
● such as the widely used SQL language.

43
Data Definition Language (DDL)
● Specification notation for defining the database schema
Example: create table instructor (
ID char(5),
name varchar(20),
dept_name varchar(20),
salary numeric(8,2))
● DDL compiler generates a set of table templates stored in a data dictionary
● Data dictionary contains metadata (i.e., data about data)
Database schema
Integrity constraints
○ Primary key (ID uniquely identifies instructors)
Authorization
○ Who can access what
44
Data Manipulation Language (DML)
● Language for accessing and manipulating the data organized by the appropriate data model
○ DML is also known as query language
● Two classes of languages
Pure – used for proving properties about computational power and for optimization
○ Relational Algebra
○ Tuple relational calculus
○ Domain relational calculus
Commercial – used in commercial systems
○ SQL is the most widely used commercial language
● Data Manipulation Language enables users to access or manipulate data as organized by the
appropriate data model.
● The types of access are:
○ Retrieval of information stored in the database
○ Insertion of new information into the database
○ Deletion of information from the database
○ Modification of information stored in the database

45
Data Manipulation Language …
● There are basically two types of data-manipulation language
Procedural DML -- require a user to specify what data are needed and
how to get those data.
Declarative DML -- require a user to specify what data are needed
without specifying how to get those data.
● Declarative DMLs are usually easier to learn and use than are procedural
DMLs.
● Declarative DMLs are also referred to as non-procedural DMLs
● The portion of a DML that involves information retrieval is called a query
language.

46
SQL Query Language
● The most widely used commercial language
Example to find all instructors in Comp. Sci. dept
○ select name
○ from instructor
○ where dept_name = 'Comp. Sci.'
● SQL is NOT a Turing machine equivalent language
● To be able to compute complex functions SQL is usually embedded in
some higher-level language
● Application programs generally access databases through one of
○ Language extensions to allow embedded SQL
○ Application program interface (e.g., ODBC/JDBC) which allow SQL queries to be sent to a
database
47
Database Access from Application Program
● Non-procedural query languages such as SQL are not as powerful as a
universal Turing machine.
● SQL does not support actions such as input from users, output to
displays, or communication over the network.
● Such computations and actions must be written in a host language, such
as C/C++, Java or Python, with embedded SQL queries that access the
data in the database.
● Application programs -- are programs that are used to interact with the
database in this fashion.

48
Data Dictionary
● We can define a data dictionary as a DBMS component that stores the
definition of data characteristics and relationships. You may recall that such
“data about data” were labeled metadata.
● The DBMS data dictionary provides the DBMS with its self describing
characteristic. In effect, the data dictionary resembles and X-ray of the
company’s entire data set, and is a crucial element in the data administration
function.
● The two main types of data dictionary exist, integrated and stand alone. An
integrated data dictionary is included with the DBMS. For example, all relational
DBMSs include a built in data dictionary or system catalog that is frequently
accessed and updated by the RDBMS. Other DBMSs especially older types, do
not have a built in data dictionary instead the DBA may use third party stand
alone data dictionary systems.

49
Database Design
● The process of designing the general structure of the database:
● Logical Design – Deciding on the database schema. Database design
requires that we find a “good” collection of relation schemas.
○ Business decision – What attributes should we record in the database?
○ Computer Science decision – What relation schemas should we have and how should the
attributes be distributed among the various relation schemas?
● Physical Design – Deciding on the physical layout of the database

50
Database Engine
● A database system is partitioned into modules that deal with each of the
responsibilities of the overall system.
● The functional components of a database system can be divided into
○ The storage manager,
○ The query processor component,
○ The transaction management component.

51
Database System Internals

52
Storage Manager
● A storage manager is a program module that provides the interface
between the low level data stored in the database and the application
programs and queries submitted to the system.
● The storage manager is responsible for the interaction with the file
manager. The raw data are stored on the disk using the file system, which
is usually provided by a conventional operating system. The storage
manager translates the various DML statements into low-level file-system
commands. Thus, the storage manager is responsible for storing,
retrieving, and updating data in the database.

53
Storage Manager
The storage manager components include:
Authorization and integrity manager, which tests for the satisfaction of integrity
constraints and checks the authority of users to access data.
Transaction manager, which ensures that the database remains in a consistent
(correct) state despite system failures, and that concurrent transaction executions
proceed without conflicting.
File manager, which manages the allocation of space on disk storage and the data
structures used to represent information stored on disk.
Buffer manager, which is responsible for fetching data from disk storage into main
memory, and deciding what data to cache in main memory. The buffer manager is a
critical part of the database system, since it enables the database to handle data
sizes that are much larger than the size of main memory.

54
Qwery Processor
● The query processor components include
○ DDL interpreter, which interprets DDL statements and records the definitions in the data
dictionary.
○ DML compiler, which translates DML statements in a query language into an evaluation
plan consisting of low-level instructions that the query evaluation engine understands.
● A query can usually be translated into any of a number of alternative
evaluation plans that all give the same result. The DML compiler also
performs query optimization, that is, it picks the lowest cost evaluation
plan from among the alternatives.
○ Query evaluation engine, which executes low-level instructions generated by the DML
compiler.

55
Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation

56
Transaction Management
● What if the system fails?
● What if more than one user is concurrently updating the same data?
● A transaction is a collection of operations that performs a single logical
function in a database application
● Transaction-management component ensures that the database
remains in a consistent (correct) state despite system failures (e.g., power
failures and operating system crashes) and transaction failures.
● Concurrency-control manager controls the interaction among the
concurrent transactions, to ensure the consistency of the database.

57
Transaction manager
● A transaction is a collection of operations that performs a single logical
function in a database application.
● Each transaction is a unit of both atomicity and consistency. Thus, we
require that transactions do not violate any database-consistency
constraints. That is, if the database was consistent when a transaction
started, the database must be consistent when the transaction
successfully terminates.
● Transaction - manager ensures that the database remains in a consistent
(correct) state despite system failures (e.g., power failures and operating
system crashes) and transaction failures.

58
Database Architecture
The architecture of a database systems is greatly influenced by the underlying computer
system on which the database is running:

● Centralized
○ A centralized database is a database that is located, stored, and maintained in a single location.
This location is most often a central computer or database system, for example a desktop or
server CPU, or a mainframe computer.
● Client-server
○ A client-server database is one where the database resides on a server, and client applications are written to access the
database.
● Parallel (multi-processor)
○ A parallel database system seeks to improve performance through parallelization of various
operations, such as loading data, building indexes and evaluating queries. ... Parallel databases
improve processing and input/output speeds by using multiple CPUs and disks in parallel
● Distributed

59
Database Architecture
● Database applications are usually partitioned into two or three parts. In a two-tier
architecture, the application resides at the client machine, where it invokes database
system functionality at the server machine through query language statements.
● Application program interface standards like ODBC and JDBC are used for interaction
between the client and the server.
● In contrast, in a three-tier architecture, the client machine acts as merely a front end and
does not contain any direct database calls. Instead, the client end communicates with an
application server, usually through a forms interface.
● The application server in turn communicates with a database system to access data. The
business logic of the application, which says what actions to carry out under what
conditions, is embedded in the application server, instead of being distributed across
multiple clients.
● Three-tier applications are more appropriate for large applications, and for applications
that run on the WorldWideWeb.

60
Database Applications
● Database applications are usually partitioned into two or three parts.
○ Two-tier architecture -- the application resides at the client machine, where
it invokes database system functionality at the server machine
○ Three-tier architecture -- the client machine acts as a front end and does not
contain any direct database calls.
■ The client end communicates with an application server, usually through
a forms interface.
■ The application server in turn communicates with a database system to access
data.

61
Two-tier and Three tier Architecture

62
Design Approaches
● Need to come up with a methodology to ensure that each of the relations
in the database is “good”

Two ways of doing so:

● Entity Relationship Model


○ Models an enterprise as a collection of entities and relationships
○ Represented diagrammatically by an entity-relationship diagram:
● Normalization Theory
○ Formalize what designs are bad, and test for them

63
Entity Relationship Model
● The entity-relationship (E-R) data model uses a collection of basic objects,
called entities, and relationships among these objects.
● An entity is a “thing” or “object” in the real world that is distinguishable
from other objects. The entity relationship model is widely used in
database design.

64
Semi Structured Data Model
● The semi-structured data model permits the specification of data where
● individual data items of the same type may have different sets of
attributes.
● This is in contrast to the data models mentioned earlier, where every data
item of a particular type must have the same set of attributes. The
● Extensible Markup Language (XML) is widely used to represent
semi-structured data.

65
Object Based Model
● Object-oriented programming (especially in Java, C++, or C#) has become
the
● dominant software-development methodology. This led to the
development of an object-oriented data model that can be seen as
extending the E-R model with notions of encapsulation, methods
(functions), and object identity.
● The object-relational data model combines features of the object-oriented
data model and relational data model.

66
Database Users and Administrators
dsadas

Database

67
Database Users
● Naive users -- unsophisticated users who interact with the system by invoking
one of the application programs that have been written previously.
● Application programmers -- are computer professionals who write application
programs.
● Sophisticated users -- interact with the system without writing programs
○ using a database query language or
○ by using tools such as data analysis software.
● Specialized users --write specialized database applications that do not fit into
the traditional data-processing framework. For example, CAD, graphic data,
audio, video.

68
Database Administrator
A person who has central control over the system is called a database
administrator DBA, whose functions are:
● Schema definition
● Storage structure and access-method definition
● Schema and physical-organization modification
● Granting of authorization for data access
● Routine maintenance
● Periodically backing up the database
● Ensuring that enough free disk space is available for normal operations,
and upgrading disk space as required
● Monitoring jobs running on the database and ensuring that performance is
not degraded by very expensive tasks submitted by some users

69
References
1. Database Systems Concepts, Abraham Silberschatz, Henry F. Korth, S.
Sudarshan, McGraw-Hill Higher Education
2. Database Management Systems, 2nd Edition, Raghu Ramakrishnan,
Johannes Gehrke

70
Database Design:
Entity
Relationship Model

Kathmandu University
BE Computer Enginering
Year II/II
Outline
● Design Process
● Entity Relationship Model
● Constraints
● Keys
● Entity-Relationship Diagram
● Design Issues
● Weak Entity Sets
● Extended E-R features
● Design of an E-R Database Schema
● Reduction of an E-R Schema to tables
Design Phases
▪ Initial phase -- characterize fully the data needs of the
prospective database users.
▪ Second phase -- choosing a data model
• Applying the concepts of the chosen data model
• Translating these requirements into a conceptual schema
of the database.
• A fully developed conceptual schema indicates the
functional requirements of the enterprise.
▪ Describe the kinds of operations (or transactions) that
will be performed on the data.
Design Phases (Cont.)
▪ Final Phase -- Moving from an abstract data model to the
implementation of the database
• Logical Design – Deciding on the database schema.
Database design requires that we find a “good”
collection of relation schemas.
▪ Business decision – What attributes should we
record in the database?
▪ Computer Science decision – What relation
schemas should we have and how should the
attributes be distributed among the various
relation schemas?
• Physical Design – Deciding on the physical layout of
the database
Design Alternatives
▪ In designing a database schema, we must ensure that
we avoid two major pitfalls:
• Redundancy: a bad design may result in repeat
information.
▪ Redundant representation of information may
lead to data inconsistency among the various
copies of information
• Incompleteness: a bad design may make certain
aspects of the enterprise difficult or impossible to
model.
▪ Avoiding bad designs is not enough. There may be a
large number of good designs from which we must
choose.
Design Approaches
▪ Entity Relationship Model (covered in this unit)
• Models an enterprise as a collection of entities and
relationships
▪ Entity: a “thing” or “object” in the enterprise that is
distinguishable from other objects
• Described by a set of attributes
▪ Relationship: an association among several entities
• Represented diagrammatically by an entity-relationship
diagram:
▪ Normalization Theory (another chapter)
• Formalize what designs are bad, and test for them
Outline of the ER Model
ER model -- Database Modeling
▪ The ER data model was developed to facilitate database
design by allowing specification of an enterprise schema
that represents the overall logical structure of a
database.
▪ The ER data model employs three basic concepts:
• entity sets,
• relationship sets,
• attributes.
▪ The ER model also has an associated diagrammatic
representation, the ER diagram, which can express the
overall logical structure of a database graphically.
Entity Sets
▪ An entity is an object that exists and is distinguishable
from other objects.
• Example: specific person, company, event, plant
▪ An entity set is a set of entities of the same type that
share the same properties.
• Example: set of all persons, companies, trees,
holidays
▪ An entity is represented by a set of attributes; i.e.,
descriptive properties possessed by all members of an
entity set.
• Example:
instructor = (ID, name, salary )
course= (course_id, title, credits)
▪ A subset of the attributes form a primary key of the
entity set; i.e., uniquely identifying each member of the
set.
Entity Sets -- instructor and student
Representing Entity sets in ER Diagram

▪ Entity sets can be represented graphically as follows:


• Rectangles represent entity sets.
• Attributes listed inside entity rectangle
• Underline indicates primary key attributes
Relationship Sets
▪ A relationship is an association among several entities
Example:
44553 (Peltier) advisor 22222 (Einstein)
student entity relationship set instructor entity
▪ A relationship set is a mathematical relation among n ≥ 2
entities, each taken from entity sets
{(e1, e2, … en) | e1 ∈ E1, e2 ∈ E2, …, en ∈ En}

where (e1, e2, …, en) is a relationship


• Example:
(44553,22222) ∈ advisor
Relationship Sets (Cont.)
▪ Example: we define the relationship set advisor to denote
the associations between students and the instructors who
act as their advisors.
▪ Pictorially, we draw a line between related entities.
Representing Relationship Sets via ER Diagrams

▪ Diamonds represent relationship sets.


Relationship Sets (Cont.)
▪ An attribute can also be associated with a relationship set.
▪ For instance, the advisor relationship set between entity
sets instructor and student may have the attribute date
which tracks when the student started being associated
with the advisor
Relationship Sets with Attributes
Roles
▪ Entity sets of a relationship need not be distinct
• Each occurrence of an entity set plays a “role” in the
relationship
▪ The labels “course_id” and “prereq_id” are called roles.
Degree of a Relationship Set
▪ Binary relationship
• involve two entity sets (or degree two).
• most relationship sets in a database system are
binary.
▪ Relationships between more than two entity sets are
rare. Most relationships are binary.
• Example: students work on research projects under
the guidance of an instructor.
• relationship proj_guide is a ternary relationship
between instructor, student, and project
Non-binary Relationship Sets
▪ Most relationship sets are binary
▪ There are occasions when it is more convenient to
represent relationships as non-binary.
▪ E-R Diagram with a Ternary Relationship
Complex Attributes
▪ Attribute types:
• Simple and composite attributes.
• Single-valued and multivalued attributes
▪ Example: multivalued attribute: phone_numbers
• Derived attributes
▪ Can be computed from other attributes
▪ Example: age, given date_of_birth
▪ Domain – the set of permitted values for each attribute
Composite Attributes
▪ Composite attributes allow us to divided attributes into
subparts (other attributes).
Representing Complex Attributes in ER Diagram
Mapping Cardinality Constraints
▪ Express the number of entities to which another entity
can be associated via a relationship set.
▪ Most useful in describing binary relationship sets.
▪ For a binary relationship set the mapping cardinality must
be one of the following types:
• One to one
• One to many
• Many to one
• Many to many
Mapping Cardinalities

One to one One to many

Note: Some elements in A and B may not be mapped to any


elements in the other set
Mapping Cardinalities

Many to Many to
one many
Note: Some elements in A and B may not be mapped to any
elements in the other set
Representing Cardinality Constraints in ER Diagram
▪ We express cardinality constraints by drawing either a
directed line (→), signifying “one,” or an undirected line (—),
signifying “many,” between the relationship set and the
entity set.

▪ One-to-one relationship between an instructor and a student :


• A student is associated with at most one instructor via the
relationship advisor
• A student is associated with at most one department via
stud_dept
One-to-Many Relationship
▪ one-to-many relationship between an instructor and a student
• an instructor is associated with several (including 0)
students via advisor
• a student is associated with at most one instructor via
advisor,
Many-to-One Relationships
▪ In a many-to-one relationship between an instructor and a
student,
• an instructor is associated with at most one student via
advisor,
• and a student is associated with several (including 0)
instructors via advisor
Many-to-Many Relationship
▪ An instructor is associated with several (possibly 0)
students via advisor
▪ A student is associated with several (possibly 0)
instructors via advisor
Total and Partial Participation
▪ Total participation (indicated by double line): every entity in
the entity set participates in at least one relationship in the
relationship set

participation of student in advisor relation is total


▪ every student must have an associated instructor
▪ Partial participation: some entities may not participate in
any relationship in the relationship set
• Example: participation of instructor in advisor is partial
Notation for Expressing More Complex Constraints
▪ A line may have an associated minimum and maximum
cardinality, shown in the form l..h, where l is the minimum
and h the maximum cardinality
• A minimum value of 1 indicates total participation.
• A maximum value of 1 indicates that the entity
participates in at most one relationship
• A maximum value of * indicates no limit.

Instructor can advise 0 or more students. A student


must have 1 advisor; cannot have multiple advisors
Cardinality Constraints on Ternary Relationship
▪ We allow at most one arrow out of a ternary (or greater
degree) relationship to indicate a cardinality constraint
▪ For example, an arrow from proj_guide to instructor
indicates each student has at most one guide for a project
▪ If there is more than one arrow, there are two ways of
defining the meaning.
• For example, a ternary relationship R between A, B and C with arrows to B
and C could mean
1. Each A entity is associated with a unique entity from B and C or
2. Each pair of entities from (A, B) is associated with a unique C entity,
and each pair (A, C) is associated with a unique B

• Each alternative has been used in different formalisms

• To avoid confusion we outlaw more than one arrow


Primary Key
▪ Primary keys provide a way to specify how entities and
relations are distinguished. We will consider:
• Entity sets
• Relationship sets.
• Weak entity sets
Primary key for Entity Sets
▪ By definition, individual entities are distinct.
▪ From database perspective, the differences among
them must be expressed in terms of their attributes.
▪ The values of the attribute values of an entity must be
such that they can uniquely identify the entity.
• No two entities in an entity set are allowed to have
exactly the same value for all attributes.
▪ A key for an entity is a set of attributes that suffice to
distinguish entities from each other
Primary Key for Relationship Sets
▪ To distinguish among the various relationships of a
relationship set we use the individual primary keys of the
entities in the relationship set.
• Let R be a relationship set involving entity sets E1, E2, ..
En
• The primary key for R consists of the union of the
primary keys of entity sets E1, E2, ..En
• If the relationship set R has attributes a1, a2, .., am
associated with it, then the primary key of R also
includes the attributes a1, a2, .., am
▪ Example: relationship set “advisor”.
• The primary key consists of inrsructor.ID and
student.ID
▪ The choice of the primary key for a relationship set
depends on the mapping cardinality of the relationship
set.
Choice of Primary key for Binary Relationship
▪ Many-to-Many relationships. The preceding union of
the primary keys is a minimal superkey and is chosen
as the primary key.
▪ One-to-Many relationships . The primary key of the
“Many” side is a minimal superkey and is used as the
primary key.
▪ Many-to-one relationships. The primary key of the
“Many” side is a minimal superkey and is used as the
primary key.
▪ One-to-one relationships. The primary key of either
one of the participating entity sets forms a minimal
superkey, and either one can be chosen as the primary
key.
Choice of Primary key for Nonbinary Relationship
▪ If no cardinality constraints are present, the superkey
is formed as described earlier. and it is chosen as the
primary key.
▪ If there are cardinality constraints are present:
• Recall that we permit at most one arrow out of a
relationship set.
Weak Entity Sets
▪ Consider a section entity, which is uniquely identified by a
course_id, semester, year, and sec_id.
▪ Clearly, section entities are related to course entities.
Suppose we create a relationship set sec_course between
entity sets section and course.
▪ Note that the information in sec_course is redundant, since
section already has an attribute course_id, which identifies
the course with which the section is related.
▪ One option to deal with this redundancy is to get rid of the
relationship sec_course; however, by doing so the
relationship between section and course becomes implicit in
an attribute, which is not desirable.
Weak Entity Sets (Cont.)
▪ An alternative way to deal with this redundancy is to not
store the attribute course_id in the section entity and to only
store the remaining attributes section_id, year, and
semester.
• However, the entity set section then does not have
enough attributes to identify a particular section entity
uniquely
▪ To deal with this problem, we treat the relationship
sec_course as a special relationship that provides extra
information, in this case, the course_id, required to identify
section entities uniquely.
▪ A weak entity set is one whose existence is dependent on
another entity, called its identifying entity
▪ Instead of associating a primary key with a weak entity, we
use the identifying entity, along with extra attributes called
discriminator to uniquely identify a weak entity.
Weak Entity Sets (Cont.)
▪ An entity set that is not a weak entity set is termed a strong
entity set.
▪ Every weak entity must be associated with an identifying
entity; that is, the weak entity set is said to be existence
dependent on the identifying entity set.
▪ The identifying entity set is said to own the weak entity set
that it identifies.
▪ The relationship associating the weak entity set with the
identifying entity set is called the identifying relationship.
▪ Note that the relational schema we eventually create from the
entity set section does have the attribute course_id, for reasons
that will become clear later, even though we have dropped the
attribute course_id from the entity set section.
Expressing Weak Entity Sets
▪ In E-R diagrams, a weak entity set is depicted via a double
rectangle.
▪ We underline the discriminator of a weak entity set with a
dashed line.
▪ The relationship set connecting the weak entity set to the
identifying strong entity set is depicted by a double
diamond.
▪ Primary key for section – (course_id, sec_id, semester, year)
Redundant Attributes
▪ Suppose we have entity sets:
• instructor, with attributes: ID, name, dept_name, salary

• department, with attributes: dept_name, building, budget

▪ We model the fact that each instructor has an


associated department using a relationship set
inst_dept
▪ The attribute dept_name in instructor replicates
information present in the relationship and is
therefore redundant
• and needs to be removed.

▪ BUT: when converting back to tables, in some cases


the attribute gets reintroduced, as we will see later.
instructor department
ID dept_name
name inst_dept building
dept_name budget
salary
E-R Diagram for a University Enterprise
Reduction to Relational Schemas
Reduction to Relational Schemas
▪ Entity sets and relationship sets can be expressed uniformly
as relation schemas that represent the contents of the
database.
▪ A database which conforms to an E-R diagram can be
represented by a collection of schemas.
▪ For each entity set and relationship set there is a unique
schema that is assigned the name of the corresponding
entity set or relationship set.
▪ Each schema has a number of columns (generally
corresponding to attributes), which have unique names.
Representing Entity Sets
▪ A strong entity set reduces to a schema with the same
attributes

student(ID, name, tot_cred)

▪ A weak entity set becomes a table that includes a


column for the primary key of the identifying strong
entity set

section ( course_id, sec_id, sem, year )


Representation of Entity Sets with Composite Attributes

▪ Composite attributes are flattened out by creating


a separate attribute for each component attribute
• Example: given entity set instructor with composite attribute
name with component attributes first_name and last_name the
schema corresponding to the entity set has two attributes
name_first_name and name_last_name
▪ Prefix omitted if there is no ambiguity (name_first_name
could be first_name)

▪ Ignoring multivalued attributes, extended


instructor schema is
• instructor(ID,
first_name, middle_initial, last_name,
street_number, street_name,
apt_number, city, state, zip_code,
date_of_birth)
Representation of Entity Sets with Multivalued Attributes

45 98462888888

45 985623456888

45 9841288888

id fname mname lname

45 dikshya poudel

46 pradhan prajina
Representing Relationship Sets
▪ A many-to-many relationship set is represented as a
schema with attributes for the primary keys of the two
participating entity sets, and any descriptive attributes of
the relationship set.
▪ Example: schema for relationship set advisor
advisor = (s_id, i_id)
Redundancy of Schemas
▪ Many-to-one and one-to-many relationship sets that are total
on the many-side can be represented by adding an extra
attribute to the “many” side, containing the primary key of the
“one” side
▪ Example: Instead of creating a schema for relationship set
inst_dept, add an attribute dept_name to the schema arising
from entity set instructor
Redundancy of Schemas (Cont.)
▪ For one-to-one relationship sets, either side can be
chosen to act as the “many” side
• That is, an extra attribute can be added to either
of the tables corresponding to the two entity sets
▪ If participation is partial on the “many” side, replacing
a schema by an extra attribute in the schema
corresponding to the “many” side could result in null
values
Redundancy of Schemas (Cont.)
▪ The schema corresponding to a relationship set linking a
weak entity set to its identifying strong entity set is
redundant.

▪ Example: The section schema already contains the


attributes that would appear in the sec_course schema
Extended E-R Features
Specialization
▪ Top-down design process; we designate sub-groupings
within an entity set that are distinctive from other
entities in the set.
▪ These sub-groupings become lower-level entity sets
that have attributes or participate in relationships that
do not apply to the higher-level entity set.
▪ Depicted by a triangle component labeled ISA (e.g.,
instructor “is a” person).
▪ Attribute inheritance – a lower-level entity set
inherits all the attributes and relationship participation
of the higher-level entity set to which it is linked.
Specialization Example
▪ Overlapping – employee and student
▪ Disjoint – instructor and secretary
▪ Total and partial
Representing Specialization via Schemas
▪ Method 1:
• Form a schema for the higher-level entity
• Form a schema for each lower-level entity set, include
primary key of higher-level entity set and local
attributes
schema attributes
person ID, name, street, city
student ID, tot_cred
employee ID, salary

• Drawback: getting information about, an employee


requires accessing two relations, the one
corresponding to the low-level schema and the one
corresponding to the high-level schema
Representing Specialization as Schemas (Cont.)

▪ Method 2:
• Form a schema for each entity set with all local and
inherited attributes
schema attributes
person ID, name, street, city
student ID, name, street, city, tot_cred
employee ID, name, street, city, salary

• Drawback: name, street and city may be stored


redundantly for people who are both students and
employees
Generalization
▪ A bottom-up design process – combine a number of
entity sets that share the same features into a
higher-level entity set.
▪ Specialization and generalization are simple inversions
of each other; they are represented in an E-R diagram in
the same way.
▪ The terms specialization and generalization are used
interchangeably.
Completeness constraint
▪ Completeness constraint -- specifies whether or not
an entity in the higher-level entity set must belong to
at least one of the lower-level entity sets within a
generalization.
• total: an entity must belong to one of the
lower-level entity sets
• partial: an entity need not belong to one of the
lower-level entity sets
Completeness constraint (Cont.)
▪ Partial generalization is the default. We can specify total
generalization in an ER diagram by adding the keyword
total in the diagram and drawing a dashed line from the
keyword to the corresponding hollow arrow-head to
which it applies (for a total generalization), or to the set of
hollow arrow-heads to which it applies (for an overlapping
generalization).
▪ The student generalization is total: All student entities
must be either graduate or undergraduate. Because the
higher-level entity set arrived at through generalization is
generally composed of only those entities in the
lower-level entity sets, the completeness constraint for a
generalized higher-level entity set is usually total
Aggregation
▪ Consider the ternary relationship proj_guide, which we saw
earlier
▪ Suppose we want to record evaluations of a student by a
guide on a project
Aggregation (Cont.)
▪ Relationship sets eval_for and proj_guide represent
overlapping information
• Every eval_for relationship corresponds to a proj_guide
relationship
• However, some proj_guide relationships may not
correspond to any eval_for relationships
▪ So we can’t discard the proj_guide relationship
▪ Eliminate this redundancy via aggregation
• Treat relationship as an abstract entity
• Allows relationships between relationships
• Abstraction of relationship into new entity
Aggregation (Cont.)
▪ Eliminate this redundancy via aggregation without
introducing redundancy, the following diagram represents:
• A student is guided by a particular instructor on a
particular project
• A student, instructor, project combination may have an
associated evaluation
Reduction to Relational Schemas
▪ To represent aggregation, create a schema containing
• Primary key of the aggregated relationship,
• The primary key of the associated entity set
• Any descriptive attributes
▪ In our example:
• The schema eval_for is:
eval_for (s_ID, project_id, i_ID, evaluation_id)
• The schema proj_guide is redundant.
Design Issues
Common Mistakes in E-R Diagrams
▪ Example of erroneous E-R diagrams
Common Mistakes in E-R Diagrams (Cont.)

▪ Correct versions of the E-R diagram of previous slide


Entities vs. Attributes
▪ Use of entity sets vs. attributes

▪ Use of phone as an entity allows extra information about


phone numbers (plus multiple phone numbers)
Entities vs. Relationship sets
▪ Use of entity sets vs. relationship sets
Possible guideline is to designate a relationship set to
describe an action that occurs between entities

▪ Placement of relationship attributes


For example, attribute date as attribute of advisor
or as attribute of student
Binary Vs. Non-Binary Relationships
▪ Although it is possible to replace any non-binary (n-ary,
for n > 2) relationship set by a number of distinct binary
relationship sets, a n-ary relationship set shows more
clearly that several entities participate in a single
relationship.
▪ Some relationships that appear to be non-binary may be
better represented using binary relationships
• For example, a ternary relationship parents, relating
a child to his/her father and mother, is best replaced
by two binary relationships, father and mother
▪ Using two binary relationships allows partial
information (e.g., only mother being known)
• But there are some relationships that are naturally
non-binary
▪ Example: proj_guide
Converting Non-Binary Relationships to Binary Form

▪ In general, any non-binary relationship can be represented using


binary relationships by creating an artificial entity set.
• Replace R between entity sets A, B and C by an entity set E, and three relationship
sets:
1. RA, relating E and A 2. RB, relating E and B
3. RC, relating E and C
• Create an identifying attribute for E and add any attributes of R to E
• For each relationship (ai , bi , ci) in R, create
1. a new entity ei in the entity set E 2. add (ei , ai ) to RA
3. add (ei , bi ) to RB 4. add (ei , ci ) to RC
Converting Non-Binary Relationships (Cont.)

▪ Also need to translate constraints


• Translating all constraints may not be possible
• There may be instances in the translated schema
that
cannot correspond to any instance of R
▪ Exercise: add constraints to the relationships RA, RB
and RC to ensure that a newly created entity
corresponds to exactly one entity in each of entity
sets A, B and C
• We can avoid creating an identifying attribute by
making E a weak entity set (described shortly)
identified by the three relationship sets
E-R Design Decisions
▪ The use of an attribute or entity set to represent an
object.
▪ Whether a real-world concept is best expressed by an
entity set or a relationship set.
▪ The use of a ternary relationship versus a pair of binary
relationships.
▪ The use of a strong or weak entity set.
▪ The use of specialization/generalization – contributes to
modularity in the design.
▪ The use of aggregation – can treat the aggregate entity set
as a single unit without concern for the details of its
internal structure.
Summary of Symbols Used in E-R Notation
Symbols Used in E-R Notation (Cont.)
Alternative ER Notations
▪ Chen, IDE1FX, …
Alternative ER Notations
Chen IDE1FX (Crows feet notation)
UML
▪ UML: Unified Modeling Language
▪ UML has many components to graphically model different
aspects of an entire software system
▪ UML Class Diagrams correspond to E-R Diagram, but
several differences.
ER vs. UML Class Diagrams

*Note reversal of position in cardinality constraint


depiction
ER vs. UML Class Diagrams
ER Diagram Equivalent in
Notation UML

*Generalization can use merged or separate arrows independent


of disjoint/overlapping
UML Class Diagrams (Cont.)
▪ Binary relationship sets are represented in UML by just
drawing a line connecting the entity sets. The relationship
set name is written adjacent to the line.
▪ The role played by an entity set in a relationship set may
also be specified by writing the role name on the line,
adjacent to the entity set.
▪ The relationship set name may alternatively be written in a
box, along with attributes of the relationship set, and the
box is connected, using a dotted line, to the line depicting
the relationship set.
ER vs. UML Class Diagrams
Other Aspects of Database Design
▪ Functional Requirements
▪ Data Flow, Workflow
▪ Schema Evolution
End of Unit 2
Relational Model

Kathmandu University
BE Computer engineering/
BSc Computer Science
Year II/II
Outline
● Relational Algebraic Operations
● Operations on Basic SQL
● Operations on Intermediate SQL
Relational Model
● Relational data model is the primary data model, which is used widely
around the world for data storage and processing. This model is simple
and it has all the properties and capabilities required to process data with
storage efficiency.

● In relational model, the data and relationships are represented by


collection of inter-related tables. Each table is a group of column and
rows, where column represents attribute of an entity and rows represents
records.
Example of a Instructor Relation attributes
(or columns)

tuples
(or rows)
Attribute
▪ The set of allowed values for each attribute is called the
domain of the attribute
▪ Attribute values are (normally) required to be atomic; that
is, indivisible
▪ The special value null is a member of every domain.
Indicated that the value is “unknown”
▪ The null value causes complications in the definition of
many operations
Relations are Unordered
▪ Order of tuples is irrelevant (tuples may be stored in an
arbitrary order)
▪ Example: instructor relation with unordered tuples
Database Schema
▪ Database schema - is the logical structure of the
database.
▪ Database instance - is a snapshot of the data in the
database at a given instant in time.
▪ Example:
• schema: instructor (ID, name, dept_name, salary)

• Instance:
Keys
▪ Let K ⊆ R
▪ K is a superkey of R if values for K are sufficient to identify a
unique tuple of each possible relation r(R)
• Example: {ID} and {ID,name} are both superkeys of instructor.

▪ Superkey K is a candidate key if K is minimal


Example: {ID} is a candidate key for Instructor
▪ One of the candidate keys is selected to be the primary key.
• which one?

▪ Foreign key constraint: Value in one relation must appear in


another
• Referencing relation

• Referenced relation

• Example – dept_name in instructor is a foreign key from instructor referencing


department
Schema Diagram for University Database

Find referencing relation and referenced relation from the university database schema
Relational Query Languages
▪ Procedural versus non-procedural, or declarative
▪ “Pure” languages:
• Relational algebra

• Tuple relational calculus

• Domain relational calculus

▪ The above 3 pure languages are equivalent in computing


power
▪ We will concentrate in this chapter on relational algebra
• Not turning-machine equivalent

• Consists of 6 basic operations


Relational Algebra
▪ A procedural language consisting of a set of operations that
take one or two relations as input and produce a new relation
as their result.
▪ Six basic operators
• select: σ
• project: ∏

• union: ∪

• set difference: –

• Cartesian product: x

• rename: ρ
Select Operation
▪ The select operation selects tuples that satisfy a given predicate.
▪ Notation: σ p(r)
▪ p is called the selection predicate
▪ Example: select those tuples of the instructor relation where the
instructor is in the “Physics” department.
• Query

σ dept_name=“Physics” (instructor)

• Result
Select Operation (Cont.)
▪ We allow comparisons using
=, ≠, >, ≥, <, ≤
in the selection predicate.
▪ We can combine several predicates into a larger predicate by using
the connectives:
∧ (and), ∨ (or), ¬ (not)
▪ Example: Find the instructors in Physics with a salary greater
$90,000, we write:

σ dept_name=“Physics” ∧ salary > 90,000 (instructor)

▪ Then select predicate may include comparisons between two


attributes.
• Example, find all departments whose name is the same as their
building name:
• σ dept_name=building (department)
Project Operation
▪ A unary operation that returns its argument relation,
with certain attributes left out.
▪ Notation:
∏ A1,A2,A3 ….Ak (r)
where A1, A2 are attribute names and r is a relation
name.
▪ The result is defined as the relation of k columns
obtained by erasing the columns that are not listed
▪ Duplicate rows removed from result, since relations are
sets
Project Operation (Cont.)
▪ Example: eliminate the dept_name attribute of instructor
▪ Query:

∏ID, name, salary (instructor)


▪ Result:
Composition of Relational Operations
▪ The result of a relational-algebra operation is relation and
therefore relational-algebra operations can be composed
together into a relational-algebra expression.
▪ Consider the query -- Find the names of all instructors in
the Physics department.

∏name(σ dept_name =“Physics” (instructor))

▪ Instead of giving the name of a relation as the argument of


the projection operation, we give an expression that
evaluates to a relation.
Cartesian-Product Operation
▪ The Cartesian-product operation (denoted by X) allows us to
combine information from any two relations.
▪ Example: the Cartesian product of the relations instructor and
teaches is written as:
instructor X teaches
▪ We construct a tuple of the result out of each possible pair of
tuples: one from the instructor relation and one from the
teaches relation (see next slide)
▪ Since the instructor ID appears in both relations we
distinguish between these attribute by attaching to the
attribute the name of the relation from which the attribute
originally came.
• instructor.ID

• teaches.ID
The instructor X teaches table
Join Operation
▪ The Cartesian-Product
instructor X teaches
associates every tuple of instructor with every tuple of
teaches.
• Most of the resulting rows have information about instructors who did NOT
teach a particular course.

▪ To get only those tuples of “instructor X teaches “ that


pertain to instructors and the courses that they taught, we
write:
σ instructor.id = teaches.id
(instructor x teaches ))

• We get only those tuples of “instructor X teaches” that pertain to instructors


and the courses that they taught.

▪ The result of this expression, shown in the next slide


Join Operation (Cont.)
▪ The table corresponding to:
σ instructor.id = teaches.id (instructor x teaches))
Join Operation (Cont.)
Union Operation
▪ The union operation allows us to combine two relations
▪ Notation: r ∪ s
▪ For r ∪ s to be valid.
1. r, s must have the same arity (same number of attributes)
2. The attribute domains must be compatible (example: 2nd
column of r deals with the same type of values as does the 2nd
column of s)
▪ Example: to find all courses taught in the Fall 2017 semester, or
in the Spring 2018 semester, or in both
∏course_id (σ semester=“Fall” Λ year=2017
(section)) ∪

∏course_id (σ semester=“Spring” Λ year=2018 (section))


Union Operation (Cont.)
▪ Result of:
∏course_id (σ semester=“Fall” Λ year=2017
(section)) ∪
∏course_id (σ semester=“Spring” Λ year=2018
(section))
Set-Intersection Operation
▪ The set-intersection operation allows us to find tuples that
are in both the input relations.
▪ Notation: r ∩ s
▪ Assume:
• r, s have the same arity

• attributes of r and s are compatible

▪ Example: Find the set of all courses taught in both the Fall
2017 and the Spring 2018 semesters.
∏course_id (σ semester=“Fall” Λ year=2017 (section)) ∩
∏course_id (σ semester=“Spring” Λ year=2018 (section))

• Result
Set Difference Operation
▪ The set-difference operation allows us to find tuples that are in
one relation but are not in another.

▪ Notation r – s
▪ Set differences must be taken between compatible relations.
• r and s must have the same arity
• attribute domains of r and s must be compatible
▪ Example: to find all courses taught in the Fall 2017 semester, but
not in the Spring 2018 semester

∏course_id (σ semester=“Fall” Λ year=2017 (section)) −

∏course_id (σ semester=“Spring” Λ year=2018 (section))


The Assignment Operation
▪ It is convenient at times to write a relational-algebra expression
by assigning parts of it to temporary relation variables.
▪ The assignment operation is denoted by ← and works like
assignment in a programming language.
▪ Example: Find all instructor in the “Physics” and Music
department.
Physics ← σ dept_name=“Physics” (instructor)
Music ← σ dept_name=“Music” (instructor)
Physics r ∪ Music

▪ With the assignment operation, a query can be written as a


sequential program consisting of a series of assignments
followed by an expression whose value is displayed as the result
of the query.
The Rename Operation
▪ The results of relational-algebra expressions do not have a
name that we can use to refer to them. The rename
operator, ρ, is provided for that purpose
▪ The expression:
ρx (E)
returns the result of expression E under the name x
▪ Another form of the rename operation:
ρx(A1,A2, .. An) (E)
Equivalent Queries
▪ There is more than one way to write a query in relational
algebra.
▪ Example: Find information about courses taught by instructors
in the Physics department with salary greater than 90,000
▪ Query 1
σ dept_name=“Physics” ∧ salary > 90,000 (instructor)

▪ Query 2
σ dept_name=“Physics” (σ salary > 90.000 (instructor))

▪ The two queries are not identical; they are, however, equivalent
-- they give the same result on any database.
Equivalent Queries

.
Basic SQL
History
▪ IBM Sequel language developed as part of System R project at
the IBM San Jose Research Laboratory
▪ Renamed Structured Query Language (SQL)
▪ ANSI and ISO standard SQL:
• SQL-86

• SQL-89

• SQL-92

• SQL:1999 (language name became Y2K compliant!)

• SQL:2003

▪ Commercial systems offer most, if not all, SQL-92 features, plus


varying feature sets from later standards and special
proprietary features.
• Not all examples here may work on your particular system.
SQL Parts
▪ DML -- provides the ability to query information from the
database and to insert tuples into, delete tuples from, and
modify tuples in the database.
▪ integrity – the DDL includes commands for specifying
integrity constraints.
▪ View definition -- The DDL includes commands for
defining views.
▪ Transaction control –includes commands for specifying
the beginning and ending of transactions.
▪ Embedded SQL and dynamic SQL -- define how SQL
statements can be embedded within general-purpose
programming languages.
▪ Authorization – includes commands for specifying access
rights to relations and views.
Data Definition Language
The SQL data-definition language (DDL) allows the specification
of information about relations, including:

▪ The schema for each relation.


▪ The type of values associated with each attribute.
▪ The Integrity constraints
▪ The set of indices to be maintained for each relation.
▪ Security and authorization information for each relation.
▪ The physical storage structure of each relation on disk.
Domain Types in SQL
▪ char(n). Fixed length character string, with user-specified length
n.
▪ varchar(n). Variable length character strings, with
user-specified maximum length n.
▪ int. Integer (a finite subset of the integers that is
machine-dependent).
▪ smallint. Small integer (a machine-dependent subset of the
integer domain type).
▪ numeric(p,d). Fixed point number, with user-specified
precision of p digits, with d digits to the right of decimal point.
(ex., numeric(3,1), allows 44.5 to be stores exactly, but not 444.5
or 0.32)
▪ real, double precision. Floating point and double-precision
floating point numbers, with machine-dependent precision.
▪ float(n). Floating point number, with user-specified precision of
at least n digits.
Create Table Construct
▪ An SQL relation is defined using the create table command:
create table r
(A1 D1, A2 D2, ..., An Dn,
(integrity-constraint1),
...,
(integrity-constraint k))
• r is the name of the relation

• each Ai is an attribute name in the schema of relation r

• Di is the data type of values in the domain of attribute Ai

▪ Example:
create table instructor (
ID char(5),
name varchar(20),
dept_name varchar(20),
salary numeric(8,2))
Integrity Constraints in Create Table
▪ Types of integrity constraints
• primary key (A1, ..., An )

• foreign key (Am, ..., An ) references r

• not null

▪ SQL prevents any update to the database that violates an


integrity constraint.
▪ Example:
create table instructor (
ID char(5),
name varchar(20) not null,
dept_name varchar(20),
salary numeric(8,2),
primary key (ID),
foreign key (dept_name) references
department);
And a Few More Relation Definitions
▪ create table student (
ID varchar(5),
name varchar(20) not null,
dept_name varchar(20),
tot_cred numeric(3,0),
primary key (ID),
foreign key (dept_name) references department);

▪ create table takes (


ID varchar(5),
course_id varchar(8),
sec_id varchar(8),
semester varchar(6),
year numeric(4,0),
grade varchar(2),
primary key (ID, course_id, sec_id, semester, year) ,
foreign key (ID) references student,
foreign key (course_id, sec_id, semester, year) references
section);
And more still
▪ create table course (
course_id varchar(8),
title varchar(50),
dept_name varchar(20),
credits numeric(2,0),
primary key (course_id),
foreign key (dept_name) references department);
Updates to tables
▪ Insert
• insert into instructor values ('10211', 'Smith', 'Biology', 66000);
▪ Delete
• Remove all tuples from the student relation
▪ delete from student
▪ Drop Table
• drop table r
▪ Alter
• alter table r add A D
▪ where A is the name of the attribute to be added to relation r and D is
the domain of A.
▪ All exiting tuples in the relation are assigned null as the value for the new
attribute.

• alter table r drop A

▪ where A is the name of an attribute of relation r


▪ Dropping of attributes not supported by many databases.
Basic Query Structure
▪ A typical SQL query has the form:

select A1, A2, ..., An


from r1, r2, ..., rm
where P

• Ai represents an attribute

• Ri represents a relation

• P is a predicate.

▪ The result of an SQL query is a relation.


The select Clause
▪ The select clause lists the attributes desired in the result
of a query
• corresponds to the projection operation of the relational algebra

▪ Example: find the names of all instructors:


select name
from instructor
▪ NOTE: SQL names are case insensitive (i.e., you may use
upper- or lower-case letters.)
• E.g., Name ≡ NAME ≡ name

• Some people use upper case wherever we use bold font.


The select Clause (Cont.)
▪ SQL allows duplicates in relations as well as in query
results.
▪ To force the elimination of duplicates, insert the
keyword distinct after select.
▪ Find the department names of all instructors, and
remove duplicates
select distinct dept_name
from instructor
▪ The keyword all specifies that duplicates should not be
removed.

select all dept_name


from instructor
The select Clause (Cont.)
▪ An asterisk in the select clause denotes “all attributes”
select *
from instructor
▪ An attribute can be a literal with no from clause
select '437'
• Results is a table with one column and a single row with value “437”

• Can give the column a name using:


select '437' as FOO

▪ An attribute can be a literal with from clause


select 'A'
from instructor
• Result is a table with one column and N rows (number of tuples in the
instructors table), each row with value “A”
The select Clause (Cont.)
▪ The select clause can contain arithmetic expressions
involving the operation, +, –, *, and /, and operating on
constants or attributes of tuples.
• The query:
select ID, name, salary/12
from instructor
would return a relation that is the same as the instructor relation, except
that the value of the attribute salary is divided by 12.

• Can rename “salary/12” using the as clause:


select ID, name, salary/12 as monthly_salary
The where Clause
▪ The where clause specifies conditions that the result must
satisfy
• Corresponds to the selection predicate of the relational algebra.

▪ To find all instructors in Comp. Sci. dept


select name
from instructor
where dept_name = 'Comp. Sci.'
▪ SQL allows the use of the logical connectives and, or, and
not
▪ The operands of the logical connectives can be expressions
involving the comparison operators <, <=, >, >=, =, and <>.
▪ Comparisons can be applied to results of arithmetic
expressions
▪ To find all instructors in Comp. Sci. dept with salary > 80000
select name
from instructor
where dept_name = 'Comp. Sci.' and salary > 80000
The from Clause
▪ The from clause lists the relations involved in the query
• Corresponds to the Cartesian product operation of the relational algebra.

▪ Find the Cartesian product instructor X teaches


select *
from instructor, teaches
• generates every possible instructor – teaches pair, with all attributes from
both relations.

• For common attributes (e.g., ID), the attributes in the resulting table are
renamed using the relation name (e.g., instructor.ID)

▪ Cartesian product not very useful directly, but useful


combined with where-clause condition (selection operation
in relational algebra).
Examples
▪ Find the names of all instructors who have taught some course
and the course_id
• select name, course_id
from instructor , teaches
where instructor.ID = teaches.ID

▪ Find the names of all instructors in the Art department who have
taught some course and the course_id
• select name, course_id
from instructor , teaches
where instructor.ID = teaches.ID and instructor. dept_name = 'Art'
The Rename Operation
▪ The SQL allows renaming relations and attributes using the
as clause:
old-name as new-name

▪ Find the names of all instructors who have a higher salary


than
some instructor in 'Comp. Sci'.
• select distinct T.name
from instructor as T, instructor as S
where T.salary > S.salary and S.dept_name = 'Comp. Sci.'

▪ Keyword as is optional and may be omitted


instructor as T ≡ instructor T
Self Join Example
▪ Relation emp-super

▪ Find the supervisor of “Bob”


▪ Find the supervisor of the supervisor of “Bob”
▪ Can you find ALL the supervisors (direct and indirect) of
“Bob”?
Self Join Example
● Relation emp-super

● Find the supervisor of “Bob”


● Find the supervisor of the supervisor of “Bob”
● Can you find ALL the supervisors (direct and indirect) of “Bob”?
String Operations
▪ SQL includes a string-matching operator for comparisons on
character strings. The operator like uses patterns that are
described using two special characters:
• percent ( % ). The % character matches any substring.

• underscore ( _ ). The _ character matches any character.

▪ Find the names of all instructors whose name includes the


substring “dar”.
select name
from instructor
where name like '%dar%'
▪ Match the string “100%”
like '100 \%' escape '\'
in that above we use backslash (\) as the escape character.
String Operations (Cont.)
▪ Patterns are case sensitive.
▪ Pattern matching examples:
• 'Intro%' matches any string beginning with “Intro”.

• '%Comp%' matches any string containing “Comp” as a substring.

• '_ _ _' matches any string of exactly three characters.

• '_ _ _ %' matches any string of at least three characters.

▪ SQL supports a variety of string operations such as


• concatenation (using “||”)

• converting from upper to lower case (and vice versa)

• finding string length, extracting substrings, etc.


Ordering the Display of Tuples
▪ List in alphabetic order the names of all instructors
select distinct name
from instructor
order by name
▪ We may specify desc for descending order or asc for ascending
order, for each attribute; ascending order is the default.
• Example: order by name desc

▪ Can sort on multiple attributes


• Example: order by dept_name, name
Where Clause Predicates
▪ SQL includes a between comparison operator
▪ Example: Find the names of all instructors with salary
between $90,000 and $100,000 (that is, ≥ $90,000 and ≤
$100,000)
• select name
from instructor
where salary between 90000 and 100000

▪ Tuple comparison
• select name, course_id
from instructor, teaches
where (instructor.ID, dept_name) = (teaches.ID, 'Biology');
Set Operations
▪ Find courses that ran in Fall 2017 or in Spring 2018
(select course_id from section where sem = 'Fall' and year = 2017)
union
(select course_id from section where sem = 'Spring' and year = 2018)

● Find courses that ran in Fall 2017 and in Spring 2018


(select course_id from section where sem = 'Fall' and year = 2017)
intersect
(select course_id from section where sem = 'Spring' and year = 2018)

● Find courses that ran in Fall 2017 but not in Spring 2018

(select course_id from section where sem = 'Fall' and year = 2017)
except
(select course_id from section where sem = 'Spring' and year = 2018)
Set Operations (Cont.)
▪ Set operations union, intersect, and except
• Each of the above operations automatically eliminates duplicates

▪ To retain all duplicates use the


• union all,

• intersect all

• except all.
Null Values
▪ It is possible for tuples to have a null value, denoted by
null, for some of their attributes
▪ null signifies an unknown value or that a value does
not exist.
▪ The result of any arithmetic expression involving null is
null
• Example: 5 + null returns null

▪ The predicate is null can be used to check for null


values.
• Example: Find all instructors whose salary is null.
select name
from instructor
where salary is null
▪ The predicate is not null succeeds if the value on
which it is applied is not null.
Null Values (Cont.)
▪ SQL treats as unknown the result of any comparison
involving a null value (other than predicates is null and is
not null).
• Example: 5 < null or null <> null or null = null

▪ The predicate in a where clause can involve Boolean


operations (and, or, not); thus the definitions of the
Boolean operations need to be extended to deal with the
value unknown.
• and : (true and unknown) = unknown,
(false and unknown) = false,
(unknown and unknown) = unknown

• or: (unknown or true) = true,


(unknown or false) = unknown
(unknown or unknown) = unknown

▪ Result of where clause predicate is treated as false if it


evaluates to unknown
Aggregate Functions
▪ These functions operate on the multiset of values of a
column of a relation, and return a value
avg: average value
min: minimum value
max: maximum value
sum: sum of values
count: number of values
Aggregate Functions Examples
▪ Find the average salary of instructors in the Computer Science
department
• select avg (salary)
from instructor
where dept_name= 'Comp. Sci.';

▪ Find the total number of instructors who teach a course in the


Spring 2010 semester
• select count (distinct ID)
from teaches
where semester = 'Spring' and year = 2018;

▪ Find the number of tuples in the course relation


• select count (*)
from course;
Aggregate Functions – Group By
▪ Find the average salary of instructors in each department
• select dept_name, avg (salary) as avg_salary
from instructor
group by dept_name;
Aggregation (Cont.)
▪ Attributes in select clause outside of aggregate functions must
appear in group by list
• /* erroneous query */
select dept_name, ID, avg (salary)
from instructor
group by dept_name;
Aggregate Functions – Having Clause
▪ Find the names and average salaries of all departments whose
average salary is greater than 42000

select dept_name, avg (salary) as avg_salary


from instructor
group by dept_name
having avg (salary) > 42000;

Note: predicates in the having clause are applied after the


formation of groups whereas predicates in the where
clause are applied before forming groups
Null Values and Aggregates
▪ Total all salaries
select sum (salary )
from instructor
• Above statement ignores null amounts

• Result is null if there is no non-null amount

▪ All aggregate operations except count(*) ignore tuples with null


values on the aggregated attributes
▪ What if collection has only null values?
• count returns 0

• all other aggregates return null


Nested Subqueries
▪ SQL provides a mechanism for the nesting of subqueries. A
subquery is a select-from-where expression that is nested
within another query.
▪ The nesting can be done in the following SQL query

select A1, A2, ..., An


from r1, r2, ..., rm
where P

as follows:
• From clause: ri can be replaced by any valid subquery

• Where clause: P can be replaced with an expression of the form:


B <operation> (subquery)
Where B is an attribute and <operation> to be defined later.

• Select clause:
Ai can be replaced be a subquery that generates a single value.
Set Membership
▪ Find courses offered in Fall 2017 and in Spring 2018
select distinct course_id
from section
where semester = 'Fall' and year= 2017 and
course_id in (select course_id
from section
where semester = 'Spring' and year= 2018);

▪ Find courses offered in Fall 2017 but not in Spring 2018

select distinct course_id


from section
where semester = 'Fall' and year= 2017 and
course_id not in (select course_id
from section
where semester = 'Spring' and year= 2018);
Set Membership (Cont.)
▪ Name all instructors whose name is neither “Mozart” nor Einstein”
select distinct name
from instructor
where name not in ('Mozart', 'Einstein')

▪ Find the total number of (distinct) students who have taken course
sections taught by the instructor with ID 10101

select count (distinct ID)


from takes
where (course_id, sec_id, semester, year) in
(select course_id, sec_id, semester, year
from teaches
where teaches.ID= 10101);

▪ Note: Above query can be written in a much simpler manner.


The formulation above is simply to illustrate SQL features
Set Comparison
Set Comparison – “some” Clause
▪ Find names of instructors with salary greater than that of
some (at least one) instructor in the Biology department.

select distinct T.name


from instructor as T, instructor as S
where T.salary > S.salary and S.dept name = 'Biology';

▪ Same query using > some clause

select name
from instructor
where salary > some (select salary
from instructor
where dept name = 'Biology');
Set Comparison – “all” Clause
▪ Find the names of all instructors whose salary is greater than
the salary of all instructors in the Biology department.

select name
from instructor
where salary > all (select salary
from instructor
where dept name = 'Biology');
Test for Absence of Duplicate Tuples
▪ The unique construct tests whether a subquery has any
duplicate tuples in its result.
▪ The unique construct evaluates to “true” if a given
subquery contains no duplicates .
▪ Find all courses that were offered at most once in 2017
select T.course_id
from course as T
where unique ( select R.course_id
from section as R
where T.course_id= R.course_id
and R.year = 2017);
Subqueries in the From Clause
Subqueries in the Form Clause
▪ SQL allows a subquery expression to be used in the from clause
▪ Find the average instructors’ salaries of those departments
where the average salary is greater than $42,000.”
select dept_name, avg_salary
from ( select dept_name, avg (salary) as avg_salary
from instructor
group by dept_name)
where avg_salary > 42000;

▪ Note that we do not need to use the having clause


▪ Another way to write above query
select dept_name, avg_salary
from ( select dept_name, avg (salary)
from instructor
group by dept_name)
as dept_avg (dept_name, avg_salary)
where avg_salary > 42000;
With Clause
▪ The with clause provides a way of defining a temporary
relation whose definition is available only to the query in
which the with clause occurs.
▪ Find all departments with the maximum budget

with max_budget (value) as


(select max(budget)
from department)
select department.name
from department, max_budget
where department.budget = max_budget.value;
Complex Queries using With Clause
▪ Find all departments where the total salary is greater than
the average of the total salary at all departments

with dept _total (dept_name, value) as


(select dept_name, sum(salary)
from instructor
group by dept_name),
dept_total_avg(value) as
(select avg(value)
from dept_total)
select dept_name
from dept_total, dept_total_avg
where dept_total.value > dept_total_avg.value;
Scalar Subquery
▪ Scalar subquery is one which is used where a single value is
expected
▪ List all departments along with the number of instructors in
each department
select dept_name,
( select count(*)
from instructor
where department.dept_name = instructor.dept_name)
as num_instructors
from department;
▪ Runtime error if subquery returns more than one result tuple
Modification of the Database

▪ Deletion of tuples from a given relation.


▪ Insertion of new tuples into a given relation
▪ Updating of values in some tuples in a given relation
Deletion
▪ Delete all instructors
delete from instructor

▪ Delete all instructors from the Finance department


delete from instructor
where dept_name= 'Finance';

▪ Delete all tuples in the instructor relation for those instructors


associated with a department located in the Watson building.
delete from instructor
where dept name in (select dept name
from department
where building = 'Watson');
Deletion (Cont.)
▪ Delete all instructors whose salary is less than the average
salary of instructors

delete from instructor


where salary < (select avg (salary)
from instructor);
● Problem: as we delete tuples from deposit, the average salary
changes
● Solution used in SQL:
1. First, compute avg (salary) and find all tuples to delete

2. Next, delete all tuples found above (without


recomputing avg or retesting the tuples)
Insertion
▪ Add a new tuple to course
insert into course
values ('CS-437', 'Database Systems', 'Comp. Sci.', 4);

▪ or equivalently
insert into course (course_id, title, dept_name, credits)
values ('CS-437', 'Database Systems', 'Comp. Sci.', 4);

▪ Add a new tuple to student with tot_creds set to null


insert into student
values ('3003', 'Green', 'Finance', null);
Insertion (Cont.)
▪ Make each student in the Music department who has earned more
than 144 credit hours an instructor in the Music department with a
salary of $18,000.

insert into instructor


select ID, name, dept_name, 18000
from student
where dept_name = 'Music' and total_cred > 144;

▪ The select from where statement is evaluated fully before any of


its results are inserted into the relation.
Otherwise queries like
insert into table1 select * from table1
would cause problem
Updates
▪ Give a 5% salary raise to all instructors
update instructor
set salary = salary * 1.05

▪ Give a 5% salary raise to those instructors who Eran


less than 70000
update instructor
set salary = salary * 1.05
where salary < 70000;
▪ Give a 5% salary raise to instructors whose salary is
less than average
update instructor
set salary = salary * 1.05
where salary < (select avg (salary)
from instructor);
Updates (Cont.)
▪ Increase salaries of instructors whose salary is over
$100,000 by 3%, and all others by a 5%
• Write two update statements:
update instructor
set salary = salary * 1.03
where salary > 100000;
update instructor
set salary = salary * 1.05
where salary <= 100000;

• The order is important

• Can be done better using the case statement (next slide)


Case Statement for Conditional Updates
▪ Same query as before but with case statement
update instructor
set salary = case
when salary <= 100000 then salary * 1.05
else salary * 1.03
end
Updates with Scalar Subqueries
▪ Recompute and update tot_creds value for all students
update student S
set tot_cred = (select sum(credits)
from takes, course
where takes.course_id = course.course_id and
S.ID= takes.ID.and
takes.grade <> 'F' and
takes.grade is not null);
▪ Sets tot_creds to null for students who have not taken any
course
▪ Instead of sum(credits), use:
case
when sum(credits) is not null then sum(credits)
else 0
end
Intermediate
SQL
Chapter 4: Intermediate SQL
▪ Join Expressions
▪ Views
▪ Transactions
▪ Integrity Constraints
▪ SQL Data Types and Schemas
▪ Index Definition in SQL
▪ Authorization
Joined Relations
▪ Join operations take two relations and return as a result
another relation.
▪ A join operation is a Cartesian product which requires that
tuples in the two relations match (under some condition). It
also specifies the attributes that are present in the result of
the join
▪ The join operations are typically used as subquery
expressions in the from clause
▪ Three types of joins:
• Natural join
• Inner join
• Outer join
Natural Join in SQL
▪ Natural join matches tuples with the same values for all
common attributes, and retains only one copy of each
common column.
▪ List the names of instructors along with the course ID
of the courses that they taught
• select name, course_id
from students, takes
where student.ID = takes.ID;

▪ Same query in SQL with “natural join” construct


• select name, course_id
from student natural join takes;
Natural Join in SQL (Cont.)
▪ The from clause in natural join can have multiple
relations combined using natural join:
select A1, A2, … An
from r1 natural join r2 natural join .. natural join rn
where P ;
Student Relation
Takes Relation
student natural join takes
Natural Join with Using Clause
▪ To avoid the danger of equating attributes erroneously, we can
use the “using” construct that allows us to specify exactly which
columns should be equated.
▪ Query example
select name, title
from (student natural join takes) join course using
(course_id)
Join Condition
▪ The on condition allows a general predicate over the relations being
joined
▪ This predicate is written like a where clause predicate except for the
use of the keyword on
▪ Query example
select *
from student join takes on student_ID = takes_ID
• The on condition above specifies that a tuple from student matches a tuple from takes if
their ID values are equal.

▪ Equivalent to:
select *
from student , takes
where student_ID = takes_ID
Join Condition
▪ The on condition allows a general predicate over the relations being
joined.
▪ This predicate is written like a where clause predicate except for the
use of the keyword on.
▪ Query example
select *
from student join takes on student_ID = takes_ID
• The on condition above specifies that a tuple from student matches a tuple from takes if
their ID values are equal.

▪ Equivalent to:
select *
from student , takes
where student_ID = takes_ID
Outer Join
▪ An extension of the join operation that avoids loss of
information.
▪ Computes the join and then adds tuples form one
relation that does not match tuples in the other relation
to the result of the join.
▪ Uses null values.
▪ Three forms of outer join:
• left outer join
• right outer join
• full outer join
Outer Join Examples
▪ Relation course

▪ Relation prereq

▪ Observe that
course information is missing for CS-437
prereq information is missing for CS-315

▪ x
Left Outer Join
▪ course natural left outer join prereq

▪ In relational algebra: course ⟕ prereq


Right Outer Join
▪ course natural right outer join prereq

▪ In relational algebra: course ⟖ prereq


Full Outer Join
▪ course natural full outer join prereq

▪ In relational algebra: course ⟗ prereq


Joined Types and Conditions
▪ Join operations take two relations and return as a result
another relation.
▪ These additional operations are typically used as subquery
expressions in the from clause
▪ Join condition – defines which tuples in the two relations
match, and what attributes are present in the result of the
join.
▪ Join type – defines how tuples in each relation that do not
match any tuple in the other relation (based on the join
condition) are treated.
Joined Relations – Examples
▪ course natural right outer join prereq

▪ course full outer join prereq using (course_id)


Joined Relations – Examples
▪ course inner join prereq on
course.course_id = prereq.course_id

▪ What is the difference between the above, and a natural


join?
▪ course left outer join prereq on
course.course_id = prereq.course_id
Joined Relations – Examples
▪ course natural right outer join prereq

▪ course full outer join prereq using (course_id)


Views
▪ In some cases, it is not desirable for all users to see the
entire logical model (that is, all the actual relations stored
in the database.)
▪ Consider a person who needs to know an instructors
name and department, but not the salary. This person
should see a relation described, in SQL, by

select ID, name, dept_name


from instructor

▪ A view provides a mechanism to hide certain data from


the view of certain users.
▪ Any relation that is not of the conceptual model but is
made visible to a user as a “virtual relation” is called a
view.
View Definition
▪ A view is defined using the create view statement
which has the form
create view v as < query expression >
where <query expression> is any legal SQL expression.
The view name is represented by v.
▪ Once a view is defined, the view name can be used to
refer to the virtual relation that the view generates.
▪ View definition is not the same as creating a new
relation by evaluating the query expression
• Rather, a view definition causes the saving of an
expression; the expression is substituted into
queries using the view.
View Definition and Use
▪ A view of instructors without their salary

create view faculty as


select ID, name, dept_name
from instructor
▪ Find all instructors in the Biology department

select name
from faculty
where dept_name = 'Biology'
▪ Create a view of department salary totals

create view departments_total_salary(dept_name, total_salary) as


select dept_name, sum (salary)
from instructor
group by dept_name;
Views Defined Using Other Views
▪ One view may be used in the expression defining
another view
▪ A view relation v1 is said to depend directly on a view
relation v2 if v2 is used in the expression defining v1
▪ A view relation v1 is said to depend on view relation v2 if
either v1 depends directly to v2 or there is a path of
dependencies from v1 to v2
▪ A view relation v is said to be recursive if it depends on
itself.
Views Defined Using Other Views
▪ create view physics_fall_2017 as
select course.course_id, sec_id, building, room_number
from course, section
where course.course_id = section.course_id
and course.dept_name = 'Physics'
and section.semester = 'Fall'
and section.year = '2017';

▪ create view physics_fall_2017_watson as


select course_id, room_number
from physics_fall_2017
where building= 'Watson';
Materialized Views
▪ Certain database systems allow view relations to be
physically stored.
• Physical copy created when the view is defined.
• Such views are called Materialized view:
▪ If relations used in the query are updated, the
materialized view result becomes out of date
• Need to maintain the view, by updating the view
whenever the underlying relations are updated.
Update of a View
▪ Add a new tuple to faculty view which we defined earlier
insert into faculty
values ('30765', 'Green', 'Music');
▪ This insertion must be represented by the insertion into
the instructor relation
• Must have a value for salary.
▪ Two approaches
• Reject the insert
• Inset the tuple
('30765', 'Green', 'Music', null)
into the instructor relation
Some Updates Cannot be Translated Uniquely

▪ create view instructor_info as


select ID, name, building
from instructor, department
where instructor.dept_name= department.dept_name;
▪ insert into instructor_info
values ('69987', 'White', 'Taylor');
▪ Issues
• Which department, if multiple departments in Taylor?
• What if no department is in Taylor?
View Updates in SQL

▪ Most SQL implementations allow updates only on simple


views
• The from clause has only one database relation.
• The select clause contains only attribute names of the
relation, and does not have any expressions,
aggregates, or distinct specification.
• Any attribute not listed in the select clause can be set
to null
• The query does not have a group by or having clause.
Transactions
▪ A transaction consists of a sequence of query and/or update
statements and is a “unit” of work
▪ The SQL standard specifies that a transaction begins implicitly
when an SQL statement is executed.
▪ The transaction must end with one of the following
statements:
• Commit work. The updates performed by the transaction become permanent
in the database.

• Rollback work. All the updates performed by the SQL statements in the
transaction are undone.

▪ Atomic transaction
• either fully executed or rolled back as if it never occurred

▪ Isolation from concurrent transactions


Integrity Constraints
▪ Integrity constraints guard against accidental
damage to the database, by ensuring that
authorized changes to the database do not result in
a loss of data consistency.
• A checking account must have a balance greater
than $10,000.00
• A salary of a bank employee must be at least
$4.00 an hour
• A customer must have a (non-null) phone
number
Constraints on a Single Relation

▪ not null
▪ primary key
▪ unique
▪ check (P), where P is a predicate
Not Null Constraints

▪ not null
• Declare name and budget to be not null
name varchar(20) not null
budget numeric(12,2) not null
Unique Constraints

▪ unique ( A1, A2, …, Am)


• The unique specification states that the
attributes A , A2, …, Am form a candidate key.
1
• Candidate keys are permitted to be null (in
contrast to primary keys).
The check clause
▪ The check (P) clause specifies a predicate P that must be
satisfied by every tuple in a relation.
▪ Example: ensure that semester is one of fall, winter, spring or
summer

create table section


(course_id varchar (8),
sec_id varchar (8),
semester varchar (6),
year numeric (4,0),
building varchar (15),
room_number varchar (7),
time slot id varchar (4),
primary key (course_id, sec_id, semester, year),
check (semester in ('Fall', 'Winter', 'Spring', 'Summer')))
Referential Integrity
▪ Ensures that a value that appears in one relation
for a given set of attributes also appears for a
certain set of attributes in another relation.
• Example: If “Biology” is a department name
appearing in one of the tuples in the instructor
relation, then there exists a tuple in the
department relation for “Biology”.
▪ Let A be a set of attributes. Let R and S be two
relations that contain attributes A and where A is
the primary key of S. A is said to be a foreign key
of R if for any values of A appearing in R these
values also appear in S.
Referential Integrity (Cont.)
▪ Foreign keys can be specified as part of the SQL create
table statement
foreign key (dept_name) references department
▪ By default, a foreign key references the primary-key
attributes of the referenced table.
▪ SQL allows a list of attributes of the referenced relation
to be specified explicitly.
foreign key (dept_name) references department
(dept_name)
Assertions
▪ An assertion is a predicate expressing a condition that we
wish the database always to satisfy.
▪ The following constraints, can be expressed using
assertions:
▪ For each tuple in the student relation, the value of the
attribute tot_cred must equal the sum of credits of courses
that the student has completed successfully.
▪ An instructor cannot teach in two different classrooms in a
semester in the same time slot
▪ An assertion in SQL takes the form:
create assertion <assertion-name> check (<predicate>);
Built-in Data Types in SQL

▪ date: Dates, containing a (4 digit) year, month and date


• Example: date '2005-7-27'
▪ time: Time of day, in hours, minutes and seconds.
• Example: time '09:00:30' time '09:00:30.75'
▪ timestamp: date plus time of day
• Example: timestamp '2005-7-27 09:00:30.75'
▪ interval: period of time
• Example: interval '1' day
• Subtracting a date/time/timestamp value from
another gives an interval value
• Interval values can be added to date/time/timestamp
values
Large-Object Types
▪ Large objects (photos, videos, CAD files, etc.) are stored
as a large object:
• blob: binary large object -- object is a large collection
of uninterpreted binary data (whose interpretation is
left to an application outside of the database system)
• clob: character large object -- object is a large
collection of character data
▪ When a query returns a large object, a pointer is
returned rather than the large object itself.
User-Defined Types
▪ create type construct in SQL creates user-defined type

create type Dollars as numeric (12,2) final

▪ Example:
create table department
(dept_name varchar (20),
building varchar (15),
budget Dollars);
Domains
▪ create domain construct in SQL-92 creates user-defined
domain types

create domain person_name char(20) not null

▪ Types and domains are similar. Domains can have


constraints, such as not null, specified on them.
▪ Example:
create domain degree_level varchar(10)
constraint degree_level_test
check (value in ('Bachelors', 'Masters', 'Doctorate'));
Index Creation
▪ Many queries reference only a small proportion of the
records in a table.
▪ It is inefficient for the system to read every record to find a
record with particular value
▪ An index on an attribute of a relation is a data structure
that allows the database system to find those tuples in the
relation that have a specified value for that attribute
efficiently, without scanning through all the tuples of the
relation.
▪ We create an index with the create index command
create index <name> on <relation-name> (attribute);
Index Creation Example
▪ create table student
(ID varchar (5),
name varchar (20) not null,
dept_name varchar (20),
tot_cred numeric (3,0) default 0,
primary key (ID))
▪ create index studentID_index on student(ID)
▪ The query:
select *
from student
where ID = '12345'
can be executed by using the index to find the required
record, without looking at all records of student
Authorization
▪ We may assign a user several forms of authorizations
on parts of the database.
• Read - allows reading, but not modification of data.
• Insert - allows insertion of new data, but not
modification of existing data.
• Update - allows modification, but not deletion of
data.
• Delete - allows deletion of data.
▪ Each of these types of authorizations is called a
privilege. We may authorize the user all, none, or a
combination of these types of privileges on specified
parts of a database, such as a relation or a view.
Authorization (Cont.)
▪ Forms of authorization to modify the database schema
• Index - allows creation and deletion of indices.
• Resources - allows creation of new relations.
• Alteration - allows addition or deletion of attributes
in a relation.
• Drop - allows deletion of relations.
Authorization Specification in SQL
▪ The grant statement is used to confer authorization
grant <privilege list> on <relation or view > to <user list>
▪ <user list> is:
• a user-id
• public, which allows all valid users the privilege granted
• A role (more on this later)
▪ Example:
• grant select on department to Amit, Satoshi

▪ Granting a privilege on a view does not imply granting any


privileges on the underlying relations.
▪ The grantor of the privilege must already hold the privilege
on the specified item (or be the database administrator).
Privileges in SQL
▪ select: allows read access to relation, or the ability to
query using the view
• Example: grant users U1, U2, and U3 select
authorization on the instructor relation:
grant select on instructor to U1, U2, U3
▪ insert: the ability to insert tuples
▪ update: the ability to update using the SQL update
statement
▪ delete: the ability to delete tuples.
▪ all privileges: used as a short form for all the allowable
privileges
Revoking Authorization in SQL
▪ The revoke statement is used to revoke authorization.
revoke <privilege list> on <relation or view> from <user list>
▪ Example:
revoke select on student from U1, U2, U3
▪ <privilege-list> may be all to revoke all privileges the revokee may
hold.
▪ If <revokee-list> includes public, all users lose the privilege except
those granted it explicitly.
▪ If the same privilege was granted twice to the same user by
different grantees, the user may retain the privilege after the
revocation.
▪ All privileges that depend on the privilege being revoked are also
revoked.
Roles
▪ A role is a way to distinguish among various users as far
as what these users can access/update in the database.
▪ To create a role we use:
create a role <name>
▪ Example:
• create role instructor
▪ Once a role is created we can assign “users” to the role
using:
• grant <role> to <users>
Roles Example
▪ create role instructor;
▪ grant instructor to Amit;
▪ Privileges can be granted to roles:
• grant select on takes to instructor;
▪ Roles can be granted to users, as well as to other roles
• create role teaching_assistant
• grant teaching_assistant to instructor;
▪ Instructor inherits all privileges of teaching_assistant
▪ Chain of roles
• create role dean;
• grant instructor to dean;
• grant dean to Satoshi;
Authorization on Views
▪ create view geo_instructor as
(select *
from instructor
where dept_name = 'Geology');
▪ grant select on geo_instructor to geo_staff
▪ Suppose that a geo_staff member issues
• select *
from geo_instructor;
▪ What if
• geo_staff does not have permissions on instructor?
• creator of view did not have some permissions on
instructor?
Other Authorization Features
▪ references privilege to create foreign key
• grant reference (dept_name) on department to Mariano;
• why is this required?
▪ transfer of privileges
• grant select on department to Amit with grant option;
• revoke select on department from Amit, Satoshi
cascade;
• revoke select on department from Amit, Satoshi
restrict;
• And more!
End of Chapter 3
Normalization
Outline
▪ Features of Good Relational Design
▪ Functional Dependencies
▪ Decomposition Using Functional Dependencies
▪ Normal Forms
▪ Functional Dependency Theory
▪ Algorithms for Decomposition using Functional
Dependencies
▪ Decomposition Using Multivalued Dependencies
▪ More Normal Form
▪ Atomic Domains and First Normal Form
▪ Database-Design Process
▪ Modeling Temporal Data
Overview of Normalization
Features of Good Relational Designs
▪ Suppose we combine instructor and department into in_dep, which
represents the natural join on the relations instructor and department

▪ There is repetition of information


▪ Need to use null values (if we add a new department with no
instructors)
A Combined Schema Without Repetition
▪ Not all combined schemas result in repetition of
information
• Consider combining relations
▪ sec_class(sec_id, building, room_number) and
▪ section(course_id, sec_id, semester, year)
into one relation
▪ section(course_id, sec_id, semester, year,
building, room_number)
• No repetition in this case
Decomposition
▪ The only way to avoid the repetition-of-information problem in
the in_dep schema is to decompose it into two schemas –
instructor and department schemas.
▪ Not all decompositions are good. Suppose we decompose
employee(ID, name, street, city, salary)
into
employee1 (ID, name)
employee2 (name, street, city, salary)

The problem arises when we have two employees with the


same name
▪ The next slide shows how we lose information -- we cannot
reconstruct the original employee relation -- and so, this is a
lossy decomposition.
A Lossy Decomposition
Lossless Decomposition
▪ Let R be a relation schema and let R1 and R2 form a
decomposition of R . That is R = R1 U R2
▪ We say that the decomposition is a lossless
decomposition if there is no loss of information by
replacing R with the two relation schemas R1 U R2
▪ Formally,
∏ R1 (r) ∏ R2 (r) = r
▪ And, conversely a decomposition is lossy if
r ⊂ ∏ R1 (r) ∏ R2 (r) = r
Example of Lossless Decomposition
▪ Decomposition of R = (A, B, C)
R1 = (A, B) R2 = (B, C)
Normalization Theory
▪ Decide whether a particular relation R is in “good” form.
▪ In the case that a relation R is not in “good” form,
decompose it into set of relations {R1, R2, ..., Rn} such
that
• Each relation is in good form
• The decomposition is a lossless decomposition
▪ Our theory is based on:
• functional dependencies
• multivalued dependencies
Functional Dependencies
▪ There are usually a variety of constraints (rules) on the
data in the real world.
▪ For example, some of the constraints that are expected to
hold in a university database are:
• Students and instructors are uniquely identified by
their ID.
• Each student and instructor has only one name.
• Each instructor and student is (primarily) associated
with only one department.
• Each department has only one value for its budget, and
only one associated building.
Functional Dependencies (Cont.)
▪ An instance of a relation that satisfies all such real-world
constraints is called a legal instance of the relation;
▪ A legal instance of a database is one where all the relation
instances are legal instances
▪ Constraints on the set of legal relations.
▪ Require that the value for a certain set of attributes
determines uniquely the value for another set of
attributes.
▪ A functional dependency is a generalization of the notion
of a key.
Functional Dependencies Definition
▪ Let R be a relation schema
α ⊆ R and β ⊆ R
▪ The functional dependency
α→β
holds on R if and only if for any legal relations r(R), whenever
any two tuples t1 and t2 of r agree on the attributes α, they
also agree on the attributes β. That is,
t1[α] = t2 [α] ⇒ t1[β ] = t2 [β ]
▪ Example: Consider r(A,B ) with the following instance of r.

1 4
1 5
3 7
▪ On this instance, B → A hold; A → B does NOT hold,
Closure of a Set of Functional Dependencies

▪ Given a set F set of functional dependencies, there are


certain other functional dependencies that are logically
implied by F.
• If A → B and B → C, then we can infer that A → C
• etc.
▪ The set of all functional dependencies logically implied
by F is the closure of F.
▪ We denote the closure of F by F+.
Keys and Functional Dependencies
▪ K is a superkey for relation schema R if and only if K → R
▪ K is a candidate key for R if and only if
• K → R, and
• for no α ⊂ K, α → R
▪ Functional dependencies allow us to express constraints that
cannot be expressed using superkeys. Consider the schema:
in_dep (ID, name, salary, dept_name, building, budget ).
We expect these functional dependencies to hold:
dept_name→ building
ID 🡪 building
but would not expect the following to hold:
dept_name → salary
Use of Functional Dependencies
▪ We use functional dependencies to:
• To test relations to see if they are legal under a given set
of functional dependencies.
▪ If a relation r is legal under a set F of functional
dependencies, we say that r satisfies F.
• To specify constraints on the set of legal relations
▪ We say that F holds on R if all legal relations on R
satisfy the set of functional dependencies F.
▪ Note: A specific instance of a relation schema may satisfy a
functional dependency even if the functional dependency
does not hold on all legal instances.
• For example, a specific instance of instructor may, by
chance, satisfy
name → ID.
Trivial Functional Dependencies
▪ A functional dependency is trivial if it is satisfied by all
instances of a relation
• Example:
▪ ID, name → ID
▪ name → name

• In general, α → β is trivial if β ⊆ α
Lossless Decomposition
▪ We can use functional dependencies to show when certain
decomposition are lossless.
▪ For the case of R = (R1, R2), we require that for all possible
relations r on schema R
r = ∏R1 (r ) ∏R2 (r )
▪ A decomposition of R into R1 and R2 is lossless
decomposition if at least one of the following dependencies
is in F+:
• R1 ∩ R2 → R1
• R1 ∩ R2 → R2
▪ The above functional dependencies are a sufficient
condition for lossless join decomposition; the dependencies
are a necessary condition only if all constraints are
functional dependencies
Example
▪ R = (A, B, C)
F = {A → B, B → C)
▪ R1 = (A, B), R2 = (B, C)
• Lossless decomposition:
R1 ∩ R2 = {B} and B → BC
▪ R1 = (A, B), R2 = (A, C)
• Lossless decomposition:
R1 ∩ R2 = {A} and A → AB
▪ Note:
• B → BC
is a shorthand notation for
• B → {B, C}
Dependency Preservation
▪ Testing functional dependency constraints each time the
database is updated can be costly
▪ It is useful to design the database in a way that constraints
can be tested efficiently.
▪ If testing a functional dependency can be done by
considering just one relation, then the cost of testing this
constraint is low
▪ When decomposing a relation it is possible that it is no
longer possible to do the testing without having to perform
a Cartesian Produced.
▪ A decomposition that makes it computationally hard to
enforce functional dependency is said to be NOT
dependency preserving.
Dependency Preservation Example
▪ Consider a schema:
dept_advisor(s_ID, i_ID, department_name)
▪ With function dependencies:
i_ID → dept_name
s_ID, dept_name → i_ID
▪ In the above design we are forced to repeat the department name
once for each time an instructor participates in a dept_advisor
relationship.
▪ To fix this, we need to decompose dept_advisor
▪ Any decomposition will not include all the attributes in
s_ID, dept_name → i_ID
▪ Thus, the composition NOT be dependency preserving
Normal Forms
Boyce-Codd Normal Form
▪ A relation schema R is in BCNF with respect to a set F of
functional dependencies if for all functional dependencies
in F+ of the form
α→β
where α ⊆ R and β ⊆ R, at least one of the following
holds:
• α → β is trivial (i.e., β ⊆ α)
• α is a superkey for R
Boyce-Codd Normal Form (Cont.)
▪ Example schema that is not in BCNF:
in_dep (ID, name, salary, dept_name, building, budget )
because :
• dept_name→ building, budget
▪ holds on in_dep
▪ but
• dept_name is not a superkey
▪ When decompose in_dept into instructor and department
• instructor is in BCNF
• department is in BCNF
Decomposing a Schema into BCNF
▪ Let R be a schema R that is not in BCNF. Let α →β be the
FD that causes a violation of BCNF.
▪ We decompose R into:
• (α U β )
• (R-(β-α))
▪ In our example of in_dep,
• α = dept_name
• β = building, budget
and in_dep is replaced by
• (α U β ) = ( dept_name, building, budget )
• ( R - ( β - α ) ) = ( ID, name, dept_name, salary )
Example
▪ R = (A, B, C)
F = {A → B, B → C)
▪ R1 = (A, B), R2 = (B, C)
• Lossless-join decomposition:
R1 ∩ R2 = {B} and B → BC
• Dependency preserving
▪ R1 = (A, B), R2 = (A, C)
• Lossless-join decomposition:
R1 ∩ R2 = {A} and A → AB
• Not dependency preserving
(cannot check B → C without computing R1 R 2)
BCNF and Dependency Preservation

▪ It is not always possible to achieve both BCNF and


dependency preservation
▪ Consider a schema:
dept_advisor(s_ID, i_ID, department_name)
▪ With function dependencies:
i_ID → dept_name
s_ID, dept_name → i_ID
▪ dept_advisor is not in BCNF
• i_ID is not a superkey.
▪ Any decomposition of dept_advisor will not include all the
attributes in
s_ID, dept_name → i_ID
▪ Thus, the composition is NOT be dependency preserving
Third Normal Form
▪ A relation schema R is in third normal form (3NF) if for all:
α → β in F+

at least one of the following holds:


• α → β is trivial (i.e., β ∈ α)
• α is a superkey for R
• Each attribute A in β – α is contained in a candidate key for R.
(NOTE: each attribute may be in a different candidate key)
▪ If a relation is in BCNF it is in 3NF (since in BCNF one of the first
two conditions above must hold).
▪ Third condition is a minimal relaxation of BCNF to ensure
dependency preservation (will see why later).
3NF Example
▪ Consider a schema:
dept_advisor(s_ID, i_ID, dept_name)
▪ With function dependencies:
i_ID → dept_name
s_ID, dept_name → i_ID
▪ Two candidate keys = {s_ID, dept_name}, {s_ID, i_ID }
▪ We have seen before that dept_advisor is not in BCNF
▪ R, however, is in 3NF
• s_ID, dept_name is a superkey
• i_ID → dept_name and i_ID is NOT a superkey, but:
▪ { dept_name} – {i_ID } = {dept_name } and
▪ dept_name is contained in a candidate key
Redundancy in 3NF
▪ Consider the schema R below, which is in 3NF

• R = (J, K, L )
• F = {JK → L, L → K }
• And an instance table:
J L K
j1 I1 k1
j2 I1 k1
j3 I1 k1
null I2 k2
▪ What is wrong with the table?
• Repetition of information
• Need to use null values (e.g., to represent the relationship l2, k2
where there is no corresponding value for J)
Comparison of BCNF and 3NF
▪ Advantages to 3NF over BCNF. It is always possible to
obtain a 3NF design without sacrificing losslessness or
dependency preservation.
▪ Disadvantages to 3NF.
• We may have to use null values to represent some of
the possible meaningful relationships among data
items.
• There is the problem of repetition of information.
Goals of Normalization
▪ Let R be a relation scheme with a set F of functional
dependencies.
▪ Decide whether a relation scheme R is in “good” form.
▪ In the case that a relation scheme R is not in “good” form,
decompose it into a set of relation scheme {R1, R2, ..., Rn}
such that
• Each relation scheme is in good form
• The decomposition is a lossless decomposition
• Preferably, the decomposition should be dependency
preserving.
How good is BCNF?
▪ There are database schemas in BCNF that do not seem to
be sufficiently normalized
▪ Consider a relation
inst_info (ID, child_name, phone)
• where an instructor may have more than one phone
and can have multiple children
• Instance of inst_info
How good is BCNF? (Cont.)
▪ There are no non-trivial functional dependencies and
therefore the relation is in BCNF
▪ Insertion anomalies – i.e., if we add a phone 981-992-3443 to
99999, we need to add two tuples
(99999, David, 981-992-3443)
(99999, William, 981-992-3443)
Higher Normal Forms
▪ It is better to decompose inst_info into:
• inst_child:

• inst_phone:

▪ This suggests the need for higher normal forms, such


as Fourth Normal Form (4NF), which we shall see later
Functional-Dependency Theory
Functional-Dependency Theory Roadmap

▪ We now consider the formal theory that tells us which


functional dependencies are implied logically by a given set
of functional dependencies.
▪ We then develop algorithms to generate lossless
decompositions into BCNF and 3NF
▪ We then develop algorithms to test if a decomposition is
dependency-preserving
Closure of a Set of Functional Dependencies

▪ Given a set F set of functional dependencies, there are


certain other functional dependencies that are logically
implied by F.
• If A → B and B → C, then we can infer that A → C
• etc.
▪ The set of all functional dependencies logically implied
by F is the closure of F.
▪ We denote the closure of F by F+.
Closure of a Set of Functional Dependencies

▪ We can compute F+ , the closure of F, by repeatedly


applying Armstrong’s Axioms:
• Reflexive rule: if β ⊆ α, then α → β
• Augmentation rule: if α → β, then γ α → γ β
• Transitivity rule: if α → β, and β → γ, then α → γ
▪ These rules are
• sound -- generate only functional dependencies
that actually hold, and
• complete -- generate all functional dependencies
that hold.
+
Example of F
▪ R = (A, B, C, G, H, I)
F={A→B
A→C
CG → H
CG → I
B → H}
▪ Some members of F+
• A→H
▪ by transitivity from A → B and B → H
• AG → I
▪ by augmenting A → C with G, to get AG → CG
and then transitivity with CG → I
• CG → HI
▪ by augmenting CG → I to infer CG → CGI,
and augmenting of CG → H to infer CGI → HI,
and then transitivity
Closure of Functional Dependencies (Cont.)

▪ Additional rules:
• Union rule: If α → β holds and α → γ holds, then α →
β γ holds.
• Decomposition rule: If α → β γ holds, then α → β
holds and α → γ holds.
• Pseudotransitivity rule:If α → β holds and γ β → δ
holds, then α γ → δ holds.
▪ The above rules can be inferred from Armstrong’s
axioms.
+
Procedure for Computing F
▪ To compute the closure of a set of functional dependencies F:
F+=F
repeat
for each functional dependency f in F+
apply reflexivity and augmentation rules on f
add the resulting functional dependencies to F +
for each pair of functional dependencies f1and f2 in F +
if f1 and f2 can be combined using transitivity
then add the resulting functional dependency to F +
until F + does not change any further

▪ NOTE: We shall see an alternative procedure for this task later


Closure of Attribute Sets
▪ Given a set of attributes α, define the closure of α under F
(denoted by α+) as the set of attributes that are
functionally determined by α under F

▪ Algorithm to compute α+, the closure of α under F

result := α;
while (changes to result) do
for each β → γ in F do
begin
if β ⊆ result then result := result ∪ γ
end
Example of Attribute Set Closure
▪ R = (A, B, C, G, H, I)
▪ F = {A → B
A→C
CG → H
CG → I
B → H}
▪ (AG)+
1. result = AG
2. result = ABCG (A → C and A → B)
3. result = ABCGH (CG → H and CG ⊆ AGBC)
4. result = ABCGHI (CG → I and CG ⊆ AGBCH)
▪ Is AG a candidate key?
1. Is AG a super key?
1. Does AG → R? == Is R ⊇ (AG)+
2. Is any subset of AG a superkey?
1. Does A → R? == Is R ⊇ (A)+
2. Does G → R? == Is R ⊇ (G)+
3. In general: check for each subset of size n-1
Uses of Attribute Closure
There are several uses of the attribute closure algorithm:
▪ Testing for superkey:
• To test if α is a superkey, we compute α+, and check if
α+ contains all attributes of R.
▪ Testing functional dependencies
• To check if a functional dependency α → β holds (or, in
other words, is in F+), just check if β ⊆ α+.
• That is, we compute α+ by using attribute closure, and
then check if it contains β.
• Is a simple and cheap test, and very useful
▪ Computing closure of F
• For each γ ⊆ R, we find the closure γ+, and for each S
⊆ γ+, we output a functional dependency γ → S.
Canonical Cover
▪ Suppose that we have a set of functional dependencies F on a relation
schema. Whenever a user performs an update on the relation, the
database system must ensure that the update does not violate any
functional dependencies; that is, all the functional dependencies in F are
satisfied in the new database state.
▪ If an update violates any functional dependencies in the set F, the
system must roll back the update.
▪ We can reduce the effort spent in checking for violations by testing a
simplified set of functional dependencies that has the same closure as
the given set.
▪ This simplified set is termed the canonical cover
▪ To define canonical cover we must first define extraneous attributes.
• An attribute of a functional dependency in F is extraneous if we
can remove it without changing F +
Extraneous Attributes
▪ Removing an attribute from the left side of a functional
dependency could make it a stronger constraint.
• For example, if we have AB → C and remove B, we get the
possibly stronger result A → C. It may be stronger
because A → C logically implies AB → C, but AB → C does
not, on its own, logically imply A → C
▪ But, depending on what our set F of functional dependencies
happens to be, we may be able to remove B from AB → C
safely.
• For example, suppose that
• F = {AB → C, A → D, D → C}
• Then we can show that F logically implies A → C, making
extraneous in AB → C.
Extraneous Attributes (Cont.)
▪ Removing an attribute from the right side of a functional
dependency could make it a weaker constraint.
• For example, if we have AB → CD and remove C, we get
the possibly weaker result AB → D. It may be weaker
because using just AB → D, we can no longer infer AB →
C.
▪ But, depending on what our set F of functional dependencies
happens to be, we may be able to remove C from AB → CD
safely.
• For example, suppose that
F = { AB → CD, A → C.
• Then we can show that even after replacing AB → CD by
AB → D, we can still infer $AB → C and thus AB → CD.
Extraneous Attributes
▪ An attribute of a functional dependency in F is extraneous if we
can remove it without changing F +
▪ Consider a set F of functional dependencies and the functional
dependency α → β in F.
• Remove from the left side: Attribute A is extraneous in α if
▪ A ∈ α and
▪ F logically implies (F – {α → β}) ∪ {(α – A) → β}.

• Remove from the right side: Attribute A is extraneous in β if


▪ A ∈ β and
▪ The set of functional dependencies
(F – {α → β}) ∪ {α →(β – A)} logically implies F.
▪ Note: implication in the opposite direction is trivial in each of the
cases above, since a “stronger” functional dependency always
implies a weaker one
Testing if an Attribute is Extraneous
▪ Let R be a relation schema and let F be a set of
functional dependencies that hold on R . Consider an
attribute in the functional dependency α → β.
▪ To test if attribute A ∈ β is extraneous in β
• Consider the set:
F' = (F – {α → β}) ∪ {α →(β – A)},
• check that α+ contains A; if it does, A is extraneous in β
▪ To test if attribute A ∈ α is extraneous in α
• Let γ = α – {A}. Check if γ → β can be inferred from F.
▪ Compute γ+ using the dependencies in F
▪ If γ+ includes all attributes in β then , A is extraneous
in α
Examples of Extraneous Attributes
▪ Let F = {AB → CD, A → E, E → C }
▪ To check if C is extraneous in AB → CD, we:
• Compute the attribute closure of AB under F' = {AB → D,
A → E, E → C}
• The closure is ABCDE, which includes CD
• This implies that C is extraneous
Canonical Cover
▪ A canonical cover for F is a set of dependencies Fc such
that
• F logically implies all dependencies in Fc , and
• Fc logically implies all dependencies in F, and
• No functional dependency in Fc contains an extraneous
attribute, and
• Each left side of functional dependency in Fc is unique.
That is, there are no two dependencies in Fc
▪ α1 → β1 and α2 → β2 such that
▪ α1 = α 2
Canonical Cover
▪ To compute a canonical cover for F:
repeat
Use the union rule to replace any dependencies in F of the
form
α1 → β1 and α1 → β2 with α1 → β1 β2
Find a functional dependency α → β in Fc with an extraneous
attribute
either in α or in β
/* Note: test for extraneous attributes done using
Fc, not F*/
If an extraneous attribute is found, delete it from α → β
until (Fc not change
▪ Note: Union rule may become applicable after some
extraneous attributes have been deleted, so it has to be
re-applied
Example: Computing a Canonical Cover
▪ R = (A, B, C)
F = {A → BC
B→C
A→B
AB → C}
▪ Combine A → BC and A → B into A → BC
• Set is now {A → BC, B → C, AB → C}
▪ A is extraneous in AB → C
• Check if the result of deleting A from AB → C is implied by the other
dependencies
▪ Yes: in fact, B → C is already present!

• Set is now {A → BC, B → C}


▪ C is extraneous in A → BC
• Check if A → C is logically implied by A → B and the other dependencies
▪ Yes: using transitivity on A → B and B → C.
• Can use attribute closure of A in more complex cases
▪ The canonical cover is: A→B
B→C
Dependency Preservation
▪ Let Fi be the set of dependencies F + that include only
attributes in Ri.
• A decomposition is dependency preserving, if
(F1 ∪ F2 ∪ … ∪ Fn )+ = F +
▪ Using the above definition, testing for dependency
preservation take exponential time.
▪ Not that if a decomposition is NOT dependency preserving
then checking updates for violation of functional
dependencies may require computing joins, which is
expensive.
Dependency Preservation (Cont.)
▪ Let F be the set of dependencies on schema R and let R1, R2
, .., Rn be a decomposition of R.
▪ The restriction of F to Ri is the set Fi of all functional
dependencies in F + that include only attributes of Ri .
▪ Since all functional dependencies in a restriction involve
attributes of only one relation schema, it is possible to test
such a dependency for satisfaction by checking only one
relation.
▪ Note that the definition of restriction uses all dependencies
in in F +, not just those in F.
▪ The set of restrictions F1, F2 , .. , Fn is the set of functional
dependencies that can be checked efficiently.
Testing for Dependency Preservation
▪ To check if a dependency α → β is preserved in a
decomposition of R into R1, R2, …, Rn , we apply the following
test (with attribute closure done with respect to F)
• result = α
repeat
for each Ri in the decomposition
t = (result ∩ Ri)+ ∩ Ri
result = result ∪ t
until (result does not change)
• If result contains all attributes in β, then the functional
dependency α → β is preserved.
▪ We apply the test on all dependencies in F to check if a
decomposition is dependency preserving
▪ This procedure takes polynomial time, instead of the
exponential time required to compute F+ and (F1 ∪ F2 ∪ … ∪
Fn)+
Example
▪ R = (A, B, C )
F = {A → B
B → C}
Key = {A}
▪ R is not in BCNF
▪ Decomposition R1 = (A, B), R2 = (B, C)
• R1 and R2 in BCNF
• Lossless-join decomposition
• Dependency preserving
Algorithm for Decomposition Using
Functional Dependencies
Testing for BCNF
▪ To check if a non-trivial dependency α →β causes a violation of
BCNF
1. compute α+ (the attribute closure of α), and
2. verify that it includes all attributes of R, that is, it is a superkey of R.
▪ Simplified test: To check if a relation schema R is in BCNF, it
suffices to check only the dependencies in the given set F for
violation of BCNF, rather than checking all dependencies in F+.
• If none of the dependencies in F causes a violation of BCNF, then none of the
dependencies in F+ will cause a violation of BCNF either.
▪ However, simplified test using only F is incorrect when testing a
relation in a decomposition of R
• Consider R = (A, B, C, D, E), with F = { A → B, BC → D}
▪ Decompose R into R1 = (A,B) and R2 = (A,C,D, E)
▪ Neither of the dependencies in F contain only attributes from
(A,C,D,E) so we might be mislead into thinking R2 satisfies BCNF.
▪ In fact, dependency AC → D in F+ shows R2 is not in BCNF.
Testing Decomposition for BCNF
▪ To check if a relation Ri in a decomposition of R is in BCNF,
+
• Either test Ri for BCNF with respect to the restriction of F to Ri (that is, all FDs
in F+ that contain only attributes from Ri)

• or use the original set of dependencies F that hold on R, but with the following
test:

• for every set of attributes α ⊆ Ri, check that α+ (the attribute closure of
α) either includes no attribute of Ri- α, or includes all attributes of Ri.

▪ If the condition is violated by some α → β in F+, the dependency


α → (α+ - α) ∩ Ri
can be shown to hold on Ri, and Ri violates BCNF.
▪ We use above dependency to decompose Ri
BCNF Decomposition Algorithm
result := {R };
done := false;
compute F +;
while (not done) do
if (there is a schema Ri in result that is not in BCNF)
then begin
let α → β be a nontrivial functional dependency that
holds on Ri such that α → Ri is not in F +,
and α ∩ β = ∅;
result := (result – Ri ) ∪ (Ri – β) ∪ (α, β );
end
else done := true;

Note: each Ri is in BCNF, and decomposition is lossless-join.


Example of BCNF Decomposition
▪ class (course_id, title, dept_name, credits, sec_id, semester, year,
building, room_number, capacity, time_slot_id)
▪ Functional dependencies:
• course_id→ title, dept_name, credits
• building, room_number→capacity
• course_id, sec_id, semester, year→building, room_number,
time_slot_id
▪ A candidate key {course_id, sec_id, semester, year}.
▪ BCNF Decomposition:
• course_id→ title, dept_name, credits holds
▪ but course_id is not a superkey.
• We replace class by:
▪ course(course_id, title, dept_name, credits)
▪ class-1 (course_id, sec_id, semester, year, building,
room_number, capacity, time_slot_id)
BCNF Decomposition (Cont.)
▪ course is in BCNF
• How do we know this?
▪ building, room_number→capacity holds on class-1
• but {building, room_number} is not a superkey for class-1.
• We replace class-1 by:
▪ classroom (building, room_number, capacity)
▪ section (course_id, sec_id, semester, year, building,
room_number, time_slot_id)
▪ classroom and section are in BCNF.
Third Normal Form
▪ There are some situations where
• BCNF is not dependency preserving, and
• efficient checking for FD violation on updates is
important
▪ Solution: define a weaker normal form, called Third
Normal Form (3NF)
• Allows some redundancy (with resultant problems;
we will see examples later)
• But functional dependencies can be checked on
individual relations without computing a join.
• There is always a lossless-join,
dependency-preserving decomposition into 3NF.
3NF Example
▪ Relation dept_advisor:
• dept_advisor (s_ID, i_ID, dept_name)
F = {s_ID, dept_name → i_ID, i_ID → dept_name}
• Two candidate keys: s_ID, dept_name, and i_ID, s_ID
• R is in 3NF
▪ s_ID, dept_name → i_ID s_ID
• dept_name is a superkey
▪ i_ID → dept_name

• dept_name is contained in a candidate key


Testing for 3NF
▪ Need to check only FDs in F, need not check all FDs in F+.
▪ Use attribute closure to check for each dependency α → β, if
α is a superkey.
▪ If α is not a superkey, we have to verify if each attribute in β
is contained in a candidate key of R
• This test is rather more expensive, since it involve finding
candidate keys
• Testing for 3NF has been shown to be NP-hard
• Interestingly, decomposition into third normal form
(described shortly) can be done in polynomial time
3NF Decomposition Algorithm
Let Fc be a canonical cover for F;
i := 0;
for each functional dependency α → β in Fc do
if none of the schemas Rj, 1 ≤ j ≤ i contains α β
then begin
i := i + 1;
Ri := α β
end
if none of the schemas Rj, 1 ≤ j ≤ i contains a candidate key for R
then begin
i := i + 1;
Ri := any candidate key for R;
end
/* Optionally, remove redundant relations */
repeat
if any schema Rj is contained in another schema Rk
then /* delete Rj */
Rj = R;;
i=i-1;
return (R1, R2, ..., Ri)
3NF Decomposition Algorithm (Cont.)
▪ Above algorithm ensures:
• each relation schema Ri is in 3NF
• decomposition is dependency preserving and
lossless-join
• Proof of correctness is at end of this presentation (click
here)
3NF Decomposition: An Example
▪ Relation schema:
cust_banker_branch = (customer_id, employee_id, branch_name, type )
▪ The functional dependencies for this relation schema are:
• customer_id, employee_id → branch_name, type
• employee_id → branch_name
• customer_id, branch_name → employee_id
▪ We first compute a canonical cover
• branch_name is extraneous in the r.h.s. of the 1st dependency
• No other attribute is extraneous, so we get FC =
customer_id, employee_id → type
employee_id → branch_name
customer_id, branch_name → employee_id
3NF Decompsition Example (Cont.)
▪ The for loop generates following 3NF schema:
(customer_id, employee_id, type )
(employee_id, branch_name)
(customer_id, branch_name, employee_id)
• Observe that (customer_id, employee_id, type ) contains a
candidate key of the original schema, so no further
relation schema needs be added
▪ At end of for loop, detect and delete schemas, such as
(employee_id, branch_name), which are subsets of other
schemas
• result will not depend on the order in which FDs are
considered
▪ The resultant simplified 3NF schema is:
(customer_id, employee_id, type)
(customer_id, branch_name, employee_id)
Comparison of BCNF and 3NF
▪ It is always possible to decompose a relation into a set of
relations that are in 3NF such that:
• The decomposition is lossless
• The dependencies are preserved
▪ It is always possible to decompose a relation into a set of
relations that are in BCNF such that:
• The decomposition is lossless
• It may not be possible to preserve dependencies.
Design Goals
▪ Goal for a relational database design is:
• BCNF.
• Lossless join.
• Dependency preservation.
▪ If we cannot achieve this, we accept one of
• Lack of dependency preservation
• Redundancy due to use of 3NF
▪ Interestingly, SQL does not provide a direct way of specifying
functional dependencies other than superkeys.
Can specify FDs using assertions, but they are expensive to test, (and
currently not supported by any of the widely used databases!)
▪ Even if we had a dependency preserving decomposition, using SQL
we would not be able to efficiently test a functional dependency
whose left hand side is not a key.
Multivalued Dependencies
Multivalued Dependencies (MVDs)
▪ Suppose we record names of children, and phone numbers
for instructors:
• inst_child(ID, child_name)
• inst_phone(ID, phone_number)
▪ If we were to combine these schemas to get
• inst_info(ID, child_name, phone_number)
• Example data:
(99999, David, 512-555-1234)
(99999, David, 512-555-4321)
(99999, William, 512-555-1234)
(99999, William, 512-555-4321)
▪ This relation is in BCNF
• Why?
Multivalued Dependencies
▪ Let R be a relation schema and let α ⊆ R and β ⊆ R. The
multivalued dependency
α →→ β
holds on R if in any legal relation r(R), for all pairs for
tuples t1 and t2 in r such that t1[α] = t2 [α], there exist
tuples t3 and t4 in r such that:
t1[α] = t2 [α] = t3 [α] = t4 [α]
t3[β] = t1 [β]
t3[R – β] = t2[R – β]
t4 [β] = t2[β]
t4[R – β] = t1[R – β]
MVD -- Tabular representation
▪ Tabular representation of α →→ β
MVD (Cont.)
▪ Let R be a relation schema with a set of attributes that are
partitioned into 3 nonempty subsets.
Y, Z, W
▪ We say that Y →→ Z (Y multidetermines Z )
if and only if for all possible relations r (R )
< y1, z1, w1 > ∈ r and < y1, z2, w2 > ∈ r
then
< y1, z1, w2 > ∈ r and < y1, z2, w1 > ∈ r
▪ Note that since the behavior of Z and W are identical it
follows that
Y →→ Z if Y →→ W
Example
▪ In our example:
ID →→ child_name
ID →→ phone_number
▪ The above formal definition is supposed to formalize the
notion that given a particular value of Y (ID) it has
associated with it a set of values of Z (child_name) and a set
of values of W (phone_number), and these two sets are in
some sense independent of each other.
▪ Note:
• If Y → Z then Y →→ Z
• Indeed we have (in above notation) Z1 = Z2
The claim follows.
Use of Multivalued Dependencies
▪ We use multivalued dependencies in two ways:
1. To test relations to determine whether they are legal
under a given set of functional and multivalued
dependencies
2. To specify constraints on the set of legal relations. We
shall concern ourselves only with relations that satisfy a
given set of functional and multivalued dependencies.
▪ If a relation r fails to satisfy a given multivalued dependency,
we can construct a relations r′ that does satisfy the
multivalued dependency by adding tuples to r.
Theory of MVDs
▪ From the definition of multivalued dependency, we can
derive the following rule:
• If α → β, then α →→ β
That is, every functional dependency is also a multivalued
dependency
▪ The closure D+ of D is the set of all functional and
multivalued dependencies logically implied by D.
• We can compute D+ from D, using the formal definitions
of functional dependencies and multivalued
dependencies.
• We can manage with such reasoning for very simple
multivalued dependencies, which seem to be most
common in practice
• For complex dependencies, it is better to reason about
sets of dependencies using a system of inference rules
(Appendix C).
Fourth Normal Form
▪ A relation schema R is in 4NF with respect to a set D of
functional and multivalued dependencies if for all multivalued
dependencies in D+ of the form α →→ β, where α ⊆ R and β
⊆ R, at least one of the following hold:
• α →→ β is trivial (i.e., β ⊆ α or α ∪ β = R)
• α is a superkey for schema R
▪ If a relation is in 4NF it is in BCNF
Restriction of Multivalued Dependencies
▪ The restriction of D to Ri is the set Di consisting of
• All functional dependencies in D+ that include only
attributes of Ri
• All multivalued dependencies of the form
α →→ (β ∩ Ri)
where α ⊆ Ri and α →→ β is in D+
4NF Decomposition Algorithm
result: = {R};
done := false;
compute D+;
Let Di denote the restriction of D+ to Ri
while (not done)
if (there is a schema Ri in result that is not in 4NF) then
begin
let α →→ β be a nontrivial multivalued dependency that holds
on Ri such that α → Ri is not in Di, and α∩β=φ;
result := (result - Ri) ∪ (Ri - β) ∪ (α, β);
end
else done:= true;
Note: each Ri is in 4NF, and decomposition is lossless-join
Example
▪ R =(A, B, C, G, H, I)
F ={ A →→ B
B →→ HI
CG →→ H }
▪ R is not in 4NF since A →→ B and A is not a superkey for R
▪ Decomposition
a) R1 = (A, B) (R1 is in 4NF)
b) R2 = (A, C, G, H, I) (R2 is not in 4NF, decompose into R3 and R4)
c) R3 = (C, G, H) (R3 is in 4NF)
d) R4 = (A, C, G, I) (R4 is not in 4NF, decompose into R5 and R6)
• A →→ B and B →→ HI 🡪 A →→ HI, (MVD transitivity), and
• and hence A →→ I (MVD restriction to R4)
e) R5 = (A, I) (R5 is in 4NF)
f)R6 = (A, C, G) (R6 is in 4NF)
Additional issues
Further Normal Forms
▪ Join dependencies generalize multivalued dependencies
• lead to project-join normal form (PJNF) (also called
fifth normal form)
▪ A class of even more general constraints, leads to a
normal form called domain-key normal form.
▪ Problem with these generalized constraints: are hard to
reason with, and no set of sound and complete set of
inference rules exists.
▪ Hence rarely used
Overall Database Design Process
▪ We have assumed schema R is given
• R could have been generated when converting E-R
diagram to a set of tables.
• R could have been a single relation containing all
attributes that are of interest (called universal
relation).
• Normalization breaks R into smaller relations.
• R could have been the result of some ad hoc design of
relations, which we then test/convert to normal form.
ER Model and Normalization
▪ When an E-R diagram is carefully designed, identifying all entities
correctly, the tables generated from the E-R diagram should not
need further normalization.
▪ However, in a real (imperfect) design, there can be functional
dependencies from non-key attributes of an entity to other
attributes of the entity
• Example: an employee entity with
▪ attributes
department_name and building,
▪ functional dependency
department_name→ building
▪ Good design would have made department an entity
▪ Functional dependencies from non-key attributes of a relationship
set possible, but rare --- most relationships are binary
Denormalization for Performance
▪ May want to use non-normalized schema for performance
▪ For example, displaying prereqs along with course_id, and
title requires join of course with prereq
▪ Alternative 1: Use denormalized relation containing
attributes of course as well as prereq with all above attributes
• faster lookup
• extra space and extra execution time for updates
• extra coding work for programmer and possibility of
error in extra code
▪ Alternative 2: use a materialized view defined as
course prereq
• Benefits and drawbacks same as above, except no extra
coding work for programmer and avoids possible errors
Other Design Issues
▪ Some aspects of database design are not caught by normalization
▪ Examples of bad database design, to be avoided:
Instead of earnings (company_id, year, amount ), use
• earnings_2004, earnings_2005, earnings_2006, etc., all on the
schema (company_id, earnings).
▪ Above are in BCNF, but make querying across years difficult
and needs new table each year
• company_year (company_id, earnings_2004, earnings_2005,
earnings_2006)
▪ Also in BCNF, but also makes querying across years difficult
and requires new attribute each year.
▪ Is an example of a crosstab, where values for one attribute
become column names
▪ Used in spreadsheets, and in data analysis tools
Modeling Temporal Data
▪ Temporal data have an association time interval during which
the data are valid.
▪ A snapshot is the value of the data at a particular point in time
▪ Several proposals to extend ER model by adding valid time to
• attributes, e.g., address of an instructor at different points in
time
• entities, e.g., time duration when a student entity exists
• relationships, e.g., time during which an instructor was
associated with a student as an advisor.
t

▪ But no accepted standard


▪ Adding a temporal component results in functional
dependencies like
ID → street, city
not holding, because the address varies over time
▪ A temporal functional dependency τ X → Y holds on schema R if
the functional dependency X 🡪 Y holds on all snapshots for all
legal instances r (R).
Modeling Temporal Data (Cont.)
▪ In practice, database designers may add start and end time
attributes to relations
• E.g., course(course_id, course_title) is replaced by
course(course_id, course_title, start, end)
▪ Constraint: no two tuples can have overlapping valid
times
• Hard to enforce efficiently
▪ Foreign key references may be to current version of data, or
to data at a point in time
• E.g., student transcript should refer to course
information at the time the course was taken
End of Chapter 7
Proof of Correctness of 3NF
Decomposition Algorithm
Correctness of 3NF Decomposition Algorithm

▪ 3NF decomposition algorithm is dependency preserving (since


there is a relation for every FD in Fc)
▪ Decomposition is lossless
• A candidate key (C ) is in one of the relations Ri in
decomposition
• Closure of candidate key under Fc must contain all
attributes in R.
• Follow the steps of attribute closure algorithm to show
there is only one tuple in the join result for each tuple in Ri
Correctness of 3NF Decomposition Algorithm (Cont.)

▪ Claim: if a relation Ri is in the decomposition generated by the above


algorithm, then Ri satisfies 3NF.
▪ Proof:
• Let Ri be generated from the dependency α → β
• Let γ → B be any non-trivial functional dependency on Ri. (We need
only consider FDs whose right-hand side is a single attribute.)
• Now, B can be in either β or α but not in both. Consider each case
separately.
Correctness of 3NF Decomposition (Cont.)
▪ Case 1: If B in β:
• If γ is a superkey, the 2nd condition of 3NF is satisfied
• Otherwise α must contain some attribute not in γ
• Since γ → B is in F+ it must be derivable from Fc, by using
attribute closure on γ.
• Attribute closure not have used α →β. If it had been used, α
must be contained in the attribute closure of γ, which is not
possible, since we assumed γ is not a superkey.
• Now, using α→ (β- {B}) and γ → B, we can derive α →B
(since γ ⊆ α β, and B ∉ γ since γ → B is non-trivial)
• Then, B is extraneous in the right-hand side of α →β; which is
not possible since α →β is in Fc.
• Thus, if B is in β then γ must be a superkey, and the second
condition of 3NF must be satisfied.
Correctness of 3NF Decomposition (Cont.)

▪ Case 2: B is in α.
• Since α is a candidate key, the third alternative in the
definition of 3NF is trivially satisfied.
• In fact, we cannot show that γ is a superkey.
• This shows exactly why the third alternative is present in
the definition of 3NF.
Q.E.D.
First Normal Form
▪ Domain is atomic if its elements are considered to be
indivisible units
• Examples of non-atomic domains:
▪ Set of names, composite attributes
▪ Identification numbers like CS101 that can be broken
up into parts
▪ A relational schema R is in first normal form if the domains
of all attributes of R are atomic
▪ Non-atomic values complicate storage and encourage
redundant (repeated) storage of data
• Example: Set of accounts stored with each customer, and
set of owners stored with each account
• We assume all relations are in first normal form (and
revisit this in Chapter 22: Object Based Databases)
First Normal Form (Cont.)
▪ Atomicity is actually a property of how the elements of the
domain are used.
• Example: Strings would normally be considered
indivisible
• Suppose that students are given roll numbers which are
strings of the form CS0012 or EE1127
• If the first two characters are extracted to find the
department, the domain of roll numbers is not atomic.
• Doing so is a bad idea: leads to encoding of information
in application program rather than in the database.
Transactions
Transaction Concept
▪ A transaction is a unit of program execution that accesses and possibly
updates various data items.
OR, a transaction is the DBMS’s abstract view of a user program: a series of
reads/writes of database objects
▪ Users submit transactions, and can think of each transaction as executing by
itself
• The concurrency is achieved by the DBMS, which interleaves actions of the various
transactions
▪ E.g. transaction to transfer $50 from account A to account B:
1. read(A)
2. A := A – 50
3. write(A)
4. read(B)
5. B := B + 50
6. write(B)
▪ Two main issues to deal with:
• Failures of various kinds, such as hardware failures and system crashes

• Concurrent execution of multiple transactions


Example of Fund Transfer
▪ Transaction to transfer $50 from account A to account B:
1. read(A)
2. A := A – 50
3. write(A)
4. read(B)
5. B := B + 50
6. write(B)
▪ Atomicity requirement
• If the transaction fails after step 3 and before step 6, money will be “lost”
leading to an inconsistent database state
▪ Failure could be due to software or hardware
• The system should ensure that updates of a partially executed transaction are
not reflected in the database
▪ Durability requirement — once the user has been notified that the transaction
has completed (i.e., the transfer of the $50 has taken place), the updates to the
database by the transaction must persist even if there are software or hardware
failures.
Example of Fund Transfer (Cont.)
▪ Consistency requirement in above example:
• The sum of A and B is unchanged by the execution of the transaction
▪ In general, consistency requirements include
• Explicitly specified integrity constraints such as primary keys and foreign keys
• Implicit integrity constraints
▪ e.g., sum of balances of all accounts, minus sum of loan amounts must
equal value of cash-in-hand
• A transaction must see a consistent database.
• During transaction execution the database may be temporarily inconsistent.
• When the transaction completes successfully the database must be consistent
▪ Erroneous transaction logic can lead to inconsistency
Example of Fund Transfer (Cont.)
▪ Isolation requirement — if between steps 3 and 6, another
transaction T2 is allowed to access the partially updated database,
it will see an inconsistent database (the sum A + B will be less than
it should be).

T1 T2
1. read(A)
2. A := A – 50
3. write(A)
read(A), read(B), print(A+B)
4. read(B)
5. B := B + 50
6. write(B
▪ Isolation can be ensured trivially by running transactions serially
• That is, one after the other.

▪ However, executing multiple transactions concurrently has


significant benefits, as we will see later.
ACID Properties
A transaction is a unit of program execution that accesses and
possibly updates various data items. To preserve the integrity of data
the database system must ensure:

▪ Atomicity. Either all operations of the transaction are properly


reflected in the database or none are.
▪ Consistency. Execution of a transaction in isolation preserves the
consistency of the database.
▪ Isolation. Although multiple transactions may execute concurrently,
each transaction must be unaware of other concurrently executing
transactions. Intermediate transaction results must be hidden from
other concurrently executed transactions.
• That is, for every pair of transactions Ti and Tj, it appears to Ti that either Tj, finished execution
before Ti started, or Tj started execution after Ti finished.

▪ Durability. After a transaction completes successfully, the changes it


has made to the database persist, even if there are system failures.
Atomicity
▪ A transaction can
• Commit after completing its actions, or
• Abort because of
▪ Internal DBMS decision: restart
▪ System crash: power, disk failure, …
▪ Unexpected situation: unable to access disk, data value, …

▪ A transaction interrupted in the middle could leave the


database inconsistent
▪ DBMS needs to remove the effects of partial
transactions to ensure atomicity: either all a
transaction’s actions are performed or none
7
Atomicity cont.
▪ A DBMS ensures atomicity by undoing the actions of partial
transactions
▪ To enable this, the DBMS maintains a record, called a log, of all
writes to the database
▪ The component of a DBMS responsible for this is called the
recovery manager

8
Consistency

▪ Consistency refers to maintaining data integrity constraints.


▪ Database consistency is the property that every transaction
sees a consistent database instance. It follows from transaction
atomicity, isolation and transaction consistency.
▪ A consistent transaction will not violate integrity constraints
placed on the data by the database rules. Enforcing consistency
ensures that if a database enters into an illegal state (if a
violation of data integrity constraints occurs) the process will be
aborted and changes rolled back to their previous, legal state.

9
Isolation
▪ Guarantee that even though transactions may
be interleaved, the net effect is identical to
executing the transactions serially
▪ For example, if transactions T1 and T2 are
executed concurrently, the net effect is
equivalent to executing
• T1 followed by T2, or
• T2 followed by T1
▪ NOTE: The DBMS provides no guarantee of
effective order of execution
10
Durability
▪ DBMS uses the log to ensure durability
▪ If the system crashed before the changes made by a completed
transaction are written to disk, the log is used to remember and
restore these changes when the system is restarted
▪ Again, this is handled by the recovery manager

11
Transaction State
▪ Active – the initial state; the transaction stays in this state while it
is executing
▪ Partially committed – after the final statement has been
executed.
▪ Failed -- after the discovery that normal execution can no longer
proceed.
▪ Aborted – after the transaction has been rolled back and the
database restored to its state prior to the start of the transaction.
Two options after it has been aborted:
• restart the transaction

▪ can be done only if no internal logical error

• kill the transaction

▪ Committed – after successful completion.


Transaction State (Cont.)
States in Transaction
Concurrent Executions
▪ Multiple transactions are allowed to run concurrently in the
system. Advantages are:
• Increased processor and disk utilization, leading to better transaction throughput

▪ e.g., one transaction can be using the CPU while another is reading from or writing to
the disk

• Reduced average response time for transactions: short transactions need not wait
behind long ones.

▪ Concurrency control schemes – mechanisms to achieve


isolation
• That is, to control the interaction among the concurrent transactions in order to prevent
them from destroying the consistency of the database
Schedules
▪ Schedule – a sequences of instructions that specify the
chronological order in which instructions of concurrent
transactions are executed
• A schedule for a set of transactions must consist of all instructions of those transactions

• Must preserve the order in which the instructions appear in each individual transaction.

▪ A transaction that successfully completes its execution will have a


commit instructions as the last statement
• By default transaction assumed to execute commit instruction as its last step

▪ A transaction that fails to successfully complete its execution will


have an abort instruction as the last statement
Schedule 1
▪ Let T1 transfer $50 from A to B, and T2 transfer 10% of the balance
from A to B.
▪ A serial schedule in which T1 is followed by T2 :
Schedule 2
▪ A serial schedule where T2 is followed by T1
Schedule 3
▪ Let T1 and T2 be the transactions defined previously. The following
schedule is not a serial schedule, but it is equivalent to Schedule 1.

In Schedules 1, 2 and 3, the sum A + B is preserved.


Schedule 4
▪ The following concurrent schedule does not preserve the value of
(A + B ).
Types of Schedule

21
a. Serial Schedule

▪ The serial schedule is a type of schedule where one transaction


is executed completely before starting another transaction. In
the serial schedule, when the first transaction completes its
cycle, then the next transaction is executed.
▪ For example: Suppose there are two transactions T1 and T2
which have some operations. If it has no interleaving of
operations, then there are the following two possible outcomes:
1. Execute all the operations of T1 which was followed by all the
operations of T2.
2. Execute all the operations of T2 which was followed by all the
operations of T1.
• In the given (a) figure, Schedule A shows the serial schedule
where T1 followed by T2.
• In the given (b) figure, Schedule B shows the serial schedule
where T2 followed by T1. 22
Example: Serial Schedule

23
b. Non-serial Schedule
• If interleaving of operations is allowed, then there will
be non-serial schedule.
• It contains many possible orders in which the system
can execute the individual operations of the
transactions.
• In the given figure (c) and (d), Schedule C and
Schedule D are the non-serial schedules. It has
interleaving of operations.

24
Example: Non-serial schedule

25
c. Serializable Schedule
• A serialisable schedule is a schedule whose effect on any
consistent database instance is identical to that of some
complete serial schedule
• The serializability of schedules is used to find non-serial
schedules that allow the transaction to execute
concurrently without interfering with one another.
• It identifies which schedules are correct when executions
of the transaction have interleaving of their operations.
• A non-serial schedule will be serializable if its result is
equal to the result of its transactions executed serially.

26
Anomalies with interleaved execution
▪ Two actions on the same data object conflict if at least one of
them is a write
▪ We’ll now consider THREE ways in which a schedule involving two
consistency-preserving transactions can leave a consistent
database inconsistent

27
Problems associated with concurrency
● To make system efficient and save time, it is required to execute
more than one transaction ( concurrently) at the same time. But
concurrency also leads several problems.
● In a database transaction, the two main operations are READ and
WRITE operations. So, there is a need to manage these two
operations in the concurrent execution of the transactions as if
these operations are not performed in an interleaved manner, and
the data may become inconsistent.
● These problems are commonly referred to as concurrency
problems in a database environment.
1. Lost Update Problems (W - W Conflict)
2. Temporary Update or Dirty Read Problems (W-R Conflict)
3. Unrepeatable Read Problem (W-R Conflict)
Temporary Update Problem (W-R Conflict)
Dirty Read Problem
Temporary update or dirty read problem occurs when one transaction updates an item and fails.
But the updated item is used by another transaction before the item is changed or reverted back to
its last value.
Consider two transactions TX and TY in the below diagram performing read/write operations on account A where the
available balance in account A is $300:

1. At time t1, transaction TX reads the value of account A,


i.e., $300.
2. At time t2, transaction TX adds $50 to account A that
becomes $350.
3. At time t3, transaction TX writes the updated value in
account A, i.e., $350.
4. Then at time t4, transaction TY reads account A that will
be read as $350.
5. Then at time t5, transaction TX rollbacks due to server
problem, and the value changes back to $300 (as
initially).
6. But the value for account A remains $350 for transaction
TY as committed, which is the dirty read

In the above transaction instance, if Tx fails for some reason then A will revert back to its previous value.
But Transaction Y has already read the incorrect value of A.
lost update problem (W - W Conflict)

The problem occurs when two different database transactions perform the read/write
operations on the same database items in an interleaved manner (i.e., concurrent
execution) that makes the values of the items incorrect hence making the database
inconsistent.
Consider the below schedule where two transactions TX and TY, are performed on the same account A
where the balance of account A is $300.

1. At time t1, transaction TX reads the value of account A, i.e., $300


(only read).

2. At time t2, transaction TX deducts $50 from account A that


becomes $250 (only deducted and not updated/write).

3. Alternately, at time t3, transaction TY reads the value of account


A that will be $300 only because TX didn't update the value yet.

4. At time t4, transaction TY adds $100 to account A that becomes


$400 (only added but not updated/write).

5. At time t6, transaction TX writes the value of account A that will


be updated as $250 only, as TY didn't update the value yet.

6. Similarly, at time t7, transaction TY writes the values of account


A, so it will write as done at time t4 that will be $400. It means
the value written by TX is lost, i.e., $250 is lost.

Hence data becomes incorrect, and database sets to inconsistent.


Unrepeatable Read Problem (W - R Conflict)

Also known as Inconsistent Retrievals Problem that occurs when in a


transaction, two different values are read for the same database item.
Consider two transactions, TX and TY, performing the read/write operations on account A, having an
available balance = $300. The diagram is shown below:

1. At time t1, transaction TX reads the value from account A, i.e.,


$300.

2. At time t2, transaction TY reads the value from account A, i.e.,


$300.

3. At time t3, transaction TY updates the value of account A by


adding $100 to the available balance, and then it becomes $400.

4. At time t4, transaction TY writes the updated value, i.e., $400.

5. After that, at time t5, transaction TX reads the available value of


account A, and that will be read as $400.

6. It means that within the same transaction TX, it reads two


different values of account A, i.e., $ 300 initially, and after updation
made by transaction TY, it reads $400. It is an unrepeatable read
and is therefore known as the Unrepeatable read problem.

Thus, in order to maintain consistency in the database and avoid such problems that take place in
concurrent execution, management is needed, and that is where the concept of Concurrency Control
comes into role.
Aborting
▪ If a transaction Ti is aborted, then all actions
must be undone
• Also, if Tj reads object last written by Ti, then Tj must
be aborted!

▪ In order to undo changes, the DBMS maintains


a log which records every write

32
The log
▪ The following facts are recorded in the log
• “Ti writes an object”: store new and old values

• “Ti commits/aborts”: store just a record

▪ Log records are chained together by transaction id, so it’s easy to


undo a specific transaction
▪ Log is often duplexed and archived on stable storage (it’s
important!)

33
Connection to Normalization

▪ The more redundancy in a database, the more


locking is required for (update) transactions.
• Extreme case: so much redundancy that all update
transactions are forced to execute serially.
▪ In general, less redundancy allows for greater
concurrency and greater transaction throughput.

34
Serializability
Consider a set of transactions (T1, T2, ..., Ti). S1 is the state of database
after they are concurrently executed and successfully completed and S2 is
the state of database after they are executed in any serial manner
(one-by-one) and successfully completed. If S1 and S2 are same then the
database maintains serializability.

Simply said, a non-serial schedule is referred to as a serializable


schedule if it yields the same results as a serial timetable.

▪ Basic Assumption – Each transaction preserves database consistency.


▪ Thus, serial execution of a set of transactions preserves database
consistency.
▪ A (possibly concurrent) schedule is serializable if it is equivalent to a
serial schedule.
Different forms of schedule equivalence give rise to the notions of:
1. conflict serializability
2. view serializability
Simplified view of transactions
▪ We ignore operations other than read and write instructions
▪ We assume that transactions may perform arbitrary computations
on data in local buffers in between reads and writes.
▪ Our simplified schedules consist of only read and write
instructions.
Conflicting Instructions
▪ Instructions li and lj of transactions Ti and Tj respectively, conflict if
and only if there exists some item Q accessed by both li and lj, and
at least one of these instructions wrote Q.
1. li = read(Q), lj = read(Q). li and lj don’t conflict.
2. li = read(Q), lj = write(Q). They conflict.
3. li = write(Q), lj = read(Q). They conflict
4. li = write(Q), lj = write(Q). They conflict
Conflict Serializability
▪ If a schedule S can be transformed into a schedule S’ by a series of
swaps of non-conflicting instructions, we say that S and S’ are
conflict equivalent.
▪ We say that a schedule S is conflict serializable if it is conflict
equivalent to a serial schedule.
Conflict Serializability (Cont.)
▪ Schedule 3 can be transformed into Schedule 6, a serial schedule
where T2 follows T1, by series of swaps of non-conflicting
instructions. Therefore Schedule 3 is conflict serializable.

Schedule 3 Schedule 6
Conflict Serializability (Cont.)
▪ Example of a schedule that is not conflict serializable:

▪ We are unable to swap instructions in the above schedule to


obtain either the serial schedule < T3, T4 >, or the serial schedule <
T4, T3 >.
View Serializability
▪ Let S and S’ be two schedules with the same set of transactions. S
and S’ are view equivalent if the following three conditions are
met, for each data item Q,
1. If in schedule S, transaction Ti reads the initial value of Q, then in schedule S’ also
transaction Ti must read the initial value of Q.
2. If in schedule S transaction Ti executes read(Q), and that value was produced by
transaction Tj (if any), then in schedule S’ also transaction Ti must read the value of Q that
was produced by the same write(Q) operation of transaction Tj .
3. The transaction (if any) that performs the final write(Q) operation in schedule S must also
perform the final write(Q) operation in schedule S’.

▪ As can be seen, view equivalence is also based purely on reads


and writes alone.
View Serializability (Cont.)
▪ A schedule S is view serializable if it is view equivalent to a serial
schedule.
▪ Every conflict serializable schedule is also view serializable.
▪ Below is a schedule which is view-serializable but not conflict
serializable.

▪ Every view serializable schedule that is not conflict serializable has


blind writes.
Testing for Serializability
▪ Consider some schedule of a set of transactions T1, T2, ..., Tn
▪ Precedence graph — a direct graph where the vertices are the
transactions (names).
▪ We draw an arc from Ti to Tj if the two transaction conflict, and Ti
accessed the data item on which the conflict arose earlier.
▪ We may label the arc by the item that was accessed.
▪ Example 1
Recoverable Schedules
Need to address the effect of transaction failures on concurrently
running transactions.

▪ Recoverable schedule — if a transaction Tj reads a data item


previously written by a transaction Ti , then the commit operation
of Ti appears before the commit operation of Tj.
▪ The following schedule (Schedule 11) is not recoverable

▪ If T8 should abort, T9 would have read (and possibly shown to the


user) an inconsistent database state. Hence, database must
ensure that schedules are recoverable.
Cascading Rollbacks
▪ Cascading rollback – a single transaction failure leads to a series
of transaction rollbacks. Consider the following schedule where
none of the transactions has yet committed (so the schedule is
recoverable)

If T10 fails, T11 and T12 must also be rolled back.


▪ Can lead to the undoing of a significant amount of work
Cascadeless Schedules
▪ Cascadeless schedules — cascading rollbacks cannot occur;
• For each pair of transactions Ti and Tj such that Tj reads a data item previously written by
Ti, the commit operation of Ti appears before the read operation of Tj.

▪ Every cascadeless schedule is also recoverable


▪ It is desirable to restrict the schedules to those that are
cascadeless
Concurrency Control
▪ A database must provide a mechanism that will ensure that all
possible schedules are
• either conflict or view serializable, and

• are recoverable and preferably cascadeless

▪ A policy in which only one transaction can execute at a time


generates serial schedules, but provides a poor degree of
concurrency
• Are serial schedules recoverable/cascadeless?

▪ Testing a schedule for serializability after it has executed is a little


too late!
▪ Goal – to develop concurrency control protocols that will assure
serializability.
Concurrency Control (Cont.)
▪ Schedules must be conflict or view serializable, and recoverable,
for the sake of database consistency, and preferably cascadeless.
▪ A policy in which only one transaction can execute at a time
generates serial schedules, but provides a poor degree of
concurrency.
▪ Concurrency-control schemes tradeoff between the amount of
concurrency they allow and the amount of overhead that they
incur.
▪ Some schemes allow only conflict-serializable schedules to be
generated, while others allow view-serializable schedules that are
not conflict-serializable.
Concurrency Control vs. Serializability Tests

▪ Concurrency-control protocols allow concurrent schedules, but


ensure that the schedules are conflict/view serializable, and are
recoverable and cascadeless .
▪ Concurrency control protocols (generally) do not examine the
precedence graph as it is being created
• Instead a protocol imposes a discipline that avoids non-serializable schedules.

▪ Different concurrency control protocols provide different tradeoffs


between the amount of concurrency they allow and the amount of
overhead that they incur.
▪ Tests for serializability help us understand why a concurrency
control protocol is correct.
Thank you !
Concurrency Control
Problems associated with concurrency
● The lost update problem
● The uncommitted dependency problem
● The inconsistent analysis problem
concurrency
▪ The lost update problem
concurrency
▪ The uncommitted dependency problem
concurrency
▪ The inconsistent analysis problem
Concurrency control techniques
▪ Lock-based protocols
▪ Timestamp-Based Protocols
▪ Validation-Based Protocols
Lock-Based Protocols
▪ A lock is a mechanism to control concurrent access to a data item
▪ Data items can be locked in two modes :
1. exclusive (X) mode. Data item can be both read as well as
written. X-lock is requested using lock-X instruction.
2. shared (S) mode. Data item can only be read. S-lock is
requested using lock-S instruction.
▪ To access a data item Q, a transaction must first lock that item by executing
the lock-S(Q) or lock-X(Q) instruction.
▪ Lock requests are made to concurrency-control manager. Transaction can
proceed only after request is granted.
▪ A transaction can unlock a data item Q by the unlock(Q) instruction.
Lock-Based Protocols (Cont.)
▪ Example of a transaction performing locking:
T2: lock-S(A);
read (A);
unlock(A);

lock-S(B);
read (B);
unlock(B);
display(A+B)
▪ Locking as above is not sufficient to guarantee serializability
Lock-Based Protocols (Cont.)
▪ Lock-compatibility matrix

▪ A transaction may be granted a lock on an item if the requested lock is


compatible with locks already held on the item by other transactions.
Otherwise the transaction requesting a lock is made to wait until all
incompatible locks held by other transactions have been released.
▪ Any number of transactions can hold shared locks on an item,
▪ But if any transaction holds an exclusive on the item no other transaction
may hold any lock on the item.
Locking Protocols
▪ WR, RW and WW anomalies can be avoided using a
locking protocol

▪ A locking protocol:
▪ Is a set of rules to be followed by each transaction to ensure
that only serializable schedules are allowed (extended later)
▪ Associates a lock with each database object, which could be of
different types (e.g., shared or exclusive)
▪ Grants and denies locks to transactions according to the
specified rules

▪ The part of the DBMS that keeps track of locks is called the
lock manager
Schedule With Lock Grants
▪ Grants omitted in rest of
chapter
• Assume grant happens
just before the next
instruction following lock
request
▪ A locking protocol is a set of
rules followed by all
transactions while
requesting and releasing
locks.
▪ Locking protocols enforce
serializability by restricting
the set of possible
schedules.
Locking Protocols
▪ Given a locking protocol (such as 2PL)
• A schedule S is legal under a locking protocol if it can be generated by a
set of transactions that follow the protocol
• A protocol ensures serializability if all legal schedules under that
protocol are serializable
The Two-Phase Locking Protocol (Cont.)
▪ Two-phase locking does not ensure freedom from deadlocks
▪ Extensions to basic two-phase locking needed to ensure recoverability of
freedom from cascading roll-back

• Strict two-phase locking: a transaction must hold


all its exclusive locks till it commits/aborts.
▪ Ensures recoverability and avoids cascading roll-backs
• Rigorous two-phase locking: a transaction must
hold all locks till commit/abort.
▪ Transactions can be serialized in the order in which they
commit.
▪ Most databases implement rigorous two-phase locking, but refer to it as
simply two-phase locking
The Two-Phase Locking Protocol (Cont.)
▪ Two-phase locking is not a necessary
condition for serializability
• There are conflict
serializable schedules that
cannot be obtained if the
two-phase locking protocol
is used.
▪ In the absence of extra information (e.g.,
ordering of access to data), two-phase
locking is necessary for conflict
serializability in the following sense:
• Given a transaction Ti that does not
follow two-phase locking, we can find a
transaction Tj that uses two-phase
locking, and a schedule for Ti and Tj
Lock Managers
▪ Usually, a lock manager in a DBMS maintains three types of
data structures:
▪ A queue, Q, for each lock, L,
Transaction Table Transaction List 1 (LS1)
Trx List
to hold its pending requests
T1 LS1
Lock Queue 1

▪ A lock table, which keeps for Object LockLock Table (Q1)


# Type # of Trx Q
each L associated with
O L S 1 Q1
each object, O, a record R
that contains:
▪ The type of L (e.g., shared or exclusive)
▪ The number of transactions currently holding L on O
▪ A pointer to Q

▪ A transaction table, which maintains for each transaction, T, a


pointer to a list of locks held by T
Two-Phase Locking
▪ A widely used locking protocol, called Two-Phase
Locking (2PL), has two rules:
▪ Rule 1: if a transaction T wants to read (or write) an
object O, it first requests the lock manager for a shared
(or exclusive) lock on O
T0 T1 T2
T0 T1 T2 T0 T1 T2

“Shared” Write Request


Read Request “Shared” Read Lock Denied
Lock Granted on Object O
on Object O Lock Granted Request
on Object O

Queue
2
Queue

Lock
Queue

Lock Lock
Manager
Manager Manager

t0 t1 t2
Time
Two-Phase Locking
▪ A widely used locking protocol, called Two-Phase
Locking (2PL), has two rules:
▪ Rule 1: if a transaction T wants to read (or write) an
object O, it first requests the lock manager for a shared
(or exclusive) lock on O
T0 T1 T2
T0 T1 T2 T0 T1 T2

“Exclusive” Lock
Release Lock Release Lock
Granted
on Object O on Object O

Queue
2
Queue

2 2 Lock
Queue

Lock Lock
Manager
Manager Manager

t3 t4 t5
Time
Two-Phase Locking
▪ A widely used locking protocol, called Two-Phase
Locking (2PL), has two rules:
▪ Rule 1: if a transaction T wants to read (or write) an
object O, it first requests the lock manager for a shared
(or exclusive) lock on O
T0 T1 T2
T0 T1 T2 T0 T1 T2

Release Lock
Read Request Lock Denied Read Lock Denied on Object O
on Object O Request
on Object O

Queue
1 1
Queue

1 Lock
Queue

Lock Lock
Manager
Manager Manager 0
0

t6 t7 t8
Time
Two-Phase Locking
▪ A widely used locking protocol, called Two-Phase
Locking (2PL), has two rules:
▪ Rule 1: if a transaction T wants to read (or write) an
object O, it first requests the lock manager for a shared
(or exclusive) lock on O
T0 T1 T2
T0 T1 T2 T0 T1 T2

“Shared” Write Request


“Shared” Lock Denied
Lock Granted on Object O
Lock Granted

Queue
1 2
Queue

Lock
Queue

Lock Lock 0
Manager
Manager Manager
0

t9 t9 t10
Time
Two-Phase Locking
▪ A widely used locking protocol, called Two-Phase Locking
(2PL), has two rules:
▪ Rule 2: T can release locks before it commits or aborts, and
cannot request additional locks once it releases any lock

▪ Thus, every transaction has a “growing” phase in which it


acquires locks, followed by a “shrinking” phase in which it
releases locks

# locks

growing phase shrinking phase


Two-Phase Locking
▪ A widely used locking protocol, called Two-Phase Locking
(2PL), has two rules:
▪ Rule 2: T can release locks before it commits or aborts, and
cannot request additional locks once it releases any lock

▪ Thus, every transaction has a “growing” phase in which it


acquires locks, followed by a “shrinking” phase in which it
releases locks

# locks

violation of 2PL
Automatic Acquisition of Locks
▪ A transaction Ti issues the standard read/write instruction, without explicit
locking calls.
▪ The operation read(D) is processed as:
if Ti has a lock on D
then
read(D)
else begin
if necessary wait until no other
transaction has a lock-X on D
grant Ti a lock-S on D;
read(D)
end
Automatic Acquisition of Locks (Cont.)
▪ write(D) is processed as:
if Ti has a lock-X on D
then
write(D)
else begin
if necessary wait until no other trans. has any lock on D,
if Ti has a lock-S on D
then
upgrade lock on D to lock-X
else
grant Ti a lock-X on D
write(D)
end;

▪ All locks are released after commit or abort


Exercise
Consider the following two transactions:
T34:
read(A);
read(B);
if A = 0 then B:=B+1;
write(B).

T35:
read(B);
read(A);
if B = 0 then A:=A+1;
write(A).
Add lock and unlock instructions to transactions T31 and T32, so that they
observe the two-phase locking protocol.
Resolving RW Conflicts Using 2PL
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 reads A
▪ T2 reads A, decrements A and commits
▪ T1 tries to decrement A

▪ T1 and T2 can be represented by the following schedule:


T1 T2 T1 T2
R(A) EXCLUSIVE(A)
R(A) R(A) RW
W(A) Lock(A) Conflict
Commit W(A) Resolved!
Wait
W(A) Commit
Commit EXCLUSIVE(A)
R(A)
Exposes RW Anomaly W(A)
Commit
Resolving RW Conflicts Using 2PL
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 reads A
▪ T2 reads A, decrements A and commits
▪ T1 tries to decrement A

▪ T1 and T2 can be represented by the following schedule:


T1 T2 T1 T2
R(A) EXCLUSIVE(A)
R(A) R(A) But, it can
W(A) Lock(A) limit
Commit W(A) parallelism!
Wait
W(A) Commit
Commit EXCLUSIVE(A)
R(A)
Exposes RW Anomaly W(A)
Commit
Resolving WW Conflicts Using 2PL
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 sets Mohammad’s Salary to $1000
▪ T2 sets Ahmad’s Salary to $2000
▪ T1 sets Ahmad’s Salary to $1000
▪ T2 sets Mohammad’s Salary to $2000

▪ T1 and T2 can be represented by the following schedule:


T1 T2
T1 T2
EXCLUSIVE(MS)
W(MS) EXCLUSIVE(AS) WW
W(AS) W(MS) Lock(AS) Conflict
W(AS) W(AS)
Commit W(MS) Wait Resolved!
Commit
Commit
EXCLUSIVE(AS)
EXCLUSIVE(MS)
Exposes WW Anomaly W(AS)
(assuming, MS & AS must be kept equal) W(MS)
Commit
Resolving WW Conflicts Using 2PL
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 sets Mohammad’s Salary to $1000
▪ T2 sets Ahmad’s Salary to $2000
▪ T1 sets Ahmad’s Salary to $1000
▪ T2 sets Mohammad’s Salary to $2000

▪ T1 and T2 can be represented by the following schedule:


T1 T2
T1 T2
EXCLUSIVE(MS)
W(MS) W(MS) EXCLUSIVE(AS)
W(AS) Lock(AS)
W(AS) W(AS)
Commit Lock(MS)
W(MS)
Commit Wait
Wait
Exposes WW Anomaly
(assuming, MS & AS must be kept equal) Deadlock!
Resolving WR Conflicts
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 deducts $100 from account A
▪ T2 adds 6% interest to accounts A and B
▪ T1 credits $100 to account B

▪ T1 and T2 can be represented by the following schedule:


T1 T2 T1 T2
R(A) EXCLUSIVE(A)
W(A) EXCLUSIVE(B) Lock(A) WR
R(A) R(A) Lock(B)
W(A) W(A)
Conflict
R(B) R(B) Wait Resolved!
W(B) W(B)
Commit Commit EXCLUSIVE(A)
R(B) EXCLUSIVE(B)
W(B) R(A)
Commit W(A)
Exposes WR Anomaly R(B)
W(B)
Commit
Resolving WR Conflicts
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 deducts $100 from account A
▪ T2 adds 6% interest to accounts A and B
▪ T1 credits $100 to account B

▪ T1 and T2 can be represented by the following schedule:


T1 T2 T1 T2
R(A) EXCLUSIVE(A)
W(A) EXCLUSIVE(B) Lock(A)
R(A) R(A)
WR
Lock(B)
W(A) W(A) Wait Conflict is
R(B) RELEASE(A) NOT
W(B) R(B) EXCLUSIVE(A) Resolved!
Commit W(B) R(A)
R(B) Commit W(A)
W(B) EXCLUSIVE(B) How can
Commit R(B)
Exposes WR Anomaly W(B)
we solve
Commit this?
Strict Two-Phase Locking
▪ WR conflicts (as well as RW & WW) can be solved by
making 2PL stricter

▪ In particular, Rule 2 in 2PL can be modified


as follows:
▪ Rule 2: locks of a transaction T can only be released
after T completes (i.e., commits or aborts)

▪ This version of 2PL is called Strict Two-Phase Locking


Resolving WR Conflicts: Revisit
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 deducts $100 from account A
▪ T2 adds 6% interest to accounts A and B
▪ T1 credits $100 to account B

▪ T1 and T2 can be represented by the following schedule:


T1 T2 T1 T2
R(A) EXCLUSIVE(A)
W(A) EXCLUSIVE(B) Lock(A)
R(A) R(A) Lock(B)
W(A) W(A) Wait
R(B) RELEASE(A)
W(B) R(B) EXCLUSIVE(A)
Commit W(B) R(A)
R(B) Commit W(A)
W(B) EXCLUSIVE(B)
Commit R(B)
Exposes WR Anomaly W(B)
Not allowed with strict 2PL Commit
Resolving WR Conflicts: Revisit
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 deducts $100 from account A
▪ T2 adds 6% interest to accounts A and B
▪ T1 credits $100 to account B

▪ T1 and T2 can be represented by the following schedule:


T1 T2 T1 T2
R(A) EXCLUSIVE(A)
W(A) EXCLUSIVE(B)
Lock(A) WR Conflict
R(A) R(A)
Lock(B) is Resolved!
W(A) W(A)
R(B) R(B)
Wait
W(B) W(B)
Commit Commit
R(B)
EXCLUSIVE(A) But,
EXCLUSIVE(B)
W(B) R(A) parallelism
Commit W(A) is limited
Exposes WR Anomaly R(B)
W(B) more!
Commit
2PL vs. Strict 2PL
T1 T2
▪ Two-Phase Locking (2PL): SHARED(A)
R(A)
▪ Limits concurrency SHARED(A)

▪ May lead to deadlocks


R(A)
EXECLUSIVE(B)

▪ May have ‘dirty reads’ EXCLUSIVE(C)


R(B)
W(B)
R(C) Commit
W(C)
▪ Strict 2PL: Commit

▪ Limits concurrency more A Schedule with Strict 2PL


and Interleaved Actions
(but, actions of different
transactions can still be interleaved)
▪ May still lead to deadlocks
▪ Avoids ‘dirty reads’
Implementation of Locking
▪ A lock manager can be implemented as a separate process
▪ Transactions can send lock and unlock requests as messages
▪ The lock manager replies to a lock request by sending a lock grant messages
(or a message asking the transaction to roll back, in case of a deadlock)
• The requesting transaction waits until its request is
answered
▪ The lock manager maintains an in-memory data-structure called a lock table
to record granted locks and pending requests
Lock Table
▪ Dark rectangles indicate granted locks, light
colored ones indicate waiting requests
▪ Lock table also records the type of lock
granted or requested
▪ New request is added to the end of the
queue of requests for the data item, and
granted if it is compatible with all earlier
locks
▪ Unlock requests result in the request being
deleted, and later requests are checked to
see if they can now be granted
▪ If transaction aborts, all waiting or granted
requests of the transaction are deleted
• lock manager may keep a list of locks
held by each transaction, to implement
this efficiently
Graph-Based Protocols
▪ Graph-based protocols are an alternative to two-phase locking
▪ Impose a partial ordering → on the set D = {d1, d2 ,..., dh} of all data items.
• If di → dj then any transaction accessing both di and dj must access di
before accessing dj.
• Implies that the set D may now be viewed as a directed acyclic graph,
called a database graph.
▪ The tree-protocol is a simple kind of graph protocol.
Tree Protocol

Tree protocol:
1. Only exclusive locks are allowed.
2. The first lock by Ti may be on any data item. Subsequently, a data Q can be
locked by Ti only if the parent of Q is currently locked by Ti.
3. Data items may be unlocked at any time.
4. A data item that has been locked and unlocked by Ti cannot subsequently
be relocked by Ti
Graph-Based Protocols (Cont.)
▪ The tree protocol ensures conflict serializability as well as freedom from
deadlock.
▪ Unlocking may occur earlier in the tree-locking protocol than in the
two-phase locking protocol.
• Shorter waiting times, and increase in concurrency
• Protocol is deadlock-free, no rollbacks are required
▪ Drawbacks
• Protocol does not guarantee recoverability or cascade freedom
▪ Need to introduce commit dependencies to ensure
recoverability
• Transactions may have to lock data items that they do not access.
▪ increased locking overhead, and additional waiting time
▪ potential decrease in concurrency
▪ Schedules not possible under two-phase locking are possible under the tree
protocol, and vice versa.
Performance of Locking
▪ Locking comes with delays mainly from blocking

▪ Usually, the first few transactions are unlikely to conflict


▪ Throughput can rise in proportion to the number of active
transactions

▪ As more transactions are executed concurrently, the


likelihood of blocking increases
▪ Throughput will increase more slowly with the number of
active transactions

▪ There comes a point when adding another active


transaction will actually decrease throughput
▪ When the system thrashes!
Performance of Locking (Cont’d)

Throughput
Thrashing

# of Active Transactions

▪ If a database begins to thrash, the DBA should


reduce the number of active transactions

▪ Empirically, thrashing is seen to occur when 30%


of active transactions are blocked!
Schedules with Aborted Transactions
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 deducts $100 from account A
▪ T2 adds 6% interest to accounts A and B, and commits
▪ T1 is aborted

▪ T1 and T2 can be represented by the following schedule:


T1 T2
R(A)
W(A) T2 read a value for A that should have never been there!
R(A)
W(A) How can we deal with the situation, assuming T2
R(B) had not yet committed?
W(B)
Commit
Abort
Schedules with Aborted Transactions
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 deducts $100 from account A
▪ T2 adds 6% interest to accounts A and B, and commits
▪ T1 is aborted

▪ T1 and T2 can be represented by the following schedule:


T1 T2
R(A)
W(A) T2 read a value for A that should have never been there!
R(A)
W(A) We can cascade the abort of T1 by aborting T2 as well!
R(B)
W(B)
Commit This “cascading process” can be recursively applied to
Abort any transaction that read A written by T1
Schedules with Aborted Transactions
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 deducts $100 from account A
▪ T2 adds 6% interest to accounts A and B, and commits
▪ T1 is aborted

▪ T1 and T2 can be represented by the following schedule:


T1 T2
R(A)
W(A) T2 read a value for A that should have never been there!
R(A)
W(A) How can we deal with the situation, assuming T2
R(B) had actually committed?
W(B)
Commit
Abort The schedule is indeed unrecoverable!
Schedules with Aborted Transactions
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 deducts $100 from account A
▪ T2 adds 6% interest to accounts A and B, and commits
▪ T1 is aborted

▪ T1 and T2 can be represented by the following schedule:


T1 T2
R(A)
W(A) T2 read a value for A that should have never been there!
R(A)
W(A) For a schedule to be recoverable, transactions
R(B) should commit only after all transactions whose
W(B) changes they read commit!
Commit
Abort
“Recoverable schedules” avoid cascading aborts!
Schedules with Aborted Transactions
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 deducts $100 from account A
▪ T2 adds 6% interest to accounts A and B, and commits
▪ T1 is aborted

▪ T1 and T2 can be represented by the following schedule:


T1 T2
R(A)
W(A) T2 read a value for A that should have never been there!
R(A)
W(A) How can we ensure “recoverable schedules”?
R(B)
W(B)
Commit By using Strict 2PL!
Abort
Schedules with Aborted Transactions
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 deducts $100 from account A
▪ T2 adds 6% interest to accounts A and B, and commits
▪ T1 is aborted

▪ T1 and T2 can be represented by the following schedule:


T1 T2
EXCLUSIVE(A)
T1 T2 R(A)
R(A) W(A) Lock(A) Cascaded
W(A) Wait
R(A) Abort aborts are
UNDO(T1)
W(A) EXCLUSIVE(A) avoided!
R(B) R(A)
W(B) W(A)
Commit EXCLUSIVE(B)
Abort R(B)
W(B)
Commit
Serializable Schedules: Redefined
▪ Two schedules are said to be equivalent if for any database
state, the effect of executing the 1st schedule is identical to
the effect of executing the 2nd schedule

▪ Previously: a serializable schedule is a schedule that is


equivalent to a serial schedule

▪ Now: a serializable schedule is a schedule that is equivalent


to a serial schedule over a set of committed transactions

▪ This definition captures serializability as well as recoverability


Deadlock
▪ Consider the partial schedule

▪ Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait


for T3 to release its lock on B, while executing lock-X(A) causes T3 to wait
for T4 to release its lock on A.
▪ Such a situation is called a deadlock.
• To handle a deadlock one of T3 or T4 must be rolled back
and its locks released.
Deadlock (Cont.)
▪ The potential for deadlock exists in most locking protocols. Deadlocks are a
necessary evil (deadlocks are preferable to inconsistent states).
▪ Starvation is also possible if concurrency control manager is badly designed.
For example:
• A transaction may be waiting for an X-lock on an item, while a sequence
of other transactions request and are granted an S-lock on the same
item.
• The same transaction is repeatedly rolled back due to deadlocks.
▪ Concurrency control manager can be designed to prevent starvation.
Deadlock Handling
▪ System is deadlocked if there is a set of transactions such that every
transaction in the set is waiting for another transaction in the set.
Deadlock Detection
▪ Wait-for graph
• Vertices: transactions
• Edge from Ti →Tj. : if Ti is waiting for a lock held in conflicting mode byTj
▪ The system is in a deadlock state if and only if the wait-for graph has a cycle.
▪ Invoke a deadlock-detection algorithm periodically to look for cycles.

Wait-for graph without a cycle Wait-for graph with a


cycle
Exercise
Consider the following sequence of events in a
schedule involving transactions T1, T2, T3, T4
and T5.
A, B, C, D, E, F are items in the database.
Draw a wait-for-graph for the data above and
find whether the transactions are in a deadlock
or not?
Lock Conversions
▪ A transaction may need to change the lock it
already acquires on an object
▪ From Shared to Exclusive
▪ This is referred to as lock upgrade
▪ From Exclusive to Shared
▪ This is referred to as lock downgrade

▪ For example, an SQL update statement might


acquire a Shared lock on each row, R, in a table
and if R satisfies the condition (in the WHERE
clause), an Exclusive lock must be obtained for R
Lock Upgrades
▪ A lock upgrade request from a transaction T on object O
must be handled specially by:
▪ Granting an Exclusive lock to T immediately if no other
transaction holds a lock on O
▪ Otherwise, queuing T at the front of O’s queue
(i.e., T is favored)

▪ T is favored because it already holds a Shared lock on O


▪ Queuing T in front of another transaction T’ that holds no lock
on O, but requested an Exclusive lock on O averts a deadlock!
▪ However, if T and T’ hold a Shared lock on O, and both request
upgrades to an Exclusive lock, a deadlock will arise regardless!
Lock Downgrades
▪ Lock upgrades can be entirely avoided by obtaining Exclusive
locks initially, and downgrade them to Shared locks once
needed

▪ Would this violate any 2PL requirement?


▪ On the surface yes; since the transaction (say, T) may need to
upgrade later

▪ This is a special case as T conservatively obtained an Exclusive


lock, and did nothing but read the object that it downgraded

▪ 2PL can be safely extended to allow lock downgrades in the


growing phase, provided that the transaction has not
modified the object
Deadlock Detection
▪ The lock manager maintains a structure called a waits-for
graph to periodically detect deadlocks

▪ In a waits-for graph:
▪ The nodes correspond to active transactions
▪ There is an edge from Ti to Tj if and only if Ti is waiting for Tj
to release a lock

▪ The lock manager adds and removes edges to and from a


waits-for graph when it queues and grants lock requests,
respectively

▪ A deadlock is detected when a cycle in the waits-for graph is


found
Deadlock Detection (Cont’d)
▪ The following schedule is free of deadlocks:
T1 T2 T3 T4
S(A) T1 T2
R(A)
X(B)
W(B)
S(B)
S(C)
R(C)
X(C) T4 T3
X(B)
No cycles; hence, no deadlocks!

A schedule without a deadlock The Corresponding Waits-For Graph*

*The nodes correspond to active transactions and there is an edge from Ti to Tj if and only
if Ti is waiting for Tj to release a lock
Deadlock Detection (Cont’d)
▪ The following schedule is NOT free of deadlocks:
T1 T2 T3 T4
S(A) T1 T2
R(A)
X(B)
W(B)
S(B)
S(C)
R(C)
X(C) T4 T3
X(B)
X(A)

A schedule with a deadlock The Corresponding Waits-For Graph*

*The nodes correspond to active transactions and there is an edge from Ti to Tj if and only
if Ti is waiting for Tj to release a lock
Deadlock Detection (Cont’d)
▪ The following schedule is NOT free of deadlocks:
T1 T2 T3 T4
S(A) T1 T2
R(A)
X(B)
W(B)
S(B)
S(C)
R(C)
X(C) T4 T3
X(B)
X(A) Cycle detected; hence, a deadlock!

A schedule with a deadlock The Corresponding Waits-For Graph*

*The nodes correspond to active transactions and there is an edge from Ti to Tj if and only
if Ti is waiting for Tj to release a lock
Resolving Deadlocks
▪ A deadlock is resolved by aborting a transaction that is
on a cycle and releasing its locks
▪ This allows some of the waiting transactions to proceed

▪ The choice of which transaction to abort can be made


using different criteria:
▪ The one with the fewest locks
▪ Or the one that has done the least work
▪ Or the one that is farthest from completion (more accurate)

▪ Caveat: a transaction that was aborted in the past,


should be favored subsequently and not aborted upon
a deadlock detection!
Deadlock Prevention
▪ Studies indicate that deadlocks are relatively infrequent
and detection-based schemes work well in practice

▪ However, if there is a high level of contention for locks,


prevention-based schemes could perform better

▪ Deadlocks can be averted by giving each transaction a


priority and ensuring that lower-priority transactions are
not allowed to wait for higher-priority ones
(or vice versa)
Deadlock Handling
▪ Deadlock prevention protocols ensure that the system will never enter into
a deadlock state. Some prevention strategies:
• Require that each transaction locks all its data items before it begins
execution (pre-declaration).
• Impose partial ordering of all data items and require that a transaction
can lock data items only in the order specified by the partial order
(graph-based protocol).
Strategies
▪ wait-die scheme — non-preemptive
• Older transaction may wait for younger one to release data item.
• Younger transactions never wait for older ones; they are rolled back
instead.
• A transaction may die several times before acquiring a lock
▪ wound-wait scheme — preemptive
• Older transaction wounds (forces rollback) of younger transaction
instead of waiting for it.
• Younger transactions may wait for older ones.
• Fewer rollbacks than wait-die scheme.
▪ In both schemes, a rolled back transactions is restarted with its original
timestamp.
• Ensures that older transactions have precedence over newer ones, and
starvation is thus avoided.
Deadlock prevention (Cont.)
▪ Timeout-Based Schemes:
• A transaction waits for a lock only for a specified amount of time. After
that, the wait times out and the transaction is rolled back.
• Ensures that deadlocks get resolved by timeout if they occur
• Simple to implement
• But may roll back transaction unnecessarily in absence of deadlock
▪ difficult to determine good value of the timeout interval.
• Starvation is also possible
Deadlock Detection
▪ Wait-for graph
• Vertices: transactions
• Edge from Ti →Tj. : if Ti is waiting for a lock held in conflicting mode byTj
▪ The system is in a deadlock state if and only if the wait-for graph has a cycle.
▪ Invoke a deadlock-detection algorithm periodically to look for cycles.

Wait-for graph without a cycle Wait-for graph with a


cycle
Deadlock Recovery
▪ When deadlock is detected :
• Some transaction will have to be rolled back (made a victim) to break
deadlock cycle.
▪ Select that transaction as victim that will incur minimum cost
• Rollback -- determine how far to roll back transaction
▪ Total rollback: Abort the transaction and then restart it.
▪ Partial rollback: Roll back victim transaction only as far as necessary
to release locks that another transaction in cycle is waiting for
▪ Starvation can happen (why?)
• One solution: oldest transaction in the deadlock set is never chosen as
victim
Multiple Granularity
▪ Allow data items to be of various sizes and define a hierarchy of data
granularities, where the small granularities are nested within larger ones
▪ Can be represented graphically as a tree (but don't confuse with tree-locking
protocol)
▪ When a transaction locks a node in the tree explicitly, it implicitly locks all
the node's descendents in the same mode.
▪ Granularity of locking (level in tree where locking is done):
• Fine granularity (lower in tree): high concurrency, high locking overhead
• Coarse granularity (higher in tree): low locking overhead, low
concurrency
Example of Granularity Hierarchy

The levels, starting from the coarsest (top) level are


• database
• area
• file
• record
Intention Lock Modes
▪ In addition to S and X lock modes, there are three additional lock modes
with multiple granularity:
• intention-shared (IS): indicates explicit locking at a lower level of the
tree but only with shared locks.
• intention-exclusive (IX): indicates explicit locking at a lower level with
exclusive or shared locks
• shared and intention-exclusive (SIX): the subtree rooted by that node is
locked explicitly in shared mode and explicit locking is being done at a
lower level with exclusive-mode locks.
▪ intention locks allow a higher level node to be locked in S or X mode without
having to check all descendent nodes.
Compatibility Matrix with Intention Lock Modes

▪ The compatibility matrix for all lock modes is:


Multiple Granularity Locking Scheme
▪ Transaction Ti can lock a node Q, using the following rules:
1. The lock compatibility matrix must be observed.
2. The root of the tree must be locked first, and may
be locked in any mode.
3. A node Q can be locked by Ti in S or IS mode only if
the parent of Q is currently locked by Ti in either IX
or IS mode.
4. A node Q can be locked by Ti in X, SIX, or IX mode
only if the parent of Q is currently locked by Ti in
either IX or SIX mode.
5. Ti can lock a node only if it has not previously
unlocked any node (that is, Ti is two-phase).
6. Ti can unlock a node Q only if none of the children
Deadlock Handling
▪ Deadlock prevention protocols ensure that the system will never enter into
a deadlock state. Some prevention strategies:
• Require that each transaction locks all its data items before it begins
execution (pre-declaration).
• Impose partial ordering of all data items and require that a transaction
can lock data items only in the order specified by the partial order
(graph-based protocol).
More Deadlock Prevention Strategies
▪ wait-die scheme — non-preemptive
• Older transaction may wait for younger one to release data item.
• Younger transactions never wait for older ones; they are rolled back
instead.
• A transaction may die several times before acquiring a lock
▪ wound-wait scheme — preemptive
• Older transaction wounds (forces rollback) of younger transaction
instead of waiting for it.
• Younger transactions may wait for older ones.
• Fewer rollbacks than wait-die scheme.
▪ In both schemes, a rolled back transactions is restarted with its original
timestamp.
• Ensures that older transactions have precedence over newer ones, and
starvation is thus avoided.
Deadlock prevention (Cont.)
▪ Timeout-Based Schemes:
• A transaction waits for a lock only for a specified amount of time. After
that, the wait times out and the transaction is rolled back.
• Ensures that deadlocks get resolved by timeout if they occur
• Simple to implement
• But may roll back transaction unnecessarily in absence of deadlock
▪ difficult to determine good value of the timeout interval.
• Starvation is also possible
Deadlock Prevention (Cont’d)
▪ One way to assign priorities is to give each
transaction a timestamp when it is started
▪ Thus, the lower the timestamp, the higher is the
transaction’s priority

▪ If a transaction Ti requests a lock and a transaction


Tj holds a conflicting lock, the lock manager can
use one of the following policies:
▪ Wound-Wait: If Ti has higher priority, Tj is aborted;
otherwise, Ti waits
▪ Wait-Die: If Ti has higher priority, it is allowed to wait;
otherwise, it is aborted
Timestamp-Based Protocols
▪ Each transaction Ti is issued a timestamp TS(Ti) when it enters the system.
• Each transaction has a unique timestamp
• Newer transactions have timestamps strictly greater than earlier ones
• Two simple methods for implementing this scheme:
▪ Use the value of the system clock as the timestamp
▪ Use a logical counter that is incremented after a new timestamp has
been assigned
▪ Timestamp-based protocols manage concurrent execution such that
time-stamp order = serializability order
i.e., if TS(Ti ) < TS(Tj ), the produced schedule is equivalent to a serial schedule
in which Ti appears before Tj.
▪ Several alternative protocols based on timestamps
Timestamp-Ordering Protocol
The timestamp ordering (TSO) protocol
▪ Maintains for each data Q two timestamp values:
• W-timestamp(Q) is the largest time-stamp of any transaction that
executed write(Q) successfully.
• R-timestamp(Q) is the largest time-stamp of any transaction that
executed read(Q) successfully.
▪ These timestamps are updated whenever a new read(Q) or write(Q)
instruction is executed.
▪ Imposes rules on read and write operations to ensure that
• any conflicting operations are executed in timestamp order
• out of order operations cause transaction rollback
Timestamp-Based Protocols (Cont.)
▪ Suppose a transaction Ti issues a read(Q)
1. If TS(Ti) ≤ W-timestamp(Q), then Ti needs to read a value of Q that was
already overwritten.
▪ Hence, the read operation is rejected, and Ti is rolled back.
2. If TS(Ti) ≥ W-timestamp(Q), then the read operation is executed, and
R-timestamp(Q) is set to
max(R-timestamp(Q), TS(Ti)).
Timestamp-Based Protocols (Cont.)
▪ Suppose that transaction Ti issues write(Q).
1. If TS(Ti) < R-timestamp(Q), then the value of Q that Ti is producing was
needed previously, and the system assumed that that value would
never be produced.
Hence, the write operation is rejected, and Ti is rolled back.
2. If TS(Ti) < W-timestamp(Q), then Ti is attempting to write an obsolete
value of Q.
Hence, this write operation is rejected, and Ti is rolled back.
3. Otherwise, the write operation is executed, and W-timestamp(Q) is set
to TS(Ti).
Example of Schedule Under TSO

▪ Is this schedule valid under TSO?

Assume that initially:


R-TS(A) = W-TS(A) = 0
R-TS(B) = W-TS(B) = 0
Assume TS(T25) = 25 and
TS(T26) = 26

▪ And how about this one,


where initially
R-TS(Q)=W-TS(Q)=0
Another Example Under TSO
A partial schedule for several data items for transactions with
timestamps 1, 2, 3, 4, 5, with all R-TS and W-TS = 0 initially
Correctness of Timestamp-Ordering Protocol

▪ The timestamp-ordering protocol guarantees serializability since all the arcs


in the precedence graph are of the form:

Thus, there will be no cycles in the precedence graph


▪ Timestamp protocol ensures freedom from deadlock as no transaction ever
waits.
▪ But the schedule may not be cascade-free, and may not even be
recoverable.
Recoverability and Cascade Freedom

▪ Solution 1:
• A transaction is structured such that its writes are all performed at the
end of its processing
• All writes of a transaction form an atomic action; no transaction may
execute while a transaction is being written
• A transaction that aborts is restarted with a new timestamp
▪ Solution 2: Limited form of locking: wait for data to be committed before
reading it
▪ Solution 3: Use commit dependencies to ensure recoverability
Thomas’ Write Rule
▪ Modified version of the timestamp-ordering protocol in which obsolete
write operations may be ignored under certain circumstances.
▪ When Ti attempts to write data item Q, if TS(Ti) < W-timestamp(Q), then Ti is
attempting to write an obsolete value of {Q}.
• Rather than rolling back Ti as the timestamp ordering protocol would
have done, this {write} operation can be ignored.
▪ Otherwise this protocol is the same as the timestamp ordering protocol.
▪ Thomas' Write Rule allows greater potential concurrency.
• Allows some view-serializable schedules that are not
conflict-serializable.
Validation-Based Protocol
▪ Idea: can we use commit time as serialization order?
▪ To do so:
• Postpone writes to end of transaction
• Keep track of data items read/written by
transaction
• Validation performed at commit time, detect any
out-of-serialization order reads/writes
▪ Also called as optimistic concurrency control since transaction executes fully
in the hope that all will go well during validation
Validation-Based Protocol
▪ Execution of transaction Ti is done in three phases.
1. Read and execution phase: Transaction Ti writes only to
temporary local variables
2. Validation phase: Transaction Ti performs a '‘validation test''
to determine if local variables can be written without violating
serializability.
3. Write phase: If Ti is validated, the updates are applied to the
database; otherwise, Ti is rolled back.
▪ The three phases of concurrently executing transactions can be
interleaved, but each transaction must go through the three phases in that
order.
• We assume for simplicity that the validation and
write phase occur together, atomically and serially
▪ I.e., only one transaction executes validation/write at a
time.
Validation-Based Protocol (Cont.)
▪ Each transaction Ti has 3 timestamps
• StartTS(Ti) : the time when Ti started its execution
• ValidationTS(Ti): the time when Ti entered its
validation phase
• FinishTS(Ti) : the time when Ti finished its write
phase
▪ Validation tests use above timestamps and read/write sets to ensure that
serializability order is determined by validation time
• Thus, TS(Ti) = ValidationTS(Ti)
▪ Validation-based protocol has been found to give greater degree of
concurrency than locking/TSO if probability of conflicts is low.
Validation Test for Transaction Tj
▪ If for all Ti with TS (Ti) < TS (Tj) either one of the following condition holds:
• finishTS(Ti) < startTS(Tj)
• startTS(Tj) < finishTS(Ti) < validationTS(Tj) and the
set of data items written by Ti does not intersect
with the set of data items read by Tj.
then validation succeeds and Tj can be committed.
▪ Otherwise, validation fails and Tj is aborted.
▪ Justification:
• First condition applies when execution is not
concurrent
▪ The writes of Tj do not affect reads of Ti since they occur
after Ti has finished its reads.
• If the second condition holds, execution is
Schedule Produced by Validation
▪ Example of schedule produced using validation
Database Recovery
Techniques
Outline

▪ Failure Classification
▪ Storage Structure
▪ Recovery and Atomicity
▪ Log-Based Recovery
▪ Remote Backup Systems
Failure Classification

▪ Transaction failure :
• Logical errors: transaction cannot complete due to some internal error
condition
• System errors: the database system must terminate an active transaction
due to an error condition (e.g., deadlock)
▪ System crash: a power failure or other hardware or software
failure causes the system to crash.
• Fail-stop assumption: non-volatile storage contents are assumed to not
be corrupted by system crash
▪ Database systems have numerous integrity checks to prevent
corruption of disk data
▪ Disk failure: a head crash or similar disk failure destroys all or
part of disk storage
• Destruction is assumed to be detectable: disk drives use checksums to
detect failures
Recovery Manager

▪ The recovery manager of a DBMS is


responsible for ensuring two important
properties of transactions:
• atomicity by undoing the actions of transactions that
do not commit
• durability by making sure that all actions of committed
transactions survive system crashes, and media
failures.
Recovery Algorithms
▪ Consider transaction Ti that transfers $50 from account A
to account B
• Two updates: subtract 50 from A and add 50 to B
▪ Transaction Ti requires updates to A and B to be output to
the database.
• A failure may occur after one of these modifications have been made but
before both of them are made.
• Modifying the database without ensuring that the transaction will commit may
leave the database in an inconsistent state
• Not modifying the database may result in lost updates if failure occurs just
after transaction commits

▪ Recovery algorithms have two parts


1. Actions taken during normal transaction processing to ensure enough
information exists to recover from failures
2. Actions taken after a failure to recover the database contents to a state that
ensures atomicity, consistency and durability
Storage Structure

▪ Volatile storage:
• Does not survive system crashes
• Examples: main memory, cache memory
▪ Nonvolatile storage:
• Survives system crashes
• Examples: disk, tape, flash memory, non-volatile RAM
• But may still fail, losing data
▪ Stable storage:
• A mythical form of storage that survives all failures
• Approximated by maintaining multiple copies on
distinct nonvolatile media
Data Access

▪ Physical blocks are those blocks


residing on the disk.
▪ Buffer blocks are the blocks
residing temporarily in main memory.
▪ Block movements between disk
and main memory are initiated
through the following two operations:
• input (A) transfers the physical block B to main memory.
• output (B) transfers the buffer block B to the disk, and
replaces the appropriate physical block there.
▪ We assume, for simplicity, that each data item fits in, and is
stored inside, a single block.
Data Access (Cont.)

▪ Each transaction Ti has its private work-area in which local


copies of all data items accessed and updated by it are kept.
• Ti 's local copy of a data item X is called xi.

▪ Transferring data items between system buffer blocks and its


private work-area done by:
• read(X) assigns the value of data item X to the local variable xi.

• write(X) assigns the value of local variable xi to data item {X} in the buffer block.

• Note: output(BX) need not immediately follow write(X). System can perform the
output operation when it deems fit.

▪ Transactions
• Must perform read(X) before accessing X for the first time (subsequent reads can be
from local copy)

• write(X) can be executed at any time before the transaction commits


Example of Data Access
buffer
Buffer Block A input(A)
X A
Buffer Block B Y B
output(B)
read(X)
write(Y)

x2
x1
y1

work area work area


of T1 of T2

memory disk
Recovery and Atomicity

▪ To ensure atomicity despite failures, we first output information


describing the modifications to stable storage without modifying
the database itself.
▪ We study log-based recovery mechanisms in detail
▪ Less used alternative: shadow-copy and shadow-paging

shadow-copy
Log-Based Recovery

▪ A log is a sequence of log records. The records keep


information about update activities on the database.
• The log is kept on stable storage
▪ When transaction Ti starts, it registers itself by writing a
<Ti start>log record
▪ Before Ti executes write(X), a log record
<Ti, X, V1, V2> is written, where V1 is the value of X before
the write (the old value), and V2 is the value to be written to X (the
new value).
▪ When Ti finishes its last statement, the log record <Ti commit>
is written.
▪ Two approaches using logs
• Immediate database modification
• Deferred database modification.
Immediate Database Modification

▪ The immediate-modification scheme allows updates of an


uncommitted transaction to be made to the buffer, or the disk itself,
before the transaction commits
▪ Update log record must be written before database item is written
• We assume that the log record is output directly to stable storage
▪ Output of updated blocks to disk can take place at any time before or
after transaction commit
▪ Order in which blocks are output can be different from the order in
which they are written.
▪ The deferred-modification scheme performs updates to buffer/disk
only at the time of transaction commit
• Simplifies some aspects of recovery
• But has overhead of storing local copy
Transaction Commit

▪ A transaction is said to have committed when its commit log


record is output to stable storage
• All previous log records of the transaction must have been
output already
▪ Writes performed by a transaction may still be in the buffer when
the transaction commits, and may be output later
Immediate Database Modification Example

Log Write Output

<T0 start>
<T0, A, 1000, 950>
<To, B, 2000, 2050>
A = 950
B = 2050
<T0 commit>
<T1 start>
<T1, C, 700, 600>
C = 600 BC output before T1
commits
BB , BC
<T1 commit>
BA
BA output after T0
Note: BX denotes block containing X. commits
Concurrency Control and Recovery

▪ With concurrent transactions, all transactions share a single


disk buffer and a single log
• A buffer block can have data items updated by one or more transactions

▪ We assume that if a transaction Ti has modified an item, no


other transaction can modify the same item until Ti has
committed or aborted
• i.e., the updates of uncommitted transactions should not be visible to other
transactions
▪ Otherwise, how to perform undo of T1 updates A, then T2 updates A and commits,
and finally T1 has to abort?

• Can be ensured by obtaining exclusive locks on updated items and holding the locks
till end of transaction (strict two-phase locking)

▪ Log records of different transactions may be interspersed in the


log.
Undo and Redo Operations

▪ Undo and Redo of Transactions


• undo(Ti) -- restores the value of all data items updated by
Ti to their old values, going backwards from the last log
record for Ti
▪ Each time a data item X is restored to its old value V a
special log record <Ti , X, V> is written out
▪ When undo of a transaction is complete, a log record
<Ti abort> is written out.
• redo(Ti) -- sets the value of all data items updated by Ti to
the new values, going forward from the first log record for Ti
▪ No logging is done in this case
Recovering from Failure

▪ When recovering after failure:


• Transaction Ti needs to be undone if the log
▪ Contains the record <Ti start>,
▪ But does not contain either the record <Ti commit> or
<Ti abort>.
• Transaction Ti needs to be redone if the log
▪ Contains the records <Ti start>
▪ And contains the record <Ti commit> or <Ti abort>
Recovering from Failure (Cont.)

▪ Suppose that transaction Ti was undone earlier and the <Ti


abort> record was written to the log, and then a failure occurs,
▪ On recovery from failure transaction Ti is redone
• Such a redo redoes all the original actions of transaction Ti
including the steps that restored old values
▪ Known as repeating history
▪ Seems wasteful, but simplifies recovery greatly
Immediate DB Modification Recovery Example

Below we show the log as it appears at three instances of time.

Recovery actions in each case above are:


(a) undo (T0): B is restored to 2000 and A to 1000, and log records
<T0, B, 2000>, <T0, A, 1000>, <T0, abort> are written out
(b) redo (T0) and undo (T1): A and B are set to 950 and 2050 and C is
restored to 700. Log records <T1, C, 700>, <T1, abort> are written out.
(c) redo (T0) and redo (T1): A and B are set to 950 and 2050
respectively. Then C is set to 600
Checkpoints

▪ Redoing/undoing all transactions recorded in the log can be


very slow
• Processing the entire log is time-consuming if the system has run for a long time

• We might unnecessarily redo transactions which have already output their updates to
the database.

▪ Streamline recovery procedure by periodically performing


checkpointing
1. Output all log records currently residing in main memory onto stable storage.
2. Output all modified buffer blocks to the disk.
3. Write a log record < checkpoint L> onto stable storage where L is a list of all
transactions active at the time of checkpoint.
4. All updates are stopped while doing checkpointing
Checkpoints (Cont.)

▪ During recovery we need to consider only the most recent


transaction Ti that started before the checkpoint, and
transactions that started after Ti.
• Scan backwards from end of log to find the most recent <checkpoint L> record

• Only transactions that are in L or started after the checkpoint need to be redone or
undone

• Transactions that committed or aborted before the checkpoint already have all their
updates output to stable storage.

▪ Some earlier part of the log may be needed for undo operations
• Continue scanning backwards till a record <Ti start> is found for every transaction Ti
in L.

• Parts of log prior to earliest <Ti start> record above are not needed for recovery, and
can be erased whenever desired.
Example of Checkpoints
Tc Tf
T1
T2
T3
T4

checkpoint system failure


● T1 can be ignored (updates already output to disk due to
checkpoint)
● T2 and T3 redone.
● T4 undone
Failure with Loss of Nonvolatile Storage

▪ So far we assumed no loss of non-volatile storage


▪ Technique similar to checkpointing used to deal with loss of
non-volatile storage
• Periodically dump the entire content of the database to
stable storage
• No transaction may be active during the dump procedure; a
procedure similar to checkpointing must take place
▪ Output all log records currently residing in main memory
onto stable storage.
▪ Output all buffer blocks onto the disk.
▪ Copy the contents of the database to stable storage.
▪ Output a record <dump> to log on stable storage.
Recovering from Failure of Non-Volatile Storage

▪ To recover from disk failure


• restore database from most recent dump.
• Consult the log and redo all transactions that committed
after the dump
▪ Can be extended to allow transactions to be active during
dump;
known as fuzzy dump or online dump
• Similar to fuzzy checkpointing
Remote Recovery System
● Remote backup systems provide a wide range of availability, allowing the
transaction processing to continue even if the primary site is destroyed by a fire,
flood or earthquake.
● Data and log records from a primary site are continuously backed up into a
remote backup site.
● One can achieve ‘wide range availability’ of data by performing transaction
processing at one site, called the ‘primary site’, and having a ‘remote backup’
site where all the data from the primary site are duplicated
● The remote site is also called ‘secondary site’.
● The remote site must be synchronized with the primary site, as updates are
performed at the primary.
Remote Backup Systems
● Remote backup systems provide high availability by allowing
transaction processing to continue even if the primary site is
destroyed.
Remote Backup Systems (Cont.)
● Detection of failure: Backup site must detect when primary site has
failed
○ to distinguish primary site failure from link failure maintain several
communication links between the primary and the remote backup.
● Transfer of control:
○ To take over control backup site first perform recovery using its copy of
the database and all the long records it has received from the primary.
■ Thus, completed transactions are redone and incomplete transactions
are rolled back.
○ When the backup site takes over processing it becomes the new primary
○ To transfer control back to old primary when it recovers, old primary must
receive redo logs from the old backup and apply all updates locally.
Remote Backup Systems (Cont.)

● Time to recover: To reduce delay in takeover, backup site


periodically proceses the redo log records (in effect, performing
recovery from previous database state), performs a checkpoint,
and can then delete earlier parts of the log.
● Hot-Spare configuration permits very fast takeover:
● Backup continually processes redo log record as they arrive,
applying the updates locally.
When failure of the primary is detected the backup rolls back
incomplete transactions, and is ready to process new
transactions.
● Alternative to remote backup: distributed database with
replicated data
Remote backup is faster and cheaper, but less tolerant to failure
Remote Backup Systems (Cont.)
● Ensure durability of updates by delaying transaction commit until
update is logged at backup; avoid this delay by permitting lower
degrees of durability.
● One-safe: commit as soon as transaction’s commit log record
is written at primary
Problem: updates may not arrive at backup before it takes over.
● Two-very-safe: commit when transaction’s commit log record
is written at primary and backup
Reduces availability since transactions cannot commit if either site
fails.
● Two-safe: proceed as in two-very-safe if both primary and
backup are active. If only the primary is active, the transaction
commits as soon as is commit log record is written at the
primary.
Better availability than two-very-safe; avoids problem of lost transactions
in one-safe.
remote recovery system design

In designing a remote backup system, the following points are important.

a) Detection of failure: It is important for the remote backup system to detect


when the primary has failed.

b) Transfer of control: When the primary site fails, the backup site takes over the
processing and becomes the new primary site.

c) Time to recover: If the log at the remote backup becomes large, recovery will
take a long time.

d) Time to commit: To ensure that the updates of a committed transaction are


durable, a transaction should not be announced committed until its log records
have reached the backup site.
Thank you !
Chapter 8:
New Trends in Databases
NoSQL Databases

● Non-tabular databases
● Store data differently than relational tables
● NoSQL = Not Only SQL or NOn-SQL

● A piece of software is a NoSQL database if it adheres to the following:


(Fowler, 2015)
○ Doesn’t require a stringent schema for every record created.
○ Is distributable on commodity hardware.
○ Doesn’t use relational database mathematical theory.
Features of NoSQL Databases
● Schema agnostic / Flexible schemas
○ Gives you the freedom to store information without doing up‐front schema design.
○ Schema on read: You need to know how the data is stored only when constructing a query.
Allows you to easily make changes to your database as requirements change
● Horizontal scaling
○ add cheaper, commodity servers/off‐the‐shelf servers whenever you need to instead of migrating
to a larger, more expensive server
● Non Relational
○ Information is stored as an aggregate (a single record with all the information) instead of storing
in multiple tables and connecting them.
○ Deliberately denormalize data, storing some data multiple times -
○ Advantages: Easy storage and retrieval, Query speed
● Highly distributable
○ A cluster of servers can be used to hold a single large database
Problems with conventional approaches

● Schema redesign overhead


○ Modifying a schema requires restructuring of queries, updating of views, locking the database
which updating the schema etc.
● Unstructured data explosion
○ Data that does not have a pre-defined data model or is not organized in a pre-defined manner.
Examples Word, PDF, text, logs, images etc.
● The sparse data problem
○ Using an RDBMS requires a null value be placed into unused columns. An RDBMS will still
allocate disk space for these columns.
● Dynamically changing relationships
● Global distribution and access
Types of NoSQL databases
Common types: i) Columnar

A columnar database is a type of database management system that stores data in columns
rather than rows, optimizing query performance by enabling efficient data retrieval and
analysis. Examples of columnar databases include:
■ Apache Cassandra
■ Amazon redshift
■ Google BigQuery
■ Vertica
■ ClickHouse
■ Snowflake
Key Benefits of Columnar Databases …
1. Improved data compression
2. Enhanced query performance
3. Efficient use of cache memory
4. Vectorization and parallel processing
5. Improved analytics and reporting
6. Better handling of sparse data
7. Flexible indexing options
8. Ease of scalability
9. Real-time data analytics and updates
ii) Document store databases
A document database (also known as a document-oriented
database or a document store) is a database that stores
information in documents.
E.g. Mongodb
Document databases offer a variety of advantages, including:
● An intuitive data model that is fast and easy for developers
to work with
● A flexible schema that allows for the data model to evolve as
application needs change

Document databases are considered to be non-relational (or


NoSQL) databases. Instead of storing data in fixed rows and
columns, document databases use flexible documents. Document
databases are the most popular alternative to tabular, relational
databases.
Characteristics of Document based database
• Key-Value Pair Structure, Documents are organized as key-value
pairs. Key: The attribute name (e.g., "Name"), Value: The attribute's
data (e.g., "John Doe").
• Allows hierarchical and nested data storage.
• Easy Scalability: Supports horizontal scaling by distributing
documents across multiple servers.
• Suited for high-availability and high-volume applications.
• Examples of Document Stores
• MongoDB (JSON-like storage).
• CouchDB (JSON storage with RESTful APIs).
• Firebase Realtime Database (NoSQL document database for real-time apps).
MongoDB CRUD operations
• CRUD operations include
• create, read, update, and delete documents.
Create Operations
• Create or insert operations add new documents to a collection.
• If the collection does not currently exist, insert operations will create the
collection.

• MongoDB provides the following methods to insert documents into a


collection:
• db.collection.insertOne()
• db.collection.insertMany()
Create Operations
Replace ‘db’ with your database name
Create Operations
{
name: "Bob",
// Insert multiple documents age: 30,
into a collection
city: "Pokhara",
db.myCollection.insertMany([ skills: ["Java", "MongoDB"],
{ isActive: false
name: "Alice", },

age: 25, {
name: "Carol",
city: "Kathmandu",
age: 28,
skills: ["JavaScript",
city: "Lalitpur",
"Python"],
skills: ["HTML", "CSS"],
isActive: true
isActive: true
},
}
]);
Read Operations
• Read operations retrieve documents from a collection; i.e. query a
collection for documents.
• MongoDB provides the following methods to read documents from a
collection:
• db.collection.find()
• You can specify query filters or criteria that identify the
documents to return.
Read Operations

• This gives “name and age” of person where city=“Pokhara”


• db.managers.find ({city: “Pokhara”}, {name:1, age:1})
• 1 -> include the field
• 0 -> exclude the field
Update Operations
• Update operations modify existing documents in a collection.
MongoDB provides the following methods to update documents of a
collection:
• db.collection.updateOne()
• db.collection.updateMany()
• db.collection.replaceOne()
• In MongoDB, update operations target a single collection.
• All write operations in MongoDB are atomic on the level of a single
document.
• Updates are permanent and can’t be rolled back.
• You can specify criteria, or filters, that identify the documents to
update.
• These filters use the same syntax as read operations.
Update Operations
• To update a document, we provide the method with two arguments: an update filter and an
update action.
• The update filter defines which items we want to update, and the update action defines how
to update those items.
• We first pass in the update filter. Then, we use the “$set” key and provide the fields we want
to update as a value.
• This method will update the first record that matches the provided filter.
updateOne():
• update a currently existing record and change a single document with an update
operation.

db.MyCollection.updateOne({name: "Marsh"},
{$set:{ownerAddress: “Lagankhel, Lalitpur"}})
Update Operations
• updateMany()
• updateMany() allows us to update multiple items by passing in a list of items.
• This update operation uses the same syntax for updating a single document.
Update Operations
replaceOne()
• The replaceOne() method replaces a single document in the specified collection.
• replaceOne() replaces the entire document, meaning fields in the old document not
contained in the new one and will be lost.

db.RecordsDB.replaceOne({name: "Kevin"}, {name: "Marki"})


Delete Operations
• Delete operations remove documents from a collection. MongoDB
provides the following methods to delete documents of a
collection:
• db.collection.deleteOne()
• db.collection.deleteMany()

• In MongoDB, delete operations target a single collection.


• All write operations in MongoDB are atomic on the level of a
single document.
• You can specify criteria, or filters, that identify the documents to
remove.
• These filters use the same syntax as read operations.
Delete Operations
db.collection.deleteOne()
• deleteOne() removes a document from a specified collection on the
MongoDB server.
• A filter criteria is used to specify the item to delete.
• It deletes the first record that matches the provided filter.

db.RecordsDB.deleteOne({name:"Marki"})
Delete Operations
db.collection.deleteMany()
• deleteMany() is a method used to delete multiple documents from a desired
collection with a single delete operation.
• A list is passed into the method and the individual items are defined with
filter criteria as in deleteOne().
Delete Operations
iii) Key-value stores
● Key value databases, also known as key value stores, are NoSQL database types
where data is stored as key value pairs and optimized for reading and writing that
data.
● The data is fetched by a unique key or a number of unique keys to retrieve the
associated value with each key.
● The values can be simple data types like strings and numbers or complex objects.
● The unique key can be anything.
● Most of the time, it is an id field, since that's the unique field in all the documents.
● To group related items, you can also add a common prefix to the key. The general
structure of a key value pair is key: value. For example, “name”: “John Drake.”
● Examples: Basho Riak, Redis, Voldemort, Aerospike, Oracle NoSQL, Amazon
DynamoDB, Azure Cosmos DB etc.
Types of NoSQL databases contd …
iii) Triple stores

● A single fact represented by three elements:


○ Subject (The subject you’re describing),
○ Predicate (The name of its property or relationship to another subject),
and
○ Object (The value).
● Examples: MarkLogic, Ontotext-OWLIM, Oracle NoSQL etc.

iv) Graph databases

● Store data in nodes and edges.


● Examples: Neo4j, Virtuoso, Apache Giraph etc.
Consistency
ACID consistency

● Once data is written, you have full consistency in reads.

Eventual consistency (BASE)

● Once data is written, it will eventually appear for reading.


BASE
● Basically Available
○ The system is guaranteed to be available in event of failure.
● Soft State
○ The state of the data could change without application interactions due to eventual consistency.
● Eventual Consistency
○ The system will be eventually consistent after the application input. The data will be replicated
to different nodes and will eventually reach a consistent state. But the consistency is not
guaranteed at a transaction level.
The CAP Theorem
CAP stands for Consistency, Availability, and Partitioning.
The CAP theorem states that you cannot have all features of all three at the same
time.
Which type of database is for you ?

Choose RDBMS if you have or need Choose NoSQL if you have or need:
● Consistent data/ACID transactions ● Semi-structured or Unstructured data / flexible
● Complex dynamic queries requiring stored schema
procedures, or views ● Limited pre-defined access paths and query patterns
● Option to migrate to another database ● No complex queries, stored procedures, or views
without significant change to ● High velocity transactions
existing application’s access ● Large volume of data (in Terabyte range) requiring
paths or logic quick and cheap scalability
● Data Warehouse, Analytics or BI use case ● Requires distributed computing and storage
Which NoSQL database is for you ?
Choose Key-value Stores if: Choose Document Stores if:
● Simple schema ● Flexible schema with complex querying
● High velocity read/write with no frequent ● JSON/BSON or XML data formats
updates ● Leverage complex Indexes (multikey,
● High performance and scalability geospatial, full text search etc.)
● No complex queries involving multiple keys ● High performance and balanced R:W ratio
or joins
Which NoSQL database is for you ?
Choose Column-Oriented Database if: Choose Graph Database if:
● High volume of data ● Applications requiring traversal between
● Extreme write speeds with relatively less data points
velocity reads ● Ability to store properties of each data
● Data extractions by columns using row point as well as relationship between them
keys ● Complex queries to determine relationships
● No ad-hoc query patterns, complex indices between data points
or high level of aggregations ● Need to detect patterns between data points
Polyglot Persistence
Polyglot persistence is the idea that a single application that uses different types of
data needs to use multiple databases behind that application.
NewSQL
Modern SQL databases that seek to provide the scalability of NoSQL systems while
maintaining the ACID guarantees of a traditional database systems.

Examples: Amazon Aurora, Couchbase etc.


LAB 6
Neo4j
Tasks
1. Download and install Neo4j Desktop (https://neo4j.com/download/).
2. Create a new project.
3. Create a local DBMS in the project.
4. Start the newly created graph database and explore neo4j.
Check out "Getting started with Neo4j Browser", "Cypher basics" and "Try Neo4j with live data" guides.
5. Create another local DBMS, and write a script in Cypher query language to do the following:
● Create some nodes of type Book with the following properties: title, publisher, published_year, genre
● Create some nodes of type Author with the following properties: name
● Create some nodes of type User with the following properties: name
● Create some relationship of type RATED between users and books. The relationship must have the following
attribute: rating. A User node must be related to a Book node if the user has rated the book.
● Create some relationship of type WRITTEN_BY between authors and books. A Book node must be
connected to an Author node if the book is written by that author.
● Find all books rated by a particular user.
● Find all books not rated by anyone.
● Find all authors who have written more than one book.
● Find all authors whose one or more books are rated.
Deliverable
Script for Task 5.
References
● Fowler, Adam (2015). NoSQL for Dummies.
● https://www.dataversity.net/choose-right-nosql-database-application/
● https://www.techtarget.com/searchdatamanagement/feature/Key-criteria-for-c
hoosing-different-types-of-NoSQL-databases
Thank you !
Normalization
Normalization is based on the analysis of functional dependencies. A functional dependency is a
constraint between two attributes or two sets of attributes. The purpose of the database design is to
arrange the various data items into an organized structure so that it generates a set of relationships and
stores the information without any repetition. A bad database design may result in redundant and
spurious data and information.

Normalization is a process for deciding which attributes should be grouped together in a relation. It is
a tool to validate and improve a logical design, so that it satisfies certain constraints that avoid
redundancy of data. Furthermore, Normalization is defined as the process of decomposing relations
with anomalies to produce smaller, well-organized relations. Thus, in normalization process, a relation
with redundancy can be refined by decomposing it or replacing it with smaller relations that contain
the same information, but without redundancy.

Functional Dependencies

Functional dependencies are the result of interrelationship between attributes or in between tuples in
any relation.

Definition : In relation R, X and Y are the two subsets of the set of attributes, Y is said to be
functionally dependent on X if a given value of X (all attributes in X) uniquely determines the value of
Y (all attributes in Y).

It is denoted by X → Y (Y depends upon X).

Determinant : Here X is known as the determinant of functional dependency.

Consider the example of Employee relation:


In Employee relation, EID is primary key. Suppose you want to know the name and salary of any
employee. If you have EID of that employee, then you can easily find information of that employee. So,
Name and Salary attributes depend upon the EID attribute.
Here, X is (EID) and Y is (Name, Salary)
X (EID) : Y (Name, Salary)
The determinant is EID
Suppose X has value 5 then Y has value (Manoj, 9,000)

Functional Dependency Chart/Diagram


It is the graphical representation of function dependencies among attributes in any relation.
The following four steps are followed to draw the FD chart.
1. Find out the primary key attributes.
2. Make a rectangle and write all primary key attributes inside it.
3. Write all non-prime key attributes outside the rectangle.
4. Use arrows to show functional dependency among attributes.

Types of Functional Dependencies

There are four major types of FD’s.

1. Partial Dependency and Fully Functional Dependency


Partial dependency: Suppose you have more than one attributes in primary key. Let A be the
non-prime key attribute. If A is not dependent upon all prime key attributes then partial dependency
exists.
Fully functional dependency : Let A be the non-prime key attribute and value of A is dependent upon
all prime key attributes. Then A is said to be fully functional dependent. Consider a relation student
having prime key attributes (RollNo and Game)
and non-prime key attributes (Grade, Name and Fee).

As shown in Figure Name and Fee are partially dependent because you can find the name of student by
his RollNo. and the fee of any game by the name of the game.
Grade is fully functionally dependent because you can find the grade of any student in a particular
game if you know RollNo. and Game of that student. Partial dependency is due to more than one
prime key attribute.

2. Transitive Dependency and Non-transitive Dependency


Transitive dependency : Transitive dependency is due to dependency between non-prime key attributes.
Suppose in a relation R, X → Y (Y depend upon X), Y → Z (Z depends upon Y), then X → Z (Z
depends upon X). Therefore, Z is said to be transitively dependent upon X.
Non-transitive dependency : Any functional dependency which is not transitive is known as
Non-transitive dependency.
Non-transitive dependency exists if there is no dependency between non-prime key attributes.
Consider a relation student (whose functional dependency chart is shown in Figure) having prime key
attribute (RollNo) and non-prime key attributes (Name, Semester, Hostel).
For each semester there is a different hostel, Here Hostel is transitively dependent upon RollNo.
Semesters of any student can be found by his RollNo. Hostel can be found out by semester of student.
Here, Name is non-transitively dependent upon RollNo.

3. Single Valued Dependency and Multivalued Dependency


Single valued dependency : In any relation R, if for a particular value of X, Y has single value then it is
known as single valued dependency.
Multivalued dependency (MVD) : In any relation R, if for a particular value of X, Y has more then one
value, then it is known as multivalued dependency. It is denoted by X →→ Y.
Consider the relation Teacher shown in Figure

There is MVD between Teacher and Class because a teacher can take more than one class. There is
another MVD between Class and Days because a class can be on more than one day.
There is a single valued dependency between ID and Teacher because each teacher has a unique ID.

Now,
Normalisation is a process by which we can decompose or divide any relation into more than one
relation to remove anomalies in relational databases. It is a step by step process and each step is known
as Normal Form. Normalisation is a reversible process.

Benefits of Normalisation
The benefits of normalization include
(a) Normalization produces smaller tables with smaller rows, this means more rows per page and hence
less logical I/O.
(b) Searching, sorting, and creating indexes are faster, since tables are narrower, and more rows fit on a
data page.
(c) The normalization produces more tables by splitting the original tables. Thus there can be more
clustered indexes and hence there is more flexibility in tuning the queries.
(d) Index searching is generally faster as indexes tend to be narrower and shorter.
(e) The more tables allow better use of segments to control physical placement of data.
(f) There are fewer indexes per table and hence data modification commands are faster.
(g) There are a small number of null values and less redundant data. This makes the database more
compact.
(h) Data modification anomalies are reduced.
(i) Normalization is conceptually cleaner and easier to maintain and change as the needs change.

Various Normal Forms


The different normal forms are as follows. Each of which has its importance and are more desirable
than the previous one.
First Normal Form (1NF)
A relation is in first normal form if the domain of each attribute contains only atomic values. It means
atomicity must be present in relation.

Consider the relation Employee as shown in Figure above. It is not in its first normal form because
attribute Name is not atomic. So, divide it into two attributes First Name and Last Name as shown in
Figure below.
Now, relation Employee is in 1NF.
Anomalies in First Normal Form : First Normal form deals only with atomicity.
Second Normal Form (2NF)
A relation is in second normal form if it is in 1NF and all non-primary key attributes must be fully
functionally dependent upon primary key attributes.
Consider the relation Student as shown in Figure

The Primary Key is (RollNo., Game). Each Student can participate in more than one game. Relation
Student is in 1NF but still contains anomalies.
1. Deletion anomaly : Suppose you want to delete student Jack. Here you loose information about
the game Hockey because he is the only player participating in hockey.
2. Insertion anomaly : Suppose you want to add a new game Basket Ball having no student
participated in it. You cannot add this information unless there is a player for it.
3. Updation anomaly : Suppose you want to change Fee of Cricket. Here, you have to search all the
students participated in cricket and update fee individually otherwise it produces inconsistency.
The solution of this problem is to separate Partial dependencies and Fully functional dependencies. So,
divide Student relation into three relations Student(RollNo., Name), Games (Game, Fee) and
Performance(RollNo., Game, Grade) as shown in Figure

Now, Deletion, Insertion and updation operations can be performed without causing inconsistency.

Third Normal Form (3NF)


A relation is in Third Normal Form if it is in 2NF and non-primary key attributes must be
non-transitively dependent upon primary key attributes.
In other words a relation is in 3NF if it is in 2NF and having no transitive dependency.
Consider the relation Student as shown in Figure
The Primary Key is (RollNo.). The condition is different Hostel is allotted for different semester.
Student relation is in 2NF but still contains anomalies.
1. Deletion anomaly : If you want to delete student Gaurav. You loose information about Hostel H2
because he is the only student staying in hostel H2.
2. Insertion anomaly : If you want to add a new Hostel H8 and this is not allotted to any student. You
cannot add this information.
3. Updation anomaly : If you want to change hostel of all students of first semester. You have to search
all the students of first semester and update them individually otherwise it causes inconsistency.

The solution of this problem is to divide relation Student into two relations Student(RollNo.
Name, Semester) and Hostels(Semester, Hostel) as shown in Figure.
Now, deletion, insertion and updation operations can be performed without causing inconsistency.

Boyce Codd Normal Form (BCNF)


BCNF is a strict format of 3NF. A relation is in BCNF if and only if all determinants are candidate
keys. BCNF deals with multiple candidate keys.
Relations in 3NF also contain anomalies. Consider the relation Student as shown in Figure below.

Assumptions:
— Student can have more than 1 subject.
— A Teacher can teach only 1 subject.
— A subject can be taught by more than 1 teacher
There are two candidate keys (RollNo., Subject) and (RollNo., Teacher). Relation Student is in 3NF
but still contain anomalies.
1. Deletion anomaly : If you delete student whose RollNo. is 7. You will also loose information that
Teacher T4 is teaching the subject VB.
2. Insertion anomaly : If you want to add a new Subject VC++, you cannot do that until a student
chooses subject VC++ and a teacher teaches subject VC++.
3. Updation anomaly : Suppose you want to change Teacher for Subject C. You have to search all the
students having subject C and update each record individually otherwise it causes inconsistency.

In relation Student, candidate key is overloaded. You can find Teacher by RollNo. And Subject. You
can also find Subject by RollNo. and Teacher. Here RollNo. is overloaded. You can also find Subject by
Teacher.

The solution of this problem is to divide relation Student in two relations Stu-Teac and Teac-Sub as
shown in Figure below.
Relations in BCNF also contains anomalies. Consider the relation Project-Work as shown
in Figure

Assumptions:
– A Programmer can work on any number of projects.
– A project can have more than one module.
Relation Project-work is in BCNF but still contains anomalies.
1. Deletion anomaly : If you delete project 2. You will loose information about Programmer P3.
2. Insertion anomaly : If you want to add a new project 4. You cannot add this project until it is
assigned to any programmer.
3. Updation anomaly : If you want to change name of project 1. Then you have to search all the
programmers having project 1 and update them individually otherwise it causes inconsistency.
Dependencies in Relation Project-work are
Programmer →→ Project
Project →→ Module
The solution of this problem is to divide relation Project-Work into two relations Prog-Prj
(Programmer, Project) and Prj-Module (Project, Module) as shown in Figure

Multivalued Dependencies and Fourth Normal Form


Definition: Let R be a relation having attributes or sets of attributes A, B, and C. There is a
multivalued dependency of attribute B on attribute A if and only if the set of B values associated with a
given A value is independent of the C values.

We write this as A →→ B and read it as A multidetermines B. If R has at least three attributes, A, B,


and C then in R(A, B, C), if A →→ B, then A →→ C as well.
Alternate definition of Multivalued Dependency
More generally, if R is a relation with multivalued dependency
A →→ B
then in any table for R, if two tuples, t1 and t2, have the same A value, then there must exist two other
tuples t3 and t4 obeying these rules
1. t3 and t4 have the same A value as t1 and t2
2. t3 has the same B value as t1
3. t4 has the same B value as t2
4. If R – B represents the attributes of R that are not in B, then the t2 and t3 have the same
values for R – B and
5. t1 and t4 have the same values for R – B
The dependency A →→ B is called a trivial multivalued dependency if B is a subset of A or A ∪ B
is all of R. Now we are ready to consider fourth normal form.
Definition: A relation is in fourth normal form (4NF) if and only if it is in BoyceCodd normal form
and there are no nontrivial multivalued dependencies.
Fourth Normal Form (4NF)
A relation is in 4NF if it is in BCNF and for all Multivalued Functional Dependencies (MVD) of the
form X →→ Y either X → Y is a trival MVD or X is a super key of relation.
Candidate Key

The minimal set of attributes that can uniquely identify a tuple is known as a candidate
key. For Example, STUD_NO in STUDENT relation.
● It is a minimal super key.
● It is a super key with no repeated data is called a candidate key.
● The minimal set of attributes that can uniquely identify a record.
● It must contain unique values.
● It can contain NULL values.
● Every table must have at least a single candidate key.
● A table can have multiple candidate keys but only one primary key.
● The value of the Candidate Key is unique and may be null for a tuple.
● There can be more than one candidate key in a relationship.

Primary Key

There can be more than one candidate key in relation out of which one can be chosen as the primary
key. For Example, STUD_NO, as well as STUD_PHONE, are candidate keys for relation STUDENT
but STUD_NO can be chosen as the primary key (only one out of many candidate keys).

● It is a unique key.
● It can identify only one tuple (a record) at a time.
● It has no duplicate values, it has unique values.
● It cannot be NULL.
● Primary keys are not necessarily to be a single column; more than one column can also be a
primary key for a table.
Super Key

The set of attributes that can uniquely identify a tuple is known as Super Key. For Example,
STUD_NO, (STUD_NO, STUD_NAME), etc. A super key is a group of single or
multiple keys that identifies rows in a table. It supports NULL values.
● Adding zero or more attributes to the candidate key generates the super key.
● A candidate key is a super key but vice versa is not true.
● Super Key values may also be NULL.

You might also like