DBMS
DBMS
Database Management
System
Kathmandu University
BE Computer Engineering
Year II/II
Overview
● Database Systems Applications
● Database System versus File systems
● View of Data
● Database Languages
● Database Users and Administrators
● Transaction Management
● Database Architecture
2
Introduction: Database management System
As the name suggests, the Database Management System consists
of two parts. They are:
1. Database and
2. Management System
3
Database
● What is a Database?
● To find out what database is, we have to start from data, which is
the basic building block of any DBMS.
● Data: Facts, figures, statistics etc. having no particular meaning
(e.g. 1, ABC, 19 etc).
● Record: Collection of related data items, e.g. in the above example
the three data items had no meaning. But if we organize them in
the following way, then they collectively represent meaningful
information.
4
Database
● Table or Relation: Collection of related records
6
Database
● We now have a collection of 4 tables. They can be called a “related
collection” because we can clearly find out that there are some
common attributes existing in a selected pair of tables.
● Because of these common attributes we may combine the data of
two or more tables together to find out the complete details of a
student.
● Questions like “Which hostel does the youngest student live in?”
can be answered now, although Age and Hostel attributes are in
different tables.
7
Database Management System
● A database-management system (DBMS) is a collection of
interrelated data and a set of programs to access those data. This
is a collection of related data with an implicit meaning and hence
is a database.
● The collection of data, usually referred to as the database,
contains information relevant to an enterprise. The primary goal
of a DBMS is to provide a way to store and retrieve database
information that is both convenient and efficient.
● By data, we mean known facts that can be recorded and that have
implicit meaning.
8
Database Management System
● Database systems are designed to manage large bodies of information.
Management of data involves both defining structures for storage of
information and providing mechanisms for the manipulation of
information.
● In addition, the database system must ensure the safety of the
information stored, despite system crashes or attempts at unauthorized
access.
● If data are to be shared among several users, the system must avoid
possible anomalous results.
9
Database Management Systems
● DBMS contains information about a particular enterprise
○ Collection of interrelated data
○ Set of programs to access the data
○ An environment that is both convenient and efficient to use
● Database systems are used to manage collections of data that are:
○ Highly valuable
○ Relatively large
○ Accessed by multiple users and applications, often at the same time.
● A modern database system is a complex software system whose task is to
manage a large, complex collection of data.
● Databases touch all aspects of our lives
10
Views of DBMS
● A database in a DBMS could be viewed by lots of different people with
different responsibilities.
11
Applications of DBMS
● Enterprise Information
○ Sales: customers, products, purchases
○ Accounting: payments, receipts, assets
○ Human Resources: Information about employees, salaries, payroll taxes.
● Manufacturing: management of production, inventory, orders,
supply chain.
● Banking and finance
○ customer information, accounts, loans, and banking transactions.
○ Credit card transactions
○ Finance: sales and purchases of financial instruments (e.g., stocks and bonds;
storing real-time market data
● Universities: registration, grades 12
Applications of DBMS …
● Airlines: reservations, schedules
● Telecommunication: records of calls, texts, and data usage, generating
monthly bills, maintaining balances on prepaid calling cards
● Web-based services
○ Online retailers: order tracking, customized recommendations
○ Online advertisements
● Document databases
● Navigation systems: For maintaining the locations of various places of
interest along with the exact routes of roads, train systems, buses, etc.
13
University Database Example
● In this course we will be using a university database to illustrate most of
the concepts
● Data consists of information about:
○ Students
○ Instructors
○ Classes
● Application program examples:
○ Add new students, instructors, and courses
○ Register students for courses, and generate class rosters
○ Assign grades to students, compute grade point averages (GPA) and generate transcripts
14
History of Database Systems
1950s and early 1960s:
● Data processing using magnetic tapes for storage
○ Tapes provided only sequential access
● Punched cards for input
Late 1960s and 1970s:
● Hard disks allowed direct access to data
● Network and hierarchical data models in widespread use
● Ted Codd defines the relational data model
● Would win the ACM Turing Award for this work
○ IBM Research begins System R prototype
○ UC Berkeley (Michael Stonebraker) begins Ingres prototype
○ Oracle releases first commercial relational database
● High-performance (for the era) transaction processing
15
History of Database Systems …
1980s:
● Research relational prototypes evolve into commercial systems
○ SQL becomes industrial standard
● Parallel and distributed database systems
○ Wisconsin, IBM, Teradata
● Object-oriented database systems
1990s:
● Large decision support and data-mining applications
● Large multi-terabyte data warehouses
● Emergence of Web commerce
16
History of Database Systems …
2000s
● Big data storage systems
○ Google BigTable, Yahoo PNuts, Amazon,
○ “NoSQL” systems.
● Big data analysis: beyond SQL
○ Map reduce
2010s
● SQL reloaded
○ SQL front end to Map Reduce systems
○ Massively parallel database systems
○ Multi-core main-memory databases
17
File System ?
18
File System
In computing, File System or filesystem (often abbreviated to fs) is a method
and data structure that the operating system uses to control how data is
stored and retrieved.
File system organizes the files and helps in retrieval of files when they are
required. File systems consists of different files which are grouped into
directories. The directories further contain other folders and files. File
system performs basic operations like management, file naming, giving
access rules etc.
19
File System
The FAT (short for File Allocation Table) file system is a general purpose file
system that is compatible with all major operating systems (Windows, Mac OS
X, and Linux/Unix).
Windows: NTFS
Linux/Unix: Ext4
20
Purpose of Database Systems
In the early days, database applications were built directly on top of file
systems, which leads to:
● Data redundancy and inconsistency: data is stored in multiple file
formats resulting in duplication of information in different files
● Difficulty in accessing data
○ Need to write a new program to carry out each new task
● Data isolation
○ Multiple files and formats
● Integrity problems
○ Integrity constraints (e.g., account balance > 0) become “buried” in program code rather
than being stated explicitly
○ Hard to add new constraints or change existing ones
21
Purpose of Database Systems
● Atomicity of updates
○ Failures may leave database in an inconsistent state with partial updates carried
out
○ Example: Transfer of funds from one account to another should either be complete
or not happen at all
● Concurrent access by multiple users
○ Concurrent access needed for performance
○ Uncontrolled concurrent accesses can lead to inconsistencies
○ Ex: Two people reading a balance (say 100) and updating it by withdrawing money
(say 50 each) at the same time
● Security problems
○ Hard to provide user access to some, but not all, data
23
File System Vs DBMS
Basis File System DBMS
Structure File system is a software that manages DBMS is a software for managing the
and organizes the files in a storage database.
medium within a computer
Data Redundancy Redundant data can be present in a file In DBMS there is no redundant data.
system.
Backup and It doesn’t provide backup and recovery It provides backup and recovery of data
Recovery of data if it is lost. even if it is lost.
. Query There is no efficient query processing in Efficient query processing is there in DBMS.
processing file system.
24
File system Vs DBMS ...
Basis File System DBMS
Consistency There is less data consistency in file There is more data consistency because of
system. the process of normalization.
Security File systems provide less security in DBMS has more security mechanisms as
Constraints comparison to DBMS. compared to file system
Cost Relatively less expensive Relatively more expensive than File system.
25
View of Data
● A database system is a collection of interrelated data and a set of
programs that allow users to access and modify these data.
● A major purpose of a database system is to provide users with an abstract
view of the data. That is, the system hides certain details of how the data
are stored and maintained.
○ Data models
■ A collection of conceptual tools for describing data, data relationships,
data semantics, and consistency constraints.
○ Data abstraction
■ Hide the complexity of data structures to represent data in the database
from users through several levels of data abstraction.
26
Data Abstraction
● For the system to be usable, it must retrieve data efficiently. The
need for efficiency has led designers to use complex data
structures to represent data in the database.
● Since many database-system users are not computer trained,
developers hide the complexity from users through several levels
of abstraction, to simplify users’ interactions with the system:
27
Data Abstraction/View of Data
28
Physical Level (Internal View/Schema)
● Physical level (or Internal View / Schema): The lowest level of abstraction
describes how the data are actually stored.
● The physical level describes complex low-level data structures in detail.
29
Logical Level (Conceptual View/Schema)
● Logical level (or Conceptual View / Schema): The next-higher level of
abstraction describes what data are stored in the database, and what
relationships exist among those data. The logical level thus describes the entire
database in terms of a small number of relatively simple structures.
● Although implementation of the simple structures at the logical level may
involve complex physical-level structures, the user of the logical level does not
need to be aware of this complexity. This is referred to as physical data
independence.
● Database administrators, who must decide what information to keep in the
database, use the logical level of abstraction.
30
Physical Data independence
● Physical Data Independence – the ability to modify the physical schema
without changing the logical schema
○ Applications depend on the logical schema
○ In general, the interfaces between the various levels and components
should be well defined so that changes in some parts do not seriously
influence others.
31
View level (or External View / Schema):
● View level (or External View / Schema): The highest level of abstraction
describes only part of the entire database.
● Even though the logical level uses simpler structures, complexity remains
because of the variety of information stored in a large database. Many
users of the database system do not need all this information; instead,
they need to access only a part of the database.
● The view level of abstraction exists to simplify their interaction with the
system. The system may provide many views for the same database.
32
Data Types and Level of Abstraction
Programming Vs DBMS: an Analogy
● Many high-level programming languages support the notion of a structure
type. For example, we may describe a record as follows:
type instructor = record
ID : char (5);
name : char (20);
dept name : char (20);
salary : numeric (8,2);
end;
● This code defines a new record type called instructor with four fields. Each
field has a name and a type associated with it. A university organization
may have several other such record types.
33
Physical level of abstraction: Programming vs DBMS
● At the physical level, an instructor (department, or student)
record can be described as a block of consecutive storage
locations. The compiler hides this level of detail from
programmers.
● Similarly, the database system hides many of the lowest-level
storage details from database programmers. Database
administrators, on the other hand, may be aware of certain details
of the physical organization of the data.
34
Logical Level of Abstraction: Programming Vs DBMS
● At the logical level, each such record is described by a type definition, as
in the previous code segment, and the interrelationship of these record
types is defined as well.
● Programmers using a programming language work at this level of
abstraction. Similarly, database administrators usually work at this level of
abstraction.
35
View Level of Abstraction: Programming vs DBMS
● At the view level, computer users see a set of application programs that
hide details of the data types.
● At the view level, several views of the database are defined, and a
database user sees some or all of these views. In addition to hiding details
of the logical level of the database, the views also provide a security
mechanism to prevent users from accessing certain parts of the database.
● For example, clerks in the university registrar office can see only that part
of the database that has information about students; they cannot access
information about salaries of instructors.
36
Instance and Schemas
● Databases change over time as information is inserted and deleted. The
collection of information stored in the database at a particular moment is
called an instance of the database. The overall design of the database is
called the database schema. Schemas are changed infrequently, if at all.
● Schema – the logical structure of the database
○ Physical schema: database design at the physical level
○ Logical schema: database design at the logical level
● Instance – the actual content of the database at a particular point in time
● Physical Data Independence – the ability to modify the physical schema
without changing the logical schema
● Applications depend on the logical schema
● In general, the interfaces between the various levels and components
should be well defined so that changes in some parts do not seriously
influence others.
37
Instances and Schemas
● Similar to types and variables in programming languages
● Logical Schema – the overall logical structure of the database
Example: The database consists of information about a set of customers
and accounts in a bank and the relationship between them
Analogous to type information of a variable in a program
● Physical schema– the overall physical structure of the database
● Instance – the actual content of the database at a particular point in time
Analogous to the value of a variable
38
Data Model
● Underlying structure of a database is the data model. It is the a collection of
conceptual tools for describing
○ Data
○ data relationships
○ data semantics
○ consistency constraints.
● A data model provides a way to describe the design of a database at the
physical, logical, and view levels.
● The data models can be classified into different categories:
○ Relational model
○ Entity Relationship Model
○ Object Based Model
○ Semi Structured Data Model
Other older models: Network model, Hierarchical model, etc.
39
Relational Model
● The relational model uses a collection of tables to represent both data
and the relationships among those data. Each table has multiple columns,
and each column has a unique name. Tables are also known as relations.
The relational model is an example of a record-based model.
● Record-based models are so named because the database is structured in
fixed-format records of several types. Each table contains records of a
particular types. Each record type defines a fixed number of fields, or
attributes. The columns of the table correspond to the attributes of the
record type.
● The relational data model is the most widely used data model, and a vast
majority of current database systems are based on the relational model.
40
A Sample Relational Database
41
42
Database languages
● A database system provides a data-definition language to specify the
database schema and a data-manipulation language to express database
queries and updates.
● In practice, the data definition language and data-manipulation
languages are not two separate languages; instead they simply form
parts of a single database language,
● such as the widely used SQL language.
43
Data Definition Language (DDL)
● Specification notation for defining the database schema
Example: create table instructor (
ID char(5),
name varchar(20),
dept_name varchar(20),
salary numeric(8,2))
● DDL compiler generates a set of table templates stored in a data dictionary
● Data dictionary contains metadata (i.e., data about data)
Database schema
Integrity constraints
○ Primary key (ID uniquely identifies instructors)
Authorization
○ Who can access what
44
Data Manipulation Language (DML)
● Language for accessing and manipulating the data organized by the appropriate data model
○ DML is also known as query language
● Two classes of languages
Pure – used for proving properties about computational power and for optimization
○ Relational Algebra
○ Tuple relational calculus
○ Domain relational calculus
Commercial – used in commercial systems
○ SQL is the most widely used commercial language
● Data Manipulation Language enables users to access or manipulate data as organized by the
appropriate data model.
● The types of access are:
○ Retrieval of information stored in the database
○ Insertion of new information into the database
○ Deletion of information from the database
○ Modification of information stored in the database
45
Data Manipulation Language …
● There are basically two types of data-manipulation language
Procedural DML -- require a user to specify what data are needed and
how to get those data.
Declarative DML -- require a user to specify what data are needed
without specifying how to get those data.
● Declarative DMLs are usually easier to learn and use than are procedural
DMLs.
● Declarative DMLs are also referred to as non-procedural DMLs
● The portion of a DML that involves information retrieval is called a query
language.
46
SQL Query Language
● The most widely used commercial language
Example to find all instructors in Comp. Sci. dept
○ select name
○ from instructor
○ where dept_name = 'Comp. Sci.'
● SQL is NOT a Turing machine equivalent language
● To be able to compute complex functions SQL is usually embedded in
some higher-level language
● Application programs generally access databases through one of
○ Language extensions to allow embedded SQL
○ Application program interface (e.g., ODBC/JDBC) which allow SQL queries to be sent to a
database
47
Database Access from Application Program
● Non-procedural query languages such as SQL are not as powerful as a
universal Turing machine.
● SQL does not support actions such as input from users, output to
displays, or communication over the network.
● Such computations and actions must be written in a host language, such
as C/C++, Java or Python, with embedded SQL queries that access the
data in the database.
● Application programs -- are programs that are used to interact with the
database in this fashion.
48
Data Dictionary
● We can define a data dictionary as a DBMS component that stores the
definition of data characteristics and relationships. You may recall that such
“data about data” were labeled metadata.
● The DBMS data dictionary provides the DBMS with its self describing
characteristic. In effect, the data dictionary resembles and X-ray of the
company’s entire data set, and is a crucial element in the data administration
function.
● The two main types of data dictionary exist, integrated and stand alone. An
integrated data dictionary is included with the DBMS. For example, all relational
DBMSs include a built in data dictionary or system catalog that is frequently
accessed and updated by the RDBMS. Other DBMSs especially older types, do
not have a built in data dictionary instead the DBA may use third party stand
alone data dictionary systems.
49
Database Design
● The process of designing the general structure of the database:
● Logical Design – Deciding on the database schema. Database design
requires that we find a “good” collection of relation schemas.
○ Business decision – What attributes should we record in the database?
○ Computer Science decision – What relation schemas should we have and how should the
attributes be distributed among the various relation schemas?
● Physical Design – Deciding on the physical layout of the database
50
Database Engine
● A database system is partitioned into modules that deal with each of the
responsibilities of the overall system.
● The functional components of a database system can be divided into
○ The storage manager,
○ The query processor component,
○ The transaction management component.
51
Database System Internals
52
Storage Manager
● A storage manager is a program module that provides the interface
between the low level data stored in the database and the application
programs and queries submitted to the system.
● The storage manager is responsible for the interaction with the file
manager. The raw data are stored on the disk using the file system, which
is usually provided by a conventional operating system. The storage
manager translates the various DML statements into low-level file-system
commands. Thus, the storage manager is responsible for storing,
retrieving, and updating data in the database.
53
Storage Manager
The storage manager components include:
Authorization and integrity manager, which tests for the satisfaction of integrity
constraints and checks the authority of users to access data.
Transaction manager, which ensures that the database remains in a consistent
(correct) state despite system failures, and that concurrent transaction executions
proceed without conflicting.
File manager, which manages the allocation of space on disk storage and the data
structures used to represent information stored on disk.
Buffer manager, which is responsible for fetching data from disk storage into main
memory, and deciding what data to cache in main memory. The buffer manager is a
critical part of the database system, since it enables the database to handle data
sizes that are much larger than the size of main memory.
54
Qwery Processor
● The query processor components include
○ DDL interpreter, which interprets DDL statements and records the definitions in the data
dictionary.
○ DML compiler, which translates DML statements in a query language into an evaluation
plan consisting of low-level instructions that the query evaluation engine understands.
● A query can usually be translated into any of a number of alternative
evaluation plans that all give the same result. The DML compiler also
performs query optimization, that is, it picks the lowest cost evaluation
plan from among the alternatives.
○ Query evaluation engine, which executes low-level instructions generated by the DML
compiler.
55
Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation
56
Transaction Management
● What if the system fails?
● What if more than one user is concurrently updating the same data?
● A transaction is a collection of operations that performs a single logical
function in a database application
● Transaction-management component ensures that the database
remains in a consistent (correct) state despite system failures (e.g., power
failures and operating system crashes) and transaction failures.
● Concurrency-control manager controls the interaction among the
concurrent transactions, to ensure the consistency of the database.
57
Transaction manager
● A transaction is a collection of operations that performs a single logical
function in a database application.
● Each transaction is a unit of both atomicity and consistency. Thus, we
require that transactions do not violate any database-consistency
constraints. That is, if the database was consistent when a transaction
started, the database must be consistent when the transaction
successfully terminates.
● Transaction - manager ensures that the database remains in a consistent
(correct) state despite system failures (e.g., power failures and operating
system crashes) and transaction failures.
58
Database Architecture
The architecture of a database systems is greatly influenced by the underlying computer
system on which the database is running:
● Centralized
○ A centralized database is a database that is located, stored, and maintained in a single location.
This location is most often a central computer or database system, for example a desktop or
server CPU, or a mainframe computer.
● Client-server
○ A client-server database is one where the database resides on a server, and client applications are written to access the
database.
● Parallel (multi-processor)
○ A parallel database system seeks to improve performance through parallelization of various
operations, such as loading data, building indexes and evaluating queries. ... Parallel databases
improve processing and input/output speeds by using multiple CPUs and disks in parallel
● Distributed
59
Database Architecture
● Database applications are usually partitioned into two or three parts. In a two-tier
architecture, the application resides at the client machine, where it invokes database
system functionality at the server machine through query language statements.
● Application program interface standards like ODBC and JDBC are used for interaction
between the client and the server.
● In contrast, in a three-tier architecture, the client machine acts as merely a front end and
does not contain any direct database calls. Instead, the client end communicates with an
application server, usually through a forms interface.
● The application server in turn communicates with a database system to access data. The
business logic of the application, which says what actions to carry out under what
conditions, is embedded in the application server, instead of being distributed across
multiple clients.
● Three-tier applications are more appropriate for large applications, and for applications
that run on the WorldWideWeb.
60
Database Applications
● Database applications are usually partitioned into two or three parts.
○ Two-tier architecture -- the application resides at the client machine, where
it invokes database system functionality at the server machine
○ Three-tier architecture -- the client machine acts as a front end and does not
contain any direct database calls.
■ The client end communicates with an application server, usually through
a forms interface.
■ The application server in turn communicates with a database system to access
data.
61
Two-tier and Three tier Architecture
62
Design Approaches
● Need to come up with a methodology to ensure that each of the relations
in the database is “good”
63
Entity Relationship Model
● The entity-relationship (E-R) data model uses a collection of basic objects,
called entities, and relationships among these objects.
● An entity is a “thing” or “object” in the real world that is distinguishable
from other objects. The entity relationship model is widely used in
database design.
64
Semi Structured Data Model
● The semi-structured data model permits the specification of data where
● individual data items of the same type may have different sets of
attributes.
● This is in contrast to the data models mentioned earlier, where every data
item of a particular type must have the same set of attributes. The
● Extensible Markup Language (XML) is widely used to represent
semi-structured data.
65
Object Based Model
● Object-oriented programming (especially in Java, C++, or C#) has become
the
● dominant software-development methodology. This led to the
development of an object-oriented data model that can be seen as
extending the E-R model with notions of encapsulation, methods
(functions), and object identity.
● The object-relational data model combines features of the object-oriented
data model and relational data model.
66
Database Users and Administrators
dsadas
Database
67
Database Users
● Naive users -- unsophisticated users who interact with the system by invoking
one of the application programs that have been written previously.
● Application programmers -- are computer professionals who write application
programs.
● Sophisticated users -- interact with the system without writing programs
○ using a database query language or
○ by using tools such as data analysis software.
● Specialized users --write specialized database applications that do not fit into
the traditional data-processing framework. For example, CAD, graphic data,
audio, video.
68
Database Administrator
A person who has central control over the system is called a database
administrator DBA, whose functions are:
● Schema definition
● Storage structure and access-method definition
● Schema and physical-organization modification
● Granting of authorization for data access
● Routine maintenance
● Periodically backing up the database
● Ensuring that enough free disk space is available for normal operations,
and upgrading disk space as required
● Monitoring jobs running on the database and ensuring that performance is
not degraded by very expensive tasks submitted by some users
69
References
1. Database Systems Concepts, Abraham Silberschatz, Henry F. Korth, S.
Sudarshan, McGraw-Hill Higher Education
2. Database Management Systems, 2nd Edition, Raghu Ramakrishnan,
Johannes Gehrke
70
Database Design:
Entity
Relationship Model
Kathmandu University
BE Computer Enginering
Year II/II
Outline
● Design Process
● Entity Relationship Model
● Constraints
● Keys
● Entity-Relationship Diagram
● Design Issues
● Weak Entity Sets
● Extended E-R features
● Design of an E-R Database Schema
● Reduction of an E-R Schema to tables
Design Phases
▪ Initial phase -- characterize fully the data needs of the
prospective database users.
▪ Second phase -- choosing a data model
• Applying the concepts of the chosen data model
• Translating these requirements into a conceptual schema
of the database.
• A fully developed conceptual schema indicates the
functional requirements of the enterprise.
▪ Describe the kinds of operations (or transactions) that
will be performed on the data.
Design Phases (Cont.)
▪ Final Phase -- Moving from an abstract data model to the
implementation of the database
• Logical Design – Deciding on the database schema.
Database design requires that we find a “good”
collection of relation schemas.
▪ Business decision – What attributes should we
record in the database?
▪ Computer Science decision – What relation
schemas should we have and how should the
attributes be distributed among the various
relation schemas?
• Physical Design – Deciding on the physical layout of
the database
Design Alternatives
▪ In designing a database schema, we must ensure that
we avoid two major pitfalls:
• Redundancy: a bad design may result in repeat
information.
▪ Redundant representation of information may
lead to data inconsistency among the various
copies of information
• Incompleteness: a bad design may make certain
aspects of the enterprise difficult or impossible to
model.
▪ Avoiding bad designs is not enough. There may be a
large number of good designs from which we must
choose.
Design Approaches
▪ Entity Relationship Model (covered in this unit)
• Models an enterprise as a collection of entities and
relationships
▪ Entity: a “thing” or “object” in the enterprise that is
distinguishable from other objects
• Described by a set of attributes
▪ Relationship: an association among several entities
• Represented diagrammatically by an entity-relationship
diagram:
▪ Normalization Theory (another chapter)
• Formalize what designs are bad, and test for them
Outline of the ER Model
ER model -- Database Modeling
▪ The ER data model was developed to facilitate database
design by allowing specification of an enterprise schema
that represents the overall logical structure of a
database.
▪ The ER data model employs three basic concepts:
• entity sets,
• relationship sets,
• attributes.
▪ The ER model also has an associated diagrammatic
representation, the ER diagram, which can express the
overall logical structure of a database graphically.
Entity Sets
▪ An entity is an object that exists and is distinguishable
from other objects.
• Example: specific person, company, event, plant
▪ An entity set is a set of entities of the same type that
share the same properties.
• Example: set of all persons, companies, trees,
holidays
▪ An entity is represented by a set of attributes; i.e.,
descriptive properties possessed by all members of an
entity set.
• Example:
instructor = (ID, name, salary )
course= (course_id, title, credits)
▪ A subset of the attributes form a primary key of the
entity set; i.e., uniquely identifying each member of the
set.
Entity Sets -- instructor and student
Representing Entity sets in ER Diagram
Many to Many to
one many
Note: Some elements in A and B may not be mapped to any
elements in the other set
Representing Cardinality Constraints in ER Diagram
▪ We express cardinality constraints by drawing either a
directed line (→), signifying “one,” or an undirected line (—),
signifying “many,” between the relationship set and the
entity set.
45 98462888888
45 985623456888
45 9841288888
45 dikshya poudel
46 pradhan prajina
Representing Relationship Sets
▪ A many-to-many relationship set is represented as a
schema with attributes for the primary keys of the two
participating entity sets, and any descriptive attributes of
the relationship set.
▪ Example: schema for relationship set advisor
advisor = (s_id, i_id)
Redundancy of Schemas
▪ Many-to-one and one-to-many relationship sets that are total
on the many-side can be represented by adding an extra
attribute to the “many” side, containing the primary key of the
“one” side
▪ Example: Instead of creating a schema for relationship set
inst_dept, add an attribute dept_name to the schema arising
from entity set instructor
Redundancy of Schemas (Cont.)
▪ For one-to-one relationship sets, either side can be
chosen to act as the “many” side
• That is, an extra attribute can be added to either
of the tables corresponding to the two entity sets
▪ If participation is partial on the “many” side, replacing
a schema by an extra attribute in the schema
corresponding to the “many” side could result in null
values
Redundancy of Schemas (Cont.)
▪ The schema corresponding to a relationship set linking a
weak entity set to its identifying strong entity set is
redundant.
▪ Method 2:
• Form a schema for each entity set with all local and
inherited attributes
schema attributes
person ID, name, street, city
student ID, name, street, city, tot_cred
employee ID, name, street, city, salary
Kathmandu University
BE Computer engineering/
BSc Computer Science
Year II/II
Outline
● Relational Algebraic Operations
● Operations on Basic SQL
● Operations on Intermediate SQL
Relational Model
● Relational data model is the primary data model, which is used widely
around the world for data storage and processing. This model is simple
and it has all the properties and capabilities required to process data with
storage efficiency.
tuples
(or rows)
Attribute
▪ The set of allowed values for each attribute is called the
domain of the attribute
▪ Attribute values are (normally) required to be atomic; that
is, indivisible
▪ The special value null is a member of every domain.
Indicated that the value is “unknown”
▪ The null value causes complications in the definition of
many operations
Relations are Unordered
▪ Order of tuples is irrelevant (tuples may be stored in an
arbitrary order)
▪ Example: instructor relation with unordered tuples
Database Schema
▪ Database schema - is the logical structure of the
database.
▪ Database instance - is a snapshot of the data in the
database at a given instant in time.
▪ Example:
• schema: instructor (ID, name, dept_name, salary)
• Instance:
Keys
▪ Let K ⊆ R
▪ K is a superkey of R if values for K are sufficient to identify a
unique tuple of each possible relation r(R)
• Example: {ID} and {ID,name} are both superkeys of instructor.
• Referenced relation
Find referencing relation and referenced relation from the university database schema
Relational Query Languages
▪ Procedural versus non-procedural, or declarative
▪ “Pure” languages:
• Relational algebra
• union: ∪
• set difference: –
• Cartesian product: x
• rename: ρ
Select Operation
▪ The select operation selects tuples that satisfy a given predicate.
▪ Notation: σ p(r)
▪ p is called the selection predicate
▪ Example: select those tuples of the instructor relation where the
instructor is in the “Physics” department.
• Query
σ dept_name=“Physics” (instructor)
• Result
Select Operation (Cont.)
▪ We allow comparisons using
=, ≠, >, ≥, <, ≤
in the selection predicate.
▪ We can combine several predicates into a larger predicate by using
the connectives:
∧ (and), ∨ (or), ¬ (not)
▪ Example: Find the instructors in Physics with a salary greater
$90,000, we write:
• teaches.ID
The instructor X teaches table
Join Operation
▪ The Cartesian-Product
instructor X teaches
associates every tuple of instructor with every tuple of
teaches.
• Most of the resulting rows have information about instructors who did NOT
teach a particular course.
▪ Example: Find the set of all courses taught in both the Fall
2017 and the Spring 2018 semesters.
∏course_id (σ semester=“Fall” Λ year=2017 (section)) ∩
∏course_id (σ semester=“Spring” Λ year=2018 (section))
• Result
Set Difference Operation
▪ The set-difference operation allows us to find tuples that are in
one relation but are not in another.
▪ Notation r – s
▪ Set differences must be taken between compatible relations.
• r and s must have the same arity
• attribute domains of r and s must be compatible
▪ Example: to find all courses taught in the Fall 2017 semester, but
not in the Spring 2018 semester
▪ Query 2
σ dept_name=“Physics” (σ salary > 90.000 (instructor))
▪ The two queries are not identical; they are, however, equivalent
-- they give the same result on any database.
Equivalent Queries
.
Basic SQL
History
▪ IBM Sequel language developed as part of System R project at
the IBM San Jose Research Laboratory
▪ Renamed Structured Query Language (SQL)
▪ ANSI and ISO standard SQL:
• SQL-86
• SQL-89
• SQL-92
• SQL:2003
▪ Example:
create table instructor (
ID char(5),
name varchar(20),
dept_name varchar(20),
salary numeric(8,2))
Integrity Constraints in Create Table
▪ Types of integrity constraints
• primary key (A1, ..., An )
• not null
• Ai represents an attribute
• Ri represents a relation
• P is a predicate.
• For common attributes (e.g., ID), the attributes in the resulting table are
renamed using the relation name (e.g., instructor.ID)
▪ Find the names of all instructors in the Art department who have
taught some course and the course_id
• select name, course_id
from instructor , teaches
where instructor.ID = teaches.ID and instructor. dept_name = 'Art'
The Rename Operation
▪ The SQL allows renaming relations and attributes using the
as clause:
old-name as new-name
▪ Tuple comparison
• select name, course_id
from instructor, teaches
where (instructor.ID, dept_name) = (teaches.ID, 'Biology');
Set Operations
▪ Find courses that ran in Fall 2017 or in Spring 2018
(select course_id from section where sem = 'Fall' and year = 2017)
union
(select course_id from section where sem = 'Spring' and year = 2018)
● Find courses that ran in Fall 2017 but not in Spring 2018
(select course_id from section where sem = 'Fall' and year = 2017)
except
(select course_id from section where sem = 'Spring' and year = 2018)
Set Operations (Cont.)
▪ Set operations union, intersect, and except
• Each of the above operations automatically eliminates duplicates
• intersect all
• except all.
Null Values
▪ It is possible for tuples to have a null value, denoted by
null, for some of their attributes
▪ null signifies an unknown value or that a value does
not exist.
▪ The result of any arithmetic expression involving null is
null
• Example: 5 + null returns null
as follows:
• From clause: ri can be replaced by any valid subquery
• Select clause:
Ai can be replaced be a subquery that generates a single value.
Set Membership
▪ Find courses offered in Fall 2017 and in Spring 2018
select distinct course_id
from section
where semester = 'Fall' and year= 2017 and
course_id in (select course_id
from section
where semester = 'Spring' and year= 2018);
▪ Find the total number of (distinct) students who have taken course
sections taught by the instructor with ID 10101
select name
from instructor
where salary > some (select salary
from instructor
where dept name = 'Biology');
Set Comparison – “all” Clause
▪ Find the names of all instructors whose salary is greater than
the salary of all instructors in the Biology department.
select name
from instructor
where salary > all (select salary
from instructor
where dept name = 'Biology');
Test for Absence of Duplicate Tuples
▪ The unique construct tests whether a subquery has any
duplicate tuples in its result.
▪ The unique construct evaluates to “true” if a given
subquery contains no duplicates .
▪ Find all courses that were offered at most once in 2017
select T.course_id
from course as T
where unique ( select R.course_id
from section as R
where T.course_id= R.course_id
and R.year = 2017);
Subqueries in the From Clause
Subqueries in the Form Clause
▪ SQL allows a subquery expression to be used in the from clause
▪ Find the average instructors’ salaries of those departments
where the average salary is greater than $42,000.”
select dept_name, avg_salary
from ( select dept_name, avg (salary) as avg_salary
from instructor
group by dept_name)
where avg_salary > 42000;
▪ or equivalently
insert into course (course_id, title, dept_name, credits)
values ('CS-437', 'Database Systems', 'Comp. Sci.', 4);
▪ Equivalent to:
select *
from student , takes
where student_ID = takes_ID
Join Condition
▪ The on condition allows a general predicate over the relations being
joined.
▪ This predicate is written like a where clause predicate except for the
use of the keyword on.
▪ Query example
select *
from student join takes on student_ID = takes_ID
• The on condition above specifies that a tuple from student matches a tuple from takes if
their ID values are equal.
▪ Equivalent to:
select *
from student , takes
where student_ID = takes_ID
Outer Join
▪ An extension of the join operation that avoids loss of
information.
▪ Computes the join and then adds tuples form one
relation that does not match tuples in the other relation
to the result of the join.
▪ Uses null values.
▪ Three forms of outer join:
• left outer join
• right outer join
• full outer join
Outer Join Examples
▪ Relation course
▪ Relation prereq
▪ Observe that
course information is missing for CS-437
prereq information is missing for CS-315
▪ x
Left Outer Join
▪ course natural left outer join prereq
select name
from faculty
where dept_name = 'Biology'
▪ Create a view of department salary totals
• Rollback work. All the updates performed by the SQL statements in the
transaction are undone.
▪ Atomic transaction
• either fully executed or rolled back as if it never occurred
▪ not null
▪ primary key
▪ unique
▪ check (P), where P is a predicate
Not Null Constraints
▪ not null
• Declare name and budget to be not null
name varchar(20) not null
budget numeric(12,2) not null
Unique Constraints
▪ Example:
create table department
(dept_name varchar (20),
building varchar (15),
budget Dollars);
Domains
▪ create domain construct in SQL-92 creates user-defined
domain types
1 4
1 5
3 7
▪ On this instance, B → A hold; A → B does NOT hold,
Closure of a Set of Functional Dependencies
• In general, α → β is trivial if β ⊆ α
Lossless Decomposition
▪ We can use functional dependencies to show when certain
decomposition are lossless.
▪ For the case of R = (R1, R2), we require that for all possible
relations r on schema R
r = ∏R1 (r ) ∏R2 (r )
▪ A decomposition of R into R1 and R2 is lossless
decomposition if at least one of the following dependencies
is in F+:
• R1 ∩ R2 → R1
• R1 ∩ R2 → R2
▪ The above functional dependencies are a sufficient
condition for lossless join decomposition; the dependencies
are a necessary condition only if all constraints are
functional dependencies
Example
▪ R = (A, B, C)
F = {A → B, B → C)
▪ R1 = (A, B), R2 = (B, C)
• Lossless decomposition:
R1 ∩ R2 = {B} and B → BC
▪ R1 = (A, B), R2 = (A, C)
• Lossless decomposition:
R1 ∩ R2 = {A} and A → AB
▪ Note:
• B → BC
is a shorthand notation for
• B → {B, C}
Dependency Preservation
▪ Testing functional dependency constraints each time the
database is updated can be costly
▪ It is useful to design the database in a way that constraints
can be tested efficiently.
▪ If testing a functional dependency can be done by
considering just one relation, then the cost of testing this
constraint is low
▪ When decomposing a relation it is possible that it is no
longer possible to do the testing without having to perform
a Cartesian Produced.
▪ A decomposition that makes it computationally hard to
enforce functional dependency is said to be NOT
dependency preserving.
Dependency Preservation Example
▪ Consider a schema:
dept_advisor(s_ID, i_ID, department_name)
▪ With function dependencies:
i_ID → dept_name
s_ID, dept_name → i_ID
▪ In the above design we are forced to repeat the department name
once for each time an instructor participates in a dept_advisor
relationship.
▪ To fix this, we need to decompose dept_advisor
▪ Any decomposition will not include all the attributes in
s_ID, dept_name → i_ID
▪ Thus, the composition NOT be dependency preserving
Normal Forms
Boyce-Codd Normal Form
▪ A relation schema R is in BCNF with respect to a set F of
functional dependencies if for all functional dependencies
in F+ of the form
α→β
where α ⊆ R and β ⊆ R, at least one of the following
holds:
• α → β is trivial (i.e., β ⊆ α)
• α is a superkey for R
Boyce-Codd Normal Form (Cont.)
▪ Example schema that is not in BCNF:
in_dep (ID, name, salary, dept_name, building, budget )
because :
• dept_name→ building, budget
▪ holds on in_dep
▪ but
• dept_name is not a superkey
▪ When decompose in_dept into instructor and department
• instructor is in BCNF
• department is in BCNF
Decomposing a Schema into BCNF
▪ Let R be a schema R that is not in BCNF. Let α →β be the
FD that causes a violation of BCNF.
▪ We decompose R into:
• (α U β )
• (R-(β-α))
▪ In our example of in_dep,
• α = dept_name
• β = building, budget
and in_dep is replaced by
• (α U β ) = ( dept_name, building, budget )
• ( R - ( β - α ) ) = ( ID, name, dept_name, salary )
Example
▪ R = (A, B, C)
F = {A → B, B → C)
▪ R1 = (A, B), R2 = (B, C)
• Lossless-join decomposition:
R1 ∩ R2 = {B} and B → BC
• Dependency preserving
▪ R1 = (A, B), R2 = (A, C)
• Lossless-join decomposition:
R1 ∩ R2 = {A} and A → AB
• Not dependency preserving
(cannot check B → C without computing R1 R 2)
BCNF and Dependency Preservation
• R = (J, K, L )
• F = {JK → L, L → K }
• And an instance table:
J L K
j1 I1 k1
j2 I1 k1
j3 I1 k1
null I2 k2
▪ What is wrong with the table?
• Repetition of information
• Need to use null values (e.g., to represent the relationship l2, k2
where there is no corresponding value for J)
Comparison of BCNF and 3NF
▪ Advantages to 3NF over BCNF. It is always possible to
obtain a 3NF design without sacrificing losslessness or
dependency preservation.
▪ Disadvantages to 3NF.
• We may have to use null values to represent some of
the possible meaningful relationships among data
items.
• There is the problem of repetition of information.
Goals of Normalization
▪ Let R be a relation scheme with a set F of functional
dependencies.
▪ Decide whether a relation scheme R is in “good” form.
▪ In the case that a relation scheme R is not in “good” form,
decompose it into a set of relation scheme {R1, R2, ..., Rn}
such that
• Each relation scheme is in good form
• The decomposition is a lossless decomposition
• Preferably, the decomposition should be dependency
preserving.
How good is BCNF?
▪ There are database schemas in BCNF that do not seem to
be sufficiently normalized
▪ Consider a relation
inst_info (ID, child_name, phone)
• where an instructor may have more than one phone
and can have multiple children
• Instance of inst_info
How good is BCNF? (Cont.)
▪ There are no non-trivial functional dependencies and
therefore the relation is in BCNF
▪ Insertion anomalies – i.e., if we add a phone 981-992-3443 to
99999, we need to add two tuples
(99999, David, 981-992-3443)
(99999, William, 981-992-3443)
Higher Normal Forms
▪ It is better to decompose inst_info into:
• inst_child:
• inst_phone:
▪ Additional rules:
• Union rule: If α → β holds and α → γ holds, then α →
β γ holds.
• Decomposition rule: If α → β γ holds, then α → β
holds and α → γ holds.
• Pseudotransitivity rule:If α → β holds and γ β → δ
holds, then α γ → δ holds.
▪ The above rules can be inferred from Armstrong’s
axioms.
+
Procedure for Computing F
▪ To compute the closure of a set of functional dependencies F:
F+=F
repeat
for each functional dependency f in F+
apply reflexivity and augmentation rules on f
add the resulting functional dependencies to F +
for each pair of functional dependencies f1and f2 in F +
if f1 and f2 can be combined using transitivity
then add the resulting functional dependency to F +
until F + does not change any further
result := α;
while (changes to result) do
for each β → γ in F do
begin
if β ⊆ result then result := result ∪ γ
end
Example of Attribute Set Closure
▪ R = (A, B, C, G, H, I)
▪ F = {A → B
A→C
CG → H
CG → I
B → H}
▪ (AG)+
1. result = AG
2. result = ABCG (A → C and A → B)
3. result = ABCGH (CG → H and CG ⊆ AGBC)
4. result = ABCGHI (CG → I and CG ⊆ AGBCH)
▪ Is AG a candidate key?
1. Is AG a super key?
1. Does AG → R? == Is R ⊇ (AG)+
2. Is any subset of AG a superkey?
1. Does A → R? == Is R ⊇ (A)+
2. Does G → R? == Is R ⊇ (G)+
3. In general: check for each subset of size n-1
Uses of Attribute Closure
There are several uses of the attribute closure algorithm:
▪ Testing for superkey:
• To test if α is a superkey, we compute α+, and check if
α+ contains all attributes of R.
▪ Testing functional dependencies
• To check if a functional dependency α → β holds (or, in
other words, is in F+), just check if β ⊆ α+.
• That is, we compute α+ by using attribute closure, and
then check if it contains β.
• Is a simple and cheap test, and very useful
▪ Computing closure of F
• For each γ ⊆ R, we find the closure γ+, and for each S
⊆ γ+, we output a functional dependency γ → S.
Canonical Cover
▪ Suppose that we have a set of functional dependencies F on a relation
schema. Whenever a user performs an update on the relation, the
database system must ensure that the update does not violate any
functional dependencies; that is, all the functional dependencies in F are
satisfied in the new database state.
▪ If an update violates any functional dependencies in the set F, the
system must roll back the update.
▪ We can reduce the effort spent in checking for violations by testing a
simplified set of functional dependencies that has the same closure as
the given set.
▪ This simplified set is termed the canonical cover
▪ To define canonical cover we must first define extraneous attributes.
• An attribute of a functional dependency in F is extraneous if we
can remove it without changing F +
Extraneous Attributes
▪ Removing an attribute from the left side of a functional
dependency could make it a stronger constraint.
• For example, if we have AB → C and remove B, we get the
possibly stronger result A → C. It may be stronger
because A → C logically implies AB → C, but AB → C does
not, on its own, logically imply A → C
▪ But, depending on what our set F of functional dependencies
happens to be, we may be able to remove B from AB → C
safely.
• For example, suppose that
• F = {AB → C, A → D, D → C}
• Then we can show that F logically implies A → C, making
extraneous in AB → C.
Extraneous Attributes (Cont.)
▪ Removing an attribute from the right side of a functional
dependency could make it a weaker constraint.
• For example, if we have AB → CD and remove C, we get
the possibly weaker result AB → D. It may be weaker
because using just AB → D, we can no longer infer AB →
C.
▪ But, depending on what our set F of functional dependencies
happens to be, we may be able to remove C from AB → CD
safely.
• For example, suppose that
F = { AB → CD, A → C.
• Then we can show that even after replacing AB → CD by
AB → D, we can still infer $AB → C and thus AB → CD.
Extraneous Attributes
▪ An attribute of a functional dependency in F is extraneous if we
can remove it without changing F +
▪ Consider a set F of functional dependencies and the functional
dependency α → β in F.
• Remove from the left side: Attribute A is extraneous in α if
▪ A ∈ α and
▪ F logically implies (F – {α → β}) ∪ {(α – A) → β}.
• or use the original set of dependencies F that hold on R, but with the following
test:
• for every set of attributes α ⊆ Ri, check that α+ (the attribute closure of
α) either includes no attribute of Ri- α, or includes all attributes of Ri.
▪ Case 2: B is in α.
• Since α is a candidate key, the third alternative in the
definition of 3NF is trivially satisfied.
• In fact, we cannot show that γ is a superkey.
• This shows exactly why the third alternative is present in
the definition of 3NF.
Q.E.D.
First Normal Form
▪ Domain is atomic if its elements are considered to be
indivisible units
• Examples of non-atomic domains:
▪ Set of names, composite attributes
▪ Identification numbers like CS101 that can be broken
up into parts
▪ A relational schema R is in first normal form if the domains
of all attributes of R are atomic
▪ Non-atomic values complicate storage and encourage
redundant (repeated) storage of data
• Example: Set of accounts stored with each customer, and
set of owners stored with each account
• We assume all relations are in first normal form (and
revisit this in Chapter 22: Object Based Databases)
First Normal Form (Cont.)
▪ Atomicity is actually a property of how the elements of the
domain are used.
• Example: Strings would normally be considered
indivisible
• Suppose that students are given roll numbers which are
strings of the form CS0012 or EE1127
• If the first two characters are extracted to find the
department, the domain of roll numbers is not atomic.
• Doing so is a bad idea: leads to encoding of information
in application program rather than in the database.
Transactions
Transaction Concept
▪ A transaction is a unit of program execution that accesses and possibly
updates various data items.
OR, a transaction is the DBMS’s abstract view of a user program: a series of
reads/writes of database objects
▪ Users submit transactions, and can think of each transaction as executing by
itself
• The concurrency is achieved by the DBMS, which interleaves actions of the various
transactions
▪ E.g. transaction to transfer $50 from account A to account B:
1. read(A)
2. A := A – 50
3. write(A)
4. read(B)
5. B := B + 50
6. write(B)
▪ Two main issues to deal with:
• Failures of various kinds, such as hardware failures and system crashes
T1 T2
1. read(A)
2. A := A – 50
3. write(A)
read(A), read(B), print(A+B)
4. read(B)
5. B := B + 50
6. write(B
▪ Isolation can be ensured trivially by running transactions serially
• That is, one after the other.
8
Consistency
9
Isolation
▪ Guarantee that even though transactions may
be interleaved, the net effect is identical to
executing the transactions serially
▪ For example, if transactions T1 and T2 are
executed concurrently, the net effect is
equivalent to executing
• T1 followed by T2, or
• T2 followed by T1
▪ NOTE: The DBMS provides no guarantee of
effective order of execution
10
Durability
▪ DBMS uses the log to ensure durability
▪ If the system crashed before the changes made by a completed
transaction are written to disk, the log is used to remember and
restore these changes when the system is restarted
▪ Again, this is handled by the recovery manager
11
Transaction State
▪ Active – the initial state; the transaction stays in this state while it
is executing
▪ Partially committed – after the final statement has been
executed.
▪ Failed -- after the discovery that normal execution can no longer
proceed.
▪ Aborted – after the transaction has been rolled back and the
database restored to its state prior to the start of the transaction.
Two options after it has been aborted:
• restart the transaction
▪ e.g., one transaction can be using the CPU while another is reading from or writing to
the disk
• Reduced average response time for transactions: short transactions need not wait
behind long ones.
• Must preserve the order in which the instructions appear in each individual transaction.
21
a. Serial Schedule
23
b. Non-serial Schedule
• If interleaving of operations is allowed, then there will
be non-serial schedule.
• It contains many possible orders in which the system
can execute the individual operations of the
transactions.
• In the given figure (c) and (d), Schedule C and
Schedule D are the non-serial schedules. It has
interleaving of operations.
24
Example: Non-serial schedule
25
c. Serializable Schedule
• A serialisable schedule is a schedule whose effect on any
consistent database instance is identical to that of some
complete serial schedule
• The serializability of schedules is used to find non-serial
schedules that allow the transaction to execute
concurrently without interfering with one another.
• It identifies which schedules are correct when executions
of the transaction have interleaving of their operations.
• A non-serial schedule will be serializable if its result is
equal to the result of its transactions executed serially.
26
Anomalies with interleaved execution
▪ Two actions on the same data object conflict if at least one of
them is a write
▪ We’ll now consider THREE ways in which a schedule involving two
consistency-preserving transactions can leave a consistent
database inconsistent
27
Problems associated with concurrency
● To make system efficient and save time, it is required to execute
more than one transaction ( concurrently) at the same time. But
concurrency also leads several problems.
● In a database transaction, the two main operations are READ and
WRITE operations. So, there is a need to manage these two
operations in the concurrent execution of the transactions as if
these operations are not performed in an interleaved manner, and
the data may become inconsistent.
● These problems are commonly referred to as concurrency
problems in a database environment.
1. Lost Update Problems (W - W Conflict)
2. Temporary Update or Dirty Read Problems (W-R Conflict)
3. Unrepeatable Read Problem (W-R Conflict)
Temporary Update Problem (W-R Conflict)
Dirty Read Problem
Temporary update or dirty read problem occurs when one transaction updates an item and fails.
But the updated item is used by another transaction before the item is changed or reverted back to
its last value.
Consider two transactions TX and TY in the below diagram performing read/write operations on account A where the
available balance in account A is $300:
In the above transaction instance, if Tx fails for some reason then A will revert back to its previous value.
But Transaction Y has already read the incorrect value of A.
lost update problem (W - W Conflict)
The problem occurs when two different database transactions perform the read/write
operations on the same database items in an interleaved manner (i.e., concurrent
execution) that makes the values of the items incorrect hence making the database
inconsistent.
Consider the below schedule where two transactions TX and TY, are performed on the same account A
where the balance of account A is $300.
Thus, in order to maintain consistency in the database and avoid such problems that take place in
concurrent execution, management is needed, and that is where the concept of Concurrency Control
comes into role.
Aborting
▪ If a transaction Ti is aborted, then all actions
must be undone
• Also, if Tj reads object last written by Ti, then Tj must
be aborted!
32
The log
▪ The following facts are recorded in the log
• “Ti writes an object”: store new and old values
33
Connection to Normalization
34
Serializability
Consider a set of transactions (T1, T2, ..., Ti). S1 is the state of database
after they are concurrently executed and successfully completed and S2 is
the state of database after they are executed in any serial manner
(one-by-one) and successfully completed. If S1 and S2 are same then the
database maintains serializability.
Schedule 3 Schedule 6
Conflict Serializability (Cont.)
▪ Example of a schedule that is not conflict serializable:
lock-S(B);
read (B);
unlock(B);
display(A+B)
▪ Locking as above is not sufficient to guarantee serializability
Lock-Based Protocols (Cont.)
▪ Lock-compatibility matrix
▪ A locking protocol:
▪ Is a set of rules to be followed by each transaction to ensure
that only serializable schedules are allowed (extended later)
▪ Associates a lock with each database object, which could be of
different types (e.g., shared or exclusive)
▪ Grants and denies locks to transactions according to the
specified rules
▪ The part of the DBMS that keeps track of locks is called the
lock manager
Schedule With Lock Grants
▪ Grants omitted in rest of
chapter
• Assume grant happens
just before the next
instruction following lock
request
▪ A locking protocol is a set of
rules followed by all
transactions while
requesting and releasing
locks.
▪ Locking protocols enforce
serializability by restricting
the set of possible
schedules.
Locking Protocols
▪ Given a locking protocol (such as 2PL)
• A schedule S is legal under a locking protocol if it can be generated by a
set of transactions that follow the protocol
• A protocol ensures serializability if all legal schedules under that
protocol are serializable
The Two-Phase Locking Protocol (Cont.)
▪ Two-phase locking does not ensure freedom from deadlocks
▪ Extensions to basic two-phase locking needed to ensure recoverability of
freedom from cascading roll-back
Queue
2
Queue
Lock
Queue
Lock Lock
Manager
Manager Manager
t0 t1 t2
Time
Two-Phase Locking
▪ A widely used locking protocol, called Two-Phase
Locking (2PL), has two rules:
▪ Rule 1: if a transaction T wants to read (or write) an
object O, it first requests the lock manager for a shared
(or exclusive) lock on O
T0 T1 T2
T0 T1 T2 T0 T1 T2
“Exclusive” Lock
Release Lock Release Lock
Granted
on Object O on Object O
Queue
2
Queue
2 2 Lock
Queue
Lock Lock
Manager
Manager Manager
t3 t4 t5
Time
Two-Phase Locking
▪ A widely used locking protocol, called Two-Phase
Locking (2PL), has two rules:
▪ Rule 1: if a transaction T wants to read (or write) an
object O, it first requests the lock manager for a shared
(or exclusive) lock on O
T0 T1 T2
T0 T1 T2 T0 T1 T2
Release Lock
Read Request Lock Denied Read Lock Denied on Object O
on Object O Request
on Object O
Queue
1 1
Queue
1 Lock
Queue
Lock Lock
Manager
Manager Manager 0
0
t6 t7 t8
Time
Two-Phase Locking
▪ A widely used locking protocol, called Two-Phase
Locking (2PL), has two rules:
▪ Rule 1: if a transaction T wants to read (or write) an
object O, it first requests the lock manager for a shared
(or exclusive) lock on O
T0 T1 T2
T0 T1 T2 T0 T1 T2
Queue
1 2
Queue
Lock
Queue
Lock Lock 0
Manager
Manager Manager
0
t9 t9 t10
Time
Two-Phase Locking
▪ A widely used locking protocol, called Two-Phase Locking
(2PL), has two rules:
▪ Rule 2: T can release locks before it commits or aborts, and
cannot request additional locks once it releases any lock
# locks
# locks
violation of 2PL
Automatic Acquisition of Locks
▪ A transaction Ti issues the standard read/write instruction, without explicit
locking calls.
▪ The operation read(D) is processed as:
if Ti has a lock on D
then
read(D)
else begin
if necessary wait until no other
transaction has a lock-X on D
grant Ti a lock-S on D;
read(D)
end
Automatic Acquisition of Locks (Cont.)
▪ write(D) is processed as:
if Ti has a lock-X on D
then
write(D)
else begin
if necessary wait until no other trans. has any lock on D,
if Ti has a lock-S on D
then
upgrade lock on D to lock-X
else
grant Ti a lock-X on D
write(D)
end;
T35:
read(B);
read(A);
if B = 0 then A:=A+1;
write(A).
Add lock and unlock instructions to transactions T31 and T32, so that they
observe the two-phase locking protocol.
Resolving RW Conflicts Using 2PL
▪ Suppose that T1 and T2 actions are interleaved as follows:
▪ T1 reads A
▪ T2 reads A, decrements A and commits
▪ T1 tries to decrement A
Tree protocol:
1. Only exclusive locks are allowed.
2. The first lock by Ti may be on any data item. Subsequently, a data Q can be
locked by Ti only if the parent of Q is currently locked by Ti.
3. Data items may be unlocked at any time.
4. A data item that has been locked and unlocked by Ti cannot subsequently
be relocked by Ti
Graph-Based Protocols (Cont.)
▪ The tree protocol ensures conflict serializability as well as freedom from
deadlock.
▪ Unlocking may occur earlier in the tree-locking protocol than in the
two-phase locking protocol.
• Shorter waiting times, and increase in concurrency
• Protocol is deadlock-free, no rollbacks are required
▪ Drawbacks
• Protocol does not guarantee recoverability or cascade freedom
▪ Need to introduce commit dependencies to ensure
recoverability
• Transactions may have to lock data items that they do not access.
▪ increased locking overhead, and additional waiting time
▪ potential decrease in concurrency
▪ Schedules not possible under two-phase locking are possible under the tree
protocol, and vice versa.
Performance of Locking
▪ Locking comes with delays mainly from blocking
Throughput
Thrashing
# of Active Transactions
▪ In a waits-for graph:
▪ The nodes correspond to active transactions
▪ There is an edge from Ti to Tj if and only if Ti is waiting for Tj
to release a lock
*The nodes correspond to active transactions and there is an edge from Ti to Tj if and only
if Ti is waiting for Tj to release a lock
Deadlock Detection (Cont’d)
▪ The following schedule is NOT free of deadlocks:
T1 T2 T3 T4
S(A) T1 T2
R(A)
X(B)
W(B)
S(B)
S(C)
R(C)
X(C) T4 T3
X(B)
X(A)
*The nodes correspond to active transactions and there is an edge from Ti to Tj if and only
if Ti is waiting for Tj to release a lock
Deadlock Detection (Cont’d)
▪ The following schedule is NOT free of deadlocks:
T1 T2 T3 T4
S(A) T1 T2
R(A)
X(B)
W(B)
S(B)
S(C)
R(C)
X(C) T4 T3
X(B)
X(A) Cycle detected; hence, a deadlock!
*The nodes correspond to active transactions and there is an edge from Ti to Tj if and only
if Ti is waiting for Tj to release a lock
Resolving Deadlocks
▪ A deadlock is resolved by aborting a transaction that is
on a cycle and releasing its locks
▪ This allows some of the waiting transactions to proceed
▪ Solution 1:
• A transaction is structured such that its writes are all performed at the
end of its processing
• All writes of a transaction form an atomic action; no transaction may
execute while a transaction is being written
• A transaction that aborts is restarted with a new timestamp
▪ Solution 2: Limited form of locking: wait for data to be committed before
reading it
▪ Solution 3: Use commit dependencies to ensure recoverability
Thomas’ Write Rule
▪ Modified version of the timestamp-ordering protocol in which obsolete
write operations may be ignored under certain circumstances.
▪ When Ti attempts to write data item Q, if TS(Ti) < W-timestamp(Q), then Ti is
attempting to write an obsolete value of {Q}.
• Rather than rolling back Ti as the timestamp ordering protocol would
have done, this {write} operation can be ignored.
▪ Otherwise this protocol is the same as the timestamp ordering protocol.
▪ Thomas' Write Rule allows greater potential concurrency.
• Allows some view-serializable schedules that are not
conflict-serializable.
Validation-Based Protocol
▪ Idea: can we use commit time as serialization order?
▪ To do so:
• Postpone writes to end of transaction
• Keep track of data items read/written by
transaction
• Validation performed at commit time, detect any
out-of-serialization order reads/writes
▪ Also called as optimistic concurrency control since transaction executes fully
in the hope that all will go well during validation
Validation-Based Protocol
▪ Execution of transaction Ti is done in three phases.
1. Read and execution phase: Transaction Ti writes only to
temporary local variables
2. Validation phase: Transaction Ti performs a '‘validation test''
to determine if local variables can be written without violating
serializability.
3. Write phase: If Ti is validated, the updates are applied to the
database; otherwise, Ti is rolled back.
▪ The three phases of concurrently executing transactions can be
interleaved, but each transaction must go through the three phases in that
order.
• We assume for simplicity that the validation and
write phase occur together, atomically and serially
▪ I.e., only one transaction executes validation/write at a
time.
Validation-Based Protocol (Cont.)
▪ Each transaction Ti has 3 timestamps
• StartTS(Ti) : the time when Ti started its execution
• ValidationTS(Ti): the time when Ti entered its
validation phase
• FinishTS(Ti) : the time when Ti finished its write
phase
▪ Validation tests use above timestamps and read/write sets to ensure that
serializability order is determined by validation time
• Thus, TS(Ti) = ValidationTS(Ti)
▪ Validation-based protocol has been found to give greater degree of
concurrency than locking/TSO if probability of conflicts is low.
Validation Test for Transaction Tj
▪ If for all Ti with TS (Ti) < TS (Tj) either one of the following condition holds:
• finishTS(Ti) < startTS(Tj)
• startTS(Tj) < finishTS(Ti) < validationTS(Tj) and the
set of data items written by Ti does not intersect
with the set of data items read by Tj.
then validation succeeds and Tj can be committed.
▪ Otherwise, validation fails and Tj is aborted.
▪ Justification:
• First condition applies when execution is not
concurrent
▪ The writes of Tj do not affect reads of Ti since they occur
after Ti has finished its reads.
• If the second condition holds, execution is
Schedule Produced by Validation
▪ Example of schedule produced using validation
Database Recovery
Techniques
Outline
▪ Failure Classification
▪ Storage Structure
▪ Recovery and Atomicity
▪ Log-Based Recovery
▪ Remote Backup Systems
Failure Classification
▪ Transaction failure :
• Logical errors: transaction cannot complete due to some internal error
condition
• System errors: the database system must terminate an active transaction
due to an error condition (e.g., deadlock)
▪ System crash: a power failure or other hardware or software
failure causes the system to crash.
• Fail-stop assumption: non-volatile storage contents are assumed to not
be corrupted by system crash
▪ Database systems have numerous integrity checks to prevent
corruption of disk data
▪ Disk failure: a head crash or similar disk failure destroys all or
part of disk storage
• Destruction is assumed to be detectable: disk drives use checksums to
detect failures
Recovery Manager
▪ Volatile storage:
• Does not survive system crashes
• Examples: main memory, cache memory
▪ Nonvolatile storage:
• Survives system crashes
• Examples: disk, tape, flash memory, non-volatile RAM
• But may still fail, losing data
▪ Stable storage:
• A mythical form of storage that survives all failures
• Approximated by maintaining multiple copies on
distinct nonvolatile media
Data Access
• write(X) assigns the value of local variable xi to data item {X} in the buffer block.
• Note: output(BX) need not immediately follow write(X). System can perform the
output operation when it deems fit.
▪ Transactions
• Must perform read(X) before accessing X for the first time (subsequent reads can be
from local copy)
x2
x1
y1
memory disk
Recovery and Atomicity
shadow-copy
Log-Based Recovery
<T0 start>
<T0, A, 1000, 950>
<To, B, 2000, 2050>
A = 950
B = 2050
<T0 commit>
<T1 start>
<T1, C, 700, 600>
C = 600 BC output before T1
commits
BB , BC
<T1 commit>
BA
BA output after T0
Note: BX denotes block containing X. commits
Concurrency Control and Recovery
• Can be ensured by obtaining exclusive locks on updated items and holding the locks
till end of transaction (strict two-phase locking)
• We might unnecessarily redo transactions which have already output their updates to
the database.
• Only transactions that are in L or started after the checkpoint need to be redone or
undone
• Transactions that committed or aborted before the checkpoint already have all their
updates output to stable storage.
▪ Some earlier part of the log may be needed for undo operations
• Continue scanning backwards till a record <Ti start> is found for every transaction Ti
in L.
• Parts of log prior to earliest <Ti start> record above are not needed for recovery, and
can be erased whenever desired.
Example of Checkpoints
Tc Tf
T1
T2
T3
T4
b) Transfer of control: When the primary site fails, the backup site takes over the
processing and becomes the new primary site.
c) Time to recover: If the log at the remote backup becomes large, recovery will
take a long time.
● Non-tabular databases
● Store data differently than relational tables
● NoSQL = Not Only SQL or NOn-SQL
A columnar database is a type of database management system that stores data in columns
rather than rows, optimizing query performance by enabling efficient data retrieval and
analysis. Examples of columnar databases include:
■ Apache Cassandra
■ Amazon redshift
■ Google BigQuery
■ Vertica
■ ClickHouse
■ Snowflake
Key Benefits of Columnar Databases …
1. Improved data compression
2. Enhanced query performance
3. Efficient use of cache memory
4. Vectorization and parallel processing
5. Improved analytics and reporting
6. Better handling of sparse data
7. Flexible indexing options
8. Ease of scalability
9. Real-time data analytics and updates
ii) Document store databases
A document database (also known as a document-oriented
database or a document store) is a database that stores
information in documents.
E.g. Mongodb
Document databases offer a variety of advantages, including:
● An intuitive data model that is fast and easy for developers
to work with
● A flexible schema that allows for the data model to evolve as
application needs change
age: 25, {
name: "Carol",
city: "Kathmandu",
age: 28,
skills: ["JavaScript",
city: "Lalitpur",
"Python"],
skills: ["HTML", "CSS"],
isActive: true
isActive: true
},
}
]);
Read Operations
• Read operations retrieve documents from a collection; i.e. query a
collection for documents.
• MongoDB provides the following methods to read documents from a
collection:
• db.collection.find()
• You can specify query filters or criteria that identify the
documents to return.
Read Operations
db.MyCollection.updateOne({name: "Marsh"},
{$set:{ownerAddress: “Lagankhel, Lalitpur"}})
Update Operations
• updateMany()
• updateMany() allows us to update multiple items by passing in a list of items.
• This update operation uses the same syntax for updating a single document.
Update Operations
replaceOne()
• The replaceOne() method replaces a single document in the specified collection.
• replaceOne() replaces the entire document, meaning fields in the old document not
contained in the new one and will be lost.
db.RecordsDB.deleteOne({name:"Marki"})
Delete Operations
db.collection.deleteMany()
• deleteMany() is a method used to delete multiple documents from a desired
collection with a single delete operation.
• A list is passed into the method and the individual items are defined with
filter criteria as in deleteOne().
Delete Operations
iii) Key-value stores
● Key value databases, also known as key value stores, are NoSQL database types
where data is stored as key value pairs and optimized for reading and writing that
data.
● The data is fetched by a unique key or a number of unique keys to retrieve the
associated value with each key.
● The values can be simple data types like strings and numbers or complex objects.
● The unique key can be anything.
● Most of the time, it is an id field, since that's the unique field in all the documents.
● To group related items, you can also add a common prefix to the key. The general
structure of a key value pair is key: value. For example, “name”: “John Drake.”
● Examples: Basho Riak, Redis, Voldemort, Aerospike, Oracle NoSQL, Amazon
DynamoDB, Azure Cosmos DB etc.
Types of NoSQL databases contd …
iii) Triple stores
Choose RDBMS if you have or need Choose NoSQL if you have or need:
● Consistent data/ACID transactions ● Semi-structured or Unstructured data / flexible
● Complex dynamic queries requiring stored schema
procedures, or views ● Limited pre-defined access paths and query patterns
● Option to migrate to another database ● No complex queries, stored procedures, or views
without significant change to ● High velocity transactions
existing application’s access ● Large volume of data (in Terabyte range) requiring
paths or logic quick and cheap scalability
● Data Warehouse, Analytics or BI use case ● Requires distributed computing and storage
Which NoSQL database is for you ?
Choose Key-value Stores if: Choose Document Stores if:
● Simple schema ● Flexible schema with complex querying
● High velocity read/write with no frequent ● JSON/BSON or XML data formats
updates ● Leverage complex Indexes (multikey,
● High performance and scalability geospatial, full text search etc.)
● No complex queries involving multiple keys ● High performance and balanced R:W ratio
or joins
Which NoSQL database is for you ?
Choose Column-Oriented Database if: Choose Graph Database if:
● High volume of data ● Applications requiring traversal between
● Extreme write speeds with relatively less data points
velocity reads ● Ability to store properties of each data
● Data extractions by columns using row point as well as relationship between them
keys ● Complex queries to determine relationships
● No ad-hoc query patterns, complex indices between data points
or high level of aggregations ● Need to detect patterns between data points
Polyglot Persistence
Polyglot persistence is the idea that a single application that uses different types of
data needs to use multiple databases behind that application.
NewSQL
Modern SQL databases that seek to provide the scalability of NoSQL systems while
maintaining the ACID guarantees of a traditional database systems.
Normalization is a process for deciding which attributes should be grouped together in a relation. It is
a tool to validate and improve a logical design, so that it satisfies certain constraints that avoid
redundancy of data. Furthermore, Normalization is defined as the process of decomposing relations
with anomalies to produce smaller, well-organized relations. Thus, in normalization process, a relation
with redundancy can be refined by decomposing it or replacing it with smaller relations that contain
the same information, but without redundancy.
Functional Dependencies
Functional dependencies are the result of interrelationship between attributes or in between tuples in
any relation.
Definition : In relation R, X and Y are the two subsets of the set of attributes, Y is said to be
functionally dependent on X if a given value of X (all attributes in X) uniquely determines the value of
Y (all attributes in Y).
As shown in Figure Name and Fee are partially dependent because you can find the name of student by
his RollNo. and the fee of any game by the name of the game.
Grade is fully functionally dependent because you can find the grade of any student in a particular
game if you know RollNo. and Game of that student. Partial dependency is due to more than one
prime key attribute.
There is MVD between Teacher and Class because a teacher can take more than one class. There is
another MVD between Class and Days because a class can be on more than one day.
There is a single valued dependency between ID and Teacher because each teacher has a unique ID.
Now,
Normalisation is a process by which we can decompose or divide any relation into more than one
relation to remove anomalies in relational databases. It is a step by step process and each step is known
as Normal Form. Normalisation is a reversible process.
Benefits of Normalisation
The benefits of normalization include
(a) Normalization produces smaller tables with smaller rows, this means more rows per page and hence
less logical I/O.
(b) Searching, sorting, and creating indexes are faster, since tables are narrower, and more rows fit on a
data page.
(c) The normalization produces more tables by splitting the original tables. Thus there can be more
clustered indexes and hence there is more flexibility in tuning the queries.
(d) Index searching is generally faster as indexes tend to be narrower and shorter.
(e) The more tables allow better use of segments to control physical placement of data.
(f) There are fewer indexes per table and hence data modification commands are faster.
(g) There are a small number of null values and less redundant data. This makes the database more
compact.
(h) Data modification anomalies are reduced.
(i) Normalization is conceptually cleaner and easier to maintain and change as the needs change.
Consider the relation Employee as shown in Figure above. It is not in its first normal form because
attribute Name is not atomic. So, divide it into two attributes First Name and Last Name as shown in
Figure below.
Now, relation Employee is in 1NF.
Anomalies in First Normal Form : First Normal form deals only with atomicity.
Second Normal Form (2NF)
A relation is in second normal form if it is in 1NF and all non-primary key attributes must be fully
functionally dependent upon primary key attributes.
Consider the relation Student as shown in Figure
The Primary Key is (RollNo., Game). Each Student can participate in more than one game. Relation
Student is in 1NF but still contains anomalies.
1. Deletion anomaly : Suppose you want to delete student Jack. Here you loose information about
the game Hockey because he is the only player participating in hockey.
2. Insertion anomaly : Suppose you want to add a new game Basket Ball having no student
participated in it. You cannot add this information unless there is a player for it.
3. Updation anomaly : Suppose you want to change Fee of Cricket. Here, you have to search all the
students participated in cricket and update fee individually otherwise it produces inconsistency.
The solution of this problem is to separate Partial dependencies and Fully functional dependencies. So,
divide Student relation into three relations Student(RollNo., Name), Games (Game, Fee) and
Performance(RollNo., Game, Grade) as shown in Figure
Now, Deletion, Insertion and updation operations can be performed without causing inconsistency.
The solution of this problem is to divide relation Student into two relations Student(RollNo.
Name, Semester) and Hostels(Semester, Hostel) as shown in Figure.
Now, deletion, insertion and updation operations can be performed without causing inconsistency.
Assumptions:
— Student can have more than 1 subject.
— A Teacher can teach only 1 subject.
— A subject can be taught by more than 1 teacher
There are two candidate keys (RollNo., Subject) and (RollNo., Teacher). Relation Student is in 3NF
but still contain anomalies.
1. Deletion anomaly : If you delete student whose RollNo. is 7. You will also loose information that
Teacher T4 is teaching the subject VB.
2. Insertion anomaly : If you want to add a new Subject VC++, you cannot do that until a student
chooses subject VC++ and a teacher teaches subject VC++.
3. Updation anomaly : Suppose you want to change Teacher for Subject C. You have to search all the
students having subject C and update each record individually otherwise it causes inconsistency.
In relation Student, candidate key is overloaded. You can find Teacher by RollNo. And Subject. You
can also find Subject by RollNo. and Teacher. Here RollNo. is overloaded. You can also find Subject by
Teacher.
The solution of this problem is to divide relation Student in two relations Stu-Teac and Teac-Sub as
shown in Figure below.
Relations in BCNF also contains anomalies. Consider the relation Project-Work as shown
in Figure
Assumptions:
– A Programmer can work on any number of projects.
– A project can have more than one module.
Relation Project-work is in BCNF but still contains anomalies.
1. Deletion anomaly : If you delete project 2. You will loose information about Programmer P3.
2. Insertion anomaly : If you want to add a new project 4. You cannot add this project until it is
assigned to any programmer.
3. Updation anomaly : If you want to change name of project 1. Then you have to search all the
programmers having project 1 and update them individually otherwise it causes inconsistency.
Dependencies in Relation Project-work are
Programmer →→ Project
Project →→ Module
The solution of this problem is to divide relation Project-Work into two relations Prog-Prj
(Programmer, Project) and Prj-Module (Project, Module) as shown in Figure
The minimal set of attributes that can uniquely identify a tuple is known as a candidate
key. For Example, STUD_NO in STUDENT relation.
● It is a minimal super key.
● It is a super key with no repeated data is called a candidate key.
● The minimal set of attributes that can uniquely identify a record.
● It must contain unique values.
● It can contain NULL values.
● Every table must have at least a single candidate key.
● A table can have multiple candidate keys but only one primary key.
● The value of the Candidate Key is unique and may be null for a tuple.
● There can be more than one candidate key in a relationship.
Primary Key
There can be more than one candidate key in relation out of which one can be chosen as the primary
key. For Example, STUD_NO, as well as STUD_PHONE, are candidate keys for relation STUDENT
but STUD_NO can be chosen as the primary key (only one out of many candidate keys).
● It is a unique key.
● It can identify only one tuple (a record) at a time.
● It has no duplicate values, it has unique values.
● It cannot be NULL.
● Primary keys are not necessarily to be a single column; more than one column can also be a
primary key for a table.
Super Key
The set of attributes that can uniquely identify a tuple is known as Super Key. For Example,
STUD_NO, (STUD_NO, STUD_NAME), etc. A super key is a group of single or
multiple keys that identifies rows in a table. It supports NULL values.
● Adding zero or more attributes to the candidate key generates the super key.
● A candidate key is a super key but vice versa is not true.
● Super Key values may also be NULL.