0% found this document useful (0 votes)
16 views

Complete Discussion of-DBMS

Uploaded by

hp18190237
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Complete Discussion of-DBMS

Uploaded by

hp18190237
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

DBMS Notes by Narayan Vyas | Page |1

Database Management System

Data – Data means known facts that can be recorded and that have implicit meaning. In simple
words, data is information. Data, in the context of databases, refers to all the single items that are
stored in a database, either individually or as a set. In software systems we manage data, here
manage data means sorting, processing and extracting data.

Mostly data represents recordable facts. Data aids in producing information, which is based on facts.
For example, if we have data about marks obtained by all students, we can then conclude about
toppers and average marks.

Data Persistence – Data persistence means existence of data. Data that is required in a software
system needs to be stored somewhere (like variables, secondary storage) until job is not over. Data
persistence means life of data.

Field - In a database table, a field is a data structure for a single piece of data.

Tuple / Records - Records are composed of fields, each of which contains one item of information. In
the context of a relational database, a row, also called a record or tuple represents a single, implicitly
structured data item in a table.

File – File is an operating system concept, which makes separation among bundles of heterogeneous
data stored in the storage. In simple words, file is a collection of records.

Database – A database is a place where all your application related data is stored. One application
data can be stored in a bunch of files. Thus, we can say that database is a collection of files. A
database is a collection of inter-related data which is stored in such a way that it becomes easier to
retrieve, create, manipulate and produce the information. A database is a collection of information
that is organized so that it can be easily accessed, managed and updated.

DBMS - A Database Management System is a software that allows creation, definition and
manipulation of database, allowing users to store, process and analyse data easily. DBMS provides
us with an interface or a tool, to perform various operations like creating database, storing data in it,
updating data, creating tables in the database and a lot more. DBMS also provides protection and
security to the databases. It also maintains data consistency in case of multiple users.

The DBMS is a general-purpose software that facilitates the process of defining, constructing,
manipulating and sharing database among various users and applications.

Characteristics / Advantages of DBMS - Traditionally, data was organized in file formats. DBMS was
a new concept then, and all the research was done to make it overcome the deficiencies in
traditional style of data management. A modern DBMS has the following characteristics −

• Real-world entity − A modern DBMS is more realistic and uses real-world entities to design
its architecture. It uses the behaviour and attributes too. For example, a school database
may use students as an entity and their age as an attribute.
• Relation-based tables − DBMS allows entities and relations among them to form tables. A
user can understand the architecture of a database just by looking at the table names.
• Isolation of data and application − A database system is entirely different than its data. A
database is an active entity, whereas data is said to be passive, on which the database works
DBMS Notes by Narayan Vyas | Page |2

and organizes. DBMS also stores metadata, which is data about data, to ease its own
process.
• Less redundancy − DBMS follows the rules of normalization, which splits a relation when any
of its attributes is having redundancy in values. Normalization is a mathematically rich and
scientific process that reduces data redundancy.
• Consistency − Consistency is a state where every relation in a database remains consistent.
There exist methods and techniques, which can detect attempt of leaving database in
inconsistent state. A DBMS can provide greater consistency as compared to earlier forms of
data storing applications like file-processing systems.
• Query Language − DBMS is equipped with query language, which makes it more efficient to
retrieve and manipulate data. A user can apply as many and as different filtering options as
required to retrieve a set of data. Traditionally it was not possible where file-processing
system was used.
• Backup and Recovery - Failure of whole database. At that time no one will be able to get the
database back and for sure company will be in a big loss. The only solution is to take backup
of database and whenever it is needed, it can be stored back. All the databases must have
this characteristic.
• Data Integrity - This is one of the most important characteristics of database management
system. Integrity ensures the quality and reliability of database system. It protects the
unauthorized access of database and makes it more secure. It brings only the consistence
and accurate data into the database.
• ACID Properties − DBMS follows the concepts of Atomicity, Consistency, Isolation, and
Durability (normally shortened as ACID). These concepts are applied on transactions, which
manipulate data in a database. ACID properties help the database stay healthy in multi-
transactional environments and in case of failure.
• Multiuser and Concurrent Access − DBMS supports multi-user environment and allows them
to access and manipulate data in parallel. Though there are restrictions on transactions
when users attempt to handle the same data item, but users are always unaware of them.
• Multiple views − DBMS offers multiple views for different users. A user who is in the Sales
department will have a different view of database than a person working in the Production
department. This feature enables the users to have a concentrate view of the database
according to their requirements.
• Security − Features like multiple views offer security to some extent where users are unable
to access data of other users and departments. DBMS offers methods to impose constraints
while entering data into the database and retrieving the same at a later stage. DBMS offers
many different levels of security features, which enables multiple users to have different
views with different features. For example, a user in the Sales department cannot see the
data that belongs to the Purchase department. Additionally, it can also be managed how
much data of the Sales department should be displayed to the user. Since a DBMS is not
saved on the disk as traditional file systems, it is very hard for miscreants to break the code.
• Remote Data Access - Online shopping has become the fad. Apps are getting rolling out.
Everything is getting online. Data has been accessed remotely. With simple plugin or
command, data can be accessed by online websites or apps.
• Data Sharing - File system does not allow sharing of data or sharing is too complex. Whereas
in DBMS, data can be shared easily due to centralized system.
DBMS Notes by Narayan Vyas | Page |3

Data Redundancy - Data redundancy is a condition created within a database or data storage
technology in which the same piece of data is held in two or more separate places. Data Redundancy
means having more than one copy of your data. It can be either at table level or at the fields level.

Data Consistency - Consistency, in the context of databases, states that data cannot be written that
would violate the database’s own rules for valid data. If a certain transaction occurs that attempts to
introduce inconsistent data, the entire transaction is rolled back and an error returned to the user.
Consistency states that only valid data will be written to the database. Consistency in database
systems refers to the requirement that any given database transaction must change affected data
only in allowed ways. Any data written to the database must be valid according to all defined rules,
including constraints, cascades, triggers, and any combination thereof.

Data Integrity - Data integrity is the maintenance of, and the assurance of the accuracy and
consistency of, data over its entire life-cycle, and is a critical aspect to the design, implementation
and usage of any system which stores, processes, or retrieves data. Data integrity is the overall
completeness, accuracy and consistency of data.

Data Dictionary - A data dictionary is a file or a set of files that contains a database's metadata (data
that describes other data or simple saying “data about data”). The data dictionary contains records
about other objects, such as data ownership, data relationships to other objects, and other data.

Concurrent Access - The ability to gain admittance to a system or component by more than one user
or process. For example, concurrent access to a computer means multiple users are interacting with
the system simultaneously. Any multi-user database application has to have some method for
dealing with concurrent access to data, when more than one user is accessing the same data at the
same time.

Difference Between Traditional File Processing System and DBMS -

• A File System is a collection of raw data files stored in the hard-drive, whereas a database is
intended for easily organizing, storing and retrieving large amounts of data.
• In File System, most tasks such as storage, retrieval and search are done manually and it is
quite tedious whereas when using a database, the inbuilt DBMS will provide automated
methods to complete these tasks
• Redundancy is controlled in DBMS whereas in Filesystem it can’t control redundancy.
• Using a File System will lead to problems like data integrity, data inconsistency and data
security, but these problems could be avoided by using a database.
• File system requires excessive program maintenance but in Database minimum maintenance
required.
• Unlike a File System, databases are efficient because reading line by line is not required, and
certain control mechanisms are in place.
• A database management system coordinates both the physical and the logical access to the
data, whereas a file-processing system coordinates only the physical access.
• Unauthorized access is restricted in DBMS but not in the file system.
• DBMS provide backup and recovery whereas data lost in file system can't be recovered.
• DBMS provide multiple user interfaces. Data is isolated in file system.
• A database management system is designed to coordinate multiple users accessing the same
data at the same time. A file-processing system is usually designed to allow one or more
programs to access different data files at the same time. In a file-processing system, a file
DBMS Notes by Narayan Vyas | Page |4

can be accessed by two programs concurrently only if both programs have read-only access
to the file.
• A database management system is designed to allow flexible access to data (i.e. queries),
whereas a file-processing system is designed to allow predetermined access to data (i.e.
compiled programs).

Database Users - A typical DBMS has users with different rights and permissions who use it for
different purposes. Some users retrieve data and some back it up.

• Administrators − Administrators maintain the DBMS and are responsible for administrating
the database. They are responsible to look after its usage and by whom it should be used.
They create access profiles for users and apply limitations to maintain isolation and force
security. Administrators also look after DBMS resources like system license, required tools,
and other software and hardware related maintenance.
• Designers − Designers are the group of people who actually work on the designing part of
the database. They keep a close watch on what data should be kept and in what format.
They identify and design the whole set of entities, relations, constraints, and views.
• Application Programmers - They are the developers who interact with the database by
means of DML queries. These DML queries are written in the application programs like C,
C++, JAVA, Pascal etc. These queries are converted into object code to communicate with
the database. For example, writing a C program to generate the report of employees who
are working in particular department will involve a query to fetch the data from database. It
will include an embedded SQL query in the C Program.
• Sophisticated Users - They are database developers, who write SQL queries to
select/insert/delete/update data. They do not use any application or programs to request
the database. They directly interact with the database by means of query language like SQL.
These users will be scientists, engineers, analysts who thoroughly study SQL and DBMS to
apply the concepts in their requirement. In short, we can say this category includes
designers and developers of DBMS and SQL.
• Specialized Users - These are also sophisticated users, but they write special database
application programs. They are the developers who develop the complex programs to the
requirement.
• Naive Users - Any user who does not have any knowledge about database can be in this
category. Their task is to just use the developed application and get the desired results. For
example, Clerical staff in any bank is a naïve user. They don’t have any DBMS knowledge but
they still use the database and perform their given task.
• End Users − End users are those who actually reap the benefits of having a DBMS. End users
can range from simple viewers who pay attention to the logs or market rates to
sophisticated users such as business analysts.

Database Administrator - A database administrator (DBA) directs or performs all activities related to
maintaining a successful database environment. Responsibilities include designing, implementing,
and maintaining the database system, establishing policies and procedures pertaining to the
management, security, maintenance, and use of the database management system and training
employees in database management and use. A DBA is expected to stay abreast of emerging
technologies and new design approaches. A DBA is an individual person or a group of people with an
overview of one or more database, who controls the design and use of the database.

Functions and Responsibilities of DBA –


DBMS Notes by Narayan Vyas | Page |5

• Installing and upgrading the DBMS Servers - DBA is responsible for installing a new DBMS
server for the new projects. He is also responsible for upgrading these servers as there are
new versions comes in the market or requirement. If there is any failure in upgradation of
the existing servers, he should be able revert the new changes back to the older version,
thus maintaining the DBMS working. He is also responsible for updating the service packs /
hot fixes / patches to the DBMS servers.
• Design And implementation - Designing the database and implementing is also DBA’s
responsibility. He should be able to decide proper memory management, file organizations,
error handling, log maintenance etc. for the database.
• Performance Tuning - Since database is huge and it will have lots of tables, data, constraints
and indices, there will be variations in the performance from time to time. Also, because of
some designing issues or data growth, the database will not work as expected. It is
responsibility of the DBA to tune the database performance. He is responsible to make sure
all the queries and programs works in fraction of seconds.
• Migrate database servers - Sometimes, users using oracle would like to shift to SQL server
or Netezza. It is the responsibility of DBA to make sure that migration happens without any
failure, and there is no data loss.
• Backup and Recovery - Proper backup and recovery programs needs to be developed by
DBA and has to be maintained him. This is one of the main responsibilities of DBA.
Data/objects should be backed up regularly so that if there is any crash, it should be
recovered without much effort and data loss.
• Security - DBA is responsible for creating various database users and roles and giving them
different levels of access rights.
• Storage Structure and Access Method Definition – DBA decides how the data is to be
represented in the stored database. This process is called physical database design.
• Documentation - DBA should be properly documenting all his activities so that if he quits or
any new DBA comes in, he should be able to understand the database without any effort. He
should basically maintain all his installation, backup, recovery, security methods. He should
keep various reports about database performance.
• Producing Reports from Queries - DBAs are frequently called upon to generate reports by
writing queries, which are then run against the database.
• Job Monitoring - DBA monitors jobs running on the database and ensures their performance
is not degraded by very expensive tasks. With changing requirements, DBA is responsible for
making appropriate adjustments in the database.
• Troubleshooting - DBAs are on call for troubleshooting in case of any problems. Whether
they need to quickly restore lost data or correct an issue to minimise damage, a DBA needs
to quickly understand and respond to problems when they occur.

Types of DBA –

• Administrative DBA - This DBA is mainly concerned with installing and maintaining DBMS
servers. His prime tasks are installing, backups, recovery, security, replications, memory
management, configurations and tuning. He is mainly responsible for all administrative tasks
of a database.
• Development DBA - He is responsible for creating queries and procedure for the
requirement. Basically, his task is similar to any database developer.
• Database Architect - Database architect is responsible for creating and maintaining the
users, roles, access rights, tables, views, constraints and indexes. He is mainly responsible for
DBMS Notes by Narayan Vyas | Page |6

designing the structure of the database depending on the requirement. These structures will
be used by developers and development DBA to code.
• Data Warehouse DBA - DBA should be able to maintain the data and procedures from
various sources in the data warehouse. These sources can be files, COBOL, or any other
programs. Here data and programs will be from different sources. A good DBA should be
able to keep the performance and function levels from these sources at same pace to make
the data warehouse to work.
• Application DBA - He acts like a bridge between the application program and the database.
He makes sure all the application program is optimized to interact with the database. He
ensures all the activities from installing, upgrading, and patching, maintaining, backup,
recovery to executing the records works without any issues.
• OLAP DBA - He is responsible for installing and maintaining the database in OLAP systems.
He maintains only OLAP databases.

OLAP - OLAP (Online Analytical Processing) is computer processing that enables a user to easily and
selectively extract and view data from different points of view. For example, a user can request that
data be analysed to display a spreadsheet showing all of a company's beach ball products sold in
Florida in the month of July, compare revenue figures with those for the same products in
September, and then see a comparison of other product sales in Florida in the same time period. To
facilitate this kind of analysis, OLAP data is stored in a multidimensional database. Whereas a
relational database can be thought of as two-dimensional, a multidimensional database considers
each data attribute (such as product, geographic sales region, and time period) as a separate
"dimension." OLAP software can locate the intersection of dimensions (all products sold in the
Eastern region above a certain price during a certain time period) and display them. Attributes such
as time periods can be broken down into sub-attributes. OLAP is used to store historical data.

OLAP can be used for data mining or the discovery of previously undiscerned relationships between
data items. An OLAP database does not need to be as large as a data warehouse, since not all
transactional data is needed for trend analysis. Using Open Database Connectivity (ODBC), data can
be imported.

OLAP performs multidimensional analysis of business data and provides the capability for complex
calculations, trend analysis, and sophisticated data modelling. It is the foundation for many kinds of
business applications for Business Performance Management, Planning, Budgeting, Forecasting,
Financial Reporting, Analysis, Simulation Models, Knowledge Discovery, and Data Warehouse
Reporting. OLAP enables end-users to perform ad hoc analysis of data in multiple dimensions,
thereby providing the insight and understanding they need for better decision making.

OLTP - OLTP (Online Transaction Processing) is a class of software programs capable of supporting
transaction-oriented applications on the Internet. Typically, OLTP systems are used for order entry,
financial transactions, customer relationship management (CRM) and retail sales. Such systems have
a large number of users who conduct short transactions. Database queries are usually simple,
require sub-second response times and return relatively few records. An important attribute of an
OLTP system is its ability to maintain concurrency. IBM's CICS (Customer Information Control
System) is a well-known OLTP product. It is used to store current / operational data.

Online transaction processing (OLTP) is a class of systems that supports or facilitates high
transaction-oriented applications. OLTP’s primary system features are immediate client feedback
and high individual transaction volume. OLTP is mainly used in industries that rely heavily on the
efficient processing of a large number of client transactions, e.g., banks, airlines and retailers.
DBMS Notes by Narayan Vyas | Page |7

Database systems that support OLTP are usually decentralized to avoid single points of failure and to
spread the volume between multiple servers.

OLTP systems must provide atomicity, which is the ability to fully process or completely undo an
order. Partial processing is never an option. When airline passenger seats are booked, atomicity
combines the two system actions of reserving and paying for the seat. Both actions must happen
together or not at all. Heavy OLTP system reliance brings added challenges. For example, if server or
communication channels fail, an entire business chain can grind to an immediate halt.
OLAP (Online Analytics Processing OLTP (Online Transaction Processing)
Historical Data Current Data
Subject Oriented Application Oriented
Decision Making Day to day operations
Takes Terabytes or Petabytes of space Takes Megabytes or Gigabytes of space
Maximum data is stored here (80%-95%) Very less data stored here (1%-20%)

One-Tier Architecture - One tier architecture (Standalone Application) has all the layers such as
Presentation, Business, Data Access layers in a single software package. Applications which handles
all the three tiers such as MP3 player, MS Office are come under one tier application. The data is
stored in the local system or a shared drive.

Basically, a one-tier architecture keeps all of the elements of an application, including the interface,
middleware and back-end data, in one place. Developers see these types of systems as the simplest
and most direct. Some experts describe them as applications that could be installed and run on a
single computer. The need for distributed models for Web applications and cloud hosting solutions
has created many situations where one-tier architectures are not sufficient. That caused three-tier
or multi-tier architecture to become more popular.

Two-Tier Architecture - With the Two-Tier architectures the user system interface is usually located
in the user’s desktop environment and the database management services are usually in a server i.e.
a more powerful machine that services many clients. Processing management is split between the
user system interface environment and the database management server environment.
DBMS Notes by Narayan Vyas | Page |8

Two-Tier Architecture is used for applications programs that runs on client side. An interface called
ODBC (Open Database Interconnectivity) provides an API that allows client-side program to call the
DBMS. The DBMS server provides stored procedure and triggers. There are number of software
vendors that provide tools to simplify development of application for the Two-Tier Architecture.

The Two-Tier Architecture is a good solution for distributed computing when workgroups are
derived as a dozen to 100 people interacting on a LAN simultaneously. Two-Tier Architecture does
have a number of limitations. When the number of users exceeds 100 performance begins to
deteriorate this limitation is a result of server maintaining a connection via Keep-Alive messages
with each client even when no work is being done. A second limitation of Two-Tier Architecture is
that implementation of processing management services using vendor proprietary database
procedures restricts flexibility and choice of DBMS for applications. Finally, current implementations
of Two-Tier Architecture provide limited flexibility in moving (repartitioning) program functionality
from one server to another without manually regenerating procedural code.

The Two-tier architecture is divided into two parts -

• Client Application (Client Tier)


• Database (Data Tier)

Client system handles both Presentation and Application layers and Server system handles Database
layer. It is also known as client server application. The communication takes place between the
Client and the Server. Client system sends the request to the Server system and the Server system
processes the request and sends back the data to the Client System.

The application processing is done separately for database queries and updates and for business
logic processing and user interface presentation. Usually, the network binds the back-end of an
application to the front-end, although both tiers can be present on the same hardware.

Sometimes, the application logic (the real business logic) is located in both the client program and in
the database itself. Quiet often, the business logic is merged into the presentation logic on the client
side. As a result, code maintenance and reusability become difficult to achieve on the client side. On
the database side, logic is often developed using stored procedures.
DBMS Notes by Narayan Vyas | Page |9

In the two-tier architecture, if the Client/Server application has a number of business rules needed
to be processed, then those rules can reside at either the Client or at the Server.

The architecture of any client/server environment is by definition at least a two-tier system, the
client being the first tier and the server being the second. The Client requests services directly from
server i.e. client communicates directly with the server without the help of another server or server
process. In a typical two-tier implementation, SQL statements are issued by the application and then
handed on by the driver to the database for execution. The results are then sent back via the same
mechanism, but in the reverse direction. It is the responsibility of the driver (ODBC) to present the
SQL statement to the database in a form that the database understands.

A two-tier architecture is a software architecture in which a presentation layer or interface runs on a


client, and a data layer or data structure gets stored on a server. Separating these two components
into different locations represents a two-tier architecture, as opposed to a single-tier architecture.

There are several advantages of two-tier systems -

1. Availability of well-integrated PC-based tools like, Power Builder, MS Access, 4 GL tools


provided by the RDBMS manufacturer, remote SQL, ODBC.
2. Tools are relatively inexpensive.
3. Least complicated to implement.
4. PC-based tools show Rapid Application Development (RAD) i.e., the application can be
developed in a comparatively short time.
5. The 2-tier Client/Server provides much more attractive graphical user interface (GUI)
applications than was possible with earlier technology.
6. Architecture maintains a persistent connection between the client and database, thereby
eliminating overhead associated with the opening and closing of connections.
7. Faster than three-tier implementation.
8. Offers a great deal of flexibility and simplicity in management.

Conversely, a two-tier architecture has some disadvantages -

1. As the application development is done on client side, maintenance cost of application, as


well as client-side tools etc. is expensive. That is why in 2-tier architecture the client is called
‘fat client’.
2. Increased network load - Since actual processing of data takes on the remote client, the data
has to be transported over the network. This leads to the increased network stress.
3. Applications are loaded on individual PC i.e. each application is bound to an individual PC.
For this reason, the application logic cannot be reused.
4. Due to dynamic business scenario, business processes/logic have to be changed. These
changed processes have to be implemented in all individual PCs. Not only that, the programs
have to undergo quality control to check whether all the programs generate the same result
or not.
5. Software distribution procedure is complicated in 2-tier Client/Server model. As all the
application logic is executed on the PCs, all these machines have to be updated in case of a
new release. The procedure is complicated, expensive, prone to errors and time consuming.
6. PCs are considered to be weak in terms of security i.e., they are relatively easy to crack.
7. Most currently available drivers require that native libraries be loaded on a client machine.
8. Load configurations must be maintained for native code if required by the driver.
9. Problem areas are encountered upon implementing this architecture on the Internet.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 10

Three-Tier Architecture - The Three-Tier Architecture (also referred to as the Multi-Tier


Architecture) emerged to overcome the limitations of the Two-Tier Architecture. In the Three-Tier
Architecture, a middle tier was added between the user system interface client environment and the
DBMS server environment. There were a variety of ways of implementing this middle tier, such as (1)
Transaction Processing monitors (TP monitors), (2) message servers or (3) application servers. The
middle tier can perform queuing application execution, and database staging. For example, if the
middle tier provides queuing, the client can deliver its request to the middle layer and disengage
because the middle tier will access the data and return the answer to the client. In addition, the
middle layer adds scheduling and prioritization for work in progress. The Three-Tier Architecture has
been shown to improve performance for groups with a large number of users (in thousands) and
improves flexibility when compared to Two-Tier Architecture.

Flexibility in partitioning can be as simple as dragging and dropping application code modules onto
different computers in some Three-Tier Architecture. A limitation with Three-Tier Architecture is
that the development environment is reportedly more difficult to use than the usually oriented
developed of Two-Tier Application. Recently, mainframes have found a new use as servers in Three-
Tier Architecture.

A three-tier architecture is a client-server architecture in which the functional process logic, data
access, computer data storage and user interface are developed and maintained as independent
modules on separate platforms. In three-tier client/server system the client request is handled by
intermediate servers which coordinate the execution of the client request with subordinate servers.

There are three important points to be remember in three-tier architecture which are –

• User (Presentation) Tier − End-users operate on this tier and they know nothing about any
existence of the database beyond this layer. At this layer, multiple views of the database can
be provided by the application. All views are generated by applications that reside in the
application tier.
• Application (Middle) Tier − At this tier reside the application server and the programs that
access the database. For a user, this application tier presents an abstracted view of the
database. End-users are unaware of any existence of the database beyond the application.
At the other end, the database tier is not aware of any other user beyond the application
tier. Hence, the application layer sits in the middle and acts as a mediator between the end-
user and the database.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 11

• Database (Data) Tier − At this tier, the database resides along with its query processing
languages. We also have the relations that define the data and their constraints at this level.

Advantages of Three-Tier Architecture –

1. Application maintenance is centralized with the transfer of the business logic for many end
users into a single application server. This eliminates the concern of software distribution
that are problematic in the traditional two-tier Client/Server model.
2. Clear separation of user-interface-control and data presentation from application logic.
3. Through this separation more clients are able to have access to a wide variety of server
applications. The two main advantages for client-applications are clear:
• Quicker development through the reuse of pre-built business-logic components and a
shorter test phase, because the server-components have already been tested.
• Many users are able to access a wide variety of server applications, as all application
logic are loaded in the applications server.
4. As a rule, servers are “trusted” systems. Their authorization is simpler than that of
thousands of “untrusted” client-PCs. Data protection and security is simpler to obtain.
Therefore, it makes sense to run critical business processes that work with security sensitive
data, on the server.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 12

5. Redefinition of the storage strategy won’t influence the clients. RDBMS offer a certain
independence from storage details for the clients. However, cases like changing table
attributes make it necessary to adapt the client’s application. In the future, even radical
changes, like switching from an RDBMS to an OODBMS, won’t influence the client. In well-
designed systems, the client still accesses data over a stable and well-designed interface,
which encapsulates all the storage details.
6. Load balancing is easier with the separation of the core business logic from the database
server.
7. Dynamic load balancing, if bottlenecks in terms of performance occur, the server process
can be moved to other servers at runtime.
8. Business objects and data storage should be brought as close together as possible. Ideally,
they should be together physically on the same server. This way network load for complex
access can be reduced.
9. The need for less expensive hardware because the client is ‘thin’.
10. Change management is easier and faster to execute. This is because a component/program
logic/business logic is implemented on the server rather than furnishing numerous PCs with
new program versions.
11. The added modularity makes it easier to modify or replace one tier without affecting the
other tier.
12. Clients do not need to have native libraries loaded locally.
13. Drivers can be managed centrally.
14. Your database server does not have to be directly visible to the Internet.

N-Tier Architecture - N-tier architecture is also called multi-tier architecture because the software is
engineered to have the processing, data management, and presentation functions physically and
logically separated. That means that these different functions are hosted on several machines or
clusters, ensuring that services are provided without resources being shared and, as such, these
services are delivered at top capacity. The “N” in the name n-tier architecture refers to any number
from 1.

Not only does your software gain from being able to get services at the best possible rate, but it’s
also easier to manage. This is because when you work on one section, the changes you make will not
affect the other functions. And if there is a problem, you can easily pinpoint where it originates.

An N-tier architecture used to be seen in enterprise scenarios where 3 tiers were not enough. You
can think of it as a layered architecture with more than one middle layer (or, one middle layer that
may evolve into more than one). Although, it could have even been designed with multiple
presentation layers.

In software architecture the term n-tier architecture refers to breaking an application into tiered
components such that each tier can be deployed separately (and, preferably, in isolation). Done
correctly this allows for greater scalability (as each layer can be scaled-out/scaled-up as needed, due
to each being able to run on their own physical environment) among other benefits.

Database Schema - In computer programming, a schema is the organization or structure for a


database. A database schema is the skeleton structure that represents the logical view of the entire
database. It defines how the data is organized and how the relations among them are associated. It
formulates all the constraints that are to be applied on the data. A database schema defines its
entities and the relationship among them. It contains a descriptive detail of the database, which can
be depicted by means of schema diagrams. It’s the database designers who design the schema to
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 13

help programmers understand the database and make it useful. In other words, "the description of a
database is called the database schema, which is specified during database design and is not
expected to change frequently". A database schema can be divided broadly into two categories −

• Physical Database Schema − This schema pertains to the actual storage of data and its form
of storage like files, indices, etc. It defines how the data will be stored in a secondary
storage.
• Logical Database Schema − This schema defines all the logical constraints that need to be
applied on the data stored. It defines tables, views, and integrity constraints.

Subschema - A subschema is a subset of the schema and inherits the same property that a schema
has. It can be defined as the subset or sub-level of schema that has the same properties as the
schema. In simple words it is just an effective plan or the schema for the view. Well, it is interesting
to note that it provides the users a window through which the user can view only that part of
database which is of matter of interest to him. It Identifies subset of areas, sets, records, data names
defined in database that is of interest to him. Thus, a portion of database can be seen by application
programs and different application programs has different view of data. Subschemas enhance
security factors and prohibit data compromise.

Database Instance - The data in the database at a particular moment of time is called an instance or
a database state. In a given instance each schema construct has its own current set of instances.
Every time we update the value of a data item in record, one state of the database changes into
another state. A database instance is a state of operational database with data at any given time. It
contains a snapshot of the database. Database instances tend to change with time.

Levels / Views of Database Architecture / Data Abstraction –

Physical View - It is the physical representation of the database on the computer. This level
describes how the data is stored in the memory. The internal level is the one that concerns the way
the data are physically stored on the hardware. The internal level covers the physical
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 14

implementation of the database to achieve optimal runtime performance and storage space
utilization. It covers the data structures and file organizations used to store data on storage devices.
It interfaces with the operating system access methods to place the data on the storage devices,
build the indexes, retrieve the data, and so on.

This is the lowest level of data abstraction. It tells us how the data is actually stored in memory. The
access methods like sequential or random access and file organisation methods like B+ trees,
hashing used for the same. Usability, size of memory, and the number of times the records are
factors which we need to know while designing the database. Suppose we need to store the details
of an employee. Blocks of storage and the amount of memory used for these purposes is kept
hidden from the user. The internal level is concerned with such things as -

• Storage space allocation for data and indexes.


• Record descriptions for storage (with stored sizes for data items).
• Record placement.
• Data compression and data encryption techniques.

Logical / Conceptual View - This level describes what data is stored in the database and the
relationships among the data. The middle level in the three-level architecture is the conceptual level.
This level contains the logical structure of the entire database as seen by the DBA. This level
comprises of the information that is actually stored in the database in the form of tables. It also
stores the relationship among the data entities in relatively simple structures. At this level, the
information available to the user at the view level is unknown. We can store the various attributes of
an employee and relationships, e.g. with the manager can also be stored.

External View - It is the user’s view of the database. This level describes that part of the database
that is relevant to each user. External level is the one which is closest to the end users. This level
deals with the way in which individual users view the data. Individual users are given different views
according to the user's requirement. A view involves only those portions of a database which are of
concern to a user. Therefore, same database can have different views for different users. The
external view insulates users from the details of the internal and conceptual levels. External level is
also known as the view level. In addition, different views may have different representations of the
same data. For example, one user may view dates in the form (day, month, year), while another may
view dates as (year, month, day).
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 15

Physical Data Independence - It refers to the characteristic of being able to modify the physical
schema without any alterations to the conceptual or logical schema, done for optimisation purposes,
e.g., Conceptual structure of the database would not be affected by any change in storage size of the
database system server. Changing from sequential to random access files is one such example. These
alterations or modifications to the physical structure may include -

• Utilising new storage devices.


• Modifying data structures used for storage.
• Altering indexes or using alternative file organisation techniques etc.

Logical Data Independence - It refers characteristic of being able to modify the logical schema
without affecting the external schema or application program. The user view of the data would not
be affected by any changes to the conceptual view of the data. These changes may include insertion
or deletion of attributes, altering table structures entities or relationships to the logical schema etc.
Alterations in the conceptual schema may include addition or deletion of fresh entities, attributes or
relationships and should be possible without having alteration to existing external schemas or having
to rewrite application programs.

Distributed Database - A distributed database is a collection of multiple interconnected databases,


which are spread physically across various locations that communicate via a computer network.

Features of Distributed Database –

• Databases in the collection are logically interrelated with each other. Often, they represent a
single logical database.
• Data is physically stored across multiple sites. Data in each site can be managed by a DBMS
independent of the other sites.
• The processors in the sites are connected via a network. They do not have any
multiprocessor configuration.
• A distributed database is not a loosely connected file system.
• A distributed database incorporates transaction processing, but it is not synonymous with a
transaction processing system.

Distributed Database Management System - A distributed database management system (DDBMS)


is a centralized software system that manages a distributed database in a manner as if it were all
stored in a single location.

Features of DDBMS -

• It is used to create, retrieve, update and delete distributed databases.


• It synchronizes the database periodically and provides access mechanisms by the virtue of
which the distribution becomes transparent to the users.
• It ensures that the data modified at any site is universally updated.
• It is used in application areas where large volumes of data are processed and accessed by
numerous users simultaneously.
• It is designed for heterogeneous database platforms.
• It maintains confidentiality and data integrity of the databases.

Data Models - Data models define how the logical structure of a database is modelled. Data Models
are fundamental entities to introduce abstraction in a DBMS. Data models define how data is
connected to each other and how they are processed and stored inside the system. The very first
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 16

data model could be flat data-models, where all the data used are to be kept in the same plane.
Earlier data models were not so scientific hence they were prone to introduce lots of duplication and
update anomalies.

Entity Relationship (ER) Model - The ER model defines the conceptual view of a database. It works
around real-world entities and the associations among them. At view level, the ER model is
considered a good option for designing databases. The entity-relationship model (or ER model) is a
way of graphically representing the logical relationships of entities (or objects) in order to create a
database. The ER model was first proposed by Peter Pin-Shan Chen of Massachusetts Institute of
Technology (MIT) in the 1970s. An entity relationship model, also called an entity-relationship (ER)
diagram, is a graphical representation of entities and their relationships to each other, typically used
in computing in regard to the organization of data within databases or information systems.

Entity - An entity can be a real-world object, either animate or inanimate, that can be easily
identifiable. For example, in a school database, students, teachers, classes, and courses offered can
be considered as entities. All these entities have some attributes or properties that give them their
identity. In relation to a database, an entity is a single person, place, or thing about which data can
be stored.

Entity Set - An entity set is a collection of similar types of entities. An entity set may contain entities
with attribute sharing similar values. For example, a Students set may contain all the students of a
school. Likewise, a Teachers set may contain all the teachers of a school from all faculties. Entity sets
need not be disjoint. Entity set is the set of entities of the same type that share the same attributes.
For example, set of all people who are customer at a particular bank can be defined as the entity
customer.

Strong Entity - The Strong Entity is the one whose existence does not depend on the existence of
any other entity in a schema. It is denoted by a single rectangle. A strong entity always has the
primary key in the set of attributes that describes the strong entity. It indicates that each entity in a
strong entity set can be uniquely identified.

Set of similar types of strong entities together forms the Strong Entity Set. A strong entity holds the
relationship with the weak entity via an Identifying Relationship, which is denoted by double
diamond in the ER diagram. On the other hands, the relationship between two strong entities is
denoted by a single diamond and it is simply called as a relationship.

Let us understand this concept with the help of an example, a customer borrows a loan. Here we
have two entities first a customer entity and second a loan entity.

Observing the ER-diagram above, for each loan, there should be at least one borrower otherwise
that loan would not be listed in Loan entity set. But even if a customer does not borrow any loan it
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 17

would be listed in Customer entity set. So, we can conclude that a customer entity does not depend
on a loan entity.

Weak Entity - A Weak entity is the one that depends on a strong entity for its existence. A weak
entity is denoted by the double rectangle. Weak entity does not have the primary key instead it has
a partial key that uniquely discriminates the weak entities. The primary key of a weak entity is a
composite key formed from the primary key of the strong entity and partial key of the weak entity.
The collection of similar weak entities is called Weak Entity Set. The relationship between a weak
entity and a strong entity is always denoted with an Identifying Relationship i.e. double diamond.

Difference Between Strong and Weak Entity -

• The basic difference between strong entity and a weak entity is that the strong entity has a
primary key whereas, a weak entity has the partial key which acts as a discriminator
between the entities of a weak entity set.
• A weak entity always depends on the strong entity for its existence whereas, a strong entity
is independent of any other entity’s existence.
• A strong entity is denoted with a single rectangle and a weak entity is denoted with a double
rectangle.
• The relationship between two strong entities is denoted with single diamond whereas, a
relationship between a weak and a strong entity is denoted with double diamond called
Identifying Relationship.
• The strong entity may or may not show the total participation in its relations, but the weak
entity always shows total participation in the identifying relationship which is denoted by
the double line.

Attributes - Entities are represented by means of their properties, called attributes. All attributes
have values. For example, a student entity may have name, class, and age as attributes. There exists
a domain or range of values that can be assigned to attributes. For example, a student's name
cannot be a numeric value. It has to be alphabetic. A student's age cannot be negative, etc. A
database attribute is a column or field in a database table. For example, given a Customers table, a
Name column is an attribute of that table.

Types of Attributes -

• Simple / Atomic Attribute − Simple attributes are atomic values, which cannot be divided
further. For example, a student's phone number is an atomic value of 10 digits.
• Composite Attribute − Composite attributes are made of more than one simple attribute.
For example, a student's complete name may have first_name and last_name.
• Derived Attribute − Derived attributes are the attributes that do not exist in the physical
database, but their values are derived from other attributes present in the database. For
example, average_salary in a department should not be saved directly in the database,
instead it can be derived. For another example, age can be derived from data_of_birth.
• Single-value Attribute − Single-value attributes contain single value. For example −
Social_Security_Number.
• Multi-value Attribute − Multi-value attributes may contain more than one values. For
example, a person can have more than one phone number, email_address, etc.
• Prime or Key Attributes - Attributes of the relation which exist in at least one of the possible
candidate keys, are called prime or key attributes.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 18

• Non-Prime or Non-Key Attributes - Attributes of the relation which does not exist in any of
the possible candidate keys of the relation, such attributes are called non-prime or non-key
attributes.

Keys - Key is an attribute or collection of attributes that uniquely identifies an entity among entity
set. For example, the roll_number of a student makes him/her identifiable among students.

• Super Key − A set of attributes (one or more) that collectively identifies an entity in an entity
set. With the help of super key, we can identify a row uniquely among the set of rows.
• Candidate Key − A minimal set of super keys is called a candidate key. An entity set may
have more than one candidate key.
• Primary Key − A primary key is one of the candidate keys chosen by the database designer
to uniquely identify the entity set. A Primary key uniquely identifies each record in a table
and must never be the same for the 2 records. Primary key is a set of one or more fields
(columns) of a table that uniquely identify a record in database table. A table can have only
one primary key and one candidate key can select as a primary key. The primary key should
be chosen such that its attributes are never or rarely changed.
• Alternate Key - Alternate keys are candidate keys that are not selected as primary key.
Alternate key can also work as a primary key. Alternate key is also called “Secondary Key”.
Out of all candidate keys, only one gets selected as primary key, remaining keys are known
as alternate or secondary keys.
• Unique Key - A unique key is a set of one or more attribute that can be used to uniquely
identify the records in table. Unique key is similar to primary key but unique key field can
contain a “Null” value but primary key doesn’t allow “Null” value.
• Composite Key - Composite key is a combination of more than one attributes that can be
used to uniquely identity each record. It is also known as “Compound” key. A composite key
may be a candidate or primary key.
• Natural Keys - A natural key is a key composed of columns that actually have a logical
relationship to other columns within a table. For example, if we use student_id,
student_name and father_name columns to form a key then it would be “Natural Key”
because there is definitely a relationship between these columns and other columns that
exist in table. Natural keys are often called “Business Key” or “Domain Key”.
• Surrogate Key - Surrogate key is an artificial key that is used to uniquely identify the record
in table. For example, in SQL Server or Sybase database system contain an artificial key that
is known as “Identity”. Surrogate keys are just simple sequential number. Surrogate keys are
only used to act as a primary key.
• Partial Key or Discriminator key - It is a set of attributes that can uniquely identify weak
entities and that are related to same owner entity. It is sometime called as Discriminator.
• Foreign Keys - Foreign key is used to generate the relationship between the tables. Foreign
Key is a field in database table that is Primary key in another table. A foreign key can accept
null and duplicate value. Foreign keys are the columns of a table that points to the primary
key of another table. They act as a cross-reference between tables. A foreign key is an
attribute or combination of attribute in one base table that points to the candidate key
(generally it is the primary key) of another table. The purpose of the foreign key is to ensure
referential integrity of the data i.e. only values that are supposed to appear in the database
are permitted.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 19

Relationship - The association among entities is called a relationship. For example, an employee
works_at a department, a student enrols in a course. Here, works_at and enrols are called
relationships.

Relationship Set - A set of relationships of similar type is called a relationship set. Like entities, a
relationship too can have attributes. These attributes are called descriptive attributes.

Degree of Relationship - The number of participating entities in a relationship defines the degree of
the relationship.

• Binary = degree 2
• Ternary = degree 3
• n-ary = degree

Total Participation - A Total participation of an entity set represents that each entity in entity set
must have at least one relationship in a relationship set. For example, In the below diagram each
college must have at-least one associated Student.

Mapping Cardinalities - Cardinality defines the number of entities in one entity set, which can be
associated with the number of entities of other set via relationship set.

• One-to-one − One entity from entity set A can be associated with at most one entity of
entity set B and vice versa.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 20

• One-to-many − One entity from entity set A can be associated with more than one entities
of entity set B however an entity from entity set B, can be associated with at most one
entity.

• Many-to-one − More than one entities from entity set A can be associated with at most one
entity of entity set B, however an entity from entity set B can be associated with more than
one entity from entity set A.

• Many-to-many − One entity from A can be associated with more than one entity from B and
vice versa.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 21

Crow’s Foot Notation - Crow's Foot notation is used in Barker's Notation, SSADM and Information
Engineering. Crow's Foot diagrams represent entities as boxes, and relationships as lines between
the boxes. Different shapes at the ends of these lines represent the cardinality of the relationship.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 22

• The first one (often called multiplicity) refers to the maximum number of times that an
instance of one entity can be associated with instances in the related entity. It can be one or
many.

• The second describes the minimum number of times one instance can be related to others.
It can be zero or one, and accordingly describes the relationship as optional or mandatory.

The combination of these two indicators is always in a specific order. Placed on the outside edge of
the relationship, the symbol of multiplicity comes first. The symbol indicating whether the
relationship is mandatory or optional is shown after the symbol of multiplicity.

In crow’s foot notation -

• A multiplicity of one and a mandatory relationship is represented by a straight line


perpendicular to the relationship line.
• A multiplicity of many is represented by the three-pronged ‘crow-foot’ symbol.
• An optional relationship is represented by an empty circle.

Relationship degrees make them readable as –

One-to-one
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 23

One-to-many

Many-to-many

Let us now learn how the ER Model is represented by means of an ER diagram. Any object, for
example, entities, attributes of an entity, relationship sets, and attributes of relationship sets, can be
represented with the help of an ER diagram.

Entity - Entities are represented by means of rectangles. Rectangles are named with the entity set
they represent.

Attributes - Attributes are the properties of entities. Attributes are represented by means of ellipses.
Every ellipse represents one attribute and is directly connected to its entity (rectangle).

If the attributes are composite, they are further divided in a tree like structure. Every node is then
connected to its attribute. That is, composite attributes are represented by ellipses that are
connected with an ellipse.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 24

Multivalued attributes are depicted by double ellipse.

Derived attributes are depicted by dashed ellipse.


D B M S N o t e s b y N a r a y a n V y a s | P a g e | 25

ER Diagram Symbols -

Examples of ER Diagrams –
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 26
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 27
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 28

Generalization - The ER Model has the power of expressing database entities in a conceptual
hierarchical manner. As the hierarchy goes up, it generalizes the view of entities, and as we go deep
in the hierarchy, it gives us the detail of every entity included.

Going up in this structure is called generalization, where entities are clubbed together to represent a
more generalized view. For example, a particular student named Mira can be generalized along with
all the students. The entity shall be a student, and further, the student is a person. The reverse is
called specialization where a person is a student, and that student is Mira.

As mentioned above, the process of generalizing entities, where the generalized entities contain the
properties of all the generalized entities, is called generalization. In generalization, a number of
entities are brought together into one generalized entity based on their similar characteristics. For
example, pigeon, house sparrow, crow and dove can all be generalized as Birds.

Generalization is the process of extracting common properties from a set of entities and create a
generalized entity from it. It is a bottom-up approach in which two or more entities can be
generalized to a higher-level entity if they have some attributes in common.

Specialization - Specialization is the opposite of generalization. In specialization, a group of entities


is divided into sub-groups based on their characteristics. Take a group ‘Person’ for example. A
person has name, date of birth, gender, etc. These properties are common in all persons, human
beings. But in a company, persons can be identified as employee, employer, customer, or vendor,
based on what role they play in the company. Similarly, in a school database, persons can be
specialized as teacher, student, or a staff, based on what role they play in school as entities. It is a
top-down approach where higher level entity is specialized into two or more lower level entities.

Inheritance - We use all the above features of ER-Model in order to create classes of objects in
object-oriented programming. The details of entities are generally hidden from the user. This
process known as abstraction. Inheritance is an important feature of Generalization and
Specialization. It allows lower-level entities to inherit the attributes of higher-level entities. For
example, the attributes of a Person class such as name, age, and gender can be inherited by lower-
level entities such as Student or Teacher.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 29

Aggregation - A concept which is used to model a relationship between a collection of entities and
relationships. It is used when we need to express a relationship among relationships. Aggregation is
a feature of the entity relationship model that allows a relationship set to participate in another
relationship set. This is indicated on an ER diagram by drawing a dashed box around the aggregation.
Aggregation is an abstraction that treats relationships as entities.

An ER diagram is not capable of representing relationship between an entity and a relationship


which may be required in some scenarios. In those cases, a relationship with its corresponding
entities is aggregated into a higher-level entity. For Example, Employee working for a project may
require some machinery. So, REQUIRE relationship is needed between relationship WORKS_FOR and
entity MACHINERY. Using aggregation, WORKS_FOR relationship with its entities EMPLOYEE and
PROJECT is aggregated into single entity and relationship REQUIRE is created between aggregated
entity and MACHINERY.

Hierarchical Model - In hierarchical model, data is organized into a tree like structure with each
record is having one parent record and many children. The main drawback of this model is that, it
can have only one to many relationships between nodes.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 30

The earliest model was the hierarchical database model, resembling an upside-down tree. Files are
related in a parent-child manner, with each parent capable of relating to more than one child, but
each child only being related to one parent. Most of you will be familiar with this kind of structure,
it’s the way most file systems work. There is usually a root, or top-level, directory that contains
various other directories and files. Each subdirectory can then contain more files and directories, and
so on. Each file or directory can only exist in one directory itself, it only has one parent.

This database model organises data into a tree-like-structure, with a single root, to which all the
other data is linked. The hierarchy starts from the Root data, and expands like a tree, adding child
nodes to the parent nodes. In this model, a child node will only have a single parent node. This
model efficiently describes many real-world relationships like index of a book, recipes etc.

Hierarchical Database model is one of the oldest database models, dating from late 1950s. One of
the first hierarchical databases Information Management System (IMS) was developed jointly by
North American Rockwell Company and IBM. This model is like a structure of a tree with the records
forming the nodes and fields forming the branches of the tree. The hierarchical model organizes
data elements as tabular rows, one for each instance of entity.

Advantages of Hierarchical Model -

• Simplicity - Data naturally have hierarchical relationship in most of the practical situations.
Therefore, it is easier to view data arranged in manner. This makes this type of database
more suitable for the purpose.
• Security - These database systems can enforce varying degree of security feature unlike flat-
file system.
• Database Integrity - Because of its inherent parent-child structure, database integrity is
highly promoted in these systems.
• Efficiency - The hierarchical database model is a very efficient, one when the database
contains a large number of I:N relationships (one-to-many relationships) and when the users
require large number of transactions, using data whose relationships are fixed.
• Minimize Disk Input and Output - Parent and child records are stored closed to each other
in storage devices. It helps to minimize disk input and output.
• Fast Navigation - Due to short distance between parent to child, database access time and
performance is improved.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 31

Disadvantages of Hierarchical Model -

• Complexity of Implementation - The actual implementation of a hierarchical database


depends on the physical storage of data. This makes the implementation complicated.
• Difficulty in Management - The movement of a data segment from one location to another
cause all the accessing programs to be modified making database management a complex
affair.
• Complexity of Programming - Programming a hierarchical database is relatively complex
because the programmers must know the physical path of the data items.
• Poor Portability - The database is not easily portable mainly because there is little or no
standard existing for these types of database.
• Database Management Problems - If you make any changes in the database structure of a
hierarchical database, then you need to make the necessary changes in all the application
programs that access the database. Thus, maintaining the database and the applications can
become very difficult.
• Lack of Structural Independence - Structural independence exists when the changes to the
database structure does not affect the DBMS's ability to access data. Hierarchical database
systems use physical storage paths to navigate to the different data segments. So, the
application programs should have a good knowledge of the relevant access paths to access
the data. So, if the physical structure is changed the applications will also have to be
modified. Thus, in a hierarchical database the benefits of data independence are limited by
structural dependence.
• Programs Complexity - Due to the structural dependence and the navigational structure, the
application programs and the end users must know precisely how the data is distributed
physically in the database in order to access data. This requires knowledge of complex
pointer systems, which is often beyond the grasp of ordinary users (users who have little or
no programming knowledge).
• Operational Anomalies - Hierarchical model suffers from the Insert anomalies, Update
anomalies and Deletion anomalies, also the retrieval operation is complex and asymmetric,
thus hierarchical model is not suitable for all the cases.
• Implementation Limitation - Many of the common relationships do not conform to the (1:N)
format required by the hierarchical model. The many-to-many (N:N) relationships, which are
more common in real life are very difficult to implement in a hierarchical model.
• Deletion Problem – If a parent is deleted, the child has also deleted automatically.
• Difficult to re-organize – It is difficult to re-organize the database due to hierarchy. It is
difficult to re-organize because parent to child relationships can be disturbed.

Network Model – In network model, entities are organized in a graph, in which some entities can be
accessed through several paths. The network model builds on the hierarchical model by allowing
many-to-many relationships between linked records, implying multiple parent records. Based on
mathematical set theory, the model is constructed with sets of related records. Each set consists of
one owner or parent record and one or more member or child records. A record can be a member or
child in multiple sets, allowing this model to convey complex relationships. It was most popular in
the 70s after it was formally defined by the Conference on Data Systems Languages (CODASYL).
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 32

The main difference of the network model from the hierarchical model, is its ability to handle many
to many (N:N) relations. In other words, it allows a record to have more than one parent. Suppose
an employee works for two departments. The strict hierarchical arrangement is not possible here
and the tree becomes a more generalized graph - a network. The network model was evolved to
specifically handle non-hierarchical relationships. As shown below data can belong to more than one
parent. Note that there are lateral connections as well as top-down connections. A network
structure thus allows 1:1 (one: one), l:M (one: many), M:M (many: many) relationships among
entities.

In network database terminology, a relationship is a set. Each set is made up of at least two types of
records, an owner record (equivalent to parent in the hierarchical model) and a member record
(similar to the child record in the hierarchical model).

Advantages of Network Model -

• Conceptual simplicity - Just like the hierarchical model, the network model is also
conceptually simple and easy to design.
• Capability to handle more relationship types - The network model can handle the one to-
many (l:N) and many to many (N:N) relationships, which is a real help in modelling the real-
life situations.
• Ease of data access - The data access is easier and flexible than the hierarchical model.
• Data Integrity - The network model does not allow a member to exist without an owner.
Thus, a user must first define the owner record and then the member record. This ensures
the data integrity.
• Data independence - The network model is better than the hierarchical model in isolating
the programs from the complex physical storage details.
• Database Standards - One of the major drawbacks of the hierarchical model was the non-
availability of universal standards for database design and modelling. The network model is
based on the standards formulated by the DBTG and augmented by ANSI/SP ARC (American
National Standards Institute/Standards Planning and Requirements Committee) in the
1970s. All the network database management systems conformed to these standards. These
standards included a Data Definition Language (DDL) and the Data Manipulation Language
(DML), thus greatly enhancing database administration and portability.

Disadvantages of Network Model -


D B M S N o t e s b y N a r a y a n V y a s | P a g e | 33

• System complexity - All the records are maintained using pointers and hence the whole
database structure becomes very complex.
• Operational Anomalies - As discussed earlier, network model's insertion, deletion and
updating operations of any record require large number of pointer adjustments, which
makes its implementation very complex and complicated.
• Absence of structural independence - Since the data access method in the network
database model is a navigational system, making structural changes to the database is very
difficult in most cases and impossible in some cases. If changes are made to the database
structure then all the application programs need to be modified before they can access data.
Thus, even though the network database model succeeds in achieving data independence, it
still fails to achieve structural independence.

Relational Model - The most common model, the relational model stores data into tables, also
known as relations, each of which consists of columns and rows. Each column lists an attribute of the
entity in question, such as price, zip code, or birth date. Together, the attributes in a relation are
called a domain. A particular attribute or combination of attributes is chosen as a primary key that
can be referred to in other tables, when it’s called a foreign key. Each row, also called a tuple,
includes data about a specific instance of the entity in question, such as a particular employee. The
model also accounts for the types of relationships between those tables, including one-to-one, one-
to-many, and many-to-many relationships.

The relational model is the conceptual basis of relational databases. Proposed by E.F. Codd in 1969,
it is a method of structuring data using relations, which are grid-like mathematical structures
consisting of columns and rows. Codd proposed the relational model for IBM, but he had no idea
how extremely vital and influential his work would become as the basis of relational databases.
Relational data model is the primary data model, which is used widely around the world for data
storage and processing. This model is simple and it has all the properties and capabilities required to
process data with storage efficiency.

Terminology –

• Tables − In relational data model, relations are saved in the format of Tables. This format
stores the relation among entities. A table has rows and columns, where rows represents
records and columns represent the attributes.
• Tuple − A single row of a table, which contains a single record for that relation is called a
tuple.
• Relation Instance − A finite set of tuples in the relational database system represents
relation instance. Relation instances do not have duplicate tuples. The set of values present
in a relation at a particular instance of time is known as relational instance
• Relation Schema − A relation schema describes the relation name (table name), attributes,
and their names.
• Relation Key − Each row has one or more attributes, known as relation key, which can
identify the row in the relation (table) uniquely.
• Domain − A domain is defined as the set of all unique values permitted for an attribute. For
example, a domain of date is the set of all possible valid dates, a domain of integer is all
possible whole numbers, a domain of day-of-week is Monday, Tuesday etc. Every attribute
has some pre-defined value scope, known as attribute domain.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 34

• NULL values - Values of some attribute for some tuples may be unknown, missing or
undefined which are represented by NULL. Two NULL values in a relation are considered
different from each other.
• Degree - The number of attributes in the relation is known as degree of the relation.
• Cardinality - The number of unique values that appear in the table for a particular column.
• Column - Column represents the set of values for a particular attribute.

The Relational Model is a depiction of how each piece of stored information relates to the other
stored information. It shows how tables are linked, what type of links are between tables, what keys
are used, what information is referenced between tables. It's an essential part of developing a
normalized database structure to prevent repeat and redundant data storage.

The basic idea behind the relational model is that a database consists of a series of unordered tables
(or relations) that can be manipulated using non-procedural operations that return tables. This
model was in vast contrast to the more traditional database theories of the time that were much
more complicated, less flexible and dependent on the physical storage methods of the data. It is
based on the Relational Algebra, set theory and predicate logic.

It is commonly thought that the word relational in the relational model comes from the fact that you
relate together tables in a relational database. Although this is a convenient way to think of the
term, it's not accurate. Instead, the word relational has its roots in the terminology that Codd used
to define the relational model. The table in Codd's writings was actually referred to as a relation (a
related set of information).

In fact, Codd (and other relational database theorists) use the terms relations, attributes and tuples
where most of us use the more common terms tables, columns and rows, respectively (or the more
physical—and thus less preferable for discussions of database design theory—files, fields and
records).

The relational model can be applied to both databases and database management systems (DBMS)
themselves. The relational fidelity of database programs can be compared using Codd's 12 rules
(since Codd's seminal paper on the relational model, the number of rules has been expanded to 300)
for determining how DBMS products conform to the relational model. When compared with other
database management programs, Microsoft Access fares quite well in terms of relational fidelity.
Still, it has a long way to go before it meets all twelve rules completely.

Advantages of Relational Database Model -

• Ease of use - The revision of any information as tables consisting of rows and columns is
quite natural and therefore even first-time users find it attractive.
• Flexibility - Different tables from which information has to be linked and extracted can be
easily manipulated by operators such as project and join to give information in the form in
which it is desired.
• Security - Security control and authorization can also be implemented more easily by moving
sensitive attributes in a given table into a separate relation with its own authorization
controls. If authorization requirement permits, a particular attribute could be joined back
with others to enable full information retrieval.
• Data Independence - Data independence is achieved more easily with normalization
structure used in a relational database than in the more complicated tree or network
structure.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 35

• Simplicity - The relational model structures data in a manner that avoids complexity. The
table structure is an intuitive organization familiar to most users, particularly those who
have worked with physical or software spreadsheets, check registers or other tabular data.
Data are organized naturally within the model, simplifying the development and use of the
database.
• Ease of Data Retrieval - Under the relational model, accessing data in a database does not
require navigating a rigid pathway through a tree or hierarchy. Users can query any table in
the database, and combine related tables using special join functions to include relevant
data contained in other tables in the results. Results can be filtered based on the content of
any column, and on any number of columns, allowing users to easily retrieve meaningful
results. Users can choose which columns to include in the results so that only relevant data
are displayed.
• Data Integrity - Data integrity is an essential feature of the relational model. Strong data
typing and validity checks ensure data fall within acceptable ranges, and required data are
present. Referential integrity among tables prevents records from becoming incomplete or
orphaned. Data integrity helps to ensure accuracy and consistency of the data.
• Normalization - A systematic methodology exists for ensuring a relational database design is
free of anomalies that may impact the integrity and accuracy of the database. "Database
normalization" provides a set of rules, qualities and objectives for the design and review of a
database structure. Normalization objectives are described in levels called "normal forms."
Each level of normalization must be completed before progressing to the next level. A
database design is generally considered normalized when it meets the requirements of the
third normal form. Normalization provides designers with confidence the database design is
robust and dependable.
• Data Manipulation Language - The possibility of responding to query by means of a
language based on relational algebra and relational calculus. For example, SQL is easy in the
relational database approach. For data organized in other structure the query language
either becomes complex or extremely limited in its capabilities.

Disadvantages of Relational Database –

• Performance - A major constraint and therefore disadvantage in the use of relational


database system is machine performance. If the number of tables between which
relationships to be established are large and the tables themselves effect the performance in
responding to the SQL queries.
• Physical Storage Consumption - With an interactive system, for example an operation like
join would depend upon the physical storage also. It is, therefore common in relational
databases to tune the databases and in such a case the physical data layout would be
chosen so as to give good performance in the most frequently run operations. It therefore
would naturally result in the fact that the lays frequently run operations would tend to
become even more shared.
• Cost - One disadvantage of relational databases is the expensive of setting up and
maintaining the database system. In order to set up a relational database, you generally
need to purchase special software. If you are not a programmer, you can use any number of
products to set up a relational database. It does take time to enter in all the information and
set up the program. If your company is large and you need a more robust database, you will
need to hire a programmer to create a relational database using SQL and a database
administrator to maintain the database once it is built. Regardless of what data you use, you
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 36

will have to either import it from other data like text files or Excel spreadsheets, or have the
data entered at the keyboard. No matter the size of your company, if you store legally
confidential or protected information in your database such as health information, social
security numbers or credit card numbers, you will also have to secure your data against
unauthorized access in order to meet regulatory standards.
• Structured Limits - Some relational databases have limits on field lengths. When you design
the database, you have to specify the amount of data you can fit into a field. Some names or
search queries are shorter than the actual, and this can lead to data loss.
• Isolated Databases - Complex relational database systems can lead to these databases
becoming "islands of information" where the information cannot be shared easily from one
large system to another. Often, with big firms or institutions, you find relational databases
grew in separate divisions differently. For example, maybe the hospital billing department
used one database while the hospital personnel department used a different database.
Getting those databases to "talk" to each other can be a large, and expensive, undertaking,
yet in a complex hospital system, all the databases need to be involved for good patient and
employee care.
• Hardware overheads - relational database systems hide the implementation complexities
and the physical data storage details from the user. For doing this, the relational database
system need more powerful hardware computers and data storage devices.
• Ease of design can lead to bad design - the relational database is easy to design and use.
The user needs not to know the complexities of the data storage. This ease of design and use
can lead to the development and implementation of the very poorly designed database
management system.

Constraints - While designing Relational Model, we define some conditions which must hold for data
present in database are called Constraints. These constraints are checked before performing any
operation (insertion, deletion and updating) in database. If there is a violation in any of constrains,
operation will fail.

Database Integrity - The preservation of the integrity of a database system is concerned with the
maintenance of the correctness and consistency of the data in a multi-user database environment.
This is a major task, since integrity violations may arise from many different sources, such as typing
errors by data entry clerks, logical errors in application programs, or errors in system software which
result in data corruption.

Many commercial database management systems have an integrity subsystem, which is responsible
for monitoring transactions, which update the database and detecting integrity violations. In the
event of an integrity violation, the system then takes appropriate action, which should involve
rejecting the operation, reporting the violation, and if necessary returning the database to a
consistent state.

Data Integrity - Data integrity is a fundamental component of information security. In its broadest
use, “data integrity” refers to the accuracy and consistency of data stored in a database, data
warehouse, data mart or other construct. Data with “integrity” is said to have a complete or whole
structure. Data values are standardized according to a data model and / or data type. All
characteristics of the data must be correct, including business rules, relations, dates, definitions and
lineage for data to be complete. Data integrity is imposed within a database when it is designed and
is authenticated through the ongoing use of error checking and validation routines. As a simple
example, to maintain data integrity numeric columns / cells should not accept alphabetic data.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 37

Domain Integrity Constraints - These are attribute level constraints. An attribute can only take
values which lie inside the domain range. For example, if a constrains AGE>0 is applied on STUDENT
relation, inserting negative value of AGE will result in failure. These domain constraints are the most
basic form of integrity constraint. They are easy to test for when data is entered.

Entity Integrity Constraints - Entity Integrity is the mechanism the system provides to maintain
primary keys. The primary key serves as a unique identifier for rows in the table. Entity Integrity
ensures two properties for primary keys –

• The primary key for a row is unique and not NULL.


• it does not match the primary key of any other row in the table.

Referential Integrity Constraints - Referential integrity is a relational database concept, which states
that table relationships must always be consistent. In other words, any foreign key field must agree
with the primary key that is referenced by the foreign key. Thus, any primary key field changes must
be applied to all foreign keys, or not at all. The same restriction also applies to foreign keys in that
any updates (but not necessarily deletions) must be propagated to the primary parent key. Consider
a bank database, which contains two tables -

• CUSTOMER_MASTER Table - This holds basic customer / account holder data such as name,
social security number, address and date of birth.
• ACCOUNTS_MASTER Table - This store basic bank account data such as account type,
account creation date, account holder and withdrawal limits.

To uniquely identify each customer / account holder in the CUSTOMER_MASTER table, a primary key
column named CUSTOMER_ID is created.

To identify a customer and bank account relationship in the ACCOUNTS_MASTER table, an existing
customer in the CUSTOMER_MASTER table must be referenced. Thus, the CUSTOMER_ID column
also created in the ACCOUNTS_MASTER table which is a foreign key. This column is special because
its values are not newly-created. Rather, these values must reference existing and identical values in
the primary key column of another table, which is the CUSTOMER_ID column of the
CUSTOMER_MASTER table.

Referential integrity is a standard that means any CUSTOMER_ID value in the CUSTOMER_MASTER
table may not be edited without editing the corresponding value in the ACCOUNTS_MASTER table.
For example, if Andrew Smith’s customer ID is changed in the CUSTOMER_MASTER table, this
change also must be applied to the ACCOUNTS_MASTER table, thus allowing Andrew Smith’s
account information to link to his customer ID.

A feature provided by relational database management systems (RDBMS's) that prevents users or
applications from entering inconsistent data. Most RDBMS's have various referential integrity rules
that you can apply when you create a relationship between two tables.

For example, suppose Table B has a foreign key that points to a field in Table A. Referential integrity
would prevent you from adding a record to Table B that cannot be linked to Table A. In addition, the
referential integrity rules might also specify that whenever you delete a record from Table A, any
records in Table B that are linked to the deleted record will also be deleted. This is called cascading
delete. Finally, the referential integrity rules could specify that whenever you modify the value of a
linked field in Table A, all records in Table B that are linked to it will also be modified accordingly.
This is called cascading update. The rules are -
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 38

• You can't delete a record from a primary table if matching records exist in a related table.
• You can't change a primary key value in the primary table if that record has related records.
• You can't enter a value in the foreign key field of the related table that doesn't exist in the
primary key of the primary table.
• However, you can enter a Null value in the foreign key, specifying that the records are
unrelated.

Key Integrity Constraints - Every relation in the database should have at least one set of attributes
which defines a tuple uniquely. Those set of attributes is called key. e.g., ROLL_NO in STUDENT is a
key. No two students can have same roll number. So, a key has two properties, it should be unique
for all tuples and It can’t have NULL values.

User-defined Integrity Constraints - User-defined integrity refers to a set of rules specified by a user,
which do not belong to the entity, domain and referential integrity categories.

Cascade Update Related Fields - Any time you change the primary key of a row in the primary table,
the foreign key values are updated in the matching rows in the related table.

Cascade Delete Related Rows - Any time you delete a row in the primary table, the matching rows
are automatically deleted in the related table.

Anomalies - Database anomaly is normally the flaw in databases which occurs because of poor
planning and storing everything in a flat database. An anomaly is an irregularity, or something which
deviates from the expected or normal state. When designing databases, we identify three types of
anomalies i.e. Insert, Update and Delete. Consider the following table as an example.

Insertion Anomaly - An Insert Anomaly occurs when certain attributes cannot be inserted into the
database without the presence of other attributes. A deletion anomaly is the unintended loss of data
due to deletion of other data. For Example, If we try to insert a record in STUDENT_COURSE with
STUD_NO =7, it will not allow.

Deletion and Updating Anomaly - If a tuple is deleted or updated from referenced relation and
referenced attribute value is used by referencing attribute in referencing relation, it will not allow
deleting the tuple from referenced relation. For Example, if we try to delete a record from STUDENT
with STUD_NO =1, it will not allow. To avoid this, following can be used in query -
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 39

• ON DELETE/UPDATE SET NULL - If a tuple is deleted or updated from referenced relation


and referenced attribute value is used by referencing attribute in referencing relation, it will
delete/update the tuple from referenced relation and set the value of referenced attribute
to NULL.
• ON DELETE/UPDATE CASCADE - If a tuple is deleted or updated from referenced relation
and referenced attribute value is used by referencing attribute in referencing relation, it will
delete/update the tuple from referenced relation and referencing relation as well.

RDBMS - RDBMS stands for Relational Database Management System. RDBMS is the basis for SQL,
and for all modern database systems like MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft
Access. A Relational database management system (RDBMS) is a database management system
(DBMS) that is based on the relational model as introduced by E. F. Codd.

Codd’s 12 Rules - Dr Edgar F. Codd, after his extensive research on the Relational Model of database
systems, came up with twelve rules of his own, which according to him, a database must obey in
order to be regarded as a true relational database. These rules can be applied on any database
system that manages stored data using only its relational capabilities. This is a foundation rule,
which acts as a base for all the other rules.

• Rule 0: Ability to Manage Database - This rule states that for a system to qualify as an
RDBMS, it must be able to manage database entirely through the relational capabilities.
• Rule 1: Information Rule - The data stored in a database, may it be user data or metadata,
must be a value of some table cell. Everything in a database must be stored in a table
format.
• Rule 2: Guaranteed Access Rule - Every single data element (value) is guaranteed to be
accessible logically with a combination of table-name, primary-key (row value), and
attribute-name (column value). No other means, such as pointers, can be used to access
data.
• Rule 3: Systematic Treatment of NULL Values - The NULL values in a database must be given
a systematic and uniform treatment. This is a very important rule because a NULL can be
interpreted as one the following − data is missing, data is not known, or data is not
applicable.
• Rule 4: Active Online Catalog - The structure description of the entire database must be
stored in an online catalog, known as data dictionary, which can be accessed by authorized
users. Users can use the same query language to access the catalog which they use to access
the database itself.
• Rule 5: Comprehensive Data Sub-Language Rule - A database can only be accessed using a
language having linear syntax that supports data definition, data manipulation, and
transaction management operations. This language can be used directly or by means of
some application. If the database allows access to data without any help of this language,
then it is considered as a violation.
• Rule 6: View Updating Rule - All the views of a database, which can theoretically be
updated, must also be updatable by the system.
• Rule 7: High-Level Insert, Update, and Delete Rule - A database must support high-level
insertion, updating, and deletion. This must not be limited to a single row, that is, it must
also support union, intersection and minus operations to yield sets of data records.
• Rule 8: Physical Data Independence - The data stored in a database must be independent of
the applications that access the database. Any change in the physical structure of a database
must not have any impact on how the data is being accessed by external applications.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 40

• Rule 9: Logical Data Independence - The logical data in a database must be independent of
its user’s view (application). Any change in logical data must not affect the applications using
it. For example, if two tables are merged or one is split into two different tables, there
should be no impact or change on the user application. This is one of the most difficult rule
to apply.
• Rule 10: Integrity Independence - A database must be independent of the application that
uses it. All its integrity constraints can be independently modified without the need of any
change in the application. This rule makes a database independent of the front-end
application and its interface.
• Rule 11: Distribution Independence - The end-user must not be able to see that the data is
distributed over various locations. Users should always get the impression that the data is
located at one site only. This rule has been regarded as the foundation of distributed
database systems.
• Rule 12: Non-Subversion Rule - If a system has an interface that provides access to low-level
records, then the interface must not be able to subvert the system and bypass security and
integrity constraints. For example, bypassing a relational security or integrity constraint.

Graph Model - Graph model is another model that is gaining popularity. These databases are
created based on the Graph theory and used nodes and edges to represent data. The structure is
somewhat similar to object-oriented applications. Graph databases are generally easier to scale and
usually perform faster for associative data sets.

Dimensional Model - A Dimensional Model is a database structure that is optimized for online
queries and Data Warehousing tools. It is comprised of "fact" and "dimension" tables. A "fact" is a
numeric value that a business wishes to count or sum. A "dimension" is essentially an entry point for
getting at the facts.

Object-oriented Database Model - This model defines a database as a collection of objects, or


reusable software elements, with associated features and methods. An object database is a
database management system in which information is represented in the form of objects as used in
object-oriented programming. There are several kinds of object-oriented databases -

• A multimedia database incorporates media, such as images, that could not be stored in a
relational database.
• A hypertext database allows any object to link to any other object. It’s useful for organizing
lots of disparate data, but it’s not ideal for numerical analysis.

The object-oriented database model is the best known post-relational database model, since it
incorporates tables, but isn’t limited to tables. Such models are also known as hybrid database
models.

Object-relational Model - This hybrid database model combines the simplicity of the relational
model with some of the advanced functionality of the object-oriented database model. In essence, it
allows designers to incorporate objects into the familiar table structure. Languages and call
interfaces include SQL3, vendor languages, ODBC, JDBC, and proprietary call interfaces that are
extensions of the languages and interfaces used by the relational model.

Difference Between Network, Hierarchical and Relational Model -

Comparison Hierarchical Network Relational


Database 2nd Generation 2nd Generation 3rd Generation
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 41

Evolution
Acronym DBMS DBMS RDBMS
Description Hierarchical orientation & Network orientation & Relational orientation,
navigation, hierarchies of navigation, uses data retrieved by unique
related records & standard hierarchically-arranged keys, relationships
interfaces data with the exception expressed through
that child tables can have matching keys, physical
more than one parent, organization of data
standard interfaces managed by RDBMS
Physical Tree, parent-child Network of interrelated Data is stored in relations
Structure relationships, single table lists, looks like several trees (tables). Relationships
acts as the “root” of the that share branches. maintained by placing a
database from which other Children have multiple key field value of one
tables “branch” out, child parents and parents have record as an attribute in
can only have one parent, multiple children. the related record.
but a parent can have
multiple children.
Programming Commands embedded in Commands embedded in SQL, ODBC.
Languages Used programming languages. programming languages.
COBOL, PL1, Fortran, ADS & COBOL, PL1, Fortran, ADS
Assembler. & Assembler.
Structural Inflexible (once data is Inflexible (once data is Flexible because tables
Changes organized in a particular organized in a particular are subject-specific & key
way, difficult to change). way, difficult to change). fields relate one entity to
Data reorganization Data reorganization another, both the data &
complicated and requires complicated and requires the database structure
careful design. careful design. can be easily modified &
manipulated. Programs
independent of data
format which yields
flexibility when
modifications are needed.
Relationships Linked lists using pointers Uses series of linked lists to Uses key fields to link data
stored in the parent/child implement relationships in many different ways.
records to navigate through between records; each list Supports one-to-one, one-
the records. pointers could has an owner record & to-many & many-to-many
be a disk address, the key possibly many member relationships.
field, or other random- records; a single record can
access technique. start at either be the owner or a
root and work down the member of several lists of
tree to reach target data; various types.
supports one-to-one & one- supports one-to, one-to-
to-many relationships. many, & many-to-many
relationships.
Advantages Easily shows one-to-one & Network model solves More easily visualize data
one-to-many relationships. problem of data organization &
More efficient than the flat redundancy by relationships. Ease of
file model b/c less need for representing relationships design & user-friendly GUI
redundant data. in terms of sets rather than interfaces. Modern query
hierarchy, allowed complex tools & report generators.
data structures to be built; Ease of data entry, tables
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 42

very efficient in storage & represent a single subject,


fast, better job with many- no duplicate data which
to-many relationships. reduces errors & improves
consistency & eases
database maintenance.
Disadvantages No support for many-to- More difficult to navigate Initially, had slow
many relationships, user and visualize compared to performance in 70’s &
must know how the tree is the hierarchical model, 80’s, but today’s more
structured to find anything, model difficult to powerful machines speed
cannot add a record to a implement and maintain. up performance. wide
child table until it has been Most implementations price range, can be
incorporated into the were used by programmers expensive. additional
parent table, still creates rather than end-users. hardware might be
data duplication. required.
Examples IMS (Information Satellite communications, Oracle, Informix, Ingres,
Management System) by airline reservations, IDMS Sybase, DB2, MS Access,
IBM. (Integrated Database FileMaker Pro, Visual
Management System) from FoxPro, SQL Server,
Cullinet. MySQL (free), PostgreSQL
(free).
Status Today Limited usage (i.e. Windows Limited usage (i.e. still used Most popular DBMS in use
file structure). in satellite communications today as a result of
& airline reservation technical development
systems) efforts to ensure that
advances such as object
orientation, web serving,
etc. appear quickly &
reliably.

Relational Algebra - Relational algebra is a procedural query language, which takes instances of
relations as input and yields instances of relations as output. It uses operators to perform queries.
An operator can be either unary or binary. They accept relations as their input and yield relations as
their output. Relational algebra is performed recursively on a relation and intermediate results are
also considered relations. The fundamental operations of relational algebra are as follows −

• Select
• Project
• Union
• Set different
• Cartesian product
• Rename

Select Operation (σ) - It selects tuples that satisfy the given predicate from a relation.

Notation − σp(r)

Where σ stands for selection predicate and r stands for relation. p is prepositional logic formula
which may use connectors like and, or, and not. These terms may use relational operators like −
=, ≠, ≥, <, >, ≤. Examples –
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 43

σsubject = "database"(Books)

Output − Selects tuples from books where subject is 'database'.

σsubject = "database" and price = "450"(Books)

Output − Selects tuples from books where subject is 'database' and 'price' is 450.

σsubject = "database" and price = "450" or year > "2010"(Books)

Output − Selects tuples from books where subject is 'database' and 'price' is 450 or those books
published after 2010.

Project Operation (∏) - It projects column(s) that satisfy a given predicate.

Notation − ∏A1, A2, An (r)

Where A1, A2, An are attribute names of relation r. Duplicate rows are automatically eliminated, as
relation is a set. For example −

∏subject, author (Books)


Selects and projects columns named as subject and author from the relation Books. We can also
commutive write select and project operation.

∏name [σcgpa>9 (Student)


Union Operation (∪) - It performs binary union between two given relations and is defined as −

r ∪ s = {t | t ∈ r or t ∈ s}
Notation − r U s
Where r and s are either database relations or relation result set (temporary relation). For a union
operation to be valid, the following conditions must hold –

• r and s must have the same number of attributes.


• Attribute domains must be compatible.
• Duplicate tuples are automatically eliminated.

∏ author (Books) ∪ ∏ author (Articles)


Output − Projects the names of the authors who have either written a book or an article or both.

Set Difference (−) - The result of set difference operation is tuples, which are present in one relation
but are not in the second relation.

Notation − r−s

Finds all the tuples that are present in r but not in s.

∏ author (Books) − ∏ author (Articles)


Output − Provides the name of authors who have written books but not articles.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 44

Cartesian Product (Χ) - Combines information of two different relations into one.

Notation − r Χ s

Where r and s are relations and their output will be defined as −

r Χ s = {q t | q ∈ r and t ∈ s}

σauthor = 'tutorialspoint'(Books Χ Articles)


Output − Yields a relation, which shows all the books and articles written by tutorialspoint.

Rename Operation (ρ) - The results of relational algebra are also relations but without any name.
The rename operation allows us to rename the output relation. 'rename' operation is denoted with
small Greek letter rho ρ.

Notation − ρ x (E)

Where the result of expression E is saved with name of x.

Relational Calculus - In contrast to Relational Algebra, Relational Calculus is a non-procedural query


language, that is, it tells what to do but never explains how to do it. Relational calculus exists in two
forms –

Tuple Relational Calculus (TRC) - Filtering variable ranges over tuples

Notation − {T | Condition}

Returns all tuples T that satisfies a condition. For example −

{T.name | Author(T) AND T.article = 'database' }


Output − Returns tuples with 'name' from Author who has written article on 'database'.

TRC can be quantified. We can use Existential (∃) and Universal Quantifiers (∀). For example −

{R| ∃T ∈ Authors(T.article='database' AND R.name=T.name)}


Output − The above query will yield the same result as the previous one.

Domain Relational Calculus (DRC) - In DRC, the filtering variable uses the domain of attributes
instead of entire tuple values (as done in TRC, mentioned above).

Notation − {a1, a2, a3, ..., an | P (a1, a2, a3, ...,an)}

Where a1, a2 are attributes and P stands for formulae built by inner attributes. For example −

{< article, page, subject > | ∈ TutorialsPoint ∧ subject =


'database'}
Output − Yields Article, Page, and Subject from the relation TutorialsPoint, where subject is
database. Just like TRC, DRC can also be written using existential and universal quantifiers. DRC also
involves relational operators. The expression power of Tuple Relation Calculus and Domain Relation
Calculus is equivalent to Relational Algebra.

Decomposition of Relational Schema – Decomposition in DBMS is nothing but another name for
Normalization. Decomposition or Normalization is a systematic way of ensuring that a database
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 45

structure is suitable for general-purpose querying and free of certain undesirable characteristics
such as insertion, update, and deletion anomalies that could lead to a loss of data integrity.

A functional decomposition is the process of breaking down the functions of an organization into
progressively greater (finer and finer) levels of detail. In decomposition, one function is described in
greater detail by a set of other supporting functions.

The decomposition of a relation scheme R consists of replacing the relation schema by two or more
relation schemas that each contain a subset of the attributes of R and together include all attributes
in R. Decomposition helps in eliminating some of the problems of bad design such as redundancy,
inconsistencies and anomalies. There are two types of decomposition -

• Lossy Decomposition
• Lossless Join Decomposition

Lossy Decomposition - "The decomposition of relation R into R1 and R2 is lossy when the join of R1
and R2 does not yield the same relation as in R." One of the disadvantages of decomposition into
two or more relational schemes (or tables) is that some information is lost during retrieval of original
relation or table. In lossy decomposition, spurious tuples are generated when a natural join is
applied to the relations in the decomposition.

Lossless Join Decomposition - "The decomposition of relation R into R1 and R2 is lossless when the
join of R1 and R2 yield the same relation as in R." A relational table is decomposed (or factored) into
two or more smaller tables, in such a way that the designer can capture the precise content of the
original table by joining the decomposed parts. This is called lossless-join (or non-additive join)
decomposition. This is also referred as non-additive decomposition. The lossless-join decomposition
is always defined with respect to a specific set F of dependencies. In lossless decomposition, no any
spurious tuples are generated when a natural joined is applied to the relations in the decomposition.

SQL - Structured Query Language(SQL) as we all know is the database language by the use of which
we can perform certain operations on the existing database and also, we can use this language to
create a database. SQL uses certain commands like Create, Drop, insert etc. to carry out the required
tasks. These SQL commands are mainly categorized into four categories as discussed below -

DDL (Data Definition Language) - DDL or Data Definition Language actually consists of the SQL
commands that can be used to define the database schema. It simply deals with descriptions of the
database schema and is used to create and modify the structure of database objects in database.
Examples of DDL commands -

• CREATE – It is used to create the database or its objects (like table, index, function, views,
store procedure and triggers).
• DROP - It is used to delete objects from the database.
• ALTER - It is used to alter the structure of the database.
• TRUNCATE - It is used to remove all records from a table, including all spaces allocated for
the records are removed.
• COMMENT - It is used to add comments to the data dictionary.
• RENAME - It is used to rename an object existing in the database.

DML (Data Manipulation Language) - The SQL commands that deals with the manipulation of data
present in database belong to DML or Data Manipulation Language and this includes most of the SQL
statements. Examples of DML -
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 46

• SELECT - It is used to retrieve data from a database.


• INSERT - It is used to insert data into a table.
• UPDATE - It is used to update existing data within a table.
• DELETE - It is used to delete records from a database table.

DCL (Data Control Language) - DCL includes commands such as GRANT and REVOKE which mainly
deals with the rights, permissions and other controls of the database system. Examples of DCL
commands -

• GRANT - Gives user’s access privileges to database.


• REVOKE - Withdraw user’s access privileges given by using the GRANT command.

TCL (Transaction Control Language) - TCL commands deals with the transaction within the database.
Examples of TCL commands -

• COMMIT - Commits a Transaction.


• ROLLBACK - Rollbacks a transaction in case of any error occurs.
• SAVEPOINT - Sets a save point within a transaction.
• SET TRANSACTION - Specify characteristics for the transaction.

Functional Dependency - Functional dependency is a relationship that exists when one attribute
uniquely determines another attribute. If R is a relation with attributes X and Y, a functional
dependency between the attributes is represented as X->Y, which specifies Y is functionally
dependent on X. Here X is a determinant set and Y is a dependent attribute. Each value of X is
associated precisely with one Y value. Functional dependency in a database serves as a constraint
between two sets of attributes. Defining functional dependency is an important part of relational
database design and contributes to aspect normalization.

Attribute Closure - Attribute closure of an attribute set can be defined as set of attributes
which can be functionally determined from it.
Armstrong's Axioms - If F is a set of functional dependencies then the closure of F, denoted as F+, is
the set of all functional dependencies logically implied by F. Armstrong's Axioms are a set of rules,
that when applied repeatedly, generates a closure of functional dependencies. Let A, B and C and D
be arbitrary subsets of the set of attributes of the giver relation R, and let AB be the union of A and
B. Then, ⇒→

• Reflexivity - If B is subset of A, then A → B


• Augmentation - If A → B, then AC → BC
• Transitivity - If A → B and B → C, then A → C.
• Projectivity or Decomposition Rule - If A → BC, Then A → B and A → C

Proof:

Step 1: A → BC (GIVEN)

Step 2: BC → B (Using Rule 1, since B ⊆ BC)

Step 3: A → B (Using Rule 3, on step 1 and step 2)

• Union or Additive Rule - If A→B, and A→C Then A→BC.


D B M S N o t e s b y N a r a y a n V y a s | P a g e | 47

Proof:

Step 1: A → B (GIVEN)

Step 2: A → C (given)

Step 3: A → AB (using Rule 2 on step 1, since AA=A)

Step 4: AB → BC (using rule 2 on step 2)

Step 5: A → BC (using rule 3 on step 3 and step 4)

• Pseudo Transitive Rule - If A → B, DB → C, then DA → C

Proof:

Step 1: A → B (Given)

Step 2: DB → C (Given)

Step 3: DA → DB (Rule 2 on step 1)

Step 4: DA → C (Rule 3 on step 3 and step 2)

• These are not commutative as well as associative.


i.e. if X → Y then
Y → X x (not possible)
• Composition Rule - If A → B, and C → D, then AC → BD.
• Self Determination Rule - A → A is a self-determination rule.
Example to find closures –
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 48

Types of Functional Dependency –


Single Valued Functional Dependency – Database is a collection of related information in
which one information depends on another information. The information is either single-
valued or multi-valued. For example, the name of the person or his date of birth are single
valued facts. But the qualification of a person is a multivalued fact.
A simple example of single value functional dependency is when A is the primary key of an
entity (like sid) and B is some single valued attribute of the entity (like sname). Then, A → B
must always hold.
Fully Functional Dependency – A functional dependency P → Q is fully functional
dependency if removal of any attribute A from P means that the dependency does not hold
any more or in a relation R, an attribute Q is said to be fully functional dependent on
attribute P, if it is functionally dependent on P and not functionally dependent on any
proper subset of P. The dependency P → Q is left reduced, there being no extraneous
attributes in the left-hand side of the dependency.
If AD → C, is fully functional dependency, then we cannot remove A or D. I.e. C is fully
functional dependent on AD. If we are able to remove A or D, then it is not fully functional
dependency.
Partial Functional Dependency – A Functional Dependency in which one or more non-key
attributes are functionally depending on a part of the primary key is called partial functional
dependency or where the determinant consists of key attributes, but not the entire primary
key, and the determined consist of non-key attributes.
Transitive Dependency – When a non-prime attribute finds another non-prime attribute, it
is called transitive dependency. Given a relation R (A, B, C) then dependency like A–>B, B–
>C is a transitive dependency, since A–>C is implied.
Trivial − If a functional dependency (FD) X → Y holds, where Y is a subset of X, then it is
called a trivial FD. Trivial FDs always hold. In Trivial Functional Dependency, we do not find
any new values.
Non-trivial − If an FD X → Y holds, where Y is not a subset of X, then it is called a non-trivial
FD. In Non-Trivial Functional Dependency, we do find new values.
Completely non-trivial − If an FD X → Y holds, where x intersects Y = Φ, it is said to be a
completely non-trivial FD.
If a database design is not perfect, it may contain anomalies, which are like a bad dream for
any database administrator. Managing a database with anomalies is next to impossible.
• Update anomalies − If data items are scattered and are not linked to each other
properly, then it could lead to strange situations. For example, when we try to
update one data item having its copies scattered over several places, a few
instances get updated properly while a few others are left with old values. Such
instances leave the database in an inconsistent state.
• Deletion anomalies − We tried to delete a record, but parts of it was left undeleted
because of unawareness, the data is also saved somewhere else.
• Insert anomalies − We tried to insert data in a record that does not exist at all.
Redundancy - Data redundancy is a condition created within a database or data storage
technology in which the same piece of data is held in two separate places. This can mean
two different fields within a single database, or two different spots in multiple software
environments or platforms.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 49

Normalization - Normalization is a process of organizing the data in database to avoid data


redundancy, insertion anomaly, update anomaly and deletion anomaly. Normalization is a
systematic approach of decomposing tables to eliminate data redundancy (repetition) and
undesirable characteristics. It is a multi-step process that puts data into tabular form,
removing duplicated data from the relation tables. It divides larger tables to smaller tables
and links them using relationships.
The inventor of the relational model Edgar Codd proposed the theory of normalization with
the introduction of First Normal Form, and he continued to extend theory with Second and
Third Normal Form. Later he joined with Raymond F. Boyce to develop the theory of Boyce-
Codd Normal Form. Theory of Data Normalization in SQL is still being developed further.
However, in most practical applications, normalization achieves its best in 3rd Normal Form.
Examples to find Candidate Keys –

First Normal Form (1NF) - For a table to be in the First Normal Form, it should follow the
following 4 rules -
1. all the attributes (columns) in a relation must have atomic values/domains. It
cannot hold multiple values.
2. Values stored in a column should be of the same domain. For example, if you are
storing roll number in a particular column, it cannot contain any other information
other than roll number.
3. All the columns in a table should have unique names.
4. The order in which data is stored, does not matter.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 50

We re-arrange the relation (table) as below, to convert it to First Normal Form.

Second Normal Form (2NF) - Before we learn about the second normal form, we need to
understand the following –
• Prime attribute − An attribute, which is a part of the candidate-key, is known as a
prime attribute.
• Non-prime attribute − An attribute, which is not a part of the prime-key, is said to
be a non-prime attribute.
• Partial Dependency - When a non-prime attribute, instead of depending upon the
entire candidate key, depends upon the part of the candidate key.
Now, rules for Second Normal Form (2NF) –
1. It should be in the First Normal form.
2. It should not have Partial Dependency or we can simply say that every non-prime
attribute should be fully functionally dependent on prime key attribute.

Here, Stu_ID and Proj_ID combined is candidate key. We see here in Student_Project
relation that the prime key attributes are Stu_ID and Proj_ID. According to the rule, non-key
attributes, i.e. Stu_Name and Proj_Name must be dependent upon both and not on any of
the prime key attribute individually. But we find that Stu_Name can be identified by Stu_ID
and Proj_Name can be identified by Proj_ID independently. This is called partial
dependency, which is not allowed in Second Normal Form.

.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 51

We broke the relation in two as depicted in the above picture. So there exists no partial
dependency.
Third Normal Form (3NF) - Before we learn about the third normal form, we need to
understand the following –
• Transitive Dependency – When a non-prime attribute finds another non-prime
attribute, it is called transitive dependency. Given a relation R (A, B, C) then
dependency like A–>B, B–>C is a transitive dependency, since A–>C is implied.
Now, rules for Third Normal Form (3NF) –
1. It should be in the Second Normal form.
2. There should be no transitive dependency.

We find that in the above Student_detail relation, Stu_ID is the candidate key and only
prime key attribute. We find that City can be identified by Stu_ID as well as Zip itself.
Neither Zip is a super key nor is City a prime attribute. Additionally, Stu_ID → Zip → City, so
there exists transitive dependency. To bring this relation into third normal form, we break
the relation into two relations as follows –

It might be possible that in future, the non-prime (assume City) attribute which is finding
another non-prime attribute (assume Zip) may become null and it will make the database
inconsistent. That’s why we use Third Normal Form.
BCNF - Boyce-Codd Normal Form (BCNF) BCNF was developed by Raymond Boyce and E.F.
Codd, the latter is widely considered the father of relational database design. BCNF is really
an extension of 3rd Normal Form (3NF). For this reason, it is frequently termed 3.5NF. This
form deals with certain type of anomaly that is not handled by 3NF. Rules for BCNF –
• A 3NF table which does not have multiple overlapping candidate keys is said to be
in BCNF.
• Every partial key (prime attribute) can only depend on a super key.
In simple words, if there is a dependency from Alpha to Beta, Alpha should also be the
super key. The reason why we use BCNF is that it is the job of candidate key to find another
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 52

attribute, what if a non-prime attribute goes to find the any candidate key or part of the
candidate key (this condition is wrong). That’s why we use BCNF.

Fourth Normal Form (4NF) – These are the rules for 4NF -
• The table must be in BCNF.
• A table is in the 4NF if it is in BCNF and has no multivalued dependencies.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 53

To understand it clearly, consider a table with Subject, Lecturer who teaches each subject
and recommended Books for each subject.

If we observe the data in the table above it satisfies 3NF. But LECTURER and BOOKS are two
independent entities here. There is no relationship between Lecturer and Books. In the
above example, either Alex or Bosco can teach Mathematics. For Mathematics subject,
student can refer either 'Maths Book1’ or 'Maths Book2' that is -
• SUBJECT --> LECTURER
• SUBJECT-->BOOKS
This is a multivalued dependency on SUBJECT. If we need to select both lecturer and books
recommended for any of the subject, it will show up (lecturer, books) combination, which
implies lecturer who recommends which book. This is not correct. To eliminate this
dependency, we divide the table into two as below –

Now if we want to know the lecturer names and books recommended for any of the
subject, we will fire two independent queries. Hence it removes the multi-valued
dependency and confusion around the data. Thus, the table is in 4NF.
Fifth Normal Form (5NF) – These are the rules for 5NF -
• It's in 4NF
• If we can decompose table further to eliminate redundancy and anomaly, and when
we re-join the decomposed tables by means of candidate keys, we should not be
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 54

losing the original data or any new record set should not arise. In simple words,
joining two or more decomposed table should not lose records nor create new
records.
Consider an example of different Subjects taught by different lecturers and the lecturers
taking classes for different semesters.
Note - Please consider that Semester 1 has Mathematics, Physics and Chemistry and
Semester 2 has only Physics in its academic year.

In above table, Rose takes both Mathematics and Physics class for Semester 1, but she does
not take Physics class for Semester 2. In this case, combination of all these 3 fields is
required to identify a valid data. Imagine we want to add a new class – Semester 3 but do
not know which Subject and who will be taking that subject. We would be simply inserting a
new entry with Class as Semester 3 and leaving Lecturer and subject as NULL. As we
discussed above, it's not a good to have such entries. Moreover, all the three columns
together act as a primary key, we cannot leave other two columns blank!
Hence, we have to decompose the table in such a way that it satisfies all the rules till 4NF
and when join them by using keys, it should yield correct record. Here, we can represent
each lecturer's Subject area and their classes in a better way. We can divide above table
into three - (SUBJECT, LECTURER), (LECTURER, CLASS), (SUBJECT, CLASS).
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 55

Now, each of combinations is in three different tables. If we need to identify who is


teaching which subject to which semester, we need join the keys of each table and get the
result.
For example, who teaches Physics to Semester 1, we would be selecting Physics and
Semester 1 from table 3 above, join with table 1 using Subject to filter out the lecturer
names. Then join with table 2 using Lecturer to get correct lecturer name. That is, we joined
key columns of each table to get the correct data. Hence there is no lose or new data -
satisfying 5NF condition.
DKNF or Sixth Normal Form (6NF) - DKNF stands for Domain Key Normal Form requires the
database that contains no constraints other than domain constraints and key constraints. A
relation is said to be in domain-key normal form, if all possible types of dependencies that
should hold on the relation can be enforced simply by enforcing the domain constraints and
key constraints on the relation. For a relation in DKNF, it becomes easy to enforce all
database constraints by simply checking that each attribute value in a tuple is of the
appropriate domain and that every key constraint is enforced. Even though the DKNF is
intended to be the ultimate normal form, because of the difficulty in including complex
constraints in a DKNF relation and the difficulty in specifying general integrity constraints,
its practical utility is limited. A schema is in DKNF if and only if it has no insertion or deletion
anomalies.
Process - A process is an instance of a program running in a computer. It is close in meaning
to task, a term used in some operating systems. In UNIX and some other operating systems,
a process is started when a program is initiated (either by a user entering a shell command
or by another program). Process is just a sequence of instructions, some executes and some
may not execute.
Transaction - A transaction can be defined as a group of tasks. A single task is the minimum
processing unit which cannot be divided further. A transaction is a set of logically related
operations. Transaction is executed as a single unit. If the database was in consistent state
before a transaction, then after execution of the transaction also, the database must be in a
consistent. For example, a transfer of money from one bank account to another requires
two changes to the database both must succeed or fail together. A transaction is defined as
a set of changes that must be made together.
Transactions are atomic in nature, either all the instructions execute or they do not execute
at all. A transaction is a set of instructions which performs a logical work.
A transaction is an action, or series of actions that are being performed by a single user or
application program, which reads or updates the contents of the database. This may be an
entire program, a piece of a program or a single command (like the SQL commands such as
INSERT or UPDATE) and it may engage in any number of operations on the database. In the
database context, the execution of an application program can be thought of as one or
more transactions with non-database processing taking place in between.
Let’s take an example of a simple transaction. Suppose a bank employee transfers Rs 500
from A's account to B's account. This very simple and small transaction involves several low-
level tasks.
A’s Account
Open_Account(A)
Old_Balance = A.balance
New_Balance = Old_Balance - 500
A.balance = New_Balance
Close_Account(A)
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 56

B’s Account
Open_Account(B)
Old_Balance = B.balance
New_Balance = Old_Balance + 500
B.balance = New_Balance
Close_Account(B)

Process of Transaction - The transaction is executed as a series of reads and writes of


database objects, which are explained below –
Read Operation - To read a database object, it is first brought into main memory from disk,
and then its value is copied into a program variable as shown in figure.

Write Operation - To write a database object, an in-memory copy of the object is first
modified and then written to disk.

Transaction failure in between the operations - Now that we understand what transaction
is, we should understand what the problems are associated with it.
The main problem that can happen during a transaction is that the transaction can fail
before finishing the all the operations in the set. This can happen due to power failure,
system crash etc. This is a serious problem that can leave database in an inconsistent state.
Assume that transaction fail after third operation then the amount would be deducted from
your account but your friend will not receive it. To solve this problem, we have the
following two operations -
Commit - If all the operations in a transaction are completed successfully then commit
those changes to the database permanently.
Rollback - If any of the operation fails then rollback all the changes done by previous
operations.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 57

Even though these operations can help us avoiding several issues that may arise during
transaction but they are not sufficient when two transactions are running concurrently. To
handle those problems, we need to understand database ACID properties.
ACID Properties - A transaction is a very small unit of a program and it may contain several
low-level tasks. A transaction in a database system must maintain Atomicity, Consistency,
Isolation, and Durability, commonly known as ACID properties. In order to ensure accuracy,
completeness, and data integrity.
The ACID properties, in totality, provide a mechanism to ensure correctness and
consistency of a database in a way such that each transaction is a group of operations that
acts a single unit, produces consistent results, acts in isolation from other operations and
updates that it makes are durably stored.
The ACID concept is described in ISO/IEC 10026-1:1992 Section 4. Each of these attributes
can be measured against a benchmark. In general, however, a transaction manager or
monitor is designed to realize the ACID concept. In a distributed system, one way to achieve
ACID is to use a two-phase commit (2PC), which ensures that all involved sites must commit
to transaction completion or none do, and the transaction is rolled back (see rollback).
Atomicity − This property states that a transaction must be treated as an atomic unit, that
is, either all of its operations are executed or none. There must be no state in a database
where a transaction is left partially completed. States should be defined either before the
execution of the transaction or after the execution/abortion/failure of the transaction.
By this, we mean that either the entire transaction takes place at once or doesn’t happen at
all, there is no midway. It involves following two operations. Transaction Management
Component takes care of atomicity.
• Abort - If a transaction aborts, changes made to database are not visible. Atomicity
is also known as the ‘All or nothing rule’.
• Commit - If a transaction commits, changes made are visible.
Consider the following transaction T consisting of T1 and T2: Transfer of 100 from account X
to account Y.

If the transaction fails after completion of T1 but before completion of T2. Say, after
write(X) but before write(Y), then amount has been deducted from X but not added to Y.
This results in an inconsistent database state. Therefore, the transaction must be executed
in entirety in order to ensure correctness of database state.
Consistency − The database must remain in a consistent state after any transaction. No
transaction should have any adverse effect on the data residing in the database. If the
database was in a consistent state before the execution of a transaction, it must remain
consistent after the execution of the transaction as well.
This means that integrity constraints must be maintained so that the database is consistent
before and after the transaction. It refers to correctness of a database. Inconsistency occurs
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 58

in case T1 completes but T2 fails. As a result, T is incomplete. No special component takes


care of consistency.
Isolation − In a database system where more than one transactions are being executed
simultaneously and in parallel, the property of isolation states that all the transactions will
be carried out and executed as if it is the only transaction in the system. No transaction will
affect the existence of any other transaction.
This property ensures that multiple transactions can occur concurrently without leading to
inconsistency of database state. Transactions occur independently without interference.
Changes occurring in a particular transaction will not be visible to any other transaction
until that particular change in that transaction is written to memory or has been
committed. This property ensures that the execution of transactions concurrently will result
in a state that is equivalent to a state achieved these were executed serially in some order.
Transaction Management Component takes care of isolation.
Durability − The database should be durable enough to hold all its latest updates even if the
system fails or restarts. If a transaction updates a chunk of data in a database and commits,
then the database will hold the modified data. If a transaction commits but the system fails
before the data could be written on to the disk, then that data will be updated once the
system springs back into action.
This property ensures that once the transaction has completed execution, the updates and
modifications to the database are stored in and written to disk and they persist even is
system failure occurs. These updates now become permanent and are stored in a non-
volatile memory. The effects of the transaction, thus, are never lost. Transaction
Management Component takes care of durability.
States of Transactions - A transaction in a database can be in one of the following states –

• Active − In this state, the transaction is being executed. This is the initial state of
every transaction. For example, updating or inserting or deleting a record is done
here, but it is still not saved to the database.
• Partially Committed − When a transaction executes its final operation, it is said to
be in a partially committed state. This is also an execution phase where last step in
the transaction is executed, but data is still not saved to the database.
• Committed − If a transaction executes all its operations successfully, it is said to be
committed. All its effects are now permanently established on the database system.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 59

In this state, all the transactions are permanently saved to the database. This step is
the last step of a transaction, if it executes without fail.
• Failed − A transaction is said to be in a failed state if any of the checks made by the
database recovery system fails. A failed transaction can no longer proceed further.
If a transaction cannot proceed to the execution state because of the failure of the
system or database, then the transaction is said to be in failed state.
• Aborted − If any of the checks fails and the transaction has reached a failed state,
then the recovery manager rolls back all its write operations on the database to
bring the database back to its original state where it was prior to the execution of
the transaction. Transactions in this state are called aborted. The database recovery
module can select one of the two operations after a transaction aborts −
o Re-start the transaction
o Kill the transaction
If the transaction fails in the middle of the transaction, all the executed transactions
are rolled back to it consistent state before executing the transaction.
Concurrency - Concurrency can be defined as the ability for multiple processes to access or
change shared data at the same time. The greater the number of concurrent user processes
that can execute without blocking each other, the greater the concurrency of the database
system. Concurrency is the ability of a database to allow multiple users to affect multiple
transactions. This is one of the main properties that separates a database from other forms
of data storage like spreadsheets or file processing systems.
The problems caused by concurrency are even more important than the ability to support
concurrent transactions. For example, when one user is changing data but has not yet saved
(committed) that data, then the database should not allow other users who query the same
data to view the changed, unsaved data. Instead the user should only view the original data.
Almost all databases deal with concurrency the same way, although the terminology may
differ. The general principle is that changed but unsaved data is held in some sort of
temporary log or file. Once it is saved, it is then written to the database’s physical storage in
place of the original data. As long as the user performing the change has not saved the data,
only he should be able to view the data he is changing. All other users querying for the
same data should view the data that existed prior to the change. Once the user saves the
data, new queries should reveal the new value of the data.
Advantages of Concurrency –
1. Decreasing waiting time.
2. Decreasing response time.
3. Increases resource utilization.
4. Increases efficiency.
Concurrency Control - Concurrency control is the process of managing simultaneous
execution of transactions (such as queries, updates, inserts, deletes and so on) in a
multiprocessing database system without having them interfere with one another. This
property of DBMS allows many transactions to access the same database at the same time
without interfering with each other. The primary goal of concurrency is to ensure the
atomicity of the execution of transactions in a multi-user database environment.
Concurrency controls mechanisms attempt to interleave (parallel) READ and WRITE
operations of multiple transactions so that the interleaved execution yields result that are
identical to the results of a serial schedule execution.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 60

Problems of Concurrency Control – When concurrent transactions are executed in an


uncontrolled manner, several problems can occur. The concurrency control has the
following three main problems –
• Dirty read (or uncommitted data).
• Lost updates.
• Unrepeatable read (or inconsistent retrievals).
Dirty Read Problem – If a uncommitted transaction or if a running transaction reads a value
from the local buffer which is written by another uncommitted transaction, then that read
is called dirty read, and that dirty read is vulnerable to inconsistency because if you execute
first transaction (let’s say A), and that transaction (A) suffer some problem and rollbacks,
then you have no chance to rollback or recover second transaction (B).
A dirty read problem occurs when one transaction updates a database item and then the
transaction fails for some reason. The updated database item is accessed by another
transaction before it is changed back to the original value. In other words, a transaction T1
updates a record, which is read by the transaction T2. Then T1 aborts and T2 now has
values which have never formed part of the stable database.

Transaction - A Time Transaction - B

---- t0 ----

---- t1 Update X

Read X t2 ----

---- t3 Rollback

---- t4 ----

• At time t1, Transactions-B writes value of X.


• At time t2, Transactions-A reads value of X.
• At time t3, Transactions-B rollbacks. So, it changes the value of X back to that of
prior to t1.
So, Transaction-A now has value which has never become part of the stable database. Such
type of problem is referred as the Dirty Read Problem, as one transaction reads a dirty value
which has not been committed.
Unrepeatable Read - Unrepeatable read (or inconsistent retrievals) occurs when a
transaction calculates some summary (aggregate) function over a set of data while other
transactions are updating the data. The problem is that the transaction might read some
data before they are changed and other data after they are changed, thereby yielding
inconsistent results.

Transaction - A Transaction - B

10 – R (X) ----

---- R (X) - 10

15 – W (X) ----
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 61

---- R (X) - 15

Lost Update / Write-Write Conflict - A lost update problem occurs when two transactions
that access the same database items have their operations in a way that makes the value of
some database item incorrect. In other words, if transactions T1 and T2 both read a record
and then update it, the effects of the first update will be overwritten by the second update.

Transaction - A Time Transaction - B

---- t0 ----

Read X t1 ----

---- t2 Read X

Update X t3 ----

---- t4 Update X

---- t5 ----

Phantom Read - A phantom read occurs when, in the course of a transaction, two identical
queries are executed, and the collection of rows returned by the second query is different
from the first.

Transaction - A Transaction - B

10 – R (X) ----

---- R (X) - 10

Delete (X) ----

---- R (X)

Blind Write - A blind write occurs when a transaction writes a value without reading it.
Schedule - A schedule is a process of grouping the transactions into one and executing them
in a predefined order. A schedule is required in a database because when some transactions
execute in parallel, they may affect the result of the transaction, means if one transaction is
updating the values which the other transaction is accessing, then the order of these two
transactions will change the result of second transaction. Hence a schedule is created to
execute the transactions. Types of schedules –
Serial Schedule - When one transaction completely executes before starting another
transaction, the schedule is called serial schedule. A serial schedule is always consistent.
e.g., if a schedule S has debit transaction T1 and credit transaction T2, possible serial
schedules are T1 followed by T2 (T1->T2) or T2 followed by T1 ((T1->T2). A serial schedule
has low throughput and less resource utilization.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 62

Non-serial / Concurrent Schedule - When operations of a transaction are interleaved with


operations of other transactions of a schedule, the schedule is called Concurrent schedule.
But concurrency can lead to inconsistency in database.
Serializability - Serializability is the classical concurrency scheme. It ensures that a schedule
for executing concurrent transactions is equivalent to one that executes the transactions
serially in some order. It assumes that all accesses to the database are done using read and
write operations.
Equivalence Schedules - An equivalence schedule can be of the following types −
• Result Equivalence - If two schedules produce the same result after execution, they
are said to be result equivalent. They may yield the same result for some value and
different results for another set of values. That's why this equivalence is not
generally considered significant.
• View Equivalence - Two schedules would be view equivalence if the transactions in
both the schedules perform similar actions in a similar manner. For example −
o If T reads the initial data in S1, then it also reads the initial data in S2.
o If T reads the value written by J in S1, then it also reads the value written by
J in S2.
o If T performs the final write on the data value in S1, then it also performs
the final write on the data value in S2.
• Conflict Equivalence - Two schedules would be conflicting if they have the following
properties −
o Both belong to separate transactions.
o Both accesses the same data item.
o At least one of them is "write" operation.
Two schedules having multiple transactions with conflicting operations are said to
be conflict equivalent if and only if −
o Both the schedules contain the same set of Transactions.
o The order of conflicting pairs of operation is maintained in both the
schedules.
Note − View equivalent schedules are view serializable and conflict equivalent
schedules are conflict serializable. All conflict serializable schedules are view
serializable too.
Conflict Serializable - A schedule is called conflict serializable if it can be transformed into a
serial schedule by swapping non-conflicting operations. Suppose T1 and T2 are two
transactions and I1 and I2 are the instructions in T1 and T2 respectively. Then these two
transactions are said to be conflict Serializable, if both the instruction access the data item
d, and at least one of the instruction is write operation.
In below case, we can see that T1 has set of instructions which modifies X and Y, whereas
T2 has instructions to modify X and Z. These two transactions are said to be conflict
Serializable since instructions of both transaction modify the value of X (write). But
instruction to modify Y and Z are not conflicting as they are accessing different data item.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 63

A transaction is said to be non-conflicting, if interchanging the non-conflicting instruction in


a transaction does not change the result. In below example WRITE (X) of T1 and READ (X) of
T2 are conflicting. But it is modified to non-conflicting Serializable by making series of
movement of non-conflicting instructions in them as shown below.

Example –
• Conflicting operations pair (R1(A), W2(A)) because they belong to two different
transactions on same data item A and one of them is write operation.
• Similarly, (W1(A), W2(A)) and (W1(A), R2(A)) pairs are also conflicting.
• On the other hand, (R1(A), W2(B)) pair is non-conflicting because they operate on
different data item.
• Similarly, ((W1(A), W2(B)) pair is non-conflicting.
Question - Consider the following schedules involving two transactions. Which one of the
following statement is true?
S1: R1(X) R1(Y) R2(X) R2(Y) W2(Y) W1(X)
S2: R1(X) R2(X) R2(Y) W2(Y) R1(Y) W1(X)
1. Both S1 and S2 are conflict serializable
2. Only S1 is conflict serializable
3. Only S2 is conflict serializable
4. None
Solution - Two transactions of given schedules are:
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 64

T1: R1(X) R1(Y) W1(X)


T2: R2(X) R2(Y) W2(Y)

Let us first check serializability of S1:


S1: R1(X) R1(Y) R2(X) R2(Y) W2(Y) W1(X)

To convert it to a serial schedule, we have to swap non-conflicting operations so that S1 becomes


equivalent to serial schedule T1->T2 or T2->T1. In this case, to convert it to a serial schedule, we
must have to swap R2(X) and W1(X) but they are conflicting. So S1 can’t be converted to a serial
schedule.
Now, let us check serializability of S2:
S2: R1(X) R2(X) R2(Y) W2(Y) R1(Y) W1(X)

Swapping non-conflicting operations R1(X) and R2(X) of S2, we get


S2’: R2(X) R1(X) R2(Y) W2(Y) R1(Y) W1(X)

Again, swapping non-conflicting operations R1(X) and R2(Y) of S2’, we get


S2’’: R2(X) R2(Y) R1(X) W2(Y) R1(Y) W1(X)

Again, swapping non-conflicting operations R1(X) and W2(Y) of S2’’, we get


S2’’’: R2(X) R2(Y) W2(Y) R1(X) R1(Y) W1(X)

which is equivalent to a serial schedule T2->T1.


So, correct option is C. Only S2 is conflict serializable.

Concurrency Control - In a multiprogramming environment where multiple transactions can be


executed simultaneously, it is highly important to control the concurrency of transactions. We have
concurrency control protocols to ensure atomicity, isolation, and serializability of concurrent
transactions. Concurrency Control deals with interleaved execution of more than one transaction.
Concurrency control protocols can be broadly divided into two categories –

o Lock based protocols


o Time-stamp based protocols

Locks - A lock is kind of a mechanism that ensures that the integrity of data is maintained. Database
systems equipped with lock-based protocols use a mechanism by which any transaction cannot read
or write data until it acquires an appropriate lock on it. Locks are of two kinds –

Binary Locks − A lock on a data item can be in two states, it is either locked or unlocked.

Shared/exclusive − This type of locking mechanism differentiates the locks based on their uses. If a
lock is acquired on a data item to perform a write operation, it is an exclusive lock. Allowing more
than one transaction to write on the same data item would lead the database into an inconsistent
state. Read locks are shared because no data value is being changed.

• Shared Lock - Shared lock is placed when we are reading the data, multiple shared locks can
be placed on the data but when a shared lock is placed no exclusive lock can be placed. For
example, when two transactions are reading Steve’s account balance, let them read by
placing shared lock but at the same time if another transaction wants to update the Steve’s
account balance by placing Exclusive lock, do not allow it until reading is finished.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 65

• Exclusive lock - Exclusive lock is placed when we want to read and write the data. This lock
allows both the read and write operation, once this lock is placed on the data no other lock
(shared or Exclusive) can be placed on the data until Exclusive lock is released. For example,
when a transaction wants to update the Steve’s account balance, let it do by placing X lock
on it but if a second transaction wants to read the data (S lock) don’t allow it, if another
transaction wants to write the data (X lock) don’t allow that either.

Lock Compatibility Matrix –

How to read this matrix - There are two rows, first row says that when S lock is placed, another S
lock can be acquired so it is marked true but no Exclusive locks can be acquired so marked False. In
second row, When X lock is acquired neither S nor X lock can be acquired so both marked false.

Types of Lock Protocols -

• Simplistic Lock Protocol − Simplistic lock-based protocols allow transactions to obtain a lock
on every object before a 'write' operation is performed. Transactions may unlock the data
item after completing the ‘write’ operation.
• Pre-claiming Lock Protocol - Pre-claiming protocols evaluate their operations and create a
list of data items on which they need locks. Before initiating an execution, the transaction
requests the system for all the locks it needs beforehand. If all the locks are granted, the
transaction executes and releases all the locks when all its operations are over. If all the
locks are not granted, the transaction rolls back and waits until all the locks are granted.

• Two-Phase Locking 2PL - This locking protocol divides the execution phase of a transaction
into three parts. In the first part, when the transaction starts executing, it seeks permission
for the locks it requires. The second part is where the transaction acquires all the locks. As
soon as the transaction releases its first lock, the third phase starts. In this phase, the
transaction cannot demand any new locks, it only releases the acquired locks.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 66

Two-phase locking has two phases, one is growing, where all the locks are being acquired by
the transaction, and the second phase is shrinking, where the locks held by the transaction
are being released. To claim an exclusive (write) lock, a transaction must first acquire a
shared (read) lock and then upgrade it to an exclusive lock.

• Strict Two-Phase Locking - The first phase of Strict-2PL is same as 2PL. After acquiring all the
locks in the first phase, the transaction continues to execute normally. But in contrast to 2PL,
Strict-2PL does not release a lock after using it. Strict-2PL holds all the locks until the commit
point and releases all the locks at a time. Strict-2PL does not have cascading abort as 2PL
does.

Timestamp-based Protocols - The most commonly used concurrency protocol is the timestamp-
based protocol. Timestamp is used to associate time to some transactions or some events. This
protocol uses either system time or logical counter as a timestamp. Lock-based protocols manage
the order between the conflicting pairs among transactions at the time of execution, whereas
timestamp-based protocols start working as soon as a transaction is created.

Every transaction has a timestamp associated with it, and the ordering is determined by the age of
the transaction. A transaction created at 0002 clock time would be older than all other transactions
that come after it. For example, any transaction 'y' entering the system at 0004 is two seconds
younger and the priority would be given to the older one. In addition, every data item is given the
latest read and write-timestamp. This lets the system know when the last ‘read and write’ operation
was performed on the data item.

Timestamp Ordering Protocol - The timestamp-ordering protocol ensures serializability among


transactions in their conflicting read and write operations. This is the responsibility of the protocol
system that the conflicting pair of tasks should be executed according to the timestamp values of the
transactions. The basic idea is that we order the transaction on the basis of the time of their arrival.

• The timestamp of transaction Ti is denoted as TS(Ti).


• Read time-stamp of data-item X is denoted by R-timestamp(X).
• Write time-stamp of data-item X is denoted by W-timestamp(X).
Timestamp ordering protocol works as follows −

• If a transaction Ti issues a read(X) operation −


o If TS(Ti) < W-timestamp(X)
▪ Operation rejected.
o If TS(Ti) >= W-timestamp(X)
▪ Operation executed.
o All data-item timestamps updated.
• If a transaction Ti issues a write(X) operation −
o If TS(Ti) < R-timestamp(X)
▪ Operation rejected.
o If TS(Ti) < W-timestamp(X)
▪ Operation rejected and Ti rolled back.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 67

o Otherwise, operation executed.

Every data item which will be shared, we have got two timestamps –
• W-Timestamp(Q) - It denotes the largest TS or Time Stamp value of any transaction that
successfully executes Write(Q).
• R-Timestamp(Q) - It denotes the largest TS or Time Stamp value of any transaction that
successfully executes Read(Q).

Thomas' Write Rule - This rule states if TS(Ti) < W-timestamp(X), then the operation is rejected and
Ti is rolled back. Time-stamp ordering rules can be modified to make the schedule view serializable.
Instead of making Ti rolled back, the 'write' operation itself is ignored.

Deadlock - In a multi-process system, deadlock is an unwanted situation that arises in a shared


resource environment, where a process indefinitely waits for a resource that is held by another
process. For example, assume a set of transactions {T0, T1, T2, ...,Tn}. T0 needs a resource X to
complete its task. Resource X is held by T1, and T1 is waiting for a resource Y, which is held by T2. T2
is waiting for resource Z, which is held by T0. Thus, all the processes wait for each other to release
resources.

In this situation, none of the processes can finish their task. This situation is known as a deadlock.
Deadlocks are not healthy for a system. In case a system is stuck in a deadlock, the transactions
involved in the deadlock are either rolled back or restarted. Deadlock is a situation where a set of
processes are blocked because each process is holding a resource and waiting for another resource
acquired by some other process.

Deadlock can arise if following four conditions hold simultaneously (Necessary Conditions) -

• Mutual Exclusion - One or more than one resource is non-sharable (Only one process can
use at a time).
• Hold and Wait - A process is holding at least one resource and waiting for resources.
• No Pre-emption - A resource cannot be taken from a process unless the process releases the
resource.
• Circular Wait - A set of processes are waiting for each other in circular form.

Deadlock Prevention - To prevent any deadlock situation in the system, the DBMS aggressively
inspects all the operations, where transactions are about to execute. The DBMS inspects the
operations and analyses if they can create a deadlock situation. If it finds that a deadlock situation
might occur, then that transaction is never allowed to be executed. There are deadlock prevention
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 68

schemes that use timestamp ordering mechanism of transactions in order to predetermine a


deadlock situation.

Wait-Die Scheme - In this scheme, if a transaction requests to lock a resource (data item), which is
already held with a conflicting lock by another transaction, then one of the two possibilities may
occur −

• If TS(Ti) < TS(Tj) − that is Ti, which is requesting a conflicting lock, is older than Tj − then Ti is
allowed to wait until the data-item is available.
• If TS(Ti) > TS(tj) − that is Ti is younger than Tj − then Ti dies. Ti is restarted later with a
random delay but with the same timestamp.

This scheme allows the older transaction to wait but kills the younger one.

Wound-Wait Scheme - In this scheme, if a transaction requests to lock a resource (data item), which
is already held with conflicting lock by some another transaction, one of the two possibilities may
occur −

• If TS(Ti) < TS(Tj), then Ti forces Tj to be rolled back, that is Ti wounds Tj. Tj is restarted later
with a random delay but with the same timestamp.
• If TS(Ti) > TS(Tj), then Ti is forced to wait until the resource is available.

This scheme, allows the younger transaction to wait, but when an older transaction requests an item
held by a younger one, the older transaction forces the younger one to abort and release the item.
In both the cases, the transaction that enters the system at a later stage is aborted.

Deadlock Avoidance - Aborting a transaction is not always a practical approach. Instead, deadlock
avoidance mechanisms can be used to detect any deadlock situation in advance. Methods like "wait-
for graph" are available but they are suitable for only those systems where transactions are
lightweight having fewer instances of resource. In a bulky system, deadlock prevention techniques
may work well.

Wait-for Graph - This is a simple method available to track if any deadlock situation may arise. For
each transaction entering into the system, a node is created. When a transaction Ti requests for a
lock on an item, say X, which is held by some other transaction Tj, a directed edge is created from Ti
to Tj. If Tj releases item X, the edge between them is dropped and Ti locks the data item. The system
maintains this wait-for graph for every transaction waiting for some data items held by others. The
system keeps checking if there's any cycle in the graph.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 69

Here, we can use any of the two following approaches −

• First, do not allow any request for an item, which is already locked by another transaction.
This is not always feasible and may cause starvation, where a transaction indefinitely waits
for a data item and can never acquire it.
• The second option is to roll back one of the transactions. It is not always feasible to roll back
the younger transaction, as it may be important than the older one. With the help of some
relative algorithm, a transaction is chosen, which is to be aborted. This transaction is known
as the victim and the process is known as victim selection.

Deadlock Recovery - Traditional operating system such as Windows doesn’t deal with deadlock
recovery as it is time and space consuming process. Real time operating systems use Deadlock
recovery.

• Killing the process.


• Resource pre-emption.

Storage Systems - Databases are stored in file formats, which contain records. At physical level, the
actual data is stored in electromagnetic format on some device. These storage devices can be
broadly categorized into three types −

• Primary Storage − The memory storage that is directly accessible to the CPU comes under
this category. CPU's internal memory (registers), fast memory (cache), and main memory
(RAM) are directly accessible to the CPU, as they are all placed on the motherboard or CPU
chipset. This storage is typically very small, ultra-fast, and volatile. Primary storage requires
continuous power supply in order to maintain its state. In case of a power failure, all its data
is lost.
• Secondary Storage − Secondary storage devices are used to store data for future use or as
backup. Secondary storage includes memory devices that are not a part of the CPU chipset
or motherboard, for example, magnetic disks, optical disks (DVD, CD, etc.), hard disks, flash
drives, and magnetic tapes.
• Tertiary Storage − Tertiary storage is used to store huge volumes of data. Since such storage
devices are external to the computer system, they are the slowest in speed. These storage
devices are mostly used to take the back up of an entire system. Optical disks and magnetic
tapes are widely used as tertiary storage.

Memory Hierarchy - A computer system has a well-defined hierarchy of memory. A CPU has direct
access to it main memory as well as its inbuilt registers. The access time of the main memory is
obviously less than the CPU speed. To minimize this speed mismatch, cache memory is introduced.
Cache memory provides the fastest access time and it contains data that is most frequently accessed
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 70

by the CPU. The memory with the fastest access is the costliest one. Larger storage devices offer
slow speed and they are less expensive, however they can store huge volumes of data as compared
to CPU registers or cache memory.

Magnetic Disks - Hard disk drives are the most common secondary storage devices in present
computer systems. These are called magnetic disks because they use the concept of magnetization
to store information. Hard disks consist of metal disks coated with magnetizable material. These
disks are placed vertically on a spindle. A read/write head moves in between the disks and is used to
magnetize or de-magnetize the spot under it. A magnetized spot can be recognized as 0 (zero) or 1
(one). Hard disks are formatted in a well-defined order to store data efficiently. A hard disk plate has
many concentric circles on it, called tracks. Every track is further divided into sectors. A sector on a
hard disk typically stores 512 bytes of data.

Redundant Array of Independent Disks (RAID) - RAID is a technology to connect multiple secondary
storage devices and use them as a single storage media. RAID consists of an array of disks in which
multiple disks are connected together to achieve different goals. RAID levels define the use of disk
arrays.

• RAID 0 - In this level, a striped array of disks is implemented. The data is broken down into
blocks and the blocks are distributed among disks. Each disk receives a block of data to
write/read in parallel. It enhances the speed and performance of the storage device. There is
no parity and backup in Level 0.

• RAID 1 - RAID 1 uses mirroring techniques. When data is sent to a RAID controller, it sends a
copy of data to all the disks in the array. RAID level 1 is also called mirroring and provides
100% redundancy in case of a failure.

• RAID 2 - RAID 2 records Error Correction Code using Hamming distance for its data, striped
on different disks. Like level 0, each data bit in a word is recorded on a separate disk and ECC
codes of the data words are stored on a different set disks. Due to its complex structure and
high cost, RAID 2 is not commercially available.

• RAID 3 - RAID 3 stripes the data onto multiple disks. The parity bit generated for data word is
stored on a different disk. This technique makes it to overcome single disk failures.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 71

• RAID 4 - In this level, an entire block of data is written onto data disks and then the parity is
generated and stored on a different disk. Note that level 3 uses byte-level striping, whereas
level 4 uses block-level striping. Both level 3 and level 4 require at least three disks to
implement RAID.

• RAID 5 - RAID 5 writes whole data blocks onto different disks, but the parity bits generated
for data block stripe are distributed among all the data disks rather than storing them on a
different dedicated disk.

• RAID 6 - RAID 6 is an extension of level 5. In this level, two independent parities are
generated and stored in distributed fashion among multiple disks. Two parities provide
additional fault tolerance. This level requires at least four disk drives to implement RAID.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 72

• RAID 7 - This RAID level is based on RAID 3 and RAID 4 but adds caching to the mix. It
includes a real-time embedded OS as a controller, caching via a high-speed bus and other
characteristics of a stand-alone computer. It is a nonstandard, trademarked RAID level
owned by the now defunct Storage Computer Corp.
• RAID 10 (RAID 1+0) - Combining RAID 1 and RAID 0, this level is often referred to as RAID 10,
which offers higher performance than RAID 1, but at a much higher cost. In RAID 1+0, the
data is mirrored and the mirrors are striped.

File Structure - Relative data and information is stored collectively in file formats. A file is a sequence
of records stored in binary format. A disk drive is formatted into several blocks that can store
records. File records are mapped onto those disk blocks.

File Organization - File Organization defines how file records are mapped onto disk blocks. We have
four types of File Organization to organize file records −

• Heap File Organization - When a file is created using Heap File Organization, the Operating
System allocates memory area to that file without any further accounting details. File
records can be placed anywhere in that memory area. It is the responsibility of the software
to manage the records. Heap File does not support any ordering, sequencing, or indexing on
its own.
• Sequential File Organization - Every file record contains a data field (attribute) to uniquely
identify that record. In sequential file organization, records are placed in the file in some
sequential order based on the unique key field or search key. Practically, it is not possible to
store all the records sequentially in physical form.
• Hash File Organization - Hash File Organization uses Hash function computation on some
fields of the records. The output of the hash function determines the location of disk block
where the records are to be placed.
• Clustered File Organization - Clustered file organization is not considered good for large
databases. In this mechanism, related records from one or more relations are kept in the
same disk block, that is, the ordering of records is not based on primary key or search key.

File Operations - Operations on database files can be broadly classified into two categories −

• Update Operations
• Retrieval Operations

Update operations change the data values by insertion, deletion, or update. Retrieval operations, on
the other hand, do not alter the data but retrieve them after optional conditional filtering. In both
types of operations, selection plays a significant role. Other than creation and deletion of a file, there
could be several operations, which can be done on files.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 73

• Open − A file can be opened in one of the two modes, read mode or write mode. In read
mode, the operating system does not allow anyone to alter data. In other words, data is
read only. Files opened in read mode can be shared among several entities. Write mode
allows data modification. Files opened in write mode can be read but cannot be shared.
• Locate − Every file has a file pointer, which tells the current position where the data is to be
read or written. This pointer can be adjusted accordingly. Using find (seek) operation, it can
be moved forward or backward.
• Read − By default, when files are opened in read mode, the file pointer points to the
beginning of the file. There are options where the user can tell the operating system where
to locate the file pointer at the time of opening a file. The very next data to the file pointer is
read.
• Write − User can select to open a file in write mode, which enables them to edit its contents.
It can be deletion, insertion, or modification. The file pointer can be located at the time of
opening or can be dynamically changed if the operating system allows to do so.
• Close − This is the most important operation from the operating system’s point of view.
When a request to close a file is generated, the operating system
• removes all the locks (if in shared mode),
• saves the data (if altered) to the secondary storage media, and
• releases all the buffers and file handlers associated with the file.

Loss of Volatile Storage - A volatile storage like RAM stores all the active logs, disk buffers, and
related data. In addition, it stores all the transactions that are being currently executed. What
happens if such a volatile storage crashes abruptly? It would obviously take away all the logs and
active copies of the database. It makes recovery almost impossible, as everything that is required to
recover the data is lost. Following techniques may be adopted in case of loss of volatile storage −

• We can have checkpoints at multiple stages so as to save the contents of the database
periodically.
• A state of active database in the volatile memory can be periodically dumped onto a stable
storage, which may also contain logs and active transactions and buffer blocks.
• <dump> can be marked on a log file, whenever the database contents are dumped from a
non-volatile memory to a stable one.

Recovery –

• When the system recovers from a failure, it can restore the latest dump.
• It can maintain a redo-list and an undo-list as checkpoints.
• It can recover the system by consulting undo-redo lists to restore the state of all transactions
up to the last checkpoint.

Database Backup and Recovery from Catastrophic Failure - A catastrophic failure is one where a
stable, secondary storage device gets corrupt. With the storage device, all the valuable data that is
stored inside is lost. We have two different strategies to recover data from such a catastrophic
failure −

• Remote backup, here a backup copy of the database is stored at a remote location from
where it can be restored in case of a catastrophe.
• Alternatively, database backups can be taken on magnetic tapes and stored at a safer place.
This backup can later be transferred onto a freshly installed database to bring it to the point
of backup.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 74

Grown-up databases are too bulky to be frequently backed up. In such cases, we have techniques
where we can restore a database just by looking at its logs. So, all that we need to do here is to take
a backup of all the logs at frequent intervals of time.

Remote Backup - Remote backup provides a sense of security in case the primary location where the
database is located gets destroyed. Remote backup can be offline or real-time or online. In case it is
offline, it is maintained manually.

Online backup systems are more real-time and lifesavers for database administrators and investors.
An online backup system is a mechanism where every bit of the real-time data is backed up
simultaneously at two distant places. One of them is directly connected to the system and the other
one is kept at a remote place as backup. As soon as the primary database storage fails, the backup
system senses the failure and switches the user system to the remote storage. Sometimes this is so
instant that the users can’t even realize a failure.

Crash Recovery - DBMS is a highly complex system with hundreds of transactions being executed
every second. The durability and robustness of a DBMS depends on its complex architecture and its
underlying hardware and system software. If it fails or crashes amid transactions, it is expected that
the system would follow some sort of algorithm or techniques to recover lost data.

Failure Classification - To see where the problem has occurred, we generalize a failure into various
categories, as follows −

• Transaction Failure - A transaction has to abort when it fails to execute or when it reaches a
point from where it can’t go any further. This is called transaction failure where only a few
transactions or processes are hurt. Reasons for a transaction failure could be −
o Logical errors − Where a transaction cannot complete because it has some code
error or any internal error condition.
o System errors − Where the database system itself terminates an active transaction
because the DBMS is not able to execute it, or it has to stop because of some system
condition. For example, in case of deadlock or resource unavailability, the system
aborts an active transaction.
• System Crash - There are problems external to the system that may cause the system to
stop abruptly and cause the system to crash. For example, interruptions in power supply
may cause the failure of underlying hardware or software failure. Examples may include
operating system errors.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 75

• Disk Failure - In early days of technology evolution, it was a common problem where hard-
disk drives or storage drives used to fail frequently. Disk failures include formation of bad
sectors, unreachability to the disk, disk head crash or any other failure, which destroys all or
a part of disk storage.

Storage Structure -

• Volatile Storage − As the name suggests, a volatile storage cannot survive system crashes.
Volatile storage devices are placed very close to the CPU, normally they are embedded onto
the chipset itself. For example, main memory and cache memory are examples of volatile
storage. They are fast but can store only a small amount of information.
• Non-volatile Storage − These memories are made to survive system crashes. They are huge
in data storage capacity, but slower in accessibility. Examples may include hard-disks,
magnetic tapes, flash memory, and non-volatile (battery backed up) RAM.

Recovery and Atomicity - When a system crashes, it may have several transactions being executed
and various files opened for them to modify the data items. Transactions are made of various
operations, which are atomic in nature. But according to ACID properties of DBMS, atomicity of
transactions as a whole must be maintained, that is, either all the operations are executed or none.
When a DBMS recovers from a crash, it should maintain the following −

• It should check the states of all the transactions, which were being executed.
• A transaction may be in the middle of some operation, the DBMS must ensure the atomicity
of the transaction in this case.
• It should check whether the transaction can be completed now or it needs to be rolled back.
• No transactions would be allowed to leave the DBMS in an inconsistent state.

There are two types of techniques, which can help a DBMS in recovering as well as maintaining the
atomicity of a transaction −

• Maintaining the logs of each transaction and writing them onto some stable storage before
actually modifying the database.
• Maintaining shadow paging, where the changes are done on a volatile memory, and later,
the actual database is updated.

Log-based Recovery - Log is a sequence of records, which maintains the records of actions
performed by a transaction. It is important that the logs are written prior to the actual modification
and stored on a stable storage media, which is failsafe. Log-based recovery works as follows −

• The log file is kept on a stable storage media.


• When a transaction enters the system and starts execution, it writes a log about it.

<Tn, Start>
• When the transaction modifies an item X, it writes logs as follows −

<Tn, X, V1, V2>


• It reads Tn has changed the value of X, from V1 to V2.
• When the transaction finishes, it logs −

<Tn, commit>
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 76

The database can be modified using two approaches −

• Deferred database modification − All logs are written on to the stable storage and the
database is updated when a transaction commits.
• Immediate database modification − Each log follows an actual database modification. That
is, the database is modified immediately after every operation.

Shadow Paging - This is the method where all the transactions are executed in the primary memory
or the shadow copy of database. Once all the transactions completely executed, it will be updated to
the database. Hence, if there is any failure in the middle of transaction, it will not be reflected in the
database. Database will be updated after all the transaction is complete. A database pointer will be
always pointing to the consistent copy of the database and copy of the database is used by
transactions to update.

Once all the transactions are complete, the DB pointer is modified to point to new copy of DB, and
old copy is deleted. If there is any failure during the transaction, the pointer will be still pointing to
old copy of database, and shadow database will be deleted. If the transactions are complete then
the pointer is changed to point to shadow DB, and old DB is deleted. It is useful if the DB is
comparatively small because shadow DB consumes same memory space as the actual DB. Hence it is
not efficient for huge databases. In addition, it cannot handle concurrent execution of transactions.
It is suitable for one transaction at a time.

Recovery with Concurrent Transactions - When more than one transaction is being executed in
parallel, the logs are interleaved. At the time of recovery, it would become hard for the recovery
system to backtrack all logs, and then start recovering. To ease this situation, most modern DBMS
use the concept of 'checkpoints'.

Checkpoint - Keeping and maintaining logs in real time and in real environment may fill out all the
memory space available in the system. As time passes, the log file may grow too big to be handled at
all. Checkpoint is a mechanism where all the previous logs are removed from the system and stored
permanently in a storage disk. Checkpoint declares a point before which the DBMS was in consistent
state, and all the transactions were committed. When a system with concurrent transactions crashes
and recovers, it behaves in the following manner –

• The recovery system reads the logs backwards from the end to the last checkpoint.
• It maintains two lists, an undo-list and a redo-list.
• If the recovery system sees a log with <Tn, start> and <Tn, Commit> or just <Tn, Commit>, it
puts the transaction in the redo-list.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 77

• If the recovery system sees a log with <Tn, start> but no commit or abort log found, it puts
the transaction in undo-list.

All the transactions in the undo-list are then undone and their logs are removed. All the transactions
in the redo-list and their previous logs are removed and then redone before saving their logs.

Database Security and Threats - Data security is an imperative aspect of any database system. It is
of particular importance in distributed systems because of large number of users, fragmented and
replicated data, multiple sites and distributed control.

Threats in a Database -

• Availability loss − Availability loss refers to non-availability of database objects by legitimate


users.
• Integrity loss − Integrity loss occurs when unacceptable operations are performed upon the
database either accidentally or maliciously. This may happen while creating, inserting,
updating or deleting data. It results in corrupted data leading to incorrect decisions.
• Confidentiality loss − Confidentiality loss occurs due to unauthorized or unintentional
disclosure of confidential information. It may result in illegal actions, security threats and
loss in public confidence.

The measures of control can be broadly divided into the following categories −

• Access Control − Access control includes security mechanisms in a database management


system to protect against unauthorized access. A user can gain access to the database after
clearing the login process through only valid user accounts. Each user account is password
protected.
• Flow Control − Distributed systems encompass a lot of data flow from one site to another
and also within a site. Flow control prevents data from being transferred in such a way that
it can be accessed by unauthorized agents. A flow policy lists out the channels through which
information can flow. It also defines security classes for data as well as transactions.
• Data Encryption − Data encryption refers to encoding data when sensitive data is to be
communicated over public channels. Even if an unauthorized agent gains access of the data,
he cannot understand it since it is in an incomprehensible format.

Authentication - Authentication is the process of confirming that a user logs in only in accordance
with the rights to perform the activities he is authorized to perform. Authentication mechanism
determines the user’s identity before revealing the sensitive information. User authentication can be
performed at operating system level or database level itself. By using authentication tools for
biometrics such as retina and figure prints are in use to keep the database from hackers or malicious
users. The database security can be managed from outside the database system. For Authentication,
it requires two different credentials, those are userid or username, and password. Here is some type
of security authentication process -

• Based on Operating System authentications.


• Lightweight Directory Access Protocol (LDAP).

Features of Authentication –

• Authentication is used by a server when the server needs to know exactly who is accessing
their information or site.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 78

• Authentication is used by a client when the client needs to know that the server is system it
claims to be.
• In authentication, the user or computer has to prove its identity to the server or client.
• Usually, authentication by a server entails the use of a user name and password. Other ways
to authenticate can be through cards, retina scans, voice recognition, and fingerprints.
• Authentication by a client usually involves the server giving a certificate to the client in
which a trusted third party such as Verisign or Thawte states that the server belongs to the
entity (such as a bank) that the client expects it to.
• Authentication does not determine what tasks the individual can do or what files the
individual can see. Authentication merely identifies and verifies who the person or system is.

Authorization - Authorization is the process of verifying that you have access to something.
Authorization technique is used to determine the permissions that are granted to an authenticated
user. Gaining access to a resource (e.g. directory on a hard disk) because the permissions configured
on it allow you access is authorization. Authorization is the process to confirm what you are
authorized to perform. For example, you are allowed to login to your Unix server via ssh client, but
you are not allowed to browser or other file systems. Authorization occurs after authentication is
successful. Authorization can be controlled at the level of file system or use a variety of
configuration options such as application level chroot. Normally, the connection attempt should be
good authentication and authorization by the system.

Features of Authorization –

• Authorization is a process by which a server determines if the client has permission to use a
resource or access a file.
• Authorization is usually coupled with authentication so that the server has some concept of
who the client is that is requesting access.
• The type of authentication required for authorization may vary, passwords may be required
in some cases but not in others.
• In some cases, there is no authorization. Any user may be use a resource or access a file
simply by asking for it. Most of the web pages on the Internet require no authentication or
authorization.

Encryption - Encryption involves the process of transforming data so that it is unreadable by anyone
who does not have a decryption key. The Secure Shell (SSH) and Secure Socket Layer (SSL) protocols
are usually used in encryption processes. The SSL drives the secure part of “https://” sites used in e-
commerce sites (like E-Bay and Amazon.com). All data in SSL transactions is encrypted between the
client (browser) and the server (web server) before the data is transferred between the two. All data
in SSH sessions is encrypted between the client and the server when communicating at the shell. By
encrypting the data exchanged between the client and server information like social security
numbers, credit card numbers, and home addresses can be sent over the Internet with less risk of
being intercepted during transit.

Comparison between Authentication and Authorization -


BASIS FOR AUTHENTICATION AUTHORIZATION
COMPARISON

Steps Authentication is the first step of Authorization is done after successful


authorization so always comes first. authentication.
D B M S N o t e s b y N a r a y a n V y a s | P a g e | 79

Basic Checks the person's identity to grant Checks the person's privileges or
access to the system. permissions to access the resources.
Process of Verifying user credentials. Validating the user permissions.
Order of the Authentication is performed at the Authorization is usually performed
process very first step. after authentication.
Examples In the online banking applications, the In a multi-user system, the
identity of the person is first administrator decides what privileges
determined with the help of the user or access rights do each user have.
ID and password.

Take-Grant Protection Model – It is a formal model used in the field of computer security to
establish or disprove the safety of a given computer system that follows specific rules. The model
represents a system as directed graph, where vertices are either subjects or objects. The edges
between them are labelled and the label indicates the rights that the source of the edge has over the
destination. Two rights occur in every instance of the model: take and grant. They play a special role
in the graph rewriting rules describing admissible changes of the graph. There are total four such
rules -

• take rule allows a subject to take rights of another object (add an edge originating at the
subject)
• grant rule allows a subject to grant own rights to another object (add an edge terminating at
the subject)
• create rule allows a subject to create new objects (add a vertex and an edge from the
subject to the new vertex)
• remove rule allows a subject to remove rights it has over on another object (remove an edge
originating at the subject)

Access Control Matrix - An access control matrix is a table that states a subject’s access rights on an
object. A subject’s access rights can be of the type read, write, and execute. It is used for the
implementation of the protection model. Matrix consists of rows and columns.

• Rows are used to represent the domains (user, process or procedure) and columns are used
to represent the objects (resources).
• Each entity in the matrix consists of set of access rights.
• The entity access (I, j) defines the set of operations that a process, executing in a domain I
can invoke on an object J.
• It must be insured that a process executing in a domain Di can access only those objects
specified in row I.
Objects File 1 File 2 File 3 Printer
Domains
D1 Read Read
D2 Print
D3 Execute
D4 Write

You might also like