DBMS
DBMS
Unit I
Syllabus
Unit II
Syllabus
Unit III
Syllabus
Physical and logical Hierarchy, concept of index, B- tree, hash index, function index,
bitmap index, concept of Functional dependency, Normalization, Business data
analytics, tool and techniques for business data analytics.
Unit I
Data
Data are raw or isolated facts from which the required information is produced.
Data are distinct piece of information, usually formatted in a special way. Data can
exist in a variety of forms that have meaning in the user’s environment such as
number or text on a piece of paper, bits stored in computer’s memory of as facts
stored in a person’s mind.
Information
Data and information are closely related and are often used interchangeably.
Information is processed, organized or summarized data. It may be defined as
collection of related data that when put together, communicated meaningful and
useful message to a recipient who uses it, to make decision or to interpret the data
to get the meaning.
Metadata
Metadata is the data about the data. It is also called the system catalog, which is the
self-describing a nature of the database that provides program-data independence.
The metadata is the data that describe objects in the database and make easier for
those objects to be accessed or manipulated. It describes the database structure,
constraints, applications, authorization, and size of data types and so on.
A data item is the smallest unit of the data that’s has meaning to its user. It is
traditionally called a field or data element. It is an occurrence of the smallest unit of
named data. It represented in the database by a value. Name, telephone number, bill
amount are a few example of data.
Records
A record is a collection of logically related fields or data items, with each field
possessing a fixed number of bytes and having a fixed data type. A record consists of
values for each field. It is an occurrence of a named collection of zero, one or more
than one data items or aggregates.
Files
A file is a collection of related sequence of records. All records in a file are of the
same record type. If every record in the file has exactly the same size the file is said
to be made up of fixed- length records. If different records in the file have different
size the file is said to be made of variable-length records.
Database
Data Redundancy
Data Redundancy means same information is duplicated in several files. This makes
data redundancy.
Data Inconsistency
Data Inconsistency means different copies of the same data are not matching. That
means different versions of same basic data are existing. This occurs as the result
of update operations that are not updating the same data stored at different places.
Example: Address Information of a customer is recorded differently in different files.
Data Isolation
Data are scattered in various files, and the files may be in different format, writing
new application program to retrieve data is difficult.
Integrity Problems
The data values may need to satisfy some integrity constraints. For example the
balance field Value must be greater than 5000. We have to handle this through
program code in file processing systems. But in database we can declare the
integrity constraints along with definition itself.
Atomicity Problem
It is difficult to ensure atomicity in file processing system. For example
transferring $100 from Account A to account Biff a failure occurs during execution
there could be situation like $100 is deducted from Account A and not credited in
Account B.
Security Problems
Database management system (DBMS) is system software for creating and managing
databases. The DBMS provides users and programmers with a systematic way to
create, retrieve, update and manage data. Database is a collection of related data
and data is a collection of facts and figures that can be processed to produce
information. Mostly data represents recordable facts. Data aids in producing
information, which is based on facts. For example, if we have data about marks
obtained by all students, we can then conclude about toppers and average marks. A
database management system stores data in such a way that it becomes easier to
retrieve, manipulate, and produce information.
Characteristics
Traditionally, data was organized in file formats. DBMS was a new concept then, and
all the research was done to make it overcome the deficiencies in traditional style of
data management. A modern DBMS has the following characteristics −
Real-world entity − A modern DBMS is more realistic and uses real-world entities
to design its architecture. It uses the behavior and attributes too. For example, a
school database may use students as an entity and their age as an attribute.
Relation-based tables − DBMS allows entities and relations among them to form
tables. A user can understand the architecture of a database just by looking at the
table names.
Isolation of data and application − A database system is entirely different than its
data. A database is an active entity, whereas data is said to be passive, on which the
database works and organizes. DBMS also stores metadata, which is data about data,
to ease its own process.
Less redundancy − DBMS follows the rules of normalization, which splits a relation
when any of its attributes is having redundancy in values. Normalization is a
mathematically rich and scientific process that reduces data redundancy.
Query Language − DBMS is equipped with query language, which makes it more
efficient to retrieve and manipulate data. A user can apply as many and as different
filtering options as required to retrieve a set of data. Traditionally it was not
possible where file-processing system was used.
Multiple views − DBMS offers multiple views for different users. A user who is in
the Sales department will have a different view of database than a person working
in the Production department. This feature enables the users to have a concentrate
view of the database according to their requirements.
Security − Features like multiple views offer security to some extent where users
are unable to access data of other users and departments. DBMS offers methods to
impose constraints while entering data into the database and retrieving the same at
a later stage. DBMS offers many different levels of security features, which enables
multiple users to have different views with different features.
2. Data Consistency:
By controlling the data redundancy, the data consistency is obtained. If a data item
appears only once, any update to its value has to be performed only once and the
updated value (new value of item) is immediately available to all users.
If the DBMS has reduced redundancy to a minimum level, the database system
enforces consistency. It means that when a data item appears more than once in the
database and is updated, the DBMS automatically updates each occurrence of a data
item in the database.
3. Data Sharing:
In DBMS, data can be shared by authorized users of the organization. The DBA
manages the data and gives rights to users to access the data. Many users can be
authorized to access the same set of information simultaneously. The remote users
can also share same data. Similarly, the data of same database can be shared
between different application programs.
4. Data Integration:
In DBMS, data in database is stored in tables. A single database contains multiple
tables and relationships can be created between tables (or associated data entities).
This makes easy to retrieve and update data.
5. Integrity Constraints:
Integrity constraints or consistency rules can be applied to database so that the
correct data can be entered into database. The constraints may be applied to data
item within a single record or they may be applied to relationships between records.
There are also some standard constraints that are intrinsic in most of the DBMSs.
These are;
6. Data Security:
Data security is the protection of the database from unauthorized users. Only the
authorized persons are allowed to access the database. Some of the users may be
allowed to access only a part of database i.e., the data that is related to them or
related to their department. Mostly, the DBA or head of a department can access all
the data in the database. Some users may be permitted only to retrieve data,
whereas others are allowed to retrieve as well as to update data. The database
access is controlled by the DBA. He creates the accounts of users and gives rights to
access the database. Typically, users or group of users are given usernames
protected by passwords.
7. Data Atomicity:
A transaction in commercial databases is referred to as atomic unit of work. For
example, when you purchase something from a point of sale (POS) terminal, a
number of tasks are performed such as;
All these tasks collectively are called an atomic unit of work or transaction. These
tasks must be completed in all; otherwise partially completed tasks are rolled back.
Thus through DBMS, it is ensured that only consistent data exists within the
database.
Structure of DBMS
A typical structure of a DBMS with its components and relationships between them
is show. The DBMS software is partitioned into several modules. Each module or
component is assigned a specific operation to perform. Some of the functions of the
DBMS are supported by operating systems (OS) to provide basic services and DBMS
is built on top of it. The physical data and system catalo are stored on a physical
disk. Access to the disk is controlled primarily by as, which schedules disk
input/output.
As show, conceptually, following logical steps are followed while executing users to
request to access the database system:
Users issue a query using particular database language, for example, SQL
commands.
The passes query is presented to a query optimizer, which uses information
about how the data is stored to produce an efficient execution plan for the
evaluating the query.
The DBMS accepts the users SQL commands and analyses them.
The DBMS produces query evaluation plans, that is, the external schema for
the user, the corresponding external/conceptual mapping, the conceptual
Components of a DBMS
The DBMS accepts the SQL commands generated from a variety of user interfaces,
produces query evaluation plans, executes these plans against the database, and
returns the answers. As shown, the major software modules or components of
DBMS are as follows:
(i) Query processor: The query processor transforms user queries into a series
of low level instructions. It is used to interpret the online user's query and
convert it into an efficient series of operations in a form capable of being sent
to the run time data manager for execution. The query processor uses the
data dictionary to find the structure of the relevant portion of the database
and uses this information in modifying the query and preparing and optimal
plan to access the database.
(ii) Run time database manager: Run time database manager is the central
software component of the DBMS, which interfaces with user-submitted
application programs and queries. It handles database access at run time. It
converts operations in user's queries coming. Directly via the query processor
or indirectly via an application program from the user's logical view to a
physical file system. It accepts queries and examines the external and
conceptual schemas to determine what conceptual records are required to
satisfy the user’s request. It enforces constraints to maintain the consistency
and integrity of the data, as well as its security. It also performs backing and
recovery operations. Run time database manager is sometimes referred to as
the database control system and has the following components:
(iii) Data Manager: The data manager is responsible for the actual handling of
data in the database. It provides recovery to the system which that system
should be able to recover the data after some failure. It includes Recovery
manager and Buffer manager. The buffer manager is responsible for the
transfer of data between the main memory and secondary storage. It is
also referred as the cache manger.
(iv) DML Processor: Using a DML compiler the DML processor convents the
DML statement embedded in an application program into standard
function calls in the host language. The DML compiler converts the DML
statements written in a host programming language into object code for
database access. The DML processor must interact with the query
processor to generate the appropriate code.
(v) DLL Processor: Using a DDL compiler the DDL processor converts the
DLL statements into a set of tables containing metadata. This table contain
the metadata concerning the database and are in a form that can be used
by other components of the DBMS. The DLL compiler processes schema
definition, specified in the DDL and stores description of the schema in the
DBMS system catalo.
Codd's 12 Rules
Dr Edgar F. Codd, after his extensive research on the Relational Model of database
systems, came up with twelve rules of his own, which according to him, a database
must obey in order to be regarded as a true relational database.
These rules can be applied on any database system that manages stored data using
only its relational capabilities. This is a foundation rule, which acts as a base for all
the other rules.
The data stored in a database, may it be user data or metadata, must be a value of
some table cell. Everything in a database must be stored in a table format.
The NULL values in a database must be given a systematic and uniform treatment.
This is a very important rule because a NULL can be interpreted as one the following
− data is missing, data is not known, or data is not applicable.
The structure description of the entire database must be stored in an online catalog,
known as data dictionary, which can be accessed by authorized users. Users can
use the same query language to access the catalog which they use to access the
database itself.
A database can only be accessed using a language having linear syntax that supports
data definition, data manipulation, and transaction management operations. This
language can be used directly or by means of some application. If the database
allows access to data without any help of this language, then it is considered as a
violation.
All the views of a database, which can theoretically be updated, must also be
updatable by the system.
A database must support high-level insertion, updation, and deletion. This must not
be limited to a single row, that is, it must also support union, intersection and minus
operations to yield sets of data records.
The data stored in a database must be independent of the applications that access
the database. Any change in the physical structure of a database must not have any
impact on how the data is being accessed by external applications.
The logical data in a database must be independent of its user’s view (application).
Any change in logical data must not affect the applications using it. For example, if
two tables are merged or one is split into two different tables, there should be no
impact or change on the user application. This is one of the most difficult rules to
apply.
A database must be independent of the application that uses it. All its integrity
constraints can be independently modified without the need of any change in the
application. This rule makes a database independent of the front-end application
and its interface.
The end-user must not be able to see that the data is distributed over various
locations. Users should always get the impression that the data is located at one site
only. This rule has been regarded as the foundation of distributed database systems.
If a system has an interface that provides access to low-level records, then the
interface must not be able to subvert the system and bypass security and integrity
constraints.
Data Model
Hierarchical Database model is one of the oldest database models, dating from late
1950s. One of the first hierarchical databases Information Management System
(IMS) was developed jointly by North American Rockwell Company and IBM. This
model is like a structure of a tree with the records forming the nodes and fields
forming the branches of the tree
Advantages
1. Simplicity
Data naturally have hierarchical relationship in most of the practical
situations. Therefore, it is easier to view data arranged in manner. This makes
this type of database more suitable for the purpose.
2. Security
These database systems can enforce varying degree of security feature unlike
flat-file system.
3. Database Integrity
Because of its inherent parent-child structure, database integrity is highly
promoted in these systems.
4. Efficiency: The hierarchical database model is a very efficient, one when the
database contains a large number of I: N relationships (one-to-many
relationships) and when the users require large number of transactions, using
data whose relationships are fixed.
5. Data sharing: because all data are held in a common database data sharing
become practical.
6. Data Independence: The DBMS creates an environment in which data
independence can be maintained. This substantially decreases the
programming efforts and program maintenance.
7. Available expertise: Due to a large number of available installed mainframe
computer base expertise programmers were available.
8. Tried business applications: There was a large amount of tried-and- true
business applications available within the mainframe environment.
Disadvantages
The Network model replaces the hierarchical tree with a graph thus allowing more
general connections among the nodes. The main difference of the network model
from the hierarchical model, is its ability to handle many to many (N:N) relations. In
other words, it allows a record to have more than one parent. Suppose an employee
works for two departments. The strict hierarchical arrangement is not possible here
and the tree becomes a more generalized graph - a network. The network model
was evolved to specifically handle non-hierarchical relationships. As shown below
data can belong to more than one parent. Note that there are lateral connections as
well as top-down connections. A network structure thus allows 1:1 (one: one), l: M
(one: many), M: M (many: many) relationships among entities.
The Network model retains almost all the advantages of the hierarchical model
while eliminating some of its shortcomings.
1. Conceptual simplicity: Just like the hierarchical model, the network model IS
also conceptually simple and easy to design.
3. Ease of data access: The data access is easier and flexible than the
hierarchical model.
4. Data Integrity: The network model does not allow a member to exist without
an owner. Thus, a user must first define the owner record and then the
member record. This ensures the data integrity.
5. Data independence: The network model is better than the hierarchical model
in isolating the programs from the complex physical storage details.
Even though the network database model was significantly better than the
hierarchical database model, it also had many drawbacks. Some of them are:
1. System complexity: All the records are maintained using pointers and hence
the whole database structure becomes very complex.
4. Not User Friendly: The network data model is not a design for user-friendly
and is a highly skill-oriented system.
Relational model stores data in the form of tables. This concept purposed by Dr. E.F.
Codd, a researcher of IBM in the year 1960s. The relational model consists of three
major components:
1. The set of relations and set of domains that defines the way data can be
represented (data structure).
2. Integrity rules that define the procedure to protect the data (data integrity).
3. The operations that can be performed on data (data manipulation).
A rational model database is defined as a database that allows you to group its data
items into one or more independent tables that can be related to one another by
using fields common to each related table.
2. Conceptual simplicity: We have seen that both the hierarchical and the
network database model were conceptually simple. But the relational database
model is even simpler at the conceptual level. Since the relational data model
frees the designer from the physical data storage details, the designers can
concentrate on the logical view of the database.
4. Ad hoc query capability: The presence of very powerful, flexible and easy-to-
use query capability is one of the main reasons for the immense popularity of
the relational database model. The query language of the relational database
models structured query language or SQL makes ad hoc queries a reality. SQL is
a fourth generation language (4GL). A 4 GL allows the user to specify what
must be done without specifying how it must be done. So, sing SQL the users
can specify what information they want and leave the details of how to get the
information to the database.
The relational model's disadvantages are very minor as compared to the advantages
and their capabilities far outweigh the shortcomings Also, the drawbacks of the
relational database systems could be avoided if proper corrective measures are
taken. The drawbacks are not because of the shortcomings in the database model,
but the way it is being implemented.
2. Ease of design can lead to bad design: The relational database is an easy to
design and use. The users need not know the complex details of physical data
storage. They need not know how the data is actually stored to access it. This
ease of design and use can lead to the development and implementation of
very poorly designed database management systems. Since the database is
efficient, these design inefficiencies will not come to light when the database is
designed and when there is only a small amount of data. As the database
grows, the poorly designed databases will slow the system down and will
result in performance degradation and data corruption.
Although DBMS and RDBMS both are used to store information in physical
database but there are some remarkable differences between them.
The main differences between DBMS and RDBMS are given below:
There are two techniques used for the purpose of data base designing from the
system requirements. These are:
It maps well to the relational model. The constructs used in the ER model can
easily be transformed into relational tables.
It is simple and easy to understand with a minimum of training. Therefore,
the model can be used by the database designer to communicate the design to
the end user.
In addition, the model can be used as a design plan by the database developer
to implement a data model in specific database management software.
DBA Responsibilities
Prof. Jiwan N. Dehankar 26
Database Management System
Types of DBA
1. Determine the purpose of the database - This helps prepare for the
remaining steps.
2. Find and organize the information required - Gather all of the types of
information to record in the database, such as product name and order
number.
3. Divide the information into tables - Divide information items into major
entities or subjects, such as Products or Orders. Each subject then becomes a
table.
4. Turn information items into columns - Decide what information needs to
be stored in each table. Each item becomes a field, and is displayed as a
column in the table. For example, an Employees table might include fields
such as Last Name and Hire Date.
5. Specify primary keys - Choose each table’s primary key. The primary key is
a column, or a set of columns, that is used to uniquely identify each row. An
example might be Product ID or Order ID.
6. Set up the table relationships - Look at each table and decide how the data
in one table is related to the data in other tables. Add fields to tables or create
new tables to clarify the relationships, as necessary.
7. Refine the design - Analyze the design for errors. Create tables and add a few
records of sample data. Check if results come from the tables as expected.
Make adjustments to the design, as needed.
8. Apply the normalization rules - Apply the data normalization rules to see if
tables are structured correctly. Make adjustments to the tables
A database is organised in such a way that a computer program can quickly select
the desired piece of data. A database can further be defined as,
A DBMS is a collection of interrelated files and a set of programs that allow several
users to access and modify these files. A major purpose of a database system is to
provide users with an abstract view of the data. That is the system hides certain
details of how the data is stored and maintained. We can imagine that the whole
database system is divided into levels. The generalised architecture of a database
system is called the ANSI/SPARC (American National Standards Institute -
Standards Planning and Requirements Committee) model.
It hides the physical storage details from users: Users should not have to
deal with physical database storage details.
The database administrator should be able to change the database storage
structures without affecting the users’ views.
The internal structure of the database should be unaffected by changes to the
physical aspects of the storage: For example, a changeover to a new disk.
External level
Conceptual level
Internal level
External level:
The external level is the user’s view of the database and closest to the users. This
level describes that part of the database that is relevant to the user. Most of the
users of database are not concerned with all the information contained in the
database. Instead, they need only a part of the database relevant to them. For
example, even though the bank database stores a lot more information, an account
holder would be interested only in the account details such as the current balance
Prof. Jiwan N. Dehankar 30
Database Management System
and the transactions made. They may not need the rest of the information stored in
the account holder’s database. An external schema describes each external view.
The external schema consists of the definition of the logical records and the
relationships in the external view.
Conceptual level:
Conceptual level is the middle level of the three-tier architecture. At this level of
database abstraction, all the database entities and relationships among them are
included. Conceptual level provides the community view of the database and
describes what data is stored in the database and the relationships among the data.
One conceptual view represents the entire database of an organization. It is a
complete view of the data requirements of the organization that is independent of
any storage consideration. The conceptual schema defines conceptual view. It is also
called the logical schema. There is only one conceptual schema per database. The
figure shows the conceptual view record of a data base.
The lowest level of abstraction is the internal level. It is the one closest to physical
storage device. This level is also termed as physical level, because it describes how
data are actually stored on the storage medium such as hard disk, magnetic tape etc.
This level indicates how the data will be stored in the database and describe the
data structures, file structures and access methods to be used by the database. The
internal schema defines the internal level. The internal schema contains the
definition of the stored record, the methods of representing the data fields and
accessed methods used. The figure shows the internal view record of a database.
2) Difficulty in accessing data:- The file system environment does not allow
needed data to be retrieved in a convenient and efficient manner.
Prof. Jiwan N. Dehankar 31
Database Management System
7) Security problems:- Not every user of the database system should be able to
access all data. Data base should be protected from access by unauthorized
users.
Data Independence
It is the capacity to change the conceptual schema without having to change external
schemas or application programs. We may change the conceptual schema to expand
the database (by adding a record type or data item), or to reduce the database (by
removing a record type or data item). In the latter case, external schemas that refer
only to the remaining data should not be affected. Only the view definition and the
mappings need be changed in a DBMS that supports logical data independence.
Application programs that reference the external schema constructs must work as
before, after the conceptual schema undergoes a logical reorganization. Changes to
constraints can be applied also to the conceptual schema without affecting the
external schemas or application programs.
It is the capacity to change the internal schema without having to change the
conceptual (or external) schemas. Changes to the internal schema may be needed
because some physical files had to be reorganized—for example, by creating
additional access structures—to improve the performance of retrieval or update. If
the same data as before remains in the database, we should not have to change the
conceptual schema. Whenever we have a multiple level DBMS, its catalog must be
expanded to include information on how to map requests and data among the
various levels. The DBMS uses additional software to accomplish these mappings by
referring to the mapping information in the catalog. Data independence is
accomplished because, when the schema is changed at some level, the schema at the
next higher level remains unchanged; only the mapping between the two levels is
changed. Hence, application programs referring to the higher-level schema need not
be changed.
Data Abstraction:
Major purpose of DMBS is to provide users with abstract view of data i.e. the system
hides certain details of how the data are stored and maintained.
Since database system users are not computer trained, developers hide the
complexity from users through 3 levels of abstraction, to simplify user’s interaction
with the system.
This s the lowest level of abstraction which describes how data are actually stored.
This level hides what data are actually stored in the database and what relationship
exists among them.
View provides security mechanism to prevent user from accessing certain parts of
database.
SQL
A database system provides two different types of languages: One to specify the database
schema and other to express database queries and updates.
A data dictionary is a file that contains metadata that is data about data. This file is
consulted before actual data are read or modified in the database system.
The storage structure and access methods used by the database system are specified by a
set of definitions in a special type of DDL called as data storage and definition language. The
result of compilation of these definitions is a set of instruction to specify the
implementation details of the database schemas-Details are usually hidden from the user.
Create table command is used for creation of the table. Suppose we wish to create table
account having columns account_no, branch, and amount, then we have to write the
following query;
In this example, table account is created on which various modification operations can be
performed.
Data Definition Language (DDL) : Statements are used to define the database structure or
schema.
Some examples:
At the physical level, we must define algorithms that allow efficient access to data. At
higher levels of abstraction, we emphasize ease of use. The goal is to provide efficient
human interaction system.
a) Procedural DMLs requires a user to specify what data are needed and how to get those
data.
b) Non procedural DMLs require a user to specify what data are needed without
specifying how to get those data.
Non procedural DMLs are usually easier to learn and use than are procedural DMls.
However since a user does not have to specify, how to get those data, these languages may
generate code that may not be as that efficient as that produced by procedural languages.
For example: The simplest insert statement is a request to insert one tuple.
Suppose that we wish to insert the fact that there is an account A-9732 at a perryridge
branch and that it has a balance of $1200. The query is
Data Manipulation Language (DML) : Statements are used for managing data within
schema objects.
Some examples:
Transaction Control (TCL) : Statements are used to manage the changes made
by DML statements. It allows statements to be grouped together into logical
transactions.
Some examples:
Assertion
An example: All new customers opening an account must have a balance of $100; however,
once the account is opened their balance can fall below that amount. In this case you have
to use a trigger because you only want the condition evaluated when a new record is
inserted.
ER Diagrams
1. Hotel Management
2. Library Management
Generalization
Generalization is a bottom-up approach in which two lower level entities combine to form a higher
level entity. In generalization, the higher level entity can also combine with other lower level entity to
make further higher level entity.
Specialization
Aggregration
Aggregration is a process when relation between two entity is treated as a single entity. Here the
relation between Center and Course, is acting as an Entity in relation with Visitor.
VIEW
A view is a "virtual table" in the database whose contents are defined by a query.
The tables of a database define the structure and organization of its data. However, SQL
also lets you look at the stored data in other ways by defining alternative views of the data.
A view is a SQL query that is permanently stored in the database and assigned a name. The
results of the stored query are "visible" through the view, and SQL lets you access these
query results as if they were, in fact, a "real" table in the database.
Views let you tailor the appearance of a database so that different users see it from
different perspectives.
Views let you restrict access to data, allowing different users to see only certain
rows or certain columns of a table.
Views simplify database access by presenting the structure of the stored data in the
way that is most natural for each user.
Advantages of VIEW
Views provide a variety of benefits and can be useful in many different types of databases.
In a personal computer database, views are usually a convenience, defined to simplify
Security: Each user can be given permission to access the database only through a
small set of views that contain the specific data the user is authorized to see, thus
restricting the user's access to stored data.
Query simplicity: A view can draw data from several different tables and present it
as a single table, turning multi-table queries into single-table queries against the
view.
Structural simplicity: Views can give a user a "personalized" view of the database
structure, presenting the database as a set of virtual tables that make sense for that
user.
Insulation from change: A view can present a consistent, unchanged image of the
structure of the database, even if the underlying source tables are split,
restructured, or renamed.
Data integrity: If data is accessed and entered through a view, the DBMS can
automatically check the data to ensure that it meets specified integrity constraints.
Disadvantages of VIEW
While views provide substantial advantages, there are also two major disadvantages to
using a view instead of a real table:
Performance: Views create the appearance of a table, but the DBMS must still translate
queries against the view into queries against the underlying source tables. If the view is
defined by a complex, multi-table query, then even a simple query against the view
becomes a complicated join, and it may take a long time to complete.
Update restrictions: When a user tries to update rows of a view, the DBMS must translate
the request into an update on rows of the underlying source tables. This is possible for
simple views, but more complex views cannot be updated; they are "read-only.
Unit II
STORAGE HIERARCHIES
The collection of data that makes up a computerized database must be stored physically on
some computer storage medium. The DBMS software can then retrieve, update, and
process this data as needed. Computer storage media form a storage hierarchy that
includes two main categories:
• Primary storage: This category includes storage media that can be operated on directly
by the computer central processing unit (CPU), such as the computer main memory and
smaller but faster cache memories. Primary storage usually provides fast access to data but
is of limited storage capacity.
• Secondary storage: This category includes magnetic disks, optical disks, and tapes.
These devices usually have a larger capacity, cost less, and provide slower access to data
than do primary storage devices. Data in secondary storage cannot be processed directly by
the CPU; it must first be copied into primary storage.
Between DRAM and magnetic disk storage, another form of memory, flash memory, is
becoming common, particularly because it is nonvolatile. Flash memories are high-density,
high performance memories using EEPROM (Electrically Erasable Programmable Read-
Only Memory) technology. The advantage of flash memory is the fast access speed; the
disadvantage is that an entire block must be erased and written over at a time.
CD-ROM disks store data optically and are read by a laser. CD-ROMs contain prerecorded
data that cannot be overwritten. WORM (Write-Once-Read-Many) disks are a form of
optical storage used for archiving data; they allow data to be written once and read any
number of times without the possibility of erasing. They hold about half a gigabyte of data
per disk and last much longer than magnetic disks. Optical juke box memories use an array
of CD-ROM platters, which are loaded onto drives on demand. Although optical juke boxes
have capacities in the hundreds of gigabytes, their retrieval times are in the hundreds of
milliseconds, quite a bit slower than magnetic disks. This type of storage has not become as
popular as it was expected to be because of the rapid decrease in cost and increase in
capacities of magnetic disks. The DVD (Digital Video Disk) is a recent standard for optical
disks allowing four to fifteen gigabytes of storage per disk. Finally, magnetic tapes are used
for archiving and backup storage of data. Tape jukeboxes—which contain a bank of tapes
that are catalogued and can be automatically loaded onto tape drives—are becoming
popular as tertiary storage to hold terabytes of data.
RDBMS Concepts
What is Table ?
table can have duplicate tuples while a truerelation cannot have duplicate tuples. Table is
the most simplest form of data storage. Below is an example of Employee table.
1 Adam 34 13000
2 Alex 28 15000
3 Stuart 20 18000
4 Ross 42 19020
What is a Record ?
A single entry in a table is called a Record or Row. A Record in a table represents set of
related data. For example, the above Employee table has 4 records. Following is an
example of single record.
1 Adam 34 13000
What is Field ?
A table consists of several records(row), each record can be broken into several smaller
entities known as Fields. The above Employee table consist of four
fields, ID, Name, Age and Salary.
What is a Column ?
In Relational table, a column is a set of value of a particular type. The term Attribute is
also used to represent a column. For example, in Employee table, Name is a column that
represent names of employee.
Name
Adam
Alex
Stuart
Ross
Database Keys
Keys are very important part of Relational database. They are used to establish and identify
relation between tables. They also ensure that each record within a table can be uniquely
identified by combination of one or more fields within a table.
Super Key
Super Key is defined as a set of attributes within a table that uniquely identifies each
record within a table. Super Key is a superset of Candidate key.
Candidate Key
Candidate keys are defined as the set of fields from which primary key can be selected. It is
an attribute or set of attribute that can act as a primary key for a table to uniquely identify
each record in that table.
Primary Key
Primary key is a candidate key that is most appropriate to become main key of the table. It
is a key that uniquely identify each record in a table.
Composite Key
Key that consist of two or more attributes that uniquely identify an entity occurance is
called Composite key. But any attribute that makes up the Composite key is not a simple
key in its own.
The candidate key which are not selected for primary key are known as secondary keys or
alternative keys
Non-key Attribute
Non-key attributes are attributes other than candidate key attributes in a table.
Non-prime Attribute
The actual content of the database, the data, changes often over the years. A database state
at a specific time defined through the currently existing content and relationship and their
attributes is called a database instance
The following illustration shows that a database scheme could be looked at like a template
or building plan for one or several database instances.
When designing a database it is differentiated between two levels of abstraction and their
respective data schemes, the conceptual and the logical data scheme.
A conceptual data scheme is a system independent data description. That means that it is
independent from the database or computer systems used.
A logical data scheme describes the data in a data definition language DDL of a specific
database management system.
The conceptual data scheme orients itself exclusively by the database application and
therefore by the real world. It does not consider any data technical infrastructure like
DBMS or computer systems, which are eventually employed.Entity relationship diagrams
and relations are tools for the development of a conceptual scheme.
When designing a database the conceptual data scheme is derived from the logical data
scheme (see unit Relational Database Design). This derivation results in a logical data
scheme for one specific application and one specific DBMS. A DB-Development System
converts then the logical scheme directly into instructions for the DBMS.
Relational Algebra
Relational algebra is a procedural query language, which takes instances of relations as
input and yields instances of relations as output. It uses operators to perform queries. An
operator can be either unary or binary. They accept relations as their input and yield
relations as their output. Relational algebra is performed recursively on a relation and
intermediate results are also considered relations.
Select
Project
Union
Set different
Cartesian product
Rename
Notation − σp(r)
Where σ stands for selection predicate and r stands for relation. p is prepositional logic
formula which may use connectors like and, or, and not. These terms may use relational
operators like − =, ≠, ≥, < , >, ≤.
For example −
σsubject = "database"(Books)
Output − Selects tuples from books where subject is 'database' and 'price' is 450.
Output − Selects tuples from books where subject is 'database' and 'price' is 450 or those
books published after 2010.
For example −
Selects and projects columns named as subject and author from the relation Books.
r ∪ s = { t | t ∈ r or t ∈ s}
Notion − r U s
Where r and s are either database relations or relation result set (temporary relation).
Output − Projects the names of the authors who have either written a book or an article or
both.
Notation − r − s
Output − Provides the name of authors who have written books but not articles.
Notation − r Χ s
r Χ s = { q t | q ∈ r and t ∈ s}
Output − Yields a relation, which shows all the books and articles written by
tutorialspoint.
Notation − ρ x (E)
Set intersection
Assignment
Natural join
Relational Calculus
In contrast to Relational Algebra, Relational Calculus is a non-procedural query language,
that is, it tells what to do but never explains how to do it.
Notation − {T | Condition}
For example −
Output − Returns tuples with 'name' from Author who has written article on 'database'.
TRC can be quantified. We can use Existential (∃) and Universal Quantifiers (∀).
For example −
Output − The above query will yield the same result as the previous one.
Notation −
Where a1, a2 are attributes and P stands for formulae built by inner attributes.
For example −
Output − Yields Article, Page, and Subject from the relation TutorialsPoint, where subject
is database.
Just like TRC, DRC can also be written using existential and universal quantifiers. DRC also
involves relational operators.
The expression power of Tuple Relation Calculus and Domain Relation Calculus is
equivalent to Relational Algebra.
Indexing
Indexing is a data structure technique to efficiently retrieve records from the database files
based on some attributes on which the indexing has been done. Indexing in database
systems is similar to what we see in books.
Indexing is defined based on its indexing attributes. Indexing can be of the following types
−
Primary Index − Primary index is defined on an ordered data file. The data file is
ordered on a key field. The key field is generally the primary key of the relation.
Clustering Index − Clustering index is defined on an ordered data file. The data file
is ordered on a non-key field.
Dense Index
Sparse Index
Dense Index
In dense index, there is an index record for every search key value in the database. This
makes searching faster but requires more space to store index records itself. Index records
contain search key value and a pointer to the actual record on the disk.
Sparse Index
In sparse index, index records are not created for every search key. An index record here
contains a search key and an actual pointer to the data on the disk. To search a record, we
first proceed by index record and reach at the actual location of the data. If the data we are
looking for is not where we directly reach by following the index, then the system starts
sequential search until the desired data is found.
Multilevel Index
Index records comprise search-key values and data pointers. Multilevel index is stored on
the disk along with the actual database files. As the size of the database grows, so does the
size of the indices. There is an immense need to keep the index records in the main
memory so as to speed up the search operations. If single-level index is used, then a large
size index cannot be kept in memory which leads to multiple disk accesses.
Multi-level Index helps in breaking down the index into several smaller indices in order to
make the outermost level so small that it can be saved in a single disk block, which can
easily be accommodated anywhere in the main memory.
B+ Tree
A B+ tree is a balanced binary search tree that follows a multi-level index format. The leaf
nodes of a B+ tree denote actual data pointers. B+ tree ensures that all leaf nodes remain at
the same height, thus balanced. Additionally, the leaf nodes are linked using a link list;
therefore, a B+ tree can support random access as well as sequential access.
Structure of B+ Tree
Every leaf node is at equal distance from the root node. A B+ tree is of the order n where nis
fixed for every B+ tree.
Internal nodes −
Internal (non-leaf) nodes contain at least ⌈n/2⌉ pointers, except the root node.
Leaf nodes −
Leaf nodes contain at least ⌈n/2⌉ record pointers and ⌈n/2⌉ key values.
At most, a leaf node can contain n record pointers and n key values.
Every leaf node contains one block pointer P to point to next leaf node and forms a
linked list.
B+ Tree Insertion
B+ trees are filled from bottom and each entry is done at the leaf node.
o Partition at i = ⌊(m+1)/2⌋.
B+ Tree Deletion
o If it is an internal node, delete and replace with the entry from the left
position.
o If underflow occurs, distribute the entries from the nodes left to it.
UNIT III
Normalization of Database
Database Normalisation is a technique of organizing the data in the database.
Normalization is a systematic approach of decomposing tables to eliminate data
redundancy and undesirable characteristics like Insertion, Update and Deletion Anamolies.
It is a multi-step process that puts data into tabular form by removing duplicated data from
the relation tables.
Without Normalization, it becomes difficult to handle and update the database, without
facing data loss. Insertion, Updation and Deletion Anamolies are very frequent if Database
is not Normalized. To understand these anomalies let us take an example of Student table.
Updation Anamoly : To update address of a student who occurs twice or more than
twice in a table, we will have to update S_Address column in all the rows, else data will
become inconsistent.
Insertion Anamoly : Suppose for a new admission, we have a Student id(S_id), name
and address of a student but if student has not opted for any subjects yet then we have
to insert NULL there, leading to Insertion Anamoly.
Deletion Anamoly : If (S_id) 401 has only one subject and temporarily he drops it,
when we delete that row, entire student record will be deleted along with it.
Normalization Rule
As per First Normal Form, no two Rows of data must contain repeating group of
information i.e each set of column must have a unique value, such that multiple columns
cannot be used to fetch the same row. Each table should be organized into rows, and each
row should have a primary key that distinguishes it as unique.
The Primary key is usually a single column, but sometimes more than one column can be
combined to create a single primary key. For example consider a table which is not in First
normal form
Student Table :
Alex 14 Maths
Stuart 17 Maths
In First Normal Form, any row must not have a column in which more than one value is
saved, like separated with commas. Rather than that, we must separate such data into
multiple rows.
Adam 15 Biology
Adam 15 Maths
Alex 14 Maths
Stuart 17 Maths
Using the First Normal Form, data redundancy increases, as there will be many columns
with same data in multiple rows but each row as a whole will be unique.
As per the Second Normal Form there must not be any partial dependency of any column
on primary key. It means that for a table that has concatenated primary key, each column in
the table that is not part of the primary key must depend upon the entire concatenated key
for its existence. If any column depends only on one part of the concatenated key, then the
table fails Second normal form.
In example of First Normal Form there are two rows for Adam, to include multiple subjects
that he has opted for. While this is searchable, and follows First normal form, it is an
inefficient use of space. Also in the above Table in First Normal Form, while the candidate
key is {Student, Subject}, Age of Student only depends on Student column, which is
incorrect as per Second Normal Form. To achieve second normal form, it would be helpful
to split out the subjects into an independent table, and match them up using the student
names as foreign keys.
Student Age
Adam 15
Alex 14
Stuart 17
In Student Table the candidate key will be Student column, because all other column
i.e Age is dependent on it.
Student Subject
Adam Biology
Adam Maths
Alex Maths
Stuart Maths
In Subject Table the candidate key will be {Student, Subject} column. Now, both the above
tables qualifies for Second Normal Form and will never suffer from Update Anomalies.
Although there are a few complex cases in which table in Second Normal Form suffers
Update Anomalies, and to handle those scenarios Third Normal Form is there.
Third Normal form applies that every non-prime attribute of table must be dependent on
primary key, or we can say that, there should not be the case that a non-prime attribute is
determined by another non-prime attribute. So this transitive functional dependency should
be removed from the table and also the table must be inSecond Normal form. For
example, consider a table with following fields.
Student_Detail Table :
In this table Student_id is Primary key, but street, city and state depends upon Zip. The
dependency between zip and other fields is called transitive dependency. Hence to
apply 3NF, we need to move the street, city and state to new table, with Zip as primary key.
Address Table :
Boyce and Codd Normal Form is a higher version of the Third Normal form. This form
deals with certain type of anamoly that is not handled by 3NF. A 3NF table which does not
have multiple overlapping candidate keys is said to be in BCNF. For a table to be in BCNF,
following conditions must be satisfied:
INTEGRITY CONSTRAINTS
To preserve the consistency and correctness of its stored data, a relational DBMS typically
imposes one or more data integrity constraints. These constraints restrict the data values
that can be inserted into the database or created by a database update. Several different
types of data integrity constraints are commonly found in relational databases, including:
Required data: Some columns in a database must contain a valid data value in every row;
they are not allowed to contain missing or NULL values. In the sample database, every
order must have an associated customer who placed the order. The DBMS can be asked to
prevent NULL values in this column.
Validity checking: Every column in a database has a domain, a set of data values that are
legal for that column. The DBMS can be asked to prevent other data values in these
columns.
Entity integrity: The primary key of a table must contain a unique value in each row,
which is different from the values in all other rows. Duplicate values are illegal, because
they wouldn't allow the database to distinguish one entity from another. The DBMS can be
asked to enforce this unique values constraint.
Referential integrity: A foreign key in a relational database links each row in the child
table containing the foreign key to the row of the parent table containing the matching
primary key value. The DBMS can be asked to enforce this foreign key/primary key
constraint.
Other data relationships: The real-world situation modeled by a database will often have
additional constraints that govern the legal data values that may appear in the database.
The DBMS can be asked to check modifications to the tables to make sure that their values
are constrained in this way.
Business rules: Updates to a database may be constrained by business rules governing the
real-world transactions that are represented by the updates.
Data Analysis
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with
the goal of discovering useful information, suggesting conclusions, and supporting
decision-making. Data analysis has multiple facets and approaches, encompassing diverse
techniques under a variety of names, in different business, science, and social science
domains.
Data mining is a particular data analysis technique that focuses on modeling and
knowledge discovery for predictive rather than purely descriptive purposes. Business
intelligence covers data analysis that relies heavily on aggregation, focusing on business
information. Instatistical applications, some people divide data analysis into descriptive
statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA). EDA
focuses on discovering new features in the data and CDA on confirming or falsifying
existing hypotheses. Predictive analytics focuses on application of statistical models for
predictive forecasting or classification, while text analytics applies statistical, linguistic, and
structural techniques to extract and classify information from textual sources, a species
of unstructured data. All are varieties of data analysis.
Data integration is a precursor to data analysis, and data analysis is closely linked to data
visualization and data dissemination. The term data analysis is sometimes used as a
synonym for data modeling.
Data analysis is a process for obtaining raw data and converting it into information useful
for decision-making by users. Data is collected and analyzed to answer questions, test
hypotheses or disprove theories
Data requirements
The data necessary as inputs to the analysis are specified based upon the requirements of
those directing the analysis or customers who will use the finished product of the analysis.
The general type of entity upon which the data will be collected is referred to as an
experimental unit (e.g., a person or population of people). Specific variables regarding a
population (e.g., age and income) may be specified and obtained. Data may be numerical or
categorical (i.e., a text label for numbers).[
Data collection
Data processing
Data initially obtained must be processed or organized for analysis. For instance, this may
involve placing data into rows and columns in a table format for further analysis, such as
within a spreadsheet or statistical software
Data cleaning
Once processed and organized, the data may be incomplete, contain duplicates, or contain
errors. The need for data cleaning will arise from problems in the way that data is entered
and stored. Data cleaning is the process of preventing and correcting these errors. Common
tasks include record matching, deduplication, and column segmentation. Such data
problems can also be identified through a variety of analytical techniques. For example,
with financial information, the totals for particular variables may be compared against
separately published numbers believed to be reliable. Unusual amounts above or below
pre-determined thresholds may also be reviewed. There are several types of data cleaning
that depend on the type of data. Quantitative data methods for outlier detection can be
used to get rid of likely incorrectly entered data. Textual data spellcheckers can be used to
lessen the amount of mistyped words, but it is harder to tell if the words themselves are
correct
Once the data is cleaned, it can be analyzed. Analysts may apply a variety of techniques
referred to as exploratory data analysis to begin understanding the messages contained in
the data. The process of exploration may result in additional data cleaning or additional
requests for data, so these activities may be iterative in nature. Descriptive statistics such
as the average or median may be generated to help understand the data. Data
visualization may also be used to examine the data in graphical format, to obtain additional
insight regarding the messages within the data.[
Mathematical formulas or models called algorithms may be applied to the data to identify
relationships among the variables, such as correlation or causation. In general terms,
models may be developed to evaluate a particular variable in the data based on other
variable(s) in the data, with some residual error depending on model accuracy (i.e., Data =
Model + Error).
Data product
A data product is a computer application that takes data inputs and generates outputs,
feeding them back into the environment. It may be based on a model or algorithm. An
example is an application that analyzes data about customer purchasing history and
recommends other purchases the customer might enjoy
Communication
Once the data is analyzed, it may be reported in many formats to the users of the analysis to
support their requirements. The users may have feedback, which results in additional
analysis. As such, much of the analytical cycle is iterative.[3]
When determining how to communicate the results, the analyst may consider data
visualization techniques to help clearly and efficiently communicate the message to the
audience. Data visualization uses information displays such as tables and charts to help
communicate key messages contained in the data. Tables are helpful to a user who might
lookup specific numbers, while charts (e.g., bar charts or line charts) may help explain the
quantitative messages contained in the data.