Dbms Notes
Dbms Notes
1,
2,
3,
1,
2,
3,
1,
2,
3,
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Now Suppose in a relational model a table T has columns titled as A, B and C. If T is
representing an entity set then A, B, C will be its attributes. Every attribute corresponds
to a limited number of values that can be assigned to it.
The set of values that can be assigned to a particular column is called domain for that. So
A, B and C will have their specified domains. Suppose those domain sets are denoted as
D
A
, D
B
, and D
C
.
Any row of the table will have a value from D
A
in first column, D
B
in second column, and
D
C
in third column. So a row of relational table is similar to a tuple of a mathematical
relation between the sets D
A
, D
B
, and D
C
.
For Example: Consider an BOOK table having attributes acc-no, title, and author. To
make it simple we restrict the domain for acc-no. as A={100, 101, 102}, for title as
B={DBMS, COMPILER, OS} and for author as C={Ramanuj, Aristotle and
Silbershatz}. That means the first column of BOOK can have any value from only A,
second from only B and third from only C. the Cartesian product of these domain sets can
be represented in a tabular form as :
A B C =
Now we can observe any valid table representing the entity set BOOK for a library given
above restriction on domains will have only a subset of the rows from the above table
which represents A B C. For Example a valid entity set for all books in the library
can be
BOOK
So we can say any table of relational model is actually similar to the mathematical
relation.
100 DBMS Ramanuj
100 DBMS Aristotle
100 DBMS Silbershatz
100 COMPILER Ramanuj
100 COMPILER Aristotle
100 COMPILER Silbershatz
100 OS Ramanuj
100 OS Aristotle
200 DBMS Ramanuj
200 DBMS Aristotle
200 DBMS Silbershatz
200 COMPILER Ramanuj
200 COMPILER Aristotle
200 COMPILER Silbershatz
200 OS Ramanuj
200 OS Aristotle
200 OS Silbershatz
300 DBMS Ramanuj
300 DBMS Aristotle
300 DBMS Silbershatz
300 COMPILER Ramanuj
300 COMPILER Aristotle
300 COMPILER Silbershatz
300 OS Ramanuj
300 OS Aristotle
300 OS Silbershatz
100 DBMS Silbershatz
200 DBMS Ramanuj
300 COMPILER Silbershatz
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Every row of such relational table is similar to a tuple of a mathematical relation. Let the
tuple variable t refers to the first tuple (first row) in above mentioned BOOK table then
we can various elements of the tuple as t[acc-no]= 100, t[title]= DBMS and t[author] =
Silbershatz.
Query Languages
A language in which a user requests information from the database is called a query
language.
o Procedural- user instructs the system to perform a sequence of operations on the
database to compute the desired result. E.g. Relational algebra
o Nonprocedural- user describes the information desired without giving a specific
procedure for obtaining the desired information. E.g. tuple calculus, domain
calculus.
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Lecture 10
Relational Algebra:
Basic operations:
o Selection () Selects a subset of rows from relation.
o Projection () Selects a subset of columns from relation.
o Cross-product () Allows us to combine two relations.
o Set-difference () Tuples in reln. 1, but not in reln. 2.
o Union (U) Tuples in reln. 1 and in reln. 2.
o Rename( ) Use new name for the Tables or fields.
Additional operations:
o Intersection (), join( ), division(): Not essential, but (very!) useful.
Since each operation returns a relation, operations can be composed! (Algebra is
closed.)
Projection
Deletes attributes that are not in projection list.
Schema of result contains exactly the fields in the projection list, with the same names
that they had in the (only) input relation. ( Unary Operation)
Projection operator has to eliminate duplicates! (as it returns a relation which is a set)
o Note: real systems typically dont do duplicate elimination unless the user
explicitly asks for it. (Duplicate values may be representing different real world
entity or relationship)
Consider the BOOK table:
Title
(BOOK) =
Selection
Selects rows that satisfy selection condition.
No duplicates in result! (Why?)
Schema of result identical to schema of (only) input relation.
Result relation can be the input for another relational algebra operation! (Operator
composition.)
Acc-no>300
(BOOK) =
Acc-No Title Author
100 DBMS Silbershatz
200 DBMS Ramanuj
300 COMPILER Silbershatz
400 COMPILER Ullman
500 OS Sudarshan
600 DBMS Silbershatz
Title
DBMS
COMPILER
OS
Acc-No Title Author
400 COMPILER Ullman
500 OS Sudarshan
600 DBMS Silbershatz
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Title=DBMS
(BOOK)=
Acc-no
(
Title=DBMS
(BOOK))=
Union, Intersection, Set-Difference
All of these operations take two input relations, which must be union-compatible:
o Same number of fields.
o `Corresponding fields have the same type.
What is the schema of result?
Consider:
Borrower Depositor
List of customers who are either borrower or depositor at bank=
Cust-name
(Borrower) U
Cust-name
(Depositor)=
Customers who are both borrowers and depositors =
Cust-name
(Borrower)
Cust-name
(Depositor)=
Customers who are borrowers but not depositors =
Cust-name
(Borrower)
Cust-name
(Depositor)=
Acc-No Title Author
100 DBMS Silbershatz
200 DBMS Ramanuj
600 DBMS Silbershatz
Acc-No
100
200
600
Cust-name Loan-no
Ram L-13
Shyam L-30
Suleman L-42
Cust-name Acc-no
Suleman A-100
Radheshyam A-300
Ram A-401
Cust-name
Ram
Shyam
Suleman
Radeshyam
Cust-name
Ram
Suleman
Cust-name
Shyam
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Lecture-11
Cartesian-Product or Cross-Product (S1 R1)
Each row of S1 is paired with each row of R1.
Result schema has one field per field of S1 and R1, with field names `inherited if
possible.
Consider the borrower and loan tables as follows:
Borrower: Loan:
Cross product of Borrower and Loan, Borrower Loan =
The rename operation can be used to rename the fields to avoid confusion when two field
names are same in two participating tables:
For example the statement,
Loan-borrower(Cust-name,Loan-No-1, Loan-No-2,Amount)
( Borrower Loan)
results into- A new Table named Loan-borrower is created where it has four fields which
are renamed as Cust-name, Loan-No-1, Loan-No-2 and Amount and the rows contains
the same data as the cross product of Borrower and Loan.
Loan-borrower:
Rename Operation:
It can be used in two ways :
o ( ) return the result of expression E in the table named x.
Cust-name Loan-no
Ram L-13
Shyam L-30
Suleman L-42
Loan-no Amount
L-13 1000
L-30 20000
L-42 40000
Borrower.Cust-
name
Borrower.Loan-
no
Loan.Loan-
no
Loan.Amount
Ram L-13 L-13 1000
Ram L-13 L-30 20000
Ram L-13 L-42 40000
Shyam L-30 L-13 1000
Shyam L-30 L-30 20000
Shyam L-30 L-42 40000
Suleman L-42 L-13 1000
Suleman L-42 L-30 20000
Suleman L-42 L-42 40000
Cust-name Loan-No-1 Loan-No-2 Amount
Ram L-13 L-13 1000
Ram L-13 L-30 20000
Ram L-13 L-42 40000
Shyam L-30 L-13 1000
Shyam L-30 L-30 20000
Shyam L-30 L-42 40000
Suleman L-42 L-13 1000
Suleman L-42 L-30 20000
Suleman L-42 L-42 40000
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
o
( , ,, )
( ) return the result of expression E in the table named x with
the attributes renamed to A
1,
A
2,
, A
n
.
o Its benefit can be understood by the solution of the query Find the largest
account balance in the bank
It can be solved by following steps:
1. Find out the relation of those balances which are not largest.
a. Consider Cartesion product of Account with itself i.e. Account
Account
b. Compare the balances of first Account table with balances of
second Account table in the product.
c. For that we should rename one of the account table by some
other name to avoid the confusion
d. It can be done by following operation
Account.balance
(
Account.balance < d.balance
(Account
d
(Account))
e. So the above relation contains the balances which are not
largest.
2. Subtract this relation from the relation containing all the balances i.e .
balance
(Account).
3. So the final statement for solving above query is
balance
(Account)-
Account.balance
(
Account.balance < d.balance
(Account
d
(Account))
Additional Operations
Natural Join ( )
Forms Cartesian product of its two arguments, performs selection forcing equality
on those attributes that appear in both relations
For example consider Borrower and Loan relations, the natural join between them
Borrower Loan will automatically perform the selection on the table returned
by Borrower Loan which force equality on the attribute that appear in both
Borrower and Loan i.e. Loan-no and also will have only one of the column named
Loan-No.
That means Borrower Loan =
Borrower.Loan-no = Loan.Loan-no
(Borrower Loan).
The table returned from this will be as follows:
Eliminate rows that does not satisfy the selection criteria
Borrower.Loan-no = Loan.Loan-
no
from Borrower Loan =
Customer-name. branch-name
(Depositor )
branch-name
(
Branch-city=Agra
(Branch)
The division operations can be specified by using only basic operations as
follows: Let r(R) and s(S) be given relations for schema R and S with S R
r s =
R-S
(r) -
R-S
((
R-S
(r) s) -
R-S,S
(r))
Cust-name Loan-no Amount
Ram L-13 1000
Shyam L-30 20000
Suleman L-42 40000
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Lecture-12
Tuple Relational Calculus
Relational algebra is an example of procedural language while tuple relational
calculus is a nonprocedural query language.
A query is specified as:
{t | P(t)}, i.e it is the set of all tuples t such that predicate P is true for t.
The formula P(t) is formed using atoms which uses the relations, tuples of
relations and fields of tuples and following symbols
o ( belongs to),<,>,,,,=, (comparison operators)
These atoms can then be used to form formulas with following symbols
o ( universal qualifier generally called "for all")
o ( existential qualifier generally called "there exists")
o ( and), (or), ( not)
For example : here are some queries and a way to express them using tuple
calculus:
o Find the branch-name, loan-number and amount for loans over Rs 1200.
{t| t Loan t[amount] > 1200}.
o Find the loan number for each loan of an amount greater that Rs1200.
{t| s Loan(t[loan-number] = s[loan-number] s[amount] >1200}
o Find the names of all the customers who have a loan from the Sadar
branch.
{t | s Borrower ( t customer-name =s customer-name
u Loan ( u[loan-number] = s[loan-number
u[branch-name] = "Sadar"))}
o Find all customers who have a loan , an account, or both at the bank
{t| s Borrower ( t[customer-name] = s[customer-name])
u Depositor (t[customer-name] = u[customer-name])}
o Find only those customers who have both an account and a loan.
{t| s Borrower ( t[customer-name] = s[customer-name])
u Depositor (t[customer-name] = u[customer-name])}
o Find all customers who have an account but do not have loan.
{t| u Depositor (t[customer-name] = u[customer-name])
s Borrower ( t[customer-name] = s[customer-name])}
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
o Find all customers who have an account at all branches located in Agra
{t | w Branch( w[branch-city] = "Agra"
s Depositor ( t customer-name = s customer-name
u Account ( u[account-number] = s[account-number]
u[branch-name] = w[branch-name])))}
Domain Relational Calculus
Domain relational calculus is another non procedural language for expressing database
queries.
A query is specified as:
{<x
1
,x
2
,,x
n
> | P(x
1
,x
2
,,x
n
)} where x
1
,x
2
,,x
n
represents domain variables. P
represent a predicate formula as in tuple calculus
Since the domain variables are referred in place of tuples the formula doesnt refer the
fields of tuples rather they refer the domain variables.
For example the queries in domain calculus are mentioned as follows:
o Find the branch-name, loan-number and amount for loans over Rs 1200.
{<b, l, a>| <b, l, a> Loan a >1200}.
o Find the loan number for each loan of an amount greater that Rs1200.
{< l >| b,a( <b, l, a> Loan a >1200}
o Find the names of all the customers who have a loan from the Sadar branch and
find the loan amount
{<c, a >| l(<c, l > Borrower
b( <b, l, a > Loan b="Sadar"))}
o Find names of all customers who have a loan , an account, or both at the Sadar
Branch
{<c>| l(<c, l > Borrower b, a(<b, l, a> Loan b ="Sadar"))
a(<c, a> Depositor b, n(<b, a, n> Account b ="Sadar"))}
o Find only those customers who have both an account and a loan.
{<c>| l(<c, l> Borrower ) a(<c, a> Depositor )}
o Find all customers who have an account but do not have loan.
{t| a(<c, a> Depositor ) l(<c, l> Borrower )}
o Find all customers who have an account at all branches located in Agra
{<c>| x, y, z(<x, y, z> Branch) y = "Agra"
a, b(<x, a, b> Account <c, a> Depositor)}
Outer Join
Outer join operation is an extension of join operation to deal with missing information
Suppose that we have following relational schemas:
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Employee( employee-name, street, city)
Fulltime-works(employee-name, branch-name, salary)
A snapshot of these relations is as follows:
Employee:
Fulltime-works
Suppose we want complete information of the full time employees.
The natural join (Employee Fulltime-works)will result into the loss of information for
Suleman and Rehman because they dont have record in both the tables ( left and right
relation). The outer join will solve the problem.
Three forms of outer join:
o Left outer join():the tuples which doesnt match while doing natural join
from left relation are also added in the result putting null values in missing field
of right relation.
o Right outer join():the tuples which doesnt match while natural join from
right relation are also added in the result putting null values in missing field of left
relation.
o Full outer join(): include both of the left and right outer joins i.e. adds the
tuples which did not match either in left relation or right relation and put null in
place of missing values.
The result for three forms of outer join are as follows:
Left join: Employee Fulltime-works=
Right join: Employee Fulltime-works=
Full join: Employee Fulltime-works=
employee-name street city
Ram M G Road Agra
Shyam New Mandi Road Mathura
Suleman Bhagat Singh Road Aligarh
employee-name branch-name salary
Ram Sadar 30000
Shyam Sanjay Place 20000
Rehman Dayalbagh 40000
employee-name street City branch-name salary
Ram M G Road Agra Sadar 30000
Shyam New Mandi Road Mathura Sanjay Place 20000
Suleman Bhagat Singh Road Aligarh Null Null
employee-name street city branch-name salary
Ram M G Road Agra Sadar 30000
Shyam New Mandi Road Mathura Sanjay Place 20000
Rehman null null Dayalbagh 40000
employee-name street city branch-name salary
Ram M G Road Agra Sadar 30000
Shyam New Mandi Road Mathura Sanjay Place 20000
Suleman Bhagat Singh Road Aligarh null null
Rehman null null Dayalbagh 40000
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Aggregate Functions
Aggregate functions are functions that take a collection of values and return a single
value as a result.
Examples are sum, avg, count, max, min.
Find the total balance of all the accounts
sum
balance
(Account).
Find the no of borrowers
count
customer-name
(Borrower)
Find the distinct customers who are either borrowers or depositors.
count-distinct
customer-name
(Borrower Depositor)
The aggregate functions can be applied on sub groups of the rows in the table rather than
on all of the rows of table using the denoted by symbol( ).
For example we want to find the total salary of all the full time employees branch wise. It
can be specified as follows:
branch-name
(Fulltime-works)
Fulltime-works
The result of aggregate function with grouping specified above will be:
employee-name branch-name salary
Ram Sadar 30000
Shyam Sanjay Place 20000
Rehman Dayalbagh 40000
Suleman Sadar 25000
branch-name sum of salary
Sadar 55000
Sanjay Place 20000
Dayalbagh 40000
Group1: branch name = sadar
Group2: branch name = sanjay place
Group3: branch name = Dayalbagh
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Lecture-13
Structured Query Language (SQL)
Introduction
Commercial database systems use more user friendly language to specify the queries.
SQL is the most influential commercially marketed product language.
Other commercially used languages are QBE, Quel, and Datalog.
Basic Structure
The basic structure of an SQL consists of three clauses: select, from and where.
select: it corresponds to the projection operation of relational algebra. Used to list the
attributes desired in the result.
from: corresponds to the Cartesian product operation of relational algebra. Used to list
the relations to be scanned in the evaluation of the expression
where: corresponds to the selection predicate of the relational algebra. It consists of a
predicate involving attributes of the relations that appear in the from clause.
A typical SQL query has the form:
select A
1
, A
2
,, A
n
fromr
1
, r
2
,, r
m
where P
o A
i
represents an attribute
o r
j
represents a relation
o P is a predicate
o It is equivalent to following relational algebra expression:
o
A
1
,A
2
,,A
n
(
P
(r
1
r
2
r
m
))
[Note: The words marked in dark in this text work as keywords in SQL language. For example
select, from and where in the above paragraph are shown in bold font to indicate that
they are keywords]
Select Clause
Let us see some simple queries and use of select clause to express them in SQL.
Find the names of all branches in the Loan relation
select branch-name
from Loan
By default the select clause includes duplicate values. If we want to force the elimination
of duplicates the distinct keyword is used as follows:
select distinct branch-name
from Loan
The all key word can be used to specify explicitly that duplicates are not removed. Even
if we not use all it means the same so we dont require all to use in select clause.
select all branch-name
from Loan
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
The asterisk * can be used to denote all attributes. The following SQL statement will
select and all the attributes of Loan.
select *
from Loan
The arithmetic expressions involving operators, +, -, *, and / are also allowed in select
clause. The following statement will return the amount multiplied by 100 for the rows in
Loan table.
select branch-name, loan-number, amount * 100
from Loan
Where Clause
Find all loan numbers for loans made at Sadar branch with loan amounts greater than
Rs 1200.
select loan-number
from Loan
where branch-name= Sadar and amount > 1200
where clause uses uses logival connectives and, or, and not
operands of the logical connectives can be expressions involving the comparison
operators <, <=, >, >=, =, and < >.
between can be used to simplify the comparisons
select loan-number
from Loan
where amount between 90000 and 100000
From Clause
The from clause by itself defines a Cartesian product of the relations in the clause.
When an attribute is present in more than one relation they can be referred as relation-
name.attribute-name to avoid the ambiguity.
For all customers who have loan from the bank, find their names and loan numbers
select distinct customer-name, Borrower.loan-number
from Borrower, Loan
where Borrower.loan-number = Loan.loan-number
The Rename Operation
Used for renaming both relations both relations and attributes in SQL
Use as clause: old-name as new-name
Find the names and loan numbers of the customers who have a loan at the Sadar
branch.
select distinct customer-name, borrower.loan-number as loan-id
from Borrower, Loan
where Borrower.loan-number = Loan.loan-number and
branch-name = Sadar
we can now refer the loan-number instead by the name loan-id.
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
For all customers who have a loan from the bank, find their names and loan-numbers
select distinct customer-name, T.loan-number
from Borrower as T, Loan as S
where T.loan-number = S.loan-number
Find the names of all branches that have assets greater than at least one branch located in
Mathura.
select distinct T.branch-name
from branch as T, branch as S
where T.assets > S.assets and S.branch-city = Mathura
String Operation
Two special characters are used for pattern matching in strings:
o Percent ( % ) : The % character matches any substring
o Underscore( _ ): The _ character matches any character
%Mandi: will match with the strings ending with Mandi viz. Raja Ki mandi,
Peepal Mandi
_ _ _ matches any string of three characters.
Find the names of all customers whose street address includes the substring Main
select customer-name
from Customer
where customer-street like %Main%
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Lecture-14
Set Operations
union, intersect and except operations are set operations available in SQL.
Relations participating in any of the set operation must be compatible; i.e. they must have
the same set of attributes.
Union Operation:
o Find all customers having a loan, an account, or both at the bank
(select customer-name
fromDepositor )
union
(select customer-name
from Borrower )
It will automatically eliminate duplicates.
o If we want to retain duplicates union all can be used
(select customer-name
fromDepositor )
union all
(select customer-name
from Borrower )
Intersect Operation
o Find all customers who have both an account and a loan at the bank
(select customer-name
fromDepositor )
intersect
(select customer-name
from Borrower )
o If we want to retail all the duplicates
(select customer-name
fromDepositor )
intersect all
(select customer-name
from Borrower )
Except Opeartion
o Find all customers who have an account but no loan at the bank
(select customer-name
fromDepositor )
except
(select customer-name
from Borrower )
o If we want to retain the duplicates:
(select customer-name
fromDepositor )
except all
(select customer-name
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
from Borrower )
Aggregate Functions
Aggregate functions are those functions which take a collection of values as input and
return a single value.
SQL offers 5 built in aggregate functions-
o Average: avg
o Minimum:min
o Maximum:max
o Total: sum
o Count:count
The input to sum and avg must be a collection of numbers but others may have
collections of non-numeric data types as input as well
Find the average account balance at the Sadar branch
select avg(balance)
from Account
where branch-name= Sadar
The result will be a table which contains single cell (one row and one column) having
numerical value corresponding to average balance of all account at sadar branch.
group by clause is used to form groups, tuples with the same value on all attributes in
the group by clause are placed in one group.
Find the average account balance at each branch
select branch-name, avg(balance)
from Account
group by branch-name
By default the aggregate functions include the duplicates.
distinct keyword is used to eliminate duplicates in an aggregate functions:
Find the number of depositors for each branch
select branch-name, count(distinct customer-name)
from Depositor, Account
where Depositor.account-number = Account.account-number
group by branch-name
having clause is used to state condition that applies to groups rather than tuples.
Find the average account balance at each branch where average account balance is more
than Rs. 1200
select branch-name, avg(balance)
from Account
group by branch-name
having avg(balance) > 1200
Count the number of tuples in Customer table
select count(*)
from Customer
SQL doesnt allow distinct with count(*)
When where and having are both present in a statement where is applied before having.
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Nested Subqueries
A subquery is a select-from-where expression that is nested within another query.
Set Membership
o The in and not in connectives are used for this type of subquery.
o Find all customers who have both a loan and an account at the bank, this query
can be written using nested subquery form as follows
select distinct customer-name
from Borrower
where customer-name in(select customer-name
fromDepositor )
o Select the names of customers who have a loan at the bank, and whose names are
neither Smith nor Jones
select distinct customer-name
from Borrower
where customer-name not in(Smith, Jones)
Set Comparison
o Find the names of all branches that have assets greater than those of at least one
branch located in Mathura
select branch-name
from Branch
where asstets > some (select assets
from Branch
where branch-city = Mathura )
o Apart from > some others comparison could be < some , <= some , >= some ,
= some , < > some.
o Find the names of all branches that have assets greater than that of each branch
located in Mathura
select branch-name
from Branch
where asstets > all (select assets
from Branch
where branch-city = Mathura )
o Apart from > all others comparison could be < all , <= all , >= all , = all ,
< >all.
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Lecture-15
Views
? In SQL create view command is used to define a view as follows:
create view v as <query expression>
where <query expression> is any legal query expression and v is the view name.
? The view consisting of branch names and the names of customers who have either an
account or a loan at the branch. This can be defined as follows:
create view All-customer as
(select branch-name, customer-name
from Depositor, Account
where Depositor.account-number=account.account-number)
union
(select branch-name, customer-name
from Borrower, Loan
where Borrower.loan-number = Loan.loan-number)
? The attributes names may be specified explicitly within a set of round bracket after the
name of view.
? The view names may be used as relations in subsequent queries. Using the view All-
customer find all customers of Sadar branch
select customer-name
from All-customer
where branch-name= Sadar
? A create-view clause creates a view definition in the database which stays until a
command - drop view view-name - is executed.
Modification of Database
? Deletion
o In SQL we can delete only whole tuple and not the values on any particular
attributes. The command is as follows:
delete from r
where P.
where P is a predicate and r is a relation.
o delete command operates on only one relation at a time. Examples are as follows:
o Delete all tuples from the Loan relation
delete from Loan
o Delete all of the Smiths account records
delete from Depositor
where customer-name = Smith
o Delete all loans with loan amounts between Rs 1300 and Rs 1500.
delete from Loan
where amount between 1300 and 1500
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
o Delete the records of all accounts with balances below the average at the bank
delete from Account
where balance < ( select avg(balance)
from Account)
? Insertion
o In SQL we either specify a tuple to be inserted or write a query whose result is a
set of tuples to be inserted. Examples are as follows:
o Insert an account of account number A-9732 at the Sadar branch having balance
of Rs 1200
insert into Account
values(Sadar, A-9732, 1200)
the values are specified in the order in which the corresponding attributes are
listed in the relation schema.
o SQL allows the attributes to be specified as part of the insert statement
insert into Account(account-number, branch-name, balance)
values(A-9732, Sadar, 1200)
insert into Account(branch-name, account-number, balance)
values(Sadar, A-9732, 1200)
o Provide for all loan customers of the Sadar branch a new Rs 200 saving account
for each loan account they have. Where loan-number serve as the account number
for these accounts.
insert into Account
select branch-name, loan-number, 200
from Loan
where branch-name = Sadar
? Updates
o Used to change a value in a tuple without changing all values in the tuple.
o Suppose that annual interest payments are being made, and all balances are to be
increased by 5 percent.
update Account
set balance = balance * 1.05
o Suppose that accounts with balances over Rs10000 receive 6 percent interest,
whereas all others receive 5 percent.
update Account
set balance = balance * 1.06
where balance > 10000
update Account
set balance = balance * 1.05
where balance <= 10000
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Data Definition Language
? Data Types in SQL
o char(n): fixed length character string, length n.
o varchar(n): variable length character string, maximum length n.
o int: an integer.
o smallint: a small integer.
o numeric(p,d): fixed point number, p digits( plus a sign), and d of the p digits are
to right of the decimal point.
o real, double precision: floating point and double precision numbers.
o float(n): a floating point number, precision at least n digits.
o date: calendar date; four digits for year, two for month and two for day of month.
o time: time of day n hours minutes and seconds.
? Domains can be defined as
create domain person-name char(20).
the domain name person-name can be used to define the type of an attribute just like
built-in domain.
? Schema Definition in SQL
o create table command is used to define relations.
create table r (A
1
D
1,
A
2
D
2
,, A
n
D
n
,
<integrity constraint
1
>,
,
<integrity constraint
k
>)
where r is relation name, each A
i
is the name of attribute, D
i
is the domain type of
values of A
i
. Several types of integrity constraints are available to define in SQL.
o Integrity Constraints which are allowed in SQL are
primary key(A
j1,
A
j2,,
A
jm
)
and
check(P) where P is the predicate.
o drop table command is used to remove relations from database.
o alter table command is used to add attributes to an existing relation
alter table r add A D
it will add attribute A of domain type D in relation r.
alter table r drop A
it will remove the attribute A of relation r.
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Lecture-16
Integrity Constraints
Integrity Constraints guard against accidental damage to the database.
Integrity constraints are predicates pertaining to the database.
Domain Constraints:
o Predicates defined on the domains are Domain constraints.
o Simplest Domain constraints are defined by defining standard data types of the
attributes like Integer, Double, Float, etc.
o We can define domains by create domain clause also we can define the
constraints on such domains as follows:
create domain hourly-wage numeric(5,2)
constraint wage-value-test check(value >= 4.00)
So we can use hourly-wage as data type for any attribute where DBMS will
automatically allow only values greater than or equal to 4.00.
o Other examples for defining Domain constraints are as follows:
create domain account-number char(10)
constraint account-number-null-test check(value not null)
create domain account-type char(10)
constraint account-tyope-test
check(value in ( Checking, Saving))
By using the later domain of two above the DBMS will allow only values for any
attribute having type as account-type i.e. Checking and Saving.
Referential Integrity:
o Foreign Key: If two table R and S are related to each other, K1 and K2 are
primary keys of the two relations also K1 is one of the attribute in S. Suppose we
want that every row in S must have a corresponding row in R, then we define the
K1 in S as foreign key. Example in our original database of library we had a table
for relation BORROWEDBY, containing two fields Card No. and Acc. No. .
Every row of BORROWEDBY relation must have corresponding row in USER
Table having same Card No. and a row in BOOK table having same Acc. No..
Then we will define the Card No. and Acc. No. in BORROWEDBY relation as
foreign keys.
o In other way we can say that every row of BORROWEDBY relation must refer to
some row in BOOK and also in USER tables.
o Such referential requirement in one table to another table is called Referential
Integrity.
o Referential Integrity constraints are defined by defining some of the attributes in a
table, which forms primary key of some other table, as foreign key.
Functional Dependencies
o Suppose in a relation having schema R, R and R. A functional
dependency holds on R if, in any table having schema R, for every two rows
r1 and r2 the values of attributes are same in r1 and r2 then values of attributes
are also same.
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
o Consider for example the table as follows
Seq
A B C D
1
a
1
b
1
c
1
d
1
2
a
1
b
2
c
1
d
2
3
a
2
b
2
c
2
d
2
4
a
2
b
3
c
2
d
3
5
a
3
b
3
c
2
d
4
Check if AC Holds, find pair of rows where value of A is same
row 1 and 2, value of A is same and C is also same
row 3 and 4, Value of A is same and C is also same
No other two rows having same value on A, So AC holds.
Check if CA Holds, find pair of rows where value of C is same
row 1 and 2, value of C is same and A is also same
row 3 and 4, value of C is same and A is also same
row 4 and 5, value of C is same but A is not same, So CA doesnt hold.
We can prove ABD also holds, find pair of rows where value of A and B
are both same
No row where A and B both are same, So ABD holds
o If K is a super key of a relation R then it means functional dependency KR
holds and vice versa.
o Armstrongs Rules: Suppose there is a given relation R and a set of functional
dependencies F that holds on R. Then these rules can be used to derive all of the
other functional dependencies which are logically implied from the given relation
R and functional dependencies F.
Reflexivity rule: if is a set of attributes and , then holds.
Augmentation rule: if holds and is a set of attributes, then
holds.
Transitivity rule: if holds and holds, then holds.
o Additional rules are also formed to simplify deriving new functional dependencies
since applying Armstrongs rules is a lengthy and tiresome task. Although we can
generate all the functional dependencies using only Armstrongs rule.
Union rule: if holds and holds, then holds.
Decomposition rule. if holds, then holds and holds.
Pseudotransitivity rule. If holds and holds, then
holds.
o Closure of Functional Dependencies: Suppose the given set of functional
dependencies is F for a given relation schema R. When we apply various rules
stated above and generate all of the possible newer functional dependencies. Then
the set containing all these newer functional dependencies and the given set of
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
functional dependencies F is called the closure of functional dependencies and is
denoted as F
+
.
o Consider schema R=( A, B, C, G, H, I ) and the set of functional dependencies F
containing following functional dependencies.
AB
AC
CGH
CGI
BH
Find other functional dependencies that can be derived using various rules
given above
Examples are as follows-
AH can be derived using functional dependencies 1 and 5 and
transitivity rule.
CGHI can be derived using functional dependencies 3 and 4 and union
rule.
AGI can be derived using 2 and 4 and Pseudotransitivity.
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Lecture-17
Normal Forms
Some of the undesirable properties that a bad database design may have
o Repetition of information
o Inability to represent certain information
o Incapability to maintain integrity of data
The normal forms of relational database theory provide criteria for determining a table's
degree of vulnerability to logical inconsistencies and anomalies.
The higher the normal form applicable to a table, the less vulnerable it is to
inconsistencies and anomalies.
Each table has a "highest normal form" (HNF): by definition, a table always meets the
requirements of its HNF and of all normal forms lower than its HNF; also by definition, a
table fails to meet the requirements of any normal form higher than its HNF.
Generally known hierarchy of normal forms is as follows First Normal Form(1NF),
Second Normal Form(2NF), Third Normal Form(3NF), Fourth Normal Form(4NF), Fifth
Normal Form(5NF).
We will discuss only up to 3NF of above hierarchy and another normal form Boyce-Codd
Normal Form(BCNF) in this course.
First Normal Form
According to Date's definition of 1NF, a table is in 1NF if and only if it is "isomorphic to
some relation", which means, specifically, that it satisfies the following five conditions:
1. There's no top-to-bottom ordering to the rows.
2. There's no left-to-right ordering to the columns.
3. There are no duplicate rows.
4. Every row-and-column intersection contains exactly one value from the
applicable domain (and nothing else).
5. All columns are regular [i.e. rows have no hidden components such as row IDs,
object IDs, or hidden timestamps].
Examples of tables (or views) that would not meet this definition of 1NF are:
o A table that lacks a unique key. Such a table would be able to accommodate
duplicate rows, in violation of condition 3.
o A view whose definition mandates that results be returned in a particular order, so
that the row-ordering is an intrinsic and meaningful aspect of the view. This
violates condition 1. The tuples in true relations are not ordered with respect to
each other.
o A table which is having at least one nullable attribute. A nullable attribute would
be in violation of condition 4, which requires every field to contain exactly one
value from its column's domain. It should be noted, however, that this aspect of
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
condition 4 is controversial. It marks an important departure from Codd's later
vision of the relational model, which made explicit provision for nulls.
Codd states that the "values in the domains on which each relation is defined are required
to be atomic with respect to the DBMS." Codd defines an atomic value as one that
"cannot be decomposed into smaller pieces by the DBMS (excluding certain special
functions)." Meaning a field should not be divided into parts with more than one kind of
data in it such that what one part means to the DBMS depends on another part of the
same field.
Suppose a novice designer wish to record the names and telephone numbers of
customers. He defines a customer table which looks like this:
Customer
Customer ID First Name Surname
Telephone
Number
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659
789 Maria Fernandez 555-808-9633
The designer then becomes aware of a requirement to record multiple telephone
numbers for some customers. He reasons that the simplest way of doing this is to
allow the "Telephone Number" field in any given record to contain more than one
value:
Assuming, however, that the Telephone Number column is defined on some Telephone
Number-like domain (e.g. the domain of strings 12 characters in length), the
Customer
ID
First Name Surname
Telephone
Number
123 Robert Ingram 555-861-2025
456 Jane Wright
555-403-1659
555-776-4100
789 Maria Fernandez 555-808-9633
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
representation above is not in 1NF. 1NF (and, for that matter, the RDBMS) prevents a
single field from containing more than one value from its column's domain.
Repeating groups across columns: The designer might attempt to get around this
restriction by defining multiple Telephone Number columns:
Customer
ID
First
Name
Surname Tel. No. 1 Tel. No. 2 Tel. No. 3
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659 555-776-4100 555-403-1659
789 Maria Fernandez 555-808-9633
This representation, however, makes use of nullable columns, and therefore does not
conform to Date's definition of 1NF. Even if the view is taken that nullable columns are
allowed, the design is not in keeping with the spirit of 1NF.Tel. No. 1, Tel. No. 2., and
Tel. No. 3. share exactly the same domain and exactly the same meaning; the splitting of
Telephone Number into three headings is artificial and causes logical problems. These
problems include:
o Difficulty in querying the table. Answering such questions as "Which
customers have telephone number X?" and "Which pairs of customers share a
telephone number?" is awkward.
o Inability to enforce uniqueness of Customer-to-Telephone Number links
through the RDBMS. Customer 789 might mistakenly be given a Tel. No. 2
value that is exactly the same as her Tel. No. 1 value.
o Restriction of the number of telephone numbers per customer to three. If a
customer with four telephone numbers comes along, we are constrained to
record only three and leave the fourth unrecorded. This means that the
database design is imposing constraints on the business process, rather than
(as should ideally be the case) vice-versa.
Repeating groups within columns: The designer might, alternatively, retain the
single Telephone Number column but alter its domain, making it a string of sufficient
length to accommodate multiple telephone numbers:
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
This design is consistent with 1NF according to Dates definition but not according to
Codds definition. It presents several design issues. The Telephone Number heading
becomes semantically woolly, as it can now represent either a telephone number, a list of
telephone numbers, or indeed anything at all. A query such as "Which pairs of customers
share a telephone number?" is more difficult to formulate, given the necessity to cater for
lists of telephone numbers as well as individual telephone numbers. Meaningful
constraints on telephone numbers are also very difficult to define in the RDBMS with this
design.
A design that complies with 1NF:A design that is unambiguously in 1NF makes
use of two tables: a Customer Name table and a Customer Telephone Number table.
Customer Name Customer Telephone
Customer
ID
First
Name
Surname
123 Robert Ingram
456 Jane Wright
789 Maria Fernandez
Repeating groups of telephone numbers do not occur in this design. Instead, each
Customer-to-Telephone Number link appears on its own record.
It is worth noting that this design meets the additional requirements for second
and third normal form (3NF).
Customer
ID
First
Name
Surname
Telephone
Numbers
123 Robert Ingram 555-861-2025
456 Jane Wright
555-403-1659,
555-776-4100
789 Maria Fernandez 555-808-9633
Customer
ID
Telephone
Number
123 555-861-2025
456 555-403-1659
456 555-776-4100
789 555-808-9633
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Lecture-18
Second Normal Form
2NF was originally defined by E.F. Codd in 1971.
A 1NF table is in 2NF if and only if, given any candidate key K and any attribute A
that is not a constituent of a candidate key, A depends upon the whole of K rather
than just a part of it
A 1NF table is in 2NF if and only if all its non-prime attributes are functionally
dependent on the whole of every candidate key. (A non-prime attribute is one that
does not belong to any candidate key.)
Note that when a 1NF table has no composite candidate keys (candidate keys
consisting of more than one attribute), the table is automatically in 2NF.
Consider a table describing employees' skills:
Employees' Skills
Employee Skill
Current
Work
Location
Jones Typing
114
Main
Street
Jones Shorthand
114
Main
Street
Jones Whittling
114
Main
Street
Bravo
Light
Cleaning
73
Industrial
Way
Ellis Alchemy
73
Industrial
Way
Ellis Flying
73
Industrial
Way
Harrison
Light
Cleaning
73
Industrial
Way
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Neither {Employee} nor {Skill} is a candidate key for the table. This is because a
given Employee might need to appear more than once (he might have multiple
Skills), and a given Skill might need to appear more than once (it might be
possessed by multiple Employees). Only the composite key {Employee, Skill}
qualifies as a candidate key for the table.
The remaining attribute, Current Work Location, is dependent on only part of the
candidate key, namely Employee. Therefore the table is not in 2NF. Note the
redundancy in the way Current Work Locations are represented: we are told three
times that Jones works at 114 Main Street, and twice that Ellis works at 73
Industrial Way. This redundancy makes the table vulnerable to update anomalies:
it is, for example, possible to update Jones' work location on his "Typing" and
"Shorthand" records and not update his "Whittling" record. The resulting data
would imply contradictory answers to the question "What is Jones' current work
location?"
A 2NF alternative to this design would represent the same information in two tables:
an "Employees" table with candidate key {Employee}, and an "Employees' Skills"
table with candidate key {Employee, Skill}:
Employees Employees Skills
Employee Current Work Location
Jones 114 Main Street
Bravo 73 Industrial Way
Ellis 73 Industrial Way
Harrison 73 Industrial Way
Neither of these tables can suffer from update anomalies.
Not all 2NF tables are free from update anomalies, however. An example of a 2NF
table which suffers from update anomalies is:
Tournament Winners
Tournament Year Winner
Winner Date of
Birth
Des Moines Masters 1998 Chip Masterson 14 March 1977
Employee Skill
Jones Typing
Jones Shorthand
Jones Whittling
Bravo Light Cleaning
Ellis Alchemy
Ellis Flying
Harrison Light Cleaning
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977
Even though Winner and Winner Date of Birth are determined by the whole key
{Tournament / Year} and not part of it, particular Winner / Winner Date of Birth
combinations are shown redundantly on multiple records. This leads to an update
anomaly: if updates are not carried out consistently, a particular winner could be
shown as having two different dates of birth.
The underlying problem is the transitive dependency to which the Winner Date of
Birth attribute is subject. Winner Date of Birth actually depends on Winner,
which in turn depends on the key Tournament / Year.
This problem is addressed by third normal form (3NF)
Note: In addition to the primary key, the table may contain other candidate keys; it is
necessary to establish that no non-prime attributes have part-key dependencies on any
of these candidate keys.
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Lecture-19
Third Normal Form:
3NF as defined by E.F. Codd in 1971 is - a table is in 3NF if and only if both of the
following conditions hold:
o The relation R (table) is in second normal form (2NF)
o Every non-prime attribute of R is non-transitively dependent (i.e. directly
dependent) on every candidate key of R.
o Note:
A non-prime attribute of R is an attribute that does not belong to any
candidate key of R.
A transitive dependency is a functional dependency in which X Z (X
determines Z) indirectly, because X Y and Y Z (where it is not the
case that Y X).
A 3NF definition, equivalent to Codd's given by Carlo Zaniolo in 1982, states that a table
is in 3NF if and only if, for each of its functional dependencies X A, at least one of the
following conditions holds:
o X contains A (that is, X A is trivial functional dependency), or
o X is a superkey, or
o Each attribute in X-A is a prime attribute (i.e., it is contained within a candidate
key)
Zaniolo's definition gives a clear sense of the difference between 3NF and the more
stringent Boyce-Codd normal form (BCNF). BCNF simply eliminates the third
alternative ("X-A has only prime attribute").
Difference between 2NF and 3NF can be stated as: non-key attributes be dependent on
"the whole key" ensures that a table is in 2NF; while that non-key attributes be dependent
on "nothing but the key" ensures that the table is in 3NF.
Example of table given above :
Tournament Winners
Tournament Year Winner Winner Date of Birth
Des Moines Masters 1998 Chip Masterson 14 March 1977
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977
This table is in 2NF but not in 3NF. The breach of 3NF occurs because the non-prime
attribute Winner Date of Birth is transitively dependent on the candidate key
{Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of
Birth is functionally dependent on Winner makes the table vulnerable to logical
inconsistencies, as there is nothing to stop the same person from being shown with
different dates of birth on different records.
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
In order to express the same facts without violating 3NF, it is necessary to split the table
into two:
Tournament Winners Player Dates of Birth
Tournament Year Winner
Des Moines Masters 1998 Chip Masterson
Indiana Invitational 1998 Al Fredrickson
Cleveland Open 1999 Bob Albertson
Des Moines Masters 1999 Al Fredrickson
Indiana Invitational 1999 Chip Masterson
Boyce-Codd Normal Form:
It is a slightly stronger version of the third normal form (3NF). A table is in Boyce-Codd
normal form if and only if for every one of its non-trivial [dependencies] X Y, X is a
superkeythat is, X is either a candidate key or a superset thereof.
Note the above set of tables Tournament Winners and Player Dates of Birth shown as
in 3NF are also in BCNF
Only in rare cases does a 3NF table not meet the requirements of BCNF. A 3NF table
which does not have multiple overlapping candidate keys is guaranteed to be in BCNF
An example of a 3NF table that does not meet BCNF is
Today's Court Bookings
Court Start Time End Time Rate Type
1 09:30 10:30 SAVER
1 11:00 12:00 SAVER
1 14:00 15:30 STANDARD
2 10:00 11:30 PREMIUM-B
2 11:30 13:30 PREMIUM-B
2 15:00 16:30 PREMIUM-A
There are two courts available and there are four distinct rate types:
SAVER, for Court 1 bookings made by members
STANDARD, for Court 1 bookings made by non-members
PREMIUM-A, for Court 2 bookings made by members
PREMIUM-B, for Court 2 bookings made by non-members
So, Rate Type Court is only non-trivial functional dependency that holds.
o We can observe that the table's candidate keys are:
{Court, Start Time}
Player Date of Birth
Chip Masterson 14 March 1977
Al Fredrickson 21 July 1975
Bob Albertson 28 September 1968
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
{Court, End Time}
{Rate Type, Start Time}
{Rate Type, End Time}
o In the Today's Court Bookings table, there are no non-prime attributes: that is, all
attributes belong to candidate keys. Therefore the table adheres to both 2NF and
3NF
o The table does not adhere to BCNF because in the dependency Rate Type
Court, the determining attribute (Rate Type) is not a super key.
The design can be amended so that it meets BCNF as follows:
Rate Types Todays Bookings
Rate Type Court Member Flag
SAVER 1 Yes
STANDARD 1 No
PREMIUM-A 2 Yes
PREMIUM-B 2 No
The candidate keys for the Rate Types table are {Rate Type} and {Court, Member Flag};
the candidate keys for the Today's Bookings table are {Rate Type, Start Time} and {Rate
Type, End Time}. Both tables are in BCNF.
Rate Type Start Time End Time
SAVER 09:30 10:30
SAVER 11:00 12:00
STANDARD 14:00 15:30
PREMIUM-B 10:00 11:30
PREMIUM-B 11:30 13:30
PREMIUM-A 15:00 16:30
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Lecture-20
Consider the following table:
Lending
branch-name branch-city assets customer-name loan-number amount
Sadar Agra 200000 Ram L-12 12000
Sanjay-place Agra 100000 Ram L-13 13000
This table stores the information regarding loans. This table has following problems:
Since every branch is going to have several loans, the table will have one row for each
loan taken from a branch all of which will have same value for the columns branch-name,
branch-city and assets, repetition of data.
Updating the branch-city or assets of a particular branch will require updating each row
of this table and hence the operation will be costly.
If we miss any row without updating then there will be more than one value for a branch
city or assets of a branch, which means breaching the data integrity.
If there is a branch having no loans then we will not have any entry in this table and we
will not be able represent the complete information.
Decomposition
The above problem can be solved by decomposing the above table. The set of relations
R
1
, R
2
,R
n
is a decomposition of relation R if R = R
1
R
2
R
n
. It should be
noted that every pair R
i
and R
i+1
of this set should have at least one common attribute so
that they can be combined back again using join operation.
But all decompositions of this table will not be free from problem.
Consider for example if we form two new tables out of our Lending table as follows
Branch-customer-schema = (branch-name, branch-city, assets, customer name)
Customer-loan-schema = (customer-name, loan-number, amount)
Then the resulting tables with data will be as follows:
Branch-customer
branch-name branch-city assets customer-name
Sadar Agra 200000 Ram
Sanjay-place Agra 100000 Ram
Customer-loan
customer-name loan-number amount
Ram L-12 12000
Ram L-13 13000
Now suppose to know the branch for loan L-12 we try to form join of these two we will
a table as follows:
Branch-customer Customer-loan =
branch-name branch-city assets customer-name loan-number amount
Sadar Agra 200000 Ram L-12 12000
Sadar Agra 200000 Ram L-13 13000
Sanjay-place Agra 100000 Ram L-12 12000
Sanjay-place Agra 100000 Ram L-13 13000
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
According to this join both of the loans are taken from both of the branches. This is an
example of information loss. This occurred because the choice of Column to be kept
common in two tables after decomposition is wrong.
Lossless-Join Decomposition: A decomposition { R
1
, R
2
,R
n
} of relation schema R is
lossless join decomposition if for all legal relations r on schema R,
r =
R
1
(r)
R
1
(r)
R
n
(r)
In other words after decomposition, when we join all of the decomposed tables with data
it should result in the original table with data as was before decomposition.
Otherwise it is called Lossy-join decomposition.
Dependency preservation: This is another desirable property of a decomposition.
Suppose it is given that a set F of functional dependencies holds on any relation based on
schema R. Then set of functional dependencies that holds on any relation subschema R
1
is F
1
that contains all the functional dependencies of F which contains attributes of only
R
1
. So if decomposition of R is { R
1
, R
2
,R
n
} such that corresponding functional
dependencies which holds on them are { F
1
, F
2
,F
n
} then following should be true.
F
+
= {F
1
F
2
F
n
}
+
.
Such a decomposition is called dependency preserving decomposition.
For example:
Consider the schema R = {A, B, C, D} such that following functional dependency holds
on it F = {AB, A BC, C D}.
Now suppose the decomposition of this R is R
1
= {A,B} and R
2
= {B,C,D}, so the
functional dependencies which holds on R
1
are F
1
= {AB} (Note: F
1
should contain all
the functional dependencies in F which have only attributes of R
1
) and those on R
2
are F
2
={CD}. If we union F
1
F
2
is {AB, C D} which doesnt contain the A BC , so
it is not a dependency preserving decomposition.
If we decompose R into these relation schemas R
1
={A,B,C} and R
2
={C,D} then
F
1
={AB, A BC} and F
2
={CD} so F
1
F
2
is {AB, A BC, C D}.
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Lecture-21
Normalization Using Functional Dependency
Lossless-Join Decomposition using FD:
o Let R is relation schema and F is a set of functional dependency on R. Let R
1
and
R
2
form a decomposition of R. This decomposition is lossless join decomposition
if at least one of the following functional dependency is in F
+
:
R
1
R
2
R
1
R
1
R
2
R
2
o Example: Lending-schema=(branch-name, branch-city, assets, customer-name,
loan-number, amount) the FD that holds on this schema are given as
branch-name assets branch-city
loan-number amount branch-name
so the decomposition of it into two schema as follows:
Branch-schema = (branch-name, branch-city, assets)
Loan-info-schema = (branch-name, customer-name, loan-number, amount)
is a lossless join decomposition because-
Branch-schema Loan-info-schema = branch-name
and we have an FD branch-name assets branch-city, applying augmentation
rule to it, this FD is equivalent to branch-name branch-name assets branch-
city i.e. branch-name Branch-schema.
Third Normal Form Using FD:
o Let R is a relation having F as the minimal set of functional dependencies that
holds on R.
Then do the following:
1. Initially have an empty set of relations.
2. for each FD in F, , i=1
Add a relation R
i
=( ,) if no other relation contains , , Increase
i by one
3. After adding all such relations add another relation R
i
= ( any candidate
key of R) if no other relation is containing a candidate key.
Boyce-Codd Normal Form using FD:
1. Let R
i
be relation i.e. not in BCNF
2. And, let is the FD that holds on but R
i
doesnt hold on (i.e. is not a
super key of R
i
)
3. Replace relation R
i
by two relations (, ) and (R
i
- ).
4. Now check again all the relations present with all the FDs that holds on them and
Go back to step 1.
o Example:
Consider: Lending-schema=(branch-name, branch-city, assets, customer-
name, loan-number, amount) the FD that holds on this schema are given as
1. branch-name assets branch-city
2. loan-number amount branch-name
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
We can see that Lending-schema is not in BCNF. Also we see that in FD
branch-name assets branch-city, branch-name is not superkey of
Lending-schema. So new relations is a set as follows:
Branch-schema=(branch-name, branch-city, assets)
branch-name assets branch-city
Loan-info-schema = (branch-name, customer-name, loan-number,
amount)
loan-number amount branch-name
Again in the new set of relations we see Loan-info-schema is not in BCNF
as loan-number is not a super key of Loan-info-schema. Again we
decompose it and the set of relations are
Branch-schema=(branch-name, branch-city, assets)
branch-name assets branch-city
Loan-schema = (branch-name, loan-number, amount)
loan-number amount branch-name
Borrower-schema = (customer-name, loan-number)
Now all of the three relations are in BCNF so we do not have to
decompose any more.
BCNF may not satisfy the dependency preservation criteria.
o In some cases, a non-BCNF table cannot be decomposed into tables that satisfy
BCNF and preserve the dependencies that held in the original table
o For example, a set of functional dependencies {AB C, C B} cannot be
represented by a BCNF schema.
o Unlike the first three normal forms, BCNF is not always achievable.
o Consider the following non-BCNF table whose functional dependencies follow
the {AB C, C B} pattern:
Nearest Shop
Person Shop Type Nearest Shop
Davidson Optician Eagle Eye
Davidson Hairdresser Snippets
Wright Bookshop Merlin Books
Fuller Bakery Doughy's
Fuller Hairdresser Sweeney Todd's
Fuller Optician Eagle Eye
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
For each Person / Shop Type combination, the table tells us which shop of this
type is geographically nearest to the person's home. We assume for simplicity that
a single shop cannot be of more than one type.
The candidate keys of the table are:
{Person, Shop Type}
{Person, Nearest Shop}
Because all three attributes are prime attributes (i.e. belong to candidate keys), the
table is in 3NF. The table is not in BCNF, however, as the Shop Type attribute is
functionally dependent on a non-superkey: Nearest Shop.
Shop Near Person Shop
Person Shop
Davidson Eagle Eye
Davidson Snippets
Wright Merlin Books
Fuller Doughy's
Fuller Sweeney Todd's
Fuller Eagle Eye
The "Shop Near Person" table has a candidate key of {Person, Shop}, and the
"Shop" table has a candidate key of {Shop}. Unfortunately, although this design
adheres to BCNF, it is unacceptable on different grounds: it allows us to record
multiple shops of the same type against the same person. In other words, its
candidate keys do not guarantee that the functional dependency {Person, Shop
Type} {Shop} will be respected.
Shop Shop Type
Eagle Eye Optician
Snippets Hairdresser
Merlin Books Bookshop
Doughy's Bakery
Sweeney Todd's Hairdresser
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Lecture 22
Multivalued Dependencies
Let R be a relation schema, and X and Y be disjoint subsets of R (i.e., X R, Y R,
XY= ), and Z = R- XY.A relation r(R) satisfies XY if for any two tuples t
1
and t
2
,
o t
1
(X)=t
2
(X), then there exist t
3
in r such that
o t
3
(X)=t
1
(X), t
3
(Y)=t
1
(Y), t
3
(Z)=t
2
(Z).
o By symmetry, there exist t4 in r such that
o t
4
(X)=t
1
(X), t
4
(Y)=t
2
(Y), t
4
(Z)=t
1
(Z).
X Y Z
t1 x1 y1 z1
t2 x1 y2 z2
t3 x1 y1 z2
t4 x1 y2 z1
The MVD X Y says that the relationship between X and Y is independent of the
relationship between X and R-Y
For example consider the table Employee:
Employee-name Project-name Dependant-name
Smith X John
Smith Y Ann
Smith X Ann
Smith Y John
o MVDs Employee-name Project-name and Employee-name Dependant-name
hold in the relation
o The employee named Smith works on projects X and Y, and has two dependents
John and Ann.
o If we store only the first two tuples in the relation, it would incorrectly show the
associations among attributes
o If we have MVDs in a relation, we may have to repeat values redundantly in the
tuples. In the Employee relation, values X and Y of Project-name are repeated
with each value of Dependant-name--- clearly undesirable
o Problem: Employee schema is in BCNF because no FDs hold for it
o Trivial MVD: If MVD X Y is satisfied by all relations whose schemas include X
and Y, it is called trivial MVD.
XY is trivial whenever Y X or XY=R
o If a relation r fails to satisfy a given MVD, a relation r that satisfies the MVD can
be constructed by adding tuples to r.
MVD is called "tuple generating dependency"
compare it with FD: need to delete tuples to make the relation to satisfy a
given FD
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
o MVD can be used in two ways
test relations to determine whether they are legal under a given set of FDs
and MVDs
specify constraints on a set of relations
Let D: a set of FDs and MVDs then D
+
: the closure of D is the set of all FDs and MVDs
logically implied by D.
D
+
can be computed using the following set of sound and complete rules
1. reflexivity: if Y X then XY
2. augmentation: if X Y then WX Y
3. transitivity: if XY and YZ then X Z
4. complementation: if XY then XR-XY
5. MV augmentation: if XY and W R, V W,then WXVY
6. MV transitivity: if X Y and YZ then XZ-Y
7. replication: if X Y then XY
8. coalescence: if XY and ZY, WR, WY= , WZ, then XZ
Note: The first three rules are Armstrongs axioms.
Fourth Normal Form(4NF):
A relation scheme R is in 4NF w.r.t. D, if for every non-trivial MVD XY in D+, X is a
superkey for R
4NF vs BCNF
o 4NF is different from BCNF only in the use of D (FD + MVD) instead of F (FDs)
o every 4NF schemas are also in BCNF.
By replication rule, XY implies XY.
o If R is not in BCNF, there exists a non-trivial FD XY where X is not a superkey
--- R cannot be in 4NF
For example: Employee (Employee-name, Project-name, Dependant-name) is not in 4NF,
since
o Employee-namePproject-name but Employee-name is not a key.
o Decompose into Emp-proj (E-n, P-n) and Emp-dep (E-n, D-n) do bring the tables
in 4NF
For example: Borrow (Loan#, C-name, Street, C-city) is in BCNF, but not in 4NF,
because C-nameLoan# is a non-trivial MVD, where C-name is not a key in this
schema.
The decomposition -- R1=(C-name, Loan#), R2=(C-name, Street, C-city)brings them
in 4NF
Benefits of Fourth Normal Form
o Reduced number of tuples
o No anomalies for insert/delete/update
Comparing FD and MVD
o if we have (a1,b1,c1,d1) r and (a1,b2,c2,d2) r
AB implies b1=b2
AB implies (a1,b1,c2,d2) r and (a1,b2,c1,d1) r
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
Lecture 23
Join Dependency and Fifth Normal form(Project Join Normal Form):
The normal forms discussed so far required that the given relation R if not in the given
normal form be decomposed in two relations to meet the requirements of the normal
form. In some rare cases, a relation can have problems like redundant information and
update anomalies because of it but cannot be decomposed in two relations to remove the
problems. In such cases it may be possible to decompose the relation in three or more
relations using the 5NF.
The fifth normal form deals with join-dependencies which is a generalisation of the
MVD. The aim of fifth normal form is to have relations that cannot be decomposed
further. A relation in 5NF cannot be constructed from several smaller relations.
A relation R satisfies join dependency *(R
1
, R
2
, ..., R
n
) if and only if R is equal to the join
of R
1
, R
2
, ..., R
n
where each R
i
is a subset of the set of attributes of R
A relation R is in 5NF (or project-join normal form, PJNF) if for all join dependencies of
the form *(R
1
, R
2
, ..., R
n
), where each R
i
is a subset of the set of attributes of R and
R = R
1
R
2
...R
n
, at least one of the following holds.
o *(R
1
, R
2
, ..., R
n
) is a trivial join-dependency (i.e., one of R
i
is R)
o Every R
i
is a super key for R.
An example of 5NF can be provided by the example below that deals with departments,
subjects and students.
Department Subject Student
Comp. Sc. CP1000 John Smith
Mathematics MA1000 John Smith
Comp. Sc. CP2000 Arun Kumar
Comp. Sc. CP3000 Reena Rani
Physics PH1000 Raymond Chew
Chemistry CH2000 Albert Garcia
o The above relation says that Comp. Sc. offers subjects CP1000, CP2000 and
CP3000 which are taken by a variety of students. No student takes all the subjects
and no subject has all students enrolled in it and therefore all three fields are
needed to represent the information.
o The above relation does not show MVDs since the attributes subject and student
are not independent; they are related to each other and the pairings have
significant information in them. The relation can therefore not be decomposed in
two relations
(dept, subject), and (dept, student)
[email protected]
Lecture Notes For DBMS and Data Mining and Data Warehousing
Department of Electrical and Electronics By: Sulabh Bansal
without losing some important information.
o The relation can however be decomposed in the following three relations
(dept, subject), and
(dept, student)
(subject, student)
and now it can be shown that this decomposition is lossless
Consider the Loan-Info-Schema discussed earlier. Suppose it is given that following join
dependency holds on the Loan-Info-Schema-
*((loan-number,branch-name), (loan-number, customer-name), (loan-number,amount))
Then it is not in 5
th
normal form as all of these relation schema doesnt represent the
super keys so we should decompose it into three relations as given by the join
dependency i.e. we should have following three relation schemas in place of given Loan-
Info-Schema:
o (loan-number, branch-name),
o (loan-number, customer-name), and
o (loan-number, amount)
[email protected]