0% found this document useful (0 votes)
62 views63 pages

Relational Database Design: Normalization

The document discusses guidelines for designing relational databases and achieving normal forms. It covers criteria for good relation schemas including avoiding update, insert, and delete anomalies. The document also discusses problems with null values and spurious tuples that can arise from poor database designs. Normalization is presented as a process to decompose relations into smaller, well-designed relations based on analyzing functional dependencies.

Uploaded by

Souvik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views63 pages

Relational Database Design: Normalization

The document discusses guidelines for designing relational databases and achieving normal forms. It covers criteria for good relation schemas including avoiding update, insert, and delete anomalies. The document also discusses problems with null values and spurious tuples that can arise from poor database designs. Normalization is presented as a process to decompose relations into smaller, well-designed relations based on analyzing functional dependencies.

Uploaded by

Souvik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 63

Relational Database Design

Normalization

1
Informal Design Guidelines for Relational
Databases
• What is relational database design?
The grouping of attributes to form "good" relation schemas

•  What are the criteria for "good" base relations?


– Semantics of the attributes
– Problems with update anomalies
– Null values in tuples
– Spurious tuples

2
Criteria for “good” relations
• GUIDELINE 1 (semantics of attributes): Informally, each tuple
in a relation should represent one entity or relationship instance.
(Applies to individual relations and their attributes).

• Attributes of different entities (EMPLOYEEs, DEPARTMENTs,


PROJECTs) should not be mixed in the same relation

• Only foreign keys should be used to refer to other entities

• Entity and relationship attributes should be kept apart as much


as possible.

• Mixing attributes of multiple entities may cause problems

• Information when stored redundantly waste storage

3
EXAMPLE OF AN UPDATE ANOMALY
Consider the relation:
EMP_PROJ ( Emp#, Proj#, Ename, Pname, No_hours)

• Update Anomaly: Changing the name of project number P1 from “Billing”


to “Customer-Accounting” may cause this update to be made for all 100
employees working on project P1.

• Insert Anomaly: Cannot insert a project unless an employee is


assigned to .
Inversely - Cannot insert an employee unless he/she is assigned to
a project.

• Delete Anomaly: When a project is deleted, it will result in deleting


all the employees who work on that project. Alternately, if an
employee is the sole employee on a project, deleting that employee
would result in deleting the corresponding project. 4
Problems with update anomalies

• GUIDELINE 2 (update anomalies): Design a schema that does not suffer


from the insertion, deletion and update anomalies. If there are any present,
then note them so that applications can be made to take them into account

• GUIDELINE 3 (Null values): Relations should be designed such


that their tuples will have as few NULL values as possible
•  Attributes that are NULL frequently could be placed in separate
relations (with the primary key)
•  Reasons for nulls:
– attribute not applicable or invalid
– attribute value unknown (may exist)
– value known to exist, but unavailable

5
Spurious Tuples

• Bad designs for a relational database may result in erroneous results for
certain JOIN operations
• The "lossless join" property is used to guarantee meaningful results for
join operations
• GUIDELINE 4: The relations should be designed to satisfy the lossless
join condition. No spurious tuples should be generated by doing a
natural-join of any relations.

There are two important properties of decompositions:


(a) non-additive or losslessness of the corresponding join
(b) preservation of the functional dependencies.

Note that property (a) is extremely important and cannot be sacrificed.


Property (b) is less stringent and may be sacrificed.

6
Banking Schema
branch (branch_name, branch_city, assets)

customer (customer_name, customer_street, customer_city)

account (account_number, branch_name, balance)

loan (loan_number, branch_name, amount)

depositor (customer_name, account_number)

borrower (customer_name, loan_number)

7
Combine Schemas?
• Suppose we combine borrower and loan to get
bor_loan = (customer_id, loan_number, amount )
• Result is possible repetition of information (L-100 in example below)

8
A Combined Schema Without Repetition
• Consider combining loan_branch and loan
loan_amt_br = (loan_number, amount, branch_name)
• No repetition (as suggested by example below)

9
Overall Database Design Process
• We have assumed schema R is given
– R could have been generated when converting E-R
diagram to a set of tables.
– R could have been a single relation containing all
attributes that are of interest (called universal
relation).
– Normalization breaks R into smaller relations.
– R could have been the result of some ad hoc design of
relations, which we then test/convert to normal form.

10
ER Model and Normalization
• When an E-R diagram is carefully designed, identifying all entities
correctly, the tables generated from the E-R diagram should not need
further normalization.

• However, in a real (imperfect) design, there can be functional


dependencies from non-key attributes of an entity to other attributes of
the entity
– Example: an employee entity with attributes department_number
and department_address, and a functional dependency
department_number  department_address
– Good design would have made department an entity

• Functional dependencies from non-key attributes of a relationship set


possible, but rare --- most relationships are binary

11
What About Smaller Schemas?
• Suppose we had started with bor_loan. How would we know to split up
(decompose) it into borrower and loan?

• In bor_loan, because loan_number is not a candidate key, the amount of a loan may
have to be repeated. This indicates the need to decompose bor_loan.

12
What About Smaller Schemas?
• Not all decompositions are good.
Suppose we decompose employee (employee_id,
employee_name, telephone_number, start_date) into
employee1 = (employee_id, employee_name)
employee2 = (employee_name, telephone_number,
start_date)

• The next slide shows how we lose information -- we


cannot reconstruct the original employee relation -- and
so, this is a lossy decomposition.

13
A Lossy Decomposition

14
Relationships of Normal Forms
1NF
2NF

3NF/BCNF
4NF
5NF

DKNF

15
Normalization
Un normalized
Relation
Remove
repeating groups
Normalized
Relation (1NF)
Remove partial
dependencies
2 NF
Remove transitive
dependencies

3 NF
Remove remaining
Anomalies resulting
from FD‘s
Boyce/Codd NF
Remove other
dependencies
16
First Normal Form
• Domain is atomic if its elements are considered to be
indivisible units
– Examples of non-atomic domains:
• Set of names, composite attributes
• Identification numbers like CS101 that can be
broken up into parts
• A relational schema R is in first normal form if the
domains of all attributes of R are atomic
• Non-atomic values: complicated storage and encourage
redundant (repeated) storage of data
• We assume all relations are in first normal form

17
First Normal Form (Cont’d)
• Atomicity is actually a property of how the elements of
the domain are used.
– Example: Strings would normally be considered
indivisible
– Suppose that students are given roll numbers which
are strings of the form CS0012 or EE1127
– If the first two characters are extracted to find the
department, the domain of roll numbers is not atomic.

18
Functional Dependencies
• Normalization theory is based on the concepts of normal forms.
• A relational table is said to be a particular normal form if it satisfied a certain set of
constraints.
• FDs and keys are used to define normal forms for relations

• A set of attributes X functionally determines a set of attributes Y if the value of X


determine a unique value for Y
• X -> Y holds if whenever two tuples have the same value for X, they must have the same
value for Y

• For any two tuples t1 and t2 in any relation instance r(R): If t1[X]=t2[X], then t1[Y]=t2[Y]
• X -> Y in R specifies a constraint on all relation instances r(R)
• FDs are derived from the real-world constraints on the attributes

19
Examples of FD constraints
• social security number determines employee name SSN -> ENAME
• project number determines project name and location
PNUMBER -> {PNAME, PLOCATION}
• employee ssn and project number determines the hours per week that
the employee works on the project
{SSN, PNUMBER} -> HOURS

• Example: Consider r(A,B ) with the following instance of r.


1 4
1 5
3 7

• On this instance, A  B does NOT hold, but B  A does


hold.

20
Examples of FD constraints
• An FD is a property of the attributes in the schema R
• The constraint must hold on every relation instance r(R)
• If K is a key of R, then K functionally determines all attributes in R
(since we never have two distinct tuples with t1[K]=t2[K])

• K is a superkey for relation schema R if and only if K  R


• K is a candidate key for R if and only if
– K  R, and
– for no   K,   R
• Consider the schema:
bor_loan = (customer_id, loan_number, amount ).
We expect this functional dependency to hold:
loan_number  amount
but would not expect the following to hold:
amount  customer_name

21
Functional Dependencies (Cont.)

• A functional dependency is trivial if it is satisfied by all instances of a


relation
– Example:
• customer_name, loan_number  customer_name
• customer_name  customer_name
– In general,    is trivial if   

22
Inference Rules for FDs
• Given a set of FDs F, we can infer additional FDs that hold whenever
the FDs in F hold

 Armstrong's inference rules:


IR1. (Reflexive) If Y subset-of X, then X -> Y
IR2. (Augmentation) If X -> Y, then XZ -> YZ
(Notation: XZ stands for X U Z)
IR3. (Transitive) If X -> Y and Y -> Z, then X -> Z

•  These rules are


– sound (generate only functional dependencies that
actually hold) and
– complete (generate all functional dependencies that
hold).
23
Inference Rules for FDs

Some additional inference rules that are useful:


(Decomposition) If X -> YZ, then X -> Y and X -> Z
(Union) If X -> Y and X -> Z, then X -> YZ
(Psuedotransitivity) If X -> Y and WY -> Z, then WX -> Z

•  The last three inference rules, as well as any


other inference rules, can be deduced from IR1,
IR2, and IR3 (completeness property)

24
Inference Rules for FDs

• Closure of a set F of FDs is the set F+ of all FDs


that can be inferred from F
• F+ is a superset of F.

• Closure of a set of attributes X with respect to F


is the set X + of all attributes that are functionally
determined by X

• X + can be calculated by repeatedly applying


IR1, IR2, IR3 using the FDs in F
25
Example of F+
• R = (A, B, C, G, H, I)
F={ AB
AC
CG  H
CG  I
B  H}
• some members of F+
– AH
• by transitivity from A  B and B  H
– AG  I
• by augmenting A  C with G, to get AG  CG
and then transitivity with CG  I
– CG  HI
• by augmenting CG  I to infer CG  CGI,
and augmenting of CG  H to infer CGI  HI,
26
and then transitivity
Procedure for Computing F+
• To compute the closure of a set of functional dependencies F:

F+=F
repeat
for each functional dependency f in F+
apply reflexivity and augmentation rules on f
add the resulting functional dependencies to F +
for each pair of functional dependencies f1and f2 in F +
if f1 and f2 can be combined using transitivity
then add the resulting functional dependency to F +
until F + does not change any further

NOTE: We shall see an alternative procedure for this task later

27
Closure of Functional Dependencies (Cont.)

• We can further simplify manual


computation of F+ by using the following
additional rules.
– If    holds and    holds, then    
holds (union)
– If     holds, then    holds and   
holds (decomposition)
– If    holds and     holds, then   
 holds (pseudotransitivity)
28
Second Normal Form (2NF)
• Uses the concepts of FDs, primary key
Definitions:
• Prime attribute - attribute that is member of the
primary key K
• Full functional dependency - a FD Y -> Z where
removal of any attribute from Y means the FD does
not hold any more

Examples: - {SSN, PNUMBER} -> HOURS is a full FD since


neither SSN -> HOURS nor PNUMBER -> HOURS hold
- {SSN, PNUMBER} -> ENAME is not a full FD (it is called a
partial dependency ) since SSN -> ENAME also holds
29
Second Normal Form (2NF)

• A relation schema R is in second normal


form (2NF) if every non-prime attribute A
in R is fully functionally dependent on the
primary key

• R can be decomposed into 2NF relations


via the process of 2NF normalization

30
Normalization
An Example : A company obtains parts from a number of suppliers. Each
supplier is located in one city. A city can have more than one supplier located
there and each city has a status code associated with it. Each supplier may
provide many parts.

The company creates a simple relational table to store this information:

FIRST (s#, status, city, p#, qty)


s# Supplier identification number
status Status code assigned to city
City City where supplier is located
p# Part number of part supplied
Qty Qty of parts supplied to date
Composite primary key is (s#, p#)

31
Normalization
• FIRST NORMAL FORM –1NF

s# city status p# qty


s1 London 20 p1 300
s1 London 20 p2 100
s1 London 20 p3 200
s1 London 20 p4 100
s2 Paris 10 p1 250
s2 Paris 10 p3 100
s3 Tokyo 30 p2 300
s3 Tokyo 30 p4 200
32
Normalization
SECOND NORMAL FORM – 2NF FIRST is in 1NF but not in 2NF
PARTS because status and city are
functionally dependent upon only on
s# p# qty the column s# of the composite key
(s#, p#).
s1 p1 300
SECOND
s1 p2 100
s1 p3 200 s# city status
s1 p4 100 s1 London 20
s2 p1 250 s2 Paris 10
s2 p3 100 s3 Tokyo 30
s3 p2 300
s3 p4 200 33
Goals of Normalization
• Let R be a relation scheme with a set F of functional
dependencies.
• Decide whether a relation scheme R is in “good” form.
• In the case that a relation scheme R is not in “good” form,
decompose it into a set of relation scheme {R1, R2, ..., Rn} such
that
– each relation scheme is in good form
– the decomposition is a lossless-join decomposition
– Preferably, the decomposition should be dependency
preserving.

34
Lossless-join Decomposition
• For the case of R = (R1, R2), we require that for all possible
relations r on schema R

r = R1 (r ) R2 (r )

• A decomposition of R into R1 and R2 is lossless join if and


only if at least one of the following dependencies is in F+:

– R1  R2  R1
– R1  R2  R2

35
Example of Relation Decomposition

36
Dependency Preservation Decomposition

• Definition: Each FD specified in F either appears


directly in one of the relations in the decomposition, or
be inferred from FDs that appear in some relation.

37
Dependency Preserving Example
• Consider relation ABCD, with FD’s :
• A ->B, B ->C, C ->D
• Decompose into two relations: ABC and
CD.
• ABC supports the FD’s A->B, B->C.
• CD supports the FD C->D.
• All the original dependencies are preserved.

38
Boyce-Codd Normal Form
A relation schema R is in BCNF with respect to a set F of
functional dependencies if for all functional dependencies
in F+ of the form

 

where   R and   R, at least one of the following holds:


    is trivial (i.e.,   )
  is a superkey for R
Example schema not in BCNF:

bor_loan = ( customer_id, loan_number, amount )

because loan_number  amount holds on bor_loan but loan_number is


not a superkey
39
Decomposing a Schema into BCNF
• Suppose we have a schema R and a non-trivial dependency
 causes a violation of BCNF.
We decompose R into:
• (U  )
• (R-(-))
• In our example,
  = loan_number
  = amount
and bor_loan is replaced by
– (U  ) = ( loan_number, amount )
– ( R - (  -  ) ) = ( customer_id, loan_number )

40
BCNF and Dependency Preservation
• Constraints, including functional dependencies, are costly to check
in practice unless they relate to only one relation
• If it is sufficient to test only those dependencies on each individual
relation of a decomposition in order to ensure that all functional
dependencies hold, then that decomposition is dependency
preserving.
• Because it is not always possible to achieve both BCNF and
dependency preservation, we consider a weaker normal form,
known as third normal form.

41
Third Normal Form: Motivation
• There are some situations where
– BCNF is not dependency preserving, and
– efficient checking for FD violation on updates is important
• Solution: define a weaker normal form, called Third
Normal Form (3NF)
– Allows some redundancy (with resultant problems; we will
see examples later)
– But functional dependencies can be checked on individual
relations without computing a join.
– There is always a lossless-join, dependency-preserving
decomposition into 3NF.

42
Third Normal Form
• A relation schema R is in third normal form (3NF) if for all:
   in F+
at least one of the following holds:
    is trivial (i.e.,   )
  is a superkey for R
– Each attribute A in  –  is contained in a candidate key for R.
(NOTE: each attribute may be in a different candidate key)
• If a relation is in BCNF it is in 3NF (since in BCNF one of the first
two conditions above must hold).
• Third condition is a minimal relaxation of BCNF to ensure
dependency preservation (will see why later).

43
Normalization
• THIRD NORMAL FORM – 3NF

A relational table is in third normal form (3NF) if it is already in 2NF and every
non-key column is non transitively dependent upon its primary key.

In other words, all non-key attributes are functionally dependent only


upon the primary key.

SUPPLIER

s# city status The table supplier is in 2NF but not in


s1 London 20 3NF because it contains a transitive
dependency
s2 Paris 10 SUPPLIER.s# —> SUPPLIER.city
SUPPLIER.city —> SUPPLIER.status
s3 Tokyo 30 SUPPLIER.s# —> SUPPLIER.status
s4 Paris 10 44
Normalization
CITY_STATUS

SUPPLIER
city status
s# city
London 20
s1 London
Paris 10
The transformation of s2 Paris
SUPPLIER into 3NF Tokyo 30
s3 Tokyo
Rome 50
s4 Paris
s5 London

45
Comparison of BCNF and 3NF
• It is always possible to decompose a relation into a set of relations
that are in 3NF such that:
– the decomposition is lossless
– the dependencies are preserved
• It is always possible to decompose a relation into a set of relations
that are in BCNF such that:
– the decomposition is lossless
– it may not be possible to preserve dependencies.

46
BCNF and Dependency Preservation
• Constraints, including functional dependencies, are costly to check
in practice unless they relate to only one relation
• If it is sufficient to test only those dependencies on each individual
relation of a decomposition in order to ensure that all functional
dependencies hold, then that decomposition is dependency
preserving.
• Because it is not always possible to achieve both BCNF and
dependency preservation, we consider a weaker normal form,
known as third normal form.

47
How good is BCNF?
• There are database schemas in BCNF that do not seem to be
sufficiently normalized
• Consider a database
classes (course, teacher, book )

such that (c, t, b)  classes means that t is qualified to teach c, and


b is a required textbook for c
• The database is supposed to list for each course the set of teachers
any one of which can be the course’s instructor, and the set of
books, all of which are required for the course (no matter who
teaches it).

48
How good is BCNF? (Cont.)
course teacher book
database Avi DB Concepts
database Avi Ullman
database Hank DB Concepts
database Hank Ullman
database Sudarshan DB Concepts
database Sudarshan Ullman
operating systems Avi OS Concepts
operating systems Avi Stallings
operating systems Pete OS Concepts
operating systems Pete Stallings
classes
• There are no non-trivial functional dependencies and therefore the
relation is in BCNF
• Insertion anomalies – i.e., if Marilyn is a new teacher that can teach
database, two tuples need to be inserted
(database, Marilyn, DB Concepts)
(database, Marilyn, Ullman) 49
How good is BCNF? (Cont.)
• Therefore, it is better to decompose classes into:
course teacher
database Avi
database Hank
database Sudarshan
operating systems Avi
operating systems Jim
teaches
course book
database DB Concepts
database Ullman
operating systems OS Concepts
operating systems Shaw
text
This suggests the need for higher normal forms, such as Fourth
Normal Form (4NF), which we shall see later.
50
Fourth Normal Form

• A relation schema R is in 4NF if and only if, whenever there


exist subsets A and B of the attributes of R such that the
MVD A  B is satisfied then all attributes of R are also
functionally dependent on A.

• If a relation is in 4NF it is in BCNF and all MVDs in R are in


fact FDs out of the candidate key.

51
Multi Valued Dependency & 4NF

ENAME PNAME DNAME


Smith X Research
Smith Y Sales
Smith X Sales
Smith Y Research
ENAME PNAME
Smith X
Smith Y ENAME DNAME

Smith Research
Smith Sales

52
Further Normal Forms
The advanced forms of normalization are:
 Fourth Normal Form (4NF)

 Fifth Normal Form (5NF or PJNF)

 Domain-key Normal Form (DKNF)

• Join dependencies generalize multivalued dependencies


– lead to project-join normal form (PJ/NF) (also called fifth normal
form)
• A class of even more general constraints, leads to a normal form called
domain-key normal form.
• Problem with these generalized constraints: are hard to reason with,
and no set of sound and complete set of inference rules exists.
• Hence rarely used

53
Multivalued Dependencies (MVDs)
• Let R be a relation schema and let   R and   R. The
multivalued dependency
  
holds on R if in any legal relation r(R), for all pairs of tuples
t1 and t2 in r such that t1[] = t2 [], there exist tuples t3 and t4
in r such that:
t1[] = t2 [] = t3 [] = t4 []
t3[] = t1 []
t3[R – ] = t2[R – ]
t4 [] = t2[]
t4[R – ] = t1[R – ]

54
Join Dependency
• A join dependency (JD), denoted by
JD(R1,R2, … Rn), specified on relation
schema R, specifies a constraint on the
states r of R.
• Natural join (R1(r),R2(r), … Rn(r)) = r
• Join dependency, multiway
decomposition, results the fifth normal
form (5NF)
55
5NF
• A MVD is a special case of a JD with n=2.
• JD(R1,R2)  MVD(R1  R2) ->> R1 – R2
MVD(R1  R2) ->> R2 – R1
• A JD is trivial if any of Ri is R.
• The 5NF is also called project-join normal
form (PJNF).

56
Join Dependency & 5NF
5NF: A relation schema is in 5NF or project-join
normal form(PJNF) w.r.t a set of F of functional,
multivalued and join dependencies if, for every
join dependency JD(R1,R2, … , Rn) in closure of
F, every Ri is a super key of R.

57
Join Dependency & 5NF
SNAME PARTNAME PROJNAME
Smith Bolt ProjX
Smith Nut ProjY
Brown Bolt ProjY
John Nut ProjZ
Brown Nail ProjX

There is no MVD  4NF; JD ( {SNAME PARTNAME} {PARTNAME PROJNAME}


{SNAME PROJNAME} ) is not lossless no 5NF

58
Join Dependency & 5NF
SNAME PARTNAME PROJNAME Assume additional constraint:
Smith Bolt ProjX whenever a supplier s supplies
p, a project j uses p and s
Smith Nut ProjY supplies at least one part to j,
Brown Bolt ProjY then the supplier s will also be
supplying p to j.
John Nut ProjZ
Brown Nail ProjX

Smith {Bolt, Nut} Smith  ProjX  {Bolt}


Smith {ProjX, ProjY} Smith  ProjY  {Nut, Bolt}
ProjX  {Bolt, Nail}
ProjY  {Nut, Bolt}

59
Join Dependency & 5NF

SNAME PARTNAME PROJNAME


Smith Bolt ProjX
Smith Nut ProjY
Brown Bolt ProjY
John Nut ProjZ
Brown Nail ProjX
Smith Bolt ProjY
Brown Bolt ProjX

JD ( R1 = {SNAME PARTNAME} R2 = {PARTNAME PROJNAME} R3 = {SNAME


PROJNAME} ) is valid  each in 5NF

60
Inclusion Dependencies
• To define certain inter-relation constraints.
– FK (referential integrity) cannot be specified
as the functional or multivalued dependency
– The inclusion dependency R.X < S.Y between
• The set of attributes X of R and the set of attributes
Y of S
• Specifies that at any specific time X(r(R)) 
Y(s(S))
• Must have same number of attributes.
• Domain for each pair of corresponding attributes
should be compatible.
61
Example
• Department.dmgrssn <Employee.ssn
• Employee.dnumber < department.dnumber
• All the above represents the referential integrity
constraints.
• Employee.ssn < Person.ssn [specialization]

62
Domain-Key Normal Form (DKNF)
• DKNF (the ultimate normal form, theoretically)
• A relation R is said to be in DKNF if all
constraints and dependencies that should hold
on the relation can be enforced simple by
enforcing the domain constraints and key
constraints on the relation.
• E.g., car(name, vn#), manufacture(vn#, country)
• If car id Toyota or Lexus and the manufacturer
country is Japan, then first character is “J”, etc…
• Value…, ….

63

You might also like