Relational Database Design: Normalization
Relational Database Design: Normalization
Normalization
1
Informal Design Guidelines for Relational
Databases
• What is relational database design?
The grouping of attributes to form "good" relation schemas
2
Criteria for “good” relations
• GUIDELINE 1 (semantics of attributes): Informally, each tuple
in a relation should represent one entity or relationship instance.
(Applies to individual relations and their attributes).
3
EXAMPLE OF AN UPDATE ANOMALY
Consider the relation:
EMP_PROJ ( Emp#, Proj#, Ename, Pname, No_hours)
5
Spurious Tuples
• Bad designs for a relational database may result in erroneous results for
certain JOIN operations
• The "lossless join" property is used to guarantee meaningful results for
join operations
• GUIDELINE 4: The relations should be designed to satisfy the lossless
join condition. No spurious tuples should be generated by doing a
natural-join of any relations.
6
Banking Schema
branch (branch_name, branch_city, assets)
7
Combine Schemas?
• Suppose we combine borrower and loan to get
bor_loan = (customer_id, loan_number, amount )
• Result is possible repetition of information (L-100 in example below)
8
A Combined Schema Without Repetition
• Consider combining loan_branch and loan
loan_amt_br = (loan_number, amount, branch_name)
• No repetition (as suggested by example below)
9
Overall Database Design Process
• We have assumed schema R is given
– R could have been generated when converting E-R
diagram to a set of tables.
– R could have been a single relation containing all
attributes that are of interest (called universal
relation).
– Normalization breaks R into smaller relations.
– R could have been the result of some ad hoc design of
relations, which we then test/convert to normal form.
10
ER Model and Normalization
• When an E-R diagram is carefully designed, identifying all entities
correctly, the tables generated from the E-R diagram should not need
further normalization.
11
What About Smaller Schemas?
• Suppose we had started with bor_loan. How would we know to split up
(decompose) it into borrower and loan?
• In bor_loan, because loan_number is not a candidate key, the amount of a loan may
have to be repeated. This indicates the need to decompose bor_loan.
12
What About Smaller Schemas?
• Not all decompositions are good.
Suppose we decompose employee (employee_id,
employee_name, telephone_number, start_date) into
employee1 = (employee_id, employee_name)
employee2 = (employee_name, telephone_number,
start_date)
13
A Lossy Decomposition
14
Relationships of Normal Forms
1NF
2NF
3NF/BCNF
4NF
5NF
DKNF
15
Normalization
Un normalized
Relation
Remove
repeating groups
Normalized
Relation (1NF)
Remove partial
dependencies
2 NF
Remove transitive
dependencies
3 NF
Remove remaining
Anomalies resulting
from FD‘s
Boyce/Codd NF
Remove other
dependencies
16
First Normal Form
• Domain is atomic if its elements are considered to be
indivisible units
– Examples of non-atomic domains:
• Set of names, composite attributes
• Identification numbers like CS101 that can be
broken up into parts
• A relational schema R is in first normal form if the
domains of all attributes of R are atomic
• Non-atomic values: complicated storage and encourage
redundant (repeated) storage of data
• We assume all relations are in first normal form
17
First Normal Form (Cont’d)
• Atomicity is actually a property of how the elements of
the domain are used.
– Example: Strings would normally be considered
indivisible
– Suppose that students are given roll numbers which
are strings of the form CS0012 or EE1127
– If the first two characters are extracted to find the
department, the domain of roll numbers is not atomic.
18
Functional Dependencies
• Normalization theory is based on the concepts of normal forms.
• A relational table is said to be a particular normal form if it satisfied a certain set of
constraints.
• FDs and keys are used to define normal forms for relations
• For any two tuples t1 and t2 in any relation instance r(R): If t1[X]=t2[X], then t1[Y]=t2[Y]
• X -> Y in R specifies a constraint on all relation instances r(R)
• FDs are derived from the real-world constraints on the attributes
19
Examples of FD constraints
• social security number determines employee name SSN -> ENAME
• project number determines project name and location
PNUMBER -> {PNAME, PLOCATION}
• employee ssn and project number determines the hours per week that
the employee works on the project
{SSN, PNUMBER} -> HOURS
20
Examples of FD constraints
• An FD is a property of the attributes in the schema R
• The constraint must hold on every relation instance r(R)
• If K is a key of R, then K functionally determines all attributes in R
(since we never have two distinct tuples with t1[K]=t2[K])
21
Functional Dependencies (Cont.)
22
Inference Rules for FDs
• Given a set of FDs F, we can infer additional FDs that hold whenever
the FDs in F hold
24
Inference Rules for FDs
F+=F
repeat
for each functional dependency f in F+
apply reflexivity and augmentation rules on f
add the resulting functional dependencies to F +
for each pair of functional dependencies f1and f2 in F +
if f1 and f2 can be combined using transitivity
then add the resulting functional dependency to F +
until F + does not change any further
27
Closure of Functional Dependencies (Cont.)
30
Normalization
An Example : A company obtains parts from a number of suppliers. Each
supplier is located in one city. A city can have more than one supplier located
there and each city has a status code associated with it. Each supplier may
provide many parts.
31
Normalization
• FIRST NORMAL FORM –1NF
34
Lossless-join Decomposition
• For the case of R = (R1, R2), we require that for all possible
relations r on schema R
r = R1 (r ) R2 (r )
– R1 R2 R1
– R1 R2 R2
35
Example of Relation Decomposition
36
Dependency Preservation Decomposition
37
Dependency Preserving Example
• Consider relation ABCD, with FD’s :
• A ->B, B ->C, C ->D
• Decompose into two relations: ABC and
CD.
• ABC supports the FD’s A->B, B->C.
• CD supports the FD C->D.
• All the original dependencies are preserved.
38
Boyce-Codd Normal Form
A relation schema R is in BCNF with respect to a set F of
functional dependencies if for all functional dependencies
in F+ of the form
40
BCNF and Dependency Preservation
• Constraints, including functional dependencies, are costly to check
in practice unless they relate to only one relation
• If it is sufficient to test only those dependencies on each individual
relation of a decomposition in order to ensure that all functional
dependencies hold, then that decomposition is dependency
preserving.
• Because it is not always possible to achieve both BCNF and
dependency preservation, we consider a weaker normal form,
known as third normal form.
41
Third Normal Form: Motivation
• There are some situations where
– BCNF is not dependency preserving, and
– efficient checking for FD violation on updates is important
• Solution: define a weaker normal form, called Third
Normal Form (3NF)
– Allows some redundancy (with resultant problems; we will
see examples later)
– But functional dependencies can be checked on individual
relations without computing a join.
– There is always a lossless-join, dependency-preserving
decomposition into 3NF.
42
Third Normal Form
• A relation schema R is in third normal form (3NF) if for all:
in F+
at least one of the following holds:
is trivial (i.e., )
is a superkey for R
– Each attribute A in – is contained in a candidate key for R.
(NOTE: each attribute may be in a different candidate key)
• If a relation is in BCNF it is in 3NF (since in BCNF one of the first
two conditions above must hold).
• Third condition is a minimal relaxation of BCNF to ensure
dependency preservation (will see why later).
43
Normalization
• THIRD NORMAL FORM – 3NF
A relational table is in third normal form (3NF) if it is already in 2NF and every
non-key column is non transitively dependent upon its primary key.
SUPPLIER
SUPPLIER
city status
s# city
London 20
s1 London
Paris 10
The transformation of s2 Paris
SUPPLIER into 3NF Tokyo 30
s3 Tokyo
Rome 50
s4 Paris
s5 London
45
Comparison of BCNF and 3NF
• It is always possible to decompose a relation into a set of relations
that are in 3NF such that:
– the decomposition is lossless
– the dependencies are preserved
• It is always possible to decompose a relation into a set of relations
that are in BCNF such that:
– the decomposition is lossless
– it may not be possible to preserve dependencies.
46
BCNF and Dependency Preservation
• Constraints, including functional dependencies, are costly to check
in practice unless they relate to only one relation
• If it is sufficient to test only those dependencies on each individual
relation of a decomposition in order to ensure that all functional
dependencies hold, then that decomposition is dependency
preserving.
• Because it is not always possible to achieve both BCNF and
dependency preservation, we consider a weaker normal form,
known as third normal form.
47
How good is BCNF?
• There are database schemas in BCNF that do not seem to be
sufficiently normalized
• Consider a database
classes (course, teacher, book )
48
How good is BCNF? (Cont.)
course teacher book
database Avi DB Concepts
database Avi Ullman
database Hank DB Concepts
database Hank Ullman
database Sudarshan DB Concepts
database Sudarshan Ullman
operating systems Avi OS Concepts
operating systems Avi Stallings
operating systems Pete OS Concepts
operating systems Pete Stallings
classes
• There are no non-trivial functional dependencies and therefore the
relation is in BCNF
• Insertion anomalies – i.e., if Marilyn is a new teacher that can teach
database, two tuples need to be inserted
(database, Marilyn, DB Concepts)
(database, Marilyn, Ullman) 49
How good is BCNF? (Cont.)
• Therefore, it is better to decompose classes into:
course teacher
database Avi
database Hank
database Sudarshan
operating systems Avi
operating systems Jim
teaches
course book
database DB Concepts
database Ullman
operating systems OS Concepts
operating systems Shaw
text
This suggests the need for higher normal forms, such as Fourth
Normal Form (4NF), which we shall see later.
50
Fourth Normal Form
51
Multi Valued Dependency & 4NF
Smith Research
Smith Sales
52
Further Normal Forms
The advanced forms of normalization are:
Fourth Normal Form (4NF)
53
Multivalued Dependencies (MVDs)
• Let R be a relation schema and let R and R. The
multivalued dependency
holds on R if in any legal relation r(R), for all pairs of tuples
t1 and t2 in r such that t1[] = t2 [], there exist tuples t3 and t4
in r such that:
t1[] = t2 [] = t3 [] = t4 []
t3[] = t1 []
t3[R – ] = t2[R – ]
t4 [] = t2[]
t4[R – ] = t1[R – ]
54
Join Dependency
• A join dependency (JD), denoted by
JD(R1,R2, … Rn), specified on relation
schema R, specifies a constraint on the
states r of R.
• Natural join (R1(r),R2(r), … Rn(r)) = r
• Join dependency, multiway
decomposition, results the fifth normal
form (5NF)
55
5NF
• A MVD is a special case of a JD with n=2.
• JD(R1,R2) MVD(R1 R2) ->> R1 – R2
MVD(R1 R2) ->> R2 – R1
• A JD is trivial if any of Ri is R.
• The 5NF is also called project-join normal
form (PJNF).
56
Join Dependency & 5NF
5NF: A relation schema is in 5NF or project-join
normal form(PJNF) w.r.t a set of F of functional,
multivalued and join dependencies if, for every
join dependency JD(R1,R2, … , Rn) in closure of
F, every Ri is a super key of R.
57
Join Dependency & 5NF
SNAME PARTNAME PROJNAME
Smith Bolt ProjX
Smith Nut ProjY
Brown Bolt ProjY
John Nut ProjZ
Brown Nail ProjX
58
Join Dependency & 5NF
SNAME PARTNAME PROJNAME Assume additional constraint:
Smith Bolt ProjX whenever a supplier s supplies
p, a project j uses p and s
Smith Nut ProjY supplies at least one part to j,
Brown Bolt ProjY then the supplier s will also be
supplying p to j.
John Nut ProjZ
Brown Nail ProjX
59
Join Dependency & 5NF
60
Inclusion Dependencies
• To define certain inter-relation constraints.
– FK (referential integrity) cannot be specified
as the functional or multivalued dependency
– The inclusion dependency R.X < S.Y between
• The set of attributes X of R and the set of attributes
Y of S
• Specifies that at any specific time X(r(R))
Y(s(S))
• Must have same number of attributes.
• Domain for each pair of corresponding attributes
should be compatible.
61
Example
• Department.dmgrssn <Employee.ssn
• Employee.dnumber < department.dnumber
• All the above represents the referential integrity
constraints.
• Employee.ssn < Person.ssn [specialization]
62
Domain-Key Normal Form (DKNF)
• DKNF (the ultimate normal form, theoretically)
• A relation R is said to be in DKNF if all
constraints and dependencies that should hold
on the relation can be enforced simple by
enforcing the domain constraints and key
constraints on the relation.
• E.g., car(name, vn#), manufacture(vn#, country)
• If car id Toyota or Lexus and the manufacturer
country is Japan, then first character is “J”, etc…
• Value…, ….
63