Session 5 Normalization
Session 5 Normalization
Session 5
Academic year 2023-2024
• If there are two employees with the same name, we cannot reconstruct the original
employee relation => this is a lossy decomposition!
Lossy decomposition
Lossless decomposition
• Decomposition of R = (A, B, C)
R1 = (A, B) R2 = (B, C)
Normalization
• Decide whether a particular relation R is in “good” form.
• The main objective of normalization is to reduce redundancy and improve data
integrity, ensuring that information is stored in a logical and efficient manner.
• In the case that a relation R is not in “good” form, decompose it into set of relations
{R1, R2, ..., Rn} such that
• Each relation is in good form
• The decomposition is a lossless decomposition
• Based on:
• Functional dependencies
• Multivalued dependencies
Normalization based on
Functional Dependencies
Functional dependencies
• There are usually a variety of constraints (rules) on the data in the real world.
• For example, in a university database are:
• Students and instructors are uniquely identified by their ID.
• Each student and instructor has only one name.
• Each instructor and student is (primarily) associated with only one department.
• Each department has only one value for its budget, and only one associated
building.
• Legal instance: An instance of a relation that satisfies all such real-world constraints
• A legal instance of a DB is one where all the relation instances are legal instances
• Require that the value for a certain set of attributes determines uniquely the value for
another set of attributes. This is expressed as "X determines Y" or "X -> Y".
• A functional dependency is a generalization of the notion of a key!
Functional dependencies
• Let R be a relation schema
a Í R and b Í R
• The functional dependency
a®b
holds on R if and only if for any legal relations r(R), whenever any two tuples t1 and t2
of r agree on the attributes a, they also agree on the attributes b. That is,
1 4
1 5 On this instance, B ® A hold; A ® B does NOT hold
3 7
Closure of a set of Functional dependencies
• Given a set F set of functional dependencies, there are certain other functional
dependencies that are logically implied by F.
• If A ® B and B ® C, then we can infer that A ® C
• Some members of F+
• A®H
• by transitivity from A ® B and B ® H
• AG ® I
• by augmenting A ® C with G, to get AG ® CG
and then transitivity with CG ® I
• CG ® HI
• by augmenting CG ® I to infer CG ® CGI,
and augmenting of CG ® H to infer CGI ® HI, and then transitivity
Keys and Functional dependencies
• K is a superkey for relation schema R if and only if K ® R
• K is a candidate key for R if and only if The key difference between a superkey and a candidate key is
that a superkey may contain additional attributes that are not
• K ® R, and necessary for unique identification!
• for no a Ì K, a ® R
• Functional dependencies allow us to express constraints that cannot be expressed
using superkeys. Consider the schema:
in_dep (ID, name, salary, dept_name, building, budget ).
We expect these functional dependencies to hold: but would not expect the following to hold:
dept_name® building dept_name ® salary
ID à building
• If testing a functional dependency can be done by considering just one relation, then
the cost of testing this constraint is low
Repetition of information
Need to use null values (e.g., to represent the relationship l2, k2
where there is no corresponding value for J)
BCNF vs 3NF
• Advantages to 3NF over BCNF
• It is always possible to obtain a 3NF design without sacrificing lossless or
dependency preservation.
• Disadvantages to 3NF
• We may have to use null values to represent some of the possible meaningful
relationships among data items.
• There is the problem of repetition of information.
How good is BCNF
• There are database schemas in BCNF that do not seem to be sufficiently normalized
• Consider a relation
inst_info (ID, child_name, phone)
• where an instructor may have more than one phone and can have multiple children
• Instance of inst_info
• There are no non-trivial functional dependencies and therefore the relation is in BCNF
• Insertion anomalies – i.e., if we add a phone 981-992-3443 to 99999, we need to add
two tuples
(99999, David, 981-992-3443)
(99999, William, 981-992-3443)
Higher NF
• Based on the previous example
• inst_phone:
We need for higher normal forms, such as Fourth Normal Form (4NF)!!
Normalization based on
Multivalued Dependencies
Multivalued Dependencies
• Let R be a relation schema and let a Í R and b Í R. The multivalued dependency
a ®® b
holds on R if in any legal relation r(R), for all pairs for tuples t1 and t2 in r such that t1[a]
= t2 [a], there exist tuples t3 and t4 in r such that:
t1[a] = t2 [a] = t3 [a] = t4 [a]
t3[b] = t1 [b]
t3[R – b] = t2[R – b]
t4 [b] = t2[b]
t4[R – b] = t1[R – b]
Multivalued Dependencies
• Let R be a relation schema with a set of attributes that are partitioned into 3
nonempty subsets.
Y, Z, W
• We say that Y ®® Z (Y multidetermines Z )
if and only if for all possible relations r (R )
< y1, z1, w1 > Î r and < y1, z2, w2 > Î r
then
< y1, z1, w2 > Î r and < y1, z2, w1 > Î r
• Since the behavior of Z and W are identical it follows that
Y ®® Z if Y ®® W
Multivalued Dependencies. Example
• Suppose inst_child(ID, child_name) and inst_phone(ID, phone_number)
• If we were to combine these schemas to get inst_info(ID, child_name, phone_number)
• Example data:
(99999, David, 512-555-1234)
(99999, David, 512-555-4321)
(99999, William, 512-555-1234)
(99999, William, 512-555-4321)
• ID ®® child_name
ID ®® phone_number
• The above formal definition is supposed to formalize the notion that given a particular
value of Y (ID) it has associated with it a set of values of Z (child_name) and a set of
values of W (phone_number), and these two sets are in some sense independent of
each other.
• Note:
• If Y ® Z then Y ®® Z
DB Design process
1NF
• Domain is atomic if its elements are considered to be indivisible units
• Set of names, composite attributes, identification numbers like CS101 that can be
broken up into parts are examples of non-atomic domains
• A relational schema R is in first normal form if the domains of all attributes of R are
atomic
• Non-atomic values complicate storage and encourage redundant storage of data
• Atomicity is actually a property of how the elements of the domain are used.
• Example: Strings would normally be considered indivisible
• Suppose that students are given roll numbers which are strings of the form
CS0012 or EE1127
• If the first two characters are extracted to find the department, the domain of roll
numbers is not atomic.
• Doing so is a bad idea: leads to encoding of information in application program
rather than in the database.
Design goals
• Goal for a relational database design is:
• BCNF.
• Lossless join.
• Dependency preservation.
• If we cannot achieve this, we accept one of
• Lack of dependency preservation
• Redundancy due to use of 3NF
• Interestingly, SQL does not provide a direct way of specifying functional
dependencies other than superkeys.
Can specify FDs using assertions, but they are expensive to test, (and currently not
supported by any of the widely used databases!)
• Even if we had a dependency preserving decomposition, using SQL we would not be
able to efficiently test a functional dependency whose left hand side is not a key.
E-R Model and Normalization
• When an E-R diagram is carefully designed, identifying all entities correctly, the tables
generated from the E-R diagram should not need further normalization.
• However, in a real (imperfect) design, there can be functional dependencies from
non-key attributes of an entity to other attributes of the entity
• Example: an employee entity with
• attributes
department_name and building,
• functional dependency
department_name® building
• Good design would have made department an entity
• Functional dependencies from non-key attributes of a relationship set possible, but
rare --- most relationships are binary
Denormalization for performance
• May want to use non-normalized schema for performance
• For example, displaying prereqs along with course_id, and title requires join of course
with prereq
• Alternative 1: Use denormalized relation containing attributes of course as well as
prereq with all above attributes
• faster lookup
• extra space and extra execution time for updates
• extra coding work for programmer and possibility of error in extra code
• Alternative 2: use a materialized view defined a course prereq
• Benefits and drawbacks same as above, except no extra coding work for
programmer and avoids possible errors
Other design issues
• Some aspects of database design are not caught by normalization
• Examples of bad database design, to be avoided:
Instead of earnings (company_id, year, amount ), use
• earnings_2004, earnings_2005, earnings_2006, etc., all on the schema (company_id,
earnings).
• Above are in BCNF, but make querying across years difficult and needs new
table each year
• company_year (company_id, earnings_2004, earnings_2005,
earnings_2006)
• Also in BCNF, but also makes querying across years difficult and requires new
attribute each year.
• Is an example of a crosstab, where values for one attribute become column
names
• Used in spreadsheets and in data analysis tools
Annex
Functional Dependency Theory
Attribute closure
Closure of Attribute Sets
• Given a set of attributes a, define the closure of a under F (denoted by a +) as the set
of attributes that are functionally determined by a under F
• R = (A, B, C, G, H, I)
• F = {A ® B
A®C
CG ® H
CG ® I
B ® H}
• (AG)+
1. result = AG
2. result = ABCG (A ® C and A ® B)
3. result = ABCGH (CG ® H and CG Í AGBC)
4. result = ABCGHI (CG ® I and CG Í AGBCH)
Use of closure of attributes
• Testing for superkey:
• To test if a is a superkey, we compute a+, and check if a+ contains all attributes of
R.
• Testing functional dependencies
• To check if a functional dependency a ® b holds (or, in other words, is in F+), just
check if b Í a+.
• That is, we compute a+ by using attribute closure, and then check if it contains b.
• Computing closure of F
• For each g Í R, we find the closure g+, and for each S Í g+, we output a functional
dependency g ® S.
• Is AG a candidate key?
1. Is AG a super key?
1. Does AG ® R? == Is R Ê (AG)+
Canonical cover
Canonical cover
• Suppose that we have a set of functional dependencies F on a relation schema.
Whenever a user performs an update on the relation, the database system must
ensure that the update does not violate any functional dependencies; that is, all the
functional dependencies in F are satisfied in the new database state.
• If an update violates any functional dependencies in the set F, the system must roll
back the update.
• We can reduce the effort spent in checking for violations by testing a simplified set of
functional dependencies that has the same closure as the given set.
• This simplified set is termed the canonical cover
• To define canonical cover we must first define extraneous attributes.
• An attribute of a functional dependency in F is extraneous if we can remove it
without changing F +
Extraneous attributes
• Removing an attribute from the left side of a functional dependency could make it a
stronger constraint.
• For example, if we have AB ® C and remove B, we get the possibly stronger result
A ® C. It may be stronger because A ® C logically implies AB ® C, but AB ® C
does not, on its own, logically imply A ® C
• But, depending on what our set F of functional dependencies happens to be, we may
be able to remove B from AB ® C safely.
• For example, suppose that
• F = {AB ® C, A ® D, D ® C}
• Then we can show that F logically implies A ® C, making extraneous in AB ® C.
Extraneous attributes
• Removing an attribute from the right side of a functional dependency could make it a
weaker constraint.
• For example, if we have AB ® CD and remove C, we get the possibly weaker
result AB ® D. It may be weaker because using just AB ® D, we can no longer
infer AB ® C.
• But, depending on what our set F of functional dependencies happens to be, we may
be able to remove C from AB ® CD safely.
• For example, suppose that
F = { AB ® CD, A ® C}
• Then we can show that even after replacing AB ® CD by AB ® D, we can still infer
$AB ® C and thus AB ® CD.
Extraneous attributes
• An attribute of a functional dependency in F is extraneous if we can remove it without
changing F +
• Consider a set F of functional dependencies and the functional dependency a ® b in
F.
• Remove from the left side: Attribute A is extraneous in a if
• A Î a and
• F logically implies (F – {a ® b}) È {(a – A) ® b}.
• Remove from the right side: Attribute A is extraneous in b if
• A Î b and
• The set of functional dependencies
(F – {a ® b}) È {a ®(b – A)} logically implies F.
• Note: implication in the opposite direction is trivial in each of the cases above, since a
“stronger” functional dependency always implies a weaker one
Extraneous attributes
• An attribute of a functional dependency in F is extraneous if we can remove it without
changing F +
• Consider a set F of functional dependencies and the functional dependency a ® b in
F.
• Remove from the left side: Attribute A is extraneous in a if
• A Î a and
• F logically implies (F – {a ® b}) È {(a – A) ® b}.
• Remove from the right side: Attribute A is extraneous in b if
• A Î b and
• The set of functional dependencies
(F – {a ® b}) È {a ®(b – A)} logically implies F.
• Note: implication in the opposite direction is trivial in each of the cases above, since a
“stronger” functional dependency always implies a weaker one
How to determine if a Extraneous attribute is
• Let R be a relation schema and let F be a set of functional dependencies that hold
on R . Consider an attribute in the functional dependency a ® b.
• To test if attribute A Î b is extraneous in b
• Consider the set:
F' = (F – {a ® b}) È {a ®(b – A)},
• check that a+ contains A; if it does, A is extraneous in b
• To test if attribute A Î a is extraneous in a
• Let g = a – {A}. Check if g ® b can be inferred from F.
• Compute g+ using the dependencies in F
• If g+ includes all attributes in b then , A is extraneous in a
Extraneous attribute. Example
• Let F = {AB ® CD, A ® E, E ® C }
• To check if C is extraneous in AB ® CD, we:
• Compute the attribute closure of AB under F' = {AB ® D, A ® E, E ® C}
• The closure is ABCDE, which includes CD
• This implies that C is extraneous
Canonical cover
• A canonical cover for F is a set of dependencies Fc such that
• F logically implies all dependencies in Fc, and
• Fc logically implies all dependencies in F, and
• No functional dependency in Fc contains an extraneous attribute, and
• Each left side of functional dependency in Fc is unique. That is, there are no two
dependencies in Fc
• a1 ® b1 and a2 ® b2 such that
• a1 = a2
Canonical cover. Example
• R = (A, B, C) F = {A ® BC
B®C
A®B
AB ® C}
• Combine A ® BC and A ® B into A ® BC
• Set is now {A ® BC, B ® C, AB ® C}
• A is extraneous in AB ® C
• Check if the result of deleting A from AB ® C is implied by the other dependencies
• Yes: in fact, B ® C is already present!
• Set is now {A ® BC, B ® C}
• C is extraneous in A ® BC
• Check if A ® C is logically implied by A ® B and the other dependencies
• Yes: using transitivity on A ® B and B ® C.
• Can use attribute closure of A in more complex cases
• The canonical cover is: A®B
B®C
Dependency Preservation
Dependency preservation
• Let Fi be the set of dependencies F + that include only attributes in Ri.
• A decomposition is dependency preserving, if (F1 È F2 È … È Fn )+ = F +
• Testing for dependency preservation takes exponential time.
• If a decomposition is NOT dependency preserving then checking updates for violation
of functional dependencies may require computing joins, which is expensive.
• Let F be the set of dependencies on schema R and let R1 .. Rn be a decomposition of R.
• The restriction of F to Ri is the set Fi of all functional dependencies in F + that include
only attributes of Ri.
• Since all functional dependencies in a restriction involve attributes of only one
relation schema, it is possible to test such a dependency for satisfaction by checking
only one relation.
• The definition of restriction uses all dependencies in F +, not just those in F.
• The set of restrictions F1 .. Fn is the set of functional dependencies that can be checked
efficiently.
Dependency preservation. Example
• R = (A, B, C )
F = {A ® B
B ® C}
Key = {A}
• R is not in BCNF
• Decomposition R1 = (A, B), R2 = (B, C)
• R1 and R2 in BCNF
• Lossless-join decomposition
• Dependency preserving
Multivalued Dependency Theory
Theory of MVDs
• From the definition of multivalued dependency, we can derive the following rule:
• If a ® b, then a ®® b every functional dependency is a multivalued dependency!
• The closure D+ of D is the set of all functional and multivalued dependencies logically
implied by D.
• We can compute D+ from D, using the formal definitions of functional
dependencies and multivalued dependencies.
• We can manage with such reasoning for very simple multivalued dependencies,
which seem to be most common in practice
• For complex dependencies, it is better to reason about sets of dependencies
using a system of inference rules
• Use of multivalued dependencies :
1. To test relations to determine whether they are legal under a given set of
functional and multivalued dependencies
2. To specify constraints on the set of legal relations.
4NF
• A relation schema R is in 4NF with respect to a set D of functional and multivalued
dependencies if for all multivalued dependencies in D+ of the form a ®® b, where a
Í R and b Í R, at least one of the following hold:
• a ®® b is trivial (i.e., b Í a or a È b = R)
• a is a superkey for schema R
• If a relation is in 4NF it is in BCNF!!
Restriction of MVD
• The restriction of D to Ri is the set Di consisting of
• All functional dependencies in D+ that include only attributes of Ri
• All multivalued dependencies of the form
a ®® (b Ç Ri)
where a Í Ri and a ®® b is in D+
4NF. Example
• R =(A, B, C, G, H, I) F ={ A ®® B
B ®® HI
CG ®® H }
• R is not in 4NF since A ®® B and A is not a superkey for R
• Decomposition
a) R1 = (A, B) (R1 is in 4NF)
b) R2 = (A, C, G, H, I) (R2 is not in 4NF, decompose into R3 and R4)
c) R3 = (C, G, H) (R3 is in 4NF)
d) R4 = (A, C, G, I) (R4 is not in 4NF, decompose into R5 and R6)
• A ®® B and B ®® HI è A ®® HI, (MVD transitivity), and
• and hence A ®® I (MVD restriction to R4)
e) R5 = (A, I) (R5 is in 4NF)
f)R6 = (A, C, G) (R6 is in 4NF)