SGN-2506 Introduction To Pattern Recognition Handout
SGN-2506 Introduction To Pattern Recognition Handout
Jussi Tohka
Tampere University of Technology
Institute of Signal Processing
2006
September 1, 2006
ii
Preface
This is an English translation of the lecture notes written originally in Finnish for
the course SGN-2500 Johdatus hahmontunnistukseen. The basis for the original
lecture notes was the course Introduction to Pattern Recognition that I lectured at
the Tampere University of Technology during 2003 and 2004. Especially, the course
in the fall semester of 2003 was based on the book Pattern Classification, 2nd Edition
by Richard Duda, Peter Hart and David Stork. The course has thereafter diverted
from the book, but still the order of topics during the course is much the same as
in the book by Duda et al.
These lecture notes correspond to a four credit point course, which has typically
been lectured during 24 lecture-hours. The driving idea behind these lecture notes
is that the basic material is presented thoroughly. Some additonal information
are then presented in a more relaxed manner. This, for example, if the formal
treatment would require too much mathematical background. The aim is to provide
the student the basic understanding of the foundations of the statistical pattern
recognition and a basis for advanced pattern recognition courses. This necessarily
means that the weight is on the probabilistic foundations of pattern recognition and
specific pattern recognition techniques and applications get less attention. This may
naturally bother some engineering students but I think the choice was a necessary
one.
Jussi Tohka
iv
Contents
1 Introduction 1
7 Classifier Evaluation 63
7.1 Estimation of the Probability of the Classification Error . . . . . . . . 63
7.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
CONTENTS vii
Introduction
The term pattern recognition refers to the task of placing some object to a cor-
rect class based on the measurements about the object. Usually this task is to
be performed automatically with the help of computer. Objects to be recognized,
measurements about the objects, and possible classes can be almost anything in the
world. For this reason, there are very different pattern recognition tasks. A system
that makes measurements about certain objects and thereafter classifies these ob-
jects is called a pattern recognition system. For example, a bottle recycling machine
is a pattern recognition system. The customer inputs his/her bottles (and cans) into
the machine, the machine recognizes the bottles, delivers them in proper containers,
computes the amount of compensation for the customer and prints a receipt for
the customer. A spam (junk-mail) filter is another example of pattern recognition
systems. A spam filter recognizes automatically junk e-mails and places them in a
different folder (e.g. /dev/null) than the user’s inbox. The list of pattern recognition
systems is almost endless. Pattern recognition has a number of applications ranging
from medicine to speech recognition.
Some pattern recognition tasks are everyday tasks (e.g. speech recognition) and
some pattern recognition tasks are not-so-everyday tasks. However, although some
of these task seem trivial for humans, it does not necessarily imply that the re-
lated pattern recognition problems would be easy. For example, it is very difficult
to ’teach’ a computer to read hand-written text. A part of the challenge follow
because a letter ’A’ written by a person B can look highly different than a letter
’A’ written by another person. For this reason, it is worthwhile to model the vari-
ation within a class of objects (e.g. hand-written ’A’s). For the modeling of the
variation during this course, we shall concentrate on statistical pattern recognition,
in which the classes and objects within the classes are modeled statistically. For
the purposes of this course, we can further divide statistical pattern recognition
into two subclasses. Roughly speaking, in one we model the variation within object
classes (generative modeling) and in the other we model the variation between the
object classes (discriminative modeling). If understood broadly, statistical pattern
recognition covers a major part of all pattern recognition applications and systems.
Syntactic pattern recognition forms another class of pattern recognition meth-
ods. During this course, we do not cover syntactic pattern recognition. The basic
2 Introduction to Pattern Recognition
idea of syntactic pattern recognition is that the patterns (observations about the
objects to be classified) can always be represented with the help of simpler and
simpler subpatterns leading eventually to atomic patterns which cannot anymore
be decomposed into subpatterns. Pattern recognition is then the study of atomic
patterns and the language between relations of these atomic patterns. The theory
of formal languages forms the basis of syntactic pattern recognition.
Some scholars distinguish yet another type of pattern recognition: The neural
pattern recognition, which utilizes artificial neural networks to solve pattern recog-
nition problems. However, artificial neural networks can well be included within the
framework of statistical pattern recognition. Artificial neural networks are covered
during the courses ’Pattern recognition’ and ’Neural computation’.
Chapter 2
2.1 Examples
Perhaps the best known material used in the comparisons of pattern classifiers is
the so-called Iris dataset. This dataset was collected by Edgar Anderson in 1935. It
contains sepal width, sepal length, petal width and petal length measurements from
150 irises belonging to three different sub-species. R.A. Fisher made the material
famous by using it as an example in the 1936 paper ’The use of multiple measure-
ments in taxonomic problems’, which can be considered as the first article about the
pattern recognition. This is why the material is usually referred as the Fisher’s Iris
Data instead of the Anderson’s Iris Data. The petal widths and lengths are shown
in Figure 2.1.
4 Introduction to Pattern Recognition
Figure 2.1: Petal widths (x-axis) and lengths (y-axis) of Fisher’s irises.
1. Sensing (measurement);
2. Pre-processing and segmentation;
3. Feature extraction;
4. Classification;
5. Post-processing;
A majority of these stages are very application specific. Take the sensing stage
as an example: the measurements required by speech recognition and finger print
identification are rather different as are those required by OCR and Anderson’s Iris
measurements.
For the classification stage, the situation is different. There exist a rather general
theory about the classification and an array of general algorithms can be formulated
for the classification stage. For this reason, the main emphasis during this course
is on the classifier design. However, the final aim is to design a pattern recognition
system. Therefore it is important to gain understanding about every stage of the
pattern recognition system although we can give few application non-specific tools
for the design of these stages in a pattern recognition system.
2.2 Basic Structure of Pattern Recognition Systems 5
Understanding the basis of the probability theory is necessary for the understand-
ing of the statistical pattern recognition. After all we aim to derive a classifier that
makes as few errors as possible. The concept of the error is defined statistically,
because there is variation among the feature vectors representing a certain class.
Very far reaching results of the probability theory are not needed in an introductory
pattern recognition course. However, the very basic concepts, results and assump-
tions of the probability theory are immensely useful. The goal of this chapter is to
review the basic concepts of the probability theory.
3.1 Examples
The probability theory was born due to the needs of gambling. Therefore, it is
pertinent to start with elementary examples related to gambling.
What is the probability that a card drawn from a well-shuffled deck of cards is
an ace?
In general, the probability of an event is the number of the favorable outcomes
divided by the total number of all outcomes. In this case, there are 4 favorable
outcomes (aces). The total number of outcomes is the number of cards in the deck,
that is 52. The probability of the event is therefore 4/52 = 1/13.
Consider next rolling two dice at the same time. The total number of outcomes
is 36. What is then the probability that the sum of the two dice is equal to 5? The
number of the favorable outcomes is 4 (what are these?). And the probability is
1/9.
2. P (S) = 1.
Theorem 1 Let E and F be events of the sample space S. The probability measure
P of S has the following properties:
1. P (E c ) = 1 − P (E)
2. P (∅) = 0
4. P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F )
5. P (E) = ni=1 P (E ∩ Fi ).
P
The cdf measures the probability mass of all y that are smaller than x.
Lets once again consider the deck of cards and the probability of drawing an ace.
The RV in this case was the map from the labels of the cards to the set {0, 1} with
the value of the RV equal to 1 in the case of a ace and 0 otherwise. Then FX (x) = 0,
when x < 0, FX (x) = 48/52, when 0 ≤ x < 1, and FX (x) = 1 otherwise.
In other words, a value of the pdf pX (x) is the probability of the event X = x. If
an RV X has a pdf pX , we say that X is distributed according to pX and we denote
X ∼ pX .
the function pX is then the probability density function of X. In Figure 3.1, the pdf
and the cdf of the standard normal distribution are depicted.
Figure 3.1: The pdf and the cdf of the standard normal distribution.
In the continuous case, the value of pdf pX (x) is not a probability: The value of any
integral over a single point is zero and therefore it is not meaningful to study the
14 Introduction to Pattern Recognition
• During this course, it is assumed that the integrals are multivariate Riemann
integrals.
• The integral Z x
pX (y)dy
−∞
is an abbreviation for
Z x1 Z xd
··· pX (y1 , y2 , . . . , yd )dy1 · · · dyd ,
−∞ −∞
• And most importantly: you do not have to evaluate any integrals in practical
pattern recognition. However, the basic properties of integrals are useful to
know.
P (E ∩ F ) = P (E)P (F ). (3.4)
3.6 Conditional Distributions and Independence 15
If E and F are not independent, they are dependent. The independence of events E
and F means that an occurrence of E does not affect the likelihood of an occurrence
of F in the same experiment. Assume that P (F ) 6= 0. The conditional probability
of E relative to F ⊆ S is denoted by P (E|F ) and defined as
P (E ∩ F )
P (E|F ) = . (3.5)
P (F )
This refers to the probability of the event E provided that the event F has already
occurred. In other words, we know that the result of a trial was included in the
event F .
If E and F are independent, then
P (E ∩ F ) P (E)P (F )
P (E|F ) = = = P (E).
P (F ) P (F )
P (E ∩ F ) = P (E)P (F ), (3.6)
and events E and F are independent. The point is that an alternative definition of
independence can be given via conditional probabilities. Note also that the concept
of independence is different from the concept of mutual exclusivity. Indeed, if the
events E and F are mutually exclusive, then P (E|F ) = P (F |E) = 0, and E and F
are dependent.
Definition (3.4) generalizes in straight-forward manner: Events E1 , . . . , En are
independent if
P (E1 ∩ · · · ∩ En ) = P (E1 ) · · · P (En ). (3.7)
Assume that we have more than two events, say E, F and G. They are pairwise
independent if E and F are independent and F and G are independent and E and G
are independent. However, it does not follow from the pairwise independence that
E, F and G would be independent.
refers to the probability density of X1 receiving the value x1 , X2 receiving the value
x2 and so forth in a trial. The pdfs pXi (xi ) are called the marginal densities of X.
These are obtained from the joint density by integration. The joint and marginal
distributions have all the properties of pdfs.
The random variables X1 , X2 , . . . , Xn are independent if and only if
for all i and j. That is to say that the results of the (sub)experiments modeled by
independent random variables are not dependent on each other. The independence
of the two (or more) RVs can be defined with the help of the events related to the
RVs. This definition is equivalent to the definition by the pdfs.
It is natural to attach the concept of conditional probability to random variables.
We have RVs X and Y . A new RV modeling the probability of X assuming that
we already know that Y = y is denoted by X|y. It is called the conditional random
variable of X given y. The RV X|y has a density function defined by
pX,Y (x, y)
pX|Y (x|y) = . (3.9)
pY (y)
for all x, y.
P (E ∩ F ) = P (F )P (E|F ).
For this we have assumed that P (F ) > 0. If also P (E) > 0, then
P (E ∩ F ) = P (E)P (F |E).
3.8 Multivariate Normal Distribution 17
And furthermore
P (F )P (E|F )
P (F |E) = .
P (E)
This is the Bayes rule. A different formulation of the rule is obtained via the point
5 in the Theorem 1. If the events F1 , . . . , Fn are mutually exclusive and the sample
space S = F1 ∪ · · · ∪ Fn , then for all k it holds that
P (Fk )P (E|Fk ) P (Fk )P (E|Fk )
P (Fk |E) = = Pn .
P (E) i=1 P (Fi )P (E|Fi )
We will state these results and the corresponding ones for the RVs in a theorem:
Assume that X and Y are RVs related to the same experiment. Then
pX (x)pY |X (y|x)
3. pX|Y (x|y) = pY (y)
;
pX (x)pY |X (y|x)
4. pX|Y (x|y) = R
pX (x)pY |X (y|x)dx
.
Note that the points 3 and 4 hold even if one of the RVs was discrete and the
other was continuous. Obviously, it can be necessary to replace the integral by a
sum.
The normal distribution has some interesting properties that make it special
among the distributions of continuous random variables. Some of them are worth
of knowing for this course. In what follows, we denote the components of µ by µi
and the components of Σ by σij . If X ∼ N (µ, Σ) then
The Bayesian decision theory offers a solid foundation for classifier design. It tells
us how to design the optimal classifier when the statistical properties of the classi-
fication problem are known. The theory is a formalization of some common-sense
principles, but it offers a good basis for classifier design.
About notation. In what follows we are a bit more sloppy with the notation
than previously. For example, pdfs are not anymore indexed with the symbol of the
random variable they correspond to.
The prior probability P (ωi ) defines what percentage of all feature vectors belong
to the class ωi . The class conditional pdf p(x|ωi ), as it is clear from the notation,
defines the pdf of the feature vectors belonging to ωi . This is same as the density
of X given that the class is ωi . To be precise: The marginal distribution of ω is
known and the distributions of X given ωi are known for all i. The general laws of
probability obviously hold. For example,
c
X
P (ωi ) = 1.
i=1
the decision regions define the decision rule: The classifier can be represented by
its decision regions. However, especially for large d, the representation by decision
regions is not convenient. However, the representation by decision regions often
offers additional intuition on how the classifier works.
where p(x|α(x)), P (α(x)) are known for every x as it was defined in the previous
section. Note that
p(x|α(x))P (α(x)) = p(x, α(x)).
That is, the classification error of α is equal to the probability of the complement
of the events {(x, α(x))}.
With the knowledge of decision regions, we can rewrite the classification error as
Xc Z c Z
X
E(α) = [1 − p(x|ωi )P (ωi )]dx = 1 − p(x|ωi )P (ωi )dx, (4.5)
i=1 Ri i=1 Ri
Notation arg maxx f (x) means the value of the argument x that yields the maximum
value for the function f . For example, if f (x) = −(x − 1)2 , then arg maxx f (x) = 1
(and max f (x) = 0). In other words,
the Bayes classifier selects the most probable class when the observed
feature vector is x.
The posterior probability P (ωi |x) is evaluated based on the Bayes rule, i.e. P (ωi |x) =
p(x|ωi )P (ωi )
p(x)
. Note additionally that p(x) is equal for all classes and p(x) ≥ 0 for all
x and we can multiply the right hand side of (4.6) without changing the classifier.
The Bayes classifier can be now rewritten as
If two or more classes have the same posterior probability given x, we can freely
choose between them. In practice, the Bayes classification of x is performed by
computing p(x|ωi )P (ωi ) for each class ωi , i = 1, . . . , c and assigning x to the class
with the maximum p(x|ωi )P (ωi ).
By its definition the Bayes classifier minimizes the conditional error E(α(x)|x) =
1 − P (α(x)|x) for all x. Because of this and basic properties of integrals, the Bayes
classifier minimizes also the classification error:
E(αBayes ) ≤ E(α).
The classification error E(αBayes ) of the Bayes classifier is called the Bayes error.
It is the smallest possible classification error for a fixed classification problem. The
Bayes error is Z
E(αBayes ) = 1 − max[p(x|ωi )P (ωi )]dx (4.8)
F i
Finally, note that the definition of the Bayes classifier does not require the as-
sumption that the class conditional pdfs are Gaussian distributed. The class condi-
tional pdfs can be any proper pdfs.
4.4 Bayes Minimum Risk Classifier 23
The Bayes minimum risk classifier chooses the action with the minimum
conditional risk.
The Bayes minimum risk classifier is the optimal in this more general setting: it
is the decision rule that minimizes the total risk given by
Z
Rtotal (α) = R(α(x)|x)p(x)dx. (4.10)
The Bayes classifier of the previous section is obtained when the action αi is the
classification to the class ωi and the loss function is zero-one loss:
0 if i = j
λ(αi |ωj ) = .
1 if i 6= j
The number of actions a can be different from the number of classes c. This
is useful e.g. when it is preferable that the pattern recognition system is able to
separate those feature vectors that cannot be reliably classified.
An example about spam filtering illustrates the differences between the minimum
error and the minimum risk classification. An incoming e-mail is either a normal
(potentially important) e-mail (ω1 ) or a junk mail (ω2 ). We have two actions: α1
(keep the mail) and α2 (put the mail to /dev/null). Because losing a normal e-mail
is on average three times more painful than getting a junk mail into the inbox, we
select a loss function
λ(α1 |ω1 ) = 0 λ(α1 |ω2 ) = 1
.
λ(α1 |ω2 ) = 3 λ(α2 |ω2 ) = 0
(You may disagree about the loss function.) In addition, we know that P (ω1 ) =
0.4, P (ω2 ) = 0.6. Now, an e-mail has been received and its feature vector is x. Based
on the feature vector, we have computed p(x|ω1 ) = 0.35, p(x|ω2 ) = 0.65.
24 Introduction to Pattern Recognition
Figure 4.2: Class conditional densities (top) and P (ω1 |x) and P (ω2 |x) and decision
regions (bottom).
where µ ∈ Rd and positive definite d × d matrix Σ are the parameters of the distri-
bution.
We assume that the class conditional pdfs are normally distributed and prior
probabilities can be arbitrary. We denote the parameters for the class conditional
pdf for ωi by µi , Σi i.e. p(x|ωi ) = pnormal (x|µi , Σi ).
We begin with the Bayes classifier defined in Eq. (4.7). This gives the discrimi-
4.6 Discriminant Functions for Normally Distributed Classes 27
nant functions
gi (x) = pnormal (x|µi , Σi )P (ωi ). (4.13)
By replacing these with their logarithms (c.f. Theorem 6) and substituting the
normal densities, we obtain2
1 d 1
gi (x) = − (x − µi )T Σ−1
i (x − µi ) − ln 2π − ln det(Σi ) + ln P (ωi ). (4.14)
2 2 2
We distinguish three distinct cases:
1. Σi = σ 2 I, where σ 2 is a scalar and I is the identity matrix;
2. Σi = Σ, i.e. all classes have equal covariance matrices;
3. Σi is arbitrary.
Case 1
Dropping the constants from the right hand side of Eq. (4.14) yields
1 ||x − µi ||2
gi (x) = − (x − µi )T (σ 2 I)−1 (x − µi ) + ln P (ωi ) = − + ln P (ωi ), (4.15)
2 2σ 2
The symbol || · || denotes the Euclidean norm, i.e.
||x − µi ||2 = (x − µi )T (x − µi ).
Expanding the norm yields
1
gi (x) = − (xT x − 2µTi x + µTi µi ) + ln P (ωi ). (4.16)
2σ 2
The term xT x is equal for all classes so it can be dropped. This gives
1 T 1
gi (x) = 2
(µi x − µTi µi ) + ln P (ωi ), (4.17)
σ 2
which is a linear discriminant function with
µi
wi = 2 ,
σ
and
−µTi µ
wi0 = + P (ωi ).
2σ 2
Decision regions in the case 1 are illustrated in Figure 4.3 which is figure 2.10
from Duda, Hart and Stork.
An important special case of the discriminant functions (4.15) is obtained when
P (ωi ) = 1c for all i. Then x is assigned to the class with the mean vector closest to
x:. This is called minimum distance classifier. It is a linear classifier as a special
case of a linear classifier.
2
Note that the discriminant functions gi (x) in (4.13) and (4.14) are not same functions. gi (x)
is just a general symbol for the discriminant functions of the same classifier. However, a more
prudent notation would be a lot messier.
28 Introduction to Pattern Recognition
Case 2
We obtain a linear classifier even if the features are dependent, but the covariance
matrices for all classes are equal (i.e. Σi = Σ). It can be represented by the
discriminant functions
gi (x) = wiT x + wi0 , (4.18)
where
wi = Σ−1 µi
and
1
wi0 = − µTi Σ−1 µi + ln P (ωi ).
2
Decision regions in the case 2 are illustrated in Figure 4.4 which is figure 2.11
from Duda, Hart and Stork.
Case 3
Now no additional assumptions about the class conditional pdfs are made. In this
case, the discriminant functions
1 d 1
gi (x) = − (x − µi )T Σ−1
i (x − µi ) − ln 2π − ln det(Σi ) + ln P (ωi ).
2 2 2
cannot be simplified much further. Only the constant term d2 ln 2π can be dropped.
Discriminant functions are not anymore linear but quadratic. They have much more
complicated decision regions than the linear classifiers of the two previous cases.
Now, decision surfaces are also quadratic and the decision regions do not have to be
even connected sets.
Decision regions in the case 3 are illustrated in Figure 4.5 which is Figure 2.14
from Duda, Hart and Stork.
4.6 Discriminant Functions for Normally Distributed Classes 29
where qij , j = 1, . . . , d are parameters for the class conditional density of the class
ωi . The distribution can be interpreted in the way that jth feature answers ’yes’
to the question asked with the probability of qij . (Compare to the coin tossing
experiment.) It is still worth emphasizing that the features are independent from
each other, i.e. the answer to the jth question does not depend on the answers to
the other questions. For a 2-class pattern recognition problem, if q1j > q2j the value
of the object’s jth feature is 1 more often if the object belongs to the class ω1 .
We now show that the Bayes classifier is linear for this problem. For the Bayes
classifier
αBayes (x) = arg max P (ωi |x),
ωi ,i=1,...,c
Replacing the right hand side by its logarithm and simplifying give
d
X
gi (x) = [xj ln(1 − qij ) + ln qij − xj ln(1 − qij )] + ln P (ωi ). (4.21)
j=1
where
qij
wij = ln ,
1 − qij
32 Introduction to Pattern Recognition
and
d
X
wi0 = ln(1 − qij ) + ln P (ωj ).
j=1
We will end this section (and chapter) by studying how the value of the clas-
sification error behaves when we add more and more independent features. This
is somewhat more pleasant in the discrete case when we do not need to evaluate
integrals numerically. However, in order to the computations become possible for
large d, we must make some additional simplifying assumptions. Here the results
are more important than their derivation and we will keep derivations as brief as
possible.
The classification error is
X
E(αBayes ) = 1 − max P (x|ωi )P (ωi ), (4.22)
x
where x is summed over all d-bit binary vectors. Again, the greater the probability
of the most probable class the smaller the classification error. Consider the 2-class
case. We model the classes using Binomial densities, that is qij = qi for all features
j and
d!
P (x|ωi ) = P (k|qi ) = q k (1 − qi )d−k , i = 1, 2,
k! + (d − k)! i
where k is the number of 1s in the feature vector x. Let us still assume that the
prior probabilities are equal to 0.5 and q1 > q2 . The class of the feature vector is
decided by how many 1s it contains. The Bayes classifier is linear and it classifies x
to ω1 if the number of 1s in x exceeds a certain threshold t. The threshold t can be
computed from (4.21). The classification error is
t d
X d! k d−k
X d!
E(αBayes ) = 1− 0.5 q2 (1−q2 ) + 0.5 q1k (1−q1 )d−k .
k=0
k! + (d − k)! k=t+1
k! + (d − k)!
Note that we sum over all k and not over all x. (There are many fewer k than there
are x.)
The classification error with respect to d when q1 = 0.3, q2 = 0.2 is plotted
in Figure 4.6. Note that the classification error diminishes when the number of
features d grows. This is generally true for the Bayes classifiers. However, this does
not mean that other, non-optimal classifiers share this property. For example, if we
would have (for some reason) selected t = bd/2c (that does not lead to the Bayes
classifier), then we would obtain a classification error that would grow towards 0.5
with d. (bd/2c refers to the greatest integer that is lesser than equal to d/2.)
4.7 Independent Binary Features 33
Figure 4.6: Classification error when class conditional pdfs are Binomial.
34 Introduction to Pattern Recognition
Chapter 5
On the other hand, the prior probabilities cannot be deduced based on the separate
sampling, and in this case, it is most reasonable to assume that they are known.
The estimation of class conditional pdfs is more difficult. We need now to esti-
mate a function p(x|ωi ) based on a finite number of training samples. Next sections
are dedicated to different approaches to estimate the class conditional pdfs.
Di = {xi1 , . . . , xini }.
Hence
n
Y
θ̂ = arg max p(xi |θ). (5.2)
i=1
5.2 Parametric Estimation of Pdfs 37
2. Solve the zeros of the gradient and search also all the other critical points of
the (log)-likelihood function. (e.g. the points where at least one of the partial
derivatives of the function are not defined are critical in addition to the points
where gradient is zero.)
3. Evaluate the (log)-likelihood function at the critical points and select the crit-
ical point with the highest likelihood value as the estimate.
A word of warning: the ML-estimate does not necessarily exist for all distribu-
tions, for example the likelihood function could grow without limit. However, the
ML-estimates exist for simple parametric families.
1 1
p(x|θ) = p exp[− (x − µ)T Σ−1 (x − µ)].
(2π)d/2 det(Σ) 2
Consider first the mean vector, i.e. θ = µ. The covariance matrix Σ is assumed
to be known and fixed. (We will see that fixing the covariance will not affect the
estimate for mean.) The log-likelihood is
n
X n
X
l(θ) = ln p(xi |µ) = − 0.5(ln[(2π)d/2 det(Σ)] + (xi − µ)T Σ−1 (xi − µ)).
i=1 i=1
38 Introduction to Pattern Recognition
Its gradient is
n
X
∇l(µ) = − Σ−1 (xi − µ).
i=1
It is easy to see that this is (the only) local maximum of the log-likelihood function,
and therefore it is the ML-estimate. Note that in this case, the ML-estimate is the
sample mean of D. As we already hinted, the exact knowledge of the covariance was
not necessary for obtaining the ML-estimate for the mean. Hence, the result holds
even if both µ and Σ were unknown.
The derivation of the ML-estimate for the covariance matrix is more complicated.
Hence, we will skip it and just present the result of the derivation. The ML -estimate
is n
1X
Σ̂ = (xi − µ̂)(xi − µ̂)T , (5.5)
n i=1
where µ̂ is the ML estimate for µ defined in Eq. (5.4).
A numeric example:
mu = 6.2000
46.8000
When n approaches infinity, the ML-estimate is the best unbiased estimate in the
sense that the variance of the ML-estimate is the smallest possible. In addition,
the ML-estimate is asymptotically consistent. This means that the ML-estimate is
arbitrarily close to the true parameter value with the probability of 1 when n tends
to infinity.
Almost all optimality properties of the ML-estimate are so called large sample
properties. If n is small, little can be said about the ML-estimates in general.
Sigma1 =
0.0764 0.0634 0.0170 0.0078
0.0634 0.0849 0.0155 0.0148
0.0170 0.0155 0.0105 0.0040
0.0078 0.0148 0.0040 0.0056
And that is it! We can now classify e.g. the feature vector x = [ 5.9 4.2 3.0 1.5 ]T
by computing the values of the discriminant functions (4.13) for all three classes.
The classifier assigns x to ω2 , or in other words, the flower in the question is iris
versicolor.
5.3.1 Histogram
Histograms are the simplest approach to density estimation. The feature space is
divided into m equal sized cells or bins Bi . (See Figure 5.1.) Then, the number of
the training samples ni , i = 1, . . . n falling into each cell is computed. The density
estimate is
ni
p̂(x) = , when x ∈ Bi , (5.6)
Vn
where V is the volume of the cell. (All the cells have equal volume and an index is
not needed.)
The histogram estimate - with a definition as narrow as this one - is not a
good way to estimate densities, especially so when there are many (continuous-
5.3 Non-Parametric Estimation of Density Functions 41
type) features. Firstly, the density estimates are discontinuous. Secondly, a good
size for the cells is hard to select.
We still need the probability that k from n training samples fall into the set B.
The training samples are independent, and each of them is in the set B with the
probability P . Hence, we can apply the binomial density to the situation. That is,
there are exactly k samples in the set B with the probability
n
Pk = P k (1 − P )n−k , (5.8)
k
where
n n!
=
k k!(n − k)!
is the number of different subsets of D of the size k.
Based on the Binomial theorem, we obtain the expected value of k. It is
n n
X X n−1
E[k] = kPk = P k−1 (1 − P )n−k nP = nP. (5.9)
k−1
k=0 k=1
42 Introduction to Pattern Recognition
(We have skipped some algebra.) Now, we substitute E[k] by the observed number
k̂ of training samples falling into B. We obtain an estimate for P :
P̂ = k̂/n. (5.10)
where V is the volume of B. This gives the estimate (or an interpretation for the
estimate)
R
k̂ p(x)dx
p̂(x) = ' BR . (5.11)
nV B
dx
We can draw the following conclusions:
• The obtained density estimate is a space averaged version of the true density.
The smaller the volume V the more accurate the estimate is. However, if n is
fixed, diminishing V will lead sooner or later to B which does not contain any
training samples and the density estimate will become useless.
• In more general level two fact need to be accepted. 1) The density estimate is
always a space averaged version of the true density. 2) The estimate k̂ is only
an estimate of nP when n is finite.
The principal question is how to select B and V . Let us assume that we have
unlimited number of training data. We can then approach this problem in the
following manner: To estimate p(x), we form a series of cells B1 , B2 , . . . containing
x. The cell B1 is used when we have just one training sample, the cell B1 is used
when we have two training samples and so-forth. The volume of Bn is denoted by
Vn and the number of training samples falling into Bn is denoted by kn . The density
estimate based on these is
kn
pn (x) = .
nVn
If we want that
lim pn (x) = p(x)
n→∞
1. limn→∞ Vn = 0
2. limn→∞ kn = ∞
3. limn→∞ kn /n = 0.
5.4 Parzen Windows 43
The conditions may seem rather abstract but they are in fact natural. The first
condition assures us that the space averaged pn (x) will converge to p(x) if there
are enough training samples. The second one assures us that the frequency ratio
kn /n will be a good enough estimate for P . The third one states that the number
of samples falling to a region Bn is always a negligibly small portion of the total
number of samples. This is required if pn (x) is to converge at all.
How to guarantee the conditions in the limit of unlimited number of training
data? There are two alternatives. We may fix Vn and estimate kn in the best way
possible. Or we can fix kn and estimate Vn . The precise definition of Bn is not
necessary in either alternatives (though it may be implied.)
The optimality results about the density estimators are necessarily of asymptotic
nature. Rosenblatt showed already in 1956 that in the continuous case density
estimates based on finite n are necessarily biased.
Above estimate differs from the histogram defined in section 5.3.1 in that we do not
define the cells a-priori. Rather the cells are defined only after observing x and they
can also intersect each other for different x.
The estimate pn (x) is an average of values of the window function at different points.
(See figure 5.2.) Typically the window function has its maximum at the origin and
its values become smaller when we move further away from the origin. Then each
training sample is contributing to the estimate in accordance with its distance from
x. The normal density ϕ(u) = (2π)1d/2 exp[−0.5uT u] is the most widely applied
window function. Usually Vn = hdn , but the geometric interpretation of Vn as the
volume of a hypercube is not completely valid anymore. We will discuss the selection
of window length hn in section 5.4.4.
Sometimes Parzen estimates are called kernel estimates. The Parzen estimates
were introduced by Emanuel Parzen in 1962. However, Rosenblatt studied similar
estimates already in 1956.
4. Vn → 0 when n → ∞.
5. nVn → ∞ when n → ∞.
In addition to the assumptions about the window function and the window length,
we needed to make assumptions about the true density. The requirements concerning
the window function are natural and e.g. the normal density is in accordance with
them.
For pattern recognition, these optimality/convergence properties are important,
because with these properties we are able to show that the classification error of
a classifier based on Parzen windows tends to the Bayes error when n approaches
infinity. This holds also more generally: When the density estimates converge (in
a certain sense) to the true class conditional pdfs, and the prior probabilities are
properly selected, then the error of the resulting classifier tends to the Bayes error.
We do not go into detail with the conditions for the convergence, since their proper
treatment would require too much effort.
on the estimates. The hard part is that it is not possible to estimate p(·|ωi ) for
all possible feature vectors x. For each test point x to be classified, we must first
compute the Parzen estimate of the each class conditional density, and only after
that we can classify the test point. The Parzen-classifier can be written simply as
ni
1 X 1 x − xij
αP arzen (x) = arg max P̂ (ωi ) ϕ( ), (5.15)
ωi ,i=1,...,c ni j=1 hdi hi
where P̂ (ωi ) are the estimates for prior probabilities, xi1 , . . . , xini are training sam-
ples from ωi , ϕ is the window function (usually the same window function is used
for all classes), and hi is the window width for the class ωi .
As can be seen based on Eq. (5.15), Parzen classifier demand much more com-
putation than the classifiers based on the ML-estimation. Every classification with
Parzen classifiers requires n evaluations of a pdf, where n is the total number of
training samples.
Note that by substituting P̂ (ωi ) = ni /n in Eq. (5.15), the Parzen classifier can
be rewritten as
ni
1X 1 x − xij
αP arzen (x) = arg max d
ϕ( ).
ωi ,i=1,...,c n h h i
j=1 i
where the last equality is due to our normalization assumptions. Notation a(x) ∝
b(x) means that a(x) = C · b(x) for some constant C and for all x.
This posterior probability estimate is maximal for the class with the highest ki and
x is assigned to that class. This is the k nearest neighbor (KNN) classification rule.
The upper bound is reached when the true class conditional pdfs are identical for
all classes.
The rather positive error bound above is a theoretical one. In real life, we
do not have infinite number of training samples and there are no corresponding
results for the case when we have a finite number of training samples. However, the
NN classifier is very simple, and in practice it often works well. We explain this
heuristically. Denote the nearest neighbor of the test point x by x∗ . Let the class of
x∗ be ω ∗ . Because we assume that the training samples have been correctly classified,
ω ∗ maximizes the posterior P (ωi |x∗ ) with respect to classes ωi , i = 1, . . . , c with high
probability. Then, if x is close to x∗ , ω ∗ maximizes also the posterior P (ωi |x) with
respect to classes ωi , i = 1, . . . , c with high probability if the posteriors are smooth
enough.
We finally consider the geometry of the decision regions of the N N rule. The
decision region for class ωi is formed by the cells whose points are closer to some
training sample in Di than any other training sample. This kind of partition of the
feature space is termed Voronoi diagram. See Figure 5.3 due to Duda, Hart and
Stork.
In the two class case, we following error estimate holds in the case of unlimited
number of training samples:
r
2E(αnn )
E(αbayes ) ≤ E(αknn ) ≤ E(αbayes ) + .
k
Also, from this error estimate it follows that the KNN rule is optimal when k → ∞.
In real life, the situation is once again different: Especially when the feature space is
high-dimensional, the required number of training samples grows very rapidly with
k.
The KNN classifier and the NN classifier suffer from the same practical problem
as the Parzen classifier. The computational complexity of classification grows lin-
early with the number of training samples, because the distance from each training
sample to the test point must be computed before the classification is possible. This
may be a problem if the classifier needs to be very fast. Let us note, however, that
there exist several techniques to speed up KNN classifiers.
5.5.4 Metrics
We have assumed until now that the distance between the points a and b in the
feature space (feature vectors) is measured with the Euclidean metric:
v
u d
uX
L(a, b) = ||a − b|| = t (ai − bi )2 .
i=1
However, many other distance measures or metrics can be applied. The selection
of a metric obviously affects the KNN classification results. Especially, the scaling
50 Introduction to Pattern Recognition
of the features has a fundamental effect to the classification results although this
transformation merely accounts to a different choice of units for features.
Formally, a metric is a function L(·, ·) from Rd × Rd to R. For all vectors a, b, c,
a metric L must satisfy
1. L(a, b) ≥ 0
3. L(a, b) = L(b, a)
The only one of these properties which we occasionally want to sacrifice is the
property 2. Other properties are important for pattern recognition applications.
Some examples of useful metrics:
• Minkowski metrics:
d
X
Lm (a, b) = ( (|ai − bi |m )1/m .
i=1
• L-infinity metric
L∞ (a, b) = max |ai − bi |.
i
• Mahalanobis-distance
p
LM ahalanobis,C (a, b) = (a − b)T C −1 (a − b),
Before studying the sources of the test error, we recall what inputs did different
classifiers require for their design. First, all supervised classifiers need training
samples. Second, the classifiers based on the ML-estimation need knowledge about
the parametric families of the class conditional pdfs. Parzen classifiers need to be
given the window functions and window lengths. For KNN classifiers, we need to
select the parameter k.
The Bayes error is a necessary error source of every pattern classification prob-
lem. This error can never be reduced without changing the classification problem.
If we take into account the whole pattern recognition system, we can change the
classification problem e.g. by adding an extra feature. This can possibly reduce the
Bayes error.
The second source of classification error with practical classifiers is that the
probabilistic model for the problem is incorrect. This is termed model error. In
ML-based classification, these are due to having incorrect (parametric) model for
class conditional densities, i.e. the data from the class ωi is not distributed according
to any of the densities p(·|ωi , θi ). In the non-parametric case, we interpret a bad
selection of window functions, window lengths, or the parameter k as model errors.
The model error is more important with the parametric ML-based classifiers than
with the non-parametric classifiers. This type of error can be reduced by selection
a better probabilistic model for the problem. Again, this is rarely straight-forward.
The third and final error source is the estimation error. This error is due having
only a finite number of training samples. This error is more pronounced in the
non-parametric case where more training samples are required than with ML-based
classifiers. In the ML-estimation, the larger the parametric family the lesser the
model error but the greater the estimation error, i.e. there is a trade-off between
the model and estimation errors. The estimation error can be reduced by adding
more training samples.
In general, if we add features to the classifier, we also increase the training
error. Further, the required number of training samples grows exponentially with
the number of features in the non-parametric case. Hence, adding one more feature
to the pattern recognition system might not always be a good idea. The phenomenon
is often termed the curse of dimensionality.
52 Introduction to Pattern Recognition
Chapter 6
6.1 Introduction
To this point, we have designed classifiers by estimating the class conditional pdfs
and prior probabilities based on training data. Based on the estimated class condi-
tional pdfs and prior probabilities, we have then derived the Bayes classifier. This
way the classifier minimizes the classification error when the estimated pdfs and
priors are close to the true ones. The problem was that the estimation problems
are challenging, and with the finite number of training samples, it is impossible to
guarantee the quality of estimates.
In this Chapter, we consider another way to derive the classifier. We assume that
we know the parametric forms of the discriminant functions in a similar way as with
the ML estimation. The estimation will be formulated as a problem of minimizing a
criterion function - again much in a same way than with the ML technique. However,
now we are aiming directly to find the discriminant functions without first estimating
the class conditional pdfs. An example illustrates positive features of this type of
classifier design: Consider a two-category classification problem, where the class
conditional pdfs are normal densities with equal covariances. Then, as stated in
Section 4.6, the Bayes classifier is linear, and it can be represented with the following
discriminant function:
where
w = Σ−1 (µ1 − µ2 )
and
1
w0 = − (µ1 − µ2 )T Σ−1 (µ1 − µ2 ) + ln P (ω1 ) − ln P (ω2 ).
2
For deriving this discriminant function, it was necessary to estimate 2d+(d2 +d)/2+1
parameter values. The first term is due to the two mean vectors, the second is
due to the single (symmetric) covariance matrix, and the third one is due to the
54 Introduction to Pattern Recognition
priors which sum to one. Hence, it might be reasonable to estimate directly the
discriminant function, which has only d + 1 parameter values: d of them from
the weight vector w, and one of them from the threshold weight w0 . In addition,
the parameterization of the discriminant functions may sometimes be more natural
than the parameterization of the class conditional pdfs.
The estimation of w and w0 will be performed by minimizing a criterion function
as it was already mentioned. The criterion can be e.g. the number of misclassified
training samples but more often it is some related function. Note, however, that
in this case we will lose the direct contact with the test-error which is much more
important quantity than the training error.
for all i 6= j. (If gi (x) = gj (x) we can use some arbitrary rule such as classifying
x to ωi if i < j. As with the Bayes classifier, the selection of such a rule bears no
importance.)
A discriminant function is said to be linear if it can written as
d
X
gi (x) = wiT x + wi0 = wij xj + wi0 ,
j=1
where wi = [wi1 , . . . , wid ]T is the weight vector and the scalar wi0 is threshold
weight. The classifier relying only on the linear discriminant functions is called
linear classifier or linear machine.
Moreover, we can characterize also decision regions. Namely, these are convex.
56 Introduction to Pattern Recognition
Figure 6.2: Only the left-most of above sets is convex and the others are not convex.
To prove the convexity of the decision regions of linear classifiers, consider two
points r1 , r2 belonging to the decision region Ri . Because these points belong to Ri ,
it holds that
gi (r1 ) > gj (r1 ), gi (r2 ) > gj (r2 ) (6.2)
for all j 6= i. Now, consider the point (1 − λ)r1 + λr2 , where λ ∈ [0, 1]. Due to
linearity of the discriminant functions
gi ((1 − λ)r1 + λr2 ) = wiT ((1 − λ)r1 + λr2 ) + wi0 > gj ((1 − λ)r1 + λr2 ). (6.3)
This means that the decision regions are convex. It is an exercise to prove that (6.3)
follows from (6.2).
The convexity of the decision regions naturally limits the number of classification
tasks which can effectively be solved by linear machines.
Di = {xi1 , . . . , xini }, i = 1, 2,
g(x1j ) > 0,
for all j = 1, . . . , n2 , we say that the training sets/samples D1 and D2 are linearly
separable.
Two-class linear classifiers and the above condition can be presented in a more
compact form if we make two modifications to the notation:
1) We form the augmented feature vector y based on x:
y = [1, x1 , . . . , xd ]T .
Similarly, the augmented weight vector is obtained from w and w0 :
a = [w0 , w1 , . . . , wd ]T = [w0 , wT ]T .
Linear discriminant functions can now be written as
g(x) = wT x + w0 = aT y = g(y).
2) We denote by yij the augmented feature vector generated from the training
sample xij . In the 2-class case, we reduce the two training sets D1 , D2 to a single
training set by simply replacing the training samples from ω2 by their negatives.
This because
aT y2j < 0 ⇔ aT (−y2j ) > 0.
Hence, we can forget about ω2 , replace its training samples by their negatives, and
attach them to the training set for ω1 . We denote the resulting training set by
D = {y1 , . . . , yn1 +n2 }. Note that ’replacement by negatives’ must be performed
expressly for augmented feature vectors. (Why?).
As an example, consider the training samples
D1 = {[1, 1]T , [2, 2]T , [2, 1]T }
and
D2 = {[1, −1]T , [1, −2]T , [2, −2]T ]
The modified training set is then
D = {[1, 1, 1]T , [1, 2, 2]T , [1, 2, 1]T , [−1, −1, 1]T , [−1, −1, 2]T , [−1, −2, 2]T }.
Given that these two modifications have been performed, we can re-define linear
separability: The training set (samples) D on linearly separable if there exists such
augmented weight vector a that
aT yj > 0 (6.4)
for all y1 , . . . , yn1 +n2 .
If we assume that the training samples are linearly separable, then we can try
to solve the system of inequalities (6.4) to design the linear classifier. One possible
solution method is introduced in the next section. However, a solution - if exists -
is not unique. 1) If we have a solution vector a which is a solution to the system
(6.4), then we can always multiply this vector a positive constant to obtain another
solution. (Remember that the weight vectors were normals to the decision surfaces).
2) It is possible to have multiple decision surfaces that separate the training sets.
Then, we have many (genuinely different) augmented weight vectors that solve the
system (6.4).
58 Introduction to Pattern Recognition
aT yj > 0
The function Jp is termed Perceptron criterion. In other words, we sum the inner
products of the weight vector a and those training samples that are incorrectly
classifier. The Perceptron criterion is the negative of that sum.
The value Jp (a) is always greater than equal to zero. The Perceptron criterion
obtains its minimum (that is zero) when a solves the system of inequalities (6.4).
(There is also a degenerate case. When a = 0, Jp (a) = 0. However, this obviously is
not the solution we are searching for.) To sum up, we have converted the problem
of solving inequalities
P into an optimization problem.
The criterion j:aT yj <0 −1 would also fulfill the above requirements. However,
this criterion function is not continuous which would lead to numerical difficulties.
Instead, Jp is continuous.
Perceptron algorithm
Set t ← 0,
PInitialize a(0), η, .
T
while − j:a(t)T yj <0 a yj > do
P
a(t + 1) = a(t) + η j:a(t)T yj <0 yj .
6.5 Perceptron for Multi-Class Case 59
Set t ← t + 1
end while
Return a(t).
Two parameters - in addition to the initial value a(0) - need to be set: The
stopping condition and the step length η. Sometimes the step length is called the
learning rate. If training samples indeed are linearly separable, we can set to zero.
The reason for this is that the Perceptron algorithm converges to the solution in
a finite number of steps as long as the training samples are linearly separable. If
the training samples are not linearly separable, the Perceptron algorithm does not
converge and it is not a good choice in the linearly non-separable case. The step
length can be set to 1.
We illustrate the Perceptron algorithm by continuing the example of Section 6.3
and designing a linear classifier for those training samples. We select a(0) = [1, 1, 1]T .
In the first iteration, samples y4 , y5 , and y6 are incorrectly classified. Hence,
This is also the final result since all the training samples are now correctly classified.
Di = {yi1 , . . . , yini }, i = 1, . . . , c.
We say that these training sets are linearly separable if there exists a linear clas-
sifier which classifies all the training samples correctly. The task is to find such
(augmented) weight vectors a1 , . . . , ac that always when y ∈ Di , it holds that
As in the 2-class case, this algorithm works only if the training samples are
linearly separable.
Note that we do not need to assume that Y would be non-singular or square. The
selection of the vector 1 is based on the relation with the Bayes classifier. For more
information on this, see section 5.8.3 in Duda, Hart and Stork.
62 Introduction to Pattern Recognition
Chapter 7
Classifier Evaluation
into two sets Dtraining and Dtest . We use Dtraining for the training of the classifier
and Dtest is solely used to estimate the test error. This method is called the holdout
method. It is a good method when we have large number of training samples, and
hence a modest decrease in the number of training samples does not essentially
decrease the quality of the classifier. Note that the division into sets Dtraining and
Dtest must be randomized. For example, selecting Dtest = D1 is not a good idea.
In the case that we have only a few training samples, it is the best to use some
cross-validation method to estimate the test error. The most simple of these is
the leave-one-out method. In the leave-one-out method, we drop a single training
sample from the training set and design a classifier based on the other training
samples. Then, it is studied whether the dropped sample is classified correctly or
not by the designed classifier. This is repeated for all n training samples. The test
64 Introduction to Pattern Recognition
is the number of test samples that are assigned to the class ωi but whose correct
class would be ωj . The order of indices i, j can be selected also differently, and it
varies from one source reference to another. Hence, it pays off to be attentive. The
confusion matrix can be estimated by the same methods as the test error.
Based on the confusion matrix, the classification (test) error can be obtained as
Pc Pc Pc
i=1 j=1 sij − i=1 sii
Ê(α) = Pc Pc .
i=1 j=1 sij
Based on the confusion matrix, also other important quantities for classifier design
can be computed.
7.3 An Example
Let us illustrate the leave-one-out-method in the case of Fisher’s Iris dataset. We
study the classification based on ML-estimation, and we assume that the class condi-
tional pdfs are Gaussian. The data was collected by separate sampling and therefore
we assume that the prior probabilities for all classes are equal. This data can be
loaded into Matlab by
command. (At least, if the Statistics Toolbox is available). There should now be
variables ’meas’ and ’species’ in Matlab. The variable ’meas’ contains the feature
vectors from 150 Irises and the variable ’species’ contains the class of each feature
vector. The feature vectors numbered from 1 to 50 belong to the class ω1 (iris
7.3 An Example 65
setosa), the feature vectors numbered from 51 to 100 belong to the class ω2 (iris
versicolor), and the feature vectors numbered from 101 to 150 belong to the class ω3
(iris virginica). This simplifies the implementation of the leave-one-out method in
Matlab. The following piece of Matlab-code estimates the confusion matrix by the
leave-one-out method:
Above, ’ceil(x)’ returns the least integer that is larger than the real number
x. We have used also some Matlab specific programming tricks: for example
meas(56:50,:) is an empty matrix. About ’mvnpdf’, Matlab’s help states the
following
Y = MVNPDF(X,MU,SIGMA) returns the density of the multivariate normal
66 Introduction to Pattern Recognition
s =
49 2 2
1 41 6
0 7 42
8.1 Introduction
Until now, we have assumed that we have a set of correctly labeled training sam-
ples for training the classifier. These procedures using training samples are part
of supervised classification (see Section 2.4). However, in some circumstances it is
necessary to resort to unsupervised classification that is also termed as clustering.
These procedures do not use labeled training data. Instead, we have a collection of
unlabeled samples, and we try to classify them based only on the features in data.
In other words: there is no explicit teacher. Clearly, clustering is a more difficult
problem than supervised classification. However, there a certain situations where
clustering is either useful or necessary. These include:
1. The collection and classification of training data can be costly and time con-
suming. Therefore, it can be impossible to collect a training set. Also, when
there are very many training samples, it can be that all of these cannot be
hand-labeled. Then, it is useful to train a supervised classifier with a small
portion of training data and use then clustering procedures to fine tune the
classifier based on the large, unclassified dataset.
2. For data mining, it can be useful to search for natural clusters/groupings
among the data, and then recognize the clusters.
3. The properties of feature vectors can change over time. Then, supervised
classification is not reasonable, because sooner or later the test feature vectors
would have completely different properties than the training data had. For
example, this problem is commonplace in the processing of medical images.
4. The clustering can be useful when searching for good parametric families for
the class conditional densities for the supervised classification.
Above situations only are examples of the situations where clustering is worth-
while. There are many more situations requiring clustering.
68 Introduction to Pattern Recognition
D = {x1 , . . . , xn }.
The task is to place each of these feature vectors into one of the c classes. In other
words, the task is to find c sets D1 , . . . , Dc so that
c
[
Di = D
i=1
and
Dj ∩ Di = ∅
for all i 6= j.
Naturally, we need some further assumptions for the problem to be sensible. (An
arbitrary division of the feature vector between different classes is not likely to be
useful.) Here, we assume that we can measure the similarity of any two feature
vectors somehow. Then, the task is to maximize the similarity of feature vectors
within a class. For this, we need to be able to define how similar are the feature
vectors belonging to some set Di .
In the following sections, we will introduce two alternatives for defining how
similar a certain set of feature vectors is. We have already seen their counterparts in
supervised classification. The first one, the k-means clustering, reminds minimum
distance classification. The second one, the expectation maximization algorithm, is
rooted on ML-estimates and finite mixture models
k-means algorithm
Initialize µ1 (0), . . . , µc (0), set t ← 0
repeat
Classify each x1 , . . . , xn to the class Dj (t) whose mean vector µj (t) is the nearest
to xi .
for k = 1 to c do
update the mean vectors µk (t + 1) = |Dk1(t)| x∈Dk (t) x
P
end for
Set t ← t + 1
until clustering did not change
Return D1 (t − 1), . . . , Dc (t − 1).
Figure 8.2: Examples of problematic clustering tasks for k-means. Top row: The
scale of features is different (note the scaling of the axis). Bottom row: There are
many more feature vectors in one class than in the other. ’True classifications’ are
shown on left and k-means results are on right.
in the unsupervised manner (without any training samples) when the parametric
families of the class conditional densities are known. We will assume that:
1. The class conditional densities are modeled by normal densities, i.e. p(x|ωj ) =
pnormal (x|µJ , Σj ) for all j = 1, . . . , c.
2. Parameters µ1 , . . . , µc , Σ1 , . . . , Σc are not known.
3. Priors P (ω1 ), . . . , P (ωc ) are not known.
4. There are no training samples: i.e. the classes of feature vectors x1 , . . . , xn are
not known, and the task is to classify the feature vectors.
We assume that feature vectors are occurrences of independent random variables.
Each random variable is distributed according to some class conditional density but
we do not know according to which one. We can, however, write the probability
density that xi is observed and it belongs to some class. Firstly,
because xi has to belong to exactly one of the classes ω1 , . . . , ωc . (The left hand side
of the above equation is a marginal density.) By combining these, we obtain
c
X
p(xi ) = p(xi |ωj )P (ωj ). (8.2)
j=1
This is a mixture density. The priors P (ωi ) in mixture densities are called mixing
parameters, and the class conditional densities are called component densities.
We assumed that the component densities are normal densities. Hence, we obtain
the parametric mixture density
c
X
p(xi |θ) = pnormal (xi |µj , Σj )P (ωj ), (8.3)
j=1
Figure 8.3: Finite mixture models. On top component densities are shown. On bot-
tom the resulting mixture density when component densities and mixing parameters
are as shown in the top panel.
A problem with FMMs is that they are not always identifiable. This means that
two different parameter vectors θ1 and θ2 may result in exactly the same mixture
density, i.e.
p(x|θ1 ) = p(x|θ2 )
for all x. However, the classifier derived based on the parameter vector θ1 can still
be different than the classifier that is based on θ2 . And therefore, this is a serious
problem because the parameter vectors cannot be distuinguished from each other
based on ML estimation.. Switching problems are special cases of identifiability
problems. If we switch the classes ωi and ωj , we end up with exactly the same
mixture density as long as component densities for ωi and ωj belong to the same
parametric family. Therefore, the classes must be named (or given interpretation)
after the classification. This is problematic, if we are after a completely unsupervised
classifier.
8.5 EM Algorithm
n
Y n X
Y c
p(xi |θ) = pnormal (xi |µj , Σj )P (ωj ).
i=1 i=1 j=1
Maximizing this with respect to the parameter vector θ results in the ML estimate
θ̂. Once again, it is more convenient to deal with the log-likelihood function, i.e.
n
X Xc
θ̂ = arg max ln[ pnormal (xi |µj , Σj )P (ωj )]. (8.4)
θ
i=1 j=1
EM algorithm
4. Set t ← t + 1.
In most cases, The EM algorithm finds a local maximum of the likelihood func-
tion. However, there are usually several local maxima and hence the final result is
strongly dependent on the initialization.