0% found this document useful (0 votes)

6 views

8

The document discusses the principles of pattern classification, including the use of features to distinguish between classes, the concept of feature vectors, and the importance of decision boundaries in classification tasks. It explains various classification methods, including supervised and unsupervised learning, and introduces Bayes' theorem for probabilistic decision-making. Additionally, it covers clustering techniques like K-means and the challenges faced in classification due to noise and overlapping classes.

Uploaded by

Pinjala Anoop

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

8

Uploaded by

Pinjala Anoop

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 141

Pattern

Classification
An Example of Classification
• “Sorting incoming Fish on a conveyor
according to species using optical sensing

Sea bass
Species
Salmon
– Some properties that could be possibly used to
distinguish between the two types of fishes is

• Length
• Lightness
Features
• Width
• Number and shape of fins
• Position of the mouth, etc…

– This is the set of all suggested features to explore for use

in our classifier!
Feature is a property (or characteristics) of
an object (quantifiable or non quantifiable) which
is used to distinguish between (or classify) two
objects.
Feature vector
• A Single feature may not be useful always for
classification

• A set of features used for classification form a feature

vector

Fish xT = [x1, x2]

Lightness Width
Feature space
• The samples of input (when represented by their features) are
represented as points in the feature space

• If a single feature is used, then work on a one- dimensional feature

space.

Point representing samples

• If number of features is 2, then we get points in 2D-

space as shown in the next slide.
• We can also have an n-dimensional feature space
Decision boundary in one-dimensional case with two
classes.

Decision boundary in 2 (or 3)

dimensional case with three
classes
Class 1
Class 2
Class 3

Sample points in a two-dimensional feature space

Some Terminologies:

• Pattern
• Feature
• Feature vector
• Feature space
• Classification
• Decision Boundary
• Decision Region
• Discriminant function
• Hyperplanes and Hypersurfaces
• Learning
• Supervised and unsupervised
• Error
• Noise
• PDF
• Baye’s Rule
• Parametric and Non-parametric approaches
Decision region and Decision Boundary
• Our goal of pattern recognition is to reach an optimal
decision rule to categorize the incoming data into their
respective categories

• The decision boundary separates points belonging to one

class from points of other

• The decision boundary partitions the feature space into

decision regions.

• The nature of the decision boundary is decided by the

discriminant function which is used for decision. It is a
function of the feature vector.
Multiple classes
Now consider the extension of linear discriminants to K >2
classes. We might be tempted be to build a K-class discriminant by
combining a number of two-class discriminant functions. However,
this leads to some serious difficulties (Duda and Hart, 1973).

Consider the use of K−1 classifiers each of which solves a

two-class problem of separating points in a particular class Ck from
points not in that class. This is known as a one-versus-the-rest
classifier.

An illustration only follows; solutions follow later.

Hyper planes and Hyper surfaces
• For two category case, a positive value of discriminant
function decides class 1 and a negative value decides the
other.

• If the number of dimensions is three. Then the decision

boundary will be a plane or a 3-D surface. The decision
regions become semi-infinite volumes

• If the number of dimensions increases to more than three,

then the decision boundary becomes a hyper-plane or a
hyper-surface. The decision regions become semi-infinite
hyperspaces.
Learning
• The classifier to be designed is built using input samples
which is a mixture of all the classes.

• The classifier learns how to discriminate between samples

of different classes.

• If the Learning is offline i.e. Supervised method then, the

classifier is first given a set of training samples and the
optimal decision boundary found, and then the
classification is done.

• If the learning is online then there is no teacher and no

training samples (Unsupervised). The input samples are
the test samples itself. The classifier learns and classifies at
the same time.
Error
• The accuracy of classification depends on two
things

– The optimality of decision rule used: The central task is

to find an optimal decision rules which can generalize to
unseen samples as well as categorize the training samples
as correctly as possible. This decision theory leads to a
minimum error-rate classification.

– The accuracy in measurements of feature vectors: This

inaccuracy is because of presence of noise. Hence our
classifier should deal with noisy and missing features too.
Classifier Types

Statistical Syntactic Neural

Supervised or Unsupervised

Categories of Statistical Classifiers:

• Linear

• Quadratic

• Piecewise

• Non-parametric
Parametric Decision making (Statistical) - Supervised

Goal of most classification procedures is to estimate the probabilities

that a pattern to be classified belongs to various possible classes, based on the
values of some feature or set of features.

In most cases, we decide which is the most likely class. We need a

mathematical decision making algorithm, to obtain classification.

Bayesian decision making or Bayes Theorem

This method refers to choosing the most likely class, given
the value of the feature/s. Bayes theorem calculates the probability
of class membership.
Define:

P(wi) - Prior Prob. for class wi ; P(X) - Prob. (Uncondl.) for feature vector X.

P(wi |X) - Measured-conditioned or posteriori probability

P(X | wi) - Prob. (Class-Condnl.) Of feature vector X in class wi

P ( X | wi ) P ( wi )
Bayes Theorem: P ( wi | X ) =
P( X )
P(X) is the probability distribution for feature X in the entire
population. Also called unconditional density function (or evidence).
P(wi) is the prior probability that a random sample is a
member of the class Ci.
P(X | wi) is the class conditional probability (or likelihood)
of obtaining feature value X given that the sample is from class wi.
It is equal to the number of times (occurrences) of X, if it belongs to
class wi.
The goal is to measure: P(wi |X) –
Measured-conditioned or posteriori probability,
from the above three values.
P(X|w)
This is the Prob. of any vector X
being assigned to class wi.
P(w) BAYES RULE P(w|X)

X, P(X)
Take an example:
Two class problem:
Cold (C) and not-cold (C’). Feature is fever (f).

Prior probability of a person having a cold, P(C) = 0.01.

Prob. of having a fever, given that a person has a cold is,

P(f|C) = 0.4. Overall prob. of fever P(f) = 0.02.

Then using Bayes Th., the Prob. that a person has a cold, given
that she (or he) has a fever is:
P ( f | C ) P (C ) 0.4 * 0.01
P (C | f ) = = = 0 .2
Not convinced that it works? P( f ) 0.02
let us take an example with values to verify:

Total Population =1000. Thus, people having cold = 10. People having
both fever and cold = 4. Thus, people having only cold = 10 – 4 = 6.
People having fever (with and without cold) = 0.02 * 1000 = 20.
People having fever without cold = 20 – 4 = 16 (may use this later).

So, probability (percentage) of people having cold along with fever,

out of all those having fever, is: 4/20 = 0.2 (20%).
IT WORKS, GREAT
A Venn diagram,
P(C and f) = P(C).P(f|C) = 0.004
illustrating the
two class,
one feature problem.

C f

P(C) = 0.01
P(f) = 0.02

Probability of a joint event - a sample comes from class C and

has the feature value X:

P(C and X) = P(C).P(X|C) = P(X).P(C|X)

= 0.01*0.4 = 0.02*0.2
Also verify, for a K class problem:

P(X) = P(w1)P(X|w1) + P(w2)P(X|w2) + ……. + P(wk)P(X|wk)

P(f) = P(C)P(f|C) + P(C’)P(f|C’)

= 0.01 0.4 + 0.99 0.01616 = 0.02

Decision or Classification algorithm according to Baye’s Theorem:

 w1 ; if p ( X |w1 ) p ( w1 ) > p ( X |w2 ) p ( w2 )

Choose C1 , if P(x|C1) > P(x|C2)

This gives α, and hence the

two decision regions.
μ1 α μ2
x

Classification error (the shaded region – minimum of the two curves):

P(E) = P(Chosen C1, when x belongs to C2) +

P(Chosen C2, when x belongs to C1)
α −∞

= P(C2 )  P(γ | C2 )dγ +P(C1 )  P(γ | C1 )dγ

−∞ α
A minimum distance (NN) supervised classifier

Rule: Assign X to Ri, where X is closest to μi.

An example of 2-D DRs:
R1 and R2; with a non-linear DB.

An example of 1-D DRs:

R1 and R2.
Decision based on
arbitrary Posteriors,
for an example:
Apples
Vs. Oranges.

Commonly used Discriminant functions

based on Baye’s decision rule:

Naïve Bayes – left for self-study:

Ref: Bishop – Sec 4.2.3 , pp -202..
Some examples of dense distribution of instances,
with non-linear decision boundaries
K-means Clustering (unsupervised)
• Given a fixed number of k clusters, assign observations to
those clusters so that the means across clusters for all
variables are as different from each other as possible.
• Input
– Number of Clusters, k
– Collection of n, d dimensional vectors xj , j=1, 2, …, n
• Goal: find the k mean vectors μ1, μ2, …, μk
• Output
– k x n binary membership matrix U where

1 if x i ∈ Gi
uij = 
0 else
& Gj, j=1, 2, …, k represent the k clusters
If n is the number of known patterns and c the
desired number of clusters, the k-means algorithm is:

Begin
initialize n, c, μ1,μ2,…,μc(randomly
selected)
do
1.classify n samples according
to nearest μi
2.recompute μi
until no change in μi
return μ1, μ2, …, μc
End
Distance measures used:
Euclidean, Manhattan, Minkowski, Canberra, Hamming,
Chessboard/Checkerboard & Maximum (Chebychev), Cosine-Sim,
Spearman-correlation.
Classification Stage
• The samples have to be assigned to clusters in order to
minimize the cost function which is:
c c 2
J =  J i =    xk − μ i 
i =1 i =1 
k , xk ∈Gi 
• This is the Euclidian Distance of the samples from
its cluster center; for all clusters this sum should
be minimum
• The classification of a point xk is done by:

1 if x k − μ i
2
≥ xk − μ j
2
,∀ k ≠ i
ui = 
0 otherwise
Re-computing the Means
• The means are recomputed according to:

1  
μi =   xk 
Gi  k , x ∈G 
 k i 

• Disadvantages
• What happens when there is overlap between classes…
that is a point is equally close to two cluster centers……
Algorithm will not terminate
• The Terminating condition is modified to “Change in
cost function (computed at the end of the Classification)
is below some threshold rather than 0”.
An Example
• The no of clusters is
two in this case.
• But still there is
some overlap
Back to study of

Decision Boundaries (DBs)

under Bayes paradigm;

Supervised Statistical Classifers

1 1 x−μ 2
Normal Density: p( x) = exp[− ( ) ]
σ 2π 2 σ
Bivariate Normal Density:
1 x−μ x 2 2 ρ xy ( x − μ x )( y − μ y ) y−μ y
− [( ) − +( )2 ]
2
2 (1− ρ xy ) σx σ xσ y σy
e
p ( x, y ) =
2πσ xσ y (1 − ρ ) 2
xy
μ - Mean; σ - S.D.; ρ xy - Correlation Coefficient
Visualize ρ as equivalent to the orientation of a 2-D tilted Gaussian
or Gabor filter.
n
For x as a discrete random variable,
the expected value of x: E ( x) =  xi P( xi ) = μ x
i =1
E(x) is also called the first moment of the distribution.
n
E ( x ) =  x P( xi )
The kth moment is defined as: k k
i
P(xi) is the probability of x = xi. i =1
Multi-variate Case: X = [x1 x2 …… xd]T
 μ1 
μ 
Mean vector:  2
Covariance matrix (symmetric): μ = E( X ) =  . 
 
σ 11 σ 12 . . σ 1d   σ 12 σ 12 . . σ 1d   . 
σ     μ d 
 21 σ 22 . . σ 2 d  σ 12 σ 22 . . σ 2d 
= . . . . . = . . . . . 
   
 . . . . .   . . . . . 
σ d 1 σ d 2 . . σ dd  σ 1d σ 2d . . σ d2 

d-dimensional normal density is:

1 ( X − μ) Σ ( X − μ)
T −1
p( X ) = exp[− ]
det(Σ)(2π ) d 2
1 1
= exp[−  ( xi − μi ) sij ( x j − μ j )]
det(Σ)(2π ) d 2 ij
p( X ) =
1
exp[−
( X − μ ) Σ ( X − μ)
T −1
]
det(Σ)(2π ) d 2
1 1
= exp[−  ( xi − μi ) sij ( x j − μ j )]
det(Σ)(2π ) d 2 ij
where, sij is the i-jth component of Σ−1 (the inverse of covariance matrix Σ).
 μx 
Special case, d = 2; where X = (x y)T; Then: μ =  
 μy 
and
 σ x2 σ xy   σ x2 ρ xyσ xσ y 
 =  =
2  ρ σ σ

σ
 xy σ y   xy x y σ y 
2

Can you now obtain this,

as given earlier:
1 x − μ x 2 2 ρ xy ( x − μ x )( y − μ y ) y − μ y 2
− [( ) − +( ) ]
2 (1− ρ xy ) σ x
2 σ xσ y σy
e
p ( x, y ) =
2πσ xσ y (1 − ρ ) 2
xy
 σ 11 σ 12 . . σ 1d   σ 12 σ 12 . . σ 1d   μ1 
σ σ 22

. . σ 2 d  σ 12 σ 22 . . σ 2d 
 μ 
 21  2
= .

. . . . = .
 
. . . . 

μ = E( X ) =  . 
 
 . . . . .   . . . . . 
 . 
σ d 1 σ d2 . . σ dd  σ 1d σ 2d . . σ d2   μ d 

Contours have constant density

of the distant term (d=2):
d(X ) =
( X − μ)T Σ −D1 ( X − μ);

The contours are lines of constant Mahalanobis distance (determined

by the matrix Σ), and are quadratic functions.
The contours of constant density may also be hyper-ellipsoids (non-
diagonal Σ) of constant Mahalanobis distance to μ.
σ x = σ y;
Diagonal covariance;
ρ xy = 0;

Diagonal covariance;
σ x > σ y;
ρ xy = 0; Asymmetric and oriented
Gaussians

Non-Diagonal
covariance;

σ x = σ y; σ x = σ y;
ρ xy < 0; ρ xy > 0;
1-NN

With Bayes,
Use σ, Σ, μ :
Define DB
Read about:

T-SNE plot

Parzen window
Decision Regions and Boundaries

A classifier partitions a feature space into class-labeled

decision regions (DRs).

If decision regions are used for a possible and unique class

assignment, the regions must cover Rd and be disjoint (non-
overlapping. In Fuzzy theory, decision regions may be overlapping.

The border of each decision region is a Decision Boundary (DBs).

Typical classification approach is as follows:

Determine the decision region (in Rd) into which X falls, and
assign X to this class.

This strategy is simple. But determining the DRs is a

challenge.

It may not be possible to visualize, DRs and DBs, in a general

classification task with a large number of classes and higher feature
space (dimension).
Classifiers are based on Discriminant functions.

In a C-class case, Discriminant functions are denoted by:

gi(X), i = 1,2,…,C.

This partitions the Rd into C distinct (disjoint) regions, and the

process of classification is implemented using the Decision Rule:

Assign X to class Cm (or region m), where: g m ( X ) > g i ( X ), ∀i, i ≠ m.

Decision Boundary is defined by the locus of points, where:
g k ( X ) = g l ( X ), k ≠ l
Minimum distance (also NN) classifier:

Discriminant function is based on the distance to the class mean:

μ1 g1 ( X ) = X − μ1 ; g 2 ( X ) = X − μ 2
R1

This does not take into account

μ2 R2 class PDFs and priors.
P ( X | wi ) P ( wi )
Remember Baye’s: P ( wi | X ) =
P( X )
Consider
discriminant function as:

and class-conditional Prob. as:

1 ( X − μ) Σ ( X − μ)
T −1
i
p ( X | wi ) = exp[− ]
det(Σ i )(2π ) d 2

Many cases arise, due to the varying nature of Σ:

• Diagonal (equal or unequal elements);

• Off-diagonal (+ve or –ve).

Let the discrimination function for the ith class be:
g i ( X ) = P (Ci | X ), and assume P (Ci ) = P (C j ), ∀i, j; i ≠ j.
Remember, multivariate Gaussian density? T −1
1 ( X − μ ) Σi ( X − μ )
g i ( X ) = P ( X | Ci ) = exp[− i i ]
det(Σ i )(2π ) d 2
Define: T −1
1 ( X − μ ) Σi ( X − μ )
Gi ( X ) = log[P( X | Ci )] = log[ ]− i i
det(Σi )(2π ) d 2
2
= k.d + q
i

Thus the classification is now influenced by the square

distance (hyper-dimensional) of X from μi, weighted by the Σ-1.
Let us examine: 2
T −1
di = (X − μ ) Σ (X − μ )
i i i
This quadratic term (scalar) is known as the
Mahalanobis distance (the distance from X to μi in feature space).
2
T −1
d = (X − μ ) Σ (X − μ )
i
i i i
For a given X, some Gm(X) is largest where (dm)2 is the
smallest, for a class i = m (assign X to class m, based on NN Rule) .

Simplest case: Σ = I, the criteria becomes the Euclidean

distance norm (and hence the NN classifier).
This is equivalent to obtaining the mean μm, for which X is
the nearest, for all μi. The distance function is then:
2 2
d = X − μi
i = X X − 2 μ X + μ μ (all vector notations)
T T
i
T
i

Thus, Gi ( X ) = d i2 / 2 = ( X T X ) / 2 − μiT X + ( μiT μi ) / 2

= ωiT X + ωi 0 Neglecting the class-invariant term.

μ μi T
This gives the simplest
where, ω = μi and ωi 0 = −
T
i
i
linear discriminant function
2 or correlation detector.
The perceptron (ANN) built to form the linear discriminant function

x1
w1

x2 w2

O(X)

O( X ) = ( wi xi ) + wi 0
i
wd wi0
xd

View this as (in 2-D space):

G = MX − Y + C
The decision region boundaries are determined by solving :

Gi ( X ) = G j ( X ), which gives : (ω − ω ) X + (ωi 0 − ω j 0 ) = 0

i
T T
j

This is an expression of a hyperplane separating the decision

regions in Rd. The hyperplane will pass through the origin, if:
ωi 0 = ω j 0
Generalized results (Gaussian case) of a discriminant function:
T −1
1 ( X − μ ) Σi ( X − μ )
Gi ( X ) = log[ P( X | Ci )] = log[ ]− i i
det(Σ i )(2π ) d 2
1 −1 d 1
= − ( X − μ ) Σ i ( X − μ ) − ( ) log(2π ) − log( i )
T

2 i i 2 2
The mahalanobis distance (quadratic term) spawns a number
of different surfaces, depending on Σ-1. It is basically a vector
distance using a Σ-1 norm. It is denoted as: 2
X − μi  i−1
Make the case of Baye’s rule more general for class assignment.
Earlier we has assumed that:
g i ( X ) = P (Ci | X ), assuming P (Ci ) = P (C j ), ∀i, j; i ≠ j.
Now, Gi ( X ) = log[ P (Ci | X ).P ( X )] = log[ P ( X | Ci )] + log[P(Ci )]
−1
T
1 ( X − μ ) Σi ( X − μ )
Gi ( X ) = log[ ]− i i + log[ P (C )]
i
det( Σ i )( 2π ) d 2
1 −1 d 1
= − ( X − μ ) Σ i ( X − μ ) − ( ) log( 2π ) − log(  i ) + log[ P (C i )]
T

2 i i 2 2
1 −1 1 Neglecting the
= − ( X − μ ) Σ i ( X − μ ) − log(  i ) + log[ P (C i )] constant term
T

2 i i 2
Simpler case: Σi = σ2I, and eliminating the class-independent bias,
we have: 1 T
Gi ( X ) = − ( X − μ ) ( X − μ ) + log[ P(Ci )]
2σ 2 i i
These are loci of constant hyper-spheres, centered at class mean.
More on this later on…..
If Σ is a diagonal matrix, with equal/unequal σii2:
 1 0 . . 0 
σ 2
1 0 . . 0  σ 2
1 
   0 1 . . 0 
0 σ 22 . . 0  σ 2

2
= . . . . .  and  = 
−1
. . . . . 
   
 . . . . .   . . . . . 
0 0 . . σ d2   1 2
 0 0 . .
 σ d 
Considering the discriminant function:
1 1
Gi ( X ) = − ( X − μ ) Σ i ( X − μ ) − log( i ) + log[ P(Ci )]
T −1

2 i i 2
This now will yield a weighted distance classifier. Depending
on the covariance term (more spread/scatter or not), we tend to put
more emphasis on some feature vector components than the other.

Check out the following:

This will give hyper-elliptical surfaces in Rd, for each class.

It is also possible to linearise it.

More general decision boundaries

Take P(Ci) = K for all i, and eliminating the class independent

terms yield:
T −1
Gi ( X ) = ( X − μ ) Σ ( X − μ )
i i i

2
T −1 T T −1 −1
d = ( X − μ ) Σ ( X − μ ) = − X Σ X + 2 μ Σi X − μ Σi μT −1
i
i i
i i i
i i
−1 T 1 T −1
G ( X ) = (Σ μ ) X − μ Σ μ as Σi =Σ, and are symmetric.
i i 2 i i
Thus, Gi ( X ) = ωiT X + ωi 0
1 T −1
where ωi = Σ μi and ωi 0 = − μi Σ μi
−1

2
Thus the decision surfaces are hyperplanes and decision
boundaries will also be linear (use Gi(X) = Gj(X), as done earlier)

Beyond this, if a diagonal Σ is class-dependent or off-diagonal terms

are non-zero, we get non-linear DFs, DRs or DBs.
The discriminant function (DF) for linearly separable classes is:
g i ( X ) = ω X + ωi 0
T
i
where, ωi is a dx1 vector of weights used for class i.
This function leads to DBs that are hyperplanes. It’s a point in
1D, line in 2-D, planar surfaces in 3-D, and ……. .
 x1 
3-D case:  
(ω1ω2ω3 ) x2  = 0 is a plane passing through the origin.
x 
 3
In general, the equation: ω ( X − X d ) = 0; => ω X − d = 0
T T

represents a plane H passing through any point (position vector) Xd.

This plane partitions the space into two mutually exclusive regions,
say Rp and Rn. The assignment of the vector X
to either the +ve side, or > 0 if X ∈ R p
–ve side or along H,

can be implemented by: T
ω X − d = 0 if X ∈ H
< 0 if X ∈ R
 n
A relook at,
x2
ω
Linear Discriminant Function g(X):
H
+ve side, Rp

g( X ) = ω X − d
T Xd
XTW=0
x1
Orientation of H is determined by ω. -ve side, Rn

Location of H is determined by d.
Pattern/feature Space

H is a hyperplane for d > 3. The figure shows a 2D representation.

H’
w2
WTX=0 Weight
The complementary role of Space
a sample in parametric space: w1
Xk
x2
ω
H H’
C1 w2
Xd
WTX=0
XTW=0
x1 w1
C2 w2 Xk

X3 X2
X1

T1 = [X1, X2];
X4
T2 = [X3, X4];
T1 = [X1, X2];

w2 T2 = [X3, X4];

X2
X1

g (T2 ) < 0 X3
g (T1 ) > 0
SOLUTION
SPACE
x1 LMS learning Law in BPNN or FFNN models
w1
Read about perceptron
x2 w2 vs. multi-layer feedforward network
O(X)

Wk + η k X k if X TkWk ≤ 0
Wk +1 =  T
 W k if X k Wk ≥ 0

ηκ is the learning rate parameter

wd wi0
xd
w2
Xk
Wk+1
H

𝑾𝒌 + 𝜼𝒌 𝑿𝒌 if 𝑿𝒌 ∈ 𝑪𝟏 and 𝑿𝑻𝒌 𝑾𝒌 ≤ 𝟎
𝑾𝒌 𝟏 =
𝑾𝒌 − 𝜼𝒌 𝑿𝒌 if 𝑿𝒌 ∈ 𝑪𝟎 and 𝑿𝑻𝒌 𝑾𝒌 ≥ 𝟎
Wk w1
WT Xk = 0
T1 = [X1, X2];
w2
T2 = [X3, X4];

X2
X1

w1
X4

𝑾𝒌 + 𝜼𝒌 𝑿𝒌 if 𝑿𝒌 ∈ 𝑪𝟏 and 𝑿𝑻𝒌 𝑾𝒌 ≤ 𝟎
ηκ decreases with each iteration 𝑾𝒌 𝟏 =
𝑾𝒌 − 𝜼𝒌 𝑿𝒌 if 𝑿𝒌 ∈ 𝑪𝟎 and 𝑿𝑻𝒌 𝑾𝒌 ≥ 𝟎
In case of FFNN, the objective is to minimize the error term:

Wk+1

ΔWk T
ek = d k − sk = d k − X Wk
k
Wk

α − LMS Learning Algorithm :

^
ΔWk = ηek X k
Self-study - Bishop chap 5;

Start at Sec. 4.1.7, pp 192.

MSE error surface (in case of multi-layer perceptron):

1
ξ k = [d k − X k Wk ] = E / 2 − P W + (1 / 2)W RW .
T 2 T T

2
P T = E [ d k X kT ];
1 x1k x nk 
 k k k
 x1 x1k x1k x1 x n 
R = E [ X k X kT ] = E [  
 
 
xk x nk x1k x nk x nk 
 n

δξ δξ δξ T Thus,
∇ξ = ( , ,......, ) = − P + RW
δw0 δw1 δwn ^
W = R −1 P
Effect of class Priors – revisiting DBs in a more general case.
P ( X | wi ) P ( wi )
p ( X | wi ) = P ( wi | X ) =
P( X )
1 ( X − μ) Σ ( X − μ)
T −1
exp[− ]
det(Σ)(2π ) d 2

CASE A. – Same diagonal Σ, with identical diagonal elements.

Canceling in class-invariant terms:

−1
gi ( X ) = [( X − μ i ) T
( X − μi )] + ln P ( wi )
2σ 2

−1
gi ( X ) = [ X T
X − 2 μ T
i X + μ i μ i ] + ln P ( wi )
T

2σ 2
−1
gi ( X ) = [ X T
X − 2 μ T
i X + μ i μ i ] + ln P ( wi )
T

2σ 2

Thus, g i ( X ) = ω X + ωi 0
T
i

μ μ μi T
where ωi = i 2 and ωi 0 = − + ln P( wi )
i
σ 2σ 2

The linear DB is thus: g k ( X ) = g l ( X ), k ≠ l

which is: (ωkT − ωlT ) X + (ωk 0 − ωl 0 ) = 0;
Prove that the 2nd constant term:
(ωk 0 − ωl 0 ) = (ωl − ωk )T X 0 ; where
1 2 μ k − μl P(ωk )
X 0 = ( μ k + μl ) − σ ln
2 μ k − μl
2
P (ωl )
Thus the linear DB is:
W T ( X − X 0 ) = 0;
where, W = μ k − μl Nothing new,
seen earlier
CASE – A. – Same diagonal Σ, with identical diagonal elements (Contd.)
Linear DB:

W T ( X − X 0 ) = 0; 1 2 μ k − μl P(ωk )
X 0 = ( μ k + μl ) − σ ln
where, W = μ k − μl 2 μ k − μl
2
P (ωl )
CASE – B. – Arbitrary Σ, but identical for all class.
−1
g i ( X ) = [( X − μi )T Σ −1 ( X − μi )] + ln P ( wi )
2
Removing the class-invariant quadratic term:
− 1 T −1
gi ( X ) = μi Σ μi + (Σ −1μi )T X + ln P( wi )
2
Thus, g i ( X ) = ω X + ωi 0
T
i

1 T −1
where ωi = Σ μi and ωi 0 = − μi Σ μi + ln P( wi )
−1

2
The linear DB is thus: g k ( X ) = g l ( X ), k ≠ l
which is: (ω − ω ) X + (ωk 0 − ωl 0 ) = 0;
T
k
T
l

(ωk 0 − ωl 0 ) = (ωl − ωk )T X 0 ; where

1 μ k − μl P (ωk )
X 0 = ( μ k + μl ) − ln  Prove it.
2 ( μ k − μl ) Σ ( μ k − μl ) P (ωl )
T −1
Thus the linear DB is: W T ( X − X 0 ) = 0;
where, W = ωk − ωl where ωi = Σ μi −1

Thus, W = Σ −1 ( μ k − μl );
The normal to the DB, “W”, is thus the transformed line
joining the two means.
The transformation matrix is a symmetric Σ−1.
The DB is thus -
a tilted (rotated) vector joining the two means.

Let Σ (2−D) be diagonal, with non-identical diagonal elements: σ1 and σ2.

 μ k1 − μl1 μ k2 − μl2 
Then, WD =  ; DB μκ
 σ1 σ2 
d = 2 case. Direction of DB = X0

 μ k2 − μl2 μ k1 − μl1  μl
σ1 > σ 2
− 
 σ2 σ1 
Thus the linear DB is: W T ( X − X 0 ) = 0;
where, W = ωk − ωl where ωi = Σ μi−1

Thus, W = Σ −1 ( μ k − μl );

Special case:

Let, Σ (2−D) be arbitrary, but with diagonal elements (=1)..

Solve for 1 ( μ − μ ) − σ ( μ − μ case.

W in this case, and 1
compare 1
with
k
the
l
2
diagonal
k
Σ 2
)
l
Then, W =  
1 − σ ( μ − μ ) − σ ( μ − μ ) 
2 2
k l
2 1
k
1
l

 μ k1 − μl1 μ k2 − μl2 
WD =  ;
 σ1 σ2 
Increasing σ2 and decreasing σ1

Diagonal Σ
in all cases.
Diagonal elements
in Σ are both 1.0,
in all cases
Point P is actually closer (in the
Euclidean sense) to the mean for the
Orange class.

The discriminant function evaluated

at P is smaller for class 'apple' than it
is for class 'orange'.
CASE C. – Arbitrary Σ, all parameters are class dependent.
−1 −1
g i ( X ) = [( X − μi ) Σ i ( X − μi )] − ln Σ i + ln P( wi )
T −1

2 2

Thus, g i ( X ) = X Wi X + ω X + ω i 0 ;
T T
i

− 1 −1
where Wi = Σi ;
2
ω i = Σ i μ i and
−1

1 T −1 1
ω i 0 = − μ i Σ i μ i − ln Σ i + ln P( wi )
2 2

The DBs and DFs are hyper-quadrics. g k ( X ) = g l ( X ), k ≠ l

We shall first look into a few cases of such surfaces next.
Example [Duda, Hart]:  3 1 / 2 0
μ1 =  ; Σ1 =   ;
6   0 2

Draw and Visualize (qualitatively) 3 2 0

μ 2 =  ; Σ 2 =   ;
the iso-contours
 − 2 0 2

−12 0 
Σ =
1  ; Assume; P(w1) = P(w1) = 0.5;
0 1 / 2
−1 1 / 2 0 
Σ2 =   ;
 0 1 / 2

Get expression of DB:

Quadratic Decision Boundaries

In Rd with X = (x1, x2, …,xd)T, consider the equation:

d d −1 d d

w x + w x x +w x + w
i =1
2
ii i
i =1 j =i +1
ij i j
i =1
i i o =0 ..1

The above equation is defined by a quadric discriminant

function, which yields a quadric surface.

If d=2, X = (x1, x2)T equation (1) becomes:

2 2
w x + w x + w12 x1 x2 + w1 x1 + w2 x2 + w0 = 0
11 1 22 2 ..2
Special cases of equation:
2 2
w x + w x + w x x + w x + w2 x2 + w0 = 0
11 1 22 2 12 1 2 1 1 ..2
Case 1:
w11 = w22 = w12 = 0; Eqn. (2) defines a line.

Case 2:
w11 = w22 = K; w12 = 0; defines a circle.

Case 3:
w11 = w22 = 1; w12 = w1 = w2 = 0; defines a circle whose center is at the origin.

Case 4:
w11 = w22 = 0; defines a bilinear constraint.

Case 5:
w11 = w12 = w2 = 0; defines a parabola with a specific orientation.

Case 6:
w11 ≠ 0, w22 ≠ 0, w11 ≠ w22 ; w12 = w1 = w2 = 0
defines a simple ellipse.

Selecting suitable values of wi’s, gives other conic sections; Hyperbolic ??

For d > 3, we define a family of hyper-surfaces in Rd.

d d −1 d d

w x +  w x x + w x +ω
i =1
2
ii i
i =1 j =i +1
ij i j
i =1
i i o =0 ..1

In the above equation, the total number of parameters is: ??

2d + 1 + d(d-1)/2 = (d+1)(d+2)/2.

Organize these parameters, and manipulate the equation to obtain:

T
X W X + w X + ωo = 0 T
..3
w has d terms, ωo has one term, and W (ωij) is a dxd matrix as:
wii if i = j

(d2-d) non-diagonal terms of the matrix W, ωij =  1
is obtained by duplicating (split into two parts):  2 wij if i ≠ j
d(d-1)/2 wijs.
In equation 3, the symmetric part of matrix W, contributes to
the Quadratic terms. Equation 3 generally defines a
hyperhyperboloidal surface.

If W = I/0, we get a hyper-spheres/planes.

T
X W X + w X + ωo = 0
T

2
T −1 T −1 T −1 T −1
d = ( X − μ ) Σ ( X − μ ) = − X Σ X + 2μ Σ X − μ Σ μ
i
i i i i i

Example of linearization:
2
g ( X ) = x2 − x − 3 x1 + 6 = 0
1

To Linearize, let x3 = x12. Then:

T
g ( X ) = x2 − x3 − 3 x1 + 6 = W X + wo
T
where, X = [ x1 , x2 , x3 ]
T
and W = [−3, 1, − 1]
CASE – C. – Arbitrary Σ, all parameters are class dependent – contd..

−1 −1
g i ( X ) = [( X − μi ) Σ i ( X − μi )] − ln Σ i + ln P( wi )
T −1

2 2
− 1 −1
Thus, g i ( X ) = X Wi X + ω X + ω i 0 ;
T T
i where Wi = Σi ;
2
1 T −1 1
ωi = Σ μi
−1
i and ω i 0 = − μ i Σ i μ i − ln Σ + ln P ( wi )
2 2
σ 1x = σ 2y ; σ 1y = σ 2x ;
ρ1 = ρ 2 = 0;
μ < μ ;μ = μ ;
1
x x
2 1
y y
2
σ = σ ;σ = σ ;
1
x y
2 1
y x
2

ρ1 = ρ 2 = 0;
μ1x = μ 2x ± C ; μ1y = μ 2y  C ;
Principal Component Analysis

 Eigen analysis, Karhunen-Loeve transform

 Eigenvectors: derived from Eigen decomposition of the

scatter matrix

 A projection set that best explains the distribution of

the representative features of an object of interest.

 PCA techniques choose a dimensionality-reducing

linear projection that maximizes the scatter of all
projected samples.
Principal Component Analysis Contd.

• Let us consider a set of N sample images {x1, x2, ……., xN}

taking values in n-dimensional image space.
• Each image belongs to one of c classes {X1, X2,..…, Xc}.

• Let us consider a linear transformation, mapping the

original n-dimensional image space to m-dimensional
feature space, where m < n.
• The new feature vectors yk є Rm are defined by the linear
transformation –
k = 1, 2,……, N
where, W є Rnxm is a matrix with orthogonal columns
representing the basis in feature space.
Principal Component Analysis Contd..
• Total scatter matrix ST is defined as
N
S T =  ( x k − μ )( x k − μ ) T
k =1

where, N is the number of samples , and μ € Rn is the mean

image of all samples . σ = E[( x − μ )( x − μ )]
ij i i j j
• The scatter of transformed feature vectors {y1,y2,….yN} is
WTSTW.
• In PCA, Wopt is chosen to maximize the determinant of the
total scatter matrix of projected samples, i.e.,
Wopt = arg max W T STW
W

where {wi | i= 1,2,….,m} is the set of n dimensional eigenvectors

of ST corresponding to m largest eigenvalues (check proof).
Principal Component Analysis Contd.
•Eigenvectors are called eigen images/pictures and also
basis images/facial basis for faces.

• Any data (say, face) can be reconstructed approximately as

a weighted sum of a small collection of images that define a
facial basis (eigen images) and a mean image of the face.

•Data form a scatter in the feature space through

projection set (eigen vector set)

• Features (eigenvectors) are extracted from the training

set without prior class information

Unsupervised learning
Demonstration of KL Transform

First
eigen
vector

Second
eigen
vector
Another One
Another Example

Source: SQUID Homepage

Principal components analysis (PCA) is a technique
used to reduce multi-dimensional data sets to lower
dimensions for analysis.

The applications include exploratory data analysis and

generating predictive models. PCA involves the computation of the
eigenvalue decomposition or Singular value decomposition of a data
set, usually after mean centering the data for each attribute.
PCA is mathematically defined as an orthogonal linear
transformation, that transforms the data to a new coordinate
system such that the greatest variance by any projection of
the data comes to lie on the first coordinate (called the first
principal component), the second greatest variance on the
second coordinate, and so on.

PCA can be used for dimensionality reduction in a data

set by retaining those characteristics of the data set that
contribute most to its variance, by keeping lower-order
principal components and ignoring higher-order ones. Such
low-order components often contain the "most important"
aspects of the data. But this is not necessarily the case,
depending on the application.
For a data matrix, XT, with zero empirical mean (the
empirical mean of the distribution has been subtracted from
the data set), where each column is made up of results for a
different subject, and each row the results from a different
probe. This will mean that the PCA for our data matrix X will
be given by:
Y = W T X = ΣV T ,
where WΣV T is the singular value decomposition (SVD) of X.

Goal of PCA:
Find some orthonormal matrix WT, where Y = WTX;
such that
COV(Y) ≡ (1/(n−1))YYT is diagonalized.

The rows of W are the principal components of X,

which are also the eigenvectors of COV(X).
Unlike other linear transforms (DCT, DFT, DWT etc.),
PCA does not have a fixed set of basis vectors. Its basis
vectors depend on the data set.
SVD – the theorem (Src; WIKI ++)
Suppose M is an m-by-n matrix whose entries come from the field K,
which is either the field of real numbers or the field of complex numbers. Then
there exists a factorization of the form
M = UΣV*
where, U is an m-by-m unitary matrix, the matrix Σ is m-by-n
with nonnegative numbers on the diagonal and zeros off the diagonal, and V*
denotes the conjugate transpose of V, an n-by-n unitary matrix over K.
Such a factorization is called a (Full) singular-value decomposition of M.

The matrix V thus contains a set of orthonormal "input" or "analysing"

basis vector directions for M.
The matrix U contains a set of orthonormal "output" basis vector
directions for M. The matrix Σ contains the singular values, which can be thought
of as scalar "gain controls" by which each corresponding input is multiplied to
give a corresponding output.

A common convention is to order the values Σi,i in non-increasing

fashion. In this case, the diagonal matrix Σ is uniquely determined by M (though
the matrices U and V are not).
For p = min(m,n) - U is m-by-p, Σ is p-by-p, and V is n-by-p.
Erichson, N. B., Voronin, S., Brunton, S. L., & Kutz, J. N. (2019). Randomized Matrix
The columns ofUsing
Decompositions 𝑈 (𝑚R.×Journal
𝑚) areofeigenvectors of 𝑨𝑨
Statistical Software, 𝑻 , and
89(11), the columns of 𝑉
1–48.
(𝑛 × 𝑛) are eigenvectors of 𝑨𝑻 𝑨
https://doi.org/10.18637/jss.v089.i11; https://www.eigensteve.com/people
If the number of right singular vectors is small (i.e. n << m),
this is a more compact factorization than the full SVD.
Low-rank matrices feature a rank (r) that is smaller than the
dimension of A i.e., r is smaller than the number of columns and
rows.

Hence, the singular values {σi : i >= (r + 1)} are zero (0),
and the corresponding singular vectors span the left and right null
spaces.
In practical applications matrices are often contaminated by
errors, and the effective rank of a matrix can be smaller than its exact
rank r.
In this case, the matrix can be well approximated by including
only those singular vectors which correspond to singular values of a
significant magnitude. Hence, it is often desirable to compute a
reduced version of the SVD, as:

For massive datasets, however, the truncated/reduced SVD is

costly to compute. The cost to compute the full SVD of an m × n
matrix is of the order O(mn2), from which the first k components
can then be extracted to form Ak.

k should be chosen close to the effective rank – data

representation Applcn.; while, chosen much smaller ( << r) for
dimension reduction (PCA).
B.Tech, CSE_ IIT Madras
(1997);
Ph.D., MIT (2001);

Miller Research Fellow,

UC Berkeley (2001-02);
CMU; Berkeley

A=UΣVT and AT=VΣUT

UTU = Ir*r <> UUT A T A = V Σ U T U Σ V T = V Σ 2V T

ATAV = VΣ2; VTATAV = Σ2;
VVT <> Ir*r = VTV A.AT = ??
1: Full SVD,

2: Thin SVD (remove columns

of U not corresponding to
rows of V*),

3: Compact SVD (remove

vanishing singular values and
corresponding columns/rows
in U and V*),

4: Truncated SVD (keep only

largest t singular values and
corresponding columns/rows
in U and V*)
The Karhunen-Loève transform is therefore equivalent
to finding the singular value decomposition of the data matrix
X, and then obtaining the reduced-space data matrix Y by
projecting X down into the reduced space defined by only the
first L singular vectors, WL: T T T
X = WΣV ; Y = WL X = Σ LVL
The matrix W of singular vectors of X is equivalently
the matrix W of eigenvectors of the matrix of observed
covariances C = X XT (find out?) =:
T T T T
COV ( X ) = XX = WΣΣ W = WDW
The eigenvectors with the largest eigenvalues
correspond to the dimensions that have the strongest
correlation in the data set. PCA is equivalent to empirical
orthogonal functions (EOF).
PCA is a popular technique in pattern recognition. But it
is not optimized for class separability. An alternative is the
linear discriminant analysis, which does take this into
account. PCA optimally minimizes reconstruction error under
the L2 norm.
PCA by COVARIANCE Method
We need to find a dxd orthonormal transformation matrix WT, such that:
T
with the constraint that: Y =W X
Cov(Y) is a diagonal matrix, and W-1 = WT.
T T T T
COV (Y ) = E[YY ] = E[(W X )(W X ) ]
T T T T
= E[(W X )( X W )] = W E[ XX ]W
T T T
= W COV ( X )W = W (WDW )W = D
T
WCOV (Y ) = WW COV ( X )W = COV ( X )W
Can you derive from the above, that:

[λ1W1 , λ2W2 ,....., λdWd ] =

[COV ( X )W1 , COV ( X )W2 ,....., COV ( X )Wd ]
𝑪𝑶𝑽(𝑿) = 𝑿𝑿𝑻 = 𝑾𝜮𝜮𝑻 𝑾𝑻 = 𝑾𝑫𝑾𝑻
COV (Y) = D
A Summary of the PCA Approach
• Standardize the data.
• Obtain the Eigenvectors and Eigenvalues from the covariance matrix or
correlation matrix, or perform Singular Value Decomposition.
• Eigenvalues from SVD are sorted in descending order; so choose the k
eigenvectors that correspond to the k largest eigenvalues where k is the
number of dimensions of the new feature subspace (k≤d).
• Construct the projection matrix W from the selected k eigenvectors.
• Transform the original dataset X via W to obtain a k-dimensional feature
subspace Y.
Example of PCA

- 1 - 2 4 
− 1  − 2 4
Samples: x1 =  1 ; x2 =  3 ; x3 = 0; X =  1 3 0
 2   1  3  2 1 3
3-D problem, with N = 3.

Each column is an observation (sample) and each row a variable (dimension),

 1  − 4  − 7   11 
Mean of the samples:
 3  ~  3 ~  3 ~  3
μ = 4 ; x1 = − 1 ; x 2 =  5 ; x 3 = − 4 ;
 3  3  3   3
x

 2   0   − 1   1 
Method – 1 (easiest)

− 4 −7 11 
 3 3 3 COVAR =  62 − 25 6
~
X = − 1 5 − 4 ; ~ ~ T  3 3 
 3 3 3  ( X X ) / 2 = (1 / 2) − 25 14 − 3
 0 −1 1 
 3 3 
 6 −3 2
 
Method – 2 (PCA defn.) 1 N
ST = ( )  ( xk − μ )( xk − μ ) T
N − 1 k =1
− 4  − 7   11 
~  3 ~  3 ~  3
C1 = x1 = − 1 ; x 2 =  5 ; x 3 = − 4 ;
1.7778 0.4444 0  3  3   3
0.4444 0.1111 0  0   − 1   1 
0 0 0

C2 = C3 =
5.4444 -3.8889 2.3333 13.4444 -4.8889 3.6667
-3.8889 2.7778 -1.6667 -4.8889 1.7778 -1.3333
2.3333 -1.6667 1.0000 3.6667 -1.3333 1.0000

SigmaC = COVAR =
20.6667 -8.3333 6.0000 SigmaC/2 =
-8.3333 4.6667 -3.0000
6.0000 -3.0000 2.0000 10.3333 -4.1667 3.0000
-4.1667 2.3333 -1.5000
3.0000 -1.5000 1.0000
Next do SVD, to get vectors.
For a face image with N samples and dimension d (=w*h, very large), we have:

The array X or Xavg of size d*N (N vertical samples stacked horizontally)

Thus XXT will be of d*d, which will be very large. To perform eigen-
analysis on such large dimension is time consuming and may be erroneous.

Thus often XTX of dimension N*N is considered for eigen-analysis. Will

it result in the same, after SVD? Lets check:

 62 − 25 6
~ ~ T  3 3  10.3333 -4.1667 3.0000
S = X X = (1 / 2) − 25 14 − 3 = -4.1667 2.3333 -1.5000
 3 3  3.0000 -1.5000 1.0000
 6 −3 2


~ ~ 0.9444 1.2778 -2.2222

m T
S =X X= 1.2778 4.6111 -5.8889
-2.2222 -5.8889 8.1111

Lets do SVD of both:

~ T ~ ~
S=XX = m
S =X X= T

10.3333 -4.1667 3.0000 0.9444 1.2778 -2.2222

-4.1667 2.3333 -1.5000 1.2778 4.6111 -5.8889
3.0000 -1.5000 1.0000 -2.2222 -5.8889 8.1111

U= U=

-0.8846 -0.4554 -0.1010 -0.2060 0.7901 0.5774

0.3818 -0.8313 0.4041 -0.5812 -0.5735 0.5774
-0.2680 0.3189 0.9091 0.7872 -0.2166 0.5774

S= S=

13.0404 0 0 13.0404 0 0
0 0.6263 0 0 0.6263 0
0 0 0.0000 0 0 0.0000

V= V=

-0.8846 -0.4554 0.1010 -0.2060 0.7901 0.5774

0.3818 -0.8313 -0.4041 -0.5812 -0.5735 0.5774
-0.2680 0.3189 -0.9091 0.7872 -0.2166 0.5774
Samples: Example, where d <> N:

− 3  − 2 − 1  4 5 6 
x1 =  ; x2 =  ; x3 =  ; x4 =  ; x5 =  ; x6 =  ;
− 3  − 2 − 1  4 5 7 
2-D problem (d=2), with N = 6. X=
-3 -2 -1 4 5 6
Each column is an observation (sample) -3 -2 -1 4 5 7
and each row a variable (dimension),

Mean of the samples: XM=

3 / 2  -4.5000 -3.5000 -2.5000 2.5000 3.5000 4.5000
μ x =   ; -4.6667 -3.6667 -2.6667 2.3333 3.3333 5.3333
5 / 3
XMT * XM =
42.0278 32.8611 23.6944 -22.1389 -31.3056 -45.1389
COVAR(X) = XM * XMT
32.8611 25.6944 18.5278 -17.3056 -24.4722 -35.3056
= 77.5000 82.0000 23.6944 18.5278 13.3611 -12.4722 -17.6389 -25.4722
82.0000 87.3333 -22.1389 -17.3056 -12.4722 11.6944 16.5278 23.6944
-31.3056 -24.4722 -17.6389 16.5278 23.3611 33.5278
-45.1389 -35.3056 -25.4722 23.6944 33.5278 48.6944
XMT * XM =
42.0278 32.8611 23.6944 -22.1389 -31.3056 -45.1389
COVAR(X) = XM * XMT 32.8611 25.6944 18.5278 -17.3056 -24.4722 -35.3056
23.6944 18.5278 13.3611 -12.4722 -17.6389 -25.4722
= 77.5000 82.0000 -22.1389 -17.3056 -12.4722 11.6944 16.5278 23.6944
82.0000 87.3333 -31.3056 -24.4722 -17.6389 16.5278 23.3611 33.5278
-45.1389 -35.3056 -25.4722 23.6944 33.5278 48.6944
U=
U=
-0.6856 -0.7280 -0.5053 -0.1469 -0.7547 0.3882 0.0214 0.0486
-0.7280 0.6856 -0.3951 -0.0654 0.3632 0.0984 -0.4091 0.7284
-0.2849 0.0162 -0.0433 -0.3456 -0.7396 -0.5002
0.2660 0.4241 -0.5083 -0.5306 -0.1150 0.4429
S= 0.3762 0.5057 -0.0258 0.6601 -0.4043 -0.0539
0.5432 -0.7337 -0.1938 0.0541 -0.3293 0.1332
164.5639 0
0 0.2694 S=
164.5639 0 0 0 0 0
0 0.2694 0 0 0 0
V= 0 0 0.0 0 0 0
0 0 0 0.0 0 0
-0.6856 -0.7280 0 0 0 0 0.0 0
-0.7280 0.6856 0 0 0 0 0 0.0
V = U ??
X: Covariance Matrix for X:
[5 5 0 0 1]
[4 5 1 1 0] [ 4.917 4.75 -4.083 -4.083 -4.333]
[5 4 1 1 0] [ 4.75 4.917 -4.083 -4.083 -4.333]
[0 0 4 4 4] [-4.083 -4.083 3.583 3.583 3.667]
[0 0 5 5 5] [-4.083 -4.083 3.583 3.583 3.667]
[1 1 4 4 4] [-4.333 -4.333 3.667 3.667 4.222]

SVD applied on Covariance Matrix:

U:
[-0.482 0.076 -0.707 -0.511 0. ]
[-0.482 0.076 0.707 -0.511 0. ]
[ 0.413 -0.365 -0. -0.443 0.707]
[ 0.413 -0.365 -0. -0.443 -0.707]
[ 0.44 0.85 -0. -0.289 0. ]

D:
[20.611 0.308 0.167 0.137 0. ]

VT:
[-0.482 -0.482 0.413 0.413 0.44 ]
[ 0.076 0.076 -0.365 -0.365 0.85 ]
[-0.707 0.707 -0. -0. -0. ]
[-0.511 -0.511 -0.443 -0.443 -0.289]
[ 0. -0. 0.707 -0.707 -0. ]
X: Covariance Matrix for XT:
[5 5 0 0 1]
[ 5.364.16 4.16 -4.48 -5.6 -3.36]
[4 5 1 1 0]
[ 4.163.76 3.56 -3.68 -4.6 -2.76]
[5 4 1 1 0]
[ 4.163.56 3.76 -3.68 -4.6 -2.76]
[0 0 4 4 4]
[-4.48-3.68 -3.68 3.84 4.8 2.88]
[0 0 5 5 5]
[-5.60 -4.60 -4.6 4.8 6.0 3.6 ]
[1 1 4 4 4]
[-3.36-2.76 -2.76 2.88 3.6 2.16]
SVD applied on Covariance Matrix of XT:

U:
[-0.462 0.669 -0. -0.486 0.31 0.087]
[-0.383 -0.518 0.707 -0.243 0.155 0.043]
[-0.383 -0.518 -0.707 -0.243 0.155 0.043]
[ 0.397 -0.071 0. -0.289 0.492 -0.715]
[ 0.497 -0.088 -0. -0.72 -0.3 0.37 ]
[ 0.298 -0.053 0. 0.21 0.723 0.584]

D:
[24.292 0.388 0.2 0. 0. 0. ]

VT:
[-0.462 -0.383 -0.383 0.397 0.497 0.298]
[ 0.669 -0.518 -0.518 -0.071 -0.088 -0.053]
[-0. 0.707 -0.707 0. -0. 0. ]
[-0.462 -0.231 -0.231 -0.253 -0.74 0.261]
[-0.328 -0.164 -0.164 -0.608 0.298 -0.616]
[-0.135 -0.067 -0.067 0.635 -0.33 -0.679]
Scatter Matrices and Separability criteria
Scatter matrices used to formulate criteria of class
separability:
 Within-class scatter Matrix: It shows the scatter
of samples around their respective class expected
vectors. c
SW =   (x k
− μ i ) ( xk − μ i ) T

i =1 xk ∈X i

 Between-class scatter Matrix: It is the scatter

of the expected vectors around the mixture
mean…..μ is the mixture mean..
c
S B =  N i ( μi − μ )( μi − μ )T
i =1
Scatter Matrices and Separability criteria
 Mixture scatter matrix: It is the covariance matrix of
all samples regardless of their class assignments.
N
ST =  ( xk − μ )( xk − μ )T = SW + S B
k =1

• The criteria formulation for class separability

needs to convert these matrices into a number.
• This number should be larger when between-
class scatter is larger or the within-class scatter is
smaller.
Several Criteria are..
−1
J1 = tr ( S S ) J 2 = ln S 2−1S1 = ln S1 − ln S 2
2 1

trS1
J 3 = tr ( S1 ) − μ (trS 2 − c) J4 =
trS2
Linear Discriminant Analysis
• Learning set is labeled – supervised learning
• Class specific method in the sense that it tries to ‘shape’ the
scatter in order to make it more reliable for classification.

• Select W to maximize the ratio of the between-class

scatter and the within-class scatter.
Between-class scatter matrix is defined by-
c
µi is the mean of class Xi
S B =  N i ( μi − μ )( μi − μ ) T

i =1 Ni is the no. of samples in class Xi.

Within-class scatter matrix

is: c
SW =   k i k i
( x − μ ) ( x − μ ) T

i =1 xk ∈X i
Linear Discriminant Analysis
• If SW is nonsingular, Wopt is chosen to satisfy
W T S BW
Wopt = arg max
W T SW W

Wopt = [w1, w2, ….,wm]

{wi | i = 1,2,…..,m} is the set of eigenvectors of SB and SW

corresponding to m largest eigen values.i.e.

S B wi = λi SW wi
• There are at most (c-1) non-zero eigen values. So upper
bound of m is (c-1).
Linear Discriminant Analysis
SW is singular most of the time. It’s rank is at most N-c
Solution – Use an alternative criterion.

• Project the samples to a lower dimensional space.

• Use PCA to reduce dimension of the feature space to N-c.
• Then apply standard FLD to reduce dimension to c-1.
T T
Wopt is given by Wopt = W W fld pca

T T
T W W pca S BW pcaW
W pca = arg max W STW W fld = arg max
W T T
W W W pca SW W pcaW
Demonstration for LDA
Hand workout EXAMPLE:

1 2 3 5 4 6 8 -2 -1 1 3 4 2 5
Data Points: 1 2 3 4 5 6 7 3 4 5 6 7 8 9

Class: 1 1 1 1 1 1 1 2 2 2 2 2 2 2

Lets try PCA first :

Overall data mean: 2.9286

5.0000

COVAR of the mean-subtracted data:

7.3022 3.3077
3.3077 5.3846

Eigenvalues after SVD of above:

9.7873 2.8996

Finally, the eigenvectors:

-0.7995 -0.6007
-0.6007 0.7995
Same EXAMPLE for LDA :
1 2 3 5 4 6 8 -2 -1 1 3 4 2 5
Data Points: 1 2 3 4 5 6 7 3 4 5 6 7 8 9
Class: 1 1 1 1 1 1 1 2 2 2 2 2 2 2

Sw = 10.6122 8.5714
8.5714 8.0000

Sb = 20.6429 -17.00
INV(Sw) . Sb = -17.00 14.00

27.20 -22.40
-31.268 25.75

Perform Eigendecomposition
on above:

Eigenvalues of Sw-1 Sb : 53.687

Eigenvectors:
- 0.7719 0.6357
0.6357 0.7719
Sw = 10.6122 8.5714 Sw = 10.6122 8.5714
8.5714 8.0000 8.5714 8.0000
Sb = 20.6429 - 17.00
- 17.00 14.00 Sb = 203.143 - 95.00
- 95.00 87.50
Eigenvalues of Sw-1 Sb : 53.687 Eigenvalues of Sw-1 Sb : 297.83
0 0.0
Eigenvectors: - 0.7719 0.6357 Eigenvectors: -0.7355 -0.6775
0.6357 0.7719 0.6775 0.7355
After linear projection, using LDA:
Same EXAMPLE for LDA, with C = 3:
1 2 3 5 4 6 8 -2 -1 1 3 4 2 5
Data Points: 1 2 3 4 5 6 7 3 4 5 6 7 8 9
Class: 1 1 1 2 2 3 3 1 1 1 2 2 3 3

Sw = 8.0764 - 2.125
- 2.125 4.1667
Sb = 56.845 52.50
52.50 50.00
INV(Sw) . Sb =

11.958 11.155
18.7 17.69

Perform Eigendecomposition
on above:

Eigenvalues of Sw-1 Sb : 30.5

0.097
Eigenvectors:
- 0.728 - 0.69
- 0.69 0.728
Data projected along
1st eigenvector:

Data projected along

2nd eigenvector:
Hence, one may need ICA
Some of the latest advancements in Pattern recognition technology deal with:
• Neuro-fuzzy (soft computing) concepts

• Multi-classifier Combination – decision and feature fusion

• Reinforcement learning

• Learning from small data sets

• Generalization capabilities

• Evolutionary Computations

• Genetic algorithms

• Pervasive computing

• Neural dynamics

• Support Vector machines - kernel methods

• Modern ML methods – transfer learning, domain adaptation, Manifold based
learning, MKL, Co-training; Deep learning, semi-supervised, self-supervised,
Weakly supervised, FSL, ….
REFERENCES

• Statistical pattern Recognition; S. Fukunaga;

Academic Press, 2000.

• Bishop – PR

• Satish Kumar - ANN

References (for SVD)

Golub, G.H., and Van Loan, C.F. (1989) Matrix Computations, 2nd
ed. (Baltimore: Johns Hopkins University Press).

Greenberg, M. (2001) Differential equations & Linear algebra

(Upper Saddle River, N.J. : Prentice Hall).

Strang, G. (1998) Introduction to linear algebra (Wellesley, MA :

Wellesley-Cambridge Press).

Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
100% (1)
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
209 pages
Data Sheet Air Foam Chamber
No ratings yet
Data Sheet Air Foam Chamber
1 page
Plan 53 B
No ratings yet
Plan 53 B
2 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Bayesian Decision Theory: Intro To
No ratings yet
Bayesian Decision Theory: Intro To
56 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Bayesian Decision Theory: CS479/679 Pattern Recognition Dr. George Bebis
No ratings yet
Bayesian Decision Theory: CS479/679 Pattern Recognition Dr. George Bebis
64 pages
Sergios Theodoridis Konstantinos Koutroumbas
No ratings yet
Sergios Theodoridis Konstantinos Koutroumbas
80 pages
Chapter 4
No ratings yet
Chapter 4
57 pages
APznzaaOoSfWDDs6MOckIGqH4XP2VHeq48_kGcBsO4AMqmGggWfQprpvqUi7un5sx3f3JT83ORHggRKjkAZyq6KG7QYiz-aJNIrQFyYcfM2CctUVKMqMQatTTYqCq-D30cEe2eQkpsv7eD8UdUymTe-_Z6Rzow7Ed8jsByqz8R-ymgT6HWk-iek4A3yLZZ7hpyO0mjabXEk
No ratings yet
APznzaaOoSfWDDs6MOckIGqH4XP2VHeq48_kGcBsO4AMqmGggWfQprpvqUi7un5sx3f3JT83ORHggRKjkAZyq6KG7QYiz-aJNIrQFyYcfM2CctUVKMqMQatTTYqCq-D30cEe2eQkpsv7eD8UdUymTe-_Z6Rzow7Ed8jsByqz8R-ymgT6HWk-iek4A3yLZZ7hpyO0mjabXEk
65 pages
Sergios Theodoridis Konstantinos Koutroumbas
No ratings yet
Sergios Theodoridis Konstantinos Koutroumbas
76 pages
Lec 04
No ratings yet
Lec 04
70 pages
Bayesian Decision Theory
No ratings yet
Bayesian Decision Theory
65 pages
Lecture 11
No ratings yet
Lecture 11
49 pages
Unit - V Pattern Recognition: Dr.K.Sampath Kumar Scse/Gu
No ratings yet
Unit - V Pattern Recognition: Dr.K.Sampath Kumar Scse/Gu
30 pages
Bayesian Theory
No ratings yet
Bayesian Theory
66 pages
03 Classification Methods
No ratings yet
03 Classification Methods
37 pages
Bayesian Decision Theory
No ratings yet
Bayesian Decision Theory
63 pages
PR January20 03 PDF
No ratings yet
PR January20 03 PDF
74 pages
Bayes&Voice Recognition
No ratings yet
Bayes&Voice Recognition
76 pages
Supervised Unsupervised
No ratings yet
Supervised Unsupervised
39 pages
Introduction To Pattern Recognition
No ratings yet
Introduction To Pattern Recognition
12 pages
BayesClassifiers Day6
No ratings yet
BayesClassifiers Day6
14 pages
Linearclassification
No ratings yet
Linearclassification
31 pages
UNIT-IV
No ratings yet
UNIT-IV
34 pages
Nayes Bayes Classifier
No ratings yet
Nayes Bayes Classifier
46 pages
Pattern Reco Tutorial
No ratings yet
Pattern Reco Tutorial
13 pages
Bayesian Classifier Implementation Using MATLAB
No ratings yet
Bayesian Classifier Implementation Using MATLAB
21 pages
7. Statistical Perspective
No ratings yet
7. Statistical Perspective
85 pages
Bayesian
No ratings yet
Bayesian
23 pages
Classifiers
No ratings yet
Classifiers
32 pages
Linear Classification: 1 1 N N I D I
No ratings yet
Linear Classification: 1 1 N N I D I
33 pages
j077 2011 KulHar WileyTutorial
No ratings yet
j077 2011 KulHar WileyTutorial
14 pages
PR- Unit 2
No ratings yet
PR- Unit 2
17 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
Classification is the task of assigning a class label to an input pattern
No ratings yet
Classification is the task of assigning a class label to an input pattern
8 pages
Statistical Classification
No ratings yet
Statistical Classification
6 pages
Bayes Decision Theory
No ratings yet
Bayes Decision Theory
53 pages
Lecturer4_Bayesian Decision Theory
No ratings yet
Lecturer4_Bayesian Decision Theory
40 pages
Bayes Classifier PDF
100% (1)
Bayes Classifier PDF
18 pages
Bayes Classification
No ratings yet
Bayes Classification
86 pages
Datamining Lect7knearst
No ratings yet
Datamining Lect7knearst
62 pages
Lec 1
No ratings yet
Lec 1
42 pages
Lecture Slide 03 - Bayesian Classifier - Summer 2023
No ratings yet
Lecture Slide 03 - Bayesian Classifier - Summer 2023
23 pages
Introduction To Pattern Recognition: Vojtěch Franc
No ratings yet
Introduction To Pattern Recognition: Vojtěch Franc
21 pages
Chapter 07
No ratings yet
Chapter 07
68 pages
Unit-2 Statistical PR
No ratings yet
Unit-2 Statistical PR
26 pages
Q. 1) What Is Class Condition Density? (3 Marks) Ans
No ratings yet
Q. 1) What Is Class Condition Density? (3 Marks) Ans
12 pages
Pattern Recognition: C G (P) G (F (M) )
No ratings yet
Pattern Recognition: C G (P) G (F (M) )
143 pages
Lec 2
No ratings yet
Lec 2
37 pages
K - Nearest Neighbours Classifier / Regressor
No ratings yet
K - Nearest Neighbours Classifier / Regressor
35 pages
Pattern Recognition 21BR551 MODULE 02 NOTES
No ratings yet
Pattern Recognition 21BR551 MODULE 02 NOTES
16 pages
Data Classification and Prediction : Lecture-11
No ratings yet
Data Classification and Prediction : Lecture-11
36 pages
pr2 bayes
No ratings yet
pr2 bayes
44 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
44 pages
Bayesian
No ratings yet
Bayesian
21 pages
PR January20 04 PDF
No ratings yet
PR January20 04 PDF
40 pages
Lec 9
No ratings yet
Lec 9
15 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
3. 03-07-2024
No ratings yet
3. 03-07-2024
1 page
Data Sheet Cylinder Rack & Piping Manifold
100% (1)
Data Sheet Cylinder Rack & Piping Manifold
1 page
Untitled Notebook
No ratings yet
Untitled Notebook
1 page
Data Sheet Directional Valve
No ratings yet
Data Sheet Directional Valve
1 page
Data Sheet Co2 Cylinder Assembly
No ratings yet
Data Sheet Co2 Cylinder Assembly
2 pages
Preparation of LC and LG Arrangement
No ratings yet
Preparation of LC and LG Arrangement
13 pages
Document 73
No ratings yet
Document 73
38 pages
Inline Mixing JGS 210-120-1-72E: Confidential
No ratings yet
Inline Mixing JGS 210-120-1-72E: Confidential
11 pages
Safety Contact - Pipeline Bursted During Hydrotest
No ratings yet
Safety Contact - Pipeline Bursted During Hydrotest
1 page
Data Sheet Discharge Nozzle
No ratings yet
Data Sheet Discharge Nozzle
1 page
Tank Mixing JGS 210-120-1-66E: Confidential
No ratings yet
Tank Mixing JGS 210-120-1-66E: Confidential
9 pages
Centrifugal Compressor Surge Control Methods PDF
100% (1)
Centrifugal Compressor Surge Control Methods PDF
1 page
PSA User Meet - Jaipur
No ratings yet
PSA User Meet - Jaipur
2 pages
2 Phase Flow Orifice
No ratings yet
2 Phase Flow Orifice
14 pages
OISD 166 Guidelines
No ratings yet
OISD 166 Guidelines
50 pages
Standard Refinery Fuel Tons
100% (6)
Standard Refinery Fuel Tons
2 pages
Standard Refinery Fuel Tons
No ratings yet
Standard Refinery Fuel Tons
2 pages
Name: Katakam Sandeep Reddy Mobile: 9704575353: Resume
No ratings yet
Name: Katakam Sandeep Reddy Mobile: 9704575353: Resume
2 pages
CDU-I Monthly Yields 2017-18 Updated
No ratings yet
CDU-I Monthly Yields 2017-18 Updated
44 pages
Energy Conversion
No ratings yet
Energy Conversion
16 pages
Is 1448 70 1968
No ratings yet
Is 1448 70 1968
9 pages
Sailu
No ratings yet
Sailu
18 pages
3-AAP Analysis Report
No ratings yet
3-AAP Analysis Report
11 pages
OISD Check List - 1
100% (1)
OISD Check List - 1
5 pages
Membrane Separation Laboratory: Experiment Setup
No ratings yet
Membrane Separation Laboratory: Experiment Setup
4 pages
Utilization of Na-Fe-Ca Composite Promoters (Industrial Wastes) As Cascade Chain Catalysis of Coal Combustion
No ratings yet
Utilization of Na-Fe-Ca Composite Promoters (Industrial Wastes) As Cascade Chain Catalysis of Coal Combustion
2 pages
Indian Institute of Technology: Experimental Setup
No ratings yet
Indian Institute of Technology: Experimental Setup
3 pages
Process Variable Vs Time
No ratings yet
Process Variable Vs Time
15 pages
ETI Microproject Up
No ratings yet
ETI Microproject Up
23 pages
Siemens - Line Differential Protections SIPROTEC 7SD60
No ratings yet
Siemens - Line Differential Protections SIPROTEC 7SD60
81 pages
Elementary Analysis 3
No ratings yet
Elementary Analysis 3
3 pages
CC111 Reviewer
No ratings yet
CC111 Reviewer
4 pages
TCS Coding
No ratings yet
TCS Coding
7 pages
Using Surface EMG
No ratings yet
Using Surface EMG
5 pages
How To Query Asham Tele Points (Telebirr)
No ratings yet
How To Query Asham Tele Points (Telebirr)
13 pages
(Ebook PDF) New Perspectives Computer Concepts 2016 Enhanced, Comprehensive 19th Edition 2024 Scribd Download
100% (4)
(Ebook PDF) New Perspectives Computer Concepts 2016 Enhanced, Comprehensive 19th Edition 2024 Scribd Download
32 pages
Chapter 13 - EUART
No ratings yet
Chapter 13 - EUART
33 pages
How To Update Rotaract and Interact Club Information
No ratings yet
How To Update Rotaract and Interact Club Information
4 pages
2235 Fyit (B) CA-II Soft Skills
No ratings yet
2235 Fyit (B) CA-II Soft Skills
4 pages
Assignment 3
No ratings yet
Assignment 3
19 pages
vSphere_ICM_8_Lab_12
No ratings yet
vSphere_ICM_8_Lab_12
26 pages
Entry Level Resume No Experience
100% (1)
Entry Level Resume No Experience
6 pages
NavCad2013DemoGuide PDF
No ratings yet
NavCad2013DemoGuide PDF
63 pages
Phone No: + 91 - 9650866999: Abhishek Mukherjee
No ratings yet
Phone No: + 91 - 9650866999: Abhishek Mukherjee
6 pages
Final Defense Guide For PPT Making
100% (1)
Final Defense Guide For PPT Making
1 page
JAVA UNIT-1 Question Bank
No ratings yet
JAVA UNIT-1 Question Bank
4 pages
7251 01 6 Steps Circular Nodes Powerpoint Diagrams 16x9
No ratings yet
7251 01 6 Steps Circular Nodes Powerpoint Diagrams 16x9
9 pages
CH - 4 - Quadratic Equations
No ratings yet
CH - 4 - Quadratic Equations
10 pages
Dgca M-4 Session Aug 2017
No ratings yet
Dgca M-4 Session Aug 2017
6 pages
Software Platform User Manual
No ratings yet
Software Platform User Manual
15 pages
ADMSHS - Emp - Tech - Q2 - M20 - Reflecting On The ICT - FV
No ratings yet
ADMSHS - Emp - Tech - Q2 - M20 - Reflecting On The ICT - FV
24 pages
Ashok
No ratings yet
Ashok
15 pages
Samsung Galaxy J7 Duo SM-J720F - Schematic Diagarm
No ratings yet
Samsung Galaxy J7 Duo SM-J720F - Schematic Diagarm
166 pages
Manual L7S 400V ENG V1.6 20210715
No ratings yet
Manual L7S 400V ENG V1.6 20210715
252 pages
Cat Greder 140 NG
0% (1)
Cat Greder 140 NG
28 pages
Instruction Manual - Eaton Internormen EBT 01 - Extension Bluetooth, 1.1, e
No ratings yet
Instruction Manual - Eaton Internormen EBT 01 - Extension Bluetooth, 1.1, e
7 pages
Numbers 1 - 27th July
No ratings yet
Numbers 1 - 27th July
30 pages
3O Mariot F-006 - Junior Deck Officer's Familiarization Checklist & Handover Record
No ratings yet
3O Mariot F-006 - Junior Deck Officer's Familiarization Checklist & Handover Record
8 pages

8

Uploaded by

8

Uploaded by

Pattern

– This is the set of all suggested features to explore for use

• A set of features used for classification form a feature

Fish xT = [x1, x2]

• If a single feature is used, then work on a one- dimensional feature

Point representing samples

• If number of features is 2, then we get points in 2D-

Decision boundary in 2 (or 3)

Sample points in a two-dimensional feature space

• The decision boundary separates points belonging to one

• The decision boundary partitions the feature space into

• The nature of the decision boundary is decided by the

Consider the use of K−1 classifiers each of which solves a

An illustration only follows; solutions follow later.

• If the number of dimensions is three. Then the decision

• If the number of dimensions increases to more than three,

• The classifier learns how to discriminate between samples

• If the Learning is offline i.e. Supervised method then, the

• If the learning is online then there is no teacher and no

– The optimality of decision rule used: The central task is

– The accuracy in measurements of feature vectors: This

Statistical Syntactic Neural

Categories of Statistical Classifiers:

Goal of most classification procedures is to estimate the probabilities

In most cases, we decide which is the most likely class. We need a

Bayesian decision making or Bayes Theorem

P(wi |X) - Measured-conditioned or posteriori probability

P(X | wi) - Prob. (Class-Condnl.) Of feature vector X in class wi

Prior probability of a person having a cold, P(C) = 0.01.

Prob. of having a fever, given that a person has a cold is,

So, probability (percentage) of people having cold along with fever,

Probability of a joint event - a sample comes from class C and

P(C and X) = P(C).P(X|C) = P(X).P(C|X)

P(X) = P(w1)P(X|w1) + P(w2)P(X|w2) + ……. + P(wk)P(X|wk)

P(f) = P(C)P(f|C) + P(C’)P(f|C’)

= 0.01 *0.4 + 0.99 *0.01616 = 0.02

Decision or Classification algorithm according to Baye’s Theorem:

 w1 ; if p ( X |w1 ) p ( w1 ) > p ( X |w2 ) p ( w2 )

Choose C1 , if P(x|C1) > P(x|C2)

This gives α, and hence the

Classification error (the shaded region – minimum of the two curves):

P(E) = P(Chosen C1, when x belongs to C2) +

= P(C2 )  P(γ | C2 )dγ +P(C1 )  P(γ | C1 )dγ

Rule: Assign X to Ri, where X is closest to μi.

An example of 1-D DRs:

Commonly used Discriminant functions

Naïve Bayes – left for self-study:

Decision Boundaries (DBs)

under Bayes paradigm;

Supervised Statistical Classifers

d-dimensional normal density is:

Can you now obtain this,

Contours have constant density

The contours are lines of constant Mahalanobis distance (determined

A classifier partitions a feature space into class-labeled

If decision regions are used for a possible and unique class

The border of each decision region is a Decision Boundary (DBs).

Typical classification approach is as follows:

This strategy is simple. But determining the DRs is a

It may not be possible to visualize, DRs and DBs, in a general

In a C-class case, Discriminant functions are denoted by:

This partitions the Rd into C distinct (disjoint) regions, and the

Assign X to class Cm (or region m), where: g m ( X ) > g i ( X ), ∀i, i ≠ m.

Discriminant function is based on the distance to the class mean:

This does not take into account

and class-conditional Prob. as:

Many cases arise, due to the varying nature of Σ:

• Off-diagonal (+ve or –ve).

Thus the classification is now influenced by the square

Simplest case: Σ = I, the criteria becomes the Euclidean

Thus, Gi ( X ) = d i2 / 2 = ( X T X ) / 2 − μiT X + ( μiT μi ) / 2

View this as (in 2-D space):

Gi ( X ) = G j ( X ), which gives : (ω − ω ) X + (ωi 0 − ω j 0 ) = 0

This is an expression of a hyperplane separating the decision

Check out the following:

It is also possible to linearise it.

= 0.01 0.4 + 0.99 0.01616 = 0.02