8
8
Classification
An Example of Classification
• “Sorting incoming Fish on a conveyor
according to species using optical sensing
Sea bass
Species
Salmon
– Some properties that could be possibly used to
distinguish between the two types of fishes is
• Length
• Lightness
Features
• Width
• Number and shape of fins
• Position of the mouth, etc…
Lightness Width
Feature space
• The samples of input (when represented by their features) are
represented as points in the feature space
F2
F1
• Pattern
• Feature
• Feature vector
• Feature space
• Classification
• Decision Boundary
• Decision Region
• Discriminant function
• Hyperplanes and Hypersurfaces
• Learning
• Supervised and unsupervised
• Error
• Noise
• PDF
• Baye’s Rule
• Parametric and Non-parametric approaches
Decision region and Decision Boundary
• Our goal of pattern recognition is to reach an optimal
decision rule to categorize the incoming data into their
respective categories
Supervised or Unsupervised
• Linear
• Quadratic
• Piecewise
• Non-parametric
Parametric Decision making (Statistical) - Supervised
P(wi) - Prior Prob. for class wi ; P(X) - Prob. (Uncondl.) for feature vector X.
X, P(X)
Take an example:
Two class problem:
Cold (C) and not-cold (C’). Feature is fever (f).
Then using Bayes Th., the Prob. that a person has a cold, given
that she (or he) has a fever is:
P ( f | C ) P (C ) 0.4 * 0.01
P (C | f ) = = = 0 .2
Not convinced that it works? P( f ) 0.02
let us take an example with values to verify:
Total Population =1000. Thus, people having cold = 10. People having
both fever and cold = 4. Thus, people having only cold = 10 – 4 = 6.
People having fever (with and without cold) = 0.02 * 1000 = 20.
People having fever without cold = 20 – 4 = 16 (may use this later).
C f
P(C) = 0.01
P(f) = 0.02
Thus:
P ( X | wi ) P ( wi )
P ( wi | X ) =
P ( w1 ) P ( X | w1 ) + P ( w2 ) P ( X | w2 ) + .... + P ( wk ) P ( X | wk )
With our last example:
1 if x i ∈ Gi
uij =
0 else
& Gj, j=1, 2, …, k represent the k clusters
If n is the number of known patterns and c the
desired number of clusters, the k-means algorithm is:
Begin
initialize n, c, μ1,μ2,…,μc(randomly
selected)
do
1.classify n samples according
to nearest μi
2.recompute μi
until no change in μi
return μ1, μ2, …, μc
End
Distance measures used:
Euclidean, Manhattan, Minkowski, Canberra, Hamming,
Chessboard/Checkerboard & Maximum (Chebychev), Cosine-Sim,
Spearman-correlation.
Classification Stage
• The samples have to be assigned to clusters in order to
minimize the cost function which is:
c c 2
J = J i = xk − μ i
i =1 i =1
k , xk ∈Gi
• This is the Euclidian Distance of the samples from
its cluster center; for all clusters this sum should
be minimum
• The classification of a point xk is done by:
1 if x k − μ i
2
≥ xk − μ j
2
,∀ k ≠ i
ui =
0 otherwise
Re-computing the Means
• The means are recomputed according to:
1
μi = xk
Gi k , x ∈G
k i
• Disadvantages
• What happens when there is overlap between classes…
that is a point is equally close to two cluster centers……
Algorithm will not terminate
• The Terminating condition is modified to “Change in
cost function (computed at the end of the Classification)
is below some threshold rather than 0”.
An Example
• The no of clusters is
two in this case.
• But still there is
some overlap
Back to study of
Diagonal covariance;
σ x > σ y;
ρ xy = 0; Asymmetric and oriented
Gaussians
Non-Diagonal
covariance;
σ x = σ y; σ x = σ y;
ρ xy < 0; ρ xy > 0;
1-NN
ND
??
With Bayes,
Use σ, Σ, μ :
Define DB
Read about:
T-SNE plot
Parzen window
Decision Regions and Boundaries
Determine the decision region (in Rd) into which X falls, and
assign X to this class.
μ1 g1 ( X ) = X − μ1 ; g 2 ( X ) = X − μ 2
R1
μ μi T
This gives the simplest
where, ω = μi and ωi 0 = −
T
i
i
linear discriminant function
2 or correlation detector.
The perceptron (ANN) built to form the linear discriminant function
x1
w1
x2 w2
O(X)
O( X ) = ( wi xi ) + wi 0
i
wd wi0
xd
G = MX − Y + C
The decision region boundaries are determined by solving :
2 i i 2 2
The mahalanobis distance (quadratic term) spawns a number
of different surfaces, depending on Σ-1. It is basically a vector
distance using a Σ-1 norm. It is denoted as: 2
X − μi i−1
Make the case of Baye’s rule more general for class assignment.
Earlier we has assumed that:
g i ( X ) = P (Ci | X ), assuming P (Ci ) = P (C j ), ∀i, j; i ≠ j.
Now, Gi ( X ) = log[ P (Ci | X ).P ( X )] = log[ P ( X | Ci )] + log[P(Ci )]
−1
T
1 ( X − μ ) Σi ( X − μ )
Gi ( X ) = log[ ]− i i + log[ P (C )]
i
det( Σ i )( 2π ) d 2
1 −1 d 1
= − ( X − μ ) Σ i ( X − μ ) − ( ) log( 2π ) − log( i ) + log[ P (C i )]
T
2 i i 2 2
1 −1 1 Neglecting the
= − ( X − μ ) Σ i ( X − μ ) − log( i ) + log[ P (C i )] constant term
T
2 i i 2
Simpler case: Σi = σ2I, and eliminating the class-independent bias,
we have: 1 T
Gi ( X ) = − ( X − μ ) ( X − μ ) + log[ P(Ci )]
2σ 2 i i
These are loci of constant hyper-spheres, centered at class mean.
More on this later on…..
If Σ is a diagonal matrix, with equal/unequal σii2:
1 0 . . 0
σ 2
1 0 . . 0 σ 2
1
0 1 . . 0
0 σ 22 . . 0 σ 2
2
= . . . . . and =
−1
. . . . .
. . . . . . . . . .
0 0 . . σ d2 1 2
0 0 . .
σ d
Considering the discriminant function:
1 1
Gi ( X ) = − ( X − μ ) Σ i ( X − μ ) − log( i ) + log[ P(Ci )]
T −1
2 i i 2
This now will yield a weighted distance classifier. Depending
on the covariance term (more spread/scatter or not), we tend to put
more emphasis on some feature vector components than the other.
2
T −1 T T −1 −1
d = ( X − μ ) Σ ( X − μ ) = − X Σ X + 2 μ Σi X − μ Σi μT −1
i
i i
i i i
i i
−1 T 1 T −1
G ( X ) = (Σ μ ) X − μ Σ μ as Σi =Σ, and are symmetric.
i i 2 i i
Thus, Gi ( X ) = ωiT X + ωi 0
1 T −1
where ωi = Σ μi and ωi 0 = − μi Σ μi
−1
2
Thus the decision surfaces are hyperplanes and decision
boundaries will also be linear (use Gi(X) = Gj(X), as done earlier)
This plane partitions the space into two mutually exclusive regions,
say Rp and Rn. The assignment of the vector X
to either the +ve side, or > 0 if X ∈ R p
–ve side or along H,
can be implemented by: T
ω X − d = 0 if X ∈ H
< 0 if X ∈ R
n
A relook at,
x2
ω
Linear Discriminant Function g(X):
H
+ve side, Rp
g( X ) = ω X − d
T Xd
XTW=0
x1
Orientation of H is determined by ω. -ve side, Rn
Location of H is determined by d.
Pattern/feature Space
H’
w2
WTX=0 Weight
The complementary role of Space
a sample in parametric space: w1
Xk
x2
ω
H H’
C1 w2
Xd
WTX=0
XTW=0
x1 w1
C2 w2 Xk
X3 X2
X1
w1
T1 = [X1, X2];
X4
T2 = [X3, X4];
T1 = [X1, X2];
w2 T2 = [X3, X4];
X2
X1
w1
X4
g (T2 ) < 0 X3
g (T1 ) > 0
SOLUTION
SPACE
x1 LMS learning Law in BPNN or FFNN models
w1
Read about perceptron
x2 w2 vs. multi-layer feedforward network
O(X)
Wk + η k X k if X TkWk ≤ 0
Wk +1 = T
W k if X k Wk ≥ 0
𝑾𝒌 + 𝜼𝒌 𝑿𝒌 if 𝑿𝒌 ∈ 𝑪𝟏 and 𝑿𝑻𝒌 𝑾𝒌 ≤ 𝟎
𝑾𝒌 𝟏 =
𝑾𝒌 − 𝜼𝒌 𝑿𝒌 if 𝑿𝒌 ∈ 𝑪𝟎 and 𝑿𝑻𝒌 𝑾𝒌 ≥ 𝟎
Wk w1
WT Xk = 0
T1 = [X1, X2];
w2
T2 = [X3, X4];
X2
X1
w1
X4
X3
𝑾𝒌 + 𝜼𝒌 𝑿𝒌 if 𝑿𝒌 ∈ 𝑪𝟏 and 𝑿𝑻𝒌 𝑾𝒌 ≤ 𝟎
ηκ decreases with each iteration 𝑾𝒌 𝟏 =
𝑾𝒌 − 𝜼𝒌 𝑿𝒌 if 𝑿𝒌 ∈ 𝑪𝟎 and 𝑿𝑻𝒌 𝑾𝒌 ≥ 𝟎
In case of FFNN, the objective is to minimize the error term:
Xk
Wk+1
ΔWk T
ek = d k − sk = d k − X Wk
k
Wk
1
ξ k = [d k − X k Wk ] = E / 2 − P W + (1 / 2)W RW .
T 2 T T
2
P T = E [ d k X kT ];
1 x1k x nk
k k k
x1 x1k x1k x1 x n
R = E [ X k X kT ] = E [
xk x nk x1k x nk x nk
n
δξ δξ δξ T Thus,
∇ξ = ( , ,......, ) = − P + RW
δw0 δw1 δwn ^
W = R −1 P
Effect of class Priors – revisiting DBs in a more general case.
P ( X | wi ) P ( wi )
p ( X | wi ) = P ( wi | X ) =
P( X )
1 ( X − μ) Σ ( X − μ)
T −1
exp[− ]
det(Σ)(2π ) d 2
−1
gi ( X ) = [ X T
X − 2 μ T
i X + μ i μ i ] + ln P ( wi )
T
2σ 2
−1
gi ( X ) = [ X T
X − 2 μ T
i X + μ i μ i ] + ln P ( wi )
T
2σ 2
Thus, g i ( X ) = ω X + ωi 0
T
i
μ μ μi T
where ωi = i 2 and ωi 0 = − + ln P( wi )
i
σ 2σ 2
W T ( X − X 0 ) = 0; 1 2 μ k − μl P(ωk )
X 0 = ( μ k + μl ) − σ ln
where, W = μ k − μl 2 μ k − μl
2
P (ωl )
CASE – B. – Arbitrary Σ, but identical for all class.
−1
g i ( X ) = [( X − μi )T Σ −1 ( X − μi )] + ln P ( wi )
2
Removing the class-invariant quadratic term:
− 1 T −1
gi ( X ) = μi Σ μi + (Σ −1μi )T X + ln P( wi )
2
Thus, g i ( X ) = ω X + ωi 0
T
i
1 T −1
where ωi = Σ μi and ωi 0 = − μi Σ μi + ln P( wi )
−1
2
The linear DB is thus: g k ( X ) = g l ( X ), k ≠ l
which is: (ω − ω ) X + (ωk 0 − ωl 0 ) = 0;
T
k
T
l
Thus, W = Σ −1 ( μ k − μl );
The normal to the DB, “W”, is thus the transformed line
joining the two means.
The transformation matrix is a symmetric Σ−1.
The DB is thus -
a tilted (rotated) vector joining the two means.
μ k2 − μl2 μ k1 − μl1 μl
σ1 > σ 2
−
σ2 σ1
Thus the linear DB is: W T ( X − X 0 ) = 0;
where, W = ωk − ωl where ωi = Σ μi−1
Thus, W = Σ −1 ( μ k − μl );
Special case:
μ k1 − μl1 μ k2 − μl2
WD = ;
σ1 σ2
Increasing σ2 and decreasing σ1
Diagonal Σ
in all cases.
Diagonal elements
in Σ are both 1.0,
in all cases
Point P is actually closer (in the
Euclidean sense) to the mean for the
Orange class.
2 2
Thus, g i ( X ) = X Wi X + ω X + ω i 0 ;
T T
i
− 1 −1
where Wi = Σi ;
2
ω i = Σ i μ i and
−1
1 T −1 1
ω i 0 = − μ i Σ i μ i − ln Σ i + ln P( wi )
2 2
−12 0
Σ =
1 ; Assume; P(w1) = P(w1) = 0.5;
0 1 / 2
−1 1 / 2 0
Σ2 = ;
0 1 / 2
d d −1 d d
w x + w x x +w x + w
i =1
2
ii i
i =1 j =i +1
ij i j
i =1
i i o =0 ..1
2 2
w x + w x + w12 x1 x2 + w1 x1 + w2 x2 + w0 = 0
11 1 22 2 ..2
Special cases of equation:
2 2
w x + w x + w x x + w x + w2 x2 + w0 = 0
11 1 22 2 12 1 2 1 1 ..2
Case 1:
w11 = w22 = w12 = 0; Eqn. (2) defines a line.
Case 2:
w11 = w22 = K; w12 = 0; defines a circle.
Case 3:
w11 = w22 = 1; w12 = w1 = w2 = 0; defines a circle whose center is at the origin.
Case 4:
w11 = w22 = 0; defines a bilinear constraint.
Case 5:
w11 = w12 = w2 = 0; defines a parabola with a specific orientation.
Case 6:
w11 ≠ 0, w22 ≠ 0, w11 ≠ w22 ; w12 = w1 = w2 = 0
defines a simple ellipse.
w x + w x x + w x +ω
i =1
2
ii i
i =1 j =i +1
ij i j
i =1
i i o =0 ..1
2d + 1 + d(d-1)/2 = (d+1)(d+2)/2.
2
T −1 T −1 T −1 T −1
d = ( X − μ ) Σ ( X − μ ) = − X Σ X + 2μ Σ X − μ Σ μ
i
i i i i i
Example of linearization:
2
g ( X ) = x2 − x − 3 x1 + 6 = 0
1
T
g ( X ) = x2 − x3 − 3 x1 + 6 = W X + wo
T
where, X = [ x1 , x2 , x3 ]
T
and W = [−3, 1, − 1]
CASE – C. – Arbitrary Σ, all parameters are class dependent – contd..
−1 −1
g i ( X ) = [( X − μi ) Σ i ( X − μi )] − ln Σ i + ln P( wi )
T −1
2 2
− 1 −1
Thus, g i ( X ) = X Wi X + ω X + ω i 0 ;
T T
i where Wi = Σi ;
2
1 T −1 1
ωi = Σ μi
−1
i and ω i 0 = − μ i Σ i μ i − ln Σ + ln P ( wi )
2 2
σ 1x = σ 2y ; σ 1y = σ 2x ;
ρ1 = ρ 2 = 0;
μ < μ ;μ = μ ;
1
x x
2 1
y y
2
σ = σ ;σ = σ ;
1
x y
2 1
y x
2
ρ1 = ρ 2 = 0;
μ1x = μ 2x ± C ; μ1y = μ 2y C ;
Principal Component Analysis
Unsupervised learning
Demonstration of KL Transform
First
eigen
vector
Second
eigen
vector
Another One
Another Example
Goal of PCA:
Find some orthonormal matrix WT, where Y = WTX;
such that
COV(Y) ≡ (1/(n−1))YYT is diagonalized.
Hence, the singular values {σi : i >= (r + 1)} are zero (0),
and the corresponding singular vectors span the left and right null
spaces.
In practical applications matrices are often contaminated by
errors, and the effective rank of a matrix can be smaller than its exact
rank r.
In this case, the matrix can be well approximated by including
only those singular vectors which correspond to singular values of a
significant magnitude. Hence, it is often desirable to compute a
reduced version of the SVD, as:
- 1 - 2 4
− 1 − 2 4
Samples: x1 = 1 ; x2 = 3 ; x3 = 0; X = 1 3 0
2 1 3 2 1 3
3-D problem, with N = 3.
1 − 4 − 7 11
Mean of the samples:
3 ~ 3 ~ 3 ~ 3
μ = 4 ; x1 = − 1 ; x 2 = 5 ; x 3 = − 4 ;
3 3 3 3
x
2 0 − 1 1
Method – 1 (easiest)
− 4 −7 11
3 3 3 COVAR = 62 − 25 6
~
X = − 1 5 − 4 ; ~ ~ T 3 3
3 3 3 ( X X ) / 2 = (1 / 2) − 25 14 − 3
0 −1 1
3 3
6 −3 2
Method – 2 (PCA defn.) 1 N
ST = ( ) ( xk − μ )( xk − μ ) T
N − 1 k =1
− 4 − 7 11
~ 3 ~ 3 ~ 3
C1 = x1 = − 1 ; x 2 = 5 ; x 3 = − 4 ;
1.7778 0.4444 0 3 3 3
0.4444 0.1111 0 0 − 1 1
0 0 0
C2 = C3 =
5.4444 -3.8889 2.3333 13.4444 -4.8889 3.6667
-3.8889 2.7778 -1.6667 -4.8889 1.7778 -1.3333
2.3333 -1.6667 1.0000 3.6667 -1.3333 1.0000
SigmaC = COVAR =
20.6667 -8.3333 6.0000 SigmaC/2 =
-8.3333 4.6667 -3.0000
6.0000 -3.0000 2.0000 10.3333 -4.1667 3.0000
-4.1667 2.3333 -1.5000
3.0000 -1.5000 1.0000
Next do SVD, to get vectors.
For a face image with N samples and dimension d (=w*h, very large), we have:
62 − 25 6
~ ~ T 3 3 10.3333 -4.1667 3.0000
S = X X = (1 / 2) − 25 14 − 3 = -4.1667 2.3333 -1.5000
3 3 3.0000 -1.5000 1.0000
6 −3 2
U= U=
S= S=
13.0404 0 0 13.0404 0 0
0 0.6263 0 0 0.6263 0
0 0 0.0000 0 0 0.0000
V= V=
− 3 − 2 − 1 4 5 6
x1 = ; x2 = ; x3 = ; x4 = ; x5 = ; x6 = ;
− 3 − 2 − 1 4 5 7
2-D problem (d=2), with N = 6. X=
-3 -2 -1 4 5 6
Each column is an observation (sample) -3 -2 -1 4 5 7
and each row a variable (dimension),
U:
[-0.482 0.076 -0.707 -0.511 0. ]
[-0.482 0.076 0.707 -0.511 0. ]
[ 0.413 -0.365 -0. -0.443 0.707]
[ 0.413 -0.365 -0. -0.443 -0.707]
[ 0.44 0.85 -0. -0.289 0. ]
D:
[20.611 0.308 0.167 0.137 0. ]
VT:
[-0.482 -0.482 0.413 0.413 0.44 ]
[ 0.076 0.076 -0.365 -0.365 0.85 ]
[-0.707 0.707 -0. -0. -0. ]
[-0.511 -0.511 -0.443 -0.443 -0.289]
[ 0. -0. 0.707 -0.707 -0. ]
X: Covariance Matrix for XT:
[5 5 0 0 1]
[ 5.364.16 4.16 -4.48 -5.6 -3.36]
[4 5 1 1 0]
[ 4.163.76 3.56 -3.68 -4.6 -2.76]
[5 4 1 1 0]
[ 4.163.56 3.76 -3.68 -4.6 -2.76]
[0 0 4 4 4]
[-4.48-3.68 -3.68 3.84 4.8 2.88]
[0 0 5 5 5]
[-5.60 -4.60 -4.6 4.8 6.0 3.6 ]
[1 1 4 4 4]
[-3.36-2.76 -2.76 2.88 3.6 2.16]
SVD applied on Covariance Matrix of XT:
U:
[-0.462 0.669 -0. -0.486 0.31 0.087]
[-0.383 -0.518 0.707 -0.243 0.155 0.043]
[-0.383 -0.518 -0.707 -0.243 0.155 0.043]
[ 0.397 -0.071 0. -0.289 0.492 -0.715]
[ 0.497 -0.088 -0. -0.72 -0.3 0.37 ]
[ 0.298 -0.053 0. 0.21 0.723 0.584]
D:
[24.292 0.388 0.2 0. 0. 0. ]
VT:
[-0.462 -0.383 -0.383 0.397 0.497 0.298]
[ 0.669 -0.518 -0.518 -0.071 -0.088 -0.053]
[-0. 0.707 -0.707 0. -0. 0. ]
[-0.462 -0.231 -0.231 -0.253 -0.74 0.261]
[-0.328 -0.164 -0.164 -0.608 0.298 -0.616]
[-0.135 -0.067 -0.067 0.635 -0.33 -0.679]
Scatter Matrices and Separability criteria
Scatter matrices used to formulate criteria of class
separability:
Within-class scatter Matrix: It shows the scatter
of samples around their respective class expected
vectors. c
SW = (x k
− μ i ) ( xk − μ i ) T
i =1 xk ∈X i
trS1
J 3 = tr ( S1 ) − μ (trS 2 − c) J4 =
trS2
Linear Discriminant Analysis
• Learning set is labeled – supervised learning
• Class specific method in the sense that it tries to ‘shape’ the
scatter in order to make it more reliable for classification.
i =1 xk ∈X i
Linear Discriminant Analysis
• If SW is nonsingular, Wopt is chosen to satisfy
W T S BW
Wopt = arg max
W T SW W
S B wi = λi SW wi
• There are at most (c-1) non-zero eigen values. So upper
bound of m is (c-1).
Linear Discriminant Analysis
SW is singular most of the time. It’s rank is at most N-c
Solution – Use an alternative criterion.
T T
T W W pca S BW pcaW
W pca = arg max W STW W fld = arg max
W T T
W W W pca SW W pcaW
Demonstration for LDA
Hand workout EXAMPLE:
1 2 3 5 4 6 8 -2 -1 1 3 4 2 5
Data Points: 1 2 3 4 5 6 7 3 4 5 6 7 8 9
Class: 1 1 1 1 1 1 1 2 2 2 2 2 2 2
Sw = 10.6122 8.5714
8.5714 8.0000
Sb = 20.6429 -17.00
INV(Sw) . Sb = -17.00 14.00
27.20 -22.40
-31.268 25.75
Perform Eigendecomposition
on above:
Eigenvectors:
- 0.7719 0.6357
0.6357 0.7719
Sw = 10.6122 8.5714 Sw = 10.6122 8.5714
8.5714 8.0000 8.5714 8.0000
Sb = 20.6429 - 17.00
- 17.00 14.00 Sb = 203.143 - 95.00
- 95.00 87.50
Eigenvalues of Sw-1 Sb : 53.687 Eigenvalues of Sw-1 Sb : 297.83
0 0.0
Eigenvectors: - 0.7719 0.6357 Eigenvectors: -0.7355 -0.6775
0.6357 0.7719 0.6775 0.7355
After linear projection, using LDA:
Same EXAMPLE for LDA, with C = 3:
1 2 3 5 4 6 8 -2 -1 1 3 4 2 5
Data Points: 1 2 3 4 5 6 7 3 4 5 6 7 8 9
Class: 1 1 1 2 2 3 3 1 1 1 2 2 3 3
Sw = 8.0764 - 2.125
- 2.125 4.1667
Sb = 56.845 52.50
52.50 50.00
INV(Sw) . Sb =
11.958 11.155
18.7 17.69
Perform Eigendecomposition
on above:
• Reinforcement learning
• Generalization capabilities
• Evolutionary Computations
• Genetic algorithms
• Pervasive computing
• Neural dynamics
• Bishop – PR
Golub, G.H., and Van Loan, C.F. (1989) Matrix Computations, 2nd
ed. (Baltimore: Johns Hopkins University Press).