0% found this document useful (0 votes)
16 views

Pattern Recognition 14

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Pattern Recognition 14

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Pattern Recognition

Pattern recognition is:

1. The name of the journal of the Pattern Recognition


Society.

2. A research area in which patterns in data are


found, recognized, discovered, …whatever.

3. A catchall phrase that includes


• classification
• clustering
• data mining
• ….
1
Two Schools of Thought
1. Statistical Pattern Recognition

The data is reduced to vectors of numbers


and statistical techniques are used for
the tasks to be performed.

2. Structural Pattern Recognition

The data is converted to a discrete structure


(such as a grammar or a graph) and the
techniques are related to computer science
subjects (such as parsing and graph matching).
2
In this course

1. How should objects to be classified be


represented?

2. What algorithms can be used for recognition


(or matching)?

3. How should learning (training) be done?

3
Classification in Statistical PR
• A class is a set of objects having some important
properties in common

• A feature extractor is a program that inputs the


data (image) and extracts features that can be
used in classification.

• A classifier is a program that inputs the feature


vector and assigns it to one of a set of designated
classes or to the “reject” class.

What kinds of object classes?


4
Feature Vector Representation
• X=[x1, x2, … , xn],
each xj a real number
• xj may be an object
measurement
• xj may be count of
object parts
• Example: object rep.
[#holes, #strokes,
moments, …]

5
Possible features for char rec.

6
Some Terminology
• Classes: set of m known categories of objects
(a) might have a known description for each
(b) might have a set of samples for each
• Reject Class:
a generic class for objects not in any of
the designated known classes
• Classifier:
Assigns object to a class based on features

7
Discriminant functions

• Functions f(x, K)
perform some
computation on
feature vector x
• Knowledge K
from training or
programming is
used
• Final stage
determines class

8
Classification using nearest class
mean
• Compute the
Euclidean distance
between feature
vector X and the mean
of each class.

• Choose closest class, if


close enough (reject
otherwise)

9
Nearest mean might yield poor
results with complex structure
• Class 2 has two
modes; where is
its mean?

• But if modes are


detected, two
subclass mean
vectors can be
used

10
Scaling coordinates by std dev

We can compute a modified distance from


feature vector x to class mean vector x c
by scaling by the spread or standard deviation
σ i of class c along each dimension i.

scaled Euclidean distance from x to class mean x c

11
Nearest Mean
• What’s good about the nearest mean
approach?

12
Nearest Neighbor Classification

• Keep all the training samples in some efficient


look-up structure.

• Find the nearest neighbor of the feature vector


to be classified and assign the class of the neighbor.

• Can be extended to K nearest neighbors.

13
Nearest Neighbor
• Pros

• Cons

14
Evaluating Results
• We need a way to measure the performance
of any classification task.
• Binary classifier: Face or not Face
– We can talk about true positives, false positives,
true negatives, false negatives
• Multiway classifier: ‘a’ or ‘b’ or ‘c’ .....
– For each class, what percentage correct and what
percentage for each of the wrong classes

15
Receiver Operating Curve ROC

• Plots correct
detection rate
versus false
alarm rate
• Generally, false
alarms go up
with attempts to
detect higher
percentages of
known objects

16
An ROC from our work:

17
Confusion matrix shows empirical
performance for multiclass problems

Confusion may be unavoidable between some classes,


for example, between 9’s and 4’s.
18
Bayesian decision-making
• Classify into class w that is most likely based on
observations X. The following distributions are
needed.

• Then we have:

19
Classifiers often used in CV
• Decision Tree Classifiers

• Artificial Neural Net Classifiers

• Bayesian Classifiers and Bayesian Networks


(Graphical Models)

• Support Vector Machines

20
Decision Trees
#holes
0 2
1
moment of
#strokes #strokes
inertia
<t ≥t
0 1
best axis
#strokes 0 1
direction
0 90 2 4
60

- / 1 x w 0 A 8 B 21
Decision Tree Characteristics

1. Training
How do you construct one from training data?
Entropy-based Methods

2. Strengths

Easy to Understand

3. Weaknesses

Overtraining
22
Entropy-Based Automatic Decision
Tree Construction

Training Set S Node 1


x1=(f11,f12,…f1m) What feature
x2=(f21,f22, f2m) should be used?
. What values?
.
xn=(fn1,f22, f2m)

Quinlan suggested information gain in his ID3 system


and later the gain ratio, both based on entropy.

We’ll look at a variant called information content. 23


Entropy
Given a set of training vectors S, if there are c classes,
c
Entropy(S) = ∑ -pi log (pi)
i=1 2

Where pi is the proportion of category i examples in S.

If all examples belong to the same category, the entropy


is 0 (no discrimination).

The greater the discrimination power, the larger the


entropy will be.

24
Entropy
• Two class problem: class I and class II
• Suppose ½ the set belongs to class I and ½
belongs to class II?
• Then
entropy = -½ log2 ½ -½ log2 ½
= (-½) (-1) -½ (-1)
= 1

25
Information Content
The information content I(C;F) of the class variable C
with possible values {c1, c2, … cm} with respect to
the feature variable F with possible values {f1, f2, … , fd}
is defined by:

• P(C = ci) is the probability of class C having value ci.


• P(F=fj) is the probability of feature F having value fj.
• P(C=ci,F=fj) is the joint probability of class C = ci
and variable F = fj.

These are estimated from frequencies in the training data.


26
Example (from text)

X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II

How would you distinguish class I from class II?

27
Example (cont)

28
Using Information Content

• Start with the root of the decision tree and the whole
training set.

• Compute I(C,F) for each feature F.

• Choose the feature F with highest information


content for the root node.

• Create branches for each value f of F.

• On each branch, create a new node with reduced


training set and repeat recursively.
29
Using Information Content
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II

• What would the optimal tree look like for


features X, Y, and Z and class I and II in the
small example?

30
Artificial Neural Nets
Artificial Neural Nets (ANNs) are networks of
artificial neuron nodes, each of which computes
a simple function.

An ANN has an input layer, an output layer, and


“hidden” layers of nodes.

. .
. .
.
. Outputs
Inputs 31
Node Functions

a1 w(1,i) neuron i
a2
output
w(j,i)
aj

an

output = g (∑ aj * w(j,i) )

Function g is commonly a step function, sign function,


or sigmoid function (see text). 32
Neural Net Learning

That’s beyond the scope of this text; only


simple feed-forward learning is covered.

The most common method is called back propagation.

We’ve used a free package called NevProp or just


The WEKA machine learning package.

33
Convolutional Neural Nets
• CNNs were invented in the 90s, but they have
returned and become very popular in computer
vision in the last few years, because they have
been used to achieve higher than competitor
accuracy on several benchmark data sets in
object recognition.
• They are related to “deep learning”.
• They have multiple layers, some of them do
convolutions instead of having full connections.
34
Simple CNN

35
Support Vector Machines (SVM)

Support vector machines are learning algorithms


that try to find a hyperplane that separates
the differently classified data the most.
They are based on two key ideas:

• Maximum margin hyperplanes

• A kernel ‘trick’.

36
Maximal Margin

1 Margin

0 1
1
0
0 1 Hyperplane
0

Find the hyperplane with maximal margin for all


the points. This originates an optimization problem
which has a unique solution (convex problem). 37
Non-separable data

0 1
0 0
0 11 0 1
0 1 1 0
0 1
1 0
0 0 1
0

What can be done if data cannot be separated with a


hyperplane?
38
The kernel trick

The SVM algorithm implicitly maps the original


data to a feature space of possibly infinite dimension
in which data (which is not separable in the
original space) becomes separable in the feature space.

Original space Rk Feature space Rn


1
0 1 1
0 0
1 0 0 1
0 Kernel 0 1
0 0
trick 0
1 1
39
The kernel trick

• What is this space?


• The user defines it.
• It’s usually a dot product space.
• Example: If we have 2 vectors X and Y,
we can work with (X·Y) or exp(X· Y)

40
Example from AI Text

True decision boundary


is x12 + x22 < 1 .

• For this problem F(xi) • F(xj) is just (xi • xj)2,


which is called a kernel function.

41
Application

• Sal Ruiz used support vector machines in his


work on 3D object recognition.

• He trained classifiers on data representing deformations


of a 3D model of a class of objects.

• The classifiers learned what kinds of surface patches


are related to key parts of the model
(ie. A snowman’s face)

42
Kernel Function used in our 3D Computer
Vision Work

• k(A,B) = exp(-θ2AB/σ2)

• A and B are shape descriptors (big


vectors).

• θ is the angle between these


vectors.

• σ2 is the “width” of the kernel.

43
We used SVMs for Insect Recognition

44
EM for Classification
• The EM algorithm was used as a clustering
algorithm for image segmentation.

• It can also be used as a classifier, by creating a


Gaussian “model” for each class to be learned.

45
Summary
• There are multiple kinds of classifiers
developed in machine learning research.
• We can use and have used pretty much all of
them in computer vision classification,
detection, and recognition tasks.

46

You might also like