Machine Learning Lectures
Machine Learning Lectures
Classification stage:
It uses the features to classify or assign the physical object or event into
one of several pre-specified categories or classes
Definition of features:
Features are measurements or properties that can be extracted from
objects or events and can be used to classify these objects or events.
Definition of patterns:
A pattern is an entity that can be given a name, e. g. a fingerprint, a
handwritten word, a human face, or a speech word.
Feature vectors and feature spaces:
Feature space:
The n-dimensional space defined by the feature vectors is called the
feature space.
Scatter plot:
Objects are represented as points in the feature space. This
representation is called a scatter plot.
L P O E Q
The acquired image for each of these characters is a two-dimensional
array of values, where each value is called a pixel (picture element)
For example, the image of the character L will be 2D array as follows:
By inspecting the images (the 2D arrays) of these 5 characters we can
realize that these images can be discriminated (recognized) by the
following four features:
- The number of vertical straight lines (V)
- The number of horizontal straight lines (H)
- The number of oblique lines (O)
- The number of curved lines (C )
The following table shows the values of the four features for each
character:
Character V H O C
L 1 1 0 0
P 1 0 0 1
O 0 0 0 1
E 1 3 0 0
Q 0 0 1 1
To recognize these 5 characters using the above mentioned four
features, we use the following tree-structured classifier:
A Realistic Pattern Recognition System
Consider the following scenario:
A fish processing plant wants to automate the process of sorting
incoming fish product according to species (salmon or tuna).
The automation system consists of the following (as shown in the figure
of the next slide):
- a conveyor belt for incoming products
- two conveyor belts for sorted products
- a pick – and – place robotic arm
- a vision system with an overhead camera
- a computer to analyze images and control the robot arm
The machine learning system for automating the fish sorting process
consists of the following:
1- Vision system
The vision system consists of an IR sensor and overhead camera. The IR
sensor is used to indicate to the computer that there is a fish in the
sorting area. The camera captures an image as a new fish enters the
sorting area.
2- Preprocessing stage
This stage consists of image processing algorithms for performing the
following:
- Adjustments for average intensity levels
- Segmentation to separate fish from background
3- Feature Extraction stage
In this stage, we compute the features that can be used to discriminate
these two types of fishes.
- Suppose we know that, on the average, tuna fish is larger than salmon
fish, then we can use the length of the fish to discriminate the tuna
fishes from salmon fishes.
- From the segmented image we estimate the length of the fish.
4- classification stage
- Collect a set of examples from both species
- Compute the distribution of lengths for both classes (as shown in the
figure of the next slide).
- Determine a decision boundary (threshold) that minimizes the
classification error.
Histogram showing the distribution of length among the two types of
fishes.
- We estimate the classifier’s probability of error and obtain a
discouraging results of 40%.
- To improve the performance of our machine learning system, we use
an additional feature which is the brightness of the fish.
- We use the two features, length and brightness, and draw the scatter
plot as shown in the following figure of the next slide.
- We compute a linear decision boundary to separate the two classes,
obtain a classification rate of 95.7%.
Scatter plot
showing how
the patterns
(samples) of the
two types of
fishes are
scattered in the
feature space
using two
features: length
and Brightness
Statistical Pattern classification
Solution:
Consider the training data shown below. In this data the two features
are equally important. Use the single nearest neighbor with Euclidean
distance to classify the unknown sample at x=0 and y = - 120. Scale the
data before applying the distance measurements.
class A A A A B B B B
x 2 3 2 -1 -2 3 -1 -3
y 300 100 -100 300 -300 -200 200 -100
Adaptive Decision Boundaries
The Bayesian decision procedures are optimal decision procedures if the
conditional densities of the classes are known. If the densities are not known,
nearest neighbor techniques can be used. However, experimentation may be
required to choose K and the set of reference samples. Classification may be time
consuming if the number of reference samples is large.
An alternative approach is to assume that the functional form of the decision
boundary between each pair of classes is linear and we have to find that linear
decision boundary. The following figure explains this approach.
Adaptive decision boundaries (continue)
In this technique, during the training phase, samples are presented to the current form of the
classifier. Whenever a sample is correctly classified, no change is made in the weights, but
when a sample is incorrectly classified, each weight is changed in whichever direction will tend
to correct the output, D.
For example, if D were negative for a particular sample when it should have been positive, the
value of D for that sample should be increased by changing the weights. If feature xi has a
positive value for that sample, then wi should be increased to increase D. However, if xi were
negative for that sample, then wi should be decreased in order to increase D.
The exact technique consists of the following steps:
Adaptive decision boundaries (continue)
1. Initialize the weights w0, w1, …., wM to small random values. Choose
positive constants C and K.
2. Choose a sample x = (x1, x2, …, xM) from the training set.
3. Compute D for the chosen sample: D = w0 + w1 x1 + w2 x2+ … + wM xM.
4. If the sample x is from class A for which D should be positive, then if D is
not positive, replace wi by wi+ Cxi, for I = 1, 2, … , M, where C is a
positive constant that controls the step size for weight adjustment. Also
replace w0 by w0 + CK, where K is a positive constant. If D is positive,
then no change should be made in the weights.
Adaptive decision boundaries (continue)
If the sample x is from class B for which D should be negative, then if D is not
negative, replace wi by wi – Cxi, for i= 1, 2, …, M, also replace w0 by w0 – CK.
5. Repeat steps 2 to 4 with each of the samples in the training set. When finished,
run through the entire training data set again. Stop when all the samples are
correctly classified during one complete pass of the entire training set.
An additional stopping rule is also needed since this process would never terminate
if the two classes were not linearly separable. A fixed maximum number of
iterations could be tried, or the algorithm could be terminated when the error rate
ceases to decrease significantly.
Example:
Use the adaptive decision boundary algorithm to find a decision
boundary that classifies patterns from two classes. Each pattern is
represented by two features. The training samples are:
Solution: X1 x2 class
-The form of the decision boundary is as follows: 2 10 A
3 8 A
D = w0 + w1 x1 + w2 x2 5 2 A
-Assume we will design D to be positive for class A 50 25 B
65 30 B
and negative for class B. 35 40 B
Example (continue):
-The following are the steps for finding the weights:
Step1: Initialize the weights with small random values:
w0 = 0.1, w1= 0.2, w2 = - 0.15, then D= 0.1 + 0.2 x1 – 0.15 x2,
Choose C= 0.2 and K = 0.5
Step2 : Choose the first sample of the training data x=(2, 10).
Step3: Compute D= 0.1 + 0.2 (2) – 0.15 (10) = -1
Step4: D is negative while it should be positive for this sample, then we change
weights as follows:
w1 = w1+Cx1 = 0.2 + 0.2(2) = 0.6
w2 = w2 + Cx2 = -0.15 + 0.2 (10) = 1.85
w0 = w0 + CK = 0.1 + 0.2 (0.5) = 0.2 therefore D = 0.2 + 0.6 x1 + 1.85 x2
Example (continue):
Step 2: choose the training sample x= (3 8)
Step3: compute D, D= 0.2 + 0.6 x1 + 1.85 x2 = 0.2 + 0.6(3) + 1.85 (8)
D = 16.8
Step4: D is positive for this sample, therefore, no change in weights should be made.
Nonlinear decision boundaries, which are more complex than the linear
ones, can also be found by this adaptive technique. For example to
create decision regions with general quadratic boundaries, we can
convert the two dimensional feature vector (x, y) to five dimensional
feature vector (u1,u2, u3, u4, u5) where u1=x, u2=y, u3=x2, u4= xy, u5
=y2. That is the discriminant function becomes as follows:
D = w0 + w1u1 + w2u2 + w3u3 + w4u4 + w5u5
= w0 + w1 x + w2 y + w3 x2 +w4 xy + w5 y2
Adaptive decision boundaries (continue)
If there are more than two classes, the same algorithm can be used to
find the boundary between each pair of classes. If there are K classes,
there will be K(K-1)/2 decision boundaries between pairs of classes.
Adaptive Discriminant Functions
Another approach to classification when there are more than two classes is
to drive a separate linear discriminant function for each class, and choose
the class which has the largest discriminant function.
If there are N classes and M features, then the set of linear discriminant
functions is:
D1 = w01 + w11 x1 + w21 x2 + ….…. + wM1 xM
D2 = w02 + w12 x1 + w22 x2 + ……… + wM1 xM
.
DN = w0N + w1N x1 + w2N + …………. + wMN xM
Adaptive Discriminant Functions (continue)
The technique for adapting the weights in these discriminant functions
classification algorithm is as follows:
Whenever a sample X is classified as class Cj when it should have been classified as
class Ci, the new weights for the two corresponding discriminant functions (Di and
Dj) are adjusted as follows:
wmi = wmi + C xm,
wmj = wmj – C xm
where m = 1, 2, ……., M and
w0i = w0i + CK, w0j = w0j – CK
No change is made in the discriminant functions for classes other than Ci and Cj.
Adaptive Discriminant Functions (continue)
This procedure has the effect of increasing Di for the class that should
have been chosen and decreasing Dj for the class that was chosen but
should not have been.
There is no reason to change the weights in the other discriminant
functions.
Example:
Design an adaptive discriminant functions algorithm to classify patterns
from three classes. Each pattern is represented by two features. The
training samples are as given in the table.
X1 x2 class
Solution: 2 10 A
3 8 A
Since we have three classes, then three 5 2 A
50 25 B
discriminant Functions are required as follows: 65 30 B
35 40 B
D1 = w01 + w11 x1 + w21 x2 20 15 C
25 18 C
D2 = w02 + w12 x1 + w22 x2 15 12 C
D3 = w03 + w13 x1 + w23 x2
Example (continue):
Step1: initialize the weights with small random values as follows:
w01 = 0.1, w11 = - 0.15, w21 = 0.15, w02 = 0.3, w12 = 0.2, w22 = - 0.2
W03 = 0.1, w13 = 0.25, w23 = 0.15
Then,
D1 = 0.1 - 0.15 x1 + 0.15 x2, D2 = 0.3 + 0.2 x1 -0.2 x2
D3 = 0.1 + 0.25 x1 + 0.15 x2
also choose C=0.2, K=0.5.
Step2: choose a training sample (for example X=(2, 10) which is from class A).
Step3: Compute D1, D2, D3: D1 = 0.1 – 0.15 (2) + 0.15 (10) = 1.3
D2 = 0.3 + 0.2 (2) – 0.2 (10) = -1.3, D3= 0.1 + 0.25 (2) + 0.15 (10)=2.1
Step4: Since the sample is from class A and D3 is the largest, then we change the weights of D1
and D3 as follows:
w01 = w01 + CK = 0.1 + 0.2 (0.5) = 0.2 , w11 = w11 + C x1 = - 0.15 + 0.2 (2) = 0.25
w21 = w21 + C x2 = 0.15 + 0.2 (10) = 2.15
w03 = w03 – CK = 0.1 – 0.2 (0.5) = 0, w13 = w13 – C x1 = 0.25 – 0.2 (2) = - 0.15
w23= w23 – C x2 = 0.15 – 0.2 (10) = - 1.85
D(Xi)
d1 d2 d3 ... dN
In general, this will not be possible, but by properly choosing the weights w0, w1,
w2, …, wM, the sum of the squared differences between the set of desired values
di and the actual values D(Xi) can be minimized. The sum of squared errors E is:
E = (D(X1)- d1)2 + (D(X2) – d2)2 + … + (D(Xi) – di)2+ … + (D(XV) – dV)2
The values of the weights that minimize E may be found by computing the partial
derivatives of E with respect to each of the wj, setting each derivative to zero, and
solving for the weights.
Since E is a quadratic function of the wj, the derivatives will be linear in the wj, so
the problem is reduced to solving a set of M+1 linear equations for the M+1
weights w0, w1, …, wM.
When there are N classes, this technique still uses a single set of weights rather than N or N(N-
1)/2 of them as in the two methods described previously for N classes.
Example: Design a minimum squared error (MSE) classification technique (find a minimum
squared error discriminant function) using the following training samples.
i X1 x2 class d
Solution:
1 2 10 A 1
The discriminant function takes the form: 2 3 8 A 1
3 5 2 A 1
D(Xi) = w0 + w1 xi1 + w2 xi2, i = 1, 2, …, 6 4 50 25 B -1
5 65 30 B -1
The sum of squared error has the form: 6 35 40 B -1
E = (D(X1) – d1)2 + (D(X2) – d2)2 + (D(X3) – d3)2 + (D(X4) – d4)2 + (D(X5) – d5)2 + (D(X6) – d6)2
But D(X1) = w0 + 2 w1 + 10 w2, D(X2) = w0 + 3 w1 + 8 w2, D(X3) = w0 + 5 w1 + 2 w2,
D(X4) = w0 + 50 w1 + 25 w2, D(X5) = w0 + 65 w1 + 30 w2, D(X6) = w0 + 35 w1 + 40 w2,
and d1 = 1, d2 = 1, d3 =1, d4 = -1, d5 = - 1, d6 = -1
Therefore the sum of squared error becomes:
E = (w0 + 2 w1 + 10 w2 - 1 )2 + (w0 + 3 w1 + 8 w2 – 1)2 + (w0 + 5 w1 + 2 w2 – 1)2 +
(w0 + 50 w1 + 25 w2 + 1)2 + (w0 + 65 w1 + 30 w2 + 1)2 + (w0 + 35 w1 + 40 w2 + 1)2
then
dE/dw0 = . . . = 0, (1)
dE/dw1 = . . . = 0, (2)
dE/dw2 = . . . = 0, (3)
By solving these three equations, we get
w0 = 1.292, w1 = - 0.0218, and w2 = - 0.0371
Thus, The discriminant function is:
D(Xi) = 1.292 – 0.0218 xi1 – 0.0371 xi2
Example: Two samples from class A are located at (4, 4) and (5, 5). Two samples
from class B are located at (2, 0) and (2, 1). We want a linear discriminant function
to equal 1 for members of class A and -1 for members of class B. What set of three
weights minimizes the squared error between the desired and the actual values of
the discriminant function at the four samples? What is D(x, y)? How would a
sample with feature values (3, 2) be classified.
Solution:
The Sequential MSE Classification Algorithm
The sequential MSE algorithm is an adaptive technique in which samples are presented one at a
time instead of all at once. That is, this procedure minimizes:
E = (D(X) – d)2
There are two forms for the sequential MSE algorithm:
1- The sequential MSE algorithm with single output
2- The sequential MSE algorithm with multiple outputs
The sequential MSE algorithm uses the steepest descent minimization procedure for adapting
the weights for each sample. Therefore, we will first describe the steepest descent minimization
procedure.
The steepest descent minimization procedure
= ½ (w0 + w1 x1 + w2 x2 – d)2
The partial derivative of this error function relative to each weight is:
dE/dw0 = (D(X) – d)
dE/dw1 = (D(X) – d) x1
dE/dw2 = (D(X) – d) x2, then designing the algorithm will be using the following
steps:
Step1: pick up small random values for the weights and choose positive constant C:
w0 = 0.2, w1 = 0.15, w2 = - 0.12, and C = 0.2
therefore D(X) = 0.2 + 0.15 x1 – 0.12 x2
Step 2: select a training sample and present it to the classifier. Let us take sample1
X1 = [2 10] and compute
D(X1) = 0.2 + 0.15 (2) – 0.12 (10) = - 0.7
Step 3: adjust weights, replace wi by wi – C (D(X1) – d) xi as follows:
w0 = w0 – C (D(x1) – d) = 0.2 – 0.2 (- 0.7 – 1) = 0.2 + 0.34 = 0.54
w1 = w1 – C (D(X1) – d) x1 = 0.15 – 0.2 (- 0.7 – 1) (2) = 0.15 + 0.68 = 0.83
w2 = w2 – C (D(X1) – d) x2 = - 0.12 – 0.2 (- 0.7 – 1) (10) = 3.28
therefore D(X) = 0.54 + 0.83 x1 + 3.28 x2
Step 2: select a training sample and present it to the classifier. Let us take sample2
X2 = [3 8] and compute
D(X2) = 0.54 + 0.83 (3) + 3.28 (8) = 29.18
Step 3: update the weights
w0 = w0 – C(D – d) = 0.54 – 0.2 (29.18 – 1) = - 5.096
w1 = w1 – C (D –d) x1 = 0.83 – 0.2 (29.18 – 1) (3) = - 16.078
w2 = w2 – C (D – d) x2 = = 3.28 – 0.2 (29.18 – 1) (8) = - 41.8