0% found this document useful (0 votes)
7 views

Machine Learning Lectures

Uploaded by

Mohamed Sayed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Machine Learning Lectures

Uploaded by

Mohamed Sayed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 126

Course Title: Machine Learning

Machine learning is a subject that is concerned with how to make


computers recognize physical objects and events. Physical objects such
as human faces, cars, or chairs and events such as human speech. That
is, machine learning is a discipline which is concerned with the
classification of objects or events into categories.
Course objective:
The main objective of the course is to present the various machine
learning algorithms that can allow computers to recognize physical
objects and events.
Course contents:

1. Introduction to machine learning systems


1.a) General block diagram of a machine learning system.
1.b) Features, feature vectors, and classifiers

2. Statistical machine learning algorithms (Bayes classification


techniques)
2. a) Bayes decision theory
2. b) Discriminant functions and decision surfaces
2. c) Bayesian classification for normal distributions
3. Nonparametric Classification Techniques
3. a) Nearest Neighbor Classification Techniques
3. b) Adaptive decision boundaries
3. C) Adaptive discriminant functions
3. d) Minimum Squared error discriminant functions

4. Nonlinear Classification Techniques


4. a) The two-layer neural Network
4. b) Three-layer Neural Network
4. c) The back propagation algorithm
Examples of the problems to which machine learning
techniques are applied:
1. Identification of people from fingerprints, handshape, retinal scans,
voice characteristics, and handwriting.
2. Classification of seismic signals for oil and mineral exploration.
3. Classification of electrocardiograms (ECG) into diagnostic categories
of heart disease.
4. Detection of spikes in electroencephalograms (EEG), and other
medical waveform analysis.
5. Automated analysis of medical images obtained from microscopes
or scanners, magnetic resonance images, nuclear medical images, X-
rays, and photographs.
6. Human speech recognition by computers.
General block diagram for machine learning systems
Function of each block:

Sensor: sensors are generally used to convert physical energy into


electrical energy. In other words, sensors are linear devices that are
used to generate electrical signal that has variations very similar to the
variations or characteristics of the physical object or event.
Signal conditioning stage:
This stage is for adjusting the signal level and removing any noise or any
undesired signal components. Typical components of the this block are
amplifiers and filters.

Signal conversion stage:


This stage is to convert the analog signal into digital signal. That is, to
take samples from the analog signal and represent each sample by
proportional binary number.
Feature Extraction stage:
This stage is for extracting some measurements or properties from the
signal. These measurements or properties are used to represent the
physical object or event and are usually called features.

Classification stage:
It uses the features to classify or assign the physical object or event into
one of several pre-specified categories or classes
Definition of features:
Features are measurements or properties that can be extracted from
objects or events and can be used to classify these objects or events.

Definition of patterns:
A pattern is an entity that can be given a name, e. g. a fingerprint, a
handwritten word, a human face, or a speech word.
Feature vectors and feature spaces:

Consider a machine learning system which is implemented to recognize


two different speech words. Assume each word is represented by a
feature vector of two features. Also, assume we have three examples
for each of the two words. Then each of these examples will be
represented by a feature vector of two components. Assume the feature
vectors of these examples are as follows:
To see how similar the feature vectors of each of the two words and how
different the feature vectors of the first word than that of the second
word, we plot these feature vectors in a two-dimensional space as
follows:
Feature Vector:
The combination of features which are used to represent a pattern is
called a feature vector. If the number of features is n, then the feature
vector will be n-dimensional feature vector.
F = [ f1 f2 f3 …… fn]t

Feature space:
The n-dimensional space defined by the feature vectors is called the
feature space.
Scatter plot:
Objects are represented as points in the feature space. This
representation is called a scatter plot.

What makes a good feature vector?


The quality of a feature vector is related to its ability to discriminate
examples from different classes.
- Examples from the same class should have similar feature values.
- Examples from different classes should have different feature values.
Image Acquisition (Operation of Digital Camera):
A simple machine learning system
Consider the problem of recognizing the characters L, P, O, E, Q which
are available in the optical form (images) as follows:

L P O E Q
The acquired image for each of these characters is a two-dimensional
array of values, where each value is called a pixel (picture element)
For example, the image of the character L will be 2D array as follows:
By inspecting the images (the 2D arrays) of these 5 characters we can
realize that these images can be discriminated (recognized) by the
following four features:
- The number of vertical straight lines (V)
- The number of horizontal straight lines (H)
- The number of oblique lines (O)
- The number of curved lines (C )
The following table shows the values of the four features for each
character:

Character V H O C
L 1 1 0 0
P 1 0 0 1
O 0 0 0 1
E 1 3 0 0
Q 0 0 1 1
To recognize these 5 characters using the above mentioned four
features, we use the following tree-structured classifier:
A Realistic Pattern Recognition System
Consider the following scenario:
A fish processing plant wants to automate the process of sorting
incoming fish product according to species (salmon or tuna).
The automation system consists of the following (as shown in the figure
of the next slide):
- a conveyor belt for incoming products
- two conveyor belts for sorted products
- a pick – and – place robotic arm
- a vision system with an overhead camera
- a computer to analyze images and control the robot arm
The machine learning system for automating the fish sorting process
consists of the following:

1- Vision system
The vision system consists of an IR sensor and overhead camera. The IR
sensor is used to indicate to the computer that there is a fish in the
sorting area. The camera captures an image as a new fish enters the
sorting area.
2- Preprocessing stage
This stage consists of image processing algorithms for performing the
following:
- Adjustments for average intensity levels
- Segmentation to separate fish from background
3- Feature Extraction stage
In this stage, we compute the features that can be used to discriminate
these two types of fishes.
- Suppose we know that, on the average, tuna fish is larger than salmon
fish, then we can use the length of the fish to discriminate the tuna
fishes from salmon fishes.
- From the segmented image we estimate the length of the fish.
4- classification stage
- Collect a set of examples from both species
- Compute the distribution of lengths for both classes (as shown in the
figure of the next slide).
- Determine a decision boundary (threshold) that minimizes the
classification error.
Histogram showing the distribution of length among the two types of
fishes.
- We estimate the classifier’s probability of error and obtain a
discouraging results of 40%.
- To improve the performance of our machine learning system, we use
an additional feature which is the brightness of the fish.
- We use the two features, length and brightness, and draw the scatter
plot as shown in the following figure of the next slide.
- We compute a linear decision boundary to separate the two classes,
obtain a classification rate of 95.7%.
Scatter plot
showing how
the patterns
(samples) of the
two types of
fishes are
scattered in the
feature space
using two
features: length
and Brightness
Statistical Pattern classification

Bayesian Decision Theory


Bayesian decision theory is a fundamental statistical approach to
the problem of pattern classification. Bayesian decision theory refers to
choosing the most likely class, given the value of the feature. To
formulate this Bayesian decision theory, we have to define the following
probabilities:
Prior probability (P(Ci)): It is the probability that a random sample
(pattern) is a member of class Ci . It is called prior probability because it
gives the probability of the class before we know the value of the
feature.
Class-conditional probability (P(x/Ci): It is the probability of obtaining
feature value x given that the pattern is from class Ci. In other words, it
is the probability that a sample (a pattern) from class Ci will have the
feature value x.
Posterior Probability (P(Ci/x)): It is the probability that a sample belongs
to class Ci, given that it has a feature value x. It is called posterior
probability because it gives the probability of the class after observing
the value of the feature.
The above three probabilities are related by the following formula:
P(Ci/x) = P(x/Ci) P(Ci)/P(x)
This is known as Bayes formula (Bayes theorem).
Knowing the above mentioned probabilities, we can state Bayes decision
theory as follows:
Assign the unknown pattern to the class that has the highest posterior
probability. In other words, we compute the posterior probability of
each class and assign the unknown pattern to the class that has the
highest posterior probability.
Sample # C1 C2 C3
x x x
1 1.58 0.21 -1.54
2 0.67 0.37 5.41
3 1.04 0.18 1.55
4 -1.49 -0.24 1.86
5 -0.41 -1.18 1.68
6 1.39 0.74 3.51
7 1.20 -0.38 1.4
8 -0.92 0.02 0.44
9 0.45 0.44 0.25
10 -0.76 0.46 -0.66
If the three classes have equal prior probabilities, then use
Bayes classification algorithm to classify an unknown pattern
with feature value x = 1.8.

Solution:

Use the above given data to compute the mean and


variance of each class as follows:
Example: Consider the following data which is
taken from three normally distributed classes. The
patterns of each class are represented by three
features as follows:
Sample # C1 C2 C3
X1 x2 x3 X1 x2 x3 X1 x2 x3
1 1.58 2.32 -5.80 0.21 0.03 -2.21 -1.54 1.17 0.64
2 0.67 1.58 -4.78 0.37 0.28 -1.8 5.41 3.45 -1.33
3 1.04 1.01 -3.63 0.18 1.22 0.16 1.55 0.99 2.69
4 -1.49 2.18 -3.39 -0.24 0.93 -1.01 1.86 3.19 1.51
5 -0.41 1.21 -4.73 -1.18 0.39 -0.39 1.68 1.79 -0.87
6 1.39 3.16 2.87 0.74 0.96 -1.16 3.51 -0.22 -1.39
7 1.20 1.40 -1.89 -0.38 1.94 -0.48 1.4 -0.44 0.92
8 -0.92 1.44 -3.22 0.02 0.72 -0.17 0.44 0.83 1.97
9 0.45 1.33 -4.38 0.44 1.31 -0.14 0.25 0.68 -0.99
10 -0.76 0.84 -1.96 0.46 1.49 0.68 -0.66 -0.45 0.08
(a) Use this data to design Bayes classification technique to
classify unknown samples (patterns) from these three
classes.
(b) use Bayes classification technique of part (a) to classify
an unknown pattern with feature vector
X = [1.8 -0.56 1.51]t
Nonparametric Classification Techniques
(Nonparametric decision making techniques)
• Statistical decision making (Bayes classification algorithm) assumes that the type
of the density function is known for each class. Only the parameters of the
densities, such as their means and variances, have to be estimated from the
training data before using them to estimate the posterior probabilities of each
class to make classification decision. This type of classification techniques is
referred to as parametric classifier (parametric decision making).
• In most real problems, the types of density functions are unknown. In this case,
nonparametric classification techniques are needed. These nonparametric
techniques include several nearest neighbor techniques and some methods for
obtaining discriminant functions directly from the data.
Nearest Neighbor classification Techniques
1- The Single Nearest Neighbor Technique
The single nearest neighbor technique bypasses the problem of probability
densities completely and simply classifies an unknown sample as belonging
to the same class as the most similar sample point in the training set of data.
Most similar or “nearest” can be taken to mean the smallest Euclidian
distance, which is the usual distance between two points A = (a1, a2, …., an)
and B= (b1, b2, …., bn), defined by:
D(A, B) = sqrt [ (b1- a1)2 + (b2- a2)2 + …. + (bn – an)2 ]
Single nearest neighbor technique (continue)
• Although Euclidean distance is probably the most commonly used distance function or
measure of dissimilarity between feature vectors, it is not always the best metric. The
fact that the distances in each dimension are squared before summation places great
emphasis on those features for which the dissimilarity is large.
• A better metric is to use the sum of the absolute differences in each feature as the
overall measure of dissimilarity. This would also save computational time. This distance
metric would then be:
D(A, B) = |b1-a1| + |b2 – a2| + … + |bn – an|
This sum of absolute distance in each dimension is sometimes called the city block distance
or the Manhattan metric.
2- The K-Nearest Neighbor (KNN) classification Technique

The KNN bases the classification of an unknown sample on the “votes”


of K of its neighbors rather than on only its single nearest neighbor. In
this technique, the estimated class of an unknown sample is chosen to
be the class that is most commonly represented in the collection of its
K nearest neighbors. The following example clarify the technique.
Scale factors
• If one of the features has a very wide range of possible values compared to the
other features, it will have a very large effect on the total dissimilarity, and the
decisions will be based primarily upon this single feature.
• To overcome this, it is necessary to apply scale factors to the features before
computing the distances. If we want the potential influence of each of the
features to be about equal, the features should be scaled so that each of them
has the same range. It is often better to normalize each feature xi to have a mean
of 0 and a standard deviation of 1. That is, the normalization replaces each
feature xi by zi = (xi - µi)/δi before computing the distance.
Example:

Consider the training data shown below. In this data the two features
are equally important. Use the single nearest neighbor with Euclidean
distance to classify the unknown sample at x=0 and y = - 120. Scale the
data before applying the distance measurements.

class A A A A B B B B
x 2 3 2 -1 -2 3 -1 -3
y 300 100 -100 300 -300 -200 200 -100
Adaptive Decision Boundaries
The Bayesian decision procedures are optimal decision procedures if the
conditional densities of the classes are known. If the densities are not known,
nearest neighbor techniques can be used. However, experimentation may be
required to choose K and the set of reference samples. Classification may be time
consuming if the number of reference samples is large.
An alternative approach is to assume that the functional form of the decision
boundary between each pair of classes is linear and we have to find that linear
decision boundary. The following figure explains this approach.
Adaptive decision boundaries (continue)
In this technique, during the training phase, samples are presented to the current form of the
classifier. Whenever a sample is correctly classified, no change is made in the weights, but
when a sample is incorrectly classified, each weight is changed in whichever direction will tend
to correct the output, D.
For example, if D were negative for a particular sample when it should have been positive, the
value of D for that sample should be increased by changing the weights. If feature xi has a
positive value for that sample, then wi should be increased to increase D. However, if xi were
negative for that sample, then wi should be decreased in order to increase D.
The exact technique consists of the following steps:
Adaptive decision boundaries (continue)

1. Initialize the weights w0, w1, …., wM to small random values. Choose
positive constants C and K.
2. Choose a sample x = (x1, x2, …, xM) from the training set.
3. Compute D for the chosen sample: D = w0 + w1 x1 + w2 x2+ … + wM xM.
4. If the sample x is from class A for which D should be positive, then if D is
not positive, replace wi by wi+ Cxi, for I = 1, 2, … , M, where C is a
positive constant that controls the step size for weight adjustment. Also
replace w0 by w0 + CK, where K is a positive constant. If D is positive,
then no change should be made in the weights.
Adaptive decision boundaries (continue)
If the sample x is from class B for which D should be negative, then if D is not
negative, replace wi by wi – Cxi, for i= 1, 2, …, M, also replace w0 by w0 – CK.
5. Repeat steps 2 to 4 with each of the samples in the training set. When finished,
run through the entire training data set again. Stop when all the samples are
correctly classified during one complete pass of the entire training set.
An additional stopping rule is also needed since this process would never terminate
if the two classes were not linearly separable. A fixed maximum number of
iterations could be tried, or the algorithm could be terminated when the error rate
ceases to decrease significantly.
Example:
Use the adaptive decision boundary algorithm to find a decision
boundary that classifies patterns from two classes. Each pattern is
represented by two features. The training samples are:
Solution: X1 x2 class
-The form of the decision boundary is as follows: 2 10 A
3 8 A
D = w0 + w1 x1 + w2 x2 5 2 A
-Assume we will design D to be positive for class A 50 25 B
65 30 B
and negative for class B. 35 40 B
Example (continue):
-The following are the steps for finding the weights:
Step1: Initialize the weights with small random values:
w0 = 0.1, w1= 0.2, w2 = - 0.15, then D= 0.1 + 0.2 x1 – 0.15 x2,
Choose C= 0.2 and K = 0.5
Step2 : Choose the first sample of the training data x=(2, 10).
Step3: Compute D= 0.1 + 0.2 (2) – 0.15 (10) = -1
Step4: D is negative while it should be positive for this sample, then we change
weights as follows:
w1 = w1+Cx1 = 0.2 + 0.2(2) = 0.6
w2 = w2 + Cx2 = -0.15 + 0.2 (10) = 1.85
w0 = w0 + CK = 0.1 + 0.2 (0.5) = 0.2 therefore D = 0.2 + 0.6 x1 + 1.85 x2
Example (continue):
Step 2: choose the training sample x= (3 8)
Step3: compute D, D= 0.2 + 0.6 x1 + 1.85 x2 = 0.2 + 0.6(3) + 1.85 (8)
D = 16.8
Step4: D is positive for this sample, therefore, no change in weights should be made.

Step 2: choose the training sample x = (5 2)


step 3: compute D, D = 0.2 + 0.6 (5) + 1.85 (2) = 6.9
Step 4: D is positive, no change in weights.

Step 2: choose the sample x = (50 25)


Step 3: Compute D, D = 0.2 + 0.6 (50) + 1.85 (25) = 76.45
Example (continue):
Step 4: D is positive while it should be negative for this sample, therefore weights are to be
changed as follows:
w1 = w1 – Cx1= 0.6 - 0.2 (50) = - 9.4
w2 = w2 – Cx2 = 1.85 – 0.2 (25) = - 3.15
w0= w0 – CK = 0.2 – 0.2(0.5) = 0.1
then the current form of the decision boundary becomes:
D = 0.1 – 9.4 x1 - 3.15 x2
Repeat the above three steps for the remaining training samples.
Repeat the algorithm again over the whole training data set.
Stop when all the samples are correctly classified or another criterion is met.
Adaptive decision boundaries (continue)

Nonlinear decision boundaries, which are more complex than the linear
ones, can also be found by this adaptive technique. For example to
create decision regions with general quadratic boundaries, we can
convert the two dimensional feature vector (x, y) to five dimensional
feature vector (u1,u2, u3, u4, u5) where u1=x, u2=y, u3=x2, u4= xy, u5
=y2. That is the discriminant function becomes as follows:
D = w0 + w1u1 + w2u2 + w3u3 + w4u4 + w5u5
= w0 + w1 x + w2 y + w3 x2 +w4 xy + w5 y2
Adaptive decision boundaries (continue)
If there are more than two classes, the same algorithm can be used to
find the boundary between each pair of classes. If there are K classes,
there will be K(K-1)/2 decision boundaries between pairs of classes.
Adaptive Discriminant Functions
Another approach to classification when there are more than two classes is
to drive a separate linear discriminant function for each class, and choose
the class which has the largest discriminant function.
If there are N classes and M features, then the set of linear discriminant
functions is:
D1 = w01 + w11 x1 + w21 x2 + ….…. + wM1 xM
D2 = w02 + w12 x1 + w22 x2 + ……… + wM1 xM
.
DN = w0N + w1N x1 + w2N + …………. + wMN xM
Adaptive Discriminant Functions (continue)
The technique for adapting the weights in these discriminant functions
classification algorithm is as follows:
Whenever a sample X is classified as class Cj when it should have been classified as
class Ci, the new weights for the two corresponding discriminant functions (Di and
Dj) are adjusted as follows:
wmi = wmi + C xm,
wmj = wmj – C xm
where m = 1, 2, ……., M and
w0i = w0i + CK, w0j = w0j – CK
No change is made in the discriminant functions for classes other than Ci and Cj.
Adaptive Discriminant Functions (continue)
This procedure has the effect of increasing Di for the class that should
have been chosen and decreasing Dj for the class that was chosen but
should not have been.
There is no reason to change the weights in the other discriminant
functions.
Example:
Design an adaptive discriminant functions algorithm to classify patterns
from three classes. Each pattern is represented by two features. The
training samples are as given in the table.
X1 x2 class
Solution: 2 10 A
3 8 A
Since we have three classes, then three 5 2 A
50 25 B
discriminant Functions are required as follows: 65 30 B
35 40 B
D1 = w01 + w11 x1 + w21 x2 20 15 C
25 18 C
D2 = w02 + w12 x1 + w22 x2 15 12 C
D3 = w03 + w13 x1 + w23 x2
Example (continue):
Step1: initialize the weights with small random values as follows:
w01 = 0.1, w11 = - 0.15, w21 = 0.15, w02 = 0.3, w12 = 0.2, w22 = - 0.2
W03 = 0.1, w13 = 0.25, w23 = 0.15
Then,
D1 = 0.1 - 0.15 x1 + 0.15 x2, D2 = 0.3 + 0.2 x1 -0.2 x2
D3 = 0.1 + 0.25 x1 + 0.15 x2
also choose C=0.2, K=0.5.
Step2: choose a training sample (for example X=(2, 10) which is from class A).
Step3: Compute D1, D2, D3: D1 = 0.1 – 0.15 (2) + 0.15 (10) = 1.3
D2 = 0.3 + 0.2 (2) – 0.2 (10) = -1.3, D3= 0.1 + 0.25 (2) + 0.15 (10)=2.1
Step4: Since the sample is from class A and D3 is the largest, then we change the weights of D1
and D3 as follows:
w01 = w01 + CK = 0.1 + 0.2 (0.5) = 0.2 , w11 = w11 + C x1 = - 0.15 + 0.2 (2) = 0.25
w21 = w21 + C x2 = 0.15 + 0.2 (10) = 2.15

w03 = w03 – CK = 0.1 – 0.2 (0.5) = 0, w13 = w13 – C x1 = 0.25 – 0.2 (2) = - 0.15
w23= w23 – C x2 = 0.15 – 0.2 (10) = - 1.85

The new set of Ds is as follows:


D1 = 0.2 + 0.25 x1 + 2.15 x2 D2 = 0.3 + 0.2 x1 -0.2 x2 D3 = 0 – 0.15 x1 – 1.85 x2

Repeat from step 2.


Minimum Squared Error (MSE) Classification Technique
In the adaptive decision boundary and adaptive discriminant function techniques, finding the
decision boundary may require many iterations. In addition, the adaptive algorithms terminate
when they find the first set of weights that correctly classifies the training data, which might not
correspond to what constitutes a good decision boundary.
The minimum squared error (MSE) classification procedure does not require iterations and it may
produce decision boundaries that more appealing than those produced by the adaptive decision
boundary techniques.
The minimum squared error algorithm uses a single discriminant function, regard less of the
number of classes.
If there are V samples and M features, then there will be V feature vectors:
Xi = (xi1, xi2, …., xiM), i = 1, 2, 3, …, V
Let the true class of Xi be represented by di, which can have any numerical value. We want to
find a set of weights wj, j= 0, 1, 2, …, M for a single linear discriminant function:
D(Xi) = w0 + w1 xi1 + w2 xi2 + … + wM xiM
Such that D(Xi) = di for all the samples of class i.

D(Xi)
d1 d2 d3 ... dN
In general, this will not be possible, but by properly choosing the weights w0, w1,
w2, …, wM, the sum of the squared differences between the set of desired values
di and the actual values D(Xi) can be minimized. The sum of squared errors E is:
E = (D(X1)- d1)2 + (D(X2) – d2)2 + … + (D(Xi) – di)2+ … + (D(XV) – dV)2
The values of the weights that minimize E may be found by computing the partial
derivatives of E with respect to each of the wj, setting each derivative to zero, and
solving for the weights.
Since E is a quadratic function of the wj, the derivatives will be linear in the wj, so
the problem is reduced to solving a set of M+1 linear equations for the M+1
weights w0, w1, …, wM.
When there are N classes, this technique still uses a single set of weights rather than N or N(N-
1)/2 of them as in the two methods described previously for N classes.
Example: Design a minimum squared error (MSE) classification technique (find a minimum
squared error discriminant function) using the following training samples.
i X1 x2 class d
Solution:
1 2 10 A 1
The discriminant function takes the form: 2 3 8 A 1
3 5 2 A 1
D(Xi) = w0 + w1 xi1 + w2 xi2, i = 1, 2, …, 6 4 50 25 B -1
5 65 30 B -1
The sum of squared error has the form: 6 35 40 B -1
E = (D(X1) – d1)2 + (D(X2) – d2)2 + (D(X3) – d3)2 + (D(X4) – d4)2 + (D(X5) – d5)2 + (D(X6) – d6)2
But D(X1) = w0 + 2 w1 + 10 w2, D(X2) = w0 + 3 w1 + 8 w2, D(X3) = w0 + 5 w1 + 2 w2,
D(X4) = w0 + 50 w1 + 25 w2, D(X5) = w0 + 65 w1 + 30 w2, D(X6) = w0 + 35 w1 + 40 w2,
and d1 = 1, d2 = 1, d3 =1, d4 = -1, d5 = - 1, d6 = -1
Therefore the sum of squared error becomes:
E = (w0 + 2 w1 + 10 w2 - 1 )2 + (w0 + 3 w1 + 8 w2 – 1)2 + (w0 + 5 w1 + 2 w2 – 1)2 +
(w0 + 50 w1 + 25 w2 + 1)2 + (w0 + 65 w1 + 30 w2 + 1)2 + (w0 + 35 w1 + 40 w2 + 1)2
then
dE/dw0 = . . . = 0, (1)
dE/dw1 = . . . = 0, (2)
dE/dw2 = . . . = 0, (3)
By solving these three equations, we get
w0 = 1.292, w1 = - 0.0218, and w2 = - 0.0371
Thus, The discriminant function is:
D(Xi) = 1.292 – 0.0218 xi1 – 0.0371 xi2
Example: Two samples from class A are located at (4, 4) and (5, 5). Two samples
from class B are located at (2, 0) and (2, 1). We want a linear discriminant function
to equal 1 for members of class A and -1 for members of class B. What set of three
weights minimizes the squared error between the desired and the actual values of
the discriminant function at the four samples? What is D(x, y)? How would a
sample with feature values (3, 2) be classified.

Solution:
The Sequential MSE Classification Algorithm
The sequential MSE algorithm is an adaptive technique in which samples are presented one at a
time instead of all at once. That is, this procedure minimizes:
E = (D(X) – d)2
There are two forms for the sequential MSE algorithm:
1- The sequential MSE algorithm with single output
2- The sequential MSE algorithm with multiple outputs

The sequential MSE algorithm uses the steepest descent minimization procedure for adapting
the weights for each sample. Therefore, we will first describe the steepest descent minimization
procedure.
The steepest descent minimization procedure

To find a local minimum of a function F(w1, w2, . . ., wM) of M


variables, pick up a starting guess w1, w2, . . . , wM and move a short
distance in the direction of steepest decrease of the function. From this
new point, re-compute the direction of the steepest decrease, and move
a short distance in this new direction. Continue until you arrive to a
local minimum.
The steepest descent algorithm can be stated as follows:
1. Pick a starting guess w1, w2, …., wM and choose a positive constant
C.
2. Compute the partial derivatives df/dwi, for i = 1, 2, …., M. Replace wi
by wi – C df/dwi for i = 1, 2, …., M.
3. Repeat step 2 until w1, w2, …., wM cease to change significantly.
Example: Use the steepest descent minimization procedure to find w1
and w2 at which the following function is minimum:
F(w1, w2) = w14 + w24 + 3w12w22 – 2w1 – 3w2 + 4
Solution:
The partial derivatives of the given function are:
dF(w1, w2)/dw1 = 4w13 + 6w1w22 – 2
dF(w1, w2)/dw2 = 4w23 + 6w12w2 - 3
Wi new = wi old – C dF/dwi
If we choose C = 0.1, we obtain the following sequence of iterations:
Iteration w1 w2 Df/dw1 Df/dw2 F(w1, w2)
1 0.0 0.0 -2.0 - 3.0 4.0
2 0.2 0.3 -1.86 -2.82 2.73
3 0.386 0.582 0.9855 -1.691 1.77
4 0.485 0.751 0.9523 -2.468 1.548
5 0.475 0.776 0.1441 -0.082 1.543

30 0.43 0.807 0.0 0.0 1.539

After 30 iterations, w1 and w2 cease to change significantly, therefore, a


local minimum has been reached.
1-The sequential MSE classification algorithm with single output
We can now describe the sequential MSE algorithm for a single output. Assume
we have L training samples where each sample has M features x1, x2, …., xM. The
weights are w1, w2, ….., wM. We can also define the auxiliary feature x0 and bias
weight w0. Because we are updating the weights by considering one sample at a
time, we use the error criterion function:
E = ½ (D(x) –d)2
Where d is the desired or correct output and D(x) = w0 x0 + w1 x1 + w2 x2 + …… +
wM xM is the actual output.
If the sample is misclassified, we want to adapt the weights so that D will be closer
to d. The partial derivatives are
dE/dwi = (D – d) xi, I = 0, 1, 2, ….., M
Using the steepest descent algorithm to minimize E gives us the sequential MSE
algorithm as follows:
1- pick up starting weights w0, w1, …, wM and choose a positive constant C.
2- present samples 1 through L repeatedly to the classifier, cycling back to sample 1
after sample L is encountered. For each sample, compute
D = w0 + w1 x1 + w2 x2 + ….. + wM xM
3- Replace wi by wi – C(D – d) xi for all i.
4- Repeat steps 2 and 3 until the weights cease to change significantly.
2-The sequential MSE algorithm with multiple outputs
In this case, we have for each of the L samples the feature values x0, x1, …., XM
and desired outputs d1, d2, ….., dN. The value wij is the weight on input i for
output node j. we use the error criterion function:
E = ½ [ (D1 – d1)2 + (D2 – d2)2 + ……… + (DN – dN)2]
Since Dj = w0j x0 + w1j x1 + …. + wij xi + ….. + wMj xM, then
dE/dwij = (Dj – dj) xi
The sequential MSE algorithm for multiple outputs can be stated as follows:
1- pick starting weights w0j,w1j, ….., wMj and choose a positive constant C.
2. Present samples 1 through L repeatedly to the classifier, cycling back to sample
1 after sample L is encountered. For each sample, compute
Dj = w0j + w1j x1 + …. + wij xi + ….. + wMj xM for nodes j = 1, 2, …., N.
3. Replace wij by wij – C (Dj – dj) xi for all i, and j.
4. Repeat steps 2 and 3 until the weights cease to change significantly.
Example: Design a sequential MSE classification algorithm with single output using
the following training data. Sample Features class Desired
index X1 x2 output
(d)
Solution:
1 2 10 A 1
The form of the discriminant function 2 3 8 A 1
is as follows: 3 5 2 A 1
D(X) = w0 + w1 x1 + w2 x2 4 50 25 B -1
The error function to be minimized is: 5 65 30 B -1
E = ½ (D(X) – d)2 6 35 40 B -1

= ½ (w0 + w1 x1 + w2 x2 – d)2
The partial derivative of this error function relative to each weight is:
dE/dw0 = (D(X) – d)
dE/dw1 = (D(X) – d) x1
dE/dw2 = (D(X) – d) x2, then designing the algorithm will be using the following
steps:
Step1: pick up small random values for the weights and choose positive constant C:
w0 = 0.2, w1 = 0.15, w2 = - 0.12, and C = 0.2
therefore D(X) = 0.2 + 0.15 x1 – 0.12 x2
Step 2: select a training sample and present it to the classifier. Let us take sample1
X1 = [2 10] and compute
D(X1) = 0.2 + 0.15 (2) – 0.12 (10) = - 0.7
Step 3: adjust weights, replace wi by wi – C (D(X1) – d) xi as follows:
w0 = w0 – C (D(x1) – d) = 0.2 – 0.2 (- 0.7 – 1) = 0.2 + 0.34 = 0.54
w1 = w1 – C (D(X1) – d) x1 = 0.15 – 0.2 (- 0.7 – 1) (2) = 0.15 + 0.68 = 0.83
w2 = w2 – C (D(X1) – d) x2 = - 0.12 – 0.2 (- 0.7 – 1) (10) = 3.28
therefore D(X) = 0.54 + 0.83 x1 + 3.28 x2

Step 2: select a training sample and present it to the classifier. Let us take sample2
X2 = [3 8] and compute
D(X2) = 0.54 + 0.83 (3) + 3.28 (8) = 29.18
Step 3: update the weights
w0 = w0 – C(D – d) = 0.54 – 0.2 (29.18 – 1) = - 5.096
w1 = w1 – C (D –d) x1 = 0.83 – 0.2 (29.18 – 1) (3) = - 16.078
w2 = w2 – C (D – d) x2 = = 3.28 – 0.2 (29.18 – 1) (8) = - 41.8

Therefore the current form of the discriminant function becomes:


D(x) = - 5.096 – 16.078 x1 – 41.8 x2

Step 2: choose sample X3 = [5 2] and compute


D(X3) = - 5.096 – 16.078 (5) – 41.8 (2) = - 169.086

Step 3: Adjust weights


w0 = w0 – C (D - d) = - 5.096 – 0.2 (- 169.086 – 1) = 28.92
w1 = w1 – C (D – d) x1 = - 16.078 – 0.2 (- 169.086 – 1) (5) = 154.008
w2 = w2 – C (D – d) x2 = - 41.8 – 0.2 (- 169.086 – 1) (2) = 26.2344
Therefore D becomes D(X) = 28.92 + 154.008 x1 + 26.2344 x2
repeat steps 2 and 3 until weights cease to change significantly.

Example: design a sequential MSE classification algorithm with multiple outputs


using the following training data:

Solution: Sample Features class Desired


Index X1 x2 vector (d)
1 2 10 C1 1 0 0
The form of the system is as shown in the
2 3 8 C1 1 0 0
next slide. 3 5 2 C1 1 0 0
4 50 25 C2 0 1 0
D1(X) = w01 + w11 x1 + w21 x2 5 65 30 C2 0 1 0
D2(X) = w02 + w12 x1 + w22 x2 6 35 40 C2 0 1 0
D3(X) = w03 + w13 x1 + w23 x2 7 15 7 C3 0 0 1
8 18 9 C3 0 0 1
9 22 15 C3 0 0 1
Step 1: pick small random values for the weights and choose a positive constant C.
w01 = 0.15, w11 = - 0.2, w21 = 0.2
w02= 0.1, w12 = 0.3, w22 = 0.25
w03 = - 0.12, w13 = 0.1, w23 = 0.15
Choose C = 0.2
D1(X) = 0.15 – 0.2 x1 + 0.2 x2
D2(X) = 0.1 + 0.3 x1 + 0.25 x2
D3(X) = - 0.12 + 0.1 x1 + 0.15 x2
Step2 : choose a training sample X1 = [2 10] and compute D1, D2, D3:
D1(X1) = 0.15 – 0.2 (2) + 0.2 (10) = 0.75
D2(X1) = 0.1 + 0.3 (2) + 0.25 (10] = 3.2
D3(X1) = - 0.12 + 0.1 (2) + 0.15 (10) = 1.82
Step 3: adjust weights
w01 = w01 – C (D1 – d1) = 0.15 – 0.2 (0.75 – 1) = 0.2
w11 = w11 – C (D1 – d1) x1 = - 0.2 – 0.2 (0.75 – 1) (2) = - 0.1
w21 = w21 – C (D1 – d1) x2 = 0.2 – 0.2 (0.75 – 1) (10) = 0.7
w02 = w02 – C (D2 – d2) = 0.1 – 0.2 (3.2 – 0) = 1 – 0.64 = - 0.36
w12 = w12 - C (D2 – d2) x1 = 0.3 – 0.2 (3.2 – 0) (2) = - 0.98
w22 = w22 – C (D2 – d2) x2 = 0.25 – 0.2 (3.2 – 0)(10) = - 6.15

w03 = w03 – C (D3 – d3) = - 0.12 – 0.2 (1.82 – 0) = - 0.484


w13 = w13 – C (D3 – d3) x1 = 0.1 – 0.2 (1.82 – 0) (2) = - 0.628
w23 = w23 – C (D3 – d3) x2 = 0.15 – 0.2 (1.82 – 0) (10) = - 3.49

D1(X) = 0.2 – 0.1 x1 + 0.7 x2


D2(X) = -0.36 – 0.98 x1 – 6.15 x2
D3(X) = - 0.484 – 0.628 x1 – 3.49 x2
Repeat steps 2 and 3 until weights cease to change significantly .

You might also like