0% found this document useful (0 votes)
109 views28 pages

Adaline

The document discusses the ADALINE (Adaptive Linear Neuron) algorithm for training neural networks to classify patterns into two or more categories. It introduces the Delta rule (also known as the Widrow-Hoff rule or LMS rule) for updating the network weights and bias to minimize the error between the network output and target values on each training example. The Delta rule aims to find a robust set of weights and bias by taking steps in the direction of steepest descent on the error surface.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views28 pages

Adaline

The document discusses the ADALINE (Adaptive Linear Neuron) algorithm for training neural networks to classify patterns into two or more categories. It introduces the Delta rule (also known as the Widrow-Hoff rule or LMS rule) for updating the network weights and bias to minimize the error between the network output and target values on each training example. The Delta rule aims to find a robust set of weights and bias by taking steps in the direction of steepest descent on the error surface.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

ADALINE for Pattern Classification

K. M. Leung
Department of Computer Science and Engineering
Polytechnic School of Engineering, NYU

2015.09.29

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

1 / 28

Abstract
A supervised learning algorithm known as the Widrow-Hoff rule, or the
Delta rule, or the LMS rule, is introduced to train neuron networks to
classify patterns into two or more categories.

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

2 / 28

Simple ADELINE for Pattern Classification


Although the Perceptron learning rule always converges, in fact in a finite
number of steps, to a set of weights and biases, provided that such a set
exists, the set obtained is often not the best in terms of robustness. We
will discuss here the ADALINE, which stands for Adaptive Linear Neuron,
and a learning rule which is capable, at least in principle, of finding such a
robust set of weights and biases.
The architecture for the NN for the ADALINE is basically the same as the
Perceptron, and similarly the ADALINE is capable of performing pattern
classifications into two or more categories. Bipolar neurons are also used.
The ADALINE differs from the Perceptron in the way the NNs are trained,
and in the form of the transfer function used for the output neurons during
training. For the ADALINE, the transfer function is taken to be the
identity function during training. However, after training, the transfer
function is taken to be the bipolar Heaviside step function when the NN is
used to classify any input patterns.

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

3 / 28

Thus the transfer function is


f (yin ) = yin ,
during training,
(
+1, if yin 0
f (yin ) =
after training.
1, if yin < 0
We will first consider the case of classification into 2 categories only, and
thus the NN has only a single output neuron. Extension to the case of
multiple categories is treated in the next section.
The total input received by the output neuron is given by
yin = b +

N
X

xi wi .

i=1

Just like Hebbs rule and the Perceptron learning rule, the Delta rule is
also a supervised learning rule. Thus we assume that we are given a
training set:
{s(q) , t (q) },
q = 1, 2, . . . , Q.
where s(q) is a training vector, and t (q) is its corresponding targeted
output value.
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)

2015.09.29

4 / 28

Also like Hebbs rule and the Perceptron rule, one cycles through the
training set, presenting the training vectors one at a time to the NN. For
the Delta rule, the weights and bias are updated so as to minimize the
square of the difference between the net output and the target value for
the particular training vector presented at that step.
Notice that this procedure is not exactly the same as minimizing the overall
error between the NN outputs and their corresponding target values for all
the training vectors. Doing so would require the solution to a large scale
optimization problem involving N weight components and a single bias.

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

5 / 28

Multi-Parameter Minimization
To better understand the updating procedure for the weights and bias in
the Delta rule, we need to digress and consider the topic of
multi-parameter minimization. We assume that E (w) is a scalar function
of a vector argument, w. We want to find the point w Rn at which E
takes on its minimum value.
Suppose we want to find the minimum value iteratively starting with w(0).
The iteration amounts to
w(k + 1) = w(k) + w(k),

k = 0, 1, . . . .

The question is how should the changes in the weight vector be chosen in
order that we end up with a lower value for E :
E (w(k + 1)) < E (w(k)).
For sufficiently small w(k), we obtain from Taylors theorem
E (w(k + 1)) = E (w(k) + w(k)) E (w(k)) + g(k) w(k),
where g(k) = E (w)|w=w(k) is the gradient of E (w) at w(k).
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)

2015.09.29

6 / 28

It is clear that E (w(k + 1)) < E (w(k)) if g(k) w(k) < 0. The largest
decrease in the value of E (w) occurs in the direction w(k) = g(k), if
is sufficiently small and positive. This direction is called the steepest
descent direction, and controls the size of the step and is called the
learning rate. Thus starting from w(0)), the idea is to find a minimum of
the function E (w) iteratively by making successive steps along the local
gradient direction, according to
w(k + 1) = w(k) g(k),

k = 0, 1, . . . .

This method of finding the minimum is known as the steepest descent


method.
This is a greedy method which may lead to convergence to a local but not
a global minimum of E .

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

7 / 28

Delta Rule
Suppose at the k-th step in the training process, the current weight vector
and bias are given by w(k) and b(k), respectively, and the q-th training
vectors, s(k) = s(q) , is presented to the NN. The total input received by
the output neuron is
yin = b(k) +

N
X

si (k)wi (k).

i=1

Since the transfer function is given by the identity function during training,
the output of the NN is
y (k) = yin = b(k) +

N
X

si (k)wi (k).

i=1

However the target output is t(k) = t (q) , and so if y (k) 6= t(k) then there
is an error given by y (k) t(k). This error can be positive or negative.
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)

2015.09.29

8 / 28

The Delta rule aims at finding the weights and bias so as to minimize the
square of this error
!2
N
X
E (w(k)) = (y (k) t(k))2 = b(k) +
si (k)wi (k) t(k) .
i=1

We can absorb the bias term by introducing an extra input neuron, X0 , so


that its activation (signal) is always fixed at 1 and its weight is the bias.
Then the square of the error in the k-th step is
!2
N
X
E (w(k)) =
si (k)wi (k) t(k) .
i=0

The gradient of this function, g(k), in a space of dimension N + 1 (N


weights and 1 bias) is
!
N
X
gj (k) = wj (k) E (w(k)) = 2
si (k)wi (k) t(k) sj (k).
i=0
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)

2015.09.29

9 / 28

Using the steepest descent method, we have


w(k + 1) = w(k) 2

N
X

!
si (k)wi (k) t(k) s(k).

i=0

The i = 1, 2, . . . , N components of this equation gives the updating rule


for the weights. The zeroth component of this equation gives the updating
rule for the bias
!
N
X
b(k + 1) = b(k) 2
si (k)wi (k) t(k) .
i=0

Notice that in the textbook by Fausett, the factors of 2 are missing from
these two updating formulas. We can also say that the learning rate there
is twice the value here.
We will now summarize the Delta rule. To save space, we use vector
notation, where vectors are denoted by boldface quantities.

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

10 / 28

The Delta Rule


1
2

Set learning rate and initialize weights and bias.


Repeat the following steps, while cycling through the training set
q = 1, 2, . . . , Q, until changes in the weights and bias are
insignificant.
1
2

Set activations for input vector x = s(q) .


Compute total input for the output neuron:
yin = x w + b

3
4

Set y = yin .
Update the weights and bias
wnew = wold 2(y t (q) )x,
b new = b old 2(y t (q) ).

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

11 / 28

Notice that for the Delta rule, unlike the Perceptron rule, training does not
stop even after all the training vectors have been correctly classified. The
algorithm continuously attempts to produce more robust sets of weights
and bias. Iteration is stopped only when changes in the weights and bias
are smaller than a preset tolerance level.
In general, there is no proof that the Delta rule will always lead to
convergence, or to a set of weights and bias that enable the NN to
correctly classify all the training vectors. One also needs to experiment
with the size of the learning rate. Too small a value may require too many
iterations. Too large a value may lead to non-convergence.
Because the identity function is used as the transfer function during
training, the error at each step of the training process may never become
small, even though an acceptable set of weights and bias may have already
been found. In that case the weights will continually change from one
iteration to the next. The amount of changes are proportional to .
Therefore in some cases, one may want to gradually decrease towards
zero during iteration, especially when one is close to obtaining the best set
of weights and bias. Of course there are many ways in which can be
made to approach zero.
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)

2015.09.29

12 / 28

Exact Optimal Choice of Weights and Bias


Actually one can find, at least in principle, a set of weights and bias that
will perform best for a given training set. To see this, it is better to absorb
the bias to simplify the expressions. What this problem intends to
accomplish mathematically is to find a vector w that minimizes the overall
squares of the errors (the least mean squares, or LMS)
Q
Q
N
1 X X (q)
1 X
(q) 2
(y t ) =
si wi t (q)
F (w) =
Q
Q
q=1

q=1

!2
.

i=0

Since F (w) is quadratic in the weight components, the solution can be


readily obtained, at least formally. To obtain the solution, we take the
partial derivatives of F (w), set them to zero, and solve the resulting set of
equations. Since F (w) is quadratic in the weight components, its partial
derivatives are linear, and the resulting equation for the weight
components are linear and can therefore be solved.
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)

2015.09.29

13 / 28

Taking the partial derivative of F (w) with respect to the j-th component
of the weight vector gives
wj F (w) =

Q
Q
N
X
2 X
2 X
(q)
(q)
(q)
(y t )wj
si w i =
(y t (q) )sj
Q
Q
q=1
q=1
i=0
!
!
Q
N
N
X
2 X X (q)
(q)
si wi t (q) sj = 2
wi Cij vj ,
Q
q=1

i=0

i=0

where we have defined the correlation matrix C such that


Q
1 X (q) (q)
Cij =
si sj
Q
q=1

and a vector v having components


vj =

Q
1 X (q) (q)
t sj .
Q
q=1

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

14 / 28

Setting the partial derivatives to zero gives the set of linear equations
(written in matrix notation):
wC = v.
Notice that the correlation matrix C and the vector v can be easily
computed from the given training set.
Assuming that the correlation matrix is nonsingular, the solution is
therefore given by
w = vC1 ,
where C1 is the inverse matrix for C. Notice that the correlation matrix
is symmetric and has dimension (N + 1) (N + 1).
Although the exact solution is formally available, computing it this way
requires the computation of the inverse of matrix C or solving a system of
linear equations. The computational complexity involved is of O(N + 1)3 .
For most practical problems, N is so large that computing the solution this
way is really not feasible.
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)

2015.09.29

15 / 28

Application: Bipolar Logic Function: AND


We use the Delta rule here to train the same NN (the bipolar logic
function: AND) that we have treated before using different training rules.
The training set is given by the following table:
s(q) t (q)
q
1
[1 1]
1
2
[1 -1]
-1
[-1 1]
-1
3
4 [-1 -1]
-1
We assume that the weights and bias are initially zero, and apply the
Delta rule to train the NN. We find that for a learning rate larger than
about 0.3, there is no convergence as the weight components increase
without bound. For less than 0.3 but larger than 0.16, the weights
converge but to values that fail to correctly classify all the training vectors.
The weights converge to values that correctly classify all the training
vectors if is less than about 0.16. They become closer and closer to the
most robust set of weights and bias when is below 0.05.
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)

2015.09.29

16 / 28

We also consider here the exact formal solution given in the last section.
We will absorb the bias by appending a 1 in the leading position of each of
q
s(q) t (q)
1
[1 1 1]
1
the training vectors so that the training set is 2
[1 1 -1]
-1
3
[1 -1 1]
-1
4 [1 -1 -1]
-1
We first compute the correlation matrix
4

1 X (q)T (q)
s
s
C =
4
q=1

1 
1 


1
1
1 1 1 + 1 1 1 1
=
4
1
1

!
1 
1 


+ 1 1 1 1 + 1 1 1 1
1
1
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)

2015.09.29

17 / 28

Thus

1 0 0
C = 0 1 0
0 0 1
Since C is an identity matrix (the training vectors are as independent of
each other as they can be), its inverse is just itself.
Then we compute the vector v
4

1 X (q) (q)
t s
4
q=1
 

1 
1 1 1 1 1 1
=
4

 

1 1 1 1 1 1
 1 1 1 
=
2 2 2 .

v =

Therefore we have
W = vC1 =

21

1
2

1
2

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

18 / 28

This means that



1
b= ,
W 12 12 ,
2
and so the best decision boundary is given by the line
x2 = 1 x1 ,
which we know before is the correct result.

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

19 / 28

NN with multiple Output Neurons


We now extend our discussions here to NN with multiple output neurons,
and thus are capable of clustering input vectors into more than 2 classes.
As before, we need to have M neurons in the output layer.
x1

X1

x2

X2

w11
w12 w21
w22

Y1

y1

Y2

y2

Ym

ym

w23
w13
w31
xn

Xn

w32
w33

Figure: A neural network for multi-category classification.


K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)

2015.09.29

20 / 28

We will absorb the biases as we did before with the Perceptron. Suppose
at the k-th step in the training process, the current weight matrix and bias
vector are given by W(k) and b(k), respectively, and one of the training
vectors s(k) = s(q) , for some integer q between 1 and Q, is presented to
the NN. The output of neuron Yj is
yj (k) = yin,j =

N
X

si (k)wij (k).

i=0
(q)

However the target is tj (k) = tj , and so the error is yj (k) tj (k). Thus
we want to find a set of wmn that minimizes the quantity
E (W(k)) =

M
X
j=1

(yj (k) tj (k)) =

N
X

!2
si (k)wij (k) tj (k)

i=0

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

21 / 28

In order to do that we need to take the partial derivative of E with respect


to wmn for any given m and n.
We will use the abbreviated notation wmn to represent the partial
derivative w .
mn

Note that
wij wmn = ij ,
which is the Kronecker delta defined by
(
1
if
ij =
0
if

i = j,
i 6= j.

The reason is because if i is not the same as m, and j is not the same as
n, then wij and wmn refer to two different weights and are therefore
independent of each other, and the partial derivative is then 0.
Otherwise they refer to the same weight and the partial derivative is 1.
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)

2015.09.29

22 / 28

We take the gradient of this function with respect to wmn


wmn E (W(k)) = wmn

M
X

(yj (k) tj (k))2

j=1

= 2

M
X

(yj (k) tj (k)) wmn yj .

j=1

wmn yj = wmn

N
X

si (k)wij (k) =

i=0

N
X

si (k)wmn wij (k)

i=0

Since
wmn wij (k) = i,m j,n ,
thus
wmn yj = wmn

N
X
i=0

si (k)wij (k) =

N
X

si (k)i,m j,n = j,n sm (k),

i=0

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

23 / 28

and so we have
wmn E (W(k)) = 2

M
X

(yj (k) tj (k)) j,n sm (k)

j=1

= 2sm (k) (yn (k) tn (k)) .


Using the steepest descent method, we have
wij (k + 1) = wij (k) 2si (k) (yj (k) tj (k)) .
The i = 1, 2, . . . , N components of this equation gives the updating rule
for the weights. The i = 0 component of this equation gives the updating
rule for the bias
bj (k + 1) = bj (k) 2 (yj (k) tj (k)) .

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

24 / 28

General Delta Rule for Multiple Output Neurons


1
2

Set learning rate and initialize weights and bias.


Repeat the following steps, while cycling through the training set
q = 1, 2, . . . , Q, until changes in the weights and biases are within
tolerance.
1
2

Set activations for input vector x = s(q) .


Compute total input for the output neuron:
yin = x W + b

3
4

Set y = yin .
Update the weights and biases
Wnew = Wold 2xT (y t(q) ),
bnew = bold 2(y t(q) ).

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

25 / 28

An Example
We will now treat the same example that we have considered before for
the Perceptron with multiple output neurons. We use bipolar output
neurons and the training set:
(class 1)






s(1) = 1 1 , s(2) = 1 2
with t(1) = t(2) = 1 1
(class 2)




s(3) = 2 1 , s(4) = 2 0



with t(3) = t(4) = 1 1

(class 3)




s(5) = 1 2 , s(6) = 2 1
(class 4)




s(7) = 1 1 , s(8) = 2 2



with t(5) = t(6) = 1 1



with t(7) = t(8) = 1 1

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

26 / 28

It is clear that N = 2, Q = 8, and the number of classes is 4. The number


of output neuron is chosen to be M = 2 so that 2M = 4 classes can be
represented.
Our exact calculation of the weights and bias for the case of a single
output neuron can be extended to the case of multiple output neurons.
One can then obtain the following exact results for the weights and biases:
#
" 91
1
W=
b=

153
8
153

6
2
3

2
153

1
6

Using these exact results, we can easily see how good or bad our iterative
solutions are.
It should be remarked that the most robust set of weights and biases is
determined only by a few training vectors that lie very close to the decision
boundaries. However in the Delta rule, all training vectors contribute in
some way. Therefore the set of weights and biases obtained by the Delta
rule is not necessarily always the most robust.
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)

2015.09.29

27 / 28

The Delta rule usually gives convergent results if the learning rate is not
too large. The resulting set of weights and biases typically leads to correct
classification of all the training vectors, provided such a set exist. How
close this set is to the best choice depends on the starting weights and
biases, the learning rate and the number of iterations. We find that for
this example much better convergence can be obtained if the learning rate
at step k is set to be = 1/k.

K. M. Leung (Department of Computer Science and EngineeringPolytechnic


CS6673
School of Engineering, NYU)

2015.09.29

28 / 28

You might also like