Adaline
Adaline
K. M. Leung
Department of Computer Science and Engineering
Polytechnic School of Engineering, NYU
2015.09.29
2015.09.29
1 / 28
Abstract
A supervised learning algorithm known as the Widrow-Hoff rule, or the
Delta rule, or the LMS rule, is introduced to train neuron networks to
classify patterns into two or more categories.
2015.09.29
2 / 28
2015.09.29
3 / 28
N
X
xi wi .
i=1
Just like Hebbs rule and the Perceptron learning rule, the Delta rule is
also a supervised learning rule. Thus we assume that we are given a
training set:
{s(q) , t (q) },
q = 1, 2, . . . , Q.
where s(q) is a training vector, and t (q) is its corresponding targeted
output value.
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)
2015.09.29
4 / 28
Also like Hebbs rule and the Perceptron rule, one cycles through the
training set, presenting the training vectors one at a time to the NN. For
the Delta rule, the weights and bias are updated so as to minimize the
square of the difference between the net output and the target value for
the particular training vector presented at that step.
Notice that this procedure is not exactly the same as minimizing the overall
error between the NN outputs and their corresponding target values for all
the training vectors. Doing so would require the solution to a large scale
optimization problem involving N weight components and a single bias.
2015.09.29
5 / 28
Multi-Parameter Minimization
To better understand the updating procedure for the weights and bias in
the Delta rule, we need to digress and consider the topic of
multi-parameter minimization. We assume that E (w) is a scalar function
of a vector argument, w. We want to find the point w Rn at which E
takes on its minimum value.
Suppose we want to find the minimum value iteratively starting with w(0).
The iteration amounts to
w(k + 1) = w(k) + w(k),
k = 0, 1, . . . .
The question is how should the changes in the weight vector be chosen in
order that we end up with a lower value for E :
E (w(k + 1)) < E (w(k)).
For sufficiently small w(k), we obtain from Taylors theorem
E (w(k + 1)) = E (w(k) + w(k)) E (w(k)) + g(k) w(k),
where g(k) = E (w)|w=w(k) is the gradient of E (w) at w(k).
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)
2015.09.29
6 / 28
It is clear that E (w(k + 1)) < E (w(k)) if g(k) w(k) < 0. The largest
decrease in the value of E (w) occurs in the direction w(k) = g(k), if
is sufficiently small and positive. This direction is called the steepest
descent direction, and controls the size of the step and is called the
learning rate. Thus starting from w(0)), the idea is to find a minimum of
the function E (w) iteratively by making successive steps along the local
gradient direction, according to
w(k + 1) = w(k) g(k),
k = 0, 1, . . . .
2015.09.29
7 / 28
Delta Rule
Suppose at the k-th step in the training process, the current weight vector
and bias are given by w(k) and b(k), respectively, and the q-th training
vectors, s(k) = s(q) , is presented to the NN. The total input received by
the output neuron is
yin = b(k) +
N
X
si (k)wi (k).
i=1
Since the transfer function is given by the identity function during training,
the output of the NN is
y (k) = yin = b(k) +
N
X
si (k)wi (k).
i=1
However the target output is t(k) = t (q) , and so if y (k) 6= t(k) then there
is an error given by y (k) t(k). This error can be positive or negative.
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)
2015.09.29
8 / 28
The Delta rule aims at finding the weights and bias so as to minimize the
square of this error
!2
N
X
E (w(k)) = (y (k) t(k))2 = b(k) +
si (k)wi (k) t(k) .
i=1
2015.09.29
9 / 28
N
X
!
si (k)wi (k) t(k) s(k).
i=0
Notice that in the textbook by Fausett, the factors of 2 are missing from
these two updating formulas. We can also say that the learning rate there
is twice the value here.
We will now summarize the Delta rule. To save space, we use vector
notation, where vectors are denoted by boldface quantities.
2015.09.29
10 / 28
3
4
Set y = yin .
Update the weights and bias
wnew = wold 2(y t (q) )x,
b new = b old 2(y t (q) ).
2015.09.29
11 / 28
Notice that for the Delta rule, unlike the Perceptron rule, training does not
stop even after all the training vectors have been correctly classified. The
algorithm continuously attempts to produce more robust sets of weights
and bias. Iteration is stopped only when changes in the weights and bias
are smaller than a preset tolerance level.
In general, there is no proof that the Delta rule will always lead to
convergence, or to a set of weights and bias that enable the NN to
correctly classify all the training vectors. One also needs to experiment
with the size of the learning rate. Too small a value may require too many
iterations. Too large a value may lead to non-convergence.
Because the identity function is used as the transfer function during
training, the error at each step of the training process may never become
small, even though an acceptable set of weights and bias may have already
been found. In that case the weights will continually change from one
iteration to the next. The amount of changes are proportional to .
Therefore in some cases, one may want to gradually decrease towards
zero during iteration, especially when one is close to obtaining the best set
of weights and bias. Of course there are many ways in which can be
made to approach zero.
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)
2015.09.29
12 / 28
q=1
!2
.
i=0
2015.09.29
13 / 28
Taking the partial derivative of F (w) with respect to the j-th component
of the weight vector gives
wj F (w) =
Q
Q
N
X
2 X
2 X
(q)
(q)
(q)
(y t )wj
si w i =
(y t (q) )sj
Q
Q
q=1
q=1
i=0
!
!
Q
N
N
X
2 X X (q)
(q)
si wi t (q) sj = 2
wi Cij vj ,
Q
q=1
i=0
i=0
Q
1 X (q) (q)
t sj .
Q
q=1
2015.09.29
14 / 28
Setting the partial derivatives to zero gives the set of linear equations
(written in matrix notation):
wC = v.
Notice that the correlation matrix C and the vector v can be easily
computed from the given training set.
Assuming that the correlation matrix is nonsingular, the solution is
therefore given by
w = vC1 ,
where C1 is the inverse matrix for C. Notice that the correlation matrix
is symmetric and has dimension (N + 1) (N + 1).
Although the exact solution is formally available, computing it this way
requires the computation of the inverse of matrix C or solving a system of
linear equations. The computational complexity involved is of O(N + 1)3 .
For most practical problems, N is so large that computing the solution this
way is really not feasible.
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)
2015.09.29
15 / 28
2015.09.29
16 / 28
We also consider here the exact formal solution given in the last section.
We will absorb the bias by appending a 1 in the leading position of each of
q
s(q) t (q)
1
[1 1 1]
1
the training vectors so that the training set is 2
[1 1 -1]
-1
3
[1 -1 1]
-1
4 [1 -1 -1]
-1
We first compute the correlation matrix
4
1 X (q)T (q)
s
s
C =
4
q=1
1
1
1
1
1 1 1 + 1 1 1 1
=
4
1
1
!
1
1
+ 1 1 1 1 + 1 1 1 1
1
1
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)
2015.09.29
17 / 28
Thus
1 0 0
C = 0 1 0
0 0 1
Since C is an identity matrix (the training vectors are as independent of
each other as they can be), its inverse is just itself.
Then we compute the vector v
4
1 X (q) (q)
t s
4
q=1
1
1 1 1 1 1 1
=
4
1 1 1 1 1 1
1 1 1
=
2 2 2 .
v =
Therefore we have
W = vC1 =
21
1
2
1
2
2015.09.29
18 / 28
1
b= ,
W 12 12 ,
2
and so the best decision boundary is given by the line
x2 = 1 x1 ,
which we know before is the correct result.
2015.09.29
19 / 28
X1
x2
X2
w11
w12 w21
w22
Y1
y1
Y2
y2
Ym
ym
w23
w13
w31
xn
Xn
w32
w33
2015.09.29
20 / 28
We will absorb the biases as we did before with the Perceptron. Suppose
at the k-th step in the training process, the current weight matrix and bias
vector are given by W(k) and b(k), respectively, and one of the training
vectors s(k) = s(q) , for some integer q between 1 and Q, is presented to
the NN. The output of neuron Yj is
yj (k) = yin,j =
N
X
si (k)wij (k).
i=0
(q)
However the target is tj (k) = tj , and so the error is yj (k) tj (k). Thus
we want to find a set of wmn that minimizes the quantity
E (W(k)) =
M
X
j=1
N
X
!2
si (k)wij (k) tj (k)
i=0
2015.09.29
21 / 28
Note that
wij wmn = ij ,
which is the Kronecker delta defined by
(
1
if
ij =
0
if
i = j,
i 6= j.
The reason is because if i is not the same as m, and j is not the same as
n, then wij and wmn refer to two different weights and are therefore
independent of each other, and the partial derivative is then 0.
Otherwise they refer to the same weight and the partial derivative is 1.
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)
2015.09.29
22 / 28
M
X
j=1
= 2
M
X
j=1
wmn yj = wmn
N
X
si (k)wij (k) =
i=0
N
X
i=0
Since
wmn wij (k) = i,m j,n ,
thus
wmn yj = wmn
N
X
i=0
si (k)wij (k) =
N
X
i=0
2015.09.29
23 / 28
and so we have
wmn E (W(k)) = 2
M
X
j=1
2015.09.29
24 / 28
3
4
Set y = yin .
Update the weights and biases
Wnew = Wold 2xT (y t(q) ),
bnew = bold 2(y t(q) ).
2015.09.29
25 / 28
An Example
We will now treat the same example that we have considered before for
the Perceptron with multiple output neurons. We use bipolar output
neurons and the training set:
(class 1)
s(1) = 1 1 , s(2) = 1 2
with t(1) = t(2) = 1 1
(class 2)
s(3) = 2 1 , s(4) = 2 0
with t(3) = t(4) = 1 1
(class 3)
s(5) = 1 2 , s(6) = 2 1
(class 4)
s(7) = 1 1 , s(8) = 2 2
with t(5) = t(6) = 1 1
with t(7) = t(8) = 1 1
2015.09.29
26 / 28
153
8
153
6
2
3
2
153
1
6
Using these exact results, we can easily see how good or bad our iterative
solutions are.
It should be remarked that the most robust set of weights and biases is
determined only by a few training vectors that lie very close to the decision
boundaries. However in the Delta rule, all training vectors contribute in
some way. Therefore the set of weights and biases obtained by the Delta
rule is not necessarily always the most robust.
K. M. Leung (Department of Computer Science and EngineeringPolytechnic
CS6673
School of Engineering, NYU)
2015.09.29
27 / 28
The Delta rule usually gives convergent results if the learning rate is not
too large. The resulting set of weights and biases typically leads to correct
classification of all the training vectors, provided such a set exist. How
close this set is to the best choice depends on the starting weights and
biases, the learning rate and the number of iterations. We find that for
this example much better convergence can be obtained if the learning rate
at step k is set to be = 1/k.
2015.09.29
28 / 28