Multi Layer Perceptron Haykin
Multi Layer Perceptron Haykin
Input Output
layer layer
Hidden Layers
2
A solution for the XOR problem
x1
x1 x2 x1 xor x2 1
-1 -1 -1
-1 1 1 -1 1
1 -1 1 x2
1 1 -1
-1
-1
0.1
+1
x1 +1
-1
1 if v > 0
(v) =
-1 -1 if v 0
x2 +1
+1 is the sign function.
-1
3
NEURON MODEL
• Sigmoidal Function
(v j )
1 (v j ) 1
av j
Increasing a 1 e
-10 -8 -6 -4 -2 2 4 6 8
vj
10
vj w
i 0 ,...,m
ji yi
• Back-propagation algorithm
Function signals
Forward Step
Error signals
Backward Step
• It adjusts the weights of the NN in order to
minimize the average squared error.
5
Average Squared Error
• Error signal of output neuron j at presentation of n-th
training example:
e j (n) d j (n) - y j (n) C: Set of
• Total energy at time n: neurons
e
in output
• Average squared error: E(n) 1
2
2
j (n) layer
jC
N: size of
• Measure of learning training set
N
E (n)
performance: 1
EAV N
n 1
6
Notation
7
Weight Update Rule
Update rule is based on the gradient descent method
take a step in the direction yielding the maximum
decrease of E
E
w ji - Step in direction opposite to the gradient
w ji
10
Update Rule
• We obtain
w ji j yi
because
E E v j
w ji v j w ji
E v j
j yi
v j w ji
11
Compute local gradient of
neuron j
12
Error ej of output neuron
• Case 1: j output neuron
ej dj - yj
Then
j (d j - y j ) ' (v j )
13
Local gradient of hidden
neuron
• Case 2: j hidden neuron
14
15
Use the Chain Rule
E y j
j - y j
' (v j )
y j v j v j
E(n) 1
2 k (n)
e
kC
2
E e k e k v k
ek e k y
y j kC y j kC v k j
e k v k
from ' (vk ) w kj
v k y j
E
We obtain
y j
kC
k w kj
16
Local Gradient of hidden
neuron j
Hence j ( v j ) k w kj
kC
w1j e1 Signal-flow
1 ’(v1) graph of
j ’(vj) back-
wkj
’(vk) ek propagation
k error signals
wm j
to neuron j
em
m ’(vm)
17
Delta Rule
• Delta rule wji = j yi
18
Local Gradient of neurons
ay [1 y j k
] w if j hidden node
j j kj
ay j [1 y j ][d j y j ] If j output node
k
19
Backpropagation algorithm
20
Summary
21
Training
22
Training
23
Stopping criterions
• Sensible stopping criterions:
– Average squared error change:
Back-prop is considered to have converged
when the absolute rate of change in the
average squared error per epoch is
sufficiently small (in the range [0.1, 0.01]).
– Generalization based criterion:
After each epoch the NN is tested for
generalization. If the generalization
performance is adequate then stop.
24
Early stopping
25
Generalization
• Generalization: NN generalizes well if the I/O
mapping computed by the network is nearly
correct for new data (test set).
• Factors that influence generalization:
– the size of the training set.
– the architecture of the NN.
– the complexity of the problem at hand.
• Overfitting (overtraining): when the NN learns
too many I/O examples it may end up
memorizing the training data.
26
Generalization
27
Expressive capabilities of NN
Boolean functions:
• Every boolean function can be represented by
network with single hidden layer
• but might require exponential hidden units
Continuous functions:
• Every bounded continuous function can be
approximated with arbitrarily small error, by network
with one hidden layer
• Any function can be approximated with arbitrary
accuracy by a network with two hidden layers
28
Generalized Delta Rule
• If small Slow rate of learning
If large Large changes of weights
NN can become unstable
(oscillatory)
• Method to overcome above drawback: include
a momentum term in the delta rule
Generalized
w ji ( n) w ji ( n 1) j ( n)yi ( n) delta
function
momentum constant
29
Generalized delta rule
32
Setting the parameters
• How are the weights initialised?
• How is the learning rate chosen?
• How many hidden layers and how many neurons?
• Which activation function ?
• How to preprocess the data ?
• How many examples in the training data set?
33
Some heuristics (1)
• Sequential x Batch algorithms: the
sequential mode (pattern by pattern) is
computationally faster than the batch
mode (epoch by epoch)
v 1 Sigmoidal function is
e
1 av nonsymmetric
36
Some heuristics (3)
37
Some heuristics (4)
• Target values: target values must be
chosen within the range of the sigmoidal
activation function.
• Otherwise, hidden neurons can be
driven into saturation which slows down
learning
38
Some heuristics (4)
• For the antisymmetric activation
function it is necessary to design Є
• For a+: d j a
• For –a:
d j a
• If a=1.7159 we can set Є=0.7159 then
d=±1
39
Some heuristics (5)
• Inputs normalisation:
– Each input variable should be processed
so that the mean value is small or close to
zero or at least very small when compared
to the standard deviation.
– Input variables should be uncorrelated.
– Decorrelated input variables should be
scaled so their covariances are
approximately equal.
40
Some heuristics (5)
41
Some heuristics (6)
• Initialisation of weights:
– If synaptic weights are assigned large
initial values neurons are driven into
saturation. Local gradients become small
so learning rate becomes small.
– If synaptic weights are assigned small
initial values algorithms operate around the
origin. For the hyperbolic activation
function the origin is a saddle point.
42
Some heuristics (6)
• Weights must be initialised for the
standard deviation of the local induced
field v lies in the transition between the
linear and saturated parts.
v 1
w m 1 / 2 m=number of weights
43
Some heuristics (7)
• Learning rate:
– The right value of depends on the application.
Values between 0.1 and 0.9 have been used in
many applications.
– Other heuristics adapt during the training as
described in previous slides.
44
Some heuristics (8)
• How many layers and neurons
– The number of layers and of neurons depend
on the specific task. In practice this issue is
solved by trial and error.
– Two types of adaptive algorithms can be used:
• start from a large network and successively
remove some neurons and links until network
performance degrades.
• begin with a small network and introduce new
neurons until performance is satisfactory.
45
Some heuristics (9)
46
Output representation and
decision rule
• M-class classification problem
Yk,j(xj)=Fk(xj), k=1,...,M
Y1,j
MLP Y2,j
YM,j
47
Data representation
1, x j C k 0
d k,j
0, x j C k
1 Kth element
0
48
MLP and the a posteriori class
probability
• A multilayer perceptron classifier
(using the logistic function)
aproximate the a posteriori class
probabilities, provided that the size
of the training set is large enough.
49
The Bayes rule
• An appropriate output decision rule is
the (approximate) Bayes rule generated
by the a posteriori probability
estimates:
j k
• xЄCk if Fk(x)>Fj(x) for all
F1 ( x )
F (x )
F (x ) 2
F M ( x )
50