0% found this document useful (0 votes)
53 views

Multi Layer Perceptron Haykin

A multilayer perceptron uses a backpropagation algorithm to minimize error during training. It consists of an input layer, hidden layers, and an output layer fully connected by weights. The algorithm calculates error signals that propagate backward from the output to adjust weights and reduce average squared error over multiple epochs until a stopping criterion is reached. Generalization to new data depends on factors like training set size and network architecture.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Multi Layer Perceptron Haykin

A multilayer perceptron uses a backpropagation algorithm to minimize error during training. It consists of an input layer, hidden layers, and an output layer fully connected by weights. The algorithm calculates error signals that propagate backward from the output to adjust weights and reduce average squared error over multiple epochs until a stopping criterion is reached. Generalization to new data depends on factors like training set size and network architecture.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 50

Multilayer Percetrons

Neural Networks, Simon Haykin, Prentice-Hall, 3rd


edition
Multilayer Perceptrons
Architecture

Input Output
layer layer

Hidden Layers

2
A solution for the XOR problem
x1
x1 x2 x1 xor x2 1
-1 -1 -1
-1 1 1 -1 1
1 -1 1 x2
1 1 -1
-1
-1
0.1
+1
x1 +1
-1
1 if v > 0
(v) =
-1 -1 if v  0
x2 +1
+1  is the sign function.
-1
3
NEURON MODEL
• Sigmoidal Function
 (v j )
1  (v j )  1
 av j
Increasing a 1 e

-10 -8 -6 -4 -2 2 4 6 8
vj
10
vj  w
i 0 ,...,m
ji yi

• v j induced field of neuron j


• Most common form of activation function
• a     threshold function
• Differentiable
4
LEARNING ALGORITHM

• Back-propagation algorithm

Function signals
Forward Step

Error signals
Backward Step
• It adjusts the weights of the NN in order to
minimize the average squared error.

5
Average Squared Error
• Error signal of output neuron j at presentation of n-th
training example:
e j (n)  d j (n) - y j (n) C: Set of
• Total energy at time n: neurons

e
in output
• Average squared error: E(n)  1
2
2
j (n) layer
jC
N: size of
• Measure of learning training set
N

 E (n)
performance: 1
EAV  N
n 1

• Goal: Adjust weights of NN to minimize EAV

6
Notation

e j Error at output of neuron j


yj Output of neuron j
vj  w
i 0 ,...,m
ji yi Induced local
field of neuron j

7
Weight Update Rule
Update rule is based on the gradient descent method
take a step in the direction yielding the maximum
decrease of E

E
w ji  - Step in direction opposite to the gradient
w ji

With w ji weight associated to the link from neuron i


to neuron j
8
9
Definition of the Local
Gradient of neuron j
E
j  - Local Gradient
v j
We obtain  j  e j  ( v j )
because
E E e j y j
     e j ( 1) ' ( v j )
v j e j y j v j

10
Update Rule
• We obtain
w ji   j yi
because

E E v j

w ji v j w ji
E v j
 j  yi
v j w ji
11
Compute local gradient of
neuron j

• The key factor is the calculation of ej


• There are two cases:
– Case 1): j is a output neuron
– Case 2): j is a hidden neuron

12
Error ej of output neuron
• Case 1: j output neuron

ej  dj - yj
Then

 j  (d j - y j ) ' (v j )

13
Local gradient of hidden
neuron
• Case 2: j hidden neuron

• the local gradient for neuron j is recursively


determined in terms of the local gradients of
all neurons to which neuron j is directly
connected

14
15
Use the Chain Rule
E y j
j  - y j
  ' (v j )
y j v j v j

E(n)  1
2  k (n)
e
kC
2

E e k   e k  v k
   ek   e k   y
y j kC y j kC  v k  j

e k v k
from    ' (vk )  w kj
v k y j
E
We obtain 
y j
 
kC
k w kj

16
Local Gradient of hidden
neuron j
Hence  j    ( v j )  k w kj
kC

w1j e1 Signal-flow
1 ’(v1) graph of
j ’(vj) back-
wkj
’(vk) ek propagation
k error signals
wm j
to neuron j
em
m ’(vm)

17
Delta Rule
• Delta rule wji = j yi

  ( v j )(d j  y j ) IF j output node


j 
  ( v j ) k w kj IF j hidden node
kC

C: Set of neurons in the layer following the one


containing j

18
Local Gradient of neurons

 ' ( v j )  ay j[1  y j ] a>0

ay [1  y j  k
]  w if j hidden node
j  j kj
ay j [1  y j ][d j  y j ] If j output node
k

19
Backpropagation algorithm

• Two phases of computation:


– Forward pass: run the NN and compute the error for
each neuron of the output layer.
– Backward pass: start at the output layer, and pass
the errors backwards through the network, layer by
layer, by recursively computing the local gradient of
each neuron.

20
Summary

21
Training

• Sequential mode (on-line, pattern or


stochastic mode):
– (x(1), d(1)) is presented, a sequence of
forward and backward computations is
performed, and the weights are updated
using the delta rule.
– Same for (x(2), d(2)), … , (x(N), d(N)).

22
Training

• The learning process continues on an epoch-


by-epoch basis until the stopping condition is
satisfied.
• From one epoch to the next choose a
randomized ordering for selecting examples in
the training set.

23
Stopping criterions
• Sensible stopping criterions:
– Average squared error change:
Back-prop is considered to have converged
when the absolute rate of change in the
average squared error per epoch is
sufficiently small (in the range [0.1, 0.01]).
– Generalization based criterion:
After each epoch the NN is tested for
generalization. If the generalization
performance is adequate then stop.

24
Early stopping

25
Generalization
• Generalization: NN generalizes well if the I/O
mapping computed by the network is nearly
correct for new data (test set).
• Factors that influence generalization:
– the size of the training set.
– the architecture of the NN.
– the complexity of the problem at hand.
• Overfitting (overtraining): when the NN learns
too many I/O examples it may end up
memorizing the training data.
26
Generalization

27
Expressive capabilities of NN
Boolean functions:
• Every boolean function can be represented by
network with single hidden layer
• but might require exponential hidden units

Continuous functions:
• Every bounded continuous function can be
approximated with arbitrarily small error, by network
with one hidden layer
• Any function can be approximated with arbitrary
accuracy by a network with two hidden layers

28
Generalized Delta Rule
• If  small  Slow rate of learning
If  large  Large changes of weights
 NN can become unstable
(oscillatory)
• Method to overcome above drawback: include
a momentum term in the delta rule
Generalized
w ji ( n)  w ji ( n  1)   j ( n)yi ( n) delta
function

momentum constant
29
Generalized delta rule

• the momentum accelerates the descent in steady


downhill directions.
• the momentum has a stabilizing effect in
directions that oscillate in time.

PMR5406 Redes Neurais e 30


Lógica Fuzzy
 adaptation
Heuristics for accelerating the convergence of
the back-prop algorithm through  adaptation:

• Heuristic 1: Every weight should have its own .

• Heuristic 2: Every  should be allowed to vary from


one iteration to the next.

PMR5406 Redes Neurais e 31


Lógica Fuzzy
NN DESIGN
• Data representation
• Network Topology
• Network Parameters
• Training
• Validation

32
Setting the parameters
• How are the weights initialised?
• How is the learning rate chosen?
• How many hidden layers and how many neurons?
• Which activation function ?
• How to preprocess the data ?
• How many examples in the training data set?

33
Some heuristics (1)
• Sequential x Batch algorithms: the
sequential mode (pattern by pattern) is
computationally faster than the batch
mode (epoch by epoch)

PMR5406 Redes Neurais e 34


Lógica Fuzzy
Some heuristics (2)
• Maximization of information content:
every training example presented to the
backpropagation algorithm must
maximize the information content.
– The use of an example that results in the
largest training error.
– The use of an example that is radically
different from all those previously used.

PMR5406 Redes Neurais e 35


Lógica Fuzzy
Some heuristics (3)
• Activation function: network learns
faster with antisymmetric functions
when compared to nonsymmetric
functions.

 v   1 Sigmoidal function is
e
1  av nonsymmetric

 (v)  a tanh(bv) Hyperbolic tangent


function is
nonsymmetric

36
Some heuristics (3)

37
Some heuristics (4)
• Target values: target values must be
chosen within the range of the sigmoidal
activation function.
• Otherwise, hidden neurons can be
driven into saturation which slows down
learning

38
Some heuristics (4)
• For the antisymmetric activation
function it is necessary to design Є
• For a+: d j  a  
• For –a:
d j  a  
• If a=1.7159 we can set Є=0.7159 then
d=±1
39
Some heuristics (5)
• Inputs normalisation:
– Each input variable should be processed
so that the mean value is small or close to
zero or at least very small when compared
to the standard deviation.
– Input variables should be uncorrelated.
– Decorrelated input variables should be
scaled so their covariances are
approximately equal.

40
Some heuristics (5)

41
Some heuristics (6)
• Initialisation of weights:
– If synaptic weights are assigned large
initial values neurons are driven into
saturation. Local gradients become small
so learning rate becomes small.
– If synaptic weights are assigned small
initial values algorithms operate around the
origin. For the hyperbolic activation
function the origin is a saddle point.

42
Some heuristics (6)
• Weights must be initialised for the
standard deviation of the local induced
field v lies in the transition between the
linear and saturated parts.

v 1
w  m 1 / 2 m=number of weights

43
Some heuristics (7)
• Learning rate:
– The right value of  depends on the application.
Values between 0.1 and 0.9 have been used in
many applications.
– Other heuristics adapt  during the training as
described in previous slides.

44
Some heuristics (8)
• How many layers and neurons
– The number of layers and of neurons depend
on the specific task. In practice this issue is
solved by trial and error.
– Two types of adaptive algorithms can be used:
• start from a large network and successively
remove some neurons and links until network
performance degrades.
• begin with a small network and introduce new
neurons until performance is satisfactory.

45
Some heuristics (9)

• How many training data ?


– Rule of thumb: the number of training examples
should be at least five to ten times the number
of weights of the network.

46
Output representation and
decision rule
• M-class classification problem
Yk,j(xj)=Fk(xj), k=1,...,M

Y1,j
MLP Y2,j
YM,j

47
Data representation

1, x j  C k 0 
d k,j   
0, x j  C k  
1  Kth element
 
 
0

48
MLP and the a posteriori class
probability
• A multilayer perceptron classifier
(using the logistic function)
aproximate the a posteriori class
probabilities, provided that the size
of the training set is large enough.

49
The Bayes rule
• An appropriate output decision rule is
the (approximate) Bayes rule generated
by the a posteriori probability
estimates:
j k
• xЄCk if Fk(x)>Fj(x) for all
 F1 ( x ) 
 F (x ) 
F (x )   2 
  
 
 F M ( x )

50

You might also like