04 Multi-layer Feedforward Networks

Neural Networks and
Fuzzy Systems
Multi-layer Feed forward Networks
Dr. Tamer Ahmed Farrag
Course No.: 803522-3

Course Outline
Part I : Neural Networks (11 weeks)
• Introduction to Machine Learning
• Fundamental Concepts of Artificial Neural Networks
(ANN)
• Single layer Perception Classifier
• Multi-layer Feed forward Networks
• Single layer FeedBack Networks
• Unsupervised learning
Part II : Fuzzy Systems (4 weeks)
• Fuzzy set theory
• Fuzzy Systems
2

Outline
• Why we need Multi-layer Feed forward Networks
(MLFF)?
• Error Function (or Cost Function or Loss function)
• Gradient Descent
• Backpropagation
3

Why we need Multi-layer Feed forward
Networks (MLFF)?
• Overcoming failure of single layer perceptron in
solving nonlinear problems.
• First Suggestion:
• Divide the problem space into smaller linearly separable
regions
• Use a perceptron for each linearly separable region
• Combine the output of multiple hidden neurons to
produce a final decision neuron.
4
Region 1
Region 2

Why we need Multi-layer Feed forward
Networks (MLFF)?
• Second suggestion
• In some cases we need a curve decision boundary or we try to solve
more complicated classification and regression problems.
• So, we need to:
• Add more layers
• Increase a number of neurons in each layer.
• Use non linear activation function in
the hidden layers.
• So , we need Multi-layer Feed forward Networks (MLFF).
5

Notation for Multi-Layer Networks
• Dealing with multi-layer networks is easy if a sensible notation is adopted.
• We simply need another label (n) to tell us which layer in the network we
are dealing with.
• Each unit j in layer n receives activations 𝑜𝑢𝑡𝑖
(𝑛−1)
𝑤𝑖𝑗
(𝑛)
from the previous
layer of processing units and sends activations 𝑜𝑢𝑡𝑗
(𝑛)
to the next layer of
units.
6
1
2
3
1
2
layer (0) layer (1)
𝒘𝒊𝒋
(𝟏)
layer (n-1) layer (n)
𝒘𝒊𝒋
(𝒏)

ANN Representation
(1 input layer + 1 hidden layer +1 output layer)
7
for example:
𝑧1
(1)
= (𝑤11
(1)
𝑥1 + 𝑤21
(1)
𝑥2+ 𝑤31
(1)
𝑥3 + 𝑏1
(1)
)
𝑎1
(1)
= 𝑓 𝑧1
(1)
=σ (𝑧1
(1)
)
𝑧2
(2)
= (𝑤12
(2)
𝑎1
(1)
+ 𝑤22
(2)
𝑎2
(1)
+ 𝑤32
(2)
𝑎3
(1)
+ 𝑏2
(2)
)
𝑦2 = 𝑎2
(2)
= 𝑓 𝑧2
(2)
=σ (𝑧2
(2)
)
𝒛𝒋
(𝒍)
=
𝒋
𝒘𝒊𝒋
(𝒍)
𝒂𝒊
(𝒍−𝟏)
+ 𝒃𝒋
(𝒍)
𝒂𝒋
(𝒍)
= 𝒇 𝒛𝒋
𝒍
= σ 𝒛𝒋
𝒍
layer (0)
𝑥1= 𝑎1
(0)
𝑥2= 𝑎2
(0)
𝒛 𝟏
(𝟏)
𝒂 𝟏
(𝟏)
𝒛 𝟐
(𝟏)
𝒂 𝟐
(𝟏)
𝒛 𝟑
(𝟏)
𝒂 𝟑
(𝟏)
layer (1)
𝒛 𝟏
(𝟐)
𝒂 𝟏
(𝟐)
𝒛 𝟐
(𝟐)
𝒂 𝟐
(𝟐)
layer (2)
𝒘 𝟏𝟏
(𝟏)
𝒘 𝟏𝟐
(𝟏)
𝒘 𝟏𝟑
(𝟏)
𝒘 𝟐𝟏
(𝟏)
𝒘 𝟐𝟐
(𝟏)
𝒘 𝟐𝟑
(𝟏)
𝒘 𝟑𝟏
(𝟏)
𝒘 𝟑𝟐
(𝟏)
𝒘 𝟑𝟑
(𝟏)
𝒘 𝟏𝟏
(𝟐)
𝒘 𝟏𝟐
(𝟐)
𝒘 𝟐𝟏
(𝟐)
𝒘 𝟐𝟐
(𝟐)
𝒘 𝟑𝟏
(𝟐)
𝒘 𝟑𝟐
(𝟐)
𝒚 𝟏
𝒚 𝟐
𝑥2= 𝑎2
(0)

Gradient Descent
and Backpropagation

Error Function
● how we can evaluate performance of a neuron
????
● We can use a Error function (or cost function or
loss function) to measure how far off we are from
the expected value.
● Choosing appropriate Error function help the
learning algorithm to reach to best values for
weights and biases.
● We’ll use the following variables:
○ D to represent the true value (desired value)
○ y to represent neuron’s prediction 9

Error Functions
(Cost function or Lost Function)
• There are many formulates for error functions.
• In this course, we will deal with two Error function
formulas.
Sum Squared Error (SSE) :
𝑒 𝑝𝑗 = 𝑦𝑗 − 𝐷𝑗
2
for single perceptron
𝐸𝑆𝑆𝐸=
𝑗=1
𝑛
𝑦𝑗 − 𝐷𝑗
2
1
Cross entropy (CE):
𝐸 𝐶𝐸 =
1
𝑛 𝑗=1
𝑛
[𝐷𝑗 ∗ ln(𝑦𝑗) + (1− 𝐷𝑗) ∗ ln(1− 𝑦𝑗)] (2)
10
1
2

Why the error in ANN occurs?
• Each weight and bias in the network contribute in
the occasion of the error.
• To solve this we need:
• A cost function or error function to compute the error.
(SSE or CE Error function)
• An optimization algorithm to minimize the error
function. (Gradient Decent)
• A learning algorithm to modify weights and biases to
new values to get the error down. (Backpropagation)
• Repeat this operation until find the best solution
11

Gradient Decent (in 1 dimension)
• Assume we have a error function E and we need to
use it to update one weight w
• The figure show the error function in terms of w
• Our target is to learn the value of w produces the
minimum value of E.
How?
12
E
W
minimum

Gradient Decent (in 1 dimension)
• In Gradient Decent algorithm, we use the following
equation to get a better value of w:
𝑤 = 𝑤 − αΔ𝑤 (called Delta rule)
Where:
α : is the learning rate
Δ𝑤 : is mathematically can be computed using
derivative of E with respect to w (
𝑑𝐸
𝑑𝑤
)
13
E
W
minimum
𝑤 = 𝑤 − α
𝑑𝐸
𝑑𝑤
(3)

Gradient Decent (multi dimension)
• In ANN with many layers and many neurons in each layer the
Error function will be multi-variable function.
• So, the derivative in equation (3) should be partial derivative
𝑤𝑖𝑗 = 𝑤𝑖𝑗 − α
𝜕𝐸 𝑗
𝜕𝑤 𝑖𝑗
(4)
• We write equation (4) as :
𝑤𝑖𝑗 = 𝑤𝑖𝑗 − α 𝜕𝑤𝑖𝑗
• Same process will be use to get the
new bias value:
𝑏𝑗= 𝑏𝑗 − α 𝜕𝑏𝑗
16

derivative of activation functions
17
Sigmoid

Learning Rule in the output layer
using SSE as error function and sigmoid as Activation
function
𝜕𝐸 𝑗
𝜕𝑤 𝑖𝑗
=
𝜕𝐸 𝑗
𝜕𝑎 𝑗
(𝑙) *
𝜕𝑎 𝑗
(𝑙)
𝜕𝑧 𝑗
(𝑙) *
𝜕𝑧 𝑗
(𝑙)
𝜕𝑤𝑖𝑗
(𝑙)
Where:
𝐸𝑗 =
𝑗
(𝑦𝑗 − 𝐷𝑗 )2
𝑦𝑗 = 𝑎𝑗
(𝑙)
= 𝑓 𝑧𝑗
𝑙
= 𝜎 𝑧𝑗
𝑙
𝑧𝑗
(𝑙)
=
𝑗
𝑤𝑖𝑗
(𝑙)
𝑎𝑖
(𝑙−1)
+ 𝑏𝑗
(𝑙)
From the previous table:
𝜎′ 𝑧𝑗
𝑙
= 𝜎 𝑧𝑗
𝑙
∗ 1 − 𝜎 𝑧𝑗
𝑙
= 𝑦𝑗 (1 − 𝑦𝑗)
18

Learning Rule in the output layer (cont.)
So (How?),
𝜕𝑦𝑗
𝜕𝑧𝑖
= 𝑦𝑗 (1 − 𝑦𝑗)
𝜕𝑧𝑗
𝜕𝑤𝑖𝑗
= 𝑎𝑖
(𝑙−1)
𝜕𝐸𝑗
𝜕𝑦𝑗
= −2(𝑦𝑗 − 𝐷𝑗 )
• Then:
𝜕𝐸 𝑗
𝜕𝑤 𝑖𝑗
= 2𝑎𝑖
𝑙−1
𝑦𝑗 − 𝐷𝑗 𝑦𝑗 1 − 𝑦𝑗
𝑤𝑖𝑗 = 𝑤𝑖𝑗 − 2 α 𝑎𝑖
𝑙−1
𝑦𝑗 − 𝐷𝑗 𝑦𝑗 1 − 𝑦𝑗
19

Learning Rule in the Hidden layer
• Now we have to determine the appropriate
weight change for an input to hidden weight.
• This is more complicated because it depends on
the error at all of the nodes this weighted
connection can lead to.
• The mathematical proof is out our scope.
20

Gradient Decent (Notes)
Note 1:
• the neuron activation function (f ) should be is defined
and differentiable function.
Note 3:
• The calculating of 𝜕𝑤𝑖𝑗 for the hidden layer will be
more difficult (Why?)
Note 2:
• The previous calculation will be repeated for each
weight and for each bias in the ANN
• So, we need big computational power (what about
deeper networks? )
21

Gradient Decent (Notes)
• 𝜕𝑤𝑖𝑗 is represent the change in the values of 𝑤𝑖𝑗
to get better output
• The equation of 𝜕𝑤𝑖𝑗 is dependent on the choosing
of the Error(Cost) function and activation function.
• Gradient Decent algorithm help in calculated the
new values of weights and bias.
• Question: is one iteration (one trail) enough to
bet the best values for weights and biases
• Answer: No, we need a extended version ?
Backpropagation
22

How Backpropagation Work?
23
𝒘 𝟏𝟏
(𝟏)
𝒘 𝟏𝟐
(𝟏)
𝒘 𝟐𝟏
(𝟏)
𝒘 𝟐𝟐
(𝟏)
𝒘 𝟑𝟏
(𝟏)
𝒘 𝟑𝟐
(𝟏)
𝒘 𝟏𝟏
(𝟐)
𝒘 𝟐𝟏
(𝟐)
𝒚
𝒂 𝟏
(𝟏)
= 𝒘 𝟏𝟏
(𝟏)
-𝛼 𝜕𝒘 𝟏𝟏
(𝟏)
= 𝒘 𝟏𝟏
(𝟐)
-𝛼 𝜕𝒘 𝟏𝟏
(𝟐)
𝑭𝒐𝒓𝒘𝒂𝒓𝒅 𝑷𝒓𝒐𝒑𝒂𝒈𝒂𝒕𝒊𝒐𝒏 𝑩𝒂𝒄𝒌 𝑷𝒓𝒐𝒑𝒂𝒈𝒂𝒕𝒊𝒐𝒏
𝒍𝒂𝒚𝒆𝒓 𝟎 𝒍𝒂𝒚𝒆𝒓 𝟏 𝒍𝒂𝒚𝒆𝒓 𝟐

Online Learning vs. Offline Learning
• Online: Pattern-by-Pattern
learning
• Error calculated for each
pattern
• Weights updated after each
individual pattern
𝚫𝒘𝒊𝒋 = −𝜶
𝝏𝑬 𝒑
𝝏𝒘𝒊𝒋
• Offline: Batch learning
• Error calculated for all
patterns
• Weights updated once at
the end of each epoch
𝚫𝒘𝒊𝒋 = −𝜶
𝒑
𝝏𝑬 𝒑
𝝏𝒘𝒊𝒋
24

Choosing Appropriate Activation and Cost
Functions
• We already know consideration of single layer networks what
output activation and cost functions should be used for
particular problem types.
• We have also seen that non-linear hidden unit activations are
needed, such as sigmoids.
• So we can summarize the required network properties:
• Regression/ Function Approximation Problems
• SSE cost function, linear output activations, sigmoid hidden activations
• Classification Problems (2 classes, 1 output)
• CE cost function, sigmoid output and hidden activations
• Classification Problems (multiple-classes, 1 output per class)
• CE cost function, softmax outputs, sigmoid hidden activations
• In each case, application of the gradient descent learning
algorithm (by computing the partial derivatives) leads to
appropriate back-propagation weight update equations.
25

Overall picture : learning process on ANN
26

Neural network simulator
• Search through the internet to find a simulator and
report it
For example:
• https://www.mladdict.com/neural-network-
simulator
• http://playground.tensorflow.org/
27

04 Multi-layer Feedforward Networks

Recommended

More Related Content

What's hot (20)

Similar to 04 Multi-layer Feedforward Networks (20)

Recently uploaded (20)

04 Multi-layer Feedforward Networks