Ch4 - Multilayer Perceptron
Ch4 - Multilayer Perceptron
Multilayer Perceptron
1
Multilayer Perceptron Chapter 4
Multilayer Perceptrons
𝑾(1) 𝑾(2) 𝑾(3) Hidden
Unit
Input .. Output
Unit . Unit
..
. ..
.
Output of unit
Input of unit
Matrix of weight
𝑙𝑡ℎ layer
𝑗𝑡ℎ unit
bias
Vector of
bias
Vector of input
Vector of output Activate function
HCM City Univ. of Technology, Faculty of Mechanical Engineering 3 Duong Van Tu
Multilayer Perceptron Chapter 4
Multilayer Perceptrons
𝑙
𝑤𝑖𝑗 is the weight of the connection between the 𝑖𝑡ℎ unit of the (𝑙 − 1)𝑡ℎ layer and
the 𝑗𝑡ℎ unit of the 𝑙𝑡ℎ layer.
𝑏𝑖𝑙 is the bias of the 𝑖𝑡ℎ unit of the 𝑙𝑡ℎ layer.
𝑾 – Matrix of weight.
𝒃 – Matrix of bias.
This MLP has 4 inputs, 3 outputs, and its hidden layer contains 5 hidden units
HCM City Univ. of Technology, Faculty of Mechanical Engineering 5 Duong Van Tu
Multilayer Perceptron Chapter 4
Activation function
The output of the 𝑖𝑡ℎ unit belongs to the 𝑙𝑡ℎ layer is calculated by:
(𝑙) 𝑙 𝑙−1 𝑙
𝑎𝑖 = 𝑓(𝑤𝑖 𝑎 + 𝑏𝑖 )
Where 𝑓(. ) is the nonlinear activation function.
In vector form, the output of all units of the 𝑙𝑡ℎ layer is calculated by:
𝒂(𝑙) = 𝑓(𝑾 𝑙 𝒂 𝑙−1 +𝒃 𝑙 )
Sign function
It is noted that the sign function should not be used in MLP.
1
𝜎 𝑥 =
1 + 𝑒 −𝑥
Sigmoid saturate and kill gradients.
A very undesirable property of the sigmoid
neuron is that when the neuron’s activation
saturates at either tail of 0 or 1, the gradient
at these regions is almost zero.
1 − 𝑒 −2𝑥
𝑡𝑎𝑛ℎ 𝑥 = −2𝑥
= 2𝜎 2𝑥 − 1
1+𝑒
Like the sigmoid neuron, its activations
saturate, but unlike the sigmoid neuron its
output is zero-centered.
We will say 𝑤 is a dependent variable, 𝑢 and 𝑣 are independent variables and 𝑥 and
𝑦 are intermediate variables.
𝜕𝑤 𝜕𝑤
Since 𝑤 is a function of 𝑥 and 𝑦 it has partial derivatives and
𝜕𝑥 𝜕𝑦
𝜕𝑤 𝜕𝑤 𝜕𝑥 𝜕𝑤 𝜕𝑦
= +
𝜕𝑢 𝜕𝑥 𝜕𝑢 𝜕𝑦 𝜕𝑢
𝜕𝑤 𝜕𝑤 𝜕𝑥 𝜕𝑤 𝜕𝑦
= +
𝜕𝑣 𝜕𝑥 𝜕𝑣 𝜕𝑦 𝜕𝑣
𝑁
1 2
ෝ𝒏
𝐽 𝑾, 𝒃, 𝐱, 𝐲 = 𝒚𝒏 − 𝒚
N
𝑛=1
It is difficult to take the gradient of the loss function over the weight matrix.
𝜕𝐽 𝜕𝐽 𝜕𝐽
The backpropagation is to take the gradient , , … , for applying the
𝜕𝑾(1) 𝜕𝑾(2) 𝜕𝑾(𝐿)
(𝐿)
𝜕𝐿 𝜕𝐿 𝜕𝑧𝑗
(𝐿)
= (𝐿) (𝐿)
𝜕𝑤𝑖𝑗 𝜕𝑧𝑗 𝜕𝑤𝑖𝑗
(𝐿) (𝐿) (𝐿−1) (𝐿)
Recall 𝑧𝑗 = 𝑤𝑖𝑗 𝑎𝑖 + 𝑏𝑗 then it can be deducted as:
(𝐿)
𝜕𝑧𝑗 (𝐿−1)
(𝐿)
= 𝑎𝑖
𝜕𝑤𝑖𝑗
(𝐿)
𝜕𝐿 𝜕𝐿 𝜕𝑧𝑗 (𝐿)
(𝐿)
= (𝐿) (𝐿)
= 𝑒𝑗
𝜕𝑏𝑗 𝜕𝑧𝑗 𝜕𝑏𝑗
For the 𝑙𝑡ℎ layer, the gradient at the 𝑗𝑡ℎ unit is calculated as below:
(𝑙)
𝜕𝐿 𝜕𝐿 𝜕𝑧𝑗
(𝑙)
= (𝑙) (𝑙)
𝜕𝑤𝑖𝑗 𝜕𝑧𝑗 𝜕𝑤𝑖𝑗
(𝑙)
(𝑙) (𝑙) (𝑙−1) (𝑙) 𝜕𝑧𝑗 (𝑙−1)
Recall that 𝑧𝑗 = 𝑤𝑖𝑗 𝑎𝑖 + 𝑏𝑗 , then it can be deducted as (𝑙) = 𝑎𝑖
𝜕𝑤𝑖𝑗
HCM City Univ. of Technology, Faculty of Mechanical Engineering 16 Duong Van Tu
Multilayer Perceptron Chapter 4
Backpropagation
(𝑙) 𝜕𝐿
Let define 𝑒𝑗 ≜ (𝑙) , the gradient of 𝐿 over the weight is calculated as:
𝜕𝑧𝑗
𝜕𝐿 (𝑙) (𝑙−1)
(𝑙)
= 𝑒𝑗 𝑎𝑖
𝜕𝑤𝑖𝑗
𝜕𝐿
Let consider the term (𝑙) , we can use the chain rule:
𝜕𝑧𝑗
(𝑙)
𝜕𝐿 𝜕𝐿 𝜕𝑎𝑗
(𝑙)
= (𝑙) (𝑙)
𝜕𝑧𝑗 𝜕𝑎𝑗 𝜕𝑧𝑗
(𝑙)
𝜕𝑎𝑗 (𝑙)
(𝑙) = 𝑓′(𝑧𝑗 ) with 𝑓 is the activation function
𝜕𝑧𝑗
HCM City Univ. of Technology, Faculty of Mechanical Engineering 18 Duong Van Tu
Multilayer Perceptron Chapter 4
Backpropagation
Then we can rewrite as follows:
𝜕𝐿 (𝑙)
(𝑙)
= 𝑒𝑗
𝜕𝑏𝑗
HCM City Univ. of Technology, Faculty of Mechanical Engineering 19 Duong Van Tu
Multilayer Perceptron Chapter 4
Backpropagation
Then we can rewrite as follows:
𝜕𝐿 (𝑙)
(𝑙)
= 𝑒𝑗
𝜕𝑏𝑗
HCM City Univ. of Technology, Faculty of Mechanical Engineering 20 Duong Van Tu
Multilayer Perceptron Chapter 4
Backpropagation
Summary
(Unit by unit)
Implement forward propagation with each input, store activation values 𝒂(𝑙)
(𝐿) 𝜕𝐿
At each unit at the output layer, we calculate: 𝑒𝑗 = (𝐿)
𝜕𝑧𝑗
𝜕𝐿 𝜕𝐿
The gradient is determined as: = 𝒂(𝐿−1) 𝒆(𝐿) , and = 𝒆(𝐿)
𝜕𝑾(𝐿) 𝜕𝒃(𝐿)
At the 𝑙𝑡ℎ layer, we calculate: 𝒆(𝑙) = 𝑾(𝑙+1) 𝒆(𝑙+1) ∗ 𝑓 ′ 𝒛(𝑙) where * is the element-
wise product.
𝜕𝐿 𝜕𝐿
The gradient is determined as: = 𝒆(𝑙) 𝒂(𝑙−1) , and = 𝒆(𝑙)
𝜕𝑾(𝑙) 𝜕𝒃(𝑙)
Forward propagation
𝒁(1) = 𝑾(1) 𝑿 + 𝒃(1)
𝑨(1) = max 𝒁(1) , 0
𝒁(2) = 𝑾(2) 𝑨(1) + 𝒃(2)
= 𝑨(2) = softmax(𝒁(2) )
𝒀
Loss function
𝑁 𝐶
1
𝐽 𝑾, 𝒃; 𝑿, 𝒀 = − 𝑦𝑗𝑖 log(𝑦ො𝑗𝑖 )
𝑁
𝑖=1 𝑗=1
𝜕𝐽
𝒆(1) = 𝑾 2 𝒆 1 ∗ 𝑓′ 𝒁 1 , = 𝒂(0) 𝒆(1) = 𝑿 𝒆(1)
𝜕𝑾(1)