Unit en Multilayer Perceptron
Unit en Multilayer Perceptron
x2 x2
A
A
A A A
A A A
A A
A A A
A A A A
A A A
A
A A A A
A
A
A A
A
A
A A A A
A A B
B B A A B
A
B B A B B B B
B B A
B A B B B
B B B A B B
B A B B B
B B B
B
x1 x1
The XOR function defines two sets in the R2 that are not linearly separable:
x3 x4
x2 = 1 + −
x1 x2 XOR c
x1 0 0 0 −
x2 1 0 1 +
x3 0 1 1 +
x4 1 1 0 −
x2 = 0 − +
x1 x2
x1 = 0 x1 = 1
The XOR function defines two sets in the R2 that are not linearly separable:
x3 x4
x2 = 1 + −
x1 x2 XOR c
x1 0 0 0 −
x2 1 0 1 +
x3 0 1 1 +
x4 1 1 0 −
x2 = 0 − +
x1 x2
x1 = 0 x1 = 1
A minimum multilayer perceptron y(x) that can handle the XOR problem:
h
x0 =1 = y0 =1 =
x1 = Σ
Σ {−, +}
x2 = Σ
x3 x4
x2 = 1 + −
x2 = 0 − +
x1 x2
x1 = 0 x1 = 1
ML:IV-58 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron Basics
(1) Overcoming the Linear Separability Restriction
A minimum multilayer perceptron y(x) that can handle the XOR problem:
h
x0 =1 = h
w1 0 y0 =1 = o
w0
h
w2 0
h
w1 1 o
x1 = Σ w1
h
w2 1
Σ {−, +}
h o
w1 2 w2
h
w2 2
x2 = Σ
x3 x4
x2 = 1 + −
x2 = 0 − +
x1 x2
x1 = 0 x1 = 1
ML:IV-59 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron Basics
(1) Overcoming the Linear Separability Restriction
A minimum multilayer perceptron y(x) that can handle the XOR problem:
h
x0 =1 = h
w1 0 y0 =1 =
h
w1 1
x1 = Σ
Σ {−, +}
h
w1 2
x2 = Σ
1 x3 0 x4
x2 = 1 + −
1
y(x) = heaviside W o Heaviside(W h x)
"
−0.5 −1 1
# 1
Wh = x1
0.5 −1 1 x2 = 0 − +
x2 0 x1 0 x2
Wo = 0.5 1 −1
x1 = 0 x1 = 1
ML:IV-60 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron Basics
(1) Overcoming the Linear Separability Restriction
A minimum multilayer perceptron y(x) that can handle the XOR problem:
h
x0 =1 = y0 =1 =
h
w2 0
x1 = Σ
h
w2 1
Σ {−, +}
h
w2 2
x2 = Σ
1, 1 x3 0, 1 x4
x2 = 1 + −
1
y(x) = heaviside W o Heaviside(W h x)
"
−0.5 −1 1
# 1
Wh = x1
0.5 −1 1 x2 = 0 − +
x2 0, 1 x1 0, 0 x2
Wo = 0.5 1 −1
x1 = 0 x1 = 1
ML:IV-61 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron Basics
(1) Overcoming the Linear Separability Restriction
A minimum multilayer perceptron y(x) that can handle the XOR problem:
h
x0 =1 = h
w1 0 y0 =1 =
h
w2 0
h
w1 1
x1 = Σ
h
w2 1 h
y1
Σ {−, +}
h
w1 2
h
w2 2
x2 = Σ h
y2
h x1, x4 x3
y2 = 1 − +
1
y(x) = heaviside W o Heaviside(W h x)
"
−0.5 −1 1
# 1
Wh = x1
h
0.5 −1 1 y2 = 0 +
x2 x2
Wo = 0.5 1 −1
h h
y1 = 0 y1 = 1
ML:IV-62 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron Basics
(1) Overcoming the Linear Separability Restriction
A minimum multilayer perceptron y(x) that can handle the XOR problem:
h
x0 =1 = h
w1 0 y0 =1 = o
w0
h
w2 0
h
w1 1 o
x1 = Σ w1
h
w2 1 h
y1
Σ {−, +}
h o
w1 2 w2
h
w2 2
x2 = Σ h
y2
h x1, x4 x3
y2 = 1 − +
1
y(x) = heaviside W o Heaviside(W h x)
"
−0.5 −1 1
# 1
Wh = x1
h
0.5 −1 1 y2 = 0 +
x2 x2
Wo = 0.5 1 −1
h h
y1 = 0 y1 = 1
ML:IV-63 Neural Networks © STEIN/VÖLSKE 2024
Remarks:
q The first, second, and third layer of the shown multilayer perceptron are called input, hidden,
and output layer respectively. Here, in the example, the input layer is comprised of
p+1=3 units, the hidden layer contains l+1=3 units, and the output layer consists of k=1 unit.
q Each input unit is connected via a weighted edge to all hidden units (except to the topmost
hidden unit, which has a constant input y0h = 1), resulting in six weights, organized as
2×3-matrix W h . Each hidden unit is connected via a weighted edge to the output unit,
resulting in three weights, organized as 1×3-matrix W o .
q The input units perform no computation but only distribute the values x0 , x1 , x2 to the next
layer. The hidden units (again except the topmost unit) and the output unit apply the
heaviside function to the sum of their weighted inputs and propagate the result.
h h
The nine weights w = (w10 , . . . , w22 , w1o , w2o , w3o ), organized as W h and W o , specify the
1
multilayer perceptron (model function) y(x) completely: y(x) = heaviside(W o Heaviside(W
h x) )
q The function Heaviside (with capital H) denotes the extension of the scalar heaviside function
to vectors.
For z ∈ Rd the function Heaviside(z) is defined as (heaviside(z1 ), . . . , heaviside(zd ))T .
1 d σ(z)
σ(z) = , = σ(z) · (1 − σ(z))
1 + e−z dz
x1
w1
w2
x2 Σ y
. θ
0
.
. .
.
.
wp
xp
Computation of the perceptron output y(x) with the sigmoid function σ() :
1
1
y(x) = σ(wT x) = p
1 + e−wT x 0
Σ wj xj
j=0
p
ez − e−z e2z − 1
tanh(z) = z = 2z 0
Σ
z wj xj
e + e−z e +1 j=0
-1
x0 =1
w0
. .
Linear activation .
.
.
. Σ y Linear regression
wp
xp
x0 =1
w0
. .
Linear activation .
.
.
. Σ y Linear regression
wp
xp
x0 =1
w0
. .
Heaviside activation .
.
.
. Σ y Perceptron algorithm
wp
xp
x0 =1
w0
. .
Linear activation .
.
.
. Σ y Linear regression
wp
xp
x0 =1
w0
. .
Heaviside activation .
.
.
. Σ y Perceptron algorithm
wp
xp
x0 =1
w0
. .
Sigmoid activation .
.
.
. Σ y Logistic regression
wp
xp
1 =
Σ y1
No decision power
Network with Σ ..
. beyond a single
linear units ..
. hyperplane
Σ yk
Σ
1 =
Σ y1
Nonlinear decision
Network with Σ ..
. boundaries but
heaviside units ..
. no gradient information
Σ yk :::::::::::::::::::::::::::::
1 =
Σ y1
Nonlinear decision
Network with Σ ..
. boundaries and
sigmoid units ..
. gradient information
Σ yk
Σ
ML:IV-72 Neural Networks © STEIN/VÖLSKE 2024
Remarks (limitation of linear thresholds) :
q A multilayer perceptron with linear threshold functions can be expressed as a single linear
function and hence is equivalent to the power of a single perceptron only.
q Consider the following exemplary composition of three linear functions as a multilayer
perceptron with p input units, two hidden units, and one output unit: y(x) = W o [W h x]
The weight matrices are as follows:
h h
w11 . . . w1p h i
Wh = , W =o
w1o w2o
h h
w21 . . . w1p
= w1o w11
h
x1 + . . . + w1o w1p
h
xp + w2o w21
h
x1 + . . . + w2o w1p
h
xp
= (w1o w11
h
+ w2o w21
h
)x1 + . . . + (w1o w1p
h
+ w2o w1p
h
)xp
= w 1 x1 + . . . + w p xp = w T x
Setting:
q X is a multiset of feature vectors from an inner product space X, X ⊆ Rp.
:::::::::::::::::::::::::::::::::::
Learning task:
q Fit D using a multilayer perceptron y() with a sigmoid activation function.
0.5
0.0
-0.5
-1.0
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0 x1
Separated classes:
x2
1.0
0.5
0.0
-0.5
-1.0
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0 x1
ML:IV-75 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron Basics
Unrestricted Classification Problems: Illustration
0.5 1
0.0
-0.5
-1.0
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0 x1
Separated classes: 0
x2 1.5
1.0 1.0
0.5
0.5
-1.0
-0.5 0.0 x2
0.0 0.0
0.5 -0.5
1.0
x1 1.5
-0.5 2.0 -1.0
2.5 [loss L2 (w)]
-1.0
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0 x1
ML:IV-76 Neural Networks © STEIN/VÖLSKE 2024
Chapter ML:IV
IV. Neural Networks
q Perceptron Learning
q Multilayer Perceptron Basics
q Multilayer Perceptron with Two Layers
q Multilayer Perceptron at Arbitrary Depth
q Advanced MLPs
q Automatic Gradient Computation
x0 =1 =
Σ
x1 = y
..
.
xp =
Multilayer perceptron y(x) with a hidden layer and k-dimensional output layer:
h
x0 =1 = y0 =1 =
Σ
x1 = Σ y1
.. .. ..
. . .
xp = Σ yk
Σ
Multilayer perceptron y(x) with a hidden layer and k-dimensional output layer:
h h
x0 =1 = w1 0 y0 =1 = o
w10
h o
wl0 o wk0
w11
h
w1 1 Σ
x1 = o
wk1 Σ y1
h
wl1
.. .. ..
. h
w1 p
. .
o
h w1l
xp = wlp o
wkl Σ yk
Σ
| {z } | {z }
Parameters w: W h ∈ Rl×(p+1) W o ∈ Rk×(l+1)
Multilayer perceptron y(x) with a hidden layer and k-dimensional output layer:
h h
x0 =1 = w1 0 y0 =1 = o
w10
h o
wl0 o wk0
w11
h
w1 1 Σ
x1 = o
wk1 Σ y1
h
wl1
.. .. ..
. h
w1 p
. .
o
h w1l
xp = wlp o
wkl Σ yk
Σ
h
x (∈ extended input space) y (∈ extended feature space) y (∈ output space)
| {z } | {z }
Parameters w: W h ∈ Rl×(p+1) W o ∈ Rk×(l+1)
Multilayer perceptron y(x) with a hidden layer and k-dimensional output layer:
h h
x0 =1 = w1 0 y0 =1 = o
w10
h o
wl0 o wk0
w11
h
w1 1 Σ
x1 = o
wk1 Σ y1
h
wl1
.. .. ..
. h
w1 p
. .
o
h w1l
xp = wlp o
wkl Σ yk
Σ
h
x (∈ extended input space) y (∈ extended feature space) y (∈ output space)
| {z } | {z }
Parameters w: W h ∈ Rl×(p+1) W o ∈ Rk×(l+1)
q The shown architecture with k output units allows for the distinction of k classes, either within
an exclusive class assignment setting or within a multi-label setting. In the former setting a
so-called “softmax layer” can be added subsequent to the output layer to directly return the
class label 1, . . . , k.
q The non-linear characteristic of the sigmoid function allows for networks that approximate
every (computable) function. For this capability only three “active” layers are required, i.e.,
two layers with hidden units and one layer with output units. Keyword: universal approximator
[Kolmogorov theorem, 1957]
q Multilayer perceptrons are also called multilayer networks or (artificial) neural networks, ANN
for short.
W h ∈ Rl×(p+1) x ∈ Rp+1
h h 1
w10 . . . w1p y1h
x1 .
...
σ ..
.. =
.
ylh
h h xp
wl0 . . . wlp
W h ∈ Rl×(p+1) x ∈ Rp+1
h h 1
w10 . . . w1p y1h
x1 .
...
σ ..
.. =
.
ylh
h h xp
wl0 . . . wlp
W o ∈ Rk×(l+1) yh ∈ Rl+1 y ∈ Rk
o o 1
w10 . . . w1l h y1
...
y1 ..
σ .. = .
.
o o
wk0 . . . wkl ylh yk
W h ∈ Rl×(p+1) X ⊂ Rp+1
h h 1 . . . 1
w
10 . . . w 1p h
y11 h
. . . y1n
x11 . . . x1n
...
...
σ
...
=
h h
h
wl0 . . . wlp h xp1 . . . xpn yl1 . . . yln
W o ∈ Rk×(l+1)
o o 1 ... 1
w10 . . . w1l h h y11 . . . y1n
y11 . . . y1n
σ ...
=
...
.
.
.
o o
wk0 . . . wkl h
yl1 h
. . . yln yk1 . . . ykn
h o
wl0 o wk0
w11
h
w1 1 Σ
x1 = o
wk1 Σ y1
h
wl1
.. .. ..
. h
w1 p
. .
o
h w1l
xp = wlp o
wkl Σ yk
Σ
h
x (∈ extended input space) y (∈ extended feature space) y (∈ output space)
| {z } | {z }
Parameters w: W h ∈ Rl×(p+1) W o ∈ Rk×(l+1)
h o
wl0 o wk0
w11
h
w1 1 Σ
x1 = o
wk1 Σ y1
h
wl1
.. .. ..
. h
w1 p
. .
o
h w1l
xp = wlp o
wkl Σ yk
Σ
h
x (∈ extended input space) y (∈ extended feature space) y (∈ output space)
| {z } | {z }
Parameters w: W h ∈ Rl×(p+1) W o ∈ Rk×(l+1)
L2(w)
L2(w) usually contains various local minima: 0.8
0.6
y(x) = σ W o σ (W1 h x) 0.4
0.2
k
1 X X -15
-10
(cu − yu(x))2
20
L2(w) = · -5
0 10
15
2 u=1 h
5
10 -5
0
h
5
L2(w)
L2(w) usually contains various local minima: 0.8
0.6
y(x) = σ W o σ (W1 h x) 0.4
0.2
k
1 X X -15
-10
(cu − yu(x))2
20
L2(w) = · -5
0 10
15
2 u=1 h
5
10 -5
0
h
5
L2(w)
L2(w) usually contains various local minima: 0.8
0.6
y(x) = σ W o σ (W1 h x) 0.4
0.2
k
1 X X -15
-10
(cu − yu(x))2
20
L2(w) = · -5
0 10
15
2 u=1 h
5
10 -5
0
h
5
q Basically, the computation of the gradient ∇L2 (w) is independent of the organization of the
weights in matrices W h and W o of a network (model function) y(x). Adopt the following view
instead:
To calculate ∇L2 (w) one has to calculate each of its components ∂L2 (w)/∂w, w ∈ w, since
each weight (parameter) has a certain impact on the global loss L2 (w) of the network. This
impact—as well as the computation of this impact—is different for different weights, but it is
canonical for all weights of the same layer though: observe that each weight w influences
“only” its direct and indirect successor nodes, and that the structure of the influenced
successor graph is identical for all weights of the same layer.
Hence it is convenient, but not necessary, to process the components of the gradient
layer-wise (matrix-wise), as ∇o L2 (w) and ∇h L2 (w) respectively. Even more, due to the
network structure of the model function y(x) only two cases need to be distinguished when
deriving the partial derivative ∂L2 (w)/∂w of an arbitrary weight w ∈ w : (a) w belongs to the
output layer, or (b) w belongs to some hidden layer.
q The derivation of the gradient for the two-layer MLP (and hence the weight update processed
in the IGD algorithm) is given in the following, as special case of the derivation of the gradient
for MLPs at arbitrary depth.
W o = W o + ∆W o,
using the ∇o-gradient of the loss function L2(w) to take the steepest descent:
∆W o = −η · ∇oL2(w)
W o = W o + ∆W o,
using the ∇o-gradient of the loss function L2(w) to take the steepest descent:
∆W o = −η · ∇oL2(w)
∂L2 (w) ∂L2 (w)
o
... o
∂w10 ∂w1l
= −η ·
...
∂L2 (w) ∂L2 (w)
o
... o
∂wk0 ∂wkl
... [derivation]
X
(1 − y(x)) ⊗ yh
= η· (c − y(x)) y(x)
D | {z }
δo
ML:IV-94 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
(2) Backpropagation (continued) [mlp arbitrary depth]
W h = W h + ∆W h,
using the ∇h-gradient of the loss function L2(w) to take the steepest descent:
∆W h = −η · ∇hL2(w)
∂L2 (w) ∂L2 (w)
h
... h
∂w10 ∂w1p
= −η ·
...
∂L2 (w) ∂L2 (w)
h
... h
∂wl0 ∂wlp
... [derivation]
o T
X
o h h
= η· (W ) δ y (x) (1 − y (x)) 1,...,l
⊗x
D | {z }
δh
ML:IV-95 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
The IGD Algorithm
1. initialize_random_weights(W h , W o ), t = 0
2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh (x) = σ (W1 h x) // forward propagation; x is extended by x0 = 1
5.
y(x) = σ(W o yh (x))
6. δ = c − y(x)
7a. δ o = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
δ h = [((W o )T δ o ) yh (1 − yh )]1,...,l
7b. ∆W h = η · (δ h ⊗ x)
∆W
o
= η · (δ o ⊗ yh (x))
8. W h = W h + ∆W h , W o = W o + ∆W o
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h , W o )
ML:IV-96 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
The IGD Algorithm (continued) [mlp arbitrary depth]
1. initialize_random_weights(W h , W o ), t = 0
2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh (x) = σ (W1 h x) // forward propagation; x is extended by x0 = 1
5.
y(x) = σ(W o yh (x))
6. δ = c − y(x)
7a. δ o = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
δ h = [((W o )T δ o ) yh (1 − yh )]1,...,l
7b. ∆W h = η · (δ h ⊗ x)
∆W
o
= η · (δ o ⊗ yh (x))
8. W h = W h + ∆W h , W o = W o + ∆W o
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h , W o )
ML:IV-97 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
The IGD Algorithm (continued) [mlp arbitrary depth]
1. initialize_random_weights(W h , W o ), t = 0
2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh (x) = σ (W1 h x) // forward propagation; x is extended by x0 = 1
5.
y(x) = σ(W o yh (x))
6. δ = c − y(x)
7a. δ o = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
δ h = [((W o )T δ o ) yh (1 − yh )]1,...,l
7b. ∆W h = η · (δ h ⊗ x)
∆W
o
= η · (δ o ⊗ yh (x))
8. W h = W h + ∆W h , W o = W o + ∆W o
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h , W o )
ML:IV-98 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
The IGD Algorithm (continued) [mlp arbitrary depth]
1. initialize_random_weights(W h , W o ), t = 0
2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh (x) = σ (W1 h x) // forward propagation; x is extended by x0 = 1
5.
y(x) = σ(W o yh (x))
6. δ = c − y(x)
7a. δ o = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
δ h = [((W o )T δ o ) yh (1 − yh )]1,...,l
7b. ∆W h = η · (δ h ⊗ x)
∆W
o
= η · (δ o ⊗ yh (x))
8. W h = W h + ∆W h , W o = W o + ∆W o
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h , W o )
ML:IV-99 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
The IGD Algorithm (continued) [mlp arbitrary depth]
1. initialize_random_weights(W h , W o ), t = 0
2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh (x) = σ (W1 h x) // forward propagation; x is extended by x0 = 1
5.
y(x) = σ(W o yh (x))
6. δ = c − y(x)
7a. δ o = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
δ h = [((W o )T δ o ) yh (1 − yh )]1,...,l
7b. ∆W h = η · (δ h ⊗ x)
∆W
o
= η · (δ o ⊗ yh (x))
8. W h = W h + ∆W h , W o = W o + ∆W o
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h , W o )
ML:IV-100 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
The IGD Algorithm (continued) [mlp arbitrary depth]
1. initialize_random_weights(W h , W o ), t = 0
2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh (x) = σ (W1 h x) //Model
5. forward propagation; x is extended by x0 = 1
function evaluation.
y(x) = σ(W o yh (x))
6. δ = c − y(x) Calculation of residual vector.
o
7a. δ = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
δ h = [((W o )T δ o ) yh (1 − yh )]1,...,l
Calculation of derivative of the loss.
7b. ∆W h = η · (δ h ⊗ x)
∆W
o
= η · (δ o ⊗ yh (x))
8. W h = W h + ∆W h , Parameter
W o = W o +vector
∆W
o update =
b one gradient step down.
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h , W o )
ML:IV-101 Neural Networks © STEIN/VÖLSKE 2024
Remarks:
q The symbol » « denotes the Hadamard product, also known as the element-wise or the
Schur product. It is a binary operation that takes two matrices of the same dimensions and
produces another matrix of the same dimension as the operands, where each element is the
product of the respective elements of the two original matrices. [Wikipedia]
q The symbol »⊗« denotes the dyadic product, also called outer product or tensor product.
The dyadic product takes two vectors and returns a second order tensor, called a dyadic in
this context: v ⊗ w ≡ vwT . [Wikipedia]
q [W ]1,...,l denotes the projection operator, which returns the rows 1 through l of matrix W as a
new matrix.
q ∆W and ∆W indicate an update of the weight matrix per batch, D, or per instance, (x, c) ∈ D,
respectively.
Σ ... Σ
x1 = Σ y1
.. .. .. ..
. . . .
...
xp = Σ yk
Σ Σ
h1 hd -1 hd
x (∈ extended input space) y y y ≡ y (∈ output space)
| {z } | {z }
hd
Parameters w: W h1 ∈ Rl1×(p+1) W ≡ W o ∈ Rk×(ld−1 + 1)
Σ ... Σ
x1 = Σ y1
.. .. .. ..
. . . .
...
xp = Σ yk
Σ Σ
h1 hd -1 hd
x (∈ extended input space) y y y ≡ y (∈ output space)
| {z } | {z }
hd
Parameters w: W h1 ∈ Rl1×(p+1) W ≡ W o ∈ Rk×(ld−1 + 1)
Σ ... Σ
x1 = Σ y1
.. .. .. ..
. . . .
...
xp = Σ yk
Σ Σ
h1 hd -1 hd
x (∈ extended input space) y y y ≡ y (∈ output space)
| {z } | {z }
hd
Parameters w: W h1 ∈ Rl1×(p+1) W ≡ W o ∈ Rk×(ld−1 + 1)
T
∂L2(w) ∂L2(w) ∂L2(w) ∂L2(w)
∇L2(w) = h1
,..., ,..., ,..., where ls = no._rows(W hs )
∂w10 ∂wlh11p h
∂w10d
h
∂wkldd−1
T
∂L2(w) ∂L2(w) ∂L2(w) ∂L2(w)
∇L2(w) = h1
,..., ,..., ,..., where ls = no._rows(W hs )
∂w10 ∂wlh11p h
∂w10d
h
∂wkldd−1
W hs = W hs + ∆W hs ,
using the ∇hs -gradient of the loss function L2(w) to take the steepest descent:
∆W hs = −η · ∇hs L2(w)
T
∂L2(w) ∂L2(w) ∂L2(w) ∂L2(w)
∇L2(w) = h1
,..., ,..., ,..., where ls = no._rows(W hs )
∂w10 ∂wlh11p h
∂w10d
h
∂wkldd−1
W hs = W hs + ∆W hs ,
using the ∇hs -gradient of the loss function L2(w) to take the steepest descent:
∆W hs = −η · ∇hs L2(w)
,→ p. 110
ML:IV-109 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron at Arbitrary Depth
(2) Backpropagation (continued) [mlp two layers]
... [derivation]
X
⊗ yhd−1 (x)
η · (c − y(x)) y(x) (1 − y(x)) if s = d
D |
{z }
δ hd ≡ δ o
hs+1 T hs+1
X
hs hs
⊗ yhs−1 (x) if 1<s<d
η · y (x) (1 − y (x))
W δ
1,...,ls
∆W hs = D | {z }
δ hs
X T
W h2 δ h2 yh1 (x) (1 − yh1 (x))
η· ⊗x if s = 1
1,...,l1
D | {z }
δ h1
where ls = no._rows(W hs )
k
X X ∂
= − (cu − yu (x)) · yu (x)
D u=1 ∂wijhs
k
(1,2) X X ∂ hd hd−1
= − (cu − yu (x)) · yu (x) · (1 − yu (x)) · hs
Wu∗ y (x)
D u=1 | {z } ∂wij
δuhd ≡ δuo
k ld−1
(3) X X ∂ X
= − δuhd · hd
wuv · yvhd−1 (x)
D u=1 ∂wijhs v=0
h
X
=− δihd · yj d−1 (x)
D
q Partial derivative for a weight in W hd−1 (next to output layer), i.e., s = d−1 :
ld−1 ld−2
∂ ∂ Only for the term where v = i
h
X X X
h
L2 (w) = − δvhd−1 h
wvwd−1 · ywhd−2 (x) // and w = j the partial derivative
∂wijd−1 D v=1 w=0 ∂wijd−1 is nonzero.
h h
X
=− δi d−1 · yj d−2 (x)
D
h h
X
=− (W∗is+1 )T δ hs+1 · yihs (x) · (1 − yihs (x)) · yj s−1 (x)
D | {z }
hs
δi
∂
into −η · ... hs
h i
q Plugging the result for hs
L2 (w) ... yields the update formula for ∆W . In detail:
∂wij
d hd hd−1
(2) Chain rule with = σ(z) · (1 − σ(z)), where σ(z) := yu (x) and z = Wu∗
dz σ(z) y (x) :
∂ ∂
y (x) ≡ ∂hs (σ (z)) = yu (x) · (1 − yu (x)) · ∂hs Wu∗
hd hd−1 hd hd−1
y (x)
hs u
≡ hs
σ Wu∗ y (x)
∂wij ∂wij ∂wij ∂wij
Note that in the partial derivative expression the symbol x is a constant, while wijhs is the
variable whose effect on the change of the loss L2 (at input x) is computed.
hd hd−1 hd h hd h hd h
(3) Wu∗ y (x) = wu0 · y0 d−1 (x) + . . . + wuj · yj d−1 (x) + . . . + wuld−1
· yld−1
d−1
(x),
where ld−1 = no._rows(W hd−1 ).
(4) Rearrange sums to reflect the nested dependencies that develop naturally from the
h
backpropagation. We now can define δv d−1 in layer d−1 as a function of δ hd (layer d).
k
X
hd T hd
(5) δuhd · wuv
hd
= (W∗v ) δ (scalar product).
u=1
h h
y0 s +1=1 y0 d -1 =1
... . ...
..
.. ..
. h
yj d -1 .
h
wi j d yi (x)
... .. ...
.. . ..
. .
hs -1 hs hs +1 hd -1 hd
y y y y y ≡ y (∈ output space)
hs hs +1 hd o
W W W ≡W
h h
y0 s +1=1 y0 d -1 =1
... . ...
..
.. ..
. h
yi s .
h
h wi j s
yj s -1
... .. ...
.. . ..
. .
hs -1 hs hs +1 hd -1 hd
y y y y y ≡ y (∈ output space)
hs hs +1 hd o
W W W ≡W
q ∇h L2 (w) is a special case of the s-layer case, and we obtain δ h from δ hs by applying the
following identities:
W hs+1 = W o , δ hs+1 = δ hd = δ o , yhs = yh , and ls = l.