0% found this document useful (0 votes)
9 views

Unit en Multilayer Perceptron

Uploaded by

Odin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Unit en Multilayer Perceptron

Uploaded by

Odin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Chapter ML:IV

IV. Neural Networks


q Perceptron Learning
q Multilayer Perceptron Basics
q Multilayer Perceptron with Two Layers
q Multilayer Perceptron at Arbitrary Depth
q Advanced MLPs
q Automatic Gradient Computation

ML:IV-53 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron Basics

Definition 1 (Linear Separability)


Two sets of feature vectors, X0, X1, sampled from a p-dimensional feature space X,
are called linearly separable if p+1 real numbers, w0, w1, . . . , wp, exist such that the
following conditions holds:
Pp
1. ∀x ∈ X0: j=0 wj xj < 0
Pp
2. ∀x ∈ X1: j=0 wj xj ≥ 0

ML:IV-54 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron Basics

Definition 1 (Linear Separability)


Two sets of feature vectors, X0, X1, sampled from a p-dimensional feature space X,
are called linearly separable if p+1 real numbers, w0, w1, . . . , wp, exist such that the
following conditions holds:
Pp
1. ∀x ∈ X0: j=0 wj xj < 0
Pp
2. ∀x ∈ X1: j=0 wj xj ≥ 0

x2 x2
A
A
A A A
A A A
A A
A A A
A A A A
A A A
A
A A A A
A
A
A A
A
A
A A A A
A A B
B B A A B
A
B B A B B B B
B B A
B A B B B
B B B A B B
B A B B B
B B B
B

x1 x1

linearly separable not linearly separable


ML:IV-55 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron Basics
Linear Separability (continued)

The XOR function defines two sets in the R2 that are not linearly separable:

x3 x4
x2 = 1 + −
x1 x2 XOR c
x1 0 0 0 −
x2 1 0 1 +
x3 0 1 1 +
x4 1 1 0 −
x2 = 0 − +
x1 x2

x1 = 0 x1 = 1

ML:IV-56 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron Basics
Linear Separability (continued)

The XOR function defines two sets in the R2 that are not linearly separable:

x3 x4
x2 = 1 + −
x1 x2 XOR c
x1 0 0 0 −
x2 1 0 1 +
x3 0 1 1 +
x4 1 1 0 −
x2 = 0 − +
x1 x2

x1 = 0 x1 = 1

Ü Specification of several hyperplanes.


Ü Layered combination of several perceptrons: the multilayer perceptron.

ML:IV-57 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron Basics
(1) Overcoming the Linear Separability Restriction

A minimum multilayer perceptron y(x) that can handle the XOR problem:
h
x0 =1 = y0 =1 =

x1 = Σ

Σ {−, +}

x2 = Σ

x3 x4
x2 = 1 + −

x2 = 0 − +
x1 x2

x1 = 0 x1 = 1
ML:IV-58 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron Basics
(1) Overcoming the Linear Separability Restriction

A minimum multilayer perceptron y(x) that can handle the XOR problem:
h
x0 =1 = h
w1 0 y0 =1 = o
w0
h
w2 0
h
w1 1 o
x1 = Σ w1
h
w2 1
Σ {−, +}
h o
w1 2 w2
h
w2 2
x2 = Σ

x3 x4
x2 = 1 + −

x2 = 0 − +
x1 x2

x1 = 0 x1 = 1
ML:IV-59 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron Basics
(1) Overcoming the Linear Separability Restriction

A minimum multilayer perceptron y(x) that can handle the XOR problem:
h
x0 =1 = h
w1 0 y0 =1 =

h
w1 1
x1 = Σ

Σ {−, +}
h
w1 2

x2 = Σ

1 x3 0 x4
x2 = 1 + −
 
1
y(x) = heaviside W o Heaviside(W h x)
 
"
−0.5 −1 1
# 1
Wh =  x1 
 
0.5 −1 1 x2 = 0 − +
x2 0 x1 0 x2
 
Wo = 0.5 1 −1
x1 = 0 x1 = 1
ML:IV-60 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron Basics
(1) Overcoming the Linear Separability Restriction

A minimum multilayer perceptron y(x) that can handle the XOR problem:
h
x0 =1 = y0 =1 =
h
w2 0

x1 = Σ
h
w2 1
Σ {−, +}

h
w2 2
x2 = Σ

1, 1 x3 0, 1 x4
x2 = 1 + −
 
1
y(x) = heaviside W o Heaviside(W h x)
 
"
−0.5 −1 1
# 1
Wh =  x1 
 
0.5 −1 1 x2 = 0 − +
x2 0, 1 x1 0, 0 x2
 
Wo = 0.5 1 −1
x1 = 0 x1 = 1
ML:IV-61 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron Basics
(1) Overcoming the Linear Separability Restriction

A minimum multilayer perceptron y(x) that can handle the XOR problem:
h
x0 =1 = h
w1 0 y0 =1 =
h
w2 0
h
w1 1
x1 = Σ
h
w2 1 h
y1
Σ {−, +}
h
w1 2
h
w2 2
x2 = Σ h
y2

h x1, x4 x3
y2 = 1 − +
 
1
y(x) = heaviside W o Heaviside(W h x)
 
"
−0.5 −1 1
# 1
Wh =  x1 
 
h
0.5 −1 1 y2 = 0 +
x2 x2
 
Wo = 0.5 1 −1
h h
y1 = 0 y1 = 1
ML:IV-62 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron Basics
(1) Overcoming the Linear Separability Restriction

A minimum multilayer perceptron y(x) that can handle the XOR problem:
h
x0 =1 = h
w1 0 y0 =1 = o
w0
h
w2 0
h
w1 1 o
x1 = Σ w1
h
w2 1 h
y1
Σ {−, +}
h o
w1 2 w2
h
w2 2
x2 = Σ h
y2

h x1, x4 x3
y2 = 1 − +
 
1
y(x) = heaviside W o Heaviside(W h x)
 
"
−0.5 −1 1
# 1
Wh =  x1 
 
h
0.5 −1 1 y2 = 0 +
x2 x2
 
Wo = 0.5 1 −1
h h
y1 = 0 y1 = 1
ML:IV-63 Neural Networks © STEIN/VÖLSKE 2024
Remarks:
q The first, second, and third layer of the shown multilayer perceptron are called input, hidden,
and output layer respectively. Here, in the example, the input layer is comprised of
p+1=3 units, the hidden layer contains l+1=3 units, and the output layer consists of k=1 unit.

q Each input unit is connected via a weighted edge to all hidden units (except to the topmost
hidden unit, which has a constant input y0h = 1), resulting in six weights, organized as
2×3-matrix W h . Each hidden unit is connected via a weighted edge to the output unit,
resulting in three weights, organized as 1×3-matrix W o .

q The input units perform no computation but only distribute the values x0 , x1 , x2 to the next
layer. The hidden units (again except the topmost unit) and the output unit apply the
heaviside function to the sum of their weighted inputs and propagate the result.
h h
The nine weights w = (w10 , . . . , w22 , w1o , w2o , w3o ), organized as W h and W o , specify the
1
multilayer perceptron (model function) y(x) completely: y(x) = heaviside(W o Heaviside(W

h x) )

q The function Heaviside (with capital H) denotes the extension of the scalar heaviside function
to vectors.
For z ∈ Rd the function Heaviside(z) is defined as (heaviside(z1 ), . . . , heaviside(zd ))T .

ML:IV-64 Neural Networks © STEIN/VÖLSKE 2024


Remarks (history) :
q The multilayer perceptron was presented by Rumelhart and McClelland in 1986. Earlier, but
unnoticed, was a similar research work of Werbos and Parker [1974, 1982].
q Compared to a single perceptron, the multilayer perceptron poses a significantly more
challenging training (= learning) problem, which requires continuous (and non-linear)
threshold functions along with sophisticated learning strategies.
q Marvin Minsky and Seymour Papert in 1969 used the XOR problem to show the limitations of
single perceptrons. Moreover, they assumed that extensions of the perceptron architecture
(such as the multilayer perceptron) would be similarly limited as a single perceptron. A fatal
mistake. In fact, they brought the research in this field to a halt that lasted 17 years. [Berkeley]

[Marvin Minsky: MIT Media Lab, Wikipedia]

ML:IV-65 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron Basics
(2) Overcoming the Non-Differentiability Restriction

The sigmoid function σ() as threshold function:


:::::::::::::::::::::::::

1 d σ(z)
σ(z) = , = σ(z) · (1 − σ(z))
1 + e−z dz

Ü A perceptron with a non-linear and differentiable threshold function:


x0 =1
w0 = −θ

x1
w1

w2
x2 Σ y
. θ
0
.
. .
.
.
wp

xp

Input Output [heaviside]


::::::::::

ML:IV-66 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron Basics
(2) Overcoming the Non-Differentiability Restriction (continued)

Computation of the perceptron output y(x) with the sigmoid function σ() :

1
1
y(x) = σ(wT x) = p

1 + e−wT x 0
Σ wj xj
j=0

An alternative to the sigmoid function is the tanh() function:

p
ez − e−z e2z − 1
tanh(z) = z = 2z 0
Σ
z wj xj
e + e−z e +1 j=0

-1

ML:IV-67 Neural Networks © STEIN/VÖLSKE 2024


Remarks:
q Employing a nonlinear function as threshold function in the perceptron, such as sigmoid or
heaviside, is a prerequisite to synthesize complex nonlinear functions via layered
composition.
q Note that a single perceptron with sigmoid activation is identical with the logistic regression
model function.
q The derivative of σ() has a canonical form. It plays a central role for the computation of the
gradient of the loss function in multilayer perceptrons. Derivation:
d σ(z) d 1 d −1
= −z
= (1 + e−z )
dz dz 1 + e dz
= −1 · (1 + e−z )−2 · e−z · (−1)
= σ(z) · σ(z) · e−z
= σ(z) · σ(z) · (1 + e−z − 1)
= σ(z) · σ(z) · (σ(z)−1 − 1)
= σ(z) · (1 − σ(z))

ML:IV-68 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron Basics
(2) Overcoming the Non-Differentiability Restriction (continued)

x0 =1
w0
. .
Linear activation .
.
.
. Σ y Linear regression

wp
xp

ML:IV-69 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron Basics
(2) Overcoming the Non-Differentiability Restriction (continued)

x0 =1
w0
. .
Linear activation .
.
.
. Σ y Linear regression

wp
xp

x0 =1
w0
. .
Heaviside activation .
.
.
. Σ y Perceptron algorithm

wp
xp

ML:IV-70 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron Basics
(2) Overcoming the Non-Differentiability Restriction (continued)

x0 =1
w0
. .
Linear activation .
.
.
. Σ y Linear regression

wp
xp

x0 =1
w0
. .
Heaviside activation .
.
.
. Σ y Perceptron algorithm

wp
xp

x0 =1
w0
. .
Sigmoid activation .
.
.
. Σ y Logistic regression

wp
xp

ML:IV-71 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron Basics
(2) Overcoming the Non-Differentiability Restriction (continued)

1 =
Σ y1
No decision power
Network with Σ ..
. beyond a single
linear units ..
. hyperplane
Σ yk
Σ

1 =
Σ y1
Nonlinear decision
Network with Σ ..
. boundaries but
heaviside units ..
. no gradient information
Σ yk :::::::::::::::::::::::::::::

1 =
Σ y1
Nonlinear decision
Network with Σ ..
. boundaries and
sigmoid units ..
. gradient information
Σ yk
Σ
ML:IV-72 Neural Networks © STEIN/VÖLSKE 2024
Remarks (limitation of linear thresholds) :
q A multilayer perceptron with linear threshold functions can be expressed as a single linear
function and hence is equivalent to the power of a single perceptron only.
q Consider the following exemplary composition of three linear functions as a multilayer
perceptron with p input units, two hidden units, and one output unit: y(x) = W o [W h x]
The weight matrices are as follows:
 
h h
w11 . . . w1p h i
Wh =  , W =o
w1o w2o
h h
w21 . . . w1p

A straightforward derivation then yields:


 
h i wh x1 + . . . + wh xp
11 1p
y(x) = W o [W h x] = w1o w2o  
h h
w21 x1 + . . . + w1p xp

= w1o w11
h
x1 + . . . + w1o w1p
h
xp + w2o w21
h
x1 + . . . + w2o w1p
h
xp

= (w1o w11
h
+ w2o w21
h
)x1 + . . . + (w1o w1p
h
+ w2o w1p
h
)xp

= w 1 x1 + . . . + w p xp = w T x

ML:IV-73 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron Basics
Unrestricted Classification Problems

Setting:
q X is a multiset of feature vectors from an inner product space X, X ⊆ Rp.
:::::::::::::::::::::::::::::::::::

q C = {0, 1}k is the set of all multiclass labelings for k classes.


q D = {(x1, c1), . . . , (xn, cn)} ⊆ X × C is a multiset of examples.

Learning task:
q Fit D using a multilayer perceptron y() with a sigmoid activation function.

ML:IV-74 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron Basics
Unrestricted Classification Problems: Illustration

Two-class classification problem:


x2
1.0

0.5

0.0

-0.5

-1.0
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0 x1

Separated classes:
x2
1.0

0.5

0.0

-0.5

-1.0
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0 x1
ML:IV-75 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron Basics
Unrestricted Classification Problems: Illustration

Two-class classification problem:


x2
1.0

0.5 1

0.0

-0.5

-1.0
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0 x1

Separated classes: 0

x2 1.5

1.0 1.0

0.5
0.5
-1.0
-0.5 0.0 x2
0.0 0.0
0.5 -0.5
1.0
x1 1.5
-0.5 2.0 -1.0
2.5 [loss L2 (w)]
-1.0
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0 x1
ML:IV-76 Neural Networks © STEIN/VÖLSKE 2024
Chapter ML:IV
IV. Neural Networks
q Perceptron Learning
q Multilayer Perceptron Basics
q Multilayer Perceptron with Two Layers
q Multilayer Perceptron at Arbitrary Depth
q Advanced MLPs
q Automatic Gradient Computation

ML:IV-77 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron with Two Layers
Network Architecture

A single perceptron y(x):

x0 =1 =

Σ
x1 = y
..
.

xp =

ML:IV-78 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron with Two Layers
Network Architecture

Multilayer perceptron y(x) with a hidden layer and k-dimensional output layer:
h
x0 =1 = y0 =1 =

Σ
x1 = Σ y1
.. .. ..
. . .

xp = Σ yk
Σ

ML:IV-79 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron with Two Layers
Network Architecture

Multilayer perceptron y(x) with a hidden layer and k-dimensional output layer:
h h
x0 =1 = w1 0 y0 =1 = o
w10

h o
wl0 o wk0
w11
h
w1 1 Σ
x1 = o
wk1 Σ y1
h
wl1
.. .. ..
. h
w1 p
. .
o
h w1l
xp = wlp o
wkl Σ yk
Σ

| {z } | {z }
Parameters w: W h ∈ Rl×(p+1) W o ∈ Rk×(l+1)

ML:IV-80 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron with Two Layers
Network Architecture

Multilayer perceptron y(x) with a hidden layer and k-dimensional output layer:
h h
x0 =1 = w1 0 y0 =1 = o
w10

h o
wl0 o wk0
w11
h
w1 1 Σ
x1 = o
wk1 Σ y1
h
wl1
.. .. ..
. h
w1 p
. .
o
h w1l
xp = wlp o
wkl Σ yk
Σ
h
x (∈ extended input space) y (∈ extended feature space) y (∈ output space)
| {z } | {z }
Parameters w: W h ∈ Rl×(p+1) W o ∈ Rk×(l+1)

ML:IV-81 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron with Two Layers
(1) Forward Propagation [mlp arbitrary depth]

Multilayer perceptron y(x) with a hidden layer and k-dimensional output layer:
h h
x0 =1 = w1 0 y0 =1 = o
w10

h o
wl0 o wk0
w11
h
w1 1 Σ
x1 = o
wk1 Σ y1
h
wl1
.. .. ..
. h
w1 p
. .
o
h w1l
xp = wlp o
wkl Σ yk
Σ
h
x (∈ extended input space) y (∈ extended feature space) y (∈ output space)
| {z } | {z }
Parameters w: W h ∈ Rl×(p+1) W o ∈ Rk×(l+1)


Model function evaluation (= forward propagation) :


  
1
y(x) = σ W o yh(x) = σ W o σ W h x


ML:IV-82 Neural Networks © STEIN/VÖLSKE 2024


Remarks:
q Each input unit is connected to the hidden units 1, . . . , l, resulting in l·(p+1) weights,
organized as matrix W h ∈ Rl×(p+1) . Each hidden unit is connected to the output units 1, . . . , k,
resulting in k·(l+1) weights, organized as matrix W o ∈ Rk×(l+1) .
q The hidden units and the output unit(s) apply the (vectorial) sigmoid function, σ, to the sum
of their weighted inputs and propagate the result as yh and y respectively. For z ∈ Rd the
vectorial sigmoid function σ(z) is defined as (σ(z1 ), . . . , σ(zd ))T .
h h o o
q The parameter vector w = (w10 , . . . , wlp , w10 , . . . , wkl ), organized as matrices W h and W o ,
specifies the multilayer perceptron (model function) y(x) completely: y(x) = σ(W o σ (W1 h x) ).


q The shown architecture with k output units allows for the distinction of k classes, either within
an exclusive class assignment setting or within a multi-label setting. In the former setting a
so-called “softmax layer” can be added subsequent to the output layer to directly return the
class label 1, . . . , k.
q The non-linear characteristic of the sigmoid function allows for networks that approximate
every (computable) function. For this capability only three “active” layers are required, i.e.,
two layers with hidden units and one layer with output units. Keyword: universal approximator
[Kolmogorov theorem, 1957]

q Multilayer perceptrons are also called multilayer networks or (artificial) neural networks, ANN
for short.

ML:IV-83 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron with Two Layers
(1) Forward Propagation (continued) [network architecture]

(a) Propagate x from input to hidden layer: [IGDMLP2 algorithm, Line 5]

W h ∈ Rl×(p+1) x ∈ Rp+1
  
h h 1
 
 w10 . . . w1p    y1h
  x1   . 
...
 
σ  .. 
  ..  =
  .  
  
ylh
  
h h xp
wl0 . . . wlp

ML:IV-84 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron with Two Layers
(1) Forward Propagation (continued) [network architecture]

(a) Propagate x from input to hidden layer: [IGDMLP2 algorithm, Line 5]

W h ∈ Rl×(p+1) x ∈ Rp+1
  
h h 1
 
 w10 . . . w1p    y1h
  x1   . 
...
 
σ  .. 
  ..  =
  .  
  
ylh
  
h h xp
wl0 . . . wlp

(b) Propagate yh from hidden to output layer: [IGDMLP2 algorithm, Line 5]

W o ∈ Rk×(l+1) yh ∈ Rl+1 y ∈ Rk
     
o o 1
 w10 . . . w1l   h   y1 

...
  y1   .. 
σ    ..  =  . 



  .   
o o
wk0 . . . wkl ylh yk

ML:IV-85 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron with Two Layers
(1) Forward Propagation: Batch Mode [network architecture]

(a) Propagate x from input to hidden layer: [IGDMLP2 algorithm, Line 5]

W h ∈ Rl×(p+1) X ⊂ Rp+1
  
h h 1 . . . 1
 
w
 10 . . . w 1p   h
y11 h
. . . y1n
  x11 . . . x1n 
 
...
...
  
σ


 ... 
 = 




  h h
h
wl0 . . . wlp h xp1 . . . xpn yl1 . . . yln

(b) Propagate yh from hidden to output layer: [IGDMLP2 algorithm, Line 5]

W o ∈ Rk×(l+1)
     
o o 1 ... 1
 w10 . . . w1l   h h   y11 . . . y1n
 y11 . . . y1n 


σ ...

=
 ... 

 
 
 .
.
.






 
o o
wk0 . . . wkl h
yl1 h
. . . yln yk1 . . . ykn

ML:IV-86 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron with Two Layers
(2) Backpropagation [linear regression] [mlp arbitrary depth]
:::::::::::::::::

The considered multilayer perceptron y(x):


h h
x0 =1 = w1 0 y0 =1 = o
w10

h o
wl0 o wk0
w11
h
w1 1 Σ
x1 = o
wk1 Σ y1
h
wl1
.. .. ..
. h
w1 p
. .
o
h w1l
xp = wlp o
wkl Σ yk
Σ
h
x (∈ extended input space) y (∈ extended feature space) y (∈ output space)
| {z } | {z }
Parameters w: W h ∈ Rl×(p+1) W o ∈ Rk×(l+1)


ML:IV-87 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron with Two Layers
(2) Backpropagation [linear regression] [mlp arbitrary depth]
:::::::::::::::::

The considered multilayer perceptron y(x):


h h
x0 =1 = w1 0 y0 =1 = o
w10

h o
wl0 o wk0
w11
h
w1 1 Σ
x1 = o
wk1 Σ y1
h
wl1
.. .. ..
. h
w1 p
. .
o
h w1l
xp = wlp o
wkl Σ yk
Σ
h
x (∈ extended input space) y (∈ extended feature space) y (∈ output space)
| {z } | {z }
Parameters w: W h ∈ Rl×(p+1) W o ∈ Rk×(l+1)


Calculation of derivatives (= backpropagation) wrt. the global squared loss:


k
1 1 X X
L2 (w) = · RSS (w) = · (cu − yu(x))2
2 2 u=1 (x,c)∈D
ML:IV-88 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
(2) Backpropagation (continued)

L2(w)
L2(w) usually contains various local minima: 0.8

0.6
 
y(x) = σ W o σ (W1 h x) 0.4

0.2
k
1 X X -15
-10
(cu − yu(x))2
20
L2(w) = · -5
0 10
15

2 u=1 h
5
10 -5
0
h
5

(x,c)∈D w10 15 -10 w31


20 -15
25 -20
[model function y(x)]

ML:IV-89 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron with Two Layers
(2) Backpropagation (continued)

L2(w)
L2(w) usually contains various local minima: 0.8

0.6
 
y(x) = σ W o σ (W1 h x) 0.4

0.2
k
1 X X -15
-10
(cu − yu(x))2
20
L2(w) = · -5
0 10
15

2 u=1 h
5
10 -5
0
h
5

(x,c)∈D w10 15 -10 w31


20 -15
25 -20

!T [model function y(x)]


∂L2(w) ∂L2(w) ∂L2(w) ∂L2(w)
∇L2(w) = o
, . . . , o
, h
, . . . , h
∂w10 ∂wkl ∂w10 ∂wlp

ML:IV-90 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron with Two Layers
(2) Backpropagation (continued)

L2(w)
L2(w) usually contains various local minima: 0.8

0.6
 
y(x) = σ W o σ (W1 h x) 0.4

0.2
k
1 X X -15
-10
(cu − yu(x))2
20
L2(w) = · -5
0 10
15

2 u=1 h
5
10 -5
0
h
5

(x,c)∈D w10 15 -10 w31


20 -15
25 -20

!T [model function y(x)]


∂L2(w) ∂L2(w) ∂L2(w) ∂L2(w)
∇L2(w) = o
, . . . , o
, h
, . . . , h
∂w10 ∂wkl ∂w10 ∂wlp

(a) Gradient in direction of W o , written as matrix: (b) Gradient in direction of W h :


   
∂L2 (w) ∂L2 (w) ∂L2 (w) ∂L2 (w)
o
... o h
... h
 ∂w10 ∂w1l   ∂w10 ∂w1p 
... ...
 ≡ ∇oL2(w)  ≡ ∇hL2(w)
   
 
 ∂L2 (w) ∂L2 (w)   ∂L2 (w) ∂L2 (w) 
o
... o h
... h
∂wk0 ∂wkl ∂wl0 ∂wlp
ML:IV-91 Neural Networks © STEIN/VÖLSKE 2024
Remarks:
q “Backpropagation” is short for “backward propagation of errors”. Backpropagation is a
method of calculating the derivatives (the gradient).

q Basically, the computation of the gradient ∇L2 (w) is independent of the organization of the
weights in matrices W h and W o of a network (model function) y(x). Adopt the following view
instead:
To calculate ∇L2 (w) one has to calculate each of its components ∂L2 (w)/∂w, w ∈ w, since
each weight (parameter) has a certain impact on the global loss L2 (w) of the network. This
impact—as well as the computation of this impact—is different for different weights, but it is
canonical for all weights of the same layer though: observe that each weight w influences
“only” its direct and indirect successor nodes, and that the structure of the influenced
successor graph is identical for all weights of the same layer.
Hence it is convenient, but not necessary, to process the components of the gradient
layer-wise (matrix-wise), as ∇o L2 (w) and ∇h L2 (w) respectively. Even more, due to the
network structure of the model function y(x) only two cases need to be distinguished when
deriving the partial derivative ∂L2 (w)/∂w of an arbitrary weight w ∈ w : (a) w belongs to the
output layer, or (b) w belongs to some hidden layer.

q The derivation of the gradient for the two-layer MLP (and hence the weight update processed
in the IGD algorithm) is given in the following, as special case of the derivation of the gradient
for MLPs at arbitrary depth.

ML:IV-92 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron with Two Layers
(2) Backpropagation (continued) [linear regression] [mlp arbitrary depth]
:::::::::::::::::

(a) Update of weight matrix W o : [IGDMLP2 algorithm, Lines 7+8]

W o = W o + ∆W o,

using the ∇o-gradient of the loss function L2(w) to take the steepest descent:

∆W o = −η · ∇oL2(w)

ML:IV-93 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron with Two Layers
(2) Backpropagation (continued) [linear regression] [mlp arbitrary depth]
:::::::::::::::::

(a) Update of weight matrix W o : [IGDMLP2 algorithm, Lines 7+8]

W o = W o + ∆W o,

using the ∇o-gradient of the loss function L2(w) to take the steepest descent:

∆W o = −η · ∇oL2(w)

 
∂L2 (w) ∂L2 (w)
o
... o
 ∂w10 ∂w1l 
= −η · 
 ... 

 ∂L2 (w) ∂L2 (w) 
o
... o
∂wk0 ∂wkl

... [derivation]

X
(1 − y(x)) ⊗ yh

= η· (c − y(x)) y(x)
D | {z }
δo
ML:IV-94 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
(2) Backpropagation (continued) [mlp arbitrary depth]

(b) Update of weight matrix W h : [IGDMLP2 algorithm, Lines 7+8]

W h = W h + ∆W h,

using the ∇h-gradient of the loss function L2(w) to take the steepest descent:

∆W h = −η · ∇hL2(w)

 
∂L2 (w) ∂L2 (w)
h
... h
 ∂w10 ∂w1p 
= −η · 
 ... 

 ∂L2 (w) ∂L2 (w) 
h
... h
∂wl0 ∂wlp

... [derivation]

o T
X
o h h
 
= η· (W ) δ y (x) (1 − y (x)) 1,...,l
⊗x
D | {z }
δh
ML:IV-95 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
The IGD Algorithm

Algorithm: IGDMLP2 Incremental Gradient Descent for the two-layer MLP.


Input: D Multiset of examples (x, c) with x ∈ Rp , c ∈ {0, 1}k .
η Learning rate, a small positive constant.
Output: W h, W o Weights of l·(p+1) hidden and k·(l+1) output layer units. (= hypothesis)

1. initialize_random_weights(W h , W o ), t = 0
2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh (x) = σ (W1 h x) // forward propagation; x is extended by x0 = 1

5.
y(x) = σ(W o yh (x))
6. δ = c − y(x)
7a. δ o = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
δ h = [((W o )T δ o ) yh (1 − yh )]1,...,l
7b. ∆W h = η · (δ h ⊗ x)
∆W
o
= η · (δ o ⊗ yh (x))
8. W h = W h + ∆W h , W o = W o + ∆W o
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h , W o )
ML:IV-96 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
The IGD Algorithm (continued) [mlp arbitrary depth]

Algorithm: IGDMLP2 Incremental Gradient Descent for the two-layer MLP.


Input: D Multiset of examples (x, c) with x ∈ Rp , c ∈ {0, 1}k .
η Learning rate, a small positive constant.
Output: W h, W o Weights of l·(p+1) hidden and k·(l+1) output layer units. (= hypothesis)

1. initialize_random_weights(W h , W o ), t = 0
2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh (x) = σ (W1 h x) // forward propagation; x is extended by x0 = 1

5.
y(x) = σ(W o yh (x))
6. δ = c − y(x)
7a. δ o = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
δ h = [((W o )T δ o ) yh (1 − yh )]1,...,l
7b. ∆W h = η · (δ h ⊗ x)
∆W
o
= η · (δ o ⊗ yh (x))
8. W h = W h + ∆W h , W o = W o + ∆W o
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h , W o )
ML:IV-97 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
The IGD Algorithm (continued) [mlp arbitrary depth]

Algorithm: IGDMLP2 Incremental Gradient Descent for the two-layer MLP.


Input: D Multiset of examples (x, c) with x ∈ Rp , c ∈ {0, 1}k .
η Learning rate, a small positive constant.
Output: W h, W o Weights of l·(p+1) hidden and k·(l+1) output layer units. (= hypothesis)

1. initialize_random_weights(W h , W o ), t = 0
2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh (x) = σ (W1 h x) // forward propagation; x is extended by x0 = 1

5.
y(x) = σ(W o yh (x))
6. δ = c − y(x)
7a. δ o = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
δ h = [((W o )T δ o ) yh (1 − yh )]1,...,l
7b. ∆W h = η · (δ h ⊗ x)
∆W
o
= η · (δ o ⊗ yh (x))
8. W h = W h + ∆W h , W o = W o + ∆W o
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h , W o )
ML:IV-98 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
The IGD Algorithm (continued) [mlp arbitrary depth]

Algorithm: IGDMLP2 Incremental Gradient Descent for the two-layer MLP.


Input: D Multiset of examples (x, c) with x ∈ Rp , c ∈ {0, 1}k .
η Learning rate, a small positive constant.
Output: W h, W o Weights of l·(p+1) hidden and k·(l+1) output layer units. (= hypothesis)

1. initialize_random_weights(W h , W o ), t = 0
2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh (x) = σ (W1 h x) // forward propagation; x is extended by x0 = 1

5.
y(x) = σ(W o yh (x))
6. δ = c − y(x)
7a. δ o = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
δ h = [((W o )T δ o ) yh (1 − yh )]1,...,l
7b. ∆W h = η · (δ h ⊗ x)
∆W
o
= η · (δ o ⊗ yh (x))
8. W h = W h + ∆W h , W o = W o + ∆W o
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h , W o )
ML:IV-99 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
The IGD Algorithm (continued) [mlp arbitrary depth]

Algorithm: IGDMLP2 Incremental Gradient Descent for the two-layer MLP.


Input: D Multiset of examples (x, c) with x ∈ Rp , c ∈ {0, 1}k .
η Learning rate, a small positive constant.
Output: W h, W o Weights of l·(p+1) hidden and k·(l+1) output layer units. (= hypothesis)

1. initialize_random_weights(W h , W o ), t = 0
2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh (x) = σ (W1 h x) // forward propagation; x is extended by x0 = 1

5.
y(x) = σ(W o yh (x))
6. δ = c − y(x)
7a. δ o = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
δ h = [((W o )T δ o ) yh (1 − yh )]1,...,l
7b. ∆W h = η · (δ h ⊗ x)
∆W
o
= η · (δ o ⊗ yh (x))
8. W h = W h + ∆W h , W o = W o + ∆W o
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h , W o )
ML:IV-100 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron with Two Layers
The IGD Algorithm (continued) [mlp arbitrary depth]

Algorithm: IGDMLP2 Incremental Gradient Descent for the two-layer MLP.


Input: D Multiset of examples (x, c) with x ∈ Rp , c ∈ {0, 1}k .
η Learning rate, a small positive constant.
Output: W h, W o Weights of l·(p+1) hidden and k·(l+1) output layer units. (= hypothesis)

1. initialize_random_weights(W h , W o ), t = 0
2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh (x) = σ (W1 h x) //Model

5. forward propagation; x is extended by x0 = 1
function evaluation.
y(x) = σ(W o yh (x))
6. δ = c − y(x) Calculation of residual vector.
o
7a. δ = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
δ h = [((W o )T δ o ) yh (1 − yh )]1,...,l
Calculation of derivative of the loss.
7b. ∆W h = η · (δ h ⊗ x)
∆W
o
= η · (δ o ⊗ yh (x))
8. W h = W h + ∆W h , Parameter
W o = W o +vector
∆W
o update =
b one gradient step down.
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h , W o )
ML:IV-101 Neural Networks © STEIN/VÖLSKE 2024
Remarks:
q The symbol » « denotes the Hadamard product, also known as the element-wise or the
Schur product. It is a binary operation that takes two matrices of the same dimensions and
produces another matrix of the same dimension as the operands, where each element is the
product of the respective elements of the two original matrices. [Wikipedia]
q The symbol »⊗« denotes the dyadic product, also called outer product or tensor product.
The dyadic product takes two vectors and returns a second order tensor, called a dyadic in
this context: v ⊗ w ≡ vwT . [Wikipedia]
q [W ]1,...,l denotes the projection operator, which returns the rows 1 through l of matrix W as a
new matrix.
q ∆W and ∆W indicate an update of the weight matrix per batch, D, or per instance, (x, c) ∈ D,
respectively.

ML:IV-102 Neural Networks © STEIN/VÖLSKE 2024


Chapter ML:IV
IV. Neural Networks
q Perceptron Learning
q Multilayer Perceptron Basics
q Multilayer Perceptron with Two Layers
q Multilayer Perceptron at Arbitrary Depth
q Advanced MLPs
q Automatic Gradient Computation

ML:IV-103 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron at Arbitrary Depth
Network Architecture [mlp two layers]

Multilayer perceptron y(x) with d layers and k-dimensional output:


h h
x0 =1 = y0 1 =1 = y0 d -1 =1 =

Σ ... Σ
x1 = Σ y1
.. .. .. ..
. . . .
...
xp = Σ yk
Σ Σ
h1 hd -1 hd
x (∈ extended input space) y y y ≡ y (∈ output space)
| {z } | {z }
hd
Parameters w: W h1 ∈ Rl1×(p+1) W ≡ W o ∈ Rk×(ld−1 + 1)

ML:IV-104 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron at Arbitrary Depth
(1) Forward Propagation [mlp two layers]

Multilayer perceptron y(x) with d layers and k-dimensional output:


h h
x0 =1 = y0 1 =1 = y0 d -1 =1 =

Σ ... Σ
x1 = Σ y1
.. .. .. ..
. . . .
...
xp = Σ yk
Σ Σ
h1 hd -1 hd
x (∈ extended input space) y y y ≡ y (∈ output space)
| {z } | {z }
hd
Parameters w: W h1 ∈ Rl1×(p+1) W ≡ W o ∈ Rk×(ld−1 + 1)


Model function evaluation (= forward propagation) :


1 
!!
yhd (x) ≡ y(x) = σ W hd yhd−1 (x) = . . . = σ W hd σ . . . 1 h
  
σ (W 1 x) . . .

ML:IV-105 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron at Arbitrary Depth
(2) Backpropagation [mlp two layers]

The considered multilayer perceptron y(x):


h h
x0 =1 = y0 1 =1 = y0 d -1 =1 =

Σ ... Σ
x1 = Σ y1
.. .. .. ..
. . . .
...
xp = Σ yk
Σ Σ
h1 hd -1 hd
x (∈ extended input space) y y y ≡ y (∈ output space)
| {z } | {z }
hd
Parameters w: W h1 ∈ Rl1×(p+1) W ≡ W o ∈ Rk×(ld−1 + 1)


Calculation of derivatives (= backpropagation) wrt. the global squared loss:


k
1 1 X X
L2 (w) = · RSS (w) = · (cu − yu(x))2
2 2 u=1 (x,c)∈D
ML:IV-106 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron at Arbitrary Depth
(2) Backpropagation (continued) [mlp two layers]

 T
∂L2(w) ∂L2(w) ∂L2(w) ∂L2(w) 
∇L2(w) =  h1
,..., ,..., ,..., where ls = no._rows(W hs )
∂w10 ∂wlh11p h
∂w10d
h
∂wkldd−1

ML:IV-107 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron at Arbitrary Depth
(2) Backpropagation (continued) [mlp two layers]

 T
∂L2(w) ∂L2(w) ∂L2(w) ∂L2(w) 
∇L2(w) =  h1
,..., ,..., ,..., where ls = no._rows(W hs )
∂w10 ∂wlh11p h
∂w10d
h
∂wkldd−1

Update of weight matrix W hs , 1 ≤ s ≤ d : [IGDMLPd algorithm, Lines 7+8]

W hs = W hs + ∆W hs ,

using the ∇hs -gradient of the loss function L2(w) to take the steepest descent:

∆W hs = −η · ∇hs L2(w)

ML:IV-108 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron at Arbitrary Depth
(2) Backpropagation (continued) [mlp two layers]

 T
∂L2(w) ∂L2(w) ∂L2(w) ∂L2(w) 
∇L2(w) =  h1
,..., ,..., ,..., where ls = no._rows(W hs )
∂w10 ∂wlh11p h
∂w10d
h
∂wkldd−1

Update of weight matrix W hs , 1 ≤ s ≤ d : [IGDMLPd algorithm, Lines 7+8]

W hs = W hs + ∆W hs ,

using the ∇hs -gradient of the loss function L2(w) to take the steepest descent:

∆W hs = −η · ∇hs L2(w)

∂L2 (w) ∂L2 (w)


 
...
hs
∂w10 hs
∂w1l ls = no._rows(W hs ),
 s−1 
= −η · 
 ... ,

where yh0 ≡ x,
 ∂L2 (w) ∂L2 (w) 
... yhd ≡ y
∂wlhs0 ∂wlhsl
s s s−1

,→ p. 110
ML:IV-109 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron at Arbitrary Depth
(2) Backpropagation (continued) [mlp two layers]

... [derivation]

 X
⊗ yhd−1 (x)



η · (c − y(x)) y(x) (1 − y(x)) if s = d

 D |

 {z }
δ hd ≡ δ o










hs+1 T hs+1
 X
hs hs
⊗ yhs−1 (x) if 1<s<d
  
η · y (x) (1 − y (x))

 W δ
1,...,ls
∆W hs = D | {z }
δ hs











 X T
W h2 δ h2 yh1 (x) (1 − yh1 (x))
 
η· ⊗x if s = 1


1,...,l1





 D | {z }
δ h1

where ls = no._rows(W hs )

ML:IV-110 Neural Networks © STEIN/VÖLSKE 2024


Multilayer Perceptron at Arbitrary Depth
The IGD Algorithm

Algorithm: IGDMLPd Incremental Gradient Descent for the d-layer MLP.


Input: D Multiset of examples (x, c) with x ∈ Rp , c ∈ {0, 1}k .
η Learning rate, a small positive constant.
Output: W h1 , . . . , W hd Weight matrices of the d layers. (= hypothesis)

1. FOR s = 1 TO d DO initialize_random_weights(W hs ) ENDDO, t = 0


2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh1 (x) = σ (W1h1 x) // forward propagation; x is extended by x0 = 1

5.
FOR s = 2 TO d−1 DO yhs (x) = σ (W hs y1hs−1 (x)) ENDDO


y(x) = σ(W hd yhd−1 (x))


6. δ = c − y(x)
7a. δ hd = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
FOR s = d−1 DOWNTO 1 DO δ hs = [((W hs+1 )T δ hs+1 ) yhs (x) (1 − yhs (x))]1,...,ls ENDDO
7b. ∆W
h1
= η · (δ h1 ⊗ x)
FOR s = 2 TO d DO ∆W hs = η · (δ hs ⊗ yhs−1 (x)) ENDDO
8. FOR s = 1 TO d DO W hs = W hs + ∆W hs ENDDO
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h1 , . . . , W hd )
ML:IV-111 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron at Arbitrary Depth
The IGD Algorithm (continued) [mlp two layers]

Algorithm: IGDMLPd Incremental Gradient Descent for the d-layer MLP.


Input: D Multiset of examples (x, c) with x ∈ Rp , c ∈ {0, 1}k .
η Learning rate, a small positive constant.
Output: W h1 , . . . , W hd Weight matrices of the d layers. (= hypothesis)

1. FOR s = 1 TO d DO initialize_random_weights(W hs ) ENDDO, t = 0


2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh1 (x) = σ (W1h1 x) // forward propagation; x is extended by x0 = 1

5.
FOR s = 2 TO d−1 DO yhs (x) = σ (W hs y1hs−1 (x)) ENDDO


y(x) = σ(W hd yhd−1 (x))


6. δ = c − y(x)
7a. δ hd = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
FOR s = d−1 DOWNTO 1 DO δ hs = [((W hs+1 )T δ hs+1 ) yhs (x) (1 − yhs (x))]1,...,ls ENDDO
7b. ∆W
h1
= η · (δ h1 ⊗ x)
FOR s = 2 TO d DO ∆W hs = η · (δ hs ⊗ yhs−1 (x)) ENDDO
8. FOR s = 1 TO d DO W hs = W hs + ∆W hs ENDDO
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h1 , . . . , W hd )
ML:IV-112 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron at Arbitrary Depth
The IGD Algorithm (continued) [mlp two layers]

Algorithm: IGDMLPd Incremental Gradient Descent for the d-layer MLP.


Input: D Multiset of examples (x, c) with x ∈ Rp , c ∈ {0, 1}k .
η Learning rate, a small positive constant.
Output: W h1 , . . . , W hd Weight matrices of the d layers. (= hypothesis)

1. FOR s = 1 TO d DO initialize_random_weights(W hs ) ENDDO, t = 0


2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh1 (x) = σ (W1h1 x) // forward propagation; x is extended by x0 = 1

5.
FOR s = 2 TO d−1 DO yhs (x) = σ (W hs y1hs−1 (x)) ENDDO


y(x) = σ(W hd yhd−1 (x))


6. δ = c − y(x)
7a. δ hd = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
FOR s = d−1 DOWNTO 1 DO δ hs = [((W hs+1 )T δ hs+1 ) yhs (x) (1 − yhs (x))]1,...,ls ENDDO
7b. ∆W
h1
= η · (δ h1 ⊗ x)
FOR s = 2 TO d DO ∆W hs = η · (δ hs ⊗ yhs−1 (x)) ENDDO
8. FOR s = 1 TO d DO W hs = W hs + ∆W hs ENDDO
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h1 , . . . , W hd )
ML:IV-113 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron at Arbitrary Depth
The IGD Algorithm (continued) [mlp two layers]

Algorithm: IGDMLPd Incremental Gradient Descent for the d-layer MLP.


Input: D Multiset of examples (x, c) with x ∈ Rp , c ∈ {0, 1}k .
η Learning rate, a small positive constant.
Output: W h1 , . . . , W hd Weight matrices of the d layers. (= hypothesis)

1. FOR s = 1 TO d DO initialize_random_weights(W hs ) ENDDO, t = 0


2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh1 (x) = σ (W1h1 x) // forward propagation; x is extended by x0 = 1

5.
FOR s = 2 TO d−1 DO yhs (x) = σ (W hs y1hs−1 (x)) ENDDO


y(x) = σ(W hd yhd−1 (x))


6. δ = c − y(x)
7a. δ hd = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
FOR s = d−1 DOWNTO 1 DO δ hs = [((W hs+1 )T δ hs+1 ) yhs (x) (1 − yhs (x))]1,...,ls ENDDO
7b. ∆W
h1
= η · (δ h1 ⊗ x)
FOR s = 2 TO d DO ∆W hs = η · (δ hs ⊗ yhs−1 (x)) ENDDO
8. FOR s = 1 TO d DO W hs = W hs + ∆W hs ENDDO
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h1 , . . . , W hd )
ML:IV-114 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron at Arbitrary Depth
The IGD Algorithm (continued) [mlp two layers]

Algorithm: IGDMLPd Incremental Gradient Descent for the d-layer MLP.


Input: D Multiset of examples (x, c) with x ∈ Rp , c ∈ {0, 1}k .
η Learning rate, a small positive constant.
Output: W h1 , . . . , W hd Weight matrices of the d layers. (= hypothesis)

1. FOR s = 1 TO d DO initialize_random_weights(W hs ) ENDDO, t = 0


2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh1 (x) = σ (W1h1 x) // forward propagation; x is extended by x0 = 1

5.
FOR s = 2 TO d−1 DO yhs (x) = σ (W hs y1hs−1 (x)) ENDDO


y(x) = σ(W hd yhd−1 (x))


6. δ = c − y(x)
7a. δ hd = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
FOR s = d−1 DOWNTO 1 DO δ hs = [((W hs+1 )T δ hs+1 ) yhs (x) (1 − yhs (x))]1,...,ls ENDDO
7b. ∆W
h1
= η · (δ h1 ⊗ x)
FOR s = 2 TO d DO ∆W hs = η · (δ hs ⊗ yhs−1 (x)) ENDDO
8. FOR s = 1 TO d DO W hs = W hs + ∆W hs ENDDO
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h1 , . . . , W hd )
ML:IV-115 Neural Networks © STEIN/VÖLSKE 2024
Multilayer Perceptron at Arbitrary Depth
The IGD Algorithm (continued) [mlp two layers]

Algorithm: IGDMLPd Incremental Gradient Descent for the d-layer MLP.


Input: D Multiset of examples (x, c) with x ∈ Rp , c ∈ {0, 1}k .
η Learning rate, a small positive constant.
Output: W h1 , . . . , W hd Weight matrices of the d layers. (= hypothesis)

1. FOR s = 1 TO d DO initialize_random_weights(W hs ) ENDDO, t = 0


2. REPEAT
3. t=t+1
4. FOREACH (x, c) ∈ D DO
yh1 (x) = σ (W1h1 x) // forward propagation; x is extended by x0 = 1

5.
1
DO yhfunction

FOR s = 2 TO d−1 Model s
(x) = σ (W
evaluation.
hs yhs−1 (x)) ENDDO
y(x) = σ(W hd yhd−1 (x))
6. δ = c − y(x) Calculation of residual vector.
7a. δ hd = δ y(x) (1 − y(x)) // backpropagation (Steps 7a+7b)
FOR s = d−1 DOWNTO 1 DO δ hs = [((W hs+1 )T δ hs+1 ) yhs (x) (1 − yhs (x))]1,...,ls ENDDO
Calculation of derivative of the loss.
7b. ∆W
h1
= η · (δ h1 ⊗ x)
FOR s = 2 TO d DO ∆W hs = η · (δ hs ⊗ yhs−1 (x)) ENDDO
8. FOR s = 1 TO d DOParameter W hs = W hvector
s
+ ∆W hupdate
s
ENDDO =
b one gradient step down.
9. ENDDO
10. UNTIL(convergence(D, y(), t))
[Python code]
11. return(W h1 , . . . , W hd )
ML:IV-116 Neural Networks © STEIN/VÖLSKE 2024
Remarks (derivation of ∇hs L2 (w)) :
q Partial derivative for a weight in a weight matrix W hs , 1 ≤ s ≤ d :
k
∂ ∂ 1 X X
L2 (w) = · (cu − yu (x))2
∂wijhs ∂wijhs 2
(x,c)∈D u=1
k
1 X X ∂
= · h
(cu − yu (x))2
2
D u=1 ∂wij
s

k
X X ∂
= − (cu − yu (x)) · yu (x)
D u=1 ∂wijhs
k
(1,2) X X ∂ hd hd−1
= − (cu − yu (x)) · yu (x) · (1 − yu (x)) · hs
Wu∗ y (x)
D u=1 | {z } ∂wij
δuhd ≡ δuo
k ld−1
(3) X X ∂ X
= − δuhd · hd
wuv · yvhd−1 (x)
D u=1 ∂wijhs v=0

q Partial derivative for a weight in W hd (output layer), i.e., s = d :


k ld−1 Only for the term where u = i
∂ X X X ∂
L2 (w) = − δuhd · hd
wuv · yvhd−1 (x) // and v = j the partial derivative
∂wijhd D u=1 v=0 ∂wijhd is nonzero. See the illustration.

h
X
=− δihd · yj d−1 (x)
D

ML:IV-117 Neural Networks © STEIN/VÖLSKE 2024


Remarks (derivation of ∇hs L2 (w)) : (continued)
q Partial derivative for a weight in a weight matrix W hs , s ≤ d−1 :
k ld−1 Every component of yhd−1 (x)
∂ X X X ∂
L2 (w) = − δuhd · hd
wuv · yvhd−1 (x) // except y0hd−1 depends on wijhs .
∂wijhs D u=1 v=0 ∂wij
hs
See the illustration.
k ld−1
(1,2) X X X ∂
= − δuhd · hd
wuv · yvhd−1 (x) · (1 − yvhd−1 (x)) · hd−1 hd−2
Wv∗ y (x)
D u=1 v=1 ∂wijhs
ld−1
k
(4) X X X ∂
= − δuhd · wuv
hd
· yvhd−1 (x) · (1 − yvhd−1 (x)) · hd−1 hd−2
Wv∗ y (x)
D v=1 u=1 ∂wijhs
ld−1
(5) X X
hd T hd ∂
= − (W∗v ) δ · yvhd−1 (x) · (1 − yvhd−1 (x)) · hs
hd−1 hd−2
Wv∗ y (x)
D v=1 | {z } ∂w ij
hd−1
δv
ld−1 ld−2
(3) X X
hd−1 ∂ X hd−1 hd−2
= − δv · hs
wvw · yw (x)
D v=1 ∂w ij w=0

q Partial derivative for a weight in W hd−1 (next to output layer), i.e., s = d−1 :
ld−1 ld−2
∂ ∂ Only for the term where v = i
h
X X X
h
L2 (w) = − δvhd−1 h
wvwd−1 · ywhd−2 (x) // and w = j the partial derivative
∂wijd−1 D v=1 w=0 ∂wijd−1 is nonzero.

h h
X
=− δi d−1 · yj d−2 (x)
D

ML:IV-118 Neural Networks © STEIN/VÖLSKE 2024


Remarks (derivation of ∇hs L2 (w)) : (continued)
q Instead of writing out the recursion further, i.e., considering a weight matrix W hs , s ≤ d−2, we
substitute s for d−1 (similarly: s+1 for d ) to derive the general backpropagation rule:
∂ X h
L2 (w) = − δihs · yj s−1 (x) // δihs is expanded based on the definition of δvhd−1 .
∂wijhs D

h h
X
=− (W∗is+1 )T δ hs+1 · yihs (x) · (1 − yihs (x)) · yj s−1 (x)
D | {z }
hs
δi


into −η · ... hs
h i
q Plugging the result for hs
L2 (w) ... yields the update formula for ∆W . In detail:
∂wij

– For updating the output matrix, W hd ≡ W o , we compute


δ hd = (c − y(x)) y(x) (1 − y(x))

– For updating a matrix W hs , 1 ≤ s < d, we compute


δ hs = [((W hs+1 )T δ hs+1 ) yhs (x) (1 − yhs (x))]1,...,ls , where W hs+1 ∈ Rls+1 ×(ls +1) ,
δ hs+1 ∈ Rls+1 ,
yhs ∈ Rls +1 ,
and yh0 (x) ≡ x.

ML:IV-119 Neural Networks © STEIN/VÖLSKE 2024


Remarks (derivation of ∇hs L2 (w)) : (continued)
q Hints:
h i
hd hd−1 hd hd−1

(1) yu (x) = σ W y (x) = σ Wu∗ y (x)
u

d hd hd−1
(2) Chain rule with = σ(z) · (1 − σ(z)), where σ(z) := yu (x) and z = Wu∗
dz σ(z) y (x) :
∂ ∂
y (x) ≡ ∂hs (σ (z)) = yu (x) · (1 − yu (x)) · ∂hs Wu∗
hd hd−1 hd hd−1
 
y (x)
hs u
≡ hs
σ Wu∗ y (x)
∂wij ∂wij ∂wij ∂wij

Note that in the partial derivative expression the symbol x is a constant, while wijhs is the
variable whose effect on the change of the loss L2 (at input x) is computed.

hd hd−1 hd h hd h hd h
(3) Wu∗ y (x) = wu0 · y0 d−1 (x) + . . . + wuj · yj d−1 (x) + . . . + wuld−1
· yld−1
d−1
(x),
where ld−1 = no._rows(W hd−1 ).

(4) Rearrange sums to reflect the nested dependencies that develop naturally from the
h
backpropagation. We now can define δv d−1 in layer d−1 as a function of δ hd (layer d).

k
X
hd T hd
(5) δuhd · wuv
hd
= (W∗v ) δ (scalar product).
u=1

ML:IV-120 Neural Networks © STEIN/VÖLSKE 2024


Remarks (derivation of ∇hs L2 (w)) : (continued)
q y(x) as a function of some wijhs in the output layer W o and some middle layer W hs . To
calculate the partial derivative of yu (x) with respect to wijhs , determine those terms in yu (x)
that depend on wijhs (shown orange). All other terms are in the role of constants.

h h
y0 s +1=1 y0 d -1 =1

... . ...
..
.. ..
. h
yj d -1 .
h
wi j d yi (x)
... .. ...
.. . ..
. .
hs -1 hs hs +1 hd -1 hd
y y y y y ≡ y (∈ output space)

hs hs +1 hd o
W W W ≡W

  1  σ(...) ∼ yhd (x) ≡ y(x)


((...)) ∼ yhd−1 (x)
1  !
yu (x) = σ W hd σ . . .  
1

...

σ W hs+1
( h
σ W hs y s−1 (x) ) ((...)) ∼ yhs+1 (x)
u ((...)) ∼ yhs (x)

q Compare the above illustration to the multilayer perceptron network architecture.


ML:IV-121 Neural Networks © STEIN/VÖLSKE 2024
Remarks (derivation of ∇hs L2 (w)) : (continued)
q y(x) as a function of some wijhs in the output layer W o and some middle layer W hs . To
calculate the partial derivative of yu (x) with respect to wijhs , determine those terms in yu (x)
that depend on wijhs (shown orange). All other terms are in the role of constants.

h h
y0 s +1=1 y0 d -1 =1

... . ...
..
.. ..
. h
yi s .
h
h wi j s
yj s -1
... .. ...
.. . ..
. .
hs -1 hs hs +1 hd -1 hd
y y y y y ≡ y (∈ output space)

hs hs +1 hd o
W W W ≡W

  1  σ(...) ∼ yhd (x) ≡ y(x)


((...)) ∼ yhd−1 (x)
1  !
yu (x) = σ W hd σ . . .  
1

...

σ W hs+1
( h
σ W hs y s−1 (x) ) ((...)) ∼ yhs+1 (x)
u ((...)) ∼ yhs (x)

q Compare the above illustration to the multilayer perceptron network architecture.


ML:IV-122 Neural Networks © STEIN/VÖLSKE 2024
Remarks (derivation of ∇o L2 (w) and ∇h L2 (w) for MLP at depth one) :

q ∇o L2 (w) ≡ ∇hd L2 (w), and hence δ o ≡ δ hd .

q ∇h L2 (w) is a special case of the s-layer case, and we obtain δ h from δ hs by applying the
following identities:
W hs+1 = W o , δ hs+1 = δ hd = δ o , yhs = yh , and ls = l.

ML:IV-123 Neural Networks © STEIN/VÖLSKE 2024

You might also like