0% found this document useful (0 votes)
4 views

Clase 4 Backpropagation

The document discusses backpropagation, a method for training neural networks by adjusting weights based on the error rate from previous iterations. It covers the structure of artificial neurons, activation functions, and the application of gradient descent to minimize loss during training. Additionally, it explains the process of forward and backward passes in both single and multi-layer perceptrons.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Clase 4 Backpropagation

The document discusses backpropagation, a method for training neural networks by adjusting weights based on the error rate from previous iterations. It covers the structure of artificial neurons, activation functions, and the application of gradient descent to minimize loss during training. Additionally, it explains the process of forward and backward passes in both single and multi-layer perceptrons.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Aprendizaje por retropagación

(Backpropagation)

Deisy Chaves
Oficina 10, 4rto Piso, Edificio B13
[email protected]
Contenido

• Aprendizaje por retropropagación (backpropagation)


• Perceptrón (one neuron)
• Perceptrón multicapa (Multi-Layer Perceptron)

• Aceleración del aprendizaje


Artificial neuron

Artificial neuron
• Input features, xj
• Weights, wi 𝑥1 𝜔1 𝑏 Bias

• Weighted sum (linear function):


𝑥2 𝜔2
w1·x1+ w2·x2 + ... + wn·xn +b =Z

∑  𝑦ො

Activation Output
• Activation function,  : 𝑥𝑚 𝜔𝑚 Weighted
ො (w1·x1+ w2·x2 + ... + wn·xn +b)
𝑦= sum
Inputs Weights

• Bias
Activation function

Sign function: Sigmoid function:

1 𝑠𝑖 𝑧 ≥ 0 1
𝜎(𝑧) = ቊ 𝜎(𝑧) =
0 𝑠𝑖 𝑧 < 0 1 + 𝑒 −𝑧
Artificial neuron

• Input features, xj Artificial neuron

• Weights, wi
𝑥1 𝜔1 𝑏 Bias
• Weighted sum (linear function):
w1·x1+ w2·x2 + ... + wn·xn +b =Z
𝑥2 𝜔2


∑  𝑦ො

• Activation function,  : Activation Output
ො (w1·x1+ w2·x2 + ... + wn·xn +b)
𝑦= 𝑥𝑚 𝜔𝑚 Weighted
sum
1 Inputs Weights
(𝑧) =
1 + 𝑒 −𝑧

• Bias
Backpropagation
• Introduced in the 1970s
• Method for fine-tuning the weights (training) of a neural network (Perceptron,
MLP,…) with respect to the error rate obtained in the previous iteration or epoch
Backpropagation
• The error -loss function- of
model output compared to the
true values is calculated to
evaluate the model

• Gradient descent is used to


update the model parameters
of model (weights/bias) and
reduce the loss during
backward pass

• Cost function is the average of


the individual losses across the
entire training dataset
(aggregates the error over all
samples)

https://machinelearningknowledge.ai/wp-content/uploads/2019/10/Backpropagation.gif
Backpropagation: Gradient descent
• The gradient of a function at a
specific point is the direction
Loss function (L) vs Weight (w) of the steepest ascent

• To find the minimum, we have


to move in the opposite
direction (negative of the
gradient), the direction of the
steepest descent

https://datamapu.com/posts/deep_learning/intro_dl/
Backpropagation: example

• Training Data: Our dataset has one-dimensional sample with one


input and one output, x=0.5 and labels y=1.5

1
• Activation Function: Sigmoid function (𝑧) = 1 + 𝑒 −𝑧

• Learning rate = 0.1


𝑛
1
• Loss or Cost Function: Sum of Squared error L(y,ොy) = 2 ෍(𝑦𝑖 − yො 𝑖 )2
𝑖=1
The training set only
has one sample, n=1
Backpropagation: one neuron

• Training Data: Our dataset has one-dimensional sample with one


input and one output, x=0.5 and labels y=1.5

1
• Activation Function: Sigmoid function (𝑧) = 1 + 𝑒 −𝑧

• Learning rate = 0.1


𝑛
1
• Loss or Cost Function: Sum of Squared error L(y,ොy) = 2 ෍(𝑦𝑖 − yො 𝑖 )2
𝑖=1
1 The training set only
L(y,ොy) = (y − yො )2 has one sample, n=1
2
Backpropagation: one neuron
• Training sample (x =0.5, y=1.5), Learning rate =0.1
Artificial neuron

• Input features, x 𝑏 Bias


• Weights, w (initial w =0.3)
• Bias, b (initial b =0.1)
𝑥1 𝜔1 𝒛=∑ a=(z) 𝑦ො
• Weighted sum (linear function): Activation Output
Z(x) = w·x+b Inputs Weights Weighted
sum

• Activation function:
1
a = (𝑧) =
1 + 𝑒 −𝑧
1 1
ො a = (z) =
• Output: 𝑦= =
1+𝑒 −𝑧 1+𝑒 −(𝑤𝑥+𝑏)
Backpropagation: one neuron

• Training sample (x =0.5, y=1.5), Learning rate  0.1

• Input features, x
• Weights, w (initial w =0.3)
• Bias, b (initial b =0.1)
• Weighted sum (linear function):
Z(x) = w·x+b
• Activation function:
1
a = (𝑧) =
1 + 𝑒 −𝑧
1 1
ො a = (z) =
• Output: 𝑦= =
1+𝑒 −𝑧 1+𝑒 −(𝑤𝑥+𝑏)
Backpropagation: one neuron

• Forward pass
ො a = (z)
𝑦=
ො (w·x+b)
𝑦=
1
𝑦ො =
1 + 𝑒 −(𝑤𝑥+𝑏)
1
L(y,ොy) = (y − yො )2
2

Training sample (x =0.5, y=1.5)


Weights, w (initial w =0.3)
Bias, b (initial b =0.1)
Learning rate,  = 0.1
Backpropagation: one neuron

• Forward pass
ො a = (z)
𝑦=
ො (w·x+b)
𝑦=
1
𝑦ො = =0.56
1+𝑒 −(𝑤𝑥+𝑏)

1 1
L(y,ොy) = 2 (y − yො )2 = 2 (1.5 − 0.56)2 = 0.44

Training sample (x =0.5, y=1.5)


Weights, w (initial w =0.3)
Bias, b (initial b =0.1)
Learning rate,  = 0.1
Backpropagation: one neuron

• Backward pass (Gradient descent)

Training sample (x =0.5, y=1.5)


Weights, w (initial w =0.3)
Bias, b (initial b =0.1)
Learning rate,  = 0.1
Backpropagation: one neuron

• Backward pass (Gradient descent)

Training sample (x =0.5, y=1.5)


Weights, w (initial w =0.3)
Bias, b (initial b =0.1)
Learning rate,  = 0.1
Backpropagation: one neuron

• Backward pass (Gradient descent)


Derivative rule
Backpropagation: one neuron

• Backward pass (Gradient descent) Derivative rule


Backpropagation: one neuron

• Backward pass (Gradient descent)


Derivative rule
Backpropagation: one neuron

• Backward pass (Gradient descent)


Derivative rule
Backpropagation: one neuron

• Backward pass (Gradient descent)


Training sample (x =0.5, y=1.5)
Backpropagation: one neuron Weights, w (initial w =0.3)
Bias, b (initial b =0.1)
Learning rate,  = 0.1
Forward pass: 𝑦ො =0.56
• Backward pass (Gradient descent)
Training sample (x =0.5, y=1.5)
Backpropagation: one neuron Weights, w (initial w =0.3)
Bias, b (initial b =0.1)
Learning rate,  = 0.1
Forward pass: 𝑦ො =0.56
• Backward pass (Gradient descent)
Training sample (x =0.5, y=1.5)
Backpropagation: one neuron Weights, w (initial w =0.3)
Bias, b (initial b =0.1)
Learning rate,  = 0.1
Forward pass: 𝑦ො =0.56
• Backward pass (Gradient descent)
Multilayer Perceptron (MLP)
• Is a neural networks contain more than one computational layer
• The additional intermediate layers (between input and output) are hidden layers
because the computations performed are not visible to the user

Fully-connected neural network


Backpropagation: two neuron

• Training sample (x =0.5, y=1.5), Learning rate  0.1

• Input features, x
• Weights, w (initial w(1) =0.3, w(2) =0.2)
• Bias, b (initial b(1) =0.1, b(2) =0.4)
• Weighted sum (linear function):
Z(x) = w·x+b
• Activation function:
1
a = (𝑧) =
1 + 𝑒 −𝑧
1 1
ො a = (z) =
• Output: 𝑦= =
1+𝑒 −𝑧 1+𝑒 −(𝑤𝑥+𝑏)
Backpropagation: two neuron

• Forward pass
ො a(3) = (z (3))
𝑦=
1
 (z) =
1+𝑒 −𝑧

1
L(y,ොy) = (y − yො )2
2

Training sample (x =0.5, y=1.5)


Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Backpropagation: two neuron

• Forward pass

Training sample (x =0.5, y=1.5)


Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Backpropagation: two neuron

• Forward pass

Training sample (x =0.5, y=1.5)


Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Backpropagation: two neuron

• Forward pass
ො a(3) = (z (3)) = 0.625
𝑦=

𝟏
ෝ)𝟐
ො = (y − 𝒚
L(y,𝐲)
𝟐

Training sample (x =0.5, y=1.5)


Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Backpropagation: two neuron

• Forward pass
ො a(3) = (z (3)) = 0.625
𝑦=

𝟏
ො =
L(y,𝐲) ෝ)𝟐
(y − 𝒚
𝟐
1
L(y,ොy) = (1.5 − 0.625)2 = 0.38
2
Training sample (x =0.5, y=1.5)
Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Backpropagation: two neuron

• Backward pass

Training sample (x =0.5, y=1.5)


Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Forward pass: 𝑦ො =0.625
Backpropagation: two neuron

• Backward pass
We need to update the
model parameters

Training sample (x =0.5, y=1.5)


Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Forward pass: 𝑦ො =0.625
Backpropagation: two neuron

• Backward pass
The calculations for w(2) and
b(2) are similar to the ones in
the first example

Training sample (x =0.5, y=1.5)


Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Forward pass: 𝑦ො =0.625
Backpropagation: two neuron

• Backward pass
The calculations for w(2) and
b(2) are similar to the ones in
the first example

Training sample (x =0.5, y=1.5)


Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Forward pass: 𝑦ො =0.625
Backpropagation: two neuron

• Backward pass
The calculations for w(2) and
b(2) are similar to the ones in
the first example

Training sample (x =0.5, y=1.5)


Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Forward pass: 𝑦ො =0.625
Training sample (x =0.5, y=1.5)

Backpropagation: two neuron Weights, w (initial w(1) =0.3, w(2) =0.2)


Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Forward pass: 𝑦ො =0.625
• Backward pass
The calculations for w(2) and
b(2) are similar to the ones in Derivative rule
the first example
Training sample (x =0.5, y=1.5)

Backpropagation: two neuron Weights, w (initial w(1) =0.3, w(2) =0.2)


Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Forward pass: 𝑦ො =0.625
• Backward pass
The calculations for w(2) and
b(2) are similar to the ones in Derivative rule
the first example
Backpropagation: two neuron

• Backward pass
Calculations for w(1) and b(1)

Training sample (x =0.5, y=1.5)


Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Forward pass: 𝑦ො =0.625
Backpropagation: two neuron

• Backward pass
Calculations for w(1) and b(1)

Training sample (x =0.5, y=1.5)


Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Forward pass: 𝑦ො =0.625
Backpropagation: two neuron
Individual derivatives

• Backward pass
Calculations for w(1) and b(1)

Training sample (x =0.5, y=1.5)


Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
Forward pass: 𝑦ො =0.625
Training sample (x =0.5, y=1.5)
Backpropagation: two neuron Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
• Backward pass Forward pass: 𝑦ො =0.625

Calculations for w(1) and b(1)


Training sample (x =0.5, y=1.5)
Backpropagation: two neuron Weights, w (initial w(1) =0.3, w(2) =0.2)
Bias, b (initial b(1) =0.1, b(2) =0.4)
Learning rate,  = 0.1
• Backward pass Forward pass: 𝑦ො =0.625

Update w(1), b(1), w(2) and b(2)


Backpropagation: Two Neurons in a layer

• Training sample (x =0.5, y=1.5),


• Learning rate  0.1

• Input features, x
• Weights, w (initial w(1) =0.3, w(2) =0.2,…)
• Bias, b (initial b(1) =[b1(1), b2(1)], b(2) =[b1(2), b2(2)],

• Weighted sum (linear function):


Z(x) = w·x+b
• Activation function:
1
a = (𝑧) =
1 − 𝑒 −𝑧
1 1
ො a = (z) = 1−𝑒 −𝑧 = 1−𝑒 −(𝑤𝑥+𝑏)
• Output: 𝑦=
Backpropagation: Two Neurons in a layer

• Forward pass
In this case, consider the sum of the two
neurons in the layer
Backpropagation: Two Neurons in a layer

• Forward pass
In this case, consider the sum of the two
neurons in the layer
Backpropagation: Two Neurons in a layer

• Forward pass
In this case, consider the sum of the two
neurons in the layer
Backpropagation: Two Neurons in a layer

• Backward pass
In this case, it is necesary to update
Backpropagation: Two Neurons in a layer

• Backward pass
For the parameters of the first layer
Backpropagation: Two Neurons in a layer

• Backward pass
For the parameters of the first layer
Example: implementation of backpropagation
Example adapt from:
• https://github.com/CodigoMaquina/code/blob/main/machine_learning_python/ba
ckpropagation_paso_a_paso.ipynb

• https://www.youtube.com/watch?v=iOsR-EC9z6I
Backpropagation: limitations & challenges
• Data quality: poor data quality, including noise, incompleteness, or bias,
can lead to inaccurate models, as backpropagation learns exactly what it
is given

• Training duration: backpropagation often requires extensive training time,


which can be impractical when dealing with large networks

• Matrix-based complexity: the matrix operations in backpropagation scale


with the network size, which increases the computational demand
Backpropagation: limitations & challenges
Accelerating learning: Gradient Descent
• Gradient Descent updates parameters by moving against the gradient of the loss function.
• All the training data is considered for a single step. The average of the gradients of all the
training examples(mean gradient) is used to update the parameters. So that’s just one step of
gradient descent in one epoch.
• Use: Small-scale models with well-behaved loss functions.
Accelerating learning: Stochastic Gradient Descent
(SGD)
• SGD updates the model parameters based on a single data point, offering
faster convergence but more variance.
• Parameters are updated after processing each data point.
• Use: Large datasets with noisy updates.
Accelerating learning: Mini-Batch SGD
• Mini-Batch Gradient Descent strikes a balance by updating parameters
using a small batch of data points at a time.
• This speeds up the training while reducing variance compared to pure SGD.
• Use: Large datasets where full gradient descent is too slow.

SGD Mini-Batch SGD

m
Accelerating learning
GD SGD Mini-Batch SGD

MSE vs epochs MSE vs epochs MSE vs epochs


Accelerating learning: SGD + Momentum
• The Momentum is a method to accelerate learning using SGD
• Introduce a new variable v (velocity) or the direction and speed by which the parameters
move as the learning dynamics progresses.
• The velocity accumulates the previous gradients

• What is the role of ?


• If i  s larger than the current update is more affected by the previous gradients
• Usually values for are set high between 0.8, 0.9
Accelerating learning: SGD + Momentum
• SGD with Momentum adds a momentum term (v) to the gradient to smooth the
update direction, helping the optimizer avoid local minima and speed up convergence.
It’s useful when the loss function has steep or shallow regions.
• Use: Tasks requiring fast convergence and stabilization of noisy gradients.
Accelerating learning: SGD + Momentum
SGD SGD + Momentum
Accelerating learning: SGD + Better Momentum
(Nesterov)
• Nesterov proposed another approach to compute momentum
• First take a step in the direction of the accumulated gradient
• Then calculate the gradient and make a correction

SGD + Momentum SGD + Nesterov


Readings
Martín del Brío, B. & Sanz Molina, A. Redes neuronales y sistemas borrosos,
• Capítulo 2.5: El perceptrón multicapa
Credits

Slides adapt from post “Backpropagation Step by Step” available at


https://datamapu.com/posts/deep_learning/backpropagation/

Slides adapts from Lecture 6: “Optimization for Deep Neural Networks CMSC
35246: Deep Learning”

You might also like