0% found this document useful (0 votes)
21 views

Module 6

Uploaded by

mehul.rudra15
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Module 6

Uploaded by

mehul.rudra15
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 104

Module 6 - Deep Learning

Simple feed forward networks – Computation


graph for deep learning – Convolution neural
networks – Learning algorithms – generalization –
Recurrent Neural Networks - Deep reinforcement
learning
Neural Networks
Neural Networks
Origin of Neural Networks – AI [since 1940’s]:
“Artificial neural networks (ANN) or connectionist
systems are computing systems vaguely inspired by the
biological neural networks that constitute animal
brains.” [Wikipedia]
Neural Networks and the Brain
• brain
– set of interconnected modules
– performs information processing
operations
• sensory input analysis
• memory storage and retrieval
• reasoning
• feelings
• Consciousness

• neurons
– basic computational elements
[Russell & Norvig, 1995] – heavily interconnected with other
neurons
Neuron Diagram
• soma
– cell body
• dendrites
– incoming branches
• axon
– outgoing branch
• synapse
– junction between a
dendrite and an axon
from another neuron

[Russell & Norvig, 1995]


Neural Networks and the Brain (Cont.)
The human brain incorporates nearly 10 billion neurons and
60 trillion connections between them.

Our brain can be considered as a highly complex, non-linear


and parallel information-processing system.

Learning is a fundamental and essential characteristic of


biological neural networks.
Analogy between biological and artificial neural networks

Biological Neural Network Artificial Neural Network


Soma Neuron / Node
Dendrite Input
Axon Output
Synapse Weight
Also a good reference on the history of Neural Networks:
“A brief history of Neural Nets and Deep Learning” by A. Kurenkov
McCulloch-Pitts Neuron (M-P Neuron)

An MCP neuron is a simplified version of a biological


neuron, receiving binary inputs (0 or 1) and generating a
binary output.
These inputs are weighted equally (all weights set
to 1), and the neuron’s output is determined by
applying a weighted sum to its inputs and
comparing it to a threshold.
If the weighted sum is greater than or equal to the
threshold, the neuron outputs 1; otherwise, it outputs 0

The first computational model of a neuron was


proposed by Warren MuCulloch (neuroscientist)
and Walter Pitts (logician) in 1943.
McCulloch-Pitts Neuron (M-P Neuron)

AND OR

An AND function neuron would R function neuron would fire if ANY of the
only fire when ALL the inputs inputs is ON i.e., g(x) ≥ 1 here.
are ON i.e., g(x) ≥ 3 here.
McCulloch-Pitts Neuron (M-P Neuron)

AND GATE

x1 x2 y
0 0 0
0 1 0
1 0 0
1 1 1
McCulloch-Pitts Neuron (M-P Neuron)

OR GATE

x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 1
McCulloch-Pitts Neuron (M-P Neuron)
Two types of input

An input is known as ‘inhibitory input’ if the weight associated with the input
is of negative magnitude and is known as ‘excitatory input’ if the weight
associated with the input is of positive magnitude

Inhibitory inputs have an absolute veto power over any


excitatory inputs.
McCulloch-Pitts Neuron (M-P Neuron)
The disadvantages of MCP (McCulloch-Pitts) neurons include:

1.Binary Output Limitation: MCP neurons produce only binary outputs (0 or 1), which
limits their ability to represent complex patterns or continuous data.

2.Fixed Weights: In MCP neurons, all inputs are equally weighted (usually set to 1),
which may not reflect the varying importance of different inputs in real-world scenarios.

3.Lack of Learning: MCP neurons do not have mechanisms for learning or adjusting
their weights based on experience or training data, making them unsuitable for tasks
requiring adaptation or optimization.

4.Inability to Handle Non-Linearities: Due to their linear threshold function, MCP


neurons struggle with tasks that involve non-linear relationships or complex decision
boundaries.

5.Limited Complexity: The simplicity of MCP neurons limits their ability to model
sophisticated behaviors or cognitive processes found in biological neural networks.

6.Scalability: MCP neurons may not scale well to handle large amounts of data or
complex networks, as their binary nature and fixed weights can lead to computational
Perceptron
Perceptron
The perceptron consists of 4 parts.

Input value or One input layer: The input layer of the perceptron is made of artificial input neurons
and takes the initial data into the system for further processing.

Weights and Bias:


Weight: It represents the dimension or strength of the connection between units. If the weight to
node 1 to node 2 has a higher quantity, then neuron 1 has a more considerable influence on the
neuron.
Bias: It is the same as the intercept added in a linear equation. It is an additional parameter which
task is to modify the output along with the weighted sum of the input to the other neuron.

Net sum: It calculates the total sum.

Activation Function: A neuron can be activated or not, is determined by an activation function. The
activation function calculates a weighted sum and further adding bias with it to give the result.
Perceptron vs McCulloch-Pitts Neuron

Binary inputs, the weights are The weights, including the


equal and the threshold values threshold can be learned and
predefined the inputs can be real values.
Modern Neural Networks –
Data Science (from
Machine Learning) [since
1990’s but mostly after 2006]
“until 2006 we didn't know
how to train neural
networks to surpass more
traditional approaches,
except for a few
specialized problems.
What changed in 2006 was
the discovery of techniques
for learning in so-called
deep neural networks.”
Activation Functions
1. Linear Activation Function
2. Non-linear Activation Functions

Equation : f(x) = x

Range : (-infinity to infinity)


It makes it easy for the model to generalize or adapt with
It doesn’t help with the complexity or various parameters variety of data and to differentiate between the output.
of usual data that is fed to the neural networks.
Activation Functions
The main terminologies needed to understand for
nonlinear functions are:

Derivative or Differential: Change in y-axis w.r.t.


change in x-axis. It is also known as slope.

Monotonic function: A function which is either


entirely non-increasing or non-decreasing.

A function is monotonic if its first derivative (which need not be continuous) does not change sign.
Common Activation Functions

• Step(x) = 1 if x >= t, else 0


• Sign(x) = +1 if x >= 0, else –1
• Sigmoid(x) = 1/(1+e-x)
Sigmoid or Logistic Activation Function

The softmax function is a more


generalized logistic activation function
which is used for multiclass classification.

The function is differentiable. That means, we can find the


slope of the sigmoid curve at any two points.

The function is monotonic but function’s derivative is not.


Tanh or hyperbolic tangent Activation Function
The range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s - shaped).

The advantage is that the negative inputs will be mapped


strongly negative and the zero inputs will be mapped near
zero in the tanh graph.

The function is differentiable.

The function is monotonic while its derivative is not


monotonic.

The tanh function is mainly used classification between


two classes.

Both tanh and logistic sigmoid activation functions are


used in feed-forward nets.
ReLU (Rectified Linear Unit) Activation Function
The ReLU is the most used activation function in the world right now. Since, it is used in almost all the convolutional neural
networks or deep learning.

But the issue is that all the negative


values become zero immediately
which decreases the ability of the
model to fit or train from the data
properly.
That means any negative input
given to the ReLU activation
function turns the value into zero
immediately in the graph, which in
turns affects the resulting graph by
not mapping the negative values
the ReLU is half rectified (from bottom). appropriately.
f(z) is zero when z is less than zero and

f(z) is equal to z when z is above or equal to zero.

Range: [ 0 to infinity)

The function and its derivative both are monotonic.


Leaky ReLU

The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so.

When a is not 0.01 then it is called Randomized ReLU.

Therefore the range of the Leaky ReLU is (-infinity to infinity).

Both Leaky and Randomized ReLU functions are monotonic in nature. Also, their derivatives
also monotonic in nature.
Why
derivative/differentiation
is used ?

When updating the


curve, to know in
which direction and
how much to change
or update the curve
depending upon the
slope.
That is why we use
differentiation in almost
every part of Machine
Learning and Deep
Learning.
Why
derivative/differentiation
is used ?

When updating the


curve, to know in
which direction and
how much to change
or update the curve
depending upon the
slope.
That is why we use
differentiation in almost
every part of Machine
Learning and Deep
Learning.
Types of perceptron
Single Layer Perceptron
Multi-Layer Perceptron

Single layer: Single layer perceptron can learn only linearly


separable patterns.

Multilayer: Multilayer perceptron can learn about two or


more layers having a greater processing power.
It is mainly similar to a single-layer perceptron model but has more hidden layers.
single layer perceptron (SLP) is a feed-forward network based on a threshold transfer function. SLP is the simplest type of artificial
neural networks and can only classify linearly separable cases with a binary target (1 , 0).
The single layer perceptron does not have a priori knowledge, so the initial
weights are assigned randomly.
SLP sums all the weighted inputs and if the sum is above the threshold
(some predetermined value), SLP is said to be activated (output=1)

The input values are presented to the perceptron, and if the predicted output is the same as the desired output, then the
performance is considered satisfactory and no changes to the weights are made.
if the output does not match the desired output, then the weights need to be changed to reduce
the error.
AND GATE PROBLEM – Single Layer Perceptron
AND GATE PROBLEM – Single Layer Perceptron

Assume all the initial weights are 0.9

⅀ = = 0 * 0.9 + 0 * 0.9 = 0

⅀ = = 0 * 0.9 + 1 * 0.9 = 0.9


AND GATE PROBLEM – Single Layer Perceptron

⅀ = = 0 * 0.9 + 1 * 0.9 = 0.9 Predicted out put = 1

Error in prediction.
The actual value is 0 but the perceptron predicted 1
AND GATE PROBLEM – Single Layer Perceptron

Error () = Actual -Predicted  = 0-1 = -1

Update the weight


AND GATE PROBLEM – Single Layer Perceptron

Error () = Actual -Predicted


 = 0-1 = -1 Update the weight

= *(actual – predicted) *
 = 0.5 Assume
= 0.5 *(0 – 1) * 0 = 0.9 No change in weight

= *(actual – predicted) *

= 0.5 *(0 – 1) * 1 = 0.9 – 0.5 = 0.4 New weight updated


3RD RECORD

⅀ = = 1 * 0.9 + 0 * 0.4 = 0.9 Predicted out put = 1

Error in prediction  Update the weight  = 0.5 Assume

= *(actual – predicted) *

= 0.5 *(0 – 1) * 1 = 0.9-0.5 = 0.4 New weight updated

= *(actual – predicted) *

= 0.5 *(0 – 1) * 0 = 0.9 – 0.5 = 0.4 No change in weight


4th RECORD

⅀ = = 1 * 0.4 + 1 * 0.4 = 0.8 Predicted out put = 1

Epoh 1
Initial Initial Updated Updated
X1 X2 Y Ypredicted (Y hat) Weight (X1) Weight (X2) Weight (X1) Weight (X2)
0 0 0
0 0.9 0.9 No Change No Change
0 1 0
1 No Change 0.4
(updated)
1 0 0
1 0.4 0.4
1 1 1 (updated)
1
0.4 0.4
AND GATE PROBLEM – Single Layer Perceptron
AND GATE PROBLEM – Single Layer Perceptron
AND GATE PROBLEM – Single Layer Perceptron
Because SLP is a linear classifier and if the cases are not linearly
separable the learning process will never reach a point where all the cases
are classified properly.
The most famous example of the inability of perceptron to solve problems
with linearly non-separable cases is the XOR problem.

However, a multi-layer perceptron using the backpropagation


algorithm can successfully classify the XOR data.
Multi-layer feed-forward Artificial Neural Network,

The inner layers for deeper processing


of the inputs are known as hidden
layers. The hidden layers are not
dependent on any other layers. This
architecture is known as Multilayer
Perceptron (MLP).
Sigmoid neurons
Introducing sigmoid neurons where the output function is much smoother than the step
function

no longer see a sharp transition around the


threshold -w0
Also the output y is no longer binary but a real value between
0 and 1 which can be interpreted as a probability

Instead of a like/dislike decision we get the probability


of liking the movie
Perceptron Vs Sigmoid neurons

Not smooth, not continuous (at w0), Smooth, continuous, differentiable


not differentiable
A typical Supervised Machine Learning Setup
Earlier mentioned that a single perceptron cannot deal
with this data because it is not linearly separable

What does “cannot deal with” mean?

What would happen if we use a perceptron model to


classify this data?
Sure, it misclassifies 3 blue points and 3 red points but
we could live with this error in most real world
applications
From now on, we will accept that it is hard to drive the
error to 0 in most cases and will instead aim to reach the
minimum possible error
A typical Supervised Machine Learning Setup

Objective/Loss/Error function: To
guide the learning algorithm - the
learning algorithm should aim to
minimize the loss function

Parameters: In all the above cases, w is a parameter


which needs to be learned from the data

Learning algorithm: An algorithm for learning the parameters (w) of the model (for
example, perceptron learning algorithm, gradient descent, etc.
A typical Supervised Machine Learning Setup
Consider our movie example

The learning algorithm should aim to find a w which minimizes the above function
(squared error between y and )
Learning Parameters: (Infeasible) guess work
Keeping this supervised ML setup in mind, we will now focus on this
model and discuss an algorithm for learning the parameters of this
model from some given data using an appropriate objective function

σ stands for the sigmoid function (logistic function in this case

Consider a very simplified version of


the model having just 1 input
Learning Parameters: (Infeasible) guess work
What does it mean to train the network?

Suppose we train the network with (x, y) = (0.5, 0.2) and (2.5, 0.9)

At the end of training we expect to find w*, b* such that:

f(0.5) → 0.2 and f(2.5) → 0.9


Learning Parameters: (Infeasible) guess work
Can we try to find such a w∗ and b∗ manually

Let us try a random guess.. (say, w = 0.5, b = 0)

Clearly not good, but how bad is it ?


Learning Parameters: (Infeasible) guess work

some guess work and intuition we were able to find the right values for w and b
Learning Parameters: (Infeasible) guess work
Let us look at the geometric interpretation of our “guess work” algorithm in terms of this
error surface
What's Next

There is a more efficient and principled way of doing the weight calculation 

Learning Parameters : Gradient Descent


Understanding the Mathematics behind Gradient
Descent
 Agile is a pretty well-known term in the software development
process.
 The basic idea behind it is simple:
 build something quickly, ➡️get it out there, ➡️get some feedback ➡️make
changes depending upon the feedback ➡️repeat the process.
 The goal is to get the product near the user and guide you
with feedback to obtain the best possible product with the least
error.
 Also, the steps taken for improvement need to be small and should
constantly involve the user
The idea of — start with a solution as soon as possible, measure and
iterate as frequently as possible, is Gradient descent under the hood.
Gradient Descent
Gradient Descent

At step n, the weights of the neural network are all modified by the product of the hyperparameter \
alpha times the gradient of the cost function, computed with those weights. If the gradient is positive,
then we decrease the weights; and conversely, if the gradient is negative, then we increase them.

Neural networks have very complex loss


surfaces and finding the optimum is difficult

Gradient Ascent
Understanding the Mathematics behind Gradient Descent
Objective
Gradient descent algorithm is an iterative process that takes us to the minimum of a
function.

The formula below sums up the entire Gradient Descent algorithm in a single line
Understanding the Mathematics behind Gradient
Descent
A Machine Learning Model
arbitrary line in space that passes through
Consider a bunch of data points in a 2 D some of these data points
space. Assume that the data is related to
the height and weight of a group of
students.

We are trying to predict some


relationship between these quantities to
predict the weight of some new
students afterward.

This is essentially a simple example of a


supervised Machine Learning technique
Prakash VIT Chennai
Understanding the Mathematics behind Gradient
Descent
Predictions
Given a known set of inputs and their corresponding outputs, A machine learning model
tries to make some predictions for a new set of inputs.

This relates to the idea of a Cost


The Error would be the difference between the function or Loss function.
two predictions.

Prakash VIT Chennai


Understanding the Mathematics behind Gradient
Descent
Cost Function
A Cost Function/Loss Function evaluates the performance of our Machine Learning
Algorithm
The Loss function computes the error for a single training example, while the Cost
function is the average of the loss functions for all the training examples
Let’s say there are a total of ’N’ points in the dataset, and for all those ’N’ data
points, we want to minimize the error.

So the Cost function would be the total squared error

The goal of any Learning Algorithm is to


minimize the Cost Function.
Prakash VIT Chennai
Understanding the Mathematics behind Gradient
Descent
How do we minimize any function?

Cost function is of the form Y = X²


To minimize the function above, need to find
that value of X that produces the lowest value
of Y which is the red dot in the above figure
It is pretty easy to locate the minima here since it is a 2D graph, but this may not
always be the case, especially in higher dimensions

Devise an algorithm to locate the minima, and that algorithm is called Gradient Descent

Prakash VIT Chennai


Understanding the Mathematics behind Gradient
Descent
Consider that you are walking along with the graph below, and you are currently at the
‘green’ dot. You aim to reach the minimum, i.e., the ‘red’ dot, but from your position, you
are unable to view it.
Possible actions would be:
You might go upward or downward

If you decide which way to go, you might take a bigger step or a little
step to reach your destination

Essentially, there are two things that you should know to


reach the minima, i.e. which way to go and how big a step
to take.
Understanding the Mathematics behind Gradient
Descent
The Minimum Value
Tangent at the green point, know that if we are
moving upwards, we are moving away from the
minima and vice versa.

Also, the tangent gives us a sense of the


steepness of the slope

The slope at the blue point is less steep


than that at the green point, which means
it will take much smaller steps to reach the
minimum from the blue point than from
the green point
Understanding the Mathematics behind Gradient
Descent
Mathematical Interpretation of Cost Function
• Let us now put all these learnings into a mathematical formula.
• In the equation, y = mX+b ‘m’ and ‘b’ are its parameters.
• During the training process, there will be a small change in their values.
• Let that small change be denoted by δ.
• The value of parameters will be updated as m=m-δm and b=b-δb, respectively.
• Aim here is to find those values of m and b in y = mx+b
• For which the error is minimum, i.e., values that minimize the cost function.

The idea is that by being able to compute the derivative/slope of the function, find the minimum of a function
Understanding the Mathematics behind Gradient
The Learning rate Descent
This size of steps taken to reach the minimum or bottom is called Learning Rate.

Derivatives

Use derivates to decide whether to increase or decrease the weights to increase


or decrease any objective function

Two concepts from calculus

Chain Rule

Power Rule
Understanding the Mathematics behind Gradient
Calculating Gradient Descent Descent
apply these rules of calculus in our original equation and find the derivative of the Cost
Function w.r.t to both ‘m’ and ‘b’.
Calculate the gradient of Error w.r.t to both m and b

m¹,b¹ = next position


parameters;
m⁰,b⁰ = current position
parameters
Example
 Find the local minima of the function y=(x+5)² starting from the point
x=3
Step 1 : Initialize x =3. Then, find the gradient
of the function, dy/dx = 2*(x+5). learning rate → 0.01

https://gist.github.com/rohanjo
seph93/ecbbb9fb1715d5c248bc
ad0a7d3bffd2#file-gradient_des
cent-ipynb

Prakash VIT Chennai


Multi-layer Perceptron - Backpropagation algorithm
A multi-layer perceptron (MLP) has the same structure of a
single layer perceptron with one or more hidden layers.
The backpropagation algorithm consists of two phases:
• the forward phase where the activations are propagated
from the input to the output layer, and
• the backward phase, where the error between the
observed actual and the requested nominal value in the
output layer is propagated backwards in order to modify
the weights and bias values.
What is Feed Forward Neural Network?
The most fundamental kind of neural network, in which input data travels only in
one way before leaving through output nodes and passing through artificial neural
nodes. Input and output layers are present in locations where hidden layers may or
may not be present. Based on this, they are further divided into single-layered and
multi-layered feed-forward neural networks.
Neural networks can
modify their weights
during training based on a
characteristic known as
the delta rule, which
allows them to compare
their outputs to the
expected values.

The complexity of the function is inversely correlated with the number of layers. It cannot spread
backward; it can only go forward. In this scenario, the weights are unchanged. Weights are added to
the inputs before being passed to an activation function.
Forward propagation
Propagate inputs by adding all the weighted inputs and then computing outputs using
sigmoid threshold.
Backward Propagation
Propagates the errors backward by apportioning them to each unit according to the amount of this error the unit is
responsible for
Back-Propagation in Multilayer Feedforward Neural Networks
Back-propagation refers to the method used during network training.
More specifically, back-propagation refers to a simple method for
calculating the gradient of the network, that is the first derivative of the
weights in the network.
The primary objective of network training is to estimate an appropriate
set of network weights based upon a training dataset.
Many ways have been researched for estimating these weights, but
they all involve minimizing some error function.The commonly used
error function is the sum-of-squared errors:

Training uses one of several possible optimization methods to minimize this error term. Some of
the more common are: steepest descent / gradient descent, quasi-Newton, conjugant gradient
and many various modifications of these optimization routines
Back-Propagation in Multilayer Feedforward Neural Networks

back-propagation is a method for calculating


the first derivative of the error function with
respect to each network weight.

Training uses one of several possible optimization methods to minimize this error term. Some of
the more common are: steepest descent / gradient descent, quasi-Newton, conjugant gradient
and many various modifications of these optimization routines
Back-Propagation in Multilayer Feedforward Neural Networks
Example of Feed Forward Neural Network

x1, x2 are the inputs.

w1, w2 are the coefficient weights for each input


y is the output
z is the weighted input

b is bias

σ(z) is the activation function


which represents sigmoid function
Variables and Parameters

Neural network classifying 4×3 pixel picture


Binary Classification of a Picture Using 1 Unit

let’s say we put the following


picture as input the weights are set as random (0 to 1
value) and the bias is set as 0

the inputs of the input layer, that


is, the inputs of the hidden layer
would be as follows
Binary Classification of a Picture Using 1 Unit
calculate using all of the units of the neural network, and calculate the final output of the whole
network
Binary Classification of a Picture Using 1 Unit
calculate using all of the units of the neural network, and calculate the final output of the whole
network

initialize the weights with random values (0~1) and the bias as 0
Binary Classification of a Picture Using 1 Unit
calculate using all of the units of the neural network, and calculate the final output of the whole network

Input layer → Hidden layer


Binary Classification of a Picture Using 1 Unit
calculate using all of the units of the neural network, and calculate the final output of the whole network

Hidden layer → Output layer


In the other words, the
parameters were
aimlessly set up in the
first place and happen to
give an accurate result.
When learning with
neural network, the
neural network adjust its
own parameters to give
a better accurate
classification result. To
do this, an algorithm
called
"Backpropagation" is
used to calculate how
much update is needed
It is y≥0.5 , thus y is close to 1. This means The number for each weight
on the picture is 1.
Overall picture
What is the derivative of the logistic sigmoid function?
Chain rule
As put by George F. Simmons: "if a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking
man, then the car travels 2 × 4 = 8 times as fast as the man.
Basic structure
Backpropagation Example
Use a neural network with two inputs, two hidden neurons, two output neurons. Additionally, the hidden and output neurons will
include a bias.
Goal of backpropagation is to optimize
the weights so that the neural network
initial weights, the biases
can learn how to correctly map
arbitrary inputs to outputs

Example : a single training set: given


inputs 0.05 and 0.10, we want the neural
network to output 0.01 and 0.99

Prakash VIT Chennai


Backpropagation Example
The Forward Pass

the neural network currently predicts given the weights and biases
above and inputs of 0.05 and 0.10.
To do this we’ll feed those inputs forward though the network

To figure out the total net input to each hidden layer neuron, squash
the total net input using an activation function (use the logistic
function), then repeat the process with the output layer neurons

Prakash VIT Chennai


The Forward Pass
Backpropagation Example

Calculating the Total Error

Target output for o1 is 0.01 but the neural network output 0.75136507  error is

Repeating this process for o2 (remembering that the target is 0.99)

Total error for the neural network is the sum of these errors
Prakash VIT Chennai
Backpropagation Example
The Backwards Pass Our goal with backpropagation is to update each of the weights in the network so that
they cause the actual output to be closer the target output, thereby minimizing the error
for each output neuron and the network as a whole
Output Layer
Consider w5, want to know how much a change in w5 affects the total error,

By applying the chain rule


Backpropagation Example
The Backwards Pass Output Layer

First, how much does the total error change with respect to the output?
Backpropagation Example
The Backwards Pass Output Layer

how much does the output of o1 change with respect to its total net input?
Backpropagation Example
The Backwards Pass Output Layer

Finally, how much does the total net input of o1 change with respect to w 5?

Update the error

the actual updates in the neural network after we have


the new weights leading into the hidden layer neurons
Backpropagation Example
continue the backwards pass by calculating new values for w 1, w2, w3, and
Hidden Layer
w4
similar process as we did for the output layer, but
slightly different to account for the fact that the output
of each hidden layer neuron contributes to the output
(and therefore error) of multiple output neurons

outh1 affects both outo1 and outo2


Backpropagation Example
continue the backwards pass by calculating new values for w 1, w2, w3, and
Hidden Layer
w4
Backpropagation Example
continue the backwards pass by calculating new values for w 1, w2, w3, and
Hidden Layer
w4

calculate the partial derivative of the total net input to


h1 with respect to w1
Backpropagation Example
continue the backwards pass by calculating new values for w 1, w2, w3, and
Hidden Layer
w4

Repeating this for w2, w3, and w4


Backpropagation Example

https://www.javatpoint.com/pytorch-backpropagation-process-in-deep-neural-network
Backpropagation Example

https://www.javatpoint.com/pytorch-backpropagation-process-in-deep-neural-network
https://www.youtube.com/watch?
v=wqPt3qjB6uA&list=PLLeO8f6PhlKYLwtebJZzCue0AW
W7VVSn4&index=5
https://www.youtube.com/watch?v=QflXxNfMCKo

https://www.youtube.com/watch?v=QflXxNfMCKo&t=629s

https://www.youtube.com/watch?v=FglxznJkGPA

https://www.youtube.com/watch?v=1UmGvau4zm0

You might also like