0% found this document useful (0 votes)

6 views

Chap5 3-BackProp

This document discusses backpropagation, which is an algorithm used in neural networks to calculate the gradient of a loss function with respect to the network's weights. It covers forward propagation, loss functions, computing derivatives using the chain rule, and the backpropagation algorithm. A simple neural network with one hidden layer is presented to demonstrate forward propagation using matrix multiplication. The loss function and regularization are defined. Gradient descent is then described as using the derivative vector to update the weights in order to minimize loss. The chain rule is applied to compute derivatives of composite functions, with examples provided.

Uploaded by

r.l.pankha2440

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Chap5 3-BackProp

Uploaded by

r.l.pankha2440

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Machine Learning Srihari

Backpropagation
Sargur Srihari

1
Machine Learning Srihari

Topics in Backpropagation
1. Forward Propagation
2. Loss Function and Gradient Descent
3. Computing derivatives using chain rule
4. Computational graph for backpropagation
5. Backprop algorithm
6. The Jacobian matrix

2
Machine Learning Srihari

Hidden unit activation functions

z j =h(aj)

K output activations
M
ak = ∑ wki(2)x i + wk(2)0 where k = 1,..,K
i =1

Output activation functions

Augmented network yk =σ(ak)
⎛ M (2) ⎛ D (1) ⎞ ⎞
No. of weights in w: yk (x,w) = σ ⎜ ∑ wkj h ⎜ ∑ w ji x i + w (1)
j0 ⎟
+ w (2)
k0 ⎟
T=(D+1)M+(M+1)K ⎝ j =1 ⎝ i =1 ⎠ ⎠
=M(D+K+1)+K

3
Machine Learning Srihari

Matrix Multiplication: Forward Propagation

• Each layer is a function of layer that preceded it
• First layer is given by z =h (W(1)T x + b(1))
• Second layer is y = σ (W(2)T x + b(2))
• Note that W is a matrix rather than a vector
• Example with D=3, M=3
⎧⎪ T T T
⎪⎪ W (1) = ⎡W W W ⎤ ,W (1) = ⎡W W W ⎤ ,W (1) = ⎡W W W ⎤
x=[x1,x2,x3]T ⎢⎣ 11 12 13 ⎥⎦ ⎢⎣ 21 22 23 ⎥⎦ ⎢⎣ 31 32 33 ⎥⎦
w = ⎪⎨
1 2 3

⎪⎪ (2) ⎡ T T
⎤ ,W (2) = ⎡W W W ⎤ ,W (2) = ⎡W W W ⎤
T
W
⎪⎪⎩ 1 = W W W
⎢⎣ 11 12 13 ⎥⎦ 2 ⎢⎣ 21 22 23 ⎥⎦ 3 ⎢⎣ 31 32 33 ⎥⎦

First Network layer Network layer output In matrix multiplication notation

4
Machine Learning Srihari

Loss and Regularization

y=f (x,w)

1 N
E=
N
∑ i
E
i=1
f (
(x (i )
,w),ti )
x Forward:
y
Loss +
Ei
Backward:
Gradient of
Ei+R

R(W)

5
Machine Learning Srihari

Gradient Descent
• Goal: determine weights w from labeled set of training samples
• Learning procedure has two stages
1. Evaluate derivatives of loss ∇E(w) with respect to weights w1,..wT
2. Use derivative vector to compute adjustments to weights

⎡ ∂E ⎤
⎢ ⎥
w (τ+1) = w (τ) − η∇E(w (τ) ) ⎢
⎢
∂w0 ⎥
∂E ⎥⎥
⎢
( )
∇E w = ⎢ ∂w1 ⎥
⎢ ⎥
⎢ ⎥
⎢ ∂E ⎥
⎢ ⎥
⎢⎣ ∂wT ⎥
⎦
Machine Learning Srihari

Derivative of composite function with one weight

g (y)
y=f (w) p=g(y)

w f (w) k(p,q) o
Input Output
weight
z=f (w) q=h(z)
h(z)

Composite Function
o = k(p,q)=k(g(f(w)),h(f(w))

!" !" !% !& !" !' !(

= +
ð$ ð% ð& ð$ ð' ð( ð$

!" !)(%.') !)(%,')

ð$
= ð%
-′(/) 0′(1)+ ð'
ℎ4 5 0′(1)

7
Path 1 Path 2
Machine Learning Srihari

Derivative of a composite function with four inputs

E (a,b,c,d) = e =a.b + c log d

Derivatives by inspection: !"

e !"
!" 1 =1 1 =1
=b ð$ ð&
ð%

!"
=a
ð) u v
!"
=log d !$
ð' !&
3 =b 2 !$ =a 5 !&
=c 1 =log d
ð% ð) ð( ð'
!" +
= c
ð* *
a b c
Computational graph !" !"
e=u+v, u=a.b, v=c.t , t=log d ð%
=3
ð)
=2 t !"
ð'
=1

We want to compute derivatives

of output wrt the input values 0.1 !( = +
ð* *
a = 2, b = 3, c = 5, d =10
⎡
⎢
⎢
∂E ⎤
⎥
∂w0 ⎥ 3
d
⎢ ∂E ⎥⎥ !"
( )
⎢
∇E w = ⎢ ∂w1 ⎥ 2 =0.5
ð*
⎢
⎢
⎥
⎥ 1
⎢ ∂E ⎥
⎢
⎢⎣
⎥
∂wT ⎥
⎦
0.5 8
Machine Learning Srihari

Example of Derivative Computation

9
Derivatives of f =(x+y)z wrt x,y,z
Machine Learning Srihari
Machine Learning Srihari

Derivatives for a neuron: z=f(x,y)

Machine Learning Srihari
Composite Function
• Consider a composite function f (g (h (x)))
• i.e., an outer function f, an inner function g and a final inner function h(x)
sin(x 2 )
• Say f (x) = e we can decompose it as:
f (x)=ex
g(x)=sin x and
h(x)=x2 or
f (g(h(x)))=e g(h(x))
• Its computational graph is

• Every connection is an input, every node is a function or operation

Deep Learning
Machine Learning Srihari

Derivatives of Composite function

• To get derivatives of f (g (h (x)))= e g(h(x)) wrt x
1. We use the chain rule df
=
df dg dh
⋅ ⋅
where
dx dg dh dx
df
= e g(h(x ))
dg since f (g(h(x)))=eg(h(x)) & derivative of ex is e
dg
= cos(h(x))
since g(h(x))=sin h(x) & derivative sin is cos
dh

dh
= 2x
because h(x)=x2 & its derivative is 2x
dx
df
= e g(h(x )) ⋅ cos h(x)⋅ 2x = e sin x**2 ⋅ cosx 2 ⋅ 2x
• Therefore dx
• In each of these cases we pretend that the inner function is a single
variable and derive it as such
sin(x 2 )
2. Another way to view it f (x) = e
• Create temp variables u=sin v, v=x2, then f (u)=eu with
computational graph:
Deep Learning
Machine Learning Srihari

Derivative using Computational Graph

• All we need to do is get the derivative of each node wrt each of
its inputs
With u=sin v, v=x2, f (u)=eu

• We can get whichever derivative we want by multiplying the

‘connection’ derivatives
dh dg df
= 2x = cos(h(x)) = e g(h(x ))
dx dh dg

df df dg dh df Since f (x)=ex, g(x)=sin x and

= ⋅ ⋅ = e g(h(x )) ⋅ cos h(x)⋅ 2x
dx dg dh dx dx h(x)=x2
2
= e sin x ⋅ cosx 2 ⋅ 2x

14
Machine Learning Srihari

Evaluating the gradient

• Goal of this section:
• Find an efficient technique for evaluating gradient of an error function
E(w) for a feed-forward neural network:

• Gradient evaluation can be performed using a local message

passing scheme
• In which information is alternately sent forwards and backwards through
the network
• Known as error backpropagation or simply as backprop
Machine Learning Srihari

Back-propagation Terminology and Usage

• Backpropagation means a variety of different things
• Computing derivative of the error function wrt weights
• In a second separate stage the derivatives are used to compute the
adjustments to be made to the weights
• Can be applied to error function other than sum of squared
errors
• Used to evaluate other matrices such as Jacobian and Hessian
matrices
• Second stage of weight adjustment using calculated derivatives
can be tackled using variety of optimization schemes
substantially more powerful than gradient descent
Machine Learning Srihari

Overview of Backprop algorithm

• Choose random weights for the network
• Feed in an example and obtain a result
• Calculate the error for each node (starting from the last stage
and propagating the error backwards)
• Update the weights
• Repeat with other examples until the network converges on the
target output

• How to divide up the errors needs a little calculus

17
Machine Learning Srihari

Evaluation of Error Function Derivatives

• Derivation of back-propagation algorithm for
• Arbitrary feed-forward topology
• Arbitrary differentiable nonlinear activation function
• Broad class of error functions
• Error functions of practical interest are sums of errors
associated with each training data point
N
E(w) = ∑ En (w)
n =1

• We consider problem of evaluating ∇En (w)

• For the nth term in the error function
• Derivatives are wrt the weights w1,..wT
• This can be used directly for sequential optimization or accumulated
over training set (for batch)

18
Machine Learning Srihari
Simple Model (Multiple Linear Regression)
• Outputs yk are linear combinations of inputs xi
yk
yk = ∑ wki x i wki
i xi
• Error function for a particular input xn is
1
(
En = ∑ ynk − tnk ) Where summation is
2
over all K outputs For a particular input x and
2 k
weight w , squared error is:
• where ynk=yk(xn,w)
1 2

• Gradient of Error function wrt a weight wji: E=

2
(y(x,w) −t )
∂En
( )
∂E
= ynj − tnj x ni = (y(x,w) −t ) x = δ ⋅x
∂w ji ∂w
• a local computation involving product of yj
• error signal ynj-tnj associated with output end of link wji wji
• variable xni associated with input end of link tj
xi
∂E
∂w ji
( )
= y j − t j x i = δj ⋅ x i
Machine Learning Srihari

Extension to more complex multilayer Network

• Each unit computes a weighted sum of its inputs a j = ∑ w jizi
i

aj=∑iwjizi zj=h(aj)
zi
wji

• zi is activation of a unit (or input) that sends a connection to unit j and wji
is the weight associated with the connection
• Output is transformed by a nonlinear activation function zj=h(aj)
• The variable zi can be an input and unit j could be an output
• For each input xn in the training set, we calculate activations of
all hidden and output units by applying above equations
• This process is called forward propagation
Machine Learning Srihari

Evaluation of Derivative En wrt a weight wji

• The outputs of the various units depend on particular input n
• We shall omit the subscript n from network variables
• Note that En depends on wji only via the summed input aj to unit j.
• We can therefore apply chain rule for partial derivatives to give
∂En ∂En ∂a j
=
∂w ji ∂a j ∂w ji
• Derivative wrt weight is given by product of derivative wrt activity and
derivative of activity wrt weight
∂E
δ ≡
• We now introduce a useful notation
n
∂a j
j

• Where the δs are errors as we shall see

∂a
• Using a j = ∑ w jizi we can write ∂w = z j

ji
i

i
∂En
• Substituting we get ∂w ji
= δ j zi

• i.e., required derivative is obtained by multiplying the value of δ for the

unit at the output end of the weight by the the value of z at the input end
of the weight
• This takes the same form as for the simple linear model 21
Machine Learning Srihari

∂ En
Summarizing evaluation of Derivative
∂ w ji
• By chain rule for partial derivatives
∂En ∂En ∂a j
= aj=∑iwjizi
∂w ji ∂a j ∂w ji
a j = ∑ w ji zi
Define δ ≡ ∂En i zi wji
j
∂a j ∂a j
we have = zi
∂w ji
• Substituting we get
∂En
= δ j zi
∂w ji
• Thus required derivative is obtained by multiplying
1. Value of δ for the unit at output end of weight
2. Value of z for unit at input end of weight
• Need to figure out how to calculate δj for each unit of network
1 ∂E
• For output units δj=yj-tj If E = ∑ (y − t ) and y = a = ∑ w z then δ =
2
= y − t For
2 j
j j j
∂a j ji i j
regression j
j j

• For hidden units, we again need to make use of chain rule of derivatives to
determine δ ≡ ∂En
j
∂a j
Machine Learning Srihari

Calculation of Error for hidden unit δj

Blue arrow for forward propagation
Red arrows indicate direction of information flow during
error backpropagation

• For hidden unit j by chain rule

∂En ∂En ∂ak Where sum is over all units k to which
δj ≡ =∑ j sends connections
∂a j k ∂ak ∂a j
ak = ∑ wki zi = ∑ wkih(ai )
• Substituting δ ≡ ∂E n i i
k
∂ak ∂ak
= ∑ wkj h '(a j )
∂a j k

• We get the backpropagation formula for error derivatives at stage j

€

δ j = h '(a j )∑ wkj δk
k

Input to activation from error derivative

earlier units at later unit k
Machine Learning Srihari

Error Backpropagation Algorithm

Unit
j
1. Apply input vector xn to network and
Unit forward propagate through network using
k

a j = ∑ w ji zi and zj=h(aj)
i

2. Evaluate δk for all output units using

• Backpropagation Formula
δk=yk-tk

δ j = h '(a j )∑ wkj δk
3. Backpropagate the δ s using

k δ j = h '(a j )∑ wkj δk
k
to obtain δj for each hidden unit
• Value of δ for a particular
hidden unit can be obtained 4. Use ∂E
by propagating the δ s
n
= δ j zi
∂w ji
backward from units higher-
to evaluate required derivatives
up in the network

24
Machine Learning Srihari

A Simple Example

• Two-layer network
• Sum-of-squared error
• Output units: linear activation
functions, i.e., multiple regression
yk=ak Standard Sum of Squared Error
• Hidden units have logistic sigmoid
1
(
E n = ∑ y k − tk )
2
activation function 2 k
h(a)=tanh (a)
where e a − e−a yk: activation of output unit k
tanh(a) = a −a
e +e tk : corresponding target
for input xk
simple form for derivative
h '(a) = 1 − h(a)2
€
Machine Learning Srihari

Simple Example: Forward and Backward Prop

For each input in training set:
D
a j = ∑ w (1)x
• Forward Propagation ji i
i =0

z j = tanh(a j )
M
yk = ∑ wkj(2)z j
j =0
• Output differences
δ k = y k − tk
δ j = h '(a j )∑ wkj δk
• Backward Propagation (δ s for hidden units) k
K 2
δ = (1 − z )∑ w δ 2 h'(a) = 1− h(a)
j j kj k
k =1

• Derivatives wrt first layer and second layer weights

∂En ∂En
= δ j xi = δk z j €
∂w (1)
ji
∂w (2)
kj

∂E ∂En
• Batch method =∑
∂w ji n ∂w ji
Machine Learning Srihari

Using derivatives to update weights

• Gradient descent
(τ+1)
• Update the weights using w = w (τ ) − η∇E (w (τ ) )

• Where the gradient vector ∇E (w (τ ) ) consists of the vector of

derivatives evaluated using back-propagation
⎡ ∂E ⎤
⎢ ⎥
⎢ (1) ⎥ There are W= M(D+1)+K(M+1) elements
⎢ ∂w11 ⎥
⎢ ⎥ in the vector
⎢ . ⎥ Gradient ∇E (w (τ ) ) is a W x 1 vector
⎢ ⎥
⎢ ∂E ⎥
⎢ (1) ⎥
d ⎢ ∂wMD ⎥
∇E(w) = E(w) = ⎢ ⎥
dw ⎢ ∂E ⎥
⎢ ⎥
⎢ (2)
∂w11 ⎥
⎢ ⎥
⎢ ⎥
⎢ . ⎥
⎢ ∂E ⎥
⎢ ⎥
⎢ (2)
∂wKM ⎥
⎢⎣ ⎥⎦

27
Machine Learning
Numerical example D
Srihari

(binary classification) a j = ∑ w (1)x

ji i
i=1

z j = σ(a j )
M

z1 yk = ∑ wkj(2)z j
j =1
D=3
M=2 y1
Errors
K=1 δj = σ '(a j )∑ wkj δk
N=1 k

δk = σ '(ak )(yk −tk )

Error Derivatives
z2 ∂En ∂En
= δ j xi = δk z j
∂w (1)
ji
∂wkj(2)

• First training example, x = [1 0 1]T whose class label is t = 1

• The sigmoid activation function is applied to hidden layer and
output layer
• Assume that the learning rate η is 0.9 28
Machine Learning
Outputs, Errors, Derivatives, Weight Update Srihari
δk = σ '(ak )(yk −tk ) = [σ(ak )(1 − σ(ak ))](1 − σ(ak ))
δj = σ '(a j )∑ w jk δk = ⎡⎢σ(a j )(1 − σ(a j ))⎤⎥ ∑ w jk δk
k
⎣ ⎦ k

Initial input and weight values

x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 w04 w05 w06
-----------------------------------------------------------------------------------
1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1

Net input and output calculation

Unit Net input a Output σ(a)
-----------------------------------------------------------------------------------
4 0.2 + 0 -0.5 -0.4 = -0.7 1/(1+e0.7)=0.332
5-0.3 +0+0.2 +0.2 =0.1 1/(1+e0.1)=0.525 Weight Update*
6 (-0.3)(0.332)-(0.2)(0.525)+0.1 = -0.105 1/(1+e0.105)=0.474 Weight New value
------------------------------------------------
w46 -03+(0.9)(0.1311)(0.332)= -0.261
w56 -0.2+(0.9)(0.1311)(0.525)= -0.138
w14 0.2 +(0.9)(-0.0087)(1) = 0.192
w15 -0.3 +(0.9)(-0.0065)(1) = -0.306
w24 0.4+ (0.9)(-0.0087)(0) = 0.4
Errors at each node w25 0.1+ (0.9)(-0.0065)(0) = 0.1
Unit δ w34 -0.5+ (0.9)(-0.0087)(1) = -0.508
----------------------------------------------------- w35 0.2 + (0.9)(-0.0065)(1) = 0.194
6 (0.474)(1-0.474)(1-0.474)=0.1311 w06 0.1 + (0.9)(0.1311) = 0.218
5 (0.525)(1-0.525)(0.1311)(-0.2)=-0.0065 w05 0.2 + (0.9)(-0.0065)=0.194
4 (0.332)(1-0.332)(0.1311)(-0.3)=-0.0087 w04 -0.4 +(0.9)(-0.0087) = -0.408

* Positive update since we used (tk-yk)

Machine Learning Srihari

MATLAB Implementation (Pseudocode)

• Allows for multiple hidden layers

• Allows for training in batches
• Determines gradients using back-propagation using sum-
of-squared error
• Determines misclassification probability

Machine Learning
Srihari 30
Machine Learning Srihari

Initializations
% This pseudo-code illustrates implementing a
several layer neural %network. You need to fill in s{1} = size(train_x, 1);
the missing part to adapt the program to %your s{2} = 100;
own use. You may have to correct minor s{3} = 100;
mistakes in the program s{4} = 100;
s{5} = 2;
%% prepare for the data

load data.mat %Initialize the parameters

%You may set them to zero or give them small
train_x = .. %random values. Since the neural network
test_x = .. %optimization is non-convex, your algorithm
%may get stuck in a local minimum which may
%be caused by the initial values you assigned.
train_y = ..
test_y = ..
for i = 1 : numOfHiddenLayers
%% Some other preparations W{i} = ..
%Number of hidden layers b{i} = ..
end
numOfHiddenLayer = 4;

x is the input to the neural network,

y is the output
Machine Learning Srihari
Training epochs, Back-propagation
for j = 1 : numepochs
The training data is divided into several %randomly rearrange the training data for each epoch
batches of size 100 for efficiency %We keep the shuffled index in kk, so that the input and output could
%be matched together
kk = randperm(size(train_x, 2));
losses = [];
train_errors = []; for l = 1 : numbatches
test_wrongs = []; %Set the activation of the first layer to be the training data
%while the target is training labels
%Here we perform mini-batch stochastic gradient descent
%If batchsize = 1, it would be stochastic gradient descent a{1} = train_x(:, kk( (l-1)*batchsize+1 : l*batchsize ) );
%If batchsize = N, it would be basic gradient descent y = train_y(:, kk( (l-1)*batchsize+1 : l*batchsize ) );

%Forward propagation, layer by layer

batchsize = 100; %Here we use sigmoid function as an example

%Num of batches for i = 2 : numOfHiddenLayer + 1

a{i} = sigm( bsxfun(@plus, W{i-1}*a{i-1}, b{i-1}) );
end
numbatches = size(train_x, 2) / batchsize;
%Calculate the error and back-propagate error layer by layers
%% Training part d{numOfHiddenLayer + 1} =
%Learning rate alpha -(y - a{numOfHiddenLayer + 1}) .* a{numOfHiddenLayer + 1} .* (1-a{numOfHiddenLayer + 1});
alpha = 0.01;
for i = numOfHiddenLayer : -1 : 2
d{i} = W{i}' * d{i+1} .* a{i} .* (1-a{i});
%Lambda is for regularization end
lambda = 0.001;
%Calculate the gradients we need to update the parameters
%Num of iterations %L2 regularization is used for W
numepochs = 20;
for i = 1 : numOfHiddenLayer
dW{i} = d{i+1} * a{i}’;
db{i} = sum(d{i+1}, 2);
W{i} = W{i} - alpha * (dW{i} + lambda * W{i});
b{i} = b{i} - alpha * db{i};
end
end
Machine Learning Srihari

Performance Evaluation
% Do some predictions to know the performance
a{1} = test_x; %Calculate training error
% forward propagation %minibatch size
bs = 2000;
% no. of mini-batches
for i = 2 : numOfHiddenLayer + 1 nb = size(train_x, 2) / bs;
%This is essentially doing W{i-1}*a{i-1}+b{i-1}, but since they
%have different dimensionalities, this addition is not allowed in train_error = 0;
%matlab. Another way to do it is to use repmat %Here we go through all the mini-batches
for ll = 1 : nb
a{i} = sigm( bsxfun(@plus, W{i-1}*a{i-1}, b{i-1}) ); %Use submatrix to pick out mini-batches
end a{1} = train_x(:, (ll-1)*bs+1 : ll*bs );
yy = train_y(:, (ll-1)*bs+1 : ll*bs );
%Here we calculate the sum-of-square error as loss function
loss = sum(sum((test_y-a{numOfHiddenLayer + 1}).^2)) / size(test_x, 2); for i = 2 : numOfHiddenLayer + 1
a{i} = sigm( bsxfun(@plus, W{i-1}*a{i-1}, b{i-1}) );
end
% Count no. of misclassifications so that we can compare it train_error = train_error + sum(sum((yy-a{numOfHiddenLayer + 1}).^2));
% with other classification methods end
% If we let max return two values, the first one represents the max train_error = train_error / size(train_x, 2);
% value and second one represents the corresponding index. Since we
% care only about the class the model chooses, we drop the max value losses = [losses loss];
% (using ~ to take the place) and keep the index.
test_wrongs = [test_wrongs, test_wrong];
[~, ind_] = max(a{numOfHiddenLayer + 1}); [~, ind] = max(test_y); train_errors = [train_errors train_error];
test_wrong = sum( ind_ ~= ind ) / size(test_x, 2) * 100;
end

max calculation returns value and index

Machine Learning Srihari

Efficiency of Backpropagation
• Computational Efficiency is main aspect of back-prop
• No of operations to compute derivatives of error function scales
with total number W of weights and biases
• Single evaluation of error function for a single input requires
O(W) operations (for large W)
• This is in contrast to O(W2) for numerical differentiation
• As seen next

34
Machine Learning Srihari

Another Approach: Numerical Differentiation

• Compute derivatives using method of finite differences
• Perturb each weight in turn and approximate derivatives by

∂En En (w ji + ε) − En (w ji )
= +O(ε) where ε<<1
∂w ji ε
• Accuracy improved by making ε smaller until round-off problems
arise
• Accuracy can be improved by using central differences
∂En En (w ji + ε) − En (w ji − ε)
= +O(ε 2 )
∂w ji 2ε
• This is O(W2)
• Useful to check if software for backprop has been
correctly implemented (for some test cases)

35
Machine Learning Srihari

Summary of Backpropagation
• Derivatives of error function wrt weights are obtained by
propagating errors backward
• It is more efficient than numerical differentiation
• It can also be used for other computations
• As seen next for Jacobian

36
Machine Learning Srihari

The Jacobian Matrix

• For a vector valued output y={y1,..,ym} with vector input x
={x1,..xn},
• Jacobian matrix organizes all the partial derivatives into an m x
n matrix

∂yk
J ki =
∂x i

For a neural network

we have a
D+1 by K matrix
Determinant of Jacobian Matrix is referred to simply as the Jacobian

37
Machine Learning Srihari

Jacobian Matrix Evaluation

• In backprop, derivatives of error function wrt weights are
obtained by propagating errors backwards through the network

• The technique of backpropagation can also be used to

calculate other derivatives
• Here we consider the Jacobian matrix
• Whose elements are derivatives of network outputs wrt inputs

∂yk
J ki =
∂x i

• Where each such derivative is evaluated with other inputs fixed

38
Machine Learning Srihari

Use of Jacobian Matrix

• Jacobian plays useful role in systems built from several
modules
• Each module has to be differentiable
• Suppose we wish to minimize error E wrt parameter w in a
modular classification system shown here:

∂E ∂E ∂yk ∂z j
=∑
∂w k,j ∂yk ∂z j ∂w

• Jacobian matrix for red module appears in the middle term

• Jacobian matrix provides measure of local sensitivity of outputs
to changes in each of the input variables
39
Machine Learning Srihari

Summary of Jacobian Matrix Computation

• Apply input vector corresponding to point in input space where

the Jacobian matrix is to be found
• Forward propagate to obtain activations of the hidden and
output units in the network
• For each row k of Jacobian matrix, corresponding to output unit k:
• Backpropagate for all the hidden units in the network
• Finally backpropagate to the inputs
• Implementation of such an algorithm can be checked using
numerical differentiation in the form

∂yk yk (x i + ε) − yk (x i − ε)
= +O(ε 2 )
∂x i 2ε
40
Machine Learning Srihari

Summary
• Neural network learning needs learning of weights from
samples involves two steps:
• Determine derivative of output of a unit wrt each input
• Adjust weights using derivatives
• Backpropagation is a general term for computing derivatives
• Evaluate δk for all output units
• (using δk=yk-tk for regression)
• Backpropagate the δk s to obtain δj for each hidden unit
• Product of δ s with activations at the unit provide the derivatives for that
weight
• Backpropagation is also useful to compute a Jacobian matrix
with several inputs and outputs
• Jacobian matrices are useful to determine the effects of different inputs

Deepface
100% (1)
Deepface
9 pages
Back Propagation in NN
No ratings yet
Back Propagation in NN
30 pages
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
9 pages
Back-Propagation Is Very Simple. Who Made It Complicated
No ratings yet
Back-Propagation Is Very Simple. Who Made It Complicated
26 pages
Backward Forward Propogation
No ratings yet
Backward Forward Propogation
19 pages
Backpropagation Example With Numbers Step by Step
No ratings yet
Backpropagation Example With Numbers Step by Step
8 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Ann2018 L6
No ratings yet
Ann2018 L6
18 pages
backprop unit 2
No ratings yet
backprop unit 2
5 pages
Lec3 Backpropagation
No ratings yet
Lec3 Backpropagation
13 pages
Eio Supplementary
No ratings yet
Eio Supplementary
6 pages
Chap3slides
No ratings yet
Chap3slides
95 pages
Derivations For Back Propagation of Multilayer Neural Network
No ratings yet
Derivations For Back Propagation of Multilayer Neural Network
14 pages
Lec3 Backpropagation
No ratings yet
Lec3 Backpropagation
13 pages
Lec3 Backpropagation
No ratings yet
Lec3 Backpropagation
13 pages
NeuralNetworks
No ratings yet
NeuralNetworks
29 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Tut 01
No ratings yet
Tut 01
39 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
CS231n Convolutional Neural Networks For Visual Recognition 4
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition 4
10 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Assignment 4
No ratings yet
Assignment 4
2 pages
Artificial Neural Networks Mathematics of Backpropagation (Part 4) - BRIAN DOLHANSKY
No ratings yet
Artificial Neural Networks Mathematics of Backpropagation (Part 4) - BRIAN DOLHANSKY
9 pages
CS231n Convolutional Neural Networks For Visual Recognition
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition
9 pages
ML807_Distributed_and_Federated_Learning_Slides_2
No ratings yet
ML807_Distributed_and_Federated_Learning_Slides_2
211 pages
AyushChokhani AI Asiignment 2
No ratings yet
AyushChokhani AI Asiignment 2
12 pages
Lecture 2, Part 2: Backpropagation: Roger Grosse
No ratings yet
Lecture 2, Part 2: Backpropagation: Roger Grosse
9 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
14 pages
07autodiff Nnets
No ratings yet
07autodiff Nnets
12 pages
Back Propagation
No ratings yet
Back Propagation
5 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Backpropagation: TA: Yi Wen
No ratings yet
Backpropagation: TA: Yi Wen
39 pages
Backpropagation
No ratings yet
Backpropagation
4 pages
Machine Learning - Exercise 4: Companion Slides
No ratings yet
Machine Learning - Exercise 4: Companion Slides
14 pages
lect 2 Backpropagation
No ratings yet
lect 2 Backpropagation
11 pages
SJNanda Neural Network
No ratings yet
SJNanda Neural Network
43 pages
SJNanda_Neural Network
No ratings yet
SJNanda_Neural Network
43 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Deep Learning_Lecture 2_Neural Networks
No ratings yet
Deep Learning_Lecture 2_Neural Networks
39 pages
ANN Backpropagation: Weight Updates For Hidden Nodes: Step 1: Update The Weights V
No ratings yet
ANN Backpropagation: Weight Updates For Hidden Nodes: Step 1: Update The Weights V
3 pages
Msep2013 L7
No ratings yet
Msep2013 L7
16 pages
L3 Backpropagation
No ratings yet
L3 Backpropagation
61 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
ANN5
No ratings yet
ANN5
21 pages
SJNanda Neural Network
No ratings yet
SJNanda Neural Network
47 pages
Backpropagation Algorithm
No ratings yet
Backpropagation Algorithm
9 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
68 pages
09: Neural Networks - Learning: Neural Network Cost Function
No ratings yet
09: Neural Networks - Learning: Neural Network Cost Function
9 pages
(IJCST-V6I4P17) :P T V Lakshmi
No ratings yet
(IJCST-V6I4P17) :P T V Lakshmi
4 pages
Neural Networks - Learning
No ratings yet
Neural Networks - Learning
26 pages
A Step by Step Backpropagation
No ratings yet
A Step by Step Backpropagation
8 pages
Automatic Differentiation and Neural Networks
No ratings yet
Automatic Differentiation and Neural Networks
13 pages
L4deep Learning
No ratings yet
L4deep Learning
14 pages
Backpropagation Example
No ratings yet
Backpropagation Example
9 pages
nn2
No ratings yet
nn2
12 pages
Backpropagation A Peek Into The Mathematics of Optimization
No ratings yet
Backpropagation A Peek Into The Mathematics of Optimization
4 pages
Back-Propagation Algorithm
No ratings yet
Back-Propagation Algorithm
26 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Computer Solved: Nonlinear Differential Equations
From Everand
Computer Solved: Nonlinear Differential Equations
Joe J. Ettl
No ratings yet
CPS 4801 Artificial Intelligence: Instructor: Tian (Tina) Tian
No ratings yet
CPS 4801 Artificial Intelligence: Instructor: Tian (Tina) Tian
36 pages
Neural Network Methods For Natural Language Proces
No ratings yet
Neural Network Methods For Natural Language Proces
4 pages
Retraction: Retracted: Deep Neural Networks For Medical Image Segmentation
No ratings yet
Retraction: Retracted: Deep Neural Networks For Medical Image Segmentation
16 pages
Mid Exam Midterm Exam of Applied Machine Learning
100% (1)
Mid Exam Midterm Exam of Applied Machine Learning
6 pages
RL Project - Deep Q-Network Agent Report
No ratings yet
RL Project - Deep Q-Network Agent Report
11 pages
Introduction to Artificial Intelligence Azure AI 900
No ratings yet
Introduction to Artificial Intelligence Azure AI 900
8 pages
2013 Ul Hasan Icdar Can We Build Lanugage Independent Ocr Using LSTM Networks
No ratings yet
2013 Ul Hasan Icdar Can We Build Lanugage Independent Ocr Using LSTM Networks
6 pages
ML L1 PDF
No ratings yet
ML L1 PDF
43 pages
ICRCCV2022 Schedule
No ratings yet
ICRCCV2022 Schedule
2 pages
Materials Today: Proceedings: K. Loganathan, R. Sarath Kumar, V. Nagaraj, Tegil J. John
No ratings yet
Materials Today: Proceedings: K. Loganathan, R. Sarath Kumar, V. Nagaraj, Tegil J. John
5 pages
Computer Vision Module 5
100% (1)
Computer Vision Module 5
22 pages
Experiment 3.1 K-Mean
No ratings yet
Experiment 3.1 K-Mean
8 pages
Intro To Machine Learning Nanodegree Program Syllabus
No ratings yet
Intro To Machine Learning Nanodegree Program Syllabus
14 pages
6 - Data Pre-Processing-III
No ratings yet
6 - Data Pre-Processing-III
30 pages
Multiple Choice (AI)
No ratings yet
Multiple Choice (AI)
3 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
35 pages
2402.05079
No ratings yet
2402.05079
11 pages
Deep Learning - IIT Ropar - Unit 5 - Week 2
No ratings yet
Deep Learning - IIT Ropar - Unit 5 - Week 2
4 pages
Chapter 1 Introduction To Machine Learning
100% (1)
Chapter 1 Introduction To Machine Learning
19 pages
LSUN: Construction of A Large-Scale Image Dataset Using Deep Learning With Humans in The Loop
No ratings yet
LSUN: Construction of A Large-Scale Image Dataset Using Deep Learning With Humans in The Loop
9 pages
Text Summarization and Conversion of Speech To Text
No ratings yet
Text Summarization and Conversion of Speech To Text
5 pages
MobileNet CBAM
No ratings yet
MobileNet CBAM
18 pages
Leancontext: Cost-Efficient Domain-Specific Question Answering Using Llms
No ratings yet
Leancontext: Cost-Efficient Domain-Specific Question Answering Using Llms
8 pages
500 + AI, ML Projects With Source Code
No ratings yet
500 + AI, ML Projects With Source Code
3 pages
Project Phase I
No ratings yet
Project Phase I
28 pages
The Turbulent Past and Uncertain Future of AI
No ratings yet
The Turbulent Past and Uncertain Future of AI
10 pages
Classification Algorithm: Supervised Learning Technique Training Data
No ratings yet
Classification Algorithm: Supervised Learning Technique Training Data
28 pages
neural-networks-and-deep-learning-notes
No ratings yet
neural-networks-and-deep-learning-notes
88 pages
Eti A1
No ratings yet
Eti A1
9 pages

Chap5 3-BackProp

Uploaded by

Chap5 3-BackProp

Uploaded by

Machine Learning Srihari

A neural network with one hidden layer

Hidden unit activation functions

Output activation functions

Matrix Multiplication: Forward Propagation

First Network layer Network layer output In matrix multiplication notation

Loss and Regularization

Derivative of composite function with one weight

!" !" !% !& !" !' !(

!" !)(%.') !)(%,')

Derivative of a composite function with four inputs

Derivatives by inspection: !"

We want to compute derivatives

Example of Derivative Computation

Derivatives for a neuron: z=f(x,y)

• Every connection is an input, every node is a function or operation

Derivatives of Composite function

Derivative using Computational Graph

• We can get whichever derivative we want by multiplying the

df df dg dh df Since f (x)=ex, g(x)=sin x and

Evaluating the gradient

• Gradient evaluation can be performed using a local message

Back-propagation Terminology and Usage

Overview of Backprop algorithm

• How to divide up the errors needs a little calculus

Evaluation of Error Function Derivatives

• We consider problem of evaluating ∇En (w)

• Gradient of Error function wrt a weight wji: E=

Extension to more complex multilayer Network

Evaluation of Derivative En wrt a weight wji

• Where the δs are errors as we shall see

• i.e., required derivative is obtained by multiplying the value of δ for the

Calculation of Error for hidden unit δj

• For hidden unit j by chain rule

• We get the backpropagation formula for error derivatives at stage j

Input to activation from error derivative

Error Backpropagation Algorithm

2. Evaluate δk for all output units using

Simple Example: Forward and Backward Prop

• Derivatives wrt first layer and second layer weights

Using derivatives to update weights

• Where the gradient vector ∇E (w (τ ) ) consists of the vector of

(binary classification) a j = ∑ w (1)x

δk = σ '(ak )(yk −tk )

• First training example, x = [1 0 1]T whose class label is t = 1

Initial input and weight values

Net input and output calculation

* Positive update since we used (tk-yk)

MATLAB Implementation (Pseudocode)

• Allows for multiple hidden layers

load data.mat %Initialize the parameters

x is the input to the neural network,

%Forward propagation, layer by layer

%Num of batches for i = 2 : numOfHiddenLayer + 1

max calculation returns value and index

Another Approach: Numerical Differentiation

The Jacobian Matrix

For a neural network

Jacobian Matrix Evaluation

• The technique of backpropagation can also be used to

• Where each such derivative is evaluated with other inputs fixed

Use of Jacobian Matrix

• Jacobian matrix for red module appears in the middle term

Summary of Jacobian Matrix Computation

• Apply input vector corresponding to point in input space where

You might also like