Chap5 3-BackProp
Chap5 3-BackProp
Backpropagation
Sargur Srihari
1
Machine Learning Srihari
Topics in Backpropagation
1. Forward Propagation
2. Loss Function and Gradient Descent
3. Computing derivatives using chain rule
4. Computational graph for backpropagation
5. Backprop algorithm
6. The Jacobian matrix
2
Machine Learning Srihari
K output activations
M
ak = ∑ wki(2)x i + wk(2)0 where k = 1,..,K
i =1
3
Machine Learning Srihari
⎪⎪ (2) ⎡ T T
⎤ ,W (2) = ⎡W W W ⎤ ,W (2) = ⎡W W W ⎤
T
W
⎪⎪⎩ 1 = W W W
⎢⎣ 11 12 13 ⎥⎦ 2 ⎢⎣ 21 22 23 ⎥⎦ 3 ⎢⎣ 31 32 33 ⎥⎦
4
Machine Learning Srihari
1 N
E=
N
∑ i
E
i=1
f (
(x (i )
,w),ti )
x Forward:
y
Loss +
Ei
Backward:
Gradient of
Ei+R
R(W)
5
Machine Learning Srihari
Gradient Descent
• Goal: determine weights w from labeled set of training samples
• Learning procedure has two stages
1. Evaluate derivatives of loss ∇E(w) with respect to weights w1,..wT
2. Use derivative vector to compute adjustments to weights
⎡ ∂E ⎤
⎢ ⎥
w (τ+1) = w (τ) − η∇E(w (τ) ) ⎢
⎢
∂w0 ⎥
∂E ⎥⎥
⎢
( )
∇E w = ⎢ ∂w1 ⎥
⎢ ⎥
⎢ ⎥
⎢ ∂E ⎥
⎢ ⎥
⎢⎣ ∂wT ⎥
⎦
Machine Learning Srihari
w f (w) k(p,q) o
Input Output
weight
z=f (w) q=h(z)
h(z)
Composite Function
o = k(p,q)=k(g(f(w)),h(f(w))
7
Path 1 Path 2
Machine Learning Srihari
!"
=a
ð) u v
!"
=log d !$
ð' !&
3 =b 2 !$ =a 5 !&
=c 1 =log d
ð% ð) ð( ð'
!" +
= c
ð* *
a b c
Computational graph !" !"
e=u+v, u=a.b, v=c.t , t=log d ð%
=3
ð)
=2 t !"
ð'
=1
9
Derivatives of f =(x+y)z wrt x,y,z
Machine Learning Srihari
Machine Learning Srihari
dh
= 2x
because h(x)=x2 & its derivative is 2x
dx
df
= e g(h(x )) ⋅ cos h(x)⋅ 2x = e sin x**2 ⋅ cosx 2 ⋅ 2x
• Therefore dx
• In each of these cases we pretend that the inner function is a single
variable and derive it as such
sin(x 2 )
2. Another way to view it f (x) = e
• Create temp variables u=sin v, v=x2, then f (u)=eu with
computational graph:
Deep Learning
Machine Learning Srihari
14
Machine Learning Srihari
17
Machine Learning Srihari
18
Machine Learning Srihari
Simple Model (Multiple Linear Regression)
• Outputs yk are linear combinations of inputs xi
yk
yk = ∑ wki x i wki
i xi
• Error function for a particular input xn is
1
(
En = ∑ ynk − tnk ) Where summation is
2
over all K outputs For a particular input x and
2 k
weight w , squared error is:
• where ynk=yk(xn,w)
1 2
aj=∑iwjizi zj=h(aj)
zi
wji
• zi is activation of a unit (or input) that sends a connection to unit j and wji
is the weight associated with the connection
• Output is transformed by a nonlinear activation function zj=h(aj)
• The variable zi can be an input and unit j could be an output
• For each input xn in the training set, we calculate activations of
all hidden and output units by applying above equations
• This process is called forward propagation
Machine Learning Srihari
ji
i
i
∂En
• Substituting we get ∂w ji
= δ j zi
∂ En
Summarizing evaluation of Derivative
∂ w ji
• By chain rule for partial derivatives
∂En ∂En ∂a j
= aj=∑iwjizi
∂w ji ∂a j ∂w ji
a j = ∑ w ji zi
Define δ ≡ ∂En i zi wji
j
∂a j ∂a j
we have = zi
∂w ji
• Substituting we get
∂En
= δ j zi
∂w ji
• Thus required derivative is obtained by multiplying
1. Value of δ for the unit at output end of weight
2. Value of z for unit at input end of weight
• Need to figure out how to calculate δj for each unit of network
1 ∂E
• For output units δj=yj-tj If E = ∑ (y − t ) and y = a = ∑ w z then δ =
2
= y − t For
2 j
j j j
∂a j ji i j
regression j
j j
• For hidden units, we again need to make use of chain rule of derivatives to
determine δ ≡ ∂En
j
∂a j
Machine Learning Srihari
δ j = h '(a j )∑ wkj δk
k
a j = ∑ w ji zi and zj=h(aj)
i
δ j = h '(a j )∑ wkj δk
3. Backpropagate the δ s using
k δ j = h '(a j )∑ wkj δk
k
to obtain δj for each hidden unit
• Value of δ for a particular
hidden unit can be obtained 4. Use ∂E
by propagating the δ s
n
= δ j zi
∂w ji
backward from units higher-
to evaluate required derivatives
up in the network
24
Machine Learning Srihari
A Simple Example
• Two-layer network
• Sum-of-squared error
• Output units: linear activation
functions, i.e., multiple regression
yk=ak Standard Sum of Squared Error
• Hidden units have logistic sigmoid
1
(
E n = ∑ y k − tk )
2
activation function 2 k
h(a)=tanh (a)
where e a − e−a yk: activation of output unit k
tanh(a) = a −a
e +e tk : corresponding target
for input xk
simple form for derivative
h '(a) = 1 − h(a)2
€
Machine Learning Srihari
z j = tanh(a j )
M
yk = ∑ wkj(2)z j
j =0
• Output differences
δ k = y k − tk
δ j = h '(a j )∑ wkj δk
• Backward Propagation (δ s for hidden units) k
K 2
δ = (1 − z )∑ w δ 2 h'(a) = 1− h(a)
j j kj k
k =1
∂E ∂En
• Batch method =∑
∂w ji n ∂w ji
Machine Learning Srihari
27
Machine Learning
Numerical example D
Srihari
z j = σ(a j )
M
z1 yk = ∑ wkj(2)z j
j =1
D=3
M=2 y1
Errors
K=1 δj = σ '(a j )∑ wkj δk
N=1 k
Error Derivatives
z2 ∂En ∂En
= δ j xi = δk z j
∂w (1)
ji
∂wkj(2)
Machine Learning
Srihari 30
Machine Learning Srihari
Initializations
% This pseudo-code illustrates implementing a
several layer neural %network. You need to fill in s{1} = size(train_x, 1);
the missing part to adapt the program to %your s{2} = 100;
own use. You may have to correct minor s{3} = 100;
mistakes in the program s{4} = 100;
s{5} = 2;
%% prepare for the data
Performance Evaluation
% Do some predictions to know the performance
a{1} = test_x; %Calculate training error
% forward propagation %minibatch size
bs = 2000;
% no. of mini-batches
for i = 2 : numOfHiddenLayer + 1 nb = size(train_x, 2) / bs;
%This is essentially doing W{i-1}*a{i-1}+b{i-1}, but since they
%have different dimensionalities, this addition is not allowed in train_error = 0;
%matlab. Another way to do it is to use repmat %Here we go through all the mini-batches
for ll = 1 : nb
a{i} = sigm( bsxfun(@plus, W{i-1}*a{i-1}, b{i-1}) ); %Use submatrix to pick out mini-batches
end a{1} = train_x(:, (ll-1)*bs+1 : ll*bs );
yy = train_y(:, (ll-1)*bs+1 : ll*bs );
%Here we calculate the sum-of-square error as loss function
loss = sum(sum((test_y-a{numOfHiddenLayer + 1}).^2)) / size(test_x, 2); for i = 2 : numOfHiddenLayer + 1
a{i} = sigm( bsxfun(@plus, W{i-1}*a{i-1}, b{i-1}) );
end
% Count no. of misclassifications so that we can compare it train_error = train_error + sum(sum((yy-a{numOfHiddenLayer + 1}).^2));
% with other classification methods end
% If we let max return two values, the first one represents the max train_error = train_error / size(train_x, 2);
% value and second one represents the corresponding index. Since we
% care only about the class the model chooses, we drop the max value losses = [losses loss];
% (using ~ to take the place) and keep the index.
test_wrongs = [test_wrongs, test_wrong];
[~, ind_] = max(a{numOfHiddenLayer + 1}); [~, ind] = max(test_y); train_errors = [train_errors train_error];
test_wrong = sum( ind_ ~= ind ) / size(test_x, 2) * 100;
end
Efficiency of Backpropagation
• Computational Efficiency is main aspect of back-prop
• No of operations to compute derivatives of error function scales
with total number W of weights and biases
• Single evaluation of error function for a single input requires
O(W) operations (for large W)
• This is in contrast to O(W2) for numerical differentiation
• As seen next
34
Machine Learning Srihari
∂En En (w ji + ε) − En (w ji )
= +O(ε) where ε<<1
∂w ji ε
• Accuracy improved by making ε smaller until round-off problems
arise
• Accuracy can be improved by using central differences
∂En En (w ji + ε) − En (w ji − ε)
= +O(ε 2 )
∂w ji 2ε
• This is O(W2)
• Useful to check if software for backprop has been
correctly implemented (for some test cases)
35
Machine Learning Srihari
Summary of Backpropagation
• Derivatives of error function wrt weights are obtained by
propagating errors backward
• It is more efficient than numerical differentiation
• It can also be used for other computations
• As seen next for Jacobian
36
Machine Learning Srihari
∂yk
J ki =
∂x i
37
Machine Learning Srihari
∂yk
J ki =
∂x i
38
Machine Learning Srihari
∂E ∂E ∂yk ∂z j
=∑
∂w k,j ∂yk ∂z j ∂w
∂yk yk (x i + ε) − yk (x i − ε)
= +O(ε 2 )
∂x i 2ε
40
Machine Learning Srihari
Summary
• Neural network learning needs learning of weights from
samples involves two steps:
• Determine derivative of output of a unit wrt each input
• Adjust weights using derivatives
• Backpropagation is a general term for computing derivatives
• Evaluate δk for all output units
• (using δk=yk-tk for regression)
• Backpropagate the δk s to obtain δj for each hidden unit
• Product of δ s with activations at the unit provide the derivatives for that
weight
• Backpropagation is also useful to compute a Jacobian matrix
with several inputs and outputs
• Jacobian matrices are useful to determine the effects of different inputs
41