A Gentle Introduction To Backpropagation
A Gentle Introduction To Backpropagation
Contents
1 Why is this article being written?
3 Backpropagation
4 Easy as 1-2-3
4.1 Boxing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Weight updates . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
4
5
9
14
8 LEGAL NOTICE
15
Neural networks have always fascinated me ever since I became aware of them
in the 1990s. I was initially drawn to the hypnotizing array of connections with
which they are often depicted. In the last decade, deep neural networks have
dominated pattern recognition, often replacing other algorithms in applications
c
Numeric
Insight, Inc. All rights reserved. See LEGAL NOTICE on page 15.
Backpropagation
In fact, this error minimization problem that must be solved to train a neural
network eluded a practical solution for decades till D. E. Rumelhart, C. E. Hinton, and R. J. Williams (drawing inspiration from other researchers) demonstrated a technique, which they called backpropagation, and made it widely
known (Nature 323, 533-536, 9 October 1986). It is essentially by building upon
their method that today others have ventured to program neural networks with
60 million weights, with astounding results.
According to Bernard Widrow, now Professor Emeritus at Stanford University and one of the pioneers of neural networks, The basic concepts of backpropagation are easily grasped. Unfortunately, these simple ideas are often obscured
by relatively intricate notation, so formal derivations of the backpropagation
rule are often tedious. This is indeed unfortunate because the backpropagation
rule is one of the most elegant applications of calculus that I have known.
Easy as 1-2-3
Once you appreciate the fact that, in order to train a neural network, you need
to somehow calculate the partial derivatives of the error with respect to weights,
backpropagation can be easily and qualitatively derived by reducing it to three
core concepts. It also helps immensely to keep the notation intuitive and easy
to connect to the concept being symbolized.
4.1
Boxing
Since training the neural network is all about minimizing the training error, the
first step in the derivation involves tacking on an extra computational block to
calculate the error between the actual output {o1 , o2 , o3 , o4 } and a known target
{t1 , t2 , t3 , t4 }. This is shown as a triangular block in Figure 2. For now, let us
think of the output and the target as known and fixed entities. Although we
need not concern ourselves with the exact formula to compute the error, I offer
the familiar sum-of-squares error as an example
e = (o1 t1 )2 + (o2 t2 )2 + (o3 t3 )2 + (o4 t4 )2
Next, we choose one of the layers (say Layer 3) and enclose that layer and
all following layers (including the error calculating block) in a box, as shown
in gray in Figure 2. Keep in mind that this is just one of several nested boxes
we can construct in order to compartmentalize the network. Let us resolve
not to worry about anything going on inside the box but simply think of the
4.2
Sensitivity
Figure 3: This diagram shows the boxed portion of the neural network of Figure 2. By hiding details of internal connections within the box, this scheme
allows us to think about the broad relationship between the input and the output pertaining to this box.
specific example, let us call the input to the preceding box the preceding input,
{p1 , p2 , p3 , p4 .} It follows quite logically that the sensitivity of the preceding box
(which we will naturally denote as {p1 , p2 , p3 , p4 }) must be related to the
sensitivity of the current box and the extra neural network elements making up
the difference between the two nested boxes. The extra elements are the very
vital nonlinear activation function units, summing junctions and weights.
Figure 4 (top) shows the current box and the extra elements that must be
added to construct the preceding box. For clarity, all the elements not relevant
to the calculation of the first component of sensitivity (p1 ) have been grayed
out. Look closely at the top and bottom parts of Figures 4 to understand how
the sensitivity of the preceding box can easily be derived from first principles.
e
Specifically, the bottom part of Figure 4 provides insight into how p1 (= p
)
1
can be computed by allowing the input component p1 to change by a small quantity p1 and following the resulting changes in the network. Notes: (i) The notation A0 (p1 ) has been used for the derivative of the activation function evaluated
at p1 . (ii) For clarity, not all changes in signals have been explicitly labeled.
Those that are not labeled can easily be determined since they all follow an
obvious pattern.
This is indeed a deep result. (And one which we have managed to arrive
at without recourse to tedious calculus and numbered equations.) By repeated
application, this result allows us to work backwards and calculate the sensitivity
of the error to changes in the input at every layer.
The algorithm gets the name backpropagation because the sensitivities are
6
Figure 4: The drawing on the top shows the current box and the extra elements
that must be added to construct the preceding box. The drawing on the bottom
e
provides insight into how the sensitivity component p1 (= p
) can be computed
1
by allowing the input component p1 to change by a small quantity p1 and
following the resulting changes in the network.
Figure 5: Similar to Figure 4, the drawing on the top shows the current box
and the extra elements that must be added to construct the preceding box. The
e
drawing on the bottom provides insight into how the partial derivative w
11
(used to guide weight refinement) can be computed by allowing the weight w11
to change by a small quantity w11 and following the resulting changes in the
e
network. Using this intuition, w
can be computed as A(p1 )c1 .
11
A starting point is all we need to completely calculate all the sensitivity terms
throughout the neural network. To do this, we consider the error computing
block itself as the first box. For this box, the input is {o1 , o2 , o3 , o4 }, and the
output is e as given in the sum-of-squares error formula we have seen before.
Simple calculus gives us the components of the sensitivity of the error computing
block
{2 (o1 t1 ) , 2 (o2 t2 ) , 2 (o3 t3 ) , 2 (o4 t4 )}
4.3
Weight updates
At this point, the last section writes itself. Following the same strategy outlined
in the previous figure, look at Figure 5 to intuitively understand how the error
changes in response to a small change in one of the weights, say w11 . Once
again in these figures, details of connections not immediately relevant to this
calculation have been grayed out. The much sought after partial derivative
of error with respect to the specific weight w11 is easily seen to be A(p1 )c1 .
Generalizing on this, the textbook formula to compute the partial derivative of
the error with respect to any weight is easily seen to be
e
= A (pi ) cj
wij
Now that we have a formula to calculate the partial derivative of error with
respect to every weight in the network, we can proceed to iteratively refine the
weights and minimize the error using the method of steepest descent.
In the most popular version of backpropagation, called stochastic backpropagation, the weights are initially set to small random values and the training set
is randomly polled to pick out a single input-target pair. The input is passed
through the network to compute internal signals (like A(p1 ) and A0 (p1 ) shown
in Figures 4 and 5) and the output vector. Once this is done, all the information
needed to initiate backpropagation becomes available. The partial derivatives
of error with respect to the weights can be computed, and the weights can be
refined with intent to reduce the error. The process is iterated using another
randomly chosen input-target pair.
I am in awe of the miraculous feat of cognition that lead early neural network
researchers to arrive at the backpropagation algorithm. They clearly had the
ability to see patterns and make elegant groupings which ultimately made it
possible to train huge networks. Their work not only resulted in the neural
network applications we use today, but have also inspired a host of other related
algorithms which depend on error minimization.
Although this algorithm has been presented here as a single established
method, it should be regarded as a framework. In my experience, appreciating
how an algorithm is derived leads to insight which makes it possible to explore
beneficial variations. The process of designing robust and efficient scientific
algorithms frequently leads us to regard established frameworks as substrates
upon which to build new and better algorithms.
10
N L+1
Oj
A()
A0 ()
l
wij
il
tj
l
gij
r
I
Number of layers
Layer index
Number of input components going into layer l.
ith component of the input going into layer l. Note that
i = {0, 1, 2, . . . , N l }. See Figure 6 to understand why i starts
from 0.
Number of output components coming out of layer L (i.e. the
last layer, or output layer.)
j th component of the output coming out of layer L. Note that
j = {1, 2, 3, . . . , N L+1 }
The activation function. For example, A(I32 ) = tanh(I32 )
The derivative of the activation function. For example,
A0 (I32 ) = 1 tanh2 (I32 )
The weight element connecting the ith input to the j th output in layer l. Note the ranges l = {1, 2, 3, . . . , L}, i =
{0, 1, 2, . . . , N l }, and j = {1, 2, 3, . . . , N l+1 .}
ith component of the sensitivity measured at the input to layer
l. Note that i = {1, 2, 3, . . . , N l }
j th component of the target corresponding to the chosen input.
Note that j = {1, 2, 3, . . . , N L+1 }
The partial derivative of the error e with respect to the weight
l
wij
. Note that we have chosen the symbol g to suggest gradient.
The learning rate.
Training Set
Training set input. Note that I is a matrix whose rows are
selected randomly during training to extract vectors of the form
1
{I11 , I21 , I31 , . . . , IN
1 } that feed into the network
Training set target. Note that at every instance of selection
of a row within I, the corresponding row of T is selected to
extract a vector of the form {t1 , t2 , t3 , . . . , tN L+1 }. It is this
vector that is compared with the actual output of the network
during training iterations
Table continues on next page . . .
Superscripts are used to index layers as in, for example, I23 . This is an intuitive notation
and it should be clear from context if a quantity is actually being raised to a power.
11
6.1
The overall goal of the algorithm is to attempt to find weights such that, for every input vector in the training set, the neural network yields an output vector
closely matching the prescribed target vector.
12
Step
Step
Step
Step
Step
Step
1
2
3
4
5
6
6.2
Initialize the weights and set the learning rate and the stopping criteria.
Randomly choose an input and the corresponding target.
Compute the input to each layer and the output of the final layer.
Compute the sensitivity components.
Compute the gradient components and update the weights.
Check against the stopping criteria. Exit and return the weights or loop back to Step 2.
l
to small random values between -1 and +1.
1. Initialize the weights wij
Set the learning rate and stopping criteria. Try the following settings for
initial experimentation
= 102
gMin = 103
k
=0
kMax = 105
1
2. Randomly choose (with replacement) an input vector {I11 , I21 , I31 , . . . , IN
1}
from I and the corresponding target {t1 , t2 , t3 , . . . , tN L+1 } from T .
l
3. Compute the input to each layer, {I1l , I2l , I3l , . . . , Ijl , . . . , IN
l }, and output
{O1 , O2 , O3 , . . . , Oj , . . . , ON L+1 } coming out of layer L.
1
Ij2
N
X
Ii1
1
wij
for j = 1, 2, 3, . . . , N 2 .
i=0
Ijl =
l1
N
X
l1
A(Iil1 ) wij
for l = 3, 4, 5, . . . , L,
for j = 1, 2, 3, . . . , N l .
i=0
L
Oj =
N
X
L
A(IiL ) wij
for j = 1, 2, 3, . . . , N L+1 .
i=0
iL+1 = 2(Oi ti )
il
= A (Iil )
l+1
N
X
j=1
for i = 1, 2, 3, . . . , N L+1 .
l l+1
wij
j
for l = L, L 1, L 2, . . . , 2,
for i = 1, 2, 3, . . . , N l .
13
l
5. Compute the gradient components gij
(partial derivative of error with rel
spect to weights) and update the weights wij
e
for l = 1, 2, 3, . . . , L,
l l+1
l
=
A(I
)
gij
=
i j
l
wij
for i = 0, 1, 2, . . . , N l ,
l
l
l
wij = wij r gij
for j = 1, 2, 3, . . . , N l+1
14
LEGAL NOTICE
This material is the property of Numeric Insight, Inc and has been provided for information
purposes only. You are permitted to print and download this article only for your personal
reference, and only provided that: no documents or related graphics are modified or copied
in any way; no graphics from this article or the referring websites www.numericinsight.com
and www.numericinsight.blogspot.com are used separately from accompanying text; and the
status of Numeric Insight, Inc (and that of any identified contributors) as the author of this
material is always acknowledged.
This legal notice applies to the entire contents of this article and the referring websites
www.numericinsight.com and www.numericinsight.blogspot.com. By using or visiting this
article you accept and agree to the terms of this legal notice. If you do not accept these terms,
please do not use this material.
You understand and agree that Numeric Insight, Inc and any of its principals or affiliates
shall in no event be liable for any direct, indirect, incidental, consequential, or exemplary
damages (including, but not limited to, equipment malfunction, loss of use, data, or profits;
or business interruption) however caused and on any theory of liability, whether in contract,
strict liability, or tort (including negligence or otherwise) arising in any way (including due to
redistribution of software, algorithm, solution or information without including this notice)
out of the use of this software, algorithm, solution or information. You understand and agree
that your use of this software, algorithm, solution or information is entirely at your own risk.
Redistributions of and references to this software, algorithm, solution or information must
retain this notice.
The material in this article and the referring websites www.numericinsight.com and
www.numericinsight.blogspot.com is provided as is, without any conditions, warranties or
other terms of any kind. Numeric Insight, Inc and its respective partners, officers, members,
employees and advisers shall not be liable for any direct or indirect or consequential loss or
damage, nor for any loss of revenue, profit, business, data, goodwill or anticipated savings
suffered by any person as a result of reliance on a statement or omission on, or the use of, this
material.
Numeric Insight, Inc will not be liable for any loss or damage caused by a distributed
denial-of-service attack, viruses or other technologically harmful material that may infect
your computer equipment, computer programs, data or other proprietary material due to
your use of the associated websites or to your downloading of any material posted on it, or
on any website linked to it.
Numeric Insight, Inc may revise the content of this article, including this legal notice, at
any time. This entire notice must be included with any subsequent copies of this material.
15