0% found this document useful (0 votes)
30 views18 pages

Notes Chapter Neural Networks

Neural networks are composed of interconnected "neurons" that can learn representations from large datasets to make predictions. The basic element is a neuron that computes a weighted sum of its inputs and applies an activation function. Networks are built by connecting many neurons in layers, with the output of one layer feeding as input to the next. In a feed-forward network, data only flows from input to output through the layers, without cycles. Gradient descent can be used to train the weights in the network to minimize loss on the training data.

Uploaded by

radhika chawla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views18 pages

Notes Chapter Neural Networks

Neural networks are composed of interconnected "neurons" that can learn representations from large datasets to make predictions. The basic element is a neuron that computes a weighted sum of its inputs and applies an activation function. Networks are built by connecting many neurons in layers, with the output of one layer feeding as input to the next. In a feed-forward network, data only flows from input to output through the layers, without cycles. Gradient descent can be used to train the weights in the network to minimize loss on the training data.

Uploaded by

radhika chawla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

CHAPTER 8

Neural Networks

Unless you live under a rock with no internet access, you’ve been hearing a lot about “neu-
ral networks.” Now that we have several useful machine-learning concepts (hypothesis
classes, classification, regression, gradient descent, regularization, etc.) we are completely
well equipped to understand neural networks in detail.
This, in some sense, the “third wave” of neural nets. The basic idea is founded on
the 1943 model of neurons of McCulloch and Pitts and learning ideas of Hebb. There
was a great deal of excitement, but not a lot of practical success: there were good train-
ing methods (e.g., perceptron) for linear functions, and interesting examples of non-linear
functions, but no good way to train non-linear functions from data. Interest died out for a
while, but was re-kindled in the 1980s when several people came up with a way to train As with many good
neural networks with “back-propagation,” which is a particular style of implementing gra- ideas in science, the
basic idea for how to
dient descent, which we will study here. By the mid-90s, the enthusiasm waned again, be-
train non-linear neural
cause although we could train non-linear networks, the training tended to be slow and networks with gradi-
was plagued by a problem of getting stuck in local optima. Support vector machines ent descent, was inde-
(SVMs) (regularization of high-dimensional hypotheses by seeking to maximize the mar- pendently developed
by more than one re-
gin) and kernel methods (an efficient and beautiful way of using feature transformations searcher.
to non-linearly transform data into a higher-dimensional space) provided reliable learning
methods with guaranteed convergence and no local optima.
However, during the SVM enthusiasm, several groups kept working on neural net-
works, and their work, in combination with an increase in available data and computation,
has made them rise again. They have become much more reliable and capable, and are
now the method of choice in many applications. There are many, many variations of neu- The number increases
ral networks, which we can’t even begin to survey. We will study the core “feed-forward” daily, as may be seen on
arxiv.org.
networks with “back-propagation” training, and then, in later chapters, address some of
the major advances beyond this core.
We can view neural networks from several different perspectives:

View 1: An application of stochastic gradient descent for classification and regression


with a potentially very rich hypothesis class.

View 2: A brain-inspired network of neuron-like computing elements that learn dis-


tributed representations.

View 3: A method for building applications that make predictions based on huge amounts

43
MIT 6.036 Fall 2019 44

of data in very complex domains.


We will mostly take view 1, with the understanding that the techniques we develop will
enable the applications in view 3. View 2 was a major motivation for the early development
of neural networks, but the techniques we will study do not seem to actually account for Some prominent re-
the biological learning processes in brains. searchers are, in fact,
working hard to find
analogues of these
methods in the brain
1 Basic element
The basic element of a neural network is a “neuron,” pictured schematically below. We will
also sometimes refer to a neuron as a “unit” or “node.”
pre-activation
x1 w1
output

.. P z
. f(·) a
wm
xm w0
activation function

input

It is a non-linear function of an input vector x ∈ Rm to a single output value a ∈ R. It is Sorry for changing our
parameterized by a vector of weights (w1 , . . . , wm ) ∈ Rm and an offset or threshold w0 ∈ R. notation here. We were
using d as the dimen-
In order for the neuron to be non-linear, we also specify an activation function f : R → R,
sion of the input, but
which can be the identity (f(x) = x), but can also be any other function, though we will we are trying to be con-
only be able to work with it if it is differentiable. sistent here with many
The function represented by the neuron is expressed as: other accounts of neural
networks. It is impossi-
  ble to be consistent with
Xm
all of them though—
a = f(z) = f  xj wj + w0  = f(wT x + w0 ) . there are many differ-
j=1 ent ways of telling this
story.
Before thinking about a whole network, we can consider how to train a single unit.
Given a loss function L(guess, actual) and a dataset {(x(1) , y(1) ), . . . , (x(n) , y(n) )}, we can do This should remind you
(stochastic) gradient descent, adjusting the weights w, w0 to minimize of our θ and θ0 for lin-
ear models.
X  
J(w, w0 ) = L NN(x(i) ; w, w0 ), y(i) .
i

where NN is the output of our neural net for a given input.


We have already studied two special cases of the neuron: linear logistic classifiers (LLC)
with NLL loss and regressors with quadratic loss! The activation function for the LLC is
f(x) = σ(x) and for linear regression it is simply f(x) = x.
Study Question: Just for a single neuron, imagine for some reason, that we decide
to use activation function f(z) = ez and loss function L(g, a) = (g − a)2 . Derive a
gradient descent update for w and w0 .

2 Networks
Now, we’ll put multiple neurons together into a network. A neural network in general
takes in an input x ∈ Rm and generates an output a ∈ Rn . It is constructed out of multiple

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 45

neurons; the inputs of each neuron might be elements of x and/or outputs of other neurons.
The outputs are generated by n output units.
In this chapter, we will only consider feed-forward networks. In a feed-forward network,
you can think of the network as defining a function-call graph that is acyclic: that is, the
input to a neuron can never depend on that neuron’s output. Data flows, one way, from
the inputs to the outputs, and the function computed by the network is just a composition
of the functions computed by the individual neurons.
Although the graph structure of a neural network can really be anything (as long as it
satisfies the feed-forward constraint), for simplicity in software and analysis, we usually
organize them into layers. A layer is a group of neurons that are essentially “in parallel”:
their inputs are outputs of neurons in the previous layer, and their outputs are the input to
the neurons in the next layer. We’ll start by describing a single layer, and then go on to the
case of multiple layers.

2.1 Single layer


A layer is a set of units that, as we have just described, are not connected to each other. The
layer is called fully connected if, as in the diagram below, the inputs to each unit in the layer
are the same (i.e. x1 , x2 , . . . xm in this case). A layer has input x ∈ Rm and output (also
known as activation) a ∈ Rn .

P
f a1

x1 P
f a2

x2
P
f a3
..
.

xm .. .. ..
. . .

W, W0 P
f an

Since each unit has a vector of weights and a single offset, we can think of the weights of
the whole layer as a matrix, W, and the collection of all the offsets as a vector W0 . If we
have m inputs, n units, and n outputs, then

• W is an m × n matrix,

• W0 is an n × 1 column vector,

• X, the input, is an m × 1 column vector,

• Z = W T X + W0 , the pre-activation, is an n × 1 column vector,

• A, the activation, is an n × 1 column vector,

and the output vector is


A = f(Z) = f(W T X + W0 ) .
The activation function f is applied element-wise to the pre-activation values Z.

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 46

What can we do with a single layer? We have already seen single-layer networks, in
the form of linear separators and linear regressors. All we can do with a single layer is
make a linear hypothesis. The whole reason for moving to neural networks is to move in We have used a step
the direction of non-linear hypotheses. To do this, we will have to consider multiple layers, or sigmoid function to
transform the linear
where we can view the last layer as still being a linear classifier or regressor, but where we
output value for clas-
interpret the previous layers as learning a non-linear feature transformation φ(x), rather sification, but it’s impor-
than having us hand-specify it. tant to be clear that the
resulting separator is still
linear.
2.2 Many layers
A single neural network generally combines multiple layers, most typically by feeding the
outputs of one layer into the inputs of another layer.
We have to start by establishing some nomenclature. We will use l to name a layer, and
let ml be the number of inputs to the layer and nl be the number of outputs from the layer.
Then, W l and W0l are of shape ml × nl and nl × 1, respectively. Let fl be the activation
function of layer l. Then, the pre-activation outputs are the nl × 1 vector It is technically possi-
ble to have different
T
Zl = W l Al−1 + W0l activation functions
within the same layer,
and the activation outputs are simply the nl × 1 vector but, again, for conve-
nience in specification
Al = fl (Zl ) . and implementation,
we generally have the
same activation function
Here’s a diagram of a many-layered network, with two blocks for each layer, one rep-
within a layer.
resenting the linear part of the operation and one representing the non-linear activation
function. We will use this structural decomposition to organize our algorithmic thinking
and implementation.

X = A0 W 1 Z1 A1 W2 Z2 A2 AL−1 W L ZL AL
W01 f1 W02 f2 ···
W0L fL

layer 1 layer 2 layer L

3 Choices of activation function


There are many possible choices for the activation function. We will start by thinking about
whether it’s really necessary to have an f at all.
What happens if we let f be the identity? Then, in a network with L layers (we’ll leave
out W0 for simplicity, but keeping it wouldn’t change the form of this argument),
T T T T
AL = W L AL−1 = W L W L−1 · · · W 1 X .

So, multiplying out the weight matrices, we find that

AL = W total X ,

which is a linear function of X! Having all those layers did not change the representational
capacity of the network: the non-linearity of the activation function is crucial.
Study Question: Convince yourself that any function representable by any number
of linear layers (where f is the identity function) can be represented by a single layer.
Now that we are convinced we need a non-linear activation, let’s examine a few com-
mon choices.

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 47

Step function 
0 if z < 0
step(z) =
1 otherwise

Rectified linear unit 


0 if z < 0
ReLU(z) = = max(0, z)
z otherwise

Sigmoid function Also known as a logistic function, can be interpreted as probability,


because for any value of z the output is in [0, 1]

1
σ(z) =
1 + e−z

Hyperbolic tangent Always in the range [−1, 1]

ez − e−z
tanh(z) =
ez + e−z

Softmax function Takes a whole vector Z ∈ Rn and generates as output a vector A ∈


P
[0, 1]n with the property that n i=1 Ai = 1, which means we can interpret it as a
probability distribution over n items:
 P 
exp(z1 )/ i exp(zi )

softmax(z) =  .. 
. 
P
exp(zn )/ i exp(zi )

1.5 1.5
step(z) ReLU(z)
1 1

0.5 0.5
z z
−2 −1 1 2 −2 −1 1 2
−0.5 −0.5

σ(z) tanh(z)
1 1

0.5 0.5

z z
−4 −2 2 4 −4 −2 2 4

−0.5 −0.5

−1 −1

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 48

The original idea for neural networks involved using the step function as an activa-
tion, but because the derivative is discontinuous, we won’t be able to use gradient-descent
methods to tune the weights in a network with step functions, so we won’t consider them
further. They have been replaced, in a sense, by the sigmoid, relu, and tanh activation
functions.
Study Question: Consider sigmoid, relu, and tanh activations. Which one is most
like a step function? Is there an additional parameter you could add to a sigmoid
that would make it be more like a step function?

Study Question: What is the derivative of the relu function? Are there some values
of the input for which the derivative vanishes?
ReLUs are especially common in internal (“hidden”) layers, and sigmoid activations are
common for the output for binary classification and softmax for multi-class classification
(see section 4 for an explanation).

4 Error back-propagation
We will train neural networks using gradient descent methods. It’s possible to use batch
gradient descent, in which we sum up the gradient over all the points (as in section 2 of
chapter 6) or stochastic gradient descent (SGD), in which we take a small step with respect
to the gradient considering a single point at a time (as in section 4 of chapter 6).
Our notation is going to get pretty hairy pretty quickly. To keep it as simple as we can,
we’ll focus on computing the contribution of one data point x(i) to the gradient of the loss
with respect to the weights, for SGD; you can simply sum up these gradients over all the
data points if you wish to do batch descent.
So, to do SGD for a training example (x, y), we need to compute ∇W Loss(NN(x; W), y),
where W represents all weights W l , W0l in all the layers l = (1, . . . , L). This seems terrifying,
but is actually quite easy to do using the chain rule. Remember the chain
Remember that we are always computing the gradient of the loss function with respect rule! If a = f(b) and
b = g(c) (so that
to the weights for a particular value of (x, y). That tells us how much we want to change the
a = f(g(c))), then
weights, in order to reduce the loss incurred on this particular training example. da
= da · db =
dc db dc
First, let’s see how the loss depends on the weights in the final layer, W L . Remembering f 0 (b)g 0 (c) =
that our output is AL , and using the shorthand loss to stand for Loss((NN(x; W), y) which f 0 (g(c))g 0 (c).
T
is equal to Loss(AL , y), and finally that AL = fL (ZL ) and ZL = W L AL−1 , we can use the
chain rule:
∂loss ∂loss ∂AL ∂ZL
L
= L
· L
· L
.
∂W |∂A
{z } |∂Z
{z } |∂W
{z }
depends on loss function fL 0 AL−1
It might reason-
To actually get the dimensions to match, we need to write this a bit more carefully, and ably bother you that
note that it is true for any l, including l = L: ∂ZL /∂W L = AL−1 .
We’re somehow think-
 T ing about the deriva-
∂loss ∂loss tive of a vector with
l−1
= |A{z } (8.1)
∂W l ∂Zl respect to a matrix,
| {z } lm ×1 | {z } which seems like it
ml ×nl 1×nl might need to be a
three-dimensional
Yay! So, in order to find the gradient of the loss with respect to the weights in the other thing. But note that
layers of the network, we just need to be able to find ∂loss/∂Zl . ∂ZL /∂W L is really
T
(∂W L AL−1 )/∂W L and
it seems okay in at least
an informal sense that
it’s AL−1 .
Last Updated: 12/18/19 11:56:05
MIT 6.036 Fall 2019 49

If we repeatedly apply the chain rule, we get this expression for the gradient of the loss
with respect to the pre-activation in the first layer:
∂loss ∂loss ∂AL ∂ZL ∂AL−1 ∂A2 ∂Z2 ∂A1
1
= · · · · ··· · · · . (8.2)
∂Z |∂A
L ∂ZL L−1
∂A {z ∂Z L−1 ∂Z2} ∂A1 ∂Z1
∂loss/∂Z2
| {z }
∂loss/∂A1

This derivation was informal, to show you the general structure of the computation. In
fact, to get the dimensions to all work out, we just have to write it backwards! Let’s first
understand more about these quantities:
• ∂loss/∂AL is nL × 1 and depends on the particular loss function you are using.
• ∂Zl /∂Al−1 is ml × nl and is just W l (you can verify this by computing a single entry
j ).
∂Zli /∂Al−1
• ∂Al /∂Zl is nl × nl . It’s a little tricky to think about. Each element ali = fl (zli ). This
means that ∂ali /∂zlj = 0 whenever i 6= j. So, the off-diagonal elements of ∂Al /∂Zl are
0
all 0, and the diagonal elements are ∂ali /∂zlj = fl (zlj ).
Now, we can rewrite equation 8.2 so that the quantities match up as
∂loss ∂Al ∂Al+1 ∂AL−1 ∂AL ∂loss
l
= l
· W l+1 · l+1
· . . . W L−1 · L−1
· WL · · . (8.3)
∂Z ∂Z ∂Z ∂Z ∂ZL AL
Using equation 8.3 to compute ∂loss/∂Zl combined with equation 8.1, lets us find the
gradient of the loss with respect to any of the weight matrices.
Study Question: Apply the same reasoning to find the gradients of loss with respect
to W0l .
This general process is called error back-propagation. The idea is that we first do a forward
pass to compute all the a and z values at all the layers, and finally the actual loss on this
example. Then, we can work backward and compute the gradient of the loss with respect
to the weights in each layer, starting at layer L and going back to layer 1. I like to think of this as
“blame propagation”.
y You can think of loss
as how mad we are
about the prediction
X = A0 W 1 Z1 A1 W2 Z2 A2 AL−1 W L ZL AL that the network just
f1 f2 ··· fL Loss
W01 W02 W0L made. Then ∂loss/∂AL
is how much we blame
∂loss ∂loss ∂loss ∂loss ∂loss ∂loss ∂loss
∂Z1 ∂A1 ∂Z2 ∂A2 ∂AL−1 ∂ZL ∂AL AL for the loss. The last
module has to take in
∂loss/∂AL and com-
If we view our neural network as a sequential composition of modules (in our work pute ∂loss/∂ZL , which
so far, it has been an alternation between a linear transformation with a weight matrix, is how much we blame
and a component-wise application of a non-linear activation function), then we can define ZL for the loss. The
a simple API for a module that will let us compute the forward and backward passes, as next module (work-
ing backwards) takes
well as do the necessary weight updates for gradient descent. Each module has to provide in ∂loss/∂ZL and com-
the following “methods.” We are already using letters a, x, y, z with particular meanings, putes ∂loss/∂AL−1 . So
so here we will use u as the vector input to the module and v as the vector output: every module is accept-
ing its blame for the
• forward: u → v loss, computing how
much of it to allocate to
• backward: u, v, ∂L/∂v → ∂L/∂u each of its inputs, and
passing the blame back
• weight grad: u, ∂L/∂v → ∂L/∂W only needed for modules that have weights W to them.
In homework we will ask you to implement these modules for neural network components,
and then use them to construct a network and train it as described in the next section.

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 50

5 Training
Here we go! Here’s how to do stochastic gradient descent training on a feed-forward neural
network. After this pseudo-code, we motivate the choice of initialization in lines 2 and 3.
The actual computation of the gradient values (e.g. ∂loss/∂AL ) is not directly defined in
this code, because we want to make the structure of the computation clear.
Study Question: What is ∂Zl /∂W l ?

Study Question: Which terms in the code below depend on fL ?

SGD-N EURAL -N ET(Dn , T , L, (m1 , . . . , mL ), (f1 , . . . , fL ))


1 for l = 1 to L
2 Wijl
∼ Gaussian(0, 1/ml )
3 W0jl
∼ Gaussian(0, 1)
4 for t = 1 to T
5 i = random sample from {1, . . . , n}
6 A0 = x(i)
7 // forward pass to compute the output AL
8 for l = 1 to L
9 Zl = W lT Al−1 + W0l
10 Al = fl (Zl )
11 loss = Loss(AL , y(i) )
12 for l = L to 1:
13 // error back-propagation
14 ∂loss/∂Al = if l < L then ∂loss/∂Zl+1 · ∂Zl+1 /∂Al else ∂loss/∂AL
15 ∂loss/∂Zl = ∂loss/∂Al · ∂Al /∂Zl
16 // compute gradient with respect to weights
17 ∂loss/∂W l = ∂loss/∂Zl · ∂Zl /∂W l
18 ∂loss/∂W0l = ∂loss/∂Zl · ∂Zl /∂W0l
19 // stochastic gradient descent update
20 W l = W l − η(t) · ∂loss/∂W l
21 W0l = W0l − η(t) · ∂loss/∂W0l

Initializing W is important; if you do it badly there is a good chance the neural network
training won’t work well. First, it is important to initialize the weights to random val-
ues. We want different parts of the network to tend to “address” different aspects of the
problem; if they all start at the same weights, the symmetry will often keep the values
from moving in useful directions. Second, many of our activation functions have (near)
zero slope when the pre-activation z values have large magnitude, so we generally want to
keep the initial weights small so we will be in a situation where the gradients are non-zero,
so that gradient descent will have some useful signal about which way to go.
One good general-purpose strategy is to choose each weight at random from a Gaussian
(normal) distribution with mean 0 and standard deviation (1/m) where m is the number
of inputs to the unit.
Study Question: If the input x to this unit is a vector of 1’s, what would the ex-
pected pre-activation z value be with these initial weights?
We write this choice (where ∼ means “is drawn randomly from the distribution”)
 
1
l
Wij ∼ Gaussian 0, l .
m

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 51

It will often turn out (especially for fancier activations and loss functions) that comput-
ing
∂loss
∂ZL
is easier than computing
∂loss ∂AL
L
and .
∂A ∂ZL
So, we may instead ask for an implementation of a loss function to provide a backward
method that computes ∂loss/∂ZL directly.

6 Loss functions and activation functions


Different loss functions make different assumptions about the range of inputs they will get
as input and, as we have seen, different activation functions will produce output values in
different ranges. When you are designing a neural network, it’s important to make these
things fit together well. In particular, we will think about matching loss functions with the
activation function in the last layer, fL . Here is a table of loss functions and activations that
make sense for them:
Loss fL
squared linear
hinge linear
NLL sigmoid
NLLM softmax

6.1 Two-class classification and log likelihood


For classification, the natural loss function is 0-1 loss, but we have already discussed the
fact that it’s very inconvenient for gradient-based learning because its derivative is discon-
tinuous.
We have also explored negative log likelihood (NLL) in chapter 5. It is nice and smooth,
and extends nicely to multiple classes as we will see below.
Hinge loss gives us another way, for binary classification problems, to make a smoother
objective, penalizing the margins of the labeled points relative to the separator. The hinge
loss is defined to be

Lh (guess, actual) = max(1 − guess · actual, 0) ,

when actual ∈ {+1, −1}. It has the property that, if the sign of guess is the same as the sign
of actual and the magnitude of guess is greater than 1, then the loss is 0.
It is trying to enforce not only that the guess have the correct sign, but also that it should
be some distance away from the separator. Using hinge loss, together with a squared-
norm regularizer, actually forces the learning process to try to find a separator that has the
maximum margin relative to the data set. This optimization set-up is called a support vector
machine, and was popular before the renaissance of neural networks and gradient descent,
because it has a quadratic form that makes it particularly easy to optimize.

6.2 Multi-class classification and log likelihood


We can extend the idea of NLL directly to multi-class classification with K classes, where
 T
the training label is represented with the one-hot vector y = y1 , . . . , yK , where yk = 1 if
the example is of class k. Assume that our network uses softmax as the activation function

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 52

 T
in the last layer, so that the output is a = a1 , . . . , aK , which represents a probability
distribution over the K possible classes. Then, the probability that our network predicts
Q
the correct class for this example is K yk
k=1 ak and the log of the probability that it is correct
PK
is k=1 yk log ak , so
K
X
Lnllm (guess, actual) = − actualk · log(guessk ) .
k=1

We’ll call this NLLM for negative log likelihood multiclass.


Study Question: Show that Lnllm for K = 2 is the same as Lnll .

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 53

7 Optimizing neural network parameters


Because neural networks are just parametric functions, we can optimize loss with respect to
the parameters using standard gradient-descent software, but we can take advantage of the
structure of the loss function and the hypothesis class to improve optimization. As we have
seen, the modular function-composition structure of a neural network hypothesis makes it
easy to organize the computation of the gradient. As we have also seen earlier, the structure
of the loss function as a sum over terms, one per training data point, allows us to consider
stochastic gradient methods. In this section we’ll consider some alternative strategies for
organizing training, and also for making it easier to handle the step-size parameter.

7.1 Batches
Assume that we have an objective of the form
n
X
J(W) = L(h(x(i) ; W), y(i) ) ,
i=1

where h is the function computed by a neural network, and W stands for all the weight
matrices and vectors in the network.
When we perform batch gradient descent, we use the update rule

W := W − η∇W J(W) ,

which is equivalent to
n
X
W := W − η ∇W L(h(x(i) ; W), y(i) ) .
i=1

So, we sum up the gradient of loss at each training point, with respect to W, and then take
a step in the negative direction of the gradient.
In stochastic gradient descent, we repeatedly pick a point (x(i) , y(i) ) at random from the
data set, and execute a weight update on that point alone:

W := W − η∇W L(h(x(i) ; W), y(i) ) .

As long as we pick points uniformly at random from the data set, and decrease η at an
appropriate rate, we are guaranteed, with high probability, to converge to at least a local
optimum.
These two methods have offsetting virtues. The batch method takes steps in the exact
gradient direction but requires a lot of computation before even a single step can be taken,
especially if the data set is large. The stochastic method begins moving right away, and can
sometimes make very good progress before looking at even a substantial fraction of the
whole data set, but if there is a lot of variability in the data, it might require a very small η
to effectively average over the individual steps moving in “competing” directions.
An effective strategy is to “average” between batch and stochastic gradient descent by
using mini-batches. For a mini-batch of size k, we select k distinct data points uniformly
at random from the data set and do the update based just on their contributions to the
gradient
Xk
W := W − η ∇W L(h(x(i) ; W), y(i) ) .
i=1

Most neural network software packages are set up to do mini-batches.

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 54

Study Question: For what value of k is mini-batch gradient descent equivalent to


stochastic gradient descent? To batch gradient descent?
Picking k unique data points at random from a large data-set is potentially computa-
tionally difficult. An alternative strategy, if you have an efficient procedure for randomly
shuffling the data set (or randomly shufffling a list of indices into the data set) is to operate
in a loop, roughly as follows:

M INI -B ATCH -SGD(NN, data, k)


1 n = length(data)
2 while not done:
3 R ANDOM -S HUFFLE(data)
4 for i = 1 to n/k
5 B ATCH -G RADIENT-U PDATE(NN, data[(i − 1)k : ik])

7.2 Adaptive step-size


Picking a value for η is difficult and time-consuming. If it’s too small, then convergence is
slow and if it’s too large, then we risk divergence or slow convergence due to oscillation.
This problem is even more pronounced in stochastic or mini-batch mode, because we know
we need to decrease the step size for the formal guarantees to hold.
It’s also true that, within a single neural network, we may well want to have differ-
ent step sizes. As our networks become deep (with increasing numbers of layers) we can
find that magnitude of the gradient of the loss with respect the weights in the last layer,
∂loss/∂WL , may be substantially different from the gradient of the loss with respect to the
weights in the first layer ∂loss/∂W1 . If you look carefully at equation 8.3, you can see that
the output gradient is multiplied by all the weight matrices of the network and is “fed
back” through all the derivatives of all the activation functions. This can lead to a problem
of exploding or vanishing gradients, in which the back-propagated gradient is much too big
or small to be used in an update rule with the same step size.
So, we’ll consider having an independent step-size parameter for each weight, and up-
dating it based on a local view of how the gradient updates have been going. This section is very
strongly influenced
by Sebastian Ruder’s
7.2.1 Running averages excellent blog posts on
the topic: ruder.io/
We’ll start by looking at the notion of a running average. It’s a computational strategy for optimizing-gradient-descent
estimating a possibly weighted average of a sequence of data. Let our data sequence be
a1 , a2 , . . .; then we define a sequence of running average values, A0 , A1 , A2 , . . . using the
equations

A0 = 0
At = γt At−1 + (1 − γt )at

where γt ∈ (0, 1). If γt is a constant, then this is a moving average, in which

AT = γAT −1 + (1 − γ)aT
= γ(γAT −2 + (1 − γ)aT −1 ) + (1 − γ)aT
T
X
= γT −t (1 − γ)at
t=0

So, you can see that inputs at closer to the end of the sequence have more effect on At than
early inputs.

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 55

If, instead, we set γt = (t − 1)/t, then we get the actual average.


Study Question: Prove to yourself that the previous assertion holds.

7.2.2 Momentum
Now, we can use methods that are a bit like running averages to describe strategies for
computing η. The simplest method is momentum, in which we try to “average” recent
gradient updates, so that if they have been bouncing back and forth in some direction, we
take out that component of the motion. For momentum, we have

V0 = 0
Vt = γVt−1 + η∇W J(Wt−1 )
Wt = Wt−1 − Vt

This doesn’t quite look like an adaptive step size. But what we can see is that, if we let
η = η 0 (1 − γ), then the rule looks exactly like doing an update with step size η 0 on a
moving average of the gradients with parameter γ:

M0 = 0
Mt = γMt−1 + (1 − γ)∇W J(Wt−1 )
Wt = Wt−1 − η 0 Mt

Study Question: Prove to yourself that these formulations are equivalent.


We will find that Vt will be bigger in dimensions that consistently have the same sign
for ∇W and smaller for those that don’t. Of course we now have two parameters to set (η
and γ), but the hope is that the algorithm will perform better overall, so it will be worth
trying to find good values for them. Often γ is set to be something like 0.9.

The red arrows show the update after one step of mini-batch gradient descent with
momentum. The blue points show the direction of the gradient with respect to
the mini-batch at each step. Momentum smooths the path taken towards the local
minimum and leads to faster convergence.

Study Question: If you set γ = 0.1, would momentum have more of an effect or less
of an effect than if you set it to 0.9?

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 56

7.2.3 Adadelta
Another useful idea is this: we would like to take larger steps in parts of the space where
J(W) is nearly flat (because there’s no risk of taking too big a step due to the gradient
being large) and smaller steps when it is steep. We’ll apply this idea to each weight in-
dependently, and end up with a method called adadelta, which is a variant on adagrad (for
adaptive gradient). Even though our weights are indexed by layer, input unit and output
unit, for simplicity here, just let Wj be any weight in the network (we will do the same
thing for all of them).

gt,j = ∇W J(Wt−1 )j
Gt,j = γGt−1,j + (1 − γ)g2t,j
η
Wt,j = Wt−1,j − p gt,j
Gt,j + 

The sequence Gt,j is a moving average of the square of the jth component of the gradient.
We square it in order to be insensitive to the sign—we want to know whether the magni-
tude is p big or small. Then, we perform a gradient update to weight j, but divide the step
size by Gt,j + , which is larger when the surface is steeper in direction j at point Wt−1 in
weight space; this means that the step size will be smaller when it’s steep and larger when
it’s flat.

7.2.4 Adam
Adam has become the default method of managing step sizes neural networks. It combines Although, interestingly,
the ideas of momentum and adadelta. We start by writing moving averages of the gradient it may actually violate
the convergence
and squared gradient, which reflect estimates of the mean and variance of the gradient for
conditions of SGD:
weight j: arxiv.org/abs/1705.08292

gt,j = ∇W J(Wt−1 )j
mt,j = B1 mt−1,j + (1 − B1 )gt,j
vt,j = B2 vt−1,j + (1 − B2 )g2t,j .

A problem with these estimates is that, if we initialize m0 = v0 = 0, they will always be


biased (slightly too small). So we will correct for that bias by defining
mt,j
m̂t,j =
1 − Bt1
vt,j
v̂t,j =
1 − Bt2
η
Wt,j = Wt−1,j − p m̂t,j .
v̂t,j + 

Note that Bt1 is B1 raised to the power t, and likewise for Bt2 . To justify these corrections,
note that if we were to expand mt,j in terms of m0,j and g0,j , g1,j , . . . , gt,j the coefficients
would sum to 1. However, the coefficient behind m0,j is Bt1 and since m0,j = 0, the sum of
coefficients of non-zero terms is 1 − Bt1 , hence the correction. The same justification holds
for vt,j .
Now, our update for weight j has a step size that takes the steepness into account, as in
adadelta, but also tends to move in the same direction, as in momentum. The authors of
this method propose setting B1 = 0.9, B2 = 0.999,  = 10−8 . Although we now have even
more parameters, Adam is not highly sensitive to their values (small changes do not have
a huge effect on the result).

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 57

Study Question: Define m̂j directly as a moving average of gt,j . What is the decay
(γ parameter)?
Even though we now have a step-size for each weight, and we have to update vari-
ous quantities on each iteration of gradient descent, it’s relatively easy to implement by
`
maintaining a matrix for each quantity (m`t , v`t , g`t , g2t ) in each layer of the network.

8 Regularization
So far, we have only considered optimizing loss on the training data as our objective for
neural network training. But, as we have discussed before, there is a risk of overfitting if
we do this. The pragmatic fact is that, in current deep neural networks, which tend to be
very large and to be trained with a large amount of data, overfitting is not a huge problem.
This runs counter to our current theoretical understanding and the study of this question
is a hot area of research. Nonetheless, there are several strategies for regularizing a neural
network, and they can sometimes be important.

8.1 Methods related to ridge regression


One group of strategies can, interestingly, be shown to have similar effects: early stopping,
weight decay, and adding noise to the training data. Result is due to
Early stopping is the easiest to implement and is in fairly common use. The idea is Bishop, described
in his textbook and
to train on your training set, but at every epoch (pass through the whole training set, or
here doi.org/10.1162/
possibly more frequently), evaluate the loss of the current W on a validation set. It will neco.1995.7.1.108.
generally be the case that the loss on the training set goes down fairly consistently with
each iteration, the loss on the validation set will initially decrease, but then begin to increase
again. Once you see that the validation loss is systematically increasing, you can stop
training and return the weights that had the lowest validation error.
Another common strategy is to simply penalize the norm of all the weights, as we did in
ridge regression. This method is known as weight decay, because when we take the gradient
of the objective
Xn
J(W) = Loss(NN(x(i) ), y(i) ; W) + λkWk2
i=1

we end up with an update of the form


  
Wt = Wt−1 − η ∇W Loss(NN(x(i) ), y(i) ; Wt−1 ) + λWt−1
 
= Wt−1 (1 − λη) − η ∇W Loss(NN(x(i) ), y(i) ; Wt−1 ) .

This rule has the form of first “decaying” Wt−1 by a factor of (1 − λη) and then taking a
gradient step.
Finally, the same effect can be achieved by perturbing the x(i) values of the training data
by adding a small amount of zero-mean normally distributed noise before each gradient
computation. It makes intuitive sense that it would be more difficult for the network to
overfit to particular training data if they are changed slightly on each training step.

8.2 Dropout
Dropout is a regularization method that was designed to work with deep neural networks.
The idea behind it is, rather than perturbing the data every time we train, we’ll perturb the
network! We’ll do this by randomly, on each training step, selecting a set of units in each

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 58

layer and prohibiting them from participating. Thus, all of the units will have to take a
kind of “collective” responsibility for getting the answer right, and will not be able to rely
on any small subset of the weights to do all the necessary computation. This tends also to
make the network more robust to data perturbations.
During the training phase, for each training example, for each unit, randomly with
probability p temporarily set a`j := 0. There will be no contribution to the output and no
gradient update for the associated unit.
Study Question: Be sure you understand why, when using SGD, setting an activa-
tion value to 0 will cause that unit’s weights not to be updated on that iteration.
When we are done training and want to use the network to make predictions, we mul-
tiply all weights by p to achieve the same average activation levels.
Implementing dropout is easy! In the forward pass during training, we let

a` = f(z` ) ∗ d`

where ∗ denotes component-wise product and d` is a vector of 0’s and 1’s drawn randomly
with probability p. The backwards pass depends on a` , so we do not need to make any
further changes to the algorithm.
It is common to set p to 0.5, but this is something one might experiment with to get
good results on your problem and data.

8.3 Batch Normalization


A more modern alternative to dropout, which tends to achieve better performance, is batch
normalization. It was originally developed to address a problem of covariate shift: that is, if For more details see
you consider the second layer of a two-layer neural network, the distribution of its input arxiv.org/abs/1502.03167.

values is changing over time as the first layer’s weights change. Learning when the input
distribution is changing is extra difficult: you have to change your weights to improve your
predictions, but also just to compensate for a change in your inputs (imagine, for instance,
that the magnitude of the inputs to your layer is increasing over time—then your weights
will have to decrease, just to keep your predictions the same).
So, when training with mini-batches, the idea is to standardize the input values for each
mini-batch, just in the way that we did it in section 2.3 of chapter 4, subtracting off the mean
and dividing by the standard deviation of each input dimension. This means that the scale
of the inputs to each layer remains the same, no matter how the weights in previous layers
change. However, this somewhat complicates matters, because the computation of the
weight updates will need to take into account that we are performing this transformation.
In the modular view, batch normalization can be seen as a module that is applied to zl ,
interposed after the product with W l and before input to fl .
Batch normalization ends up having a regularizing effect for similar reasons that adding
noise and dropout do: each mini-batch of data ends up being mildly perturbed, which
prevents the network from exploiting very particular values of the data points.
Let’s think of the batch-norm layer as taking zl as input and producing an output Z bl as
output. But now, instead of thinking of Z as an n × 1 vector, we have to explicitly think
l l

about handling a mini-batch of data of size K, all at once, so Zl will be nl × K, and so will
the output Z bl .
Our first step will be to compute the batchwise mean and standard deviation. Let µl be
the nl × 1 vector where
K
1X l
µli = Zij ,
K
j=1

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 59

and let σl be the nl × 1 vector where


v
u K
u1 X
σi = t
l
(Zlij − µi )2 .
K
j=1

The basic normalized version of our data would be a matrix, element (i, j) of which is

l Zlij − µli
Zij = ,
σli + 

where  is a very small constant to guard against division by zero. However, if we let these
be our Zbl values, we really are forcing something too strong on our data—our goal was to
normalize across the data batch, but not necessarily force the output values to have exactly
mean 0 and standard deviation 1. So, we will give the layer the “opportunity” to shift and
scale the outputs by adding new weights to the layer. These weights are Gl and Bl , each of
which is an nl × 1 vector. Using the weights, we define the final output to be

bl = Gl Zl + Bl .
Z ij i ij i

That’s the forward pass. Whew!


bl ,
Now, for the backward pass, we have to do two things: given ∂L/∂Z
• Compute ∂L/∂Zl for back-propagation, and

• Compute ∂L/∂Gl and ∂L/∂Bl for gradient updates of the weights in this layer.
Schematically
∂L ∂L ∂Zb
= .
∂B ∂Zb ∂B
It’s hard to think about these derivatives in matrix terms, so we’ll see how it works for the
components. Bi contributes to Z bij for all data points j in the batch. So

∂L X ∂L ∂Zbij
=
∂Bi bij ∂Bi
∂Z j
X ∂L
= ,
bij
∂Z
j

bij for all data points j in the batch. So


Similarly, Gi contributes to Z

∂L X ∂L ∂Zbij
=
∂Gi bij ∂Gi
∂Z j
X ∂L
= Z .
bij ij
∂Z
j

Now, let’s figure out how to do backprop. We can start schematically:

∂L b
∂L ∂Z
= .
∂Z b
∂Z ∂Z
And because dependencies only exist across the batch, but not across the unit outputs,
K
X bik
∂L ∂L ∂Z
= .
∂Zij bik ∂Zij
∂Z
k=1

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 60

The next step is to note that

bik
∂Z bik ∂Zik
∂Z
=
∂Zij ∂Zik ∂Zij
∂Zik
= Gi
∂Zij

And now that  


∂Zik ∂µi 1 Zik − µi ∂σi
= δjk − − ,
∂Zij ∂Zij σi σ2i ∂Zij
where δjk = 1 if j = k and δjk = 0 otherwise. Getting close! We need two more small parts:

∂µi 1
=
∂Zij K
∂σi Zij − µi
=
∂Zij Kσi

Putting the whole crazy thing together, we get


K
X ∂L  
∂L 1 (Zik − µi )(Zij − µi )
= Gi δjk K − 1 −
∂Zij b
∂Zik
k=1
Kσ i σ2i

Last Updated: 12/18/19 11:56:05

You might also like