L07 Optimization
L07 Optimization
Roger Grosse
1 Introduction
Now that we’ve seen how to compute derivatives of the cost function with
respect to model parameters, what do we do with those derivatives? In this
lecture, we’re going to take a step back and look at optimization problems
more generally. We’ve briefly discussed gradient descent and used it to train
some models, but what exactly is the gradient, and why is it a good idea to
move opposite it? We also introduce stochastic gradient descent, a way of
obtaining noisy gradient estimates from a small subset of the data.
Using modern neural network libraries, it is easy to implement the back-
prop algorithm so that it correctly computes the gradient. It’s not always
so easy to get it to work well. In this lecture, we’ll make a list of things that
can go drastically wrong in neural net training, and talk about how we can
spot them. This includes: learning rates that are too large or too small,
symmetries, dead or saturated units, and badly conditioned curvature. We
discuss tricks to ameliorate all of these problems. In general, debugging a
learning algorithm is like debugging any other complex piece of software:
if something goes wrong, you need to make hypotheses about what might
have happened, and look for evidence or design experiments to test those
hypotheses. This requires a thorough understanding of the principles of
optimization. Understanding the principles of
Our style of thinking in this lecture will be very different from that neural nets and being able to
diagnose failure modes are what
in the last several lectures. When we discussed backprop, we looked at the distinguishes someone who’s
gradient computations algebraically: we derived mathematical equations for finished CSC421 from someone
computing all the derivatives. We also looked at the computations imple- who’s merely worked through the
mentationally, seeing how to implement them efficiently (e.g. by vectorizing TensorFlow tutorial.
the computations), and designing an automatic differentiation system which
separated the backprop algorithm itself from the design of a network archi-
tecture. In this lecture, we’ll look at gradient descent geometrically: we’ll
reason qualitatively about optimization problems and about the behavior
of gradient descent, without thinking about how the gradients are actually
computed. I.e., we abstract away the gradient computation. One of the
most important skills to develop as a computer scientist is the ability to
move between different levels of abstraction, and to figure out which level
is most appropriate for the problem at hand.
1
• Know why stochastic gradient descent can be faster than batch gradi-
ent descent, and understand the tradeoffs in choosing the mini-batch
size.
• Know what effect the learning rate has on the training process. Why
can it be advantageous to decay the learning rate over time?
– slow progress
– instability
– fluctuations
– dead or saturated units
– symmetries
– badly conditioned curvature
2
(a) (b)
Figure 1: (a) Cost surface for an optimization problem with two local min-
ima, one of which is the global minimum. (b) Cartoon plot of a one-
dimensional optimization problem, and the gradient descent iterates start-
ing from two different initializations, in two different basins of attraction.
(a) (b)
3
denoted ∇θ E. This is the direction which goes directly uphill, i.e. the di-
rection which increases the cost the fastest relative to the distance moved.
We can’t determine the magnitude of the gradient from the contour plot,
but it is easy to determine its direction: the gradient is always orthogonal
(perpendicular) to the level sets. This gives an easy way to draw it on a
contour plot (e.g. see Figure 2(a)). Algebraically, the gradient is simply the
vector of partial derivatives of the cost function: In this context, E is taken as a
function of the parameters, not of
the loss L. Therefore, the partial
∂E/∂θ1
∂E .. derivatives correspond to the
∇θ E = = (1)
∂θ . values wij , bi , etc., computed from
∂E/∂θM backpropagation.
The fact that the vector of partial derivatives gives the steepest ascent
direction is far from obvious; you would see the derivation in a multivariable
calculus class, but here we will take it for granted.
The gradient descent update rule (which we’ve already seen multiple
times) can be written in terms of the gradient:
θ ← θ − α∇θ E, (2)
where α is the scalar-valued learning rate. This shows directly that gra-
dient descent moves opposite the gradient, or in the direction of steepest
descent. Too large a learning rate can cause instability, whereas too small
a learning rate can cause slow progress. In general, the learning rate is one
of the most important hyperparameters of a learning algorithm, so it’s very
important to tune it, i.e. look for a good value. (Most commonly, one tries Recall that hyperparameters are
a bunch of values and picks the one which works the best.) parameters which aren’t part of
the model and which aren’t tuned
For completeness, it’s worth mentioning one more possible feature of a with gradient descent.
cost function, namely a saddle point, shown in Figure 2(b). This is a point
where the gradient is zero, but which isn’t a local minimum because the cost
increases in some directions and decreases in others. If we’re exactly on a
saddle point, gradient descent won’t go anywhere because the gradient is
zero.
4
If we use this formula directly, we must visit every training example to com-
pute the gradient. This is known as batch training, since we’re treating
the entire training set as a batch. But this can be very time-consuming, and
it’s also unnecessary: we can get a stochastic estimate of the gradient from
a single training example. In stochastic gradient descent (SGD), we
pick a training example, and update the parameters opposite the gradient
for that example: This is identical to the gradient
θ ← θ − α∇θ En . (7) descent update rule, except that E
is replaced with En .
SGD is able to make a lot of progress even before the whole training set has
been visited. A lot of datasets are so large that it can take hours or longer
to make a single pass over the training set; in such cases, batch training is
impractical, and we need to use a stochastic algorithm.
In practice, we don’t compute the gradient on a single example, but
rather average it over a batch of B training examples known as a mini-
batch. Typical mini-batch sizes are on the order of 100. Why mini-batches?
Observe that the number of operations required to compute the gradient for
a mini-batch is linear in the size of the mini-batch (since mathematically, the
gradient for each training example is a separate computation). Therefore, if
all operations were equally expensive, one would always prefer to use B = 1.
In practice, there are two important reasons to use B > 1:
• Operations on mini-batches can be vectorized by writing them in
terms of matrix operations. This reduces the interpreter overhead,
and makes use of efficient and carefully tuned linear algebra libraries.
In previous lectures, we already
derived vectorized forms of batch
• Most large neural networks are trained on GPUs or some other ar- gradient descent. The same
chitecture which enables a high degree of parallelism. There is much formulas can be applied in
mini-batch mode.
more parallelism to exploit when B is large, since the gradients can
be computed independently for each training example.
On the flip side, we don’t want to make B too large, because then it takes
too long to compute the gradients. In the extreme case where B = N , we
get batch gradient descent. (The activations for large mini-batches may also
be too large to store in memory.)
5
4.1 Incorrect gradient computations
If your computed gradients are wrong, then all bets are off. If you’re lucky,
the training will fail completely, and you’ll notice that something is wrong.
If you’re unlucky, it will sort of work, but it will also somehow be broken.
This is much more common than you might expect: it’s not unusual for
an incorrectly implemented learning algorithm to perform reasonably well.
But it will perform a bit worse than it should; furthermore, it will make it
harder to tune, since some of the diagnostics might give misleading results
if the gradients are wrong. Therefore, it’s completely useless to do anything
else until you’re sure the gradients are correct.
Fortunately, it’s possible to be confident in the correctness of the gra-
dients. We’ve already covered finite difference methods, which are pretty
reliable (see the lecture “Training a Classifier”). If you’re using one of
the major neural net frameworks, you’re pretty safe, because the gradients
are being computed automatically by a system which has been thoroughly
tested. For the rest of this discussion, we’ll assume the gradient computa-
tion is correctly implemented.
4.3 Symmetries
Suppose we initialize all the weights and biases of a neural network to zero.
All the hidden activations will be identical, and you can check by inspection
(see the lecture on backprop) that all the weights feeding into a given hid-
den unit will have identical derivatives. Therefore, these weights will have
identical values in the next step, and so on. With nothing to distinguish
different hidden units, no learning will occur. This phenomenon is perhaps
the most important example of a saddle point in neural net training.
Fortunately, the problem is easy to deal with, using any sort of sym-
metry breaking. Once two hidden units compute slightly different things,
they will probably get a gradient signal driving them even farther apart.
(Think of this in terms of the saddle point picture; if you’re exactly on
the saddle point, you get zero gradient, but if you’re slightly to one side,
6
(a) (b) (c)
Figure 3: (a) Slow progress due to a small learning rate. (b) Instability
due to a large learning rate. (c) Oscillations due to a large learning rate.
you’ll move away from it, which gives you a larger gradient, and so on.) In
practice, we typically initialize all the weights randomly.
7
the dynamics are essentially those described above. (The potential energy
is the height of the surface.)
We can simulate these dynamics with the following update rule, known
as gradient descent with momentum. (Momentum can be used with
either the batch version or with SGD.)
p ← µp − α∇θ En (8)
θ ←θ+p (9)
Just as with ordinary SGD, there is a learning rate α. There is also another
parameter µ, called the momentum parameter, satisfying 0 ≤ µ ≤ 1.
It determines the timescale on which momentum decays. In terms of the
physical analogy, it determines the amount of friction (with µ = 1 being
frictionless). As usual, it’s useful to think about the edge cases:
4.6 Fluctuations
All of the problems we’ve discussed so far occur both in batch training and
in SGD. But in SGD, we have the further problem that the gradients are
stochastic; even if they point in the right direction on average, individual
stochastic gradients are noisy and may even increase the cost function. The
effect of this noise is to push the parameters in a random direction, causing
them to fluctuate. Note the difference between oscillations and fluctua-
tions: oscillations are a systematic effect caused by the cost surface itself,
whereas fluctuations are an effect of the stochasticity in the gradients.
Fluctuations often show up as fluctuations in the cost function, and can
be seen in the training curves. One solution to fluctuations is to decrease
the learning rate; however, this can slow down the progress too much. It’s
actually fine to have fluctuations during training, since the parameters are
still moving in the right direction “on average.”
A better approach to deal with fluctuations is learning rate decay.
My favorite approach is to keep the learning rate relatively high throughout
training, but then at the very end, to decay it using an exponential schedule,
i.e.
αt = α0 e−t/τ , (10)
where α0 is the initial learning rate, t is the iteration count, τ is the decay
timescale, and t = 0 corresponds to the start of the decay.
I should emphasize that we don’t begin the decay until late in training,
when the parameters are already pretty good “on average” and we merely
have a high cost because of fluctuations. Once you start decaying α, progress
8
Figure 4: If you decay the learning rate too soon, you’ll get a sudden drop
in the loss as a result of reducing fluctuations, but the algorithm will stop
making progress towards the optimum, leading to slower convergence in the
long run. This is a big problem in practice, and we haven’t figured out any
good ways to detect if this is happening.
slows down drastically. If you decay α too early, you may get a sudden
improvement in the cost from reducing fluctuations, at the cost of failure to
converge in the long term. This phenomenon is illustrated in Figure 4.
Another neat trick for dealing with fluctuations is iterate averaging.
Separate from the training process, we keep an exponential moving av-
erage θ̃ of the iterates, as follows:
1 1
θ̃ ← 1 − θ̃ + θ. (11)
τ τ
τ is a hyperparameter called the timescale. Iterate averaging doesn’t
change the training algorithm itself at all, but when we apply or evalu-
ate the network, we use θ̃ rather than θ. In practice, iterate averaging can
give a huge performance boost by reducing the fluctuations.
9
Figure 5: The Rosenbrock function, a function which is commonly used as
an optimization benchmark and demonstrates badly conditioned curvature
(i.e. a ravine).
10
(a)
(b)
11
conditioned curvature: batch normalization and Adam. We won’t cover
them properly, but the original papers are very readable, in case you’re cu-
rious.2 Batch normalization normalizes the activations of each layer of a
network to have zero mean and unit variance. This can help significantly
for the reason outlined above. (It can also attenuate the problem of satu-
rated units.) Adam separately adapts the learning rate of each individual
parameter, in order to correct for differences in curvature along individual
coordinate directions.
4.9 Recap
Here is a table to summarize all the pitfalls, diagnostics, and workarounds
that we’ve covered:
2
D. P. Kingma and J. L. Ba, 2015. Adam: a method for stochastic optimization. ICLR
S. Ioffe and C. Szegedy, 2015. Batch normalization: accelerating deep network training
by reducing internal covariate shift.
12