0% found this document useful (0 votes)
12 views

L07 Optimization

This document provides an overview of optimization techniques for neural networks. It discusses visualizing optimization problems through surface plots, contour plots, and plots of cost against parameters. Gradient descent works by moving in the direction opposite the gradient, the steepest direction of increase for the cost function. Stochastic gradient descent uses estimates of the gradient from subsets of data. The learning rate and its decay over time affect training. Potential problems include slow progress, instability, dead units, symmetries, and ill-conditioned curvature. The document aims to build understanding of optimization principles to diagnose neural network training failures.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

L07 Optimization

This document provides an overview of optimization techniques for neural networks. It discusses visualizing optimization problems through surface plots, contour plots, and plots of cost against parameters. Gradient descent works by moving in the direction opposite the gradient, the steepest direction of increase for the cost function. Stochastic gradient descent uses estimates of the gradient from subsets of data. The learning rate and its decay over time affect training. Potential problems include slow progress, instability, dead units, symmetries, and ill-conditioned curvature. The document aims to build understanding of optimization principles to diagnose neural network training failures.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Lecture 7: Optimization

Roger Grosse

1 Introduction
Now that we’ve seen how to compute derivatives of the cost function with
respect to model parameters, what do we do with those derivatives? In this
lecture, we’re going to take a step back and look at optimization problems
more generally. We’ve briefly discussed gradient descent and used it to train
some models, but what exactly is the gradient, and why is it a good idea to
move opposite it? We also introduce stochastic gradient descent, a way of
obtaining noisy gradient estimates from a small subset of the data.
Using modern neural network libraries, it is easy to implement the back-
prop algorithm so that it correctly computes the gradient. It’s not always
so easy to get it to work well. In this lecture, we’ll make a list of things that
can go drastically wrong in neural net training, and talk about how we can
spot them. This includes: learning rates that are too large or too small,
symmetries, dead or saturated units, and badly conditioned curvature. We
discuss tricks to ameliorate all of these problems. In general, debugging a
learning algorithm is like debugging any other complex piece of software:
if something goes wrong, you need to make hypotheses about what might
have happened, and look for evidence or design experiments to test those
hypotheses. This requires a thorough understanding of the principles of
optimization. Understanding the principles of
Our style of thinking in this lecture will be very different from that neural nets and being able to
diagnose failure modes are what
in the last several lectures. When we discussed backprop, we looked at the distinguishes someone who’s
gradient computations algebraically: we derived mathematical equations for finished CSC421 from someone
computing all the derivatives. We also looked at the computations imple- who’s merely worked through the
mentationally, seeing how to implement them efficiently (e.g. by vectorizing TensorFlow tutorial.
the computations), and designing an automatic differentiation system which
separated the backprop algorithm itself from the design of a network archi-
tecture. In this lecture, we’ll look at gradient descent geometrically: we’ll
reason qualitatively about optimization problems and about the behavior
of gradient descent, without thinking about how the gradients are actually
computed. I.e., we abstract away the gradient computation. One of the
most important skills to develop as a computer scientist is the ability to
move between different levels of abstraction, and to figure out which level
is most appropriate for the problem at hand.

1.1 Learning goals


• Be able to interpret visualizations of cross-sections of an error surface.

• Know what the gradient is and how to draw it geometrically.

1
• Know why stochastic gradient descent can be faster than batch gradi-
ent descent, and understand the tradeoffs in choosing the mini-batch
size.

• Know what effect the learning rate has on the training process. Why
can it be advantageous to decay the learning rate over time?

• Be aware of various potential failure modes of gradient descent. How


might you diagnose each one, and how would you solve the problem
if it occurs?

– slow progress
– instability
– fluctuations
– dead or saturated units
– symmetries
– badly conditioned curvature

• Understand why momentum can be advantageous.

2 Visualizing gradient descent


When we train a neural network, we’re trying to minimize some cost function
E, which is a function of the network’s parameters, which we’ll denote with
the vector θ. In general, θ would contain all of the network’s weights and
biases, and perhaps a few other parameters, but for the most part, we’re not
going to think about what the elements of θ represent in this lecture. We’re
going to think about optimization problems in the abstract. In general, the
cost function will be the sum of losses over the training examples; it may
also include a regularization term (which we’ll discuss in the next lecture).
But for the most part, we’re not going to think about the particulars of the
cost function.
In order to think qualitatively about optimization problems, we’ll need
some ways to visualize them. Suppose θ consists of two weights, w1 and w2 .
One way to visualize E is to draw the cost surface, as in Figure 1(a); this is
an example of a surface plot. This particular cost function has two local
minima, or points which minimize the cost within a small neighborhood.
One of these local optima is also a global optimum, a point which achieves
the minimum cost over all values of θ. In the context of optimization, local A function can have multiple
and global minima are also referred to as local and global optima. global minima if there are multiple
points that achieve the minimum
Surface plots can be hard to interpret, so we’re only going to use them cost. Technically speaking, global
when we absolutely have to. Instead, we’ll primarily rely on two other visu- optima are also local optima, but
alizations. First, suppose we have a one-dimensional optimization problem, informally when we refer to “local
i.e. θ consists of a single weight w. We can visualize this by plotting E as a optima,” we usually mean the ones
which aren’t global optima.
function of w, as in Figure 1(b). This figure also shows the gradient descent
iterates (i.e. the points the algorithm visits) starting from two different
initializations. One of these sequences converges to the global optimum,
and the other one converges to the other local optimum. In general, gra-
dient descent greedily tries to move downhill; by historical accident, it is

2
(a) (b)

Figure 1: (a) Cost surface for an optimization problem with two local min-
ima, one of which is the global minimum. (b) Cartoon plot of a one-
dimensional optimization problem, and the gradient descent iterates start-
ing from two different initializations, in two different basins of attraction.

(a) (b)

Figure 2: (a) Contour plot of a cost function. (b) A saddle point.

referred to as a hill-climbing algorithm. It’s important to choose a good


initialization, because we’d like to converge to the global optimum, or at
least a good local optimum. The set of weights which lead to a given local
optimum are known as a basin of attraction.
Figure 1(a) also shows a different feature of the cost function, known as
a plateau (plural = plateaux). This is a region where the function is flat, or
nearly flat, i.e. the derivative is zero or very close to zero. Gradient descent
can perform very badly on plateaux, because the parameters change very
slowly. In neural net training, plateaux are generally a bigger problem than Remember when we observed that
local optima: while most local optima tend to be good enough in practice, the gradient of 0–1 loss is zero
almost everywhere? That’s an
plateaux can cause the training to get stuck on a very bad solution. example of a plateau.
Figure 2(a) shows a different visualization of a two-dimensional opti-
mization problem: a contour plot. Here, the axes correspond to w1 and
w2 ; this means we’re visualizing weight space (just like we did in our lecture
on linear classifiers). Each of the contours represents a level set, or set of
parameters where the cost takes a particular value. Check your understanding: how
One of the most important things we can visualize on a contour plot is can you (approximately) see local
optima on a countour plot? How
the gradient, or the direction of steepest ascent, of the cost function, do you tell which one is the global
optimum?

3
denoted ∇θ E. This is the direction which goes directly uphill, i.e. the di-
rection which increases the cost the fastest relative to the distance moved.
We can’t determine the magnitude of the gradient from the contour plot,
but it is easy to determine its direction: the gradient is always orthogonal
(perpendicular) to the level sets. This gives an easy way to draw it on a
contour plot (e.g. see Figure 2(a)). Algebraically, the gradient is simply the
vector of partial derivatives of the cost function: In this context, E is taken as a
function of the parameters, not of
the loss L. Therefore, the partial
 
∂E/∂θ1
∂E .. derivatives correspond to the
∇θ E = = (1)
 
∂θ .  values wij , bi , etc., computed from
∂E/∂θM backpropagation.

The fact that the vector of partial derivatives gives the steepest ascent
direction is far from obvious; you would see the derivation in a multivariable
calculus class, but here we will take it for granted.
The gradient descent update rule (which we’ve already seen multiple
times) can be written in terms of the gradient:

θ ← θ − α∇θ E, (2)

where α is the scalar-valued learning rate. This shows directly that gra-
dient descent moves opposite the gradient, or in the direction of steepest
descent. Too large a learning rate can cause instability, whereas too small
a learning rate can cause slow progress. In general, the learning rate is one
of the most important hyperparameters of a learning algorithm, so it’s very
important to tune it, i.e. look for a good value. (Most commonly, one tries Recall that hyperparameters are
a bunch of values and picks the one which works the best.) parameters which aren’t part of
the model and which aren’t tuned
For completeness, it’s worth mentioning one more possible feature of a with gradient descent.
cost function, namely a saddle point, shown in Figure 2(b). This is a point
where the gradient is zero, but which isn’t a local minimum because the cost
increases in some directions and decreases in others. If we’re exactly on a
saddle point, gradient descent won’t go anywhere because the gradient is
zero.

3 Stochastic gradient descent


In machine learning, our cost function generally consists of the average
of costs for individual training examples. By linearity of derivatives, the
gradient is the average of the gradients for individual examples:
N
1 X
E= En (3)
N
n=1
N
1 X
= L(y (n) , ŷ (n) ) (4)
N
n=1
N
1 X
∇θ E = ∇θ En (5)
N
n=1
N
1 X
= ∇ θ En (6)
N
n=1

4
If we use this formula directly, we must visit every training example to com-
pute the gradient. This is known as batch training, since we’re treating
the entire training set as a batch. But this can be very time-consuming, and
it’s also unnecessary: we can get a stochastic estimate of the gradient from
a single training example. In stochastic gradient descent (SGD), we
pick a training example, and update the parameters opposite the gradient
for that example: This is identical to the gradient
θ ← θ − α∇θ En . (7) descent update rule, except that E
is replaced with En .
SGD is able to make a lot of progress even before the whole training set has
been visited. A lot of datasets are so large that it can take hours or longer
to make a single pass over the training set; in such cases, batch training is
impractical, and we need to use a stochastic algorithm.
In practice, we don’t compute the gradient on a single example, but
rather average it over a batch of B training examples known as a mini-
batch. Typical mini-batch sizes are on the order of 100. Why mini-batches?
Observe that the number of operations required to compute the gradient for
a mini-batch is linear in the size of the mini-batch (since mathematically, the
gradient for each training example is a separate computation). Therefore, if
all operations were equally expensive, one would always prefer to use B = 1.
In practice, there are two important reasons to use B > 1:
• Operations on mini-batches can be vectorized by writing them in
terms of matrix operations. This reduces the interpreter overhead,
and makes use of efficient and carefully tuned linear algebra libraries.
In previous lectures, we already
derived vectorized forms of batch
• Most large neural networks are trained on GPUs or some other ar- gradient descent. The same
chitecture which enables a high degree of parallelism. There is much formulas can be applied in
mini-batch mode.
more parallelism to exploit when B is large, since the gradients can
be computed independently for each training example.
On the flip side, we don’t want to make B too large, because then it takes
too long to compute the gradients. In the extreme case where B = N , we
get batch gradient descent. (The activations for large mini-batches may also
be too large to store in memory.)

4 Problems, diagnostics, and workarounds


Now we get to the most important part of this lecture: debugging gradient
descent training. When you first learned to program, whenever something
didn’t work, you might have looked through your code line by line to try
and spot the mistake. This might have worked for 10-line programs, but
it probably became unworkable for more complex programs. Line-by-line
inspection doesn’t work very well in machine learning either — not just
because the programs are complicated, but also because most of the prob-
lems we’re going to talk about can occur even for a correctly implemented
training algorithm. E.g., if the problem is that you set the learning rate too
small, you’re not going to be able to deduce this by looking at your code,
since you don’t know ahead of time what the right learning rate is.
Let’s make a list of various things that can go wrong, and how to diag-
nose and fix them.

5
4.1 Incorrect gradient computations
If your computed gradients are wrong, then all bets are off. If you’re lucky,
the training will fail completely, and you’ll notice that something is wrong.
If you’re unlucky, it will sort of work, but it will also somehow be broken.
This is much more common than you might expect: it’s not unusual for
an incorrectly implemented learning algorithm to perform reasonably well.
But it will perform a bit worse than it should; furthermore, it will make it
harder to tune, since some of the diagnostics might give misleading results
if the gradients are wrong. Therefore, it’s completely useless to do anything
else until you’re sure the gradients are correct.
Fortunately, it’s possible to be confident in the correctness of the gra-
dients. We’ve already covered finite difference methods, which are pretty
reliable (see the lecture “Training a Classifier”). If you’re using one of
the major neural net frameworks, you’re pretty safe, because the gradients
are being computed automatically by a system which has been thoroughly
tested. For the rest of this discussion, we’ll assume the gradient computa-
tion is correctly implemented.

4.2 Local optima


We’re trying to minimize the cost function, and one of the ways we can fail
to do this is if we get stuck in a local optimum. Actually, that formulation
isn’t quite precise, since we rarely converge exactly to any optimum (local
or global) when training neural nets. A more precise statement would be,
we might wind up in a bad basin of attraction, and therefore not achieve as
low a cost as we would be able to in the best basin of attraction.
In general, it’s very hard to diagnose if you’re in a bad basin of attrac-
tion. In many areas of machine learning, one tries to ameliorate the issue
using random restarts: initialize the training from several random loca-
tions, run the training procedure from each one, and pick whichever result
has the lowest cost. This is sometimes done in neural net training, but
more often we just ignore the problem. In practice, the local optima are
usually fine, so we think about training in terms of converging faster to a
local optimum, rather than finding the global optimum.

4.3 Symmetries
Suppose we initialize all the weights and biases of a neural network to zero.
All the hidden activations will be identical, and you can check by inspection
(see the lecture on backprop) that all the weights feeding into a given hid-
den unit will have identical derivatives. Therefore, these weights will have
identical values in the next step, and so on. With nothing to distinguish
different hidden units, no learning will occur. This phenomenon is perhaps
the most important example of a saddle point in neural net training.
Fortunately, the problem is easy to deal with, using any sort of sym-
metry breaking. Once two hidden units compute slightly different things,
they will probably get a gradient signal driving them even farther apart.
(Think of this in terms of the saddle point picture; if you’re exactly on
the saddle point, you get zero gradient, but if you’re slightly to one side,

6
(a) (b) (c)

Figure 3: (a) Slow progress due to a small learning rate. (b) Instability
due to a large learning rate. (c) Oscillations due to a large learning rate.

you’ll move away from it, which gives you a larger gradient, and so on.) In
practice, we typically initialize all the weights randomly.

4.4 Slow progress


If the learning rate is too small, gradient descent makes very slow progress,
as shown in Figure 3(a). When you plot the training curve, this may show
up as a cost which decreases very slowly, but at an approximately linear
rate. If you see this happening, then try increasing the learning rate.

4.5 Instability and oscillations


Conversely, if the learning rate is too large, the gradient descent step will
overshoot. In some cases, it will overshoot so much that the gradient gets
larger, a situation known as instability. If this repeats itself, the parameter
values and gradient can quickly blow up; this is visualized in Figure 3(b).
In the training curve, the cost may appear to suddenly shoot up to infinity.
If this is happening, you should decrease the learning rate.
If the learning rate is too large, yet not enough to cause instability, you
might get oscillations, as shown in Figure 3(c). While the phenomenon
might seem easy to spot based on this picture, it’s actually pretty hard in
practice — keep in mind that weight space is very high-dimensional, and it
might not be obvious in which direction to look for oscillations. Also note
that oscillations in weight space don’t necessarily lead to oscillations in the
training curve.
Since we can’t detect oscillations, we simply try to tune the learning rate,
finding the best value we can. Typically, we do this using a grid search over
values spaced approximately by factors of 3, i.e. {0.3, 0.1, 0.03, . . . , 0.0001}.
The learning rate is one of the most important parameters, and one of the
hardest to choose a good value for a priori, so it is usually worth tuning it
carefully.
As it happens, there’s one more idea which can dampen oscillations
while also speeding up training: momentum. The physical intuition is as
follows: the parameter vector θ is treated as a particle which is moving
through a field whose potential energy function is the cost E. The gradient
does not determine the velocity of the particle (as it would in SGD), but
rather the acceleration. As a rough intuition, imagine you’ve built a surface
in 3-D corresponding to a 2-D cost function, and you start a frictionless ball
rolling from somewhere on that surface. If the surface is sufficiently flat,

7
the dynamics are essentially those described above. (The potential energy
is the height of the surface.)
We can simulate these dynamics with the following update rule, known
as gradient descent with momentum. (Momentum can be used with
either the batch version or with SGD.)

p ← µp − α∇θ En (8)
θ ←θ+p (9)

Just as with ordinary SGD, there is a learning rate α. There is also another
parameter µ, called the momentum parameter, satisfying 0 ≤ µ ≤ 1.
It determines the timescale on which momentum decays. In terms of the
physical analogy, it determines the amount of friction (with µ = 1 being
frictionless). As usual, it’s useful to think about the edge cases:

• µ = 0 yields standard gradient descent.

• µ = 1 is frictionless, so momentum never decays. This is problematic


because of conservation of energy. We would like to minimize the
cost function, but whenever the particle gets near the optimum, it has
low potential energy, and hence high kinetic energy, so it doesn’t stay
there very long. We need µ < 1 in order for the energy to decay.

In practice, µ = 0.9 is a reasonable value. Momentum sometimes helps a


lot, and it hardly ever hurts, so using momentum is standard practice.

4.6 Fluctuations
All of the problems we’ve discussed so far occur both in batch training and
in SGD. But in SGD, we have the further problem that the gradients are
stochastic; even if they point in the right direction on average, individual
stochastic gradients are noisy and may even increase the cost function. The
effect of this noise is to push the parameters in a random direction, causing
them to fluctuate. Note the difference between oscillations and fluctua-
tions: oscillations are a systematic effect caused by the cost surface itself,
whereas fluctuations are an effect of the stochasticity in the gradients.
Fluctuations often show up as fluctuations in the cost function, and can
be seen in the training curves. One solution to fluctuations is to decrease
the learning rate; however, this can slow down the progress too much. It’s
actually fine to have fluctuations during training, since the parameters are
still moving in the right direction “on average.”
A better approach to deal with fluctuations is learning rate decay.
My favorite approach is to keep the learning rate relatively high throughout
training, but then at the very end, to decay it using an exponential schedule,
i.e.
αt = α0 e−t/τ , (10)
where α0 is the initial learning rate, t is the iteration count, τ is the decay
timescale, and t = 0 corresponds to the start of the decay.
I should emphasize that we don’t begin the decay until late in training,
when the parameters are already pretty good “on average” and we merely
have a high cost because of fluctuations. Once you start decaying α, progress

8
Figure 4: If you decay the learning rate too soon, you’ll get a sudden drop
in the loss as a result of reducing fluctuations, but the algorithm will stop
making progress towards the optimum, leading to slower convergence in the
long run. This is a big problem in practice, and we haven’t figured out any
good ways to detect if this is happening.

slows down drastically. If you decay α too early, you may get a sudden
improvement in the cost from reducing fluctuations, at the cost of failure to
converge in the long term. This phenomenon is illustrated in Figure 4.
Another neat trick for dealing with fluctuations is iterate averaging.
Separate from the training process, we keep an exponential moving av-
erage θ̃ of the iterates, as follows:
 
1 1
θ̃ ← 1 − θ̃ + θ. (11)
τ τ
τ is a hyperparameter called the timescale. Iterate averaging doesn’t
change the training algorithm itself at all, but when we apply or evalu-
ate the network, we use θ̃ rather than θ. In practice, iterate averaging can
give a huge performance boost by reducing the fluctuations.

4.7 Dead and saturated units


Another tricky problem is that of saturated units, i.e. units whose ac-
tivations are nearly always near the ends of their dynamic range (i.e. the
range of possible values). An important special case is that of dead units,
units whose activations are always very close to zero. To understand why
saturated units are problematic, we need to revisit one of the equations we
derived for backprop. Suppose hi = φ(zi ), where φ is a sigmoidal nonlin-
earity (such as the logistic function). Then:
dhi
zi = hi = hi φ0 (zi ). (12)
dzi
If h is near the edge of its dynamic range, then φ0 (z) is very small. (Think
about why this is the case.) Therefore, z is also very small, and no gradient
signal will pass through this node in the computation graph. In particular,
all the weights that feed into zi will get no gradient signal:
wij = zi xj ≈ 0 (13)
bi = zi ≈ 0. (14)
If the incoming weights and bias don’t change, then this unit can stay
saturated for a long time. In terms of our visualizations, this situation
corresponds to a plateau.

9
Figure 5: The Rosenbrock function, a function which is commonly used as
an optimization benchmark and demonstrates badly conditioned curvature
(i.e. a ravine).

Diagnosing saturated units is simple: just look at a histogram of the av-


erage activations, and make sure they’re not concentrated at the endpoints.
Preventing saturated units is pretty hard, but there are some tricks that
help. One trick is to carefully choose the scale of the random initialization
of the weights so that the activations are in the middle of their dynamic
range. One such trick is the “Xavier initialization”, named after one of its
inventors1 .
Another way to avoid saturation is to use an activation function which
doesn’t saturate. Linear activation functions would fit the bill, but unfor-
tunately we saw that deep linear networks aren’t any more powerful than
shallow ones. Instead, consider rectified linear units (ReLUs), which
have the activation function

z if z ≥ 0
φ(z) = (15)
0 if z < 0.

ReLU units don’t saturate for positive z, which is convenient. Unfortu-


nately, they can die if z is consistently negative, so it helps to initialize the
biases to a small positive value (such as 0.1).

4.8 Badly conditioned curvature


All of the problems we’ve discussed so far are fairly specific things that can
be attenuated using simple tricks. But there’s one more problem that’s
fundamentally very hard to deal with: badly conditioned curvature. Let’s
unpack this. Intuitively, curvature refers to how fast the function curves
upwards when you move in a given direction. In directions of high curvature,
you want to take a small step, because you can overshoot very quickly. The curvature and its conditioning
In directions of low curvature, you want to take a large step, because are formalized in terms of the
eigenvalues of the matrix of second
there’s a long distance you need to travel. But what actually happens in derivatives of E, but we won’t go
gradient descent is precisely the opposite: it likes to take large steps in into that here.
high curvature directions and small steps in low curvature directions. If
1
X. Glorot and Y. Bengio, 2010. Understanding the difficulty of training deep feed-
forward neural networks. AISTATS

10
(a)

(b)

Figure 6: Unnormalized data can lead to badly conditioned curvature. (a)


The two inputs have vastly different scales. Changing w1 has a much bigger
effect on the model’s predictions than changing w2 , so the cost function
curves more rapidly along that dimension. (b) The two inputs are offset by
about the same amount. Changing the weights in a direction that preserves
w1 + w2 has little effect on the predictions, while changing w1 + w2 has a
much larger effect.

the curvature is very different in different directions, we say the curvature


is badly conditioned. An example is shown in Figure 5. A region with
badly conditioned curvature is sometimes called a ravine, because of what
it looks like in a surface plot of the cost function. Think about the effect
this has on optimization. You need to set α small enough that you don’t
get oscillations or instability in the high curvature directions. But if α is
small, then progress will be very slow in the low curvature directions.
In practice, neural network training is very badly conditioned. This is
likely a big part of why modern neural nets can take weeks to train. Much
effort has been spent researching second-order optimization methods, alter-
natives to SGD which attempt to correct for the curvature. Unfortunately,
these methods are complicated and pretty hard to tune (in the context of
neural nets), so SGD is still the go-to algorithm, and we just have to live
with badly conditioned curvature.
However, we can at least try to eliminate particular egregious instances
of badly conditioned curvature. One way in which badly conditioned cur-
vature can arise is if the inputs have very different scales or are off-center.
See Figure 6 for examples of this effect in linear regression problems. Such
examples could arise because inputs represent arbitrary units, such as feet
or years. This framing almost immediately suggests a workaround: nor-
malize the inputs so that they have zero mean and unit variance. I.e., It is perhaps less intuitive why
take having the means far from zero
xj − µ j causes badly conditioned
x̃j = , (16) curvature, but rest assured this is
σj
an important effect, and worth
where µj = E[xj ] and σj2 = Var(xj ). combating.
It’s worth mentioning two very popular algorithms which help with badly

11
conditioned curvature: batch normalization and Adam. We won’t cover
them properly, but the original papers are very readable, in case you’re cu-
rious.2 Batch normalization normalizes the activations of each layer of a
network to have zero mean and unit variance. This can help significantly
for the reason outlined above. (It can also attenuate the problem of satu-
rated units.) Adam separately adapts the learning rate of each individual
parameter, in order to correct for differences in curvature along individual
coordinate directions.

4.9 Recap
Here is a table to summarize all the pitfalls, diagnostics, and workarounds
that we’ve covered:

Problem Diagnostics Workarounds


incorrect gradients finite differences fix them, or use an autodiff package
local optima (hard) random restarts
symmetries visualize W initialize W randomly
slow progress slow, linear training curve increase α
instability cost increases decrease α
oscillations fluctuations in training curve decrease α; momentum
fluctuations fluctuations in training curve decay α; iterate averaging
dead/saturated units activation histograms initial scale of W; ReLU
badly conditioned curvature (hard) normalization; momentum;
Adam; second-order opt.

2
D. P. Kingma and J. L. Ba, 2015. Adam: a method for stochastic optimization. ICLR
S. Ioffe and C. Szegedy, 2015. Batch normalization: accelerating deep network training
by reducing internal covariate shift.

12

You might also like