Geoffrey Hinton With Nitish Srivastava Kevin Swersky: Neural Networks For Machine Learning
Geoffrey Hinton With Nitish Srivastava Kevin Swersky: Neural Networks For Machine Learning
Lecture 6a
Overview of mini-batch gradient descent
Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
Reminder: The error surface for a linear neuron
• The error surface lies in a space with a
horizontal axis for each weight and one vertical
axis for the error. E
– For a linear neuron with a squared error, it is
a quadratic bowl.
– Vertical cross-sections are parabolas.
– Horizontal cross-sections are ellipses.
• For multi-layer, non-linear nets the error surface
is much more complicated. w1
– But locally, a piece of a quadratic bowl is
usually a very good approximation.
w2
Convergence speed of full batch learning when the error
surface is a quadratic bowl
• Going downhill reduces the error, but the
direction of steepest descent does not point
at the minimum unless the ellipse is a circle.
– The gradient is big in the direction in
which we only want to travel a small Even for non-linear
distance. multi-layer nets, the
error surface is locally
– The gradient is small in the direction in
quadratic, so the same
which we want to travel a large distance.
speed issues apply.
How the learning goes wrong
• If the learning rate is big, the weights slosh to
and fro across the ravine.
– If the learning rate is too big, this
oscillation diverges.
• What we would like to achieve:
– Move quickly in directions with small but E
consistent gradients.
– Move slowly in directions with big but
w
inconsistent gradients.
Stochastic gradient descent
• If the dataset is highly redundant, the • Mini-batches are usually better
gradient on the first half is almost than online.
identical to the gradient on the second – Less computation is used
half.
updating the weights.
– So instead of computing the full
gradient, update the weights using – Computing the gradient for
the gradient on the first half and many cases simultaneously
then get a gradient for the new uses matrix-matrix
weights on the second half. multiplies which are very
– The extreme version of this efficient, especially on GPUs
approach updates weights after
• Mini-batches need to be
each case. Its called “online”.
balanced for classes
Two types of learning algorithm
If we use the full gradient computed from all For large neural networks with
the training cases, there are many clever ways very large and highly redundant
to speed up learning (e.g. non-linear conjugate training sets, it is nearly always
gradient). best to use mini-batch learning.
– The optimization community has – The mini-batches may
studied the general problem of need to be quite big
optimizing smooth non-linear when adapting fancy
functions for many years. methods.
– Multilayer neural nets are not typical – Big mini-batches are
of the problems they study so their more computationally
methods may need a lot of adaptation. efficient.
A basic mini-batch gradient descent algorithm
• Guess an initial learning rate. • Towards the end of mini-batch
– If the error keeps getting worse learning it nearly always helps to
or oscillates wildly, reduce the turn down the learning rate.
learning rate. – This removes fluctuations in the
– If the error is falling fairly final weights caused by the
consistently but slowly, increase variations between mini-
the learning rate. batches.
• Write a simple program to automate • Turn down the learning rate when
this way of adjusting the learning the error stops decreasing.
rate. – Use the error on a separate
validation set
Neural Networks for Machine Learning
Lecture 6b
A bag of tricks for mini-batch gradient descent
Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
Be careful about turning down the learning rate
• Turning down the learning
rate reduces the random reduce
fluctuations in the error due learning rate
error
to the different gradients on
different mini-batches.
– So we get a quick win.
– But then we get slower
learning.
• Don’t turn down the epoch
learning rate too soon!
Initializing the weights
• If two hidden units have exactly • If a hidden unit has a big fan-in,
the same bias and exactly the small changes on many of its
same incoming and outgoing incoming weights can cause the
weights, they will always get learning to overshoot.
exactly the same gradient. – We generally want smaller
– So they can never learn to be incoming weights when the fan-
different features. in is big, so initialize the weights
to be proportional to sqrt(fan-
– We break symmetry by
in).
initializing the weights to
• We can also scale the learning rate
have small random values.
the same way.
Shifting the inputs color indicates
training case
w1 w2
• When using steepest descent, shifting
the input values makes a big difference.
– It usually helps to transform each
component of the input vector so
that it has zero mean over the 101, 101 2 gives error
whole training set. 101, 99 0 surface
• The hypberbolic tangent (which is
2*logistic -1) produces hidden
activations that are roughly zero mean. 1, 1 2 gives error
– In this respect its better than the 1, -1 0 surface
logistic.
Scaling the inputs color indicates
weight axis
w1 w2
• When using steepest descent,
scaling the input values
makes a big difference.
– It usually helps to 0.1, 10 2 gives error
transform each 0.1, -10 0 surface
component of the input
vector so that it has unit
variance over the whole
training set. 1, 1 2 gives error
1, -1 0 surface
A more thorough method: Decorrelate the input components
• For a linear neuron, we get a big win by decorrelating each component of the
input from the other input components.
• There are several different ways to decorrelate inputs. A reasonable method is
to use Principal Components Analysis.
– Drop the principal components with the smallest eigenvalues.
• This achieves some dimensionality reduction.
– Divide the remaining principal components by the square roots of their
eigenvalues. For a linear neuron, this converts an axis aligned elliptical
error surface into a circular one.
• For a circular error surface, the gradient points straight towards the minimum.
Common problems that occur in multilayer networks
• If we start with a very big learning • In classification networks that use
rate, the weights of each hidden a squared error or a cross-entropy
unit will all become very big and error, the best guessing strategy is
positive or very big and negative. to make each output unit always
– The error derivatives for the produce an output equal to the
hidden units will all become proportion of time it should be a
tiny and the error will not 1.
decrease. – The network finds this strategy
– This is usually a plateau, but quickly and may take a long
people often mistake it for a time to improve on it by
local minimum. making use of the input.
– This is another plateau that
looks like a local minimum.
Four ways to speed up mini-batch learning
Lecture 6c
The momentum method
Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
The intuition behind the momentum method
brown vector = jump, red vector = correction, green vector = accumulated gradient
Lecture 6d
A separate, adaptive learning rate for each connection
Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
The intuition behind separate adaptive learning rates
• In a multilayer net, the appropriate learning rates
can vary widely between weights:
– The magnitudes of the gradients are often very
different for different layers, especially if the initial
weights are small.
– The fan-in of a unit determines the size of the
“overshoot” effects caused by simultaneously Gradients can get very
changing many of the incoming weights of a unit to
small in the early layers of
correct the same error.
very deep nets.
• So use a global learning rate (set by hand)
multiplied by an appropriate local gain that is The fan-in often varies
determined empirically for each weight. widely between layers.
One way to determine the individual learning rates
• Start with a local gain of 1 for every weight. ¶E
• Increase the local gain if the gradient for D wij =- e gij
¶wij
that weight does not change sign.
• Use small additive increases and
multiplicative decreases (for mini-batch) æ ¶E ö
¶E
– This ensures that big gains decay rapidly if çç (t) (t - 1)÷
÷ >0
when oscillations start. è¶wij ¶wij ø
– If the gradient is totally random the gain
will hover around 1 when we increase then gij (t) =gij (t - 1) +.05
by plus d half the time and decrease else gij (t) =gij (t - 1)*.95
by times 1- d half the time.
Tricks for making adaptive learning rates work better
• Limit the gains to lie in some • Adaptive learning rates can be
reasonable range combined with momentum.
– e.g. [0.1, 10] or [.01, 100] – Use the agreement in sign
• Use full batch learning or big mini- between the current gradient for a
batches weight and the velocity for that
– This ensures that changes in weight (Jacobs, 1989).
the sign of the gradient are • Adaptive learning rates only deal with
not mainly due to the axis-aligned effects.
sampling error of a mini-
batch. – Momentum does not care about
the alignment of the axes.
Neural Networks for Machine Learning
Lecture 6e
rmsprop: Divide the gradient by a running average of its recent magnitude
Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
rprop: Using only the sign of the gradient
• The magnitude of the gradient can be very • rprop: This combines the idea of only using
different for different weights and can the sign of the gradient with the idea of
change during learning. adapting the step size separately for each
– This makes it hard to choose a single weight.
global learning rate. – Increase the step size for a weight
• For full batch learning, we can deal with multiplicatively (e.g. times 1.2) if the
this variation by only using the sign of the signs of its last two gradients agree.
gradient. – Otherwise decrease the step size
– The weight updates are all of the same multiplicatively (e.g. times 0.5).
magnitude. – Limit the step sizes to be less than 50 and
– This escapes from plateaus with tiny more than a millionth (Mike Shuster’s
gradients quickly. advice).
Why rprop does not work with mini-batches
• The idea behind stochastic gradient • rprop would increment the weight nine
descent is that when the learning times and decrement it once by about
rate is small, it averages the the same amount (assuming any
gradients over successive mini- adaptation of the step sizes is small on
batches. this time-scale).
– Consider a weight that gets a – So the weight would grow a lot.
gradient of +0.1 on nine mini- • Is there a way to combine:
batches and a gradient of -0.9 – The robustness of rprop.
on the tenth mini-batch. – The efficiency of mini-batches.
– We want this weight to stay – The effective averaging of gradients
roughly where it is. over mini-batches.
rmsprop: A mini-batch version of rprop
• rprop is equivalent to using the gradient but also dividing by the size of the
gradient.
– The problem with mini-batch rprop is that we divide by a different number
for each mini-batch. So why not force the number we divide by to be very
similar for adjacent mini-batches?
• rmsprop: Keep a moving average of the squared gradient for each weight2
MeanSquare(w, t) =0.9 MeanSquare(w, t- 1) + 0.1 ¶E ( ¶w )
(t)
MeanSquare(w, t)
• Dividing the gradient by makes the learning work much
better (Tijmen Tieleman, unpublished).
Further developments of rmsprop
• Combining rmsprop with standard momentum
– Momentum does not help as much as it normally does. Needs more
investigation.
• Combining rmsprop with Nesterov momentum (Sutskever 2012)
– It works best if the RMS of the recent gradients is used to divide the correction
rather than the jump in the direction of accumulated corrections.
• Combining rmsprop with adaptive learning rates for each connection
– Needs more investigation.
• Other methods related to rmsprop
– Yann LeCun’s group has a fancy version in “No more pesky learning rates”
Summary of learning methods for neural networks
• For small datasets (e.g. 10,000 cases) or • Why there is no simple recipe:
bigger datasets without much Neural nets differ a lot:
redundancy, use a full-batch method. – Very deep nets (especially ones
– Conjugate gradient, LBFGS ... with narrow bottlenecks).
– adaptive learning rates, rprop ... – Recurrent nets.
• For big, redundant datasets use mini- – Wide shallow nets.
batches. Tasks differ a lot:
– Try gradient descent with – Some require very accurate
momentum. weights, some don’t.
– Try rmsprop (with momentum ?) – Some have many very rare cases
– Try LeCun’s latest recipe. (e.g. words).