DL Class2
DL Class2
•Choosing a proper learning rate can be difficult. A learning rate that is too
small leads to painfully slow convergence, while a learning rate that is too
large can hinder convergence and cause the loss function to fluctuate
around the minimum or even to diverge.
W_1
original W
Stochastic Gradient Descent
W_2
True gradients in blue
minibatch gradients in red
W_1
original W
We are trying to optimize a cost function which has contours like this:
(red dot - minumum)
Gradient descent will take a lot of steps / slowly oscillate towards the
minimum.
On the vertical axis, we want the learning to be slower (we don’t want
the oscillations), but along the horizontal axis, we want faster learning.
i.e. we want to aggressively move from left to right to that minimum.
Momentum
SGD has trouble navigating ravines, i.e. areas where the surface curves
much more steeply in one dimension than in another which are common
around local optima. In these scenarios, SGD oscillates across the slopes
of the ravine while only making hesitant progress along the bottom towards
the local optimum
The gradient descent with momentum algorithm borrows the idea from physics.
It does this by adding a fraction γ of the update vector of the past time step
to the current update vector:
Essentially, when using momentum, we push a ball down a hill. The ball
accumulates momentum as it rolls downhill, becoming faster and faster on
the way (until it reaches its terminal velocity if there is air resistance,
i.e. γ<1).
Momentum based gradient descent
Momentum (magenta) vs. Gradient Descent (cyan) on a surface with a global minimum
(the left well) and local minimum (the right well)
https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-
adagrad-rmsprop-adam-f898b102325c
Momentum based gradient descent
SGD
vs
Momentum
notice momentum
overshooting the target,
but overall getting to the
minimum much faster
than vanilla SGD.
Nesterov accelerated gradient descent
• However, a ball that rolls down a hill, blindly following the slope, is
highly unsatisfactory.
• We'd like to have a smarter ball, a ball that has a notion of where it
is going so that it knows to slow down before the hill slopes up
again.
• We know that we will use our momentum term γvt−1 to move the
parameters θ.
Instead of evaluating gradient at the current position (red circle), we know that our
momentum is about to carry us to the tip of the green arrow. With Nestrov
momentum, we therefore evaluate the gradient at this “look-ahead” position.
AdaGrad learning algorithms
• The idea behind Adagrad is to use different learning rates for each
parameter base on iteration.
• The reason behind the need for different learning rates is that the
learning rate for sparse features parameters needs to be higher
compare to the dense features parameter because the frequency of
occurrence of sparse features is lower.
AdaGrad learning algorithms
In the Adagrad optimizer equation, the learning rate has been modified in such a
way that it will automatically decrease because the summation of the previous
gradient square will always keep on increasing after every time step.
AdaGrad learning algorithms
Therefore, we can increase our learning rate and our algorithm could
take larger steps in the horizontal direction converging faster.
The following slide shows how the gradients are calculated for the
RMSprop and gradient descent with momentum.