chp2 Gradient Descent algorithm
chp2 Gradient Descent algorithm
The learning rate is a hyperparameter that determines the size of the step
taken in the weight update. A small learning rate results in a slow
convergence, while a large learning rate can lead to overshooting the
minimum and oscillating around the minimum. It’s important to choose an
appropriate learning rate that balances the speed of convergence and the
stability of the optimization.
Variants of Gradient Descent
In batch gradient descent, the gradient of the loss function is computed with
respect to the weights for the entire training dataset, and the weights are
updated after each iteration. This provides a more accurate estimate of the
gradient, but it can be computationally expensive for large datasets.
In SGD, the gradient of the loss function is computed with respect to a single
training example, and the weights are updated after each example. SGD has
a lower computational cost per iteration compared to batch gradient descent,
but it can be less stable and may not converge to the optimal solution.
4) Momentum:
If learning rate is too big the algorithm may not converge to the optimal
point (jump around) or even to diverge completely.
For this function, by taking a learning rate of 0.1 and starting point
at x=9 we can easily calculate each step by hand. Let’s do it for the
first 3 steps:
The animation below shows steps taken by the GD algorithm for
learning rates of 0.1 and 0.8. As you see, for the smaller learning
rate, as the algorithm approaches the minimum the steps are getting
gradually smaller. For a bigger learning rate, it is jumping from one
side to another before converging.
First 10 steps taken by GD for small and big learning rate; Image by author
https://www.geeksforgeeks.org/gradient-descent-algorithm-and-its-variants/