0% found this document useful (0 votes)
1 views

chp2 Gradient Descent algorithm

Gradient descent is an optimization algorithm used to minimize the loss function in machine learning models by iteratively adjusting weights in the direction of the negative gradient. Variants include Batch Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent, and Momentum, each addressing different computational challenges and stability issues. The learning rate is a crucial hyperparameter that influences convergence speed and stability, with careful selection necessary to avoid overshooting or slow convergence.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

chp2 Gradient Descent algorithm

Gradient descent is an optimization algorithm used to minimize the loss function in machine learning models by iteratively adjusting weights in the direction of the negative gradient. Variants include Batch Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent, and Momentum, each addressing different computational challenges and stability issues. The learning rate is a crucial hyperparameter that influences convergence speed and stability, with careful selection necessary to avoid overshooting or slow convergence.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Gradient Descent algorithm

Gradient descent is a powerful optimization algorithm used to minimize the


loss function in a machine learning model. It’s a popular choice for a variety
of algorithms, including linear regression, logistic regression, and neural
networks. In this article, we’ll cover what gradient descent is, how it works,
and several variants of the algorithm that are designed to address different
challenges and provide optimizations for different use cases.

What is Gradient Descent?


Gradient descent is an optimization algorithm that is used to minimize the
loss function in a machine learning model. The goal of gradient descent is to
find the set of weights (or coefficients) that minimize the loss function. The
algorithm works by iteratively adjusting the weights in the direction of the
steepest decrease in the loss function.

How does Gradient Descent Work?


The basic idea of gradient descent is to start with an initial set of weights and
update them in the direction of the negative gradient of the loss function. The
gradient is a vector of partial derivatives that represents the rate of change of
the loss function with respect to the weights. By updating the weights in the
direction of the negative gradient, the algorithm moves towards a minimum of
the loss function.

The learning rate is a hyperparameter that determines the size of the step
taken in the weight update. A small learning rate results in a slow
convergence, while a large learning rate can lead to overshooting the
minimum and oscillating around the minimum. It’s important to choose an
appropriate learning rate that balances the speed of convergence and the
stability of the optimization.
Variants of Gradient Descent

1) Batch Gradient Descent:

In batch gradient descent, the gradient of the loss function is computed with
respect to the weights for the entire training dataset, and the weights are
updated after each iteration. This provides a more accurate estimate of the
gradient, but it can be computationally expensive for large datasets.

2) Stochastic Gradient Descent (SGD):

In SGD, the gradient of the loss function is computed with respect to a single
training example, and the weights are updated after each example. SGD has
a lower computational cost per iteration compared to batch gradient descent,
but it can be less stable and may not converge to the optimal solution.

3) Mini-Batch Gradient Descent:

Mini-batch gradient descent is a compromise between batch gradient


descent and SGD. The gradient of the loss function is computed with respect
to a small randomly selected subset of the training examples (called a mini-
batch), and the weights are updated after each mini-batch. Mini-batch
gradient descent provides a balance between the stability of batch gradient
descent and the computational efficiency of SGD.

4) Momentum:

Momentum is a variant of gradient descent that incorporates information from


the previous weight updates to help the algorithm converge more quickly to
the optimal solution. Momentum adds a term to the weight update that is
proportional to the running average of the past gradients, allowing the
algorithm to move more quickly in the direction of the optimal solution.
Gradient Descent
Gradient Descent Algorithm iteratively calculates the next point using
gradient at the current position, scales it (by a learning rate) and subtracts
obtained value from the current position (makes a step). It subtracts the
value because we want to minimise the function (to maximise it would be
adding). This process can be written as:

There’s an important parameter η which scales the gradient and thus


controls the step size. In machine learning, it is called learning rate and have
a strong influence on performance.

 The smaller learning rate the longer GD converges, or may reach


maximum iteration before reaching the optimum point

 If learning rate is too big the algorithm may not converge to the optimal
point (jump around) or even to diverge completely.

In summary, Gradient Descent method’s steps are:

1. choose a starting point (initialisation)

2. calculate gradient at this point

3. make a scaled step in the opposite direction to the gradient (objective:


minimise)

4. repeat points 2 and 3 until one of the criteria is met:

 maximum number of iterations reached

 step size is smaller than the tolerance (due to scaling or a small


gradient).
This function takes 5 parameters:

1. starting point - in our case, we define it manually but in


practice, it is often a random initialisation

2. gradient function - has to be specified before-hand

3. learning rate - scaling factor for step sizes

4. maximum number of iterations

5. tolerance to conditionally stop the algorithm (in this case a default


value is 0.01)

Example 1 — a quadratic function

Let’s take a simple quadratic function defined as:

Because it is an univariate function a gradient function is:

For this function, by taking a learning rate of 0.1 and starting point
at x=9 we can easily calculate each step by hand. Let’s do it for the
first 3 steps:
The animation below shows steps taken by the GD algorithm for
learning rates of 0.1 and 0.8. As you see, for the smaller learning
rate, as the algorithm approaches the minimum the steps are getting
gradually smaller. For a bigger learning rate, it is jumping from one
side to another before converging.

First 10 steps taken by GD for small and big learning rate; Image by author

Trajectories, number of iterations and the final converged result


(within tolerance) for various learning rates are shown below:

https://www.geeksforgeeks.org/gradient-descent-algorithm-and-its-variants/

You might also like