Lec05-1-Gradient Descent-Detailed
Lec05-1-Gradient Descent-Detailed
Paly it
What is Gradient?
1. It is a rate of change in one variable w.r.t. other.
2. A gradient measures how much the output of a function
changes if you change the inputs a little bit.
3. In mathematical terms, it is known as the slope of a function.
4. A gradient is nothing but a derivative that defines the effects
on outputs of the function with a little bit of variation in
inputs.
5. In machine learning a gradient is a derivative of a function,
that has more than one input variable.
6. The gradient simply measures the change in all weights
about the change in error.
Gradient and Partial Derivatives
What is a Cost Function?
• It is a function that measures the performance of a model for any given data.
• Cost Function quantifies the error between predicted values and expected values
and presents it in the form of a single real number. OR In General
After making a hypothesis with initial
parameters, we calculate the Cost function
with a goal to reduce the cost function, we
modify the parameters by using the Gradient
descent algorithm over the given data. Here’s
the mathematical representation of it:
Cost function Optimization
• The cost function represents the discrepancy between the
Predicted and the Actual output of the model.
• The goal of Gradient Descent is to find the set of parameters that
minimizes this discrepancy and improves the model’s
performance.
• It is an optimization algorithm used in machine/deep learning to
minimize the cost function by iteratively adjusting parameters in
the direction of the Negative Gradient (moving away from the
gradient of the function at the current point), aiming to find the
optimal set of parameters.
Gradient Descent
• The algorithm objective is to identify model parameters like weight
and bias that reduce model error on training data.
• In linear regression, it finds weight and biases, and deep learning
backward propagation uses the method.
• It adjusts parameters to minimize particular functions to local minima.
How does Gradient Descent work?
• The algorithm operates by calculating the gradient of the cost function,
which indicates the direction and magnitude of the steepest ascent.
• However, since the objective is to minimize the cost function, gradient
descent moves in the opposite direction of the gradient, known as the
negative gradient direction.
• By iteratively updating the model’s parameters in the negative gradient
direction, gradient descent gradually converges towards the optimal
set of parameters that yields the lowest cost.
• The learning rate (a hyper-parameter) determines the step size taken
in each iteration, influencing the speed and stability of convergence.
Applications
• Gradient descent can be applied to various machine/deep learning
algorithms including:
• Linear regression
• Logistic regression
• Neural networks, and
• Support vector machines
• It provides a general framework for optimizing models by iteratively
refining their parameters based on the cost function.
When Gradient Descent is used in
ML/DL?
• The learning happens during the backpropagation while training the
neural network-based model.
• When We say that we are training the model, it’s Gradient Descent
behind the scenes who trains it.
Example
• Let’s say you are playing a game
where the players are at the top of
a mountain, and they are asked to
reach the lowest point of the
mountain. Additionally, they are
blindfolded. So, what approach do
you think would make you reach
the lake?
• Take a moment to think about this
before heading forward. The best way is to observe the ground and find where the land descends.
From that position, take a step in the descending direction and iterate this
process until we reach the lowest point.
Example
• Gradient descent is an iterative optimization
algorithm for finding the local minimum of a
function.
• To find the local minimum of a function using
gradient descent, we must take steps proportional to
the negative of the gradient (move away from the
gradient) of the function at the current point.
• If we take steps proportional to the positive of the
gradient (moving towards the gradient), we will
approach a local maximum of the function, and the
procedure is called Gradient Ascent.
Gradient Descent Working
The goal of the gradient descent algorithm is to
minimize the given function (say cost
function).
To achieve this goal, it performs two steps
iteratively:
1. Compute the gradient (slope), the first-
order derivative of the function at that point
2. Make a step (move) in the direction
opposite to the gradient, opposite direction
of slope increases from the current point by
alpha times the gradient at that point
Alpha is called Learning rate – a tuning parameter in the
optimization process. It decides the length of the steps.
Consideration for Learning Rate
Plotting the Gradient Descent
• 2-D
When we have a single parameter
2-D plot
(theta), we can plot the
dependent variable cost on the y-
axis and theta on the x-axis.
• 3-D
If there are two parameters, we
can go with a 3-D plot, with cost
on one axis and the two
parameters (thetas) along the
other two axes.
3-D plot
Gradient Descent- Computing
Derivatives
Plotting the Gradient Descent
It can also be visualized by
using a contour.
• A contour is a line drawn on
a topographic map to
indicate ground elevation or
depression.
• A contour interval is the
vertical distance or
difference in elevation
between contour lines.
Plotting the Gradient Descent
• It can also be visualized by using a
contour
• This shows a 3-D plot in two
dimensions with parameters along
both axes and the response as a
contour.
• The value of the response increases
away from the center and has the
same value as with the rings.
• The response is directly proportional to
the distance of a point from the center
(along a direction).
Challenges of Gradient Descent
While gradient descent is a powerful optimization algorithm, it can also present some challenges that can affect its
performance. Some of these challenges include:
1. Local Optima: Gradient descent can converge to local optima instead of the global optimum, especially if the cost
function has multiple peaks and valleys.
2. Learning Rate Selection: The choice of learning rate can significantly impact the performance of gradient
descent. If the learning rate is too high, the algorithm may overshoot the minimum, and if it is too low, the
algorithm may take too long to converge.
3. Overfitting: Gradient descent can overfit the training data if the model is too complex or the learning rate is too
high. This can lead to poor generalization performance on new data.
4. Convergence Rate: The convergence rate of gradient descent can be slow for large datasets or high-dimensional
spaces, which can make the algorithm computationally expensive.
5. Saddle Points: In high-dimensional spaces, the gradient of the cost function can have saddle points, which can
cause gradient descent to get stuck in a plateau instead of converging to a minimum.
• To overcome these challenges, several variations of gradient descent have been developed, such as adaptive
learning rate methods, momentum-based methods, and second-order methods. Additionally, choosing the right
regularization method, model architecture, and hyper-parameters can also help improve the performance of
gradient descent.
Challenges of Gradient Descent
• Efficient implementation of gradient descent is essential for achieving
good performance in machine learning tasks.
• The choice of the learning rate and the number of iterations can
significantly impact the performance of the algorithm.
• There are different variations of gradient descent including:
1. Batch Gradient Descent Batch gradient descent is suitable for small datasets, while
2. Stochastic Gradient Descent, and stochastic gradient descent is more suitable for large
datasets. Mini-batch gradient descent is a good
3. Mini-batch Gradient Descent compromise between the two and is often used in practice.
• Due to their own benefits and drawbacks, their choice depends on
the problem at hand and the size of the dataset.
Gradient Descent Types
Gradient Descent Types
1. Batch Gradient Descent
• In Batch Gradient Descent, all the training data is taken into
consideration to take a single step.
• It calculates the average gradient of the cost function for all the
training examples and updates the parameters in the opposite
direction using use that mean gradient.
• So, just one step of gradient descent is taken downhill in every epoch.
• Batch gradient descent guarantees convergence to the global
minimum, but can be computationally expensive and slow for large
datasets.
Gradient Descent Types
1. Batch Gradient Descent
• Batch Gradient Descent is great for
convex or relatively smooth error
manifolds. In this case, we move
somewhat directly towards an
optimum solution.
• The graph of cost vs epochs is also
quite smooth because we are
averaging over all the gradients of
training data for a single step. The
cost keeps on decreasing over the
epochs.
Gradient Descent Types
1. Batch Gradient Descent Issue:
In Batch Gradient Descent we were considering all the examples for every step of
Gradient Descent.
Local Minima-Details
• Not trivial, but there are techniques that can help to avoid local minima in
gradient descent.
• In machine learning, local minima and global minima are two important
concepts related to the optimization of loss functions.
• A loss function is a function that measures the error between a model’s
predictions and the ground truth. The goal of machine learning is to find a
model that minimizes the loss function.
• A local minimum is a point in the parameter space where the loss function
is minimized in a local neighborhood.
• A global minimum is a point in the parameter space where the loss
function is minimized globally.
Self Reading $$
How To Deal With Local Minima-
Details
• It is not usually possible to find the global minimum of a loss function analytically.
• Instead, machine learning algorithms use iterative optimization methods to find a local
minimum.
• One common iterative optimization method is gradient descent.
• Gradient descent starts at a random point in the parameter space and then iteratively
updates the parameters in the direction of the negative gradient of the loss function.
• The negative gradient points in the direction of the steepest descent, so gradient
descent will eventually converge to a local minimum.
• However, there is no guarantee that the local minimum found by gradient descent is the
global minimum.
• In fact, it is possible that gradient descent will get stuck in a local minimum that is not
the global minimum.
Self Reading $$
Mathematical Example
• Let f(x) be a loss function.
• A local minimum of f(x) is a point x* such that f(x*) < f(x) for all x in a
neighborhood of x*.
• A global minimum of f(x) is a point x* such that f(x*) < f(x) for all x in
the domain of f(x).
• Here is an example of a loss function with two local minima (at x =1
and -1) and one global minimum:
• But the function: has only one Global Minimum
because it is always increasing.
Self/Further Reading $$ for PhD Students
How to Identify Local Minima
“The global minimum is the best possible solution, but it’s not always easy to find it.” -- Andrew Ng
In L1, we have: