0% found this document useful (0 votes)
18 views

Lec05-1-Gradient Descent-Detailed

Uploaded by

awaisqarni640
Copyright
© © All Rights Reserved
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Lec05-1-Gradient Descent-Detailed

Uploaded by

awaisqarni640
Copyright
© © All Rights Reserved
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Gradient Descent

Dr. Muhammad Nadeem Ashraf


Outlines
• Background/ What is a Cost Function?
• What is Gradient Descent?
• How Does Gradient Descent Work?
• Types of Gradient Descent
• Batch Gradient Descent
• Stochastic Gradient Descent
• Mini-Batch Gradient Descent
• Plotting the Gradient Descent Algorithm
• Alpha – The Learning Rate
• Local Minima
• Code Implementation of Gradient Descent in Python
• Challenges of Gradient Descent
• End Notes
• Frequently Asked Questions
History
• Gradient descent was originally proposed by CAUCHY in 1847.
• It is also known as the steepest descent.
• Gradient descent is a first-order iterative optimization algorithm for
finding the minimum of a function
Background
• Imagine you’re lost in a dense forest
with no map or compass. What do
you do?
• You follow the path of the steepest
descent, taking steps in the direction
that decreases the slope and brings
you closer to your destination.
 Similarly, Gradient Descent is the go-to algorithm for navigating the
complex landscape of machine learning.
 It helps models find the optimal set of parameters by iteratively
adjusting them in the opposite direction of the gradient.
Positive and Negative Slopes

Paly it
What is Gradient?
1. It is a rate of change in one variable w.r.t. other.
2. A gradient measures how much the output of a function
changes if you change the inputs a little bit.
3. In mathematical terms, it is known as the slope of a function.
4. A gradient is nothing but a derivative that defines the effects
on outputs of the function with a little bit of variation in
inputs.
5. In machine learning a gradient is a derivative of a function,
that has more than one input variable.
6. The gradient simply measures the change in all weights
about the change in error.
Gradient and Partial Derivatives
What is a Cost Function?
• It is a function that measures the performance of a model for any given data.
• Cost Function quantifies the error between predicted values and expected values
and presents it in the form of a single real number. OR In General
After making a hypothesis with initial
parameters, we calculate the Cost function
with a goal to reduce the cost function, we
modify the parameters by using the Gradient
descent algorithm over the given data. Here’s
the mathematical representation of it:
Cost function Optimization
• The cost function represents the discrepancy between the
Predicted and the Actual output of the model.
• The goal of Gradient Descent is to find the set of parameters that
minimizes this discrepancy and improves the model’s
performance.
• It is an optimization algorithm used in machine/deep learning to
minimize the cost function by iteratively adjusting parameters in
the direction of the Negative Gradient (moving away from the
gradient of the function at the current point), aiming to find the
optimal set of parameters.
Gradient Descent
• The algorithm objective is to identify model parameters like weight
and bias that reduce model error on training data.
• In linear regression, it finds weight and biases, and deep learning
backward propagation uses the method.
• It adjusts parameters to minimize particular functions to local minima.
How does Gradient Descent work?
• The algorithm operates by calculating the gradient of the cost function,
which indicates the direction and magnitude of the steepest ascent.
• However, since the objective is to minimize the cost function, gradient
descent moves in the opposite direction of the gradient, known as the
negative gradient direction.
• By iteratively updating the model’s parameters in the negative gradient
direction, gradient descent gradually converges towards the optimal
set of parameters that yields the lowest cost.
• The learning rate (a hyper-parameter) determines the step size taken
in each iteration, influencing the speed and stability of convergence.
Applications
• Gradient descent can be applied to various machine/deep learning
algorithms including:
• Linear regression
• Logistic regression
• Neural networks, and
• Support vector machines
• It provides a general framework for optimizing models by iteratively
refining their parameters based on the cost function.
When Gradient Descent is used in
ML/DL?
• The learning happens during the backpropagation while training the
neural network-based model.
• When We say that we are training the model, it’s Gradient Descent
behind the scenes who trains it.
Example
• Let’s say you are playing a game
where the players are at the top of
a mountain, and they are asked to
reach the lowest point of the
mountain. Additionally, they are
blindfolded. So, what approach do
you think would make you reach
the lake?
• Take a moment to think about this
before heading forward. The best way is to observe the ground and find where the land descends.
From that position, take a step in the descending direction and iterate this
process until we reach the lowest point.
Example
• Gradient descent is an iterative optimization
algorithm for finding the local minimum of a
function.
• To find the local minimum of a function using
gradient descent, we must take steps proportional to
the negative of the gradient (move away from the
gradient) of the function at the current point.
• If we take steps proportional to the positive of the
gradient (moving towards the gradient), we will
approach a local maximum of the function, and the
procedure is called Gradient Ascent.
Gradient Descent Working
The goal of the gradient descent algorithm is to
minimize the given function (say cost
function).
To achieve this goal, it performs two steps
iteratively:
1. Compute the gradient (slope), the first-
order derivative of the function at that point
2. Make a step (move) in the direction
opposite to the gradient, opposite direction
of slope increases from the current point by
alpha times the gradient at that point
Alpha is called Learning rate – a tuning parameter in the
optimization process. It decides the length of the steps.
Consideration for Learning Rate
Plotting the Gradient Descent
• 2-D
When we have a single parameter
2-D plot
(theta), we can plot the
dependent variable cost on the y-
axis and theta on the x-axis.
• 3-D
If there are two parameters, we
can go with a 3-D plot, with cost
on one axis and the two
parameters (thetas) along the
other two axes.
3-D plot
Gradient Descent- Computing
Derivatives
Plotting the Gradient Descent
It can also be visualized by
using a contour.
• A contour is a line drawn on
a topographic map to
indicate ground elevation or
depression.
• A contour interval is the
vertical distance or
difference in elevation
between contour lines.
Plotting the Gradient Descent
• It can also be visualized by using a
contour
• This shows a 3-D plot in two
dimensions with parameters along
both axes and the response as a
contour.
• The value of the response increases
away from the center and has the
same value as with the rings.
• The response is directly proportional to
the distance of a point from the center
(along a direction).
Challenges of Gradient Descent
While gradient descent is a powerful optimization algorithm, it can also present some challenges that can affect its
performance. Some of these challenges include:
1. Local Optima: Gradient descent can converge to local optima instead of the global optimum, especially if the cost
function has multiple peaks and valleys.
2. Learning Rate Selection: The choice of learning rate can significantly impact the performance of gradient
descent. If the learning rate is too high, the algorithm may overshoot the minimum, and if it is too low, the
algorithm may take too long to converge.
3. Overfitting: Gradient descent can overfit the training data if the model is too complex or the learning rate is too
high. This can lead to poor generalization performance on new data.
4. Convergence Rate: The convergence rate of gradient descent can be slow for large datasets or high-dimensional
spaces, which can make the algorithm computationally expensive.
5. Saddle Points: In high-dimensional spaces, the gradient of the cost function can have saddle points, which can
cause gradient descent to get stuck in a plateau instead of converging to a minimum.
• To overcome these challenges, several variations of gradient descent have been developed, such as adaptive
learning rate methods, momentum-based methods, and second-order methods. Additionally, choosing the right
regularization method, model architecture, and hyper-parameters can also help improve the performance of
gradient descent.
Challenges of Gradient Descent
• Efficient implementation of gradient descent is essential for achieving
good performance in machine learning tasks.
• The choice of the learning rate and the number of iterations can
significantly impact the performance of the algorithm.
• There are different variations of gradient descent including:
1. Batch Gradient Descent Batch gradient descent is suitable for small datasets, while
2. Stochastic Gradient Descent, and stochastic gradient descent is more suitable for large
datasets. Mini-batch gradient descent is a good
3. Mini-batch Gradient Descent compromise between the two and is often used in practice.
• Due to their own benefits and drawbacks, their choice depends on
the problem at hand and the size of the dataset.
Gradient Descent Types
Gradient Descent Types
1. Batch Gradient Descent
• In Batch Gradient Descent, all the training data is taken into
consideration to take a single step.
• It calculates the average gradient of the cost function for all the
training examples and updates the parameters in the opposite
direction using use that mean gradient.
• So, just one step of gradient descent is taken downhill in every epoch.
• Batch gradient descent guarantees convergence to the global
minimum, but can be computationally expensive and slow for large
datasets.
Gradient Descent Types
1. Batch Gradient Descent
• Batch Gradient Descent is great for
convex or relatively smooth error
manifolds. In this case, we move
somewhat directly towards an
optimum solution.
• The graph of cost vs epochs is also
quite smooth because we are
averaging over all the gradients of
training data for a single step. The
cost keeps on decreasing over the
epochs.
Gradient Descent Types
1. Batch Gradient Descent Issue:
In Batch Gradient Descent we were considering all the examples for every step of
Gradient Descent.

But what if our dataset is very huge?


 Deep learning models require more and more data.
 The more the data the more chances of a model to be good.
Example:
• Suppose our dataset has 5 million examples, then just to take one step the model will have
to calculate the gradients of all the 5 million examples.
• This does not seem an efficient way.
• To tackle this problem we have Stochastic Gradient Descent.
Gradient Descent Types
2. Stochastic Batch Gradient Descent
• In Stochastic Gradient Descent (SGD), we consider just one example at a time
to take a single step. We do the following steps in one epoch for SGD:
1. Take an example randomly
2. Feed it to the Neural Network
3. Calculate its gradient of the cost function for that example
4. Use the gradient we calculated in Step 3 to update the parameters (weights and Bias)
5. Repeat steps 1–4 for all the examples in the training dataset
• Stochastic gradient descent is computationally efficient and can converge
faster than batch gradient descent.
• However, it can be noisy and may not converge to the global minimum.
Local Minima
• The cost function may consist of
many minimum points.
• The gradient may settle on any one
of the minima, which depends on
the initial point (i.e., initial
parameters (theta)) and the learning
rate.
• Therefore, the optimization may
converge to different points with
different starting points and learning
rates.
Gradient Descent Types
2. Stochastic Gradient Descent Issues:
• Since we are considering just one example
at a time the cost will fluctuate over the
training examples and it
will not necessarily decrease.
• But in the long run, you will see the cost
decreasing with fluctuations.
Cost vs Epochs in SGD [1]
• Because the cost is so fluctuating, it will
never reach the minimum but it will keep
dancing around it.
Gradient Descent Types
3. Mini Batch Gradient Descent
• Mini-batch gradient descent updates the model’s parameters using the
gradient of a small subset of the training set, known as a mini-batch.
• It calculates the average gradient of the cost function for the mini-
batch and updates the parameters in the opposite direction.
• Mini-batch gradient descent combines the advantages of both batch
and stochastic gradient descent and is the most commonly used
method in practice.
• It is computationally efficient and less noisy than stochastic gradient
descent, while still being able to converge to a good solution.
Gradient Descent Types
3. Mini Batch Gradient Descent Implementation
• Neither all dataset is used at once Nor a single example at a time.
• We use a batch of a fixed number of training examples which is less than the actual
dataset and call it a Mini-Batch.
• Doing this helps us achieve the advantages of both the former variants we saw.
• After creating the mini-batches of fixed size, we do the following steps in one epoch:
1. Pick a mini-batch
2. Feed it to the Neural Network
3. Calculate the mean gradient of the mini-batch
4. Use the mean gradient we calculated in Step 3 to update the parameters (weights and bias)
5. Repeat steps 1–4 for the mini-batches we created
Gradient Descent Types
3. Mini Batch Gradient Descent Issues:
• Just like SGD, the average cost over the
epochs in mini-batch gradient descent
fluctuates because we are averaging a
small number of examples at a time.
• So, when we are using the mini-batch
gradient descent we are updating our
parameters frequently as well as we can
use vectorized implementation for faster
computations.
Self Reading $$

Local Minima-Details
• Not trivial, but there are techniques that can help to avoid local minima in
gradient descent.
• In machine learning, local minima and global minima are two important
concepts related to the optimization of loss functions.
• A loss function is a function that measures the error between a model’s
predictions and the ground truth. The goal of machine learning is to find a
model that minimizes the loss function.
• A local minimum is a point in the parameter space where the loss function
is minimized in a local neighborhood.
• A global minimum is a point in the parameter space where the loss
function is minimized globally.
Self Reading $$
How To Deal With Local Minima-
Details
• It is not usually possible to find the global minimum of a loss function analytically.
• Instead, machine learning algorithms use iterative optimization methods to find a local
minimum.
• One common iterative optimization method is gradient descent.
• Gradient descent starts at a random point in the parameter space and then iteratively
updates the parameters in the direction of the negative gradient of the loss function.
• The negative gradient points in the direction of the steepest descent, so gradient
descent will eventually converge to a local minimum.
• However, there is no guarantee that the local minimum found by gradient descent is the
global minimum.
• In fact, it is possible that gradient descent will get stuck in a local minimum that is not
the global minimum.
Self Reading $$

Mathematical Example
• Let f(x) be a loss function.
• A local minimum of f(x) is a point x* such that f(x*) < f(x) for all x in a
neighborhood of x*.
• A global minimum of f(x) is a point x* such that f(x*) < f(x) for all x in
the domain of f(x).
• Here is an example of a loss function with two local minima (at x =1
and -1) and one global minimum:
• But the function: has only one Global Minimum
because it is always increasing.
Self/Further Reading $$ for PhD Students
How to Identify Local Minima
“The global minimum is the best possible solution, but it’s not always easy to find it.” -- Andrew Ng

There are a few ways to know if your model is stuck in a local


minimum.
• Look at the loss function
• If the loss function is not decreasing after a certain number of
iterations, it is likely that the model is stuck in a local minimum.
• Look at the model parameters
• If the model parameters are not changing after a certain number
of iterations, it is likely that the model is stuck in a local minimum.
How To Avoid Local Minima
“Local minima are a common problem, but there are ways to avoid them.” — Michael Nielsen
If you think that your model is stuck in a local minimum, you can try
one of the following:
• Change the learning rate: A smaller learning rate may help the model
to escape from the local minimum.
• Use a different optimization algorithm: A different optimization
algorithm, such as SGD or momentum, may be more effective at
avoiding local minima.
• Add regularization: Regularization can help to prevent the model
from overfitting the training data and can make it less likely to get
stuck in a local minimum.
How To Avoid Local Minima-Details
• Random restarts
• Stochastic gradient descent (SGD) algorithm
• Momentum algorithm
• Nesterov momentum
• Simulated annealing
• Bayesian optimization
• Ensemble learning
• Regularization technique
How To Avoid Local Minima-Details
• Random restarts
• This involves randomly re-initializing the optimization algorithm and
starting over. This can help to escape local minima by starting the
algorithm in a different location.
• Stochastic gradient descent (SGD) algorithm. SGD
works by randomly sampling a subset of the data points
at each iteration, and then using the gradient of those
data points to update the model parameters.
This helps to prevent the algorithm from getting stuck
in a local minimum because it is constantly exploring
different parts of the loss landscape.
How To Avoid Local Minima-Details
• Adding Momentum: Momentum works by adding a fraction of the
previous gradient to the current gradient. This helps to smooth out
the path of the algorithm, and makes it less likely to get stuck in a
local minimum.
How To Avoid Local Minima-Details
With Momentum update, the parameter vector will build up velocity
Adding Momentum: in any direction that has consistent gradient.
• The simplest form of update is to change the parameters along the negative gradient
direction (since the gradient indicates the direction of increase, but we usually wish to
minimize a loss function).
• Assuming a vector of parameters w and the gradient dw,
# Simplest update has the form:
w += - learning_rate * dw
where learning_rate is a hyperparameter. A fixed constant.
• Momentum update is almost always enjoys better converge rates on deep networks.
# Momentum update :
v = mu * v - learning_rate * dw # The integrate velocity
Here wewsee
+=an
v introduction of a v variable that is initialized
# The at zero, andposition
integrate an additional hyperparameter
(mu). As an unfortunate misnomer, this variable is in optimization referred to as momentum (its typical
value is about 0.9), but its physical meaning is more consistent with the coefficient of friction.
How To Avoid Local Minima-Details
• Nesterov momentum: It is a slightly different version of the momentum update that has
recently been gaining popularity.
• It enjoys stronger theoretical converge guarantees for convex functions and in practice it
also consistently works slightly better than standard momentum.
• The core idea is that when the current parameter vector is at some position w, then
looking at the momentum update above,
• The momentum term alone (i.e. ignoring the second term with the gradient) is about to
nudge the parameter vector by mu * v.
• Therefore, if we are about to compute the gradient, we can treat the future approximate
position w + mu * v as a “lookahead” - this is a point in the vicinity of where we are soon
going to end up.
• Hence, it makes sense to compute the gradient at x + mu * v instead of at the “old/stale”
position x.
How To Avoid Local Minima-Details
How To Avoid Local Minima-Details
• Nesterov momentum: However, in practice people prefer to express
the update to look as similar to vanilla SGD or to the previous
momentum update as possible. This is possible to achieve by
manipulating the update above with a variable transform x_ahead = x
+ mu * v, and then expressing the update in terms of x_ahead instead
of x. That is, the parameter vector we are actually storing is always
the ahead version. The equations in terms of x_ahead (but renaming
it back to x) then become:

For further Reading:


•Advances in optimizing Recurrent Networks by Yoshua Bengio, Section 3.5.
•Ilya Sutskever’s thesis (pdf) contains a longer exposition of the topic in section 7.2
How To Avoid Local Minima-Details
How To Avoid Local Minima-Details $$

• Simulated annealing: This is a technique uses simulated annealing to


escape local minima.
It is a probabilistic technique for
approximating the global optimum of a
given function.
Specifically, it is a metaheuristic to
approximate global optimization in a
large search space for an optimization
problem.
For large numbers of local optima, SA
can find the global optima.
How To Avoid Local Minima-Details $$

• Bayesian optimization: This is a technique that uses Bayesian


optimization to escape local minima.
How To Avoid Local Minima-Details $$

• Ensemble learning: This is a technique uses ensemble learning to


escape local minima.
How To Avoid Local Minima-Details

Animations that may help your


intuitions about the learning
process dynamics.
Contours of a loss surface and
time evolution of different
optimization algorithms.
Notice the "overshooting"
behavior of momentum-based
methods, which make the
optimization look like a ball
rolling down the hill.
How To Avoid Local Minima-Details

Animations that may help your


intuitions about the learning process
dynamics.
A visualization of a saddle point in the
optimization landscape, where the curvature
along different dimension has different signs
(one dimension curves up and another down).
Notice that SGD has a very hard time breaking
symmetry and gets stuck on the top.
Conversely, algorithms such as RMSprop will see
very low gradients in the saddle direction.
Due to the denominator term in the RMSprop
update, this will increase the effective learning
rate along this direction, helping RMSProp
proceed. Images credit: Alec Radford.
How To Avoid Local Minima-Details $$
Regularization is used in machine and deep learning to prevent overfitting and improve the generalization
performance of a model.
• Regularization techniques: Regularization works by adding a penalty to
the loss function that is proportional to the size of the model parameters.
Cost function = Loss (say, binary cross entropy) + Regularization term
• This helps to prevent the model from overfitting the training data and can
make it less likely to get stuck in a local minimum.
• Different Regularization Techniques in Deep Learning
• L2 & L1 regularization
• Dropout
• Data Augmentation
• Early stopping
Why Regularization?
Regularization $$

• Regularization techniques: L2 & L1 regularization


Cost function = Loss (say, binary cross entropy) + Regularization term
• However, this regularization term differs in L1 and L2.
• In L2, we have:
• Here, lambda is the regularization parameter. It is the hyper-
parameter whose value is optimized for better results.
• L2 regularization is also known as weight decay as it forces the
weights to decay towards zero (but not exactly zero).
Regularization $$

• Regularization techniques: L2 & L1 regularization

In L1, we have:

In this, we penalize the absolute value of the weights.

Unlike L2, the weights may be reduced to zero here.

Hence, it is very useful when we are trying to


compress our model. Otherwise, we usually prefer L2
over it.
Further Reading $$:
A Gentle Introduction to Dropout for Regularizing Deep Neural Networks by
Jason Brownlee
Regularization Dropout Regularization in Deep Learning Models with Keras
by Jason Brownlee
$$

Dropout: A simple way to prevent neural networks from overfitting


• Regularization techniques: Dropout
It also produces very good results and is consequently the most frequently used
regularization technique in the field of deep learning
It can also be thought of as an ensemble technique in machine learning.
Regularization $$

• Regularization techniques: Data Augmentation


Regularization
• Regularization techniques: Early Stopping
Early stopping is a kind of cross-validation strategy where we keep one
part of the training set as the validation set.
When we see that the performance on the validation set is getting
worse, we immediately stop the training on the model.
This is known as early stopping.
Conclusion
• The gradient descent is a powerful optimization algorithm used to
minimize the cost function of a model by iteratively adjusting its
parameters in the opposite direction of the gradient.
• While it has several variations and advantages, there are also some
challenges associated with gradient descent that need to be
addressed.

You might also like