0% found this document useful (0 votes)

18 views

Lec05-1-Gradient Descent-Detailed

Uploaded by

awaisqarni640

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Lec05-1-Gradient Descent-Detailed

Uploaded by

awaisqarni640

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

You are on page 1/ 62

Gradient Descent

Dr. Muhammad Nadeem Ashraf

Outlines
• Background/ What is a Cost Function?
• What is Gradient Descent?
• How Does Gradient Descent Work?
• Types of Gradient Descent
• Batch Gradient Descent
• Stochastic Gradient Descent
• Mini-Batch Gradient Descent
• Plotting the Gradient Descent Algorithm
• Alpha – The Learning Rate
• Local Minima
• Code Implementation of Gradient Descent in Python
• Challenges of Gradient Descent
• End Notes
• Frequently Asked Questions
History
• Gradient descent was originally proposed by CAUCHY in 1847.
• It is also known as the steepest descent.
• Gradient descent is a first-order iterative optimization algorithm for
finding the minimum of a function
Background
• Imagine you’re lost in a dense forest
with no map or compass. What do
you do?
• You follow the path of the steepest
descent, taking steps in the direction
that decreases the slope and brings
you closer to your destination.
 Similarly, Gradient Descent is the go-to algorithm for navigating the
complex landscape of machine learning.
 It helps models find the optimal set of parameters by iteratively
adjusting them in the opposite direction of the gradient.
Positive and Negative Slopes

Paly it
What is Gradient?
1. It is a rate of change in one variable w.r.t. other.
2. A gradient measures how much the output of a function
changes if you change the inputs a little bit.
3. In mathematical terms, it is known as the slope of a function.
4. A gradient is nothing but a derivative that defines the effects
on outputs of the function with a little bit of variation in
inputs.
5. In machine learning a gradient is a derivative of a function,
that has more than one input variable.
6. The gradient simply measures the change in all weights
about the change in error.
Gradient and Partial Derivatives
What is a Cost Function?
• It is a function that measures the performance of a model for any given data.
• Cost Function quantifies the error between predicted values and expected values
and presents it in the form of a single real number. OR In General
After making a hypothesis with initial
parameters, we calculate the Cost function
with a goal to reduce the cost function, we
modify the parameters by using the Gradient
descent algorithm over the given data. Here’s
the mathematical representation of it:
Cost function Optimization
• The cost function represents the discrepancy between the
Predicted and the Actual output of the model.
• The goal of Gradient Descent is to find the set of parameters that
minimizes this discrepancy and improves the model’s
performance.
• It is an optimization algorithm used in machine/deep learning to
minimize the cost function by iteratively adjusting parameters in
the direction of the Negative Gradient (moving away from the
gradient of the function at the current point), aiming to find the
optimal set of parameters.
Gradient Descent
• The algorithm objective is to identify model parameters like weight
and bias that reduce model error on training data.
• In linear regression, it finds weight and biases, and deep learning
backward propagation uses the method.
• It adjusts parameters to minimize particular functions to local minima.
How does Gradient Descent work?
• The algorithm operates by calculating the gradient of the cost function,
which indicates the direction and magnitude of the steepest ascent.
• However, since the objective is to minimize the cost function, gradient
descent moves in the opposite direction of the gradient, known as the
negative gradient direction.
• By iteratively updating the model’s parameters in the negative gradient
direction, gradient descent gradually converges towards the optimal
set of parameters that yields the lowest cost.
• The learning rate (a hyper-parameter) determines the step size taken
in each iteration, influencing the speed and stability of convergence.
Applications
• Gradient descent can be applied to various machine/deep learning
algorithms including:
• Linear regression
• Logistic regression
• Neural networks, and
• Support vector machines
• It provides a general framework for optimizing models by iteratively
refining their parameters based on the cost function.
When Gradient Descent is used in
ML/DL?
• The learning happens during the backpropagation while training the
neural network-based model.
• When We say that we are training the model, it’s Gradient Descent
behind the scenes who trains it.
Example
• Let’s say you are playing a game
where the players are at the top of
a mountain, and they are asked to
reach the lowest point of the
mountain. Additionally, they are
blindfolded. So, what approach do
you think would make you reach
the lake?
• Take a moment to think about this
before heading forward. The best way is to observe the ground and find where the land descends.
From that position, take a step in the descending direction and iterate this
process until we reach the lowest point.
Example
• Gradient descent is an iterative optimization
algorithm for finding the local minimum of a
function.
• To find the local minimum of a function using
gradient descent, we must take steps proportional to
the negative of the gradient (move away from the
gradient) of the function at the current point.
• If we take steps proportional to the positive of the
gradient (moving towards the gradient), we will
approach a local maximum of the function, and the
procedure is called Gradient Ascent.
Gradient Descent Working
The goal of the gradient descent algorithm is to
minimize the given function (say cost
function).
To achieve this goal, it performs two steps
iteratively:
1. Compute the gradient (slope), the first-
order derivative of the function at that point
2. Make a step (move) in the direction
opposite to the gradient, opposite direction
of slope increases from the current point by
alpha times the gradient at that point
Alpha is called Learning rate – a tuning parameter in the
optimization process. It decides the length of the steps.
Consideration for Learning Rate
Plotting the Gradient Descent
• 2-D
When we have a single parameter
2-D plot
(theta), we can plot the
dependent variable cost on the y-
axis and theta on the x-axis.
• 3-D
If there are two parameters, we
can go with a 3-D plot, with cost
on one axis and the two
parameters (thetas) along the
other two axes.
3-D plot
Gradient Descent- Computing
Derivatives
Plotting the Gradient Descent
It can also be visualized by
using a contour.
• A contour is a line drawn on
a topographic map to
indicate ground elevation or
depression.
• A contour interval is the
vertical distance or
difference in elevation
between contour lines.
Plotting the Gradient Descent
• It can also be visualized by using a
contour
• This shows a 3-D plot in two
dimensions with parameters along
both axes and the response as a
contour.
• The value of the response increases
away from the center and has the
same value as with the rings.
• The response is directly proportional to
the distance of a point from the center
(along a direction).
Challenges of Gradient Descent
While gradient descent is a powerful optimization algorithm, it can also present some challenges that can affect its
performance. Some of these challenges include:
1. Local Optima: Gradient descent can converge to local optima instead of the global optimum, especially if the cost
function has multiple peaks and valleys.
2. Learning Rate Selection: The choice of learning rate can significantly impact the performance of gradient
descent. If the learning rate is too high, the algorithm may overshoot the minimum, and if it is too low, the
algorithm may take too long to converge.
3. Overfitting: Gradient descent can overfit the training data if the model is too complex or the learning rate is too
high. This can lead to poor generalization performance on new data.
4. Convergence Rate: The convergence rate of gradient descent can be slow for large datasets or high-dimensional
spaces, which can make the algorithm computationally expensive.
5. Saddle Points: In high-dimensional spaces, the gradient of the cost function can have saddle points, which can
cause gradient descent to get stuck in a plateau instead of converging to a minimum.
• To overcome these challenges, several variations of gradient descent have been developed, such as adaptive
learning rate methods, momentum-based methods, and second-order methods. Additionally, choosing the right
regularization method, model architecture, and hyper-parameters can also help improve the performance of
gradient descent.
Challenges of Gradient Descent
• Efficient implementation of gradient descent is essential for achieving
good performance in machine learning tasks.
• The choice of the learning rate and the number of iterations can
significantly impact the performance of the algorithm.
• There are different variations of gradient descent including:
1. Batch Gradient Descent Batch gradient descent is suitable for small datasets, while
2. Stochastic Gradient Descent, and stochastic gradient descent is more suitable for large
datasets. Mini-batch gradient descent is a good
3. Mini-batch Gradient Descent compromise between the two and is often used in practice.
• Due to their own benefits and drawbacks, their choice depends on
the problem at hand and the size of the dataset.
Gradient Descent Types
Gradient Descent Types
1. Batch Gradient Descent
• In Batch Gradient Descent, all the training data is taken into
consideration to take a single step.
• It calculates the average gradient of the cost function for all the
training examples and updates the parameters in the opposite
direction using use that mean gradient.
• So, just one step of gradient descent is taken downhill in every epoch.
• Batch gradient descent guarantees convergence to the global
minimum, but can be computationally expensive and slow for large
datasets.
Gradient Descent Types
1. Batch Gradient Descent
• Batch Gradient Descent is great for
convex or relatively smooth error
manifolds. In this case, we move
somewhat directly towards an
optimum solution.
• The graph of cost vs epochs is also
quite smooth because we are
averaging over all the gradients of
training data for a single step. The
cost keeps on decreasing over the
epochs.
Gradient Descent Types
1. Batch Gradient Descent Issue:
In Batch Gradient Descent we were considering all the examples for every step of
Gradient Descent.

But what if our dataset is very huge?

 Deep learning models require more and more data.
 The more the data the more chances of a model to be good.
Example:
• Suppose our dataset has 5 million examples, then just to take one step the model will have
to calculate the gradients of all the 5 million examples.
• This does not seem an efficient way.
• To tackle this problem we have Stochastic Gradient Descent.
Gradient Descent Types
2. Stochastic Batch Gradient Descent
• In Stochastic Gradient Descent (SGD), we consider just one example at a time
to take a single step. We do the following steps in one epoch for SGD:
1. Take an example randomly
2. Feed it to the Neural Network
3. Calculate its gradient of the cost function for that example
4. Use the gradient we calculated in Step 3 to update the parameters (weights and Bias)
5. Repeat steps 1–4 for all the examples in the training dataset
• Stochastic gradient descent is computationally efficient and can converge
faster than batch gradient descent.
• However, it can be noisy and may not converge to the global minimum.
Local Minima
• The cost function may consist of
many minimum points.
• The gradient may settle on any one
of the minima, which depends on
the initial point (i.e., initial
parameters (theta)) and the learning
rate.
• Therefore, the optimization may
converge to different points with
different starting points and learning
rates.
Gradient Descent Types
2. Stochastic Gradient Descent Issues:
• Since we are considering just one example
at a time the cost will fluctuate over the
training examples and it
will not necessarily decrease.
• But in the long run, you will see the cost
decreasing with fluctuations.
Cost vs Epochs in SGD [1]
• Because the cost is so fluctuating, it will
never reach the minimum but it will keep
dancing around it.
Gradient Descent Types
3. Mini Batch Gradient Descent
• Mini-batch gradient descent updates the model’s parameters using the
gradient of a small subset of the training set, known as a mini-batch.
• It calculates the average gradient of the cost function for the mini-
batch and updates the parameters in the opposite direction.
• Mini-batch gradient descent combines the advantages of both batch
and stochastic gradient descent and is the most commonly used
method in practice.
• It is computationally efficient and less noisy than stochastic gradient
descent, while still being able to converge to a good solution.
Gradient Descent Types
3. Mini Batch Gradient Descent Implementation
• Neither all dataset is used at once Nor a single example at a time.
• We use a batch of a fixed number of training examples which is less than the actual
dataset and call it a Mini-Batch.
• Doing this helps us achieve the advantages of both the former variants we saw.
• After creating the mini-batches of fixed size, we do the following steps in one epoch:
1. Pick a mini-batch
2. Feed it to the Neural Network
3. Calculate the mean gradient of the mini-batch
4. Use the mean gradient we calculated in Step 3 to update the parameters (weights and bias)
5. Repeat steps 1–4 for the mini-batches we created
Gradient Descent Types
3. Mini Batch Gradient Descent Issues:
• Just like SGD, the average cost over the
epochs in mini-batch gradient descent
fluctuates because we are averaging a
small number of examples at a time.
• So, when we are using the mini-batch
gradient descent we are updating our
parameters frequently as well as we can
use vectorized implementation for faster
computations.
Self Reading $$

Local Minima-Details
• Not trivial, but there are techniques that can help to avoid local minima in
gradient descent.
• In machine learning, local minima and global minima are two important
concepts related to the optimization of loss functions.
• A loss function is a function that measures the error between a model’s
predictions and the ground truth. The goal of machine learning is to find a
model that minimizes the loss function.
• A local minimum is a point in the parameter space where the loss function
is minimized in a local neighborhood.
• A global minimum is a point in the parameter space where the loss
function is minimized globally.
Self Reading $$
How To Deal With Local Minima-
Details
• It is not usually possible to find the global minimum of a loss function analytically.
• Instead, machine learning algorithms use iterative optimization methods to find a local
minimum.
• One common iterative optimization method is gradient descent.
• Gradient descent starts at a random point in the parameter space and then iteratively
updates the parameters in the direction of the negative gradient of the loss function.
• The negative gradient points in the direction of the steepest descent, so gradient
descent will eventually converge to a local minimum.
• However, there is no guarantee that the local minimum found by gradient descent is the
global minimum.
• In fact, it is possible that gradient descent will get stuck in a local minimum that is not
the global minimum.
Self Reading $$

Mathematical Example
• Let f(x) be a loss function.
• A local minimum of f(x) is a point x* such that f(x*) < f(x) for all x in a
neighborhood of x*.
• A global minimum of f(x) is a point x* such that f(x*) < f(x) for all x in
the domain of f(x).
• Here is an example of a loss function with two local minima (at x =1
and -1) and one global minimum:
• But the function: has only one Global Minimum
because it is always increasing.
Self/Further Reading $$ for PhD Students
How to Identify Local Minima
“The global minimum is the best possible solution, but it’s not always easy to find it.” -- Andrew Ng

There are a few ways to know if your model is stuck in a local

minimum.
• Look at the loss function
• If the loss function is not decreasing after a certain number of
iterations, it is likely that the model is stuck in a local minimum.
• Look at the model parameters
• If the model parameters are not changing after a certain number
of iterations, it is likely that the model is stuck in a local minimum.
How To Avoid Local Minima
“Local minima are a common problem, but there are ways to avoid them.” — Michael Nielsen
If you think that your model is stuck in a local minimum, you can try
one of the following:
• Change the learning rate: A smaller learning rate may help the model
to escape from the local minimum.
• Use a different optimization algorithm: A different optimization
algorithm, such as SGD or momentum, may be more effective at
avoiding local minima.
• Add regularization: Regularization can help to prevent the model
from overfitting the training data and can make it less likely to get
stuck in a local minimum.
How To Avoid Local Minima-Details
• Random restarts
• Stochastic gradient descent (SGD) algorithm
• Momentum algorithm
• Nesterov momentum
• Simulated annealing
• Bayesian optimization
• Ensemble learning
• Regularization technique
How To Avoid Local Minima-Details
• Random restarts
• This involves randomly re-initializing the optimization algorithm and
starting over. This can help to escape local minima by starting the
algorithm in a different location.
• Stochastic gradient descent (SGD) algorithm. SGD
works by randomly sampling a subset of the data points
at each iteration, and then using the gradient of those
data points to update the model parameters.
This helps to prevent the algorithm from getting stuck
in a local minimum because it is constantly exploring
different parts of the loss landscape.
How To Avoid Local Minima-Details
• Adding Momentum: Momentum works by adding a fraction of the
previous gradient to the current gradient. This helps to smooth out
the path of the algorithm, and makes it less likely to get stuck in a
local minimum.
How To Avoid Local Minima-Details
With Momentum update, the parameter vector will build up velocity
Adding Momentum: in any direction that has consistent gradient.
• The simplest form of update is to change the parameters along the negative gradient
direction (since the gradient indicates the direction of increase, but we usually wish to
minimize a loss function).
• Assuming a vector of parameters w and the gradient dw,
# Simplest update has the form:
w += - learning_rate * dw
where learning_rate is a hyperparameter. A fixed constant.
• Momentum update is almost always enjoys better converge rates on deep networks.
# Momentum update :
v = mu * v - learning_rate * dw # The integrate velocity
Here wewsee
+=an
v introduction of a v variable that is initialized
# The at zero, andposition
integrate an additional hyperparameter
(mu). As an unfortunate misnomer, this variable is in optimization referred to as momentum (its typical
value is about 0.9), but its physical meaning is more consistent with the coefficient of friction.
How To Avoid Local Minima-Details
• Nesterov momentum: It is a slightly different version of the momentum update that has
recently been gaining popularity.
• It enjoys stronger theoretical converge guarantees for convex functions and in practice it
also consistently works slightly better than standard momentum.
• The core idea is that when the current parameter vector is at some position w, then
looking at the momentum update above,
• The momentum term alone (i.e. ignoring the second term with the gradient) is about to
nudge the parameter vector by mu * v.
• Therefore, if we are about to compute the gradient, we can treat the future approximate
position w + mu * v as a “lookahead” - this is a point in the vicinity of where we are soon
going to end up.
• Hence, it makes sense to compute the gradient at x + mu * v instead of at the “old/stale”
position x.
How To Avoid Local Minima-Details
How To Avoid Local Minima-Details
• Nesterov momentum: However, in practice people prefer to express
the update to look as similar to vanilla SGD or to the previous
momentum update as possible. This is possible to achieve by
manipulating the update above with a variable transform x_ahead = x
+ mu * v, and then expressing the update in terms of x_ahead instead
of x. That is, the parameter vector we are actually storing is always
the ahead version. The equations in terms of x_ahead (but renaming
it back to x) then become:

For further Reading:

•Advances in optimizing Recurrent Networks by Yoshua Bengio, Section 3.5.
•Ilya Sutskever’s thesis (pdf) contains a longer exposition of the topic in section 7.2
How To Avoid Local Minima-Details
How To Avoid Local Minima-Details $$

• Simulated annealing: This is a technique uses simulated annealing to

escape local minima.
It is a probabilistic technique for
approximating the global optimum of a
given function.
Specifically, it is a metaheuristic to
approximate global optimization in a
large search space for an optimization
problem.
For large numbers of local optima, SA
can find the global optima.
How To Avoid Local Minima-Details $$

• Bayesian optimization: This is a technique that uses Bayesian

optimization to escape local minima.
How To Avoid Local Minima-Details $$

• Ensemble learning: This is a technique uses ensemble learning to

escape local minima.
How To Avoid Local Minima-Details

Animations that may help your

intuitions about the learning
process dynamics.
Contours of a loss surface and
time evolution of different
optimization algorithms.
Notice the "overshooting"
behavior of momentum-based
methods, which make the
optimization look like a ball
rolling down the hill.
How To Avoid Local Minima-Details

Animations that may help your

intuitions about the learning process
dynamics.
A visualization of a saddle point in the
optimization landscape, where the curvature
along different dimension has different signs
(one dimension curves up and another down).
Notice that SGD has a very hard time breaking
symmetry and gets stuck on the top.
Conversely, algorithms such as RMSprop will see
very low gradients in the saddle direction.
Due to the denominator term in the RMSprop
update, this will increase the effective learning
rate along this direction, helping RMSProp
proceed. Images credit: Alec Radford.
How To Avoid Local Minima-Details $$
Regularization is used in machine and deep learning to prevent overfitting and improve the generalization
performance of a model.
• Regularization techniques: Regularization works by adding a penalty to
the loss function that is proportional to the size of the model parameters.
Cost function = Loss (say, binary cross entropy) + Regularization term
• This helps to prevent the model from overfitting the training data and can
make it less likely to get stuck in a local minimum.
• Different Regularization Techniques in Deep Learning
• L2 & L1 regularization
• Dropout
• Data Augmentation
• Early stopping
Why Regularization?
Regularization $$

• Regularization techniques: L2 & L1 regularization

Cost function = Loss (say, binary cross entropy) + Regularization term
• However, this regularization term differs in L1 and L2.
• In L2, we have:
• Here, lambda is the regularization parameter. It is the hyper-
parameter whose value is optimized for better results.
• L2 regularization is also known as weight decay as it forces the
weights to decay towards zero (but not exactly zero).
Regularization $$

• Regularization techniques: L2 & L1 regularization

In L1, we have:

In this, we penalize the absolute value of the weights.

Unlike L2, the weights may be reduced to zero here.

Hence, it is very useful when we are trying to

compress our model. Otherwise, we usually prefer L2
over it.
Further Reading $$:
A Gentle Introduction to Dropout for Regularizing Deep Neural Networks by
Jason Brownlee
Regularization Dropout Regularization in Deep Learning Models with Keras
by Jason Brownlee
$$

Dropout: A simple way to prevent neural networks from overfitting

• Regularization techniques: Dropout
It also produces very good results and is consequently the most frequently used
regularization technique in the field of deep learning
It can also be thought of as an ensemble technique in machine learning.
Regularization $$

• Regularization techniques: Data Augmentation

Regularization
• Regularization techniques: Early Stopping
Early stopping is a kind of cross-validation strategy where we keep one
part of the training set as the validation set.
When we see that the performance on the validation set is getting
worse, we immediately stop the training on the model.
This is known as early stopping.
Conclusion
• The gradient descent is a powerful optimization algorithm used to
minimize the cost function of a model by iteratively adjusting its
parameters in the opposite direction of the gradient.
• While it has several variations and advantages, there are also some
challenges associated with gradient descent that need to be
addressed.

Gradient Descent
No ratings yet
Gradient Descent
17 pages
Capstone Proect Notes 2
100% (2)
Capstone Proect Notes 2
16 pages
Tugas Review Jurnal Ekonomi Makro
100% (1)
Tugas Review Jurnal Ekonomi Makro
7 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
UNIT III Part-2
No ratings yet
UNIT III Part-2
39 pages
GD Types
No ratings yet
GD Types
98 pages
CV Lec4
No ratings yet
CV Lec4
46 pages
DL Unit -2
No ratings yet
DL Unit -2
20 pages
Gradient_Descent_(1)
No ratings yet
Gradient_Descent_(1)
8 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
DL Class1
No ratings yet
DL Class1
18 pages
Gradient Descent and Cost Function
No ratings yet
Gradient Descent and Cost Function
14 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Unit3_rev3
No ratings yet
Unit3_rev3
201 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Gradient Descent a Fundamental Optimization Algorithm
No ratings yet
Gradient Descent a Fundamental Optimization Algorithm
30 pages
5 Optimizers
No ratings yet
5 Optimizers
10 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
lec7-8+CNN-2
No ratings yet
lec7-8+CNN-2
69 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Unit IV BPA GD
No ratings yet
Unit IV BPA GD
12 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
Paper 2
No ratings yet
Paper 2
27 pages
chp2 Gradient Descent algorithm
No ratings yet
chp2 Gradient Descent algorithm
5 pages
LInear
No ratings yet
LInear
14 pages
PCA and Convex optimization and bias , Variance-2
No ratings yet
PCA and Convex optimization and bias , Variance-2
29 pages
ML MODULE 5 FULL NOTES
No ratings yet
ML MODULE 5 FULL NOTES
23 pages
Unit 4 Final
No ratings yet
Unit 4 Final
29 pages
Lecture 10_04.09.2024_Regression-02 Lecture Slides
No ratings yet
Lecture 10_04.09.2024_Regression-02 Lecture Slides
61 pages
Yash 21bsds12
No ratings yet
Yash 21bsds12
3 pages
DL_Unit2
No ratings yet
DL_Unit2
113 pages
Gradient Descent Unit3
No ratings yet
Gradient Descent Unit3
9 pages
Module3_Ch1
No ratings yet
Module3_Ch1
83 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
Unconstrained Optimization Gradient Search Method
No ratings yet
Unconstrained Optimization Gradient Search Method
8 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Gradient Descent in Linear Regression
No ratings yet
Gradient Descent in Linear Regression
30 pages
DL Class2
No ratings yet
DL Class2
30 pages
Module 2
No ratings yet
Module 2
67 pages
2. Linear Regression, Polynomical, Gradiant Descent
No ratings yet
2. Linear Regression, Polynomical, Gradiant Descent
42 pages
Optimization
No ratings yet
Optimization
3 pages
Gradient Descent - A Quick, Simple Introduction - Built in
No ratings yet
Gradient Descent - A Quick, Simple Introduction - Built in
15 pages
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
No ratings yet
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
25 pages
MLPPT
No ratings yet
MLPPT
36 pages
Module 3
No ratings yet
Module 3
27 pages
DL UNIT-I
No ratings yet
DL UNIT-I
30 pages
Backpropagation, Sgmiod Neuron & Gradient Discend
No ratings yet
Backpropagation, Sgmiod Neuron & Gradient Discend
29 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
4 pages
UNIT2
No ratings yet
UNIT2
25 pages
RL chap 4
No ratings yet
RL chap 4
7 pages
4. Gradient Descent
No ratings yet
4. Gradient Descent
15 pages
12-Mini-Batch Gradient Descent - Exponential Weighted Averages-07-08-2024
No ratings yet
12-Mini-Batch Gradient Descent - Exponential Weighted Averages-07-08-2024
2 pages
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lecture-3-RC4-A
No ratings yet
Lecture-3-RC4-A
18 pages
Lec09-1Confusion Matrix - A Comprehensive Guide
No ratings yet
Lec09-1Confusion Matrix - A Comprehensive Guide
6 pages
Lec03-1 Artificial Neural Networks
No ratings yet
Lec03-1 Artificial Neural Networks
17 pages
Lec02-Deep Learning Applications
No ratings yet
Lec02-Deep Learning Applications
49 pages
Lec07 1 Vectorization
No ratings yet
Lec07 1 Vectorization
47 pages
Lec06 Derivatives
No ratings yet
Lec06 Derivatives
22 pages
Lec08-1Activation Functions
No ratings yet
Lec08-1Activation Functions
19 pages
Lec01.301d-How To DL
No ratings yet
Lec01.301d-How To DL
33 pages
03-Lecture Notes-Mid
No ratings yet
03-Lecture Notes-Mid
23 pages
Clinic 4
No ratings yet
Clinic 4
6 pages
Uav Collision Avoidance Via Optimal Trajectory Generation Method
No ratings yet
Uav Collision Avoidance Via Optimal Trajectory Generation Method
7 pages
Set Theory
No ratings yet
Set Theory
8 pages
NES/MAA German Gender 11-2011 Short
No ratings yet
NES/MAA German Gender 11-2011 Short
13 pages
Optimization PDF
No ratings yet
Optimization PDF
9 pages
2 Unit 1 Impact of Jet
No ratings yet
2 Unit 1 Impact of Jet
16 pages
3d Pythagoras and Trigonometry Ikyp7eFaPUcccmEc
No ratings yet
3d Pythagoras and Trigonometry Ikyp7eFaPUcccmEc
28 pages
Mathematics From Grade 9 12 How To Link Easy For Reading @entrance
No ratings yet
Mathematics From Grade 9 12 How To Link Easy For Reading @entrance
9 pages
Lecture 3 - Virtual Work For Indeterminate Truss
No ratings yet
Lecture 3 - Virtual Work For Indeterminate Truss
8 pages
Project The Normal Distribution Activity
No ratings yet
Project The Normal Distribution Activity
17 pages
Curved Mirrors
No ratings yet
Curved Mirrors
55 pages
501 849 PDF
100% (1)
501 849 PDF
59 pages
M.tech Scheme17 19
No ratings yet
M.tech Scheme17 19
45 pages
ridge and lasso
No ratings yet
ridge and lasso
2 pages
Applied Mechanics PDF
No ratings yet
Applied Mechanics PDF
18 pages
Markowitz-S Portfolio Theory (For INCHARGE)
No ratings yet
Markowitz-S Portfolio Theory (For INCHARGE)
3 pages
Answers To Chapter 8 Review Basic
No ratings yet
Answers To Chapter 8 Review Basic
2 pages
John Nash Thesis PDF
100% (2)
John Nash Thesis PDF
5 pages
Ae2201 Mechanics of Machine
No ratings yet
Ae2201 Mechanics of Machine
37 pages
David_Mumford
No ratings yet
David_Mumford
7 pages
Vibration Chapter04
No ratings yet
Vibration Chapter04
75 pages
Lesson 3 Continuity of Functions
No ratings yet
Lesson 3 Continuity of Functions
6 pages
‎⁨عاشر_متقدم_فصل_ثاني_عصام_الدبايبه⁩
No ratings yet
‎⁨عاشر_متقدم_فصل_ثاني_عصام_الدبايبه⁩
35 pages
Envision G4 05 TA P
No ratings yet
Envision G4 05 TA P
4 pages
Class: 4 (Syllabus & Sample Questions)
No ratings yet
Class: 4 (Syllabus & Sample Questions)
2 pages
8 Bit Computer Manual
No ratings yet
8 Bit Computer Manual
4 pages
Projectile-Motion Horizontal
No ratings yet
Projectile-Motion Horizontal
25 pages
W. F. Chen, Plasticity For Structural Engineers, 1988-156
No ratings yet
W. F. Chen, Plasticity For Structural Engineers, 1988-156
1 page

Lec05-1-Gradient Descent-Detailed

Uploaded by

Lec05-1-Gradient Descent-Detailed

Uploaded by

Gradient Descent

Dr. Muhammad Nadeem Ashraf

But what if our dataset is very huge?

There are a few ways to know if your model is stuck in a local

For further Reading:

• Simulated annealing: This is a technique uses simulated annealing to

• Bayesian optimization: This is a technique that uses Bayesian

• Ensemble learning: This is a technique uses ensemble learning to

Animations that may help your

Animations that may help your

• Regularization techniques: L2 & L1 regularization

• Regularization techniques: L2 & L1 regularization

In this, we penalize the absolute value of the weights.

Unlike L2, the weights may be reduced to zero here.

Hence, it is very useful when we are trying to

Dropout: A simple way to prevent neural networks from overfitting

• Regularization techniques: Data Augmentation

You might also like