0% found this document useful (0 votes)

9 views

DL Class2

This document discusses various deep learning techniques for optimizing neural networks, including gradient descent algorithms like stochastic gradient descent, mini-batch gradient descent, and variants like momentum-based gradient descent, Nesterov accelerated gradient descent, AdaGrad, RMSProp, and Adam learning algorithms. It covers topics like limitations of gradient descent, bias-variance tradeoff, overfitting, regularization techniques, dimensionality reduction methods like PCA and autoencoders, and challenges in optimizing neural networks.

Uploaded by

Rishi Chaary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

DL Class2

Uploaded by

Rishi Chaary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

18AIC301J: DEEP LEARNING TECHNIQUES

B. Tech in ARTIFICIAL INTELLIGENCE, 5th semester

Faculty: Dr. Athira Nambiar

Section: A, slot:D
Venue: TP 804
Academic Year: 2022-22
UNIT-2

Limitations of gradient descent

learning algorithm, Contour maps,
Momentum based gradient descent, Nesterov accelerated gradient
Descent, AdaGrad, RMSProp, Adam learning
Algorithm, Stochastic gradient descent
Implement linear regression with stochastic gradient descent
Mini-batch gradient descent, Bias Variance tradeoff, Overfitting in deep neural networks,
Hyperparameter tuning, Regularization: L2 regularization, Dataset Augmentation and Early
stopping
Implement linear regression with
stochastic mini-batch gradient
descent and compare the results with
previous exercise.
Dimensionality reduction, Principal Component Analysis, Singular value decomposition
Autoencoders, Relation between PCA and
Autoencoders, Regularization in Autoencoders
Optimizing neural networks using
L2 regularization, Dropout, data
augmentation and early stopping.
Challenges of mini-batch SGD
Vanilla mini-batch gradient descent, does not guarantee good
convergence, but offers a few challenges that need to be addressed:

•Choosing a proper learning rate can be difficult. A learning rate that is too
small leads to painfully slow convergence, while a learning rate that is too
large can hinder convergence and cause the loss function to fluctuate
around the minimum or even to diverge.

•Additionally, the same learning rate applies to all parameter updates. If

our data is sparse and our features have very different frequencies, we
might not want to update all of them to the same extent, but perform a
larger update for rarely occurring features.

•Another key challenge of minimizing highly non-convex error functions

common for neural networks is avoiding getting trapped in their numerous
suboptimal local minima.
Stochastic Gradient
W_2
noisy gradient from minibatch

W_1
original W
Stochastic Gradient Descent
W_2
True gradients in blue
minibatch gradients in red

W_1
original W

Gradients are noisy but still make good progress on average

You might be wondering….
W_2
Standard gradient

Can we compute this? W_1

Optimizers

• There exist multiple variants of SGD that differ by taking into

account previous weight updates when computing the next weight
update, rather than just looking at the current value of the
gradients.

• There is, for instance, SGD with momentum, as well as Adagrad,

RMSProp, and several others.

• Such variants are known as optimization methods or optimizers.

Optimizers
Momentum based gradient descent

Momentum addresses two issues with SGD: convergence speed and

local minima.

We are trying to optimize a cost function which has contours like this:
(red dot - minumum)

Start a gradient descent and then at each iteration it end up heading to

the minimum point.
Momentum based gradient descent

Vertical side of the ellipse..

Gradient descent will take a lot of steps / slowly oscillate towards the
minimum.

These up-and-down oscillations slows down gradient learningand

prevents from using a much larger learning rate.
Momentum based gradient descent

On the vertical axis, we want the learning to be slower (we don’t want
the oscillations), but along the horizontal axis, we want faster learning.
i.e. we want to aggressively move from left to right to that minimum.

Solution: Gradient descent with momentum

Momentum based gradient descent

Momentum
SGD has trouble navigating ravines, i.e. areas where the surface curves
much more steeply in one dimension than in another which are common
around local optima. In these scenarios, SGD oscillates across the slopes
of the ravine while only making hesitant progress along the bottom towards
the local optimum

SGD SGD with momentum

Momentum based gradient descent

The gradient descent with momentum algorithm borrows the idea from physics.

Imagine rolling down a ball inside of a frictionless bowl. Instead of stopping at

the bottom, the momentum it has accumulated pushes it forward, and the ball
keeps rolling back and forth.

We can apply the concept of momentum to our vanilla gradient descent

algorithm. In each step, in addition to the regular gradient, it also adds on the
movement from the previous step.

Mathematically, it is commonly expressed as:

delta = - learning_rate * gradient + previous_delta * decay_rate

theta += delta
Momentum based gradient descent

Momentum is a method that helps accelerate SGD in the relevant direction

and dampens oscillations.

It does this by adding a fraction γ of the update vector of the past time step
to the current update vector:

Essentially, when using momentum, we push a ball down a hill. The ball
accumulates momentum as it rolls downhill, becoming faster and faster on
the way (until it reaches its terminal velocity if there is air resistance,
i.e. γ<1).
Momentum based gradient descent

Updating the parameter w based not only on the current

gradient value but also on the previous parameter update.
Momentum based gradient descent

Momentum (magenta) vs. Gradient Descent (cyan) on a surface with a global minimum
(the left well) and local minimum (the right well)

https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-
adagrad-rmsprop-adam-f898b102325c
Momentum based gradient descent

SGD
vs
Momentum
notice momentum
overshooting the target,
but overall getting to the
minimum much faster
than vanilla SGD.
Nesterov accelerated gradient descent

• However, a ball that rolls down a hill, blindly following the slope, is
highly unsatisfactory.

• We'd like to have a smarter ball, a ball that has a notion of where it
is going so that it knows to slow down before the hill slopes up
again.

• Nesterov accelerated gradient (NAG) is a way to give our

momentum term this kind of prescience.
Nesterov accelerated gradient descent

• We know that we will use our momentum term γvt−1 to move the
parameters θ.

• Computing θ−γvt−1 thus gives us an approximation of the next position

of the parameters (the gradient is missing for the full update), a rough
idea where our parameters are going to be.

• Helps to calculate the gradient not w.r.t. to our current

parameters θ but w.r.t. the approximate future position of our
parameters:
Nesterov accelerated gradient descent

Instead of evaluating gradient at the current position (red circle), we know that our
momentum is about to carry us to the tip of the green arrow. With Nestrov
momentum, we therefore evaluate the gradient at this “look-ahead” position.
AdaGrad learning algorithms

Adagrad (Adaptive Gradient Algorithm)

• Whatever the optimizer we learned till SGD with momentum, the
learning rate remains constant.

• In Adagrad optimizer, there is no momentum concept so, it is much

simpler compared to SGD with momentum.

• The idea behind Adagrad is to use different learning rates for each
parameter base on iteration.

• The reason behind the need for different learning rates is that the
learning rate for sparse features parameters needs to be higher
compare to the dense features parameter because the frequency of
occurrence of sparse features is lower.
AdaGrad learning algorithms

In the Adagrad optimizer equation, the learning rate has been modified in such a
way that it will automatically decrease because the summation of the previous
gradient square will always keep on increasing after every time step.
AdaGrad learning algorithms

Added element-wise scaling of the gradient based on the

historical sum of squares in each dimension
RMSProp learning algorithms

RMSprop optimizer adjusts the Adagrad in a simple way in an attempt to

reduce its aggressive monotonically decreasing learning rate.

It uses a moving average of squared gradients, instead.

RMSProp learning algorithms

The RMSprop optimizer is similar to the gradient descent algorithm with

momentum.

The RMSprop optimizer restricts the oscillations in the vertical direction.

Therefore, we can increase our learning rate and our algorithm could
take larger steps in the horizontal direction converging faster.

The difference between RMSprop and gradient descent is on how the

gradients are calculated.

The following slide shows how the gradients are calculated for the
RMSprop and gradient descent with momentum.

The value of momentum is denoted by beta and is usually set to 0.9.

RMSProp learning algorithms

Gradient descent with momentum RMSprop optimizer

Adam learning algorithms
Summary

• Momentum: is another method to produce better effective

gradients.

• ADAGRAD, RMSprop diagonally scale the gradient.

• ADAM scales and applies momentum.

Learning
Resources
• Charu C. Aggarwal, Neural Networks and Deep Learning, Springer, 2018.
• Eugene Charniak, Introduction to Deep Learning, MIT Press, 2018.
• Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016.
• Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.
• Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013.
• https://ruder.io/optimizing-gradient-descent/index.html#challenges
• Notes from Andrej Karpathy, Fei-Fei Li, Justin Johnson cs231n
• https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-
momentum-adagrad-rmsprop-adam-f898b102325c
Thank you

The Practice of Statistics Solution Manual
67% (9)
The Practice of Statistics Solution Manual
168 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Momentum, AdaGrad, RMSProp, Adam
No ratings yet
Momentum, AdaGrad, RMSProp, Adam
27 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Rajesh (Dl Unit3) 06dec2024
No ratings yet
Rajesh (Dl Unit3) 06dec2024
67 pages
Deep Learning
No ratings yet
Deep Learning
18 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
9 - Gradient Descent Part 3
No ratings yet
9 - Gradient Descent Part 3
31 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
CS 437 / CS 5317 Deep Learning: Murtaza Taj
No ratings yet
CS 437 / CS 5317 Deep Learning: Murtaza Taj
11 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
Deep Learning (MODULE-2) (2)
No ratings yet
Deep Learning (MODULE-2) (2)
86 pages
Module 2
No ratings yet
Module 2
67 pages
Wa0000.
No ratings yet
Wa0000.
4 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
19_22
No ratings yet
19_22
9 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Adadelta: An Adaptive Learning Rate Method Matthew D. Zeiler Google Inc., USA New York University, USA
No ratings yet
Adadelta: An Adaptive Learning Rate Method Matthew D. Zeiler Google Inc., USA New York University, USA
6 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
No ratings yet
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
25 pages
cours5
No ratings yet
cours5
23 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Optimizers
No ratings yet
Optimizers
4 pages
054 Report
No ratings yet
054 Report
6 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
GD Compare
No ratings yet
GD Compare
5 pages
Otimization 2024_ver3
No ratings yet
Otimization 2024_ver3
42 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Qbank 2 solutions
No ratings yet
Qbank 2 solutions
6 pages
Unit2 Optimizer
No ratings yet
Unit2 Optimizer
18 pages
Optimization Gradient Descent Method
No ratings yet
Optimization Gradient Descent Method
3 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Optimization and Tips For Neural Network Training: Geena Kim
No ratings yet
Optimization and Tips For Neural Network Training: Geena Kim
24 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
ADL Unit-3
No ratings yet
ADL Unit-3
21 pages
Lecture 8.5
No ratings yet
Lecture 8.5
9 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
Visualising SGD With Momentum, Adam and Learning Rate Annealing
No ratings yet
Visualising SGD With Momentum, Adam and Learning Rate Annealing
8 pages
Types of Gradient Descent
No ratings yet
Types of Gradient Descent
9 pages
Gradient Descent (3) (2)
No ratings yet
Gradient Descent (3) (2)
27 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
DL (2)
No ratings yet
DL (2)
18 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
UNIT3
No ratings yet
UNIT3
17 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
cst414-deep learning module 2
No ratings yet
cst414-deep learning module 2
13 pages
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Active Contour: Advancing Computer Vision with Active Contour Techniques
From Everand
Active Contour: Advancing Computer Vision with Active Contour Techniques
Fouad Sabry
No ratings yet
Canny Edge Detector: Unveiling the Art of Visual Perception
From Everand
Canny Edge Detector: Unveiling the Art of Visual Perception
Fouad Sabry
No ratings yet
Student Materials - Final
No ratings yet
Student Materials - Final
35 pages
IPM StatisticalMethods Term1
No ratings yet
IPM StatisticalMethods Term1
4 pages
EMF - Prático
No ratings yet
EMF - Prático
32 pages
Safari 6
No ratings yet
Safari 6
4 pages
HYPOTHESIS TESTING - ITEC 95
No ratings yet
HYPOTHESIS TESTING - ITEC 95
13 pages
Applied Science Preparation Task
No ratings yet
Applied Science Preparation Task
20 pages
An Introduction To Statistical Modeling of Extreme V (Llues: Stuart Coles
No ratings yet
An Introduction To Statistical Modeling of Extreme V (Llues: Stuart Coles
10 pages
Mehak Fatima QRM Exam
No ratings yet
Mehak Fatima QRM Exam
4 pages
Minitab Ebook
100% (8)
Minitab Ebook
358 pages
2MLIntrodpart 2
No ratings yet
2MLIntrodpart 2
42 pages
Download full Marketing research 11 ed. Edition David A. Aaker ebook all chapters
100% (6)
Download full Marketing research 11 ed. Edition David A. Aaker ebook all chapters
65 pages
Revision Research Kay Sir
No ratings yet
Revision Research Kay Sir
55 pages
Baseline Specifications For GSM BSS Network Performance KPIs (Call Drop Ratio On TCH)
No ratings yet
Baseline Specifications For GSM BSS Network Performance KPIs (Call Drop Ratio On TCH)
24 pages
Data Estrus SPSS
No ratings yet
Data Estrus SPSS
3 pages
Linear Regression Slides
No ratings yet
Linear Regression Slides
129 pages
jml5 1
No ratings yet
jml5 1
63 pages
Machine Learning Engineer Interview Questions
No ratings yet
Machine Learning Engineer Interview Questions
2 pages
Ken Black QA 5th Chapter15 Solution
100% (1)
Ken Black QA 5th Chapter15 Solution
12 pages
Probability Concepts and Applications
No ratings yet
Probability Concepts and Applications
35 pages
Rosal, Ginie Lyn Quiz 4
No ratings yet
Rosal, Ginie Lyn Quiz 4
4 pages
chapter-4-test-bank
No ratings yet
chapter-4-test-bank
48 pages
S2 Exercise 7B
No ratings yet
S2 Exercise 7B
3 pages
Week 3 - CH6 Normal Distribution and Other Continuous Distributions
No ratings yet
Week 3 - CH6 Normal Distribution and Other Continuous Distributions
6 pages
Business Statistics
No ratings yet
Business Statistics
4 pages
12-Article Text-31-1-10-20201005
100% (1)
12-Article Text-31-1-10-20201005
19 pages
FinalReport LIPSTICKS
No ratings yet
FinalReport LIPSTICKS
39 pages
Cost Contingencies
No ratings yet
Cost Contingencies
14 pages
Sta2604 2012 - Studyguide - 001 2012 4 B
No ratings yet
Sta2604 2012 - Studyguide - 001 2012 4 B
127 pages
MD (Hom) Syllabus
No ratings yet
MD (Hom) Syllabus
8 pages

DL Class2

Uploaded by

DL Class2

Uploaded by

18AIC301J: DEEP LEARNING TECHNIQUES

B. Tech in ARTIFICIAL INTELLIGENCE, 5th semester

Faculty: Dr. Athira Nambiar

Limitations of gradient descent

•Additionally, the same learning rate applies to all parameter updates. If

•Another key challenge of minimizing highly non-convex error functions

Gradients are noisy but still make good progress on average

Can we compute this? W_1

• There exist multiple variants of SGD that differ by taking into

• There is, for instance, SGD with momentum, as well as Adagrad,

• Such variants are known as optimization methods or optimizers.

Momentum addresses two issues with SGD: convergence speed and

Start a gradient descent and then at each iteration it end up heading to

Vertical side of the ellipse..

These up-and-down oscillations slows down gradient learningand

Solution: Gradient descent with momentum

SGD SGD with momentum

Imagine rolling down a ball inside of a frictionless bowl. Instead of stopping at

We can apply the concept of momentum to our vanilla gradient descent

Mathematically, it is commonly expressed as:

delta = - learning_rate * gradient + previous_delta * decay_rate

Momentum is a method that helps accelerate SGD in the relevant direction

Updating the parameter w based not only on the current

• Nesterov accelerated gradient (NAG) is a way to give our

• Computing θ−γvt−1 thus gives us an approximation of the next position

• Helps to calculate the gradient not w.r.t. to our current

Adagrad (Adaptive Gradient Algorithm)

• In Adagrad optimizer, there is no momentum concept so, it is much

Added element-wise scaling of the gradient based on the

RMSprop optimizer adjusts the Adagrad in a simple way in an attempt to

It uses a moving average of squared gradients, instead.

The RMSprop optimizer is similar to the gradient descent algorithm with

The RMSprop optimizer restricts the oscillations in the vertical direction.

The difference between RMSprop and gradient descent is on how the

The value of momentum is denoted by beta and is usually set to 0.9.

Gradient descent with momentum RMSprop optimizer

• Momentum: is another method to produce better effective

• ADAGRAD, RMSprop diagonally scale the gradient.

• ADAM scales and applies momentum.

You might also like