0% found this document useful (0 votes)
10 views

Lecture 8.5

Uploaded by

kapiljain1989
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 8.5

Uploaded by

kapiljain1989
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

INFO 557 FA24 002 -

Neural Networks
Instructor: Dr. Liang Zhang
TAs: Jiacheng Zhang, Ruoyao Wang
College of Information Science
University of Arizona
Quiz at Tophat (Join: 436056)
Use the web app or mobile app to answer:

Which of the following optimization algorithms takes velocity into account

when taking a gradient descent step?

A SGD

B AdaGrad

C RMSProp

D Adam

With your group, come to consensus on the correct answer, and discuss

what is right or wrong about each of the answers.


Quiz at Tophat (Join: 436056)
Use the web app or mobile app to answer:

Which of the following optimization algorithms takes velocity into account

when taking a gradient descent step?

A SGD

B AdaGrad

C RMSProp

D Adam

With your group, come to consensus on the correct answer, and discuss

what is right or wrong about each of the answers.


AdaGrad
For each training step:

Properties:
+ Not very sensitive to initial learning rate
+ Large partial derivatives, large learning rate decrease
+ Small partial derivatives, small learning rate decrease
- Aggressive, monotonically decreasing learning rate
RMSProp
For each training step:

Properties:
+ Discards history from the extreme past
+ Less aggressive than AdaGrad
- Has 1 extra hyperparameter, 𝜌
Adam
For each training step:

Properties:
+ Incorporates momentum
+ Less biased than RMSProp
- Has 2 extra hyperparameters, 𝜌1 and 𝜌2
Group Activity
What would the paths of regular SGD, AdaGrad, RMSProp, and Adam look like on
the surface below? Redder is higher, bluer is lower, circle is the start, and star is
the minimum.
Group Activity
● SGD: Noisy, zigzagging, potential overshooting, slower convergence.
● AdaGrad: Large early steps, very cautious later, slower near the end.
● RMSProp: Smooth path, adjusts well to the surface, relatively stable.
● Adam: Fast, adaptive, smooth, and efficient convergence.
Selecting a learning algorithm
Good luck! Try a few and see what happens.
Tom Schaul, Ioannis Antonoglou, and David Silver. Unit tests for stochastic
optimization. ICLR 2014.
● Tries algorithms on simple cost function shapes
● Adaptive learning rates robust, but no clear winner
Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin
Recht. The marginal value of adaptive gradient methods in machine learning.
NeuIPS 2017
● Tries algorithms on image and language tasks
● AdaGrad, RMSProp, and Adam all have worse generalization error than SGD

You might also like