Lecture 8.5
Lecture 8.5
Neural Networks
Instructor: Dr. Liang Zhang
TAs: Jiacheng Zhang, Ruoyao Wang
College of Information Science
University of Arizona
Quiz at Tophat (Join: 436056)
Use the web app or mobile app to answer:
A SGD
B AdaGrad
C RMSProp
D Adam
With your group, come to consensus on the correct answer, and discuss
A SGD
B AdaGrad
C RMSProp
D Adam
With your group, come to consensus on the correct answer, and discuss
Properties:
+ Not very sensitive to initial learning rate
+ Large partial derivatives, large learning rate decrease
+ Small partial derivatives, small learning rate decrease
- Aggressive, monotonically decreasing learning rate
RMSProp
For each training step:
Properties:
+ Discards history from the extreme past
+ Less aggressive than AdaGrad
- Has 1 extra hyperparameter, 𝜌
Adam
For each training step:
Properties:
+ Incorporates momentum
+ Less biased than RMSProp
- Has 2 extra hyperparameters, 𝜌1 and 𝜌2
Group Activity
What would the paths of regular SGD, AdaGrad, RMSProp, and Adam look like on
the surface below? Redder is higher, bluer is lower, circle is the start, and star is
the minimum.
Group Activity
● SGD: Noisy, zigzagging, potential overshooting, slower convergence.
● AdaGrad: Large early steps, very cautious later, slower near the end.
● RMSProp: Smooth path, adjusts well to the surface, relatively stable.
● Adam: Fast, adaptive, smooth, and efficient convergence.
Selecting a learning algorithm
Good luck! Try a few and see what happens.
Tom Schaul, Ioannis Antonoglou, and David Silver. Unit tests for stochastic
optimization. ICLR 2014.
● Tries algorithms on simple cost function shapes
● Adaptive learning rates robust, but no clear winner
Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin
Recht. The marginal value of adaptive gradient methods in machine learning.
NeuIPS 2017
● Tries algorithms on image and language tasks
● AdaGrad, RMSProp, and Adam all have worse generalization error than SGD