[Fall 2024] Deep Learning 2
[Fall 2024] Deep Learning 2
No Momentum
With Momentum
(Imagine dropping a marble at
the start and how it would
descend)
RMSProp
● Instead of keeping a weighted average of gradients, we are going to keep a
weighted average of squared gradient components
● We are going to divide our true gradient update by the square root of this
weighted average of squared gradients
○ Takeaway/intuition is on the next slide
● The first equation ends up looking very similar to momentum since we are just
calculating a weighted average (albeit of a different quantity) Square our gradient update,
Weighted average of squared gradient scale it, and add it to our old
components, same way as momentum. weighted average that has
Note that g^2 is the variable’s name, a been scaled down
reference to the tracking of squared
gradients instead of regular gradients
Concept Check 2)
the best weights themselves.
Momentum prevents us from
getting stuck in local minima,
Answers
RMSProp prevents our gradients
from being too small or too large,
Adam has the benefits of both.
Normalization
No Normalization: W2 is always
much more important than W1
● Normalized data tends to train faster, due to the fact
that our loss surfaces look a lot more “normal”
○ When our data is normalized, the first layer weights are all on
the same order of magnitude, so our gradient steps are also
roughly on the same order of magnitude in all directions
● We have no control over whether our activations stay
normalized through the network… how can we make
Normalization: both features
sure that it’s possible for our activations to be can play an equal role
Takeaway: this, in theory, allows us to build arbitrarily deep networks since the
blocks can now easily learn the identity function or very small updates
Recap
● All of these are just additions to your deep learning toolkit that are frequently used in modern deep
learning, here are the big picture takeaways for each
● Adam: better optimization out of the box than anything else
○ Takes momentum’s and RMSProp’s benefits and combines them
● Batchnorm: You can add it after any non-final layer for better learning
○ Gives better behaved gradients by allowing the network to have normalized activations
● Ensembling: If you have the compute, this can give you more performance on your dataset
○ “Wisdom of the group”
● Dropout: Budget ensembling
○ Increases regularization by forcing each layer to learn the same thing in different ways using
different features from the previous layer, “wisdom of the group” still applies here
● Skip connections: MORE LAYERS
○ Enables you to go deeper since it is trivial to learn the identity layer, and learning small updates to
the previous activation is less complex
PyTorch
Backprop is hard…
● Backprop is hard to implement, but needed to make deep learning feasible
○ This is why we need something to do it automatically for us
● Pytorch will automatically make arbitrary computational graphs for us and can
perform the backpropagation for it automatically, so we don’t have to muck about
doing any actual math
○ Praise be
PyTorch: How to Approach
● If you understand Numpy, PyTorch will be a breeze
○ To a user, PyTorch looks and behaves like numpy
○ Instead of np arrays, torch has things called “tensors” that act the same way
■ Except they generate a computational graph in the background as you go along
● At a higher level, PyTorch will also have some built in functions and classes for
things like activations, layers, etc
● What PyTorch can’t do:
○ Symbolic differentiation
● What PyTorch can do:
○ Take the partial derivatives of one value (a loss perhaps…) with respect to some parameter
evaluated at the parameter’s current value
Intros / Demos
● Torch really just looks like Numpy at the lowest level
○ You can see it looks pretty much exactly the same, except
instead of having arrays, we have things called tensors
● We can do all the normal operations that we would do
on an array, and things behave exactly the same way
○ We can add, subtract, square
○ We can reshape, etc
● We can see the size of the tensor by inspecting the
.shape attribute
Intros / Demos
Define a new tensor (in this case, think of it as a
parameter) with current value = 5. We make sure it
is a float, which is required in order to differentiate
http://tinyurl.com/fa24-dl4cv
Contributors
● Slides by Jake Austin and Harshika Jalan
● Edited by Aryan Jain