0% found this document useful (0 votes)
6 views

[Fall 2024] Deep Learning 2

Uploaded by

David Earnest
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

[Fall 2024] Deep Learning 2

Uploaded by

David Earnest
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Deep Learning 2

By: ML@B Edu Team


Announcements
● Quiz 1 due tonight
● Quiz 2 due next Monday
○ Covers content from today’s (Deep Learning 2) and Wednesday’s (Deep Learning 3) lectures
● Homework 1 due next Monday
● Office Hours will be held every Thursday 3-4 PM, at Cory 531
What questions are there from the
first deep learning lecture?
● Warning
● Deep Learning 1 Review
● Math Review
● Backpropagation
Outline ● Modern Deep Learning
○ Optimizers
○ Batch-Norm
○ Ensembling
○ Dropout
○ Skip Connections
● PyTorch
Warning
Warning for these slides…
● For the first part of lecture, we are going to discuss the backpropagation
algorithm
○ If you are lost, that’s totally fine, we just need you to take away a broad idea of what a
computational graph is, what backprop is doing and why we do it
● The last half of the lecture, however, will be CRITICAL to modern deep learning
systems
○ Even here, there will be a lot of math. If you don’t understand the math, that’s fine, just pay
attention to the bullet points labeled “takeaway” and the bolded text, these will get you the
intuition you need to use these tools
Deep Learning 1 Review
Recap of last time
1. Our neural network is just a function of the scalar
parameters in our weight matrices and biases
2. We can take the gradient (direction of steepest ascent) of
our loss function w.r.t. to the parameters, since loss is a
scalar function with vector inputs
3. We update our parameters along the negative gradient
direction (by a step size called the learning rate) to
decrease the loss — this is called gradient descent
Math Review
Computational Graphs
● Say we have some function e(c, d), but c
and d are functions of other variables. We
have c(a, b) and d(b)
● We can write how these functions depend
on each other as a tree
● We call this a computational graph
because it tells us how to compute the final
value e from leaf nodes (inputs) a and b
● Each node in this tree is a function of the
incoming nodes
Computational Graphs and the Chain Rule
● If we want to calculate derivatives of an input
with respect to the output, we need to use the
multivariable chain rule
○ Sum over all unique paths from the input to the output
○ For each path, multiply all partial derivatives of each
output node with respect to the corresponding input
node
Backpropagation
Backpropagation
● Here is an example of a computational graph of a toy neural network’s MSE loss
on a single training example
○ This neural network has only one neuron per layer, making inputs and outputs scalars
● Our objective with gradient descent is to calculate the partial derivative of the
output with respect to w1, w2, w3, b1, b2 and b3… but we don’t want to do 6x
the computation… how can we do this?

Backpropagation
● We can write out the chain rule for all these, and see if there is anywhere that we
can optimize and save ourselves some compute
● Note: We will be writing out a lot of partial derivatives… each one is being
evaluated for under the current training example and the current parameters,
there just wasn’t enough space to write it
● In this example, x, y, b1, b2, b3, w1, w2,
w3 are scalars Side note: on the forward pass, we
● Don’t get hung up on what each term’s calculate and save things like the
value is, it’s not important partial of z_3 with respect to a_3, so
that we can use it here later

Scalar Computation Graph Example


● In this example, x, y, b1, b2, b3, w1, w2,
w3 are scalars

We can see that we’ve calculated these values multiple times

Scalar Computation Graph Example


Backpropagation
● Rather than calculating these values
again with repeated multiplication, let’s
just save and reuse them
○ This saves a lot of redundant calculations
for deep neural networks
● We will simply work from the end of the 1) First we calculate the update for W3,
caching the red
network to the front, caching values 2) Then we use the red to calculate the blue
that we need as we go along value before calculating the update for W2
● Note: All the partials here are being 3) This pattern of using the last computation to
save redundant multiplications on the next
evaluated for the current data and
update continues
parameters… there wasn’t enough space
to include this notation
Backpropagation
● Now imagine that W_i are weight matrices, B_i are bias vectors, X is an input
vector, and we replace any scalar multiplication with matrix multiplication
○ Critically, our loss L is still a scalar
● You don’t need to understand what the derivative of L with respect to Zi looks like,
but just imagine that it exists — we can implement the same caching idea from
the scalar case for a real neural network as well
○ Caching saves us from performing redundant matrix multiplications, which can be very expensive
Takeaways
● The ONLY thing you need to take away from backprop is that it is a fast method
of getting all of the partial derivatives needed for gradient descent, removing
redundant (matrix) multiplications
○ We do this by working from the end of the computational graph to the front, caching any
computation used in calculating the previous partial derivatives
○ By working from the end of the graph to the front, we can handle much more complex
computational graphs quickly and efficiently
● Modern auto-differentiation software like pytorch will keep track of the graph
and calculate our gradients with backprop
○ It can handle arbitrarily large computational graphs (our toy example was very simple in comparison
to real deep learning systems out there)
Modern Deep Learning
Optimization
● Can we do better than vanilla gradient descent? Yes
● We’re going to talk about 3 optimizers: Momentum, RMSProp, and Adam
● Not understanding the math is ok, just make sure you understand the
intuition/takeaway bullet point from each method

Here is the vanilla GD update rule for reference


Momentum
● We’d hate for a small random local optima or
Add in the current gradient
random flat points to halt our descent to our weighted average,
scale down
○ Take inspiration from an actual ball rolling down a Scale down the previous
weighted average
hill which will have momentum
○ Takeaway: Update our parameters with a
weighted average of past gradients
● Technically this update isn’t the real gradient
anymore, but it works well for avoiding local Perform the
gradient
optima ● g is our weighted average of the previous gradients update step
● t is the time step / t-th gradient step with our
● beta is some hyperparameter in the range [0, 1] that we weighted
choose, it controls the strength of the weighted average
average
Momentum Visualized

No Momentum

With Momentum
(Imagine dropping a marble at
the start and how it would
descend)
RMSProp
● Instead of keeping a weighted average of gradients, we are going to keep a
weighted average of squared gradient components
● We are going to divide our true gradient update by the square root of this
weighted average of squared gradients
○ Takeaway/intuition is on the next slide
● The first equation ends up looking very similar to momentum since we are just
calculating a weighted average (albeit of a different quantity) Square our gradient update,
Weighted average of squared gradient scale it, and add it to our old
components, same way as momentum. weighted average that has
Note that g^2 is the variable’s name, a been scaled down
reference to the tracking of squared
gradients instead of regular gradients

Dividing our update by


the square root of the epsilon: some really small
weighted average value to make sure we
don’t divide by zero
RMSProp
● Case 1: The gradients have been really
small in the past
○ Our moving average of squared gradients will
be even tinier
○ The square root of this moving average will be a
really small number, and dividing by it should
increase the size of the final gradient update
● Case 2: Our gradients have been really Takeaway: this helps combat the issue that
gradients can be varying in size, causing us to
big in the past
either get stuck from small gradients or blow
○ Our moving average of squared gradients will
past our mark with large gradients. RMSProp
be huge
makes sure our steps never get too big or too
○ The square root of this moving average will be a
small!
really large number, and dividing by it should
decrease the size of the final update step
Adam
● Adam is the best optimizer… period… just
default to using Adam with default
hyperparameters
● We will simply combine momentum and
RMSProp
○ We will keep 2 moving averages: one for the
gradients and one for the squared gradients
● Takeaway: It is a combination of
momentum and RMSProp, getting the
benefits of both, and it works out of the
box (default parameters) better than
almost anything else
1) What is the role of the optimizer
in deep learning? Does it change

Concept Check what the best set of weights for a


model is?
2) List the primary benefits for
Questions
momentum, RMSProp, and
Adam.
1) The optimizer determines the
best solution that we are capable
of finding, but has no impact on

Concept Check 2)
the best weights themselves.
Momentum prevents us from
getting stuck in local minima,
Answers
RMSProp prevents our gradients
from being too small or too large,
Adam has the benefits of both.
Normalization
No Normalization: W2 is always
much more important than W1
● Normalized data tends to train faster, due to the fact
that our loss surfaces look a lot more “normal”
○ When our data is normalized, the first layer weights are all on
the same order of magnitude, so our gradient steps are also
roughly on the same order of magnitude in all directions
● We have no control over whether our activations stay
normalized through the network… how can we make
Normalization: both features
sure that it’s possible for our activations to be can play an equal role

normalized just like our input data was?


Batch Normalization
● For a batch of neuron activations after a layer,
normalize each neuron’s output independently over
the batch
● We then rescale these activations by some learned
parameters, allowing the network to learn how to
re-weight its own features if it finds that helpful
● Takeaway: We provide more explicit structure for a - a batch of activations in our network
a’ - the normalized activations
the network to learn normalized activations, since gamma - a vector of learned parameters
beta - a vector of learned parameters
normalized features have nicer gradients epsilon - small value to prevent us from
dividing by zero
● We tend to just throw batchnorm in either before or Note: the * in the output equation denotes
element-wise multiplication
after any non-final layer activations, it works really well!
BatchNorm Notes
● Note: at test time, when we are working with single examples, instead of the mean
and std. dev. calculations, we instead use a weighted average of our most recent
means and std. deviations from during training
● Reminder: gamma and beta are learned parameters, meaning we add them to the
list of things we update in gradient descent
BatchNorm Concept Check
● Does batchnorm make our model more “expressive” (ie: does it allow us our model
to learn or approximate new functions that we wouldn’t have been able to before)
● Answer: NO
○ We could in theory adjust the weights and biases of a linear layer to get the same output without
using batchnorm, but the objective of batchnorm isn’t to get different outputs (batchnorm doesn’t
make our model more expressive)
○ The point is to get nicer gradients for our weights and biases, which this does by allowing the
network to learn normalized activations (or choose to weight certain activations more if it is
advantageous to do so)
LayerNorm
● Exactly like batchnorm, except instead of normalizing over the batch’s statistics,
we normalize over each individual training’s features independently
● We can still use parameters to rescale and re-bias the data in the same way we did
for batchnorm
● There are many ways to normalize your activations, the goal is to provide nice
gradients
Normalization Concept Check
● We know that with batch gradient descent, normally our gradients are simply the
mean of the gradients of our training examples. Is this true of layernorm and
batchnorm?
● Answer: NO
○ Layernorm: each training example is processed independently, so batch gradient descent is the
same as averaging the individual gradients
○ Batchnorm: the value of the activation after batchnorm depends on batch statistics and if you break
up the batch into smaller sub-batches or look at the training examples in isolation, the values in the
forward pass will change
■ When using batchnorm, gradients will not be the same if you accumulate them via taking the
mean of gradients of individual samples
Ensembling
● Our model is doing the best it can… is there any way to squeeze more
performance out?
● Make many different models (parameters initialized randomly each time) and see
what the entire group of models thinks
○ Wisdom of the group
○ Can average their predictions for regression tasks or take majority vote for classification tasks
○ Can use many different model configurations and ensemble them
● This can often get you better performance, but obviously requires a lot more
compute as you have to train many models instead of just one, also takes that
much more time to evaluate at test time
● To avoid each model learning the same thing: omit a random portion of the
training dataset for each model’s training
Dropout
● Overfitting is a problem
● Randomly zero out a new set of neurons during each batch
○ This means the network will have to learn to use different combinations of neurons
○ It will have to learn the same function in a number of ways, since each neuron can’t rely on any
previous neuron not being zeroed out
○ This relates to the concept of ensembling
■ Takeaway: since we have to learn the same thing multiple ways, we get the “wisdom of
the group” effect that ensembling was giving us
● At test time, we don’t zero anything
○ We just need to make sure to scale down all our activations by the right amount, otherwise the sum
of values coming into each neuron will be much larger in magnitude than during training
● Takeaway: dropout has a regularizing effect by making it harder to rely on (or
overfit to) specific features
Skip Connections
● We simply add our activations from a previous layer to
the activations of the current layer
○ Normally, the output of each layer (before we activate) is F(x)
where F is our linear layer with an activation function
○ Instead, now we will output F(x) + x before we activate
● You can choose to skip arbitrary layers before adding an
activation back again, just make sure to add before you
go through an activation function
○ These sections of the network with a skip connection are
generally called “blocks”
Skip Connections
● Say, halfway through a normal network, the activations
are informative enough to classify the inputs well, but
our chosen network still has more layers after that
(potentially adding more noise unless we carefully
choose our weights)
● It happens to be trivial to have our block spit out exactly
what it took in (just set all the weights to zero), called
the identity function
○ If our weights are zero, i.e. F(x) = 0, then the output of the entire
block just becomes 0 + x = x

Takeaway: this, in theory, allows us to build arbitrarily deep networks since the
blocks can now easily learn the identity function or very small updates
Recap
● All of these are just additions to your deep learning toolkit that are frequently used in modern deep
learning, here are the big picture takeaways for each
● Adam: better optimization out of the box than anything else
○ Takes momentum’s and RMSProp’s benefits and combines them
● Batchnorm: You can add it after any non-final layer for better learning
○ Gives better behaved gradients by allowing the network to have normalized activations
● Ensembling: If you have the compute, this can give you more performance on your dataset
○ “Wisdom of the group”
● Dropout: Budget ensembling
○ Increases regularization by forcing each layer to learn the same thing in different ways using
different features from the previous layer, “wisdom of the group” still applies here
● Skip connections: MORE LAYERS
○ Enables you to go deeper since it is trivial to learn the identity layer, and learning small updates to
the previous activation is less complex
PyTorch
Backprop is hard…
● Backprop is hard to implement, but needed to make deep learning feasible
○ This is why we need something to do it automatically for us
● Pytorch will automatically make arbitrary computational graphs for us and can
perform the backpropagation for it automatically, so we don’t have to muck about
doing any actual math
○ Praise be
PyTorch: How to Approach
● If you understand Numpy, PyTorch will be a breeze
○ To a user, PyTorch looks and behaves like numpy
○ Instead of np arrays, torch has things called “tensors” that act the same way
■ Except they generate a computational graph in the background as you go along
● At a higher level, PyTorch will also have some built in functions and classes for
things like activations, layers, etc
● What PyTorch can’t do:
○ Symbolic differentiation
● What PyTorch can do:
○ Take the partial derivatives of one value (a loss perhaps…) with respect to some parameter
evaluated at the parameter’s current value
Intros / Demos
● Torch really just looks like Numpy at the lowest level
○ You can see it looks pretty much exactly the same, except
instead of having arrays, we have things called tensors
● We can do all the normal operations that we would do
on an array, and things behave exactly the same way
○ We can add, subtract, square
○ We can reshape, etc
● We can see the size of the tensor by inspecting the
.shape attribute
Intros / Demos
Define a new tensor (in this case, think of it as a
parameter) with current value = 5. We make sure it
is a float, which is required in order to differentiate

Here is our function and the


symbolic derivative (PyTorch
doesn’t know the symbolic By setting requires_grad = True, we
derivative) mandate that torch save everything it
could need to include this variable in
the computational graph
Define just some regular function

Evaluate the function on our tensor


Calling .backward() on a scalar tensor will tell
Pytorch to use the computational graph to
calculate the partial derivatives of this value with
respect ALL parameters requiring a gradient that
are in our computational graph, evaluated at the
parameter’s current value

We can inspect the .grad attribute of the


tensor to see the partial derivative of
This is what PyTorch our output with respect to our input
knows and calculates tensor evaluated at the current value,
which we initialized to 5
Lecture Attendance

http://tinyurl.com/fa24-dl4cv
Contributors
● Slides by Jake Austin and Harshika Jalan
● Edited by Aryan Jain

You might also like