0% found this document useful (0 votes)

6 views

[Fall 2024] Deep Learning 2

Uploaded by

David Earnest

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

[Fall 2024] Deep Learning 2

Uploaded by

David Earnest

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Deep Learning 2

By: ML@B Edu Team

Announcements
● Quiz 1 due tonight
● Quiz 2 due next Monday
○ Covers content from today’s (Deep Learning 2) and Wednesday’s (Deep Learning 3) lectures
● Homework 1 due next Monday
● Office Hours will be held every Thursday 3-4 PM, at Cory 531
What questions are there from the
first deep learning lecture?
● Warning
● Deep Learning 1 Review
● Math Review
● Backpropagation
Outline ● Modern Deep Learning
○ Optimizers
○ Batch-Norm
○ Ensembling
○ Dropout
○ Skip Connections
● PyTorch
Warning
Warning for these slides…
● For the first part of lecture, we are going to discuss the backpropagation
algorithm
○ If you are lost, that’s totally fine, we just need you to take away a broad idea of what a
computational graph is, what backprop is doing and why we do it
● The last half of the lecture, however, will be CRITICAL to modern deep learning
systems
○ Even here, there will be a lot of math. If you don’t understand the math, that’s fine, just pay
attention to the bullet points labeled “takeaway” and the bolded text, these will get you the
intuition you need to use these tools
Deep Learning 1 Review
Recap of last time
1. Our neural network is just a function of the scalar
parameters in our weight matrices and biases
2. We can take the gradient (direction of steepest ascent) of
our loss function w.r.t. to the parameters, since loss is a
scalar function with vector inputs
3. We update our parameters along the negative gradient
direction (by a step size called the learning rate) to
decrease the loss — this is called gradient descent
Math Review
Computational Graphs
● Say we have some function e(c, d), but c
and d are functions of other variables. We
have c(a, b) and d(b)
● We can write how these functions depend
on each other as a tree
● We call this a computational graph
because it tells us how to compute the final
value e from leaf nodes (inputs) a and b
● Each node in this tree is a function of the
incoming nodes
Computational Graphs and the Chain Rule
● If we want to calculate derivatives of an input
with respect to the output, we need to use the
multivariable chain rule
○ Sum over all unique paths from the input to the output
○ For each path, multiply all partial derivatives of each
output node with respect to the corresponding input
node
Backpropagation
Backpropagation
● Here is an example of a computational graph of a toy neural network’s MSE loss
on a single training example
○ This neural network has only one neuron per layer, making inputs and outputs scalars
● Our objective with gradient descent is to calculate the partial derivative of the
output with respect to w1, w2, w3, b1, b2 and b3… but we don’t want to do 6x
the computation… how can we do this?
●
Backpropagation
● We can write out the chain rule for all these, and see if there is anywhere that we
can optimize and save ourselves some compute
● Note: We will be writing out a lot of partial derivatives… each one is being
evaluated for under the current training example and the current parameters,
there just wasn’t enough space to write it
● In this example, x, y, b1, b2, b3, w1, w2,
w3 are scalars Side note: on the forward pass, we
● Don’t get hung up on what each term’s calculate and save things like the
value is, it’s not important partial of z_3 with respect to a_3, so
that we can use it here later

Scalar Computation Graph Example

● In this example, x, y, b1, b2, b3, w1, w2,
w3 are scalars

We can see that we’ve calculated these values multiple times

Scalar Computation Graph Example

Backpropagation
● Rather than calculating these values
again with repeated multiplication, let’s
just save and reuse them
○ This saves a lot of redundant calculations
for deep neural networks
● We will simply work from the end of the 1) First we calculate the update for W3,
caching the red
network to the front, caching values 2) Then we use the red to calculate the blue
that we need as we go along value before calculating the update for W2
● Note: All the partials here are being 3) This pattern of using the last computation to
save redundant multiplications on the next
evaluated for the current data and
update continues
parameters… there wasn’t enough space
to include this notation
Backpropagation
● Now imagine that W_i are weight matrices, B_i are bias vectors, X is an input
vector, and we replace any scalar multiplication with matrix multiplication
○ Critically, our loss L is still a scalar
● You don’t need to understand what the derivative of L with respect to Zi looks like,
but just imagine that it exists — we can implement the same caching idea from
the scalar case for a real neural network as well
○ Caching saves us from performing redundant matrix multiplications, which can be very expensive
Takeaways
● The ONLY thing you need to take away from backprop is that it is a fast method
of getting all of the partial derivatives needed for gradient descent, removing
redundant (matrix) multiplications
○ We do this by working from the end of the computational graph to the front, caching any
computation used in calculating the previous partial derivatives
○ By working from the end of the graph to the front, we can handle much more complex
computational graphs quickly and efﬁciently
● Modern auto-differentiation software like pytorch will keep track of the graph
and calculate our gradients with backprop
○ It can handle arbitrarily large computational graphs (our toy example was very simple in comparison
to real deep learning systems out there)
Modern Deep Learning
Optimization
● Can we do better than vanilla gradient descent? Yes
● We’re going to talk about 3 optimizers: Momentum, RMSProp, and Adam
● Not understanding the math is ok, just make sure you understand the
intuition/takeaway bullet point from each method

Here is the vanilla GD update rule for reference

Momentum
● We’d hate for a small random local optima or
Add in the current gradient
random ﬂat points to halt our descent to our weighted average,
scale down
○ Take inspiration from an actual ball rolling down a Scale down the previous
weighted average
hill which will have momentum
○ Takeaway: Update our parameters with a
weighted average of past gradients
● Technically this update isn’t the real gradient
anymore, but it works well for avoiding local Perform the
gradient
optima ● g is our weighted average of the previous gradients update step
● t is the time step / t-th gradient step with our
● beta is some hyperparameter in the range [0, 1] that we weighted
choose, it controls the strength of the weighted average
average
Momentum Visualized

No Momentum

With Momentum
(Imagine dropping a marble at
the start and how it would
descend)
RMSProp
● Instead of keeping a weighted average of gradients, we are going to keep a
weighted average of squared gradient components
● We are going to divide our true gradient update by the square root of this
weighted average of squared gradients
○ Takeaway/intuition is on the next slide
● The ﬁrst equation ends up looking very similar to momentum since we are just
calculating a weighted average (albeit of a different quantity) Square our gradient update,
Weighted average of squared gradient scale it, and add it to our old
components, same way as momentum. weighted average that has
Note that g^2 is the variable’s name, a been scaled down
reference to the tracking of squared
gradients instead of regular gradients

Dividing our update by

the square root of the epsilon: some really small
weighted average value to make sure we
don’t divide by zero
RMSProp
● Case 1: The gradients have been really
small in the past
○ Our moving average of squared gradients will
be even tinier
○ The square root of this moving average will be a
really small number, and dividing by it should
increase the size of the final gradient update
● Case 2: Our gradients have been really Takeaway: this helps combat the issue that
gradients can be varying in size, causing us to
big in the past
either get stuck from small gradients or blow
○ Our moving average of squared gradients will
past our mark with large gradients. RMSProp
be huge
makes sure our steps never get too big or too
○ The square root of this moving average will be a
small!
really large number, and dividing by it should
decrease the size of the final update step
Adam
● Adam is the best optimizer… period… just
default to using Adam with default
hyperparameters
● We will simply combine momentum and
RMSProp
○ We will keep 2 moving averages: one for the
gradients and one for the squared gradients
● Takeaway: It is a combination of
momentum and RMSProp, getting the
benefits of both, and it works out of the
box (default parameters) better than
almost anything else
1) What is the role of the optimizer
in deep learning? Does it change

Concept Check what the best set of weights for a

model is?
2) List the primary beneﬁts for
Questions
momentum, RMSProp, and
Adam.
1) The optimizer determines the
best solution that we are capable
of ﬁnding, but has no impact on

Concept Check 2)
the best weights themselves.
Momentum prevents us from
getting stuck in local minima,
Answers
RMSProp prevents our gradients
from being too small or too large,
Adam has the beneﬁts of both.
Normalization
No Normalization: W2 is always
much more important than W1
● Normalized data tends to train faster, due to the fact
that our loss surfaces look a lot more “normal”
○ When our data is normalized, the ﬁrst layer weights are all on
the same order of magnitude, so our gradient steps are also
roughly on the same order of magnitude in all directions
● We have no control over whether our activations stay
normalized through the network… how can we make
Normalization: both features
sure that it’s possible for our activations to be can play an equal role

normalized just like our input data was?

Batch Normalization
● For a batch of neuron activations after a layer,
normalize each neuron’s output independently over
the batch
● We then rescale these activations by some learned
parameters, allowing the network to learn how to
re-weight its own features if it finds that helpful
● Takeaway: We provide more explicit structure for a - a batch of activations in our network
a’ - the normalized activations
the network to learn normalized activations, since gamma - a vector of learned parameters
beta - a vector of learned parameters
normalized features have nicer gradients epsilon - small value to prevent us from
dividing by zero
● We tend to just throw batchnorm in either before or Note: the * in the output equation denotes
element-wise multiplication
after any non-final layer activations, it works really well!
BatchNorm Notes
● Note: at test time, when we are working with single examples, instead of the mean
and std. dev. calculations, we instead use a weighted average of our most recent
means and std. deviations from during training
● Reminder: gamma and beta are learned parameters, meaning we add them to the
list of things we update in gradient descent
BatchNorm Concept Check
● Does batchnorm make our model more “expressive” (ie: does it allow us our model
to learn or approximate new functions that we wouldn’t have been able to before)
● Answer: NO
○ We could in theory adjust the weights and biases of a linear layer to get the same output without
using batchnorm, but the objective of batchnorm isn’t to get different outputs (batchnorm doesn’t
make our model more expressive)
○ The point is to get nicer gradients for our weights and biases, which this does by allowing the
network to learn normalized activations (or choose to weight certain activations more if it is
advantageous to do so)
LayerNorm
● Exactly like batchnorm, except instead of normalizing over the batch’s statistics,
we normalize over each individual training’s features independently
● We can still use parameters to rescale and re-bias the data in the same way we did
for batchnorm
● There are many ways to normalize your activations, the goal is to provide nice
gradients
Normalization Concept Check
● We know that with batch gradient descent, normally our gradients are simply the
mean of the gradients of our training examples. Is this true of layernorm and
batchnorm?
● Answer: NO
○ Layernorm: each training example is processed independently, so batch gradient descent is the
same as averaging the individual gradients
○ Batchnorm: the value of the activation after batchnorm depends on batch statistics and if you break
up the batch into smaller sub-batches or look at the training examples in isolation, the values in the
forward pass will change
■ When using batchnorm, gradients will not be the same if you accumulate them via taking the
mean of gradients of individual samples
Ensembling
● Our model is doing the best it can… is there any way to squeeze more
performance out?
● Make many different models (parameters initialized randomly each time) and see
what the entire group of models thinks
○ Wisdom of the group
○ Can average their predictions for regression tasks or take majority vote for classification tasks
○ Can use many different model configurations and ensemble them
● This can often get you better performance, but obviously requires a lot more
compute as you have to train many models instead of just one, also takes that
much more time to evaluate at test time
● To avoid each model learning the same thing: omit a random portion of the
training dataset for each model’s training
Dropout
● Overfitting is a problem
● Randomly zero out a new set of neurons during each batch
○ This means the network will have to learn to use different combinations of neurons
○ It will have to learn the same function in a number of ways, since each neuron can’t rely on any
previous neuron not being zeroed out
○ This relates to the concept of ensembling
■ Takeaway: since we have to learn the same thing multiple ways, we get the “wisdom of
the group” effect that ensembling was giving us
● At test time, we don’t zero anything
○ We just need to make sure to scale down all our activations by the right amount, otherwise the sum
of values coming into each neuron will be much larger in magnitude than during training
● Takeaway: dropout has a regularizing effect by making it harder to rely on (or
overfit to) specific features
Skip Connections
● We simply add our activations from a previous layer to
the activations of the current layer
○ Normally, the output of each layer (before we activate) is F(x)
where F is our linear layer with an activation function
○ Instead, now we will output F(x) + x before we activate
● You can choose to skip arbitrary layers before adding an
activation back again, just make sure to add before you
go through an activation function
○ These sections of the network with a skip connection are
generally called “blocks”
Skip Connections
● Say, halfway through a normal network, the activations
are informative enough to classify the inputs well, but
our chosen network still has more layers after that
(potentially adding more noise unless we carefully
choose our weights)
● It happens to be trivial to have our block spit out exactly
what it took in (just set all the weights to zero), called
the identity function
○ If our weights are zero, i.e. F(x) = 0, then the output of the entire
block just becomes 0 + x = x

Takeaway: this, in theory, allows us to build arbitrarily deep networks since the
blocks can now easily learn the identity function or very small updates
Recap
● All of these are just additions to your deep learning toolkit that are frequently used in modern deep
learning, here are the big picture takeaways for each
● Adam: better optimization out of the box than anything else
○ Takes momentum’s and RMSProp’s benefits and combines them
● Batchnorm: You can add it after any non-final layer for better learning
○ Gives better behaved gradients by allowing the network to have normalized activations
● Ensembling: If you have the compute, this can give you more performance on your dataset
○ “Wisdom of the group”
● Dropout: Budget ensembling
○ Increases regularization by forcing each layer to learn the same thing in different ways using
different features from the previous layer, “wisdom of the group” still applies here
● Skip connections: MORE LAYERS
○ Enables you to go deeper since it is trivial to learn the identity layer, and learning small updates to
the previous activation is less complex
PyTorch
Backprop is hard…
● Backprop is hard to implement, but needed to make deep learning feasible
○ This is why we need something to do it automatically for us
● Pytorch will automatically make arbitrary computational graphs for us and can
perform the backpropagation for it automatically, so we don’t have to muck about
doing any actual math
○ Praise be
PyTorch: How to Approach
● If you understand Numpy, PyTorch will be a breeze
○ To a user, PyTorch looks and behaves like numpy
○ Instead of np arrays, torch has things called “tensors” that act the same way
■ Except they generate a computational graph in the background as you go along
● At a higher level, PyTorch will also have some built in functions and classes for
things like activations, layers, etc
● What PyTorch can’t do:
○ Symbolic differentiation
● What PyTorch can do:
○ Take the partial derivatives of one value (a loss perhaps…) with respect to some parameter
evaluated at the parameter’s current value
Intros / Demos
● Torch really just looks like Numpy at the lowest level
○ You can see it looks pretty much exactly the same, except
instead of having arrays, we have things called tensors
● We can do all the normal operations that we would do
on an array, and things behave exactly the same way
○ We can add, subtract, square
○ We can reshape, etc
● We can see the size of the tensor by inspecting the
.shape attribute
Intros / Demos
Define a new tensor (in this case, think of it as a
parameter) with current value = 5. We make sure it
is a float, which is required in order to differentiate

Here is our function and the

symbolic derivative (PyTorch
doesn’t know the symbolic By setting requires_grad = True, we
derivative) mandate that torch save everything it
could need to include this variable in
the computational graph
Deﬁne just some regular function

Evaluate the function on our tensor

Calling .backward() on a scalar tensor will tell
Pytorch to use the computational graph to
calculate the partial derivatives of this value with
respect ALL parameters requiring a gradient that
are in our computational graph, evaluated at the
parameter’s current value

We can inspect the .grad attribute of the

tensor to see the partial derivative of
This is what PyTorch our output with respect to our input
knows and calculates tensor evaluated at the current value,
which we initialized to 5
Lecture Attendance

http://tinyurl.com/fa24-dl4cv
Contributors
● Slides by Jake Austin and Harshika Jalan
● Edited by Aryan Jain

Diffusion: by Aryan Jain
100% (1)
Diffusion: by Aryan Jain
55 pages
Mainframes - Refresher 1 2
No ratings yet
Mainframes - Refresher 1 2
284 pages
As/400 Interview Question
100% (20)
As/400 Interview Question
47 pages
Lesson 4 - Deep Learning
No ratings yet
Lesson 4 - Deep Learning
20 pages
DL3_Backpropagation.pptx
No ratings yet
DL3_Backpropagation.pptx
17 pages
02-Linear Regression
No ratings yet
02-Linear Regression
17 pages
Advanced Regression
No ratings yet
Advanced Regression
29 pages
Logistic Regression
No ratings yet
Logistic Regression
51 pages
Machine Learning For Data Science 2 - Normalizing Flows V2
No ratings yet
Machine Learning For Data Science 2 - Normalizing Flows V2
50 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
CV Lec4
No ratings yet
CV Lec4
46 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Perceptron Adaline
No ratings yet
Perceptron Adaline
49 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
Lec06 Derivatives
No ratings yet
Lec06 Derivatives
22 pages
Small 16
No ratings yet
Small 16
77 pages
Lec 8
No ratings yet
Lec 8
43 pages
back_prop_1722172453
No ratings yet
back_prop_1722172453
13 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
11-Nonlinear Models (Neural Networks)
No ratings yet
11-Nonlinear Models (Neural Networks)
6 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Logistic Regression
No ratings yet
Logistic Regression
10 pages
DSA5105 Lecture5
No ratings yet
DSA5105 Lecture5
52 pages
CE6146_Lecture_3
No ratings yet
CE6146_Lecture_3
83 pages
Chapter 3. Linear Regression
No ratings yet
Chapter 3. Linear Regression
41 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
CS 304.A Training Models
No ratings yet
CS 304.A Training Models
149 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
CSD411 Week7 Regression
No ratings yet
CSD411 Week7 Regression
75 pages
DL Class2
No ratings yet
DL Class2
30 pages
Unit 3
No ratings yet
Unit 3
110 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
ML Notes
No ratings yet
ML Notes
14 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Training NNs
No ratings yet
Training NNs
34 pages
MLware
No ratings yet
MLware
12 pages
Back Propagation
No ratings yet
Back Propagation
71 pages
DSAP-Lecture 3 - Algorithm Analysis
No ratings yet
DSAP-Lecture 3 - Algorithm Analysis
35 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
MLS+1+-+Regression
No ratings yet
MLS+1+-+Regression
20 pages
PAMS-22Fall-Smart Marketing With RRM-5-Optimization
No ratings yet
PAMS-22Fall-Smart Marketing With RRM-5-Optimization
40 pages
Lecture 09_02.09.2024_Regression-01
No ratings yet
Lecture 09_02.09.2024_Regression-01
62 pages
Sparse Predictive Hierarchies: Eric Laukien, Ogma Corp
No ratings yet
Sparse Predictive Hierarchies: Eric Laukien, Ogma Corp
30 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Backpropagation: TA: Yi Wen
No ratings yet
Backpropagation: TA: Yi Wen
39 pages
cst414-deep learning module 2
No ratings yet
cst414-deep learning module 2
13 pages
DL (2)
No ratings yet
DL (2)
18 pages
Diffusion
No ratings yet
Diffusion
55 pages
Activations, Loss Functions & Optimizers in ML
No ratings yet
Activations, Loss Functions & Optimizers in ML
29 pages
W5 V0 Spiking is Not Differentiable
No ratings yet
W5 V0 Spiking is Not Differentiable
4 pages
L05 Slides.mlp2
No ratings yet
L05 Slides.mlp2
21 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Robotics
No ratings yet
Robotics
21 pages
Lec05-1-Gradient Descent-Detailed
No ratings yet
Lec05-1-Gradient Descent-Detailed
62 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
23-Practical Aspects of Optimization
No ratings yet
23-Practical Aspects of Optimization
7 pages
GENAI-SEE
No ratings yet
GENAI-SEE
51 pages
Feature Extraction Techniques
No ratings yet
Feature Extraction Techniques
32 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Business Development Strategies Through Artificial Intelligence Technology
No ratings yet
Business Development Strategies Through Artificial Intelligence Technology
6 pages
TBULM08001 36 v14 A4
No ratings yet
TBULM08001 36 v14 A4
4 pages
Why VoLTE Means Services To Monetize
No ratings yet
Why VoLTE Means Services To Monetize
6 pages
Complex Adaptive Systems and Futures Thinking Theories, Applications and Methods LInda Groff Si Rima Shaffer
100% (1)
Complex Adaptive Systems and Futures Thinking Theories, Applications and Methods LInda Groff Si Rima Shaffer
35 pages
Techniques For Power System Protection and Control: Dr. Mohamed Amer Hassan Abobmahdi
No ratings yet
Techniques For Power System Protection and Control: Dr. Mohamed Amer Hassan Abobmahdi
6 pages
"Real Time Driver Behavior Detection": A Project Report On
No ratings yet
"Real Time Driver Behavior Detection": A Project Report On
47 pages
Men Without Women First Edition Haruki Murakami 2024 scribd download
100% (1)
Men Without Women First Edition Haruki Murakami 2024 scribd download
35 pages
Advanced Project Management Techniques
25% (4)
Advanced Project Management Techniques
7 pages
Overview of Cisco 1100 Series Integrated Services Routers
No ratings yet
Overview of Cisco 1100 Series Integrated Services Routers
12 pages
Bill of Materials
100% (1)
Bill of Materials
25 pages
Intelligent Hearing System Ihs Solo Abr Bera
No ratings yet
Intelligent Hearing System Ihs Solo Abr Bera
8 pages
Infosys_Project
No ratings yet
Infosys_Project
3 pages
108 91129err
No ratings yet
108 91129err
2 pages
Wireless Communications 1st Edition Bin Tian - Instantly access the complete ebook with just one click
No ratings yet
Wireless Communications 1st Edition Bin Tian - Instantly access the complete ebook with just one click
56 pages
Website Proposal (Wordpress)
No ratings yet
Website Proposal (Wordpress)
9 pages
GPT-4.1 Prompting Guide _ OpenAI Cookbook
No ratings yet
GPT-4.1 Prompting Guide _ OpenAI Cookbook
30 pages
Meaning of Dispersion With Objectives
No ratings yet
Meaning of Dispersion With Objectives
11 pages
Vigneshwaran-Resume-Linux and Windows
No ratings yet
Vigneshwaran-Resume-Linux and Windows
6 pages
Predator HD With Thermal Camera
No ratings yet
Predator HD With Thermal Camera
2 pages
AN127
No ratings yet
AN127
32 pages
Addition of Two Polynomial
100% (1)
Addition of Two Polynomial
5 pages
Computer Processing of Remotely Sensed Images 5th Edition Paul M. Mather & Magaly M. Koch - Download the ebook and start exploring right away
No ratings yet
Computer Processing of Remotely Sensed Images 5th Edition Paul M. Mather & Magaly M. Koch - Download the ebook and start exploring right away
61 pages
Downtime
No ratings yet
Downtime
7 pages
datamining1
No ratings yet
datamining1
7 pages
Competency Based Learning Material: Coron School of Fisheries
100% (1)
Competency Based Learning Material: Coron School of Fisheries
97 pages
S - 2 Computer Studies Mid Term 1 Examinations 2019
No ratings yet
S - 2 Computer Studies Mid Term 1 Examinations 2019
7 pages
Info Cube
No ratings yet
Info Cube
9 pages
Rudyregularization PDF
No ratings yet
Rudyregularization PDF
56 pages

[Fall 2024] Deep Learning 2

Uploaded by

[Fall 2024] Deep Learning 2

Uploaded by

Deep Learning 2

By: ML@B Edu Team

Scalar Computation Graph Example

We can see that we’ve calculated these values multiple times

Scalar Computation Graph Example

Here is the vanilla GD update rule for reference

Dividing our update by

Concept Check what the best set of weights for a

normalized just like our input data was?

Here is our function and the

Evaluate the function on our tensor

We can inspect the .grad attribute of the

You might also like