0% found this document useful (0 votes)
3 views

Neural Networks

The document explains the structure and functioning of linear classifiers and neural networks, highlighting their components such as input, model, and loss function. It discusses how neurons in a neural network operate similarly to linear classifiers, with an emphasis on the importance of non-linear activation functions and the concept of layers in deep learning. Additionally, it covers the backpropagation algorithm for training neural networks, detailing the forward and backward passes for gradient computation.

Uploaded by

Carlos Souza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Neural Networks

The document explains the structure and functioning of linear classifiers and neural networks, highlighting their components such as input, model, and loss function. It discusses how neurons in a neural network operate similarly to linear classifiers, with an emphasis on the importance of non-linear activation functions and the concept of layers in deep learning. Additionally, it covers the backpropagation algorithm for training neural networks, detailing the forward and backward passes for gradient computation.

Uploaded by

Carlos Souza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Neural

Network
View of a
Linear
Classifier
A linear classifier can be broken down into:
⬣ Input
⬣ A function of the input
⬣ A loss function

It’s all just one function that can be decomposed into building blocks

𝒖 𝟏 𝒑 𝑳
𝑿 𝒘⋅𝒙 −𝐥𝐨𝐠 𝒑
𝟏 + 𝒆!𝒖
Input Model Loss Function

What Does a Linear Classifier Consist of?


A simple neural network has similar structure as our linear classifier:
⬣ A neuron takes input (firings) from other neurons (-> input to linear classifier)
⬣ The inputs are summed in a weighted manner (-> weighted sum)
⬣ Learning is through a modification of the weights
⬣ If it receives enough input, it “fires” (threshold or if weighted sum plus bias is high
enough)
Impulses carried toward cell body 𝒙𝟎 𝒘𝟎
synapse
axon from a neuron
presynaptic
𝒘𝟎 𝒙𝟎
dendrite
terminal dendrite
cell body 𝒇 $ 𝒘𝒊 𝒙𝒊 + 𝒃
𝒘𝟏 𝒙𝟏 𝒊
$ 𝒘𝒊 𝒙𝒊 + 𝒃 𝒇
output axon
axon 𝒊
activation
𝒘𝟐 𝒙𝟐 function

Impulses carried away


cell body from cell body Figures adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Origins of the Term Neural Network


1.0
Sigmoid
As we did before, the output of a 0.8
Activation
0.6
neuron can be modulated by a 0.4 Function
non-linear function (e.g. sigmoid) 0.2 𝟏
0.0
-10 -5 0 5 10 𝟏 + 𝒆*𝒙

Impulses carried toward cell body 𝒙𝟎 𝒘𝟎


synapse
axon from a neuron
presynaptic
𝒘𝟎 𝒙𝟎
dendrite
terminal dendrite
cell body 𝒇 $ 𝒘𝒊 𝒙𝒊 + 𝒃
𝒘𝟏 𝒙𝟏 𝒊
$ 𝒘𝒊 𝒙𝒊 + 𝒃 𝒇
output axon
axon 𝒊
activation
𝒘𝟐 𝒙𝟐 function

Impulses carried away


cell body from cell body Figures adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Adding Non-Linearities
We can have multiple neurons
connected to the same input

Corresponds to a multi-class classifier


⬣ Each output node outputs the score
for a class

𝒇 𝒙, 𝑾 = 𝝈( 𝑾𝒙 + 𝒃) 𝒘𝟏𝟏 𝒘𝟏𝟐 ⋯ 𝒘𝟏𝒎 𝒃𝟏


𝒘𝟐𝟏 𝒘𝟐𝟐 ⋯ 𝒘𝟐𝒎 𝒃𝟐
𝒘𝟐𝟏 𝒘𝟐𝟐 ⋯ 𝒘𝟑𝒎 𝒃𝟑
input layer
⬣ Often called fully connected layers output layer
⬣ Also called a linear projection
layer Figure adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Connecting Many Neurons


⬣ Each input/output is a neuron
(node)
⬣ A linear classifier is called a fully
connected layer
⬣ Connections are represented as
edges
⬣ Output of a particular neuron is
referred to as activation
⬣ This will be expanded as we view input layer
computation in a neural network as output layer
a graph
Figure adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Neural Network Terminology


We can stack multiple layers together
⬣ Input to second layer is output of first
layer
Called a 2-layered neural network (input is
not counted)
Because the middle layer is neither input or
output, and we don’t know what their values
represent, we call them hidden layers
⬣ We will see that they end up learning output layer
effective features
input layer
This increases the representational power hidden layer
of the function!
⬣ Two layered networks can represent
any continuous function
Figure adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Connecting Many Layers


The same two-layered neural network
corresponds to adding another
weight matrix
⬣ We will prefer the linear algebra
view, but use some terminology
from neural networks (& biology) output layer
input layer
hidden layer
𝒙 𝑾𝟏 𝑾𝟐
=
𝒇 𝒙, 𝑾𝟏 , 𝑾𝟐 = 𝝈(𝑾𝟐 𝝈 𝑾𝟏 𝒙 )
Figure adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

The Linear Algebra View


Large (deep) networks can be built by
adding more and more layers
Three-layered neural networks can
represent any function
⬣ The number of nodes could grow
unreasonably (exponential or worse)
with respect to the complexity of the
function output
layer
We will show them without edges: input
layer hidden hidden
layer 1 layer 2

output
layer
input
hidden hidden
layer
layer 1 layer 2
Figure adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Adding More Layers!


Computation
Graphs
Functions can be made arbitrarily complex (subject to memory and
computational limits), e.g.:
𝒇 𝒙, 𝑾 = 𝝈(𝑾𝟓 𝝈(𝑾𝟒 𝝈(𝑾𝟑 𝝈(𝑾𝟐 𝝈 𝑾𝟏 𝒙 )
We can use any type of differentiable function (layer) we want!
⬣ At the end, add the loss function
Composition can have some structure

Loss
Function

Adding Even More Layers


The world is compositional!
VISION
We want our model to reflect this
pixels edge texton motif part object

Empirical and theoretical SPEECH


evidence that it makes learning sample spectral formant motif phone word
band
complex functions easier
NLP

Note that prior state of art character word NP/VP/.. clause sentence story

engineered features often had


this compositionality as well Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

⬣ Pixels -> edges -> object parts -> objects

Compositionality
⬣ We are learning complex models with significant amount of
parameters (millions or billions)
⬣ How do we compute the gradients of the loss (at the end) with
respect to internal parameters?
⬣ Intuitively, want to understand how small changes in weight deep
inside are propagated to affect the loss function at the end

Loss
Function

𝝏𝑳
?
𝝏𝒘𝒊

Computing Gradients in Complex Function


To develop a general algorithm for
this, we will view the function as a
computation graph

Graph can be any directed acyclic


graph (DAG)

⬣ Modules must be differentiable to


support gradient computations
for gradient descent

A training algorithm will then


process this graph, one module at a
time
Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

A General Framework
𝒇 𝒙𝟏 , 𝒙𝟐 = 𝐥𝐧 𝒙𝟏 + 𝒙𝟏 𝒙𝟐 − 𝐬𝐢𝐧(𝒙𝟐 )

𝒙𝟏 𝐥𝐧
+
*


𝒙𝟐 𝐬𝐢𝐧

Example
𝟏
− 𝐥𝐨𝐠
𝟏 + 𝒆!𝒘⋅𝒙

𝒖 𝒑 𝑳
𝟏
𝒘⋅𝒙 −𝐥𝐨𝐠 𝒑
𝟏 + 𝒆!𝒖

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Machine Learning Example


Backpropagation
Given this computation graph, the training
algorithm will:
⬣ Calculate the current model’s outputs Input Function Output
(called the forward pass)
⬣ Calculate the gradients for each 𝒉ℓ'𝟏 𝒉ℓ
module (called the backward pass)
Backward pass is a recursive algorithm that:
⬣ Starts at loss function where we know
how to calculate the gradients
⬣ Progresses back through the modules 𝑾
⬣ Ends in the input layer where we do Parameters
not need gradients (no parameters)
This algorithm is called backpropagation
Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Overview of Training
Step 1: Compute Loss on Mini-Batch: Forward Pass

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training


Step 1: Compute Loss on Mini-Batch: Forward Pass

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training


Step 1: Compute Loss on Mini-Batch: Forward Pass

Layer 1 Layer 2 Layer 3

Note that we must store the intermediate outputs of all layers!


⬣ This is because we will need them to compute the gradients (the gradient
equations will have terms with the output values in them)

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training


In the backward pass, we seek to 𝝏𝑳 𝝏𝑳
calculate the gradients of the loss with 𝝏𝒉ℓ*𝟏 𝝏𝒉ℓ
respect to the module’s parameters
⬣ Assume that we have the
gradient of the loss with respect 𝝏𝑳
to the module’s outputs (given 𝝏𝑾
to us by upstream module)
Problem:
⬣ We will also pass the gradient of ⬣ We can compute local gradients:
the loss with respect to the 𝝏𝒉ℓ 𝝏𝒉ℓ
module’s inputs {𝝏𝒉ℓ"𝟏 , 𝝏𝑾 }
⬣ This is not required for ⬣ We are given: 𝝏𝒉ℓ
𝝏𝑳
update the module’s weights,
𝝏𝑳 𝝏𝑳
but passes the gradients ⬣ Compute: { }
𝝏𝒉ℓ"𝟏 , 𝝏𝑾
back to the previous module
Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Backward Pass Computations


𝝏𝒉ℓ 𝝏𝒉ℓ
⬣ We can compute local gradients: { ℓ.𝟏 , }
𝝏𝒉 𝝏𝑾

⬣ This is just the derivative of our function with respect to its


parameters and inputs!

Example: If 𝒉ℓ = 𝑾𝒉ℓ'𝟏

𝝏𝒉ℓ
then =𝑾
𝝏𝒉ℓ.𝟏

𝝏𝒉ℓ
and = 𝒉ℓ'𝟏,𝑻
𝝏𝑾

Computing the Local Gradients: Example


𝝏𝑳 𝝏𝑳
⬣ We want to to compute: { ℓ'𝟏 , }
𝝏𝒉 𝝏𝑾

𝝏𝑳 𝝏𝑳 𝝏𝑳 𝝏𝑳
𝝏𝒉ℓ 𝝏𝒉ℓ*𝟏 𝝏𝒉ℓ 𝝏𝒉ℓ*𝟏
Loss

𝝏𝑳
𝝏𝑾

⬣ We will use the chain rule to do this:


𝝏𝒛 𝝏𝒛 𝝏𝒚
Chain Rule: = ,
𝝏𝒙 𝝏𝒚 𝝏𝒙

Computing the Gradients of Loss


𝝏𝑳 𝝏𝑳
⬣ We will use the chain rule to compute: { , }
𝝏𝒉ℓ.𝟏 𝝏𝑾

𝝏𝑳 𝝏𝑳 𝝏𝒉ℓ
⬣ Gradient of loss w.r.t. inputs: = Given by upstream
𝝏𝒉ℓ.𝟏 𝝏𝒉ℓ 𝝏𝒉ℓ.𝟏 module (upstream
gradient)
𝝏𝑳 𝝏𝑳 𝝏𝒉ℓ
⬣ Gradient of loss w.r.t. weights: =
𝝏𝑾 𝝏𝒉ℓ 𝝏𝑾

𝝏𝑳 𝝏𝑳
𝝏𝒉ℓ*𝟏 𝝏𝒉ℓ

𝝏𝑳
𝝏𝑾
Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Computing the Gradients of Loss


Step 1: Compute Loss on Mini-Batch: Forward Pass
Step 2: Compute Gradients wrt parameters: Backward Pass

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training


Step 1: Compute Loss on Mini-Batch: Forward Pass
Step 2: Compute Gradients wrt parameters: Backward Pass

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training


Step 1: Compute Loss on Mini-Batch: Forward Pass
Step 2: Compute Gradients wrt parameters: Backward Pass

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training


Step 1: Compute Loss on Mini-Batch: Forward Pass
Step 2: Compute Gradients wrt parameters: Backward Pass
Step 3: Use gradient to update all parameters at the end

Layer 1 Layer 2 Layer 3

𝝏𝑳 Backpropagation is the application of


𝒘𝒊 = 𝒘𝒊 − 𝜶 gradient descent to a computation
𝝏𝒘𝒊 graph via the chain rule!
Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training


Backpropagation
and Automatic
Differentiation
Backpropagation does not really spell out how to
efficiently carry out the necessary computations
But the idea can be applied to any directed
acyclic graph (DAG)
⬣ Graph represents an ordering constraining
which paths must be calculated first
Given an ordering, we can then iterate from the
last module backwards, applying the chain rule
⬣ We will store, for each node, its gradient
outputs for efficient computation
This is called reverse-mode automatic
differentiation

A General Framework
Computation = Graph
⬣ Input = Data + Parameters
⬣ Output = Loss
⬣ Scheduling = Topological ordering

Auto-Diff
⬣ A family of algorithms for
implementing chain-rule on computation graphs

Deep Learning = Differentiable Programming


𝒇 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 𝒙𝟐 + 𝐬𝐢𝐧 𝒙𝟐 We want to find the partial
derivative of output f (output)
𝒂𝟑 with respect to all intermediate
+ variables

𝒂𝟏 𝒂𝟐 ⬣ Assign intermediate variables

Simplify notation:
sin( ) * /0
Denote bar as: 𝑎. =
/11

⬣ Start at end and move


x2 x1 backward

Example
𝒇 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 𝒙𝟐 + 𝐬𝐢𝐧 𝒙𝟐 𝝏𝒇
𝒂𝟑 = =𝟏
𝝏𝒂𝟑

𝒂𝟑 𝝏𝒇 𝝏𝒇 𝝏𝒂𝟑 𝝏𝒇 𝝏(𝒂𝟏 )𝒂𝟐 ) 𝝏𝒇


𝒂𝟏 = = = = 𝟏 = 𝒂𝟑
𝝏𝒂𝟏 𝝏𝒂𝟑 𝝏𝒂𝟏 𝝏𝒂𝟑 𝝏𝒂𝟏 𝝏𝒂𝟑
+
𝝏𝒇 𝝏𝒇 𝝏𝒂𝟑
𝒂𝟐 = 𝝏𝒂 = 𝝏𝒂 = 𝒂𝟑
𝟑 𝝏𝒂𝟐
𝒂𝟏 𝒂𝟐 𝟐

𝝏𝒇 𝝏𝒂𝟏
𝒙𝑷𝟏
𝟐 = 𝝏𝒂 = 𝒂𝟏 𝐜𝐨𝐬 𝒙𝟐 +
sin( ) * 𝟏 𝝏𝒙𝟐

Path 1 Gradients
𝝏𝒇 𝝏𝒂𝟐 𝝏𝒇 𝝏(𝒙𝟏𝒙𝟐)
(P1) 𝒙𝑷𝟐
𝟐 = = = 𝒂𝟐 𝒙𝟏 from multiple
𝝏𝒂𝟐 𝝏𝒙𝟐 𝝏𝒂𝟐 𝝏𝒙𝟐 paths
summed
Path 2 𝝏𝒇 𝝏𝒂𝟐
x2 x1 𝒙𝟏 = = 𝒂𝟐 𝒙𝟐
(P2) 𝝏𝒂𝟐 𝝏𝒙𝟏

Example
𝒇 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 𝒙𝟐 + 𝐬𝐢𝐧 𝒙𝟐

𝒂𝟑 𝝏𝒇 𝝏𝒇 𝝏𝒂𝟑 𝝏𝒇 𝝏(𝒂𝟏 )𝒂𝟐 ) 𝝏𝒇


𝒂𝟏 = = = = 𝟏 = 𝒂𝟑
𝝏𝒂𝟏 𝝏𝒂𝟑 𝝏𝒂𝟏 𝝏𝒂𝟑 𝝏𝒂𝟏 𝝏𝒂𝟑
+
𝝏𝒇 𝝏𝒇 𝝏𝒂𝟑
𝒂𝟐 = 𝝏𝒂 = 𝝏𝒂 = 𝒂𝟑
𝟑 𝝏𝒂𝟐
𝒂𝟏 𝒂𝟐 𝟐

Addition operation distributes gradients


sin( ) * along all paths!

x2 x1

Patterns of Gradient Flow: Addition


𝒇 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 𝒙𝟐 + 𝐬𝐢𝐧 𝒙𝟐

𝒂𝟑

𝒂𝟏 𝒂𝟐 Multiplication operation is a gradient


switcher (multiplies it by the values of
sin( ) * the other term)

𝝏𝒇 𝝏𝒂𝟐 𝝏𝒇 𝝏(𝒙𝟏𝒙𝟐)
𝒙𝟐 = 𝝏𝒂 = 𝝏𝒂 = 𝒂𝟐 𝒙𝟏
𝟐 𝝏𝒙𝟐 𝟐 𝝏𝒙𝟐

𝝏𝒇 𝝏𝒂𝟐
x1 x2 𝒙𝟏 = 𝝏𝒂
𝟐 𝝏𝒙𝟏
= 𝒂𝟐 𝒙𝟐

Patterns of Gradient Flow: Multiplication


Several other patterns as well, e.g.: 5 gradient
Max operation selects which path to
push the gradients through
⬣ Gradient flows along the path Max
Max
that was “selected” to be max
⬣ This information must be
recorded in the forward pass 5 1
gradient
The flow of gradients is one of the most important aspects in deep
neural networks
⬣ If gradients do not flow backwards properly, learning slows or stops!

Patterns of Gradient Flow: Other


𝝏𝒇 𝝏𝒂𝟏
⬣ Key idea is to explicitly store 𝒙𝟐 = 𝝏𝒂 = 𝒂𝟏 𝐜𝐨𝐬(𝒙𝟐 )
𝟏 𝝏𝒙𝟐
computation graph in
memory and corresponding 𝒂𝟑
gradient functions
+

⬣ Nodes broken down to basic 𝒂𝟏 𝒂𝟐


primitive computations
cos( ) sin( ) *
(addition, multiplication, log,
etc.) for which
corresponding derivative is
known x2 x1

Computational Implementation
Note that we can also do forward
mode automatic differentiation
𝒘̇ 𝟑 = 𝒘̇ 𝟏 + 𝒘̇ 𝟐
Start from inputs and propagate
gradients forward +

𝒘̇ 𝟏 = 𝐜𝐨𝐬(𝒙𝟏 )𝒙̇ 𝟏 𝒘̇ 𝟐 = 𝒙̇ 𝟏 𝒙𝟐 + 𝒙𝟏 𝒙̇ 𝟐
Complexity is proportional to input
size sin( ) *

⬣ However, in most cases our 𝒙̇ 𝟏 𝒙̇ 𝟏 𝒙̇ 𝟐


inputs (images) are large and
outputs (loss) are small x1 x2

Automatic Differentiation
A graph is created on the fly 𝑾𝒉 h 𝑾𝒙 x

from torch.autograd import Variable


MM MM
x = Variable(torch.randn(1, 20))
prev_h = Variable(torch.randn(1, 20))
W_h = Variable(torch.randn(20, 20)) h2h Add i2h
W_x = Variable(torch.randn(20, 20))

i2h = torch.mm(W_x, x.t())


h2h = torch.mm(W_h, prev_h.t()) next_h
next_h = i2h + h2h

(Note above)

Computation Graphs in PyTorch


Back-propagation uses the
A graph is created on the fly 𝑾𝒉 h 𝑾𝒙 x
dynamically built graph
from torch.autograd import Variable
MM MM
x = Variable(torch.randn(1, 20))
prev_h = Variable(torch.randn(1, 20))
W_h = Variable(torch.randn(20, 20)) h2h Add i2h
W_x = Variable(torch.randn(20, 20))

i2h = torch.mm(W_x, x.t())


h2h = torch.mm(W_h, prev_h.t()) Tanh
next_h = i2h + h2h
next_h = next_h.tanh()
next_h
next_h.backward(torch.ones(1, 20))
From pytorch.org

Computation Graphs in PyTorch


⬣ Computation graphs are not
limited to mathematical Program Space
functions!
Software 1.0
xity
e
⬣ Can have control flows (if co
m pl

statements, loops) and gram


P ro
backpropagate through
algorithms! Software 2.0
(op
tim
⬣ Can be done dynamically so iza
t ion
)
that gradients are computed,
then nodes are added, repeat

⬣ Differentiable programming
Adapted from figure by Andrej Karpathy

Power of Automatic Differentiation


Computation
Graph
Example for
Logistic
Regression
⬣ Input: 𝒙 ∈ 𝑹𝑫
⬣ Binary label: 𝒚 ∈ {−𝟏, +𝟏}
⬣ Parameters: 𝒘 ∈ 𝑹𝑫
𝟏
𝟏
⬣ Output prediction: 𝒑 𝒚 = 𝟏 𝒙 = 𝑻
𝟏;𝒆.𝒘 𝒙
𝟏 𝒘𝑻 𝒙
⬣ Loss: 𝑳 = 𝒘 𝟐 − 𝝀𝐥𝐨𝐠 𝒑 𝒚 𝒙
𝟐
Note weighting can equivalently be on
𝑳 regularization term (more standard)!

Log Loss

𝟏 𝒘𝑻 𝒙𝒚 Adapted from slide by Marc'Aurelio Ranzato

Linear Classifier: Logistic Regression


We have discussed computation
graphs for generic functions

Machine Learning functions 𝟏


− 𝐥𝐨𝐠 𝑻𝒙
(input -> model -> loss function) 𝟏 + 𝒆-𝒘
is also a computation graph

We can use the computed 𝒖 𝒑 𝑳


𝟏
gradients from 𝒘𝑻 𝒙 −𝐥𝐨𝐠 𝒑
𝟏 + 𝒆-𝒖
backprop/automatic
differentiation to update the
weights!

Neural Network Computation Graph


𝑳0 = 𝟏
𝒖 𝒑 𝑳 𝝏𝑳 𝟏
𝟏 2=
𝒑 =−
𝒘𝑻 𝒙 −𝐥𝐨𝐠 𝒑 𝝏𝒑 𝒑
𝟏 + 𝒆*𝒖 𝟏
where 𝒑 = 𝝈 𝒘𝑻 𝒙 and 𝝈 𝒙 = 𝟏&𝒆!𝒙

𝝏𝑳 𝝏𝑳 𝝏𝒑
2= =
𝒖 2 𝝈 𝒘𝑻 𝒙
=𝒑 𝟏 − 𝝈 𝒘𝑻 𝒙
𝝏𝒖 𝝏𝒑 𝝏𝒖
Automatic differentiation:
𝝏𝑳 𝝏𝑳 𝝏𝒖
⬣ Carries out this procedure for us 2=
𝒘 𝝏𝒘
= 𝝏𝒖 𝝏𝒘
2 𝒙𝑻
=𝒖
on arbitrary graphs
We can do this in a combined way to see all terms
⬣ Knows derivatives of primitive together:
functions 𝝏𝑳 𝝏𝒑 𝝏𝒖 𝟏
2=
𝒘 = −𝝈 𝒘 𝑻𝒙
𝝈 𝒘𝑻 𝒙 (𝟏 − 𝝈 𝒘𝑻 𝒙 )𝒙𝑻

𝝏𝒑 𝝏𝒖 𝝏𝒘
As a result, we just define these
= − 𝟏 − 𝝈 𝒘𝑻 𝒙 𝒙𝑻
(forward) functions and don’t
even need to specify the This effectively shows gradient flow along path from
gradient (backward) functions! L to w

Example Gradient Computations


Vectorization
and
Jacobians of
Simple
Layers
The chain rule can be 𝒖 𝒑 𝑳
𝟏
computed as a series of 𝒘𝑻 𝒙 −𝐥𝐨𝐠 𝒑
𝟏 + 𝒆*𝒖
scalar, vector, and matrix
linear algebra operations
1x1 1x1
1xd

dx1

Extremely efficient in 𝟏
graphics processing units = =−
𝒘 𝝈 𝒘𝑻𝒙 (𝟏 − 𝝈 𝒘𝑻𝒙 )𝒙𝑻
𝝈 𝒘𝑻 𝒙
(GPUs)
1x1 1x1 1x1 1xd

Vectorized Computations
Input Function Output
𝒉ℓ*𝟏 𝒉ℓ

𝑾
Parameters

𝒉ℓ = 𝑾𝒉ℓ!𝟏
𝒘𝑻𝒊

|𝒉ℓ |×𝟏 |𝒉ℓ |×|𝒉ℓ-𝟏 | |𝒉ℓ-𝟏 |×𝟏

Fully Connected (FC) Layer: Forward Function


𝝏𝒉ℓ
𝝏𝒉ℓ(𝟏
=𝑾 𝝏𝑳 𝝏𝑳 Note doing this on full W
𝝏𝒉ℓ*𝟏 𝝏𝒉ℓ matrix would result in
Jacobian tensor!
𝝏𝒉ℓ𝒊 ℓ!𝟏 ,𝑻
=𝒉 𝝏𝑳 But it is sparse – each
𝝏𝒘𝒊 𝝏𝑾
output only affected by
(other elements zeros) corresponding weight row
𝝏𝑳 𝝏𝑳 𝝏𝒉ℓ 𝝏𝑳 𝝏𝑳 𝝏𝒉ℓ
= =
𝝏𝒉ℓ'𝟏 𝝏𝒉ℓ 𝝏𝒉ℓ'𝟏 𝝏𝒘𝒊 𝝏𝒉ℓ 𝝏𝒘𝒊
𝟎
𝝏𝒉ℓ𝒊
𝝏𝒘𝒊
𝟎
𝟏×|𝒉ℓ-𝟏 | 𝟏×|𝒉ℓ | |𝒉ℓ |×|𝒉ℓ-𝟏 | 𝟏×|𝒉ℓ-𝟏 | 𝟏×|𝒉ℓ | |𝒉ℓ |×|𝒉ℓ-𝟏 |

Fully Connected (FC) Layer


We can employ any differentiable 2
ReLU
(or piecewise differentiable) 1.
8
Logisti
1. c
function 6
1.
4
1.
2
1

A common choice is the Rectified 0.


8

Linear Unit 0.
6
0.

⬣ Provides non-linearity but better


4
0.
2
0
gradient flow than sigmoid -2 -
1.
-1 -
0.
0 0.
5
1 1.
5
2

⬣ Performed element-wise
5 5

𝒉ℓ = 𝐦𝐚𝐱 𝟎, 𝒉ℓ!𝟏
max(0,_)
How many parameters for this layer?

Rectified Linear Unit (ReLU)


Full Jacobian of ReLU layer is large Input Function Output
(output dim x input dim) 𝒉ℓ*𝟏 𝒉ℓ
⬣ But again it is sparse
⬣ Only diagonal values non-zero
𝑾
because it is element-wise Parameters
⬣ An output value affected only by Forward: 𝒉ℓ = 𝐦𝐚𝐱(𝟎, 𝒉ℓ'𝟏 )
corresponding input value
𝝏𝑳 𝝏𝑳 𝝏𝒉ℓ
Backward: =
Max function funnels gradients 𝝏𝒉ℓ.𝟏 𝝏𝒉ℓ 𝝏𝒉ℓ.𝟏
through selected max
⬣ Gradient will be zero if input 𝝏𝒉ℓ 𝟏 𝒊𝒇 𝒉ℓ'𝟏 > 𝟎
=C
<= 0 𝝏𝒉ℓ'𝟏 𝟎 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
|𝒉ℓ×𝒉ℓ*𝟏 |

Jacobian of ReLU

You might also like