0% found this document useful (0 votes)

3 views

Neural Networks

The document explains the structure and functioning of linear classifiers and neural networks, highlighting their components such as input, model, and loss function. It discusses how neurons in a neural network operate similarly to linear classifiers, with an emphasis on the importance of non-linear activation functions and the concept of layers in deep learning. Additionally, it covers the backpropagation algorithm for training neural networks, detailing the forward and backward passes for gradient computation.

Uploaded by

Carlos Souza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Neural Networks

Uploaded by

Carlos Souza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Neural

Network
View of a
Linear
Classifier
A linear classifier can be broken down into:
⬣ Input
⬣ A function of the input
⬣ A loss function

It’s all just one function that can be decomposed into building blocks

𝒖 𝟏 𝒑 𝑳
𝑿 𝒘⋅𝒙 −𝐥𝐨𝐠 𝒑
𝟏 + 𝒆!𝒖
Input Model Loss Function

What Does a Linear Classifier Consist of?

A simple neural network has similar structure as our linear classifier:
⬣ A neuron takes input (firings) from other neurons (-> input to linear classifier)
⬣ The inputs are summed in a weighted manner (-> weighted sum)
⬣ Learning is through a modification of the weights
⬣ If it receives enough input, it “fires” (threshold or if weighted sum plus bias is high
enough)
Impulses carried toward cell body 𝒙𝟎 𝒘𝟎
synapse
axon from a neuron
presynaptic
𝒘𝟎 𝒙𝟎
dendrite
terminal dendrite
cell body 𝒇 $ 𝒘𝒊 𝒙𝒊 + 𝒃
𝒘𝟏 𝒙𝟏 𝒊
$ 𝒘𝒊 𝒙𝒊 + 𝒃 𝒇
output axon
axon 𝒊
activation
𝒘𝟐 𝒙𝟐 function

Impulses carried away

cell body from cell body Figures adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Impulses carried toward cell body 𝒙𝟎 𝒘𝟎

synapse
axon from a neuron
presynaptic
𝒘𝟎 𝒙𝟎
dendrite
terminal dendrite
cell body 𝒇 $ 𝒘𝒊 𝒙𝒊 + 𝒃
𝒘𝟏 𝒙𝟏 𝒊
$ 𝒘𝒊 𝒙𝒊 + 𝒃 𝒇
output axon
axon 𝒊
activation
𝒘𝟐 𝒙𝟐 function

Impulses carried away

cell body from cell body Figures adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Adding Non-Linearities
We can have multiple neurons
connected to the same input

Corresponds to a multi-class classifier

⬣ Each output node outputs the score
for a class

𝒇 𝒙, 𝑾 = 𝝈( 𝑾𝒙 + 𝒃) 𝒘𝟏𝟏 𝒘𝟏𝟐 ⋯ 𝒘𝟏𝒎 𝒃𝟏

𝒘𝟐𝟏 𝒘𝟐𝟐 ⋯ 𝒘𝟐𝒎 𝒃𝟐
𝒘𝟐𝟏 𝒘𝟐𝟐 ⋯ 𝒘𝟑𝒎 𝒃𝟑
input layer
⬣ Often called fully connected layers output layer
⬣ Also called a linear projection
layer Figure adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Connecting Many Neurons

⬣ Each input/output is a neuron
(node)
⬣ A linear classifier is called a fully
connected layer
⬣ Connections are represented as
edges
⬣ Output of a particular neuron is
referred to as activation
⬣ This will be expanded as we view input layer
computation in a neural network as output layer
a graph
Figure adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Neural Network Terminology

We can stack multiple layers together
⬣ Input to second layer is output of first
layer
Called a 2-layered neural network (input is
not counted)
Because the middle layer is neither input or
output, and we don’t know what their values
represent, we call them hidden layers
⬣ We will see that they end up learning output layer
effective features
input layer
This increases the representational power hidden layer
of the function!
⬣ Two layered networks can represent
any continuous function
Figure adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Connecting Many Layers

The same two-layered neural network
corresponds to adding another
weight matrix
⬣ We will prefer the linear algebra
view, but use some terminology
from neural networks (& biology) output layer
input layer
hidden layer
𝒙 𝑾𝟏 𝑾𝟐
=
𝒇 𝒙, 𝑾𝟏 , 𝑾𝟐 = 𝝈(𝑾𝟐 𝝈 𝑾𝟏 𝒙 )
Figure adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

The Linear Algebra View

Large (deep) networks can be built by
adding more and more layers
Three-layered neural networks can
represent any function
⬣ The number of nodes could grow
unreasonably (exponential or worse)
with respect to the complexity of the
function output
layer
We will show them without edges: input
layer hidden hidden
layer 1 layer 2

output
layer
input
hidden hidden
layer
layer 1 layer 2
Figure adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Adding More Layers!

Computation
Graphs
Functions can be made arbitrarily complex (subject to memory and
computational limits), e.g.:
𝒇 𝒙, 𝑾 = 𝝈(𝑾𝟓 𝝈(𝑾𝟒 𝝈(𝑾𝟑 𝝈(𝑾𝟐 𝝈 𝑾𝟏 𝒙 )
We can use any type of differentiable function (layer) we want!
⬣ At the end, add the loss function
Composition can have some structure

Loss
Function

Adding Even More Layers

The world is compositional!
VISION
We want our model to reflect this
pixels edge texton motif part object

Empirical and theoretical SPEECH

evidence that it makes learning sample spectral formant motif phone word
band
complex functions easier
NLP

Note that prior state of art character word NP/VP/.. clause sentence story

engineered features often had

this compositionality as well Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

⬣ Pixels -> edges -> object parts -> objects

Compositionality
⬣ We are learning complex models with significant amount of
parameters (millions or billions)
⬣ How do we compute the gradients of the loss (at the end) with
respect to internal parameters?
⬣ Intuitively, want to understand how small changes in weight deep
inside are propagated to affect the loss function at the end

Loss
Function

𝝏𝑳
?
𝝏𝒘𝒊

Computing Gradients in Complex Function

To develop a general algorithm for
this, we will view the function as a
computation graph

Graph can be any directed acyclic

graph (DAG)

⬣ Modules must be differentiable to

support gradient computations
for gradient descent

A training algorithm will then

process this graph, one module at a
time
Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

A General Framework
𝒇 𝒙𝟏 , 𝒙𝟐 = 𝐥𝐧 𝒙𝟏 + 𝒙𝟏 𝒙𝟐 − 𝐬𝐢𝐧(𝒙𝟐 )

𝒙𝟏 𝐥𝐧
+
*

−
𝒙𝟐 𝐬𝐢𝐧

Example
𝟏
− 𝐥𝐨𝐠
𝟏 + 𝒆!𝒘⋅𝒙

𝒖 𝒑 𝑳
𝟏
𝒘⋅𝒙 −𝐥𝐨𝐠 𝒑
𝟏 + 𝒆!𝒖

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Machine Learning Example

Backpropagation
Given this computation graph, the training
algorithm will:
⬣ Calculate the current model’s outputs Input Function Output
(called the forward pass)
⬣ Calculate the gradients for each 𝒉ℓ'𝟏 𝒉ℓ
module (called the backward pass)
Backward pass is a recursive algorithm that:
⬣ Starts at loss function where we know
how to calculate the gradients
⬣ Progresses back through the modules 𝑾
⬣ Ends in the input layer where we do Parameters
not need gradients (no parameters)
This algorithm is called backpropagation
Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Overview of Training
Step 1: Compute Loss on Mini-Batch: Forward Pass

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training

Step 1: Compute Loss on Mini-Batch: Forward Pass

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training

Step 1: Compute Loss on Mini-Batch: Forward Pass

Layer 1 Layer 2 Layer 3

Note that we must store the intermediate outputs of all layers!

⬣ This is because we will need them to compute the gradients (the gradient
equations will have terms with the output values in them)

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training

In the backward pass, we seek to 𝝏𝑳 𝝏𝑳
calculate the gradients of the loss with 𝝏𝒉ℓ*𝟏 𝝏𝒉ℓ
respect to the module’s parameters
⬣ Assume that we have the
gradient of the loss with respect 𝝏𝑳
to the module’s outputs (given 𝝏𝑾
to us by upstream module)
Problem:
⬣ We will also pass the gradient of ⬣ We can compute local gradients:
the loss with respect to the 𝝏𝒉ℓ 𝝏𝒉ℓ
module’s inputs {𝝏𝒉ℓ"𝟏 , 𝝏𝑾 }
⬣ This is not required for ⬣ We are given: 𝝏𝒉ℓ
𝝏𝑳
update the module’s weights,
𝝏𝑳 𝝏𝑳
but passes the gradients ⬣ Compute: { }
𝝏𝒉ℓ"𝟏 , 𝝏𝑾
back to the previous module
Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Backward Pass Computations

𝝏𝒉ℓ 𝝏𝒉ℓ
⬣ We can compute local gradients: { ℓ.𝟏 , }
𝝏𝒉 𝝏𝑾

⬣ This is just the derivative of our function with respect to its

parameters and inputs!

Example: If 𝒉ℓ = 𝑾𝒉ℓ'𝟏

𝝏𝒉ℓ
then =𝑾
𝝏𝒉ℓ.𝟏

𝝏𝒉ℓ
and = 𝒉ℓ'𝟏,𝑻
𝝏𝑾

Computing the Local Gradients: Example

𝝏𝑳 𝝏𝑳
⬣ We want to to compute: { ℓ'𝟏 , }
𝝏𝒉 𝝏𝑾

𝝏𝑳 𝝏𝑳 𝝏𝑳 𝝏𝑳
𝝏𝒉ℓ 𝝏𝒉ℓ*𝟏 𝝏𝒉ℓ 𝝏𝒉ℓ*𝟏
Loss

𝝏𝑳
𝝏𝑾

⬣ We will use the chain rule to do this:

𝝏𝒛 𝝏𝒛 𝝏𝒚
Chain Rule: = ,
𝝏𝒙 𝝏𝒚 𝝏𝒙

Computing the Gradients of Loss

𝝏𝑳 𝝏𝑳
⬣ We will use the chain rule to compute: { , }
𝝏𝒉ℓ.𝟏 𝝏𝑾

𝝏𝑳 𝝏𝑳 𝝏𝒉ℓ
⬣ Gradient of loss w.r.t. inputs: = Given by upstream
𝝏𝒉ℓ.𝟏 𝝏𝒉ℓ 𝝏𝒉ℓ.𝟏 module (upstream
gradient)
𝝏𝑳 𝝏𝑳 𝝏𝒉ℓ
⬣ Gradient of loss w.r.t. weights: =
𝝏𝑾 𝝏𝒉ℓ 𝝏𝑾

𝝏𝑳 𝝏𝑳
𝝏𝒉ℓ*𝟏 𝝏𝒉ℓ

𝝏𝑳
𝝏𝑾
Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Computing the Gradients of Loss

Step 1: Compute Loss on Mini-Batch: Forward Pass
Step 2: Compute Gradients wrt parameters: Backward Pass

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training

Step 1: Compute Loss on Mini-Batch: Forward Pass
Step 2: Compute Gradients wrt parameters: Backward Pass

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training

Step 1: Compute Loss on Mini-Batch: Forward Pass
Step 2: Compute Gradients wrt parameters: Backward Pass

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training

Step 1: Compute Loss on Mini-Batch: Forward Pass
Step 2: Compute Gradients wrt parameters: Backward Pass
Step 3: Use gradient to update all parameters at the end

Layer 1 Layer 2 Layer 3

𝝏𝑳 Backpropagation is the application of

𝒘𝒊 = 𝒘𝒊 − 𝜶 gradient descent to a computation
𝝏𝒘𝒊 graph via the chain rule!
Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training

Backpropagation
and Automatic
Differentiation
Backpropagation does not really spell out how to
efficiently carry out the necessary computations
But the idea can be applied to any directed
acyclic graph (DAG)
⬣ Graph represents an ordering constraining
which paths must be calculated first
Given an ordering, we can then iterate from the
last module backwards, applying the chain rule
⬣ We will store, for each node, its gradient
outputs for efficient computation
This is called reverse-mode automatic
differentiation

A General Framework
Computation = Graph
⬣ Input = Data + Parameters
⬣ Output = Loss
⬣ Scheduling = Topological ordering

Auto-Diff
⬣ A family of algorithms for
implementing chain-rule on computation graphs

Deep Learning = Differentiable Programming

𝒇 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 𝒙𝟐 + 𝐬𝐢𝐧 𝒙𝟐 We want to find the partial
derivative of output f (output)
𝒂𝟑 with respect to all intermediate
+ variables

𝒂𝟏 𝒂𝟐 ⬣ Assign intermediate variables

Simplify notation:
sin( ) * /0
Denote bar as: 𝑎. =
/11

⬣ Start at end and move

x2 x1 backward

Example
𝒇 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 𝒙𝟐 + 𝐬𝐢𝐧 𝒙𝟐 𝝏𝒇
𝒂𝟑 = =𝟏
𝝏𝒂𝟑

𝒂𝟑 𝝏𝒇 𝝏𝒇 𝝏𝒂𝟑 𝝏𝒇 𝝏(𝒂𝟏 )𝒂𝟐 ) 𝝏𝒇

𝒂𝟏 = = = = 𝟏 = 𝒂𝟑
𝝏𝒂𝟏 𝝏𝒂𝟑 𝝏𝒂𝟏 𝝏𝒂𝟑 𝝏𝒂𝟏 𝝏𝒂𝟑
+
𝝏𝒇 𝝏𝒇 𝝏𝒂𝟑
𝒂𝟐 = 𝝏𝒂 = 𝝏𝒂 = 𝒂𝟑
𝟑 𝝏𝒂𝟐
𝒂𝟏 𝒂𝟐 𝟐

𝝏𝒇 𝝏𝒂𝟏
𝒙𝑷𝟏
𝟐 = 𝝏𝒂 = 𝒂𝟏 𝐜𝐨𝐬 𝒙𝟐 +
sin( ) * 𝟏 𝝏𝒙𝟐

Path 1 Gradients
𝝏𝒇 𝝏𝒂𝟐 𝝏𝒇 𝝏(𝒙𝟏𝒙𝟐)
(P1) 𝒙𝑷𝟐
𝟐 = = = 𝒂𝟐 𝒙𝟏 from multiple
𝝏𝒂𝟐 𝝏𝒙𝟐 𝝏𝒂𝟐 𝝏𝒙𝟐 paths
summed
Path 2 𝝏𝒇 𝝏𝒂𝟐
x2 x1 𝒙𝟏 = = 𝒂𝟐 𝒙𝟐
(P2) 𝝏𝒂𝟐 𝝏𝒙𝟏

Example
𝒇 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 𝒙𝟐 + 𝐬𝐢𝐧 𝒙𝟐

𝒂𝟑 𝝏𝒇 𝝏𝒇 𝝏𝒂𝟑 𝝏𝒇 𝝏(𝒂𝟏 )𝒂𝟐 ) 𝝏𝒇

Addition operation distributes gradients

sin( ) * along all paths!

x2 x1

Patterns of Gradient Flow: Addition

𝒇 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 𝒙𝟐 + 𝐬𝐢𝐧 𝒙𝟐

𝒂𝟑

𝒂𝟏 𝒂𝟐 Multiplication operation is a gradient

switcher (multiplies it by the values of
sin( ) * the other term)

𝝏𝒇 𝝏𝒂𝟐 𝝏𝒇 𝝏(𝒙𝟏𝒙𝟐)
𝒙𝟐 = 𝝏𝒂 = 𝝏𝒂 = 𝒂𝟐 𝒙𝟏
𝟐 𝝏𝒙𝟐 𝟐 𝝏𝒙𝟐

𝝏𝒇 𝝏𝒂𝟐
x1 x2 𝒙𝟏 = 𝝏𝒂
𝟐 𝝏𝒙𝟏
= 𝒂𝟐 𝒙𝟐

Patterns of Gradient Flow: Multiplication

Several other patterns as well, e.g.: 5 gradient
Max operation selects which path to
push the gradients through
⬣ Gradient flows along the path Max
Max
that was “selected” to be max
⬣ This information must be
recorded in the forward pass 5 1
gradient
The flow of gradients is one of the most important aspects in deep
neural networks
⬣ If gradients do not flow backwards properly, learning slows or stops!

Patterns of Gradient Flow: Other

𝝏𝒇 𝝏𝒂𝟏
⬣ Key idea is to explicitly store 𝒙𝟐 = 𝝏𝒂 = 𝒂𝟏 𝐜𝐨𝐬(𝒙𝟐 )
𝟏 𝝏𝒙𝟐
computation graph in
memory and corresponding 𝒂𝟑
gradient functions
+

⬣ Nodes broken down to basic 𝒂𝟏 𝒂𝟐

primitive computations
cos( ) sin( ) *
(addition, multiplication, log,
etc.) for which
corresponding derivative is
known x2 x1

Computational Implementation
Note that we can also do forward
mode automatic differentiation
𝒘̇ 𝟑 = 𝒘̇ 𝟏 + 𝒘̇ 𝟐
Start from inputs and propagate
gradients forward +

𝒘̇ 𝟏 = 𝐜𝐨𝐬(𝒙𝟏 )𝒙̇ 𝟏 𝒘̇ 𝟐 = 𝒙̇ 𝟏 𝒙𝟐 + 𝒙𝟏 𝒙̇ 𝟐
Complexity is proportional to input
size sin( ) *

⬣ However, in most cases our 𝒙̇ 𝟏 𝒙̇ 𝟏 𝒙̇ 𝟐

inputs (images) are large and
outputs (loss) are small x1 x2

Automatic Differentiation
A graph is created on the fly 𝑾𝒉 h 𝑾𝒙 x

from torch.autograd import Variable

MM MM
x = Variable(torch.randn(1, 20))
prev_h = Variable(torch.randn(1, 20))
W_h = Variable(torch.randn(20, 20)) h2h Add i2h
W_x = Variable(torch.randn(20, 20))

i2h = torch.mm(W_x, x.t())

h2h = torch.mm(W_h, prev_h.t()) next_h
next_h = i2h + h2h

(Note above)

Computation Graphs in PyTorch

Back-propagation uses the
A graph is created on the fly 𝑾𝒉 h 𝑾𝒙 x
dynamically built graph
from torch.autograd import Variable
MM MM
x = Variable(torch.randn(1, 20))
prev_h = Variable(torch.randn(1, 20))
W_h = Variable(torch.randn(20, 20)) h2h Add i2h
W_x = Variable(torch.randn(20, 20))

i2h = torch.mm(W_x, x.t())

h2h = torch.mm(W_h, prev_h.t()) Tanh
next_h = i2h + h2h
next_h = next_h.tanh()
next_h
next_h.backward(torch.ones(1, 20))
From pytorch.org

Computation Graphs in PyTorch

⬣ Computation graphs are not
limited to mathematical Program Space
functions!
Software 1.0
xity
e
⬣ Can have control flows (if co
m pl

statements, loops) and gram

P ro
backpropagate through
algorithms! Software 2.0
(op
tim
⬣ Can be done dynamically so iza
t ion
)
that gradients are computed,
then nodes are added, repeat

⬣ Differentiable programming
Adapted from figure by Andrej Karpathy

Power of Automatic Differentiation

Computation
Graph
Example for
Logistic
Regression
⬣ Input: 𝒙 ∈ 𝑹𝑫
⬣ Binary label: 𝒚 ∈ {−𝟏, +𝟏}
⬣ Parameters: 𝒘 ∈ 𝑹𝑫
𝟏
𝟏
⬣ Output prediction: 𝒑 𝒚 = 𝟏 𝒙 = 𝑻
𝟏;𝒆.𝒘 𝒙
𝟏 𝒘𝑻 𝒙
⬣ Loss: 𝑳 = 𝒘 𝟐 − 𝝀𝐥𝐨𝐠 𝒑 𝒚 𝒙
𝟐
Note weighting can equivalently be on
𝑳 regularization term (more standard)!

Log Loss

𝟏 𝒘𝑻 𝒙𝒚 Adapted from slide by Marc'Aurelio Ranzato

Linear Classifier: Logistic Regression

We have discussed computation
graphs for generic functions

Machine Learning functions 𝟏

− 𝐥𝐨𝐠 𝑻𝒙
(input -> model -> loss function) 𝟏 + 𝒆-𝒘
is also a computation graph

We can use the computed 𝒖 𝒑 𝑳

𝟏
gradients from 𝒘𝑻 𝒙 −𝐥𝐨𝐠 𝒑
𝟏 + 𝒆-𝒖
backprop/automatic
differentiation to update the
weights!

Neural Network Computation Graph

𝑳0 = 𝟏
𝒖 𝒑 𝑳 𝝏𝑳 𝟏
𝟏 2=
𝒑 =−
𝒘𝑻 𝒙 −𝐥𝐨𝐠 𝒑 𝝏𝒑 𝒑
𝟏 + 𝒆*𝒖 𝟏
where 𝒑 = 𝝈 𝒘𝑻 𝒙 and 𝝈 𝒙 = 𝟏&𝒆!𝒙

𝝏𝑳 𝝏𝑳 𝝏𝒑
2= =
𝒖 2 𝝈 𝒘𝑻 𝒙
=𝒑 𝟏 − 𝝈 𝒘𝑻 𝒙
𝝏𝒖 𝝏𝒑 𝝏𝒖
Automatic differentiation:
𝝏𝑳 𝝏𝑳 𝝏𝒖
⬣ Carries out this procedure for us 2=
𝒘 𝝏𝒘
= 𝝏𝒖 𝝏𝒘
2 𝒙𝑻
=𝒖
on arbitrary graphs
We can do this in a combined way to see all terms
⬣ Knows derivatives of primitive together:
functions 𝝏𝑳 𝝏𝒑 𝝏𝒖 𝟏
2=
𝒘 = −𝝈 𝒘 𝑻𝒙
𝝈 𝒘𝑻 𝒙 (𝟏 − 𝝈 𝒘𝑻 𝒙 )𝒙𝑻
⬣
𝝏𝒑 𝝏𝒖 𝝏𝒘
As a result, we just define these
= − 𝟏 − 𝝈 𝒘𝑻 𝒙 𝒙𝑻
(forward) functions and don’t
even need to specify the This effectively shows gradient flow along path from
gradient (backward) functions! L to w

Example Gradient Computations

Vectorization
and
Jacobians of
Simple
Layers
The chain rule can be 𝒖 𝒑 𝑳
𝟏
computed as a series of 𝒘𝑻 𝒙 −𝐥𝐨𝐠 𝒑
𝟏 + 𝒆*𝒖
scalar, vector, and matrix
linear algebra operations
1x1 1x1
1xd

dx1

Extremely efficient in 𝟏
graphics processing units = =−
𝒘 𝝈 𝒘𝑻𝒙 (𝟏 − 𝝈 𝒘𝑻𝒙 )𝒙𝑻
𝝈 𝒘𝑻 𝒙
(GPUs)
1x1 1x1 1x1 1xd

Vectorized Computations
Input Function Output
𝒉ℓ*𝟏 𝒉ℓ

𝑾
Parameters

𝒉ℓ = 𝑾𝒉ℓ!𝟏
𝒘𝑻𝒊

|𝒉ℓ |×𝟏 |𝒉ℓ |×|𝒉ℓ-𝟏 | |𝒉ℓ-𝟏 |×𝟏

Fully Connected (FC) Layer: Forward Function

𝝏𝒉ℓ
𝝏𝒉ℓ(𝟏
=𝑾 𝝏𝑳 𝝏𝑳 Note doing this on full W
𝝏𝒉ℓ*𝟏 𝝏𝒉ℓ matrix would result in
Jacobian tensor!
𝝏𝒉ℓ𝒊 ℓ!𝟏 ,𝑻
=𝒉 𝝏𝑳 But it is sparse – each
𝝏𝒘𝒊 𝝏𝑾
output only affected by
(other elements zeros) corresponding weight row
𝝏𝑳 𝝏𝑳 𝝏𝒉ℓ 𝝏𝑳 𝝏𝑳 𝝏𝒉ℓ
= =
𝝏𝒉ℓ'𝟏 𝝏𝒉ℓ 𝝏𝒉ℓ'𝟏 𝝏𝒘𝒊 𝝏𝒉ℓ 𝝏𝒘𝒊
𝟎
𝝏𝒉ℓ𝒊
𝝏𝒘𝒊
𝟎
𝟏×|𝒉ℓ-𝟏 | 𝟏×|𝒉ℓ | |𝒉ℓ |×|𝒉ℓ-𝟏 | 𝟏×|𝒉ℓ-𝟏 | 𝟏×|𝒉ℓ | |𝒉ℓ |×|𝒉ℓ-𝟏 |

Fully Connected (FC) Layer

We can employ any differentiable 2
ReLU
(or piecewise differentiable) 1.
8
Logisti
1. c
function 6
1.
4
1.
2
1

A common choice is the Rectified 0.

Linear Unit 0.
6
0.

⬣ Provides non-linearity but better

4
0.
2
0
gradient flow than sigmoid -2 -
1.
-1 -
0.
0 0.
5
1 1.
5
2

⬣ Performed element-wise
5 5

𝒉ℓ = 𝐦𝐚𝐱 𝟎, 𝒉ℓ!𝟏
max(0,_)
How many parameters for this layer?

Rectified Linear Unit (ReLU)

Full Jacobian of ReLU layer is large Input Function Output
(output dim x input dim) 𝒉ℓ*𝟏 𝒉ℓ
⬣ But again it is sparse
⬣ Only diagonal values non-zero
𝑾
because it is element-wise Parameters
⬣ An output value affected only by Forward: 𝒉ℓ = 𝐦𝐚𝐱(𝟎, 𝒉ℓ'𝟏 )
corresponding input value
𝝏𝑳 𝝏𝑳 𝝏𝒉ℓ
Backward: =
Max function funnels gradients 𝝏𝒉ℓ.𝟏 𝝏𝒉ℓ 𝝏𝒉ℓ.𝟏
through selected max
⬣ Gradient will be zero if input 𝝏𝒉ℓ 𝟏 𝒊𝒇 𝒉ℓ'𝟏 > 𝟎
=C
<= 0 𝝏𝒉ℓ'𝟏 𝟎 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
|𝒉ℓ×𝒉ℓ*𝟏 |

Jacobian of ReLU

Cognitive Behavioural Therapy For Insomnia CBT I Across The Life Span - 2022 - Baglioni - Front Matter
No ratings yet
Cognitive Behavioural Therapy For Insomnia CBT I Across The Life Span - 2022 - Baglioni - Front Matter
11 pages
Dalgleish and Hitchcock Transdaignostic Distortions in AM
No ratings yet
Dalgleish and Hitchcock Transdaignostic Distortions in AM
17 pages
G7 Mathematics DLL Q2 W3
100% (1)
G7 Mathematics DLL Q2 W3
3 pages
How To Build Your Own Neural Network From Scratch in
No ratings yet
How To Build Your Own Neural Network From Scratch in
6 pages
Mathematics: Quarter 3 - Module 7: Construct Triangles, Squares, Rectangles, Regular Pentagons, and Regular Hexagons
No ratings yet
Mathematics: Quarter 3 - Module 7: Construct Triangles, Squares, Rectangles, Regular Pentagons, and Regular Hexagons
29 pages
06-backprop
No ratings yet
06-backprop
63 pages
Understanding Backpropagation Algorithm - Towards Data Science
No ratings yet
Understanding Backpropagation Algorithm - Towards Data Science
11 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Lect 5
No ratings yet
Lect 5
89 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
NeuralNetworks
No ratings yet
NeuralNetworks
29 pages
Understanding and Coding Neural Networks From Scratch in Python and R
No ratings yet
Understanding and Coding Neural Networks From Scratch in Python and R
12 pages
Types of MAC Protocols
No ratings yet
Types of MAC Protocols
16 pages
Understanding and Creating Neural Networks
No ratings yet
Understanding and Creating Neural Networks
69 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
Learning 3
No ratings yet
Learning 3
98 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Assignment - 4
No ratings yet
Assignment - 4
24 pages
Back Propagation
No ratings yet
Back Propagation
27 pages
Understanding and Coding Neural Networks From Scratch in Python and R
100% (1)
Understanding and Coding Neural Networks From Scratch in Python and R
15 pages
lect8_dnn (1)
No ratings yet
lect8_dnn (1)
33 pages
Slides 11
No ratings yet
Slides 11
48 pages
Artificial Neural Networks - Lect - 3
No ratings yet
Artificial Neural Networks - Lect - 3
16 pages
ML.8-Neural Networks - Deep Learning (Week 12,13)
No ratings yet
ML.8-Neural Networks - Deep Learning (Week 12,13)
80 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
30 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Unit - II ML
No ratings yet
Unit - II ML
9 pages
Lecture 12 - Neural Networks (DONE!!) PDF
No ratings yet
Lecture 12 - Neural Networks (DONE!!) PDF
27 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
Neural Networks Handout
No ratings yet
Neural Networks Handout
7 pages
M3_Transcript
No ratings yet
M3_Transcript
10 pages
Main
No ratings yet
Main
25 pages
Lec10 Handout
No ratings yet
Lec10 Handout
41 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
14 pages
Neural
No ratings yet
Neural
53 pages
DS303_NN
No ratings yet
DS303_NN
20 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
36-Multi-Layer Perceptron and Its Properties-30-10-2024
No ratings yet
36-Multi-Layer Perceptron and Its Properties-30-10-2024
39 pages
Module 3.Docxaiml
No ratings yet
Module 3.Docxaiml
20 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
ML807_Distributed_and_Federated_Learning_Slides_2
No ratings yet
ML807_Distributed_and_Federated_Learning_Slides_2
211 pages
A Beginner's Tutorial For CNN
100% (1)
A Beginner's Tutorial For CNN
35 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
Backpropagation algorithm
No ratings yet
Backpropagation algorithm
12 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
68 pages
NN Lecture Notes
No ratings yet
NN Lecture Notes
45 pages
Neural Network
100% (1)
Neural Network
54 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Intro To Neural Networks
No ratings yet
Intro To Neural Networks
100 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Unit 1 (1)
No ratings yet
Unit 1 (1)
72 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
NN_Notes
No ratings yet
NN_Notes
39 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
Classification Advanced
No ratings yet
Classification Advanced
51 pages
9NeuralNetworksLearning
No ratings yet
9NeuralNetworksLearning
38 pages
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
No ratings yet
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
45 pages
cst414- Deep learning
No ratings yet
cst414- Deep learning
34 pages
Neural Networks and Fuzzy Systems: Multi-Layer Feed Forward Networks
No ratings yet
Neural Networks and Fuzzy Systems: Multi-Layer Feed Forward Networks
27 pages
ML Unit-2
No ratings yet
ML Unit-2
141 pages
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
From Everand
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
Fouad Sabry
No ratings yet
Neural Networks
From Everand
Neural Networks
Sasha Kurzweil
No ratings yet
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
From Everand
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
Fouad Sabry
No ratings yet
Outline
No ratings yet
Outline
7 pages
KUMUSTAHAN
No ratings yet
KUMUSTAHAN
19 pages
BC12020 - Teeqp - BN 1
No ratings yet
BC12020 - Teeqp - BN 1
2 pages
Preparing For The Part 3 Examination - Article 1996
No ratings yet
Preparing For The Part 3 Examination - Article 1996
2 pages
Carinosa OFFICIAL
No ratings yet
Carinosa OFFICIAL
8 pages
Flyer Batch 11 - Top Star Indonesia
No ratings yet
Flyer Batch 11 - Top Star Indonesia
29 pages
Accounting Study Design 2025-2029
No ratings yet
Accounting Study Design 2025-2029
36 pages
1218-Article Text-8028-1-10-20220301
No ratings yet
1218-Article Text-8028-1-10-20220301
13 pages
Detailed Lesson Plan in Mathematics
No ratings yet
Detailed Lesson Plan in Mathematics
11 pages
Transport Facility 180221
No ratings yet
Transport Facility 180221
3 pages
Bed 3 Rdsemregexamsttfeb 2023
No ratings yet
Bed 3 Rdsemregexamsttfeb 2023
1 page
Kindergarten Weather Unit K.E.1
No ratings yet
Kindergarten Weather Unit K.E.1
50 pages
Korean Language 1B SB Sample
100% (1)
Korean Language 1B SB Sample
21 pages
Kimberly Thabisiwe Moyo: Course Name Mark Classification
No ratings yet
Kimberly Thabisiwe Moyo: Course Name Mark Classification
3 pages
Lyralyn Atienza - Teaching Profession (Quiz)
No ratings yet
Lyralyn Atienza - Teaching Profession (Quiz)
3 pages
Mrinal Reports
No ratings yet
Mrinal Reports
7 pages
Applicationform
No ratings yet
Applicationform
2 pages
AICTE 360 Degree Feedback Capture and Reporting System For Faculty
No ratings yet
AICTE 360 Degree Feedback Capture and Reporting System For Faculty
38 pages
By: Rizki Isfahani, M.PD
No ratings yet
By: Rizki Isfahani, M.PD
9 pages
Homily On Luke 8 22-25
No ratings yet
Homily On Luke 8 22-25
3 pages
IB English Commentary Instructions
No ratings yet
IB English Commentary Instructions
4 pages
Reading Skills For Academic Study
No ratings yet
Reading Skills For Academic Study
17 pages
TP2 Commande MAC - MPC 2024
No ratings yet
TP2 Commande MAC - MPC 2024
2 pages
The Philippine Flag: Student'S Activity Sheet
No ratings yet
The Philippine Flag: Student'S Activity Sheet
11 pages
Assessment (Math - Operations, Fractions)
No ratings yet
Assessment (Math - Operations, Fractions)
72 pages
UT Dallas Syllabus For Biol4v00.001.09s Taught by Mehmet Candas (Candas)
No ratings yet
UT Dallas Syllabus For Biol4v00.001.09s Taught by Mehmet Candas (Candas)
8 pages

Neural Networks

Uploaded by

Neural Networks

Uploaded by

Neural

What Does a Linear Classifier Consist of?

Impulses carried away

Origins of the Term Neural Network

Impulses carried toward cell body 𝒙𝟎 𝒘𝟎

Impulses carried away

Corresponds to a multi-class classifier

𝒇 𝒙, 𝑾 = 𝝈( 𝑾𝒙 + 𝒃) 𝒘𝟏𝟏 𝒘𝟏𝟐 ⋯ 𝒘𝟏𝒎 𝒃𝟏

Connecting Many Neurons

Neural Network Terminology

Connecting Many Layers

The Linear Algebra View

Adding More Layers!

Adding Even More Layers

Empirical and theoretical SPEECH

engineered features often had

⬣ Pixels -> edges -> object parts -> objects

Computing Gradients in Complex Function

Graph can be any directed acyclic

⬣ Modules must be differentiable to

A training algorithm will then

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Machine Learning Example

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training

Layer 1 Layer 2 Layer 3

Note that we must store the intermediate outputs of all layers!

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training

Backward Pass Computations

⬣ This is just the derivative of our function with respect to its

Computing the Local Gradients: Example

⬣ We will use the chain rule to do this:

Computing the Gradients of Loss

Computing the Gradients of Loss

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training

Layer 1 Layer 2 Layer 3

Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training

Layer 1 Layer 2 Layer 3

𝝏𝑳 Backpropagation is the application of

Neural Network Training

Deep Learning = Differentiable Programming

𝒂𝟏 𝒂𝟐 ⬣ Assign intermediate variables

⬣ Start at end and move

𝒂𝟑 𝝏𝒇 𝝏𝒇 𝝏𝒂𝟑 𝝏𝒇 𝝏(𝒂𝟏 )𝒂𝟐 ) 𝝏𝒇

𝒂𝟑 𝝏𝒇 𝝏𝒇 𝝏𝒂𝟑 𝝏𝒇 𝝏(𝒂𝟏 )𝒂𝟐 ) 𝝏𝒇

Addition operation distributes gradients

Patterns of Gradient Flow: Addition

𝒂𝟏 𝒂𝟐 Multiplication operation is a gradient

Patterns of Gradient Flow: Multiplication

Patterns of Gradient Flow: Other

⬣ Nodes broken down to basic 𝒂𝟏 𝒂𝟐

⬣ However, in most cases our 𝒙̇ 𝟏 𝒙̇ 𝟏 𝒙̇ 𝟐

from torch.autograd import Variable

i2h = torch.mm(W_x, x.t())

Computation Graphs in PyTorch

i2h = torch.mm(W_x, x.t())

Computation Graphs in PyTorch

statements, loops) and gram

Power of Automatic Differentiation

𝟏 𝒘𝑻 𝒙𝒚 Adapted from slide by Marc'Aurelio Ranzato

Linear Classifier: Logistic Regression

Machine Learning functions 𝟏

We can use the computed 𝒖 𝒑 𝑳

Neural Network Computation Graph

Example Gradient Computations