Neural Networks
Neural Networks
Network
View of a
Linear
Classifier
A linear classifier can be broken down into:
⬣ Input
⬣ A function of the input
⬣ A loss function
It’s all just one function that can be decomposed into building blocks
𝒖 𝟏 𝒑 𝑳
𝑿 𝒘⋅𝒙 −𝐥𝐨𝐠 𝒑
𝟏 + 𝒆!𝒖
Input Model Loss Function
Adding Non-Linearities
We can have multiple neurons
connected to the same input
output
layer
input
hidden hidden
layer
layer 1 layer 2
Figure adapted from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Loss
Function
Note that prior state of art character word NP/VP/.. clause sentence story
Compositionality
⬣ We are learning complex models with significant amount of
parameters (millions or billions)
⬣ How do we compute the gradients of the loss (at the end) with
respect to internal parameters?
⬣ Intuitively, want to understand how small changes in weight deep
inside are propagated to affect the loss function at the end
Loss
Function
𝝏𝑳
?
𝝏𝒘𝒊
A General Framework
𝒇 𝒙𝟏 , 𝒙𝟐 = 𝐥𝐧 𝒙𝟏 + 𝒙𝟏 𝒙𝟐 − 𝐬𝐢𝐧(𝒙𝟐 )
𝒙𝟏 𝐥𝐧
+
*
−
𝒙𝟐 𝐬𝐢𝐧
Example
𝟏
− 𝐥𝐨𝐠
𝟏 + 𝒆!𝒘⋅𝒙
𝒖 𝒑 𝑳
𝟏
𝒘⋅𝒙 −𝐥𝐨𝐠 𝒑
𝟏 + 𝒆!𝒖
Overview of Training
Step 1: Compute Loss on Mini-Batch: Forward Pass
Example: If 𝒉ℓ = 𝑾𝒉ℓ'𝟏
𝝏𝒉ℓ
then =𝑾
𝝏𝒉ℓ.𝟏
𝝏𝒉ℓ
and = 𝒉ℓ'𝟏,𝑻
𝝏𝑾
𝝏𝑳 𝝏𝑳 𝝏𝑳 𝝏𝑳
𝝏𝒉ℓ 𝝏𝒉ℓ*𝟏 𝝏𝒉ℓ 𝝏𝒉ℓ*𝟏
Loss
𝝏𝑳
𝝏𝑾
𝝏𝑳 𝝏𝑳 𝝏𝒉ℓ
⬣ Gradient of loss w.r.t. inputs: = Given by upstream
𝝏𝒉ℓ.𝟏 𝝏𝒉ℓ 𝝏𝒉ℓ.𝟏 module (upstream
gradient)
𝝏𝑳 𝝏𝑳 𝝏𝒉ℓ
⬣ Gradient of loss w.r.t. weights: =
𝝏𝑾 𝝏𝒉ℓ 𝝏𝑾
𝝏𝑳 𝝏𝑳
𝝏𝒉ℓ*𝟏 𝝏𝒉ℓ
𝝏𝑳
𝝏𝑾
Adapted from figure by Marc'Aurelio Ranzato, Yann LeCun
A General Framework
Computation = Graph
⬣ Input = Data + Parameters
⬣ Output = Loss
⬣ Scheduling = Topological ordering
Auto-Diff
⬣ A family of algorithms for
implementing chain-rule on computation graphs
Simplify notation:
sin( ) * /0
Denote bar as: 𝑎. =
/11
Example
𝒇 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 𝒙𝟐 + 𝐬𝐢𝐧 𝒙𝟐 𝝏𝒇
𝒂𝟑 = =𝟏
𝝏𝒂𝟑
𝝏𝒇 𝝏𝒂𝟏
𝒙𝑷𝟏
𝟐 = 𝝏𝒂 = 𝒂𝟏 𝐜𝐨𝐬 𝒙𝟐 +
sin( ) * 𝟏 𝝏𝒙𝟐
Path 1 Gradients
𝝏𝒇 𝝏𝒂𝟐 𝝏𝒇 𝝏(𝒙𝟏𝒙𝟐)
(P1) 𝒙𝑷𝟐
𝟐 = = = 𝒂𝟐 𝒙𝟏 from multiple
𝝏𝒂𝟐 𝝏𝒙𝟐 𝝏𝒂𝟐 𝝏𝒙𝟐 paths
summed
Path 2 𝝏𝒇 𝝏𝒂𝟐
x2 x1 𝒙𝟏 = = 𝒂𝟐 𝒙𝟐
(P2) 𝝏𝒂𝟐 𝝏𝒙𝟏
Example
𝒇 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 𝒙𝟐 + 𝐬𝐢𝐧 𝒙𝟐
x2 x1
𝒂𝟑
𝝏𝒇 𝝏𝒂𝟐 𝝏𝒇 𝝏(𝒙𝟏𝒙𝟐)
𝒙𝟐 = 𝝏𝒂 = 𝝏𝒂 = 𝒂𝟐 𝒙𝟏
𝟐 𝝏𝒙𝟐 𝟐 𝝏𝒙𝟐
𝝏𝒇 𝝏𝒂𝟐
x1 x2 𝒙𝟏 = 𝝏𝒂
𝟐 𝝏𝒙𝟏
= 𝒂𝟐 𝒙𝟐
Computational Implementation
Note that we can also do forward
mode automatic differentiation
𝒘̇ 𝟑 = 𝒘̇ 𝟏 + 𝒘̇ 𝟐
Start from inputs and propagate
gradients forward +
𝒘̇ 𝟏 = 𝐜𝐨𝐬(𝒙𝟏 )𝒙̇ 𝟏 𝒘̇ 𝟐 = 𝒙̇ 𝟏 𝒙𝟐 + 𝒙𝟏 𝒙̇ 𝟐
Complexity is proportional to input
size sin( ) *
Automatic Differentiation
A graph is created on the fly 𝑾𝒉 h 𝑾𝒙 x
(Note above)
⬣ Differentiable programming
Adapted from figure by Andrej Karpathy
Log Loss
𝝏𝑳 𝝏𝑳 𝝏𝒑
2= =
𝒖 2 𝝈 𝒘𝑻 𝒙
=𝒑 𝟏 − 𝝈 𝒘𝑻 𝒙
𝝏𝒖 𝝏𝒑 𝝏𝒖
Automatic differentiation:
𝝏𝑳 𝝏𝑳 𝝏𝒖
⬣ Carries out this procedure for us 2=
𝒘 𝝏𝒘
= 𝝏𝒖 𝝏𝒘
2 𝒙𝑻
=𝒖
on arbitrary graphs
We can do this in a combined way to see all terms
⬣ Knows derivatives of primitive together:
functions 𝝏𝑳 𝝏𝒑 𝝏𝒖 𝟏
2=
𝒘 = −𝝈 𝒘 𝑻𝒙
𝝈 𝒘𝑻 𝒙 (𝟏 − 𝝈 𝒘𝑻 𝒙 )𝒙𝑻
⬣
𝝏𝒑 𝝏𝒖 𝝏𝒘
As a result, we just define these
= − 𝟏 − 𝝈 𝒘𝑻 𝒙 𝒙𝑻
(forward) functions and don’t
even need to specify the This effectively shows gradient flow along path from
gradient (backward) functions! L to w
dx1
Extremely efficient in 𝟏
graphics processing units = =−
𝒘 𝝈 𝒘𝑻𝒙 (𝟏 − 𝝈 𝒘𝑻𝒙 )𝒙𝑻
𝝈 𝒘𝑻 𝒙
(GPUs)
1x1 1x1 1x1 1xd
Vectorized Computations
Input Function Output
𝒉ℓ*𝟏 𝒉ℓ
𝑾
Parameters
𝒉ℓ = 𝑾𝒉ℓ!𝟏
𝒘𝑻𝒊
Linear Unit 0.
6
0.
⬣ Performed element-wise
5 5
𝒉ℓ = 𝐦𝐚𝐱 𝟎, 𝒉ℓ!𝟏
max(0,_)
How many parameters for this layer?
Jacobian of ReLU