0% found this document useful (0 votes)
48 views31 pages

Deep Learning Basics Lecture 2 Backpropagation

This document discusses backpropagation and gradient descent algorithms for training deep learning models. It begins with an overview of backpropagation for computing gradients to minimize a loss function. It then provides a pictorial illustration of gradient descent by representing neural networks as real circuits. It explains how to compute gradients through multiple layers and nodes with activation functions and weights. It concludes by discussing stochastic gradient descent and mini-batch gradient descent for optimizing deep learning models on large datasets.

Uploaded by

baris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views31 pages

Deep Learning Basics Lecture 2 Backpropagation

This document discusses backpropagation and gradient descent algorithms for training deep learning models. It begins with an overview of backpropagation for computing gradients to minimize a loss function. It then provides a pictorial illustration of gradient descent by representing neural networks as real circuits. It explains how to compute gradients through multiple layers and nodes with activation functions and weights. It concludes by discussing stochastic gradient descent and mini-batch gradient descent for optimizing deep learning models on large datasets.

Uploaded by

baris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Deep Learning Basics

Lecture 2: Backpropagation
Princeton University COS 495
Instructor: Yingyu Liang
How to train the dragon?


……
𝑦

ℎ1 ℎ2 ℎ𝐿
𝑥
How to get the expected output

𝑓𝜃 (𝑥)
𝑥 𝑙 𝑥; 𝜃 ≠ 0
𝑦

Loss of the system


𝑙(𝑥; 𝜃) = 𝑙(𝑓𝜃 , 𝑥, 𝑦)
How to get the expected output
Find direction 𝑑 so that:

𝑥 𝑙 𝑥; 𝜃 + 𝑑 ≈ 0

Loss 𝑙(𝑥; 𝜃 + 𝑑)
How to get the expected output
How to find 𝑑: 𝑙 𝑥; 𝜃 + 𝜖𝑣 ≈ 𝑙 𝑥; 𝜃 + 𝛻𝑙 𝑥; 𝜃 ∗ 𝜖𝑣 for small scalar 𝜖

𝑥 𝑙 𝑥; 𝜃 + 𝑑 ≈ 0

Loss 𝑙(𝑥; 𝜃 + 𝑑)
How to get the expected output
Conclusion: Move 𝜃 along −𝛻𝑙 𝑥; 𝜃 for a small amount

𝑥 𝑙 𝑥; 𝜃 + 𝑑

Loss 𝑙(𝑥; 𝜃 + 𝑑)
Neural Networks as real circuits
Pictorial illustration of gradient descent
Gradient
• Gradient of the loss is simple
• E.g., 𝑙 𝑓𝜃 , 𝑥, 𝑦 = 𝑓𝜃 𝑥 − 𝑦 2 /2
𝜕𝑙 𝜕𝑓
• = (𝑓𝜃 𝑥 − 𝑦)
𝜕𝜃 𝜕𝜃
• Key part: gradient of the hypothesis
Open the box: real circuit
Single neuron

𝑥1
− 𝑓
𝑥2

Function: 𝑓 = 𝑥1 − 𝑥2
Single neuron

𝑥1
1
− 𝑓
𝑥2
−1

Function: 𝑓 = 𝑥1 − 𝑥2
𝜕𝑓 𝜕𝑓
Gradient: = 1, = −1
𝜕𝑥1 𝜕𝑥2
Two neurons
𝑥1
𝑥3
− 𝑓
+ 𝑥2
𝑥4

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥4
Two neurons
𝑥1
𝑥3 1
𝜕𝑥2
− 𝑓
=1
𝜕𝑥3
+ 𝑥2
−1
𝑥4
𝜕𝑥2
=1
𝜕𝑥4

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥4
𝜕𝑥2 𝜕𝑥2 𝜕𝑓
Gradient: = 1, = 1. What about ?
𝜕𝑥3 𝜕𝑥4 𝜕𝑥3
Two neurons
𝑥1
𝑥3 1
− 𝑓
−1
+ 𝑥2
−1
𝑥4
−1

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥4
𝜕𝑓 𝜕𝑓 𝜕𝑥2
Gradient: = = −1
𝜕𝑥3 𝜕𝑥2 𝜕𝑥3
Multiple input
𝑥1
𝑥3 1
− 𝑓
−1
𝑥5
+ 𝑥2
−1
𝜕𝑥2 𝑥4
=1
𝜕𝑥5 −1

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥5 + 𝑥4
𝜕𝑥2
Gradient: =1
𝜕𝑥5
Multiple input
𝑥1
𝑥3 1
− 𝑓
−1
𝑥5
+ 𝑥2
−1
−1 𝑥4
−1

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥5 + 𝑥4
𝜕𝑓 𝜕𝑓 𝜕𝑥5
Gradient: = = −1
𝜕𝑥5 𝜕𝑥5 𝜕𝑥3
Weights on the edges
𝑥3 𝑥1
𝑤3 1
− 𝑓
+ 𝑥2
𝑤4 −1

𝑥4

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑤3 𝑥3 + 𝑤4 𝑥4
Weights on the edges
𝑤3
𝑥1
1
𝑥3 − 𝑓

𝑤4
+ 𝑥2
−1

𝑥4

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑤3 𝑥3 + 𝑤4 𝑥4
Weights on the edges
𝑥3 𝑥1
𝑤3 −𝑥3 1
− 𝑓
+ 𝑥2
𝑤4 −𝑥4 −1

𝑥4

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑤3 𝑥3 + 𝑤4 𝑥4
𝜕𝑓 𝜕𝑓 𝜕𝑥2
Gradient: = = −1 × 𝑥3 = −𝑥3
𝜕𝑤3 𝜕𝑥2 𝜕𝑤3
Activation
𝑥3 𝑥1
𝑤3 1
− 𝑓
𝜎 𝑥2
𝑤4 −1

𝑥4

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝜎 𝑤3 𝑥3 + 𝑤4 𝑥4
Activation
𝑥3 𝑥1
𝑤3 1
− 𝑓
𝑛𝑒𝑡2 𝜎 𝑥2
−1
𝑤4
𝑥4

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝜎 𝑤3 𝑥3 + 𝑤4 𝑥4
Let 𝑛𝑒𝑡2 = 𝑤3 𝑥3 + 𝑤4 𝑥4
Activation
𝜕𝑛𝑒𝑡2
= 𝑥3
𝑥3 𝜕𝑤3 𝑥1
𝑤3 1
− 𝑓
𝜕𝑥2
= 𝜎′ 𝑛𝑒𝑡2 𝜎 𝑥2
𝜕𝑛𝑒𝑡2
𝑤4 −1

𝑥4

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝜎 𝑤3 𝑥3 + 𝑤4 𝑥4
𝜕𝑓 𝜕𝑓 𝜕𝑥2 𝜕𝑛𝑒𝑡2
Gradient: = = −1 × 𝜎 ′ × 𝑥3 = −𝜎 ′ 𝑥3
𝜕𝑤3 𝜕𝑥2 𝜕𝑛𝑒𝑡2 𝜕𝑤3
Activation
𝑥3 −𝜎′𝑥3 𝑥1
𝑤3 1
− 𝑓
−𝜎′ 𝑛𝑒𝑡2 𝜎 𝑥2
𝑤4 −1

𝑥4

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝜎 𝑤3 𝑥3 + 𝑤4 𝑥4
𝜕𝑓 𝜕𝑓 𝜕𝑥2 𝜕𝑛𝑒𝑡2
Gradient: = = −1 × 𝜎 ′ × 𝑥3 = −𝜎 ′ 𝑥3
𝜕𝑤3 𝜕𝑥2 𝜕𝑛𝑒𝑡2 𝜕𝑤3
Multiple paths
𝑥5
+ 𝑥1
1
𝑥3 𝑤3 − 𝑓
𝑛𝑒𝑡2 𝜎 𝑥2
𝑤4 −1

𝑥4

Function: 𝑓 = 𝑥1 − 𝑥2 = (𝑥1 +𝑥5 ) − 𝜎 𝑤3 𝑥3 + 𝑤4 𝑥4


Multiple paths
+ 𝑥1
1
𝑥3 𝑤3 − 𝑓
𝑛𝑒𝑡2 𝜎 𝑥2
𝑤4 −1

Function: 𝑓 = 𝑥1 − 𝑥2 = (𝑥1 +𝑥5 ) − 𝜎 𝑤3 𝑥3 + 𝑤4 𝑥4


Multiple paths
+ 𝑥1
1
𝑥3 𝑤3 − 𝑓
𝑛𝑒𝑡2 𝜎 𝑥2
𝑤4 −1

Function: 𝑓 = 𝑥1 − 𝑥2 = (𝑥3 +𝑥5 ) − 𝜎 𝑤3 𝑥3 + 𝑤4 𝑥4


𝜕𝑓 𝜕𝑓 𝜕𝑥2 𝜕𝑛𝑒𝑡2 𝜕𝑓 𝜕𝑥1
Gradient: = + = −1 × 𝜎 ′ × 𝑤3 + 1 × 1 = −𝜎 ′ 𝑤3 + 1
𝜕𝑥3 𝜕𝑥2 𝜕𝑛𝑒𝑡2 𝜕𝑥3 𝜕𝑥1 𝜕𝑥3
Summary
• Forward to compute 𝑓
• Backward to compute the gradients

𝑛𝑒𝑡11
𝑥1 𝜎 ℎ11
+ 𝑓
𝑥2 𝜎 ℎ12
𝑛𝑒𝑡21
Math form
Gradient descent
• Minimize loss 𝐿෠ 𝜃 , where the hypothesis is parametrized by 𝜃

• Gradient descent
• Initialize 𝜃0
• 𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 𝛻 𝐿෠ 𝜃𝑡
Stochastic gradient descent (SGD)
• Suppose data points arrive one by one

1 𝑛

• 𝐿 𝜃 = σ𝑡=1 𝑙(𝜃, 𝑥𝑡 , 𝑦𝑡 ), but we only know 𝑙(𝜃, 𝑥𝑡 , 𝑦𝑡 ) at time 𝑡
𝑛

• Idea: simply do what you can based on local information


• Initialize 𝜃0
• 𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 𝛻𝑙(𝜃𝑡 , 𝑥𝑡 , 𝑦𝑡 )
Mini-batch
• Instead of one data point, work with a small batch of 𝑏 points
(𝑥𝑡𝑏+1, 𝑦𝑡𝑏+1 ),…, (𝑥𝑡𝑏+𝑏, 𝑦𝑡𝑏+𝑏 )

• Update rule
1
𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 𝛻 ෍ 𝑙 𝜃𝑡 , 𝑥𝑡𝑏+𝑖 , 𝑦𝑡𝑏+𝑖
𝑏
1≤𝑖≤𝑏

• Typical batch size: 𝑏 = 128

You might also like