Chapter 6 - Backpropagation
Chapter 6 - Backpropagation
Notes based on
CS231n, Stanford University
EECS 498-007 / 598-005, University of Michigan
with permission from Justin Johnson
Li
input image
weights
loss
input image
loss
input image
loss
f = qz
2. Backward pass: Compute derivatives
Want:
f = qz
2. Backward pass: Compute derivatives
Want:
f = qz
2. Backward pass: Compute derivatives
Want:
f = qz
2. Backward pass: Compute derivatives
Want:
f = qz
2. Backward pass: Compute derivatives
Want:
f = qz ,
2. Backward pass: Compute derivatives
Want:
Chain rule
f = qz ,
2. Backward pass: Compute derivatives
Want:
Downstream
Gradient = Local
Gradient *
Upstream
Gradient
Chain rule
f = qz ,
2. Backward pass: Compute derivatives
Want:
Downstream
Gradient = Local
Gradient *
Upstream
Gradient
Local
Gradient
Upstream
Gradient
-1.00
-3.00
-2.00
-3.00
-3.00
-3.00
Base case
6.00
1.00 -1.00 0.37 1.37 0.73
-2.00 -0.53 1.00
6.00
1.00 -1.00 0.37 1.37 0.73
-2.00 -0.53 -0.53 1.00
6.00
1.00 -1.00 0.37 1.37 0.73
-2.00 -0.20 -0.53 -0.53 1.00
6.00
1.00 -1.00 0.37 1.37 0.73
-2.00 0.20 -0.20 -0.53 -0.53 1.00
Another Example
2.00
-0.20 Upstream Backward pass: Compute gradients
-2.00 gradient
0.20
-1.00 Local gradient [local] * [upstream]
0.40 𝜕 𝜕
4.00 𝑥𝑦 = 𝑦; 𝑥𝑦 = 𝑥
𝜕𝑥 𝜕𝑦
0.20
-3.00
6.00
0.20 1.00 -1.00 0.37 1.37 0.73
-2.00 0.20 -0.20 -0.53 -0.53 1.00
-3.00
0.20
Another Example
2.00
-0.20 Upstream Backward pass: Compute gradients
-2.00 gradient
0.20
-1.00 Local gradient [local] * [upstream]
0.40 𝜕 𝜕
4.00 𝑥𝑦 = 𝑦; 𝑥𝑦 = 𝑥
𝜕𝑥 𝜕𝑦
0.20
-3.00
-0.40
6.00
0.20 1.00 -1.00 0.37 1.37 0.73
-2.00 0.20 -0.20 -0.53 -0.53 1.00
-0.60
-3.00
0.20
Sigmoid local
gradient:
grad_L = 1.0
Backward pass:
Compute grads grad_s3 = grad_L * (1 - L) * L
grad_w2 = grad_s3
0.73 grad_s2 = grad_s3
1.00
grad_s0 = grad_s2
grad_s1 = grad_s2
grad_w1 = grad_s1 * x1
grad_x1 = grad_s1 * w1
grad_w0 = grad_s0 * x0
grad_x0 = grad_s0 * w0
grad_L = 1.0
Sigmoid grad_s3 = grad_L * (1 - L) * L
grad_w2 = grad_s3
0.73 grad_s2 = grad_s3
1.00
grad_s0 = grad_s2
grad_s1 = grad_s2
grad_w1 = grad_s1 * x1
grad_x1 = grad_s1 * w1
grad_w0 = grad_s0 * x0
grad_x0 = grad_s0 * w0
grad_L = 1.0
grad_s3 = grad_L * (1 - L) * L
Add grad_w2 = grad_s3
0.73 grad_s2 = grad_s3
1.00
grad_s0 = grad_s2
grad_s1 = grad_s2
grad_w1 = grad_s1 * x1
grad_x1 = grad_s1 * w1
grad_w0 = grad_s0 * x0
grad_x0 = grad_s0 * w0
grad_L = 1.0
grad_s3 = grad_L * (1 - L) * L
grad_w2 = grad_s3
0.73 grad_s2 = grad_s3
1.00
grad_s0 = grad_s2
Add
grad_s1 = grad_s2
grad_w1 = grad_s1 * x1
grad_x1 = grad_s1 * w1
grad_w0 = grad_s0 * x0
grad_x0 = grad_s0 * w0
grad_L = 1.0
grad_s3 = grad_L * (1 - L) * L
grad_w2 = grad_s3
0.73 grad_s2 = grad_s3
1.00
grad_s0 = grad_s2
grad_s1 = grad_s2
grad_w1 = grad_s1 * x1
Multiply
grad_x1 = grad_s1 * w1
grad_w0 = grad_s0 * x0
grad_x0 = grad_s0 * w0
grad_L = 1.0
grad_s3 = grad_L * (1 - L) * L
grad_w2 = grad_s3
0.73 grad_s2 = grad_s3
1.00
grad_s0 = grad_s2
grad_s1 = grad_s2
grad_w1 = grad_s1 * x1
grad_x1 = grad_s1 * w1
grad_w0 = grad_s0 * x0
Multiply
grad_x0 = grad_s0 * w0
z
Downstream f
Gradients
Local
Gradient
Forward pass: Compute outputs Upstream
Gradient
Hidden layer:
100 Output: 10
Input:
x W1 h W2 s
3072