0% found this document useful (0 votes)
33 views

Backpropagation Exercises

Uploaded by

Sirius
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Backpropagation Exercises

Uploaded by

Sirius
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Section 3: Gradient Descent & Backpropagation

Practice Problems

Problem 1. Computation Graph Review

Let's assume we have a simple function f (x, y, z) = (x + y) z . We can break this up into the
equations q = x + y and f (x, y, z) = qz . Using this simplified notation, we can also represent
this equation as a computation graph:

Now let's assume that we are evaluating this function at ​x = -2,​ ​y = 5​, and ​z = -4​. In addition let the
value of the upstream gradient (gradient of the loss with respect to our function, ∂ L/∂f ) equal ​1​.
These are filled out for you in the computation graph.

Solve for the following values, both symbolically (without plugging in specific values of x/y/z), and
evaluated at ​x = -2,​ ​y = 5​, ​z = -4​, and ​∂L/∂f = 1​:

Symbolically Evaluated:
1. ∂ f / ∂q = ∂ f / ∂q =

2. ∂ q / ∂x = ∂ q / ∂x =

3. ∂ q / ∂y = ∂ q / ∂y =

4. ∂ f / ∂z = ∂ f / ∂z =

5. ∂ f / ∂x = ∂ f / ∂x =

6. ∂ f / ∂y = ∂ f / ∂y =

CS230 Deep Learning Page 1


Problem 2. Computation Graphs on Steroids

Now let's perform backpropagation through a single neuron of a neural network with a sigmoid
activation. Specifically, we will define the pre-activation z = wo xo + w1 x1 + w2 and we will define
the activation value α = σ (z) = 1 / (1 + e−z ) . The computation graph is visualized below:

In the graph we've filled out the ​forward activations​, on the top of the lines, as well as the
upstream gradient (gradient of the loss with respect to our neuron, ∂ L/∂α ). Use this information
to compute the rest of the gradients (labelled with ​question marks​) throughout the graph.

Hint: A calculator may be helpful here.

Finally, report the symbolic gradients with respect to the input parameters, xo , x1 , w0 , w1 , w2 :

1. ∂ α / ∂x0 =

2. ∂ α / ∂w0 =

3. ∂ α / ∂x1 =

4. ∂ α / ∂w1 =

5. ∂ α / ∂w2 =

CS230 Deep Learning Page 2


Problem 3. Backpropagation Basics: Dimensions & Derivatives

Let's assume we have a two layer neural network, as defined below:

z 1 = W 1 x(i) + b1
a1 = ReLU (z 1 )
z 2 = W 2 a1 + b2
(i)
ŷ = σ(z 2 )
(i) (i) (i)
L = y (i) (i)
* log(yˆ ) + (1 − y ) * log(1 − yˆ )
m
J = −1
m ∑ L(i)
i=1

Note that x(i) represents a single input example, and is of shape Dx × 1 . Further y (i) is a single
output label and is a scalar. There are ​m​ examples in our dataset. We will use Da1 nodes in our
hidden layer; that is, z 1 's shape is Da1 × 1 .

1. What are the shapes of W 1 , b1 , W 2 , b2 ? If we were vectorizing this network across


multiple examples, what would the shapes of the weights/biases be instead? If we were
vectorizing across multiple examples, what would the shapes of X and Y be instead?

(i)
2. What is ∂ J / ∂yˆ ? Refer to this result as δ 1 (i) . Using this result, what is ∂ J / ∂yˆ ?

(i)
3. What is ∂ yˆ / ∂z 2 ? Refer to this result as δ 2 (i) .

CS230 Deep Learning Page 3


Equations reproduced below for the remaining parts of the question:

z 1 = W 1 x(i) + b1
a1 = ReLU (z 1 )
z 2 = W 2 a1 + b2
(i)
ŷ = σ(z 2 )
(i) (i) (i)
L = y (i) * log(yˆ ) + (1 − y (i) ) * log(1 − yˆ )
m
J = −1
m ∑ L(i)
i=1

Note that x(i) represents a single input example, and is of shape Dx × 1 . Further y (i) is a single
output label and is a scalar. There are ​m​ examples in our dataset. We will use Da1 nodes in our
hidden layer; that is, z 1 's shape is Da1 × 1 .

4. What is ∂ z 2 / ∂a1 ? Refer to this result as δ 3 (i) .

5. What is ∂ a1 / ∂z 1 ? Refer to this result as δ 4 (i) .

6. What is ∂ z 1 / ∂W 1 ? Refer to this result as δ 5 (i) .

7. What is ∂ J / ∂W 1 ? It may help to reuse work from the previous parts. Hint: Be careful with
the shapes!

CS230 Deep Learning Page 4


Problem 4. Bonus!

Apart from simple mathematical operations like multiplication or exponentiation, and piecewise
operations like the max used in relu activations, we can also perform complex operations in our
neural networks. For this question, we'll be exploring the ​sort​ operation in hopes of better
understanding how to backpropagate gradients through a sort. This is applicable in a variety of
real-world use-cases including a differentiable non-max suppression, for object detection
networks.

For each of the following parts, assume you are given an input vector x ∈ Rn and some
upstream gradient vector ∂ L / ∂F , and you want to calculate ∂ L / ∂x where ​F​ is a function of ​x
that also returns a vector. You may assume all values in ​x​ are distinct. Note that x​o​ is the first
component in the vector ​x:​ ( x = [x0 , x1 , ... , xn−1 ] ).

1. F(x) = x​0​ * x

2. F(x) = sort(x)

3. F(x) = x​0​ * sort(x)

CS230 Deep Learning Page 5


Section 3 Solutions

Problem 1. Computation Graph Review

1. ∂ f / ∂q = z = − 4
2. ∂ q / ∂x = 1
3. ∂ q / ∂y = 1
4. ∂ f / ∂z = q = x + y = 3
5. ∂ f / ∂x = z * 1 = z = − 4
6. ∂ f / ∂y = z * 1 = z = − 4

Problem 2. Computation Graphs on Steroids

1. ∂ α / ∂x0 = σ (z) (1 − σ(z)) w0

2. ∂ α / ∂w0 = σ (z) (1 − σ(z)) x0

3. ∂ α / ∂x1 = σ (z) (1 − σ(z)) w1

4. ∂ α / ∂w1 = σ (z) (1 − σ(z)) x1

5. ∂ α / ∂w2 = σ (z) (1 − σ(z))

CS230 Deep Learning Page 6


Problem 3. Backpropagation Basics: Dimensions & Derivatives

1. W 1 ∈ RDa1 ×Dx , b1 ∈ RDa1 ×1 , W 2 ∈ R1×Da1 , b2 ∈ R1×1 . The shapes of the weights/biases


would be the same after vectorizing. X ∈ RDx ×m , Y ∈ Rm ×1 after vectorizing.
(i) (i)
2. δ 1 (i) = y (i) / yˆ − (1 − y (i) ) / (1 − yˆ ) . δ J / δyˆ = − 1
m ∑ δ 1 (i)
i
(i)
3. δ2 = σ(z 2 ) (1 − σ(z 2 ))

4. δ 3 (i) = W 2

5. δ 4 (i) = 0 if z 1 < 0, 1 if z 1 >= 0

T
6. δ 5 (i) = x(i)

7. δ 6 (i) = − 1
m ∑ δ 1 (i) * δ 2 (i) * (δ 3 (i) ° δ 4 (i) ) * δ 5 (i)
i

Problem 4. Bonus!

1. As an example, say x = [x0 , x1 , x2 ] . Then F (x) = [x0 * x0 , x0 * x1 , x0 * x2 ] . Then ∂ L/∂x


will be a vector. For component ​i​, where ​i​ is not 0, it is ∂ L/∂F i * x0 . For the 0th

component, it will be 2 * x0 * ∂ L/∂F 0 + ∑ ∂ L/∂F i * xi .


i≠0
2. Sorting will simply reroute the gradients. As an example, say x = [x0 , x1 , x2 ] , we have
upstream gradients ∂ L / ∂F = [∂ 0 , ∂ 1 , ∂ 2 ] , and F (x) = [x1 , x2 , x0 ] . Then,
∂ L / ∂x = [∂ 2 , ∂ 0 , ∂ 1 ] (move gradients to reverse the transformation from ​x -> F(x))​ .

3. This can be viewed as a computation graph where the multiplication happens first and
then the sorting happens. As such, it simply requires rerouting the gradients to account
for the sort as in #2, and then performing the multiplicative rules as in #1 to account for
multiplying by x​0​.

CS230 Deep Learning Page 7

You might also like