Backpropagation Exercises
Backpropagation Exercises
Practice Problems
Let's assume we have a simple function f (x, y, z) = (x + y) z . We can break this up into the
equations q = x + y and f (x, y, z) = qz . Using this simplified notation, we can also represent
this equation as a computation graph:
Now let's assume that we are evaluating this function at x = -2, y = 5, and z = -4. In addition let the
value of the upstream gradient (gradient of the loss with respect to our function, ∂ L/∂f ) equal 1.
These are filled out for you in the computation graph.
Solve for the following values, both symbolically (without plugging in specific values of x/y/z), and
evaluated at x = -2, y = 5, z = -4, and ∂L/∂f = 1:
Symbolically Evaluated:
1. ∂ f / ∂q = ∂ f / ∂q =
2. ∂ q / ∂x = ∂ q / ∂x =
3. ∂ q / ∂y = ∂ q / ∂y =
4. ∂ f / ∂z = ∂ f / ∂z =
5. ∂ f / ∂x = ∂ f / ∂x =
6. ∂ f / ∂y = ∂ f / ∂y =
Now let's perform backpropagation through a single neuron of a neural network with a sigmoid
activation. Specifically, we will define the pre-activation z = wo xo + w1 x1 + w2 and we will define
the activation value α = σ (z) = 1 / (1 + e−z ) . The computation graph is visualized below:
In the graph we've filled out the forward activations, on the top of the lines, as well as the
upstream gradient (gradient of the loss with respect to our neuron, ∂ L/∂α ). Use this information
to compute the rest of the gradients (labelled with question marks) throughout the graph.
Finally, report the symbolic gradients with respect to the input parameters, xo , x1 , w0 , w1 , w2 :
1. ∂ α / ∂x0 =
2. ∂ α / ∂w0 =
3. ∂ α / ∂x1 =
4. ∂ α / ∂w1 =
5. ∂ α / ∂w2 =
z 1 = W 1 x(i) + b1
a1 = ReLU (z 1 )
z 2 = W 2 a1 + b2
(i)
ŷ = σ(z 2 )
(i) (i) (i)
L = y (i) (i)
* log(yˆ ) + (1 − y ) * log(1 − yˆ )
m
J = −1
m ∑ L(i)
i=1
Note that x(i) represents a single input example, and is of shape Dx × 1 . Further y (i) is a single
output label and is a scalar. There are m examples in our dataset. We will use Da1 nodes in our
hidden layer; that is, z 1 's shape is Da1 × 1 .
(i)
2. What is ∂ J / ∂yˆ ? Refer to this result as δ 1 (i) . Using this result, what is ∂ J / ∂yˆ ?
(i)
3. What is ∂ yˆ / ∂z 2 ? Refer to this result as δ 2 (i) .
z 1 = W 1 x(i) + b1
a1 = ReLU (z 1 )
z 2 = W 2 a1 + b2
(i)
ŷ = σ(z 2 )
(i) (i) (i)
L = y (i) * log(yˆ ) + (1 − y (i) ) * log(1 − yˆ )
m
J = −1
m ∑ L(i)
i=1
Note that x(i) represents a single input example, and is of shape Dx × 1 . Further y (i) is a single
output label and is a scalar. There are m examples in our dataset. We will use Da1 nodes in our
hidden layer; that is, z 1 's shape is Da1 × 1 .
7. What is ∂ J / ∂W 1 ? It may help to reuse work from the previous parts. Hint: Be careful with
the shapes!
Apart from simple mathematical operations like multiplication or exponentiation, and piecewise
operations like the max used in relu activations, we can also perform complex operations in our
neural networks. For this question, we'll be exploring the sort operation in hopes of better
understanding how to backpropagate gradients through a sort. This is applicable in a variety of
real-world use-cases including a differentiable non-max suppression, for object detection
networks.
For each of the following parts, assume you are given an input vector x ∈ Rn and some
upstream gradient vector ∂ L / ∂F , and you want to calculate ∂ L / ∂x where F is a function of x
that also returns a vector. You may assume all values in x are distinct. Note that xo is the first
component in the vector x: ( x = [x0 , x1 , ... , xn−1 ] ).
1. F(x) = x0 * x
2. F(x) = sort(x)
1. ∂ f / ∂q = z = − 4
2. ∂ q / ∂x = 1
3. ∂ q / ∂y = 1
4. ∂ f / ∂z = q = x + y = 3
5. ∂ f / ∂x = z * 1 = z = − 4
6. ∂ f / ∂y = z * 1 = z = − 4
4. δ 3 (i) = W 2
T
6. δ 5 (i) = x(i)
7. δ 6 (i) = − 1
m ∑ δ 1 (i) * δ 2 (i) * (δ 3 (i) ° δ 4 (i) ) * δ 5 (i)
i
Problem 4. Bonus!
3. This can be viewed as a computation graph where the multiplication happens first and
then the sorting happens. As such, it simply requires rerouting the gradients to account
for the sort as in #2, and then performing the multiplicative rules as in #1 to account for
multiplying by x0.