exercise0_solution
exercise0_solution
The following exercises do not have to be submitted as homework, but might be helpful to practice
the required math and prepare for the exam. Some questions are from old exams and contain the used
rubrik. You do not have to submit these questions and will not receive points for them.
1
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0
Solution
∇f (a) = (2x, 2y) = (0, 0)
(x,y)=a
and
∇g(a) = (2x, −2y) = (0, 0)
(x,y)=a
i.e. all eigenvalues (2&2) are real, positive ⇒ Hf is pos. definite. Thus, a is a minimum of f
2 0 !
(Hg )(a) = ⇒ (2 − λ)(−2 − λ) = 0
0 −2
positive and negative Eigenvalues (2&-2) ⇒ Hg is neither positive nor negative definite. Therefore a
is a saddlepoint but no extremum of g.
Solution
(a) For p being a probability density it is required that (i) p(x) ≥ 0 ∀x ∈ R which is fullfilled here.
Furthermore, (ii) p must be normalized appropriately:
Z
p(x)dx = 1
R
2
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0
(b) To calculate the expected value, we use integration by parts, i.e., for any functions f and g:
Z b b
Z b
f g ′ dx = (f g) − f ′ gdx
a a a
Z π π
Z π
1
µ := Ep [x] = 2 x sin(x)dx = − 12 x cos(x) + 1
2 cos(x)dx = π/2
0 0
|0 {z }
=0
with Z π
π π
k = x sin(x) − sin(x)dx = 0 + cos(x) =0−2
0 0 0
and therefore
π2
Ep [x2 ] = 2 −2
yielding
π2 π2 π2
Ep [(x − µ)2 ] = Ep [x2 ] − µ2 = 2 −2− 4 = 4 − 2.
Prove that the variance of the empirical mean fn := n1 ni=1 xi , based on n samples xi ∈ R drawn
P
2
i.i.d. from the Gaussian distribution N (µ, σ 2 ), is V[fn ] = σn , without using the fact the variance of a
sum of independent variables is the sum of the variables’ variances.
Solution The major insights are that E[xi ] = µ, ∀i, E[xi xj ] = E[xi ]E[xj ] if i ̸= j due to i.i.d. sampling
and that E[(xi − µ)2 ] = σ 2 .
h n 2 i n n
1P 1 PP
V[fn ] = E n xi −µ = n2
E[(xi − µ)(xj − µ)]
i=1 i=1j=1
n
σ2
1 P
µ)] E[(xj − µ)] + n12 E[(xi − µ)2 ]
P
= n2
E[(xi − = n .
i̸=j
| {z } | {z } i=1 | {z }
0 0 σ2
Rubrik:
• 1 point for the correct definition of variance V
• 1 point for using E[xi ] = µ
• 1 point for the use of independent samples
• 1 point for the use of the definition of σ 2
• 1 point for putting it correctly together
• − 21 points for minor mistakes (e.g. E[xi xj ] = 0 for i ̸= j)
• but no point loss for forgetting little things like one or two ± mistakes
3
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0
Let {xi }ni=1 be a data set that is drawn i.i.d. from the Gaussian distribution xi ∼ N (µ, σ 2 ). Let further
µ̂ := n1 ni=1 xi denote the empirical mean and σ̂ 2 := n1 ni=1 (xi − µ̂)2 the equivalent empirical
P P
variance. Prove analytically that µ̂ is unbiased, i.e. E[µ̂] = µ, and that σ̂ 2 is biased, i.e. E[σ̂ 2 ] ̸= σ 2 .
Bonus-question: Can you derive an unbiased estimator for the empirical variance?
Hint: If xi and xj are drawn i.i.d. from N (µ, σ 2 ), then holds ∀i:
Proving that σ̂ 2 is biased is more involved, as σ̂ 2 contains the empirical mean µ̂:
n h i n n
2 2
1P
= n1 E[x2i ] − 2 n1 E[xi µ̂] + E[µ̂2 ]
P P
E[σ̂ ] = n E (xi − µ̂)
i=1 i=1 i=1
n n h n i h P n n i
2
= n1 E[xi ] − 2 n1 E xi n1 xj + E n1 xi n1 xj
P P P P
i=1 i=1 j=1 i=1 j=1
n n P
n
1P
E[x2i ] − 1
E[xi xj ] −µ2 + µ2
P
= n n2
i=1 i=1j=1 | {z }
0
n n P
n
1P
E[(xi − µ)2 ] − n12
P
= n E[(xi − µ)(xj − µ)]
i=1 | {z } i=1j=1 | {z }
σ2 σ 2 if i=j else 0
2 1 2 n−1 2
= σ − nσ = n σ ,
where we used E[(xi − µ)(xj − µ)] = E[xi xj ] − E[xi ]µ − E[xj ]µ + µ2 = E[xi xj ] − µ2 , because
E[xi ] = µ.
n
Bonus-question: Note that σ̂ 2 would be unbiased if we would multiply it with n−1 and we can therefore
define the unbiased empirical estimate of the variance as
n
ˆ2
σ̂ := 1 P
(xi − µ̂)2 .
n−1
i=1
This question is designed to practice the use of Kronecker-delta functions and become more familiar
with (discrete) probabilities. You are given 3 dice, a D6, a D8 and a D10, where Dx refers to a x-
sided fair dice, where each of the x sides is numbered uniquely 1 to x and rolled with the exact same
probability.
(a) Prove analytically that the probability that the D6 is among the highest (including equal) numbers
when all 3 dice are rolled together is roughly ρ ≈ 19%.
(b) Prove analytically that the probability that the D8 rolls among the highest is ρ′ ≈ 38%.
(c) Prove analytically that the probability that the D10 rolls among the highest is ρ′′ ≈ 58%.
4
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0
Hint: You can solve the question however you want, but you are encouraged to use Kronecker-deltas,
e.g. δ(i > 5) is 1 if i > 5 and 0 otherwise. You will find that this can simplify complex sums
(1) 2 (2)
enormously. If you do so, you can use the equalities ni=1 i = n 2+n and ni=1 i2 = n(n+1)(2n+1)
P P
6 .
Bonus-question: Why don’t the above numbers sum up to 1?
1
Solution The three dice are statistically independent and have the probability px (i) = x of outcome
1 ≤ i ≤ x. The probability of a Dx rolling higher or equal than a Dy is therefore:
x y
1 PP
p(i ≥ j|i ∼ px , j ∼ py ) = xy δ(i ≥ j) .
i=1j=1
Note that if two conditions must be true, one can simply multiply the Kronecker-delta functions.
(a) The probability ρ of a D6 to roll higher than the D8 and the D10 is thus: i i
z }| { z }| {
6 8 10
1 PP P
ρ = p(i ≥ j ∧ i ≥ k|i ∼ p6 , j ∼ p8 , k ∼ p10 ) = 6·8·10 δ(i ≥ j) δ(i ≥ k)
i=1 j=1 k=1
6
1 P 2 (2) 1 6(6+1)(12+1) 91
= 480 i = 480 6 = 480 ≈ 19% .
i=1
P6
(b) The major difference is that ≥ j) cannot get larger than 6, even if i > 6.
j=1 δ(i
8 6 10
ρ′ = p(i ≥ j ∧ i ≥ k|i ∼ p8 , j ∼ p6 , k ∼ p10 ) = 6·8·10 1 PP P
δ(i ≥ j) δ(i ≥ k)
i=1j=1k=1
8 P
6 10 8 6
1 P P 1 P P
= 6·8·10 δ(i ≥ j) δ(i ≥ k) = 6·8·10 i δ(i ≥ j)
i=1j=1 k=1 i=1 j=1
| {z } | {z }
i min(i,6)
6 8
(2)
P
1
i2 + 6i 1 6·7·13 91+90
P
= 6·8·10 = 6·8·10 6 + 6(7 + 8) = 480 ≈ 38% .
i=1 i=7
Bonus-question: Because conditions like δ(i ≥ j) and δ(j ≥ i) overlap in the case δ(i = j). To get
a probability distribution over disjunct outcomes, one would have to consider the cases “D6 is highest,
D8 is highest, D10 is highest, D6 and D8 are highest, D8 and D10 are highest, D6 and D10 are highest
and finally all 3 dice are equal (and thus highest)”. The probabilities over these cases would sum up to
1.
Implement the MNIST classification example from the lecture slides. Make sure you get the correct
deep CNN model architecture from Lecture 2 (p.18).
(a) Train the model fθ : R28×28 → R10 from the lecture slides with a cross-entropy loss for 10 epochs.
Plot the average train/test losses during each epoch (y-axis) over all epochs (x-axis). Do the same
5
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0
with the average train/test accuracies during each epoch. Try to program as modular as possible, as
you will re-use the code later.
(b) Change your optimization criterion to a mean-squared-error loss between the same model architec-
ture fθ : R28×28 → R10 you used in (a) and a one-hot encoding (hi ∈ R10 , hij = 1 iff j = yi ,
otherwise hij = 0) of the labels yi :
n 2
1P
L := n fθ (xi ) − hi
i=1
Plot the same plots as in (a). Try to reuse as much of your old code as possible, e.g., by defining
the criterion (which is now different) as external functions that can be overwritten.
(c) Define a new architecture fθ′ : R28×28 → R, that is exactly the same as above, but with only one
output neuron instead of 10. Train it with a regression mean-squared error loss between the model
output and the scalar class identifier.
n 2
L′ := 1P
n fθ′ (xi ) − yi
i=1
Let {yt }∞ 2
t=1 an infinite training set drawn i.i.d. from the Gaussian distribution N (µ, σ ). At time t, the
online estimate ft of the average over the training set, starting at f0 , is defined as
(a) Show that for small t the online estimate is biased, i.e., E[ft ] ̸= µ .
6
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0
(b) Prove that in the limit t → ∞ the online estimate is unbiased, i.e., E[ft ] = µ .
α σ2
(c) Prove that in the limit t → ∞ the variance of the online estimate is E[ft2 ] − E[ft ]2 = 2−α .
t−1
1−rt
rk =
P
Hint: You can use the geometric series 1−r , ∀|r| < 1 .
k=0
α 1
Bonus-question: Prove that for the decaying learning rate αt = 1−(1−α)t holds lim αt = t .
α→0
Hint: You can also use the binomial identity (x + y)t = tk=0 t
xk y t−k .
P
k
Solution
Pt−1
(a) Note that by recursion ft = (1 − α)t f0 + i=0 α (1 − α)i yt−i .
Xt−1
E[ft ] = (1 − α)t f0 + α (1 − α)i E[yt−i ] = µ + (1 − α)t (f0 − µ) ̸= µ.
| {z }
|i=0 {z } µ
1−(1−α)t
1−(1−α)
(c) Note that due to the assumption of i.i.d. sampling E[yi yj ] = µ2 + σ 2 δij .
h t−1
X 2 i
lim E[ft2 ] = lim E (1 − α)t f0 + α(1 − α)i yt−i
t→∞ t→∞
i=0
h 2 t−1
X
t t
= lim E (1 − α) f0 + 2(1 − α) f0 α(1 − α)i yt−i
t→∞
i=0
t−1
X t−1
X i
+ α(1 − α)i yt−i α(1 − α)j yt−j
i=0 j=0
2 t 2t
−(1−α)
= lim (1 − α)t f0 + lim 2α f0 µ (1−α)
1−(1−α)
t→∞ | {z } t→∞ | {z }
→0 →0
t−1
hX t−1
X i
+ lim E α(1 − α)i yt−i α(1 − α)j yt−j
t→∞
i=0 j=0
t−1
X
= α2 lim (1 − α)i+j E yt−i yt−j
t→∞
i,j=0
∞ 2 t−1
2
X
i 2 2
X i
= µ α (1 − α) + α σ lim (1 − α)2
t→∞
i=0 i=0
| {z }
=1
α2 σ 2 α σ2
= µ2 + = µ2 + .
1 − (1 − α)2 2−α
7
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0
Bonus-question: One can use the binomial identity to reformulate the (1 − α)t :
t−1 t −1
h 1
i
lim αt = lim α
t lim α−1 − α− t − α t
=
α→0 α→0 1−(1−α) α→0
t (t−k)(t−1) −1
h i
t−k − kt
lim α−1 − t
P
= k (−1) α α t
α→0 k=0
| {z }
αt−k−1
h t−2 i−1
lim α−1 − α−1
P t t−k t−k−1 t t
α0 −
= k (−1) α + t−1 t
α→0 k=0 | {z } |{z}
t 1
h t−2
P t i−1 1
t−k t−k−1
= lim t − k (−1) α = .
α→0 k=0 t
Let {xi }ni=1 ⊂ Rm denote a set of training samples and {yi }ni=1 ⊂ R the set of corresponding training
labels. We will use the mean squared loss L := n1 i (f (xi ) − yi )2 to learn a function f (xi ) ≈ yi , ∀i.
P
(a) Derive the analytical solution for parameter a ∈ Rm of a linear function f (x) := a⊤ x.
(b) We will now augment the training data by adding i.i.d. noise ϵi ∼ N (0, σ 2 ) ∈ R to the training
labels, i.e. ỹi := yi + ϵi . Show that this does not change the analytical solution of the expected loss
E[L].
(c) Let f denote the function that minimizes L without label-noise, and let f˜ denote the function that
minimizes L with a random noise ϵi added to labels yi (but not the solution of the expected loss
E[L]). Derive the analytical variance E[(f˜(x) − f (x))2 ] of the noisy solution f˜.
(d) We will now augment the training data by adding i.i.d. noise ϵi ∼ N (0, Σ) ∈ Rm to the training
samples: x̃i = xi + ϵi . Derive the analytical solution for parameter a ∈ Rm that minimizes the
expected loss E[L].
Bonus-question: Which popular regularization method is equivalent to (d) and what problem is solved?
Hint: Summarize all training samples into matrix X = [x1 , . . . , xn ]⊤ ∈ Rn×m , all training labels into
vector y = [y1 , . . . , yn ]⊤ ∈ Rn , and denote the noisy versions ỹ ∈ Rn and X̃ ∈ Rn×m .
Solution
2 P ⊤ 2 ⊤ 2 ⊤ !
(a) Setting the gradient to zero: ∇a L = n i (a xi − yi )xi = n X Xa − n X y = 0 allows us to
!
derive the analytic solution for a if the matrix X⊤ X is invertible: a = (X⊤ X)−1 X⊤ y.
(b) Using the result from (a): E[∇a L] = n2 X⊤ Xa − n2 X⊤ E[ỹ] = n2 X⊤ Xa − n2 X⊤ y, because E[ỹ] =
y + E[ϵ] = y due to the zero mean noise vector ϵ := [ϵ1 , . . . , ϵn ]⊤ .
(c) First note that E[(f˜(x) − f (x))2 ] = E[f˜2 (x)] − 2f (x)E[f˜(x)] + f 2 (x) = E[f˜2 (x)] − f 2 (x),
because E[ỹ] = y. Due to i.i.d. noise we have E[ỹ ỹ ⊤ ] = yy ⊤ + yE[ϵ⊤ ] + E[ϵ]y ⊤ + E[ϵϵ⊤ ] =
yy ⊤ +σ 2 I, and E[f˜2 (x)] = x⊤ (X⊤ X)−1 X⊤ E[ỹ ỹ ⊤ ]X(X⊤ X)−1 x = f 2 (x)+σ 2 x⊤ (X⊤ X)−1 x.
The variance is therefore E[(f˜(x) − f (x))2 ] = σ 2 x⊤ (X⊤ X)−1 x.
8
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0
Bonus-question: The popular L2 regularization, also called weight decay, adds the term λ∥a∥2 to the
!
loss and yields the analytical solution a = (X⊤ X + nλI)−1 Xy, which is also called ridge regression.
This regularization guarantees that the matrix X⊤ X+nλI is invertible for all λ > 0 and yields smoother
functions. For Σ = λI, which corresponds to noising each input dimension independently with variance
λ, the two solutions are the same, which indicates that noising the input smoothes the learned function!
https://xkcd.com/2343