0% found this document useful (0 votes)
4 views

exercise0_solution

The document contains voluntary exercises for a Deep Reinforcement Learning course, focusing on mathematical concepts and their applications in machine learning. Exercises include Taylor expansion, critical points analysis, probability distributions, variance calculations, and empirical mean estimates. Solutions are provided for each exercise, demonstrating the necessary mathematical derivations and proofs.

Uploaded by

Meme Necromancer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

exercise0_solution

The document contains voluntary exercises for a Deep Reinforcement Learning course, focusing on mathematical concepts and their applications in machine learning. Exercises include Taylor expansion, critical points analysis, probability distributions, variance calculations, and empirical mean estimates. Solutions are provided for each exercise, demonstrating the necessary mathematical derivations and proofs.

Uploaded by

Meme Necromancer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0

Wendelin Böhmer <[email protected]> voluntary exercises

Math and machine learning primer


Voluntary exercises

The following exercises do not have to be submitted as homework, but might be helpful to practice
the required math and prepare for the exam. Some questions are from old exams and contain the used
rubrik. You do not have to submit these questions and will not receive points for them.

E0.1: Taylor expansion (voluntary)



For the function 1 + x, write down the Taylor series around x0 = 0 up to 3rd order.

Solution Approximating f (x) via Taylor expansion at x0 :



X f (n) (x0 )
f (x) = (x − x0 )n
n!
n=0

i.e. for an expansion around x0 = 0


1 ′′ 1 ′′′
f (x) ≈ f (0) + f ′ (0)x + f (0)x2 + f (0)x3 + O(x4 )
2 6
with
1 1
f ′ (x) =
(x + 1)− 2 → f ′ (0) = 1/2
2
1 3
f (x) = − (x + 1)− 2 → f ′′ (0) = −1/4
′′
4
′′′ 3 5
f (x) = (x + 1)− 2 → f ′′′ (0) = 3/8
8
√ 1
the Taylor expansion of 1 + x = (1 + x) 2 around x0 = 0 reads
√ 1 1 1
1 + x ≈ 1 + x − x2 + x3 + ...
2 8 16

E0.2: Critical points (voluntary)

Consider the two functions


f (x, y) := c + x2 + y 2
g(x, y) := c + x2 − y 2 ,
where c ∈ R is a constant.

1
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0

(a) Show that a = (0, 0) is a critical point of both functions.


(b) Check for f and for g whether a is a minimum, maximum, or saddlepoint using the Hessian matrix.
Hint: A matrix is positive (negative) definite if and only if all its eigenvalues are positive (negative).

Solution
∇f (a) = (2x, 2y) = (0, 0)
(x,y)=a

and
∇g(a) = (2x, −2y) = (0, 0)
(x,y)=a

=⇒ Necessary condition of extrema (vanishing gradient) is fulfilled at a.


Checking for extrema:
Minimum if Hessian matrix H is positive definite (all eigenvalues >0)
Maximum if H is negative definite (all eigenvalues <0)
→ characteristic polyomial i.e. det[H − λI]:
 
2 0 !
(Hf )(a) = ⇒ (2 − λ)2 = 0
0 2

i.e. all eigenvalues (2&2) are real, positive ⇒ Hf is pos. definite. Thus, a is a minimum of f
 
2 0 !
(Hg )(a) = ⇒ (2 − λ)(−2 − λ) = 0
0 −2
positive and negative Eigenvalues (2&-2) ⇒ Hg is neither positive nor negative definite. Therefore a
is a saddlepoint but no extremum of g.

E0.3: Distributions and expected values (voluntary)

Let x ∈ R be a random variable with probability density p : R → R with:



c · sin(x), x ∈ [0, π]
p(x) =
0, elsewhere
(a) Determine the parameter c ∈ R such that p(x) is indeed a probability density.
(b) Determine the expected value µ := Ep [x]
(c) Determine the variance of x, Ep [(x − µ)2 ].

Solution
(a) For p being a probability density it is required that (i) p(x) ≥ 0 ∀x ∈ R which is fullfilled here.
Furthermore, (ii) p must be normalized appropriately:
Z
p(x)dx = 1
R

Therefore, we get for the unknown constant c:



π !
c sin(x)dx = c[− cos(x)] = 2c = 1 → c = 1/2
0
0

2
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0

(b) To calculate the expected value, we use integration by parts, i.e., for any functions f and g:
Z b b
Z b
f g ′ dx = (f g) − f ′ gdx
a a a

Z π π
Z π
1
µ := Ep [x] = 2 x sin(x)dx = − 12 x cos(x) + 1
2 cos(x)dx = π/2
0 0
|0 {z }
=0

(c) To calculate the variance, we proceed in the same way


Z π Z π
π
Ep [x2 ] = 12 x2 sin(x)dx = − 12 x2 cos(x) + 2
2 x cos(x)dx
0 0 0
| {z } | {z }
2
= π2 =k

with Z π
π π
k = x sin(x) − sin(x)dx = 0 + cos(x) =0−2
0 0 0

and therefore
π2
Ep [x2 ] = 2 −2
yielding
π2 π2 π2
Ep [(x − µ)2 ] = Ep [x2 ] − µ2 = 2 −2− 4 = 4 − 2.

E0.4: Variance of the empirical mean (old exam question) (voluntary)

Prove that the variance of the empirical mean fn := n1 ni=1 xi , based on n samples xi ∈ R drawn
P
2
i.i.d. from the Gaussian distribution N (µ, σ 2 ), is V[fn ] = σn , without using the fact the variance of a
sum of independent variables is the sum of the variables’ variances.

Solution The major insights are that E[xi ] = µ, ∀i, E[xi xj ] = E[xi ]E[xj ] if i ̸= j due to i.i.d. sampling
and that E[(xi − µ)2 ] = σ 2 .
h n 2 i n n
1P 1 PP
V[fn ] = E n xi −µ = n2
E[(xi − µ)(xj − µ)]
i=1 i=1j=1
n
σ2
1 P
µ)] E[(xj − µ)] + n12 E[(xi − µ)2 ]
P
= n2
E[(xi − = n .
i̸=j
| {z } | {z } i=1 | {z }
0 0 σ2

Rubrik:
• 1 point for the correct definition of variance V
• 1 point for using E[xi ] = µ
• 1 point for the use of independent samples
• 1 point for the use of the definition of σ 2
• 1 point for putting it correctly together
• − 21 points for minor mistakes (e.g. E[xi xj ] = 0 for i ̸= j)
• but no point loss for forgetting little things like one or two ± mistakes

3
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0

E0.5: Unbiased variance estimate (voluntary)

Let {xi }ni=1 be a data set that is drawn i.i.d. from the Gaussian distribution xi ∼ N (µ, σ 2 ). Let further
µ̂ := n1 ni=1 xi denote the empirical mean and σ̂ 2 := n1 ni=1 (xi − µ̂)2 the equivalent empirical
P P
variance. Prove analytically that µ̂ is unbiased, i.e. E[µ̂] = µ, and that σ̂ 2 is biased, i.e. E[σ̂ 2 ] ̸= σ 2 .
Bonus-question: Can you derive an unbiased estimator for the empirical variance?
Hint: If xi and xj are drawn i.i.d. from N (µ, σ 2 ), then holds ∀i:

E[xi ] = µ , E[(xi − µ)2 ] = σ 2 and E[(xi − µ)(xj − µ)] = 0 if i ̸= j .

Solution We prove that µ̂ is bias free simply by using its definition:


h Pn i n
E[µ̂] = E n1 xi = n1 E[xi ] =
P
µ.
i=1 i=1 | {z }
µ

Proving that σ̂ 2 is biased is more involved, as σ̂ 2 contains the empirical mean µ̂:
n h i n n
2 2
1P
= n1 E[x2i ] − 2 n1 E[xi µ̂] + E[µ̂2 ]
P P
E[σ̂ ] = n E (xi − µ̂)
i=1 i=1 i=1
n n h n i h P n n i
2
= n1 E[xi ] − 2 n1 E xi n1 xj + E n1 xi n1 xj
P P P P
i=1 i=1 j=1 i=1 j=1
n n P
n
1P
E[x2i ] − 1
E[xi xj ] −µ2 + µ2
P
= n n2
i=1 i=1j=1 | {z }
0
n n P
n
1P
E[(xi − µ)2 ] − n12
P
= n E[(xi − µ)(xj − µ)]
i=1 | {z } i=1j=1 | {z }
σ2 σ 2 if i=j else 0
2 1 2 n−1 2
= σ − nσ = n σ ,

where we used E[(xi − µ)(xj − µ)] = E[xi xj ] − E[xi ]µ − E[xj ]µ + µ2 = E[xi xj ] − µ2 , because
E[xi ] = µ.
n
Bonus-question: Note that σ̂ 2 would be unbiased if we would multiply it with n−1 and we can therefore
define the unbiased empirical estimate of the variance as
n
ˆ2
σ̂ := 1 P
(xi − µ̂)2 .
n−1
i=1

E0.6: Maximum dice (voluntary)

This question is designed to practice the use of Kronecker-delta functions and become more familiar
with (discrete) probabilities. You are given 3 dice, a D6, a D8 and a D10, where Dx refers to a x-
sided fair dice, where each of the x sides is numbered uniquely 1 to x and rolled with the exact same
probability.
(a) Prove analytically that the probability that the D6 is among the highest (including equal) numbers
when all 3 dice are rolled together is roughly ρ ≈ 19%.
(b) Prove analytically that the probability that the D8 rolls among the highest is ρ′ ≈ 38%.
(c) Prove analytically that the probability that the D10 rolls among the highest is ρ′′ ≈ 58%.

4
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0

Hint: You can solve the question however you want, but you are encouraged to use Kronecker-deltas,
e.g. δ(i > 5) is 1 if i > 5 and 0 otherwise. You will find that this can simplify complex sums
(1) 2 (2)
enormously. If you do so, you can use the equalities ni=1 i = n 2+n and ni=1 i2 = n(n+1)(2n+1)
P P
6 .
Bonus-question: Why don’t the above numbers sum up to 1?

1
Solution The three dice are statistically independent and have the probability px (i) = x of outcome
1 ≤ i ≤ x. The probability of a Dx rolling higher or equal than a Dy is therefore:
x y
1 PP
p(i ≥ j|i ∼ px , j ∼ py ) = xy δ(i ≥ j) .
i=1j=1

Note that if two conditions must be true, one can simply multiply the Kronecker-delta functions.
(a) The probability ρ of a D6 to roll higher than the D8 and the D10 is thus: i i
z }| { z }| {
6 8 10
1 PP P
ρ = p(i ≥ j ∧ i ≥ k|i ∼ p6 , j ∼ p8 , k ∼ p10 ) = 6·8·10 δ(i ≥ j) δ(i ≥ k)
i=1 j=1 k=1
6
1 P 2 (2) 1 6(6+1)(12+1) 91
= 480 i = 480 6 = 480 ≈ 19% .
i=1

P6
(b) The major difference is that ≥ j) cannot get larger than 6, even if i > 6.
j=1 δ(i
8 6 10
ρ′ = p(i ≥ j ∧ i ≥ k|i ∼ p8 , j ∼ p6 , k ∼ p10 ) = 6·8·10 1 PP P
δ(i ≥ j) δ(i ≥ k)
i=1j=1k=1
8 P
6 10 8 6
1 P P 1 P P
= 6·8·10 δ(i ≥ j) δ(i ≥ k) = 6·8·10 i δ(i ≥ j)
i=1j=1 k=1 i=1 j=1
| {z } | {z }
i min(i,6)
6 8
(2)
P   
1
i2 + 6i 1 6·7·13 91+90
P
= 6·8·10 = 6·8·10 6 + 6(7 + 8) = 480 ≈ 38% .
i=1 i=7

(c) Similarly for the D10: min(i,6) min(i,8)


z }| { z }| {
10 6 8
ρ′′ = p(i ≥ j ∧ i ≥ k|i ∼ p10 , j ∼ p6 , k ∼ p8 ) 1 PP P
= 6·8·10 δ(i ≥ j) δ(i ≥ k)
i=1 j=1 k=1
6
P 8 10 
1
i2 + 91+90+96
P P
= 480 6i + 6·8 = 480 ≈ 58% .
i=1 i=7 i=9

Bonus-question: Because conditions like δ(i ≥ j) and δ(j ≥ i) overlap in the case δ(i = j). To get
a probability distribution over disjunct outcomes, one would have to consider the cases “D6 is highest,
D8 is highest, D10 is highest, D6 and D8 are highest, D8 and D10 are highest, D6 and D10 are highest
and finally all 3 dice are equal (and thus highest)”. The probabilities over these cases would sum up to
1.

E0.7: Implement MNIST classification (voluntary)

Implement the MNIST classification example from the lecture slides. Make sure you get the correct
deep CNN model architecture from Lecture 2 (p.18).
(a) Train the model fθ : R28×28 → R10 from the lecture slides with a cross-entropy loss for 10 epochs.
Plot the average train/test losses during each epoch (y-axis) over all epochs (x-axis). Do the same

5
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0

with the average train/test accuracies during each epoch. Try to program as modular as possible, as
you will re-use the code later.
(b) Change your optimization criterion to a mean-squared-error loss between the same model architec-
ture fθ : R28×28 → R10 you used in (a) and a one-hot encoding (hi ∈ R10 , hij = 1 iff j = yi ,
otherwise hij = 0) of the labels yi :
n  2
1P
L := n fθ (xi ) − hi
i=1

Plot the same plots as in (a). Try to reuse as much of your old code as possible, e.g., by defining
the criterion (which is now different) as external functions that can be overwritten.
(c) Define a new architecture fθ′ : R28×28 → R, that is exactly the same as above, but with only one
output neuron instead of 10. Train it with a regression mean-squared error loss between the model
output and the scalar class identifier.
n  2
L′ := 1P
n fθ′ (xi ) − yi
i=1

Plot the same plots as in (a), but for 50 epochs.


(d) Learning in (c) should be significantly slower, in terms of accuracy gain per epoch, than in (a) and
(b). Use a transformation of your model output (which can be implemented in the functions that
compute the criterion and the accuracy, or as an extra module) as fθ′′ (xi ) := αfθ′ (xi ) + β, with
α = β = 4.5. Plot the same plots as in (c). Does the learning behavior change? Why?
Bonus-question: Can you come up with an alternative approach to (d) that has the same speed-up effect?
Hint: Evaluate your test loss and accuracy before every training to make sure the accuracy is defined
correctly (should be around 0.1 for a model without training). This means that you will always have
one test measurement more.
Hint: Try to reuse as much of your old code as possible, e.g., by defining the criterion and the accuracy
(which will change for some question) as external functions that can be overwritten later.

Solution A sample implementation can be found in the accompanying Jupyter Notebook.


Bonus-question: Interestingly, a larger learning rate does not help, as the RMSProp seems to auto-
matically compensate for it. However, transforming the output is mathematically equivalent to scal-
ing the last model layer. Both the linear weights model[-1].weight *= 4.5 and the bias
model[-1].bias[0] = 4.5 of the torch.nn.Linear layer need to be adjusted. Note that
PyTorch does not allow you to modify parameters in-place, as this can break auto-differentiation when
done during a forward pass. The context torch.no_grad() allows to circumvent this security
feature. This context is generally helpful when no gradient computation is necessary, as it frees all
intermediately computed tensors and can save a lot of memory usage when used correctly.

E0.8: Mean and variance of online estimates (voluntary)

Let {yt }∞ 2
t=1 an infinite training set drawn i.i.d. from the Gaussian distribution N (µ, σ ). At time t, the
online estimate ft of the average over the training set, starting at f0 , is defined as

ft = ft−1 + α (yt − ft−1 ) , 0 < α < 1.

(a) Show that for small t the online estimate is biased, i.e., E[ft ] ̸= µ .

6
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0

(b) Prove that in the limit t → ∞ the online estimate is unbiased, i.e., E[ft ] = µ .
α σ2
(c) Prove that in the limit t → ∞ the variance of the online estimate is E[ft2 ] − E[ft ]2 = 2−α .
t−1
1−rt
rk =
P
Hint: You can use the geometric series 1−r , ∀|r| < 1 .
k=0

α 1
Bonus-question: Prove that for the decaying learning rate αt = 1−(1−α)t holds lim αt = t .
α→0

Hint: You can also use the binomial identity (x + y)t = tk=0 t
xk y t−k .
P 
k

Solution
Pt−1
(a) Note that by recursion ft = (1 − α)t f0 + i=0 α (1 − α)i yt−i .

Xt−1
E[ft ] = (1 − α)t f0 + α (1 − α)i E[yt−i ] = µ + (1 − α)t (f0 − µ) ̸= µ.
| {z }
|i=0 {z } µ
1−(1−α)t
1−(1−α)

(b) In limit of (a), the term (1 − α)t goes against 0.

lim E[ft ] = µ + lim (1 − α)t (f0 − µ) = µ.


t→∞ t→∞
| {z }
0

(c) Note that due to the assumption of i.i.d. sampling E[yi yj ] = µ2 + σ 2 δij .

h t−1
X 2 i
lim E[ft2 ] = lim E (1 − α)t f0 + α(1 − α)i yt−i
t→∞ t→∞
i=0
h 2 t−1
X
t t
= lim E (1 − α) f0 + 2(1 − α) f0 α(1 − α)i yt−i
t→∞
i=0
t−1
X t−1
X i
+ α(1 − α)i yt−i α(1 − α)j yt−j
i=0 j=0
 2 t 2t
−(1−α)
= lim (1 − α)t f0 + lim 2α f0 µ (1−α)
1−(1−α)
t→∞ | {z } t→∞ | {z }
→0 →0
t−1
hX t−1
X i
+ lim E α(1 − α)i yt−i α(1 − α)j yt−j
t→∞
i=0 j=0
t−1
X
= α2 lim (1 − α)i+j E yt−i yt−j
 
t→∞
i,j=0
 ∞ 2 t−1
2
X
i 2 2
X i
= µ α (1 − α) + α σ lim (1 − α)2
t→∞
i=0 i=0
| {z }
=1
α2 σ 2 α σ2
= µ2 + = µ2 + .
1 − (1 − α)2 2−α

Subtracting lim E[ft ]2 = µ2 yields the variance.


t→∞

7
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0

Bonus-question: One can use the binomial identity to reformulate the (1 − α)t :
t−1 t −1
h 1
i
lim αt = lim α
t lim α−1 − α− t − α t
=
α→0 α→0 1−(1−α) α→0
t (t−k)(t−1) −1
h i
t−k − kt
lim α−1 − t
P 
= k (−1) α α t
α→0 k=0
| {z }
αt−k−1
h t−2 i−1
lim α−1 − α−1
P t t−k t−k−1 t t
α0 −
 
= k (−1) α + t−1 t
α→0 k=0 | {z } |{z}
t 1
h t−2
P t i−1 1
t−k t−k−1
= lim t − k (−1) α = .
α→0 k=0 t

E0.9: Noise in linear functions (voluntary)

Let {xi }ni=1 ⊂ Rm denote a set of training samples and {yi }ni=1 ⊂ R the set of corresponding training
labels. We will use the mean squared loss L := n1 i (f (xi ) − yi )2 to learn a function f (xi ) ≈ yi , ∀i.
P

(a) Derive the analytical solution for parameter a ∈ Rm of a linear function f (x) := a⊤ x.
(b) We will now augment the training data by adding i.i.d. noise ϵi ∼ N (0, σ 2 ) ∈ R to the training
labels, i.e. ỹi := yi + ϵi . Show that this does not change the analytical solution of the expected loss
E[L].
(c) Let f denote the function that minimizes L without label-noise, and let f˜ denote the function that
minimizes L with a random noise ϵi added to labels yi (but not the solution of the expected loss
E[L]). Derive the analytical variance E[(f˜(x) − f (x))2 ] of the noisy solution f˜.
(d) We will now augment the training data by adding i.i.d. noise ϵi ∼ N (0, Σ) ∈ Rm to the training
samples: x̃i = xi + ϵi . Derive the analytical solution for parameter a ∈ Rm that minimizes the
expected loss E[L].

Bonus-question: Which popular regularization method is equivalent to (d) and what problem is solved?
Hint: Summarize all training samples into matrix X = [x1 , . . . , xn ]⊤ ∈ Rn×m , all training labels into
vector y = [y1 , . . . , yn ]⊤ ∈ Rn , and denote the noisy versions ỹ ∈ Rn and X̃ ∈ Rn×m .

Solution
2 P ⊤ 2 ⊤ 2 ⊤ !
(a) Setting the gradient to zero: ∇a L = n i (a xi − yi )xi = n X Xa − n X y = 0 allows us to
!
derive the analytic solution for a if the matrix X⊤ X is invertible: a = (X⊤ X)−1 X⊤ y.

(b) Using the result from (a): E[∇a L] = n2 X⊤ Xa − n2 X⊤ E[ỹ] = n2 X⊤ Xa − n2 X⊤ y, because E[ỹ] =
y + E[ϵ] = y due to the zero mean noise vector ϵ := [ϵ1 , . . . , ϵn ]⊤ .

(c) First note that E[(f˜(x) − f (x))2 ] = E[f˜2 (x)] − 2f (x)E[f˜(x)] + f 2 (x) = E[f˜2 (x)] − f 2 (x),
because E[ỹ] = y. Due to i.i.d. noise we have E[ỹ ỹ ⊤ ] = yy ⊤ + yE[ϵ⊤ ] + E[ϵ]y ⊤ + E[ϵϵ⊤ ] =
yy ⊤ +σ 2 I, and E[f˜2 (x)] = x⊤ (X⊤ X)−1 X⊤ E[ỹ ỹ ⊤ ]X(X⊤ X)−1 x = f 2 (x)+σ 2 x⊤ (X⊤ X)−1 x.
The variance is therefore E[(f˜(x) − f (x))2 ] = σ 2 x⊤ (X⊤ X)−1 x.

8
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0

ϵn ]⊤ , where we know from the definition of all ϵi that E[E] = 0 ∈ Rn×m


(d) Let E := [ϵ1 , . . . , P
and E[E E] = E[ i ϵi ϵ⊤

i ] = nΣ. Gradients pass through sums (and therefore expectations):
!
∇a E[L] = n1 E[X̃⊤ X̃]a − n1 E[X̃⊤ ]y = ( n1 X⊤ X + Σ)a − n1 X⊤ y = 0. Now the optimal solution
!
for the parameter vector is a = (X⊤ X + nΣ)−1 X⊤ y.

Bonus-question: The popular L2 regularization, also called weight decay, adds the term λ∥a∥2 to the
!
loss and yields the analytical solution a = (X⊤ X + nλI)−1 Xy, which is also called ridge regression.
This regularization guarantees that the matrix X⊤ X+nλI is invertible for all λ > 0 and yields smoother
functions. For Σ = λI, which corresponds to noising each input dimension independently with variance
λ, the two solutions are the same, which indicates that noising the input smoothes the learned function!

https://xkcd.com/2343

You might also like