0% found this document useful (0 votes)
1 views

exercise0

The document is an exercise sheet for a Deep Reinforcement Learning course, containing voluntary exercises aimed at practicing math and machine learning concepts. It includes problems on Taylor expansion, critical points, probability distributions, variance estimates, and implementing a CNN for MNIST classification. Solutions to the exercises are available on Brightspace, but submission is not required.

Uploaded by

Meme Necromancer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

exercise0

The document is an exercise sheet for a Deep Reinforcement Learning course, containing voluntary exercises aimed at practicing math and machine learning concepts. It includes problems on Taylor expansion, critical points, probability distributions, variance estimates, and implementing a CNN for MNIST classification. Solutions to the exercises are available on Brightspace, but submission is not required.

Uploaded by

Meme Necromancer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0

Wendelin Böhmer <[email protected]> voluntary exercises

Math and machine learning primer


Voluntary exercises

The following exercises do not have to be submitted as homework, but might be helpful to practice
the required math and prepare for the exam. Some questions are from old exams and contain the used
rubrik. You do not have to submit these questions and will not receive points for them. Solutions are
available on Brightspace.

E0.1: Taylor expansion (voluntary)



For the function 1 + x, write down the Taylor series around x0 = 0 up to 3rd order.

E0.2: Critical points (voluntary)

Consider the two functions


f (x, y) := c + x2 + y 2
g(x, y) := c + x2 − y 2 ,
where c ∈ R is a constant.
(a) Show that a = (0, 0) is a critical point of both functions.
(b) Check for f and for g whether a is a minimum, maximum, or saddlepoint using the Hessian matrix.
Hint: A matrix is positive (negative) definite if and only if all its eigenvalues are positive (negative).

E0.3: Distributions and expected values (voluntary)

Let x ∈ R be a random variable with probability density p : R → R with:



c · sin(x), x ∈ [0, π]
p(x) =
0, elsewhere

(a) Determine the parameter c ∈ R such that p(x) is indeed a probability density.
(b) Determine the expected value µ := Ep [x]
(c) Determine the variance of x, Ep [(x − µ)2 ].

1
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0

E0.4: Variance of the empirical mean (old exam question) (voluntary)

Prove that the variance of the empirical mean fn := n1 ni=1 xi , based on n samples xi ∈ R drawn
P
2
i.i.d. from the Gaussian distribution N (µ, σ 2 ), is V[fn ] = σn , without using the fact the variance of a
sum of independent variables is the sum of the variables’ variances.

E0.5: Unbiased variance estimate (voluntary)

Let {xi }ni=1 be a data set that is drawn i.i.d. from the Gaussian distribution xi ∼ N (µ, σ 2 ). Let further
µ̂ := n i=1 xi denote the empirical mean and σ̂ 2 := n ni=1 (xi − µ̂)2 the equivalent empirical
1 Pn 1 P
variance. Prove analytically that µ̂ is unbiased, i.e. E[µ̂] = µ, and that σ̂ 2 is biased, i.e. E[σ̂ 2 ] ̸= σ 2 .
Bonus-question: Can you derive an unbiased estimator for the empirical variance?
Hint: If xi and xj are drawn i.i.d. from N (µ, σ 2 ), then holds ∀i:

E[xi ] = µ , E[(xi − µ)2 ] = σ 2 and E[(xi − µ)(xj − µ)] = 0 if i ̸= j .

E0.6: Maximum dice (voluntary)

This question is designed to practice the use of Kronecker-delta functions and become more familiar
with (discrete) probabilities. You are given 3 dice, a D6, a D8 and a D10, where Dx refers to a x-
sided fair dice, where each of the x sides is numbered uniquely 1 to x and rolled with the exact same
probability.
(a) Prove analytically that the probability that the D6 is among the highest (including equal) numbers
when all 3 dice are rolled together is roughly ρ ≈ 19%.
(b) Prove analytically that the probability that the D8 rolls among the highest is ρ′ ≈ 38%.
(c) Prove analytically that the probability that the D10 rolls among the highest is ρ′′ ≈ 58%.
Hint: You can solve the question however you want, but you are encouraged to use Kronecker-deltas,
e.g. δ(i > 5) is 1 if i > 5 and 0 otherwise. You will find that this can simplify complex sums
(1) 2 (2)
enormously. If you do so, you can use the equalities ni=1 i = n 2+n and ni=1 i2 = n(n+1)(2n+1)
P P
6 .
Bonus-question: Why don’t the above numbers sum up to 1?

E0.7: Implement MNIST classification (voluntary)

Implement the MNIST classification example from the lecture slides. Make sure you get the correct
deep CNN model architecture from Lecture 2 (p.18).
(a) Train the model fθ : R28×28 → R10 from the lecture slides with a cross-entropy loss for 10 epochs.
Plot the average train/test losses during each epoch (y-axis) over all epochs (x-axis). Do the same
with the average train/test accuracies during each epoch. Try to program as modular as possible, as
you will re-use the code later.

2
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0

(b) Change your optimization criterion to a mean-squared-error loss between the same model architec-
ture fθ : R28×28 → R10 you used in (a) and a one-hot encoding (hi ∈ R10 , hij = 1 iff j = yi ,
otherwise hij = 0) of the labels yi :
n  2
1P
L := n fθ (xi ) − hi
i=1

Plot the same plots as in (a). Try to reuse as much of your old code as possible, e.g., by defining
the criterion (which is now different) as external functions that can be overwritten.
(c) Define a new architecture fθ′ : R28×28 → R, that is exactly the same as above, but with only one
output neuron instead of 10. Train it with a regression mean-squared error loss between the model
output and the scalar class identifier.
n  2
L′ := 1P
n fθ′ (xi ) − yi
i=1

Plot the same plots as in (a), but for 50 epochs.


(d) Learning in (c) should be significantly slower, in terms of accuracy gain per epoch, than in (a) and
(b). Use a transformation of your model output (which can be implemented in the functions that
compute the criterion and the accuracy, or as an extra module) as fθ′′ (xi ) := αfθ′ (xi ) + β, with
α = β = 4.5. Plot the same plots as in (c). Does the learning behavior change? Why?
Bonus-question: Can you come up with an alternative approach to (d) that has the same speed-up effect?
Hint: Evaluate your test loss and accuracy before every training to make sure the accuracy is defined
correctly (should be around 0.1 for a model without training). This means that you will always have
one test measurement more.
Hint: Try to reuse as much of your old code as possible, e.g., by defining the criterion and the accuracy
(which will change for some question) as external functions that can be overwritten later.

E0.8: Mean and variance of online estimates (voluntary)

Let {yt }∞ 2
t=1 an infinite training set drawn i.i.d. from the Gaussian distribution N (µ, σ ). At time t, the
online estimate ft of the average over the training set, starting at f0 , is defined as

ft = ft−1 + α (yt − ft−1 ) , 0 < α < 1.

(a) Show that for small t the online estimate is biased, i.e., E[ft ] ̸= µ .
(b) Prove that in the limit t → ∞ the online estimate is unbiased, i.e., E[ft ] = µ .
α σ2
(c) Prove that in the limit t → ∞ the variance of the online estimate is E[ft2 ] − E[ft ]2 = 2−α .
t−1
1−rt
rk =
P
Hint: You can use the geometric series 1−r , ∀|r| < 1 .
k=0

α 1
Bonus-question: Prove that for the decaying learning rate αt = 1−(1−α)t holds lim αt = t .
α→0

Hint: You can also use the binomial identity (x + y)t = tk=0 t
xk y t−k .
P 
k

3
DSAIT4115 Deep Reinforcement Learning Exercise Sheet 0

E0.9: Noise in linear functions (voluntary)

Let {xi }ni=1 ⊂ Rm denote a set of training samples and {yi }ni=1 ⊂ R the set of corresponding training
labels. We will use the mean squared loss L := n1 i (f (xi ) − yi )2 to learn a function f (xi ) ≈ yi , ∀i.
P

(a) Derive the analytical solution for parameter a ∈ Rm of a linear function f (x) := a⊤ x.
(b) We will now augment the training data by adding i.i.d. noise ϵi ∼ N (0, σ 2 ) ∈ R to the training
labels, i.e. ỹi := yi + ϵi . Show that this does not change the analytical solution of the expected loss
E[L].
(c) Let f denote the function that minimizes L without label-noise, and let f˜ denote the function that
minimizes L with a random noise ϵi added to labels yi (but not the solution of the expected loss
E[L]). Derive the analytical variance E[(f˜(x) − f (x))2 ] of the noisy solution f˜.
(d) We will now augment the training data by adding i.i.d. noise ϵi ∼ N (0, Σ) ∈ Rm to the training
samples: x̃i = xi + ϵi . Derive the analytical solution for parameter a ∈ Rm that minimizes the
expected loss E[L].

Bonus-question: Which popular regularization method is equivalent to (d) and what problem is solved?
Hint: Summarize all training samples into matrix X = [x1 , . . . , xn ]⊤ ∈ Rn×m , all training labels into
vector y = [y1 , . . . , yn ]⊤ ∈ Rn , and denote the noisy versions ỹ ∈ Rn and X̃ ∈ Rn×m .

https://xkcd.com/2343

You might also like