DGM 2023 Endterm Solution
DGM 2023 Endterm Solution
Informatics
Technical University of Munich
Note:
• During the attendance check a sticker containing a unique code will be put on this exam.
Esolution • This code contains a unique number that associates this exam with your registration
Place student sticker here number.
• This number is printed both next to the code and to the signature field in the attendance
check list.
n
Exam: CIT4230003 / Endterm Date: Monday 31st July, 2023
Examiner: Time:
tio
Prof. Dr. Stephan Günnemann 13:30 – 14:30
P1 P2 P3 P4
lu
So
Working instructions
• This exam consists of 12 pages with a total of 4 problems.
Please make sure now that you received a complete copy of the exam.
• Allowed resources:
– one A4 sheet of handwritten notes, two sides.
pl
• There is scratch paper at the end of the exam (after problem 10).
• Write your answers only in the provided solution boxes or the scratch paper.
• If you solve a task on the scratch paper, clearly reference it in the main solution box.
Sa
• For problems that say “Justify your answer” you only get points if you provide a valid explana-
tion.
• For problems that say “Derive” you only get points if you provide a valid mathematical derivation.
• For problems that say “Prove” you only get points if you provide a valid mathematical proof.
• If a problem does not say “Justify your answer”, “Derive” or “Prove”, it is sufficient to only provide the
correct answer.
– Page 1 / 12 –
Problem 1 Normalizing flows (5 credits)
In this task will focus on the reverse parametrization for normalizing flows on Rd .
1
A = aT a
2
z = σ (A x),
Please state whether this transformation leads to a valid normalizing flow. Justify your answer accordingly.
n
No, this transformation is not invertible. Trivially A does not have full rank. Therefore, the determinant
of A is zero and the transformation is non-invertible.
tio
lu
So
0 b) Now, let x ∈ R3 and the transformation is defined as follows:
1
z1 = (x3 + x2 )3
2
z2 = x14 x2 + x3
e
z3 = e x3 .
pl
Please state whether this transformation leads to a valid normalizing flow. Justify your answer accordingly.
No, this is not a valid transformation. To disprove the bijectivity of the transformation, we can find a
counter example. For any assignment of x , we can choose a different assignment that maps to the
m
– Page 2 / 12 –
c) Lastly, let’s assume you are given a transformation f : R → R, where we know that the Jacobian 0
determinant of its inverse is equal to 1. How does this affect the normalizing flow?
1
−1
Please use the change of variable formula and a possible parametrization of f to explain.
A flow which Jacobian determinant is equal to 1 is volume preserving. This means that the transformed
distribution has the same volume, i.e., the change of variable formula is given by p2 (x) = p1 (f −1 (x)) ∗ 1.
Thus, a normalizing flow with a Jacobian determinant of one is not expressive and can not model any
other distribution than p1 (z).
A trivial example is f −1 (x) = x + b , where b ∈ R, which includes the identity map and any translation.
n
tio
lu
So
e
pl
m
Sa
– Page 3 / 12 –
Problem 2 Variational Inference & Variational Autoencoder (9 credits)
We want to draw samples from a log-normal distribution log N (µ, σ 2 ), where µ, σ ∈ R, with reparametrization.
The probability density function of the log-normal distribution is defined as:
(ln z −µ)2
(
1
√ exp − 2 σ 2 if z > 0
qµ,σ2 (z) = z σ 2π
0 otherwise
Its cumulative density function is given as:
a
ln a − µ
Z
1
Qµ,σ2 (a) = Pr(z ≤ a) = qµ,σ (z)dz = 1 + erf √
−∞ 2 σ 2
Rz
Recall that the error function erf(z) is an invertible function that is defined as erf(z) = √2 exp(−t 2 )dt .
π 0
n
0 a) Suppose you have access to an algorithm that produces samples ϵ from a standard normal distribution
N (0, 1). Find a deterministic transformation T : R → R>0 that transforms a sample ϵ ∼ N (0, 1) into a sample
1 from the log-normal distribution log N (0, 1).
tio
Hint: The cumulative density function of a normal distribution N (µ, σ 2 ) is given as:
2
1 a−µ
3 Fµ,σ2 (a) = Pr(z ≤ a) = 1 + erf √
2 σ 2
lu
We want to find T such that Pr(T (ϵ) ≤ a) = Q0,1 (a)
Since erf(z) is invertible, we just have to match the arguments of the error function.
e
0 b) Now suppose you have access to an algorithm that produces samples z from a log-normal distribution
∼ log N (0, 1). Find a deterministic transformation Mµ,σ2 : R>0 → R>0 that transforms a sample z ∼ log N (0, 1)
1 into a sample from the log-normal distribution log N (µ, σ 2 ).
Sa
3
We want to find Mµ,σ2 such that Pr(Mµ,σ2 (z) ≤ a) = Qµ,σ2 (a)
– Page 4 / 12 –
Similar to before, we match the arguments of the error function.
ln a − µ
ln Mµ−,1σ2 (a) =
σ
σ ln Mµ−,1σ2 (a) + µ = ln a
Mµ−,1σ2 (a)σ exp(µ) = a
⇒ Mµ,σ2 (z) = z σ exp(µ)
n
tio
c) Now suppose you have access to an algorithm that produces samples ϵ from a standard normal distribution 0
lu
N (0, 1). Find a deterministic transformation Cµ,σ2 : R → R>0 that transforms a sample ϵ ∼ N (0, 1) into a
sample from the log-normal distribution log N (µ, σ 2 ). 1
Cµ,σ2 (ϵ) = Mµ,σ2 ◦ T (ϵ) = Mµ,σ2 ((T (ϵ)) = exp(ϵ)σ exp(µ) = exp(σϵ + µ)
e
pl
m
Sa
– Page 5 / 12 –
0 d) We want to model the distribution of data samples p(x) using a Variational Autoencoder. Recall that this
assumes a latent variable structure p(x, z) = p(x |z)p(z) and we need to model the distribution pθ (x |z) and the
1 variational distribution qϕ (z) respectively. We learn the parameters of our model by optimizing the ELBO:
2
L(θ, ϕ) = E log pθ (x |z) − KL qϕ (z) | p(z)
z ∼qϕ (z)
n
The KL-divergence KL(qϕ (z) | p(z)) is defined as:
tio
Z
qϕ (z)
KL qϕ (z) | p(z) = qϕ (z) log dz (2.1)
R p(z)
Since p(z) will be zero for z ≤ 0 while qϕ (z) > 0, the KL-divergence term diverges to ∞, which prevents
gradient-based optimization.
If we instead parametrize qϕ (z) = log N (µ, σ 2 ), both arguments of the KL-divergence have the same
support, ensuring finite values. By employing the reparametrization scheme of c), we can backpropa-
lu
gate through sampling from qϕ (z).
So
e
pl
m
Sa
– Page 6 / 12 –
Problem 3 Generative Adversarial Networks (8 credits)
1
For π = 2
, GANs are trained by optimizing the model parameters θ according to
1
min max E [log Dϕ (x)] + 21 E [log(1
2 p ∗ (x)
− Dϕ (fθ (z)))] .
θ ϕ p(z)
| {z } | {z }
E1 E2
n
tio
• E1 : Rewards the discriminator for recognizing samples from the data distribution
• E2 : Rewards the discriminator for rejecting samples from the generated distribution and the
generator for fooling the discriminator
• Discriminator and generator are adversaries because they optimize the same objective in
lu
opposite directions
So
e
pl
m
Sa
– Page 7 / 12 –
0 b) Show that the loss
L = max 12 ∗E [log Dϕ (x)] + 1
E [log(1
2 p(z)
− Dϕ (fθ (z)))]
ϕ p (x)
1
from the GAN objective is equivalent to the Jensen-Shannon divergence (JSD) between the data distribution
2
p ∗ and the learned, generated distribution pθ , i.e.
3
L = JSD(p ∗ , pθ ) + c
4
for some constant c ∈ R that does not depend on p ∗ or θ. The JSD between two probability densities p and
5 q is defined as
JSD(p, q) = 12 KL p ∥ 21 (p + q) + KL q∥ 12 (p + q)
n
Hint: For GANs, it holds for functions h that
tio
E [h(fθ (z))] = E [h(x)].
p(z) pθ (x)
p ∗ (x)
lu
Dϕ∗ (x) = .
p ∗ (x) + pθ (x)
1 p ∗ (x) 1
h p ∗ (x) i
= E [log
2 p ∗ (x)
] + E log 1 − (3.3)
p ∗ (x) + pθ (x) 2 pθ (x) p ∗ (x) + pθ (x)
e
1 p ∗ (x) 1
h pθ (x)
i
= 2
E [log ] + E log (3.4)
p ∗ (x) p ∗ (x) + pθ (x) 2 p (x)
θ p ∗ (x) + pθ (x)
pl
∗
1 p (x) 1
h pθ (x) i
= E
2 p (x)
∗
[log ∗
p (x)+p (x)
θ
] + 2
E log ∗
p (x)+p (x) θ
− log(2) (3.5)
pθ (x)
2 2
p ∗ (x) + pθ (x) 1 p ∗ (x) + pθ (x)
= 12 KL p ∗ ∥ + 2 KL pθ ∥ − log(2) (3.6)
2 2
m
– Page 8 / 12 –
Problem 4 Denoising Diffusion (6 credits)
Consider a denoising diffusion model with N diffusion steps and the usual forward parametrization qφx 0 and
reverse process pθ .
n
Y 1 − ᾱn−1
αn = 1 − βn ᾱn = αi β̃n = βn
1 − ᾱn
i=1
√
qφ(x 0 ) (z n ) = N ᾱn x 0 , (1 − ᾱn I)
√ √ !
αn (1 − ᾱn−1 ) ᾱn−1 βn
qφ(x 0 ) (z n−1 | zn) = N zn + x 0 , β̃n I
1 − ᾱn 1 − ᾱn
√
z n − 1 − ᾱn ϵθ (z n , n)
x0 = √
ᾱn
n
a) Why do we optimize the ELBO instead of the data log-likelihood? 0
tio
1
The data log-likelihood log p(x) requires us to marginalize out the latent variables z 1 , ... , z N which is
intractable.
lu
So
b) Why does model training fail if βn > 1? 0
1
If βn > 1, αn = 1 − βn < 0 which means that ᾱn is going to oscillate in sign with at least one αn < 0.
Then we would have to take the square root of a negative number during training when sampling
z n ∼ qϕ(x 0 ) (z n ).
e
pl
1
If βn = 1, we get αn = 1 − βn = 0 and therefore ᾱn′ = 0 for n′ ≥ n. In training this would mean that all
information
√ from x 0 would be lost from the n-th step on. During sampling, we would also have to divide
Sa
– Page 9 / 12 –
0 d) Which of these beta schedules are invalid? Justify your answer.
1. βn = sin Nn
1
1
2 2. βn = 1 − n
n
3. βn = loge 1 + N
πn
4. βn = − cos N
1
Schedule 2 is invalid because β1 = 1 − 1
= 0. Schedule 4 is invalid because βn ≤ 0.
n
tio
lu
So
e
pl
m
Sa
– Page 10 / 12 –
Additional space for solutions–clearly mark the (sub)problem your answers are related to and strike
out invalid solutions.
n
tio
lu
So
e
pl
m
Sa
– Page 11 / 12 –
n
tio
lu
So
e
pl
m
Sa
– Page 12 / 12 –