0% found this document useful (0 votes)
20 views

DGM 2023 Endterm Solution

Uploaded by

Ghaith Chebil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

DGM 2023 Endterm Solution

Uploaded by

Ghaith Chebil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Data Analytics & Machine Learning

Informatics
Technical University of Munich

Note:
• During the attendance check a sticker containing a unique code will be put on this exam.
Esolution • This code contains a unique number that associates this exam with your registration
Place student sticker here number.
• This number is printed both next to the code and to the signature field in the attendance
check list.

Advanced Machine Learning: Deep Generative Models

n
Exam: CIT4230003 / Endterm Date: Monday 31st July, 2023
Examiner: Time:

tio
Prof. Dr. Stephan Günnemann 13:30 – 14:30

P1 P2 P3 P4

lu
So
Working instructions
• This exam consists of 12 pages with a total of 4 problems.
Please make sure now that you received a complete copy of the exam.

• The total amount of achievable credits in this exam is 28 credits.


• Detaching pages from the exam is prohibited.
e

• Allowed resources:
– one A4 sheet of handwritten notes, two sides.
pl

• No other material (e.g. books, cell phones, calculators) is allowed!


• Physically turn off all electronic devices, put them into your bag and close the bag.
m

• There is scratch paper at the end of the exam (after problem 10).
• Write your answers only in the provided solution boxes or the scratch paper.

• If you solve a task on the scratch paper, clearly reference it in the main solution box.
Sa

• All sheets (including scratch paper) have to be returned at the end.


• Only use a black or a blue pen (no pencils, red or greens pens!)

• For problems that say “Justify your answer” you only get points if you provide a valid explana-
tion.
• For problems that say “Derive” you only get points if you provide a valid mathematical derivation.
• For problems that say “Prove” you only get points if you provide a valid mathematical proof.

• If a problem does not say “Justify your answer”, “Derive” or “Prove”, it is sufficient to only provide the
correct answer.

Left room from to / Early submission at

– Page 1 / 12 –
Problem 1 Normalizing flows (5 credits)
In this task will focus on the reverse parametrization for normalizing flows on Rd .

0 a) Let x ∈ R2 and the transformation is defined as follows:

1
A = aT a
2
z = σ (A x),

where a ∈ R1>×02 and σ is the element-wise sigmoid activation.

Please state whether this transformation leads to a valid normalizing flow. Justify your answer accordingly.

n
No, this transformation is not invertible. Trivially A does not have full rank. Therefore, the determinant
of A is zero and the transformation is non-invertible.

tio
lu
So
0 b) Now, let x ∈ R3 and the transformation is defined as follows:

1
z1 = (x3 + x2 )3
2
z2 = x14 x2 + x3
e

z3 = e x3 .
pl

Please state whether this transformation leads to a valid normalizing flow. Justify your answer accordingly.

No, this is not a valid transformation. To disprove the bijectivity of the transformation, we can find a
counter example. For any assignment of x , we can choose a different assignment that maps to the
m

same z by replacing x1 with −x1 .


Sa

– Page 2 / 12 –
c) Lastly, let’s assume you are given a transformation f : R → R, where we know that the Jacobian 0
determinant of its inverse is equal to 1. How does this affect the normalizing flow?
1
−1
Please use the change of variable formula and a possible parametrization of f to explain.

A flow which Jacobian determinant is equal to 1 is volume preserving. This means that the transformed
distribution has the same volume, i.e., the change of variable formula is given by p2 (x) = p1 (f −1 (x)) ∗ 1.
Thus, a normalizing flow with a Jacobian determinant of one is not expressive and can not model any
other distribution than p1 (z).
A trivial example is f −1 (x) = x + b , where b ∈ R, which includes the identity map and any translation.

n
tio
lu
So
e
pl
m
Sa

– Page 3 / 12 –
Problem 2 Variational Inference & Variational Autoencoder (9 credits)
We want to draw samples from a log-normal distribution log N (µ, σ 2 ), where µ, σ ∈ R, with reparametrization.
The probability density function of the log-normal distribution is defined as:
 
(ln z −µ)2
(
1
√ exp − 2 σ 2 if z > 0
qµ,σ2 (z) = z σ 2π
0 otherwise
Its cumulative density function is given as:
a   
ln a − µ
Z
1
Qµ,σ2 (a) = Pr(z ≤ a) = qµ,σ (z)dz = 1 + erf √
−∞ 2 σ 2
Rz
Recall that the error function erf(z) is an invertible function that is defined as erf(z) = √2 exp(−t 2 )dt .
π 0

n
0 a) Suppose you have access to an algorithm that produces samples ϵ from a standard normal distribution
N (0, 1). Find a deterministic transformation T : R → R>0 that transforms a sample ϵ ∼ N (0, 1) into a sample
1 from the log-normal distribution log N (0, 1).

tio
Hint: The cumulative density function of a normal distribution N (µ, σ 2 ) is given as:
2
  
1 a−µ
3 Fµ,σ2 (a) = Pr(z ≤ a) = 1 + erf √
2 σ 2

lu
We want to find T such that Pr(T (ϵ) ≤ a) = Q0,1 (a)

Pr(T (ϵ) ≤ a) = Pr(ϵ ≤ T −1 (a))


= F0,1 (T −1 (a))
So
T −1 (a)
  
1
= 1 + erf √
2 2
!
= Q0,1 (a)
  
1 ln a
= 1 + erf √
2 2

Since erf(z) is invertible, we just have to match the arguments of the error function.
e

T −1 (a) = ln(a) ⇒ T (ϵ) = exp(ϵ)


pl
m

0 b) Now suppose you have access to an algorithm that produces samples z from a log-normal distribution
∼ log N (0, 1). Find a deterministic transformation Mµ,σ2 : R>0 → R>0 that transforms a sample z ∼ log N (0, 1)
1 into a sample from the log-normal distribution log N (µ, σ 2 ).
Sa

3
We want to find Mµ,σ2 such that Pr(Mµ,σ2 (z) ≤ a) = Qµ,σ2 (a)

Pr(Mµ,σ2 (z) ≤ a) = Pr(z ≤ Pr(Mµ−,1σ2 (a))


= Q0,1 (Mµ−,1σ2 (a))
ln Mµ−,1σ2 (a)
" !#
1
= 1 + erf √
2 2
!
= Qµ,σ2 (a)
  
1 ln a − µ
= 1 + erf √
2 σ 2

– Page 4 / 12 –
Similar to before, we match the arguments of the error function.

ln a − µ
ln Mµ−,1σ2 (a) =
σ
σ ln Mµ−,1σ2 (a) + µ = ln a
Mµ−,1σ2 (a)σ exp(µ) = a
⇒ Mµ,σ2 (z) = z σ exp(µ)

n
tio
c) Now suppose you have access to an algorithm that produces samples ϵ from a standard normal distribution 0

lu
N (0, 1). Find a deterministic transformation Cµ,σ2 : R → R>0 that transforms a sample ϵ ∼ N (0, 1) into a
sample from the log-normal distribution log N (µ, σ 2 ). 1

Hint: Use the results from the previous subproblems.


So
We simply compose the transformations T which provides samples from log N (0, 1) and Mµ,σ2 which
transforms them into samples from log N (µ, σ 2 ).

Cµ,σ2 (ϵ) = Mµ,σ2 ◦ T (ϵ) = Mµ,σ2 ((T (ϵ)) = exp(ϵ)σ exp(µ) = exp(σϵ + µ)

e
pl
m
Sa

– Page 5 / 12 –
0 d) We want to model the distribution of data samples p(x) using a Variational Autoencoder. Recall that this
assumes a latent variable structure p(x, z) = p(x |z)p(z) and we need to model the distribution pθ (x |z) and the
1 variational distribution qϕ (z) respectively. We learn the parameters of our model by optimizing the ELBO:
2    
L(θ, ϕ) = E log pθ (x |z) − KL qϕ (z) | p(z)
z ∼qϕ (z)

Here, KL is the Kullback-Leibler divergence KL p(z)∥q(z) = p(z) log p(z)


  R
q(z)
dz . For simplicity, assume that
the latent variable z is scalar.
Instead of assuming a standard normal prior p(z) = N (0, 1) on the latent variable z , we want to employ a
log-normal prior p(z) = log N (0, 1). Argue why parametrizing qϕ (z) as a normal distribution N (µ, σ 2 ) is not a
practical idea. Furthermore, propose an alternative suitable parametrization and briefly outline how we can
backpropagate through sampling from qϕ (z).
Hint: You may refer to the procedure of c), even if you could not derive it.

n
The KL-divergence KL(qϕ (z) | p(z)) is defined as:

tio
Z  
  qϕ (z)
KL qϕ (z) | p(z) = qϕ (z) log dz (2.1)
R p(z)
Since p(z) will be zero for z ≤ 0 while qϕ (z) > 0, the KL-divergence term diverges to ∞, which prevents
gradient-based optimization.
If we instead parametrize qϕ (z) = log N (µ, σ 2 ), both arguments of the KL-divergence have the same
support, ensuring finite values. By employing the reparametrization scheme of c), we can backpropa-

lu
gate through sampling from qϕ (z).
So
e
pl
m
Sa

– Page 6 / 12 –
Problem 3 Generative Adversarial Networks (8 credits)
1
For π = 2
, GANs are trained by optimizing the model parameters θ according to
1
min max E [log Dϕ (x)] + 21 E [log(1
2 p ∗ (x)
− Dϕ (fθ (z)))] .
θ ϕ p(z)
| {z } | {z }
E1 E2

a) Based on this training objective, explain in one sentence each 0

• the meaning of the first expected value E1 , 1

• the meaning of the second expected value E2 , 2

• and what is adversarial about this formulation. 3

n
tio
• E1 : Rewards the discriminator for recognizing samples from the data distribution
• E2 : Rewards the discriminator for rejecting samples from the generated distribution and the
generator for fooling the discriminator
• Discriminator and generator are adversaries because they optimize the same objective in

lu
opposite directions
So
e
pl
m
Sa

– Page 7 / 12 –
0 b) Show that the loss
L = max 12 ∗E [log Dϕ (x)] + 1
E [log(1
2 p(z)
− Dϕ (fθ (z)))]
ϕ p (x)
1
from the GAN objective is equivalent to the Jensen-Shannon divergence (JSD) between the data distribution
2
p ∗ and the learned, generated distribution pθ , i.e.
3
L = JSD(p ∗ , pθ ) + c
4
for some constant c ∈ R that does not depend on p ∗ or θ. The JSD between two probability densities p and
5 q is defined as
JSD(p, q) = 12 KL p ∥ 21 (p + q) + KL q∥ 12 (p + q)
  

where KL is the Kullback-Leibler (KL) divergence KL(p ∥q) = Ep log pq .


 

Hint: Remember the general form of the optimal discriminator.

n
Hint: For GANs, it holds for functions h that

tio
E [h(fθ (z))] = E [h(x)].
p(z) pθ (x)

The optimal discriminator is given by

p ∗ (x)

lu
Dϕ∗ (x) = .
p ∗ (x) + pθ (x)

L = max 12 ∗E [log Dϕ (x)] + 1


So
E [log(1
2 p(z)
− Dϕ (fθ (z)))] (3.1)
ϕ p (x)
1 1
= max E [log Dϕ (x)]
2 p ∗ (x)
+ E [log(1
2 p (x)
− Dϕ (x))] (3.2)
ϕ θ

Now plug in the optimal discriminator.

1 p ∗ (x) 1
h  p ∗ (x) i
= E [log
2 p ∗ (x)
] + E log 1 − (3.3)
p ∗ (x) + pθ (x) 2 pθ (x) p ∗ (x) + pθ (x)
e

1 p ∗ (x) 1
h  pθ (x)
i
= 2
E [log ] + E log (3.4)
p ∗ (x) p ∗ (x) + pθ (x) 2 p (x)
θ p ∗ (x) + pθ (x)
pl


1 p (x) 1
h  pθ (x) i
= E
2 p (x)

[log ∗
p (x)+p (x)
θ
] + 2
E log ∗
p (x)+p (x) θ
− log(2) (3.5)
pθ (x)
2 2
p ∗ (x) + pθ (x)  1 p ∗ (x) + pθ (x) 
= 12 KL p ∗ ∥ + 2 KL pθ ∥ − log(2) (3.6)
2 2
m

= JSD(p ∗ , pθ ) − log(2) (3.7)


Sa

– Page 8 / 12 –
Problem 4 Denoising Diffusion (6 credits)
Consider a denoising diffusion model with N diffusion steps and the usual forward parametrization qφx 0 and
reverse process pθ .

n
Y 1 − ᾱn−1
αn = 1 − βn ᾱn = αi β̃n = βn
1 − ᾱn
i=1
√ 
qφ(x 0 ) (z n ) = N ᾱn x 0 , (1 − ᾱn I)
√ √ !
αn (1 − ᾱn−1 ) ᾱn−1 βn
qφ(x 0 ) (z n−1 | zn) = N zn + x 0 , β̃n I
1 − ᾱn 1 − ᾱn

z n − 1 − ᾱn ϵθ (z n , n)
x0 = √
ᾱn

n
a) Why do we optimize the ELBO instead of the data log-likelihood? 0

tio
1
The data log-likelihood log p(x) requires us to marginalize out the latent variables z 1 , ... , z N which is
intractable.

lu
So
b) Why does model training fail if βn > 1? 0

1
If βn > 1, αn = 1 − βn < 0 which means that ᾱn is going to oscillate in sign with at least one αn < 0.
Then we would have to take the square root of a negative number during training when sampling
z n ∼ qϕ(x 0 ) (z n ).
e
pl

c) Why does model training fail if βn = 1?


m

1
If βn = 1, we get αn = 1 − βn = 0 and therefore ᾱn′ = 0 for n′ ≥ n. In training this would mean that all
information
√ from x 0 would be lost from the n-th step on. During sampling, we would also have to divide
Sa

by 0 = 0 when estimating x 0 from z n .

– Page 9 / 12 –
0 d) Which of these beta schedules are invalid? Justify your answer.

1. βn = sin Nn

1

1
2 2. βn = 1 − n
n

3. βn = loge 1 + N

πn

4. βn = − cos N

1
Schedule 2 is invalid because β1 = 1 − 1
= 0. Schedule 4 is invalid because βn ≤ 0.

n
tio
lu
So
e
pl
m
Sa

– Page 10 / 12 –
Additional space for solutions–clearly mark the (sub)problem your answers are related to and strike
out invalid solutions.

n
tio
lu
So
e
pl
m
Sa

– Page 11 / 12 –
n
tio
lu
So
e
pl
m
Sa

– Page 12 / 12 –

You might also like