1. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Variational Autoencoder
from scratch
Umar Jamil
License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0):
https://creativecommons.org/licenses/by-nc/4.0/legalcode
Video: https://youtu.be/iwEzwTTalbg
Not for commercial use
Kingma, D.P. and Welling, M., 2019. An introduction to variational autoencoders. Foundations and
Trends® in Machine Learning, 12(4), pp.307-392.
2. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
What is an Autoencoder?
X Encoder Z X’
Decoder
Input
Code
Reconstructed
Input
[1.2, 3.65, …]
[1.6, 6.00, …]
[10.1, 9.0, …]
[2.5, 7.0, …]
* The values are random and
have no meaning
3. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Analogy with file compression
X ZIP zebra.zip X’
UNZIP
Input Reconstructed
Input
zebra.jpg zebra.jpg
4. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
• The code should be as small as possible, that is, the dimension of the Z vector should be as small as possible.
• The reconstructed input should as close as possible to the original input.
What makes a good Autoencoder?
5. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
What’s the problem with Autoencoders?
The code learned by the model makes no sense. That is, the model can just assign any vector to the inputs without the numbers in the vector representing any
pattern. The model doesn’t capture any semantic relationship between the data.
X Encoder X’
Decoder
Input
Code
Reconstructed
Input
Z
6. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Introducing the Variational Autoencoder
The variational autoencoder, instead of learning a code, learns a “latent space”. The latent space represents the parameters of a (multivariate) distribution.
X Encoder X’
Decoder
Input
Latent Space
Reconstructed
Input
Z
7. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Sampling the latent space
X Encoder X’
Decoder
Input
Latent Space
Reconstructed
Input
Just like when you use Python to generate a random number between 1 and 100, you’re sampling from a uniform (pseudo)random distribution between 1 and 100. In
the same way, we can sample from the latent space in order to generate a random vector, give it to the decoder and generate new data.
Z
[8.67, 12.8564, 0.44875, 874.22, …]
8. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Why is it called latent space?
X Z
Latent (hidden) variable
Observable variable
𝜇
𝜎2
[8.67, 12.8564, 0.44875, 874.22, …]
[4.59, 13.2548, 1.14569, 148.25, …]
[1.74, 32.3476, 5.18469, 358.14, …]
9. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Plato’s allegory of the cave
Observable variable
[8.67, 12.8564, 0.44875, 874.22, …]
[4.59, 13.2548, 1.14569, 148.25, …]
[1.74, 32.3476, 5.18469, 358.14, …]
Latent (hidden) variable
10. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Pep talk
1. VAE is the most important component of Stable Diffusion
models. Concepts like ELBO also come in Stable Diffusion.
2. In 2023 you shouldn’t be memorizing things without
understanding, ChatGPT can do that faster and better than
any human being. You need to be human to compete with a
machine, you can’t compete with a machine by acting like
one.
3. You should try to learn how things work not only for curiosity,
but because that’s the true engine of innovation and
creativity.
4. Math is fun.
11. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Math Concepts
Expectation of a random variable
Chain rule of probability
Bayes’ Theorem
𝐸𝑥 𝑓(𝑥) = න 𝑥𝑓 𝑥 𝑑𝑥
𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑦 𝑃(𝑦)
𝑃 𝑥 | 𝑦 =
𝑃 𝑦 𝑥 𝑃(𝑥)
𝑃(𝑦)
12. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Kullback-Leibler Divergence
𝐷𝐾𝐿 ԡ
𝑃 𝑄 = න 𝑝 𝑥 log
𝑝 𝑥
𝑞 𝑥
𝑑𝑥
Properties:
• Not symmetric.
• Always ≥ 0
• It is equal to 0 if and only if 𝑃 = 𝑄
13. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Let’s define our model
We can define the likelihood of our data as the marginalization over the joint probability with respect to the latent variable
… or we can use the Chain rule of probability
𝑝 𝒙 = න 𝑝 𝒙, 𝒛 𝑑𝒛 Is intractable because we would need to
evaluate this integral over all latent variables Z.
𝑝 𝒙 =
𝑝(𝒙, 𝒛)
𝑝 𝒛 𝒙
We don’t have a ground truth 𝑝 𝒛 𝒙
… which is also what we’re trying to find!
Intractable problem = a problem that can be solved in theory (e.g. given large but finite resources,
especially time), but for which in practice any solution takes too many resources to be useful, is
known as an intractable problem.
X Z
Latent (hidden) variable
Observable variable
𝜇
𝜎2
14. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
A chicken and egg problem
𝑝 𝒙 =
𝑝(𝒙, 𝒛)
𝑝 𝒛 𝒙
𝑝 𝒛 𝒙 =
𝑝(𝒙, 𝒛)
𝑝 𝒙
In order to have a tractable 𝑝 𝒙 we need a tractable 𝑝 𝒛 𝒙
In order to have a tractable 𝑝 𝒛 𝒙 we need a tractable 𝑝 𝒙
15. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Can we find a surrogate?
𝑝𝜃 𝒛 𝒙 ≈ 𝑞𝜑 𝒛 𝒙
Our true posterior (that we can’t evaluate due to its intractability)
Parametrized by 𝜃.
An approximate posterior.
Parametrized by 𝜑.
17. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
What can we infer?
log 𝑝𝜃(𝒙) = 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
+ 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙
≥ 0
ELBO
Total Compensation = Base Salary + Bonus
≥ 0
Total Compensation ≥ Base Salary
log 𝑝𝜃(𝒙) ≥ 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
We can for sure deduce the following:
ELBO = Evidence Lower Bound
18. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
ELBO in detail
log 𝑝𝜃(𝒙) ≥ 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
= 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃 𝒙 𝒛 𝑝(𝒛)
𝑞𝜑 𝒛 𝒙
Chain rule of probability
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 + 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝(𝒛)
𝑞𝜑 𝒛 𝒙
Split the expectation
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 − 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 Definition of KL divergence
Maximizing the ELBO means:
1. Maximizing the first term: maximizing the reconstruction likelihood of the decoder
2. Minimizing the second term: minimizing the distance between the learned distribution and the
prior belief we have over the latent variable.
Kingma, D.P. and Welling, M., 2019. An introduction to variational
autoencoders. Foundations and Trends® in Machine Learning, 12(4),
pp.307-392.
Profit = Revenue - Costs
19. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Maximizing the ELBO: A little introduction to
estimators
• When we have a function we want to maximize, we usually take the gradient and adjust the weights of the model so that they move along the gradient direction.
• When we have a function we want to minimize, we usually take the gradient, and adjust the weights of the model so that they move against the gradient direction.
Stochastic Gradient Descent
SGD is stochastic because we choose the minibatch randomly from our dataset and we then
average the loss over the minibatch
𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 − 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛
20. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
How to maximize the ELBO?
Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
SCORE estimator
ELBO
This estimator is unbiased, meaning that even if at every step it may not be equal to the true
expectation, on average it will converge to it, but as it is stochastic, it also has a variance and it
happens to be high for practical use. Plus, we can’t run backpropagation through it!
21. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
We need a new estimator!
22. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
The reparameterization trick
X Z
Latent (hidden) variable
Observable variable
𝜇
𝜎2
[8.67, 12.8564, 0.44875, 874.22, …]
[4.59, 13.2548, 1.14569, 148.25, …]
[1.74, 32.3476, 5.18469, 358.14, …]
𝜖
Stochastic node
23. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Running backpropagation on the
reparametrized model
𝜇
𝜎2
Loss function
Kingma, D.P. and Welling, M., 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), pp.307-392.
26. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Example network
X Encoder
𝜇
X’
Decoder
Input
Latent Space
Reconstructed
Input
Z
log(𝜎2
)
𝜖 ≈ 𝑁(0, 𝐼)
We prefer learning log(𝜎2
) because it can be negative, so the model doesn’t need to be forced
to produce only positive values for it.
Sampled using torch.randn_like(shape) function
27. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Show me the loss already!
X Encoder X’
Decoder
Latent Space
Z
𝜖 ≈ 𝑁(0, 𝐼)
MLP = Multi Layer Perceptron
Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
28. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
How to derive the loss function?
https://stats.stackexchange.com/questions/318748/deriving-the-kl-divergence-loss-for-vaes
29. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Thanks for watching!
Don’t forget to subscribe for
more amazing content on AI
and Machine Learning!