Variational Autoencoder from scratch.pdf

Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Variational Autoencoder
from scratch
Umar Jamil
License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0):
https://creativecommons.org/licenses/by-nc/4.0/legalcode
Video: https://youtu.be/iwEzwTTalbg
Not for commercial use
Kingma, D.P. and Welling, M., 2019. An introduction to variational autoencoders. Foundations and
Trends® in Machine Learning, 12(4), pp.307-392.

What is an Autoencoder?
X Encoder Z X’
Decoder
Input
Code
Reconstructed
Input
[1.2, 3.65, …]
[1.6, 6.00, …]
[10.1, 9.0, …]
[2.5, 7.0, …]
* The values are random and
have no meaning

Analogy with file compression
X ZIP zebra.zip X’
UNZIP
Input Reconstructed
Input
zebra.jpg zebra.jpg

• The code should be as small as possible, that is, the dimension of the Z vector should be as small as possible.
• The reconstructed input should as close as possible to the original input.
What makes a good Autoencoder?

What’s the problem with Autoencoders?
The code learned by the model makes no sense. That is, the model can just assign any vector to the inputs without the numbers in the vector representing any
pattern. The model doesn’t capture any semantic relationship between the data.
X Encoder X’
Decoder
Input
Code
Reconstructed
Input
Z

Introducing the Variational Autoencoder
The variational autoencoder, instead of learning a code, learns a “latent space”. The latent space represents the parameters of a (multivariate) distribution.
X Encoder X’
Decoder
Input
Latent Space
Reconstructed
Input
Z

Sampling the latent space
X Encoder X’
Decoder
Input
Latent Space
Reconstructed
Input
Just like when you use Python to generate a random number between 1 and 100, you’re sampling from a uniform (pseudo)random distribution between 1 and 100. In
the same way, we can sample from the latent space in order to generate a random vector, give it to the decoder and generate new data.
Z
[8.67, 12.8564, 0.44875, 874.22, …]

Why is it called latent space?
X Z
Latent (hidden) variable
Observable variable
𝜇
𝜎2
[8.67, 12.8564, 0.44875, 874.22, …]
[4.59, 13.2548, 1.14569, 148.25, …]
[1.74, 32.3476, 5.18469, 358.14, …]

Plato’s allegory of the cave
Observable variable
[8.67, 12.8564, 0.44875, 874.22, …]
[4.59, 13.2548, 1.14569, 148.25, …]
[1.74, 32.3476, 5.18469, 358.14, …]

Pep talk
1. VAE is the most important component of Stable Diffusion
models. Concepts like ELBO also come in Stable Diffusion.
2. In 2023 you shouldn’t be memorizing things without
understanding, ChatGPT can do that faster and better than
any human being. You need to be human to compete with a
machine, you can’t compete with a machine by acting like
one.
3. You should try to learn how things work not only for curiosity,
but because that’s the true engine of innovation and
creativity.
4. Math is fun.

Math Concepts
Expectation of a random variable
Chain rule of probability
Bayes’ Theorem
𝐸𝑥 𝑓(𝑥) = න 𝑥𝑓 𝑥 𝑑𝑥
𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑦 𝑃(𝑦)
𝑃 𝑥 | 𝑦 =
𝑃 𝑦 𝑥 𝑃(𝑥)
𝑃(𝑦)

Kullback-Leibler Divergence
𝐷𝐾𝐿 ԡ
𝑃 𝑄 = න 𝑝 𝑥 log
𝑝 𝑥
𝑞 𝑥
𝑑𝑥
Properties:
• Not symmetric.
• Always ≥ 0
• It is equal to 0 if and only if 𝑃 = 𝑄

Let’s define our model
We can define the likelihood of our data as the marginalization over the joint probability with respect to the latent variable
… or we can use the Chain rule of probability
𝑝 𝒙 = න 𝑝 𝒙, 𝒛 𝑑𝒛 Is intractable because we would need to
evaluate this integral over all latent variables Z.
𝑝 𝒙 =
𝑝(𝒙, 𝒛)
𝑝 𝒛 𝒙
We don’t have a ground truth 𝑝 𝒛 𝒙
… which is also what we’re trying to find!
Intractable problem = a problem that can be solved in theory (e.g. given large but finite resources,
especially time), but for which in practice any solution takes too many resources to be useful, is
known as an intractable problem.
X Z
Observable variable
𝜇
𝜎2

A chicken and egg problem
𝑝 𝒙 =
𝑝(𝒙, 𝒛)
𝑝 𝒛 𝒙
𝑝 𝒛 𝒙 =
𝑝(𝒙, 𝒛)
𝑝 𝒙
In order to have a tractable 𝑝 𝒙 we need a tractable 𝑝 𝒛 𝒙
In order to have a tractable 𝑝 𝒛 𝒙 we need a tractable 𝑝 𝒙

Can we find a surrogate?
𝑝𝜃 𝒛 𝒙 ≈ 𝑞𝜑 𝒛 𝒙
Our true posterior (that we can’t evaluate due to its intractability)
Parametrized by 𝜃.
An approximate posterior.
Parametrized by 𝜑.

Let’s do some maths…
log 𝑝𝜃(𝒙) = log 𝑝𝜃 (𝒙)
= log 𝑝𝜃 (𝒙) න 𝑞𝜑 𝒛 𝒙 𝑑𝒛 Multiply by 1
= න log 𝑝𝜃 (𝒙) 𝑞𝜑 𝒛 𝒙 𝑑𝒛 Bring inside the integral
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 (𝒙) Definition of expectation
= 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑝𝜃 𝒛 𝒙
𝑝𝜃 𝒙 =
𝑝𝜃 𝒛 𝒙
Apply the equation
𝑝𝜃(𝒙, 𝒛)𝑞𝜑 𝒛 𝒙
𝑝𝜃 𝒛 𝒙 𝑞𝜑 𝒛 𝒙
Multiply by 1
𝑞𝜑 𝒛 𝒙
+ 𝐸𝑞𝜑 𝒛 𝒙 log
𝑞𝜑 𝒛 𝒙
𝑝𝜃 𝒛 𝒙
Split the expectation
𝑞𝜑 𝒛 𝒙
+ 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙 Definition of KL divergence
≥ 0

What can we infer?
log 𝑝𝜃(𝒙) = 𝐸𝑞𝜑 𝒛 𝒙 log
𝑞𝜑 𝒛 𝒙
+ 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙
≥ 0
ELBO
Total Compensation = Base Salary + Bonus
≥ 0
Total Compensation ≥ Base Salary
log 𝑝𝜃(𝒙) ≥ 𝐸𝑞𝜑 𝒛 𝒙 log
𝑞𝜑 𝒛 𝒙
We can for sure deduce the following:
ELBO = Evidence Lower Bound

ELBO in detail
log 𝑝𝜃(𝒙) ≥ 𝐸𝑞𝜑 𝒛 𝒙 log
𝑞𝜑 𝒛 𝒙
𝑝𝜃 𝒙 𝒛 𝑝(𝒛)
𝑞𝜑 𝒛 𝒙
Chain rule of probability
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 + 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝(𝒛)
𝑞𝜑 𝒛 𝒙
Split the expectation
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 − 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 Definition of KL divergence
Maximizing the ELBO means:
1. Maximizing the first term: maximizing the reconstruction likelihood of the decoder
2. Minimizing the second term: minimizing the distance between the learned distribution and the
prior belief we have over the latent variable.
Kingma, D.P. and Welling, M., 2019. An introduction to variational
autoencoders. Foundations and Trends® in Machine Learning, 12(4),
pp.307-392.
Profit = Revenue - Costs

Maximizing the ELBO: A little introduction to
estimators
• When we have a function we want to maximize, we usually take the gradient and adjust the weights of the model so that they move along the gradient direction.
• When we have a function we want to minimize, we usually take the gradient, and adjust the weights of the model so that they move against the gradient direction.
Stochastic Gradient Descent
SGD is stochastic because we choose the minibatch randomly from our dataset and we then
average the loss over the minibatch
𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log
𝑞𝜑 𝒛 𝒙
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 − 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛

How to maximize the ELBO?
Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
SCORE estimator
ELBO
This estimator is unbiased, meaning that even if at every step it may not be equal to the true
expectation, on average it will converge to it, but as it is stochastic, it also has a variance and it
happens to be high for practical use. Plus, we can’t run backpropagation through it!

We need a new estimator!

The reparameterization trick
X Z
Observable variable
𝜇
𝜎2
[8.67, 12.8564, 0.44875, 874.22, …]
[4.59, 13.2548, 1.14569, 148.25, …]
[1.74, 32.3476, 5.18469, 358.14, …]
𝜖
Stochastic node

Running backpropagation on the
reparametrized model
𝜇
𝜎2
Loss function
Kingma, D.P. and Welling, M., 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), pp.307-392.

A new estimator!
𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log
𝑞𝜑 𝒛 𝒙
= 𝐸𝑝(𝜖) log
𝑞𝜑 𝒛 𝒙
𝜖 ≈ 𝑝(𝜖)
𝑧 = 𝑔(𝜑, 𝑥, 𝜖)
𝐸𝑝(𝜖) log
𝑞𝜑 𝒛 𝒙
≅ ෨
𝐿 𝜃, 𝜑, 𝒙 = log
𝑞𝜑 𝒛 𝒙
ELBO

Is the new estimator unbiased?
𝐸𝑝(𝜖) ∇𝜃,𝜑
෨
𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑝(𝜖) ∇𝜃,𝜑 log
𝑞𝜑 𝒛 𝒙
෨
𝐿 𝜃, 𝜑, 𝒙 = log
𝑞𝜑 𝒛 𝒙
= ∇𝜃,𝜑(𝐸𝑝 𝜖 log
𝑝𝜃 𝒙, 𝒛
𝑞𝜑 𝒛 𝒙
)
𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑝(𝜖) log
𝑞𝜑 𝒛 𝒙
= ∇𝜃,𝜑(𝐿 𝜃, 𝜑, 𝒙 )

Example network
X Encoder
𝜇
X’
Decoder
Input
Latent Space
Reconstructed
Input
Z
log(𝜎2
)
𝜖 ≈ 𝑁(0, 𝐼)
We prefer learning log(𝜎2
) because it can be negative, so the model doesn’t need to be forced
to produce only positive values for it.
Sampled using torch.randn_like(shape) function

Show me the loss already!
X Encoder X’
Decoder
Latent Space
Z
𝜖 ≈ 𝑁(0, 𝐼)
MLP = Multi Layer Perceptron

How to derive the loss function?
https://stats.stackexchange.com/questions/318748/deriving-the-kl-divergence-loss-for-vaes

Thanks for watching!
Don’t forget to subscribe for
more amazing content on AI
and Machine Learning!

Variational Autoencoder from scratch.pdf

More Related Content

Similar to Variational Autoencoder from scratch.pdf (20)

Recently uploaded (20)

Variational Autoencoder from scratch.pdf