SlideShare a Scribd company logo
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Variational Autoencoder
from scratch
Umar Jamil
License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0):
https://creativecommons.org/licenses/by-nc/4.0/legalcode
Video: https://youtu.be/iwEzwTTalbg
Not for commercial use
Kingma, D.P. and Welling, M., 2019. An introduction to variational autoencoders. Foundations and
Trends® in Machine Learning, 12(4), pp.307-392.
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
What is an Autoencoder?
X Encoder Z X’
Decoder
Input
Code
Reconstructed
Input
[1.2, 3.65, …]
[1.6, 6.00, …]
[10.1, 9.0, …]
[2.5, 7.0, …]
* The values are random and
have no meaning
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Analogy with file compression
X ZIP zebra.zip X’
UNZIP
Input Reconstructed
Input
zebra.jpg zebra.jpg
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
• The code should be as small as possible, that is, the dimension of the Z vector should be as small as possible.
• The reconstructed input should as close as possible to the original input.
What makes a good Autoencoder?
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
What’s the problem with Autoencoders?
The code learned by the model makes no sense. That is, the model can just assign any vector to the inputs without the numbers in the vector representing any
pattern. The model doesn’t capture any semantic relationship between the data.
X Encoder X’
Decoder
Input
Code
Reconstructed
Input
Z
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Introducing the Variational Autoencoder
The variational autoencoder, instead of learning a code, learns a “latent space”. The latent space represents the parameters of a (multivariate) distribution.
X Encoder X’
Decoder
Input
Latent Space
Reconstructed
Input
Z
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Sampling the latent space
X Encoder X’
Decoder
Input
Latent Space
Reconstructed
Input
Just like when you use Python to generate a random number between 1 and 100, you’re sampling from a uniform (pseudo)random distribution between 1 and 100. In
the same way, we can sample from the latent space in order to generate a random vector, give it to the decoder and generate new data.
Z
[8.67, 12.8564, 0.44875, 874.22, …]
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Why is it called latent space?
X Z
Latent (hidden) variable
Observable variable
𝜇
𝜎2
[8.67, 12.8564, 0.44875, 874.22, …]
[4.59, 13.2548, 1.14569, 148.25, …]
[1.74, 32.3476, 5.18469, 358.14, …]
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Plato’s allegory of the cave
Observable variable
[8.67, 12.8564, 0.44875, 874.22, …]
[4.59, 13.2548, 1.14569, 148.25, …]
[1.74, 32.3476, 5.18469, 358.14, …]
Latent (hidden) variable
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Pep talk
1. VAE is the most important component of Stable Diffusion
models. Concepts like ELBO also come in Stable Diffusion.
2. In 2023 you shouldn’t be memorizing things without
understanding, ChatGPT can do that faster and better than
any human being. You need to be human to compete with a
machine, you can’t compete with a machine by acting like
one.
3. You should try to learn how things work not only for curiosity,
but because that’s the true engine of innovation and
creativity.
4. Math is fun.
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Math Concepts
Expectation of a random variable
Chain rule of probability
Bayes’ Theorem
𝐸𝑥 𝑓(𝑥) = න 𝑥𝑓 𝑥 𝑑𝑥
𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑦 𝑃(𝑦)
𝑃 𝑥 | 𝑦 =
𝑃 𝑦 𝑥 𝑃(𝑥)
𝑃(𝑦)
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Kullback-Leibler Divergence
𝐷𝐾𝐿 ԡ
𝑃 𝑄 = න 𝑝 𝑥 log
𝑝 𝑥
𝑞 𝑥
𝑑𝑥
Properties:
• Not symmetric.
• Always ≥ 0
• It is equal to 0 if and only if 𝑃 = 𝑄
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Let’s define our model
We can define the likelihood of our data as the marginalization over the joint probability with respect to the latent variable
… or we can use the Chain rule of probability
𝑝 𝒙 = න 𝑝 𝒙, 𝒛 𝑑𝒛 Is intractable because we would need to
evaluate this integral over all latent variables Z.
𝑝 𝒙 =
𝑝(𝒙, 𝒛)
𝑝 𝒛 𝒙
We don’t have a ground truth 𝑝 𝒛 𝒙
… which is also what we’re trying to find!
Intractable problem = a problem that can be solved in theory (e.g. given large but finite resources,
especially time), but for which in practice any solution takes too many resources to be useful, is
known as an intractable problem.
X Z
Latent (hidden) variable
Observable variable
𝜇
𝜎2
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
A chicken and egg problem
𝑝 𝒙 =
𝑝(𝒙, 𝒛)
𝑝 𝒛 𝒙
𝑝 𝒛 𝒙 =
𝑝(𝒙, 𝒛)
𝑝 𝒙
In order to have a tractable 𝑝 𝒙 we need a tractable 𝑝 𝒛 𝒙
In order to have a tractable 𝑝 𝒛 𝒙 we need a tractable 𝑝 𝒙
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Can we find a surrogate?
𝑝𝜃 𝒛 𝒙 ≈ 𝑞𝜑 𝒛 𝒙
Our true posterior (that we can’t evaluate due to its intractability)
Parametrized by 𝜃.
An approximate posterior.
Parametrized by 𝜑.
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Let’s do some maths…
log 𝑝𝜃(𝒙) = log 𝑝𝜃 (𝒙)
= log 𝑝𝜃 (𝒙) න 𝑞𝜑 𝒛 𝒙 𝑑𝒛 Multiply by 1
= න log 𝑝𝜃 (𝒙) 𝑞𝜑 𝒛 𝒙 𝑑𝒛 Bring inside the integral
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 (𝒙) Definition of expectation
= 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑝𝜃 𝒛 𝒙
𝑝𝜃 𝒙 =
𝑝𝜃(𝒙, 𝒛)
𝑝𝜃 𝒛 𝒙
Apply the equation
= 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)𝑞𝜑 𝒛 𝒙
𝑝𝜃 𝒛 𝒙 𝑞𝜑 𝒛 𝒙
Multiply by 1
= 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
+ 𝐸𝑞𝜑 𝒛 𝒙 log
𝑞𝜑 𝒛 𝒙
𝑝𝜃 𝒛 𝒙
Split the expectation
= 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
+ 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙 Definition of KL divergence
≥ 0
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
What can we infer?
log 𝑝𝜃(𝒙) = 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
+ 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙
≥ 0
ELBO
Total Compensation = Base Salary + Bonus
≥ 0
Total Compensation ≥ Base Salary
log 𝑝𝜃(𝒙) ≥ 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
We can for sure deduce the following:
ELBO = Evidence Lower Bound
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
ELBO in detail
log 𝑝𝜃(𝒙) ≥ 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
= 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃 𝒙 𝒛 𝑝(𝒛)
𝑞𝜑 𝒛 𝒙
Chain rule of probability
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 + 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝(𝒛)
𝑞𝜑 𝒛 𝒙
Split the expectation
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 − 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 Definition of KL divergence
Maximizing the ELBO means:
1. Maximizing the first term: maximizing the reconstruction likelihood of the decoder
2. Minimizing the second term: minimizing the distance between the learned distribution and the
prior belief we have over the latent variable.
Kingma, D.P. and Welling, M., 2019. An introduction to variational
autoencoders. Foundations and Trends® in Machine Learning, 12(4),
pp.307-392.
Profit = Revenue - Costs
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Maximizing the ELBO: A little introduction to
estimators
• When we have a function we want to maximize, we usually take the gradient and adjust the weights of the model so that they move along the gradient direction.
• When we have a function we want to minimize, we usually take the gradient, and adjust the weights of the model so that they move against the gradient direction.
Stochastic Gradient Descent
SGD is stochastic because we choose the minibatch randomly from our dataset and we then
average the loss over the minibatch
𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 − 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
How to maximize the ELBO?
Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
SCORE estimator
ELBO
This estimator is unbiased, meaning that even if at every step it may not be equal to the true
expectation, on average it will converge to it, but as it is stochastic, it also has a variance and it
happens to be high for practical use. Plus, we can’t run backpropagation through it!
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
We need a new estimator!
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
The reparameterization trick
X Z
Latent (hidden) variable
Observable variable
𝜇
𝜎2
[8.67, 12.8564, 0.44875, 874.22, …]
[4.59, 13.2548, 1.14569, 148.25, …]
[1.74, 32.3476, 5.18469, 358.14, …]
𝜖
Stochastic node
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Running backpropagation on the
reparametrized model
𝜇
𝜎2
Loss function
Kingma, D.P. and Welling, M., 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), pp.307-392.
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
A new estimator!
𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
= 𝐸𝑝(𝜖) log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
𝜖 ≈ 𝑝(𝜖)
𝑧 = 𝑔(𝜑, 𝑥, 𝜖)
𝐸𝑝(𝜖) log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
≅ ෨
𝐿 𝜃, 𝜑, 𝒙 = log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
ELBO
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Is the new estimator unbiased?
𝐸𝑝(𝜖) ∇𝜃,𝜑
෨
𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑝(𝜖) ∇𝜃,𝜑 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
෨
𝐿 𝜃, 𝜑, 𝒙 = log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
= ∇𝜃,𝜑(𝐸𝑝 𝜖 log
𝑝𝜃 𝒙, 𝒛
𝑞𝜑 𝒛 𝒙
)
𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑝(𝜖) log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
= ∇𝜃,𝜑(𝐿 𝜃, 𝜑, 𝒙 )
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Example network
X Encoder
𝜇
X’
Decoder
Input
Latent Space
Reconstructed
Input
Z
log(𝜎2
)
𝜖 ≈ 𝑁(0, 𝐼)
We prefer learning log(𝜎2
) because it can be negative, so the model doesn’t need to be forced
to produce only positive values for it.
Sampled using torch.randn_like(shape) function
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Show me the loss already!
X Encoder X’
Decoder
Latent Space
Z
𝜖 ≈ 𝑁(0, 𝐼)
MLP = Multi Layer Perceptron
Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
How to derive the loss function?
https://stats.stackexchange.com/questions/318748/deriving-the-kl-divergence-loss-for-vaes
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Thanks for watching!
Don’t forget to subscribe for
more amazing content on AI
and Machine Learning!

More Related Content

Similar to Variational Autoencoder from scratch.pdf (20)

PDF
(DL hacks輪読) Variational Inference with Rényi Divergence
Masahiro Suzuki
 
PDF
Tutorial on Deep Generative Models
MLReview
 
PPTX
Variational Auto Encoder and the Math Behind
Varun Reddy
 
PDF
從 VAE 走向深度學習新理論
岳華 杜
 
PDF
Iclr2016 vaeまとめ
Deep Learning JP
 
PDF
Deep VI with_beta_likelihood
Natan Katz
 
PDF
Jonathan Ronen - Variational Autoencoders tutorial
Jonathan Ronen
 
PPTX
Deep learning from a novice perspective
Anirban Santara
 
PDF
Explicit Density Models
Sangwoo Mo
 
PPT
Machine Learning and Inductive Inference
butest
 
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Masahiro Suzuki
 
PDF
Meta-learning and the ELBO
Yoonho Lee
 
PDF
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
Changwon National University
 
PDF
Module 2: Machine Learning Deep Dive
Sara Hooker
 
PPTX
Bayesian Neural Networks
Natan Katz
 
PDF
Learning stochastic neural networks with Chainer
Seiya Tokui
 
PDF
Auto encoding-variational-bayes
mehdi Cherti
 
PDF
VAE-type Deep Generative Models
Kenta Oono
 
PDF
causality_discussion_slides_final.pdf
ssuser8cde591
 
(DL hacks輪読) Variational Inference with Rényi Divergence
Masahiro Suzuki
 
Tutorial on Deep Generative Models
MLReview
 
Variational Auto Encoder and the Math Behind
Varun Reddy
 
從 VAE 走向深度學習新理論
岳華 杜
 
Iclr2016 vaeまとめ
Deep Learning JP
 
Deep VI with_beta_likelihood
Natan Katz
 
Jonathan Ronen - Variational Autoencoders tutorial
Jonathan Ronen
 
Deep learning from a novice perspective
Anirban Santara
 
Explicit Density Models
Sangwoo Mo
 
Machine Learning and Inductive Inference
butest
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Masahiro Suzuki
 
Meta-learning and the ELBO
Yoonho Lee
 
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
Changwon National University
 
Module 2: Machine Learning Deep Dive
Sara Hooker
 
Bayesian Neural Networks
Natan Katz
 
Learning stochastic neural networks with Chainer
Seiya Tokui
 
Auto encoding-variational-bayes
mehdi Cherti
 
VAE-type Deep Generative Models
Kenta Oono
 
causality_discussion_slides_final.pdf
ssuser8cde591
 

Recently uploaded (20)

PPTX
ENG8_Q1_WEEK2_LESSON1. Presentation pptx
marawehsvinetshe
 
PDF
Workbook de Inglés Completo - English Path.pdf
shityouenglishpath
 
PPTX
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
PDF
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
PPTX
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PPTX
How to Send Email From Odoo 18 Website - Odoo Slides
Celine George
 
PPTX
ENGlish 8 lesson presentation PowerPoint.pptx
marawehsvinetshe
 
PPTX
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
PPTX
DAY 1_QUARTER1 ENGLISH 5 WEEK- PRESENTATION.pptx
BanyMacalintal
 
PPTX
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
PPTX
infertility, types,causes, impact, and management
Ritu480198
 
PPTX
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
PDF
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
PDF
Council of Chalcedon Re-Examined
Smiling Lungs
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PPTX
EDUCATIONAL MEDIA/ TEACHING AUDIO VISUAL AIDS
Sonali Gupta
 
PDF
epi editorial commitee meeting presentation
MIPLM
 
ENG8_Q1_WEEK2_LESSON1. Presentation pptx
marawehsvinetshe
 
Workbook de Inglés Completo - English Path.pdf
shityouenglishpath
 
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
How to Send Email From Odoo 18 Website - Odoo Slides
Celine George
 
ENGlish 8 lesson presentation PowerPoint.pptx
marawehsvinetshe
 
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
DAY 1_QUARTER1 ENGLISH 5 WEEK- PRESENTATION.pptx
BanyMacalintal
 
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
infertility, types,causes, impact, and management
Ritu480198
 
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
Council of Chalcedon Re-Examined
Smiling Lungs
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
EDUCATIONAL MEDIA/ TEACHING AUDIO VISUAL AIDS
Sonali Gupta
 
epi editorial commitee meeting presentation
MIPLM
 
Ad

Variational Autoencoder from scratch.pdf

  • 1. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Variational Autoencoder from scratch Umar Jamil License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0): https://creativecommons.org/licenses/by-nc/4.0/legalcode Video: https://youtu.be/iwEzwTTalbg Not for commercial use Kingma, D.P. and Welling, M., 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), pp.307-392.
  • 2. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes What is an Autoencoder? X Encoder Z X’ Decoder Input Code Reconstructed Input [1.2, 3.65, …] [1.6, 6.00, …] [10.1, 9.0, …] [2.5, 7.0, …] * The values are random and have no meaning
  • 3. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Analogy with file compression X ZIP zebra.zip X’ UNZIP Input Reconstructed Input zebra.jpg zebra.jpg
  • 4. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes • The code should be as small as possible, that is, the dimension of the Z vector should be as small as possible. • The reconstructed input should as close as possible to the original input. What makes a good Autoencoder?
  • 5. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes What’s the problem with Autoencoders? The code learned by the model makes no sense. That is, the model can just assign any vector to the inputs without the numbers in the vector representing any pattern. The model doesn’t capture any semantic relationship between the data. X Encoder X’ Decoder Input Code Reconstructed Input Z
  • 6. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Introducing the Variational Autoencoder The variational autoencoder, instead of learning a code, learns a “latent space”. The latent space represents the parameters of a (multivariate) distribution. X Encoder X’ Decoder Input Latent Space Reconstructed Input Z
  • 7. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Sampling the latent space X Encoder X’ Decoder Input Latent Space Reconstructed Input Just like when you use Python to generate a random number between 1 and 100, you’re sampling from a uniform (pseudo)random distribution between 1 and 100. In the same way, we can sample from the latent space in order to generate a random vector, give it to the decoder and generate new data. Z [8.67, 12.8564, 0.44875, 874.22, …]
  • 8. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Why is it called latent space? X Z Latent (hidden) variable Observable variable 𝜇 𝜎2 [8.67, 12.8564, 0.44875, 874.22, …] [4.59, 13.2548, 1.14569, 148.25, …] [1.74, 32.3476, 5.18469, 358.14, …]
  • 9. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Plato’s allegory of the cave Observable variable [8.67, 12.8564, 0.44875, 874.22, …] [4.59, 13.2548, 1.14569, 148.25, …] [1.74, 32.3476, 5.18469, 358.14, …] Latent (hidden) variable
  • 10. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Pep talk 1. VAE is the most important component of Stable Diffusion models. Concepts like ELBO also come in Stable Diffusion. 2. In 2023 you shouldn’t be memorizing things without understanding, ChatGPT can do that faster and better than any human being. You need to be human to compete with a machine, you can’t compete with a machine by acting like one. 3. You should try to learn how things work not only for curiosity, but because that’s the true engine of innovation and creativity. 4. Math is fun.
  • 11. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Math Concepts Expectation of a random variable Chain rule of probability Bayes’ Theorem 𝐸𝑥 𝑓(𝑥) = න 𝑥𝑓 𝑥 𝑑𝑥 𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑦 𝑃(𝑦) 𝑃 𝑥 | 𝑦 = 𝑃 𝑦 𝑥 𝑃(𝑥) 𝑃(𝑦)
  • 12. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Kullback-Leibler Divergence 𝐷𝐾𝐿 ԡ 𝑃 𝑄 = න 𝑝 𝑥 log 𝑝 𝑥 𝑞 𝑥 𝑑𝑥 Properties: • Not symmetric. • Always ≥ 0 • It is equal to 0 if and only if 𝑃 = 𝑄
  • 13. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Let’s define our model We can define the likelihood of our data as the marginalization over the joint probability with respect to the latent variable … or we can use the Chain rule of probability 𝑝 𝒙 = න 𝑝 𝒙, 𝒛 𝑑𝒛 Is intractable because we would need to evaluate this integral over all latent variables Z. 𝑝 𝒙 = 𝑝(𝒙, 𝒛) 𝑝 𝒛 𝒙 We don’t have a ground truth 𝑝 𝒛 𝒙 … which is also what we’re trying to find! Intractable problem = a problem that can be solved in theory (e.g. given large but finite resources, especially time), but for which in practice any solution takes too many resources to be useful, is known as an intractable problem. X Z Latent (hidden) variable Observable variable 𝜇 𝜎2
  • 14. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes A chicken and egg problem 𝑝 𝒙 = 𝑝(𝒙, 𝒛) 𝑝 𝒛 𝒙 𝑝 𝒛 𝒙 = 𝑝(𝒙, 𝒛) 𝑝 𝒙 In order to have a tractable 𝑝 𝒙 we need a tractable 𝑝 𝒛 𝒙 In order to have a tractable 𝑝 𝒛 𝒙 we need a tractable 𝑝 𝒙
  • 15. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Can we find a surrogate? 𝑝𝜃 𝒛 𝒙 ≈ 𝑞𝜑 𝒛 𝒙 Our true posterior (that we can’t evaluate due to its intractability) Parametrized by 𝜃. An approximate posterior. Parametrized by 𝜑.
  • 16. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Let’s do some maths… log 𝑝𝜃(𝒙) = log 𝑝𝜃 (𝒙) = log 𝑝𝜃 (𝒙) න 𝑞𝜑 𝒛 𝒙 𝑑𝒛 Multiply by 1 = න log 𝑝𝜃 (𝒙) 𝑞𝜑 𝒛 𝒙 𝑑𝒛 Bring inside the integral = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 (𝒙) Definition of expectation = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑝𝜃 𝒛 𝒙 𝑝𝜃 𝒙 = 𝑝𝜃(𝒙, 𝒛) 𝑝𝜃 𝒛 𝒙 Apply the equation = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛)𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙 𝑞𝜑 𝒛 𝒙 Multiply by 1 = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 + 𝐸𝑞𝜑 𝒛 𝒙 log 𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙 Split the expectation = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 + 𝐷𝐾𝐿 ฮ 𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙 Definition of KL divergence ≥ 0
  • 17. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes What can we infer? log 𝑝𝜃(𝒙) = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 + 𝐷𝐾𝐿 ฮ 𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙 ≥ 0 ELBO Total Compensation = Base Salary + Bonus ≥ 0 Total Compensation ≥ Base Salary log 𝑝𝜃(𝒙) ≥ 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 We can for sure deduce the following: ELBO = Evidence Lower Bound
  • 18. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes ELBO in detail log 𝑝𝜃(𝒙) ≥ 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 𝑝(𝒛) 𝑞𝜑 𝒛 𝒙 Chain rule of probability = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 + 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝(𝒛) 𝑞𝜑 𝒛 𝒙 Split the expectation = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 − 𝐷𝐾𝐿 ฮ 𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 Definition of KL divergence Maximizing the ELBO means: 1. Maximizing the first term: maximizing the reconstruction likelihood of the decoder 2. Minimizing the second term: minimizing the distance between the learned distribution and the prior belief we have over the latent variable. Kingma, D.P. and Welling, M., 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), pp.307-392. Profit = Revenue - Costs
  • 19. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Maximizing the ELBO: A little introduction to estimators • When we have a function we want to maximize, we usually take the gradient and adjust the weights of the model so that they move along the gradient direction. • When we have a function we want to minimize, we usually take the gradient, and adjust the weights of the model so that they move against the gradient direction. Stochastic Gradient Descent SGD is stochastic because we choose the minibatch randomly from our dataset and we then average the loss over the minibatch 𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 − 𝐷𝐾𝐿 ฮ 𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛
  • 20. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes How to maximize the ELBO? Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. SCORE estimator ELBO This estimator is unbiased, meaning that even if at every step it may not be equal to the true expectation, on average it will converge to it, but as it is stochastic, it also has a variance and it happens to be high for practical use. Plus, we can’t run backpropagation through it!
  • 21. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes We need a new estimator!
  • 22. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes The reparameterization trick X Z Latent (hidden) variable Observable variable 𝜇 𝜎2 [8.67, 12.8564, 0.44875, 874.22, …] [4.59, 13.2548, 1.14569, 148.25, …] [1.74, 32.3476, 5.18469, 358.14, …] 𝜖 Stochastic node
  • 23. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Running backpropagation on the reparametrized model 𝜇 𝜎2 Loss function Kingma, D.P. and Welling, M., 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), pp.307-392.
  • 24. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes A new estimator! 𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 = 𝐸𝑝(𝜖) log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 𝜖 ≈ 𝑝(𝜖) 𝑧 = 𝑔(𝜑, 𝑥, 𝜖) 𝐸𝑝(𝜖) log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 ≅ ෨ 𝐿 𝜃, 𝜑, 𝒙 = log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. ELBO
  • 25. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Is the new estimator unbiased? 𝐸𝑝(𝜖) ∇𝜃,𝜑 ෨ 𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑝(𝜖) ∇𝜃,𝜑 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 ෨ 𝐿 𝜃, 𝜑, 𝒙 = log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 = ∇𝜃,𝜑(𝐸𝑝 𝜖 log 𝑝𝜃 𝒙, 𝒛 𝑞𝜑 𝒛 𝒙 ) 𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑝(𝜖) log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 = ∇𝜃,𝜑(𝐿 𝜃, 𝜑, 𝒙 )
  • 26. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Example network X Encoder 𝜇 X’ Decoder Input Latent Space Reconstructed Input Z log(𝜎2 ) 𝜖 ≈ 𝑁(0, 𝐼) We prefer learning log(𝜎2 ) because it can be negative, so the model doesn’t need to be forced to produce only positive values for it. Sampled using torch.randn_like(shape) function
  • 27. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Show me the loss already! X Encoder X’ Decoder Latent Space Z 𝜖 ≈ 𝑁(0, 𝐼) MLP = Multi Layer Perceptron Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • 28. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes How to derive the loss function? https://stats.stackexchange.com/questions/318748/deriving-the-kl-divergence-loss-for-vaes
  • 29. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Thanks for watching! Don’t forget to subscribe for more amazing content on AI and Machine Learning!