lecture_13_jiajun.pdf Generative models GAN

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 13 - May 12, 2022
1
Lecture 13:
Generative Models

Administrative
2
● A3 is out. Due May 25.
● Milestone was due May 10th
○ Read website page for milestone requirements.
○ Need to Finish data preprocessing and initial results by then.
● Midterm and A2 grades will be out this week

Supervised vs Unsupervised Learning
3
Supervised Learning
Data: (x, y)
x is data, y is label
Goal: Learn a function to map x -> y
Examples: Classification,
regression, object detection,
semantic segmentation, image
captioning, etc.

4
Supervised Learning
Data: (x, y)
captioning, etc.
Cat
Classification
This image is CC0 public domain

5
Supervised Learning
Data: (x, y)
captioning, etc.
Image captioning
A cat sitting on a suitcase on the floor
Caption generated using neuraltalk2
Image is CC0 Public domain.

6
Supervised Learning
Data: (x, y)
captioning, etc.
DOG, DOG, CAT
Object Detection

7
Supervised Learning
Data: (x, y)
captioning, etc.
Semantic Segmentation
GRASS, CAT,
TREE, SKY

8
Unsupervised Learning
Data: x
Just data, no labels!
Goal: Learn some underlying
hidden structure of the data
Examples: Clustering,
dimensionality reduction, feature
learning, density estimation, etc.

9
Data: x
dimensionality reduction, density
estimation, etc.
K-means clustering

10
Data: x
estimation, etc.
Principal Component Analysis
(Dimensionality reduction)
This image from Matthias Scholz
is CC0 public domain
3-d 2-d

11
Data: x
estimation, etc.
2-d density estimation
2-d density images left and right
are CC0 public domain
1-d density estimation
Figure copyright Ian Goodfellow, 2016. Reproduced with permission.
Modeling p(x)

Data: x
estimation, etc.
12
Supervised Learning
Data: (x, y)
captioning, etc.

Generative Modeling
13
Training data ~ pdata
(x)
Objectives:
1. Learn pmodel
(x) that approximates pdata
(x)
2. Sampling new x from pmodel
(x)
Given training data, generate new samples from same distribution
learning
pmodel
(x)
sampling

Generative Modeling
14
Training data ~ pdata
(x)
Given training data, generate new samples from same distribution
learning sampling
Formulate as density estimation problems:
- Explicit density estimation: explicitly define and solve for pmodel
(x)
- Implicit density estimation: learn model that can sample from pmodel
(x) without
explicitly defining it.
pmodel
(x)

Why Generative Models?
15
- Realistic samples for artwork, super-resolution, colorization, etc.
- Learn useful features for downstream tasks such as classification.
- Getting insights from high-dimensional data (physics, medical imaging, etc.)
- Modeling physical world for simulation and planning (robotics and
reinforcement learning applications)
- Many more ...
FIgures from L-R are copyright: (1) Alec Radford et al. 2016; (2) Phillip Isola et al. 2017. Reproduced with authors permission (3) BAIR Blog.

Taxonomy of Generative Models
16
Generative models
Explicit density Implicit density
Direct
Tractable density Approximate density
Markov Chain
Variational Markov Chain
Variational Autoencoder Boltzmann Machine
GSN
GAN
Figure copyright and adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks, 2017.
Fully Visible Belief Nets
- NADE
- MADE
- PixelRNN/CNN
- NICE / RealNVP
- Glow
- Ffjord

17
Generative models
Direct
Markov Chain
GSN
GAN
Today: discuss 3 most
popular types of generative
models today
- NADE
- MADE
- PixelRNN/CNN
- NICE / RealNVP
- Glow
- Ffjord

18
PixelRNN and PixelCNN
(A very brief overview)

19
Fully visible belief network (FVBN)
Likelihood of
image x
Explicit density model
Joint likelihood of each
pixel in the image

20
Use chain rule to decompose likelihood of an image x into product of 1-d
distributions:
Likelihood of
image x
Probability of i’th pixel value
given all previous pixels
Then maximize likelihood of training data

Then maximize likelihood of training data
21
Use chain rule to decompose likelihood of an image x into product of 1-d
distributions:
Likelihood of
image x
Probability of i’th pixel value
given all previous pixels
Complex distribution over pixel
values => Express using a neural
network!

22
Recurrent Neural Network
x1
RNN
x2
x2
RNN
x3
x3
RNN
x4
...
xn-
1
RNN
xn
h1
h2
h3
h0

PixelRNN
23
Generate image pixels starting from corner
Dependency on previous pixels modeled
using an RNN (LSTM)
[van der Oord et al. 2016]

PixelRNN
24
using an RNN (LSTM)

PixelRNN
25
using an RNN (LSTM)

PixelRNN
26
using an RNN (LSTM)
Drawback: sequential generation is slow
in both training and inference!

PixelCNN
27
Still generate image pixels starting from
corner
Dependency on previous pixels now
modeled using a CNN over context region
(masked convolution)
Figure copyright van der Oord et al., 2016. Reproduced with permission.

PixelCNN
28
Still generate image pixels starting from
corner
Dependency on previous pixels now
modeled using a CNN over context region
(masked convolution)
Figure copyright van der Oord et al., 2016. Reproduced with permission.
Training is faster than PixelRNN
(can parallelize convolutions since context region
values known from training images)
Generation is still slow:
For a 32x32 image, we need to do forward passes of
the network 1024 times for a single image

Generation Samples
29
Figures copyright Aaron van der Oord et al., 2016. Reproduced with permission.
32x32 CIFAR-10 32x32 ImageNet

30
PixelRNN and PixelCNN
Improving PixelCNN performance
- Gated convolutional layers
- Short-cut connections
- Discretized logistic loss
- Multi-scale
- Training tricks
- Etc…
See
- Van der Oord et al. NIPS 2016
- Salimans et al. 2017
(PixelCNN++)
Pros:
- Can explicitly compute likelihood
p(x)
- Easy to optimize
- Good samples
Con:
- Sequential generation => slow

31
Generative models
Direct
Markov Chain
GSN
GAN
- NADE
- MADE
- PixelRNN/CNN
- NICE / RealNVP
- Glow
- Ffjord

32
Variational
Autoencoders (VAE)

33
PixelRNN/CNNs define tractable density function, optimize likelihood of training data:
So far...

So far...
34
Variational Autoencoders (VAEs) define intractable density function with latent z:
Cannot optimize directly, derive and optimize lower bound on likelihood instead
No dependencies among pixels, can generate all pixels at the same time!

So far...
35
Variational Autoencoders (VAEs) define intractable density function with latent z:
No dependencies among pixels, can generate all pixels at the same time!
Why latent z?

Some background first: Autoencoders
36
Unsupervised approach for learning a lower-dimensional feature representation
from unlabeled training data
Encoder
Input data
Features
Decoder

37
Input data
Features
z usually smaller than x
(dimensionality reduction)
Q: Why dimensionality
reduction?
Decoder
Encoder

38
Input data
Features
z usually smaller than x
(dimensionality reduction)
Decoder
Encoder
Q: Why dimensionality
reduction?
A: Want features to
capture meaningful
factors of variation in
data

39
Encoder
Input data
Features
How to learn this feature
representation?
Train such that features
can be used to
reconstruct original data
“Autoencoding” -
encoding input itself
Decoder
Reconstructed
input data
Reconstructed data
Encoder: 4-layer conv
Decoder: 4-layer upconv
Input data

40
Encoder
Input data
Features
Decoder
Reconstructed data
Input data
Encoder: 4-layer conv
Decoder: 4-layer upconv
L2 Loss function:
Train such that features
can be used to
reconstruct original data
Doesn’t use labels!

41
Encoder
Input data
Features
Decoder
Reconstructed
input data
After training,
throw away decoder

42
Encoder
Input data
Features
Classifier
Predicted Label
Fine-tune
encoder
jointly with
classifier
Loss function
(Softmax, etc)
Encoder can be
used to initialize a
supervised model
plane
dog deer
bird
truck
Train for final task
(sometimes with
small data)
Transfer from large, unlabeled
dataset to small, labeled dataset.

43
Encoder
Input data
Features
Decoder
Reconstructed
input data
Autoencoders can reconstruct
data, and can learn features to
initialize a supervised model
Features capture factors of
variation in training data.
But we can’t generate new
images from an autoencoder
because we don’t know the
space of z.
How do we make autoencoder a
generative model?

44
Variational Autoencoders
Probabilistic spin on autoencoders - will let us sample from the model to generate data!

45
Sample from
true prior
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR 2014
Assume training data is generated from the distribution of unobserved (latent)
representation z
Sample from
true conditional

46
Sample from
true prior
Assume training data is generated from the distribution of unobserved (latent)
representation z
Sample from
true conditional
Intuition (remember from autoencoders!):
x is an image, z is latent factors used to
generate x: attributes, orientation, etc.

47
Sample from
true prior
Sample from
true conditional
We want to estimate the true parameters
of this generative model given training data x.

48
Sample from
true prior
Sample from
true conditional
How should we represent this model?

49
Sample from
true prior
Sample from
true conditional
Choose prior p(z) to be simple, e.g.
Gaussian. Reasonable for latent attributes,
e.g. pose, how much smile.

50
Sample from
true prior
Sample from
true conditional
Choose prior p(z) to be simple, e.g.
Gaussian. Reasonable for latent attributes,
e.g. pose, how much smile.
Conditional p(x|z) is complex (generates
image) => represent with neural network
Decoder
network

51
Sample from
true prior
Sample from
true conditional
How to train the model?
Decoder
network

52
Sample from
true prior
Sample from
true conditional
Learn model parameters to maximize likelihood
of training data
Decoder
network

53
Sample from
true prior
Sample from
true conditional
Learn model parameters to maximize likelihood
of training data
Q: What is the problem with this?
Intractable!
Decoder
network

54
Variational Autoencoders: Intractability
Data likelihood:

55
Data likelihood:
Simple Gaussian prior
✔

56
Data likelihood:
Decoder neural network
✔ ✔

57
Data likelihood:
Intractable to compute p(x|z) for every z!
��
✔ ✔

58
Data likelihood:
Intractable to compute p(x|z) for every z!
��
✔ ✔
Monte Carlo estimation is too high variance

59
Data likelihood:
��
✔ ✔
Posterior density:
Intractable data likelihood

60
Data likelihood:
Solution: In addition to modeling pθ
(x|z), learn qɸ
(z|x) that approximates the true
posterior pθ
(z|x).
Will see that the approximate posterior allows us to derive a lower bound on the
data likelihood that is tractable, which we can optimize.
Variational inference is to approximate the unknown posterior distribution from
only the observed data x
Posterior density also intractable:

61

62
Taking expectation wrt. z
(using encoder network) will
come in handy later

63

64

65

66
The expectation wrt. z (using
encoder network) let us write
nice KL terms

67
This KL term (between
Gaussians for encoder and z
prior) has nice closed-form
solution!
pθ
(z|x) intractable (saw
earlier), can’t compute this KL
term :( But we know KL
divergence always >= 0.
Decoder network gives pθ
(x|z), can
compute estimate of this term through
sampling (need some trick to
differentiate through sampling).

68
We want to
maximize the
data
likelihood
This KL term (between
Gaussians for encoder and z
prior) has nice closed-form
solution!
pθ
(z|x) intractable (saw
earlier), can’t compute this KL
term :( But we know KL
divergence always >= 0.
Decoder network gives pθ
(x|z), can
compute estimate of this term through
sampling.

69
Tractable lower bound which we can take
gradient of and optimize! (pθ
(x|z) differentiable,
KL term differentiable)
We want to
maximize the
data
likelihood

70
Tractable lower bound which we can take
gradient of and optimize! (pθ
(x|z) differentiable,
KL term differentiable)
Decoder:
reconstruct
the input data
Encoder:
make approximate
posterior distribution
close to prior

71
Putting it all together: maximizing the
likelihood lower bound

72
Input Data
Let’s look at computing the KL
divergence between the estimated
posterior and the prior given some data

73
Encoder network
Input Data

74
Encoder network
Input Data
Make approximate
close to prior
Have analytical solution

75
Encoder network
Sample z from
Input Data
Make approximate
close to prior
Not part of the computation graph!

76
Encoder network
Sample z from
Input Data
Reparameterization trick to make
sampling differentiable:
Sample

77
Encoder network
Sample z from
Input Data
Reparameterization trick to make
sampling differentiable:
Sample
Part of computation graph
Input to
the graph

78
Encoder network
Decoder network
Sample z from
Input Data

79
Encoder network
Decoder network
Sample z from
Input Data
Maximize likelihood of original
input being reconstructed

80
Encoder network
Decoder network
Sample z from
Input Data
For every minibatch of input
data: compute this forward
pass, and then backprop!

81
Variational Autoencoders: Generating Data!
Sample from
true prior
Sample from
true conditional
Decoder
network
Our assumption about data generation
process

82
Sample from
true prior
Sample from
true conditional
Decoder
network
Our assumption about data generation
process
Decoder network
Sample z from
Sample x|z from
Now given a trained VAE:
use decoder network & sample z from prior!

83
Decoder network
Sample z from
Sample x|z from
Use decoder network. Now sample z from prior!

84
Decoder network
Sample z from
Sample x|z from
Use decoder network. Now sample z from prior! Data manifold for 2-d z
Vary z1
Vary z2

85
Vary z1
Vary z2
Degree of smile
Head pose
Diagonal prior on z
=> independent
latent variables
Different
dimensions of z
encode
interpretable factors
of variation

86
Vary z1
Vary z2
Degree of smile
Head pose
Diagonal prior on z
=> independent
latent variables
Different
dimensions of z
encode
interpretable factors
of variation
Also good feature representation that
can be computed using qɸ
(z|x)!

87
32x32 CIFAR-10
Labeled Faces in the Wild
Figures copyright (L) Dirk Kingma et al. 2016; (R) Anders Larsen et al. 2017. Reproduced with permission.

88
Probabilistic spin to traditional autoencoders => allows generating data
Defines an intractable density => derive and optimize a (variational) lower bound
Pros:
- Principled approach to generative models
- Interpretable latent space.
- Allows inference of q(z|x), can be useful feature representation for other tasks
Cons:
- Maximizes lower bound of likelihood: okay, but not as good evaluation as
PixelRNN/PixelCNN
- Samples blurrier and lower quality compared to state-of-the-art (GANs)
Active areas of research:
- More flexible approximations, e.g. richer approximate posterior instead of diagonal
Gaussian, e.g., Gaussian Mixture Models (GMMs), Categorical Distributions.
- Learning disentangled representations.

89
Generative models
Direct
Markov Chain
GSN
GAN
- NADE
- MADE
- PixelRNN/CNN
- NICE / RealNVP
- Glow
- Ffjord

90
Generative Adversarial
Networks (GANs)

So far...
91
VAEs define intractable density function with latent z:

So far...
92
What if we give up on explicitly modeling density, and just want ability to sample?

So far...
93
What if we give up on explicitly modeling density, and just want ability to sample?
GANs: not modeling any explicit density function!

Generative Adversarial Networks
94
Ian Goodfellow et al., “Generative
Adversarial Nets”, NIPS 2014
Problem: Want to sample from complex, high-dimensional training distribution. No direct
way to do this!
Solution: Sample from a simple distribution we can easily sample from, e.g. random noise.
Learn transformation to training distribution.

way to do this!
95
z
Input: Random noise
Generator
Network
Output: Sample from
training distribution

way to do this!
96
z
Input: Random noise
Generator
Network
Output: Sample from
But we don’t know which
sample z maps to which
training image -> can’t
learn by reconstructing
training images

way to do this!
97
z
Input: Random noise
Generator
Network
Output: Sample from
training images
Objective: generated
images should look “real”

way to do this!
98
z
Input: Random noise
Generator
Network
Output: Sample from
training images
Discriminator
Network
Real?
Fake?
Solution: Use a discriminator
network to tell whether the
generate image is within data
distribution (“real”) or not
gradient

Training GANs: Two-player game
99
Discriminator network: try to distinguish between real and fake images
Generator network: try to fool the discriminator by generating real-looking images

100
z
Random noise
Generator Network
Discriminator Network
Fake Images
(from generator)
Real Images
(from training set)
Real or Fake
Fake and real images copyright Emily Denton et al. 2015. Reproduced with permission.

101
z
Random noise
Generator Network
Fake Images
(from generator)
Real Images
(from training set)
Real or Fake
Generator learning signal
Discriminator learning signal

102
Train jointly in minimax game
Minimax objective function:
Generator
objective Discriminator
objective

103
Discriminator output
for real data x
Discriminator output for
generated fake data G(z)
Discriminator outputs likelihood in (0,1) of real image

104
for real data x

105
for real data x

106
for real data x
- Discriminator (θd
) wants to maximize objective such that D(x) is close to 1 (real) and
D(G(z)) is close to 0 (fake)
- Generator (θg
) wants to minimize objective such that D(G(z)) is close to 1
(discriminator is fooled into thinking generated G(z) is real)

107
Alternate between:
1. Gradient ascent on discriminator
2. Gradient descent on generator

108
Alternate between:
In practice, optimizing this generator objective
does not work well!
When sample is likely
fake, want to learn from
it to improve generator
(move to the right on X
axis).

109
Alternate between:
In practice, optimizing this generator objective
does not work well!
When sample is likely
fake, want to learn from
it to improve generator
(move to the right on X
axis).
But gradient in this
region is relatively flat!
Gradient signal
dominated by region
where sample is
already good

110
Alternate between:
2. Instead: Gradient ascent on generator, different objective
Instead of minimizing likelihood of discriminator being correct, now
maximize likelihood of discriminator being wrong.
Same objective of fooling discriminator, but now higher gradient
signal for bad samples => works much better! Standard in practice.
High gradient signal
Low gradient signal

111
Putting it together: GAN training algorithm

112
Putting it together: GAN training algorithm
Some find k=1
more stable,
others use k > 1,
no best rule.
Followup work
(e.g. Wasserstein
GAN, BEGAN)
alleviates this
problem, better
stability!
Arjovsky et al. "Wasserstein gan." arXiv preprint arXiv:1701.07875 (2017)
Berthelot, et al. "Began: Boundary equilibrium generative adversarial networks." arXiv preprint arXiv:1703.10717 (2017)

113
z
Random noise
Generator Network
Fake Images
(from generator)
Real Images
(from training set)
Real or Fake
After training, use generator network to
generate new images

Generative Adversarial Nets
114
Nearest neighbor from training set
Generated samples
Figures copyright Ian Goodfellow et al., 2014. Reproduced with permission.

Generative Adversarial Nets
115
Nearest neighbor from training set
Generated samples (CIFAR-10)
Figures copyright Ian Goodfellow et al., 2014. Reproduced with permission.

Generative Adversarial Nets: Convolutional Architectures
116
Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016
Generator is an upsampling network with fractionally-strided convolutions
Discriminator is a convolutional network

117
Radford et al,
ICLR 2016
Samples
from the
model look
much
better!

118
Radford et al,
ICLR 2016
Interpolating
between
random
points in latent
space

Generative Adversarial Nets: Interpretable Vector Math
119
Smiling woman Neutral woman Neutral man
Samples
from the
model
Radford et al, ICLR 2016

120
Samples
from the
model
Average Z
vectors, do
arithmetic

121
Smiling Man
Samples
from the
model
Average Z
vectors, do
arithmetic

122
Glasses man No glasses man No glasses woman
Woman with glasses
Radford et al,
ICLR 2016

123
https://github.com/hindupuravinash/the-gan-zoo
See also: https://github.com/soumith/ganhacks for tips
and tricks for trainings GANs
2017: Explosion of GANs
“The GAN Zoo”

124
Better training and generation
LSGAN, Zhu 2017. Wasserstein GAN,
Arjovsky 2017.
Improved Wasserstein
GAN, Gulrajani 2017.
Progressive GAN, Karras 2018.

125
CycleGAN. Zhu et al. 2017.
Source->Target domain transfer
Many GAN applications
Pix2pix. Isola 2017. Many examples at
https://phillipi.github.io/pix2pix/
Reed et al. 2017.
Text -> Image Synthesis

2019: BigGAN
126
Brock et al., 2019

Scene graphs to GANs
Specifying exactly what kind of image you
want to generate.
The explicit structure in scene graphs
provides better image generation for complex
scenes.
127
Johnson et al. Image Generation from Scene Graphs, CVPR 2019
Figures copyright 2019. Reproduced with permission.

HYPE: Human eYe Perceptual Evaluations
hype.stanford.edu
Zhou, Gordon, Krishna et al. HYPE: Human eYe Perceptual Evaluations, NeurIPS 2019
128
Figures copyright 2019. Reproduced with permission.

Summary: GANs
129
Don’t work with an explicit density function
Take game-theoretic approach: learn to generate from training distribution through 2-player
game
Pros:
- Beautiful, state-of-the-art samples!
Cons:
- Trickier / more unstable to train
- Can’t solve inference queries such as p(x), p(z|x)
Active areas of research:
- Better loss functions, more stable training (Wasserstein GAN, LSGAN, many others)
- Conditional GANs, GANs for all kinds of applications

Summary
130
Autoregressive models:
PixelRNN, PixelCNN
Van der Oord et al, “Conditional
image generation with pixelCNN
decoders”, NIPS 2016
Kingma and Welling, “Auto-encoding
variational bayes”, ICLR 2013
Generative Adversarial
Networks (GANs)
Goodfellow et al, “Generative

Useful Resources on Generative Models
CS 236: Deep Generative Models (Stanford)
CS 294-158 Deep Unsupervised Learning (Berkeley)
131

Next: Self-Supervised Learning
132

lecture_13_jiajun.pdf Generative models GAN

Recommended

More Related Content

Similar to lecture_13_jiajun.pdf Generative models GAN (20)

Recently uploaded (20)

lecture_13_jiajun.pdf Generative models GAN