06 DLEA Generative Models
06 DLEA Generative Models
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 1
Supervised vs Unsupervised
Supervised Learning Unsupervised Learning
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 3
Discriminat ive models vs generat ive
models
Is this an apple? vs What is an apple ?
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 4
Example: classify an animal as cat or dog
• Discriminative Model: Find a
decision boundary that
separates cats and dogs. Then,
check on which side of the
decision boundary the new
animal falls.
• Generative Model: Build models
of what cats and dogs are like,
respectively. Then, match the
new animal against the cat and
dog model.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 5
Discriminat ive models vs generat ive models
More formally…
• Discriminative model: Conditional probability distribution
𝑃𝑃(𝑦𝑦|𝑥𝑥)
A discriminative model tells you how likely a label 𝑦𝑦 is applied to
the instance of input data 𝑥𝑥.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 7
Some Generat ive Tasks
Creating art
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 8
Some Generat ive Tasks
Generate artificial images (like human faces)
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 9
Autoencoders
Learning a low-dimensional feature representation
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 10
Reducing t he Dimensionalit y of Dat a
Audencoders are composed by
two sections: an Encoder and a
Decoder
• Encoder: transforms the high-
dimensional data into a low-
dimensional code 𝒛𝒛 called
lat ent space (a nonlinear
generalization of PCA).
• Decoder: reconstructs the
data from the code.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 11
Training an Aut oencoder
• The objective is to minimize the discrepancy between the
original data and its reconstruction:
ℒ𝑟𝑟𝑚𝑚𝑟𝑟 =∥ 𝑥𝑥 − 𝐴𝐴𝐴𝐴 𝑥𝑥 ∥2
where 𝑥𝑥 is the input data and 𝐴𝐴𝐴𝐴(𝑥𝑥) is the autoencoder output.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 12
Compression is not Lossless
• Ideally we would like for an Autoencoder to perfectly reconstruct the
input
• Nevertheless, the latent space represents a “bottleneck”:
• The smaller the latent space, the more information are lost
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 13
Visualizing t he Lat ent Space
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 14
Applicat ions of Aut oencoders
• Features Extraction (Image Retrieval):
• Latent Space is used to find the most similar
images in the dataset
• Compression removes «useless» data (such as
the hat in the example)
• Denoising:
• Corrupted Data is used as input and the
Autoencoder role is to recover the original data
• Compression extracts the «meaningful» data
which represent the distribution, removing
useless data like noise
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 15
Generat e New Dat a?
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 16
Problem
• Autoencoders map an input 𝑥𝑥 into a latent vector 𝒛𝒛 which
represents a point in the latent space, not a real distribution →
lack of interpretable structure
• We are still not able to generate new data
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 17
W hy?
• Kind of expected: we did not enforce any organization of the
latent space.
• It is a different way of seeing the overfitting problem:
• It is possible to generate a sample, given x
• It is possible to generate a sample by interpolating two latent vectors
• It is not possible to generate if we deviate from that
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 18
Variat ional Aut oencoders
Introducing Stochastic Variational Inference
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 19
W hat is a Variat ional Aut oencoder (VAE)
A VAE is an autoencoder optimized in order to enable the latent
space to be used in a generative process.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 20
Mat h Not at ion
𝑖𝑖 𝑁𝑁
• 𝑋𝑋 = 𝑥𝑥 𝑖𝑖=1 is the dataset consisting in 𝑁𝑁 samples of some
variable 𝑥𝑥.
• 𝑧𝑧 is an unobserved continuous random variable from which the
data is generated.
• 𝑞𝑞𝜙𝜙 (𝑧𝑧|𝑥𝑥) is a probabilistic encoder that, given a datapoint 𝑥𝑥 ,
produces a distribution (e.g. a Gaussian) over the possible values
of the code 𝑧𝑧 from which the datapoint 𝑥𝑥 could have been
generated.
• 𝑝𝑝𝜃𝜃 (𝑥𝑥|𝑧𝑧) is a probabilistic decoder that, given a code 𝑧𝑧, produces a
distribution over the possible corresponding values of 𝑥𝑥.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 21
Comput e 𝑧𝑧 from 𝑥𝑥 (using probabilit ies)
We want to calculate 𝑝𝑝(𝑧𝑧|𝑥𝑥).
𝑝𝑝(𝑥𝑥|𝑧𝑧)𝑝𝑝(𝑧𝑧) 𝑝𝑝(𝑥𝑥,𝑧𝑧)
If we apply Bayes: 𝑝𝑝 𝑧𝑧 𝑥𝑥 = =
𝑝𝑝(𝑥𝑥) 𝑝𝑝(𝑥𝑥)
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 22
KL divergence
KL divergence is a measure of difference between two probability distributions. Therefore, we can use
it to ensure q 𝑧𝑧 𝑥𝑥 is similar to 𝑝𝑝 𝑧𝑧 𝑥𝑥 , by minimizing:
min 𝐾𝐾𝐾𝐾(q 𝑧𝑧 𝑥𝑥 ∥ 𝑝𝑝 𝑧𝑧 𝑥𝑥 ) Remember:
𝑝𝑝(𝑥𝑥|𝑧𝑧)𝑝𝑝(𝑧𝑧) 𝑝𝑝(𝑥𝑥, 𝑧𝑧)
𝑝𝑝 𝑧𝑧 𝑥𝑥 = =
𝑝𝑝(𝑥𝑥) 𝑝𝑝(𝑥𝑥)
p 𝑧𝑧 𝑥𝑥
𝐾𝐾𝐾𝐾 q 𝑧𝑧 𝑥𝑥 ∥ 𝑝𝑝 𝑧𝑧 𝑥𝑥 = −∑q 𝑧𝑧 𝑥𝑥 log =
q 𝑧𝑧 𝑥𝑥
𝑝𝑝 𝑥𝑥, 𝑧𝑧
𝑝𝑝 𝑥𝑥 𝑝𝑝 𝑥𝑥, 𝑧𝑧
−∑q 𝑧𝑧 𝑥𝑥 log = −∑q 𝑧𝑧 𝑥𝑥 log =
q 𝑧𝑧 𝑥𝑥 q 𝑧𝑧 𝑥𝑥 𝑝𝑝 𝑥𝑥
𝑝𝑝 𝑥𝑥, 𝑧𝑧
= −∑q 𝑧𝑧 𝑥𝑥 log − log 𝑝𝑝 𝑥𝑥 =
q 𝑧𝑧 𝑥𝑥
𝑝𝑝 𝑥𝑥, 𝑧𝑧 =1
−∑q 𝑧𝑧 𝑥𝑥 log + ∑q 𝑧𝑧 𝑥𝑥 log 𝑝𝑝 𝑥𝑥 log 𝑝𝑝 𝑥𝑥 � q 𝑧𝑧 𝑥𝑥
q 𝑧𝑧 𝑥𝑥
𝑧𝑧
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 23
Variat ional Lower Bound
ℒ
A constant we wanted to minimize this
𝑝𝑝 𝑥𝑥, 𝑧𝑧
log 𝑝𝑝 𝑥𝑥 = 𝐾𝐾𝐾𝐾 q 𝑧𝑧 𝑥𝑥 ∥ 𝑝𝑝 𝑧𝑧 𝑥𝑥 + ∑q 𝑧𝑧 𝑥𝑥 log
q 𝑧𝑧 𝑥𝑥
Instead of minimizing KL, we can maximize ℒ.
ℒ is called Variational Lower Bound. This is because KL is always
positive and therefore ℒ ≤ log 𝑝𝑝 𝑥𝑥 and when ℒ is maximized also
log 𝑝𝑝 𝑥𝑥 is pushed up as well.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 24
Final Object ive Derivat ion from lower bound
𝑝𝑝 𝑥𝑥, 𝑧𝑧 𝑝𝑝 𝑥𝑥|𝑧𝑧 𝑝𝑝 𝑧𝑧
ℒ = ∑q 𝑧𝑧 𝑥𝑥 log = ∑q 𝑧𝑧 𝑥𝑥 log
q 𝑧𝑧 𝑥𝑥 q 𝑧𝑧 𝑥𝑥
𝑝𝑝 𝑧𝑧
= ∑q 𝑧𝑧 𝑥𝑥 log 𝑝𝑝 𝑥𝑥|𝑧𝑧 + log =
q 𝑧𝑧 𝑥𝑥
𝑝𝑝 𝑧𝑧
∑q 𝑧𝑧 𝑥𝑥 log 𝑝𝑝 𝑥𝑥|𝑧𝑧 + ∑q 𝑧𝑧 𝑥𝑥 log
q 𝑧𝑧 𝑥𝑥
𝔼𝔼𝑞𝑞 𝑧𝑧|𝑥𝑥 log 𝑝𝑝 𝑥𝑥 𝑧𝑧 −𝐾𝐾𝐾𝐾 q 𝑧𝑧 𝑥𝑥 ∥ 𝑝𝑝 𝑧𝑧
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 25
Turn probabilit ies int o neural net works
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 27
Train a VAE
• First, the input is encoded as distribution over the latent space
• Second, a point from the latent space is sampled from that
distribution
• Third, the sampled point is decoded and the reconstruction error can
be computed
• Finally, the reconstruction error is backpropagated through the
network.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 28
Solut ion: Reparamet rizat ion Trick
• 𝑧𝑧 ~ 𝒩𝒩 𝜇𝜇, 𝜎𝜎 2 → 𝑧𝑧 = 𝜇𝜇 + 𝜎𝜎 � 𝜖𝜖, where 𝜖𝜖 = 𝒩𝒩 0,1
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 29
Final VAE archit ect ure
𝜖𝜖
𝑥𝑥 𝑥𝑥�
𝜇𝜇
𝑧𝑧
𝜎𝜎
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 30
VAE applicat ions
• Image generation and interpolation
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 31
Lat ent Pert urbat ion
Each latent variable controls a different interpretable factor of
variation.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 32
Ent anglement
• A problem is that many
factors of variation might be
entangled:
• Ex. eyes and hair color might be
correlated. They are captured
“jointly”
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 33
Ent anglement
• Learning disentangled representations is still a challenging hot-
topic.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 34
Generative Adversarial
Networks (GANs)
Estimating Generative Models via an Adversarial Process
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 35
St op modelling t he dist ribut ion
• Idea: modelling a distribution 𝑝𝑝𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 can be difficult or
intractable → learn only a tractable sample generation process.
• These are called implicit generative models. GANs fall into this
category.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 36
How do GANs work
Generative adversarial networks are based on a game, in the sense of Nash’s game theory,
between two machine learning models, typically implemented using neural networks.
The generator creates samples (fake images) that are intended to come from the same
distribution as the training data.
The discriminator examines samples to determine whether they are real or fake.
During training the generator tries to fool the discriminator into thinking the generated
samples are real.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 37
How GANs work: t he Generat or
The generator defines 𝑝𝑝𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 (𝑥𝑥) implicitly: the generator is not necessarily able to
evaluate the density function 𝑝𝑝𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 . Instead, the generator is able to draw samples
from the distribution 𝑝𝑝𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 .
The generator is defined by a prior distribution 𝑝𝑝(𝑧𝑧) over a vector 𝑧𝑧 that serves as
input to the generator function 𝐺𝐺(𝑧𝑧; 𝜃𝜃 (𝐺𝐺) ) where 𝜃𝜃 (𝐺𝐺) is a set of learnable parameters
defining the generator’s strategy in the game.
The input vector 𝑧𝑧 can be thought of as a source of randomness in an otherwise
deterministic system, analogous to the seed of pseudorandom number generator.
The prior distribution 𝑝𝑝(𝑧𝑧) is typically a relatively unstructured distribution, such as
a high-dimensional Gaussian distribution. Samples z from this distribution are then
just noise. The main role of the generator is to learn the function 𝐺𝐺(𝑧𝑧) that
transforms such unstructured noise 𝑧𝑧 into realistic samples.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 38
How GANs work: t he Discriminat or
The discriminator examines samples 𝑥𝑥 and returns some estimate
𝐷𝐷(𝑥𝑥; 𝜃𝜃 (𝐷𝐷) ) of whether 𝑥𝑥 is real (drawn from the training
distribution) or fake (drawn from 𝑝𝑝𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 by running the generator).
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 39
How GANs work: t he Cost (loss)
Each player incurs a cost: 𝐽𝐽 𝐺𝐺 𝜃𝜃 𝐺𝐺 , 𝜃𝜃 𝐷𝐷 for the generator and
𝐽𝐽 𝐷𝐷 𝜃𝜃 𝐺𝐺 , 𝜃𝜃 𝐷𝐷 for the discriminator. Each player attempts to minimize its
own cost.
The discriminator’s cost encourages it to correctly classify data as real or
fake, while the generator’s cost encourages it to generate samples that the
discriminator incorrectly classifies as real.
Many different specific formulations of these costs are possible (performing
roughly the same).
In the original version of GANs, 𝐽𝐽 𝐷𝐷 was defined to be the negative log-
likelihood that the discriminator assigns to the real-vs-fake labels given the
input to the discriminator. In other words, the discriminator is trained just
like a regular binary classifier.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 40
The Generat or Cost
The original work on GANs offered two versions of the cost for the generator.
• min-max GAN loss:
The generator tries to minimize the following function while the discriminator tries to
maximize it:
𝔼𝔼𝑥𝑥 log 𝐷𝐷 𝑥𝑥 + 𝔼𝔼z log 1 − D G z
The generator can’t directly affect the log 𝐷𝐷 𝑥𝑥 term, so, for the generator, minimizing the
loss is equivalent to minimizing log 1 − D G z .
• Non-Saturating GAN Loss:
A subtle variation of the standard loss function is used where the generator maximizes the
log of the discriminator probabilities −log D G z .
This change is inspired by framing the problem from a different perspective, where the
generator seeks to maximize the probability of images being detected as real, instead of
minimizing the probability of an image being detected as fake.
This avoids generator saturation through a more stable weight update mechanism.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 41
GAN archit ect ure
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 42
Training a GAN: Pseudocode
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 43
Some Preliminary Result s
From Paper: Goodfellow, Ian, et al.
"Generative adversarial
networks." Advances in Neural
Information Processing Systems 27 (NIPS
2014)
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 44
Conditional Generative Adversarial Nets (cGANs)
GANs are very difficult to control and, in particular, given the
input 𝑧𝑧 there is no way to know how the resulting output will look
like, given the random sampling over the latent space.
The conditional version of generative adversarial nets was
introduced to solve this problem by simply feeding additional
information in both generator and discriminator.
In the first version this information was the class label 𝑦𝑦 (but other
types of data can be used too, like images, text, etc..)
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 46
Net work Archit ect ure
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 47
Some Result s
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 48
DCGAN – Deep Convolut ional GAN
Paper: Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised
representation learning with deep convolutional generative adversarial
networks." arXiv preprint arXiv:1511.06434 (2015).
Objective: bridge the gap between the success of CNNs for supervised
learning (classification) and unsupervised learning (GANs).
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 49
DCGAN Model Archit ect ure Improvement s
• Replace any pooling layers with strided
convolutions (discriminator) and fractional- N ON -STRIDED
strided convolutions (generator). CONVOLUTION
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 50
DCGAN Generat or Archit ect ure
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 51
Using DCGAN Discriminat or as a Feat ure Ext ractor
One common technique for evaluating the quality of unsupervised
representation learning algorithms is to apply them as a feature extractor on
supervised datasets and evaluate the performance of linear models fitted on
top of these features.
To evaluate the quality of the representations learned by DCGANs for
supervised tasks, it is trained on Imagenet-1k and then discriminator’s
convolutional features from all layers are used after maxpooling each layers
representation to produce a 4 × 4 spatial grid. These features are then
flattened and concatenated to form a dimensional vector and a regularized
linear L2-SVM classifier is trained on top of them.
This achieves 82.8% accuracy on CIFAR10. Since DCGAN was never trained
on CIFAR-10 this experiment also demonstrates the domain robustness of
the learned features.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 52
Common Problems in GANs
Vanishing gradients: if the discriminator is too
good, then generator training can fail due to
vanishing gradients because it will fail in
fooling the discriminator. In effect, an optimal
discriminator doesn't provide enough
information for the generator to make
progress.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 54
A solut ion: W assert ein GAN
• The W asserstein GAN, or WGAN for short, was introduced by
Martin Arjovsky, et al. in their 2017 paper[1]
[1] Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks."
International conference on machine learning. PMLR, 2017.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 55
A solut ion: W assert ein GAN
• The standard loss for GAN is a special form of the Jensen-Shannon divergence (derived
from KL divergence) to compare two distributions. W asserstein divergence (also known as
Earth move distance - EMD) is a valid alternative because it provides a score of the
similarity between two distributions (which are, in our case, the distribution of generated
images from G and the distribution of real images):
• The first part of the equation represents the real data, while the second half represents the
generator data. The discriminator aims to maximize the distance between the real data and
the generated data because it wants to be able to successfully distinguish the data
accordingly. The generator network aims to minimize the distance between the real data
and generated data because it wants the generated data to be as real as possible.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 56
A solut ion: W assert ein GAN
• This distance is not followed by a sigmoid, so it’s not saturated and returns
a value, called critic, in a certain range
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 57
A solut ion: W assert ein GAN
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 60
StyleGAN – Adaptive Instance Normalization
StyleGAN makes use of adaptive instance normalization (AdaIN),
which is commonly used when performing style transfer (i.e.
applying the style of a target image to a source image).
AdaIN allows arbitrary style transfer in real-time by aligning the
mean and variance of the content features with those of the style
features.
AdaIN receives a content input 𝑥𝑥 and a style input 𝑦𝑦, and simply
aligns the channel-wise mean and variance of 𝑥𝑥 to match those of
𝑦𝑦:
𝑥𝑥 − 𝜇𝜇 𝑥𝑥 From paper: Huang, Xun, and Serge
𝐴𝐴𝐴𝐴𝐴𝐴𝐼𝐼𝑁𝑁 𝑥𝑥, 𝑦𝑦 = 𝜎𝜎 𝑦𝑦 + 𝜇𝜇 𝑦𝑦 Belongie. "Arbitrary style transfer in real-
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 62
St yleGAN result s
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 63
St yle Mixing wit h St yleGAN
During training, a given percentage
of images are generated using two
random latent codes instead of one.
In order to do so, the model simply
switches from one latent code to
another — an operation referred to
as style mixing — at a randomly
selected point in the generator.
This regularization technique
prevents the network from assuming
that adjacent styles are correlated.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 64
Truncat ion t rick in St yleGAN
One of the challenges in generative models is dealing with areas that are
poorly represented in the training data. The generator isn’t able to learn them
and create images that resemble them (and instead creates bad-looking
images). To avoid generating poor images, StyleGAN truncates the
intermediate vector ⱳ, forcing it to stay close to the “average” intermediate
vector.
After training the model, an “average” ⱳavg is produced by selecting many
random inputs; generating their intermediate vectors with the mapping
network; and calculating the mean of these vectors. When generating new
images, instead of using Mapping Network output directly, ⱳ is transformed
into ⱳ_new=ⱳ_avg+𝞧𝞧(ⱳ -ⱳ_avg), where the value of 𝞧𝞧 defines how far the
image can be from the “average” image (and how diverse the output can be).
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 65
Image-To-Image Translat ion
• When in a cGAN the condition is
an image.
• Definition: Image-to-image
translation refers to a constrained
synthesis task of automatically
transforming an input image to a
synthetic image or mapping an
input image to the desired output
image.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 66
Example of Generator Architecture for Im2Im
Goal: mapping a high-resolution input grid (an image) to a high-resolution
output grid.
Solution: encoder-decoder network. The input is passed through a series of
layers that progressively downsample (Encoder), until a bottleneck layer, at
which point the process is reversed (Decoder).
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 67
Paired vs Unpaired Dat aset s
• Paired dataset: For each input
sample we have the
corresponding ground truth
(example edges -> shoes)
• Unpaired dataset: For each
input sample we DO NOT
have the corresponding
ground truth (example:
photo -> painting)
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 68
Pix2Pix
Fundamental Paper for Im2Im: Isola, Phillip, et al. "Image-to-image
translation with conditional adversarial networks." Proceedings of
the IEEE conference on computer vision and pattern recognition.
2017.
First work that investigated conditional adversarial networks as a
general-purpose solution (not tailored to a specific Im2Im task) to
image-to-image translation problems.
Objective: learn a mapping from observed image x and random
noise vector 𝑧𝑧, to 𝑦𝑦, 𝐺𝐺 ∶ {𝑥𝑥, 𝑧𝑧} → 𝑦𝑦. Pix2Pix works with paired
dataset.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 69
Pix2Pix - Training
The model is trained combining conditional adversarial loss ℒ𝑟𝑟𝐺𝐺𝑐𝑐𝑁𝑁 and a
pixel loss ℒ𝐿𝐿1 .
The ℒ𝑟𝑟𝐺𝐺𝑐𝑐𝑁𝑁 serves the purpose of obtaining a generator G capable to
produce outputs that cannot be distinguished from “real” images by an
adversarially-trained discriminator, D, which is trained with ℒ𝑟𝑟𝐺𝐺𝑐𝑐𝑁𝑁 to do
as well as possible at detecting the generator’s “fakes”:
ℒ𝑟𝑟𝐺𝐺𝑐𝑐𝑁𝑁 𝐺𝐺, 𝐷𝐷 = 𝔼𝔼𝑥𝑥,𝑦𝑦 log 𝐷𝐷(𝑥𝑥, 𝑦𝑦) + 𝔼𝔼𝑥𝑥,𝑧𝑧 log 1 − 𝐷𝐷 𝑥𝑥, 𝐺𝐺(𝑥𝑥, 𝑧𝑧)
The ℒ𝐿𝐿1 is used to make the G output to be near the ground truth:
ℒ𝐿𝐿1 = 𝔼𝔼𝑥𝑥,𝑦𝑦,𝑧𝑧 ∥ 𝑦𝑦 − 𝐺𝐺 𝑥𝑥, 𝑧𝑧 ∥
Two losses are combined with an hyperparameter 𝜆𝜆 typically set to 10:
ℒ = ℒ𝑟𝑟𝐺𝐺𝑐𝑐𝑁𝑁 + 𝜆𝜆ℒ𝐿𝐿1
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 70
Pix2Pix – Generat or Model
In the problem tackled by Pix2Pix, the input and output differ in
surface appearance, but both are renderings of the same
underlying structure. Therefore, structure in the input is roughly
aligned with structure in the output.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 71
Pix2Pix – Discriminat or Model (Pat chGAN)
The ℒ𝐿𝐿1 loss fails to encourage high-frequency crispness, but
accurately captures the low frequencies.
This motivates restricting the GAN discriminator to only model high-
frequency structure, relying on an L1 term to force low-frequency
correctness. In order to model high-frequencies, it is sufficient to
restrict the attention to the structure in local image patches.
Therefore, the discriminator architecture is designed to only penalizes
structure at the scale of patches. This discriminator tries to classify if
each N ×N patch (70x70 in original paper) in an image is real or fake.
This discriminator is run convolutionally across the image, averaging all
responses to provide the ultimate output of D.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 72
Pix2Pix - Result s
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 73
CycleGAN – Unpaired Im2Im
First work on unpaired image to image translation: Zhu, Jun-Yan,
et al. "Unpaired image-to-image translation using cycle-consistent
adversarial networks." Proceedings of the IEEE international
conference on computer vision. 2017.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 74
CycleGAN - Archit ect ure
Goal: learn a mapping 𝐺𝐺 ∶ 𝐴𝐴 → 𝐵𝐵 such that
the distribution of images from 𝐺𝐺(𝐴𝐴) is
indistinguishable from the distribution
𝐵𝐵 using an adversarial loss.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 75
CycleGAN – Cycle Consist ency Loss
It forces the output of Generator G to be reversed back to the
original input using the Generator F (and the other way around).
𝔼𝔼𝑑𝑑 ∥ 𝐴𝐴 − 𝐹𝐹 𝐺𝐺 𝐴𝐴 ∥1 + 𝔼𝔼𝑏𝑏 ∥ 𝑏𝑏 − 𝐺𝐺 𝐹𝐹 𝑏𝑏 ∥1
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 76
CycleGAN – Some more det ails
The two G-D couples in CycleGAN are pretty similar to those of
Pix2Pix, i.e the generator is a state-of-the-art generator for
DCGAN (e.g. ResNet), while the discriminator is PatchGAN
Additionally, a buffer of 50 generated images is used to update
the discriminator models instead of freshly generated images in
the current minibatch.
The models are trained with the Adam and a small learning rate
for 100 epochs, then a further 100 epochs with a learning rate
decay. The models are updated after each image, e.g. a batch size
of 1.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 77
Multi Domain Image-t o-Image Translat ion
• Goal: developing a system
that is able to learn multiple
mapping functions, that is one
for each domain (example:
multiple face attributes).
• Problem: Following the
CycleGAN pattern, having
𝑘𝑘 domains would result in
using 𝑘𝑘(𝑘𝑘 − 1) Generators.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 78
St arGAN
StarGAN:
• Multi-Domain image-to-image generation
• It uses only one Generator network to
generate multiple domains
• G is conditioned on the target domain y
(defined by a one-hot vector)
• D produces probability distributions over
source and target domain
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 79
St arGAN – Loss Funct ions
Adversarial Loss: To make the generated images indistinguishable from real images, an adversarial loss
is adopted:
ℒ 𝐺𝐺𝑐𝑐𝑁𝑁 𝐺𝐺, 𝐷𝐷 = 𝔼𝔼𝑥𝑥 log 𝐷𝐷𝑠𝑠𝑟𝑟𝑟𝑟 (𝑥𝑥) + 𝔼𝔼𝑥𝑥,𝑟𝑟 log 1 − 𝐷𝐷𝑠𝑠𝑟𝑟𝑟𝑟 𝐺𝐺(𝑥𝑥, 𝑐𝑐)
where 𝑐𝑐 is the target domain and 𝐷𝐷𝑠𝑠𝑟𝑟𝑟𝑟 is the adversarial output of the discriminator. The generator G
tries to minimize this objective, while the discriminator D tries to maximize it.
Domain Classification Loss: For a given input image 𝑥𝑥 and a target domain label 𝑐𝑐, the goal is to
translate 𝑥𝑥 into an output image 𝑦𝑦, which is properly classified to the target domain 𝑐𝑐. For this reason
an auxiliary classifier 𝐷𝐷𝑟𝑟𝑚𝑚𝑠𝑠 is added on top of the discriminator D and trained on real images:
𝑟𝑟
ℒ𝑟𝑟𝑚𝑚𝑠𝑠 = 𝔼𝔼𝑥𝑥,𝑟𝑟 ′ − log 𝐷𝐷𝑟𝑟𝑚𝑚𝑠𝑠 𝑐𝑐 ′ 𝑥𝑥
where 𝐷𝐷𝑟𝑟𝑚𝑚𝑠𝑠 𝑐𝑐 ′ 𝑥𝑥 represents a probability distribution over domain labels computed by D. By
minimizing this objective, D learns to classify a real image 𝑥𝑥 to its corresponding original domain 𝑐𝑐 ′ . On
the other hand, the loss function for the domain classification of fake images is defined as :
𝑓𝑓
ℒ𝑟𝑟𝑚𝑚𝑠𝑠 = 𝔼𝔼𝑥𝑥,𝑟𝑟 − log 𝐷𝐷𝑟𝑟𝑚𝑚𝑠𝑠 𝑐𝑐 𝐺𝐺 𝑥𝑥, 𝑐𝑐
In other words, G tries to minimize this objective to generate images that can be classified as the target
domain 𝑐𝑐.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 80
St arGAN – Loss Funct ions
Reconstruction Loss: By minimizing the adversarial and classification
losses, G is trained to generate images that are realistic and classified
to its correct target domain. However, minimizing the losses does not
guarantee that translated images preserve the content of its input
images while changing only the domain-related part of the inputs. To
alleviate this problem, we apply a cycle consistency loss to the
generator, defined as:
ℒ𝑟𝑟𝑚𝑚𝑟𝑟 = 𝔼𝔼𝑥𝑥,𝑟𝑟,𝑟𝑟 ′ ∥ 𝑥𝑥 − 𝐺𝐺 𝐺𝐺 𝑥𝑥, 𝑐𝑐 , 𝑐𝑐 ′ ∥1
where ′𝐺𝐺 takes in the translated image 𝐺𝐺(𝑥𝑥, 𝑐𝑐) and the original domain
label 𝑐𝑐 as input and tries to reconstruct the original image 𝑥𝑥. Note that
the same generator is used twice, first to translate an original image
into an image in the target domain and then to reconstruct the original
image from the translated image.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 81
St arGAN - Training
Overall losses:
𝑟𝑟
ℒ 𝐷𝐷 = −ℒ𝐺𝐺𝑐𝑐𝑁𝑁 + 𝜆𝜆𝑟𝑟𝑚𝑚𝑠𝑠 ℒ𝑟𝑟𝑚𝑚𝑠𝑠
𝑓𝑓
ℒ𝐺𝐺 = ℒ 𝐺𝐺𝑐𝑐𝑁𝑁 + 𝜆𝜆𝑟𝑟𝑚𝑚𝑠𝑠 ℒ𝑟𝑟𝑚𝑚𝑠𝑠 + 𝜆𝜆𝑟𝑟𝑚𝑚𝑟𝑟 ℒ𝑟𝑟𝑚𝑚𝑟𝑟
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 82
Few Shot Image to Image Translation (FUNIT)
The training set consists of images of various object classes (source classes). The
model is trained to translate images between these source object classes.
During inference (deployment) very few images of the target class are showed to
the trained model, which are sufficient to translate images of source classes to
analogous images of the target class even though the model has never seen a single
image from the target class during training.
Note that the FUNIT generator takes two inputs: 1) a content image and 2) a set of
target class images. It aims to generate a translation of the input image that
resembles images of the target class.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 84
FUNIT Framework
FUNIT framework consists of a conditional image generator G and
a multi-task adversarial discriminator D.
The generator G simultaneously takes a content image 𝑥𝑥 and a set
of K class images {𝑦𝑦1 , … , 𝑦𝑦𝐾𝐾 } as input and produces the output
image 𝑥𝑥̅ via 𝑥𝑥̅ = 𝐺𝐺 𝑥𝑥, 𝑦𝑦1 , … , 𝑦𝑦𝐾𝐾 .
The content image belongs to object class 𝑐𝑐𝑥𝑥 while each of the K
class images belong to object class 𝑐𝑐𝑦𝑦 . In general, K is a small
number and 𝑐𝑐𝑥𝑥 is different from 𝑐𝑐𝑦𝑦 .
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 85
FUNIT Generat or
The few-shot image generator G
consists of a content encoder 𝐴𝐴𝑥𝑥 , a
class encoder 𝐴𝐴𝑦𝑦 , and a decoder 𝐹𝐹𝑥𝑥 .
The content encoder maps the input
content image 𝑥𝑥 to a content latent
code 𝑧𝑧𝑥𝑥 , which is a spatial feature
map. On the other side, the class
encoder maps each of the K
individual class images {𝑦𝑦1 , … , 𝑦𝑦𝐾𝐾 } to
an intermediate latent vector and
then computes the mean of the
intermediate latent vectors to obtain
the final class latent code 𝑧𝑧𝑦𝑦 .
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 86
FUNIT Generat or
The decoder consists of several adaptive
instance normalization (AdaIN) residual
blocks followed by a couple of upscale
convolutional layers. The AdaIN residual
block is a residual block using the AdaIN as
the normalization layer.
For each sample, AdaIN first normalizes the
activations of a sample in each channel to
have a zero mean and unit variance. It then
scales the activations using a learned affine
transformation consisting of a set of scalars
and biases. Note that the affine
transformation is spatially invariant and
hence can only be used to obtain global
appearance information.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 87
Funit Discriminat or
The discriminator D is trained to solve multiple adversarial
classification tasks simultaneously. Each of the tasks is a binary
classification task determining whether an input image is a real
image of the source class or a translation output coming from G.
As there are |𝕊𝕊| source classes, D produces |𝕊𝕊| outputs.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 88
FUNIT Losses
FUNIT is trained by solving a minmax optimization problem, as usual:
min max ℒ𝐺𝐺𝑐𝑐𝑁𝑁 𝐷𝐷, 𝐺𝐺 + 𝜆𝜆𝑅𝑅 ℒ𝑅𝑅 𝐺𝐺 + 𝜆𝜆𝐹𝐹 ℒ𝐹𝐹𝐹𝐹 𝐺𝐺
𝐷𝐷 𝐺𝐺
Adversarial Loss:
ℒ𝐺𝐺𝑐𝑐𝑁𝑁 𝐺𝐺, 𝐷𝐷 = 𝔼𝔼𝑥𝑥 log 𝐷𝐷 𝑟𝑟𝑥𝑥 (𝑥𝑥) + 𝔼𝔼𝑥𝑥,𝑟𝑟𝑦𝑦 log 1 − 𝐷𝐷𝑟𝑟𝑦𝑦 𝐺𝐺(𝑥𝑥, 𝑐𝑐𝑦𝑦 )
The superscript attached to D denotes the object class; the loss is
computed only using the corresponding binary prediction score of the
class.
Content reconstruction loss: when using the same image for both the
input content image and the input class image (in this case K = 1), the
loss encourages G to generate an output image identical to the input:
ℒ𝑅𝑅 𝐺𝐺 = 𝔼𝔼𝑥𝑥 [∥ 𝑥𝑥 − 𝐺𝐺(𝑥𝑥, {𝑥𝑥}) ∥]
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 89
FUNIT Losses
Feature matching loss: First a feature extractor is constructed,
referred to as 𝐷𝐷f , by removing the last (prediction) layer from D.
Then, 𝐷𝐷f is used to extract features from the translation output
𝑥𝑥̅ and the class images 𝑦𝑦𝑘𝑘 and minimize:
𝐷𝐷f 𝑦𝑦𝑘𝑘
ℒ𝐹𝐹 𝐺𝐺 = 𝔼𝔼𝑥𝑥,{𝑦𝑦1,…,𝑦𝑦𝐾𝐾 } 𝐷𝐷f 𝑥𝑥̅ − �
𝐾𝐾
𝑘𝑘
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 90
FUNIT Result s
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 91
From Paper: Choi, Yunjey, et al. "Stargan v2:
Diverse image synthesis for multiple
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 93
St arGAN v2 - Losses
During the training a latent code 𝑧𝑧 𝜖𝜖 𝒵𝒵 and a target domain 𝑦𝑦� 𝜖𝜖 𝒴𝒴
are randomly sampled and a target style code 𝑠𝑠̃ = 𝐹𝐹𝑦𝑦� (𝑧𝑧) is
generated
Adversarial loss:
ℒ𝐺𝐺𝑐𝑐𝑁𝑁 𝐺𝐺, 𝐷𝐷 = 𝔼𝔼𝑥𝑥,𝑦𝑦 log 𝐷𝐷𝑦𝑦 (𝑥𝑥) + 𝔼𝔼𝑥𝑥,𝑦𝑦,𝑧𝑧
� log 1 − 𝐷𝐷𝑦𝑦� 𝐺𝐺(𝑥𝑥, 𝑠𝑠)̃
The mapping network 𝐹𝐹 learns to provide the style code 𝑠𝑠̃ that is
likely in the target domain 𝑦𝑦,
� and the generator 𝐺𝐺 learns to utilize
𝑠𝑠̃ and generates an image 𝐺𝐺 𝑥𝑥, 𝑠𝑠̃ that is indistinguishable from real
images of the domain 𝑦𝑦.
�
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 94
St arGAN v2 - Losses
Style Reconstruction: used in order to enforce the generator 𝐺𝐺 to
utilize the style code 𝑠𝑠̃ when generating the image 𝐺𝐺(𝑥𝑥, 𝑠𝑠):
̃
ℒ𝑠𝑠𝑑𝑑𝑦𝑦 = 𝔼𝔼𝑥𝑥,𝑦𝑦,𝑧𝑧
� ∥ 𝑠𝑠̃ − E𝑦𝑦� G x, 𝑠𝑠̃ ∥
Similar to other approaches which employ multiple encoders to
learn a mapping from an image to its latent code. The notable
difference is that here a single encoder 𝐴𝐴 is trained to encourage
diverse outputs for multiple domains. At test time, the learned
encoder 𝐴𝐴 allows 𝐺𝐺 to transform an input image, reflecting the
style of a reference image.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 95
St arGAN v2 - Losses
Style diversification: employed to further enable the generator 𝐺𝐺
to produce diverse images (this loss will be maximized):
ℒ𝑑𝑑𝑠𝑠 = 𝔼𝔼𝑥𝑥,𝑦𝑦,𝑧𝑧
� 1 ,𝑧𝑧2 [∥ G x, 𝑠𝑠̃1 − G x, 𝑠𝑠̃2 ∥]
The target style codes 𝑠𝑠1̃ and 𝑠𝑠̃2 are produced by 𝐹𝐹
conditioned on two random latent codes 𝑧𝑧1 and 𝑧𝑧2
Cycle Consistency loss: used to guarantee that the generated
image G x, 𝑠𝑠̃ properly preserves the domain invariant
characteristics (e.g. pose) of its input image 𝑥𝑥:
ℒ𝑟𝑟𝑦𝑦𝑟𝑟 = 𝔼𝔼𝑥𝑥,𝑦𝑦,𝑦𝑦,𝑧𝑧
� ∥ 𝑥𝑥 − 𝐺𝐺 G x, 𝑠𝑠̃ , 𝑠𝑠̂ ∥
where 𝑠𝑠̂ = 𝐴𝐴𝑦𝑦 𝑥𝑥 is the estimated style code of the input image 𝑥𝑥
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 96
St arGAN v2 - Result s
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 97
Just t he t ip of t he Iceberg
A vast literature on generative models exists.
New architectures are being currently developed:
Vision Transformers
Diffusion models
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 98