0% found this document useful (0 votes)

31 views

06 DLEA Generative Models

The document discusses various generative models including autoencoders, variational autoencoders, and generative adversarial networks. It begins by explaining the differences between supervised and unsupervised learning, with generative models falling under unsupervised learning. Autoencoders are described as learning a low-dimensional latent representation of data through an encoder-decoder structure. Variational autoencoders build upon this by modeling the latent space as a probability distribution rather than a single point, allowing for generation of new data.

Uploaded by

Diego di nicolantonio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

06 DLEA Generative Models

Uploaded by

Diego di nicolantonio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

Generative Models

Autoencoder, Variational Autoencoder, Generative Adversarial

Networks and beyond…

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 1
Supervised vs Unsupervised
Supervised Learning Unsupervised Learning

Data (x, y) Data x

x is data, y is label x is data, no labels

Goal: learn some hidden or

Goal: learn a function to map: underlying structure in the data
𝑓𝑓: 𝑥𝑥 → 𝑦𝑦 Examples: clustering,
Examples: classification, dimensionality reduction e.g.
regression, detection etc. PCA
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 2
Generat ive models
Generative models are a sub-set of unsupervised learning
algorithms.

Definition: any model that takes a training set, consisting of samples

drawn from a distribution 𝑝𝑝𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 , and learns to represent an estimate
of that distribution somehow. The result is a probability distribution
𝑝𝑝𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚

𝑝𝑝𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 𝑥𝑥 ~ 𝑝𝑝𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 (𝑥𝑥)

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 3
Discriminat ive models vs generat ive
models
Is this an apple? vs What is an apple ?

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 4
Example: classify an animal as cat or dog
• Discriminative Model: Find a
decision boundary that
separates cats and dogs. Then,
check on which side of the
decision boundary the new
animal falls.
• Generative Model: Build models
of what cats and dogs are like,
respectively. Then, match the
new animal against the cat and
dog model.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 5
Discriminat ive models vs generat ive models
More formally…
• Discriminative model: Conditional probability distribution
𝑃𝑃(𝑦𝑦|𝑥𝑥)
A discriminative model tells you how likely a label 𝑦𝑦 is applied to
the instance of input data 𝑥𝑥.

• Generative model: Joint probability distribution

𝑃𝑃(𝑦𝑦, 𝑥𝑥)
A generative model includes the distribution of the data itself, and
tells you how likely a given example is to belong to that
distribution.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 6
Some Generat ive Tasks
Image super resolution

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 7
Some Generat ive Tasks
Creating art

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 8
Some Generat ive Tasks
Generate artificial images (like human faces)

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 9
Autoencoders
Learning a low-dimensional feature representation

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 10
Reducing t he Dimensionalit y of Dat a
Audencoders are composed by
two sections: an Encoder and a
Decoder
• Encoder: transforms the high-
dimensional data into a low-
dimensional code 𝒛𝒛 called
lat ent space (a nonlinear
generalization of PCA).
• Decoder: reconstructs the
data from the code.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 11
Training an Aut oencoder
• The objective is to minimize the discrepancy between the
original data and its reconstruction:
ℒ𝑟𝑟𝑚𝑚𝑟𝑟 =∥ 𝑥𝑥 − 𝐴𝐴𝐴𝐴 𝑥𝑥 ∥2
where 𝑥𝑥 is the input data and 𝐴𝐴𝐴𝐴(𝑥𝑥) is the autoencoder output.

• The loss does not require any label (unsupervised indeed).

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 12
Compression is not Lossless
• Ideally we would like for an Autoencoder to perfectly reconstruct the
input
• Nevertheless, the latent space represents a “bottleneck”:
• The smaller the latent space, the more information are lost

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 13
Visualizing t he Lat ent Space

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 14
Applicat ions of Aut oencoders
• Features Extraction (Image Retrieval):
• Latent Space is used to find the most similar
images in the dataset
• Compression removes «useless» data (such as
the hat in the example)

• Denoising:
• Corrupted Data is used as input and the
Autoencoder role is to recover the original data
• Compression extracts the «meaningful» data
which represent the distribution, removing
useless data like noise

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 15
Generat e New Dat a?

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 16
Problem
• Autoencoders map an input 𝑥𝑥 into a latent vector 𝒛𝒛 which
represents a point in the latent space, not a real distribution →
lack of interpretable structure
• We are still not able to generate new data

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 17
W hy?
• Kind of expected: we did not enforce any organization of the
latent space.
• It is a different way of seeing the overfitting problem:
• It is possible to generate a sample, given x
• It is possible to generate a sample by interpolating two latent vectors
• It is not possible to generate if we deviate from that

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 18
Variat ional Aut oencoders
Introducing Stochastic Variational Inference

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 19
W hat is a Variat ional Aut oencoder (VAE)
A VAE is an autoencoder optimized in order to enable the latent
space to be used in a generative process.

Just like an AE, it is trained to minimize the reconstruction error

between the input and output of the network. Remember:
ℒ 𝑟𝑟𝑚𝑚𝑟𝑟 =∥ 𝑥𝑥 − 𝐴𝐴𝐴𝐴 𝑥𝑥 ∥2

In addition, we want to encode the input not as a single point, but

as a distribution over the latent space.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 20
Mat h Not at ion
𝑖𝑖 𝑁𝑁
• 𝑋𝑋 = 𝑥𝑥 𝑖𝑖=1 is the dataset consisting in 𝑁𝑁 samples of some
variable 𝑥𝑥.
• 𝑧𝑧 is an unobserved continuous random variable from which the
data is generated.
• 𝑞𝑞𝜙𝜙 (𝑧𝑧|𝑥𝑥) is a probabilistic encoder that, given a datapoint 𝑥𝑥 ,
produces a distribution (e.g. a Gaussian) over the possible values
of the code 𝑧𝑧 from which the datapoint 𝑥𝑥 could have been
generated.
• 𝑝𝑝𝜃𝜃 (𝑥𝑥|𝑧𝑧) is a probabilistic decoder that, given a code 𝑧𝑧, produces a
distribution over the possible corresponding values of 𝑥𝑥.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 21
Comput e 𝑧𝑧 from 𝑥𝑥 (using probabilit ies)
We want to calculate 𝑝𝑝(𝑧𝑧|𝑥𝑥).

𝑝𝑝(𝑥𝑥|𝑧𝑧)𝑝𝑝(𝑧𝑧) 𝑝𝑝(𝑥𝑥,𝑧𝑧)
If we apply Bayes: 𝑝𝑝 𝑧𝑧 𝑥𝑥 = =
𝑝𝑝(𝑥𝑥) 𝑝𝑝(𝑥𝑥)

Problem: 𝑝𝑝(𝑥𝑥) is an intractable distribution!

Solution: Variational Inference → approximate 𝑝𝑝 𝑧𝑧 𝑥𝑥 by another

distribution q 𝑧𝑧 𝑥𝑥 defined such that it has a tractable distribution.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 22
KL divergence
KL divergence is a measure of difference between two probability distributions. Therefore, we can use
it to ensure q 𝑧𝑧 𝑥𝑥 is similar to 𝑝𝑝 𝑧𝑧 𝑥𝑥 , by minimizing:
min 𝐾𝐾𝐾𝐾(q 𝑧𝑧 𝑥𝑥 ∥ 𝑝𝑝 𝑧𝑧 𝑥𝑥 ) Remember:
𝑝𝑝(𝑥𝑥|𝑧𝑧)𝑝𝑝(𝑧𝑧) 𝑝𝑝(𝑥𝑥, 𝑧𝑧)
𝑝𝑝 𝑧𝑧 𝑥𝑥 = =
𝑝𝑝(𝑥𝑥) 𝑝𝑝(𝑥𝑥)
p 𝑧𝑧 𝑥𝑥
𝐾𝐾𝐾𝐾 q 𝑧𝑧 𝑥𝑥 ∥ 𝑝𝑝 𝑧𝑧 𝑥𝑥 = −∑q 𝑧𝑧 𝑥𝑥 log =
q 𝑧𝑧 𝑥𝑥
𝑝𝑝 𝑥𝑥, 𝑧𝑧
𝑝𝑝 𝑥𝑥 𝑝𝑝 𝑥𝑥, 𝑧𝑧
−∑q 𝑧𝑧 𝑥𝑥 log = −∑q 𝑧𝑧 𝑥𝑥 log =
q 𝑧𝑧 𝑥𝑥 q 𝑧𝑧 𝑥𝑥 𝑝𝑝 𝑥𝑥
𝑝𝑝 𝑥𝑥, 𝑧𝑧
= −∑q 𝑧𝑧 𝑥𝑥 log − log 𝑝𝑝 𝑥𝑥 =
q 𝑧𝑧 𝑥𝑥
𝑝𝑝 𝑥𝑥, 𝑧𝑧 =1
−∑q 𝑧𝑧 𝑥𝑥 log + ∑q 𝑧𝑧 𝑥𝑥 log 𝑝𝑝 𝑥𝑥 log 𝑝𝑝 𝑥𝑥 � q 𝑧𝑧 𝑥𝑥
q 𝑧𝑧 𝑥𝑥
𝑧𝑧

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 23
Variat ional Lower Bound
ℒ
A constant we wanted to minimize this
𝑝𝑝 𝑥𝑥, 𝑧𝑧
log 𝑝𝑝 𝑥𝑥 = 𝐾𝐾𝐾𝐾 q 𝑧𝑧 𝑥𝑥 ∥ 𝑝𝑝 𝑧𝑧 𝑥𝑥 + ∑q 𝑧𝑧 𝑥𝑥 log
q 𝑧𝑧 𝑥𝑥
Instead of minimizing KL, we can maximize ℒ.
ℒ is called Variational Lower Bound. This is because KL is always
positive and therefore ℒ ≤ log 𝑝𝑝 𝑥𝑥 and when ℒ is maximized also
log 𝑝𝑝 𝑥𝑥 is pushed up as well.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 24
Final Object ive Derivat ion from lower bound

𝑝𝑝 𝑥𝑥, 𝑧𝑧 𝑝𝑝 𝑥𝑥|𝑧𝑧 𝑝𝑝 𝑧𝑧
ℒ = ∑q 𝑧𝑧 𝑥𝑥 log = ∑q 𝑧𝑧 𝑥𝑥 log
q 𝑧𝑧 𝑥𝑥 q 𝑧𝑧 𝑥𝑥
𝑝𝑝 𝑧𝑧
= ∑q 𝑧𝑧 𝑥𝑥 log 𝑝𝑝 𝑥𝑥|𝑧𝑧 + log =
q 𝑧𝑧 𝑥𝑥
𝑝𝑝 𝑧𝑧
∑q 𝑧𝑧 𝑥𝑥 log 𝑝𝑝 𝑥𝑥|𝑧𝑧 + ∑q 𝑧𝑧 𝑥𝑥 log
q 𝑧𝑧 𝑥𝑥
𝔼𝔼𝑞𝑞 𝑧𝑧|𝑥𝑥 log 𝑝𝑝 𝑥𝑥 𝑧𝑧 −𝐾𝐾𝐾𝐾 q 𝑧𝑧 𝑥𝑥 ∥ 𝑝𝑝 𝑧𝑧

𝓛𝓛 = 𝔼𝔼𝒒𝒒 𝒛𝒛|𝒙𝒙 𝐥𝐥𝐥𝐥𝐥𝐥 𝒑𝒑 𝒙𝒙 𝒛𝒛 − 𝑲𝑲𝑲𝑲 𝐪𝐪 𝒛𝒛 𝒙𝒙 ∥ 𝒑𝒑 𝒛𝒛

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 25
Turn probabilit ies int o neural net works

𝓛𝓛 = 𝔼𝔼𝒒𝒒 𝒛𝒛|𝒙𝒙 𝐥𝐥𝐥𝐥𝐥𝐥 𝒑𝒑 𝒙𝒙 𝒛𝒛 − 𝑲𝑲𝑲𝑲 𝐪𝐪 𝒛𝒛 𝒙𝒙 ∥ 𝒑𝒑 𝒛𝒛

This is the same Since 𝒑𝒑 𝒛𝒛 is a Normal
reconstruction loss Gaussian Distribution 𝒩𝒩(0, 𝐼𝐼),
of the Autoencoder this force 𝐪𝐪 𝒛𝒛 𝒙𝒙 to have
mean close to 0 and var close
to 1
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 26
Consequence of Regularizat ion
• With this regularisation term,
the model can encode data far
apart in the latent space and
encourage as much as possible
returned distributions to
“overlap”.
• This comes at the price of a
higher reconstruction error on
the training data.
• Nevertheless, the tradeoff
between the reconstruction
error and the KL divergence can
be adjusted.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 27
Train a VAE
• First, the input is encoded as distribution over the latent space
• Second, a point from the latent space is sampled from that
distribution
• Third, the sampled point is decoded and the reconstruction error can
be computed
• Finally, the reconstruction error is backpropagated through the
network.

Problem: latent vector z is sampled from the learned distribution.

It’s not possible to back-propagate gradients through sampling layers.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 28
Solut ion: Reparamet rizat ion Trick
• 𝑧𝑧 ~ 𝒩𝒩 𝜇𝜇, 𝜎𝜎 2 → 𝑧𝑧 = 𝜇𝜇 + 𝜎𝜎 � 𝜖𝜖, where 𝜖𝜖 = 𝒩𝒩 0,1

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 29
Final VAE archit ect ure
𝜖𝜖

𝑥𝑥 𝑥𝑥�
𝜇𝜇
𝑧𝑧

𝑞𝑞𝜙𝜙 (𝑧𝑧|𝑥𝑥) 𝑝𝑝𝜃𝜃 (𝑥𝑥|𝑧𝑧)

𝜎𝜎

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 30
VAE applicat ions
• Image generation and interpolation

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 31
Lat ent Pert urbat ion
Each latent variable controls a different interpretable factor of
variation.

Therefore, we can increase/ decrease a single variable keeping the

others fixed.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 32
Ent anglement
• A problem is that many
factors of variation might be
entangled:
• Ex. eyes and hair color might be
correlated. They are captured
“jointly”

• Ideally, we want latent

variables to be uncorrelated →
disentanglement.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 33
Ent anglement
• Learning disentangled representations is still a challenging hot-
topic.

• If additional labels e.g. gender, age, etc. are available, more

complex models can take advantage from that.

• However, factors of variation are usually way larger than the

possible labels one can assign (they might depend also on
imaging condition themselves).

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 34
Generative Adversarial
Networks (GANs)
Estimating Generative Models via an Adversarial Process

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 35
St op modelling t he dist ribut ion
• Idea: modelling a distribution 𝑝𝑝𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 can be difficult or
intractable → learn only a tractable sample generation process.
• These are called implicit generative models. GANs fall into this
category.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 36
How do GANs work
Generative adversarial networks are based on a game, in the sense of Nash’s game theory,
between two machine learning models, typically implemented using neural networks.

The generator creates samples (fake images) that are intended to come from the same
distribution as the training data.
The discriminator examines samples to determine whether they are real or fake.

During training the generator tries to fool the discriminator into thinking the generated
samples are real.

“The generative model is pitted against an adversary: a discriminative model.”

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 37
How GANs work: t he Generat or
The generator defines 𝑝𝑝𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 (𝑥𝑥) implicitly: the generator is not necessarily able to
evaluate the density function 𝑝𝑝𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 . Instead, the generator is able to draw samples
from the distribution 𝑝𝑝𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 .
The generator is defined by a prior distribution 𝑝𝑝(𝑧𝑧) over a vector 𝑧𝑧 that serves as
input to the generator function 𝐺𝐺(𝑧𝑧; 𝜃𝜃 (𝐺𝐺) ) where 𝜃𝜃 (𝐺𝐺) is a set of learnable parameters
defining the generator’s strategy in the game.
The input vector 𝑧𝑧 can be thought of as a source of randomness in an otherwise
deterministic system, analogous to the seed of pseudorandom number generator.
The prior distribution 𝑝𝑝(𝑧𝑧) is typically a relatively unstructured distribution, such as
a high-dimensional Gaussian distribution. Samples z from this distribution are then
just noise. The main role of the generator is to learn the function 𝐺𝐺(𝑧𝑧) that
transforms such unstructured noise 𝑧𝑧 into realistic samples.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 38
How GANs work: t he Discriminat or
The discriminator examines samples 𝑥𝑥 and returns some estimate
𝐷𝐷(𝑥𝑥; 𝜃𝜃 (𝐷𝐷) ) of whether 𝑥𝑥 is real (drawn from the training
distribution) or fake (drawn from 𝑝𝑝𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 by running the generator).

In the original formulation of GANs, this estimate consists of a

probability that the input is real rather than fake assuming that
the real distribution and fake distribution are sampled equally
often.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 39
How GANs work: t he Cost (loss)
Each player incurs a cost: 𝐽𝐽 𝐺𝐺 𝜃𝜃 𝐺𝐺 , 𝜃𝜃 𝐷𝐷 for the generator and
𝐽𝐽 𝐷𝐷 𝜃𝜃 𝐺𝐺 , 𝜃𝜃 𝐷𝐷 for the discriminator. Each player attempts to minimize its
own cost.
The discriminator’s cost encourages it to correctly classify data as real or
fake, while the generator’s cost encourages it to generate samples that the
discriminator incorrectly classifies as real.
Many different specific formulations of these costs are possible (performing
roughly the same).
In the original version of GANs, 𝐽𝐽 𝐷𝐷 was defined to be the negative log-
likelihood that the discriminator assigns to the real-vs-fake labels given the
input to the discriminator. In other words, the discriminator is trained just
like a regular binary classifier.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 40
The Generat or Cost
The original work on GANs offered two versions of the cost for the generator.
• min-max GAN loss:
The generator tries to minimize the following function while the discriminator tries to
maximize it:
𝔼𝔼𝑥𝑥 log 𝐷𝐷 𝑥𝑥 + 𝔼𝔼z log 1 − D G z
The generator can’t directly affect the log 𝐷𝐷 𝑥𝑥 term, so, for the generator, minimizing the
loss is equivalent to minimizing log 1 − D G z .
• Non-Saturating GAN Loss:
A subtle variation of the standard loss function is used where the generator maximizes the
log of the discriminator probabilities −log D G z .
This change is inspired by framing the problem from a different perspective, where the
generator seeks to maximize the probability of images being detected as real, instead of
minimizing the probability of an image being detected as fake.
This avoids generator saturation through a more stable weight update mechanism.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 41
GAN archit ect ure

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 42
Training a GAN: Pseudocode

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 43
Some Preliminary Result s
From Paper: Goodfellow, Ian, et al.
"Generative adversarial
networks." Advances in Neural
Information Processing Systems 27 (NIPS
2014)

Rightmost column shows the

nearest training example of the
neighboring sample, in order to
demonstrate that the model has
not memorized the training set

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 44
Conditional Generative Adversarial Nets (cGANs)
GANs are very difficult to control and, in particular, given the
input 𝑧𝑧 there is no way to know how the resulting output will look
like, given the random sampling over the latent space.
The conditional version of generative adversarial nets was
introduced to solve this problem by simply feeding additional
information in both generator and discriminator.
In the first version this information was the class label 𝑦𝑦 (but other
types of data can be used too, like images, text, etc..)

From paper: Mirza, Mehdi, and Simon Osindero. "Conditional

generative adversarial nets." arXiv preprint arXiv:1411.1784
(2014).
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 45
Combine addit ional dat a int o t he loss
The conditioning can be performed by feeding 𝑦𝑦 into both the
discriminator and the generator as additional input layer.
In the generator, the prior input noise 𝑝𝑝(𝑧𝑧) and 𝑦𝑦 are combined in joint
hidden representation, and the adversarial training framework allows
for considerable flexibility in how this hidden representation is
composed (e.g. the conditioning input and prior noise are both inputs
to a single hidden layer of a MLP).
In the discriminator 𝑥𝑥 and 𝑦𝑦 are presented as inputs to a discriminative
function.
The objective function becomes:
min max 𝑉𝑉(𝐷𝐷, 𝐺𝐺) = 𝔼𝔼𝑥𝑥 log 𝐷𝐷 𝑥𝑥 𝑦𝑦 + 𝔼𝔼𝑧𝑧 log 1 − 𝐷𝐷 𝐺𝐺 𝑥𝑥 𝑦𝑦
𝐺𝐺 𝐷𝐷

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 46
Net work Archit ect ure

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 47
Some Result s

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 48
DCGAN – Deep Convolut ional GAN
Paper: Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised
representation learning with deep convolutional generative adversarial
networks." arXiv preprint arXiv:1511.06434 (2015).

Objective: bridge the gap between the success of CNNs for supervised
learning (classification) and unsupervised learning (GANs).

Solution: introduction of a class of CNNs called deep convolutional

generative adversarial networks (DCGANs), that have certain
architectural constraints, and demonstrate that they are a strong
candidate for unsupervised learning.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 49
DCGAN Model Archit ect ure Improvement s
• Replace any pooling layers with strided
convolutions (discriminator) and fractional- N ON -STRIDED
strided convolutions (generator). CONVOLUTION

• Use batchnorm in both the generator and the

discriminator.
• Remove fully connected hidden layers for
deeper architectures. STRIDED
CONVOLUTION
• Use ReLU activation in generator for all layers
except for the output, which uses Tanh.
• Use LeakyReLU activation in the discriminator
FRACTIONAL-
for all layers. STRIDED
CONVOLUTION

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 50
DCGAN Generat or Archit ect ure

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 51
Using DCGAN Discriminat or as a Feat ure Ext ractor
One common technique for evaluating the quality of unsupervised
representation learning algorithms is to apply them as a feature extractor on
supervised datasets and evaluate the performance of linear models fitted on
top of these features.
To evaluate the quality of the representations learned by DCGANs for
supervised tasks, it is trained on Imagenet-1k and then discriminator’s
convolutional features from all layers are used after maxpooling each layers
representation to produce a 4 × 4 spatial grid. These features are then
flattened and concatenated to form a dimensional vector and a regularized
linear L2-SVM classifier is trained on top of them.
This achieves 82.8% accuracy on CIFAR10. Since DCGAN was never trained
on CIFAR-10 this experiment also demonstrates the domain robustness of
the learned features.

https:/ / github.com/ eriklindernoren/ PyTorch-GAN/ blob/ master/ implementations/ dcgan/ dcgan.py

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 52
Common Problems in GANs
Vanishing gradients: if the discriminator is too
good, then generator training can fail due to
vanishing gradients because it will fail in
fooling the discriminator. In effect, an optimal
discriminator doesn't provide enough
information for the generator to make
progress.

Mode Collapse: Usually we want the GAN to

produce a wide variety of outputs (example:
different faces).
However, if a generator produces an especially
plausible output, the generator may learn to
produce only that output. In fact, the
generator is always trying to find the one
output that seems most plausible to the
discriminator.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 53
Training GAN’s complexit y
• The reason why GANs are difficult to train is that the architecture involves the
simultaneous training of a generator and a discriminator model in a zero-sum game. Stable
training requires finding and maintaining an equilibrium between the capabilities of the
two models.
• As seen before, the discriminator model is a neural network that learns a binary
classification problem, using a sigmoid activation function in the output layer, and is fit
using a binary cross entropy loss function. As such, the model predicts a probability that a
given input is real (or fake as 1 minus the predicted) as a value between 0 and 1.
• The loss function has the effect of penalizing the model proportionally to how far the
predicted probability distribution differs from the expected probability distribution for a
given image. This provides the basis for the error that is back propagated through the
discriminator and the generator in order to perform better on the next batch.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 54
A solut ion: W assert ein GAN
• The W asserstein GAN, or WGAN for short, was introduced by
Martin Arjovsky, et al. in their 2017 paper[1]

• Instead of using a discriminator to classify or predict the

probability of generated images as being real or fake, the WGAN
changes the discriminator model with a critic that scores the
realness or fakeness of a given image.

[1] Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks."
International conference on machine learning. PMLR, 2017.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 55
A solut ion: W assert ein GAN
• The standard loss for GAN is a special form of the Jensen-Shannon divergence (derived
from KL divergence) to compare two distributions. W asserstein divergence (also known as
Earth move distance - EMD) is a valid alternative because it provides a score of the
similarity between two distributions (which are, in our case, the distribution of generated
images from G and the distribution of real images):

• The first part of the equation represents the real data, while the second half represents the
generator data. The discriminator aims to maximize the distance between the real data and
the generated data because it wants to be able to successfully distinguish the data
accordingly. The generator network aims to minimize the distance between the real data
and generated data because it wants the generated data to be as real as possible.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 56
A solut ion: W assert ein GAN
• This distance is not followed by a sigmoid, so it’s not saturated and returns
a value, called critic, in a certain range

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 57
A solut ion: W assert ein GAN

In addition to use Wasserstein divergence to output a «critic»,

WGAN also uses gradient penalty that forces the gradient to have
norm = 1.

Wasserstein GANs are less vulnerable to getting stuck than

minimax-based GANs, and avoid problems with vanishing
gradients.

25/ 01/ 2023 University of Parma 58

StyleGAN – St at e of t he art for noise-t o-image-generat ion
Since DCGAN, several different GANs architectures were
proposed for noise-to-image-generation. The most important one
is StyleGAN (and its variants).
StyleGAN proposes an alternative generator architecture for
generative adversarial networks, borrowing from style transfer
literature.
The new architecture leads to an automatically learned,
unsupervised separation of high-level attributes (e.g., pose and
identity when trained on human faces) and stochastic variation in
the generated images (e.g., freckles, hair), and it enables intuitive,
scale-specific control of the synthesis. From paper: Karras, Tero, Samuli Laine, and Timo Aila.
"A style-based generator architecture for generative
adversarial networks." Proceedings of the IEEE/ CVF
conference on computer vision and pattern recognition.
2019.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 59
StyleGAN – St at e of t he art for noise-t o-image-generat ion
StyleGAN takes its bases from ProGAN (Progressive Growing
GAN) proposal, where the generator and the discriminator are
trained progressively, by starting with a very low-resolution image
(e.g. 4×4) and adding a higher resolution layer every time.
ProGAN generates high-
quality images but its ability
to control specific features
of the generated image is
very limited. In other words,
the features are entangled.
From paper: Karras, T., Aila, T., Laine, S., & Lehtinen, J.
(2017). Progressive growing of gans for improved quality,
stability, and variation. arXiv preprint arXiv:1710.10196.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 60
StyleGAN – Adaptive Instance Normalization
StyleGAN makes use of adaptive instance normalization (AdaIN),
which is commonly used when performing style transfer (i.e.
applying the style of a target image to a source image).
AdaIN allows arbitrary style transfer in real-time by aligning the
mean and variance of the content features with those of the style
features.
AdaIN receives a content input 𝑥𝑥 and a style input 𝑦𝑦, and simply
aligns the channel-wise mean and variance of 𝑥𝑥 to match those of
𝑦𝑦:
𝑥𝑥 − 𝜇𝜇 𝑥𝑥 From paper: Huang, Xun, and Serge
𝐴𝐴𝐴𝐴𝐴𝐴𝐼𝐼𝑁𝑁 𝑥𝑥, 𝑦𝑦 = 𝜎𝜎 𝑦𝑦 + 𝜇𝜇 𝑦𝑦 Belongie. "Arbitrary style transfer in real-

𝜎𝜎 𝑥𝑥 time with adaptive instance

normalization." Proceedings of the IEEE
international conference on computer
vision. 2017.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 61
St yleGAN Generat or
• The input layer is omitted and the generation
starts from a learned constant instead
• Given a latent code 𝑧𝑧 in the input latent space
𝒵𝒵 , a non-linear mapping network
𝑓𝑓: 𝒵𝒵 → 𝒲𝒲 first produces 𝑤𝑤 ∈ 𝒲𝒲
• This mapping network is used to reduce the
feature entanglement by moving to a latent
space that does not follow the probability
density of the training data
• Learned affine transformations then specialize
𝑤𝑤 to styles 𝑦𝑦 = (𝑦𝑦𝑠𝑠 , 𝑦𝑦𝑏𝑏 ) that control AdaIN
operations after each convolution layer of the
synthesis network
• Finally, explicit noise inputs are introduced in
the generator in order to generate stochastic
detail (such as freckles) FC

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 62
St yleGAN result s

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 63
St yle Mixing wit h St yleGAN
During training, a given percentage
of images are generated using two
random latent codes instead of one.
In order to do so, the model simply
switches from one latent code to
another — an operation referred to
as style mixing — at a randomly
selected point in the generator.
This regularization technique
prevents the network from assuming
that adjacent styles are correlated.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 64
Truncat ion t rick in St yleGAN
One of the challenges in generative models is dealing with areas that are
poorly represented in the training data. The generator isn’t able to learn them
and create images that resemble them (and instead creates bad-looking
images). To avoid generating poor images, StyleGAN truncates the
intermediate vector ⱳ, forcing it to stay close to the “average” intermediate
vector.
After training the model, an “average” ⱳavg is produced by selecting many
random inputs; generating their intermediate vectors with the mapping
network; and calculating the mean of these vectors. When generating new
images, instead of using Mapping Network output directly, ⱳ is transformed
into ⱳ_new=ⱳ_avg+𝞧𝞧(ⱳ -ⱳ_avg), where the value of 𝞧𝞧 defines how far the
image can be from the “average” image (and how diverse the output can be).

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 65
Image-To-Image Translat ion
• When in a cGAN the condition is
an image.

• Definition: Image-to-image
translation refers to a constrained
synthesis task of automatically
transforming an input image to a
synthetic image or mapping an
input image to the desired output
image.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 66
Example of Generator Architecture for Im2Im
Goal: mapping a high-resolution input grid (an image) to a high-resolution
output grid.
Solution: encoder-decoder network. The input is passed through a series of
layers that progressively downsample (Encoder), until a bottleneck layer, at
which point the process is reversed (Decoder).

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 67
Paired vs Unpaired Dat aset s
• Paired dataset: For each input
sample we have the
corresponding ground truth
(example edges -> shoes)
• Unpaired dataset: For each
input sample we DO NOT
have the corresponding
ground truth (example:
photo -> painting)

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 68
Pix2Pix
Fundamental Paper for Im2Im: Isola, Phillip, et al. "Image-to-image
translation with conditional adversarial networks." Proceedings of
the IEEE conference on computer vision and pattern recognition.
2017.
First work that investigated conditional adversarial networks as a
general-purpose solution (not tailored to a specific Im2Im task) to
image-to-image translation problems.
Objective: learn a mapping from observed image x and random
noise vector 𝑧𝑧, to 𝑦𝑦, 𝐺𝐺 ∶ {𝑥𝑥, 𝑧𝑧} → 𝑦𝑦. Pix2Pix works with paired
dataset.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 69
Pix2Pix - Training
The model is trained combining conditional adversarial loss ℒ𝑟𝑟𝐺𝐺𝑐𝑐𝑁𝑁 and a
pixel loss ℒ𝐿𝐿1 .
The ℒ𝑟𝑟𝐺𝐺𝑐𝑐𝑁𝑁 serves the purpose of obtaining a generator G capable to
produce outputs that cannot be distinguished from “real” images by an
adversarially-trained discriminator, D, which is trained with ℒ𝑟𝑟𝐺𝐺𝑐𝑐𝑁𝑁 to do
as well as possible at detecting the generator’s “fakes”:
ℒ𝑟𝑟𝐺𝐺𝑐𝑐𝑁𝑁 𝐺𝐺, 𝐷𝐷 = 𝔼𝔼𝑥𝑥,𝑦𝑦 log 𝐷𝐷(𝑥𝑥, 𝑦𝑦) + 𝔼𝔼𝑥𝑥,𝑧𝑧 log 1 − 𝐷𝐷 𝑥𝑥, 𝐺𝐺(𝑥𝑥, 𝑧𝑧)
The ℒ𝐿𝐿1 is used to make the G output to be near the ground truth:
ℒ𝐿𝐿1 = 𝔼𝔼𝑥𝑥,𝑦𝑦,𝑧𝑧 ∥ 𝑦𝑦 − 𝐺𝐺 𝑥𝑥, 𝑧𝑧 ∥
Two losses are combined with an hyperparameter 𝜆𝜆 typically set to 10:
ℒ = ℒ𝑟𝑟𝐺𝐺𝑐𝑐𝑁𝑁 + 𝜆𝜆ℒ𝐿𝐿1

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 70
Pix2Pix – Generat or Model
In the problem tackled by Pix2Pix, the input and output differ in
surface appearance, but both are renderings of the same
underlying structure. Therefore, structure in the input is roughly
aligned with structure in the output.

For this reason, the Generator G

follows a Unet structure making
use of skip-connections.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 71
Pix2Pix – Discriminat or Model (Pat chGAN)
The ℒ𝐿𝐿1 loss fails to encourage high-frequency crispness, but
accurately captures the low frequencies.
This motivates restricting the GAN discriminator to only model high-
frequency structure, relying on an L1 term to force low-frequency
correctness. In order to model high-frequencies, it is sufficient to
restrict the attention to the structure in local image patches.
Therefore, the discriminator architecture is designed to only penalizes
structure at the scale of patches. This discriminator tries to classify if
each N ×N patch (70x70 in original paper) in an image is real or fake.
This discriminator is run convolutionally across the image, averaging all
responses to provide the ultimate output of D.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 72
Pix2Pix - Result s

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 73
CycleGAN – Unpaired Im2Im
First work on unpaired image to image translation: Zhu, Jun-Yan,
et al. "Unpaired image-to-image translation using cycle-consistent
adversarial networks." Proceedings of the IEEE international
conference on computer vision. 2017.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 74
CycleGAN - Archit ect ure
Goal: learn a mapping 𝐺𝐺 ∶ 𝐴𝐴 → 𝐵𝐵 such that
the distribution of images from 𝐺𝐺(𝐴𝐴) is
indistinguishable from the distribution
𝐵𝐵 using an adversarial loss.

Problem: this mapping is highly

underconstrained (the shape of the input
could be different from the one of the
output)

Solution: Additional generator F that learns

the inverse mapping F ∶ 𝐵𝐵 → 𝐴𝐴 using a Cycle
consistency loss forcing F G A ≈ 𝐵𝐵

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 75
CycleGAN – Cycle Consist ency Loss
It forces the output of Generator G to be reversed back to the
original input using the Generator F (and the other way around).

𝔼𝔼𝑑𝑑 ∥ 𝐴𝐴 − 𝐹𝐹 𝐺𝐺 𝐴𝐴 ∥1 + 𝔼𝔼𝑏𝑏 ∥ 𝑏𝑏 − 𝐺𝐺 𝐹𝐹 𝑏𝑏 ∥1

where a and 𝑏𝑏 are images belonging to domains 𝐴𝐴 and 𝐵𝐵 ,

respectively.
Example: Generator G learns to do horse->zebra and Generator F
learns to do zebra->horse.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 76
CycleGAN – Some more det ails
The two G-D couples in CycleGAN are pretty similar to those of
Pix2Pix, i.e the generator is a state-of-the-art generator for
DCGAN (e.g. ResNet), while the discriminator is PatchGAN
Additionally, a buffer of 50 generated images is used to update
the discriminator models instead of freshly generated images in
the current minibatch.
The models are trained with the Adam and a small learning rate
for 100 epochs, then a further 100 epochs with a learning rate
decay. The models are updated after each image, e.g. a batch size
of 1.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 77
Multi Domain Image-t o-Image Translat ion
• Goal: developing a system
that is able to learn multiple
mapping functions, that is one
for each domain (example:
multiple face attributes).
• Problem: Following the
CycleGAN pattern, having
𝑘𝑘 domains would result in
using 𝑘𝑘(𝑘𝑘 − 1) Generators.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 78
St arGAN
StarGAN:
• Multi-Domain image-to-image generation
• It uses only one Generator network to
generate multiple domains
• G is conditioned on the target domain y
(defined by a one-hot vector)
• D produces probability distributions over
source and target domain

From Paper : Choi, Yunjey, et al. "Stargan:

Unified generative adversarial networks for
multi-domain image-to-image translation."
Proceedings of the IEEE conference on
computer vision and pattern recognition. 2018.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 79
St arGAN – Loss Funct ions
Adversarial Loss: To make the generated images indistinguishable from real images, an adversarial loss
is adopted:
ℒ 𝐺𝐺𝑐𝑐𝑁𝑁 𝐺𝐺, 𝐷𝐷 = 𝔼𝔼𝑥𝑥 log 𝐷𝐷𝑠𝑠𝑟𝑟𝑟𝑟 (𝑥𝑥) + 𝔼𝔼𝑥𝑥,𝑟𝑟 log 1 − 𝐷𝐷𝑠𝑠𝑟𝑟𝑟𝑟 𝐺𝐺(𝑥𝑥, 𝑐𝑐)
where 𝑐𝑐 is the target domain and 𝐷𝐷𝑠𝑠𝑟𝑟𝑟𝑟 is the adversarial output of the discriminator. The generator G
tries to minimize this objective, while the discriminator D tries to maximize it.

Domain Classification Loss: For a given input image 𝑥𝑥 and a target domain label 𝑐𝑐, the goal is to
translate 𝑥𝑥 into an output image 𝑦𝑦, which is properly classified to the target domain 𝑐𝑐. For this reason
an auxiliary classifier 𝐷𝐷𝑟𝑟𝑚𝑚𝑠𝑠 is added on top of the discriminator D and trained on real images:
𝑟𝑟
ℒ𝑟𝑟𝑚𝑚𝑠𝑠 = 𝔼𝔼𝑥𝑥,𝑟𝑟 ′ − log 𝐷𝐷𝑟𝑟𝑚𝑚𝑠𝑠 𝑐𝑐 ′ 𝑥𝑥
where 𝐷𝐷𝑟𝑟𝑚𝑚𝑠𝑠 𝑐𝑐 ′ 𝑥𝑥 represents a probability distribution over domain labels computed by D. By
minimizing this objective, D learns to classify a real image 𝑥𝑥 to its corresponding original domain 𝑐𝑐 ′ . On
the other hand, the loss function for the domain classification of fake images is defined as :
𝑓𝑓
ℒ𝑟𝑟𝑚𝑚𝑠𝑠 = 𝔼𝔼𝑥𝑥,𝑟𝑟 − log 𝐷𝐷𝑟𝑟𝑚𝑚𝑠𝑠 𝑐𝑐 𝐺𝐺 𝑥𝑥, 𝑐𝑐
In other words, G tries to minimize this objective to generate images that can be classified as the target
domain 𝑐𝑐.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 80
St arGAN – Loss Funct ions
Reconstruction Loss: By minimizing the adversarial and classification
losses, G is trained to generate images that are realistic and classified
to its correct target domain. However, minimizing the losses does not
guarantee that translated images preserve the content of its input
images while changing only the domain-related part of the inputs. To
alleviate this problem, we apply a cycle consistency loss to the
generator, defined as:
ℒ𝑟𝑟𝑚𝑚𝑟𝑟 = 𝔼𝔼𝑥𝑥,𝑟𝑟,𝑟𝑟 ′ ∥ 𝑥𝑥 − 𝐺𝐺 𝐺𝐺 𝑥𝑥, 𝑐𝑐 , 𝑐𝑐 ′ ∥1
where ′𝐺𝐺 takes in the translated image 𝐺𝐺(𝑥𝑥, 𝑐𝑐) and the original domain
label 𝑐𝑐 as input and tries to reconstruct the original image 𝑥𝑥. Note that
the same generator is used twice, first to translate an original image
into an image in the target domain and then to reconstruct the original
image from the translated image.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 81
St arGAN - Training
Overall losses:
𝑟𝑟
ℒ 𝐷𝐷 = −ℒ𝐺𝐺𝑐𝑐𝑁𝑁 + 𝜆𝜆𝑟𝑟𝑚𝑚𝑠𝑠 ℒ𝑟𝑟𝑚𝑚𝑠𝑠
𝑓𝑓
ℒ𝐺𝐺 = ℒ 𝐺𝐺𝑐𝑐𝑁𝑁 + 𝜆𝜆𝑟𝑟𝑚𝑚𝑠𝑠 ℒ𝑟𝑟𝑚𝑚𝑠𝑠 + 𝜆𝜆𝑟𝑟𝑚𝑚𝑟𝑟 ℒ𝑟𝑟𝑚𝑚𝑟𝑟

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 82
Few Shot Image to Image Translation (FUNIT)

From Paper: Liu, Ming-Yu, et al. "Few-shot unsupervised image-to-

image translation." Proceedings of the IEEE/ CVF international
conference on computer vision. 2019.
Unsupervised image-to-image translation methods learn to map images
in a given class to an analogous image in a different class, drawing on
unstructured (non-registered) datasets of images. While remarkably
successful, they require access to many images in both source and
destination classes at training time.
FUNIT introduces a few-shot, unsupervised image-to-image translation
algorithm that works on previously unseen target classes that are
specified, at test time, only by a few example images.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 83
FUNIT Training vs Inference

The training set consists of images of various object classes (source classes). The
model is trained to translate images between these source object classes.
During inference (deployment) very few images of the target class are showed to
the trained model, which are sufficient to translate images of source classes to
analogous images of the target class even though the model has never seen a single
image from the target class during training.
Note that the FUNIT generator takes two inputs: 1) a content image and 2) a set of
target class images. It aims to generate a translation of the input image that
resembles images of the target class.
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 84
FUNIT Framework
FUNIT framework consists of a conditional image generator G and
a multi-task adversarial discriminator D.
The generator G simultaneously takes a content image 𝑥𝑥 and a set
of K class images {𝑦𝑦1 , … , 𝑦𝑦𝐾𝐾 } as input and produces the output
image 𝑥𝑥̅ via 𝑥𝑥̅ = 𝐺𝐺 𝑥𝑥, 𝑦𝑦1 , … , 𝑦𝑦𝐾𝐾 .
The content image belongs to object class 𝑐𝑐𝑥𝑥 while each of the K
class images belong to object class 𝑐𝑐𝑦𝑦 . In general, K is a small
number and 𝑐𝑐𝑥𝑥 is different from 𝑐𝑐𝑦𝑦 .

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 85
FUNIT Generat or
The few-shot image generator G
consists of a content encoder 𝐴𝐴𝑥𝑥 , a
class encoder 𝐴𝐴𝑦𝑦 , and a decoder 𝐹𝐹𝑥𝑥 .
The content encoder maps the input
content image 𝑥𝑥 to a content latent
code 𝑧𝑧𝑥𝑥 , which is a spatial feature
map. On the other side, the class
encoder maps each of the K
individual class images {𝑦𝑦1 , … , 𝑦𝑦𝐾𝐾 } to
an intermediate latent vector and
then computes the mean of the
intermediate latent vectors to obtain
the final class latent code 𝑧𝑧𝑦𝑦 .
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 86
FUNIT Generat or
The decoder consists of several adaptive
instance normalization (AdaIN) residual
blocks followed by a couple of upscale
convolutional layers. The AdaIN residual
block is a residual block using the AdaIN as
the normalization layer.
For each sample, AdaIN first normalizes the
activations of a sample in each channel to
have a zero mean and unit variance. It then
scales the activations using a learned affine
transformation consisting of a set of scalars
and biases. Note that the affine
transformation is spatially invariant and
hence can only be used to obtain global
appearance information.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 87
Funit Discriminat or
The discriminator D is trained to solve multiple adversarial
classification tasks simultaneously. Each of the tasks is a binary
classification task determining whether an input image is a real
image of the source class or a translation output coming from G.
As there are |𝕊𝕊| source classes, D produces |𝕊𝕊| outputs.

This discriminator works better than a discriminator trained by

solving a much harder |𝕊𝕊| -class classification problem.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 88
FUNIT Losses
FUNIT is trained by solving a minmax optimization problem, as usual:
min max ℒ𝐺𝐺𝑐𝑐𝑁𝑁 𝐷𝐷, 𝐺𝐺 + 𝜆𝜆𝑅𝑅 ℒ𝑅𝑅 𝐺𝐺 + 𝜆𝜆𝐹𝐹 ℒ𝐹𝐹𝐹𝐹 𝐺𝐺
𝐷𝐷 𝐺𝐺
Adversarial Loss:
ℒ𝐺𝐺𝑐𝑐𝑁𝑁 𝐺𝐺, 𝐷𝐷 = 𝔼𝔼𝑥𝑥 log 𝐷𝐷 𝑟𝑟𝑥𝑥 (𝑥𝑥) + 𝔼𝔼𝑥𝑥,𝑟𝑟𝑦𝑦 log 1 − 𝐷𝐷𝑟𝑟𝑦𝑦 𝐺𝐺(𝑥𝑥, 𝑐𝑐𝑦𝑦 )
The superscript attached to D denotes the object class; the loss is
computed only using the corresponding binary prediction score of the
class.
Content reconstruction loss: when using the same image for both the
input content image and the input class image (in this case K = 1), the
loss encourages G to generate an output image identical to the input:
ℒ𝑅𝑅 𝐺𝐺 = 𝔼𝔼𝑥𝑥 [∥ 𝑥𝑥 − 𝐺𝐺(𝑥𝑥, {𝑥𝑥}) ∥]

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 89
FUNIT Losses
Feature matching loss: First a feature extractor is constructed,
referred to as 𝐷𝐷f , by removing the last (prediction) layer from D.
Then, 𝐷𝐷f is used to extract features from the translation output
𝑥𝑥̅ and the class images 𝑦𝑦𝑘𝑘 and minimize:
𝐷𝐷f 𝑦𝑦𝑘𝑘
ℒ𝐹𝐹 𝐺𝐺 = 𝔼𝔼𝑥𝑥,{𝑦𝑦1,…,𝑦𝑦𝐾𝐾 } 𝐷𝐷f 𝑥𝑥̅ − �
𝐾𝐾
𝑘𝑘

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 90
FUNIT Result s

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 91
From Paper: Choi, Yunjey, et al. "Stargan v2:
Diverse image synthesis for multiple

St arGAN v2 domains." Proceedings of the IEEE/ CVF

conference on computer vision and pattern
recognition. 2020.

A good image-to-image translation model should learn a mapping between

different visual domains while satisfying the following properties:
1) diversity of generated images;
2) scalability over multiple domains.

StarGAN v2 was proposed to tackle both.

It improves on StarGAN by replacing its domain label with a domain specific

style code that can represent diverse styles of a specific domain.

It introduces two modules, a mapping network and a style encoder.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 92
St arGAN v2 - Archit ect ure
• Generator: 𝐺𝐺 translates an input image
𝑥𝑥 into an output image
𝐺𝐺(𝑥𝑥, 𝑠𝑠) reflecting a domain-specific
style code 𝑠𝑠. It uses AdaIN layers to
inject 𝑠𝑠 into 𝐺𝐺.
• The mapping network 𝐹𝐹 learns to
transform random Gaussian noise into
a style code, while the style encoder 𝐴𝐴
learns to extract the style code from a
given reference image.
• Discriminator: is a multi-task
discriminator that learns a binary
classification determining whether an
image 𝑥𝑥 is a real image of its domain 𝑦𝑦
or a fake image 𝐺𝐺(𝑥𝑥, 𝑠𝑠) produced by 𝐺𝐺.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 93
St arGAN v2 - Losses
During the training a latent code 𝑧𝑧 𝜖𝜖 𝒵𝒵 and a target domain 𝑦𝑦� 𝜖𝜖 𝒴𝒴
are randomly sampled and a target style code 𝑠𝑠̃ = 𝐹𝐹𝑦𝑦� (𝑧𝑧) is
generated
Adversarial loss:
ℒ𝐺𝐺𝑐𝑐𝑁𝑁 𝐺𝐺, 𝐷𝐷 = 𝔼𝔼𝑥𝑥,𝑦𝑦 log 𝐷𝐷𝑦𝑦 (𝑥𝑥) + 𝔼𝔼𝑥𝑥,𝑦𝑦,𝑧𝑧
� log 1 − 𝐷𝐷𝑦𝑦� 𝐺𝐺(𝑥𝑥, 𝑠𝑠)̃
The mapping network 𝐹𝐹 learns to provide the style code 𝑠𝑠̃ that is
likely in the target domain 𝑦𝑦,
� and the generator 𝐺𝐺 learns to utilize
𝑠𝑠̃ and generates an image 𝐺𝐺 𝑥𝑥, 𝑠𝑠̃ that is indistinguishable from real
images of the domain 𝑦𝑦.
�

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 94
St arGAN v2 - Losses
Style Reconstruction: used in order to enforce the generator 𝐺𝐺 to
utilize the style code 𝑠𝑠̃ when generating the image 𝐺𝐺(𝑥𝑥, 𝑠𝑠):
̃
ℒ𝑠𝑠𝑑𝑑𝑦𝑦 = 𝔼𝔼𝑥𝑥,𝑦𝑦,𝑧𝑧
� ∥ 𝑠𝑠̃ − E𝑦𝑦� G x, 𝑠𝑠̃ ∥
Similar to other approaches which employ multiple encoders to
learn a mapping from an image to its latent code. The notable
difference is that here a single encoder 𝐴𝐴 is trained to encourage
diverse outputs for multiple domains. At test time, the learned
encoder 𝐴𝐴 allows 𝐺𝐺 to transform an input image, reflecting the
style of a reference image.

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 95
St arGAN v2 - Losses
Style diversification: employed to further enable the generator 𝐺𝐺
to produce diverse images (this loss will be maximized):
ℒ𝑑𝑑𝑠𝑠 = 𝔼𝔼𝑥𝑥,𝑦𝑦,𝑧𝑧
� 1 ,𝑧𝑧2 [∥ G x, 𝑠𝑠̃1 − G x, 𝑠𝑠̃2 ∥]
The target style codes 𝑠𝑠1̃ and 𝑠𝑠̃2 are produced by 𝐹𝐹
conditioned on two random latent codes 𝑧𝑧1 and 𝑧𝑧2
Cycle Consistency loss: used to guarantee that the generated
image G x, 𝑠𝑠̃ properly preserves the domain invariant
characteristics (e.g. pose) of its input image 𝑥𝑥:
ℒ𝑟𝑟𝑦𝑦𝑟𝑟 = 𝔼𝔼𝑥𝑥,𝑦𝑦,𝑦𝑦,𝑧𝑧
� ∥ 𝑥𝑥 − 𝐺𝐺 G x, 𝑠𝑠̃ , 𝑠𝑠̂ ∥
where 𝑠𝑠̂ = 𝐴𝐴𝑦𝑦 𝑥𝑥 is the estimated style code of the input image 𝑥𝑥
Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 96
St arGAN v2 - Result s

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 97
Just t he t ip of t he Iceberg
A vast literature on generative models exists.
New architectures are being currently developed:

Vision Transformers

Diffusion models

Deep learning and generative models Prof. A. Prati, Prof. T. Fontanini - University of Parma 98

Difference Between ICT and IT
100% (2)
Difference Between ICT and IT
5 pages
Cad Scripting Awk
100% (1)
Cad Scripting Awk
35 pages
User Manual For Aacalc7
100% (2)
User Manual For Aacalc7
42 pages
Formal Languages And Automata Theory
From Everand
Formal Languages And Automata Theory
Ajit Singh
No ratings yet
DL ASMT-2
No ratings yet
DL ASMT-2
17 pages
Deep Generative Models
No ratings yet
Deep Generative Models
55 pages
220110038_MuskanSharma_III IT
No ratings yet
220110038_MuskanSharma_III IT
10 pages
Deep Gen Models Tutorial
No ratings yet
Deep Gen Models Tutorial
96 pages
LECT-GEN AI-2
No ratings yet
LECT-GEN AI-2
22 pages
Unit5 Autoencoders.doc
No ratings yet
Unit5 Autoencoders.doc
45 pages
Vae - Gan 1
No ratings yet
Vae - Gan 1
136 pages
Unsupervised Deep Learning
No ratings yet
Unsupervised Deep Learning
11 pages
Generative_Models
No ratings yet
Generative_Models
65 pages
Part 15 MD
No ratings yet
Part 15 MD
36 pages
10 - Generative AI
No ratings yet
10 - Generative AI
71 pages
LECT-GEN AI-2
No ratings yet
LECT-GEN AI-2
22 pages
21.3 VAE Apps
No ratings yet
21.3 VAE Apps
29 pages
Lecture # 6 Latent Variable Models
No ratings yet
Lecture # 6 Latent Variable Models
55 pages
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
61 pages
CSD411-Week14-AutoRBM_1731474657667996771673434e1e7d46
No ratings yet
CSD411-Week14-AutoRBM_1731474657667996771673434e1e7d46
18 pages
Unit V 2 Marks With Header DL
No ratings yet
Unit V 2 Marks With Header DL
6 pages
465-Lecture 12
No ratings yet
465-Lecture 12
31 pages
Slides PyConfr Bordeaux Calcagno
No ratings yet
Slides PyConfr Bordeaux Calcagno
46 pages
L12 Generative Models en
No ratings yet
L12 Generative Models en
65 pages
GAPE_module_3 - Copy - Copy
No ratings yet
GAPE_module_3 - Copy - Copy
21 pages
Unit 5e - Autoencoders
No ratings yet
Unit 5e - Autoencoders
32 pages
Assignment_14_Modern_AI
No ratings yet
Assignment_14_Modern_AI
3 pages
AAI - Module 2 - Variational Autoencoders
No ratings yet
AAI - Module 2 - Variational Autoencoders
9 pages
7& 9 Autoencoder and Variational Autoencoder
No ratings yet
7& 9 Autoencoder and Variational Autoencoder
13 pages
nlfynx7RfS0IZ9YGOtls_Some core concepts
No ratings yet
nlfynx7RfS0IZ9YGOtls_Some core concepts
6 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Mod 3 Advanced AI
No ratings yet
Mod 3 Advanced AI
37 pages
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
No ratings yet
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
89 pages
Autoencoders
No ratings yet
Autoencoders
35 pages
2020 CS182 Section 7 Notes
No ratings yet
2020 CS182 Section 7 Notes
5 pages
unit-iv-v-deep-learning-material
No ratings yet
unit-iv-v-deep-learning-material
32 pages
Introtodeeplearning MIT 6.S191
No ratings yet
Introtodeeplearning MIT 6.S191
36 pages
Unit-V Deep Learning Techniques
100% (1)
Unit-V Deep Learning Techniques
31 pages
Intro To Vae
No ratings yet
Intro To Vae
89 pages
Unit 5
No ratings yet
Unit 5
39 pages
Class Generative Models.pptx
No ratings yet
Class Generative Models.pptx
54 pages
Generating Diverse High-Fidelity Images
No ratings yet
Generating Diverse High-Fidelity Images
15 pages
Unit 4 Generative AI
No ratings yet
Unit 4 Generative AI
5 pages
Talk MLSS Part2
No ratings yet
Talk MLSS Part2
97 pages
ch14 Autoencoder
No ratings yet
ch14 Autoencoder
42 pages
DeepLearning 4 and 5
No ratings yet
DeepLearning 4 and 5
60 pages
Introduction To VAE
No ratings yet
Introduction To VAE
5 pages
VAE Vs GAN
No ratings yet
VAE Vs GAN
3 pages
ACV - Notes - Final
No ratings yet
ACV - Notes - Final
7 pages
Deep Learning: Presented By:-Anuj Trehan (003) Deepak Dhingra (008) Divyanshu Sharma
No ratings yet
Deep Learning: Presented By:-Anuj Trehan (003) Deepak Dhingra (008) Divyanshu Sharma
15 pages
dis10-sol
No ratings yet
dis10-sol
11 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
16 pages
UNIT III
No ratings yet
UNIT III
15 pages
Class 5 - Deep Dive Into AI
No ratings yet
Class 5 - Deep Dive Into AI
32 pages
L11 - UCLxDeepMind DL2020
No ratings yet
L11 - UCLxDeepMind DL2020
68 pages
GAPE_module_1 - Copy
No ratings yet
GAPE_module_1 - Copy
29 pages
An Introduction To Deep Learning: January 2011
No ratings yet
An Introduction To Deep Learning: January 2011
14 pages
Unit 5
No ratings yet
Unit 5
36 pages
Auto Encoder s
No ratings yet
Auto Encoder s
16 pages
CII4Q3 - Computer Vision-EAR - Week-11-Intro To Deep Learning v1.0
No ratings yet
CII4Q3 - Computer Vision-EAR - Week-11-Intro To Deep Learning v1.0
50 pages
Vae Gan
No ratings yet
Vae Gan
214 pages
D5_PPT
No ratings yet
D5_PPT
79 pages
Auto Encoder s
No ratings yet
Auto Encoder s
22 pages
NCERT QP & Ans ( PART - A)
No ratings yet
NCERT QP & Ans ( PART - A)
5 pages
Modeling Business Objectives
No ratings yet
Modeling Business Objectives
20 pages
2 Characteristics Process and Ethics of Research
No ratings yet
2 Characteristics Process and Ethics of Research
16 pages
Mukesh's Resume
No ratings yet
Mukesh's Resume
1 page
Elden Ring Cross Platform - Cerca Con Google
No ratings yet
Elden Ring Cross Platform - Cerca Con Google
1 page
The Impact of Technology On Modern Society and The Implications For The Future
No ratings yet
The Impact of Technology On Modern Society and The Implications For The Future
3 pages
Alternator For 90amp D375
No ratings yet
Alternator For 90amp D375
2 pages
Rupnagar Computer Training Center, Mirpur, Dhaka: Bteb Counter Payment Voucher
No ratings yet
Rupnagar Computer Training Center, Mirpur, Dhaka: Bteb Counter Payment Voucher
1 page
Instruction: The Note: You Will Find Some Other Instructions Here Also
No ratings yet
Instruction: The Note: You Will Find Some Other Instructions Here Also
3 pages
Fermat Little Theorem PDF
No ratings yet
Fermat Little Theorem PDF
5 pages
HOTMAIL (685)
No ratings yet
HOTMAIL (685)
15 pages
B-JET CV - Template - 11th
No ratings yet
B-JET CV - Template - 11th
4 pages
Downlink Bench Tree en Ingles
100% (1)
Downlink Bench Tree en Ingles
5 pages
Experiments With An Axial Turbine: Description Learning Objectives/experiments
No ratings yet
Experiments With An Axial Turbine: Description Learning Objectives/experiments
2 pages
Rope
No ratings yet
Rope
5 pages
Frequency Bands
No ratings yet
Frequency Bands
4 pages
If Elif Else 10
No ratings yet
If Elif Else 10
8 pages
Samsung LE26S81BHX Chassis GJA26TSA
No ratings yet
Samsung LE26S81BHX Chassis GJA26TSA
159 pages
INST-00094-Power-Supply-Replacement
No ratings yet
INST-00094-Power-Supply-Replacement
3 pages
D.A.V. Public School, Dudhichua: Project Report On Hospital Management System
No ratings yet
D.A.V. Public School, Dudhichua: Project Report On Hospital Management System
16 pages
A Tutorial On ITU-T G.709 Optical Transport Networks (OTN) : by Steve Gorshe Principal Engineer
No ratings yet
A Tutorial On ITU-T G.709 Optical Transport Networks (OTN) : by Steve Gorshe Principal Engineer
110 pages
Gate Project
No ratings yet
Gate Project
77 pages
Quick Guide To Enhanceing Elm
No ratings yet
Quick Guide To Enhanceing Elm
10 pages
TPO 610 Mosfet PDF
No ratings yet
TPO 610 Mosfet PDF
4 pages
Unit - Iii 8086 Interrupts
No ratings yet
Unit - Iii 8086 Interrupts
22 pages
Harshit JANA2006 Tutorial
No ratings yet
Harshit JANA2006 Tutorial
2 pages
Black and White Minimalist Marketing Manager Professional CV Resume
No ratings yet
Black and White Minimalist Marketing Manager Professional CV Resume
5 pages