Harsha Thesis
Harsha Thesis
Master Thesis
Deep learning has excelled in numerous computer vision applications, but they still have a
limited capacity to combine data from different domains. Because of this, performance
might be negatively impacted by even little changes in the training data. Domain shift,
the difference between observed and expected data, is crucial to studying robotics as a
slight change in algorithms can put a robot at risk while operating it physically. One
potential strategy for modifying current data to match better what an algorithm might
anticipate during operation is to use generative methods, which try to produce data
depending on the input. In our study, we are researching various Generative Adversarial
Networks and Vector Quantization methods capable of generating real-world scenarios
from computer-generated images.
Contents
1 Introduction 1
1.1 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Synthetic Vs Real Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Supervised and Unsupervised learning . . . . . . . . . . . . . . . . 4
2.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.1 Artificial Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Multilayer perception . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.3 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . 6
2.3.4 Regularization techniques for CNNs . . . . . . . . . . . . . . . . . . 7
2.3.5 Optimization of Artificial Neural Networks . . . . . . . . . . . . . . 8
2.3.6 Latent Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.7 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.8 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Literature Review 17
6 Conclusion 39
Bibliography 41
A Appendix 47
List of Figures
The human brain has the capacity to see objects in three dimensions which possesses the
tremendous potential for broad environmental understanding. In addition to recognizing
an item’s shape, form, density, and weight, the brain also understands how to interact with
it and manipulate it so that it may be moved or carry out tasks. For humans, it is easy
to perceive the changes in the environment, or even if we view something painted by an
artist, the human brain may change how we see an item. Our capacity to recognize objects
visually is relatively simple [Rensink 00]. Despite the enormous difference in appearance
that each thing generates in our eyes, humans are able to identify and categorize objects
in a fraction of a second from among tens of thousands of possibilities. Perceptional
cognitive functions of the brain make us learn and recognize the object’s nature even if it
is seen by us once and utilize the memory to identify it in a real and unique environment
[Van Dyck 21]. Replicating these human capabilities is quite challenging as the perceived
objects can be hidden or blurred and differ from different viewpoints, making the machines
have a terrible sense of their environment in general.
Though Deep Learning has outstanding performance, they are highly data specific and
cannot perform well even with subtle deviations from training. Furthermore, real-time
situations with variations in light illuminations and the environment in which the model is
to be employed may differ from training and testing data, adding complexity and obtaining
low performance. Apart from the above issue, the type of processed data we use to train
models also adds to existing problems. For instance, models trained on the ImageNet
data, where images are preprocessed not to have noise and blur, and aligned to the centre,
sometimes lead to better performances, unlike in real life. In addition, data scarcity with
unseen conditions like forest areas and off-roads must also be considered. These challenges
have been treated with immensely generated simulation data. Synthetic data is a crucial
strategy for addressing data problems, either by employing cutting-edge data manipulation
techniques to provide fresh and varied training examples or by creating desirable simulated
data from scratch. Even if the simulated domain provides an endless supply of data, the
issue of domain shift between it and the real world still exists. The complexity of reality
cannot be fully apprehended in simulation, hence an algorithm developed on simulated
data may not perform well when applied to real situations [Shrivastava 17, Theiss 22]. To
2 1. Introduction
assist models’ transition from synthetic training sets to real test sets, one must make them
as realistic as feasible and develop ways for doing so. Consequently, domain adaptation
emerges as a central theme in synthetic data research.
There have been several techniques that are designed to decrease the domain gap be-
tween the data distributions. One of the widely studied techniques for image-to-image
translation is Generative adversarial networks (GAN) [Denton 15, Salimans 16]. In ad-
dition, various style transfer methods have also been emerged to address this issue.The
conditional contexts in the GANs and style transfer make them frequently translate to
learn straightforward recolouring or regional transformations without comprehending the
actual target distribution [Wang 18]. Recent studies [Esser 21] have demonstrated the
usefulness of the vector quantization (VQ) approach as a latent representation of generative
models [Enis Cetin 06, Wu 19]. This work utilizes the VQ-VAE [Van Den Oord 17], GAN
[Goodfellow 20] and transformer [Dosovitskiy 20] architectures to achieve latent represen-
tation from synthetic images more closer to real latent representation. Further, we obtain
a synthetic image to real image translation using the transformed embedding space of
synthetic images and the pretrained codebook and decoder.
We elucidate the foundational concepts that are employed in this current work, starting
with a brief description of various types of images. Later, explaining basic machine
learning concepts, artificial neural networks, and convolutional neural networks. Further,
we shed light on state-of-the-art models ranging from Generative Adversarial Networks to
Transformers, which are prominent for image generation.
Many Machine Learning models require large, diverse, annotated datasets to perform
error-free tasks. But gathering this kind of data is expensive and takes time to process.
One alternative solution is Synthetic data. In contrast to real data collected from actual
natural events, synthetic data is generated to mimic the properties and structure of critical
real-world data using algorithmic methods by a computer. Synthetic data points resemble
real-world events but do not accurately reflect them. It provides unlimited, affordable and
high-quality data for the training of machine learning models. However, things are a little
more complicated in reality. Synthetic data fails to generalize well on real scenarios and
Inconsistencies when replicating complexities within the original datasets [Alkhalifah 21].
As CNNs are more biased towards the texture than the object’s shape [Geirhos 18], training
a model with synthetic data may reflect the biases as the texture of the contents in the
image differs from the actual instance. Unlike humans, CNNs are more focused on texture
cues detect elephant-textured cat as an elephant, as shown in figure 2.1. To effectively
use synthetic data as a source for machine learning models, it must have a statistical
distribution similar to the distribution of a real dataset. The synthetic data points should
be indistinguishable from those that are real, and also, there should be a considerable
difference between each synthetic data point. As it is difficult to obtain every possible
real-time situation like Forest environments, Defence etc., we rely on synthetic data, which
can be extracted from computer games[Richter 16a].
4 2. Background
Figure 2.1: Classification based on shape and texture by ResNet-50 adopted from [Geirhos 18].
(a) shows texture cues from elephant, (b) has cat as content and (c) is combination of cat shape
with elephant texture
a = f (n) = f (w · p + b) (2.1)
Figure 2.2: Schematic diagram of Single Neuron with activation function adapted from
[Demuth 14]
As shown in the figure2.2, each neuron gets m inputs p ∈ Rm from external sources
or neighbouring nodes, each of which is assigned a weight w ∈ Rm . A further bias
component, b, is introduced and turns into the number n when added to the weighted
total of these inputs. Applying an activation function to n produces the output in equation
2.1, which is written in vector form. Wide range of linear and non-linear mathematical
functions, for example, ReLU, LeakyReLU, and Swish, are available for the activation
function f . Rectified Linear Unit (ReLU) activation, a piece-wise linear function function
f (n) = max(0, n) is often used in current networks.
y = f 2 (W 2 f 1 (W 1 · p + b1 ) + b2 ) (2.2)
MLP may approximate any real, continuous function if it has at least one hidden layer.
However, no universal rule specifies the necessary number of neurons and layers or the
optimisation method [Hornik 89]. Nowadays, practically every published design has many
layers of neurons due to advancements in computing power. Since Deep Learning is a
very active study area, several new designs are suggested daily. CNNs are cutting-edge
techniques for problems like speech-recognition ,image identification or computer vision in
general[Krizhevsky 17, Hinton 12]. We continue our discussion of CNN, another crucial
deep-learning topic, in the following section.
6 2. Background
Figure 2.3: Schematic diagram of Multi-layer Perceptron adapted from [Demuth 14]
Sparse interactions eliminate the requirement for fully connected Neural Networks, where
each neuron is coupled to every neuron in the previous layer. Instead, a kernel K smaller
than the input connects only a tiny subset of the input in the vicinity of the kernel centre
to the convolution’s output. The size of the kernel area is referred to as a Kernel size. In
addition to the kernel size, stride and the total number of feature mappings are additional
crucial hyper parameters. When moving the kernel across the input, the stride indicates
the step size to be utilized [LeCun 15]. Padding can be used to preserve a specific output
size. The number of feature maps indicates how frequently convolution is applied to
the input. One feature map is produced as an output for each convolution of the input.
The feature maps or output layers correlate to the input image’s edges and textures. As
depth of convolution layers increase, the extraction of features also tend to increase from
generic to semantically rich features. Sharing parameters makes convolution processes
storage-efficient and enables the reuse of weights for various input areas.
In contrast to conventional fully connected NNs, CNNs do not assign a separate weight to
2.3. Deep Learning 7
each neuron connection. Instead, within each convolution process, the kernel’s weights
remain constant. As a result, every region of the input image receives the identical kernel
weight matrix. Equivariance to translation refers to the fact that when the input of a
convolution operation is translated, so is its output. More specifically, f (x) and g are
equivalent if f (g(x)) = g(f (x)). In the case of 2D images, this indicates that if there is
a shift in the input image, there will also be a shift in the output image. For a given
input value, the convolution layer output is determined by formula 2.3, where o represents
output, i represent input picture dimensions, k represents a kernel, s represents stride,
and p represents padding. he pictorial representation is given 2.4 in Padding helps obtain
the desired result, enabling the kernel to consider edges. In some network topologies,
like autoencoder, it is required to obtain images from the feature map by implementing
upsampling or transpose convolution techniques. These are covered in further detail in
the following chapters.
activation functions won’t get saturated, which offers faster and more reliable convergence
[Ioffe 15]. Unlike BN, Layer normalization and instance normalization are almost similar.
However, instance normalization normalizes across each channel in each training example
rather than across input characteristics in a training sample.
Continuous Space
Consider the data x ∈ Rn , which is generated from the lower dimension z ∈ Rm (m < n),
by the formula
x=A·z+v (2.6)
From the equation 2.6, v is an independent, n-dimensional Gaussian distribution. The
representation in a lower dimension is not available to us; we can only access the raw
data. Principal Component Analysis, a traditional unsupervised approach, tries to find
this hidden representation. Our latent spaces are intended to identify non-linear represen-
tations, whereas PCA [Kurita 19] focuses more on linear representations. These hidden
representations can be considered as compressed data from a large data set.
Discrete Spaces
So far, in this section, we have dealt with latent representations as continuous distributions.
However, the several types of real data such as language representation as detailed
in [Van Den Oord 17], is in discrete representations, which motivates the transition of
hidden representations from continuous to discrete space. This concept also leverages by
2.3. Deep Learning 9
2.3.7 ResNet
According to researchers, Convolutional neural networks benefit from depth, which makes
sense because a larger parameter space gives models more freedom to adapt to any
environment. However, it has been shown that the performance declines after some deep
layers, leading to a vanishing gradient problem [Hochreiter 98]. This is due to the fact
that when the network is too deep, the gradients used to compute the loss function drop
to zero after applying several numbers of the chain rule. As discussed in section 2.3.5, the
Batch Normalization (BN) helps to reduce the probability of this gradient occurring but
does not prevent it from vanishing [Basodi 20].
The problem of training deep networks has been solved by the resnet models, which
comprise Residual blocks. The core concept of Residual blocks is skip connections, which
skips a few layers and have a direct connection, as shown in the figure 2.5. The accuracy
of the model is significantly increased by employing these residual connections because
they allow the gradient to flow more easily into the lower layers. The skip connection
explains in the equation 2.7 for a previous layer.
Auto Encoders
An autoencoder is the most basic type of neural network used in unsupervised learning as
it trains without explicit labelling. However, since they create their labels from the training
data, they may be considered self-supervised. Moreover, it identifies non-linear hidden
representations for input data [Bank 20]. Autoencoder consists of an encoder network
z = fθ (x) with θ parameters and a decoder network x̂ = gϕ (z) with ϕ parameters, where x
is the input, z represents as a latent vector and x̂ is the reconstruction from compressed
representation. The mean square error (MSE) from the input x and the reconstructions x̂
are used to calculate the reconstruction loss L.
arg min LRec (x, x̂) = arg min LRec (x, gϕ (fθ (x))) (2.8)
θ,ϕ θ,ϕ
The equation 2.8 represents the entire autoencoder network. Mathematically the error
can be represented as || x − x̂ || 22 or log(p(x|z)). The hidden part of the NN can also
be termed as the bottleneck, as it condenses all the data into lower dimensions. z must
always be smaller than x to force the encoder to identify the key characteristics.
VAE
Variational autoencoders introduced by Kingma et al. in [Kingma 19] are one of the
standard approaches in the field of unsupervised learning. The structure of the bottleneck
representation is where autoencoders and VAE differ significantly. In our latent space,
data points with similar semantic properties should cluster together, while those with
dissimilar semantic properties should be dispersed. Furthermore, the majority of the
distribution of the data should not reach infinity and, instead, take up minimal space in
hidden space. The primary problem with standard autoencoders is that the latent space
can extend up to infinity, making the model memorise individual data points. Variational
autoencoders solve this issue by imposing a probabilistic prior on the latent space. VAE
has both neural network and probabilistic models perspective. We will dive into each
viewpoint briefly. VAE is almost entirely similar to the auto encoders as shown in 2.6,
with an encoder qθ (z|x) that outputs Gaussian probability distribution, decoder pϕ (x|z)
that takes sampled z which followed reparameterization trick with the formula 2.9, and
2.3. Deep Learning 11
z = µ + ϵσ (2.9)
In the equation 2.9, µ is mean, σ is standard deviation and ϵ is a random value drawn
from prior distribution, N ormal(0, 1). We calculate reconstruction loss with log likelihood
log pϕ (x|z). We can split loss function li of every data point xi as 2.10.
VQ-VAE
With the addition of the discrete codebook component, VQ-VAE [Van Den Oord 17]
expands on the conventional variational autoencoders. The major difference between
VQ-VAE and standard VAE is that continuous space is replaced with discrete and prior
value is learned instead of constant [Van Den Oord 17].
Figure 2.7: Schematic diagram of VQ-VAE adapted from [Van Den Oord 17].
As shown in the figure 2.7 , the encoder outputs a latent vector from an input image
and compares it to each embedding vector in the codebook to determine which vector is
closest, based on the Euclidean distance, zq = argmin||ze (x) − ei ||2 , where ze (x) is the
12 2. Background
encoder’s output vector, and ei is the ith embedding vector in the codebook. Then, the
appropriate quantized codebook vector zq is sent to the decoder for image reconstruction.
As the argmin operation is not differentiable, we copy the decoder gradient ▽z L directly to
the encoder for training purposes, which is set to 1 concerning the encoder and resulting
codebook vector and zero regarding other vectors.
The codebook also learns using gradient descent, like the encoder and decoder. Learning
codebook vectors that align to the encoder output is bidirectional. Reconstruction loss,
codebook alignment loss, and codebook commitment loss are the components that make
up the loss function, as shown in 2.12, 2.13. The mean square error between the original
and reconstructed image is used to calculate reconstruction loss. Codebook alignment
loss helps bring the selected embedding vector close to the output of the encoder, setting
a stop gradient on the encoder output. Conversely, codebook commitment loss places a
stop gradient on the codebook vector to get the encoder output to commit to the nearest
codebook vector. The significance of commitment loss is scaled by a hyperparameter β.
In our current work, we employed VQ-VAE with different hyperparameters, embedding
vectors, and embedding dimensions.
As shown in figure 2.8, a generative model G reconstructs image x = G(x) by feeding the
samples from a Gaussian distribution as a random input noise. The generated and original
images are then provided to the discriminator network D, distinguishing between the fake
and real images.
LGAN = min max V (D, G) = Ex∼pdata(x) [log D(x)] + Ez∼pz (z) [log (1 − D(G(z)))] (2.14)
G D
D(x) represents the discriminator estimate of the probability that the real sample is
real. D(G(Z)) represents the probability that the fake sample is real. The objective
function in equation 2.14 is described as a minimax function. The first term interprets the
discriminator’s predictions on real data, and the second is the discriminative prediction
2.3. Deep Learning 13
represented on the fake data. First, the discriminator wants D(x) to be a large value because
that represents high confidence that a real sample is real. And later, the discriminator
wants D(G(z)) term to be small as possible or less confident. But since the generator
intends to fool the discriminator, it wants this value to be as large as possible. This
results in an adversarial game. Expectation E is the average of the predictions when we
feed original or noisy data as input to the discriminator. GANs showed promising results
for image generation, although equation optimization is complex and computationally
expensive.
CycleGAN
Cycle Generative Adversarial Network (CycleGAN) is one of the methods to train a deep
convolutional neural network for an image-to-image translation application [Zhu 17]. Using
an unpaired dataset, the network learns the mapping between input and output images.
These models are trained unsupervised, utilizing a set of images from the source ’X’ and
target domains Y that are not required to correspond to each other. CycleGAN has been
used in various applications, including the translation of seasons, object transformation,
style fusion, and the creation of pictures from paintings. The authors offer a technique
that can learn to recognize one image collection’s unique characteristics and determine how
these qualities might be transferred to the second image collection, all without any paired
examples. The CycleGAN is a GAN architectural extension that includes simultaneously
training two generator models and two discriminator models. As shown in figure 2.9 the
first generator(G) takes input samples from the first domain(X) and generates output
samples of the target domain(Y). In contrast, the second generator(F) uses input images
from the target domain(Y) and produces images for the first(X). For each generator, a
discriminator is included that makes an effort to distinguish the synthetic and real samples.
Later these generator models are updated based on how credible the generated images are.
This extension using adversarial loss by itself could be adequate to produce reasonable
visuals in each domain but not translations of the input images. Cycle consistency and
identity loss are further advancements of the architecture that CycleGAN takes advantage
of.
14 2. Background
It is proposed that an image produced by the first generator may be utilized as the input
Figure 2.9: Schematic diagram of CycleGAN. G, F are generators and DX , DY are discrimina-
tors.
for the second generator, whose output must resemble the original image. However, the
opposite is also true: an output from the second generator may be provided as input to
the first generator, and the output should equal to the second generator’s input. The
absolute pixel level difference between the original and generated images is used to impose
this constraint
VQ-GAN
Vector Quantized Generative Adversarial Network (VQ-GAN) [Esser 21] extends VQ-VAE
by adding a discriminator network, which tries to identify real and reconstructed images.
We use VQ-GAN with and without the transformer model from the paper [Esser 21] in
our work.
Similar to [Van Den Oord 17], VQ-GAN training uses the decoder’s reconstructed images,
2.3. Deep Learning 15
which are then supplied to the discriminator along with normal real input. For improved
visual quality, the adversarial loss of the discriminator component is added to VQ-GAN,
and the Lrec of the 2.13 is replaced with the perceptual loss. Loss function LV QGAN is
described as follows2.15,
LV QGAN = arg min max Ex∼p(x) [LV Q (E, G, Z) + λLGAN ({E, G, Z}, D)] (2.15)
E,G,Z D
Transformers
Recent developments of Transformers, The Vision Transformers developed as a competitive
substitute to Convolutional neural networks (CNNs), as currently the state-of-the-art
in computer vision, are being employed extensively in a variety of image identification
applications. Transformers show great promise for a general learning approach that may
be used for a variety of data modalities, including the most recent advances in computer
vision that achieve state-of-the-art standard accuracy with improved parameter efficiency.
The incredible performance of Transformer models in Natural Language Processing (NLP)
[Devlin 18]. These amazing models are pushing the boundaries of NLP and breaking
several records. which are utilized in a variety of NLP tasks like including machine
translation, conversational chat bots, and even more effective search engines. This inspires
the researchers to apply the same ideas to computer vision tasks. Some of notable
examples are BeiT [Bao 21], SWIN-Transformers [Liu 21] etc. Similar to the series of
word embeddings used when employing transformers to text related tasks, the ViT model
represents an input image as a set of image patches and by modeling feature maps as
vectors of tokens, with each token representing an embedding of a particular image patch.
The vision transformers make use of encoder only unlike [Vaswani 17].
The figure illustrates the architecture of a vision transformer. The Parallelization power
of Transformers makes them fed with lots of data with a broader context window. Before
feeding to the encoder, the first step is to split the images into patches as illustrated
in figure 2.11 and are provided as a sequence of linear embeddings as an input to a
Transformer.. These image patches are same as a tokens (i.e words) in an NLP application.
We could also just feed in the image pixel values if we choose not to divide the image into
patches. However, the attention mechanism suffers. As it is stands, every input must be
compared to every other input. We must do 2562 comparisons if we apply it to a 256 × 256
pixel image.This is only for single attention layer likewise transformers contains several of
them, which it would be a computational nightmare. In order to embed created patches
as a patch embeddings we feed those in to the layer called Linear Projection Layer to
produce vectors. These are learnable parameters during training the model updates them
with the values that better help with the assigned task. Unlike LSTM’s which takes the
embeddings sequentially in a designated order they know which word came first and so on,
the transformers on the other hand take up all the embeddings at once eventhough this
16 2. Background
is a high plus which makes the transformers much faster but the downside is that they
lose the embedding ordering information and also do not have the notion of equivariant
and invariant, unlike CNN’s. To resolve this issue the position embedding [Vaswani 17]
came up with the clever idea of using wave frequencies to capture position information.
For classification task, similar to [Bao 21] learnable class token embedding appended to
the series of position embeddings. The final layer feature vector corresponding to this
class embedding is used by the classification head(i.e MLP) for classification tasks. The
transformer encoder maps the input sequence of embedded patch representation to a
sequence of continuous representation.
The transformer encoder maps the input sequence of embedded patch representation
to a sequence of continuous representations that holds the learned information for that
entire sequence. The transformer encoder is a stack of Multi-Head self-attention layer
(MHP), Multi-layer Perceptron(MLP) Layer, Layer Norms, and with or without residual
connections. The encoder uses a particular attentional mechanism known as the self-
attention mechanism. Self-attention enables the models to relate each input token to every
other token in the sequence and computes the attention score for each pair. The weighted
mean of all the input elements is set to the current token, where attention scores determine
weights. A process like this modifies each token in the input sequence by scanning the
whole input sequence, determining which tokens are most crucial, then changing each
token’s representation by the most crucial tokens. The type of attention employed in the
majority of transformers differs slightly from the explanation given above. In particular,
transformers frequently employ a "multi-headed" style of self-attention. Multi-headed
self-attention employs several separate self-attention modules simultaneously, and their
outputs are concatenated.
3. Literature Review
This section discusses the related work on the image-to-image translation of different
domain data using VQ-VAE, GANs and transformers. [Goodfellow 16] believes that the
observed data may be represented as independently distributed random variables with the
identical distribution. But in actual practice, this assumption must be seriously questioned
because data are still scarce in many fields. We still want to employ training datasets
for broad use cases, even if they may have been gathered on a different continent or with
different camera configurations, even though the models might anticipate distributional
shifts while performing in their target context. In light of the fact that data annotation
remains costly, it would be ideal to forgo the pricey human element entirely and switch
to the creation of automated, synthetic data. Even the most remarkable photorealistic
rendering techniques now on the market are unlikely to reflect the original distribution
ultimately. The primary concept of robotics is to acclimate to new and unknown situations,
and data collection of different environments is challenging and impossible, thus motivating
us to use synthetic data.
Isola et al. proposed Pix2Pix [Isola 17] as a conditional model that can translate an image
from one domain to another domain through paired training. later Pix2PixHD[Wang 18]
is emerged based on[Isola 17] which had high resolution translational images. Although
these approaches performed well when they represented considerable advancements over
the state-of-the-art, one key issue with these paired image-to-image translation methods is
the paired dataset, a dataset in which one image already has its translated counterpart.
Due to this parity restriction between the input and output, these paired datasets are
difficult, costly, or even impossible to collect.
Domain adaptation helps us to overcome the bias between training and real-time data distri-
bution.The authors of CycleGAN[Zhu 17] extended the idea of vanilla GAN[Goodfellow 20]
by adding two generators and two discriminators, as described in the section2.3.8. Fur-
thermore, the authors make use of two more loss functions, forward and backward cycle
consistency loss, in addition to the normal adversarial loss to regularize further the mapping
between the distribution of the input images (X->can be computer-generated images) and
desired output distribution(Y-> Cityscape or Kitti image). The CycleGAN architecture
18 3. Literature Review
In [Isola 17], authors Philips et al. have studied conditional adversarial networks on
a variety of datasets for a broader range of applications starting from image-to-image
translation to segmentation of images. Their proposed network learns a loss function
to train to map instead of just learning the mapping between the original image and
translated image, which eases the need to tweak parameters for different adaptations. The
generator and discriminator architectures have Convolution followed by BatchNorm and
ReLU adopted from [Ioffe 15]. They also make use of skip connections from ’U-Net’, to
communicate low-level information between the layers. PatchGAN as a discriminator
architecture is used, which has been considered similar to texture loss. Several datasets
like Cityscapes [Cordts 16],Google maps etc., have been used for testing proposed methods
on semantic labelling and converting maps to aerial photos, respectively. Methods like
running generated and original images on Amazon Mechanical Trunk (AMT) and metrics
similar to inception score [Salimans 16], and FCN-score [Long 15] are used for evaluation.
The paper further proposed, PixelGANs are efficient at colourization where as PatchGAN
are good with sharpness in the images.
[Liu 17] proposed a Coupled GAN-based technique [Liu 16] for unsupervised image-to-
image translation. The authors make the assumption of a shared-latent space, which is
that two corresponding images in separate domains can be mapped to the same latent
representation in a shared-latent space. Based on this, they proposed a framework by
combining VAEs and GANs. They also have embedded weight-sharing constraints to relate
to the VAEs. Later, [Kazemi 18] addressed the problem in [Liu 17] on modelling domain
variant information by removing shared latent space and jointly learning a domain-specific
space and a domain-invariant space. As these methods frequently fail to model domain-
specific data not represented in the target domain.
[Xie 20] talks about the challenge of identifying cross-domain relationships from unpaired
data. The proposed method learns to identify connections across several domains using
generative adversarial networks. Furthermore, this network successfully performs style
transfer from one domain to another while maintaining important characteristics like
orientation and face identification using the found relations. Similar to[Xie 20],[Yi 17]
also creates a unique dual-GAN technique that allows image interpreters to be trained
using two sets of unlabeled images from two domains. In the proposed design, the dual
GAN learns to invert the job, while the primal GAN learns to translate images from
the domain U to the domain V. Both algorithms are cycle consistency constraint, which
was used to promote bidirectional image translations with regularized structure output.
Although these GANs produce realistic visual results for Image-to-image translation tasks,
it is bad for domain adaption situations when corruptions of image content commonly
appear in translated images. The issue of content distortions was addressed by [[Huang 18],
[Zhang 18]]. They used extra segmentation branches to include semantic data into the
19
In [Xie 20], Xinpeng et al. have addressed the issue of content distortions which CycleGAN
methods have not fully solved. Though CycleGAN is best at I2I translation, it fails to pre-
serve image content. They also worked on eliminating the need for pixel-level annotations
that were extensively used to overcome image object preservation problems. The proposed
method had a self-supervised multi-tasking network in addition to regular adversarial and
cycle consistency losses. By using the additional network, authors tried to get the contents
of the image along with domain knowledge. They have experimented on three datasets
that included medical images, which are prone to various imaging circumstances.
To date, many papers have done I2I translation with generative models. However, these
generative models tend to learn colourization or simple area translation instead of compre-
hending real desirable distribution. To address this, the authors of the [Chen 22] employed
the vector quantization method to obtain an image-to-image translation. The fact that
VQ is the intermediate embedding space, with low dimensional image representation,
made these possible. Furthermore, they proposed a multi-functionality model for an
unconditional generation of images with the possibility of styling translations. The method
has an encoder that tries to extract domain-invariant features with two more style encoders
that extract domain-specific features. Later, the generator forms images that combine
contents and style, whereas the discriminator tries to find out if the image is fake or real.
20 3. Literature Review
4. Concept and Implementations
The earlier chapters described the principles and current state-of-the-art generative models
for domain adaptation ranging from GAN-based approaches to vector quantized approaches.
This work is more focusing on learning the representations for image-to-image translations
that preserve the similarity relations of the data space. The goals of the study, is to reduce
the number of data points required for deep learning algorithms to produce usable findings
and also to make such results reliable in a variety of situations.
This work is based on VQ-GAN, built on top of the VQ-VAE, as these methods are good
at learning meaningful latent representations of the data space and image generation. one
of the advantages of using these types of models is that it can avoid posterior collapse
by integrating the notion of ’Vector Quantization.’ The vector quantizer is used as a
regularizer to force a restricted code space onto the encoder’s output and organize this
output into vectors. The term "quantization" in VQ refers to how related vectors are
represented by the same index.
4.1 Datasets
This section describes the properties and characteristics of training and test data and
shows exemplary images in figure 4.1 from each dataset. The dataset should be relevant
to the target domain in order to give transferrable signals and information while also
capturing a substantial amount of variation to avoid the overfitting problem. However,
annotating large-scale datasets at the pixel level is very expensive and cumbersome as
they take up immense human efforts. To portray a vast setting that frequently matches
its real-world equivalents, sandbox video games strive to convey those required traits.
Sandbox games often cover a wide variety of architectures and artificial environments
which resembles the human world.
GTAV
[Richter 16b] had put forward an algorithm using G-buffers to extract pixel-level semantic
labelling images from one of the famous open-world games, Grand Theft Auto V (GTAV).
22 4. Concept and Implementations
The dataset extracted by [Richter 16b] consists of 254064 frames in total with 1920 × 1810
resolution, which were captured under different environmental and climatic settings.
CityScapes
CityScapes [Cordts 15] is one of the famous and publicly available datasets for Computer
Vision tasks that have been taken from video sequences from fifty different cities in Germany
and France. The dataset comprises 5000 high-quality datasets with pixel-wise annotations
and 20000 frames with coarse annotations. All images are set to be of 1024 × 2048
resolution. The main focus of this dataset is to make models understand urban scenes
at the pixel level and instance level. As CityScapes are from the real world, these were
chosen as the target domain in this current work.
using min-max normalization into the range [0, 1]. Later, horizontal flip is applied to the
images to improve the models’ generalizability.
4.3 Implementation
Deep learning research is highly dependent on the frameworks that provide building blocks
for model architectures. A couple of the most famous frameworks for deep learning are
PyTorch [Paszke 19] and TensorFlow [Abadi 16]. In our study, we use PyTorch as the
designated framework, as it is most commonly adopted for academia. Here, we provide
the motivation and implementational details of the baseline model and the extensions and
special features of the proposed methods.
The fundamental goal of this research is to make use of a discrete latent representation
space that helps us add high-quality features for the synthetic inputs to generate more
realistic output. Furthermore, we must preserve source geometry and semantic labels of
translations while transforming images from one domain to another, such as computer-
generated images to real images.
To achieve this, we first pre-train the entire VQ-GAN model on a real dataset to obtain
the actual codebook latent distribution and its corresponding decoder. Then, we freeze
the codebook and decoder weights and employ the different encoder modules to map
the synthetic latent representation to the real latent space representation. We later use
discriminators and transformers to bring the latent synthetic distribution closer to the
real distribution.
Table 4.1: Discriminator layers used for conditioning on real images having 2d Convolutional
Layers, LeakyReLU activation function and Batch Normalization.
repeated in the same order, forming a complete residual block. An attention block has a
Group Norm layer followed by four convolutional layers with 1 × 1 kernel size and stride 1.
Further, there are four upsampling blocks, each having a conv2d layer ( k = 3 × 3, s = 1,
p = 1) and two residual blocks. Further features are decreased with convolutional single
stride layer with window size 3 × 3. Later the image is provided to the discriminator for
further classification into real and fake images, with the mixture of convolution layers,
activation functions, and batch normalizations as shown in the table 4.1.
After training, the codebook and decoder, which have learned real representations, are
frozen for further usage in the following steps.
Initially, the actual images are fed to the encoder, Ereal , which is already trained in 4.2 to
obtain real latent representation. Later, we take another encoder, Esyn , to obtain latent
distribution from synthetic images. These real and artificial representations are then fed
to the discriminator to discriminate between real and computer-generated representations.
Here, Ereal output is used to regularize to Esyn predictions. Once when Esyn has learned to
generate real-like latent space, for the inference phase, we employ it to create images using
the already frozen codebook, Creal and Decoder Dreal from above 4.2. The architectures
of both encoders are similar to the encoder of 4.2. The discriminator has several layers
of 2d Convolutional layers, Batch normalization and activation function Leaky ReLU as
shown in the table 4.2
Regarding architectural detailing, the image with size 256 × 256 with RGB channels are
fed to the Esyn , which gives output latent dimension 16 × 16 with 256 features. Later
these dimensions are provided to the generator model as shown in 4.6. The first layer is
4.3. Implementation 27
Table 4.2: Discriminator layers having 2d Convolutional Layers, LeakyReLU activation function
and Batch Normalization.
Convolution with kernel size 3 × 3 with single padding and stride. It is followed by two
Convolutional layers with filter size 3 × 3 with double stride and single padding. After
that, all the convolutional layers are followed by Batch Normalization and LeakyReLU.
Later it is followed by five residual blocks. Then upsampling blocks are used to match
the size to the input size. The generator’s output is subsequently fed to the discriminator
for further classification into real or fake. The other generator and discriminator follow a
similar procedure.
Subsequently, the output layer is concatenated with a class token at the beginning of each
sequence leading to a flattened tensor of shape 17 with 256 dimensions. Now, the positional
embedding of each patch is provided, which is later sent to the Transformer encoder. We
use only the transformer encoder when doing the classification task. The encoder has
recurrent layers of MLP and MHSA. Prior to the self-attention block and the MLP block,
Layernorm (LN) is applied. Every block is followed by the residual connections. Finally,
the classification head considers the class token and outputs the true/false predictions.
In this section, we explain the experiment setting for the investigated methods and the
results obtained are discussed. Later, the generated data from synthetic images are tested
to check their appropriateness for the object detection models.
Figure 5.1: Original and reconstructed images. Top row are the original images and the bottom
are reconstructed images
32 5. Experiments and Results
Figure 5.1 shows the original images and reconstruction images from the model. The
reconstructed images are blurry, and some parts are distorted. One of the reasons can be
resizing the images. There is a loss of information when 2048 × 1024 resolution images are
resized to 256 × 256.
Figure 5.2: Original and reconstructed image patches. Top row are the original patches and
the bottom are reconstructed patches
Figure 5.3: Original and reconstructed images after patch-wise training. Top row are the
original images and the bottom are reconstructed images
Further, to improve the reconstruction quality, instead of resizing the whole image, we
randomly chose patches with dimension 256 × 256 from each image. By this, we can
preserve the information for better reconstruction. As we can observe from the picture 5.2,
the tiny details are also reconstructed when compared to the firstly adopted model. It is
clearly observed that cycle spokes, small paper pieces on the ground are well reconstructed
5.2. Data Generation 33
by the help of patches. The image 5.3 shows the original and reconstruction images from
the patch-wise training. All the experiments were conducted with batch size 4, learning
rate 0.0002 and ADAM optimizer for 500 epochs.Reconstruction loss and vqloss are the
metrics for selecting the code book and decoder. Similarly, we have trained synthetic
images with patches and also by resizing the images.
To tackle this issue, we utilized a pretrained concept that helps the model to learn the
generic information in the images.Additionally, this encourages the encoder to perform
better, as expected results are more reasonable, as shown in the figures 5.4.
After hyper parameter tuning like batch size, learning rate, we lowered FID from 143.57
to 67.67, indicating the right path of experiments. From the 5.4, we can observe that the
roads, footpaths and building texture are adapted to a real domain, but the reconstructed
objects far away are ambiguous. The precise translation of the image can be seen in the
5.6. Almost all the content is transferred from synthetic to natural, like street poles and
road markings etc., The half-seen car in the synthetic image is fully restored in the real
image, which can be due to code book learning. We have also implemented the model
using Wassertein loss [Martin 17] with weight clipping, for further improvement in image
generation quality. The figure 5.7, shows better improvement in texture, which is in
more resemblance with CityScapes dataset. Side paths, nearer vehicles are generated in a
realistic fashion. In the figure A.2, we can observe that nearer objects are reconstructed
well, where as farther objects are unable to be generated of blurry when generated.We
can also interpret that generating data from unknown latent representations is difficult as
observed, In A.3, A.4 images, the color of the sky (sunset, sunrise) in synthetic images has
effected the reconstructions. Models are unable to replicate them and also distorting near
by pixels.
Additionally, an experiment is conducted by changing the input given to discriminator
during the synthetic encoder in the VQ-GAN model. Instead of providing synthetic data
to the discriminator, we provide real data, in order to map the translation between the
domains. We observed slight differences as shown in the figure 5.5, road texture, trees are
more realistic.
34 5. Experiments and Results
Figure 5.4: Images generated after training using discriminators. Real-like image reconstruction
using pretrained encoder on synthetic data.
Figure 5.5: Images generated using VQ-GAN by feeding real images to the discriminator. From
left to right, first two images are synthetic, third and fourth are generated images respectively.
5.2. Data Generation 35
Figure 5.7: Images generated after training using discriminators with wasserstein. Left image
is synthetic and right is generated image.
36 5. Experiments and Results
Figure 5.8: Images generated after training using cycleGAN.From left to right, first two images
are synthetic, third and fourth are generated images respectively.
We tried to use CycleGAN to avoid inconsistencies that occurred while only using a
discriminator. Several experiments have been conducted to utilize cycle consistency loss,
as explained in 2.3.8. However, we are unable to extract helpful information for image
reconstruction. So, we had tweaked the model by feeding Gsyn , synthetic images instead of
their embeddings and the resulting outputs are shown in figure 5.8. Though these are also
not up to the mark, model tries to generate most of the image with road texture. We are
using two generators that force each to generate real latent form synthetic images and vice
versa. This might have led the model to learn real latent, resulting in generating a BMW
symbol at the centre of the image, which is predominant in the CityScapes dataset.
We utilize two pretrained VQ-GAN architectures on real and synthetic images respectively,
as indicated in 5.1. In this experiment, we consider the latent space of the synthetic image
encoder to generate real-like images. The encoder output is given to the frozen code book
and decoder, obtained from training on real scenarios. From the image 5.9, we can observe
the realism of the elements. For example, the texture of the vehicles in the GTAV dataset
is glossy, whereas, in the CityScapes dataset, they are not. In our generated images, we
can observe cars are not as shiny in synthetic data, indicating our codebook had learnt
naturalistic features. The stony structures are not well generated because our real dataset
does not have them. Instead of rocks, the model generates building or wall-like structures.
KID and FID scores of subjectively better performing models has been calculated and
is shown in the table 5.1. In this table cross domain encoder means using an encoder
pretrained on the synthetic data and feeding the latent representations to the code book
and decoder, which are pretrained on real data. The baseline in the table represents FID
and KID similarity score between CityScapes Dataset and Synthetic Dataset(GTAV). The
perceptual scores obtained on generated images are as follows.
5.2. Data Generation 37
Figure 5.9: Image generated by using cross domain encoders. Left image is synthetic and right
is generated image.
Table 5.1: Summary of table with perceptual metrics. Here FID is Frechlet Inception Distance,
KID is Kernel Inception Distance. The lower the value, better the performance of model.
5.2.5 Discussion
Several intriguing events appeared throughout the tests. Discriminator-based methods are
superior to transformer-based methods in our current setting. Additionally, pre-training
the models with synthetic images has also made it helpful for encoders to produce latent
representations appropriate for image reconstruction. Image generation on the model
with cross domain encoders has shown better performance,as indicated by lower FID
scores. Not only quantitatively but also subjectively, the images are better at expressing
real perception. The approach of using a discriminator for differentiating latent spaces
in a similar fashion to how we use it for images did not work as expected. This can
be due to the size of latent spaces. The discriminator further reduces the latent space
size, leaving it with few features to learn. The GANs with Wasserstein loss had showed
significant improvement in the scores and also subjectively. However, this model still has
few limitations at reconstructing with images that are completely unavailable in real data
distribution.
We could not get the expected results from CycleGAN and transformers, though they are
better at vision tasks.
38 5. Experiments and Results
6. Conclusion
This thesis work has been chosen to address the data limitation problem in certain
environmental scenarios. Using image-to-image translation, we would like to generate
the data required for remote robotic applications, as data collection is quite tricky in
these regions.In addition, we can use synthetic data for generalized tasks, however, for
downstream tasks, the performance goes down as CNN work based on textures. By
reducing the gap between domain distributions, we can leverage the synthetic data for
real-time implementations.
Overall, we got better results visually when the discriminator was used to differentiate real
and fake latents. By considering further hyper parameter tuning and by utilizing wasserstein
loss, we achieved significant results. However, even the quantifiable measurements show
the same. In most of the implementations, real texture has been adapted to synthetic
images. Also, it was observed that codebook learning had played a certain role in image
generation. As a result, there is heavy distortion when the model encounters unseen
data,which was not present in real sets.
As further work, we can additionally tune the discriminator for better performance. Also,
investigation on cycleGANs could be continued. Finally, instead of just using transformers
as discriminators, we can employ transformer encoders for this task.
40 6. Conclusion
Bibliography
[Abadi 16] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghe-
mawat, G. Irving, M. Isard et al, “{TensorFlow}: a system for {Large-Scale} machine
learning”, in 12th USENIX symposium on operating systems design and implementation
(OSDI 16). 2016, pp. 265–283.
[Bao 21] H. Bao, L. Dong, F. Wei, “Beit: Bert pre-training of image transformers”, arXiv
preprint arXiv:2106.08254, 2021.
[Chen 22] Y.-J. Chen, S.-I. Cheng, W.-C. Chiu, H.-Y. Tseng, H.-Y. Lee, “Vector Quantized
Image-to-Image Translation”, in European Conference on Computer Vision, Springer.
2022, pp. 440–456.
[Denton 15] E. L. Denton, S. Chintala, R. Fergus et al, “Deep generative image mod-
els using a[U+FFFC] laplacian pyramid of adversarial networks”, Advances in neural
information processing systems, vol. 28, 2015.
42 Bibliography
[Devlin 18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, “Bert: Pre-training of deep
bidirectional transformers for language understanding”, arXiv preprint arXiv:1810.04805,
2018.
[Enis Cetin 06] A. Enis Cetin, O. Nezih Gerek, “Vector Quantization”, Wiley Encyclopedia
of Biomedical Engineering, 2006.
[Goodfellow 16] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT press, 2016.
[Hinton 12] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior,
V. Vanhoucke, P. Nguyen, T. N. Sainath et al, “Deep neural networks for acoustic
modeling in speech recognition: The shared views of four research groups”, IEEE Signal
processing magazine, vol. 29, no. 6, pp. 82–97, 2012.
[Hochreiter 98] S. Hochreiter, “The vanishing gradient problem during learning recurrent
neural nets and problem solutions”, International Journal of Uncertainty, Fuzziness
and Knowledge-Based Systems, vol. 6, no. 02, pp. 107–116, 1998.
[Hoffman 18] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros,
T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation”, in International
conference on machine learning, Pmlr. 2018, pp. 1989–1998.
[Huang 18] S. Huang, C. Lin, S. Chen, Y. Wu, P. Hsu, S. Lai, “Cross Domain Adaptation
with GAN-Based Data Augmentation”, Proceedings of the Lecture Notes in Computer
Science: Computer Vision—ECCV, 2018.
[Ioffe 15] S. Ioffe, C. Szegedy, “Batch normalization: Accelerating deep network training
by reducing internal covariate shift”, in International conference on machine learning,
PMLR. 2015, pp. 448–456.
Bibliography 43
[Isola 17] P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, “Image-to-image translation with
conditional adversarial networks”, in Proceedings of the IEEE conference on computer
vision and pattern recognition. 2017, pp. 1125–1134.
[Kurita 19] T. Kurita, “Principal component analysis (PCA)”, Computer Vision: A Ref-
erence Guide, pp. 1–4, 2019.
[LeCun 15] Y. LeCun, Y. Bengio, G. Hinton, “Deep learning”, nature, vol. 521, no. 7553,
pp. 436–444, 2015.
[Liu 16] M.-Y. Liu, O. Tuzel, “Coupled generative adversarial networks”, Advances in
neural information processing systems, vol. 29, 2016.
[Liu 17] M.-Y. Liu, T. Breuel, J. Kautz, “Unsupervised image-to-image translation net-
works”, Advances in neural information processing systems, vol. 30, 2017.
[Liu 21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, “Swin transformer:
Hierarchical vision transformer using shifted windows”, in Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2021, pp. 10012–10022.
[Long 15] J. Long, E. Shelhamer, T. Darrell, “Fully convolutional networks for semantic
segmentation”, in Proceedings of the IEEE conference on computer vision and pattern
recognition. 2015, pp. 3431–3440.
[Martin 17] A. Martin, C. Soumith, L. Bottou, “Wasserstein GAN, 2017”, arXiv, vol. 1701,
no. 07875, p.ṽ3, 2017.
[Ren 22] J. Ren, Q. Zheng, Y. Zhao, X. Xu, C. Li, “DLFormer: Discrete Latent Transformer
for Video Inpainting”, in Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 2022, pp. 3511–3520.
[Richter 16a] S. R. Richter, V. Vineet, S. Roth, V. Koltun, “Playing for data: Ground
truth from computer games”, in European conference on computer vision, Springer.
2016, pp. 102–118.
[Richter 16b] S. R. Richter, V. Vineet, S. Roth, V. Koltun, “Playing for Data: Ground
Truth from Computer Games”, in European Conference on Computer Vision (ECCV), ser.
LNCS, B. Leibe, J. Matas, N. Sebe, M. Welling, Eds., vol. 9906. Springer International
Publishing, 2016, pp. 102–118.
[Theiss 22] J. Theiss, J. Leverett, D. Kim, A. Prakash, “Unpaired Image Translation via
Vector Symbolic Architectures”, in European Conference on Computer Vision, Springer.
2022, pp. 17–32.
[Van Den Oord 17] A. Van Den Oord, O. Vinyals et al, “Neural discrete representation
learning”, Advances in neural information processing systems, vol. 30, 2017.
[Van Dyck 21] L. E. Van Dyck, R. Kwitt, S. J. Denzler, W. R. Gruber, “Comparing Object
Recognition in Humans and Deep Convolutional Neural Networks—An Eye Tracking
Study”, Frontiers in Neuroscience, vol. 15, p. 750639, 2021.
[Wang 18] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, B. Catanzaro, “High-
resolution image synthesis and semantic manipulation with conditional gans”, in Pro-
ceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp.
8798–8807.
[Wu 19] H. Wu, M. Flierl, “Learning product codebooks using vector-quantized autoen-
coders for image retrieval”, in 2019 IEEE Global Conference on Signal and Information
Processing (GlobalSIP), IEEE. 2019, pp. 1–5.
Bibliography 45
[Xie 20] X. Xie, J. Chen, Y. Li, L. Shen, K. Ma, Y. Zheng, “Self-supervised cyclegan
for object-preserving image-to-image domain adaptation”, in European Conference on
Computer Vision, Springer. 2020, pp. 498–513.
[Yi 17] Z. Yi, H. Zhang, P. Tan, M. Gong, “Dualgan: Unsupervised dual learning for
image-to-image translation”, in Proceedings of the IEEE international conference on
computer vision. 2017, pp. 2849–2857.
[Zhang 18] Z. Zhang, L. Yang, Y. Zheng, “Translating and segmenting multimodal medical
volumes with cycle-and shape-consistency generative adversarial network”, in Proceedings
of the IEEE conference on computer vision and pattern Recognition. 2018, pp. 9242–9251.
[Zhu 17] J.-Y. Zhu, T. Park, P. Isola, A. A. Efros, “Unpaired image-to-image translation
using cycle-consistent adversarial networks”, in Proceedings of the IEEE international
conference on computer vision. 2017, pp. 2223–2232.
46 Bibliography
A. Appendix
Figure A.1: Images generated after training on discriminators. From left to right, first
two images are synthetic, third and fourth are generated images respectively. The right two
reconstructed images show mode collapse.
48 A. Appendix
Figure A.2: Images generated after training using discriminators with wasserstein. Left image
is synthetic and right is generated image.
Figure A.3: Images generated after training using discriminators with wasserstein. Left image
is synthetic and right is generated image.
49
Figure A.4: Images generated after training using discriminators with wasserstein. Left image
is synthetic and right is generated image.