0% found this document useful (0 votes)
64 views

Harsha Thesis

This master's thesis examines using generative models for image-to-image translation between synthetic and real images. The document provides background on deep learning, GANs, VAEs, and other relevant techniques. It then describes implementing various models, including discriminators, CycleGAN, and transformers, to generate real-world images from computer-generated inputs. Experiments are conducted to generate images and evaluate the results using perceptual metrics. The thesis concludes that generative models show promise for domain adaptation but that further research is still needed to improve image quality and realism.

Uploaded by

Anuraag Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Harsha Thesis

This master's thesis examines using generative models for image-to-image translation between synthetic and real images. The document provides background on deep learning, GANs, VAEs, and other relevant techniques. It then describes implementing various models, including discriminators, CycleGAN, and transformers, to generate real-world images from computer-generated inputs. Experiments are conducted to generate images and evaluate the results using perceptual metrics. The thesis concludes that generative models show promise for domain adaptation but that further research is still needed to improve image quality and realism.

Uploaded by

Anuraag Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Robotics Research Lab

Department of Computer Science


Technische Universität Kaiserslautern

Master Thesis

Image-to-Image translation using


Quantized Representations

Harshavardhan Reddy Moravapalli

January 18, 2023


Master Thesis

Image-to-Image translation using


Quantized Representations

Robotics Research Lab


Department of Computer Science
Technische Universität Kaiserslautern

Harshavardhan Reddy Moravapalli

Day of issue : 18.07.2022


Day of release : 18.01.2023

First Reviewer : Prof. Dr. Karsten Berns


Supervisor : M.Sc.Jakub Pawlak
Hereby I declare that I have self-dependently composed the Master Thesis at hand. The
sources and additives used have been marked in the text and are exhaustively given in the
bibliography.

January 18, 2023 – Kaiserslautern

(Harshavardhan Reddy Moravapalli)


Abstract

Deep learning has excelled in numerous computer vision applications, but they still have a
limited capacity to combine data from different domains. Because of this, performance
might be negatively impacted by even little changes in the training data. Domain shift,
the difference between observed and expected data, is crucial to studying robotics as a
slight change in algorithms can put a robot at risk while operating it physically. One
potential strategy for modifying current data to match better what an algorithm might
anticipate during operation is to use generative methods, which try to produce data
depending on the input. In our study, we are researching various Generative Adversarial
Networks and Vector Quantization methods capable of generating real-world scenarios
from computer-generated images.
Contents

1 Introduction 1
1.1 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3
2.1 Synthetic Vs Real Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Supervised and Unsupervised learning . . . . . . . . . . . . . . . . 4
2.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.1 Artificial Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Multilayer perception . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.3 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . 6
2.3.4 Regularization techniques for CNNs . . . . . . . . . . . . . . . . . . 7
2.3.5 Optimization of Artificial Neural Networks . . . . . . . . . . . . . . 8
2.3.6 Latent Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.7 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.8 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Literature Review 17

4 Concept and Implementations 21


4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Pre-processing step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.1 Pre-training of model . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.2 Generation of images using discriminators . . . . . . . . . . . . . . 24
4.3.3 Generation of images using CycleGAN . . . . . . . . . . . . . . . . 26
4.3.4 Generation of images using Transformers . . . . . . . . . . . . . . . 28
4.4 Perceptual metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Experiments and Results 31


5.1 Pre-training of model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.1 Data Generation using Discriminators . . . . . . . . . . . . . . . . 33
5.2.2 Data Generation using CycleGAN and Transformers . . . . . . . . 36
5.2.3 Data Generation using cross domain encoder . . . . . . . . . . . . . 36
5.2.4 Perceptual Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
viii Contents

6 Conclusion 39

Bibliography 41

A Appendix 47
List of Figures

2.1 Classification based on shape and texture . . . . . . . . . . . . . . . . . . . 4


2.2 Schematic diagram of Single Neuron . . . . . . . . . . . . . . . . . . . . . 5
2.3 Schematic diagram of Multi-layer Perceptron . . . . . . . . . . . . . . . . . 6
2.4 Representation of parameters in Convolutional Layers . . . . . . . . . . . . 7
2.5 Representation of ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Schematic diagram of VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Schematic diagram of VQ-VAE . . . . . . . . . . . . . . . . . . . . . . . . 11
2.8 Schematic diagram of GAN . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.9 Schematic diagram of CycleGAN . . . . . . . . . . . . . . . . . . . . . . . 14
2.10 Schematic diagram of VQ-GAN . . . . . . . . . . . . . . . . . . . . . . . . 14
2.11 Schematic diagram of Transformer . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 Exemplary images from dataset . . . . . . . . . . . . . . . . . . . . . . . . 22


4.2 Schematic diagram of implementation of real image reconstruction . . . . . 24
4.3 Architecture of VQ-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Schematic diagram of implementation using discriminators . . . . . . . . . 26
4.5 Schematic diagram of implementation using CycleGAN . . . . . . . . . . . 27
4.6 Architecture of generator in CycleGAN . . . . . . . . . . . . . . . . . . . . 28
4.7 Schematic diagram of implementation using transformers . . . . . . . . . . 29

5.1 Original and reconstructed images . . . . . . . . . . . . . . . . . . . . . . . 31


5.2 Original and reconstructed image patches . . . . . . . . . . . . . . . . . . . 32
5.3 Original and reconstructed images after patch-wise training . . . . . . . . . 32
5.4 Images generated after training using discriminators . . . . . . . . . . . . . 34
5.5 Images generated using VQ-GAN mode . . . . . . . . . . . . . . . . . . . . 34
5.6 Images generated after training using discriminators . . . . . . . . . . . . . 35
5.7 Images generated after training using discriminators with wassertein . . . . 35
5.8 Images generated after training using cycleGAN . . . . . . . . . . . . . . . 36
x List of Figures

5.9 Image generated by using cross domain encoders . . . . . . . . . . . . . . . 37

A.1 Images generated after training on discriminators . . . . . . . . . . . . . . 47


A.2 Images generated after training using discriminators with wassertein . . . . 48
A.3 Images generated after training using discriminators with wassertein . . . . 48
A.4 Images generated after training using discriminators with wassertein . . . . 49
List of Tables

4.1 Discriminator layers for VQ-GAN . . . . . . . . . . . . . . . . . . . . . . . 24


4.2 Discriminator layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Summary of table with perceptual metrics . . . . . . . . . . . . . . . . . . 37


1. Introduction

The human brain has the capacity to see objects in three dimensions which possesses the
tremendous potential for broad environmental understanding. In addition to recognizing
an item’s shape, form, density, and weight, the brain also understands how to interact with
it and manipulate it so that it may be moved or carry out tasks. For humans, it is easy
to perceive the changes in the environment, or even if we view something painted by an
artist, the human brain may change how we see an item. Our capacity to recognize objects
visually is relatively simple [Rensink 00]. Despite the enormous difference in appearance
that each thing generates in our eyes, humans are able to identify and categorize objects
in a fraction of a second from among tens of thousands of possibilities. Perceptional
cognitive functions of the brain make us learn and recognize the object’s nature even if it
is seen by us once and utilize the memory to identify it in a real and unique environment
[Van Dyck 21]. Replicating these human capabilities is quite challenging as the perceived
objects can be hidden or blurred and differ from different viewpoints, making the machines
have a terrible sense of their environment in general.

Though Deep Learning has outstanding performance, they are highly data specific and
cannot perform well even with subtle deviations from training. Furthermore, real-time
situations with variations in light illuminations and the environment in which the model is
to be employed may differ from training and testing data, adding complexity and obtaining
low performance. Apart from the above issue, the type of processed data we use to train
models also adds to existing problems. For instance, models trained on the ImageNet
data, where images are preprocessed not to have noise and blur, and aligned to the centre,
sometimes lead to better performances, unlike in real life. In addition, data scarcity with
unseen conditions like forest areas and off-roads must also be considered. These challenges
have been treated with immensely generated simulation data. Synthetic data is a crucial
strategy for addressing data problems, either by employing cutting-edge data manipulation
techniques to provide fresh and varied training examples or by creating desirable simulated
data from scratch. Even if the simulated domain provides an endless supply of data, the
issue of domain shift between it and the real world still exists. The complexity of reality
cannot be fully apprehended in simulation, hence an algorithm developed on simulated
data may not perform well when applied to real situations [Shrivastava 17, Theiss 22]. To
2 1. Introduction

assist models’ transition from synthetic training sets to real test sets, one must make them
as realistic as feasible and develop ways for doing so. Consequently, domain adaptation
emerges as a central theme in synthetic data research.

There have been several techniques that are designed to decrease the domain gap be-
tween the data distributions. One of the widely studied techniques for image-to-image
translation is Generative adversarial networks (GAN) [Denton 15, Salimans 16]. In ad-
dition, various style transfer methods have also been emerged to address this issue.The
conditional contexts in the GANs and style transfer make them frequently translate to
learn straightforward recolouring or regional transformations without comprehending the
actual target distribution [Wang 18]. Recent studies [Esser 21] have demonstrated the
usefulness of the vector quantization (VQ) approach as a latent representation of generative
models [Enis Cetin 06, Wu 19]. This work utilizes the VQ-VAE [Van Den Oord 17], GAN
[Goodfellow 20] and transformer [Dosovitskiy 20] architectures to achieve latent represen-
tation from synthetic images more closer to real latent representation. Further, we obtain
a synthetic image to real image translation using the transformed embedding space of
synthetic images and the pretrained codebook and decoder.

1.1 Outline of thesis


The structure of the thesis is as follows. Chapter 2 starts with the background of synthetic
and real images and their limitations. Later, we explain the theoretical concepts of machine
learning and deep learning. Finally, we shed light on some significant state-of-the-art
models. Chapter 3 provides literature research and an overview of recent work. In chapter
4, we elucidate the approaches and the datasets used in the current work. We present the
results and discussion in chapter 5. We end the thesis by coming to a conclusion.
2. Background

We elucidate the foundational concepts that are employed in this current work, starting
with a brief description of various types of images. Later, explaining basic machine
learning concepts, artificial neural networks, and convolutional neural networks. Further,
we shed light on state-of-the-art models ranging from Generative Adversarial Networks to
Transformers, which are prominent for image generation.

2.1 Synthetic Vs Real Images

Many Machine Learning models require large, diverse, annotated datasets to perform
error-free tasks. But gathering this kind of data is expensive and takes time to process.
One alternative solution is Synthetic data. In contrast to real data collected from actual
natural events, synthetic data is generated to mimic the properties and structure of critical
real-world data using algorithmic methods by a computer. Synthetic data points resemble
real-world events but do not accurately reflect them. It provides unlimited, affordable and
high-quality data for the training of machine learning models. However, things are a little
more complicated in reality. Synthetic data fails to generalize well on real scenarios and
Inconsistencies when replicating complexities within the original datasets [Alkhalifah 21].
As CNNs are more biased towards the texture than the object’s shape [Geirhos 18], training
a model with synthetic data may reflect the biases as the texture of the contents in the
image differs from the actual instance. Unlike humans, CNNs are more focused on texture
cues detect elephant-textured cat as an elephant, as shown in figure 2.1. To effectively
use synthetic data as a source for machine learning models, it must have a statistical
distribution similar to the distribution of a real dataset. The synthetic data points should
be indistinguishable from those that are real, and also, there should be a considerable
difference between each synthetic data point. As it is difficult to obtain every possible
real-time situation like Forest environments, Defence etc., we rely on synthetic data, which
can be extracted from computer games[Richter 16a].
4 2. Background

Figure 2.1: Classification based on shape and texture by ResNet-50 adopted from [Geirhos 18].
(a) shows texture cues from elephant, (b) has cat as content and (c) is combination of cat shape
with elephant texture

2.2 Machine Learning


Machine learning is a branch of artificial intelligence that regularly applies algorithms to
identify the underlying relationships between data and information. Machine learning
algorithm performs the specific task by learning from the data without giving detailed
instructions. [Mitchell 97] nicely expresses the concept of learning about this situation:
"A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E." We briefly discuss machine learning’s foundations in this part,
covering supervised and unsupervised learning, before describing multilayer perceptrons
and convolutional neural networks (CNN) concepts from the literature [Goodfellow 16].

2.2.1 Supervised and Unsupervised learning


Supervised learning techniques draw connections between data and a particular label.
These algorithms utilize a training dataset associated with corresponding labels to develop
a machine-learning model for predicting labels of new datasets. For supervised learning
in the context of image-to-image translation, paired samples (i.e. synthetic and real
images with the same context or domain difference) are required. These paired images
should correspond to each other. In contrast to supervised learning, unsupervised or
self-supervised learning images do not need to be annotated for training. This method
depends entirely on the features of each sample in the dataset and thus is advantageous
for the domain transfer task, removing the tedious pair image collection. Unsupervised
learning tries to obtain meaningful features without the usage of labels.

2.3 Deep Learning


Machine learning is a subfield of artificial intelligence, and deep learning is one of its
subfields. This section discusses starting with primary Artificial Neuron components and
progressing to a sophisticated CNN.
2.3. Deep Learning 5

2.3.1 Artificial Neuron


In a neural network (NN), an artificial neuron is the main computational component. The
biological neuron, which fires when information is received and sends the information to
the connected neurons, is similar to this neuron. Although most of the human brain has
not yet been studied, AN can be analogous to human neurons [Demuth 14].

a = f (n) = f (w · p + b) (2.1)

Figure 2.2: Schematic diagram of Single Neuron with activation function adapted from
[Demuth 14]

As shown in the figure2.2, each neuron gets m inputs p ∈ Rm from external sources
or neighbouring nodes, each of which is assigned a weight w ∈ Rm . A further bias
component, b, is introduced and turns into the number n when added to the weighted
total of these inputs. Applying an activation function to n produces the output in equation
2.1, which is written in vector form. Wide range of linear and non-linear mathematical
functions, for example, ReLU, LeakyReLU, and Swish, are available for the activation
function f . Rectified Linear Unit (ReLU) activation, a piece-wise linear function function
f (n) = max(0, n) is often used in current networks.

2.3.2 Multilayer perception


A multilayer perceptron (MLP) combines several layers of different combinations of
neurons. The first layer, which takes input, is called the input layer, while the last
layer, which produces output, is called the output layer. Hidden layers are the layers
that lie in the middle. Every layer contains its weight matrix W, net inputs, bias, and
outputs, as indicated in figure 2.3, and weights determined using equation 2.2. The
core of deep learning is made up of MLP, and artificial neurons are considered Universal
approximators[Hornik 89].

y = f 2 (W 2 f 1 (W 1 · p + b1 ) + b2 ) (2.2)
MLP may approximate any real, continuous function if it has at least one hidden layer.
However, no universal rule specifies the necessary number of neurons and layers or the
optimisation method [Hornik 89]. Nowadays, practically every published design has many
layers of neurons due to advancements in computing power. Since Deep Learning is a
very active study area, several new designs are suggested daily. CNNs are cutting-edge
techniques for problems like speech-recognition ,image identification or computer vision in
general[Krizhevsky 17, Hinton 12]. We continue our discussion of CNN, another crucial
deep-learning topic, in the following section.
6 2. Background

Figure 2.3: Schematic diagram of Multi-layer Perceptron adapted from [Demuth 14]

2.3.3 Convolutional Neural Network


Convolutional neural networks are the foundation for many computer vision techniques,
including picture recognition, detection, etc. According to Ian Goodfellow’s definition in
his book [Goodfellow 16], CNNs are a particular kind of neural network that processes
input using a grid-like topology. Convolution is a mathematical procedure used by neural
networks. It is challenging to suggest the ideal architecture because of the advanced
technology and significant deep learning research since one or the other is published
regularly. However, CNN’s designs are based on the below-described structural elements.
Depending on the architectures, different kernel sizes that are useful in obtaining features
can be used. We currently use 1×1, 3×3, and 5×5 kernel sizes in our present work.
i + 2p − k
o= +1 (2.3)
s

Sparse interactions eliminate the requirement for fully connected Neural Networks, where
each neuron is coupled to every neuron in the previous layer. Instead, a kernel K smaller
than the input connects only a tiny subset of the input in the vicinity of the kernel centre
to the convolution’s output. The size of the kernel area is referred to as a Kernel size. In
addition to the kernel size, stride and the total number of feature mappings are additional
crucial hyper parameters. When moving the kernel across the input, the stride indicates
the step size to be utilized [LeCun 15]. Padding can be used to preserve a specific output
size. The number of feature maps indicates how frequently convolution is applied to
the input. One feature map is produced as an output for each convolution of the input.
The feature maps or output layers correlate to the input image’s edges and textures. As
depth of convolution layers increase, the extraction of features also tend to increase from
generic to semantically rich features. Sharing parameters makes convolution processes
storage-efficient and enables the reuse of weights for various input areas.

In contrast to conventional fully connected NNs, CNNs do not assign a separate weight to
2.3. Deep Learning 7

Figure 2.4: Representation of parameters in Convolutional Layers

each neuron connection. Instead, within each convolution process, the kernel’s weights
remain constant. As a result, every region of the input image receives the identical kernel
weight matrix. Equivariance to translation refers to the fact that when the input of a
convolution operation is translated, so is its output. More specifically, f (x) and g are
equivalent if f (g(x)) = g(f (x)). In the case of 2D images, this indicates that if there is
a shift in the input image, there will also be a shift in the output image. For a given
input value, the convolution layer output is determined by formula 2.3, where o represents
output, i represent input picture dimensions, k represents a kernel, s represents stride,
and p represents padding. he pictorial representation is given 2.4 in Padding helps obtain
the desired result, enabling the kernel to consider edges. In some network topologies,
like autoencoder, it is required to obtain images from the feature map by implementing
upsampling or transpose convolution techniques. These are covered in further detail in
the following chapters.

2.3.4 Regularization techniques for CNNs


The regularization techniques that benefit deep NNs and CNNs are introduced in the
following section. Dropout is a powerful and flexible regularization technique that can
be applied to many different models [Srivastava 14]. Here, a network’s neurons are
stochastically lost at a specific rate during training, known as the drop rate. Batch
normalization is one of the regularization method used in modern CNN designs. It
resolves issues with the internal covariate shift, a general problem with deep networks
with several hidden layers [Ioffe 15]. The gradients in networks with several hidden layers
show each parameter’s update after fixing all other parameters. But in reality, all the
parameters will be updated simultaneously [Goodfellow 16]. As a result, each layer’s input
distribution changes throughout training, which is known as the internal covariate shift.
Batch normalization can be used to mitigate this phenomenon. With batch normalization,
each layer’s inputs are normalized with respect to a mini-batch. As a result, the nonlinear
8 2. Background

activation functions won’t get saturated, which offers faster and more reliable convergence
[Ioffe 15]. Unlike BN, Layer normalization and instance normalization are almost similar.
However, instance normalization normalizes across each channel in each training example
rather than across input characteristics in a training sample.

2.3.5 Optimization of Artificial Neural Networks


The main objective of any NN model is to approximate a function f ∗ that describes the
proper input-to-output mapping. Thus, a mapping function y = f (x; θ) is defined. The
weights and biases of the NN serve as the trainable parameters for this mapping function.
An optimizing algorithm modifies these parameters θ during training to approximate f ∗ by
minimizing the cost function by adjusting the parameters. Gradient-based techniques are
frequently employed in current NNs to improve the network parameters. For example, the
MSE is the commonly used cost function J(θ) for regression tasks. For J(θ), a gradient
is calculated with respect to changes in each parameter of θ when using gradient-based
optimization approaches.
1
J(θ) = ||y − f (x, θ)||2 (2.4)
2
Later the computed gradients for the parameters θ are updated as follows.

θt+1 = θt − αlr · ∇J(θ)|θ(t) (2.5)

2.3.6 Latent Space


Most of the higher dimensional data can often be represented as lower dimensions, which are
sufficient for the generation of images. Latent space is a term used to describe these hidden
representations of the distribution of the raw input data. Continuous representations are
employed by most architects, while recent research has concentrated on discrete hidden
spaces.

Continuous Space
Consider the data x ∈ Rn , which is generated from the lower dimension z ∈ Rm (m < n),
by the formula
x=A·z+v (2.6)
From the equation 2.6, v is an independent, n-dimensional Gaussian distribution. The
representation in a lower dimension is not available to us; we can only access the raw
data. Principal Component Analysis, a traditional unsupervised approach, tries to find
this hidden representation. Our latent spaces are intended to identify non-linear represen-
tations, whereas PCA [Kurita 19] focuses more on linear representations. These hidden
representations can be considered as compressed data from a large data set.

Discrete Spaces
So far, in this section, we have dealt with latent representations as continuous distributions.
However, the several types of real data such as language representation as detailed
in [Van Den Oord 17], is in discrete representations, which motivates the transition of
hidden representations from continuous to discrete space. This concept also leverages by
2.3. Deep Learning 9

Transformers, a recent breakthrough in computer vision and natural language processing


[Ren 22]. [Van Den Oord 17] has implemented this idea by replacing continuous sample
space with discrete codebooks, compressing the information bottleneck, and enforcing
regularization effects. With this compressed data, sample-specific information may be
successfully removed while more semantic information is retained. In the following section,
we further define the use of the codebook that holds the discrete latent representation.

2.3.7 ResNet
According to researchers, Convolutional neural networks benefit from depth, which makes
sense because a larger parameter space gives models more freedom to adapt to any
environment. However, it has been shown that the performance declines after some deep
layers, leading to a vanishing gradient problem [Hochreiter 98]. This is due to the fact
that when the network is too deep, the gradients used to compute the loss function drop
to zero after applying several numbers of the chain rule. As discussed in section 2.3.5, the
Batch Normalization (BN) helps to reduce the probability of this gradient occurring but
does not prevent it from vanishing [Basodi 20].

H(x) = f (x) + x (2.7)

The problem of training deep networks has been solved by the resnet models, which
comprise Residual blocks. The core concept of Residual blocks is skip connections, which
skips a few layers and have a direct connection, as shown in the figure 2.5. The accuracy
of the model is significantly increased by employing these residual connections because
they allow the gradient to flow more easily into the lower layers. The skip connection
explains in the equation 2.7 for a previous layer.

Figure 2.5: Representation of ResNET adapted from [Targ 16]

2.3.8 State of the Art


In this section, we dive into several state-of-the-art architectures that are relevant to our
study.
10 2. Background

Auto Encoders
An autoencoder is the most basic type of neural network used in unsupervised learning as
it trains without explicit labelling. However, since they create their labels from the training
data, they may be considered self-supervised. Moreover, it identifies non-linear hidden
representations for input data [Bank 20]. Autoencoder consists of an encoder network
z = fθ (x) with θ parameters and a decoder network x̂ = gϕ (z) with ϕ parameters, where x
is the input, z represents as a latent vector and x̂ is the reconstruction from compressed
representation. The mean square error (MSE) from the input x and the reconstructions x̂
are used to calculate the reconstruction loss L.

arg min LRec (x, x̂) = arg min LRec (x, gϕ (fθ (x))) (2.8)
θ,ϕ θ,ϕ

The equation 2.8 represents the entire autoencoder network. Mathematically the error
can be represented as || x − x̂ || 22 or log(p(x|z)). The hidden part of the NN can also
be termed as the bottleneck, as it condenses all the data into lower dimensions. z must
always be smaller than x to force the encoder to identify the key characteristics.

VAE
Variational autoencoders introduced by Kingma et al. in [Kingma 19] are one of the
standard approaches in the field of unsupervised learning. The structure of the bottleneck
representation is where autoencoders and VAE differ significantly. In our latent space,
data points with similar semantic properties should cluster together, while those with
dissimilar semantic properties should be dispersed. Furthermore, the majority of the
distribution of the data should not reach infinity and, instead, take up minimal space in
hidden space. The primary problem with standard autoencoders is that the latent space

Figure 2.6: Schematic diagram of VAE.

can extend up to infinity, making the model memorise individual data points. Variational
autoencoders solve this issue by imposing a probabilistic prior on the latent space. VAE
has both neural network and probabilistic models perspective. We will dive into each
viewpoint briefly. VAE is almost entirely similar to the auto encoders as shown in 2.6,
with an encoder qθ (z|x) that outputs Gaussian probability distribution, decoder pϕ (x|z)
that takes sampled z which followed reparameterization trick with the formula 2.9, and
2.3. Deep Learning 11

latent spaces z. Reparameterization trick, is a method to calculate gradient of function


that has random variable, which helps in backpropagation from decoder to encoder.

z = µ + ϵσ (2.9)

In the equation 2.9, µ is mean, σ is standard deviation and ϵ is a random value drawn
from prior distribution, N ormal(0, 1). We calculate reconstruction loss with log likelihood
log pϕ (x|z). We can split loss function li of every data point xi as 2.10.

li = −Ez∼qθ (z|xi ) [log pϕ (xi |z)] + KL[qθ (z|xi )||p(z)] (2.10)

LV AE = LRec + LKL (2.11)


The first term in equations 2.10, 2.11, LRec is reconstruction loss, which helps the network
to learn for better reconstructions. The higher the reconstruction error, the poorer the
reconstructions. Hence, the model learns to reduce the reconstruction error. The second
term is Kullback-Leibler divergence between distribution of encoder, qθ (z|x) and prior
distribution g(z) is N (0, 1), with mean zero and standard deviation one. If the encoder
produces the z, that are not consistent with standard Gaussian distribution, the loss
function acquires a penalty in the form of KLD. This LKL term helps to ensure that the
target latent distribution is also gaussian.

VQ-VAE
With the addition of the discrete codebook component, VQ-VAE [Van Den Oord 17]
expands on the conventional variational autoencoders. The major difference between
VQ-VAE and standard VAE is that continuous space is replaced with discrete and prior
value is learned instead of constant [Van Den Oord 17].

Figure 2.7: Schematic diagram of VQ-VAE adapted from [Van Den Oord 17].

As shown in the figure 2.7 , the encoder outputs a latent vector from an input image
and compares it to each embedding vector in the codebook to determine which vector is
closest, based on the Euclidean distance, zq = argmin||ze (x) − ei ||2 , where ze (x) is the
12 2. Background

encoder’s output vector, and ei is the ith embedding vector in the codebook. Then, the
appropriate quantized codebook vector zq is sent to the decoder for image reconstruction.
As the argmin operation is not differentiable, we copy the decoder gradient ▽z L directly to
the encoder for training purposes, which is set to 1 concerning the encoder and resulting
codebook vector and zero regarding other vectors.

LV Q = LRec + Lalignment + Lcommitment (2.12)

log(p(x|q(x))) + ||sg[ze (x)] − e||22 + β||ze (x) − sg[e]||22 (2.13)

The codebook also learns using gradient descent, like the encoder and decoder. Learning
codebook vectors that align to the encoder output is bidirectional. Reconstruction loss,
codebook alignment loss, and codebook commitment loss are the components that make
up the loss function, as shown in 2.12, 2.13. The mean square error between the original
and reconstructed image is used to calculate reconstruction loss. Codebook alignment
loss helps bring the selected embedding vector close to the output of the encoder, setting
a stop gradient on the encoder output. Conversely, codebook commitment loss places a
stop gradient on the codebook vector to get the encoder output to commit to the nearest
codebook vector. The significance of commitment loss is scaled by a hyperparameter β.
In our current work, we employed VQ-VAE with different hyperparameters, embedding
vectors, and embedding dimensions.

Generative Adversarial Networks


One of the current innovations in the field of AI is Generative adversarial networks. GANs
are deep-learning based generative models. These models are capable of creating new
instances that mimic our training data. The GANs potential is vast as it resembles any
data distribution, which can be any domain, for example, images, music, speech etc. In a
sense, GANs are robot artists, and their outcome is impressive. Goodfellow first introduces
the GAN architecture in the paper [Goodfellow 20]. To train a generative model with
GANs, we conveniently simplify the job by posing it as a supervised learning problem with
two submodels: we generate the new examples by training the generator model and the
discriminator model, which tries to classify the samples as either real or fraudulent, which
makes both the generator and discriminator opponents and play a strictly competitive
game against each other.

As shown in figure 2.8, a generative model G reconstructs image x = G(x) by feeding the
samples from a Gaussian distribution as a random input noise. The generated and original
images are then provided to the discriminator network D, distinguishing between the fake
and real images.

LGAN = min max V (D, G) = Ex∼pdata(x) [log D(x)] + Ez∼pz (z) [log (1 − D(G(z)))] (2.14)
G D

D(x) represents the discriminator estimate of the probability that the real sample is
real. D(G(Z)) represents the probability that the fake sample is real. The objective
function in equation 2.14 is described as a minimax function. The first term interprets the
discriminator’s predictions on real data, and the second is the discriminative prediction
2.3. Deep Learning 13

Figure 2.8: Schematic diagram of GAN.

represented on the fake data. First, the discriminator wants D(x) to be a large value because
that represents high confidence that a real sample is real. And later, the discriminator
wants D(G(z)) term to be small as possible or less confident. But since the generator
intends to fool the discriminator, it wants this value to be as large as possible. This
results in an adversarial game. Expectation E is the average of the predictions when we
feed original or noisy data as input to the discriminator. GANs showed promising results
for image generation, although equation optimization is complex and computationally
expensive.

CycleGAN
Cycle Generative Adversarial Network (CycleGAN) is one of the methods to train a deep
convolutional neural network for an image-to-image translation application [Zhu 17]. Using
an unpaired dataset, the network learns the mapping between input and output images.
These models are trained unsupervised, utilizing a set of images from the source ’X’ and
target domains Y that are not required to correspond to each other. CycleGAN has been
used in various applications, including the translation of seasons, object transformation,
style fusion, and the creation of pictures from paintings. The authors offer a technique
that can learn to recognize one image collection’s unique characteristics and determine how
these qualities might be transferred to the second image collection, all without any paired
examples. The CycleGAN is a GAN architectural extension that includes simultaneously
training two generator models and two discriminator models. As shown in figure 2.9 the
first generator(G) takes input samples from the first domain(X) and generates output
samples of the target domain(Y). In contrast, the second generator(F) uses input images
from the target domain(Y) and produces images for the first(X). For each generator, a
discriminator is included that makes an effort to distinguish the synthetic and real samples.
Later these generator models are updated based on how credible the generated images are.
This extension using adversarial loss by itself could be adequate to produce reasonable
visuals in each domain but not translations of the input images. Cycle consistency and
identity loss are further advancements of the architecture that CycleGAN takes advantage
of.
14 2. Background

It is proposed that an image produced by the first generator may be utilized as the input

Figure 2.9: Schematic diagram of CycleGAN. G, F are generators and DX , DY are discrimina-
tors.

for the second generator, whose output must resemble the original image. However, the
opposite is also true: an output from the second generator may be provided as input to
the first generator, and the output should equal to the second generator’s input. The
absolute pixel level difference between the original and generated images is used to impose
this constraint

VQ-GAN
Vector Quantized Generative Adversarial Network (VQ-GAN) [Esser 21] extends VQ-VAE
by adding a discriminator network, which tries to identify real and reconstructed images.
We use VQ-GAN with and without the transformer model from the paper [Esser 21] in
our work.

Figure 2.10: Schematic diagram of VQ-GAN.

Similar to [Van Den Oord 17], VQ-GAN training uses the decoder’s reconstructed images,
2.3. Deep Learning 15

which are then supplied to the discriminator along with normal real input. For improved
visual quality, the adversarial loss of the discriminator component is added to VQ-GAN,
and the Lrec of the 2.13 is replaced with the perceptual loss. Loss function LV QGAN is
described as follows2.15,

LV QGAN = arg min max Ex∼p(x) [LV Q (E, G, Z) + λLGAN ({E, G, Z}, D)] (2.15)
E,G,Z D

where E, Z, G, D are encoder, codebook, decoder and discriminator respectively. The λ,


that scales the discriminator loss term is estimated by the equation 2.16 using perceptual
reconstruction loss and gradient with respect to the final layer.
▽GL [LRec ]
λ= (2.16)
▽GL [LGAN ] + δ

Transformers
Recent developments of Transformers, The Vision Transformers developed as a competitive
substitute to Convolutional neural networks (CNNs), as currently the state-of-the-art
in computer vision, are being employed extensively in a variety of image identification
applications. Transformers show great promise for a general learning approach that may
be used for a variety of data modalities, including the most recent advances in computer
vision that achieve state-of-the-art standard accuracy with improved parameter efficiency.
The incredible performance of Transformer models in Natural Language Processing (NLP)
[Devlin 18]. These amazing models are pushing the boundaries of NLP and breaking
several records. which are utilized in a variety of NLP tasks like including machine
translation, conversational chat bots, and even more effective search engines. This inspires
the researchers to apply the same ideas to computer vision tasks. Some of notable
examples are BeiT [Bao 21], SWIN-Transformers [Liu 21] etc. Similar to the series of
word embeddings used when employing transformers to text related tasks, the ViT model
represents an input image as a set of image patches and by modeling feature maps as
vectors of tokens, with each token representing an embedding of a particular image patch.
The vision transformers make use of encoder only unlike [Vaswani 17].

The figure illustrates the architecture of a vision transformer. The Parallelization power
of Transformers makes them fed with lots of data with a broader context window. Before
feeding to the encoder, the first step is to split the images into patches as illustrated
in figure 2.11 and are provided as a sequence of linear embeddings as an input to a
Transformer.. These image patches are same as a tokens (i.e words) in an NLP application.
We could also just feed in the image pixel values if we choose not to divide the image into
patches. However, the attention mechanism suffers. As it is stands, every input must be
compared to every other input. We must do 2562 comparisons if we apply it to a 256 × 256
pixel image.This is only for single attention layer likewise transformers contains several of
them, which it would be a computational nightmare. In order to embed created patches
as a patch embeddings we feed those in to the layer called Linear Projection Layer to
produce vectors. These are learnable parameters during training the model updates them
with the values that better help with the assigned task. Unlike LSTM’s which takes the
embeddings sequentially in a designated order they know which word came first and so on,
the transformers on the other hand take up all the embeddings at once eventhough this
16 2. Background

Figure 2.11: Schematic diagram of Transformer adopted from [Dosovitskiy 20]

is a high plus which makes the transformers much faster but the downside is that they
lose the embedding ordering information and also do not have the notion of equivariant
and invariant, unlike CNN’s. To resolve this issue the position embedding [Vaswani 17]
came up with the clever idea of using wave frequencies to capture position information.
For classification task, similar to [Bao 21] learnable class token embedding appended to
the series of position embeddings. The final layer feature vector corresponding to this
class embedding is used by the classification head(i.e MLP) for classification tasks. The
transformer encoder maps the input sequence of embedded patch representation to a
sequence of continuous representation.

The transformer encoder maps the input sequence of embedded patch representation
to a sequence of continuous representations that holds the learned information for that
entire sequence. The transformer encoder is a stack of Multi-Head self-attention layer
(MHP), Multi-layer Perceptron(MLP) Layer, Layer Norms, and with or without residual
connections. The encoder uses a particular attentional mechanism known as the self-
attention mechanism. Self-attention enables the models to relate each input token to every
other token in the sequence and computes the attention score for each pair. The weighted
mean of all the input elements is set to the current token, where attention scores determine
weights. A process like this modifies each token in the input sequence by scanning the
whole input sequence, determining which tokens are most crucial, then changing each
token’s representation by the most crucial tokens. The type of attention employed in the
majority of transformers differs slightly from the explanation given above. In particular,
transformers frequently employ a "multi-headed" style of self-attention. Multi-headed
self-attention employs several separate self-attention modules simultaneously, and their
outputs are concatenated.
3. Literature Review

This section discusses the related work on the image-to-image translation of different
domain data using VQ-VAE, GANs and transformers. [Goodfellow 16] believes that the
observed data may be represented as independently distributed random variables with the
identical distribution. But in actual practice, this assumption must be seriously questioned
because data are still scarce in many fields. We still want to employ training datasets
for broad use cases, even if they may have been gathered on a different continent or with
different camera configurations, even though the models might anticipate distributional
shifts while performing in their target context. In light of the fact that data annotation
remains costly, it would be ideal to forgo the pricey human element entirely and switch
to the creation of automated, synthetic data. Even the most remarkable photorealistic
rendering techniques now on the market are unlikely to reflect the original distribution
ultimately. The primary concept of robotics is to acclimate to new and unknown situations,
and data collection of different environments is challenging and impossible, thus motivating
us to use synthetic data.

Isola et al. proposed Pix2Pix [Isola 17] as a conditional model that can translate an image
from one domain to another domain through paired training. later Pix2PixHD[Wang 18]
is emerged based on[Isola 17] which had high resolution translational images. Although
these approaches performed well when they represented considerable advancements over
the state-of-the-art, one key issue with these paired image-to-image translation methods is
the paired dataset, a dataset in which one image already has its translated counterpart.
Due to this parity restriction between the input and output, these paired datasets are
difficult, costly, or even impossible to collect.

Domain adaptation helps us to overcome the bias between training and real-time data distri-
bution.The authors of CycleGAN[Zhu 17] extended the idea of vanilla GAN[Goodfellow 20]
by adding two generators and two discriminators, as described in the section2.3.8. Fur-
thermore, the authors make use of two more loss functions, forward and backward cycle
consistency loss, in addition to the normal adversarial loss to regularize further the mapping
between the distribution of the input images (X->can be computer-generated images) and
desired output distribution(Y-> Cityscape or Kitti image). The CycleGAN architecture
18 3. Literature Review

performs well in colour or texture transformations. However, it is not particularly effective


when used for geometrical transformations. A common problem for the GAN-based
methods is the semantic alignment of the source and target domain. [Hoffman 18] is an
adversarial unpaired adaptation algorithm that extends the cycle gan method to align the
domains while preserving semantics at both the feature and pixel levels with the help of
cycle and semantic consistency.

In [Isola 17], authors Philips et al. have studied conditional adversarial networks on
a variety of datasets for a broader range of applications starting from image-to-image
translation to segmentation of images. Their proposed network learns a loss function
to train to map instead of just learning the mapping between the original image and
translated image, which eases the need to tweak parameters for different adaptations. The
generator and discriminator architectures have Convolution followed by BatchNorm and
ReLU adopted from [Ioffe 15]. They also make use of skip connections from ’U-Net’, to
communicate low-level information between the layers. PatchGAN as a discriminator
architecture is used, which has been considered similar to texture loss. Several datasets
like Cityscapes [Cordts 16],Google maps etc., have been used for testing proposed methods
on semantic labelling and converting maps to aerial photos, respectively. Methods like
running generated and original images on Amazon Mechanical Trunk (AMT) and metrics
similar to inception score [Salimans 16], and FCN-score [Long 15] are used for evaluation.
The paper further proposed, PixelGANs are efficient at colourization where as PatchGAN
are good with sharpness in the images.

[Liu 17] proposed a Coupled GAN-based technique [Liu 16] for unsupervised image-to-
image translation. The authors make the assumption of a shared-latent space, which is
that two corresponding images in separate domains can be mapped to the same latent
representation in a shared-latent space. Based on this, they proposed a framework by
combining VAEs and GANs. They also have embedded weight-sharing constraints to relate
to the VAEs. Later, [Kazemi 18] addressed the problem in [Liu 17] on modelling domain
variant information by removing shared latent space and jointly learning a domain-specific
space and a domain-invariant space. As these methods frequently fail to model domain-
specific data not represented in the target domain.

[Xie 20] talks about the challenge of identifying cross-domain relationships from unpaired
data. The proposed method learns to identify connections across several domains using
generative adversarial networks. Furthermore, this network successfully performs style
transfer from one domain to another while maintaining important characteristics like
orientation and face identification using the found relations. Similar to[Xie 20],[Yi 17]
also creates a unique dual-GAN technique that allows image interpreters to be trained
using two sets of unlabeled images from two domains. In the proposed design, the dual
GAN learns to invert the job, while the primal GAN learns to translate images from
the domain U to the domain V. Both algorithms are cycle consistency constraint, which
was used to promote bidirectional image translations with regularized structure output.
Although these GANs produce realistic visual results for Image-to-image translation tasks,
it is bad for domain adaption situations when corruptions of image content commonly
appear in translated images. The issue of content distortions was addressed by [[Huang 18],
[Zhang 18]]. They used extra segmentation branches to include semantic data into the
19

generators, forcing CycleGAN to translate images with content in mind.

In [Xie 20], Xinpeng et al. have addressed the issue of content distortions which CycleGAN
methods have not fully solved. Though CycleGAN is best at I2I translation, it fails to pre-
serve image content. They also worked on eliminating the need for pixel-level annotations
that were extensively used to overcome image object preservation problems. The proposed
method had a self-supervised multi-tasking network in addition to regular adversarial and
cycle consistency losses. By using the additional network, authors tried to get the contents
of the image along with domain knowledge. They have experimented on three datasets
that included medical images, which are prone to various imaging circumstances.

The authors [Shrivastava 17] proposed a simulated+unsupervised learning method that


reduces the gap between the synthetic and real distributions by utilizing the unlabeled
real data. The objective is to develop a model which is GAN based to enhance the
realism of a simulator’s output while maintaining the simulator’s annotation information.
To maintain annotations, eliminate artefacts, and stabilize training, the model adopted
a self-regularization term which penalizes significant differences between synthetic and
refined pictures and a local adversarial loss that adds realism and updates the discriminator
using a history of improved images to dealt with stability issue.

To date, many papers have done I2I translation with generative models. However, these
generative models tend to learn colourization or simple area translation instead of compre-
hending real desirable distribution. To address this, the authors of the [Chen 22] employed
the vector quantization method to obtain an image-to-image translation. The fact that
VQ is the intermediate embedding space, with low dimensional image representation,
made these possible. Furthermore, they proposed a multi-functionality model for an
unconditional generation of images with the possibility of styling translations. The method
has an encoder that tries to extract domain-invariant features with two more style encoders
that extract domain-specific features. Later, the generator forms images that combine
contents and style, whereas the discriminator tries to find out if the image is fake or real.
20 3. Literature Review
4. Concept and Implementations

The earlier chapters described the principles and current state-of-the-art generative models
for domain adaptation ranging from GAN-based approaches to vector quantized approaches.
This work is more focusing on learning the representations for image-to-image translations
that preserve the similarity relations of the data space. The goals of the study, is to reduce
the number of data points required for deep learning algorithms to produce usable findings
and also to make such results reliable in a variety of situations.

This work is based on VQ-GAN, built on top of the VQ-VAE, as these methods are good
at learning meaningful latent representations of the data space and image generation. one
of the advantages of using these types of models is that it can avoid posterior collapse
by integrating the notion of ’Vector Quantization.’ The vector quantizer is used as a
regularizer to force a restricted code space onto the encoder’s output and organize this
output into vectors. The term "quantization" in VQ refers to how related vectors are
represented by the same index.

4.1 Datasets
This section describes the properties and characteristics of training and test data and
shows exemplary images in figure 4.1 from each dataset. The dataset should be relevant
to the target domain in order to give transferrable signals and information while also
capturing a substantial amount of variation to avoid the overfitting problem. However,
annotating large-scale datasets at the pixel level is very expensive and cumbersome as
they take up immense human efforts. To portray a vast setting that frequently matches
its real-world equivalents, sandbox video games strive to convey those required traits.
Sandbox games often cover a wide variety of architectures and artificial environments
which resembles the human world.

GTAV
[Richter 16b] had put forward an algorithm using G-buffers to extract pixel-level semantic
labelling images from one of the famous open-world games, Grand Theft Auto V (GTAV).
22 4. Concept and Implementations

(a) Synthetic(GTAV) Image

(b) Real(Cityscapes) Image

Figure 4.1: Exemplary images from dataset

The dataset extracted by [Richter 16b] consists of 254064 frames in total with 1920 × 1810
resolution, which were captured under different environmental and climatic settings.

CityScapes

CityScapes [Cordts 15] is one of the famous and publicly available datasets for Computer
Vision tasks that have been taken from video sequences from fifty different cities in Germany
and France. The dataset comprises 5000 high-quality datasets with pixel-wise annotations
and 20000 frames with coarse annotations. All images are set to be of 1024 × 2048
resolution. The main focus of this dataset is to make models understand urban scenes
at the pixel level and instance level. As CityScapes are from the real world, these were
chosen as the target domain in this current work.

4.2 Pre-processing step


Before supplying the samples to the neural networks, each image underwent several
preprocessing procedures. First, to maintain computational resources, images are resized
to achieve desirable dimensions 512 × 512 and 256 × 256. Also, resizing images sometimes
leads to loss of information, so patches of required size 512 × 512 and 256 × 256 are selected
from the image to feed to the neural network. Additionally, the images are normalized
4.3. Implementation 23

using min-max normalization into the range [0, 1]. Later, horizontal flip is applied to the
images to improve the models’ generalizability.

4.3 Implementation
Deep learning research is highly dependent on the frameworks that provide building blocks
for model architectures. A couple of the most famous frameworks for deep learning are
PyTorch [Paszke 19] and TensorFlow [Abadi 16]. In our study, we use PyTorch as the
designated framework, as it is most commonly adopted for academia. Here, we provide
the motivation and implementational details of the baseline model and the extensions and
special features of the proposed methods.

The fundamental goal of this research is to make use of a discrete latent representation
space that helps us add high-quality features for the synthetic inputs to generate more
realistic output. Furthermore, we must preserve source geometry and semantic labels of
translations while transforming images from one domain to another, such as computer-
generated images to real images.

To achieve this, we first pre-train the entire VQ-GAN model on a real dataset to obtain
the actual codebook latent distribution and its corresponding decoder. Then, we freeze
the codebook and decoder weights and employ the different encoder modules to map
the synthetic latent representation to the real latent space representation. We later use
discriminators and transformers to bring the latent synthetic distribution closer to the
real distribution.

4.3.1 Pre-training of model


As mentioned above, we have chosen VQ-GAN as the baseline architecture to pretrain our
model with real images, as shown in figure 4.2.
VQ-GAN architecture is chosen from the base paper [Esser 21] published by Esser et al.
The network has many convolution layers, residual and attention blocks, as illustrated in
figure 4.3. The first block has a single stride and padding convolution layer with window
size 3 × 3. Then there are four downsampling blocks, each with two residual blocks
accompanied by a convolution layer of filter size 3 × 3 and stride 1. Later, a combination of
Attention and residual blocks is repeated before feeding the image to a pre-vq-convolution
layer with k = 3 × 3 and s = 1. Then tensor is fed to the codebook, where the vectors
are reshaped to format batch, height, width, and channel, which are later flattened to
obtain mapping with the closest codebook vector. The obtained 1024 codebook vectors
with embedding size 256 are then fed to a single strided post-vq-convolution layer with
filter size 1 × 1 and padding 1 for increasing feature maps to 512.
After quantization, there is another single stride post-quantization convolutional layer
with window size 3 × 3 and padding 1, followed by an attention block sandwiched between
two residual blocks. The residual block consists of a Group Norm, a normalization layer
where channels are divided into groups and features are normalized in each group. It is
followed by a Swish activation layer f (x) = x · sigmoid(x) and a 2d convolutional layer
with kernel size 3 × 3 and single padding and stride value. The above three layers are
24 4. Concept and Implementations

Figure 4.2: Schematic diagram of implementation of real image reconstruction.

Type of Layer Kernel stride padding


Conv2d 4×4 2 1
LeakyReLU - - -
Conv2d 4×4 2 1
BatchNorm2d - - -
LeakyReLU - - -
Conv2d 4×4 2 1
BatchNorm2d - - -
LeakyReLU - - -
Conv2d 4×4 1 1
BatchNorm2d - - -
LeakyReLU - - -
Conv2d 4×4 1 1

Table 4.1: Discriminator layers used for conditioning on real images having 2d Convolutional
Layers, LeakyReLU activation function and Batch Normalization.

repeated in the same order, forming a complete residual block. An attention block has a
Group Norm layer followed by four convolutional layers with 1 × 1 kernel size and stride 1.
Further, there are four upsampling blocks, each having a conv2d layer ( k = 3 × 3, s = 1,
p = 1) and two residual blocks. Further features are decreased with convolutional single
stride layer with window size 3 × 3. Later the image is provided to the discriminator for
further classification into real and fake images, with the mixture of convolution layers,
activation functions, and batch normalizations as shown in the table 4.1.
After training, the codebook and decoder, which have learned real representations, are
frozen for further usage in the following steps.

4.3.2 Generation of images using discriminators


In the training phase of the implementation, we make use of both natural and synthetic
images to generate the real-like latent distribution of synthetic images. The figure 4.4
shows both training and inference phase in this implementation.
4.3. Implementation 25

Figure 4.3: Architecture of VQ-GAN.


26 4. Concept and Implementations

Figure 4.4: Schematic diagram of implementation using discriminators.

Initially, the actual images are fed to the encoder, Ereal , which is already trained in 4.2 to
obtain real latent representation. Later, we take another encoder, Esyn , to obtain latent
distribution from synthetic images. These real and artificial representations are then fed
to the discriminator to discriminate between real and computer-generated representations.
Here, Ereal output is used to regularize to Esyn predictions. Once when Esyn has learned to
generate real-like latent space, for the inference phase, we employ it to create images using
the already frozen codebook, Creal and Decoder Dreal from above 4.2. The architectures
of both encoders are similar to the encoder of 4.2. The discriminator has several layers
of 2d Convolutional layers, Batch normalization and activation function Leaky ReLU as
shown in the table 4.2

4.3.3 Generation of images using CycleGAN


In the cycleGAN approach as illustrated in figure 4.5, we use two generators Greal , Gsyn and
two discriminators Dreal , Dsyn . First, we obtain synthetic latent space from the encoder,
which is already pre-trained on the GTAV dataset. Then, we use Gr eal to generate the
latent space that represents real distribution. And the second generator, Gsyn , tries to
construct synthetic latent space from real distribution. Each generator is integrated with
a discriminator (Dreal , Dsyn ). To identify the latent distribution generated by Greal as real
or fake, we give generated input and real input to the discriminator Dreal . Similarly, Dsyn
is applied on synthetic latent. During the inference phase, we feed computer-generated
images to the optimized generator, which has already learnt to create a real-like latent
distribution. Later the output hidden space is given to the code book and decoder from
4.2 to obtain real images.

Regarding architectural detailing, the image with size 256 × 256 with RGB channels are
fed to the Esyn , which gives output latent dimension 16 × 16 with 256 features. Later
these dimensions are provided to the generator model as shown in 4.6. The first layer is
4.3. Implementation 27

Type of Layer Kernel stride padding


Conv2d 3×3 2 1
LeakyReLU - - -
Conv2d 3×3 2 1
BatchNorm2d - - -
LeakyReLU - - -
Conv2d 3×3 2 1
BatchNorm2d - - -
LeakyReLU - - -
Conv2d 3×3 1 1
BatchNorm2d - - -
LeakyReLU - - -
Conv2d 3×3 1 1

Table 4.2: Discriminator layers having 2d Convolutional Layers, LeakyReLU activation function
and Batch Normalization.

Figure 4.5: Schematic diagram of implementation using CycleGAN.


28 4. Concept and Implementations

Figure 4.6: Architecture of generator in cycleGAN.

Convolution with kernel size 3 × 3 with single padding and stride. It is followed by two
Convolutional layers with filter size 3 × 3 with double stride and single padding. After
that, all the convolutional layers are followed by Batch Normalization and LeakyReLU.
Later it is followed by five residual blocks. Then upsampling blocks are used to match
the size to the input size. The generator’s output is subsequently fed to the discriminator
for further classification into real or fake. The other generator and discriminator follow a
similar procedure.

4.3.4 Generation of images using Transformers

In the implementation, we employ a transformer instead of a discriminator from the section


4.3.1. However, the entire training and evaluation procedure is similar to the section ??.
Here, we describe the transformer model as the implementation’s only different element.
The output of the real and synthetic image encoders of size 16 × 16 with 256 features are
fed into a convolutional layer, a patch embedding layer with a patch and stride size of 4.
The embedding layer outputs 4 × 4 sized 16 patches with dimension 256.

Subsequently, the output layer is concatenated with a class token at the beginning of each
sequence leading to a flattened tensor of shape 17 with 256 dimensions. Now, the positional
embedding of each patch is provided, which is later sent to the Transformer encoder. We
use only the transformer encoder when doing the classification task. The encoder has
recurrent layers of MLP and MHSA. Prior to the self-attention block and the MLP block,
Layernorm (LN) is applied. Every block is followed by the residual connections. Finally,
the classification head considers the class token and outputs the true/false predictions.

4.4 Perceptual metrics


Here we are briefly explaining the quantitative metrics chosen to evaluate the generated
images.
4.4. Perceptual metrics 29

Figure 4.7: Schematic diagram of implementation using transformers.

Frechlet Inception Distance


Frechlet Inception Distance (FID) is the performance metric used to measure the distance
between the feature vectors for generated and real images. The score represents the
statistical similarity between the two groups using the inception v3 model , which is image
classification algorithm. For example, a perfect score of 0.0 indicates that the two sets of
photographs are identical. The lower values show that the two groups of images are more
similar or have more in common statistically.

Kernel Inception Distance


Kernel Inception Distance (KID) is also a metric that measures the dissimilarity of the
images. KID also determines the distance in feature space, similar to FID. This computes
dissimilarity by drawing independent samples from two probability distributions of real
and generated data. The lower the value, the closer the two datasets, resulting in generated
images more proximate to the real images.
30 4. Concept and Implementations
5. Experiments and Results

In this section, we explain the experiment setting for the investigated methods and the
results obtained are discussed. Later, the generated data from synthetic images are tested
to check their appropriateness for the object detection models.

5.1 Pre-training of model


To obtain the real latent code book representation of real images, we employed the VQ-
GAN model, which had shown promising results in [Esser 21]. As discrete code book can
represent complex image characteristics, we further extend the usage to our work. We
used Cityscapes dataset with the size of 2048 × 1024.

Figure 5.1: Original and reconstructed images. Top row are the original images and the bottom
are reconstructed images
32 5. Experiments and Results

Figure 5.1 shows the original images and reconstruction images from the model. The
reconstructed images are blurry, and some parts are distorted. One of the reasons can be
resizing the images. There is a loss of information when 2048 × 1024 resolution images are
resized to 256 × 256.

Figure 5.2: Original and reconstructed image patches. Top row are the original patches and
the bottom are reconstructed patches

Figure 5.3: Original and reconstructed images after patch-wise training. Top row are the
original images and the bottom are reconstructed images

Further, to improve the reconstruction quality, instead of resizing the whole image, we
randomly chose patches with dimension 256 × 256 from each image. By this, we can
preserve the information for better reconstruction. As we can observe from the picture 5.2,
the tiny details are also reconstructed when compared to the firstly adopted model. It is
clearly observed that cycle spokes, small paper pieces on the ground are well reconstructed
5.2. Data Generation 33

by the help of patches. The image 5.3 shows the original and reconstruction images from
the patch-wise training. All the experiments were conducted with batch size 4, learning
rate 0.0002 and ADAM optimizer for 500 epochs.Reconstruction loss and vqloss are the
metrics for selecting the code book and decoder. Similarly, we have trained synthetic
images with patches and also by resizing the images.

5.2 Data Generation


In this section, we display and discuss the results of the images generated after training
using investigated methods. Furthermore, perceptual metrics for some of the models are
presented.

5.2.1 Data Generation using Discriminators


In this experiment, we investigated discriminators to map the latent features between real
and synthetic images. To achieve the latent target distribution, we used an already-trained
encoder network on the real images. Figure A.1 shows the images generated by the latent
space of the synthetic encoder. The figure clearly shows that the model is trying to
generate a few sets of outputs, which are more focused on the tree, road and sky-like
textures. In the output images, the buildings are replaced by trees. This phenomenon is
described as mode collapse, where the discriminator falls into the local minimum.

To tackle this issue, we utilized a pretrained concept that helps the model to learn the
generic information in the images.Additionally, this encourages the encoder to perform
better, as expected results are more reasonable, as shown in the figures 5.4.

After hyper parameter tuning like batch size, learning rate, we lowered FID from 143.57
to 67.67, indicating the right path of experiments. From the 5.4, we can observe that the
roads, footpaths and building texture are adapted to a real domain, but the reconstructed
objects far away are ambiguous. The precise translation of the image can be seen in the
5.6. Almost all the content is transferred from synthetic to natural, like street poles and
road markings etc., The half-seen car in the synthetic image is fully restored in the real
image, which can be due to code book learning. We have also implemented the model
using Wassertein loss [Martin 17] with weight clipping, for further improvement in image
generation quality. The figure 5.7, shows better improvement in texture, which is in
more resemblance with CityScapes dataset. Side paths, nearer vehicles are generated in a
realistic fashion. In the figure A.2, we can observe that nearer objects are reconstructed
well, where as farther objects are unable to be generated of blurry when generated.We
can also interpret that generating data from unknown latent representations is difficult as
observed, In A.3, A.4 images, the color of the sky (sunset, sunrise) in synthetic images has
effected the reconstructions. Models are unable to replicate them and also distorting near
by pixels.
Additionally, an experiment is conducted by changing the input given to discriminator
during the synthetic encoder in the VQ-GAN model. Instead of providing synthetic data
to the discriminator, we provide real data, in order to map the translation between the
domains. We observed slight differences as shown in the figure 5.5, road texture, trees are
more realistic.
34 5. Experiments and Results

Figure 5.4: Images generated after training using discriminators. Real-like image reconstruction
using pretrained encoder on synthetic data.

Figure 5.5: Images generated using VQ-GAN by feeding real images to the discriminator. From
left to right, first two images are synthetic, third and fourth are generated images respectively.
5.2. Data Generation 35

Figure 5.6: Images generated after training using discriminators.

Figure 5.7: Images generated after training using discriminators with wasserstein. Left image
is synthetic and right is generated image.
36 5. Experiments and Results

Figure 5.8: Images generated after training using cycleGAN.From left to right, first two images
are synthetic, third and fourth are generated images respectively.

5.2.2 Data Generation using CycleGAN and Transformers

We tried to use CycleGAN to avoid inconsistencies that occurred while only using a
discriminator. Several experiments have been conducted to utilize cycle consistency loss,
as explained in 2.3.8. However, we are unable to extract helpful information for image
reconstruction. So, we had tweaked the model by feeding Gsyn , synthetic images instead of
their embeddings and the resulting outputs are shown in figure 5.8. Though these are also
not up to the mark, model tries to generate most of the image with road texture. We are
using two generators that force each to generate real latent form synthetic images and vice
versa. This might have led the model to learn real latent, resulting in generating a BMW
symbol at the centre of the image, which is predominant in the CityScapes dataset.

5.2.3 Data Generation using cross domain encoder

We utilize two pretrained VQ-GAN architectures on real and synthetic images respectively,
as indicated in 5.1. In this experiment, we consider the latent space of the synthetic image
encoder to generate real-like images. The encoder output is given to the frozen code book
and decoder, obtained from training on real scenarios. From the image 5.9, we can observe
the realism of the elements. For example, the texture of the vehicles in the GTAV dataset
is glossy, whereas, in the CityScapes dataset, they are not. In our generated images, we
can observe cars are not as shiny in synthetic data, indicating our codebook had learnt
naturalistic features. The stony structures are not well generated because our real dataset
does not have them. Instead of rocks, the model generates building or wall-like structures.

5.2.4 Perceptual Metrics

KID and FID scores of subjectively better performing models has been calculated and
is shown in the table 5.1. In this table cross domain encoder means using an encoder
pretrained on the synthetic data and feeding the latent representations to the code book
and decoder, which are pretrained on real data. The baseline in the table represents FID
and KID similarity score between CityScapes Dataset and Synthetic Dataset(GTAV). The
perceptual scores obtained on generated images are as follows.
5.2. Data Generation 37

Figure 5.9: Image generated by using cross domain encoders. Left image is synthetic and right
is generated image.

Name of model FID KID


Baseline 76.92 0.07
Discriminators 67.67 0.06
Discriminators with wasserstein 62.23 0.06
Cross Domain encoder 52.83 0.04

Table 5.1: Summary of table with perceptual metrics. Here FID is Frechlet Inception Distance,
KID is Kernel Inception Distance. The lower the value, better the performance of model.

5.2.5 Discussion
Several intriguing events appeared throughout the tests. Discriminator-based methods are
superior to transformer-based methods in our current setting. Additionally, pre-training
the models with synthetic images has also made it helpful for encoders to produce latent
representations appropriate for image reconstruction. Image generation on the model
with cross domain encoders has shown better performance,as indicated by lower FID
scores. Not only quantitatively but also subjectively, the images are better at expressing
real perception. The approach of using a discriminator for differentiating latent spaces
in a similar fashion to how we use it for images did not work as expected. This can
be due to the size of latent spaces. The discriminator further reduces the latent space
size, leaving it with few features to learn. The GANs with Wasserstein loss had showed
significant improvement in the scores and also subjectively. However, this model still has
few limitations at reconstructing with images that are completely unavailable in real data
distribution.

We could not get the expected results from CycleGAN and transformers, though they are
better at vision tasks.
38 5. Experiments and Results
6. Conclusion

This thesis work has been chosen to address the data limitation problem in certain
environmental scenarios. Using image-to-image translation, we would like to generate
the data required for remote robotic applications, as data collection is quite tricky in
these regions.In addition, we can use synthetic data for generalized tasks, however, for
downstream tasks, the performance goes down as CNN work based on textures. By
reducing the gap between domain distributions, we can leverage the synthetic data for
real-time implementations.

This thesis mainly focuses on learning representations for image-to-image generation.


This approach includes the investigation of Vector Quantized representations, along with
generative methods. Our models started by using basic discriminators to cycleGANs.
We also had made a few attempts to understand the usability of transformers for this
particular task. Finally, we used the encoder trained on synthetic images to generate
images.

Overall, we got better results visually when the discriminator was used to differentiate real
and fake latents. By considering further hyper parameter tuning and by utilizing wasserstein
loss, we achieved significant results. However, even the quantifiable measurements show
the same. In most of the implementations, real texture has been adapted to synthetic
images. Also, it was observed that codebook learning had played a certain role in image
generation. As a result, there is heavy distortion when the model encounters unseen
data,which was not present in real sets.

As further work, we can additionally tune the discriminator for better performance. Also,
investigation on cycleGANs could be continued. Finally, instead of just using transformers
as discriminators, we can employ transformer encoders for this task.
40 6. Conclusion
Bibliography

[Abadi 16] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghe-
mawat, G. Irving, M. Isard et al, “{TensorFlow}: a system for {Large-Scale} machine
learning”, in 12th USENIX symposium on operating systems design and implementation
(OSDI 16). 2016, pp. 265–283.

[Alkhalifah 21] T. Alkhalifah, H. Wang, O. Ovcharenko, “MLReal: Bridging the gap


between training on synthetic data and real data applications in machine learning”, in
82nd EAGE Annual Conference & Exhibition, vol. 2021, no. 1, European Association of
Geoscientists & Engineers. 2021, pp. 1–5.

[Bank 20] D. Bank, N. Koenigstein, R. Giryes, “Autoencoders”, arXiv preprint


arXiv:2003.05991, 2020.

[Bao 21] H. Bao, L. Dong, F. Wei, “Beit: Bert pre-training of image transformers”, arXiv
preprint arXiv:2106.08254, 2021.

[Basodi 20] S. Basodi, C. Ji, H. Zhang, Y. Pan, “Gradient amplification: An efficient


way to train deep neural networks”, Big Data Mining and Analytics, vol. 3, no. 3, pp.
196–207, 2020.

[Chen 22] Y.-J. Chen, S.-I. Cheng, W.-C. Chiu, H.-Y. Tseng, H.-Y. Lee, “Vector Quantized
Image-to-Image Translation”, in European Conference on Computer Vision, Springer.
2022, pp. 440–456.

[Cordts 15] M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson,


U. Franke, S. Roth, B. Schiele, “The Cityscapes Dataset”, in CVPR Workshop on The
Future of Datasets in Vision. 2015.

[Cordts 16] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,


U. Franke, S. Roth, B. Schiele, “The cityscapes dataset for semantic urban scene
understanding”, in Proceedings of the IEEE conference on computer vision and pattern
recognition. 2016, pp. 3213–3223.

[Demuth 14] H. B. Demuth, M. H. Beale, O. De Jess, M. T. Hagan, Neural network design,


Martin Hagan, 2014.

[Denton 15] E. L. Denton, S. Chintala, R. Fergus et al, “Deep generative image mod-
els using a[U+FFFC] laplacian pyramid of adversarial networks”, Advances in neural
information processing systems, vol. 28, 2015.
42 Bibliography

[Devlin 18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, “Bert: Pre-training of deep
bidirectional transformers for language understanding”, arXiv preprint arXiv:1810.04805,
2018.

[Dosovitskiy 20] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un-


terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al, “An image is worth 16x16
words: Transformers for image recognition at scale”, arXiv preprint arXiv:2010.11929,
2020.

[Enis Cetin 06] A. Enis Cetin, O. Nezih Gerek, “Vector Quantization”, Wiley Encyclopedia
of Biomedical Engineering, 2006.

[Esser 21] P. Esser, R. Rombach, B. Ommer, “Taming transformers for high-resolution


image synthesis”, in Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition. 2021, pp. 12873–12883.

[Geirhos 18] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, W. Bren-


del, “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves
accuracy and robustness”, arXiv preprint arXiv:1811.12231, 2018.

[Goodfellow 16] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT press, 2016.

[Goodfellow 20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,


S. Ozair, A. Courville, Y. Bengio, “Generative adversarial networks”, Communications
of the ACM, vol. 63, no. 11, pp. 139–144, 2020.

[Hinton 12] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior,
V. Vanhoucke, P. Nguyen, T. N. Sainath et al, “Deep neural networks for acoustic
modeling in speech recognition: The shared views of four research groups”, IEEE Signal
processing magazine, vol. 29, no. 6, pp. 82–97, 2012.

[Hochreiter 98] S. Hochreiter, “The vanishing gradient problem during learning recurrent
neural nets and problem solutions”, International Journal of Uncertainty, Fuzziness
and Knowledge-Based Systems, vol. 6, no. 02, pp. 107–116, 1998.

[Hoffman 18] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros,
T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation”, in International
conference on machine learning, Pmlr. 2018, pp. 1989–1998.

[Hornik 89] K. Hornik, M. Stinchcombe, H. White, “Multilayer feedforward networks are


universal approximators”, Neural networks, vol. 2, no. 5, pp. 359–366, 1989.

[Huang 18] S. Huang, C. Lin, S. Chen, Y. Wu, P. Hsu, S. Lai, “Cross Domain Adaptation
with GAN-Based Data Augmentation”, Proceedings of the Lecture Notes in Computer
Science: Computer Vision—ECCV, 2018.

[Ioffe 15] S. Ioffe, C. Szegedy, “Batch normalization: Accelerating deep network training
by reducing internal covariate shift”, in International conference on machine learning,
PMLR. 2015, pp. 448–456.
Bibliography 43

[Isola 17] P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, “Image-to-image translation with
conditional adversarial networks”, in Proceedings of the IEEE conference on computer
vision and pattern recognition. 2017, pp. 1125–1134.

[Kazemi 18] H. Kazemi, S. Soleymani, F. Taherkhani, S. Iranmanesh, N. Nasrabadi,


“Unsupervised image-to-image translation using domain-specific variational information
bound”, Advances in neural information processing systems, vol. 31, 2018.

[Kingma 19] D. P. Kingma, M. Welling et al, “An introduction to variational autoencoders”,


Foundations and Trends® in Machine Learning, vol. 12, no. 4, pp. 307–392, 2019.

[Krizhevsky 17] A. Krizhevsky, I. Sutskever, G. E. Hinton, “Imagenet classification with


deep convolutional neural networks”, Communications of the ACM, vol. 60, no. 6, pp.
84–90, 2017.

[Kurita 19] T. Kurita, “Principal component analysis (PCA)”, Computer Vision: A Ref-
erence Guide, pp. 1–4, 2019.

[LeCun 15] Y. LeCun, Y. Bengio, G. Hinton, “Deep learning”, nature, vol. 521, no. 7553,
pp. 436–444, 2015.

[Liu 16] M.-Y. Liu, O. Tuzel, “Coupled generative adversarial networks”, Advances in
neural information processing systems, vol. 29, 2016.

[Liu 17] M.-Y. Liu, T. Breuel, J. Kautz, “Unsupervised image-to-image translation net-
works”, Advances in neural information processing systems, vol. 30, 2017.

[Liu 21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, “Swin transformer:
Hierarchical vision transformer using shifted windows”, in Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2021, pp. 10012–10022.

[Long 15] J. Long, E. Shelhamer, T. Darrell, “Fully convolutional networks for semantic
segmentation”, in Proceedings of the IEEE conference on computer vision and pattern
recognition. 2015, pp. 3431–3440.

[Martin 17] A. Martin, C. Soumith, L. Bottou, “Wasserstein GAN, 2017”, arXiv, vol. 1701,
no. 07875, p.ṽ3, 2017.

[Mitchell 97] T. M. Mitchell, T. M. Mitchell, Machine learning, McGraw-hill New York,


1997, vol. 1, no. 9.

[Paszke 19] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen,


Z. Lin, N. Gimelshein, L. Antiga et al, “Pytorch: An imperative style, high-performance
deep learning library”, Advances in neural information processing systems, vol. 32, 2019.

[Ren 22] J. Ren, Q. Zheng, Y. Zhao, X. Xu, C. Li, “DLFormer: Discrete Latent Transformer
for Video Inpainting”, in Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 2022, pp. 3511–3520.

[Rensink 00] R. A. Rensink, “The dynamic representation of scenes”, Visual cognition,


vol. 7, no. 1–3, pp. 17–42, 2000.
44 Bibliography

[Richter 16a] S. R. Richter, V. Vineet, S. Roth, V. Koltun, “Playing for data: Ground
truth from computer games”, in European conference on computer vision, Springer.
2016, pp. 102–118.

[Richter 16b] S. R. Richter, V. Vineet, S. Roth, V. Koltun, “Playing for Data: Ground
Truth from Computer Games”, in European Conference on Computer Vision (ECCV), ser.
LNCS, B. Leibe, J. Matas, N. Sebe, M. Welling, Eds., vol. 9906. Springer International
Publishing, 2016, pp. 102–118.

[Salimans 16] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen,


“Improved techniques for training gans”, Advances in neural information processing
systems, vol. 29, 2016.

[Shrivastava 17] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, R. Webb,


“Learning from simulated and unsupervised images through adversarial training”, in
Proceedings of the IEEE conference on computer vision and pattern recognition. 2017,
pp. 2107–2116.

[Srivastava 14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,


“Dropout: a simple way to prevent neural networks from overfitting”, The journal of
machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.

[Targ 16] S. Targ, D. Almeida, K. Lyman, “Resnet in resnet: Generalizing residual


architectures”, arXiv preprint arXiv:1603.08029, 2016.

[Theiss 22] J. Theiss, J. Leverett, D. Kim, A. Prakash, “Unpaired Image Translation via
Vector Symbolic Architectures”, in European Conference on Computer Vision, Springer.
2022, pp. 17–32.

[Van Den Oord 17] A. Van Den Oord, O. Vinyals et al, “Neural discrete representation
learning”, Advances in neural information processing systems, vol. 30, 2017.

[Van Dyck 21] L. E. Van Dyck, R. Kwitt, S. J. Denzler, W. R. Gruber, “Comparing Object
Recognition in Humans and Deep Convolutional Neural Networks—An Eye Tracking
Study”, Frontiers in Neuroscience, vol. 15, p. 750639, 2021.

[Vaswani 17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,


Ł. Kaiser, I. Polosukhin, “Attention is all you need”, Advances in neural information
processing systems, vol. 30, 2017.

[Wang 18] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, B. Catanzaro, “High-
resolution image synthesis and semantic manipulation with conditional gans”, in Pro-
ceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp.
8798–8807.

[Wu 19] H. Wu, M. Flierl, “Learning product codebooks using vector-quantized autoen-
coders for image retrieval”, in 2019 IEEE Global Conference on Signal and Information
Processing (GlobalSIP), IEEE. 2019, pp. 1–5.
Bibliography 45

[Xie 20] X. Xie, J. Chen, Y. Li, L. Shen, K. Ma, Y. Zheng, “Self-supervised cyclegan
for object-preserving image-to-image domain adaptation”, in European Conference on
Computer Vision, Springer. 2020, pp. 498–513.

[Yi 17] Z. Yi, H. Zhang, P. Tan, M. Gong, “Dualgan: Unsupervised dual learning for
image-to-image translation”, in Proceedings of the IEEE international conference on
computer vision. 2017, pp. 2849–2857.

[Zhang 18] Z. Zhang, L. Yang, Y. Zheng, “Translating and segmenting multimodal medical
volumes with cycle-and shape-consistency generative adversarial network”, in Proceedings
of the IEEE conference on computer vision and pattern Recognition. 2018, pp. 9242–9251.

[Zhu 17] J.-Y. Zhu, T. Park, P. Isola, A. A. Efros, “Unpaired image-to-image translation
using cycle-consistent adversarial networks”, in Proceedings of the IEEE international
conference on computer vision. 2017, pp. 2223–2232.
46 Bibliography
A. Appendix

Figure A.1: Images generated after training on discriminators. From left to right, first
two images are synthetic, third and fourth are generated images respectively. The right two
reconstructed images show mode collapse.
48 A. Appendix

Figure A.2: Images generated after training using discriminators with wasserstein. Left image
is synthetic and right is generated image.

Figure A.3: Images generated after training using discriminators with wasserstein. Left image
is synthetic and right is generated image.
49

Figure A.4: Images generated after training using discriminators with wasserstein. Left image
is synthetic and right is generated image.

You might also like