0% found this document useful (0 votes)
1 views

Group 27_Creating art from existing images using deep neural network models (2)

This project report focuses on creating art from existing images using deep neural network models, submitted by students of Amity University for their Bachelor of Technology in Computer Science and Engineering. It includes an analysis of AI's role in art creation and understanding, exploring various machine learning techniques such as neural networks and generative adversarial networks. The report also covers the methodology, technologies used, results, and future scope of the project.

Uploaded by

Sahil Shakoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Group 27_Creating art from existing images using deep neural network models (2)

This project report focuses on creating art from existing images using deep neural network models, submitted by students of Amity University for their Bachelor of Technology in Computer Science and Engineering. It includes an analysis of AI's role in art creation and understanding, exploring various machine learning techniques such as neural networks and generative adversarial networks. The report also covers the methodology, technologies used, results, and future scope of the project.

Uploaded by

Sahil Shakoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Major Project Report

On

CREATING ART FROM EXISTING IMAGES USING DEEP NEURAL


NETWORK MODELS

Submitted to

Amity University Uttar Pradesh

In partial fulfilment of the requirements for the award of the degree

of

Bachelor of Technology

In

Computer Science and Engineering

By

KARMANYA MENDIRATTA (A2305218077)


VAIBHAV GARG (A2305218094)
SHAIL VERMA (A2305218126)
Under the guidance of

Faculty Guide

Dr. SUBHASH CHAND GUPTA

Department of Computer Science & Engineering

Amity School of Engineering & Technology

Amity University Uttar Pradesh


DECLARATION

We, Karmanya Mendiratta, Vaibhav Garg, Shail Verma students of B. Tech (CSE) hereby
declare that the project titled “CREATING ART FROM EXISTING IMAGES USING
DEEP NEURAL NETWORK MODELS” which is submitted by us to Amity School of
Engineering and Technology, Amity University Uttar Pradesh, Noida, in partial fulfilment
of requirement for the award of the degree of B. Tech (CSE) has not been previously
formed the basis for the award of any degree, diploma or other similar title or recognition.
We hereby declare that we have gone through project guidelines including policy on health
and safety, policy on plagiarism etc.

Signature(s) of Project Team


Date:

i
CERTIFICATE

On the basis of the declaration submitted by KARMANYA MENDIRATTA, VAIBHAV


GARG, SHAIL VERMA students of B.Tech (CSE), I hereby certify that the project titled
“Creating Art from existing images using Deep Neural Network models”, which is
submitted to Department of Computer Science and Engineering, Amity School of
Engineering and Technology, Amity University Uttar Pradesh, Noida, in partial fulfilment
of requirement for the award of the degree of Bachelor of Technology in Computer Science
and Engineering is an original contribution with existing knowledge and faithful record of
work carried out by them under my guidance and supervision.

To the best of my knowledge this work has not been submitted in part or full for any Degree
or Diploma to this University or elsewhere.

Noida
Date:

Dr. Subhash Chand Gupta


Associate Professor
Department of Computer Science and Engineering
Amity School of Engineering and Technology
Amity University Uttar Pradesh, Noida

ii
ACKNOWLEDGEMENT

We take this opportunity to express our profound sense of gratitude and respect to all those
who helped us throughout our project. This report acknowledges to the intense driving and
technical competence of everyone who have contributed to it. It would have been almost
impossible to complete this project without the support of these people. We would like to
thank Dr. Sanjeev Thakur, HOD of CSE Department and Amity University for
providing us with opportunity to do this project. We would also like to thank our faculty
guide Dr. Subhash Chand Gupta whose constant support and guidance resulted in the
successful completion of this project. We would also want to thank our batch mates and
family who helped me throughout this time and motivated me to complete the project.

Karmanya Mendiratta (A2305218077)


Vaibhav Garg (A2305218094)
Shail Verma (A2305218126)

iii
TABLE OF CONTENTS

S. NO. TOPIC PAGE NO.

1. Declaration i

2. Certificate ii

3. Acknowledgement iii

4. Abstract vii

5. Chapter 1: Introduction 1

6. Chapter 2: Literature Review 3

7. Chapter 3: Methodology 34

8. Chapter 4: Technologies Used 37

9. Chapter 5: Results 41

10. Chapter 6: Conclusion 59

11. Chapter 7: Future Scope 61

12. References 62

iv
LIST OF TABLES

Table No. Title of Table Page No.

2.1 Comparison of GAN models based on FID 32

LIST OF FIGURES

Figure Caption of Figure Page No.


No.
2.1 AI vs ML vs Deep Learning 4
2.2 Deep Neural Network Architecture 6
2.3 Results obtained by Neural Style Transfer Using VGG 9
Network
2.4 VGG-19 Architecture 10
2.5 GAN Architecture 12
2.6 Example of a paired and unpaired Dataset 18
2.7 Architecture of CycleGAN for converting a Zebra into a 19
Horse
2.8 ProGAN Architecture 21
2.9 StyleGAN Architecture 22
2.10 The Generators Adaptive Instance Normalization 23
2.11 Comparison between StyleGAN and StyleGAN2 architecture 25
2.12 cGAN Architecture 28
2.13 AttnGAN Architecture 28
2.14 CLIP Architecture 29
2.15 Architecture of StyleGAN+CLIP 30
2.16 Graph comparing FID Scores of GAN models 33
4.1 Logo of Python 37
4.2 Google Colab Icon 39
4.3 Google Drive Icon 40
5.1 Results Obtained for Neural Style Transfer 41
5.2 Input data for VGG Model 42
5.3 Image generated after training the model 42
5.4 Input data for VGG Model 43
5.5 Image generated after training the model 43
5.6 Input data for VGG Model 44
5.7 Image generated after training the model 44

v
5.8 Input data for VGG Model 45
5.9 Image generated after training the model 45
5.10 Input data for VGG Model 46
5.11 Image generated after training the model 46
5.12 Generators after Training the CycleGAN Model on 47
Horse2Zebra Dataset for 50 epochs
5.13 Discriminator after training the CycleGAN Model on 48
Horse2Zebra Dataset for 50 epochs
5.14 Results obtained on Test Data after Training the Cycle GAN 48
model Horse2Zebra Dataset for 50 epochs.
5.15 Generators after Training the CycleGAN Model on 49
Monet2Photo Dataset for 65 epochs
5.16 Discriminator after training the Model Monet2Photo Dataset 50
for 65 epochs
5.17 Results obtained on Test Data after Training the Cycle GAN 50-51
model Monet2Photo Dataset for 65 epochs.
5.18 Initial generated images before training of model 52
5.19 Input data for StyleGAN2 model 52
5.20 Generated Images after 2 hours of training 53
5.21 Generated Images after 8 hours of training 53
5.22 Generated Images after 14 hours of training 54
5.23 Generated Images after 20 hours of training 54
5.24 Generated images after training the model for 48 hours 55
5.25 Generated images after training the model for 82 hours 55
5.26 Result obtained when “City during a rainy night” was given 56
as prompt
5.27 Result obtained when “A cold and rainy night” was given as 56
prompt
5.28 Result obtained when “Fantasy Kingdom Deviantart” was 57
given as prompt
5.29 Result obtained when “Cyberpunk city ArtstationHQ” was 57
given as prompt
5.30 Result obtained when “Underwater Castle ArtstationHQ” was 58
given as prompt

vi
ABSTRACT
The innovations and advancements in the field of AI has motivated researchers to look into
the application of AI in several disciplines during the last few years. As a result, the field
of AI Art has emerged. This project examines the advancements and breakthroughs in the
field of AI Art, as well as how deep neural networks are being used to create art. In this
project an analysis on two aspects of AI art has been done: Firstly, how AI is being used
for art analysis and secondly, how AI is being used for creative purposes and generating
artworks. In the context of AI-related research for art understanding, we present a
comprehensive study on various Machine Learning tasks like Classification, Object
Detection, Automatic Painter detection based on painters’ style, etc. In context to the role
of AI in creating art, a comprehensive study on various techniques such as Convolution
Neural Networks, Neural Style Transfer and Generative Adversarial Networks has been
presented.

vii
CHAPTER 1

INTRODUCTION

Undoubtedly, one of the most trending fields of today’s day and age is Artificial
Intelligence (AI). Artificial Intelligence is a broad branch of Computer Science which deals
with the creation and development of intelligent or smart machines that are able to perform
tasks and make decisions which would require human intelligence. These smart machines
are able to learn from their experiences, draw conclusions based on the information they
have and adjust to new unseen input data to make logical decisions and perform human-
like tasks.

At first, Artificial Intelligence was only viewed as a research field and lots of computer
scientists did extensive research on AI and ML, which is a subfield of AI. Based on this
research, a lot of practical applications of AI and ML were recognized and hence the field
has progressed a lot and now, they are being used in almost all fields of study ranging from
Computer Science to Medicine. Thus, in the past two-three decades, the field of AI has
evolved from a field of study and research to a practical technology which has widespread
commercial use and is being used everywhere from chess-playing computers to self-driving
cars.

Another sub-field of AI which has seen a lot of progress and research is Deep Learning.
Deep Learning is a sub-field of Machine Learning which essentially consists of Neural
Networks which have various layers. These neural networks are designed to behave
similarly to the human brain, and so Deep Learning is used to solve complex problems and
hence neural networks are usually considered to give the best possible outcomes to these
complex problems. Due to the complex nature of Neural Networks and the many useful
applications of deep learning, significant research has been carried out in the field of deep
learning and different types of neural networks have been developed to tackle various types
of complex problems.

1
Thus, this extensive research in AI has prompted researchers to critically analyze and
discuss how AI and Machine learning can be applied in a more unconventional manner and
to look at the field from a newer, more creative perspective. One such field where AI is
being used is for the creation and understanding of art. As a result, a new kind of art named
artificial intelligence art (AiArt) has emerged, which is a creative activity that combines
art with technology by using AI as the core medium to create, express thoughts and
emotions. This can be done using Computer Vision. Computer Vision is an application of
AI in which Machine Learning and Deep learning is used to make observations and
important inferences from the input images, videos or any other visual inputs and to take
action or make recommendations based on that information.

Art can be described as a way to compose a complex interplay between content and style
of an image to show and demonstrate different ideas and concepts visually. Artists have
many different ideas and many different ways of representing these ideas. Thus, various
art pieces differ in style and complexity based on the artist that has created the art piece.
Due to the uniqueness, creativity and complexity involved in the process of creating art, it
was considered impossible for machines to mimic this process and create their own art.
However, developments in AI have made this process possible. ML and DL is used to
understand and recognize the various styles of these art images in order to create their own
style and generate their own art.

In this project, a study on the advancements made in the field of AIArt will be done and
the various algorithms and methods that are used in the field will be discussed, explored
and compared by doing research on the already existing literature available on AIArt.

2
CHAPTER 2

LITERATURE REVIEW

Machine learning studies how a machine may learn to accomplish something without being
specifically taught to do so. It is commonly used to anticipate outcomes based on existing
data or to categorise data into several labels. The study and development Machine learning
has increased rapidly in the past two-three decades, from study and research to a practical
technology in widespread commercial use. Within the field of artificial intelligence (AI),
ML is considered very popular and has been highly utilized for creation of software for
addressing various automation problems like natural language processing, computer
vision, object detection, speech recognition, classification problem, etc. Many developers
involved in development of AI applications now recognise that it’s much easier to train a
system on already available data than to manually program the algorithm by anticipating
the desired response. The impact of machine learning is also common in the computing
industry and various industries related to data-intensive topics, such as consumer services,
troubleshooting of complex systems, and supply chain control. It has a wide range of
influence in empirical sciences from biology to cosmology to social sciences, because ML
techniques have been developed to analyse powerful experimental data in novel ways.

A significant number of ML algorithms have been developed to cover a vast variety of data
and problem types represented in various machine learning problems. Conceptually, a
machine learning algorithm can be seen as a search in a large number of candidate program
spaces, which is guided by a training data. This experience from the training data is used
to optimize the performance of the program. Machine learning algorithms vary widely,
partly because of the way they represent candidate programs (for example, decision trees,
mathematical functions, and general programming languages), and partly because of the
way they search the programming space (for example, optimization algorithms with easy-
to-understand convergence).

As a research field, machine learning is at the intersection of computer science, statistics,


data science and many other disciplines that are related to automatic improvement over
3
time. These disciplines are also used to make conclusions and decisions under uncertainty.
Some related research fields include Evolutionary research, adaptive management theory,
educational practice research, neurobiology, organizational behaviour, and economics.
Although communication with these other fields has increased in the past decade, we are
just beginning to use the potential synergies and the various formalisms and experimental
methods used in these many fields to study systems that improve with experience.[1]

Alternatively, traditional machine-learning approaches were limited in their capacity to


analyse natural data in its raw form. Creating a feature extractor that converted raw data
into a feature vector that the learning subsystem, usually a classifier, could use to discover
or categorise patterns in the input required meticulous engineering and deep domain
knowledge. As a result, advances in the field of Deep Learning have been made. Deep
Learning is a subfield of Machine Learning which makes use of multiple-level
representation-learning approaches acquired by building basic but non-linear modules that
successively change the representation at one level (beginning with the raw input) into a
representation at a higher, slightly more abstract level. Very complex functions can be
learned by combining enough of these modifications. Deep learning methods utilize Neural
Networks with various layers to perform these complex transformations and functions [2].

Fig 2.1: AI vs ML vs Deep Learning

4
2.1 Artificial Neural Networks

Artificial neural networks (also simply known as neural networks) is a type of pattern-
oriented problem that can be used to solve classification and time-series problems. Due to
the nonparametric nature of neural networks, models can be created without prior
knowledge of the data distribution or effects caused between variables due to possible
interactions, which is required by most parametric statistical methods. The process used in
neural networks is similar to the way the human brain operates and hence neural networks
are usually considered to give the best possible outcome. A neural network consists of
various nodes where each node is similar to neurons present in our brain. Each node is
connected to the next node through weights which are calculated during the execution of
the machine learning algorithm [3].

The basic architecture of neural networks consists of 3 types of neuron layers: input,
hidden, and output layers. In feed-forward networks, the data moves in a feed-forward
direction from the input units to the output units. A value is calculated for each node based
on the value of the previous node and this process of calculation and storing of variables
from the input layer to the output layer is known as forward propagation. Feed-forward
network do not contain any feedback connections and hence data flows only in the forward
direction. Recurrent networks contain feedback connections and thus allow
backpropagation which is a method for calculating the gradient of neural network
parameters and in the method the network is traversed in the backwards order from output
layer to the input layer. A neural network must be built up in such a way that when a set of
inputs is applied, the desired set of outputs is produced. The strength of the connections
can be determined using a variety of approaches. One method is to use a priori information
to explicitly set the weights. Another method is to train the neural network by giving it
training patterns and allowing it to adjust its weights based on some learning rule [4].

5
Fig 2.2: Deep Neural Network Architecture[5]

2.2 Convolution Neural Network

The Convolutional Neural Network (CNN) is one of the most widely used deep neural
networks. Convolution can be described as a mathematical linear action performed
between matrices. Convolutional layer, non-linearity layer, pooling layer, and fully-
connected layer are some of the layers of CNN. Pooling and non-linearity layers do not
have parameters, whereas convolutional and fully connected layers do [6]. An important
characteristic of CNN is that it is able to pre-process data by itself thus saving time and
resources in the data pre-processing part. The CNN might require a bit of hand engineering
of the features in the beginning but as the machine learning algorithm progresses, the CNN
is able to adapt and learn these new features and develop its own filters. Thus, the CNN is
continuously evolving with growth in data. A CNN works well for data that is represented
as grid structures, hence it works well for pattern recognition and image classification
problems as an image is nothing but a matrix of pixel values. Since the data present in an
image is in the form of a grid structure, by applying various filters the CNN is able to
capture the spatial and temporal dependencies of the image and thus can be trained to
understand and finely characterise the input image. Thus, the role of a CNN is to reduce
the image into a form that is simpler to process without sacrificing features that are critical
to obtaining an accurate result and since the CNN is reducing the image into a simpler
form, it can easily be scaled to very large datasets as well.

6
Thus, the working of a CNN is as follows:
• Convolution layer: This layer computes the dot product between the weights of the
neuron and the region of the input image that are related.
• Non-Linearity Layer: The non-linearity can be used to adjust or cut-off the generated
output. This layer is applied in order to saturate the output or limiting the generated
output. [6]
• Pooling Layer: The pooling layer will down sample the spatial dimensions of the
image. Thus, it reduces the amount of computation to be performed and the number of
features to learn.
• Fully Connected Output layer: The fully connected output layer is the classifier layer
and gives us the final output by classifying the images into the required categories.

Since CNN is widely used for image classification, there has been a significant amount of
research using CNN in the field of AI art as well. Crowley et al. (2014) [7] demonstrated
in their research paper that object classifiers, learnt using CNN features computed from
various natural image sources, can retrieve paintings containing these objects. They were
not only able to detect the objects, but were also able to make annotations on the objects to
show how these objects in the paintings have evolved over time.

After this paper, Various advancements were made in object detection and various different
problems like identifying positions of objects, face recognition in paintings, analysing
gender of faces in paintings, etc were also addressed. Researchers thus realised that not
only objects, but the style of the image could also be detected using CNNs and these
advancements further supported the rising interest in AI art. Eva Cetinic et al. (2013) [8]
proposed an approach for automated classification of paintings by artist. In their model, the
individual style of an artist is recognized by analysing specific components of a painting
which distinguishes the work of an individual from the works of others. Once the style of
the artists has been recognized, their various paintings can be automatically classified.

7
2.3 Neural Style Transfer

Neural Style Transfer is based on texture synthesis methods that use histograms, such as
the Portilla and Simoncelli approach. For the picture analogies problem, NST may be
characterized as histogram-based texture synthesis with convolutional neural network
(CNN) characteristics. The original work employed a VGG-19 architecture that had been
pre-trained on the ImageNet dataset to conduct object recognition.

Neural style transfer is a technique for blending two images—a content picture and a style
reference image (such as a famous painter's work)—so that the output image appears like
the content image but is "painted" in the manner of the style reference image. This is
accomplished by matching the content statistics of the content picture with the style
statistics of the style reference image in the output image. A convolutional network is used
to extract these data from the pictures.

2.3.1 VGG-19

The research paper by Gatys et. Al [9][10] was one of the first papers that invented Neural
Style Transfer which triggered the rapid use and development of AI in the field of art. In
the paper, the authors present an artificial neural system which makes use of VGG-19 [11]
which is a CNN that outperforms humans on a common visual object recognition
benchmark test. This model makes use of the feature space supplied by the 19 layered VGG
Network's 16 convolutional and 5 pooling layers. The algorithm's main concept is to
iteratively optimise a picture with the goal of matching specified CNN feature distributions,
including both the content information of the photo as well as the style information of the
artwork. The concept separates image content from style, allowing any image's content to
be recast in any other image's style. This is demonstrated by the generation of innovative,
beautiful visuals that combine the style of numerous well-known paintings with the content
of a randomly chosen photograph. The feature responses of high-performing deep neural
networks trained on object detection are used to generate neural representations of an
image's content and style in particular. This work was the first to show how image attributes
may be used to separate content from style in natural images.

8
Fig 2.3: Results obtained by Neural Style Transfer Using VGG Network[9]

2.3.1.1 VGG-19 Architecture

• Input: The VGGNet accepts images with a size of 224 X 224 pixels. To keep the image
input size consistent for the ImageNet competition, the model's authors chopped out the
middle 224 X 224 patch in each image. The only preparation it performs is to deduct
from each pixel the mean RGB values derived from the training dataset.
• Convolution Layer: VGG's convolutional layers use a small receptive field (3 X 3),
the smallest size that still captures up/down and left/right movement. There are also 1
X 1 convolution filters that perform a linear transformation of the input. Then there's a
ReLU unit, which is a significant AlexNet invention that cuts down on the training time.

9
The rectified linear unit activation function (ReLU) is a piecewise linear function that
outputs the input if it is positive and zero otherwise. The convolutional strides are set to
1 pixel, and the spatial padding of convolutional layer input is set to 1 pixel for 3
Convolution layers to preserve the spatial resolution after convolution. The Spatial
pooling is then performed by five max-pooling layers, 16 of which follow part of the
Convolution layers but not all of the Convolution layers. This Max-pooling is done with
stride 2 across a 2 pixel frame.
• Hidden Layers: The VGG network's hidden layers all use ReLU. Local Response
Normalization (LRN) is rarely used in VGG since it increases memory usage and
training time. Furthermore, it has no effect on total accuracy.
• Fully Connected Layers: The architecture consists of a stack of convolutional layers
of varying depth in different designs, followed by three Fully-Connected (FC) layers:
the first two FC each have 4096 channels, while the third FC conducts 1000-way
classification and hence has 1000 channels, one for each class. The soft-max layer is the
last layer. In all networks, the completely linked levels are configured similarly [12].

Fig 2.4: VGG-19 Architecture [13]

10
2.3.1.2 Content Loss

It aids in the identification of similarities between the produced picture and the content
image. Higher layers of the model, intuitively, focus more on the characteristics contained
in the picture, i.e., the total content of the image.
The Euclidean distance between the corresponding intermediate higher-level feature
representations of the input picture (x) and the content image (p) at layer l is used to
determine content loss.
It's natural for a model to generate distinct feature maps at higher layers when different
objects are present.
This allows us to conclude that photos with comparable content should have similar
activations in higher layers.
1
𝐿𝑐𝑜𝑛𝑡𝑒𝑛𝑡 (𝜌, 𝑥, 𝐿) = ∑𝑖𝑗(𝐹𝑖𝑗𝑙 − 𝑃𝑖𝑗𝑙 )2 (2.1)
2

2.3.1.3 Style Loss

Content loss is fundamentally distinct from style loss. The comparison of intermediate
features of two images to get style loss is not possible. So here we have introduced a new
term known as Gram Matrices. The gram matrix is a method of interpreting style
information in a picture by displaying the overall distribution of characteristics in a specific
layer. It is the level of correlation between feature maps in a specific layer that is assessed.

Style loss is calculated by the distance between the gram matrices (or, in other terms, style
representation) of the generated image and the style reference image.
The contribution of each layer in the style information is calculated by the below formula:

1
𝐸𝑙 = 2 ∑𝑖𝑗(𝐺𝑖𝑗𝑙 − 𝐴𝑙𝑖𝑗 )2 (2.2)
4N2l M2l

Now, the total style loss can be expressed as:


𝐿𝑠𝑡𝑦𝑙𝑒 (𝑎, 𝑥) = ∑𝑙𝜀𝐿 𝜔𝑙 𝐸𝑙 (2.3)

where the contribution of each layer in the style loss is depicted by some factor wl [13]

11
2.4 Generative Adversarial Networks (GAN)

Another such revolutionary innovation in the field of AI art was the invention of Generative
Adversarial Networks (GAN). Introduced by Goodfellow et al. in their paper, GAN’s
constitutes a significant milestone in the effort to use machines to create new visual content.
A GAN's main mechanism is to train two "competing" models: a discriminator and a
generator, both of which are frequently implemented as neural networks. The generator's
purpose is to generate realistic images by capturing the distribution of actual examples in
the input sample, while the discriminator categorises generated images as fake or real. [14]
The process terminates at a point that is a minimum in reference to the generator and a
maximum in relation to the discriminator, as it is designed as a minimax optimization
problem. This framework's implementation produced outstanding results in terms of
creating plausible false variants of actual images for many sorts of image content [14].
GAN quickly rose to prominence as one of the most important areas of research in AI, with
several advanced and domain-specific versions of the original design emerging., e.g.
CycleGAN [15], StyleGAN [16] or BigGAN [17].

Fig 2.5: GAN Architecture [18]

The GAN model gives great results and is able to generate very good and realistic images
however they often don’t learn the way you expect them to. Typically, convergence of the
model on the training dataset is sought which is observed as the minimization of loss

12
functions on the dataset. The end of the two-player game between the generator and
discriminator is signalled by convergence in a GAN. Instead, the goal is to find a balance
between generator and discriminator loss. We'll examine into GANs and the various
variants in their loss functions in order to gain a better understanding of how they work.

2.4.1 GAN Loss Functions

2.4.1.1 Minimax Loss


The typical GAN loss function, often known as the min-max loss, was initially described
by Ian Goodfellow et al. [14]. This function is minimised by the generator, whereas it is
maximised by the discriminator. This loss formulation seems effective when seen as a min-
max game. In practise, it saturates for the generator, which means that if it doesn't catch up
with the discriminator, the generator will frequently stop training.

Ex [log(D(x))] + Ez [log (1 - D(G(z)))] (2.4)

In this function:
• D(x) is the discriminator's prediction of the likelihood that genuine data instance x is
real.
• Ex is the average of all real-world data instances.
• When given noise z, the generator's output is G(z).
• The discriminator's estimate of the likelihood that a fake instance is real is 𝐷(𝐺(𝑧)).
• Ez is the expected value of the generator's random inputs (in effect, the expected value
of all generated fake instances G(z)).
• The cross-entropy between the real and generated distributions gives rise to the
formula.
Since the generator cannot directly alter the function's log(D(x)) term, minimising the loss

is equivalent to minimising log (1 - D(G(z))).

The Standard GAN loss function can be categorized into two further parts: Discriminator
loss and Generator loss. These are discussed below:

13
• Discriminator Loss: The discriminator classifies both genuine and fake data from the
generator while it is being trained. By maximising the loss function, it penalises itself
for misclassifying a real instance as fake or a fake instance (made by the generator) as
real.
• Generator Loss: The generator samples random noise and provides an output from it
as it is being trained. The output is then passed via the discriminator, which classifies it
as "Real" or "Fake" based on the discriminator's ability to distinguish one from the other.
The generator loss is then determined based on the discriminator's categorization, it is
rewarded if it succeeds in fooling the discriminator, and penalised if it fails [19].

2.4.1.2 Modified Minimax Loss


The above minimax loss function can lead the GAN to get stuck in the early phases of
GAN training, when the discriminator's work is quite easy, according to the original GAN
paper. As a result, the study advises changing the generator loss so that it strives to optimise
log(D(z)) [14].

2.4.1.3 Wasserstein Loss


This loss function is based on a variant of the GAN model known as the "Wasserstein
GAN" or "WGAN," in which the discriminator does not actually classify instances. It
generates a number for each instance. We can't use 0.5 as a threshold to determine if an
instance is real or fake because this value cannot be less than one or bigger than zero.
Discriminator training simply attempts to increase the output for real instances over fake
instances. Since it is unable to distinguish between real and fake, the WGAN discriminator
is referred to as a "critic" rather than a "discriminator." This difference has theoretical
significance, but we may interpret it as a recognition that the loss function inputs do not
have to represent probabilities for practical purposes.
The loss functions themselves are surprisingly straightforward:
i. Critic Loss: D(x) - D(G(z))
This function is maximised by the discriminator. To put it another way, it aims to
maximise the difference between its output on genuine and false instances.
ii. Generator Loss: D(G(z))
14
This function is optimised by the generator. It aims to maximise the discriminator's
output for its fake instances, in other words.
In these functions:
• D(x) is the critic's output for a real instance.
• G(z) is the generator's output when given noise z.
• D(G(z)) is the critic's output for a fake instance.
• The output of critic D does not have to be between 1 and 0.
• The formulas derive from the earth mover distance between the real and generated
distributions [20]

2.4.2 Common Problems In GAN

2.4.2.1 Vanishing Gradients


When the discriminator performs much better than the generator, this situation occurs.
Either the discriminator updates are incorrect, or they vanish completely. This phenomenon
is known as Vanishing Gradients. One possible explanation for this is that the generator is
strongly penalised, causing the value post-activation function to reach saturation and the
gradient to vanish. Attempts have been made to remedy this problem by making use of the
Wasserstein Loss and the Modified Minimax loss [21]

2.4.2.2 Mode Collapse


Typically, you'll want your GAN to generate a wide range of outputs. You might want a
distinct face for each random input to your face generator, for example. If a generator
provides a very believable result, however, the generator may learn to exclusively produce
that output. In fact, the generator is continually looking for the one output that the
discriminator finds most believable.

If the generator starts producing the same output (or a small collection of outputs) over and
over again, the discriminator's best approach is to learn to always reject that output.
However, if the following generation of discriminator becomes stuck in a local minimum
and fails to identify the optimum strategy, the next generator iteration will be able to find
the most plausible output for the present discriminator far too easily. Each generator

15
iteration over-optimizes for a specific discriminator, and the discriminator never learns to
escape the trap. As a result, the generators alternate between a limited number of output
types. This type of GAN failure is called Mode Collapse. Attempts have been made to
remedy this problem by making use of the Wasserstein Loss and by inventing different
types of GANs to train in way to reduce the mode collapse. These GAN’s have been
discussed ahead in the report [21].

2.4.2.3 Failure to Converge


As two networks are being trained at the same time, GAN convergence was one of the first,
and possibly one of the most difficult challenges to solve since its inception. In most
circumstances, the ideal situation in which both networks stabilise and produce consistent
results is difficult to achieve. One explanation for this issue is that while the generator
improves with subsequent epochs, the discriminator deteriorates because the discriminator
is unable to distinguish between the real and fake image. The discriminator has a 50 percent
accuracy if the generator succeeds every time, akin to flipping a coin. This puts the GAN's
overall convergence in jeopardy. To tackle this problem, various forms of regularizations
have been used by researchers to improve the GAN convergence. These regularizations
include Adding noise to the Discriminator inputs and penalizing discriminator weights
[21].

2.5 Types of GANs

2.5.1 CycleGAN

Image-to-image translation entails creating a new synthetic version of a given image with
a specific change, for as changing a summer landscape to a winter scene. A large collection
of matched examples is often required to train a model for image-to-image translation.
These datasets, such as images of artworks by long-dead painters, can be difficult and
expensive to create, if not impossible in some situations. CycleGAN is a technique for
automatically training image-to-image translation models without paired samples. The
models are trained in a fairly unsupervised way using a set of images from the source and
target domains that are not connected in any way.

16
CycleGAN was proposed by Zhu et al. In this paper, the authors in the absence of any
paired training examples, offered a system that learn to capture the specific properties of
an image collection and work out how these traits may be transferred into some other image
collection. The key finding of this paper was that the CycleGAN model can be used to
transfer style of the image. Unlike Neural Style Transfer, however, the CycleGAN learns
to emulate the style of a complete collection of artworks (e.g., Van Gogh), rather than a
single piece of art (e.g. Starry Night). The authors were also able to do the reverse of the
process, i.e. converting a painting to a real image by making use of the CycleGAN model.
CycleGAN is able to learn special characteristics of one image collection and is able to
translate it into another image collection without having any paired training examples. The
developers of this model first compared it to current methods for unpaired image-to-image
translation using paired datasets with input-output pairs. The relevance of the cycle
consistency loss and the adversarial was then investigated, and their full method was
compared to numerous versions. Finally, they have proved the applicability of their
technique in a variety of applications that do not require paired data. [15]. One such
application where CycleGAN has been used is FaceApp which shows how a human might
look at different ages.[22]

2.5.1.1 Architecture
A CycleGAN in contrast to any other traditional GAN is made up of two GANs, giving it
a total of two generators and two discriminators. In CycleGAN, the problem is modelled
as an image reconstruction problem. We begin by taking an image input (x) and converting
it to the reconstructed picture using the generator G. Then, using a generator F, we reverse
this process from reconstructed image to original image. The mean squared error loss
between the original and reconstructed images is then computed. The most crucial aspect
of this cycle GAN is that it can do image translation on an unpaired image where there is
no relation between the input and output images.

17
Fig 2.6: Example of a paired and unpaired Dataset [15]
A paired dataset is the one where we get the counterpart of the input image in the dataset
whose style needs to be incorporated into the other set of images. As shown in the above
example, for every image in set xi, an identical image with different style is present in set
yi. For an unpaired Dataset, it is not necessary to have a complete set of images. As
shown in the above example, we can use a dataset of different Images and different
Paintings in order to implement CycleGAN.
Cycle consistency is a supplementary extension to the architecture used by the CycleGAN.
The notion is that an image output by the first generator can be utilised as input to the
second generator, and the output of the second generator should match the original image.
The opposite is also true: an output from the second generator can be provided as input to
the first generator, with the result matching the input to the second generator. Cycle
consistency is a machine translation concept that states that a phrase translated from
English to French should translate back to English and be identical to the original phrase.
The opposite should also be true [23]. Given two sets of images, for example, horses and
zebras, one generator transforms horses into zebras and the other transforms zebras into
horses. The discriminators are present during the training phase to determine if images
computed by generators are authentic or not. With the feedback of their respective
discriminators, generators can improve their performance in this process. In the instance of
CycleGAN, one generator receives additional feedback from the other. This feedback

18
ensures that an image formed by a generator is cycle consistent, which means that applying
both generators to an image sequentially should result in a similar image [24].

Fig 2.7: Architecture of CycleGAN for converting a Zebra into a Horse [24]

2.5.1.2 Applications
Other than transforming a horse into a zebra, CycleGAN may be used in a variety of ways.
It's a very adaptable model, so much so that it can turn an apple into an orange! Yeah, not
the best example, but here are a few others.
• Using blueprints, create a realMistic picture of what a structure would look like.
• Creating a mental image of how an area might appear in each season.
• Changing artwork into real-life images.
• Based on a police sketch, create a realistic picture of a suspect's face.
• Enhancing photo elements to make them appear more professional [23],

2.5.2 ProGAN

Although GAN’s were very powerful and were giving very good results, researchers were
still finding it troublesome to generate high quality large images (eg 1024 x1024) until
2017 when Kerras et al.[25] first tackled the problem with Progressive GAN (ProGAN).
The key idea of this model was to train the Generator and Discriminator progressively i.e.
Rather than trying to train all layers of the generator and discriminator at once, as is typical,
the team built their GAN one layer at a time to handle increasingly higher quality versions
of the images. To accomplish so, they first reduced the resolution of their training photos
19
to a very low starting point (only 4x4 pixels). To generate images at this low resolution,
they developed a generator with only a few layers and a mirrored architecture
discriminator. Because these networks were so small, they learned just the large-scale
features evident in the extremely blurred images and trained quickly.

They added another layer to the Generator and Discriminator after the initial layers finished
training, boosting the output resolution to 8x8. The preceding layers' trained weights were
preserved but not locked, and the new layer was progressively faded in to help stabilise the
transition. The training continued until the GAN could synthesise convincing images once
more, this time at the new 8x8 resolution. They continued to add layers, double the
resolution, and train in this manner until they reached the desired output size [25].

Thus, by progressively increasing the resolution, the network is able to continuously learn
a much simpler piece of the overall problem and this incremental learning process greatly
stabilises the training of the model and also reduces the chances of modal collapse.
The low-to-high resolution trend also encourages progressively growing networks to
prioritise high-level structure (patterns evident even in the most blurred forms of the image)
above details. This increases the final image's quality by lowering the chances that the
network will get some high-level structure completely wrong. Gradually increasing the
network size is also more computationally efficient than the more traditional approach of
initialising all the layers at once. Fewer layers are faster to train because they contain fewer
parameters. Since all but the final set of training iterations are performed with a subset of
the eventual layers, impressive efficiency gains are obtained. Karras et al. discovered that,
depending on the output resolution, their ProGAN trained about 2–6 times faster than a
corresponding traditional GAN.

Therefore, ProGAN’s were able to generate high quality images however, its capacity to
modify specific characteristics of the generated image is limited, as is the case with most
models. To put it another way, the features are entangled, thus changing the input, even
little, frequently affects numerous features at once [25]

20
Fig 2.8: ProGAN Architecture [26]

2.5.3 StyleGAN

Karras, T et. Al proposed a Style-based GAN architecture (StyleGAN) which offers and
upgraded version to the ProGAN’s image generator with a focus on the architecture of the
generator network. The study proposes an alternate generator design for GANs which
according to the writers findings is in every way better and superior to the standard GAN
generator architecture. This architecture achieves an automatically taught, unsupervised
separation of high-level features and stochastic variance in produced images. As a result,
the GAN style dramatically alters the generator's design [16]. The generator does not accept
a point as input from latent space; alternatively, two additional sources of randomness are
employed to build a synthetic image: an independent mapping network and noise layers.

2.5.3.1 Mapping Network


The purpose of the Mapping Network is to encode the input vector into an intermediate
vector whose various constituents govern various visual characteristics. This is a difficult
operation since the input vector's ability to regulate visual characteristics is limited because
it must follow the training data's probability distribution. If the dataset has more
photographs of people with black hair, for example, more input values will be translated to
that feature.

21
As a result, a phenomenon known as features entanglement prevents the model from
mapping sections of the input (items in the vector) to features. However, the model can
build a vector that does not have to follow the training data distribution and can lessen the
correlation between characteristics by employing another neural network.

The Mapping Network is made up of eight fully connected layers, with an output layer ⱳ
that is the same size as the input layer (512 x 1). The latent space vector is sent through an
8-layer mapping transformation, followed by an 18-layer synthesis network, with each
layer producing images ranging from 4 x 4 to 1024 × 1024. A second convolution layer is
used to output an RGB picture from the output layer.
The output of the mapping network is a vector that defines the styles that are merged at
each point in the generator model via a new layer called adaptive instance normalisation
[16].

Fig 2.9: StyleGAN Architecture [27]

22
2.5.3.2 Adaptive Instance Normalization (AdaIN)
The AdaIN (Adaptive Instance Normalization) [28] layer transmits the Mapping Network's
encoded information ⱳ into the resulting image. Each Synthesis Network resolution level
has a module that defines the visual expression of the characteristics in that level:
• To ensure that the scaling and shifting of step 3 have the desired impact, each channel
of the convolution layer output is first normalised.
• Another fully-connected layer (designated as A) transforms the intermediate vector into
a scale and bias for each channel.
• Each channel of the convolution output is shifted by the scale and bias vectors, defining
the importance of each filter in the convolution. This fine-tuning converts the data from
ⱳ into a visual representation.
Using distinct style vectors at different places in the synthesis network allows you to adjust
the styles of the final image at various levels of detail. [27].

Fig 2.10: The Generators Adaptive Instance Normalization [27]

2.5.3.3 Style Mixing


Another important feature of StyleGAN is Style Mixing. In each level of the synthesis
network, the StyleGAN generator employs the intermediate vector, which may cause the
network to learn that levels are correlated. The model takes two input vectors at random
and constructs an intermediate vector for them to reduce correlation. It then uses the first
to train some of the levels before switching (at random) to the second to train the remaining

23
levels. The random switch prevents the network from learning and relying on a level
correlation. Though it doesn't increase model performance across all datasets, this notion
has an intriguing side effect: it can merge numerous images in a coherent manner. The
model creates two images, A and B, which are then combined using low-level features
from A and the rest of the information from B [16].

2.5.3.4 Addition of Noise


Each convolutional layer in the synthesis network produces a block of activation maps as
its output. Prior to the AdaIN processes, Gaussian noise is applied to each of these
activation maps. For each block, a different sample of noise is generated and evaluated
using per-layer scaling factors.
At a given level of detail, this noise is employed to introduce style-level variance.
Similarly, to the preceding example of varying style, the authors altered the usage of noise
at different degrees of detail in the model (e.g. fine, middle, coarse). As a result, noise
provides control over the generation of detail, ranging from broader structure when noise
is used in coarse layers to fine detail generation when noise is added to layers closer to the
network's output [16].

2.5.4 StyleGAN2

Modifications in both the architecture of the model and the methods of training in
StyleGAN were proposed by Kerras. T et al in their paper. In order to improve conditioning
in the mapping from latent codes to pictures, they revised the generator normalisation,
revisited progressive growing, and regularised the generator. Thus, this improves the image
quality generated by StyleGAN and is known as StyleGAN2. Firstly in StyleGAN2,
Weight demodulation is used which is a new normalisation technique that replaces adaptive
instance normalisation. Weight demodulation is used to fix the issue of the drop like defects
that were observed in StyleGAN for generation of 64 x 64 resolution images. Adaptive
Instance Normalization scales and shifts intermediate activations, similar to other
normalisation layers like Batch Norm. Instance Norm compares a single image to a batch,
whereas Batch Norm uses learning mean and variance parameters obtained from batch
statistics. Different scale and shift parameters are used by Adaptive Instance Norm to align

24
different sections of the source data with different regions of the feature map (either within
each feature map or via grouping features channel-wise by spatial location). Weight
demodulation removes the scale and shift parameters from the sequential calculation
pipeline, instead baking scaling into convolutional layer parameters. The shifting of values
(done with (y) in AdaIN) appears to be assigned to the noise map.
Secondly, an improved training method upon progressively growing is developed that
achieves the same purpose - training starts with low-resolution images and then gradually
transfers focus to higher and higher resolutions - while not modifying the network topology
during training. New types of regularisation are also proposed, such as lazy regularisation
and path length regularisation. [29]

Fig 2.11: Comparison between StyleGAN and StyleGAN2 architecture [29]

2.5.5 BigGAN

Brock, A. et. Al proposed the concept of BigGAN which is based on the concept of Self
Attention GAN model (SAGAN)[30]. They found that GANs trained to simulate natural
pictures of various categories gain greatly from scaling up, both in terms of quality and
sample diversity. As a consequence, the model achieves a new level of performance among
ImageNet GAN models, vastly outperforming the state of the art. They also offered an
examination of the training behaviour of large scale GANs, defined their stability in terms
of the singular values of their weights, and addressed the relationship between stability and
performance [17].

25
In BigGAN, the authors implemented two architectural adjustments to boost scalability
while also increasing conditioning by using orthogonal regularisation on the generator. The
orthogonal regularisation applied to the generator makes the model accessible to the
"truncation technique," which allows for fine control of the fidelity-variation trade-offs by
truncating the latent space. In terms of stability, the authors identified and characterised
instabilities particular to large-scale GANs, then devised strategies to mitigate the
instabilities – but at a substantial performance cost. BigGAN achieves an Inception Score
(IS) of 166.3 when trained on the ImageNet dataset at 128 X 128 resolution, showing more
than 100 percent of improvement over the previous state of the art (SotA) record of 52.52.
In addition, the Frechet Inception Distance (FID) score has been improved from 18.65 to
9.6. BigGAN outperformed the previous SotA on ImageNet at 256 X 256 and 512 X 512
resolutions, in addition to its performance improvement at 128 X 128 resolutions. [31]

2.5.6 Artificial Intelligence Creative Adversarial Network (AICAN)

To enhance the GAN technology's creative content generation capabilities even farther,
Elgammal et al. introduced AICAN - artificial intelligence creative adversarial network.
They suggest in their study that training a GAN model on paintings would merely teach it
how to create pictures that look like existing art, and that this, like the Neural Style Transfer
technique, would not produce anything truly artistic or innovative. The model they have
proposed generates art in a more creative manner by learning the style of paintings and
then deviating away from the learned style to make the artwork creative [32]. AICAN is a
model that can be used to generate art in a more creative manner. In their paper, the authors
propose changing the optimization criterion to allow the network to develop new art by
increasing divergence from established styles while remaining within the art distribution.
This deviation ensures that the generated artwork seems original and is different from
existing artistic styles. Through a series of exhibits and trials, the authors also proved that
spectators were frequently unable to distinguish between AICAN-generated images and
artworks made by a human-artists. Thus, we can state that the results generated by the
AICAN model are novel and are indistinguishable from human made artwork.[32]

26
2.5.7 Conditional GAN (cGAN)

Conditional generative adversarial network, or cGAN for short, is a form of GAN that uses
a generator model to conditionally generate images. A conditional setup is used in cGANs,
which means auxiliary information (such as class labels or data) is fed to both the generator
and discriminator from other modalities. As a result, by being fed varied contextual
information, the ideal model can learn multi-modal mapping from inputs to outputs. We
get two advantages by sharing additional information:
• Convergence will occur more quickly. Even the random distribution of the fake images
will have some sort of pattern.
• At test time, you can control the generator's output by providing a label for the image
you want to generate.
Assume you train a GAN with handwritten digits (MNIST images). Normally, you have
no influence over the photos the generator generates. In other words, you cannot request a
specific digit image from the generator. This is where cGANs come into play, as they let
us to add an additional input layer of one-hot-encoded image labels. This additional layer
directs the generator as to which image to generate [33].

In their paper, Matsumura, N et Al. have proposed a method of tile art image generation
which make use of machine learning methods and is based on conditional GAN. In a
conditional GAN, a source image is provided as an additional input for the GAN. In the
proposed method, with the help of conditional GAN, the authors directly produce a tile art
image from the provided image [34]. The tile art image is generated using forward
propagation. The suggested approach's network architecture is based on the pix2pix
method with conditional GAN’s [35].

27
Fig 2.12: cGAN Architecture [33]

2.5.8 Attentional GAN (AttnGAN)

Xu, T et Al. introduced an Attentional GAN (commonly known as AttnGAN) that enables
multi-stage and attention-driven refinement for generation of fine-grained image from text.
Using an attentional generative network, the AttnGAN synthesises fine-grained features in
distinct regions of the image by paying attention to the essential words in the description
of natural language. In addition, a deep attentional multimodal similarity model is proposed
in the paper for training the generator in order to compute a fine-grained image-text
matching loss. Visualizing the AttnGAN's attention layers allows for a more complete
study. It is the first time that the layered attentional GAN has been shown to be capable of
selecting the condition at the word level for producing distinct parts of the image
automatically. [36]

Fig 2.13: AttnGAN Architecture [36]

28
2.6 Text Guided Art Generation

2.6.1 Contrastive Language Image Pre-training(CLIP)

In January 2021, OpenAI published Contrastive Language–Image Pre-training (CLIP),


which was presented in Learning Transferable Visual Models From Natural Language
Supervision. Natural language is proposed as a way to improve the generality and
robustness of deep learning models for picture classification tasks. They can achieve state-
of-the-art performance in numerous benchmarks using a zero-shot setting, which is very
remarkable.
CLIP's basic idea is to use enormous volumes of image data taken from the Internet with
their associated captions to jointly pre-train a neural language model and an image
classification model. The "Text Encoder" represents the language model, whereas the
"Picture Encoder" represents the image classification model [37].

Fig 2.14: CLIP Architecture [38]


2.6.2 StyleGAN+CLIP

The concept is simple: to generate a picture, we will start with random values for the w
latent vectors from StyleGAN. CLIP will receive the result along with an arbitrary prompt.
CLIP will provide a score to this image based on how well it represents the prompt's
content. This value will be used to update w, which will generate another picture, and so
on, until we conclude that the generated image is sufficiently similar to the prompt. The
values of w will be updated using gradient descent and backpropagation as if they were
weights in a neural network [39].

29
The architecture of StyleGAN+CLIP is as follows:

Fig 2.15: Architecture of StyleGAN+CLIP [39]

2.7 Generative Adversarial Networks Evaluation

The generator model of a GAN is trained using a second model called a discriminator that
learns to classify images as real or fake, unlike conventional deep learning neural network
models that are trained with a loss function until convergence. To ensure equilibrium, both
the generator and discriminator models are trained concurrently. As a result, there is no
objective loss function used to train GAN generator models, and there is no way to
objectively assess the training progress and the relative or absolute quality of the model
based on loss alone.

Therefore, it becomes really difficult to quantify and compare the performance of GAN
models using conventional metrics. Thus, to compare the performance of GAN models,
metrics like Fréchet Inception Distance (FID) and Inception Score (IS) are generally used.

2.7.1 Manual GAN Evaluation

Many GAN practitioners rely on manual examination of pictures produced by a generator


model to evaluate GAN generators. This entails utilising the generator model to generate a
batch of synthetic photos, then assessing the quality and diversity of the images in

30
accordance to the Target Domain. This practice could generally be carried out by the
researcher or practitioner themselves. Human visual assessment of samples is one of the
most prevalent and intuitive methods of evaluating GANs.
Although inspecting GANs manually can be the most basic form of model evaluation, it
has several drawbacks, including:
• It is subjective, including prejudices of the reviewer concerning the model, its
configuration, and the project purpose.
• It necessitates understanding of what is and is not feasible for the target domain.
• It is constrained by the number of photographs that can be evaluated in an acceptable
amount of time.
Using Human Vision to Evaluate the quality of generated images is costly and time-
consuming, skewed, difficult to replicate, and does not reflect the real capacity of models.
The subjectivity almost always results in unbalanced model selection and nitpicking, and
thus is not advised to be used for definitive model identification on non-trivial projects.
As a result, better and more sophisticated GAN Evaluation Metrices came into picture and
are discussed below [40].

2.7.2 Inception Score (IS)

The Inception Score (IS) is a metric used for assessing the quality of generated images,
particularly synthetic images produced by GAN models. To classify the generated images,
the inception score employs a pre-trained deep learning neural network model for image
classification. The model is used to classify a huge number of generated photos. The chance
of an image belonging to each class is predicted in detail. The inception score is created by
combining these predictions. The goal of the score is to capture two characteristics of a set
of created images:
• The quality of the generated images
• The diversity of the generated images

The inception score has a lowest value of 1.0 and a highest value of the number of classes
supported by the classification model. Higher value of Inception Score indicates better
performance of the GAN [41].

31
2.7.3 Fréchet Inception Distance (FID)

This is one of the most widely used metrics for comparing real and generated images in
terms of feature distance. The Fréchet Distance is a measure of similarity between curves
that takes the location and ordering of the points along the curves into account. It can also
be used to calculate the difference between two distributions.
Feature distance and the Inception model pre-trained on the Imagenet dataset are employed
in the context of computer vision, particularly GAN evaluation. The score is called
"Fréchet Inception Distance" because it uses activations from the Inception model to
summarise each image.

Mathematically, the distance between two multivariate normal distributions is calculated


be making use of Fréchet Distance. The formula for multivariate normal distribution is
given by:
1
𝐹𝐼𝐷 = ||𝜇𝑥 − 𝜇𝑦 ||2 + 𝑇𝑟(Σ𝑥 + Σ𝑦 − 2(Σ𝑥 Σ𝑦 )2 ) (2.5)

where X and Y are the real and fake embeddings (activation from the Inception model)
assumed to be two multivariate normal distributions. 𝜇𝑥 and 𝜇𝑦 are the magnitudes of the
vector X and Y. Tr is the trace of the matrix and Σ𝑥 and Σ𝑦 are the covariance matrix of the
vectors. Lower FID value means better Image quality and diversity. [42]

In this section, comparison of the various GAN models discussed in this report has been
done based on their FID scores.

Table 2.1: Comparison of GAN models based on FID

Models FID Dataset

CycleGAN [43] 61.5 Horse->Zebra (ImageNet)

Pix2Pix [43] 24.2 Edges->shoes

32
StyleGAN [16] 4.43 FFHQ

StyleGAN2 [29] 2.84 FFHQ

BigGAN [17] 8.7 ImageNet

SAGAN [30] 18.65 ImageNet

It is difficult to compare the results of the GANs as the functions that these GANs perform,
the output generated by them and the datasets that they have been trained on is different.
On the basis of FID alone, we can say that StyleGAN2 has the best FID score of 2.84 when
trained on the FFHQ dataset. Out of the models trained on the ImageNet dataset, which is
the most commonly used dataset, the best FID score was obtained by BigGAN.

Fig 2.16: Graph comparing FID Scores of GAN models

33
CHAPTER 3

METHODOLOGY

The goal of this project is to address and answer some questions related to the subject of
AI Art (Creating Art using Deep Neural Networks) through research on the topic. To
answer these questions, a qualitative approach of research was used in the research paper.
A qualitative approach is a type of research approach in which other people’s views and
understanding of the issue is used to gain insight into the topic. This particular type of
research helps the researcher to develop their own ideas and helps the reader to increase
their understanding of the topic.

3.1 Implementation of VGG-19

The dataset used in the training of the model is the ImageNet dataset. ImageNet is a massive
collection of annotated images used in computer vision research. The dataset was created
with the intention of serving as a resource to aid in the study and development of better
computer vision technologies.
In this project we have used the following steps for training of VGG-19 model
• VGG-19 model was established
• Check the image for convolution operation by different convolution
• Get responses with different characteristics
• Figure out the output feature diagram
• Get the data we need from the model
• We get the activation function
• Using the lower sampling layer to retain the main characteristics of the sample and
reduce the number of parameters
• Software regression classifier was used to achieve the final classification [44].

3.2 Implementation of CycleGAN

We took advantage of an already existing model called pix-2-pix model with addition of a
loss function for training our CycleGAN model. Our model is trained on two different
34
datasets. Firstly, we used the unpaired Dataset of images Horses and Zebras which
contained a total of 2,661 images. 2,401 of these images were used for training the model
whereas 260 images were used for Testing. For the Second Dataset, we used unpaired
training data containing Painting of Monet and photographs for training our model. The
dataset contains 8,231 images in total from which 7,359 and 872 images are split into
training and test sets respectively [45]. In order to avoid overfitting of the dataset, image
augmentation techniques like Random jittering and Mirroring are applied on the training
dataset in both the cases. Two Generators and Discriminators are simultaneously trained,
in CycleGAN instead of 1 for each. One generator receives additional data from the other
generator throughout the training phase. Such feedback assures that an image formed by a
generator is cycle consistent, which means that applying both generators to an image
sequentially should result in a similar image. The discriminators, like in general GANs, are
then used to determine whether images generated by matching generators are real or not.
With the feedback of their respective discriminators, generators can improve their
performance in this process [46]. We calculate Cycle Consistency Loss in order to achieve
unpaired image to image translation. Cycle Consistency makes sure that the created image
and the original input are close to each other. Hence, lower the value of Cycle Consistency
Loss, better the model. For our model, we used a modified unet generator and Generator
and Discriminator Losses from Pix2Pix model are used. For the first Dataset, the model
was trained for abot 50 epochs which took about 12 hours in Google Colab. The model
was trained for 70 epochs which took about 16 hours on Google Colab for the Second
Dataset. The Results are obtained by running the model on the test Dataset and are
presented in the results section.

3.3 Implementation of StyleGAN

In this project, StyleGAN2 with Adaptive Discriminator Augmentation (ADA) [47] has
been used to generate artistic images. StyleGAN2-ADA is an improved version of
StyleGAN that is notable for retaining the excellent quality results of previous works while
also greatly reducing the number of training images required. It's also faster to compute
than its predecessors. This advancement makes training high-quality output GANs much
more practical. Instead of having to create or obtain a dataset with tens of thousands of

35
photos, it is now possible to get good results with just a few thousand or even less. For this
project, the Pytorch implementation of StyleGAN2-ADA [48] has been used as the Pytorch
implementation is slightly faster and less GPU intensive in comparison to the TensorFlow
implementation. WikiArt dataset has been used to train this model. The dataset contains
10,912 images of different types of artworks obtained from WikiArt [49] All the images
were of different dimensions hence the images have been resized to 1024 X 1024 resolution
for uniformity and training of the model. In order to reduce time and computation, this
project also employs transfer learning from a previously trained model. Transfer learning
is a training strategy that uses a previously trained version of the model developed on a
similar dataset to reduce training time. Rather than starting again, the model resumes
training with the previously trained model. This is especially useful for larger models that
require weeks of training in costly infrastructure [50]. It has also been noted that the images
formed by the model with transfer learning appear to be more polished than images created
by models that have been traditionally trained. In this project, the StyleGAN2-ADA model
trained on the FFHQ faces dataset at 1024 pixels resolution has been used and then the
model was further trained on the WikiArt dataset which also contains images of 1024 pixel
resolution. The model was trained for 160 ticks on 640 kimg which took about 82 hours of
training on Google Colab pro. The results obtained by the model and how the results
evolved overtime have been presented in the results section of the report.

3.4 Implementation of Text Guided Art Generation

For this project, CLIP along with StyleGAN has been used to perform text guided art
generation. The StyleGAN model generates an image and then CLIP measures how well
the image matches the text. Then, the model uses the feedback obtained from the CLIP
model to generate more “accurate” images. This iteration will be done many times until the
CLIP score becomes high enough and the generated image matches the text. Thus, the
model takes a prompt as input and then generates an art image that resembles the prompt.
StyleGAN2 model trained on the WikiArt dataset is used with CLIP for this project. The
model combined with CLIP is trained for 300 iterations for each prompt. Since the
generated image is dependent on the given prompt, a lot of prompt engineering can be done
as well to obtain more specific and relevant results

36
CHAPTER 4

TECHNOLOGIES USED

4.1 Python 3

Python is a high-level programming language that means many of the words that makes
utilization of are easily understandable by humans. In order for a machine to understand
the code it makes use of an interpreter, also python is a dynamic language which makes
programming in python easier as the programmer doesn’t have to declare variable types
because they are self-verified by the interpreter depending on their use.

Fig 4.1: Logo of Python [51]

Now there are many features that make python an excellent choice for machine learning
like it is simple to code, it has very vast collection of libraries (for examples scikit-learn,
NumPy, Tensorflow) which nearly contains functions for every task a programmer might
need to perform so development time is reduced and in turn productivity rises. Some of the
libraries used in this project are discussed below:

37
• NumPy: This library allows us to work with multi-dimensional arrays as well as
matrices. It has inbuilt functions for most of the operations regarding matrices like
transpose and many mathematical functions like square root, sin, cos, etc.
• PyTorch: PyTorch is an open source machine learning (ML) framework based on the
Torch library and the Python programming language. It is one of the most popular deep
learning research platforms. The framework is designed to accelerate the transition
from research prototyping to implementation. PyTorch, like NumPy, works with
tensors, which are accelerated by graphics processing units (GPU). Tensors are
multidimensional data structures that may be modified and operated on through APIs.
Over 200 different mathematical operations are supported by the PyTorch framework.
• TensorFlow: TensorFlow is a python library which is used for rapid numerical
processing. It is also used for creating deep learning models. This library is maintained
by google.
• Pillow: The Python Imaging Library enhances your Python interpreter's image
processing capabilities. This library supports a wide range of file formats, has an
efficient internal representation, and can do some image processing. The core image
library is built to provide quick access to data in a few basic pixel formats. It should
serve as a good foundation for an image processing tool in general.
• Time: The Python time module offers a variety of ways to express time in code,
including objects, integers, and texts. It also has features other than expressing time,
such as waiting during code execution and calculating code efficiency.
• Matplotlib: Matplotlib is a Python library that aids in data visualisation, analysis, and
interpretation using graphical, pictorial representations that may be replicated using the
matplotlib package. Matplotlib is a visualisation package that allows you to create
static, animated, and interactive visualisations.

4.2 Google Colab

In terms of AI research, Google is active. Google spent years developing TensorFlow, an


AI framework, and Collaboratory, a development platform. TensorFlow is now open-
source, and Google has made Collaboratory free to use since 2017. Google Colab, or
simply Colab, is the new name for Collaboratory.
38
The utilisation of GPU is another appealing feature that Google provides to developers.
Colab is a free application that supports GPU. Its software might become a standard in
academia for teaching machine learning and data science because of its public availability.
It might also have the long-term goal of establishing a client base for Google Cloud APIs,
which are sold on a per-use basis.

Fig 4.2: Google Colab Icon [52]

As a programmer, you may use Google Colab to do the following:


• We can write and run code in python
• Notebooks can be created, uploaded, and shared.
• Google Drive notebooks can be imported/exported.
• GitHub notebooks may be imported/published.
• Import data from other sources, such as Kaggle.
• Use PyTorch, TensorFlow, Keras, and OpenCV together.
• Cloud computing with a free GPU

Google Colab is a fantastic tool for learning Python and efficiently building machine
learning models. Even remotely, team members may exchange and edit notes
simultaneously. The notebooks can also be shared with the public by publishing them on
GitHub. Many prominent machine learning libraries are supported by Colab, including
PyTorch, TensorFlow, Keras, and OpenCV. The current limitation is that it does not

39
support R or Scala. Sessions and their size are likewise limited. These are minor sacrifices
to undertake when weighed against the advantages.

4.3 Google Drive

Google Drive is a cloud-based file storage and syncing service created by Google. Users
can use Google Drive to store data in the cloud (on Google's servers), sync files between
devices, and share files. Google Drive provides offline programmes for Windows and
macOS PCs, as well as Android and iOS smartphones and tablets, in addition to a web
interface. Google Drive provides customers with 15 GB of free storage with Google One.
Google One also provides 100 GB, 200 GB, and 2 TB storage space via optional premium
plans.

Fig 4.3: Google Drive Icon [53]

In our project, we used Google Drive to store and retrieve our models which are trained for
a certain period of time and the Colab got disconnected due to Runtime Disconnect. We
also stored Dataset in Google Drive for few of our models we also used it to store our
generated results.

40
CHAPTER 5

RESULTS

5.1 Neural Style Transfer


The following results were obtained by training the model on the ImageNet dataset. Dataset
was trained on Google Colab. In this two images a content image and a style reference
image is blended in such a way that the output image appears to be like the content image
but is painted in the manner of the style of reference image

Style Content Output

Fig 5.1: Results Obtained for Neural Style Transfer

41
Fig 5.2: Input data for VGG Model

Fig 5.3: Image generated after training the model

42
The following results were obtained after training the model further. The results which
were obtained are better as compared to the previous generated images.

Fig 5.4: Input data for VGG Model

Fig 5.5: Image generated after training the model

43
In this we have taken content image of Rome Colosseum and style image of The muse, a
painting by Pablo Picasso. The generated image appears to be like the content image but is
painted in the manner of the style of reference image

Fig 5.6: Input data for VGG Model

Fig 5.7: Image generated after training the model

44
In this we have taken content image of the world’s richest person Elon musk and style
image of artwork of an artist. The generated image appears to be like the content image but
is painted in the manner of the style of reference image.

Fig 5.8: Input data for VGG Model

Fig 5.9: Image generated after training the model

45
In this we have taken content image of a famous actor Robert Pattinson and style image of
Girl with mandolin, an example of analytic cubist painting by Pablo Picasso. The generated
image appears to be like the content image but is painted in the manner of the style of
reference image.

Fig 5.10: Input data for VGG Model

Fig 5.11: Image generated after training the model

46
5.2 CycleGAN
5.2.1 Horse to Zebra Image-to-Image Translation

The following results were obtained by training a Horse2Zebra Dataset of Tensorflow


which was divided into Train and Test Sets. The Dataset was trained on Google Colab for
50 epochs which took around 12 hours.

Fig 5.12: Generators after Training the CycleGAN Model on Horse2Zebra Dataset for 50
epochs

47
Fig 5.13: Discriminator after training the CycleGAN Model on Horse2Zebra
Dataset for 50 epochs

This is how the Discriminator looked like after training the model for 50 epochs. It started
differentiating between the real and fake images correctly.

Original Image Image Generated after 50 epochs

Original Image Image Generated after 50 epochs


Fig 5.14: Results obtained on Test Data after Training the Cycle GAN model
Horse2Zebra Dataset for 50 epochs.

48
We provide an Input Image (Original Image) from the training Dataset to the model which
is trained for 50 epochs and the generated image with the help of the Generator is presented
here. The model successfully converts an image of a horse into an image of a Zebra and
vice versa.
4.2.2 Monet Painting to Photograph Image-to-Image Translation

The following results were obtained by training a Monet2Photo Tensorflow Dataset which
was divided into Train and Test Sets. The Dataset was trained on Google Colab for 65
epochs which took around 17 hours.

Fig 5.15: Generators after Training the CycleGAN Model on Monet2Photo Dataset for 65
epochs
49
Fig 5.16: Discriminator after training the Model Monet2Photo Dataset for 65 epochs

This is how the Discriminator looked like after training the model for 65 epochs on
monet2photo dataset. It started differentiating between the real and fake images correctly.

Original Image Image Generated after 65 epochs

50
Original Image Image Generated after 65 epochs

Original Image Image Generated after 65 epochs

Original Image Image Generated after 65 epochs


Fig 5.17: Results obtained on Test Data after Training the Cycle GAN model
Monet2Photo Dataset for 65 epochs.

I provided an Input Image (Original Image) from the training Dataset to the model which
is trained for 65 epochs and the generated image with the help of the Generator is presented
here. The model successfully converts a Monet Painting into a Photograph and vice-versa.

5.3 StyleGAN
Since Transfer Learning is being used in this project for the training of the model, the
images generated initially are of faces as the model being used is the StyleGAN model
trained on the FFHQ dataset.

51
Fig 5.18: Initial generated images before training of model

We will be giving the WikiArt dataset images as input to this model so that the images start
training on the required art data and the model starts generating the images. A sample of
the dataset has been shown in the image below.

Fig 5.19: Input data for StyleGAN2 model

52
Now we will start training the model on our dataset and observe how the generated images
are changing. After 2 hours of training the model, it was observed that the generated images
had slowly started changing into artwork and appeared like a fusion of the artwork and the
face images. The figures below show how the generated images are changing with time.

Fig 5.20: Generated Images after 2 hours of training

Fig 5.21: Generated Images after 8 hours of training

53
Fig 5.22: Generated Images after 14 hours of training

Fig 5.23: Generated Images after 20 hours of training

54
Fig 5.24: Generated images after training the model for 48 hours

Fig 5.25: Generated images after training the model for 82 hours

55
5.4 StyleGAN+CLIP
The figures below show the results obtained when a prompt is given to CLIP and how CLIP
guides StyleGAN to generate the most appropriate image based on the prompt.

Initial generated image Image generated after 300 iterations


Fig 5.26: Result obtained when “City during a rainy night” was given as prompt

Initial generated image Image generated after 300 iterations


Fig 5.27: Result obtained when “A cold and rainy night” was given as prompt

56
Initial generated image Image generated after 300 iterations
Fig 5.28: Result obtained when “Fantasy Kingdom Deviantart” was given as prompt

Initial generated image Image generated after 300 iterations


Fig 5.29: Result obtained when “Cyberpunk city ArtstationHQ” was given as prompt

57
Initial generated image Image generated after 300 iterations
Fig 5.30: Result obtained when “Underwater Castle ArtstationHQ” was given as prompt

58
CHAPTER 6

CONCLUSION

A brief review of the current status and progress of Creating art using Deep Neural
Networks was covered in this report to show the various areas of research it takes part in.
A brief summary of Machine Learning and Artificial Intelligence was presented and the
relevance of Convolution Neural Networks and Generative Adversarial Networks in the
field of AI Art was discussed. The application of AI Art and the various techniques like
Neural Style Transfer that are used in AI Art were also discussed and reviewed. The new
development of GANs like CycleGAN, StyleGAN, etc and how they are being used in AI
Art as well as other fields of science and industry is another factor of optimism.

Based on the research done and papers read, we can conclude that CycleGAN is thus far
the best model that has been able to achieve Neural Style Transfer. It outperforms the
conventional VGG-19 framework and is able to transfer the entire style of an artist instead
of the style of a single artwork. In the case of AI generated Art, various models like
AICAN, BigGAN, Conditional GAN, etc have been used however it is difficult to compare
the results of the GANs as the functions that these GANs perform, the output generated by
them and the datasets that they have been trained on is different. Based on the FID alone
the StyleGAN2 model generated the best results which were original and hard to
distinguish from human created art.

Based on the results of our research, we trained VGG-19 as well as CycleGAN models for
achieving Neural Style Transfer. Using VGG-19, we were able to showcase how the style
and content of images were mixing together to form artwork. Using CycleGAN, we were
able to generate an image of a Zebra using an image of a horse without having a paired
dataset. A few more applications of CycleGAN might include converting a sketch of a
suspect into a real life photograph or how a place might appear during a particular season
etc. Further, we were also able to generate the content images in the style of Monet’s
paintings thus showcasing that unlike VGG-19, CycleGAN is able to transfer the entire
style of an artist instead of just the style of a single painting.

59
We also trained the StyleGAN2 model on the WikiArt dataset for generating original
artwork. Using StyleGAN2, we were able to showcase how the model makes use of transfer
learning to change how it generates images and demonstrated how the face images that
were being originally generated slowly changed into artworks as the model trained on the
WikiArt dataset. Using StyleGAN2, we were able to generate completely AI made
artworks which were realistic and appeared similar to human made artworks. We observed
that the artworks being generated by the StyleGAN2 model were random and we wanted
more specific results. Thus, we used our StyleGAN2 model and combined it with CLIP to
perform Text Guided Art Generation. We observed that the CLIP model was able to guide
the image generation of the StyleGAN2 really well and we were able obtain more relevant
results which looked very similar to the prompt given to the model.

In conclusion, it can be said that the future of AI Art is very bright, and strongly connected
with the evolution and development of GANs. It is a field with a lot of potential but requires
a large amount of research and technological advancements as it is still in its early stages.

60
CHAPTER 7

FUTURE SCOPE

As the field of AI Art is still in its early stages, a large amount of research and technological
advancements can still be made in this field. Further refinement to the existing GAN
models can be made to further improve the results. Additionally, more types of GANs can
be developed with further research in this field. Significant advancements in the
development of multimodal generative models, such as models that can create images from
text, have recently been made. Art production and creativity will almost certainly be
influenced by technological advances in this direction. Because the concept of
multimodality is intrinsic to many art forms and has long played a significant part in the
creative process, models that can convert input from diverse modalities into a joint
semantic space constitute an appealing tool for artistic experimentation. Thus, further
research in multimodal generative models can also lead to significant developments in the
field of AI Art.

Due to the current global pandemic, there has been a rapid shift of attention towards digital
showrooms and online platforms which has further contributed to the already rising interest
in blockchain technologies and crypto art which can have a huge impact and can
significantly transform the art market. The interest in NFTs, especially digital art based
NFTs is on the rise and AI generated Art can also be sold in this market.

61
REFERENCES
[1] Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and
prospects. Science, 349(6245), 255-260.
[2] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436-
444.
[3] Walczak, S. (2018). Artificial neural networks. In Encyclopedia of Information Science
and Technology, Fourth Edition (pp. 120-131). IGI Global.
[4] Abraham, A. (2005). Artificial neural networks. Handbook of measuring system design.
[5] https://www.ibm.com/cloud/learn/neural-networks
[6] Albawi, S., Mohammed, T. A., & Al-Zawi, S. (2017, August). Understanding of a
convolutional neural network. In 2017 International Conference on Engineering and
Technology (ICET) (pp. 1-6). Ieee.
[7] Crowley, E. J., & Zisserman, A. (2014, September). In search of art. In European
conference on computer vision (pp. 54-70). Springer, Cham.
[8] Cetinic, E., & Grgic, S. (2013, September). Automated painter recognition based on
image feature extraction. In Proceedings ELMAR-2013 (pp. 19-22). IEEE.
[9] Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). A neural algorithm of artistic
style. arXiv preprint arXiv:1508.06576.
[10] Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using
convolutional neural networks. In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 2414-2423).
[11] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556.
[12] https://viso.ai/deep-learning/vgg-very-deep-convolutional-networks/
[13] https://www.analyticsvidhya.com/blog/2021/06/build-vgg-net-from-scratch-with-
python/
[14] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... &
Bengio, Y. (2014). Generative adversarial nets. Advances in neural information
processing systems, 27.

62
[15] Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image
translation using cycle-consistent adversarial networks. In Proceedings of the IEEE
international conference on computer vision (pp. 2223-2232).
[16] Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for
generative adversarial networks. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (pp. 4401-4410).
[17] Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high
fidelity natural image synthesis. arXiv preprint arXiv:1809.11096.
[18] https://www.freecodecamp.org/news/an-intuitive-introduction-to-generative-
adversarial-networks-gans-7a2264a81394
[19] https://neptune.ai/blog/gan-loss-functions
[20] Arjovsky, M., Chintala, S., & Bottou, L. (2017, July). Wasserstein generative
adversarial networks. In International conference on machine learning (pp. 214-223).
PMLR.
[21] https://developers.google.com/machine-learning/gan/problems
[22] https://www.faceapp.com
[23] https://machinelearningmastery.com/what-is-cyclegan/
[24] https://blog.jaysinha.me/train-your-first-cyclegan-for-image-to-image-translation/
[25] Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for
improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.
[26] https://towardsdatascience.com/progan-how-nvidia-generated-images-of-
unprecedented-quality-51c98ec2cbd2
[27] https://towardsdatascience.com/explained-a-style-based-generator-architecture-for-
gans-generating-and-tuning-realistic-6cb2be0f431
[28] Huang, X., & Belongie, S. (2017). Arbitrary style transfer in real-time with adaptive
instance normalization. In Proceedings of the IEEE international conference on
computer vision (pp. 1501-1510).
[29] Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing
and improving the image quality of stylegan. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (pp. 8110-8119).

63
[30] Zhang, H., Goodfellow, I., Metaxas, D., & Odena, A. (2019, May). Self-attention
generative adversarial networks. In International conference on machine learning (pp.
7354-7363). PMLR.
[31] https://medium.com/syncedreview/biggan-a-new-state-of-the-art-in-image-synthesis-
cf2ec5694024
[32] Elgammal, A., Liu, B., Elhoseiny, M., & Mazzone, M. (2017). Can: Creative adversarial
networks, generating" art" by learning about styles and deviating from style norms.
arXiv preprint arXiv:1706.07068.
[33] https://www.educative.io/edpresso/what-is-a-conditional-gan-cgan
[34] Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with
conditional adversarial networks. In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 1125-1134).
[35] Matsumura, N., Tokura, H., Kuroda, Y., Ito, Y., & Nakano, K. (2018, November). Tile
art image generation using conditional generative adversarial networks. In 2018 Sixth
International Symposium on Computing and Networking Workshops
(CANDARW) (pp. 209-215). IEEE.
[36] Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., & He, X. (2018).
AttnGAN: Fine-grained text to image generation with attentional generative adversarial
networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 1316-1324).
[37] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever,
I. (2021, July). Learning transferable visual models from natural language supervision.
In International Conference on Machine Learning (pp. 8748-8763). PMLR.
[38] https://openai.com/blog/clip/
[39] https://towardsdatascience.com/generating-images-from-prompts-using-clip-and-
stylegan-1f9ed495ddda
[40] https://machinelearningmastery.com/how-to-evaluate-generative-adversarial-networks/
[41] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X.
(2016). Improved techniques for training gans. Advances in neural information
processing systems, 29.

64
[42] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans
trained by a two time-scale update rule converge to a local nash equilibrium. Advances
in neural information processing systems, 30.
[43] Li, M., Lin, J., Ding, Y., Liu, Z., Zhu, J. Y., & Han, S. (2020). Gan compression:
Efficient architectures for interactive conditional gans. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (pp. 5284-5294).
[44] Yang, Z. (2021). Classification of picture art style based on VGGNET. In Journal of
Physics: Conference Series (Vol. 1774, No. 1, p. 012043). IOP Publishing.
[45] https://www.tensorflow.org/datasets/catalog/cycle_gan
[46] http://cs230.stanford.edu/projects_fall_2020/reports/55792990.pdf
[47] Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., & Aila, T. (2020). Training
generative adversarial networks with limited data. Advances in Neural Information
Processing Systems, 33, 12104-12114.
[48] https://github.com/NVlabs/stylegan2-ada-pytorch
[49] https://www.kaggle.com/c/painter-by-numbers
[50] Bozinovski, S. (2020). Reminder of the first paper on transfer learning in neural
networks, 1976. Informatica, 44(3).
[51] https://commons.wikimedia.org/wiki/File:Python-logo-notext.svg
[52] https://colab.research.google.com/
[53] https://tech.hindustantimes.com/how-to/organize-your-files-in-google-drive-on-
iphone-computer-android-here-is-how-to-71641109305422.html

65
APPENDIX

ABSTRACT
The innovations and advancements in the field of AI has motivated researchers to look into
the application of AI in several disciplines during the last few years. As a result, the field
of AI Art has emerged. This paper examines the advancements and breakthroughs in the
field of AI Art, as well as how deep neural networks are being used to create art. In this
paper an analysis on two aspects of AI art has been done: Firstly, how AI is being used for
art analysis and Secondly, how AI is being used for creative purposes and generating
artworks. In the context of AI-related research for art understanding, we present a
comprehensive study on various Machine Learning tasks like Classification, Object
Detection, Automatic Painter detection based on painters’ style, etc. In context to the role
of AI in creating art, a comprehensive study on various techniques such as Convolution
Neural Networks, Neural Style Transfer and Generative Adversarial Networks has been
present
1. INTRODUCTION

Undoubtedly, one of the most trending fields of today’s day and age is Artificial
Intelligence (AI). Artificial Intelligence is a broad branch of Computer Science which deals
with the creation and development of intelligent or smart machines that are able to perform
tasks and make decisions which would require human intelligence. These smart machines
are able to learn from their experiences, draw conclusions based on the information they
have and adjust to new unseen input data to make logical decisions and perform human-
like tasks.

At first, Artificial Intelligence was only viewed as a research field and lots of computer
scientists did extensive research on AI and ML, which is a subfield of AI. Based on this
research, a lot of practical applications of AI and ML were recognized and hence the field
has progressed a lot and now, they are being used in almost all fields of study ranging from
Computer Science to Medicine. Thus, in the past two-three decades, the field of AI has
evolved from a field of study and research to a practical technology which has widespread
commercial use and is being used everywhere from chess-playing computers to self-driving
cars.

Another sub-field of AI which has seen a lot of progress and research is Deep Learning.
Deep Learning is a sub-field of Machine Learning which essentially consists of Neural
Networks which have various layers. These neural networks are designed to behave
similarly to the human brain, and so Deep Learning is used to solve complex problems and
hence neural networks are usually considered to give the best possible outcomes to these
complex problems. Due to the complex nature of Neural Networks and the many useful
applications of deep learning, significant research has been carried out in the field of deep
learning and different types of neural networks have been developed to tackle various types
of complex problems.

Thus, this extensive research in AI has prompted researchers to critically analyze and
discuss how AI and Machine learning can be applied in a more unconventional manner and
to look at the field from a newer, more creative perspective. One such field where AI is
being used is for the creation and understanding of art. As a result, a new kind of art named
artificial intelligence art (AiArt) has emerged, which is a creative activity that combines
art with technology by using AI as the core medium to create, express thoughts and
emotions. This can be done using Computer Vision. Computer Vision is an application of
AI in which Machine Learning and Deep learning is used to make observations and
important inferences from the input images, videos or any other visual inputs and to take
action or make recommendations based on that information.

Art can be described as a way to compose a complex interplay between content and style
of an image to show and demonstrate different ideas and concepts visually. Artists have
many different ideas and many different ways of representing these ideas. Thus, various
art pieces differ in style and complexity based on the artist that has created the art piece.
Due to the uniqueness, creativity and complexity involved in the process of creating art, it
was considered impossible for machines to mimic this process and create their own art.
However, developments in AI have made this process possible. ML and DL is used to
understand and recognize the various styles of these art images in order to create their own
style and generate their own art.

In this review paper, a study on the advancements made in the field of AIArt will be done
and the various algorithms and methods that are used in the field will be discussed, explored
and compared by doing research on the already existing literature available on AIArt.
2. LITERATURE REVIEW

Machine learning studies how a machine may learn to accomplish something without being
specifically taught to do so. It is commonly used to anticipate outcomes based on existing
data or to categorise data into several labels. The study and development Machine learning
has increased rapidly in the past two-three decades, from study and research to a practical
technology in widespread commercial use. Within the field of artificial intelligence (AI),
ML is considered very popular and has been highly utilized for creation of software for
addressing various automation problems like natural language processing, computer
vision, object detection, speech recognition, classification problem, etc. Many developers
involved in development of AI applications now recognise that it’s much easier to train a
system on already available data than to manually program the algorithm by anticipating
the desired response. The impact of machine learning is also common in the computing
industry and various industries related to data-intensive topics, such as consumer services,
troubleshooting of complex systems, and supply chain control. It has a wide range of
influence in empirical sciences from biology to cosmology to social sciences, because ML
techniques have been developed to analyse powerful experimental data in novel ways.

A significant number of ML algorithms have been developed to cover a vast variety of data
and problem types represented in various machine learning problems. Conceptually, a
machine learning algorithm can be seen as a search in a large number of candidate program
spaces, which is guided by a training data. This experience from the training data is used
to optimize the performance of the program. Machine learning algorithms vary widely,
partly because of the way they represent candidate programs (for example, decision trees,
mathematical functions, and general programming languages), and partly because of the
way they search the programming space (for example, optimization algorithms with easy-
to-understand convergence).

As a research field, machine learning is at the intersection of computer science, statistics,


data science and many other disciplines that are related to automatic improvement over
time. These disciplines are also used to make conclusions and decisions under uncertainty.
Some related research fields include Evolutionary research, adaptive management theory,
educational practice research, neurobiology, organizational behaviour, and economics.
Although communication with these other fields has increased in the past decade, we are
just beginning to use the potential synergies and the various formalisms and experimental
methods used in these many fields to study systems that improve with experience.[1]

Alternatively, traditional machine-learning approaches were limited in their capacity to


analyse natural data in its raw form. Creating a feature extractor that converted raw data
into a feature vector that the learning subsystem, usually a classifier, could use to discover
or categorise patterns in the input required meticulous engineering and deep domain
knowledge. As a result, advances in the field of Deep Learning have been made. Deep
Learning is a subfield of Machine Learning which makes use of multiple-level
representation-learning approaches acquired by building basic but non-linear modules that
successively change the representation at one level (beginning with the raw input) into a
representation at a higher, slightly more abstract level. Very complex functions can be
learned by combining enough of these modifications. Deep learning methods utilize Neural
Networks with various layers to perform these complex transformations and functions [2].

Fig 1: AI vs ML vs Deep Learning


2.1 Artificial Neural Networks

Artificial neural networks (also simply known as neural networks) is a type of pattern-
oriented problem that can be used to solve classification and time-series problems. Due to
the nonparametric nature of neural networks, models can be created without prior
knowledge of the data distribution or effects caused between variables due to possible
interactions, which is required by most parametric statistical methods. The process used in
neural networks is similar to the way the human brain operates and hence neural networks
are usually considered to give the best possible outcome. A neural network consists of
various nodes where each node is similar to neurons present in our brain. Each node is
connected to the next node through weights which are calculated during the execution of
the machine learning algorithm [3].

The basic architecture of neural networks consists of 3 types of neuron layers: input,
hidden, and output layers. In feed-forward networks, the data moves in a feed-forward
direction from the input units to the output units. A value is calculated for each node based
on the value of the previous node and this process of calculation and storing of variables
from the input layer to the output layer is known as forward propagation. Feed-forward
network do not contain any feedback connections and hence data flows only in the forward
direction. Recurrent networks contain feedback connections and thus allow
backpropagation which is a method for calculating the gradient of neural network
parameters and in the method the network is traversed in the backwards order from output
layer to the input layer. A neural network must be built up in such a way that when a set of
inputs is applied, the desired set of outputs is produced. The strength of the connections
can be determined using a variety of approaches. One method is to use a priori information
to explicitly set the weights. Another method is to train the neural network by giving it
training patterns and allowing it to adjust its weights based on some learning rule [4].
Fig 2: Deep Neural Network Architecture

2.2 Convolution Neural Network

The Convolutional Neural Network (CNN) is one of the most widely used deep neural
networks. Convolution can be described as a mathematical linear action performed
between matrices. Convolutional layer, non-linearity layer, pooling layer, and fully-
connected layer are some of the layers of CNN. Pooling and non-linearity layers do not
have parameters, whereas convolutional and fully connected layers do [5]. An important
characteristic of CNN is that it is able to pre-process data by itself thus saving time and
resources in the data pre-processing part. The CNN might require a bit of hand engineering
of the features in the beginning but as the machine learning algorithm progresses, the CNN
is able to adapt and learn these new features and develop its own filters. Thus, the CNN is
continuously evolving with growth in data. A CNN works well for data that is represented
as grid structures, hence it works well for pattern recognition and image classification
problems as an image is nothing but a matrix of pixel values. Since the data present in an
image is in the form of a grid structure, by applying various filters the CNN is able to
capture the spatial and temporal dependencies of the image and thus can be trained to
understand and finely characterise the input image. Thus, the role of a CNN is to reduce
the image into a form that is simpler to process without sacrificing features that are critical
to obtaining an accurate result and since the CNN is reducing the image into a simpler
form, it can easily be scaled to very large datasets as well.
Thus, the working of a CNN is as follows:
• Convolution layer: This layer computes the dot product between the weights of the
neuron and the region of the input image that are related.
• Non-Linearity Layer: The non-linearity can be used to adjust or cut-off the generated
output. This layer is applied in order to saturate the output or limiting the generated
output. [5]
• Pooling Layer: The pooling layer will down sample the spatial dimensions of the
image. Thus, it reduces the amount of computation to be performed and the number of
features to learn.
• Fully Connected Output layer: The fully connected output layer is the classifier layer
and gives us the final output by classifying the images into the required categories.

Since CNN is widely used for image classification, there has been a significant amount of
research using CNN in the field of AI art as well. Crowley et al. (2014) [6] demonstrated
in their research paper that object classifiers, learnt using CNN features computed from
various natural image sources, can retrieve paintings containing these objects. They were
not only able to detect the objects, but were also able to make annotations on the objects to
show how these objects in the paintings have evolved over time.

After this paper, Various advancements were made in object detection and various different
problems like identifying positions of objects, face recognition in paintings, analysing
gender of faces in paintings, etc were also addressed. Researchers thus realised that not
only objects, but the style of the image could also be detected using CNNs and these
advancements further supported the rising interest in AI art. Eva Cetinic et al. (2013) [7]
proposed an approach for automated classification of paintings by artist. In their model, the
individual style of an artist is recognized by analysing specific components of a painting
which distinguishes the work of an individual from the works of others. Once the style of
the artists has been recognized, their various paintings can be automatically classified.
The research paper by Gatys et. Al (2015) [8] [9] was one of the first papers that invented
Neural Style Transfer which triggered the rapid use and development of AI in the field of
art. In this paper, the authors present an artificial neural system which makes use of VGG-
19 [10]. This work is the first to show how image attributes may be used to separate content
from style in natural images.

Fig 3: Results obtained by Neural Style Transfer Using VGG Network[10]

Another such revolutionary innovation in the field of AI art was the invention of Generative
Adversarial Networks (GAN). GANs were introduced by Goodfellow et al. (2014) [11]
and constitutes a significant milestone in the effort to use machines to create new visual
content. A GAN's main mechanism is to train two "competing" models: a generator and a
discriminator, which are commonly implemented as neural networks. The generator's
purpose is to generate realistic images by capturing the distribution of actual examples in
the input sample, while the discriminator is trained to categorise generated images as fake
and real images from the original sample as real. The optimization process terminates at a
saddle point that is a minimum in reference to the generator and a maximum in relation to
the discriminator, as it is designed as a minimax optimization problem. This framework's
implementation produced outstanding results in terms of creating plausible false variants
of actual images for many sorts of image content. GAN quickly rose to prominence as one
of the most important areas of artificial intelligence research, with several advanced and
domain-specific versions of the original design emerging., e.g. CycleGAN [12], StyleGAN
[13] or BigGAN [14].

Fig 4: GAN Architecture

CycleGAN was proposed by Zhu et al. (2017) [12] In this paper, the authors present a
system that can learn to capture the special characteristics of one image collection and
figure out how these characteristics could be translated into the other image collection, all
in the absence of any paired training examples. The key finding of this paper was that the
CycleGAN model can be used to transfer style of the image. Unlike Neural Style Transfer,
however, the CycleGAN learns to emulate the style of a complete collection of artworks
(e.g., Van Gogh), rather than a single piece of art (e.g. Starry Night). The authors were also
able to do the reverse of the process, i.e. converting a painting to a real image by making
use of the CycleGAN model.

Karras, T et. Al (2019) [13] proposed a Style-based GAN architecture (StyleGAN). An


alternative generator architecture for generative adversarial networks, is proposed in the
paper. Based on results obtained in the paper, it is becoming clear that the traditional GAN
generator architecture is in every way inferior to a style-based design. The new architecture
leads to an automatically learned, unsupervised separation of high-level attributes and
stochastic variation in the generated images, and it enables intuitive, scale-specific control
of the synthesis. Thus, the style GAN changes the architecture of the generator
significantly.

Fig 5: Style GAN Component Diagram [15]

Modifications in both the architecture of the model and the methods of training in
StyleGAN were proposed by Kerras. T et al (2020) [16] in their paper. In order to improve
conditioning in the mapping from latent codes to pictures, they revised the generator
normalisation, revisited progressive growing, and regularised the generator. Thus, this
improves the image quality generated by StyleGAN and is known as StyleGAN2.

Brock, A. et. Al (2018) [14] proposed the concept of BigGAN which is based on the
concept of Self Attention GAN model (SAGAN)[17]. They demonstrated that GANs
trained to model natural images of multiple categories highly benefit from scaling up, both
in terms of fidelity and variety of the generated samples. As a result, the model sets a new
level of performance among ImageNet GAN models, improving on the state of the art by
a large margin. They have also presented an analysis of the training behaviour of large
scale GANs, characterized their stability in terms of the singular values of their weights,
and discussed the interplay between stability and performance.

To enhance the GAN technology's creative content generation capabilities even farther,
Elgammal et al. [18] have introduced AICAN - artificial intelligence creative adversarial
network. They suggest in their study that if a GAN model is trained on paintings, it will
only learn how to make images that look like existing art, and that this, like the Neural
Style Transfer technique, would not produce anything truly artistic or innovative. The
model they have proposed generates art in a more creative manner by learning the style of
paintings and then deviating away from the learned style to make the artwork creative.

In their paper, Matsumura, N et Al. (2018) [19] have proposed a tile art image generation
method which make use of machine learning methods and is based on conditional GAN.
In a condition GAN, a source image is received as an additional input for the GAN. In the
proposed method, the authors are directly producing a tile art image from an input image
using a conditional GAN. The tile art image is generated using forward propagation. The
architecture of the network in the proposed method is modelled on the pix2pix method with
the conditional generative adversarial networks [20].

Xu, T et Al. (2018) [21] have proposed an Attentional Generative Adversarial Network
(AttnGAN) that allows multi-stage attention-driven refinement for fine-grained text-to-
image generation. By paying attention to the important words in the natural language
description, the AttnGAN can synthesise fine-grained details in different sub-regions of
the image using an unique attentional generative network. In addition to this, for training
the generator, a deep attentional multimodal similarity model is presented to compute a
fine-grained image-text matching loss. Visualizing the AttnGAN's attention layers allows
for a more complete study. It is the first time that the layered attentional GAN has been
shown to be capable of selecting the condition at the word level for producing distinct parts
of the image automatically.
3. METHODOLOGY

The goal of this research paper was to address and answer some questions related to the
subject of AI Art (Creating Art using Deep Neural Networks) through research on the topic.
To answer these questions, a qualitative approach of research was used in the research
paper. A qualitative approach is a type of research approach in which other people’s views
and understanding of the issue is used to gain insight into the topic. This particular type of
research helps the researcher to develop their own ideas and helps the reader to increase
their understanding of the topic.

In the case of this research paper, previously published research papers were used to obtain
information and collect data related to the topic. Previously published research papers
provided a more technical and in-depth view of the topic and help the researcher to
understand the different views of different scientists and researchers related to the topic.
The studies and research of other scientists helps the reader to get an idea of the current
situation of the topic and helps the reader to get an insight into the topic. It has helped me
to obtain answers to many of the questions listed above and is an efficient method of
obtaining information.

This Particular method of research was decided upon as it was felt that this method was a
more efficient and reliable for research. The information obtained through the said method
is based on proper experimentation and research and is very factual and reliable. Also, the
information obtained is more thorough, concise, technical and provides a clear view of the
subject. It also helps to understand the progress of the topic and how its understanding and
techniques have changed over the years and how the subject has evolved. It also provides
many different views of the same topic and helps the reader to look at the topic from various
different perspectives and provides a complete overview of the topic.

Based on the papers read and the research conducted, various types of deep neural network
models like VGG and types of GANs were identified that were successfully able to create
AIArt. The working of these models has been discussed in this part of the paper.
The first model that was used to perform Neural Style transfer was the VGG Model [10].
It is a CNN that rivals human performance on a common visual object recognition
benchmark test. This model uses the feature space provided by the 16 convolutional and 5
pooling layers of the 19-layer VGG- Network. The key idea behind the algorithm is to
iteratively optimise an image with the objective of matching desired CNN feature
distributions, which involves both the photo’s content information and artwork’s style
information The concept separates image content from style, allowing any image's content
to be recast in any other image's style. This is demonstrated by the generation of innovative,
beautiful visuals that combine the style of numerous well-known paintings with the content
of a randomly chosen photograph. The feature responses of high-performing deep neural
networks trained on object detection are used to create neural representations for the
content and style of an image in particular. [9]

CycleGAN is a type of GAN that has been used to achieve Neural Style Transfer.
CycleGAN is able to learn special characteristics of one image collection and is able to
translate it into another image collection without having any paired training examples. The
developers of this model first compared it to current methods for unpaired image-to-image
translation using paired datasets with input-output pairs. The relevance of the cycle
consistency loss and the adversarial was then investigated, and their full method was
compared to numerous versions. Finally, they have proved the applicability of their
technique in a variety of applications that do not require paired data. [12]. One such
application where CycleGAN has been used is FaceApp in which human faces are
transformed into different age groups.[22]

In StyleGAN [13], the generator no longer takes a point from the latent space as input;
instead, there are two new sources of randomness used to generate a synthetic image: a
standalone mapping network and noise layers. The latent space vector is passed through a
mapping transformation comprises of 8 fully connected layers whereas the synthesis
network comprises of 18 layers, where each layer produces image from 4 x 4 to 1024 x
1024. The output layer outputs a RGB image through a separate convolution layer. The
output from the mapping network is a vector that defines the styles that is integrated at each
point in the generator model via a new layer called adaptive instance normalization. The
use of this style vector gives control over the style of the generated image [23].

In BigGAN [14], the authors implemented two architectural adjustments to boost


scalability while also increasing conditioning by using orthogonal regularisation on the
generator. The orthogonal regularisation applied to the generator makes the model
accessible to the "truncation technique," which allows for fine control of the fidelity-
variation trade-offs by truncating the latent space. In terms of stability, the authors
identified and characterised instabilities particular to large-scale GANs, then devised
strategies to mitigate the instabilities – but at a substantial performance cost. BigGAN can
achieve an Inception Score (IS) of 166.3 when trained on the ImageNet dataset at 128 X
128 resolution, which is a more than 100 percent improvement over the previous state of
the art (SotA) record of 52.52. In addition, the Frechet Inception Distance (FID) score has
been improved from 18.65 to 9.6. BigGAN outperformed the previous SotA on ImageNet
at 256 X 256 and 512 X 512 resolutions, in addition to its performance improvement at
128 X 128 resolutions. [24]

AICAN is a model that can be used to generate art in a more creative manner. In their
paper, the authors propose changing the optimization criterion to allow the network to
develop new art by increasing divergence from established styles while remaining within
the art distribution. This deviation ensures that the generated artwork seems original and is
different from existing artistic styles. Authors of the AICAN system also demonstrated that
viewers were frequently unable to distinguish between AICAN-generated images and
artworks created by a human artist through a series of exhibitions and experiments. Thus,
we can state that the results generated by the AICAN model are novel and are
indistinguishable from human made artwork.[18]
4. COMPARATIVE ANALYSIS

The generator model of a GAN is trained using a second model called a discriminator that
learns to classify images as real or fake, unlike conventional deep learning neural network
models that are trained with a loss function until convergence. To ensure equilibrium, both
the generator and discriminator models are trained concurrently. As a result, there is no
objective loss function used to train GAN generator models, and there is no way to
objectively assess the training progress and the relative or absolute quality of the model
based on loss alone.

Therefore, it becomes really difficult to quantify and compare the performance of GAN
models using conventional metrics. Thus, to compare the performance of GAN models,
metrics like Fréchet Inception Distance (FID) and Inception Score (IS) are generally used.

Inception Score (IS)

The Inception Score (IS) is a metric used for assessing the quality of generated images,
particularly synthetic images produced by GAN models. To classify the generated images,
the inception score employs a pre-trained deep learning neural network model for image
classification. The model is used to classify a huge number of generated photos. The chance
of an image belonging to each class is predicted in detail. The inception score is created by
combining these predictions. The goal of the score is to capture two characteristics of a set
of created images:
• The quality of the generated images
• The diversity of the generated images

The inception score has a lowest value of 1.0 and a highest value of the number of classes
supported by the classification model. Higher value of Inception Score indicates better
performance of the GAN. [25]

Fréchet Inception Distance (FID)

This is one of the most widely used metrics for comparing real and generated images in
terms of feature distance. The Fréchet Distance is a measure of similarity between curves
that takes the location and ordering of the points along the curves into account. It can also
be used to calculate the difference between two distributions.
Feature distance and the Inception model pre-trained on the Imagenet dataset are employed
in the context of computer vision, particularly GAN evaluation. The score is called
"Fréchet Inception Distance" because it uses activations from the Inception model to
summarise each image.

Mathematically, the distance between two multivariate normal distributions is calculated


be making use of Fréchet Distance. The formula for multivariate normal distribution is
given by:

1
𝐹𝐼𝐷 = ||𝜇𝑥 − 𝜇𝑦 ||2 + 𝑇𝑟(Σ𝑥 + Σ𝑦 − 2(Σ𝑥 Σ𝑦 )2 ) (1)

where X and Y are the real and fake embeddings (activation from the Inception model)
assumed to be two multivariate normal distributions. 𝜇𝑥 and 𝜇𝑦 are the magnitudes of the
vector X and Y. Tr is the trace of the matrix and Σ𝑥 and Σ𝑦 are the covariance matrix of the
vectors. Lower FID value means better Image quality and diversity. [26]

For this comparative study, we will be comparing the GANs that have been in discussed in
this paper based on the FID scores of each model.

Models FID Dataset

CycleGAN [27] 61.5 Horse->Zebra (ImageNet)

Pix2Pix [27] 24.2 Edges->shoes

StyleGAN [13] 4.43 FFHQ

StyleGAN2 [16] 2.84 FFHQ

BigGAN [14] 8.7 ImageNet

SAGAN [17] 18.65 ImageNet


Table 1: Comparison of GAN models based on FID

It is difficult to compare the results of the GANs as the functions that these GANs perform,
the output generated by them and the datasets that they have been trained on is different.
On the basis of FID alone, we can say that StyleGAN2 has the best FID score of 2.84 when
trained on the FFHQ dataset. Out of the models trained on the ImageNet dataset, which is
the most commonly used dataset, the best FID score was obtained by BigGAN.

Fig 6: Graph comparing FID Scores of GAN models


5. CONCLUSION

A brief review of the current status and progress of Creating art using Deep Neural
Networks was covered in this research paper to show the various areas of research it takes
part in. A brief summary of Machine Learning and Artificial Intelligence was presented
and the relevance of Convolution Neural Networks and Generative Adversarial Networks
in the field of AI Art was discussed. The application of AI Art and the various techniques
like Neural Style Transfer that are used in AI Art were also discussed and reviewed. The
new development of GANs like CycleGAN, StyleGAN, etc and how they are being used
in AI Art as well as other fields of science and industry is another factor of optimism.

Based on the research done and papers read, we can conclude that CycleGAN is thus far
the best model that has been to achieve Neural Style Transfer. It outperforms the
conventional VGG framework and is able to transfer the entire style of an artist instead of
the style of a single artwork. In the case of AI generated Art, various models like
StyleGAN, BigGAN, Conditional GAN, etc have been used however the AICAN model
generated the best results which were original and hard to distinguish from human created
art.

In conclusion, it can be said that the future of AI Art is very bright, and strongly connected
with the evolution and development of GANs. It is a field with a lot of potential but requires
a large amount of research and technological advancements as it is still in its early stages.
6. FUTURE WORKS

As the field of AI Art is still in its early stages, a large amount of research and technological
advancements can still be made in this field. Further refinement to the existing GAN
models can be made to further improve the results. Additionally, more types of GANs can
be developed with further research in this field. Significant advancements in the
development of multimodal generative models, such as models that can create images from
text, have recently been made. Art production and creativity will almost certainly be
influenced by technological advances in this direction. Because the concept of
multimodality is intrinsic to many art forms and has long played a significant part in the
creative process, models that can convert input from diverse modalities into a joint
semantic space constitute an appealing tool for artistic experimentation. Thus, further
research in multimodal generative models can also lead to significant developments in the
field of AI Art.

Due to the current global pandemic, there has been a rapid shift of attention towards digital
showrooms and online platforms which has further contributed to the already rising interest
in blockchain technologies and crypto art which can have a huge impact and can
significantly transform the art market. The interest in NFTs, especially digital art based
NFTs is on the rise and AI generated Art can also be sold in this market.

You might also like