Group 27_Creating art from existing images using deep neural network models (2)
Group 27_Creating art from existing images using deep neural network models (2)
On
Submitted to
of
Bachelor of Technology
In
By
Faculty Guide
We, Karmanya Mendiratta, Vaibhav Garg, Shail Verma students of B. Tech (CSE) hereby
declare that the project titled “CREATING ART FROM EXISTING IMAGES USING
DEEP NEURAL NETWORK MODELS” which is submitted by us to Amity School of
Engineering and Technology, Amity University Uttar Pradesh, Noida, in partial fulfilment
of requirement for the award of the degree of B. Tech (CSE) has not been previously
formed the basis for the award of any degree, diploma or other similar title or recognition.
We hereby declare that we have gone through project guidelines including policy on health
and safety, policy on plagiarism etc.
i
CERTIFICATE
To the best of my knowledge this work has not been submitted in part or full for any Degree
or Diploma to this University or elsewhere.
Noida
Date:
ii
ACKNOWLEDGEMENT
We take this opportunity to express our profound sense of gratitude and respect to all those
who helped us throughout our project. This report acknowledges to the intense driving and
technical competence of everyone who have contributed to it. It would have been almost
impossible to complete this project without the support of these people. We would like to
thank Dr. Sanjeev Thakur, HOD of CSE Department and Amity University for
providing us with opportunity to do this project. We would also like to thank our faculty
guide Dr. Subhash Chand Gupta whose constant support and guidance resulted in the
successful completion of this project. We would also want to thank our batch mates and
family who helped me throughout this time and motivated me to complete the project.
iii
TABLE OF CONTENTS
1. Declaration i
2. Certificate ii
3. Acknowledgement iii
4. Abstract vii
5. Chapter 1: Introduction 1
7. Chapter 3: Methodology 34
9. Chapter 5: Results 41
12. References 62
iv
LIST OF TABLES
LIST OF FIGURES
v
5.8 Input data for VGG Model 45
5.9 Image generated after training the model 45
5.10 Input data for VGG Model 46
5.11 Image generated after training the model 46
5.12 Generators after Training the CycleGAN Model on 47
Horse2Zebra Dataset for 50 epochs
5.13 Discriminator after training the CycleGAN Model on 48
Horse2Zebra Dataset for 50 epochs
5.14 Results obtained on Test Data after Training the Cycle GAN 48
model Horse2Zebra Dataset for 50 epochs.
5.15 Generators after Training the CycleGAN Model on 49
Monet2Photo Dataset for 65 epochs
5.16 Discriminator after training the Model Monet2Photo Dataset 50
for 65 epochs
5.17 Results obtained on Test Data after Training the Cycle GAN 50-51
model Monet2Photo Dataset for 65 epochs.
5.18 Initial generated images before training of model 52
5.19 Input data for StyleGAN2 model 52
5.20 Generated Images after 2 hours of training 53
5.21 Generated Images after 8 hours of training 53
5.22 Generated Images after 14 hours of training 54
5.23 Generated Images after 20 hours of training 54
5.24 Generated images after training the model for 48 hours 55
5.25 Generated images after training the model for 82 hours 55
5.26 Result obtained when “City during a rainy night” was given 56
as prompt
5.27 Result obtained when “A cold and rainy night” was given as 56
prompt
5.28 Result obtained when “Fantasy Kingdom Deviantart” was 57
given as prompt
5.29 Result obtained when “Cyberpunk city ArtstationHQ” was 57
given as prompt
5.30 Result obtained when “Underwater Castle ArtstationHQ” was 58
given as prompt
vi
ABSTRACT
The innovations and advancements in the field of AI has motivated researchers to look into
the application of AI in several disciplines during the last few years. As a result, the field
of AI Art has emerged. This project examines the advancements and breakthroughs in the
field of AI Art, as well as how deep neural networks are being used to create art. In this
project an analysis on two aspects of AI art has been done: Firstly, how AI is being used
for art analysis and secondly, how AI is being used for creative purposes and generating
artworks. In the context of AI-related research for art understanding, we present a
comprehensive study on various Machine Learning tasks like Classification, Object
Detection, Automatic Painter detection based on painters’ style, etc. In context to the role
of AI in creating art, a comprehensive study on various techniques such as Convolution
Neural Networks, Neural Style Transfer and Generative Adversarial Networks has been
presented.
vii
CHAPTER 1
INTRODUCTION
Undoubtedly, one of the most trending fields of today’s day and age is Artificial
Intelligence (AI). Artificial Intelligence is a broad branch of Computer Science which deals
with the creation and development of intelligent or smart machines that are able to perform
tasks and make decisions which would require human intelligence. These smart machines
are able to learn from their experiences, draw conclusions based on the information they
have and adjust to new unseen input data to make logical decisions and perform human-
like tasks.
At first, Artificial Intelligence was only viewed as a research field and lots of computer
scientists did extensive research on AI and ML, which is a subfield of AI. Based on this
research, a lot of practical applications of AI and ML were recognized and hence the field
has progressed a lot and now, they are being used in almost all fields of study ranging from
Computer Science to Medicine. Thus, in the past two-three decades, the field of AI has
evolved from a field of study and research to a practical technology which has widespread
commercial use and is being used everywhere from chess-playing computers to self-driving
cars.
Another sub-field of AI which has seen a lot of progress and research is Deep Learning.
Deep Learning is a sub-field of Machine Learning which essentially consists of Neural
Networks which have various layers. These neural networks are designed to behave
similarly to the human brain, and so Deep Learning is used to solve complex problems and
hence neural networks are usually considered to give the best possible outcomes to these
complex problems. Due to the complex nature of Neural Networks and the many useful
applications of deep learning, significant research has been carried out in the field of deep
learning and different types of neural networks have been developed to tackle various types
of complex problems.
1
Thus, this extensive research in AI has prompted researchers to critically analyze and
discuss how AI and Machine learning can be applied in a more unconventional manner and
to look at the field from a newer, more creative perspective. One such field where AI is
being used is for the creation and understanding of art. As a result, a new kind of art named
artificial intelligence art (AiArt) has emerged, which is a creative activity that combines
art with technology by using AI as the core medium to create, express thoughts and
emotions. This can be done using Computer Vision. Computer Vision is an application of
AI in which Machine Learning and Deep learning is used to make observations and
important inferences from the input images, videos or any other visual inputs and to take
action or make recommendations based on that information.
Art can be described as a way to compose a complex interplay between content and style
of an image to show and demonstrate different ideas and concepts visually. Artists have
many different ideas and many different ways of representing these ideas. Thus, various
art pieces differ in style and complexity based on the artist that has created the art piece.
Due to the uniqueness, creativity and complexity involved in the process of creating art, it
was considered impossible for machines to mimic this process and create their own art.
However, developments in AI have made this process possible. ML and DL is used to
understand and recognize the various styles of these art images in order to create their own
style and generate their own art.
In this project, a study on the advancements made in the field of AIArt will be done and
the various algorithms and methods that are used in the field will be discussed, explored
and compared by doing research on the already existing literature available on AIArt.
2
CHAPTER 2
LITERATURE REVIEW
Machine learning studies how a machine may learn to accomplish something without being
specifically taught to do so. It is commonly used to anticipate outcomes based on existing
data or to categorise data into several labels. The study and development Machine learning
has increased rapidly in the past two-three decades, from study and research to a practical
technology in widespread commercial use. Within the field of artificial intelligence (AI),
ML is considered very popular and has been highly utilized for creation of software for
addressing various automation problems like natural language processing, computer
vision, object detection, speech recognition, classification problem, etc. Many developers
involved in development of AI applications now recognise that it’s much easier to train a
system on already available data than to manually program the algorithm by anticipating
the desired response. The impact of machine learning is also common in the computing
industry and various industries related to data-intensive topics, such as consumer services,
troubleshooting of complex systems, and supply chain control. It has a wide range of
influence in empirical sciences from biology to cosmology to social sciences, because ML
techniques have been developed to analyse powerful experimental data in novel ways.
A significant number of ML algorithms have been developed to cover a vast variety of data
and problem types represented in various machine learning problems. Conceptually, a
machine learning algorithm can be seen as a search in a large number of candidate program
spaces, which is guided by a training data. This experience from the training data is used
to optimize the performance of the program. Machine learning algorithms vary widely,
partly because of the way they represent candidate programs (for example, decision trees,
mathematical functions, and general programming languages), and partly because of the
way they search the programming space (for example, optimization algorithms with easy-
to-understand convergence).
4
2.1 Artificial Neural Networks
Artificial neural networks (also simply known as neural networks) is a type of pattern-
oriented problem that can be used to solve classification and time-series problems. Due to
the nonparametric nature of neural networks, models can be created without prior
knowledge of the data distribution or effects caused between variables due to possible
interactions, which is required by most parametric statistical methods. The process used in
neural networks is similar to the way the human brain operates and hence neural networks
are usually considered to give the best possible outcome. A neural network consists of
various nodes where each node is similar to neurons present in our brain. Each node is
connected to the next node through weights which are calculated during the execution of
the machine learning algorithm [3].
The basic architecture of neural networks consists of 3 types of neuron layers: input,
hidden, and output layers. In feed-forward networks, the data moves in a feed-forward
direction from the input units to the output units. A value is calculated for each node based
on the value of the previous node and this process of calculation and storing of variables
from the input layer to the output layer is known as forward propagation. Feed-forward
network do not contain any feedback connections and hence data flows only in the forward
direction. Recurrent networks contain feedback connections and thus allow
backpropagation which is a method for calculating the gradient of neural network
parameters and in the method the network is traversed in the backwards order from output
layer to the input layer. A neural network must be built up in such a way that when a set of
inputs is applied, the desired set of outputs is produced. The strength of the connections
can be determined using a variety of approaches. One method is to use a priori information
to explicitly set the weights. Another method is to train the neural network by giving it
training patterns and allowing it to adjust its weights based on some learning rule [4].
5
Fig 2.2: Deep Neural Network Architecture[5]
The Convolutional Neural Network (CNN) is one of the most widely used deep neural
networks. Convolution can be described as a mathematical linear action performed
between matrices. Convolutional layer, non-linearity layer, pooling layer, and fully-
connected layer are some of the layers of CNN. Pooling and non-linearity layers do not
have parameters, whereas convolutional and fully connected layers do [6]. An important
characteristic of CNN is that it is able to pre-process data by itself thus saving time and
resources in the data pre-processing part. The CNN might require a bit of hand engineering
of the features in the beginning but as the machine learning algorithm progresses, the CNN
is able to adapt and learn these new features and develop its own filters. Thus, the CNN is
continuously evolving with growth in data. A CNN works well for data that is represented
as grid structures, hence it works well for pattern recognition and image classification
problems as an image is nothing but a matrix of pixel values. Since the data present in an
image is in the form of a grid structure, by applying various filters the CNN is able to
capture the spatial and temporal dependencies of the image and thus can be trained to
understand and finely characterise the input image. Thus, the role of a CNN is to reduce
the image into a form that is simpler to process without sacrificing features that are critical
to obtaining an accurate result and since the CNN is reducing the image into a simpler
form, it can easily be scaled to very large datasets as well.
6
Thus, the working of a CNN is as follows:
• Convolution layer: This layer computes the dot product between the weights of the
neuron and the region of the input image that are related.
• Non-Linearity Layer: The non-linearity can be used to adjust or cut-off the generated
output. This layer is applied in order to saturate the output or limiting the generated
output. [6]
• Pooling Layer: The pooling layer will down sample the spatial dimensions of the
image. Thus, it reduces the amount of computation to be performed and the number of
features to learn.
• Fully Connected Output layer: The fully connected output layer is the classifier layer
and gives us the final output by classifying the images into the required categories.
Since CNN is widely used for image classification, there has been a significant amount of
research using CNN in the field of AI art as well. Crowley et al. (2014) [7] demonstrated
in their research paper that object classifiers, learnt using CNN features computed from
various natural image sources, can retrieve paintings containing these objects. They were
not only able to detect the objects, but were also able to make annotations on the objects to
show how these objects in the paintings have evolved over time.
After this paper, Various advancements were made in object detection and various different
problems like identifying positions of objects, face recognition in paintings, analysing
gender of faces in paintings, etc were also addressed. Researchers thus realised that not
only objects, but the style of the image could also be detected using CNNs and these
advancements further supported the rising interest in AI art. Eva Cetinic et al. (2013) [8]
proposed an approach for automated classification of paintings by artist. In their model, the
individual style of an artist is recognized by analysing specific components of a painting
which distinguishes the work of an individual from the works of others. Once the style of
the artists has been recognized, their various paintings can be automatically classified.
7
2.3 Neural Style Transfer
Neural Style Transfer is based on texture synthesis methods that use histograms, such as
the Portilla and Simoncelli approach. For the picture analogies problem, NST may be
characterized as histogram-based texture synthesis with convolutional neural network
(CNN) characteristics. The original work employed a VGG-19 architecture that had been
pre-trained on the ImageNet dataset to conduct object recognition.
Neural style transfer is a technique for blending two images—a content picture and a style
reference image (such as a famous painter's work)—so that the output image appears like
the content image but is "painted" in the manner of the style reference image. This is
accomplished by matching the content statistics of the content picture with the style
statistics of the style reference image in the output image. A convolutional network is used
to extract these data from the pictures.
2.3.1 VGG-19
The research paper by Gatys et. Al [9][10] was one of the first papers that invented Neural
Style Transfer which triggered the rapid use and development of AI in the field of art. In
the paper, the authors present an artificial neural system which makes use of VGG-19 [11]
which is a CNN that outperforms humans on a common visual object recognition
benchmark test. This model makes use of the feature space supplied by the 19 layered VGG
Network's 16 convolutional and 5 pooling layers. The algorithm's main concept is to
iteratively optimise a picture with the goal of matching specified CNN feature distributions,
including both the content information of the photo as well as the style information of the
artwork. The concept separates image content from style, allowing any image's content to
be recast in any other image's style. This is demonstrated by the generation of innovative,
beautiful visuals that combine the style of numerous well-known paintings with the content
of a randomly chosen photograph. The feature responses of high-performing deep neural
networks trained on object detection are used to generate neural representations of an
image's content and style in particular. This work was the first to show how image attributes
may be used to separate content from style in natural images.
8
Fig 2.3: Results obtained by Neural Style Transfer Using VGG Network[9]
• Input: The VGGNet accepts images with a size of 224 X 224 pixels. To keep the image
input size consistent for the ImageNet competition, the model's authors chopped out the
middle 224 X 224 patch in each image. The only preparation it performs is to deduct
from each pixel the mean RGB values derived from the training dataset.
• Convolution Layer: VGG's convolutional layers use a small receptive field (3 X 3),
the smallest size that still captures up/down and left/right movement. There are also 1
X 1 convolution filters that perform a linear transformation of the input. Then there's a
ReLU unit, which is a significant AlexNet invention that cuts down on the training time.
9
The rectified linear unit activation function (ReLU) is a piecewise linear function that
outputs the input if it is positive and zero otherwise. The convolutional strides are set to
1 pixel, and the spatial padding of convolutional layer input is set to 1 pixel for 3
Convolution layers to preserve the spatial resolution after convolution. The Spatial
pooling is then performed by five max-pooling layers, 16 of which follow part of the
Convolution layers but not all of the Convolution layers. This Max-pooling is done with
stride 2 across a 2 pixel frame.
• Hidden Layers: The VGG network's hidden layers all use ReLU. Local Response
Normalization (LRN) is rarely used in VGG since it increases memory usage and
training time. Furthermore, it has no effect on total accuracy.
• Fully Connected Layers: The architecture consists of a stack of convolutional layers
of varying depth in different designs, followed by three Fully-Connected (FC) layers:
the first two FC each have 4096 channels, while the third FC conducts 1000-way
classification and hence has 1000 channels, one for each class. The soft-max layer is the
last layer. In all networks, the completely linked levels are configured similarly [12].
10
2.3.1.2 Content Loss
It aids in the identification of similarities between the produced picture and the content
image. Higher layers of the model, intuitively, focus more on the characteristics contained
in the picture, i.e., the total content of the image.
The Euclidean distance between the corresponding intermediate higher-level feature
representations of the input picture (x) and the content image (p) at layer l is used to
determine content loss.
It's natural for a model to generate distinct feature maps at higher layers when different
objects are present.
This allows us to conclude that photos with comparable content should have similar
activations in higher layers.
1
𝐿𝑐𝑜𝑛𝑡𝑒𝑛𝑡 (𝜌, 𝑥, 𝐿) = ∑𝑖𝑗(𝐹𝑖𝑗𝑙 − 𝑃𝑖𝑗𝑙 )2 (2.1)
2
Content loss is fundamentally distinct from style loss. The comparison of intermediate
features of two images to get style loss is not possible. So here we have introduced a new
term known as Gram Matrices. The gram matrix is a method of interpreting style
information in a picture by displaying the overall distribution of characteristics in a specific
layer. It is the level of correlation between feature maps in a specific layer that is assessed.
Style loss is calculated by the distance between the gram matrices (or, in other terms, style
representation) of the generated image and the style reference image.
The contribution of each layer in the style information is calculated by the below formula:
1
𝐸𝑙 = 2 ∑𝑖𝑗(𝐺𝑖𝑗𝑙 − 𝐴𝑙𝑖𝑗 )2 (2.2)
4N2l M2l
where the contribution of each layer in the style loss is depicted by some factor wl [13]
11
2.4 Generative Adversarial Networks (GAN)
Another such revolutionary innovation in the field of AI art was the invention of Generative
Adversarial Networks (GAN). Introduced by Goodfellow et al. in their paper, GAN’s
constitutes a significant milestone in the effort to use machines to create new visual content.
A GAN's main mechanism is to train two "competing" models: a discriminator and a
generator, both of which are frequently implemented as neural networks. The generator's
purpose is to generate realistic images by capturing the distribution of actual examples in
the input sample, while the discriminator categorises generated images as fake or real. [14]
The process terminates at a point that is a minimum in reference to the generator and a
maximum in relation to the discriminator, as it is designed as a minimax optimization
problem. This framework's implementation produced outstanding results in terms of
creating plausible false variants of actual images for many sorts of image content [14].
GAN quickly rose to prominence as one of the most important areas of research in AI, with
several advanced and domain-specific versions of the original design emerging., e.g.
CycleGAN [15], StyleGAN [16] or BigGAN [17].
The GAN model gives great results and is able to generate very good and realistic images
however they often don’t learn the way you expect them to. Typically, convergence of the
model on the training dataset is sought which is observed as the minimization of loss
12
functions on the dataset. The end of the two-player game between the generator and
discriminator is signalled by convergence in a GAN. Instead, the goal is to find a balance
between generator and discriminator loss. We'll examine into GANs and the various
variants in their loss functions in order to gain a better understanding of how they work.
In this function:
• D(x) is the discriminator's prediction of the likelihood that genuine data instance x is
real.
• Ex is the average of all real-world data instances.
• When given noise z, the generator's output is G(z).
• The discriminator's estimate of the likelihood that a fake instance is real is 𝐷(𝐺(𝑧)).
• Ez is the expected value of the generator's random inputs (in effect, the expected value
of all generated fake instances G(z)).
• The cross-entropy between the real and generated distributions gives rise to the
formula.
Since the generator cannot directly alter the function's log(D(x)) term, minimising the loss
The Standard GAN loss function can be categorized into two further parts: Discriminator
loss and Generator loss. These are discussed below:
13
• Discriminator Loss: The discriminator classifies both genuine and fake data from the
generator while it is being trained. By maximising the loss function, it penalises itself
for misclassifying a real instance as fake or a fake instance (made by the generator) as
real.
• Generator Loss: The generator samples random noise and provides an output from it
as it is being trained. The output is then passed via the discriminator, which classifies it
as "Real" or "Fake" based on the discriminator's ability to distinguish one from the other.
The generator loss is then determined based on the discriminator's categorization, it is
rewarded if it succeeds in fooling the discriminator, and penalised if it fails [19].
If the generator starts producing the same output (or a small collection of outputs) over and
over again, the discriminator's best approach is to learn to always reject that output.
However, if the following generation of discriminator becomes stuck in a local minimum
and fails to identify the optimum strategy, the next generator iteration will be able to find
the most plausible output for the present discriminator far too easily. Each generator
15
iteration over-optimizes for a specific discriminator, and the discriminator never learns to
escape the trap. As a result, the generators alternate between a limited number of output
types. This type of GAN failure is called Mode Collapse. Attempts have been made to
remedy this problem by making use of the Wasserstein Loss and by inventing different
types of GANs to train in way to reduce the mode collapse. These GAN’s have been
discussed ahead in the report [21].
2.5.1 CycleGAN
Image-to-image translation entails creating a new synthetic version of a given image with
a specific change, for as changing a summer landscape to a winter scene. A large collection
of matched examples is often required to train a model for image-to-image translation.
These datasets, such as images of artworks by long-dead painters, can be difficult and
expensive to create, if not impossible in some situations. CycleGAN is a technique for
automatically training image-to-image translation models without paired samples. The
models are trained in a fairly unsupervised way using a set of images from the source and
target domains that are not connected in any way.
16
CycleGAN was proposed by Zhu et al. In this paper, the authors in the absence of any
paired training examples, offered a system that learn to capture the specific properties of
an image collection and work out how these traits may be transferred into some other image
collection. The key finding of this paper was that the CycleGAN model can be used to
transfer style of the image. Unlike Neural Style Transfer, however, the CycleGAN learns
to emulate the style of a complete collection of artworks (e.g., Van Gogh), rather than a
single piece of art (e.g. Starry Night). The authors were also able to do the reverse of the
process, i.e. converting a painting to a real image by making use of the CycleGAN model.
CycleGAN is able to learn special characteristics of one image collection and is able to
translate it into another image collection without having any paired training examples. The
developers of this model first compared it to current methods for unpaired image-to-image
translation using paired datasets with input-output pairs. The relevance of the cycle
consistency loss and the adversarial was then investigated, and their full method was
compared to numerous versions. Finally, they have proved the applicability of their
technique in a variety of applications that do not require paired data. [15]. One such
application where CycleGAN has been used is FaceApp which shows how a human might
look at different ages.[22]
2.5.1.1 Architecture
A CycleGAN in contrast to any other traditional GAN is made up of two GANs, giving it
a total of two generators and two discriminators. In CycleGAN, the problem is modelled
as an image reconstruction problem. We begin by taking an image input (x) and converting
it to the reconstructed picture using the generator G. Then, using a generator F, we reverse
this process from reconstructed image to original image. The mean squared error loss
between the original and reconstructed images is then computed. The most crucial aspect
of this cycle GAN is that it can do image translation on an unpaired image where there is
no relation between the input and output images.
17
Fig 2.6: Example of a paired and unpaired Dataset [15]
A paired dataset is the one where we get the counterpart of the input image in the dataset
whose style needs to be incorporated into the other set of images. As shown in the above
example, for every image in set xi, an identical image with different style is present in set
yi. For an unpaired Dataset, it is not necessary to have a complete set of images. As
shown in the above example, we can use a dataset of different Images and different
Paintings in order to implement CycleGAN.
Cycle consistency is a supplementary extension to the architecture used by the CycleGAN.
The notion is that an image output by the first generator can be utilised as input to the
second generator, and the output of the second generator should match the original image.
The opposite is also true: an output from the second generator can be provided as input to
the first generator, with the result matching the input to the second generator. Cycle
consistency is a machine translation concept that states that a phrase translated from
English to French should translate back to English and be identical to the original phrase.
The opposite should also be true [23]. Given two sets of images, for example, horses and
zebras, one generator transforms horses into zebras and the other transforms zebras into
horses. The discriminators are present during the training phase to determine if images
computed by generators are authentic or not. With the feedback of their respective
discriminators, generators can improve their performance in this process. In the instance of
CycleGAN, one generator receives additional feedback from the other. This feedback
18
ensures that an image formed by a generator is cycle consistent, which means that applying
both generators to an image sequentially should result in a similar image [24].
Fig 2.7: Architecture of CycleGAN for converting a Zebra into a Horse [24]
2.5.1.2 Applications
Other than transforming a horse into a zebra, CycleGAN may be used in a variety of ways.
It's a very adaptable model, so much so that it can turn an apple into an orange! Yeah, not
the best example, but here are a few others.
• Using blueprints, create a realMistic picture of what a structure would look like.
• Creating a mental image of how an area might appear in each season.
• Changing artwork into real-life images.
• Based on a police sketch, create a realistic picture of a suspect's face.
• Enhancing photo elements to make them appear more professional [23],
2.5.2 ProGAN
Although GAN’s were very powerful and were giving very good results, researchers were
still finding it troublesome to generate high quality large images (eg 1024 x1024) until
2017 when Kerras et al.[25] first tackled the problem with Progressive GAN (ProGAN).
The key idea of this model was to train the Generator and Discriminator progressively i.e.
Rather than trying to train all layers of the generator and discriminator at once, as is typical,
the team built their GAN one layer at a time to handle increasingly higher quality versions
of the images. To accomplish so, they first reduced the resolution of their training photos
19
to a very low starting point (only 4x4 pixels). To generate images at this low resolution,
they developed a generator with only a few layers and a mirrored architecture
discriminator. Because these networks were so small, they learned just the large-scale
features evident in the extremely blurred images and trained quickly.
They added another layer to the Generator and Discriminator after the initial layers finished
training, boosting the output resolution to 8x8. The preceding layers' trained weights were
preserved but not locked, and the new layer was progressively faded in to help stabilise the
transition. The training continued until the GAN could synthesise convincing images once
more, this time at the new 8x8 resolution. They continued to add layers, double the
resolution, and train in this manner until they reached the desired output size [25].
Thus, by progressively increasing the resolution, the network is able to continuously learn
a much simpler piece of the overall problem and this incremental learning process greatly
stabilises the training of the model and also reduces the chances of modal collapse.
The low-to-high resolution trend also encourages progressively growing networks to
prioritise high-level structure (patterns evident even in the most blurred forms of the image)
above details. This increases the final image's quality by lowering the chances that the
network will get some high-level structure completely wrong. Gradually increasing the
network size is also more computationally efficient than the more traditional approach of
initialising all the layers at once. Fewer layers are faster to train because they contain fewer
parameters. Since all but the final set of training iterations are performed with a subset of
the eventual layers, impressive efficiency gains are obtained. Karras et al. discovered that,
depending on the output resolution, their ProGAN trained about 2–6 times faster than a
corresponding traditional GAN.
Therefore, ProGAN’s were able to generate high quality images however, its capacity to
modify specific characteristics of the generated image is limited, as is the case with most
models. To put it another way, the features are entangled, thus changing the input, even
little, frequently affects numerous features at once [25]
20
Fig 2.8: ProGAN Architecture [26]
2.5.3 StyleGAN
Karras, T et. Al proposed a Style-based GAN architecture (StyleGAN) which offers and
upgraded version to the ProGAN’s image generator with a focus on the architecture of the
generator network. The study proposes an alternate generator design for GANs which
according to the writers findings is in every way better and superior to the standard GAN
generator architecture. This architecture achieves an automatically taught, unsupervised
separation of high-level features and stochastic variance in produced images. As a result,
the GAN style dramatically alters the generator's design [16]. The generator does not accept
a point as input from latent space; alternatively, two additional sources of randomness are
employed to build a synthetic image: an independent mapping network and noise layers.
21
As a result, a phenomenon known as features entanglement prevents the model from
mapping sections of the input (items in the vector) to features. However, the model can
build a vector that does not have to follow the training data distribution and can lessen the
correlation between characteristics by employing another neural network.
The Mapping Network is made up of eight fully connected layers, with an output layer ⱳ
that is the same size as the input layer (512 x 1). The latent space vector is sent through an
8-layer mapping transformation, followed by an 18-layer synthesis network, with each
layer producing images ranging from 4 x 4 to 1024 × 1024. A second convolution layer is
used to output an RGB picture from the output layer.
The output of the mapping network is a vector that defines the styles that are merged at
each point in the generator model via a new layer called adaptive instance normalisation
[16].
22
2.5.3.2 Adaptive Instance Normalization (AdaIN)
The AdaIN (Adaptive Instance Normalization) [28] layer transmits the Mapping Network's
encoded information ⱳ into the resulting image. Each Synthesis Network resolution level
has a module that defines the visual expression of the characteristics in that level:
• To ensure that the scaling and shifting of step 3 have the desired impact, each channel
of the convolution layer output is first normalised.
• Another fully-connected layer (designated as A) transforms the intermediate vector into
a scale and bias for each channel.
• Each channel of the convolution output is shifted by the scale and bias vectors, defining
the importance of each filter in the convolution. This fine-tuning converts the data from
ⱳ into a visual representation.
Using distinct style vectors at different places in the synthesis network allows you to adjust
the styles of the final image at various levels of detail. [27].
23
levels. The random switch prevents the network from learning and relying on a level
correlation. Though it doesn't increase model performance across all datasets, this notion
has an intriguing side effect: it can merge numerous images in a coherent manner. The
model creates two images, A and B, which are then combined using low-level features
from A and the rest of the information from B [16].
2.5.4 StyleGAN2
Modifications in both the architecture of the model and the methods of training in
StyleGAN were proposed by Kerras. T et al in their paper. In order to improve conditioning
in the mapping from latent codes to pictures, they revised the generator normalisation,
revisited progressive growing, and regularised the generator. Thus, this improves the image
quality generated by StyleGAN and is known as StyleGAN2. Firstly in StyleGAN2,
Weight demodulation is used which is a new normalisation technique that replaces adaptive
instance normalisation. Weight demodulation is used to fix the issue of the drop like defects
that were observed in StyleGAN for generation of 64 x 64 resolution images. Adaptive
Instance Normalization scales and shifts intermediate activations, similar to other
normalisation layers like Batch Norm. Instance Norm compares a single image to a batch,
whereas Batch Norm uses learning mean and variance parameters obtained from batch
statistics. Different scale and shift parameters are used by Adaptive Instance Norm to align
24
different sections of the source data with different regions of the feature map (either within
each feature map or via grouping features channel-wise by spatial location). Weight
demodulation removes the scale and shift parameters from the sequential calculation
pipeline, instead baking scaling into convolutional layer parameters. The shifting of values
(done with (y) in AdaIN) appears to be assigned to the noise map.
Secondly, an improved training method upon progressively growing is developed that
achieves the same purpose - training starts with low-resolution images and then gradually
transfers focus to higher and higher resolutions - while not modifying the network topology
during training. New types of regularisation are also proposed, such as lazy regularisation
and path length regularisation. [29]
2.5.5 BigGAN
Brock, A. et. Al proposed the concept of BigGAN which is based on the concept of Self
Attention GAN model (SAGAN)[30]. They found that GANs trained to simulate natural
pictures of various categories gain greatly from scaling up, both in terms of quality and
sample diversity. As a consequence, the model achieves a new level of performance among
ImageNet GAN models, vastly outperforming the state of the art. They also offered an
examination of the training behaviour of large scale GANs, defined their stability in terms
of the singular values of their weights, and addressed the relationship between stability and
performance [17].
25
In BigGAN, the authors implemented two architectural adjustments to boost scalability
while also increasing conditioning by using orthogonal regularisation on the generator. The
orthogonal regularisation applied to the generator makes the model accessible to the
"truncation technique," which allows for fine control of the fidelity-variation trade-offs by
truncating the latent space. In terms of stability, the authors identified and characterised
instabilities particular to large-scale GANs, then devised strategies to mitigate the
instabilities – but at a substantial performance cost. BigGAN achieves an Inception Score
(IS) of 166.3 when trained on the ImageNet dataset at 128 X 128 resolution, showing more
than 100 percent of improvement over the previous state of the art (SotA) record of 52.52.
In addition, the Frechet Inception Distance (FID) score has been improved from 18.65 to
9.6. BigGAN outperformed the previous SotA on ImageNet at 256 X 256 and 512 X 512
resolutions, in addition to its performance improvement at 128 X 128 resolutions. [31]
To enhance the GAN technology's creative content generation capabilities even farther,
Elgammal et al. introduced AICAN - artificial intelligence creative adversarial network.
They suggest in their study that training a GAN model on paintings would merely teach it
how to create pictures that look like existing art, and that this, like the Neural Style Transfer
technique, would not produce anything truly artistic or innovative. The model they have
proposed generates art in a more creative manner by learning the style of paintings and
then deviating away from the learned style to make the artwork creative [32]. AICAN is a
model that can be used to generate art in a more creative manner. In their paper, the authors
propose changing the optimization criterion to allow the network to develop new art by
increasing divergence from established styles while remaining within the art distribution.
This deviation ensures that the generated artwork seems original and is different from
existing artistic styles. Through a series of exhibits and trials, the authors also proved that
spectators were frequently unable to distinguish between AICAN-generated images and
artworks made by a human-artists. Thus, we can state that the results generated by the
AICAN model are novel and are indistinguishable from human made artwork.[32]
26
2.5.7 Conditional GAN (cGAN)
Conditional generative adversarial network, or cGAN for short, is a form of GAN that uses
a generator model to conditionally generate images. A conditional setup is used in cGANs,
which means auxiliary information (such as class labels or data) is fed to both the generator
and discriminator from other modalities. As a result, by being fed varied contextual
information, the ideal model can learn multi-modal mapping from inputs to outputs. We
get two advantages by sharing additional information:
• Convergence will occur more quickly. Even the random distribution of the fake images
will have some sort of pattern.
• At test time, you can control the generator's output by providing a label for the image
you want to generate.
Assume you train a GAN with handwritten digits (MNIST images). Normally, you have
no influence over the photos the generator generates. In other words, you cannot request a
specific digit image from the generator. This is where cGANs come into play, as they let
us to add an additional input layer of one-hot-encoded image labels. This additional layer
directs the generator as to which image to generate [33].
In their paper, Matsumura, N et Al. have proposed a method of tile art image generation
which make use of machine learning methods and is based on conditional GAN. In a
conditional GAN, a source image is provided as an additional input for the GAN. In the
proposed method, with the help of conditional GAN, the authors directly produce a tile art
image from the provided image [34]. The tile art image is generated using forward
propagation. The suggested approach's network architecture is based on the pix2pix
method with conditional GAN’s [35].
27
Fig 2.12: cGAN Architecture [33]
Xu, T et Al. introduced an Attentional GAN (commonly known as AttnGAN) that enables
multi-stage and attention-driven refinement for generation of fine-grained image from text.
Using an attentional generative network, the AttnGAN synthesises fine-grained features in
distinct regions of the image by paying attention to the essential words in the description
of natural language. In addition, a deep attentional multimodal similarity model is proposed
in the paper for training the generator in order to compute a fine-grained image-text
matching loss. Visualizing the AttnGAN's attention layers allows for a more complete
study. It is the first time that the layered attentional GAN has been shown to be capable of
selecting the condition at the word level for producing distinct parts of the image
automatically. [36]
28
2.6 Text Guided Art Generation
The concept is simple: to generate a picture, we will start with random values for the w
latent vectors from StyleGAN. CLIP will receive the result along with an arbitrary prompt.
CLIP will provide a score to this image based on how well it represents the prompt's
content. This value will be used to update w, which will generate another picture, and so
on, until we conclude that the generated image is sufficiently similar to the prompt. The
values of w will be updated using gradient descent and backpropagation as if they were
weights in a neural network [39].
29
The architecture of StyleGAN+CLIP is as follows:
The generator model of a GAN is trained using a second model called a discriminator that
learns to classify images as real or fake, unlike conventional deep learning neural network
models that are trained with a loss function until convergence. To ensure equilibrium, both
the generator and discriminator models are trained concurrently. As a result, there is no
objective loss function used to train GAN generator models, and there is no way to
objectively assess the training progress and the relative or absolute quality of the model
based on loss alone.
Therefore, it becomes really difficult to quantify and compare the performance of GAN
models using conventional metrics. Thus, to compare the performance of GAN models,
metrics like Fréchet Inception Distance (FID) and Inception Score (IS) are generally used.
30
accordance to the Target Domain. This practice could generally be carried out by the
researcher or practitioner themselves. Human visual assessment of samples is one of the
most prevalent and intuitive methods of evaluating GANs.
Although inspecting GANs manually can be the most basic form of model evaluation, it
has several drawbacks, including:
• It is subjective, including prejudices of the reviewer concerning the model, its
configuration, and the project purpose.
• It necessitates understanding of what is and is not feasible for the target domain.
• It is constrained by the number of photographs that can be evaluated in an acceptable
amount of time.
Using Human Vision to Evaluate the quality of generated images is costly and time-
consuming, skewed, difficult to replicate, and does not reflect the real capacity of models.
The subjectivity almost always results in unbalanced model selection and nitpicking, and
thus is not advised to be used for definitive model identification on non-trivial projects.
As a result, better and more sophisticated GAN Evaluation Metrices came into picture and
are discussed below [40].
The Inception Score (IS) is a metric used for assessing the quality of generated images,
particularly synthetic images produced by GAN models. To classify the generated images,
the inception score employs a pre-trained deep learning neural network model for image
classification. The model is used to classify a huge number of generated photos. The chance
of an image belonging to each class is predicted in detail. The inception score is created by
combining these predictions. The goal of the score is to capture two characteristics of a set
of created images:
• The quality of the generated images
• The diversity of the generated images
The inception score has a lowest value of 1.0 and a highest value of the number of classes
supported by the classification model. Higher value of Inception Score indicates better
performance of the GAN [41].
31
2.7.3 Fréchet Inception Distance (FID)
This is one of the most widely used metrics for comparing real and generated images in
terms of feature distance. The Fréchet Distance is a measure of similarity between curves
that takes the location and ordering of the points along the curves into account. It can also
be used to calculate the difference between two distributions.
Feature distance and the Inception model pre-trained on the Imagenet dataset are employed
in the context of computer vision, particularly GAN evaluation. The score is called
"Fréchet Inception Distance" because it uses activations from the Inception model to
summarise each image.
where X and Y are the real and fake embeddings (activation from the Inception model)
assumed to be two multivariate normal distributions. 𝜇𝑥 and 𝜇𝑦 are the magnitudes of the
vector X and Y. Tr is the trace of the matrix and Σ𝑥 and Σ𝑦 are the covariance matrix of the
vectors. Lower FID value means better Image quality and diversity. [42]
In this section, comparison of the various GAN models discussed in this report has been
done based on their FID scores.
32
StyleGAN [16] 4.43 FFHQ
It is difficult to compare the results of the GANs as the functions that these GANs perform,
the output generated by them and the datasets that they have been trained on is different.
On the basis of FID alone, we can say that StyleGAN2 has the best FID score of 2.84 when
trained on the FFHQ dataset. Out of the models trained on the ImageNet dataset, which is
the most commonly used dataset, the best FID score was obtained by BigGAN.
33
CHAPTER 3
METHODOLOGY
The goal of this project is to address and answer some questions related to the subject of
AI Art (Creating Art using Deep Neural Networks) through research on the topic. To
answer these questions, a qualitative approach of research was used in the research paper.
A qualitative approach is a type of research approach in which other people’s views and
understanding of the issue is used to gain insight into the topic. This particular type of
research helps the researcher to develop their own ideas and helps the reader to increase
their understanding of the topic.
The dataset used in the training of the model is the ImageNet dataset. ImageNet is a massive
collection of annotated images used in computer vision research. The dataset was created
with the intention of serving as a resource to aid in the study and development of better
computer vision technologies.
In this project we have used the following steps for training of VGG-19 model
• VGG-19 model was established
• Check the image for convolution operation by different convolution
• Get responses with different characteristics
• Figure out the output feature diagram
• Get the data we need from the model
• We get the activation function
• Using the lower sampling layer to retain the main characteristics of the sample and
reduce the number of parameters
• Software regression classifier was used to achieve the final classification [44].
We took advantage of an already existing model called pix-2-pix model with addition of a
loss function for training our CycleGAN model. Our model is trained on two different
34
datasets. Firstly, we used the unpaired Dataset of images Horses and Zebras which
contained a total of 2,661 images. 2,401 of these images were used for training the model
whereas 260 images were used for Testing. For the Second Dataset, we used unpaired
training data containing Painting of Monet and photographs for training our model. The
dataset contains 8,231 images in total from which 7,359 and 872 images are split into
training and test sets respectively [45]. In order to avoid overfitting of the dataset, image
augmentation techniques like Random jittering and Mirroring are applied on the training
dataset in both the cases. Two Generators and Discriminators are simultaneously trained,
in CycleGAN instead of 1 for each. One generator receives additional data from the other
generator throughout the training phase. Such feedback assures that an image formed by a
generator is cycle consistent, which means that applying both generators to an image
sequentially should result in a similar image. The discriminators, like in general GANs, are
then used to determine whether images generated by matching generators are real or not.
With the feedback of their respective discriminators, generators can improve their
performance in this process [46]. We calculate Cycle Consistency Loss in order to achieve
unpaired image to image translation. Cycle Consistency makes sure that the created image
and the original input are close to each other. Hence, lower the value of Cycle Consistency
Loss, better the model. For our model, we used a modified unet generator and Generator
and Discriminator Losses from Pix2Pix model are used. For the first Dataset, the model
was trained for abot 50 epochs which took about 12 hours in Google Colab. The model
was trained for 70 epochs which took about 16 hours on Google Colab for the Second
Dataset. The Results are obtained by running the model on the test Dataset and are
presented in the results section.
In this project, StyleGAN2 with Adaptive Discriminator Augmentation (ADA) [47] has
been used to generate artistic images. StyleGAN2-ADA is an improved version of
StyleGAN that is notable for retaining the excellent quality results of previous works while
also greatly reducing the number of training images required. It's also faster to compute
than its predecessors. This advancement makes training high-quality output GANs much
more practical. Instead of having to create or obtain a dataset with tens of thousands of
35
photos, it is now possible to get good results with just a few thousand or even less. For this
project, the Pytorch implementation of StyleGAN2-ADA [48] has been used as the Pytorch
implementation is slightly faster and less GPU intensive in comparison to the TensorFlow
implementation. WikiArt dataset has been used to train this model. The dataset contains
10,912 images of different types of artworks obtained from WikiArt [49] All the images
were of different dimensions hence the images have been resized to 1024 X 1024 resolution
for uniformity and training of the model. In order to reduce time and computation, this
project also employs transfer learning from a previously trained model. Transfer learning
is a training strategy that uses a previously trained version of the model developed on a
similar dataset to reduce training time. Rather than starting again, the model resumes
training with the previously trained model. This is especially useful for larger models that
require weeks of training in costly infrastructure [50]. It has also been noted that the images
formed by the model with transfer learning appear to be more polished than images created
by models that have been traditionally trained. In this project, the StyleGAN2-ADA model
trained on the FFHQ faces dataset at 1024 pixels resolution has been used and then the
model was further trained on the WikiArt dataset which also contains images of 1024 pixel
resolution. The model was trained for 160 ticks on 640 kimg which took about 82 hours of
training on Google Colab pro. The results obtained by the model and how the results
evolved overtime have been presented in the results section of the report.
For this project, CLIP along with StyleGAN has been used to perform text guided art
generation. The StyleGAN model generates an image and then CLIP measures how well
the image matches the text. Then, the model uses the feedback obtained from the CLIP
model to generate more “accurate” images. This iteration will be done many times until the
CLIP score becomes high enough and the generated image matches the text. Thus, the
model takes a prompt as input and then generates an art image that resembles the prompt.
StyleGAN2 model trained on the WikiArt dataset is used with CLIP for this project. The
model combined with CLIP is trained for 300 iterations for each prompt. Since the
generated image is dependent on the given prompt, a lot of prompt engineering can be done
as well to obtain more specific and relevant results
36
CHAPTER 4
TECHNOLOGIES USED
4.1 Python 3
Python is a high-level programming language that means many of the words that makes
utilization of are easily understandable by humans. In order for a machine to understand
the code it makes use of an interpreter, also python is a dynamic language which makes
programming in python easier as the programmer doesn’t have to declare variable types
because they are self-verified by the interpreter depending on their use.
Now there are many features that make python an excellent choice for machine learning
like it is simple to code, it has very vast collection of libraries (for examples scikit-learn,
NumPy, Tensorflow) which nearly contains functions for every task a programmer might
need to perform so development time is reduced and in turn productivity rises. Some of the
libraries used in this project are discussed below:
37
• NumPy: This library allows us to work with multi-dimensional arrays as well as
matrices. It has inbuilt functions for most of the operations regarding matrices like
transpose and many mathematical functions like square root, sin, cos, etc.
• PyTorch: PyTorch is an open source machine learning (ML) framework based on the
Torch library and the Python programming language. It is one of the most popular deep
learning research platforms. The framework is designed to accelerate the transition
from research prototyping to implementation. PyTorch, like NumPy, works with
tensors, which are accelerated by graphics processing units (GPU). Tensors are
multidimensional data structures that may be modified and operated on through APIs.
Over 200 different mathematical operations are supported by the PyTorch framework.
• TensorFlow: TensorFlow is a python library which is used for rapid numerical
processing. It is also used for creating deep learning models. This library is maintained
by google.
• Pillow: The Python Imaging Library enhances your Python interpreter's image
processing capabilities. This library supports a wide range of file formats, has an
efficient internal representation, and can do some image processing. The core image
library is built to provide quick access to data in a few basic pixel formats. It should
serve as a good foundation for an image processing tool in general.
• Time: The Python time module offers a variety of ways to express time in code,
including objects, integers, and texts. It also has features other than expressing time,
such as waiting during code execution and calculating code efficiency.
• Matplotlib: Matplotlib is a Python library that aids in data visualisation, analysis, and
interpretation using graphical, pictorial representations that may be replicated using the
matplotlib package. Matplotlib is a visualisation package that allows you to create
static, animated, and interactive visualisations.
Google Colab is a fantastic tool for learning Python and efficiently building machine
learning models. Even remotely, team members may exchange and edit notes
simultaneously. The notebooks can also be shared with the public by publishing them on
GitHub. Many prominent machine learning libraries are supported by Colab, including
PyTorch, TensorFlow, Keras, and OpenCV. The current limitation is that it does not
39
support R or Scala. Sessions and their size are likewise limited. These are minor sacrifices
to undertake when weighed against the advantages.
Google Drive is a cloud-based file storage and syncing service created by Google. Users
can use Google Drive to store data in the cloud (on Google's servers), sync files between
devices, and share files. Google Drive provides offline programmes for Windows and
macOS PCs, as well as Android and iOS smartphones and tablets, in addition to a web
interface. Google Drive provides customers with 15 GB of free storage with Google One.
Google One also provides 100 GB, 200 GB, and 2 TB storage space via optional premium
plans.
In our project, we used Google Drive to store and retrieve our models which are trained for
a certain period of time and the Colab got disconnected due to Runtime Disconnect. We
also stored Dataset in Google Drive for few of our models we also used it to store our
generated results.
40
CHAPTER 5
RESULTS
41
Fig 5.2: Input data for VGG Model
42
The following results were obtained after training the model further. The results which
were obtained are better as compared to the previous generated images.
43
In this we have taken content image of Rome Colosseum and style image of The muse, a
painting by Pablo Picasso. The generated image appears to be like the content image but is
painted in the manner of the style of reference image
44
In this we have taken content image of the world’s richest person Elon musk and style
image of artwork of an artist. The generated image appears to be like the content image but
is painted in the manner of the style of reference image.
45
In this we have taken content image of a famous actor Robert Pattinson and style image of
Girl with mandolin, an example of analytic cubist painting by Pablo Picasso. The generated
image appears to be like the content image but is painted in the manner of the style of
reference image.
46
5.2 CycleGAN
5.2.1 Horse to Zebra Image-to-Image Translation
Fig 5.12: Generators after Training the CycleGAN Model on Horse2Zebra Dataset for 50
epochs
47
Fig 5.13: Discriminator after training the CycleGAN Model on Horse2Zebra
Dataset for 50 epochs
This is how the Discriminator looked like after training the model for 50 epochs. It started
differentiating between the real and fake images correctly.
48
We provide an Input Image (Original Image) from the training Dataset to the model which
is trained for 50 epochs and the generated image with the help of the Generator is presented
here. The model successfully converts an image of a horse into an image of a Zebra and
vice versa.
4.2.2 Monet Painting to Photograph Image-to-Image Translation
The following results were obtained by training a Monet2Photo Tensorflow Dataset which
was divided into Train and Test Sets. The Dataset was trained on Google Colab for 65
epochs which took around 17 hours.
Fig 5.15: Generators after Training the CycleGAN Model on Monet2Photo Dataset for 65
epochs
49
Fig 5.16: Discriminator after training the Model Monet2Photo Dataset for 65 epochs
This is how the Discriminator looked like after training the model for 65 epochs on
monet2photo dataset. It started differentiating between the real and fake images correctly.
50
Original Image Image Generated after 65 epochs
I provided an Input Image (Original Image) from the training Dataset to the model which
is trained for 65 epochs and the generated image with the help of the Generator is presented
here. The model successfully converts a Monet Painting into a Photograph and vice-versa.
5.3 StyleGAN
Since Transfer Learning is being used in this project for the training of the model, the
images generated initially are of faces as the model being used is the StyleGAN model
trained on the FFHQ dataset.
51
Fig 5.18: Initial generated images before training of model
We will be giving the WikiArt dataset images as input to this model so that the images start
training on the required art data and the model starts generating the images. A sample of
the dataset has been shown in the image below.
52
Now we will start training the model on our dataset and observe how the generated images
are changing. After 2 hours of training the model, it was observed that the generated images
had slowly started changing into artwork and appeared like a fusion of the artwork and the
face images. The figures below show how the generated images are changing with time.
53
Fig 5.22: Generated Images after 14 hours of training
54
Fig 5.24: Generated images after training the model for 48 hours
Fig 5.25: Generated images after training the model for 82 hours
55
5.4 StyleGAN+CLIP
The figures below show the results obtained when a prompt is given to CLIP and how CLIP
guides StyleGAN to generate the most appropriate image based on the prompt.
56
Initial generated image Image generated after 300 iterations
Fig 5.28: Result obtained when “Fantasy Kingdom Deviantart” was given as prompt
57
Initial generated image Image generated after 300 iterations
Fig 5.30: Result obtained when “Underwater Castle ArtstationHQ” was given as prompt
58
CHAPTER 6
CONCLUSION
A brief review of the current status and progress of Creating art using Deep Neural
Networks was covered in this report to show the various areas of research it takes part in.
A brief summary of Machine Learning and Artificial Intelligence was presented and the
relevance of Convolution Neural Networks and Generative Adversarial Networks in the
field of AI Art was discussed. The application of AI Art and the various techniques like
Neural Style Transfer that are used in AI Art were also discussed and reviewed. The new
development of GANs like CycleGAN, StyleGAN, etc and how they are being used in AI
Art as well as other fields of science and industry is another factor of optimism.
Based on the research done and papers read, we can conclude that CycleGAN is thus far
the best model that has been able to achieve Neural Style Transfer. It outperforms the
conventional VGG-19 framework and is able to transfer the entire style of an artist instead
of the style of a single artwork. In the case of AI generated Art, various models like
AICAN, BigGAN, Conditional GAN, etc have been used however it is difficult to compare
the results of the GANs as the functions that these GANs perform, the output generated by
them and the datasets that they have been trained on is different. Based on the FID alone
the StyleGAN2 model generated the best results which were original and hard to
distinguish from human created art.
Based on the results of our research, we trained VGG-19 as well as CycleGAN models for
achieving Neural Style Transfer. Using VGG-19, we were able to showcase how the style
and content of images were mixing together to form artwork. Using CycleGAN, we were
able to generate an image of a Zebra using an image of a horse without having a paired
dataset. A few more applications of CycleGAN might include converting a sketch of a
suspect into a real life photograph or how a place might appear during a particular season
etc. Further, we were also able to generate the content images in the style of Monet’s
paintings thus showcasing that unlike VGG-19, CycleGAN is able to transfer the entire
style of an artist instead of just the style of a single painting.
59
We also trained the StyleGAN2 model on the WikiArt dataset for generating original
artwork. Using StyleGAN2, we were able to showcase how the model makes use of transfer
learning to change how it generates images and demonstrated how the face images that
were being originally generated slowly changed into artworks as the model trained on the
WikiArt dataset. Using StyleGAN2, we were able to generate completely AI made
artworks which were realistic and appeared similar to human made artworks. We observed
that the artworks being generated by the StyleGAN2 model were random and we wanted
more specific results. Thus, we used our StyleGAN2 model and combined it with CLIP to
perform Text Guided Art Generation. We observed that the CLIP model was able to guide
the image generation of the StyleGAN2 really well and we were able obtain more relevant
results which looked very similar to the prompt given to the model.
In conclusion, it can be said that the future of AI Art is very bright, and strongly connected
with the evolution and development of GANs. It is a field with a lot of potential but requires
a large amount of research and technological advancements as it is still in its early stages.
60
CHAPTER 7
FUTURE SCOPE
As the field of AI Art is still in its early stages, a large amount of research and technological
advancements can still be made in this field. Further refinement to the existing GAN
models can be made to further improve the results. Additionally, more types of GANs can
be developed with further research in this field. Significant advancements in the
development of multimodal generative models, such as models that can create images from
text, have recently been made. Art production and creativity will almost certainly be
influenced by technological advances in this direction. Because the concept of
multimodality is intrinsic to many art forms and has long played a significant part in the
creative process, models that can convert input from diverse modalities into a joint
semantic space constitute an appealing tool for artistic experimentation. Thus, further
research in multimodal generative models can also lead to significant developments in the
field of AI Art.
Due to the current global pandemic, there has been a rapid shift of attention towards digital
showrooms and online platforms which has further contributed to the already rising interest
in blockchain technologies and crypto art which can have a huge impact and can
significantly transform the art market. The interest in NFTs, especially digital art based
NFTs is on the rise and AI generated Art can also be sold in this market.
61
REFERENCES
[1] Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and
prospects. Science, 349(6245), 255-260.
[2] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436-
444.
[3] Walczak, S. (2018). Artificial neural networks. In Encyclopedia of Information Science
and Technology, Fourth Edition (pp. 120-131). IGI Global.
[4] Abraham, A. (2005). Artificial neural networks. Handbook of measuring system design.
[5] https://www.ibm.com/cloud/learn/neural-networks
[6] Albawi, S., Mohammed, T. A., & Al-Zawi, S. (2017, August). Understanding of a
convolutional neural network. In 2017 International Conference on Engineering and
Technology (ICET) (pp. 1-6). Ieee.
[7] Crowley, E. J., & Zisserman, A. (2014, September). In search of art. In European
conference on computer vision (pp. 54-70). Springer, Cham.
[8] Cetinic, E., & Grgic, S. (2013, September). Automated painter recognition based on
image feature extraction. In Proceedings ELMAR-2013 (pp. 19-22). IEEE.
[9] Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). A neural algorithm of artistic
style. arXiv preprint arXiv:1508.06576.
[10] Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using
convolutional neural networks. In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 2414-2423).
[11] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556.
[12] https://viso.ai/deep-learning/vgg-very-deep-convolutional-networks/
[13] https://www.analyticsvidhya.com/blog/2021/06/build-vgg-net-from-scratch-with-
python/
[14] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... &
Bengio, Y. (2014). Generative adversarial nets. Advances in neural information
processing systems, 27.
62
[15] Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image
translation using cycle-consistent adversarial networks. In Proceedings of the IEEE
international conference on computer vision (pp. 2223-2232).
[16] Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for
generative adversarial networks. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (pp. 4401-4410).
[17] Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high
fidelity natural image synthesis. arXiv preprint arXiv:1809.11096.
[18] https://www.freecodecamp.org/news/an-intuitive-introduction-to-generative-
adversarial-networks-gans-7a2264a81394
[19] https://neptune.ai/blog/gan-loss-functions
[20] Arjovsky, M., Chintala, S., & Bottou, L. (2017, July). Wasserstein generative
adversarial networks. In International conference on machine learning (pp. 214-223).
PMLR.
[21] https://developers.google.com/machine-learning/gan/problems
[22] https://www.faceapp.com
[23] https://machinelearningmastery.com/what-is-cyclegan/
[24] https://blog.jaysinha.me/train-your-first-cyclegan-for-image-to-image-translation/
[25] Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for
improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.
[26] https://towardsdatascience.com/progan-how-nvidia-generated-images-of-
unprecedented-quality-51c98ec2cbd2
[27] https://towardsdatascience.com/explained-a-style-based-generator-architecture-for-
gans-generating-and-tuning-realistic-6cb2be0f431
[28] Huang, X., & Belongie, S. (2017). Arbitrary style transfer in real-time with adaptive
instance normalization. In Proceedings of the IEEE international conference on
computer vision (pp. 1501-1510).
[29] Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing
and improving the image quality of stylegan. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (pp. 8110-8119).
63
[30] Zhang, H., Goodfellow, I., Metaxas, D., & Odena, A. (2019, May). Self-attention
generative adversarial networks. In International conference on machine learning (pp.
7354-7363). PMLR.
[31] https://medium.com/syncedreview/biggan-a-new-state-of-the-art-in-image-synthesis-
cf2ec5694024
[32] Elgammal, A., Liu, B., Elhoseiny, M., & Mazzone, M. (2017). Can: Creative adversarial
networks, generating" art" by learning about styles and deviating from style norms.
arXiv preprint arXiv:1706.07068.
[33] https://www.educative.io/edpresso/what-is-a-conditional-gan-cgan
[34] Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with
conditional adversarial networks. In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 1125-1134).
[35] Matsumura, N., Tokura, H., Kuroda, Y., Ito, Y., & Nakano, K. (2018, November). Tile
art image generation using conditional generative adversarial networks. In 2018 Sixth
International Symposium on Computing and Networking Workshops
(CANDARW) (pp. 209-215). IEEE.
[36] Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., & He, X. (2018).
AttnGAN: Fine-grained text to image generation with attentional generative adversarial
networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 1316-1324).
[37] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever,
I. (2021, July). Learning transferable visual models from natural language supervision.
In International Conference on Machine Learning (pp. 8748-8763). PMLR.
[38] https://openai.com/blog/clip/
[39] https://towardsdatascience.com/generating-images-from-prompts-using-clip-and-
stylegan-1f9ed495ddda
[40] https://machinelearningmastery.com/how-to-evaluate-generative-adversarial-networks/
[41] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X.
(2016). Improved techniques for training gans. Advances in neural information
processing systems, 29.
64
[42] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans
trained by a two time-scale update rule converge to a local nash equilibrium. Advances
in neural information processing systems, 30.
[43] Li, M., Lin, J., Ding, Y., Liu, Z., Zhu, J. Y., & Han, S. (2020). Gan compression:
Efficient architectures for interactive conditional gans. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (pp. 5284-5294).
[44] Yang, Z. (2021). Classification of picture art style based on VGGNET. In Journal of
Physics: Conference Series (Vol. 1774, No. 1, p. 012043). IOP Publishing.
[45] https://www.tensorflow.org/datasets/catalog/cycle_gan
[46] http://cs230.stanford.edu/projects_fall_2020/reports/55792990.pdf
[47] Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., & Aila, T. (2020). Training
generative adversarial networks with limited data. Advances in Neural Information
Processing Systems, 33, 12104-12114.
[48] https://github.com/NVlabs/stylegan2-ada-pytorch
[49] https://www.kaggle.com/c/painter-by-numbers
[50] Bozinovski, S. (2020). Reminder of the first paper on transfer learning in neural
networks, 1976. Informatica, 44(3).
[51] https://commons.wikimedia.org/wiki/File:Python-logo-notext.svg
[52] https://colab.research.google.com/
[53] https://tech.hindustantimes.com/how-to/organize-your-files-in-google-drive-on-
iphone-computer-android-here-is-how-to-71641109305422.html
65
APPENDIX
ABSTRACT
The innovations and advancements in the field of AI has motivated researchers to look into
the application of AI in several disciplines during the last few years. As a result, the field
of AI Art has emerged. This paper examines the advancements and breakthroughs in the
field of AI Art, as well as how deep neural networks are being used to create art. In this
paper an analysis on two aspects of AI art has been done: Firstly, how AI is being used for
art analysis and Secondly, how AI is being used for creative purposes and generating
artworks. In the context of AI-related research for art understanding, we present a
comprehensive study on various Machine Learning tasks like Classification, Object
Detection, Automatic Painter detection based on painters’ style, etc. In context to the role
of AI in creating art, a comprehensive study on various techniques such as Convolution
Neural Networks, Neural Style Transfer and Generative Adversarial Networks has been
present
1. INTRODUCTION
Undoubtedly, one of the most trending fields of today’s day and age is Artificial
Intelligence (AI). Artificial Intelligence is a broad branch of Computer Science which deals
with the creation and development of intelligent or smart machines that are able to perform
tasks and make decisions which would require human intelligence. These smart machines
are able to learn from their experiences, draw conclusions based on the information they
have and adjust to new unseen input data to make logical decisions and perform human-
like tasks.
At first, Artificial Intelligence was only viewed as a research field and lots of computer
scientists did extensive research on AI and ML, which is a subfield of AI. Based on this
research, a lot of practical applications of AI and ML were recognized and hence the field
has progressed a lot and now, they are being used in almost all fields of study ranging from
Computer Science to Medicine. Thus, in the past two-three decades, the field of AI has
evolved from a field of study and research to a practical technology which has widespread
commercial use and is being used everywhere from chess-playing computers to self-driving
cars.
Another sub-field of AI which has seen a lot of progress and research is Deep Learning.
Deep Learning is a sub-field of Machine Learning which essentially consists of Neural
Networks which have various layers. These neural networks are designed to behave
similarly to the human brain, and so Deep Learning is used to solve complex problems and
hence neural networks are usually considered to give the best possible outcomes to these
complex problems. Due to the complex nature of Neural Networks and the many useful
applications of deep learning, significant research has been carried out in the field of deep
learning and different types of neural networks have been developed to tackle various types
of complex problems.
Thus, this extensive research in AI has prompted researchers to critically analyze and
discuss how AI and Machine learning can be applied in a more unconventional manner and
to look at the field from a newer, more creative perspective. One such field where AI is
being used is for the creation and understanding of art. As a result, a new kind of art named
artificial intelligence art (AiArt) has emerged, which is a creative activity that combines
art with technology by using AI as the core medium to create, express thoughts and
emotions. This can be done using Computer Vision. Computer Vision is an application of
AI in which Machine Learning and Deep learning is used to make observations and
important inferences from the input images, videos or any other visual inputs and to take
action or make recommendations based on that information.
Art can be described as a way to compose a complex interplay between content and style
of an image to show and demonstrate different ideas and concepts visually. Artists have
many different ideas and many different ways of representing these ideas. Thus, various
art pieces differ in style and complexity based on the artist that has created the art piece.
Due to the uniqueness, creativity and complexity involved in the process of creating art, it
was considered impossible for machines to mimic this process and create their own art.
However, developments in AI have made this process possible. ML and DL is used to
understand and recognize the various styles of these art images in order to create their own
style and generate their own art.
In this review paper, a study on the advancements made in the field of AIArt will be done
and the various algorithms and methods that are used in the field will be discussed, explored
and compared by doing research on the already existing literature available on AIArt.
2. LITERATURE REVIEW
Machine learning studies how a machine may learn to accomplish something without being
specifically taught to do so. It is commonly used to anticipate outcomes based on existing
data or to categorise data into several labels. The study and development Machine learning
has increased rapidly in the past two-three decades, from study and research to a practical
technology in widespread commercial use. Within the field of artificial intelligence (AI),
ML is considered very popular and has been highly utilized for creation of software for
addressing various automation problems like natural language processing, computer
vision, object detection, speech recognition, classification problem, etc. Many developers
involved in development of AI applications now recognise that it’s much easier to train a
system on already available data than to manually program the algorithm by anticipating
the desired response. The impact of machine learning is also common in the computing
industry and various industries related to data-intensive topics, such as consumer services,
troubleshooting of complex systems, and supply chain control. It has a wide range of
influence in empirical sciences from biology to cosmology to social sciences, because ML
techniques have been developed to analyse powerful experimental data in novel ways.
A significant number of ML algorithms have been developed to cover a vast variety of data
and problem types represented in various machine learning problems. Conceptually, a
machine learning algorithm can be seen as a search in a large number of candidate program
spaces, which is guided by a training data. This experience from the training data is used
to optimize the performance of the program. Machine learning algorithms vary widely,
partly because of the way they represent candidate programs (for example, decision trees,
mathematical functions, and general programming languages), and partly because of the
way they search the programming space (for example, optimization algorithms with easy-
to-understand convergence).
Artificial neural networks (also simply known as neural networks) is a type of pattern-
oriented problem that can be used to solve classification and time-series problems. Due to
the nonparametric nature of neural networks, models can be created without prior
knowledge of the data distribution or effects caused between variables due to possible
interactions, which is required by most parametric statistical methods. The process used in
neural networks is similar to the way the human brain operates and hence neural networks
are usually considered to give the best possible outcome. A neural network consists of
various nodes where each node is similar to neurons present in our brain. Each node is
connected to the next node through weights which are calculated during the execution of
the machine learning algorithm [3].
The basic architecture of neural networks consists of 3 types of neuron layers: input,
hidden, and output layers. In feed-forward networks, the data moves in a feed-forward
direction from the input units to the output units. A value is calculated for each node based
on the value of the previous node and this process of calculation and storing of variables
from the input layer to the output layer is known as forward propagation. Feed-forward
network do not contain any feedback connections and hence data flows only in the forward
direction. Recurrent networks contain feedback connections and thus allow
backpropagation which is a method for calculating the gradient of neural network
parameters and in the method the network is traversed in the backwards order from output
layer to the input layer. A neural network must be built up in such a way that when a set of
inputs is applied, the desired set of outputs is produced. The strength of the connections
can be determined using a variety of approaches. One method is to use a priori information
to explicitly set the weights. Another method is to train the neural network by giving it
training patterns and allowing it to adjust its weights based on some learning rule [4].
Fig 2: Deep Neural Network Architecture
The Convolutional Neural Network (CNN) is one of the most widely used deep neural
networks. Convolution can be described as a mathematical linear action performed
between matrices. Convolutional layer, non-linearity layer, pooling layer, and fully-
connected layer are some of the layers of CNN. Pooling and non-linearity layers do not
have parameters, whereas convolutional and fully connected layers do [5]. An important
characteristic of CNN is that it is able to pre-process data by itself thus saving time and
resources in the data pre-processing part. The CNN might require a bit of hand engineering
of the features in the beginning but as the machine learning algorithm progresses, the CNN
is able to adapt and learn these new features and develop its own filters. Thus, the CNN is
continuously evolving with growth in data. A CNN works well for data that is represented
as grid structures, hence it works well for pattern recognition and image classification
problems as an image is nothing but a matrix of pixel values. Since the data present in an
image is in the form of a grid structure, by applying various filters the CNN is able to
capture the spatial and temporal dependencies of the image and thus can be trained to
understand and finely characterise the input image. Thus, the role of a CNN is to reduce
the image into a form that is simpler to process without sacrificing features that are critical
to obtaining an accurate result and since the CNN is reducing the image into a simpler
form, it can easily be scaled to very large datasets as well.
Thus, the working of a CNN is as follows:
• Convolution layer: This layer computes the dot product between the weights of the
neuron and the region of the input image that are related.
• Non-Linearity Layer: The non-linearity can be used to adjust or cut-off the generated
output. This layer is applied in order to saturate the output or limiting the generated
output. [5]
• Pooling Layer: The pooling layer will down sample the spatial dimensions of the
image. Thus, it reduces the amount of computation to be performed and the number of
features to learn.
• Fully Connected Output layer: The fully connected output layer is the classifier layer
and gives us the final output by classifying the images into the required categories.
Since CNN is widely used for image classification, there has been a significant amount of
research using CNN in the field of AI art as well. Crowley et al. (2014) [6] demonstrated
in their research paper that object classifiers, learnt using CNN features computed from
various natural image sources, can retrieve paintings containing these objects. They were
not only able to detect the objects, but were also able to make annotations on the objects to
show how these objects in the paintings have evolved over time.
After this paper, Various advancements were made in object detection and various different
problems like identifying positions of objects, face recognition in paintings, analysing
gender of faces in paintings, etc were also addressed. Researchers thus realised that not
only objects, but the style of the image could also be detected using CNNs and these
advancements further supported the rising interest in AI art. Eva Cetinic et al. (2013) [7]
proposed an approach for automated classification of paintings by artist. In their model, the
individual style of an artist is recognized by analysing specific components of a painting
which distinguishes the work of an individual from the works of others. Once the style of
the artists has been recognized, their various paintings can be automatically classified.
The research paper by Gatys et. Al (2015) [8] [9] was one of the first papers that invented
Neural Style Transfer which triggered the rapid use and development of AI in the field of
art. In this paper, the authors present an artificial neural system which makes use of VGG-
19 [10]. This work is the first to show how image attributes may be used to separate content
from style in natural images.
Another such revolutionary innovation in the field of AI art was the invention of Generative
Adversarial Networks (GAN). GANs were introduced by Goodfellow et al. (2014) [11]
and constitutes a significant milestone in the effort to use machines to create new visual
content. A GAN's main mechanism is to train two "competing" models: a generator and a
discriminator, which are commonly implemented as neural networks. The generator's
purpose is to generate realistic images by capturing the distribution of actual examples in
the input sample, while the discriminator is trained to categorise generated images as fake
and real images from the original sample as real. The optimization process terminates at a
saddle point that is a minimum in reference to the generator and a maximum in relation to
the discriminator, as it is designed as a minimax optimization problem. This framework's
implementation produced outstanding results in terms of creating plausible false variants
of actual images for many sorts of image content. GAN quickly rose to prominence as one
of the most important areas of artificial intelligence research, with several advanced and
domain-specific versions of the original design emerging., e.g. CycleGAN [12], StyleGAN
[13] or BigGAN [14].
CycleGAN was proposed by Zhu et al. (2017) [12] In this paper, the authors present a
system that can learn to capture the special characteristics of one image collection and
figure out how these characteristics could be translated into the other image collection, all
in the absence of any paired training examples. The key finding of this paper was that the
CycleGAN model can be used to transfer style of the image. Unlike Neural Style Transfer,
however, the CycleGAN learns to emulate the style of a complete collection of artworks
(e.g., Van Gogh), rather than a single piece of art (e.g. Starry Night). The authors were also
able to do the reverse of the process, i.e. converting a painting to a real image by making
use of the CycleGAN model.
Modifications in both the architecture of the model and the methods of training in
StyleGAN were proposed by Kerras. T et al (2020) [16] in their paper. In order to improve
conditioning in the mapping from latent codes to pictures, they revised the generator
normalisation, revisited progressive growing, and regularised the generator. Thus, this
improves the image quality generated by StyleGAN and is known as StyleGAN2.
Brock, A. et. Al (2018) [14] proposed the concept of BigGAN which is based on the
concept of Self Attention GAN model (SAGAN)[17]. They demonstrated that GANs
trained to model natural images of multiple categories highly benefit from scaling up, both
in terms of fidelity and variety of the generated samples. As a result, the model sets a new
level of performance among ImageNet GAN models, improving on the state of the art by
a large margin. They have also presented an analysis of the training behaviour of large
scale GANs, characterized their stability in terms of the singular values of their weights,
and discussed the interplay between stability and performance.
To enhance the GAN technology's creative content generation capabilities even farther,
Elgammal et al. [18] have introduced AICAN - artificial intelligence creative adversarial
network. They suggest in their study that if a GAN model is trained on paintings, it will
only learn how to make images that look like existing art, and that this, like the Neural
Style Transfer technique, would not produce anything truly artistic or innovative. The
model they have proposed generates art in a more creative manner by learning the style of
paintings and then deviating away from the learned style to make the artwork creative.
In their paper, Matsumura, N et Al. (2018) [19] have proposed a tile art image generation
method which make use of machine learning methods and is based on conditional GAN.
In a condition GAN, a source image is received as an additional input for the GAN. In the
proposed method, the authors are directly producing a tile art image from an input image
using a conditional GAN. The tile art image is generated using forward propagation. The
architecture of the network in the proposed method is modelled on the pix2pix method with
the conditional generative adversarial networks [20].
Xu, T et Al. (2018) [21] have proposed an Attentional Generative Adversarial Network
(AttnGAN) that allows multi-stage attention-driven refinement for fine-grained text-to-
image generation. By paying attention to the important words in the natural language
description, the AttnGAN can synthesise fine-grained details in different sub-regions of
the image using an unique attentional generative network. In addition to this, for training
the generator, a deep attentional multimodal similarity model is presented to compute a
fine-grained image-text matching loss. Visualizing the AttnGAN's attention layers allows
for a more complete study. It is the first time that the layered attentional GAN has been
shown to be capable of selecting the condition at the word level for producing distinct parts
of the image automatically.
3. METHODOLOGY
The goal of this research paper was to address and answer some questions related to the
subject of AI Art (Creating Art using Deep Neural Networks) through research on the topic.
To answer these questions, a qualitative approach of research was used in the research
paper. A qualitative approach is a type of research approach in which other people’s views
and understanding of the issue is used to gain insight into the topic. This particular type of
research helps the researcher to develop their own ideas and helps the reader to increase
their understanding of the topic.
In the case of this research paper, previously published research papers were used to obtain
information and collect data related to the topic. Previously published research papers
provided a more technical and in-depth view of the topic and help the researcher to
understand the different views of different scientists and researchers related to the topic.
The studies and research of other scientists helps the reader to get an idea of the current
situation of the topic and helps the reader to get an insight into the topic. It has helped me
to obtain answers to many of the questions listed above and is an efficient method of
obtaining information.
This Particular method of research was decided upon as it was felt that this method was a
more efficient and reliable for research. The information obtained through the said method
is based on proper experimentation and research and is very factual and reliable. Also, the
information obtained is more thorough, concise, technical and provides a clear view of the
subject. It also helps to understand the progress of the topic and how its understanding and
techniques have changed over the years and how the subject has evolved. It also provides
many different views of the same topic and helps the reader to look at the topic from various
different perspectives and provides a complete overview of the topic.
Based on the papers read and the research conducted, various types of deep neural network
models like VGG and types of GANs were identified that were successfully able to create
AIArt. The working of these models has been discussed in this part of the paper.
The first model that was used to perform Neural Style transfer was the VGG Model [10].
It is a CNN that rivals human performance on a common visual object recognition
benchmark test. This model uses the feature space provided by the 16 convolutional and 5
pooling layers of the 19-layer VGG- Network. The key idea behind the algorithm is to
iteratively optimise an image with the objective of matching desired CNN feature
distributions, which involves both the photo’s content information and artwork’s style
information The concept separates image content from style, allowing any image's content
to be recast in any other image's style. This is demonstrated by the generation of innovative,
beautiful visuals that combine the style of numerous well-known paintings with the content
of a randomly chosen photograph. The feature responses of high-performing deep neural
networks trained on object detection are used to create neural representations for the
content and style of an image in particular. [9]
CycleGAN is a type of GAN that has been used to achieve Neural Style Transfer.
CycleGAN is able to learn special characteristics of one image collection and is able to
translate it into another image collection without having any paired training examples. The
developers of this model first compared it to current methods for unpaired image-to-image
translation using paired datasets with input-output pairs. The relevance of the cycle
consistency loss and the adversarial was then investigated, and their full method was
compared to numerous versions. Finally, they have proved the applicability of their
technique in a variety of applications that do not require paired data. [12]. One such
application where CycleGAN has been used is FaceApp in which human faces are
transformed into different age groups.[22]
In StyleGAN [13], the generator no longer takes a point from the latent space as input;
instead, there are two new sources of randomness used to generate a synthetic image: a
standalone mapping network and noise layers. The latent space vector is passed through a
mapping transformation comprises of 8 fully connected layers whereas the synthesis
network comprises of 18 layers, where each layer produces image from 4 x 4 to 1024 x
1024. The output layer outputs a RGB image through a separate convolution layer. The
output from the mapping network is a vector that defines the styles that is integrated at each
point in the generator model via a new layer called adaptive instance normalization. The
use of this style vector gives control over the style of the generated image [23].
AICAN is a model that can be used to generate art in a more creative manner. In their
paper, the authors propose changing the optimization criterion to allow the network to
develop new art by increasing divergence from established styles while remaining within
the art distribution. This deviation ensures that the generated artwork seems original and is
different from existing artistic styles. Authors of the AICAN system also demonstrated that
viewers were frequently unable to distinguish between AICAN-generated images and
artworks created by a human artist through a series of exhibitions and experiments. Thus,
we can state that the results generated by the AICAN model are novel and are
indistinguishable from human made artwork.[18]
4. COMPARATIVE ANALYSIS
The generator model of a GAN is trained using a second model called a discriminator that
learns to classify images as real or fake, unlike conventional deep learning neural network
models that are trained with a loss function until convergence. To ensure equilibrium, both
the generator and discriminator models are trained concurrently. As a result, there is no
objective loss function used to train GAN generator models, and there is no way to
objectively assess the training progress and the relative or absolute quality of the model
based on loss alone.
Therefore, it becomes really difficult to quantify and compare the performance of GAN
models using conventional metrics. Thus, to compare the performance of GAN models,
metrics like Fréchet Inception Distance (FID) and Inception Score (IS) are generally used.
The Inception Score (IS) is a metric used for assessing the quality of generated images,
particularly synthetic images produced by GAN models. To classify the generated images,
the inception score employs a pre-trained deep learning neural network model for image
classification. The model is used to classify a huge number of generated photos. The chance
of an image belonging to each class is predicted in detail. The inception score is created by
combining these predictions. The goal of the score is to capture two characteristics of a set
of created images:
• The quality of the generated images
• The diversity of the generated images
The inception score has a lowest value of 1.0 and a highest value of the number of classes
supported by the classification model. Higher value of Inception Score indicates better
performance of the GAN. [25]
This is one of the most widely used metrics for comparing real and generated images in
terms of feature distance. The Fréchet Distance is a measure of similarity between curves
that takes the location and ordering of the points along the curves into account. It can also
be used to calculate the difference between two distributions.
Feature distance and the Inception model pre-trained on the Imagenet dataset are employed
in the context of computer vision, particularly GAN evaluation. The score is called
"Fréchet Inception Distance" because it uses activations from the Inception model to
summarise each image.
1
𝐹𝐼𝐷 = ||𝜇𝑥 − 𝜇𝑦 ||2 + 𝑇𝑟(Σ𝑥 + Σ𝑦 − 2(Σ𝑥 Σ𝑦 )2 ) (1)
where X and Y are the real and fake embeddings (activation from the Inception model)
assumed to be two multivariate normal distributions. 𝜇𝑥 and 𝜇𝑦 are the magnitudes of the
vector X and Y. Tr is the trace of the matrix and Σ𝑥 and Σ𝑦 are the covariance matrix of the
vectors. Lower FID value means better Image quality and diversity. [26]
For this comparative study, we will be comparing the GANs that have been in discussed in
this paper based on the FID scores of each model.
It is difficult to compare the results of the GANs as the functions that these GANs perform,
the output generated by them and the datasets that they have been trained on is different.
On the basis of FID alone, we can say that StyleGAN2 has the best FID score of 2.84 when
trained on the FFHQ dataset. Out of the models trained on the ImageNet dataset, which is
the most commonly used dataset, the best FID score was obtained by BigGAN.
A brief review of the current status and progress of Creating art using Deep Neural
Networks was covered in this research paper to show the various areas of research it takes
part in. A brief summary of Machine Learning and Artificial Intelligence was presented
and the relevance of Convolution Neural Networks and Generative Adversarial Networks
in the field of AI Art was discussed. The application of AI Art and the various techniques
like Neural Style Transfer that are used in AI Art were also discussed and reviewed. The
new development of GANs like CycleGAN, StyleGAN, etc and how they are being used
in AI Art as well as other fields of science and industry is another factor of optimism.
Based on the research done and papers read, we can conclude that CycleGAN is thus far
the best model that has been to achieve Neural Style Transfer. It outperforms the
conventional VGG framework and is able to transfer the entire style of an artist instead of
the style of a single artwork. In the case of AI generated Art, various models like
StyleGAN, BigGAN, Conditional GAN, etc have been used however the AICAN model
generated the best results which were original and hard to distinguish from human created
art.
In conclusion, it can be said that the future of AI Art is very bright, and strongly connected
with the evolution and development of GANs. It is a field with a lot of potential but requires
a large amount of research and technological advancements as it is still in its early stages.
6. FUTURE WORKS
As the field of AI Art is still in its early stages, a large amount of research and technological
advancements can still be made in this field. Further refinement to the existing GAN
models can be made to further improve the results. Additionally, more types of GANs can
be developed with further research in this field. Significant advancements in the
development of multimodal generative models, such as models that can create images from
text, have recently been made. Art production and creativity will almost certainly be
influenced by technological advances in this direction. Because the concept of
multimodality is intrinsic to many art forms and has long played a significant part in the
creative process, models that can convert input from diverse modalities into a joint
semantic space constitute an appealing tool for artistic experimentation. Thus, further
research in multimodal generative models can also lead to significant developments in the
field of AI Art.
Due to the current global pandemic, there has been a rapid shift of attention towards digital
showrooms and online platforms which has further contributed to the already rising interest
in blockchain technologies and crypto art which can have a huge impact and can
significantly transform the art market. The interest in NFTs, especially digital art based
NFTs is on the rise and AI generated Art can also be sold in this market.