0% found this document useful (0 votes)
12 views31 pages

Neural Style Transfer a Critical Review

Uploaded by

Anagh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views31 pages

Neural Style Transfer a Critical Review

Uploaded by

Anagh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Received August 26, 2021, accepted September 11, 2021, date of publication September 15, 2021,

date of current version September 30, 2021.


Digital Object Identifier 10.1109/ACCESS.2021.3112996

Neural Style Transfer: A Critical Review


AKHIL SINGH 1, VAIBHAV JAISWAL 1 , GAURAV JOSHI 1, ADITH SANJEEVE 1,

SHILPA GITE 1,2 , AND KETAN KOTECHA2,3


1 Department of Computer Science and Information Technology, Symbiosis Institute of Technology, Symbiosis International (Deemed) University,
Pune 412115, India
2 Symbiosis Centre of Applied AI (SCAAI), Symbiosis International (Deemed) University, Pune 412115, India
3 Symbiosis Institute of Technology, Symbiosis International (Deemed) University, Pune 412115, India

Corresponding authors: Shilpa Gite ([email protected]) and Ketan Kotecha ([email protected])


This work was supported by Symbiosis International (Deemed University), Pune, India.

ABSTRACT Neural Style Transfer (NST) is a class of software algorithms that allows us to transform
scenes, change/edit the environment of a media with the help of a Neural Network. NST finds use in image
and video editing software allowing image stylization based on a general model, unlike traditional methods.
This made NST a trending topic in the entertainment industry as professional editors/media producers create
media faster and offer the general public recreational use. In this paper, the current progress in Neural Style
Transfer with all related aspects such as still images and videos is presented critically. The authors looked at
the different architectures used and compared their advantages and limitations. Multiple literature reviews
focus on either the Neural Style Transfer (of images) or cover Generative Adversarial Networks (GANs)
that generate video. As per the authors’ knowledge, this is the only research article that looks at image and
video style transfer, particularly mobile devices with high potential usage. This article also reviewed the
challenges faced in applyingvideo neural style transfer in real-time on mobile devices and presents research
gaps with future research directions. NST, a fascinating deep learning application, has considerable research
and application potential in the coming years.

INDEX TERMS Style transfer, video style transfer, mobile, convolutional neural networks, generative
adversarial networks.
I. INTRODUCTION (Huang et al. [32]). This has significant applications like
Since its conception, videos have been considered a popular entertainment to directly transform the scene or parts, usually
multimedia tool for various functions like Education, enter- taking hours of manual work and supervision. It can also
tainment, communication, etc. Videos have become more be used for recreational purposes fusing with Augmented
and more popular as the effort needed to make them keeps Reality to create a virtual world modeled after the real one
dropping thanks to advancements in Cameras and, more par- (Dudzik et al. [33]).
ticularly, mobile cameras. Today, an average user uses mobile Generative Adversarial Networks (GANs) are often
devices to capture videos rather than expensive dedicated used to produce or synthesize data since conception
setups [1]. On the other hand, entertainment producers use (Goodfellow et al. [7]). This makes GANs a potential can-
dedicated hardware and editing tools to create picturesque didate for generating Images/Videos given a set of inputs that
scenes with the help of Compute Generated Imagery (CGI) control its structure and texture. The paper focuses on Gener-
software like [2] and [3]. ative Adversarial Networks (GANs) developments and sum-
There are multiple resources, approaches, improvements, marizes the advancements made to date (up to April 2021).
and implementations since the first Generative Adversarial It also describes basic techniques currently being used to
Network was presented by Goodfellow et al. [7]. As of now, transform videos and then move onto the NST-based tech-
NST is extremely popular and widely used to edit images to niques. To understand the developments and have compar-
create a host of effects (E.g., Prisma App) (Gatys et al. [13]) isons, all categorized papers are reviewed into four parts,
(Liu et al. [18]). Recently developments have been observed as shown in fig. 1. Each part has differentobjectives and key
to use NST for video style transfer (Ruder et al. [31]), takeaways, such as advantages, limitations, research gaps,
and future scope.
The associate editor coordinating the review of this manuscript and The papers selected for review were found using Scopus
approving it for publication was Longxiang Gao . and Web of Science databases with the search terms such

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 9, 2021 131583
A. Singh et al.: NST: Critical Review

FIGURE 1. An overview and categorization of the papers studied in this review.

as ‘‘video neural style transfer,’’ ‘‘real-time video neural on the quality of the videos they generate (as shown in their
style transfer,’’ ‘‘generative adversarial networks,’’ ‘‘video demonstrations/Readme).
neural style transfer on mobile devices,’’ ‘‘video style transfer There is currently no benchmark dataset for Neural Style
improvement.’’ Shortlisted papers with code implementations Transfer. MS-COCO and Cityscape are the two datasets
publicly available (on GitHub or similar services) and based most frequently utilized in the experiments within the

131584 VOLUME 9, 2021


A. Singh et al.: NST: Critical Review

TABLE 1. A short summary of review papers and key contributions.

papersreviewed. These datasets are primarily used for object • Research gaps and future research directions are also
detection and recognition. Still, they may also be used as con- discussed in a detailed study of the challenges in apply-
tent photos for training various models since each dataset was ing forvideo neural style transfer in real-time on mobile
around 330K and 25K in size, respectively. The style dataset devices.
was either scraped from online sources such as Danbooru, Fig. 1 shows the papers reviewed in this research study and
Safebooru, and Videvo.net or was created according to the their categorization as per paper flow.
requirements of the problem statement, which were in the
range of 1k to 2k images. II. GENERATIVE ADVERSARIAL NETWORKS OVERVIEW
While reviewing the literature, a few research gaps such The first part deals with papers that define the basics of most
as platform-related, dataset-related, and architecture-related GANs. These papers are essentially the backbone, as most
deficiencies were identified. Hardware limitations are the other articles follow their pathway by improving upon or
primary cause of platform-related gaps. In the absence making amends to them. GANs generate data based on pre-
of a benchmark dataset and benchmark metrics, there viously learned patterns and regularities as the model finds
exist data-related research gaps. Lastly, architecture-related these patterns. Deep learning suits generative models as they
gaps concern how model parameters change based on the can effectively recognize patterns in input data.
dataset need. These gaps are further discussed in detail in
section VIII. A. GENERATIVE ADVERSARIAL NETWORKS
As presented in table 1, there are a total of 3 review [7] explores the framework, which was new around then
papers available in the NST domain. Out of those 3, only for making generative models in a loosely organized cycle,
paper [4] can be considered as a comprehensive review paper. wherein training two models: a generative model G which
The novelty of our paper lies in terms of the latest papers gets the details, and a discriminative model D that calculates
review till Aug 2021, stating all related facets of NST. There the likelihood that an image comes from training examples
are four major sections to the paper. The first part of the instead of G. The arrangement technique of G would be to
paper covers the basics of GANs, their types, and how they raise the likelihood of D creation a mistake. This arrange-
work; the second part covers the contemporary architecture ment resembles a more modest than usual max two-player
of GANs with NST and how they work; the third part of the game. Next to optional limits G and D, a response occurs,
paper covers the improvements that can be made to GANs with G recovering the arrangement of data course and D
while applying NST to it, such as deep photo style transfer; up to 0.5 everywhere. For the situation where G and D are
and the fourth part is about how we can use NST along with represented by multilayer perceptron, they always set up the
GAN architecture on a real-time basis. fundamental structure with backpropagation. The technique
Highlights of this literature review are listed below: used here is to get the most extreme probability of doling out
• Qualitative analysis of the latest GAN architectures the correct mark to both preparing models and tests from G
models, along with their advantages and limitations, and at the same time preparing G to limit log(1-D(G(z))).
is discussed. min max V (D, G) = Ex∼pdata (x) logD (x)
 
• A summary and in-depth analysis of neural style trans- G D
+ Ez∼pz (z) log (1 − D (G (z)))
 
fer for both images and video are given, emphasizing (1)
mobile devices.
• Most relevant research papers on the Neural Style Trans- B. STYLE BASED GENERATOR ADVERSARIAL NETWORKS
fer were explicitly identified focused on real-time NST, Generator improvement has seen less attention and improve-
which narrows down the research in video style transfer. ment compared to Discriminator. To enhance the picture

VOLUME 9, 2021 131585


A. Singh et al.: NST: Critical Review

TABLE 2. FIDS in FFHQ for networks trained for different percentages of TABLE 3. In FFHQ for different generator architectures, separability
training examples by allowing mixing regularization. (Karras et al. [8]). scores and perceptual route lengths (lower is better).(Karras et al. [8]).

quality produced by the Generator, [8] introduced a Style


transfer literature-based generator. In this model, the Gener-
ator is trained with the Progressive GAN setup of Karras as a TABLE 4. The effect of a mapping network in FFHQ. Karras et al. [8]).
baseline. The following are details of the model:
1. Traditionally the Generator is provided with a latentcode
through the input of the first layer of the feed-forward
network. At the same time, in the new approach, it is
omitted altogether and is started with a learned constant.
2. Provided a latent code z in the non-linear mapping net-
work and latent input space Z, f: Z → W first generates
w∈W
3. After the mapping is done, learned affine transformation
specializes w to styles : y = (ys, yb) which operates
after each convolution layer of the generative network and
controls the normalization of the generative network G.
4. The normalization technique used is adaptive instance
normalization (AdaIN), the (2) for the same is:
xi − µ (xi ) latent codes. Precisely, w1, w2 controls the Style of two
AdaI N (xi , y) = ys,i + yb,i (2) different codes, z1, z2 across the mapping network so that
σ (xi )
w1 is applied before and w2 after the crossover point. This
Over here to get the AdaIN between xi and y,firstly finding
approach prevents the network from assuming that the object
the distance between xi and the mean of xi (µ (xi )) further
style is correlated.
dividing it by standard deviation of xi (σ (xi )), then to
scale, the value is multipliedit by ys,i and bias it by yb,i . • After each convolution, the architecture adds per-pixel
5. The Generator is then given direct noise input, which noise, resulting in noise only affecting the stochastic
allows it to generate stochastically. The noise input is aspects leaving intact the function and aspects at a high
uncorrelated noise input generated via a single channel of level.
images. Dedicated noise input is given into each layer of • Global effects such as illumination, etc., were seen to be
the synthesis layer. coherently regulated, whereas the noise was applied to
6. Using the learned pre-feature scaling factor, the noise each pixel separately, ideally only suitable for stochastic
image is first transmitted on all feature maps, and then the variation. When the network is monitoring, i.e., the Dis-
corresponding convolution layer output is applied. criminatorpenalizes the pose with the noise, leading to
The above changes in the Generator lead to the following inconsistency in space. This way, without clear instruc-
observation and improvement: tions, the network can learn to use global and local
• 20% improvement in FID over traditional Generator. networks properly. The perceptual path length is lowered
• This makes it possible by modifying the styles by scale by using the style-based Generator, as seen below in the
to track image synthesis. Then the display of the map- picture:
ping network and geometric transformation that pre- • It is shown that increasing the mapping network’s depth
serves content and style images to produce new images enhances both image quality and separability.
from the trained distribution and generative network
based on a series of styles to create examples. This C. DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL
resulted locally in each style effect, meaning only some NETWORKS
portion of the image would be influenced by changing a Unsupervised learning using Convolutional Neural networks
small part of the Style. (CNN) has seen less attention than supervised learning and
The use of regularization mixing, which is a given number its adoption in computer vision applications. To bridge the
of images, is generated during training using two random gap, [9] introduced Deep Convolutional GANs (DCGANs).

131586 VOLUME 9, 2021


A. Singh et al.: NST: Critical Review

FIGURE 2. (a) CycleGAN model schematic showing the two generators and discriminators along with the image domains. (b) The transforms used to
compute forward cyclical consistency loss. (c) The transforms used to compute backward cyclical consistency loss.

In this method, discriminators are trained for the image clas- due to this structure. The loss function is thus modified to:
sification task, and generators have vector calculations that h i
allow more control over generated images. L (G, DY) = Ey∼p(y) (DY(y) − 1)2
Following guidelines are proposed to create stable convo-
h i
+ Ex∼p(x) DY(G(x))2 (3)
lutional GANs: h i
1. Use strided convolution in Discriminator and fractional- L (F, DX) = Ex∼p(x) (DX(x) − 1)2
strided convolution in Generator, enabling the model to h i
tune upsampling and downsampling itself. + Ey∼p(y) DX(F(Y))2 (4)
2. Batch normalization usage is done in the generators and Lcyc (G, F) = Ex∼p(x) [k F (G (x)) − x k]
the discriminators. This prevents the generators from + Ey∼p(y) k G (F (y)) − y k
 
(5)
mode collapse.
L = L (G, DY) + L (F, DX) + λ ∗ Lcyc (G, F)
3. Remove dense hidden layers. Connect the highest Conv.
features to input and output of both parts, which showed (6)
promising results. For Discriminator, flatten and feed the
where X and Y are two image domains, G is a generator
last Conv. Layer into sigmoid output.
transforming an image from domain X to Y. F is a generatori
4. For Generator, use ReLU activation and Tanh (only in the
transforming an image from domain Y to X. DY is the dis-
final layer).
criminator concerning G (identifies real/generated images in
5. In the Discriminator, make use of LeakyReLU in layers.
Y domain). DX is the discriminator concerning F (identifies
These architectural changes result in regular training and a real/generated images in X domain). G(x) is the image gen-
model capable of handling high resolutions. erated by G on an input image x such that x ∈ X and F(y)
Testing on the CIFAR-10 and Street View House Numbers is the image generated by F on an input image y such that
dataset (SVHN) dataset confirmed the impressive perfor- y ∈ Y. Thus, Equations (3) and (4) compute the Adversarial
mance of DCGANS. However, it still falls short of Exemplar losses for the two GANs. In contrast, Equation (5) computes
CNNs [10]. Another point to improve upon is that even the cyclical consistency loss by comparing input images x
with fewer feature maps in the Discriminator, it has a more and y to their remapped/generated versions, F(G(x)) and
extensive feature vector size, increasing training size at higher G(F(x)), respectively. Equation (6) describes the total loss
resolutions. of CycleGAN combining the adversarial and cyclical losses.
The transformations are diagrammed by [11] in Figure 2.
D. CYCLE CONSISTENT ADVERSARIAL NETWORKS The generators use a Resnet based architecture and a few
Unlike the Deep Convolution GANs, CycleGANs allow encoder-decoder layers, while the discriminators use a Patch-
image translation on unpaired data. [11] achieve this the GAN architecture to focus on local structural details. The
concept of ‘‘Cyclical Consistency’’ meaning that if two Gen- results show that CycleGANs perform exceptionally well
erators, ‘‘G’’ and ‘‘F,’’ are trained to be inverses of each on all test metrics barring the Pix2Pix model. The model’s
other than virtually, F(G(X)) ≈ X. [11] introduce a second limitation is that it fails whenever an image sampled from a
generator that takes the outputs of the first one to and tries to different distribution is input.
produce the actual input image. By training two GANs whose,
generators perform inverses of each other, [11] decouples the 1) OBSERVATIONS
translation’s style and structural aspects (one model handles In summary, [7] defines a basic GAN with its objective func-
the style transfer while the other enforces structure). A key tion and training procedure. However, unconditional (which
takeaway is that these models do not need paired data to train cannot get precise results) and uncontrollable (controlling

VOLUME 9, 2021 131587


A. Singh et al.: NST: Critical Review

FIGURE 4. Encoder-decoder vs U-Net architecture. (Isola et al. [12].

FIGURE 3. Output of CycleGANs. Left-most images are inputs, middle


images are corresponding style transfer from the first Generator G.
Right-most images are the reconstruction of inputs by second generator F.
(Zhu et al. [11]).

the individual features used for generation). [8] builds upon


this by modifying the Generator to allow control over disen-
tangled features. [9] improves the architecture by introduc-
ing a Deep Convolutional Neural Network. [11] focuses on
Unpaired Image Style Transfer presented with the help of
‘‘cyclical consistency.’’ FIGURE 5. Introducing U-Net allows higher quality of generated images.
(Isola et al. [12].

III. GENERATIVE ADVERSARIAL NETWORKS IN NEURAL


STYLE TRANSFER
architecture based on U-Net, whereasthe Discriminator has a
Now architectures prevalent and in use concerning Neural
PatchGAN based architecture. The PatchGAN architecture is
Style Transfer (NST) are discussed. These papers look at
shown to be useful as it penalizes local structural differences.
proposing a new architecture and employing new methods.
The effect of locality or ‘‘patch size’’ is also studied. The loss
NST first appeared in Gatys et al. [13]. The approach takes a
function is given as:
content image and applies the textures of the Style given. NST
then gained momentum as many works followed, increasing LcGAN = Ex,y (log (D (x, y)))
the quality of images generated or generating them faster than
+ Ex,z (log (1 − D (x, G (x, z)))) (7)
Gatys et al. [13]. These efficiencies and/or effect improve-
ments paved the way for faster image editing (E.g., Adobe where G and D are the Generator and Discriminator networks,
Image stylization) or recreational use (E.g., Prisma App). x and y are content and style images, respectively, and z is a
random noise vector that gets learned to produce the mapping
A. CONDITIONAL ADVERSARIAL NETWORKS FOR STYLE G: {x, z} → y. The Discriminator is now fed ‘‘x’’ or input
TRANSFER image as an input. In addition, an L1 distance term is added
Conditional GANs (cGANs) introduce image-to-image trans- to make the generated images closer to ground truth and avoid
lation and a loss function to allow the models’ training. blurred images:
It removes the usage of hand-engineered loss functions or
mapping functions. [12] aims to create a common framework LL1 (G) = Ex,y,z [||y − G(x, z)||] (8)
that predicts a particular set of pixels based on another given
Thus, the final objective is given as:
set of pixels. Instead of treating the output space as ‘‘uncon-
ditional’’ from the input image, cGANs use a structured Lt = LcGAN (G, D) + λ.LL1 (G) (9)
loss function, considering the structural differences between
input and generated images. Optimizing this loss function where G and D are the Generator and Discriminator networks,
allows the generated images to be structurally related or LcGAN is the conditional loss given in (7) and LL1 (G) is the
‘‘conditioned’’ as per the input image. The Generator has an L1 loss of the generator as given in (8). The total loss is Lt

131588 VOLUME 9, 2021


A. Singh et al.: NST: Critical Review

FIGURE 6. Style transfer algorithm. (Gatys et al. [13]).

and λ is a weight used to alter the importance of L1 loss in derived from the optimized Convolution Neural Network
the total loss. for object recognition, explicitly providing high-level image
Noise is provided in dropouts and not as inputs as the details. Overall, the approach combines CNN-based paramet-
models ignored the latter. In addition, the U-Net architecture ric texture models to invert their representations of the image.
introduces skip connections, which allow low-level details to The method used is:
be transferred easily between the input and output images. • The standardized version of the 19-layer VGG network
Meanwhile, PatchGANs, as discriminators, focuses on local- includes 16 convolutional and five pooling layers.
ized information more. Another significant advantage is that • By scaling weights, the network was normalized such
PatchGANs can work with a smaller subset of pixels at a time, that the mean activation of each convolutional filter over
decreasing the number of parameters, computation, and time images and positions was equivalent to one.
required for discriminator predictions. • Image synthesis was done by using average pooling as
It is seen that having a small patch size causes loss of spa- it was seen that it provided a better result.
tial features (structure of image) with useful spectral features • For Content representation:
(colorful images). As the transition towards a larger patch ◦ One can perform gradient descent to display image
size, a balance of spatial and spectral features producing a data on several levels of a white noise picture to
crisp image. However, Increasing the patch size beyond this locate another image that fits the feature responses
‘‘balance point’’ causes a lower quality image to be generated. of the content image.
Another plus for PatchGAN is that the Discriminator can Let pE- original image,
be applied to large images. The places where the model fails xE- generated image,
to be good at are: P l- feature representation of pE in layer l,
1. Sparse input images (Images with shallow structural F l- feature representation of pE in layer l.
details) The squared-error loss between the two feature representa-
2. Unusual Inputs (Input which is not like training data). tions is defined as:
1 X l 2
B. IMAGE STYLE TRANSFER USING CNN Lcontent (Ep, xE, l) = Fij − Plij (10)
It is difficult to render an image’s semantic content differently 2
i,j
since it lacks representations that explicitly provide semantic
The derivative of this loss w.r.t activations in layer l equals
information. To solve the limitation of using only low-level
∂L content
image characteristics of the target image, [11] presents an  
Artistic Style neural algorithm that can isolate and recombine ∂Lcontent  F l
− P l
if Fijl > 0
the content of images (style texture) and generate the images = ij (11)
∂Fijl 0 if F l < 0
using those Styles. The image representations used here are ij

VOLUME 9, 2021 131589


A. Singh et al.: NST: Critical Review

• Style representation:
◦ To acquire a style portrayal of an input picture,
feature space was utilized to capture textural data.
The feature space can be based on top of the fil-
ter response of the model layer, which comprises
of connection between various reactions, where
exceptional cases are taken over the spatial degree
of feature maps.
◦ Feature correlation, given by: Gram matrix Gl ∈
RNl ×Ni , where Glij is the inner product between the
vectorized feature maps i and j in layer l:
X
Glij = Fikl Fjkl (12)
k

◦ The total loss function seen was:


1 X 
l l
2 FIGURE 7. Photorealistic style transfer. (Gatys et al. [13]).
El = Gij − Aij (13)
4Nl2 Ml2 i,j
C. TOWARDS THE AUTOMATIC ANIME CHARACTERS
◦ Total style loss: CREATION
XL The most common problem in generating faces is that they
Lstyle (Ea, xE) = wl El (14) get distorted on some features and get blurred. [14] addresses
l=0
this problem in both data and model aspects. [14] provides
◦ The derivative of E w.r.t. the activation functions in three contributions for generating anime faces:
layers l can be computed analytically:
1. GAN model based on DRAGAN architecture.

1
  
T  2. A suitable clean anime facial dataset comprising of
l l l

 F G −A high-quality images which are collected from Getchu
Nt2 Mi2

 ji
∂El (Japanese game selling website)


= if F l
ij > 0 (15) 3. An approach to train GAN from untagged images.
∂Fijl 
0


 Tags are assigned to the dataset using Illustration2Vec
if Fij0 < 0

(a CNN-based Tag estimation tool). This tool can detect
and tag 512 different types of attributes. After tags are
• Style transfer: set, 34 tags are selected, which are suitable for the task
◦ the loss function jointly minimized distance at hand. In this way, any untagged dataset can be pro-
between feature representations of the white noise cessed and prepared, which vastly opens data collection
of two images (content and Style): sources.
Model architecture is based on DRAGAN proposed by
Ltotal (Ep, aE, xE) = αLcontent (Ep, xE) + βLstyle (Ea, xE) (16) Kodaliet al. [15]. DRAGAN has the least computation cost
than other GAN variants and is much faster to train. Gen-
The content image was resized to style image always erator architecture is shown in Fig. 8, which is based on a
before computing its feature representations to keep them for modified version of SRResNet [16]. It consists of 16 Residual
comparable sizes. Blocks and three feature upscaling blocks. The discriminator
Results seen for the suggested image style transfer are: architecture is depicted in Fig. 9. It has 11 Residual Blocks
• Both the content and the image style type are easily and a dense layer that acts as an attribute classifier. The
separable in CNN, and to produce new meaningful visu- proposed model was compared with a standard DRAGAN
als;then the changes can be represented individually. model with DCGAN Generator based on Fréchet inception
• Due to the many layers in the image synthesis, which Distance (FID) scores. Table 5 shows, the proposed model
layers fit the content and style representations were has lower average FID scores identifying it as a better model.
shown. Figure 10 shows the samples generated from the model.
• The picture is cleaner if the matching is done up to higher These samples are transparent, have sharp images, and have
layers initializing noise before initialization of gradient good diversity.
descent leads to the generation of arbitrary numbers of The only drawback of this model is that it cannot han-
new images. dle super-resolution. It is observed that the high-resolution
• The algorithm provides photo-realistic style transfer; an images generated using this model have undesirable artifacts,
example can be seen in Fig. 7. which made the results messy.

131590 VOLUME 9, 2021


A. Singh et al.: NST: Critical Review

FIGURE 8. Generator architecture. (Jin et al. [14]).

FIGURE 9. Discriminator architecture. (Jin et al. [14]).

TABLE 5. FID of proposed model and baseline model. (Jin et al. [14]).

texture-based loss functions. [17] proposes CartoonGAN,


a new GAN framework that can take unpaired images for
training to tackle this problem. CartoonGAN architecture is
shown in Fig. 11 Generator G, which comprises one flat Conv.
Block proceeded by two down-Conv. Blocks are meant to
perform compression as well as encoding of an input image.
The content and manifold part are made up of eight residual
blocks. Finally, two up-convolution blocks and a convolution
FIGURE 10. Generated samples. (Jin et al. [14]). layer create the cartoon-style output images. Discriminator D
consists of flat layers preceded by two strided Conv. blocks
to reduce resolution and encodefeatures. The final layers are
D. CartoonGAN: GENERATIVE ADVERSARIAL NETWORKS
made up of a feature construction block with convolution
FOR PHOTO CARTOONIZATION
layers to obtain a classification.
[17] proposes a solution to convert real-world scenery
The overall loss has two parts: adversarial loss and content
images into cartoon-style images. The unique characteris-
loss described as
tic, smooth shading, and textures of cartoon-style images
prove significant challenges to existing methods based on L (G, D) = Ladv (G, D) + ωLcon (G, D) (17)

VOLUME 9, 2021 131591


A. Singh et al.: NST: Critical Review

FIGURE 11. Proposed generator discriminator architecture. (Chen et al. [17].

Here ω is the weight by which canbe limited to the content and CycleGAN to handle cartoon style well. NST using only
retention amount from the input. style imagery cannot control theStyle thoroughly because the
Adversarial loss Ladv (G, D) is an edge-promoting loss local regions are styled differently. This leads to inconsistent
defined as: artifacts. Similarly, results from CycleGAN are also unable
to understand and depict the cartoon style appropriately.
Ladv (G, D) The absence of Identity loss renders it unable to preserve
= Eci ∼Sdata (c) logD (ci ) + Eej ∼Sdata (e) log 1 − D ej
   
input image content. Even with identity loss, the results
+ Epk ∼Sdata (p) log (1 − D (G (pk ))) are unsatisfactory. The results clearly show that Cartoon-
 
(18)
GAN effectively transforms real-world scenery images into
Here, the Generator G outputs a generated image G (pk ) cartoon-style efficiency and high quality. It efficiently per-
for each photo pk in the photo manifold p. ej is a cartoon forms much better than other top stylization methods.
image without precise edges  and ci is the corresponding
actual image. D (ci ) , D ej , D (G (pk )) are the probabilities
E. ARTSY–GAN A STYLE TRANSFER SYSTEM
of the discriminator D assigning correct labels to the actual
image, cartoon image without clear edge, and generated [18] introduces a novel method for GAN-based style transfer
image, respectively. termed Artsy-GAN. The problemwith current approaches,
Content Loss Lcon (G, D), which has a feature map in a such as using CycleGAN, is the slow training of these models
pre-trained VGG network defined by: due to their complexity. Another disadvantage is the source of
randomness, which is limited to input images. [18] proposes
Lcon (G, D) = Epi ∼Sdata (p) [||VGGl (Gi (pi ))−VGGl (pi )||1 ] three ways to tackle these problems:
(19) 1. Using perceptual loss instead of reconstructing to improve
training speed and quality.
Here, l refers to the feature maps of specific VGG layer.
2. Using chroma sub-sampling to process the images
Along with the model, an initialization phase is proposed
improves inference/prediction speed and makes the model
to improve the GAN model’s convergence. In this phase,
compact by reducing size.
the Generator is trained only with semantic content loss (19)
3. Improving the diversity in generated output by appending
and can reconstruct only the input images’ content. The
noise to the Generator’s input and pairing it with the loss
training data is unpaired, consisting of real-world and cartoon
function would force it to develop a variety of details for
images that are all resized to 256 × 256. There are 5402 real-
the same image.
world training images. Cartoon images comprise of Makoto
Shinkai (4,573), Mamoru (4,212), Miyazaki Hayao (3,617), Fig. 13 shows the model architecture of the Generator. The
and Paprika (2,302) style images. inputs are a 3-channel color image (RGB) with noise added
As Fig. 12 shows, outputs from CartoonGAN are compared to each channel. The Generator has three branches, each of
with NST [11] and CycleGAN [11] outputs trained on the which receives the same input but produces different output
same dataset. The Figure demonstrates the inability of NST image channels that are converted back into RBG by a model

131592 VOLUME 9, 2021


A. Singh et al.: NST: Critical Review

FIGURE 12. Generated output comparison of CartoonGAN, CycleGAN and NST. (Chen et al. [17]).

FIGURE 13. Architecture of generator. (Liu et al. [18]).

at the end of the network. The discriminator architecture is the 2. Diversity loss LDIVERSITY to improve diversity in gener-
same as CycleGAN using 70×70 PatchGANs [11], [17], [18]. ate/output images, which are defined as
The objective loss function is made up of three types of
N
losses and is defined as 1 X 1
LDIVERSITY = − −
N meanj6=1 ||g(zi )−g(zj )|| + E +E
i=1
LFULL = LGAN + αLPERCEPTUAL + βLDIVERSITY (20)
(22)
where α and β control the significance of losses. where, N is the number of input noises as well as several
Here, the loss functions are: outputs.
1. An adversarial loss LGAN for equalizing distribution of 3. A perceptual loss LPERCEPTUAL to overcome the uncon-
domains. It is defined as strained problem by keeping the object and content in the
h i output and can be described as:
LGAN = Ex∼pdata (x) (D (G (x, z)) − Lreal )2 (21) 1 h i
LPERCEPTUAL = ||φj (x) − φj (G(x, z))2 (23)
Cj Hj wj
where, Lreal is the table of actual data, z is a noise tensor,
G (x, z) is a produced image from generator G,and D is the where, φj (x) is the output of the j-th layer of feature
discriminator. encoder network φ for image x. If the j-th layer is a

VOLUME 9, 2021 131593


A. Singh et al.: NST: Critical Review

FIGURE 14. Results compared with Cycle-GAN. (Liu et al. [18]).

TABLE 6. Comparison of FID of Artsy-GAN and CycleGAN. (Liu et al. [18]). increases, proving that Artsy-GAN is much faster and well-
suited for higher resolution images.
Fig. 14 shows that CycleGAN output images are very
similar for the same input image with shallow diversity even
after adding noise to input images. Whereas Artsy-GAN out-
put images vary significantly, confirming its high diversity.
Finally, the proposed Artsy-GAN is a better, faster, and more
diverse method for style transfer, which easily outperforms
other SOTA methods depicted by the result. The perceptual
loss proposed can also be used for different stylings, such as
TABLE 7. Processing time comparison in GPU Tesla M40. (Liu et al. [18]).
oil paintings with vibrant textures.

F. DEPTH AWARE STYLE TRANSFER


After the style transfers have been rendered using a differ-
ent picture style, the depth of the content picture has not
been reproduced. It is seen that those traditional methods
like additional regularization in the optimization of the loss
function, etc., are either ineffective in computing or require
a different trained neural style network. AdaInapproach of
Huang et al. [32] enables effective arbitrary style transfer to
the content image. The depth map of the content image
cannot be replicated. [20] proposed an extension to the AdaIn
method to preserve the depth map by applying variable
stylization strength. The comparison showed in the image
Fig. 15.
The technique is the depth-aware AdaIN (DA-AdaIN),
convolutional later then, φj (x) will be a feature map of which works with varied strength: closer areas are less styl-
Cj ∗ Hj ∗ wj ized, whereasdistant regions representing a background have
Comparison of Artsy-GAN is made with CycleGAN based a more stylistic feature. Based on the following styling, AdaIn
on FID, processing time, and diversity in generated images. applies the Style evenly to the content image:
Table 6 shows that Artsy-GAN has lower FID scores across Î = g (AdaIN (f (Ic ) , f (Is ))) (24)
all the styles for which both models are trained.
Table 7 Shows that Artsy-GAN is 9.33% faster than Cycle- where,
GAN at the minimum resolution(640 × 480) taken and up • Ic - Content Image
to 74.96 % faster at the highest resolution (1960 × 1080). • Is - Style Image
As the resolution increases, the difference in processing times • f(·) is an encoder

131594 VOLUME 9, 2021


A. Singh et al.: NST: Critical Review

FIGURE 17. Result of style transfer based on proximity offset parameter


ε, β = 20. (Kitov et al. [20].

FIGURE 15. Comparison between proposed DA-AdaIn to AdaIn methods.


(Kitov et al. [20]).
◦ ε ∈ [0, 1] controls minimal offset of the image
regions from the camera.
Image result based on different hyperparameters values:

G. StyleBank: AN EXPLICIT REPRESENTATION FOR


NEURAL IMAGE STYLE TRANSFER
StyleBank is made of many convolutional filter banks, each
of which explicitly reflects one Style and transmits the Style
of neural images. To convert a picture to a particular style,
FIGURE 16. Result of style transfer based on depth contrast parameter β,
ε = 0. (Kitov et al. [20]). the appropriate filter bank is used on top of the intermediate
feature embedding generated by a single auto-encoder. The
StyleBank and the auto-encoder are concurrently learned,
•g(·) Is a decoder trained for appropriate stylization with with the auto-encoder encoding no style information due to
the encoder. the flexibility provided by the explicit filter bank representa-
• AdaIN(x, y) is a variant of instance normalization. tion. Additionally, it supports incremental learning to add a
The extension proposed is: new image style by learning a new filter bank while keeping
the auto-encoder unchanged. The explicit style representation
• AddStyle using varied strengths, based on your camera
and the adaptable network design enable us to combine styles
proximity, in various areas of the content image.
at the picture and area levels.
• Closer places must be preserved in the forefront so that
To investigate an explicit representation for Style, [21]
less stylized; remote areas are considered more stylistic
revisit traditional texton (referred to as the essential element
backgrounds. The hyperparameter α = [0, 1] can be
of texture) mapping methods, in which mapping a texton to
regulated by the following formula in a standard stylizer
the target location is equivalent to convolution between a
strength check:
texton and a Delta function (indicating sampling positions)
Î = g (αf (Ic )+(1 − α) AdaIN (f (Ic ) , f (Is ))) (25) in the image space.
In response, [21] offersStyleBank, a collection of differ-
• Since f (Ic ) is the actual unaltered content encoder ent convolution filter banks, each reflecting a distinct style.
representation, whereas AdaI N (f (Ic ) , f (Is )). Is a The matching filter bank is convolved with the intermediate
completely styled encoder representation. To manage feature embedding generated by a single autoencoder, which
spatially varying strength, the modified formula can be decomposes the original picture into several feature maps to
used convert a picture to a specific style.
In comparison to previously published neural style transfer
Î = g (P f (Ic )+(1−P) AdaIN (f (Ic ) , f (Is ))) networks, the proposed neural style transfer network is novel
(26) in the following ways:

• where P ∈ RHe ×Wc is stylization strength map • This method offers an explicit representation of styles
shows repeated element multiplication for each chan- using this way. After learning, the network canisolate
nel for each spatial position in the content encoder styles from content.
representation: • This technique enables region-based style transfer due to
the explicit style representation. This is not possible with
{P F}cij = Pij Fcij (27) existing neural style transfer networks, but it is possible
with classical texture transfer.
• The algorithm has two hyperparameters: • This method enables concurrent training of many styles
◦ β > 0 controls the prominence of the proximity map with a single auto-encoder and progressive learning of a
around its mean value. new style without modifying the auto-encoder.

VOLUME 9, 2021 131595


A. Singh et al.: NST: Critical Review

preparing stage, and output highlights are utilized in picture


loss capacities.
L = λcLc(F(xcontent ), F(x1)) + λsLs(G(xs), G(x1)) (28)
In equation 28, Lc(·) is the Euclidean distance between
content portrayals of substance pictures and adapted pic-
tures. Ls(·) is the squared Frobenius standard of the contrast
between the Gram lattices of style picture and adapted pic-
ture. F and G are the element change capacities. Secondly,
Image Colorization Network, which further has Conditional
FIGURE 18. The network model. (Chen et al. [17]). GAN and DualGAN, is used to experiment and check which
one gives the better output. In this way, when the Generator
and Discriminator are adapted to additional data, it studies
[21] construct a feed-forward network based on a simple a strict model. As shown in Fig. 19, line drawing is taken
image autoencoder (Figure 18), which converts the input over by both modeling and line extraction models. As the
picture (i.e., the content image) to the feature space via the line drawing can be noticed, the Discriminator can observe
encoder subnetwork. how the Generator transforms the information line to a suit-
able photo. In this manner, the Discriminator will, in gen-
H. TWO-STAGE COLOR INK PAINTING STYLE TRANSFER eral, be more solid to separate the created photos from the
VIA CNN significant.
[22] proposes the best approach to move bloom pictures to In DualGAN, an unaided learning system figures out how
paint ink painting. Not quite the same as a common neural to decipher pictures from area X to those in space Y and figure
style move technique, the report presents a way that imitates out how to upset the errand. During this case, as appeared
the creation of shading ink painting. It can be viewed as in Fig. 21, two picture sets from 2 areas, explicitly, line
two specific steps – edge marking and picture colorization. drawing set (space X) and shading ink painting set (area Y),
Rather than utilizing edge identification calculations, the line are taken care of into two gatherings of GAN. Generator GA
drawing is taken to help and adventure the CNN-based neural initially changes line drawing x1 from space X into adapted
style move technique to get line drawing. Concerning picture composition picture y. Y is turned at that moment into a
colorization, the GAN-based neural style move strategy is regenerated line image x1. In the meantime, GB generators
utilized. convert the shading style of ink painting in an adapted line
The framework comprises two segments: a line extraction image x1 to a recreated shading ink color. L1 distance is
model and an image colorization model. The line extrac- obtained to live the remaking mistake, adding to the GAN
tion model changes blossom photograph content into a line target. Hence, generators figure out how to get pictures with
drawing through the planning x1 = f1(content). The image perceptual authenticity.
colorization model colorizes the line drawing x1 to give
output y through the planning y = g (·). In this methodol- 2) OBSERVATIONS
ogy, f1(content) is anticipated to be the planning as follows. A summary of contributions is presented in Table 8. One
At that point, an estimated shading portrayal could likewise peculiar limitation seen is that the models tend to fail at higher
be acquired from content ≈ f1 − 1(x1). Thus, both line and resolution images.
estimated shading portrayal in substance picture is developed
through the planning f1(content). With matched information, IV. ADVANCEMENT PAPERS
contingent GAN, prepared in a directed way, may incorpo- This set of papers present advancements to current archi-
rate substance picture with client determined Style. Hence, tectures. These advancementsallow different types of control
adapted photos from generator fool discriminator, yet meet to the Style Transfer by improving Color control, Stability,
the necessities for shading ink painting tone. Spatial Control, and other vital aspects which enhance the
quality of generated images.
1) AS REFERENCED BEFORE, THERE ARE TWO PRIMARY
MODELS A. PERCEPTUAL FACTOR CONTROL IN NEURAL STYLE
First, the Line Extraction model includes an image coloriza- TRANSFER
tion model that removes most lines of blossoms and leaves [23] presents an extension to the existing methods by propos-
in substance pictures. The Line Extraction model is utilized ing spatial, color, and scale control over a generated image’s
to characterize loss capacities by estimating contrasts in sub- features. By breaking down the perceptual factors into these
stance and tastefulness between highlights extricated from features, more appealing images can be generated that avoid
pictures. In the training stage, the flower image is taken by the common pitfalls. Finally, [23] shows a method to incorporate
image colorization model, and output picture x1 is created in this control into already existing processes. The identification
a like manner. The line extraction model is fixed during the of perceptual factors is the key to producing higher-quality

131596 VOLUME 9, 2021


A. Singh et al.: NST: Critical Review

FIGURE 19. Line extraction model architecture. (Zheng and Zhang [22]).

TABLE 8. A short summary of architecture-based papers and their key contributions.

FIGURE 20. The organization model of image colorization model. (Zheng and Zhang [22]).

images. Spatial control implies controlling which region of R regions and L layers as:
the style image is applied to each region of the content
image. This helps as different regions have different styling, Flr (x)[:,i] = T1r ◦ F1 (x)[:,i] (29)
and mapping them incorrectly can cause visual artifacts. The
first method to do this uses Guidance-based Gram Matrices, where ◦ denotes element-wise multiplication. The Guided
where each image is provided with a spatial guidance channel Gram Matrix can then be defined as:
indicating which region should be applied to what Style.
This involves computing a Spatially Guided Feature Map for Grl (x) = Flr (x)T Flr (x) (30)

VOLUME 9, 2021 131597


A. Singh et al.: NST: Critical Review

FIGURE 21. Model design of the complete network. (Zheng and Zhang [22]).

Furthermore, the contribution to the loss function is given style image, the output is achieved by downsampling to the
as: desired resolution. This output is upsampled and used as
1 XR X 2 the initialization for original images. This technique requires
λr Gr` x̂ − Gr` (xS ) ij (31)

El = fewer iterations for optimization and filters low-level noise
4N`2 r=1 ij
as well. The method can be iterated to generate very high-
where Nl is the number of feature maps in layer ‘‘l,’’ Gr` x̂

resolution images.
and Gr` (xS ) are the guided gram matrices generated as per As seen in Fig. 22, the method works well to get a high-
Equations (29) and (30) for the generated image x̂ and the resolution image like the one that does not use it. However,
input style image xS . λr is the weighting factor that controls the ‘‘CTF’’ model requires fewer iterations and is seen to have
the stylization strength in the corresponding region r. less noise.
An alternative approach focuses on stacking the guidance
matrices with the feature maps directly. This is more efficient B. STABILITY IMPROVEMENTS IN NEURAL STYLE
than the previous approach but comes at the cost of texture TRANSFER
quality, as noted. The second factor addressed in [23] is Color
The latest image style transfer methods can be grouped into
control, independent of geometric shapes or textures. Color
two groups. The first one is the optimization approach that
control is beneficial in situations where the model needs to
solves a particular optimization problem for the generated
contain the image’s color is essential (E.g., Photo-realistic
image. These results are outstanding but take some time to
Style Transfers). [23] present two approaches to deal with
develop each picture. Second is Feed-forward approaches
this:
that provide solutions to these problems and are usable
1. Luminance-only Transfer:Style Transfer is only per-
for real-time synthesis but tend to give unstable readings.
formed on the luminance channel. This is done by extract-
[24] introduces a new method for stabilizing feed-forward
ing style and content Luminance channels and producing
style transfer methods for video stylization using a recur-
output luminance channels that are then combined with
rent network trained using temporal consistency loss. In this
the original content colors to create the generated image.
method, the network tries to minimize the summation of three
2. Color Histogram Matching:In this method, the style
losses. The combined loss is defined as
image’s colors are transformed such that their mean and
covariance match with the content image’s mean and L (W, c1:T , s)
covariance using a linear transform. T
Each of them has its pros and cons. For instance, Luminance-
X
= (λc LC (pt , ct ) + λs Ls (pt , s) + λt Lt (pt−1 , pt ))
only transfer preserves the content colors, but this comes t=1
at the expense of losing dependencies between luminance (32)
and colors. The color-matching might maintain this, but it
depends on the transform, which can be rather tricky to find. Here λc , λs , and λt areused to assign importance to loss
Scale control allows us to pick separate styles at different term.
scales. The image’sStyle is the spread of image texture in The three losses are as follows:
an arbitrary area [23] propose creating fresh pictures ofStyle 1. Content style loss Lc which is defined as
from two separate photos combining a fine and a coarse-scale
picture. This is handy when it comes to Style Transfer on X 1 2
Lc (p, c) = φj (p) − φj (c) 2
(33)
high-resolution images. Given a high-resolution content and j∈C cj Hj Wj
131598 VOLUME 9, 2021
A. Singh et al.: NST: Critical Review

FIGURE 23. System overview. (Gupta et al. [24]).

FIGURE 24. Image sharpness based on SSIN. (Gupta et al. [24]).

Here, m (h, w) ∈ [0, 1] is 0 in the region of occlusion and


motion boundaries, indicates element-wise multiplication,
and H , W isthe height and width of the input frame. Style
and content losses motivate high-level feature mapping of the
content image with features in Style. Temporal consistency
loss prevents drastic variations in the output between time
steps. Content image and a previous frame are fed as input to
the network. At each step, the output of the network is passed
FIGURE 22. (a) The content image. (b) Spatial control that differentiates as input in the next step.
in sky and ground textures. (c) Color control that tries to preserve the
original colors of the content image. (d) Two Styles are used on fine and As Fig. 23 shows, it is a recurrent convolutional network
coarse scale to stylize the image. (Gatys et al. [23]). where each style transfer network is a deep Conv. Network
with two spatial downsampling blocks, followed by several
residual blocks. The final layers are nearest-neighbor upsam-
Here, φj (x) is the j-th layer activation network activation pling blocks.
of the shape Cj ∗ Hj ∗ Wj for image x. Fig. 24 shows the results for translation and blurring distor-
2. Style reconstruction loss Lsis defined as tions of images. An image patch is taken and distorted then
SSIM is computed between both the original and distorted
X 1 2 patch. Both are then stylized, and SSIM is calculated for
Ls (p, s) = G(φj (p)) − G(φj (s)) F
j∈s Cj · HJ · Wj the styled original and styled distorted patch. The proposed
(34) method is compared with the Real-Time baseline model on
all styles. The results prove that this method is significantly
Here, G φj (x) is a Cj ∗ Ci gram matrix for layer j

more robust at controlling distortions.
activations Table 9 shows the results of the comparison done based on
3. Temporal consistency loss Lt defined as speed. This method matches the Real-Time baseline in terms
of speed and is three times faster than the Optimbaseline [24].
1 Fig. 25 shows a pair-wise comparison of stylized frame
Lt (pt−1 , pt ) = ||mt 2ρt−1 −mt 2 p̃t ||2F (35) output. PSNR/SSIM values are shown for each example pair.
HW
VOLUME 9, 2021 131599
A. Singh et al.: NST: Critical Review

FIGURE 25. Pair-wise stylized image SSIN comparison. (Gupta et al. [24]).

TABLE 9. (Gupta et al. [24]). of the original painting to the output painting, which can
alter the appearance in undesirable ways. [25] describes a
simple linear method for retaining colors after style transfer,
extending to the neural artistic style transfer algorithm. One
of the problems seen, as said before, is that the yield after
style transfer, however, replicates the Style of brushstrokes,
mathematical shapes, and painterly structures displayed in
the style picture. Nevertheless, it likewise duplicates the color
distribution of the style picture undesirably.
Two different methods for preserving colors of the content
image seen are color histogram and luminance only transfer.
1. Color histogram matching:
1. Consider S- style image and C- input image. Style
image’s colors are transformed to coordinate the input
image’s colors, producing S’- a new style image that
replaces S as an input to the NST algorithm. One choice
FIGURE 26. On the output image (c), the undesirable style image color
overlay is evident. (Gatys et al. [13]).
that is to be made is the color transfer procedure.
2. Each pixel is transformed as:
This method produces similar frames as OptimBaseline [24]. xS 0 ← AxS + b (36)
Still, on comparing with real-time baseline, the frames made where, A: 3 × 3 matrix B: 3D vector xi = (R, G, B)T
are better and temporally consistent for unstable styles like 3. This transformation is chosen in such a way that
Rain Princess and Metzinger. the mean and covariance of the RGB value in the
There are two problems with this method: new image style (S’) matches the content image (C),
1. Occasionally, as a result, one object can block others, i.e., µS 0 = µC and 6S 0 = 6C
which is undesirable. 4. The values on A and b from equation (36) based on the
2. Show-door artifacts appear in the generated image. condition mentioned about (C) are:
b = µC − AµS
C. PRESERVING COLOR IN NEURAL ARTISTIC STYLE
TRANSFER A6S A = 6C
T
(37)
Though there have been many papers on style transfer, there 5. There are many different solutions for A which satisfies
has been some shortcoming: the algorithms transfer the colors these constraints

131600 VOLUME 9, 2021


A. Singh et al.: NST: Critical Review

1. The first variant is using the Cholesky decompositions:


Achol = LC L−1
S

where, 6 = LLT is Cholesky decomposition of 6.


2. Formulation of 3D color matching is the second vari-
1/2 −1/2
ant, which is: AIA = 6C 6S
3. It is seen that transfer of color histogram before style
transfer gives better outcomes, which is neural style
move is figured from the first data sources S and C.
Afterwards, the yield T is color-coordinated to C, cre-
ating another yield T0 .
4. The algorithm also reduced competition between the
reconstruction of the content image and the simulta-
FIGURE 27. Result cholesky and image analogies color transfer.
neous matching of the texture details from the image (Gatys et al. [13]).
style.
5. Luminance-only transfer:
• Visual perception is much more susceptible to changes
in luminance than to color.
• Luminance channels Ls and Lc are initially derived
from the style and content images., NST algorithm is
applied to them and yield luminance image Lt.
• Using YIQ color space, I, and Q filters - the input
picture’s color information merged with Lt to generate
the resulting image.
• The significant mismatch between the style luminosity
histograms and the material images should be balanced
before the Style is transferred.
each style image’s luminance pixel is updated:
σC
LS 0 = (LS − µS ) + µC (38)
σS
where µS and µC is the mean luminance σS and σC is
the standard deviation.

D. DEEP PHOTO STYLE TRANSFER


A profound learning way to deal with photographic style
transfer manages an outsized kind of picture that reliably
moves with the given Style. One of the commitments is to FIGURE 28. Working of luminance-based style transfer with color
histogram.(Gatys et al. [13]).
dispose of the works of art-like impacts by preventing spatial
data losses and obliging the exchange activity in the shading
the transformation performed to the input image rather than
area. Another critical commitment might be an answer for
on the output image directly. The topic of characterizing the
the test presented by the distinction in content between the
space of photo-realistic photos remains unresolved. [26] did
given and reference pictures, ending in unwanted exchanges
not need to solve it; instead, utilized the fact that the input
between random substances. The calculation utilized here
was already photo-realistic. The goal is to protect images
takes two pictures: an input picture, commonly a stock photo,
from losing this attribute during the transfer by including
and an adapted and corrected reference picture, the reference
a provision that penalizes image distortions. The answer is
style picture. The proposed approach might be a photorealism
to find an image transform locally affine in color space,
regularization term inside the target work during the improve-
a function that translates the input RGB values onto their
ment, compelling the reproduced picture to be spoken to by
output counterparts for each output patch.
locally relative shading changes of the contribution to stop
L L
twists. X X
LTOTAL = αl LlC + 0 βl Lls+ + λLm (39)
1) PHOTOREALISM REGULARIZATION l=1 l=1

[26] describes how toregularize this optimization approach L is the no. of convolutional layers, and l is the l th Conv.
to maintain the structure of the original image and generate layer of the network. Weight 0 controls style loss. Weights
photo-realistic results. The idea is to express this limitation on αl and βl are layer preference parameters. Weight λ is used

VOLUME 9, 2021 131601


A. Singh et al.: NST: Critical Review

FIGURE 30. Failures are caused by mismatching. (Luan et al. [26]).

FIGURE 29. Manual division empowers assorted errands, for example,


moving a fireball (b) to a scent bottle (a) to create a fire-enlightened look
(c) or exchanging the surface between different apples (d, e).
(Luan et al. [26]).

to control photorealism regularization. LlC , Lls+ and Lls+ are


content, augmented Style, and photorealism regularization,
respectively. Fig. 29, Shows how clients can monitor the
exchange outcomes by only offering semantic masks. This
utilization case allows masterful applications and makes it
possible to handle unusual cases for which semantic naming
is not supported, e.g., direct fireball scent holders.

2) AUGMENTED STYLE LOSS WITH SEMANTIC


SEGMENTATION
The style term is restricted by calculating the matrix on the
whole picture. Because a Gram matrix defines its constituent
vectors up to an isometry, it implicitly stores the precise
distribution of brain responses, limiting its capacity to adjust
to changes in semantic context and causing ‘‘spillovers.’’
The masks are added extra channels to the input picture
and enhance the neural style method by concatenating the
segmentation channels and updating the style loss. [26] also
learned that the segmentation does not need to be pixel precise
FIGURE 31. Flow of segmented Style transfer. (Makow et al. [28]).
because the regularization finally restricts the output.
Fig. 30 shows instances of disappointment because of mis-
matching. These can be fixed utilizing manual segmentation. defective because their normalization layers tend to
‘‘wash away’’ information in information semantic covers.
E. GauGAN: SEMANTIC IMAGE SYNTHESIS WITH To address the issue, the creator has proposed spatiallyadapt-
SPATIALLY ADAPTIVE NORMALIZATION able standardization. This restrictive standardization layer
Conditional picture synthesis implies the task of creating directs the inceptions using semantic input formats through a
photo-realistic pictures molding on some input data. [27] spatially flexible, learned change and can reasonably multiply
is about a particular restrictive picture blend changing over the semantic information all through the networks.
a semantic division veil to a photo-realistic picture. This
structure has a broad scope of uses, for example, content 1) SPADE GENERATOR
generation and picture altering. There is no convincing motivation to deal with the divi-
[27], which is worked by stacking convolutional, stan- sion guide to the Generator’s top layer with SPADE since
dardization, and nonlinearity layers, is the ideal situation, the informed regulation boundaries have encoded enough

131602 VOLUME 9, 2021


A. Singh et al.: NST: Critical Review

TABLE 10. Remarks on part 3.

information about the imprint design. This way, discard the the convolutional piece size following up on the name guide,
Generator’s encoder, which is consistently used in late plans and find that part size of 1×1 harms execution, likely because
that smooth out achieves a more lightweight network. Equiv- it blocks utilizing the name’s setting. Ultimately, adjusting the
alently to existing class-contingent generators, the new Gen- restriction of the generator network by changing the number
erator can acknowledge a subjective vector as info, engaging of convolutional channels.
a fundamental and standard way for the multi-modular blend.
Curiously, the division shroud in the SPADE Generator is F. EXPLORING STYLE TRANSFER
dealt with through spatially flexible equilibrium without stan- In recent times NST algorithms have improved significantly
dardization. Only networks from the past layer are standard- on tasks such as image segmentation, replicating the content
ized. From this time forward, the SPADE generator can all image into different images using styles. In [28], several
the more probable protect semantic information. It acknowl- new extensions and improvements to the original neural style
edges the benefit of standardization without losing semantic transfer were seen, such as altering the original loss function
information. to achieve multiple style transfers while preserving the color
and semantically segmented style transfer. Gaty’s approach
2) MULTI-MODAL SYNTHESIS includes a pre-trained feed-forward network that performs
Using a self-assertive vector as the Generator’s contribution, a forward pass ‘‘image transformation’’ on the input image
the design gives an essential technique to the multi-modular before inputting it to style transformations, which can be done
union. To be explicit, one can add an encoder that quantifies on real-time video applications.
a picture into an irregular vector, which the Generator then Method:
deals with. The encoder and generator structure a variation- • The baseline taken was fast neural style transfer, consist-
alautoencoder, in which the encoder endeavors to get the ing of two components: picture transformation network
Style of the image. In contrast, the Generator solidifies the Fw and loss function ϕ.
encoded Style and the division veil information by meth- • The overall combined loss function is the final objective
ods for SPADE to change the primary picture. Moreover, is given as:
the encoder fills in as a style direction network at test time " #
to get the Style of target pictures. X
W ∗ = arg min Ex,{yi } λi li (fW (x) , yi ) (40)
In the first place, [27] considers two kinds of tasksto the W
i=1
Generator: self-assertive commotion or down inspected divi-
sion maps. Second, fluctuating the sort of limit-free standard- where W- weights X- image to be transformed Yi - Style
ization layers before applying the tweak limits. Next, move image

VOLUME 9, 2021 131603


A. Singh et al.: NST: Critical Review

FIGURE 32. A scene from the test Sintel dataset, the style image used, and the outputs obtained from various methods. The highlighted regions are the
ones with prominent differences. The error images show the temporal inconsistence which is prominent in the third approach.(Ruder et al. [31]).

• Image Transformation Network: Color images used  It minimizes the style reconstruction loss results in
are 3 × 256 × 256 in shape. generating an image that preserves stylistic features
◦ Downscaling: done by convolutional layer with over not spatial characteristics of the target.
stride 2 • Simple Loss function:
◦ Upscaling: done by convolutional layer with stride ◦ Pixel loss: the normalized distance between the
1/2. output ŷ and target
◦ This method provides computational benefits of
1
ŷ, y =k ŷ − y k22

operating lower-dimensional spaces. lpixel = (44)
CHW
• Perceptual Losses:
◦ Total Variation Regularization: this is used for
◦ Feature Reconstruction loss: pixels of the output
maintaining spatial smoothness.
image ŷ have feature presentations similar to the
◦ Multiple Style Transfer: An extension of vanilla
loss network ϕ computes.
neural style transfer allows multiple style images to
φ,j 1 2 be transferred to a single content image.
φj (ŷ) − φj (y) 2 (41)

Ifeat ŷ, y =
Cj Hj Wj ◦ Requires a smile modification to the style loss
◦ Style Reconstruction loss: penalizes style differ- function:
Xn φ,J
ences such as colors and textures lmulti = wi lstylei ŷ, yi

(45)
 Firstly, the Gram matrix is defined: i=1

Hj Wj
◦ This allows the flexible choice of the style layers
φ 1 X X and weights independently for each style image.
Gj (x)c,c0 = φj (x)h,w,c φj (x)h,w,c0
Cj Hj Wj ◦ Allows us to generate images that blend the styles
h=1 w=1
of multiple images readily.
(42)
◦ Trained on Adam optimizer
 Style loss is the squared normalization of the Frobe- ◦ When forced to blend multiple styles, it leads to a
nius standard for the difference between gram out- more extensive style loss than a single style image.
put matrices of the generated and actual target ◦ Color Preserving Style Transfer: used the luminous
image.: only transfer, which works very well and takes a
φ,j  φ φ 2 simple transformation after a typical style transfer
lstyle ŷ, y = Gj (ŷ) − Gj (y) (43) algorithm.
F

131604 VOLUME 9, 2021


A. Singh et al.: NST: Critical Review

FIGURE 33. The ReCoNet architecture [33] for Video neural Style Transfer (presented by Gao et al.). It, Ft and Ot are input image, encoded
feature map, and generated image at time ‘‘t’’. A Frame is made up of these three objects. The previous frame is compared with the current
frame to compute ‘‘temporal loss’’ which results in better dependencies between two consecutive frames. (Ruder et al. [31]).

◦ Semantically Segmented Style transfer: clustering 1) OBSERVATIONS


parts of input images together that belong to the This part focuses mainly on achieving better quality images
same object class. It first generates a mask for the from GANs by improving color accuracy. [23], [25] and [28]
input of shape H x W for each pixel location to parallelly propose Spatial and Color Control, which allows
apply gradient descent and where to not. All the the use of multiple styles and preserves content image color
above extensions make it possible to be an effect for generating more photo-realistic images. By constraining
to achieve real-time video processing applications. transform and adding a custom energy term, [26] provides
a versatile model that handles various input images. [27]
G. AUTOMATIC IMAGE COLORIZATION USING GANs introduces ‘‘spatially adaptive normalization’’ that assists in
[29] talks about how GANs can automate the image’s col- synthesizing photo-realistic images. A key feature provided
orization process without changing the picture’s configura- by [23] is Scale Control, which allows us to mix coarse and
tion. [29] have used Conditional GAN to achieve the result. fine attributes of two different styles. This method helps with
The architecture approach used here was on fully connected training on high-resolution images and is highly scalable in
networks; [29] used layers of convolutions in Generator. [29] that regard. [24] is solely focused on video Neural Style
have also used the technique similar to expanding encoder Transfer and introduces temporal consistency in-between
networks and compressing decoder networks to reduce the frames to allow dependency between adjacent frames.
memory’s dependency on training. The Generator takes in
the greyscale image and then downsample it. It is com- V. APPLICATION-BASED PAPERS
pressed after it goes through here, and these operations are This section looks at the approaches, challenges, and limita-
repeated four times resulting in a matrix. In the expansion tions in Neural Style Transfer for Videos on Mobile phones.
phase, it gets upsampled. Batch normalization and the use A few architectures are proposed based on their performance
of leaky ReLU help in better training and performance of on mobile devices.
GAN. The Discriminator starts with greyscale images and the
predicted image to form the color image. The unique activa- A. ARTISTIC STYLE TRANSFER FOR VIDEOS
tion functions used to stabilize the last layers of Generator [31] presents the application of image style transfer to a
and Discriminator are the tanh activation function and Sig- complete video. A few additions are made regarding initial-
moid activation function. Another unique method used here is izations and loss functions to suit the video input allowing
Adam’s Optimizer for learning rate and Strided convolutions, stable stylized videos even with a high degree of motion.
resulting in upgrading the training performance depending In addition, it processes each frame individually and adds
on the invariances’ convolution layers. Convergence failure temporal constraints that penalize deviation among point tra-
was experienced on various occasions, settling by changing jectories. [31] also, propose two more extensions:
optimizers, expanding learning rates, changing kernel rates, Long-term motion estimates allow consistency over a more
and presenting batch normalization. considerable period in regions with occlusion.

VOLUME 9, 2021 131605


A. Singh et al.: NST: Critical Review

TABLE 11. Different methods tested on multiple sequences with their temporal consistency errors. (Ruder et al. [31]).

(i−j,i)
A multi-pass algorithm is used to reduce the artifacts at The weights along are computed as follows:
the image boundaries. The algorithm considers forward and
(i−j,i)
 
along = max a(i−j,i) − a(i−k,i) , 0
X
backward optical flow resulting in a better-quality video. [31]
k∈J :i−k>i−j
propose using the previous frame to initialize the optimizer (50)
for the current frame. This allows similar parts of the frame to
be rendered, whereas the changed parts are rebuilt. However, This means investigating past frames till consistent cor-
the technique has flaws when used on videos as moving respondence is obtained. The advantage of this is that each
objects are not initialized properly. To address this, [25] pixel is associated with the nearest frame, and as the optical
consider the optical flow by warping the previous output: flow computed over temporally closer images has a lesser
error. Thus, it results in better videos. [31] handle the problem
0
x(i+1) = ωii+1 x (i) (46) of strong motion using a multi-pass algorithm. The video is
where ωi
(i+1)
warps the input stylized frame x(i) using the processed bi-directionally in multiple passes. By alternating
the direction of optical flow, firmer consistency is achieved.
optical flow information derived from content frames g(i)
Initially, every input is processed independently based on
and g(i+1) . [31] use DeepFlow and EpicFlow optical flow
random initializations. The inputs are then mixed with the
estimation algorithms to do so. The next addition is the use
warped non-disoccluded parts of previous frames on which
of temporal consistency losses to penalize adjacent frame
the optimization algorithm is run for some iterations. Next,
inconsistencies. To do so, they detect the disoccluded regions
the forward and backward passes are alternated. The frame
by comparing the forward and backward flows. The temporal
initializations for forwarding and backward passes are given
loss then penalizes deviation between the generated
as:
Image and the compatible optical flow parts of the warped  (i)(j−1)
image. This is done with the help of a feature map ‘‘a’’ that 
 x   if i = 1
specifies per-pixel weightage depending on disocclusion and (i−1,i) (i−1)(j)

x(i)(j) = δa ◦ ωi−1 x
i
(51)
motion boundaries.  
 + δ̄1 + δa(i−1,i) ◦ x(i)(j−1) else.


1 XD
Ltemporal (x, ω, a) = ak · (xk − ωk )2 (47)  (i)(j−1)
D k=1 
 x if i = Nframes
 
Thus, the short-term loss function is given as: (i+1,i) (i+1)(j)

x 0(i)(j)
= δa ◦ ω i
i+1 x (52)
 
Lshortterm g(i) , c, x(i)
 
 + δ̄1 + δa(i+1,i) ◦ x(i)(j−1) else


   
= αLcontent g(i) , x(i) + βLstyle c, x(i) The optical flow computation implementation takes
    roughly 3 minutes per frame at a resolution of 1024 × 436,
+ γ Ltemporal x(i) , ωi−1i
x(i−1) , a(i−1,i) (48) which is done with the help of parallel flow computation on
the CPU. At the same time, style transfer occurs on the GPU.
This is further extended to achieve longer-term consistency
The short-term consistency results on the Sintel datasets are
by incorporating the data for multiple previous frames rather
presented in Table 11, where multiple approaches’ errors are
than just one frame:
  compared across different videos.
Llongterm g(i) , c, x(i) The long-term consistency results are more qualitative.
    They are thus presented in the form of supplementary videos.
= αLcontent g(i) , x(i) + βLstyle c, x(i)
  
(i−j,i)
 B. REAL-TIME NEURAL STYLE TRANSFER FOR VIDEO
Ltemporal x(i) , ωi−j x(i−j) , along
X
+γ i
j∈J :i−j≥1 The work looks at the possibility of making video-style trans-
(49) fers using a feed-forward network. Differentiated and direct

131606 VOLUME 9, 2021


A. Singh et al.: NST: Critical Review

applying a current picture style move procedure to accounts, or unrelated frames causing flicker artifacts. [19] tries to
the proposed method uses the readied association to yield solve this problem; however, their model has time-consuming
fleetingly consistent adjusted weights, which are significant. computations.
In distinction to the previous video style move methods, [33] use Gao et al.’s lightweight forward feed network.
which rely upon progression on the fly, the technique refer- There are white bubbles seen in the images. However, these
enced disagreement ongoing while at the same time creating are caused due to instance normalization and can be removed
severe visual outcomes. using filter response normalization. However, no implemen-
The adapting network acknowledges one edge as informa- tations exist for mobile devices. Other issues include faded
tion and produces its adjusted yield. The loss network, pre- colors. The model is trained in two stages,
prepared on the ImageNet order task, first focuses on the First on Style and content losses and then on a regulariza-
features of the revised yield outlines and registers the losses tion term.
used to set up the adapting network. During the arrangement
cycle, the adapting network and loss network are connected. L(t) = γ Lcontent + ρLstyle + τ Ltv (53)
The loss network’s spatial loss is utilized to establish the Second on achieving temporal consistency
adjustable network. With satisfactory setting up, the adapting X
γ Lcontent (i) + ρLstyle (i) + τ Ltv (i)

network, tolerating one single casing as information, has L(t − 1, t) =
encoded the worldly cognizance picked up from a video i∈t−1,t
dataset and would in this manner have the option to make + λf Ltemp,f (t − 1, t) + λo Ltemp,o (t − 1, t)
transiently unsurprising adjusted video outlines. (54)

1) STYLIZING NETWORK Ltemp,f, and Ltemp,o are features and output-based temporal
The adapting network speaks to changing a singular video losses presented in Gao’s paper.
edge to an adapted one. After three convolutional blocks, The main idea is to use the optical flux between adjacent
the component map’s objective is diminished to a fourth of the frames. The models do not need this information, effectively
information. By then, five lingering blocks are in this manner making it faster since dense optical flow estimation is compu-
followed, provoking a brisk blend. Stood out from the current tationally expensive. On the other hand, introducing Tempo-
feed-forward association for picture style move, a tremendous ral Coherence weakens the style transfer. Speaking of android
favorable position of the network is used for fewer channels vs. iPhone implementations, Apple had better support since
to reduce the model size, which winds up gathering a distinct 2018’s A12 chip and CoreML library, allowing the use of
loss in the stylization quality. dedicated NPUs effectively. However, conversions between
PyTorch to TensorFlow result in additional layers causing a
2) LOSS NETWORK 30-40% FPS drop.
The sturdy and essential elements of the primary model, the Furthermore, many Libraries are yet to provide full mobile
adapted edge, and the style image for establishing the net- GPU operation support. Thus, due to the lack of standardiza-
work adapter should be segregated for space and global loss tion, Android implementations are rare. [32] also, compare
calculations. VGG-19 is employed in this article as the loss two iPhones (6s and 11Pro) with different model sizes, reso-
network showing acceptable image content and style images. lutions and chart their FPS:
Two kinds of losses can be found in the model: Spatial Loss Fig. 35 shows that the model mentioned above can output
and Temporal Loss. around 13 FPS at 480p on an iPhone 11Pro with half a
million parameters. This indicates that Video NST on mobile
C. REAL-TIME VIDEO-NEURAL STYLE TRANSFER ON devices still needs many improvements. Then the coarse-to-
MOBILE DEVICES fine stylization presented in [24] can probably be applied to
[33] presents a solution to two problems of video style increase the resolution of the generated images.
transfer:
D. MULTI-STYLE GENERATIVE NETWORK FOR REAL-TIME
1. The difficulty of usage by non-experts.
TRANSFER
2. Hardware Limitations
[34] finds it challenging, with dimensionally integrated mod-
They present an app that can perform neural style transfer eling, to obtain comprehensive styles in this study. A novel
to videos at over 25FPS. They also discuss performance MSG-NET technique is presented, allowing brush dimension
concerning iOS-based devices where they test an iPhone 6s control in real-time. [34] believes that detailed form model-
and iPhone 11 Pro. Limitations for Android devices are also ing with dimensional style integration in [34] is difficult to
discussed. The solution includes: achieve. The method shown is a modern MSG-NET approach
1. A real-time application of NST on mobile devices that achieves real-time control of the size of the brush. The
2. Existing solutions to temporal coherence. image’s resizing style adjusts the brush’s relative size based
The traditional approach of applying a convolution-based on the changing input images. A more acceptable representa-
image generator per frame causes ‘‘temporal inconsistency’’ tion of image style requires a 2D method.

VOLUME 9, 2021 131607


A. Singh et al.: NST: Critical Review

FIGURE 34. A chart of the proposed model. It includes two segments: an adapting and a loss network. Dark, green, and red square shapes address an
info outline, a yield outline, and a given style picture independently. (Huang et al. [32]).

FIGURE 35. Performance achieved per configuration in terms of Frames Per Second (FPS) v/s Number of parameters in the model charted for
two mobile devices at two resolution levels. (Dudzik et. al. 2020).

The model is based on the following works: ◦ The Gram Matrix is ordered less and describes the
• Relation to Pyramid Matching: Early method was devel- feature distributions
oped using texture synthesis using multi-style image • CoMatchLayer: Explicitly matches statistics of second-
pyramids. White noise image manipulation could lead order features based on the Style given.
to realistic image synthesis, so that fayre statics were ◦ Ŷ i is a solution that holds the semantic information
inspired. of the content image and matches the texture from
◦ This method uses a similar feed-forward network, the style image:
but it takes advantage of the benefits of deep learn-

2
ing networks without putting computational costs Ŷ i & = argmin Ŷ − F(xc )
yi F
into the training process. 
  2
• Relation to Fusion Layer: The computed Comatch Layer + α G(Y ) − G F (xs )
i i
(56)
uses both content and Style as input, hence a separate F
style from content. ◦ To equalize the contribution target’s Style and con-
• Content and Style Representation: tent, the α parameter is used. α is a parameter that
◦ The image texture or Style can be represented as the allows a change of weightage for style loss.
distribution of the features by use of Gram Matrix ◦ An iterative technique allows the difficulty men-
  XHi X
Wi tioned above to be minimized. However, in real-
G F i (x) = i
Fh,w i
(x)Fh,w (x)T (55) time, it is not practicable to achieve or distinguish
h=1 w=1 the model.

131608 VOLUME 9, 2021


A. Singh et al.: NST: Critical Review

FIGURE 36. An overview of MSG-Net. (Zhang et al. [34]).

FIGURE 38. Spatial control result. (Zhang et al. [34]).


FIGURE 37. Extended architecture. (Zhang et al. [34]).

◦ Brush Stroke Size Control: The network was con-


◦ Target style features map is tuned using the follow-
ditioned to learn different brush stroke sizes with
ing approximation:
different picture type sizes. Users can choose the
  T  T brush stroke size after training.
Ŷ i = 8−1 8 F i (xc ) W G F i (xs ) (57) ◦ The employment of weighted layers of ReLU and
the normalizing process improves the picture qual-
◦ The layer can be differentiated and introduced ity created and resists the adaptation of picture
into the current Generative network and can learn contrast.
directly without supervision from the loss function. ◦ Minimizing the Loss by:
Multi-style Generative Network (MSG-Net): This
n
• 2
ŴG = argmin Exe ,xs λc F c (G(xc , xs )) − F c (xc ) F
method introduces matching the feature statistics explic- Wc
itly during runtime. K
X     2
◦ Siamese network and encoder of transformation + λs G F i (G(xc , xs )) − G F i (xs )
network share their weights, which picks up the F
i=1
static features from the Style and gives out Gram + λTV `TV (G (xc , xs ))} (58)
Matrices.
◦ Matches the features of style image at multiple The speed and size of models are crucial for mobile apps
levels with the content image using CoMatch. and cloud services. These are shown in Table 12.
◦ Upsampled convolution: upsampling with convo- • MSG-Net is shown faster due to an endless encoder in
lution layer of stride 2. Compared to stride Con- place of a pretrained VGG Network.
volution fractionally, the calculation complexity • Model Scalability: It is noted that there is no loss in
and parameters are precisely four times for this quality as the number of styles rises on a real-time basis.
approach. This way, the network does not sample • Fig. 38 shows the spatial control using this model.
objects.
Upsampled Residual block: Original architecture is 1) OBSERVATIONS
extended with an upsampling version of fractionally [19] Moreover, [32] extends on [24] and adds Optical Flow
strided convolution as soon in image Fig. 37. estimation based on multiple frames to improve temporal

VOLUME 9, 2021 131609


A. Singh et al.: NST: Critical Review

TABLE 12. Comparison between different model’s architecture based on • Online Education: Using different style banks, the same
model-size and speed. (Zhang et al. [34]).
model can be used for other applications, such as creat-
ing animated versions of real-life stories in Education.
• Gaming: It can also be used in Mixed Reality (MR)
games wherein the real world seen from the MR
headset will change based on the style used for the
game [38], [39].
• Fashion industry: NST can find applications in the fash-
ion industry where designers and consumers can use it to
overlay items while designing or trying them [40], [41].
Approaches like [42] provide a good starting point for real-
consistency. [32] introduces a CoMatch Layer that maps time video style transfer and can be improved to work on
second-order feature statistics with target styles. [33] focuses mobiles efficiently. Observed with user privacy being in the
on implementation on mobile devices and compares the per- headlines every day, Federated learning can also provide
formance of video style transfer [33] models of varying size safer, more private data access by localizing training to spe-
and input images for two devices. It is observed that achieving cific devices. Some recent approaches include [43], [43].
reasonable frame rates with high resolutions is difficult, given Studies like [36] compare the evaluation metrics commonly
the lack of GPU usage on mobile devices. used. Having more architectures that train on unpaired data
is another interesting sub-domain to venture into. A good
VI. NST EVALUATION METRICS approach that performs style transfer on unpaired data is [44].
Evaluation metrics for NST could be challenging because Although [44] works for high-resolution unpaired images
of the variety in GANs models. However, accuracy, Fréchet and not videos, it can be considered a good entry point for
Inception Distance (FID), Intersection-over-Union (IoU), high-resolution video style transfer. [45] uses Vision Trans-
time, perceptual path lengths, and warping error are the most formers for image style transfer. There can be many more
often utilized metrics for the models constructed in the pub- such fascinating use cases for NST shortly based on the user
lications evaluated [36]. requirements.
• The accuracy was used to measure the relative depth of
the predicted images. It was also used to predict feature VIII. RESEARCH GAPS
maps, where the higher the accuracy, the more accurate The research gaps observed in this literature review are
the feature maps indicated. summed up in Fig. 39 and can be grouped into three basic
• The Fréchet Inception Distance (FID) approximates the categories, namely architecture-related, platform-related, and
real and fake feature distributions with two Gaussian dataset related
distributions. They then compute the Fréchet distance • Platform-related:
(Wasserstein-2 distance) between two Gaussian distri- a. Native Mobile NST: Implementing real-time video
butions and use the findings to determine the model’s neural style transfer directly on mobiles. Most applica-
quality. tions implement style transfer on mobiles via a Client-
• Few papers use the Intersection-over-Union (IoU) met- server approach. This is primarily due to mobiles having
ric to determine the accuracy of segmentation and detec- relatively new software and low-power hardware.
tion in object classification and localization. b. Use of Federated Learning: Federated learning is
• The perceptual path length quantifies the difference another gap observed while looking at Mobile NST. It is
between consecutive images (VGG16 embeddings). a recent idea and has been used to overcome low power
It determines if the image changes along the shortest device limitations.
perceptual path in the latent space where fake images • Dataset Related:
are introduced. a. Lack of benchmark datasets: As discussed previously,
• The warping error is the difference between the warped multiple papers mix and match datasets by re-purposing
and real subsequent frames. The warping error value them from different domains. While this has the pros of
is a good metric for determining the smoothness of swapping and replacing datasets in training, the need for
video since it is an efficient technique to monitor video a benchmark dataset can be seen for evaluation purposes.
stability with many frames. A benchmark dataset could make testing, evaluating, and
understanding the model’s performance standardized.
VII. POSSIBLE FUTURE APPLICATIONS OF NST Another point observed is that some articles create their
Apart from various exciting image transformation use cases, datasets and apply different transforms to data, which
NST can be extended in a few more application areas such as: can distort the image’s structure, leading to the genera-
• Movies: NST can change the scenes captured in movies tion of artifacts.
using representational objects instead of green screens b. Lack of a good benchmark metric: It is observed
and tedious editing [37], [38]. and discussed above that many papers turn to Amazon

131610 VOLUME 9, 2021


A. Singh et al.: NST: Critical Review

FIGURE 39. Research gaps.

M-Turks (a service that offers manual labor) to inspect Then the GANs improvement papers, explaining how Spatial,
the quality of the images generated. Photorealism is usu- Color, and Scale control can allow better image generation.
ally inspected manually and thus could be a place to add Lastly, how NST can be applied over mobile devices in
a metric. However, this can be difficult as photorealism real-time using GANs has been explained.
is subjective and might change depending on context. However, real-time NST on mobile devices with a reason-
In addition, as discussed previously, whereas there are able frame rate is still relatively difficult to achieve. As time
metrics such as Intersection over Union or Accuracy, progresses, low power devices and devices with a smaller
they rely on ‘‘comparing’’ two similar images. This can footprint will perform and handle large-scale computation
be particularly challenging to use as one needs some better. This will be an exciting avenue to investigate, consid-
‘‘ground truth’’ to compare to, and paired samples can ering NST can be used in Augmented Reality. Non-iterative
be tricky to obtain. video NST is a good topic for future research since it can
• Model Architectures: It is seen that many of the models considerably reduce the time required to process videos.
cannot handle super-resolution very well. The scalability Since NST has vast potential, its research would see growing
of models in terms of the resolution of generated images exponentially in coming years.
is thus another concern. Apart from this, most data avail-
able or pieced together is usually unpaired, meaning the REFERENCES
content and style images do not have the same structural [1] The Smartphone vs. the Camera Industry. Accessed: Apr. 20, 2021.
composition. [Online]. Available: https://photographylife.com/smartphone-vs-camera-
industry/amp
[2] Adobe Premiere Pro. Accessed: Apr. 20, 2021. [Online]. Available:
IX. CONCLUSION AND FUTURE SCOPE https://www.adobe.com/in/products/premiere/movie-and-film-
NST, one of the exhilarating AI applications adopted for editing.html
[3] DaVinci Resolve. Accessed: Apr. 20, 2021. [Online]. Available:
artistic use of photos and videos, has started capturing the https://www.blackmagicdesign.com/in/products/davinciresolve
attention of GANs researchers in the last few years. These [4] Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, and M. Song, ‘‘Neural style
papers consisted of a comprehensive study of GANs and transfer: A review,’’ IEEE Trans. Vis. Comput. Graphics, vol. 26, no. 11,
Video NST, divided into four parts. Initially, the working of pp. 3365–3385, Nov. 2020.
[5] H. Li, A Literature Review of Neural Style Transfer. Princeton NJ, USA:
GANs has been explained and its recent development on the Princeton Univ. Technical Report, 2019.
different types of models for NST on mobile devices like [6] J. Li, Q. Wang, H. Chen, J. An, and S. Li, ‘‘A review on neural style
CartoonGAN, Artsy-GANs, etc. The unpaired images can transfer,’’ J. Phys., Conf. Ser., vol. 1651, Nov. 2020, Art. no. 012156.
be used for training GANs using CycleGANs. Furthermore, [7] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial
adding ‘‘temporal losses’’ allows consistency between adja- networks,’’ 2014, arXiv:1406.2661. [Online]. Available: http://arxiv.
cently generated frames as seen over multiple architectures. org/abs/1406.2661

VOLUME 9, 2021 131611


A. Singh et al.: NST: Critical Review

[8] T. Karras, S. Laine, and T. Aila, ‘‘A style-based generator architecture [31] M. Ruder, A. Dosovitskiy, and T. Brox, ‘‘Artistic style transfer for
for generative adversarial networks,’’ 2018, arXiv:1812.04948. [Online]. videos,’’ 2016, arXiv:1604.08610. [Online]. Available: http://arxiv.org/
Available: http://arxiv.org/abs/1812.04948 abs/1604.08610
[9] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised representation [32] H. Huang, H. Wang, W. Luo, L. Ma, W. Jiang, X. Zhu, Z. Li, and W. Liu,
learning with deep convolutional generative adversarial networks,’’ 2015, ‘‘Real-time neural style transfer for videos,’’ in Proc. IEEE Conf. Comput.
arXiv:1511.06434. [Online]. Available: http://arxiv.org/abs/1511.06434 Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 783–791.
[10] A. Dosovitskiy, P. Fischer, J. Springenberg, M. Riedmiller, and T. Brox, [33] W. Dudzik and D. Kosowski, ‘‘Kunster—AR art video maker—Real time
‘‘Discriminative unsupervised feature learning with exemplar convolu- video neural style transfer on mobile devices,’’ 2020, arXiv:2005.03415.
tional neural networks,’’ in Proc. Adv. Neural Inf. Process. Syst. (NIPS), [Online]. Available: http://arxiv.org/abs/2005.03415
2014. [34] H. Zhang and K. Dana, ‘‘Multi-style generative network for
[11] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, ‘‘Unpaired image-to- real-time transfer,’’ 2017, arXiv:1703.06953. [Online]. Available:
image translation using cycle-consistent adversarial networks,’’ 2017, http://arxiv.org/abs/1703.06953
arXiv:1703.10593. [Online]. Available: http://arxiv.org/abs/1703.10593 [35] D. Chen, J. Liao, L. Yuan, N. Yu, and G. Hua, ‘‘Coherent online video
[12] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ‘‘Image-to-image trans- style transfer,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
lation with conditional adversarial networks,’’ 2016, arXiv:1611.07004. pp. 1105–1114.
[Online]. Available: http://arxiv.org/abs/1611.07004 [36] Q. Xu, G. Huang, Y. Yuan, C. Guo, Y. Sun, F. Wu, and
[13] L. A. Gatys, A. S. Ecker, and M. Bethge, ‘‘Image style transfer using
K. Weinberger, ‘‘An empirical study on evaluation metrics of generative
convolutional neural networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
adversarial networks,’’ 2018, arXiv:1806.07755. [Online]. Available:
Recognit. (CVPR), Jun. 2016, pp. 2414–2423.
http://arxiv.org/abs/1806.07755
[14] Y. Jin, J. Zhang, M. Li, Y. Tian, H. Zhu, and Z. Fang, ‘‘Towards
[37] B. Joshi, K. Stewart, and D. Shapiro, ‘‘Bringing impressionism to life with
the automatic anime characters creation with generative adversarial
neural style transfer in come swim,’’ 2017, arXiv:1701.04928. [Online].
networks,’’ 2017, arXiv:1708.05509. [Online]. Available: http://arxiv.
Available: http://arxiv.org/abs/1701.04928
org/abs/1708.05509
[15] N. Kodali, J. Abernethy, J. Hays, and Z. Kira, ‘‘On convergence [38] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua, ‘‘Stereoscopic neural
and stability of GANs,’’ 2017, arXiv:1705.07215. [Online]. Available: style transfer,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
http://arxiv.org/abs/1705.07215 Jun. 2018, pp. 6654–6663.
[16] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, [39] Z. Hao, A. Mallya, S. Belongie, and M.-Y. Liu, ‘‘GANcraft: Unsuper-
A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, ‘‘Photo-realistic single vised 3D neural rendering of minecraft worlds,’’ 2021, arXiv:2104.07659.
image super-resolution using a generative adversarial network,’’ 2016, [Online]. Available: http://arxiv.org/abs/2104.07659
arXiv:1609.04802. [Online]. Available: http://arxiv.org/abs/1609.04802 [40] W. Jiang, S. Liu, C. Gao, J. Cao, R. He, J. Feng, and S. Yan,
[17] Y. Chen, Y.-K. Lai, and Y.-J. Liu, ‘‘CartoonGAN: Generative adversarial ‘‘PSGAN: Pose and expression robust spatial-aware GAN for customiz-
networks for photo cartoonization,’’ in Proc. IEEE/CVF Conf. Comput. Vis. able makeup transfer,’’ 2019, arXiv:1909.06956. [Online]. Available:
Pattern Recognit., Jun. 2018, pp. 9465–9474. http://arxiv.org/abs/1909.06956
[18] H. Liu, P. N. Michelini, and D. Zhu, ‘‘Artsy-GAN: A style transfer system [41] T. Nguyen, A. Tran, and M. Hoai, ‘‘Lipstick ain’t enough: Beyond
with improved quality, diversity and performance,’’ in Proc. 24th Int. Conf. color matching for in-the-wild makeup transfer,’’ 2021, arXiv:2104.01867.
Pattern Recognit. (ICPR), Aug. 2018, pp. 79–84. [Online]. Available: http://arxiv.org/abs/2104.01867
[19] C. Li and M. Wang, ‘‘Precomputed real-time texture synthesis with [42] O. Texler, D. Futschik, M. Kučera, O. Jamrička, Š. Sochorová, M. Chai,
Markovian generative adversarial network,’’ in Proc. ECCV, 2016, S. Tulyakov, and D. Sýkora, ‘‘Interactive video stylization using few-
pp. 702–716. shot patch-based training,’’ ACM Trans. Graph., vol. 39, no. 4, p. 73,
[20] V. Kitov, K. Kozlovtsev, and M. Mishustina, ‘‘Depth-aware arbitrary style 2020.
transfer using instance normalization,’’ 2019, arXiv:1906.01123. [Online]. [43] J. Song and J. Chul Ye, ‘‘Federated CycleGAN for privacy-preserving
Available: http://arxiv.org/abs/1906.01123 image-to-image translation,’’ 2021, arXiv:2106.09246. [Online]. Avail-
[21] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua, ‘‘StyleBank: An explicit rep- able: http://arxiv.org/abs/2106.09246
resentation for neural image style transfer,’’ in Proc. IEEE Conf. Comput. [44] A. Li, C. Wu, Y. Chen, and B. Ni, ‘‘MVStylizer: An efficient edge-
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1897–1906. assisted video photo-realistic style transfer system for mobile phones,’’ in
[22] C. Zheng and Y. Zhang, ‘‘Two-stage color ink painting style transfer Proc. 21st Int. Symp. Theory, Algorithmic Found., Protocol Design Mobile
via convolution neural network,’’ in Proc. 15th Int. Symp. Pervas. Syst., Netw. Mobile Comput. New York, NY, USA: Association for Computing
Algorithms Netw. (I-SPAN), Oct. 2018, pp. 193–200. Machinery, 2020, pp. 31–40, doi: 10.1145/3397166.3409140.
[23] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman, [45] A. Junginger, M. Hanselmann, T. Strauss, S. Boblest, J. Buchner, and
‘‘Controlling perceptual factors in neural style transfer,’’ 2016, H. Ulmer, ‘‘Unpaired high-resolution and scalable style transfer using gen-
arXiv:1611.07865. [Online]. Available: http://arxiv.org/abs/1611.07865 erative adversarial networks,’’ 2018, arXiv:1810.05724. [Online]. Avail-
[24] A. Gupta, J. Johnson, A. Alahi, and L. Fei-Fei, ‘‘Characterizing and able: http://arxiv.org/abs/1810.05724
improving stability in neural style transfer,’’ 2017, arXiv:1705.02092. [46] Y. Deng, F. Tang, X. Pan, W. Dong, C. Ma, and C. Xu, ‘‘StyTr2 : Unbi-
[Online]. Available: http://arxiv.org/abs/1705.02092 ased image style transfer with transformers,’’ 2021, arXiv:2105.14576.
[25] L. A. Gatys, M. Bethge, A. Hertzmann, and E. Shechtman, ‘‘Preserving
[Online]. Available: https://arxiv.org/abs/2105.14576
color in neural artistic style transfer,’’ 2016, arXiv:1606.05897. [Online].
[47] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shecht-
Available: http://arxiv.org/abs/1606.05897
[26] F. Luan, S. Paris, E. Shechtman, and K. Bala, ‘‘Deep photo style trans- man, ‘‘Controlling perceptual factors in neural style transfer,’’ in
fer,’’ 2017, arXiv:1703.07511. [Online]. Available: http://arxiv.org/abs/ Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
1703.07511 pp. 3985–3993.
[27] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, ‘‘GauGAN: Semantic [48] Computer Vision—ECCV 2018 Workshops. New York, NY, USA: Springer,
image synthesis with spatially adaptive normalization,’’ in Proc. ACM SIG- 2019.
GRAPH Real-Time Live, Jul. 2019, p. 1, doi: 10.1145/3306305.3332370. [49] Computer Vision—ECCV 2016. New York, NY, USA: Springer, 2016.
[28] N. Makow and P. Hernandez, ‘‘Exploring style transfer: Extensions [50] C. Zhou, Z. Gu, Y. Gao, and J. Wang, ‘‘An improved style transfer algo-
to neural style transfer,’’ Stanford Univ., Stanford, CA, USA, Tech. rithm using feedforward neural network for real-time image conversion,’’
Rep. 2017-428, 2017. Sustainability, vol. 11, no. 20, p. 5673, Oct. 2019.
[29] R. Dhir, M. Ashok, S. Gite, and K. Kotecha, ‘‘Automatic image colorization [51] T. Karras, S. Laine, and T. Aila, ‘‘A style-based generator architecture for
using GANs,’’ in Soft Computing and its Engineering Applications (Com- generative adversarial networks,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
munications in Computer and Information Science), vol. 1374, K. K. Patel, early access, Feb. 2, 2020, doi: 10.1109/TPAMI.2020.2970919.
D. Garg, A. Patel, P. Lingras, Eds. Singapore: Springer, 2020, pp. 15–26, [52] C. Gao, D. Gu, F. Zhang, and Y. Yu, ‘‘ReCoNet: Real-time coherent video
doi: 10.1007/978-981-16-0708-0_2. style transfer network,’’ 2018, arXiv:1807.01197. [Online]. Available:
[30] R. Dhir, M. Ashok, and S. Gite, ‘‘An overview of advances in image http://arxiv.org/abs/1807.01197
colorization using computer vision and deep learning techniques,’’ [53] D. Bau, J.-Y. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T. Freeman,
Rev. Comput. Eng. Res., vol. 7, no. 2, pp. 86–95, 2020. [Online]. and A. Torralba, ‘‘Visualizing and understanding generative adversarial
Available: http://www.conscientiabeam.com/journal/76/abstract/6190, networks extended abstract,’’ 2019, arXiv:1901.09887. [Online]. Avail-
doi: 10.18488/journal.76.2020.72.86.95. able: http://arxiv.org/abs/1901.09887

131612 VOLUME 9, 2021


A. Singh et al.: NST: Critical Review

AKHIL SINGH is currently pursuing the SHILPA GITE received the Ph.D. degree in deep
B.Tech. degree in computer science with the Sym- learning for assistive driving in semi-autonomous
biosis Institute of Technology, Symbiosis Inter- vehicles from Symbiosis International (Deemed)
national (Deemed) University, Pune. His research University, Pune, India, in 2019. She is currently
interests include machine learning, deep learning, working as an Associate Professor with the Com-
generative adversarial networks, and explainable puter Science Department, Symbiosis Institute of
artificial intelligence. Technology, Pune. She is also working as an Asso-
ciate Faculty at the Symbiosis Centre of Applied
AI (SCAAI). She is also guiding Ph.D. students in
biomedical imaging, self-driving cars, and natural
language processing areas. She has around 13 years of teaching experience.
She has published more than 60 research articles in international journals and
25 Scopus indexed international conferences. Her research interests include
VAIBHAV JAISWAL is currently pursuing the deep learning, machine learning, medical imaging, and computer vision. She
B.Tech. degree in computer science with the Sym- was a recipient of the Best Paper Award at 11th IEMERA Conference held
biosis Institute of Technology, Symbiosis Inter- virtually at Imperial College, London, in October 2020.
national (Deemed) University, Pune. His research
interests include deep learning and generative
adversarial networks.

KETAN KOTECHA received the M.Tech. and


Ph.D. degrees from IIT Bombay.
He is currently holding the positions as the
Head of the Symbiosis Centre for Applied AI
(SCAAI), the Director of the Symbiosis Institute
of Technology, the CEO of the Symbiosis Centre
GAURAV JOSHI is currently pursuing the for Entrepreneurship and Innovation (SCEI), and
B.Tech. degree in computer science with the the Dean of the Faculty of Engineering, Symbiosis
Symbiosis Institute of Technology, Symbiosis International (Deemed) University. He has exper-
International (Deemed) University, Pune. His tise and experience in cutting-edge research and
research interests include deep learning, genera- projects in AI and deep learning for the last 25 years. He has published more
tive adversarial networks, mixed reality, and game than 100 widely in a number of excellent peer-reviewed journals on various
development. topics ranging from cutting-edge AI, education policies, teaching-learning
practices, and AI for all. He has published three patents and delivered
key note speeches at various national and international forums, including
at the Machine Intelligence Laboratory, USA, IIT Bombay under World
Bank Project, and the International Indian Science Festival organized by the
Department of Science Technology, Government of India, and many more.
Dr. Kotecha was a recipient of the two SPARC projects worth INR
ADITH SANJEEVE is currently pursuing the 166 Lakhs from MHRD Government of India in AI, in collaboration with
B.Tech. degree in computer science with the Arizona State University, USA, and the University of Queensland Australia.
Symbiosis Institute of Technology, Symbiosis He was also the recipient of numerous prestigious awards like Erasmus+
International (Deemed) University, Pune. His Faculty Mobility Grant to Poland, DUO-India Professors Fellowship for
research interests include machine learning, deep research in responsible AI, in collaboration with Brunel University, U.K.,
learning, generative adversarial networks, and LEAP Grant at Cambridge University, U.K., UKIERI Grant with Aston
computer vision. University, U.K., and a Grant from Royal Academy of Engineering, and
U.K. under Newton Bhabha Fund. He is currently an Academic Editor and
an Associate Editor of IEEE ACCESS journal.

VOLUME 9, 2021 131613

You might also like