Regularization Methods For Generative Adversarial Networks: An Overview of Recent Studies
Regularization Methods For Generative Adversarial Networks: An Overview of Recent Studies
Abstract
Despite its short history, Generative Adversarial Network (GAN) has been extensively studied and
used for various tasks, including its original purpose, i.e., synthetic sample generation. However,
applying GAN to different data types with diverse neural network architectures has been hindered by
its limitation in training, where the model easily diverges. Such a notorious training of GANs is well
known and has been addressed in numerous studies. Consequently, in order to make the training of
GAN stable, numerous regularization methods have been proposed in recent years. This paper
reviews the regularization methods that have been recently introduced, most of which have been
published in the last three years. Specifically, we focus on general methods that can be commonly
used regardless of neural network architectures. To explore the latest research trends in the
regularization for GANs, the methods are classified into several groups by their operation principles,
and the differences between the methods are analyzed. Furthermore, to provide practical knowledge
of using these methods, we investigate popular methods that have been frequently employed in state-
of-the-art GANs. In addition, we discuss the limitations in existing methods and propose future
research directions.
1. Introduction
G
enerative Adversarial Network (GAN) is one of the most rapidly developed deep learning methods in recent years.
While the initial model of GAN has lately been introduced, compared to other conventional deep learning methods,
such as Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM), GAN is now extensively used
in various practical applications, including image synthesis [5,33,34,79,82], image inpainting [78], video generation [67], data
augmentation [6,44], and style transfer [3,12,54].
Distinct from the other conventional deep learning methods, GAN is a generative model of which objective is to learn
distributions instead of data points as targets. Therefore, the output of GAN models is generally distributions as well. By using
the Monte Carlo method over the distributions, GAN can be used for synthetic sample generation since the data point extracted
by the Monte Carlo method corresponds to a synthetic sample that mimics features of real samples.
However, it is well known that the training of GAN is challenging compared to that of other deep learning methods since the
training is commonly unstable, and weight parameters of GAN easily diverge due to its adversarial training process [21]. These
incorrectly trained GANs generally produce identical samples regardless of input noises; such a failure of training is called
mode collapse problem [77].
Thus, it is extremely critical to stabilize the training of GANs in order to induce GANs to learn certain distributions precisely.
Since the convergence of GAN has been proved under the condition that the discriminator in GANs is optimal [20], in recent
years, a large number of studies have proposed regularization methods that aim at making the discriminator stable. Also, it has
been demonstrated that Lipschitz continuity of the discriminator with regard to sample spaces is a key condition for a stable
1
GAN training [2]; consequently, numerous methods to enforce the discriminator to satisfy such a condition have been proposed.
In this paper, we review the recent regularization methods for the training of GANs. Specifically, the most recent methods are
investigated; then, we classify the regularization methods into several groups according to their operation principles, in order
to explore research trends in the regularization methods. Furthermore, popular methods that are frequently employed in state-
of-the-art GAN models are investigated.
While several studies have been conducted to review GAN models, they have commonly focused on different architectures of
GAN models [24,52], GAN losses [70], or practical applications using GANs [9,76]. Instead, this paper aims at investigating
the regularization methods that can straightforwardly be integrated with various GANs; thereby, by applying appropriate
regularization methods presented in this paper, it becomes possible to use GANs for diverse neural network architectures in
many research domains. For this reason, we exclude the regularization methods for specific variants of GANs since they can
hardly be used for different architectures of GAN models; we only focus on the methods that can be used universally.
In addition, in this paper, we discuss common trends in regularization methods as well as limitations of the methods and then
propose future research directions. Therefore, this paper helps researchers to gain a better understanding of regularization
methods in GANs, from various aspects, such as real applications of the methods and developments of novel regularization
methods.
2. Background
2.1 Generative adversarial network
A conventional generative adversarial network is composed of two neural network modules, i.e., a generator and a discriminator.
The generator can be interpreted as a function that produces high dimensional samples by using low dimensional feature vectors
𝑧 ∈ ℝ𝑘 as its inputs, i.e., 𝐺(∙) ≔ ℝ𝑘 → ℝ𝑝 , where 𝑝 denotes the dimension of a real sample; thus, the target of the training
of the generator is to find the latent feature vectors constituting a dataset, and to allocate the features to its inputs. To achieve
this goal, GANs uses an adversarial learning process with the discriminator.
The discriminator is designed to distinguish between real samples and fake samples produced by the generator. Therefore, the
discriminator corresponds to a function of which dimension of the output is one, i.e., 𝐷(∙) ≔ ℝ𝑝 → ℝ . The initial GAN
introduces the sigmoid function as well as the cross-entropy function for the output of the discriminator in order that the output
represents the probability that the sample is real or generated.
The training of the generator and the discriminator is performed in a competitive manner, in which the generator and the
discriminator play a game. In the beginning, a randomly initialized generator produces noise-like samples, which generally
consist of random values. Therefore, the discriminator can be easily trained to distinguish between these fake samples and real
samples. The training of the ordinary discriminator is performed by the following loss function:
Then, the generator can learn from such learning of the discriminator, by targeting to deceive the discriminator; gradients can
be backpropagated from the output of the discriminator since all components from the input of the generator to the output of
the discriminator are connected through neural network structures. The training loss of the generator in the initial GANs can
be represented as follows:
By such an evolving and adversarial learning between the generator and the discriminator, latent features of a dataset can be
allocated to the input variables of the generator (Figure 1(A)). After adequate iterations of such learning, the generator can
produce synthetic but realistic samples by applying the Monte Carlo method to the input 𝑍 (Figure 1(B)).
However, it has been known that the GAN training is not straightforward since the target of the generator is produced by the
other neural network structure, i.e., the discriminator, during the training. Therefore, the target constantly fluctuates, thereby
2
Figure 1. The framework of GANs. (A) The training process; (B) Sample generation after the training. G represents the
generator; D represents the discriminator. The Monte Carlo method is used over the latent vector Z.
making the convergence of GANs challenging. While the convergence of GANs has been proved, an optimal discriminator is
required for the convergence [20]; however, in practice, the discriminator is not optimal and not fixed during the GAN training.
Thus, due to such a discrepancy of the discriminator between the theoretical proof and the actual training, there have been
numerous studies to address this discrepancy.
where 𝐾 ≥ 0 is a real constant called Lipschitz constant, and ∀𝑥1 , 𝑥2 ∈ ℝ𝑝 . Therefore, Lipschitz continuity implies that
absolute values of the gradients of the discriminator with respect to the sample space must be in a certain range, i.e., 𝐾, since
( 3 ) can be rearranged by
3
Figure 2. Graphical representations of Lipschitz continuity. (A) Lipschitz continuity with respect to neural network;
(B) Lipschitz continuity with respect to a sample space. In (B), a surface of the outputs of the discriminator is illustrated
with regard to a two-dimensional sample space with variables, Xa and Xb. Satisfying Lipschitz continuity corresponds to
smoothing the surface in order to make the gradients sufficiently small.
4
Figure 3. Variants of GANs and the scope of this study. The research scope is highlighted in orange. This study addresses
regularization methods that are universally applied to various GANs, regardless of their architectures. Thus, regularization
methods using special architectures are excluded in this study, due to their nonuniversality.
is hard to describe that such improvements are due to certain regularization effects since there is little evidence that the
modifications in architectures and regularizations are related. Although several studies have employed auto-encoder structures
for the discriminator [5,84], and demonstrated superior performance to generate synthetic images, it is also unclear that such
excellent performance results from a certain regularization effect. Thus, architectural methods are excluded from the research
scope due to their nonuniversality and uncertain regularization effects.
In this paper, we categorize existing regularization methods into five groups, according to their operating principles: 1) Gradient
penalties; 2) Weight normalizations; 3) Imbalanced training; 4) Normalizations of outputs of layers; 5) Modified losses and
targets of GANs. Then, in each group, we analyze the approach and present the most recent methods. The scope of the study is
illustrated in Figure 3. We describe each group and corresponding methods in the following sections.
where 𝓛𝐷 denotes the ordinary loss; 𝓛𝐺𝑃 is the auxiliary loss function of each gradient penalty method; 𝜆 is a setting
parameter that modulates the penalty; 𝑥̃ represents a sample obtained from a target sample space; 𝑊 indicates weight
matrices of the discriminator. In this manner, each gradient method varies by using different 𝓛𝐺𝑃 and 𝑥̃. The differences are
summarized in Table 1.
Gradient Penalty (GP) [21] initially introduces a constraint that penalizes the gradients. In GP, the discriminator aims at
5
Table 1. Gradient penalty methods with auxiliary penalty functions and sampling algorithms. 𝓛𝑮𝑷 and 𝒙 ̃ represent the
penalty function and sampling algorithms. 𝜶 is a parameter drawn from a uniform distribution, which varies in each iteration.
‖𝑫‖𝑳𝒊𝒑 stands for gradients of Lipschitz continuity.
Reference Year 𝓛𝐺𝑃 𝑥̃ Lipschitz continuity
Gulrajani et al. [21] 2017 𝔼[‖∇𝑥̃ 𝑊‖2 − 1] (1 − 𝛼)𝑥 + 𝛼𝑥̂ ‖𝐷‖𝐿𝑖𝑝 → 1
targeting ‖𝐷‖𝐿𝑖𝑝 → 1 where ‖𝐷‖𝐿𝑖𝑝 denotes the left term in ( 4 ). To satisfy such a condition, GP uses a 𝐿2 norm as 𝓛𝐺𝑃
in which gradients that deviate from one are penalized.
Also, in GP, a process to obtain 𝑥̃ is introduced; an interpolation between a real sample and a generated sample is used, which
can be obtained as follows:
where 𝛼 indicates a parameter sampled from 𝑈𝑛𝑖𝑓(0,1) in each training iteration; 𝑥 denotes a real sample; 𝑥̂ represents a
generated sample with the generator.
Other gradient penalty methods correspond to variants of GP in general, where similar approaches are employed for 𝓛𝐺𝑃 .
Petzka et al. [53] argued that GP does not directly enforce Lipschitz continuity since ‖𝐷‖𝐿𝑖𝑝 converges to around one, and
gradients having small values of ‖𝐷‖𝐿𝑖𝑝 are trained to be one. Thus, they proposed a maximum threshold for the initial GP,
called Lipschitz Penalty (LP), in order to handle this problem. Consequently, in LP, the gradients can be constrained to be under
one.
Zhou et al. [86] introduce Max Gradient Penalty (MaxGP), which uses the maximum value of gradients as the penalty. From
the claim that the maximum of the gradients is equivalent to Lipschitz constant 𝐾, proposed in Adler and Lunz [1], MaxGP
aims at penalizing the maximum value, i.e., 𝓛𝐺𝑃 ≔ max{‖∇𝑥̃ 𝑊‖2 } . Therefore, since the gradients are directly penalized
without a certain threshold, they converge to zero, which is distinct from LP.
Recently, Thanh-Tung et al. [64] proposed another gradient penalty method using the same convergence with MaxGP, called
0-GP. In contrast to MaxGP, 0-GP uses an average value of all weights, making the weights strictly regularized to satisfy the
convergence. However, in the study, although 0-GP was compared to several similar methods using the same convergence but
different sampling algorithms, MaxGP and 0-GP were not compared; thus, it is difficult to conclude that such a strict
regularization is effective.
The aforementioned methods commonly use the same sampling process as that of GP, in which the interpolated value between
a real sample and a generated sample is employed. In contrast to these methods, there are several methods that introduce
different sampling process. For instance, Kodali et al. [30] introduced 𝑥 + 𝜖 as 𝑥̃, which uses small noises added to real
samples. Similarly, Mescheder et al. [46] employed both real samples and fake samples as 𝑥̃ . They claimed that it is not
necessary to satisfy Lipschitz continuity in the whole sample space. Since the discriminator is trained with real samples and
generated samples, it is sufficient to satisfy Lipschitz continuity around the training space, i.e., 𝑥 + 𝜖 and 𝑥̂ + 𝜖. Therefore,
the sampling algorithm used in GP and the other methods restrict the training of the discriminator in an excessive manner
because the algorithm tries to limit the whole interpolated spaces between real samples and generated samples. Such a claim
6
Table 2. Weight penalty and normalization methods. 𝓛𝑾 represents the penalty function. 𝑾 ̃ denotes the normalized
weight and algorithm to obtain the weight. 𝝈 stands for the largest singular value of a matrix.
Reference Year Method 𝓛𝑊 ̃
𝑊
Arjovsky et al. [2] 2017 Normalization - ̃ ≔ 𝑐𝑙𝑖𝑝(𝑊, [−0.01, 0.01])
𝑊
Brock et al. [7] 2017 Penalty ‖𝑊𝑊 𝑇 − 𝐼‖2𝐹 -
Miyato et al. [47] 2018 Normalization - ̃ ≔ 𝑊 ⁄𝜎(𝑊)
𝑊
Zhou et al. [85] 2018 Penalty ‖𝑊‖∞ -
Brock et al. [8] 2019 Penalty ‖𝑊𝑊 𝑇 ⊙ (𝟏 − 𝐼)‖2𝐹 -
Kurach et al. [31] 2019 Penalty ‖𝑊‖2 -
Liu et al. [39] 2019 Normalization - ̃ ≔ 𝑊 ⁄𝜎(𝑊) + ∇𝑊 ⁄𝜎(𝑊)
𝑊
Zhang et al. [83] 2019 Normalization - ̃ ≔ 𝑊 ⁄√‖𝑊‖1 ‖𝑊‖∞
𝑊
7
𝓛𝑊 ≔ ∑‖𝑊𝑊 𝑇 − 𝐼‖2𝐹 , (7)
in which the non-diagonal elements of 𝑊𝑊 𝑇 converge to zero. However, such a method also restricts 𝐿2 norm of the matrix,
thereby making the GAN training challenging, as claimed in [47]. Thus, Brock et al. [8] proposed a novel OR method that does
not directly limit 𝐿2 norm, which can be represented as
where ⊙ denotes Hadamard product; and 𝟏 represents a matrix of which all elements are one.
Distinct from weight penalty methods, weight normalization methods do not use additional loss. Instead, they constantly update
the weights during the training process, or compute gradients with respect to the normalized weights then backpropagate the
gradients.
Spectral Normalization (SN) [47] is one of the most conventional weight normalization methods, which introduces the spectral
norm of weight matrices for the GAN training. The spectral norm is a type of matrix norm, which is equivalent to 𝐿2 norm of
a vector. In addition, it is known that the spectral norm corresponds to the largest singular vector. Using the claim that the
largest singular vector is related to Lipschitz constant, as verified in [77], the backpropagation of SN is performed with
normalized weight matrices, as follows:
where 𝑘 is an index of iteration; 𝛽 is a learning rate; 𝐷𝑆𝑁(𝑊) indicates a discriminator with spectrally normalized weight
matrices. Such a process signifies that the spectral norm is employed only for gradient calculation, and weight matrices
themselves are not changed in SN.
Furthermore, SN uses the power iteration method to calculate the spectral norm, while a conventional optimization method can
hardly be used since gradients through the optimization method cannot be computed. In the power iteration method, the spectral
norm is approximated with multiplications of vectors and a weight matrix, thereby, making it possible to compute gradients:
𝑊 ( 10 )
𝑆𝑁(𝑊) ≅ ,
𝑢𝑇 𝑊𝑣
where 𝑊 ∈ ℝ𝑛×𝑚 ; 𝑢 ∈ ℝ𝑛 ; 𝑣 ∈ ℝ𝑚 ; and 𝑢𝑇 𝑊𝑣 signifies the approximated spectral norm. The 𝑢 and 𝑣 are randomly
initialized at first, then updated through the power iteration method.
A variant of SN has been introduced in Liu et al. [39], in which the gradients of weight parameters are additionally spectrally
penalized with the original SN. They investigated the concept of spectral collapse in the study, and claimed that such a method,
called Spectral Regularization (SR), can tackle the spectral collapse. Evaluated with image datasets, SR demonstrated superior
regularization performance, compared to SN.
Similarly, Zhang et al. [83] further explored the use of the spectral norm for the GAN training, and proposed Spectral Bounding
(SB) method. SB aims to restrict the upper bound of the spectral norm, which can be computed by
Furthermore, in the study, it was reported that SB outperformed SN in the experiments with two image datasets.
In addition, weight clipping method, which can be interpreted as a weight normalization method, is used to train initial
Wasserstein GAN (WGAN) [2], while WGAN has been extensively studied thereafter. The weight clipping method encourages
the discriminator to satisfy Lipschitz continuity by regulating values of weight matrices to be in a certain small range. In the
8
Table 3. Algorithms and applications for imbalanced training in GANs.
Reference Year Approach Description
They proposed a multiple update algorithm with a parameter k
Goodfellow et al. [20] 2014 Multiple updates for the discriminator; however, k was set to one in their
experiments.
Arjovsky et al. [2] 2017 Multiple updates k was set to five.
Heusel et al. [22] 2017 Learning rate They advocated using imbalanced learning rates (TTUR).
Both algorithms were simultaneously used, in which double
Multiple updates &
Brock et al. [8] 2019 updates and a double learning rate were adopted for a
Learning rate
discriminator
study, they clipped the weight values to a fixed range of [−0.01, 0.01]. However, such an approach significantly hinders the
GAN training because the weights cannot be further trained beyond the values, which becomes the motivation of GP.
To sum up, the weight normalization methods have been employed for satisfying Lipschitz continuity while the weight penalty
methods have been introduced for orthogonality of weight matrices or general stability of the GAN training. Also, the weight
normalization methods have been reported that they commonly enhance sample generation performance of GANs [2,47,83],
while a weight penalty method demonstrated inferior performance than ordinary GAN without the method [31], and a few
weight penalty methods can be used only with the other GAN training methods, such as truncation trick [8] and orthogonal
initialization [7,60]. Since Lipschitz continuity is an essential condition of the stability of the GAN training, as described in the
previous section, it has been demonstrated that the methods which properly tackle this problem show excellent performance in
general.
9
Table 4. Normalization methods for outputs of a layer in GANs.
Method Proposed for neural networks Proposed and evaluated with GANs
Batch normalization Ioffe and Szegedy [26] Kurach et al. [31]; Miyato et al. [47]
Layer normalization Ba et al. [4] Kurach et al. [31]; Miyato et al. [47]
Weight normalization Salimans and Kingma [59] Miyato et al. [47]
Instance normalization Ulyanov et al. [68] Karras et al. [28]
Group normalization Wu and He [74] -
Conditional batch normalization Dumoulin et al. [16] Miyato et al. [47]; Zhang et al. [80]
Self-modulation - Chen et al. [11]
For example, in WGAN [2], the discriminator uses the multiple update algorithm because the weight clipping method is applied,
thereby hindering the training, where the discriminator of WGAN was trained 5 times per generator training. Likewise, in other
various GANs [8,21,47], such an approach with imbalanced training has been conventionally used.
Heusel et al. [22] investigated this problem from a different perspective and argued that distinct learning rates for the
discriminator and generator can solve this problem. They claimed a local Nash equilibrium can be obtained even if the learning
rates of the discriminator and generator are different from each other, and proved this claim in depth. Hence, instead of using
the multiple update algorithm for the discriminator, a higher learning rate than that of the generator can reduce computational
time. They named this algorithm as Two Time-scale Update Rule (TTUR).
Overall, because the balanced training between the discriminator and generator is essential, there have been two simple
approaches enforcing imbalanced training to handle this issue: Multiple updates and imbalanced learning rates. The descriptions
of these approaches are shown in Table 3. The two approaches are intuitive and straightforward since the methods directly train
the discriminator severer than the generator to maintain the balance. In addition, it has been verified that such simple methods
operate properly in many studies, even if the methods are simultaneously used [8]. However, the number of multiple updates
and values of different learning rates vary according to GAN architecture, experiments, and datasets, which remains to be
further studied to determine how to select these parameters, while existing studies commonly used trial-and-error methods.
ℎ−𝜇 ( 12 )
ℎ𝑁 = 𝛾 ⊙ + 𝑏,
𝜎
where ℎ denotes outputs of a layer; 𝛾 and 𝑏 are learnable scale and shift parameters; 𝜇 and 𝜎 are the average and standard
deviation of ℎ, respectively. The difference between the normalization methods is in obtaining 𝜇 and 𝜎, in which, for instance,
they are calculated within the mini-batch in BN, or within the nodes in a layer in LN.
There have been several approaches to adopt the normalization methods to the discriminator in GANs as well [15,56]. BN,
which is the most conventional normalization method in general neural networks, was evaluated in Lucic et al. [41]. However,
as a result, it was demonstrated that the conventional BN is not valid for GANs, in which the performance decreased when BN
was used.
10
In contrast, other conventional methods that are used in typical neural networks, including LN and WN, were investigated in
WGAN with GP [21] and further evaluated in Miyato et al. [47] and Kurach et al. [31]. The methods were assessed for the
discriminator in GANs and showed superior performance than ordinary GANs. Specifically, in both studies, it has been verified
that LN enhances sample generation performance in general, regardless of parameter settings.
However, for the generator, BN has been commonly used and successfully demonstrated its effectiveness [47]. Furthermore,
conditional BN (cBN) as well as Adaptive IN (AdaIN) [14,16,19,25] have become the dominant methods to provide conditional
information for the discriminator. Consequently, in recent GANs, it is general to use BN only for the generator and not for the
discriminator [8,28].
As another variant of BN, Chen et al. [11] proposed Self-Modulation (Self-Mod) method for GANs. While ordinary BN uses
learnable scale and shift parameters, in Self-Mod, they modified these parameters to be input-dependent. Therefore, the
learnable parameters take 𝑧 as their inputs; then the dependencies are trained through neural network structures, which can be
represented as
ℎ−𝜇 ( 13 )
ℎ𝑁 = 𝛾(𝑧) ⊙ + 𝑏(𝑧),
𝜎
𝛾(𝑧) ≔ 𝑊𝛾
(1) (2)
⋅ 𝑅𝑒𝐿𝑈(𝑊𝛾 𝑧 + 𝛽𝛾 ), ( 14 )
𝑏(𝑧) ≔ 𝑊𝑏
(1) (2)
⋅ 𝑅𝑒𝐿𝑈(𝑊𝑏 𝑧 + 𝛽𝑏 ), ( 15 )
where 𝑊 denotes a weight matrix; 𝛽 represents a bias parameter of the neural network structures; the activation function of
the neural network structures is set to Rectified Linear Unit (ReLU).
The loss functions for GANs to handle Lipschitz continuity problem were initially explored by Arjovsky et al. [2] and Qi [55]
about the same time. Arjovsky et al. [2] introduced Earth-Mover (EM) distance to solve this problem and named the method as
WGAN. In contrast, while the same problem was addressed, Qi [55] proposed a GAN model using a feature-wise distance,
called Loss-Sensitive GAN (LSGAN), where the distance is obtained through pre-trained networks. Also, it is verified that
WGAN is a special case of Generalized LSGAN (GLSGAN).
In WGAN, the discriminator and generator are trained with the following loss functions:
𝓛𝐷 ≔ 𝔼[𝐷(𝐺(𝑧))] − 𝔼[𝐷(𝑥)], ( 16 )
𝓛𝐺 ≔ −𝔼[𝐷(𝐺(𝑧))], ( 17 )
While the same loss function for the generator, i.e., ( 17 ), is used in LSGAN, a distinct loss function for the discriminator is
introduced, which can be calculated as follows:
11
𝓛𝐷 ≔ −𝜔 ⋅ (∆(𝑋, 𝐺(𝑧)) + 𝔼[𝐷(𝑥)] − 𝔼[𝐷(𝐺(𝑧))]) − 𝔼[𝐷(𝑥)], ( 18 )
where 𝜔 denotes a positive balancing parameter; ∆(𝑎, 𝑏) is a feature-wise distance of which features are obtained from pre-
trained networks, such as Inception network [62].
Recently, a more improved loss function for the discriminator, called hinge loss, was proposed and employed in various GANs
[38,65]. Also, although the loss function for the generator is the same as WGAN and LSGAN, the discriminator is more
regularized with the hinge loss, which can be computed as follows:
which can be interpreted as a restricted Wasserstein distance where only less trained samples are optimized in a discriminator
with output thresholds of 1 and −1.
In a similar manner, Ni et al. [49] proposed a regulated loss for the discriminator, in which generated samples are gradually
trained. While the same loss as that of WGAN is used for the generator, the discriminator does not learn by generated samples
at first, then gradually learns them, which can be represented as
𝓛𝐷 ≔ 𝑘𝑡 ⋅ 𝔼[𝐷(𝐺(𝑧))] − 𝔼[𝐷(𝑥)], ( 20 )
It has been explored to regulate the consistency of generated samples and 𝐷(𝑥) with respect to inputs of the generator and the
discriminator, respectively. Gan et al. [18] proposed a method to smooth the differences in generated samples with regard to its
inputs, i.e., 𝑧 . The motivation of the method is straightforward, in which two generated samples, i.e., 𝐺(𝑧1 ) and 𝐺(𝑧2 ) ,
should be similar if the difference between 𝑧1 and 𝑧2 is small. Therefore, to maintain such a consistency, they used the
following penalized loss for the generator:
where 𝑘 ≔ ‖𝐺(𝑧 + 𝜀) − 𝐺(𝑧)‖2 ⁄‖𝜀‖2; 𝛼 is a balancing parameter; 𝛿𝑚𝑖𝑛 and 𝛿𝑚𝑎𝑥 are thresholds to maintain 𝑘 to be in
a certain range. Similarly, Zhang et al. [81] proposed a regularization method for the consistency of the discriminator with
respect to data augmentation. They argued that the outputs of the discriminator of two samples, i.e., 𝐷(𝑥1 ) and 𝐷(𝑥2 ), should
be similar if features of 𝑥1 and 𝑥2 are almost same. Therefore, they set 𝑥2 ≔ 𝑇(𝑥1 ), where 𝑇(⋅) is a data augmentation
method, such as cropping and rotating. Then, a penalization loss, called Consistency Regularization (CR), is added to the hinge
loss, i.e., 𝓛𝐷+𝐶𝑅 ≔ 𝓛𝐷 + 𝛼 ⋅ 𝓛𝐶𝑅 , where 𝓛𝐷 is the same as ( 19 ); and
Another regularization method using dropout [61] was proposed in Wei et al. [73], where the consistency of dropout is
maintained. Dropout layers are used in the discriminator; then, discriminator outputs of two samples, i.e., 𝑥1 and 𝑥2 , should
be similar to satisfy Lipschitz continuity. Consequently, such a characteristic is implemented by dropout with a same input.
Hence, Consistency Term (CT) that was proposed in the study can be calculated as follows:
12
Table 5. State-of-the-art GANs and regularization methods used in the studies. Recent studies that have been cited more
than 500 times according to Google Scholar were investigated. MU stands for multiple update algorithm; HL stands for the
hinge loss. A semicolon indicates that both methods were simultaneously used. A forward slash indicates two methods were
selectively used depending on experiment. The other abbreviations can be found in each section of this paper.
Gradient Weight Imbalanced Layer output Regularized
Reference Year Task
penalty regularization training normalization loss
Wang et al. [71] 2018 Style transfer None None None IN None
Hoffman et al. [23] 2018 Style transfer None None None BN None
Karras et al. [27] 2018 Image generation GP None MU BN; LN WGAN
Choi et al. [12] 2018 Style transfer GP None MU IN WGAN
Miyato et al. [47] 2018 Image generation GP/None SN MU BN/cBN HL
Zhang et al. [80] 2019 Image generation None SN TTUR cBN HL
Brock et al. [8] 2019 Image generation GP SN; WN MU; TTUR cBN HL
Karras et al. [28] 2019 Image generation GP None None AdaIN WGAN
(𝑖) (𝑖)
𝓛𝐶𝑇 ≔ ∑ 𝜆𝑖 ⋅ |𝐷1 (𝑥) − 𝐷2 (𝑥)|, ( 23 )
(𝑖)
where 𝐷𝑘 denotes the output of 𝑖 𝑡ℎ layer of the discriminator with a random dropout represented as 𝑘 ; 𝜆𝑖 is a weight
parameter of the layer. 𝜆𝑖 was set to 1 and 0.2 for the last layer and the penultimate layer, respectively. This CT is used in
addition to the ordinary discriminator loss in a similar manner to other studies.
The selected studies and analysis results are shown in Table 5. The studies can be categorized into two classes according to
their tasks, which are image generation and style transfer. Image generation is the original purpose of initial GANs [20], which
use noise vector as inputs of the generator and produce synthetic images from the noises. Style transfer using GANs has been
studied widely after the development of Cycle-consistent GAN (CycleGAN) [87], which takes unpaired samples as its inputs,
then trains with the GAN loss as well as cycle-consistent losses obtained through two generators in CycleGAN.
As a result, the first gradient penalty method, i.e., GP, is the dominant method for penalizing gradients, while a variety of other
gradient penalty methods have been recently proposed as described in Table 1. In addition, for weight regularization in very
recent models, SN has been used widely since its development [47]. In contrast, the multiple update algorithm and TTUR are
both used in recent state-of-the-art GANs. Specifically, Brock et al. [8] performed a large scale study for imbalanced training,
and reported that the best performance was obtained when both methods were used simultaneously. All models have employed
layer output normalization methods; however, recent models used the method only for the generator. While the initial GAN
loss, which is represented as None in Table 5, WGAN loss, and hinge loss are utilized in the selected GANs, the hinge loss is
commonly used in the latest models.
13
5. Limitations and Future Directions
Although numerous studies have explored regularization in GANs with various aspects, there still remain several limitations
to be further investigated. In this section, we discuss such limitations and propose future research directions to tackle these
limitations.
While GP is the dominant method for penalizing gradients, there are several limitations: First, GP can hardly satisfy Lipschitz
continuity in a direct manner because the gradients become around one, while they must be under a certain threshold to satisfy
Lipschitz continuity. Such an issue also makes the discriminator difficult to converge because a discriminator having small
gradients below one is enforced to have large gradients near one by GP, making the discriminator can hardly converge. Second,
GP uses an interpolated value between real samples and generated samples; however, the interpolated value cannot represent
the whole sample space nor the sample space between real samples and fake samples. Specifically, the sample space between
real and fake should be obtained with regard to features instead of pixels. Third, in the first place, it is sufficient that the
discriminator satisfies Lipschitz continuity only near the real samples and fake samples, since the discriminator is trained with
these samples, and not the interpolated values nor the whole sample space.
As recent studies using ‖𝐷‖𝐿𝑖𝑝 → 0, future work should investigate gradient penalty methods that make the gradients under a
certain threshold. By the Lagrange multiplier method [57], the gradients are constrained to be a certain range even if the
auxiliary loss for the penalty converges to zero. In such a case, the penalty parameter 𝜆 determines Lipschitz constant. This
relationship between the Lagrange multiplier method and the penalty method using ‖𝐷‖𝐿𝑖𝑝 → 0 should be further investigated
as well.
Also, as previously described, it is sufficient to satisfy Lipschitz continuity near the real samples and generated samples. In
other words, the discriminator should be locally Lipschitz continuous instead of strictly satisfying it. While most of the existing
methods use pixel-wise, interpolated values, it is more natural to simultaneously penalize the gradients with respect to both
𝑥 + 𝜖 and 𝐺(𝑧) + 𝜖.
Furthermore, while GP aims at 1-Lipschitz continuous of the discriminator, the Lipschitz constant of one is an arbitrary
parameter. Lipschitz constant can be set to any number, and we can conjecture that Lipschitz constant is highly related to
learning rates and values of weight parameters. For instance, Karras et al. [27] evaluated various Lipschitz constant with their
model, and reported that they obtain a significantly better result when Lipschitz constant is set to 750. Such a study regarding
optimal Lipschitz constants should be conducted as well.
In a similar manner, SN also aims at 1-Lipschitz continuous of the discriminator, by dividing weight parameters with the
̃ ≔ 𝑊 ⁄𝜎(𝑊) . However, such an operation can be relaxed to satisfy Lipschitz continuity with
maximum singular value, i.e., 𝑊
another constant, by
̃ ≔ 𝐾 ⋅ 𝑊 ⁄𝜎(𝑊),
𝑊 ( 24 )
The GAN training in recent studies is generally conducted with the multiple update algorithm and TTUR using Adam
optimization [29], which is a conventional method for the training of neural networks. While such a training method has
demonstrated fine performance, specific optimization methods for GANs are required due to the unique training process of
GANs, of which generator targets to the discriminator, which is another neural network structure. Several existing methods
have explored this problem [45,48,50], and proposed optimization methods using the gradients of its counterpart. These
approaches should be further investigated.
Recently, Chu et al. [13] argued that a smooth activation function is required for the discriminator to stabilize the GAN training.
While existing studies have focused only on the approaches presented in this study, such as in terms of weight parameters and
Lipschitz continuity, such an aspect based on the smoothness of activation functions should be investigated further.
14
6. Conclusion
In this paper, we reviewed and classified regularization methods for GANs. The existing methods were categorized into five
classes according to their operation principles; then we analyzed each group. While numerous methods have been proposed in
recent years, we found that there still remain several limitations in the method. We also proposed future research directions to
tackle these problems. We believe that this study can help researchers to select appropriate regularization methods for specific
neural network architectures and datasets, and to gain a better understanding of existing methods in order to develop novel
regularization methods that handle the limitations in existing methods.
References
[1] J. Adler and S. Lunz, (2018), "Banach Wasserstein GAN," In: Advances in Neural Information Processing Systems
(NeurIPS), Montréal, Canada.
[2] M. Arjovsky, S. Chintala, and L. Bottou, (2017), "Wasserstein generative adversarial networks," In: International
Conference on Machine Learning (ICML), Sydney, Australia.
[3] K. Armanious, C. Jiang, M. Fischer, T. Küstner, T. Hepp, K. Nikolaou, S. Gatidis, and B. Yang, (2020), "MedGAN:
Medical image translation using GANs," Computerized Medical Imaging and Graphics, vol. 79, p. 101684.
[4] J. L. Ba, J. R. Kiros, and G. E. Hinton, (2016), "Layer normalization," In: Advances in Neural Information Processing
Systems (NeurIPS), Barcelona, Spain.
[5] D. Berthelot, T. Schumm, and L. Metz, (2017), "BEGAN: Boundary equilibrium generative adversarial networks," In:
ArXiv preprint arXiv:1703.10717.
[6] C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. Gunn, A. Hammers, D. A. Dickie, M. V. Hernández, J. Wardlaw, and
D. Rueckert, (2018), "GAN augmentation: Augmenting training data using generative adversarial networks," ArXiv
preprint arXiv:1810.10863.
[7] A. Brock, T. Lim, J. M. Ritchie, and N. J. Weston, (2017), "Neural photo editing with introspective adversarial
networks," In: International Conference on Learning Representations (ICLR), Tulon, France.
[8] A. Brock, J. Donahue, and K. Simonyan, (2019), "Large scale GAN training for high fidelity natural image synthesis,"
In: International Conference on Learning Representations (ICLR), New Orleans, LA.
[9] Y.-J. Cao, L.-L. Jia, Y.-X. Chen, N. Lin, C. Yang, B. Zhang, Z. Liu, X.-X. Li, and H.-H. Dai, (2018), "Recent advances
of generative adversarial networks in computer vision," IEEE Access, vol. 7, pp. 14985-15006.
[10] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li, (2017), "Mode regularized generative adversarial networks," In:
International Conference on Learning Representations (ICLR), Tulon, France.
[11] T. Chen, M. Lucic, N. Houlsby, and S. Gelly, (2019), "On self modulation for generative adversarial networks," In:
International Conference on Learning Representations (ICLR), New Orleans, LA.
[12] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, (2018), "StarGAN: Unified generative adversarial networks
for multi-domain image-to-image translation," In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Salt Lake City, UT.
[13] C. Chu, K. Minami, and K. Fukumizu, (2020), "Smoothness and stability in GANs," In: International Conference on
Learning Representations (ICLR), Addis Ababa, Ethiopia.
[14] H. De Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. C. Courville, (2017), "Modulating early visual
processing by language," In: Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA.
[15] E. L. Denton, S. Chintala, and R. Fergus, (2015), "Deep generative image models using a Laplacian pyramid of
adversarial networks," In: Advances in Neural Information Processing Systems (NeurIPS), Montréal, Canada
[16] V. Dumoulin, J. Shlens, and M. Kudlur, (2017), "A learned representation for artistic style," In: International
Conference on Learning Representations (ICLR), Tulon, France.
[17] W. Fedus, I. Goodfellow, and A. M. Dai, (2018), "MaskGAN: Better text generation via filling in the _," In:
International Conference on Learning Representations (ICLR), Vancouver, Canada.
[18] Y. Gan, K. Liu, M. Ye, and Y. Qian, (2019), "Generative adversarial networks with augmentation and penalty,"
Neurocomputing, vol. 360, pp. 52-60.
[19] G. Ghiasi, H. Lee, M. Kudlur, V. Dumoulin, and J. Shlens, (2017), "Exploring the structure of a real-time, arbitrary
neural artistic stylization network," ArXiv preprint arXiv:1705.06830.
[20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, (2014),
"Generative adversarial nets," In: Advances in Neural Information Processing Systems (NeurIPS), Montréal, Canada
15
[21] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, (2017), "Improved training of Wasserstein
GANs," In: Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA.
[22] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, (2017), "GANs trained by a two time-scale
update rule converge to a local nash equilibrium," In: Advances in Neural Information Processing Systems (NeurIPS),
Long Beach, CA.
[23] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, (2018), "CyCADA: Cycle-
consistent adversarial domain adaptation," In: International Conference on Machine Learning (ICML), Stockholm,
Sweden.
[24] Y. Hong, U. Hwang, J. Yoo, and S. Yoon, (2019), "How generative adversarial networks and their variants work: An
overview," ACM Computing Surveys (CSUR), vol. 52, no. 1, pp. 1-43.
[25] X. Huang and S. Belongie, (2017), "Arbitrary style transfer in real-time with adaptive instance normalization," In:
IEEE International Conference on Computer Vision (ICCV), Venice, Italy.
[26] S. Ioffe and C. Szegedy, (2015), "Batch normalization: Accelerating deep network training by reducing internal
covariate shift," In: International Conference on Machine Learning (ICML), Lille, France.
[27] T. Karras, T. Aila, S. Laine, and J. Lehtinen, (2018), "Progressive growing of GANs for improved quality, stability,
and variation," In: International Conference on Learning Representations (ICLR), Vancouver, Canada.
[28] T. Karras, S. Laine, and T. Aila, (2019), "A style-based generator architecture for generative adversarial networks," In:
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA.
[29] D. P. Kingma and J. Ba, (2015), "Adam: A method for stochastic optimization," In: International Conference on
Learning Representations (ICLR), San Diego, CA.
[30] N. Kodali, J. Abernethy, J. Hays, and Z. Kira, (2017), "On convergence and stability of GANs," ArXiv Preprint
arXiv:1705.07215.
[31] K. Kurach, M. Lučić, X. Zhai, M. Michalski, and S. Gelly, (2019), "A large-scale study on regularization and
normalization in GANs," In: International Conference on Machine Learning (ICML), Long Beach, CA.
[32] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, and Z. Wang,
(2017), "Photo-realistic single image super-resolution using a generative adversarial network," In: IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI.
[33] M. Lee and J. Seok, (2019), "Controllable generative adversarial network," IEEE Access, vol. 7, pp. 28158-28169.
[34] M. Lee and J. Seok, (2020), "Score-guided generative adversarial networks," ArXiv preprint arXiv:2004.04396.
[35] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, (2017), "Perceptual generative adversarial networks for small object
detection," In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI.
[36] S. Li and W. Deng, (2020), "Deep facial expression recognition: A survey," IEEE Transactions on Affective Computing.
[37] Y. Li, N. Xiao, and W. Ouyang, (2019), "Improved generative adversarial networks with reconstruction loss,"
Neurocomputing, vol. 323, pp. 363-372.
[38] J. H. Lim and J. C. Ye, (2017), "Geometric GAN," ArXiv preprint arXiv:1705.02894.
[39] K. Liu, W. Tang, F. Zhou, and G. Qiu, (2019), "Spectral regularization for combating mode collapse in GANs," In:
IEEE International Conference on Computer Vision (ICCV), Seoul, Korea.
[40] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen, (2020), "Deep learning for generic object
detection: A survey," International Journal of Computer Vision, vol. 128, no. 2, pp. 261-318.
[41] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet, (2018), "Are GANs created equal? a large-scale study,"
In: Advances in Neural Information Processing Systems (NeurIPS), Montréal, Canada.
[42] P. Luo, X. Wang, W. Shao, and Z. Peng, (2018), "Towards understanding regularization in batch normalization," In:
International Conference on Learning Representations (ICLR), Vancouver, Canada.
[43] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, (2017), "Least squares generative adversarial
networks," In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy.
[44] G. Mariani, F. Scheidegger, R. Istrate, C. Bekas, and C. Malossi, (2018), "BAGAN: Data augmentation with balancing
gan," ArXiv preprint arXiv:1803.09655.
[45] L. Mescheder, S. Nowozin, and A. Geiger, (2017), "The numerics of GANs," In: Advances in Neural Information
Processing Systems (NeurIPS), Long Beach, CA.
[46] L. Mescheder, A. Geiger, and S. Nowozin, (2018), "Which training methods for GANs do actually converge?," In:
International Conference on Machine Learning (ICML), Stockholm, Sweden.
[47] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, (2018), "Spectral normalization for generative adversarial
networks," In: International Conference on Learning Representations (ICLR), Vancouver, Canada.
[48] V. Nagarajan and J. Z. Kolter, (2017), "Gradient descent GAN optimization is locally stable," In: Advances in Neural
Information Processing Systems (NeurIPS), Long Beach, CA.
16
[49] Y. Ni, D. Song, X. Zhang, H. Wu, and L. Liao, (2018), "CAGAN: Consistent adversarial training enhanced GANs,"
In: International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden.
[50] W. Nie and A. Patel, (2019), "Towards a better understanding and regularization of gan training dynamics," ArXiv
preprint arxiv:1806.09235.
[51] S. Nowozin, B. Cseke, and R. Tomioka, (2016), "f-GAN: Training generative neural samplers using variational
divergence minimization," In: Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain.
[52] Z. Pan, W. Yu, X. Yi, A. Khan, F. Yuan, and Y. Zheng, (2019), "Recent progress on generative adversarial networks
(GANs): A survey," IEEE Access, vol. 7, pp. 36322-36333.
[53] H. Petzka, A. Fischer, and D. Lukovnicov, (2018), "On the regularization of Wasserstein GANs," In: International
Conference on Learning Representations (ICLR), Vancouver, Canada.
[54] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, (2018), "GANimation: Anatomically-
aware facial animation from a single image," In: European Conference on Computer Vision (ECCV), Munich,
Germany.
[55] G.-J. Qi, (2020), "Loss-sensitive generative adversarial networks on Lipschitz densities," International Journal of
Computer Vision, vol. 128, pp. 1118–1140.
[56] A. Radford, L. Metz, and S. Chintala, (2015), "Unsupervised representation learning with deep convolutional
generative adversarial networks," ArXiv preprint arXiv:1511.06434.
[57] R. T. Rockafellar, (1993), "Lagrange multipliers and optimality," SIAM review, vol. 35, no. 2, pp. 183-238.
[58] M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed, (2017), "Variational approaches for auto-
encoding generative adversarial networks," ArXiv preprint arXiv:1706.04987.
[59] T. Salimans and D. P. Kingma, (2016), "Weight normalization: A simple reparameterization to accelerate training of
deep neural networks," In: Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain.
[60] A. M. Saxe, J. L. Mcclelland, and S. Ganguli, (2014), "Exact solutions to the nonlinear dynamics of learning in deep
linear neural network," In: International Conference on Learning Representations (ICLR), Banff, Canada.
[61] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, (2014), "Dropout: a simple way to
prevent neural networks from overfitting," The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958.
[62] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, (2016), "Rethinking the Inception architecture for
computer vision," In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV.
[63] C. Tao, L. Chen, R. Henao, J. Feng, and L. C. Duke, (2018), "Chi-square generative adversarial network," In:
International Conference on Machine Learning (ICML), Stockholm, Sweden.
[64] H. Thanh-Tung, S. Venkatesh, and T. Tran, (2019), "Improving generalization and stability of generative adversarial
networks," In: International Conference on Learning Representations (ICLR), New Orleans, LA.
[65] D. Tran, R. Ranganath, and D. Blei, (2017), "Hierarchical implicit models and likelihood-free variational inference,"
In: Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA.
[66] N.-T. Tran, T.-A. Bui, and N.-M. Cheung, (2018), "Dist-gan: An improved GAN using distance constraints," In:
European Conference on Computer Vision (ECCV), Munich, Germany.
[67] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, (2018), "MoCoGAN: Decomposing motion and content for video
generation," In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT.
[68] D. Ulyanov, A. Vedaldi, and V. Lempitsky, (2016), "Instance normalization: The missing ingredient for fast
stylization," ArXiv preprint arXiv:1607.08022.
[69] D. Ulyanov, A. Vedaldi, and V. Lempitsky, (2018), "It takes (only) two: Adversarial generator-encoder networks," In:
AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA.
[70] K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, and F.-Y. Wang, (2017), "Generative adversarial networks: introduction
and outlook," IEEE/CAA Journal of Automatica Sinica, vol. 4, no. 4, pp. 588-598.
[71] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, (2018), "High-resolution image synthesis and
semantic manipulation with conditional GANs," In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Salt Lake City, UT.
[72] Z. Wang, J. Chen, and S. C. Hoi, (2020), "Deep learning for image super-resolution: A survey," IEEE Transactions on
Pattern Analysis and Machine Intelligence.
[73] X. Wei, B. Gong, Z. Liu, W. Lu, and L. Wang, (2018), "Improving the improved training of Wasserstein GANs: A
consistency term and its dual effect," In: International Conference on Learning Representations (ICLR), Vancouver,
Canada.
[74] Y. Wu and K. He, (2018), "Group normalization," In: European Conference on Computer Vision (ECCV), Munich,
Germany.
[75] L. Yanchun and X. Nanfeng, (2019), "Generative adversarial networks based on denoising and reconstruction
17
regularization," In: IEEE International Conference on High Performance Computing and Communications; IEEE
International Conference on Smart City; IEEE International Conference on Data Science and Systems
(HPCC/SmartCity/DSS), Zhangjiajie, China.
[76] X. Yi, E. Walia, and P. Babyn, (2019), "Generative adversarial network in medical imaging: A review," Medical image
analysis, p. 101552.
[77] Y. Yoshida and T. Miyato, (2017), "Spectral norm regularization for improving the generalizability of deep learning,"
ArXiv preprint arXiv:1705.10941.
[78] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, (2018), "Generative image inpainting with contextual
attention," In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT.
[79] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, (2018), "StackGAN++: Realistic image
synthesis with stacked generative adversarial networks," IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 41, no. 8, pp. 1947-1962.
[80] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, (2019), "Self-attention generative adversarial networks," In:
International Conference on Machine Learning (ICML), Long Beach, CA.
[81] H. Zhang, Z. Zhang, A. Odena, and H. Lee, (2020), "Consistency regularization for generative adversarial networks,"
In: International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia.
[82] Z. Zhang, Y. Xie, and L. Yang, (2018), "Photographic text-to-image synthesis with a hierarchically-nested adversarial
network," In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT.
[83] Z. Zhang, Y. Zeng, L. Bai, Y. Hu, M. Wu, S. Wang, and E. R. Hancock, (2019), "Spectral bounding: strictly satisfying
the 1-Lipschitz property for generative adversarial networks," Pattern Recognition, p. 107179.
[84] J. Zhao, M. Mathieu, and Y. LeCun, (2019), "Energy-based generative adversarial networks," In: International
Conference on Learning Representations (ICLR), Tulon, France.
[85] C. Zhou, J. Zhang, and J. Liu, (2018), "Lp-WGAN: Using Lp-norm normalization to stabilize Wasserstein generative
adversarial networks," Knowledge-Based Systems, vol. 161, pp. 415-424.
[86] Z. Zhou, J. Liang, Y. Song, L. Yu, H. Wang, W. Zhang, Y. Yu, and Z. Zhang, (2019), "Lipschitz generative adversarial
nets," In: International Conference on Machine Learning (ICML), Long Beach, CA.
[87] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, (2017), "Unpaired image-to-image translation using cycle-consistent
adversarial networks," In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy.
[88] Google Scholar. Available: https://scholar.google.com/
18