0% found this document useful (0 votes)
31 views

Tao DF-GAN A Simple and Effective Baseline For Text-to-Image Synthesis CVPR 2022 Paper

The document proposes a new method called Deep Fusion Generative Adversarial Networks (DF-GAN) for text-to-image synthesis. DF-GAN aims to generate realistic and text-consistent images from text descriptions. It introduces a novel one-stage backbone, target-aware discriminator, and deep fusion blocks to improve over existing stacked architectures and cross-modal attention methods. The proposed approach is evaluated on two datasets and shown to outperform previous state-of-the-art methods.

Uploaded by

Tanzil Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Tao DF-GAN A Simple and Effective Baseline For Text-to-Image Synthesis CVPR 2022 Paper

The document proposes a new method called Deep Fusion Generative Adversarial Networks (DF-GAN) for text-to-image synthesis. DF-GAN aims to generate realistic and text-consistent images from text descriptions. It introduces a novel one-stage backbone, target-aware discriminator, and deep fusion blocks to improve over existing stacked architectures and cross-modal attention methods. The proposed approach is evaluated on two datasets and shown to outperform previous state-of-the-art methods.

Uploaded by

Tanzil Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis

Ming Tao1 Hao Tang2 Fei Wu1 Xiaoyuan Jing3 Bing-Kun Bao1 * Changsheng Xu4,5,6
1
Nanjing University of Posts and Telecommunications 2 CVL, ETH Zürich 3 Wuhan University
4
Peng Cheng Laboratory 5 University of Chinese Academy of Sciences
6
NLPR, Institute of Automation, CAS
[email protected]

Abstract

Synthesizing high-quality realistic images from text de-


scriptions is a challenging task. Existing text-to-image Gen-
erative Adversarial Networks generally employ a stacked
architecture as the backbone yet still remain three flaws.
First, the stacked architecture introduces the entanglements
between generators of different image scales. Second, ex-
isting studies prefer to apply and fix extra networks in
adversarial learning for text-image semantic consistency,
which limits the supervision capability of these networks.
Third, the cross-modal attention-based text-image fusion
that widely adopted by previous works is limited on several
Figure 1. (a) Existing text-to-image models stack multiple gener-
special image scales because of the computational cost. To
ators to generate high-resolution images. (b) Our proposed DF-
these ends, we propose a simpler but more effective Deep
GAN generates high-quality images directly and fuses the text and
Fusion Generative Adversarial Networks (DF-GAN). To be image features deeply by our deep text-image fusion blocks.
specific, we propose: (i) a novel one-stage text-to-image
backbone that directly synthesizes high-resolution images It aims to generate realistic and text-consistent images from
without entanglements between different generators, (ii) a the given natural language descriptions. Due to its prac-
novel Target-Aware Discriminator composed of Matching- tical value, text-to-image synthesis has become an active
Aware Gradient Penalty and One-Way Output, which en- research area recently [3, 9, 13, 19–21, 32, 33, 35, 51, 53, 60].
hances the text-image semantic consistency without intro- Two major challenges for text-to-image synthesis are the
ducing extra networks, (iii) a novel deep text-image fu- authenticity of the generated image, and the semantic con-
sion block, which deepens the fusion process to make a sistency between the given text and the generated image.
full fusion between text and visual features. Compared Due to the instability of the GAN model, most recent mod-
with current state-of-the-art methods, our proposed DF- els adopt the stacked architecture [56,57] as the backbone to
GAN is simpler but more efficient to synthesize realistic generate high-resolution images. They employ cross-modal
and text-matching images and achieves better performance attention to fuse text and image features [37, 50, 56, 57, 60]
on widely used datasets. Code is available at https: and then introduce DAMSM network [50], cycle consis-
//github.com/tobran/DF-GAN . tency [33], or Siamese network [51] to ensure the text-
image semantic consistency by extra networks.
Although impressive results have been presented by pre-
1. Introduction vious works [9,19,21,32,33,51,60], there still remain three
The last few years have witnessed the great success of problems. First, the stacked architecture [56] introduces en-
Generative Adversarial Networks (GANs) [8] for a variety tanglements between different generators, and this makes
of applications [4, 27, 48]. Among them, text-to-image syn- the final refined images look like a simple combination of
thesis is one of the most important applications of GANs. fuzzy shape and some details. As shown in Figure 1(a),
the final refined image has a fuzzy shape synthesized by
* Corresponding Author G0 , coarse attributes (e.g., eye and beak) synthesized by

16515
G1 , and fine-grained details (e.g., eye reflection) added by • We propose a novel Deep text-image Fusion Block
G2 . The final synthesized image looks like a simple combi- (DFBlock), which fully fuses text and visual features
nation of visual features from different image scales. Sec- more effectively and deeply.
ond, existing studies usually fix the extra networks [33, 50]
during the adversarial training, making these networks eas- • Extensive qualitative and quantitative experiments on
ily fooled by the generator to synthesize adversarial fea- two challenging datasets demonstrate that the pro-
tures [30, 52], thereby weakening their supervision power posed DF-GAN outperforms existing state-of-the-art
on semantic consistency. Third, cross-modal attention [50] text-to-image models.
can not make full use of text information. They can only be
applied two times on 64×64 and 128×128 image features 2. Related Work
due to its high computational cost. It limits the effectiveness
of the text-image fusion process and makes the model hard Generative Adversarial Networks (GANs) [8] are an at-
to extend to higher-resolution image synthesis. tractive framework that can be used to mimic complex
To address the above issues, we propose a novel text- real-world distributions by solving a min-max optimiza-
to-image generation method named Deep Fusion Genera- tion problem between a generator and discriminator [16,17,
tive Adversarial Network (DF-GAN). For the first issue, 43, 54]. For instance, Reed et al. first applied the condi-
we replace the stacked backbone with a one-stage back- tional GAN to generate plausible images from text descrip-
bone. It is composed of hinge loss [54] and residual net- tions [37,38]. StackGAN [56,57] generates high-resolution
works [11] which stabilizes the GAN training process to images by stacking multiple generators and discriminators
synthesize high-resolution images directly. Since there is and provides the text information to the generator by con-
only one generator in the one-stage backbone, it avoids the catenating text vectors as well as the input noises. Next,
entanglements between different generators. AttnGAN [50] introduces the cross-modal attention mech-
For the second issue, we design a Target-Aware Dis- anism to help the generator synthesize images with more
criminator composed of Matching-Aware Gradient Penalty details. MirrorGAN [33] regenerates text descriptions from
(MA-GP) and One-Way Output to enhance the text-image generated images for text-image semantic consistency [59].
semantic consistency. MA-GP is a regularization strategy SD-GAN [51] employs the Siamese structure [45, 46] to
on the discriminator. It pursues the gradient of discrimina- distill the semantic commons from texts for image gener-
tor on target data (real and text-matching image) to be zero. ation consistency. DM-GAN [60] introduces the Memory
Thereby, the MA-GP constructs a smooth loss surface at Network [10, 49] to refine fuzzy image contents when the
real and matching data points which further promotes the initial images are not well generated in stacked architec-
generator to synthesize text-matching images. Moreover, ture. Recently, some large transformer-based text-to-image
considering that the previous Two-Way Output slows down methods [7,24,35] show excellent performance on complex
the convergence process of the generator under MA-GP, we image synthesis. They tokenize the images and take the im-
replace it with a more effective One-Way Output. age tokens and word tokens to make auto-regressive training
For the third issue, we propose a Deep text-image Fusion by a unidirectional Transformer [2, 34].
Block (DFBlock) to fuse the text information into image Our DF-GAN is much different from previous methods.
features more effectively. The DFBlock consists of several First, it generates high-resolution images directly by a one-
Affine Transformations [31]. The Affine Transformation stage backbone. Second, it adopts a Target-Aware Discrim-
is a lightweight module that manipulates the visual feature inator to enhance text-image semantic consistency without
maps through channel-wise scaling and shifting operation. introducing extra networks. Third, it fuses text and image
Stacking multiple DFBlocks at all image scales deepens the features more deeply and effectively through a sequence of
text-image fusion process and makes a full fusion between DFBlocks. Compared with previous models, our DF-GAN
text and visual features. is much simpler but more effective in synthesizing realistic
Overall, our contributions can be summarized as follows: and text-matching images.

• We propose a novel one-stage text-to-image backbone 3. The Proposed DF-GAN


that can synthesize high-resolution images directly
without entanglements between different generators. In this paper, we propose a simple model for text-to-
image synthesis named Deep Fusion GAN (DF-GAN).To
• We propose a novel Target-Aware Discriminator com- synthesize more realistic and text-matching images, we pro-
posed of Matching-Aware Gradient Penalty (MA-GP) pose: (i) a novel one-stage text-to-image backbone that
and One-Way Output. It significantly enhances the can synthesize high-resolution images directly without vi-
text-image semantic consistency without introducing sual feature entanglements. (ii) a novel Target-Aware Dis-
extra networks. criminator composed of Matching-Aware Gradient Penalty

16516
Figure 2. The architecture of the proposed DF-GAN for text-to-image synthesis. DF-GAN generates high-resolution images directly by
one pair of generator and discriminator and fuses the text information and visual feature maps through multiple Deep text-image Fusion
Blocks (DFBlock) in UPBlocks. Armed with Matching-Aware Gradient Penalty (MA-GP) and One-Way Output, our model can synthesize
more realistic and text-matching images.

(MA-GP) and One-Way Output, which enhances the text- 3.2. One-Stage Text-to-Image Backbone
image semantic consistency without introducing extra net-
Since the instability of the GAN model, previous text-to-
works. (iii) a novel Deep text-image Fusion Block (DF-
image GANs usually employ stacked architecture [56,57] to
Block), which more fully fuses text and visual features.
generate high-resolution images from low-resolution ones.
3.1. Model Overview However, the stacked architecture introduces entanglements
between different generators, and it makes the final refined
The proposed DF-GAN is composed of a generator, a images look like a simple combination of fuzzy shape and
discriminator, and a pre-trained text encoder as shown in some details (see Figure 1(a)).
Figure 2. The generator has two inputs, a sentence vector Inspired by recent studies on unconditional image gener-
encoded by text encoder and a noise vector sampled from ation [23, 54], we propose a one-stage text-to-image back-
the Gaussian distribution to ensure the diversity of the gen- bone that can synthesize high-resolution images directly by
erated images. The noise vector is first fed into a fully con- a single pair of generator and discriminator. We employ the
nected layer and reshaped.We then apply a series of UP- hinge loss [23] to stabilize the adversarial training process.
Blocks to upsample the image features. The UPBlock is Since there is only one generator in the one-stage backbone,
composed of an upsample layer, a residual block, and DF- it avoids the entanglements between different generators.
Blocks to fuse the text and image features during the image As the single generator in our one-stage framework needs
generation process. Finally, a convolution layer converts to synthesize high-resolution images from noise vectors di-
image features into images. rectly, it must contain more layers than previous generators
The discriminator converts images into image features in stacked architecture. To train these layers effectively, we
through a series of DownBlocks. Then the sentence vector introduce residual networks [11] to stabilize the training of
will be replicated and concatenated with image features. An deeper networks. The formulation of our one-stage method
adversarial loss will be predicted to evaluate the visual real- with hinge loss [23] is as follows:
ism and semantic consistency of inputs. By distinguishing
generated images from real samples, the discriminator pro- LD = − Ex∼Pr [min(0, −1 + D(x, e))]
motes the generator to synthesize images with higher qual- − (1/2)EG(z)∼Pg [min(0, −1 − D(G(z), e))]
ity and text-image semantic consistency. (1)
− (1/2)Ex∼Pmis [min(0, −1 − D(x, e))]
The text encoder is a bi-directional Long Short-Term
LG = − EG(z)∼Pg [D(G(z), e)]
Memory (LSTM) [41] that extracts semantic vectors from
the text description. We directly use the pre-trained model where z is the noise vector sampled from Gaussian distri-
provided by AttnGAN [50]. bution; e is the sentence vector; Pg , Pr , Pmis denote the

16517
Figure 3. (a) A comparison of loss landscape before and after applying gradient penalty. The gradient penalty smooths the discriminator
loss surface which is helpful for generator convergence. (b) A diagram of MA-GP. The data point (real, match) should be applied MA-GP.

synthetic data distribution, real data distribution, and mis- serves four kinds of inputs: synthetic images with match-
matching data distribution, respectively. ing text (fake, match), synthetic images with mismatched
text (fake, mismatch), real images with matching text (real,
3.3. Target-Aware Discriminator match), real images with mismatched text (real, mismatch).
In this section, we detailed the proposed Target-Aware For text-visual semantic consistency, we tend to apply gra-
Discriminator, which is composed of Matching-Aware Gra- dient penalty on the text-matching real data, the target of
dient Penalty (MA-GP) and One-Way Output. The Target- text-to-image synthesis. Therefore, in MA-GP, the gradi-
Aware Discriminator promotes the generator to synthesize ent penalty should be applied on real images with matching
more realistic and text-image semantic-consistent images. text. The whole formulation of our model with MA-GP is
as follows:
3.3.1 Matching-Aware Gradient Penalty
The Matching-Aware zero-centered Gradient Penalty (MA- LD = − Ex∼Pr [min(0, −1 + D(x, e))]
GP) is our newly designed strategy to enhance text-image − (1/2)EG(z)∼Pg [min(0, −1 − D(G(z), e))]
semantic consistency. In this subsection, we first show the
− (1/2)Ex∼Pmis [min(0, −1 − D(x, e))] (2)
unconditional gradient penalty [28] from a novel and clear
perspective, then extend it to our MA-GP for the text-to- + kEx∼Pr [(∥∇x D(x, e)∥ + ∥∇e D(x, e)∥)p ]
image generation task. LG = − EG(z)∼Pg [D(G(z), e)]
As shown in Figure 3(a), in unconditional image gen-
eration, the target data (real images) correspond to a low where k and p are two hyper-parameters to balance the ef-
discriminator loss. Correspondingly, the synthetic images fectiveness of gradient penalty.
correspond to a high discriminator loss. The hinge loss lim- By using the MA-GP loss as a regularization on the
its the range of discriminator loss between -1 and 1. The discriminator, our model can better converge to the text-
gradient penalty on real data will reduce the gradient of the matching real data, therefore synthesizing more text-
real data point and its vicinity. The surface of the loss func- matching images. Besides, since the discriminator is jointly
tion around the real data point is then smoothed which is trained in our network, it prevents the generator from syn-
helpful for the synthetic data point to converge to the real thesizing adversarial features of the fixed extra network.
data point. Moreover, since MA-GP does not incorporate any extra net-
Based on the above analysis, we find that the gradi- works for text-image consistency and the gradients are al-
ent penalty on target data constructs a better loss land- ready computed by back propagation process, the only com-
scape to help the generator converge. By leveraging the putation introduced by our proposed MA-GP is the gradient
view into the text-to-image generation. As shown in Fig- summation, which is more computational friendly than ex-
ure 3(b), in text-to-image generation, the discriminator ob- tra networks.

16518
3.4. Efficient Text-Image Fusion
To fuse text and image features efficiently, we propose
a novel Deep text-image Fusion Block (DFBlock). Com-
pared with previous text-image fusion modules, our DF-
Block deepens the text-image fusion process to make a full
text-image fusion.
As shown in Figure 2, the generator of our DF-GAN con-
sists of 7 UPBlocks. A UPBlock contains two Text-Image
Fusion blocks. To fully utilize the text information in fu-
sion, we propose the Deep text-image Fusion Block (DF-
Block) which stacks multiple Affine Transformations and
ReLU layers in Fusion Block. For Affine transformation, as
shown in Figure 5(c), we adopt two MLPs (Multilayer Per-
Figure 4. Comparison between Two-Way Output and our One- ceptron) to predict the language-conditioned channel-wise
Way Output. (a) The Two-Way Output predicts conditional loss scaling parameters γ and shifting parameters θ from sen-
and unconditional loss and sums them up as the final adversarial tence vector e, respectively:
loss. (b) Our One-Way Output predicts the whole adversarial loss
directly. γ = M LP1 (e), θ = M LP2 (e). (3)

For a given input feature map X∈RB×C×H×W , we first


3.3.2 One-Way Output
conduct the channel-wise scaling operation on X with the
scaling parameter γ, then apply the channel-wise shifting
In the previous text-to-image GANs [50, 56, 57], image fea- operation with the shifting parameter θ. Such a process can
tures extracted by discriminator are usually used in two be expressed as follows:
ways (Figure 4(a)): one determines whether the image
is real or fake, the other concatenates the image feature AF F (xi |e) = γi · xi + θi , (4)
and sentence vector to evaluate text-image semantic con-
sistency. Correspondingly, the unconditional loss and the where AF F denotes the Affine Transformation; xi is the
conditional loss are computed in these models. ith channel of visual feature maps; e is the sentence vector;
However, it is shown that the Two-Way Output weakens γi and θi are scaling parameter and shifting parameter for
the effectiveness of MA-GP and slows down the conver- the ith channel of visual feature maps.
gence of the generator. Concretely, as depicted in Figure The Affine layer expands the conditional representation
3(b), the conditional loss gives a gradient α pointing to the space of the generator. However, the Affine transformation
real and matching inputs after back propagation, while the is a linear transformation for each channel. It limits the ef-
unconditional loss gives a gradient β only pointing to the fectiveness of text-image fusion process. Thereby, we add
real images. However, the direction of the final gradient a ReLU layer between two Affine layers which brings the
which just simply sums up γ and β does not point to the real nonlinearity into the fusion process. It enlarges the condi-
and matching data points as we expected. Since the target tional representation space compared with only one Affine
of the generator is to synthesize real and text-matching im- layer. A larger representation space is helpful for the gen-
ages, the final gradient with deviation cannot well achieve erator to map different images to different representations
text-image semantic consistency and slows down the con- according to text descriptions.
vergence process of the generator. Our DFBlock is partly inspired by Conditional Batch
Normalization (CBN) [5] and Adaptive Instance Normal-
Therefore, we propose the One-Way Output for text-to- ization (AdaIN) [14, 16] which contain the Affine transfor-
image synthesis. As shown in Figure 4(b), our discrimi- mation. However, both CBN and AdaIN employ normal-
nator concatenates the image feature and sentence vector, ization layers [15,44] which transform the feature maps into
then outputs only one adversarial loss through two convo- the normal distribution. It generates an opposite effect to the
lution layers. Through the One-Way Output, we are able to Affine Transformation which is expected to increase the dis-
make the single gradient γ pointed to the target data points tance between different samples. It is then unhelpful for the
(real and match) directly, which optimize and accelerate the conditional generation process. To this end, we remove the
convergence of the generator. normalization process. Furthermore, our DFBlock deepens
By combining the MA-GP and the One-Way Output, our the text-image fusion process. We stack multiple Affine lay-
Target-Aware Discriminator can guide the generator to syn- ers and add a ReLU layer between. It promotes the diversity
thesize more real and text-matching images. of visual features and enlarges the representation spaces to

16519
Figure 5. (a) A typical UPBlock in the generator network. The UPBlock upsamples the image features and fuses text and image features
by two Fusion Blocks. (b) The DFBlock consists of two Affine layers, two ReLU activation layers, and a Convolution layer. (c) The
illustration of the Affine Transformation. (d) Comparison between (d.1) the generator with cross-modal attention [50, 60] and (d.2) our
generator with DFBlock.

represent different visual features according to different text work. Specifically, IS computes the Kullback-Leibler (KL)
descriptions. divergence between a conditional distribution and marginal
With the deepening of the fusion process, the DFBlock distribution. Higher IS means higher quality of the gener-
brings two main benefits for text-to-image generation: First, ated images, and each image clearly belongs to a specific
it makes the generator more fully exploit the text informa- class. FID [12] computes the Fréchet distance between the
tion when fusing text and image features. Second, deepen- distribution of the synthetic images and real-world images
ing the fusion process enlargers the representation space of in the feature space of a pre-trained Inception v3 network.
the fusion module, which is beneficial to generate semantic Contrary to IS, more realistic images have a lower FID. To
consistent images from different text descriptions. compute both IS and FID, each model generates 30,000 im-
Furthermore, compared with previous text-to-image ages (256×256 resolution) from text descriptions randomly
GANs [50, 56, 57, 60], the proposed DFBlock makes our selected from the test dataset.
model no longer consider the limitation from image scales As stated in the recent works [21,58], the IS cannot eval-
when fusing the text and image features. This is because uate the image quality well on the COCO dataset, which
existing text-to-image GANs generally employ the cross- also exists in our proposed method. Moreover, we find
modal attention mechanism which suffers a rapid growth of that some GAN-based models [50, 60] achieve significant
computation cost along with the increase of image size. higher IS than Transformer-based large text-to-image mod-
els [7, 35] on the COCO dataset, but the visual quality of
4. Experiments synthesized images is obviously lower than Transformer-
based models [7, 35]. Thus, we do not compare IS on the
In this section, we first introduce the datasets, training COCO dataset. In contrast, FID is more robust and aligns
details, and evaluation metrics used in our experiments, then human qualitative evaluation on the COCO dataset.
evaluate DF-GAN and its variants quantitatively and quali- Moreover, we evaluate the number of parameters (NoP)
tatively. to compare the model size with current methods.
Datasets. We follow previous work [33, 50, 51, 56, 57,
60] and evaluate the proposed model on two challeng- 4.1. Quantitative Evaluation
ing datasets, i.e., CUB bird [47] and COCO [25]. The We compare the proposed method with several state-of-
CUB dataset contains 11,788 images belonging to 200 bird the-art methods, including StackGAN [56], StackGAN++
species. Each bird image has ten language descriptions. [57], AttnGAN [50], MirrorGAN [33], SD-GAN [51], and
The COCO dataset contains 80k images for training and DM-GAN [60], which have achieved the remarkable suc-
40k images for testing. Each image in this dataset has five cess of text-to-image synthesis by using stacked structures.
language descriptions. We also compared with more recent models [22, 26, 39, 55].
Training Details. We optimize our network using Adam It should be pointed that recent models always use extra
[18] with β1 =0.0 and β2 =0.9. The learning rate is set to knowledge or supervisions. For example, CPGAN [22]
0.0001 for the generator and 0.0004 for the discriminator uses the extra pretrained YOLO-V3 [36], XMC-GAN [55]
according to Two Timescale Update Rule (TTUR) [12]. uses the extra pretrained VGG-19 [42] and Bert [6], DAE-
Evaluation Details. Following previous works [50, 60], we GAN [39] uses extra NLTK POS tagging and manually de-
choose the Inception Score (IS) [40] and Fréchet Inception signs rules for different datasets, and TIME [26] uses extra
Distance (FID) [12] to evaluate the performance of our net- 2-D positional encoding.

16520
Figure 6. Examples of images synthesized by AttnGAN [50], DM-GAN [60], and our proposed DF-GAN conditioned on text descriptions
from the test set of COCO and CUB datasets.

Table 1. The results of IS, FID and NoP compared with the state- and also decreases FID from 32.64 to 19.32 on the COCO.
of-the-art methods on the test set of CUB and COCO. Moreover, compared with recent models which introduce
CUB COCO extra knowledge, our DF-GAN still achieves a competitive
Model
IS ↑ FID ↓ FID ↓ NoP ↓ performance. The quantitative comparisons prove that our
StackGAN [56] 3.70 - - - model is much simpler but more effective.
StackGAN++ [57] 3.84 - - 4.2. Qualitative Evaluation
AttnGAN [50] 4.36 23.98 35.49 230M
MirrorGAN [33] 4.56 18.34 34.71 - We also compare the visualization results synthesized by
SD-GAN [51] 4.67 - - - AttnGAN [50], DM-GAN [60], and the proposed DF-GAN.
DM-GAN [60] 4.75 16.09 32.64 46M It can be seen that images synthesized by AttnGAN [50]
CPGAN [22] - - 55.80 318M and DM-GAN [60] in Figure 6 look like a simple com-
XMC-GAN [55] - - 9.30 166M bination of fuzzy shape and some visual details (1st , 3rd ,
DAE-GAN [39] 4.42 15.19 28.12 98M
5th , 7th , and 8th columns). As shown in the 5th , 7th , and
TIME [26] 4.91 14.30 31.14 120M
8th columns, the birds synthesized by AttnGAN [50] and
DF-GAN (Ours) 5.10 14.81 19.32 19M
DM-GAN [60] contain wrong shapes. Moreover, the im-
ages synthesized by our DF-GAN have better object shapes
As shown in Table 1, compared with other leading mod- and realistic fine-grained details (e.g., 1st , 3rd , 7th , and 8th
els, our DF-GAN has a significant smaller Number of columns). Besides, the posture of the bird in our DF-GAN
Parameters (NoP) but still achieves a competitive perfor- result is also more natural (e.g., 7th and 8th columns).
mance. Compared with AttnGAN [50] which employs Comparing the text-image semantic consistency with
cross-modal attention to fuse text and image features, our other models, we find that our DF-GAN can also capture
DF-GAN improves the IS metric from 4.36 to 5.10 and de- more fine-grained details in text descriptions. For example,
creases the FID metric from 23.98 to 14.81 on the CUB as the results shown in 1st , 2th , 6th columns in Figure 6,
dataset. And our DF-GAN decreases FID from 35.49 to other models cannot synthesize the “holding ski poles”,
19.32 on the COCO dataset. Compared with MirrorGAN “train track”, and “a black stripe by its eyes” described
[33] and SD-GAN [51] which employ cycle consistency in the text well, but the proposed DF-GAN can synthesize
and Siamese network to ensure text-image semantic con- them more correctly.
sistency, our DF-GAN improves IS from 4.56 and 4.67 to
4.3. Ablation Study
5.10. respectively on the CUB dataset. Compared with
DM-GAN [60] which introduces Memory Network to refine In this section, we conduct ablation studies on the test-
fuzzy image contents, our model also improves IS from 4.75 ing set of the CUB dataset to verify the effectiveness
to 5.10 and decreases FID from 16.09 to 14.81 on CUB, of each component in the proposed DF-GAN. The com-

16521
Table 2. The performance of different components of our model Table 3. The performance of MA-GP GAN with different modules
on the test set of CUB. on the test set of CUB.
Architecture IS ↑ FID ↓ SC ↑ Architecture IS↑ FID ↓
Baseline 3.96 51.34 - MA-GP GAN w/ Concat 4.57 23.16
OS-B 4.11 43.45 1.46 MA-GP GAN w/ CBN 4.81 18.56
OS-B w/ DAMSM 4.28 36.72 1.79 MA-GP GAN w/ AdaIN 4.85 17.52
OS-B w/ MA-GP 4.46 32.52 3.55 MA-GP GAN w/ AFFBLK 4.87 17.43
OS-B w/ MA-GP w/ OW-O 4.57 23.16 4.61 MA-GP GAN w/ DFBLK (DF-GAN) 5.10 14.81

ponents include One-Stage text-to-image Backbone (OS- tures. The comparison among CBN, AdaIN, and AFFBlock
B), Matching-Aware Gradient Penalty (MA-GP), One-Way proves that Normalization is not essential in Fusion Block,
Output (OW-O), Deep text-image Fusion Block (DFBlock). and removing normalization even slightly improves the re-
We also compare our Target-Aware Discriminator with sults. The comparison between DFBlock and AFFBlock
Deep Attentional Multimodal Similarity Model (DAMSM) demonstrates the effectiveness of deepening the text-image
which is an extra network widely employed in current mod- fusion process. In sum, the comparison results prove the
els [50, 51, 60]. We first evaluate the effectiveness of OS-B, effectiveness of our proposed DFBlock.
MA-GP, and OW-O. We conducted a user study to evaluate
the text-image semantic consistency (SC), and we asked ten
4.4. Limitations
users to score the 100 randomly synthesized images with Although DF-GAN shows superiority in text-to-image
text descriptions. The scores range from 1 (worst) to 5 synthesis, some limitations must be taken into considera-
(best). The results on the CUB dataset are shown in Table 2. tion in future studies. First, our model only introduces the
Baseline. Our baseline employs stacked framework and sentence-level text information, which limits the ability of
Two-Way Output with the same Adversarial loss as Stack- fine-grained visual feature synthesis. Second, introducing
GAN [56]. In baseline, the sentence vector is naively con- pre-trained large language models [6, 34] to provide addi-
catenated to the input noise and intermediate feature maps. tional knowledge may further improve the performance. We
Effect of One-Stage Backbone. Our proposed OS-B im- will try to address these limitations in our future work.
proves IS from 3.96 to 4.11 and decreases FID from 43.45
to 32.52. The results demonstrate that our one-stage back- 5. Conclusion and Future Work
bone is more effective than stacked architecture.
In this paper, we propose a novel DF-GAN for the text-
Effect of MA-GP. Armed with MA-GP, the model further
to-image generation tasks. We present a one-stage text-to-
improves IS to 4.46, SC to 3.55, and decreases FID to 32.52
image Backbone that can synthesize high-resolution images
significantly. It demonstrates that the proposed MA-GP can
directly without entanglements between different genera-
promote the generator to synthesize more realistic and text-
tors. We also propose a novel Target-Aware Discrimina-
image semantic consistent images.
tor composed of Matching-Aware Gradient Penalty (MA-
Effect of One-Way Output. The proposed OW-O also im- GP) and One-Way Output. It can further enhance the text-
proves IS from 4.46 to 4.57, SC from 3.55 to 4.61, and de- image semantic consistency without introducing extra net-
creases FID from 32.52 to 23.16. It also demonstrates that works. Besides, we introduce a novel Deep text-image
the One-Way Output is more effective than a Two-Way Out- Fusion Block (DFBlock) which fully fuses text and im-
put in the text-to-image generation task. age features more effectively and deeply. Extensive exper-
Effect of Target-Aware Discriminator. Compared with iment results demonstrate that our proposed DF-GAN sig-
DAMSM, our proposed Target-Aware Discriminator com- nificantly outperforms current state-of-the-art models on the
posed of MA-GP and OW-O improves IS from 4.28 to 4.57, CUB dataset and more challenging COCO dataset.
SC from 1.79 to 4.61, and decreases FID from 36.72 to
23.16. The results demonstrate that our Target-Aware Dis- Acknowledgment
criminator is superior to extra networks.
Effect of DFBlock. We compare our DFBlock with CBN This work was supported by National Key Research
[1, 5, 29], AdaIN [16] and AFFBlock. The AFFBlock em- and Development Project (No.2020AAA0106200),
ploys one Affine Transformation layer to fuse text and im- the National Nature Science Foundation of China
age features. MA-GP GAN is the model that employs One- under Grants (No.61936005, 61872424, 62076139,
Stage text-to-image Backbone, Matching-Aware Gradient 62176069 and 61933013), the Natural Science Founda-
Penalty, and One-Way Output. From the results in Ta- tion of Jiangsu Province (Grants No.BK20200037 and
ble 3, we find that, compared with other fusion methods, BK20210595), and the Open Research Project of Zhejiang
concatenation cannot efficiently fuse text and image fea- Lab (No.2021KF0AB05).

16522
References to-image synthesis. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 7986–
[1] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large 7994, 2018. 1
scale gan training for high fidelity natural image synthesis.
[14] Xun Huang and Serge Belongie. Arbitrary style transfer in
In International Conference on Learning Representations,
real-time with adaptive instance normalization. In Proceed-
2019. 8
ings of the IEEE International Conference on Computer Vi-
[2] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub- sion, pages 1501–1510, 2017. 5
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan-
[15] Sergey Ioffe and Christian Szegedy. Batch normalization:
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Accelerating deep network training by reducing internal co-
Language models are few-shot learners. arXiv preprint
variate shift. In International Conference on Machine Learn-
arXiv:2005.14165, 2020. 2
ing, 2015. 5
[3] Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, and
[16] Tero Karras, Samuli Laine, and Timo Aila. A style-based
Dapeng Tao. Rifegan: Rich feature generation for text-to-
generator architecture for generative adversarial networks. In
image synthesis from prior knowledge. In Proceedings of
Proceedings of the IEEE conference on computer vision and
the IEEE/CVF Conference on Computer Vision and Pattern
pattern recognition, pages 4401–4410, 2019. 2, 5, 8
Recognition, pages 10911–10920, 2020. 1
[17] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
[4] Wen-Huang Cheng, Sijie Song, Chieh-Yun Chen, Shin-
Jaakko Lehtinen, and Timo Aila. Analyzing and improv-
tami Chusnul Hidayati, and Jiaying Liu. Fashion meets com-
ing the image quality of stylegan. In Proceedings of the
puter vision: A survey. ACM Computing Surveys (CSUR),
IEEE/CVF Conference on Computer Vision and Pattern
54(4):1–41, 2021. 1
Recognition, pages 8110–8119, 2020. 2
[5] Harm De Vries, Florian Strub, Jérémie Mary, Hugo
Larochelle, Olivier Pietquin, and Aaron C Courville. Mod- [18] Diederik P Kingma and Jimmy Ba. Adam: A method
ulating early visual processing by language. In Advances in for stochastic optimization. In International Conference on
Neural Information Processing Systems, pages 6594–6604, Learning Representations, 2015. 6
2017. 5, 8 [19] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Torr. Controllable text-to-image generation. In Advances in
Toutanova. Bert: Pre-training of deep bidirectional Neural Information Processing Systems, pages 2065–2075,
transformers for language understanding. arXiv preprint 2019. 1
arXiv:1810.04805, 2018. 6, 8 [20] Ruifan Li, Ning Wang, Fangxiang Feng, Guangwei Zhang,
[7] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, and Xiaojie Wang. Exploring global and local linguistic rep-
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, resentation for text-to-image synthesis. IEEE Transactions
Hongxia Yang, et al. Cogview: Mastering text-to-image gen- on Multimedia, 2020. 1
eration via transformers. arXiv preprint arXiv:2105.13290, [21] Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang,
2021. 2, 6 Xiaodong He, Siwei Lyu, and Jianfeng Gao. Object-driven
[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing text-to-image synthesis via adversarial training. In Proceed-
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and ings of the IEEE Conference on Computer Vision and Pattern
Yoshua Bengio. Generative adversarial nets. In Advances in Recognition, pages 12174–12182, 2019. 1, 6
Neural Information Processing Systems, pages 2672–2680, [22] Jiadong Liang, Wenjie Pei, and Feng Lu. Cpgan: Content-
2014. 1, 2 parsing generative adversarial networks for text-to-image
[9] Yuchuan Gou, Qiancheng Wu, Minghao Li, Bo Gong, and synthesis. In European Conference on Computer Vision,
Mei Han. Segattngan: Text to image generation with seg- pages 491–508. Springer, 2020. 6, 7
mentation attention. arXiv preprint arXiv:2005.12444, 2020. [23] Jae Hyun Lim and Jong Chul Ye. Geometric gan. arXiv
1 preprint arXiv:1705.02894, 2017. 3
[10] Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho, and [24] Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming Ding,
Yoshua Bengio. Dynamic neural turing machine with contin- Yichang Zhang, Peng Wang, Ang Wang, Le Jiang, Xianyan
uous and discrete addressing schemes. Neural computation, Jia, et al. M6: A chinese multimodal pretrainer. arXiv
30(4):857–884, 2018. 2 preprint arXiv:2103.00823, 2021. 2
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Deep residual learning for image recognition. In Proceed- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
ings of the IEEE conference on computer vision and pattern Zitnick. Microsoft coco: Common objects in context. In
recognition, pages 770–778, 2016. 2, 3 European conference on computer vision, pages 740–755.
[12] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Springer, 2014. 6
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a [26] Bingchen Liu, Kunpeng Song, Yizhe Zhu, Gerard de
two time-scale update rule converge to a local nash equilib- Melo, and Ahmed Elgammal. Time: text and image
rium. In Advances in neural information processing systems, mutual-translation adversarial networks. arXiv preprint
pages 6626–6637, 2017. 6 arXiv:2005.13192, 2020. 6, 7
[13] Seunghoon Hong, Dingdong Yang, Jongwook Choi, and [27] Ming-Yu Liu, Xun Huang, Jiahui Yu, Ting-Chun Wang, and
Honglak Lee. Inferring semantic layout for hierarchical text- Arun Mallya. Generative adversarial networks for image and

16523
video synthesis: Algorithms and applications. Proceedings [43] Hao Tang, Song Bai, Li Zhang, Philip HS Torr, and Nicu
of the IEEE, 109(5):839–862, 2021. 1 Sebe. Xinggan for person image generation. In ECCV, 2020.
[28] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. 2
Which training methods for gans do actually converge? [44] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. In-
In International Conference on Machine Learning, pages stance normalization: The missing ingredient for fast styliza-
3481–3490, 2018. 4 tion. arXiv preprint arXiv:1607.08022, 2016. 5
[29] Takeru Miyato and Masanori Koyama. cgans with projection [45] Rahul Rama Varior, Mrinal Haloi, and Gang Wang. Gated
discriminator. arXiv preprint arXiv:1802.05637, 2018. 8 siamese convolutional neural network architecture for hu-
[30] Dong Huk Park, Samaneh Azadi, Xihui Liu, Trevor Darrell, man re-identification. In European conference on computer
and Anna Rohrbach. Benchmark for compositional text-to- vision, pages 791–808. Springer, 2016. 2
image synthesis. 2021. 2 [46] Rahul Rama Varior, Bing Shuai, Jiwen Lu, Dong Xu, and
[31] Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- Gang Wang. A siamese long short-term memory architec-
moulin, and Aaron Courville. Film: Visual reasoning with a ture for human re-identification. In European conference on
general conditioning layer. In Proceedings of the AAAI Con- computer vision, pages 135–153. Springer, 2016. 2
ference on Artificial Intelligence, volume 32, 2018. 2 [47] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
[32] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. The Caltech-UCSD Birds-200-2011 Dataset. Technical Re-
Learn, imagine and create: Text-to-image generation from port CNS-TR-2011-001, California Institute of Technology,
prior knowledge. In Advances in Neural Information Pro- 2011. 6
cessing Systems, pages 887–897, 2019. 1 [48] Zhihao Wang, Jian Chen, and Steven CH Hoi. Deep learn-
[33] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. ing for image super-resolution: A survey. IEEE transactions
Mirrorgan: Learning text-to-image generation by redescrip- on pattern analysis and machine intelligence, 43(10):3365–
tion. In Proceedings of the IEEE Conference on Computer 3387, 2020. 1
Vision and Pattern Recognition, pages 1505–1514, 2019. 1, [49] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory
2, 6, 7 networks. In International Conference on Learning Repre-
[34] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario sentations, 2015. 2
Amodei, Ilya Sutskever, et al. Language models are unsu- [50] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang,
pervised multitask learners. OpenAI blog, 1(8):9, 2019. 2, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-
8 grained text to image generation with attentional generative
[35] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott adversarial networks. In Proceedings of the IEEE conference
Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya on computer vision and pattern recognition, pages 1316–
Sutskever. Zero-shot text-to-image generation. arXiv 1324, 2018. 1, 2, 3, 5, 6, 7, 8
preprint arXiv:2102.12092, 2021. 1, 2, 6 [51] Guojun Yin, Bin Liu, Lu Sheng, Nenghai Yu, Xiaogang
[36] Joseph Redmon and Ali Farhadi. Yolov3: An incremental Wang, and Jing Shao. Semantics disentangling for text-to-
improvement. arXiv preprint arXiv:1804.02767, 2018. 6 image generation. In Proceedings of the IEEE Conference
[37] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- on Computer Vision and Pattern Recognition, pages 2327–
geswaran, Bernt Schiele, and Honglak Lee. Generative ad- 2336, 2019. 1, 2, 6, 7, 8
versarial text to image synthesis. In Proceedings of the In- [52] Fangchao Yu, Li Wang, Xianjin Fang, and Youwen Zhang.
ternational Conference on Machine Learning, pages 1060– The defense of adversarial example with conditional genera-
1069, 2016. 1, 2 tive adversarial networks. Security and Communication Net-
[38] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, works, 2020, 2020. 2
Bernt Schiele, and Honglak Lee. Learning what and where to [53] Mingkuan Yuan and Yuxin Peng. Ckd: Cross-task knowl-
draw. In Advances in neural information processing systems, edge distillation for text-to-image synthesis. IEEE Transac-
pages 217–225, 2016. 2 tions on Multimedia, 2019. 1
[39] Shulan Ruan, Yong Zhang, Kun Zhang, Yanbo Fan, Fan [54] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augus-
Tang, Qi Liu, and Enhong Chen. Dae-gan: Dynamic aspect- tus Odena. Self-attention generative adversarial networks. In
aware gan for text-to-image synthesis. In Proceedings of the International conference on machine learning, pages 7354–
IEEE/CVF International Conference on Computer Vision, 7363. PMLR, 2019. 2, 3
pages 13960–13969, 2021. 6, 7 [55] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and
[40] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Yinfei Yang. Cross-modal contrastive learning for text-to-
Cheung, Alec Radford, and Xi Chen. Improved techniques image generation. In Proceedings of the IEEE/CVF Con-
for training gans. In Advances in neural information pro- ference on Computer Vision and Pattern Recognition, pages
cessing systems, pages 2234–2242, 2016. 6 833–842, 2021. 6, 7
[41] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent [56] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
neural networks. IEEE transactions on Signal Processing, gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-
45(11):2673–2681, 1997. 3 gan: Text to photo-realistic image synthesis with stacked
[42] Karen Simonyan and Andrew Zisserman. Very deep convo- generative adversarial networks. In Proceedings of the IEEE
lutional networks for large-scale image recognition. arXiv international conference on computer vision, pages 5907–
preprint arXiv:1409.1556, 2014. 6 5915, 2017. 1, 2, 3, 5, 6, 7, 8

16524
[57] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-
gan++: Realistic image synthesis with stacked generative ad-
versarial networks. IEEE TPAMI, 41(8):1947–1962, 2018. 1,
2, 3, 5, 6, 7
[58] Zhenxing Zhang and Lambert Schomaker. Dtgan: Dual
attention generative adversarial networks for text-to-image
generation. In 2021 International Joint Conference on Neu-
ral Networks (IJCNN), pages 1–8. IEEE, 2021. 6
[59] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
Efros. Unpaired image-to-image translation using cycle-
consistent adversarial networks. In Proceedings of the IEEE
international conference on computer vision, pages 2223–
2232, 2017. 2
[60] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. Dm-gan:
Dynamic memory generative adversarial networks for text-
to-image synthesis. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 5802–
5810, 2019. 1, 2, 6, 7, 8

16525

You might also like