0% found this document useful (0 votes)
20 views

StyleSwin Transformer-based GAN for High-resolution Image Generation

The document presents StyleSwin, a transformer-based GAN designed for high-resolution image generation, achieving superior performance compared to previous models like StyleGAN. It utilizes a novel architecture that incorporates local attention and double attention mechanisms to enhance generation quality while maintaining computational efficiency. Extensive experiments demonstrate its capability to produce high-fidelity images at resolutions up to 1024x1024, addressing challenges such as blocking artifacts through innovative techniques like wavelet discrimination.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

StyleSwin Transformer-based GAN for High-resolution Image Generation

The document presents StyleSwin, a transformer-based GAN designed for high-resolution image generation, achieving superior performance compared to previous models like StyleGAN. It utilizes a novel architecture that incorporates local attention and double attention mechanisms to enhance generation quality while maintaining computational efficiency. Extensive experiments demonstrate its capability to produce high-fidelity images at resolutions up to 1024x1024, addressing challenges such as blocking artifacts through innovative techniques like wavelet discrimination.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

StyleSwin: Transformer-based GAN for High-resolution Image Generation

Bowen Zhang1 * Shuyang Gu1 Bo Zhang2† Jianmin Bao2 Dong Chen2


Fang Wen2 Yong Wang1 Baining Guo2
1 2
University of Science and Technology of China Microsoft Research Asia

Figure 1. Image samples generated by our StyleSwin on FFHQ 1024 × 1024 and LSUN Church 256 × 256 respectively.

Abstract 1024. The StyleSwin, without complex training strategies,


excels over StyleGAN on CelebA-HQ 1024, and achieves
Despite the tantalizing success in a broad of vision tasks, on-par performance on FFHQ-1024, proving the promise
transformers have not yet demonstrated on-par ability as of using transformers for high-resolution image genera-
ConvNets in high-resolution image generative modeling. In tion. The code and pretrained models are available at
this paper, we seek to explore using pure transformers to https://github.com/microsoft/StyleSwin.
build a generative adversarial network for high-resolution
image synthesis. To this end, we believe that local atten- 1. Introduction
tion is crucial to strike the balance between computational
efficiency and modeling capacity. Hence, the proposed gen- The state of image generative modeling has seen dra-
erator adopts Swin transformer in a style-based architec- matic advancement in recent years, among which genera-
ture. To achieve a larger receptive field, we propose double tive adversarial networks [14, 41] (GANs) offer arguably
attention which simultaneously leverages the context of the the most compelling quality on synthesizing high-resolution
local and the shifted windows, leading to improved gener- images. While early attempts focus on stabilizing the train-
ation quality. Moreover, we show that offering the knowl- ing dynamics via proper regularization [15,16,36,46,47] or
edge of the absolute position that has been lost in window- adversarial loss designs [2, 25, 39, 45], remarkable perfor-
based transformers greatly benefits the generation quality. mance leaps in recent prominent works mainly attribute to
The proposed StyleSwin is scalable to high resolutions, with the architectural modifications that aim for stronger mod-
both the coarse geometry and fine structures benefit from eling capacity, such as adopting self-attention [66], aggres-
the strong expressivity of transformers. However, block- sive model scaling [4], or style-based generators [29, 30].
ing artifacts occur during high-resolution synthesis because Recently, drawn by the broad success of transformers in dis-
performing the local attention in a block-wise manner may criminative models [11, 32, 43], a few works [24, 37, 62, 67]
break the spatial coherency. To solve this, we empirically attempt to use pure transformers to build generative net-
investigate various solutions, among which we find that em- works in the hope that the increased expressivity and the
ploying a wavelet discriminator to examine the spectral ability to model long-range dependencies can benefit the
discrepancy effectively suppresses the artifacts. Extensive generation of complex images, yet high-quality image gen-
experiments show the superiority over prior transformer- eration, especially on high resolutions, remains challenging.
based GANs, especially on high resolutions, e.g., 1024 × This paper aims to explore key ingredients when us-
ing transformers to constitute a competitive GAN for high-
* Author did this work during his internship at Microsoft Research Asia. resolution image generation. The first obstacle is to tame
† Corresponding author. the quadratic computational cost so that the network is scal-

11304
able to high resolutions, e.g., 1024 × 1024. We propose to tain an FID of 5.07, approaching the performance of Style-
leverage Swin transformers [43] as the basic building block GAN2 [30].
since the window-based local attention strikes a balance be-
tween computational efficiency and modeling capacity. As 2. Related Work
such, we could take advantage of the increased expressivity High-resolution image generation. Image generative
to characterize all the image scales, as opposed to reduc- modeling has improved rapidly in the past decade [14, 19,
ing to point-wise multi-layer perceptrons (MLP) for higher 34, 35, 41, 55]. Among various solutions, generative adver-
scales [67], and the synthesis is scalable to high resolution, sarial networks (GANs) offer competitive generation qual-
e.g., 1024×1024, with delicate details. Besides, the local at- ity. While early methods [2, 47, 49] focus on stabilizing the
tention introduces locality inductive bias so there is no need adversarial training, recent prominent works [4, 28–30] rely
for the generator to relearn the regularity of images from on designing architectures with enhanced capacity, which
scratch. These merits make a simple transformer network considerably improves generation quality. However, con-
substantially outperform the convolutional baseline. temporary GAN-based methods adopt convolutional back-
In order to compete with the state of the arts, we further bones which are now deemed to be inferior to transformers
propose three instrumental architectural adaptations. First, in terms of modeling capacity. In this paper, we are inter-
we strengthen the generative model capacity by employing ested in applying the emerging vision transformers to GANs
the local attention in a style-based architecture [29], dur- for high-resolution image generation.
ing which we empirically compare various style injection Vision transformers. Recent success of transformers [5,
approaches for our transformer GAN. Second, we propose 57] in NLP tasks inspires the research of vision transforms.
double attention in order to enlarge the limited receptive The seminal work ViT [11] proposes a pure transformer-
field brought by the local attention, where each layer attends based architecture for image classification and demonstrates
to both the local and the shifted windows, effectively im- the great potential of transformers for vision tasks. Later,
proving the generator capacity without much computational transformers dominate the benchmarks in a broad of dis-
overhead. Moreover, we notice that Conv-based GANs im- criminative tasks [10,17,43,53,56,59,60,64]. However, the
plicitly utilize zero padding to infer the absolute positions, a self-attention in transformer blocks brings quadratic com-
crucial clue for generation, yet such feature is missing in the putational complexity, which limits its application for high-
window-based transformers. We propose to fix this by in- resolution inputs. A few recent works [10, 43, 56] tackle
troducing sinusoidal positional encoding [52] to each layer this problem by proposing to compute self-attention in lo-
such that absolute positions can be leveraged for image syn- cal windows, so that linear computational complexity can
thesis. Equipped with the above techniques, the proposed be achieved. Moreover, the hierarchical architecture makes
network, dubbed as StyleSwin, starts to show advantageous them suitable to serve as general purpose backbones.
generation quality on 256 × 256 resolution. Transformer-based GANs. Recently, the research com-
Nonetheless, we observe blocking artifacts when syn- munity begins to explore using transformers for generative
thesizing high-resolution images. We conjecture that these tasks in the hope that the increased expressivity can benefit
disturbing artifacts arise because computing the attention the generation of complex images. One natural way is to
independently in a block-wise manner breaks the spatial use transformers to synthesize pixels in an auto-regressive
coherency. That is, while proven successful in discrimina- manner [6, 12], but the slow inference speed limits their
tive tasks [43, 56], the block-wise attention requires special practical usage. Recently a few works [24, 37, 62, 67] at-
treatment when applied in synthesis networks. To tackle tempt to propose transformer-based GANs, yet most of
these blocking artifacts, we empirically investigate various these methods only support the synthesis up to 256 × 256
solutions, among which we find that a wavelet discrimina- resolution. Notably, the HiT [67] successfully generates
tor [13] examining the artifacts in spectral domain could 1024 × 1024 images at the cost of reducing to MLPs in
effectively suppress the artifacts, making our transformer- its high-resolution stages, hence unable to synthesize high-
based GAN yield visually pleasing outputs. fidelity details as the Conv-based counterpart [29]. In com-
The proposed StyleSwin, achieves state-of-the-art quality parison, our StyleSwin can synthesize fine structures using
on multiple established benchmarks, e.g., FFHQ, CelebA- transformers, leading to comparable quality as the leading
HQ, and LSUN Church. In particular, our approach shows ConvNets on high-resolution synthesis.
compelling quality on high-resolution image synthesis (Fig-
ure 1), achieving competitive quantitative performance rel- 3. Method
ative to the leading ConvNet-based methods without com-
3.1. Transformer-based GAN architecture
plex training strategies. On CelebA-HQ 1024, our approach
achieves an FID of 4.43, outperforming all the prior works We start from a simple generator architecture, as shown
including StyleGAN [29]; whereas on FFHQ-1024, we ob- in Figure 2(a), which receives a latent variable z ∼ N (0, I)

11305
Const
x
Latent z∈Z SPE
4×4
4×4 Norm A AdaIN
LN
FC RPE Double Attn
Attention
×𝑁 n//2 heads n//2 heads
×𝑁 FC A AdaIN
LN
FC
MLP
MLP xw xsw
FC
tRGB

Up FC Up
RPE RPE
FC SPE
8×8
8×8
Attention Attention
LN
FC
A AdaIN
Attention FC Concat
×𝑁 RPE Double Attn
Up
LN ×𝑁
A AdaIN
MLP
w∈W
MLP

tRGB


(a) (b) (c)
Figure 2. The architectures we investigate. (a) The baseline architecture is comprised of a series of transformer blocks hierarchically. (b)
The proposed StyleSwin adopts style-based architecture, where the style codes derived from the latent code z modulate the feature maps
of transformer blocks through style injection. (c) The proposed double attention enlarges the receptive field of transformer blocks by using
split heads attending to the local and the shifted windows respectively.

as input and gradually upsamples the feature maps through multi-head self-attention under the regular and shifted win-
a cascade of transformer blocks. dow partitioning respectively, and LN stands for layer nor-
Due to the quadratic computational complexity, it is un- malization. Since such block-wise attention induces lin-
affordable to compute full-attention on high-resolution fea- ear computational complexity relative to the image size, the
ture maps. We believe that local attention is a good way network is scalable to the high-resolution generation where
to achieve trade-off between computational efficiency and the fine structures can be modeled by these capable trans-
modeling capacity. We adopt Swin transformer [43] as formers as well.
the basic building block which computes multi-head self- Since the discriminator severely affects the stability of
attention (MSA) [57] locally in non-overlapping windows. adversarial training, we opt to use a Conv-based discrim-
To advocate the information interaction across adjacent inator directly from [29]. In our experiment, we find that
windows, Swin transformer uses shifted window partition simply replacing the convolution with transformer blocks
in alternative blocks. Specifically, given the input feature under this baseline architecture yields more stabilized train-
map xl ∈ RH×W ×C of layer l, the consecutive Swin blocks ing due to the improved model capacity. However, such
operate as follows: naive architecture cannot make our transformer-based GAN
compete with the state of the arts, so we make further stud-
\begin {split} &\begin {rcases*} \hat {\bm {x}}^l = \text {W-MSA}(\text {LN}(\bm {x}^l)) + \bm {x}^l\\ \bm {x}^{l+1} = \text {MLP}(\text {LN}(\hat {\bm {x}}^l)) + \hat {\bm {x}}^l \end {rcases*}\ \text {regular window},\\ &\begin {rcases*} \hat {\bm {x}}^{l+1} = \text {SW-MSA}(\text {LN}(\bm {x}^{l+1})) + \bm {x}^{l+1}\\ \bm {x}^{l+2} = \text {MLP}(\text {LN}(\hat {\bm {x}}^{l+1})) + \hat {\bm {x}}^{l+1} \end {rcases*}\ \text {shifted window}, \end {split}
ies which we introduce as follows.

Style injection. We first strengthen the model capability by


adapting the generator to a style-based architecture [29, 30]
as shown in Figure 2(b). We learn a non-linear mapping
(1) f : Z → W to map the latent code z from Z space to W
space, which specifies the styles that are injected into the
where W-MSA and SW-MSA denote the window-based main synthesis network. We investigate the following style

11306
Style injection methods FID ↓ used by the Swin block be κ × κ, then due to the shifted
window strategy, the receptive field increases by κ in each
Baseline 15.03
dimension using one more Swin block. Suppose we use
AdaIN 6.34 Swin blocks to process a 64 × 64 feature map and we by
AdaLN 6.95 default choose κ = 8, then it takes 64/κ = 8 transformer
AdaBN > 100 blocks to span over the entire feature map.
AdaRMSNorm 7.43 In order to achieve an enlarged receptive field, we pro-
Modulated MLP 7.09 pose double attention which allows a single transformer
Cross attention 6.59 block to simultaneously attend to the context of the local
and shifted windows. As illustrated in Figure 2(c), we split
Table 1. Comparison of different style injection methods on h attention heads into two groups: the first half of heads per-
FFHQ-256. The style injection methods considerably improve the form the regular window attention whereas the second half
FID, among which the AdaIN leads to the best generation quality. compute the shifted window attention, both of whose results
are further concatenated to form the output. Specifically, we
denote with xw and xsw the non-overlapping patches under
injection approaches:
the regular and shifted window partitioning respectively, i.e.
• AdaNorm modulates the statistics (i.e., mean and vari- HW

ance) of feature maps after normalization. We study mul- xw , xsw ∈ R κ2 ×κ×κ×C , then the double attention is for-
tiple normalization variants, including instance normal- mulated as,
ization (IN) [54], batch normalization (BN) [21], layer
normalization (LN) [3] and the recently proposed RM- \text {Double-Attention} = \text {Concat}(\text {head}_1,..., \text {head}_h)\bm {W}^O \label {eq:double_attention} (2)
Snorm [65]. Since the RMSNorm removes the mean-
centering of LN, we only predict the variance from the where W O ∈ RC×C is the projection matrix used to mix
W code. the heads to output. The attention heads in Equation 2 can
be computed as:
• Modulated MLP: Instead of modulating feature maps, one
can also modulate the weights of linear layers. Specif-
ically, we rescale the channel-wise weight magnitude \thinmuskip =1mu \thickmuskip =1mu \text {head}_i = \begin {cases} \text {Attn}(\bm {x}_{w}\bm {W}_i^Q,\bm {x}_{w}\bm {W}_i^K,\bm {x}_{w}\bm {W}_i^V) & i\leq \floor *{\frac {h}{2}}\\ \text {Attn}(\bm {x}_{sw}\bm {W}_i^Q,\bm {x}_{sw}\bm {W}_i^K,\bm {x}_{sw}\bm {W}_i^V) & i>\floor *{\frac {h}{2}} \end {cases}
of the feed-forward network (FFN) within transformer
blocks. According to [30], such style injection admits (3)
faster speed than AdaNorm. where WiQ , WiK , WiV ∈ RC×(C/h) are query, key and
• Cross-attention: Motivated by the decoder trans- value projection matrix for i-th head respectively. One can
former [57], we explore a transformer-specific style injec- derive that the receptive field of each dimension increases
tion in which the transformers additionally attend to the by 2.5κ with one additional double attention block, which
style tokens derived from the W space. The effectiveness allows capturing larger context more efficiently. Still, for a
of this cross-attention is also validated in [67]. 64×64 input, it now takes 4 transformer blocks to cover the
Table 1 shows that all the above style injection methods entire feature map.
significantly boost the generative modeling capacity except Local-global positional encoding. Relative positional en-
that the training with AdaBN does not converge because the coding (RPE) adopted by the default Swin blocks encodes
batch size is compromised for high-resolution synthesis. In the relative position of pixels and has proven crucial for
comparison, AdaNorm brings more sufficient style injection discriminative tasks [9, 43]. Theoretically, a multi-head lo-
possibly because the network could take advantage of the cal attention layer with RPE can express any convolutional
style information twice — in either the attention block and layer of window-sized kernels [8, 38]. However, when sub-
the FFN, whereas the modulated MLP and cross-attention stituting the convolutional layers with transformers that use
make use of the style information once. We did not fur- RPE, one thing is rarely noticed: ConvNets could infer the
ther study the hybrid of modulated MLP and cross-attention absolute positions by leveraging the clues from the zero
due to efficiency considerations. Furthermore, compared paddings [22,31] yet such feature is missing in Swin blocks
to AdaBN and AdaLN, AdaIN offers finer and more suf- using RPE. On the other hand, it is essential to let the gener-
ficient feature modulation as feature maps are normalized ator be aware of the absolute positions because the synthesis
and modulated independently, so we choose AdaIN by de- of a specific component, e.g., mouth, highly depends on its
fault for our following experiments. spatial coordinate [1, 40].
Double attention. Using local attention, nonetheless, sac- In view of this, we propose to introduce sinusoidal po-
rifices the ability to model long-range dependencies, which sition encoding [7, 57, 61] (SPE) on each scale, as shown
is pivotal to capture geometry [4, 66]. Let the window size in Figure 2(b). Specifically, after the scale upsampling, the

11307
(a) (b)

Figure 4. A 1D example illustrates that the window-wise local at-


tention causes blocking artifacts. (a) Input continuous signal along
with partitioning windows. (b) Output discontinuous signal af-
ter window-wise attention. For simplicity, we adopt one attention
head with random projection matrices.
Figure 3. Blocking artifacts become obvious on 1024 × 1024 res-
olution. These artifacts correlate with the window size of local
attentions. window tend to become uniform after the softmax opera-
tion, so the outputs of neighboring windows appear rather
distinct. The 2D case is analogous to the JPEG compression
feature maps are added with the following encoding: artifacts caused by the block-wise encoding [42].

\thinmuskip =1.2mu \thickmuskip =1.2mu [\underbrace {\sin (\omega _0 {i}),\, \cos (\omega _0 {i}),\, \cdots }_{\rm horizontal\ dimension},\: \underbrace {\sin (\omega _0 {j}),\, \cos (\omega _0 {j}),\, \cdots }_{\rm vertical\ dimension}]\in \mathbb {R}^{C}, (4) 3.3. Artifact suppression
In the next, we discuss a few solutions to suppress the
blocking artifacts.
where and ωk = 1/100002k and (i, j) denotes the 2D lo-
cation. We use SPE rather than learnable absolute posi- Artifact-free generator. We first attempt to reduce artifacts
tional encoding [11] because SPE admits translation invari- by improving the generator.
ance [58]. In practice, we make the best of RPE and SPE • Token sharing. Blocking artifacts arise because there
by employing them altogether: the RPE applied within each is an abrupt change of keys and values used by the at-
transformer block offers the relative positions within the lo- tention computing across distinct windows, so we pro-
cal context, whereas the SPE introduced on each scale in- pose to make windows have shared tokens in a way like
forms the global position. HaloNet [56]. However, artifacts are still noticeable since
there always exist tokens exclusive to specific windows.
3.2. Blocking artifact in high-resolution synthesis
• Theoretically, sliding window attention [20] should lead
While achieving state-of-the-art quality on synthesizing to artifact-free results. Note that training the generator
256 × 256 images with the above architecture, directly with sliding attention is too costly so we only adopt the
applying it for higher resolution synthesis, e.g., 1024 × sliding window for inference.
1024, brings blocking artifacts as shown in Figure 3, which • Reduce to MLPs on fine scales. Just as [67], one can re-
severely affects the perceptual quality. Note that these are move self-attention and purely rely on point-wise MLPs
by no means the checkboard artifacts caused by the trans- for fine structure synthesis at the cost of sacrificing the
posed convolution [48] as we use bilinear upsampling fol- ability to model high-frequency details.
lowed by anti-aliasing filters as [29].
Artifact-suppression discriminator. Indeed, we observe
We conjecture that the blocking artifacts are caused by
blocking artifacts in the early training phase on 256 × 256
the transformers. To verify this, we remove the attention
resolution, but they gradually fade out as training precedes.
operators starting from 64 × 64 and employ only MLPs to
In other words, although the window-based attention is
characterize the high-frequency details. This time we ob-
prone to produce artifacts, the generator does have the capa-
tain artifact-free results. To be further, we find that these
bility to offer an artifact-free solution. The artifacts plague
artifacts exhibit periodic patterns with a strong correlation
the high-resolution synthesis because the discriminator fails
with the window size of local attention. Hence, we are cer-
to examine the high-frequency details. This enlightens us to
tain it is the window-wise processing that breaks the spatial
resort to stronger discriminators for artifact suppression.
coherency and causes the blocking artifacts. To simplify,
one can consider the 1D case in Figure 4, where attention • Patch discriminator [23] possesses limited receptive field
is computed locally in strided windows. For a continuous and can be employed to specifically penalize the local
signal, the window-wise local attention is likely to produce structures. Experiments show partial suppression of the
a discontiguous output because the values within the same blocking artifacts using a patch discriminator.

11308
Solutions FID ↓ Remove artifacts?
Window-based attention 8.39 ✗
Sliding window inference 12.08 ✓
Token sharing 8.95 ✗
MLPs after 64 × 64 12.69 ✓
Patch discriminator 7.73 ✗
Total variation annealing 12.79 ✓
Wavelet discriminator 5.07 ✓
Table 2. Comparison of the artifact suppression solutions on
FFHQ-1024.
(a) (b) (c)
Figure 5. The Fourier spectrum of blocking artifacts. (a) Images relative to real images after discrete wavelet decompo-
with blocking artifacts. (b) The artifacts with periodic patterns
sition. Such a wavelet discriminator works remarkably
can be clearly distinguished in the spectrum. (c) The spectrum of
well in combating the blocking artifacts. Meanwhile, it
artifact-free images derived from the sliding window inference.
does not bring any side-effects on distribution matching,
1024 × 1024
effectively guiding the generator to produce rich details.
Table 2 compares the above artifact suppression meth-
ods, showing that there are four approaches that could to-
tally remove the visual artifacts. However, sliding window
fWavelets 512 × 512 inference suffers from the train-test gap, whereas MLPs fail
Downsample
to synthesize fine details on high-resolution stages, both of
them leading to a higher FID score. On the other hand, the
fWavelets 256 × 256
total variation with annealing still deteriorates the FID. In
Downsample comparison, the wavelet-discriminator achieves the lowest
fWavelets 128 × 128
FID score and yields the most visually pleasing results.

… … 4. Experiments
4.1. Experiment setup
DWT Concat + Conv
Datasets. We validate our StyleSwin on the following
datasets: CelebA-HQ [27], LSUN Church [63], and FFHQ
[29]. CelebA-HQ is a high-quality version of CelebA
Figure 6. The wavelet discriminator suppresses the artifacts by dataset [44] which contains 30,000 human face images of
examining the wavelet spectrum of the multi-scaled input. 1024 × 1024 resolution. FFHQ [29] is a commonly used
dataset for high-resolution image generation. It contains
70,000 high-quality human face images with more varia-
• Total variation annealing. To advocate smooth outputs, tion of age, ethnicity and background, and has better cov-
we apply a large total variation loss at the beginning of erage of accessories such as eyeglasses, sunglasses, hats,
training, aiming to suppress the network’s tendency to ar- etc. We synthesize images on FFHQ and CelebA-HQ on
tifacts. The loss weight is then linearly decayed to zero either 256 × 256 and 1024 × 1024 resolutions. LSUN
towards the end of training. Though artifacts can be com- Church [63] contains around 126,000 church images in di-
pletely removed, such handcrafted constraint favors over- verse architecture styles, on which we conduct experiments
smoothed results and inevitably affects the distribution with 256 × 256 resolution.
matching for high-frequency details. Evaluation protocol. We adopt Fréchet Inception Distance
• Wavelet discriminator. As shown in Figure 5, the periodic (FID) [18] as the quantitative metric, which measures the
artifact pattern can be easily distinguished in the spec- distribution discrepancy between generated images and real
tral domain. Inspired by this, we resort to a wavelet dis- ones. Lower FID scores indicate better generation quality.
criminator [13] to complement our spatial discriminator For FFHQ [29] and LSUN Church [63] datasets, we ran-
and we illustrate its architecture in Figure 6. The dis- domly sample 50,000 images from the original datasets as
criminator hierarchically downsamples the input image validation sets and calculate FID between the validation sets
and on each scale examines the frequency discrepancy and 50,000 generated images. For CelebA-HQ [27], we cal-

11309
Methods FFHQ CelebA-HQ LSUN Church Methods FFHQ-1024 CelebA-HQ 1024
∗ 1
StyleGAN2 [30] 3.62 - 3.86 StyleGAN [30] [29] 4.41 5.06
PG-GAN [27] - 8.03 6.42 COCO-GAN - 9.49
U-Net GAN [50] 7.63 - - PG-GAN [27] - 7.30
INR-GAN [51] 9.57 - 5.09 MSG-GAN [26] 5.80 6.37
MSG-GAN [26] - - 5.20 INR-GAN [51] 16.32 -
CIPS [1] 4.38 - 2.92 CIPS [1] 10.07 -
TransGAN [24] - 9.60∗ 8.94 HiT-B [67] 6.37 8.83
VQGAN [12] 11.40 10.70 - StyleSwin 5.07 4.43
HiT-B [67] 2.95∗ 3.39∗ -
Table 4. Comparison of state-of-the-art unconditional image gen-
StyleSwin 2.81∗ 3.25∗ 2.95 eration methods on FFHQ and CelebA-HQ of resolution 1024 ×
Table 3. Comparison of state-of-the-art unconditional image gen- 1024 in terms of FID score (lower is better). 1 We report the FID
eration methods on FFHQ, CelebA-HQ and LSUN Church of score of StyleGAN2 on FFHQ-1024 and that of StyleGAN on
256 × 256 resolution in terms of FID score (lower is better). The CelebA-HQ 1024. For fair comparison, we report results of Style-
subscript (∗) indicates that bCR is applied during training. GAN2 without style-mixing and path regularization.

Model Configuration FID ↓


culated the FID between 30,000 generated images and all A. Swin baseline 15.03
the training samples. B. + Style injection 8.40
C. + Double attention 7.86
4.2. Implementation details D. + Wavelet discriminator 6.34
During training we use Adam solver [33] with β1 = 0.0, E. + SPE 5.76
β2 = 0.99. Following TTUR [18], we set imbalanced learn- F. + Larger model 5.50
ing rates, 5e−5 and 2e−4, for the generator and discrimina- G. + bCR 2.81
tor respectively. We train our model using the standard non- Table 5. Ablation study conducted on FFHQ-256. Starting from
saturating GAN loss with R1 gradient penalty [30] and sta- the baseline architecture, we prove the effectiveness of each pro-
bilize the adversarial training by applying spectral normal- posed component.
ization [47] on the discriminator. By default, we report all
the results with the wavelet discriminator as shown in Fig- where the proposed StyleSwin also demonstrates state-of-
ure 6. Using 8 32GB V100 GPUs, we are able to fit 32 im- the-art performance. Notably, we obtain the record FID
ages as one training batch for the training on 256 × 256 res- score of 4.43 on CelebA-HQ 1024 dataset while consid-
olution and the batch size reduces to 16 on 1024 × 1024 res- erably closing the gap with the leading approach Style-
olution. For fair comparison with prior works, we report the GAN2 without involving complex training strategies or
FID with balanced consistency regularization (bCR) [68] on additional regularization. Also, StyleSwin outperforms
the FFHQ-256 and CelebA-HQ 256 datasets with the loss the transformer-based approach HiT by a large margin on
weight λreal = λfake = 10. Similar to [67], we do not ob- 1024 × 1024 resolution, proving that the self-attention on
serve performance gain using bCR on higher resolutions. high-resolution stages is beneficial to high-fidelity detail
Note that we do not adopt complex training strategies, such synthesis.
as path length and mixing regularizations [29], as we wish
to conduct studies on neat network architectures. Qualitative results. Figure 7 shows the image samples gen-
erated by StyleSwin on FFHQ and CelebA-HQ of 1024 ×
4.3. Main results 1024 resolution. Our StyleSwin shows compelling qual-
ity on synthesizing diverse images of different ages, back-
Quantitative results. We compare with state-of-the-art
grounds and viewpoints on the resolution of 1024 × 1024.
Conv-based GANs as well as the recent transformer-based
On top of face modeling, we show generation results of
methods. As shown in Table 3, our StyleSwin achieves
LSUN Church in Figure 8, showing StyleSwin is capable
state-of-the-art FID scores on all the 256 × 256 synthesis.
to model complex scene structures. Both the coherency of
In particular, on both FFHQ and LSUN Church datasets,
global geometry and the high-fidelity details all prove the
StyleSwin outperforms StyleGAN2 [30]. Besides the im-
advantages of using transformers among all the resolutions.
pressive results on resolution 256 × 256, the proposed
StyleSwin shows a strong capability on high-resolution im- Ablation study. To validate the effectiveness of the pro-
age generation. As shown in Table 4, we evaluate models on posed components, we conduct ablation studies in Table 5.
FFHQ and CelebA-HQ on the resolution of 1024 × 1024, Compared with the baseline architecture, we observe sig-

11310
(a) (b)
Figure 7. Image samples generated by our StyleSwin on (a) FFHQ 1024 × 1024 and (b) CelebA-HQ 1024 × 1024.

Methods #params FLOPs


StyleGAN2 [30] 30.37M 74.27B
StyleSwin 40.86M 50.90B
Table 6. Comparison of the network parameters and FLOPs with
StyleGAN2.

schemes for high-resolution synthesis to future work.


Parameters and Throughput. In Table 6, We compare
the number of model parameters and FLOPs with Style-
GAN2 for 1024 × 1024 synthesis. Although our approach
has a larger model size, it achieves lower FLOPs than Style-
GAN2, which means the method achieves competitive gen-
Figure 8. Image samples generated by our StyleSwin on LSUN
eration quality with less theoretical computational cost.
Church 256 × 256.

5. Conclusion
nificant FID improvement thanks to the enhanced model ca-
pacity brought by the style injection. The double attention We propose StyleSwin, a transformer-based GAN for
makes each layer leverage larger context at one time and high-resolution image generation. The use of local atten-
further reduces the FID score. Wavelet discriminator brings tion is efficient to compute while attaining most modeling
a large FID improvement because it effectively suppresses capability since the receptive field is largely compensated
the blocking artifacts and meanwhile brings stronger super- by double attention. Besides, we find one key feature is
vision for high-frequencies. In our experiment, we observe missing in transformer-based GANs — the generator is not
faster adversarial training when adopting the wavelet dis- aware of the position for patches under synthesis, so we in-
criminator. Further, introducing sinusoidal positional en- troduce SPE for global positioning. Thanks to the increased
coding (SPE) on each generation scale effectively reduces expressivity, the proposed StyleSwin consistently outper-
the FID. Employing a larger model brings slight improve- forms the leading Conv-based approaches on 256 × 256
ment and it seems that the model capacity of the current datasets. To solve the blocking artifacts on high-resolution
transformer structure is not the bottleneck. From Table 5 synthesis, we propose to penalize the spectral discrepancy
we see that bCR considerably improves the FID by 2.69, with a wavelet discriminator [13]. Ultimately, the proposed
which coincides with the recent findings [24, 37, 67] that StyleSwin offers compelling quality on the resolution of
data augmentation is still vital in transformer-based GAN 1024 × 1024, which for the first time, approaches the best
since transformers are data-hungry and prone to overfitting. performed ConvNets. Our work hopefully incentives more
However, we do not observe its effectiveness on higher res- studies on utilizing the capable transformers in generative
olutions, e.g., 1024 × 1024, and we leave regularization modeling.

11311
References [17] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu,
and Yunhe Wang. Transformer in transformer. arXiv preprint
[1] Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb arXiv:2103.00112, 2021. 2
Sterkin, Victor Lempitsky, and Denis Korzhenkov. Image
[18] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
generators with conditionally-independent pixel synthesis,
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
2020. 4, 7
two time-scale update rule converge to a local nash equilib-
[2] Martı́n Arjovsky, Soumith Chintala, and Léon Bottou. rium, 2018. 6, 7
Wasserstein gan. ArXiv, abs/1701.07875, 2017. 1, 2
[19] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-
sion probabilistic models. arXiv preprint arXiv:2006.11239,
ton. Layer normalization. arXiv preprint arXiv:1607.06450,
2020. 2
2016. 4
[20] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local
[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
relation networks for image recognition, 2019. 5
scale gan training for high fidelity natural image synthesis.
[21] Sergey Ioffe and Christian Szegedy. Batch normalization:
ArXiv, abs/1809.11096, 2019. 1, 2, 4
Accelerating deep network training by reducing internal co-
[5] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
variate shift. In International conference on machine learn-
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan-
ing, pages 448–456. PMLR, 2015. 4
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language models are few-shot learners. arXiv preprint [22] Md Amirul Islam, Sen Jia, and Neil DB Bruce. How much
arXiv:2005.14165, 2020. 2 position information do convolutional neural networks en-
code? arXiv preprint arXiv:2001.08248, 2020. 4
[6] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee-
woo Jun, David Luan, and Ilya Sutskever. Generative pre- [23] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
training from pixels. In International Conference on Ma- Efros. Image-to-image translation with conditional adver-
chine Learning, pages 1691–1703. PMLR, 2020. 2 sarial networks. CVPR, 2017. 5
[7] Jooyoung Choi, Jungbeom Lee, Yonghyun Jeong, and Sun- [24] Yifan Jiang, Shiyu Chang, and Zhangyang Wang. Transgan:
groh Yoon. Toward spatially unbiased generative models. Two transformers can make one strong gan. arXiv preprint
arXiv preprint arXiv:2108.01285, 2021. 4 arXiv:2102.07074, 2021. 1, 2, 7, 8
[8] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin [25] Alexia Jolicoeur-Martineau. The relativistic discrimina-
Jaggi. On the relationship between self-attention and con- tor: a key element missing from standard gan. ArXiv,
volutional layers. ArXiv, abs/1911.03584, 2020. 4 abs/1807.00734, 2019. 1
[9] Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. [26] Animesh Karnewar and Oliver Wang. Msg-gan: Multi-scale
Coatnet: Marrying convolution and attention for all data gradients for generative adversarial networks. arXiv preprint
sizes. arXiv preprint arXiv:2106.04803, 2021. 4 arXiv:1903.06048, 2019. 7
[10] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming [27] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Progressive growing of gans for improved quality, stability,
Cswin transformer: A general vision transformer backbone and variation. arXiv preprint arXiv:1710.10196, 2017. 6, 7
with cross-shaped windows, 2021. 2 [28] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen,
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, free generative adversarial networks. arXiv preprint
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- arXiv:2106.12423, 2021. 2
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is [29] Tero Karras, Samuli Laine, and Timo Aila. A style-based
worth 16x16 words: Transformers for image recognition at generator architecture for generative adversarial networks,
scale. ArXiv, abs/2010.11929, 2021. 1, 2, 5 2019. 1, 2, 3, 5, 6, 7
[12] Patrick Esser, Robin Rombach, and Björn Ommer. Taming [30] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
transformers for high-resolution image synthesis, 2020. 2, 7 Jaakko Lehtinen, and Timo Aila. Analyzing and improving
[13] Rinon Gal, Dana Cohen, Amit Bermano, and Daniel Cohen- the image quality of stylegan. 2020 IEEE/CVF Conference
Or. Swagan: A style-based wavelet-driven generative model, on Computer Vision and Pattern Recognition (CVPR), pages
2021. 2, 6, 8 8107–8116, 2020. 1, 2, 3, 4, 7, 8
[14] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing [31] Osman Semih Kayhan and Jan C van Gemert. On translation
Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, invariance in cnns: Convolutional layers can exploit abso-
and Yoshua Bengio. Generative adversarial nets. In NIPS, lute spatial location. In Proceedings of the IEEE/CVF Con-
2014. 1, 2 ference on Computer Vision and Pattern Recognition, pages
[15] Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. Giqa: 14274–14285, 2020. 4
Generated image quality assessment. In European Confer- [32] Salman Hameed Khan, Muzammal Naseer, Munawar Hayat,
ence on Computer Vision, pages 369–385. Springer, 2020. Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak
1 Shah. Transformers in vision: A survey. ArXiv,
[16] Ishaan Gulrajani, Faruk Ahmed, Martı́n Arjovsky, Vincent abs/2101.01169, 2021. 1
Dumoulin, and Aaron C. Courville. Improved training of [33] Diederik P. Kingma and Jimmy Ba. Adam: A method for
wasserstein gans. In NIPS, 2017. 1 stochastic optimization, 2017. 7

11312
[34] Diederik P Kingma and Prafulla Dhariwal. Glow: Gener- In Proceedings of the IEEE/CVF Conference on Computer
ative flow with invertible 1x1 convolutions. arXiv preprint Vision and Pattern Recognition, pages 8207–8216, 2020. 7
arXiv:1807.03039, 2018. 2 [51] Ivan Skorokhodov, Savva Ignatyev, and Mohamed Elho-
[35] Diederik P Kingma and Max Welling. Auto-encoding varia- seiny. Adversarial generation of continuous images, 2021.
tional bayes. arXiv preprint arXiv:1312.6114, 2013. 2 7
[36] Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michal- [52] Matthew Tancik, Pratul P Srinivasan, Ben Mildenhall, Sara
ski, and Sylvain Gelly. A large-scale study on regularization Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra-
and normalization in gans. In ICML, 2019. 1 mamoorthi, Jonathan T Barron, and Ren Ng. Fourier features
[37] Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, let networks learn high frequency functions in low dimen-
Zhuowen Tu, and Ce Liu. Vitgan: Training gans with vision sional domains. arXiv preprint arXiv:2006.10739, 2020. 2
transformers. ArXiv, abs/2107.04589, 2021. 1, 2, 8 [53] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
[38] Shanda Li, Xiangning Chen, Di He, and Cho-Jui Hsieh. Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
Can vision transformers perform convolution? ArXiv, data-efficient image transformers & distillation through at-
abs/2111.01353, 2021. 4 tention. In International Conference on Machine Learning,
[39] Jae Hyun Lim and J. C. Ye. Geometric gan. ArXiv, pages 10347–10357. PMLR, 2021. 2
abs/1705.02894, 2017. 1 [54] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. In-
[40] Chieh Hubert Lin, Chia-Che Chang, Yu-Sheng Chen, Da- stance normalization: The missing ingredient for fast styliza-
Cheng Juan, Wei Wei, and Hwann-Tzong Chen. Coco- tion. arXiv preprint arXiv:1607.08022, 2016. 4
gan: Generation by parts via conditional coordinating. In [55] Aaron Van Oord, Nal Kalchbrenner, and Koray
Proceedings of the IEEE/CVF International Conference on Kavukcuoglu. Pixel recurrent neural networks. In In-
Computer Vision, pages 4512–4521, 2019. 4 ternational Conference on Machine Learning, pages
[41] Ming-Yu Liu, Xun Huang, Jiahui Yu, Ting-Chun Wang, and 1747–1756. PMLR, 2016. 2
Arun Mallya. Generative adversarial networks for image and [56] Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas,
video synthesis: Algorithms and applications. arXiv preprint Niki Parmar, Blake Hechtman, and Jonathon Shlens. Scaling
arXiv:2008.02793, 2020. 1, 2 local self-attention for parameter efficient visual backbones.
[42] Shizhong Liu and Alan C Bovik. Efficient dct-domain blind In Proceedings of the IEEE/CVF Conference on Computer
measurement and reduction of blocking artifacts. IEEE Vision and Pattern Recognition, pages 12894–12904, 2021.
Transactions on Circuits and Systems for Video Technology, 2, 5
12(12):1139–1149, 2002. 5 [57] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob
[43] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,
Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- and Illia Polosukhin. Attention is all you need. ArXiv,
former: Hierarchical vision transformer using shifted win- abs/1706.03762, 2017. 2, 3, 4
dows, 2021. 1, 2, 3, 4 [58] Benyou Wang, Lifeng Shang, Christina Lioma, Xin Jiang,
[44] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Hao Yang, Qun Liu, and Jakob Grue Simonsen. On position
Deep learning face attributes in the wild. In Proceedings of embeddings in bert. In International Conference on Learn-
International Conference on Computer Vision (ICCV), De- ing Representations, 2020. 5
cember 2015. 6 [59] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
[45] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
Zhen Wang, and Stephen Paul Smolley. Least squares gener- Pyramid vision transformer: A versatile backbone for
ative adversarial networks. 2017 IEEE International Confer- dense prediction without convolutions. arXiv preprint
ence on Computer Vision (ICCV), pages 2813–2821, 2017. arXiv:2102.12122, 2021. 2
1 [60] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu,
[46] Lars M. Mescheder, Andreas Geiger, and Sebastian Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introduc-
Nowozin. Which training methods for gans do actually con- ing convolutions to vision transformers. arXiv preprint
verge? In ICML, 2018. 1 arXiv:2103.15808, 2021. 2
[47] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and [61] Rui Xu, Xintao Wang, Kai Chen, Bolei Zhou, and
Yuichi Yoshida. Spectral normalization for generative ad- Chen Change Loy. Positional encoding as spatial inductive
versarial networks. ArXiv, abs/1802.05957, 2018. 1, 2, 7 bias in gans. In Proceedings of the IEEE/CVF Conference
[48] Augustus Odena, Vincent Dumoulin, and Chris Olah. De- on Computer Vision and Pattern Recognition, pages 13569–
convolution and checkerboard artifacts. Distill, 1(10):e3, 13578, 2021. 4
2016. 5 [62] Rui Xu, Xiangyu Xu, Kai Chen, Bolei Zhou, and
[49] Alec Radford, Luke Metz, and Soumith Chintala. Un- Chen Change Loy. Stransgan: An empirical study on trans-
supervised representation learning with deep convolu- former in gans. ArXiv, abs/2110.13107, 2021. 1, 2
tional generative adversarial networks. arXiv preprint [63] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianx-
arXiv:1511.06434, 2015. 2 iong Xiao. Lsun: Construction of a large-scale image dataset
[50] Edgar Schonfeld, Bernt Schiele, and Anna Khoreva. A u- using deep learning with humans in the loop. arXiv preprint
net based discriminator for generative adversarial networks. arXiv:1506.03365, 2015. 6

11313
[64] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi,
Zihang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng
Yan. Tokens-to-token vit: Training vision transformers
from scratch on imagenet. arXiv preprint arXiv:2101.11986,
2021. 2
[65] Biao Zhang and Rico Sennrich. Root mean square layer nor-
malization. In NeurIPS, 2019. 4
[66] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and
Augustus Odena. Self-attention generative adversarial net-
works. In ICML, 2019. 1, 4
[67] Long Zhao, Zizhao Zhang, Ting Chen, Dimitris N. Metaxas,
and Hang Zhang. Improved transformer for high-resolution
gans. ArXiv, abs/2106.07631, 2021. 1, 2, 4, 5, 7, 8
[68] Zhengli Zhao, Sameer Singh, Honglak Lee, Zizhao Zhang,
Augustus Odena, and Han Zhang. Improved consistency reg-
ularization for gans, 2020. 7

11314

You might also like