0% found this document useful (0 votes)

6 views

Mangagan: Unpaired Photo-To-Manga Translation Based On The Methodology of Manga Drawing

Uploaded by

manoelneto1102

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Mangagan: Unpaired Photo-To-Manga Translation Based On The Methodology of Manga Drawing

Uploaded by

manoelneto1102

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

MangaGAN: Unpaired Photo-to-Manga Translation Based on

The Methodology of Manga Drawing

Hao Su1 , Jianwei Niu1,2,3∗ , Xuefeng Liu1 , Qingfeng Li1 , Jiahe Cui1 , and Ji Wan1
1 State Key Lab of VR Technology and System, School of Computer Science and Engineering, Beihang University
2 Industrial
Technology Research Institute, School of Information Engineering, Zhengzhou University
3 Hangzhou Innovation Institute, Beihang University

{bhsuhao, niujianwei, liu_xuefeng, liqingfeng, cuijiahe, wanji}@buaa.edu.cn

This paper has been accepted by AAAI 2021.

arXiv:2004.10634v2 [cs.CV] 17 Dec 2020

Figure 1: Left: the combination of manga faces we generated and body components we collected from a popular manga work Bleach
[27], which shows a unified style with a strong attractiom. Right: the input frontal face photos and our results, where our method can
effectively endow output results with both the facial similarity and the target manga style.

ABSTRACT 1 INTRODUCTION
Manga is a world popular comic form originated in Japan, which Manga, originated in Japan, is a worldwide popular comic form of
typically employs black-and-white stroke lines and geometric ex- drawing on serialized pages to present long stories. Typical manga is
aggeration to describe humans’ appearances, poses, and actions. In printed in black-and-white (as shown in Fig. 1 left), which employs
this paper, we propose MangaGAN, the first method based on Gen- abstract stroke lines and geometric exaggeration to describe humans’
erative Adversarial Network (GAN) for unpaired photo-to-manga appearances, poses, and actions. Professional manga artists usually
translation. Inspired by how experienced manga artists draw manga, build up personalized drawing styles during their careers, and their
MangaGAN generates the geometric features of manga face by a styles are hard to be imitated by other peers. Meanwhile, drawing
designed GAN model and delicately translates each facial region manga is a time-consuming process, and even a professional manga
into the manga domain by a tailored multi-GANs architecture. For artist requires several hours to finish one page of high-quality work.
training MangaGAN, we construct a new dataset collected from a As an efficient approach to assist with manga drawing, automat-
popular manga work, containing manga facial features, landmarks, ically translating a face photo to manga with an attractive style is
bodies and so on. Moreover, to produce high-quality manga faces, much desired. This task can be described as the image translation
we further propose a structural smoothing loss to smooth stroke-lines that is a hot topic in the computer vision field. In recent years, deep
and avoid noisy pixels, and a similarity preserving module to im- learning based image translation has made significant progress and
prove the similarity between domains of photo and manga. Extensive derived a series of systematic methods. Among the examples are the
experiments show that MangaGAN can produce high-quality manga Neural Style Transfer (NST) methods (e.g.,[4, 11, 21, 29]) which
faces which preserve both the facial similarity and a popular manga use tailored CNNs and objective functions to stylize images, the
style, and outperforms other related state-of-the-art methods. Generative Adversarial Network (GAN)[12] based methods (e.g.,
∗ The corresponding author.
[19, 36, 65]) which work well for mapping paired or unpaired images repairing of disturbing elements (e.g, hair covering, shadows).
from the original domain to the stylized domain. MangaGAN-BL will be released for academic use.
Although these excellent works have achieved good performances
in their applications, they have difficulties to generate a high-quality
manga due to the following four challenges. First, in the manga
domain, humans’ faces are abstract, colorless, geometrically exag- 2 RELATED WORK
gerated, and far from that in the photo domain. The facial correspon- Recent literature suggests two main directions with the ability to
dences between the two domains are hard to be matched by networks. generate manga-like results: neural style transfer, and GAN-based
Second, the style of manga is more represented by the structure of cross-domain translation.
stroke lines, face shape, and facial features’ sizes and locations.
Meanwhile, for different facial features, manga artists always use
different drawing styles and locate them with another personalized
skill. These independent features (i.e., appearance, location, size, 2.1 Neural style transfer
style) are almost unable to be extracted and concluded by a network The goal of neural style transfer (NST) is to transfer the style
simultaneously. Third, a generated manga has to faithfully resemble from an art image to another content target image. Inspired by the
the input photo to keep the identity of a user without comprising progress of CNN, Gatys et al. [11] propose the pioneering NST work
the abstract manga style. It is a challenge to keep both of them with by utilizing CNN’s power of extracting abstract features, and the
high performances. Forth, the training data of manga is difficult to style capture ability of Gram matrices [10]. Then, Li and Wand [29]
collect. Manga artists often use local storyboards to show stories, use the Markov Random Field (MRF) to encode styles, and present
which makes it difficult to find clear and complete manga faces with an MRF-based method (CNNMRF) for image stylization. Afterward,
factors such as covered by hair or shadow, segmented by storyboards, various follow-up works have been presented to improve their perfor-
low-resolution and so on. Therefore, related state-of-the-art methods mances on visual quality [13, 20, 35, 39, 48, 63], generating speed
of image stylization (e.g., [11, 19, 29, 35, 53, 60, 65]) are not able [4, 17, 21, 33, 50], and multimedia extension [3, 5, 15, 54, 59].
to produce desired results of manga1 . Although these methods work well on translating images into
To address these challenges, we present MangaGAN, the first some typical artistic styles, e.g., oil painting, watercolor, they are
GAN-based method for translating frontal face photos to the manga not good at producing black-and-white manga with exaggerated ge-
domain with preserving the attractive style of a popular manga work ometry and discrete stroke lines, since they tend to translate textures
Bleach [27]. We observed that an experienced manga artist generally and colors features of a target style and preserve the structure of the
takes the following steps when drawing manga: first outlining the content image.
exaggerated face and locating the geometric distributions of facial
features, and then fine-drawing each of them. MangaGAN follows
the above process and employs a multi-GANs architecture to trans-
late different facial features, and to map their geometric features by 2.2 GAN-based cross-domain translation
another designed GAN model. Moreover, to obtain high-quality re- Many GAN-based cross-domain translation methods work well
sults in an unsupervised manner, we present a Similarity Preserving on image stylization, whose goal is to learn a mapping from a source
(SP) module to improve the similarity between domains of photo and domain to a stylized domain. There are a series of works based on
manga, and leverage a structural smoothing loss to avoid artifacts. GAN [12] presented and applied for image stylization. Pix2Pix [19]
To summarize, our main contributions are three-fold: first presents a unified framework for image-to-image translation
• We propose MangaGAN, the first GAN-based method for based on conditional GANs [40]. BicycleGAN [66] extends it to
unpaired photo-to-manga translation. It can produce attractive multi-modal translation. Some methods including CycleGAN [65],
manga faces with preserving both the facial similarity and DualGAN [61], DiscoGAN [23], UNIT [36], DTN [55] etc. are pre-
a popular manga style. MangaGAN uses a novel network sented for unpaired one-to-one translation. MNUIT [18], startGAN
architecture by simulating the drawing process of manga [7] etc. are presented for unpaired many-to-many translation.
artists, which generates the exaggerated geometric features The methods mentioned above succeed in translation tasks that
of faces by a designed GAN model, and delicately translates are mainly characterized by color or texture changes only (e.g.,
each facial region by a tailored multi-GANs architecture. summer to winter, and apples to oranges). For photo-to-manga trans-
• We propose a similarity preserving module that effectively im- lation, they fail to capture the correspondences between two domains
proves the performances on preserving both the facial similar- due to the abstract structure, colorless appearance, and geometric
ity and manga style. We also propose a structural smoothing deformation of manga drawing.
loss to encourage producing results with smooth stroke-lines Besides the above two main directions, there are also some
and less messy pixels. works specially designed for creating artistic facial images. They
• We construct a new dataset called MangaGAN-BL (contain- employ techniques of Non-photorealistic rendering (NPR), data-
ing manga facial features, landmarks, bodies, etc.), collected driven synthesizing, computer graphics, etc., and have achieved
from a world popular manga work Bleach. Each sample has much progress in many typical art forms, e.g., caricature and car-
been manually processed by cropping, angle-correction, and toon [2, 6, 16, 31, 47, 52, 58, 64], portrait and sketching [1, 8, 34,
43, 45, 46, 49, 56, 57, 60, 62]. However, none of them involve the
1 Comparison results as shown in Figure 11 and 12 of experiments. generation of manga face.
2
Appearance
Transformation
Network

(b) Napp m

Synthesis Module
(a)
Photo facial features Manga facial features
p (e)
Output
Manga result
Input photo (d) Ngeo
(c) Geometric
Transformation
Domain of Network Domain of
photo landmark manga landmark (f )
lp (LP) (LM)
lm
Figure 2: Overall pipeline of MangaGAN. Inspired by the prior knowledge of manga drawing, MangaGAN consists of two branches:
one branch learns the geometric mapping by a Geometric Transformation Network (GTN); the other branch learns the appearance
mapping by an Appearance Transformation Network (ATN). On the end, a Synthesis Module is designed to fuse them and end up
with the manga face.

3 METHOD (a) eye eye

E N
3.1 Overview Left eye Flip Horizontally

Let 𝑃 and 𝑀 indicate the face photo domain and the manga do- (b)
main respectively, where no pairing exists between them. Given an Right eye
input photo 𝑝 ∈ 𝑃, our MangaGAN learns a mapping Ψ : 𝑃 → 𝑀 (c) mouth
N
mouth
E
that can transfer 𝑝 to a sample 𝑚 = Ψ(𝑝), 𝑚 ∈ 𝑀, while endowing Mouth
𝑚 with manga style and facial similarity.
nose
As shown in Figure 2(f), our method is inspired by the prior knowl- (d) E
nose
N
edge that how experienced manga artists doing drawing manga: first Nose
outline the exaggerated face and locate the geometric distributions hair
of facial features, and finally do the fine-drawing. Accordingly, Man- (e) N
gaGAN consists of two branches: one branch learns a geometric Hair Hair & Background
Segmentation
mapping Ψ𝑔𝑒𝑜 by a Geometric Transformation Network (GTN) 𝑁𝑔𝑒𝑜
which adopted to translate the facial geometry from 𝑃 to 𝑀 [Fig-
Figure 3: ATN is a network with multi-GANs architecture, con-
ure 2(d)]; the other branch learns an appearance mapping Ψ𝑎𝑝𝑝 by
sists of four local GANs, designed to translate each facial re-
an Appearance Transformation Network (ATN) 𝑁𝑎𝑝𝑝 [Figure 2(b)]
gion respectively. Moreover, we tailor different training strate-
which used to produce components of all facial features. At the
gies and encoders to improve their performances.
end, a Synthesis Module is designed to fuse facial geometry and all
components, and end up with the output manga 𝑚 ∈ 𝑀 [Figure 2(e)].
Then, we will detail the ATN, the GTN, and the Synthesis Module
in Section 3.2, Section 3.3, and Section 3.4 respectively. mapping the unpaired data, we couple it with a reverse mapping,
inspired by the network architecture of CycleGAN [65]. Accord-
3.2 Appearance transformation network ingly, the baseline architecture of 𝑁 𝛿 (𝛿 ∈{eye, mouth}) includes
the forward / backward generator 𝐺 𝑀 𝛿 / 𝐺 𝛿 and the corresponding
As shown in Figure 3, ATN 𝑁𝑎𝑝𝑝 is a network with multi-GAN ar- 𝑃
chitecture includes a set of four locals GANs, 𝑁𝑎𝑝𝑝 ={𝑁 𝑒𝑦𝑒 , 𝑁 𝑛𝑜𝑠𝑒 , discriminator 𝐷 𝑃 / 𝐷 𝑀 . 𝐺 𝑀 learns the mapping Ψ𝑎𝑝𝑝
𝛿 𝛿 𝛿 𝛿 : 𝑝 𝛿→ 𝑚
b𝛿 , and
𝛿 ′ : 𝑚𝛿→b
𝐺 𝑃𝛿 learns the reverse mapping Ψ𝑎𝑝𝑝 𝑝 𝛿 , where 𝑚
b𝑖𝛿 and 𝑝b𝑖𝛿
𝑁 𝑚𝑜𝑢𝑡ℎ , 𝑁 ℎ𝑎𝑖𝑟 }, where 𝑁 𝑒𝑦𝑒 , 𝑁 𝑛𝑜𝑠𝑒 , 𝑁 𝑚𝑜𝑢𝑡ℎ , and 𝑁 ℎ𝑎𝑖𝑟 are re-
spectively trained for translating facial regions of eye, nose, mouth, are the generated fake samples; the discriminator 𝐷 𝑃𝛿 / 𝐷 𝑀 𝛿 learn to
and hair, from the input 𝑝 ∈ 𝑃 to the output 𝑚 ∈ 𝑀. 𝛿 𝛿
distinguish real samples 𝑝 / 𝑚 and fake samples 𝑝b /𝑚 𝛿 𝛿
b . Our gen-
3.2.1 Translating regions of eyes and mouths. Eyes and mouths erators 𝐺 𝑃𝛿 , 𝐺 𝑀
𝛿 use the Resnet 6 blocks [14], and 𝐷 𝛿 , 𝐷 𝛿 use the
𝑃 𝑀
are the critical components of manga faces but are the hardest parts Markovian discriminator of 70 × 70 patchGANs [19, 28, 30].
to translate, since they are most noticed, error sensitive, and vary We adopt the stable least-squares losses [38] instead of negative
with different facial expressions. For 𝑁 𝑒𝑦𝑒 and 𝑁 𝑚𝑜𝑢𝑡ℎ , for better log-likelihood objective [12] as our adversarial losses 𝐿𝑎𝑑𝑣 , defined
3
as Nδ Dδ Dδ
𝛿
L𝑎𝑑𝑣 𝛿
(𝐺 𝑀 𝛿
, 𝐷𝑀 𝛿
) = E𝑚𝛿∼𝑀 𝛿 [(𝐷 𝑀 (𝑚𝛿 ) − 1) 2 ] (δ= eye) SP Module M SP Module P

, (1)
𝛿
+ E𝑝 𝛿∼𝑃 𝛿 [𝐷 𝑀 𝛿
(𝐺 𝑀 (𝑝 𝛿 )) 2 ] δ
GM Gδ
P
𝛿 (𝐺 𝛿 , 𝐷 𝛿 ) is defined in a similar manner.
while L𝑎𝑑𝑣 𝑃 𝑃
L𝑐𝑦𝑐 is the cycle-consistency loss [65] that is used to constrain (a)
the mapping solution between the input and the output domain, fpool1( pδ) f ( pδ)

128×128
pool2
defined as fpool3( pδ)

32×32
64×64
Conv layer

𝛿 fpool4( pδ)
L𝑐𝑦𝑐 (𝐺 𝑃𝛿 , 𝐺 𝑀
𝛿
) = E𝑝 𝛿∼𝑃 𝛿 [∥𝐺 𝑃𝛿 (𝐺 𝑀
𝛿
( 𝑝 𝛿 )) − 𝑝 𝛿 ∥ 1 ]

256×256

16×16
Pooling layer
. (2) fpool5( pδ)
𝛿
+ E𝑚𝛿∼𝑀 𝛿 [∥𝐺 𝑀 (𝐺 𝑃𝛿 (𝑚𝛿 )) − 𝑚𝛿 ∥ 1 ]

8×8

Module
SP
k5n128s1
However, we find that the baseline architectures of 𝑁 𝑒𝑦𝑒 and

k5n512s1
k5n64s1
k5n64s1

k5n256s1
𝑁 𝑚𝑜𝑢𝑡ℎ with L𝑎𝑑𝑣 and L𝑐𝑦𝑐 still fail to preserve the similarity (b)
between two domains. Specifically, for regions of eye and mouth, it

8×8
16×16
32×32
always produces messy results since the networks almost unable to fpool5(Gδ δ
M ( p ))
δ
match colored photos and discrete black lines of mangas. Therefore, fpool4(GM ( pδ))

64×64
128×128
fpool3(Gδ δ

256×256
M ( p ))
we further make three following improvements to optimize their
fpool2(Gδ δ
M ( p ))
performances. GPδ ( pδ) fpool1(Gδ δ
M ( p ))
First, we design a Similarity Preserving (SP) module with an
SP loss L𝑆𝑃 to enhance the similarity. Second, we train an encoder
Figure 4: (a) We append two SP modules on both forward and
𝐸𝑒𝑦𝑒 that can extract the main backbone of 𝑝 𝑒𝑦𝑒 to binary results, as
backward mappings. (b) SP module extracts feature maps with
the input of 𝑁 𝑒𝑦𝑒 , and an encoder 𝐸𝑚𝑜𝑢𝑡ℎ that encodes 𝑝𝑚𝑜𝑢𝑡ℎ to
different resolutions and measures the similarities between two
binary edge-lines, used to guide the shape of manga mouth2 . Third, a inputs in different latent spaces.
structural smoothing loss L𝑆𝑆 is designed for encouraging networks
to produce manga with smooth stroke-lines, defined as
𝛿 𝛿 2
1 h ∑︁ −(𝐺 𝑃 (𝑚 ) 𝑗 − 𝜇) where 𝜆𝑖 , 𝜆𝐼 controls the relative importance of each objective,
L𝑆𝑆 (𝐺 𝑃𝛿 ,𝐺 𝑀
𝛿
)=√ exp 𝜙,𝑖
2𝜋𝜎𝑗 ∈ {1,2,...,𝑁 } 2𝜎 2 L𝑝𝑖𝑥𝑒𝑙
𝐼
and L 𝑓 𝑒𝑎𝑡 are used to keep the similarity on pixel-wise
(𝐺 𝛿 (𝑝 𝛿 )𝑘 − 𝜇) 2 i , (3) 𝜙,𝑖
∑︁
𝑀
and different feature-wise respectively. L𝑝𝑖𝑥𝑒𝑙
𝐼
and L 𝑓 𝑒𝑎𝑡 defined as
+ exp −
2𝜎 2 𝜙,𝑖 𝜙 𝜙 𝜙 𝜙 2
𝑘 ∈ {1,2,...,𝑁 } L 𝑓 𝑒𝑎𝑡 [𝑓𝑖 (𝑝 𝛿 ), 𝑓𝑖 (𝐺 𝑀
𝛿
( 𝑝 𝛿 ))]= 𝑓𝑖 (𝑝 𝛿 ) −𝑓𝑖 (𝐺 𝑀
𝛿
( 𝑝 𝛿 )) 2
, (5)
where L𝑆𝑆 based on a Gaussian model with 𝜇= 255 𝛿 𝛿
2 , 𝐺 𝑃 (𝑚 ) 𝑗 or 2
L𝑝𝑖𝑥𝑒𝑙
𝐼
[ 𝑝𝛿 , 𝐺𝑀
𝛿
( 𝑝 𝛿 )]= 𝑝 𝛿−𝐺 𝑀
𝛿
( 𝑝𝛿 ) 2
𝐺𝑀𝛿 (𝑝 𝛿 ) is the 𝑗-th or 𝑘-th pixel of 𝐺 𝛿 (𝑚𝛿 ) or 𝐺 𝛿 (𝑝 𝛿 ). The un-
𝑘 𝑃 𝑀 𝜙
derlying idea is that producing unnecessary gray areas will distract where 𝑓𝑖 (𝑥) is a feature map extracted from 𝑖-th layer of network
and mess the manga results since manga mainly consists of black 𝜙 when 𝑥 as the input. Note that we only extract feature maps after
and white stroke lines. Thus, we give a pixel smaller loss when its pooling layers.
gray value closer to black (0) or white (255), to smooth the gradient Combining Eq.(1)-(5), the full objective for learning the appear-
edges of black stroke lines and produce clean results. ance mappings of 𝑁 𝛿 (𝛿 ∈{eye, mouth}) is:
Similarity Preserving Module. The main idea of SP module is 𝛿 𝛿 𝛿 𝛿 𝛿
L𝑎𝑝𝑝 = L𝑎𝑑𝑣 (𝐺 𝑀 ,𝐷 𝑀 ) + L𝑎𝑑𝑣 (𝐺 𝑃𝛿 ,𝐷 𝑃𝛿 )
that keeping the similarity between two images at a lower resolution
𝛿
can give them similar spatial distributions and different pixel details + 𝛼 1 L𝑐𝑦𝑐 (𝐺 𝑃𝛿 , 𝐺 𝑀
𝛿 𝛿
) + 𝛼 2 L𝑆𝑃 𝛿
(𝐺 𝑀 , 𝑝𝛿 ) , (6)
when they are up-sampled to a higher resolution. As shown in Figure 𝛿
+ 𝛼 3 L𝑆𝑃 (𝐺 𝑃𝛿 , 𝑚𝛿 ) 𝛿
+ 𝛼 4 L𝑆𝑆 (𝐺 𝑀 , 𝐺 𝑃𝛿 )
4(a), we append two SP modules on both forward and backward
mappings of 𝑁 𝛿 . SP module leverages a pre-trained network 𝜙 that where 𝛼 1 to 𝛼 4 used to balance the multiple objectives.
we designed to extract feature maps in different latent spaces and 3.2.2 Translating regions of nose and hair. Noses are insignifi-
resolutions. The architecture of 𝜙 as shown in Figure 4(b), it only cant to manga faces since almost all characters have a similar nose
uses few convolutional layers since we consider the correspondences in the target manga style. Therefore, 𝑁 𝑛𝑜𝑠𝑒 adopts a generating
of encoded features are relatively clear. For the forward mapping method instead of a translating one, which follows the architecture
Ψ𝑎𝑝𝑝
𝛿 :𝑚 𝛿 ( 𝑝 𝛿 ), we input 𝑝 𝛿 and 𝐺 𝛿 ( 𝑝 𝛿 ) to SP module, and
b𝛿 = 𝐺 𝑀 𝑀 of progressive growing GANs [22] that can produce a large number
𝛿 by minimizing the loss functions L (𝐺 𝛿 , 𝑝 𝛿 ) defined
optimize 𝐺 𝑀 of high-quality results similar to training data. As shown in Figure
𝑆𝑃 𝑀
as ∑︁ 3(d), we first train a variational autoencoder [26] to encode the nose
𝛿 𝛿 𝜙,𝑖 𝜙 𝛿 𝜙
L𝑆𝑃 (𝐺 𝑀 , 𝑝 ) = 𝜆𝑖 L 𝑓 𝑒𝑎𝑡 𝑓𝑖 (𝑝 ), 𝑓𝑖 (𝐺 𝑀 ( 𝑝 𝛿 ))
𝛿
region of the input photo into a feature vector, then make the vector
𝑖 ∈𝜙 h i (4) as a seed to generate a default manga nose, and we also allow users
+ 𝜆𝐼 L𝑝𝑖𝑥𝑒𝑙
𝐼
𝑝𝛿 , 𝐺𝑀
𝛿
( 𝑝𝛿 ) , to change it according to their preferences.
𝑁 ℎ𝑎𝑖𝑟 employs a pre-trained generator of APDdrawingGAN [60]
2 Training details of the two encoders are described in Section 7.5.1. that can produce binary portrait hair with the style similar to manga.
4
Nloc (a) (b)
GL M

loc(lp) loc(lm)

lp lm
Nsiz DL P

GL
P el er el er
siz(lp) siz(lm)

n 11 cl n 11 cr
Geometric shp(lm) cl cr
Nshp (b) m
Features m
siz(lm) 12

lp
12

Domain of photo landmark shp(lp) shp(lm) Domain of manga landmark loc(lm)

GL cb cb
lm
(LP) (a) Geometric Transformation Network (LM) P

DL M
el_cl el_cr el_cb er_cl el_cl el_cr el_cb er_cl
lp(d) lp(d) lp(d) lp(d) lm(d) lm(d) lm(d) lm(d)
Figure 5: The pipeline of GTN. (a) To improve the variety of shp(lp) er_cr er_cb n_cl n_cr er_cr er_cb n_cl n_cr
lp(d) lp(d) lp(d) lp(d) lm(d) lm(d) lm(d) lm(d)
facial collocation mode, GTN divides geometric information siz(lp) GL M

loc(lp) n_cb m_cl 11 m_cr 12 m_cb n_cb m_cl 11 m_cr 12 m_cb

lp(d) lp(d) lp(d) lp(d) lm(d) lm(d) lm(d) lm(d)
into three independent attributes, i.e., facial features’ locations,
sizes, and face shape, while including three sub-GANs 𝑁𝑙𝑜𝑐 ,
𝑁𝑠𝑖𝑧 , 𝑁𝑠ℎ𝑎 to targetedly translate them. (b) According to the Figure 6: (a) Architectures of 𝑁𝑙𝑜𝑐 , 𝑁𝑠𝑖𝑧 , and 𝑁𝑠ℎ𝑎 . (b) Defini-
pre-computed proportion of cheek and forehead, we produce tions of relative locations in 𝜉𝑙𝑜𝑐 (𝑙𝑝 ) and 𝜉𝑙𝑜𝑐 (𝑙𝑚 ).
the geometric features of a whole manga face.
the face shape, where the face shape is represented as the landmark
In addition, the coordinates of generated portraits can accurately of cheek region containing 17 points.
correspond to the input photos. As shown in Figure 3(e), we first Network architecture. As shown in Figure 6(a), 𝑁𝑙𝑜𝑐 , 𝑁𝑠𝑖𝑧 , and
extract the rough hair region by a hair segmentation method [41] 𝑁𝑠ℎ𝑝 roughly follow the structure of CycleGAN [65] with adver-
with a fine-tune of expanding the segmented area, and then remove sarial loss L𝑎𝑑𝑣 as eq(1) and cycle loss L𝑐𝑦𝑐 as eq(2). Moreover,
the extra background area by a portrait segmentation method [51]. we replace all convolutional layer in generators with the fully con-
nected layers, and add the characteristic loss L𝑐ℎ𝑎 [2] that leverages
3.3 Geometric transformation network the differences between a face and the mean face to measure the
distinctive features after exaggeration. Let L𝐿𝑐ℎ𝑎 𝑀
(𝐺𝐿𝑀 ) indicates the
The goal of GTN is to translate the geometric features of faces
characteristic loss on the forward mapping, defined as
from the photo domain to the manga domain, where we represent
these features with facial landmarks. Let 𝐿𝑃 and 𝐿𝑀 express the do- L𝐿𝑐ℎ𝑎
𝑀
(𝐺𝐿𝑀 ) = E𝜉 ∗ (𝑙𝑝 )∼𝜉 ∗ (𝐿𝑃 ) 1 − cos[𝜉 ∗ (𝑙𝑝 )
main of landmarks corresponding to photo and manga. GTN learns , (7)
− 𝜉 ∗ (𝐿𝑃 ), 𝐺 𝐿𝑀 (𝜉 ∗ (𝑙𝑝 )) − 𝜉 ∗ (𝐿𝑀 )]
a geometric mapping Ψ𝑔𝑒𝑜 : 𝑙𝑝 ∈𝐿𝑃 → 𝑙𝑚 ∈𝐿𝑀 , where 𝑙𝑚 must be
similar to 𝑙𝑝 and follow manga’s geometric style. For training data, where 𝜉 ∗ (𝐿𝑃 ) or 𝜉 ∗ (𝐿𝑀 ) denotes the averages of vector 𝜉 ∗ (𝐿𝑃 )
each landmark 𝑙𝑝 can be extracted by an existing face landmark de- or 𝜉 ∗ (𝐿𝑀 ) whose format defined by network 𝑁 ∗ , ∗∈{𝑙𝑜𝑐, 𝑠𝑖𝑧, 𝑠ℎ𝑝},
tector [24], and 106 facial landmarks of manga data 𝑙𝑚 are manually 𝐿𝑃
while the reverse loss L𝑐ℎ𝑎 is defined similarly. We let L denotes
marked by us. 𝑙𝑜𝑐

When translating facial landmarks, an issue is that the collocation the loss of 𝑁𝑙𝑜𝑐 , and losses of 𝑁𝑠𝑖𝑧 and 𝑁𝑠ℎ𝑎 are represented in a
mode of facial features constrains the variety of results. For example, similar manner. The objective function L𝑔𝑒𝑜 to optimize GTN is
people with the same face shape may have different sizes or locations 𝐿𝑃 𝐿𝑀 𝐿𝑃 𝐿𝑀
L𝑔𝑒𝑜 =L𝑎𝑑𝑣 + L 𝑎𝑑𝑣 +𝛽 1L𝑐𝑦𝑐 +𝛽 2 ( L𝑐ℎ𝑎 + L𝑐ℎ𝑎 )
of eyes, nose, or mouth. However, GAN may generate them in a 𝑙𝑜𝑐 𝑙𝑜𝑐 𝑙𝑜𝑐 𝑙𝑜𝑐 𝑙𝑜𝑐
fixed or similar collocation mode when it is trained by the landmarks 𝐿𝑃
+L𝑎𝑑𝑣 𝐿𝑀
+ L 𝑎𝑑𝑣 𝐿𝑃
+𝛽 3L𝑐𝑦𝑐 +𝛽 4 (L𝑐ℎ𝑎 𝐿𝑀
+ L𝑐ℎ𝑎 ),
of global faces. Accordingly, as shown in Figure 5, we divide the 𝑠𝑖𝑧 𝑠𝑖𝑧 𝑠𝑖𝑧 𝑠𝑖𝑧 𝑠𝑖𝑧
(8)
geometric features into three attributions (face shape, facial features’ 𝐿𝑃
+L𝑎𝑑𝑣 𝐿𝑀
+ L𝑎𝑑𝑣 𝐿𝑃
+ 𝛽 5 L𝑐𝑦𝑐 +𝛽 6 ( L𝑐ℎ𝑎 𝐿𝑀
+ L𝑐ℎ𝑎 )
locations and sizes) and employ three sub-GANs 𝑁𝑠ℎ𝑎 , 𝑁𝑙𝑜𝑐 , 𝑁𝑠𝑖𝑧 𝑠ℎ𝑎 𝑠ℎ𝑎 𝑠ℎ𝑎 𝑠ℎ𝑎 𝑠ℎ𝑝

to translate them respectively.

Input of sub-GANs. For 𝑁𝑙𝑜𝑐 , we employ relative locations in- Facial Components Geometric Features
stead of absolute coordinates, since directly generating coordinates
may incur few facial features beyond the face profile. As shown
in Figure 6(b), for 𝑙𝑝 , relative locations are represented as a vector
(a)
𝜉𝑙𝑜𝑐 (𝑙𝑝 ). 𝜉𝑙𝑜𝑐 (𝑙𝑝 ) = {𝑙𝑝𝑒𝑙 , 𝑙𝑝𝑒𝑟 , 𝑙𝑝𝑛 , 𝑙𝑝𝑚 } and 𝜉𝑙𝑜𝑐 (𝑙𝑚 ) is represented sim-
ilarly, where 𝑙𝑝𝑒𝑙 , 𝑙𝑝𝑒𝑟 , 𝑙𝑝𝑛 , 𝑙𝑝𝑚 represent regions of left eye, right eye, Resize & Locate
(b)
nose, and mouth respectively. Take 𝑙𝑝𝑛 as an example, its relative loca-
Insert

tion is represented as three scalars 𝑙𝑝𝑛_𝑐𝑙 , 𝑙 𝑛_𝑐𝑟 , 𝑙 𝑛_𝑐𝑏 , corresponding

(𝑑) 𝑝 (𝑑) 𝑝 (𝑑)
PCHIP

to distances of nose’s center to cheek’s left edge, right edge, and

bottom edge respectively, and 𝑙𝑝𝑒𝑙 , 𝑙𝑝𝑒𝑟 , 𝑙𝑝𝑚 are defined similarly. 𝑁𝑠𝑖𝑧
only learns the mapping of facial features’ widths, since the length- (c)
Output
width ratio of the generated manga facial regions are fixed. Then,
the size features of 𝑙𝑝 is represent as 𝜉𝑠𝑖𝑧 (𝑙𝑝 ) = {𝑙𝑝𝑒𝑙(𝑤) , 𝑙𝑝𝑒𝑟(𝑤) , 𝑙𝑝𝑛 (𝑤) ,
𝑙𝑝𝑚(𝑤) }, where 𝑙𝑝𝑒𝑙(𝑤) , 𝑙𝑝𝑒𝑟(𝑤) , 𝑙𝑝𝑛 (𝑤) , 𝑙𝑝𝑚(𝑤) represent the width of left
eye, right eye, nose, and mouth respectively. 𝑁𝑠ℎ𝑎 learns to translate Figure 7: In synthesis module, we generate manga by fusing all
5
facial components and their geometric features.
where 𝛽 1 to 𝛽 6 used to balance the multiple objectives. with a batch size of 5. All networks use the learning rate of 0.0002
Finally, as shown in Figure 5(b), according to the pre-defined for the first 100 epochs, where the rate is linearly decayed to 0 over
proportion of cheek and forehead, we produce the geometric features the next 100 epochs.
of the whole manga face. 4.2 Ablation experiment of our improvements
In Section 3.2.1, encoders 𝐸𝑒𝑦𝑒 and 𝐸𝑚𝑜𝑢𝑡ℎ help GANs to capture
3.4 Synthesis Module the abstract correspondences of eye and mouth regions, respectively.
The goal of this module is to synthesis an attractive manga face 𝐸𝑒𝑦𝑒 is a conditional GAN model basically following [19], and is
by combining facial components and their geometric features. As pretrained by paired eye regions of photos from dataset D𝑝 and
mentioned above, facial components of eyes, nose, mouth, and hair their binary result from dataset D𝑏 ; 𝐸𝑚𝑜𝑢𝑡ℎ includes a landmark
are generated by ATN in Section 3.2, and the geometric features of detector [24] and a pre-processed program that smoothly connects
them are generated by GTN in Section 3.3. landmarks of mouth to the black edge-lines to guide the shape of a
The pipeline of fusing components is shown in Figure 7. First, we manga mouth.
resize and locate facial components following the geometric features With the help of 𝐸𝑒𝑦𝑒 and 𝐸𝑚𝑜𝑢 , as shown in Figure 10, our
[Figure 7(a)]. Second, the face shape is drawn by the fitting curve method can effectively preserve the shape of eyebrows (red arrows),
of generated landmarks, based on the method of Piecewise Cubic eyes, and mouths, and further abstract them into manga style. With-
Hermite Interpolating Polynomial (PCHIP) [9], where PCHIP can out 𝐸𝑒𝑦𝑒 or 𝐸𝑚𝑜𝑢 , the network cannot capture the correspondences
obtain a smooth curve and effectively preserving the face shape or generated messy results, as shown in the 6𝑡ℎ and 12𝑡ℎ columns in
[Figure 7 (b)]. Then, for ear regions, we provide 10 components of Figure 8.
manga ears instead of generating them, since they are stereotyped SP module is essential to keep the similarity between the photo do-
and unimportant for facial expression. Moreover, we collect 8 manga main and the manga domain. As shown in the 5𝑡ℎ and 11𝑡ℎ columns
bodies in our dataset, 5 for male, and 3 for female, that mainly used in Figure 8, without the SP module, neither the manga style nor the
for decorating faces. In the end, we output a default manga result, similarity between input and output can be well preserved.
and provide a toolkit that allows users to fast fine-tune the size Structural Smoothing (SS) loss is also a key to produce mangas
and location of each manga component, and to switch components with clean appearances and smooth stroke-lines. As shown in the 4𝑡ℎ
that insignificant for facial expression (i.e., noses, ears, and bodies)
and 10𝑡ℎ columns in Figure 8, for both eyes and mouth, when train-
following their preferences [Figure 7 (c)].
ing with SS loss, the structure of black stroke lines are effectively
smoothed and the gray messy pixels are reduced as well.
4 EXPERIMENT
In the following experiments, we first introduce our dataset and 4.3 Comparison with state-of-the-art methods
training details in Section 7.8 and then evaluate the effectiveness We compare MangaGAN with nine state-of-the-art methods that
of our improvements in Section 7.5.1. Finally, in Section 7.5.2, have potentials to produce manga-like results: the first class is NST
we compare our MangaGAN with other state-of-the-art works. We methods, containing Gatys [11], Fast NST [21], SCNST [20], Deep
implemented MangaGAN in PyTorch [42] and all experiments are Image Analogy [35], CNNMRF [29], and Headshot Portrait [53]. For
performed on a computer with an NVIDIA Tesla V100 GPU. fair comparison, as shown in Figure 11, we employ three different
manga faces (one of which is our result) as the style targets to stylize
4.1 Training each input photo respectively. The results show that these methods
Dataset. The datasets we used in experiments are divided into generally produce warping stroke lines and fail to produce clean
three parts, i.e., the manga dataset D𝑚 , the photo dataset D𝑝 , and the manga face, since they focus on transferring the texture and color
portrait dataset D𝑏 . D𝑚 , called MangaGAN-BL, is a novel dataset from the style target. They roughly follow the structure of the photo
constructed by us and is collected from a world popular manga content, and ignore the transformation of geometric features.
work Bleach [27]. It contains manga facial features of 448 eyes, The second class we compared with is cross-domain translation
109 noses, 179 mouths, and 106 frontal view of manga faces whose methods, containing CycleGAN [65], UNIT [36], and Im2pencil
landmarks have been marked manually. Moreover, each sample of [32] as shown in Figure 12. For fair comparison, we train CycleGAN
D𝑚 is normalized to 256×256 and is optimized by cropping, angle- and UNIT to translate the whole face region, and translate each facial
correction, and repairing of disturbing elements (e.g, covering of feature, respectively. For the whole face region translation as shown
hairs, glasses, shadows); D𝑝 contains 1197 front view of face photos in Figure 12(c)(d), we only train the ROI [Figure 12(b)] to make
collected from CFD [37], and D𝑏 contains 1197 black-and-white these methods easier to find the correspondences between photo
portraits generated by APDrawingGAN [60] when D𝑝 as input. and manga, where the photo domain trained by 1197 frontal facial
Training details. For training MangaGAN, each training data of photos’ ROIs in D𝑏 , and the manga domain trained by 83 frontal
the photo domain and the manga domain is converted to grayscale manga faces’ ROIs in D𝑚 . For each facial feature translation as
with 1 channel, and each landmark of manga face is pre-processed shown in Figure 12(e)(f), we append CycleGAN and UNIT on the
by symmetric processing to generate more symmetrical faces. For ATN structure, and train each facial region by the same data as we
all experiments, we set 𝛼 1 =10, 𝛼 {2,3} =5, 𝛼 4 =1 in Eq.(6); 𝛽 {1,3,5} =10, use. Comparison results in Figure 12 show that the other methods
𝛽 {2,4,6} =1 in Eq.(8); the parameters of L𝑆𝑃 in Eq.(4) are fixed at get trouble in matching the poor correspondences between photo and
𝜆𝐼 =1, 𝜆𝑝𝑜𝑜𝑙5 =1, 𝜆𝑖 =0, 𝑖∈{𝑝𝑜𝑜𝑙1, 𝑝𝑜𝑜𝑙2, 𝑝𝑜𝑜𝑙3, 𝑝𝑜𝑜𝑙4} with the output manga, i.e., they focus on matching the dark region of photos and
resolution of 256×256. Moreover, we employ the Adam solver [25] manga, and do not translate the face shape and stroke-line structures.
6
W/O W/O Result of E mouth W/O W/O
Input eye Result of E eye Ours W/O E eye Input mouth Ours
SS loss SP module W/O E mouth
SS loss SP m

Figure 8: Comparison results of eye and mouth regions based on different improvements. Obviously, without our improvements, the
network produces poor manga results with messy regions and artifacts, and even cannot capture the correspondences between inputs
and outputs.

(a) Input Encoded Result Input Encoded Result

(a)

(b)

Figure 9: (a) Samples of eye regions in target manga work. (b)

Samples of mouth regions in target manga work. Comparison
with the generated results in Figure 8 and 10, we observe our
(b) Input Encoded Result Input Encoded Result
method effectively preserve the style of the target manga work.

Unlike them, our method can effectively make the output similar
to the appearance of the target manga (e.g., exaggerated eyelids,
smooth eyebrows, simplified mouths) as shown in Figure 12(h)(i).

5 DISCUSSION
The performance on preserving manga style. Most of the state-
of-the-art methods prone to translate the color or texture of the artis-
tic image, and ignore the translation of geometric abstraction. As
shown in Figure 11 and 12, the stylized faces they generated are sim-
ilar to the input photos with only color or texture changing, which Figure 10: (a) Samples of eye regions. (b) Samples of mouth
makes them more like the realistic sketches or portraits than the ab- regions (red lines indicate landmarks). Our method can effec-
stract mangas. Unlike them, we extend the translation to the structure tively preserve the shape of eyebrows (red arrows), eyes, and
of stroke lines and the geometric abstraction of facial features (e.g., mouths, and further abstracts them into manga style.
simplified eyes and mouths, beautified facial proportions), which
makes our results more like the works drawn by the manga artist.
The performance on preserving user identity. We generate compromise the performance on preserving the user identity. Ac-
manga face guided by the input photo, however, manga charac- cordingly, it is reasonable that there are some dissimilarities between
ters are typically fictitious, simplified, idealized and much unlike the output manga face and the input facial photo.
real people. Specifically, manga faces are usually designed to own More evaluations. To subjectively evaluate the performances
optimum proportions, and the facial features are simplified to sev- of our methods on preserving manga style, user identity, and vi-
eral black lines [Figure 12(i)]. Therefore, the excessive similarity sual attractiveness, we conduct a series of user studies in Section
between the output and input will make the output unlike a manga. 2 of the supplementary materials. Moreover, we also show more
To generate typical and clean manga faces, we even remove the experimental results and generated manga faces in Section 5 of our
detail textures and beautify the proportions of facial features, which supplementary materials.
7
Input photo

Ours Style Target Gatys Fast NST SCNST Deep Image Analogy CNNMRF Headshot Portrait

Figure 11: Comparison results with NST methods, containing Gatys [11], Fast NST [21], SCNST [20], Deep Image Analogy [35],
CNNMRF [29], and Headshot Portrait [53]. For fair comparison, we employ three different manga faces (one of which is our result)
as the style targets to stylize each input photo respectively.

Input photo ROI CycleGAN UNIT CycleGAN+ATN UNIT+ATN Im2Pencil Ours ROI Target manga sample Ours
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)

Figure 12: Comparison results with cross-domain translation methods. (a) Input photo. (b) ROI of the input photo. (c)-(h) Results
of CycleGAN [65], UNIT [36], Im2Pencil [32], APDrawingGAN [60], and our method, respectively. (i) Some typical face samples in
target manga work [27]. We obverse that our method can effectively preserve the manga style of (i), e.g., exaggerated eyelid, smooth
eyebrow, and simplified mouth. More generated samples as shown in Figure 8 and 9 in our Supplemental Material.

6 CONCLUSION 7 SUPPLEMENTAL MATERIAL

In this paper, we propose the first GAN-based method for unpaired 7.1 Overview
photo-to-manga translation, called MangaGAN. It is inspired by the
In this document we provide the following supplementary con-
prior-knowledge of drawing manga, and can translate a frontal face
tents:
photo into the manga domain with preserving the style of a popular
manga work. Extensive experiments and user studies show that • a series of user studies to subjectively evaluate our method
MangaGAN can produce high-quality manga faces and outperforms and related state-of-the-art works (Section 7.2);
other state-of-the-art methods. • more details about the ablation experiment of our improve-
ments (Section 7.5.1);
• more qualitative results of comparison with state-of-the-art
style methods (Section 7.5.2);
• details about our network architectures (Section 7.6);
• more generated samples of our MangaGAN (Section 7.7);
• our dataset and download link (Section 7.8);
8
• some failure cases (Section 7.9).

7.2 User Study

To subjectively evaluate our performances on preserving manga
style, user identity and visual attractiveness, we conduct two user
studies in Section 7.3 and Section 7.4 respectively.
Number of Votes Score
Figure 14: User studies on the visual attractiveness of NST
7.3 Preserve manga style and user identity
methods and ours. Left: Voting results of the method that has
Method. We design an online questionnaire, which first shows the most attractive results. Right: Boxplot of scoring results for
some samples of input photo and our corresponding results, and then visual attractiveness.
appends two questions, “How much do you think our results are
similar to the target manga style?” and “How much do you think
our results are similar to the input photos?”. All users are required Score
Number of Votes
to vote one of five selections (very dissimilar, dissimilar, common,
similar, and very similar) according to their observation. To evaluate
our work professionally, we anonymously open the questionnaire to
a professional manga forum, and ask the experienced manga readers
to attend this user study.
Result. In a two-week period, 157 participants attended this user
study. The summarized results as shown in Figure 14. We observe Figure 15: User studies on the visual attractiveness of cross-
that 86.62 % participants believe our results preserve the style of domain translation methods and ours. Left: Voting results of
target manga, and 78.98 % participants believe our results are sim- the method that has the most attractive results. Right: Boxplot
ilar to the input photos, which indicates that our method has good of scoring results for visual attractiveness.
performances on both two aspects.
photo and stylized results that are produced by five cross-domain
7.4 Visual attractiveness translation methods (CycleGAN [65], UNIT [36], CycleGAN+ATN,
UNIT+ATN, and Im2pencil [32]), and our MangaGAN, respectively.
Method. We invited 20 volunteers (10 males and 10 females)
Finally, each volunteer is asked to complete two tasks for each image
irrelevant to this work to conduct a user study. Preparing for the
group: the first task is scoring 1 to 5 for each method’s result, where
experiment, we firstly select ten face photos from D𝑝 randomly.
a higher score indicates a higher attractiveness; another task is to
Then, each photo is expanded to two group of images. The first group
vote for the method with the most attractive results.
containing: one input photo and stylized results that are produced by
Result. As shown in Figure 14, compared with NST methods,
six NST methods (Gatys [11], Fast NST [21], SCNST [20], Deep
our method scored the highest on visual-quality, and over 70% vol-
Image Analogy [35], CNNMRF [29], Headshot Portrait [53]), and
unteers believe our results are the most attractive ones. As shown
our MangaGAN, respectively; another group containing: one input
in Figure 15, compared with cross-domain translation methods, our
method still gets the highest score and the most number of votes.
Number The above user studies show that our MangaGAN has reached the
of votes
state-of-the-art level on visual attractiveness.

7.5 Supplemental Experiment

7.5.1 Ablation experiment of our improvements. In Figure 16,
we show more comparison results corresponding to the ablation
experiments in Section 4.2 of the main paper. We can observe that:
the structural smoothing loss 𝐿𝑆𝑆 can make the structure of stroke
Number
of votes lines smooth, and constrain the generation of mess gray areas; the
SP module successfully preserves the similarity between the input
photos and the output mangas; the encoder 𝐸𝑒𝑦𝑒 effectively helps
the network extract the main structure of the eye region and capture
the poor correspondences between photos and mangas. Without the
above improvements, the model cannot generate high-quality results
with clean stroke lines and an attractive manga style.
7.5.2 More qualitative results of comparison. According to
Figure 13: Results of our online user study. Upper: The user Section 4.3 of the main paper, for more fair comparisons, we leverage
study on how much the similarity between our results and tar- related state-of-the-art methods and our methods to translate the
get manga style. Bottom: The user study on how much the simi- same local facial regions (e.g., eye and mouth) respectively. For
larity between our results and input photos. NST methods, we use three different manga eyes and mouths (one
9
of which is our result) as the style targets to stylize the input photo Table 2: Network architecture used for the generators of 𝑁 𝑛𝑜𝑠𝑒 .
respectively. For cross-domain translation methods, we train them to
translate the same local facial region, using the same dataset as us. Generator
Comparison results as shown in Figure 17. We observe that nei- Type Kernal Size Output Channels Output Size
Latent vector N/A 512 1
ther the NST methods nor the cross-domain methods can generate Conv+LReLU 4 512 4
clean and attractive manga eyes and mouthes, due to the reasons we Conv+LReLU 3 512 4
Upsample N/A 512 8
concluded in Section 4.3 of the main paper. Conv+LReLU 3 512 8
Conv+LReLU 3 512 8
Upsample N/A 512 16
7.6 Network Architecture Conv+LReLU
Conv+LReLU
3
3
512
512
16
16
In Section 3.2 of the main paper, 𝑁 𝑒𝑦𝑒 , 𝑁 𝑚𝑜𝑢𝑡ℎ , and 𝑁𝑛𝑜𝑠𝑒 are Upsample N/A 512 32
Conv+LReLU 3 512 32
respectively trained for translating facial regions of eye, mouth, and Conv+LReLU 3 512 32
nose, from the input photo 𝑝 ∈ 𝑃 to the output manga 𝑚 ∈ 𝑀. The Upsample N/A 512 64
Conv+LReLU 3 256 64
generators of 𝑁 𝑒𝑦𝑒 and 𝑁 𝑚𝑜𝑢𝑡ℎ use the Resnet 6 blocks [14, 65], Conv+LReLU 3 256 64
Upsample N/A 256 128
and the discriminators use the Markovian discriminator of 70 × 70 Conv+LReLU 3 128 128
patchGANs [19, 28, 30]. We also tested using U-Net [44] or Resnet Conv+LReLU 3 128 128
Upsample N/A 64 256
9 blocks [14] as the generators of 𝑁 𝑒𝑦𝑒 and 𝑁 𝑚𝑜𝑢𝑡ℎ , but they often Conv+LReLU 3 64 256
produce messy results. Table 1 illustrates the network architectures Conv+LReLU 3 64 256
Conv+liner 1 3 256
used for the generators of 𝑁 𝑒𝑦𝑒 and 𝑁 𝑚𝑜𝑢𝑡ℎ . Discriminator
Type Kernal Size Output Channels Output Size
Input image N/A 3 256
Conv+LReLU 1 64 256
Conv+LReLU 3 64 256
Table 1: Network architecture used for the generators of 𝑁 𝑒𝑦𝑒 Conv+LReLU 3 128 256
and 𝑁 𝑚𝑜𝑢𝑡ℎ . Downsample N/A 128 128
Conv+LReLU 3 128 128
Conv+LReLU 3 256 128
Downsample N/A 256 64
Type Kernal Size Output Channels Output Size Conv+LReLU 3 256 64
Input N/A 1 256 Conv+LReLU 3 512 64
Downsample N/A 512 32
Conv 7 64 256 Conv+LReLU 3 512 32
ReLu+Conv+IN 3 128 128 Conv+LReLU 3 512 32
Residual block 3 256 64 Downsample N/A 512 16
Conv+LReLU 3 512 16
Residual block 3 256 64 Conv+LReLU 3 512 16
Residual block 3 256 64 Downsample N/A 512 8
Residual block 3 256 64 Conv+LReLU 3 512 8
Conv+LReLU 3 512 8
Residual block 3 256 64 Downsample N/A 512 4
Residual block 3 256 64 Conv+LReLU 3 512 4
ReLu+DeConv+IN 3 128 128 Conv+LReLU 3 512 4
Conv+LReLU 4 512 1
ReLu+DeConv+IN 3 64 256 Fully-connected+linear N/A 1 1
ReLu+Conv+IN 7 1 256

7.8 Dataset
𝑁 𝑛𝑜𝑠𝑒 employs a generating method instead of a translating one, Our dataset MangaGAN-BL is collected from a world popular
which follows the architecture of progressive growing GANs [22]. manga work Bleach [? ]. It contains manga facial features of 448
The network architectures of 𝑁 𝑛𝑜𝑠𝑒 as illustrated in Table 2. eyes, 109 noses, 179 mouths, and 106 frontal view of manga faces
MangaGAN-BL can be downloaded by the Google Drive link: whose landmarks have been marked manually. Moreover, each sam-
https://drive.google.com/drive/folders/1viLG8fbT4lVXAwrYBO ple of MangaGAN-BL is normalized to 256×256 and optimized by
xVLoJrS2ZTUC3o?usp=sharing. cropping, angle-correction, and repairing of disturbing elements (e.g,
covering of hairs, glasses, shadows).
7.7 Generated Samples
In Table 3, we show some generated samples of paired eye regions. 7.9 Failure Cases
For one people with different facial expressions, our method success- Although our method can generate attractive manga faces in
fully preserves the similarities of manga eyes, and the appearances many cases, the network still produces some typical failure cases.
of manga eyes are adaptively changed with the facial expressions as As shown in Figure 21, when the input eyes are close to the hair, part
well; for different people, our method can effectively preserve the of the hair area will be selected into the input image, which results
shape of eyebrows and eyes, and further abstract them into manga in some artifacts in the generated manga. These failure cases are
style. caused by the incomplete content of our dataset. For example, our
Some generated samples of manga noses as shown in Figure 20. data for training manga eyes only include clean eye regions, thus
Moreover, in Figure 18 and Figure 19, we show some manga faces the model cannot be adaptive to some serious interference elements
with high resolution, generated for males and females. (e.g., hair, glasses).
10
Input Result of Eeye Ours W/O LSS W/O SP module W/O Eeye Input Result of Eeye Ours W/O LSS W/O SP module W/O Eeye

Figure 16: Ablation experiment of our improvements on eye regions. From left to right: input face photos, results of encoder 𝐸𝑒𝑦𝑒 , our
results, results of removing structural smoothing loss 𝐿𝑆𝑆 , results of removing SP module, and results of removing 𝐸𝑒𝑦𝑒 .
11
Table 3: Some samples of eye regions in input photos and generated mangas.

Left Eye Right Eye Left Eye Right Eye

Input Result Input Result Input Result Input Result
People with different facial expressions

Results of different people

12
Gatys Fast NST

Input Ours

Deep Image Analogy CNNMRF

Gatys Fast NST

Input Ours

Deep Image Analogy CNNMRF

Input Ours CycleGAN UNIT Input Ours CycleGAN UNIT

Figure 17: Upper: comparison results with NST methods, containing Gatys [11], Fast NST [21], Deep Image Analogy [35], and
CNNMRF [29]. Bottom: comparison results with GAN-based one-to-one translation methods, containing CycleGAN [65] and UNIT
[36].
REFERENCES conference on computer vision and pattern recognition, pages 1897–1906, 2017.
[1] Itamar Berger, Ariel Shamir, Moshe Mahler, Elizabeth Carter, and Jessica Hodgins. [5] Dongdong Chen, Lu Yuan, Jing Liao, Nenghai Yu, and Gang Hua. Stereoscopic
Style and abstraction in portrait sketching. ACM Transactions on Graphics (TOG), neural style transfer. In Proceedings of the IEEE Conference on Computer Vision
32(4):55, 2013. and Pattern Recognition, pages 6654–6663, 2018.
[2] Kaidi Cao, Jing Liao, and Lu Yuan. Carigans: unpaired photo-to-caricature [6] Yang Chen, Yu-Kun Lai, and Yong-Jin Liu. Cartoongan: Generative adversarial
translation. In SIGGRAPH Asia 2018 Technical Papers, page 244. ACM, 2018. networks for photo cartoonization. In Proceedings of the IEEE Conference on
[3] Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, and Gang Hua. Coherent Computer Vision and Pattern Recognition, pages 9465–9474, 2018.
online video style transfer. In Proc. Intl. Conf. Computer Vis., 2017. [7] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and
[4] Dongdong Chen, Lu Yuan, Jing Liao, Nenghai Yu, and Gang Hua. Stylebank: An Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain
explicit representation for neural image style transfer. In Proceedings of the IEEE image-to-image translation. In Proceedings of the IEEE Conference on Computer

13
Figure 18: Samples of input photos and generated manga faces

14
Figure 19: Samples of input photos and generated manga faces

15
Computer Vision, pages 1501–1510, 2017.
[18] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsuper-
vised image-to-image translation. In Proceedings of the European Conference on
Computer Vision (ECCV), pages 172–189, 2018.
[19] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image
translation with conditional adversarial networks. In 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 5967–5976. IEEE, 2017.
[20] Yongcheng Jing, Yang Liu, Yezhou Yang, Zunlei Feng, Yizhou Yu, Dacheng Tao,
and Mingli Song. Stroke controllable fast style transfer with adaptive receptive
fields. In Proceedings of the European Conference on Computer Vision (ECCV),
pages 238–254, 2018.
[21] Justin Johnson, Alexandre Alahi, and F.-F. Li. Perceptual losses for real-time style
transfer and super-resolution. In Proc. Eur. Conf. Comput. Vis., pages 694–711,
2016.
[22] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive
growing of gans for improved quality, stability, and variation. arXiv preprint
arXiv:1710.10196, 2017.
[23] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim.
Learning to discover cross-domain relations with generative adversarial networks.
In Proceedings of the 34th International Conference on Machine Learning-Volume
70, pages 1857–1865. JMLR. org, 2017.
[24] Davis E King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning
Figure 20: Samples of generated manga noses. Research, 10(Jul):1755–1758, 2009.
[25] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014.
[26] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv
preprint arXiv:1312.6114, 2013.
(a) [27] Tite Kubo. Bleach. Weekly Jump, 2001-2016.
[28] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunning-
ham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan
Wang, et al. Photo-realistic single image super-resolution using a generative
adversarial network. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 4681–4690, 2017.
(b) [29] Chuan Li and Michael Wand. Combining markov random fields and convolutional
neural networks for image synthesis. In Proc. IEEE Conf. Comput. Vis. Pattern
Recog., pages 2479–2486, 2016.
[30] Chuan Li and Michael Wand. Precomputed real-time texture synthesis with
Input Encoded Result markovian generative adversarial networks. In European Conference on Computer
Vision, pages 702–716. Springer, 2016.
[31] Wenbin Li, Wei Xiong, Haofu Liao, Jing Huo, Yang Gao, and Jiebo Luo. Carigan:
Figure 21: Typical failure cases of our method. When the input Caricature generation through weakly paired adversarial learning. arXiv preprint
eyes are close to the hair, part of the hair area may be selected arXiv:1811.00445, 2018.
[32] Yijun Li, Chen Fang, Aaron Hertzmann, Eli Shechtman, and Ming-Hsuan Yang.
into the input image, which results in some artifacts in the gen- Im2pencil: Controllable pencil illustration from photographs. In Proceedings
erated manga. of the IEEE Conference on Computer Vision and Pattern Recognition, pages
(c) 1525–1534, 2019.
[33] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang.
Universal style transfer via feature transforms. In Advances in neural information
Vision and Pattern Recognition, pages 8789–8797, 2018. processing systems, pages 386–396, 2017.
[8] Jakub Fišer, Ondřej Jamriška, Michal Lukáč, Eli Shechtman, Paul Asente, Jingwan [34] Dongxue Liang, Kyoungju Park, and Przemyslaw Krompiec. Facial feature model
Lu, and Daniel Sỳkora. Stylit: illumination-guided example-based stylization of for a portrait video stylization. Symmetry, 10(10):442, 2018.
Input
3d renderings. ACM Transactions on Graphics (TOG), Result
35(4):92, 2016. [35] Jing Liao, Yuan Yao, Lu Yuan, Gang Hua, and Sing Bing Kang. Visual attribute
[9] Frederick N Fritsch and Ralph E Carlson. Monotone piecewise cubic interpolation. transfer through deep image analogy. ACM Transactions on Graphics (TOG),
SIAM Journal on Numerical Analysis, 17(2):238–246, 1980. 36(4):120, 2017.
[10] Leon Gatys, Alexander S Ecker, and Matthias Bethge. Texture synthesis using [36] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image
convolutional neural networks. In Advances in neural information processing translation networks. In Advances in neural information processing systems, pages
systems, pages 262–270, 2015. 700–708, 2017.
[11] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer [37] Debbie S Ma, Joshua Correll, and Bernd Wittenbrink. The chicago face database:
using convolutional neural networks. In Proc. IEEE Conf. Comput. Vis. Pattern A free stimulus set of faces and norming data. Behavior research methods,
Recog., pages 2414–2423, 2016. 47(4):1122–1135, 2015.
[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, [38] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Paul Smolley. Least squares generative adversarial networks. In Proceedings of
In Advances in neural information processing systems, pages 2672–2680, 2014. the IEEE International Conference on Computer Vision, pages 2794–2802, 2017.
[13] Shuyang Gu, Congliang Chen, Jing Liao, and Lu Yuan. Arbitrary style transfer [39] Yifang Men, Zhouhui Lian, Yingmin Tang, and Jianguo Xiao. A common frame-
with deep feature reshuffle. In Proceedings of the IEEE Conference on Computer work for interactive texture transfer. In Proceedings of the IEEE Conference on
Vision and Pattern Recognition, pages 8222–8231, 2018. Computer Vision and Pattern Recognition, pages 6353–6362, 2018.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning [40] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv
for image recognition. In Proceedings of the IEEE conference on computer vision preprint arXiv:1411.1784, 2014.
and pattern recognition, pages 770–778, 2016. [41] Umar Riaz Muhammad, Michele Svanera, Riccardo Leonardi, and Sergio Benini.
[15] Haozhi Huang, Hao Wang, Wenhan Luo, Lin Ma, Wenhao Jiang, Xiaolong Zhu, Hair detection, segmentation, and hairstyle classification in the wild. Image and
Zhifeng Li, and Wei Liu. Real-time neural style transfer for videos. In Proceedings Vision Computing, 2018.
of the IEEE Conference on Computer Vision and Pattern Recognition, pages 783– [42] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
791, 2017. Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
[16] Junhong Huang, Mingkui Tan, Yuguang Yan, Chunmei Qing, Qingyao Wu, and Automatic differentiation in pytorch. 2017.
Zhuliang Yu. Cartoon-to-photo facial translation with generative adversarial [43] Chunlei Peng, Xinbo Gao, Nannan Wang, Dacheng Tao, Xuelong Li, and Jie Li.
networks. In Asian Conference on Machine Learning, pages 566–581, 2018. Multiple representations-based face sketch–photo synthesis. IEEE transactions
[17] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive on neural networks and learning systems, 27(11):2201–2215, 2015.
instance normalization. In Proceedings of the IEEE International Conference on
16
[44] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional [55] Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross-domain image
networks for biomedical image segmentation. In International Conference on generation. arXiv preprint arXiv:1611.02200, 2016.
Medical image computing and computer-assisted intervention, pages 234–241. [56] Lidan Wang, Vishwanath Sindagi, and Vishal Patel. High-quality facial photo-
Springer, 2015. sketch synthesis using multi-adversarial networks. In 2018 13th IEEE interna-
[45] Paul L Rosin and Yu-Kun Lai. Non-photorealistic rendering of portraits. In tional conference on automatic face & gesture recognition (FG 2018), pages
Proceedings of the workshop on Computational Aesthetics, pages 159–170. Euro- 83–90. IEEE, 2018.
graphics Association, 2015. [57] Nannan Wang, Xinbo Gao, Leiyu Sun, and Jie Li. Bayesian face sketch synthesis.
[46] Paul L Rosin, David Mould, Itamar Berger, John P Collomosse, Yu-Kun Lai, IEEE Transactions on Image Processing, 26(3):1264–1274, 2017.
Chuan Li, Hua Li, Ariel Shamir, Michael Wand, Tinghuai Wang, et al. Bench- [58] Holger Winnemöller, Sven C Olsen, and Bruce Gooch. Real-time video abstraction.
marking non-photorealistic rendering of portraits. In NPAR, pages 11–1, 2017. In ACM Transactions On Graphics (TOG), volume 25, pages 1221–1226. ACM,
[47] Takafumi Saito and Tokiichiro Takahashi. Comprehensible rendering of 3-d 2006.
shapes. In ACM SIGGRAPH Computer Graphics, volume 24, pages 197–206. [59] Mingliang Xu, Hao Su, Yafei Li, Xi Li, Jing Liao, Jianwei Niu, Pei Lv, and Bing
ACM, 1990. Zhou. Stylized aesthetic qr code. IEEE Transactions on Multimedia, 2019.
[48] Ahmed Selim, Mohamed Elgharib, and Linda Doyle. Painting style transfer for [60] Ran Yi, Yong-Jin Liu, Yu-Kun Lai, and Paul L Rosin. Apdrawinggan: Generating
head portraits using convolutional neural networks. pages 129:1–129:18, 2016. artistic portrait drawings from face photos with hierarchical gans. In Proceedings
[49] Ahmed Selim, Mohamed Elgharib, and Linda Doyle. Painting style transfer of the IEEE Conference on Computer Vision and Pattern Recognition, pages
for head portraits using convolutional neural networks. ACM Transactions on 10743–10752, 2019.
Graphics (ToG), 35(4):129, 2016. [61] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual
[50] Falong Shen, Shuicheng Yan, and Gang Zeng. Neural style transfer via meta learning for image-to-image translation. In Proceedings of the IEEE international
networks. In Proceedings of the IEEE Conference on Computer Vision and conference on computer vision, pages 2849–2857, 2017.
Pattern Recognition, pages 8061–8069, 2018. [62] Shengchuan Zhang, Xinbo Gao, Nannan Wang, Jie Li, and Mingjin Zhang. Face
[51] Xiaoyong Shen, Aaron Hertzmann, Jiaya Jia, Sylvain Paris, Brian Price, Eli sketch synthesis via sparse representation-based greedy search. IEEE transactions
Shechtman, and Ian Sachs. Automatic portrait segmentation for image stylization. on image processing, 24(8):2466–2477, 2015.
In Computer Graphics Forum, volume 35, pages 93–102. Wiley Online Library, [63] Yexun Zhang, Ya Zhang, and Wenbin Cai. Separating style and content for
2016. generalized style transfer. In Proceedings of the IEEE conference on computer
[52] Yichun Shi, Debayan Deb, and Anil K Jain. Warpgan: Automatic caricature vision and pattern recognition, pages 8447–8455, 2018.
generation. In Proceedings of the IEEE Conference on Computer Vision and [64] Yong Zhang, Weiming Dong, Chongyang Ma, Xing Mei, Ke Li, Feiyue Huang,
Pattern Recognition, pages 10762–10771, 2019. Bao-Gang Hu, and Oliver Deussen. Data-driven synthesis of cartoon faces using
[53] YiChang Shih, Sylvain Paris, Connelly Barnes, William T Freeman, and Frédo different styles. IEEE Transactions on image processing, 26(1):464–478, 2016.
Durand. Style transfer for headshot portraits. ACM Transactions on Graphics [65] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-
(TOG), 33(4):148, 2014. image translation using cycle-consistent adversarial networks. In Proceedings of
[54] Hao Su, Jianwei Niu, Xuefeng Liu, Qingfeng Li, Ji Wan, Mingliang Xu, and Tao the IEEE international conference on computer vision, pages 2223–2232, 2017.
Ren. An end-to-end method for producing scanning-robust stylized qr codes. [66] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros,
arXiv preprint arXiv:2011.07815, 2020. Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation.
In Advances in Neural Information Processing Systems, pages 465–476, 2017.