Mangagan: Unpaired Photo-To-Manga Translation Based On The Methodology of Manga Drawing
Mangagan: Unpaired Photo-To-Manga Translation Based On The Methodology of Manga Drawing
Figure 1: Left: the combination of manga faces we generated and body components we collected from a popular manga work Bleach
[27], which shows a unified style with a strong attractiom. Right: the input frontal face photos and our results, where our method can
effectively endow output results with both the facial similarity and the target manga style.
ABSTRACT 1 INTRODUCTION
Manga is a world popular comic form originated in Japan, which Manga, originated in Japan, is a worldwide popular comic form of
typically employs black-and-white stroke lines and geometric ex- drawing on serialized pages to present long stories. Typical manga is
aggeration to describe humans’ appearances, poses, and actions. In printed in black-and-white (as shown in Fig. 1 left), which employs
this paper, we propose MangaGAN, the first method based on Gen- abstract stroke lines and geometric exaggeration to describe humans’
erative Adversarial Network (GAN) for unpaired photo-to-manga appearances, poses, and actions. Professional manga artists usually
translation. Inspired by how experienced manga artists draw manga, build up personalized drawing styles during their careers, and their
MangaGAN generates the geometric features of manga face by a styles are hard to be imitated by other peers. Meanwhile, drawing
designed GAN model and delicately translates each facial region manga is a time-consuming process, and even a professional manga
into the manga domain by a tailored multi-GANs architecture. For artist requires several hours to finish one page of high-quality work.
training MangaGAN, we construct a new dataset collected from a As an efficient approach to assist with manga drawing, automat-
popular manga work, containing manga facial features, landmarks, ically translating a face photo to manga with an attractive style is
bodies and so on. Moreover, to produce high-quality manga faces, much desired. This task can be described as the image translation
we further propose a structural smoothing loss to smooth stroke-lines that is a hot topic in the computer vision field. In recent years, deep
and avoid noisy pixels, and a similarity preserving module to im- learning based image translation has made significant progress and
prove the similarity between domains of photo and manga. Extensive derived a series of systematic methods. Among the examples are the
experiments show that MangaGAN can produce high-quality manga Neural Style Transfer (NST) methods (e.g.,[4, 11, 21, 29]) which
faces which preserve both the facial similarity and a popular manga use tailored CNNs and objective functions to stylize images, the
style, and outperforms other related state-of-the-art methods. Generative Adversarial Network (GAN)[12] based methods (e.g.,
∗ The corresponding author.
[19, 36, 65]) which work well for mapping paired or unpaired images repairing of disturbing elements (e.g, hair covering, shadows).
from the original domain to the stylized domain. MangaGAN-BL will be released for academic use.
Although these excellent works have achieved good performances
in their applications, they have difficulties to generate a high-quality
manga due to the following four challenges. First, in the manga
domain, humans’ faces are abstract, colorless, geometrically exag- 2 RELATED WORK
gerated, and far from that in the photo domain. The facial correspon- Recent literature suggests two main directions with the ability to
dences between the two domains are hard to be matched by networks. generate manga-like results: neural style transfer, and GAN-based
Second, the style of manga is more represented by the structure of cross-domain translation.
stroke lines, face shape, and facial features’ sizes and locations.
Meanwhile, for different facial features, manga artists always use
different drawing styles and locate them with another personalized
skill. These independent features (i.e., appearance, location, size, 2.1 Neural style transfer
style) are almost unable to be extracted and concluded by a network The goal of neural style transfer (NST) is to transfer the style
simultaneously. Third, a generated manga has to faithfully resemble from an art image to another content target image. Inspired by the
the input photo to keep the identity of a user without comprising progress of CNN, Gatys et al. [11] propose the pioneering NST work
the abstract manga style. It is a challenge to keep both of them with by utilizing CNN’s power of extracting abstract features, and the
high performances. Forth, the training data of manga is difficult to style capture ability of Gram matrices [10]. Then, Li and Wand [29]
collect. Manga artists often use local storyboards to show stories, use the Markov Random Field (MRF) to encode styles, and present
which makes it difficult to find clear and complete manga faces with an MRF-based method (CNNMRF) for image stylization. Afterward,
factors such as covered by hair or shadow, segmented by storyboards, various follow-up works have been presented to improve their perfor-
low-resolution and so on. Therefore, related state-of-the-art methods mances on visual quality [13, 20, 35, 39, 48, 63], generating speed
of image stylization (e.g., [11, 19, 29, 35, 53, 60, 65]) are not able [4, 17, 21, 33, 50], and multimedia extension [3, 5, 15, 54, 59].
to produce desired results of manga1 . Although these methods work well on translating images into
To address these challenges, we present MangaGAN, the first some typical artistic styles, e.g., oil painting, watercolor, they are
GAN-based method for translating frontal face photos to the manga not good at producing black-and-white manga with exaggerated ge-
domain with preserving the attractive style of a popular manga work ometry and discrete stroke lines, since they tend to translate textures
Bleach [27]. We observed that an experienced manga artist generally and colors features of a target style and preserve the structure of the
takes the following steps when drawing manga: first outlining the content image.
exaggerated face and locating the geometric distributions of facial
features, and then fine-drawing each of them. MangaGAN follows
the above process and employs a multi-GANs architecture to trans-
late different facial features, and to map their geometric features by 2.2 GAN-based cross-domain translation
another designed GAN model. Moreover, to obtain high-quality re- Many GAN-based cross-domain translation methods work well
sults in an unsupervised manner, we present a Similarity Preserving on image stylization, whose goal is to learn a mapping from a source
(SP) module to improve the similarity between domains of photo and domain to a stylized domain. There are a series of works based on
manga, and leverage a structural smoothing loss to avoid artifacts. GAN [12] presented and applied for image stylization. Pix2Pix [19]
To summarize, our main contributions are three-fold: first presents a unified framework for image-to-image translation
• We propose MangaGAN, the first GAN-based method for based on conditional GANs [40]. BicycleGAN [66] extends it to
unpaired photo-to-manga translation. It can produce attractive multi-modal translation. Some methods including CycleGAN [65],
manga faces with preserving both the facial similarity and DualGAN [61], DiscoGAN [23], UNIT [36], DTN [55] etc. are pre-
a popular manga style. MangaGAN uses a novel network sented for unpaired one-to-one translation. MNUIT [18], startGAN
architecture by simulating the drawing process of manga [7] etc. are presented for unpaired many-to-many translation.
artists, which generates the exaggerated geometric features The methods mentioned above succeed in translation tasks that
of faces by a designed GAN model, and delicately translates are mainly characterized by color or texture changes only (e.g.,
each facial region by a tailored multi-GANs architecture. summer to winter, and apples to oranges). For photo-to-manga trans-
• We propose a similarity preserving module that effectively im- lation, they fail to capture the correspondences between two domains
proves the performances on preserving both the facial similar- due to the abstract structure, colorless appearance, and geometric
ity and manga style. We also propose a structural smoothing deformation of manga drawing.
loss to encourage producing results with smooth stroke-lines Besides the above two main directions, there are also some
and less messy pixels. works specially designed for creating artistic facial images. They
• We construct a new dataset called MangaGAN-BL (contain- employ techniques of Non-photorealistic rendering (NPR), data-
ing manga facial features, landmarks, bodies, etc.), collected driven synthesizing, computer graphics, etc., and have achieved
from a world popular manga work Bleach. Each sample has much progress in many typical art forms, e.g., caricature and car-
been manually processed by cropping, angle-correction, and toon [2, 6, 16, 31, 47, 52, 58, 64], portrait and sketching [1, 8, 34,
43, 45, 46, 49, 56, 57, 60, 62]. However, none of them involve the
1 Comparison results as shown in Figure 11 and 12 of experiments. generation of manga face.
2
Appearance
Transformation
Network
(b) Napp m
Synthesis Module
(a)
Photo facial features Manga facial features
p (e)
Output
Manga result
Input photo (d) Ngeo
(c) Geometric
Transformation
Domain of Network Domain of
photo landmark manga landmark (f )
lp (LP) (LM)
lm
Figure 2: Overall pipeline of MangaGAN. Inspired by the prior knowledge of manga drawing, MangaGAN consists of two branches:
one branch learns the geometric mapping by a Geometric Transformation Network (GTN); the other branch learns the appearance
mapping by an Appearance Transformation Network (ATN). On the end, a Synthesis Module is designed to fuse them and end up
with the manga face.
Let 𝑃 and 𝑀 indicate the face photo domain and the manga do- (b)
main respectively, where no pairing exists between them. Given an Right eye
input photo 𝑝 ∈ 𝑃, our MangaGAN learns a mapping Ψ : 𝑃 → 𝑀 (c) mouth
N
mouth
E
that can transfer 𝑝 to a sample 𝑚 = Ψ(𝑝), 𝑚 ∈ 𝑀, while endowing Mouth
𝑚 with manga style and facial similarity.
nose
As shown in Figure 2(f), our method is inspired by the prior knowl- (d) E
nose
N
edge that how experienced manga artists doing drawing manga: first Nose
outline the exaggerated face and locate the geometric distributions hair
of facial features, and finally do the fine-drawing. Accordingly, Man- (e) N
gaGAN consists of two branches: one branch learns a geometric Hair Hair & Background
Segmentation
mapping Ψ𝑔𝑒𝑜 by a Geometric Transformation Network (GTN) 𝑁𝑔𝑒𝑜
which adopted to translate the facial geometry from 𝑃 to 𝑀 [Fig-
Figure 3: ATN is a network with multi-GANs architecture, con-
ure 2(d)]; the other branch learns an appearance mapping Ψ𝑎𝑝𝑝 by
sists of four local GANs, designed to translate each facial re-
an Appearance Transformation Network (ATN) 𝑁𝑎𝑝𝑝 [Figure 2(b)]
gion respectively. Moreover, we tailor different training strate-
which used to produce components of all facial features. At the
gies and encoders to improve their performances.
end, a Synthesis Module is designed to fuse facial geometry and all
components, and end up with the output manga 𝑚 ∈ 𝑀 [Figure 2(e)].
Then, we will detail the ATN, the GTN, and the Synthesis Module
in Section 3.2, Section 3.3, and Section 3.4 respectively. mapping the unpaired data, we couple it with a reverse mapping,
inspired by the network architecture of CycleGAN [65]. Accord-
3.2 Appearance transformation network ingly, the baseline architecture of 𝑁 𝛿 (𝛿 ∈{eye, mouth}) includes
the forward / backward generator 𝐺 𝑀 𝛿 / 𝐺 𝛿 and the corresponding
As shown in Figure 3, ATN 𝑁𝑎𝑝𝑝 is a network with multi-GAN ar- 𝑃
chitecture includes a set of four locals GANs, 𝑁𝑎𝑝𝑝 ={𝑁 𝑒𝑦𝑒 , 𝑁 𝑛𝑜𝑠𝑒 , discriminator 𝐷 𝑃 / 𝐷 𝑀 . 𝐺 𝑀 learns the mapping Ψ𝑎𝑝𝑝
𝛿 𝛿 𝛿 𝛿 : 𝑝 𝛿→ 𝑚
b𝛿 , and
𝛿 ′ : 𝑚𝛿→b
𝐺 𝑃𝛿 learns the reverse mapping Ψ𝑎𝑝𝑝 𝑝 𝛿 , where 𝑚
b𝑖𝛿 and 𝑝b𝑖𝛿
𝑁 𝑚𝑜𝑢𝑡ℎ , 𝑁 ℎ𝑎𝑖𝑟 }, where 𝑁 𝑒𝑦𝑒 , 𝑁 𝑛𝑜𝑠𝑒 , 𝑁 𝑚𝑜𝑢𝑡ℎ , and 𝑁 ℎ𝑎𝑖𝑟 are re-
spectively trained for translating facial regions of eye, nose, mouth, are the generated fake samples; the discriminator 𝐷 𝑃𝛿 / 𝐷 𝑀 𝛿 learn to
and hair, from the input 𝑝 ∈ 𝑃 to the output 𝑚 ∈ 𝑀. 𝛿 𝛿
distinguish real samples 𝑝 / 𝑚 and fake samples 𝑝b /𝑚 𝛿 𝛿
b . Our gen-
3.2.1 Translating regions of eyes and mouths. Eyes and mouths erators 𝐺 𝑃𝛿 , 𝐺 𝑀
𝛿 use the Resnet 6 blocks [14], and 𝐷 𝛿 , 𝐷 𝛿 use the
𝑃 𝑀
are the critical components of manga faces but are the hardest parts Markovian discriminator of 70 × 70 patchGANs [19, 28, 30].
to translate, since they are most noticed, error sensitive, and vary We adopt the stable least-squares losses [38] instead of negative
with different facial expressions. For 𝑁 𝑒𝑦𝑒 and 𝑁 𝑚𝑜𝑢𝑡ℎ , for better log-likelihood objective [12] as our adversarial losses 𝐿𝑎𝑑𝑣 , defined
3
as Nδ Dδ Dδ
𝛿
L𝑎𝑑𝑣 𝛿
(𝐺 𝑀 𝛿
, 𝐷𝑀 𝛿
) = E𝑚𝛿∼𝑀 𝛿 [(𝐷 𝑀 (𝑚𝛿 ) − 1) 2 ] (δ= eye) SP Module M SP Module P
, (1)
𝛿
+ E𝑝 𝛿∼𝑃 𝛿 [𝐷 𝑀 𝛿
(𝐺 𝑀 (𝑝 𝛿 )) 2 ] δ
GM Gδ
P
𝛿 (𝐺 𝛿 , 𝐷 𝛿 ) is defined in a similar manner.
while L𝑎𝑑𝑣 𝑃 𝑃
L𝑐𝑦𝑐 is the cycle-consistency loss [65] that is used to constrain (a)
the mapping solution between the input and the output domain, fpool1( pδ) f ( pδ)
128×128
pool2
defined as fpool3( pδ)
32×32
64×64
Conv layer
𝛿 fpool4( pδ)
L𝑐𝑦𝑐 (𝐺 𝑃𝛿 , 𝐺 𝑀
𝛿
) = E𝑝 𝛿∼𝑃 𝛿 [∥𝐺 𝑃𝛿 (𝐺 𝑀
𝛿
( 𝑝 𝛿 )) − 𝑝 𝛿 ∥ 1 ]
256×256
16×16
Pooling layer
. (2) fpool5( pδ)
𝛿
+ E𝑚𝛿∼𝑀 𝛿 [∥𝐺 𝑀 (𝐺 𝑃𝛿 (𝑚𝛿 )) − 𝑚𝛿 ∥ 1 ]
8×8
Module
SP
k5n128s1
However, we find that the baseline architectures of 𝑁 𝑒𝑦𝑒 and
k5n512s1
k5n64s1
k5n64s1
k5n256s1
𝑁 𝑚𝑜𝑢𝑡ℎ with L𝑎𝑑𝑣 and L𝑐𝑦𝑐 still fail to preserve the similarity (b)
between two domains. Specifically, for regions of eye and mouth, it
8×8
16×16
32×32
always produces messy results since the networks almost unable to fpool5(Gδ δ
M ( p ))
δ
match colored photos and discrete black lines of mangas. Therefore, fpool4(GM ( pδ))
64×64
128×128
fpool3(Gδ δ
256×256
M ( p ))
we further make three following improvements to optimize their
fpool2(Gδ δ
M ( p ))
performances. GPδ ( pδ) fpool1(Gδ δ
M ( p ))
First, we design a Similarity Preserving (SP) module with an
SP loss L𝑆𝑃 to enhance the similarity. Second, we train an encoder
Figure 4: (a) We append two SP modules on both forward and
𝐸𝑒𝑦𝑒 that can extract the main backbone of 𝑝 𝑒𝑦𝑒 to binary results, as
backward mappings. (b) SP module extracts feature maps with
the input of 𝑁 𝑒𝑦𝑒 , and an encoder 𝐸𝑚𝑜𝑢𝑡ℎ that encodes 𝑝𝑚𝑜𝑢𝑡ℎ to
different resolutions and measures the similarities between two
binary edge-lines, used to guide the shape of manga mouth2 . Third, a inputs in different latent spaces.
structural smoothing loss L𝑆𝑆 is designed for encouraging networks
to produce manga with smooth stroke-lines, defined as
𝛿 𝛿 2
1 h ∑︁ −(𝐺 𝑃 (𝑚 ) 𝑗 − 𝜇) where 𝜆𝑖 , 𝜆𝐼 controls the relative importance of each objective,
L𝑆𝑆 (𝐺 𝑃𝛿 ,𝐺 𝑀
𝛿
)=√ exp 𝜙,𝑖
2𝜋𝜎𝑗 ∈ {1,2,...,𝑁 } 2𝜎 2 L𝑝𝑖𝑥𝑒𝑙
𝐼
and L 𝑓 𝑒𝑎𝑡 are used to keep the similarity on pixel-wise
(𝐺 𝛿 (𝑝 𝛿 )𝑘 − 𝜇) 2 i , (3) 𝜙,𝑖
∑︁
𝑀
and different feature-wise respectively. L𝑝𝑖𝑥𝑒𝑙
𝐼
and L 𝑓 𝑒𝑎𝑡 defined as
+ exp −
2𝜎 2 𝜙,𝑖 𝜙 𝜙 𝜙 𝜙 2
𝑘 ∈ {1,2,...,𝑁 } L 𝑓 𝑒𝑎𝑡 [𝑓𝑖 (𝑝 𝛿 ), 𝑓𝑖 (𝐺 𝑀
𝛿
( 𝑝 𝛿 ))]= 𝑓𝑖 (𝑝 𝛿 ) −𝑓𝑖 (𝐺 𝑀
𝛿
( 𝑝 𝛿 )) 2
, (5)
where L𝑆𝑆 based on a Gaussian model with 𝜇= 255 𝛿 𝛿
2 , 𝐺 𝑃 (𝑚 ) 𝑗 or 2
L𝑝𝑖𝑥𝑒𝑙
𝐼
[ 𝑝𝛿 , 𝐺𝑀
𝛿
( 𝑝 𝛿 )]= 𝑝 𝛿−𝐺 𝑀
𝛿
( 𝑝𝛿 ) 2
𝐺𝑀𝛿 (𝑝 𝛿 ) is the 𝑗-th or 𝑘-th pixel of 𝐺 𝛿 (𝑚𝛿 ) or 𝐺 𝛿 (𝑝 𝛿 ). The un-
𝑘 𝑃 𝑀 𝜙
derlying idea is that producing unnecessary gray areas will distract where 𝑓𝑖 (𝑥) is a feature map extracted from 𝑖-th layer of network
and mess the manga results since manga mainly consists of black 𝜙 when 𝑥 as the input. Note that we only extract feature maps after
and white stroke lines. Thus, we give a pixel smaller loss when its pooling layers.
gray value closer to black (0) or white (255), to smooth the gradient Combining Eq.(1)-(5), the full objective for learning the appear-
edges of black stroke lines and produce clean results. ance mappings of 𝑁 𝛿 (𝛿 ∈{eye, mouth}) is:
Similarity Preserving Module. The main idea of SP module is 𝛿 𝛿 𝛿 𝛿 𝛿
L𝑎𝑝𝑝 = L𝑎𝑑𝑣 (𝐺 𝑀 ,𝐷 𝑀 ) + L𝑎𝑑𝑣 (𝐺 𝑃𝛿 ,𝐷 𝑃𝛿 )
that keeping the similarity between two images at a lower resolution
𝛿
can give them similar spatial distributions and different pixel details + 𝛼 1 L𝑐𝑦𝑐 (𝐺 𝑃𝛿 , 𝐺 𝑀
𝛿 𝛿
) + 𝛼 2 L𝑆𝑃 𝛿
(𝐺 𝑀 , 𝑝𝛿 ) , (6)
when they are up-sampled to a higher resolution. As shown in Figure 𝛿
+ 𝛼 3 L𝑆𝑃 (𝐺 𝑃𝛿 , 𝑚𝛿 ) 𝛿
+ 𝛼 4 L𝑆𝑆 (𝐺 𝑀 , 𝐺 𝑃𝛿 )
4(a), we append two SP modules on both forward and backward
mappings of 𝑁 𝛿 . SP module leverages a pre-trained network 𝜙 that where 𝛼 1 to 𝛼 4 used to balance the multiple objectives.
we designed to extract feature maps in different latent spaces and 3.2.2 Translating regions of nose and hair. Noses are insignifi-
resolutions. The architecture of 𝜙 as shown in Figure 4(b), it only cant to manga faces since almost all characters have a similar nose
uses few convolutional layers since we consider the correspondences in the target manga style. Therefore, 𝑁 𝑛𝑜𝑠𝑒 adopts a generating
of encoded features are relatively clear. For the forward mapping method instead of a translating one, which follows the architecture
Ψ𝑎𝑝𝑝
𝛿 :𝑚 𝛿 ( 𝑝 𝛿 ), we input 𝑝 𝛿 and 𝐺 𝛿 ( 𝑝 𝛿 ) to SP module, and
b𝛿 = 𝐺 𝑀 𝑀 of progressive growing GANs [22] that can produce a large number
𝛿 by minimizing the loss functions L (𝐺 𝛿 , 𝑝 𝛿 ) defined
optimize 𝐺 𝑀 of high-quality results similar to training data. As shown in Figure
𝑆𝑃 𝑀
as ∑︁ 3(d), we first train a variational autoencoder [26] to encode the nose
𝛿 𝛿 𝜙,𝑖 𝜙 𝛿 𝜙
L𝑆𝑃 (𝐺 𝑀 , 𝑝 ) = 𝜆𝑖 L 𝑓 𝑒𝑎𝑡 𝑓𝑖 (𝑝 ), 𝑓𝑖 (𝐺 𝑀 ( 𝑝 𝛿 ))
𝛿
region of the input photo into a feature vector, then make the vector
𝑖 ∈𝜙 h i (4) as a seed to generate a default manga nose, and we also allow users
+ 𝜆𝐼 L𝑝𝑖𝑥𝑒𝑙
𝐼
𝑝𝛿 , 𝐺𝑀
𝛿
( 𝑝𝛿 ) , to change it according to their preferences.
𝑁 ℎ𝑎𝑖𝑟 employs a pre-trained generator of APDdrawingGAN [60]
2 Training details of the two encoders are described in Section 7.5.1. that can produce binary portrait hair with the style similar to manga.
4
Nloc (a) (b)
GL M
loc(lp) loc(lm)
lp lm
Nsiz DL P
GL
P el er el er
siz(lp) siz(lm)
n 11 cl n 11 cr
Geometric shp(lm) cl cr
Nshp (b) m
Features m
siz(lm) 12
lp
12
DL M
el_cl el_cr el_cb er_cl el_cl el_cr el_cb er_cl
lp(d) lp(d) lp(d) lp(d) lm(d) lm(d) lm(d) lm(d)
Figure 5: The pipeline of GTN. (a) To improve the variety of shp(lp) er_cr er_cb n_cl n_cr er_cr er_cb n_cl n_cr
lp(d) lp(d) lp(d) lp(d) lm(d) lm(d) lm(d) lm(d)
facial collocation mode, GTN divides geometric information siz(lp) GL M
When translating facial landmarks, an issue is that the collocation the loss of 𝑁𝑙𝑜𝑐 , and losses of 𝑁𝑠𝑖𝑧 and 𝑁𝑠ℎ𝑎 are represented in a
mode of facial features constrains the variety of results. For example, similar manner. The objective function L𝑔𝑒𝑜 to optimize GTN is
people with the same face shape may have different sizes or locations 𝐿𝑃 𝐿𝑀 𝐿𝑃 𝐿𝑀
L𝑔𝑒𝑜 =L𝑎𝑑𝑣 + L 𝑎𝑑𝑣 +𝛽 1L𝑐𝑦𝑐 +𝛽 2 ( L𝑐ℎ𝑎 + L𝑐ℎ𝑎 )
of eyes, nose, or mouth. However, GAN may generate them in a 𝑙𝑜𝑐 𝑙𝑜𝑐 𝑙𝑜𝑐 𝑙𝑜𝑐 𝑙𝑜𝑐
fixed or similar collocation mode when it is trained by the landmarks 𝐿𝑃
+L𝑎𝑑𝑣 𝐿𝑀
+ L 𝑎𝑑𝑣 𝐿𝑃
+𝛽 3L𝑐𝑦𝑐 +𝛽 4 (L𝑐ℎ𝑎 𝐿𝑀
+ L𝑐ℎ𝑎 ),
of global faces. Accordingly, as shown in Figure 5, we divide the 𝑠𝑖𝑧 𝑠𝑖𝑧 𝑠𝑖𝑧 𝑠𝑖𝑧 𝑠𝑖𝑧
(8)
geometric features into three attributions (face shape, facial features’ 𝐿𝑃
+L𝑎𝑑𝑣 𝐿𝑀
+ L𝑎𝑑𝑣 𝐿𝑃
+ 𝛽 5 L𝑐𝑦𝑐 +𝛽 6 ( L𝑐ℎ𝑎 𝐿𝑀
+ L𝑐ℎ𝑎 )
locations and sizes) and employ three sub-GANs 𝑁𝑠ℎ𝑎 , 𝑁𝑙𝑜𝑐 , 𝑁𝑠𝑖𝑧 𝑠ℎ𝑎 𝑠ℎ𝑎 𝑠ℎ𝑎 𝑠ℎ𝑎 𝑠ℎ𝑝
Figure 8: Comparison results of eye and mouth regions based on different improvements. Obviously, without our improvements, the
network produces poor manga results with messy regions and artifacts, and even cannot capture the correspondences between inputs
and outputs.
(b)
Unlike them, our method can effectively make the output similar
to the appearance of the target manga (e.g., exaggerated eyelids,
smooth eyebrows, simplified mouths) as shown in Figure 12(h)(i).
5 DISCUSSION
The performance on preserving manga style. Most of the state-
of-the-art methods prone to translate the color or texture of the artis-
tic image, and ignore the translation of geometric abstraction. As
shown in Figure 11 and 12, the stylized faces they generated are sim-
ilar to the input photos with only color or texture changing, which Figure 10: (a) Samples of eye regions. (b) Samples of mouth
makes them more like the realistic sketches or portraits than the ab- regions (red lines indicate landmarks). Our method can effec-
stract mangas. Unlike them, we extend the translation to the structure tively preserve the shape of eyebrows (red arrows), eyes, and
of stroke lines and the geometric abstraction of facial features (e.g., mouths, and further abstracts them into manga style.
simplified eyes and mouths, beautified facial proportions), which
makes our results more like the works drawn by the manga artist.
The performance on preserving user identity. We generate compromise the performance on preserving the user identity. Ac-
manga face guided by the input photo, however, manga charac- cordingly, it is reasonable that there are some dissimilarities between
ters are typically fictitious, simplified, idealized and much unlike the output manga face and the input facial photo.
real people. Specifically, manga faces are usually designed to own More evaluations. To subjectively evaluate the performances
optimum proportions, and the facial features are simplified to sev- of our methods on preserving manga style, user identity, and vi-
eral black lines [Figure 12(i)]. Therefore, the excessive similarity sual attractiveness, we conduct a series of user studies in Section
between the output and input will make the output unlike a manga. 2 of the supplementary materials. Moreover, we also show more
To generate typical and clean manga faces, we even remove the experimental results and generated manga faces in Section 5 of our
detail textures and beautify the proportions of facial features, which supplementary materials.
7
Input photo
Ours Style Target Gatys Fast NST SCNST Deep Image Analogy CNNMRF Headshot Portrait
Figure 11: Comparison results with NST methods, containing Gatys [11], Fast NST [21], SCNST [20], Deep Image Analogy [35],
CNNMRF [29], and Headshot Portrait [53]. For fair comparison, we employ three different manga faces (one of which is our result)
as the style targets to stylize each input photo respectively.
Input photo ROI CycleGAN UNIT CycleGAN+ATN UNIT+ATN Im2Pencil Ours ROI Target manga sample Ours
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)
Figure 12: Comparison results with cross-domain translation methods. (a) Input photo. (b) ROI of the input photo. (c)-(h) Results
of CycleGAN [65], UNIT [36], Im2Pencil [32], APDrawingGAN [60], and our method, respectively. (i) Some typical face samples in
target manga work [27]. We obverse that our method can effectively preserve the manga style of (i), e.g., exaggerated eyelid, smooth
eyebrow, and simplified mouth. More generated samples as shown in Figure 8 and 9 in our Supplemental Material.
7.8 Dataset
𝑁 𝑛𝑜𝑠𝑒 employs a generating method instead of a translating one, Our dataset MangaGAN-BL is collected from a world popular
which follows the architecture of progressive growing GANs [22]. manga work Bleach [? ]. It contains manga facial features of 448
The network architectures of 𝑁 𝑛𝑜𝑠𝑒 as illustrated in Table 2. eyes, 109 noses, 179 mouths, and 106 frontal view of manga faces
MangaGAN-BL can be downloaded by the Google Drive link: whose landmarks have been marked manually. Moreover, each sam-
https://drive.google.com/drive/folders/1viLG8fbT4lVXAwrYBO ple of MangaGAN-BL is normalized to 256×256 and optimized by
xVLoJrS2ZTUC3o?usp=sharing. cropping, angle-correction, and repairing of disturbing elements (e.g,
covering of hairs, glasses, shadows).
7.7 Generated Samples
In Table 3, we show some generated samples of paired eye regions. 7.9 Failure Cases
For one people with different facial expressions, our method success- Although our method can generate attractive manga faces in
fully preserves the similarities of manga eyes, and the appearances many cases, the network still produces some typical failure cases.
of manga eyes are adaptively changed with the facial expressions as As shown in Figure 21, when the input eyes are close to the hair, part
well; for different people, our method can effectively preserve the of the hair area will be selected into the input image, which results
shape of eyebrows and eyes, and further abstract them into manga in some artifacts in the generated manga. These failure cases are
style. caused by the incomplete content of our dataset. For example, our
Some generated samples of manga noses as shown in Figure 20. data for training manga eyes only include clean eye regions, thus
Moreover, in Figure 18 and Figure 19, we show some manga faces the model cannot be adaptive to some serious interference elements
with high resolution, generated for males and females. (e.g., hair, glasses).
10
Input Result of Eeye Ours W/O LSS W/O SP module W/O Eeye Input Result of Eeye Ours W/O LSS W/O SP module W/O Eeye
Figure 16: Ablation experiment of our improvements on eye regions. From left to right: input face photos, results of encoder 𝐸𝑒𝑦𝑒 , our
results, results of removing structural smoothing loss 𝐿𝑆𝑆 , results of removing SP module, and results of removing 𝐸𝑒𝑦𝑒 .
11
Table 3: Some samples of eye regions in input photos and generated mangas.
12
Gatys Fast NST
Input Ours
Input Ours
Figure 17: Upper: comparison results with NST methods, containing Gatys [11], Fast NST [21], Deep Image Analogy [35], and
CNNMRF [29]. Bottom: comparison results with GAN-based one-to-one translation methods, containing CycleGAN [65] and UNIT
[36].
REFERENCES conference on computer vision and pattern recognition, pages 1897–1906, 2017.
[1] Itamar Berger, Ariel Shamir, Moshe Mahler, Elizabeth Carter, and Jessica Hodgins. [5] Dongdong Chen, Lu Yuan, Jing Liao, Nenghai Yu, and Gang Hua. Stereoscopic
Style and abstraction in portrait sketching. ACM Transactions on Graphics (TOG), neural style transfer. In Proceedings of the IEEE Conference on Computer Vision
32(4):55, 2013. and Pattern Recognition, pages 6654–6663, 2018.
[2] Kaidi Cao, Jing Liao, and Lu Yuan. Carigans: unpaired photo-to-caricature [6] Yang Chen, Yu-Kun Lai, and Yong-Jin Liu. Cartoongan: Generative adversarial
translation. In SIGGRAPH Asia 2018 Technical Papers, page 244. ACM, 2018. networks for photo cartoonization. In Proceedings of the IEEE Conference on
[3] Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, and Gang Hua. Coherent Computer Vision and Pattern Recognition, pages 9465–9474, 2018.
online video style transfer. In Proc. Intl. Conf. Computer Vis., 2017. [7] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and
[4] Dongdong Chen, Lu Yuan, Jing Liao, Nenghai Yu, and Gang Hua. Stylebank: An Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain
explicit representation for neural image style transfer. In Proceedings of the IEEE image-to-image translation. In Proceedings of the IEEE Conference on Computer
13
Figure 18: Samples of input photos and generated manga faces
14
Figure 19: Samples of input photos and generated manga faces
15
Computer Vision, pages 1501–1510, 2017.
[18] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsuper-
vised image-to-image translation. In Proceedings of the European Conference on
Computer Vision (ECCV), pages 172–189, 2018.
[19] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image
translation with conditional adversarial networks. In 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 5967–5976. IEEE, 2017.
[20] Yongcheng Jing, Yang Liu, Yezhou Yang, Zunlei Feng, Yizhou Yu, Dacheng Tao,
and Mingli Song. Stroke controllable fast style transfer with adaptive receptive
fields. In Proceedings of the European Conference on Computer Vision (ECCV),
pages 238–254, 2018.
[21] Justin Johnson, Alexandre Alahi, and F.-F. Li. Perceptual losses for real-time style
transfer and super-resolution. In Proc. Eur. Conf. Comput. Vis., pages 694–711,
2016.
[22] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive
growing of gans for improved quality, stability, and variation. arXiv preprint
arXiv:1710.10196, 2017.
[23] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim.
Learning to discover cross-domain relations with generative adversarial networks.
In Proceedings of the 34th International Conference on Machine Learning-Volume
70, pages 1857–1865. JMLR. org, 2017.
[24] Davis E King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning
Figure 20: Samples of generated manga noses. Research, 10(Jul):1755–1758, 2009.
[25] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014.
[26] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv
preprint arXiv:1312.6114, 2013.
(a) [27] Tite Kubo. Bleach. Weekly Jump, 2001-2016.
[28] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunning-
ham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan
Wang, et al. Photo-realistic single image super-resolution using a generative
adversarial network. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 4681–4690, 2017.
(b) [29] Chuan Li and Michael Wand. Combining markov random fields and convolutional
neural networks for image synthesis. In Proc. IEEE Conf. Comput. Vis. Pattern
Recog., pages 2479–2486, 2016.
[30] Chuan Li and Michael Wand. Precomputed real-time texture synthesis with
Input Encoded Result markovian generative adversarial networks. In European Conference on Computer
Vision, pages 702–716. Springer, 2016.
[31] Wenbin Li, Wei Xiong, Haofu Liao, Jing Huo, Yang Gao, and Jiebo Luo. Carigan:
Figure 21: Typical failure cases of our method. When the input Caricature generation through weakly paired adversarial learning. arXiv preprint
eyes are close to the hair, part of the hair area may be selected arXiv:1811.00445, 2018.
[32] Yijun Li, Chen Fang, Aaron Hertzmann, Eli Shechtman, and Ming-Hsuan Yang.
into the input image, which results in some artifacts in the gen- Im2pencil: Controllable pencil illustration from photographs. In Proceedings
erated manga. of the IEEE Conference on Computer Vision and Pattern Recognition, pages
(c) 1525–1534, 2019.
[33] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang.
Universal style transfer via feature transforms. In Advances in neural information
Vision and Pattern Recognition, pages 8789–8797, 2018. processing systems, pages 386–396, 2017.
[8] Jakub Fišer, Ondřej Jamriška, Michal Lukáč, Eli Shechtman, Paul Asente, Jingwan [34] Dongxue Liang, Kyoungju Park, and Przemyslaw Krompiec. Facial feature model
Lu, and Daniel Sỳkora. Stylit: illumination-guided example-based stylization of for a portrait video stylization. Symmetry, 10(10):442, 2018.
Input
3d renderings. ACM Transactions on Graphics (TOG), Result
35(4):92, 2016. [35] Jing Liao, Yuan Yao, Lu Yuan, Gang Hua, and Sing Bing Kang. Visual attribute
[9] Frederick N Fritsch and Ralph E Carlson. Monotone piecewise cubic interpolation. transfer through deep image analogy. ACM Transactions on Graphics (TOG),
SIAM Journal on Numerical Analysis, 17(2):238–246, 1980. 36(4):120, 2017.
[10] Leon Gatys, Alexander S Ecker, and Matthias Bethge. Texture synthesis using [36] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image
convolutional neural networks. In Advances in neural information processing translation networks. In Advances in neural information processing systems, pages
systems, pages 262–270, 2015. 700–708, 2017.
[11] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer [37] Debbie S Ma, Joshua Correll, and Bernd Wittenbrink. The chicago face database:
using convolutional neural networks. In Proc. IEEE Conf. Comput. Vis. Pattern A free stimulus set of faces and norming data. Behavior research methods,
Recog., pages 2414–2423, 2016. 47(4):1122–1135, 2015.
[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, [38] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Paul Smolley. Least squares generative adversarial networks. In Proceedings of
In Advances in neural information processing systems, pages 2672–2680, 2014. the IEEE International Conference on Computer Vision, pages 2794–2802, 2017.
[13] Shuyang Gu, Congliang Chen, Jing Liao, and Lu Yuan. Arbitrary style transfer [39] Yifang Men, Zhouhui Lian, Yingmin Tang, and Jianguo Xiao. A common frame-
with deep feature reshuffle. In Proceedings of the IEEE Conference on Computer work for interactive texture transfer. In Proceedings of the IEEE Conference on
Vision and Pattern Recognition, pages 8222–8231, 2018. Computer Vision and Pattern Recognition, pages 6353–6362, 2018.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning [40] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv
for image recognition. In Proceedings of the IEEE conference on computer vision preprint arXiv:1411.1784, 2014.
and pattern recognition, pages 770–778, 2016. [41] Umar Riaz Muhammad, Michele Svanera, Riccardo Leonardi, and Sergio Benini.
[15] Haozhi Huang, Hao Wang, Wenhan Luo, Lin Ma, Wenhao Jiang, Xiaolong Zhu, Hair detection, segmentation, and hairstyle classification in the wild. Image and
Zhifeng Li, and Wei Liu. Real-time neural style transfer for videos. In Proceedings Vision Computing, 2018.
of the IEEE Conference on Computer Vision and Pattern Recognition, pages 783– [42] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
791, 2017. Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
[16] Junhong Huang, Mingkui Tan, Yuguang Yan, Chunmei Qing, Qingyao Wu, and Automatic differentiation in pytorch. 2017.
Zhuliang Yu. Cartoon-to-photo facial translation with generative adversarial [43] Chunlei Peng, Xinbo Gao, Nannan Wang, Dacheng Tao, Xuelong Li, and Jie Li.
networks. In Asian Conference on Machine Learning, pages 566–581, 2018. Multiple representations-based face sketch–photo synthesis. IEEE transactions
[17] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive on neural networks and learning systems, 27(11):2201–2215, 2015.
instance normalization. In Proceedings of the IEEE International Conference on
16
[44] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional [55] Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross-domain image
networks for biomedical image segmentation. In International Conference on generation. arXiv preprint arXiv:1611.02200, 2016.
Medical image computing and computer-assisted intervention, pages 234–241. [56] Lidan Wang, Vishwanath Sindagi, and Vishal Patel. High-quality facial photo-
Springer, 2015. sketch synthesis using multi-adversarial networks. In 2018 13th IEEE interna-
[45] Paul L Rosin and Yu-Kun Lai. Non-photorealistic rendering of portraits. In tional conference on automatic face & gesture recognition (FG 2018), pages
Proceedings of the workshop on Computational Aesthetics, pages 159–170. Euro- 83–90. IEEE, 2018.
graphics Association, 2015. [57] Nannan Wang, Xinbo Gao, Leiyu Sun, and Jie Li. Bayesian face sketch synthesis.
[46] Paul L Rosin, David Mould, Itamar Berger, John P Collomosse, Yu-Kun Lai, IEEE Transactions on Image Processing, 26(3):1264–1274, 2017.
Chuan Li, Hua Li, Ariel Shamir, Michael Wand, Tinghuai Wang, et al. Bench- [58] Holger Winnemöller, Sven C Olsen, and Bruce Gooch. Real-time video abstraction.
marking non-photorealistic rendering of portraits. In NPAR, pages 11–1, 2017. In ACM Transactions On Graphics (TOG), volume 25, pages 1221–1226. ACM,
[47] Takafumi Saito and Tokiichiro Takahashi. Comprehensible rendering of 3-d 2006.
shapes. In ACM SIGGRAPH Computer Graphics, volume 24, pages 197–206. [59] Mingliang Xu, Hao Su, Yafei Li, Xi Li, Jing Liao, Jianwei Niu, Pei Lv, and Bing
ACM, 1990. Zhou. Stylized aesthetic qr code. IEEE Transactions on Multimedia, 2019.
[48] Ahmed Selim, Mohamed Elgharib, and Linda Doyle. Painting style transfer for [60] Ran Yi, Yong-Jin Liu, Yu-Kun Lai, and Paul L Rosin. Apdrawinggan: Generating
head portraits using convolutional neural networks. pages 129:1–129:18, 2016. artistic portrait drawings from face photos with hierarchical gans. In Proceedings
[49] Ahmed Selim, Mohamed Elgharib, and Linda Doyle. Painting style transfer of the IEEE Conference on Computer Vision and Pattern Recognition, pages
for head portraits using convolutional neural networks. ACM Transactions on 10743–10752, 2019.
Graphics (ToG), 35(4):129, 2016. [61] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual
[50] Falong Shen, Shuicheng Yan, and Gang Zeng. Neural style transfer via meta learning for image-to-image translation. In Proceedings of the IEEE international
networks. In Proceedings of the IEEE Conference on Computer Vision and conference on computer vision, pages 2849–2857, 2017.
Pattern Recognition, pages 8061–8069, 2018. [62] Shengchuan Zhang, Xinbo Gao, Nannan Wang, Jie Li, and Mingjin Zhang. Face
[51] Xiaoyong Shen, Aaron Hertzmann, Jiaya Jia, Sylvain Paris, Brian Price, Eli sketch synthesis via sparse representation-based greedy search. IEEE transactions
Shechtman, and Ian Sachs. Automatic portrait segmentation for image stylization. on image processing, 24(8):2466–2477, 2015.
In Computer Graphics Forum, volume 35, pages 93–102. Wiley Online Library, [63] Yexun Zhang, Ya Zhang, and Wenbin Cai. Separating style and content for
2016. generalized style transfer. In Proceedings of the IEEE conference on computer
[52] Yichun Shi, Debayan Deb, and Anil K Jain. Warpgan: Automatic caricature vision and pattern recognition, pages 8447–8455, 2018.
generation. In Proceedings of the IEEE Conference on Computer Vision and [64] Yong Zhang, Weiming Dong, Chongyang Ma, Xing Mei, Ke Li, Feiyue Huang,
Pattern Recognition, pages 10762–10771, 2019. Bao-Gang Hu, and Oliver Deussen. Data-driven synthesis of cartoon faces using
[53] YiChang Shih, Sylvain Paris, Connelly Barnes, William T Freeman, and Frédo different styles. IEEE Transactions on image processing, 26(1):464–478, 2016.
Durand. Style transfer for headshot portraits. ACM Transactions on Graphics [65] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-
(TOG), 33(4):148, 2014. image translation using cycle-consistent adversarial networks. In Proceedings of
[54] Hao Su, Jianwei Niu, Xuefeng Liu, Qingfeng Li, Ji Wan, Mingliang Xu, and Tao the IEEE international conference on computer vision, pages 2223–2232, 2017.
Ren. An end-to-end method for producing scanning-robust stylized qr codes. [66] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros,
arXiv preprint arXiv:2011.07815, 2020. Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation.
In Advances in Neural Information Processing Systems, pages 465–476, 2017.
17