Disney Type Avatars Paper
Disney Type Avatars Paper
Bridging
Shen Sang Tiancheng Zhi Guoxian Song Minghao Liu Chunpong Lai
ByteDance ByteDance ByteDance UC Santa Cruz ByteDance
Mountain View, USA Mountain View, USA Mountain View, USA Santa Cruz, USA Mountain View, USA
(a) Input (b) Our method: cascaded domain bridging (from left to right) (c) Application
Figure 1: (a) Given a front-facing user image as input, (b) our method progressively bridges the domain gap between real faces
and 3D avatars through three stages: (b.1) The stylization stage performs an image space translation to generate a stylized
portrait while normalizing expressions. (b.2) The parameterization stage uses a learned model to find avatar parameters which
match the results of stylization. (b.3) The conversion stage searches for a valid avatar vector matching the parameterization
that can be rendered by the graphics engine. (c) The output is a user editable 3D model which can be animated and applied to
various applications, for example personalized emoji. ©H JACQUOT and Montclair Film.
ABSTRACT selfies into stylized avatar renderings as the targets for desired 3D
Stylized 3D avatars have become increasingly prominent in our avatars. Next, we find the best parameters of the avatars to match
modern life. Creating these avatars manually usually involves la- the stylized avatar renderings through a differentiable imitator we
borious selection and adjustment of continuous and discrete pa- train to mimic the avatar graphics engine. To ensure we can ef-
rameters and is time-consuming for average users. Self-supervised fectively optimize the discrete parameters, we adopt a cascaded
approaches to automatically create 3D avatars from user selfies relaxation-and-search pipeline. We use a human preference study
promise high quality with little annotation cost but fall short in to evaluate how well our method preserves user identity compared
application to stylized avatars due to a large style domain gap. We to previous work as well as manual creation. Our results achieve
propose a novel self-supervised learning framework to create high- much higher preference scores than previous work and close to
quality stylized 3D avatars with a mix of continuous and discrete those of manual creation. We also provide an ablation study to
parameters. Our cascaded domain bridging framework first lever- justify the design choices in our pipeline.
ages a modified portrait stylization approach to translate input
Encoder Appearance-Based
Mapper
W+ Searching
User Image Latent Code Relaxed Avatar Vector Strict Avatar Vector
Stylized Differentiable Graphics
Decoder Imitator Engine
Figure 2: Pipeline. Our framework consists of three modules: Portrait Stylization for image-space real-to-stylized domain
crossing, Self-supervised Avatar Parametrization for recovering relaxed avatar vector from the stylization latent code, and
Avatar Vector Conversion for discretizing the predicted relaxed avatar vector into a strict avatar vector that can be taken by
the graphics engine directly. ©NGÁO STUDIO.
Game Avatars: Commercial products such as Zepeto and Ready- simultaneously normalizing the face to look closer to an avatar ren-
Player use a graphics engine to render cartoon avatars from user dering. Next, the Self-supervised Avatar Parameterization module
selfies. While no detailed description of their methods exists, we sus- regresses a relaxed avatar vector from the stylization latent code via
pect these commercial methods are supervised with a large amount a MLP based Mapper. Finally, the Avatar Vector Conversion module
of manual annotations, something this paper seeks to avoid. discretizes part of the relaxed avatar vector to meet the requirement
Creating semi-realistic 3D avatars has also been explored [Cao of the graphics engine using an appearance-based search.
et al. 2016; Hu et al. 2017; Ichim et al. 2015; Luo et al. 2021]. Most
relevant to our framework, Shi et al. [2019] proposed an algorithm 3.1 Portrait Stylization
to search for the optimal avatar parameters by comparing the input
Portrait Stylization transforms user images into stylized images
image directly to the rendered avatar. Follow-up work improves
close to our target domain. This stage of our pipeline occurs entirely
efficiency [Shi et al. 2020], and seeks to use the photograph’s texture
within the 2D image domain. We adopt an encoder-decoder frame-
to make the avatar match more closely [Lin et al. 2021]. These efforts
work for the stylization task. A novel transfer learning approach
seek to create a similar looking avatar, while this paper seeks to
is applied to a StyleGAN model [Karras et al. 2020], including W+
create a highly stylized avatar with a large domain gap.
space transfer learning, using a normalized style exemplar set, and
Portrait Stylization: Many methods for non-photorealistic styl- a loss function that supports these modifications.
ization of 2D images exist. Gatys et al. [2016] proposed neural style
W+ space transfer learning: We perform transfer learning directly
transfer, matching features at different levels of CNNs. Image-to-
from the W+ space, unlike previous methods [Gal et al. 2021; Song
image models focus on the translation of images from a source to
et al. 2021] where stylization transfer learning is done in the more
target domain, either with paired data supervision [Isola et al. 2017]
entangled Z/Z+ space. The W+ space is more disentangled and
or without [Park et al. 2020; Zhu et al. 2017]. Recent development
can preserve more personal identity features. However, this design
in GAN inversion [Richardson et al. 2021; Tov et al. 2021] and in-
change introduces a challenge. We need to model a distribution
terpolation [Pinkney and Adler 2020] methods make it possible to
prior W of the W+ space, as it is a highly irregular space [Wulff
achieve high quality cross-domain stylization [Cao et al. 2018; Song
and Torralba 2020], and cannot be directly sampled like the Z/Z+
et al. 2021; Zhu et al. 2021]. The end result of these methods are in
space (standard Gaussian distribution). We achieve this by inverting
2D pixels space and directly inspire the first stage of our pipeline.
a large dataset of real face images into a W+ embeddings via a
pre-trained image encoder [Tov et al. 2021], and then sample the
3 PROPOSED APPROACH latent codes from that prior. Fig. 3 provides one example of better
Our cascaded avatar creation framework consists of three stages: preserved personalization. Notice that our method preserves glasses
Portrait Stylization (Sec. 3.1), Self-supervised Avatar Parameteriza- which are lost in the comparison method.
tion (Sec. 3.2), and Avatar Vector Conversion (Sec. 3.3). A diagram
of their relationship is shown in Fig. 2. Portrait Stylization trans- Normalized Style Exemplar Set: Our stylization method seeks
forms a real user image into a stylized avatar image, keeping as to ignore pose and expression and produce a normalized image.
much personal identity (glasses, hairs, colors, etc.) as possible, while In contrast, existing methods are optimized to preserve source to
Sang, S. et al
(a) Input (b) AgileGAN (c) Our Stylization (a) Stylization (b) Strict (c) Relaxed
Figure 3: Portrait stylization results. Compared with a state- Figure 4: Avatar Parameterization produces errors in final
of-the-art stylization method, AgileGAN [Song et al. 2021], predictions if discrete types are enforced during training,
our stylization does a better job at preserving the user’s such as hair and beard types in this example. Relaxing
personal identity (e.g. glasses are preserved), and simulta- the discrete constraint allows easier optimization and thus
neously normalizing the expressions (e.g. mouth is closed) better predictions which match the stylization target more
for easier fitting in the downstream pipeline. ©Greg Mooney closely.
and Sebastiaan ter Burg.
the StyleGAN2 discriminator [Karras et al. 2020]:
target similarities literally, transferring specific facial expressions,
head poses, and lighting conditions directly from user photos into L𝑎𝑑𝑣 = E𝑦∼Y [min(0, −1+𝐷 (𝑦))]+E𝑤∼W [min(0, −1−𝐷 (G𝜙𝑡 (𝑤)))]
target stylized images. This is not desirable for our later avatar (2)
parameterization stage as we are trying to extract the core personal Also, to improve training stability and prevent artifacts, we use
identity features only. In order to produce normalized stylizations R1 regularization [Mescheder et al. 2018] for the discriminator:
L𝑅 1 = 2 E𝑦∼Y [∥∇𝐷 (𝑦)∥ 2 ], where we set 𝛾 = 10 empirically.
𝛾
we limit the rendered exemplars provided during transfer learn-
ing to contain only neutral poses, expressions and illumination Finally, the generator and discriminators are jointly trained to
to ensure a good normalization. Fig. 3 provides an example of a optimize the combined objective min𝜙 max𝐷 L𝑠𝑡 𝑦𝑙𝑖𝑧𝑒 , where
smiling face. The comparison method preserves the smile, while
L𝑠𝑡 𝑦𝑙𝑖𝑧𝑒 = 𝜆𝑎𝑑𝑣 L𝑎𝑑𝑣 + 𝜆𝑠𝑒𝑚 L𝑠𝑒𝑚 + 𝜆𝑅 1 L𝑅 1 (3)
our method successfully provides only the normalized core identity.
𝜆𝑎𝑑𝑣 = 1, 𝜆𝑠𝑒𝑚 = 12, 𝜆𝑅 1 = 5 are constant weights set empirically.
Loss: Our loss contains non-standard terms to support the needs Please see the appendix A for more details.
of our pipeline. The target output stylization is not exactly aligned
with the input due to pose normalization. Therefore, commonly 3.2 Self-supervised Avatar Parameterization
used perceptual loss [Zhang et al. 2018] cannot be applied directly Avatar Parameterization finds a set of parameters for the rendering
in decoder training. We instead use a novel segmented color loss. engine which produces an avatar matching the stylized portrait as
The full objective comprises three loss terms to fine-tune the closely as possible. We call the module which finds parameters the
generator G𝜙 . Let G𝜙𝑜 and G𝜙𝑡 be the model before and after mapper. To facilitate training the mapper, we use a differentiable
fine-tuning. We introduce a color matching loss at a semantic neural rendering engine we call the imitator.
level. Specifically, we leverage two face segmentation models from A particular avatar is defined by an avatar vector with both con-
BiSeNet [Yu et al. 2018] pre-trained on real and stylized data sepa- tinuous and discrete parameters. Continuous parameters are used
rately to match the color of semantic regions. Let S = {ℎ𝑎𝑖𝑟, 𝑠𝑘𝑖𝑛} to control primarily placement and size, for example eye size, eye
be the classes taken into consideration, and B𝑘 (𝐼 ) (𝑘 ∈ S) be the rotation, mouth position, and head width. Discrete parameters are
mean color of pixels belonging to class 𝑘 in image 𝐼 . B𝑟𝑒𝑎𝑙 𝑘 and used to set individual assets and textures such as hair types, beard
B𝑠𝑡 𝑦𝑙𝑒 represent real and stylized models separately. The semantic
𝑘 types, and skin tone textures. All parameters are concatenated into
color matching loss is: a vector with discrete parameters represented as one-hot vectors.
FID = 236.8 FID = 38.7 FID = 17.9 In domain Table 1: Numerical results from two user studies. Our
method is judged to produce better avatars than the base-
line methods, approaching the quality of manual work. At-
tribute evaluation: judge whether a specific attribute of the
created avatar matches the human image. Matching: choose
the correct one out of four avatars which matches the hu-
man image.
Attribute Evaluation
Match
beard face brow hair hair skin Task
type shape type color style tone
F2P [2019] 0.36 0.46 0.22 0.21 0.12 0.36 0.67
CNN 0.17 0.54 0.22 0.46 0.30 0.50 0.57
Stylization+CNN 0.45 0.69 0.38 0.57 0.43 0.66 0.82
Ours 0.82 0.94 0.88 0.82 0.72 0.82 0.92
Manual 0.94 0.97 0.85 0.90 0.86 0.94 0.96
(a) Input (b) Ours (c) CNN (d) Stylization + CNN (e) F2P [2019] (f) Manual
Figure 7: Results comparison. (a) Given an input image, (b) our method produces an avatar in the target cartoon style that
looks similar to the user. (c) A CNN trained on synthetic data produces incorrect beard, hair style, and glasses on real image
inputs due to the significant domain gap. (d) Applying the CNN instead to the results of stylization reduces the domain gap
and thus improves results, however significant errors remain. (e) F2P, a baseline method intended to produce semi-realistic
avatars does not consider the domain gap and thus produces poor results when used with stylized avatars [Shi et al. 2019]. (f)
Manual results were created by expert-trained users. Our results approximate the quality obtainable through manual creation.
©Sebastiaan ter Burg, NIGP, YayA Lee and S Pakhrin.
our observations that the overall method is robust to the precise as also evidenced by the manually-created results. This issue could
selection of loss, but that the additional terms help in some cases. be addressed by improving the diversity of the avatar system.
5 LIMITATIONS
We observe two main limitations to our method. First, our method 6 CONCLUSION
occasionally produces wrong predictions on assets covering a small In summary, we present a self-supervised stylized avatar auto-
area, because their contribution to the loss is small and gets ig- creation method with cascaded domain crossing. Our method demon-
nored. The eye color in Fig. 9 (a) is an example of this difficulty. strates that the gap between the real images domain and the target
Redesigning the loss function might resolve this problem. Second, avatar domain can be progressively bridged with a three-stage
lighting is not fully normalized in the stylization stage, leading pipeline: portrait stylization, self-supervised avatar parameteriza-
to incorrect skin tone estimates when there are strong shadows, tion, and avatar vector conversion. Each stage is carefully designed
shown in Fig. 9 (b). This problem could potentially be addressed by and cannot be simply removed. Experimental results show that our
incorporating intrinsic decomposition into the pipeline. In addition approach produces high quality attractive 3D avatars with personal
to the limitations of our method, we experience a loss of ethnicity in identities preserved. In the future, we will extend the proposed
the final results, which is mainly introduced by the graphics engine, pipeline to other domains, such as cubism and caricature avatars.
Sang, S. et al
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019a. ArcFace: Additive
Angular Margin Loss for Deep Face Recognition. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR).
Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. 2019b.
Accurate 3d face reconstruction with weakly-supervised learning: From single
image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops. 0–0.
Bernhard Egger, William AP Smith, Ayush Tewari, Stefanie Wuhrer, Michael Zollhoefer,
Thabo Beeler, Florian Bernard, Timo Bolkart, Adam Kortylewski, Sami Romdhani,
et al. 2020. 3d morphable face models—past, present, and future. ACM Transactions
on Graphics (TOG) 39, 5 (2020), 1–38.
Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or.
2021. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators.
arXiv:2108.00946 [cs.CV]
Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using
convolutional neural networks. In Proceedings of the IEEE conference on computer
Input (a) No stylization (b) AgileGAN (c) Ours vision and pattern recognition. 2414–2423.
Xiaoguang Han, Chang Gao, and Yizhou Yu. 2017. DeepSketch2Face: a deep learning
based sketching system for 3D face and caricature modeling. ACM Transactions on
Figure 8: We ablate by removing the stylization stage, as well graphics (TOG) 36, 4 (2017), 1–12.
as replacing our stylization with a state-of-the-art method. Xiaoguang Han, Kangcheng Hou, Dong Du, Yuda Qiu, Shuguang Cui, Kun Zhou,
and Yizhou Yu. 2018. Caricatureshop: Personalized and photorealistic caricature
In each case the final renderings from the graphics engine sketching. IEEE transactions on visualization and computer graphics 26, 7 (2018),
are shown. (a) Fitting directly on a user image results in an 2349–2361.
avatar that lacks attractiveness. (b) Replacing our stylization Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman
Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. 2017. Avatar digitization from
with AgileGAN [2021] suffers from missing personal infor- a single image for real-time rendering. ACM Transactions on Graphics (ToG) 36, 6
mation such as glasses and artifacts where smiles are mis- (2017), 1–14.
Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. 2015. Dynamic 3D avatar
interpreted as heavy lips or mustache. (c) Our stylization creation from hand-held video input. ACM Transactions on Graphics (ToG) 34, 4
retains personal features like glasses, and generate visually (2015), 1–14.
appealing results in spite of expressions. ©Chang-Ching Su Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image
translation with conditional adversarial networks. In Proceedings of the IEEE confer-
and Luca Boldrini. ence on computer vision and pattern recognition. 1125–1134.
Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with
gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).
Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture
for generative adversarial networks. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 4401–4410.
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo
Aila. 2020. Analyzing and improving the image quality of stylegan. In Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition. 8110–8119.
Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. 2019. Fréchet
(a) Limitation - Small areas (b) Limitation - Shadows Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algo-
rithms.. In INTERSPEECH. 2350–2354.
Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization.
Figure 9: Limitations: (a) failure on a parameter (eye color) In International Conference on Learning Representations (ICLR).
affecting a small number of pixels. (b) incorrect skin tone Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. 2020. MaskGAN: Towards
Diverse and Interactive Facial Image Manipulation. In IEEE Conference on Computer
prediction caused by shadows. ©Daniel Åberg and Peter Vision and Pattern Recognition (CVPR).
Bright. Thomas Lewiner, Thales Vieira, Dimas Martínez, Adelailson Peixoto, Vinícius Mello,
and Luiz Velho. 2011. Interactive 3D caricature from harmonic exaggeration. Com-
puters & Graphics 35, 3 (2011), 586–595.
Song Li, Songzhi Su, Juncong Lin, Guorong Cai, and Li Sun. 2021. Deep 3D caricature
REFERENCES face generation with identity and structure consistency. Neurocomputing 454 (2021),
Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. 2010. 178–188.
High-quality single-shot capture of facial geometry. In ACM SIGGRAPH 2010 papers. Jiangke Lin, Yi Yuan, and Zhengxia Zou. 2021. MeInGame: Create a Game Character
1–9. Face from a Single Portrait. In Proceedings of the AAAI Conference on Artificial
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or propa- Intelligence, Vol. 35. 311–319.
gating gradients through stochastic neurons for conditional computation. arXiv Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep learning face
preprint arXiv:1308.3432 (2013). attributes in the wild. In Proceedings of the IEEE international conference on computer
Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. vision. 3730–3738.
In Proceedings of the 26th annual conference on Computer graphics and interactive Huiwen Luo, Koki Nagano, Han-Wei Kung, Qingguo Xu, Zejian Wang, Lingyu Wei,
techniques. 187–194. Liwen Hu, and Hao Li. 2021. Normalized Avatar Synthesis Using StyleGAN and
Hongrui Cai, Yudong Guo, Zhuang Peng, and Juyong Zhang. 2021. Landmark detection Perceptual Refinement. In Proceedings of the IEEE/CVF Conference on Computer
and 3D face reconstruction for caricature using a nonlinear parametric model. Vision and Pattern Recognition. 11662–11672.
Graphical Models 115 (2021), 101103. Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. 2018. Which training meth-
Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-time ods for GANs do actually converge?. In International conference on machine learning.
facial animation with image-based dynamic avatars. ACM Transactions on Graphics PMLR, 3481–3490.
35, 4 (2016). Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. 2020. Contrastive learn-
Kaidi Cao, Jing Liao, and Lu Yuan. 2018. CariGANs: Unpaired Photo-to-Caricature ing for unpaired image-to-image translation. In European Conference on Computer
Translation. Vision. Springer, 319–345.
Zhixiang Chen and Tae-Kyun Kim. 2021. Learning Feature Aggregation for Deep 3D Weilong Peng, Zhiyong Feng, Chao Xu, and Yong Su. 2017. Parametric t-spline face
Morphable Models. In Proceedings of the IEEE/CVF Conference on Computer Vision morphable model for detailed fitting in shape subspace. In Proceedings of the IEEE
and Pattern Recognition. 13164–13173. Conference on Computer Vision and Pattern Recognition. 6139–6147.
Pengyu Cheng, Chang Liu, Chunyuan Li, Dinghan Shen, Ricardo Henao, and Lawrence Justin NM Pinkney and Doron Adler. 2020. Resolution dependent gan interpolation
Carin. 2018. Straight-through estimator as projected Wasserstein gradient flow. In for controllable image synthesis between domains. arXiv preprint arXiv:2010.05334
Neural Information Processing Systems (NeurIPS) Workshop. (2020).
AgileAvatar: Stylized 3D Avatar Creation via Cascaded Domain Bridging
Yuda Qiu, Xiaojie Xu, Lingteng Qiu, Yan Pan, Yushuang Wu, Weikai Chen, and Xi- Normalized Style Exemplar Set Y: For training stylized generator
aoguang Han. 2021. 3dcaricshop: A dataset and a baseline method for single-view G𝜙𝑡 , we synthetically rendered a diverse set of 150 avatar imageries
3d caricature face reconstruction. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 10236–10245. with normalized facial expressions.
Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro,
and Daniel Cohen-Or. 2021. Encoding in Style: a StyleGAN Encoder for Image-
to-Image Translation. In IEEE/CVF Conference on Computer Vision and Pattern
B AVATAR PARAMETERIZATION DETAILS
Recognition (CVPR).
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh
B.1 Imitator
Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings To train our module in a self-supervised way, we plug-in a differ-
of the IEEE conference on computer vision and pattern recognition. 4510–4520.
Tianyang Shi, Yi Yuan, Changjie Fan, Zhengxia Zou, Zhenwei Shi, and Yong Liu. 2019.
entiable neural renderer (i.e. imitator) in our learning framework.
Face-to-parameter translation for game character auto-creation. In Proceedings of As we mentioned in the main paper, the imitator can take a re-
the IEEE/CVF International Conference on Computer Vision. 161–170. laxed avatar vector as input, although the imitator itself is trained
Tianyang Shi, Zhengxia Zuo, Yi Yuan, and Changjie Fan. 2020. Fast and Robust Face-
to-Parameter Translation for Game Character Auto-Creation. In Proceedings of the with strict avatar vector. No matter the input is a relaxed or strict
AAAI Conference on Artificial Intelligence, Vol. 34. 1733–1740. avatar vector, it can produce a valid rendering. In this way, we can
Guoxian Song, Linjie Luo, Jing Liu, Wan-Chun Ma, Chunpong Lai, Chuanxia Zheng, supervise the training in image space without any ground-truth
and Tat-Jen Cham. 2021. AgileGAN: stylizing portraits by inversion-consistent
transfer learning. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1–13. for the parameters. Due to the differentiability of the imitator, the
Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. 2021. parameterization stage can be trained with gradient descent. To
Designing an encoder for stylegan image manipulation. ACM Transactions on
Graphics (TOG) 40, 4 (2021), 1–14.
achieve high fidelity rendering quality, we leverage the StyleGAN2
Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. generator [Karras et al. 2019] as our backbone, which is capable of
Advances in neural information processing systems 30 (2017). generating high quality renderings matching the graphics engine.
Roberto C Cavalcante Vieira, Creto A Vidal, and Joaquim Bento Cavalcante-Neto. 2013.
Three-dimensional face caricaturing by anthropometric distortions. In 2013 XXVI The imitator consists of an encoder E𝑖 implemented using MLP and
Conference on Graphics, Patterns and Images. IEEE, 163–170. a generator G𝑖 adopted from StyleGAN2. The encoder translates
Qianyi Wu, Juyong Zhang, Yu-Kun Lai, Jianmin Zheng, and Jianfei Cai. 2018. Alive an input avatar vector to a latent code 𝑤+. The generator then
caricature from 2d to 3d. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. 7336–7345. produces a high-quality image given the latent code.
Jonas Wulff and Antonio Torralba. 2020. Improving Inversion and Generation Diversity
in StyleGAN using a Gaussianized Latent Space. In Conference on Neural Information Training: In order to fully utilize the image generation capability
Processing Systems. of StyleGAN2, we propose to train the imitator in two steps: 1)
Sicheng Xu, Jiaolong Yang, Dong Chen, Fang Wen, Yu Deng, Yunde Jia, and Xin
Tong. 2020. Deep 3d portrait from a single image. In Proceedings of the IEEE/CVF we first train a StyleGAN2 from scratch with random rendering
Conference on Computer Vision and Pattern Recognition. 7710–7720. samples generated by our graphics engine to obtain a high-quality
Haotian Yang, Hao Zhu, Yanru Wang, Mingkai Huang, Qiu Shen, Ruigang Yang, and image generator, without any label or conditions; then 2) we train
Xun Cao. 2020. FaceScape: A Large-Scale High Quality 3D Face Dataset and Detailed
Riggable 3D Face Prediction. In Proceedings of the IEEE/CVF Conference on Computer the encoder and the generator together with images and correspond-
Vision and Pattern Recognition (CVPR). ing labels, result in a conditional generator. Given an avatar vector
Zipeng Ye, Mengfei Xia, Yanan Sun, Ran Yi, Minjing Yu, Juyong Zhang, Yu-Kun Lai, and
Yong-Jin Liu. 2021. 3D-CariGAN: an end-to-end solution to 3D caricature generation
𝑣, a target image I𝑔𝑡 and the generated image I𝑔𝑒𝑛 = G𝑖 (E𝑖 (𝑣)), we
from normal face photos. IEEE Transactions on Visualization and Computer Graphics use the following loss function combination to perform the second
(2021). step training:
Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang.
2018. Bisenet: Bilateral segmentation network for real-time semantic segmentation.
In Proceedings of the European conference on computer vision (ECCV). 325–341.
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. L𝑖𝑚𝑖𝑡𝑎𝑡𝑜𝑟 = 𝜆1 ∥I𝑔𝑒𝑛 − I𝑔𝑡 ∥ 1 + 𝜆2 L𝑙𝑝𝑖𝑝𝑠 + 𝜆3 L𝑖𝑑 (6)
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image- where the first term is an L1 loss, which encourages less blurring
to-image translation using cycle-consistent adversarial networks. In Proceedings of than L2. In addition, L𝑙𝑝𝑖𝑝𝑠 is the LPIPS loss adopted from [Zhang
the IEEE international conference on computer vision. 2223–2232.
Peihao Zhu, Rameen Abdal, John Femiani, and Peter Wonka. 2021. Mind the Gap: et al. 2018],
Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial
Networks. arXiv:2110.08398 [cs.CV]
Michael Zollhöfer, Justus Thies, Pablo Garrido, Derek Bradley, Thabo Beeler, Patrick
L𝑙𝑝𝑖𝑝𝑠 = ∥F (𝐼 1 ) − F (𝐼 2 )∥ 2 (7)
Pérez, Marc Stamminger, Matthias Nießner, and Christian Theobalt. 2018. State where F denotes the perceptual feature extractor. L𝑖𝑑 is the identity
of the art on monocular 3D face reconstruction, tracking, and applications. In
Computer Graphics Forum, Vol. 37. Wiley Online Library, 523–550. loss which measures the cosine similarity between two faces built
upon a pretrained ArcFace [Deng et al. 2019a] face recognition
network R,
A PORTRAIT STYLIZATION DETAILS
Segmentation Models: The avatar segmentation model is trained L𝑖𝑑 = 1 − 𝑐𝑜𝑠 (R (𝐼 1 ), R (𝐼 2 )) (8)
using 20k randomly sampled avatar vectors with neural pose, ex- We set 𝜆1 = 1.0, 𝜆2 = 0.8, 𝜆3 = 1.0, empirically.
pression and illumination. For real image segmentation, we used
an open-source pre-trained BiSeNet module4 [Yu et al. 2018]. Interpolation property: Fig. 10 provides an example of the interpo-
lation property of the imitator which enables relaxed optimization
Distribution Prior W: To sample W+ distribution prior, we in- over the discrete parameters.
verse CelebA dataset [Liu et al. 2015] into W+ space using a pre-
trained e4e encoder [Tov et al. 2021]. Implementation: To train the imitator, we randomly generate
100,000 images and corresponding parameters. Note that although
random sampling leads to strange avatars, our imitator can generate
4 https://github.com/zllrunning/face-parsing.PyTorch images matching the graphics engine well by seeing plenty of
Sang, S. et al
B.2 Mapper
We use CelebA-HQ [Lee et al. 2020] and FFHQ [Karras et al. 2019]
as our training data. To collect a high quality dataset for training,
we use the Azure Face API 6 to analyze the facial attributes and
keep only facial images that meet our requirements:
1) within a limited pose range (𝑦𝑎𝑤 < 8◦, 𝑝𝑖𝑡𝑐ℎ < 8◦, 𝑟𝑜𝑙𝑙 < 5◦ )
2) without headwears
Figure 10: Interpolation of avatar vectors. The neural ren- 3) without extreme expressions
dering imitator which temporarily replaces the traditional 4) without any occlusions
graphics engine is differentiable, allowing the relaxation of Finally, we collect 21,522 images in total for mapper training.
the strict constraint on discrete types. Linear interpolation The input is an 18 × 512 latent code taken from the Stylization
between two avatar vectors results in the gradual disappear- module. Each one of the 18 layers latent code is passed to an indi-
ance of the beard and the gradual growth of the hair. vidual MLP. The output features are then concatenated together.
After that, we apply two MLP heads to generate continuous and
samples in the parameter space. Please refer to our supplementary discrete parameters separately.
video for a side-by-side comparison. We apply a scaling before the softmax function for discrete pa-
We train StyleGAN2 using the official source code5 with images rameters:
of size 256 × 256 × 3, thus the latent code 𝑤+ has a shape of 𝑒 𝛽𝑥𝑘
14 × 512. We build the encoder E𝑖 with 14 individual small MLPs, S(𝑥) = 𝑁 , 𝑘 = 1, ...𝑁 (9)
Σ𝑖=1𝑒 𝛽𝑥𝑖
each is responsible for mapping from the input vector to one latent
where 𝛽 > 1 is a coefficient that performs non-maximum suppres-
style. Given the pretrained generator, we train the encoder and
sion over some types that contribute less than the dominant ones,
simultaneously finetune the generator with Adam [Kingma and Ba
and 𝑁 is the number of discrete types. During training, we gradu-
2015]. We set the initial learning as 0.01 and decay it by 0.5 each two
ally increase the coefficient 𝛽 to perform an easy-to-hard training
epochs. In our experiments, it takes around 20 epochs to converge.
by decreasing the smoothness. Empirically, we increase 𝛽 by 1 for
each epoch. We train the mapper for 20 epochs.
5 https://github.com/NVlabs/stylegan2-ada-pytorch
6 https://azure.microsoft.com/en-us/services/cognitive-services/face