0% found this document useful (0 votes)
7 views10 pages

Disney Type Avatars Paper

Uploaded by

nidehab926
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

Disney Type Avatars Paper

Uploaded by

nidehab926
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

AgileAvatar: Stylized 3D Avatar Creation via Cascaded Domain

Bridging
Shen Sang Tiancheng Zhi Guoxian Song Minghao Liu Chunpong Lai
ByteDance ByteDance ByteDance UC Santa Cruz ByteDance
Mountain View, USA Mountain View, USA Mountain View, USA Santa Cruz, USA Mountain View, USA

Jing Liu Xiang Wen James Davis Linjie Luo


ByteDance ByteDance UC Santa Cruz ByteDance
Mountain View, USA Hangzhou, China Santa Cruz, USA Mountain View, USA
arXiv:2211.07818v1 [cs.CV] 15 Nov 2022

(b.1) Stylization (b.2) Parameterization (b.3) Conversion

(a) Input (b) Our method: cascaded domain bridging (from left to right) (c) Application

Figure 1: (a) Given a front-facing user image as input, (b) our method progressively bridges the domain gap between real faces
and 3D avatars through three stages: (b.1) The stylization stage performs an image space translation to generate a stylized
portrait while normalizing expressions. (b.2) The parameterization stage uses a learned model to find avatar parameters which
match the results of stylization. (b.3) The conversion stage searches for a valid avatar vector matching the parameterization
that can be rendered by the graphics engine. (c) The output is a user editable 3D model which can be animated and applied to
various applications, for example personalized emoji. ©H JACQUOT and Montclair Film.

ABSTRACT selfies into stylized avatar renderings as the targets for desired 3D
Stylized 3D avatars have become increasingly prominent in our avatars. Next, we find the best parameters of the avatars to match
modern life. Creating these avatars manually usually involves la- the stylized avatar renderings through a differentiable imitator we
borious selection and adjustment of continuous and discrete pa- train to mimic the avatar graphics engine. To ensure we can ef-
rameters and is time-consuming for average users. Self-supervised fectively optimize the discrete parameters, we adopt a cascaded
approaches to automatically create 3D avatars from user selfies relaxation-and-search pipeline. We use a human preference study
promise high quality with little annotation cost but fall short in to evaluate how well our method preserves user identity compared
application to stylized avatars due to a large style domain gap. We to previous work as well as manual creation. Our results achieve
propose a novel self-supervised learning framework to create high- much higher preference scores than previous work and close to
quality stylized 3D avatars with a mix of continuous and discrete those of manual creation. We also provide an ablation study to
parameters. Our cascaded domain bridging framework first lever- justify the design choices in our pipeline.
ages a modified portrait stylization approach to translate input

Authors’ email addresses: Shen Sang: [email protected]; Tiancheng Zhi:


[email protected]; Guoxian Song: [email protected]; Ming- CCS CONCEPTS
hao Liu: [email protected]; Chunpong Lai: [email protected]; Jing
Liu: [email protected]; Xiang Wen: [email protected]; James Davis: • Computing methodologies → Non-photorealistic render-
[email protected]; Linjie Luo: [email protected]. ing.
Sang, S. et al

KEYWORDS Next, the Self-Supervised Avatar Parameterization stage focuses


Avatar Creation, Human Stylization on crossing from image pixel domain to avatar vector domain. We
observed that strictly enforcing parameter discreteness causes opti-
mization to fail to converge. To address this, we use a relaxed for-
1 INTRODUCTION mulation called a relaxed avatar vector in which discrete parameters
An attractive and animatable 3D avatar is an important entry point are encoded as continuous one-hot vectors. To enable differentia-
to the digital world that has become increasingly prominent in bility in training, we trained an imitator in similar spirit to F2P [Shi
modern life for socialization, shopping and gaming etc. A good et al. 2019] to mimic the behavior of the non-differentiable engine.
avatar should be both personalized (reflecting the person’s unique Finally, the Avatar Vector Conversion stage focuses on domain
appearance) and good-looking. Many popular avatar systems adopt crossing from the relaxed avatar vector space to the strict avatar
cartoonized and stylized designs for their playfulness and appeal- vector space where all the discrete parameters are one-hot vectors.
ingness to the users such as Zepeto1 and ReadyPlayer2 . However, The strict avatar vector can then be used by the graphics engine to
creating an avatar manually usually involves laborious selections create final avatars and for rendering. We employ a novel search
and adjustments from a swarm of art assets which is both time- process that leads to better results than direct quantization.
consuming and difficult for average users with no prior experience. To evaluate our results, we use a human preference study to
In this paper, we study automatic creation of stylized 3D avatars evaluate how well our method preserves personal identity relative
from a single front-facing selfie image. To be specific, given a selfie to baseline methods including F2P [2019] as well as manual creation.
image, our algorithm predicts an avatar vector as the complete Our results achieve much higher scores than baseline methods and
configuration for a graphics engine to generate a 3D avatar and close to those of manual creation. We also provide an ablation study
render avatar images from predefined 3D assets. The avatar vector to justify the design choices in our pipeline.
consists of parameters specific to the predefined assets which can In summary, our technical contributions are:
be either continuous (e.g. head length) or discrete (e.g. hair types). • A novel self-supervised learning framework to create high-
A naive solution is to annotate a set of selfie images and train a quality stylized 3D avatars with a mix of continuous and
model to predict the avatar vector via supervised learning. How- discrete parameters;
ever, large scale annotations are needed to handle a large range of • A novel approach to cross the large style domain gap in
assets (usually in the hundreds). To alleviate the annotation cost, stylized 3D avatar creation using portrait stylization;
self-supervised methods [Shi et al. 2019, 2020] are proposed to train • A cascaded relaxation and search pipeline that solves the
a differentiable imitator that mimics the renderings of the graphics convergence issue in discrete avatar parameter optimization.
engine to automatically match the rendered avatar image with the
selfie image using various losses of identity and semantic segmenta- 2 RELATED WORK
tion. While these methods proved effective to create semi-realistic
3D Face Reconstruction: Photorealistic 3D face reconstruction
avatars close to user’s identity, they fall short in application to
from images has been studied extensively for many years. Ex-
stylized avatars since the style domain gap between selfie images
tremely high quality models can be obtained using gantries with
and stylized avatars are too large (see Fig. 7).
multiple cameras followed by a stereo or photogrammetry recon-
Our main technical challenges are two folds: (1) the large domain
struction [Beeler et al. 2010; Yang et al. 2020]. When only a single
gap between user selfie images and stylized avatars and (2) the com-
image is available, researchers leverage a parameterized 3D mor-
plex optimization of a mix of continuous and discrete parameters
phable model to reconstruct realistic 3D faces [Blanz and Vetter
in the avatar vector. To address these challenges, we formulate a
1999; Chen and Kim 2021; Deng et al. 2019b; Peng et al. 2017; Xu
cascaded framework which progressively bridge the domain gap
et al. 2020]. Excellent surveys [Egger et al. 2020; Zollhöfer et al.
while ensuring optimization convergence on both continuous and
2018] exist providing great insights in this direction. These meth-
discrete parameters. Our novel framework consists of three stages:
ods focus on an accurate reconstruction of the real human, and
Portrait Stylization, Self-supervised Avatar Parameterization, and
the model parameters often lack physical meaning. In contrast our
Avatar Vector Conversion. Fig. 1 shows the domain gap gradually
work focuses on cross domain creation of a stylized avatar which
bridged across the three stages, while the identity information (hair
has parameters with direct meaning to casual users.
style, skin tone, glasses, etc.) is maintained throughout the pipeline.
First, the Portrait Stylization stage focuses on 2D real-to-stylized 3D Caricature: Non-photorealistic 3D face reconstruction has
visual appearance domain crossing. This stage translates input also received interest recently, a popular style being caricature. Qiu
selfie image to a stylized avatar rendering and remains in image et al. [2021] created a dataset of 3D caricature models for recon-
space. Naively applying existing stylization methods [Pinkney and structing meshes from caricature images. Some works generate
Adler 2020; Song et al. 2021] for translation will retain factors caricature meshes by exaggerating or deforming real face meshes,
such as expression, which would unnecessarily complicate later with [Cai et al. 2021; Wu et al. 2018] or without [Lewiner et al. 2011;
stages of our pipeline. Thus, we create a modified variant from Vieira et al. 2013] caricature image input. Sketches can be used to
AgileGAN [Song et al. 2021] to ensure uniformity in expression guide the creation [Han et al. 2017, 2018]. Recent works [Li et al.
while preserving user identity. 2021; Ye et al. 2021] use GANs to generate 3D caricatures given real
images. However, these methods are designed for reconstructing
1 https://zepeto.me/ caricature meshes and/or textures while we focus on cartoonish
2 https://readyplayer.me/ avatars constrained by parameters with semantic meaning.
AgileAvatar: Stylized 3D Avatar Creation via Cascaded Domain Bridging

Portrait Stylization Self-Supervised Avatar Avatar Vector Conversion


(Sec. 3.1) Parameterization (Sec. 3.2) (Sec. 3.3)

Encoder Appearance-Based
Mapper
W+ Searching

User Image Latent Code Relaxed Avatar Vector Strict Avatar Vector
Stylized Differentiable Graphics
Decoder Imitator Engine

Decoder Mapper Searching


Loss Loss Loss

Figure 2: Pipeline. Our framework consists of three modules: Portrait Stylization for image-space real-to-stylized domain
crossing, Self-supervised Avatar Parametrization for recovering relaxed avatar vector from the stylization latent code, and
Avatar Vector Conversion for discretizing the predicted relaxed avatar vector into a strict avatar vector that can be taken by
the graphics engine directly. ©NGÁO STUDIO.

Game Avatars: Commercial products such as Zepeto and Ready- simultaneously normalizing the face to look closer to an avatar ren-
Player use a graphics engine to render cartoon avatars from user dering. Next, the Self-supervised Avatar Parameterization module
selfies. While no detailed description of their methods exists, we sus- regresses a relaxed avatar vector from the stylization latent code via
pect these commercial methods are supervised with a large amount a MLP based Mapper. Finally, the Avatar Vector Conversion module
of manual annotations, something this paper seeks to avoid. discretizes part of the relaxed avatar vector to meet the requirement
Creating semi-realistic 3D avatars has also been explored [Cao of the graphics engine using an appearance-based search.
et al. 2016; Hu et al. 2017; Ichim et al. 2015; Luo et al. 2021]. Most
relevant to our framework, Shi et al. [2019] proposed an algorithm 3.1 Portrait Stylization
to search for the optimal avatar parameters by comparing the input
Portrait Stylization transforms user images into stylized images
image directly to the rendered avatar. Follow-up work improves
close to our target domain. This stage of our pipeline occurs entirely
efficiency [Shi et al. 2020], and seeks to use the photograph’s texture
within the 2D image domain. We adopt an encoder-decoder frame-
to make the avatar match more closely [Lin et al. 2021]. These efforts
work for the stylization task. A novel transfer learning approach
seek to create a similar looking avatar, while this paper seeks to
is applied to a StyleGAN model [Karras et al. 2020], including W+
create a highly stylized avatar with a large domain gap.
space transfer learning, using a normalized style exemplar set, and
Portrait Stylization: Many methods for non-photorealistic styl- a loss function that supports these modifications.
ization of 2D images exist. Gatys et al. [2016] proposed neural style
W+ space transfer learning: We perform transfer learning directly
transfer, matching features at different levels of CNNs. Image-to-
from the W+ space, unlike previous methods [Gal et al. 2021; Song
image models focus on the translation of images from a source to
et al. 2021] where stylization transfer learning is done in the more
target domain, either with paired data supervision [Isola et al. 2017]
entangled Z/Z+ space. The W+ space is more disentangled and
or without [Park et al. 2020; Zhu et al. 2017]. Recent development
can preserve more personal identity features. However, this design
in GAN inversion [Richardson et al. 2021; Tov et al. 2021] and in-
change introduces a challenge. We need to model a distribution
terpolation [Pinkney and Adler 2020] methods make it possible to
prior W of the W+ space, as it is a highly irregular space [Wulff
achieve high quality cross-domain stylization [Cao et al. 2018; Song
and Torralba 2020], and cannot be directly sampled like the Z/Z+
et al. 2021; Zhu et al. 2021]. The end result of these methods are in
space (standard Gaussian distribution). We achieve this by inverting
2D pixels space and directly inspire the first stage of our pipeline.
a large dataset of real face images into a W+ embeddings via a
pre-trained image encoder [Tov et al. 2021], and then sample the
3 PROPOSED APPROACH latent codes from that prior. Fig. 3 provides one example of better
Our cascaded avatar creation framework consists of three stages: preserved personalization. Notice that our method preserves glasses
Portrait Stylization (Sec. 3.1), Self-supervised Avatar Parameteriza- which are lost in the comparison method.
tion (Sec. 3.2), and Avatar Vector Conversion (Sec. 3.3). A diagram
of their relationship is shown in Fig. 2. Portrait Stylization trans- Normalized Style Exemplar Set: Our stylization method seeks
forms a real user image into a stylized avatar image, keeping as to ignore pose and expression and produce a normalized image.
much personal identity (glasses, hairs, colors, etc.) as possible, while In contrast, existing methods are optimized to preserve source to
Sang, S. et al

(a) Input (b) AgileGAN (c) Our Stylization (a) Stylization (b) Strict (c) Relaxed

Figure 3: Portrait stylization results. Compared with a state- Figure 4: Avatar Parameterization produces errors in final
of-the-art stylization method, AgileGAN [Song et al. 2021], predictions if discrete types are enforced during training,
our stylization does a better job at preserving the user’s such as hair and beard types in this example. Relaxing
personal identity (e.g. glasses are preserved), and simulta- the discrete constraint allows easier optimization and thus
neously normalizing the expressions (e.g. mouth is closed) better predictions which match the stylization target more
for easier fitting in the downstream pipeline. ©Greg Mooney closely.
and Sebastiaan ter Burg.
the StyleGAN2 discriminator [Karras et al. 2020]:
target similarities literally, transferring specific facial expressions,
head poses, and lighting conditions directly from user photos into L𝑎𝑑𝑣 = E𝑦∼Y [min(0, −1+𝐷 (𝑦))]+E𝑤∼W [min(0, −1−𝐷 (G𝜙𝑡 (𝑤)))]
target stylized images. This is not desirable for our later avatar (2)
parameterization stage as we are trying to extract the core personal Also, to improve training stability and prevent artifacts, we use
identity features only. In order to produce normalized stylizations R1 regularization [Mescheder et al. 2018] for the discriminator:
L𝑅 1 = 2 E𝑦∼Y [∥∇𝐷 (𝑦)∥ 2 ], where we set 𝛾 = 10 empirically.
𝛾
we limit the rendered exemplars provided during transfer learn-
ing to contain only neutral poses, expressions and illumination Finally, the generator and discriminators are jointly trained to
to ensure a good normalization. Fig. 3 provides an example of a optimize the combined objective min𝜙 max𝐷 L𝑠𝑡 𝑦𝑙𝑖𝑧𝑒 , where
smiling face. The comparison method preserves the smile, while
L𝑠𝑡 𝑦𝑙𝑖𝑧𝑒 = 𝜆𝑎𝑑𝑣 L𝑎𝑑𝑣 + 𝜆𝑠𝑒𝑚 L𝑠𝑒𝑚 + 𝜆𝑅 1 L𝑅 1 (3)
our method successfully provides only the normalized core identity.
𝜆𝑎𝑑𝑣 = 1, 𝜆𝑠𝑒𝑚 = 12, 𝜆𝑅 1 = 5 are constant weights set empirically.
Loss: Our loss contains non-standard terms to support the needs Please see the appendix A for more details.
of our pipeline. The target output stylization is not exactly aligned
with the input due to pose normalization. Therefore, commonly 3.2 Self-supervised Avatar Parameterization
used perceptual loss [Zhang et al. 2018] cannot be applied directly Avatar Parameterization finds a set of parameters for the rendering
in decoder training. We instead use a novel segmented color loss. engine which produces an avatar matching the stylized portrait as
The full objective comprises three loss terms to fine-tune the closely as possible. We call the module which finds parameters the
generator G𝜙 . Let G𝜙𝑜 and G𝜙𝑡 be the model before and after mapper. To facilitate training the mapper, we use a differentiable
fine-tuning. We introduce a color matching loss at a semantic neural rendering engine we call the imitator.
level. Specifically, we leverage two face segmentation models from A particular avatar is defined by an avatar vector with both con-
BiSeNet [Yu et al. 2018] pre-trained on real and stylized data sepa- tinuous and discrete parameters. Continuous parameters are used
rately to match the color of semantic regions. Let S = {ℎ𝑎𝑖𝑟, 𝑠𝑘𝑖𝑛} to control primarily placement and size, for example eye size, eye
be the classes taken into consideration, and B𝑘 (𝐼 ) (𝑘 ∈ S) be the rotation, mouth position, and head width. Discrete parameters are
mean color of pixels belonging to class 𝑘 in image 𝐼 . B𝑟𝑒𝑎𝑙 𝑘 and used to set individual assets and textures such as hair types, beard
B𝑠𝑡 𝑦𝑙𝑒 represent real and stylized models separately. The semantic
𝑘 types, and skin tone textures. All parameters are concatenated into
color matching loss is: a vector with discrete parameters represented as one-hot vectors.

Mapper Training: The Mapper takes the results of portrait styl-


ization as input and outputs an avatar vector which defines a similar
∑︁   looking avatar. Rather than using the stylized image itself as input,
2
L𝑠𝑒𝑚 = E𝑤∼W [ 𝑘
B𝑟𝑒𝑎𝑙 𝑘
(G𝜙𝑜 (𝑤)) − B𝑠𝑡 we use the latent code 𝑤+ derived from the stylization encoder,
𝑦𝑙𝑒 (G𝜙𝑡 (𝑤)) ]
𝑘 ∈S since it is a more compact representation and contains facial se-
(1) mantic styles from coarse to fine [Karras et al. 2019].
An adversarial loss is used to match the distribution of the trans- The Mapper is built as an MLP, and trained using a Mapper
lated images to the target stylized set distribution Y, where 𝐷 is Loss which measures the similarity between the stylized image,
AgileAvatar: Stylized 3D Avatar Creation via Cascaded Domain Bridging

Our solution to training discrete parameters in the mapper makes


use of the imitator’s interpolation property. When mixing two
avatar vectors, the imitator still produces a valid rendering. That is,
given the one-hot encoding v1 and v2 of two hair or beard types,
their linear interpolation v𝑚𝑖𝑥 = (1 − 𝛼) · v1 + 𝛼 · v2 (𝛼 ∈ [0, 1])
produces a valid result. Please see the appendix B.1 for details.
Thus, when training the mapper we do not strictly enforce dis-
crete parameters, and instead apply a softmax function to the final
activation of the mapper to allow a continuous optimization space
while still discouraging mixtures of too many asset types.
We compare our relaxed training with a strict training method
(a) Relaxed Result (b) Directly Classify (c) Our Conversion
performing quantization during optimization. In the forward pass,
it quantizes the softmax result by picking the entry with maximum
Figure 5: Avatar Vector Conversion is necessary to convert probability. In the backward pass, it back-propagates unaltered
the relaxed result produced during parameterization into gradients in a straight-through way [Bengio et al. 2013]. In Fig. 4,
discrete types suitable for the graphics engine. Direct classi- our method produces a much closer match to the stylization results.
fication often fails to select the best type. Our conversion se-
lects the best match to the relaxed type by searching through 3.3 Avatar Vector Conversion
all available discrete types. Notice in this example that the The graphics engine requires discrete inputs for attributes such as
skin tone and hair type are much closer using our method. hair and glasses. However the mapper module in Avatar Parameter-
ization produces continuous values. One straightforward approach
I𝑠𝑡 𝑦𝑙𝑒 , and the imitator output, I𝑖𝑚𝑖𝑡𝑎𝑡𝑒 . This loss function contains
for discretization is to pick the type with the highest probability
several terms to measure the global and local similarity.
given the softmax result. However, we observe that this approach
To preserve global appearance, we incorporate identity loss L𝑖𝑑
does not achieve good results, especially when dealing with multi-
measuring the cosine similarity between two faces built upon a
class attributes (e.g. 45 hair types). The challenge is that the solution
pretrained ArcFace [Deng et al. 2019a] face recognition network
space is under-constrained. Medium length hair can be achieved
R: L𝑖𝑑 = 1 − 𝑐𝑜𝑠 (R (𝐼𝑠𝑡 𝑦𝑙𝑒 ), R (𝐼𝑖𝑚𝑖𝑡𝑎𝑡𝑒 )). For a more fine-grained
by selecting the medium length hair type, or by mixing between
similarity measurement, LPIPS loss [Zhang et al. 2018] is adopted:
short and long hair types. In the latter case, simply selecting the
L𝑙𝑝𝑖𝑝𝑠 = ∥F (𝐼𝑠𝑡 𝑦𝑙𝑒 ) − F (𝐼𝑖𝑚𝑖𝑡𝑎𝑡𝑒 )∥ 2 , where F denotes the percep-
highest probability of short or long hair is clearly not optimal.
tual feature extractor. Additionally, we use a color matching loss to
We discretize the relaxed avatar vector via searching over all
obtain more faithful colors for the skin and hair region:
candidates from the asset list for each attribute, while fixing all
∑︁  other parameters. Using the image result from the imitator I𝑖𝑚𝑖𝑡𝑎𝑡𝑒

2
L𝑐𝑜𝑙𝑜𝑟 = 𝑘 𝑘
B𝑠𝑡 𝑦𝑙𝑒 (𝐼𝑠𝑡 𝑦𝑙𝑒 ) − B𝑠𝑡 𝑦𝑙𝑒 (𝐼𝑖𝑚𝑖𝑡𝑎𝑡𝑒 ) (4)
as target, we use the loss function from Eq. 5 as an objective to
𝑘 ∈S
measure the similarity between I𝑖𝑚𝑖𝑡𝑎𝑡𝑒 and the candidate result
The final loss function is: I𝑐𝑎𝑛𝑑 . By minimizing the objective, we can find the best solution
L𝑚𝑎𝑝𝑝𝑒𝑟 = 𝜆𝑖𝑑 L𝑖𝑑 + 𝜆𝑙𝑝𝑖𝑝𝑠 L𝑙𝑝𝑖𝑝𝑠 + 𝜆𝑐𝑜𝑙𝑜𝑟 L𝑐𝑜𝑙𝑜𝑟 (5) for each attribute. The selections for each attribute are combined to
create the avatar vector used for graphics rendering and animation.
where 𝜆𝑖𝑑 = 0.4, 𝜆𝑙𝑝𝑖𝑝𝑠 = 0.8, 𝜆𝑐𝑜𝑙𝑜𝑟 = 0.8 are set empirically. Fig. 5 provides a comparison of direct classification and our method.
We empirically choose the best loss terms to provide good results. Note that direct classification makes incorrect choices for hair type
An ablation study of these terms is provided in the results section. and skin color while ours closely matches the reference image.
Differentiable Imitator: The imitator is a neural renderer trained
to replicate the output of the graphics engine as closely as possi- 4 EXPERIMENTAL ANALYSIS
ble given an input avatar vector. The imitator has the important Cascaded Domain Bridging: To illustrate the effect of each stage
property of differentiablity, making it suitable for inclusion in an in the proposed three-stage pipeline, the intermediate results are vi-
optimization framework. We leverage an existing neural model sualized in Fig. 6. Notice how the three stages progressively bridge
[Karras et al. 2019] as the backbone generator, which is capable the domain gap between real images and stylized avatars. To mea-
of generating high quality avatar renderings. We train it with syn- sure how close the intermediate results are in comparison to the
thetic avatar data supervisedly. See the appendix B.1 for details. target avatar domain, we use the perceptual metric FID [Kilgour
et al. 2019]. Notice that the FID becomes lower after each stage,
Discrete Parameters: Solving for discrete parameters is challeng-
demonstrating the gradual reduction of domain gap.
ing because of unstable convergence. Some methods handle this
via quantization during optimization [Bengio et al. 2013; Cheng Visual Comparison with Baseline Methods: We compare the pro-
et al. 2018; Jang et al. 2016; Van Den Oord et al. 2017]. However, posed method against a number of baselines, shown in Fig. 7. CNN
we found that quantization after optimization, which relaxes the is a naive supervised method using rendered avatar images to train
discrete constraint during training and re-apply it as postprocess- a CNN [Sandler et al. 2018] to fit ground truth parameters. The
ing, is more effective for our task. Below we describe the relaxed CNN is then applied on the segmented head region of the input
optimization and in Sec. 3.3 we present the quantization method. image. The domain gap causes the CNN to make poor predictions.
Sang, S. et al

FID = 236.8 FID = 38.7 FID = 17.9 In domain Table 1: Numerical results from two user studies. Our
method is judged to produce better avatars than the base-
line methods, approaching the quality of manual work. At-
tribute evaluation: judge whether a specific attribute of the
created avatar matches the human image. Matching: choose
the correct one out of four avatars which matches the hu-
man image.

Attribute Evaluation
Match
beard face brow hair hair skin Task
type shape type color style tone
F2P [2019] 0.36 0.46 0.22 0.21 0.12 0.36 0.67
CNN 0.17 0.54 0.22 0.46 0.30 0.50 0.57
Stylization+CNN 0.45 0.69 0.38 0.57 0.43 0.66 0.82
Ours 0.82 0.94 0.88 0.82 0.72 0.82 0.92
Manual 0.94 0.97 0.85 0.90 0.86 0.94 0.96

(a) Input (b) Stylized (c) Parameterized (d) Converted


Table 2: Ablation study for mapper training losses. Users
Figure 6: Progressive domain crossing. (b) At the portrait picked the best matching avatar from the six candidates pro-
stylization stage, the images may still contain characteris- duced by loss combinations. The scores show the fraction
tics outside the domain of a graphics avatar, such as hair of each combination picked. L𝐿𝑃𝐼 𝑃𝑆 is the most significant
shape and non-frontal pose. (c) At the parameterization component, while L𝑖𝑑 and L𝑐𝑜𝑙𝑜𝑟 also improve the results.
stage, the images are within the target domain, but may con-
tain mixtures of components. (d) Finally, after vector con- ID LPIPS ID+LPISP ID+Color LPIPS+Color ID+LPIPS+Color
version the output is a strict avatar vector which can be ren- 9.3% 17.8% 18.7% 14.8% 19.1% 20.3%
dered by the graphics engine. Using FID as a measure of im-
age distribution similarity, notice that each step brings us better than the baseline. In the matching task, we evaluate whether
closer to the final target avatar domain. ©Marcin Wichary, an avatar retains personal identity overall. Four random and diverse
TechCrunch and Vanity Productions. images were used to create avatars, and the subject must choose
which is the correct match to a specific photograph. A total of 990
Our stylization + CNN narrows the domain gap by applying the judgements were collected. Avatars created with our method were
CNN to our stylized results. This noticeably improves predictions, identified correctly significantly more often than baseline methods,
however errors in hair and face coloration remain. Since the CNN approaching the level of manually created avatars.
is only trained on synthetic data, it cannot regress the parameters
properly due to the domain gap between training and test data even Portrait Stylization Ablation: To study the impact of Portrait
for stylized images. F2P [2019] is a self-supervised optimization- Stylization on the complete avatar creation pipeline we compare
based method designed for semi-realistic avatars. This method fails three options, shown in Fig. 8. No stylization removes this stage
to do well, likely because it naively aligns the segmentation of real entirely and uses the real image as input to parameterization loss
faces and the avatar faces, without considering the domain gap. calculation. Without stylization, the parameterization module tries
Manual results were created by expert-trained users. Given a real to match the real image with the target stylized avatar, leading to
face, the users were asked to build an avatar that preserves personal poor visual quality. AgileGAN [Song et al. 2021] is a state-of-the-art
identity while demonstrating high attractiveness based on their stylization method. It provides stylization and thus improves the
own judgement. Visually, our method shows a quality similar to final avatar attractiveness compared to no stylization. However,
manual creation, demonstrating the utility of our method. it cannot remove the impact of expressions and does not handle
glasses well. In Row 1 (b), the smile expression is explained as a big
Numerical Comparison with Baseline Methods: To evaluate results mouth in the fitting stage, and personal information like glasses is
numerically we rely on judgements made by human observers not preserved in Row 2 (b). Our method addresses these issues and
recruited through Amazon Mechanical Turk 3 . We conduct two achieves better results in both visual quality and personal identity.
user studies for quantitative evaluation: Attribute Evaluation and
Matching. We perform attribute evaluation to evaluate whether Mapper Losses Ablation: To study the importance of including all
users believe that specific identity attributes such as hair color and losses while training the Mapper, we generate results using different
style match the source photograph using a yes/no selection. 330 permutations of loss terms (identity, LPIPS, color). We then collected
opinions were collected for each of 6 attributes. Table 1 shows 990 user judgements from Amazon Mechanical Turk, to select the
results, indicating that our method retains photograph identity best matching results to the input image among six permutation
results. Table 2 shows the fraction of each option selected. The full
3 https://www.mturk.com/ set of losses achieves the best score by a small margin, matching
AgileAvatar: Stylized 3D Avatar Creation via Cascaded Domain Bridging

(a) Input (b) Ours (c) CNN (d) Stylization + CNN (e) F2P [2019] (f) Manual

Figure 7: Results comparison. (a) Given an input image, (b) our method produces an avatar in the target cartoon style that
looks similar to the user. (c) A CNN trained on synthetic data produces incorrect beard, hair style, and glasses on real image
inputs due to the significant domain gap. (d) Applying the CNN instead to the results of stylization reduces the domain gap
and thus improves results, however significant errors remain. (e) F2P, a baseline method intended to produce semi-realistic
avatars does not consider the domain gap and thus produces poor results when used with stylized avatars [Shi et al. 2019]. (f)
Manual results were created by expert-trained users. Our results approximate the quality obtainable through manual creation.
©Sebastiaan ter Burg, NIGP, YayA Lee and S Pakhrin.

our observations that the overall method is robust to the precise as also evidenced by the manually-created results. This issue could
selection of loss, but that the additional terms help in some cases. be addressed by improving the diversity of the avatar system.

5 LIMITATIONS
We observe two main limitations to our method. First, our method 6 CONCLUSION
occasionally produces wrong predictions on assets covering a small In summary, we present a self-supervised stylized avatar auto-
area, because their contribution to the loss is small and gets ig- creation method with cascaded domain crossing. Our method demon-
nored. The eye color in Fig. 9 (a) is an example of this difficulty. strates that the gap between the real images domain and the target
Redesigning the loss function might resolve this problem. Second, avatar domain can be progressively bridged with a three-stage
lighting is not fully normalized in the stylization stage, leading pipeline: portrait stylization, self-supervised avatar parameteriza-
to incorrect skin tone estimates when there are strong shadows, tion, and avatar vector conversion. Each stage is carefully designed
shown in Fig. 9 (b). This problem could potentially be addressed by and cannot be simply removed. Experimental results show that our
incorporating intrinsic decomposition into the pipeline. In addition approach produces high quality attractive 3D avatars with personal
to the limitations of our method, we experience a loss of ethnicity in identities preserved. In the future, we will extend the proposed
the final results, which is mainly introduced by the graphics engine, pipeline to other domains, such as cubism and caricature avatars.
Sang, S. et al

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019a. ArcFace: Additive
Angular Margin Loss for Deep Face Recognition. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR).
Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. 2019b.
Accurate 3d face reconstruction with weakly-supervised learning: From single
image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops. 0–0.
Bernhard Egger, William AP Smith, Ayush Tewari, Stefanie Wuhrer, Michael Zollhoefer,
Thabo Beeler, Florian Bernard, Timo Bolkart, Adam Kortylewski, Sami Romdhani,
et al. 2020. 3d morphable face models—past, present, and future. ACM Transactions
on Graphics (TOG) 39, 5 (2020), 1–38.
Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or.
2021. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators.
arXiv:2108.00946 [cs.CV]
Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using
convolutional neural networks. In Proceedings of the IEEE conference on computer
Input (a) No stylization (b) AgileGAN (c) Ours vision and pattern recognition. 2414–2423.
Xiaoguang Han, Chang Gao, and Yizhou Yu. 2017. DeepSketch2Face: a deep learning
based sketching system for 3D face and caricature modeling. ACM Transactions on
Figure 8: We ablate by removing the stylization stage, as well graphics (TOG) 36, 4 (2017), 1–12.
as replacing our stylization with a state-of-the-art method. Xiaoguang Han, Kangcheng Hou, Dong Du, Yuda Qiu, Shuguang Cui, Kun Zhou,
and Yizhou Yu. 2018. Caricatureshop: Personalized and photorealistic caricature
In each case the final renderings from the graphics engine sketching. IEEE transactions on visualization and computer graphics 26, 7 (2018),
are shown. (a) Fitting directly on a user image results in an 2349–2361.
avatar that lacks attractiveness. (b) Replacing our stylization Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman
Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. 2017. Avatar digitization from
with AgileGAN [2021] suffers from missing personal infor- a single image for real-time rendering. ACM Transactions on Graphics (ToG) 36, 6
mation such as glasses and artifacts where smiles are mis- (2017), 1–14.
Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. 2015. Dynamic 3D avatar
interpreted as heavy lips or mustache. (c) Our stylization creation from hand-held video input. ACM Transactions on Graphics (ToG) 34, 4
retains personal features like glasses, and generate visually (2015), 1–14.
appealing results in spite of expressions. ©Chang-Ching Su Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image
translation with conditional adversarial networks. In Proceedings of the IEEE confer-
and Luca Boldrini. ence on computer vision and pattern recognition. 1125–1134.
Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with
gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).
Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture
for generative adversarial networks. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 4401–4410.
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo
Aila. 2020. Analyzing and improving the image quality of stylegan. In Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition. 8110–8119.
Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. 2019. Fréchet
(a) Limitation - Small areas (b) Limitation - Shadows Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algo-
rithms.. In INTERSPEECH. 2350–2354.
Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization.
Figure 9: Limitations: (a) failure on a parameter (eye color) In International Conference on Learning Representations (ICLR).
affecting a small number of pixels. (b) incorrect skin tone Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. 2020. MaskGAN: Towards
Diverse and Interactive Facial Image Manipulation. In IEEE Conference on Computer
prediction caused by shadows. ©Daniel Åberg and Peter Vision and Pattern Recognition (CVPR).
Bright. Thomas Lewiner, Thales Vieira, Dimas Martínez, Adelailson Peixoto, Vinícius Mello,
and Luiz Velho. 2011. Interactive 3D caricature from harmonic exaggeration. Com-
puters & Graphics 35, 3 (2011), 586–595.
Song Li, Songzhi Su, Juncong Lin, Guorong Cai, and Li Sun. 2021. Deep 3D caricature
REFERENCES face generation with identity and structure consistency. Neurocomputing 454 (2021),
Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. 2010. 178–188.
High-quality single-shot capture of facial geometry. In ACM SIGGRAPH 2010 papers. Jiangke Lin, Yi Yuan, and Zhengxia Zou. 2021. MeInGame: Create a Game Character
1–9. Face from a Single Portrait. In Proceedings of the AAAI Conference on Artificial
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or propa- Intelligence, Vol. 35. 311–319.
gating gradients through stochastic neurons for conditional computation. arXiv Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep learning face
preprint arXiv:1308.3432 (2013). attributes in the wild. In Proceedings of the IEEE international conference on computer
Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. vision. 3730–3738.
In Proceedings of the 26th annual conference on Computer graphics and interactive Huiwen Luo, Koki Nagano, Han-Wei Kung, Qingguo Xu, Zejian Wang, Lingyu Wei,
techniques. 187–194. Liwen Hu, and Hao Li. 2021. Normalized Avatar Synthesis Using StyleGAN and
Hongrui Cai, Yudong Guo, Zhuang Peng, and Juyong Zhang. 2021. Landmark detection Perceptual Refinement. In Proceedings of the IEEE/CVF Conference on Computer
and 3D face reconstruction for caricature using a nonlinear parametric model. Vision and Pattern Recognition. 11662–11672.
Graphical Models 115 (2021), 101103. Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. 2018. Which training meth-
Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-time ods for GANs do actually converge?. In International conference on machine learning.
facial animation with image-based dynamic avatars. ACM Transactions on Graphics PMLR, 3481–3490.
35, 4 (2016). Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. 2020. Contrastive learn-
Kaidi Cao, Jing Liao, and Lu Yuan. 2018. CariGANs: Unpaired Photo-to-Caricature ing for unpaired image-to-image translation. In European Conference on Computer
Translation. Vision. Springer, 319–345.
Zhixiang Chen and Tae-Kyun Kim. 2021. Learning Feature Aggregation for Deep 3D Weilong Peng, Zhiyong Feng, Chao Xu, and Yong Su. 2017. Parametric t-spline face
Morphable Models. In Proceedings of the IEEE/CVF Conference on Computer Vision morphable model for detailed fitting in shape subspace. In Proceedings of the IEEE
and Pattern Recognition. 13164–13173. Conference on Computer Vision and Pattern Recognition. 6139–6147.
Pengyu Cheng, Chang Liu, Chunyuan Li, Dinghan Shen, Ricardo Henao, and Lawrence Justin NM Pinkney and Doron Adler. 2020. Resolution dependent gan interpolation
Carin. 2018. Straight-through estimator as projected Wasserstein gradient flow. In for controllable image synthesis between domains. arXiv preprint arXiv:2010.05334
Neural Information Processing Systems (NeurIPS) Workshop. (2020).
AgileAvatar: Stylized 3D Avatar Creation via Cascaded Domain Bridging

Yuda Qiu, Xiaojie Xu, Lingteng Qiu, Yan Pan, Yushuang Wu, Weikai Chen, and Xi- Normalized Style Exemplar Set Y: For training stylized generator
aoguang Han. 2021. 3dcaricshop: A dataset and a baseline method for single-view G𝜙𝑡 , we synthetically rendered a diverse set of 150 avatar imageries
3d caricature face reconstruction. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 10236–10245. with normalized facial expressions.
Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro,
and Daniel Cohen-Or. 2021. Encoding in Style: a StyleGAN Encoder for Image-
to-Image Translation. In IEEE/CVF Conference on Computer Vision and Pattern
B AVATAR PARAMETERIZATION DETAILS
Recognition (CVPR).
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh
B.1 Imitator
Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings To train our module in a self-supervised way, we plug-in a differ-
of the IEEE conference on computer vision and pattern recognition. 4510–4520.
Tianyang Shi, Yi Yuan, Changjie Fan, Zhengxia Zou, Zhenwei Shi, and Yong Liu. 2019.
entiable neural renderer (i.e. imitator) in our learning framework.
Face-to-parameter translation for game character auto-creation. In Proceedings of As we mentioned in the main paper, the imitator can take a re-
the IEEE/CVF International Conference on Computer Vision. 161–170. laxed avatar vector as input, although the imitator itself is trained
Tianyang Shi, Zhengxia Zuo, Yi Yuan, and Changjie Fan. 2020. Fast and Robust Face-
to-Parameter Translation for Game Character Auto-Creation. In Proceedings of the with strict avatar vector. No matter the input is a relaxed or strict
AAAI Conference on Artificial Intelligence, Vol. 34. 1733–1740. avatar vector, it can produce a valid rendering. In this way, we can
Guoxian Song, Linjie Luo, Jing Liu, Wan-Chun Ma, Chunpong Lai, Chuanxia Zheng, supervise the training in image space without any ground-truth
and Tat-Jen Cham. 2021. AgileGAN: stylizing portraits by inversion-consistent
transfer learning. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1–13. for the parameters. Due to the differentiability of the imitator, the
Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. 2021. parameterization stage can be trained with gradient descent. To
Designing an encoder for stylegan image manipulation. ACM Transactions on
Graphics (TOG) 40, 4 (2021), 1–14.
achieve high fidelity rendering quality, we leverage the StyleGAN2
Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. generator [Karras et al. 2019] as our backbone, which is capable of
Advances in neural information processing systems 30 (2017). generating high quality renderings matching the graphics engine.
Roberto C Cavalcante Vieira, Creto A Vidal, and Joaquim Bento Cavalcante-Neto. 2013.
Three-dimensional face caricaturing by anthropometric distortions. In 2013 XXVI The imitator consists of an encoder E𝑖 implemented using MLP and
Conference on Graphics, Patterns and Images. IEEE, 163–170. a generator G𝑖 adopted from StyleGAN2. The encoder translates
Qianyi Wu, Juyong Zhang, Yu-Kun Lai, Jianmin Zheng, and Jianfei Cai. 2018. Alive an input avatar vector to a latent code 𝑤+. The generator then
caricature from 2d to 3d. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. 7336–7345. produces a high-quality image given the latent code.
Jonas Wulff and Antonio Torralba. 2020. Improving Inversion and Generation Diversity
in StyleGAN using a Gaussianized Latent Space. In Conference on Neural Information Training: In order to fully utilize the image generation capability
Processing Systems. of StyleGAN2, we propose to train the imitator in two steps: 1)
Sicheng Xu, Jiaolong Yang, Dong Chen, Fang Wen, Yu Deng, Yunde Jia, and Xin
Tong. 2020. Deep 3d portrait from a single image. In Proceedings of the IEEE/CVF we first train a StyleGAN2 from scratch with random rendering
Conference on Computer Vision and Pattern Recognition. 7710–7720. samples generated by our graphics engine to obtain a high-quality
Haotian Yang, Hao Zhu, Yanru Wang, Mingkai Huang, Qiu Shen, Ruigang Yang, and image generator, without any label or conditions; then 2) we train
Xun Cao. 2020. FaceScape: A Large-Scale High Quality 3D Face Dataset and Detailed
Riggable 3D Face Prediction. In Proceedings of the IEEE/CVF Conference on Computer the encoder and the generator together with images and correspond-
Vision and Pattern Recognition (CVPR). ing labels, result in a conditional generator. Given an avatar vector
Zipeng Ye, Mengfei Xia, Yanan Sun, Ran Yi, Minjing Yu, Juyong Zhang, Yu-Kun Lai, and
Yong-Jin Liu. 2021. 3D-CariGAN: an end-to-end solution to 3D caricature generation
𝑣, a target image I𝑔𝑡 and the generated image I𝑔𝑒𝑛 = G𝑖 (E𝑖 (𝑣)), we
from normal face photos. IEEE Transactions on Visualization and Computer Graphics use the following loss function combination to perform the second
(2021). step training:
Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang.
2018. Bisenet: Bilateral segmentation network for real-time semantic segmentation.
In Proceedings of the European conference on computer vision (ECCV). 325–341.
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. L𝑖𝑚𝑖𝑡𝑎𝑡𝑜𝑟 = 𝜆1 ∥I𝑔𝑒𝑛 − I𝑔𝑡 ∥ 1 + 𝜆2 L𝑙𝑝𝑖𝑝𝑠 + 𝜆3 L𝑖𝑑 (6)
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image- where the first term is an L1 loss, which encourages less blurring
to-image translation using cycle-consistent adversarial networks. In Proceedings of than L2. In addition, L𝑙𝑝𝑖𝑝𝑠 is the LPIPS loss adopted from [Zhang
the IEEE international conference on computer vision. 2223–2232.
Peihao Zhu, Rameen Abdal, John Femiani, and Peter Wonka. 2021. Mind the Gap: et al. 2018],
Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial
Networks. arXiv:2110.08398 [cs.CV]
Michael Zollhöfer, Justus Thies, Pablo Garrido, Derek Bradley, Thabo Beeler, Patrick
L𝑙𝑝𝑖𝑝𝑠 = ∥F (𝐼 1 ) − F (𝐼 2 )∥ 2 (7)
Pérez, Marc Stamminger, Matthias Nießner, and Christian Theobalt. 2018. State where F denotes the perceptual feature extractor. L𝑖𝑑 is the identity
of the art on monocular 3D face reconstruction, tracking, and applications. In
Computer Graphics Forum, Vol. 37. Wiley Online Library, 523–550. loss which measures the cosine similarity between two faces built
upon a pretrained ArcFace [Deng et al. 2019a] face recognition
network R,
A PORTRAIT STYLIZATION DETAILS
Segmentation Models: The avatar segmentation model is trained L𝑖𝑑 = 1 − 𝑐𝑜𝑠 (R (𝐼 1 ), R (𝐼 2 )) (8)
using 20k randomly sampled avatar vectors with neural pose, ex- We set 𝜆1 = 1.0, 𝜆2 = 0.8, 𝜆3 = 1.0, empirically.
pression and illumination. For real image segmentation, we used
an open-source pre-trained BiSeNet module4 [Yu et al. 2018]. Interpolation property: Fig. 10 provides an example of the interpo-
lation property of the imitator which enables relaxed optimization
Distribution Prior W: To sample W+ distribution prior, we in- over the discrete parameters.
verse CelebA dataset [Liu et al. 2015] into W+ space using a pre-
trained e4e encoder [Tov et al. 2021]. Implementation: To train the imitator, we randomly generate
100,000 images and corresponding parameters. Note that although
random sampling leads to strange avatars, our imitator can generate
4 https://github.com/zllrunning/face-parsing.PyTorch images matching the graphics engine well by seeing plenty of
Sang, S. et al

B.2 Mapper
We use CelebA-HQ [Lee et al. 2020] and FFHQ [Karras et al. 2019]
as our training data. To collect a high quality dataset for training,
we use the Azure Face API 6 to analyze the facial attributes and
keep only facial images that meet our requirements:
1) within a limited pose range (𝑦𝑎𝑤 < 8◦, 𝑝𝑖𝑡𝑐ℎ < 8◦, 𝑟𝑜𝑙𝑙 < 5◦ )
2) without headwears
Figure 10: Interpolation of avatar vectors. The neural ren- 3) without extreme expressions
dering imitator which temporarily replaces the traditional 4) without any occlusions
graphics engine is differentiable, allowing the relaxation of Finally, we collect 21,522 images in total for mapper training.
the strict constraint on discrete types. Linear interpolation The input is an 18 × 512 latent code taken from the Stylization
between two avatar vectors results in the gradual disappear- module. Each one of the 18 layers latent code is passed to an indi-
ance of the beard and the gradual growth of the hair. vidual MLP. The output features are then concatenated together.
After that, we apply two MLP heads to generate continuous and
samples in the parameter space. Please refer to our supplementary discrete parameters separately.
video for a side-by-side comparison. We apply a scaling before the softmax function for discrete pa-
We train StyleGAN2 using the official source code5 with images rameters:
of size 256 × 256 × 3, thus the latent code 𝑤+ has a shape of 𝑒 𝛽𝑥𝑘
14 × 512. We build the encoder E𝑖 with 14 individual small MLPs, S(𝑥) = 𝑁 , 𝑘 = 1, ...𝑁 (9)
Σ𝑖=1𝑒 𝛽𝑥𝑖
each is responsible for mapping from the input vector to one latent
where 𝛽 > 1 is a coefficient that performs non-maximum suppres-
style. Given the pretrained generator, we train the encoder and
sion over some types that contribute less than the dominant ones,
simultaneously finetune the generator with Adam [Kingma and Ba
and 𝑁 is the number of discrete types. During training, we gradu-
2015]. We set the initial learning as 0.01 and decay it by 0.5 each two
ally increase the coefficient 𝛽 to perform an easy-to-hard training
epochs. In our experiments, it takes around 20 epochs to converge.
by decreasing the smoothness. Empirically, we increase 𝛽 by 1 for
each epoch. We train the mapper for 20 epochs.
5 https://github.com/NVlabs/stylegan2-ada-pytorch
6 https://azure.microsoft.com/en-us/services/cognitive-services/face

You might also like