Abdal 3DAvatarGAN Bridging Domains For Personalized Editable Avatars CVPR 2023 Paper
Abdal 3DAvatarGAN Bridging Domains For Personalized Editable Avatars CVPR 2023 Paper
Rameen Abdal²1 Hsin-Ying Lee2 Peihao Zhu²1 Menglei Chai2 Aliaksandr Siarohin2
Figure 1. Editable 3D avatars. We present 3DAvatarGAN, a 3D GAN able to produce and edit personalized 3D avatars from a single
photograph (real or generated). Our method distills information from a 2D-GAN trained on 2D artistic datasets like Caricatures, Pixar
toons, Cartoons, Comics etc. and requires no camera annotations.
4552
is feasible with the datasets containing objects with highly allows us to train 3D-GANs on challenging artistic datasets
consistent geometry, enabling a 3D-GAN to learn a distri- with exaggerated geometry and texture. We call our method
bution of shapes and textures. In contrast, artistically styl- 3DAvatarGAN as itÐfor the first timeÐoffers generation,
ized datasets [25, 65] have arbitrary exaggerations of both editing, and animation of personalized stylized, artistic
geometry and texture, for example, the nose, cheeks, and avatars obtained from a single image. Our results (See
eyes can be arbitrarily drawn, depending on the style of the Sec. 5.2) show the high-quality 3D avatars possible by our
artist as well as on the features of the subject, see Fig. 1. method compared to the naive fine-tuning.
Training a 3D-GAN on such data becomes problematic due
to the challenge of learning such an arbitrary distribution of 2. Related Work
geometry and texture. In our experiments (Sec. 5.1), 3D-
GANs [10] generate flat geometry and become 2D-GANs GANs and Semantic Image Editing. Generative adversar-
essentially. A natural question arises, whether a 3D-GAN ial Networks (GANs) [19, 47] are one popular type of gen-
can synthesize consistent novel views of images belonging erative model, especially for smaller high-quality datasets
to artistically stylized domains, such as the ones in Fig. 1. such as FFHQ [32], AFHQ [14], and LSUN objects [67].
For these datasets, StyleGAN [28,30,32] can be considered
In this work, we propose a domain-adaption framework
as the current state-of-the-art GAN [27, 28, 30, 32, 33]. The
that allows us to answer the question positively. Specifi-
disentangled latent space learned by StyleGAN has been
cally, we fine-tune a pre-trained 3D-GAN using a 2D-GAN
shown to exhibit semantic properties conducive to seman-
trained on a target domain. Despite being well explored for
tic image editing [1, 3, 16, 22, 36, 44, 51, 56, 62]. CLIP [46]
2D-GANs [25, 65], existing domain adaptation techniques
based image editing [2, 17, 44] and domain transfer [15, 70]
are not directly applicable to 3D-GANs, due to the nature
are another set of works enabled by StyleGAN.
of 3D data and characteristics of 3D generators.
GAN Inversion. Algorithms to project existing images into
The geometry and texture of stylized 2D datasets can be a GAN latent space are a prerequisite for GAN-based image
arbitrarily exaggerated depending on the context, artist, and editing. There are mainly two types of methods to enable
production requirements. Due to this, no reliable way to such a projection: optimization-based methods [1,13,57,71]
estimate camera parameters for each image exists, whether and encoder-based methods [5, 7, 48, 58, 69]. On top of both
using an off-the-shelf pose detector [72] or a manual label- streams of methods, the generator weights can be further
ing effort. To enable the training of 3D-GANs on such chal- modified after obtaining initial inversion results [49].
lenging datasets, we propose three contributions. 1 An Learning 3D-GANs with 2D Data. Previously, some ap-
optimization-based method to align distributions of cam- proaches attempt to extract 3D structure from pre-trained
era parameters between domains. 2 Texture, depth, and 2D-GANs [42, 52]. Recently, inspired by Neural Radiance
geometry regularizations to avoid degenerate, flat solutions Field (NeRF) [9, 37, 43, 68], novel GAN architectures have
and ensure high visual quality. Furthermore, we redesign been proposed to combine implicit or explicit 3D represen-
the discriminator training to make it compatible with our tations with neural rendering techniques [11, 12, 20, 39±41,
task. We then propose 3 a Thin Plate Spline (TPS) 3D 50, 53, 55, 63, 64]. In our work, we build on EG3D [11]
deformation module operating on a tri-plane representation which has current state-of-the-art results for human faces
to allow for certain large and sometimes extreme geometric trained on the FFHQ dataset.
deformations, which are so typical in artistic domains. Avatars and GANs. To generate new results in an artistic
The proposed adaptation framework enables the train- domain (e.g. anime or cartoons), a promising technique is
ing of 3D-GANs on complex and challenging artistic data. to fine-tune an existing GAN pre-trained on photographs,
The previous success of domain adaptation in 2D-GANs un- e.g. [45, 54, 60]. Data augmentation and freezing lower lay-
leashed a number of exciting applications in the content cre- ers of the discriminator are useful tools when fine-tuning
ation area [25, 65]. Given a single image such methods first a 2D-GAN [28, 38]. One branch of methods [18, 44, 70]
find a latent code corresponding to it using GAN inversion, investigates domain adaptation if only a few examples or
followed by latent editing producing the desired effect in only text descriptions are available. While others focus
the image space. Compared to 2D-GANs, the latent space on matching the distribution of artistic datasets with di-
of 3D-GANs is more entangled, making it more challeng- verse shapes and styles. Our work also falls in this domain.
ing to link the latent spaces between domains, rendering the Among previous efforts, StyleCariGAN [25] proposes in-
existing inversion and editing techniques not directly appli- vertible modules in the generator to train and generate cari-
cable. Hence, we take a step further and explore the use of catures from real images. DualStyleGAN [65] learns two
our approach to 3D artistic avatar generation and editing. mapping networks in StyleGAN to control the style and
Our final contribution to enable such applications is 4 a structure of the new domain. Some works are trained on
new inversion method for coupled 3D-GANs. 3D data or require heavy labeling/engineering [21, 26, 66]
In summary, the proposed domain-adaption framework and use 3D morphable models to map 2D images of carica-
4553
3.1. How to align the cameras?
Selecting appropriate ranges for camera parameters is of
paramount importance for high-fidelity geometry and tex-
ture detail. Typically, such parameters are empirically esti-
mated, directly computed from the dataset using an off-the-
shelf pose detector [10], or learned during training [8]. In
domains we aim to bridge, such as caricatures for which
a 3D model may not even exist, directly estimating the
camera distribution is problematic and, hence, is not as-
sumed by our method. Instead, we find it essential to en-
sure that the camera parameter distribution is consistent
Figure 2. Comparison with naive fine-tuning. Comparison of across the source and target domains. For the target domain,
generated 3D avatars with a naÈıvly fine-tuned generator Gbase (left we use StyleGAN2 trained on FFHQ, fine-tuned on artistic
sub-figures) versus our generator Gt (right sub-figures). The cor- datasets [25, 65]. Assuming that the intrinsic parameters of
responding sub-figures show comparisons in terms of texture qual- all the cameras are the same, we aim to match the distribu-
ity (top two rows) and geometry (bottom two rows). See Sec. 5.1 tion of extrinsic camera parameters of Gs and G2D and train
for details.
our final Gt using it (see illustration in Fig. 2 of the supple-
mentary materials). To this end, we define an optimization-
tures to 3D models. However, such models fail to model the based method to match the sought distributions. The first
hair, teeth, neck, and clothes and suffer in texture quality. In step is to identify a canonical pose image in G2D , where
this work, we are the first to tackle the problem of domain the yaw, pitch, and roll parameters are zero. According to
adaption of 3D-GANs and to produce fully controllable 3D Karras et al., [31], the image corresponding to the mean
Avatars. We employ 2D to 3D domain adaptation and dis- latent code satisfies this property. Let θ, φ be the camera
tillation and make use of synthetic 2D data from StyleCari- Euler angles in a spherical coordinate system, r, c be the
GAN [25] and DualStyleGAN [65]. radius of the sphere and camera lookat point, and, M be a
function that converts these parameters into the camera-to-
world matrix. Let Is (w, θ, φ, c, r) = Gs (w, M(θ, φ, c, r))
and I2D (w) = G2D (w) represent an arbitrary image gen-
3. Domain Adaptation for 3D-GANs erated by Gs and G2D , respectively, given the w code vari-
able. Let kd be the face key-points detected by the detector
The goal of domain adaptation for 3D-GANs is to adapt Kd [72], then
(both texture and geometry) to a particular style defined
by a 2D dataset (Caricature, Anime, Pixar toons, Comic, (c′ , r′ ) := arg min Lkd (Is (wavg
′
, 0, 0, c, r), I2D (wavg )),
(c,r)
and Cartoons [24, 25, 65] in our case). In contrast to 2D-
(1)
StyleGAN-based fine-tuning methods that are conceptually
where Lkd (I1 , I2 ) = ∥kd (I1 ) − kd (I2 )∥1 and wavg and
simpler [29, 45], fine-tuning a 3D-GAN on 2D data intro- ′
wavg are the mean w latent codes of G2D and Gs , respec-
duces challenges in addition to domain differences, espe-
tively. In our results, r′ is determined to be 2.7 and c′ is ap-
cially on maintaining the texture quality while preserving
proximately [0.0, 0.05, 0.17]. The next step is to determine
the geometry. Moreover, for these datasets, there is no ex-
a safe range of the θ and φ parameters. Following prior
plicit shape and camera information. We define the do-
works, StyleFlow [3] and FreeStyleGAN [35] (see Fig.5 of
main adaptation task as follows: Given a prior 3D-GAN
the paper), we set these parameters as θ′ ∈ [−0.45, 0.45]
i.e. EG3D (Gs ) of source domain (Ts ), we aim to produce a
and φ′ ∈ [−0.35, 0.35] in radians.
3D Avatar GAN (Gt ) of the target domain (Tt ) while main-
taining the semantic, style, and geometric properties of Gs , 3.2. What loss functions and regularizers to use?
and at the same time preserving the identity of the subject
between the domains (Ts ↔ Tt ). Refer to Fig. 4 in sup- Next, although the camera systems are aligned, the given
plementary for the pipeline figure. We represent G2D as a dataset may not stem from a consistent 3D model, e.g., in
teacher 2D-GAN used for knowledge distillation fine-tuned the case of caricatures or cartoons. This entices the gener-
on the above datasets. Note that as Tt is not assumed to ator Gt to converge to an easier degenerate solution with a
contain camera parameter annotations, the training scheme flat geometry. Hence, to benefit from the geometric prior
must suppress artifacts such as low-quality texture under of Gs , another important step is to design the loss functions
different views and flat geometry (See Fig. 2). In the fol- and regularizers for a selected set of parameters to update
lowing, we discuss the details of our method. in Gt . Next, we discuss these design choices:
4554
Figure 3. Domain adaptation. Domain adaptation results of images from source domain Ts (top row in each sub-figure) to target domain
Tt . Rows two to five show corresponding 3D avatar results from different viewpoints.
Loss Functions. To ensure texture quality and diversity, s activations of the S space [62]. The s activations are pre-
we resort to the adversarial loss used to fine-tune GANs as dicted by A(w), where A is the learned affine function in
our main loss function. We use the standard non-saturating EG3D. The s activations scale the kernels of a particular
loss to train the generator and discriminator networks used layer. In order to preserve the identity as well as geometry
in EG3D [11]. We also perform lazy density regularization such that the optimization of ∆s does not deviate too far
to ensure consistency of the density values in the final fine- away from the original domain Ts , we introduce a regular-
tuned model Gt . izer given by
Texture Regularization. Since the texture can be entan- R(∆s) := ∥∆s∥1 . (2)
gled with the geometry information, determining which lay- Note that we apply R(∆s) regularization in a lazy manner,
ers to update is important. To make use of the fine-style in- i.e., with density regularization. Interestingly, after training,
formation encoded in later layers, it is essential to update we can interpolate between s and s + ∆s parameters to in-
the tRGB layer parameters (outputting tri-plane features) terpolate between the geometries of samples in Ts and Tt
before the neural rendering stage. tRGB are convolutional (See Fig. 5).
layers that transform feature maps to 3 channels at each res- Depth Regularization. Next, we observe that even though
olution (96 channels in triplanes). Moreover, since the net- the above design choice produces better geometry for Tt ,
work has to adapt to a color distribution of Tt , it is essential some samples from Gt can still lead to flatter geometry, and
to update the decoder (MLP layers) of the neural render- it is hard to detect these cases. We found that the problem is
ing pipeline as well. Given the EG3D architecture, we also related to the relative depth of the background to the fore-
update the super-resolution layer parameters to ensure the ground. To circumvent this problem, we use an additional
coherency between the low-resolution and high-resolution regularization where we encourage the average background
outputs seen by the discriminator D. depth of Gt to be similar to Gs . Let Sb be a face back-
Geometry Regularization. In order to allow the network ground segmentation network [34]. We first compute the
to learn the structure distribution of Tt and at the same time average background depth of the samples given by Gs . This
ensure properties of W and S latent spaces are preserved, average depth is given by
we update the earlier layers with regularization. This also
1 X 1
M
encourages the latent spaces of Ts and Tt to be easily linked. ad := ( ∥Dn ⊙ Sb (In )∥2F ). (3)
Essentially, we update the deviation parameter ∆s from the M n=1 Nn
4555
ometric deformations, e.g., in the caricature dataset is an-
other challenge. One choice to edit the geometry is to use
the properties of tri-plane features learned by EG3D. We
start out by analyzing these three planes in Gs . We observe
that the frontal plane encodes most of the information re-
quired to render the final image. To quantify this, we sam-
ple images and depth maps from Gs and swap the front and
the other planes from two random images. Then we com-
pare the difference in RGB values of the images and the
Chamfer distance of the depth maps. While swapping the
frontal tri-planes, the final images are completely swapped,
and the Chamfer distance changes by 80 ∼ 90% matching
the swapped image depth map. In the case of the other two
planes, the RGB image is not much affected and the Cham-
fer distance of the depth maps is reduced by only 20 ∼ 30%
in most cases.
Given the analysis, we focus to manipulate the 2D front
plane features to learn additional deformation or exaggera-
tions. We learn a TPS (Thin Plate Spline) [61] network on
top of the front plane. Our TPS network is conditioned both
on the front plane features as well as the W space to enable
multiple transformations. The architecture of the module
is similar to the standard StyleGAN2 layer with an MLP
Figure 4. 3D avatars from real images. Projection of real images appended at the end to predict the control points that trans-
on the 3D avatar generators. form the features. Hence, as a byproduct, we also enable
Here, Dn is the depth map of the image In sampled from 3D-geometry editing guided by the learned latent space. We
Gs , ⊙ represents the Hadamard product, M is the number train this module separately after Gt has been trained. We
of the sampled images, and Nn is the number of background find that joint training is unstable due to exploding gradients
pixels in In . Finally, regularization is defined as: arising from the large domain gap between Ts and Tt in the
initial stages. Formally, we define this transformation as:
R(D) := ∥ad · J − (Dt ⊙ Sb (It ))∥F , (4)
T(w, f ) := ∆c, (5)
where Dt is the depth map of the image It sampled from
Gt and J is the matrix of ones having the same spatial di- where, w is the latent code, f is the front plane, and c are
mensions as Dt . the control points.
Let cI be the initial control points producing an identity
3.3. What discriminator to use? transformation, (c1 , c2 ) be the control points corresponding
Given that the data in Ts and Tt is not paired and Tt is to front planes (f1 , f2 ) sampled using W codes (w1 , w2 ),
not assumed to contain camera parameter annotations, the respectively, and (c′1 , c′2 ) be points with (w1 , w2 ) swapped
choice of the discriminator (D) used for this task is also a in the TPS module. To regularize and encourage the module
critical design choice. Essentially, we use the unconditional to learn different deformations, we have
version of the dual discriminator proposed in EG3D, and
hence, we do not condition the discriminator on the camera X
2
R(T1 ) := α ∥cI − cn ∥1 − β∥c1 − c2 ∥1 − σ∥c′1 − c′2 ∥1 .
information. As a result, during the training, Gt generates n=1
arbitrary images with pose using M(θ′ , φ′ , c′ , r′ ), and the (6)
discriminator discriminates these images using arbitrary im- We use initial control point regularization to regularize
ages from Tt . We train the discriminator from scratch and large deviations in the control points which would otherwise
in order to adapt Ts → Tt , we use the StyleGAN-ADA [28] explode. Additionally, to learn extreme exaggerations in Tt
training scheme and use R1 regularization. and ‘in expectation’, conform to the target distribution in the
dataset, we add an additional loss term. Let S(I) be the soft-
3.4. How to incorporate larger geometric deforma-
argmax output of the face segmentation network [34] given
tions between domains?
an image I and assuming that S generalizes to caricatures,
While the regularizers are used to limit the geometric then
changes when adapting from Ts to Tt , modeling large ge- R(T2 ) := ∥S(Gt (w)), S(It )∥1 (7)
4556
GANSpace [22], StyleSpace [62] etc., and geometric ed-
its using TPS (Sec. 3.4) and ∆s interpolation (Sec. 3.2).
To perform video editing, we design an encoder for EG3D
based on e4e [58] to encode videos and transfer the edits
from Gs to Gt based on the w codes [4, 6, 59]. We leave a
more fine-grained approach for video processing as future
work.
4557
Figure 6. Deformations using TPS. Geometric edits using our proposed TPS (Thin Plate Spline) module learned on the frontal tri-plane
features. Each sub-figure shows a 3D avatar and three examples of TPS deformations sampled from the learned 3D deformation space.
formation in the earlier layers instead of being camera view- Figure 7. Local edits. Local edits performed on the 3D avatars
dependent. To quantify this, since pose information may not using the S space.
be available for some domains (e.g. cartoons), we compute
the R(T2 ) scores between corresponding images in the do- is able to preserve the identity better across the domains.
main Ts (Gs ) and Tt (Gt and Gbase ). Note that these scores
5.2. Qualitative Results
are computed without the TPS module. Our scores are lower
in all three metrics, hence, validating that our method avoids For qualitative results, we show the results of the domain
the degenerate solution and preserves the geometric distri- adaptation, as well as the personalized edits (geometric and
bution of the prior. For discussion on the TPS module and semantic), performed on the resultant 3D avatars. First, in
ablations refer to the supplementary materials. order to show the quality of domain adaptation, identity
Identity Preservation. Identity preservation score is an- preservation, and geometric consistency, in Fig. 3, we show
other important evaluation to check the quality of latent results from Gs and corresponding results from 3D avatar
space linking between Gs and Gt . In Table 3, we compute generator Gt trained on Caricatures, Pixar toons, Cartoons,
the attribute loss (BCE loss) between the domains Ts and Tt and Comic domains. Next, in order to show that the method
using the attribute classifiers [24, 25]. Note that our method generalizes to real images, we use the method described in
4558
Figure 8. 3D avatar animation. Animation of 3D avatars generated using a driving video encoded in source domain Ts and applied to
samples in target domain Tt . The top row shows the driving video and the subsequent rows show generated animations using a random
Caricature or Pixar toon. The head pose is changed in each frame of the generated animation to show 3D consistency.
4559
References [14] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha.
Stargan v2: Diverse image synthesis for multiple domains.
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- In Proceedings of the IEEE Conference on Computer Vision
age2stylegan: How to embed images into the stylegan la- and Pattern Recognition, 2020. 2
tent space? In Proceedings of the IEEE/CVF International [15] Min Jin Chong and David A. Forsyth. Jojogan: One shot
Conference on Computer Vision, pages 4432±4441, Seoul, face stylization. CoRR, abs/2112.11641, 2021. 2
Korea, 2019. IEEE. 2, 6
[16] Min Jin Chong, Hsin-Ying Lee, and David Forsyth. Style-
[2] Rameen Abdal, Peihao Zhu, John Femiani, Niloy Mitra, and gan of all trades: Image manipulation with only pretrained
Peter Wonka. Clip2stylegan: Unsupervised extraction of stylegan. arXiv preprint arXiv:2111.01619, 2021. 2
stylegan edit directions. In ACM SIGGRAPH 2022 Confer- [17] Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik,
ence Proceedings, SIGGRAPH ’22, New York, NY, USA, and Daniel Cohen-Or. Stylegan-nada: Clip-guided do-
2022. Association for Computing Machinery. 2 main adaptation of image generators. arXiv preprint
[3] Rameen Abdal, Peihao Zhu, Niloy J. Mitra, and Peter arXiv:2108.00946, 2021. 2
Wonka. Styleflow: Attribute-conditioned exploration of [18] Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and
stylegan-generated images using conditional continuous nor- Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adap-
malizing flows. ACM Trans. Graph., 40(3), may 2021. 2, 3 tation of image generators, 2021. 2
[4] Rameen Abdal, Peihao Zhu, Niloy J. Mitra, and Peter [19] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Wonka. Video2stylegan: Disentangling local and global Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
variations in a video, 2022. 6 Yoshua Bengio. Generative adversarial networks, 2014. 2
[5] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Restyle: [20] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt.
A residual-based stylegan encoder via iterative refinement. Stylenerf: A style-based 3d aware generator for high-
In Proceedings of the IEEE/CVF International Conference resolution image synthesis. In International Conference on
on Computer Vision (ICCV), October 2021. 2 Learning Representations, 2022. 2
[6] Yuval Alaluf, Or Patashnik, Zongze Wu, Asif Zamir, Eli [21] Fangzhou Han, Shuquan Ye, Mingming He, Menglei Chai,
Shechtman, Dani Lischinski, and Daniel Cohen-Or. Third and Jing Liao. Exemplar-based 3d portrait stylization.
time’s the charm? image and video editing with stylegan3. IEEE Transactions on Visualization and Computer Graph-
CoRR, abs/2201.13433, 2022. 6 ics, 2021. 2
[7] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and [22] Erik HÈarkÈonen, Aaron Hertzmann, Jaakko Lehtinen, and
Amit H. Bermano. Hyperstyle: Stylegan inversion with hy- Sylvain Paris. Ganspace: Discovering interpretable gan con-
pernetworks for real image editing. CoRR, abs/2111.15666, trols. arXiv preprint arXiv:2004.02546, 2020. 2, 6
2021. 2 [23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
[8] Anonymous. 3d generation on imagenet. In Open Review, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
2023. 3 two time-scale update rule converge to a local nash equilib-
[9] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter rium. Advances in neural information processing systems,
Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. 30, 2017. 6
Mip-nerf: A multiscale representation for anti-aliasing neu- [24] Jing Huo, Wenbin Li, Yinghuan Shi, Yang Gao, and Hujun
ral radiance fields. In Proceedings of the IEEE/CVF Inter- Yin. Webcaricature: a benchmark for caricature recognition.
national Conference on Computer Vision, pages 5855±5864, In British Machine Vision Conference, 2018. 3, 6, 7
2021. 2 [25] Wonjong Jang, Gwangjin Ju, Yucheol Jung, Jiaolong Yang,
[10] Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Xin Tong, and Seungyong Lee. Stylecarigan: Caricature
Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, generation via stylegan feature map modulation. 40(4), 2021.
Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero 2, 3, 6, 7
Karras, and Gordon Wetzstein. Efficient geometry-aware 3D [26] Yucheol Jung, Wonjong Jang, Soongjin Kim, Jiaolong Yang,
generative adversarial networks. In arXiv, 2021. 1, 2, 3 Xin Tong, and Seungyong Lee. Deep deformable 3d carica-
[11] Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki tures with learned shape control. In Special Interest Group
Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, on Computer Graphics and Interactive Techniques Confer-
Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero ence Proceedings. ACM, aug 2022. 2
Karras, and Gordon Wetzstein. Efficient geometry-aware 3d [27] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
generative adversarial networks, 2021. 2, 4 Progressive growing of gans for improved quality, stability,
[12] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and variation, 2017. 2
and Gordon Wetzstein. pi-gan: Periodic implicit generative [28] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine,
adversarial networks for 3d-aware image synthesis. In Pro- Jaakko Lehtinen, and Timo Aila. Training generative adver-
ceedings of the IEEE/CVF Conference on Computer Vision sarial networks with limited data. In Proc. NeurIPS, 2020.
and Pattern Recognition, pages 5799±5809, 2021. 2 1, 2, 5, 6
[13] Yen-Chi Cheng, Chieh Hubert Lin, Hsin-Ying Lee, Jian Ren, [29] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine,
Sergey Tulyakov, and Ming-Hsuan Yang. Inout: Diverse im- Jaakko Lehtinen, and Timo Aila. Training generative
age outpainting via gan inversion. In IEEE Conference on adversarial networks with limited data. arXiv preprint
Computer Vision and Pattern Recognition, 2022. 2 arXiv:2006.06676, 2020. 3
4560
[30] Tero Karras, Miika Aittala, Samuli Laine, Erik HÈarkÈonen, [45] Justin N. M. Pinkney and Doron Adler. Resolution depen-
Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free dent gan interpolation for controllable image synthesis be-
generative adversarial networks, 2021. 1, 2 tween domains, 2020. 2, 3
[31] Tero Karras, Samuli Laine, and Timo Aila. A style-based [46] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
generator architecture for generative adversarial networks. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
In Proceedings of the IEEE/CVF Conference on Computer Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Vision and Pattern Recognition, pages 4401±4410, 2019. 1, Krueger, and Ilya Sutskever. Learning transferable vi-
3 sual models from natural language supervision. CoRR,
[32] Tero Karras, Samuli Laine, and Timo Aila. A Style-Based abs/2103.00020, 2021. 2
generator architecture for generative adversarial networks. [47] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper-
IEEE transactions on pattern analysis and machine intelli- vised representation learning with deep convolutional gener-
gence, 43(12):4217±4228, Dec. 2021. 2 ative adversarial networks, 2015. 2
[33] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, [48] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan,
Jaakko Lehtinen, and Timo Aila. Analyzing and improving Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding
the image quality of StyleGAN. In Proc. CVPR, 2020. 2 in style: a stylegan encoder for image-to-image translation.
[34] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. arXiv preprint arXiv:2008.00951, 2020. 2
Maskgan: Towards diverse and interactive facial image ma- [49] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel
nipulation. In IEEE Conference on Computer Vision and Pat- Cohen-Or. Pivotal tuning for latent-based editing of real im-
tern Recognition (CVPR), 2020. 4, 5 ages. arXiv preprint arXiv:2106.05744, 2021. 2
[35] Thomas LeimkÈuhler and George Drettakis. Freestylegan: [50] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas
Free-view editable portrait rendering with the camera mani- Geiger. Graf: Generative radiance fields for 3d-aware image
fold. 40(6), 2021. 3 synthesis. In Advances in Neural Information Processing
[36] Chieh Hubert Lin, Hsin-Ying Lee, Yen-Chi Cheng, Sergey Systems (NeurIPS), 2020. 2
Tulyakov, and Ming-Hsuan Yang. Infinitygan: Towards
[51] Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou.
infinite-pixel image synthesis. In International Conference
Interfacegan: Interpreting the disentangled face representa-
on Learning Representations (ICLR), 2022. 2
tion learned by gans. IEEE Transactions on Pattern Analysis
[37] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
and Machine Intelligence, 2020. 2, 6
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
[52] Yichun Shi, Divyansh Aggarwal, and Anil K Jain. Lifting
Representing scenes as neural radiance fields for view syn-
2d stylegan for 3d-aware face generation. In Proceedings of
thesis. In European conference on computer vision, pages
the IEEE/CVF Conference on Computer Vision and Pattern
405±421. Springer, 2020. 2
Recognition, pages 6258±6266, 2021. 2
[38] Sangwoo Mo, Minsu Cho, and Jinwoo Shin. Freeze the dis-
criminator: a simple baseline for fine-tuning gans, 2020. 2 [53] Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian
[39] Michael Niemeyer and Andreas Geiger. Campari: Camera- Ren, Hsin-Ying Lee, Peter Wonka, and Sergey Tulyakov.
aware decomposed generative neural radiance fields. In 2021 3d generation on imagenet. In International Conference on
International Conference on 3D Vision (3DV), pages 951± Learning Representations (ICLR), 2023. 2
961. IEEE, 2021. 2 [54] Guoxian Song, Linjie Luo, Jing Liu, Wan-Chun Ma, Chun-
[40] Michael Niemeyer and Andreas Geiger. Giraffe: Represent- pong Lai, Chuanxia Zheng, and Tat-Jen Cham. Agilegan:
ing scenes as compositional generative neural feature fields. Stylizing portraits by inversion-consistent transfer learning.
In Proceedings of the IEEE/CVF Conference on Computer ACM Trans. Graph., 40(4), jul 2021. 2
Vision and Pattern Recognition, pages 11453±11464, 2021. [55] Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue
2 Wang, and Yebin Liu. Ide-3d: Interactive disentangled edit-
[41] Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shecht- ing for high-resolution 3d-aware portrait synthesis, 2022. 1,
man, Jeong Joon Park, and Ira Kemelmacher-Shlizerman. 2
StyleSDF: High-Resolution 3D-Consistent Image and Ge- [56] Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian
ometry Generation. arXiv preprint arXiv:2112.11427, 2021. Bernard, Hans-Peter Seidel, Patrick PÂerez, Michael Zoll-
1, 2 hofer, and Christian Theobalt. Stylerig: Rigging style-
[42] Xingang Pan, Bo Dai, Ziwei Liu, Chen Change Loy, and gan for 3d control over portrait images. In Proceedings of
Ping Luo. Do 2d gans know 3d shape? unsupervised 3d the IEEE/CVF Conference on Computer Vision and Pattern
shape reconstruction from 2d image gans. arXiv preprint Recognition, pages 6142±6151, 2020. 2
arXiv:2011.00844, 2020. 2 [57] Ayush Tewari, Mohamed Elgharib, Mallikarjun BR, Flo-
[43] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien rian Bernard, Hans-Peter Seidel, Patrick PÂerez, Michael
Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo ZÈollhofer, and Christian Theobalt. Pie: Portrait image em-
Martin-Brualla. Deformable neural radiance fields. arXiv bedding for semantic control. volume 39, December 2020.
preprint arXiv:2011.12948, 2020. 2 2
[44] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, [58] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
and Dani Lischinski. Styleclip: Text-driven manipulation of Daniel Cohen-Or. Designing an encoder for stylegan image
stylegan imagery, 2021. 2 manipulation. arXiv preprint arXiv:2102.02766, 2021. 2, 6
4561
[59] Rotem Tzaban, Ron Mokady, Rinon Gal, Amit H. Bermano,
and Daniel Cohen-Or. Stitch it in time: Gan-based facial
editing of real videos. CoRR, abs/2201.08361, 2022. 6
[60] Can Wang, Menglei Chai, Mingming He, Dongdong Chen,
and Jing Liao. Cross-domain and disentangled face manipu-
lation with 3d guidance. IEEE Transactions on Visualization
and Computer Graphics, 2022. 2
[61] WarBean. tps-stn-pytorch. https://github.com/
WarBean/tps_stn_pytorch. 5
[62] Zongze Wu, Dani Lischinski, and Eli Shechtman. Stylespace
analysis: Disentangled controls for stylegan image genera-
tion. arXiv preprint arXiv:2011.12799, 2020. 2, 4, 6
[63] Yinghao Xu, Menglei Chai, Zifan Shi, Sida Peng, Ivan Sko-
rokhodov, Aliaksandr Siarohin, Ceyuan Yang, Yujun Shen,
Hsin-Ying Lee, Bolei Zhou, et al. Discoscene: Spatially
disentangled generative radiance fields for controllable 3d-
aware scene synthesis. In IEEE Conference on Computer
Vision and Pattern Recognition, 2023. 2
[64] Yinghao Xu, Sida Peng, Ceyuan Yang, Yujun Shen,
and Bolei Zhou. 3d-aware image synthesis via learn-
ing structural and textural representations. arXiv preprint
arXiv:2112.10759, 2021. 1, 2
[65] Shuai Yang, Liming Jiang, Ziwei Liu, and Chen Change
Loy. Pastiche master: Exemplar-based high-resolution por-
trait style transfer. In CVPR, 2022. 2, 3
[66] Zipeng Ye, Mengfei Xia, Yanan Sun, Ran Yi, Minjing Yu,
Juyong Zhang, Yu-Kun Lai, and Yong-Jin Liu. 3d-CariGAN:
An end-to-end solution to 3d caricature generation from nor-
mal face photos. IEEE Transactions on Visualization and
Computer Graphics, pages 1±1, 2021. 2
[67] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianx-
iong Xiao. Lsun: Construction of a large-scale image dataset
using deep learning with humans in the loop. arXiv preprint
arXiv:1506.03365, 2015. 2
[68] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
Koltun. Nerf++: Analyzing and improving neural radiance
fields. arXiv preprint arXiv:2010.07492, 2020. 2
[69] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-
domain gan inversion for real image editing. In European
Conference on Computer Vision, pages 592±608. Springer,
2020. 2
[70] Peihao Zhu, Rameen Abdal, John Femiani, and Peter Wonka.
Mind the gap: Domain gap control for single shot domain
adaptation for generative adversarial networks. In Interna-
tional Conference on Learning Representations, 2022. 2
[71] Peihao Zhu, Rameen Abdal, Yipeng Qin, John Femiani, and
Peter Wonka. Improved stylegan embedding: Where are the
good latents?, 2020. 2
[72] zllrunning. face-parsing.pytorch. https://github.
com/zllrunning/face-parsing.PyTorch. 2, 3
4562