Generative AI Summary
Generative AI Summary
Abstract—As information exists in various modalities in real world, effective interaction and fusion among multimodal information plays
a key role for the creation and perception of multimodal data in computer vision and deep learning research. With superb power in
modeling the interaction among multimodal information, multimodal image synthesis and editing has become a hot research topic in
recent years. Instead of providing explicit guidance for network training, multimodal guidance offers intuitive and flexible means for
arXiv:2112.13592v6 [cs.CV] 24 Aug 2023
image synthesis and editing. On the other hand, this field is also facing several challenges in alignment of multimodal features,
synthesis of high-resolution images, faithful evaluation metrics, etc. In this survey, we comprehensively contextualize the advance of
the recent multimodal image synthesis and editing and formulate taxonomies according to data modalities and model types. We start
with an introduction to different guidance modalities in image synthesis and editing, and then describe multimodal image synthesis and
editing approaches extensively according to their model types. After that, we describe benchmark datasets and evaluation metrics as
well as corresponding experimental results. Finally, we provide insights about the current research challenges and possible directions
for future research. A project associated with this survey is available at https://github.com/fnzhan/Generative-AI.
Index Terms—Multimodality, Image Synthesis & Editing, NeRFs, Diffusion Models, GANs, Autoregressive Models.
1 I NTRODUCTION
Semantic Map Scene Layout Sketch Map Text (2D) Text (3D) Text (Video) Audio Scene Graph
sky-other
umbrella " A strawberry Audio sky below elephant
" A blue
splashing in
Guidance sea
the coffee in a
unicorn flying
on elephant right of
Types chair over a mystical Person
mug under
sand land."
the starry sky." grass above tree
Synthesis
Results
Normal Map Depth Keypoints Hough Lines Scribbles Canny Edge Brain Signal Mouse Track
Guidance
Types
Synthesis
Results
Fig. 1. Illustration of multimodal image synthesis and editing. Typical guidance types include visual information (e.g.semantic maps, scene layouts,
sketch maps), text prompts, audio signal, scene graph, brain signal, and mouse track. The samples are from [2], [8]–[16].
by Transformer suggest a possible route for autoregressive associated with the guidance modalities.
models [41] in MISE by accommodating the long-range • We develop a taxonomy of the recent approaches accord-
dependency of sequences. Notably, both multimodal guid- ing to the essential models and highlight the major strengths
ance and images can be represented in a common form of and weaknesses of existing models.
discrete tokens. For instance, texts can be naturally denoted • This survey provides an overview of various datasets
by token sequence; audio and visual guidance including and evaluation metrics in multimodal image synthesis and
images can be represented as token sequences [42]. With editing, and critically evaluates the performance of contem-
such unified discrete representation, the correlation between porary methods.
multimodal guidance and images can be well accommo- • We summarize the open challenges in the current research
dated via Transformer-based autoregressive models which and share our humble opinions on promising areas and
have pushed the boundary of MISE significantly [2], [43], directions for future research.
[44]. The remainder of this survey is organized as follows.
Most aforementioned methods work for 2D images re- Section 2 presents the modality foundations of MISE. Sec-
gardless the 3D essence of real world. With the recent tion 3 provides a comprehensive overview and description
advance of neural rendering, especially Neural Radiance of MISE methods with detailed pipelines. Section 4 reviews
Fields (NeRF) [4], 3D-aware image synthesis and editing the common datasets and evaluation metrics, with experi-
have attracted increasing attention from the community. mental results of typical methods. In Section 5, we discuss
Distinct from synthesis and editing on 2D images, 3D- the main challenges and future research directions for MISE.
aware MISE poses a bigger challenge thanks to the lack of Some social impact analysis and concluding remarks are
multi-view data and requirement of multi-view consistency drawn in Section 6 and Section 7, respectively.
during synthesis and editing. As a remedy, pre-trained 2D
foundation models (e.g., CLIP [45] and Stable Diffusion [46]) 2 M ODALITY F OUNDATIONS
can be employed to drive the NeRF optimization for view Each source or form of information can be called a modality.
synthesis and editing [11], [47]. Besides, generative models For example, people have the sense of touch, hearing, sight,
like GAN and diffusion models can be combined with NeRF and smell; the medium of information includes voice, video,
to train 3D-aware generative models on 2D images, where text, etc.; data are recorded by various sensors such as radar,
MISE can be performed by developing conditional NeRFs infrared, and accelerometer. In terms of image synthesis and
or inverting NeRFs [48], [49]. editing, we group the modality guidance as visual guid-
The contributions of this survey can be summarized in ance, text guidance, audio guidance, and other modality
the following aspects: guidance. Detailed description of each modality guidance
• This survey covers extensive literature with regard to together with related processing methods will be presented
multimodal image synthesis and editing with a rational and in the following subsections.
structured framework.
• We provide a foundation of different types of guidance 2.1 Visual Guidance
modality underlying multimodal image synthesis and edit- Visual guidance has drawn widespread interest in the field
ing tasks and elaborate the specifics of encoding approaches of MISE due to its inherent capacity to convey spatial and
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
TABLE 1
The strength and weakness of different model types for MISE tasks. Representative MISE works are also listed as references.
structural details. Notably, it encapsulates specific image image synthesis task [53], [108]–[110] aims to produce clear,
properties in pixel space, thereby offering an exceptional photo-realistic images with high semantic relevance to the
degree of control. This property of visual guidance facili- corresponding text guidance. Notably, text and images are
tates interactive manipulation and precise handling during different types of data which makes it difficult to learn an
image synthesis, which can be crucial for achieving desired accurate and reliable mapping from one to the other. Tech-
outcomes. As a pixel-level guidance, it can be seamlessly niques for integrating text guidance, such as representation
integrated into the image generation process, underscoring learning, play a crucial role in text-guided image synthesis
its versatility and extensive use in various image synthesis and editing.
contexts. Common types of visual guidance encompass seg- Text Guidance Encoding. Learning faithful represen-
mentation maps [5], [6], keypoints [89]–[91], sketch & edge tation from text description is a non-trivial task. There
& scribbles [51], [92]–[99], and scene layouts [100]–[104] are a number of traditional text representations, such as
as illustrated in Fig. 1. Besides, several studies investigate Word2Vec [111] and Bag-of-Words [112]. With the preva-
image synthesis conditioned on depth map [2], [8], normal lence of deep neural networks, Recurrent Neural Network
map [8], trace map [105], etc. The visual guidance can be (RNN) [109] and LSTM [54] are widely adopted to encode
obtained by employing pre-trained models (e.g., segmen- texts as features [55]. With the development of pre-trained
tation model, depth predictor, pose predictor), applying models in natural language processing field, several stud-
algorithms (e.g., Canny edges, Hough lines), or relying on ies [113], [114] also explore to perform text encoding by
manual effort (e.g., manual annotation, human scribbles). leveraging large-scale pre-trained language models such as
By modifying the visual guidance elements, like semantic BERT [115]. Remarkably, with a large number of image-text
maps, we can directly repurpose image synthesis techniques pairs for training, Contrastive Language-Image Pre-training
for various image editing tasks [106], [107], demonstrating (CLIP) [45] yields informative text embeddings by learning
the versatile applicability of visual guidance in the domain the alignment of images and the corresponding captions,
of MISE. and has been widely adopted for text encoding.
Visual Guidance Encoding. These visual cues, repre-
sented in 2D pixel space, can be interpreted as specific
2.3 Audio Guidance
types of images, thereby permitting their direct encoding via
numerous image encoding strategies such as Convolutional Unlike text and visual guidance, audio guidance provides
Neural Networks (CNNs) and Transformers. As the en- temporal information which can be utilized for generating
coded features spatially align with image features, it can be dynamic or sequential visual content. The relationship be-
smoothly integrated into networks via naive concatenation, tween audio signals and images [116]–[118] is often more
SPADE [6], cross-attention mechanism [46], etc. abstract compared to text or visual guidance. For instance,
audio associated with certain actions or environments may
suggest but not explicitly define visual content [119]; sound
2.2 Text Guidance can carry emotional tone and nuanced context that isn’t
Compared with visual guidance, text guidance provides a always clear in text or visual inputs. Thus, audio-guided
more versatile and flexible way to express and describe vi- MISE offers an interesting challenge of interpreting audio
sual concepts. This is because text can capture a wide range signals into visual content. This involves understanding
of ideas and details that may not be easily communicated and modeling the complex correlations between sound and
through other means. Text descriptions can be ambiguous visual elements, which has been explored in talking-face
and open to interpretation. This is both a challenge and generation [57], [59], [60], [120] whose goal is to create
an opportunity. It’s a challenge because it can lead to a realistic animations of a face speaking given an audio input.
wide array of possible images that accurately represent the Audio Guidance Encoding. An audio sequence can be
text, making it harder to predict the outcome. However, generated from given videos where deep convolution net-
it’s also an opportunity because it allows for greater cre- work is employed to extract features from video screenshots
ativity and diversity in the resulting images. The text-to- followed by LSTM [121] to generate audio waveform of the
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
…
Noise Vector
2.4 Other Modality Guidance Embedding Incorporation
Several other types of guidance have also been investigated
to guide multimodal image synthesis and editing. Fig. 2. Illustration of conditional GAN framework with different condition
Scene Graph. Scene Graphs represent scenes as directed incorporation mechanisms.
graphs, where nodes are objects and edges give relation-
ships between objects. Image generation conditioned on
scene graphs allows to reason explicit object relationships NeRF for the challenging task of 3D-aware MISE. Later,
and synthesize faithful images with complex scene relation- we present several other methods for image synthesis and
ships. The guided scene graph can be encoded through editing under the context of multimodal guidance. Finally,
a graph convolution network [125] which predicts object we compare and discuss the strengths and weaknesses of
bounding boxes to yield a scene layout. For instance, Vo et al. different generation architectures.
[126] propose to predict relation units between objects which
is converted to a visual layout via convolutional LSTM [127]. 3.1 GAN-based Methods
Brain Signal. Treating brain signals as a modality to syn-
thesize or reconstruct visual images offers an exciting way GAN-based methods have been widely adopted for various
to understand brain activity and facilitate brain-computer MISE tasks by either developing conditional GANs (Sec.
interfaces. Recently, several studies explore to generate im- 3.1.1) or leveraging pre-trained unconditional GANs (Sec.
ages from functional magnetic resonance imaging (fMRI). 3.1.2). For conditional GANs, multimodal condition can
For example, Fang et al. [128] decode shape and semantic be directly incorporated into the generator to guide the
representations from the visual cortex, and then fuse them generation process. For pre-trained unconditional GANs,
to generate images via GAN; Lin et al. [129] propose to map GAN inversion is usually employed to perform various
fMRI signals into the latent space of pretrained StyleGAN to MISE tasks by operating latent codes in latent spaces.
enable conditional generation; Takagi and Nishimoto [130]
quantitatively interpret each component in pretrained LDM 3.1.1 Conditional GANs
[46] by mapping them into distinct brain regions. Conditional Generative Adversarial Networks (CGANs)
Mouse Track. To achieve precise and flexible manipula- [18] are extensions of the popular GAN architecture which
tion of image content, mouse track [16] has recently emerged allow for image generation with specific characteristics or
as a remarkable guidance in MISE. Specifically, users can attributes. The key idea behind CGANs is to condition
select a set of ‘handle points’ and ‘target points’ within an the generation process on additional information, such as
image by simply clicking the mouse. The objective here is multimodal guidance in MISE tasks. This is achieved by
to edit the image by steering these handle points to their feeding the additional information into both the generator
respective target points. This innovative approach of mouse and discriminator networks as extra guidance. The gener-
track guidance enables an image to be deformed with an ator then learns to generate samples that not only fool the
impressive level of accuracy, and facilitates manipulation of discriminator but also match the specified conditional infor-
various attributes such as pose, shape, and expression across mation. In recent years, a range of designs have significantly
a range of categories. The point motion can be integrated to boosted the performance of CGANs for MISE [110] 1 .
supervise the editing via a pre-trained transformer that’s Condition Incorporation. To steer the generation pro-
based on optical flow [131], [132] or a shifted patch loss on cess, it is necessary to incorporate multimodal conditions
the generator features [16]. into the network effectively as shown in Fig. 2. Generally,
multimodal guidance can be uniformly encoded as 1-D
3 M ETHODS features which can be concatenated with the feature in net-
works [18], [59], [109]. For visual guidance that is spatially
We broadly categorize the methods for MISE into five cate- aligned with the target image, the condition can be directly
gories: GAN-based methods (Sec. 3.1), autoregressive meth- encoded as 2D features which provide accurate spatial guid-
ods (Sec. 3.3), diffusion-based methods (Sec. 3.2), NeRF- ance for generation or editing [5]. However, the encoded 2D
based methods (Sec. 3.4), and other methods (Sec. 3.5). We features struggle to capture complex scene structural rela-
briefly summarize the strength and weakness of four main tionships between the guidance and real images when there
methods with representative references as shown in Table 1. exists very different views or severe deformations. Under
In this section, we first discuss the GAN-based methods, such circumstances, an attention module can be employed
which generally rely on GANs and their inversion. We then
discuss the prevailing diffusion-based methods and autore- 1. Please refer to [110] for detailed review of GAN-based text-to-
gressive methods comprehensively. After that, we introduce image generation.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
to align the guidance with the target image as in [133]–[135]. eration. For conditional input that is spatially aligned with
Moreover, naively encoding the visual guidance with deep the ground-truth image, it has been proved that perceptual
networks is suboptimal as part of the guidance information loss [156] is able to boost the generation quality significantly
tends to be lost in normalization layers. Thus, a spatially- [157], by minimizing the distance of perceptual features
adaptive de-normalization (SPADE) [6] is introduced to in- between generated images and the ground-truth. Besides,
ject the guided feature effectively, which is further extended associated with the cycle structure described previously,
to a semantic region-adaptive normalization [136] to achieve a cycle-consistency loss [151] is duly imposed to enforce
region-wise condition incorporation. Besides, by assessing condition consistency. However, cycle-consistency loss is too
the similarity between generated images and conditions, an restrictive for conditional generation as it assumes a bi-
attentional incorporation mechanism [54], [137]–[139] can jectional relationship between two domains. Thus, some ef-
be employed to direct the generator’s attention to partic- forts [150], [158], [159] have been devoted to exploring one-
ular image regions during generation, which is particularly way translation and bypass the bijection constraint of cycle-
advantageous when dealing with complex conditional in- consistency. With the emergence of contrastive learning,
formation, such as texts. Notably, complex conditions also several studies explore to maximize the mutual information
can be mapped to an intermediary representation which of positive pairs via noise contrastive estimation [160] for
facilitates more faithful image generation, e.g., audio clip the preservation of contents in unpaired image generation
can be mapped to facial landmarks [58], [120] or 3DMM from visual guidance [161], [162] or text-to-image generation
parameters [140] for talking-face generation. For sequential [163]. Except for contrastive loss, triplet loss also has been
conditions such as audios [20], [59], [120], [123], [141]– employed to improve the condition consistency for cross-
[144], a recurrent condition incorporation mechanism is modal guidance like texts [148].
also widely adopted to account for temporal dependency
such that smooth transition can be achieved in sequential 3.1.2 Inversion of Unconditional GAN
conditions. Large scale GANs [145], [164] have achieved remark-
Model Structure. Conditional generation of high- able progress in unconditional image synthesis with high-
resolution images with fine details is challenging and com- resolution and high-fidelity. With a pre-trained GAN model,
putationally expensive for GANs. Coarse-to-fine structures a series of studies explore to invert a given image back
[50], [53], [99], [108], [146] help address these issues by into the latent space of the GAN, which is termed as
gradually refining the generated images or features from GAN inversion [30] 2 . Specifically, a pre-trained GAN learns
low resolutions to high resolutions. By generating coarse a mapping from latent codes to real images, while the
images or features first and then refining them, the genera- GAN inversion maps images back to latent codes, which
tor network can focus on capturing the overall structure of is achieved by feeding the latent code into the pre-trained
the image before moving on to the fine details, which leads GAN to reconstruct the image through optimization. Typi-
to more efficient training and higher generation quality. cally, the reconstruction metrics are based on ℓ1 , ℓ2 , percep-
Not only generator, many discriminator networks [50], [147] tual [156] loss or LPIPS [165]. Certain constraints on face
also operate at multiple levels of resolution to efficiently identity [166] or latent codes [31] could also be included
differentiate high-resolution images and avoid potentially during optimization. With the obtained latent codes, we
overfitting. On the other hand, as a scene can be depicted can faithfully reconstruct the original image and conduct
with diverse linguistic expressions, generating images with realistic image manipulation in the latent space. In terms
consistent semantics regardless of the expression variants of MISE, cross-modal image manipulation can be achieved
presents a significant challenge. Multiple pieces of research by manipulating or generating latent codes according to the
employ a siamese structure with two generation branches guidance from other modalities.
to facilitate the semantic alignment. With a pair of con- Explicit Cross-modal Alignment. One direction of lever-
ditions for the two branches, a contrastive loss can be aging the guidance from other modalities is to map the em-
adopted to minimize the distance between positive pairs beddings of images and cross-modal inputs (e.g., semantic
(two text prompts describe the same scene) and maximize maps, texts) in a common embedding space [28], [167] as
the distance between negative pairs (two prompts describe shown in Fig. 3 (a). For example, TediGAN [28] trains an
different scenes) [137], [148], [149]. Besides, an intra-domain encoder for each modality to extract the embeddings and
transformation loss [150] can also be employed in siamese apply similarity loss to map them into the latent space. Af-
structure to preserve key characteristics during generation. terwards, latent manipulation (e.g., latent mixing [28]) could
Except for above structures, a cycle structure also has been be performed to edit the image latent codes toward the
explored in series of conditional GANs to preserve key in- embeddings of other modalities and achieve cross-modal
formation in generation process. Specifically, some research image manipulation. However, mapping multimodal data
[55], [151]–[154] explores to pass the generated images into a common space is non-trivial thanks to the heterogene-
through an inverse network to yield the conditional input, ity across different modalities, which can result in inferior
which imposes a cycle-consistency of conditional input. and unfaithful image generation.
The inverse network varies for different conditional inputs, Implicit Cross-modal Supervision. Instead of explicitly
e.g., image captioning models [55], [155] for text guidance, projecting guidance modality into the latent space, another
generation networks for visual guidance. line of research aims to guide the synthesis or editing by
Loss Design. Except for the inherent adversarial loss defining consistency loss between the generation results
in GANs, various other loss terms have been explored to
achieve high-fidelity generation or faithful conditional gen- 2. Please refer to [30] for a comprehensive review of GAN inversion.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
Mixing
Image
Source Image Output Image
Embedding Original
Embedding
Condition StyleGAN
Embedding
Mapper
"He has
brown hair."
Condition Input,
e.g., texts.
This w ▽w w
woman has
brown hair
and wears Condition Output
Embedding Embedding
lipstick. Identity CLIP Loss
GAN Latent Space
Loss (‘Excited’)
…
Joint Embedding
Generation
(a) Cross-modal Alignment (b) Cross-modal Supervision
Fig. 3. The architectures of GAN inversion for MISE, including (a) Cross-modal alignment [28] and (b) cross-modal supervision [29]. The cross-
modal alignment embeds both images and conditions into the latent space of GAN (e.g., StyleGAN [145]), aiming to pull their embeddings to
be closer. Then image and condition embeddings can be mixed to perform multimodal image generation or editing. The cross-modal supervision
inverts source image into a latent code and trains a mapper network to produce residuals that are added to the latent code to yield the target code,
from which a pre-trained StyleGAN generates an image assessed by the CLIP and identity losses. The figure is reproduced based on [28] and [29].
and the guiding modality. For instance, Jiang et al. [168] variance schedule. The reverse process q(xt−1 |xt ) is pa-
propose to optimize image latent codes through a pre- rameterized by another Gaussian transition p(xt−1 |xt ) :=
trained fine-grained attribute predictor, which can examine N (xt−1 ; µ(xt ), σt2 I). µ(xt ) can be decomposed into a lin-
the consistency of the edited image and the text description. ear combination of xt and a noise approximation model
However, the attribute predictor is specifically designed for ϵθ (xt , t) that can be learned through optimization. After
face editing with fine-grained attribute annotations, making training ϵ(x, t), the sampling process of DDPM can be
it hard to generalize to other scenarios. A recently released achieved by following a reverse diffusion process.
large-scale pretrained model, Contrastive Language-Image Song et al. [22] propose an alternative non-Markovian
Pre-training (CLIP) [45] has demonstrated great potential noising process that has the same forward marginals as
in multimodal synthesis and manipulation [29], [43], which DDPM but allows using different samplers by changing
learns joint vision-language representations from over 400M the variance of the noise. Especially, by setting the noise
text-image pairs via contrastive learning. On the strength of to 0, which is a DDIM sampling process [22], the sampling
the powerful pre-trained CLIP, Bau et al. [169] define a CLIP- process becomes deterministic, enabling full inversion of
based semantic consistency loss to optimize latent codes in- the latent variables into the original images with signifi-
side an inpainting region to align the recovered content with cantly fewer steps [21], [22]. Notably, the latest work [21]
the given text. Similarly, StyleClip [29] and StyleMC [170] has demonstrated even higher quality of image synthesis
employ cosine similarity between CLIP representations to performance compared to variational autoencoders (VAEs)
supervise the text-guided manipulation as illustrated in Fig. [175], flow models [176], [177], autoregressive models [178],
3 (b). A known issue of standard CLIP loss is the adversarial [179] and (GANs) [1], [145]. To achieve image generation
solution [171], where the model tends to fool the CLIP clas- and editing conditioned on provided guidance, leverag-
sifier by adding meaningless pixel-level perturbations to the ing pre-trained models [32] (by guidance function or fine-
image. To this end, Liu et al.propose AugCLIP score [171] to tuning) and training conditional models from scratch [46]
robustify the standard CLIP score; StyleGAN-NADA [172] are both extensively studied in the literature. A downside
presents a directional CLIP loss to align the CLIP-space of guidance function method lies in the requirement of an
directions between the source and target text-image pairs. additional guidance model which leads to a complicated
It also directly finetunes the pretrained generative model training pipeline. Recently, Ho et al. [27] achieve compelling
with text conditions for domain adaptation. Moreover, Yu et results without a separately guidance model by using a
al. [173] introduce a CLIP-based contrastive loss for robust form of guidance that interpolates between predictions from
optimization and counterfactual image manipulation. a diffusion model with and without labels. GLIDE [180]
compares CLIP-guided diffusion model and conditional dif-
fusion model on text-to-image synthesis task, and concludes
3.2 Diffusion-based Methods that training conditional diffusion model yields better gen-
Recently, diffusion models such as denoising diffusion prob- eration performance.
abilistic models (DDPMs) [3], [174] have achieved great suc-
cesses in generative image modeling [3], [22]–[24]. DDPMs 3.2.1 Conditional Diffusion Models
are a type of latent variable models that consist of a for- To launch the MISE tasks, a conditional diffusion model
ward diffusion process and a reverse diffusion process. can be formulated by directly integrating the condition
The forward process is a Markov chain where noise is information into the denoising process. Recently, the per-
gradually added to the data when sequentially sampling formance of conditional diffusion models is significantly
the latent variables xt for t = 1, · · · , T . Each step in the pushed forward by a series of designs.
forward
√ process is a Gaussian transition q(xt |xt−1 ) := Condition Incorporation. As a common framework, a
N ( 1 − βt xt−1 , βt I), where {βt }Tt=0 are fixed or learned condition-specific encoder is usually employed to project
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
Latent Space which highlighting the need for latent space regulariza-
tions. As a common choice, KL divergence can be applied
Encoder
Image 𝑥 Diffusion Process to regularize the latent space towards a standard normal
distribution. Alternatively, vector quantization can also be
𝑧 ×𝑇 𝑧!
applied for regularization via a VQGAN [2] variant with an
absorbed quantization layer as in [46]. Besides, VQGAN can
directly learn a discrete latent space (quantization layer is
Decoder
Image 𝑥
"
not absorbed), which can be modeled by a discrete diffusion
𝑧! process as in VQ-Diffusion [184]. Tang et al. [185] further im-
𝑧 Denoising U-Net
Pixel Space prove VQ-Diffusion by introducing a high-quality inference
strategy to alleviate the joint distribution issue.
Model Architecture. Ho et al. [3] introduced a U-Net
Semantic Map
architecture for diffusion models, which can incorporate
Encoder
Depth Map the inductive bias of CNNs into the diffusion process.
Incorporation
This U-Net architecture is further improved by a series
Text Prompt
𝑐
….. of designs, including attention configuration [21], residual
block for upsampling and downsampling activations [23],
and adaptive group normalization [21]. Although U-Net
Fig. 4. Overall framework of conditional diffusion model. With a certain
model for latent representation, diffusion process models the latent structure is widely adopted in SOTA diffusion models,
space by reversing a forward diffusion process conditioned on certain Chahal [186] shows that a Transformer-based LDM [46]
guidance (e.g., semantic map, depth map, and texts). The image is can yield comparable performance to U-Net-based LDM
reproduced based on [46].
[46], accompanied with a natural multimodal condition
incorporation via multi-head attention. Nevertheless, such
Transformer architecture is more favored under the setting
multimodal condition into embedding vectors, which is of discrete latent space as in [184], [187]. On the other hand,
further incorporated into the model as shown in Fig. 4. instead of directly generating final images, DALL-E 2 [25]
The condition-specific encoder can be learned along with proposes a two-stage structure by producing intermediate
the model or directly borrowed from pre-trained models. image embeddings from text in the CLIP latent space. Then,
Typically, CLIP is a common choice for text embedding as the image embeddings are applied to condition a diffusion
adopted in DALL-E 2 [25]. Besides, generic large language model to generate final images, which allows improving
models (e.g.T5 [182]) pre-trained text corpora also show re- the diversity of generated images [25]. Besides, some other
markable effectiveness at encoding text for image synthesis architectures are also explored, including compositional ar-
as validated in Imagen [10]. With the condition embedding, chitecture [188] which generates an image by composing
diverse mechanisms can be adopted to incorporate it into a set of diffusion models, multi-diffusion architecture [189]
diffusion models. Specifically, the condition embedding can which is composed of multiple diffusion processes with
be naively concatenated or added to the diffusion timestep shared parameters or constraints, retrieval-based diffusion
embedding [21], [183]. In LDM [46], condition embedding is model [190] which alleviates the high computational cost,
mapped to the intermediate layers of diffusion models via etc.
a cross-attention mechanism. Imagen [10] further compares
mean pooling and attention pooling with cross attention 3.2.2 Pre-trained Diffusion Models
mechanism and observes both pooling mechanisms perform Rather than expensively re-training diffusion models, an-
significantly worse. To fully leverage the conditional infor- other line of research resorts to guiding the denoising pro-
mation for semantic image synthesis, Wang et al. [61] pro- cess with proper supervision, or finetuning the model with
pose to incorporate visual guidance via a spatially-adaptive a lower cost as shown in Fig. 5.
normalization, which improves both the quality and seman- Guidance Function Method. As an early exploration,
tic coherence of generated images. Instead of incorporating Dhariwal et al. [21] augment pre-trained diffusion models
condition to train diffusion models from scratch, ControlNet with classifier guidance which can be extended to achieve
[8] aims to incorporate condition into a pre-trained diffusion conditional generation with various guidance. Specifically,
model for controllable generation. To preserve production- the reverse process p(xt−1 |xt ) with guidance can be rewrit-
ready weights of pre-trained models for fast convergence, a ten as p(xt−1 |xt , y) where y is the provided guidance.
‘zero convolution’ is designed to incorporate the guidance, Following the derivation in [21], the final diffusion sampling
where the convolution weights are gradually learned from process can be rewritten as:
zeros to optimized parameters.
xt−1 = µ(xt ) + σt2 ∇xt log p(y|xt ) + σt ε, ε ∼ N (0, I) (1)
Latent Diffusion. To enable diffusion models training
on limited computational resources while retaining their F (xt , y) = log p(y|xt ) (dubbed as guidance function) in-
quality and flexibility, several works explore to conduct dicates the consistency between xt and guidance y which
diffusion process in learned latent spaces [46] as shown in can be formulated by certain similarity metric [32] such
Fig. 4. Typically, an autoencoding model can be employed as Cosine similarity and L2 distance. As the similarity is
to learn a latent space that is perceptually equivalent to usually computed on the feature space, pre-trained CLIP
the image space. On the other hand, the learned latent can be adopted as the image encoder and condition encoder
spaces may be accompanied with undesired high variance, for text guidance as shown in Fig. 5 (a). However, the
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
… … Diffusion
Process
“A photo of
Guidance : Pre-trained Unconditional Model : Pre-trained Conditional Model
“A woman wearing a bird”
Function
sunglasses”
“A bird spreading
: Fine-tuned Conditional Model
wings”
Fig. 5. Typical frameworks of pre-trained diffusion models for MISE tasks, including guidance function method and fine-tuning method. The figure
is reproduced based on [32] and [181].
image encoder will take noisy images as input while CLIP is 3.3 Autoregressive Methods
trained on clean images. Thus, a self-supervised fine-tuning Fueled by the advance of GPT [38] in natural language mod-
of CLIP can be performed to force an alignment between eling, autoregressive models have been successfully applied
features extracted from clean and noised images as in [32]. to image generation [39] by treating the flattened image
To control the generation consistency with the guidance, sequences as discrete tokens. The plausibility of generated
a parameter γ can be introduced to scale the guidance images demonstrates that autoregressive models are able to
gradients as below: accommodate the spatial relationships between pixels and
high-level attributes. Compared with CNN, Transformer
xt−1 = µ(xt ) + σt2 γ∇xt log p(y|xt ) + σt ε, ε ∼ N (0, I) (2) models naturally support various multimodal inputs in a
unified manner, and a series of studies have been proposed
Apparently, the model will focus more on the modes of to explore multimodal image synthesis with Transformer-
guidance with a larger gradient scale γ . As the result, γ is based autoregressive models [2], [44], [69], [194]. Overall,
positively correlated with the generation consistency (with the pipeline of autoregressive model for MISE consists of a
the guidance), while is negatively correlated with the gen- vector quantization [42], [195] stage to yield unified discrete
eration diversity [21]. Besides, to achieve the local guidance representation and achieve data compression, and an au-
for image editing, a blended diffusion mechanism [191] can toregressive modeling stage which establishes the dependency
be employed by spatially blending noised image with the between discrete tokens in a raster-scan order as illustrated
local guided diffusion latent at progressive noise levels. in Fig. 6.
Fine-tuning Method. In terms of fine-tuning, MISE can
be achieved by modifying the latent code or adapting the 3.3.1 Vector Quantization
pre-trained diffusion models as shown in Fig. 5 (b). To adapt Directly treating all image pixels as a sequence for autore-
unconditional pre-trained models for text-guided editing, gressive modeling with Transformer is expensive in terms
the input image is first converted to the latent space via of memory consumption as the self-attention mechanism
the forward diffusion process. The diffusion model at the in Transformer incurs quadratic memory cost. Thus, com-
reverse path is then fine-tuned to generate images driven pressed and discrete representation of image is essential
by the target text and the CLIP loss [33]. For pre-trained and significant for autoregressive image synthesis and edit-
conditional models (typically conditioned on texts), similar ing. A k-means method to cluster RGB pixel values has
to GAN Inversion, a text latent embedding or a diffusion been adopted in [39] to reduce the input dimensionality.
model can be fine-tuned to reconstruct a few images (or ob- However, k-means clustering only reduces the dimension-
jects) [35], [36] faithfully. Then the obtained text embedding ality while the sequence length is still unchanged. Thus,
or fine-tuned model can be applied to generate the same the autoregressive model still cannot be scaled to higher
object in novel contexts. However, these methods [35], [36] resolutions, due to the quadratically increasing cost in se-
usually drastically change the layout of the original images. quence length. To this end, Vector Quantised VAE (VQ-
As observing the crux of the relationship between image VAE) [42] is adopted to learn discrete and compressed image
spatial layout and each word lies in cross-attention layers, representation. VQ-VAE consists of an encoder, a feature
Prompt-to-Prompt [34] proposes to preserve some content quantizer, and a decoder. The image is fed into the encoder
from the original image by manipulating the cross-attention to learn a continuous representation, which is quantized via
maps. Alternatively, taking advantage of the step-by-step the feature quantizer by assigning the feature to the nearest
diffusion sampling process, a model fine-tuned for image codebook entry. Then the decoder reconstructs the original
reconstruction can be utilized to provide score guidance image from the quantized feature, driving to learn a faithful
for content and structure preservation at the early stage discrete image representation. As assigning codebook entry
of the denoising process [192]. Similar approach is adopted is not differentiable, a reparameterization trick [42], [196] is
in [181] by fine-tuning diffusion model and optimizing text usually adopted to approximate the gradient. Targeting for
embedding via image reconstruction, which allows preserv- learning superior discrete image representation, a series of
ing contents via text embedding interpolation. efforts [2], [197], [198] have been devoted to improving VQ-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
Decoder
Encoder
Image Image 26 17 62 … 3 19
Quantization PixelCNN
Image Sequence 𝑠"!
Decoder
Condition …
Depth Map
Quantization
36 90 81 53 27
…..
Audio
…..
Condition Sequence 𝐶
Fig. 6. Typical framework of autoregressive methods for MISE tasks. A quantization stage is firstly performed to learn discrete and compressed
representation by reconstructing the original image or condition (e.g., semantic map) faithfully via VQ-GAN [2], [42], followed by an autoregressive
modeling stage to capture the dependency of discrete sequence. The image is reproduced based on [2] and [193].
VAE in terms of loss function, model architecture, codebook space for code index lookup and boosts the codebook usage
utilization, and learning regularization. substantially.
Loss Function. To achieve desirable perceptual quality Learning Regularization. Recent work [198] validates
for reconstructed images, an adversarial loss and a per- that the vanilla VQ-VAE doesn’t satisfy translation equiv-
ceptual loss [156], [199], [200] (with pre-trained VGG) can ariance during quantization, resulting in degraded perfor-
be incorporated for image reconstruction. With the extra mance for text-to-image generation. A simple but effective
adversarial loss and perceptual loss, the image quality is TE-VQGAN [198] is thus proposed to achieve translation
clearly improved compared with the original pixel loss equivariance by regularizing orthogonality in the codebook
as validated in [2]. Except for pre-trained VGG for com- embeddings. To regularize the latent structure of heteroge-
puting perceptual loss, vision Transformer [201] from self- neous domain data in conditional generation, Zhan et al.
supervised learning [115], [202] is also proved to work [193] design an Integrated Quantization VAE to penalize the
well for calculating perceptual loss. Besides, to emphasize inter-domain discrepancy with intra-domain variations.
reconstruction quality in certain regions, a feature-matching
loss can be employed over the activations of certain pre- 3.3.2 Autoregressive Modeling
trained models, e.g., face-embedding network [203] which Autoregressive (AR) modeling is a representative paradigm
can improve the reconstruction quality of face region. to accommodate sequence dependencies, complying with
Network Architecture. Convolution neural network is the chain rule of probability. The probability of each
the common structure to learn the discrete image repre- token in the sequence is conditioned on all previously
sentation in VQ-VAE. Recently, Yu et al. [197] replace the predictions, yielding a joint distribution of sequences
convolution-based structure with Vision Transformer (ViT) as the product of conditional Qn distributions: p(x) =
Q n
[204], which is shown to be less constrained by the inductive t=1 p(x |x
t 1 , x2 , · · · , xt−1 ) = t=1 p(xt |x<t ). During in-
priors imposed by convolutions and is able to yield better ference, each token is predicted autoregressively in a raster-
computational efficiency with higher reconstruction quality. scan order. Notably, a sliding-window strategy [2] can be
With the emergence of diffusion models, diffusion-based employed to reduce the cost during inference by only utiliz-
decoder [205] also has been explored to learn discrete image ing the predictions within a local window. A top-k sampling
representation with superior reconstruction quality. On the strategy is adopted to randomly sample from the k most
other hand, a multi-scale quantization structure is proved to likely next tokens, which naturally enables diverse sampling
promote the generation performance by including both low- results. The predicted tokens are then concatenated with the
level pixels and high-level tokens [206] or hierarchical latent previous sequence as conditions for the prediction of next
codes [207]. To further reduce the computational costs, a token. This process repeats iteratively until all the tokens are
residual quantization [208] can be employed to recursively sampled. Autoregressive models for image synthesis have
quantize the image as a stacked map of discrete tokens. become increasingly popular due to their ability to generate
Codebook Utilization. The vanilla VQ-VAE with argmin high-quality, realistic images with a high level of detail. In
operation (to get the nearest codebook entry) suffers from MISE tasks, autoregressive models generate images pixel-
severe codebook collapse, e.g., only few codebook entries by-pixel based on a conditional probability distribution that
are effectively utilized for quantization [209]. To allevi- takes into account both the previously generated pixels and
ate the codebook collapse, vq-wav2vec [210] introduces the given conditioning information, which allows the mod-
Gumbel-Softmax [211] to replace argmin for quantization. els to capture the complex dependencies to yield visually
The Gumbel-Softmax allows sampling discrete representa- consistent images. In recent years, autoregressive models
tion in a differentiable way through straight-through gradi- for MISE have been largely fueled by series of designs to
ent estimator [196], which boosts the codebook utilization be introduced below.
significantly. ViT-VQGAN [197] also presents a factorized Network Architecture. Early autoregressive models for
code architecture which introduces a linear projection from image generation usually adopt PixelCNN [213] which
the encoder output to a low dimensional latent variable struggle in modeling long term relationships within an im-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10
Encoder
Imagen Edge Map
(a) Per-scene NeRF with Pre-trained Models (b) Generative NeRF with GAN Loss
Fig. 7. The frameworks of (a) per-scene NeRF and (b) generative (GAN-based) NeRF for 3D-aware MISE. The image is adapted from [11], [212].
age due to the limited receptive field. With the prevailing of fields, Neural Radiance Fields (NeRF) [4] achieve impressive
Transformer [37], Transformer-based autoregressive models performance for novel views synthesis by parameterizing
[214] emerge with enhanced receptive field which allows the color and density of a 3D scene with neural fields.
sequentially predicting each pixel conditioned on previous Specifically, a fully-connected neural network is adopted
prediction results. To explore the limits of autoregressive in NeRF, by taking a spatial location (x, y, z) with the
text-to-image synthesis, Parti [71] scales the parameter size corresponding viewing direction (θ, ϕ)) as input, and the
of Transformer up to 20B, yielding consistent quality im- volume density with the corresponding emitted radiance as
provements in terms of image quality and text-image align- output. To render 2D images from the implicit 3D represen-
ment. Instead of unidirectionally modeling from condition tation, differentiable volume rendering is performed with
to image, a bi-directional architecture is also explored in a numerical integrator [4] to approximate the intractable
text-to-image synthesis [215], [216], which generates both volumetric projection integral. Powered by NeRF for 3D
diverse captions and images. scene representation, 3D-aware MISE can be achieved with
Bidirectional Context. On the other hand, previous per-scene NeRF or generative NeRF frameworks.
methods incorporate image context in a raster-scan order
by attending only to previous generation results. This strat- 3.4.1 Per-scene NeRF
egy is unidirectional and suffers from sequential bias as Consistent with the original NeRF model, a per-scene NeRF
it disregards much context information until autoregres- aims to optimize and represent a single scene supervised by
sion is nearly complete. It also ignores much contextual images or certain pre-trained models.
information in different scales as it only processes the Image Supervision. With paired guidance and corre-
image on a single scale. Grounded in above observations, sponding view images, a NeRF can be naively trained
ImageBART [194] presents a coarse-to-fine approach in a conditioned on the guidance to achieve MISE. For instance,
unified framework that addresses the unidirectional bias of AD-NeRF [13] achieves high-fidelity talking-head synthesis
autoregressive modeling and the corresponding exposure by training neural radiance fields on a video sequence with
bias. Specifically, a diffusion process is applied to succes- the audio track of one target person. Instead of bridging
sively eliminate information, yielding a hierarchy of repre- audio inputs and video outputs based on the intermediate
sentations which is further compressed via a multinomial representations, AD-NeRF directly feeds the audio features
diffusion process [174], [217]. By modeling the Markovian into an implicit function to yield a dynamic NeRF, which
transition autoregressively with attending to the preceding is further exploited to synthesize high-fidelity talking-face
hierarchical state, crucial global context can be leveraged videos accompanied with the audio via volume rendering.
for each individual autoregressive step. As an alternative, However, the paired condition-image data and multiview
bidirectional Transformer is also widely explored to incor- images are usually unavailable or costly to acquire which
porate bidirectional context, accompanied with a Masked hinders the broad applications of this method.
Visual Token Modeling (MVTM) [218] or Masked Language Pre-trained Model Supervision. Instead of relying on
Modeling (MLM) [219], [220] mechanism. multiview images or paired data, certain pre-trained models
Self-Attention Mechanism. To handle languages, im- can be adopted to optimize NeRFs from scratch as shown in
ages, and videos in different tasks in a unified manner, Fig. 7 (a). For instance, pre-trained CLIP can be leveraged to
NUWA [44] presents a 3D Transformer framework with a achieve text-driven 3D-aware image synthesis [223], by opti-
unified 3D Nearby Self-Attention (3DNA) which not only mizing NeRF to render multi-view images that score highly
reduces the complexity of full attention but also shows su- with a target text description according to the CLIP model.
perior performance. With a focus on semantic image editing Similar CLIP-based approach is also adopted in AvatarCLIP
at high resolution, ASSET [221] proposes to sparsify the [224] to achieve zero-shot text-driven 3D avatar generation
Transformer’s attention matrix at high resolutions guided and animation. Recently, with the prosperity of diffusion
by dense attention at lower resolutions, leading to reduced models, pre-trained 2D diffusion models show great poten-
computational cost. tial to drive the generation of high-fidelity 3D scenes for
diverse text prompts as in DreamFusion [11]. Specifically,
3.4 NeRF-based Methods based on probability density distillation, 2D diffusion model
A neural field [222] is a field that is parameterized fully can serve as a generative prior for the optimization of a
or in part by a neural network. As a special case of neural randomly-initialized 3D neural field via gradient descent
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11
such that its 2D renderings yield a high score with the target hand, the inversion of generative NeRF is challenging due to
condition. Following this line of research, Magic3D [47] fur- the including of camera pose. Thus, a hybrid inversion strat-
ther proposes to optimize a textured 3D mesh model with an egy [232] can be applied in practice by combining encoder-
efficient differentiable renderer [212], [225] interacting with based and optimization-based inversion, where the encoder
a pre-trained latent diffusion model. On the other hand, predicts a camera pose and a coarse style code which
optimizing NeRF with pre-trained models is an under- is further refined through inverse optimization. To enable
constrained process, which highlights the need of certain flexible and faithful 3D-aware MISE, some pre-trained mod-
prior knowledge or regularizations. It has been proved that els like CLIP also can be introduced in NeRF inversion.
geometric priors including sparsity regularization and scene For instance, to achieve 3D-aware manipulation from text
bounds [223] improve the generation fidelity significantly. prompt, CLIP-NeRF [237] optimizes latent codes towards
Besides, to mitigate the ambiguous geometry from a single targeted manipulation driven by a CLIP-based matching
viewpoint, random lighting directions can be applied to loss as described in StyleCLIP [29].
shade a scene to reveal the geometric details [11]. To prevent
normal vectors from improperly facing backwards from the 3.5 Other Methods
camera, an orientation loss proposed in Ref-NeRF [226] can
Except for above-mentioned methods, there has been sev-
be employed to impose penalty.
eral endeavors dedicated to the MISE task, exploring diverse
3.4.2 Generative NeRF research paths.
2D MISE without Generative Models. Instead of rely-
Distinct from per-scene optimization NeRFs which work for
ing on generative models, a series of alternative methods
a single scene, generative NeRFs are capable of generalizing
have been explored for multimodal editing of 2D images.
to different scenes by integrating NeRF with generative
For instance, CLVA [255] manipulates the style of a content
models. In generative NeRF, a scene is specified by a latent
image through text prompts by comparing the contrastive
code in the corresponding latent space. GRAF [227] is the
pairs of content image and style instruction to achieve
first to introduce a GAN framework for the generative
mutual relativeness. However, CLVA is constrained as it
training of radiance fields by employing a multi-scale patch-
requires style images accompanied with the text prompts
based discriminator. Lot of efforts have recently been de-
during training. Instead, CLIPstyler [256] leverages pre-
voted to improve the generative NeRF, e.g., GIRAFFE [228]
trained CLIP model to achieves text guided style transfer by
for introducing volume rendering at the feature level and
training a lightweight network which transforms a content
separating the object instances in a controllable way; Pi-
image to follow the text condition. As an extension to video,
GAN [229] for the FiLM-based conditioning scheme [230]
Loeschcke et al. [257] harness the power of CLIP to stylize
with a SIREN architecture [231]; StyleNeRF [232] for the in-
the object in a video according to two target texts.
tegration of style-based generator to achieve high-resolution
3D-aware MISE without NeRF. Except for NeRF, there
image synthesis; EG3D [233] for incorporating efficient tri-
are alternative methods that can be leveraged for 3D-aware
plane 3D representation. Fueled by these advancements, 3D-
MISE. Typically, classical 3D representations such as mesh
aware MISE can be well performed following the pipeline of
also can be employed to replace NeRF for 3D-aware MISE
conditional generative NeRF or generative NeRF inversion.
[258], [259]. Specifically, aiming for style transfer of 3D
Conditional NeRF. In conditional generative NeRF, a
scenes, Mu et al. [260] propose to learn geometry-aware con-
scene is specified by the combination of 3D positions and
tent features from a point cloud representation of the scene,
given conditions as shown in Fig. 7 (b). The condition can be
followed by point-to-pixel adaptive attention normalization
integrated to condition the NeRF following the integration
(AdaAttN) to transfer the style of a given image. Besides,
strategies in GANs or diffusion models. For instance, a pre-
a popular line of research adapts GANs for 3D-aware gen-
trained CLIP model is employed in [234] to extract the
eration by conditioning on camera parameters [261], intro-
conditional visual and text features to condition a NeRF.
ducing intermediate 3D shape [262], incorporating depth
Similarly, pix2pix3D [49] encodes certain visual guidance
prior [263], and adopting 3D rigid-body transformation with
(and a random code) to generate triplanes for scene rep-
projection [264].
resentation, while it renders the image and pixel-aligned
label map simultaneously to enable interactive 3D cross-
view editing. 3.6 Comparison and Discussion
NeRF Inversion. In the light of recent advances in gen- All generation methods possess their own strength and
erative NeRFs for 3D-aware image synthesis, some work weakness. GAN-based methods can achieve high-fidelity
explores the inversion of generative NeRFs for 3D-aware image synthesis in terms of FID and Inception Score and
MISE. As generative NeRF (GAN-based) is accompanied also have fast inference speed, while GANs are notorious for
with a latent space, the conditional guidance for MISE can unstable training and are prone to mode collapse. Moreover,
be naively mapped into the latent space to enable condi- it has been shown that GANs focus more on fidelity rather
tional 3D-aware generation [235]. However, this method than capturing the diversity of the training data distribution
struggles for image generation & editing with local control. compared with likelihood-based models like diffusion mod-
Some recent work proposes to train 3D-semantic-aware gen- els and autoregressive models [21]. Besides, GANs usually
erative NeRF [48], [236] that produces spatial-aligned im- adopt a CNN architecture (although Transformer structure
ages and semantic masks concurrently with two branches. is explored in some studies [265]–[267]), which makes them
These aligned semantic masks can be used to perform local struggle to handle multimodal data in a unified manner
editing of 3D volume via NeRF inversion. On the other and generalize to new MISE tasks. With wide adoption of
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
TABLE 2
Annotation types in popular datasets for MISE. Note, only currently available annotations are labeled with checkmarks, although some off-the-shelf
models (e.g., segmentation models, edge detectors, image caption models) can be employed to annotate the corresponding datasets. Part of the
information is retrieved from [238]. B-Box denotes bounding box.
Datasets Samples Semantic Map Keypoint Sketch B-Box Depth Attribute Text Audio Scene Graph
ADE20K [239] 27,574 ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
COCO [240] 328,000 ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗
COCO-Stuff [241] 164,000 ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓
PSG [242] 48,749 ✓ ✗ ✗ ✓ ✗ ✗ ✓ ✗ ✓
Cityscapes [243] 25,000 ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
CelebA [244] 202,599 ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗
CelebA-HQ [146] 30,000 ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗
CelebAMask-HQ [7] 30,000 ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗
CelebA-Dialog [168] 202,599 ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✗ ✗
MM-CelebA-HQ [28] 30,000 ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✗ ✗
DeepFashion [245] 800,000 ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗
DeepFashion-MM [187] 44,096 ✓ ✓ ✗ ✗ ✗ ✓ ✓ ✗ ✗
Chictopia10K [246] 14,400 ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
NYU Depth [247] 1,449 ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗
Stanford’s Cars [248] 16,185 ✗ ✗ ✗ ✓ ✓ ✓ ✗ ✗ ✗
Oxford-102 [249] 8,189 ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✗ ✗
CUB-200 [250] 11,788 ✓ ✗ ✗ ✓ ✗ ✓ ✓ ✗ ✗
LAION-5B [251] 5,85 billion ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗
Visual Genome [252] 101,174 ✗ ✗ ✗ ✓ ✗ ✓ ✓ ✗ ✓
VoxCeleb [253] 148,642 ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗
LRS [254] 144,482 ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✗
Transformer backbone, autoregressive models can handle age synthesis; StyleNeRF [232] combines NeRF with GAN
different MISE tasks in a unified manner. However, thanks to enable high-fidelity image synthesis with both high-
to the autoregressive prediction of tokens, autoregressive fidelity and 3D-awareness; ImageBart [194] combines the
models suffer from slow inference speed, which is also a autoregressive formulation with a multinomial diffusion
bottleneck of diffusion models as requiring a number of dif- process to incorporate a coarse-to-fine hierarchy of con-
fusion steps. Currently, autoregressive models and diffusion text information; X-LXMERT [268] integrates GAN into the
models are more favored in SOTA methods compared with framework of cross-modality representation to achieve text-
GANs, especially for text-to-image synthesis. guided image generation.
Autoregressive models and diffusion models are likely-
based generative models which are equipped with station- 4 E XPERIMENTAL E VALUATION
ary training objective and good training stability. The com- 4.1 Datasets
parison of generative modeling capability between autore-
gressive models and diffusion is still inconclusive. DALL- Datasets are the core of image synthesis and editing tasks. To
E 2 [25] shows that diffusion models are slightly better give an overall picture of the datasets in MISE, we tabulate
than autoregressive models in modeling the diffusion prior. the detailed annotation types in popular datasets in Table 2.
However, the recent work Parti [71] which adopts an au- Notably, ADE20K [239], COCO-Stuff [241], and Cityscapes
toregressive structure presents superior performance over [243] are common benchmark datasets for semantic image
the SOTA work of diffusion-based methods (i.e., Imagen). synthesis; Oxford-120 Flowers [249], CUB-200 Birds [250],
On the other hand, the exploration of two different families and COCO [240] are widely adopted in text-to-image syn-
of generative models may open exciting opportunities to thesis; VoxCeleb2 [281] and Lip Reading in the Wild (LRW)
combine the merits of the two powerful models. [282] are usually used for the benchmark of taking face
generation. Please refer to the supplementary material for
Different from above generation methods which mainly more details of the widely adopted datasets in different
work on 2D images and have few requirements for the MISE tasks.
training datasets, NeRF-based methods handle the 3D scene
geometry and thus have relatively high requirements for
training data. For example, per-scene optimization NeRFs 4.2 Evaluation Metrics
require multiview images or video sequence with pose Precise evaluation metrics are of great importance in driving
annotation, while generative NeRFs require the scene ge- progress of research. On the other hand, the evaluation of
ometry of the dataset to be simple. Thus, the application of MISE tasks is challenging as multiple attributes account for
NeRF in MISE with high-fidelity is still quite constrained. a fine generation result and the notion of image evaluation
Nevertheless, the 3D-aware modeling of real world with is often subjective. To achieve faithful evaluation, compre-
NeRF opens a new door for future MISE research, broad- hensive metrics are adopted to evaluate MISE tasks from
ening the horizons for potential advancements. multiple aspects. Specifically, Inception Score (IS) [283] and
Besides, state-of-the-art methods are prone to combine FID [284] are general metrics for image quality evaluation,
different generative models to yield superior performance. while LPIPS [165] is a common metric to evaluate image
For example, Taming Transformer [2] incorporates VQ-GAN diversity. These metrics can be applied across different gen-
and Autoregressive models to achieve high-resolution im- eration tasks. In terms of the alignment between generated
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13
Semantic Map + Sketch Map Semantic Map + Texts Scene Layout + Texts Sketch Map + Texts
“A Water bench traffic at the along beach the next Thames to the by sea.” “The large room has a wooden table with chairs and a couch.” “Wild Ones 1991 Limited Edition Print - Frank McCarthy.”
Semantic Map + Keypoints + Texts Depth + Texts Normal Map + Texts HED + Texts
“Background: Old City of Tzfat, Foreground: A set of twins who are taking a bath.” “A display of vintage animal toys on the floor.” “A man in the kitchen standing with his dog.” “The Taj Mahal mirrored by a water fountain’s reflection. - Agra, Uttar
Pradesh, India - Daily Travel Photos.”
Depth + Keypoints + Texts Keypoints + Texts Canny Edge + Texts Scribbles + Texts
“Background: Photograph, Toronto Wet, Foreground: Some girls in colorful shirts standing “A woman is sitting in front of a desk.” “Two people walking along a side walk next to a train on the tracks.”
by some pastries.” “a masterpiece of cartoon-style turtle illustration.”
Fig. 8. Image synthesis from the combination of different types of guidance. The samples are from [8], [62], [269].
TABLE 3
Quantitative comparison with existing methods on segmentation-to-image synthesis. Part of the results are retrieved from [61].
images and conditions, the evaluation metrics are usually metrics for a comprehensive and faithful analysis of model
designed for specific generation tasks, e.g., mIoU and mAP performance.
for semantic image synthesis, R-precision [54], Captioning
Metrics [285] and Semantic Object Accuracy (SOA) [274] 4.3 Experimental Results
for text-to-image generation, Landmark distance (LMD) and
To showcase the capability and effectiveness of MISE in
audio-lip synchronization (Sync) [286] for talking face gen-
a tangible manner, we visualize the synthesized images
eration.
conditioned on the combination of diverse guidance types
As a general image quality metric, the advantage of IS is as shown in Fig. 8. Please refer to the supplementary ma-
its simplicity, and it can be applied to a wide range of image terial for more visualization. Furthermore, we provide a
generation models. However, IS has been criticized for its quantitative comparison of the image synthesis performance
lack of robustness and sensitivity to noise. It also struggles exhibited by various models. This assessment takes into
to evaluate overfitting generation (i.e., the model memorizes consideration distinct types of guidance including visual,
the training set) and measure intra-domain variation (i.e., text, and audio, which will be discussed in the following
the model only produces one good sample). FID is more sections.
robust than the IS and can better capture the overall quality
of the generated images. However, it assumes a Gaussian 4.3.1 Visual Guidance
distribution for image features which is not always valid. For visual guidance, we mainly conduct comparison on
For diversity evaluation metrics like LPIPS, the quality of semantic image synthesis as there are numbers of methods
generated images is not concerned which means unrealistic for benchmarking. As shown in Table 3, the experimen-
generation could lead to a good diversity score. Align- tal comparison is conducted on four challenging datasets:
ment metric provides quantitative evaluations of generation ADE20K [239], ADE20K-outdoors [239], COCO-stuff [241]
alignment, while most of them are subject to various issues, and Cityscapes [243], following the setting of [6]. The evalu-
including insensitivity to temporal or overall coherence in ation is performed with FID, LPIPS, and mIoU. Specially, the
SOA and CPBD, dataset or pre-trained model bias in R- mIoU aims to assess the alignment between the generated
precision, mIoU & mAP and audio-lip synchronization, am- image and the ground truth segmentation via a pre-trained
biguous alignment in Captioning Metrics. Please refer to the semantic segmentation network. Pre-trained UperNet101
supplementary material for more details of the correspond- [287], multi-scale DRN-D-105 [288], and DeepLabV2 [289]
ing evaluation metrics. Overall, certain evaluation metric are adopted for Cityscapes, ADE20K & ADE20K-outdoors,
should be applied in conjunction with other evaluation and COCO-Stuff, respectively.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
TABLE 4 TABLE 5
Text-to-Image generation performance on the COCO dataset. † The audio guided image editing (talking-head) performance on
denotes the results obtained by using the corresponding open-source LRW [282] and VoxCeleb2 [281] under three metrics. ⋆ denotes that
code. The rows in grey and cyan denote the results of the model is trained for subject-specific talking-head generation. Part of
Transformer-based and Diffusion-based methods, respectively. Others the results are retrieved from [59].
are the results of GAN-based methods. Part of the results are retrieved
from [110].
LRW [282] VoxCeleb2 [281]
Methods SSIM ↑ LMD ↓ Sync ↑ SSIM ↑ LMD ↓ Sync ↑
Methods IS ↑ FID ↓ R-Prec. ↑
ATVG [120] 0.810 5.25 4.1 0.826 6.49 4.3
Real Images [274] 34.88 6.09 68.58 Wav2Lip [57] 0.862 5.73 6.9 0.846 12.26 4.5
MakeitTalk [58] 0.796 7.13 3.1 0.817 31.44 2.8
StackGAN [53] 8.450 74.05 - Rhythmic Head [141] - - - 0.779 14.76 3.8
StackGAN++ [108] 8.300 81.59 - PC-AVS [59] 0.861 3.93 6.4 0.886 6.88 5.9
AttnGAN [54] 25.89 35.20 85.47 GC-AVT [290] - - - 0.710 3.03 5.3
MirrorGAN [55] 26.47 - 74.52 EAMM [291] 0.740 2.08 5.5 - - -
AttnGAN+OP [274] 24.76 33.35 82.44 SyncTalkFace [60] 0.893 1.25 - - - -
OP-GAN [274] 27.88 24.70 89.01 DIRFA [292] - 3.16 6.4 - 4.45 5.8
SEGAN [137] 27.86 32.28 - AVCT⋆ [293] - - - - 0.25 7.0
ControlGAN [138] 24.06 - 82.43 Ground Truth 1.000 0.00 6.5 1.000 0.00 5.9
DM-GAN [139] 30.49 32.64 88.56
DM-GAN [139]† 32.43 24.24 92.23
Obj-GAN [275] 27.37 25.64 91.05 SOTA performance in terms of FID, e.g., 8.12 in GAN-based
Obj-GAN [275]† 27.32 24.70 91.91 method LAFITE [26], 7.23 in autoregressive method Parti
TVBi-GAN [276] 31.01 31.97 -
Wang et al. [277] 29.03 16.28 82.70 [71], and 7.27 in diffusion-based method Imagen [10]. How-
Rombach et al. [278] 34.70 30.63 - ever, autoregressive and diffusion-based methods are still
CPGAN [279] 52.73 - 93.59 preferred in recent SOTA work, thanks to their stationary
Pavllo et al. [114] - 19.65 - training objective and good scalability [21].
XMC-GAN [163] 30.45 9.330 -
LAFITE [26] 32.34 8.120 -
CogView [69] 18.20 27.10 - 4.3.3 Audio Guidance
CogView2 [280] 22.40 24.10 - In terms of audio guided image synthesis and editing, we
DALL-E [43] 17.90 27.50 -
NUWA [44] 27.20 12.90 - conduct quantitative comparison in the task of audio-driven
DiVAE [205] - 11.53 - talking face generation which has been widely explored
Make-A-Scene [72] - 11.84 - in the literature. Notably, current development of talking
Parti [71] - 7.230 - face generation mainly relies on GANs, while autoregres-
VQ-Diffusion [184] - 13.86 -
LDM [46] 30.29 12.63 - sive or diffusion-based methods for talking face generation
GLIDE [180] - 12.24 - remain under-explored. The quantitative results of talking
DALL-E 2 [25] - 10.39 - face generation on LRW [282] and VoxCeleb2 [281] datasets
Imagen [10] - 7.270 - are shown in Table 5.
As shown in Table 3, diffusion-based method (i.e., SDM 5 O PEN C HALLENGES & D ISCUSSION
[61]) achieves superior generation quality and diversity as Though MISE has made notable progress and achieved su-
evaluated by FID and LPIPS, and yields comparable seman- perior performance in recent years, there exist several chal-
tic consistency as evaluated by mIoU compared with GAN- lenges for future exploration. In this section, we overview
based methods. Although the comparison may not be fair the typical challenges, share our humble opinions on possi-
as the model sizes are different, diffusion-based method still ble solutions, and highlight the future research directions.
demonstrates its powerful modeling capability for semantic
image synthesis. With a large model size, autoregressive
method Taming [2] doesn’t show a clear advantage over 5.1 Towards Large-Scale Multi-Modality Datasets
other methods. We conjecture that Taming Transformer [2] As current datasets mainly provide annotations in a sin-
is a versatile framework for various conditional generation gle modality (e.g., visual guidance), most existing methods
tasks without specific design for semantic image synthe- focus on image synthesis and editing conditioned on guid-
sis, while other methods in Table 3 mainly focus on the ance from a single modality (e.g., text-to-image synthesis,
task of semantic image synthesis. Notably, autoregressive semantic image synthesis). However, humans possess the
method and diffusion method inherently support diverse capability of creating visual contents with guidance of mul-
conditional generation results, while GAN-based methods tiple modalities concurrently. Targeting to mimic the human
usually require additional modules (e.g., VAE [175]) or de- intelligence, multimodal inputs are expected to be fused
signs to achieve diverse generation. and leveraged jointly in image generation. Recently, Make-
A-Scene [72] explores to include semantic segmentation
4.3.2 Text Guidance tokens in autoregressive modeling to achieve better quality
We benchmark text-to-image generation methods on COCO in image synthesis; ControlNet [8] incorporates various
dataset as tabulated in Table 4 (The results are extracted visual conditions into Stable Diffusion (for text-to-image
from relevant papers). As shown in Table 4, GAN-based, generation) to achieve controllable generation; with MM-
autoregressive, and diffusion-based methods can all achieve CelebA-HQ [28], COCO [240], and COCO-Stuff [241] as
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15
the training set, PoE-GAN [269] achieves image genera- models the 3D geometry of real world. With the incorpo-
tion conditioned on multi-modal including segmentation, ration of generative models, generative NeRF is notably
sketch, image, and text. However, the size of MM-CelebA- appealing for MISE as it is associated with a latent space.
HQ [28], COCO [240], and COCO-Stuff [241] is still far from Current generative NeRF models (e.g., StyleNeRF, EG3D)
narrowing the gap with real-world distributions. Therefore, have enabled to model scenes with simple geometry (e.g.,
to encompass a broad range of modalities into image gen- faces, cars) from a collection of unposed 2D images, just
eration, there is a need for a large-scale dataset equipped like the training of unconditional GANs (e.g., StyleGAN).
with annotations spanning a wide spectrum of modalities, Powered by these efforts, several 3D-aware MISE tasks
such as semantic segmentation, text descriptions, and scene have been explored, e.g., text-to-NeRF [234] and semantic-
graphs. One potential approach to assemble such a dataset to-NeRF [235]. However, current generative NeRFs still
could be utilizing pre-trained models for different tasks struggle on datasets with complex geometry variation, e.g.,
to generate the requisite annotations. For instance, a seg- DeepFashion [245] and ImageNet [299].
mentation model could be used to create semantic maps, a Only relying on generative models to learn the com-
detection model could be employed to annotate bounding plex scene geometry from unposed 2D images is indeed
boxes. Additionally, synthetic data could provide another intractable and challenging. A possible solution is to pro-
feasible alternative, given its inherent advantage of readily vide more prior knowledge of the scene, e.g., obtaining
providing a multitude of annotations. prior geometry with off-the-shelf models [300], providing
skeleton prior for generative human modeling, etc. No-
tably, the power of prior knowledge has been explored in
5.2 Towards Faithful Evaluation Metrics some recent studies of 3D-aware tasks [260], [300], [301].
Accurate yet faithful evaluation is of great significance for Another possible approach is to provide more supervision,
the development of MISE and is still an open problem. e.g., creating a large dataset with multiview annotations or
Leveraging pre-trained models to conduct evaluations (e.g., geometry information. Once the 3D-aware generative mod-
FID) is constrained to the pre-trained datasets, which tends eling succeeds to work on complex natural scenes, some
to pose discrepancy with the target datasets. User study interesting multimodal applications will become possible,
recruits human subjects to assess the synthesized images e.g., 3D version of DALL-E.
directly, which is however often resource-intensive in terms
of time and cost.
6 S OCIAL I MPACTS
With the advance of multimodal pre-training, CLIP [45]
is used to measure the similarity between the texts and As related to the hot concept of AI-Generated Content
generated images, which however does not correlate well (AIGC), MISE has gained considerable attention in recent
with human preferences. To inherit the merits of powerful years. The rapid advancements in MISE offer unprece-
representation of pre-trained models and human preference dented generation realism and editing possibilities, which
of crowd-sourcing study, fine-tuning pre-trained CLIP with have influenced and will continue to influence our society in
human preference datasets [294], [295] will be a promising both positive and potentially negative ways. In this section,
direction for the designing of MISE evaluation metrics. we discuss the correlation between MISE and AIGC, and
analyze the potential social impacts of MISE.
[41] K. Gregor et al. Deep autoregressive networks. In ICML, 2014. [78] A. Mikaeili et al. Sked: Sketch-guided text-based 3d editing.
[42] A. v. d. Oord et al. Neural discrete representation learning. arXiv:2303.10735, 2023.
arXiv:1711.00937, 2017. [79] C. Bao et al. Sine: Semantic-driven image-based nerf editing with
[43] A. Ramesh et al. DALL·E: Creating images from text. Technical prior-guided editing field. In CVPR, 2023.
report, OpenAI, 2021. [80] S. Weder et al. Removing objects from neural radiance fields. In
[44] C. Wu et al. NÜwa: Visual synthesis pre-training for neural visual CVPR, 2023.
world creation. arXiv:2111.12417, 2021. [81] D. Xu et al. Sinnerf: Training neural radiance fields on complex
[45] A. Radford et al. Learning transferable visual models from scenes from a single image. In ECCV, 2022.
natural language supervision. arXiv:2103.00020, 2021. [82] J. Hyung et al. Local 3d editing via 3d distillation of clip
[46] R. Rombach et al. High-resolution image synthesis with latent knowledge. In CVPR, 2023.
diffusion models. In CVPR, 2022. [83] C. Wang et al. Clip-nerf: Text-and-image driven manipulation of
[47] C.-H. Lin et al. Magic3d: High-resolution text-to-3d content neural radiance fields. In CVPR, 2022.
creation. arXiv:2211.10440, 2022. [84] Z. Wang et al. Prolificdreamer: High-fidelity and di-
[48] J. Sun et al. Ide-3d: Interactive disentangled editing for high- verse text-to-3d generation with variational score distillation.
resolution 3d-aware portrait synthesis. arXiv:2205.15517, 2022. arXiv:2305.16213, 2023.
[49] K. Deng et al. 3d-aware conditional image synthesis. In CVPR, [85] Z. Ye et al. Geneface: Generalized and high-fidelity audio-driven
2023. 3d talking face synthesis. arXiv:2301.13430, 2023.
[50] T.-C. Wang et al. High-resolution image synthesis and semantic [86] Z. Ye et al. Geneface++: Generalized and stable real-time audio-
manipulation with conditional gans. In CVPR, 2018. driven 3d talking face generation. arXiv:2305.00787, 2023.
[51] H.-Y. Lee et al. Diverse image-to-image translation via disentan- [87] S. Shen et al. Learning dynamic facial radiance fields for few-shot
gled representations. In ECCV, 2018. talking head synthesis. In ECCV, 2022.
[52] V. Sushko et al. You only need adversarial supervision for [88] X. Liu et al. Semantic-aware implicit neural audio-driven video
semantic image synthesis. arXiv:2012.04781, 2020. portrait generation. In ECCV, 2022.
[53] H. Zhang et al. StackGAN: Text to photo-realistic image synthesis [89] L. Ma et al. Pose guided person image generation.
with stacked generative adversarial networks. In ICCV, 2017. arXiv:1705.09368, 2017.
[54] T. Xu et al. Attngan: Fine-grained text to image generation with [90] Y. Men et al. Controllable person image synthesis with attribute-
attentional generative adversarial networks. In CVPR, 2018. decomposed gan. In CVPR, 2020.
[55] T. Qiao et al. Mirrorgan: Learning text-to-image generation by [91] C. Zhang et al. Deep monocular 3d human pose estimation via
redescription. In CVPR, 2019. cascaded dimension-lifting. arXiv:2104.03520, 2021.
[56] M. Kang et al. Scaling up gans for text-to-image synthesis. In [92] J.-Y. Zhu et al. Toward multimodal image-to-image translation.
CVPR, 2023. In NeurIPS, 2017.
[57] K. Prajwal et al. A lip sync expert is all you need for speech to [93] C. Gao et al. Sketchycoco: Image generation from freehand scene
lip generation in the wild. In MM, 2020. sketches. In CVPR, 2020.
[58] Y. Zhou et al. Makelttalk: speaker-aware talking-head animation. [94] W. Chen and J. Hays. Sketchygan: Towards diverse and realistic
TOG, 2020. sketch to image synthesis. In CVPR, 2018.
[59] H. Zhou et al. Pose-controllable talking face generation by [95] S.-Y. Chen et al. Deepfacedrawing: Deep generation of face
implicitly modularized audio-visual representation. In CVPR, images from sketches. TOG, 2020.
2021. [96] M. Zhu et al. A deep collaborative framework for face photo–
[60] S. J. Park et al. Synctalkface: Talking face generation with precise sketch synthesis. TNNLS, 2019.
lip-syncing via audio-lip memory. In AAAI, 2022. [97] M. Zhu et al. Learning deep patch representation for probabilistic
[61] W. Wang et al. Semantic image synthesis via diffusion models. graphical model-based face sketch synthesis. IJCV, 2021.
arXiv:2207.00050, 2022. [98] M. Zhu et al. Knowledge distillation for face photo–sketch
[62] C. Qin et al. Unicontrol: A unified diffusion model for control- synthesis. TNNLS, 2020.
lable visual generation in the wild. arXiv:2305.11147, 2023. [99] Z. Li et al. Staged sketch-to-image synthesis via semi-supervised
[63] T. Brooks et al. Instructpix2pix: Learning to follow image editing generative adversarial networks. TMM, 2020.
instructions. In CVPR, 2023. [100] W. Sun and T. Wu. Image synthesis from reconfigurable layout
[64] S. Shen et al. Difftalk: Crafting diffusion models for generalized and style. In ICCV, 2019.
audio-driven portraits animation. In CVPR, 2023. [101] B. Zhao et al. Image generation from layout. In CVPR, 2019.
[65] J. Tseng et al. Edge: Editable dance generation from music. In [102] Y. Li et al. Bachgan: High-resolution image synthesis from salient
CVPR, 2023. object layout. In CVPR, 2020.
[66] L. Ruan et al. Mm-diffusion: Learning multi-modal diffusion [103] Z. Li et al. Image synthesis from layout with locality-aware mask
models for joint audio and video generation. In Proceedings of the adaption. In ICCV, 2021.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [104] S. Frolov et al. Attrlostgan: Attribute controlled image synthesis
pp. 10219–10228, 2023. from reconfigurable layout and style. arXiv:2103.13722, 2021.
[67] L. Yang et al. Diffusion-based scene graph to image generation [105] J. Y. Koh et al. Text-to-image generation grounded by fine-grained
with masked contrastive pre-training. arXiv:2211.11138, 2022. user attention. In WACV, 2021.
[68] S. Kim et al. Instaformer: Instance-aware image-to-image trans- [106] F. Zhan et al. Bi-level feature alignment for versatile image
lation with transformer. In CVPR, 2022. translation and manipulation. In ECCV, 2022.
[69] M. Ding et al. Cogview: Mastering text-to-image generation via [107] H. Zheng et al. Semantic layout manipulation with high-
transformers. arXiv:2105.13290, 2021. resolution sparse attention. TPAMI, 2022.
[70] A. Ramesh et al. Zero-shot text-to-image generation. [108] H. Zhang et al. Stackgan++: Realistic image synthesis with
arXiv:2102.12092, 2021. stacked generative adversarial networks. TPAMI, 2018.
[71] J. Yu et al. Scaling autoregressive models for content-rich text-to- [109] S. Reed et al. Generative adversarial text to image synthesis. In
image generation. arXiv:2206.10789, 2022. ICML, 2016.
[72] O. Gafni et al. Make-a-scene: Scene-based text-to-image genera- [110] S. Frolov et al. Adversarial text-to-image synthesis: A review.
tion with human priors. arXiv:2203.13131, 2022. NN, 2021.
[73] H. Chang et al. Muse: Text-to-image generation via masked [111] T. Mikolov et al. Distributed representations of words and
generative transformers. arXiv:2301.00704, 2023. phrases and their compositionality. In NeurIPS, 2013.
[74] Y. Lu et al. Live speech portraits: real-time photorealistic talking- [112] Z. S. Harris. Distributional structure. Word, 1954.
head animation. TOG, 2021. [113] T. Wang et al. Faces à la carte: Text-to-face generation via attribute
[75] R. Li et al. Ai choreographer: Music conditioned 3d dance disentanglement. In WACV, 2021.
generation with aist++. In ICCV, 2021. [114] D. Pavllo et al. Controlling style and semantics in weakly-
[76] L. Siyao et al. Bailando: 3d dance generation by actor-critic gpt supervised image generation. In ECCV, 2020.
with choreographic memory. In CVPR, 2022. [115] J. Devlin et al. Bert: Pre-training of deep bidirectional transform-
[77] Y. Yin et al. Or-nerf: Object removing from 3d scenes ers for language understanding. arXiv:1810.04805, 2018.
guided by multiview segmentation with neural radiance fields. [116] D. Harwath and J. R. Glass. Learning word-like units from joint
arXiv:2305.10503, 2023. audio-visual analysis. arXiv:1701.07481, 2017.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18
[117] D. Harwath et al. Vision as an interlingua: Learning multilingual [155] A. Nguyen et al. Plug & play generative networks: Conditional
semantic embeddings of untranscribed speech. In ICASSP, 2018. iterative generation of images in latent space. In CVPR, 2017.
[118] J. Li et al. Direct speech-to-image translation. JSTSP, 2020. [156] J. Johnson et al. Perceptual losses for real-time style transfer and
[119] Y. Aytar et al. Soundnet: Learning sound representations from super-resolution. In ECCV, 2016.
unlabeled video. NeurIPS, 2016. [157] C. Wang et al. Perceptual adversarial networks for image-to-
[120] L. Chen et al. Hierarchical cross-modal talking face generation image transformation. TIP, 2018.
with dynamic pixel-wise loss. In CVPR, 2019. [158] S. Benaim and L. Wolf. One-sided unsupervised domain map-
[121] S. Hochreiter and J. Schmidhuber. Long short-term memory. ping. NeurIPS, 2017.
Neural computation, 1997. [159] H. Fu et al. Geometry-consistent generative adversarial networks
[122] A. Owens et al. Visually indicated sounds. In CVPR, 2016. for one-sided unsupervised domain mapping. In CVPR, 2019.
[123] Y. Song et al. Talking face generation by conditional recurrent [160] A. v. d. Oord et al. Representation learning with contrastive
adversarial network. arXiv:1804.04786, 2018. predictive coding. arXiv:1807.03748, 2018.
[124] P. Ekman et al. Facial action coding system (facs) a human face. [161] T. Park et al. Contrastive learning for unpaired image-to-image
Salt Lake City, 2002. translation. In ECCV, 2020.
[125] J. Johnson et al. Image generation from scene graphs. In CVPR, [162] A. Andonian et al. Contrastive feature loss for image prediction.
2018. In ICCV, 2021.
[126] D. M. Vo and A. Sugimoto. Visual-relation conscious image [163] H. Zhang et al. Cross-modal contrastive learning for text-to-
generation from structured-text. In ECCV, 2020. image generation. In CVPR, 2021.
[127] X. Shi et al. Convolutional lstm network: A machine learning [164] A. Brock et al. Large scale gan training for high fidelity natural
approach for precipitation nowcasting. NeurIPS, 28, 2015. image synthesis. arXiv:1809.11096, 2018.
[128] T. Fang et al. Reconstructing perceptive images from brain [165] R. Zhang et al. The unreasonable effectiveness of deep features
activity by shape-semantic gan. NeurIPS, 2020. as a perceptual metric. In CVPR, 2018.
[129] S. Lin et al. Mind reader: Reconstructing complex images from [166] E. Richardson et al. Encoding in style: a stylegan encoder for
brain activities. arXiv:2210.01769, 2022. image-to-image translation. arXiv:2008.00951, 2020.
[130] Y. Takagi and S. Nishimoto. High-resolution image reconstruc- [167] H. Wang et al. Cycle-consistent inverse gan for text-to-image
tion with latent diffusion models from human brain activity. synthesis. In MM, 2021.
bioRxiv, 2022. [168] Y. Jiang et al. Talk-to-edit: Fine-grained facial editing via dialog.
[131] G. Yang and D. Ramanan. Upgrading optical flow to 3d scene In ICCV, 2021.
flow through optical expansion. In CVPR, 2020. [169] D. Bau et al. Paint by word. arXiv:2103.10951, 2021.
[132] Y. Endo. User-controllable latent transformer for stylegan image [170] U. Kocasari et al. Stylemc: Multi-channel based fast text-guided
layout editing. In Computer Graphics Forum, 2022. image generation and manipulation. arXiv:2112.08493, 2021.
[133] H. Tang et al. Multi-channel attention selection gan with cas- [171] X. Liu et al. Fusedream: Training-free text-to-image generation
caded semantic guidance for cross-view image translation. In with improved clip+ gan space optimization. arXiv:2112.01573,
CVPR, 2019. 2021.
[134] P. Zhang et al. Cross-domain correspondence learning for [172] R. Gal et al. Stylegan-nada: Clip-guided domain adaptation of
exemplar-based image translation. In CVPR, 2020. image generators. arXiv:2108.00946, 2021.
[135] F. Zhan et al. Unbalanced feature transport for exemplar-based [173] Y. Yu et al. Towards counterfactual image manipulation via clip.
image translation. In CVPR, 2021. arXiv:2207.02812, 2022.
[136] P. Zhu et al. Sean: Image synthesis with semantic region-adaptive [174] J. Sohl-Dickstein et al. Deep unsupervised learning using
normalization. In CVPR, 2020. nonequilibrium thermodynamics. In ICML, 2015.
[137] H. Tan et al. Semantics-enhanced adversarial nets for text-to- [175] D. P. Kingma and M. Welling. Auto-encoding variational bayes.
image synthesis. In ICCV, 2019. arXiv:1312.6114, 2013.
[138] B. Li et al. Controllable text-to-image generation. [176] D. Rezende and S. Mohamed. Variational inference with normal-
arXiv:1909.07083, 2019. izing flows. In ICML, 2015.
[139] M. Zhu et al. Dm-gan: Dynamic memory generative adversarial [177] L. Dinh et al. Density estimation using real nvp. arXiv:1605.08803,
networks for text-to-image synthesis. In CVPR, 2019. 2016.
[140] V. Blanz et al. A morphable model for the synthesis of 3d faces. [178] J. Menick and N. Kalchbrenner. Generating high fidelity images
In Siggraph, 1999. with subscale pixel networks and multidimensional upscaling.
[141] L. Chen et al. Talking-head generation with rhythmic head arXiv:1812.01608, 2018.
motion. In ECCV, 2020. [179] A. Van Oord et al. Pixel recurrent neural networks. In ICML,
[142] H. Zhou et al. Talking face generation by adversarially disentan- 2016.
gled audio-visual representation. In AAAI, 2019. [180] A. Nichol et al. Glide: Towards photorealistic image generation
[143] S. Suwajanakorn et al. Synthesizing obama: learning lip sync and editing with text-guided diffusion models. arXiv:2112.10741,
from audio. TOG, 2017. 2021.
[144] S. Wang et al. One-shot talking face generation from single- [181] B. Kawar et al. Imagic: Text-based real image editing with
speaker audio-visual correlation learning. arXiv:2112.02749, 2021. diffusion models. arXiv:2210.09276, 2022.
[145] T. Karras et al. A style-based generator architecture for generative [182] C. Raffel et al. Exploring the limits of transfer learning with a
adversarial networks. In CVPR, 2019. unified text-to-text transformer. JMLR, 2020.
[146] T. Karras et al. Progressive growing of gans for improved quality, [183] J. Ho et al. Cascaded diffusion models for high fidelity image
stability, and variation. arXiv:1710.10196, 2017. generation. JMLR, 2022.
[147] Z. Zhang et al. Photographic text-to-image synthesis with a [184] S. Gu et al. Vector quantized diffusion model for text-to-image
hierarchically-nested adversarial network. In CVPR, 2018. synthesis. arXiv:2111.14822, 2021.
[148] G. Yin et al. Semantics disentangling for text-to-image genera- [185] Z. Tang et al. Improved vector quantized diffusion models.
tion. In CVPR, 2019. arXiv:2205.16007, 2022.
[149] M. Cha et al. Adversarial learning of semantic relevance in text [186] P. Chahal. Exploring transformer backbones for image diffusion
to image synthesis. In AAAI, 2019. models. arXiv:2212.14678, 2022.
[150] M. Amodio and S. Krishnaswamy. Travelgan: Image-to-image [187] Y. Jiang et al. Text2human: Text-driven controllable human image
translation by transformation vector learning. In CVPR, 2019. generation. TOG, 2022.
[151] J.-Y. Zhu et al. Unpaired image-to-image translation using cycle- [188] N. Liu et al. Compositional visual generation with composable
consistent adversarial networks. In ICCV, 2017. diffusion models. arXiv:2206.01714, 2022.
[152] H. Tang et al. Cycle in cycle generative adversarial networks for [189] O. Bar-Tal et al. Multidiffusion: Fusing diffusion paths for
keypoint-guided image generation. In MM, 2019. controlled image generation. arXiv:2302.08113, 2023.
[153] Q. Lao et al. Dual adversarial inference for text-to-image synthe- [190] A. Blattmann et al. Retrieval-augmented diffusion models.
sis. In ICCV, 2019. arXiv:2204.11824, 2022.
[154] Z. Chen and Y. Luo. Cycle-consistent diverse image synthesis [191] O. Avrahami et al. Blended diffusion for text-driven editing of
from natural language. In ICMEW. IEEE, 2019. natural images. arXiv:2111.14818, 2021.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19
[192] Z. Zhang et al. Sine: Single image editing with text-to-image [229] E. R. Chan et al. pi-gan: Periodic implicit generative adversarial
diffusion models. arXiv:2212.04489, 2022. networks for 3d-aware image synthesis. In CVPR, 2021.
[193] F. Zhan et al. Auto-regressive image synthesis with integrated [230] E. Perez et al. Film: Visual reasoning with a general conditioning
quantization. In ECCV, 2022. layer. In AAAI, volume 32, 2018.
[194] P. Esser et al. Imagebart: Bidirectional context with multinomial [231] V. Sitzmann et al. Implicit neural representations with periodic
diffusion for autoregressive image synthesis. In NeurIPS, 2021. activation functions. NeurIPS, 2020.
[195] J. T. Rolfe. Discrete variational autoencoders. arXiv:1609.02200, [232] J. Gu et al. Stylenerf: A style-based 3d-aware generator for high-
2016. resolution image synthesis. In ICLR, 2022.
[196] Y. Bengio et al. Estimating or propagating gradients through [233] E. R. Chan et al. Efficient geometry-aware 3d generative adver-
stochastic neurons for conditional computation. arXiv:1308.3432, sarial networks. In CVPR, 2022.
2013. [234] K. Jo et al. Cg-nerf: Conditional generative neural radiance fields.
[197] J. Yu et al. Vector-quantized image modeling with improved arXiv:2112.03517, 2021.
vqgan. arXiv:2110.04627, 2021. [235] Y. Chen et al. Sem2nerf: Converting single-view semantic masks
[198] W. Shin et al. Translation-equivariant image quantizer for bi- to neural radiance fields. arXiv:2203.10821, 2022.
directional image-text generation. arXiv:2112.00384, 2021. [236] J. Sun et al. Fenerf: Face editing in neural radiance fields. In
[199] A. Lamb et al. Discriminative regularization for generative CVPR, 2022.
models. arXiv:1602.03220, 2016. [237] C. Wang et al. Clip-nerf: Text-and-image driven manipulation of
[200] A. B. L. Larsen et al. Autoencoding beyond pixels using a learned neural radiance fields. arXiv:2112.05139, 2021.
similarity metric. In ICML, 2016. [238] Y. Xue et al. Deep image synthesis from intuitive user input: A
[201] X. Dong et al. Peco: Perceptual codebook for bert pre-training of review and perspectives. CVM, 2022.
vision transformers. arXiv:2111.12710, 2021. [239] B. Zhou et al. Scene parsing through ade20k dataset. In CVPR,
[202] H. Bao et al. Beit: Bert pre-training of image transformers. 2017.
arXiv:2106.08254, 2021. [240] T.-Y. Lin et al. Microsoft coco: Common objects in context. In
[203] Q. Cao et al. Vggface2: A dataset for recognising faces across ECCV, 2014.
pose and age. In FG, 2018. [241] H. Caesar et al. Coco-stuff: Thing and stuff classes in context. In
[204] A. Dosovitskiy et al. An image is worth 16x16 words: Transform- CVPR, 2018.
ers for image recognition at scale, 2020. [242] J. Yang et al. Panoptic scene graph generation. arXiv:2207.11247,
[205] J. Shi et al. Divae: Photorealistic images synthesis with denoising 2022.
diffusion decoder. arXiv:2206.00386, 2022. [243] M. Cordts et al. The cityscapes dataset for semantic urban scene
[206] M. Ni et al. NÜwa-lip: Language guided image inpainting with understanding. In CVPR, 2016.
defect-free vqgan. arXiv:2202.05009, 2022. [244] Z. Liu et al. Deep learning face attributes in the wild. In ICCV,
[207] A. Razavi et al. Generating diverse high-fidelity images with 2015.
vq-vae-2. In NeurISP, 2019. [245] Z. Liu et al. Deepfashion: Powering robust clothes recognition
[208] D. Lee et al. Autoregressive image generation using residual and retrieval with rich annotations. In CVPR, 2016.
quantization. In CVPR, 2022. [246] X. Liang et al. Deep human parsing with active template regres-
[209] J. Zhang et al. Regularized vector quantization for tokenized sion. TPAMI, 2015.
image synthesis. In CVPR, 2023. [247] N. Silberman and R. Fergus. Indoor scene segmentation using a
[210] A. Baevski et al. vq-wav2vec: Self-supervised learning of discrete structured light sensor. In ICCVW. IEEE, 2011.
speech representations. arXiv:1910.05453, 2019. [248] J. Krause et al. 3d object representations for fine-grained catego-
[211] E. Jang et al. Categorical reparameterization with gumbel- rization. In ICCVW, 2013.
softmax. arXiv:1611.01144, 2016. [249] M.-E. Nilsback and A. Zisserman. Automated flower classifica-
[212] J. Gao et al. Get3d: A generative model of high quality 3d tion over a large number of classes. In ICVGIP, 2008.
textured shapes learned from images. NeurIPS, 2022. [250] P. Welinder et al. Caltech-ucsd birds 200. California Institute of
[213] A. Van den Oord et al. Conditional image generation with Technology, 2010.
pixelcnn decoders. In NeurIPS, 2016. [251] C. Schuhmann et al. Laion-5b: An open large-scale dataset for
[214] N. Parmar et al. Image transformer. In ICML, 2018. training next generation image-text models. In NeurIPS Datasets
[215] Y. Huang et al. A picture is worth a thousand words: A unified and Benchmarks Track, 2022.
system for diverse captions and rich images generation. In MM, [252] R. Krishna et al. Visual genome: Connecting language and vision
2021. using crowdsourced dense image annotations. IJCV, 2017.
[216] Y. Huang et al. Unifying multimodal transformer for bi- [253] A. Nagrani et al. Voxceleb: a large-scale speaker identification
directional image and text generation. In MM, 2021. dataset. arXiv:1706.08612, 2017.
[217] E. Hoogeboom et al. Argmax flows and multinomial diffusion: [254] J. Son Chung et al. Lip reading sentences in the wild. In CVPR,
Towards non-autoregressive language models. arXiv:2102.05379, 2017.
2021. [255] T.-J. Fu et al. Language-driven image style transfer.
[218] H. Chang et al. Maskgit: Masked generative image transformer. arXiv:2106.00178, 2021.
In CVPR, 2022. [256] G. Kwon and J. C. Ye. Clipstyler: Image style transfer with a
[219] Z. Zhang et al. M6-ufc: Unifying multi-modal controls for single text condition. arXiv:2112.00374, 2021.
conditional image synthesis. arXiv:2105.14211, 2021. [257] S. Loeschcke et al. Text-driven stylization of video objects.
[220] Y. Yu et al. Diverse image inpainting with bidirectional and arXiv:2206.12396, 2022.
autoregressive transformers. In MM, 2021. [258] O. Michel et al. Text2mesh: Text-driven neural stylization for
[221] D. Liu et al. Asset: autoregressive semantic scene editing with meshes. In CVPR, 2022.
transformers at high resolutions. TOG, 2022. [259] N. Khalid et al. Text to mesh without 3d supervision using limit
[222] Y. Xie et al. Neural fields in visual computing and beyond. In subdivision. arXiv:2203.13333, 2022.
CGF. Wiley Online Library, 2022. [260] F. Mu et al. 3d photo stylization: Learning to generate stylized
[223] A. Jain et al. Zero-shot text-guided object generation with dream novel views from a single image. In CVPR, 2022.
fields. In CVPR, 2022. [261] A. Noguchi and T. Harada. Rgbd-gan: Unsupervised 3d repre-
[224] F. Hong et al. Avatarclip: Zero-shot text-driven generation and sentation learning from natural image datasets via rgbd image
animation of 3d avatars. TOG, 2022. synthesis. arXiv:1909.12573, 2019.
[225] T. Shen et al. Deep marching tetrahedra: a hybrid representation [262] J.-Y. Zhu et al. Visual object networks: Image generation with
for high-resolution 3d shape synthesis. NeurIPS, 2021. disentangled 3d representations. NeurIPS, 2018.
[226] D. Verbin et al. Ref-nerf: Structured view-dependent appearance [263] Z. Shi et al. 3d-aware indoor scene synthesis with depth priors.
for neural radiance fields. In CVPR, 2022. In ECCV, 2022.
[227] K. Schwarz et al. Graf: Generative radiance fields for 3d-aware [264] T. Nguyen-Phuoc et al. Hologan: Unsupervised learning of 3d
image synthesis. NeurIPS, 33, 2020. representations from natural images. In ICCV, 2019.
[228] M. Niemeyer and A. Geiger. Giraffe: Representing scenes as [265] Y. Jiang et al. Transgan: Two pure transformers can make one
compositional generative neural feature fields. In CVPR, 2021. strong gan, and that can scale up. NeurIPS, 2021.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 20
[266] J. Park and Y. Kim. Styleformer: Transformer based generative Fangneng Zhan is a postdoctoral researcher at Max Planck Institute
adversarial networks with style vector. In CVPR, 2022. for Informatics. He received the Ph.D. degree in Computer Science
[267] D. A. Hudson and C. L. Zitnick. Generative adversarial trans- & Engineering from Nanyang Technological University. His research
formers. ICML, 2021. interests include generative models and neural rendering. He serves
[268] J. Cho et al. X-lxmert: Paint, caption and answer questions with as a reviewer or program committee member for top journals and
multi-modal transformers. arXiv:2009.11278, 2020. conferences including TPAMI, ICLR, ICML, NeurIPS, CVPR, ICCV.
[269] X. Huang et al. Multimodal conditional image synthesis with
product-of-experts gans. arXiv:2112.05130, 2021. Yingchen Yu is currently pursuing the Ph.D. degree at School of
[270] Z. Tan et al. Efficient semantic image synthesis via class-adaptive Computer Science and Engineering, Nanyang Technological University
normalization. TPAMI, 2021. under Alibaba Talent Programme. His research interests are image
[271] X. Liu et al. Learning to predict layout-to-image conditional synthesis and manipulation.
convolutions for semantic image synthesis. arXiv:1910.06809,
2019. Rongliang Wu received the Ph.D. degree from School of Computer
[272] Z. Zhu et al. Semantically multi-modal image synthesis. In CVPR, Science and Engineering, Nanyang Technological University. His re-
2020. search interests include computer vision and deep learning, specifically
[273] Z. Tan et al. Diverse semantic image synthesis via probability for facial expression analysis and generation.
distribution modeling. In CVPR, 2021.
[274] T. Hinz et al. Semantic object accuracy for generative text-to- Jiahui Zhang is currently pursuing the Ph.D. degree at School of Com-
image synthesis. arXiv:1910.13321, 2019. puter Science and Engineering, Nanyang Technological University. His
research interests include computer vision and machine learning.
[275] W. Li et al. Object-driven text-to-image synthesis via adversarial
training. In CVPR, 2019.
Shijian Lu is an Associate Professor in the School of Computer Science
[276] Z. Wang et al. Text to image synthesis with bidirectional genera-
and Engineering, Nanyang Technological University. He received his
tive adversarial network. In ICME, 2020.
PhD in Electrical and Computer Engineering from the National Univer-
[277] M. Wang et al. End-to-end text-to-image synthesis with spatial
sity of Singapore. His research interests include computer vision and
constrains. TIST, 2020.
deep learning. He has published more than 100 internationally refereed
[278] R. Rombach et al. Network-to-network translation with condi- journal and conference papers. Dr Lu is currently an Associate Editor for
tional invertible neural networks. arXiv:2005.13580, 2020. the journals of Pattern Recognition and Neurocomputing.
[279] J. Liang et al. Cpgan: Content-parsing generative adversarial
networks for text-to-image synthesis. In ECCV, 2020. Lingjie Liu is the Aravind K. Joshi Assistant Professor in the Depart-
[280] M. Ding et al. Cogview2: Faster and better text-to-image genera- ment of Computer and Information Science at the University of Penn-
tion via hierarchical transformers. arXiv:2204.14217, 2022. sylvania. Before that, she was a Lise Meitner postdoctoral researcher
[281] J. S. Chung et al. Voxceleb2: Deep speaker recognition. in the Visual Computing and AI Department at Max Planck Institute for
arXiv:1806.05622, 2018. Informatics. She obtained her Ph.D. degree from the University of Hong
[282] J. S. Chung and A. Zisserman. Lip reading in the wild. In ACCV, Kong in 2019. Her research interests are Neural Scene Representa-
2016. tions, Neural Rendering, Human Performance Modeling and Capture,
[283] T. Salimans et al. Improved techniques for training gans. and 3D Reconstruction.
NeurIPS, 2016.
[284] M. Heusel et al. Gans trained by a two time-scale update rule Adam Kortylewski is a research group leader at the University of
converge to a local nash equilibrium. In NeurIPS, 2017. Freiburg and the Max Planck Institute for Informatics where he leads
[285] S. Hong et al. Inferring semantic layout for hierarchical text-to- the Generative Vision and Robust Learning lab. Before that he was a
image synthesis. In CVPR, 2018. postdoc at Johns Hopkins University with Alan Yuille for three years.
[286] J. S. Chung and A. Zisserman. Out of time: automated lip sync He obtained his PhD from the University of Basel with Thomas Vetter.
in the wild. In ACCV, 2016. His research focuses understanding the principles that enable artificial
[287] T. Xiao et al. Unified perceptual parsing for scene understanding. intelligence systems to reliably perceive our world through images.
In ECCV, 2018. Adam was awarded the prestigious Emmy Noether Grant (2022) of
[288] F. Yu et al. Dilated residual networks. In CVPR, 2017. the German Science Foundation for exceptionally qualified early career
[289] L.-C. Chen et al. Semantic image segmentation with deep convo- researchers.
lutional nets and fully connected crfs. arXiv:1412.7062, 2014.
[290] B. Liang et al. Expressive talking head generation with granular Christian Theobalt is a Professor of Computer Science and the director
audio-visual control. In CVPR, 2022. of the department “Visual Computing and Artificial Intelligence” at the
[291] X. Ji et al. Eamm: One-shot emotional talking face via audio- Max Planck Institute for Informatics, Germany. He is also a professor at
based emotion-aware motion model. arXiv:2205.15278, 2022. Saarland University. His research lies on the boundary between Com-
puter Vision and Computer Graphics. Christian received several awards,
[292] R. Wu et al. Audio-driven talking face generation with diverse
for instance the Otto Hahn Medal of the Max Planck Society (2007),
yet realistic facial animations. arXiv:2304.08945, 2023.
the EUROGRAPHICS Young Researcher Award (2009), the German
[293] S. Wang et al. One-shot talking face generation from single-
Pattern Recognition Award (2012), an ERC Starting Grant (2013), an
speaker audio-visual correlation learning. In AAAI, 2022.
ERC Consolidator Grant (2017), and the Eurographics Outstanding
[294] X. Wu et al. Human preference score v2: A solid benchmark Technical Contributions Award (2020). In 2015, he was elected one of
for evaluating human preferences of text-to-image synthesis. Germany’s top 40 innovators under 40 by the magazine Capital.
arXiv:2306.09341, 2023.
[295] Y. Kirstain et al. Pick-a-pic: An open dataset of user preferences Eric Xing (Fellow, IEEE) received the Ph.D. degree in computer science
for text-to-image generation. arXiv:2305.01569, 2023. from the University of California at Berkeley, Berkeley, CA, USA, in
[296] V. Jayaram and J. Thickstun. Parallel and flexible sampling from 2004. He is currently a Professor of machine learning with the School
autoregressive models via langevin dynamics. In ICML, 2021. of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
[297] T. Dockhorn et al. Score-based generative modeling with His principal research interests lie in the development of machine
critically-damped langevin diffusion. arXiv:2112.07068, 2021. learning and statistical methodology, especially for solving problems
[298] Y. Song et al. Consistency models. In ICML, 2023. involving automated learning, reasoning, and decision-making in high-
[299] J. Deng et al. ImageNet: A large-scale hierarchical image dimensional, multimodal, and dynamic possible worlds in social and
database. In CVPR, 2009. biological systems. Dr. Xing is a member of the DARPA Information
[300] I. Skorokhodov et al. 3d generation on imagenet. In ICLR, 2023. Science and Technology (ISAT) Advisory Group and the Program Chair
[301] Q. Xu et al. Point-nerf: Point-based neural radiance fields. In of the International Conference on Machine Learning (ICML) 2014. He is
CVPR, 2022. also an Associate Editor of The Annals of Applied Statistics (AOAS), the
[302] J. Bailey. The tools of generative art, from flash to neural Journal of American Statistical Association (JASA), the IEEE Transac-
networks. Art in America, 2020. tions on Pattern Analysis and Machine Intelligence (T-PAMI), and PLOS
[303] Y. Mirsky and W. Lee. The creation and detection of deepfakes: Computational Biology and an Action Editor of the Machine Learning
A survey. CSUR, 2021. Journal (MLJ) and the Journal of Machine Learning Research (JMLR).