Multi-Modal Generative AI Survey
Multi-Modal Generative AI Survey
Abstract—Multi-modal generative AI has received increasing modeling, while Sora is a multi-modal video generation model
attention in both academia and industry. Particularly, two with diffusion denoising modeling. As such, there naturally
dominant families of techniques are: i) The multi-modal large arises a question: “Is it possible to establish a unified multi-
arXiv:2409.14993v1 [cs.AI] 23 Sep 2024
1. Related Techniques: VAE, GAN, DDPM, SDE, Latent Diffusion Model WebVidQA, TGIF, EgoQA, etc.
Conversation
Architecture: UNet/Transformer
2. Model Design
Modality Interaction: AdaLN/Cross-
Diffusion
Attention/In-context condition CLEVR, VisualMRC, NExT-QA,
Multi-Modal CLEVRER, etc.
Generation 3. Text-to-Image Glide, Imagen, DALLE, Stable Diffusion, etc.
Reasoning
n
Y
p(w) = pθL (wi |w<i ), (1) Image-Text Mask Language Image-Text
Matching Loss Modeling Loss Contrastive Loss
i=1
where θL denotes the parameters of the LLM, which is Multi-modal Encoder Image Feature Text Feature
generally composed of several layers of transformers [5]. Note
that LLM can only receive the text tokens as its input, the
Region Feature Text Feature
next important problem for MLLM is how to enable LLM Image Text
Encoder Encoder
to understand the visual information. To tackle the problem, Object Text
most existing works [4], [6], [7] try to align the LLM with the Detector Embedding
visual encoders from vision-language pretraining tasks, such
(a) BERT-like Architechure (b) Two-Tower Architechure
as CLIP [8]. More recently, there have been some attempts [3]
to directly transform the images into discrete visual tokens,
Image-Text Mask Language Image-Text Language
so that the text and visual tokens can be tackled by the auto- Matching Loss Modeling Loss Matching Loss Modeling Loss
regressive LLM together. Next, we will introduce preliminaries
about vision-language pretraining and visual tokenizers. Multi-modal Encoder
Multi-modal Multi-modal
Encoder Decoder
2) Vision-Language Pretraining: We provide different
Image-Text Image-Text
paradigms of vision-language pretraining in Fig. 2. The suc- Contrastive Loss Contrastive Loss
cess of BERT [9] in natural language processing (NLP)
brings the “pretrain-finetune” paradigm into the mainstream Image Feature Text Feature Image Feature Text Feature
and drives the development of pretraining in the multi-
modal domain. Vision-language pretraining (VLP) aims to Image Text Image Text
learn multi-modal representations from large-scale image-text Encoder Encoder Encoder Encoder
Codebook 𝑍
3) Visual Tokenizer: On the one hand, a naive way to
… transform images into a series of tokens is to split each
image into a series of patches, and then map each patch
to a continuous embedding with linear projection, such as
53 adopted in Fuyu [24]. On the other hand, inspired by language
𝐸 1
𝐷 models where each word is tokenized by a discrete tokenizer,
8
𝐸(𝑥)
a series of works also transform images into discrete tokens.
𝑥 Encoder Compressed Visual Decoder 𝑥ො Typical visual tokenizers include the VQ-VAEs [25], [26] and
image Tokens
VQGANs [27], [28], whose overall framework is shown in
Fig. 3. Illustration for the framework of the visual tokenizers. Fig. 3. We will begin our discussion with VQ-VAE. Basically,
VQ-VAE works like an auto-encoder, which has an encoder
E(·) and a decoder D(·). Given an image x, VQ-VAE first
models to understand visual information. Before the Vision encodes it with an encoder E(·) into a lower-dimension
Transformer (ViT) [12] emerges, some approaches [13]–[16] continuous vector E(x). Then, the lower-dimension vector
rely on a frozen object detector to extract region features as will be discretized with a codebook Z = {zk }K k=1 . The
shown in Fig. 2(a), but this method is computationally expen- codebook is similar to the word embedding table in NLP,
sive—up to 50 times more than a BERT-base-like model—and where K has a similar meaning to the vocabulary size, and
the performance of VLP models may be constrained by the each zk ∈ Rnc represents a visual prototype that is similar
frozen object detector. Pixel-BERT [17] attempts to address to a word embedding. With the encoded vector E(x) and the
this by replacing the frozen object detector with a trainable codebook Z, we can obtain the discrete value of the image by
ResNet [18], but its downstream performance only matches finding the nearest neighbor of E(x) in Z as follows:
object-detector-based VLP models when using a very heavy
ResNeXt-152. The introduction of ViT allows ViLT [19] to
adopt a simpler visual embedding approach: linear projection Discrete(E(x)) = zq , q = argminq ||E(x) − zq ||. (2)
operating on image patches, which significantly improves
inference speed with only a minor performance trade-off. After we obtain the discrete code zq , we can use it to
On a different path, CLIP [8] and ALIGN [20] employ reconstruct the image with the decoder: x̂ = D(zq ). The
separate transformer encoders for each modality, a design training objective of VQ-VAE is shown as follows:
commonly referred to as a two-tower structure, as shown
L = ||x − D(zq )||22 + ||sg[E(x)] − zq ||22 + β||sg[zq ] − E(x)||22 ,
in Fig. 2(b). They perform pretraining on massive amounts
(3)
of noisy web data using a contrastive loss, aligning image
where the first term means the reconstructed image should be
and text embeddings in a shared embedding space. Despite
close to the input image. However, since zq is obtained by the
their impressive zero-shot performance on image-text retrieval,
nearest neighbor, it has no gradient. Therefore, in the second
these models lack the ability to capture more complex inter-
term, zq should be close to the encoded image E(x), where
actions between image and text necessary for tasks like visual
sg[·] means stopping the gradient. Similarly, to optimize the
question answering.
encoder E, we need to make the E(x) close to zq as shown in
ALBEF [21] unifies these two architectures. As shown in the third term. Note that when optimizing the codebook (the
Fig. 2(c), ALBEF initially uses separate unimodal encoders for second term), [25] adopts the exponential moving average
each modality and performs the cross-attention fusion between updates. After training with this objective, we obtain a way
image and text within a BERT-like multi-modal encoder. At to transform an image into discrete tokens. Compared to VQ-
the same time, the unimodal embeddings are aligned through VAEs, VQGAN [27], [28] utilizes a GAN perceptual loss to
contrastive loss before fusion. This approach leads to strong replace the L2 reconstruction loss, which helps to learn a rich
unimodal and multi-modal representations, delivering superior codebook. We use a simple example to illustrate the process
performance on both retrieval and reasoning tasks. of tokenization. If we have an input image of size H × W × 3,
As a significant cornerstone of MLLM, BLIP [22] builds after the encoder E, we obtain a lower-dimension vector E(x)
upon ALBEF with two key improvements. From a model of size h × w × nc , where h < H and w < W and nc is the
perspective, it introduces an additional transformer decoder as dimension of the code. This means we can obtain h×w vectors
shown in Fig. 2(d), enabling not only image-text understanding of dimension nc , and for each vector, we will find its nearest
but also image-to-text generation (image captioning), which neighbor in the code book for discretization so that we will
paves the way for the influential MLLM BLIP-2 [23]. From finally obtain a discrete sequence of length h × w to represent
a data perspective, it proposes a new dataset bootstrapping the image.
method, Captioning and Filtering (CapFilt). After training a Remark. On the one hand, VQGAN and VQ-VAE can be
BLIP model on noisy image-text pairs, this model generates used as visual tokenizers to transform an image into discrete
captions for images in the dataset and filters out noisy captions tokens, which enables it to be received by LLMs for visual
from both original and generated texts. This approach produces understanding. On the other hand, they can be used for
a cleaner dataset for training stronger VLP models and pro- compressing an image into a lower-dimensional space, which
vides valuable insights for future MLLM dataset generation. motivates the well-known latent diffusion model (LDM) [29].
PREPRINT VERSION, 2024 4
such as a projector [4], or Q-Former [23] is used to align the Fig. 4. Two branches of multi-modal large language model architectures,
image embedding with the LLM space. To train the alignment including (i) the alignment architecture by aligning pretraining vision models
module, some text-image or text-video pairs are required to with LLM and (ii) the early-fusion architecture which receives mixed visual
and text tokens and relies on auto-regressive modeling for multi-modal
input the model. A typical way to align is to make the LLM understanding.
output the caption of an image given an image embedding.
In contrast, as shown on the right of Fig. 4, the early-fusion
architecture [3], [30] does not rely on a pretrained vision will follow the MLLM architectures section, and elaborate on
model to obtain the semantics of the input image. Instead, the latest advancement of image LLM.
similar to NLP where each word is mapped to a token, the 1) Alignment-Architecture Image LLM. This architecture
early-fusion architecture maps each visual input into visual treats the image input as an additional extension. The vision
tokens through a visual tokenizer. Then a multi-modal auto- encoders are usually frozen and the alignment modules and
regressive language model will receive the mixed text and LLM are tuned based on various strategies to align the multi-
visual tokens, and output the user’s desired answers. modal content and instructions.
Remark. (i) The advantage of the alignment architecture is a) Vision Encoder. It is a module that extracts crucial infor-
that it can utilize the pretrained knowledge of the vision mation from images. Common generic vision encoders contain
encoder and LLM. The vision-language pretraining enables the ResNet [32], CLIP-ViT encoder [8], ImageBind [33]. ResNet
output of the vision encoder to have semantic meanings. The and CLIP are pretrained on image-text modals while Im-
only thing that needs training is the alignment module, which ageBind aligns six modals’ embeddings into common spaces
makes this paradigm resource-friendly. (Sometimes other mod- making it vision encoders encode richer information. However,
ules are also learnable for better performance.) However, its generic vision encoders suffer information loss from their
ability is also limited by the pretrained vision encoder and limited pretraining tasks, and some works attempt to learn
LLM, e.g., the pretrained CLIP vision encoder often struggles tailored vision encoders for themselves. Generic features are
with multiple objects, making the MLLM based on CLIP not designed for accurate object understanding, VCoder [34]
inherit the limitation. (ii) In contrast, the early-fusion archi- improves vision encoders by introducing depth and segmen-
tecture may have a higher potential, because all its parameters tation control inputs to promote accurate object perception.
are trained from scratch. However, training from scratch makes CLIP features lack lexical semantics compared to word tokens
the early-fusion architecture face two challenges: (a) how to that are tailored for LLM, SPAE [35] and V2T Tokenizer [36]
train a strong visual tokenizer and (b) more resources to train encode images to lexical tokens guided by LLM codebooks
the multi-modal auto-regressive model. First, since the visual within autoencoders, helping to extract both semantic concepts
tokenization process involves compression and discretization, and appearance details.
there is inevitably visual information loss. How to train a
b) Alignment Module. This module, also named projector,
tokenizer that contains rich visual information still remains
adapter, etc., aims to mitigate the gap between image features
a challenging problem. Second, the visual tokenizers are gen-
and lexical word tokens and further fuse two modalities.
erally trained with the image reconstruction objective, more a
LLaVA [37] adopts a simple but effective linear projection
pixel-level task instead of a semantic-level task, which requires
to convert image features into word token embedding space
the downstream multi-modal LLM to have an additional ability
and then it concatenates image tokens and word tokens. Such
to learn semantic meanings from the pixel-level information,
alignment only involves image transformation, limiting inter-
compared to the original LLM which only needs to understand
action with texts, and is not flexible in the visual token number.
semantic tokens. Therefore, the multi-modal LLM requires
Resampler [38] technique maps varying-size features to a fixed
much more data for training.
number of tokens. BLIP-2 [23], MiniGPT-4 [39] and Qwen-
Next, with the overall architecture in mind, we will intro- vl [40] employ Q-former [23] before linear projections to re-
duce recent advances in image large language models and duce tokens. Q-former incorporates text semantics and models
video large language models. the interaction between image features and text inputs with
learnable queries to enhance the most useful visual content for
LLM. However, despite the shorter sequence length, the local-
C. Image Large Language Models
ity preservation is damaged in these projectors. Honeybee [41]
Many works equip the LLM with the capability to under- proposes a Locality-enhanced Projector, which contains a C-
stand images, such as some pioneer works, Frozen [31]. We Abstractor and D-Abstractor to enhance spatial understanding
PREPRINT VERSION, 2024 5
while maintaining token flexibility. Besides, efficiency is vital mechanisms [56], [58]. Human feedbacks [59] also play an
for alignment modules. TokenPacker [42] adopts a coarse-to- important role in reducing hallucination.
fine scheme to further promote efficiency while maintaining
finer details. The above discusses the transformation of visual D. Video Large Language Models
tokens, while in most works the visual tokens are directly Following the success of Image LLMs, researchers start
concatenated to word tokens, and the LLM architecture is exploring the training of Video LLMs [60]. Typically, videos
not modified. Several works adopt progressively injecting are viewed as sequences of image frames (some Video LLMs
image content into LLM architecture to enhance alignment. incorporate other modalities like audio or speech), so Video
Flamingo [38] inserts gated XATTN-DENSE layers between LLMs have a higher computational complexity compared to
LM blocks. ImageBind-LLM [43] adds gated image feature to Image LLMs. The challenge of collecting high-quality video
word tokens in each LLM layer. LLaMA-Adatper [44] adds datasets further complicates the training process, making early
visual projection to adapters and adopts zero-init attention to fusion architectures computationally exhaustive. As a result,
fuse visual adapters and word tokens in the last L layers. almost all the existing Video LLMs adopt the alignment
2) Early-fusion Architecture Image LLM. The alignment ar- architectures.
chitecture utilizes the power of off-the-shelf LLM and requires 1) Alignment-Architecture Video LLM. The video LLM
lower computations, but pretrained vision encoders would have architecture is similar to that of Image LLMs with alignment
information loss and be infected by inductive biases because of architectures. By sampling a fixed number of frames or using
the gap between limited pretraining tasks and real demands for a fixed frames-per-second (FPS) rate, videos are reduced to a
image LLM, such as supporting flexible resolution. Therefore, limited set of images. The visual embeddings of each image
as shown in Fig. 4, another line of work aims to train a multi- are then extracted using a visual encoder. These features
modal LLM from scratch, where both images and text words are sequentially concatenated in the order of the frames and
are converted into a series of tokens. connected to the LLM via an alignment module. In earlier
Pioneer work Fuyu [24] adopts linear projections on image works, VideoChat [61] utilizes a Q-former structure as the
patches in spatial order and trains a transformer decoder alignment module, while VideoLLaMA [62] introduces an
taking the visual and word token sequence as input. De- audio encoder and an audio Q-former to handle audio signals.
spite limited performance, it reveals a new technical fashion. Video-ChatGPT [63] takes a different approach by average-
Google follows this fashion whose Gemini [30] processes the pooling each frame’s patch embeddings along the spatial
interleaved image and other modalities from the beginning. and temporal dimensions before using a linear layer as the
Chameleon [45] trains an image tokenizer that encodes a alignment module. Training Video LLMs also follows an
512x512 image into 1024 discrete tokens from a codebook of “alignment then instruction tuning” strategy. While additional
size 8192 and trains a BPE tokenzier [46] for both modalities. GPT-annotated or human-annotated video datasets are col-
Recent Show-o [47] unifies multi-modal understanding and lected, image datasets can also be leveraged by treating images
generation. It trains a lookup-free tokenizer around 35M image as single-frame videos.
data, maintains a codebook of size 8192, and encodes images Recent successful efforts focus on improving performance
of 256×256 resolution into 16×16 discrete tokens. Early-fusion by refining the alignment module and scaling up the model and
Architecture requires much more computation and it’s more dataset sizes. For instance, VideoLLaMA2 [64] improves the
difficult to converge, leaving challenges for future exploration. alignment module to model the connections across temporal
3) Challenges in Image LLM. (a) One of the challenges is and spatial dimensions. It also gathers datasets for tasks such
fine-grained visual concept understanding. More tokens help as captioning, classification, and question answering. LLaVA-
encode more detailed information but may cause redundant NeXT-Video [65] and LLaVA-OneVision [7] introduce the
computation. Chat-UniVi [48] proposes dynamic visual tokens AnyRes technology [66], which serves as a flexible visual
to allocate more computations on important details. An impor- representation framework adaptable for both multi-image and
tant part of fine-grained understanding is the spatial awareness video representation. Additionally, some Video LLMs, like
of object concepts. AnyRef [49] applies RoIAlign to encode MiniCPM-V [67] and VILA-1.5 [68], also support multi-
regions and designs segment encoder-decoder to learn segmen- image and video input, showcasing strong performance across
tation from the image LLM’s token outputs, which is similar various benchmarks.
to OMG-LLaVA [50] who generates pixel- and object-centric 2) Challenges and Limitations in Video LLM. Compared to
visual tokens before projections and decodes segmentation Image LLMs, Video LLMs face two unique challenges. The
tokens from LLM’s output by OMG-Seg. Different from seg- first challenge is understanding videos at a finer granularity,
mentation supervision, VisionLLM [51] and Virtron [52] use specifically the comprehension of video segments and the
text supervision such as bounding and polygon descriptions relationships between these segments. The second challenge
by flexible instruction tuning. Fine granularity modeling offers is understanding long-form videos, such as movies, within the
some explanations for LLM. (b) Like LLM, the other challenge limited context length of LLMs.
comes from hallucination. The hallucination involves errors in For segment-level video understanding, VTimeLLM [6]
objects, attributes, and relations in the forms of judgment or transforms the temporal video grounding and dense video
description [53]. Some works [54], [55] try to reduce biases in captioning tasks into a sequence-to-sequence format. After
training data while some mitigate hallucination via improving alignment training, it introduces an additional boundary per-
model characteristics like vision encoders [56], [57] or fusion ception training, leveraging large-scale multi-event video-text
PREPRINT VERSION, 2024 6
results in 8K tokens, which reaches the maximum context 𝑥' 𝑥# 𝑥#$% 𝑥&
Diffusion
length of most LLMs. However, this represents less than 5
minutes of video at a sampling rate of 1 FPS. Therefore, more Fig. 5. The comparison of basic architectures of GANs, VAEs, and diffusion
efficient representations are necessary for processing long- models.
form videos like movies. MovieChat [73] introduces a memory
consolidation mechanism that merges similar image tokens
architectures to generate visual contents such as images [80]–
once the token limit is reached. LWM [74] and LongVA [75]
[82] and videos [83]–[86].
handle long video inputs by using LLMs with larger context
The main idea of GANs lies in two networks: a generator G
lengths and more efficient attention mechanisms. Some meth-
and a discriminator D. Specifically, G tries to generate visual
ods [6], [69], [76] reduce the number of tokens per frame,
contents from a noise z and D is trained to distinguish between
representing each frame with only 1 or 2 tokens on average.
the real ground truth visual contents x and the generated
Other approaches [77], [78] convert long-form videos into text
results G(z). Typically, these two networks are trained against
corpus using image captioning and employ LLMs as agents
each other. The whole training process is a min-max game
to search for specific answers within the text corpus.
Despite the advancements in Video LLMs, nearly all ex- where we expect our generator to make the generated results
isting models rely on sampling frames and encoding them as foolproof as possible to better discriminators. The two
individually through image encoders. This approach may networks are mutually reinforcing, so the training objective
be favored due to several reasons: image encoders are less is as follows:
computationally intensive compared to video encoders, they min max Ex∼px log D(x) + Ez∼pz log(1 − D(G(z))), (4)
offer better alignment with textual data, and they facilitate G D
unification with Image LLMs. However, this methodology where z is sampled from pz that is usually a normal distribu-
comes with a significant limitation. Specifically, the process of tion and x is a sample from the real data distribution px .
sampling frames can lead to the complete loss of information The generator and the discriminator are different for differ-
that occurs between sampled frames. As a result, these models ent tasks and usually have been improved to process multi-
fail to capture the continuous motion and trajectories of modal data by different methods. For example, in the video
objects, which are essential for understanding dynamic scenes generation tasks, TGANs-C [84] proposes a novel GAN archi-
and activities within a video. tecture with 3D spatial-temporal convolutions and utilizes a
Now we have discussed the multi-modal large language discriminator to determine whether the video matches the text
model for visual understanding. Next, we will discuss another caption rather than the ground truth video only. IRC-GAN [86]
important topic of multi-modal generative AI, i.e., multi-modal introduces a novel approach based on mutual-information
diffusion models for visual generation. introspection, leveraging mutual information semantically to
concretely assess semantic consistency, thereby aligning the
III. M ULTI -M ODAL D IFFUSION FOR G ENERATION generated video with the textual content.
In this section, before the discussion on diffusion models, 2) Variational AutoEncoder: Variational AutoEncoder [87]
we will first introduce some preliminaries, including previous (VAE) is another typical generative model. Unlike GANs,
generative models such as GANs and VAEs, and then the autoencoders have an encoder-decoder architecture that uses
diffusion probabilistic modeling, and we also present their an encoder E to present the visual content x to a latent code
overall frameworks in Fig. 5. After that, we will present z = E(x) and a decoder D to reconstruct the data x̂ = D(z) ≈
the widely adopted latent diffusion model [29], and discuss x. However, normal autoencoders have no constraints to the
some advanced diffusion text-to-image models and text-to- latent space, which makes it overfit the dataset easily. To solve
video models. the problem, VAEs make a regularization to the latent space
and sample z from a distribution pθ , typically a Gaussian
A. Preliminaries distribution, where θ is the parameters of the encoder-decoder
1) Generative Adversarial Networks: The generative ad- model. As the distribution pθ is unknown, VAE utilizes a
versarial network (GAN) [79] is one of the earliest neural recognition model ϕ which serves as a variational approxima-
PREPRINT VERSION, 2024 7
tion qϕ to approximate pθ and trains them jointly. The training backward process, we can use the following equation to
objective is: denoise:
L(θ, ϕ; x) = −DKL (qϕ (z|x)||pθ (z)) + Eqϕ (z|x) [log pθ (x|z)], dx = [f (x, t) − g(t)2 ∇x logqt (x)]dt + g(t)dw̄, (9)
(5)
where DKL means the Kullback-Leibler divergence. ϕ can be where ∇x logqt (x) is the score, and a model θ will be used to
formulated as a differentiable estimator using the parameteri- predict the score.
zation trick. Mathematically, the SDE and DDPM are equal and two
To better generate visual content, many efforts [85], [88], different views of the diffusion process. During the diffusion
[89] have been made based on VAE. Sync-DRAW [88] intro- model training, the following objective is generally adopted:
duces a novel architecture that combines VAE with a recurrent
attention mechanism to create a unique temporally dependent min Ex0 ,ϵ,t [wt ||ϵ − ϵθ (xt , t))||22 ], (10)
θ
sequence of frames. Despite the successful introduction of
VAEs, they still face a significant issue where the model where ϵ is the randomly sampled noise, xt is the noisy image,
ignores the information in the latent space and relies solely ϵθ is the neural network to predict the noise. Intuitively, when
on a powerful decoder to reconstruct the data, a phenomenon we can predict the noise, we can predict a cleaner image by
known as “posterior collapse”. To address this problem, the subtracting the noise as in DDPM, and also we can predict
VQ-VAE [89] utilizes discrete encoding to learn the prior and the score in SDE. wt is the schedule for different time steps.
employs vector quantization methods to prevent the latents Remark. GAN, VAE, and Diffusion models are all generative
from becoming uninformative. [85] leverages the strengths of models. Compared to GAN, the diffusion model has explicit
both GANs and VAEs. It introduces a VAE model to capture probabilistic modeling. Also, the diffusion model only needs
static information such as background color or the layout of to train a denoising network ϵθ . In contrast, GAN needs
objects and utilizes a GAN model to obtain dynamic motion to train both the generator and discriminator, which is less
information based on the captured information and text input. stable. Similarly, VAE-based models also need to train an
Compared to GAN and VAE, a new branch of generative encoder and a decoder. Moreover, from the perspective of
models, diffusion models [29], [90], [91] have become dom- data augmentation, considering that during training we denoise
inant in many tasks such as text-to-image generation or text- for each image T times, we will have T variants of each
to-video generation. image. These augmented images help the denoising network
3) Diffusion Probabilistic Modeling: We will briefly intro- better model the data distribution pθ (x0 ), resulting in better
duce the diffusion probabilistic modeling from two mainstream generation results.
perspectives, i.e., the denoising diffusion probabilistic models 4) Latent Diffusion Model: As shown in Eq. (6) and Eq. (7),
(DDPM) and the stochastic differential equations (SDE). The the denoising process of diffusion models is conducted on the
core idea of the diffusion process is to model the relations be- pixels of each image in an iterative way, which results in high
tween the real data distribution q(x0 ) and a random Gaussian computational cost, especially when the generated image is
distribution q(xT ). high-resolution. To tackle this problem, the latent diffusion
DDPM: The DDPM includes the forward and backward model (LDM) [29] proposed to conduct the diffusion process
processes. In the forward process, given a real data sample in the latent space instead of the pixel space. The framework
x0 , it will go through a Markov process with more and more comparison between the pixel-level diffusion model and LDM
random Gaussian noise added to the sample as follows: is shown in Fig. 6. To reduce the computational cost, LDM
p
q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I), t = 0, 1, · · · , T (6) utilizes the encoder of VQGAN [27] to compress the image
into the latent space, z = E(x), which has a much lower
where t is the time step, T is usually large so that xT is
dimension than the original image. Then, the diffusion process
close to a Gaussian noise, and βt is a parameter to control the
in Eq. (6) and Eq. (7) will be conducted in the latent space.
noise schedule. Conversely, to achieve generation from random
Also, the training objective in Eq. (10) is also applied to the
noise, what DDPM does in the backward process is to learn
latent code zt instead of the image xt as follows,
the following distribution:
pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t)), (7) min Ez0 =E(x0 ),ϵ,t [wt ||ϵ − ϵθ (zt , t, c))||22 ]. (11)
θ
where a neural network parameterized by θ is designed to
predicted the less noisy image xt−1 . Then, with this denoising Note that there is an additional input c of the denoising
network θ, we can denoise from a random noise xT step by network that is for conditional generation, e.g., as for the
step until we get a clean data sample x0 , which could be an text-to-image generation task, c could be the representation of
image or a video, etc. the text prompt [92]. Also, c could be other conditions, such
SDE: SDE describes the tragectories from x0 to xT with the as layout [93]–[96], semantic maps [97], [98] or image-to-
following stochatistic differential equation: image translation tasks [97]. Since most computation including
the training and iterative inference is conducted in the lower-
dx = f (x, t)dt + g(t)dw, (8)
dimension latent space, the LDM model exhibits high effi-
where f (·) describes the diffusion process and g(·) represents ciency. Therefore, most text-to-image and text-to-video models
the drift function of the Wiener process. Then, during the adopt the LDM structure.
PREPRINT VERSION, 2024 8
Denoising Transformer
Diffusion Process 𝐸 Diffusion Process Denoising U-Net
𝑥 𝑥 𝑧 𝑧𝑇 Q Q Q Q
×𝑇 𝑛𝑜𝑖𝑠𝑒 ×𝑇 Patchify
KV KV KV KV
Condition Condition
Denoising Denoising
Network(𝜖𝜃 ) 𝐷 Network(𝜖𝜃 )
Timestep Timestep Embed N × DiT Block LN Linear
Pixel-level diffusion model Latent diffusion model Fig. 7. Comparison between U-Net-based diffusion model and Transformer-
based diffusion model.
Fig. 6. Comparison between pixel-level diffusion models and latent diffusion
models.
adopted in these works, where the label in a class-conditional
diffusion model is replaced with a null label at a fixed
B. Text-to-Image Generation
probability during training. Third, U-Net traditionally serves
1) Text-to-Image Diffusion Model: As mentioned in the as the backbone of the diffusion model, facilitating denoising
preliminary part, diffusion models can be broadly categorized and the gradual generation of high-quality images.
into two branches: pixel-based and latent-based [99]. In the Despite its advantages in high-resolution image generation,
early development stage, the denoising process is typically U-Net’s specific structures, such as ResBlocks and convolu-
applied directly in the pixel space. For instance, GLIDE [100] tional operations, limit its scalability. In contrast, Transform-
is a pioneering work in photorealistic image generation with ers, which are better suited to handle larger-scale data and
text guidance, using a 3.5 billion parameter diffusion model tasks, are emerging as strong contenders to U-Net. The Dif-
that employs a text encoder to condition on natural language fusion Transformer (DiT) [106] represents a class of diffusion
descriptions. GLIDE also explores the use of CLIP guidance models that replaces the commonly used U-Net backbone with
and classifier-free guidance in diffusion models, finding that a transformer backbone as shown in Fig. 7. This approach
classifier-free guidance produces higher-quality images. Be- is supported by empirical findings suggesting that the U-Net
sides, Imagen [101] follows GLIDE and adopts classifier- inductive bias is not crucial to the performance of diffu-
free guidance for its pixel-based diffusion model. The key sion models. Additionally, utilizing a transformer backbone
difference between them two is that GLIDE trains text encoder enables the diffusion model to leverage the best practices
and diffusion model together with text-image pair, while Ima- of transformers, such as architectural design and training
gen utilizes pretrained and frozen large transformer language paradigms, along with their good properties like scalability,
models, leveraging their strong text understanding capabilities robustness, and efficiency. Specifically, DiT adheres to the
to enhance sample fidelity and image-text alignment. foundation of the Latent Diffusion Model (LDM) framework
However, directly operating in pixel space requires substan- and emulates the design of the Vision Transformer (ViT)
tial computational resources, which leads to the appearance by introducing a comprehensive DiT design space, including
of latent-based diffusion models. A milestone in this area patch size, transformer block architecture, and model size. The
is Stable Diffusion [102], which introduces the concept of first layer of DiT, termed patchify, converts the spatial input
latent diffusion model to strike a near-optimal balance between into a sequence of tokens by linearly embedding each patch.
complexity reduction and detail preservation. It incorporates Following the patchify step, the input tokens are processed
a pretrained VQGAN to compress images from pixel space through a sequence of transformer blocks that incorporate
into semantic latent space. Compared to pixel-based diffu- conditioning such as time and label. The proposed transformer
sion methods, Stable Diffusion not only achieves competitive design includes adaptive layer norm (adaLN) block, cross-
performance across multiple image generation tasks but also attention block, and in-context conditioning block. After the
significantly reduces both training and inference costs. Another final block, a transformer decoder translates the image tokens
notable example of a latent-based model is DALL-E2 [103], into output predictions. DiT is available in four configurations,
which combines a CLIP model and a diffusion model to enable DiT-S, DiT-B, DiT-L, and DiT-XL, ranging from 0.3 to 118.6
zero-shot text-guided image generation. DALL-E2 consists of Gflops. The difference between U-Net-based diffusion model
a CLIP image encoder and a diffusion decoder that inverts and Transformer-based diffusion model is illustrated in Fig. 7.
the encoder, allowing for explicit generation of image rep- The three distinct transformer blocks are the core modules
resentations. This approach improves image diversity while of DiT, representing different ways to interact with multi-
maintaining photorealism and caption similarity. modal information, including images, timestep, and condi-
GLIDE [100], Imagen [101], Stable Diffusion [102], and tions. Their designs are inspired by the standard ViT block
DALL-E2 [103] are all pioneering works that represent dif- design but incorporate small yet significant modifications. As
ferent technological pathways in the field of text-to-image illustrated in Fig. 8, these blocks differ in how the image latent
generation. Despite their differences, some common trends interacts with the conditioning information. The adaLN block
have emerged in their development. First, latent-based dif- follows the adaptive normalization layers in GANs, replacing
fusion methods have become increasingly prevalent due to the standard normalization layers in transformer blocks. The
their advantages in conserving computational resources and scale and shift parameters in this block are determined by the
generating high-quality images. Second, compared to classi- sum of the embedding vectors of timestep and condition. This
fier guidance [104], classifier-free guidance [105] are widely block adds the least Gflops to the model. The cross-attention
PREPRINT VERSION, 2024 9
+ + +
𝛼𝛼2
Scale Pointwise Pointwise
Feedforward Feedforward
Pointwise 𝑥 𝑥ො
Feedforward Layer Norm Layer Norm
𝛾𝛾2 , 𝛽𝛽2 𝑐𝑡𝑒𝑥𝑡 𝑐𝑎𝑑𝑑
Scale, Shift +
+ Model-Based
Layer Norm Multi-Head
Cross-Attention
Multi-Head
+ Cross-Attention
𝛼𝛼1 Layer Norm
Scale
Layer Norm
Multi-Head +
𝑥 𝑥ො
Self-Attention
Multi-Head
𝛾𝛾1 , 𝛽𝛽1 Self-Attention 𝑐𝑡𝑒𝑥𝑡 𝑐𝑎𝑑𝑑
Scale, Shift Concatenate
on Sequence
Dimension Tuning-Based
Layer Norm MLP Layer Norm
DiTBlock with adaLN-Zero DiTBlock with Cross-Attention DiTBlock with In-Context Conditioning
𝑥 𝑥ො
𝑐𝑡𝑒𝑥𝑡 𝑐𝑎𝑑𝑑
Fig. 8. Comparison between different DiT blocks from [106].
Training-Free
block introduces an additional multi-head cross-attention layer, Fig. 9. Model-based, tuning-based, and training-free controllable text-to-
image generation.
serving as the interaction module between the image latent and
the timestep and condition. This block adds the most Gflops to
the model. The in-context conditioning block treats the tokens
from the timestep and condition in the same way as image efforts are improving diffusion models to allow for more
tokens, concatenating them along the sequence dimension. precise control through additional conditions beyond text.
This block introduces a moderate amount of Gflops. Compared to original text-based image generation, con-
Following the development of DiT [106], a growing number trollable generation involves introducing additional conditions
of works are exploring variants of diffusion transformers with cadd without compromising the original text condition ctext .
improved performance. For instance, CrossDiT [107] com- From the perspective of control mechanisms, related methods
bines the adaLN-zero DiT block and cross-attention DiT block. can be categorized into three classes: model-based, tuning-
It simplifies adaLN-zero layers to adaLN-single layers by based, and training-free [110], as illustrated in Fig. 9. Model-
removing label conditioning and using only time conditioning based methods incorporate extra models to encode the addi-
for scale and shift control. It incorporates text embeddings tional conditions and integrate them into the diffusion process.
from T5 [108] into the multi-head cross-attention layer. An- For instance, InstantBooth [111] employs a patch encoder and
other notable variant is MM-DiT [109], which integrates the a concept encoder to encode personalized sample images and
adaLN-zero DiT block and in-context conditioning DiT block. incorporates adapter layers to the U-Net for conditions interac-
This model uses text embeddings from CLIP and timestamps tion. These extra models, including encoders and adapters, are
to condition the network, employs two separate sets of weights trainable, while other components of the model remain frozen.
for image and condition modalities, and concatenates image Tuning-based methods do not require extra models but instead
and condition for the attention operation. Empirical experi- fine-tune certain parts of the original diffusion model to adapt
ments show that both CrossDiT and MM-DiT outperform the to specific conditions. For example, Textual Inversion [112]
vanilla DiT in terms of validation loss, CLIP score, and FID. fine-tunes the text encoder, while Dreambooth [113] fine-tunes
The designs of diffusion transformer variants are distinct the U-Net. In these tuning-based methods, Parameter-Efficient
from each other, but they basically derive from the three core Fine-Tuning (PEFT) techniques are often employed to re-
architectures proposed by DiT: the adaLN-zero block, the place traditional fine-tuning, thereby reducing computational
cross-attention block, and the in-context conditioning block. resources. Training-free methods eliminate the need for any
Currently, MM-DiT, which combines the adaLN-zero block training or fine-tuning process, instead controlling generation
with in-context conditioning, represents the state-of-the-art by leveraging the intrinsic capabilities of the U-Net structure,
architecture. Its advantage lies in training the text modality such as attention. For example, StyleAligned [114] achieves
iteratively alongside the diffusion process in an in-context consistent style generation by employing minimal “attention
manner rather than keeping it frozen, which produces a more sharing” during the diffusion process, where all images share
diverse semantic space. self-attention with the reference image.
2) Controllable Generation with Diffusion Model: Despite Each of these three methods has its own strengths and
the success of diffusion models in generating photorealis- weaknesses. Model-based methods introduce additional mod-
tic images, this text-to-image technique falls short of fully els and require a tuning process, generally consuming the
meeting the increasing and diverse user requirements, such as most computational resources. However, once the encoders
fine-grained control or specific customization. For instance, and adapters are fully trained, they can be easily adapted
creating a portrait of an ordinary individual based solely on to different conditions. Tuning-based methods save compu-
a name and physical description is beyond the capabilities of tational resources by not incorporating extra models, but they
current text-to-image diffusion models. As a result, ongoing are limited to adapting to a single specific condition per fine-
PREPRINT VERSION, 2024 10
tuning. Training-free methods do not require any extra models the text-to-video model without any additional parameters.
or fine-tuning time, but they are restricted to controlling only Text2Video-Zero [135] is one of the pioneer works. Rather
a limited range of conditions, such as layout or style. than random initial the latents of all frames independently,
Controllable generation involves customizing various as- Text2Video-Zero only samples the latent code zT1 of the first
pects of an image, such as subject, layout, and style [110]. frame and applies ∆t DDIM backward steps to obtain zT1 ′ .
Among these, the primary control condition is the subject of After that, Text2Video-Zero determines the global scene and
an image, known as subject-driven generation [112], [113], a camera motion direction, proposes a warping function Wk
[115], [116]. For instance, when a user describes an image to get all F frames from zT1 ′ to zTF′ , and then performs a
with a phrase like “a dog is running on the beach”, the dog DDPM forward to get the initial latents. To keep the con-
may not be an arbitrary one but a specific, familiar dog. To sistency among different frames, Text2Video-Zero proposes
achieve this, sample images of the dog are provided as an cross-frame attention which uses keys and values from the first
additional condition, and an uncommon word, such as “[V]”, frame to generate the images. Latent-Shift [136] is another
“sks”, or “S*”, is assigned to the description to represent the representative method. It proposes a novel Temporal-Shift
specific subject. A related but more specialized area is person- module that splits the latents along the channel dimension
driven generation [117]–[119], which focuses on maintaining and shifts the split channel along the temporal dimension
a consistent human identity, depicting a person with different to keep the consistency of all frames. These methods have
expressions, postures, and actions. Compared to subject-driven fully used the powerful pretrained text-to-image models and
generation, layout, and style-driven generation focus on the can generate videos with much higher resolution and quality
overall composition of the image. Layout conditions [93]–[96] than traditional text-to-video methods using GANs and VAEs.
control the relative positions of different subjects and the back- However, rather than capturing, training, and understanding the
ground, while style conditions [114], [120], [121] determine temporal information, these methods are more like to provide
the artistic style of the image, such as oil painting, black-and- a class of expert knowledge that can utilize the temporal
white, or line art. Additionally, other novel conditions, such as information from a human perspective. Thus, these methods
sound, brain signals, and semantic maps, are being explored enjoy high generation efficiency but the videos generated still
to control text-to-image generation, offering new and subtle struggle with motion smoothness, dynamic degree, and video
ways to influence the mood and perception of an image. consistency.
Beyond addressing specific conditions, many studies explore To solve the problems, another kind of approaches [137]–
complex control involving multiple conditions. For exam- [141] not only inherits the architecture of the T2I models
ple, generating multiple customized subjects performing user- but also makes efforts to introduce novel modules or mod-
defined actions is particularly challenging because the model ify the original structure to learn the temporal information.
might confuse or forget the specified conditions. Thus, multi- VDM [137] is one of the earliest works that transferred the
ple conditioning control involves more than simply combining T2I model to solve T2V tasks. VDM proposes a 3D U-Net that
various specific conditions; it requires their interaction in a modifies the diffusion architecture by changing each 2D spatial
well-designed manner. convolutional layer into a 3D convolution. After that, for each
Existing methods for controlling generation with multiple spatial attention block, VDM inserts a temporal attention block
conditions can be categorized as follows. Joint training meth- that performs attention over all frames with relative position
ods [117], [122]–[124] rely on multi-condition encoders and embeddings to distinguish the ordering of frames. Make-a-
specialized training strategies to manage diverse conditions video [138] proposed a pseudo-3D convolutional and attention
simultaneously. Continual learning methods [125]–[127] in- layer which consists of a spatial 2D convolutional layer and
corporate strategies from the field of continual learning to a temporal 1D convolutional layer. Compared to 3D convo-
effectively handle conditions that arise sequentially. Weight lution, this approach is much more efficient while facilitating
fusion methods [115], [128]–[131] assign weights to all con- information sharing between the spatial and temporal axes. To
ditions and blend these weights cohesively to ensure compre- more flexibly apply the capabilities of the T2I model such
hensive control over all conditions. Attention-based integration as the customization and style transferring ability brought by
methods [128], [129] modify the attention map to adaptively LoRA, AnimateDiff [141] keeps the original architecture and
position and prioritize different conditions. Guidance composi- only inserts a motion module after each pretrained layer. The
tion methods [132]–[134] integrate the independent denoising motion module consists of an input projection layer, several
results of each condition to achieve a unified output. temporal self-attention layers, and an output projection layer.
To avoid harming the original capabilities of T2I models,
AnimateDiff zero initializes the output projection layer.
C. Text-to-Video Generation As the attention-based architecture is more suitable for
1) Text-to-Video Diffusion Models: Due to the success of capturing long-range contextual relationships, some meth-
diffusion models in text-to-image tasks, many researchers have ods [142], [143] adopt a DiT-based model to generate videos.
introduced temporal information to the diffusion models and Latte [142] utilizes a video transformer as the backbone and
utilized the capability of generating high-quality images to employs a VAE to encode videos into features, which is used to
conduct text-to-video models. extract tokens. Currently, compared to U-Net-based methods,
The most intuitive approach to utilizing the text-to-image DiT-based methods can scale to larger datasets and parameters,
model is modifying the self-attention mechanism, which gets hence yielding relatively better performance. However, this
PREPRINT VERSION, 2024 11
also implies a higher consumption of computational resources. generation [150]–[154]. It utilizes the information extracted
The DiT-based methods are commonly adopted in accomplish- from each frame of videos, such as skeleton, depth map,
ing some outstanding applications within the industry. and optical flow, to generate videos that satisfy the provided
At this point, the basic text-to-video models have been text prompt. Control-A-Video [150] utilizes a ControlNet to
constructed based on the text-to-image model. However, there control the generation process by different types of conditional
are still two problems. The first problem lies in the fact that information. Besides, to improve the consistency, Control-
these methods can only control the generation of video through A-Video proposes a novel residual-based noise initialization
text, and it is usually difficult to describe all aspects of the strategy that introduces motion prior to the diffusion process.
requirements of the video in text. How to better control the VideoControlNet [152] proposes a motion-guided method that
generation of video is an important issue. The second problem uses an auxiliary model to predict the optical flow between
lies in the fact that, limited by the scale of model parameters keyframes. After generating the keyframes, VideoControlNet
and GPU memory, most of the videos generated by these utilizes a motion-guided B-frame interpolation module to
methods are in the range of 16-24 frames, which makes it generate the rest frames. In contrast to the aforementioned
difficult to satisfy the needs of real-life users of needs of visual methods, SparseCtrl [153] takes into account the potential
content. Next, we will analyze these two issues and discuss quality degradation that arises from using noisy latents as in-
some related works. puts in traditional ControlNet. Therefore, SparseCtrl proposes
2) Controllable Generation with Diffusion Model: For con- a novel sparse condition encoder that proposes a sparse mask,
trollable generation, the key challenge is how to choose eliminating the noisy sample input and exclusively accepting
suitable conditioning information and how to utilize this condition information. Other methods [155]–[157] analogize
information fully. condition information into embeddings and employ attention
An intuitive way to do this is to use existing videos, mechanisms to achieve controllable generation. Follow-Your-
which can be regarded as video editing [144]–[148]. Stable- Pose [155] proposes a two-stage training strategy. In the first
Video [145] first introduces a pretrained model to split fore- stage, Follow-Your-Pose trains a pose encoder to translate the
ground and background and edits them separately. To better frames of key points into specific embeddings. The second
maintain the consistency of subjects, StableVideo proposes an stage introduces a temporal self-attention and a cross-frame
inter-frame propagation mechanism that utilizes the layered self-attention to keep consistency. During inference, Follow-
representations to keep information between different frames. Your-Pose mixes the embeddings of poses and the latents to
Rerender-A-Video [147] proposes a novel two-stage editing control the video through the key points.
approach. In the first stage, Rerender-A-Video identifies the However, whether directly utilizing videos as control con-
keyframes of the reference video and edits them according ditions or extracting crucial information from them, the afore-
to the given prompt. To ensure effectiveness, Rerender-A- mentioned approaches heavily rely on the inherent structure of
Video introduces a pretrained image diffusion model with the reference videos being highly consistent with the desired
hierarchical cross-frame constraints. The second stage uti- generated videos. This limitation constrains the diversity in
lizes the edited keyframes to perform overall video editing controllable video generation unless we possess an infinite
through temporal-aware patch matching and frame blending. video library and exceptionally powerful video retrieval meth-
FateZero [148] makes full use of the information provided ods to cater to all the whimsical imaginings of users. Hence,
by the attention maps during inversion. On the one hand, other works also explore the use of simpler and more readily
they encapsulate a wealth of layout and motion information accessible control conditions to exert finer control over specific
from the original video. On the other hand, through the cross- aspects of videos.
attention maps, a novel blending mask can be derived. These Some methods attempt to control videos by emulating how
masks indicate the information that influences the subject people shoot movies in real life, such as controlling the
requiring editing, thereby minimizing semantic loss during layouts [158]–[161], adjusting camera views [141], [162]–
subject editing. [164], and setting “actors” [165]–[168]. Compared to previous
It can be found that these methods usually utilize two control methods, the major advantage of these approaches
aspects of the reference video. One is the overall layout lies in the simplicity of their condition information, such as
information of the video, including the position of the ob- sequences of bounding boxes, a viewpoint with a direction,
ject, the motion trajectory, etc. The other is the attributes or images of objects. Simultaneously, they can effectively
of the subjects requiring editing, with appropriate extraction represent a specific set of significant attributes of a video. This
and adjustment. This also implies that not all information eliminates the need to invest substantial efforts in searching
in the video is valid according to the situation, such as for a suitable reference video before generating the video.
the color or shape of an object, which often proves to be Users can obtain control information through their own un-
disruptive when we intend to edit it. With such considerations, derstanding, either via a GUI for dragging and dropping or
certain approaches initially involve pre-extracting conditional in numerous method. This not only ensures controllability but
information through auxiliary networks, and then feeding this also significantly enhances the diversity of video generation.
preprocessed information into the generative model, aiming Next, we will briefly introduce several works of these methods.
for improved controllability over video generation. LVD [161] analyzes potential layout information within the
Inspired by controllable generation methods used in text-to- user-provided prompt through LLM and transforms it to frame-
image tasks, ControlNet [149] is introduced to text-to-video level dynamic scene layout (DSL) information, which can be
PREPRINT VERSION, 2024 12
AR Regularization
seen as a series of bounding boxes. To use this DSL informa- Text tokens Visual tokens AR Diffusion
Regularization Regularization
tion for generating videos that satisfy the desired layouts, LVD Diffusion Model
introduces a DSL-guided video generator. It designs an energy Multi-modal AR Model Connector Multi-modal Transformers
function to assess the degree of overlap between the generated
Multi-modal LLM
objects and the required bounding boxes and influences the Text Visual Multi-modal Input
Processor
Tokenizer Tokenizer
object positions during denoising by minimizing the energy
function and back-propagation. CameraCtrl [163] proposes Auto-Regressive(AR) Model Joint AR and Diffusion Model
a novel plug-and-play module that can control the camera
trajectory for text-to-video generation. It trains a camera Fig. 10. Possible unified multi-modal understanding and generation frame-
encoder that can output multi-scale camera representations, works with different probabilistic modeling methods.
which are then utilized by the temporal attention layers of U-
Net to control the video generation process. DisenStudio [166] can receive the last frame or the last several frames of the
addresses the challenge of customized multi-subject text-to- previous video as input, and also the text prompt as input, to
video generation in the real world where each subject has only generate the next video clip.
a few reference images available. It proposes a disentangled
spatial control approach to associate each subject with the IV. U NIFIED F RAMEWORK
desired action. Besides, DisenStudio proposes a novel multi-
Till now, we have discussed both the multi-modal large
subject co-occurrence tuning and masked single-subject tuning
language models and the multi-modal diffusion models, where
to keep the visual attributes of given subjects, and a multi-
the former works well for multi-modal understanding and the
subject motion-preserved tuning method to maintain the tem-
latter exhibits powerful ability in visual generation. Then there
poral motion-generation ability. Kaleido [169] integrates these
arises a natural question, could we have a unified model that
condition information by encoding different control conditions
can simultaneously work well for multi-modal understanding
to tokens, enabling more flexible multi-condition controllable
and generation? Next, we will discuss this trending problem
generation. However, employing control conditions beyond
from the following two perspectives: (i) the probabilistic
text inevitably leads to potential conflicts among multiple
modeling method, and (ii) the model architecture.
conditions, resulting in a decline in the quality of the generated
videos. This problem could serve as a challenging future
direction. A. Probabilistic Modeling: Auto-regressive or Diffusion?
3) Long Video Generation: Another challenge of the The success of multi-modal large-language models has
diffusion model is generating longer videos. Some ap- clearly shown the great power of auto-regressive modeling for
proaches [150], [153], [170], [171] leverage the controllable multi-modal understanding and text generation, so we believe
generation methods mentioned before, splitting the whole the auto-regressive method should be included. Then, the next
video into several smaller video chunks and generating them question is how we enable the model with visual generation
in an auto-regressive manner. Typically, they use the final ability. Based on existing works in Sec. II and Sec. III, we
frames of the preceding chunk as a reference and generate the provide the possible methods in Fig. 10, where we present
next chunk to ensure complete the same in the overlapping the auto-regressive model and the joint auto-regressive and
parts between chunks, thereby guaranteeing consistency and diffusion model. Next, we will elaborate on them in detail.
smoothness across different chunks. Other than using addi- 1) Auto-regressive Model: Although diffusion models have
tional modules to fully control the overlapping frames between become dominant in visual generation, there are still some
chunks, FreeNoise [172] accomplishes the generation of long recent attempts [3], [182]–[184] on generating visual content
videos by performing specific operations on the latents of in an auto-regressive manner. These works will first try to map
overlapping frames between different chunks. FreeNoise no the input images and text into discrete tokens respectively.
longer initializes noise for all frames but rearranges the noise Particularly, the images are discretized with visual tokenizers
sequences to achieve long-range correlations, and applies tem- such as VQGAN or VQ-VAE. Then the mixed text and visual
poral attention through window-based fusion. The limitation of tokens will be fed into a multi-modal auto-regressive model.
this kind of method is that it often suffers concept drift as the After that, the model will output the mixed text and visual
video becomes longer. Additionally, it often fails to generate tokens. Also, some special tokens such as < soi >, < eoi >
new backgrounds or videos with high dynamic degrees. are used to indicate the start of the image tokens and the
Another way to generate longer videos is to rely on larger end of the image tokens. Then the generated text tokens
datasets and model parameters. Early works pretrain U-Net will deliver how the model understands the input multi-modal
based diffusion models [173]–[176] which can only generate information, and the visual tokens will be sent to the decoder
1-second or 2-second videos. More recently, by scaling the of the VQ-VAE or VQGAN to reconstruct images. Therefore,
DiT architecture for video generation, several works [2], [143], the AR model can be used for both understanding and visual
[177]–[181] can generate videos up to 1 minute, with high generation.
resolution and smoothness. Despite their success, the length of Remark. Despite these efforts, the auto-regressive method still
the generated video cannot be arbitrarily long because of the faces two important problems. One is that it relies upon the
computational restrictions. A possible way to generate longer ability of the visual tokenizer, which needs to compress all
videos is to train a multi-modal video generation model, which the visual information concisely. The current codebook of
PREPRINT VERSION, 2024 13
the tokenizer is obtained through the image reconstruction through connecting pretrained models, its ability is limited by
objective, which contains more pixel-level information instead independent modeling.
of semantics, making the multi-modal understanding harder The second possible model is a unified multimodal-
without a large amount of data to train the Multi-modal transformer framework as shown in Fig. 10, where we do
AR model. Additionally, discrete tokens inevitably lose some not rely on two pretrained models, but try to use a single
visual information, which may fail for some finer-grained model trained with both diffusion and auto-regressive regular-
understanding tasks or the visual generation task. The other izations. The multi-modal input processor will first transform
problem is that the auto-regressive way basically means a the multi-modal data into sequences that can be received by
causal structure and cause attention, where we use the former the transformers. Then the multi-modal transformer will try to
tokens to predict future tokens. However, this is not so suitable learn the multi-modal knowledge for both understanding and
for image generation because given an image, it is hard generation. The diffusion regularization is used to guide visual
for us to choose which visual token should be put in the generation and the AR regularization is used to guide the text
beginning and which visual token should be put in the end. generation. Note that this is a transformer-like model but not
Therefore, a recent work [185] tries to use the next-scale necessarily an LLM. This is because when using transformers
prediction paradigm to generate images, where the lower- to generate visual content, the full-attention mechanism is
resolution images are regarded as former tokens to predict usually adopted. In contrast, the attention mechanism adopted
higher-resolution images. However, its scaling ability is still by LLM is causal and uni-directional. Therefore, an adaptive
not verified in the multi-modal understanding and generation. or mixed attention mechanism might be designed. This per-
2) Joint AR and Diffusion Model: Considering the im- spective is verified in the very recent works TransFusion [193]
pressive visual generation ability of the diffusion model, a and Show-o [194]. The difference between Transfusion and
more natural way for unified multi-modal understanding and Show-o mainly lies in the diffusion model, where TransFusion
generation is to combine the AR and diffusion models. In adopted continuous diffusion that is similar to current visual
Fig. 10, we present two kinds of possible frameworks. diffusion models, but Show-o adopted masked generative
The first one is that we have a pretrained diffusion model modeling [195] which could be regarded as discrete diffusion
for visual generation and a multi-modal LLM for multi-modal regularization. Therefore, Show-o still relies on a pixel-level
understanding. Then we connect these two parts and we can visual tokenizer for image generation but might trade off
obtain a unified model. About how to connect these two parts, some understanding ability. Additionally, these two works are
many existing works [186]–[188] directly use the LLM as primary attempts at combining auto-regressive and diffusion
the controller, and the diffusion model as a tool for visual modeling methods in a single transformer-like model. There
generation, which is a common paradigm in tool learning. still exist several open problems regarding what the model
Although works like tool learning can enable the models architecture should be like, such as the multi-modal input
with visual generation abilities, they easily suffer generation processor or the transformer-like model, which we will discuss
failure when meeting multi-modal generation conditions. For next.
example, when we want to generate “a specific girl (described
with a given image) and a specific dog (described with a
given image) playing on the grass”, the tools available are B. Model Architecture
only SOTA text-to-image models. They will fail to guarantee Compared to previous MLLM or Diffusion models that only
the specific girl and dog occur in the generated image. In focus on one task, i.e., generation or understanding, the unified
fact, there are many conditions that can not be described with model itself should support multiple objectives. When it comes
only text, and this kind of tool-learning method will fail. to understanding, the model should have the ability of concep-
To tackle the problem, a more advanced way is to train a tual abstraction and associative reasoning. In contrast, when it
learnable connector [189]–[192], which aligns the diffusion comes to visual generation, besides the overall concepts and
model and the multi-modal LLM in the same space, similar to their relations, pixel-level details are also important. Therefore,
the training paradigm of the alignment module in MLLM. The the unified model architecture design might be different from
alignment process enables the diffusion model to receive the that of previous single-objective models. Next, we mainly
LLM output multi-modal embeddings as conditions instead of discuss the possible architectures of the multi-modal input
pure text descriptions, thus achieving multi-modal generation. processor and the multi-modal transformers.
However, this paradigm inherits the limitations of alignment 1) Multi-modal input processor: To tackle the multi-modal
architecture. The multi-modal LLM and the diffusion model input text and images, two possible input processors are
are pretrained respectively, the performance of the unified presented in Fig. 11. Text is consistently tackled by a text
model will be limited by each model. Additionally, from an tokenizer. However, there are some differences in the visual
intuitive perspective, multi-modal understanding and multi- input. In (a) of Fig. 11, we show the visual processor adopted
modal generation should not be independent tasks, but two by most existing works, where a single visual encoder is
related tasks that could share knowledge. For example, when used to process the images. Considering that the visual tokens
generating a picture of a girl riding a horse, the model should support the pixel-level visual generation task, existing
should definitely understand the concepts “girl”, “horse” and works [3], [193], [194] generally adopt the single pixel-
“riding”. Therefore, although it is resource-friendly to obtain level(or patch-level) visual tokens. The pixel-level tokens bring
a unified multi-modal generation and understanding model challenges to the multi-modal transformer, requiring it not
PREPRINT VERSION, 2024 14
Text Visual
Tokens Tokens
combined with each other and result in more architectures,
and now there are few attempts at the unified model design
and we believe the discussion above will inspire a lot of future
Text Tokenizer Visual Encoder
works.
Text Image
(a) Single-Visual Encoder V. DATASETS
After discussing the multi-modal understanding and gener-
Text Visual Visual
Tokens Semantic Tokens Pixel Tokens ation models, multi-modal text-image and text-video datasets
are also important to implement multi-modal generative AI
Visual Visual [196]. In this section, we will review the literature on the
Text Tokenizer Semantic Pixel
Encoder Encoder dataset for multi-modal generative AI training. Based on the
differences in data types, we divide the dataset into three cate-
Text Image
gories: caption, conversation, and reasoning. In addition, many
(b) Semantic-Pixel Visual Encoders
multimodal large models choose to collect the aforementioned
Fig. 11. Possible frameworks of the multi-modal input processor for unified types of data for integration and construct their own datasets.
multi-modal generation and understanding models. Therefore, we call these datasets the integration dataset.
TABLE I
C OMMON DATASETS
C. Reasoning Datasets step is to extend this unified model to videos. Among the three
The above two types of datasets mainly focus on the architectures introduced in Figure 10, connecting the MLLM
visual content itself, normally lacking in-depth reasoning ques- and video diffusion model with a connector [223], [224]
tions. Meanwhile, the Reasoning dataset focuses on enhancing can be achieved similarly to the approach used for images.
MLLMs for diverse reasoning capacities, which normally However, the other two methods face significant challenges:
require a step-by-step reasoning process by following rigor- the increased computational demands due to longer sequences
ous logic. These include spatial reasoning (CLEVR [216]), and the difficulty in learning spatiotemporal cues.
reading comprehension (VisualMRC [217]), temporal rea- For instance, in an auto-regressive model, encoding indi-
soning (NExT-QA [218]), and spatiotemporal reasoning vidual video frames separately using a 2D visual tokenizer
(CLEVRER [219]). fails to capture the essential temporal motion information.
VideoPoet [225], which employs a 3D video tokenizer [226],
encodes a 17-frame video (spanning 2.125 seconds) into
D. Integration Datasets
1280 tokens, limiting its ability to generate longer videos.
Due to the strong generalization ability of multimodal large VideoLaViT [227] introduces an efficient video representation
language models, their training data is not limited to only by decomposing video into keyframes and temporal motions,
one task of caption, conversation, or reasoning, but requires training separate tokenizers for each, which significantly im-
comprehensive pretraining for both simple and complex visual proves computational efficiency. However, the training cost is
modal tasks. Therefore, many multimodal large model works still too high to scale to web-scale video data.
often do not use a single visual task dataset, but select subsets Similarly, using a single model trained with both diffusion
of several datasets from each category mentioned above for and auto-regressive regularizations also encounters these two
integration and adjustment, forming instruction training data challenges, and the problem of modeling causal attention
that employs both image and video data for different visual and spatiotemporal attention within the same model remains
modal tasks. For visual instruction tuning, LLaVA [37] is the unexplored. We hope that future research will dedicate more
first MLLM method to leverage text-only GPT-4 [222] to effort to advancing unified generative AI for video generation
expand the existing bounding box and caption dataset such and understanding.
as MSCOCO [198] to a multi-modal instruction following
data. In addition, Liu et al. proposed LLaVA-Instruct built
on a subset of the CC-3M dataset and it contains 58k in B. Unified Generation and Understanding Benchmark
conversations, 23k in detailed description, and 77k in complex Despite some pioneering work on studying unified genera-
reasoning. Following the development of visual instruction tion and understanding models [193], [194], their evaluations
tuning, many video large language models such as Video- of these tasks are completely separated in a non-unified way.
LLaVA [220], VideoChat2 [221] and VideoLLaMa2 [64] are For instance, these studies use specific benchmarks for un-
proposed also utilizing the combination of captioning, con- derstanding tasks, such as Flickr30k [228] and VQAv2 [207],
versation, and reasoning datasets under both text-image and while relying on different benchmarks for generation tasks,
text-video modalities. such as MSCOCO [198] and GenEval [229]. Compared to
this separated evaluation, a unified benchmark offers the
VI. F UTURE D IRECTIONS advantage of unified metrics and rankings, providing a more
comprehensive assessment of model performance across both
In this section, we explore challenging and promising future
tasks. However, designing such a benchmark is challenging,
directions for multi-modal generative AI from the following
as it requires a vast amount of visual data with human annota-
perspectives.
tions in various forms, including labels, rankings, and natural
language descriptions. More importantly, the evaluation should
A. Unified GenAI of Video Generation and Understanding ideally reflect the mutual promotion between generation and
In Section IV, we primarily discussed the unified generative understanding. In summary, the challenges for creating a
AI for image generation and understanding. Naturally, the next unified benchmark are threefold: (i) Dataset construction: The
PREPRINT VERSION, 2024 16
visual data should be representative, diverse, and abundant, remain in this area: 1) The feature spaces of different
with high-quality annotations for multiple tasks. (ii) Ranking modalities are heterogeneous, and thus, aligning them
criteria: Models should be ranked based on a combination into a unified space within a multi-modal graph presents
of understanding and generation metrics, ensuring a balanced significant challenges. 2) The links between instances
evaluation of both capabilities. (iii) Ground truth for mutual in different modalities can be heterophilous, e.g., the
promotion: The benchmark should include datasets or tasks sounds of black and white cats may be very similar, but
that effectively demonstrate how generation and understanding their visual appearances differ greatly, leading to varying
enhance each other. Despite these challenges, developing such degrees of similarity for the links across modalities within
a benchmark is crucial for advancing the field, making it a the multi-modal graph. 3) There may be substantial biases
promising area for future research efforts. among different modalities, such as text and images
dominating due to their ease of collection via the internet,
C. Multi-Modal Graph GenAI while other modalities like voice and tactile sense are
Graph is a flexible representation paradigm, capable of much more difficult to collect.
modeling both naturally occurring network instances, e.g. Despite these challenges, multi-modal graph GenAI holds
protein and molecular structures, and the relations between significant potential applications. For instance, generating
entities across diverse modalities, e.g., multi-modal knowledge molecular graphs from text can facilitate scientists in rapidly
graphs. This part discusses the future directions of multi-modal creating and editing molecular compounds with desired prop-
graph GenAI from the following two perspectives: erties through natural language interactions, thereby accel-
• Leveraging multi-modal information to generate graph erating the drug discovery process. Additionally, leveraging
instances. Current multi-modal research predominantly multi-modal graphs allows GenAI systems to reference entities
focuses on modalities with regular structures, such as associated with different modalities, thereby enhancing their
texts (sequences) and images (grids). However, many ability to make cross-modal associations. We hope that this
real-world instances within various modalities exhibit discussion provides some inspiration for future research and
highly irregular structures, including proteins [230], development efforts in multi-modal graph GenAI.
molecules [231], scene graphs [232], etc. Understanding
and generating graphs across these modalities represents
a promising direction for future research. For instance, D. LightWeight Multi-Modal GenAI
[233] explored text-to-graph generation by leveraging We discuss the future directions of lightweight multi-modal
the domain knowledge of large language models, and GenAI from three perspectives. i) For multi-modal generation
[234] explored text-to-molecular graph generation by (dominated by diffusion models), lightweight techniques face
integrating the graph, image, and text information. How- challenges from sampling steps, network architecture, and
ever, there are several challenges for multi-modal graph tasks. The iterative sampling process is a critical limitation
generation: 1) Understanding structures. Given the high of diffusion models, bringing high computational expenditure
degree of irregularity in graphs, aligning them with other and constraining real-time applications. Although substantial
modalities poses significant difficulties. 2) Generating works (e.g., distillation [239], consistency model [240], [241]
structures. While mainstream approaches utilize auto- and flow matching [242], [243]) engage in few-steps (e.g., 4
regressive methods for generating discrete sequence infor- steps) or single-step sampling, generally fewer-steps sampling
mation and diffusion models for continuous grid informa- can cause remarkable quality degradation. Many tasks requir-
tion, the complexity of graph structures may necessitate ing high quality (e.g., [244], [245]) still adopt multi-step sam-
alternative modeling techniques for generation. pling. Therefore, improving few-step sampling is a significant
• Leveraging graph relations to help multi-modal gen- and prospective future direction. Besides, the massive network
eration. Current multi-modal methodologies often as- architecture of diffusion models also contributes to the issue
sume that data from different modalities are independent, of high computational costs. It tends to be more severe as the
whereas there can be strong intrinsic relationships be- model size is increasing rapidly. Previous methods try to obtain
tween modalities [235], [236]. For example, the word, lightweight networks by compression techniques such as quan-
voice, and image of a bird are more closely related tization [246]–[248], pruning [249], feature cache [250], [251],
to each other than they are to those of other species. and neural architecture search [252], [253]. Although they
Leveraging these multi-modal associations to form a have achieved remarkable success, their designs are mostly
graph that aids in understanding and generation is a tailored for the setting of multi-step sampling and can’t be
promising avenue for future research. For instance, [237] applied or perform poorly in few-steps sampling. Therefore,
explores combining multiple data modalities through exploring sampling-steps-agnostic compression methods is an
cross-modal dependencies and geometric relationships to important future direction. Moreover, previous compression
develop multimodal architectures that can process diverse methods mainly focus on UNet-based models. A lot of lit-
datasets, such as image-intensive, knowledge-grounded, erature [106], [109] indicates DiT [106] might be a better
and language-intensive models. while [238] captures architecture, to match this advancement, compression methods
intricate relationships between multiple modalities using need to attach more importance to DiT-based architectures.
graphs to enhance pretrained language models with mul- Last but not least, previous compression methods mainly focus
timodal context for generative tasks. Several challenges on class-condition or text-to-image generation tasks but rarely
PREPRINT VERSION, 2024 17
engage in other popular and more expensive tasks such as cons of different frameworks are discussed. Furthermore, our
video generation. Exploring effective compression methods for paper also sheds light on the cutting-edge topic, unified multi-
these tasks should be meaningful. ii) For multi-modal under- modal understanding and generation generative model, from
standing (MLLM), there is a mass of studies of lightweight the probabilistic modeling methods and model architectures.
MLLMs [254], such as vision token compression [220], [255] We finally highlight some interesting and challenging future
and efficient structures (e.g., MoE [256] and Mamba [257]). directions and hope this paper can contribute to the ongoing
However, conventional but powerful compression methods advancements of multi-modal generative AI.
including quantization and pruning are largely unexplored
for MLLM. Both diffusion models [247] and LLMs [258]
R EFERENCES
have gained successful compression from quantization and
pruning, so we believe exploring these methods for MLLMs [1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman,
is a promising direction. iii) Recently, researchers [45], [47] D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4
technical report,” arXiv preprint arXiv:2303.08774, 2023.
have started exploring the unified framework of multimodal [2] T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo,
understanding and generation which is quite a novel and in- L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman,
triguing topic. These unified models also typically have a large C. Ng, R. Wang, and A. Ramesh, “Video generation
models as world simulators,” 2024. [Online]. Available: https:
number of parameters, thus raising the need for compression. //openai.com/research/video-generation-models-as-world-simulators
According to previous experience, the research on lightweight [3] C. Team, “Chameleon: Mixed-modal early-fusion foundation models,”
technology has always lagged behind the development of the arXiv preprint arXiv:2405.09818, 2024.
[4] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,”
model itself. Developing effective lightweight methods for Advances in neural information processing systems, vol. 36, 2024.
unified understanding and generation models can be a new [5] A. Vaswani, “Attention is all you need,” arXiv preprint
track. arXiv:1706.03762, 2017.
[6] B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu, “Vtimellm: Em-
power llm to grasp video moments,” in Proceedings of the IEEE/CVF
E. Multi-Modal GenAI in Dynamic Environment Conference on Computer Vision and Pattern Recognition, 2024, pp.
14 271–14 280.
The multi-modal generative models discussed in this paper [7] B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y. Li,
mostly do not interact with the dynamic physical world. In Z. Liu, and C. Li, “Llava-onevision: Easy visual task transfer,” arXiv
preprint arXiv:2408.03326, 2024.
the future, the multi-modal generative AI should behave like [8] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
humans, where it can perceive the multi-modal environment, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable
reason and plan based on the perception and its state, take visual models from natural language supervision,” in International
conference on machine learning. PMLR, 2021, pp. 8748–8763.
action, and improve itself. A very related topic is multi-
[9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
modal embodied AI [259], [260], where multi-modal large of deep bidirectional transformers for language understanding,” arXiv
language models are used as the controller. However, exist- preprint arXiv:1810.04805, 2018.
ing embodied AI are all parameter-fixed after deployment, [10] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
limiting their abilities in dynamic environments, where the context,” in Computer Vision–ECCV 2014: 13th European Conference,
environment may change, and new concepts may arise. If Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.
the new concepts are out of the distribution of the pretrained Springer, 2014, pp. 740–755.
[11] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and
multi-modal generative model, existing works will fail to take D. Parikh, “Vqa: Visual question answering,” in Proceedings of the
the right action. Therefore, future works need to explore an IEEE international conference on computer vision, 2015, pp. 2425–
automatic way of when to update the model parameters and 2433.
[12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
which part of the model parameters to update, e.g., the vision T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
or the language modules, just as indicated in self-directed “An image is worth 16x16 words: Transformers for image recognition
machine learning [261]. A possible way is to connect the at scale,” arXiv preprint arXiv:2010.11929, 2020.
[13] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng,
multi-modal generative AI with an online cloud, and when and J. Liu, “Uniter: Universal image-text representation learning,” in
the error rate of the model reaches some threshold, it will Computer Vision–ECCV 2020: 16th European Conference, Glasgow,
request to update parameters to the cloud. Then the cloud UK, August 23–28, 2020, Proceedings, Part XXX. Springer, 2020, pp.
104–120.
will automatically collect the corresponding data and decide [14] H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder
which part of the model parameter to update, and then pass representations from transformers,” arXiv preprint arXiv:1908.07490,
the parameters to the model with some efficient techniques 2019.
[15] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu,
such as LoRA [262]. Also, when updating the parameters, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training
some continual learning [263] or other optimization problems for vision-language tasks,” in Computer Vision–ECCV 2020: 16th
should be also considered. European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Part XXX 16. Springer, 2020, pp. 121–137.
[16] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-
VII. C ONCLUSION agnostic visiolinguistic representations for vision-and-language tasks,”
Advances in neural information processing systems, vol. 32, 2019.
In this paper, a comprehensive review is provided for multi- [17] Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu, “Pixel-bert: Aligning
modal generative AI. In the second and third sections, we image pixels with text by deep multi-modal transformers,” arXiv
review the multi-modal large language models for multi-modal preprint arXiv:2004.00849, 2020.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
understanding and multi-modal diffusion models for visual recognition,” in Proceedings of the IEEE conference on computer vision
generation. Related techniques are presented, and the pros and and pattern recognition, 2016, pp. 770–778.
PREPRINT VERSION, 2024 18
[19] W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer [41] J. Cha, W. Kang, J. Mun, and B. Roh, “Honeybee: Locality-enhanced
without convolution or region supervision,” in International Conference projector for multimodal llm,” in Proceedings of the IEEE/CVF Confer-
on Machine Learning. PMLR, 2021, pp. 5583–5594. ence on Computer Vision and Pattern Recognition, 2024, pp. 13 817–
[20] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. 13 827.
Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language [42] W. Li, Y. Yuan, J. Liu, D. Tang, S. Wang, J. Zhu, and L. Zhang,
representation learning with noisy text supervision,” in International “Tokenpacker: Efficient visual projector for multimodal llm,” arXiv
conference on machine learning. PMLR, 2021, pp. 4904–4916. preprint arXiv:2407.02392, 2024.
[21] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, [43] J. Han, R. Zhang, W. Shao, P. Gao, P. Xu, H. Xiao, K. Zhang, C. Liu,
“Align before fuse: Vision and language representation learning with S. Wen, Z. Guo et al., “Imagebind-llm: Multi-modality instruction
momentum distillation,” Advances in neural information processing tuning,” arXiv preprint arXiv:2309.03905, 2023.
systems, vol. 34, pp. 9694–9705, 2021. [44] R. Zhang, J. Han, C. Liu, A. Zhou, P. Lu, Y. Qiao, H. Li, and P. Gao,
[22] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image “Llama-adapter: Efficient fine-tuning of large language models with
pre-training for unified vision-language understanding and generation,” zero-initialized attention,” in The Twelfth International Conference on
in International Conference on Machine Learning. PMLR, 2022, pp. Learning Representations, 2024.
12 888–12 900. [45] C. Team, “Chameleon: Mixed-modal early-fusion foundation models,”
[23] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- arXiv preprint arXiv:2405.09818, 2024.
image pre-training with frozen image encoders and large language [46] R. Sennrich, “Neural machine translation of rare words with subword
models,” arXiv preprint arXiv:2301.12597, 2023. units,” arXiv preprint arXiv:1508.07909, 2015.
[24] A. AI, “Fuyu-8b: A unified multimodal agent for image and text [47] J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y. Gu,
understanding,” https://www.adept.ai/blog/fuyu-8b, 2023. Z. Chen, Z. Yang, and M. Z. Shou, “Show-o: One single transformer
[25] A. Razavi, A. Van den Oord, and O. Vinyals, “Generating diverse to unify multimodal understanding and generation,” arXiv preprint
high-fidelity images with vq-vae-2,” Advances in neural information arXiv:2408.12528, 2024.
processing systems, vol. 32, 2019. [48] P. Jin, R. Takanobu, W. Zhang, X. Cao, and L. Yuan, “Chat-univi:
[26] W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas, “Videogpt: Video gener- Unified visual representation empowers large language models with
ation using vq-vae and transformers,” arXiv preprint arXiv:2104.10157, image and video understanding,” in Proceedings of the IEEE/CVF
2021. Conference on Computer Vision and Pattern Recognition, 2024, pp.
[27] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for 13 700–13 710.
high-resolution image synthesis,” in Proceedings of the IEEE/CVF [49] J. He, Y. Wang, L. Wang, H. Lu, J.-Y. He, J.-P. Lan, B. Luo, and
conference on computer vision and pattern recognition, 2021, pp. X. Xie, “Multi-modal instruction tuned llms with fine-grained visual
12 873–12 883. perception,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2024, pp. 13 980–13 990.
[28] J. Yu, X. Li, J. Y. Koh, H. Zhang, R. Pang, J. Qin, A. Ku, Y. Xu,
J. Baldridge, and Y. Wu, “Vector-quantized image modeling with [50] T. Zhang, X. Li, H. Fei, H. Yuan, S. Wu, S. Ji, C. C. Loy, and S. Yan,
improved vqgan,” arXiv preprint arXiv:2110.04627, 2021. “Omg-llava: Bridging image-level, object-level, pixel-level reasoning
and understanding,” arXiv preprint arXiv:2406.19389, 2024.
[29] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer,
[51] W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, T. Lu,
“High-resolution image synthesis with latent diffusion models,” in
J. Zhou, Y. Qiao et al., “Visionllm: Large language model is also
Proceedings of the IEEE/CVF conference on computer vision and
an open-ended decoder for vision-centric tasks,” Advances in Neural
pattern recognition, 2022, pp. 10 684–10 695.
Information Processing Systems, vol. 36, 2024.
[30] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut,
[52] H. Fei, S. Wu, H. Zhang, T.-S. Chua, and S. Yan, “Vitron: A uni-
J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly
fied pixel-level vision llm for understanding, generating, segmenting,
capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
editing,” 2024.
[31] M. Tsimpoukelli, J. L. Menick, S. Cabi, S. Eslami, O. Vinyals, and [53] H. Liu, W. Xue, Y. Chen, D. Chen, X. Zhao, K. Wang, L. Hou,
F. Hill, “Multimodal few-shot learning with frozen language models,” R. Li, and W. Peng, “A survey on hallucination in large vision-language
Advances in Neural Information Processing Systems, vol. 34, pp. 200– models,” arXiv preprint arXiv:2402.00253, 2024.
212, 2021.
[54] H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F.
[32] A. Brock, S. De, S. L. Smith, and K. Simonyan, “High-performance Chang, and Y. Yang, “Ferret: Refer and ground anything anywhere at
large-scale image recognition without normalization,” in International any granularity,” arXiv preprint arXiv:2310.07704, 2023.
conference on machine learning. PMLR, 2021, pp. 1059–1071. [55] A. Gunjal, J. Yin, and E. Bas, “Detecting and preventing hallucinations
[33] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, in large vision language models,” in Proceedings of the AAAI Confer-
and I. Misra, “Imagebind: One embedding space to bind them all,” ence on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 18 135–18 143.
in Proceedings of the IEEE/CVF Conference on Computer Vision and [56] Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong,
Pattern Recognition, 2023, pp. 15 180–15 190. Q. Zhang, X. Zhu, L. Lu et al., “Internvl: Scaling up vision foundation
[34] J. Jain, J. Yang, and H. Shi, “Vcoder: Versatile vision encoders for models and aligning for generic visual-linguistic tasks,” in Proceed-
multimodal large language models,” in Proceedings of the IEEE/CVF ings of the IEEE/CVF Conference on Computer Vision and Pattern
Conference on Computer Vision and Pattern Recognition, 2024, pp. Recognition, 2024, pp. 24 185–24 198.
27 992–28 002. [57] Y. Zhao, Z. Li, Z. Jin, F. Zhang, H. Zhao, C. Dou, Z. Tao, X. Xu, and
[35] L. Yu, Y. Cheng, Z. Wang, V. Kumar, W. Macherey, Y. Huang, D. Liu, “Enhancing the spatial awareness capability of multi-modal
D. Ross, I. Essa, Y. Bisk, M.-H. Yang et al., “Spae: Semantic pyramid large language model,” arXiv preprint arXiv:2310.20357, 2023.
autoencoder for multimodal generation with frozen llms,” Advances in [58] C. Jiang, H. Xu, M. Dong, J. Chen, W. Ye, M. Yan, Q. Ye, J. Zhang,
Neural Information Processing Systems, vol. 36, 2024. F. Huang, and S. Zhang, “Hallucination augmented contrastive learning
[36] L. Zhu, F. Wei, and Y. Lu, “Beyond text: Frozen large language models for multimodal large language model,” in Proceedings of the IEEE/CVF
in visual signal comprehension,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp.
Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 036–27 046.
27 047–27 057. [59] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss,
[37] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize
Advances in neural information processing systems, vol. 36, 2024. with human feedback,” Advances in Neural Information Processing
[38] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, Systems, vol. 33, pp. 3008–3021, 2020.
K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: [60] Y. Tang, J. Bi, S. Xu, L. Song, S. Liang, T. Wang, D. Zhang, J. An,
a visual language model for few-shot learning,” Advances in neural J. Lin, R. Zhu, A. Vosoughi, C. Huang, Z. Zhang, F. Zheng, J. Zhang,
information processing systems, vol. 35, pp. 23 716–23 736, 2022. P. Luo, J. Luo, and C. Xu, “Video understanding with large language
[39] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: En- models: A survey,” arXiv preprint arXiv:2312.17432, 2023.
hancing vision-language understanding with advanced large language [61] L. KunChang, H. Yinan, W. Yi, L. Yizhuo, W. Wenhai, P. Luo, W. Yali,
models,” arXiv preprint arXiv:2304.10592, 2023. W. Limin, and Q. Yu, “Videochat: Chat-centric video understanding,”
[40] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, arXiv preprint arXiv:2305.06355, 2023.
and J. Zhou, “Qwen-vl: A frontier large vision-language model with [62] H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction-tuned
versatile abilities,” arXiv preprint arXiv:2308.12966, 2023. audio-visual language model for video understanding,” arXiv preprint
PREPRINT VERSION, 2024 19
arXiv:2306.02858, 2023. [Online]. Available: https://arxiv.org/abs/ [84] Y. Pan, Z. Qiu, T. Yao, H. Li, and T. Mei, “To create what you tell:
2306.02858 Generating videos from captions,” in Proceedings of the 25th ACM
[63] S. K. Muhammad Maaz, Hanoona Rasheed and F. Khan, “Video- international conference on Multimedia, 2017, pp. 1789–1798.
chatgpt: Towards detailed video understanding via large vision and [85] Y. Li, M. Min, D. Shen, D. Carlson, and L. Carin, “Video genera-
language models,” ArXiv 2306.05424, 2023. tion from text,” in Proceedings of the AAAI conference on artificial
[64] Z. Cheng, S. Leng, H. Zhang, Y. Xin, X. Li, G. Chen, intelligence, vol. 32, no. 1, 2018.
Y. Zhu, W. Zhang, Z. Luo, D. Zhao, and L. Bing, “Videollama [86] K. Deng, T. Fei, X. Huang, and Y. Peng, “Irc-gan: Introspective
2: Advancing spatial-temporal modeling and audio understanding recurrent convolutional gan for text-to-video generation.” in IJCAI,
in video-llms,” arXiv preprint arXiv:2406.07476, 2024. [Online]. 2019, pp. 2216–2222.
Available: https://arxiv.org/abs/2406.07476 [87] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in
[65] Y. Zhang, B. Li, h. Liu, Y. j. Lee, L. Gui, D. Fu, J. Feng, International Conference on Learning Representations, 2014.
Z. Liu, and C. Li, “Llava-next: A strong zero-shot video [88] G. Mittal, T. Marwah, and V. N. Balasubramanian, “Sync-draw: Auto-
understanding model,” April 2024. [Online]. Available: https: matic video generation using deep recurrent attentive architectures,” in
//llava-vl.github.io/blog/2024-04-30-llava-next-video/ Proceedings of the 25th ACM international conference on Multimedia,
[66] H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. 2017, pp. 1096–1104.
Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” [89] A. Van Den Oord, O. Vinyals et al., “Neural discrete representation
January 2024. [Online]. Available: https://llava-vl.github.io/blog/ learning,” Advances in neural information processing systems, vol. 30,
2024-01-30-llava-next/ 2017.
[67] Y. Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, [90] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic
W. Zhao, Z. He et al., “Minicpm-v: A gpt-4v level mllm on your models,” Advances in neural information processing systems, vol. 33,
phone,” arXiv preprint arXiv:2408.01800, 2024. pp. 6840–6851, 2020.
[68] J. Lin, H. Yin, W. Ping, Y. Lu, P. Molchanov, A. Tao, H. Mao, J. Kautz, [91] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit mod-
M. Shoeybi, and S. Han, “Vila: On pre-training for visual language els,” arXiv preprint arXiv:2010.02502, 2020.
models,” 2023. [92] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,
[69] S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive “Generative adversarial text to image synthesis,” in International con-
multimodal large language model for long video understanding,” in ference on machine learning. PMLR, 2016, pp. 1060–1069.
Proceedings of the IEEE/CVF Conference on Computer Vision and [93] Y. He, R. Salakhutdinov, and J. Z. Kolter, “Localized text-to-
Pattern Recognition, 2024, pp. 14 313–14 323. image generation for free via cross attention control,” arXiv preprint
[70] C. Zhang, T. Lu, M. M. Islam, Z. Wang, S. Yu, M. Bansal, and arXiv:2306.14636, 2023.
G. Bertasius, “A simple llm framework for long-range video question- [94] O. Avrahami, T. Hayes, O. Gafni, S. Gupta, Y. Taigman, D. Parikh,
answering,” arXiv preprint arXiv:2312.17235, 2023. D. Lischinski, O. Fried, and X. Yin, “Spatext: Spatio-textual repre-
[71] H. Chen, X. Wang, H. Chen, Z. Song, J. Jia, and W. Zhu, “Grounding- sentation for controllable image generation,” in Proceedings of the
prompter: Prompting llm with multimodal information for temporal IEEE/CVF Conference on Computer Vision and Pattern Recognition,
sentence grounding in long videos,” arXiv preprint arXiv:2312.17117, 2023, pp. 18 370–18 380.
2023. [95] J. Cheng, X. Liang, X. Shi, T. He, T. Xiao, and M. Li, “Layoutdiffuse:
[72] W. Feng, X. Wang, H. Chen, Z. Zhang, Z. Song, Y. Zhou, and W. Zhu, Adapting foundational diffusion models for layout-to-image genera-
“Llm4vg: Large language models evaluation for video grounding,” tion,” arXiv preprint arXiv:2302.08908, 2023.
arXiv preprint arXiv:2312.14206, 2023. [96] G. Zheng, X. Zhou, X. Li, Z. Qi, Y. Shan, and X. Li, “Layoutdiffusion:
[73] E. Song, W. Chai, G. Wang, Y. Zhang, H. Zhou, F. Wu, H. Chi, Controllable diffusion model for layout-to-image generation,” in Pro-
X. Guo, T. Ye, Y. Zhang et al., “Moviechat: From dense token to ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
sparse memory for long video understanding,” in Proceedings of the Recognition, 2023, pp. 22 490–22 499.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [97] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image
2024, pp. 18 221–18 232. translation with conditional adversarial networks,” in Proceedings of
[74] H. Liu, W. Yan, M. Zaharia, and P. Abbeel, “World model on million- the IEEE conference on computer vision and pattern recognition, 2017,
length video and language with blockwise ringattention,” arXiv preprint pp. 1125–1134.
arXiv:2402.08268, 2024. [98] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image
[75] P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y. Zhang, Z. Wang, synthesis with spatially-adaptive normalization,” in Proceedings of the
H. Tan, C. Li, and Z. Liu, “Long context transfer from language to IEEE/CVF conference on computer vision and pattern recognition,
vision,” arXiv preprint arXiv:2406.16852, 2024. 2019, pp. 2337–2346.
[76] Y. Li, C. Wang, and J. Jia, “Llama-vid: An image is worth 2 tokens in [99] C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, “Text-to-
large language models,” arXiv preprint arXiv:2311.17043, 2023. image diffusion models in generative ai: A survey,” arXiv preprint
[77] Z. Wang, S. Yu, E. Stengel-Eskin, J. Yoon, F. Cheng, G. Bertasius, arXiv:2303.07909, 2023.
and M. Bansal, “Videotree: Adaptive tree-based video representation [100] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew,
for llm reasoning on long videos,” arXiv preprint arXiv:2405.19209, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image gen-
2024. eration and editing with text-guided diffusion models,” arXiv preprint
[78] X. Wang, Y. Zhang, O. Zohar, and S. Yeung-Levy, “Videoagent: Long- arXiv:2112.10741, 2021.
form video understanding with large language model as agent,” arXiv [101] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton,
preprint arXiv:2403.10517, 2024. K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al.,
[79] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- “Photorealistic text-to-image diffusion models with deep language
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial understanding,” Advances in neural information processing systems,
networks,” 2014. [Online]. Available: https://arxiv.org/abs/1406.2661 vol. 35, pp. 36 479–36 494, 2022.
[80] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, “Cvae-gan: fine-grained [102] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer,
image generation through asymmetric training,” in Proceedings of the “High-resolution image synthesis with latent diffusion models,” in
IEEE international conference on computer vision, 2017, pp. 2745– Proceedings of the IEEE/CVF conference on computer vision and
2754. pattern recognition, 2022, pp. 10 684–10 695.
[81] M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, and [103] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchi-
T. Park, “Scaling up gans for text-to-image synthesis,” in Proceedings cal text-conditional image generation with clip latents,” arXiv preprint
of the IEEE/CVF Conference on Computer Vision and Pattern Recog- arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022.
nition, 2023, pp. 10 124–10 134. [104] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image
[82] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, synthesis,” Advances in neural information processing systems, vol. 34,
“Analyzing and improving the image quality of stylegan,” in Pro- pp. 8780–8794, 2021.
ceedings of the IEEE/CVF conference on computer vision and pattern [105] J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv
recognition, 2020, pp. 8110–8119. preprint arXiv:2207.12598, 2022.
[83] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with [106] W. Peebles and S. Xie, “Scalable diffusion models with transformers,”
scene dynamics,” Advances in neural information processing systems, in Proceedings of the IEEE/CVF International Conference on Com-
vol. 29, 2016. puter Vision, 2023, pp. 4195–4205.
PREPRINT VERSION, 2024 20
[107] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, multiple subjects,” in Proceedings of the 37th International Conference
P. Luo, H. Lu et al., “Pixart-alpha: Fast training of diffusion trans- on Neural Information Processing Systems, 2023, pp. 57 500–57 519.
former for photorealistic text-to-image synthesis,” arXiv preprint [129] Y. Gu, X. Wang, J. Z. Wu, Y. Shi, Y. Chen, Z. Fan, W. Xiao,
arXiv:2310.00426, 2023. R. Zhao, S. Chang, W. Wu et al., “Mix-of-show: Decentralized low-
[108] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, rank adaptation for multi-concept customization of diffusion models,”
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning Advances in Neural Information Processing Systems, vol. 36, 2024.
with a unified text-to-text transformer,” Journal of machine learning [130] V. Shah, N. Ruiz, F. Cole, E. Lu, S. Lazebnik, Y. Li, and V. Jampani,
research, vol. 21, no. 140, pp. 1–67, 2020. “Ziplora: Any subject in any style by effectively merging loras,” arXiv
[109] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Muller, H. Saini, preprint arXiv:2311.13600, 2023.
Y. Levi, D. Lorenz, A. Sauer, F. Boesel et al., “Scaling rectified [131] R. Po, G. Yang, K. Aberman, and G. Wetzstein, “Orthogonal adaptation
flow transformers for high-resolution image synthesis,” in Forty-first for modular customization of diffusion models,” in Proceedings of the
International Conference on Machine Learning, 2024. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[110] P. Cao, F. Zhou, Q. Song, and L. Yang, “Controllable genera- 2024, pp. 7964–7973.
tion with text-to-image diffusion models: A survey,” arXiv preprint [132] L. Wang, G. Shen, W. Ge, G. Chen, Y. Li, and Y.-c. Chen, “Decompose
arXiv:2403.04279, 2024. and realign: Tackling condition misalignment in text-to-image diffusion
[111] J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized models,” arXiv preprint arXiv:2306.14408, 2023.
text-to-image generation without test-time finetuning,” in Proceedings
[133] Y. Wang, W. Zhang, J. Zheng, and C. Jin, “High-fidelity person-
of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
centric subject-to-image synthesis,” in Proceedings of the IEEE/CVF
nition, 2024, pp. 8543–8552.
Conference on Computer Vision and Pattern Recognition, 2024, pp.
[112] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano,
7675–7684.
G. Chechik, and D. Cohen-Or, “An image is worth one word: Personal-
izing text-to-image generation using textual inversion,” arXiv preprint [134] P. Cao, L. Yang, F. Zhou, T. Huang, and Q. Song, “Concept-
arXiv:2208.01618, 2022. centric personalization with large-scale diffusion priors,” arXiv preprint
[113] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, arXiv:2312.08195, 2023.
“Dreambooth: Fine tuning text-to-image diffusion models for subject- [135] L. Khachatryan, A. Movsisyan, V. Tadevosyan, R. Henschel, Z. Wang,
driven generation,” in Proceedings of the IEEE/CVF Conference on S. Navasardyan, and H. Shi, “Text2video-zero: Text-to-image diffu-
Computer Vision and Pattern Recognition, 2023, pp. 22 500–22 510. sion models are zero-shot video generators,” in Proceedings of the
[114] A. Hertz, A. Voynov, S. Fruchter, and D. Cohen-Or, “Style aligned im- IEEE/CVF International Conference on Computer Vision, 2023, pp.
age generation via shared attention,” in Proceedings of the IEEE/CVF 15 954–15 964.
Conference on Computer Vision and Pattern Recognition, 2024, pp. [136] J. An, S. Zhang, H. Yang, S. Gupta, J.-B. Huang, J. Luo, and X. Yin,
4775–4785. “Latent-shift: Latent diffusion with temporal shift for efficient text-to-
[115] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, “Multi- video generation,” arXiv preprint arXiv:2304.08477, 2023.
concept customization of text-to-image diffusion,” in Proceedings of the [137] J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet,
IEEE/CVF Conference on Computer Vision and Pattern Recognition, “Video diffusion models,” Advances in Neural Information Processing
2023, pp. 1931–1941. Systems, vol. 35, pp. 8633–8646, 2022.
[116] H. Chen, Y. Zhang, S. Wu, X. Wang, X. Duan, Y. Zhou, and W. Zhu, [138] U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu,
“Disenbooth: Identity-preserving disentangled tuning for subject-driven H. Yang, O. Ashual, O. Gafni et al., “Make-a-video: Text-to-video
text-to-image generation,” arXiv preprint arXiv:2305.03374, 2023. generation without text-video data,” in International Conference on
[117] G. Xiao, T. Yin, W. T. Freeman, F. Durand, and S. Han, “Fastcomposer: Learning Representations, 2024.
Tuning-free multi-subject image generation with localized attention,” [139] A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler,
arXiv preprint arXiv:2305.10431, 2023. and K. Kreis, “Align your latents: High-resolution video synthesis with
[118] D. Valevski, D. Lumen, Y. Matias, and Y. Leviathan, “Face0: Instanta- latent diffusion models,” in Proceedings of the IEEE/CVF Conference
neously conditioning a text-to-image model on a face,” in SIGGRAPH on Computer Vision and Pattern Recognition, 2023, pp. 22 563–22 575.
Asia 2023 Conference Papers, 2023, pp. 1–10. [140] D. Zhou, W. Wang, H. Yan, W. Lv, Y. Zhu, and J. Feng, “Magicvideo:
[119] Z. Chen, S. Fang, W. Liu, Q. He, M. Huang, Y. Zhang, and Z. Mao, Efficient video generation with latent diffusion models,” arXiv preprint
“Dreamidentity: Improved editability for efficient face-identity pre- arXiv:2211.11018, 2022.
served image generation,” arXiv preprint arXiv:2307.00300, 2023. [141] Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala,
[120] K. Sohn, N. Ruiz, K. Lee, D. C. Chin, I. Blok, H. Chang, J. Barber, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-
L. Jiang, G. Entis, Y. Li et al., “Styledrop: Text-to-image generation to-image diffusion models without specific tuning,” in International
in any style,” arXiv preprint arXiv:2306.00983, 2023. Conference on Learning Representations, 2024.
[121] D.-Y. Chen, H. Tennent, and C.-W. Hsu, “Artadapter: Text-to-image [142] X. Ma, Y. Wang, G. Jia, X. Chen, Z. Liu, Y.-F. Li, C. Chen, and
style transfer using multi-level style encoder and explicit adaptation,” Y. Qiao, “Latte: Latent diffusion transformer for video generation,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and arXiv preprint arXiv:2401.03048, 2024.
Pattern Recognition, 2024, pp. 8619–8628.
[143] S. Chen, M. Xu, J. Ren, Y. Cong, S. He, Y. Xie, A. Sinha, P. Luo,
[122] L. Huang, D. Chen, Y. Liu, Y. Shen, D. Zhao, and J. Zhou, “Composer:
T. Xiang, and J.-M. Perez-Rua, “Gentron: Diffusion transformers
Creative and controllable image synthesis with composable conditions,”
for image and video generation,” in Proceedings of the IEEE/CVF
arXiv preprint arXiv:2302.09778, 2023.
Conference on Computer Vision and Pattern Recognition, 2024, pp.
[123] L. Han, Y. Li, H. Zhang, P. Milanfar, D. Metaxas, and F. Yang, “Svdiff:
6441–6451.
Compact parameter space for diffusion fine-tuning,” in Proceedings of
the IEEE/CVF International Conference on Computer Vision, 2023, pp. [144] W. Wang, Y. Jiang, K. Xie, Z. Liu, H. Chen, Y. Cao, X. Wang, and
7323–7334. C. Shen, “Zero-shot video editing using off-the-shelf image diffusion
[124] M. Hu, J. Zheng, D. Liu, C. Zheng, C. Wang, D. Tao, and T.-J. models,” arXiv preprint arXiv:2303.17599, 2023.
Cham, “Cocktail: Mixing multi-modality control for text-conditional [145] W. Chai, X. Guo, G. Wang, and Y. Lu, “Stablevideo: Text-driven
image generation,” in Thirty-seventh Conference on Neural Information consistency-aware diffusion video editing,” in Proceedings of the
Processing Systems, 2023. IEEE/CVF International Conference on Computer Vision, 2023, pp.
[125] J. S. Smith, Y.-C. Hsu, L. Zhang, T. Hua, Z. Kira, Y. Shen, and 23 040–23 050.
H. Jin, “Continual diffusion: Continual customization of text-to-image [146] J. Xing, M. Xia, Y. Liu, Y. Zhang, Y. Zhang, Y. He, H. Liu, H. Chen,
diffusion with c-lora,” arXiv preprint arXiv:2304.06027, 2023. X. Cun, X. Wang et al., “Make-your-video: Customized video gen-
[126] G. Sun, W. Liang, J. Dong, J. Li, Z. Ding, and Y. Cong, “Create your eration using textual and structural guidance,” IEEE Transactions on
world: Lifelong text-to-image diffusion,” IEEE Transactions on Pattern Visualization and Computer Graphics, 2024.
Analysis and Machine Intelligence, 2024. [147] S. Yang, Y. Zhou, Z. Liu, and C. C. Loy, “Rerender a video: Zero-
[127] J. S. Smith, Y.-C. Hsu, Z. Kira, Y. Shen, and H. Jin, “Continual shot text-guided video-to-video translation,” in SIGGRAPH Asia 2023
diffusion with stamina: Stack-and-mask incremental adapters,” in Pro- Conference Papers, 2023, pp. 1–11.
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern [148] C. Qi, X. Cun, Y. Zhang, C. Lei, X. Wang, Y. Shan, and Q. Chen,
Recognition, 2024, pp. 1744–1754. “Fatezero: Fusing attentions for zero-shot text-based video editing,” in
[128] Z. Liu, Y. Zhang, Y. Shen, K. Zheng, K. Zhu, R. Feng, Y. Liu, D. Zhao, Proceedings of the IEEE/CVF International Conference on Computer
J. Zhou, and Y. Cao, “Cones 2: Customizable image synthesis with Vision, 2023, pp. 15 932–15 942.
PREPRINT VERSION, 2024 21
[149] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control [172] H. Qiu, M. Xia, Y. Zhang, Y. He, X. Wang, Y. Shan, and Z. Liu,
to text-to-image diffusion models,” in Proceedings of the IEEE/CVF “Freenoise: Tuning-free longer video diffusion via noise rescheduling,”
International Conference on Computer Vision, 2023, pp. 3836–3847. in International Conference on Learning Representations, 2024.
[150] W. Chen, Y. Ji, J. Wu, H. Wu, P. Xie, J. Li, X. Xia, X. Xiao, and L. Lin, [173] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian,
“Control-a-video: Controllable text-to-video generation with diffusion D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts et al., “Stable video
models,” arXiv preprint arXiv:2305.13840, 2023. diffusion: Scaling latent video diffusion models to large datasets,” arXiv
[151] Y. Zhang, Y. Wei, D. Jiang, X. Zhang, W. Zuo, and Q. Tian, “Con- preprint arXiv:2311.15127, 2023.
trolvideo: Training-free controllable text-to-video generation,” arXiv [174] H. Chen, M. Xia, Y. He, Y. Zhang, X. Cun, S. Yang, J. Xing, Y. Liu,
preprint arXiv:2305.13077, 2023. Q. Chen, X. Wang et al., “Videocrafter1: Open diffusion models for
[152] Z. Hu and D. Xu, “Videocontrolnet: A motion-guided video-to-video high-quality video generation,” arXiv preprint arXiv:2310.19512, 2023.
translation framework by using diffusion model with controlnet,” arXiv [175] H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan,
preprint arXiv:2307.14073, 2023. “Videocrafter2: Overcoming data limitations for high-quality video
[153] Y. Guo, C. Yang, A. Rao, M. Agrawala, D. Lin, and B. Dai, “Spar- diffusion models,” in Proceedings of the IEEE/CVF Conference on
sectrl: Adding sparse controls to text-to-video diffusion models,” arXiv Computer Vision and Pattern Recognition, 2024, pp. 7310–7320.
preprint arXiv:2311.16933, 2023. [176] “Pika labs.” [Online]. Available: https://pika.art/
[154] J. Wu, X. Li, Y. Zeng, J. Zhang, Q. Zhou, Y. Li, Y. Tong, and K. Chen, [177] F. Bao, C. Xiang, G. Yue, G. He, H. Zhu, K. Zheng, M. Zhao,
“Motionbooth: Motion-aware customized text-to-video generation,” S. Liu, Y. Wang, and J. Zhu, “Vidu: a highly consistent, dynamic and
arXiv preprint arXiv:2406.17758, 2024. skilled text-to-video generator with diffusion models,” arXiv preprint
[155] Y. Ma, Y. He, X. Cun, X. Wang, S. Chen, X. Li, and Q. Chen, arXiv:2405.04233, 2024.
“Follow your pose: Pose-guided text-to-video generation using pose- [178] Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu,
free videos,” in Proceedings of the AAAI Conference on Artificial Y. Yang, W. Hong, X. Zhang, G. Feng et al., “Cogvideox: Text-to-
Intelligence, vol. 38, no. 5, 2024, pp. 4117–4125. video diffusion models with an expert transformer,” arXiv preprint
[156] L. Hu, “Animate anyone: Consistent and controllable image-to-video arXiv:2408.06072, 2024.
synthesis for character animation,” in Proceedings of the IEEE/CVF [179] Runway, “Introducing gen-3 alpha: A new frontier for video
Conference on Computer Vision and Pattern Recognition, 2024, pp. generation,” 2024. [Online]. Available: https://runwayml.com/research/
8153–8163. introducing-gen-3-alpha
[157] J. Xue, H. Wang, Q. Tian, Y. Ma, A. Wang, Z. Zhao, S. Min, [180] “Luma ai.” [Online]. Available: https://lumalabs.ai/
W. Zhao, K. Zhang, H.-Y. Shum et al., “Follow-your-pose v2: Multiple- [181] “Kling.” [Online]. Available: https://kling.kuaishou.com/
condition guided character image animation for stable pose control,” [182] J. Zhu, X. Ding, Y. Ge, Y. Ge, S. Zhao, H. Zhao, X. Wang, and Y. Shan,
arXiv preprint arXiv:2406.03035, 2024. “Vl-gpt: A generative pre-trained transformer for vision and language
[158] S. Hong, J. Seo, H. Shin, S. Hong, and S. Kim, “Large language models understanding and generation,” arXiv preprint arXiv:2312.09251, 2023.
are frame-level directors for zero-shot text-to-video generation,” in First
[183] P. Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan,
Workshop on Controllable Video Generation@ ICML24, 2023.
“Autoregressive model beats diffusion: Llama for scalable image gen-
[159] Y. Lu, L. Zhu, H. Fan, and Y. Yang, “Flowzero: Zero-shot text-to-
eration,” arXiv preprint arXiv:2406.06525, 2024.
video synthesis with llm-driven dynamic scene syntax,” arXiv preprint
[184] J. Zhan, J. Dai, J. Ye, Y. Zhou, D. Zhang, Z. Liu, X. Zhang, R. Yuan,
arXiv:2311.15813, 2023.
G. Zhang, L. Li et al., “Anygpt: Unified multimodal llm with discrete
[160] H. Huang, Y. Feng, C. Shi, L. Xu, J. Yu, and S. Yang, “Free-bloom:
sequence modeling,” arXiv preprint arXiv:2402.12226, 2024.
Zero-shot text-to-video generator with llm director and ldm animator,”
Advances in Neural Information Processing Systems, vol. 36, 2024. [185] K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang, “Visual autoregres-
[161] L. Lian, B. Shi, A. Yala, T. Darrell, and B. Li, “Llm-grounded sive modeling: Scalable image generation via next-scale prediction,”
video diffusion models,” in International Conference on Learning arXiv preprint arXiv:2404.02905, 2024.
Representations, 2024. [186] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual chatgpt:
[162] Z. Wang, Z. Yuan, X. Wang, Y. Li, T. Chen, M. Xia, P. Luo, and Talking, drawing and editing with visual foundation models,” arXiv
Y. Shan, “Motionctrl: A unified and flexible motion controller for video preprint arXiv:2303.04671, 2023.
generation,” in ACM SIGGRAPH 2024 Conference Papers, 2024, pp. [187] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt:
1–11. Solving ai tasks with chatgpt and its friends in hugging face,” Advances
[163] H. He, Y. Xu, Y. Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang, in Neural Information Processing Systems, vol. 36, 2024.
“Cameractrl: Enabling camera control for text-to-video generation,” [188] C. Wang, W. Luo, Q. Chen, H. Mai, J. Guo, S. Dong, Z. Li, L. Ma,
arXiv preprint arXiv:2404.02101, 2024. S. Gao et al., “Tool-lmm: A large multi-modal model for tool agent
[164] S. Yang, L. Hou, H. Huang, C. Ma, P. Wan, D. Zhang, X. Chen, learning,” arXiv preprint arXiv:2401.10727, 2024.
and J. Liao, “Direct-a-video: Customized video generation with user- [189] Q. Sun, Y. Cui, X. Zhang, F. Zhang, Q. Yu, Y. Wang, Y. Rao, J. Liu,
directed camera movement and object motion,” in ACM SIGGRAPH T. Huang, and X. Wang, “Generative multimodal models are in-context
2024 Conference Papers, 2024, pp. 1–12. learners,” in Proceedings of the IEEE/CVF Conference on Computer
[165] H. Chen, X. Wang, G. Zeng, Y. Zhang, Y. Zhou, F. Han, and Vision and Pattern Recognition, 2024, pp. 14 398–14 409.
W. Zhu, “Videodreamer: Customized multi-subject text-to-video gen- [190] X. Pan, L. Dong, S. Huang, Z. Peng, W. Chen, and F. Wei, “Kosmos-g:
eration with disen-mix finetuning,” arXiv preprint arXiv:2311.00990, Generating images in context with multimodal large language models,”
2023. arXiv preprint arXiv:2310.02992, 2023.
[166] H. Chen, X. Wang, Y. Zhang, Y. Zhou, Z. Zhang, S. Tang, and W. Zhu, [191] Z. Tang, Z. Yang, M. Khademi, Y. Liu, C. Zhu, and M. Bansal,
“Disenstudio: Customized multi-subject text-to-video generation with “Codi-2: In-context interleaved and interactive any-to-any generation,”
disentangled spatial control,” arXiv preprint arXiv:2405.12796, 2024. in Proceedings of the IEEE/CVF Conference on Computer Vision and
[167] Z. Wang, A. Li, E. Xie, L. Zhu, Y. Guo, Q. Dou, and Z. Li, Pattern Recognition, 2024, pp. 27 425–27 434.
“Customvideo: Customizing text-to-video generation with multiple [192] Y. Ge, S. Zhao, J. Zhu, Y. Ge, K. Yi, L. Song, C. Li, X. Ding, and
subjects,” arXiv preprint arXiv:2401.09962, 2024. Y. Shan, “Seed-x: Multimodal models with unified multi-granularity
[168] Y. Jiang, T. Wu, S. Yang, C. Si, D. Lin, Y. Qiao, C. C. Loy, and Z. Liu, comprehension and generation,” arXiv preprint arXiv:2404.14396,
“Videobooth: Diffusion-based video generation with image prompts,” 2024.
in Proceedings of the IEEE/CVF Conference on Computer Vision and [193] C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis,
Pattern Recognition, 2024, pp. 6689–6700. J. Kahn, X. Ma, L. Zettlemoyer, and O. Levy, “Transfusion: Predict
[169] J. Gu, Y. Shen, S. Zhai, Y. Zhang, N. Jaitly, and J. M. Susskind, the next token and diffuse images with one multi-modal model,” arXiv
“Kaleido diffusion: Improving conditional diffusion models with au- preprint arXiv:2408.11039, 2024.
toregressive latent modeling,” arXiv preprint arXiv:2405.21048, 2024. [194] J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y. Gu,
[170] R. Henschel, L. Khachatryan, D. Hayrapetyan, H. Poghosyan, V. Tade- Z. Chen, Z. Yang, and M. Z. Shou, “Show-o: One single transformer
vosyan, Z. Wang, S. Navasardyan, and H. Shi, “Streamingt2v: Consis- to unify multimodal understanding and generation,” arXiv preprint
tent, dynamic, and extendable long video generation from text,” arXiv arXiv:2408.12528, 2024.
preprint arXiv:2403.14773, 2024. [195] L. Yu, Y. Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G.
[171] W. Ren, H. Yang, G. Zhang, C. Wei, X. Du, S. Huang, and W. Chen, Hauptmann, M.-H. Yang, Y. Hao, I. Essa et al., “Magvit: Masked gen-
“Consisti2v: Enhancing visual consistency for image-to-video genera- erative video transformer,” in Proceedings of the IEEE/CVF Conference
tion,” arXiv preprint arXiv:2402.04324, 2024. on Computer Vision and Pattern Recognition, 2023, pp. 10 459–10 469.
PREPRINT VERSION, 2024 22
[196] W. Zhu, P. Cui, Z. Wang, and G. Hua, “Multimedia big data comput- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
ing,” IEEE multimedia, vol. 22, no. 3, pp. 96–c3, 2015. 2022, pp. 18 995–19 012.
[197] V. Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images us- [216] J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei,
ing 1 million captioned photographs,” Advances in neural information C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset
processing systems, vol. 24, 2011. for compositional language and elementary visual reasoning,” in Pro-
[198] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, ceedings of the IEEE conference on computer vision and pattern
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in recognition, 2017, pp. 2901–2910.
context,” in Computer Vision–ECCV 2014: 13th European Conference, [217] R. Tanaka, K. Nishida, and S. Yoshida, “Visualmrc: Machine reading
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. comprehension on document images,” in Proceedings of the AAAI
Springer, 2014, pp. 740–755. Conference on Artificial Intelligence, vol. 35, no. 15, 2021, pp. 13 878–
[199] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: 13 888.
A cleaned, hypernymed, image alt-text dataset for automatic image [218] J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Next phase of
captioning,” in Proceedings of the 56th Annual Meeting of the Asso- question-answering to explaining temporal actions,” in Proceedings of
ciation for Computational Linguistics (Volume 1: Long Papers), 2018, the IEEE/CVF conference on computer vision and pattern recognition,
pp. 2556–2565. 2021, pp. 9777–9786.
[200] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, [219] K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum,
M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Clevrer: Collision events for video representation and reasoning,” in
“Laion-5b: An open large-scale dataset for training next generation International Conference on Learning Representations, 2020.
image-text models,” Advances in Neural Information Processing Sys- [220] B. Lin, B. Zhu, Y. Ye, M. Ning, P. Jin, and L. Yuan, “Video-llava:
tems, vol. 35, pp. 25 278–25 294, 2022. Learning united visual representation by alignment before projection,”
[201] A. Awadalla, L. Xue, O. Lo, M. Shu, H. Lee, E. K. Guha, M. Jordan, arXiv preprint arXiv:2311.10122, 2023.
S. Shen, M. Awadalla, S. Savarese et al., “Mint-1t: Scaling open-source [221] K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, Y. Liu, Z. Wang, J. Xu,
multimodal data by 10x: A multimodal dataset with one trillion tokens,” G. Chen, P. Luo et al., “Mvbench: A comprehensive multi-modal
arXiv preprint arXiv:2406.11271, 2024. video understanding benchmark,” in Proceedings of the IEEE/CVF
[202] M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A Conference on Computer Vision and Pattern Recognition, 2024, pp.
joint video and image encoder for end-to-end retrieval,” in Proceedings 22 195–22 206.
of the IEEE/CVF international conference on computer vision, 2021, [222] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman,
pp. 1728–1738. D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4
[203] Y. Wang, Y. He, Y. Li, K. Li, J. Yu, X. Ma, X. Li, G. Chen, technical report,” arXiv preprint arXiv:2303.08774, 2023.
X. Chen, Y. Wang et al., “Internvid: A large-scale video-text dataset for [223] S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any
multimodal understanding and generation,” in The Twelfth International multimodal llm,” arXiv preprint arXiv:2309.05519, 2023.
Conference on Learning Representations. [224] H. Ye, D.-A. Huang, Y. Lu, Z. Yu, W. Ping, A. Tao, J. Kautz, S. Han,
[204] W. Wang, H. Yang, Z. Tuo, H. He, J. Zhu, J. Fu, and J. Liu, D. Xu, P. Molchanov et al., “X-vila: Cross-modality alignment for large
“Videofactory: Swap attention in spatiotemporal diffusions for text- language model,” arXiv preprint arXiv:2405.19335, 2024.
to-video generation,” 2023. [225] D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler,
[205] L. Zhou, C. Xu, and J. Corso, “Towards automatic learning of pro- R. Hornung, V. Birodkar, J. Yan, M.-C. Chiu et al., “Videopoet: A
cedures from web instructional videos,” in Proceedings of the AAAI large language model for zero-shot video generation,” in International
Conference on Artificial Intelligence, vol. 32, no. 1, 2018. Conference on Machine Learning, 2024.
[206] W. Wu, Y. Zhao, Z. Li, J. Li, H. Zhou, M. Z. Shou, and X. Bai, “A [226] L. Yu, J. Lezama, N. B. Gundavarapu, L. Versari, K. Sohn, D. Minnen,
large cross-modal video retrieval dataset with reading comprehension,” Y. Cheng, A. Gupta, X. Gu, A. G. Hauptmann et al., “Language model
Pattern Recognition, vol. 157, p. 110818, 2025. beats diffusion–tokenizer is key to visual generation,” arXiv preprint
[207] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making arXiv:2310.05737, 2023.
the v in vqa matter: Elevating the role of image understanding in [227] Y. Jin, Z. Sun, K. Xu, L. Chen, H. Jiang, Q. Huang, C. Song, Y. Liu,
visual question answering,” in Proceedings of the IEEE conference D. Zhang, Y. Song et al., “Video-lavit: Unified video-language pre-
on computer vision and pattern recognition, 2017, pp. 6904–6913. training with decoupled visual-motional tokenization,” arXiv preprint
[208] D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real- arXiv:2402.03161, 2024.
world visual reasoning and compositional question answering,” in [228] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image
Proceedings of the IEEE/CVF conference on computer vision and descriptions to visual denotations: New similarity metrics for semantic
pattern recognition, 2019, pp. 6700–6709. inference over event descriptions,” Transactions of the Association for
[209] K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “Ok-vqa: A Computational Linguistics, vol. 2, pp. 67–78, 2014.
visual question answering benchmark requiring external knowledge,” [229] D. Ghosh, H. Hajishirzi, and L. Schmidt, “Geneval: An object-focused
in Proceedings of the IEEE/cvf conference on computer vision and framework for evaluating text-to-image alignment,” Advances in Neural
pattern recognition, 2019, pp. 3195–3204. Information Processing Systems, vol. 36, 2024.
[210] D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi, [230] H.-C. Yi, Z.-H. You, D.-S. Huang, and C. K. Kwoh, “Graph represen-
“A-okvqa: A benchmark for visual question answering using world tation learning in bioinformatics: trends, methods and applications,”
knowledge,” in European conference on computer vision. Springer, Briefings in Bioinformatics, vol. 23, no. 1, p. bbab340, 2022.
2022, pp. 146–162. [231] N. Yang, H. Wu, K. Zeng, Y. Li, and J. Yan, “Molecule genera-
[211] A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty, “Ocr-vqa: tion for drug design: a graph learning perspective,” arXiv preprint
Visual question answering by reading text in images,” in 2019 inter- arXiv:2202.09212, 2022.
national conference on document analysis and recognition (ICDAR). [232] H. Li, G. Zhu, L. Zhang, Y. Jiang, Y. Dang, H. Hou, P. Shen, X. Zhao,
IEEE, 2019, pp. 947–952. S. A. A. Shah, and M. Bennamoun, “Scene graph generation: A
[212] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, comprehensive survey,” Neurocomputing, vol. 566, p. 127052, 2024.
D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” [233] Y. Yao, X. Wang, Z. Zhang, Y. Qin, Z. Zhang, X. Chu, Y. Yang, W. Zhu,
in Proceedings of the IEEE/CVF conference on computer vision and and H. Mei, “Exploring the potential of large language models in graph
pattern recognition, 2019, pp. 8317–8326. generation,” arXiv e-prints, pp. arXiv–2403, 2024.
[213] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “Tgif-qa: Toward spatio- [234] P. Liu, Y. Ren, J. Tao, and Z. Ren, “Git-mol: A multi-modal large
temporal reasoning in visual question answering,” in Proceedings of the language model for molecular science with graph, image, and text,”
IEEE conference on computer vision and pattern recognition, 2017, pp. Computers in biology and medicine, vol. 171, p. 108073, 2024.
2758–2766. [235] J. Zhu, Y. Zhou, S. Qian, Z. He, T. Zhao, N. Shah, and D. Koutra,
[214] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Just ask: “Multimodal graph benchmark,” arXiv preprint arXiv:2406.16321,
Learning to answer questions from millions of narrated videos,” in 2024.
Proceedings of the IEEE/CVF international conference on computer [236] C. Peng, J. He, and F. Xia, “Learning on multimodal graphs: A survey,”
vision, 2021, pp. 1686–1697. arXiv preprint arXiv:2402.05322, 2024.
[215] K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, [237] Y. Ektefaie, G. Dasoulas, A. Noori, M. Farhat, and M. Zitnik, “Mul-
J. Hamburger, H. Jiang, M. Liu, X. Liu et al., “Ego4d: Around the timodal learning with graphs,” Nature Machine Intelligence, vol. 5,
world in 3,000 hours of egocentric video,” in Proceedings of the no. 4, pp. 340–350, 2023.
PREPRINT VERSION, 2024 23
[238] M. Yoon, J. Y. Koh, B. Hooi, and R. Salakhutdinov, “Multimodal graph [262] E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen
learning for generative tasks,” arXiv preprint arXiv:2310.07478, 2023. et al., “Lora: Low-rank adaptation of large language models,” in
[239] A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach, “Adversarial International Conference on Learning Representations, 2021.
diffusion distillation,” arXiv preprint arXiv:2311.17042, 2023. [263] H. Shi, Z. Xu, H. Wang, W. Qin, W. Wang, Y. Wang, and H. Wang,
[240] Y. Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency models,” “Continual learning of large language models: A comprehensive sur-
in Proceedings of the 40th International Conference on Machine vey,” arXiv preprint arXiv:2404.16789, 2024.
Learning, 2023, pp. 32 211–32 252.
[241] S. Luo, Y. Tan, L. Huang, J. Li, and H. Zhao, “Latent consistency
models: Synthesizing high-resolution images with few-step inference,”
arXiv preprint arXiv:2310.04378, 2023.
[242] X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning
to generate and transfer data with rectified flow,” in The Eleventh
International Conference on Learning Representations (ICLR), 2023.
[243] Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow
matching for generative modeling,” in The Eleventh International
Conference on Learning Representations, 2023.
[244] L. Tian, Q. Wang, B. Zhang, and L. Bo, “Emo: Emote portrait alive-
generating expressive portrait videos with audio2video diffusion model
under weak conditions,” arXiv preprint arXiv:2402.17485, 2024.
[245] Z. Xu, J. Zhang, J. H. Liew, H. Yan, J.-W. Liu, C. Zhang, J. Feng,
and M. Z. Shou, “Magicanimate: Temporally consistent human image
animation using diffusion model,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2024, pp.
1481–1490.
[246] Y. Shang, Z. Yuan, B. Xie, B. Wu, and Y. Yan, “Post-training quantiza-
tion on diffusion models,” in Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, 2023, pp. 1972–1981.
[247] S. Tang, X. Wang, H. Chen, C. Guan, Z. Wu, Y. Tang, and
W. Zhu, “Post-training quantization with progressive calibration and
activation relaxing for text-to-image diffusion models,” arXiv preprint
arXiv:2311.06322, 2023.
[248] X. Li, Y. Liu, L. Lian, H. Yang, Z. Dong, D. Kang, S. Zhang, and
K. Keutzer, “Q-diffusion: Quantizing diffusion models,” in Proceedings
of the IEEE/CVF International Conference on Computer Vision, 2023,
pp. 17 535–17 545.
[249] D. Zhang, S. Li, C. Chen, Q. Xie, and H. Lu, “Laptop-diff: Layer
pruning and normalized distillation for compressing diffusion models,”
arXiv preprint arXiv:2404.11098, 2024.
[250] X. Ma, G. Fang, and X. Wang, “Deepcache: Accelerating diffusion
models for free,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2024, pp. 15 762–15 772.
[251] P. Chen, M. Shen, P. Ye, J. Cao, C. Tu, C.-S. Bouganis, Y. Zhao, and
T. Chen, “Delta-dit: A training-free acceleration method tailored for
diffusion transformers,” arXiv preprint arXiv:2406.01125, 2024.
[252] S. Tang, X. Wang, H. Chen, C. Guan, Y. Tang et al., “Lightweight dif-
fusion models with distillation-based block neural architecture search,”
arXiv preprint arXiv:2311.04950, 2023.
[253] L. Li, H. Li, X. Zheng, J. Wu, X. Xiao, R. Wang, M. Zheng, X. Pan,
F. Chao, and R. Ji, “Autodiffusion: Training-free optimization of time
steps and architectures for automated diffusion model acceleration,” in
Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2023, pp. 7105–7114.
[254] Y. Jin, J. Li, Y. Liu, T. Gu, K. Wu, Z. Jiang, M. He, B. Zhao, X. Tan,
Z. Gan et al., “Efficient multimodal large language models: A survey,”
arXiv preprint arXiv:2405.10739, 2024.
[255] Y. Li, Y. Zhang, C. Wang, Z. Zhong, Y. Chen, R. Chu, S. Liu, and J. Jia,
“Mini-gemini: Mining the potential of multi-modality vision language
models,” arXiv preprint arXiv:2403.18814, 2024.
[256] B. Lin, Z. Tang, Y. Ye, J. Cui, B. Zhu, P. Jin, J. Zhang, M. Ning,
and L. Yuan, “Moe-llava: Mixture of experts for large vision-language
models,” arXiv preprint arXiv:2401.15947, 2024.
[257] H. Zhao, M. Zhang, W. Zhao, P. Ding, S. Huang, and D. Wang, “Cobra:
Extending mamba to multi-modal large language model for efficient
inference,” arXiv preprint arXiv:2403.14520, 2024.
[258] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han,
“Smoothquant: Accurate and efficient post-training quantization for
large language models,” in International Conference on Machine
Learning. PMLR, 2023, pp. 38 087–38 099.
[259] C. Zhang, J. Chen, J. Li, Y. Peng, and Z. Mao, “Large language models
for human-robot interaction: A review,” Biomimetic Intelligence and
Robotics, p. 100131, 2023.
[260] Y. Mu, Q. Zhang, M. Hu, W. Wang, M. Ding, J. Jin, B. Wang, J. Dai,
Y. Qiao, and P. Luo, “Embodiedgpt: Vision-language pre-training
via embodied chain of thought,” Advances in Neural Information
Processing Systems, vol. 36, 2024.
[261] W. Zhu, X. Wang, and P. Xie, “Self-directed machine learning,” AI
Open, vol. 3, pp. 58–70, 2022.