2406.09399v1
2406.09399v1
Visual Generation
Junke Wang1,2 , Yi Jiang3♠ , Zehuan Yuan3 , Binyue Peng3 , Zuxuan Wu1,2†♠ , Yu-Gang Jiang
1
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
2
Shanghai Collaborative Innovation Center on Intelligent Visual Computing, 3 Bytedance Inc.
arXiv:2406.09399v1 [cs.CV] 13 Jun 2024
Abstract
Tokenizer, serving as a translator to map the intricate visual data into a compact
latent space, lies at the core of visual generative models. Based on the finding
that existing tokenizers are tailored to image or video inputs, this paper presents
OmniTokenizer, a transformer-based tokenizer for joint image and video tokeniza-
tion. OmniTokenizer is designed with a spatial-temporal decoupled architecture,
which integrates window and causal attention for spatial and temporal modeling.
To exploit the complementary nature of image and video data, we further propose
a progressive training strategy, where OmniTokenizer is first trained on image
data on a fixed resolution to develop the spatial encoding capacity and then jointly
trained on image and video data on multiple resolutions to learn the temporal
dynamics. OmniTokenizer, for the first time, handles both image and video inputs
within a unified framework and proves the possibility of realizing their synergy.
Extensive experiments demonstrate that OmniTokenizer achieves state-of-the-art
(SOTA) reconstruction performance on various image and video datasets, e.g., 1.11
reconstruction FID on ImageNet and 42 reconstruction FVD on UCF-101, beating
the previous SOTA methods by 13% and 26%, respectively. Additionally, we also
show that when integrated with OmniTokenizer, both language model-based ap-
proaches and diffusion models can realize advanced visual synthesis performance,
underscoring the superiority and versatility of our method.
1 Introduction
The development of generative models [25, 52, 14, 17, 10, 39] has been one of the most exhilarating
developments in artificial intelligence, offering the potential to revolutionize the way we gener-
ate visual content. In recent years, visual generation approaches have emerged as two dominant
paradigms: language model-based methods [52, 12, 64, 46] and diffusion models [17, 43]. The
former exploits the superior sequence modeling capability of language models (LMs) [34, 35, 50]
for visual generation by formulating it as a next-token prediction process, while the latter gradually
transforms noise into coherent visual structures through a carefully crafted reverse diffusion process.
Core to both approaches is the tokenizer, which translates visual signals into latent representations,
with LM tokenizers, also known as VQVAE, discretizing inputs into sequences of latent codes [12,
62, 64], and diffusion tokenizers, i.e., VAE, modeling their probability distributions within a latent
space [25, 39]. Analogous to the role of the lexicon in a written language, tokenizers for visual
synthesis dictate the upper bound of the generative models, thus attracting increasing attention in the
community [12, 61, 19].
• We introduce OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization.
For the first time, OmniTokenizer employs a shared framework and weight to handle both types of
visual data.
• We propose a progressive training strategy that begins with image pre-training at a fixed resolution
and then transits to image-video joint training at multiple resolutions. Such an approach capitalizes
on the synergies between image and video data, facilitating OmniTokenizer to achieve better
performance than solo image or video training.
• We conduct extensive experiments across various datasets like ImageNet, CelebA-HQ, FFHQ,
UCF-101, and Kinetics-600. The results showcase the state-of-the-art reconstruction performance
of OmniTokenizer on both image and video datasets. Furthermore, equipped with OmniTok-
enizer, both language model-based generative models and diffusion models could achieve superior
generation results.
2 Related Work
Language models have emerged as powerful contenders in the visual generation field, drawing
inspiration from their unparalleled success in natural language processing [34, 35, 49, 50] and visual
understanding [11, 5, 47, 57, 55]. These methods [12, 7, 13, 64] recast visual synthesis as a sequence
prediction problem, similar to constructing sentences in human language.
2
…
…
Codebook
Patchify
…… Look up
LM Tokenizer
&
⊕
"
⊗
…… !(0, 1) #
Diffusion Tokenizer
Figure 1: Architecture of OmniTokenizer, which consists of patch embedding layers, and separate
spatial-temporal attention blocks. To obtain the latent representations, OmniTokenizer-VQVAE looks
up a codebook to quantize the encoder embeddings, while OmniTokenizer-VAE samples from a
Gaussian distribution. We omit the decoder and only show the tokenization process.
Depending on whether the tokens are predicted sequentially or in parallel, LM-based methods can
be further categorized into autoregressive models [12, 63] and non-autoregressive models [7, 65].
Autoregressive (AR) models have been the initial foray into visual generation, utilizing the inherent
sequential nature of language models to generate images [62, 63] and videos [61, 13] in a step-wise
fashion. These models, such as DALL-E [37] and its preceding variants, typically work by predicting
one token at a time and are characterized by their high-quality outputs and precise control over the
generation process. VAR[46]redefines the autoregressive learning framework on images as coarse-to-
fine "next-scale prediction" paradigm. Non-autoregressive (Non-AR) models, on the other hand, have
been developed to allow for a faster generation process by predicting multiple tokens independently
and in parallel. Models like MaskGIT [7] leverage this parallelism to significantly reduce generation
time while maintaining high fidelity in synthesized images. The non-AR approaches have also
demonstrated promise in video generation, featured by MAGVIT series [64, 65]. Both AR and
non-AR methods have significantly advanced the field of visual generation, offering novel methods
to synthesize high-quality images and videos.
Diffusion models [17, 31, 3, 60] represent an alternative avenue for visual generation, benefiting
from their probabilistic nature that iteratively denoise a random signal into structured images or
videos. These models stand out for their flexibility in generating visual outputs that not only exhibit
coherent global structures but are also rich with intricate textures [30, 32]. Unlike language models
that discretize visual inputs as latent codes, diffusion models directly generate visual samples in
continuous pixel space [43, 10]. While effective, this approach demands significant computational
resources given the high dimensionality of visual data.
The advent of latent diffusion models (LDMs) [39] seeks to mitigate these issues by compressing the
high-dimensional visual data into latent space with a pretrained Variational Autoencoder (VAE) [25,
39]. LDM preserves the desirable properties of pixel-space diffusion models, such as high-quality
image synthesis and the ability to incorporate conditional information, while drastically reducing
the training and sampling overhead. After that, the rise of LDMs [69, 33, 32, 28] continues to push
visual generation toward higher quality, larger resolution, and more complex scenes.
3 Methodology
3.1 Joint Image and Video Tokenization
We aim to enable image and video tokenization in a unified framework and achieve mutual benefits
between them. To accomplish this, we employ a transformer-based architecture with decoupled
spatial and temporal blocks (Sec. 3.1.1). Complementing this, we also propose a progressive training
strategy consisting of two consecutive stages to learn the visual encoding in an incremental way
(Sec. 3.1.2). The overall framework of our method is illustrated in Figure 1.
3
Image Video
Tokenizers Tokenizers
Image training from scratch, Video training from scratch, e.g., TATS or
e.g., VQGAN, ViT-VQGAN Image as initialization, e.g., MAGVITv2
OmniTokenizer OmniTokenizer
Encoder and Decoder. To have better compatibility with image and video inputs, we adopt a
spatial-temporal factorized encoder consisting of separate spatial and temporal blocks. In the spatial
dimension, window attention [27] is employed as it exhibits superior local aggregation capability and
efficiency. While in the temporal dimension, we use causal attention to align with the autoregressive
visual generation in the second stage. Next, the latent code z could be obtained by looking up
a codebook [52] for LM tokenizer (i.e., quantization in VQVAE), or sampling from a Gaussian
distribution for diffusion tokenizer.
The architecture of the decoder is symmetric with the encoder. Finally, we map the spatial-temporal
tokens to the pixel space with two linear projection layers without any activation function.
4
Table 1: Reconstruction FID on ImageNet validation split, Table 2: Reconstruction FVD on
CelebA-HQ, and FFHQ. ∗ denotes models trained with UCF-101 and Moments-in-Time
Gumbel-Softmax reparameterization [37]. For our method, val. split. ∗ denotes training image
the results that are jointly trained with UCF-101 are reported. tokenizer with video loss.
Method Dataset Lat. shape Codebook rFID Method Type UCF MiT
ViT-VQGAN [62] CelebAHQ 32 × 32 8192 4.66 MaskGIT [7] Img 240 -
Ours-VQVAE CelebAHQ 32 × 32 8192 1.93 VQGAN [12] Img 299 306
ViT-VQGAN [62] FFHQ 32 × 32 8192 3.13 ViT-VQGAN [62] Img - 167
Ours-VQVAE FFHQ 32 × 32 8192 1.91 ViT-VQGAN∗ [62] Img - 173
where sg denotes the stop-gradient operation, λ1 and λ2 are the balancing hyperparameters, E and zq
represent the encoder of OmniTokenizer and codebook vectors, respectively. Factorized codes and
l2 -normalized codes [62] are also used to boost the codebook usage.
KL fine-tuning. After the VQ training, we further fine-tune our model as a diffusion tokenizer (i.e.,
OmniTokenizer-VAE) by replacing the above LV Q with Kullback-Leibler (KL) loss:
LKL = λ3 DKL (Q(z|x)||P (z)), (2)
where P (z) is Gaussian distribution, Q(z|x) represents the inferred posterior configurations of the
latent code given the observed input.
Besides LV Q or LKL , both VQ training and KL fine-tuning also employs L2 reconstruction loss
Lrecon and GAN loss LGAN .
As mentioned in Sec. 3.1.2, after the progressive training and KL fine-tuning, we can obtain two
tokenizers: OmniTokenizer-VQVAE and OmniTokenizer-VAE which separately encode the visual
inputs into latent codes in a discrete codebook or the continuous latent space. With this, we further
train language models or diffusion models for visual generation.
Language models-based generation approaches formulate visual synthesis as a token prediction
problem. Specifically, after OmniTokenizer-VQVAE tokenizes image or video inputs into a sequence
of discrete latent codes, we first flatten them in the raster order [8, 12] to obtain the code indices
y. Then a transformer language model [34] is trained to maximize the log-likelihood between the
predicted tokens ŷ and the target tokens y with cross-entropy loss:
XL
maximize logP(ŷi |c, y1:i−1 ; θ). (3)
i=1
where c represents the condition (e.g., label for class-conditional image and video generation), θ is
the learnable parameters of the language model, P and L denote the softmax probability and the
length of y. During inference, we predict each token according to the model likelihood.
Latent diffusion models (LDMs) [39] perform diffusion process in the latent space to enable high-
quality image synthesis with improved computational efficiency. Specifically, with the 2D latent
representation from OmniTokenizer-VAE, the diffusion process gradually applies Gaussian noise to
the latent code to generate a perturbed sample, while the denoising process trains a diffusion model
to predict the noise that has been added. During inference, the well-trained diffusion model could
generate a coherent visual sample from the noise by iteratively reversing the noising process.
4 Experiments
Datasets. We evaluate the visual tokenization performance of OmniTokenizer on both image and
video datasets, including ImageNet [9], CelebA-HQ [21], FFHQ [22], Kinetics [23, 6], UCF-101 [44],
5
Table 3: Comparions of class-conditional results Table 4: Comparions of class-conditional gener-
on ImageNet 256×256 using language models. “↓” ation results on UCF-101 and frame prediction
(“↑”) indicates lower (higher) is better. Metrics results on Kinetics-600. Fréchet video distance
include Fréchet inception distance (FID) and incep- (FVD) is reported.
tion score (IS). NAR and AR: non-autoregressive
and autoregressive. ∗ : taken from MaskGIT [7]. FVD↓
Type Method #Param
UCF K600
Type Method #Param FID↓ IS↑
NAR Phenaki [53] 227M - 36.4
∗ NAR MAGVIT [64] 306M 76 9.9
AR VQGAN [12] 227M 18.65 80.4
AR RQ-Transformer [26] 488M 15.72 86.8 NAR MAGVITv2 [65] 307M 58 4.3
AR Ours 227M 10.13 94.5 AR LVT [36] 50M - 224.7
∗ AR ViTrans [59] 373M - 170.0
AR VQVAE-2 [38] 13.5B 31.11 ∼45
AR CogVideo [19] 9.4B 626 109.2
AR VQGAN [12] 1.4B 15.78 74.3 AR ViVQVAE [54] NA - 64.3
AR RQ-Transformer [26] 821M 13.11 104.3 AR TATS [13] 321M 332 -
AR ViT-VQGAN [62] 650M 8.81 110.8 AR Ours 227M 314 34.2
AR Ours 650M 7.45 146.7 AR Ours 650M 191 32.9
Moments-in-Time (MiT) [29], and Something-Something v2 (SSV2) [15]. We adopt a subset of the
above datasets for visual generation to compare with previous works [12, 62, 53, 13].
Implementation Details. OmniTokenizer adopts a decoupled spatial-temporal architecture consisting
of 4 window attention-based spatial layers (window size = 8) and 4 causal attention-based temporal
layers. The hidden dimension is 512 and the latent dimension is 8, following ViT-VQGAN [62].
λ1 , λ2 , and λ3 are set to 1, 1, 1e-6, respectively. As mentioned in Sec. 3.1.2, the training of
OmniTokenizer follows a progressive training strategy, where both stages last 500K iterations. The
learning rate is warmed up to 1e-3 and decayed to 0 using a cosine scheduler. Adam [24] is employed
for optimization (β1 = 0.9 and β2 = 0.99). During the image training stage, we train the model with a
fixed image resolution of 256×256. For the joint training stage, we forward the model with image
and video data iteratively, with the video sequence length being 17 frames. The spatial resolutions
are randomly chosen from 128, 192, 256, 320, and 384. Only random horizontal flip is adopted for
data augmentation. We train our model using 8 NVIDIA A100 GPUs for 2 weeks. Unless otherwise
stated, the results reported in this paper are jointly trained on ImageNet and UCF-101.
We try both the language models and diffusion models for visual generation with OmniTokenizer as the
tokenizer. The configuration for the language model follows VQGAN [12], and for a fair comparison
with previous methods, we also scale up the model size by increasing the hidden dimension to
1535, following ViT-VQGAN [62]. The training of image and video diffusion transformers follows
DiT [32] and Latte [28], respectively.
6
Table 5: Class-conditional results on ImageNet Table 6: Comparisons of unconditional re-
256×256 using GAN and diffusion models. sults on UCF-101 256×256 using GAN and
diffusion models.
Method FID↓ IS↑ Prec↑ Rec↑
Method Lat. Comp. FVD↓
BigGAN [4] 6.95 171.4 0.87 0.28
StyleGAN-XL [40] 2.30 265.12 0.78 0.53 MoCoGAN [51] - 2886.9
VideoGPT [61] 4×4×4 2880.6
ADM [10] 10.94 100.98 0.69 0.63 MoCoGAN-HD [48] - 1729.6
LDM-4 10.56 103.49 0.71 0.62 DIGAN [67] - 1630.2
CDM [18] 4.88 158.71 - -
StyleGAN-V [42] - 1431.0
DiT-XL/2 [32] 9.62 121.50 0.67 0.67 PVDM [66] 1×4×4 1141.9
DiT-XL/2-CFG [32] 2.27 278.24 0.83 0.57 MoStGAN-V [41] - 1380.3
Ours-DiT-XL/2 12.25 109.94 0.73 0.64 Latte [28] 1×8×8 478.0
Ours-DiT-XL/2-CFG 3.48 244.23 0.89 0.52 Ours-Latte 4×8×8 209.2
model surpasses existing autoregressive image generation methods with significant margins. Re-
markably, with a model comprising only 227M parameters, we achieve 10.13 FID and 94.5 IS,
outperforming VQGAN [12] by 32% and 25%, respectively. Upon scaling up to a larger model with
650M parameters, the FID is further reduced to 7.45.
In the domain of video generation, as illustrated in Table 4, our model beats the previous state-of-
the-art autoregressive model, TATS [13] for class-conditional video generation on UCF-101 with
much lower FVD (283 v.s. 314). Moreover, for frame prediction tasks on the Kinetics-600 dataset,
our model not only achieves the best performance compared to other autoregressive models but also
surpasses Phenaki [53], a non-autoregressive method.
7
Table 7: Comparison of rFID on ImageNet and rFVD on various video datasets.
ImageNet K600 UCF MiT SSV2
Method
256 128 256 128 256 128 256 128 256
1 Ours-Image (Fix) 1.28 - - - - - - - -
2 Ours-Image (Multi) 1.44 - - - - - - - -
3 Ours-Video (Fix) - 211.51 48.89 214.83 118.52 211.07 64.47 162.53 22.82
4 Ours-Video (Multi) - 194.51 54.89 211.83 114.52 238.07 26.47 193.35 38.82
5 Ours-Joint (Fix) 1.35 113.51 26.89 186.83 62.52 140.07 21.47 108.35 20.82
6 Ours-Joint (Multi) 1.11 84.38 25.97 107.80 42.35 59.47 19.87 84.78 20.30
Latent Dimension and Compression Rate. Figure 3 shows the reconstruction FID with different
compression rates and latent dimensions. We can observe that increasing the compression rate always
hurts the reconstruction performance since more information is lost during the encoding process.
Moreover, latent dimension = 8 leads to the best trade-off between rFID and codebook usage.
4.5 Visualizations
Visual Reconstruction. We visualize the reconstruction results by OmniTokenizer, VQGAN [12]
and TATS [13] in Figure 4. Our method works significantly better than baselines for face and text
reconstruction, which are typically regarded as the most challenging reconstruction cases.
Figure 4: Image and video reconstruction results of VQGAN [12], TATS [13], and our method.
8
Class-conditional Image and Video Generation. The class-conditional generation results are shown
in Figure 5-8. Our model could synthesize visually coherent and contextually accurate images and
videos, showcasing the strengths of OmniTokenizer in facilitating generative tasks.
Figure 5: Class-conditional ImageNet generation results using language models, with OmniTokenizer-
VQVAE as tokenizer.
9
Figure 8: Unconditional UCF-101 generation using diffusion models (and OmniTokenizer-VAE).
Figure 9: Visualization of the frame prediction results by OmniTokenizer. The frames marked in red
are given during inference, while the following frames are generated.
This paper presented OmniTokenizer, a transformer-based tokenizer for joint image-video tokeniza-
tion. OmniTokenizer adopts a spatial-temporal decoupled architecture, employing the window and
causal attention in the spatial and temporal dimensions. To realize the synergy between images
and video data, we proposed a progressive training strategy that starts with image training on a
fixed resolution to acquire the spatial encoding capability and then incorporates video data for multi-
resolution joint training to learn temporal modeling. Extensive experimental results substantiate
the state-of-the-art performance of OmniTokenizer in visual reconstruction tasks. Further, when
equipped with OmniTokenizer, both language model-based methods and diffusion models could
achieve superior visual generation results.
Previous literature [20, 16, 68, 46, 45] has revealed that the performance of transformer models
improves significantly as the model size increases, also known as scaling law. In the future, we will
explore scaling the model capacity of OmniTokenizer for more advanced tokenization performance.
10
References
[1] S. AI. Stable diffusion v1-4. https://huggingface.co/CompVis/
stable-diffusion-v1-4, 2022.
[2] G. Bertasius, H. Wang, and L. Torresani. Is space-time attention all you need for video
understanding? In ICML, 2021.
[3] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English,
V. Voleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large
datasets. arXiv preprint arXiv:2311.15127, 2023.
[4] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image
synthesis. In ICLR, 2019.
[5] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end
object detection with transformers. In ECCV, 2020.
[6] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman. A short note about
kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
[7] H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman. Maskgit: Masked generative image
transformer. In CVPR, 2022.
[8] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever. Generative pretraining
from pixels. In ICML, 2020.
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical
image database. In CVPR, 2009.
[10] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16
words: Transformers for image recognition at scale. In ICLR, 2021.
[12] P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis.
In CVPR, 2021.
[13] S. Ge, T. Hayes, H. Yang, X. Yin, G. Pang, D. Jacobs, J.-B. Huang, and D. Parikh. Long video
generation with time-agnostic vqgan and time-sensitive transformer. In ECCV, 2022.
[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and
Y. Bengio. Generative adversarial networks. Communications of the ACM, 2020.
[15] R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel,
I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The" something something" video database for
learning and evaluating visual common sense. In ICCV, 2017.
[16] T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhari-
wal, S. Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint
arXiv:2010.14701, 2020.
[17] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020.
[18] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans. Cascaded diffusion
models for high fidelity image generation. JMLR, 2022.
[19] W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang. Cogvideo: Large-scale pretraining for
text-to-video generation via transformers. In ICLR, 2023.
[20] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad-
ford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint
arXiv:2001.08361, 2020.
[21] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality,
stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
[22] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial
networks. In CVPR, 2019.
[23] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola,
T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint
arXiv:1705.06950, 2017.
11
[24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
[25] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
[26] D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han. Autoregressive image generation using
residual quantization. In CVPR, 2022.
[27] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer:
Hierarchical vision transformer using shifted windows. In ICCV, 2021.
[28] X. Ma, Y. Wang, G. Jia, X. Chen, Z. Liu, Y.-F. Li, C. Chen, and Y. Qiao. Latte: Latent diffusion
transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
[29] M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal, T. Yan, L. Brown, Q. Fan,
D. Gutfreund, C. Vondrick, et al. Moments in time dataset: one million videos for event
understanding. TPAMI, 2019.
[30] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and
M. Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion
models. PMLR, 2022.
[31] A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In ICML,
2021.
[32] W. Peebles and S. Xie. Scalable diffusion models with transformers. In CVPR, 2023.
[33] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach.
Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint
arXiv:2307.01952, 2023.
[34] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language understanding
by generative pre-training. OpenAI Blog, 2018.
[35] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are
unsupervised multitask learners. OpenAI Blog, 2019.
[36] R. Rakhimov, D. Volkhonskiy, A. Artemov, D. Zorin, and E. Burnaev. Latent video transformer.
arXiv preprint arXiv:2006.10704, 2020.
[37] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever.
Zero-shot text-to-image generation. In ICML, 2021.
[38] A. Razavi, A. Van den Oord, and O. Vinyals. Generating diverse high-fidelity images with
vq-vae-2. In NeurIPS, 2019.
[39] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis
with latent diffusion models. In CVPR, 2022.
[40] A. Sauer, K. Schwarz, and A. Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In
ACM SIGGRAPH, 2022.
[41] X. Shen, X. Li, and M. Elhoseiny. Mostgan-v: Video generation with temporal motion styles.
In CVPR, 2023.
[42] I. Skorokhodov, S. Tulyakov, and M. Elhoseiny. Stylegan-v: A continuous video generator with
the price, image quality and perks of stylegan2. In CVPR, 2022.
[43] J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. In ICLR, 2021.
[44] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from
videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[45] P. Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan. Autoregressive model beats
diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024.
[46] K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang. Visual autoregressive modeling: Scalable
image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024.
[47] R. Tian, Z. Wu, Q. Dai, H. Hu, Y. Qiao, and Y.-G. Jiang. Resformer: Scaling vits with
multi-resolution training. In CVPR, 2023.
[48] Y. Tian, J. Ren, M. Chai, K. Olszewski, X. Peng, D. N. Metaxas, and S. Tulyakov. A good
image generator is what you need for high-resolution video synthesis. In ICLR, 2021.
12
[49] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv
preprint arXiv:2302.13971, 2023.
[50] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra,
P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv
preprint arXiv:2307.09288, 2023.
[51] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for
video generation. In CVPR, 2018.
[52] A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017.
[53] R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro,
J. Kunze, and D. Erhan. Phenaki: Variable length video generation from open domain textual
descriptions. In ICLR, 2022.
[54] J. Walker, A. Razavi, and A. v. d. Oord. Predicting video with vqvae. arXiv preprint
arXiv:2103.01950, 2021.
[55] J. Wang, D. Chen, C. Luo, B. He, L. Yuan, Z. Wu, and Y.-G. Jiang. Omnivid: A generative
framework for universal video understanding. In CVPR, 2024.
[56] J. Wang, D. Chen, Z. Wu, C. Luo, L. Zhou, Y. Zhao, Y. Xie, C. Liu, Y.-G. Jiang, and L. Yuan.
Omnivl: One foundation model for image-language and video-language tasks. NeurIPS, 2022.
[57] J. Wang, Z. Wu, J. Chen, X. Han, A. Shrivastava, S.-N. Lim, and Y.-G. Jiang. Objectformer for
image manipulation detection and localization. In CVPR, 2022.
[58] R. Wang, D. Chen, Z. Wu, Y. Chen, X. Dai, M. Liu, Y.-G. Jiang, L. Zhou, and L. Yuan. Bevt:
Bert pretraining of video transformers. In CVPR, 2022.
[59] D. Weissenborn, O. Täckström, and J. Uszkoreit. Scaling autoregressive video models. In ICLR,
2020.
[60] Z. Xing, Q. Dai, H. Hu, Z. Wu, and Y.-G. Jiang. Simda: Simple diffusion adapter for efficient
video generation. In CVPR, 2024.
[61] W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas. Videogpt: Video generation using vq-vae and
transformers. arXiv preprint arXiv:2104.10157, 2021.
[62] J. Yu, X. Li, J. Y. Koh, H. Zhang, R. Pang, J. Qin, A. Ku, Y. Xu, J. Baldridge, and Y. Wu.
Vector-quantized image modeling with improved vqgan. In ICLR, 2022.
[63] J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan,
et al. Scaling autoregressive models for content-rich text-to-image generation. In ICLR, 2024.
[64] L. Yu, Y. Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang,
Y. Hao, I. Essa, et al. Magvit: Masked generative video transformer. In CVPR, 2023.
[65] L. Yu, J. Lezama, N. B. Gundavarapu, L. Versari, K. Sohn, D. Minnen, Y. Cheng, A. Gupta,
X. Gu, A. G. Hauptmann, et al. Language model beats diffusion–tokenizer is key to visual
generation. In ICLR, 2024.
[66] S. Yu, K. Sohn, S. Kim, and J. Shin. Video probabilistic diffusion models in projected latent
space. In CVPR, 2023.
[67] S. Yu, J. Tack, S. Mo, H. Kim, J. Kim, J.-W. Ha, and J. Shin. Generating videos with dynamics-
aware implicit generative adversarial networks. In ICLR, 2022.
[68] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers. In CVPR, 2022.
[69] L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion
models. In CVPR, 2023.
13