Vision-Language Pre-Training
Vision-Language Pre-Training
Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao
Microsoft Corporation
{zhgan,linjli,chunyl,lijuanw,zliu,jfgao}@microsoft.com
arXiv:2210.09263v1 [cs.CV] 17 Oct 2022
Abstract
This paper surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these ap-
proaches into three categories: (i) VLP for image-text tasks, such as image cap-
tioning, image-text retrieval, visual question answering, and visual grounding; (ii)
VLP for core computer vision tasks, such as (open-set) image classification, ob-
ject detection, and segmentation; and (iii) VLP for video-text tasks, such as video
captioning, video-text retrieval, and video question answering. For each cate-
gory, we present a comprehensive review of state-of-the-art methods, and discuss
the progress that has been made and challenges still being faced, using specific
systems and models as case studies. In addition, for each category, we discuss
advanced topics being actively explored in the research community, such as big
foundation models, unified modeling, in-context few-shot learning, knowledge,
robustness, and computer vision in the wild, to name a few.
♠
Zhe Gan and Jianfeng Gao initiated the project. Zhe Gan and Linjie Li took lead in the writing of
Chapter 1. Linjie Li and Jianfeng Gao took lead in the writing of Chapter 2. Zhe Gan further took lead in the
writing of Chapter 3 and 7. Chunyuan Li took lead in the writing of Chapter 4. Linjie Li further took lead in
the writing of Chapter 5. Lijuan Wang and Zicheng Liu took lead in the writing of Chapter 6. All the authors
provided project advice, and contributed to paper editing and proofreading.
Contents
1 Introduction 4
1.1 Who Should Read this Paper? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Vision-and-Language: What Kinds of Problems? . . . . . . . . . . . . . . . . . . 6
1.3 The Transition From Task-Specific Methods to Large-Scale Pre-training . . . . . . 8
1.4 What is a Good VLP Model From an Overall Perspective? . . . . . . . . . . . . . 9
1.5 Related Materials: Slide Decks and Pre-recorded Talks . . . . . . . . . . . . . . . 9
2
3.5.5 Robustness and Probing Analysis . . . . . . . . . . . . . . . . . . . . . . 38
3.5.6 VL for Language, Model Compression, Multilingual VLP, and Beyond . . 39
3.6 Text-to-Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.1 VQ-token-based Auto-regressive Methods . . . . . . . . . . . . . . . . . . 42
3.6.2 Diffusion-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6 VL Systems in Industry 71
6.1 VL in Commercial Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Issues in VL Model Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3
Chapter 1
Introduction
Humans perceive the world through many channels, such as images viewed by the eyes, or voices
heard by the ears. Though any individual channel might be incomplete or noisy, humans can natu-
rally align and fuse information collected from multiple channels in order to grasp the key concepts
needed for a better understanding of the world.
One of the core aspirations in AI is to develop algorithms that endow computers with an ability
to effectively learn from multimodal (or, multi-channel) data. This data is similar to sights and
sounds attained from vision and language that help humans make sense of the world around us. For
example, computers could mimic this ability by searching the most relevant images to a text query
(or vice versa), and by describing the content of an image using natural language.
Vision-and-Language (VL), a popular research area that sits at the nexus of Computer Vision and
Natural Language Processing (NLP), aims to achieve this goal. Inspired by the great success of
language model pre-training in NLP (e.g., BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019d),
T5 (Raffel et al., 2020), and GPT-3 (Brown et al., 2020)), Vision-Language Pre-training (VLP)
has recently attracted rapidly growing attention from both communities. With the promise to learn
universal transferable visual and vision-language representations, VLP has become an increasingly
central training paradigm for modern VL research.
Recently, there are some related survey papers on VLP. Zhang et al. (2020a) focused on task-specific
VL methods before the era of pre-training, and provided a concise discussion of VLP models. Du
et al. (2022); Li et al. (2022e) focused on VLP, but mainly on image-text tasks, without touch on
video-text tasks. Ruan and Jin (2022) focused on VLP for video-text tasks. Chen et al. (2022a)
reviewed VLP methods for image-text and video-text tasks. However, the discussion is not in depth.
The contributions of this survey paper are summarized as follows.
• We provide a comprehensive survey on modern VLP, not only covering its successful applications
to traditional image-text and video-text tasks (e.g., image/video captioning, retrieval, and ques-
tion answering), but also showing its great potential for core computer vision tasks (e.g., image
classification, object detection and segmentation).
• We provide in-depth discussions on advanced topics at the frontier of VLP, ranging from big
foundation models, unified modeling, in-context few-shot learning, knowledge-enhanced VLP,
multilingual VLP, model robustness, model compression, to computer vision in the wild.
• We picture the landscape of VL systems developed in research communities and released to public,
demonstrating via case studies the progress we have made and the challenges we are facing.
4
VQA (Antol et al., 2015);
Simple Fusion
Show and Tell (Xu et al., 2015)
Efficient Adaptation/Multilingual/...
5
ideas and insights needed to understand modern VLP methods, and serves as a valuable resource for
students, researchers, engineers, and practitioners that are interested in large-scale pre-training for
VL representation learning and its applications in computer vision and multimodal tasks. The paper
is structured as follows.
• Chapter 1 introduces the landscape of VL research, and presents a historical view on the transition
of VL research from task-specific methods to large-scale pre-training.
• Chapter 2 introduces early task-specific VL methods for visual question answering, image caption-
ing, and image-text retrieval, which serve as the foundation to understand modern VLP methods.
• Chapter 3 describes VLP methods for image-text tasks, such as image captioning, image-text
retrieval, visual question answering, and visual grounding.
• Chapter 4 describes VLP methods for core computer vision tasks, including (open-vocabulary)
image classification, object detection and segmentation.
• Chapter 5 describes VLP methods for video-text tasks, such as video captioning, video-text re-
trieval, and video question answering.
• Chapter 6 briefly reviews VL systems developed in industry and the challenges to deploy these
VL systems in real-world settings.
• Chapter 7 concludes the paper and discusses research trends.
Relations between core chapters. Chapter 2-5 are the core chapters of this survey paper. An
overview of the structure for these chapters are provided in Figure 1.1. As the wave of VLP starts
with image-text tasks, we first provide a comprehensive review on the transition from early task-
specific methods (Chapter 2) to most recent VLP methods (Chapter 3) with image-text inputs. In
Chapter 4, we discuss how core computer vision tasks can be viewed as image-text tasks with
open-vocabulary predictions, when powered by contrastively pre-trained image-text models (such
as CLIP (Radford et al., 2021)), and further enable computer vision in the wild (Li et al., 2022b).
Extending image-text tasks to more modalities, we present how VLP methods can serve more appli-
cations with video-text inputs in Chapter 5.
How to read the paper. Different readers have different backgrounds, and may have different
purposes of reading this paper. Here, we provide a few guidance.
• Each chapter is mostly self-contained. If you have a clear goal and a clear research direction
that you want to focus on, then just jump to the corresponding chapter. For example, if you are
interested in video-language pre-training, then you can directly jump to Chapter 5.
• If you are a beginner in the VLP field, and are interested in getting a glimpse of the cutting-edge
research of VLP, it is also highly suggested to read the whole paper chapter by chapter, as the
paper provides a comprehensive literature review that helps you understand the VLP landscape.
• If you already have rich experience in VLP and are very familiar with the literature, feel free to
jump to specific chapters you want to read. In particular, we include in each chapter a dedicated
section to discuss advanced topics. For example, in Section 3.5, we have discussed big foundation
models, unified image-text modeling, in-context few-shot learning, knowledge, robustness and
probing analysis, etc.
6
Figure 1.2: Illustration of representative tasks from three categories of VL problems covered in this
paper: image-text tasks , vision tasks as VL problems , and video-text tasks .
– VQA and visual reasoning. As extensions to visual question answering, researchers have de-
veloped datasets for visual reasoning (Hudson and Manning, 2019b; Suhr et al., 2019), visual
commonsense reasoning (Zellers et al., 2019), visual dialog (Das et al., 2017), knowledge-based
VQA (Marino et al., 2019), scene-text-based VQA (Singh et al., 2019), etc. The answers re-
quired in these these tasks can be open-ended free-form texts, or selected from multiple choices.
– Image captioning. In addition to the setting where short single-sentence generation is re-
quired (Lin et al., 2014), researchers have also developed datasets for image paragraph cap-
tioning (Krause et al., 2017), scene-text-based image captioning (Sidorov et al., 2020), visual
storytelling (Huang et al., 2016), and so on.
– Image-text retrieval. Popular image-text retrieval datasets are based on image captioning
datasets (Chen et al., 2015; Plummer et al., 2015). AI models are required to retrieve the most
relevant text (or image) from a large corpus, given the image (or text) query.
– Visual grounding. Instead of text outputs, referring expression comprehension and phrase
grounding (Yu et al., 2016; Plummer et al., 2015) requires bounding box outputs, where the
model needs to predict the bounding box corresponding to the input text query.
– Text-to-image generation. It can be considered as the dual task of image captioning, where
the system is required to create a high-fidelity image based on the text input. A brief discussion
on this task is provided in Section 3.6.
• Computer Vision Tasks as VL Problems. Image classification, object detection, and segmen-
tation (highlighted with pink in Figure 1.2) are core visual recognition tasks in computer vision.
Traditionally, these tasks are considered as pure vision problems. As the advent of CLIP (Radford
et al., 2021) and ALIGN (Jia et al., 2021), researchers have realized that language supervision can
play an important role in computer vision tasks. First, the use of noisy image-text data crawled
from web allows large-scale pre-training of vision encoders from scratch. Second, instead of treat-
ing the supervision signals (e.g., class labels) as one-hot vectors, we take the semantic meaning
behind the labels into consideration and cast these computer vision tasks as VL problems. This
perspective generalizes the traditional close-set classification or detection models to recognizing
unseen concepts in real-world applications, such as open-vocabulary object detection.
• Video-Text Tasks. Besides static images, videos are another important type of visual modality.
Naturally, all aforementioned image-text tasks have their video-text counterparts, such as video
captioning, retrieval, and question answering (highlighted with green in Figure 1.2). The unique-
ness of video inputs, in comparison to images, requires an AI system to not only capture spatial
information within a single video frame, but also capture the inherent temporal dependencies
among video frames.
7
85 BEiT-3 PaLI
VLMo
OFA Flamingo, CoCa
mPLUG, GIT2
80 SimVLM
Florence
VQAv2 test-std accuracy
VinVL
ALBEF
75 VILLA
UNITER ERNIE-ViL
VL-BERT, LXMERT OSCAR
MCAN
BAN ViLBERT
70 ReGAT VisualBERT
Pythia
Counter
BUTD
65
2017/8 2019/8 2021/8 2022/8
Figure 1.3: The transition from task-specific methods to large-scale pre-training, using the VQA
task as a case study. Every time when there was a transition, we observe a big performance lift,
e.g., from MCAN (Yu et al., 2019c) to UNITER (Chen et al., 2020d), and from ALBEF (Li et al.,
2021a) to SimVLM (Wang et al., 2022k). Methods before August 2017 were not drawn; only some
representative VLP works are shown to avoid the figure to be too crowded.
While this paper provides a comprehensive survey of VLP, some of the important VL topics are
not discussed. For example, Vision-Language Navigation (VLN) (Anderson et al., 2018b), another
emerging topic at the intersection of VL research and embodied AI, is not covered in this paper.
8
1.4 What is a Good VLP Model From an Overall Perspective?
While VLP is an emerging field with many new exciting papers appearing, it remains less clear what
is the north star we are pursuing as a community. We provide our perspective on the direction. We
believe a good VLP model should:
• Achieve good performance on a wide range of downstream tasks. The task coverage can be
considered in a two-level granularity. First, the problem types are broad, for example, one model
can perform on image-text tasks such as VQA, image captioning and text-to-image generation in
Chapter 3, core computer vision tasks such as image classification, object detection and segmen-
tation in Chapter 4, video-text tasks such as video QA and captioning in Chapter 5. Second, for
each problem type, there is a broad coverage of datasets that represent different use scenarios. For
example, Li et al. (2022b) present 20 image classification datasets and 35 object detection datasets
to illustrate various scenarios in the wild.
• Adapt to new tasks with minimal cost. The adaptation cost needs to be low when deploying a
VLP model to a new task. Various efficiency metrics can be considered to measure the adaptation
cost, including inference speed, GPU usage for further model weight update, the number of train-
ing samples, and the number of trainable parameters. This is an area not well defined yet, and there
has been some early effort. For example, Li et al. (2022b) provide a definition by decomposing
the adaptation cost into sample-efficiency and parameter-efficiency.
To summarize, the north star of a good VLP model is a single unified model with fixed model weights
(or, with inexpensive finetuning) that performs well on all the tasks above. This is an ambitious goal
that the community is collectively working towards. Developing a central benchmark is itself an
open research problem. We advocate for considering the following factors when benchmarking
VLP models: the coverage of tasks, the performance on these tasks, and the cost of adaptation.
• Chapter 2:
– CVPR 2020 Tutorial: VQA and visual reasoning (Youtube, Bilibili)
– CVPR 2020 Tutorial: Image captioning (Youtube, Bilibili)
• Chapter 3:
– CVPR 2022 Tutorial: Overview of Image-Text Pre-training (YouTube, Bilibili)
– CVPR 2022 Tutorial: Unified Image-Text Modeling (YouTube, Bilibili)
– CVPR 2022 Tutorial: Advanced Topics in Image-Text Pre-training (YouTube, Bilibili)
– CVPR 2021 Tutorial: Representations and Training Strategies for VLP (YouTube)
– CVPR 2021 Tutorial: Robustness, Efficiency and Extensions for VLP (YouTube)
– CVPR 2020 Tutorial: Self-supervised Image-Text Learning (YouTube, Bilibili)
• Chapter 4:
– CVPR 2022 Tutorial: VLP for Image Classification (Youtube, Bilibili)
– CVPR 2022 Tutorial: VLP for Object Detection (Youtube, Bilibili)
– CVPR 2022 Tutorial: Benchmarks for Computer Vision in the Wild (YouTube, Bilibili)
• Chapter 5:
– CVPR 2022 Tutorial: Overview of Video-Text Pre-training (YouTube, Bilibili)
– CVPR 2022 Tutorial: Learning from Multi-channel Videos: Methods and Bench-
marks (YouTube, Bilibili)
– CVPR 2022 Tutorial: Advanced Topics in Video-Text Pre-training (YouTube, Bilibili)
– CVPR 2021 Tutorial: Video-and-Language Pre-training (Youtube)
9
Chapter 2
In Section 2.1, we first introduce major vision-language (VL) tasks and the benchmarks that are
commonly used in the research community. We group these tasks into two categories. VL under-
standing tasks, such as image-text retrieval and visual question answering (VQA), require a VL
model to select the output from a given list of candidates. VL generation tasks, such as image cap-
tioning, require a VL model to generate the output. In Section 2.2, we take VQA as an example to
present the VL models developed prior to the era of large-scale VLP. Early VL models typically take
a pipeline approach. First, the image features are extracted by a pre-trained visual encoder. The tex-
tual features are computed using a text encoder. Then, the cross-modal representations are obtained,
by performing multimodal fusion on top of these features, for the final prediction. One of the major
research focuses is on the attention design for multimodal fusion, which we use to categorize these
models and to reflect how task-specific models evolve over time. We show that early VL models
eventually evolve into a Transformer-based architecture (e.g., MCAN (Yu et al., 2019c)), which is
similar to some early VLP models (e.g., LXMERT (Tan and Bansal, 2019) and ViLBERT (Lu et al.,
2019)), as to be discussed in detail in Chapter 3. In Section 2.3, we review additional research topics
for the development of early VL models, including bilinear pooling, compositional visual reasoning,
and visual grounding.
Image-text retrieval can be categorized into two sub-tasks, including (i) text-to-image retrieval,
which retrieves a relevant image given an input text query (illustrated in Figure 2.1), and (ii) image-
to-text retrieval, which retrieves a textual description that can be grounded in the image query. In
10
Figure 2.1: Illustration of representative vision-language tasks with image-text inputs: (i) image-
text retrieval; (ii) visual question answering and visual reasoning; and (iii) image captioning with a
single-sentence caption, or a more descriptive paragraph of captions.
both cases, the model needs to match the query to its relevant instances from a relatively large
database (e.g., 1000-5000 images for a typical text-to-image retrieval task). Recall@K (K=1, 5,
10) is used as the evaluation metric. Popular datasets include COCO (Chen et al., 2015) and
Flickr30K (Plummer et al., 2015). Sun et al. (2021) propose to combine the training, validation and
test sets of each dataset to form a larger candidate pool that can mimic a real-world text-to-image
retrieval scenario which usually involves hundreds of thousands of images, and evaluate models in
terms of both retrieval accuracy and inference speed.
Visual Question Answering (VQA) (Antol et al., 2015) is one of the most prominent VL tasks stud-
ied in the research community. Given an image-question pair, VQA requires the model to provide
a correct answer to the question based on the image. There are two typically settings: (i) multiple-
choice, where a small set of answer choices (e.g., 4/5 answer choices) are provided, together with the
image-question pair; and (ii) open-ended, where the answer can be free-form that is not limited to
any pre-defined answer candidates. However, to simplify the VQA tasks, most studies (Antol et al.,
2015; Anderson et al., 2018a; Yu et al., 2019c) treat both multiple-choice and open-ended VQA as
classification problems. Specifically, the most frequent answers from the training set is selected to
build an answer candidate set under open-ended setting. For example, the second version of VQA
dataset, dubbed as VQAv2 (Goyal et al., 2017b), contain approximately 3000 answers which can be
used to form the list of candidates for all questions. As the VQA dataset contains 10 ground-truth
answers per image-question pair, VQA score (Antol et al., 2015) is used to evaluate model per-
formance. VQA score is defined as follows, considering the consensus among human annotators.
11
Figure 2.2: Illustration of a general framework for task-specific VQA models. In most cases, image
features are extracted offline, with no gradient update to the visual encoder during model training.
Image captioning is to generate a free-form textual caption for a given image. Captioning perfor-
mance is usually evaluated on standard text generation metrics based on n-gram overlap, such as
BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), ROUGE-L (Lin, 2004) and
CIDEr (Vedantam et al., 2015). In addition, semantic content matching metrics, such as SPICE (An-
derson et al., 2016), are used to measure the similarity between model-generated text and references
by extracting explicit semantic information units from text beyond n-grams.
As shown in Figure 2.1, two kinds of captions are proposed for the image captioning task. Pop-
ular datasets, mostly designed with single-sentence captions, include COCO (Chen et al., 2015),
TextCaps (Sidorov et al., 2020), NoCaps (Agrawal et al., 2019) and VizWiz-Captions (Gurari et al.,
2020). There have been less efforts (Krause et al., 2017) on building datasets with more descriptive,
multi-sentence captions. On the modeling side, most work (Farhadi et al., 2010; Kulkarni et al.,
2013; Fang et al., 2015; Anderson et al., 2018a) focus on the single-sentence captioning task.
Overview. Given an image-question pair, a VQA model first extracts visual features v =
{v 1 , · · · , v M } via a visual encoder and encodes the question input via a text encoder into text
features w = {w1 , · · · , wN }. Here, N can be the number of words in the question, or N = 1 if
12
a global textual representation is computed for the question. M is the number of visual features for
an image, which can be the number of image regions (e.g., M ∈ [10, 100]), or the number of grids
(e.g., M = 14 × 14), depending on the specific vision encoder being used. Likewise, M = 1 when a
global image representation is extracted. The text and visual features are then fed into a multimodal
fusion module to produce cross-modal representations, which are then fed into a task-specific output
layer (e.g.,, a classifier for the VQA task) to predict the answer. An illustration of this framework is
shown in Figure 2.2.
Visual Encoder. Most early VL methods (Antol et al., 2015; Anderson et al., 2018a; Yu et al.,
2019c) adopt a two-stage training pipeline, where visual features are first extracted from a pre-
trained visual encoder. There are two types of visual encoders: (i) a plain convolutional neural
network (CNN), and (ii) an object detector (OD).
• CNN. Inspired by the success of CNN on image classification, early methods adopt CNN
models (e.g., VGGNet (Simonyan and Zisserman, 2014), AlexNet (Krizhevsky et al., 2012),
GoogLeNet (Szegedy et al., 2015), and ResNet (He et al., 2016)) pre-trained on ImageNet (Deng
et al., 2009) to extract visual features. The very first VQA model (Antol et al., 2015) experiments
with global visual features from the last fully connected layer of VGGNet, which has been inher-
ited by the immediate follow-up works (Gao et al., 2015; Ren et al., 2015a; Ma et al., 2016). To
retain spatial information in the original images, researchers (Yang et al., 2016; Zhu et al., 2016;
Andreas et al., 2016b; Jabri et al., 2016) use grid features from earlier layers of pre-trained CNN
models. Grid features represent the input image by a uniform grid of equally sized and shaped
neural receptive fields, hence contain more local information than the holistic entire-image repre-
sentation captured by the global visual features.
• OD. In contrast to the uniform grids, object detectors produce a set of salient image regions of
varying size and aspect ratio. Region features are the pooled convolutional features extracted per
region proposal. Shih et al. (2016) is the first work to exploit region features for VQA, where
the regions are located using edges (Zitnick and Dollár, 2014). The most widely used OD model
for VL research is a Faster R-CNN (Ren et al., 2015b) pre-trained on the Visual Genome (VG)
dataset (Krishna et al., 2017c) from BUTD (Anderson et al., 2018a).
Discussion: from grids to regions, and back again. As discussed above, early explorations in
VQA models (Gao et al., 2015; Yang et al., 2016; Jabri et al., 2016) have witnessed the transition
from holistic global visual features to grid features with a CNN visual encoder. Popularized by
regional bottom-up features (Anderson et al., 2018a), OD models have soon dominated the design
of visual encoder. Region features have become the de facto standard for VL tasks like VQA and
image captioning in many follow-up works (Teney et al., 2018; Gao et al., 2019b; Li et al., 2019d;
Yu et al., 2019c). However, Jiang et al. (2020) argue that compared to the “format” of features
(i.e., region vs. grids), the semantic content that visual features represent is more critical for their
effectiveness. Grid features, extracted from the CNN backbone of an OD model trained on the same
data as bottom-up features, can be equally performant, but with better efficiency, and can be more
easily end-to-end finetuned than region features.
Text Encoder. The input question is first tokenized into a sequence of words, and then encoded
via a text encoder. Depending on how we view textual input, different neural models can be used
for text encoding.
• Bag-of-Words (BoW). BoW-based methods (e.g., Antol et al., 2015; Yu et al., 2015; Jabri et al.,
2016; Shih et al., 2016) independently encode each word in the input question, without considering
dependencies between neighboring words. The sum or average of word embeddings (learned
from scratch or extracted from the pre-trained word2vec (Mikolov et al., 2013a)) are taken as the
representation of the input question.
• Recurrent Neural Networks (RNN). RNN-based methods (e.g., Ren et al., 2015a; Malinowski
et al., 2015; Fukui et al., 2016; Anderson et al., 2018a; Teney et al., 2018) intend to cap-
ture word dependencies and text structures. The input words are one-hot encoded and passed
through a word embedding layer (e.g., learned from scratch or extracted from word2vec or ini-
tialized/concatenated with GLoVe (Pennington et al., 2014)). These word embeddings are further
processed by an RNN-based text encoder (e.g., LSTM (Hochreiter and Schmidhuber, 1997) or
GRU (Cho et al., 2014)) to obtain the representation of the question.
13
Semantic
Show&Tell SAN Attn MCB N2NMN BUTD MAC BAN ReGAT NSM
2014/11 2015/11 2016/3 2016/6 2017/4 2017/7 2018/3 2018/5 2019/3 2019/7
2015/2 2015/11 2016/5 2016/12 2017/6 2017/8 2018/3 2019/2 2019/6 2020/1
Show, Attend HierCoAttn Adaptive Relation MFB SCAN MuRel MCAN Grid
NMN
& Tell Attn Network Feature
Figure 2.3: Early VL models developed along time. We mainly focus on the VQA task, and include
methods for (i) inter-modality attention design for multimodal alignment (e.g., SAN (Yang et al.,
2016) and BAN (Kim et al., 2018)), (ii) intra-modality attention design for relational reasoning
(e.g., Relation Network (Santoro et al., 2017) and ReGAT (Li et al., 2019d)), (iii) bilinear pooling
for better fusion (e.g., MCB (Fukui et al., 2016) and MFB (Yu et al., 2017)), (iv) the use of both
inter- and intra-modality attention (e.g., MCAN (Yu et al., 2018a)), and (v) neural module network
for compositional visual reasoning (Andreas et al., 2016b). We also briefly include methods for
image captioning and image-text retrieval. As there exist a vast number of literature on this topic,
only some representative works are shown.
• Transformer. Inspired by the success of Transformers (Vaswani et al., 2017) (e.g., BERT (Devlin
et al., 2019)) with large-scale pre-training in NLP, researchers have used pre-trained BERT to
extract question representations. This method has been integrated into several winning ensemble
entries of the VQA Challenge (Yu et al., 2019b; Liu et al., 2019a).
In addition to what are discussed above, other text encoders, such as the CNN-based text en-
coder (Ma et al., 2016) that is trained to recognize patterns in text (such as key phrases), have
also been explored. A recent survey is Minaee et al. (2021).
Multimodal Fusion Module. Multimodal fusion aims at modeling interactions between visual
features and text features. The design of multimodal fusion modules has always been the major
topic in VL research, especially for task-specific VL models. We start the review with simple fusion
methods (such as concatenation), followed by some of the most popular attention-based methods,
which demonstrate how task-specific VL models evolve over time. For methods that are not based
on attention, such as bilinear pooling (Fukui et al., 2016), we defer the discussion to Section 2.3.
• Simple fusion without attention. Image and text features are fused via element-wise product
or sum, or concatenation (Antol et al., 2015; Jabri et al., 2016). More sophisticated designs re-
fine the fused image-text features via LSTM (Malinowski et al., 2015) or multimodal residual
networks (Kim et al., 2016).
• Inter-modality attention. Inter-modality attention methods (e.g., Yang et al., 2016; Lu et al.,
2016; Nguyen and Okatani, 2018) aim to capture multimodal alignment between image and text
inputs. Compared to simple fusion, attention models construct a more informative VL-joint repre-
sentation since higher weights are put on the image regions that are more useful to solve the task.
There are many works along this direction. We name a few below. Stacked Attention Network
(SAN) (Yang et al., 2016) is the first that verifies the effectiveness of inter-modality attention in
VQA, with question as query to attend image features. Lu et al. (2016) argue that attention on
text is equally important as that on image, and develop a co-attention method to jointly perform
question-guided image attention and image-guided text attention. BAN (Kim et al., 2018) extends
the idea of co-attention into bilinear attention, which considers every pair of question words and
image regions. Stacking multiple inter-modality attention layers can also be viewed as a way to
perform multi-step reasoning (Yang et al., 2016; Gan et al., 2019), where the attention distribution
is refined layer by layer to focus on regions that are more relevant to the question.
• Intra-modality attention. Intra-modality attention methods aim to perform relational reasoning
over image regions or question words. Considering the relations between object regions in image
and dependencies between words in question, VQA performance can be improved by building
graph structured representations (Santoro et al., 2017; Hu et al., 2019). For question, a graph
built with words as nodes can be obtained through dependency parsing (Teney et al., 2017). For
14
Figure 2.4: Overview of the BUTD model for VQA. Gray numbers indicate the dimensions of
the vector representations between layers. Yellow elements use learned parameters. Figure credit:
Anderson et al. (2018a).
image, the graph with object regions as nodes can be built by leveraging external knowledge
(e.g., scene graphs) and rule-based priors (e.g., estimating the relative positions of two objects
with bounding box coordinates) (Li et al., 2019d). Alternatively, one can also start with a fully-
connected graph, and dynamically prune and refine the connections between nodes during model
training (Norcliffe-Brown et al., 2018; Cadene et al., 2019a).
• Transformer. Image (question) understanding can be achieved by not only attending to the other
modality (through inter-modality attention), but also the related regions (other words) from the
current modality (via intra-modality attention) (Gao et al., 2019b). Based on the scaled dot-
product attention in Transformer (Vaswani et al., 2017), MCAN (Yu et al., 2019c) uses the
self-attention unit for intra-modal interactions (i.e., region-to-region or word-to-word) and the
guided attention unit for dense inter-modal interactions (e.g., word-to-region). MCAN also adopts
an encoder-decoder Transformer architecture, where the encoder with multiple layers of self-
attention learns the self-attended question features, and the decoder uses the resulting question
features to learn the attended image features with a stack of self-attention (on image features
only) followed by guided-attention (with question feature as query to attend on image features).
Task-specific Output Layer. The cross-modal representations computed by the multimodal fu-
sion module are fed to a task-specific output layer to generate model predictions. As VQA is usually
modeled as a classification problem, the output layer is a classifier that consists of a fully-connected
layer or a multi-layer perceptron followed by a softmax layer, to predict the answer.
Trends in VQA Models. Now, we summarize the trends of model architecture designs in the VQA
literature, detailed to each component. Figure 2.3 list some early VL models developed along time.
• Visual features evolve in 4 stages: (i) global image features with a holistic view of the entire
image; (ii) grid features that preserve local and spatial information with a uniform grid; (iii)
region features extracted from more salient object-centric image regions; and (iv) back to grid
features that can capture similar semantics when trained with object detection objective.
• Textual features evolve in 3 stages: (i) bag-of-words that encodes each word independently; (ii)
RNNs capturing word dependencies and text structures; and (iii) more powerful text representa-
tions with a pre-trained Transformer.
• Multimodal fusion methods evolve in 4 stages: (i) simple fusion without attention; (ii) inter-
modality attention methods that model multimodal alignment between image and text inputs; (iii)
intra-modality attention methods that capture uni-modal relations; and (iv) Transformer-based
models that combine inter-modality and intra-modality attention.
BUTD with Top-down Attention. Given an image-question pair, regional bottom-up features
v = {v 1 , · · · , v M } (M is the number of regions)1 are first extracted from an OD-based visual
1
We use M instead of k from Figure 2.4 here to keep consistency throughout the paper.
15
Figure 2.5: Overview of scaled dot-product attention (left), multi-head attention (middle) and Trans-
former layer (right). Figure credit: Vaswani et al. (2017).
encoder, and the question feature w is obtained with a word embedding layer followed by a GRU as
the text encoder. Note that the question feature is a global textual representation with a single vector
of dimension 512 as specified in Figure 2.4.
BUTD adopts inter-modality attention to attend the query question feature to each image region.
Formally, the attention weight ai on each region v i is computed by an attention model fatt and
normalized with softmax operation:
ei = fatt (v i , w) = wTa fa ([v i , w])
exp (ei )
ai = PM , (2.2)
j=1 exp(ej )
where wa is a learnable parameter vector, fa is a gated tanh layer. Once the attention weights are
computed, the attended visual representation v̂ is obtained via weighted sum over v.
M
X
v̂ = ai v i . (2.3)
i=1
Transformer with Multi-head Scaled Dot-Product Attention. The top-down attention intro-
duced in BUTD is simple, in two aspects. On one hand, it is inter-modality attention only, while
more advanced models (Gao et al., 2019b; Yu et al., 2019c) combine both inter-modality and intra-
modality attention to learn better cross-modal representation. On the other hand, the attention mech-
anism is simple in that only question-to-region attention is used. Furthermore, the attention weights
are learned with a single learnable parameter vector wTa (which is usually referred as single-head at-
tention in literature). Of late, the modern attention-based models (Li et al., 2019d; Gao et al., 2019b;
Yu et al., 2019c) closely follow Transformer (Vaswani et al., 2017) to adopt scaled dot-product at-
tention, usually with multi-head. As Transformer architecture becomes the basis for VLP (and also
the basic concept in the following chapters), we briefly review the multi-head scaled dot-product
attention and the vanilla Transformer layer (shown in Figure 2.5).
• Multi-head scaled dot-product attention. With the inputs as three set of feature vectors, query
Q, key K and value V , scaled dot-product attention is defined as
QK T
Attention(Q, K, V ) = softmax( √ )V , (2.5)
dk
16
where dk is the feature dimension of Q and K. To extend it to multi-head attention (illustrated
in the center of Figure 2.5), the queries, keys and values can be linearly projected h times with
different, learned linear projections to dk , dk and dv dimensions, respectively. On each of these
projected versions of queries, keys and values, the attention is performed in parallel, yielding dv -
dimensional output values. These are concatenated and once again projected to produce the final
values. Compared to single-head attention, multi-head attention allows the model to jointly attend
to information from different representation subspaces at different positions.
The scaled dot-product attention mechanism can be adopted for both inter-modality and intra-
modality attention, depending on the inputs. For example, word-to-region attention (inter-
modality) can be realized by using question features w as query and visual features v as key and
value. When we set the query, key, value as the features from the same modality, it is considered
as intra-modality attention.
• Transformer layer. As shown in the rightmost of Figure 2.5, a Transformer layer has two sub-
layers, (i) a multi-head attention layer, and (ii) a simple, position-wise fully connected feed-
forward layer. A residual connection is added around each of the two sub-layers, followed by
layer normalization. This Transformer layer is the building block of modern VLP models.
In this subsection, we briefly review model architectures for image captioning and image-text re-
trieval, where similar trends to VQA models are observed.
Image Captioning. Early captioning models before deep learning use a modular architecture (e.g.,
Farhadi et al., 2010; Kulkarni et al., 2013; Fang et al., 2015), consisting of modules developed sep-
arately for detecting objects or concepts in images and generating captions using rules or machine
learned models, respectively. Inspired by the Seq2Seq learning framework for machine transla-
tion (Sutskever et al., 2014; Bahdanau et al., 2015), image captioning models nowadays adopt the
encoder-decoder architecture. Specifically, a visual encoder is used to extract visual features and a
text decoder generates a caption based on the visual features. To make the text decoder better ex-
ploit rich information in visual features, different multimodal fusion methods have been explored
with or without attention.
Case study: We first use the seminal “Show, Attention and Tell” model (Xu et al., 2015) as an
example to review how a captioning model works. Grid features v = {v 1 , · · · , v M } (M = 14 × 14
is the number of grids) are first extracted from a CNN-based visual encoder. A LSTM is used as
the text decoder to produce a caption by generating one word at every time step conditioned on (i)
the context vector z t at current time t, indicating relevant part of the image input; (ii) the current
hidden state (ht ) of the LSTM; and (iii) previously generated words ŷ1:t−1 . Here, we describe how
the context vector is produced via attention. Similar to Equation 2.2, an attention model fatt followed
by softmax normalization is adopted to compute the attention weight ati for the i-th visual feature v i
at time t. However, in the case of image captioning, instead of conditioning on the question feature
w, the attention weights are conditioned on the previous hidden state ht−1 of LSTM. Specifically,
Xu et al. (2015) have explored two alternative mechanisms for fatt , which we refer the reader to
the original paper for more details. After obtaining the attention weights, the context vector z t is
computed via weighted sum of all visual features. That is,
M
X
zt = ati v i . (2.7)
i=1
The output word probability at time t can be calculated via an output layer fo :
17
Figure 2.6: Overview of the seminal “Show, Attend, and Tell” model for image captioning. Figure
credit: Xu et al. (2015).
During training, given the ground-truth caption sequence y1:T , the following cross-entropy loss is
minimized:
T
X
LXE (θ) = − log(pθ (yt |v, y1:t−1 )) , (2.9)
t=1
• Visual encoder. Early studies (Vinyals et al., 2015; Karpathy and Fei-Fei, 2015) adopt a CNN
model as the image encoder to extract global visual features, and then quickly move to grid fea-
tures (Xu et al., 2015; Yao et al., 2017). Later, region features extracted from OD-based visual
encoder become the default choice, since BUTD (Anderson et al., 2018a) has shown bottom-up
features much more effective for image captioning. And once again, Jiang et al. (2020) also defend
the use of grid features in terms of VQA and image captioning. More recently, fully Transformer-
based captioning model (Wang et al., 2022i; Fang et al., 2022b) is built on top of grid features
extracted from Transformer-based visual encoder (e.g., Swin Transformer (Liu et al., 2021c)).
• Text decoder. RNN-based methods are widely adopted (Mao et al., 2014; Donahue et al., 2015;
Pan et al., 2020b) before the emergence of Transformer. CNN-based decoder has also been ex-
plored in Aneja et al. (2018), showing on par performance but easier to train (e.g., better training
efficiency, less likely to suffer from vanishing gradients), when compared with the prominent
LSTM design. Of late, Transformer-based decoder (Herdade et al., 2019; Li et al., 2019b; Cornia
et al., 2020; Luo et al., 2021b) has become the most popular design choice.
• Multimodal fusion. Early models without attention, directly input the global visual features to
text decoder, either as the initial hidden state (Xu et al., 2015; Vinyals et al., 2015; Karpathy and
Fei-Fei, 2015) or to each step of the LSTM decoder (Mao et al., 2014; Donahue et al., 2015).
Similar to the use of attention models in VQA, the encoder-decoder image captioning models
are enhanced by incorporating inter-modality attention mechanism in the decoder (e.g., Xu et al.,
2015; Lu et al., 2017; Huang et al., 2019), so that the caption can be generated based on the im-
age regions/grids and concepts of interest. Intra-modality attention (You et al., 2016; Yao et al.,
2019; Yang et al., 2019a) has also been explored for captioning, mostly focus on modeling ob-
ject relational reasoning. For example, Yao et al. (2018) employ a graph convolutional network
to integrate both semantic and spatial object relationships into visual encoder. Herdade et al.
(2019) build an object relation Transformer to explicitly incorporate information about the spatial
relationship between input objects through geometric attention.
18
• Visual encoder. We observe a similar transition with a plain CNN model, from global image
features (Kiros et al., 2014; Socher et al., 2014; Wang et al., 2016; Klein et al., 2015) to grid
features (Huang et al., 2017; Nam et al., 2017). Even before the first adoption of bottom-up
features (Anderson et al., 2018a) in Lee et al. (2018), region features have been used to model
finer-grained alignment between image and text. For example, Karpathy and Fei-Fei (2015) ex-
tract region features with R-CNN (Girshick et al., 2014); Plummer et al. (2015) leverage Edge-
Box (Zitnick and Dollár, 2014) to generate region proposals; and Niu et al. (2017) further combine
region features with global image features.
• Text encoder. Researchers have explored (i) BoW-based methods (Klein et al., 2015; Wang
et al., 2016) by independently computing word embeddings; (ii) RNN-based architecture, such
as LSTM (Kiros et al., 2014; Socher et al., 2014) and GRU (Faghri et al., 2017); and (iii) CNN-
based architecture (Zheng et al., 2020).
• Multimodal fusion. There have been studies that focus on projecting global visual features and
global text features into a common “visual-semantic” space (Kiros et al., 2014; Socher et al., 2014;
Wang et al., 2016), where multimodal fusion is realized by simple dot product. Another paradigm
of approaches examine more finer-grained alignment between regions in the image and words
in the texts. The first attempt (Karpathy and Fei-Fei, 2015) adopts inner product to fuse each
word-region pair, and sum the similarity between aligned word and region pairs as the image-text
similarity. The adoption of attention greatly enhances the performance of local-level matching
methods. Lots of works (Huang et al., 2017; Nam et al., 2017; Liu et al., 2019b; Zhang et al.,
2020b) are devoted to designing better inter-modality attention. Perhaps the most prominent ex-
ample is SCAN (Lee et al., 2018), with cross-attention to not only use text as the query to attend
to image regions, but also use the image query to attend to words. Intra-modality attention mech-
anisms are also incorporated to enhance image/text representations. Image representations can be
refined by position-focused attention module (Wang et al., 2019d) or structured reasoning over ob-
ject relationships with graph neural networks (Li et al., 2019c). Extending to text representations,
Chen and Luo (2020) design a word attention module and an object attention module to compute
the self-attention weights of words and objects. Liu et al. (2020a) and Diao et al. (2021) apply
graph neural network to both image and text inputs.
In this section, we review additional research topics for the development of early VL models, in-
cluding bilinear pooling, compositional visual reasoning, and visual grounding.
Advanced attention design is a main theme for early VL research. Besides this, instead of simple
concatenation and element-wise product for fusion, another line of work (Fukui et al., 2016; Kim
et al., 2017; Yu et al., 2017) aims to develop better methods for bilinear pooling, i.e., how to fuse
two vectors into a better representation.
Specifically, Fukui et al. (2016) proposed Multimodal Compact Bilinear (MCB) pooling, which is
also the 2016 VQA challenge winner solution. However, the feature after Fourier transform is very
high dimensional, which also makes MCB computation expensive. Kim et al. (2017) proposed a
simple Hadamard product for low-rank bilinear pooling, and Yu et al. (2017) proposed Multimodal
Factorized Bilinear (MFB) pooling. Other more advanced pooling methods include MUTAN (Ben-
Younes et al., 2017) and BLOCK (Ben-Younes et al., 2019), for example. In Perez et al. (2018),
the authors developed FiLM, a feature-wise linear modulation operator similar to conditional batch
normalization, i.e., a general conditioning layer to inject language information (e.g., a question) into
the image backbone (e.g., a convolutional neural network).
This line of work is orthogonal to attention design, and typically they are used together to enhance
each other. However, in the era of VLP, all these bilinear pooling and attention designs are largely
replaced by, or converged to, the Transformer design.
19
2.3.2 Compositional Visual Reasoning
Besides designing better attention methods to achieve stronger performance on standard VL tasks,
such as VQA and image captioning, there are studies on compositional visual reasoning that re-
quires a model to learn a strong compositional generalization capability, i.e., understanding and
answering compositional questions without seeing similar semantic compositions before. Below,
we briefly review Neural Module Network (NMN) (Andreas et al., 2016a,b) that aims to perform
such complex reasoning tasks. For evaluation, methods are typically tested on a diagnostic visual
reasoning dataset called CLEVR (Johnson et al., 2017a), and a real-world visual reasoning dataset
called GQA (Hudson and Manning, 2019b).
In order to answer a question about an image, NMN uses a set of pre-defined functions and explicitly
encodes each function into a shallow neural network called a module. These modules are composed
dynamically to build an instance-specific network for each input question. By first parsing the
question into a program, and then executing the program via dynamically composing an instance-
specific network, NMN excels in interpretability and compositionality by design, as each module is
designed to accomplish a specific skill, and multiple modules can be combined to perform a new
task during inference.
Since NMN involves two steps, program synthesis and program execution, the original neural
module network (Andreas et al., 2016b) cannot be trained end-to-end. IEP (Hu et al., 2017) and
N2NMN (Johnson et al., 2017b) have successfully made NMN end-to-end trainable via reinforce-
ment learning. Stack-NMN (Hu et al., 2018) makes a soft layout selection so that the whole model
is fully differentiable. Neural-Symbolic VQA (Yi et al., 2018; Mao et al., 2019; Vedantam et al.,
2019) performs symbolic reasoning by encoding images into scene graphs. Chen et al. (2021b)
propose Meta Module Network, where only a general-purpose meta module is used for program
execution recurrently. This meta module is able to take in function recipes and morph them into di-
verse instance modules dynamically. The instance modules are then woven into an execution graph
for complex visual reasoning, inheriting the explainability and compositionality of NMN.
In addition to neural module networks, compositional attention networks (Hudson and Manning,
2018) and MuRel (Cadene et al., 2019a) have been proposed to realize multi-hop reasoning on com-
plex questions. However, due to the pure attention design, these models are less interpretable. Also,
Neural State Machine (NSM) (Hudson and Manning, 2019a) is proposed. It first predicts a proba-
bilistic scene graph, and then performs multi-hop reasoning over the graph for answer prediction,
where the scene graph serves as a strong prior to the model.
In the recent VLP literature (Tan and Bansal, 2019; Chen et al., 2020d; Li et al., 2021a; Dou et al.,
2022b), most methods use large-scale, Transformer-based monolithic networks. The research on
compositional visual reasoning and neural module networks becomes less popular. But we believe
that compositional generalization is an important topic worthy of further investigation even in the
new era of large-scale pre-training.
Now, we briefly discuss the visual grounding (VG) task. Different from the VL tasks introduced
in Section 2.1, VG requires a model to ground a text query in the relevant object in the image,
and predict bounding box coordinates. Likewise, we briefly review the popular benchmarks and
representative task-specific models for VG.
Task and Benchmark. Two types of VG tasks are proposed in literature, phrase grounding and
referring expression comprehension.
• Phrase grounding is introduced with the Flicker30K Entities dataset (Plummer et al., 2015), in
which multiple entities (phrases) in a sentence for an image are mapped to the boxes on the image
to indicate the correspondences between them (Figure 2.7a). The task is to predict a bounding
box for each entity. Recall@K is used to evaluate model performance and a predicted box for
a given entity is considered correct if the intersection over union (IoU) between predicted and
ground-truth bounding box is greater than or equal to 0.5.
• Referring expression comprehension is to localize the object in the input image that is re-
ferred to by an expression in text and return a bounding box around the object (Figure 2.7b).
20
(a) Phrase grounding. (b) Referring expression comprehension.
Figure 2.7: Visualization of two visual grounding tasks.
Three well-established datasets for this task are RefCOCO, RefCOCO+ (Yu et al., 2016) and Re-
fCOCOg (Mao et al., 2016). Similarly, a prediction is counted as a true positive, if the IOU is
larger than or equal to 0.5. Accuracy is used as evaluation metric.
Task-specific VG Models. Early VL models for VG task can be generally grouped into two cat-
egories. One is two-stage methods (Nagaraja et al., 2016; Kim et al., 2018), which require to first
generate object regions and then perform region-text matching via multimodal fusion to ground the
query/referring expression. The region proposals are generated using either unsupervised meth-
ods (Plummer et al., 2018; Wang et al., 2018) or a pre-trained object detector (Yu et al., 2018a;
Zhang et al., 2018b). The other is one-stage models with end-to-end training (Chen et al., 2018;
Liao et al., 2020), where the bounding box proposal generation is guided by the text/phrase query.
For example, Yang et al. (2019b) fuse a text query’s embedding (a single vector representation) into
the YOLOv3 object detector (Redmon and Farhadi, 2018). The method is later improved by using a
recursive sub-query construction framework to reason between image and query for multiple rounds
and reduces the referring ambiguity step by step (Yang et al., 2020). Lately, Deng et al. (2021) em-
pirically show that complex fusion modules can be replaced by simple stack of Transformer encoder
layers to achieve higher performance.
21
Chapter 3
Visual question answering (VQA) (Antol et al., 2015), image captioning (Vinyals et al., 2015) and
image-text retrieval (Lin et al., 2014; Plummer et al., 2015) are arguably the three most widely stud-
ied image-text tasks in the literature. They require an AI system to comprehend both the input image
and text contents. Inspired by the great success of language model pre-training (Devlin et al., 2019;
Liu et al., 2019d; Raffel et al., 2020; Brown et al., 2020; He et al., 2021), coupled with the unification
of architectures used in the NLP and computer vision communities (Dosovitskiy et al., 2021; Carion
et al., 2020), there has been a surging research interest in developing VLP methods for image-text
tasks (Tan and Bansal, 2019; Chen et al., 2020d; Li et al., 2020e; Zhang et al., 2021b; Kim et al.,
2021). Specifically, large amounts of image-caption pairs are fed into a model that consumes both
images and text to pre-train representations that encode rich multimodal knowledge and is helpful
for downstream tasks. In this chapter, we present a systematic review of this new emerging train-
ing paradigm. Specifically, in Section 3.1, we provide an overview of representative VLP models,
and divide them into several categories. In Section 3.2, we describe the Transformer-based model
architectures for VLP, and dissect the model designs along multiple dimensions including image
encoder, text encoder, multimodal fusion, etc.. In Section 3.3 and 3.4, we introduce the commonly
used pre-training objectives and pre-training datasets, respectively. In Section 3.5, we present a list
of advanced research topics, including foundation models, multimodal few-shot learning, unified
VL modeling, knowledge for VLP, robustness evaluation, model compression and so on. Lastly, in
Section 3.6, we provide a brief discussion on text-to-image generation, another important image-text
task that has received rapidly growing attention in the community.
• For dual encoder, images and text are encoded separately, and modality interaction is only han-
dled by a simple cosine similarity of the image and text feature vectors. This model architecture
is effective for image retrieval tasks, and when scaled up, can be used to learn a strong image
encoder from scratch via large-scale contrastive pre-training, as demonstrated by CLIP (Radford
et al., 2021) and ALIGN (Jia et al., 2021). However, due to the lack of deep multimodal fusion,
CLIP performs poorly on VQA and visual reasoning tasks.
• For fusion encoder, besides the use of an image encoder and a text encoder, additional Trans-
former layers (Vaswani et al., 2017) are typically employed to model the deep interaction be-
tween image and text representations. Prominent examples include UNITER (Chen et al., 2020d),
VinVL (Zhang et al., 2021b), SimVLM (Wang et al., 2022k), and METER (Dou et al., 2022b).
This fusion-encoder architecture achieves superior performance on the VQA and image caption-
ing tasks, but can be very ineffective when applied to image retrieval, as it requires to encode all
the possible image-text pairs (matched or not) to compute similarity scores for ranking. Recent
work, such as ALBEF (Li et al., 2021a), UFO (Wang et al., 2021a), and VLMo (Wang et al.,
2021c), has also shown that it is possible to encapsulate both the dual encoder and fusion encoder
22
Vision Text Multimodal Pre-training
Model Decoder
Encoder Encoder Fusion Objectives
ViLBERT (Lu et al., 2019) MLM+ITM+MIM
OD+Xformer Xformer Co-attn.
LXMERT (Tan and Bansal, 2019) MLM+ITM+MIM+VQA
VisualBERT (Li et al., 2019e) MLM+ITM
VL-BERT (Su et al., 2019) MLM+MIM
UNITER (Chen et al., 2020d) MLM+ITM+MIM+WRA
7†
OSCAR (Li et al., 2020e) MLM+ITM
OD Emb. Merged-attn.
VILLA (Gan et al., 2020) MLM+ITM+MIM+WRA
VinVL (Zhang et al., 2021b) MLM+ITM
UNIMO (Li et al., 2021e) MLM+ITM+MIM+ITC
VL-T5 (Cho et al., 2021) 3 MLM+ITM+VQA+GC
PixelBERT (Huang et al., 2020) MLM+ITM
SOHO (Huang et al., 2021b) 7† MLM+ITM+MIM
Emb.
CLIP-ViL (Shen et al., 2022b) MLM+ITM+VQA
SimVLM (Wang et al., 2022k) Merged-attn. PrefixLM
CNN
MDETR (Kamath et al., 2021) OD+TP+CA
Xformer 3
UniTAB (Yang et al., 2021c) Seq2Seq
OFA (Wang et al., 2022f) Seq2Seq
Emb.
Flamingo (Alayrac et al., 2022) Cross-attn. 7† LM
ViLT (Kim et al., 2021) Patch Emb. MLM+ITM
Visual Parsing (Xue et al., 2021) Emb. MLM+ITM+MIM
GIT (Wang et al., 2022d) Merged-attn. LM
VLMo (Wang et al., 2021c) MLM+ITM+ITC
BEiT-3 (Wang et al., 2022g) MLM+MIM+MVLM
7†
ALBEF (Li et al., 2021a) Xformer MLM+ITM+ITC
BLIP (Li et al., 2022f) Xformer Cross-attn. LM+ITM+ITC
CoCa (Yu et al., 2022a) LM+ITC
METER (Dou et al., 2022b) MLM+ITM
Co-attn.
FIBER (Dou et al., 2022a) LM+ITM+ITC
CLIP (Radford et al., 2021) CNN/Xformer
Xformer None 7 ITC
ALIGN (Jia et al., 2021) CNN
Table 3.1: Glossary of representative VLP models. OD: object detector. Xformer: transformer.
Emb.: embedding. MLM/MIM: masked language/image modeling. ITM: image-text matching.
ITC: image-text contrastive learning. WRA: word-region alginment. TP: token prediction. CA:
contrastive alignment. GC: grounding+captioning. (†) In many cases (e.g., Flamingo (Alayrac et al.,
2022), CoCa (Dou et al., 2022b), and GIT (Dou et al., 2022b)), the multimodal fusion module itself
is also directly called (or serves as) the text decoder.
design into one framework, so that the model is suitable for fast image retrieval, but at the same
time can also be used for the VQA and image captioning tasks.
In this chapter, we mainly focus on the review of VLP methods based on the fusion-encoder ar-
chitecture, while postponing the detailed discussion of dual-encoder models to Chapter 4. Among
fusion-encoder methods, we further divide them into two categories based on whether the model can
be pre-trained end-to-end. This categorization also roughly reflects how the VLP methods evolve
along time. Specifically, most early VLP methods (Tan and Bansal, 2019; Su et al., 2019; Chen
et al., 2020d; Li et al., 2020e; Zhang et al., 2021b) adopt a two-stage pre-training pipeline, where
image region features are first extracted from a pre-trained object detector. More recently, end-to-
end pre-training methods (Huang et al., 2020; Kim et al., 2021; Li et al., 2021a) become popular,
where image features are extracted from either convolutional neural networks (CNNs) (He et al.,
2016), vision Transformers (ViTs) (Dosovitskiy et al., 2021), or only using image patch embed-
dings, and the model gradients can be back-propagated into the vision backbone for end-to-end
training. End-to-end VLP methods have achieved new state of the art on all the major VL tasks.
• OD-based VLP Models. Early methods use pre-trained object detectors (ODs) to extract visual
features. Among them, ViLBERT (Lu et al., 2019) and LXMERT (Tan and Bansal, 2019) use co-
attention for multimodal fusion, where two Transformers are applied respectively to region and
text features, and another Transformer fuses the representations of the two modalities in a later
stage. On the other hand, VisualBERT (Li et al., 2019e), Unicoder-VL (Li et al., 2020a), VL-
BERT (Su et al., 2019), and UNITER (Chen et al., 2020d) use a merged attention fusion module
that feeds both region and text features into a single Transformer. The comparison between merged
attention and co-attention is detailed in Section 3.2. OSCAR (Li et al., 2020e) feeds additional
image tags into the Transformer model, while VinVL (Zhang et al., 2021b) uses a stronger pre-
trained OD for feature extraction, and demonstrates state-of-the-art performance across VL tasks.
23
ViLBERT UNITER VinVL ViLT CLIP ALBEF METER OFA CoCa BEiT-3
Aug. 6th, 2019 Sep. 25th, 2019 Jan. 2nd, 2021 Feb. 5th, 2021 Feb. 26th, 2021 Jul. 16th, 2021 Nov. 3rd, 2021 Feb. 7th, 2022 May. 4th, 2022 Aug. 22th, 2022
Aug. 20th, 2019 Apr. 2nd, 2020 Feb. 4th, 2021 Feb. 11st, 2021 Apr. 26th, 2021 Aug. 24th, 2021 Nov. 3rd, 2021 Apr. 29th, 2022 May. 27th, 2022 Sep. 14th, 2022
LXMERT PixelBERT VL-T5 ALIGN MDETR SimVLM VLMo Flamingo GIT PaLI
Figure 3.1: VLP models developed for image-text tasks along time. Due to space constraint, only
some representative works are shown.
On the one hand, region features are object-level and semantic-rich; on the other hand, extracting
region features can be time-consuming, and the pre-trained object detectors are usually frozen
during pre-training, which may limit the capacity of VLP models.
• End-to-End VLP Models. Researchers have tried different ways to pre-train VL models in an
end-to-end fashion. Specifically, we further divide them into two subcategories, based on how
they encode images.
– CNN-based Grid Features. PixelBERT (Huang et al., 2020) and CLIP-ViL (Shen et al.,
2022b) feed grid features from CNNs and text directly into a Transformer. SOHO (Huang
et al., 2021b) first discretizes grid features using a learned vision dictionary, and then feeds the
discretized features into their cross-modal module. While using grid features directly can be
efficient, inconsistent optimizers are typically used for CNN and Transformer. For example,
PixelBERT (Huang et al., 2020) and CLIP-ViL (Shen et al., 2022b) use AdamW (Loshchilov
and Hutter, 2018) for Transformer and SGD for CNN.
– ViT-based Patch Features. Vision Transformers (ViTs) have been an increasingly active re-
search topic in computer vision, motivating researchers to develop ViT-based VLP models.
Among them, ViLT (Kim et al., 2021) directly feeds image patch features and text token em-
beddings into a pre-trained ViT model, and then pre-train the model on image-text datasets.
ViTCAP (Fang et al., 2022b) further extends ViLT for image captioning tasks. This has also
led to follow-up works such as UFO (Wang et al., 2021a) and VLMo (Wang et al., 2021c),
where UFO (Wang et al., 2021a) uses the same Transformer to perform image/text encoding
and multimodal fusion all together, while in VLMo (Wang et al., 2021c), additional mixture-of-
modality-experts layers are included. Besides this, visual parsing (Xue et al., 2021), ALBEF (Li
et al., 2021a), METER (Dou et al., 2022b), BLIP (Li et al., 2022f), X-VLM (Zeng et al., 2022b)
and FIBER (Dou et al., 2022a) all use ViT as their image encoder (e.g., plain ViT and Swin
Transformer (Liu et al., 2021c)), and design different objectives for model pre-training.
We present a glossary of representative VLP models in Table 3.1, where models are dissected along
multiple dimensions. In Figure 3.1, we show how these VLP models evolve along time.
Research Progress Driven by VLP. Now, we use the VQA task as a case study to illustrate the
research progress driven by large-scale VLP (see Figure 3.2).
• From August 2017 to August 2019, many task-specific methods have been developed, ranging
from the use of object-centric visual features, advanced attention mechanism design, object rela-
tional modeling, to the use of Transformer. The corresponding VQA accuracy has been boosted
from ≈66% to ≈71%.
• From August 2019 to August 2021, vision-language pre-training (VLP) has become the main-
stream. It first started from OD-based VLP models, boosting the VQA accuracy from ≈71% to
≈78%; then end-to-end VLP methods based on convolutional networks and vision Transformer
dominate the field.
• From August 2021 to August 2022, we have witnessed a boom of big multimodal foundation
models, e.g., SimVLM (Wang et al., 2022k), Florence (Yuan et al., 2021), Flamingo (Alayrac
et al., 2022), CoCa (Yu et al., 2022a), GIT (Wang et al., 2022d), and BEiT-3 (Wang et al., 2022g).
When these models are scaled up in terms of both model size and pre-training dataset size, the
VQA performance is further boosted from ≈80% to ≈84%.
24
BEiT-3
PaLI
VLMo
AliceMind OFA
Florence Flamingo, CoCa
mPLUG, GIT2
80 SimVLM
UNIMO(Ens.)
BLIP
VQAv2 test-std accuracy
VinVL ALBEF
PixelBERT CLIP-ViL
75 ERNIE-ViL
UNITER Visual Parsing
VILLA
VL-BERT&LXMERT OSCAR SOHO
MCAN
BAN ViLBERT
70 ViLT
Pythia ReGAT VisualBERT
Counter
BUTD
65
2017/8 2019/8 2021/8 2022/8
Figure 3.2: Research progress driven by large-scale VLP, using the VQA task as a case study. From
August 2017 to August 2019, many task-specific methods have been developed. Since August
2019, OD-based VLP models have become popular. Later on, due to the emergence of vision Trans-
former (Dosovitskiy et al., 2021), end-to-end VLP models have become the mainstream. During the
last one year, we have witnessed a boom of big multimodal foundation models, e.g., SimVLM (Wang
et al., 2022k), Florence (Yuan et al., 2021), Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a),
GIT (Wang et al., 2022d), and BEiT-3 (Wang et al., 2022g).
Vision Encoder. As discussed in Section 3.1, there are three types of vision encoders: (i) an object
detector (OD), (ii) a plain CNN, and (iii) a vision Transformer.
• OD. The most widely used object detector for VL research is the Faster R-CNN (Ren et al., 2015b)
pre-trained on the Visual Genome (VG) dataset (Krishna et al., 2017c) as in BUTD (Anderson
et al., 2018a). In VinVL (Zhang et al., 2021b), a stronger OD model based on the ResNeXt-152 C4
architecture is pre-trained on multiple public OD datasets (including COCO (Chen et al., 2015),
OpenImages (Kuznetsova et al., 2020), Objects365 (Shao et al., 2019) and VG), and significant
performance boost is observed across a wide range of VL tasks by using this stronger OD model.
Additional care is taken to encode the location information of image regions, which is typically
represented as a 7-dimensional vector.1 Both visual and location features are then fed through
a fully-connected layer, to be projected into the same embedding space. The final embedding
for each region is obtained by summing up the two FC outputs and then passing through a layer
normalization layer.
1
[x1 , y1 , x2 , y2 , w, h, w ∗ h] (normalized top/left/bottom/right coordinates, width, height, and area)
25
OD, CNN, ViT, Masked Language Modeling
Pre-training
Patch Embedding Masked Image Modeling
Objectives
Image-Text Matching
Image encoder
Merged Attention/ Image-Text Contrastive Learning
Decoder
Co-attention/
(optional)
Dot-product Visual Question Answering
Image Captioning
A dog lying on Multimodal Fusion Downstream Image-Text Retrieval
BERT, RoBERTa, Tasks
the grass next Phrase Grounding
Word Embedding
to a frisbee
Text encoder
• CNN. In PixelBERT (Huang et al., 2020) and SOHO (Huang et al., 2021b), ResNet-50, ResNet-
101 and ResNeXt-152 pre-trained from ImageNet classification are adopted. In CLIP-ViL (Shen
et al., 2022b), ResNet-50, ResNet-101, and ResNet-50x4 pre-trained from CLIP (Radford et al.,
2021) are used. In SimVLM (Wang et al., 2022k), they use the first three blocks (excluding the
Conv stem) of ResNet-101 and ResNet-152 for their base and large models, respectively, and a
larger variant of ResNet-152 with more channels for the huge model. Typically, it is observed that
a stronger CNN backbone results in stronger downstream performance.
• ViT. Following Dosovitskiy et al. (2021), an image is first split into image patches, which are then
flattened into vectors and linearly projected to obtain patch embeddings. A learnable special token
[CLS] embedding is also prepended to the sequence. These patch embeddings, when summed up
together with learnable 1D position embeddings and a potential image-type embedding, are sent
into a multi-layer Transformer block to obtain the final output image features. Different ViT vari-
ants have been studied for VLP, such as plain ViT (Dosovitskiy et al., 2021), DeiT (Touvron et al.,
2021), BEiT (Bao et al., 2022a), Swin Transformer (Liu et al., 2021c), and CLIP-ViT (Radford
et al., 2021), to name a few.
In a nutshell, no matter what vision encoder is used, the input image is represented as a set of feature
vectors v = {v 1 , · · · , v M }.
Text Encoder. Following BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019d), VLP
models (Tan and Bansal, 2019; Li et al., 2019e; Lu et al., 2019; Su et al., 2019; Chen et al., 2020d;
Li et al., 2020e) first segment the input sentence into a sequence of subwords (Sennrich et al., 2016),
and then insert two special tokens at the beginning and the end of the sentence to generate the
input text sequence. After we obtain the text embeddings, existing works either feed them directly
to the multimodal fusion module (Li et al., 2019e; Chen et al., 2020d), or to several text-specific
layers (Tan and Bansal, 2019; Lu et al., 2019) before the fusion. For the former, the fusion module
is typically initialized with BERT, and the role of text encoding and multimodal fusion is therefore
entangled and absorbed in a single BERT model, and in this case, we consider text encoder as the
word embedding layer.
Language model (LM) pre-training has demonstrated impressive performance across tasks and dif-
ferent pre-trained LMs have been proposed. In METER (Dou et al., 2022b), the authors have studied
the use of BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019d), ELECTRA (Clark et al., 2020),
ALBERT (Lan et al., 2020), and DeBERTa (He et al., 2021) for text encoding. In Flamingo (Alayrac
et al., 2022), a huge pre-trained LM with 70B parameters (Hoffmann et al., 2022) is used as the
text encoder, and kept frozen during the VLP process for multimodal few-shot learning. In a nut-
shell, no matter what text encoder is used, the input text is represented as a set of feature vectors
w = {w1 , · · · , wN }.
Multimodal Fusion. For dual encoders like CLIP (Radford et al., 2021) and ALIGN (Jia et al.,
2021), fusion is performed via a dot-product between two global image and text feature vectors. For
fusion encoder, it takes both v = {v 1 , · · · , v M } and w = {w1 , · · · , wN } as input, and learns con-
textualized multimodal representations denoted as ṽ = {ṽ 1 , · · · , ṽ M } and w̃ = {w̃1 , · · · , w̃N }.
There are mainly two types of fusion modules, namely, merged attention and co-attention (Hen-
dricks et al., 2021), shown in Figure 3.4. Specifically,
26
Feedforward Feedforward
Feedforward
Mx Mx Mx
Cross-Attn Cross-Attn
Self-Attn
Self-Attn Self-Attn
Figure 3.4: Co-attention and merged attention design for multimodal fusion.
• In a merged attention module, the text and visual features are simply concatenated together, and
then fed into a single Transformer block. This design has been used in many previous works, such
as VisualBERT (Li et al., 2019e), Unicoder-VL (Li et al., 2020a), VLP (Zhou et al., 2020b), VL-
BERT (Su et al., 2019), UNITER (Chen et al., 2020d), OSCAR (Li et al., 2020e), VinVL (Zhang
et al., 2021b), ViLT (Kim et al., 2021), GIT (Wang et al., 2022d), etc.
• In a co-attention module, on the other hand, the text and visual features are fed into different
Transformer blocks independently, and techniques such as cross-attention are used to enable cross-
modal interaction. This design has been used in LXMERT (Tan and Bansal, 2019), ViLBERT (Lu
et al., 2019), ERNIE-ViL (Yu et al., 2021), METER (Dou et al., 2022b), etc. Also, in many
models, only image-to-text cross-attention modules are used, such as ALBEF (Li et al., 2021a),
BLIP (Li et al., 2022f), CoCa Yu et al. (2022a), and Flamingo (Alayrac et al., 2022).
For region-based VLP models, as shown in Bugliarello et al. (2021), the merged attention and co-
attention models can achieve comparable performance. Yet, the merged attention module is more
parameter-efficient, as the same set of parameters are used for both modalities. For end-to-end VLP
models, as shown in METER (Dou et al., 2022b), co-attention performs better. However, there are
no conclusive decision on which one is better, and it is largely an empirical choice for model design.
In mPLUG (Li et al., 2022c), a combination of merged attention and co-attention is used for multi-
modal fusion; while in BLIP (Li et al., 2022f) and FIBER (Dou et al., 2022a), fusion is performed
via simply inserting cross-attention modules inside the image and text backbones, which can be
more lightweight and efficient. In MLP-ViL (Nie et al., 2021), the authors study the use of MLP
architectures for multimodal fusion.
Discussion: unified modeling with a shared backbone. Transformer has now become a universal
computation engine (Lu et al., 2022b). In UFO (Wang et al., 2021a), the authors have tried to
use the same shared Transformer backbone for image/text encoding and multimodal fusion. In
MS-CLIP (You et al., 2022) and VATT (Akbari et al., 2021), the same shared backbone is used
for contrastive pre-training across multiple modalities. In VLMo (Wang et al., 2021c), additional
mixture-of-modality-experts layers are further added, while the same self-attention layers are shared
for image/text encoding and multimodal fusion. This mixture-of-expert design has achieved strong
performance across multiple VL tasks.
Encoder-Only vs. Encoder-Decoder. Most VLP models adopt the encoder-only architecture,
where the cross-modal representations are directly fed into a MLP-based output layer to generate
the final outputs. This encoder-only design naturally fits VL understanding tasks such as VQA and
visual reasoning. When used for image captioning, the same encoder acts as a decoder to generate
the output captions token by token by using a causal mask.
Recently, inspired from T5 (Raffel et al., 2020) and BART (Lewis et al., 2020a) in the NLP lit-
erature, VL-T5 (Cho et al., 2021), SimVLM (Wang et al., 2022k), UniTAB (Yang et al., 2021c),
OFA (Wang et al., 2022f) and DaVinci (Diao et al., 2022), on the other hand, advocate the use of a
Transformer-based encoder-decoder architecture, where the cross-modal representations are first fed
into a decoder and then to an output layer. In these models, the decoder attends to both the encoder
representations and the previously generated tokens, producing the outputs autoregressively. The
use of an encoder-decoder architecture can enable the unification of various image-text tasks and
zero-shot/few-shot learning of VLP models (see Section 3.5.3 for more detailed discussions), and
27
glass frisbee glass frisbee glass frisbee </s>
is also a natural fit for generation tasks. In MDETR (Kamath et al., 2021), the authors also adopt
an encoder-decoder architecture, but the decoder is designed to generate bounding boxes in paral-
lel, following the seminal work of DETR (Carion et al., 2020). An illustrative comparison between
encoder-only and encoder-decoder architectures is provided in Figure 3.5.
Now, we introduce how to design pre-training tasks. We will first review masked language modeling
and image-text matching, which have been used extensively in almost every VLP model. Then, we
will shift our focus to image-text contrastive learning and various masked image modeling tasks.
Masked Language Modeling (MLM). The MLM objective is first introduced in language model
pre-training (e.g., Devlin et al., 2019; Liu et al., 2019d). In VLP, MLM with image-text pairs has
also proven to be useful. Denote the mask indices as m ∈ Nm .2 In MLM, given an image-text pair,
we randomly mask out the input words with probability of 15%, and replace the masked ones w̃m
with special token [MASK].3 The goal is to predict these masked tokens based on their surrounding
words w̃\m and the paired image ṽ, by minimizing the negative log-likelihood:
where θ denotes the trainable parameters. Each pair (w̃, ṽ) is sampled from the whole training set
D. There are several MLM variants used in VLP. Specifically,
• Seq-MLM: In order to adapt the pre-trained model for image captioning, it is observed (Zhou
et al., 2020b; Wang et al., 2021a) that adding a seq2seq causal mask during pre-training is bene-
ficial. That is, in Seq-MLM, the model can only use its preceding context to predict the masked
token, which is consistent to the way the model performs image captioning during inference.
• LM: Direct language modeling is used in BLIP (Li et al., 2022f) and CoCa (Yu et al., 2022a) for
VLP. The model predicts the caption given an image token-by-token autoregressively.
• Prefix-LM: Using the encoder-decoder framework as in SimVLM (Wang et al., 2022k), a Pre-
fixLM pre-training objective is proposed, where a sentence is first split into two parts, and the
bi-directional attention is enabled on the prefix sequence and the input image, while a causal
attention mask is adopted on the remaining tokens.
28
indicating whether the sampled image-caption pair is matched. Specifically, denote the output score
as sθ (w̃, ṽ), We apply the binary cross-entropy loss for optimization:
LITM (θ) = −E(w̃,ṽ)∼D [y log sθ (w̃, ṽ) + (1 − y) log(1 − sθ (w̃, ṽ))]) . (3.2)
Besides randomly sampling a negative image-text pair, harder negative pairs can also be mined from
an image-text contrastive loss introduced below, which has been shown to be effective in improving
the downstream performance, as reported in ALBEF (Li et al., 2021a), VLMo (Wang et al., 2021c),
and FIBER (Dou et al., 2022a).
Image-Text Contrastive Learning (ITC). Early VLP models, such as UNITER (Chen et al.,
2020d) and VinVL (Zhang et al., 2021b), do not use ITC for pre-training (one exception is Light-
ningDOT (Sun et al., 2021)). Though the ITC loss is widely studied before VLP (Frome et al.,
2013), in the context of end-to-end VLP, it is mostly popularized by CLIP (Radford et al., 2021) and
ALIGN (Jia et al., 2021) to pre-train a dual encoder. Later on, it is also used to pre-train a fusion
encoder as in ALBEF (Li et al., 2021a). Note that this ITC loss is used on top of the outputs of
image and text encoders, before multimodal fusion (i.e., the use of w and v, instead of w̃ and ṽ).
Specifically, given a batch of N image-text pairs, ITC aims to predict the N matched pairs from
all the N 2 possible image-text pairs. With a little bit abuse of notation, let {v i }N N
i=1 and {w i }i=1
denote respectively the normalized image vectors and text vectors in a training batch. To compute
image-to-text and text-to-image similarities, we have:
> >
si2t t2i
i,j = v i w j , si,j = w i v j , (3.3)
N N
1 X exp(si2t
i,i /σ) 1 X exp(st2i
i,i /σ)
Li2t
ITC (θ) = − log PN , Lt2i
ITC (θ) = − log PN , (3.4)
N i=1 i2t
j=1 exp(si,j /σ)
N i=1
t2i
j=1 exp(si,j /σ)
Masked Image Modeling (MIM). Similar to the MLM objective, researchers have studied var-
ious masked image modeling (MIM) tasks for pre-training. Specifically, the model is trained to
reconstruct the masked patches or regions ṽm given the remaining visible patches or regions ṽ\m
and all the words w̃ as
LMIM (θ) = E(w̃,ṽ)∼D Pθ (ṽm |ṽ\m , w̃) . (3.5)
The designs of MIM can be divided into two categories.
• For OD-based VLP models, e.g., LXMERT (Tan and Bansal, 2019) and UNITER (Chen et al.,
2020d), some of the input regions are randomly masked (i.e., the visual features of the masked
regions are replaced by zeros), and the model is trained to regress the original region features via
minimizing the mean squared error loss. Researchers (Tan and Bansal, 2019; Lu et al., 2019; Chen
et al., 2020d) have also tried to first generate object labels for each region using a pre-trained object
detector, which can contain high-level semantic information, and the model is trained to predict
the object labels for the masked regions instead of the original region features.
• For end-to-end VLP models, e.g., ViLT (Kim et al., 2021) and METER (Dou et al., 2022b),
researchers have investigated the use of masked patch regression/classification for masked image
modeling. Specifically,
– For MIM with discrete VQ tokens, inspired by BEiT (Bao et al., 2022a), discrete VQ tokens
are first extracted for the input patches, and the model is then trained to reconstruct the discrete
tokens. Specifically, the VQ-VAE (van den Oord et al., 2017) model in DALL-E (Ramesh
et al., 2021) is first used to tokenize each image into a sequence of discrete tokens. Each
image is resized so that the number of patches is equal to the number of tokens, and thus each
patch corresponds to a discrete token. Then, we randomly mask 15% of the patches and feed
the masked image patches to the model as before, but now the model is trained to predict the
discrete tokens instead of the masked patches.
– For MIM with in-batch negatives, by imitating MLM which uses a text vocabulary, the model
is trained to reconstruct input patches by using a dynamical vocabulary constructed with in-
batch negatives. Concretely, at each training step, we sample a batch of image-caption pairs
29
(a) UNITER (b) ViLT
{hv k , wk i}B k B
k=1 , where B is the batch size. We treat all the patches in {v }k=1 as candidate
patches. For each masked patch, we mask 15% of the input patches. The model needs to select
the original patch within this candidate set. The model is trained to maximize its probability
similar to noise contrastive estimation (Gutmann and Hyvärinen, 2010).
Notably, recent state-of-the-art VLP models (e.g., VinVL (Zhang et al., 2021b), ALBEF (Li et al.,
2021a), VLMo (Wang et al., 2021c)) do not apply MIM during pre-training, and in ViLT (Kim et al.,
2021) and METER (Dou et al., 2022b), the authors also demonstrate that MIM is not helpful for
downstream performance. However, there are also recent works that adopt masked vision-language
modeling (as in MaskVLM (Kwon et al., 2022) and VL-BEiT (Bao et al., 2022b)), which try to
randomly mask patches/tokens while keeping the other modality intact.
Other Pre-training Tasks. Besides these typical pre-training tasks introduced above, researchers
have also investigated other possibilities. For example,
• UNITER (Chen et al., 2020d) proposes a word-region alignment objective that tries to align the
image and text features using optimal transport (Xie et al., 2020; Chen et al., 2019, 2020a).
• In E2E-VLP (Xu et al., 2021c), MDETR (Kamath et al., 2021), GLIP (Li et al., 2022h), and X-
VLM (Zeng et al., 2022b), bounding box prediction from object detection and phrase grounding
is directly used as a fine-grained pre-training task.
Case Study. Until now, we have introduced the general model architecture and popular pre-
training tasks in the image-text literature. To provide the readers with more concrete examples,
we select four representative models as case studies, including (i) UNITER (Chen et al., 2020d), an
OD-based image-text model; (ii) ViLT (Kim et al., 2021), a minimal end-to-end image-text model
that builds upon vision Transformer; (iii) ALBEF (Li et al., 2021a), an end-to-end image-text model
that uses both contrastive and generative objectives for pre-training, and (iv) SimVLM (Wang et al.,
2022k), the first large-scale pre-trained encoder-decoder image-text model with simple PrefixLM as
the pre-training objective. Below, we briefly review their architectures and pre-training tasks.
• UNITER. The architecture of UNITER is shown in Figure 3.6a. The image is encoded by an
offline pre-trained OD model to extract regional features. Together with positional embeddings,
these image features are then concatenated with word embeddings from the input text, followed by
several Transformer layers for multimodal fusion. The model is pre-trained via commonly used
tasks including masked language modeling, image-text matching, and masked region modeling.
30
COCO VG CC3M SBU Total
#Images 113K 108K 3.1M 875K 4.2M
#Captions 567K 5.4M 3.1M 875K 10M
Table 3.2: Statistics of the pre-training datasets used in a typical academic setting.
The authors also provide a word-region alignment loss via the use of optimal transport. The
multimodal Transformer is initialized via the pre-trained BERT-base or BERT-large model.
• ViLT. Figure 3.6b illustrates the model architecture of ViLT, which is the simplest image-text
model one can imagine. The image is divided into patches, and encoded via patch embeddings,
and the text is encoded via word token embeddings. These features are concatenated and sent
to a Transformer, which is initialized via supervised pre-trained plain vision Transformer on Im-
ageNet22k. Pre-training was performed via masked language modeling, image-text matching,
matched patch modeling, and word-patch alignment.
• ALBEF. As shown in Figure 3.6c, ALBEF adopts a general VLP architecture which has also
been extensively studied in METER (Dou et al., 2022b). Specifically, a vision Transformer is
used to encode the image, the first 6 layers of a BERT model is used to encode the text, followed
by multimodal fusion via the last 6 layers of the BERT model. The key innovation lies in the
use of contrastive objectives during pre-training, which is introduced in CLIP, but has not been
used for fusion-encoder-based image-text models. By incorporating the contrastive loss into pre-
training, fast image-text retrieval via simple dot-product of two feature vectors can be achieved,
while VQA and image captioning tasks that require deep multimodal fusion can also be tackled
via the top fusion layers.
• SimVLM. Lastly, we briefly discuss SimVLM, as shown in Figure 3.6d. CLIP and ALIGN are
the first two large-scale pre-trained dual encoders, which can be only applied to (zero-shot) image
classification and retrieval, while SimVLM is the first large-scale pre-trained encoder-decoder
model that can be used for tasks that require deep multimodal fusion. Furthermore, the pre-
training objective has been simplified as a single PrefixLM loss. The model shows great promise
for training big image-text models, and a further detailed discussion is delayed to Section 3.5.1.
Industrial Setting. In what follows, we brief some of the web-crawled image-text datasets used
in the industrial setting.
• The dataset used in CLIP (Radford et al., 2021) consists of 400 million image-text pairs, which
is built upon a set of 500, 000 queries. The queries include all the words occurring at least 100
31
(1) (2) (3) (4) (5)
Massive pylon. Basalt rock Bay horse Burdock leaf Karate fight
Figure 3.7: Examples of how the web-crawled image-text datasets look like. Figure credit:
LEMON (Hu et al., 2022).
times in the English version of Wikipedia and are augmented with bi-grams. The image-text pairs
are searched such that the text includes one of the queries. The final results are also balanced to
include up to 20, 000 image-text pairs per query.
• The dataset used in ALIGN (Jia et al., 2021) has 1.8 billion image-text pairs. Later works such
as SimVLM (Wang et al., 2022k) and CoCa (Yu et al., 2022a) also uses this dataset. The data
collection pipeline is similar to that used in Conceptual Captions (Sharma et al., 2018; Changpinyo
et al., 2021), but most cleaning steps are relaxed. Only some rule-based filters are applied, such
as image size, alt-text frequencies, and rare words.
• The Wikipedia-based Image-Text Dataset (WIT) (Srinivasan et al., 2021) is composed of 11.5
million unique images and 37.6 million texts. Different from the aforementioned datasets, it
features multilingual texts across 108 languages. The images and texts are collected from the
Wikipedia content pages. It provides texts from multiple sources, such as reference, attribution
and alt-texts, and texts in different languages for the same image.
• WenLan (Huo et al., 2021) consists of 30 million image-text pairs. The web-collected pairs have
gone through an elaborate cleaning process. For each data source, topic models are used to extract
topic words, and the topic distribution is analyzed to select desired contents.
• LAION-400M/5B (Schuhmann et al., 2021) has 400 million or 5 billion image-text pairs, and are
recently released to public. Instead of applying human designed heuristics in data cleaning, this
dataset relies on the CLIP (Radford et al., 2021) model to filter image-text pairs, where the cosine
similarity scores between image and text embeddings are calculated and filtered by threshold 0.3.
32
• RedCaps (Desai et al., 2021) comprises 12 million image-text pairs from 350 subreddits. It
contains everyday things that users like to share on social media, e.g., hobbies and pets. Captions
often contain specific and fine-grained descriptions.
• The dataset used in Florence (Yuan et al., 2021) and GIT (Wang et al., 2022d) contains 800
million image-text pairs, which include ALT200M introduced in LEMON (Hu et al., 2022). This
dataset has been scaled up to include 12 billion web-crawled image-text pairs.
The datasets such as CC3M (Sharma et al., 2018), CC12M (Changpinyo et al., 2021), WIT (Srini-
vasan et al., 2021), RedCaps (Desai et al., 2021) and LAION-400M/5B (Schuhmann et al., 2021)
are released to public with the image URL and associated meta files. Other datasets are proprietary.
Now, we provide some visual examples in Figure 3.7 to show how the web-crawled datasets look
like (examples are from ALT200M (Hu et al., 2022)). While some of the alt attributes are descriptive
sentences that can serve as good training targets, e.g., Figure 3.7 (7), (8), (9), it is noted that some
texts are not semantically well-formed, e.g., Figure 3.7 (10). Some texts are very short phrases
containing only 2 to 4 words, e.g., Figure 3.7 (1) - (6). Some texts do not precisely describe the
image content, but refer to external knowledge or information. For example, Figure 3.7 (12) shows
a woman pointing at the camera, but the text is “I got you”. The text steam in Figure 3.7 (11) is
likely to be extracted from news. The quality of the textual data does present some challenges for the
model to learn from noisy supervision. However, there are indeed a large variety of (fine-grained)
visual objects present in the images and texts, such as burdock leaf, karate, mahjong, and great
blue heron. Compared to human-annotated datasets, these web-collected data provide much richer
training resources, especially for long-tailed concepts.
Data Guidance. Collecting massive image-text pairs at larger-scale has driven the development
of foundation models in VLP (see Section 3.5.1 for discussions about big models). However, these
large, mostly uncurated, web-scraped datasets are usually collected with little oversight. A recent
data audit (Birhane et al., 2021) to large-scale datasets (e.g., LAION-400M) uncovered a wide range
of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes
(e.g., stereotypical representations of people described as lawyers, flight attendants, homemakers,
etc.). One should always keep in mind during model development that training models on this data
risks reflecting or even scaling up the underlying problems. In addition, it is crucial to follow respon-
sible AI practices (Mitchell et al., 2019; Gebru et al., 2021; Pushkarna et al., 2022) to transparently
document and share information about datasets and models.
• Most big VLP models are obtained via either contrastive pre-training or generative pre-training, or
a combination of both. An illustration of how these big models look like is provided in Figure 3.8.
The use of ITC enables fast image-text retrieval and open-set image classification, while the use
of MLM or LM after the fusion module powers multimodal understanding tasks such as image
captioning and VQA.
• The current big VLP models typically contain roughly 1B parameters, pre-trained over roughly
1B-10B image-text pairs.
33
Model Size
Model PT dataset size PT Tasks
Image Enc. Text Enc.† Fusion† Total
CLIP ViT-L/14 (Radford et al., 2021) 302M 123M 0 425M 400M ITC
ALIGN (Jia et al., 2021) 480M 340M 0 820M 1.8B ITC
Florence (Yuan et al., 2021) 637M 256M 0 893M 900M ITC
SimVLM-huge (Wang et al., 2022k) 300M 39M 600M 939M 1.8B PrefixLM
METER-huge (Dou et al., 2022b) 637M 125M 220M 982M 900M+20M1 MLM+ITM
LEMON (Hu et al., 2022) 147M2 39M 636M 822M 200M MLM
Flamingo (Alayrac et al., 2022) 200M 70B 10B 80.2B 2.1B+27M3 LM
GIT (Wang et al., 2022d) 637M 40M 70M 747M 800M LM
GIT2 (Wang et al., 2022d) 4.8B 40M 260M 5.1B 12.9B LM
CoCa (Yu et al., 2022a) 1B 477M 623M 2.1B 1.8B+3B4 ITC+LM
MIM+MLM
BEiT-3 (Wang et al., 2022g) 692M5 692M5 52M5 1.9B 21M+14M6
+MVLM
LM+VQA7
PaLI (Chen et al., 2022e) 3.9B 40M 13B 16.9B 1.6B
+OCR+OD
Table 3.3: A summary of recent big VLP models in terms of model size, pre-training dataset
size, and pre-training tasks. Note, that some of the numbers shown in this table are based on our
best estimate. 1 : 20M image-text pairs are used for VLP, while 900M data is used to pre-train the
Florence image encoder. 2 : This is the model size of an object detector as used in VinVL (Zhang
et al., 2021b). 3 : 2.1B image-text data plus 27M video-text data. 4 : 1.8B image-text data plus 3B
image-tag data before filtering. 5 : shared attention blocks contain another 317M parameters. 6 : 21M
image-text pairs plus 14M images from ImageNet-21K (additional 160GB documents are omitted
here). 7 : A complete set of pre-training tasks for PaLI (Chen et al., 2022e) include LM, PrefixLM,
VQA, VQG, OCR, and OD. † : In our context, a module that takes both image and text features as
input is considered as the fusion module, and a module that only takes text as input is considered
as text encoder. Sometimes, the fusion module is called a text decoder in the literature, such as in
SimVLM (Wang et al., 2022k), Flamingo (Alayrac et al., 2022), and GIT (Wang et al., 2022d). ITC:
image-text contrastive loss. ITM: image-text matching. MLM/LM: (masked) language modeling.
MIM: masked image modeling. MVLM: masked vision-language modeling.
• Flamingo (Alayrac et al., 2022) adopts a large frozen language model (70B in size) to keep
the in-context few-shot learning capability inherited from the pre-trained language model, while
GIT (Wang et al., 2022d) adopts a large contrastively pre-trained image encoder instead, with a
relatively small text decoder.
• Both Flamingo (Alayrac et al., 2022) and GIT (Wang et al., 2022d) first pre-train an image encoder
via contrastive learning, and then perform generative pre-training. However, the image encoder
and text decoder are both kept frozen in Flamingo (Alayrac et al., 2022); while the text decoder is
randomly initialized in GIT (Wang et al., 2022d), and the image encoder is not kept frozen during
the generative pre-training phase.
• Instead of performing contrastive and generative pre-training separately, CoCa (Yu et al., 2022a)
performs a joint contrastive and generative pre-training in one stage.
• By using only masked data modeling and a multi-way Transformer design, BEiT-3 (Wang et al.,
2022g) achieves state-of-the-art performance on VQA and other VL tasks.
Achieving state-of-the-art performance via full model finetuning is good. It is more desirable to train
a model that can quickly adapt to different downstream tasks via only providing a few in-context
examples. In the context of language model pre-training, such capability has been demonstrated in
GPT-3 (Brown et al., 2020) via large-scale pre-training on massive text corpora. Inspired by this,
researchers have also started to investigate multimodal in-context few-shot learning. Below, we
mainly discuss three pieces of work: Frozen (Tsimpoukelli et al., 2021), PICa (Yang et al., 2022d),
and Flamingo (Alayrac et al., 2022)..
• Frozen (Tsimpoukelli et al., 2021) is the pioneering work on this topic. It shows that by using
a large frozen language model and learning an image encoder to align the embedding space of
images and text via a simple image captioning task, strong in-context few-shot learning perfor-
mance can be obtained. However, an image is encoded using only two global vectors, which are
34
LM LM
Figure 3.8: Illustration of how the recent big VLP models look like. (a) contrastive pre-training,
including models like CLIP (Radford et al., 2021), ALIGN (Jia et al., 2021), Florence (Yuan et al.,
2021), BASIC (Pham et al., 2021), etc. (b) generative pre-training, including models like GIT (Wang
et al., 2022d) and Flamingo (Alayrac et al., 2022). LEMON (Hu et al., 2022) and most previous
OD-based VLP models also adopt this model architecture, but uses additional pre-training losses
such as MLM and ITM. (c) Joint contrastive and generative pre-training, such as CoCa (Yu et al.,
2022a). METER (Dou et al., 2022b) also uses this model architecture, but is pre-trained with MLM
and ITM instead. Base-sized models such as ALBEF (Li et al., 2021a) and FIBER (Dou et al.,
2022a) also adopt both ITC and MLM losses. (d) generative pre-training with an encoder-decoder
architecture, including models like SimVLM (Wang et al., 2022k) and PaLI (Chen et al., 2022e).
(e) VL-BEiT (Bao et al., 2022b) and BEiT-3 (Wang et al., 2022g) performs unified masked data
modeling with a multi-way Transformer design.
not sufficient to capture all visual information of the image. Further, the frozen language model is
only 7B in model size, which may not be large enough.
• In order to retain the strong in-context few-shot learning capability of the 175B-sized GPT-
3 (Brown et al., 2020), PICa (Yang et al., 2022d) proposes to prompt GPT-3 via the use of image
captions for multimodal few-shot learning, since GPT-3 can only read text but not images. With
such a simple approach, 4-shot prompting can already outperform supervised SoTA on the chal-
lenging OK-VQA benchmark that requires external knowledge to answer a question about an
input image correctly. However, its performance improvement on the VQAv2 dataset is limited,
since captions cannot capture every detail of an image, and fine-grained visual information can
be lost. Recently, in a similar spirit, VidIL (Wang et al., 2022j) is proposed to perform few-shot
video-language learning via inheriting the in-context learning capability from GPT-3 as well.
• To address the above challenges, Flamingo (Alayrac et al., 2022) proposes to use both a con-
trastively pre-trained frozen image encoder and a large frozen language model, and insert gated
cross-attention modules to bridge these two frozen models. By large-scale pre-training and using
a 70B-sized frozen language model, SoTA in-context few-shot learning results are reported.
Besides relying on large language models, researchers have also explored other approaches for few-
shot learning. In FewVLM (Jin et al., 2022), the authors propose to train a VL-T5-like model (Cho
35
What color is the plate? The donut on the white plate A donut on a white plate
next to a cup of latte.
(a) Close-set classification, (b) Open-ended text generation, (c) Box/mask localization, such (d) Pixel prediction, such as text-
such as VQA, image-text such as image captioning, as phrase grounding, referring to-image generation, text-based
retrieval, NLVR2, VCR, etc. storytelling, open-end VQA. expression comprehension/ image editing.
segmentation.
Figure 3.9: The spectrum of image-text tasks that researchers have tried to unify. (a) Close-set
classification, such as VQA, image-text retrieval, NLVR2, etc. (b) Open-ended text generation, such
as image captioning, visual storytelling, and open-ended VQA. (c) Box/mask localization, such as
phrase grounding, referring expression comprehension/segmentation, and grounded captioning. (d)
Pixel prediction, such as text-to-image generation, text-based image editing, etc. Figure credit: from
Zhengyuan Yang’s CVPR 2022 tutorial slides on unified image-text modeling.
et al., 2021) with PrefixLM and MLM, and found that PrefixLM is helpful for zero/few-shot image
captioning, while MLM is good for zero/few-shot VQA. In TAP-C (Song et al., 2022), the authors
show that CLIP (Radford et al., 2021) can be a few-shot learner for VQA and visual entailment
tasks. For VQA, the authors propose to reformulate it as an image-text retrieval task; while for
visual entailment, captions and hypothesis (text-text pairs) are used in training, while image and
hypothesis (image-text pairs) are used at inference.
Zero-shot Image Captioning. A crucial benefit of training big VLP models is the potential of
achieving zero-shot generalization. In image-text tasks, while zero-shot retrieval can be readily
powered by the use of contrastive loss during pre-training, zero-shot image captioning has been
rarely evaluated, largely due to that the zero-shot performance is poor as the model is pre-trained
on web-scale noisy image-text pairs. Quantitative evaluation of zero-shot captioning is provided
in SimVLM (Wang et al., 2022k) and FewVLM (Jin et al., 2022), and qualitative visual examples
are provided in LEMON (Hu et al., 2022) and CM3 (Aghajanyan et al., 2022). Zero-shot image
captioning can also be achieved via the use of CLIP and GPT-2 together, as discussed in MAGIC (Su
et al., 2022) and ZeroCap (Tewel et al., 2022).
As shown in Figure 3.9, image-text tasks can be roughly divided into four categories: (i) Close-set
classification, such as VQA, image-text retrieval, and visual reasoning; (ii) Open-ended text genera-
tion, such as image captioning, visual storytelling, and free-form open-ended VQA; (iii) Box/mask
localization, such as phrase grounding, referring expression comprehension/segmentation, and
grounded captioning; and (iv) Pixel prediction, such as text-to-image generation and text-based
image editing. How to design a unified image-text model that can support all these downstream
tasks becomes an increasingly important topic. We provide a brief summary of current attempts
towards this goal below.
• Unifying image-text tasks as text generation. Borrowing ideas from T5 (Raffel et al., 2020) and
BART (Lewis et al., 2020a), VL-T5 (Cho et al., 2021) proposes to use a sequence-to-sequence
(seq2seq) encoder-decoder framework to unify different VL tasks as text generation, so that dif-
ferent tasks can be directly supported without introducing task-specific heads. Since pre-trained
object detectors are used to (pre-)extract bounding boxes and the corresponding regional features,
the box prediction task in phrase grounding and referring expression comprehension becomes a
region index classification problem. However, the fact that the model cannot be end-to-end pre-
trained results in sub-optimal downstream performance. SimVLM (Wang et al., 2022k) proposes
a simple end-to-end seq2seq learning framework, and considers VQA as a text generation task as
in VL-T5, and performs large-scale pre-training.
• Unifying text generation and box prediction as language modeling. The approaches above
have unified certain image-text tasks (e.g., VQA, visual reasoning and image captioning) as text
36
generation. However, bounding box coordinates cannot be directly predicted. By quantizing
bounding box coordinates as discrete tokens, Pix2Seq (Chen et al., 2022c) and Pix2SeqV2 (Chen
et al., 2022d) propose to treat object detection (OD) as a language modeling task using a seq2seq
framework. Inspired by this, in UniTAB (Yang et al., 2021c), the authors have tried to unify text
generation and bounding box prediction into a single Transformer encoder-decoder architecture
via representing each bounding box using a set of discrete tokens, which enables UniTAB to
approach different VL tasks with a single set of parameters, generate desired text and box outputs
together, and meanwhile detect the alignments between words and boxes.
• Unifying text generation and image generation as language modeling. Through the use of
VQ-VAE (van den Oord et al., 2017; Razavi et al., 2019), images can also be represented as a
sequence of discrete image tokens. Therefore, image generation can be naturally regarded as a
language modeling task. Recent works, such as Taming Transformer (Esser et al., 2021b), DALL-
E (Ramesh et al., 2021), and Parti (Yu et al., 2022b), have shown that this approach can generate
high-quality realistic images. Inspired by this, recent work shows that image generation and text
generation (e.g., image captioning) can be unified, such as ERINE-ViLG Zhang et al. (2021a),
L-Verse (Kim et al., 2022), and DU-VLG (Huang et al., 2022). Furthermore, DaVinci (Diao
et al., 2022) combines a prefix image modeling task and a prefix language modeling (as used
in SimVLM (Wang et al., 2022k)) for pre-training. Aghajanyan et al. (2022) introduce CM3,
a causally masked generative model pre-trained over a large corpus of structured multi-modal
documents that can contain both text and image tokens (from a pre-trained VQVAE-GAN). After
pre-training, the authors show that the model can generate images unconditionally, conditioned on
text, and learn to perform image captioning in a zero-shot setting.
• Unifying text generation, box prediction and image generation all together. In OFA (Wang
et al., 2022e), the authors propose to unify text generation, box prediction, and image generation
all together, by combining the ideas of Pix2Seq (Chen et al., 2022c) and VQ-VAE (van den Oord
et al., 2017). Using the same idea, Unified-IO (Lu et al., 2022a) further supports modalities as
diverse as images, masks, key points, boxes, and text, and tasks as varied as depth estimation,
inpainting, semantic segmentation, captioning, and reading comprehension. However, the perfor-
mance of Unified-IO on downstream tasks is not satisfactory at its current stage.
• Unifying localization and VL understanding. Serializing bounding boxes as token sequences al-
lows the design of a unified model to tackle all tasks without introducing task-specific heads. This
is appealing. However, the downstream object detection (OD) performance is either not evaluated,
or still lagging behind the state of the art by a large margin. There is another line of work that tries
to unify localization and VL understanding but still uses additional OD heads to output bound-
ing boxes. Prominent examples include GPV-1 (Gupta et al., 2022a), MDETR (Kamath et al.,
2021), UniT (Hu and Singh, 2021), GLIPv2 (Zhang et al., 2022b), and FIBER (Dou et al., 2022a).
Specifically, GPV-1 (Gupta et al., 2022a) and GPV-2 (Kamath et al., 2022) advocate the concept
of general-purpose vision systems. MDETR (Kamath et al., 2021) and GLIP (Li et al., 2022h)
propose to unify object detection and phrase grounding for grounded pre-training, which further
inspires GLIPv2 (Zhang et al., 2022b) to unify localization and VL understanding. FIBER (Dou
et al., 2022a) provides another solution to tackle both localization and VL understanding tasks,
by designing a new fusion-in-the-backbone architecture, and a new pre-training strategy, i.e., first
performing coarse-grained pre-training on image-text data, followed by fine-grained pre-training
on image-text-box data.
Besides unifying different tasks within one framework, there are also works on designing a unified
Transformer. For example, UFO (Wang et al., 2021a) develops a unified Transformer that can be
flexibly used as dual encoder and fusion encoder. VLMo (Wang et al., 2021c) proposes to further
introduce additional modality-specific experts, and its scaled-up version BEiT-3 (Wang et al., 2022g)
has recently achieved state-of-the-art results on VQA and other VL tasks.
3.5.4 Knowledge
We mainly focus on knowledge-requiring VQA tasks that require external knowledge in addition to
the image content to answer a question correctly. Below, we divide the discussion into three parts.
• Datasets. The earliest explicit knowledge-based VQA datasets are KB-VQA (Wang et al., 2017b)
and FVQA (Wang et al., 2017a). However, the knowledge required in these datasets is retained
37
in the same knowledge graphs that are used to generate the dataset. KVQA (Shah et al., 2019b)
is based on images in Wikipedia articles. OK-VQA (Marino et al., 2019) is a recent popular
VQA dataset that requires external, open-domain knowledge to answer a question given an in-
put image. More recently, WebQA (Chang et al., 2022) is collected using web queries, and A-
OKVQA (Schwenk et al., 2022) is a crowdsourced dataset composed of a diverse set of questions
requiring a broader base of commonsense and world knowledge to answer.
• Knowledge sources. There are two categories of knowledge sources: (i) explicit structured sym-
bolic knowledge bases such as Wikipedia, ConceptNet, WordNet, and Google images; and (ii)
implicit unstructured knowledge bases, i.e., large-scale pre-trained language models such as GPT-
3 (Brown et al., 2020), where rich encyclopedia and commonsense knowledge has been encoded.
• Methods. Most studies followed a two-step approach to tackle the knowledge-based VQA tasks,
i.e., first retrieve knowledge from external resources, and then reason over the selected knowl-
edge, the input image, and question for answer prediction. Below, we mainly discuss meth-
ods designed for OK-VQA. Specifically, Shevchenko et al. (2021) propose to build a knowl-
edge base with knowledge embeddings, and then inject these knowledge embeddings into VLP
models. KRISP (Marino et al., 2021) propose to retrieve the implicit knowledge stored in pre-
trained language models as a supplementary knowledge resource to the structured knowledge
base. MAVEx (Wu et al., 2022c) presents an answer validation approach to make better use of
the noisy retrieved knowledge. More recently, PICa (Yang et al., 2022d) shows that by prompting
GPT-3 via the use of image captioning and in-context few-shot learning, state-of-the-art results can
be obtained. This approach has been further enhanced in KAT (Gui et al., 2022) by additionally
retrieving knowledge from explicit knowledge bases.
Besides knowledge-based VQA that explicitly requires external knowledge to solve the tasks, there
also exist models such as ERINE-ViL (Yu et al., 2021) and ROSITA (Cui et al., 2021) that use knowl-
edge encoded inside the scene graphs to improve performance on standard VL tasks (e.g., VQAv2
and image-text retrieval). By pre-training on large-scale image-text data, the recent GIT work (Wang
et al., 2022d) shows that rich multimodal knowledge about the visual world has been encoded in the
model weights, and the pre-trained model can readily recognize scene text, tables/charts, food, logos,
landmarks, characters, products, etc., and output these knowledge in natural language format when
finetuned on the TextCaps dataset (Sidorov et al., 2020). A related survey on knowledge-intensive
NLP tasks is Yin et al. (2022).
In the majority of the VLP literature, models are evaluated on standard benchmarks such as
VQAv2 (Goyal et al., 2017b), image captioning, NLVR2 (Suhr et al., 2019), visual entailment (Xie
et al., 2019), image-text retrieval, referring expression comprehension (Yu et al., 2018a), etc. These
benchmarks have driven tremendous progress in the field (e.g., see Figure 3.2), and some big VLP
models have even surpassed human performance on some of these tasks. Although this progress is
meaningful and exciting, we should not focus solely on topping the leaderboard, and should avoid
both over-claiming and under-claiming the capabilities learned by the models (will be detailed in the
discussion below). To date, it remains unclear how robust these pre-trained models are. In what fol-
lows, we review popular approaches to robustness analysis along multiple dimensions: (i) diagnostic
tests; (ii) challenging sets that test out-of-distribution (OOD) generalization; (iii) human-adversarial
attacks; and (iv) probing analysis.
Diagnostic Tests. Diagnostic tests aim to verify one specific capability or one specific type
of robustness of VLP models. For example, Li et al. (2020c) has conducted a host of thor-
ough evaluations of OD-based VLP models, including (i) robustness against linguistic variation
via VQA-Rephrasings (Shah et al., 2019a); (ii) robustness against logical reasoning via VQA-
LOL (Gokhale et al., 2020); and (iii) robustness against visual content manipulation via IV-VQA
and CV-VQA (Agarwal et al., 2020). CLEVR (Johnson et al., 2017a) is a diagnostic dataset for
testing compositional visual reasoning. GQA (Hudson and Manning, 2019b) provides large-scale
rule-based questions from ground-truth scene graphs of real-world images to test VQA model’s abil-
ity on positional reasoning and relational reasoning. Winoground (Thrush et al., 2022) is a carefully
curated dateset to probe VLP models’ visio-linguistic compositionality on an image-text matching
task. Furthermore, Parcalabescu et al. (2020) propose to test VL models on counting tasks. The Vi-
sual Commonsense Tests (ViComTe) dataset (Zhang et al., 2022a) is created to test to what degree
38
unimodal (language-only) and multimodal (image and language) models capture a broad range of
visually salient attributes. VALSE (Parcalabescu et al., 2021) is proposed to test VLP models cen-
tered on linguistic phenomena. CARET (Jimenez et al., 2022) is proposed to systematically measure
consistency and robustness of modern VQA models through six fine-grained capability tests.
Human-Adversarial Attacks. To build a benchmark that can organically evolve over time, Li
et al. (2021b); Sheng et al. (2021) introduce Adversarial VQA datasets that are collected iteratively
via an adversarial human-and-model-in-the-loop procedure (Nie et al., 2020). Interestingly, they
find that during dataset collection, non-expert annotators can easily attack modern VLP models
successfully. These VLP models also achieve far worse performance on the new benchmark than
on standard VQAv2 dataset. More recently, Bitton et al. (2022) introduce WinoGAViL, which is an
online game to collect VL associations, used as a dynamic benchmark to evaluate state-of-the-art
VLP models. On one hand, these benchmarks are valuable as they successfully demonstrate the
weaknesses of the SoTA VLP models, and shed new light on robustness studies in the community.
On the other hand, we also need to be careful not to under-claim the capabilities learned by the
models, as these datasets are specially collected to fool these models.
Probing Analysis. Besides testing VLP models on various benchmarks for robustness analysis,
there also exists a line of work that aims to probe and understand what has been learned in the
VLP models (Cao et al., 2020; Li et al., 2020d; Salin et al., 2022), such as cross-modal input ab-
lation test (Frank et al., 2021), verb understanding (Hendricks and Nematzadeh, 2021), bias anal-
ysis (Srinivasan and Bisk, 2022), the decoupling of the role of data, attention, and losses in VLP
models (Hendricks et al., 2021), to name a few.
VL for Language. With the advent of VLP models like CLIP (Radford et al., 2021) and
ALIGN (Jia et al., 2021), it has now been widely accepted that image-text data can be used to
learn strong image encoders from scratch, and enable zero-shot image classification capabilities. On
the other hand, human language is grounded in visual knowledge like colors, sizes, and shapes. A
natural question to ask is whether image-text data can also help learn better language representa-
tions. Vokenization (Tan and Bansal, 2020) and its follow-up work iACE (Lu et al., 2022c) propose
to concatenate tokens and token-related images as vokens to enrich learned language representa-
tions. In VidLanKD (Tang et al., 2021b), the authors show that it is beneficial to use video-distilled
knowledge transfer to improve language understanding tasks that involve world knowledge, physi-
cal reasoning, and temporal reasoning. Similarly, VaLM (Wang et al., 2022h) proposes to visually-
augment text tokens with retrieved relevant images from CLIP (Radford et al., 2021), and use a
visual knowledge fusion layer to enable multimodal grounded language modeling. VaLM shows
substantial gains on object color and size reasoning, when compared with a text-only baseline.
39
OFA
(Wang et al. 2022) GPV SimVLM, CoCa
(Gupta et al. 2021) (Wang et al. 2021, 2022)
UniTAB MiniVLM
VLMo Big Models (Wang et al. 2020)
PICa (Yang et al. 2022) (Wang et al. 2021)
(Yang et al. 2021) GIT PaLI DistillVLM
Unified Modeling (Wang et al. 2022) (Chen et al. 2022) (Fang et al. 2021)
Frozen Unified-IO
(Tsimpoukelli et al. 2021) OmniVL BEiT-3 Compression
(Lu et al. 2021)
(Wang et al. 2022) (Wang et al. 2022) VL lottery tickets
Few-shot Learning UniT
VL-T5 (Gan et al. 2021)
FewVLM (Cho et al. 2021)
TAP-C (Hu et al. 2021)
(Song et al. 2022) (Jin et al. 2021) GLIPv2 Vokenization
Flamingo
(Zhang et al. 2022) Adv VQA (Tan et al. 2020)
(Li et al. 2021, VLP for L
(Alayrac et al. 2022)
Sheng et al. 2021)
Winoground VidLanKD
KRISP (Thrush et al. 2022) (Tang et al. 2021)
(Marino et al. 2020)
Robustness
MAVEx KAT VLUE
(Wu et al. 2022) (Gui et al. 2021) (Zhou et al. 2022)
Knowledge Behind the Scene GRIT
(Cao et al. 2020) UC2 (Gupta et al. 2022)
OK-VQA (Zhou et al. 2021)
(Marino et al. 2021) Probing Analysis
What does BERT with Multilingual
vision look at? M3P
(Li et al. 2022) (Ni et al. 2021)
Figure 3.10: Advanced topics in VLP for image-text tasks. In each topic, we only list a few repre-
sentative works due to limited space. Also, if some models are shown in one topic, it will not be
shown in another topic to avoid repetition. For example, Flamingo (Alayrac et al., 2022) is shown
in the topic of Few-Shot Learning, therefore, not shown in the topic of Big Models.
object detector as in BUTD (Anderson et al., 2018a), saving computation costs without decreasing
performance much. DistilVLM (Fang et al., 2021) proposes to perform knowledge distillation for
VLP. In the VL lottery ticket paper (Gan et al., 2022), the authors study the over-parameterization
of VLP models via the lens of lottery ticket hypothesis.
Efficient Adaptation. In the NLP literature (Liu et al., 2021a), techniques such as
adapter (Houlsby et al., 2019), prompt/prefix tuning (Li and Liang, 2021; Lester et al., 2021), multi-
task learning (Liu et al., 2019c), and LoRA (Hu et al., 2021) have been proposed for parameter-
efficient adaptation of large language models for downstream tasks. By adding a few adapter layers
into the Transformer backbone, or adding some learnable continuous prompt vectors, while freez-
ing the pre-trained language model backbone, comparable or even better performance can be ob-
tained in downstream tasks. The same idea has also been investigated in VL-Adapter (Sung et al.,
2022b), where a few adapter and prompt-tuning variants have been carefully tested. More recently,
Sung et al. (2022a) propose ladder side-tuning for parameter and memory efficient transfer learning,
which has also been applied to image-text tasks.
Multilingual VLP. Most VLP studies focus on English-only VL benchmarks, which leaves mul-
tilingual VLP a relatively less explored territory. To enable VLP models to support multiple lan-
guages, UC2 (Zhou et al., 2021) and M3P (Ni et al., 2021) propose to add multilingual text encoders,
and show how to use both English-only and multilingual data for joint pre-training. In MURAL (Jain
et al., 2021), the authors pre-train a dual encoder that solves both image-text matching and transla-
tion pair matching tasks. By incorporating billions of translation pairs, MURAL (Jain et al., 2021)
extends ALIGN (Jia et al., 2021) to multilingual scenarios. More recently, in CCLM (Zeng et al.,
40
2022c), the authors introduce cross-view language modeling that unifies cross-lingual cross-modal
pre-training using the ALBEF (Li et al., 2021a) model architecture, and claim that CCLM is the
first multi-lingual multi-modal model that surpasses the translate-test performance of representative
English VL models by zero-shot cross-lingual transfer. The most recent model is PaLI (Chen et al.,
2022e), which can be considered a multilingual version of SimVLM (Wang et al., 2022k).
Unsupervised VLP. Inspired by unsupervised machine translation, in Li et al. (2021d), the au-
thors investigate whether a strong VLP model can be learned without parallel image-text data. They
propose to conduct masked-modeling-based pre-training on text-only and image-only data, and in-
troduce object tags detected by an OD model to serve as anchor points to bridge the two modalities.
Zhou et al. (2022c) suggest that using tags alone is not sufficient, and propose to first construct a
weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular
alignment pre-training tasks to bridge the gap between the two modalities. On one hand, exploring
VLP under an unsupervised setting looks appealing and is a valid research problem; on the other
hand, paired image-text data is actually not very difficult to collect and scale up, and there already
exist many such large-scale image-text datasets as discussed in Section 3.4, suggesting that unsu-
pervised VLP may not be an urgent problem in practice. It would be interesting to investigate how
image-only and text-only data can help improve the downstream performance of a model pre-trained
on image-text data.
Socratic Models. Large foundation models are ubiquitous nowadays. Different models store dif-
ferent forms of knowledge across different domains. What if these large foundation models have
a way to communicate and cooperate with each other? Will composing different foundation mod-
els in a zero-shot or few-shot manner enable new capabilities? To answer this, Zeng et al. (2022a)
propose the concept of Socratic Models, which use language as the representation by which inter-
domain foundation models can jointly be used for inference. Several models belong to this category.
For example, PICa (Yang et al., 2022d) uses the cooperation of VinVL (a SoTA image captioning
model) (Zhang et al., 2021b) and GPT-3 for few-shot knowledge-based VQA. MAGIC (Su et al.,
2022) uses a CLIP-induced score to regularize the language generation of GPT-2 so that the zero-
shot generated caption is semantically related to the given image. BEST (Xie et al., 2022) uses the
cooperation of Florence (Yuan et al., 2021) and GPT-3 for visual storytelling and image paragraph
captioning. Wang et al. (2022j) propose the cooperation of CLIP, BLIP (Li et al., 2022f), and GPT-3
for few-shot video-language learning. Flamingo (Alayrac et al., 2022) uses a frozen image encoder
and a big frozen language decoder, builds the connection between them via inserting cross-attention
blocks, and performs large-scale pre-training.
More Applications. Besides VLP for standard VL tasks, VLP has also been applied to tackle (i)
TextVQA (Singh et al., 2019) and TextCaps (Sidorov et al., 2020) tasks that require an AI system to
comprehend scene text in order to perform VQA and captioning, such as TAP (Yang et al., 2021d)
and LaTr (Biten et al., 2022); (ii) visual dialog (Das et al., 2017) that requires an AI system to
chat about an input image, such as VisDial-BERT (Murahari et al., 2020) and VD-BERT (Wang
et al., 2020b); (iii) fashion-domain tasks, such as Kaleido-BERT (Zhuge et al., 2021) and Fashion-
VLP (Goenka et al., 2022); and (iv) vision-language navigation (VLN), such as PREVALENT (Hao
et al., 2020) and VLN-BERT (Hong et al., 2021), to name a few. A detailed literature review on
VLN can be found in Gu et al. (2022b).
Before VLP. As the pioneering work in T2I generation, Mansimov et al. (2016) shows that re-
current variational auto-encoder could generate novel visual scenes conditioned on image captions;
however, the generated image quality is not satisfactory. Research on T2I generation was then
greatly advanced with the prosperity of generative adversarial networks (GANs). Reed et al. (2016)
extended conditional GANs to T2I generation, and has been shown to work on restricted datasets
(e.g., Oxford-102 Flowers and CUB-200 Birds) with relatively small image resolutions (64x64). In
41
𝑣𝑣1 𝑣𝑣2 𝑣𝑣3 … 𝑣𝑣𝑛𝑛
…
Super-Resolution
Text Visual Token VQGAN Model (trained)
reconstructed Encoder Decoder Decoder
image 256x256
…
add
Gaussian
Diffusion
…
1024x1024 noise Model
…
VQGAN Decoder
two dogs are running in a field (a) Auto-regressive Models such as DALL·E and Parti
discrete … …
visual tokens reverse-Diffusion Process
Super-Resolution
Model (trained)
text
𝑥𝑥𝑡𝑡 …
𝑥𝑥𝑡𝑡−𝑖𝑖 …
embedding
VQGAN Encoder 256x256
add
Gaussian
Diffusion
Text … … 1024x1024
noise Model
Encoder
input Prior
…
image Model
… (optional) image
embedding
…
(b) Diffusion Models such as DALL·E 2 and Imagen
recent years, this field has made remarkable progress thanks to the improved multimodal encoding
(e.g., StackGAN (Zhang et al., 2017), StackGAN++ (Zhang et al., 2018c)), novel attention mecha-
nisms (e.g., AttnGAN (Xu et al., 2018), SEGAN (Tan et al., 2019), ControlGAN (Li et al., 2019a)),
the use of cycle structure (e.g., MirrorGAN (Qiao et al., 2019)), etc.
To extend the success of GANs to limited-data regime, it is common to use pre-training, i.e., initial-
izing the optimization process by pre-trained GAN models on some large datasets (Grigoryev et al.,
2022). However, most GAN-based pre-training is conducted on image datasets only, which did not
leverage image-text pairs used for vision-language pre-training (VLP), except for recent work using
the CLIP model in GAN-based methods such as LAFITE (Zhou et al., 2022f), which demonstrates
the first work on training T2I generation models without using text data explicitly.
In the Context of VLP. While GAN-based methods are still popular for image synthesis, there is
a new paradigm shift for T2I generation. In the context of VLP, we classify these methods into two
categories: (i) VQ-token-based auto-regressive methods (e.g., DALL-E (Ramesh et al., 2021) and
Parti (Yu et al., 2022b)), and (ii) diffusion-based methods (e.g., DALL-E 2 (Ramesh et al., 2022) and
Imagen (Saharia et al., 2022)). An illustration of these methods is provided in Figure 3.11. Below,
we provide a brief review on these recent works.
Discrete Token Representation. In 2017, VQ-VAE (van den Oord et al., 2017) was proposed,
which provides a simple yet powerful generative model that learns discrete representations for high-
quality image reconstruction. Later on, in VQ-VAE-2 (Razavi et al., 2019), researchers show that
high-fidelty and high-resolution images can be generated. With the prevalence of Transformer model
(Vaswani et al., 2017) which have achieved impressive improvements in domains such as language
models (Devlin et al., 2019) and image generative pre-training (Chen et al., 2020b), the modeling of
VQ token sequence is also naturally handled by Transformer (Esser et al., 2021b).
42
NUWA- Make-A-
DALL-E NUWA GLIDE DALL-E 2 Imagen Infinity Video Phenaki
Figure 3.12: Auto-regressive and diffusion-based text-to-image/video models developed over time.
Only some representative works are shown.
images of arbitrary aspect ratio. Parti (Yu et al., 2022b) adopts a similar Transformer-based encoder-
decoder architecture, and trains the model in scale, and demonstrates impressive image generation
results. Make-A-Scene (Gafni et al., 2022) proposes to use an additional segmentation map (can be
generated or not) as additional input to further aid the image generation process. Other examples
include CogView (Ding et al., 2021) and CogView2 (Ding et al., 2022a), which is also similar to
DALL-E (Ramesh et al., 2021).
Bi-Directional Image-Text Generation. ERINE-ViLG (Zhang et al., 2021a), L-Verse (Kim et al.,
2022) and OFA (Wang et al., 2022f) demonstrate that large-scale generative joint pre-training for
both text and image tokens (from VQ-VAE) can be finetuned on diverse downstream tasks, such as
style learning (domain-specific text-to-image), super-resolution (image-to-image), image captioning
(image-to-text), and even text-image retrieval, etc.
Continuous Diffusion. Recently, diffusion models such as denoising diffusion probabilistic mod-
els (DDPM) (Ho et al., 2020) have achieved great successes in image generation tasks. Recent
works (Dhariwal and Nichol, 2021) have demonstrated even higher quality image synthesis com-
pared to VQ-token-based models and GANs. Furthermore, a recent denoising diffusion implicit
model (DDIM) (Song et al., 2021) further accelerates the sampling procedure and enables nearly
perfect inversion. We refer the readers to Yang et al. (2022c) for a comprehensive survey of diffu-
sion models.
To extend diffusion-based methods for T2I generation, GLIDE (Nichol et al., 2021) adopts contin-
uous diffusion, and compares CLIP guidance and classifier-free guidance in diffusion models, and
concludes that a diffusion model of 3.5 billion parameters with classifier-free guidance outperforms
DALL-E in terms of human evaluation. More recently, DALL-E 2 (Ramesh et al., 2022), Ima-
gen (Saharia et al., 2022) and Stable Diffusion (a scaled-up version of Latent Diffusion (Rombach
et al., 2022)) have pushed this line of work to a new level, especially due to the open-source efforts
of Stable Diffusion. Instead of performing diffusion in the pixel space as in DALL-E 2 (Ramesh
et al., 2022) and Imagen (Saharia et al., 2022), the Latent Diffusion model (Rombach et al., 2022)
proposes to perform diffusion in the continuous latent space instead.
Text-to-Video Generation. The field is progressing at a rapid speed. Not just satisfied at text-
to-image generation, recent works, such as Make-A-Video (Singer et al., 2022), Imagen Video (Ho
et al., 2022), and Phenaki (Villegas et al., 2022), have significantly lifted the quality of text-to-video
generation to a new level.
43
Chapter 4
Computer vision has become ubiquitous in our society, with applications in visual search, image
understanding, mapping, medicine, and self-driving cars. Core to many of these applications are
visual recognition tasks such as image classification and object detection. The primary goal of these
tasks is to assign a semantically meaningful concept to the visual instance such as images or regions.
Traditional computer vision systems are trained to predict a fixed set of predetermined concepts,
such as the image class labels on ImageNet (Deng et al., 2009)/JFT300M, the object categories on
COCO (Lin et al., 2014), and so on. Although close-to-human performance has been reported on
these tasks, the restricted form of a close-set of concepts limits models’ generality and usability,
since additional labeled data is needed to specify semantic concepts that are unseen in training data.
In this chapter, we describe how recent advances in VLP tackle the core visual recognition problems.
Section 4.1 provides the overview rational on the paradigm shift. This is exemplified by the three
vision problems, including image classification in Section 4.2, object detection in Section 4.3 and
image segmentation in Section 4.4. Section 4.5 outlines the trend of computer vision in the wild,
and Section 4.6 summarizes the chapter with a discussion on advanced topics.
4.1 Overview
Recent state-of-the-art computer vision systems are trained from free-form natural language supervi-
sion, ranging from simple object category names to descriptive captions. These language-augmented
visual models have shown strong transfer ability. We believe that two following factors contribute
to the paradigm shift.
(1) Open-set recognition is enabled due to the problem reformulation from classification to retrieval.
Traditional classification formulation defines and learns a fixed set of embedding vectors, each
of which represents an object category. It is infeasible for the models to predict and transfer be-
yond this close-set of concepts. The alternative is to cast image classification as an image-to-text
retrieval task, where one searches for an image (or regions in the image) the matched concepts.
Parametric models such as neural nets are employed to encode both images and language (con-
cepts) and perform dense retrieval to retrieve an image from its relevant concepts.
(2) Model generality and usability is improved since the form of language supervision allows a wide
range of visual concepts to be represented. The fixed set of visual concepts is an over-simplified
representation of visual concepts, due to the compactness requirement in a classification head.
In contrast, the newly introduced text encoder in the retrieval formulation is capable of dealing
with a much larger concept pool. Natural language is semantically richer than any set of concept
labels (e.g., object categories). The text sequence form of language also allows to represent
external knowledge (e.g., from WordNet and Wikipedia) in the same format as image captions
and concept labels, further boosting the concept coverage.
In this chapter, we illustrate the paradigm shift by presenting case studies on three prominent com-
puter vision tasks, image classification (IC), object detection (OD), and segmentation. We review
UniCL (Yang et al., 2022b), CLIP (Radford et al., 2021), ALIGN (Jia et al., 2021) for IC, ViLD (Gu
44
The most recent art
Contrastive Vision-Language Learning
CLIP ViLD FILIP Florence GLIP OpenSeg Detic OV-DETR K-Lite GLIPv2
Jan. 5th, 2021 April 28th, 2021 Nov. 9th, 2021 Nov. 22nd, 2021 Dec. 7th, 2021 Dec. 22nd, 2021 Jan. 7th, 2022 Mar. 22nd, 2022 April. 20th, 2022 June 12nd, 2022
Feb. 11st, 2021 Oct. 11st, 2021 Nov. 15th, 2021 Dec. 2nd, 2021 Dec. 16th, 2021 Dec. 23rd, 2021 Jan. 10th, 2022 April. 7th, 2022 May. 4th, 2022
Table 4.1: Glossary of representative VLP models for core vision tasks. For data scale, we
report # image-text pairs, including both image-label and image-caption. IC: image classification.
OD: object detection. LocNar: Localized Narratives. Golden-G is the mixed golden ground-truth
grounding data processed in MDETR (Kamath et al., 2021). ITC: image-text contrastive learning.
WRA: word-region alignment. TP: Token Prediction. SSL: Self-supervised learning.
et al., 2022d), RegionCLIP (Zhong et al., 2022), GLIP (Li et al., 2022h) for OD, and LSeg (Li et al.,
2022a), OpenSeg (Ghiasi et al., 2022), DenseCLIP (Rao et al., 2021) for image segmentation.
We present a glossary of representative VLP models in Table 4.1, where models are described along
multiple dimensions. In Figure 4.1, we show how these VLP models evolve along time. This line of
research equips computer vision models with the capability of open-set visual recognition, opening
the possibilities of building generalizable computer vision systems with a strong task-level transfer
ability, and thus paving the way towards Computer Vision in the Wild (CVinW)1 (Li et al., 2022b).
45
1 1 dog
1 1 dog
1
cat
1
parrots
and ũ is the vector representation of the entire image and sentence, respectively. For i-th image xi
and j-th language description tj in a batch B, we normalize their feature vectors in a hyper-sphere
fφ (tj )
using ui = kffθθ (xi) T
(xi )k and v j = kfφ (tj )k , and their similarity is calculated as sij = ui v j . Figure 4.2
shows an example image-text pair and a batch of four image-text pairs.
UniCL. A bidirectional superivsed contrastive learning objective is defined based on the matching
between images and language descriptions (Yang et al., 2022b):
min LUniCL =Li2t + Lt2i , with (4.1)
{θ,φ}
X 1 X exp(τ uTi v k )
Li2t = − log P T
, (4.2)
|P(i)| j∈B exp(τ ui v j )
i∈B k∈P(i)
X 1 X exp(τ uTk v j )
Lt2i = − log P T
, (4.3)
|Q(j)| i∈B exp(τ ui v j )
j∈B k∈Q(j)
CLIP/ALIGN. CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) assume that there are
only one-to-one mappings between an image and its paired caption in a batch, i.e., P(i) = {i} and
Q(j) = {j}. The CLIP training objective is
min LCLIP =Li2t + Lt2i , with (4.4)
{θ,φ}
X exp(τ ui v i )
Li2t = − log P , (4.5)
i∈B j∈B exp(τ ui v j )
X exp(τ uj v j )
Lt2i = − log P . (4.6)
j∈B i∈B exp(τ ui v j )
For the example in Figure 4.2, CLIP or ALIGN only considers the on-diagonal elements as positive,
and all off-diagonal elements as negative. Ideally, CLIP or ALIGN should be applied to image-text
pairs without duplication in either modality.
Connections to Traditional Classification Formulation. Note that LUniCL in (4.1) is closely re-
lated to the standard cross-entropy loss used in supervised image classification problems. Specifi-
cally, the image-to-language contrastive term in (4.2) recovers cross-entropy as a special case, when
46
the following three conditions are satisfied. (i) The text encoder fφ is represented as a simple linear
embedding layer W. (ii) The batch size |B| is sufficiently larger than the number of classes K,
so that all the class embedding vectors are used in contrastive learning, when stochastic sampling
is used for training. (iii) τ = 1, and `2 normalization is excluded, so that ũ = u and ṽ = v. In
practice, all of these conditions can be easily satisfied, and (4.2) becomes
X exp(wŷ ṽ i )
min LIC = log PK , (4.7)
{θ,W}
i∈B k=1 exp(w k ṽ i )
where ŷ is the ground-truth label for the i-th image in the batch.
Other Language-Image Pre-training Methods for IC. Learning a vision backbone from web-
scale image-text pairs is an emerging research topic. There are an increasing number of papers
recently, aiming to improve zero-shot/few-shot performance of IC in the wild.
• Improved Contrastive Pre-training Objectives. FILIP (Yao et al., 2022) bootstraps the fine-
grained region-word correspondences. PyramidCLIP (Gao et al., 2022b) constructs an input pyra-
mid with different semantic levels, and aligns two modalities in the form of hierarchy via both intra
and cross-level alignment. Prefix conditioning (Saito et al., 2022) introduces the use of prefixed
prompt to combine image-caption and image-label data, based on the data type. CyCLIP (Goel
et al., 2022) shows that consistent representations can be learned by explicitly symmetrizing the
similarity between the two mismatched image-text pairs (cross-modal consistency), and the simi-
larity between the image-image pair and the text-text pair (in-modal consistency).
• Self-supervised + Contrastive Objectives. DeCLIP (Li et al., 2022j) comprehensively investi-
gates multiple single-modality self-supervision signals in image-text pairs. SLIP (Mu et al., 2021)
studies the integration of image-to-image self-supervised learning and image-to-text contrastive
learning. Masked image/language modeling is also combined with image-to-text contrastive learn-
ing, such as MultiMAE (Bachmann et al., 2022) and M3AE (Geng et al., 2022).
• Frozen Models. LiT (Zhai et al., 2022) introduces the “contrastive-tuning” method, showing that
locking the pre-trained image encoder and tuning text encoder works best for zero-shot transfer.
Flamingo (Alayrac et al., 2022) leverages pre-trained models from each single modality, and con-
tinues to pre-train the cross-modal module to achieve impressive image classification performance
using in-context learning.
• Scaling. Due to the promising results of web-scale pre-training for computer vision tasks, there is
a trend to explore the scaling success of VLP models. BASIC (Pham et al., 2021) is proposed to
scale up the contrastive learning framework of CLIP and ALIGN in three dimensions: data size,
model size, and batch size, achieving 85.7% zero-shot accuracy on ImageNet. LIMoE (Mustafa
et al., 2022) is a sparse mixture of experts model capable of language-image multimodal learning.
Pathways Language and Image model (PaLI) (Chen et al., 2022e) finds that joint scaling of the
vision and language components is important. Since existing Transformer language models are
much larger than their vision counterparts, PaLI trains the largest ViT to date to quantify the
benefits from even larger-capacity vision models, based on large multilingual mix of pre-training
task and a new image-text training set containing 10B images and texts in over 100 languages.
In the literature, there are two different experiment settings to evaluate the open-set image classifi-
cation ability of pre-trained models.
• Class-level Transfer in a Single Domain. The traditional zero-shot transfer evaluation for image
classification has been studied for decades, where a manual split is pre-defined in a given visual
domain, ensuring that evaluation concepts are not observed in training. Examples include Animal
with Attributes (AwA) (Lampert et al., 2013), Caltech-UCSD Birds-200 (CUB) (Wah et al., 2011),
SUN (Patterson and Hays, 2012), aPY (Farhadi et al., 2009), and ZS-ImageNet (Rohrbach et al.,
2011; Fu and Sigal, 2016).
• Task-level Transfer. To demonstrate the strong usability and generality of CLIP, Radford et al.
(2021) directly apply the pre-trained checkpoint to recognize any concepts in around 30 public im-
age classification datasets in the community. Impressive results are reported, though the model has
“never observed” the images from these downstream datasets. It quickly popularizes the zero-shot
task transfer evaluation for computer vision foundation models. Many CLIP variants (Li et al.,
47
Figure 4.3: Top: CLIP pre-trains an image encoder and a text encoder to predict which images are
paired with which texts in a dataset/batch. This behavior allows us to turn CLIP into a zero-shot
classifier. We convert all the classes into captions such as “a photo of a dog” and predict the class of
the caption CLIP estimates best pairs with a given image. Bottom: predictions of zero-shot CLIP
classifiers on examples from four datasets. This figure was created in Radford et al. (2021).
2022j; Gao et al., 2022b; Yang et al., 2022b) are proposed. But these works perform evaluation
using different downstream datasets, making their results not comparable. The recent Image Clas-
sification in the Wild (ICinW) benchmark is an attempt to formalizing the task-level evaluation
with 20 public datasets (Li et al., 2022b).
Use Cases of Language-Image Models for IC. In Figure 4.3, we illustrate how the image-text
contrastive-trained model like CLIP can be used for zero-shot image classification. Given a new
IC dataset/task with a set of concept/category names, each concept is converted into caption by
augmenting it with various text templates. The caption is used as prompt for the text encoder to
extract concept representations. The query image is fed into the image encoder to extract the visual
representation, which is used to compute the similarity with respective to all concepts. The one with
highest similarity yields the predicted concept. In the bottom of Figure 4.3, four cases are illustrated,
one is from ImageNet and others are from ICinW that represent real-world IC scenarios.
A typical object detection task contains two sub-tasks. (i) Localization aims to locate the presence
of objects in an image and indicate the position with a bounding box. (ii) Recognition determines
what object categories are present in the region of interest (or bounding box). The recognition task
is similar to the image classification task (Section 4.2), except that classification is performed on the
entire image in IC but on individual regions/boxes in OD. Therefore, by following the reformulation
48
1 dog
1 frisbee
1
grass
1
leaves
that converts classification to retrieval as described in Section 4.2, one may improve OD models’
transfer ability for open-set recognition. Specifically, each region/box feature is fed into two pre-
diction heads, i.e., a box classifier and a box regressor, which are trained with the classification loss
Lcls and the localization loss Lloc , respectively:
LOD = Lcls + Lloc . (4.8)
In the traditional OD formulation, the box classifier is implemented using a simple linear layer, and
the classification loss Lcls can be written as:
O = fθ (x), Scls = OW> , Lcls = M(Scls ; T) . (4.9)
Here,2 O ∈ RM ×d are the object/region/box features of the input image, W ∈ RK×d is the weight
matrix of the box classifier, Scls ∈ RM ×K are the output classification logits, T ∈ {0, 1}M ×K is
the target, and M(S; T) is the loss measure, e.g., focal loss in one-stage OD models.
Instead of classifying each region/box into K classes, GLIP (Li et al., 2022h) reformulates OD as
a phrase grounding task, by grounding/aligning each region in K phrases in a text prompt t. The
alignment scores Sground are computed between regions in image x and words in the prompt t:
where P ∈ RL×d are the contextualized word/token features from the language encoder, and L is the
length of language prompt t. P plays a similar role to the weight matrix W in (4.9). The grounding
model, consisting of both the image encoder fθ and the language encoder fφ , is trained end-to-
end by minimizing the loss defined in (4.8) & (4.9), with a simple replacement of the classification
logits Scls in (4.9) with the region-word alignment scores Sground in (4.10). In Figure 4.4, we show
an example of Sground computed for 4 region-word pairs. Note that all the bounding box proposals
used to compute Sground are extracted from one image. The matched pairs are assigned higher scores
than the mismatched pairs.
By distilling the knowledge from the CLIP/ALIGN model into a two-stage detector, ViLD (Gu et al.,
2022d) and RegionCLIP (Zhong et al., 2022) are proposed for zero-shot object detection.
In a two-stage detector, a separate region proposal network (RPN) with loss Lrpn is used to distin-
guish foreground from background. Since Lrpn does not use semantic information of object classes,
it can be merged into the localization loss Lloc in (4.8). In RegionCLIP, RPN is used to propose
2
M is the number of region/box features, d is the visual feature hidden dimension, K is the number of
object classes, and we ignore the bias in the box classifier for simplicity.
49
Figure 4.5: Top: GLIP pre-trains an image encoder, a text encoder and a fusion module to predict
which image box regions are paired with which words/phrase of the text prompt. This behavior
allows us to turn GLIP into a zero-shot OD detector. We convert all of a dataset’s classes into
captions by concatenation and predict the words/phrases of the caption that GLIP estimates best
pairs with a given box. Bottom: predictions of zero-shot GLIP object detector on examples from
six datasets in ODinW (Li et al., 2022b). This figure was created in Li et al. (2022h).
image regions for all images in a batch to produce N image regions in total. The set of image re-
gions is denoted by {r i }N
i=1 . Given the proposed regions, the visual representation v i of region r i
is produced by the visual encoder with a feature pooling method, such as RoIAlign.
RegionCLIP also builds a large pool of candidate concepts for image regions, which are often dif-
ferent from the concepts for full images. These concepts are in the form of natural language, and
encoded into semantic representations {uk }k=1,...,K by a pre-trained text encoder L, where K de-
notes the size of the concept pool.
By leveraging the pre-trained CLIP, the object concept u with the highest matching score is selected
as the pseudo label for each region r, and thus constructing the positive pairs of {u, v}. A similar
contrastive learning framework with an additional distillation loss is used to train the OD models.
Other Language-Image Pre-training Methods for OD. Learning generic open-set object de-
tectors from image-text pairs is an increasingly popular topic. Similar to GLIP, MDETR (Kamath
et al., 2021) reformulates detection as a phrase grounding problem, and uses a single text query
for the whole image. FIBER (Dou et al., 2022a) improves GLIP by (i) using a coarse-to-fine pre-
training pipeline, and (ii) performing fusion in the backbone rather than in the OD head as in GLIP.
50
1 dog
1 frisbee
1
grass
1
leaves
OVR-CNN (Zareian et al., 2021) fine-tunes an image-text model to detection on a limited vocabulary
and relies on image-text pre-training for generalization to an open vocabulary setting. Detic (Zhou
et al., 2022e) improves long-tail detection performance with weak supervision by training only the
classification head on the examples where only image-level annotations are available. Other con-
current works include OV-DETR (Zang et al., 2022), X-DETR (Cai et al., 2022), FindIT (Kuo et al.,
2022), PromptDet (Feng et al., 2022), and OWL-ViT (Minderer et al., 2022).
In the literature, there are two different experiment settings to evaluate the open-set object detection
ability of pre-trained OD models.
• Class-level Transfer in a Single Domain. One common zero-shot transfer evaluation for object
detection follows the setting in Zareian et al. (2021), where a manual split is pre-defined for a given
visual domain, ensuring no concept overlap between training and evaluation. For example, on
LVIS (Gupta et al., 2019), 866 frequent and common categories are treated as the base categories
for training, and 337 rare categories are held out as the novel categories for evaluation. On COCO,
there is a split with 48 base categories and 17 novel categories, removing 15 categories without a
synset in the WordNet hierarchy.
• Task-level Transfer. This is an increasingly popular setting, where the pre-trained OD model is
evaluated on multiple datasets in a zero-shot setting. For example, inspired by CLIP, the LVIS-
trained model is evaluated on 3 datasets, including PASCAL VOC, COCO and Objects365 in
ViLD (Gu et al., 2022d). The recent Object Detection in the Wild (ODinW) benchmark generalizes
the task-level evaluation to a more comprehensive regime, 13 datasets initiated in Li et al. (2022h)
and 35 datasets formalized in Li et al. (2022b), respectively.
Use Cases of Language-Image Models for OD. In Figure 4.5, we illustrate how the region-phrase
matching models like GLIP can be used for zero-shot object detection. Given a new OD dataset/task
with a set of concept/category names, all concepts are converted into caption by concatenation, add
some simple user customized text prompt. The caption is used as prompt for the text encoder to
extract concept representations. The query image is fed into the image encoder to extract the dense
visual representation, which is used to compute the similarity with respect to all concepts via a deep
fusion module. The similarity above the a given threshold yields the predicted results: the box of
interest and the matched concept. In the bottom of Figure 4.5, six use cases are illustrated, all of
them are from ODinW benchmark that represent the real-world OD scenarios.
Image segmentation involves grouping image pixels and assigning a class label to each pixel of an
image. We use Language driven Semantic segmentation (LSeg) (Li et al., 2022a) as an example to
illustrate the image segmentation process, where textual categories and image pixels are embedded
into a common space, and each pixel is assigned to a semantic category.
51
For any semantic segmentation task with a set of K class labels, the text encoder embeds them into
a continuous vector space Rd , producing an embedding matrix for all classes P = [p1 , · · · , pK ] ∈
RK×d as outputs. For an image x, an image encoder encodes it into a dense grid representation
O ∈ RH×W ×d , where H and W specify the spatial size of the feature map. The word-grid sim-
ilarity tensor is computed as dot product Sseg = OP> ∈ R(H×W )×K . In Figure 4.6, we show a
simplified example of Sseg , which is computed on 4 word-grid pairs. Note that all the grid features
to compute Sseg are extracted from one image. The matched pairs are assigned higher scores than
the mismatched pairs.
For a given position pair, we minimize a per-grid Softmax with cross-entropy loss (with temperature
scaling) as is standard in semantic segmentation. In LSeg, a dense prediction Transformer (Ranftl
et al., 2021) is used to decode the features, and a final spatial regularization block spatially regular-
izes and cleans up the predictions.
Due to rich semantics in image-text pair data, there are many other works that use language-image
models for segmentation, as detailed below.
• CLIP-based Segmentation. Many segmentation models directly adapt the pre-trained CLIP to
pixel-level visual recognition tasks, including PhraseCut (Wu et al., 2020), OpenSeg (Ghiasi et al.,
2022), CLIPSeg (Lüddecke and Ecker, 2022), ZS-Seg (Xu et al., 2021d), MaskCLIP (Zhou et al.,
2022a), DenseCLIP (Rao et al., 2021) and MaskCLIP (Ding et al., 2022b). OpenSeg (Ghiasi
et al., 2022) also performs model learning with class agnostic mask annotations for generating
mask proposals.
• Training from scratch. GroupViT (Xu et al., 2022) is a new hierarchical grouping Trans-
former architecture that exploits the global self-attention mechanism of Transformers to partition
input images into progressively larger arbitrary-shaped groups. It is pre-trained with a multi-
label image-text contrastive loss on around 12M image-text pairs. Since GroupViT automatically
groups images into semantically-similar segments, its output can be easily transferred to semantic
segmentation without fine-tuning.
In the above three subsections, we have described how we might extend a close-set recognition
model to perform three open-set recognition tasks: image classification, object detection and seg-
mentation. The solution is to utilize a parametric function such as neural language models to rep-
resent categories, instead of traditional non-parametric representations such as one-hot vector em-
bedding. Though it endows the functionality of open-set recognition, the model is still lacking the
power to perform well on a large range of downstream tasks in the wild, where both the visual ap-
pearance of input images and the semantics of output categories often vary significantly from one
application to another.
In Figure 4.7, we use the definitions in Li et al. (2022b) to compare four settings studied in the com-
puter vision community: traditional close-set recognition setting (bottom-left quadrant), open-set
recognition setting (top-left quadrant), domain adaptation or out-of-distribution setting (bottom-right
quadrant), and the CVinW setting (top-right quadrant). It is clear that CVinW considers variations in
both visual domains and concept domains. In fact, any visual recognition task can be naturally de-
fined using a customized set of concepts and a given visual domain. From this perspective, CVinW
considers task-level transfer, which is beyond the concept/class-level transfer that often appears in
the traditional open-set recognition setting. In Figure 4.8, we use the same image above to illustrate
the difference among these settings.
The goal of developing foundation models for computer vision in the wild is two-fold:
• The ability to transfer to a wide range of new downstream tasks. It means the application sce-
narios of the foundation models are broad. The well-established datasets such as ImageNet and
COCO are two representative close-set tasks for image classification and object detection, respec-
tively. In real-world settings, both the visual domain and the concept sets can vary significantly,
beyond ImageNet and COCO. The effectiveness of a foundation model is better measured by its
applicability than by its performance on any specific tasks.
52
Concept / Vocabulary / Language Computer Vision
In the Wild
mask-wearing food flowers textures
person,
Close-Set dog
Visual Content
Training Domain
Domain Shift
Figure 4.7: Illustration on the setting of “Computer Vision in the Wild (CVinW)”, in comparison
with other settings. The 2D space is constructed with two dimensions: input image and output
concept. The 2D chart is divided into four quadrants, based on the requirements between the model
development stage and the model evaluation stage. For the example provided in the standard setting,
the natural image with concept “person, sheep, dog” is presented. Figure from Li et al. (2022b).
Image
Settings
i. Closed-set: [dog, grass, frisbee]
ii. Open-set: [mixed breed, grass, frisbee, leaves, white dog, ground, green ground, red toy, …]
iii. In-the-Wild: New tasks with any customized set of concepts in various visual domains
• The adaptation cost of task-transfer is low. One major advantage of pre-trained foundation
models is the promise that they can transfer to downstream tasks effortlessly (or in an inexpensive
manner). It means that model adaptation efficiency is an important factor to measure the usability
of a foundation model. Good foundation models should be deployed with minimum adaptation
effort. To measure the cost of adaptation, Li et al. (2022b) define the adaptation cost in two
orthogonal dimensions: sample-efficiency (measured by the number of training examples), and
parameter-efficiency (measured by the number of trainable parameters). The established datasets
such as ImageNet and COCO do not provide the best evaluation setting for foundation models.
53
To achieve SoTA performance on these datasets, it often requires full-model fine-tuning on the
full-shots, a setting that comes with high adaptation cost. As a north star, one foundation model
with fixed weights should zero-shot transfer well to many downstream tasks.
The above goal of developing foundation models can be achieved on a range of computer vision tasks
individually or jointly. When achieved individually, the setup is to build one separate foundation
model for each problem. Most VLP models described in this chapter fall into this category. When
achieved jointly, the setup is to build one unified foundation model across all tasks. Computer vision
tasks require image processing at different levels of granularity (image, region, pixels), rendering
the cross-task unification challenging. It remains an appealing open research topic to build one
AI system that can leverage vision-language data at different levels of granularity, seeking the best
trade-off between data scale and semantics-richness.
54
UniCL
(Yang et al. 2022) Florence
(Yuan et al. 2021) CLIPSeg
ViLD GLIP LIT (Luddecke and Ecker 2022)
(Gu et al. 2021) (Li et al. 2022) (Zhai et al. 2021)
K-LITE LSeg OpenSeg
RegionCLIP BASIC DeCLIP (Sheng et al. 2021) (Li et al. 2022) (Ghiasi et al. 2021)
(Zhong et al. 2022) X-DETR (Pham et al. 2021) (Li et al. 2021)
(Cai et al. 2022) Segmentation
Detection Image Classification DenseCLIP
FILIP SLIP CoCa (Rao et al. 2021) DenseCLIP
OV-DETR
MDETR (Zang et al. 2022) (Yao et al. 2021) (Mu et al. 2022) (Yu et al. 2022) (Zhou et al. 2021)
(Kamath et al. 2021) GroupViT
MultiMAE M3AE
OWL-ViT (Bachmann et al. 2022) (Geng et al. 2022)
(Xu et al. 2022)
DETIC (Minderer et al. 2022) CLIP Adapter
(Zhou et al. 2022) Robust Fine-tuning (Gao et al. 2021)
FindIT PyramidCLIP (Wortsman et al. 2022)
(Minderer et al. 2022) (Gao et al. 2021) VL-Adapter CPT
K-LITE Distributional Robustness (Sung et al. 2022) (Yao et al. 2021b)
MURAL (Shen et al. 2021) (Fang et al. 2022)
(Guo et al. 2021) Efficient Adaptation
Knowledge Robustness Conditional Prompt
Multilingual CLIP
(Carlsson et al. 2022) VL-LTR LP-FT (Zhou et al. 2022a)
Chinese (Tian et al. 2021) (Kumar et al. 2020) ELEVATER
(Gu et al. 2022) Russian (Li et al. 2022)
(Shonenkov et al. 2022)
Benchmark
Multilingual
Figure 4.9: Research topics and papers in VLP for core computer vision tasks.
• Open-set Visual Relationship Recognition. The idea of open-set recognition has been extended
to more visual recognition tasks, such as relation detection. Relational Language-Image Pre-
training (RLIP) (Yuan et al., 2022) improves zero-shot, few-shot and fine-tuning Human-Object-
Interaction (HOI) detection performance, and the robustness to learning from noisy annotations.
• Open-set Video Classification. Multi-modal Open-Vocabulary video classification (MOV) (Qian
et al., 2022) is proposed to use the vision encoder from pre-trained text-image models with mini-
mal modifications to encode video, optical flow and audio spectrogram, and design a cross-modal
fusion mechanism to aggregate complimentary multi-modal information. X-CLIP (Ni et al., 2022)
adapts the pre-trained text-image models to video recognition. It uses a cross-frame attention
mechanism that explicitly exchanges information across frames, and a video-specific prompting
scheme to leverage video content information for generating discriminative textual prompts.
We refer the readers who are interested in the literature on Computer Vision in the Wild (i.e.,
VLP for core vision tasks) to the up-to-date CVinW reading list at https://github.com/
Computer-Vision-in-the-Wild/CVinW_Readings.
55
Chapter 5
Videos contain multiple modalities in nature, and have been used as an epitome to test how AI
systems perceive the world. In this chapter, we provide a systematic review on vision-language
pre-training (VLP) for video-text tasks. We start with an introduction to popular video-text tasks in
Section 5.1. In Section 5.2, we review the architecture of a typical video-text model, which consists
of a video encoder, a text encoder and a multimodal fusion module. We divide the representative
video-language models into two categories: (i) dual encoder, where video and text are encoded
separately and a light multimodal fusion layer or operation (e.g., dot product) is used to fuse the
video and text features; and (ii) fusion encoder, where on top of the video encoder and text encoder,
multiple additional Transformer layers are usually adopted to capture deep interactions between
the video and text features. Section 5.3 and Section 5.4 present, respectively, the popular pre-
training tasks adopted in literature and the datasets available for large-scale video-text pre-training.
In Section 5.5, we discuss advanced topics and research trends in video-text pre-training, such as
comprehensive video-text benchmarks and learning from multi-channel videos.
The text-to-video retrieval task is to retrieve a relevant video or video segment given a natural lan-
guage query, from a large video corpus. The task can be further categorized into three types, de-
pending on its settings.
• Video retrieval (VR) (Chen and Dolan, 2011; Xu et al., 2016) aims to retrieve a video from a
large video corpus. In this setting, the text queries are supposed to give an overview description of
a video. Take the example shown in Figure 5.1, “a person plays frisbee with a dog” summarizes
the event happening in the first video. This is analogous to text-to-image retrieval, and Recall@K
(K=1, 5, 10, 100) is used as the evaluation metric.
• Single Video Moment retrieval (SVMR) (Regneri et al., 2013; Krishna et al., 2017a) is to ground
the text query in a specific time interval of a given video. The text query is only relevant to a
specific segment of the whole video. For example, in Figure 5.1, “a dog is running with a fris-
bee in its month” can only be grounded in the visual content at t = 3, 4, 5 in the first video.
Again, Recall@K (K=1, 5, 10, 100) is used as the evaluation metric, with a constraint on tem-
poral intersection over union (tIoU) between the ground truth and the predicted proposals (e.g.,
tIoU≥0.5/0.7).
• Video Corpus Moment Retrieval (VCMR) (Lei et al., 2020b; Li et al., 2020b) further extends
the pool of relevant video segments from a single video to a large video corpus. It can be viewed
as the combination of VR and SVMR. An AI model is required to not only retrieve the relevant
video from the video corpus, but also localize the video segment in the retrieved video so that it
56
Figure 5.1: Illustration of representative video-text tasks: (i) text-to-video retrieval, including video
retrieval and moment retrieval; (ii) video question answering in both multiple-choice and open-
ended settings; and (iii) video captioning with a single-sentence caption or a paragraph of captions.
can be described by the text query. For example, given the query “a dog is running with a frisbee
in its month”, the model needs to correctly match it to the first video and then ground the text
query in the video segment from t = 3 to t = 5. Similarly, VCMR is evaluated using Recall@K
(K=1, 5, 10, 100) with tIoU≥0.5/0.7.
Most VLP models (Miech et al., 2019; Bain et al., 2021) are evaluated on VR. Popular VR datasets
include (i) MSVD (Chen and Dolan, 2011), MSRVTT (Xu et al., 2016), LSMDC (Rohrbach et al.,
2015), YouCook2 (Zhou et al., 2018) and VATEX (Wang et al., 2019c) for single-sentence-to-video
retrieval; and (ii) DiDeMo (Hendricks et al., 2017) and ActivityNet Captions (Krishna et al., 2017b)
for paragraph-to-video retrieval. The paragraph-to-video retrieval datasets are transformed from the
datasets collected for the more challenging SVMR or VCMR tasks. In DiDeMo and ActivityNet
Captions, each sentence of a paragraph is annotated with relevant time intervals. More recently,
TVR (Lei et al., 2020b) and How2R (Li et al., 2020b) are proposed to incorporate additional dia-
logue/subtitle information to perform VCMR with multi-channel video inputs.
Given a video-question pair, video question answering (QA) requires an AI model to answer the
question based on the video content. There are two settings, both are evaluated in accuracy.
• Multiple-Choice Video QA (Jang et al., 2017): A model needs to identify the correct answer from
a list of fixed, small number of answer candidates (e.g., 4-5 answer candidates). As the answer is
constrained to a finite set, the task is often formulated as classification. Popular datasets include
TGIF-Action, TGIF-Transition (Jang et al., 2017), TVQA (Lei et al., 2018), TVQA+ (Lei et al.,
2020a), How2QA (Li et al., 2020b) and Drama-QA (Choi et al., 2021). In the literature, video-
to-text retrieval tasks with a small number of text candidates are often regarded as a multiple-
choice QA task, such as LSMDC-MC (Torabi et al., 2016) and MSRVTT-MC (Yu et al., 2018b).
More recently, different video reasoning datasets have been proposed, mostly in the format of
multiple-choice QA. Examples include VIOLIN (Liu et al., 2020b) for video-and-language in-
ference, VLEP (Lei et al., 2020c) for future event prediction in videos, NExT-QA Xiao et al.
(2021) to test on causal action reasoning, and STAR (Wu et al., 2021) to test on 4 types of situated
reasoning: interaction, sequence, prediction and feasibility.
• Open-ended Video QA (Xu et al., 2017): the correct answer can be free-form, constructed by
words from the whole word vocabulary. The common practice is to fist form a finite set of answer
vocabulary by selecting the most frequent answers from the training split, and formulate it as
a classification task. Popular datasets in this setting include LSMDC-FiB (Torabi et al., 2016),
57
Figure 5.2: Illustration of a general framework for Transformer-based video-language models.
TGIF-Frame (Jang et al., 2017), MSRVTT-QA, MSVD-QA (Xu et al., 2017), ActivityNetQA (Yu
et al., 2019a) and iVQA (Yang et al., 2021a).
The task is to generate a natural language description for a given video, which is the only gener-
ation task among the three. The caption is expected to comprehensively describe the content of
the video, including the events or objects of interest, the evolution of the events or object behav-
iors along time, and the relations among them. Most popular benchmarks (Chen and Dolan, 2011;
Xu et al., 2016; Wang et al., 2019c) require generating a single-sentence caption to describe the
overall video content. Although a single sentence may be enough to summarize the event happen-
ing in short videos, descriptions of longer videos are often multiple-sentence paragraphs, as in
the dense captioning benchmark (Krishna et al., 2017b). Recently, multi-modal video captioning
datasets (e.g., TVC (Lei et al., 2020b)) are proposed with captions describing both visual scenes and
dialogues/subtitles in videos. Captioning performance is evaluated using standard text generation
metrics, such as BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), ROUGE-
L (Lin, 2004) and CIDEr (Vedantam et al., 2015).
Overview. Given a pair of text sentence w and video v, a typical video-text model first extracts
a sequence of text features w = {w1 , · · · , wN } and visual features v = {v 1 , · · · , v M } via a text
encoder and a video encoder, respectively. Here, N is the number of tokens in a sentence, and M
is the number of visual features for a video, which can be the number of frames/regions/patches,
depending on the specific vision encoder being used. A multimodal fusion module projects these
features into a shared embedding space to produce cross-modal representations. We broadly divide
video-text models into two categories, based on the design of the multimodal fusion module:
• Dual Encoder, where video and text are encoded separately, and the interaction between video
and text features is modeled using a light-weight operation (e.g., dot product or cosine similarity).
This design is favorable in text-to-video retrieval for fast search (Bain et al., 2021), and also widely
adopted for promoting better video representations via contrastive video-text pre-training (Miech
et al., 2019). However, such shallow cross-modal interactions are not effective enough for video
QA and captioning tasks, as shown in Support-Set (Patrick et al., 2020). Therefore, an additional
text decoder is needed for caption generation.
• Fusion Encoder, where on top of the video encoder and text encoder, additional Transformer
layers (Vaswani et al., 2017) are adopted to capture fine-grained interactions between video and
text features. Preeminent works with deep fusion encoder, such as VideoBERT (Sun et al., 2019a),
UniVL (Luo et al., 2020), ClipBERT (Lei et al., 2021b), and MERLOT (Zellers et al., 2021),
show strong performance on the video QA and captioning tasks. While still achieving competitive
performance on text-to-video retrieval tasks, fusion encoders are computationally more expensive
than dual encoders (Bain et al., 2021).
58
Multimodal Vision Text Pre-training
Model Decoder E2E
Fusion Encoder Encoder Objectives
VideoBERT (Sun et al., 2019a) 3D CNN Emb. 7 7 MLM+VTM+MVM
ActBERT (Zhu and Yang, 2020) OD Emb. 7 7 MLM+VTM+MVM
MLM+VTM+FOM
HERO (Li et al., 2020b) 2D+3D CNN Emb. 7 7
+MFM
VTC+MLM+VTM
UniVL (Luo et al., 2020) 2D+3D CNN+Xformer Xformer 3 7
+MFM+LM
VQA-T (Yang et al., 2021a) 3D CNN Xformer 7 7 MLM+VTC
TACo (Yang et al., 2021b) Xformer 2D+3D CNN Xformer 7 7 VTC
ClipBERT (Lei et al., 2021b) 2D CNN Emb. 7 3 MLM+VTM
MERLOT (Zellers et al., 2021) 2D CNN+Xformer Xformer 7 3 MLM+VTC+FOM
MV-GPT (Seo et al., 2022) Xformer Emb. 3 3 MLM+LM
LAVENDER (Li et al., 2022g) Xformer Xformer 7 3 MLM+VTM as MLM
Singularity (Lei et al., 2022a) Xformer Xformer 7 3 MLM+VTM+VTC
HTM (Miech et al., 2019) 3D CNN Word2Vec 7 3 VTC
MIL-NCE (Miech et al., 2020) 3D CNN Word2Vec 7 3 VTC
Support Set (Patrick et al., 2020) Dot Product 2D+3D CNN+Xformer Xformer 3 7 VTC+LM
VideoCLIP (Xu et al., 2021b) 3D CNN+Xformer Xformer 7 7 VTC
Frozen (Bain et al., 2021) Xformer Xformer 7 3 VTC
Table 5.1: Glossary of representative VLP models for video-text tasks. E2E: end-to-end. CNN:
convolutional neural netowrks. OD: object detector. Xformer: transformer. Emb.: embedding.
MLM/MFM/MVM: masked language/frame/video modeling. VTM: video-text matching. VTC:
video-text contrastive learning. FOM: frame order modeling. LM: language modeling.
The final outputs of a video-text model are either generated directly via an output layer that operates
on the cross-modal representations produced by the multimodal fusion module (for encoder-only
models), or a decoder that is added in between the multimodal fusion module and the output layer
(for encoder-decoder models). An illustration of this framework is shown in Figure 5.2. Table 5.1
summarizes the representative VLP models for video-text tasks, including fusion encoder models
(the upper block) and dual encoder models (the lower block). In Figure 5.3, we further show how
these VLP models evolve along time. Next, we review each component in detail.
Video Encoder. Unlike static images, a video clip consists of a sequence of frames/images that
evolve over time. Hence, the video encoder needs to capture not only spatial information from
each frame, but also temporal dynamics across frames. Over time, the video encoder evolves from
multiple offline feature extractors, to a single video encoder learned in an end-to-end manner. The
change in video encoder also reflects the general trend in VLP for video-text tasks, i.e., from two-
stage pre-training to end-to-end pre-training, similar to image-text models in Chapter 3.
• Multiple offline feature extractors. Early methods (Sun et al., 2019a; Zhu and Yang, 2020;
Li et al., 2020b) use a combination of fixed video feature extractors, such as 2D CNNs pre-
trained for image classification (e.g., ResNet (He et al., 2016)), 3D CNNs pre-trained for action
recognition (e.g., I3D (Carreira and Zisserman, 2017)), and object detection models (e.g., Faster
RCNN (Girshick, 2015)). These video features are further processed to have a similar format
to text inputs or projected into the same high-dimensional space as text representations. For ex-
ample, VideoBERT (Sun et al., 2019a) generates a sequence of “visual tokens” (in analogy to
textual tokens) by applying hierarchical vector quantization to the pre-extracted video features
from S3D (Zhang et al., 2018a) pre-trained on Kinetics (Kay et al., 2017). ActBERT (Zhu and
Yang, 2020) represents a video by combining a sequence of action features from a 3D CNN and
a sequence of regional object features from Faster R-CNN. The learnable embedding of a spe-
cial token ([ACT] for action and [REGION] for object) is then added to the features before being
fed to the multimodal fusion module. HERO (Li et al., 2020b) concatenates 3D Slowfast (Fe-
ichtenhofer et al., 2019) features and 2D ResNet-101 features extracted at the same frame rate as
the video representation. The concatenated video features are projected into a hidden space via
a fully-connected layer, and then a positional embedding, which encodes the temporal order of
input frame features, is added.
• Single video encoder learned in an end-to-end manner. Although models based on pre-
extracted video features achieve strong performance, these fixed features are somewhat discon-
nected with the target video-text tasks/domains. Offline feature extractors are often trained on
pure vision tasks in different domains. To address this issue, researchers try to refine the video
encoder during video-text pre-training (Miech et al., 2020; Lei et al., 2021b; Zellers et al., 2021)
59
Figure 5.3: VLP models developed for video-text tasks along time. Due to space constraint, only
some representative works are shown.
in an end-to-end (E2E) manner. Instead of using multiple video encoders, which renders excessive
computational demands, a single video encoder is used. For example, HTM (Miech et al., 2020)
learns video representations from scratch with a randomly initialized I3D (Carreira and Zisser-
man, 2017). In ClipBERT (Lei et al., 2021b), ResNet-50 pre-trained for object detection (Jiang
et al., 2020) along with a temporal mean pooling is used to generate video representations. With
advances in vision Transformers (ViTs), recent E2E models adopt a fully Transformer-based ar-
chitecture. Frozen (Bain et al., 2021) inserts several space-time self-attention blocks into a pre-
trained ViT (Dosovitskiy et al., 2021) to learn a global video representation via contrastive video-
text pre-training. MV-GPT (Seo et al., 2022) and LAVENDER (Li et al., 2022g) directly encode
video inputs via a video vision Transformer (e.g., ViViT (Arnab et al., 2021) and a video Swin
Transformer (Liu et al., 2022)).
Text Encoder. Text inputs are first tokenized into a sequence of tokens to obtain the token em-
beddings. Before the wide adoption of BERT-like models for video-text pre-training, early dual
encoder models (Miech et al., 2019, 2020) utilize the pre-trained word2vec embeddings (Mikolov
et al., 2013a), followed by a max-pooling operation to obtain the overall sentence representation.
Most recent works follow the standard text pre-processing steps of BERT to tokenize a text into a
sequence of WordPieces (Wu et al., 2016), with two special tokens ([CLS] and [SEP]) inserted at
the beginning and the end of the sequence, respectively. A word embedding layer, consisting of to-
ken embedding, position embedding and layer normalization layers, is used to embed these tokens to
vectors in a high-dimensional continuous space. For dual encoder models, the learned embeddings
are the feature vectors produced by a deep Transformer network (Patrick et al., 2020; Bain et al.,
2021; Xu et al., 2021b). For fusion encoder models, they are either directly fed into a multimodal
fusion module (Tang et al., 2021c; Xu et al., 2021a), where the word embedding layer is the only
text-specific model component or processed by several Transformer layers for text encoding before
multimodal fusion (Yang et al., 2021a,b; Seo et al., 2022).
Multimodal Fusion. For dual encoder models like HTM (Miech et al., 2019) and MIL-
NCE (Miech et al., 2020), the global video/text representations extracted from video/text encoders
are aligned in a common semantic space via a lightweight inner product. For fusion encoder models,
the most popular design is merged attention (illustrated in Figure 3.4 (b) of Chapter 3.2), where the
text and video features are simply concatenated and then fed into a single Transformer block. In a re-
cent study (Lei et al., 2022a), cross attention modules are inserted to the top few Transformer layers
between self-attention and feed-forward layers, to enable text features to attend to a variable-length
visual feature sequence. This is similar to co-attention (illustrated in Figure 3.4 (a) of Chapter 3.2).
But it is asymmetric in that only video-to-text cross-attention modules are used.
60
This illustrative comparison can be directly applied to video-text inputs, simply by replacing the
input image with a sequence of input video frames.
Video-Text Contrastive Learning (VTC). In VTC, the model aims to learn the correspondence
between video and text. VTC is widely adopted to train the dual-encoder models (Miech et al.,
2019; Bain et al., 2021), where video and text inputs are fused via a lightweight inner product. This
simple dot product is also used to compute the video-to-text and text-to-video similarities in VTC
for dual-encoder models. Specifically, given a batch of N video-text pairs, VTC aims to predict the
N matched pairs from all the N 2 possible video-text pairs.
> >
sv2t t2v
i,j = v i w j , si,j = w i v j , (5.1)
N N
1 X exp(si2t
i,i /σ) 1 X exp(st2v
i,i /σ)
Lv2t
VTC (θ) = − log PN , Lt2v
VTC (θ) = − log N
,
N i=1 v2t N i=1 t2v
P
j=1 exp(si,j /σ) j=1 exp(si,j /σ)
where {v i }N N
i=1 and {w i }i=1 are the normalized video vectors and text vectors in a training batch,
σ is a learned temperature hyper-parameter, Lv2t t2v
VTC and LVTC are video-to-text and text-to-video
contrastive loss, respectively.
The naive formulation of VTC assumes there exist correct alignments between video and text in
pre-training data, which is not always the case. A great challenge in large-scale contrastive pre-
training on existing video-text data is the inherent misalignment between visual frames and speech-
transcribed subtitles. To address the visually misaligned narrations, MIL-NCE (Miech et al., 2020) is
proposed to combine multiple instance learning with contrastive learning to use the weak and noisy
training signals in narrated videos. VideoCLIP (Xu et al., 2021b) constructs temporally overlapped
pairs of video and text clips of varying length, in contrast to fixed length in Miech et al. (2019,
2020), to increase the quality and quantity of pre-training corpus. In addition, they contrast not only
different clips from the same video, but also harder negatives that are similar to the in-batch clips,
retrieved from other videos.
Moreover, the conventional contrastive learning computes the loss after aggregating all the words
in the text and frames in the video. In TACo (Yang et al., 2021b), the authors propose to make it
token-aware, where the lose is computed using only a subset of words (e.g., nouns and verbs), to
improve the grounding of individual words in the video. TACo combines token-aware VTC with the
naive VTC, applied to the dual-encoder architecture, and further adds a third VTC loss enhanced by
deep multimodal fusion. Specifically, the similarity between video and text input is the multimodal
fusion output for the [CLS] token, computed by the Transformer block operating on top of the dual
encoders, which is exactly the fusion-encoder architecture. To reduce the complexity in computing
the fusion-encoder VTC loss, they adopt a cascade sampling strategy to only sample a small subset
of hard negatives based on the token-aware VTC and the naive VTC loss.
Masked Language Modeling (MLM). MLM is a direct adoption of the one used for lan-
guage model pre-training, except that the inputs are video-text pairs. Formally, the inputs for
MLM include: (i) sub-word tokens from an input sentence w; (ii) the visual inputs (e.g., frame
patches/features) v aligned with w; and (iii) mask indices m ∈ NM . N is a natural number, M is
the number of masked tokens, and m is the set of masked indices. In practice, we randomly mask
out input words with a probability of 15%, and replace the masked tokens wm with the special token
[MASK]. Following BERT, the 15% randomly masked-out words are further decomposed into 10%
random words, 10% unchanged, and 80% [MASK]. The goal is to predict these masked words based
on the observation of their surrounding words w\m and the visual inputs v aligned to the sentence,
61
by minimizing the negative log-likelihood:
LMLM (θ) = −ED log Pθ (wm |w\m , v) , (5.2)
where θ denotes trainable parameters. Each pair (w, v) is sampled from the training set D.
Similar to image-text pre-training, video-text pre-training methods also use a variant of MLM, lan-
guage modeling, where captions are generated token-by-token autoregressively, as in UniVL (Luo
et al., 2020) and Support-Set (Patrick et al., 2020). Specific to video-text pre-training, speech-
transcribed texts are usually less formal with utterances or repetitively mentioning the key objects.
To avoid masking on un-grounded words, MERLOT (Zellers et al., 2021) provides a simple heuristic
solution to mask words based on the learned attention weights and empirically verifies its advantages
over random masking.
Video-Text Matching (VTM). In VTM, the model is given a batch of positive video-text pairs and
negative video-text pairs, which are constructed by replacing the video/text inputs in positive video-
text pairs. The goal of VTM is to identify positive pairs of videos and texts. VTM is often formulated
as a binary classification task. Specifically, a special token (i.e., [CLS]) is inserted at the beginning
of the input sentence, whose learned vector representation is used as the cross-modal representation
of the input video-text pair. We then feed the model with either a matched or mismatched video-
text pair hv, wi with equal probability, and learn a classifier to predict binary label y, indicating
whether the sampled video-text pair is positive or negative. Specifically, denoting the output score
by sθ (w, v), we apply the binary cross-entropy loss for optimization:
LVTM (θ) = −E(w,v)∼D [y log sθ (w, v) + (1 − y) log(1 − sθ (w, v))]) . (5.3)
Different variations of VTM have been proposed to capture alignments along the temporal dimen-
sion of different levels of granularity. For example, HERO (Li et al., 2020b) considers both the
global alignment (predicting whether a text matches the input video) and local temporal alignment
(retrieving the moment where the text should be localized in the video clip), which is proven effec-
tive for downstream video corpus moment retrieval.
Other Pre-training Tasks. Besides the pre-training tasks discussed above, some attempts have
been made to leverage the unique characteristics of video inputs for self-supervised pre-training.
• Frame Order Modeling (FOM). FOM is proposed to model the chronological order of events
or actions happening in video. During training, we scramble a certain percentage of input frames
(or frame features) randomly chosen, and the model is trained to explicitly recover the correct
temporal order. Two variants are explored, including reconstructing absolute temporal order of
these shuffled frames as in HERO (Li et al., 2020b), and predicting the relative order between
each pair of frames as in MERLOT (Zellers et al., 2021). In both works, FOM is applied to the
videos paired with temporally grounded texts, such as subtitles or ASR outputs.
– FOM with absolute temporal order. At time t, let’s denote the video frame inputs as v t
and the temporally grounded sentence as wt . The inputs to FOM are (i) all subtitle sentences
{wt }; (ii) visual frames {v t }; and (iii) the reorder indices r = {ri }R R
i=1 ∈ N , where R is
the number of reordered frames, and r is the set of reorder indices. During training, 15% of
the frames are randomly selected to be shuffled, and the goal is to reconstruct their original
order along the temporal dimension, denoted as t = {ti }R i=1 , where ti ∈ {1, ..., Nv }. FOM
is formulated as a classification problem, where t is the ground-truth labels of the reordered
frames. The final objective is to minimize the negative log-likelihood:
PR
LFOM (θ) = −ED i=1 log Pθ ([ri , ti ]) . (5.4)
– FOM with relative temporal order. During training, 40% of the time, an integer n, indi-
cating the number of frames to be randomly shuffled, is first randomly picked from [2, T ],
given T input frames. Then, n frames are chosen at random to be randomly scrambled. Af-
ter shuffling, the frames together with the text inputs are fed into the model to learn the joint
video-language representations. For a pair of frames at timestep ti and tj (after the shuffling),
we concatenate their hidden states and pass the result through a two-layer MLP, predicting
if ti < tj or ti > tj . Similarly, FOM with relative temporal order can be optimized using a
cross-entropy loss.
62
• Masked Video Modeling (MVM). As the consecutive frames may contain similar spatial infor-
mation, MVM is introduced to reconstruct high-level semantics or low-level details for a certain
percentage of “masked” visual inputs (i.e., features or patches), given intact video tokens/features
from neighboring frames and the paired textual description. Specifically, the model is trained to
reconstruct the masked patches or features vm given the remaining visible patches or features
v\m and the paired text w. That is,
LMVM (θ) = E(w,v)∼D Pθ (vm |v\m , w) . (5.5)
Similar objectives have been proposed for image-text pre-training (Chen et al., 2020d; Kim et al.,
2021), known as masked image modeling (MIM) as described in Chapter 3.3.
– MVM with in-batch negatives is explored in HERO (Li et al., 2020b), leveraging Noise
Constrative Estimation loss (Jozefowicz et al., 2016) to supervise the model to identify the
correct frame feature corresponding to the masked frames, compared to all negative distrac-
tors in the same batch.
– MVM with discrete visual tokens, is first introduced in VideoBERT (Sun et al., 2019a).
VideoBERT tokenizes continuous S3D (Zhang et al., 2018a) features extracted from input
video frames into discrete “visual tokens” using hierarchical k-means. These visual tokens
are then used as both the video inputs to the model and the prediction targets for MVM. MVM
is formulated as a classification task that is performed in the same manner as MLM. Similarly,
15% of the input visual tokens are randomly masked and the model is trained to recover these
masked visual tokens. More recently, VIOLET (Fu et al., 2021) draws inspirations from self-
supervised learning methods on vision Transformers (Bao et al., 2022a; Tan et al., 2021)
to take advantages of pre-trained DALL-E (Ramesh et al., 2021) to extract discrete visual
tokens as the MVM targets. VIOLET randomly mask out the raw input video frame patches,
and train the model to predict the corresponding visual tokens for these masked patches in an
end-to-end manner.
– MVM with other visual targets. In addition to discrete visual tokens, Fu et al. (2022) em-
pirically examines 7 other reconstructive targets of MVM, from low-level pixel values and
oriented gradients to high-level depth maps, optical flow predictions, various latent visual fea-
tures from deep neural networks. Likewise, the raw input video frame patches are randomly
masked, and the model training is supervised with l1 loss between the MVM prediction and
these continuous visual targets of the masked patches.
Case Study. Until now, we have introduced the general model architecture and popular pre-
training tasks in video-text literature. To provide the readers with more concrete examples, we
select three representative models as case studies, including (i) MIL-NCE (Miech et al., 2020), a
dual-encoder model; (ii) UniVL (Lu et al., 2022a), a fusion encoder model that offline extracts
video features; and (iii) ClipBERT (Lei et al., 2021b), an end-to-end fusion encoder model that
directly learns from raw video pixels. We briefly review their architectures and pre-training tasks.
• MIL-NCE. The architecture of MIL-NCE is shown in Figure 5.4a. Video is encoded by a 3D
CNN backbone (e.g., I3D(Carreira and Zisserman, 2017) or S3D (Zhang et al., 2018a)) to extract
3D grid features, which are then globally mean pooled to obtain the global video embedding. Text
sentence is encoded with a pre-trained word2vec embedding, followed by a max-pooling operation
to obtain the global text embedding. MIL-NCE is pre-trained with VTC, where the similarity
between video-text pairs is measured by the dot product between the two global embeddings.
• UniVL. Figure 5.4b illustrates the model architecture of UniVL, which contains two single Trans-
former encoders to embed video and text respectively, a cross-modal Transformer to model the
interactions between text and video embeddings, and a Transformer decoder. UniVL follows a
two-stage pipeline. First, a off-the-shelf feature extractor (e.g., S3D or ResNet-152(He et al.,
2016)) is used to extract video features from densely sampled frames. Then, these video features
along with the accompanying text sentences are fed into UniVL to learn multimodal representa-
tions. UniVL is pre-trained with 5 tasks, VTC, VTM, MLM, MVM with in-batch negatives, and
a language modeling task that pre-trains the decoder to generate token-by-token autoregressively.
• ClipBERT. As shown in Figure 5.4c, ClipBERT adopts a fusion encoder architecture. A 2D CNN
followed by a temporal mean pooling layer is used to encode sparsely sampled frames from each
video clip. Text inputs are first encoded with a word embedding layer, and then sent to a multi-
layer Transformer for multimodal fusion, along with video features. In comparison to previous
63
(b) UniVL
Figure 5.4: Overview of three representative VLP models for video-text tasks: (a) MIL-NCE (Miech
et al., 2020), (b) UniVL (Luo et al., 2020) and (c) ClipBERT (Lei et al., 2021b). Figures are from
the corresponding papers.
two-stage pipelines (Li et al., 2020b; Luo et al., 2020) that offline extract video features with 3D
CNNs from densely sampled frames, the video encoding of ClipBERT is less computationally
heavy, which makes end-to-end optimization feasible during pre-training and finetuning. Clip-
BERT is pre-trained with image-text matching and MLM on image-text datasets, COCO (Chen
et al., 2015) and VG (Krishna et al., 2017d). More discussions about leveraging image-text pairs
to pre-train video-text models can be found in Section 5.5.2.
• The dataset used in VideoBERT (Sun et al., 2019a) contains a set of 312K videos with a total
duration of roughly 966 days. The videos are obtained by extracting publicly available videos
from YouTube with topics related to “cooking” and “recipe”, and then videos longer than 15
minutes are removed. YouTube’s ASR toolkit1 is utilized to get timestamped speech-transcribed
texts from these videos. In the end, 120K videos with texts in English are kept for video-text
pre-training. The remaining videos, although without paired texts in English, can be used for
video-only pre-training tasks.
• HowTo100M (Miech et al., 2019) consists of 1.22M instructional videos from YouTube, covering
human activities such as cooking, hand crafting, personal care, gardening, etc. These videos are
collected by searching YouTube videos with “how to” text queries (e.g., how to paint furniture).
The text queries cover a set of refined “visual tasks” from WikiHow2 , which involve human in-
1
https://developers.google.com/youtube/v3/docs/captions
2
https://www.wikihow.com/
64
(a) Examples of YouTube video clips, paired with ASR transcripts, sampled from YTTemporal-180M (Zellers
et al., 2021).
(b) Examples of short videos, paired with alt-text descriptions. Figure credit: Frozen (Bain et al., 2021).
Figure 5.5: Visualization of exemplary video-text pre-training data.
teractions with the physical world. The text accompanied with each video is also collected from
YouTube, either written manually by the content creators or auto-generated by an ASR system.
The original long videos are further cut into short clips with an average duration of 4 seconds,
which produces 136M clip-text pairs in total.
• HD-VILA-100M (Xue et al., 2022) features high resolution (720p) videos from YouTube and
consists of 100M video clip and sentence pairs from 3.3 million videos with 371.5K hours in
total. Before the emergence of HD-VILA-100M, previous video-text datasets, including both
large-scale pre-training datasets (Miech et al., 2019) and downstream benchmarks (Xu et al., 2016;
Hendricks et al., 2017), are mostly 240p or 360p. Videos in HD-VILA-100M are collected from
15 popular YouTube categories (e.g., sports, music, autos). During collection, the authors ensure
a balanced video clip number in each category to ease the under-fitting problem. Using the off-
the-shelf tool3 , the auto-generated subtitles in YouTube videos are split into complete sentences
and aligned to their corresponding clips via Dynamic Time Warping using the timestamp of the
original subtitles. After processing, each pair in HD-VILA100M consists of a video clip about
13.4 seconds on average and a sentence with 32.5 words on average.
• YTTemporal-180M (Zellers et al., 2021) is derived from 6M YouTube videos, which cover di-
verse domains and topics. The authors first collect a large amount (∼ 27M) of video candidate
ids from YouTube, including instructional videos, lifestyle vlogs of everyday events and some
auto-suggested videos by YouTube with topics like “science” or “home improvement”. Then,
these candidate videos are filtered using YouTube API and some pre-trained computer vision
models. Specifically, they exclude those videos that do not have English ASR tracks, or are over
20 minutes long, or are not visually grounded, or whose thumbnails do not have objects (based
on the predictions of a pre-trained image classification model). Like other YouTube-based pre-
training datasets, the accompanying texts for these videos are produced using ASR tools, and later
processed to add punctuation. The total 6M videos are cut into 180M short clips based on the
predicted punctuation added to the ASR texts, which may suggest a sentence ending. This dataset
is further augmented with the audio modality and scaled up to 1B (in # frame-text-audio triplets),
namely YTTemporal-1B in Zellers et al. (2022).
3
https://github.com/ottokart/punctuator2
65
Figure 5.6: The evolution of video-text pre-training data along time. The x-axis indicates the year
and the month that each dataset is released. The y-axis is the total video duration in number of
days. The size of the circle indicates the number of video-text pairs in the dataset. We group pop-
ular datasets into (i) Youtube-based datasets ; (ii) Datasets with short videos and alt-texts , and
(iii) TV-show based dataset (Lei et al., 2018). Youtube-based datasets include the one used in
VideoBERT (Sun et al., 2019a), HowTo100M (Miech et al., 2019), HD-VILA-100M (Xue et al.,
2022), YTTemporal-180M (Zellers et al., 2021) and YTTemporal-1B (Zellers et al., 2022). Datasets
with short videos and alt-texts cover AutoGIF (Pan et al., 2020a), WebVid-2.5M and WebVid-
10M (Bain et al., 2021).
• WebVid2.5M (Bain et al., 2021) is inspired by the web-crawled image-text dataset Conceptual
Captions (CC3M) (Sharma et al., 2018). Following a similar collection pipeline, a total of 2.5M
text-video pairs were scraped from the same source as CC3M. Although more than 20x smaller
than YouTube-based pre-training datasets, WebVid2.5M is of higher quality and widely adopted,
in which the texts are manually generated captions, mostly well formed sentences, and can more
precisely describe the visual scenes. This dataset has been recently enlarged to WebVid10M with
10M video-text pairs.
• Auto-captions on GIF (AutoGIF) (Pan et al., 2020a) crawls over 160M GIF videos from commer-
cial GIF websites with text queries constructed by extracting objects, actions and subject-verb-
object triplets from existing image/video benchmarks. The GIF videos can be viewed as video
without audio channel, and it is usually as short as 3 seconds.
• TV Dataset is first introduced in Lei et al. (2018), in which video clips from 6 popular TV se-
ries across 3 genres (medical dramas, sitcoms and crime shows) are used to collect a downstream
video-language question answering dataset. It consists of 22K video clips from 925 episodes.
Each video clip is 60-90 seconds long, covering long-range scenes with complex character inter-
actions and social/professional activities. The accompanying texts are human-written subtitles,
transcribed from the dialogue/conversation happening in the video.
• Video Source: The TV Dataset (Lei et al., 2018) is sourced from popular TV shows while all
other datasets are crawled from Internet. It is worth noting that the large-scale datasets (e.g.,
HowTo100M (Miech et al., 2019), HD-VILA-100M (Xue et al., 2022), YTTemporal (Zellers
et al., 2021, 2022)) are mostly based on YouTube videos.
• Accessibility: HowTo100M and Webvid (Bain et al., 2021) are released to public with raw videos.
Frames extracted at 3 fps are released in the TV dataset, due to copyright concerns. Media URLs
are released in the datasets of YTTemporal, HD-VILA-100M and AutoGIF (Pan et al., 2020a).
• Scale: YouTube-based datasets like the HowTo100M, HD-VILA-100M and YTTemporal datasets
are large scale, from 100M to 1B video clips. While other datasets, especially WebVid(<10M
videos) and the TV dataset (<22K videos), are smaller. The evolution in scale of popular video-
text pre-training datasets along time are depicted in Figure 5.6.
66
Figure 5.7: Advanced topics in VLP for video-text tasks. We gray out works that have been covered
in previous sections.
As images can be considered as a special case of videos, with temporal size 1, researchers (Lei
et al., 2021b; Bain et al., 2021) have explored to leverage image-text data for video-text pre-training.
Popular image-text datasets, introduced in Chapter 3.4, have been recently added to the video-text
pre-training corpora, including COCO (Chen et al., 2015), Visual Genome (VG) (Krishna et al.,
2017c), SBU Captions (Ordonez et al., 2011), Conceptual Captions (CC3M) (Sharma et al., 2018),
and CC12M (Changpinyo et al., 2021).
Apart from the four common pre-training tasks discussed in Section 5.3, other pre-training objec-
tives have been explored to improve the model training on noisy video-text pairs. Tang et al. (2021c)
add automatically-extracted dense region captions from the video frames as auxiliary text input, to
provide informative visual cues to improve the learning of video and language associations. To al-
leviate the temporal misalignment issue, it incorporates an entropy minimization-based constrained
attention loss to encourage the model to automatically focus on the correct captions from a pool
of candidate ASR captions. Li et al. (2022d) propose a new visually-grounded pre-training task,
prompting entity modeling (PEM), to learn fine-grained region-entity alignment. The prediction
targets for the PEM task are generated by an entity prompter module, trained with contrastive learn-
ing to produce the similarity between a video crop and text prompts instantiated with entity names.
During training, the PEM task asks the model to predict the entity pseudo-labels (i.e., normalized
similarity scores) for randomly-selected video crops. In BridgeFormer (Ge et al., 2022), the authors
exploit the rich semantics of text (i.e., nouns and verbs) to build question-answer pairs to form a
question answering task as a pretext task, with which the model can be trained to capture more re-
67
gional content and temporal dynamics. Wang et al. (2022c) proposes an object-aware Transformer
to leverage bounding boxes and object tags to guide the training process.
The studies of ClipBERT (Lei et al., 2021b) and Frozen (Bain et al., 2021) demonstrate that image-
text pre-training is effective in improving downstream video-text performance. Recent efforts in
image-text modeling have also shown that, when scaled up to hundreds of millions (Radford et al.,
2021) or even billions (Li et al., 2021a) of image-text pairs, image-text models can achieve state-
of-the-art results on various video-text tasks, including text-to-video retrieval (Luo et al., 2021a;
Yuan et al., 2021; Yu et al., 2022a), video question answering (Alayrac et al., 2022), and video
captioning (Tang et al., 2021a; Wang et al., 2022d).
The advantages of transferring image-text models to video-text tasks are twofold. First, leveraging
image-text pre-training or well-pretrained image-text models can potentially save the computational
cost of video-text pre-training. Second, compared to video-text data, large-scale image-text data
are cleaner (i.e., the text description is usually better-aligned to the image content) and are easier
to collect (e.g., there are much more image alt-text pairs than video alt-text pairs existing widely
over the internet). However, existing approaches (Yu et al., 2022a; Wang et al., 2022d) adapt image-
text models to video-text tasks by simply concatenating video frames without explicit temporal
modeling. Such frame concatenation can only work on short videos with sparsely sampled frames,
but is ineffective for long videos (Yu et al., 2019a) or more challenging tasks that require temporal
reasoning (Lei et al., 2020b). It is worth exploring more effective ways to transfer image-text models
for more challenging video-text tasks.
Videos are multi-channel in nature, which are composed of visual signals from video frames, lan-
guage cues from speech-transcribed texts, and audio signals from environmental sound or back-
ground musics. However, most video-text models (Sun et al., 2019a; Miech et al., 2019; Zellers
et al., 2021) mainly focus on vision-language modeling with video frames only (i.e., single-channel
videos) to learn video representations and the joint video-language representations. Although video-
text tasks defined on most existing video-text datasets (Xu et al., 2016; Chen and Dolan, 2011) can
be largely solved with single-channel video inputs, we cannot solely rely on these datasets to test
the model’s capability of video-text understanding. Recent efforts in learning from multi-channel
videos have been made for developing new modeling techniques and benchmark datasets.
68
pre-training framework for multimodal video captioning with both visual frames and subtitles as
inputs. The proposed MV-GPT model is based on the Transformer architecture and can be end-to-
end trained. MERLOT-Reserve (Zellers et al., 2022) similarly adopts the Transformer architecture,
but takes all three channel inputs (visual frames, subtitles and audio). An important finding of the
MERLOT-Reserve study is that video-text pre-training with audio can help visual commonsense
reasoning (Zellers et al., 2019), an audio-less image-text task.
Multi-channel Video-text Benchmarks. TVQA (Lei et al., 2018), TVQA+ (Lei et al., 2020a),
TVR and TVC (Lei et al., 2020b) are attempts to building video-text datasets on multi-channel
video inputs in TV show domain. Specifically, during data collection, annotators are instructed to
write text descriptions/QA pairs given the context provided in visual frames only, or subtitles only,
or both visual frames and subtitles. Following the same procedure, How2R and How2QA (Li et al.,
2020b) are created to cover additional video domain of instructional videos.
Another problem in model evaluation is that existing video-text models (Xu et al., 2021a; Yang
et al., 2021a; Tang et al., 2021c) are often evaluated on their own choices of downstream datasets,
which makes it hard to compare between models. We expect that a general video-language system
should do well on diverse tasks/domains/datasets, as we have witnessed in the NLP field that pub-
licly accessible large-scale multi-task benchmarks (Wang et al., 2019b,a) can facilitate advances in
modeling. With the above motivation, VALUE (Li et al., 2021c) is a first attempt to build a compre-
hensive benchmark for video and language understanding evaluation. There are four characteristics
of VALUE benchmark. (i) It features multi-channel videos, with video frames and subtitle as video
inputs. (ii) The videos in VALUE are collected from diverse video domains, including movie, TV
shows, instructional videos and vlogs. (iii) VALUE includes 11 datasets over 3 representative tasks:
text-to-video retrieval, video question answering and video captioning. (iv) VALUE supports a live
leaderboard to track the advances in video-and-language research.
Contrastive video-text pre-training (Miech et al., 2020; Yang et al., 2021b; Xu et al., 2021b) has
shown promising results on video action recognition (Kuehne et al., 2011; Kay et al., 2017), action
localization (Abu-El-Haija et al., 2016; Zhukov et al., 2019), and action segmentation (Tang et al.,
2019). TAN (Han et al., 2022) enhances the dual-encoder architecture trained with VTC by adding
a temporal alignment network to tackle long-term video understanding. For procedure planning in
instructional videos, Zhao et al. (2022) propose a weakly supervised method of learning models from
natural language instructions in HowTo100M (Miech et al., 2019). Given that contrastive image-text
pre-training (Radford et al., 2021; Li et al., 2021a) is beneficial to learning image representations, a
line of work (Wang et al., 2021b; Ju et al., 2022; Li et al., 2022i) try to prompting CLIP (Radford
et al., 2021) to perform video action recognition. Leveraging VLP for other core video tasks, such
as video object detection (Damen et al., 2018) and video object segmentation (Perazzi et al., 2016),
is an interesting direction to explore in future.
69
On the modeling front, Park et al. (2022) identify the weakness in video-language models through
some text manipulations, and report that a pre-trained video-language model can be easily fooled,
which indicates that the models may rely on some spurious clues in the training data.
How to design a unified VL model that can support various downstream VL tasks without introduc-
ing task-specific heads is a popular topic in image-text modeling (detailed in Section 3.5). Similar
attempts have been made in video-text understanding along two dimensions.
• One Transformer for all: Transformer (Vaswani et al., 2017) and Transformer-based pre-trained
models (Devlin et al., 2019; Dosovitskiy et al., 2021) have revolutionized a wide range of research
fields, e.g., natural language processing (Devlin et al., 2019; Liu et al., 2019d), computer vi-
sion (Dosovitskiy et al., 2021; Liu et al., 2021c), and speech processing (Baevski et al., 2020; Chen
et al., 2022b). Motivated by this, researchers have tried to further narrow the modeling gaps among
different modalities by using a shared Transformer for video-text modeling. For example, All-in-
one (Wang et al., 2022a) encodes raw video and textual signals into joint representations using a
unified backbone architecture. VATT (Akbari et al., 2021) uses a modality-agnostic Transformer
shared across video, text and audio inputs. Similarly, Uni-Perceiver (Zhu et al., 2022) shares the
backbone weights across various modalities, such as image, video and text. OmniVL (Wang et al.,
2022b) is a universal architecture to support both image-text and video-text tasks. SkillNet (Dai
et al., 2022) uses a sparsely activated Transformer (i.e., mixture-of-experts (Shazeer et al., 2017)
where different parts of the parameters are specialized to processing different modalities, includ-
ing text, audio, image, video, and code.
• Unifying video-text tasks as text generation. Inspired by the unifying efforts in image-text mod-
eling (Cho et al., 2021; Wang et al., 2022k), LAVENDER (Li et al., 2022g) focuses on integrating
different video-text tasks into a unified format so that a single architecture can be used for all tasks.
Specifically, all pre-training and downstream tasks are reformulated as masked language model-
ing, so that a single task head is used for both pre-training and downstream finetuning, without
introducing additional task-specific heads. With the unified architecture, LAVENDER can sup-
port all downstream tasks with just a set of shared parameter values when multi-task finetuned,
showing strong generalizability on downstream tasks with limited training examples, and enabling
zero-shot prediction on video question answering tasks. However, LAVENDER has several limi-
tations that suggest two future improvements: (i) extensions to fine-grained video-text tasks (e.g.,
video corpus moment retrieval (Lei et al., 2020b)), as current LAVENDER only supports video
retrieval; and (ii) more effective in-context few-shot learning or prompt tuning to better leverage
and improve the generalizability of LAVENDER.
The majority of the literature in video-text understanding focus on English-only video-text tasks,
while we live in a multilingual world. Due to the missing of large-scale non-English video-text
datasets for both pre-training and downstream evaluation, video-text models in non-English lan-
guages are less explored. As initial attempts, Lei et al. (2021a) and Zeng et al. (2022d) have
developed large-scale Chinese video-text datasets. Huang et al. (2021a) crawl multilingual subti-
tles for each video in HowTo100M (Miech et al., 2019) in 9 languages, covering English, German,
French, Russian, Spanish, Swahili, Chinese and Vietnamese, to develop a new multilingual instruc-
tional video dataset (MultiHowTo100M).
70
Chapter 6
VL Systems in Industry
As the technology of vision-language (VL) learning advances rapidly, more and more companies
are integrating VL capabilities into their products and services. iPhone can automatically generate
image captions which are read by VoiceOver so that vision-impaired users know what is in the
image. Chrome OS has the capability to generate image captions in 10 different languages for
unlabeled web images. Microsoft offers image captioning as an Azure cloud service, and Microsoft
Office applications (e.g., PowerPoint and Word) use this image captioning service to generate image
descriptions automatically. Seeing AI, which is a mobile app for the blind and low vision community
currently available on iPhone, has a channel to describe a scene using automatically generated image
captions. In addition to image captioning, we believe many other VLP-enabled technologies, such
as open vocabulary image classification and object detection, will be deployed into products and
services in industry.
71
(a) (b)
Figure 6.1: Microsoft PowerPoint automatically generates image descriptions for user-inserted im-
ages. (a) After inserting an image into PowerPoint and right clicking on it, a drop-down menu pops
up. Select “View Alt Text” to automatically generate image description. (b) The generated im-
age description “A cat walking in the snow” is displayed in the text box. A user can also edit the
generated image description.
Figure 6.2: Microsoft Edge browser automatically generates image descriptions (alt text) which are
then read out via a text-to-speech engine of a screen reader.
of Seeing AI. The button at the bottom right performs scene description. It calls image captioning
API (Microsoft Cognitive Service) to automatically generate alt text which is then read out via
a text-to-speech engine. Figure 6.3b shows an image captured by the phone, and the generated
image caption is shown below.
• Facebook. Facebook provides the feature to automatically generate alt text for user-uploaded
images. When we create a post in Facebook and upload an image as shown in Figure 6.4a, a button
“Edit” will pop up. By clicking on the “Edit” button, the system will automatically generate an
image description as shown in Figure 6.4b (“May be an image of fruit” in this example). More
detailed instructions can be found from the webpage.3
• Apple iOS VoiceOver Screen Reader. VoiceOver is the screen reader built into IOS, the operating
system on Apple’s mobile devices. With VoiceOver, users with visual impairment can use a few
simple gestures to hear aloud what is displayed on the screen for people who are sighted.
There are also many VL models hosted in cloud services, detailed below.
3
How do I edit the alternative text for a photo on Facebook?
72
(a) (b)
Figure 6.3: Seeing AI is designed for those who are blind or have low vision. It is currently available
in iPhone App store. (a) A screen shot of Seeing AI where the button at the bottom right invokes
Scene Description feature. (b) The generated image caption for the captured image, which will be
read out via a text-to-speech engine.
(a) (b)
Figure 6.4: Facebook automatically generates alt text for images uploaded by users. (a) An image is
uploaded to Facebook to create a Post, and an“Edit” button appears. (b) When the “Edit” button is
clicked, the system automatically generates an image description. It also allows the user to manually
enter a corrected or preferred description.
• Microsoft Azure Computer Vision - Cognitive Services. Microsoft Azure Computer Vision is
an AI service that analyzes and extracts rich information from images and videos. The Image
Analysis service extracts many visual features from images, such as objects, faces, adult content,
and auto-generated text descriptions. One can follow the Image Analysis quickstart to have a
try. The Spatial Analysis service analyzes the presence and movement of people on a video feed
and produces events that other systems can respond to. One can also install the Spatial Analysis
container to get started.
73
• Google Cloud Vision AI. Google Cloud Vision AI provides two computer vision products to help
you understand images. AutoML Vision automates the training of your own custom machine
learning models. Simply upload images and train custom image models with AutoML Vision’s
easy-to-use graphical interface; optimize your models for accuracy, latency, and size; and export
them to your application in the cloud or to an array of devices at the edge. Vision API offers
powerful pre-trained machine learning models through REST and RPC APIs. Assign labels to
images and quickly classify them into millions of predefined categories. Detect objects and faces,
read printed and handwritten text, and build valuable metadata into your image catalog.
• Amazon Rekognition. Amazon Rekognition offers pre-trained and customizable computer vision
capabilities to extract information and insights from your images and videos. The AI service
includes key features such as Content moderation, Face compare and search, Face detection and
analysis, Labels, Custom labels, Text detection, Celebrity recognition, Video segment detection,
and Streaming video events detection.
• Alibaba Cloud Image Search. Image Search allows users to search by image based on image
similarities. Image Search uses deep learning and machine vision to capture characteristics of
images and then search for images based on the captured information. Search by Product Images
allows customers to use a product image to search for the same product or similar products in
your self-managed image library. Then, the system returns information about the product images.
Search by General-Purpose Image allows users to use an image to search for images that contain
the same elements or objects from a self-managed image library. The system returns the same or
similar images based on the captured image information.
How are the current prevailing big foundation models changing the industry? Besides the fact that
big foundation models demonstrate superior performance on a variety of downstream vision and
vision-language tasks, more importantly, they are also fundamentally changing the way the industry
collects data, develops models, delivers services, and builds their R&D organizations.
On one hand, as discussed in Section 6.1, it is very encouraging to see that there are already many
VL systems deployed in industry. On the other hand, there are also many factors that one has
to consider when deploying a VL model to real-world applications, including robustness to new
domains, inference cost and latency, fairness, and responsible AI issues. Since VL learning is still a
relatively new field, research on these practical issues is preliminary. As more and more applications
start to deploy VL models, we expect the demand for solutions to these practical issues to become
increasingly strong, which will inspire more research in these areas. In this section, we review
solutions to three fundamental issues regarding deploying VL systems for real-world applications,
domain adaptation, serving cost, and responsible AI.
Domain Adaptation. The images in the real-world scenarios are usually unpredictable and have
large variations, and the model must be robust and generalize well to new domains. Though there
has been a lot of work on domain adaptation in other computer vision areas such as image classifi-
cation and object detection, there has been little research on domain adaptation for VL models. One
interesting domain is the non-natural images such as diagrams, tables, and charts. So far, most VL
datasets are composed of natural images, how to handle non-natural images remains unaddressed.
Serving Cost. In addition to accuracy, there are also constraints on the inference cost and latency.
How to reduce model size without sacrificing accuracy is an important problem for real-world ap-
plications, and there has been a lot of research devoted to it. For example, Wang et al. (2020a)
developed a small VL model called MiniVLM that reduces model size by 73% and the inference
time cost by 94% while being able to retrain 94-97% of the accuracy on multiple VL tasks. Fang
et al. (2021) developed a knowledge distillation technique to compress a Transformer-based large
VL model into a small VL model. Inspired by the lottery ticket hypothesis (Frankle and Carbin,
2019), Gan et al. (2022) found that lottery tickets also exist for VLP models such as UNITER,
LXMERT, and ViLT. They were able to discover “relaxed” winning tickets at 50%-70% sparsity
that maintain 99% of the full accuracy.
74
Fairness and Responsible AI. Fairness and responsible AI is another critical issue when deploy-
ing a VL model. As VL datasets typically contain biases, the trained models usually have biases
as well, which may generate results that appear offensive. As reported by Zhao et al. (2021), the
COCO image captioning dataset (Chen et al., 2015) is heavily skewed towards lighter-skinned and
male individuals. In addition, there are racial terms including racial slurs in the manually annotated
captions. Given the biases in the training data, it is difficult to avoid biases in the trained models.
For example, as reported by Srinivasan and Bisk (2022), the VL-BERT model exhibits gender biases
where it sometimes may prefer to reinforce a stereotype over faithfully describing the visual scene.
There has been research work to reduce biases in VL models. Cadene et al. (2019b) proposed
a learning strategy for VQA to reduce the importance of the most biased examples that can be
correctly classified without looking at the image, thus forcing the model to use both input modalities
instead of relying on statistical regularities between the question and the answer. KV and Mittal
(2020) proposed a technique to reduce the dependency of a VQA model on the language prior by
using a model-agnostic question encoder that utilizes both visual and language modalities equally
while encoding the question. Despite the progress, how to eliminate biases in general VL models is
still an open problem. Since real-world data contains a lot of biases, scalable solutions have yet to
be developed to eliminate data biases during data collection and curation.
Fairness and Responsible AI have also been studied in the context of conversational AI agents (bots).
VL systems and conversational AI agents share many problems. Recently, VL models have also
been incorporated into AI bots for social chat and task completion (e.g., Gao et al., 2019a; Zhou
et al., 2020a; Gao et al., 2022a). The design of AI bots needs to defend against harms and mitigate
potential toxicity and bias – either in training data, introduced through feedback and usage, or simply
inappropriate due to local culture or context (e.g., Breitfeller et al., 2019; Zhang et al., 2018d).
The 10 guidelines described in Microsoft (2018) outlines what need to be considered to develop a
responsible AI bot that meets the challenges in real-world situations.
75
Chapter 7
Vision-Language Pre-training (VLP) has attracted rapidly growing attention from both the computer
vision and NLP communities, especially due to the emergence of large-scale multimodal foundation
models like CLIP (Radford et al., 2021), DALL-E (Ramesh et al., 2021), CoCa (Yu et al., 2022a),
Flamingo (Alayrac et al., 2022), and Florence (Yuan et al., 2021). In this chapter, we provide a
concise summary of what has been reviewed, and discuss the current research trends.
• Task-specific models. To lay a comprehensive foundation for the introduction of VLP models, we
have discussed many seminar papers before the era of pre-training. One major theme during this
period is the design of various attention mechanisms. We have introduced how the field has been
moving from (i) inter-modality attention design, which aims to capture multimodal alignment and
perform multimodal fusion, to (ii) intra-modality attention design, which aims to capture visual
relations among image regions, e.g., via graph attention networks, to (iii) the convergence to the
Transformer architecture, which models both inter- and intra-modality interactions. Besides the
attention design, we have also briefly discussed topics regarding bilinear pooling methods for
multimodal fusion, neural module networks for compositional visual reasoning, and so on.
• VLP for image-text tasks. We have covered VLP models for image captioning, visual question
answering, image-text retrieval, and visual grounding. A general methodology transition is from
the development of OD-based VLP models (which requires the extraction of image regional fea-
tures first from an offline pre-trained object detector) to the prevailing end-to-end VLP models,
partially due to the popularity of vision Transformer. Early VLP models are only pre-trained on
approximately 4M images (with roughly 10M image-text pairs), while the most recent big VLP
models are already pre-trained over 10B image-text pairs. We have also discussed many advanced
topics, ranging from unified image-text modeling, few-shot learning, knowledge, robustness, mul-
tilingual VLP, to model compression and efficient adaptation.
• VLP for core vision tasks. We have reviewed VLP models for core vision tasks, including im-
age classification, object detection, and segmentation. These language-augmented visual mod-
els demonstrate a strong zero-shot transfer capability, since they acquire open-set and open-
vocabulary recognition abilities through problem reformulation, casting image classification as
image retrieval or object detection as phrase grounding. Moreover, model generalization is im-
proved as natural language supervision typically contains much richer semantics. We advocate
the concept of computer vision in the wild,1 and encourage the development and evaluation of
future foundation models for this. We have also discussed many advanced topics, ranging from
benchmark, knowledge, robustness, efficient adaptation, to open-set video classification.
1
Computer-Vision-in-the-Wild Readings.
76
• VLP for video-text tasks. We have discussed VLP models for video-text tasks, including video
retrieval, question answering and captioning. We observe the same research trend as that of image-
text, a transition from the use of offline extracted video features to end-to-end VLP models via e.g.,
the use of video Transformers. We have provided an in-depth discussion on the design of model
architectures, pre-training tasks, and the widely used pre-training datasets to date. We have also
covered diverse advanced topics, such as learning from multi-channel videos, the adaptation of
image-text models for video-text tasks, VLP for core video tasks, and unified video-text modeling.
Text-to-Image Generation. In the context of VLP, text-to-image generation methods can be clas-
sified into two categories: (i) VQ-token-based auto-regressive methods, such as DALL-E (Ramesh
et al., 2021), Make-A-Scene (Gafni et al., 2022), NUWA-Infinity (Wu et al., 2022a), and Parti (Yu
et al., 2022b); and (ii) diffusion-based methods, such as DALL-E 2 (Ramesh et al., 2022), Ima-
gen (Saharia et al., 2022), and Stable Diffusion (Rombach et al., 2022). We provide a brief discus-
sion on this important topic in Section 3.6. This field is rapidly growing, we leave a detailed survey
of this topic to future work.
Unified Modeling. In order to build a general-purpose foundation model, we need a unified model
architecture that can be readily scaled up; and when being pre-trained at scale, it can be readily
adopted to various downstream computer vision and vision-language (VL) tasks. From this uni-
fication, we envision that new model capabilities will be unlocked. There are different levels of
unification. For example, unification of different VL understanding tasks can be achieved relatively
easily (e.g., SimVLM (Wang et al., 2022k), GIT (Wang et al., 2022d), CoCa (Yu et al., 2022a)),
while the unification of VL understanding tasks and region-level localization tasks can be much
more challenging (e.g., UniTAB (Yang et al., 2021c), GLIP (Li et al., 2022h), and GLIPv2 (Zhang
et al., 2022b)), not to mention the unification of image generation tasks (e.g., OFA (Wang et al.,
2022e) and Unified-IO (Lu et al., 2022a)). Pix2seqV2 (Chen et al., 2022d) and UViM (Kolesnikov
et al., 2022) also propose unified approaches for computer vision tasks. MetaLM (Hao et al., 2022)
shows that language models can be a general-purpose interface for many diverse tasks. We envision
that more research efforts will be devoted to unified modeling. See Section 3.5.3 and 5.5.6 for more
detailed discussions.
Computer Vision in the Wild. How can natural language play a more fundamental role in com-
puter vision tasks? In Chapter 4, we have shown how visual recognition tasks (e.g., image classifi-
cation, object detection, and segmentation) can be considered as VL problems, and the unification
of computer vision and VL tasks is possible. Distinct from the traditional close-set recognition set-
ting, the use of natural language supervision enables open-set and in-the-wild visual recognition
(see Section 4.5 for a detailed discussion), and we envision more research efforts will be devoted to
language-augmented computer vision models, and VLP can have the potential to become a main-
stream and impactful direction in computer vision research. This also requires better benchmarks for
evaluation of the performance of such computer vision foundation models, ranging from zero-shot
generalization (i.e., how the model performs “out of the box”), few-shot evaluation, linear probing,
prompting, to model finetuning.
Model Scaling. In recent years, we have witnessed great successes from scaling up language mod-
els, via training large Transformers from massive amounts of text data. Prominent examples include
T5 (Raffel et al., 2020), GPT-3 (Brown et al., 2020), Megatron-Turing (Shoeybi et al., 2019), Chin-
77
chilla (Hoffmann et al., 2022), OPT (Zhang et al., 2022c), and PaLM (Chowdhery et al., 2022).
A crucial benefit of scaling is the potential of zero-shot and few-shot generalization. VLP models
have been following a similar trend, with examples including SimVLM (Wang et al., 2022k), Flo-
rence (Yuan et al., 2021), CoCa (Yu et al., 2022a), GIT (Wang et al., 2022d), BEiT-3 (Wang et al.,
2022g), PaLI (Chen et al., 2022e), and Flamingo (Alayrac et al., 2022). However, compared with
the scale of language models, the scaling for VLP models is still in its infant stage. We envision that
bigger VL models, especially open-sourced ones (Ilharco et al., 2021), will appear in near future. It
would also be interesting to investigate the emergent abilities of such big models once they become
available. See Section 3.5.1 for a more detailed discussion.
In-context Few-shot Learning. Can we train a model that can quickly adapt to different down-
stream tasks with only a few in-context examples? By inheriting this capability form large frozen
language models, Flamingo (Alayrac et al., 2022) has shown that this is possible for tasks with text
outputs, e.g., question answering, captioning and classification. However, due to the diversity of
vision tasks, often we require the model to not just output text sequences. It remains unknown how
in-context few-shot learning can be enabled for complex tasks, such as localization, where bounding
boxes or even pixel output are needed. See Section 3.5.2 for a more detailed discussion.
Efficient Adaptation. As the size of VL models has been increasing rapidly, it becomes increas-
ingly important to develop methods to adapt big VLP models efficiently for downstream tasks. By
freezing the model weights, different parameter-efficient transfer learning methods have been devel-
oped, especially for the few-shot setting (please check Section 4.6). As we have access to more and
more large foundation models, this topic becomes timely and urgent.
Knowledge. On one hand, big foundation models encapsulate abundant multimodal knowledge
about the visual world in their model weights. On the other hand, the knowledge encoded in model
weights can be dated soon without timely model update, while various types of knowledge are
evolving in real world, for example, factual knowledge in databases. One solution is to enhance pre-
trained models using external knowledge. Knowledge-enhanced NLP models (also called retrieval-
augmented methods) have been widely studied for knowledge-intensive NLP tasks (Guu et al., 2020;
Lewis et al., 2020b), while the exploration of knowledge-enhanced vision and multimodal models
are still in its infant stage. See Section 3.5.4 and 4.6 for more detailed discussions.
Robustness. We typically evaluate models on the standard and well established benchmarks, such
as ImageNet classification, COCO object detection, and VQAv2. On one hand, these benchmarks
have driven tremendous progress in the field, allowing top research teams around the world to ad-
vance state of the art on top of each other’s work. On the other hand, we also need to be careful not
to over-claim the model capabilities, before carefully-designed robustness evaluation is performed.
As discussed in Section 3.5.5, better diagnostic tests and more robust methods should be developed.
Concluding Remarks. The aforementioned research directions are deeply connected to achieve
the same goal of developing a general-purpose, multi-sense, AI system. For example, the model
architectures developed in unified modeling could lead to a better solution to computer vision in
the wild, and the techniques we developed for model scaling can be used for scaling the unified
model. Further, when the model is significantly scaled up, the capability of in-context few-shot
learning may emerge naturally. With a few light-weight adapters and a few in-context examples,
we envision that the unified multimodal foundation model is capable of efficiently adapting itself
to different tasks. In addition, external knowledge is an additional source for further enhancing the
performance. Lastly, in order to deploy these state-of-the-art models in real-world applications, we
also need to improve robustness and cost-efficiency.
The VLP field is progressing at a rapid speed, with new ideas and methods emerging constantly.
There are many important research topics that are not discussed in this paper, mostly due to an
ironic observation that it is impossible for our writing to catch up with the daily-updated research
innovation. We feel glad and blessed to write this paper, as this is an exciting journey to review
the progress that we have made as a community. We are optimistic about the future of the VLP
field, not only because there are so many new research directions to explore, but also because we
are convinced that connecting the two important fields in AI, NLP and computer vision, is going to
significantly advance the state of the art of AI in the near future.
78
Acknowledgments
Many people have supported us and provided valuable feedback to the writing of this book. This
book is largely based on our CVPR 2022 tutorial on vision-language pre-training (VLP). We espe-
cially thank Zi-Yi Dou, Jianfeng Wang, Zhengyuan Yang, and Xiaowei Hu for providing valuable
materials on “VLP for image-text tasks”; Jianwei Yang and Pengchuan Zhang for their inputs to
“VLP for core vision tasks”; Chung-Ching Lin and Kevin Lin for tutorials on “VLP for video-text
tasks”; and Chenfei Wu for contributions to “VLP for Text-to-Image Synthesis”. This book is also
partially based on our CVPR 2021 and 2020 tutorials, for which we thank Luowei Zhou, Licheng
Yu, Yu Cheng, Yen-Chun Chen, Jingjing Liu and Xiaodong He for their contributions. We are also
grateful to the anonymous reviewers for their insightful feedback, and Mark de Jongh for making
the publication of this book possible.
79
Bibliography
Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., and Vijaya-
narasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv preprint
arXiv:1609.08675.
Agarwal, V., Shetty, R., and Fritz, M. (2020). Towards causal vqa: Revealing and reducing spurious
correlations by invariant and covariant semantic editing. In CVPR.
Aghajanyan, A., Huang, B., Ross, C., Karpukhin, V., Xu, H., Goyal, N., Okhonko, D., Joshi, M.,
Ghosh, G., Lewis, M., et al. (2022). Cm3: A causal masked multimodal model of the internet.
arXiv preprint arXiv:2201.07520.
Agrawal, A., Batra, D., Parikh, D., and Kembhavi, A. (2018). Don’t just assume; look and answer:
Overcoming priors for visual question answering. In CVPR.
Agrawal, A., Kajić, I., Bugliarello, E., Davoodi, E., Gergely, A., Blunsom, P., and Nematzadeh, A.
(2022). Rethinking evaluation practices in visual question answering: A case study on out-of-
distribution generalization. arXiv preprint arXiv:2205.12191.
Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., Lee, S.,
and Anderson, P. (2019). nocaps: novel object captioning at scale. In ICCV.
Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., and Gong, B. (2021). Vatt:
Transformers for multimodal self-supervised learning from raw video, audio and text. In NeurIPS.
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican,
K., Reynolds, M., et al. (2022). Flamingo: a visual language model for few-shot learning. arXiv
preprint arXiv:2204.14198.
Alayrac, J.-B., Recasens, A., Schneider, R., Arandjelović, R., Ramapuram, J., De Fauw, J., Smaira,
L., Dieleman, S., and Zisserman, A. (2020). Self-supervised multimodal versatile networks. In
NeurIPS.
Anderson, P., Fernando, B., Johnson, M., and Gould, S. (2016). Spice: Semantic propositional
image caption evaluation. In ECCV.
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018a).
Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., and
Van Den Hengel, A. (2018b). Vision-and-language navigation: Interpreting visually-grounded
navigation instructions in real environments. In CVPR.
Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2016a). Learning to compose neural networks
for question answering. In NAACL.
Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2016b). Neural module networks. In CVPR.
Aneja, J., Deshpande, A., and Schwing, A. G. (2018). Convolutional image captioning. In CVPR.
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. (2015). Vqa:
Visual question answering. In ICCV.
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., and Schmid, C. (2021). Vivit: A video
vision transformer. In ICCV.
Bachmann, R., Mizrahi, D., Atanov, A., and Zamir, A. (2022). Multimae: Multi-modal multi-task
masked autoencoders. arXiv preprint arXiv:2204.01678.
80
Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-
supervised learning of speech representations. In NeurIPS.
Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to
align and translate. In ICLR.
Bain, M., Nagrani, A., Varol, G., and Zisserman, A. (2021). Frozen in time: A joint video and image
encoder for end-to-end retrieval. In ICCV.
Banerjee, S. and Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improved
correlation with human judgments. In ACL workshop on intrinsic and extrinsic evaluation mea-
sures for machine translation and/or summarization.
Bao, H., Dong, L., and Wei, F. (2022a). BEiT: Bert pre-training of image transformers. In ICLR.
Bao, H., Wang, W., Dong, L., and Wei, F. (2022b). Vl-beit: Generative vision-language pretraining.
arXiv preprint arXiv:2206.01127.
Ben-Younes, H., Cadene, R., Cord, M., and Thome, N. (2017). Mutan: Multimodal tucker fusion
for visual question answering. In ICCV.
Ben-Younes, H., Cadene, R., Thome, N., and Cord, M. (2019). Block: Bilinear superdiagonal fusion
for visual question answering and visual relationship detection. In AAAI.
Bianchi, F., Attanasio, G., Pisoni, R., Terragni, S., Sarti, G., and Lakshmi, S. (2021). Contrastive
language-image pre-training for the italian language. arXiv preprint arXiv:2108.08688.
Birhane, A., Prabhu, V. U., and Kahembwe, E. (2021). Multimodal datasets: misogyny, pornogra-
phy, and malignant stereotypes. arXiv preprint arXiv:2110.01963.
Biten, A. F., Litman, R., Xie, Y., Appalaraju, S., and Manmatha, R. (2022). Latr: Layout-aware
transformer for scene-text vqa. In CVPR.
Biten, A. F., Tito, R., Mafla, A., Gomez, L., Rusinol, M., Valveny, E., Jawahar, C., and Karatzas, D.
(2019). Scene text visual question answering. In CVPR.
Bitton, Y., Guetta, N. B., Yosef, R., Elovici, Y., Bansal, M., Stanovsky, G., and Schwartz, R. (2022).
Winogavil: Gamified association benchmark to challenge vision-and-language models. arXiv
preprint arXiv:2207.12576.
Breitfeller, L., Ahn, E., Jurgens, D., and Tsvetkov, Y. (2019). Finding microaggressions in the wild:
A case for locating elusive phenomena in social media posts. In EMNLP.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam,
P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. In NeuIPS.
Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., and Niebles, J. C. (2022). Revisiting the”
video” in video-language understanding. In CVPR.
Bugliarello, E., Cotterell, R., Okazaki, N., and Elliott, D. (2021). Multimodal pretraining unmasked:
Unifying the vision and language BERTs. TACL.
Cadene, R., Ben-Younes, H., Cord, M., and Thome, N. (2019a). Murel: Multimodal relational
reasoning for visual question answering. In CVPR.
Cadene, R., Dancette, C., Ben-younes, H., Cord, M., and Parikh, D. (2019b). Rubi: Reducing
unimodal biases for visual question answering. In NeurIPS.
Cai, Z., Kwon, G., Ravichandran, A., Bas, E., Tu, Z., Bhotika, R., and Soatto, S. (2022). X-detr: A
versatile architecture for instance-wise vision-language tasks. In ECCV.
Cao, J., Gan, Z., Cheng, Y., Yu, L., Chen, Y.-C., and Liu, J. (2020). Behind the scene: Revealing the
secrets of pre-trained vision-and-language models. In ECCV.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-end
object detection with transformers. In ECCV.
Carlsson, F., Eisen, P., Rekathati, F., and Sahlgren, M. (2022). Cross-lingual and multilingual clip.
In Proceedings of the Language Resources and Evaluation Conference.
Carreira, J. and Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics
dataset. In CVPR.
Chang, Y., Narang, M., Suzuki, H., Cao, G., Gao, J., and Bisk, Y. (2022). Webqa: Multihop and
multimodal qa. In CVPR.
81
Changpinyo, S., Sharma, P., Ding, N., and Soricut, R. (2021). Conceptual 12m: Pushing web-scale
image-text pre-training to recognize long-tail visual concepts. In CVPR.
Chen, B., Rouditchenko, A., Duarte, K., Kuehne, H., Thomas, S., Boggust, A., Panda, R., Kings-
bury, B., Feris, R., Harwath, D., et al. (2021a). Multimodal clustering networks for self-supervised
learning from unlabeled videos. In ICCV.
Chen, D. and Dolan, W. (2011). Collecting highly parallel data for paraphrase evaluation. In ACL.
Chen, F., Zhang, D., Han, M., Chen, X., Shi, J., Xu, S., and Xu, B. (2022a). Vlp: A survey on
vision-language pre-training. arXiv preprint arXiv:2202.09061.
Chen, L., Gan, Z., Cheng, Y., Li, L., Carin, L., and Liu, J. (2020a). Graph optimal transport for
cross-domain alignment. In ICML.
Chen, L., Zhang, Y., Zhang, R., Tao, C., Gan, Z., Zhang, H., Li, B., Shen, D., Chen, C., and Carin,
L. (2019). Improving sequence-to-sequence learning via optimal transport. In ICLR.
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., and Sutskever, I. (2020b). Generative
pretraining from pixels. In ICML.
Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X.,
et al. (2022b). Wavlm: Large-scale self-supervised pre-training for full stack speech processing.
JSTSP.
Chen, T., Frankle, J., Chang, S., Liu, S., Zhang, Y., Wang, Z., and Carbin, M. (2020c). The lottery
ticket hypothesis for pre-trained bert networks. In NeurIPS.
Chen, T. and Luo, J. (2020). Expressing objects just like words: Recurrent visual embedding for
image-text matching. In AAAI.
Chen, T., Saxena, S., Li, L., Fleet, D. J., and Hinton, G. (2022c). Pix2seq: A language modeling
framework for object detection. In ICLR.
Chen, T., Saxena, S., Li, L., Lin, T.-Y., Fleet, D. J., and Hinton, G. (2022d). A unified sequence
interface for vision tasks. arXiv preprint arXiv:2206.07669.
Chen, W., Gan, Z., Li, L., Cheng, Y., Wang, W., and Liu, J. (2021b). Meta module network for
compositional visual reasoning. In WACV.
Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., and Zitnick, C. L. (2015). Microsoft
COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
Chen, X., Ma, L., Chen, J., Jie, Z., Liu, W., and Luo, J. (2018). Real-time referring expression
comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426.
Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Gryc-
ner, A., Mustafa, B., Beyer, L., et al. (2022e). Pali: A jointly-scaled multilingual language-image
model. arXiv preprint arXiv:2209.06794.
Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., and Liu, J. (2020d).
UNITER: Universal image-text representation learning. In ECCV.
Cho, J., Lei, J., Tan, H., and Bansal, M. (2021). Unifying vision-and-language tasks via text gener-
ation. In ICML.
Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio,
Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine
translation. In EMNLP.
Choi, S., On, K.-W., Heo, Y.-J., Seo, A., Jang, Y., Lee, M., and Zhang, B.-T. (2021). Dramaqa:
Character-centered video story understanding with hierarchical qa. In AAAI.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung,
H. W., Sutton, C., Gehrmann, S., et al. (2022). Palm: Scaling language modeling with pathways.
arXiv preprint arXiv:2204.02311.
Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. (2020). Electra: Pre-training text encoders
as discriminators rather than generators. In ICLR.
Cornia, M., Stefanini, M., Baraldi, L., and Cucchiara, R. (2020). Meshed-memory transformer for
image captioning. In CVPR.
82
Cui, Y., Yu, Z., Wang, C., Zhao, Z., Zhang, J., Wang, M., and Yu, J. (2021). Rosita: Enhancing
vision-and-language semantic alignments via cross-and intra-modal knowledge integration. In
ACMMM.
Dai, Y., Tang, D., Liu, L., Tan, M., Zhou, C., Wang, J., Feng, Z., Zhang, F., Hu, X., and Shi, S.
(2022). One model, multiple modalities: A sparsely activated approach for text, sound, image,
video and code. arXiv preprint arXiv:2205.06126.
Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D.,
Munro, J., Perrett, T., Price, W., et al. (2018). Scaling egocentric vision: The epic-kitchens
dataset. In ECCV.
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., Parikh, D., and Batra, D. (2017).
Visual dialog. In CVPR.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale
hierarchical image database. In CVPR.
Deng, J., Yang, Z., Chen, T., Zhou, W., and Li, H. (2021). Transvg: End-to-end visual grounding
with transformers. In ICCV.
Desai, K., Kaul, G., Aysola, Z., and Johnson, J. (2021). Redcaps: Web-curated image-text data
created by the people, for the people. In NeurIPS, Track on Datasets and Benchmarks.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional
transformers for language understanding. In NAACL.
Dhariwal, P. and Nichol, A. (2021). Diffusion models beat gans on image synthesis. In NeurIPS.
Diao, H., Zhang, Y., Ma, L., and Lu, H. (2021). Similarity reasoning and filtration for image-text
matching. In AAAI.
Diao, S., Zhou, W., Zhang, X., and Wang, J. (2022). Prefix language models are unified modal
learners. arXiv preprint arXiv:2206.07699.
Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., Lin, J., Zou, X., Shao, Z., Yang, H.,
et al. (2021). Cogview: Mastering text-to-image generation via transformers. In NeurIPS.
Ding, M., Zheng, W., Hong, W., and Tang, J. (2022a). Cogview2: Faster and better text-to-image
generation via hierarchical transformers. arXiv preprint arXiv:2204.14217.
Ding, Z., Wang, J., and Tu, Z. (2022b). Open-vocabulary panoptic segmentation with maskclip.
arXiv preprint arXiv:2208.08984.
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K.,
and Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and
description. In CVPR.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani,
M., Minderer, M., Heigold, G., Gelly, S., et al. (2021). An image is worth 16x16 words: Trans-
formers for image recognition at scale. In ICLR.
Dou, Z.-Y., Kamath, A., Gan, Z., Zhang, P., Wang, J., Li, L., Liu, Z., Liu, C., LeCun, Y., Peng,
N., et al. (2022a). Coarse-to-fine vision-language pre-training with fusion in the backbone. In
NeurIPS.
Dou, Z.-Y., Xu, Y., Gan, Z., Wang, J., Wang, S., Wang, L., Zhu, C., Liu, Z., Zeng, M., et al. (2022b).
An empirical study of training end-to-end vision-and-language transformers. In CVPR.
Du, Y., Liu, Z., Li, J., and Zhao, W. X. (2022). A survey of vision-language pre-trained models. In
IJCAI survey track.
Duan, J., Chen, L., Tran, S., Yang, J., Xu, Y., Zeng, B., and Chilimbi, T. (2022). Multi-modal
alignment using representation codebook. In CVPR.
Esser, P., Rombach, R., Blattmann, A., and Ommer, B. (2021a). Imagebart: Bidirectional context
with multinomial diffusion for autoregressive image synthesis. In NeurIPS.
Esser, P., Rombach, R., and Ommer, B. (2021b). Taming transformers for high-resolution image
synthesis. In CVPR.
Faghri, F., Fleet, D. J., Kiros, J. R., and Fidler, S. (2017). Vse++: Improving visual-semantic
embeddings with hard negatives. arXiv preprint arXiv:1707.05612.
83
Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar, V., Dave, A., and Schmidt, L. (2022a).
Data determines distributional robustness in contrastive language image pre-training (clip). arXiv
preprint arXiv:2205.01397.
Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M.,
Platt, J. C., et al. (2015). From captions to visual concepts and back. In CVPR.
Fang, Z., Wang, J., Hu, X., Liang, L., Gan, Z., Wang, L., Yang, Y., and Liu, Z. (2022b). Injecting
semantic concepts into end-to-end image captioning. In CVPR.
Fang, Z., Wang, J., Hu, X., Wang, L., Yang, Y., and Liu, Z. (2021). Compressing visual-linguistic
model via knowledge distillation. In ICCV.
Farhadi, A., Endres, I., Hoiem, D., and Forsyth, D. (2009). Describing objects by their attributes. In
CVPR.
Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth,
D. (2010). Every picture tells a story: Generating sentences from images. In ECCV.
Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019). Slowfast networks for video recognition.
In ICCV.
Feng, C., Zhong, Y., Jie, Z., Chu, X., Ren, H., Wei, X., Xie, W., and Ma, L. (2022). Promptdet:
Expand your detector vocabulary with uncurated images. In ECCV.
Frank, S., Bugliarello, E., and Elliott, D. (2021). Vision-and-language or vision-for-language? on
cross-modal influence in multimodal transformers. arXiv preprint arXiv:2109.04448.
Frankle, J. and Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural
networks. In ICML.
Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., and Mikolov, T. (2013).
Devise: A deep visual-semantic embedding model. In NeurIPS.
Fu, T.-J., Li, L., Gan, Z., Lin, K., Wang, W. Y., Wang, L., and Liu, Z. (2021). Violet:
End-to-end video-language transformers with masked visual-token modeling. arXiv preprint
arXiv:2111.12681.
Fu, T.-J., Li, L., Gan, Z., Lin, K., Wang, W. Y., Wang, L., and Liu, Z. (2022). An empirical
study of end-to-end video-language transformers with masked visual modeling. arXiv preprint
arXiv:2209.01540.
Fu, Y. and Sigal, L. (2016). Semi-supervised vocabulary-informed learning. In CVPR.
Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., and Rohrbach, M. (2016). Multimodal
compact bilinear pooling for visual question answering and visual grounding. In EMNLP.
Gabeur, V., Sun, C., Alahari, K., and Schmid, C. (2020). Multi-modal transformer for video retrieval.
In ECCV.
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., and Taigman, Y. (2022). Make-a-scene:
Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131.
Gan, Z., Chen, Y.-C., Li, L., Chen, T., Cheng, Y., Wang, S., and Liu, J. (2022). Playing lottery
tickets with vision and language. In AAAI.
Gan, Z., Chen, Y.-C., Li, L., Zhu, C., Cheng, Y., and Liu, J. (2020). Large-scale adversarial training
for vision-and-language representation learning. In NeurIPS.
Gan, Z., Cheng, Y., Kholy, A. E., Li, L., Liu, J., and Gao, J. (2019). Multi-step reasoning via
recurrent dual attention for visual dialog. In ACL.
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., and Xu, W. (2015). Are you talking to a machine?
dataset and methods for multilingual image question. In NeurIPS.
Gao, J., Galley, M., Li, L., et al. (2019a). Neural approaches to conversational ai. In Foundations
and Trends® in Information Retrieval.
Gao, J., Xiong, C., Bennett, P., and Craswell, N. (2022a). Neural approaches to conversational
information retrieval. arXiv preprint arXiv:2201.05176.
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., and Qiao, Y. (2021). CLIP-adapter:
Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544.
84
Gao, P., Jiang, Z., You, H., Lu, P., Hoi, S. C., Wang, X., and Li, H. (2019b). Dynamic fusion with
intra-and inter-modality attention flow for visual question answering. In CVPR.
Gao, Y., Liu, J., Xu, Z., Zhang, J., Li, K., and Shen, C. (2022b). Pyramidclip: Hierarchical feature
alignment for vision-language model pretraining. arXiv preprint arXiv:2204.14095.
Ge, Y., Ge, Y., Liu, X., Li, D., Shan, Y., Qie, X., and Luo, P. (2022). Bridging video-text retrieval
with multiple choice questions. In CVPR.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., and Crawford, K.
(2021). Datasheets for datasets. CACM.
Geng, X., Liu, H., Lee, L., Schuurams, D., Levine, S., and Abbeel, P. (2022). Multimodal masked
autoencoders learn transferable representations. arXiv preprint arXiv:2205.14204.
Ghiasi, G., Gu, X., Cui, Y., and Lin, T.-Y. (2022). Open-vocabulary image segmentation. In ECCV.
Girshick, R. (2015). Fast r-cnn. In ICCV.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). Rich feature hierarchies for accurate
object detection and semantic segmentation. In CVPR.
Goel, S., Bansal, H., Bhatia, S., Rossi, R. A., Vinay, V., and Grover, A. (2022). Cyclip: Cyclic
contrastive language-image pretraining. arXiv preprint arXiv:2205.14459.
Goenka, S., Zheng, Z., Jaiswal, A., Chada, R., Wu, Y., Hedau, V., and Natarajan, P. (2022). Fash-
ionvlp: Vision language transformer for fashion retrieval with feedback. In CVPR.
Gokhale, T., Banerjee, P., Baral, C., and Yang, Y. (2020). Vqa-lol: Visual question answering under
the lens of logic. In ECCV.
Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V.,
Fruend, I., Yianilos, P., Mueller-Freitag, M., et al. (2017a). The” something something” video
database for learning and evaluating visual common sense. In ICCV.
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. (2017b). Making the V in VQA
matter: Elevating the role of image understanding in visual question answering. In CVPR.
Grigoryev, T., Voynov, A., and Babenko, A. (2022). When, why, and which pretrained gans are
useful? In ICLR.
Gu, J., Meng, X., Lu, G., Hou, L., Niu, M., Xu, H., Liang, X., Zhang, W., Jiang, X., and Xu,
C. (2022a). Wukong: 100 million large-scale chinese cross-modal pre-training dataset and a
foundation framework. arXiv preprint arXiv:2202.06767.
Gu, J., Stefani, E., Wu, Q., Thomason, J., and Wang, X. E. (2022b). Vision-and-language navigation:
A survey of tasks, methods, and future directions. arXiv preprint arXiv:2203.12667.
Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., and Guo, B. (2022c). Vector
quantized diffusion model for text-to-image synthesis. In CVPR.
Gu, X., Lin, T.-Y., Kuo, W., and Cui, Y. (2022d). Open-vocabulary object detection via vision and
language knowledge distillation. In ICLR.
Gui, L., Wang, B., Huang, Q., Hauptmann, A., Bisk, Y., and Gao, J. (2022). Kat: A knowledge
augmented transformer for vision-and-language. In NAACL.
Gupta, A., Dollar, P., and Girshick, R. (2019). Lvis: A dataset for large vocabulary instance seg-
mentation. In CVPR.
Gupta, T., Kamath, A., Kembhavi, A., and Hoiem, D. (2022a). Towards general purpose vision
systems: An end-to-end task-agnostic vision-language architecture. In CVPR.
Gupta, T., Marten, R., Kembhavi, A., and Hoiem, D. (2022b). Grit: General robust image task
benchmark. arXiv preprint arXiv:2204.13653.
Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., and Bigham, J. P. (2018).
Vizwiz grand challenge: Answering visual questions from blind people. In CVPR.
Gurari, D., Zhao, Y., Zhang, M., and Bhattacharya, N. (2020). Captioning images taken by people
who are blind. In ECCV.
Gutmann, M. and Hyvärinen, A. (2010). Noise-contrastive estimation: A new estimation principle
for unnormalized statistical models. In AISTATS.
85
Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M. (2020). Retrieval augmented language model
pre-training. In ICML.
Han, T., Xie, W., and Zisserman, A. (2022). Temporal alignment networks for long-term video. In
CVPR.
Hao, W., Li, C., Li, X., Carin, L., and Gao, J. (2020). Towards learning a generic agent for vision-
and-language navigation via pre-training. In CVPR.
Hao, Y., Song, H., Dong, L., Huang, S., Chi, Z., Wang, W., Ma, S., and Wei, F. (2022). Language
models are general-purpose interfaces. arXiv preprint arXiv:2206.06336.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In
CVPR.
He, P., Liu, X., Gao, J., and Chen, W. (2021). DeBERTa: Decoding-enhanced bert with disentangled
attention. In ICLR.
He, X., Li, C., Zhang, P., Yang, J., and Wang, X. E. (2022). Parameter-efficient fine-tuning for vision
transformers. arXiv preprint arXiv:2203.16329.
Hendricks, L. A., Mellor, J., Schneider, R., Alayrac, J.-B., and Nematzadeh, A. (2021). Decoupling
the role of data, attention, and losses in multimodal transformers. TACL.
Hendricks, L. A. and Nematzadeh, A. (2021). Probing image-language transformers for verb under-
standing. arXiv preprint arXiv:2106.09141.
Hendricks, L. A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., and Russell, B. (2017). Localizing
Moments in Video with Natural Language. In ICCV.
Herdade, S., Kappeler, A., Boakye, K., and Soares, J. (2019). Image captioning: Transforming
objects into words. In NeurIPS.
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi,
M., Fleet, D. J., et al. (2022). Imagen video: High definition video generation with diffusion
models. arXiv preprint arXiv:2210.02303.
Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. In NeurIPS.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L.,
Hendricks, L. A., Welbl, J., Clark, A., et al. (2022). Training compute-optimal large language
models. arXiv preprint arXiv:2203.15556.
Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., and Gould, S. (2021). Vln bert: A recurrent vision-
and-language bert for navigation. In CVPR.
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan,
M., and Gelly, S. (2019). Parameter-efficient transfer learning for nlp. In ICML.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). Lora:
Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Hu, R., Andreas, J., Darrell, T., and Saenko, K. (2018). Explainable neural computation via stack
neural module networks. In ECCV.
Hu, R., Andreas, J., Rohrbach, M., Darrell, T., and Saenko, K. (2017). Learning to reason: End-to-
end module networks for visual question answering. In ICCV.
Hu, R., Rohrbach, A., Darrell, T., and Saenko, K. (2019). Language-conditioned graph networks for
relational reasoning. In ICCV.
Hu, R. and Singh, A. (2021). Unit: Multimodal multitask learning with a unified transformer. In
ICCV.
Hu, X., Gan, Z., Wang, J., Yang, Z., Liu, Z., Lu, Y., and Wang, L. (2022). Scaling up vision-language
pre-training for image captioning. In CVPR.
Huang, L., Niu, G., Liu, J., Xiao, X., and Wu, H. (2022). Du-vlg: Unifying vision-and-language
generation via dual sequence-to-sequence pre-training. In Findings of ACL.
Huang, L., Wang, W., Chen, J., and Wei, X.-Y. (2019). Attention on attention for image captioning.
In ICCV.
86
Huang, P.-Y., Patrick, M., Hu, J., Neubig, G., Metze, F., and Hauptmann, A. (2021a). Multilin-
gual multimodal pre-training for zero-shot cross-lingual transfer of vision-language models. In
NAACL.
Huang, T.-H., Ferraro, F., Mostafazadeh, N., Misra, I., Agrawal, A., Devlin, J., Girshick, R., He, X.,
Kohli, P., Batra, D., et al. (2016). Visual storytelling. In NAACL.
Huang, Y., Wang, W., and Wang, L. (2017). Instance-aware image and sentence matching with
selective multimodal lstm. In CVPR.
Huang, Z., Zeng, Z., Huang, Y., Liu, B., Fu, D., and Fu, J. (2021b). Seeing out of the box: End-to-
end pre-training for vision-language representation learning. In CVPR.
Huang, Z., Zeng, Z., Liu, B., Fu, D., and Fu, J. (2020). Pixel-BERT: Aligning image pixels with
text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849.
Hudson, D. and Manning, C. D. (2019a). Learning by abstraction: The neural state machine. In
NeurIPS.
Hudson, D. A. and Manning, C. D. (2018). Compositional attention networks for machine reasoning.
In ICLR.
Hudson, D. A. and Manning, C. D. (2019b). Gqa: A new dataset for real-world visual reasoning
and compositional question answering. In CVPR.
Huo, Y., Zhang, M., Liu, G., Lu, H., Gao, Y., Yang, G., Wen, J., Zhang, H., Xu, B., Zheng, W., et al.
(2021). Wenlan: Bridging vision and language by large-scale multi-modal pre-training. arXiv
preprint arXiv:2103.06561.
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V.,
Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L. (2021). Openclip. If you
use this software, please cite it as below.
Jabri, A., Joulin, A., and Maaten, L. v. d. (2016). Revisiting visual question answering baselines. In
ECCV.
Jain, A., Guo, M., Srinivasan, K., Chen, T., Kudugunta, S., Jia, C., Yang, Y., and Baldridge,
J. (2021). Mural: multimodal, multitask retrieval across languages. arXiv preprint
arXiv:2109.05125.
Jang, Y., Song, Y., Yu, Y., Kim, Y., and Kim, G. (2017). TGIF-QA: Toward Spatio-Temporal
Reasoning in Visual Question Answering. In CVPR.
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q. V., Sung, Y., Li, Z., and Duerig, T.
(2021). Scaling up visual and vision-language representation learning with noisy text supervision.
In ICML.
Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., and Chen, X. (2020). In defense of grid
features for visual question answering. In CVPR.
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., and Liu, Q. (2020). Tinybert:
Distilling bert for natural language understanding. In EMNLP.
Jimenez, C. E., Russakovsky, O., and Narasimhan, K. (2022). Carets: A consistency and robustness
evaluative test suite for vqa. arXiv preprint arXiv:2203.07613.
Jin, W., Cheng, Y., Shen, Y., Chen, W., and Ren, X. (2022). A good prompt is worth millions of
parameters? low-resource prompt-based learning for vision-language models. In ACL.
Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., and Girshick, R.
(2017a). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning.
In CVPR.
Johnson, J., Hariharan, B., Van Der Maaten, L., Hoffman, J., Fei-Fei, L., Lawrence Zitnick, C., and
Girshick, R. (2017b). Inferring and executing programs for visual reasoning. In ICCV.
Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y. (2016). Exploring the limits of
language modeling. arXiv preprint arXiv:1602.02410.
Ju, C., Han, T., Zheng, K., Zhang, Y., and Xie, W. (2022). Prompting visual-language models for
efficient video understanding. In ECCV.
Kamath, A., Clark, C., Gupta, T., Kolve, E., Hoiem, D., and Kembhavi, A. (2022). Webly supervised
concept expansion for general purpose vision models. arXiv preprint arXiv:2202.02317.
87
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., and Carion, N. (2021). Mdetr-modulated
detection for end-to-end multi-modal understanding. In ICCV.
Karpathy, A. and Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descrip-
tions. In CVPR.
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green,
T., Back, T., Natsev, P., et al. (2017). The kinetics human action video dataset. arXiv preprint
arXiv:1705.06950.
Kervadec, C., Antipov, G., Baccouche, M., and Wolf, C. (2021). Roses are red, violets are blue...
but should vqa expect them to? In CVPR.
Kim, J.-H., Jun, J., and Zhang, B.-T. (2018). Bilinear attention networks. In NeurIPS.
Kim, J.-H., Lee, S.-W., Kwak, D., Heo, M.-O., Kim, J., Ha, J.-W., and Zhang, B.-T. (2016). Multi-
modal residual learning for visual qa. In NeurIPS.
Kim, J.-H., On, K.-W., Lim, W., Kim, J., Ha, J.-W., and Zhang, B.-T. (2017). Hadamard product for
low-rank bilinear pooling. In ICLR.
Kim, T., Song, G., Lee, S., Kim, S., Seo, Y., Lee, S., Kim, S. H., Lee, H., and Bae, K. (2022).
L-verse: Bidirectional generation between image and text. In CVPR.
Kim, W., Son, B., and Kim, I. (2021). ViLT: Vision-and-language transformer without convolution
or region supervision. In ICML.
Kiros, R., Salakhutdinov, R., and Zemel, R. S. (2014). Unifying visual-semantic embeddings with
multimodal neural language models. In NeurIPS deep learning workshop.
Klein, B., Lev, G., Sadeh, G., and Wolf, L. (2015). Associating neural word embeddings with deep
image representations using fisher vectors. In CVPR.
Ko, B. and Gu, G. (2022). Large-scale bilingual language-image contrastive learning. arXiv preprint
arXiv:2203.14463.
Kolesnikov, A., Pinto, A. S., Beyer, L., Zhai, X., Harmsen, J., and Houlsby, N. (2022). Uvim: A uni-
fied modeling approach for vision with learned guiding codes. arXiv preprint arXiv:2205.10337.
Krause, J., Johnson, J., Krishna, R., and Fei-Fei, L. (2017). A hierarchical approach for generating
descriptive image paragraphs. In CVPR.
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., and Carlos Niebles, J. (2017a). Dense-captioning events
in videos. In ICCV.
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., and Niebles, J. C. (2017b). Dense-Captioning Events in
Videos. In ICCV.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li,
L.-J., Shamma, D. A., et al. (2017c). Visual genome: Connecting language and vision using
crowdsourced dense image annotations. IJCV.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li,
L.-J., Shamma, D. A., et al. (2017d). Visual Genome: Connecting language and vision using
crowdsourced dense image annotations. IJCV.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolu-
tional neural networks. In NeurIPS.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011). HMDB: a large video database
for human motion recognition. In ICCV.
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., and Berg, T. L. (2013).
Babytalk: Understanding and generating simple image descriptions. TPAMI.
Kumar, A., Raghunathan, A., Jones, R., Ma, T., and Liang, P. (2022). Fine-tuning can distort
pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054.
Kuo, W., Bertsch, F., Li, W., Piergiovanni, A., Saffar, M., and Angelova, A. (2022). Findit: Gener-
alized localization with natural language queries. In ECCV.
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S.,
Malloci, M., Kolesnikov, A., et al. (2020). The open images dataset v4. IJCV.
88
KV, G. and Mittal, A. (2020). Reducing language biases in visual question answering with visually-
grounded question encoder. In ECCV.
Kwon, G., Cai, Z., Ravichandran, A., Bas, E., Bhotika, R., and Soatto, S. (2022). Masked vision and
language modeling for multi-modal representation learning. arXiv preprint arXiv:2208.02131.
Lampert, C. H., Nickisch, H., and Harmeling, S. (2013). Attribute-based classification for zero-shot
visual object categorization. TPAMI.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. (2020). Albert: A lite bert
for self-supervised learning of language representations. In ICLR.
Lee, K.-H., Chen, X., Hua, G., Hu, H., and He, X. (2018). Stacked cross attention for image-text
matching. In ECCV.
Lei, C., Luo, S., Liu, Y., He, W., Wang, J., Wang, G., Tang, H., Miao, C., and Li, H. (2021a).
Understanding chinese video and language via contrastive multimodal pre-training. In ACMMM.
Lei, J., Berg, T. L., and Bansal, M. (2022a). Revealing single frame bias for video-and-language
learning. arXiv preprint arXiv:2206.03428.
Lei, J., Chen, X., Zhang, N., Wang, M., Bansal, M., Berg, T. L., and Yu, L. (2022b). Loo-
pitr: Combining dual and cross encoder architectures for image-text retrieval. arXiv preprint
arXiv:2203.05465.
Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T. L., Bansal, M., and Liu, J. (2021b). Less is more: Clipbert
for video-and-language learning via sparse sampling. In CVPR.
Lei, J., Yu, L., Bansal, M., and Berg, T. L. (2018). Tvqa: Localized, compositional video question
answering. In EMNLP.
Lei, J., Yu, L., Berg, T. L., and Bansal, M. (2020a). Tvqa+: Spatio-temporal grounding for video
question answering. In ACL.
Lei, J., Yu, L., Berg, T. L., and Bansal, M. (2020b). Tvr: A large-scale dataset for video-subtitle
moment retrieval. In ECCV.
Lei, J., Yu, L., Berg, T. L., and Bansal, M. (2020c). What is more likely to happen next? video-and-
language future event prediction. In EMNLP.
Lester, B., Al-Rfou, R., and Constant, N. (2021). The power of scale for parameter-efficient prompt
tuning. In EMNLP.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettle-
moyer, L. (2020a). Bart: Denoising sequence-to-sequence pre-training for natural language gen-
eration, translation, and comprehension. In ACL.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih,
W.-t., Rocktäschel, T., et al. (2020b). Retrieval-augmented generation for knowledge-intensive
nlp tasks. In NeurIPS.
Li, B., Qi, X., Lukasiewicz, T., and Torr, P. (2019a). Controllable text-to-image generation. In
NeurIPS.
Li, B., Weinberger, K. Q., Belongie, S., Koltun, V., and Ranftl, R. (2022a). Language-driven seman-
tic segmentation. In ICLR.
Li, C., Liu, H., Li, L. H., Zhang, P., Aneja, J., Yang, J., Jin, P., Lee, Y. J., Hu, H., Liu, Z., et al.
(2022b). Elevater: A benchmark and toolkit for evaluating language-augmented visual models.
In NeurIPS, Track on Datasets and Benchmarks.
Li, C., Xu, H., Tian, J., Wang, W., Yan, M., Bi, B., Ye, J., Chen, H., Xu, G., Cao, Z., et al. (2022c).
mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv
preprint arXiv:2205.12005.
Li, D., Li, J., Li, H., Niebles, J. C., and Hoi, S. C. (2022d). Align and prompt: Video-and-language
pre-training with entity prompts. In CVPR.
Li, F., Zhang, H., Zhang, Y.-F., Liu, S., Guo, J., Ni, L. M., Zhang, P., and Zhang, L. (2022e).
Vision-language intelligence: Tasks, representation learning, and large models. arXiv preprint
arXiv:2203.01922.
Li, G., Duan, N., Fang, Y., Gong, M., and Jiang, D. (2020a). Unicoder-vl: A universal encoder for
vision and language by cross-modal pre-training. In AAAI.
89
Li, G., Zhu, L., Liu, P., and Yang, Y. (2019b). Entangled transformer for image captioning. In ICCV.
Li, J., Li, D., Xiong, C., and Hoi, S. (2022f). Blip: Bootstrapping language-image pre-training for
unified vision-language understanding and generation. In ICML.
Li, J., Selvaraju, R. R., Gotmare, A. D., Joty, S., Xiong, C., and Hoi, S. (2021a). Align before fuse:
Vision and language representation learning with momentum distillation. In NeurIPS.
Li, K., Zhang, Y., Li, K., Li, Y., and Fu, Y. (2019c). Visual semantic reasoning for image-text
matching. In ICCV.
Li, L., Chen, Y.-C., Cheng, Y., Gan, Z., Yu, L., and Liu, J. (2020b). Hero: Hierarchical encoder for
video+ language omni-representation pre-training. In EMNLP.
Li, L., Gan, Z., Cheng, Y., and Liu, J. (2019d). Relation-aware graph attention network for visual
question answering. In ICCV.
Li, L., Gan, Z., Lin, K., Lin, C.-C., Liu, Z., Liu, C., and Wang, L. (2022g). Lavender: Unifying
video-language understanding as masked language modeling. arXiv preprint arXiv:2206.07160.
Li, L., Gan, Z., and Liu, J. (2020c). A closer look at the robustness of vision-and-language pre-
trained models. arXiv preprint arXiv:2012.08673.
Li, L., Lei, J., Gan, Z., and Liu, J. (2021b). Adversarial vqa: A new benchmark for evaluating the
robustness of vqa models. In ICCV.
Li, L., Lei, J., Gan, Z., Yu, L., Chen, Y.-C., Pillai, R., Cheng, Y., Zhou, L., Wang, X. E., Wang, W. Y.,
et al. (2021c). Value: A multi-task benchmark for video-and-language understanding evaluation.
In NeurIPS, Track on Datasets and Benchmarks.
Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang, K.-W. (2019e). VisualBERT: A simple and
performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang, K.-W. (2020d). What does BERT with
vision look at? In ACL.
Li, L. H., You, H., Wang, Z., Zareian, A., Chang, S.-F., and Chang, K.-W. (2021d). Unsupervised
vision-and-language pre-training without parallel images and captions. In NAACL.
Li, L. H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang,
J.-N., et al. (2022h). Grounded language-image pre-training. In CVPR.
Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., and Lu, J. (2022i). Bridge-prompt: Towards
ordinal action understanding in instructional videos. In CVPR.
Li, W., Gao, C., Niu, G., Xiao, X., Liu, H., Liu, J., Wu, H., and Wang, H. (2021e). Unimo: Towards
unified-modal understanding and generation via cross-modal contrastive learning. In ACL.
Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y.,
and Gao, J. (2020e). Oscar: Object-semantics aligned pre-training for vision-language tasks. In
ECCV.
Li, X. L. and Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. In
ACL.
Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., and Yan, J. (2022j). Supervision
exists everywhere: A data efficient contrastive language-image pre-training paradigm. In ICLR.
Liao, Y., Liu, S., Li, G., Wang, F., Chen, Y., Qian, C., and Li, B. (2020). A real-time cross-modality
correlation filtering method for referring expression comprehension. In CVPR.
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization
branches out.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L.
(2014). Microsoft coco: Common objects in context. In ECCV.
Liu, B., Huang, Z., Zeng, Z., Chen, Z., and Fu, J. (2019a). Vqa challenge 2019 runner-up talk. VQA
Challenge 2019.
Liu, C., Mao, Z., Liu, A.-A., Zhang, T., Wang, B., and Zhang, Y. (2019b). Focus your attention: A
bidirectional focal attention network for image-text matching. In ACMMM.
Liu, C., Mao, Z., Zhang, T., Xie, H., Wang, B., and Zhang, Y. (2020a). Graph structured network
for image-text matching. In CVPR.
90
Liu, J., Chen, W., Cheng, Y., Gan, Z., Yu, L., Yang, Y., and Liu, J. (2020b). Violin: A large-scale
dataset for video-and-language inference. In CVPR.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. (2021a). Pre-train, prompt, and
predict: A systematic survey of prompting methods in natural language processing. arXiv preprint
arXiv:2107.13586.
Liu, S., Fan, H., Qian, S., Chen, Y., Ding, W., and Wang, Z. (2021b). Hit: Hierarchical transformer
with momentum contrast for video-text retrieval. In ICCV.
Liu, X., He, P., Chen, W., and Gao, J. (2019c). Multi-task deep neural networks for natural language
understanding. In ACL.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and
Stoyanov, V. (2019d). Roberta: A robustly optimized bert pretraining approach. arXiv preprint
arXiv:1907.11692.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021c). Swin transformer:
Hierarchical vision transformer using shifted windows. In ICCV.
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., and Hu, H. (2022). Video Swin Transformer.
In CVPR.
Loshchilov, I. and Hutter, F. (2018). Decoupled weight decay regularization. In ICLR.
Lu, J., Batra, D., Parikh, D., and Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic
representations for vision-and-language tasks. In NeurIPS.
Lu, J., Clark, C., Zellers, R., Mottaghi, R., and Kembhavi, A. (2022a). Unified-io: A unified model
for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916.
Lu, J., Xiong, C., Parikh, D., and Socher, R. (2017). Knowing when to look: Adaptive attention via
a visual sentinel for image captioning. In CVPR.
Lu, J., Yang, J., Batra, D., and Parikh, D. (2016). Hierarchical question-image co-attention for visual
question answering. In NeurIPS.
Lu, K., Grover, A., Abbeel, P., and Mordatch, I. (2022b). Frozen pretrained transformers as universal
computation engines. In AAAI.
Lu, Y., Zhu, W., Wang, X. E., Eckstein, M., and Wang, W. Y. (2022c). Imagination-augmented
natural language understanding. In NAACL.
Lüddecke, T. and Ecker, A. (2022). Image segmentation using text and image prompts. In CVPR.
Luo, H., Ji, L., Shi, B., Huang, H., Duan, N., Li, T., Li, J., Bharti, T., and Zhou, M. (2020). Univl:
A unified video and language pre-training model for multimodal understanding and generation.
arXiv preprint arXiv:2002.06353.
Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., and Li, T. (2021a). Clip4clip: An empirical
study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860.
Luo, Y., Ji, J., Sun, X., Cao, L., Wu, Y., Huang, F., Lin, C.-W., and Ji, R. (2021b). Dual-level
collaborative transformer for image captioning. In AAAI.
Ma, L., Lu, Z., and Li, H. (2016). Learning to answer questions from image using convolutional
neural network. In AAAI.
Malinowski, M., Rohrbach, M., and Fritz, M. (2015). Ask your neurons: A neural-based approach
to answering questions about images. In ICCV.
Mansimov, E., Parisotto, E., Ba, J. L., and Salakhutdinov, R. (2016). Generating images from
captions with attention. In ICLR.
Mao, J., Gan, C., Kohli, P., Tenenbaum, J. B., and Wu, J. (2019). The neuro-symbolic concept
learner: Interpreting scenes, words, and sentences from natural supervision. In ICLR.
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L., and Murphy, K. (2016). Generation and
comprehension of unambiguous object descriptions. In CVPR.
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. (2014). Deep captioning with
multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632.
Marino, K., Chen, X., Parikh, D., Gupta, A., and Rohrbach, M. (2021). Krisp: Integrating implicit
and symbolic knowledge for open-domain knowledge-based vqa. In CVPR.
91
Marino, K., Rastegari, M., Farhadi, A., and Mottaghi, R. (2019). Ok-vqa: A visual question answer-
ing benchmark requiring external knowledge. In CVPR.
Microsoft (2018). Responsible bots: 10 guidelines for developers of conversational ai.
Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., and Zisserman, A. (2020). End-to-End
Learning of Visual Representations from Uncurated Instructional Videos. In CVPR.
Miech, A., Zhukov, D., Alayrac, J.-B., Tapaswi, M., Laptev, I., and Sivic, J. (2019). Howto100m:
Learning a text-video embedding by watching hundred million narrated video clips. In ICCV.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representa-
tions in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representa-
tions of words and phrases and their compositionality. In NeurIPS.
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., and Gao, J. (2021). Deep
learning–based text classification: a comprehensive review. ACM Computing Surveys (CSUR).
Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Ma-
hendran, A., Arnab, A., Dehghani, M., Shen, Z., et al. (2022). Simple open-vocabulary object
detection with vision transformers. In ECCV.
Mishra, A., Shekhar, S., Singh, A. K., and Chakraborty, A. (2019). Ocr-vqa: Visual question an-
swering by reading text in images. In ICDAR.
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D.,
and Gebru, T. (2019). Model cards for model reporting. In ACM FAccT.
Mu, N., Kirillov, A., Wagner, D., and Xie, S. (2021). Slip: Self-supervision meets language-image
pre-training. arXiv preprint arXiv:2112.12750.
Murahari, V., Batra, D., Parikh, D., and Das, A. (2020). Large-scale pretraining for visual dialog: A
simple state-of-the-art baseline. In ECCV.
Mustafa, B., Riquelme, C., Puigcerver, J., Jenatton, R., and Houlsby, N. (2022). Multimodal
contrastive learning with limoe: the language-image mixture of experts. arXiv preprint
arXiv:2206.02770.
Nagaraja, V. K., Morariu, V. I., and Davis, L. S. (2016). Modeling context between objects for
referring expression understanding. In ECCV.
Nam, H., Ha, J.-W., and Kim, J. (2017). Dual attention networks for multimodal reasoning and
matching. In CVPR.
Nguyen, D.-K. and Okatani, T. (2018). Improved fusion of visual and language representations by
dense symmetric co-attention for visual question answering. In CVPR.
Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., and Ling, H. (2022).
Expanding language-image pretrained models for general video recognition. arXiv preprint
arXiv:2208.02816.
Ni, M., Huang, H., Su, L., Cui, E., Bharti, T., Wang, L., Gao, J., Zhang, D., and Duan, N. (2021).
M3p: Learning universal representations via multitask multilingual multimodal pre-training. In
CVPR.
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., and Chen,
M. (2021). Glide: Towards photorealistic image generation and editing with text-guided diffusion
models. arXiv preprint arXiv:2112.10741.
Nie, Y., Li, L., Gan, Z., Wang, S., Zhu, C., Zeng, M., Liu, Z., Bansal, M., and Wang, L.
(2021). Mlp architectures for vision-and-language modeling: An empirical study. arXiv preprint
arXiv:2112.04453.
Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., and Kiela, D. (2020). Adversarial nli: A
new benchmark for natural language understanding. In ACL.
Niu, Z., Zhou, M., Wang, L., Gao, X., and Hua, G. (2017). Hierarchical multimodal lstm for dense
visual-semantic embedding. In ICCV.
Norcliffe-Brown, W., Vafeias, S., and Parisot, S. (2018). Learning conditioned graph structures for
interpretable visual question answering. In NeurIPS.
92
Ordonez, V., Kulkarni, G., and Berg, T. (2011). Im2text: Describing images using 1 million cap-
tioned photographs. In NeurIPS.
Pan, Y., Li, Y., Luo, J., Xu, J., Yao, T., and Mei, T. (2020a). Auto-captions on gif: A large-scale
video-sentence dataset for vision-language pre-training. arXiv preprint arXiv:2007.02375.
Pan, Y., Yao, T., Li, Y., and Mei, T. (2020b). X-linear attention networks for image captioning. In
CVPR.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation
of machine translation. In ACL.
Parcalabescu, L., Cafagna, M., Muradjan, L., Frank, A., Calixto, I., and Gatt, A. (2021). Valse: A
task-independent benchmark for vision and language models centered on linguistic phenomena.
arXiv preprint arXiv:2112.07566.
Parcalabescu, L., Gatt, A., Frank, A., and Calixto, I. (2020). Seeing past words: Testing the cross-
modal capabilities of pretrained v&l models on counting tasks. arXiv preprint arXiv:2012.12352.
Park, J. S., Shen, S., Farhadi, A., Darrell, T., Choi, Y., and Rohrbach, A. (2022). Exposing the limits
of video-text models through contrast sets. In NAACL.
Patrick, M., Huang, P.-Y., Asano, Y., Metze, F., Hauptmann, A. G., Henriques, J. F., and Vedaldi, A.
(2020). Support-set bottlenecks for video-text representation learning. In ICLR.
Patterson, G. and Hays, J. (2012). SUN attribute database: Discovering, annotating, and recognizing
scene attributes. In CVPR.
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representa-
tion. In EMNLP.
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., and Sorkine-Hornung, A.
(2016). A benchmark dataset and evaluation methodology for video object segmentation. In
CVPR.
Perez, E., Strub, F., De Vries, H., Dumoulin, V., and Courville, A. (2018). Film: Visual reasoning
with a general conditioning layer. In AAAI.
Pham, H., Dai, Z., Ghiasi, G., Liu, H., Yu, A. W., Luong, M.-T., Tan, M., and Le, Q. V. (2021).
Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050.
Plummer, B. A., Kordas, P., Kiapour, M. H., Zheng, S., Piramuthu, R., and Lazebnik, S. (2018).
Conditional image-text embedding networks. In ECCV.
Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S.
(2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-
sentence models. In ICCV.
Pont-Tuset, J., Uijlings, J., Changpinyo, S., Soricut, R., and Ferrari, V. (2020). Connecting vision
and language with localized narratives. In ECCV.
Pushkarna, M., Zaldivar, A., and Kjartansson, O. (2022). Data cards: Purposeful and transparent
dataset documentation for responsible ai. In ACM FAccT.
Qian, R., Li, Y., Xu, Z., Yang, M.-H., Belongie, S., and Cui, Y. (2022). Multimodal open-vocabulary
video classification via pre-trained vision and language models. arXiv preprint arXiv:2207.07646.
Qiao, T., Zhang, J., Xu, D., and Tao, D. (2019). Mirrorgan: Learning text-to-image generation by
redescription. In CVPR.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A.,
Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language
supervision. In ICML.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J.
(2020). Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. (2022). Hierarchical text-conditional
image generation with clip latents. arXiv preprint arXiv:2204.06125.
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I.
(2021). Zero-Shot Text-to-Image Generation. In ICML.
93
Ranftl, R., Bochkovskiy, A., and Koltun, V. (2021). Vision transformers for dense prediction. In
ICCV.
Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou, J., and Lu, J. (2021).
Denseclip: Language-guided dense prediction with context-aware prompting. arXiv preprint
arXiv:2112.01518.
Razavi, A., Van den Oord, A., and Vinyals, O. (2019). Generating diverse high-fidelity images with
vq-vae-2. In NeurIPS.
Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint
arXiv:1804.02767.
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., and Lee, H. (2016). Generative adversarial
text to image synthesis. In ICML.
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez,
M., Sulsky, Y., Kay, J., Springenberg, J. T., et al. (2022). A generalist agent. arXiv preprint
arXiv:2205.06175.
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., and Pinkal, M. (2013). Grounding
action descriptions in videos. TACL.
Ren, M., Kiros, R., and Zemel, R. (2015a). Exploring models and data for image question answering.
In NeurIPS.
Ren, S., He, K., Girshick, R., and Sun, J. (2015b). Faster r-cnn: Towards real-time object detection
with region proposal networks. In NeurIPS.
Rohrbach, A., Rohrbach, M., Tandon, N., and Schiele, B. (2015). A Dataset for Movie Description.
In CVPR.
Rohrbach, M., Stark, M., and Schiele, B. (2011). Evaluating knowledge transfer and zero-shot
learning in a large-scale setting. In CVPR.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). High-resolution image
synthesis with latent diffusion models. In CVPR.
Rouditchenko, A., Boggust, A., Harwath, D., Chen, B., Joshi, D., Thomas, S., Audhkhasi, K.,
Kuehne, H., Panda, R., Feris, R., et al. (2021). Avlnet: Learning audio-visual language represen-
tations from instructional videos. In InterSpeech.
Ruan, L. and Jin, Q. (2022). Survey: Transformer based video-language pre-training. AI Open.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K.,
Mahdavi, S. S., Lopes, R. G., et al. (2022). Photorealistic text-to-image diffusion models with
deep language understanding. arXiv preprint arXiv:2205.11487.
Saito, K., Sohn, K., Zhang, X., Li, C.-L., Lee, C.-Y., Saenko, K., and Pfister, T. (2022). Prefix
conditioning unifies language and label supervision. arXiv preprint arXiv:2206.01125.
Salin, E., Farah, B., Ayache, S., and Favre, B. (2022). Are vision-language transformers learning
multimodal representations? a probing perspective. In AAAI.
Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P., and Lillicrap, T.
(2017). A simple neural network module for relational reasoning. In NeurIPS.
Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T.,
Jitsev, J., and Komatsuzaki, A. (2021). Laion-400m: Open dataset of clip-filtered 400 million
image-text pairs. arXiv preprint arXiv:2111.02114.
Schwenk, D., Khandelwal, A., Clark, C., Marino, K., and Mottaghi, R. (2022). A-okvqa: A bench-
mark for visual question answering using world knowledge. arXiv preprint arXiv:2206.01718.
Selvaraju, R. R., Tendulkar, P., Parikh, D., Horvitz, E., Ribeiro, M. T., Nushi, B., and Kamar, E.
(2020). Squinting at vqa models: Introspecting vqa models with sub-questions. In CVPR.
Sennrich, R., Haddow, B., and Birch, A. (2016). Neural machine translation of rare words with
subword units. In ACL.
Seo, P. H., Nagrani, A., Arnab, A., and Schmid, C. (2022). End-to-end generative pretraining for
multimodal video captioning. In CVPR.
94
Shah, M., Chen, X., Rohrbach, M., and Parikh, D. (2019a). Cycle-consistency for robust visual
question answering. In CVPR.
Shah, S., Mishra, A., Yadati, N., and Talukdar, P. P. (2019b). Kvqa: Knowledge-aware visual
question answering. In AAAI.
Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., and Sun, J. (2019). Objects365: A
large-scale, high-quality dataset for object detection. In ICCV.
Sharma, P., Ding, N., Goodman, S., and Soricut, R. (2018). Conceptual captions: A cleaned, hyper-
nymed, image alt-text dataset for automatic image captioning. In ACL.
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. (2017). Outra-
geously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR.
Shen, S., Li, C., Hu, X., Xie, Y., Yang, J., Zhang, P., Rohrbach, A., Gan, Z., Wang, L., Yuan, L.,
et al. (2022a). K-lite: Learning transferable visual models with external knowledge. In NeurIPS.
Shen, S., Li, L. H., Tan, H., Bansal, M., Rohrbach, A., Chang, K.-W., Yao, Z., and Keutzer, K.
(2022b). How much can clip benefit vision-and-language tasks? In ICLR.
Sheng, S., Singh, A., Goswami, V., Magana, J., Thrush, T., Galuba, W., Parikh, D., and Kiela, D.
(2021). Human-adversarial visual question answering. In NeurIPS.
Shevchenko, V., Teney, D., Dick, A., and Hengel, A. v. d. (2021). Reasoning over vision and
language: Exploring the benefits of supplemental knowledge. arXiv preprint arXiv:2101.06013.
Shih, K. J., Singh, S., and Hoiem, D. (2016). Where to look: Focus regions for visual question
answering. In CVPR.
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. (2019). Megatron-
lm: Training multi-billion parameter language models using model parallelism. arXiv preprint
arXiv:1909.08053.
Shonenkov, A., Kuznetsov, A., Dimitrov, D., Shavrina, T., Chesakov, D., Maltseva, A., Fenogenova,
A., Pavlov, I., Emelyanov, A., Markov, S., et al. (2022). Ruclip–new models and experiments: a
technical report. arXiv preprint arXiv:2202.10784.
Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R. S., Harwath, D.,
Glass, J., and Kuehne, H. (2022). Everything at once-multi-modal fusion transformer for video
retrieval. In CVPR.
Sidorov, O., Hu, R., Rohrbach, M., and Singh, A. (2020). Textcaps: a dataset for image captioning
with reading comprehension. In ECCV.
Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556.
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni,
O., et al. (2022). Make-a-video: Text-to-video generation without text-video data. arXiv preprint
arXiv:2209.14792.
Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., and Kiela, D. (2022).
Flava: A foundational language and vision alignment model. In CVPR.
Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., and Rohrbach, M.
(2019). Towards vqa models that can read. In CVPR.
Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., and Ng, A. Y. (2014). Grounded compositional
semantics for finding and describing images with sentences. TACL.
Song, H., Dong, L., Zhang, W.-N., Liu, T., and Wei, F. (2022). Clip models are few-shot learners:
Empirical studies on vqa and visual entailment. In ACL.
Song, J., Meng, C., and Ermon, S. (2021). Denoising diffusion implicit models. In ICLR.
Srinivasan, K., Raman, K., Chen, J., Bendersky, M., and Najork, M. (2021). Wit: Wikipedia-
based image text dataset for multimodal multilingual machine learning. arXiv preprint
arXiv:2103.01913.
Srinivasan, T. and Bisk, Y. (2022). Worst of both worlds: Biases compound in pre-trained vision-
and-language models. In 4th Workshop on Gender Bias in Natural Language Processing, NAACL.
95
Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., and Dai, J. (2019). VL-BERT: Pre-training of
generic visual-linguistic representations. In ICLR.
Su, Y., Lan, T., Liu, Y., Liu, F., Yogatama, D., Wang, Y., Kong, L., and Collier, N. (2022). Language
models can see: Plugging visual controls in text generation. arXiv preprint arXiv:2205.02655.
Suhr, A., Zhou, S., Zhang, A., Zhang, I., Bai, H., and Artzi, Y. (2019). A corpus for reasoning about
natural language grounded in photographs. In ACL.
Sun, C., Myers, A., Vondrick, C., Murphy, K., and Schmid, C. (2019a). VideoBERT: A joint model
for video and language representation learning. In ICCV.
Sun, S., Chen, Y.-C., Li, L., Wang, S., Fang, Y., and Liu, J. (2021). Lightningdot: Pre-training
visual-semantic embeddings for real-time image-text retrieval. In NAACL.
Sun, S., Cheng, Y., Gan, Z., and Liu, J. (2019b). Patient knowledge distillation for bert model
compression. In EMNLP.
Sun, S., Gan, Z., Cheng, Y., Fang, Y., Wang, S., and Liu, J. (2020). Contrastive distillation on
intermediate representations for language model compression. In EMNLP.
Sung, Y.-L., Cho, J., and Bansal, M. (2022a). Lst: Ladder side-tuning for parameter and memory
efficient transfer learning. In NeurIPS.
Sung, Y.-L., Cho, J., and Bansal, M. (2022b). Vl-adapter: Parameter-efficient transfer learning for
vision-and-language tasks. In CVPR.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks.
In NeurIPS.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and
Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.
Tan, H. and Bansal, M. (2019). LXMERT: Learning cross-modality encoder representations from
transformers. In EMNLP.
Tan, H. and Bansal, M. (2020). Vokenization: Improving language understanding with contextual-
ized, visual-grounded supervision. In EMNLP.
Tan, H., Lei, J., Wolf, T., and Bansal, M. (2021). VIMPAC: Video Pre-Training via Masked Token
Prediction and Contrastive Learning. arXiv preprint arXiv:2106.11250.
Tan, H., Liu, X., Li, X., Zhang, Y., and Yin, B. (2019). Semantics-enhanced adversarial nets for
text-to-image synthesis. In ICCV.
Tang, M., Wang, Z., Liu, Z., Rao, F., Li, D., and Li, X. (2021a). Clip4caption: Clip for video
caption. In ACMMM.
Tang, Y., Ding, D., Rao, Y., Zheng, Y., Zhang, D., Zhao, L., Lu, J., and Zhou, J. (2019). Coin: A
large-scale dataset for comprehensive instructional video analysis. In CVPR.
Tang, Z., Cho, J., Tan, H., and Bansal, M. (2021b). Vidlankd: Improving language understanding
via video-distilled knowledge transfer. In NeurIPS.
Tang, Z., Lei, J., and Bansal, M. (2021c). DeCEMBERT: Learning from noisy instructional videos
via dense captions and entropy minimization. In NAACL.
Teney, D., Anderson, P., He, X., and van den Hengel, A. (2018). Tips and tricks for visual question
answering: Learnings from the 2017 challenge. In CVPR.
Teney, D., Liu, L., and van Den Hengel, A. (2017). Graph-structured representations for visual
question answering. In CVPR.
Tewel, Y., Shalev, Y., Schwartz, I., and Wolf, L. (2022). Zerocap: Zero-shot image-to-text generation
for visual-semantic arithmetic. In CVPR.
Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., and Ross, C. (2022).
Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR.
Tian, C., Wang, W., Zhu, X., Wang, X., Dai, J., and Qiao, Y. (2021). Vl-ltr: Learning
class-wise visual-linguistic representation for long-tailed visual recognition. arXiv preprint
arXiv:2111.13579.
Torabi, A., Tandon, N., and Sigal, L. (2016). Learning Language-Visual Embedding for Movie
Understanding with Natural-Language. arXiv preprint arXiv:1609.08124.
96
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jégou, H. (2021). Training
data-efficient image transformers & distillation through attention. In ICML.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015). Learning spatiotemporal
features with 3d convolutional networks. In ICCV.
Tsimpoukelli, M., Menick, J., Cabi, S., Eslami, S., Vinyals, O., and Hill, F. (2021). Multimodal
few-shot learning with frozen language models. In NeurIPS.
van den Oord, A., Vinyals, O., and Kavukcuoglu, K. (2017). Neural discrete representation learning.
In NeurIPS.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polo-
sukhin, I. (2017). Attention is all you need. In NeurIPS.
Vedantam, R., Desai, K., Lee, S., Rohrbach, M., Batra, D., and Parikh, D. (2019). Probabilistic
neural symbolic models for interpretable visual question answering. In ICML.
Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015). Cider: Consensus-based image descrip-
tion evaluation. In CVPR.
Villegas, R., Babaeizadeh, M., Kindermans, P.-J., Moraldo, H., Zhang, H., Saffar, M. T., Castro, S.,
Kunze, J., and Erhan, D. (2022). Phenaki: Variable length video generation from open domain
textual description. arXiv preprint arXiv:2210.02399.
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: A neural image caption
generator. In CVPR.
Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. (2011). The Caltech-UCSD birds-
200-2011 dataset.
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S.
(2019a). Superglue: A stickier benchmark for general-purpose language understanding systems.
In NeurIPS.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. (2019b). GLUE: A multi-task
benchmark and analysis platform for natural language understanding. In ICLR.
Wang, A. J., Ge, Y., Yan, R., Ge, Y., Lin, X., Cai, G., Wu, J., Shan, Y., Qie, X., and Shou,
M. Z. (2022a). All in one: Exploring unified video-language pre-training. arXiv preprint
arXiv:2203.07303.
Wang, J., Chen, D., Wu, Z., Luo, C., Zhou, L., Zhao, Y., Xie, Y., Liu, C., Jiang, Y.-G., and Yuan,
L. (2022b). Omnivl: One foundation model for image-language and video-language tasks. In
NeurIPS.
Wang, J., Ge, Y., Cai, G., Yan, R., Lin, X., Shan, Y., Qie, X., and Shou, M. Z. (2022c). Object-aware
video-language pre-training for retrieval. In CVPR.
Wang, J., Hu, X., Gan, Z., Yang, Z., Dai, X., Liu, Z., Lu, Y., and Wang, L. (2021a). Ufo: A unified
transformer for vision-language representation learning. arXiv preprint arXiv:2111.10023.
Wang, J., Hu, X., Zhang, P., Li, X., Wang, L., Zhang, L., Gao, J., and Liu, Z. (2020a). Minivlm: A
smaller and faster vision-language model. arXiv preprint arXiv:2012.06946.
Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C., and Wang, L. (2022d). Git: A
generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
Wang, L., Li, Y., Huang, J., and Lazebnik, S. (2018). Learning two-branch neural networks for
image-text matching tasks. TPAMI.
Wang, L., Li, Y., and Lazebnik, S. (2016). Learning deep structure-preserving image-text embed-
dings. In CVPR.
Wang, M., Xing, J., and Liu, Y. (2021b). Actionclip: A new paradigm for video action recognition.
arXiv preprint arXiv:2109.08472.
Wang, P., Wu, Q., Shen, C., Dick, A., and Van Den Hengel, A. (2017a). Fvqa: Fact-based visual
question answering. TPAMI.
Wang, P., Wu, Q., Shen, C., Hengel, A. v. d., and Dick, A. (2017b). Explicit knowledge-based
reasoning for visual question answering. In IJCAI.
97
Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., and Yang, H. (2022e).
Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learn-
ing framework. In ICML.
Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., and Yang, H.
(2022f). Unifying architectures, tasks, and modalities through a simple sequence-to-sequence
learning framework. In ICML.
Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O. K.,
Singhal, S., Som, S., and Wei, F. (2022g). Image as a foreign language: Beit pretraining for all
vision and vision-language tasks. arXiv preprint arXiv:2208.10442.
Wang, W., Bao, H., Dong, L., and Wei, F. (2021c). Vlmo: Unified vision-language pre-training with
mixture-of-modality-experts. arXiv preprint arXiv:2111.02358.
Wang, W., Dong, L., Cheng, H., Song, H., Liu, X., Yan, X., Gao, J., and Wei, F. (2022h). Visually-
augmented language modeling. arXiv preprint arXiv:2205.10178.
Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.-F., and Wang, W. Y. (2019c). VATEX: A Large-Scale,
High-Quality Multilingual Dataset for Video-and-Language Research. In ICCV.
Wang, Y., Joty, S., Lyu, M. R., King, I., Xiong, C., and Hoi, S. C. (2020b). Vd-bert: A unified vision
and dialog transformer with bert. In EMNLP.
Wang, Y., Xu, J., and Sun, Y. (2022i). End-to-end transformer based model for image captioning.
In AAAI.
Wang, Y., Yang, H., Qian, X., Ma, L., Lu, J., Li, B., and Fan, X. (2019d). Position focused attention
network for image-text matching. In IJCAI.
Wang, Z., Li, M., Xu, R., Zhou, L., Lei, J., Lin, X., Wang, S., Yang, Z., Zhu, C., Hoiem, D., et al.
(2022j). Language models with image descriptors are strong few-shot video-language learners.
arXiv preprint arXiv:2205.10747.
Wang, Z., Yu, J., Yu, A. W., Dai, Z., Tsvetkov, Y., and Cao, Y. (2022k). Simvlm: Simple visual
language model pretraining with weak supervision. In ICLR.
Wortsman, M., Ilharco, G., Kim, J. W., Li, M., Kornblith, S., Roelofs, R., Lopes, R. G., Hajishirzi,
H., Farhadi, A., Namkoong, H., et al. (2022). Robust fine-tuning of zero-shot models. In CVPR.
Wray, M., Doughty, H., and Damen, D. (2021). On semantic similarity in video retrieval. In CVPR.
Wu, B., Yu, S., Chen, Z., Tenenbaum, J. B., and Gan, C. (2021). Star: A benchmark for situated
reasoning in real-world videos. In NeurIPS.
Wu, C., Liang, J., Hu, X., Gan, Z., Wang, J., Wang, L., Liu, Z., Fang, Y., and Duan, N. (2022a).
Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis. arXiv
preprint arXiv:2207.09814.
Wu, C., Liang, J., Ji, L., Yang, F., Fang, Y., Jiang, D., and Duan, N. (2022b). N\” uwa: Visual
synthesis pre-training for neural visual world creation. In ECCV.
Wu, C., Lin, Z., Cohen, S., Bui, T., and Maji, S. (2020). Phrasecut: Language-based image segmen-
tation in the wild. In CVPR.
Wu, J., Lu, J., Sabharwal, A., and Mottaghi, R. (2022c). Multi-modal answer validation for
knowledge-based vqa. In AAAI.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao,
Q., Macherey, K., et al. (2016). Google’s neural machine translation system: Bridging the gap
between human and machine translation. arXiv preprint arXiv:1609.08144.
Xiao, J., Shang, X., Yao, A., and Chua, T.-S. (2021). Next-qa: Next phase of question-answering to
explaining temporal actions. In CVPR.
Xie, N., Lai, F., Doran, D., and Kadav, A. (2019). Visual entailment: A novel task for fine-grained
image understanding. arXiv preprint arXiv:1901.06706.
Xie, Y., Wang, X., Wang, R., and Zha, H. (2020). A fast proximal point method for computing exact
wasserstein distance. In UAI.
Xie, Y., Zhou, L., Dai, X., Yuan, L., Bach, N., Liu, C., and Zeng, M. (2022). Visual clues:
Bridging vision and language foundations for image paragraph captioning. arXiv preprint
arXiv:2206.01843.
98
Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., and Zhuang, Y. (2017). Video Question
Answering via Gradually Refined Attention over Appearance and Motion. In ACMMM.
Xu, H., Ghosh, G., Huang, P.-Y., Arora, P., Aminzadeh, M., Feichtenhofer, C., Metze, F., and
Zettlemoyer, L. (2021a). Vlm: Task-agnostic video-language model pre-training for video under-
standing. In Findings of ACL.
Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., and
Feichtenhofer, C. (2021b). Videoclip: Contrastive pre-training for zero-shot video-text under-
standing. In EMNLP.
Xu, H., Yan, M., Li, C., Bi, B., Huang, S., Xiao, W., and Huang, F. (2021c). E2e-vlp: end-to-end
vision-language pre-training enhanced by visual learning. In ACL.
Xu, J., De Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., and Wang, X. (2022). Groupvit:
Semantic segmentation emerges from text supervision. In CVPR.
Xu, J., Mei, T., Yao, T., and Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging
video and language. In CVPR.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015).
Show, attend and tell: Neural image caption generation with visual attention. In ICML.
Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Hu, H., and Bai, X. (2021d). A simple baseline
for zero-shot semantic segmentation with pre-trained vision-language model. arXiv preprint
arXiv:2112.14757.
Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., and He, X. (2018). Attngan: Fine-
grained text to image generation with attentional generative adversarial networks. In CVPR.
Xue, H., Hang, T., Zeng, Y., Sun, Y., Liu, B., Yang, H., Fu, J., and Guo, B. (2022). Advancing
high-resolution video-language representation with large-scale video transcriptions. In CVPR.
Xue, H., Huang, Y., Liu, B., Peng, H., Fu, J., Li, H., and Luo, J. (2021). Probing inter-modality:
Visual parsing with self-attention for vision-language pre-training. In NeurIPS.
Yang, A., Miech, A., Sivic, J., Laptev, I., and Schmid, C. (2021a). Just ask: Learning to answer
questions from millions of narrated videos. In ICCV.
Yang, J., Bisk, Y., and Gao, J. (2021b). Taco: Token-aware cascade contrastive learning for video-
text alignment. In ICCV.
Yang, J., Duan, J., Tran, S., Xu, Y., Chanda, S., Chen, L., Zeng, B., Chilimbi, T., and Huang, J.
(2022a). Vision-language pre-training with triple contrastive learning. In CVPR.
Yang, J., Li, C., Zhang, P., Xiao, B., Yuan, L., Liu, C., and Gao, J. (2022b). Unicl: Unified con-
trastive learning in image-text-label space. In CVPR.
Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Shao, Y., Zhang, W., Cui, B., and Yang,
M.-H. (2022c). Diffusion models: A comprehensive survey of methods and applications. arXiv
preprint arXiv:2209.00796.
Yang, X., Tang, K., Zhang, H., and Cai, J. (2019a). Auto-encoding scene graphs for image caption-
ing. In CVPR.
Yang, Z., Chen, T., Wang, L., and Luo, J. (2020). Improving one-stage visual grounding by recursive
sub-query construction. In ECCV.
Yang, Z., Gan, Z., Wang, J., Hu, X., Ahmed, F., Liu, Z., Lu, Y., and Wang, L. (2021c). Crossing the
format boundary of text and boxes: Towards unified vision-language modeling. arXiv preprint
arXiv:2111.12085.
Yang, Z., Gan, Z., Wang, J., Hu, X., Lu, Y., Liu, Z., and Wang, L. (2022d). An empirical study of
gpt-3 for few-shot knowledge-based vqa. In AAAI.
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., and Luo, J. (2019b). A fast and accurate one-stage
approach to visual grounding. In ICCV.
Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. (2016). Stacked attention networks for image
question answering. In CVPR.
Yang, Z., Lu, Y., Wang, J., Yin, X., Florencio, D., Wang, L., Zhang, C., Zhang, L., and Luo, J.
(2021d). Tap: Text-aware pre-training for text-vqa and text-caption. In CVPR.
99
Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang, X., Li, Z., Jiang, X., and Xu, C. (2022).
Filip: Fine-grained interactive language-image pre-training. In ICLR.
Yao, T., Pan, Y., Li, Y., and Mei, T. (2018). Exploring visual relationship for image captioning. In
ECCV.
Yao, T., Pan, Y., Li, Y., and Mei, T. (2019). Hierarchy parsing for image captioning. In ICCV.
Yao, T., Pan, Y., Li, Y., Qiu, Z., and Mei, T. (2017). Boosting image captioning with attributes. In
ICCV.
Yao, Y., Zhang, A., Zhang, Z., Liu, Z., Chua, T.-S., and Sun, M. (2021). CPT: Colorful prompt
tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797.
Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., and Tenenbaum, J. (2018). Neural-symbolic vqa:
Disentangling reasoning from vision and language understanding. In NeurIPS.
Yin, D., Dong, L., Cheng, H., Liu, X., Chang, K.-W., Wei, F., and Gao, J. (2022). A survey of
knowledge-intensive nlp with pre-trained language models. arXiv preprint arXiv:2202.08772.
You, H., Zhou, L., Xiao, B., Codella, N., Cheng, Y., Xu, R., Chang, S.-F., and Yuan, L. (2022).
Learning visual representation from modality-shared contrastive language-image pre-training. In
ECCV.
You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J. (2016). Image captioning with semantic attention.
In CVPR.
Yu, F., Tang, J., Yin, W., Sun, Y., Tian, H., Wu, H., and Wang, H. (2021). Ernie-vil: Knowledge
enhanced vision-language representations through scene graphs. In AAAI.
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., and Wu, Y. (2022a). Coca: Con-
trastive captioners are image-text foundation models. TMLR.
Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan,
B. K., et al. (2022b). Scaling autoregressive models for content-rich text-to-image generation.
arXiv preprint arXiv:2206.10789.
Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., and Berg, T. L. (2018a). Mattnet: Modular
attention network for referring expression comprehension. In CVPR.
Yu, L., Park, E., Berg, A. C., and Berg, T. L. (2015). Visual madlibs: Fill in the blank image
generation and question answering. arXiv preprint arXiv:1506.00278.
Yu, L., Poirson, P., Yang, S., Berg, A. C., and Berg, T. L. (2016). Modeling context in referring
expressions. In ECCV.
Yu, Y., Kim, J., and Kim, G. (2018b). A Joint Sequence Fusion Model for Video Question Answer-
ing and Retrieval. In ECCV.
Yu, Z., Xu, D., Yu, J., Yu, T., Zhao, Z., Zhuang, Y., and Tao, D. (2019a). Activitynet-qa: A dataset
for understanding complex web videos via question answering. In AAAI.
Yu, Z., Yu, J., Cui, Y., and Li, J. (2019b). Vqa challenge 2019 winner talk. VQA Challenge 2019.
Yu, Z., Yu, J., Cui, Y., Tao, D., and Tian, Q. (2019c). Deep modular co-attention networks for visual
question answering. In CVPR.
Yu, Z., Yu, J., Fan, J., and Tao, D. (2017). Multi-modal factorized bilinear pooling with co-attention
learning for visual question answering. In ICCV.
Yuan, H., Jiang, J., Albanie, S., Feng, T., Huang, Z., Ni, D., and Tang, M. (2022). Rlip: Re-
lational language-image pre-training for human-object interaction detection. arXiv preprint
arXiv:2209.01814.
Yuan, L., Chen, D., Chen, Y.-L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., Liu,
C., Liu, M., Liu, Z., Lu, Y., Shi, Y., Wang, L., Wang, J., Xiao, B., Xiao, Z., Yang, J., Zeng, M.,
Zhou, L., and Zhang, P. (2021). Florence: A new foundation model for computer vision. arXiv
preprint arXiv:2111.11432.
Zang, Y., Li, W., Zhou, K., Huang, C., and Loy, C. C. (2022). Open-vocabulary detr with conditional
matching. arXiv preprint arXiv:2203.11876.
Zareian, A., Rosa, K. D., Hu, D. H., and Chang, S.-F. (2021). Open-vocabulary object detection
using captions. In CVPR.
100
Zellers, R., Bisk, Y., Farhadi, A., and Choi, Y. (2019). From recognition to cognition: Visual
commonsense reasoning. In CVPR.
Zellers, R., Lu, J., Lu, X., Yu, Y., Zhao, Y., Salehi, M., Kusupati, A., Hessel, J., Farhadi, A., and
Choi, Y. (2022). Merlot reserve: Neural script knowledge through vision and language and sound.
In CVPR.
Zellers, R., Lu, X., Hessel, J., Yu, Y., Park, J. S., Cao, J., Farhadi, A., and Choi, Y. (2021). Merlot:
Multimodal neural script knowledge models. In NeurIPS.
Zeng, A., Wong, A., Welker, S., Choromanski, K., Tombari, F., Purohit, A., Ryoo, M., Sindhwani,
V., Lee, J., Vanhoucke, V., et al. (2022a). Socratic models: Composing zero-shot multimodal
reasoning with language. arXiv preprint arXiv:2204.00598.
Zeng, Y., Zhang, X., and Li, H. (2022b). Multi-grained vision language pre-training: Aligning texts
with visual concepts. In ICML.
Zeng, Y., Zhou, W., Luo, A., and Zhang, X. (2022c). Cross-view language modeling: Towards
unified cross-lingual cross-modal pre-training. arXiv preprint arXiv:2206.00621.
Zeng, Z., Luo, Y., Liu, Z., Rao, F., Li, D., Guo, W., and Wen, Z. (2022d). Tencent-mvse: A large-
scale benchmark dataset for multi-modal video similarity evaluation. In CVPR.
Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., and Beyer, L. (2022). Lit:
Zero-shot transfer with locked-image text tuning. In CVPR.
Zhang, C., Van Durme, B., Li, Z., and Stengel-Eskin, E. (2022a). Visual commonsense in pretrained
unimodal and multimodal models. arXiv preprint arXiv:2205.01850.
Zhang, C., Yang, Z., He, X., and Deng, L. (2020a). Multimodal intelligence: Representation learn-
ing, information fusion, and applications. JSTSP.
Zhang, D., Dai, X., Wang, X., and Wang, Y.-F. (2018a). S3d: single shot multi-span detector via
fully 3d convolutional networks. In BMVC.
Zhang, H., Niu, Y., and Chang, S.-F. (2018b). Grounding referring expressions in images by varia-
tional context. In CVPR.
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., and Metaxas, D. N. (2017). Stackgan:
Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV.
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., and Metaxas, D. N. (2018c). Stackgan++:
Realistic image synthesis with stacked generative adversarial networks. TPAMI.
Zhang, H., Yin, W., Fang, Y., Li, L., Duan, B., Wu, Z., Sun, Y., Tian, H., Wu, H., and Wang, H.
(2021a). Ernie-vilg: Unified generative pre-training for bidirectional vision-language generation.
arXiv preprint arXiv:2112.15283.
Zhang, H., Zhang, P., Hu, X., Chen, Y.-C., Li, L. H., Dai, X., Wang, L., Yuan, L., Hwang, J.-N., and
Gao, J. (2022b). Glipv2: Unifying localization and vision-language understanding. In ECCV.
Zhang, J., Chang, J. P., Danescu-Niculescu-Mizil, C., Dixon, L., Hua, Y., Thain, N., and Taraborelli,
D. (2018d). Conversations gone awry: Detecting early signs of conversational failure. In ACL.
Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., and Gao, J. (2021b). VinVL:
Revisiting visual representations in vision-language models. In CVPR.
Zhang, Q., Lei, Z., Zhang, Z., and Li, S. Z. (2020b). Context-aware attention network for image-text
retrieval. In CVPR.
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X.,
Lin, X. V., et al. (2022c). Opt: Open pre-trained transformer language models. arXiv preprint
arXiv:2205.01068.
Zhao, D., Wang, A., and Russakovsky, O. (2021). Understanding and evaluating racial biases in
image captioning. In ECCV.
Zhao, H., Hadji, I., Dvornik, N., Derpanis, K. G., Wildes, R. P., and Jepson, A. D. (2022). P3iv:
Probabilistic procedure planning from instructional videos with weak supervision. In CVPR.
Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., and Shen, Y.-D. (2020). Dual-path convolu-
tional image-text embeddings with instance loss. ACM TOMM.
101
Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L. H., Zhou, L., Dai, X., Yuan, L., Li, Y.,
et al. (2022). Regionclip: Region-based language-image pretraining. In CVPR.
Zhou, C., Loy, C. C., and Dai, B. (2022a). Extract free dense labels from clip. In ECCV.
Zhou, K., Yang, J., Loy, C. C., and Liu, Z. (2022b). Conditional prompt learning for vision-language
models. In CVPR.
Zhou, L., Gao, J., Li, D., and Shum, H.-Y. (2020a). The design and implementation of xiaoice, an
empathetic social chatbot. Computational Linguistics.
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., and Gao, J. (2020b). Unified vision-language
pre-training for image captioning and vqa. In AAAI.
Zhou, L., Xu, C., and Corso, J. J. (2018). Towards Automatic Learning of Procedures from Web
Instructional Videos. In AAAI.
Zhou, M., Yu, L., Singh, A., Wang, M., Yu, Z., and Zhang, N. (2022c). Unsupervised vision-and-
language pre-training via retrieval-based multi-granular alignment. In CVPR.
Zhou, M., Zhou, L., Wang, S., Cheng, Y., Li, L., Yu, Z., and Liu, J. (2021). Uc2: Universal cross-
lingual cross-modal vision-and-language pre-training. In CVPR.
Zhou, W., Zeng, Y., Diao, S., and Zhang, X. (2022d). Vlue: A multi-task benchmark for evaluating
vision-language models. In ICML.
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., and Misra, I. (2022e). Detecting twenty-thousand
classes using image-level supervision. arXiv preprint arXiv:2201.02605.
Zhou, Y., Zhang, R., Chen, C., Li, C., Tensmeyer, C., Yu, T., Gu, J., Xu, J., and Sun, T. (2022f).
Lafite: Towards language-free training for text-to-image generation. In CVPR.
Zhu, L. and Yang, Y. (2020). Actbert: Learning global-local video-text representations. In CVPR.
Zhu, X., Zhu, J., Li, H., Wu, X., Li, H., Wang, X., and Dai, J. (2022). Uni-perceiver: Pre-training
unified architecture for generic perception for zero-shot and few-shot tasks. In CVPR.
Zhu, Y., Groth, O., Bernstein, M., and Fei-Fei, L. (2016). Visual7w: Grounded question answering
in images. In CVPR.
Zhuge, M., Gao, D., Fan, D.-P., Jin, L., Chen, B., Zhou, H., Qiu, M., and Shao, L. (2021). Kaleido-
bert: Vision-language pre-training on fashion domain. In CVPR.
Zhukov, D., Alayrac, J.-B., Cinbis, R. G., Fouhey, D., Laptev, I., and Sivic, J. (2019). Cross-task
weakly supervised learning from instructional videos. In CVPR.
Zitnick, C. L. and Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In ECCV.
102