0% found this document useful (0 votes)
2 views

paper2-cvf-2020

The paper introduces ActBERT, a model designed for self-supervised learning of joint video-text representations, leveraging instructional videos and automatic speech recognition to explore the relationships between video content and text. It incorporates a TaNgled Transformer block to process global actions, local objects, and linguistic descriptions, enhancing the model's understanding of fine-grained interactions. ActBERT demonstrates superior performance in various video-text tasks, outperforming existing state-of-the-art methods.

Uploaded by

Shreya Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

paper2-cvf-2020

The paper introduces ActBERT, a model designed for self-supervised learning of joint video-text representations, leveraging instructional videos and automatic speech recognition to explore the relationships between video content and text. It incorporates a TaNgled Transformer block to process global actions, local objects, and linguistic descriptions, enhancing the model's understanding of fine-grained interactions. ActBERT demonstrates superior performance in various video-text tasks, outperforming existing state-of-the-art methods.

Uploaded by

Shreya Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu1,2 and Yi Yang2∗


1
Baidu Research 2 ReLER, University of Technology Sydney
{linchao.zhu,yi.yang}@uts.edu.au

Abstract Motivated by BERT’s success in self-supervised train-


ing, we aim to learn an analogous model for video and
In this paper, we introduce ActBERT for self-supervised text joint modeling. We exploit video-text relations based
learning of joint video-text representations from unlabeled on narrated instructional videos, where the aligned texts
data. First, we leverage global action information to cat- are detected by off-the-shelf automatic speech recognition
alyze mutual interactions between linguistic texts and local (ASR) models. These instructional videos serve as natural
regional objects. It uncovers global and local visual clues sources for video-text relationship studies. First, they are
from paired video sequences and text descriptions for de- vastly available and freely accessible on YouTube and other
tailed visual and text relation modeling. Second, we intro- platforms [26, 33]. Second, the visual frames are aligned
duce a TaNgled Transformer block (TNT) to encode three with the instructional narrations. The text narrations not
sources of information, i.e., global actions, local regional only cover the objects in the scene explicitly but identify
objects, and linguistic descriptions. Global-local corre- the salient action in the video clip.
spondences are discovered via judicious clues extraction To generalize BERT to video-and-language tasks, Sun et
from contextual information. It enforces the joint video-text al. [33] extended the BERT model by learning from quan-
representation to be aware of fine-grained objects as well tized video frame features. The original BERT takes dis-
as global human intention. We validate the generalization crete elements as inputs and predicts the corresponding to-
capability of ActBERT on downstream video-and-language kens as the output. In contrast, visual features are dis-
tasks, i.e., text-video clip retrieval, video captioning, video tributed representations with real value, while the real-value
question answering, action segmentation, and action step features cannot be directly categorized into discrete labels
localization. ActBERT significantly outperforms the state- for “visual token” prediction. Sun et al. [33] discretized vi-
of-the-art, demonstrating its superiority in video-text repre- sual features into visual words via clustering. These visual
sentation learning. tokens can be directly passed to the original BERT model.
However, detailed local information, e.g., interacting ob-
jects, human actions would be possibly lost during cluster-
1. Introduction ing. It prevents the model from uncovering fine-grained re-
lations between video and text. In this paper, we propose
While supervised learning has been successful in a vari- ActBERT to learn a joint video-text representation that un-
ety of computer vision tasks [17, 9, 38, 29], self-supervised covers global and local visual clues from paired video se-
representation learning from unlabeled data has attracted in- quences and text descriptions. Both the global and the local
creasing attention in recent years [4, 27]. In self-supervised visual signals interact with the semantic stream mutually.
learning, a model is first pre-trained on a large amount ActBERT leverages profound contextual information and
of unlabeled data with a surrogate loss. The fine-tuning exploits fine-grained relations for video-text joint modeling.
process further helps the pre-trained model to be spe-
cialized in downstream tasks. Recently, there has been First, ActBERT incorporates global actions, local re-
rapid progress in self-supervised representation learning for gional objects and text descriptions in a joint framework.
texts [7, 45], where the Bidirectional Encoder Representa- Actions, e.g., “cut”, “rotate”, “slice”, are essential to var-
tions from Transformers (BERT) model [7] generalizes re- ious video-related downstream tasks. The recognition of
markably to many natural language tasks, e.g., question an- human actions can demonstrate the model’s capacity in mo-
swering [2]. tion understanding and complex human intention reason-
ing. It could be beneficial to explicitly model human actions
* This work was done when Linchao Zhu visited Baidu Research. Yi during model pre-training. Long-term action sequences
Yang is the corresponding author. furthermore offer temporal dependencies about an instruc-

8746
tional task. Though action clues are important, they are 2. Related Work
largely ignored in previous self-supervised video-text train-
ing [33, 26], where actions are treated identically to objects.
To model human actions, we first extract verbs from the text Video and language. There are many existing video-and-
descriptions and construct an action classification dataset language tasks to evaluate the model’s capacities in joint
from the original dataset. Then, a 3D convolution network video-text representation learning, e.g., video question an-
is trained to predict the action labels. The features from the swering [36, 10, 18, 54], video captioning [46, 52], text-
optimized network are used as the action embedding. In this video retrieval [47, 41, 25], video grounding [50]. In video
way, clip-level actions are represented, and the correspond- and language modeling, it can be difficult to learn relations
ing action label is inserted. Besides global action informa- between ordered video frames and their corresponding de-
tion, we incorporate local regional information to provide scriptions, where video temporal information and the inter-
fine-grained visual cues [21, 34, 32, 19, 5]. Object regions actions between multiple objects spatio-temporally requires
provide detailed visual clues about the whole scene, includ- to be incorporated. The dominant approach for multi-modal
ing the regional object feature, the position of the object. modeling is to leverage Recurrent Neural Networks (RNNs)
The language model can benefit from the regional informa- and their variants, e.g., Long Short-Term Memory (LSTM)
tion for better language-and-visual alignment. and Gated Recurrent Unit (GRU), to model sequence re-
lations, e.g., [28, 53]. Zhou et al. [52] leveraged masked
transformers in both the encoder and the decoder for dense
Second, we introduce a TaNgled Transformer block video captioning. Most of these works are conducted on
(TNT) to encode features from three sources, i.e., global ac- well-annotated datasets where the descriptions are manu-
tions, local regional objects, and linguistic tokens. Previous ally generated, requiring considerable human interference.
studies [21, 34] consider two modalities when designing the There are other works to learn video representations from
new transformer layers, i.e., fine-grained object information limited annotated data [55]. The video data is a natural
from image and natural language. Lu et al. [21] introduced source to learn cross-modal representations. The text de-
a co-attentional transformer layer, where the key-value pairs scriptions are automatically generated by off-the-shelf auto-
from one modality are passed to the other modality’s atten- matic speech recognition (ASR) models. This is more scal-
tion block to act as the new key-value pairs. However, in our able and general to the model’s deployment in real-world
scenario, there are three sources of inputs. The two sources, applications. In this paper, we focus on learning joint video-
i.e., local regional features and linguistic texts, offer de- text representation in a self-supervised way.
tailed descriptions of the occurring event in the clip. The
other global action feature provides the human intention Cross-modal pre-training. In the past year, many works
in time-series as well as a straightforward clue for contex- extended BERT to model cross-modal data [21, 32, 34, 5,
tual inferring. We design a new tangled transformer block 19, 33]. The recent BERT model for video-text model-
for cross-modality feature learning from three sources. To ing [33] introduces visual words for video frames encoding,
enhance the interactions between two visual cues and lin- where local regional information is largely ignored. The
guistic features, we use a separate transformer block [40] synchronized video-audio signal is also a good test-bed for
to encode each modality. The mutual cross-modal commu- cross-modal representation learning [3, 15]. However, they
nication is later enhanced with two additional multi-head leveraged low-level audio signals and only considered the
attention blocks. The action feature catalyzes mutual in- synchronization nature of video data. In this work, we focus
teractions. With the guidance from the action features, we on video-text joint representation learning. Our ActBERT
inject visual information to the linguistic transformer, and leverages multi-source information and achieves remark-
incorporate linguistic information to the visual transform- able performance in many downstream video-text tasks.
ers. The tangled transformer dynamically selects judicious
cues its context to facilitate the target prediction. Instructional videos. Learning from instructional videos
is challenging due to its data complexity across various
Furthermore, we design four surrogate tasks to train Act- tasks [6, 1, 51, 26]. These videos are collected from many
BERT, i.e., masked language modeling with global and lo- domains, e.g., cooking, sports, gardening. Many works
cal visual cues, masked action classification, masked object also regard the transcriptions generated from instructional
classification and cross-modal matching. The pre-trained videos as a source of supervision [1, 51, 26]. However, we
ActBERT is transferred to five video-related downstream employ ActBERT to explicitly model human actions, local
tasks, i.e., video captioning, action segmentation, text-video regions in a unified framework. We improve [26] with more
clip retrieval, action step localization, and video question specific relation modeling between videos and their descrip-
answering. We quantitatively show ActBERT achieves the tion. We quantitatively demonstrated that ActBERT is more
state-of-the-art performance with a clear margin. suitable for unsupervised video-text modeling.

8747
3. Model Architecture tures as r1 , . . . , rM . The sequential text descriptions
is denoted as w1 , . . . , wN . The whole sequence is
3.1. Preliminary denoted as {[CLS], w1 , . . . , wN , [SEP], a1 , . . . , aL , [SEP],
We first illustrate the original BERT [7] model. r1 , . . . , rM , [SEP]}. “[SEP]” is also inserted between dif-
BERT [7] pre-trains a language model on large corpora in ferent sentences. We also insert “[SEP]” between regions
an unsupervised way. The pre-trained model is found to be that are from different clips, which can help the model to
flexible and beneficial to a variety of downstream tasks, e.g., identify the clip boundaries. For each input step, the fi-
question answering [2]. nal embedding feature consists of four different embed-
In BERT [7], the input entities are processed by a multi- dings. The embeddings are position embedding, segment
layer bidirectional transformer [40]. The embeddings of embedding, token embedding, visual feature embedding.
each input are processed with stacked self-attention layers We added a few new tokens to distinguish action features
to aggregate contextual features. The attention weights are and regional object features. The visual embedding is in-
adaptively generated. The output features contain contex- troduced to extract visual and action information. These
tual information about the original input sequence. In self- embeddings are added to be the final feature of ActBERT.
attention, the generated features are irrelevant to input se- We explain them in detail as follows.
quence order, and it enables the output representation to be Position embedding. Following [7], we incorporate a
permutation-invariant. The output representation is not af- learnable position embedding to every input in the se-
fected when the input sequence is shuffled. A position em- quence. Since self-attention does not consider order infor-
bedding is commonly applied to each input entity for the mation, position encoding offers a flexible way to embed a
incorporation of sequential order clues. sequence when the sequence order matters. For the actions
In the original BERT, Devlin et al. introduced two tasks in different clips, the position embeddings are different as
for pre-training. In the task of masked language modeling the video clips are ordered. For the regions extracted from
(MLM), a portion of input words are randomly masked out. the same frame, we use the same position embedding. To
These masked-out words are replaced by a special token distinguish regions from the same frame, we consider spa-
“[MASK]”. The task is to predict the masked words based tial position embedding for different spatial positions. The
on the observations from the contextual contents. The con- details will be described in “Visual (action) embedding”.
textual contents are unmasked elements that provide useful Segment embedding. We consider multiple video clips for
relevant cues for the prediction of the masked word. long-term video context modeling. Each video clip or video
The other task, i.e., Next Sentence Prediction (NSP), segment has a corresponding segment embedding. The el-
models order information between two sentences. Two sen- ements, i.e., action inputs, regional object inputs, linguistic
tences are sampled from a document, and NSP aims to iden- descriptions, have the same segment embedding in the same
tify if the second sentence is adjacent to the first sentence video clip.
with the correct order. The two sentences are concatenated Token embedding. Each word is embedded with Word-
via a token “[SEP]”, so that the models can be aware of the Piece embeddings [42] with a 30,000 vocabulary. In ad-
inputs being separated sentences. The prediction is made dition to the special tokens mentioned above (“[CLS]”,
upon the output features of the first token “[CLS]”. This is a “[MASK]”, “[SEP]”), we introduce “[ACT]” and “[RE-
binary classification problem, and a simple sigmoid classi- GION]” to represent the action features and the region fea-
fier is used. A prediction of “1” indicates the sentences are tures extracted from video frames, respectively. Note that
consecutive, and the second sentence is right after the first all action inputs have the identical token embedding, which
sentence. reveals the modality of the inputs.
Visual (action) embedding. We now explain the visual (ac-
3.2. ActBERT tion) embedding in details. We first illustrate the procedure
3.2.1 Input Embeddings to obtain the action embedding. For each video clip, we
extract verbs from its corresponding descriptions. For sim-
There are four types of input elements in ActBERT. They plicity, we remove clips that do not have any verbs. We
are actions, image regions, linguistic descriptions and spe- then build a vocabulary from all the extracted verbs. Af-
cial tokens. Special tokens are used to distinguish different ter verb vocabulary construction, each video clip has one or
inputs. multiple category labels. We train a 3D convolutional neu-
Each input sequence starts with a special token “[CLS]” ral network on this constructed dataset. The inputs to the
and ends with another token “[SEP]”. We put the lin- 3D network is a tensor that contains an additional tempo-
guistic descriptions after “[CLS]”. There are the action ral dimension. We leverage a softmax classifier on top of
inputs followed by local regional features. We denote the convolutional neural network. For clips with multiple
the action features as a1 , . . . , aL , the frame region fea- labels, we normalize the one-hot label with ℓ1 -norm, where

8748
3.2.2
as ( W

We

l l
detection feature.

denote
, H, W, H,

l
the
W ∗H

Tangled Transformer

simplicity, we denote hlw


x1 y1 x2 y2 (x2 −x1 )∗(y2 −y1 )

=
object features and linguistic features.

l
the actions that occurred in the video clip.

tions at transformer block l as hl


intermediate

. . . , hwN ), (ha0 , . . . , haL ), (hr0 , . . . , hrM )}.


=
the top-left and bottom-right coordinates, respectively.

can dynamically select judicious cues for target prediction.

K is the key, V is the value. The details of multi-head


output = M ultihead(Q, K, V ), where Q is the query,
{(hlw0 ,

mutual interactions. We denote the multi-head attention as


transformer blocks. Specifically, we utilize hla to catalyze
attention blocks to enhance mutual interactions between the
same modality, we leverage the other two multi-head
standard multi-head attention encoding features from the
and r-transformer, respectively (Figure 1). Besides the
which are processed by w-transfomer, a-transformer,
hla = {hla0 , . . . , hlaL )}, and hlr = {hlr0 , . . . , hlrM )},
{hlw0 , . . . , hlwN },
For
representa-
ers. With cross-modal interactions, the tangled transformer
incorporate linguistic information to the visual transform-
to inject visual information to the linguistic transformer and
actions between visual and linguistic features, we propose
three sources of features, respectively. To enhance the inter-
sists of three transformers. The three transformers take
sual and text features equally, our tangled transformer con-
Instead of using only one transformer that treats the vi-
three sources of information, i.e., action features, regional
We design a TaNgled Transformer (TNT) to better encode
summation of the spatial position embedding and the object
the visual feature. The final regional object feature is the
This vector is then embedded to match the dimension of
width, H is the frame height, and (x1 , y1 ) and (x2 , y2 ) are
), where W is the frame
tion of the region area. Specifically, we denote the vector
This vector consists of four box coordinates and the frac-
beddings to represent region locations with a 5-D vector.
work. Following [21], we incorporate spatial position em-
ture vectors before the output layer in the pre-trained net-
each region, the visual feature embeddings are the fea-
sual information for visual and text relation modeling. For
ulary [20]. The image region features offer detailed vi-
tract the categorical distribution under the COCO vocab-
we utilized pre-trained Faster R-CNN network [29] to ex-
trained object detection network. Similar to Lu et al. [21],
boxes and the corresponding visual features from a pre-
To obtain regional object features, we extract bounding
ing as the action features. This feature can well represent
is trained, we extract the features after global average pool-
the scores for all labels are summed to be 1. After the model

8749
sha1_base64="S8o8k6X9NK9GxCPFwpLshvp7c5Y=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cW7Ae0oWy2k3btZhN2N0oJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3H1FpHst7M0nQj+hQ8pAzaqzUeOqXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqN3kcRTiBUzgHD66gBndQhyYwQHiGV3hzHpwX5935WLQWnHzmGP7A+fwB5LWM+w==</latexit>
sha1_base64="S8o8k6X9NK9GxCPFwpLshvp7c5Y=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cW7Ae0oWy2k3btZhN2N0oJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3H1FpHst7M0nQj+hQ8pAzaqzUeOqXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqN3kcRTiBUzgHD66gBndQhyYwQHiGV3hzHpwX5935WLQWnHzmGP7A+fwB5LWM+w==</latexit><latexit
sha1_base64="S8o8k6X9NK9GxCPFwpLshvp7c5Y=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cW7Ae0oWy2k3btZhN2N0oJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3H1FpHst7M0nQj+hQ8pAzaqzUeOqXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqN3kcRTiBUzgHD66gBndQhyYwQHiGV3hzHpwX5935WLQWnHzmGP7A+fwB5LWM+w==</latexit><latexit
sha1_base64="S8o8k6X9NK9GxCPFwpLshvp7c5Y=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cW7Ae0oWy2k3btZhN2N0oJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3H1FpHst7M0nQj+hQ8pAzaqzUeOqXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqN3kcRTiBUzgHD66gBndQhyYwQHiGV3hzHpwX5935WLQWnHzmGP7A+fwB5LWM+w==</latexit><latexit
latexit
<
3.2.3
sha1_base64="S49tq2EDkVGfDn8RLYaRXXFh+Jg=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBZBEEoigh6LXjxWsB+QxrLZbtqlm92wu1FKyM/w4kERr/4ab/4bt20O2vpg4PHeDDPzwoQzbVz32ymtrK6tb5Q3K1vbO7t71f2DtpapIrRFJJeqG2JNORO0ZZjhtJsoiuOQ0044vpn6nUeqNJPi3kwSGsR4KFjECDZW8kcPGT/z8n72lPerNbfuzoCWiVeQGhRo9qtfvYEkaUyFIRxr7XtuYoIMK8MIp3mll2qaYDLGQ+pbKnBMdZDNTs7RiVUGKJLKljBopv6eyHCs9SQObWeMzUgvelPxP89PTXQVZEwkqaGCzBdFKUdGoun/aMAUJYZPLMFEMXsrIiOsMDE2pYoNwVt8eZm0z+ueW/fuLmqN6yKOMhzBMZyCB5fQgFtoQgsISHiGV3hzjPPivDsf89aSU8wcwh84nz9GApE8</latexit>
sha1_base64="S49tq2EDkVGfDn8RLYaRXXFh+Jg=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBZBEEoigh6LXjxWsB+QxrLZbtqlm92wu1FKyM/w4kERr/4ab/4bt20O2vpg4PHeDDPzwoQzbVz32ymtrK6tb5Q3K1vbO7t71f2DtpapIrRFJJeqG2JNORO0ZZjhtJsoiuOQ0044vpn6nUeqNJPi3kwSGsR4KFjECDZW8kcPGT/z8n72lPerNbfuzoCWiVeQGhRo9qtfvYEkaUyFIRxr7XtuYoIMK8MIp3mll2qaYDLGQ+pbKnBMdZDNTs7RiVUGKJLKljBopv6eyHCs9SQObWeMzUgvelPxP89PTXQVZEwkqaGCzBdFKUdGoun/aMAUJYZPLMFEMXsrIiOsMDE2pYoNwVt8eZm0z+ueW/fuLmqN6yKOMhzBMZyCB5fQgFtoQgsISHiGV3hzjPPivDsf89aSU8wcwh84nz9GApE8</latexit><latexit
sha1_base64="S49tq2EDkVGfDn8RLYaRXXFh+Jg=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBZBEEoigh6LXjxWsB+QxrLZbtqlm92wu1FKyM/w4kERr/4ab/4bt20O2vpg4PHeDDPzwoQzbVz32ymtrK6tb5Q3K1vbO7t71f2DtpapIrRFJJeqG2JNORO0ZZjhtJsoiuOQ0044vpn6nUeqNJPi3kwSGsR4KFjECDZW8kcPGT/z8n72lPerNbfuzoCWiVeQGhRo9qtfvYEkaUyFIRxr7XtuYoIMK8MIp3mll2qaYDLGQ+pbKnBMdZDNTs7RiVUGKJLKljBopv6eyHCs9SQObWeMzUgvelPxP89PTXQVZEwkqaGCzBdFKUdGoun/aMAUJYZPLMFEMXsrIiOsMDE2pYoNwVt8eZm0z+ueW/fuLmqN6yKOMhzBMZyCB5fQgFtoQgsISHiGV3hzjPPivDsf89aSU8wcwh84nz9GApE8</latexit><latexit
sha1_base64="S49tq2EDkVGfDn8RLYaRXXFh+Jg=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBZBEEoigh6LXjxWsB+QxrLZbtqlm92wu1FKyM/w4kERr/4ab/4bt20O2vpg4PHeDDPzwoQzbVz32ymtrK6tb5Q3K1vbO7t71f2DtpapIrRFJJeqG2JNORO0ZZjhtJsoiuOQ0044vpn6nUeqNJPi3kwSGsR4KFjECDZW8kcPGT/z8n72lPerNbfuzoCWiVeQGhRo9qtfvYEkaUyFIRxr7XtuYoIMK8MIp3mll2qaYDLGQ+pbKnBMdZDNTs7RiVUGKJLKljBopv6eyHCs9SQObWeMzUgvelPxP89PTXQVZEwkqaGCzBdFKUdGoun/aMAUJYZPLMFEMXsrIiOsMDE2pYoNwVt8eZm0z+ueW/fuLmqN6yKOMhzBMZyCB5fQgFtoQgsISHiGV3hzjPPivDsf89aSU8wcwh84nz9GApE8</latexit><latexit
<latexit
coding.
sha1_base64="L0esRcJfVafGdmbpGrkVJlzJAFY=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Ae0sWy2m3bpZhN2J0oJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GNzO//ci1EbG6x0nC/YgOlQgFo2il9uhB9rOnab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/8TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6qHpu1bu7rNSv8ziKcAKncA4e1KAOt9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QOimo/A</latexit>
sha1_base64="L0esRcJfVafGdmbpGrkVJlzJAFY=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Ae0sWy2m3bpZhN2J0oJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GNzO//ci1EbG6x0nC/YgOlQgFo2il9uhB9rOnab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/8TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6qHpu1bu7rNSv8ziKcAKncA4e1KAOt9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QOimo/A</latexit><latexit
sha1_base64="L0esRcJfVafGdmbpGrkVJlzJAFY=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Ae0sWy2m3bpZhN2J0oJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GNzO//ci1EbG6x0nC/YgOlQgFo2il9uhB9rOnab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/8TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6qHpu1bu7rNSv8ziKcAKncA4e1KAOt9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QOimo/A</latexit><latexit
sha1_base64="L0esRcJfVafGdmbpGrkVJlzJAFY=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Ae0sWy2m3bpZhN2J0oJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GNzO//ci1EbG6x0nC/YgOlQgFo2il9uhB9rOnab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/8TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6qHpu1bu7rNSv8ziKcAKncA4e1KAOt9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QOimo/A</latexit><latexit
<latexit
w

hlw
hl+1

associated.
Feed

Q K V
forward

w -transformer
Attention

cr =
Multi-head
Add & Norm
Add & Norm

K
V
Attention
Multi-head Q

sha1_base64="b7/vCs5ze5KtVd66W3yyALYBfbk=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipSQflilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/DXYzl</latexit>
sha1_base64="b7/vCs5ze5KtVd66W3yyALYBfbk=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipSQflilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/DXYzl</latexit><latexit
sha1_base64="b7/vCs5ze5KtVd66W3yyALYBfbk=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipSQflilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/DXYzl</latexit><latexit
sha1_base64="b7/vCs5ze5KtVd66W3yyALYBfbk=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipSQflilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/DXYzl</latexit><latexit
latexit
<
sha1_base64="ig+iMAcmIxWmdmc/qNAOEylxE0g=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s9o3q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHySUkSY=</latexit>
sha1_base64="ig+iMAcmIxWmdmc/qNAOEylxE0g=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s9o3q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHySUkSY=</latexit><latexit
sha1_base64="ig+iMAcmIxWmdmc/qNAOEylxE0g=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s9o3q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHySUkSY=</latexit><latexit
sha1_base64="ig+iMAcmIxWmdmc/qNAOEylxE0g=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s9o3q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHySUkSY=</latexit><latexit
<latexit
ActBERT Training
sha1_base64="K4VwG3uZo2AaEcZnEMSO+RXzRJg=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWCbbTbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVlLVoLGLVDVAzwSVrGW4E6yaKYRQI1gkmt7nfeWJK81g+mGnC/AhHkoecorFSZ/woBhnOBtWaW3fnIKvEK0gNCjQH1a/+MKZpxKShArXueW5i/AyV4VSwWaWfapYgneCI9SyVGDHtZ/NzZ+TMKkMSxsqWNGSu/p7IMNJ6GgW2M0Iz1steLv7n9VITXvsZl0lqmKSLRWEqiIlJ/jsZcsWoEVNLkCpubyV0jAqpsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNaQGECz/AKb07ivDjvzseiteQUM8fwB87nD4Esj6o=</latexit>
sha1_base64="K4VwG3uZo2AaEcZnEMSO+RXzRJg=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWCbbTbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVlLVoLGLVDVAzwSVrGW4E6yaKYRQI1gkmt7nfeWJK81g+mGnC/AhHkoecorFSZ/woBhnOBtWaW3fnIKvEK0gNCjQH1a/+MKZpxKShArXueW5i/AyV4VSwWaWfapYgneCI9SyVGDHtZ/NzZ+TMKkMSxsqWNGSu/p7IMNJ6GgW2M0Iz1steLv7n9VITXvsZl0lqmKSLRWEqiIlJ/jsZcsWoEVNLkCpubyV0jAqpsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNaQGECz/AKb07ivDjvzseiteQUM8fwB87nD4Esj6o=</latexit><latexit
sha1_base64="K4VwG3uZo2AaEcZnEMSO+RXzRJg=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWCbbTbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVlLVoLGLVDVAzwSVrGW4E6yaKYRQI1gkmt7nfeWJK81g+mGnC/AhHkoecorFSZ/woBhnOBtWaW3fnIKvEK0gNCjQH1a/+MKZpxKShArXueW5i/AyV4VSwWaWfapYgneCI9SyVGDHtZ/NzZ+TMKkMSxsqWNGSu/p7IMNJ6GgW2M0Iz1steLv7n9VITXvsZl0lqmKSLRWEqiIlJ/jsZcsWoEVNLkCpubyV0jAqpsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNaQGECz/AKb07ivDjvzseiteQUM8fwB87nD4Esj6o=</latexit><latexit
sha1_base64="K4VwG3uZo2AaEcZnEMSO+RXzRJg=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWCbbTbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVlLVoLGLVDVAzwSVrGW4E6yaKYRQI1gkmt7nfeWJK81g+mGnC/AhHkoecorFSZ/woBhnOBtWaW3fnIKvEK0gNCjQH1a/+MKZpxKShArXueW5i/AyV4VSwWaWfapYgneCI9SyVGDHtZ/NzZ+TMKkMSxsqWNGSu/p7IMNJ6GgW2M0Iz1steLv7n9VITXvsZl0lqmKSLRWEqiIlJ/jsZcsWoEVNLkCpubyV0jAqpsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNaQGECz/AKb07ivDjvzseiteQUM8fwB87nD4Esj6o=</latexit><latexit
<latexit
a

hla
hl+1

Feed

Q K V
forward

Attention
Multi-head
Add & Norm
Add & Norm

a -transformer

attend judicious cues from hlw and hlr :


Attention
Q Multi-head

tween linguistic features and visual features.


K
V

sha1_base64="D+vIjYIYiuYBqfGNJBmXYbUZJb0=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipqQblilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/dIYz2</latexit>
sha1_base64="D+vIjYIYiuYBqfGNJBmXYbUZJb0=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipqQblilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/dIYz2</latexit><latexit
sha1_base64="D+vIjYIYiuYBqfGNJBmXYbUZJb0=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipqQblilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/dIYz2</latexit><latexit
sha1_base64="D+vIjYIYiuYBqfGNJBmXYbUZJb0=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipqQblilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/dIYz2</latexit><latexit
latexit
<
sha1_base64="s9lVApzk9tYWlaWVYqqNGOB+J/8=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s903q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHz5pkTc=</latexit>
sha1_base64="s9lVApzk9tYWlaWVYqqNGOB+J/8=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s903q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHz5pkTc=</latexit><latexit
sha1_base64="s9lVApzk9tYWlaWVYqqNGOB+J/8=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s903q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHz5pkTc=</latexit><latexit
sha1_base64="s9lVApzk9tYWlaWVYqqNGOB+J/8=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s903q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHz5pkTc=</latexit><latexit
<latexit
sha1_base64="MPwYpJMY3uFe43ksYZJ/sxWqQGM=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMGit1xo9ikKnZoFpz6+4cZJV4BalBgeag+tUfxiyNUBomqNY9z02Mn1FlOBM4q/RTjQllEzrCnqWSRqj9bH7ujJxZZUjCWNmShszV3xMZjbSeRoHtjKgZ62UvF//zeqkJr/2MyyQ1KNliUZgKYmKS/06GXCEzYmoJZYrbWwkbU0WZsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNawGACz/AKb07ivDjvzseiteQUM8fwB87nD5sBj7s=</latexit>
sha1_base64="MPwYpJMY3uFe43ksYZJ/sxWqQGM=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMGit1xo9ikKnZoFpz6+4cZJV4BalBgeag+tUfxiyNUBomqNY9z02Mn1FlOBM4q/RTjQllEzrCnqWSRqj9bH7ujJxZZUjCWNmShszV3xMZjbSeRoHtjKgZ62UvF//zeqkJr/2MyyQ1KNliUZgKYmKS/06GXCEzYmoJZYrbWwkbU0WZsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNawGACz/AKb07ivDjvzseiteQUM8fwB87nD5sBj7s=</latexit><latexit
sha1_base64="MPwYpJMY3uFe43ksYZJ/sxWqQGM=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMGit1xo9ikKnZoFpz6+4cZJV4BalBgeag+tUfxiyNUBomqNY9z02Mn1FlOBM4q/RTjQllEzrCnqWSRqj9bH7ujJxZZUjCWNmShszV3xMZjbSeRoHtjKgZ62UvF//zeqkJr/2MyyQ1KNliUZgKYmKS/06GXCEzYmoJZYrbWwkbU0WZsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNawGACz/AKb07ivDjvzseiteQUM8fwB87nD5sBj7s=</latexit><latexit
sha1_base64="MPwYpJMY3uFe43ksYZJ/sxWqQGM=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMGit1xo9ikKnZoFpz6+4cZJV4BalBgeag+tUfxiyNUBomqNY9z02Mn1FlOBM4q/RTjQllEzrCnqWSRqj9bH7ujJxZZUjCWNmShszV3xMZjbSeRoHtjKgZ62UvF//zeqkJr/2MyyQ1KNliUZgKYmKS/06GXCEzYmoJZYrbWwkbU0WZsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNawGACz/AKb07ivDjvzseiteQUM8fwB87nD5sBj7s=</latexit><latexit
<latexit
r

M ultihead(Wq2 hla , Wkr hlr , Wvr hlr ),


hlr
hl+1

cw = M ultihead(Wq1 hla , Wkw hlw , Wvw hlw ),


Feed

Q K V
forward

Attention
Multi-head
Add & Norm
Add & Norm

r-transformer

framework is presented in Figure 2. We naturally extend


We introduce four tasks for ActBERT pre-training. Our
and visual features are incorporated during transformer en-
value with the original one. In this way, both the linguistic
values in [21], while our tangled transformer stacks the key-
and values from different modalities replace the origin key-
hints from linguistic and visual features. Third, the keys
block utilizes a global cue to guide the selection of local
[21] treats the two modalities equally, while our tangled
attention block, without further pre-processing. Second,
keys and values from one modality to the other modality’s
First, the co-attentional transformer block simply passes the
co-attentional transformer block in [21] in several ways.
Note that our tangled transformer is different from the
gled transformer, visual and linguistic features are further
with key-value pair in w-transformer. With this form tan-
generate a new key-value pair from cr , which is stacked
the original a-transformer and r-transformer. Similarly, we
ated key-value pair is stacked with the key-value pairs from
new key-value pair from cw using a linear layer. This gener-
ture from regional object representation. We then generate a
from linguistic representations, while cr is the guided fea-
where W∗∗ are learnable weights. cw is the blended feature
(2)
(1)
attention can be found in [40]. We use hla as a query to
information as inputs, which enhances the interactions be-
Figure 1: Our tangled transformer takes three sources of
Cross-modal Masked Masked action (verb) Masked object (noun)
matching language modeling classification classification

[CLS] rotate shrimp … [SEP] rotate add Nouni Nounj [SEP]

ActBERT
… Position
0 1 2 3 4 5 6 7 8 embedding

… Segment
SA SA SA SB SA SB SA SB SB embedding

[CLS] rotate [MASK] … [SEP] [ACT] [ACT] [REGION] [REGION] [SEP] Token
embedding
Spatial position encoding
… Visual (action)
embedding

Global stacked frames Local object regions

Rotate Add
… … … …

… … …
Rotate shrimp balls. Add spinach.

Figure 2: Our ActBERT framework. We incorporate three sources of information during pre-training, i.e., global actions,
local regional objects, and text descriptions. The yellow grid indicates that the action or the region object is masked out.

the Masked Language Modeling in our cross-modal set- task is to predict the masked action label based on linguistic
ting. There are some existing extensions for image and lan- features and object features. Explicit action prediction can
guage pre-training [21, 33], and video and language pre- be beneficial in two perspectives. First, action sequential
training [33]. Compared to [33], we explicitly model ac- cues can be exploited in the long-term. For example, for a
tions and regional information in a unified framework. video with action sequences of “get into”, “rotate”, “add”,
Masked Language Modeling with Global and Local Vi- this task can better exploit the temporal order information
sual Cues. We extend the Masked Language Modeling regarding performing this instructional assignment. Sec-
(MLM) task in BERT to our setting. We leverage visual ond, the regional objects and linguistic texts are leveraged
cues from local regional objects and global actions to un- for better cross-modality modeling. Note that in Masked
cover the relationships between visual and linguistic enti- Action Classification, the goal is to predict the categorical
ties. As described in Section 3.1, each word in the input label of the masked-out action feature. This task can en-
sentence is randomly masked with a fixed probability. The hance the action recognition capability of the pre-trained
task forces the model to learn from contextual descriptions, model, which can be further generalized to many down-
and at the same time, extract relevant visual features to fa- stream tasks, e.g., video question answering.
cilitate prediction. When a verb word is masked out, the Masked Object Classification. In Masked Object Classi-
model should exploit the action features for a more accu- fication, the regional object features are randomly masked
rate prediction. When a description of an object is masked out. We follow [21] to predict a distribution over fixed vo-
out, local regional features can provide more contextual in- cabulary for the masked-out image region. The target distri-
formation. Thus, the strong model needs to align visual and bution of the masked-out region is calculated as the softmax
linguistic inputs locally and globally. The output feature is activation that is extracted by forwarding the region to the
then appended with a softmax classifier over the whole lin- same pre-trained detection model in the feature extraction
guistic vocabulary. stage. The KL divergence between the two distributions is
Masked Action Classification. Similarly, in Masked Ac- minimized.
tion Classification, the action features are masked out. The Cross-modal matching. Similar to the Next Sentence Pre-

8750
diction (NSP) task, we apply a linear layer on top of the To obtain the action features, we first construct an action
output of the first token “[CLS]”. It is followed by a sig- classification dataset. We sample frames at 8 FPS. For each
moid classifier, indicating the relevance score of the linguis- clip, we extract the verb from its text descriptions. Then,
tic sentences and the visual features. If the score is high, we train a ResNet-3D [39] network with a softmax classi-
it shows that the text well-describes the video clips. The fication loss. We initialized the weights of the ResNet-3D
model is optimized via a binary cross-entropy loss. To train model from a pre-trained model on Kinetics [12]. The Ki-
this cross-modal matching task, we sample negative video- netics dataset covers 400 actions from YouTube videos. The
text pairs from the unlabeled dataset. We follow [26] for 3D convolutional network converges faster using when it is
sampling positive pairs and negative pairs. pre-trained on Kinetics. The input clip length to ResNet-
3D is 32. The clip covers a 4-second video duration. The
4. Experiments spatial shape of the input frame is 224×224. The initial
learning rate is set to 0.001. The batch size is 16. We decay
In this section, we evaluate ActBERT in multiple down-
the learning rate by 0.1 at iteration 100,000, and the total
stream video-and-language tasks. We quantitatively evalu-
number of training iterations is 1,000,000. We keep other
ate the generalization capability of ActBERT on five chal-
training settings unchanged following [39]. During feature
lenging tasks, i.e., text-video clip retrieval, video caption-
extraction, we sample the central clip, and each frame is
ing, video question answering, action segmentation, and ac-
central cropped. We use the feature after global average
tion step localization.
pooling as the clip representation.
4.1. ActBERT implementation details During ActBERT pre-training, 15% of input features are
randomly masked out. ActBERT has 12 layers of trans-
HowTo100M. We pre-train ActBERT on the
former blocks. Each transformer block has a hidden unit
HowTo100M dataset [26]. The HowTo100M dataset
size of 768. We initialize the linguistic transformer with the
is constructed by querying YouTube API. The top 200
BERT model pre-trained on the BookCorpus [56] and En-
search results are kept. This dataset covers a total of
glish Wikipedia. The other two transformers are randomly
23,611 tasks, e.g., maintenance and repair, animal rescue,
initialized. The network is optimized by Adam optimizer.
food preparation. This dataset is biased towards actions,
We set the learning rate to be 10−5 . We trained the model
where the verbs like “go”, “make”, “come” being the most
for five epochs due to the large-scale data. We use four
frequent. The nouns are also distributed in a long-tailed
NVIDIA Tesla V100 GPUs for model training.
way, where objects like “water”, “cup” are ranked top.
Each video has a corresponding narration that is extracted 4.2. Results on video-and-text tasks
from video subtitles. As the association between video
clips and texts are not manually annotated, the video-text We evaluate ActBERT on five downstream tasks, i.e., ac-
connection can sometimes be weak. There are cases tion step localization, action segmentation, text-video clip
of noisy correspondences, where the actors sometimes retrieval, video captioning, and video question answering.
talk about unrelated things. Though noisy, we found We evaluate the five tasks on CrossTask [57], COIN [35],
pre-training on HowTo100M can still significantly improve YouCook2 [51], and MSR-VTT [44]. Videos from the test
the performance of downstream tasks. sets of these datasets are removed during pre-training on
Pre-training details. To construct video-text inputs for HowTo100M.
ActBERT pre-training, we sample video clips from the
HowTo100M dataset. Instead of only using one clip for 4.2.1 Datasets
video-text joint training, we leverage multiple adjacent clips
to cover a longer context. This enables ActBERT to model CrossTask: We evaluate action step localization on the
relations in different segments. We sample 10 adjacent CrossTask [57] dataset. CrossTask [57] contains 83 tasks
video clips, and the temporal-aligned linguistic tokens are and 4.7k videos related to cooking, car maintenance, craft-
extracted to form a video-text pair. ing, etc. We use the recall metric described in [57], which is
To obtain the local regional features, we use Faster R- defined by the number of step assignments that fall into the
CNN pre-trained on the Visual Genome [16] dataset fol- ground-truth interval, divided by the total number of steps in
lowing [21]. The backbone is ResNet-101 [9]. We use the the video. COIN: We evaluate the action segmentation task
frame rate of 1 FPS to extract the regional features. Each on the recent COIN [35] dataset. COIN [35] contains 180
region feature is RoI-pooled from the convolutional feature tasks and 11,827 videos. This dataset consists of 46,354 an-
from that region. We set the detection confidence threshold notated segments. The videos are collected from YouTube.
as 0.4, and each frame contains at most five boxes. Trans- YouCook2: We evaluate text-video clip retrieval and video
former and co-attentional transformer blocks in the visual captioning on YouCook2. YouCook2 is a cooking video
stream have hidden state size of 1024 and 8 attention heads. dataset collected from YouTube, covering a large variety of

8751
Method BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr Method Frame Accuracy (%)
Zhou et al. [52] 7.53 3.84 11.55 27.44 0.38
S3D [43] 6.12 3.24 9.52 26.09 0.31 NN-Viterbi [30] 21.17
VideoBert [33] 6.80 4.04 11.01 27.50 0.49 VGG [31] 25.79
VideoBert + S3D [33] 7.59 4.33 11.94 28.80 0.55 TCFPN-ISBA [8] 34.30
ActBERT 8.66 5.41 13.30 30.56 0.65
ActBERT w/o region cues 52.10
ActBERT 56.95
Table 1: Video captioning results on YouCook2. We out-
perform VideoBERT [33] across all the metrics. Table 2: Action segmentation results on COIN.

cooking styles, methods, ingredients and cookwares [51]. labeling. We do not feed the text descriptions during the
In YouCook2, there are 89 types of recipes and totally 14k fine-tuning process. The results are shown in Table 2. The
clips described with linguistic texts. Following [26], we baseline methods are conducted by [35]. Notably, ActBERT
evaluate the text-video clip retrieval task on the validation significantly outperforms the baselines with more than 20%
clips of YouCook2. MSR-VTT: We evaluate text-video improvements. It shows that the pre-trained ActBERT can
clip retrieval and video question answering on MSR-VTT. deal with only visual inputs when linguistic descriptions are
The MSR-VTT dataset [44] is a general video dataset col- absent. When we remove the regional information, we ob-
lected from YouTube with text descriptions. For the video serve a performance drop compared to our full model. It
question answering task, we evaluate the multiple-choice shows that detailed local cues are important to the dense
VideoQA following [47]. There are 2,990 questions in total frame labeling task.
for testing. Each test video is associated with a ground-
truth caption, a correct answer, and four mismatched de- 4.2.4 Action step localization
scriptions. For text-video clip retrieval, following [47], we
use 1,000 pairs text-video for evaluation. We evaluate action step localization on CrossTask. To fairly
compare to [26], we do not fine-tune on the target dataset.
We regard the step action label as the text description and
4.2.2 Video captioning
directly feed the text-video pair to ActBERT. We regard the
We compare our ActBERT to VideoBERT [33] on the video prediction for the first token “[CLS]” as the relevance score
captioning task. We take the pre-trained action transformer of this clip belonging to the label. We choose the action
as the video encoder. We follow the setup from [52] that with the max relevance score as the final prediction. The
takes the video clips from YouCook2 [51] as input, and a results are shown in Table 3. ActBERT significantly out-
transformer decoder is used to decode videos to captions. performs TVJE [26] with a large margin, i.e., the average
We do not use the regional object transformer to fairly com- improvement is 7%. We achieve even better than the super-
pare to [33]. Similar to [33], we cross-validate the hyper- vised baseline. We remove the region cues to have a fair
parameters on the training set. We report the standard eval- comparison to [26], as [26] does not use object detection
uation metrics for captioning, i.e., BLEU, METEOR, and features for video and text matching. The results of “Act-
ROUGE, on the validation set. The model is optimized BERT w/o region cues” also substantially outperform [26],
by Adam optimizer for 40k iterations. We set the initial demonstrating the effectiveness of ActBERT pre-training.
learning rate to 1.0 × 10−3 , and the batch size is 128. The Our full ActBERT model further improves performance by
results are shown in Table 1. We outperform VideoBERT 4%. This validates that regional information is an impor-
[33] across all metrics, achieving a 1.36 improvement on tant source that provides detailed local object features for
METEOR. It demonstrates that our pre-trained transformer text-and-video matching.
learns a better video representation. It also indicates the
effectiveness of ActBERT in modeling video sequences by
4.2.5 Text-video clip retrieval
considering both global and local video cues. Our trans-
former generalizes better in video captioning. We evaluate ActBERT on the task of video clip retrieval
with natural language queries. Given a linguistic query, it
4.2.3 Action segmentation aims to rank the video clips from a gallery video set. We
use the following metrics for evaluation [26], i.e., Recall@1
The action segmentation task in COIN is to design an ac- (R@1), Recall@5 (R@5), Recall@10 (R@10) and the me-
tion label for a video at the frame-level. To apply ActBERT dian rank (Median R). We evaluate ActBERT on YouCook2
to action segmentation, we fine-tune ActBERT by adding and MSR-VTT. We followed [26] to conduct the YouCook2
a linear classifier upon the output features for dense frame evaluation. The results are shown in Table 4. ActBERT

8752
Strawberry Cake
Make Banana

French Toast
Kimchi Rice

Irish Coffee
Taco Salad
Jello Shots

Fish Curry
Lemonade
Cucumber

Ice Cream

Meringue
Pancakes

Average
Jack Up

Add Oil

Shelves
Change
Pickle

to Car
Make

Make

Make

Make

Make

Make

Make

Make

Make

Make

Make
Steak

Build
Latte
Grill

Tire
Car
Alayrac et al. [1] 15.6 10.6 7.5 14.2 9.3 11.8 17.3 13.1 6.4 12.9 27.2 9.2 15.7 8.6 16.3 13.0 23.2 7.4 13.3
Zhukov et al. [57] 13.3 18.0 23.4 23.1 16.9 16.5 30.7 21.6 4.6 19.5 35.3 10.0 32.3 13.8 29.5 37.6 43.0 13.3 22.4
Supervised [57] 19.1 25.3 38.0 37.5 25.7 28.2 54.3 25.8 18.3 31.2 47.7 12.0 39.5 23.4 30.9 41.1 53.4 17.3 31.6
TVJE [26] 33.5 27.1 36.6 37.9 24.1 35.6 32.7 35.1 30.7 28.5 43.2 19.8 34.7 33.6 40.4 41.6 41.9 27.4 33.6
ActBERT w/o region cues 37.4 29.5 39.0 42.2 29.8 37.5 35.5 37.8 33.2 32.8 48.4 25.2 37.4 35.6 42.4 47.0 46.1 30.4 37.1
ActBERT 41.8 33.6 42.7 46.8 33.4 43.0 40.8 41.8 38.3 37.4 52.5 30.1 41.2 40.4 46.1 51.0 49.7 35.1 41.4

Table 3: Action step localization results on CrossTask [57].

Method Dataset R@1 R@5 R@10 Median R Method Accuracy


HGLMM [14] YouCook2 4.6 14.3 21.6 75
TVJE [26] YouCook2 4.2 13.7 21.5 65
Text-only BLSTM [22] 32.0
TVJE +FT [26] YouCook2 8.2 24.5 35.3 24 Text-only Human [22] 30.2
GoogleNet-2D + C3D [22] 35.7
ActBERT YouCook2 9.6 26.7 38.0 19
Merging-LSTM [23] 34.2
C+LSTM+SA [37] MSR-VTT 4.2 12.9 19.9 55 SNUVL [48] 38.0
VSE-LSTM [13] MSR-VTT 3.8 12.7 17.1 66
CT-SAN [49] 41.9
SNUVL [48] MSR-VTT 3.5 15.9 23.8 44
Kaufman et al. [11] MSR-VTT 4.7 16.6 24.1 41 LR/RL LSTMs [24] 40.9
CT-SAN [49] MSR-VTT 4.4 16.6 22.3 35 JSFusion [47] 45.5
JSFusion [47] MSR-VTT 10.2 31.2 43.2 13
TVJE [26] MSR-VTT 7.5 21.2 29.6 38
ActBERT 48.6
ActBERT MSR-VTT 8.6 23.4 33.1 36
Table 5: Video question answering (multiple-choices) re-
Table 4: Text-video clip retrieval results on YouCook2 and sults on MSR-VTT.
MSR-VTT. “FT” denotes fine-tuning on the training set.

results are shown in Table 5. We compare to many base-


significantly outperforms TVJE [26] and other baselines. lines in this task. Without fancy joint modeling, ActBERT
TVJE trains a ranking loss on the HowTo100M dataset. significantly outperforms JSFusion [47] by 3%. It shows
It shows ActBERT is a better pre-training framework for ActBERT’s strong generalization from a large-scale dataset.
video-text joint representation learning. Notably, our pre-
trained model achieves better retrieval performance than the
finetuned TVJE model (“TVJE +FT”) on YouCook2. It 5. Conclusion
shows the superiority of ActBERT in self-supervised video-
text representation learning. In MSR-VTT, ActBERT out- In this paper, we introduce ActBERT for joint video-text
performs TVJE by 1.1% on R@1 when no labeled data is modeling in a self-supervised way. We directly model both
accessed. Note that JSFusion [47] is a supervised method global and local visual cues for fine-grained visual and lin-
that leverages labeled video and text pairs for training. guistic relation learning. ActBERT takes three sources of
information as input, i.e., global actions, local regional ob-
jects, and linguistic descriptions. The novel tangled trans-
4.2.6 Video question answering. former further enhances the communications between the
We evaluate ActBERT on the multiple-choice VideoQA three sources. Quantitative results on five video-text bench-
task. We fine-tune the pre-trained ActBERT on the MSR- marks demonstrate the effectiveness of ActBERT. In the
VTT training set. The video-text pairs are fed to ActBERT. future, we will consider evaluating ActBERT on video ac-
We use a linear classifier upon the output feature. We use tion recognition and detection. We will also improve Act-
a small learning rate of 0.0001 and use Adam optimizer for BERT by designing more powerful modules for video and
training. At the inference time, we fed each candidate with text modeling.
the video clip to ActBERT. The final choice is made by se- Acknowledgements. This work is supported by ARC
lecting the candidates with the max matching score. The DP200100938.

8753
References Connecting language and vision using crowdsourced dense
image annotations. IJCV, 123(1):32–73, 2017. 6
[1] Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal,
[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. Unsu-
Imagenet classification with deep convolutional neural net-
pervised learning from narrated instruction videos. In CVPR,
works. In NeurIPS, 2012. 1
2016. 2, 8
[18] Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg.
[2] Chris Alberti, Kenton Lee, and Michael Collins. A
Tvqa: Localized, compositional video question answering.
bert baseline for the natural questions. arXiv preprint
arXiv preprint arXiv:1809.01696, 2018. 2
arXiv:1901.08634, 2019. 1, 3
[19] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming
[3] Relja Arandjelovic and Andrew Zisserman. Objects that
Zhou. Unicoder-vl: A universal encoder for vision and
sound. In ECCV, 2018. 2
language by cross-modal pre-training. arXiv preprint
[4] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and
arXiv:1908.06066, 2019. 2
Matthijs Douze. Deep clustering for unsupervised learning
of visual features. In ECCV, 2018. 1 [20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[5] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy,
Zitnick. Microsoft coco: Common objects in context. In
Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.
ECCV, 2014. 4
Uniter: Learning universal image-text representations. arXiv
preprint arXiv:1909.11740, 2019. 2 [21] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert:
Pretraining task-agnostic visiolinguistic representations for
[6] Dima Damen, Hazel Doughty, Giovanni Maria Farinella,
vision-and-language tasks. In NeurIPS, 2019. 2, 4, 5, 6
Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide
Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. [22] Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron
Scaling egocentric vision: The epic-kitchens dataset. In Courville, and Christopher Pal. A dataset and exploration
ECCV, 2018. 2 of models for understanding video data through fill-in-the-
blank question-answering. In CVPR, 2017. 8
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre-training of deep bidirectional [23] Amir Mazaheri, Dong Zhang, and Mubarak Shah. Video
transformers for language understanding. arXiv preprint fill in the blank with merging lstms. arXiv preprint
arXiv:1810.04805, 2018. 1, 3 arXiv:1610.04062, 2016. 8
[8] Li Ding and Chenliang Xu. Weakly-supervised action seg- [24] Amir Mazaheri, Dong Zhang, and Mubarak Shah. Video fill
mentation with iterative soft boundary assignment. In CVPR, in the blank using lr/rl lstms with spatial-temporal attentions.
pages 6508–6516, 2018. 7 In ICCV, 2017. 8
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [25] Antoine Miech, Ivan Laptev, and Josef Sivic. Learning a
Deep residual learning for image recognition. In CVPR, text-video embedding from incomplete and heterogeneous
2016. 1, 6 data. arXiv preprint arXiv:1804.02516, 2018. 2
[10] Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and [26] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac,
Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in Makarand Tapaswi, Ivan Laptev, and Josef Sivic.
visual question answering. In CVPR, 2017. 2 Howto100m: Learning a text-video embedding by watching
[11] Dotan Kaufman, Gil Levi, Tal Hassner, and Lior Wolf. Tem- hundred million narrated video clips. In ICCV, 2019. 1, 2,
poral tessellation: A unified approach for video analysis. In 6, 7, 8
ICCV, 2017. 8 [27] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuf-
[12] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, fle and learn: unsupervised learning using temporal order
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, verification. In ECCV, 2016. 1
Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- [28] Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong
man action video dataset. arXiv preprint arXiv:1705.06950, Rui. Jointly modeling embedding and translation to bridge
2017. 6 video and language. In CVPR, 2016. 2
[13] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. [29] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Unifying visual-semantic embeddings with multimodal neu- Faster r-cnn: Towards real-time object detection with region
ral language models. arXiv preprint arXiv:1411.2539, 2014. proposal networks. In NeurIPS, 2015. 1, 4
8 [30] Alexander Richard, Hilde Kuehne, Ahsan Iqbal, and Juer-
[14] Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. Asso- gen Gall. Neuralnetwork-viterbi: A framework for weakly
ciating neural word embeddings with deep image represen- supervised video learning. In CVPR, 2018. 7
tations using fisher vectors. In CVPR, 2015. 8 [31] Karen Simonyan and Andrew Zisserman. Very deep convo-
[15] Bruno Korbar, Du Tran, and Lorenzo Torresani. Coopera- lutional networks for large-scale image recognition. arXiv
tive learning of audio and video models from self-supervised preprint arXiv:1409.1556, 2014. 7
synchronization. In NeurIPS, 2018. 2 [32] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu
[16] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- linguistic representations. arXiv preprint arXiv:1908.08530,
tidis, Li-Jia Li, David A Shamma, et al. Visual genome: 2019. 2

8754
[33] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and [49] Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee
Cordelia Schmid. Videobert: A joint model for video and Kim. End-to-end concept word detection for video caption-
language representation learning. In ICCV, 2019. 1, 2, 5, 7 ing, retrieval, and question answering. In CVPR, 2017. 8
[34] Hao Tan and Mohit Bansal. Lxmert: Learning cross- [50] Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J
modality encoder representations from transformers. arXiv Corso, and Marcus Rohrbach. Grounded video description.
preprint arXiv:1908.07490, 2019. 2 In CVPR, 2019. 2
[35] Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, [51] Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards
Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: automatic learning of procedures from web instructional
A large-scale dataset for comprehensive instructional video videos. In AAAI, 2018. 2, 6, 7
analysis. In CVPR, 2019. 6, 7 [52] Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher,
[36] Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, and Caiming Xiong. End-to-end dense video captioning with
Antonio Torralba, Raquel Urtasun, and Sanja Fidler. masked transformer. In CVPR, 2018. 2, 7
Movieqa: Understanding stories in movies through question- [53] Linchao Zhu, Zhongwen Xu, and Yi Yang. Bidirectional
answering. In CVPR, 2016. 2 multirate reconstruction for temporal modeling in videos. In
[37] Atousa Torabi, Niket Tandon, and Leonid Sigal. Learning CVPR, 2017. 2
language-visual embedding for movie understanding with [54] Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G
natural-language. arXiv preprint arXiv:1609.08124, 2016. Hauptmann. Uncovering the temporal context for video
8 question answering. IJCV, 124(3):409–421, 2017. 2
[38] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
[55] Linchao Zhu and Yi Yang. Compound memory networks for
and Manohar Paluri. Learning spatiotemporal features with
few-shot video classification. In ECCV, 2018. 2
3D convolutional networks. In ICCV, 2015. 1
[56] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov,
[39] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann
Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Align-
LeCun, and Manohar Paluri. A closer look at spatiotemporal
ing books and movies: Towards story-like visual explana-
convolutions for action recognition. In CVPR, 2018. 6
tions by watching movies and reading books. In ICCV, 2015.
[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
6
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
[57] Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk
Polosukhin. Attention is all you need. In NeurIPS, 2017. 2,
Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross-
3, 4
task weakly supervised learning from instructional videos.
[41] Xin Wang, Jiawei Wu, Da Zhang, Yu Su, and William Yang
In CVPR, 2019. 6, 8
Wang. Learning to compose topic-aware mixture of experts
for zero-shot video captioning. In AAAI, 2019. 2
[42] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le,
Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun,
Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s
neural machine translation system: Bridging the gap be-
tween human and machine translation. arXiv preprint
arXiv:1609.08144, 2016. 3
[43] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and
Kevin Murphy. Rethinking spatiotemporal feature learning:
Speed-accuracy trade-offs in video classification. In ECCV,
2018. 7
[44] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large
video description dataset for bridging video and language. In
CVPR, 2016. 6, 7
[45] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell,
Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized
autoregressive pretraining for language understanding. arXiv
preprint arXiv:1906.08237, 2019. 1
[46] Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas,
Christopher Pal, Hugo Larochelle, and Aaron Courville. De-
scribing videos by exploiting temporal structure. In ICCV,
pages 4507–4515, 2015. 2
[47] Youngjae Yu, Jongseok Kim, and Gunhee Kim. A joint se-
quence fusion model for video question answering and re-
trieval. In ECCV, 2018. 2, 7, 8
[48] Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee
Kim. Video captioning and retrieval models with semantic
attention. arXiv preprint arXiv:1610.02947, 6(7), 2016. 8

8755

You might also like