paper2-cvf-2020
paper2-cvf-2020
8746
tional task. Though action clues are important, they are 2. Related Work
largely ignored in previous self-supervised video-text train-
ing [33, 26], where actions are treated identically to objects.
To model human actions, we first extract verbs from the text Video and language. There are many existing video-and-
descriptions and construct an action classification dataset language tasks to evaluate the model’s capacities in joint
from the original dataset. Then, a 3D convolution network video-text representation learning, e.g., video question an-
is trained to predict the action labels. The features from the swering [36, 10, 18, 54], video captioning [46, 52], text-
optimized network are used as the action embedding. In this video retrieval [47, 41, 25], video grounding [50]. In video
way, clip-level actions are represented, and the correspond- and language modeling, it can be difficult to learn relations
ing action label is inserted. Besides global action informa- between ordered video frames and their corresponding de-
tion, we incorporate local regional information to provide scriptions, where video temporal information and the inter-
fine-grained visual cues [21, 34, 32, 19, 5]. Object regions actions between multiple objects spatio-temporally requires
provide detailed visual clues about the whole scene, includ- to be incorporated. The dominant approach for multi-modal
ing the regional object feature, the position of the object. modeling is to leverage Recurrent Neural Networks (RNNs)
The language model can benefit from the regional informa- and their variants, e.g., Long Short-Term Memory (LSTM)
tion for better language-and-visual alignment. and Gated Recurrent Unit (GRU), to model sequence re-
lations, e.g., [28, 53]. Zhou et al. [52] leveraged masked
transformers in both the encoder and the decoder for dense
Second, we introduce a TaNgled Transformer block video captioning. Most of these works are conducted on
(TNT) to encode features from three sources, i.e., global ac- well-annotated datasets where the descriptions are manu-
tions, local regional objects, and linguistic tokens. Previous ally generated, requiring considerable human interference.
studies [21, 34] consider two modalities when designing the There are other works to learn video representations from
new transformer layers, i.e., fine-grained object information limited annotated data [55]. The video data is a natural
from image and natural language. Lu et al. [21] introduced source to learn cross-modal representations. The text de-
a co-attentional transformer layer, where the key-value pairs scriptions are automatically generated by off-the-shelf auto-
from one modality are passed to the other modality’s atten- matic speech recognition (ASR) models. This is more scal-
tion block to act as the new key-value pairs. However, in our able and general to the model’s deployment in real-world
scenario, there are three sources of inputs. The two sources, applications. In this paper, we focus on learning joint video-
i.e., local regional features and linguistic texts, offer de- text representation in a self-supervised way.
tailed descriptions of the occurring event in the clip. The
other global action feature provides the human intention Cross-modal pre-training. In the past year, many works
in time-series as well as a straightforward clue for contex- extended BERT to model cross-modal data [21, 32, 34, 5,
tual inferring. We design a new tangled transformer block 19, 33]. The recent BERT model for video-text model-
for cross-modality feature learning from three sources. To ing [33] introduces visual words for video frames encoding,
enhance the interactions between two visual cues and lin- where local regional information is largely ignored. The
guistic features, we use a separate transformer block [40] synchronized video-audio signal is also a good test-bed for
to encode each modality. The mutual cross-modal commu- cross-modal representation learning [3, 15]. However, they
nication is later enhanced with two additional multi-head leveraged low-level audio signals and only considered the
attention blocks. The action feature catalyzes mutual in- synchronization nature of video data. In this work, we focus
teractions. With the guidance from the action features, we on video-text joint representation learning. Our ActBERT
inject visual information to the linguistic transformer, and leverages multi-source information and achieves remark-
incorporate linguistic information to the visual transform- able performance in many downstream video-text tasks.
ers. The tangled transformer dynamically selects judicious
cues its context to facilitate the target prediction. Instructional videos. Learning from instructional videos
is challenging due to its data complexity across various
Furthermore, we design four surrogate tasks to train Act- tasks [6, 1, 51, 26]. These videos are collected from many
BERT, i.e., masked language modeling with global and lo- domains, e.g., cooking, sports, gardening. Many works
cal visual cues, masked action classification, masked object also regard the transcriptions generated from instructional
classification and cross-modal matching. The pre-trained videos as a source of supervision [1, 51, 26]. However, we
ActBERT is transferred to five video-related downstream employ ActBERT to explicitly model human actions, local
tasks, i.e., video captioning, action segmentation, text-video regions in a unified framework. We improve [26] with more
clip retrieval, action step localization, and video question specific relation modeling between videos and their descrip-
answering. We quantitatively show ActBERT achieves the tion. We quantitatively demonstrated that ActBERT is more
state-of-the-art performance with a clear margin. suitable for unsupervised video-text modeling.
8747
3. Model Architecture tures as r1 , . . . , rM . The sequential text descriptions
is denoted as w1 , . . . , wN . The whole sequence is
3.1. Preliminary denoted as {[CLS], w1 , . . . , wN , [SEP], a1 , . . . , aL , [SEP],
We first illustrate the original BERT [7] model. r1 , . . . , rM , [SEP]}. “[SEP]” is also inserted between dif-
BERT [7] pre-trains a language model on large corpora in ferent sentences. We also insert “[SEP]” between regions
an unsupervised way. The pre-trained model is found to be that are from different clips, which can help the model to
flexible and beneficial to a variety of downstream tasks, e.g., identify the clip boundaries. For each input step, the fi-
question answering [2]. nal embedding feature consists of four different embed-
In BERT [7], the input entities are processed by a multi- dings. The embeddings are position embedding, segment
layer bidirectional transformer [40]. The embeddings of embedding, token embedding, visual feature embedding.
each input are processed with stacked self-attention layers We added a few new tokens to distinguish action features
to aggregate contextual features. The attention weights are and regional object features. The visual embedding is in-
adaptively generated. The output features contain contex- troduced to extract visual and action information. These
tual information about the original input sequence. In self- embeddings are added to be the final feature of ActBERT.
attention, the generated features are irrelevant to input se- We explain them in detail as follows.
quence order, and it enables the output representation to be Position embedding. Following [7], we incorporate a
permutation-invariant. The output representation is not af- learnable position embedding to every input in the se-
fected when the input sequence is shuffled. A position em- quence. Since self-attention does not consider order infor-
bedding is commonly applied to each input entity for the mation, position encoding offers a flexible way to embed a
incorporation of sequential order clues. sequence when the sequence order matters. For the actions
In the original BERT, Devlin et al. introduced two tasks in different clips, the position embeddings are different as
for pre-training. In the task of masked language modeling the video clips are ordered. For the regions extracted from
(MLM), a portion of input words are randomly masked out. the same frame, we use the same position embedding. To
These masked-out words are replaced by a special token distinguish regions from the same frame, we consider spa-
“[MASK]”. The task is to predict the masked words based tial position embedding for different spatial positions. The
on the observations from the contextual contents. The con- details will be described in “Visual (action) embedding”.
textual contents are unmasked elements that provide useful Segment embedding. We consider multiple video clips for
relevant cues for the prediction of the masked word. long-term video context modeling. Each video clip or video
The other task, i.e., Next Sentence Prediction (NSP), segment has a corresponding segment embedding. The el-
models order information between two sentences. Two sen- ements, i.e., action inputs, regional object inputs, linguistic
tences are sampled from a document, and NSP aims to iden- descriptions, have the same segment embedding in the same
tify if the second sentence is adjacent to the first sentence video clip.
with the correct order. The two sentences are concatenated Token embedding. Each word is embedded with Word-
via a token “[SEP]”, so that the models can be aware of the Piece embeddings [42] with a 30,000 vocabulary. In ad-
inputs being separated sentences. The prediction is made dition to the special tokens mentioned above (“[CLS]”,
upon the output features of the first token “[CLS]”. This is a “[MASK]”, “[SEP]”), we introduce “[ACT]” and “[RE-
binary classification problem, and a simple sigmoid classi- GION]” to represent the action features and the region fea-
fier is used. A prediction of “1” indicates the sentences are tures extracted from video frames, respectively. Note that
consecutive, and the second sentence is right after the first all action inputs have the identical token embedding, which
sentence. reveals the modality of the inputs.
Visual (action) embedding. We now explain the visual (ac-
3.2. ActBERT tion) embedding in details. We first illustrate the procedure
3.2.1 Input Embeddings to obtain the action embedding. For each video clip, we
extract verbs from its corresponding descriptions. For sim-
There are four types of input elements in ActBERT. They plicity, we remove clips that do not have any verbs. We
are actions, image regions, linguistic descriptions and spe- then build a vocabulary from all the extracted verbs. Af-
cial tokens. Special tokens are used to distinguish different ter verb vocabulary construction, each video clip has one or
inputs. multiple category labels. We train a 3D convolutional neu-
Each input sequence starts with a special token “[CLS]” ral network on this constructed dataset. The inputs to the
and ends with another token “[SEP]”. We put the lin- 3D network is a tensor that contains an additional tempo-
guistic descriptions after “[CLS]”. There are the action ral dimension. We leverage a softmax classifier on top of
inputs followed by local regional features. We denote the convolutional neural network. For clips with multiple
the action features as a1 , . . . , aL , the frame region fea- labels, we normalize the one-hot label with ℓ1 -norm, where
8748
3.2.2
as ( W
We
l l
detection feature.
denote
, H, W, H,
l
the
W ∗H
Tangled Transformer
=
object features and linguistic features.
l
the actions that occurred in the video clip.
8749
sha1_base64="S8o8k6X9NK9GxCPFwpLshvp7c5Y=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cW7Ae0oWy2k3btZhN2N0oJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3H1FpHst7M0nQj+hQ8pAzaqzUeOqXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqN3kcRTiBUzgHD66gBndQhyYwQHiGV3hzHpwX5935WLQWnHzmGP7A+fwB5LWM+w==</latexit>
sha1_base64="S8o8k6X9NK9GxCPFwpLshvp7c5Y=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cW7Ae0oWy2k3btZhN2N0oJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3H1FpHst7M0nQj+hQ8pAzaqzUeOqXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqN3kcRTiBUzgHD66gBndQhyYwQHiGV3hzHpwX5935WLQWnHzmGP7A+fwB5LWM+w==</latexit><latexit
sha1_base64="S8o8k6X9NK9GxCPFwpLshvp7c5Y=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cW7Ae0oWy2k3btZhN2N0oJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3H1FpHst7M0nQj+hQ8pAzaqzUeOqXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqN3kcRTiBUzgHD66gBndQhyYwQHiGV3hzHpwX5935WLQWnHzmGP7A+fwB5LWM+w==</latexit><latexit
sha1_base64="S8o8k6X9NK9GxCPFwpLshvp7c5Y=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cW7Ae0oWy2k3btZhN2N0oJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3H1FpHst7M0nQj+hQ8pAzaqzUeOqXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqN3kcRTiBUzgHD66gBndQhyYwQHiGV3hzHpwX5935WLQWnHzmGP7A+fwB5LWM+w==</latexit><latexit
latexit
<
3.2.3
sha1_base64="S49tq2EDkVGfDn8RLYaRXXFh+Jg=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBZBEEoigh6LXjxWsB+QxrLZbtqlm92wu1FKyM/w4kERr/4ab/4bt20O2vpg4PHeDDPzwoQzbVz32ymtrK6tb5Q3K1vbO7t71f2DtpapIrRFJJeqG2JNORO0ZZjhtJsoiuOQ0044vpn6nUeqNJPi3kwSGsR4KFjECDZW8kcPGT/z8n72lPerNbfuzoCWiVeQGhRo9qtfvYEkaUyFIRxr7XtuYoIMK8MIp3mll2qaYDLGQ+pbKnBMdZDNTs7RiVUGKJLKljBopv6eyHCs9SQObWeMzUgvelPxP89PTXQVZEwkqaGCzBdFKUdGoun/aMAUJYZPLMFEMXsrIiOsMDE2pYoNwVt8eZm0z+ueW/fuLmqN6yKOMhzBMZyCB5fQgFtoQgsISHiGV3hzjPPivDsf89aSU8wcwh84nz9GApE8</latexit>
sha1_base64="S49tq2EDkVGfDn8RLYaRXXFh+Jg=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBZBEEoigh6LXjxWsB+QxrLZbtqlm92wu1FKyM/w4kERr/4ab/4bt20O2vpg4PHeDDPzwoQzbVz32ymtrK6tb5Q3K1vbO7t71f2DtpapIrRFJJeqG2JNORO0ZZjhtJsoiuOQ0044vpn6nUeqNJPi3kwSGsR4KFjECDZW8kcPGT/z8n72lPerNbfuzoCWiVeQGhRo9qtfvYEkaUyFIRxr7XtuYoIMK8MIp3mll2qaYDLGQ+pbKnBMdZDNTs7RiVUGKJLKljBopv6eyHCs9SQObWeMzUgvelPxP89PTXQVZEwkqaGCzBdFKUdGoun/aMAUJYZPLMFEMXsrIiOsMDE2pYoNwVt8eZm0z+ueW/fuLmqN6yKOMhzBMZyCB5fQgFtoQgsISHiGV3hzjPPivDsf89aSU8wcwh84nz9GApE8</latexit><latexit
sha1_base64="S49tq2EDkVGfDn8RLYaRXXFh+Jg=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBZBEEoigh6LXjxWsB+QxrLZbtqlm92wu1FKyM/w4kERr/4ab/4bt20O2vpg4PHeDDPzwoQzbVz32ymtrK6tb5Q3K1vbO7t71f2DtpapIrRFJJeqG2JNORO0ZZjhtJsoiuOQ0044vpn6nUeqNJPi3kwSGsR4KFjECDZW8kcPGT/z8n72lPerNbfuzoCWiVeQGhRo9qtfvYEkaUyFIRxr7XtuYoIMK8MIp3mll2qaYDLGQ+pbKnBMdZDNTs7RiVUGKJLKljBopv6eyHCs9SQObWeMzUgvelPxP89PTXQVZEwkqaGCzBdFKUdGoun/aMAUJYZPLMFEMXsrIiOsMDE2pYoNwVt8eZm0z+ueW/fuLmqN6yKOMhzBMZyCB5fQgFtoQgsISHiGV3hzjPPivDsf89aSU8wcwh84nz9GApE8</latexit><latexit
sha1_base64="S49tq2EDkVGfDn8RLYaRXXFh+Jg=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBZBEEoigh6LXjxWsB+QxrLZbtqlm92wu1FKyM/w4kERr/4ab/4bt20O2vpg4PHeDDPzwoQzbVz32ymtrK6tb5Q3K1vbO7t71f2DtpapIrRFJJeqG2JNORO0ZZjhtJsoiuOQ0044vpn6nUeqNJPi3kwSGsR4KFjECDZW8kcPGT/z8n72lPerNbfuzoCWiVeQGhRo9qtfvYEkaUyFIRxr7XtuYoIMK8MIp3mll2qaYDLGQ+pbKnBMdZDNTs7RiVUGKJLKljBopv6eyHCs9SQObWeMzUgvelPxP89PTXQVZEwkqaGCzBdFKUdGoun/aMAUJYZPLMFEMXsrIiOsMDE2pYoNwVt8eZm0z+ueW/fuLmqN6yKOMhzBMZyCB5fQgFtoQgsISHiGV3hzjPPivDsf89aSU8wcwh84nz9GApE8</latexit><latexit
<latexit
coding.
sha1_base64="L0esRcJfVafGdmbpGrkVJlzJAFY=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Ae0sWy2m3bpZhN2J0oJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GNzO//ci1EbG6x0nC/YgOlQgFo2il9uhB9rOnab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/8TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6qHpu1bu7rNSv8ziKcAKncA4e1KAOt9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QOimo/A</latexit>
sha1_base64="L0esRcJfVafGdmbpGrkVJlzJAFY=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Ae0sWy2m3bpZhN2J0oJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GNzO//ci1EbG6x0nC/YgOlQgFo2il9uhB9rOnab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/8TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6qHpu1bu7rNSv8ziKcAKncA4e1KAOt9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QOimo/A</latexit><latexit
sha1_base64="L0esRcJfVafGdmbpGrkVJlzJAFY=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Ae0sWy2m3bpZhN2J0oJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GNzO//ci1EbG6x0nC/YgOlQgFo2il9uhB9rOnab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/8TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6qHpu1bu7rNSv8ziKcAKncA4e1KAOt9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QOimo/A</latexit><latexit
sha1_base64="L0esRcJfVafGdmbpGrkVJlzJAFY=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Ae0sWy2m3bpZhN2J0oJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GNzO//ci1EbG6x0nC/YgOlQgFo2il9uhB9rOnab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/8TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6qHpu1bu7rNSv8ziKcAKncA4e1KAOt9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QOimo/A</latexit><latexit
<latexit
w
hlw
hl+1
associated.
Feed
Q K V
forward
w -transformer
Attention
cr =
Multi-head
Add & Norm
Add & Norm
K
V
Attention
Multi-head Q
sha1_base64="b7/vCs5ze5KtVd66W3yyALYBfbk=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipSQflilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/DXYzl</latexit>
sha1_base64="b7/vCs5ze5KtVd66W3yyALYBfbk=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipSQflilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/DXYzl</latexit><latexit
sha1_base64="b7/vCs5ze5KtVd66W3yyALYBfbk=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipSQflilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/DXYzl</latexit><latexit
sha1_base64="b7/vCs5ze5KtVd66W3yyALYBfbk=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipSQflilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/DXYzl</latexit><latexit
latexit
<
sha1_base64="ig+iMAcmIxWmdmc/qNAOEylxE0g=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s9o3q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHySUkSY=</latexit>
sha1_base64="ig+iMAcmIxWmdmc/qNAOEylxE0g=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s9o3q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHySUkSY=</latexit><latexit
sha1_base64="ig+iMAcmIxWmdmc/qNAOEylxE0g=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s9o3q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHySUkSY=</latexit><latexit
sha1_base64="ig+iMAcmIxWmdmc/qNAOEylxE0g=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s9o3q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHySUkSY=</latexit><latexit
<latexit
ActBERT Training
sha1_base64="K4VwG3uZo2AaEcZnEMSO+RXzRJg=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWCbbTbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVlLVoLGLVDVAzwSVrGW4E6yaKYRQI1gkmt7nfeWJK81g+mGnC/AhHkoecorFSZ/woBhnOBtWaW3fnIKvEK0gNCjQH1a/+MKZpxKShArXueW5i/AyV4VSwWaWfapYgneCI9SyVGDHtZ/NzZ+TMKkMSxsqWNGSu/p7IMNJ6GgW2M0Iz1steLv7n9VITXvsZl0lqmKSLRWEqiIlJ/jsZcsWoEVNLkCpubyV0jAqpsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNaQGECz/AKb07ivDjvzseiteQUM8fwB87nD4Esj6o=</latexit>
sha1_base64="K4VwG3uZo2AaEcZnEMSO+RXzRJg=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWCbbTbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVlLVoLGLVDVAzwSVrGW4E6yaKYRQI1gkmt7nfeWJK81g+mGnC/AhHkoecorFSZ/woBhnOBtWaW3fnIKvEK0gNCjQH1a/+MKZpxKShArXueW5i/AyV4VSwWaWfapYgneCI9SyVGDHtZ/NzZ+TMKkMSxsqWNGSu/p7IMNJ6GgW2M0Iz1steLv7n9VITXvsZl0lqmKSLRWEqiIlJ/jsZcsWoEVNLkCpubyV0jAqpsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNaQGECz/AKb07ivDjvzseiteQUM8fwB87nD4Esj6o=</latexit><latexit
sha1_base64="K4VwG3uZo2AaEcZnEMSO+RXzRJg=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWCbbTbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVlLVoLGLVDVAzwSVrGW4E6yaKYRQI1gkmt7nfeWJK81g+mGnC/AhHkoecorFSZ/woBhnOBtWaW3fnIKvEK0gNCjQH1a/+MKZpxKShArXueW5i/AyV4VSwWaWfapYgneCI9SyVGDHtZ/NzZ+TMKkMSxsqWNGSu/p7IMNJ6GgW2M0Iz1steLv7n9VITXvsZl0lqmKSLRWEqiIlJ/jsZcsWoEVNLkCpubyV0jAqpsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNaQGECz/AKb07ivDjvzseiteQUM8fwB87nD4Esj6o=</latexit><latexit
sha1_base64="K4VwG3uZo2AaEcZnEMSO+RXzRJg=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWCbbTbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVlLVoLGLVDVAzwSVrGW4E6yaKYRQI1gkmt7nfeWJK81g+mGnC/AhHkoecorFSZ/woBhnOBtWaW3fnIKvEK0gNCjQH1a/+MKZpxKShArXueW5i/AyV4VSwWaWfapYgneCI9SyVGDHtZ/NzZ+TMKkMSxsqWNGSu/p7IMNJ6GgW2M0Iz1steLv7n9VITXvsZl0lqmKSLRWEqiIlJ/jsZcsWoEVNLkCpubyV0jAqpsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNaQGECz/AKb07ivDjvzseiteQUM8fwB87nD4Esj6o=</latexit><latexit
<latexit
a
hla
hl+1
Feed
Q K V
forward
Attention
Multi-head
Add & Norm
Add & Norm
a -transformer
sha1_base64="D+vIjYIYiuYBqfGNJBmXYbUZJb0=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipqQblilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/dIYz2</latexit>
sha1_base64="D+vIjYIYiuYBqfGNJBmXYbUZJb0=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipqQblilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/dIYz2</latexit><latexit
sha1_base64="D+vIjYIYiuYBqfGNJBmXYbUZJb0=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipqQblilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/dIYz2</latexit><latexit
sha1_base64="D+vIjYIYiuYBqfGNJBmXYbUZJb0=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipqQblilt1FyDrxMtJBXI0BuWv/jBmaYTSMEG17nluYvyMKsOZwFmpn2pMKJvQEfYslTRC7WeLQ2fkwipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDGz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m5INwVt9eZ20r6qeW/Wa15X6bR5HEc7gHC7BgxrU4R4a0AIGCM/wCm/Oo/PivDsfy9aCk8+cwh84nz/dIYz2</latexit><latexit
latexit
<
sha1_base64="s9lVApzk9tYWlaWVYqqNGOB+J/8=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s903q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHz5pkTc=</latexit>
sha1_base64="s9lVApzk9tYWlaWVYqqNGOB+J/8=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s903q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHz5pkTc=</latexit><latexit
sha1_base64="s9lVApzk9tYWlaWVYqqNGOB+J/8=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s903q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHz5pkTc=</latexit><latexit
sha1_base64="s9lVApzk9tYWlaWVYqqNGOB+J/8=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSIIQklE0GPRi8cK9gPaWDbbTbt0swm7E6GE/AwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt1O//cS1EbF6wEnC/YgOlQgFo2il7ugxk+de3s903q9U3Zo7A1kmXkGqUKDRr3z1BjFLI66QSWpM13MT9DOqUTDJ83IvNTyhbEyHvGupohE3fjY7OSenVhmQMNa2FJKZ+nsio5ExkyiwnRHFkVn0puJ/XjfF8NrPhEpS5IrNF4WpJBiT6f9kIDRnKCeWUKaFvZWwEdWUoU2pbEPwFl9eJq2LmufWvPvLav2miKMEx3ACZ+DBFdThDhrQBAYxPMMrvDnovDjvzse8dcUpZo7gD5zPHz5pkTc=</latexit><latexit
<latexit
sha1_base64="MPwYpJMY3uFe43ksYZJ/sxWqQGM=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMGit1xo9ikKnZoFpz6+4cZJV4BalBgeag+tUfxiyNUBomqNY9z02Mn1FlOBM4q/RTjQllEzrCnqWSRqj9bH7ujJxZZUjCWNmShszV3xMZjbSeRoHtjKgZ62UvF//zeqkJr/2MyyQ1KNliUZgKYmKS/06GXCEzYmoJZYrbWwkbU0WZsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNawGACz/AKb07ivDjvzseiteQUM8fwB87nD5sBj7s=</latexit>
sha1_base64="MPwYpJMY3uFe43ksYZJ/sxWqQGM=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMGit1xo9ikKnZoFpz6+4cZJV4BalBgeag+tUfxiyNUBomqNY9z02Mn1FlOBM4q/RTjQllEzrCnqWSRqj9bH7ujJxZZUjCWNmShszV3xMZjbSeRoHtjKgZ62UvF//zeqkJr/2MyyQ1KNliUZgKYmKS/06GXCEzYmoJZYrbWwkbU0WZsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNawGACz/AKb07ivDjvzseiteQUM8fwB87nD5sBj7s=</latexit><latexit
sha1_base64="MPwYpJMY3uFe43ksYZJ/sxWqQGM=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMGit1xo9ikKnZoFpz6+4cZJV4BalBgeag+tUfxiyNUBomqNY9z02Mn1FlOBM4q/RTjQllEzrCnqWSRqj9bH7ujJxZZUjCWNmShszV3xMZjbSeRoHtjKgZ62UvF//zeqkJr/2MyyQ1KNliUZgKYmKS/06GXCEzYmoJZYrbWwkbU0WZsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNawGACz/AKb07ivDjvzseiteQUM8fwB87nD5sBj7s=</latexit><latexit
sha1_base64="MPwYpJMY3uFe43ksYZJ/sxWqQGM=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMGit1xo9ikKnZoFpz6+4cZJV4BalBgeag+tUfxiyNUBomqNY9z02Mn1FlOBM4q/RTjQllEzrCnqWSRqj9bH7ujJxZZUjCWNmShszV3xMZjbSeRoHtjKgZ62UvF//zeqkJr/2MyyQ1KNliUZgKYmKS/06GXCEzYmoJZYrbWwkbU0WZsQlVbAje8surpH1R99y6d39Za9wUcZThBE7hHDy4ggbcQRNawGACz/AKb07ivDjvzseiteQUM8fwB87nD5sBj7s=</latexit><latexit
<latexit
r
Q K V
forward
Attention
Multi-head
Add & Norm
Add & Norm
r-transformer
ActBERT
… Position
0 1 2 3 4 5 6 7 8 embedding
… Segment
SA SA SA SB SA SB SA SB SB embedding
[CLS] rotate [MASK] … [SEP] [ACT] [ACT] [REGION] [REGION] [SEP] Token
embedding
Spatial position encoding
… Visual (action)
embedding
Rotate Add
… … … …
… … …
Rotate shrimp balls. Add spinach.
Figure 2: Our ActBERT framework. We incorporate three sources of information during pre-training, i.e., global actions,
local regional objects, and text descriptions. The yellow grid indicates that the action or the region object is masked out.
the Masked Language Modeling in our cross-modal set- task is to predict the masked action label based on linguistic
ting. There are some existing extensions for image and lan- features and object features. Explicit action prediction can
guage pre-training [21, 33], and video and language pre- be beneficial in two perspectives. First, action sequential
training [33]. Compared to [33], we explicitly model ac- cues can be exploited in the long-term. For example, for a
tions and regional information in a unified framework. video with action sequences of “get into”, “rotate”, “add”,
Masked Language Modeling with Global and Local Vi- this task can better exploit the temporal order information
sual Cues. We extend the Masked Language Modeling regarding performing this instructional assignment. Sec-
(MLM) task in BERT to our setting. We leverage visual ond, the regional objects and linguistic texts are leveraged
cues from local regional objects and global actions to un- for better cross-modality modeling. Note that in Masked
cover the relationships between visual and linguistic enti- Action Classification, the goal is to predict the categorical
ties. As described in Section 3.1, each word in the input label of the masked-out action feature. This task can en-
sentence is randomly masked with a fixed probability. The hance the action recognition capability of the pre-trained
task forces the model to learn from contextual descriptions, model, which can be further generalized to many down-
and at the same time, extract relevant visual features to fa- stream tasks, e.g., video question answering.
cilitate prediction. When a verb word is masked out, the Masked Object Classification. In Masked Object Classi-
model should exploit the action features for a more accu- fication, the regional object features are randomly masked
rate prediction. When a description of an object is masked out. We follow [21] to predict a distribution over fixed vo-
out, local regional features can provide more contextual in- cabulary for the masked-out image region. The target distri-
formation. Thus, the strong model needs to align visual and bution of the masked-out region is calculated as the softmax
linguistic inputs locally and globally. The output feature is activation that is extracted by forwarding the region to the
then appended with a softmax classifier over the whole lin- same pre-trained detection model in the feature extraction
guistic vocabulary. stage. The KL divergence between the two distributions is
Masked Action Classification. Similarly, in Masked Ac- minimized.
tion Classification, the action features are masked out. The Cross-modal matching. Similar to the Next Sentence Pre-
8750
diction (NSP) task, we apply a linear layer on top of the To obtain the action features, we first construct an action
output of the first token “[CLS]”. It is followed by a sig- classification dataset. We sample frames at 8 FPS. For each
moid classifier, indicating the relevance score of the linguis- clip, we extract the verb from its text descriptions. Then,
tic sentences and the visual features. If the score is high, we train a ResNet-3D [39] network with a softmax classi-
it shows that the text well-describes the video clips. The fication loss. We initialized the weights of the ResNet-3D
model is optimized via a binary cross-entropy loss. To train model from a pre-trained model on Kinetics [12]. The Ki-
this cross-modal matching task, we sample negative video- netics dataset covers 400 actions from YouTube videos. The
text pairs from the unlabeled dataset. We follow [26] for 3D convolutional network converges faster using when it is
sampling positive pairs and negative pairs. pre-trained on Kinetics. The input clip length to ResNet-
3D is 32. The clip covers a 4-second video duration. The
4. Experiments spatial shape of the input frame is 224×224. The initial
learning rate is set to 0.001. The batch size is 16. We decay
In this section, we evaluate ActBERT in multiple down-
the learning rate by 0.1 at iteration 100,000, and the total
stream video-and-language tasks. We quantitatively evalu-
number of training iterations is 1,000,000. We keep other
ate the generalization capability of ActBERT on five chal-
training settings unchanged following [39]. During feature
lenging tasks, i.e., text-video clip retrieval, video caption-
extraction, we sample the central clip, and each frame is
ing, video question answering, action segmentation, and ac-
central cropped. We use the feature after global average
tion step localization.
pooling as the clip representation.
4.1. ActBERT implementation details During ActBERT pre-training, 15% of input features are
randomly masked out. ActBERT has 12 layers of trans-
HowTo100M. We pre-train ActBERT on the
former blocks. Each transformer block has a hidden unit
HowTo100M dataset [26]. The HowTo100M dataset
size of 768. We initialize the linguistic transformer with the
is constructed by querying YouTube API. The top 200
BERT model pre-trained on the BookCorpus [56] and En-
search results are kept. This dataset covers a total of
glish Wikipedia. The other two transformers are randomly
23,611 tasks, e.g., maintenance and repair, animal rescue,
initialized. The network is optimized by Adam optimizer.
food preparation. This dataset is biased towards actions,
We set the learning rate to be 10−5 . We trained the model
where the verbs like “go”, “make”, “come” being the most
for five epochs due to the large-scale data. We use four
frequent. The nouns are also distributed in a long-tailed
NVIDIA Tesla V100 GPUs for model training.
way, where objects like “water”, “cup” are ranked top.
Each video has a corresponding narration that is extracted 4.2. Results on video-and-text tasks
from video subtitles. As the association between video
clips and texts are not manually annotated, the video-text We evaluate ActBERT on five downstream tasks, i.e., ac-
connection can sometimes be weak. There are cases tion step localization, action segmentation, text-video clip
of noisy correspondences, where the actors sometimes retrieval, video captioning, and video question answering.
talk about unrelated things. Though noisy, we found We evaluate the five tasks on CrossTask [57], COIN [35],
pre-training on HowTo100M can still significantly improve YouCook2 [51], and MSR-VTT [44]. Videos from the test
the performance of downstream tasks. sets of these datasets are removed during pre-training on
Pre-training details. To construct video-text inputs for HowTo100M.
ActBERT pre-training, we sample video clips from the
HowTo100M dataset. Instead of only using one clip for 4.2.1 Datasets
video-text joint training, we leverage multiple adjacent clips
to cover a longer context. This enables ActBERT to model CrossTask: We evaluate action step localization on the
relations in different segments. We sample 10 adjacent CrossTask [57] dataset. CrossTask [57] contains 83 tasks
video clips, and the temporal-aligned linguistic tokens are and 4.7k videos related to cooking, car maintenance, craft-
extracted to form a video-text pair. ing, etc. We use the recall metric described in [57], which is
To obtain the local regional features, we use Faster R- defined by the number of step assignments that fall into the
CNN pre-trained on the Visual Genome [16] dataset fol- ground-truth interval, divided by the total number of steps in
lowing [21]. The backbone is ResNet-101 [9]. We use the the video. COIN: We evaluate the action segmentation task
frame rate of 1 FPS to extract the regional features. Each on the recent COIN [35] dataset. COIN [35] contains 180
region feature is RoI-pooled from the convolutional feature tasks and 11,827 videos. This dataset consists of 46,354 an-
from that region. We set the detection confidence threshold notated segments. The videos are collected from YouTube.
as 0.4, and each frame contains at most five boxes. Trans- YouCook2: We evaluate text-video clip retrieval and video
former and co-attentional transformer blocks in the visual captioning on YouCook2. YouCook2 is a cooking video
stream have hidden state size of 1024 and 8 attention heads. dataset collected from YouTube, covering a large variety of
8751
Method BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr Method Frame Accuracy (%)
Zhou et al. [52] 7.53 3.84 11.55 27.44 0.38
S3D [43] 6.12 3.24 9.52 26.09 0.31 NN-Viterbi [30] 21.17
VideoBert [33] 6.80 4.04 11.01 27.50 0.49 VGG [31] 25.79
VideoBert + S3D [33] 7.59 4.33 11.94 28.80 0.55 TCFPN-ISBA [8] 34.30
ActBERT 8.66 5.41 13.30 30.56 0.65
ActBERT w/o region cues 52.10
ActBERT 56.95
Table 1: Video captioning results on YouCook2. We out-
perform VideoBERT [33] across all the metrics. Table 2: Action segmentation results on COIN.
cooking styles, methods, ingredients and cookwares [51]. labeling. We do not feed the text descriptions during the
In YouCook2, there are 89 types of recipes and totally 14k fine-tuning process. The results are shown in Table 2. The
clips described with linguistic texts. Following [26], we baseline methods are conducted by [35]. Notably, ActBERT
evaluate the text-video clip retrieval task on the validation significantly outperforms the baselines with more than 20%
clips of YouCook2. MSR-VTT: We evaluate text-video improvements. It shows that the pre-trained ActBERT can
clip retrieval and video question answering on MSR-VTT. deal with only visual inputs when linguistic descriptions are
The MSR-VTT dataset [44] is a general video dataset col- absent. When we remove the regional information, we ob-
lected from YouTube with text descriptions. For the video serve a performance drop compared to our full model. It
question answering task, we evaluate the multiple-choice shows that detailed local cues are important to the dense
VideoQA following [47]. There are 2,990 questions in total frame labeling task.
for testing. Each test video is associated with a ground-
truth caption, a correct answer, and four mismatched de- 4.2.4 Action step localization
scriptions. For text-video clip retrieval, following [47], we
use 1,000 pairs text-video for evaluation. We evaluate action step localization on CrossTask. To fairly
compare to [26], we do not fine-tune on the target dataset.
We regard the step action label as the text description and
4.2.2 Video captioning
directly feed the text-video pair to ActBERT. We regard the
We compare our ActBERT to VideoBERT [33] on the video prediction for the first token “[CLS]” as the relevance score
captioning task. We take the pre-trained action transformer of this clip belonging to the label. We choose the action
as the video encoder. We follow the setup from [52] that with the max relevance score as the final prediction. The
takes the video clips from YouCook2 [51] as input, and a results are shown in Table 3. ActBERT significantly out-
transformer decoder is used to decode videos to captions. performs TVJE [26] with a large margin, i.e., the average
We do not use the regional object transformer to fairly com- improvement is 7%. We achieve even better than the super-
pare to [33]. Similar to [33], we cross-validate the hyper- vised baseline. We remove the region cues to have a fair
parameters on the training set. We report the standard eval- comparison to [26], as [26] does not use object detection
uation metrics for captioning, i.e., BLEU, METEOR, and features for video and text matching. The results of “Act-
ROUGE, on the validation set. The model is optimized BERT w/o region cues” also substantially outperform [26],
by Adam optimizer for 40k iterations. We set the initial demonstrating the effectiveness of ActBERT pre-training.
learning rate to 1.0 × 10−3 , and the batch size is 128. The Our full ActBERT model further improves performance by
results are shown in Table 1. We outperform VideoBERT 4%. This validates that regional information is an impor-
[33] across all metrics, achieving a 1.36 improvement on tant source that provides detailed local object features for
METEOR. It demonstrates that our pre-trained transformer text-and-video matching.
learns a better video representation. It also indicates the
effectiveness of ActBERT in modeling video sequences by
4.2.5 Text-video clip retrieval
considering both global and local video cues. Our trans-
former generalizes better in video captioning. We evaluate ActBERT on the task of video clip retrieval
with natural language queries. Given a linguistic query, it
4.2.3 Action segmentation aims to rank the video clips from a gallery video set. We
use the following metrics for evaluation [26], i.e., Recall@1
The action segmentation task in COIN is to design an ac- (R@1), Recall@5 (R@5), Recall@10 (R@10) and the me-
tion label for a video at the frame-level. To apply ActBERT dian rank (Median R). We evaluate ActBERT on YouCook2
to action segmentation, we fine-tune ActBERT by adding and MSR-VTT. We followed [26] to conduct the YouCook2
a linear classifier upon the output features for dense frame evaluation. The results are shown in Table 4. ActBERT
8752
Strawberry Cake
Make Banana
French Toast
Kimchi Rice
Irish Coffee
Taco Salad
Jello Shots
Fish Curry
Lemonade
Cucumber
Ice Cream
Meringue
Pancakes
Average
Jack Up
Add Oil
Shelves
Change
Pickle
to Car
Make
Make
Make
Make
Make
Make
Make
Make
Make
Make
Make
Steak
Build
Latte
Grill
Tire
Car
Alayrac et al. [1] 15.6 10.6 7.5 14.2 9.3 11.8 17.3 13.1 6.4 12.9 27.2 9.2 15.7 8.6 16.3 13.0 23.2 7.4 13.3
Zhukov et al. [57] 13.3 18.0 23.4 23.1 16.9 16.5 30.7 21.6 4.6 19.5 35.3 10.0 32.3 13.8 29.5 37.6 43.0 13.3 22.4
Supervised [57] 19.1 25.3 38.0 37.5 25.7 28.2 54.3 25.8 18.3 31.2 47.7 12.0 39.5 23.4 30.9 41.1 53.4 17.3 31.6
TVJE [26] 33.5 27.1 36.6 37.9 24.1 35.6 32.7 35.1 30.7 28.5 43.2 19.8 34.7 33.6 40.4 41.6 41.9 27.4 33.6
ActBERT w/o region cues 37.4 29.5 39.0 42.2 29.8 37.5 35.5 37.8 33.2 32.8 48.4 25.2 37.4 35.6 42.4 47.0 46.1 30.4 37.1
ActBERT 41.8 33.6 42.7 46.8 33.4 43.0 40.8 41.8 38.3 37.4 52.5 30.1 41.2 40.4 46.1 51.0 49.7 35.1 41.4
8753
References Connecting language and vision using crowdsourced dense
image annotations. IJCV, 123(1):32–73, 2017. 6
[1] Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal,
[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. Unsu-
Imagenet classification with deep convolutional neural net-
pervised learning from narrated instruction videos. In CVPR,
works. In NeurIPS, 2012. 1
2016. 2, 8
[18] Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg.
[2] Chris Alberti, Kenton Lee, and Michael Collins. A
Tvqa: Localized, compositional video question answering.
bert baseline for the natural questions. arXiv preprint
arXiv preprint arXiv:1809.01696, 2018. 2
arXiv:1901.08634, 2019. 1, 3
[19] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming
[3] Relja Arandjelovic and Andrew Zisserman. Objects that
Zhou. Unicoder-vl: A universal encoder for vision and
sound. In ECCV, 2018. 2
language by cross-modal pre-training. arXiv preprint
[4] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and
arXiv:1908.06066, 2019. 2
Matthijs Douze. Deep clustering for unsupervised learning
of visual features. In ECCV, 2018. 1 [20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[5] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy,
Zitnick. Microsoft coco: Common objects in context. In
Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.
ECCV, 2014. 4
Uniter: Learning universal image-text representations. arXiv
preprint arXiv:1909.11740, 2019. 2 [21] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert:
Pretraining task-agnostic visiolinguistic representations for
[6] Dima Damen, Hazel Doughty, Giovanni Maria Farinella,
vision-and-language tasks. In NeurIPS, 2019. 2, 4, 5, 6
Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide
Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. [22] Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron
Scaling egocentric vision: The epic-kitchens dataset. In Courville, and Christopher Pal. A dataset and exploration
ECCV, 2018. 2 of models for understanding video data through fill-in-the-
blank question-answering. In CVPR, 2017. 8
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre-training of deep bidirectional [23] Amir Mazaheri, Dong Zhang, and Mubarak Shah. Video
transformers for language understanding. arXiv preprint fill in the blank with merging lstms. arXiv preprint
arXiv:1810.04805, 2018. 1, 3 arXiv:1610.04062, 2016. 8
[8] Li Ding and Chenliang Xu. Weakly-supervised action seg- [24] Amir Mazaheri, Dong Zhang, and Mubarak Shah. Video fill
mentation with iterative soft boundary assignment. In CVPR, in the blank using lr/rl lstms with spatial-temporal attentions.
pages 6508–6516, 2018. 7 In ICCV, 2017. 8
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [25] Antoine Miech, Ivan Laptev, and Josef Sivic. Learning a
Deep residual learning for image recognition. In CVPR, text-video embedding from incomplete and heterogeneous
2016. 1, 6 data. arXiv preprint arXiv:1804.02516, 2018. 2
[10] Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and [26] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac,
Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in Makarand Tapaswi, Ivan Laptev, and Josef Sivic.
visual question answering. In CVPR, 2017. 2 Howto100m: Learning a text-video embedding by watching
[11] Dotan Kaufman, Gil Levi, Tal Hassner, and Lior Wolf. Tem- hundred million narrated video clips. In ICCV, 2019. 1, 2,
poral tessellation: A unified approach for video analysis. In 6, 7, 8
ICCV, 2017. 8 [27] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuf-
[12] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, fle and learn: unsupervised learning using temporal order
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, verification. In ECCV, 2016. 1
Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- [28] Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong
man action video dataset. arXiv preprint arXiv:1705.06950, Rui. Jointly modeling embedding and translation to bridge
2017. 6 video and language. In CVPR, 2016. 2
[13] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. [29] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Unifying visual-semantic embeddings with multimodal neu- Faster r-cnn: Towards real-time object detection with region
ral language models. arXiv preprint arXiv:1411.2539, 2014. proposal networks. In NeurIPS, 2015. 1, 4
8 [30] Alexander Richard, Hilde Kuehne, Ahsan Iqbal, and Juer-
[14] Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. Asso- gen Gall. Neuralnetwork-viterbi: A framework for weakly
ciating neural word embeddings with deep image represen- supervised video learning. In CVPR, 2018. 7
tations using fisher vectors. In CVPR, 2015. 8 [31] Karen Simonyan and Andrew Zisserman. Very deep convo-
[15] Bruno Korbar, Du Tran, and Lorenzo Torresani. Coopera- lutional networks for large-scale image recognition. arXiv
tive learning of audio and video models from self-supervised preprint arXiv:1409.1556, 2014. 7
synchronization. In NeurIPS, 2018. 2 [32] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu
[16] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- linguistic representations. arXiv preprint arXiv:1908.08530,
tidis, Li-Jia Li, David A Shamma, et al. Visual genome: 2019. 2
8754
[33] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and [49] Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee
Cordelia Schmid. Videobert: A joint model for video and Kim. End-to-end concept word detection for video caption-
language representation learning. In ICCV, 2019. 1, 2, 5, 7 ing, retrieval, and question answering. In CVPR, 2017. 8
[34] Hao Tan and Mohit Bansal. Lxmert: Learning cross- [50] Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J
modality encoder representations from transformers. arXiv Corso, and Marcus Rohrbach. Grounded video description.
preprint arXiv:1908.07490, 2019. 2 In CVPR, 2019. 2
[35] Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, [51] Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards
Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: automatic learning of procedures from web instructional
A large-scale dataset for comprehensive instructional video videos. In AAAI, 2018. 2, 6, 7
analysis. In CVPR, 2019. 6, 7 [52] Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher,
[36] Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, and Caiming Xiong. End-to-end dense video captioning with
Antonio Torralba, Raquel Urtasun, and Sanja Fidler. masked transformer. In CVPR, 2018. 2, 7
Movieqa: Understanding stories in movies through question- [53] Linchao Zhu, Zhongwen Xu, and Yi Yang. Bidirectional
answering. In CVPR, 2016. 2 multirate reconstruction for temporal modeling in videos. In
[37] Atousa Torabi, Niket Tandon, and Leonid Sigal. Learning CVPR, 2017. 2
language-visual embedding for movie understanding with [54] Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G
natural-language. arXiv preprint arXiv:1609.08124, 2016. Hauptmann. Uncovering the temporal context for video
8 question answering. IJCV, 124(3):409–421, 2017. 2
[38] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
[55] Linchao Zhu and Yi Yang. Compound memory networks for
and Manohar Paluri. Learning spatiotemporal features with
few-shot video classification. In ECCV, 2018. 2
3D convolutional networks. In ICCV, 2015. 1
[56] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov,
[39] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann
Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Align-
LeCun, and Manohar Paluri. A closer look at spatiotemporal
ing books and movies: Towards story-like visual explana-
convolutions for action recognition. In CVPR, 2018. 6
tions by watching movies and reading books. In ICCV, 2015.
[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
6
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
[57] Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk
Polosukhin. Attention is all you need. In NeurIPS, 2017. 2,
Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross-
3, 4
task weakly supervised learning from instructional videos.
[41] Xin Wang, Jiawei Wu, Da Zhang, Yu Su, and William Yang
In CVPR, 2019. 6, 8
Wang. Learning to compose topic-aware mixture of experts
for zero-shot video captioning. In AAAI, 2019. 2
[42] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le,
Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun,
Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s
neural machine translation system: Bridging the gap be-
tween human and machine translation. arXiv preprint
arXiv:1609.08144, 2016. 3
[43] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and
Kevin Murphy. Rethinking spatiotemporal feature learning:
Speed-accuracy trade-offs in video classification. In ECCV,
2018. 7
[44] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large
video description dataset for bridging video and language. In
CVPR, 2016. 6, 7
[45] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell,
Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized
autoregressive pretraining for language understanding. arXiv
preprint arXiv:1906.08237, 2019. 1
[46] Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas,
Christopher Pal, Hugo Larochelle, and Aaron Courville. De-
scribing videos by exploiting temporal structure. In ICCV,
pages 4507–4515, 2015. 2
[47] Youngjae Yu, Jongseok Kim, and Gunhee Kim. A joint se-
quence fusion model for video question answering and re-
trieval. In ECCV, 2018. 2, 7, 8
[48] Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee
Kim. Video captioning and retrieval models with semantic
attention. arXiv preprint arXiv:1610.02947, 6(7), 2016. 8
8755