Yan Multiview Transformers For Video Recognition CVPR 2022 Paper
Yan Multiview Transformers For Video Recognition CVPR 2022 Paper
Abstract
Global Encoder
…
L(") × Fusion L($)×
of-the-art, they have not explicitly modelled different spa-
Encoder block Encoder block
tiotemporal resolutions. To this end, we present Multiview
Transformers for Video Recognition (MTV). Our model con-
Hidden
size
3333
encoders specialized for each “view”, with lateral connec- achieving great success in natural language [8, 17, 53], pure
tions between them to fuse information from different views transformer architectures had not gained the same popular-
to each other. We can use transformer encoders of vary- ity in computer vision until Vision Transformers (ViT) [18].
ing sizes to process each view, and find that it is better (in Inspired by ViT, ViViT [3] and Timesformer [6] were the
terms of accuracy/computation trade-offs) to use a smaller first two works that successfully adopted a pure transformer
encoder (e.g. smaller hidden sizes and fewer layers) to rep- architecture for video classification, advancing the state of
resent the broader view of the video (Fig. 1 left) while an the art previously set by 3D CNNs.
encoder with larger capacity is used to capture the details Multiscale processing in computer vision. “Pyramid”
(Fig. 1 right). This design therefore poses a clear contrast structures [1] are one of the most popular multiscale rep-
to pyramid-based approaches where model complexity in- resentations for images and have been key in the early com-
creases as the spatio-temporal resolution decreases. Our puter vision works, where their use has been widespread in
design is verified by our experiments which show clear ad- multiple domains including feature descriptors [45], feature
vantages over the former approach. tracking [7, 46], image compression [9], etc. This idea has
Our proposed method, of processing different “views” also been successfully adopted for modern CNNs [27, 58,
of the input video is simple, and in contrast to previous 63] where the spatial dimension of the network is gradually
work [23] generalizes readily to a variable number of views. reduced while the network “depth” is gradually increased
This is significant, as our experiments show that accuracy to encode more semantically rich features. Also, this tech-
increases as the number of views grows. Although our pro- nique has been used to produce higher resolution output fea-
posed architecture increases the number of tokens processed tures for downstream tasks [42, 43, 84]. Multiscale process-
by the network according to the number of input views, ing is necessary for CNNs because a convolution operation
we show that we can consistently achieve superior accu- only operates on a sub-region of the input and a hierarchi-
racy/computation trade-offs compared to the current state cal structure is required to capture the whole view of the
of the art [3], across a spectrum of model sizes, ranging image or video. In theory, such a hierarchy is not required
from “Small” to “Huge”. We show empirically that this for transformers as each token “attends” to all other posi-
is because processing more views in parallel enables us to tions. In practice, due to the limited amount of training
achieve larger accuracy improvements than increasing the data, applying similar multiscale processing in transform-
depth of the transformer network. We perform thorough ers [11, 21, 44, 73] to reduce complexity of the model has
ablation studies of our design choices, and achieve state-of- proven to be effective.
the-art results on six standard video classification datasets. Our model does not follow the pyramid structure but di-
Moreover, we show that these results can be further im- rectly takes different views of the video and feeds them
proved with large-scale pretraining. into cross-view encoders. As our experiments validate,
this alternative multiview architecture has consistently out-
2. Related Work performed its single-view counterpart in terms of accu-
Evolution of video understanding models. Early racy/FLOP trade-offs. This is because processing more
works [34, 37, 72] relied on hand-crafted features to encode views in parallel gives us larger accuracy improvements
motion and appearance information. With the emergence than increasing the depth of the transformer network. Sig-
of large labelled datasets like ImageNet [16], Convolutional nificantly, such improvement persists as we scale the model
Neural Networks (CNNs) [39] showed their superiority over capacity to over a billion parameters (e.g., our “Huge”
the classic methods. Since AlexNet [36] won the Ima- model), which has not been shown by the previous pyramid-
geNet challenge by a large margin, CNNs have been quickly structured transformers [21, 44, 73]. Conceptually, our
adopted to various vision tasks, their architectures have method is most comparable to SlowFast [23] where a two-
been refined over many generations [12,27,58,63] and later stream CNN is used to process two views of the same video
improved by Neural Architecture Search (NAS) [54,65,85]. clip (densely sampled and sparsely sampled frames). In-
At the same time, CNNs and RNNs have quickly become stead of sampling the input video at different frame rates,
the de-facto backbones for video understanding tasks [32, we obtain different view by linearly projecting spatio-
50, 57]. Since the release of the Kinetics dataset [33], 3D temporal “tubelets” [3] of varying sizes for each view. Fur-
CNNs [10, 24, 68] have gained popularity, and many vari- thermore, we empirically show that our proposed method
ants [22, 62, 69, 70, 78] have been developed to improve the outperforms [23] when using transformer backbones.
speed and accuracy. Convolution operations can only pro- 3. Multiview Transformers for Video
cess one local neighborhood at a time, and consequently,
transformer blocks [71] have been inserted into CNNs as We begin with an overview of vision transformer,
additional layers to improve modeling of long range inter- ViT [18], and its extension to video, ViViT [3], which our
actions among spatio-temporal features [74, 75]. Although model is based on, in Sec. 3.1. As shown in Fig. 1, our
3334
model constructs different “views” of the input video by corresponds to smaller tubelets (and thus more tokens). The
extracting tokens from spatio-temporal tubelets of varying 0th layer corresponds to the tokens that are input to the
dimensions (Sec. 3.2). These tokens are then processed by subsequent transformer. As shown in Fig. 1, we tokenize
a multiview transformer, which incorporates lateral connec- each view using a 3D convolution, as it was the best to-
tions to efficiently fuse together information from multiple kenization method reported by [3]. We can use different
scales (Sec. 3.3). convolutional kernels, and different hidden sizes, d(i) , for
each view. Note that smaller convolutional kernels corre-
3.1. Preliminaries: ViT and ViViT spond to smaller spatio-temporal “tubelets”, thus resulting
We denote our input video as V ∈ RT ×H×W ×C . Trans- in more tokens to be processed for the ith view. Intuitively,
former architectures [71] process inputs by converting in- fine-grained motions can be captured by smaller tubelets
puts into discrete tokens which are subsequently processed whilst larger tubelets capture slowly-varying semantics of
by multiple transformer layers sequentially. the scene. As each view captures different levels of infor-
ViT [18] extracts tokens from images by partitioning mation, we use transformer encoders of varying capacities
an image into non-overlapping patches and linearly pro- for each stream with lateral connections between them to
jecting them. ViViT [4] extends this to video by extract- fuse information, as described in the next section.
ing N non-overlapping, spatio-temporal “tubes” [3] from 3.3. Multiview transformer
the input video, x1 , x2 , . . . xN ∈ Rt×h×w×c where N =
⌊ Tt ⌋ × ⌊ H W After extracting tokens from multiple views, we have
h ⌋ × ⌊ w ⌋.
Each tube, xi , is then projected into a token, zi ∈ Rd by Z0 = [z0,(1) , z0,(2) , . . . , z0,(V ) ] from the input, which are
a linear operator E, as zi = Exi . All tokens are then con- processed with a multiview transformer as shown in Fig. 1.
catenated together to form a sequence, which is prepended As self-attention has quadratic complexity [71], processing
with a learnable class token zcls ∈ Rd [17]. As trans- tokens from all views jointly is not computationally feasi-
formers are permutation invariant, a positional embedding ble for video. As a result, we first use a multiview encoder,
p ∈ R(N +1)×d , is also added to this sequence. Therefore, comprising of separate transformer encoders (consisting of
this tokenization process can be denoted as L(i) transformer layers) for the tokens between views, with
lateral connections between these encoders to fuse informa-
\mathbf {z}^{0} = [\mathbf {z}_{cls}, \mathbf {E}x_1, \mathbf {E}x_2, \ldots , \mathbf {E}x_N] + \mathbf {p}. (1) tion from each view (Fig. 2). Finally, we extract a token rep-
resentation from each view, and process these jointly with a
Note that the linear projection E can also be seen as a 3D final global encoder to produce the final classification token,
convolution with a kernel of size t × h × w and stride of which we linearly read-off to obtain the final classification.
(t, h, w) in the time, height and width dimensions respec-
tively. 3.3.1 Multiview encoder
The sequence of tokens z is then processed by a trans- Our multiview encoder consists of separate transformer en-
former encoder consisting of L layers. Each layer, ℓ, is ap- coders for each view which are connected by lateral con-
plied sequentially, and consists of the following operations, nections to fuse cross-view information. Each transformer
layer within the encoders follows the same design as the
original transformer of Vaswani et al. [71], except for the
\mathbf {y}^{\ell } &= \text {MSA}\left (\text {LN}\left (\mathbf {z}^{\ell - 1}\right )\right ) + \mathbf {z}^{\ell - 1}, \label {eq:transformer_layer_msa} \\ \mathbf {z}^{\ell } &= \text {MLP}\left (\text {LN}\left (\mathbf {y}^{\ell }\right ) \right ) + \mathbf {y}^{\ell } \label {eq:transformer_layer_mlp}
fact that we optionally fuse information from other streams
(3) within the layer as described in Sec. 3.3.2. Note that our
model is agnostic to the exact type of transformer layer
where MSA denotes multi-head self-attention [71], LN is used. Furthermore, within each transformer layer, we com-
layer normalization [5] and MLP consists of two linear pro- pute self-attention only among tokens extracted from the
jections separated by GeLU [28] non-linearity. same temporal index, following the Factorised Encoder
Finally, a linear classifier, Wout ∈ Rd×C maps the en- of [3]. This significantly reduces the computational cost
coded classification token, zℓcls to one of C classes. of the model. Furthermore, self-attention along all spatio-
3.2. Multiview tokenization temporal tokens is unnecessary, as we fuse information
from other views within the multiview encoder, and also
In our model, we extract multiple sets of tokens, because of the subsequent global encoder which aggregates
z0,(1) , z0,(2) , . . . , z0,(V ) from the input video. Here, V is tokens from all streams.
the number of views, and thus zℓ,(i) denotes tokens after ℓ
3.3.2 Cross-view fusion
layers of transformer processing for the ith view. We de-
fine a view as a video representation expressed by a set of We consider the following three cross-view fusion methods.
fixed-sized tubelets. A larger view corresponds to a set of Note that the hidden dimensions of the tokens, d(i) , can vary
larger tubelets (and thus fewer tokens) and a smaller view between views.
3335
Lx
Lx Lx Lx
Lx
Lx
MLP Linear
Layer Norm
MLP
MLP MLP MLP
MLP Layer Norm
Layer Norm Layer Norm Layer Norm
CVA Linear
Layer Norm Concat
(a) An example of CVA for fusion. (b) An example of bottleneck tokens for fusion. (c) An example of MLP fusion.
Figure 2. An illustration of our proposed cross-view fusion methods. In all three subfigures, view i (left) refers to a video representation
using larger tubelets, and thus less input tokens and view i + 1 (right) corresponds to the representation with smaller tubelets and more
input tokens. “+” denotes summation. Tokens extracted from tubelets are colored red and bottleneck tokens are colored blue. MSA is
short for Multihead Self-Attention and CVA stands for Cross View Attention.
(i+1) (i+1) (i+1)
×d
Cross-view attention (CVA). A straight-forward method of kens, zB ∈ RB where B (i+1) is the number
combining information betweenPdifferent views is to per- of bottleneck tokens in the (i + 1)th view and B (i+1) ≪
form self-attention jointly on all i N (i) tokens where N (i) N (i+1) . As shown in Fig. 2b (where B = 1), the bottleneck
is the number of tokens in the ith view. However, due to (i+1)
tokens from view i+1, zB , are concatenated to the input
the quadratic complexity of self-attention, this is prohibitive tokens of the same view, z(i+1) , and processed with self-
computationally for video models, and hence we perform a attention. This effectively transfers information between all
more efficient alternative. tokens from view i + 1. Thereafter, these tokens, zB
(i+1)
are
We sequentially fuse information between all pairs of linearly projected to the depth of view i, and concatenated
two adjacent views, i and i + 1, where the views are or- to z(i) before performing self-attention again. This process
dered in terms of increasing numbers of tokens (i.e. N (i) ≤ is repeated between each pair of adjacent views as shown
N (i+1) ). Concretely, to update the tokens from the larger in Fig. 2b, and allows us to efficiently transfer information
view, z(i) , we compute attention where the queries are z(i) , from one view to the next.
and the keys and values are z(i+1) (the tokens from the
As with cross-view attention, we sequentially perform
smaller view). As the hidden dimensions of the tokens be-
fusion between all pairs of adjacent views, beginning from
tween the two views can be different, we first project the
the view with the largest number of tokens, and proceed-
keys and values to the same dimension, as denoted by
ing in order of decreasing token numbers. Intuitively, this
\mathbf {z}^{(i)} &= \text {CVA}(\mathbf {z}^{(i)}, \mathbf {W}^{\text {proj}} \mathbf {z}^{(i+1)}), \\ \text {CVA}(\mathbf {x}, \mathbf {y}) &= \text {Softmax}\left (\frac {\mathbf {W}^{Q}\mathbf {x} \mathbf {W}^{K}\mathbf {y}^{\top }}{\sqrt {d_k}}\right ) \mathbf {W}^{V}\mathbf {y}. allows the view with the fewest tokens to aggregate fine-
grained information from all subsequent views.
(5) Note that the only parameters introduced into the model
from this fusion method are the linear projections of bottle-
Note that WQ , WK and WV are the query-, key- neck tokens from one view to the next, and the bottleneck
and value-projection matrices used in the attention opera- tokens themselves which are learned from random initial-
tion [71]. As shown in Fig. 2a, we also include a residual ization. We also note that “bottleneck” tokens have also
connection around the cross-view attention operation, and been used by [31, 49].
zero-initialize the parameters of this operation, as this helps MLP fusion. Recall that each transformer encoder layer
when using image-pretrained models as is common prac- consists of a multi-head self attention operation (Eq. 2), fol-
tice [3, 6]. Similar studies on cross stream attention have lowed by an MLP block (Eq. 3). A simple method is to fuse
been done by [11] for images. before the MLP block within each encoder layer.
Bottleneck tokens. An efficient method of transferring in- Concretely, as shown in Fig. 2c, tokens from view i + 1,
formation between tokens from two views, z(i) and z(i+1) , z(i+1) with hidden dimension d(i+1) are concatenated with
is by an intermediate set of B bottleneck tokens. Once tokens from view i along the hidden dimension. These to-
again, we sequentially fuse information between all pairs of kens are then fed into the MLP block of layer i and linearly
two adjacent views, i + 1 and i, where the views are ordered projected to the depth d(i) . This process is repeated between
in terms of increasing numbers of tokens. adjacent views of the network, where once again, views are
In more detail, we initialize a sequence of bottleneck to- ordered by increasing number of tokens per view.
3336
Fusion locations. We note that it is not necessary to per- frame resolution is set to be 224 × 224 in both training and
form cross-view fusion at each layer of the cross-view en- inference. We follow [3] and apply the same data augmenta-
coder to transfer information among the different views, tion and regularization schemes [13, 29, 64, 82], which were
since each fusion operation has a global “receptive field” used by [67] to train vision transformers more effectively.
that considers all the tokens from the previous views. Fur- During inference, we adopt the standard evaluation proto-
thermore, it is also possible for the encoders for each indi- col by averaging over multiple spatial and temporal crops.
vidual view to have different depths, meaning that fusion The number of crops is given in the results tables. For re-
can occur between layer l of view i and layer l′ of view j producibility, we include exhaustive details in the supple-
where l ̸= l′ . Therefore, we consider the fusion locations mentary.
as a design choice which we perform ablation studies on.
Initialization. Following previous works [3, 6, 51], we ini-
3.3.3 Global encoder tialize our model from a corresponding ViT model pre-
trained on large-scale image datasets [16,61] obtained from
Finally, we aggregate the tokens from each of the views the public code of [18]. The initial tubelet embedding oper-
with the final global encoder, as shown in Fig. 1, effec- ator, E, and positional embeddings, p, have different shapes
tively fusing information from all views after the cross- in the pretrained model and we use the same technique
view transformer. We extract the classification token from as [3] to adapt them to initialize each view of our multiview
(i)
each view, {zcls }Vi=1 , and process them further with another encoder (Sec. 3.3.1). The final global encoder (Sec. 3.3.3)
transformer encoder, following Vaswani et al. [71], that ag- is randomly initialized.
gregates information from all views. The resulting classifi-
cation token is then mapped to one of C classification out- Datasets. We report the performance of our proposed mod-
puts, where C is the number of classes. els on a diverse set of video classification datasets:
Kinetics [33] is a collection of large-scale, high-quality
4. Experiments datasets of 10s video clips focusing on human actions. We
report results on Kinetics 400, 600, and 700, with 400, 600,
4.1. Experimental setup and 700 classes, respectively.
Model variants. For the backbone of each view, we con- Moments in Time [48] is a collection of 800,000 labeled
sider five ViT variants, “Tiny”, “Small”, “Base”, “Large”, 3 second videos, involving people, animals, objects or nat-
and “Huge”. Their settings strictly follow the ones de- ural phenomena, that capture the gist of a dynamic scene.
fined in BERT [17] and ViT [18, 59], i.e. number of trans- Epic-Kitchens-100 [15] consists of 90,000 egocentric
former layers, number of attention heads, hidden dimen- videos, totaling 100 hours, recorded in kitchens. Each video
sions. See the supplementary for the detailed settings. For is labeled with a “noun” and a “verb” and therefore we
convenience, each model variant is denoted with the follow- predict both categories using a single network with two
ing abbreviations indicating the backbone size and tubelet “heads”. Three accuracy scores (“noun”, “verb”, and “ac-
length. For example, B/2+S/4+Ti/8 denotes a three-view tion”) are commonly reported for this dataset with action
model, where a “Base”, “Small”, and “Tiny” encoders are accuracy being the primary metric. The “action” label is
used to processes tokens from the views with tubelets of formed by selecting the top-scoring noun and verb pair.
sizes 16 × 16 × 2, 16 × 16 × 4, and 16 × 16 × 8, respec- Something-Something V2 [26] consists of more than
tively. Note that we omit 16 in our model abbreviations be- 220,000 short video clips that show humans interacting with
cause all our models use 16 × 16 as the spatial tubelet size everyday objects. Similar objects and backgrounds appear
except for the “Huge” model, which uses 14×14, following in videos across different classes. Therefore, in contrast to
ViT [18]. All model variants use the same global encoder other datasets, this one challenges a model’s capability to
which follows the “Base” architecture, except that the num- distinguish classes from motion cues.
ber of heads is set to 8 instead of 12. The reason is that the
hidden dimension of the tokens should be divisible by the 4.2. Ablation study
number of heads for multi-head attention, and the number We conduct ablation studies on the Kinetics 400 dataset.
of hidden dimensions across all standard transformer archi- In all cases, the largest backbone in the multiview encoder
tectures (from “Tiny” to “Huge” [18, 59]) is divisible by 8. is “Base” for faster experimentation. We report accuracies
Training and inference. We follow the training settings of when averaging predictions across multiple spatio-temporal
ViViT reported in the paper and public code [3], unless oth- crops, as standard practice [3, 6, 10, 23]. In particular, we
erwise stated. Namely, all models are trained on 32 frames use 4 × 3 crops, that is 4 temporal crops, with 3 spatial
with a temporal stride of 2. We train our model using syn- crops for each temporal crop. We used a learning rate of 0.1
chronous SGD with momentum of 0.9 following a cosine for all experiments for 30 epochs, and used no additional
learning rate schedule with a linear warm up. The input regularization as done by [3].
3337
(a) Effects of different model-view assignments. (c) Comparison of different cross-view fusion methods. (e) Effects of increasing number of views.
Model variants GFLOPs MParams Top-1 Model variants Method GFLOPs MParams Top-1 Model variants GFLOPs Top-1
B/8+Ti/2 81 161 77.3 B/4 145 173 78.3 B/4 145 78.3
B/2+Ti/8 337 221 81.3 S/8 N/A 20 60 74.1 B/4+Ti/16 168 80.8 (+2.5)
Ti/16 3 13 67.6 B/4+S/8+Ti/16 195 81.1 (+2.8)
B/8+S/4+Ti/2 202 250 78.5
Ensemble 168 246 77.7 B/4 (14) 168 78.1 (-0.2)
B/2+S/4+Ti/8 384 310 81.8
Late fusion 187 306 80.6 B/4 (17) 203 78.4 (+0.1)
B/4+S/8+Ti/16 195 314 81.1 B/4+S/8+Ti/16 MLP 202 323 80.6
Bottleneck 188 306 81.0
CVA 195 314 81.1 (f) Effects of applying CVA at different layers.
(b) Effects of the same model applied to different
views. Fusion layers GFLOPs MParams Top-1
Model variants GFLOPs MParams Top-1 0 80.96
(d) Comparison to SlowFast multi-resolution method.
B/4+S/8+Ti/16 195 314 81.1 5 195 314 81.08
Model variants GFLOPs MParams Top-1 11 81.00
B/4+B/8+B/16 324 759 81.1
SlowFast (transformer backbone) 0, 1 80.91
B/2+Ti/8 337 221 81.3 Slow-only (B) 79 87 78.0 5, 6 80.96
B/2+B/8 448 465 81.5 203 323
Fast-only (Ti) 63 6 74.6 10, 11 80.81
Slowfast (B+Ti) 202 105 79.7 5, 11 81.14
B/2+S/4+Ti/8 384 310 81.8
B/2+B/4+B/8 637 751 81.7 B/4+Ti/16 (ours) 168 224 80.8 0, 5, 11 210 331 80.95
Table 1. Ablation studies of our method. (a) Assigning larger models to smaller tubelet sizes achieves the highest accuracy. (b) We
apply the same “Base” encoder to all views, and show that there is minimal accuracy difference to the alternatives from (a), but a large
increase in computation. (c) A comparison of different cross-view fusion methods, shows that Cross-View Attention (CVA) is the best. The
“Ensemble” and “late fusion” baselines are detailed in the text. (d) We compare our approach to the alternate temporal multi-resolution
method of [23], implemented in the context of transformers, and show signficant improvements. (e) We achieve substantial accuracy by
adding more views, and this improvement is larger than that obtained by adding more layers to a single encoder. (f) The optimal fusion
layers are at the middle and late stages of the network.
Model-view assignments. Recall that a view is a video worse results. This method actually decreases the perfor-
representation in terms of tubelets, and that a larger view mance compared to the B/4 model since “Small” and “Tiny”
equates to larger tubelets (and hence fewer transformer to- models perform not comparably well. “Late fusion” con-
kens) and smaller views correspond to smaller tubelets (and catenates the final embeddings produced by the transformer
thus more tokens). encoder from each view without any cross-view operations
We considered two model-view assignment strategies: before feeding it into the global encoder. It improves the
larger models for larger views (e.g., B/8+Ti/2, the larger B/4 model from 78.3% to 80.6%. All of our fusion methods
“Base” model is used to encode 16 × 16 × 8 tubelets and except MLP outperform the baselines while CVA is the best
the smaller “Tiny” model encodes 16 × 16 × 2 tubelets) overall. Based on this observation, we choose CVA as the
and smaller models for larger views (e.g., B/2+Ti/8). Ta- fusion method for all subsequent experiments. MLP fusion
ble 1a shows that assigning a larger model to smaller views is the worst performing method of the three and we think it
is superior. For example, B/2+S/4+Ti/8 scores 81.8% while is because concatenation in the MLP blocks introduces ad-
B/8+S/4+Ti/2 only scores 78.5%. One may argue that this ditional channels that have to be randomly initialized, mak-
is due to the increase of the FLOPs but B/4+S/8+Ti/16 still ing model optimization more difficult.
outperforms B/8+S/4+Ti/2 by a large margin under similar
FLOPs. Our explanation is that larger views capture the gist Effect of the number of views. Table 1e shows perfor-
of the scene, which requires less complexity to learn while mance on Kinetics-400 as we increase the number of views.
the details of the scene are encapsulated by smaller views With two views we achieve a +2.5% in Top-1 accuracy over
so a larger-capacity model is needed. the baseline B/4 model. As we increase to three views,
Another strategy is to assign the same model to all views. the improvement widens to 2.8%. Furthermore, we show
Table 1b shows that in all three examples there is little dif- that such improvement is non-trivial. For example, we also
ference between assigning a “Base” model and assigning a train a 14-layer and a 17-layer variants of the “Base” model.
“Small” or “Tiny” model to larger views. This result is sur- They share similar FLOPs with our two-view and three-
prising yet beneficial since we can reduce the complexity of view counterparts but their performance remains similar to
the model at almost no cost of accuracy. that of the baseline.
What is the best cross-view fusion method?. Table 1c Which layers to apply cross-view fusion?. Motivated by
shows the comparison of different fusion methods on a Tab. 1c, we fix the fusion method to CVA, and vary the lo-
three-view model. We use one late fusion and an ensem- cations and number of layers where we apply CVA, when
ble approach as the baselines. “Ensemble” simply sums the using a three-view B+S+Ti model (each encoder thus has
probabilities produced from each view, where the models 12 layers) in Tab. 1f. The choices are in the early-, mid-,
from each view are trained separately. We also tried sum- and late-stages of the transformer encoders and the number
ming up the logits and majority voting but both obtained of fusion layers is set to be one and two. When using one
3338
86
MTV H/2 point and then are fine-tuned on the target datasets follow-
ViViT-FE
ing [3, 21, 51].
L/2 H/4
84 L/4 3.1 Accuracy/computation trade-offs. Figure 3 compares our
1.9
Top-1 Accuracy
3339
Table 2. Comparisons to state-of-the-art. For “views”, x × y denotes x temporal views and y spatial views. We report the total TFLOPs
to process all spatio-temporal views. We use shorter notation, MTV-B, L, H to denote variants, B/2+S/4+Ti/8, L/2+B/4+S/8+Ti/16, and
H/2+B/4+S/8+Ti/16, respectively. Models use a spatial resolution of 224 × 224, unless explicitly stated by MTV (xp), which refers to a
spatial resolution of x × x. Models are pretrained on ImageNet-21K unless explicitly stated in parenthesis.
(a) Kinetics 400 (b) Kinetics 600 (d) Kinetics 700
Method Top 1 Top 5 Views TFLOPs Method Top 1 Top 5 Top 1 Top 5
TEA [40] 76.1 92.5 10 × 3 2.10 SlowFast R101-NL [23] 81.8 95.1 VidTR-L [83] 70.2 –
TSM-ResNeXt-101 [41] 76.3 – – – X3D-XL [22] 81.9 95.5 SlowFast R101 [23] 71.0 89.6
I3D NL [74] 77.7 93.3 10 × 3 10.77 TimeSformer-L [6] 82.2 95.6 MoViNet-A6 [35] 72.3 –
MFormer-HR [51] 82.7 96.1 MTV-L 74.0 91.3
VidTR-L [83] 79.1 93.9 10 × 3 10.53
ViViT-L FE [3] 82.9 94.6
LGD-3D R101 [52] 79.4 94.4 – – MViT-B [21] 83.8 96.3 CoVeR (JFT-3B) [81] 79.8 –
SlowFast R101-NL [23] 79.8 93.9 10 × 3 7.02 MoViNet-A6 [35] 84.8 96.5 MTV-H (JFT) 78.0 93.3
X3D-XXL [22] 80.4 94.6 10 × 3 5.82 MTV-B 83.6 96.1 MTV-H (WTS) 82.2 95.7
OmniSource [20] 80.5 94.4 – – MTV-B (320p) 84.0 96.2
TimeSformer-L [6] 80.7 94.7 1×3 7.14 (e) Epic-Kitchens-100 Top 1 accuracy
R3D-RS (WTS) [19] 84.3 –
MFormer-HR [51] 81.1 95.2 10 × 3 28.76 Method Action Verb Noun
ViViT-H [3] (JFT) 85.8 96.5
MViT-B [21] 81.2 95.1 3×3 4.10
TokenLearner-L/10 [55] (JFT) 86.3 97.0 SlowFast [23] 38.5 65.6 50.0
MoViNet-A6 [35] 81.5 95.3 1×1 0.39 Florence [79] (FLD-900M) 87.8 97.8 ViViT-L FE [3] 44.0 66.4 56.8
ViViT-L FE [3] 81.7 93.8 1×3 11.94 CoVeR (JFT-3B) [81] 87.9 – MFormer-HR [51] 44.5 67.0 58.5
MTV-B 81.8 95.0 4×3 4.79 MTV-L (JFT) 85.4 96.7 MoViNet-A6 [35] 47.7 72.2 57.3
MTV-B (320p) 82.4 95.2 4×3 11.16 MTV-H (JFT) 86.5 97.3 MTV-B 46.7 67.8 60.5
MTV-H (WTS) 89.6 98.3 MTV-B (320p) 48.6 68.0 63.1
Methods with web-scale pretraining
VATT-L [2] (HowTo100M) 82.1 95.5 4×3 29.80 (f) Moments in Time
ip-CSN-152 [69] (IG) 82.5 95.3 10 × 3 3.27 (c) Something-Something v2
Top 1 Top 5
R3D-RS (WTS) [19] 83.5 – 10 × 3 9.21 Method Top 1 Top 5
OmniSource [20] (IG) 83.6 96.0 – – AssembleNet-101 [56] 34.3 62.7
SlowFast R50 [23, 77] 61.7 – ViViT-L FE [3] 38.5 64.1
ViViT-H [3] (JFT) 84.9 95.8 4×3 47.77
TimeSformer-HR [6] 62.5 – MoViNet-A6 [35] 40.2 –
TokenLearner-L/10 [55] (JFT) 85.4 96.3 4×3 48.91 VidTR [83] 63.0 –
Florence [79] (FLD-900M) 86.5 97.3 4×3 – MTV-L 41.7 69.7
ViViT-L FE [3] 65.9 89.9
CoVeR (JFT-3B) [81] 87.2 – 1×3 – MViT [21] 67.7 90.9 VATT-L (HT100M) [2] 41.1 67.7
MTV-L (JFT) 84.3 96.3 4×3 18.05 MFormer-L [51] 68.1 91.2 MTV-H (JFT) 44.0 70.2
MTV-H (JFT) 85.8 96.6 4×3 44.47 MTV-B 67.6 90.1 MTV-H (WTS) 45.6 74.7
MTV-H (WTS) 89.1 98.2 4×3 44.47 MTV-B (320p) 68.5 90.4 MTV-H (WTS 280p) 47.2 75.7
400 and 700 for which the architecture of [35] was not di- Something-Something V2. This dataset consists of class
rectly optimized for. labels such as “move to left” and “pointing to right” [26].
When using additional JFT-300M pretraining, our As the model needs to explicitly reason about direction,
“Huge” model outperforms other recent transformer mod- we do not perform random horizontal or vertical flipping
els using the same pretraining dataset [3, 55]. And when we as data augmentation on this dataset as also done by [21].
utilize the Weak Textual Supervision (WTS) dataset of [60] We improve substantially over ViViT-L-FE [3], which cor-
for pre-training, we substantially advance the best reported responds to a deeper single-view equivalent of our model
results on Kinetics: On Kinetics 400, we achieve a Top- by 2.6%, and also improve upon MFormer [51] by 0.4%.
1 accuracy of 89.1%, which improves upon the previous Moments in Time. Our MTV-L model significantly im-
highest result (CoVeR [81]) by 1.9%. Similarly, on Kinet- proves over the previous state-of-the-art [35] by 1.5% in
ics 600, we achieve a Top-1 of 89.6%, which is an absolute Top-1 accuracy. Moreover, our model with ImageNet-21K
improvement of 1.7% on [81]. On Kinetics 700, we achieve pretraining even outperforms VATT [2], which was pre-
82.2%, which improves even further by 2.4% over [81]. We trained on HowTo100M [47], a dataset consisting of around
also improve upon R3D-RS [19], which also used WTS pre- 100M video clips. When using WTS pre-training, we im-
training, by 5.6% and 5.3% on Kinetics-400 and -600. prove our accuracy even further, achieving 47.2%.
Epic-Kitchens-100. Following the standard protocol [15],
we report Top-1 action-, verb- and noun-accuracies with 5. Conclusion
action accuracy being the primary metric. Our results are We have presented a simple method for capturing multi-
averaged over 4 × 1 crops as additional spatial crops did resolution temporal context in transformer architectures,
not help. Both our MTV-B and MTV-B(320p) significantly based on processing multiple “views” of the input video in
improve the previous state-of-the-art on noun classes, and parallel. We have demonstrated that our approach performs
MTV-B(320p) achieves a new state-of-the-art of 48.6% on better, in terms of accuracy/computation trade-offs than in-
actions. We found that additional data augmentation (de- creasing the depth of current single-view architectures. Fur-
tailed in the supplementary) has to be used to achieve good thermore, we have achieved state-of-the-art results on six
performance (as also observed by [3, 51]) as this is the popular video classification datasets. These results were
smallest dataset of all six with 67,000 training examples. then further improved with large-scale pretraining [60, 61].
3340
References [18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
[1] Edward H Adelson, Charles H Anderson, James R Bergen, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
Peter J Burt, and Joan M Ogden. Pyramid methods in image vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
processing. RCA engineer, 29(6):33–41, 1984. 2 worth 16x16 words: Transformers for image recognition at
[2] Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong scale. In ICLR, 2021. 2, 3, 5
Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. VATT: [19] Xianzhi Du, Yeqing Li, Yin Cui, Rui Qian, Jing Li, and Irwan
Transformers for multimodal self-supervised learning from Bello. Revisiting 3D ResNets for video recognition. In arXiv
raw video, audio and text. In NeurIPS, 2021. 8 preprint arXiv:2109.01696, 2021. 8
[3] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen [20] Haodong Duan, Yue Zhao, Yuanjun Xiong, Wentao Liu, and
Sun, Mario Lučić, and Cordelia Schmid. ViViT: A video Dahua Lin. Omni-sourced webly-supervised learning for
vision transformer. In ICCV, 2021. 2, 3, 4, 5, 7, 8 video recognition. In ECCV, 2020. 8
[4] Anurag Arnab, Chen Sun, and Cordelia Schmid. Unified [21] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li,
graph structured models for video understanding. In ICCV, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer.
2021. 1, 3 Multiscale vision transformers. In ICCV, 2021. 2, 7, 8
[5] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. [22] Christoph Feichtenhofer. X3D: Expanding architectures for
Layer normalization. In arXiv preprint arXiv:1607.06450, efficient video recognition. In CVPR, 2020. 2, 8
2016. 3 [23] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and
[6] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is Kaiming He. Slowfast networks for video recognition. In
space-time attention all you need for video understanding? ICCV, 2019. 1, 2, 5, 6, 7, 8
In ICML, 2021. 2, 4, 5, 8 [24] Christoph Feichtenhofer, Axel Pinz, and Richard Wildes.
[7] Jean-Yves Bouguet et al. Pyramidal implementation of the Spatiotemporal residual networks for video action recogni-
affine lucas kanade feature tracker description of the algo- tion. In NeurIPS, 2016. 2
rithm. Intel corporation, 5(1-10):4, 2001. 2 [25] Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. Large-
[8] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub- scale weakly-supervised pre-training for video action recog-
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, nition. In CVPR, 2019. 7
Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- [26] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal-
guage models are few-shot learners. In NeurIPS, 2020. 2 ski, Joanna Materzynska, Susanne Westphal, Heuna Kim,
[9] Peter J Burt and Edward H Adelson. The laplacian pyramid Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz
as a compact image code. In Readings in computer vision, Mueller-Freitag, et al. The” something something” video
pages 671–679. Elsevier, 1987. 1, 2 database for learning and evaluating visual common sense.
In ICCV, 2017. 5, 8
[10] Joao Carreira and Andrew Zisserman. Quo vadis, action
recognition? A new model and the kinetics dataset. In CVPR, [27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
2017. 2, 5 Deep residual learning for image recognition. In CVPR,
2016. 2
[11] Chun-Fu Chen, Quanfu Fan, and Rameswar Panda.
[28] Dan Hendrycks and Kevin Gimpel. Gaussian error linear
CrossViT: Cross-attention multi-scale vision transformer for
units (gelus). In arXiv preprint arXiv:1606.08415, 2016. 3
image classification. In ICCV, 2021. 1, 2, 4
[29] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian
[12] François Chollet. Xception: Deep learning with depthwise
Weinberger. Deep networks with stochastic depth. In ECCV,
separable convolutions. In CVPR, 2017. 2
2016. 5
[13] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. [30] Jinggang Huang and David Mumford. Statistics of natural
Le. Randaugment: Practical automated data augmentation images and models. In CVPR, 1999. 1
with a reduced search space. In NeurIPS, 2020. 5 [31] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac,
[14] Navneet Dalal and Bill Triggs. Histograms of oriented gra- Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop-
dients for human detection. In CVPR, 2005. 1 pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier
[15] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol
Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Vinyals, and João Carreira. Perceiver IO: A general archi-
Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and tecture for structured inputs & outputs. In arXiv preprint
Michael Wray. Rescaling egocentric vision: Collection, arXiv: 2107.14795, 2021. 4
pipeline and challenges for epic-kitchens-100. In IJCV, [32] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas
2021. 5, 8 Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video
[16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, classification with convolutional neural networks. In CVPR,
and Li Fei-Fei. ImageNet: A large-scale hierarchical image 2014. 2
database. In CVPR, 2009. 2, 5, 7 [33] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
[17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
Toutanova. BERT: Pre-training of deep bidirectional trans- Tim Green, Trevor Back, Paul Natsev, et al. The Ki-
formers for language understanding. In NAACL, 2019. 2, 3, netics human action video dataset. In arXiv preprint
5 arXiv:1705.06950, 2017. 2, 5
3341
[34] Alexander Klaser, Marcin Marszałek, and Cordelia Schmid. [51] Mandela Patrick, Dylan Campbell, Yuki M Asano, Is-
A spatio-temporal descriptor based on 3d-gradients. In han Misra Florian Metze, Christoph Feichtenhofer, Andrea
BMVC, 2008. 2 Vedaldi, Jo Henriques, et al. Keeping your eye on the ball:
[35] Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Trajectory attention in video transformers. In NeurIPS, 2021.
Mingxing Tan, Matthew Brown, and Boqing Gong. 5, 7, 8
MoViNets: Mobile video networks for efficient video recog- [52] Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xinmei Tian, and
nition. In CVPR, 2021. 7, 8 Tao Mei. Learning spatio-temporal representation with local
[36] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. and global diffusion. In CVPR, 2019. 8
Imagenet classification with deep convolutional neural net- [53] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,
works. In NeurIPS, 2012. 2 Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
[37] Ivan Laptev. On space-time interest points. In IJCV, 2005. 2 Peter J Liu. Exploring the limits of transfer learning with a
unified text-to-text transformer. In JMLR, 2020. 2
[38] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Be-
yond bags of features: Spatial pyramid matching for recog- [54] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena,
nizing natural scene categories. In CVPR, 2006. 1 Yutaka Leon Suematsu, Jie Tan, Quoc V Le, and Alexey Ku-
rakin. Large-scale evolution of image classifiers. In ICML,
[39] Yann LeCun, Bernhard Boser, John S Denker, Donnie
2017. 2
Henderson, Richard E Howard, Wayne Hubbard, and
Lawrence D Jackel. Backpropagation applied to handwrit- [55] Michael S Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa
ten zip code recognition. Neural computation, 1(4):541–551, Dehghani, and Anelia Angelova. Tokenlearner: What can 8
1989. 2 learned tokens do for images and videos? In NeurIPS, 2021.
8
[40] Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and
[56] Michael S Ryoo, AJ Piergiovanni, Mingxing Tan, and Anelia
Limin Wang. TEA: Temporal excitation and aggregation for
Angelova. Assemblenet: Searching for multi-stream neural
action recognition. In CVPR, 2020. 8
connectivity in video architectures. In ICLR, 2019. 8
[41] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift
[57] Karen Simonyan and Andrew Zisserman. Two-stream con-
module for efficient video understanding. In ICCV, 2019. 8
volutional networks for action recognition in videos. In
[42] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, NeurIPS, 2014. 2
Bharath Hariharan, and Serge Belongie. Feature pyramid
[58] Karen Simonyan and Andrew Zisserman. Very deep convo-
networks for object detection. In CVPR, 2017. 1, 2
lutional networks for large-scale image recognition. In ICLR,
[43] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian 2015. 2
Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C [59] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross
Berg. Ssd: Single shot multibox detector. In ECCV, 2016. Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train
1, 2 your ViT? Data, augmentation, and regularization in vision
[44] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng transformers. In arXiv preprint arXiv:2106.10270, 2021. 5
Zhang, Stephen Lin, and Baining Guo. Swin Transformer: [60] Jonathan C. Stroud, Zhichao Lu, Chen Sun, Jia Deng, Rahul
Hierarchical vision transformer using shifted windows. In Sukthankar, Cordelia Schmid, and David A. Ross. Learning
ICCV, 2021. 1, 2 video representations from textual web supervision. In arXiv
[45] David G Lowe. Distinctive image features from scale- 2007.14937, 2020. 7, 8
invariant keypoints. In IJCV, 2004. 2 [61] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi-
[46] Bruce D Lucas, Takeo Kanade, et al. An iterative image nav Gupta. Revisiting unreasonable effectiveness of data in
registration technique with an application to stereo vision. In deep learning era. In ICCV, 2017. 5, 7, 8
IJCAI, 1981. 2 [62] Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. Human
[47] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, action recognition using factorized spatio-temporal convolu-
Makarand Tapaswi, Ivan Laptev, and Josef Sivic. tional networks. In ICCV, 2015. 2
Howto100m: Learning a text-video embedding by watching [63] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
hundred million narrated video clips. In ICCV, 2019. 7, 8 Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
[48] Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ra- Vanhoucke, and Andrew Rabinovich. Going deeper with
makrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, convolutions. In CVPR, 2015. 2
Quanfu Fan, Dan Gutfreund, Carl Vondrick, et al. Moments [64] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
in time dataset: one million videos for event understanding. Shlens, and Zbigniew Wojna. Rethinking the inception ar-
In PAMI, 2019. 5 chitecture for computer vision. In CVPR, 2016. 5
[49] Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, [65] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model
Cordelia Schmid, and Chen Sun. Attention bottlenecks for scaling for convolutional neural networks. In ICML, 2019. 2
multimodal fusion. In NeurIPS, 2021. 4 [66] Antonio Torralba and Aude Oliva. Statistics of natural image
[50] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vi- categories. Network: computation in neural systems, 2003.
jayanarasimhan, Oriol Vinyals, Rajat Monga, and George 1
Toderici. Beyond short snippets: Deep networks for video [67] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
classification. In CVPR, 2015. 2 Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
3342
data-efficient image transformers & distillation through at- [77] Chao-Yuan Wu, Ross Girshick, Kaiming He, Christoph Fe-
tention. In ICML, 2021. 5 ichtenhofer, and Philipp Krahenbuhl. A multigrid method
[68] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, for efficiently training video models. In CVPR, 2020. 8
and Manohar Paluri. Learning spatiotemporal features with [78] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and
3d convolutional networks. In ICCV, 2015. 2 Kevin Murphy. Rethinking spatiotemporal feature learning:
[69] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feis- Speed-accuracy trade-offs in video classification. In ECCV,
zli. Video classification with channel-separated convolu- 2018. 2
tional networks. In ICCV, 2019. 2, 8 [79] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella,
[70] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang,
LeCun, and Manohar Paluri. A closer look at spatiotemporal Boxin Li, Chunyuan Li, et al. Florence: A new foun-
convolutions for action recognition. In CVPR, 2018. 2 dation model for computer vision. In arXiv preprint
[71] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- arXiv:2111.11432, 2021. 7, 8
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia [80] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu-
Polosukhin. Attention is all you need. In NeurIPS, 2017. 2, cas Beyer. Scaling vision transformers. In arXiv preprint
3, 4, 5 arXiv:2106.04560, 2021. 7
[72] Heng Wang, Alexander Kläser, Cordelia Schmid, and
[81] Bowen Zhang, Jiahui Yu, Christopher Fifty, Wei Han, An-
Cheng-Lin Liu. Dense trajectories and motion boundary de-
drew M Dai, Ruoming Pang, and Fei Sha. Co-training trans-
scriptors for action recognition. In IJCV, 2013. 2
former with videos and images improves action recognition.
[73] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
In arXiv preprint arXiv:2112.07175, 2021. 8
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra-
mid vision transformer: A versatile backbone for dense pre- [82] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and
diction without convolutions. In ICCV, 2021. 1, 2 David Lopez-Paz. Mixup: Beyond empirical risk minimiza-
[74] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- tion. In ICLR, 2018. 5
ing He. Non-local neural networks. In CVPR, 2018. 2, 8 [83] Yanyi Zhang, Xinyu Li, Chunhui Liu, Bing Shuai, Yi Zhu,
[75] Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Pier- Biagio Brattoli, Hao Chen, Ivan Marsic, and Joseph Tighe.
giovanni, Michael S Ryoo, Anelia Angelova, Kris M Kitani, VidTr: Video transformer without convolutions. In ICCV,
and Wei Hua. AttentionNAS: Spatiotemporal attention cell 2021. 8
search for video classification. In ECCV, 2020. 2 [84] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
[76] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim-
CVPR, 2017. 1, 2
ing He, Philipp Krahenbuhl, and Ross Girshick. Long-term
feature banks for detailed video understanding. In CVPR, [85] Barret Zoph and Quoc V Le. Neural architecture search with
2019. 1 reinforcement learning. In ICLR, 2017. 2
3343