0% found this document useful (0 votes)
25 views

Lin TSM Temporal Shift Module For Efficient Video Understanding ICCV 2019 Paper

Pdf

Uploaded by

嵇暉恩
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Lin TSM Temporal Shift Module For Efficient Video Understanding ICCV 2019 Paper

Pdf

Uploaded by

嵇暉恩
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

TSM: Temporal Shift Module for Efficient Video Understanding

Ji Lin Chuang Gan Song Han


MIT MIT-IBM Watson AI Lab MIT
[email protected] [email protected] [email protected]

T H,W
Abstract Channel C C truncate Channel C
t=0

Temporal T

temporal shift
t=1
The explosive growth in video streaming gives rise to
t=2
challenges on performing video understanding at high accu- t=3
racy and low computation cost. Conventional 2D CNNs pad zero …

are computationally cheap but cannot capture temporal


(a) The original ten- (b) Offline temporal (c) Online temporal
relationships; 3D CNN based methods can achieve good
sor without shift. shift (bi-direction). shift (uni-direction).
performance but are computationally intensive, making it
expensive to deploy. In this paper, we propose a generic Figure 1. Temporal Shift Module (TSM) performs efficient tem-
poral modeling by moving the feature map along the temporal
and effective Temporal Shift Module (TSM) that enjoys both
dimension. It is computationally free on top of a 2D convolution,
high efficiency and high performance. Specifically, it can but achieves strong temporal modeling ability. TSM efficiently
achieve the performance of 3D CNN but maintain 2D CNN’s supports both offline and online video recognition. Bi-directional
complexity. TSM shifts part of the channels along the tempo- TSM mingles both past and future frames with the current frame,
ral dimension; thus facilitate information exchanged among which is suitable for high-throughput offline video recognition.
neighboring frames. It can be inserted into 2D CNNs to Uni-directional TSM mingles only the past frame with the current
achieve temporal modeling at zero computation and zero frame, which is suitable for low-latency online video recognition.
parameters. We also extended TSM to online setting, which
enables real-time low-latency online video recognition and Existing efficient video understanding approaches directly
video object detection. TSM is accurate and efficient: it use 2D CNN [24, 39, 48, 58]. However, 2D CNN on individ-
ranks the first place on the Something-Something leader- ual frames cannot well model the temporal information. 3D
board upon publication; on Jetson Nano and Galaxy Note8, CNNs [45, 4] can jointly learn spatial and temporal features
it achieves a low latency of 13ms and 35ms for online video but the computation cost is large, making the deployment on
recognition. The code is available at: https://github. edge devices difficult; it cannot be applied to real-time on-
com/mit-han-lab/temporal-shift-module. line video recognition. There are works to trade off between
temporal modeling and computation, such as post-hoc fu-
sion [13, 9, 58, 7] and mid-level temporal fusion [61, 53, 46].
Such methods sacrifice the low-level temporal modeling for
1. Introduction
efficiency, but much of the useful information is lost during
Hardware-efficient video understanding is an important the feature extraction before the temporal fusion happens.
step towards real-world deployment, both on the cloud and In this paper, we propose a new perspective for effi-
on the edge. For example, there are over 105 hours of videos cient temporal modeling in video understanding by propos-
uploaded to YouTube every day to be processed for recom- ing a novel Temporal Shift Module (TSM). Concretely, an
mendation and ads ranking; tera-bytes of sensitive videos activation in a video model can be represented as A ∈
in hospitals need to be processed locally on edge devices to RN ×C×T ×H×W , where N is the batch size, C is the num-
protect privacy. All these industry applications require both ber of channels, T is the temporal dimension, H and W are
accurate and efficient video understanding. the spatial resolutions. Traditional 2D CNNs operate inde-
Deep learning has become the standard for video under- pendently over the dimension T ; thus no temporal modeling
standing over the years [45, 48, 4, 49, 61, 53, 58]. One key takes effects (Figure 1a). In contrast, our Temporal Shift
difference between video recognition and image recognition Module (TSM) shifts the channels along the temporal dimen-
is the need for temporal modeling. For example, to distin- sion, both forward and backward. As shown in Figure 1b,
guish between opening and closing a box, reversing the order the information from neighboring frames is mingled with
will give opposite results, so temporal modeling is critical. the current frame after shifting. Our intuition is: the convo-

17083
lution operation consists of shift and multiply-accumulate. (temporal stream) respectively. Temporal Segment Networks
We shift in the time dimension by ±1 and fold the multiply- (TSN) [48] extracted averaged features from strided sampled
accumulate from time dimension to channel dimension. For frames. Such methods are more efficient compared to 3D
real-time online video understanding, future frames can’t counterparts but cannot infer the temporal order or more
get shifted to the present, so we use a uni-directional TSM complicated temporal relationships.
(Figure 1c) to perform online video understanding.
3D CNN. 3D convolutional neural networks can jointly
Despite the zero-computation nature of the shift opera- learn spatio-temporal features. Tran et al. [45] proposed
tion, we empirically find that simply adopting the spatial a 3D CNN based on VGG models, named C3D, to learn
shift strategy [51] used in image classifications introduces spatio-temporal features from a frame sequence. Carreira
two major issues for video understanding: (1) it is not effi- and Zisserman [4] proposed to inflate all the 2D convolution
cient: shift operation is conceptually zero FLOP but incurs filters in an Inception V1 model [43] into 3D convolutions.
data movement. The additional cost of data movement is However, 3D CNNs are computationally heavy, making the
non-negligible and will result in latency increase. This phe- deployment difficult. They also have more parameters than
nomenon has been exacerbated in the video networks since 2D counterparts, thus are more prone to over-fitting. On the
they usually have a large memory consumption (5D activa- other hand, our TSM has the same spatial-temporal modeling
tion). (2) It is not accurate: shifting too many channels in a ability as 3D CNN while enjoying the same computation and
network will significantly hurt the spatial modeling ability parameters as the 2D CNNs.
and result in performance degradation. To tackle the prob-
lems, we make two technical contributions. (1) We use a Trade-offs. There have been attempts to trade off expres-
temporal partial shift strategy: instead of shifting all the siveness and computation costs. Lee et al. [27] proposed
channels, we shift only a small portion of the channels for a motion filter to generate spatio-temporal features from
efficient temporal fusion. Such strategy significantly cuts 2D CNN. Tran et al. [46] and Xie et al. [53] proposed to
down the data movement cost (Figure 2a). (2) We insert study mixed 2D and 3D networks, either first using 3D and
TSM inside residual branch rather than outside so that the later 2D (bottom-heavy) or first 2D and later 3D (top-heavy)
activation of the current frame is preserved, which does not architecture. ECO [61] also uses a similar top-heavy archi-
harm the spatial feature learning capability of the 2D CNN tecture to achieve a very efficient framework. Another way
backbone. to save computation is to decompose the 3D convolution
into a 2D spatial convolution and a 1D temporal convolu-
The contributions of our paper are summarized as follows:
tion [46, 33, 42]. For mixed 2D-3D CNNs, they still need
• We provide a new perspective for efficient video model to remove low-level temporal modeling or high-level tem-
design by temporal shift, which is computationally free poral modeling. Compared to decomposed convolutions,
but has strong spatio-temporal modeling ability. our method completely removes the computation cost of
temporal modeling has enjoys better hardware efficiency.
• We observed that naive shift cannot achieve high ef-
ficiency or high performance. We then proposed two
technical modifications partial shift and residual shift 2.2. Temporal Modeling
to realize a high efficiency model design.
A direct way for temporal modeling is to use 3D CNN
• We propose bi-directional TSM for offline video under- based methods as discussed above. Wang et al. [49] pro-
standing that achieves state-of-the-art performance. It posed a spatial-temporal non-local module to capture long-
ranks the first on Something-Something leaderboard range dependencies. Wang et al. [50] proposed to represent
upon publication. videos as space-time region graphs. An alternative way to
model the temporal relationships is to use 2D CNN + post-
• We propose uni-directional TSM for online real-time hoc fusion [13, 9, 58, 7]. Some works use LSTM [19] to
video recognition with strong temporal modeling ca- aggregate the 2D CNN features [54, 7, 41, 10, 12]. Atten-
pacity at low latency on edge devices. tion mechanism also proves to be effective for temporal
modeling [37, 28, 32]. Zhou et al. [58] proposed Temporal
2. Related Work Relation Network to learn and reason about temporal depen-
dencies. The former category is computational heavy, while
2.1. Deep Video Recognition the latter cannot capture the useful low-level information
2D CNN. Using the 2D CNN is a straightforward way that is lost during feature extraction. Our method offers an
to conduct video recognition [24, 39, 48, 11, 8, 9, 2]. For efficient solution at the cost of 2D CNNs, while enabling
example, Simonyan et al. [39] designed a two-stream CNN both low-level and high-level temporal modeling, just like
for RGB input (spatial stream) and optical flow [55] input 3D-CNN based methods.

7084
15%
15% 75%
2.3. Efficient Neural Networks Naive shift: Our Choice
large overhead
12%
The efficiency of 2D CNN has been extensively studied. 73%

Latency Overhead
Some works focused on designing an efficient model [21, 20,

Accuracy
9%
36, 56]. Recently neural architecture search [62, 63, 31] P100 71%
6% TX2 2D baseline
has been introduced to find an efficient architecture au- CPU Naive shift:
69%
tomatically [44, 3]. Another way is to prune, quan- 3%
In-place TSM
low acc.
tize and compress an existing model for efficient deploy- 0%
Our Choice
67%
Residual TSM

ment [16, 15, 29, 59, 18, 47]. Address shift, which is a 00 1/8
1/8 1/4
1/4 1/2
1/2 11 0 1/8 1/4 1/2 1
Shift Proportion Shift Proportion
hardware-friendly primitive, has also been exploited for com-
pact 2D CNN design on image recognition tasks [51, 57]. (a) Overhead vs. proportion. (b) Residual vs. in-place.
Nevertheless, we observe that directly adopting the shift op- Figure 2. (a) Latency overhead of TSM due to data movement.
eration on video recognition task neither maintains efficiency (b) Residual TSM achieve better performance than in-place shift.
nor accuracy, due to the complexity of the video data. We choose 1/4 proportion residual shift as our default setting. It
achieves higher accuracy with a negligible overhead.
3. Temporal Shift Module (TSM)
future frames, therefore, we only shift from past frames to
We first explain the intuition behind TSM: data move- future frames in a uni-directional fashion.
ment and computation can be separated in a convolution.
However, we observe that such naive shift operation neither 3.2. Naive Shift Does Not Work
achieves high efficiency nor high performance. To tackle the Despite the simple philosophy behind the proposed mod-
problem, we propose two techniques minimizing the data ule, we find that directly applying the spatial shift strat-
movement and increasing the model capacity, which leads egy [51] to the temporal dimension cannot provide high
to the efficient TSM module. performance nor efficiency. To be specific, if we shift all or
most of the channels, it brings two disasters: (1) Worse ef-
3.1. Intuition
ficiency due to large data movement. The shift operation
Let us first consider a normal convolution operation. For enjoys no computation, but it involves data movement. Data
brevity, we used a 1-D convolution with the kernel size of movement increases the memory footprint and inference la-
3 as an example. Suppose the weight of the convolution is tency on hardware. Worse still, such effect is exacerbated
W = (w1 , w2 , w3 ), and the input X is a 1-D vector with in the video understanding networks due to large activation
infinite length. The convolution operator Y = Conv(W, X) size (5D tensor). When using the naive shift strategy shifting
can be written as: Yi = w1 Xi−1 + w2 Xi + w3 Xi+1 . We can every map, we observe a 13.7% increase in CPU latency and
decouple the operation of convolution into two steps: shift 12.4% increase in GPU latency, making the overall inference
and multiply-accumulate: we shift the input X by −1, 0, +1 slow. (2) Performance degradation due to worse spatial
and multiply by w1 , w2 , w3 respectively, which sum up to modeling ability. By shifting part of the channels to neigh-
be Y . Formally, the shift operation is: boring frames, the information contained in the channels is
no longer accessible for the current frame, which may harm
Xi−1 = Xi−1 , Xi0 = Xi , Xi+1 = xi+1 (1) the spatial modeling ability of the 2D CNN backbone. We
and the multiply-accumulate operation is: observe a 2.6% accuracy drop when using the naive shift
implementation compared to the 2D CNN baseline (TSN).
Y = w1 X −1 + w2 X 0 + w3 X +1 (2)
3.3. Module Design
The first step shift can be conducted without any multipli-
cation. While the second step is more computationally ex- To tackle the two problem from naive shift implementa-
pensive, our Temporal Shift module merges the multiply- tion, we discuss two technical contributions.
accumulate into the following 2D convolution, so it intro- Reducing Data Movement. To study the effect of data
duces no extra cost compared to 2D CNN based models. movement, we first measured the inference latency of TSM
The proposed Temporal Shift module is described in Fig- models and 2D baseline on different hardware devices. We
ure 1. In Figure 1a, we describe a tensor with C channels shifted different proportion of the channels and measured
and T frames. The features at different time stamps are de- the latency. We measured models with ResNet-50 backbone
noted as different colors in each row. Along the temporal and 8-frame input using no shift (2D baseline), partial shift
dimension, we shift part of the channels by −1, another part (1/8, 1/4, 1/2) and all shift (shift all the channels). The
by +1, leaving the rest un-shifted (Figure 1b). For online timing was measure on server GPU (NVIDIA Tesla P100),
video recognition setting, we also provide an online version mobile GPU (NVIDIA Jetson TX2) and CPU (Intel Xeon E5-
of TSM (Figure 1c). In the online setting, we cannot access 2690). We report the average latency from 1000 runs after

7085
yt yt+1 yN


Cached in
shift conv Feature Memory Feature Feature
X shift conv + Y X + Y Shift out Replace

Shift out Replace
Conv Conv Conv
(a) In-place TSM. (b) Residual TSM.
Feature Feature Feature
Figure 3. Residual shift is better than in-place shift. In-place shift

happens before a convolution layer (or a residual block). Residual
Conv Conv Conv
shift fuses temporal information inside a residual branch.


200 warm-up runs. We show the overhead of the shift opera-
tion as the percentage of the original 2D CNN inference time Ft Ft+1 FN
in 2a. We observe the same overhead trend for different de- Figure 4. Uni-directional TSM for online video recognition.
vices. If we shift all the channels, the latency overhead takes
up to 13.7% of the inference time on CPU, which is defi-
4. TSM Video Network
nitely non-negligible during inference. On the other hand, 4.1. Offline Models with Bi-directional TSM
if we only shift a small proportion of the channels, e.g., 1/8,
we can limit the latency overhead to only 3%. Therefore, We insert bi-directional TSM to build offline video recog-
we use partial shift strategy in our TSM implementation to nition models. Given a video V , we first sample T frames
significantly bring down the memory movement cost. Fi , F1 , ..., FT from the video. After frame sampling, 2D
CNN baselines process each of the frames individually, and
the output logits are averaged to give the final prediction.
Keeping Spatial Feature Learning Capacity. We need Our proposed TSM model has exactly the same parameters
to balance the model capacity for spatial feature learning and computation cost as 2D model. During the inference of
and temporal feature learning. A straight-forward way to convolution layers, the frames are still running independently
apply TSM is to insert it before each convolutional layer just like the 2D CNNs. The difference is that TSM is inserted
or residual block, as illustrated in Figure 3a. We call such for each residual block, which enables temporal information
implementation in-place shift. It harms the spatial feature fusion at no computation. For each inserted temporal shift
learning capability of the backbone model, especially when module, the temporal receptive field will be enlarged by 2, as
we shift a large amount of channels, since the information if running a convolution with the kernel size of 3 along the
stored in the shifted channels is lost for the current frame. temporal dimension. Therefore, our TSM model has a very
large temporal receptive field to conduct highly complicated
To address such issue, we propose a variant of the shift temporal modeling. In this paper, we used ResNet-50 [17]
module. Instead of inserting it in-place, we put the TSM as the backbone unless otherwise specified.
inside the residual branch in a residual block. We denote A unique advantage of TSM is that it can easily convert
such version of shift as residual shift as shown in 3b. Resid- any off-the-shelf 2D CNN model into a pseudo-3D model
ual shift can address the degraded spatial feature learning that can handle both spatial and temporal information, with-
problem, as all the information in the original activation is out adding additional computation. Thus the deployment
still accessible after temporal shift through identity mapping. of our framework is hardware friendly: we only need to
support the operations in 2D CNNs, which are already well-
To verify our assumption, we compared the performance
optimized at both framework level (CuDNN [6], MKL-DNN,
of in-place shift and residual shift on Kinetics [25] dataset.
TVM [5]) and hardware level (CPU/GPU/TPU/FPGA).
We studied the experiments under different shift proportion
setting. The results are shown in 2b. We can see that residual 4.2. Online Models with Uni-directional TSM
shift achieves better performance than in-place shift for all
shift proportion. Even we shift all the channels to neighbor- Video understanding from online video streams is im-
ing frames, due to the shortcut connection, residual shift still portant in real-life scenarios. Many real-time applications
achieves better performance than the 2D baseline. Another requires online video recognition with low latency, such as
finding is that the performance is related to the proportion AR/VR and self-driving. In this section, we show that we
of shifted channels: if the proportion is too small, the ability can adapt TSM to achieve online video recognition while
of temporal reasoning may not be enough to handle compli- with multi-level temporal fusion.
cated temporal relationships; if too large, the spatial feature As shown in Figure 1, offline TSM shifts part of the
learning ability may be hurt. For residual shift, we found channels bi-directionally, which requires features from future
that the performance reaches the peak when 1/4 (1/8 for frames to replace the features in the current frame. If we
each direction) of the channels are shifted. Therefore, we only shift the feature from previous frames to current frames,
use this setting for the rest of the paper. we can achieve online recognition with uni-directional TSM.

7086
The inference graph of uni-directional TSM for online Table 1. Our method consistently outperforms 2D counterparts on
multiple datasets at zero extra computation (protocol: ResNet-50
video recognition is shown in Figure 4. During inference,
8f input, 10 clips for Kinetics, 2 for others, full-resolution).
for each frame, we save the first 1/8 feature maps of each
residual block and cache it in the memory. For the next Dataset Model Acc1 Acc5 ∆ Acc1
frame, we replace the first 1/8 of the current feature maps TSN 70.6 89.2
Kinetics +3.5

Less Temporal
with the cached feature maps. We use the combination of 7/8 Ours 74.1 91.2
current feature maps and 1/8 old feature maps to generate TSN 91.7 99.2
UCF101 +4.2
the next layer, and repeat. Using the uni-directional TSM for Ours 95.9 99.7
online video recognition shares several unique advantages: TSN 64.7 89.9
HMDB51 +8.8
1. Low latency inference. For each frame, we only need Ours 73.5 94.3
to replace and cache 1/8 of the features, without incurring Something TSN 20.5 47.5
any extra computations. Therefore, the latency of giving per- +28.0

More Temporal
V1 Ours 47.3 76.2
frame prediction is almost the same as the 2D CNN baseline.
Something TSN 30.4 61.0
Existing methods like [61] use multiple frames to give one V2 Ours 61.7 87.4
+31.3
prediction, which may leads to large latency.
TSN 83.9 99.6
2. Low memory consumption. Since we only cache a Jester
Ours 97.0 99.9
+11.7
small portion of the features in the memory, the memory
consumption is low. For ResNet-50, we only need 0.9MB
memory cache to store the intermediate feature. with shorter side 256 for evaluation, so that we can give a
3. Multi-level temporal fusion. Most of the online direct comparison; when we consider the efficiency (e.g.,
method only enables late temporal fusion after feature ex- as in Table 2), we used just 1 clip per video and the center
traction like [58] or mid level temporal fusion [61], while 224×224 crop for evaluation. We keep the same protocol
our TSM enables all levels of temporal fusion. Through ex- for the methods compared in the same table.
periments (Table 2) we find that multi-level temporal fusion Model. To have an apple-to-apple comparison with the
is very important for complex temporal modeling. state-of-the-art method [50], we used the same back-
bone (ResNet-50) on the dataset ( Something-Something-
5. Experiments V1 [14]).This dataset focuses on temporal modeling. The
difference is that [50] used 3D ResNet-50, while we used
We first show that TSM can significantly improve the per- 2D ResNet-50 as the backbone to demonstrate efficiency.
formance of 2D CNN on video recognition while being com-
Datasets. Kinetics dataset [25] is a large-scale action
putationally free and hardware efficient. It further demon-
recognition dataset with 400 classes. As pointed in [58, 53],
strated state-of-the-art performance on temporal-related
datasets like Something-Something (V1&V2) [14], Cha-
datasets, arriving at a much better accuracy-computation
rades [38], and Jester [1] are more focused on modeling the
pareto curve. TSM models achieve an order of magnitude
temporal relationships , while UCF101 [40], HMDB51 [26],
speed up in measured GPU throughput compared to con-
and Kinetics [25] are less sensitive to temporal relationships.
ventional I3D model from [50]. Finally, we leverage uni-
Since TSM focuses on temporal modeling, we mainly fo-
directional TSM to conduct low-latency and real-time online
cus on datasets with stronger temporal relationships like
prediction on both video recognition and object detection.
Something-Something. Nevertheless, we also observed
5.1. Setups strong results on the other datasets and reported it.
Training & Testing. We conducted experiments on video 5.2. Improving 2D CNN Baselines
action recognition tasks. The training parameters for the
Kinetics dataset are: 100 training epochs, initial learning We can seamlessly inject TSM into a normal 2D CNN
rate 0.01 (decays by 0.1 at epoch 40&80), weight decay and improve its performance on video recognition. In this
1e-4, batch size 64, and dropout 0.5. For other datasets, we section, we demonstrate a 2D CNN baseline can significantly
scale the training epochs by half. For most of the datasets, benefit from TSM with double-digits accuracy improvement.
the model is fine-tuned from ImageNet pre-trained weights; We chose TSN [48] as the 2D CNN baseline. We used the
while HMDB-51 [26] and UCF-101 [40] are too small and same training and testing protocol for TSN and our TSM.
prone to over-fitting [48], we followed the common prac- The only difference is with or without TSM.
tice [48, 49] to fine-tune from Kinetics [25] pre-trained Comparing Different Datasets. We compare the results
weights and freeze the Batch Normalization [22] layers. For on several action recognition datasets in Table 1. The chart
testing, when pursue high accuracy, we followed the com- is split into two parts. The upper part contains datasets
mon setting in [49, 50] to sample multiple clips per video (10 Kinetics [25], UCF101 [40], HMDB51 [26], where tem-
for Kinetics, 2 for others) and use the full resolution image poral relationships are less important, while our TSM still

7087
Table 2. Comparing TSM against other methods on Something-Something dataset (center crop, 1 clip/video unless otherwise specified).

Model Backbone #Frame FLOPs/Video #Param. Val Top-1 Val Top-5 Test Top-1
TSN [58] BNInception 8 16G 10.7M 19.5 - -
TSN (our impl.) ResNet-50 8 33G 24.3M 19.7 46.6 -
TRN-Multiscale [58] BNInception 8 16G 18.3M 34.4 - 33.6
TRN-Multiscale (our impl.) ResNet-50 8 33G 31.8M 38.9 68.1 -
Two-stream TRNRGB+Flow [58] BNInception 8+8 - 36.6M 42.0 - 40.7
ECO [61] BNIncep+3D Res18 8 32G 47.5M 39.6 - -
ECO [61] BNIncep+3D Res18 16 64G 47.5M 41.4 - -
ECOEn Lite [61] BNIncep+3D Res18 92 267G 150M 46.4 - 42.3
ECOEn LiteRGB+Flow [61] BNIncep+3D Res18 92+92 - 300M 49.5 - 43.9
I3D from [50] 3D ResNet-50 32×2clip 153G1 ×2 28.0M 41.6 72.2 -
Non-local I3D from [50] 3D ResNet-50 32×2clip 168G1 ×2 35.3M 44.4 76.0 -
Non-local I3D + GCN [50] 3D ResNet-50+GCN 32×2clip 303G2 ×2 62.2M2 46.1 76.8 45.0

TSM ResNet-50 8 33G 24.3M 45.6 74.2 -


TSM ResNet-50 16 65G 24.3M 47.2 77.1 46.0
TSMEn ResNet-50 24 98G 48.6M 49.7 78.5 -
TSMRGB+Flow ResNet-50 16+16 - 48.6M 52.6 81.9 50.7

Table 3. TSM can consistently improve the performance over dif- Something-Something-V1. Something-Something-V1 is
ferent backbones on Kinetics dataset.
a challenging dataset, as activity cannot be inferred merely
Mb-V2 R-50 RX-101 NL R-50 from individual frames (e.g., pushing something from right
TSN 66.5 70.7 72.4 74.6 to left). We compared TSM with current state-of-the-art
TSM 69.5 74.1 76.3 75.7 methods in Table 2. We only applied center crop during test-
∆Acc. +3.0 +3.4 +3.9 +1.1 ing to ensure the efficiency unless otherwise specified. TSM
achieves the first place on the leaderboard upon publication.
consistently outperforms the 2D TSN baseline at no extra We first show the results of the 2D based methods
computation. For the lower part, we present the results on TSN [48] and TRN [58]. TSN with different backbones fails
Something-Something V1 and V2 [14] and Jester [1], which to achieve decent performance (<20% Top-1) due to the lack
depend heavily on temporal relationships. 2D CNN baseline of temporal modeling. For TRN, although late temporal fu-
cannot achieve a good accuracy, but once equipped with sion is added after feature extraction, the performance is still
TSM, the performance improved by double digits. significantly lower than state-of-the-art methods, showing
the importance of temporal fusion across all levels.
Scaling over Backbones. TSM scales well to backbones The second section shows the state-of-the-art efficient
of different sizes. We show the Kinetics top-1 accuracy video understanding framework ECO [61]. ECO uses an
with MobileNet-V2 [36], ResNet-50 [17], ResNext-101 [52] early 2D + late 3D architecture which enables medium-level
and ResNet-50 + Non-local module [49] backbones in Ta- temporal fusion. Compared to ECO, our method achieves
ble 3. TSM consistently improves the accuracy over different better performance at a smaller FLOPs. For example, when
backbones, even for NL R-50, which already has temporal using 8 frames as input, our TSM achieves 45.6% top-1
modeling ability. accuracy with 33G FLOPs, which is 4.2% higher accuracy
than ECO with 1.9× less computation. The ensemble ver-
5.3. Comparison with State-of-the-Arts sions of ECO (ECOEn Lite and ECOEn LiteRGB+Flow , using an
TSM not only significantly improves the 2D baseline but ensemble of {16, 20, 24, 32} frames as input) did achieve
also outperforms state-of-the-art methods, which heavily competitive results, but the computation and parameters are
rely on 3D convolutions. We compared the performance too large for deployment. While our model is much more
of our TSM model with state-of-the-art methods on both efficient: we only used {8, 16} frames model for ensemble
Something-Something V1&V2 because these two datasets (TSMEn ), and the model achieves better performance using
focus on temporal modeling. 2.7× less computation and 3.1× fewer parameters.
1 We reported the performance of NL I3D described in [50], which is
The third section contains previous state-of-the-art meth-
a variant of the original NL I3D [49]. It uses fewer temporal dimension
ods: Non-local I3D + GCN [50], that enables all-level
pooling to achieve good performance, but also incur larger computation. temporal fusion. The GCN needs a Region Proposal Net-
2 Includes parameters and FLOPs of the Region Proposal Network. work [34] trained on MSCOCO object detection dataset [30]

7088
Table 4. Results on Something-Something-V2. Our TSM achieves Table 5. TSM enjoys low GPU inference latency and high through-
state-of-the-art performance. put. V/s means videos per second, higher the better (Measured on
Val Test NVIDIA Tesla P100 GPU).
Method
Top-1 Top-5 Top-1 Top-5 Efficiency Statistics Accuracy
Model
TSN (our impl.) 30.0 60.5 - - FLOPs Param. Latency Thrput. Sth. Kinetics
MultiScale TRN [58] 48.8 77.6 50.9 79.3 I3D from [50] 306G 35.3M 165.3ms 6.1V/s 41.6% -
2-Stream TRN [58] 55.5 83.1 56.2 83.2 ECO16F [61] 64G 47.5M 30.6ms 45.6V/s 41.4% -
TSM8F 59.1 85.6 - - I3D from [49] 33G 29.3M 25.8ms 42.4V/s - 73.3%
TSM16F 63.4 88.5 64.3 89.6 I3Dreplace 48G 33.0M 28.0ms 37.9V/s 44.9% -
TSMRGB+Flow 66.0 90.5 66.6 91.3
TSM8F 33G 24.3M 17.4ms 77.4V/s 45.6% 74.1%
51
Ours ECO [61] I3D from [50] TSM16F 65G 24.3M 29.0ms 39.5V/s 47.2% 74.7%
TSMEn

48 ECOEnLite
work [34] to extract bounding boxes, whose cost is also
TSM16F NL I3D+GCN
considered in the chart. Note that the computation cost of
Accuracy (%)

46 TSM8F optical flow extraction is usually larger than the video recog-
NL I3D
nition model itself. Therefore, we do not report the FLOPs
43 of two-stream based methods.
I3D
ECO16F # Parameters We show the accuracy, FLOPs, and number of parameters
41
ECO8F
30M 100M 150M trade-off in Figure 5. The accuracy is tested on the validation
38
set of Something-Something-V1 dataset, and the number of
0 100 200 300 400 500 600 700 parameters is indicated by the area of the circles. We can
FLOPs/Video (G) see that our TSM based methods have a better Pareto curve
Figure 5. TSM enjoys better accuracy-cost trade-off than I3D family than both previous state-of-the-art efficient models (ECO
and ECO family on Something-Something-V1 [14] dataset. (GCN based models) and high-performance models (non-local I3D
includes the cost of ResNet-50 RPN to generate region proposals.) based models). TSM models are both efficient and accurate.
It can achieve state-of-the-art accuracy at high efficiency:
to generate the bounding boxes, which is unfair to compare it achieves better performance while consuming 3× less
since external data (MSCOCO) and extra training cost is computation than the ECO family . Considering that ECO
introduced. Thus we compared TSM to its CNN part: Non- is already an efficiency-oriented design, our method enjoys
local I3D. Our TSM (8f) achieves 1.2% better accuracy with highly competitive hardware efficiency.
10× fewer FLOPs on the validation set compared to the
Non-local I3D network. Note that techniques like Non-local 5.4. Latency and Throughput Speedup
module [49] are orthogonal to our work, which could also
The measured inference latency and throughput are impor-
be added to our framework to boost the performance further.
tant for the large-scale video understanding. TSM has low
Generalize to Other Modalities. We also show that our latency and high throughput. We performed measurement
proposed method can generalize to other modalities like on a single NVIDIA Tesla P100 GPU. We used batch size of
optical flow. To extract the optical flow information be- 1 for latency measurement; batch size of16 for throughput
tween frames, we followed [48] to use the TVL1 optical measurement. We made two comparisons:
flow algorithm [55] implemented in OpenCV with CUDA. (1) Compared with the I3D model from [50], our method
We conducted two-stream experiments on both Something- is faster by an order of magnitude at 1.8% higher accuracy
Something V1 and V2 datasets, and it consistently improves (Table 5). We also compared our method to the state-of-the-
over the RGB performance: introducing optical flow branch art efficient model ECO [61]: Our TSM model has 1.75×
brings 5.4% and 2.6% top-1 improvement on V1 and V2. lower latency (17.4ms vs. 30.6ms), 1.7× higher through-
Something-Something-V2. We also show the result on put, and achieves 2% better accuracy. ECO has a two-
Something-Something-V2 dataset, which is a newer release branch (2D+3D) architecture, while TSM only needs the
to its previous version. The results compared to other state- in-expensive 2D backbone.
of-the-art methods are shown in Table 4. On Something- (2) We then compared TSM to efficient 3D model designs.
Something-V2 dataset, we achieved state-of-the-art perfor- One way is to only inflate the first 1 × 1 convolution in each
mance while only using RGB input. of the block as in [49], denoted as ”I3D from [49]” in the
Cost vs. Accuracy. Our TSM model achieves very com- table. Although the FLOPs are similiar due to pooling, it suf-
petitive performance while enjoying high efficiency and low fers from 1.5× higher latency and only 55% the throughput
computation cost for fast inference. We show the FLOPs compared with TSM, with worse accuracy. We speculate the
for each model in Table 2. Although GCN itself is light, reason is that TSM model only uses 2D convolution which
the method used a ResNet-50 based Region Proposal Net- is highly optimized for hardware. To excliude the factors

7089
Table 6. Comparing the accuracy of offline TSM and online TSM on Table 7. Video detection results on ImageNet-VID.
different datasets. Online TSM brings negligible latency overhead. Need mAP
Model Online Flow Latency
Model Latency Kinetics UCF101 HMDB51 Something Overall Slow Medium Fast
R-FCN [23] X 1× 74.7 83.6 72.5 51.4
TSN 4.7ms 70.6% 91.7% 64.7% 20.5% FGFA [60] X 2.5× 75.9 84.0 74.4 55.6
+Offline - 74.1% 95.9% 73.5% 47.3% Online TSM X 1× 76.3 83.4 74.8 56.0
+Online 4.8ms 74.3% 95.5% 73.6% 46.3%
Table 8. TSM efficiently runs on edge devices with low latency.
96
Jetson Nano Jetson TX2
Devices Rasp. Note8 Pixel1
92 CPU GPU CPU GPU
Accuracy %

Latency (ms) 47.8 13.4 36.4 8.5 69.6 34.5 47.4


88 Power (watt) 4.8 4.5 5.6 5.8 3.8 - -
ECO (s=8)
84 ECO (s=12)
ECO (s=20) Online Object Detection Real-time online video object
TSM detection is an important application in self-driving vehicles,
80
10 20 40 60 80 100 robotics, etc. By injecting our online TSM into the backbone,
Video Observation % we can easily take the temporal cues into consideration at
Figure 6. Early recognition on UCF101. TSM gives high prediction negligible overhead, so that the model can handle poor object
accuracy after only observing a small portion of the video. appearance like motion blur, occlusion, defocus, etc. We
of backbone design, we replace every TSM primitive with conducted experiments on R-FCN [23] detector with ResNet-
3 × 1 × 1 convolution and denote this model as I3Dreplace . It 101 backbone on ImageNet-VID [35] dataset. We inserted
is still much slower than TSM and performs worse. the uni-directional TSM to the backbone, while keeping
other settings the same. The results are shown in Table 7.
5.5. Online Recognition with TSM Compared to 2D baseline R-FCN [23], our online TSM
Online vs. Offline Online TSM models shift the feature model significantly improves the performance, especially on
maps uni-directionally so that it can give predictions in real the fast moving objects, where TSM increases mAP by 4.6%.
time. We compare the performance of offline and online We also compare to a strong baseline FGFA [60] that uses
TSM models to show that online TSM can still achieve com- optical flow to aggregate the temporal information from 21
parable performance. Follow [61], we use the prediction frames (past 10 frames and future 10 frames) for offline video
averaged from all the frames to compare with offline mod- detection. Compared to FGFA, TSM can achieve similar
els, i.e., we compare the performance after observing the or higher performance while enabling online recognition
whole videos. The performance is provided in Table 6. We at much smaller latency. We visualize some video clips in
can see that for less temporal related datasets like Kinetics, the supplementary material to show that online TSM can
UCF101 and HMDB51, the online models achieve compa- leverage the temporal consistency to correct mis-predictions.
rable and sometimes even better performance compared to Edge Deployment TSM is mobile device friendly. We
the offline models. While for more temporal related datasets build an online TSM model with MobileNet-V2 backbone,
Something-Something, online model performs worse than which achieves 69.5% accuracy on Kinetics. The latency
offline model by 1.0%. Nevertheless, the performance of and energy on NVIDIA Jetson Nano & TX2, Raspberry Pi
online model is still significantly better than the 2D baseline. 4B, Samsung Galaxy Note8, Google Pixel-1 is shown in
We also compare the per-frame prediction latency of pure Table 8. The models are compiled using TVM [5]. Power is
2D backbone (TSN) and our online TSM model. We compile measured with a power meter, subtracting the static power.
both models with TVM [5] on GPU. Our online TSM model TSM achieves low latency and low power on edge devices.
only adds to less than 0.1ms latency overhead per frame
while bringing up to 25% accuracy improvement. It demon- 6. Conclusion
strates online TSM is hardware-efficient for latency-critical We propose Temporal Shift Module for hardware-efficient
real-time applications. video recognition. It can be inserted into 2D CNN backbone
to enable joint spatial-temporal modeling at no additional
Early Recognition Early recognition aims to classify the
cost. The module shifts part of the channels along temporal
video while only observing a small portion of the frames. It
dimension to exchange information with neighboring frames.
gives fast response to the input video stream. Here we com-
Our framework is both efficient and accurate, enabling low-
pare the early video recognition performance on UCF101
latency video recognition on edge devices.
dataset (Figure 6). Compared to ECO, TSM gives much
higher accuracy, especially when only observing a small Acknowledgments We thank MIT Quest for Intelligence,
portion of the frames. For example, when only observing MIT-IBM Watson AI Lab, MIT-SenseTime Alliance, Sam-
the first 10% of video frames, TSM model can achieve 90% sung, SONY, AWS, Google for supporting this research. We
accuracy, which is 6.6% higher than the best ECO model. thank Oak Ridge National Lab for Summit supercomputer.

7090
References [13] Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic,
and Bryan Russell. Actionvlad: Learning spatio-temporal
[1] The 20bn-jester dataset v1. https://20bn.com/ aggregation for action classification. In CVPR, volume 2,
datasets/jester. 5, 6 page 3, 2017. 1, 2
[2] Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea [14] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal-
Vedaldi, and Stephen Gould. Dynamic image networks for ski, Joanna Materzynska, Susanne Westphal, Heuna Kim,
action recognition. In Proceedings of the IEEE Conference on Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-
Computer Vision and Pattern Recognition, pages 3034–3042, Freitag, et al. The something something video database for
2016. 2 learning and evaluating visual common sense. In The IEEE
[3] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct International Conference on Computer Vision (ICCV), vol-
neural architecture search on target task and hardware. In ume 1, page 3, 2017. 5, 6, 7
International Conference on Learning Representations, 2019. [15] Song Han, Huizi Mao, and William J Dally. Deep com-
3 pression: Compressing deep neural networks with pruning,
[4] Joao Carreira and Andrew Zisserman. Quo vadis, action trained quantization and huffman coding. International Con-
recognition? a new model and the kinetics dataset. In Com- ference on Learning Representations, 2016. 3
puter Vision and Pattern Recognition (CVPR), 2017 IEEE [16] Song Han, Jeff Pool, John Tran, and William Dally. Learning
Conference on, pages 4724–4733. IEEE, 2017. 1, 2 both weights and connections for efficient neural network. In
[5] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Advances in neural information processing systems, pages
Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, 1135–1143, 2015. 3
Yuwei Hu, Luis Ceze, et al. {TVM}: An automated end-to- [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
end optimizing compiler for deep learning. In 13th {USENIX} Deep residual learning for image recognition. In Proceed-
Symposium on Operating Systems Design and Implementation ings of the IEEE conference on computer vision and pattern
({OSDI} 18), pages 578–594, 2018. 4, 8 recognition, pages 770–778, 2016. 4, 6
[6] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, [18] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and
Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shel- Song Han. Amc: Automl for model compression and accel-
hamer. cudnn: Efficient primitives for deep learning. arXiv eration on mobile devices. In Proceedings of the European
preprint arXiv:1410.0759, 2014. 4 Conference on Computer Vision (ECCV), pages 784–800,
[7] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, 2018. 3
Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, [19] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term
and Trevor Darrell. Long-term recurrent convolutional net- memory. Neural computation, 9(8):1735–1780, 1997. 2
works for visual recognition and description. In Proceedings [20] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
of the IEEE conference on computer vision and pattern recog- Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
nition, pages 2625–2634, 2015. 1, 2 dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
[8] Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. Spa- tional neural networks for mobile vision applications. arXiv
tiotemporal residual networks for video action recognition. preprint arXiv:1704.04861, 2017. 3
In Advances in neural information processing systems, pages [21] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid
3468–3476, 2016. 2 Ashraf, William J Dally, and Kurt Keutzer. Squeezenet:
[9] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Alexnet-level accuracy with 50x fewer parameters and¡ 0.5
Convolutional two-stream network fusion for video action mb model size. arXiv preprint arXiv:1602.07360, 2016. 3
recognition. In Proceedings of the IEEE Conference on Com- [22] Sergey Ioffe and Christian Szegedy. Batch normalization:
puter Vision and Pattern Recognition, pages 1933–1941, 2016. Accelerating deep network training by reducing internal co-
1, 2 variate shift. arXiv preprint arXiv:1502.03167, 2015. 5
[10] Chuang Gan, Chen Sun, Lixin Duan, and Boqing Gong. [23] Kaiming He Jian Sun Jifeng Dai, Yi Li. R-FCN: Object
Webly-supervised video recognition by mutually voting for detection via region-based fully convolutional networks. 2016.
relevant web images and web video frames. In European 8
Conference on Computer Vision, pages 849–866. Springer, [24] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas
2016. 2 Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video
[11] Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and classification with convolutional neural networks. In Proceed-
Alex G Hauptmann. Devnet: A deep event network for multi- ings of the IEEE conference on Computer Vision and Pattern
media event detection and evidence recounting. In Proceed- Recognition, pages 1725–1732, 2014. 1, 2
ings of the IEEE Conference on Computer Vision and Pattern [25] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
Recognition, pages 2568–2577, 2015. 2 Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
[12] Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, and Tao Mei. Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu-
You lead, we exceed: Labor-free video concept learning by man action video dataset. arXiv preprint arXiv:1705.06950,
jointly exploiting web videos and images. In Proceedings 2017. 4, 5
of the IEEE Conference on Computer Vision and Pattern [26] Hildegard Kuehne, Hueihan Jhuang, Estı́baliz Garrote,
Recognition, pages 923–932, 2016. 2 Tomaso Poggio, and Thomas Serre. Hmdb: a large video

7091
database for human motion recognition. In Computer Vi- vances in neural information processing systems, pages 568–
sion (ICCV), 2011 IEEE International Conference on, pages 576, 2014. 1, 2
2556–2563. IEEE, 2011. 5 [40] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.
[27] Myunggi Lee, Seungeui Lee, Sungjoon Son, Gyutae Park, Ucf101: A dataset of 101 human actions classes from videos
and Nojun Kwak. Motion feature network: Fixed motion in the wild. arXiv preprint arXiv:1212.0402, 2012. 5
filter for action recognition. In Proceedings of the European [41] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudi-
Conference on Computer Vision (ECCV), pages 387–403, nov. Unsupervised learning of video representations using
2018. 2 lstms. In International conference on machine learning, pages
[28] Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, 843–852, 2015. 2
and Cees GM Snoek. Videolstm convolves, attends and flows [42] Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. Human
for action recognition. Computer Vision and Image Under- action recognition using factorized spatio-temporal convolu-
standing, 166:41–50, 2018. 2 tional networks. In Proceedings of the IEEE International
[29] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neu- Conference on Computer Vision, pages 4597–4605, 2015. 2
ral pruning. In Advances in Neural Information Processing [43] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Systems, pages 2181–2191, 2017. 3 Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
[30] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Vanhoucke, and Andrew Rabinovich. Going deeper with
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence convolutions. In Proceedings of the IEEE conference on
Zitnick. Microsoft coco: Common objects in context. In computer vision and pattern recognition, pages 1–9, 2015. 2
European conference on computer vision, pages 740–755. [44] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan,
Springer, 2014. 6 and Quoc V Le. Mnasnet: Platform-aware neural architecture
[31] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia search for mobile. arXiv preprint arXiv:1807.11626, 2018. 3
Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Mur- [45] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
phy. Progressive neural architecture search. arXiv preprint and Manohar Paluri. Learning spatiotemporal features with
arXiv:1712.00559, 2017. 3 3d convolutional networks. In Proceedings of the IEEE inter-
[32] Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao national conference on computer vision, pages 4489–4497,
Liu, and Shilei Wen. Attention clusters: Purely attention 2015. 1, 2
based local feature integration for video classification. In [46] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann
Proceedings of the IEEE Conference on Computer Vision and LeCun, and Manohar Paluri. A closer look at spatiotemporal
Pattern Recognition, pages 7834–7843, 2018. 2 convolutions for action recognition. In Proceedings of the
[33] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio- IEEE Conference on Computer Vision and Pattern Recogni-
temporal representation with pseudo-3d residual networks. tion, pages 6450–6459, 2018. 1, 2
In 2017 IEEE International Conference on Computer Vision [47] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han.
(ICCV), pages 5534–5542. IEEE, 2017. 2 Haq: Hardware-aware automated quantization. arXiv preprint
[34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. arXiv:1811.08886, 2018. 3
Faster r-cnn: Towards real-time object detection with region [48] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua
proposal networks. In Advances in neural information pro- Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment
cessing systems, pages 91–99, 2015. 6, 7 networks: Towards good practices for deep action recognition.
[35] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- In European Conference on Computer Vision, pages 20–36.
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Springer, 2016. 1, 2, 5, 6, 7
Aditya Khosla, Michael Bernstein, et al. Imagenet large [49] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
scale visual recognition challenge. International Journal of ing He. Non-local neural networks. arXiv preprint
Computer Vision, 115(3):211–252, 2015. 8 arXiv:1711.07971, 10, 2017. 1, 2, 5, 6, 7
[36] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- [50] Xiaolong Wang and Abhinav Gupta. Videos as space-time
moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted region graphs. arXiv preprint arXiv:1806.01810, 2018. 2, 5,
residuals and linear bottlenecks. In Proceedings of the IEEE 6, 7
Conference on Computer Vision and Pattern Recognition, [51] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng
pages 4510–4520, 2018. 3, 6 Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonzalez,
[37] Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. and Kurt Keutzer. Shift: A zero flop, zero parameter alterna-
Action recognition using visual attention. arXiv preprint tive to spatial convolutions. arXiv preprint arXiv:1711.08141,
arXiv:1511.04119, 2015. 2 2017. 2, 3
[38] Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali [52] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in Kaiming He. Aggregated residual transformations for deep
homes: Crowdsourcing data collection for activity under- neural networks. In Proceedings of the IEEE conference on
standing. In European Conference on Computer Vision, pages computer vision and pattern recognition, pages 1492–1500,
510–526. Springer, 2016. 5 2017. 6
[39] Karen Simonyan and Andrew Zisserman. Two-stream con- [53] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and
volutional networks for action recognition in videos. In Ad- Kevin Murphy. Rethinking spatiotemporal feature learning:

7092
Speed-accuracy trade-offs in video classification. In Proceed-
ings of the European Conference on Computer Vision (ECCV),
pages 305–321, 2018. 1, 2, 5
[54] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijaya-
narasimhan, Oriol Vinyals, Rajat Monga, and George Toderici.
Beyond short snippets: Deep networks for video classifica-
tion. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 4694–4702, 2015. 2
[55] Christopher Zach, Thomas Pock, and Horst Bischof. A duality
based approach for realtime tv-l 1 optical flow. In Joint
Pattern Recognition Symposium, pages 214–223. Springer,
2007. 2, 7
[56] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.
Shufflenet: An extremely efficient convolutional neural net-
work for mobile devices. CoRR, abs/1707.01083, 2017. 3
[57] Huasong Zhong, Xianggen Liu, Yihui He, Yuchun Ma, and
Kris Kitani. Shift-based primitives for efficient convolutional
neural networks. arXiv preprint arXiv:1809.08458, 2018. 3
[58] Bolei Zhou, Alex Andonian, and Antonio Torralba. Tem-
poral relational reasoning in videos. arXiv preprint
arXiv:1711.08496, 2017. 1, 2, 5, 6, 7
[59] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally.
Trained ternary quantization. International Conference on
Learning Representations, 2016. 3
[60] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen
Wei. Flow-guided feature aggregation for video object detec-
tion. In Proceedings of the IEEE International Conference
on Computer Vision, pages 408–417, 2017. 8
[61] Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas
Brox. Eco: Efficient convolutional network for online video
understanding. arXiv preprint arXiv:1804.09066, 2018. 1, 2,
5, 6, 7, 8
[62] Barret Zoph and Quoc V Le. Neural architecture search with
reinforcement learning. arXiv preprint arXiv:1611.01578,
2016. 3
[63] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V
Le. Learning transferable architectures for scalable image
recognition. arXiv preprint arXiv:1707.07012, 2(6), 2017. 3

7093

You might also like