Bytedance Ai Lab Ava Challenge 2019 Technical Report
Bytedance Ai Lab Ava Challenge 2019 Technical Report
Wei Li1 , Zehuan Yuan2 , An Zhao2 , Jie Shao2 , and Changhu Wang2
1
Shanghai Jiao Tong University
2
ByteDance AI Lab
{liweihfyz}@sjtu.edu.cn
{yuanzehuan,zhaoan,shaojie.mail,wangchanghu}@bytedance.com
Abstract Frame-mAP
baseline 24.0
In this technical report, we will introduce our solution on multi-sec (T=5) 25.45
Spatial-temporal localization (AVA datasets) of ActivityNet multi-sec (T=9) 25.83
Challenge 2019. To this end, we propose an novel two-stage Table 1. Comparison results of baseline model and corresponding
training algorithm to integrate temporal context spanning multi-sec model.
multiple seconds. Firstly, we utilize 3D ConvNet pretrained
on kinetics dataset to exploit short-term visual contents of lowing [2], we replicate region proposals along the temporal
video clips centered at key frames. Secondly, we propose axis to generate feature volume V ∈ R4×256×7×7 with 2D
to link region proposals of several key frames with dynamic ROI-pooling. We utilize additionnal point-wise 2d Convo-
programming to form deformable action tubes spaning mul- lution layer to reduce the channel dimension into 256. The
tiple seconds. In both stages, additional 3D ConvNets are output feature volume V is further forwarded into an addi-
introduced after ROI-pooling to integrate contents of a sin- tional 2-layer 3D ConvNet to generate a compact represen-
gle clip or multiple clips into a compact representation for tation. The region proposal is considered as positive if its
further classification and regression. Iou with any ground-truth box of the key frame is higher
than 0.5. We choose 40 proposals with a ratio of 1:3 for
position and negative proposals, respectively.
1. Our method
1.2. Multi-sec training
In our method, we split the whole training process into
two-stages: baseline training and multi-second training. After training our baseline model, we link region propos-
The baseline stage exploits short-term visual contents while als of key frames of multiple seconds into an action tubes
multi-second training integrates contents of multiple sec- with dynamic programming algorithm. Due to memory lim-
onds with linked action tubes to exploit long-term informa- its, we precompute res4 feature volumes off-the-shelf and
tion. finetune res5 stage parameters loaded from baseline model.
We uniformly sample 5 clips centered at key frames from
1.1. Baseline T seconds and use average pooling or max pooling to re-
We follow the training strategy of Faster-RCNN [3] for duce the temporal dimension of feature volumes of each
end-to-end localization. We train out RPN Network with clip into 1 and stack them together into a feature volume
2D resnet50 backbone on key frames to generate off-the- Vm ∈ R5×256×7×7 . Similarly, an additional 2-layer Con-
shelf region proposals. We utilize SlowFast-50[1] pre- vNet is used to aggregate stacked feature volume into a
trained on kinetics dataset as our backbone to exploit visual compact representation. The comparison results of baseline
contents of each clips centered at key frames. Following model and multi-sec models are summarized in Table1.
[1], We input 32 frames with a temporal stride of 2 into our
1.3. Ensemble with LFB
backbone. Also, the spatial stride of res5 is set to 1 to in-
crease the spatial resolution. We rescale the shorter side of In order to better combine the results of long-term infor-
each image to 300 pixels due to GPU memory limit. Fol- mation, we ensemble our results with LFB[4]. For overlap-
4321
Frame-mAP
multi-sec (mean) 25.83
multi-sec (max) 25.3
LFB (single-crop)[4] 26.98
ensemble 29.4
Table 2. Comparison results of each single model and ensemble
model on validation set.
References
[1] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and
Kaiming He. Slowfast networks for video recognition. arXiv
preprint arXiv:1812.03982, 2018.
[2] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Car-
oline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan,
George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia
Schmid, and Jitendra Malik. Ava: A video dataset of spatio-
temporally localized atomic visual actions. In The IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR),
2018.
[3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster R-CNN: Towards real-time object detection with region
proposal networks. In Advances in Neural Information Pro-
cessing Systems (NIPS), 2015.
[4] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim-
ing He, Philipp Krähenbühl, and Ross Girshick. Long-Term
Feature Banks for Detailed Video Understanding. In CVPR,
2019.
4322