0% found this document useful (0 votes)
55 views

Bytedance Ai Lab Ava Challenge 2019 Technical Report

The document summarizes a technical report on a solution for spatial-temporal localization on the AVA dataset for the ActivityNet Challenge 2019. A two-stage training approach is proposed that first utilizes a 3D ConvNet pretrained on kinetics to exploit short-term visual contents of video clips centered on key frames. It then links region proposals across multiple key frames to form deformable action tubes spanning multiple seconds and introduces additional 3D ConvNets after ROI-pooling to integrate contents from single or multiple clips. Results show the multi-second approach improves over the baseline and ensemble with LFB achieves the best performance.

Uploaded by

project mission
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Bytedance Ai Lab Ava Challenge 2019 Technical Report

The document summarizes a technical report on a solution for spatial-temporal localization on the AVA dataset for the ActivityNet Challenge 2019. A two-stage training approach is proposed that first utilizes a 3D ConvNet pretrained on kinetics to exploit short-term visual contents of video clips centered on key frames. It then links region proposals across multiple key frames to form deformable action tubes spanning multiple seconds and introduces additional 3D ConvNets after ROI-pooling to integrate contents from single or multiple clips. Results show the multi-second approach improves over the baseline and ensemble with LFB achieves the best performance.

Uploaded by

project mission
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

ByteDance AI Lab AVA Challenge 2019 Technical Report

Wei Li1 , Zehuan Yuan2 , An Zhao2 , Jie Shao2 , and Changhu Wang2
1
Shanghai Jiao Tong University
2
ByteDance AI Lab
{liweihfyz}@sjtu.edu.cn
{yuanzehuan,zhaoan,shaojie.mail,wangchanghu}@bytedance.com

Abstract Frame-mAP
baseline 24.0
In this technical report, we will introduce our solution on multi-sec (T=5) 25.45
Spatial-temporal localization (AVA datasets) of ActivityNet multi-sec (T=9) 25.83
Challenge 2019. To this end, we propose an novel two-stage Table 1. Comparison results of baseline model and corresponding
training algorithm to integrate temporal context spanning multi-sec model.
multiple seconds. Firstly, we utilize 3D ConvNet pretrained
on kinetics dataset to exploit short-term visual contents of lowing [2], we replicate region proposals along the temporal
video clips centered at key frames. Secondly, we propose axis to generate feature volume V ∈ R4×256×7×7 with 2D
to link region proposals of several key frames with dynamic ROI-pooling. We utilize additionnal point-wise 2d Convo-
programming to form deformable action tubes spaning mul- lution layer to reduce the channel dimension into 256. The
tiple seconds. In both stages, additional 3D ConvNets are output feature volume V is further forwarded into an addi-
introduced after ROI-pooling to integrate contents of a sin- tional 2-layer 3D ConvNet to generate a compact represen-
gle clip or multiple clips into a compact representation for tation. The region proposal is considered as positive if its
further classification and regression. Iou with any ground-truth box of the key frame is higher
than 0.5. We choose 40 proposals with a ratio of 1:3 for
position and negative proposals, respectively.
1. Our method
1.2. Multi-sec training
In our method, we split the whole training process into
two-stages: baseline training and multi-second training. After training our baseline model, we link region propos-
The baseline stage exploits short-term visual contents while als of key frames of multiple seconds into an action tubes
multi-second training integrates contents of multiple sec- with dynamic programming algorithm. Due to memory lim-
onds with linked action tubes to exploit long-term informa- its, we precompute res4 feature volumes off-the-shelf and
tion. finetune res5 stage parameters loaded from baseline model.
We uniformly sample 5 clips centered at key frames from
1.1. Baseline T seconds and use average pooling or max pooling to re-
We follow the training strategy of Faster-RCNN [3] for duce the temporal dimension of feature volumes of each
end-to-end localization. We train out RPN Network with clip into 1 and stack them together into a feature volume
2D resnet50 backbone on key frames to generate off-the- Vm ∈ R5×256×7×7 . Similarly, an additional 2-layer Con-
shelf region proposals. We utilize SlowFast-50[1] pre- vNet is used to aggregate stacked feature volume into a
trained on kinetics dataset as our backbone to exploit visual compact representation. The comparison results of baseline
contents of each clips centered at key frames. Following model and multi-sec models are summarized in Table1.
[1], We input 32 frames with a temporal stride of 2 into our
1.3. Ensemble with LFB
backbone. Also, the spatial stride of res5 is set to 1 to in-
crease the spatial resolution. We rescale the shorter side of In order to better combine the results of long-term infor-
each image to 300 pixels due to GPU memory limit. Fol- mation, we ensemble our results with LFB[4]. For overlap-

4321
Frame-mAP
multi-sec (mean) 25.83
multi-sec (max) 25.3
LFB (single-crop)[4] 26.98
ensemble 29.4
Table 2. Comparison results of each single model and ensemble
model on validation set.

ping boxes of the same class, we average their positions and


adding weighted confidence scores. The weights of each
methods sums to 1. We use different pooling strategies to
increase the diversity of our trained models. We summarize
our results on the AVA validation dataset in Table2.

References
[1] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and
Kaiming He. Slowfast networks for video recognition. arXiv
preprint arXiv:1812.03982, 2018.
[2] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Car-
oline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan,
George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia
Schmid, and Jitendra Malik. Ava: A video dataset of spatio-
temporally localized atomic visual actions. In The IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR),
2018.
[3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster R-CNN: Towards real-time object detection with region
proposal networks. In Advances in Neural Information Pro-
cessing Systems (NIPS), 2015.
[4] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim-
ing He, Philipp Krähenbühl, and Ross Girshick. Long-Term
Feature Banks for Detailed Video Understanding. In CVPR,
2019.

4322

You might also like