Bytedance Ai Lab Ava Challenge 2019 Technical Report

The document summarizes a technical report on a solution for spatial-temporal localization on the AVA dataset for the ActivityNet Challenge 2019. A two-stage training approach is proposed that first utilizes a 3D ConvNet pretrained on kinetics to exploit short-term visual contents of video clips centered on key frames. It then links region proposals across multiple key frames to form deformable action tubes spanning multiple seconds and introduces additional 3D ConvNets after ROI-pooling to integrate contents from single or multiple clips. Results show the multi-second approach improves over the baseline and ensemble with LFB achieves the best performance.

Uploaded by

project mission

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views

Bytedance Ai Lab Ava Challenge 2019 Technical Report

Uploaded by

project mission

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

ByteDance AI Lab AVA Challenge 2019 Technical Report

Wei Li1 , Zehuan Yuan2 , An Zhao2 , Jie Shao2 , and Changhu Wang2
1
Shanghai Jiao Tong University
2
ByteDance AI Lab
{liweihfyz}@sjtu.edu.cn
{yuanzehuan,zhaoan,shaojie.mail,wangchanghu}@bytedance.com

Abstract Frame-mAP
baseline 24.0
In this technical report, we will introduce our solution on multi-sec (T=5) 25.45
Spatial-temporal localization (AVA datasets) of ActivityNet multi-sec (T=9) 25.83
Challenge 2019. To this end, we propose an novel two-stage Table 1. Comparison results of baseline model and corresponding
training algorithm to integrate temporal context spanning multi-sec model.
multiple seconds. Firstly, we utilize 3D ConvNet pretrained
on kinetics dataset to exploit short-term visual contents of lowing [2], we replicate region proposals along the temporal
video clips centered at key frames. Secondly, we propose axis to generate feature volume V ∈ R4×256×7×7 with 2D
to link region proposals of several key frames with dynamic ROI-pooling. We utilize additionnal point-wise 2d Convo-
programming to form deformable action tubes spaning mul- lution layer to reduce the channel dimension into 256. The
tiple seconds. In both stages, additional 3D ConvNets are output feature volume V is further forwarded into an addi-
introduced after ROI-pooling to integrate contents of a sin- tional 2-layer 3D ConvNet to generate a compact represen-
gle clip or multiple clips into a compact representation for tation. The region proposal is considered as positive if its
further classification and regression. Iou with any ground-truth box of the key frame is higher
than 0.5. We choose 40 proposals with a ratio of 1:3 for
position and negative proposals, respectively.
1. Our method
1.2. Multi-sec training
In our method, we split the whole training process into
two-stages: baseline training and multi-second training. After training our baseline model, we link region propos-
The baseline stage exploits short-term visual contents while als of key frames of multiple seconds into an action tubes
multi-second training integrates contents of multiple sec- with dynamic programming algorithm. Due to memory lim-
onds with linked action tubes to exploit long-term informa- its, we precompute res4 feature volumes off-the-shelf and
tion. finetune res5 stage parameters loaded from baseline model.
We uniformly sample 5 clips centered at key frames from
1.1. Baseline T seconds and use average pooling or max pooling to re-
We follow the training strategy of Faster-RCNN [3] for duce the temporal dimension of feature volumes of each
end-to-end localization. We train out RPN Network with clip into 1 and stack them together into a feature volume
2D resnet50 backbone on key frames to generate off-the- Vm ∈ R5×256×7×7 . Similarly, an additional 2-layer Con-
shelf region proposals. We utilize SlowFast-50[1] pre- vNet is used to aggregate stacked feature volume into a
trained on kinetics dataset as our backbone to exploit visual compact representation. The comparison results of baseline
contents of each clips centered at key frames. Following model and multi-sec models are summarized in Table1.
[1], We input 32 frames with a temporal stride of 2 into our
1.3. Ensemble with LFB
backbone. Also, the spatial stride of res5 is set to 1 to in-
crease the spatial resolution. We rescale the shorter side of In order to better combine the results of long-term infor-
each image to 300 pixels due to GPU memory limit. Fol- mation, we ensemble our results with LFB[4]. For overlap-

4321
Frame-mAP
multi-sec (mean) 25.83
multi-sec (max) 25.3
LFB (single-crop)[4] 26.98
ensemble 29.4
Table 2. Comparison results of each single model and ensemble
model on validation set.

ping boxes of the same class, we average their positions and

adding weighted confidence scores. The weights of each
methods sums to 1. We use different pooling strategies to
increase the diversity of our trained models. We summarize
our results on the AVA validation dataset in Table2.

References
[1] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and
Kaiming He. Slowfast networks for video recognition. arXiv
preprint arXiv:1812.03982, 2018.
[2] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Car-
oline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan,
George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia
Schmid, and Jitendra Malik. Ava: A video dataset of spatio-
temporally localized atomic visual actions. In The IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR),
2018.
[3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster R-CNN: Towards real-time object detection with region
proposal networks. In Advances in Neural Information Pro-
cessing Systems (NIPS), 2015.
[4] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim-
ing He, Philipp Krähenbühl, and Ross Girshick. Long-Term
Feature Banks for Detailed Video Understanding. In CVPR,
2019.

4322

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
100% (8)
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
148 pages
How To Hack Atm
87% (15)
How To Hack Atm
1 page
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Digital Modulations using Matlab
From Everand
Digital Modulations using Matlab
Mathuranathan Viswanathan
4/5 (6)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (20)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
100% (10)
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
708 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
100% (24)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
The Fabric of Reality
100% (1)
The Fabric of Reality
6 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
SUN2912 - Longport Installation and Setup Guide PDF
No ratings yet
SUN2912 - Longport Installation and Setup Guide PDF
43 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Machine Learning For Humans
100% (4)
Machine Learning For Humans
97 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
SPLK-1001: Number: SPLK-1001 Passing Score: 800 Time Limit: 120 Min File Version: 1
No ratings yet
SPLK-1001: Number: SPLK-1001 Passing Score: 800 Time Limit: 120 Min File Version: 1
36 pages
Lan Deep Local Video CVPR 2017 Paper
No ratings yet
Lan Deep Local Video CVPR 2017 Paper
7 pages
Video Transformer Network
No ratings yet
Video Transformer Network
11 pages
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
No ratings yet
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
11 pages
Swiftnet: Real-Time Video Object Segmentation
No ratings yet
Swiftnet: Real-Time Video Object Segmentation
10 pages
Learning Spatiotemporal Features With 3D Convolutional Networks
No ratings yet
Learning Spatiotemporal Features With 3D Convolutional Networks
16 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Master en Creació Multimedia: Video Analytics
No ratings yet
Master en Creació Multimedia: Video Analytics
62 pages
Pnerv: A Polynomial Neural Representation For Videos: Sonam Gupta
No ratings yet
Pnerv: A Polynomial Neural Representation For Videos: Sonam Gupta
25 pages
Two-Stream Convolutional Networks For Action Recognition in Videos
No ratings yet
Two-Stream Convolutional Networks For Action Recognition in Videos
9 pages
Wu AdaFrame Adaptive Frame Selection For Fast Video Recognition CVPR 2019 Paper
No ratings yet
Wu AdaFrame Adaptive Frame Selection For Fast Video Recognition CVPR 2019 Paper
10 pages
CS231N Section: Video Understanding
No ratings yet
CS231N Section: Video Understanding
52 pages
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
AUTOREGRESSIVE VIDEO GENERATION
No ratings yet
AUTOREGRESSIVE VIDEO GENERATION
22 pages
entropy-25-01469
No ratings yet
entropy-25-01469
22 pages
2015 - A Robust and Efficient Video Representation For Action Recognition (IMPROVED TRAJECTORIES)
No ratings yet
2015 - A Robust and Efficient Video Representation For Action Recognition (IMPROVED TRAJECTORIES)
20 pages
Video Tutorial CVPR19
No ratings yet
Video Tutorial CVPR19
40 pages
deepfakeDetectionGLSTM (1)
No ratings yet
deepfakeDetectionGLSTM (1)
10 pages
Feichtenhofer A Large-Scale Study On Unsupervised Spatiotemporal Representation Learning CVPR 2021 Paper
No ratings yet
Feichtenhofer A Large-Scale Study On Unsupervised Spatiotemporal Representation Learning CVPR 2021 Paper
11 pages
paper 4
No ratings yet
paper 4
10 pages
7 Video Google A Text Retrieval Approach To Object Matching in Videos
No ratings yet
7 Video Google A Text Retrieval Approach To Object Matching in Videos
8 pages
Movinets: Mobile Video Networks For Efficient Video Recognition
No ratings yet
Movinets: Mobile Video Networks For Efficient Video Recognition
21 pages
Video Panorama
No ratings yet
Video Panorama
12 pages
He Deep Residual Learning 2016 CVPR Supplemental
No ratings yet
He Deep Residual Learning 2016 CVPR Supplemental
4 pages
Detect To Track and Track To Detect
No ratings yet
Detect To Track and Track To Detect
10 pages
Vanishing Point Detection With Convolutional Neural Networks
No ratings yet
Vanishing Point Detection With Convolutional Neural Networks
4 pages
5638 Faster R CNN Towards Real Time Object Detection With Region Proposal Networks
No ratings yet
5638 Faster R CNN Towards Real Time Object Detection With Region Proposal Networks
9 pages
Yan Multiview Transformers For Video Recognition CVPR 2022 Paper
No ratings yet
Yan Multiview Transformers For Video Recognition CVPR 2022 Paper
11 pages
2022_PVT v2
No ratings yet
2022_PVT v2
10 pages
Object Detection in Videos by High Quality Object Linking
No ratings yet
Object Detection in Videos by High Quality Object Linking
7 pages
Efﬁcient Training of Visual Transformers with Small Datasets_Liu et al_
No ratings yet
Efﬁcient Training of Visual Transformers with Small Datasets_Liu et al_
13 pages
5_6280382869936280464
No ratings yet
5_6280382869936280464
14 pages
Patch Refinement - Localized 3D Object Detection
No ratings yet
Patch Refinement - Localized 3D Object Detection
10 pages
2023 - A Novel Deep Convolutionalencoder Decoder Network Application To Moving Object Detection in Videos
No ratings yet
2023 - A Novel Deep Convolutionalencoder Decoder Network Application To Moving Object Detection in Videos
15 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
3.15 T-CNN Tubelets With Convolutional Neural Networks For Object Detection From Videos
No ratings yet
3.15 T-CNN Tubelets With Convolutional Neural Networks For Object Detection From Videos
11 pages
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Lecture14 PDF
No ratings yet
Lecture14 PDF
130 pages
3.Action Recognition
No ratings yet
3.Action Recognition
10 pages
Fan_RangeDet_In_Defense_ICCV_2021_supplemental
No ratings yet
Fan_RangeDet_In_Defense_ICCV_2021_supplemental
3 pages
Zhang Frame Flexible Network CVPR 2023 Paper
No ratings yet
Zhang Frame Flexible Network CVPR 2023 Paper
10 pages
Video Classification Project
No ratings yet
Video Classification Project
52 pages
Temporal Segment Networks: Towards Good Practices For Deep Action Recognition
No ratings yet
Temporal Segment Networks: Towards Good Practices For Deep Action Recognition
16 pages
D V2D: V D D S M: EEP Ideo To Epth With Ifferentiable Tructure From Otion
No ratings yet
D V2D: V D D S M: EEP Ideo To Epth With Ifferentiable Tructure From Otion
20 pages
CNN Unconstrained Video Classification
No ratings yet
CNN Unconstrained Video Classification
9 pages
Hybrid Spatial-Temporal Entropy Modelling For Neural Video Compression
No ratings yet
Hybrid Spatial-Temporal Entropy Modelling For Neural Video Compression
17 pages
DevOps for Networking
From Everand
DevOps for Networking
Steven Armstrong
4/5 (2)
Lip2 Speech Report
No ratings yet
Lip2 Speech Report
7 pages
Playing For Benchmarks (1709.07322)
No ratings yet
Playing For Benchmarks (1709.07322)
10 pages
Master's Thesis Deep Learning For Visual Recognition: Remi Cadene Supervised by Nicolas Thome and Matthieu Cord
No ratings yet
Master's Thesis Deep Learning For Visual Recognition: Remi Cadene Supervised by Nicolas Thome and Matthieu Cord
58 pages
Efficient CNN Architecture Design Guided by Visualization
No ratings yet
Efficient CNN Architecture Design Guided by Visualization
6 pages
BEVFormer: Learning Bird's-Eye-View Representation From Multi-Camera Images Via Spatiotemporal Transformers
No ratings yet
BEVFormer: Learning Bird's-Eye-View Representation From Multi-Camera Images Via Spatiotemporal Transformers
20 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
2D or Not 2D Adaptive 3D Convolution Selection For Efficient Video Recogition
No ratings yet
2D or Not 2D Adaptive 3D Convolution Selection For Efficient Video Recogition
10 pages
Fast and Furious
No ratings yet
Fast and Furious
9 pages
Java / J2EE Interview Questions You'll Most Likely Be Asked
From Everand
Java / J2EE Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Chattopadhyay Sprinasdger
No ratings yet
Chattopadhyay Sprinasdger
14 pages
Yu Et Al - 2016 - Recent Developments On Deep Big Vision
No ratings yet
Yu Et Al - 2016 - Recent Developments On Deep Big Vision
2 pages
Attention-Guided Multi-Granularity Fusion Model For Video Summarization
No ratings yet
Attention-Guided Multi-Granularity Fusion Model For Video Summarization
11 pages
Deep Networks-Based Video Classification Methods: A Literature Overview
No ratings yet
Deep Networks-Based Video Classification Methods: A Literature Overview
20 pages
Application of Defuzzification in Cloud Computing: Method: Center of Gravity (COG) / Centroid of Area (COA) Method
No ratings yet
Application of Defuzzification in Cloud Computing: Method: Center of Gravity (COG) / Centroid of Area (COA) Method
5 pages
Assi 11
No ratings yet
Assi 11
2 pages
Assignment 11
No ratings yet
Assignment 11
2 pages
Format of Report
No ratings yet
Format of Report
9 pages
Assignment - 10 Parallel Sorting Techniques: Range-Partitioning Sort
No ratings yet
Assignment - 10 Parallel Sorting Techniques: Range-Partitioning Sort
6 pages
Title: Install and Demonstrate Oracle Parallel Database: Oracle Real
No ratings yet
Title: Install and Demonstrate Oracle Parallel Database: Oracle Real
2 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
6 pages
Database Basic
No ratings yet
Database Basic
1 page
Stack
No ratings yet
Stack
1 page
Amazon Redshift: Database - PRN NO-2017BTECS00041
No ratings yet
Amazon Redshift: Database - PRN NO-2017BTECS00041
9 pages
Is Your Browser Running HTTP Version 1
No ratings yet
Is Your Browser Running HTTP Version 1
2 pages
Application of Data Structure
No ratings yet
Application of Data Structure
2 pages
Database Basic
No ratings yet
Database Basic
1 page
Static Keyword in Java: Static Is A Non-Access Modifier in Java Which Is Applicable For
No ratings yet
Static Keyword in Java: Static Is A Non-Access Modifier in Java Which Is Applicable For
6 pages
Experiment No 8: Study and Implementation of Stored Procedures
No ratings yet
Experiment No 8: Study and Implementation of Stored Procedures
4 pages
Experiment-5 Solution: 10. Select D - Name From Doctor Where D - Specs "Cardiologist" 11
No ratings yet
Experiment-5 Solution: 10. Select D - Name From Doctor Where D - Specs "Cardiologist" 11
2 pages
Anadaman White Toothed Shrew: 1. Classification
No ratings yet
Anadaman White Toothed Shrew: 1. Classification
2 pages
Module 1.6
No ratings yet
Module 1.6
53 pages
D 64 B CFC Carpooling
No ratings yet
D 64 B CFC Carpooling
13 pages
Modue 2 - APPLICATIONS OF BPN
No ratings yet
Modue 2 - APPLICATIONS OF BPN
40 pages
Varchar2 (40), Grade Number (2), Salary Number (10,2), Date - of - Joining Date)
No ratings yet
Varchar2 (40), Grade Number (2), Salary Number (10,2), Date - of - Joining Date)
1 page
Aneka Platform: Shyam Krishna Khadka
No ratings yet
Aneka Platform: Shyam Krishna Khadka
27 pages
Covid News: Indigenous (Native) Wisdom
No ratings yet
Covid News: Indigenous (Native) Wisdom
4 pages
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
Roadmap How To Learn AI in 2024 (Uncovered AI)
No ratings yet
Roadmap How To Learn AI in 2024 (Uncovered AI)
6 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
Wisc V Interpretation
100% (1)
Wisc V Interpretation
8 pages
Rationality From AI To Zombies
86% (7)
Rationality From AI To Zombies
1,813 pages
Tech Trend 2024 Report-2
No ratings yet
Tech Trend 2024 Report-2
11 pages
From Music To Mathematic
100% (1)
From Music To Mathematic
4 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Mind Control Patents
100% (1)
Mind Control Patents
41 pages
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
100% (7)
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
145 pages
Psych Unit 7a Practice Quiz
No ratings yet
Psych Unit 7a Practice Quiz
4 pages
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
No ratings yet
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
456 pages
ASWC WirelessPowerTransfer-final
No ratings yet
ASWC WirelessPowerTransfer-final
38 pages
PyDictionary 1.3.4: Python Package Index
No ratings yet
PyDictionary 1.3.4: Python Package Index
3 pages
Basic and Advanced Applications of The SEL-3530 RTAC Access Point Router
No ratings yet
Basic and Advanced Applications of The SEL-3530 RTAC Access Point Router
14 pages
Session F-2 - Statistics and Probability For Middle-School Math Te PDF
No ratings yet
Session F-2 - Statistics and Probability For Middle-School Math Te PDF
25 pages
PH Control Simulation
No ratings yet
PH Control Simulation
14 pages

Bytedance Ai Lab Ava Challenge 2019 Technical Report

Uploaded by

Bytedance Ai Lab Ava Challenge 2019 Technical Report

Uploaded by

ByteDance AI Lab AVA Challenge 2019 Technical Report

ping boxes of the same class, we average their positions and

You might also like