0% found this document useful (0 votes)

31 views

Video Tutorial CVPR19

The document discusses video classification and detection tasks. It provides background on two-stream convolutional networks and transforming spatial networks into temporal ones by inflation. It also discusses visualizing the learned representations of networks through activation maximization.

Uploaded by

Private

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Video Tutorial CVPR19

Uploaded by

Private

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

1

Video Classification and Detection

CVPR 2019 Tutorial

Christoph Feichtenhofer
Facebook AI Research (FAIR)
2

Task: Human action classification & detection

C. Feichtenhofer, H. Fan, J. Malik, K. He SlowFast Networks for Video Recognition. Tech. Report, arXiv, 2018
3

Johansson: Perception of Biological Motion

Sources: Johansson, G. “Visual perception of biological motion and a model for its analysis.” Perception & Psychophysics. 14(2):201-211. 1973.
4

Motivation: Separate visual pathways in nature

è Dorsal stream (‘where’) recognizes motion and locates
objects
OPTICAL FLOW STIMULI

è “Interconnection”
e.g. in STS area

è Ventral (‘what’) stream performs object recognition

Sources: “Sensitivity of MST neurons to optic flow stimuli. I. A continuum of response selectivity to large-field stimuli." Journal of neurophysiology 65.6 (1991).
“A cortical representation of the local visual environment”, Nature. 392 (6676): 598–601, 2009
https://en.wikipedia.org/wiki/Two-streams_hypothesis
5

Background: Two-Stream Convolutional Networks

Individual processing of spatial and temporal information

• Using a separate 2D (x,y) ConvNet recognition stream for each
• Late fusion via softmax score averaging

K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014
6

Background: Two-Stream Network Fusion and 2D -> 3D Transformation/Inflation

Motion Stream

conv2_x

conv3_x

conv4_x

conv5_x
conv1

loss
+

+
x x x x
Appearance Stream

conv2_x

conv3_x

conv4_x

conv5_x
conv1

loss
+

+
o ST-ResNet allows the hierarchical learning of spacetime features by connecting the
appearance and motion channels of a two-stream architecture.

C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proc. CVPR, 2016
C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal multiplier networks for video action recognition. In CVPR, 2017.
7

Background: Transforming spatial networks into temporal ones by Inflation

Time t
t
… …

conv1 conv1 conv1 conv1 conv1 conv1 conv1 conv1 conv1

* * * * * * * * * *
res2 res2 res2 res2 res2 res2 res2 res2 res2

+ + + + + + + + +
* * * * * * * * * *
res3 res3 res3 res3 res3 res3 res3 res3 res3

+ + + + + + + + +
* * * * * * * * * *
res4 res4 res4 res4 res4 res4 res4 res4 res4

+ + + + + + + + +
* * * * * * * * * *
res5 res5 res5 res5 res5 res5 res5 res5 res5

+ + + + + + + + +

pool

o Inflation allows to transform spatial filters to spatiotemporal ones (3D or 2D spatial +1D temporal)
C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR, 2017.
D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
8

Background: Transforming 2D networks into 3D by Inflation

C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR, 2017.
9

Visualizing the learned representation

o Regularized activation maximization on the input

Input Convolutional Feature Maps of VGG-16

Spatial regularization Temporal
weight regularization weight

Motion

Loss
height
width

Fusion
*
depth

c
Maximize channel c
Appearance

e.g. c = 4
C. Feichtenhofer, A. Pinz, and R. Wildes, A. Zisserman. What have we learned from deep representations for action recognition?. In CVPR, 2018.
C. Feichtenhofer, A. Pinz, and R. Wildes, A. Zisserman. Deep insights into convolutional networks for video recognition?. In IJCV, 2019.
10

Visualizing the learned representation

o Slow
Regularized
motion activation maximization on the input
Fast motion
(high temporal reg.) (low temporal reg.) Maximum Activation

Punch

Input Convolutional Feature Maps of VGG-16

Drumming
Spatial regularization Temporal
weight
StillRings regularization weight
BoxingSpeedBag

Motion JumpingJack

IceDancing

BenchPress Loss
height
width
Archery

Fusion
*
depth
MilitaryParade

BoxingPunchingBag
c
0 5 10 15
Maximize channel
20 25 30 35
c
Appearance

Filter #251 at conv5 fusion – the strongest

local Billiards unit
slow medium fast

Maximum Activation

Billiards

TableTennisShot

SoccerJuggling

Fencing

FieldHockeyPenalty

Basketball

Lunges

BaseballPitch

SoccerPenalty

FloorGymnastics

0 5 10 15 20 25 30 35 40
13

Last layer
è “Billiards”

Appearance Slow motion Fast motion

e.g. “ball rolling” e.g. “player
moving”
14

Last layer
è
“CleanAndJerk”
Appearance Slow motion Fast motion
e.g. “shaking with e.g. “push bar”
bar”

Fast motion appearance

Background: 3D Convolutional Networks

Input clip of ~2sec H,W
T

prediction
“Head-butting”
(Kinetics classificaiton annotation)

G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In Proc. ECCV, 2010.
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In Proc. ICCV, 2015.
J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR, 2017.
16

Background: Non-Local Convolutional Network Blocks

o Self-attention in the spatiotemporal domain allows long-range feature aggregation

X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proc. CVPR, 2018.
17

Background: Limited temporal input length of 3D ConvNets

Actions

3D CNN

2-4 seconds
19

New work: Long-Term Feature Banks for Video Understanding

Actions

FBO
Long-Term Feature Bank
... ...
3D CNN

short clip

full video

... ... ... ...

CY Wu, C. Feichtenhofer, H. Fan, K. He, P. Krähenbühl, R. Girshick Long-Term Feature Banks for Detailed Video Understanding. In Proc. CVPR, 2019.
20

This talk: SlowFast Networks for Video Recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik and Kaiming He
• New backbone network for human action classification & detection

Slow pathway

H,W
T

Fast pathway

C. Feichtenhofer, H. Fan, J. Malik, K. He SlowFast Networks for Video Recognition. Tech. Report, arXiv, 2018
21

Motivation: Separate visual pathways for what and where

Magno cell properties1 Parvo cell properties1

è Minority of cells in è Majority of cells in

LGN: ~20% LGN: ~80%
è Large receptive field è Small receptive field
è High contrast è Able to differentiate
sensitivity detailed stimuli
è Able to differentiate è Color Sensitive
only coarse stimuli è Processes
è Color Blind information about
è Processes color & detail
information about è Slow conduction rate
depth & motion (less myelin)
è Fast conduction rate
(more myelin)
David C. Van Essen, Jack L. Gallant, Neural mechanisms of form and
1 https://www.ucalgary.ca/pip369/mod2/visualpathways/magnoparvo
motion processing in the primate visual system,
Neuron, Volume 13, Issue 1, July 1994, Pages 1-10, ISSN 0896-6273
22

The basic idea of SlowFast

• The network consists of two pathways:
• (i) a Slow pathway, operating at low frame rate, to capture spatial semantics
• (ii) a Fast pathway, operating at high frame rate, to capture motion at fine
temporal resolution
23

The basic idea of SlowFast networks

Slow pathway

Slow
C T
C T
C T

prediction
H,W
T

Fast
αT
αT βC
βC
αT
βC Fast pathway
“Hand-clap” e.g. α = 8
(AVA detection annotation) β = 1/8
26

SlowFast training recipe, Kinetics action classification

• Kinetics has 240k training videos and 20k validation videos in 400 classes
• Our training recipe for training without ImageNet initialization (inflation)
• T = input size, τ = temporal stride
28

SlowFast ablations: Individual paths

• Kinetics action classification dataset has 240k training
Slow pathway
videos and 20k validation videos in 400 classes

prediction
C T
C T
C T
H,W
T

prediction
α=8
β = 1/8 αT
αT
βC
βC
αT
βC Fast pathway
29

SlowFast ablations: Lateral fusion, Kinetics action classification

• Kinetics dataset has 240k training videos and 20k
validation videos in 400 classes Slow pathway

C T
C T
C T

prediction
H,W
T

𝑡 C

αT
αT βC
βC
αT
βC Fast pathway
30

SlowFast ablations: Learning curves

SlowFast ablations: Making the Fast path thin in channel dimension

• Kinetics dataset has 240k training videos and 20k Slow pathway
validation videos in 400 classes

C T
C T
C T

prediction
H,W
T

αT
αT βC
βC
αT
βC Fast pathway
33
Conv1 filters

SlowFast ablatios: Weak input

Slow

β = 1/4
rgb grayscale time diff

β = 1/6 β = 1/8

Fast
β = 1/16
β = 1/32
t t t

t t t t

rgb grayscale dt
35

SlowFast ablations: Temporal sampling rates

Slow pathway

C T
C T
C T

prediction
H,W
T

αT
αT βC
βC
αT
βC Fast pathway
37

SlowFast: State-of-the-art comparison on Kinetics

+ 5.1% at 10%
§§§§§§§§§

top-1 of FLOPs

37
38

SlowFast: State-of-the-art comparison Kinetics-600

• Kinetics-600 has 392k training videos and 30k validation videos in 600 classes
39

SlowFast: State-of-the-art comparison Charades1

• Charades has 9.8k training videos and 1.8k validation videos in 157 classes
• Multi-label classification setting of longer activities spanning 30 seconds on average

1G.A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity
understanding. In ECCV, 2016. , CVPR 2016
40

Experiments: AVA1 Action Detection SlowFast detector output

• Fine-scale localization of 80 different

physical actions
• Data from 437 different movies and
spatiotemporal labels are provided in a
1Hz interval
• 211k training and 57k validation video
segments
• We follow the standard protocol of
evaluating on 60 most freqent classes
• Every person is annotated with a
bounding box and (possibly multiple)
actions

1Gu et al. AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions, CVPR 2018
41

Experiments: AVA Action Detection

Slow pathway

Slow
T RPN1
T C
C
C T
H,W

concat
T Detections

C RoIAlign

Fast
αT
αT βC
βC
αT
βC Fast pathway

1Faster R-CNN with a ResNeXt-101-FPN backbone pretrained on COCO keypoints

SlowFast
SlowFast:ablations:
State-of-the-art
AVA action
comparison
detection
AVA

+5.2mAP

Slow pathway

T RPN
T C
C
C T
H,W +6.4mAP +13mAP

concat
T Det
34.25mAP
C RoIAlign

αT
αT βC
βC
αT
βC Fast pathway
46

SlowFast ablations: AVA class level performance

Experiments: AVA Qualitative results

Conclusion
Slow pathway
• The time axis is a special dimension of video.
• 3D ConvNets treat space and time uniformly.
• SlowFast and Two-Stream networks share motivation
from biological studies.
• We investigate an architecture design that focuses on
contrasting the speed along the temporal axis.
• The SlowFast architecture achieves state-of-the-art
accuracy for video action classification and detection
without need of any (e.g. ImageNet) pretraining.
• Given the mutual benefits of jointly modeling video
Fast pathway
with different temporal speeds, we hope that this
concept can foster further research in video analysis.
FAIR Research Engineer
Menlo Park, CA
Seattle, WA

ACCELERATE AND SCALE CV RESEARCH

Familiarity with CV and ML
Ability to write high-quality and performance-critical code

[email protected]

Slow Fast
No ratings yet
Slow Fast
10 pages
SlowFast Video Recognition
No ratings yet
SlowFast Video Recognition
10 pages
Lecture14 PDF
No ratings yet
Lecture14 PDF
130 pages
3.Action Recognition
No ratings yet
3.Action Recognition
10 pages
Video Transformer Network
No ratings yet
Video Transformer Network
11 pages
Two-Stream Convolutional Networks For Action Recognition in Videos
No ratings yet
Two-Stream Convolutional Networks For Action Recognition in Videos
9 pages
Quo Vadis, Action Recognition? A New Model and The Kinetics Dataset
No ratings yet
Quo Vadis, Action Recognition? A New Model and The Kinetics Dataset
10 pages
CS231N Section: Video Understanding
No ratings yet
CS231N Section: Video Understanding
52 pages
Feichtenhofer_Convolutional_Two-Stream_Network_CVPR_2016_paper
No ratings yet
Feichtenhofer_Convolutional_Two-Stream_Network_CVPR_2016_paper
9 pages
Temporal Segment Networks: Towards Good Practices For Deep Action Recognition
No ratings yet
Temporal Segment Networks: Towards Good Practices For Deep Action Recognition
16 pages
Sun Human Action Recognition ICCV 2015 Paper
No ratings yet
Sun Human Action Recognition ICCV 2015 Paper
9 pages
10.1007@s00371 020 01868 8
No ratings yet
10.1007@s00371 020 01868 8
15 pages
CNN Unconstrained Video Classification
No ratings yet
CNN Unconstrained Video Classification
9 pages
Can 3D CNN Retrace The History of 2D CNN and ImageNet
No ratings yet
Can 3D CNN Retrace The History of 2D CNN and ImageNet
10 pages
Master en Creació Multimedia: Video Analytics
No ratings yet
Master en Creació Multimedia: Video Analytics
62 pages
Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
No ratings yet
Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
22 pages
action_recog
No ratings yet
action_recog
11 pages
Activity Recognition Based On Spatio-Temporal Features With Transfer Learning
No ratings yet
Activity Recognition Based On Spatio-Temporal Features With Transfer Learning
9 pages
Ufc Sports Data
No ratings yet
Ufc Sports Data
10 pages
10224/submission 10224
No ratings yet
10224/submission 10224
10 pages
SIDDHARTH SHARMA - FinalProjectReport - Siddharth
No ratings yet
SIDDHARTH SHARMA - FinalProjectReport - Siddharth
27 pages
3D Convolutional Neural Networks For Human Action Recognition
No ratings yet
3D Convolutional Neural Networks For Human Action Recognition
11 pages
Action Classification and Highlighting in Videos
No ratings yet
Action Classification and Highlighting in Videos
12 pages
Large-Scale Video Classification With Convolutional Neural Networks
No ratings yet
Large-Scale Video Classification With Convolutional Neural Networks
8 pages
Video Swin Transformer
No ratings yet
Video Swin Transformer
12 pages
Abu Farha MS-TCN Multi-Stage Temporal Convolutional Network For Action Segmentation CVPR 2019 Paper
No ratings yet
Abu Farha MS-TCN Multi-Stage Temporal Convolutional Network For Action Segmentation CVPR 2019 Paper
10 pages
3D Convolutional Neural Networks For Human Action Recognition
No ratings yet
3D Convolutional Neural Networks For Human Action Recognition
11 pages
Human Activity Detection Using Deep - 2-1
No ratings yet
Human Activity Detection Using Deep - 2-1
8 pages
Cheron P-CNN Pose-Based CNN ICCV 2015 Paper
No ratings yet
Cheron P-CNN Pose-Based CNN ICCV 2015 Paper
9 pages
5_6280382869936280464
No ratings yet
5_6280382869936280464
14 pages
Deep Networks-Based Video Classification Methods: A Literature Overview
No ratings yet
Deep Networks-Based Video Classification Methods: A Literature Overview
20 pages
A Comprehensive Study of Deep Video Action Recognition
No ratings yet
A Comprehensive Study of Deep Video Action Recognition
30 pages
Seminar PPT On HAR Depth
No ratings yet
Seminar PPT On HAR Depth
37 pages
Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors
No ratings yet
Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors
10 pages
Frame-Skip Convolutional Neural Networks For Action Recognition
No ratings yet
Frame-Skip Convolutional Neural Networks For Action Recognition
6 pages
Human Action Behavior Recognition in Still Images With Proposed Frames Selection Using Transfer Learning
No ratings yet
Human Action Behavior Recognition in Still Images With Proposed Frames Selection Using Transfer Learning
19 pages
I3D-Shufflenet Based Human Action Recognition
No ratings yet
I3D-Shufflenet Based Human Action Recognition
14 pages
28 - Action Recognition in Australian Rules Football Through Deep Learning
No ratings yet
28 - Action Recognition in Australian Rules Football Through Deep Learning
14 pages
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
No ratings yet
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
11 pages
566600917
No ratings yet
566600917
43 pages
Jianhang Chen Activity Recognition On Kinect-3d Videos Using Transfer Learning
No ratings yet
Jianhang Chen Activity Recognition On Kinect-3d Videos Using Transfer Learning
3 pages
Learning Spatiotemporal Features With 3D Convolutional Networks
No ratings yet
Learning Spatiotemporal Features With 3D Convolutional Networks
16 pages
Video Classification Project
No ratings yet
Video Classification Project
52 pages
Research On Human Action Recognition Model Based On Computer Laplacian Matrix An
No ratings yet
Research On Human Action Recognition Model Based On Computer Laplacian Matrix An
6 pages
Literature Review - Sheet1
No ratings yet
Literature Review - Sheet1
2 pages
Raushan Pandey Review Paper of Deep Learning
No ratings yet
Raushan Pandey Review Paper of Deep Learning
3 pages
CV Ss16 0609 Deep Learning
No ratings yet
CV Ss16 0609 Deep Learning
91 pages
CV Mot
No ratings yet
CV Mot
69 pages
P-CNN: Pose-Based CNN Features For Action Recognition: Guilhem CH Eron Ivan Laptev Cordelia Schmid Inria
No ratings yet
P-CNN: Pose-Based CNN Features For Action Recognition: Guilhem CH Eron Ivan Laptev Cordelia Schmid Inria
9 pages
Video Survivallence
No ratings yet
Video Survivallence
3 pages
Nibali Extraction and Classification CVPR 2017 Paper
No ratings yet
Nibali Extraction and Classification CVPR 2017 Paper
11 pages
Multimodal Human Action Recognition Based On A Fusion of Dynamic Images Using CNN Descriptors
No ratings yet
Multimodal Human Action Recognition Based On A Fusion of Dynamic Images Using CNN Descriptors
8 pages
Week5_Computer_Vision
No ratings yet
Week5_Computer_Vision
58 pages
Irjet V7i61094
No ratings yet
Irjet V7i61094
3 pages
Adversarial Video Generation On Complex Datasets
No ratings yet
Adversarial Video Generation On Complex Datasets
21 pages
Convolutional Neural PDF
No ratings yet
Convolutional Neural PDF
187 pages
Beyond Goldfish Memory- Long-Term Open-Domain Conversation
No ratings yet
Beyond Goldfish Memory- Long-Term Open-Domain Conversation
15 pages
Alpha Geometry 2
No ratings yet
Alpha Geometry 2
28 pages
Brook for GPUs - Stream Computing on Graphics Hardware - Paper
No ratings yet
Brook for GPUs - Stream Computing on Graphics Hardware - Paper
10 pages
Thesis Defenseman Mike Has Been Named To His Team
No ratings yet
Thesis Defenseman Mike Has Been Named To His Team
113 pages
specular 거칠기의 일부 모양을 제공
No ratings yet
specular 거칠기의 일부 모양을 제공
28 pages

Video Tutorial CVPR19

Uploaded by

Video Tutorial CVPR19

Uploaded by

1

Video Classification and Detection

Task: Human action classification & detection

Johansson: Perception of Biological Motion

Motivation: Separate visual pathways in nature

è Ventral (‘what’) stream performs object recognition

Background: Two-Stream Convolutional Networks

Individual processing of spatial and temporal information

Background: Two-Stream Network Fusion and 2D -> 3D Transformation/Inflation

Background: Transforming spatial networks into temporal ones by Inflation

conv1 conv1 conv1 conv1 conv1 conv1 conv1 conv1 conv1

Background: Transforming 2D networks into 3D by Inflation

Visualizing the learned representation

Input Convolutional Feature Maps of VGG-16

Visualizing the learned representation

Input Convolutional Feature Maps of VGG-16

Filter #251 at conv5 fusion – the strongest

Appearance Slow motion Fast motion

Fast motion appearance

Background: 3D Convolutional Networks

Background: Non-Local Convolutional Network Blocks

o Self-attention in the spatiotemporal domain allows long-range feature aggregation

Background: Limited temporal input length of 3D ConvNets

New work: Long-Term Feature Banks for Video Understanding

... ... ... ...

This talk: SlowFast Networks for Video Recognition

Motivation: Separate visual pathways for what and where

è Minority of cells in è Majority of cells in

The basic idea of SlowFast

The basic idea of SlowFast networks

Example instantiation of a SlowFast network

SlowFast training recipe, Kinetics action classification

SlowFast ablations: Individual paths

SlowFast ablations: Lateral fusion, Kinetics action classification

SlowFast ablations: Learning curves

SlowFast ablations: Making the Fast path thin in channel dimension

SlowFast ablatios: Weak input

SlowFast ablations: Temporal sampling rates

SlowFast: State-of-the-art comparison on Kinetics

SlowFast: State-of-the-art comparison Kinetics-600

SlowFast: State-of-the-art comparison Charades1

Experiments: AVA1 Action Detection SlowFast detector output

• Fine-scale localization of 80 different

Experiments: AVA Action Detection

1Faster R-CNN with a ResNeXt-101-FPN backbone pretrained on COCO keypoints

SlowFast ablations: AVA class level performance

Experiments: AVA Qualitative results

Experiments: AVA Qualitative results

ACCELERATE AND SCALE CV RESEARCH

You might also like