Video Tutorial CVPR19
Video Tutorial CVPR19
Christoph Feichtenhofer
Facebook AI Research (FAIR)
2
C. Feichtenhofer, H. Fan, J. Malik, K. He SlowFast Networks for Video Recognition. Tech. Report, arXiv, 2018
3
Sources: Johansson, G. “Visual perception of biological motion and a model for its analysis.” Perception & Psychophysics. 14(2):201-211. 1973.
4
è “Interconnection”
e.g. in STS area
Sources: “Sensitivity of MST neurons to optic flow stimuli. I. A continuum of response selectivity to large-field stimuli." Journal of neurophysiology 65.6 (1991).
“A cortical representation of the local visual environment”, Nature. 392 (6676): 598–601, 2009
https://en.wikipedia.org/wiki/Two-streams_hypothesis
5
K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014
6
conv2_x
conv3_x
conv4_x
conv5_x
conv1
loss
+
+
x x x x
Appearance Stream
conv2_x
conv3_x
conv4_x
conv5_x
conv1
loss
+
+
o ST-ResNet allows the hierarchical learning of spacetime features by connecting the
appearance and motion channels of a two-stream architecture.
C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proc. CVPR, 2016
C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal multiplier networks for video action recognition. In CVPR, 2017.
7
+ + + + + + + + +
* * * * * * * * * *
res3 res3 res3 res3 res3 res3 res3 res3 res3
+ + + + + + + + +
* * * * * * * * * *
res4 res4 res4 res4 res4 res4 res4 res4 res4
+ + + + + + + + +
* * * * * * * * * *
res5 res5 res5 res5 res5 res5 res5 res5 res5
+ + + + + + + + +
pool
fc
o Inflation allows to transform spatial filters to spatiotemporal ones (3D or 2D spatial +1D temporal)
C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR, 2017.
D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
8
C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR, 2017.
9
Motion
Loss
height
width
Fusion
*
depth
c
Maximize channel c
Appearance
e.g. c = 4
C. Feichtenhofer, A. Pinz, and R. Wildes, A. Zisserman. What have we learned from deep representations for action recognition?. In CVPR, 2018.
C. Feichtenhofer, A. Pinz, and R. Wildes, A. Zisserman. Deep insights into convolutional networks for video recognition?. In IJCV, 2019.
10
Punch
Motion JumpingJack
IceDancing
BenchPress Loss
height
width
Archery
Fusion
*
depth
MilitaryParade
BoxingPunchingBag
c
0 5 10 15
Maximize channel
20 25 30 35
c
Appearance
e.g. c = 4
C. Feichtenhofer, A. Pinz, and R. Wildes, A. Zisserman. What have we learned from deep representations for action recognition?. In CVPR, 2018.
C. Feichtenhofer, A. Pinz, and R. Wildes, A. Zisserman. Deep insights into convolutional networks for video recognition?. In IJCV, 2019.
12
Maximum Activation
Billiards
TableTennisShot
SoccerJuggling
Fencing
FieldHockeyPenalty
Basketball
Lunges
BaseballPitch
SoccerPenalty
FloorGymnastics
0 5 10 15 20 25 30 35 40
13
Last layer
è “Billiards”
Last layer
è
“CleanAndJerk”
Appearance Slow motion Fast motion
e.g. “shaking with e.g. “push bar”
bar”
prediction
“Head-butting”
(Kinetics classificaiton annotation)
G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In Proc. ECCV, 2010.
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In Proc. ICCV, 2015.
J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR, 2017.
16
X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proc. CVPR, 2018.
17
Actions
3D CNN
2-4 seconds
19
Actions
FBO
Long-Term Feature Bank
... ...
3D CNN
short clip
full video
CY Wu, C. Feichtenhofer, H. Fan, K. He, P. Krähenbühl, R. Girshick Long-Term Feature Banks for Detailed Video Understanding. In Proc. CVPR, 2019.
20
Slow pathway
H,W
T
Fast pathway
C. Feichtenhofer, H. Fan, J. Malik, K. He SlowFast Networks for Video Recognition. Tech. Report, arXiv, 2018
21
Slow
C T
C T
C T
prediction
H,W
T
Fast
αT
αT βC
βC
αT
βC Fast pathway
“Hand-clap” e.g. α = 8
(AVA detection annotation) β = 1/8
26
• Dimensions are
• Strides are {temporal, spatial2}
• The backbone is ResNet-50
• Residual blocks are shown by brackets
• Non-degenerate temporal filters are
underlined
• Here the speed ratio is α = 8 and the
channel ratio is β = 1/8
• Orange numbers mark fewer channels,
for the Fast pathway
• Green numbers mark higher temporal
resolution of the Fast pathway
• No temporal pooling is performed
throughout the hierarchy
27
prediction
C T
C T
C T
H,W
T
prediction
α=8
β = 1/8 αT
αT
βC
βC
αT
βC Fast pathway
29
C T
C T
C T
prediction
H,W
T
𝑡 C
αT
αT βC
βC
αT
βC Fast pathway
30
C T
C T
C T
prediction
H,W
T
αT
αT βC
βC
αT
βC Fast pathway
33
Conv1 filters
β = 1/4
rgb grayscale time diff
β = 1/6 β = 1/8
Fast
β = 1/16
β = 1/32
t t t
t t t t
rgb grayscale dt
35
Slow pathway
C T
C T
C T
prediction
H,W
T
αT
αT βC
βC
αT
βC Fast pathway
37
+ 5.1% at 10%
§§§§§§§§§
top-1 of FLOPs
37
38
• Kinetics-600 has 392k training videos and 30k validation videos in 600 classes
39
1G.A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity
understanding. In ECCV, 2016. , CVPR 2016
40
1Gu et al. AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions, CVPR 2018
41
Slow
T RPN1
T C
C
C T
H,W
concat
T Detections
C RoIAlign
Fast
αT
αT βC
βC
αT
βC Fast pathway
SlowFast
SlowFast:ablations:
State-of-the-art
AVA action
comparison
detection
AVA
+5.2mAP
Slow pathway
T RPN
T C
C
C T
H,W +6.4mAP +13mAP
concat
T Det
34.25mAP
C RoIAlign
αT
αT βC
βC
αT
βC Fast pathway
46
Conclusion
Slow pathway
• The time axis is a special dimension of video.
• 3D ConvNets treat space and time uniformly.
• SlowFast and Two-Stream networks share motivation
from biological studies.
• We investigate an architecture design that focuses on
contrasting the speed along the temporal axis.
• The SlowFast architecture achieves state-of-the-art
accuracy for video action classification and detection
without need of any (e.g. ImageNet) pretraining.
• Given the mutual benefits of jointly modeling video
Fast pathway
with different temporal speeds, we hope that this
concept can foster further research in video analysis.
FAIR Research Engineer
Menlo Park, CA
Seattle, WA