Object-Region Video Transformers

2021.10.27.
KAIST ALIN-LAB
Sangwoo Mo
1

Goal: Video Recognition
2
• Understand what is happening in the video (extension of image recognition)
• Action recognition (i.e., classification)
• Spatio-temporal action detection

Background: Video Transformers
3
• Transformer architectures have shown remarkable success in video recognition
• Extending Vision Transformer (ViT), apply attention over 𝑇×𝐻𝑊 patch tokens
• Previous works focused on designing an efficient attention over the patch tokens
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.

4
• Naïve approach = Joint Attention (attention over all patches)
Bertasius et al. Is Space-Time Attention All You Need for Video Understanding? ICML 2021.
𝑆 = 𝐻𝑊

5
• Divided Attention: Each patch attends to the spatial and temporal patches alternatively
Bertasius et al. Is Space-Time Attention All You Need for Video Understanding? ICML 2021.
𝑆 = 𝐻𝑊

6
• Since divided attention only (temporally) attends to the same position of the patch,
it does not catch the moving trajectory of the objects
Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.

7
• Trajectory Attention: Divide attention operation in two stages
1. Compute attention map over all space-time patches (𝑠𝑡 × 𝑠!𝑡!)
then apply spatial pooling to make trajectory features (𝑠𝑡 × 𝑡!)

8
2. Apply temporal attention over the trajectory features (𝑠𝑡)

9
However, it still does not explicitly model the objects!
Only aggregating the effects of all possible spatio-temporal relations

Method: Object-Region Video Transformer (ORViT)
10
• Idea: The attention should be applied in object level1, in addition to the patch level
• The patch attends to the all objects and patches in all time frames2
1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets.
2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost.

11
• Specifically, ORViT considers three aspects of the objects:
• Objects (themselves)
• Interactions over objects
• Dynamics of objects

12
• Specifically, ORViT considers three aspects of the objects:
• Objects (themselves)
• Interactions over objects
• Dynamics of objects
Object-Region Attention
Object-Dynamics Module

Method: Object-Region Attention
13
• Object-Region Attention computes attention over both patches and objects
• Query: patches / Key & Value: patches + objects
• Object features are given by the ROIAlign (and MaxPool) of patch features
where the coordinate embedding is given by
the sum of MLP(𝐵) and learnable vector 𝑃

Method: Object-Dynamics Module
14
• Object-Dynamics Module computes attention over object locations
• Then, the dynamics features are spatially expanded by Box Position Encoder
The coordinate embedding
is given by the sum of .
MLP 𝐵
and learnable vector /
𝑃
Query & Key & Value: objects

Method: Overall ORViT Block
15
• Substitute attention blocks to the ORViT blocks
• It is important to apply the ORViT blocks in the lower layers

Results: Action Recognition
16
• ORViT significantly improves the baseline models
* Use detected boxes for Diving48 and Epic-Kitchens100. Yet, ORViT gives 8% improvement for Diving48.
Note that the box quality is
important, as shown in (a)

Results: Compositional Action Recognition
17
• ORViT is more effective for the for the following scenarios:1
• Compositional: Class = verb + noun / some test combinations are not in the training set
• Few-shot: Train on base classes, and fine-tune on few-shot novel classes
1. Indeed, ORViT better disentangles the objects (noun) and actions (verb).
SomethingElse dataset

Results: Spatio-temporal Action Detection
18
• ORViT also works well for spatio-temporal action detection
• Apply RoIAlign head on top of the spatio-temporal features
• All models use same boxes; hence, only differ from the box classification

Results: Ablation Study
19
• All proposed components contribute to the performance
• It is crucial to apply the ORViT module in lower layers (layer 2 ≫ layer 12)
• Cf. Trajectory attention performs the best

Results: Attention Maps (CLS)
20
• ORViT better attends on the salient objects of the video
• ORViT-Mformer consistently attends on the papers (main objects of the video) while
Mformer attends on the human face (salient for the scene, but not for the whole video)
* Attention map corresponding to the CLS query.

Results: Attention Maps (Objects)
21
• The attention map of each object visualizes its affecting regions
• Note that remote controllers attend on their regions, while hand has a broader map
* Attention map of each object to the patches.

22
Thank you for listening! 😀

Object-Region Video Transformers

More Related Content

What's hot (20)

Similar to Object-Region Video Transformers (20)

More from Sangwoo Mo (15)

Recently uploaded (20)

Object-Region Video Transformers