SlideShare a Scribd company logo
2021.10.27.
KAIST ALIN-LAB
Sangwoo Mo
1
Goal: Video Recognition
2
• Understand what is happening in the video (extension of image recognition)
• Action recognition (i.e., classification)
• Spatio-temporal action detection
Background: Video Transformers
3
• Transformer architectures have shown remarkable success in video recognition
• Extending Vision Transformer (ViT), apply attention over 𝑇×𝐻𝑊 patch tokens
• Previous works focused on designing an efficient attention over the patch tokens
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
Background: Video Transformers
4
• Transformer architectures have shown remarkable success in video recognition
• Naïve approach = Joint Attention (attention over all patches)
Bertasius et al. Is Space-Time Attention All You Need for Video Understanding? ICML 2021.
𝑆 = 𝐻𝑊
Background: Video Transformers
5
• Transformer architectures have shown remarkable success in video recognition
• Divided Attention: Each patch attends to the spatial and temporal patches alternatively
Bertasius et al. Is Space-Time Attention All You Need for Video Understanding? ICML 2021.
𝑆 = 𝐻𝑊
Background: Video Transformers
6
• Transformer architectures have shown remarkable success in video recognition
• Since divided attention only (temporally) attends to the same position of the patch,
it does not catch the moving trajectory of the objects
Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
Background: Video Transformers
7
• Transformer architectures have shown remarkable success in video recognition
• Trajectory Attention: Divide attention operation in two stages
1. Compute attention map over all space-time patches (𝑠𝑡 × 𝑠!𝑡!)
then apply spatial pooling to make trajectory features (𝑠𝑡 × 𝑡!)
Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
Background: Video Transformers
8
• Transformer architectures have shown remarkable success in video recognition
• Trajectory Attention: Divide attention operation in two stages
2. Apply temporal attention over the trajectory features (𝑠𝑡)
Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
Background: Video Transformers
9
• Transformer architectures have shown remarkable success in video recognition
• Trajectory Attention: Divide attention operation in two stages
Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
However, it still does not explicitly model the objects!
Only aggregating the effects of all possible spatio-temporal relations
Method: Object-Region Video Transformer (ORViT)
10
• Idea: The attention should be applied in object level1, in addition to the patch level
• The patch attends to the all objects and patches in all time frames2
1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets.
2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost.
Method: Object-Region Video Transformer (ORViT)
11
• Idea: The attention should be applied in object level1, in addition to the patch level
• The patch attends to the all objects and patches in all time frames2
• Specifically, ORViT considers three aspects of the objects:
• Objects (themselves)
• Interactions over objects
• Dynamics of objects
1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets.
2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost.
Method: Object-Region Video Transformer (ORViT)
12
• Idea: The attention should be applied in object level1, in addition to the patch level
• The patch attends to the all objects and patches in all time frames2
• Specifically, ORViT considers three aspects of the objects:
• Objects (themselves)
• Interactions over objects
• Dynamics of objects
1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets.
2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost.
Object-Region Attention
Object-Dynamics Module
Method: Object-Region Attention
13
• Object-Region Attention computes attention over both patches and objects
• Query: patches / Key & Value: patches + objects
• Object features are given by the ROIAlign (and MaxPool) of patch features
where the coordinate embedding is given by
the sum of MLP(𝐵) and learnable vector 𝑃
Method: Object-Dynamics Module
14
• Object-Dynamics Module computes attention over object locations
• Then, the dynamics features are spatially expanded by Box Position Encoder
The coordinate embedding
is given by the sum of .
MLP 𝐵
and learnable vector /
𝑃
Query & Key & Value: objects
Method: Overall ORViT Block
15
• Substitute attention blocks to the ORViT blocks
• It is important to apply the ORViT blocks in the lower layers
Results: Action Recognition
16
• ORViT significantly improves the baseline models
* Use detected boxes for Diving48 and Epic-Kitchens100. Yet, ORViT gives 8% improvement for Diving48.
Note that the box quality is
important, as shown in (a)
Results: Compositional Action Recognition
17
• ORViT is more effective for the for the following scenarios:1
• Compositional: Class = verb + noun / some test combinations are not in the training set
• Few-shot: Train on base classes, and fine-tune on few-shot novel classes
1. Indeed, ORViT better disentangles the objects (noun) and actions (verb).
SomethingElse dataset
Results: Spatio-temporal Action Detection
18
• ORViT also works well for spatio-temporal action detection
• Apply RoIAlign head on top of the spatio-temporal features
• All models use same boxes; hence, only differ from the box classification
Results: Ablation Study
19
• All proposed components contribute to the performance
• It is crucial to apply the ORViT module in lower layers (layer 2 ≫ layer 12)
• Cf. Trajectory attention performs the best
Results: Attention Maps (CLS)
20
• ORViT better attends on the salient objects of the video
• ORViT-Mformer consistently attends on the papers (main objects of the video) while
Mformer attends on the human face (salient for the scene, but not for the whole video)
* Attention map corresponding to the CLS query.
Results: Attention Maps (Objects)
21
• The attention map of each object visualizes its affecting regions
• Note that remote controllers attend on their regions, while hand has a broader map
* Attention map of each object to the patches.
22
Thank you for listening! 😀

More Related Content

PDF
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Sangwoo Mo
 
PDF
Explicit Density Models
Sangwoo Mo
 
PDF
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Sangwoo Mo
 
PDF
Recursive Neural Networks
Sangwoo Mo
 
PDF
Deep Learning Theory Seminar (Chap 1-2, part 1)
Sangwoo Mo
 
PDF
Score-Based Generative Modeling through Stochastic Differential Equations
Sangwoo Mo
 
PDF
Self-Attention with Linear Complexity
Sangwoo Mo
 
PDF
Domain Transfer and Adaptation Survey
Sangwoo Mo
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Sangwoo Mo
 
Explicit Density Models
Sangwoo Mo
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Sangwoo Mo
 
Recursive Neural Networks
Sangwoo Mo
 
Deep Learning Theory Seminar (Chap 1-2, part 1)
Sangwoo Mo
 
Score-Based Generative Modeling through Stochastic Differential Equations
Sangwoo Mo
 
Self-Attention with Linear Complexity
Sangwoo Mo
 
Domain Transfer and Adaptation Survey
Sangwoo Mo
 

What's hot (20)

PDF
Learning Theory 101 ...and Towards Learning the Flat Minima
Sangwoo Mo
 
PDF
Sharpness-aware minimization (SAM)
Sangwoo Mo
 
PDF
Introduction to Diffusion Models
Sangwoo Mo
 
PDF
Meta-Learning with Implicit Gradients
Sangwoo Mo
 
PDF
Bayesian Model-Agnostic Meta-Learning
Sangwoo Mo
 
PDF
Emergence of Invariance and Disentangling in Deep Representations
Sangwoo Mo
 
PDF
Deep Learning Theory Seminar (Chap 3, part 2)
Sangwoo Mo
 
PDF
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Koh Takeuchi
 
PDF
Focal loss for dense object detection
DaeHeeKim31
 
PPTX
Exploring Simple Siamese Representation Learning
Sangmin Woo
 
PPTX
Rethinking Attention with Performers
Joonhyung Lee
 
PDF
Pelee: a real time object detection system on mobile devices Paper Review
LEE HOSEONG
 
PDF
PR-305: Exploring Simple Siamese Representation Learning
Sungchul Kim
 
PPTX
Regularization in deep learning
Kien Le
 
PDF
Deep Learning for Computer Vision: Optimization (UPC 2016)
Universitat Politècnica de Catalunya
 
PDF
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
PDF
A beginner's guide to Style Transfer and recent trends
JaeJun Yoo
 
PPTX
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
Taiji Suzuki
 
PDF
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
PDF
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Sangwoo Mo
 
Sharpness-aware minimization (SAM)
Sangwoo Mo
 
Introduction to Diffusion Models
Sangwoo Mo
 
Meta-Learning with Implicit Gradients
Sangwoo Mo
 
Bayesian Model-Agnostic Meta-Learning
Sangwoo Mo
 
Emergence of Invariance and Disentangling in Deep Representations
Sangwoo Mo
 
Deep Learning Theory Seminar (Chap 3, part 2)
Sangwoo Mo
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Koh Takeuchi
 
Focal loss for dense object detection
DaeHeeKim31
 
Exploring Simple Siamese Representation Learning
Sangmin Woo
 
Rethinking Attention with Performers
Joonhyung Lee
 
Pelee: a real time object detection system on mobile devices Paper Review
LEE HOSEONG
 
PR-305: Exploring Simple Siamese Representation Learning
Sungchul Kim
 
Regularization in deep learning
Kien Le
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Universitat Politècnica de Catalunya
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
A beginner's guide to Style Transfer and recent trends
JaeJun Yoo
 
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
Taiji Suzuki
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
 
Ad

Similar to Object-Region Video Transformers (20)

PDF
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
Edge AI and Vision Alliance
 
PDF
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
Edge AI and Vision Alliance
 
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
PPTX
Iciap 2
Ionut Mironica
 
PPTX
Classification of xray images using vision transformers
JayasankarShyam
 
PDF
How is a Vision Transformer (ViT) model built and implemented?
Benjaminlapid1
 
PDF
Efficient fusion of spatio-temporal saliency for frame wise saliency identifi...
IAESIJAI
 
PPTX
Transformer in Vision
Sangmin Woo
 
PDF
Activity detection at TRECVID
George Awad
 
PDF
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
robertsamuel23
 
PDF
Visual Transformers
Kwanghee Choi
 
PDF
AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder E...
gerogepatton
 
PDF
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder E...
ijaia
 
PDF
Séminaire IA & VA- Yassine Ruichek, UTBM
Mahdi Zarg Ayouna
 
PDF
unlocking-the-future-an-introduction-to-vision-transformers-202410100758143pD...
b22065
 
PDF
Lecture 08 larry zitnick - undestanding and describing scenes
mustafa sarac
 
PPTX
tech_seminar_ppt on vision transformers.pptx
HiteshGupta702785
 
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
PDF
[Paper] Multiscale Vision Transformers(MVit)
Susang Kim
 
PDF
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
Edge AI and Vision Alliance
 
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
Edge AI and Vision Alliance
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
Classification of xray images using vision transformers
JayasankarShyam
 
How is a Vision Transformer (ViT) model built and implemented?
Benjaminlapid1
 
Efficient fusion of spatio-temporal saliency for frame wise saliency identifi...
IAESIJAI
 
Transformer in Vision
Sangmin Woo
 
Activity detection at TRECVID
George Awad
 
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
robertsamuel23
 
Visual Transformers
Kwanghee Choi
 
AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder E...
gerogepatton
 
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder E...
ijaia
 
Séminaire IA & VA- Yassine Ruichek, UTBM
Mahdi Zarg Ayouna
 
unlocking-the-future-an-introduction-to-vision-transformers-202410100758143pD...
b22065
 
Lecture 08 larry zitnick - undestanding and describing scenes
mustafa sarac
 
tech_seminar_ppt on vision transformers.pptx
HiteshGupta702785
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
[Paper] Multiscale Vision Transformers(MVit)
Susang Kim
 
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Ad

More from Sangwoo Mo (15)

PDF
Brief History of Visual Representation Learning
Sangwoo Mo
 
PDF
Learning Visual Representations from Uncurated Data
Sangwoo Mo
 
PDF
Hyperbolic Deep Reinforcement Learning
Sangwoo Mo
 
PDF
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
Sangwoo Mo
 
PDF
Self-supervised Learning Lecture Note
Sangwoo Mo
 
PDF
Generative Models for General Audiences
Sangwoo Mo
 
PDF
Deep Learning for Natural Language Processing
Sangwoo Mo
 
PDF
Neural Processes
Sangwoo Mo
 
PDF
Improved Trainings of Wasserstein GANs (WGAN-GP)
Sangwoo Mo
 
PDF
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
Sangwoo Mo
 
PDF
Topology for Computing: Homology
Sangwoo Mo
 
PDF
Reinforcement Learning with Deep Energy-Based Policies
Sangwoo Mo
 
PDF
Statistical Decision Theory
Sangwoo Mo
 
PDF
On Unifying Deep Generative Models
Sangwoo Mo
 
PDF
Dropout as a Bayesian Approximation
Sangwoo Mo
 
Brief History of Visual Representation Learning
Sangwoo Mo
 
Learning Visual Representations from Uncurated Data
Sangwoo Mo
 
Hyperbolic Deep Reinforcement Learning
Sangwoo Mo
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
Sangwoo Mo
 
Self-supervised Learning Lecture Note
Sangwoo Mo
 
Generative Models for General Audiences
Sangwoo Mo
 
Deep Learning for Natural Language Processing
Sangwoo Mo
 
Neural Processes
Sangwoo Mo
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Sangwoo Mo
 
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
Sangwoo Mo
 
Topology for Computing: Homology
Sangwoo Mo
 
Reinforcement Learning with Deep Energy-Based Policies
Sangwoo Mo
 
Statistical Decision Theory
Sangwoo Mo
 
On Unifying Deep Generative Models
Sangwoo Mo
 
Dropout as a Bayesian Approximation
Sangwoo Mo
 

Recently uploaded (20)

PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 

Object-Region Video Transformers

  • 2. Goal: Video Recognition 2 • Understand what is happening in the video (extension of image recognition) • Action recognition (i.e., classification) • Spatio-temporal action detection
  • 3. Background: Video Transformers 3 • Transformer architectures have shown remarkable success in video recognition • Extending Vision Transformer (ViT), apply attention over 𝑇×𝐻𝑊 patch tokens • Previous works focused on designing an efficient attention over the patch tokens Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
  • 4. Background: Video Transformers 4 • Transformer architectures have shown remarkable success in video recognition • Naïve approach = Joint Attention (attention over all patches) Bertasius et al. Is Space-Time Attention All You Need for Video Understanding? ICML 2021. 𝑆 = 𝐻𝑊
  • 5. Background: Video Transformers 5 • Transformer architectures have shown remarkable success in video recognition • Divided Attention: Each patch attends to the spatial and temporal patches alternatively Bertasius et al. Is Space-Time Attention All You Need for Video Understanding? ICML 2021. 𝑆 = 𝐻𝑊
  • 6. Background: Video Transformers 6 • Transformer architectures have shown remarkable success in video recognition • Since divided attention only (temporally) attends to the same position of the patch, it does not catch the moving trajectory of the objects Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
  • 7. Background: Video Transformers 7 • Transformer architectures have shown remarkable success in video recognition • Trajectory Attention: Divide attention operation in two stages 1. Compute attention map over all space-time patches (𝑠𝑡 × 𝑠!𝑡!) then apply spatial pooling to make trajectory features (𝑠𝑡 × 𝑡!) Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
  • 8. Background: Video Transformers 8 • Transformer architectures have shown remarkable success in video recognition • Trajectory Attention: Divide attention operation in two stages 2. Apply temporal attention over the trajectory features (𝑠𝑡) Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
  • 9. Background: Video Transformers 9 • Transformer architectures have shown remarkable success in video recognition • Trajectory Attention: Divide attention operation in two stages Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021. However, it still does not explicitly model the objects! Only aggregating the effects of all possible spatio-temporal relations
  • 10. Method: Object-Region Video Transformer (ORViT) 10 • Idea: The attention should be applied in object level1, in addition to the patch level • The patch attends to the all objects and patches in all time frames2 1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets. 2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost.
  • 11. Method: Object-Region Video Transformer (ORViT) 11 • Idea: The attention should be applied in object level1, in addition to the patch level • The patch attends to the all objects and patches in all time frames2 • Specifically, ORViT considers three aspects of the objects: • Objects (themselves) • Interactions over objects • Dynamics of objects 1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets. 2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost.
  • 12. Method: Object-Region Video Transformer (ORViT) 12 • Idea: The attention should be applied in object level1, in addition to the patch level • The patch attends to the all objects and patches in all time frames2 • Specifically, ORViT considers three aspects of the objects: • Objects (themselves) • Interactions over objects • Dynamics of objects 1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets. 2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost. Object-Region Attention Object-Dynamics Module
  • 13. Method: Object-Region Attention 13 • Object-Region Attention computes attention over both patches and objects • Query: patches / Key & Value: patches + objects • Object features are given by the ROIAlign (and MaxPool) of patch features where the coordinate embedding is given by the sum of MLP(𝐵) and learnable vector 𝑃
  • 14. Method: Object-Dynamics Module 14 • Object-Dynamics Module computes attention over object locations • Then, the dynamics features are spatially expanded by Box Position Encoder The coordinate embedding is given by the sum of . MLP 𝐵 and learnable vector / 𝑃 Query & Key & Value: objects
  • 15. Method: Overall ORViT Block 15 • Substitute attention blocks to the ORViT blocks • It is important to apply the ORViT blocks in the lower layers
  • 16. Results: Action Recognition 16 • ORViT significantly improves the baseline models * Use detected boxes for Diving48 and Epic-Kitchens100. Yet, ORViT gives 8% improvement for Diving48. Note that the box quality is important, as shown in (a)
  • 17. Results: Compositional Action Recognition 17 • ORViT is more effective for the for the following scenarios:1 • Compositional: Class = verb + noun / some test combinations are not in the training set • Few-shot: Train on base classes, and fine-tune on few-shot novel classes 1. Indeed, ORViT better disentangles the objects (noun) and actions (verb). SomethingElse dataset
  • 18. Results: Spatio-temporal Action Detection 18 • ORViT also works well for spatio-temporal action detection • Apply RoIAlign head on top of the spatio-temporal features • All models use same boxes; hence, only differ from the box classification
  • 19. Results: Ablation Study 19 • All proposed components contribute to the performance • It is crucial to apply the ORViT module in lower layers (layer 2 ≫ layer 12) • Cf. Trajectory attention performs the best
  • 20. Results: Attention Maps (CLS) 20 • ORViT better attends on the salient objects of the video • ORViT-Mformer consistently attends on the papers (main objects of the video) while Mformer attends on the human face (salient for the scene, but not for the whole video) * Attention map corresponding to the CLS query.
  • 21. Results: Attention Maps (Objects) 21 • The attention map of each object visualizes its affecting regions • Note that remote controllers attend on their regions, while hand has a broader map * Attention map of each object to the patches.
  • 22. 22 Thank you for listening! 😀