1) The document discusses object-region video transformers (ORViT) for video recognition. ORViT applies attention at both the patch and object levels.
2) ORViT considers three aspects of objects: the objects themselves, interactions between objects, and object dynamics over time.
3) Experimental results show ORViT outperforms baseline models on action recognition, compositional action recognition, and spatio-temporal action detection tasks. ORViT better captures object-level information and dynamics compared to patch-level attention alone.
Related topics: