Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
This document summarizes recent developments in action recognition using deep learning techniques. It discusses early approaches using improved dense trajectories and two-stream convolutional neural networks. It then focuses on advances using 3D convolutional networks, enabled by large video datasets like Kinetics. State-of-the-art results are achieved using inflated 3D convolutional networks and temporal aggregation methods like temporal linear encoding. The document provides an overview of popular datasets and challenges and concludes with tips on training models at scale.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
This document summarizes recent developments in action recognition using deep learning techniques. It discusses early approaches using improved dense trajectories and two-stream convolutional neural networks. It then focuses on advances using 3D convolutional networks, enabled by large video datasets like Kinetics. State-of-the-art results are achieved using inflated 3D convolutional networks and temporal aggregation methods like temporal linear encoding. The document provides an overview of popular datasets and challenges and concludes with tips on training models at scale.
3. 書誌情報
• 題名:MetaFormer is Actually What You Need for Vision [1]
• 著者:Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao
Wang, Jiashi Feng, Shuicheng Yan(シンガポールの研究チーム)
• URL:https://arxiv.org/abs/2111.11418
※本資料における出典の記載の無い図表は全て上記論文より引用
3
30. 引用
1. Yu, Weihao, et al. "Metaformer is actually what you need for vision." arXiv preprint arXiv:2111.11418 (2021).
2. Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint
arXiv:2010.11929 (2020)
3. Tolstikhin, Ilya, et al. "Mlp-mixer: An all-mlp architecture for vision." arXiv preprint arXiv:2105.01601 (2021).
4. Touvron, Hugo, et al. "Training data-efficient image transformers & distillation through attention." International Conference on
Machine Learning. PMLR, 2021.
5. Wang, Wenhai, et al. "Pyramid vision transformer: A versatile backbone for dense prediction without convolutions." arXiv preprint
arXiv:2102.12122 (2021).
6. Wightman, Ross, Hugo Touvron, and Hervé Jégou. "Resnet strikes back: An improved training procedure in timm." arXiv preprint
arXiv:2110.00476 (2021).
7. Touvron, Hugo, et al. "Resmlp: Feedforward networks for image classification with data-efficient training." arXiv preprint
arXiv:2105.03404 (2021).
8. Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows." arXiv preprint arXiv:2103.14030 (2021).
9. Liu, Hanxiao, et al. "Pay Attention to MLPs." arXiv preprint arXiv:2105.08050 (2021).
10. He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and
pattern recognition. 2016.
30
31. 引用
11. Lin, Tsung-Yi, et al. "Focal loss for dense object detection." Proceedings of the IEEE international conference on computer vision.
2017.
12. He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017.
13. Kirillov, Alexander, et al. "Panoptic feature pyramid networks." Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2019.
14. Xie, Saining, et al. "Aggregated residual transformations for deep neural networks." Proceedings of the IEEE conference on
computer vision and pattern recognition. 2017.
31