This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
MLP-Mixer: An all-MLP Architecture for Visionharmonylab
出典:Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy : Mlp-mixer: An all-mlp architecture for vision, Advances in Neural Information Processing Systems 34 (2021)
公開URL:https://arxiv.org/abs/2105.01601
概要:最近の画像処理分野ではCNNやVision Transformerのようなネットワークが人気です。この論文では、多層パーセプトロン(MLP)のみで作成したアーキテクチャ"MLP-Mixer"を提案します。MLP-Mixerは2種類のレイヤーを保持しており、チャネルとトークン(位置)をそれぞれ別のMLPで学習しています。このモデルは画像分類ベンチマークにおいて、事前学習と推論コストが最新モデルに匹敵するスコアを達成しました
[DL輪読会]PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object DetectionDeep Learning JP
This paper proposes a new method called PV-RCNN for 3D object detection from point clouds. It introduces two key modules: 1) A voxel-to-keypoint scene encoding module that extracts feature vectors for keypoints by combining features from voxel CNNs and point networks. 2) A RoI grid pooling module that computes feature vectors for regions of interest (RoIs) from the keypoint features to refine detections. Experiments on KITTI and Waymo datasets demonstrate that PV-RCNN achieves state-of-the-art performance for 3D object detection from point clouds.
MLP-Mixer: An all-MLP Architecture for Visionharmonylab
出典:Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy : Mlp-mixer: An all-mlp architecture for vision, Advances in Neural Information Processing Systems 34 (2021)
公開URL:https://arxiv.org/abs/2105.01601
概要:最近の画像処理分野ではCNNやVision Transformerのようなネットワークが人気です。この論文では、多層パーセプトロン(MLP)のみで作成したアーキテクチャ"MLP-Mixer"を提案します。MLP-Mixerは2種類のレイヤーを保持しており、チャネルとトークン(位置)をそれぞれ別のMLPで学習しています。このモデルは画像分類ベンチマークにおいて、事前学習と推論コストが最新モデルに匹敵するスコアを達成しました
[DL輪読会]PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object DetectionDeep Learning JP
This paper proposes a new method called PV-RCNN for 3D object detection from point clouds. It introduces two key modules: 1) A voxel-to-keypoint scene encoding module that extracts feature vectors for keypoints by combining features from voxel CNNs and point networks. 2) A RoI grid pooling module that computes feature vectors for regions of interest (RoIs) from the keypoint features to refine detections. Experiments on KITTI and Waymo datasets demonstrate that PV-RCNN achieves state-of-the-art performance for 3D object detection from point clouds.
CV分野での最近の脱○○系論文3本を紹介します。
・脱ResNets: RepVGG: Making VGG-style ConvNets Great Again
・脱BatchNorm: High-Performance Large-Scale Image Recognition Without Normalization
・脱attention: LambdaNetworks: Modeling Long-Range Interactions Without Attention
Detect helmet impacts in NFL games using videos and player tracking data. A two-stage pipeline involves helmet detection followed by classification of detections as impacts or non-impacts. Post-processing includes temporal non-maximum suppression using tracking results to reduce false positives. Multiple models are ensembled and thresholds tuned on a validation set for best performance.
3D Perception for Autonomous Driving - Datasets and Algorithms -Kazuyuki Miyazawa
This document summarizes several 3D perception datasets and algorithms for autonomous driving. It begins with an overview of Kazuyuki Miyazawa from Mobility Technologies Co. and then covers popular datasets like KITTI, ApolloScape, nuScenes, and Waymo Open Dataset, describing their sensor setups, data formats, and licenses. It also summarizes seminal 3D object detection algorithms like PointNet, VoxelNet, and SECOND that take point cloud data as input.
22. Mobility Technologies Co., Ltd.
なぜか同時多発的に類似論文がarXivに登場
■ 5/4
MLP-Mixer: An all-MLP Architecture for Vision
■ 5/6
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well
on ImageNet
■ 5/7
ResMLP: Feedforward networks for image classification with data-efficient training
おまけ
22