This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video Processing (NeRF...Deep Learning JP
Neural Radiance Flow (NeRFlow) is a method that extends Neural Radiance Fields (NeRF) to model dynamic scenes from video data. NeRFlow simultaneously learns two fields - a radiance field to reconstruct images like NeRF, and a flow field to model how points in space move over time using optical flow. This allows it to generate novel views from a new time point. The model is trained end-to-end by minimizing losses for color reconstruction from volume rendering and optical flow reconstruction. However, the method requires training separate models for each scene and does not generalize to unknown scenes.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
YouTube nnabla channelの次の動画で利用したスライドです。
【DeepLearning研修】Transformerの基礎と応用 -- 第1回 Transformerの基本
https://youtu.be/Ry_AeJzMzU0?si=YjSaRmhEQhaa43k-
【参考文献】
・On the Opportunities and Risks of Foundation Models
https://arxiv.org/abs/2108.07258
・Attention Is All You Need
https://arxiv.org/abs/2108.07258
・An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
https://arxiv.org/abs/2010.11929
・FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
https://arxiv.org/abs/2205.14135
・Gaussian Error Linear Units (GELUs)
https://arxiv.org/abs/1606.08415
・Language Modeling with Gated Convolutional Networks
https://arxiv.org/abs/1612.08083
・GLU Variants Improve Transformer
https://arxiv.org/abs/2002.05202
・Deep Residual Learning for Image Recognition
https://arxiv.org/abs/1512.03385
・Layer Normalization
https://arxiv.org/abs/1607.06450
・Learning Deep Transformer Models for Machine Translation
https://arxiv.org/abs/1906.01787
・Understanding the Difficulty of Training Transformers
https://arxiv.org/abs/2004.08249
・The Annotated Transformer
https://nlp.seas.harvard.edu/2018/04/03/attention.html
・Self-Attention with Relative Position Representations
https://arxiv.org/abs/1803.02155
・RoFormer: Enhanced Transformer with Rotary Position Embedding
https://arxiv.org/abs/2104.09864
[DL輪読会]Neural Radiance Flow for 4D View Synthesis and Video Processing (NeRF...Deep Learning JP
Neural Radiance Flow (NeRFlow) is a method that extends Neural Radiance Fields (NeRF) to model dynamic scenes from video data. NeRFlow simultaneously learns two fields - a radiance field to reconstruct images like NeRF, and a flow field to model how points in space move over time using optical flow. This allows it to generate novel views from a new time point. The model is trained end-to-end by minimizing losses for color reconstruction from volume rendering and optical flow reconstruction. However, the method requires training separate models for each scene and does not generalize to unknown scenes.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
YouTube nnabla channelの次の動画で利用したスライドです。
【DeepLearning研修】Transformerの基礎と応用 -- 第1回 Transformerの基本
https://youtu.be/Ry_AeJzMzU0?si=YjSaRmhEQhaa43k-
【参考文献】
・On the Opportunities and Risks of Foundation Models
https://arxiv.org/abs/2108.07258
・Attention Is All You Need
https://arxiv.org/abs/2108.07258
・An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
https://arxiv.org/abs/2010.11929
・FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
https://arxiv.org/abs/2205.14135
・Gaussian Error Linear Units (GELUs)
https://arxiv.org/abs/1606.08415
・Language Modeling with Gated Convolutional Networks
https://arxiv.org/abs/1612.08083
・GLU Variants Improve Transformer
https://arxiv.org/abs/2002.05202
・Deep Residual Learning for Image Recognition
https://arxiv.org/abs/1512.03385
・Layer Normalization
https://arxiv.org/abs/1607.06450
・Learning Deep Transformer Models for Machine Translation
https://arxiv.org/abs/1906.01787
・Understanding the Difficulty of Training Transformers
https://arxiv.org/abs/2004.08249
・The Annotated Transformer
https://nlp.seas.harvard.edu/2018/04/03/attention.html
・Self-Attention with Relative Position Representations
https://arxiv.org/abs/1803.02155
・RoFormer: Enhanced Transformer with Rotary Position Embedding
https://arxiv.org/abs/2104.09864
本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
2018/10/20コンピュータビジョン勉強会@関東「ECCV読み会2018」発表資料
Yew, Z. J., & Lee, G. H. (2018). 3DFeat-Net: Weakly Supervised Local 3D Features for Point Cloud Registration. European Conference on Computer Vision.
This document summarizes a paper titled "DeepI2P: Image-to-Point Cloud Registration via Deep Classification". The paper proposes a method for estimating the camera pose within a point cloud map using a deep learning model. The model first classifies whether points in the point cloud fall within the camera's frustum or image grid. It then performs pose optimization to estimate the camera pose by minimizing the projection error of inlier points onto the image. The method achieves more accurate camera pose estimation compared to existing techniques based on feature matching or depth estimation. It provides a new approach for camera localization using point cloud maps without requiring cross-modal feature learning.
2020/10/10に開催された第4回全日本コンピュータビジョン勉強会「人に関する認識・理解論文読み会」発表資料です。
以下の2本を読みました
Harmonious Attention Network for Person Re-identification. (CVPR2018)
Weekly Supervised Person Re-Identification (CVPR2019)
18. PointNetおさらい:出典
18
PointNet
Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017). PointNet : Deep
Learning on Point Sets for 3D Classification and Segmentation Big
Data + Deep Representation Learning. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
PointNet++
Qi, C. R.,Yi, L., Su, H., & Guibas, L. J. (2017). PointNet++: Deep
Hierarchical Feature Learning on Point Sets in a Metric Space.
Conference on Neural Information Processing Systems (NeurIPS)
PointNeXt
Qian, G., Li,Y., Peng, H., Mai, J., Hammoud, H.A.A. K., Elhoseiny, M., &
Ghanem, B. (2022). PointNeXt: Revisiting PointNet++ with Improved
Training and Scaling Strategies. Conference on Neural Information
Processing Systems (NeurIPS).
38. Transformerおさらい: 出典
38
Transformer
Vaswani,A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
Gomez,A. N., Kaiser, L., & Polosukhin, I. (2017).Attention Is All
You Need. Advances in Neural Information Processing Systems
(NeurIPS).
Vision Transformer
Dosovitskiy,A., Beyer, L., Kolesnikov,A.,Weissenborn, D., Zhai,
X., Unterthiner,T., Dehghani, M., Minderer, M., Heigold, G., Gelly,
S., Uszkoreit, J., & Houlsby, N. (2021).An Image Is Worth 16x16
Words:Transformers For Image Recognition At Scale.
International Conference on Learning Representations (ICLR).
39. Transformerおさらい: 出典
39
MLP Mixer
Tolstikhin, I., Houlsby, N., Kolesnikov,A., Beyer, L., Zhai, X.,
Unterthiner,T.,Yung, J., Steiner,A., Keysers, D., Uszkoreit, J., Lucic,
M., & Dosovitskiy,A. (2021). MLP-Mixer:An all-MLP
Architecture forVision. Advances in Neural Information Processing
Systems
Meta Former (Pool Former)
Yu,W., Luo, M., Zhou, P., Si, C., Zhou,Y.,Wang, X., Feng, J., &Yan, S.
(2022). MetaFormer is ActuallyWhatYou Need forVision.
Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition
64. Point Transformer V2
64
Wu, X., Lao,Y., Jiang, L., Liu, X., & Zhao, H. (2022). Point
TransformerV2: GroupedVector Attention and Partition-
based Pooling. Advances in Neural Information Processing
Systems (NeurIPS), NeurIPS
PointTransformerに対して、以下を導入することで性能改善
GroupedVector Attention
より強力なPositional Embedding
Partition Based Pooling
70. PointMixer
70
Choe, J., Park, C., Rameau, F., Park, J., & Kweon, I. S. (2022).
PointMixer: MLP-Mixer for Point Cloud Understanding.
European Conference on ComputerVision (ECCV)
MLP Mixerを、点群のような疎で乱雑なデータに対して適用す
るために、Token-Mixing部分をChannel-MixingとSoftmaxの組
み合わせで置き換え
Inter-Set、Intra-Set、Hierarchical-Setの3パターンでmixing
高効率
89. PCT: Point Cloud Transformer
89
Guo, M. H., Cai, J. X., Liu, Z. N., Mu,T. J., Martin, R. R., & Hu,
S. M. (2021). PCT: Point cloud transformer. Computational
Visual Media, 7(2), 187–199.
点群の座標を特徴量へ変換し、通常のTransformerと同様、
Key、Queryの内積を用いてAttentionを生成し、Valueに重み
づけ
全ての点同士でSelf-Attentionを計算
グラフ理論で用いられるラプラシアン行列を用いたOffset
Attentionを導入することで、順序不変なAttentionを実装
112. Fast Point Transformer
112
Park, C., Jeong,Y., Cho, M., & Park, J. (2022). Fast Point
Transformer. Conference on ComputerVision and Pattern
Recognition (CVPR)
LightWeightな局所領域でのSelf-Attention Blockを導入
Voxel-Hashingベースアーキテクチャによって、Point
Transformerと比較して129倍の推論の高速化
122. Point-BERT
126
Yu, X.,Tang, L., Rao,Y., Huang,T., Zhou, J., & Lu, J. (2022).
Point-BERT: Pre-training 3D Point CloudTransformers
with Masked Point Modeling. Conference on ComputerVision
and Pattern Recognition (CVPR)
点群解析のための事前学習モデルの作成
Classificationは2層のMLPを加えて識別。
Object Part Segmentationは、Transformerのいくつかの中間
層と最終層の特徴量を元に、各点のラベルを計算
149. Self-Positioning Point-based Transformer
(SPoTr)
153
Park, J., Lee, S., Kim, S., Xiong,Y., & Kim, H. J. (2023).
Self-positioning Point-based Transformer for Point
Cloud Understanding. Conference on Computer
Vision and Pattern Recognition (CVPR).
リソース削減のために、全ての点同士のSelf-
Attentionを取るのではなく、グローバルおよびロー
カルの特徴を捉えたself-positioning point (SP point)
を使用。
SP pointを用いてローカルおよびグローバルなCross-
Attentionを取ることで、3つのベンチマーク(SONN,
SN-Part, and S3DIS)でSOTA達成