本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
SAM is a new segmentation model that can segment objects in images using natural language prompts. It was trained on over 1,100 datasets totaling over 10,000 images using a model-in-the-loop approach. SAM uses a transformer-based architecture with encoders for images, text, bounding boxes and masks. It achieves state-of-the-art zero-shot segmentation performance without any fine-tuning on target datasets.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
This document summarizes a research paper on scaling laws for neural language models. Some key findings of the paper include:
- Language model performance depends strongly on model scale and weakly on model shape. With enough compute and data, performance scales as a power law of parameters, compute, and data.
- Overfitting is universal, with penalties depending on the ratio of parameters to data.
- Large models have higher sample efficiency and can reach the same performance levels with less optimization steps and data points.
- The paper motivated subsequent work by OpenAI on applying scaling laws to other domains like computer vision and developing increasingly large language models like GPT-3.
This document summarizes recent developments in action recognition using deep learning techniques. It discusses early approaches using improved dense trajectories and two-stream convolutional neural networks. It then focuses on advances using 3D convolutional networks, enabled by large video datasets like Kinetics. State-of-the-art results are achieved using inflated 3D convolutional networks and temporal aggregation methods like temporal linear encoding. The document provides an overview of popular datasets and challenges and concludes with tips on training models at scale.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
SAM is a new segmentation model that can segment objects in images using natural language prompts. It was trained on over 1,100 datasets totaling over 10,000 images using a model-in-the-loop approach. SAM uses a transformer-based architecture with encoders for images, text, bounding boxes and masks. It achieves state-of-the-art zero-shot segmentation performance without any fine-tuning on target datasets.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
This document summarizes a research paper on scaling laws for neural language models. Some key findings of the paper include:
- Language model performance depends strongly on model scale and weakly on model shape. With enough compute and data, performance scales as a power law of parameters, compute, and data.
- Overfitting is universal, with penalties depending on the ratio of parameters to data.
- Large models have higher sample efficiency and can reach the same performance levels with less optimization steps and data points.
- The paper motivated subsequent work by OpenAI on applying scaling laws to other domains like computer vision and developing increasingly large language models like GPT-3.
This document summarizes recent developments in action recognition using deep learning techniques. It discusses early approaches using improved dense trajectories and two-stream convolutional neural networks. It then focuses on advances using 3D convolutional networks, enabled by large video datasets like Kinetics. State-of-the-art results are achieved using inflated 3D convolutional networks and temporal aggregation methods like temporal linear encoding. The document provides an overview of popular datasets and challenges and concludes with tips on training models at scale.
23. References (時系列予測 Transformers)
• [Vaswani+, NIPS’17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.
Attention is all you need. In NIPS, 2017.
• [Li+, NeurIPS’19] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y. Wang, and X. Yan. Enhancing the locality and breaking the
memory bottleneck of transformer on time series forecasting. In NeurIPS, 2019.
• [Zhou+, AAAI’21] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang. Informer: Beyond efficient
transformer for long sequence time-series forecasting. In AAAI, 2021.
• [Kitaev+, ICLR’20] N. Kitaev, L. Kaiser, and A. Levskaya. Reformer: The efficient transformer. In ICLR, 2020.
• [Liu+, ICLR’22] S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A. XLiu, and S. Dustdar. Pyraformer: Low-complexity pyramidal attention
for long-range time series modeling and forecasting. In ICLR, 2022.
• [Wu+, NeurIPS’21] H. Wu, J. Xu, J. Wang, and M. Long. Autoformer: Decomposition transformers with Auto-Correlation for
long-term series forecasting. In NeurIPS, 2021.
• [Zhou+, ICML’22] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin. FEDformer: Frequency enhanced decomposed
transformer for long-term series forecasting. In ICML, 2022.
• [Woo+, arXiv, 22] G. Woo, C. Liu, D. Sahoo, A. Kumar, and S. C. H. Hoi. Etsformer: Exponential smoothing transformers for
time-series forecasting. arXiv preprint arXiv:1406.1078, 2022.
26
24. References (Others)
• [Lai+, SIGIR’18] G. Lai, W. Chang, Y. Yang, and H. Liu. Modeling long- and short-term temporal patterns with deep neural networks. In SIGIR, 2018.
• [Salinas+, Int. J. Forecast., 20] D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski. DeepAR: Probabilistic forecasting with autoregressive
recurrent networks. Int. J. Forecast., Vol. 36, 3, pp.1181-1191, 2020.
• [Oreshkin+, ICLR’20] B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio. N-BEATS: neural basis expansion analysis for interpretable time series
forecasting. In ICLR, 2020.
• [Challu+, arXiv, 22] C. Challu, K. G. Olivares, B. N. Oreshkin, F. Garza, M. Mergenthaler, and A. Dubrawski. N-hits: Neural hierarchical interpolation for
time series forecasting. arXiv preprint arXiv:2201.12886, 2022.
• [Ishida+, ICML’20] T. Ishida, I. Yamane, T. Sakai, G. Niu, and M. Sugiyama. Do We Need Zero Training Loss After Achieving Zero Training Error? In
ICML, 2020.
• [Li+, NIPS’18] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the Loss Landscape of Neural Nets. In NIPS, 2018.
• [Park+, ICLR’22] N. Park and S. Kim. How do vision transformers work? In ICLR, 2022.
• [Ogasawara+, IJCNN’10] E. Ogasawara, L. C. Martinez, D. de Oliveira, G. Zimbrão, G. L. Pappa, and M. Mattoso. Adaptive Normalization: A novel data
normalization approach for non-stationary time series. In IJCNN, Barcelona, Spain, 2010, pp. 1-8, doi: 10.1109/IJCNN.2010.5596746.
• [Passalis+, IEEE TNNLS’20] N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis. Deep Adaptive Input Normalization for Time Series
Forecasting. In IEEE TNNLS, vol. 31, no. 9, pp. 3760-3765, Sept. 2020, doi: 10.1109/TNNLS.2019.2944933.
• [Kim+, ICLR’22] T. Kim, J. Kim, Y. Tae, C. Park, J. Choi, and J. Choo. Reversible Instance Normalization for Accurate Time-Series Forecasting
against Distribution Shift. In ICLR, 2022.
27