This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
AAAI2023「Are Transformers Effective for Time Series Forecasting?」と、HuggingFace「Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)」の紹介です。
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action DiffusionDeep Learning JP
This document discusses a paper on visuomotor policy learning via action diffusion. The paper presents a method for training policies that map camera images directly to actions by incorporating action diffusion, which adds noise to actions during training. This helps explore the action space and avoid getting stuck in local optima during policy learning. The method can learn policies for complex manipulation tasks entirely from pixels using self-supervised reinforcement learning with image rewards.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action DiffusionDeep Learning JP
This document discusses a paper on visuomotor policy learning via action diffusion. The paper presents a method for training policies that map camera images directly to actions by incorporating action diffusion, which adds noise to actions during training. This helps explore the action space and avoid getting stuck in local optima during policy learning. The method can learn policies for complex manipulation tasks entirely from pixels using self-supervised reinforcement learning with image rewards.
Variational Template Machine for Data-to-Text Generationharmonylab
公開URL:https://openreview.net/forum?id=HkejNgBtPB
出典:Rong Ye, Wenxian Shi, Hao Zhou, Zhongyu Wei, Lei Li : Variational Template Machine for Data-to-Text Generation, 8th International Conference on Learning Representations(ICLR2020), Addis Ababa, Ethiopia (2020)
概要:Table形式の構造化データから文章を生成するタスク(Data-to-Text)において、Variational Auto Encoder(VAE)ベースの手法Variational Template Machine(VTM)を提案する論文です。Encoder-Decoderモデルを用いた既存のアプローチでは、生成文の多様性に欠けるという課題があります。本論文では多様な文章を生成するためにはテンプレートが重要であるという主張に基づき、テンプレートを学習可能なVAEベースの手法を提案します。提案手法では潜在変数の空間をテンプレート空間とコンテンツ空間に明示的に分離することによって、正確で多様な文生成が可能となります。また、table-textのペアデータだけではなくtableデータのないraw textデータを利用した半教師あり学習を行います。
6. Flamingoのポイント
6
学習済モデルを
重み固定で利用
画像とテキスト間の
ドメイン適応部を学習
画像/動画(=視覚)は
一定次元ベクトルに圧縮
汎用性を高めている
言語:サイズ70BのChinchilla (Hoffmann et al., 2022)
画像:サイズ435MのNFNet-F6(Brock et al., 2021)
XAttn-Denseで言語と画像学習済モデルを結合
学習する部分
Flamingoオリジナルの構造の提案手法
画像/動画
&
自然言語
Andrew Brock, Soham De, Samuel L. Smith, and Karen Simonyan. High-performance largescale image recognition without normalization. arXiv:2102.06171, 2021.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, JohannesWelbl, Aidan Clark, Eric Noland
Tom Hennigan, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training
compute-optimal large language models. arXiv:2203.15556, 2022.
Perceiverで画像or動画を一定の潜在ベクトルに圧縮
学習する部分
関連研究として後述