【DL輪読会】Efficiently Modeling Long Sequences with Structured State SpacesDeep Learning JP
This document summarizes a research paper on modeling long-range dependencies in sequence data using structured state space models and deep learning. The proposed S4 model (1) derives recurrent and convolutional representations of state space models, (2) improves long-term memory using HiPPO matrices, and (3) efficiently computes state space model convolution kernels. Experiments show S4 outperforms existing methods on various long-range dependency tasks, achieves fast and memory-efficient computation comparable to efficient Transformers, and performs competitively as a general sequence model.
本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
The document summarizes a presentation about papers from ICLR2021 and ICML conferences. It begins with an introduction of the presenter and their background and research interests. It then provides high-level summaries of four papers: 1) MiCE, an unsupervised image clustering method that combines contrastive learning and deep clustering; 2) A method for controllable image editing by navigating the latent space of a GAN; 3) A method for explaining uncertainty estimates from Bayesian neural networks by identifying influential inputs; 4) A long-tail learning framework that proposes a new posterior logit adjustment or loss modification with statistical justification.
Free lunch for few shot learning distribution calibrationぱんいち すみもと
The paper proposes a method to estimate the distribution of novel classes for few-shot learning by calibrating distributions. It hypothesizes that the distributions of novel classes can be represented by similar base classes based on their feature space distributions. The method samples from a mixture of estimated novel class distributions, which are determined based on distances to base class means. Logistic regression is then performed on the sampled data to train the classifier. Experimental results on standard few-shot datasets demonstrate that the proposed distribution calibration approach improves over traditional fine-tuning baselines.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
The document summarizes a presentation about papers from ICLR2021 and ICML conferences. It begins with an introduction of the presenter and their background and research interests. It then provides high-level summaries of four papers: 1) MiCE, an unsupervised image clustering method that combines contrastive learning and deep clustering; 2) A method for controllable image editing by navigating the latent space of a GAN; 3) A method for explaining uncertainty estimates from Bayesian neural networks by identifying influential inputs; 4) A long-tail learning framework that proposes a new posterior logit adjustment or loss modification with statistical justification.
Free lunch for few shot learning distribution calibrationぱんいち すみもと
The paper proposes a method to estimate the distribution of novel classes for few-shot learning by calibrating distributions. It hypothesizes that the distributions of novel classes can be represented by similar base classes based on their feature space distributions. The method samples from a mixture of estimated novel class distributions, which are determined based on distances to base class means. Logistic regression is then performed on the sampled data to train the classifier. Experimental results on standard few-shot datasets demonstrate that the proposed distribution calibration approach improves over traditional fine-tuning baselines.
This document summarizes two papers presented at NIPS 2018 on anomaly detection and out-of-distribution detection. The first paper proposes a simple unified framework using geometric transformations and Dirichlet density estimation to detect anomalies and adversarial examples. The second paper introduces a method that uses an ensemble of neural networks to detect out-of-distribution samples and adversarial attacks with state-of-the-art performance on CIFAR-10, SVHN and FGSM attacks. It also explores applications to class-incremental learning.
The document discusses variational divergence minimization for training generative neural networks using f-GAN. It introduces f-divergence as a generalization of divergence measures used in GANs like KL divergence. F-divergence allows the training of generative models by minimizing the divergence between the generated distribution and real data distribution. The paper presents an algorithm for minimizing f-divergence in generative neural samplers.
Categorical reparameterization with gumbel softmaxぱんいち すみもと
This document discusses two semi-supervised deep generative models:
(1) A VAE model (M1) that learns latent representations from both labeled and unlabeled data.
(2) An extended VAE model (M2) that uses Gumbel-Softmax to learn discrete latent variables from unlabeled data.
Combining M1 and M2 (M1+M2) allows learning of both continuous and discrete disentangled representations in an end-to-end manner, achieving better performance than the individual models. The document provides technical details on how both models work and are combined.
This document summarizes research papers on domain transfer techniques from NIPS 2017 to 2018. It discusses papers that use generative adversarial networks (GANs) for unsupervised image-to-image translation tasks between different domains. Key papers mentioned include Pix2Pix, CycleGAN, DiscoGAN, BicycleGAN, MUNIT, StarGAN, and Fader Networks. The document provides brief descriptions of each paper's proposed method and contributions to domain transfer and image translation.
This document summarizes a seminar presented by Panichi Sumimoto on September 2, 2018 at Bread House about introducing variational autoencoders (VAEs). The presentation covered the basics of VAEs, how they can be used for image generation tasks, and extensions like VAE-GANs that combine VAEs with generative adversarial networks. Examples of datasets used for training included CelebA, CelebA-HQ and LSUN bedrooms.
2. ⽬標: 実装レベルでVQ-VAEを理解する
n 第⼀著者: Aaron van den Oord
n 同著者が書いた関連論⽂
n Neural Discrete Representation Learning (NIPS 2017)
n Generating Diverse High-Fidelity Images with VQ-VAE-2 (NIPS 2019)
n 概要:
n VAEのフレームワークで離散的な潜在変数の学習を可能にし,posterior collapseの問題を解決する
ことで,⾼いクオリティの画像,ビデオ,⾳声のサンプリングを可能にした
VQVAE2による256x256のサンプル
3. 提案⼿法: VQVAEの学習⽅法
n 1: 例えば32x32x3の画像をCNNでエンコードして,8x8xDのfeature mapを出⼒する
n 2: feature mapのそれぞれの1x1xDのベクトルに最も距離が近いものを,予め⽤意したK個の
D次元の埋め込みベクトルに置き換える
n 3: 置き換えた8x8xDのベクトルをデコードして元の画像を復元できるように学習する
4. 提案⼿法: VQVAEの学習⽅法
n 1: 例えば32x32x3の画像をCNNでエンコードして,8x8xDのfeature mapを出⼒する
n 2: feature mapのそれぞれの1x1xDのベクトルに最も距離が近いものを,予め⽤意したK個の
D次元の埋め込みベクトルに置き換える
n 3: 置き換えた8x8xDのベクトルをデコードして元の画像を復元できるように学習する
5. 提案⼿法: VQVAEの学習⽅法
n 1: 例えば32x32x3の画像をCNNでエンコードして,8x8xDのfeature mapを出⼒する
n 2: feature mapのそれぞれの1x1xDのベクトルに最も距離が近いものを,予め⽤意したK個の
D次元の埋め込みベクトルに置き換える
n 3: 置き換えた8x8xDのベクトルをデコードして元の画像を復元できるように学習する
6. 提案⼿法: VQVAEの学習⽅法
n 学習するもの: エンコーダ・デコーダのパラメータ, KクラスxD次元の埋め込みベクトル
n sgはstop gradientで勾配を計算しないの意味
n 再構成の際,埋め込みベクトルに流れた勾配をそのままエンコーダに渡して学習させる
再構成誤差 埋め込みベクトルを
エンコーダベクトル
に近づける
エンコーダベクトルに
埋め込みベクトルを
近づける
11. 実装: ⼤きな流れ
n 表記
n B: バッチサイズ
n C: チャンネル数
n H: ⾼さ
n W: 幅
n K: 埋め込みベクトルの数
n D: 埋め込みベクトルの次元
n 今回はpixelCNNを使ったサンプリング
を除く実装までの解説を⾏う
実装参考
https://nbviewer.jupyter.org/github/zalandoresea
rch/pytorch-vq-vae/blob/master/vq-vae.ipynb
16. 実装: VQ部分
n encoding_indices: (B*W*H, 1)
n 距離で⼀番近い部分をとる
n encodings: (B*W*H, K)
n 0の⾏列をつくる
n encodings.scatter_(1, encoding_indices, 1)
n 1番⽬のaxisでインデックス番号の0を1に変換する
(one hotになる)
n quantized: (B, H, W, D)
n エンコードベクトルを埋め込みベクトルにする
quantized
encodings
encoding
indices
17. 実装: VQ部分
n ロスの⼀部分を計算する
n inputs: z_e(x), quantized: e
n detachでstop gradientできる
n その次のコードは,⼊⼒に勾配を伝えるため
n 埋め込みベクトルに置き換えると勾配が⼊⼒に
伝わらなくなるから
18. 実装: VQ部分
n 特に,ロス関数の⼆番⽬の項はEMAを利⽤する
と収束が早い
n ⼀般には,各埋め込みベクトル𝑒"から最も近い
エンコードされたベクトルを𝑧"とし,その数を
𝑛"とすると, 𝑒"は𝑧"の平均をとればよい
n しかし,ミニバッチによる計算を⾏っているた
め,移動平均を利⽤したほうが良い
n ガンマはハイパラで,0.99くらい