AAAI2023「Are Transformers Effective for Time Series Forecasting?」と、HuggingFace「Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)」の紹介です。
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action DiffusionDeep Learning JP
This document discusses a paper on visuomotor policy learning via action diffusion. The paper presents a method for training policies that map camera images directly to actions by incorporating action diffusion, which adds noise to actions during training. This helps explore the action space and avoid getting stuck in local optima during policy learning. The method can learn policies for complex manipulation tasks entirely from pixels using self-supervised reinforcement learning with image rewards.
The document discusses FactorVAE, a method for disentangling latent representations in variational autoencoders (VAEs). It introduces Total Correlation (TC) as a penalty term that encourages independence between latent variables. TC is added to the standard VAE objective function to guide the model to learn disentangled representations. The document provides details on how TC is defined and computed based on the density-ratio trick from generative adversarial networks. It also discusses how FactorVAE uses TC to learn disentangled representations and can be evaluated using a disentanglement metric.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
The document discusses FactorVAE, a method for disentangling latent representations in variational autoencoders (VAEs). It introduces Total Correlation (TC) as a penalty term that encourages independence between latent variables. TC is added to the standard VAE objective function to guide the model to learn disentangled representations. The document provides details on how TC is defined and computed based on the density-ratio trick from generative adversarial networks. It also discusses how FactorVAE uses TC to learn disentangled representations and can be evaluated using a disentanglement metric.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
[2010]
Large-scale Image Classification: Fast Feature Extraction and SVM Training
[2011]
High-dimensional signature compression for large-scale image classification
【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...Deep Learning JP
Ad
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold
1. 1
DEEP LEARNING JP
[DL Papers]
http://deeplearning.jp/
Drag Your GAN: Interactive Point-based Manipulation on
the Generative Image Manifold
Yuki Sato, University ofTsukuba M2
2. 書誌情報
Drag Your GAN: Interactive Point-based Manipulation
on the Generative Image Manifold
Xingang Pan1,2, Ayush Tewari3, Thomas Leimkühler1, Lingjie Liu1,4, Abhimitra Meka5, Christian Theobalt1,2
1Max Planck Instutute 2Saarbrücken Research Center 3MIT 4University of Pennsylvania 5Google AR/VR
• 投稿先: SIGGRAPH 2023
• プロジェクトページ: https://vcai.mpi-inf.mpg.de/projects/DragGAN/
• 選定理由
➢ GANの生成画像の潜在変数を直接最適化することで、追加のネットワークの学習を必要とせず、
短時間で実行可能である
➢ インタラクティブな操作による高品質な画像編集を可能とした
2
5. StyleGAN
StyleGAN[1]
• Mapping Networkを用いて特徴量のもつれ
をなくした中間潜在変数を利用し、各解
像度で正規化を行うことで、細かな特徴
を制御可能な高解像度画像生成が可能
StyleGAN2[2]
• AdaINを標準偏差を用いた正規化に置き
換え、Generator, Discriminatorの構造を改
良することで、生成画像の品質向上を達
成
5
[1]より引用
1. Karras, Tero, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative adversarial networks." Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. 2019.
2. Karras, Tero, et al. "Analyzing and improving the image quality of stylegan." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
6. GANの制御性
潜在変数ベクトルを編集
• アノテーションデータや3Dモデルを用いた教師あり学習を利用
• 物体位置の移動精度が低いなど正確な制御が難しい
ポイントベースの手法
• 画像の特徴を独立に、正確に操作可能
• GANWarping[3]:ポイントベースの編集手法だが、3次元姿勢の制御など困難なタ
スクが存在
• UserControllableLT[4]:GANの潜在変数をユーザの入力を用いて変換して入力するこ
とで画像を編集するが、画像内で1方向へのみドラッグ可能であり複数点を異な
る方向に同時に編集できない
6
3. Wang, Sheng-Yu, David Bau, and Jun-Yan Zhu. "Rewriting geometric rules of a gan." ACM Transactions on Graphics (TOG) 41.4 (2022): 1-16.
4. Endo, Yuki. "User-Controllable Latent Transformer for StyleGAN Image Layout Editing." arXiv preprint arXiv:2208.12408 (2022).
7. Point tracking
目的:連続した画像間における対応する点の動きを推定
• 連続したフレーム間のオプティカルフロー推定
RAFT[5]
• 画素単位で特徴量を抽出し相関を算出し、RNNによる反復処理で推定を行う
PIPs[6]
• 複数フレームにまたがる任意のピクセルを追跡してフローを推論可能
両手法ともフロー予測のためのモデルを別途学習させる必要がある
7
5. Teed, Zachary, and Jia Deng. "Raft: Recurrent all-pairs field transforms for optical flow." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part
II 16. Springer International Publishing, 2020.
6. Harley, Adam W., Zhaoyuan Fang, and Katerina Fragkiadaki. "Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories." Computer Vision–ECCV 2022: 17th European Conference,
Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII. Cham: Springer Nature Switzerland, 2022.