[DL輪読会]Recent Advances in Autoencoder-Based Representation LearningDeep Learning JP
1. Recent advances in autoencoder-based representation learning include incorporating meta-priors to encourage disentanglement and using rate-distortion and rate-distortion-usefulness tradeoffs to balance compression and reconstruction.
2. Variational autoencoders introduce priors to disentangle latent factors, but recent work aggregates posteriors to directly encourage disentanglement.
3. The rate-distortion framework balances the rate of information transmission against reconstruction distortion, while rate-distortion-usefulness also considers downstream task usefulness.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised L...harmonylab
紹介論文
Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos
出典: Vincent Casser, Soeren Pirk Reza, Mahjourian, Anelia Angelova : Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos, the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8001-8008 (2019)
概要: カメラ映像による深度予測は、屋内及び屋外のロボットナビゲーションにとって必要なタスクです。本研究では、教師なし学習を用いて映像の深度予測とカメラのエゴモーション(自身の動き)の学習に取り組んでいます。先行研究で確立されたベースラインのモデルに、移動する個々の物体のモデル化と、オンラインでのモデルの調整を行う手法を取り入れています。結果として、物体の動きを多く含むシーンでの予測結果を大幅に向上させています。
3. やろうと思った動機
• Lidarは高い
• Kinectは光に弱い(Time of Flight 方式のレーザーセンサ)
• 画像による予測を補強に使えないか?(詳しくはちょっと言え
ないけど)
慶應にも似たような研究をしているところがあるらしい
Depth Interpolation via Smooth Surface Segmentation Using
Tangent Planes Based on the Superpixels of a Color Image
(2013)
青木研究室かな?
8. ディープラーニングを使った手法
• Depth Map Prediction from a Single Image using a Multi-
Scale Deep Network(2014)
• 最初に全体の深度予測
(Coarse network)
• 次にLocalな深度予測
(Fine network)