This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
2. 書誌情報
2
[1] Vincent, P. (2011).
A connection between score matching and denoising autoencoders.
Neural computation, 23(7), 1661-1674.
[2] Song, Y., & Ermon, S. (2019).
Generative modeling by estimating gradients of the data distribution.
Advances in Neural Information Processing Systems, 32.
[3] Ho, J., Jain, A., & Abbeel, P. (2020).
Denoising diffusion probabilistic models.
Advances in Neural Information Processing Systems, 33, 6840-6851.
[4] Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2020).
Score-based generative modeling through stochastic differential equations.
arXiv preprint arXiv:2011.13456.
[5] Anderson, B. D. (1982).
Reverse-time diffusion equation models.
Stochastic Processes and their Applications, 12(3), 313-326.
[6] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022).
High-resolution image synthesis with latent diffusion models.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10684-10695).
[7] Song, Y., Shen, L., Xing, L., & Ermon, S. (2021).
Solving inverse problems in medical imaging with score-based generative models.
arXiv preprint arXiv:2111.08005.
その他参考URL:
What are Diffusion Models?
https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Generative Modeling by Estimating Gradients of the Data Distribution
https://yang-song.net/blog/2021/score/
選書理由
Diffusion modelの基礎部分の理解と、最新の応用事例に興味があったため
Diffusion modelの
考え方の理解のため
最近の応用事例
10. 計算コストの課題
10
1試行で、U-Netを1000回計算(直列)するため、計算が重い
アーキテクチャ
• U-netベース(各種改良あり)
• 𝑥𝑡−1 ← 𝑥𝑡の更新にU-netを一回計算.
• T=1000
• pixel空間での演算.
図式引用:[4] Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2020). Score-based generative modeling through stochastic differential
equations. arXiv preprint arXiv:2011.13456.
11. 潜在空間でのdiffusion-based model
11
図式引用:[6] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10684-10695).
情報圧縮方法
• 画像の持つ情報(Semantic、Perceptual)のうち、
PerceptualをAutoencoderが担当
• blurを抑制するLossも利用
AEで情報圧縮.潜在空間でサンプリングを行い効率化
12. 1/8のダウンサンプルで、良好な画像生成を達成
12
図式引用:[6] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10684-10695).