7. ②Prior
モデル➀ : Autregression Model (Transformer)
Attention Is All You Need https://arxiv.org/pdf/1706.03762.pdf
モデル② : Latent Diffusion Model
An Image is Worth One Word_ Personalizing Text-to-Image Generation using Textual Inversion
https://arxiv.org/pdf/2208.01618.pdf
通常の画像embeddingと、それを反転させたinverted embeddingを
セットでDecoderに渡す
どちらかを選択
画像分布の変化を高精度で追うことができる
・Diffusion Models Beat GANs on Image Synthesis https://arxiv.org/pdf/2105.05233.pdf
・GAN Inversion: A Survey https://arxiv.org/pdf/2101.05278.pdf
12. 実験➃ DALL-E2とGLIDEとの比較
GLIDE vs DALL-E2
写実性 : 互角
妥当性: GLIDE
多様性 : DALL-E2
GLIDE
DALL-E2
“A green vase filled with red roses sitting on top of table.”
Human Evaluationの得票率 (vs GLIDE)
16. 関連論文
• Hierarchical Text-Conditional Image Generation with CLIP Latents(DALL-E2)
• Denoising Diffusion Probabilistic Models(採用したDiffusion Modelについて)
• Learning Transferable Visual Models From Natural Language Supervision(CLIPモデルの提案)
• How Much Can CLIP Benefit Vision-and-Language Tasks? (CLIPモデルをV&Lタスクに利用することに関する評価と提案)
• GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models(本論文の比較対象となった生成モデル、Priorの損失関数はこれを参照)
• An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion(Latent Diffusion Model)
• LAFITE _ Towards Language-Free Training for Text-to-Image Generation(text-to-imageの他のモデル)
• Zero-Shot Text-to-Image Generation(当初のDALL-Eモデル)
• GAN Inversion: A Survey (inversionについて)
• Diffusion Models Beat GANs on Image Synthesis (GANではなくDiffusion Modelを使うことを提案)
• Classifier-Free Diffusion Guidance (diffusion guidanceの一つ)
17. GLIDEモデルの概要
Text
Image
Text Encoder : ADM Model
Diffusion Models Beat GANs on Image Synthesis https://arxiv.org/pdf/2105.05233.pdf
Image Encoder : Transformer
Attention Is All You Need https://arxiv.org/pdf/1706.03762.pdf
Noising : Diffusion Model Noised CLIP Denoising : Diffusion Model
Decode Image
… DALL-E2と共通
… DALL-E2と相違
18. CLIP Encoderに使われているモデル
・Image Encoder … Transformer
AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE (2021)
https://arxiv.org/pdf/2010.11929v2.pdf
・Text Encoder … Transformer
Language Models are Unsupervised Multitask Learners
https://d4mucfpksywv.cloudfront.net/better-language-
models/language_models_are_unsupervised_multitask_learners.pdf