【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents

DEEP LEARNING JP
[DL Papers]
http://deeplearning.jp/
Hierarchical Text-Conditional Image
Generation with CLIP Latents

書誌情報
タイトル :
Hierarchical Text-Conditional Image Generation with CLIP Latents
著者 :
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
組織 :
OpenAI
会議 :
arXiv (2022 April)
URL :
https://arxiv.org/pdf/2204.06125.pdf

概要
☑架空のポケモン生成に使われたモデルとして話題となったDALL-
E2モデルに関する論文
☑CLIPモデルとDiffusionモデルを組み合わせ、Text-to-Imageに応用
したモデル
☑生成画像の高いリアリティ、多様性を実現
☑text-to-imageの分野でのDALL-E2の特徴(長所・短所)を紹介

モデルの構造
➀ENCODER (CLIP MODEL) … 入力した画像とテキストをドット積の形でencode
➁PRIOR (DIFFUSION MODEL) … bipartite latent representation (zi; xT )を取得してdecoderに渡す
➂DECODER (DIFFUSION MODEL) … 生成画像を出力
ドット積𝑧𝑖・𝑧𝑡
𝑧𝑡
𝑧𝑖

➀Encoder “CLIP Encoder”
Learning Transferable Visual Models From Natural Language Supervision (2021)
画像とテキストのエンコードを個別で行い、そのドット積を出力

➀Encoder “CLIP Encoder”の特徴
➀学習に使用された画像データの量が大きい
②画像のエンコーディングで非常に優秀
③一方、下流のV&Lタスクにおいてはイマイチな場合もあり、
他のモデルと組み合わせて使用することが推奨されている
・How Much Can CLIP Benefit Vision-and-Language Tasks? https://arxiv.org/pdf/2107.06383.pdf

②Prior
モデル➀ : Autregression Model (Transformer)
Attention Is All You Need https://arxiv.org/pdf/1706.03762.pdf
モデル② : Latent Diffusion Model
An Image is Worth One Word_ Personalizing Text-to-Image Generation using Textual Inversion
https://arxiv.org/pdf/2208.01618.pdf
通常の画像embeddingと、それを反転させたinverted embeddingを
セットでDecoderに渡す
どちらかを選択
画像分布の変化を高精度で追うことができる
・Diffusion Models Beat GANs on Image Synthesis https://arxiv.org/pdf/2105.05233.pdf
・GAN Inversion: A Survey https://arxiv.org/pdf/2101.05278.pdf

③Decoder
• モデル
Diffusion Model
ADMNets (Diffusion Models Beat GANs on Image Synthesis, https://arxiv.org/pdf/2105.05233.pdf)
・入力
(CLIP embedding, 反転済み CLIP embedding)
・出力
テキストに沿うように生成された画像

実験➀ 画像操作
バリエーション
補間
A photo of a landscape in winter → a photo of a landscape in fall
テキスト差分

実験② CLIP潜在空間の観察
・物体の判定においては、画像中の
テキストの影響を受けやすい
・画像操作(生成)を行っても、iPod
ではなくリンゴの画像が生成される
一見相反する結果の裏で、CLIPがどう動いているか観察可能

実験➂
DALL-E2でのtext-to-imageタスクへのCLIPによるエンコードの影響
DALL-E2のdecoderへの入力
Text, ImageどちらでもCLIP embeddingを含む場合に高再現度

実験➃ DALL-E2とGLIDEとの比較
GLIDE vs DALL-E2
写実性 : 互角
妥当性: GLIDE
多様性 : DALL-E2
GLIDE
DALL-E2
“A green vase filled with red roses sitting on top of table.”
Human Evaluationの得票率 (vs GLIDE)

FIDスコア
FIDスコアも他のあらゆる生成モデルより優秀

DALL-E2の弱点
DALL-E2 GLIDE
“a red cube on top of a blue cube”
複雑な文の再現に対してはDALL-E2は弱い

まとめ
☑CLIPにDiffusion Modelを組み合わせることで、高品質の画像
生成を行うことができる(DALL-E2)
☑DALL-E2は従来のtext-to-imageのSOTAモデルと比べて写実性、
画質を維持しながら、生成画像の多様性で優位に立つ
☑一方で、複雑な内容のテキストには対応できない弱点がある

関連論文
• Hierarchical Text-Conditional Image Generation with CLIP Latents(DALL-E2)
• Denoising Diffusion Probabilistic Models(採用したDiffusion Modelについて)
• Learning Transferable Visual Models From Natural Language Supervision(CLIPモデルの提案)
• How Much Can CLIP Benefit Vision-and-Language Tasks? (CLIPモデルをV&Lタスクに利用することに関する評価と提案)
• GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models(本論文の比較対象となった生成モデル、Priorの損失関数はこれを参照)
• An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion(Latent Diffusion Model)
• LAFITE _ Towards Language-Free Training for Text-to-Image Generation(text-to-imageの他のモデル)
• Zero-Shot Text-to-Image Generation(当初のDALL-Eモデル)
• GAN Inversion: A Survey (inversionについて)
• Diffusion Models Beat GANs on Image Synthesis (GANではなくDiffusion Modelを使うことを提案)
• Classifier-Free Diffusion Guidance (diffusion guidanceの一つ)

GLIDEモデルの概要
Text
Image
Text Encoder : ADM Model
Diffusion Models Beat GANs on Image Synthesis https://arxiv.org/pdf/2105.05233.pdf
Image Encoder : Transformer
Attention Is All You Need https://arxiv.org/pdf/1706.03762.pdf
Noising : Diffusion Model Noised CLIP Denoising : Diffusion Model
Decode Image
… DALL-E2と共通
… DALL-E2と相違

CLIP Encoderに使われているモデル
・Image Encoder … Transformer
AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE (2021)
https://arxiv.org/pdf/2010.11929v2.pdf
・Text Encoder … Transformer
Language Models are Unsupervised Multitask Learners
https://d4mucfpksywv.cloudfront.net/better-language-
models/language_models_are_unsupervised_multitask_learners.pdf

“Diffusion Model”について
Noising
Denoising
完全なノイズ元の画像

計算量削減
• 従来はziとztのドット積を学習していたが、本論文のモデルで
は2つのzi(image,画像)のサンプルを生成し、zt(text,テキスト)と
のドット積の値が大きい方を採用するという形をとった
→計算量削減
• すなわち、未ノイズzi(image embedding)を直接予測している

【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents

Recommended

More Related Content

What's hot (20)

Similar to 【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents (15)

More from Deep Learning JP (20)

【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents