0% found this document useful (0 votes)
7 views15 pages

Clip

The document discusses the development of a text-to-image system using diffusion models, emphasizing the importance of conditioning on textual embeddings derived from image captions. It outlines the training process, including the use of contrastive loss functions to align image and caption representations, and describes the CLIP model, which jointly trains image and text encoders. Additionally, it covers applications such as image captioning and generating images based on textual prompts.

Uploaded by

Yicheng Jiang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views15 pages

Clip

The document discusses the development of a text-to-image system using diffusion models, emphasizing the importance of conditioning on textual embeddings derived from image captions. It outlines the training process, including the use of contrastive loss functions to align image and caption representations, and describes the CLIP model, which jointly trains image and text encoders. Additionally, it covers applications such as image captioning and generating images based on textual prompts.

Uploaded by

Yicheng Jiang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Text-to-image

• Last week we discussed diffusion models.


• Suppose now we want to build a text-to-image system based on diffusion models.
• Thought experiment: suppose a human is given a very noisy image of something
and is asked to denoise it. Then the human is told that it is a photo of a cat on a
table. Will this additional information help the human to denoise it?
• Training data: lots of images with textual captions (xi,ci), i=1,…,N
• xi is ith image
• ci is corresponding caption
• Obtain fixed-length embeddings (representations) of captions:
• ei = g(ci), i=1,…,N
• To be discussed how we do this
• Then condition on ei, i=1,…,N, during training
Recall: Diffusion algorithms
Denoising neural network with
conditioning
• Data point (x0, c).
• Encode c to get embedding e
• Recall xt is x0 with t steps of noise
• Input into neural network xt, t, e, to obtain estimate of noise added at step t.
xt
t 𝜃 = (,t,e)
e

• So now when we predict the added noise our prediction depends on the
caption. Different captions lead to different predictions
Training algorithm with conditioning
Repeat
1. Select (x0, c) from dataset of images with captions
2. Obtain embedding e for c
3. t Uniform( {1,…,T} )
4. N(0,I)
5. Take gradient step on
• ||( + , t, e)||2
Inference with a prompt
1. User provides prompt c
2. Obtain embedding e for c
3. xT N(0,I)
4. For t= T, T-1,…,1
5. = (- (, t, e) )
Text-to-image: What embedding
• Possible approach.
• Obtain fixed-size representations for each caption from BERT or GPT. Then
condition diffusion model on those embeddings during training.
• Doesn’t work well. Need representations that align captions to corresponding
images. This way the caption “a red fox in the snow” will have a
representation similar to an image of red foxes.
• Let’s first review “contrastive loss functions”.
Recall cross-entropy loss for image
classification
• M classes of images; class label y {1,…,M}
• Training data: (xi,yi), i=1,2,…,N
• Neural network p(y|x): probability of predicting class y for input x
• Obtained by applying softmax to logits output.
• Cross-entropy loss function:
• |)
• Maximize the log probability of making the correct predictions,
averaged over the entire trainings set.
Contrastive-Loss Approach
• First get representations for each datapoint. Representations depend
on labels.
• Let f(xi) denote representation for ith data point
• Want f(xi) and f(xj) to be close if xi and xj belong to same class; far
apart if they belong to different classes:
• )

• )=

• Encourages m if
From contrastive loss to
classification
• After minimizing contrastive loss function, have representations zi,
i=1,…,N
• Nearest neighbor: At test time with new image x, obtain z = f(x) and
find nearest neighbor zi. Output class of zi , i.e., output yi.
• K-nearest neighbors: Determine the classes of the K-nearest
neighbors and then vote.
• Linear layer: Add linear layer W (dimension Mxd); train W with CE
loss, ie, obtain logits Wzi, apply softmax, then minimize CE over W.
What is CLIP?
• Vision-language model
• CLIP (Contrastive Language-Image Pre-training) is an open-source vision-
language model originally developed by OpenAI.
• The training dataset consists of images along with textual captions.
• Trained on 400 million image-caption pairs collected from a variety of
public sources.
• CLIP jointly trains an image encoder and a text encoder, mapping the
images and textual descriptions into a shared embedding space.
• Images are encoded with vision model such as ResNet or ViT.
• Captions are encoded with transformer. Neither models are pretrained
• Contrastive loss: images and their correct captions are pulled together in
the embedding space, while mismatched pairs are pushed apart.
Training CLIP
• Data: ,ci), i=1,…,N
• Image encoder:
• Text encoder:
• d = embedding dimension (say 512)
• Normalize embeddings so that
||||=||||= 1 for all I
• Cosine similarity between and
= sij(=

Another form of Contrastive Loss:


L(

This loss tries to push the representations for xi and ci close together
and push xi and cj far apart for
CLIP applications (after training)
• Test-to-image: Given a prompt (such as “dog sitting on table”), use
CLIP to obtain embedding e = g(c). Then use the conditional diffusion
inference algorithm to generate an image that matches the prompt.
• Image captioning: Given an image, obtain a caption for the image.
Image captioning (after training )

• Obtain encoded images from training set = (), i=1,…,N


• First train by minimizing contrastive loss function

• Now use (, ), i=1,…,N, to train a transformer using next token prediction:


• First input into transformer (after a projection layer, so that token has proper
dimension)
• Then perform next token prediction on tokens in using standard CE loss
• So we are fine-tuning an LLM in the standard way except for conditioning on the
representations
• Captioning at inference: Given a test image .
• Obtain = () aligned representation for image.
• Input into trained transformer and obtain caption c for image
Advanced Topics in AI and ML
Goal: Explore contemporary topics you most likely did not see in other
Credits:
ML classes. Towards expertise in Modern AI
Professor: Keith Ross
Primary Assistant : George Andriopoulos
Secondary Assistants: Zixuan Dong, Safal Shrestha, Minwu Kim, Nour Abdelmoneim
Covered a ton of material!
• Tabular Reinforcement Learning • Training and inference for LLMs
• Bandit problem, MDP formulation, Monte • Transformer
Carlo methods, TD methods • Self attention
• Policy Gradient and its variants • Mixture of experts
• Including PPO • KV caching; Multihead Latent Attention
• DRL with planning and self play • ChatGPT
• Alpha Zero • Instruction fine tuning
• Monte Carlo Tree Search • RLHF
• Off-policy DRL • Chain of Thought Prompting
• DQN and Atari Games • Reasoning in LLMs:
• DRL for Robotic Control • RL with Verifiable Rewards (RLVR)
• Continuous action spaces • Efficient fine-tuning of LLMs: LoRA
• Actor-Critic Algorithms including DDPG
• Diffusion Models for image generation
• Imitation Learning • Vision-Language Model: CLIP
• Off-line Reinforcement Learning

You might also like