Clip
Clip
• So now when we predict the added noise our prediction depends on the
caption. Different captions lead to different predictions
Training algorithm with conditioning
Repeat
1. Select (x0, c) from dataset of images with captions
2. Obtain embedding e for c
3. t Uniform( {1,…,T} )
4. N(0,I)
5. Take gradient step on
• ||( + , t, e)||2
Inference with a prompt
1. User provides prompt c
2. Obtain embedding e for c
3. xT N(0,I)
4. For t= T, T-1,…,1
5. = (- (, t, e) )
Text-to-image: What embedding
• Possible approach.
• Obtain fixed-size representations for each caption from BERT or GPT. Then
condition diffusion model on those embeddings during training.
• Doesn’t work well. Need representations that align captions to corresponding
images. This way the caption “a red fox in the snow” will have a
representation similar to an image of red foxes.
• Let’s first review “contrastive loss functions”.
Recall cross-entropy loss for image
classification
• M classes of images; class label y {1,…,M}
• Training data: (xi,yi), i=1,2,…,N
• Neural network p(y|x): probability of predicting class y for input x
• Obtained by applying softmax to logits output.
• Cross-entropy loss function:
• |)
• Maximize the log probability of making the correct predictions,
averaged over the entire trainings set.
Contrastive-Loss Approach
• First get representations for each datapoint. Representations depend
on labels.
• Let f(xi) denote representation for ith data point
• Want f(xi) and f(xj) to be close if xi and xj belong to same class; far
apart if they belong to different classes:
• )
• )=
• Encourages m if
From contrastive loss to
classification
• After minimizing contrastive loss function, have representations zi,
i=1,…,N
• Nearest neighbor: At test time with new image x, obtain z = f(x) and
find nearest neighbor zi. Output class of zi , i.e., output yi.
• K-nearest neighbors: Determine the classes of the K-nearest
neighbors and then vote.
• Linear layer: Add linear layer W (dimension Mxd); train W with CE
loss, ie, obtain logits Wzi, apply softmax, then minimize CE over W.
What is CLIP?
• Vision-language model
• CLIP (Contrastive Language-Image Pre-training) is an open-source vision-
language model originally developed by OpenAI.
• The training dataset consists of images along with textual captions.
• Trained on 400 million image-caption pairs collected from a variety of
public sources.
• CLIP jointly trains an image encoder and a text encoder, mapping the
images and textual descriptions into a shared embedding space.
• Images are encoded with vision model such as ResNet or ViT.
• Captions are encoded with transformer. Neither models are pretrained
• Contrastive loss: images and their correct captions are pulled together in
the embedding space, while mismatched pairs are pushed apart.
Training CLIP
• Data: ,ci), i=1,…,N
• Image encoder:
• Text encoder:
• d = embedding dimension (say 512)
• Normalize embeddings so that
||||=||||= 1 for all I
• Cosine similarity between and
= sij(=
This loss tries to push the representations for xi and ci close together
and push xi and cj far apart for
CLIP applications (after training)
• Test-to-image: Given a prompt (such as “dog sitting on table”), use
CLIP to obtain embedding e = g(c). Then use the conditional diffusion
inference algorithm to generate an image that matches the prompt.
• Image captioning: Given an image, obtain a caption for the image.
Image captioning (after training )