Clip

The document discusses the development of a text-to-image system using diffusion models, emphasizing the importance of conditioning on textual embeddings derived from image captions. It outlines the training process, including the use of contrastive loss functions to align image and caption representations, and describes the CLIP model, which jointly trains image and text encoders. Additionally, it covers applications such as image captioning and generating images based on textual prompts.

Uploaded by

Yicheng Jiang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views15 pages

Clip

Uploaded by

Yicheng Jiang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Text-to-image

• Last week we discussed diffusion models.

• Suppose now we want to build a text-to-image system based on diffusion models.
• Thought experiment: suppose a human is given a very noisy image of something
and is asked to denoise it. Then the human is told that it is a photo of a cat on a
table. Will this additional information help the human to denoise it?
• Training data: lots of images with textual captions (xi,ci), i=1,…,N
• xi is ith image
• ci is corresponding caption
• Obtain fixed-length embeddings (representations) of captions:
• ei = g(ci), i=1,…,N
• To be discussed how we do this
• Then condition on ei, i=1,…,N, during training
Recall: Diffusion algorithms
Denoising neural network with
conditioning
• Data point (x0, c).
• Encode c to get embedding e
• Recall xt is x0 with t steps of noise
• Input into neural network xt, t, e, to obtain estimate of noise added at step t.
xt
t 𝜃 = (,t,e)
e

• So now when we predict the added noise our prediction depends on the
caption. Different captions lead to different predictions
Training algorithm with conditioning
Repeat
1. Select (x0, c) from dataset of images with captions
2. Obtain embedding e for c
3. t Uniform( {1,…,T} )
4. N(0,I)
5. Take gradient step on
• ||( + , t, e)||2
Inference with a prompt
1. User provides prompt c
2. Obtain embedding e for c
3. xT N(0,I)
4. For t= T, T-1,…,1
5. = (- (, t, e) )
Text-to-image: What embedding
• Possible approach.
• Obtain fixed-size representations for each caption from BERT or GPT. Then
condition diffusion model on those embeddings during training.
• Doesn’t work well. Need representations that align captions to corresponding
images. This way the caption “a red fox in the snow” will have a
representation similar to an image of red foxes.
• Let’s first review “contrastive loss functions”.
Recall cross-entropy loss for image
classification
• M classes of images; class label y {1,…,M}
• Training data: (xi,yi), i=1,2,…,N
• Neural network p(y|x): probability of predicting class y for input x
• Obtained by applying softmax to logits output.
• Cross-entropy loss function:
• |)
• Maximize the log probability of making the correct predictions,
averaged over the entire trainings set.
Contrastive-Loss Approach
• First get representations for each datapoint. Representations depend
on labels.
• Let f(xi) denote representation for ith data point
• Want f(xi) and f(xj) to be close if xi and xj belong to same class; far
apart if they belong to different classes:
• )

• )=

• Encourages m if
From contrastive loss to
classification
• After minimizing contrastive loss function, have representations zi,
i=1,…,N
• Nearest neighbor: At test time with new image x, obtain z = f(x) and
find nearest neighbor zi. Output class of zi , i.e., output yi.
• K-nearest neighbors: Determine the classes of the K-nearest
neighbors and then vote.
• Linear layer: Add linear layer W (dimension Mxd); train W with CE
loss, ie, obtain logits Wzi, apply softmax, then minimize CE over W.
What is CLIP?
• Vision-language model
• CLIP (Contrastive Language-Image Pre-training) is an open-source vision-
language model originally developed by OpenAI.
• The training dataset consists of images along with textual captions.
• Trained on 400 million image-caption pairs collected from a variety of
public sources.
• CLIP jointly trains an image encoder and a text encoder, mapping the
images and textual descriptions into a shared embedding space.
• Images are encoded with vision model such as ResNet or ViT.
• Captions are encoded with transformer. Neither models are pretrained
• Contrastive loss: images and their correct captions are pulled together in
the embedding space, while mismatched pairs are pushed apart.
Training CLIP
• Data: ,ci), i=1,…,N
• Image encoder:
• Text encoder:
• d = embedding dimension (say 512)
• Normalize embeddings so that
||||=||||= 1 for all I
• Cosine similarity between and
= sij(=

Another form of Contrastive Loss:

This loss tries to push the representations for xi and ci close together
and push xi and cj far apart for
CLIP applications (after training)
• Test-to-image: Given a prompt (such as “dog sitting on table”), use
CLIP to obtain embedding e = g(c). Then use the conditional diffusion
inference algorithm to generate an image that matches the prompt.
• Image captioning: Given an image, obtain a caption for the image.
Image captioning (after training )

• Obtain encoded images from training set = (), i=1,…,N

• First train by minimizing contrastive loss function

• Now use (, ), i=1,…,N, to train a transformer using next token prediction:

• First input into transformer (after a projection layer, so that token has proper
dimension)
• Then perform next token prediction on tokens in using standard CE loss
• So we are fine-tuning an LLM in the standard way except for conditioning on the
representations
• Captioning at inference: Given a test image .
• Obtain = () aligned representation for image.
• Input into trained transformer and obtain caption c for image
Advanced Topics in AI and ML
Goal: Explore contemporary topics you most likely did not see in other
Credits:
ML classes. Towards expertise in Modern AI
Professor: Keith Ross
Primary Assistant : George Andriopoulos
Secondary Assistants: Zixuan Dong, Safal Shrestha, Minwu Kim, Nour Abdelmoneim
Covered a ton of material!
• Tabular Reinforcement Learning • Training and inference for LLMs
• Bandit problem, MDP formulation, Monte • Transformer
Carlo methods, TD methods • Self attention
• Policy Gradient and its variants • Mixture of experts
• Including PPO • KV caching; Multihead Latent Attention
• DRL with planning and self play • ChatGPT
• Alpha Zero • Instruction fine tuning
• Monte Carlo Tree Search • RLHF
• Off-policy DRL • Chain of Thought Prompting
• DQN and Atari Games • Reasoning in LLMs:
• DRL for Robotic Control • RL with Verifiable Rewards (RLVR)
• Continuous action spaces • Efficient fine-tuning of LLMs: LoRA
• Actor-Critic Algorithms including DDPG
• Diffusion Models for image generation
• Imitation Learning • Vision-Language Model: CLIP
• Off-line Reinforcement Learning

DHA 2_30 Marks Exit Students
No ratings yet
DHA 2_30 Marks Exit Students
2 pages
03 Porter's Diamond
No ratings yet
03 Porter's Diamond
18 pages
Fry - 2001 - Multifunctional Landscapes - Towards Transdisciplinary Research
No ratings yet
Fry - 2001 - Multifunctional Landscapes - Towards Transdisciplinary Research
10 pages
What is CLIP_ Contrastive Language-Image Pre-Processing Explained
No ratings yet
What is CLIP_ Contrastive Language-Image Pre-Processing Explained
16 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
NLP-week9-fine-tuning_and_IR
No ratings yet
NLP-week9-fine-tuning_and_IR
64 pages
Lecture22-Multimodal
No ratings yet
Lecture22-Multimodal
32 pages
Text-image embeddings with OpenAIs CLIP
No ratings yet
Text-image embeddings with OpenAIs CLIP
5 pages
Lecture11- Unsupervised Learning (I)
No ratings yet
Lecture11- Unsupervised Learning (I)
29 pages
Computer Vision 12 Vision Language Models(1)
No ratings yet
Computer Vision 12 Vision Language Models(1)
56 pages
Cloth Captioning
No ratings yet
Cloth Captioning
36 pages
2023 Cross-Domain Image Captioning With Discriminative Finetuning
No ratings yet
2023 Cross-Domain Image Captioning With Discriminative Finetuning
10 pages
Parsimony and Self-Consistency-with-Translation
No ratings yet
Parsimony and Self-Consistency-with-Translation
46 pages
Raheem (1)
No ratings yet
Raheem (1)
21 pages
09508
No ratings yet
09508
16 pages
CLIP_ Connecting text and images _ OpenAI
No ratings yet
CLIP_ Connecting text and images _ OpenAI
16 pages
Tiny Models From Tiny Data Textual and Null-text i
No ratings yet
Tiny Models From Tiny Data Textual and Null-text i
21 pages
Lecture 11 Transfer and Few-shot Learning
No ratings yet
Lecture 11 Transfer and Few-shot Learning
47 pages
2025-Lecture06-MachineLearning
No ratings yet
2025-Lecture06-MachineLearning
56 pages
Assignment 3
No ratings yet
Assignment 3
25 pages
CLIP Summary
No ratings yet
CLIP Summary
2 pages
465-Lecture 17-CT
No ratings yet
465-Lecture 17-CT
22 pages
Learning With Few Data
No ratings yet
Learning With Few Data
67 pages
2 28 Lesson Plan-2
No ratings yet
2 28 Lesson Plan-2
3 pages
6333 Regularization With Stochastic Transformations and Perturbations For Deep Semi Supervised Learning
No ratings yet
6333 Regularization With Stochastic Transformations and Perturbations For Deep Semi Supervised Learning
9 pages
Wally Ball Project
No ratings yet
Wally Ball Project
47 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
Spvv211 - Chapter 18
No ratings yet
Spvv211 - Chapter 18
14 pages
Thematic Analysis On Genetically Modified Organisms
No ratings yet
Thematic Analysis On Genetically Modified Organisms
9 pages
Contrastive Language Image Pre-Training
No ratings yet
Contrastive Language Image Pre-Training
18 pages
LR 1
No ratings yet
LR 1
3 pages
Self_Supervised_Learning
No ratings yet
Self_Supervised_Learning
5 pages
L4_CSE256_FA24_WE
No ratings yet
L4_CSE256_FA24_WE
68 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
14 pages
01 Introduction
No ratings yet
01 Introduction
64 pages
SWA and GSP
No ratings yet
SWA and GSP
5 pages
cs224n 2023 Lecture9 Pretraining
No ratings yet
cs224n 2023 Lecture9 Pretraining
54 pages
Llms Course Andrew
No ratings yet
Llms Course Andrew
46 pages
World Geography - Continent
No ratings yet
World Geography - Continent
2 pages
Improving Text Classifiers Through Controlled Text Generation Using Tranformer Wasserstein Autoencoder
No ratings yet
Improving Text Classifiers Through Controlled Text Generation Using Tranformer Wasserstein Autoencoder
9 pages
Zero Shot Text To Image Generation (DALL E)
No ratings yet
Zero Shot Text To Image Generation (DALL E)
20 pages
Hierarchical Text-Conditional Image Generation With CLIP Latents
No ratings yet
Hierarchical Text-Conditional Image Generation With CLIP Latents
27 pages
Ai Image Captioning
No ratings yet
Ai Image Captioning
10 pages
Diass Module 10
No ratings yet
Diass Module 10
76 pages
Lecture5 Vit Ink
No ratings yet
Lecture5 Vit Ink
58 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
FD 2020 410505 00
No ratings yet
FD 2020 410505 00
2 pages
2/2 Normally Closed Solenoid Valve For Terminal/ Gantary Automation
No ratings yet
2/2 Normally Closed Solenoid Valve For Terminal/ Gantary Automation
2 pages
DQ Analytical Balance
No ratings yet
DQ Analytical Balance
3 pages
Ikea Edition 4 Full
No ratings yet
Ikea Edition 4 Full
2 pages
XCS224N_Module4_Slides
No ratings yet
XCS224N_Module4_Slides
91 pages
Course Outline in Pre-Calculus
No ratings yet
Course Outline in Pre-Calculus
2 pages
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
No ratings yet
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
12 pages
Tutorial 3 Question
No ratings yet
Tutorial 3 Question
3 pages
Neural Networks Tirgul W10 Moodle PDF
No ratings yet
Neural Networks Tirgul W10 Moodle PDF
167 pages
Stable Diffusion
No ratings yet
Stable Diffusion
58 pages
DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter
No ratings yet
DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter
5 pages
CA Final Study Plan
No ratings yet
CA Final Study Plan
1 page
cs224n 2017 Lecture4 PDF
No ratings yet
cs224n 2017 Lecture4 PDF
61 pages
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
No ratings yet
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
7 pages
AAI Module 4
No ratings yet
AAI Module 4
13 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
How to Meditate This Book Will Change Your Life (Anwar Uddin)
No ratings yet
How to Meditate This Book Will Change Your Life (Anwar Uddin)
53 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Q1018 Series: Linear Low Density Polyethylene (LLDPE)
No ratings yet
Q1018 Series: Linear Low Density Polyethylene (LLDPE)
2 pages
Deep Network Notes
No ratings yet
Deep Network Notes
54 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
Ec Ad 2.1vaau en GB
No ratings yet
Ec Ad 2.1vaau en GB
14 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
Practical SVBRDF Capture in The Frequency Domain
No ratings yet
Practical SVBRDF Capture in The Frequency Domain
12 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
HRD Form 2 Training Matrix District
No ratings yet
HRD Form 2 Training Matrix District
5 pages
Emotional Intelligence and Social Intelligence: TCS (Tata Consultancy Services) Mission & Values
No ratings yet
Emotional Intelligence and Social Intelligence: TCS (Tata Consultancy Services) Mission & Values
8 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li
No ratings yet
Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li
105 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Practical 4 Sinha'S Comprehensive Anxiety Test Aim: To Assess The Level of Anxiety of The Subject by Using Sinha's Comprehensive Anxiety
No ratings yet
Practical 4 Sinha'S Comprehensive Anxiety Test Aim: To Assess The Level of Anxiety of The Subject by Using Sinha's Comprehensive Anxiety
4 pages
Stable Diffusion A Tutorial
100% (1)
Stable Diffusion A Tutorial
66 pages
Lecture 26
No ratings yet
Lecture 26
17 pages
LC001 - Release
No ratings yet
LC001 - Release
47 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
NGO Projects Ethiopia
No ratings yet
NGO Projects Ethiopia
90 pages
Depressuring Study and Application On BP-A Project
100% (8)
Depressuring Study and Application On BP-A Project
35 pages
Short Course On Deep Learning: Welcome!!
No ratings yet
Short Course On Deep Learning: Welcome!!
57 pages
Animation in Design Systems PDF
No ratings yet
Animation in Design Systems PDF
39 pages
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet
Digital Image Processing: Fundamentals and Applications
From Everand
Digital Image Processing: Fundamentals and Applications
Fouad Sabry
No ratings yet