Learning Transferable Visual Models From Natural Language Supervision

The document discusses an approach to learning visual representations from natural language supervision using a large dataset of image-text pairs. It trains models like CLIP using masked language modeling on the dataset, and evaluates their ability to do zero-shot transfer to image classification and other tasks, finding promising results. However, more work is needed to close the gap to human-level performance on some tasks.

Uploaded by

黃偉宸

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views14 pages

Learning Transferable Visual Models From Natural Language Supervision

Uploaded by

黃偉宸

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Learning Transferable Visual Models From

Natural Language Supervision

Alec Radford * 1 Jong Wook Kim * 1 Chris Hallacy 1 Aditya Ramesh 1 Gabriel Goh 1 Sandhini Agarwal 1 Girish Sastry 1
Amanda Askell 1 Pamela Mishkin 1 Jack Clark 1 Gretchen Krueger 1 Ilya Sutskever 1
Motivation
1. Concurent CV system are trained to predict a fixed set of predetermined
object categories (image classification)
2. Learning directly from raw text about images provides a broader source of
supervision
3. NLP can solve above problems(BERT, GPT)
Approach
Approach
1. WebImageText : Large dataset with 400 million (image,text) pairs
a. existing datasets, such as MS-COCO and Visual Genom are too small
b. large dataset is necessary to capture the full rank if visual concepts and textual descriptions
that exist in the real world
2. Efficient pre-trained method
a. training model on such dataset would be computationally expensive
b. masked language model (MLM), where a subset of tokens in an input sequence is masked
and the model is trained to predict them based on the remaining token
c. image masking, where some of the input tokens corresponding to image regions rather than
text
Approach
3. Natural Language Supervision
a. easier to scale natural language supervision and does not require annotations to be in a
classic “machine learning compatible format”
b. can learn passively from the supervision contained in the vast amount of text on the internet

4. Choosing and scaling a model

a. image encoder : ResNet-50, ViT
b. text encoder : Transformer
Experiments
1. Zero-shot transfer
a. ability of a model perform well on a task it has not been explicitly trained on
b. use zero-shot transfer as an evaluation metric

2. CLIP for zero-shot transfer

a. CLIP: pre-trained to predict if an image and a text snippet are paired together in its dataset
i. compute the feature embedding of the the image and feature embedding of the set of
possible text
ii. product them -> similarity score for each (image,text) pair
Experiments
1. CLIP outperforms Visual N-Grams on ImageNet and performs well on task it
has not been explicitly trained
a.
Experiments
1. prompt engineering
a. standard image classification datasets treat the information naming or describing classes as
an afterthought and annotate images with just a numeric id of the label (then mapping id to
their names)
b. Prompt: construct natural language prompts that can be used to guid the model’s prediction
i. use a template-based approach where they construct prompts by filling in placeholders
with relevant information about the task at hand “A photo of a {label}.”
Experiments
Experiments
1. zero shot performance of CLIP
a. impressice on some task but it still quite weak on several kinds of tasks such as
i. specialized (e.g. satellite image classification)
ii. abstract and systematic tasks (e.g. counting the number of objects)
iii. self-driving related tasks (e.g. classifying the distance of the nearest car)

b. its performance is not yet perfect

Representation Learning Capabilities of CLIP
1. discovering and extracting useful features or representations from raw data
2. common way to evaluating the quality of representation
a. fit a linear classification on a representation extracted from the model and measures its
performance on varios dataset
b. linear probe, fine-tune
Experiments
Comparison to Human Performance
● Dataset : Oxford IIT Pets dataset select which of the 37 cat or dog breed best matched the image
● Zero-shot : the humans were given no examples of the breeds and asked to label.
● One-shot : one sample image of each breed and in the two-shot experiment they were given two
sample images of each breed

CoCoOp
No ratings yet
CoCoOp
11 pages
371-1-2284-5-10-20240222
No ratings yet
371-1-2284-5-10-20240222
9 pages
Computer Vision 12 Vision Language Models(1)
No ratings yet
Computer Vision 12 Vision Language Models(1)
56 pages
452_Learning_to_Adapt_Frozen_C
No ratings yet
452_Learning_to_Adapt_Frozen_C
22 pages
Automatic Creative Selection With Cross-Modal Matching
No ratings yet
Automatic Creative Selection With Cross-Modal Matching
3 pages
Gadre Et Al. - 2023 - DataComp in Search of the Next Generation of Multimodal Datasets
No ratings yet
Gadre Et Al. - 2023 - DataComp in Search of the Next Generation of Multimodal Datasets
66 pages
Final Demo
No ratings yet
Final Demo
48 pages
What is CLIP_ Contrastive Language-Image Pre-Processing Explained
No ratings yet
What is CLIP_ Contrastive Language-Image Pre-Processing Explained
16 pages
Lecture22-Multimodal
No ratings yet
Lecture22-Multimodal
32 pages
Cloth Captioning
No ratings yet
Cloth Captioning
36 pages
Clip
No ratings yet
Clip
15 pages
Zhou Conditional Prompt Learning For Vision-Language Models CVPR 2022 Paper
No ratings yet
Zhou Conditional Prompt Learning For Vision-Language Models CVPR 2022 Paper
10 pages
Research Paper 5
No ratings yet
Research Paper 5
10 pages
465-Lecture 17-CT
No ratings yet
465-Lecture 17-CT
22 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
09508
No ratings yet
09508
16 pages
CLIP_ Connecting text and images _ OpenAI
No ratings yet
CLIP_ Connecting text and images _ OpenAI
16 pages
Combined Scaling For Zero-Shot Transfer Learning
No ratings yet
Combined Scaling For Zero-Shot Transfer Learning
47 pages
NLP UNIT 5c
No ratings yet
NLP UNIT 5c
33 pages
2206.07669v2
No ratings yet
2206.07669v2
14 pages
14
No ratings yet
14
8 pages
interactive
No ratings yet
interactive
13 pages
Pratt What Does a Platypus Look Like Generating Customized Prompts for ICCV 2023 Paper
No ratings yet
Pratt What Does a Platypus Look Like Generating Customized Prompts for ICCV 2023 Paper
11 pages
2023 Cross-Domain Image Captioning With Discriminative Finetuning
No ratings yet
2023 Cross-Domain Image Captioning With Discriminative Finetuning
10 pages
Project Synopsis Imagecaptioning
No ratings yet
Project Synopsis Imagecaptioning
5 pages
Laclip
No ratings yet
Laclip
29 pages
Siglip
No ratings yet
Siglip
13 pages
Text-image embeddings with OpenAIs CLIP
No ratings yet
Text-image embeddings with OpenAIs CLIP
5 pages
Contrastive Language Image Pre-Training
No ratings yet
Contrastive Language Image Pre-Training
18 pages
1603.09016v2
No ratings yet
1603.09016v2
8 pages
Learning To Prompt With Text Only Supervision For Vision-Language Models
No ratings yet
Learning To Prompt With Text Only Supervision For Vision-Language Models
15 pages
Implementation_of_Simple_and_Efficient_P
No ratings yet
Implementation_of_Simple_and_Efficient_P
8 pages
Chinese Clip
No ratings yet
Chinese Clip
19 pages
SLIP: Self-Supervision Meets Language-Image Pre-Training
No ratings yet
SLIP: Self-Supervision Meets Language-Image Pre-Training
13 pages
Hierarchical Text-Conditional Image Generation With CLIP Latents
No ratings yet
Hierarchical Text-Conditional Image Generation With CLIP Latents
27 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
CoOp
No ratings yet
CoOp
13 pages
Large Language Models Are Good Prompt Learners
No ratings yet
Large Language Models Are Good Prompt Learners
12 pages
Mediaeval 2023
No ratings yet
Mediaeval 2023
5 pages
LLaMA Ankit - Rawat
No ratings yet
LLaMA Ankit - Rawat
52 pages
Image Caption Generator Report
No ratings yet
Image Caption Generator Report
27 pages
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
No ratings yet
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
11 pages
Lilianweng Github Io Posts 2023-03-15 Prompt Engineering
No ratings yet
Lilianweng Github Io Posts 2023-03-15 Prompt Engineering
9 pages
Zero Shot Text To Image Generation (DALL E)
No ratings yet
Zero Shot Text To Image Generation (DALL E)
20 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Image Captioning Generator Using CNN and LSTM
No ratings yet
Image Captioning Generator Using CNN and LSTM
8 pages
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
No ratings yet
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
18 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
No ratings yet
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
10 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
CS236 Default Project
No ratings yet
CS236 Default Project
3 pages
Image Generator
No ratings yet
Image Generator
11 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
No ratings yet
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
12 pages

Learning Transferable Visual Models From Natural Language Supervision

Uploaded by

Learning Transferable Visual Models From Natural Language Supervision

Uploaded by

Learning Transferable Visual Models From

Natural Language Supervision

4. Choosing and scaling a model

2. CLIP for zero-shot transfer

b. its performance is not yet perfect

You might also like