Pratt What Does a Platypus Look Like Generating Customized Prompts for ICCV 2023 Paper
Pratt What Does a Platypus Look Like Generating Customized Prompts for ICCV 2023 Paper
Zero-shot
Standard
for image classification. Unlike traditional classification
models, open-vocabulary models classify among any arbi- Image encoder Text encoder
trary set of categories specified with natural language dur-
ing inference. This natural language, called “prompts”,
typically consists of a set of hand-written templates (e.g.,
“a photo of a {}”) which are completed with each of the
category names. This work introduces a simple method A platypus
What does
to generate higher accuracy prompts, without relying on a platypus GPT-3 looks like a
beaver with
look like?
Language models (CuPL)
a ducks bill
fewer hand-constructed sentences. To achieve this, we com- Goldfish are small orange fish
bine open-vocabulary models with large language models with shiny scales
(LLMs) to create Customized Prompts via Language mod- A platypus looks like a beaver
with a ducks bill
els (CuPL, pronounced “couple”). In particular, we lever- A spatula is a flat rectangular
age the knowledge contained in LLMs in order to gener- kitchen utensil with a long
handle
ate many descriptive sentences that contain important dis- Image encoder Text encoder
criminating characteristics of the image categories. This
allows the model to place a greater importance on these re-
gions in the image when making predictions. We find that
this straightforward and general approach improves accu-
racy on a range of zero-shot image classification bench- Figure 1. Schematic of the method. (Top) The standard method
marks, including over one percentage point gain on Ima- of a zero-shot open-vocabulary image classification model (e.g.,
CLIP [42]). (Bottom) Our method of CuPL. First, an LLM gener-
geNet. Finally, this simple baseline requires no additional
ates descriptive captions for given class categories. Next, an open-
training and remains completely zero-shot. Code available vocabulary model uses these captions as prompts for performing
at https://github.com/sarahpratt/CuPL. classification.
15691
LLM-prompts: Image-prompts:
“What does a “A lorikeet is a small to medium-sized parrot with a brightly colored plumage.”
{lorikeet, marimba, “A marimba is a large wooden percussion instrument that looks like a xylophone.”
viaduct, papillon} GPT-3 “A viaduct is a bridge composed of several spans supported by piers or pillars.”
look like?” “A papillon is a small, spaniel-type dog with a long, silky coat and fringed ears.”
The standard approach is to hand write a number of prompt resentations present in the dataset. This information is not
templates [42] (e.g.,“a photo of a {}”), compile a natural generalizable to other datasets, as ImageNet contains “black
language label for each category in the dataset, and create a and white” and “toy” representations of its categories, but
set of prompts for each category by filling in each template other datasets do not (e.g., FVGC Aircraft [32]).
with the natural language labels. Then, image embeddings
are matched to the nearest set of prompt embeddings and la- To overcome these challenges, we propose Customized
belled with the category associated with that set of prompts Prompts via Language models (CuPL). In this algorithm,
(more details in Section 2). we couple a large language model (LLM) with a zero-shot
This method has three major drawbacks. Firstly, each open-vocabulary image classification model. We use the
prompt template has to be hand-written, so having twice as LLM to generate prompts for each of the image categories
many prompts for a category requires twice as much human in a dataset. Using an LLM allows us to generate an ar-
effort. This can become costly as each new dataset typically bitrary number of prompts with a fixed number of hand-
has a different set of prompt templates [42]. written sentences. Additionally, these prompts are now
customized to each category and contain specified visual
Secondly, the prompt templates must be general enough descriptions while still remaining zero-shot. This allows
to apply to all image categories. For example, a prompt prompts to contain details about a class which distinguish
for the ImageNet [13] category “platypus” could only be it from other similar classes. For example, to describe a tree
as specific as “a photo of a {platypus}”, and could not be frog, the LLM generates the sentence “A tree frog looks
something like “a photo of a {platypus}, a type of aquatic like a small frog with large eyes.” This not only describes
mammal” as that template would no longer be relevant for the category, but specifically mentions the eyes, the feature
other image categories. This is limiting, as descriptive de- which distinguishes the Tree frog class from the most visu-
tails are useful for fine-grained classification. For example, ally similar classes - other types of frogs. We find that CuPL
different species of frogs share many of the same character- prompts are rich with these discriminating details and show
istics. However, tree frogs can be distinguished with their that the model is able to leverage these details to place more
distinct large eyes. This is a valuable detail for classification importance on relevant parts of the image when classifying
but cannot be included in a general template. Therefore, between similar, commonly confused categories (Figure 5).
when using these basic templates, the model may not take
advantage of this detail in the image, leading to an incorrect We find these customized prompts outperform the hand-
categorization as demonstrated in Figure 5. written templates on 15 zero-shot image classification
Lastly, writing high performing prompt templates cur- benchmarks, including a greater than 1 percentage point
rently requires prior information about the contents of the gain on ImageNet [13] Top-1 accuracy and a greater than 6
dataset. For example, the list of hand-written ImageNet percentage point gain on Describable Textures Dataset [11],
prompts [42] includes “a black and white photo of the {}.”, with fewer hand-written prompts when compared to the
“a low resolution photo of a {}.”, and “a toy {}.” all of standard method used in [42]. Finally, this method requires
which demonstrate prior knowledge about the type of rep- no additional training or labeled data for either model.
15692
2. Methods via the text encoder, and all sentences completed with the
same category name are averaged and normalized. This re-
The CuPL algorithm consists of two steps: (1) generat- sults in n embeddings where n is the number of categories
ing customized prompts for each of the categories in a given in the dataset. Each of these n embeddings is the mean of
dataset and (2) using these prompts to perform zero-shot many different sentence embeddings. Then each image in
image classification. the dataset is embedded using the image encoder. This em-
bedding is compared to each of the n text embeddings using
2.1. Generating Customized Prompts
cosine similarity and is labeled with the most similar one.
This step consists of generating prompts using an LLM. CuPL requires only a small adjustment from this stan-
For clarity, we distinguish between two different kind of dard practice. Instead of filling in the hand-written tem-
prompts. The first are the prompts which cue the LLM to plates for each category, we simply replace these altogether
generate the descriptions of the dataset categories. These with the sentences output by GPT-3. This means that for
prompts do not describe an object, but rather prompt the CuPL, hand-written templates are only used as input for the
description of an object (e.g., “What does a platypus look LLM, while the prompts for CLIP are entirely generated
like?”). We will refer to these as “LLM-prompts”. text. We present 2 different setting of CuPL (as shown in
Secondly, there are the prompts to be matched with im- Table 1), each representing a different trade-off between ac-
ages in the zero-shot image classification model. These are curacy and hand-engineering.
the prompts that describe a category (e.g., “A platypus looks 1. CuPL (base). This setting uses three hand-written
like ...”). We call them “image-prompts.” In CuPL, these are sentences across all 15 examined datasets. We do this
the output of the LLM, as exemplified in Figure 2. by constructing general LLM-prompt templates which are
In this work, we use GPT-3 [5] as our LLM. To gener- filled in with the category names for each dataset. Our three
ate our image-prompts, we must first construct a number of general templates are as follows:
LLM-prompt templates. While this does require some en-
gineering by hand, it is significantly less than the amount
of hand-engineered sentences used in the standard method Describe what a/the looks like:
of creating image-prompt templates for CLIP. For example, Describe a/the :
in our ImageNet experiments, we construct 5 LLM-prompt What are the identifying characteristics of a/the ?
templates compared to the 80 image-prompts used by CLIP
for zero-shot ImageNet classification. The blank portion of this template is either filled in with
After constructing these LLM-prompts, we generate 10 the category type plus the category name (e.g. “pet” + {}
different image-prompts for each of the LLM-prompts. This for the Oxford Pets dataset [38] or “aircraft” + {} for FGVC
means for ImageNet we use an LLM to generate a total of Aircraft [32]) or just the category name for more general
50 customized image-prompts for each image category. For datasets like ImageNet [13]. Type specification is necessary
each of these, we generate a maximum of 50 tokens, but halt because of words that have multiple meanings. For example
a generation early if it produces a period. Additionally, we “boxer” from the Oxford Pets dataset can also mean a per-
generate with a high temperature of 0.99, which encourages son who boxes, as opposed to a dog breed, so it is necessary
more diversity among the 10 generated image-prompts. We to specify “Describe a pet boxer:”. Similarly, “Tornado”
also clean each generated sentence by deleting any blank from the FGVC Aircraft dataset can be a type of aircraft or
lines and adding a period at the end. a type of weather.
2.2. Utilizing Customized Prompts 2. CuPL (full). In this setting we use different LLM-
prompt templates for each dataset, just as [42] uses different
After generating image-prompts for each of the cat- image-prompt templates for each dataset. However, we use
egories, we then perform zero-shot image classification. fewer hand-written templates overall and also contain less
While there are a number of open-vocabulary models [40, specific information about each dataset in the templates. For
23, 42, 63], we report our results using CLIP [42] as this is this work, each dataset has between 2 and 9 LLM-prompts
the most popular publicly available open-vocabulary model. which generate between 20 and 90 image-prompt per cate-
CLIP consists of a text encoder and and image encoder gory (10 generated sentences per LLM-prompt). For Ima-
(schematic on the top of Figure 1). In the standard setting, geNet, we use the following 5 LLM-prompts: (1) “Describe
there are a number of hand-written templates which can be what a(n) {} looks like”, (2) “How can you identify a(n)
completed with the relevant category names (e.g. “A photo {}?”, (3) “What does a(n) {} look like?”, (4) “A caption of
of a {}”, “A photo of many {}”). To classify the images in an image of a(n) {}”, (5) “Describe an image from the inter-
a dataset, each of these templates is filled in with a given net of a(n) {}”. Full LLM-prompts for all datasets as well
category name. Then each of these sentences is embedded as example image-prompts are given the Appendix.
15693
FGVC Aircraft
Stanford Cars
Kinetics-700
Flowers 102
Oxford Pets
CIFAR-100
Caltech101
RESISC45
CIFAR-10
ImageNet
Birdsnap
Food101
SUN397
UCF101
Unique
mean
DTD
Total
std 75.54 55.20 77.53 69.31 93.08 32.88 93.33 93.24 78.53 77.45 60.07 71.10 95.59 78.26 50.43 73.43
# hw 80 8 8 2 1 2 1 34 1 48 28 18 18 18 1 268 175
CuPL (base) 76.19 58.90 76.49 72.74 93.33 36.69 93.37 93.45 78.83 77.74 60.24 68.96 95.81 78.47 51.11 74.15
std +0.65 +3.70 -1.04 +3.43 +0.25 +3.81 +0.04 +0.21 +0.30 +0.29 +0.17 -2.14 +0.22 +0.21 +0.63
# hw 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 45 3
CuPL (full) 76.69 61.70 77.63 73.31 93.36 36.11 93.81 93.45 79.67 78.36 60.63 71.69 95.84 78.57 51.11 74.80
std +1.15 +6.50 +0.10 +4.00 +0.28 +3.23 +0.48 +0.21 +1.14 +0.91 +0.56 +0.59 +0.25 +0.31 +0.63
# hw 5 6 9 3 3 2 2 3 2 5 4 5 3 4 3 59 45
Table 1. Performance of CuPL prompts compared to the standard, hand-written prompts in CLIP [42] on 15 zero-shot image
classification benchmarks. “ std” stands for the difference; green shows improvement. In addition to accuracy, we show number of
prompt templates that are hand-written (“# hw”) for each dataset using each method, as well as the total and unique number of hand-
written templates for each method (unique number only counts templates once even if used for multiple datasets). Note that CuPL (base)
uses just three hand-constructed sentence across all datasets compared to 175 in the standard method.
3.1. Setup
Unless specified otherwise, we use CLIP with a back-
bone of ViT-L/14 [14] and the GPT-3 DaVinci-002 model.
Additionally, in order to perform open-vocabulary image
classification, each image category needs a natural language
label. This is sometimes provided by the dataset, but not al-
ways (e.g. ImageNet categories are described by an id num-
ber which can map to multiple synonyms). For this work,
we use the same natural language labels specified in [42].
We report our findings on 15 zero-shot image recognition
benchmarks: ImageNet [13], Describable Textures Dataset
(DTD) [11], Stanford Cars [26], Scene UNderstanding
(SUN397) [60], Food101 [4], FGVC Aircraft [32], Oxford
Figure 3. Performance of CuPL as models scale. (Top) Ima- Pets [38], Caltech101 [16], Flowers 102 [36], UCF101 [52],
geNet Top-1 accuracy for various scales of CLIP. CuPL prompts Kinetics-700 [8], Remote Sensing Image Scene Classifica-
remain consistently better than standard prompts even we adjust
tion (RESISC45) [10], CIFAR-10 [27], CIFAR-100 [27],
CLIP model size (ViT-B/32, ViT-B/16, ViT-L/14). GPT-3 model
and Birdsnap [2]. For the two video datasets, we extract the
set as DaVinci-002. (Bottom) ImageNet Top-1 accuracy for var-
ious scales of GPT-3 (ada, babbage, curie, davinci-002). Larger middle frame of the video, as is done in Radford et al. [42].
models produce higher accuracy. CLIP model set as ViT-L/14.
3.2. Results
Our results for the base prompts setting and the full
3. Experiments and Results prompts setting are in Table 1. We present our method’s
performance on 15 different image classification bench-
We first discuss the details of our experimental setup. marks, comparing both the classification accuracy and the
We next show improvements on a wide range of image number of hand-written sentence templates needed for each
classification benchmarks. We then examine the scaling method. Note that for the standard method [42], the hand-
behavior with respect to the model size and report obser- written sentences refer to the image-prompts, while for
15694
Standard CuPL Menon et al. [33]
75.54 76.69 75.00
Table 2. Comparison with Menon et al. [33] on Top-1 Imagenet
accuracy with ViT L/14.
15695
LLM-prompts constant at 5 and adjust how many image- between a “Tree frog” and a “Tailed frog”, which also look
prompts we generate per LLM-prompt. We plot the accu- very similar.
racy given the total number of image-prompts (so 10 gen- For each binary classification, we show four heatmaps:
erated image-prompt per LLM-prompt corresponds to 50 (1) the regions that contribute to a higher probability of the
total image-prompts). We see that CuPL begins to outper- correct class when using CuPL prompts, (2) the regions that
form the baseline at just 25 image-prompts, well below the contribute to a higher probability of the incorrect class when
80 image-prompts used in the baseline. using CuPL prompts, (3) the regions that contribute to a
Additional Analysis. In the Appendix, we provide com- higher probability of the correct class when using baseline
parisons between CuPL prompts and descriptive prompts prompts, (4) the regions that contribute to a higher proba-
generated with definitions of ImageNet classes as well as bility of the incorrect class when using baseline prompts.
with Wikipedia descriptions of ImageNet classes. We find Interestingly, we find that not only does CuPL place impor-
that CuPL prompts outperform both of these baselines. Ad- tance on different regions of the image, but these regions
ditionally, we provide results of ensembling CuPL prompts correspond to descriptions in the CuPL prompts. For exam-
and the baseline hand-written prompts used in [42]. We find ple, the tail of the dog is very important to the “Schipperke”
that this ensemble outperforms just baseline prompts for probability when using CuPL prompts, but not when us-
all datasets, and outperforms just CuPL prompts for some ing baseline prompts, and the tail of the Schipperke dog is
datasets. described 10 times in the CuPL descriptions of this class.
Similarly, we find that the eyes in the image of the frog
3.4. Shapley Value Analysis are much more important when classifying with CuPL than
with the baseline, and that the eyes are mentioned 10 times
We show that CuPL descriptions allow CLIP to place
in the CuPL description of a tree frog. We provide more
more importance on image regions that are most relevant
examples of this phenomenon in the Appendix.
for the correct classification. In order to measure the im-
Importance of Segmented Parts In order to understand
portance of regions in the image, we invoke Shapley val-
the correlation between the importance of an image region
ues [49], a tool from game theory that has become popu-
and its frequency in CuPL prompts on a larger scale, we
lar for understanding which input information contributes
utilize the PartImageNet Dataset [18]. This dataset contains
to a model’s final prediction [31, 9]. Shapley values can
segmentation maps of the different parts of a class for a sub-
be computed for any model, and although there are meth-
set of ImageNet classes. For example, the dog classes have
ods designed specifically for vision transformers [12] (the
the parts: ‘head’, ‘body’, ‘leg’ and ‘tail’. We use these seg-
underlying architecture of CLIP), we use a simple model-
mentation maps to obtain the Shapley value for each part of
agnostic calculation [35]. We employ Shapley values to
the animal with respect to the final probability of the ground
understand the importance of different image regions with
truth class. To understand the effect of changing to CuPL
CuPL prompts versus baseline prompts, and we find that
prompts, we calculate the difference between the Shapley
CuPL places more value on regions that are emphasized in
values with CuPL prompts and with baseline prompts, and
object descriptions, and thus are likely important for obtain-
we average across all images in a class. So for each part in
ing correct classifications. We demonstrate this correlation
each examined class we calculate the following (where SV
in two ways: (1) visualizing heatmaps of importance over
denotes the Shapley value):
images, and (2) measuring the importance of segmented im- 1 X
age parts annotated by the PartImageNet Dataset [18]. SVCuPL (image, part) SVbase (image, part)
|class|
Importance Heatmaps To understand how CuPL cap- image2class
tions lead to a change in importance of different image This gives us a score for how much more important a
regions, we calculate the Shapley value of small image part of an animal is to CuPL compared to the baseline for
patches when using CuPL prompts versus when using base- classification. Additionally, we quantify how prevalent each
line prompts. We calculate the Shapley values with respect body part is in the CuPL descriptions. We do this using the
to a binary classification probability between the correct WordNet [34] database to tag each words as part of the ‘leg’,
class and a similar distractor class in order to understand ‘head’, etc. More details of this tagging system are given in
how CuPL corrects these errors. As shown in Figure 5, the Appendix. We present our findings in Figure 6. We find
we examine the important regions of an image of a dog that the parts that are more important to CuPL are highly
when classifying between two very similar dog categories: correlated with the parts that are present in the descriptions
a “Schipperke dog” versus a “Groenendael dog”. Both of of the animals (and thus likely important to the identifica-
these classes are Belgian dogs that are black with pointy tion of the animal). For example, head-related attributes of
ears. However, they have a few subtle differences includ- the Japanese Spaniel class are frequently mentioned in the
ing the typical appearance of their tails. Additionally, we descriptions. Additionally, the ‘head’ in the image is much
show the important regions of an image when classifying more important to the final prediction for CuPL than for
15696
Region Importance with Region Importance with Example Image of
Original Image
CuPL Prompts Baseline Prompts Distractor Class
(A) (B) (C) (D) (E) (F)
Schipperke Prompt: Groenendael Prompt: Schipperke Prompt: Groenendael Prompt:
GT Label: "A Schipperke is a "A Groenendael dog can “A photo of a “A photo of a Example:
small, black Belgian be identified by its black Schipperke” Groenendael dog” Groenendael dog
Schipperke
dog with pointy ears coat and erect ears."
and an upright tail."
Tree Frog Prompt: Tailed Frog Prompt: Tree Frog Prompt: Tailed Frog Prompt:
GT Label: "A tree frog looks "The tailed frog is a “A photo of a “A photo of a Example:
Tree Frog like a small frog small frog that is found tree frog” tailed frog” Tailed frog
with large eyes." in North America."
Figure 5. CuPL prompts lead the model to focus on semantically important regions of the image. We use Shapley values (Section 3.4)
to visualize the importance of each region in a binary classification problem. We examine which parts of an image lead the model to classify
it as the correct class versus a commonly confused class. We present the original image (column A), as well as four heatmaps showing
which regions raise the probability of the correct class for the CuPL model (column B), the incorrect class for the CuPL model (column
C), the correct class for the baseline model (column D), and the incorrect class for the baseline model (column E). Additionally, we show
that the regions that are more important to CuPL than to the baseline correspond to regions mentioned in the CuPL prompts (i.e. “tail”
which is a commonly mentioned word in Schipperke Dog CuPL prompts and “eyes” which is a common word in Tree Frog prompts). We
also show an example image from the distractor class to demonstrate the level of similarity between these fine-grained classes (column F).
Finally, we see that CuPL scores the correct class higher, whereas the baseline scores the incorrect class higher. This series of observations
lead us to believe that CuPL is able to correct errors because the descriptive prompts cause the model to weigh semantically important
regions more heavily.
baseline. Thus, CuPL is able to extract important informa- visual information from Wikipedia descriptions to enable
tion for identifying the animal from the text and incorporate zero-shot bird classification. Additional works [50, 6]
it into classification predictions. show improvements on large datasets (e.g., ImageNet) us-
ing external information from external databases such as
4. Related Work Imagenet-wiki and Wordnet. While these works show the
effectiveness of augmenting zero-shot models with descrip-
4.1. Natural Language Descriptions for Image Clas- tive text, all of these prior works rely on external natural
sification language databases for descriptions. This often limits the
possible categories that can be classified and can require
Several prior works use text-based knowledge of image extensive preprocessing to extract visual descriptions from
categories to improve classification accuracy. [15] extract noisy natural language.
visual information from unstructured text descriptions col-
lected from the internet to recognize parts of object and 4.2. Generated Text for Downstream Tasks
classify them in a zero-shot way. [45] and [19] use natu-
ral language descriptions of bird types to train a multimodal Recent work has utilized text generated from LLMs in a
classification model. [21] use hand-collected attribute tags number of ways. [47] use an LLM to paraphrase existing
to attend over relevant features in images. [39] extract image captions to use as data augmentation for CLIP. [30]
15697
Japanese Spaniel Coucal Bird Tree Frog Gila monster Lizard
Text Part importance:
Tag: Legs
Tag: Legs
Figure 6. When specific parts of an animal/object are frequently mentioned in CuPL prompts, the CuPL model places more im-
portance on these parts in the image compared to the baseline model. The PartImageNet dataset [18] provides segmentation maps
of ImageNet images broken down into parts. For example, Tree Frog is broken down into the parts: ‘head’, ‘leg’, ‘body’ and ‘tail’. We
use the WordNet database [34] to tag words in CuPL prompts as belonging to one of these parts. We refer to the number of mentions
of the part as the Text Part Importance. We then use the PartImageNet segmentations to compare the Shapley value of each part when
using CuPL prompts and baseline prompts, which we call the Image Part Importance. We find a strong correlation between the Text Part
Importance and the Image Part Importance, leading to the conclusion that CuPL is able to take advantage of the knowledge contained in
the descriptions when making its predictions.
use GPT-3 to generate knowledge on a topic when given a the format of prompts is known to highly affect accuracy
number of demonstrations, which is then used to improve [48, 42, 5, 17]. This has led to a large effort to find op-
accuracy on common sense reasoning questions. [20] use timal prompt formats. Proposed methods include crowd-
a LLM to add labels to text to improve text classification sourcing high performing prompts [1] as well as framing
accuracy. In [64], the outputs of a GPT-2 model are used to prompts to induce models to give explanations as well as an-
train an encoder on top of a vision model to generate mul- swers [57, 25, 37]. Additional works have proposed learn-
timodal image representations for a variety of tasks. [53] ing prompts via gradient based methods [65, 41, 29, 28, 51],
utilize a language model to perform image captioning by retrieval from a database [46], or reformatting/rephrasing
iteritively generating candidate image captions with a LLM existing prompts [24, 46].
and then using feedback from an open-vocabulary model Most relevant to this work are a number of methods for
to align it to a given image. Similarly, [62] use GPT-3 designing optimal prompts for zero-shot image classifica-
along with text descriptions of images for the Visual Ques- tion with open-vocabulary models. These methods learn
tion Answering (VQA) task. However, unlike CuPL these prompts formats which yield high accuracy for image clas-
prior works are either purely language tasks (common sense sification using either supervised [66, 43] or unsupervised
reasoning, text classification) or multimodal with some lan- [22] methods. However, unlike these prior works this work
guage component (image captioning, VQA). Most simi- requires no additional training or labeled data.
larly, [33] use and LLM to generate a structured list of at-
tributes which are then reformatted into captions for CLIP. 5. Conclusion
However this work differs from ours as it does not improve
over human written templates. Additionally, [61] use an We demonstrate that leveraging knowledge from an
LLM to generate a list of natural language attributes for Im- LLM can immediately improve zero-shot accuracy on a va-
ageNet classes and then select a subset of these attributes riety of image classification tasks, with much less hand-
for each class in a few-shot manner. Our work differs from engineering efforts to craft natural language prompts. Fur-
this as we remain in the zero-shot setting. thermore, prompts can be customized to the desired cate-
gories, rather than a general template that applies to all cat-
4.3. Prompt Engineering egories. Finally, using prompts generated by LLMs lowers
the barrier of prior knowledge about the dataset, which is
Previous efforts have explored methods for obtaining often required when crafting prompt templates.
successful natural language prompts. For both open- Querying an LLM for prompt construction is simple,
vocabulary image classification models as well as LLMs, straightforward and as our results suggested, immediately
15698
beneficial. The hypothesis that a joint force of LLMs and [7] Sebastian Bujwid and Josephine Sullivan. Large-scale zero-
open vocabulary models would improve zero-shot image shot image classification from rich and diverse textual de-
classification is thoroughly tested in this work. We hope scriptions. arXiv preprint arXiv:2103.09669, 2021. 19
these findings serve as a useful tool towards understanding [8] Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zis-
and improving zero-shot image classification, and more serman. A short note on the kinetics-700 human action
generally, the consolidation of model capacities and modal- dataset. arXiv preprint arXiv:1907.06987, 2019. 4
ities through natural language. [9] Hugh Chen, Ian C Covert, Scott M Lundberg, and Su-In
Lee. Algorithms to estimate Shapley value feature attribu-
tions. arXiv preprint arXiv:2207.07605, 2022. 6
[10] Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens-
6. Acknowledgements ing image scene classification: Benchmark and state of the
art. Proceedings of the IEEE, 105(10):1865–1883, 2017. 4
We thank Mitchell Wortsman, Gabriel Ilharco, Vivek
[11] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A.
Ramanujan, Aditya Kusupati, Jason Wei, and Ofir Press
Vedaldi. Describing textures in the wild. In Proceedings of
for helpful discussions and draft feedback. We also thank
the IEEE Conf. on Computer Vision and Pattern Recognition
Samir Yitzhak Gadre and Alex Fang for their useful
(CVPR), 2014. 2, 4
experimental suggestions. Finally, we thank our reviewers
[12] Ian Covert, Chanwoo Kim, and Su-In Lee. Learning to esti-
and meta-reviewers for their time and feedback during the
mate Shapley values with vision transformers. arXiv preprint
peer review process. This work is in part supported by
arXiv:2206.05282, 2022. 6
NSF IIS 1652052, IIS 17303166, DARPA N66001-19-2-
4031, DARPA W911NF-15-1-0543 and gifts from Allen [13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
Institute for Artificial Intelligence, Google, and Apple. and Li Fei-Fei. Imagenet: A large-scale hierarchical image
database. In 2009 IEEE conference on computer vision and
pattern recognition, pages 248–255. Ieee, 2009. 2, 3, 4, 22
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
References Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
[1] Stephen H. Bach, Victor Sanh, Zheng Xin Yong, Albert
vain Gelly, et al. An image is worth 16x16 words: Trans-
Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma,
formers for image recognition at scale. arXiv preprint
Taewoon Kim, M SAIFUL BARI, Thibault Févry, Zaid
arXiv:2010.11929, 2020. 4
Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik
[15] Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, and A. Elgam-
Ben-David, Canwen Xu, Gunjan Chhablani, Han Wang, Ja-
mal. Link the head to the ”beak”: Zero shot learning from
son Alan Fries, Maged S. Al-shaibani, Shanya Sharma, Ur-
noisy text description at part precision. 2017 IEEE Confer-
mish Thakker, Khalid Almubarak, Xiangru Tang, Mike Tian-
ence on Computer Vision and Pattern Recognition (CVPR),
Jian Jiang, and Alexander M. Rush. Promptsource: An in-
pages 6288–6297, 2017. 7
tegrated development environment and repository for natural
language prompts. ArXiv, abs/2202.01279, 2022. 8 [16] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener-
ative visual models from few training examples: An incre-
[2] Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L
mental bayesian approach tested on 101 object categories. In
Alexander, David W Jacobs, and Peter N Belhumeur. Bird-
2004 conference on computer vision and pattern recognition
snap: Large-scale fine-grained visual categorization of birds.
workshop, pages 178–178. IEEE, 2004. 4
In Proceedings of the IEEE Conference on Computer Vision
[17] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-
and Pattern Recognition, pages 2011–2018, 2014. 4
trained language models better few-shot learners. arXiv
[3] Lucas Beyer, Olivier J Hénaff, Alexander Kolesnikov, Xi- preprint arXiv:2012.15723, 2020. 8
aohua Zhai, and Aäron van den Oord. Are we done with
[18] Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xi-
imagenet? arXiv preprint arXiv:2006.07159, 2020. 24
aoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qi-
[4] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. hang Yu, and Alan Yuille. Partimagenet: A large, high-
Food-101 – mining discriminative components with random quality dataset of parts. In Computer Vision–ECCV 2022:
forests. In European Conference on Computer Vision, 2014. 17th European Conference, Tel Aviv, Israel, October 23–
4 27, 2022, Proceedings, Part VIII, pages 128–145. Springer,
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- 2022. 6, 8, 22
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- [19] Xiangteng He and Yuxin Peng. Fine-grained image clas-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- sification via combining vision and language. 2017 IEEE
guage models are few-shot learners. Advances in neural in- Conference on Computer Vision and Pattern Recognition
formation processing systems, 33:1877–1901, 2020. 3, 8, 20, (CVPR), pages 7332–7340, 2017. 7
21 [20] Shengding Hu, Ning Ding, Huadong Wang, Zhiyuan Liu,
[6] Sebastian Bujwid and Josephine Sullivan. Large-scale zero- Juan-Zi Li, and Maosong Sun. Knowledgeable prompt-
shot image classification from rich and diverse textual de- tuning: Incorporating knowledge into prompt verbalizer for
scriptions. ArXiv, abs/2103.09669, 2021. 7 text classification. In ACL, 2022. 8
15699
[21] Siteng Huang, Min Zhang, Yachen Kang, and Donglin [37] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari,
Wang. Attributes-guided and pure-visual attention alignment Henryk Michalewski, Jacob Austin, David Bieber, David
for few-shot recognition. In AAAI, 2021. 7 Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan,
[22] Tony Huang, Jack Chu, and Fangyun Wei. Unsupervised Charles Sutton, and Augustus Odena. Show your work:
prompt learning for vision-language models. arXiv preprint Scratchpads for intermediate computation with language
arXiv:2204.03649, 2022. 8 models. ArXiv, abs/2112.00114, 2021. 8
[23] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, [38] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and
Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom CV Jawahar. Cats and dogs. In 2012 IEEE conference on
Duerig. Scaling up visual and vision-language representa- computer vision and pattern recognition, pages 3498–3505.
tion learning with noisy text supervision. In International IEEE, 2012. 3, 4
Conference on Machine Learning, pages 4904–4916. PMLR, [39] Tzuf Paz-Argaman, Yuval Atzmon, Gal Chechik, and Reut
2021. 1, 3 Tsarfaty. Zest: Zero-shot learning from text descriptions us-
[24] Zhengbao Jiang, Frank F. Xu, J. Araki, and Graham Neubig. ing textual similarity and visual summarization. In FIND-
How can we know what language models know? Trans- INGS, 2020. 7
actions of the Association for Computational Linguistics, [40] Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi,
8:423–438, 2020. 8 Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen,
[25] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Minh-Thang Luong, Yonghui Wu, et al. Combined scal-
Matsuo, and Yusuke Iwasawa. Large language models are ing for open-vocabulary image classification. arXiv preprint
zero-shot reasoners. ArXiv, abs/2205.11916, 2022. 8 arXiv:2111.10050, 2021. 1, 3
[26] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. [41] Guanghui Qin and Jas’ Eisner. Learning how to ask:
3d object representations for fine-grained categorization. In Querying lms with mixtures of soft prompts. ArXiv,
4th International IEEE Workshop on 3D Representation and abs/2104.06599, 2021. 8
Recognition (3dRR-13), Sydney, Australia, 2013. 4 [42] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[27] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
layers of features from tiny images. 2009. 4 Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
[28] Brian Lester, Rami Al-Rfou, and Noah Constant. The ing transferable visual models from natural language super-
power of scale for parameter-efficient prompt tuning. ArXiv, vision. In International Conference on Machine Learning,
abs/2104.08691, 2021. 8 pages 8748–8763. PMLR, 2021. 1, 2, 3, 4, 5, 6, 8, 12, 20,
[29] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- 22, 23
ing continuous prompts for generation. Proceedings of the
[43] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong
59th Annual Meeting of the Association for Computational
Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu.
Linguistics and the 11th International Joint Conference on
Denseclip: Language-guided dense prediction with context-
Natural Language Processing (Volume 1: Long Papers),
aware prompting. In Proceedings of the IEEE/CVF Con-
abs/2101.00190, 2021. 8
ference on Computer Vision and Pattern Recognition, pages
[30] Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter 18082–18091, 2022. 8
West, Ronan Le Bras, Yejin Choi, and Hannaneh Hajishirzi.
[44] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and
Generated knowledge prompting for commonsense reason-
Vaishaal Shankar. Do imagenet classifiers generalize to im-
ing. In ACL, 2022. 7
agenet? In International Conference on Machine Learning,
[31] Scott M Lundberg and Su-In Lee. A unified approach to
pages 5389–5400. PMLR, 2019. 22
interpreting model predictions. Advances in Neural Infor-
mation Processing Systems, 30, 2017. 6 [45] Scott E. Reed, Zeynep Akata, Honglak Lee, and Bernt
Schiele. Learning deep representations of fine-grained vi-
[32] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi.
sual descriptions. 2016 IEEE Conference on Computer Vi-
Fine-grained visual classification of aircraft. Technical re-
sion and Pattern Recognition (CVPR), pages 49–58, 2016.
port, 2013. 2, 3, 4
7
[33] Sachit Menon and Carl Vondrick. Visual classification via
description from large language models. arXiv preprint [46] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learn-
arXiv:2210.07183, 2022. 5, 8 ing to retrieve prompts for in-context learning. In NAACL,
[34] George A Miller. Wordnet: a lexical database for english. 2022. 8
Communications of the ACM, 38(11):39–41, 1995. 6, 8, 19, [47] Shibani Santurkar, Yann Dubois, Rohan Taori, Percy Liang,
22, 24, 25 and Tatsunori Hashimoto. Is a caption worth a thousand im-
[35] Rory Mitchell, Joshua Cooper, Eibe Frank, and Geoffrey ages? a controlled study for representation learning. arXiv
Holmes. Sampling permutations for Shapley value estima- preprint arXiv:2207.07635, 2022. 7
tion. Journal of Machine Learning Research, 23(43):1–46, [48] Timo Schick and Hinrich Schütze. Exploiting cloze-
2022. 6 questions for few-shot text classification and natural lan-
[36] Maria-Elena Nilsback and Andrew Zisserman. Automated guage inference. In EACL, 2021. 8
flower classification over a large number of classes. In 2008 [49] Lloyd S Shapley et al. A value for n-person games. 1953. 6
Sixth Indian Conference on Computer Vision, Graphics & [50] Sheng Shen, Chunyuan Li, Xiaowei Hu, Yujia Xie, Jianwei
Image Processing, pages 722–729. IEEE, 2008. 4 Yang, Pengchuan Zhang, Anna Rohrbach, Zhe Gan, Lijuan
15700
Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, and Zellers, Ronan Le Bras, Gunhee Kim, and Yejin Choi. Mul-
Jianfeng Gao. K-lite: Learning transferable visual models timodal knowledge alignment with reinforcement learning.
with external knowledge. ArXiv, abs/2204.09222, 2022. 7 ArXiv, abs/2205.12630, 2022. 8
[51] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric [65] Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng, Zhen
Wallace, and Sameer Singh. Eliciting knowledge from lan- Bi, Chuanqi Tan, Fei Huang, and Huajun Chen. Differen-
guage models using automatically generated prompts. ArXiv, tiable prompt makes pre-trained language models better few-
abs/2010.15980, 2020. 8 shot learners. ArXiv, abs/2108.13161, 2021. 8
[52] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. [66] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
Ucf101: A dataset of 101 human actions classes from videos Liu. Learning to prompt for vision-language models. Inter-
in the wild. arXiv preprint arXiv:1212.0402, 2012. 4 national Journal of Computer Vision, pages 1–12, 2022. 8
[53] Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yo-
gatama, Yan Wang, Lingpeng Kong, and Nigel Collier. Lan-
guage models can see: Plugging visual controls in text gen-
eration. ArXiv, abs/2205.02655, 2022. 8
[54] Laurens Van der Maaten and Geoffrey Hinton. Visualiz-
ing data using t-sne. Journal of machine learning research,
9(11), 2008. 23, 24, 25, 26
[55] Ben Wang and Aran Komatsuzaki. Gpt-j-6b: A 6 billion
parameter autoregressive language model, 2021. 20
[56] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P
Xing. Learning robust global representations by penalizing
local predictive power. In Advances in Neural Information
Processing Systems, pages 10506–10518, 2019. 22
[57] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought
prompting elicits reasoning in large language models. ArXiv,
abs/2201.11903, 2022. 8
[58] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau-
mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s
transformers: State-of-the-art natural language processing.
arXiv preprint arXiv:1910.03771, 2019. 20
[59] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim,
Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon-
tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok
Namkoong, et al. Robust fine-tuning of zero-shot models.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 7959–7971, 2022. 22
[60] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva,
and Antonio Torralba. Sun database: Large-scale scene
recognition from abbey to zoo. In 2010 IEEE computer so-
ciety conference on computer vision and pattern recognition,
pages 3485–3492. IEEE, 2010. 4
[61] Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel
Jin, Chris Callison-Burch, and Mark Yatskar. Language
in a bottle: Language model guided concept bottlenecks
for interpretable image classification. arXiv preprint
arXiv:2211.11158, 2022. 8
[62] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu-
mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study
of gpt-3 for few-shot knowledge-based vqa. In AAAI, 2022.
8
[63] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo-
jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive
captioners are image-text foundation models. arXiv preprint
arXiv:2205.01917, 2022. 1, 3
[64] Youngjae Yu, Jiwan Chung, Heeseung Yun, Jack Hessel,
Jae Sung Park, Ximing Lu, Prithviraj Ammanabrolu, Rowan
15701