0% found this document useful (0 votes)

24 views11 pages

Pratt What Does a Platypus Look Like Generating Customized Prompts for ICCV 2023 Paper

This paper introduces Customized Prompts via Language models (CuPL), a method that enhances zero-shot image classification by generating more accurate prompts using large language models (LLMs) without requiring extensive hand-crafted templates. CuPL improves classification accuracy on various benchmarks, including a notable gain on ImageNet, by leveraging LLMs to create descriptive prompts tailored to image categories. The method is efficient, requiring no additional training or labeled data, and demonstrates significant improvements over traditional prompt generation techniques.

Uploaded by

dpm4212

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views11 pages

Pratt What Does a Platypus Look Like Generating Customized Prompts for ICCV 2023 Paper

Uploaded by

dpm4212

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

This ICCV paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

the final published version of the proceedings is available on IEEE Xplore.

What does a platypus look like?

Generating customized prompts for zero-shot image classification

Sarah Pratt1* Ian Covert1 Rosanne Liu2, 3 Ali Farhadi1

1
University of Washington 2
Google DeepMind 3
ML Collective

Abstract A photo of a goldfish

A photo of a platypus
Open-vocabulary models are a promising new paradigm A photo of a spatula

Zero-shot
Standard
for image classification. Unlike traditional classification
models, open-vocabulary models classify among any arbi- Image encoder Text encoder
trary set of categories specified with natural language dur-
ing inference. This natural language, called “prompts”,
typically consists of a set of hand-written templates (e.g.,
“a photo of a {}”) which are completed with each of the
category names. This work introduces a simple method A platypus
What does
to generate higher accuracy prompts, without relying on a platypus GPT-3 looks like a
beaver with
look like?
Language models (CuPL)

any explicit knowledge of the task domain and with far

Customized Prompts via

a ducks bill

fewer hand-constructed sentences. To achieve this, we com- Goldfish are small orange fish
bine open-vocabulary models with large language models with shiny scales

(LLMs) to create Customized Prompts via Language mod- A platypus looks like a beaver
with a ducks bill
els (CuPL, pronounced “couple”). In particular, we lever- A spatula is a flat rectangular
age the knowledge contained in LLMs in order to gener- kitchen utensil with a long
handle
ate many descriptive sentences that contain important dis- Image encoder Text encoder
criminating characteristics of the image categories. This
allows the model to place a greater importance on these re-
gions in the image when making predictions. We find that
this straightforward and general approach improves accu-
racy on a range of zero-shot image classification bench- Figure 1. Schematic of the method. (Top) The standard method
marks, including over one percentage point gain on Ima- of a zero-shot open-vocabulary image classification model (e.g.,
CLIP [42]). (Bottom) Our method of CuPL. First, an LLM gener-
geNet. Finally, this simple baseline requires no additional
ates descriptive captions for given class categories. Next, an open-
training and remains completely zero-shot. Code available vocabulary model uses these captions as prompts for performing
at https://github.com/sarahpratt/CuPL. classification.

1. Introduction models classify images by providing a similarity score be-

tween an image and a caption. To perform inference, one
Open-vocabulary models [40, 23, 42, 63] achieve high
can generate a caption or “prompt” associated with each of
classification accuracy across a large number of datasets
the desired categories, and match each image to the best
without labeled training data for those tasks. To accomplish
prompt. This means that categories can be selected ad hoc
this, these models leverage the massive amounts of image-
and adjusted without additional training.
text pairs available on the internet by learning to associate
However, this new paradigm poses a challenge:
the images with their correct caption, leading to greater
flexibility during inference. Unlike standard models, these
How can we best represent an image category through
*Correspondence to [email protected]. natural language prompts?

15691
LLM-prompts: Image-prompts:
“What does a “A lorikeet is a small to medium-sized parrot with a brightly colored plumage.”
{lorikeet, marimba, “A marimba is a large wooden percussion instrument that looks like a xylophone.”
viaduct, papillon} GPT-3 “A viaduct is a bridge composed of several spans supported by piers or pillars.”
look like?” “A papillon is a small, spaniel-type dog with a long, silky coat and fringed ears.”

Lorikeet Marimba Viaduct Papillon

Figure 2. Example CuPL LLM-prompts and Image-prompts. LLM-prompts are filled in with a class name and then used as input to
GPT-3, which then outputs image-prompts. Example LLM generated image-prompts and associated images from ImageNet are shown.
Only image-prompts are used for the downstream image classification.

The standard approach is to hand write a number of prompt resentations present in the dataset. This information is not
templates [42] (e.g.,“a photo of a {}”), compile a natural generalizable to other datasets, as ImageNet contains “black
language label for each category in the dataset, and create a and white” and “toy” representations of its categories, but
set of prompts for each category by filling in each template other datasets do not (e.g., FVGC Aircraft [32]).
with the natural language labels. Then, image embeddings
are matched to the nearest set of prompt embeddings and la- To overcome these challenges, we propose Customized
belled with the category associated with that set of prompts Prompts via Language models (CuPL). In this algorithm,
(more details in Section 2). we couple a large language model (LLM) with a zero-shot
This method has three major drawbacks. Firstly, each open-vocabulary image classification model. We use the
prompt template has to be hand-written, so having twice as LLM to generate prompts for each of the image categories
many prompts for a category requires twice as much human in a dataset. Using an LLM allows us to generate an ar-
effort. This can become costly as each new dataset typically bitrary number of prompts with a fixed number of hand-
has a different set of prompt templates [42]. written sentences. Additionally, these prompts are now
customized to each category and contain specified visual
Secondly, the prompt templates must be general enough descriptions while still remaining zero-shot. This allows
to apply to all image categories. For example, a prompt prompts to contain details about a class which distinguish
for the ImageNet [13] category “platypus” could only be it from other similar classes. For example, to describe a tree
as specific as “a photo of a {platypus}”, and could not be frog, the LLM generates the sentence “A tree frog looks
something like “a photo of a {platypus}, a type of aquatic like a small frog with large eyes.” This not only describes
mammal” as that template would no longer be relevant for the category, but specifically mentions the eyes, the feature
other image categories. This is limiting, as descriptive de- which distinguishes the Tree frog class from the most visu-
tails are useful for fine-grained classification. For example, ally similar classes - other types of frogs. We find that CuPL
different species of frogs share many of the same character- prompts are rich with these discriminating details and show
istics. However, tree frogs can be distinguished with their that the model is able to leverage these details to place more
distinct large eyes. This is a valuable detail for classification importance on relevant parts of the image when classifying
but cannot be included in a general template. Therefore, between similar, commonly confused categories (Figure 5).
when using these basic templates, the model may not take
advantage of this detail in the image, leading to an incorrect We find these customized prompts outperform the hand-
categorization as demonstrated in Figure 5. written templates on 15 zero-shot image classification
Lastly, writing high performing prompt templates cur- benchmarks, including a greater than 1 percentage point
rently requires prior information about the contents of the gain on ImageNet [13] Top-1 accuracy and a greater than 6
dataset. For example, the list of hand-written ImageNet percentage point gain on Describable Textures Dataset [11],
prompts [42] includes “a black and white photo of the {}.”, with fewer hand-written prompts when compared to the
“a low resolution photo of a {}.”, and “a toy {}.” all of standard method used in [42]. Finally, this method requires
which demonstrate prior knowledge about the type of rep- no additional training or labeled data for either model.

15692
2. Methods via the text encoder, and all sentences completed with the
same category name are averaged and normalized. This re-
The CuPL algorithm consists of two steps: (1) generat- sults in n embeddings where n is the number of categories
ing customized prompts for each of the categories in a given in the dataset. Each of these n embeddings is the mean of
dataset and (2) using these prompts to perform zero-shot many different sentence embeddings. Then each image in
image classification. the dataset is embedded using the image encoder. This em-
bedding is compared to each of the n text embeddings using
2.1. Generating Customized Prompts
cosine similarity and is labeled with the most similar one.
This step consists of generating prompts using an LLM. CuPL requires only a small adjustment from this stan-
For clarity, we distinguish between two different kind of dard practice. Instead of filling in the hand-written tem-
prompts. The first are the prompts which cue the LLM to plates for each category, we simply replace these altogether
generate the descriptions of the dataset categories. These with the sentences output by GPT-3. This means that for
prompts do not describe an object, but rather prompt the CuPL, hand-written templates are only used as input for the
description of an object (e.g., “What does a platypus look LLM, while the prompts for CLIP are entirely generated
like?”). We will refer to these as “LLM-prompts”. text. We present 2 different setting of CuPL (as shown in
Secondly, there are the prompts to be matched with im- Table 1), each representing a different trade-off between ac-
ages in the zero-shot image classification model. These are curacy and hand-engineering.
the prompts that describe a category (e.g., “A platypus looks 1. CuPL (base). This setting uses three hand-written
like ...”). We call them “image-prompts.” In CuPL, these are sentences across all 15 examined datasets. We do this
the output of the LLM, as exemplified in Figure 2. by constructing general LLM-prompt templates which are
In this work, we use GPT-3 [5] as our LLM. To gener- filled in with the category names for each dataset. Our three
ate our image-prompts, we must first construct a number of general templates are as follows:
LLM-prompt templates. While this does require some en-
gineering by hand, it is significantly less than the amount
of hand-engineered sentences used in the standard method Describe what a/the looks like:
of creating image-prompt templates for CLIP. For example, Describe a/the :
in our ImageNet experiments, we construct 5 LLM-prompt What are the identifying characteristics of a/the ?
templates compared to the 80 image-prompts used by CLIP
for zero-shot ImageNet classification. The blank portion of this template is either filled in with
After constructing these LLM-prompts, we generate 10 the category type plus the category name (e.g. “pet” + {}
different image-prompts for each of the LLM-prompts. This for the Oxford Pets dataset [38] or “aircraft” + {} for FGVC
means for ImageNet we use an LLM to generate a total of Aircraft [32]) or just the category name for more general
50 customized image-prompts for each image category. For datasets like ImageNet [13]. Type specification is necessary
each of these, we generate a maximum of 50 tokens, but halt because of words that have multiple meanings. For example
a generation early if it produces a period. Additionally, we “boxer” from the Oxford Pets dataset can also mean a per-
generate with a high temperature of 0.99, which encourages son who boxes, as opposed to a dog breed, so it is necessary
more diversity among the 10 generated image-prompts. We to specify “Describe a pet boxer:”. Similarly, “Tornado”
also clean each generated sentence by deleting any blank from the FGVC Aircraft dataset can be a type of aircraft or
lines and adding a period at the end. a type of weather.
2.2. Utilizing Customized Prompts 2. CuPL (full). In this setting we use different LLM-
prompt templates for each dataset, just as [42] uses different
After generating image-prompts for each of the cat- image-prompt templates for each dataset. However, we use
egories, we then perform zero-shot image classification. fewer hand-written templates overall and also contain less
While there are a number of open-vocabulary models [40, specific information about each dataset in the templates. For
23, 42, 63], we report our results using CLIP [42] as this is this work, each dataset has between 2 and 9 LLM-prompts
the most popular publicly available open-vocabulary model. which generate between 20 and 90 image-prompt per cate-
CLIP consists of a text encoder and and image encoder gory (10 generated sentences per LLM-prompt). For Ima-
(schematic on the top of Figure 1). In the standard setting, geNet, we use the following 5 LLM-prompts: (1) “Describe
there are a number of hand-written templates which can be what a(n) {} looks like”, (2) “How can you identify a(n)
completed with the relevant category names (e.g. “A photo {}?”, (3) “What does a(n) {} look like?”, (4) “A caption of
of a {}”, “A photo of many {}”). To classify the images in an image of a(n) {}”, (5) “Describe an image from the inter-
a dataset, each of these templates is filled in with a given net of a(n) {}”. Full LLM-prompts for all datasets as well
category name. Then each of these sentences is embedded as example image-prompts are given the Appendix.

15693
FGVC Aircraft
Stanford Cars

Kinetics-700
Flowers 102
Oxford Pets

CIFAR-100
Caltech101

RESISC45

CIFAR-10
ImageNet

Birdsnap
Food101
SUN397

UCF101

Unique
mean
DTD

Total
std 75.54 55.20 77.53 69.31 93.08 32.88 93.33 93.24 78.53 77.45 60.07 71.10 95.59 78.26 50.43 73.43
# hw 80 8 8 2 1 2 1 34 1 48 28 18 18 18 1 268 175
CuPL (base) 76.19 58.90 76.49 72.74 93.33 36.69 93.37 93.45 78.83 77.74 60.24 68.96 95.81 78.47 51.11 74.15
std +0.65 +3.70 -1.04 +3.43 +0.25 +3.81 +0.04 +0.21 +0.30 +0.29 +0.17 -2.14 +0.22 +0.21 +0.63
# hw 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 45 3
CuPL (full) 76.69 61.70 77.63 73.31 93.36 36.11 93.81 93.45 79.67 78.36 60.63 71.69 95.84 78.57 51.11 74.80
std +1.15 +6.50 +0.10 +4.00 +0.28 +3.23 +0.48 +0.21 +1.14 +0.91 +0.56 +0.59 +0.25 +0.31 +0.63
# hw 5 6 9 3 3 2 2 3 2 5 4 5 3 4 3 59 45
Table 1. Performance of CuPL prompts compared to the standard, hand-written prompts in CLIP [42] on 15 zero-shot image
classification benchmarks. “ std” stands for the difference; green shows improvement. In addition to accuracy, we show number of
prompt templates that are hand-written (“# hw”) for each dataset using each method, as well as the total and unique number of hand-
written templates for each method (unique number only counts templates once even if used for multiple datasets). Note that CuPL (base)
uses just three hand-constructed sentence across all datasets compared to 175 in the standard method.

vations regarding hyperparameters such as the number of

hand-written prompts. Finally, we provide evidence that
the model is able to use CuPL prompts to place more im-
portance on the most relavent parts of the image.

3.1. Setup
Unless specified otherwise, we use CLIP with a back-
bone of ViT-L/14 [14] and the GPT-3 DaVinci-002 model.
Additionally, in order to perform open-vocabulary image
classification, each image category needs a natural language
label. This is sometimes provided by the dataset, but not al-
ways (e.g. ImageNet categories are described by an id num-
ber which can map to multiple synonyms). For this work,
we use the same natural language labels specified in [42].
We report our findings on 15 zero-shot image recognition
benchmarks: ImageNet [13], Describable Textures Dataset
(DTD) [11], Stanford Cars [26], Scene UNderstanding
(SUN397) [60], Food101 [4], FGVC Aircraft [32], Oxford
Figure 3. Performance of CuPL as models scale. (Top) Ima- Pets [38], Caltech101 [16], Flowers 102 [36], UCF101 [52],
geNet Top-1 accuracy for various scales of CLIP. CuPL prompts Kinetics-700 [8], Remote Sensing Image Scene Classifica-
remain consistently better than standard prompts even we adjust
tion (RESISC45) [10], CIFAR-10 [27], CIFAR-100 [27],
CLIP model size (ViT-B/32, ViT-B/16, ViT-L/14). GPT-3 model
and Birdsnap [2]. For the two video datasets, we extract the
set as DaVinci-002. (Bottom) ImageNet Top-1 accuracy for var-
ious scales of GPT-3 (ada, babbage, curie, davinci-002). Larger middle frame of the video, as is done in Radford et al. [42].
models produce higher accuracy. CLIP model set as ViT-L/14.
3.2. Results
Our results for the base prompts setting and the full
3. Experiments and Results prompts setting are in Table 1. We present our method’s
performance on 15 different image classification bench-
We first discuss the details of our experimental setup. marks, comparing both the classification accuracy and the
We next show improvements on a wide range of image number of hand-written sentence templates needed for each
classification benchmarks. We then examine the scaling method. Note that for the standard method [42], the hand-
behavior with respect to the model size and report obser- written sentences refer to the image-prompts, while for

15694
Standard CuPL Menon et al. [33]
75.54 76.69 75.00
Table 2. Comparison with Menon et al. [33] on Top-1 Imagenet
accuracy with ViT L/14.

CuPL the hand-written sentences refer to the LLM-prompts,

with which image-prompts are generated.
1. CuPL (base). In this setting, we see performance
gains in 13 out of the 15 examined datasets. Note this set-
ting uses just three hand-constructed sentence across all
datasets. This is in comparison to the nearly 175 unique
image-prompt templates that are hand-written across all of
these datasets in the standard setting. Additionally, in the
standard setting these hand-constructed prompts must be
very specific to the dataset (e.g., “a black and white photo
of a {}.”, “a plastic {}.”). In comparison, CuPL (base) re-
quires only the category type of the overall dataset and still
outperforms the hand-written, domain specified baseline in
almost all cases. Thus, we present this base prompt setting
as a simple standard that matches or exceeds prompt engi-
neering open-vocabulary models. Figure 4. Ablation on number of LLM-prompts (top) and
image-prompts (bottom). (Top) As number of hand-written
2. CuPL (full prompts). Here we see improvements
LLM-prompts increases, so does accuracy. 10 image-prompts are
on all examined datasets. This includes large (over 1 per- generated for each LLM-prompt. Note that CuPL outperforms the
centage point) gains on ImageNet Top-1, DTD (texture baseline even with just one hand-written sentence. We add the
classification), SUN397 (scene classification), FGVC Air- prompts in a greedy manner, at each step adding the 10 prompts
craft (fine-grained aircraft classification), and Flowers 102 which lead to the largest performance gain. (Bottom) We adjust
(flower classification). While this setting requires more the number of image-prompts generated by a fixed number (5) of
hand-written prompts than setting (1), it still requires signif- LLM-prompts. Even at 5 Image-prompts per LLM-prompt (25
icantly fewer than the baseline method (5 sentences versus prompts total), we outperform the baseline which uses 80 image-
80 sentence for ImageNet), and does not include knowledge prompts.
about the image domain. The full list of hand-constructed
sentences for CuPL (full prompts) and the baseline method
overall accuracy.
[42] can be found in the Appendix.
Model Size. In Figure 3, we show CuPL (full prompts)
at different model scales. As there are two different zero-
3.3. Analysis and Ablations
shot models in the CuPL algorithm, we show the effects
Other prompting techniques. Concurrent work by of varying each model individually. On the top, we vary
Menon et al. [33] also explores LLM generated descrip- the CLIP model used while holding the LLM constant. We
tions for image classification. This work differs from CuPL see consistent gains across all model sizes. On the bottom,
as it generates a structured list of identifying attributes in a we vary the size of the LLM. We plot the accuracy of the
single generation, which are reformatted into multiple sen- baseline as well, which does not vary as it does not utilize an
tences. In contrast, CuPL outputs a single sentence for LLM. We find larger models lead to higher accuracy, though
multiple generations, with no enforced format. The ben- the 2nd and 3rd largest models perform similarly.
efit of the structured output used in Menon et al. [33] is Number of Prompts. In Figure 4, we present abla-
that the authors can examine the similarity of a given im- tions on the number of LLM-prompts and image-prompts
age with each individual attribute to understand which ones for CuPL (full prompts). On the top, we show ImageNet
most contribute to a prediction. However, unlike CuPL, accuracy as we increase the number of LLM-prompts. This
this method performs worse than standard human-written also corresponds to the number of sentences that have to be
prompts, as shown in Table 2. This is potentially because hand-written. Notably, this method outperforms the base-
this work focuses on explainability, and therefore enforces line even when using prompts generated from a single hand-
a strict format on the generated prompts, likely reducing written sentence. On the bottom, we hold the number of

15695
LLM-prompts constant at 5 and adjust how many image- between a “Tree frog” and a “Tailed frog”, which also look
prompts we generate per LLM-prompt. We plot the accu- very similar.
racy given the total number of image-prompts (so 10 gen- For each binary classification, we show four heatmaps:
erated image-prompt per LLM-prompt corresponds to 50 (1) the regions that contribute to a higher probability of the
total image-prompts). We see that CuPL begins to outper- correct class when using CuPL prompts, (2) the regions that
form the baseline at just 25 image-prompts, well below the contribute to a higher probability of the incorrect class when
80 image-prompts used in the baseline. using CuPL prompts, (3) the regions that contribute to a
Additional Analysis. In the Appendix, we provide com- higher probability of the correct class when using baseline
parisons between CuPL prompts and descriptive prompts prompts, (4) the regions that contribute to a higher proba-
generated with definitions of ImageNet classes as well as bility of the incorrect class when using baseline prompts.
with Wikipedia descriptions of ImageNet classes. We find Interestingly, we find that not only does CuPL place impor-
that CuPL prompts outperform both of these baselines. Ad- tance on different regions of the image, but these regions
ditionally, we provide results of ensembling CuPL prompts correspond to descriptions in the CuPL prompts. For exam-
and the baseline hand-written prompts used in [42]. We find ple, the tail of the dog is very important to the “Schipperke”
that this ensemble outperforms just baseline prompts for probability when using CuPL prompts, but not when us-
all datasets, and outperforms just CuPL prompts for some ing baseline prompts, and the tail of the Schipperke dog is
datasets. described 10 times in the CuPL descriptions of this class.
Similarly, we find that the eyes in the image of the frog
3.4. Shapley Value Analysis are much more important when classifying with CuPL than
with the baseline, and that the eyes are mentioned 10 times
We show that CuPL descriptions allow CLIP to place
in the CuPL description of a tree frog. We provide more
more importance on image regions that are most relevant
examples of this phenomenon in the Appendix.
for the correct classification. In order to measure the im-
Importance of Segmented Parts In order to understand
portance of regions in the image, we invoke Shapley val-
the correlation between the importance of an image region
ues [49], a tool from game theory that has become popu-
and its frequency in CuPL prompts on a larger scale, we
lar for understanding which input information contributes
utilize the PartImageNet Dataset [18]. This dataset contains
to a model’s final prediction [31, 9]. Shapley values can
segmentation maps of the different parts of a class for a sub-
be computed for any model, and although there are meth-
set of ImageNet classes. For example, the dog classes have
ods designed specifically for vision transformers [12] (the
the parts: ‘head’, ‘body’, ‘leg’ and ‘tail’. We use these seg-
underlying architecture of CLIP), we use a simple model-
mentation maps to obtain the Shapley value for each part of
agnostic calculation [35]. We employ Shapley values to
the animal with respect to the final probability of the ground
understand the importance of different image regions with
truth class. To understand the effect of changing to CuPL
CuPL prompts versus baseline prompts, and we find that
prompts, we calculate the difference between the Shapley
CuPL places more value on regions that are emphasized in
values with CuPL prompts and with baseline prompts, and
object descriptions, and thus are likely important for obtain-
we average across all images in a class. So for each part in
ing correct classifications. We demonstrate this correlation
each examined class we calculate the following (where SV
in two ways: (1) visualizing heatmaps of importance over
denotes the Shapley value):
images, and (2) measuring the importance of segmented im- 1 X
age parts annotated by the PartImageNet Dataset [18]. SVCuPL (image, part) SVbase (image, part)
|class|
Importance Heatmaps To understand how CuPL cap- image2class

tions lead to a change in importance of different image This gives us a score for how much more important a
regions, we calculate the Shapley value of small image part of an animal is to CuPL compared to the baseline for
patches when using CuPL prompts versus when using base- classification. Additionally, we quantify how prevalent each
line prompts. We calculate the Shapley values with respect body part is in the CuPL descriptions. We do this using the
to a binary classification probability between the correct WordNet [34] database to tag each words as part of the ‘leg’,
class and a similar distractor class in order to understand ‘head’, etc. More details of this tagging system are given in
how CuPL corrects these errors. As shown in Figure 5, the Appendix. We present our findings in Figure 6. We find
we examine the important regions of an image of a dog that the parts that are more important to CuPL are highly
when classifying between two very similar dog categories: correlated with the parts that are present in the descriptions
a “Schipperke dog” versus a “Groenendael dog”. Both of of the animals (and thus likely important to the identifica-
these classes are Belgian dogs that are black with pointy tion of the animal). For example, head-related attributes of
ears. However, they have a few subtle differences includ- the Japanese Spaniel class are frequently mentioned in the
ing the typical appearance of their tails. Additionally, we descriptions. Additionally, the ‘head’ in the image is much
show the important regions of an image when classifying more important to the final prediction for CuPL than for

15696
Region Importance with Region Importance with Example Image of
Original Image
CuPL Prompts Baseline Prompts Distractor Class
(A) (B) (C) (D) (E) (F)
Schipperke Prompt: Groenendael Prompt: Schipperke Prompt: Groenendael Prompt:
GT Label: "A Schipperke is a "A Groenendael dog can “A photo of a “A photo of a Example:
small, black Belgian be identified by its black Schipperke” Groenendael dog” Groenendael dog
Schipperke
dog with pointy ears coat and erect ears."
and an upright tail."

Prediction: Schipperke Prediction: Groenendael dog

Tree Frog Prompt: Tailed Frog Prompt: Tree Frog Prompt: Tailed Frog Prompt:
GT Label: "A tree frog looks "The tailed frog is a “A photo of a “A photo of a Example:
Tree Frog like a small frog small frog that is found tree frog” tailed frog” Tailed frog
with large eyes." in North America."

Prediction: Tree Frog Prediction: Tailed Frog

Figure 5. CuPL prompts lead the model to focus on semantically important regions of the image. We use Shapley values (Section 3.4)
to visualize the importance of each region in a binary classification problem. We examine which parts of an image lead the model to classify
it as the correct class versus a commonly confused class. We present the original image (column A), as well as four heatmaps showing
which regions raise the probability of the correct class for the CuPL model (column B), the incorrect class for the CuPL model (column
C), the correct class for the baseline model (column D), and the incorrect class for the baseline model (column E). Additionally, we show
that the regions that are more important to CuPL than to the baseline correspond to regions mentioned in the CuPL prompts (i.e. “tail”
which is a commonly mentioned word in Schipperke Dog CuPL prompts and “eyes” which is a common word in Tree Frog prompts). We
also show an example image from the distractor class to demonstrate the level of similarity between these fine-grained classes (column F).
Finally, we see that CuPL scores the correct class higher, whereas the baseline scores the incorrect class higher. This series of observations
lead us to believe that CuPL is able to correct errors because the descriptive prompts cause the model to weigh semantically important
regions more heavily.

baseline. Thus, CuPL is able to extract important informa- visual information from Wikipedia descriptions to enable
tion for identifying the animal from the text and incorporate zero-shot bird classification. Additional works [50, 6]
it into classification predictions. show improvements on large datasets (e.g., ImageNet) us-
ing external information from external databases such as
4. Related Work Imagenet-wiki and Wordnet. While these works show the
effectiveness of augmenting zero-shot models with descrip-
4.1. Natural Language Descriptions for Image Clas- tive text, all of these prior works rely on external natural
sification language databases for descriptions. This often limits the
possible categories that can be classified and can require
Several prior works use text-based knowledge of image extensive preprocessing to extract visual descriptions from
categories to improve classification accuracy. [15] extract noisy natural language.
visual information from unstructured text descriptions col-
lected from the internet to recognize parts of object and 4.2. Generated Text for Downstream Tasks
classify them in a zero-shot way. [45] and [19] use natu-
ral language descriptions of bird types to train a multimodal Recent work has utilized text generated from LLMs in a
classification model. [21] use hand-collected attribute tags number of ways. [47] use an LLM to paraphrase existing
to attend over relevant features in images. [39] extract image captions to use as data augmentation for CLIP. [30]

15697
Japanese Spaniel Coucal Bird Tree Frog Gila monster Lizard
Text Part importance:

Text Part Importance

Number of mentions of
part in CuPL prompts
"A tree frog is a small frog
that has large toe pads
that help it climb trees."

Tag: Legs

Image Part importance: Image Part importance

Diﬀerence between CuPL
Shapley value of part and
baseline Shapley value of part

Tag: Legs

Figure 6. When specific parts of an animal/object are frequently mentioned in CuPL prompts, the CuPL model places more im-
portance on these parts in the image compared to the baseline model. The PartImageNet dataset [18] provides segmentation maps
of ImageNet images broken down into parts. For example, Tree Frog is broken down into the parts: ‘head’, ‘leg’, ‘body’ and ‘tail’. We
use the WordNet database [34] to tag words in CuPL prompts as belonging to one of these parts. We refer to the number of mentions
of the part as the Text Part Importance. We then use the PartImageNet segmentations to compare the Shapley value of each part when
using CuPL prompts and baseline prompts, which we call the Image Part Importance. We find a strong correlation between the Text Part
Importance and the Image Part Importance, leading to the conclusion that CuPL is able to take advantage of the knowledge contained in
the descriptions when making its predictions.

use GPT-3 to generate knowledge on a topic when given a the format of prompts is known to highly affect accuracy
number of demonstrations, which is then used to improve [48, 42, 5, 17]. This has led to a large effort to find op-
accuracy on common sense reasoning questions. [20] use timal prompt formats. Proposed methods include crowd-
a LLM to add labels to text to improve text classification sourcing high performing prompts [1] as well as framing
accuracy. In [64], the outputs of a GPT-2 model are used to prompts to induce models to give explanations as well as an-
train an encoder on top of a vision model to generate mul- swers [57, 25, 37]. Additional works have proposed learn-
timodal image representations for a variety of tasks. [53] ing prompts via gradient based methods [65, 41, 29, 28, 51],
utilize a language model to perform image captioning by retrieval from a database [46], or reformatting/rephrasing
iteritively generating candidate image captions with a LLM existing prompts [24, 46].
and then using feedback from an open-vocabulary model Most relevant to this work are a number of methods for
to align it to a given image. Similarly, [62] use GPT-3 designing optimal prompts for zero-shot image classifica-
along with text descriptions of images for the Visual Ques- tion with open-vocabulary models. These methods learn
tion Answering (VQA) task. However, unlike CuPL these prompts formats which yield high accuracy for image clas-
prior works are either purely language tasks (common sense sification using either supervised [66, 43] or unsupervised
reasoning, text classification) or multimodal with some lan- [22] methods. However, unlike these prior works this work
guage component (image captioning, VQA). Most simi- requires no additional training or labeled data.
larly, [33] use and LLM to generate a structured list of at-
tributes which are then reformatted into captions for CLIP. 5. Conclusion
However this work differs from ours as it does not improve
over human written templates. Additionally, [61] use an We demonstrate that leveraging knowledge from an
LLM to generate a list of natural language attributes for Im- LLM can immediately improve zero-shot accuracy on a va-
ageNet classes and then select a subset of these attributes riety of image classification tasks, with much less hand-
for each class in a few-shot manner. Our work differs from engineering efforts to craft natural language prompts. Fur-
this as we remain in the zero-shot setting. thermore, prompts can be customized to the desired cate-
gories, rather than a general template that applies to all cat-
4.3. Prompt Engineering egories. Finally, using prompts generated by LLMs lowers
the barrier of prior knowledge about the dataset, which is
Previous efforts have explored methods for obtaining often required when crafting prompt templates.
successful natural language prompts. For both open- Querying an LLM for prompt construction is simple,
vocabulary image classification models as well as LLMs, straightforward and as our results suggested, immediately

15698
beneficial. The hypothesis that a joint force of LLMs and [7] Sebastian Bujwid and Josephine Sullivan. Large-scale zero-
open vocabulary models would improve zero-shot image shot image classification from rich and diverse textual de-
classification is thoroughly tested in this work. We hope scriptions. arXiv preprint arXiv:2103.09669, 2021. 19
these findings serve as a useful tool towards understanding [8] Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zis-
and improving zero-shot image classification, and more serman. A short note on the kinetics-700 human action
generally, the consolidation of model capacities and modal- dataset. arXiv preprint arXiv:1907.06987, 2019. 4
ities through natural language. [9] Hugh Chen, Ian C Covert, Scott M Lundberg, and Su-In
Lee. Algorithms to estimate Shapley value feature attribu-
tions. arXiv preprint arXiv:2207.07605, 2022. 6
[10] Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens-
6. Acknowledgements ing image scene classification: Benchmark and state of the
art. Proceedings of the IEEE, 105(10):1865–1883, 2017. 4
We thank Mitchell Wortsman, Gabriel Ilharco, Vivek
[11] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A.
Ramanujan, Aditya Kusupati, Jason Wei, and Ofir Press
Vedaldi. Describing textures in the wild. In Proceedings of
for helpful discussions and draft feedback. We also thank
the IEEE Conf. on Computer Vision and Pattern Recognition
Samir Yitzhak Gadre and Alex Fang for their useful
(CVPR), 2014. 2, 4
experimental suggestions. Finally, we thank our reviewers
[12] Ian Covert, Chanwoo Kim, and Su-In Lee. Learning to esti-
and meta-reviewers for their time and feedback during the
mate Shapley values with vision transformers. arXiv preprint
peer review process. This work is in part supported by
arXiv:2206.05282, 2022. 6
NSF IIS 1652052, IIS 17303166, DARPA N66001-19-2-
4031, DARPA W911NF-15-1-0543 and gifts from Allen [13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
Institute for Artificial Intelligence, Google, and Apple. and Li Fei-Fei. Imagenet: A large-scale hierarchical image
database. In 2009 IEEE conference on computer vision and
pattern recognition, pages 248–255. Ieee, 2009. 2, 3, 4, 22
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
References Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
[1] Stephen H. Bach, Victor Sanh, Zheng Xin Yong, Albert
vain Gelly, et al. An image is worth 16x16 words: Trans-
Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma,
formers for image recognition at scale. arXiv preprint
Taewoon Kim, M SAIFUL BARI, Thibault Févry, Zaid
arXiv:2010.11929, 2020. 4
Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik
[15] Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, and A. Elgam-
Ben-David, Canwen Xu, Gunjan Chhablani, Han Wang, Ja-
mal. Link the head to the ”beak”: Zero shot learning from
son Alan Fries, Maged S. Al-shaibani, Shanya Sharma, Ur-
noisy text description at part precision. 2017 IEEE Confer-
mish Thakker, Khalid Almubarak, Xiangru Tang, Mike Tian-
ence on Computer Vision and Pattern Recognition (CVPR),
Jian Jiang, and Alexander M. Rush. Promptsource: An in-
pages 6288–6297, 2017. 7
tegrated development environment and repository for natural
language prompts. ArXiv, abs/2202.01279, 2022. 8 [16] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener-
ative visual models from few training examples: An incre-
[2] Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L
mental bayesian approach tested on 101 object categories. In
Alexander, David W Jacobs, and Peter N Belhumeur. Bird-
2004 conference on computer vision and pattern recognition
snap: Large-scale fine-grained visual categorization of birds.
workshop, pages 178–178. IEEE, 2004. 4
In Proceedings of the IEEE Conference on Computer Vision
[17] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-
and Pattern Recognition, pages 2011–2018, 2014. 4
trained language models better few-shot learners. arXiv
[3] Lucas Beyer, Olivier J Hénaff, Alexander Kolesnikov, Xi- preprint arXiv:2012.15723, 2020. 8
aohua Zhai, and Aäron van den Oord. Are we done with
[18] Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xi-
imagenet? arXiv preprint arXiv:2006.07159, 2020. 24
aoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qi-
[4] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. hang Yu, and Alan Yuille. Partimagenet: A large, high-
Food-101 – mining discriminative components with random quality dataset of parts. In Computer Vision–ECCV 2022:
forests. In European Conference on Computer Vision, 2014. 17th European Conference, Tel Aviv, Israel, October 23–
4 27, 2022, Proceedings, Part VIII, pages 128–145. Springer,
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- 2022. 6, 8, 22
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- [19] Xiangteng He and Yuxin Peng. Fine-grained image clas-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- sification via combining vision and language. 2017 IEEE
guage models are few-shot learners. Advances in neural in- Conference on Computer Vision and Pattern Recognition
formation processing systems, 33:1877–1901, 2020. 3, 8, 20, (CVPR), pages 7332–7340, 2017. 7
21 [20] Shengding Hu, Ning Ding, Huadong Wang, Zhiyuan Liu,
[6] Sebastian Bujwid and Josephine Sullivan. Large-scale zero- Juan-Zi Li, and Maosong Sun. Knowledgeable prompt-
shot image classification from rich and diverse textual de- tuning: Incorporating knowledge into prompt verbalizer for
scriptions. ArXiv, abs/2103.09669, 2021. 7 text classification. In ACL, 2022. 8

15699
[21] Siteng Huang, Min Zhang, Yachen Kang, and Donglin [37] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari,
Wang. Attributes-guided and pure-visual attention alignment Henryk Michalewski, Jacob Austin, David Bieber, David
for few-shot recognition. In AAAI, 2021. 7 Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan,
[22] Tony Huang, Jack Chu, and Fangyun Wei. Unsupervised Charles Sutton, and Augustus Odena. Show your work:
prompt learning for vision-language models. arXiv preprint Scratchpads for intermediate computation with language
arXiv:2204.03649, 2022. 8 models. ArXiv, abs/2112.00114, 2021. 8
[23] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, [38] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and
Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom CV Jawahar. Cats and dogs. In 2012 IEEE conference on
Duerig. Scaling up visual and vision-language representa- computer vision and pattern recognition, pages 3498–3505.
tion learning with noisy text supervision. In International IEEE, 2012. 3, 4
Conference on Machine Learning, pages 4904–4916. PMLR, [39] Tzuf Paz-Argaman, Yuval Atzmon, Gal Chechik, and Reut
2021. 1, 3 Tsarfaty. Zest: Zero-shot learning from text descriptions us-
[24] Zhengbao Jiang, Frank F. Xu, J. Araki, and Graham Neubig. ing textual similarity and visual summarization. In FIND-
How can we know what language models know? Trans- INGS, 2020. 7
actions of the Association for Computational Linguistics, [40] Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi,
8:423–438, 2020. 8 Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen,
[25] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Minh-Thang Luong, Yonghui Wu, et al. Combined scal-
Matsuo, and Yusuke Iwasawa. Large language models are ing for open-vocabulary image classification. arXiv preprint
zero-shot reasoners. ArXiv, abs/2205.11916, 2022. 8 arXiv:2111.10050, 2021. 1, 3
[26] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. [41] Guanghui Qin and Jas’ Eisner. Learning how to ask:
3d object representations for fine-grained categorization. In Querying lms with mixtures of soft prompts. ArXiv,
4th International IEEE Workshop on 3D Representation and abs/2104.06599, 2021. 8
Recognition (3dRR-13), Sydney, Australia, 2013. 4 [42] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[27] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
layers of features from tiny images. 2009. 4 Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
[28] Brian Lester, Rami Al-Rfou, and Noah Constant. The ing transferable visual models from natural language super-
power of scale for parameter-efficient prompt tuning. ArXiv, vision. In International Conference on Machine Learning,
abs/2104.08691, 2021. 8 pages 8748–8763. PMLR, 2021. 1, 2, 3, 4, 5, 6, 8, 12, 20,
[29] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- 22, 23
ing continuous prompts for generation. Proceedings of the
[43] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong
59th Annual Meeting of the Association for Computational
Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu.
Linguistics and the 11th International Joint Conference on
Denseclip: Language-guided dense prediction with context-
Natural Language Processing (Volume 1: Long Papers),
aware prompting. In Proceedings of the IEEE/CVF Con-
abs/2101.00190, 2021. 8
ference on Computer Vision and Pattern Recognition, pages
[30] Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter 18082–18091, 2022. 8
West, Ronan Le Bras, Yejin Choi, and Hannaneh Hajishirzi.
[44] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and
Generated knowledge prompting for commonsense reason-
Vaishaal Shankar. Do imagenet classifiers generalize to im-
ing. In ACL, 2022. 7
agenet? In International Conference on Machine Learning,
[31] Scott M Lundberg and Su-In Lee. A unified approach to
pages 5389–5400. PMLR, 2019. 22
interpreting model predictions. Advances in Neural Infor-
mation Processing Systems, 30, 2017. 6 [45] Scott E. Reed, Zeynep Akata, Honglak Lee, and Bernt
Schiele. Learning deep representations of fine-grained vi-
[32] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi.
sual descriptions. 2016 IEEE Conference on Computer Vi-
Fine-grained visual classification of aircraft. Technical re-
sion and Pattern Recognition (CVPR), pages 49–58, 2016.
port, 2013. 2, 3, 4
7
[33] Sachit Menon and Carl Vondrick. Visual classification via
description from large language models. arXiv preprint [46] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learn-
arXiv:2210.07183, 2022. 5, 8 ing to retrieve prompts for in-context learning. In NAACL,
[34] George A Miller. Wordnet: a lexical database for english. 2022. 8
Communications of the ACM, 38(11):39–41, 1995. 6, 8, 19, [47] Shibani Santurkar, Yann Dubois, Rohan Taori, Percy Liang,
22, 24, 25 and Tatsunori Hashimoto. Is a caption worth a thousand im-
[35] Rory Mitchell, Joshua Cooper, Eibe Frank, and Geoffrey ages? a controlled study for representation learning. arXiv
Holmes. Sampling permutations for Shapley value estima- preprint arXiv:2207.07635, 2022. 7
tion. Journal of Machine Learning Research, 23(43):1–46, [48] Timo Schick and Hinrich Schütze. Exploiting cloze-
2022. 6 questions for few-shot text classification and natural lan-
[36] Maria-Elena Nilsback and Andrew Zisserman. Automated guage inference. In EACL, 2021. 8
flower classification over a large number of classes. In 2008 [49] Lloyd S Shapley et al. A value for n-person games. 1953. 6
Sixth Indian Conference on Computer Vision, Graphics & [50] Sheng Shen, Chunyuan Li, Xiaowei Hu, Yujia Xie, Jianwei
Image Processing, pages 722–729. IEEE, 2008. 4 Yang, Pengchuan Zhang, Anna Rohrbach, Zhe Gan, Lijuan

15700
Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, and Zellers, Ronan Le Bras, Gunhee Kim, and Yejin Choi. Mul-
Jianfeng Gao. K-lite: Learning transferable visual models timodal knowledge alignment with reinforcement learning.
with external knowledge. ArXiv, abs/2204.09222, 2022. 7 ArXiv, abs/2205.12630, 2022. 8
[51] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric [65] Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng, Zhen
Wallace, and Sameer Singh. Eliciting knowledge from lan- Bi, Chuanqi Tan, Fei Huang, and Huajun Chen. Differen-
guage models using automatically generated prompts. ArXiv, tiable prompt makes pre-trained language models better few-
abs/2010.15980, 2020. 8 shot learners. ArXiv, abs/2108.13161, 2021. 8
[52] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. [66] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
Ucf101: A dataset of 101 human actions classes from videos Liu. Learning to prompt for vision-language models. Inter-
in the wild. arXiv preprint arXiv:1212.0402, 2012. 4 national Journal of Computer Vision, pages 1–12, 2022. 8
[53] Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yo-
gatama, Yan Wang, Lingpeng Kong, and Nigel Collier. Lan-
guage models can see: Plugging visual controls in text gen-
eration. ArXiv, abs/2205.02655, 2022. 8
[54] Laurens Van der Maaten and Geoffrey Hinton. Visualiz-
ing data using t-sne. Journal of machine learning research,
9(11), 2008. 23, 24, 25, 26
[55] Ben Wang and Aran Komatsuzaki. Gpt-j-6b: A 6 billion
parameter autoregressive language model, 2021. 20
[56] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P
Xing. Learning robust global representations by penalizing
local predictive power. In Advances in Neural Information
Processing Systems, pages 10506–10518, 2019. 22
[57] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought
prompting elicits reasoning in large language models. ArXiv,
abs/2201.11903, 2022. 8
[58] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau-
mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s
transformers: State-of-the-art natural language processing.
arXiv preprint arXiv:1910.03771, 2019. 20
[59] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim,
Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon-
tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok
Namkoong, et al. Robust fine-tuning of zero-shot models.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 7959–7971, 2022. 22
[60] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva,
and Antonio Torralba. Sun database: Large-scale scene
recognition from abbey to zoo. In 2010 IEEE computer so-
ciety conference on computer vision and pattern recognition,
pages 3485–3492. IEEE, 2010. 4
[61] Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel
Jin, Chris Callison-Burch, and Mark Yatskar. Language
in a bottle: Language model guided concept bottlenecks
for interpretable image classification. arXiv preprint
arXiv:2211.11158, 2022. 8
[62] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu-
mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study
of gpt-3 for few-shot knowledge-based vqa. In AAAI, 2022.
8
[63] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo-
jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive
captioners are image-text foundation models. arXiv preprint
arXiv:2205.01917, 2022. 1, 3
[64] Youngjae Yu, Jiwan Chung, Heeseung Yun, Jack Hessel,
Jae Sung Park, Ximing Lu, Prithviraj Ammanabrolu, Rowan

15701

Certificate Course in Social Work-Distance Education
100% (1)
Certificate Course in Social Work-Distance Education
6 pages
ART Automatic multi-step reasoning and tool-use for
No ratings yet
ART Automatic multi-step reasoning and tool-use for
26 pages
?️_?️ Vision SFT Handbook
No ratings yet
?️_?️ Vision SFT Handbook
11 pages
The Role of Mathematics in Business Studies PDF
No ratings yet
The Role of Mathematics in Business Studies PDF
3 pages
7
No ratings yet
7
23 pages
371-1-2284-5-10-20240222
No ratings yet
371-1-2284-5-10-20240222
9 pages
(Jan 2024) OCRBench
No ratings yet
(Jan 2024) OCRBench
13 pages
Eco-Heroes and Eco-Villains An Archetypal Analysis of Environmental Film, 1950-2010
No ratings yet
Eco-Heroes and Eco-Villains An Archetypal Analysis of Environmental Film, 1950-2010
10 pages
06 HW
No ratings yet
06 HW
1 page
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
76 pages
Water sampling points
No ratings yet
Water sampling points
2 pages
EXPT.4 TIME DOMAIN RESPONSE OF SECOND ORDER SYSTEM USING MATLAB
No ratings yet
EXPT.4 TIME DOMAIN RESPONSE OF SECOND ORDER SYSTEM USING MATLAB
4 pages
Koh Coa
No ratings yet
Koh Coa
1 page
Contents
No ratings yet
Contents
95 pages
Stephen Spencer - Visual Research Methods in the Social Sciences_ Awakening Visions-Routledge (2023)
No ratings yet
Stephen Spencer - Visual Research Methods in the Social Sciences_ Awakening Visions-Routledge (2023)
373 pages
2110.08118v1
No ratings yet
2110.08118v1
57 pages
imagen_3_report
No ratings yet
imagen_3_report
32 pages
DetGPT
No ratings yet
DetGPT
17 pages
Mahmoud Ben Romdhane & Sam Moyo (Eds) - Peasant Organisations and The Democratisation Process in Africa
No ratings yet
Mahmoud Ben Romdhane & Sam Moyo (Eds) - Peasant Organisations and The Democratisation Process in Africa
3 pages
RP-1
No ratings yet
RP-1
6 pages
2305.02932v2
No ratings yet
2305.02932v2
6 pages
2023.findings-acl.277
No ratings yet
2023.findings-acl.277
19 pages
Research Paper 5
No ratings yet
Research Paper 5
10 pages
NeurIPS 2022 Flamingo A Visual Language Model For Few Shot Learning Paper Conference
No ratings yet
NeurIPS 2022 Flamingo A Visual Language Model For Few Shot Learning Paper Conference
21 pages
2311.11904v2 (copy)
No ratings yet
2311.11904v2 (copy)
19 pages
Amherstp2
No ratings yet
Amherstp2
19 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
GPT4Tools: Teaching Large Language Model To Use Tools Via Self-Instruction
No ratings yet
GPT4Tools: Teaching Large Language Model To Use Tools Via Self-Instruction
21 pages
09508
No ratings yet
09508
16 pages
CLIP_ Connecting text and images _ OpenAI
No ratings yet
CLIP_ Connecting text and images _ OpenAI
16 pages
Songyang_ToM_2017
No ratings yet
Songyang_ToM_2017
14 pages
2012.15723
No ratings yet
2012.15723
15 pages
AutoML-GPT - Automatic Machine Learning With GPT
No ratings yet
AutoML-GPT - Automatic Machine Learning With GPT
11 pages
fang2015
No ratings yet
fang2015
10 pages
Image Captioning Metric Based on v&L Transformers CLIP and Precision
No ratings yet
Image Captioning Metric Based on v&L Transformers CLIP and Precision
28 pages
Gong_Meta_Agent_Teaming_Active_Learning_for_Pose_Estimation_CVPR_2022_paper
No ratings yet
Gong_Meta_Agent_Teaming_Active_Learning_for_Pose_Estimation_CVPR_2022_paper
11 pages
Lu_Prompt_Distribution_Learning_CVPR_2022_paper
No ratings yet
Lu_Prompt_Distribution_Learning_CVPR_2022_paper
10 pages
Cheng Skeleton-Based Action Recognition With Shift Graph Convolutional Network CVPR 2020 Paper
No ratings yet
Cheng Skeleton-Based Action Recognition With Shift Graph Convolutional Network CVPR 2020 Paper
10 pages
Lin MAtch EXpand and Improve Unsupervised Finetuning for Zero-Shot Action Recognition ICCV 2023 Paper
No ratings yet
Lin MAtch EXpand and Improve Unsupervised Finetuning for Zero-Shot Action Recognition ICCV 2023 Paper
12 pages
Mplug-Docowl: Modularized Multimodal Large Language Model For Document Understanding
No ratings yet
Mplug-Docowl: Modularized Multimodal Large Language Model For Document Understanding
13 pages
Final Project Report
No ratings yet
Final Project Report
18 pages
Prompt Distribution Learning
No ratings yet
Prompt Distribution Learning
13 pages
Chatbots
No ratings yet
Chatbots
16 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
14 pages
2022 Alayrac
No ratings yet
2022 Alayrac
54 pages
Visual GPT
No ratings yet
Visual GPT
17 pages
2 Gradable ChatGPT Translation Evaluation
No ratings yet
2 Gradable ChatGPT Translation Evaluation
13 pages
Plug-and-Play Compositional Reasoning With Large Language Models
No ratings yet
Plug-and-Play Compositional Reasoning With Large Language Models
25 pages
Advances in Spinal Fusion - Molecular Science, Biomechanics, and Clinical Management PDF
No ratings yet
Advances in Spinal Fusion - Molecular Science, Biomechanics, and Clinical Management PDF
771 pages
Animals On The Web
No ratings yet
Animals On The Web
8 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
Final_Synopsis
No ratings yet
Final_Synopsis
4 pages
Toolformer - Language Models Can Teach Themselves To Use Tools
No ratings yet
Toolformer - Language Models Can Teach Themselves To Use Tools
17 pages
PGCON Paper Final
No ratings yet
PGCON Paper Final
4 pages
NeurIPS-2023-hugginggpt-solving-ai-tasks-with-chatgpt-and-its-friends-in-hugging-face-Paper-Conference
No ratings yet
NeurIPS-2023-hugginggpt-solving-ai-tasks-with-chatgpt-and-its-friends-in-hugging-face-Paper-Conference
27 pages
Large Language Models Are Good Prompt Learners
No ratings yet
Large Language Models Are Good Prompt Learners
12 pages
Choosing and Implementing Hugging Face Models _ by Stephanie Kirmer _ Towards Data Science
No ratings yet
Choosing and Implementing Hugging Face Models _ by Stephanie Kirmer _ Towards Data Science
15 pages
ON THE HIDDEN MYSTERY OF OCR IN LARGE
No ratings yet
ON THE HIDDEN MYSTERY OF OCR IN LARGE
16 pages
SolvingSchrdingerEquationforFinitePotentialWell
No ratings yet
SolvingSchrdingerEquationforFinitePotentialWell
8 pages
Don Muella Case Study III Ken Black Business Statistics
No ratings yet
Don Muella Case Study III Ken Black Business Statistics
4 pages
Biomedical Equipment: Workshop 12
No ratings yet
Biomedical Equipment: Workshop 12
5 pages
Image Captioning Generator Using CNN and LSTM
No ratings yet
Image Captioning Generator Using CNN and LSTM
8 pages
AudioChatLlama: Towards General-Purpose Speech Abilities For LLMs
No ratings yet
AudioChatLlama: Towards General-Purpose Speech Abilities For LLMs
11 pages
2021-2022-s4-2nd-term-exam-math-cp-1
No ratings yet
2021-2022-s4-2nd-term-exam-math-cp-1
12 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
Image Caption Bot With Keras and Speech Generation For
No ratings yet
Image Caption Bot With Keras and Speech Generation For
7 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
Wodwo Analysis
100% (1)
Wodwo Analysis
3 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
The Llama Hitchiking Guide to Local LLMs – hackerllama
No ratings yet
The Llama Hitchiking Guide to Local LLMs – hackerllama
13 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
Herd_Behavior-student_copy
No ratings yet
Herd_Behavior-student_copy
4 pages
Class XI Summer HHW 2023
No ratings yet
Class XI Summer HHW 2023
5 pages
grade 9
No ratings yet
grade 9
4 pages
Nist HB 150-2-2019
No ratings yet
Nist HB 150-2-2019
41 pages
John Philip B. Marcelino Ii-Cfm Activity 1
No ratings yet
John Philip B. Marcelino Ii-Cfm Activity 1
4 pages
Paper Petro
No ratings yet
Paper Petro
31 pages
W03 NLP
No ratings yet
W03 NLP
88 pages
ChatGPT For Robotics Design Principles and Model Abilities
No ratings yet
ChatGPT For Robotics Design Principles and Model Abilities
15 pages
The Theory of the Literary Field and the Situation of the First Modernity
No ratings yet
The Theory of the Literary Field and the Situation of the First Modernity
15 pages
Homework Practice Workbook Geometry Glencoe Mcgraw Hill
100% (1)
Homework Practice Workbook Geometry Glencoe Mcgraw Hill
7 pages
Alg2 Semester 1 Final Exam Review
No ratings yet
Alg2 Semester 1 Final Exam Review
16 pages
A E C P T L L M: A P ' G: N Mpirical Ategorization of Rompting Echniques FOR Arge Anguage Odels Ractitioner S Uide
No ratings yet
A E C P T L L M: A P ' G: N Mpirical Ategorization of Rompting Echniques FOR Arge Anguage Odels Ractitioner S Uide
16 pages
Manual de Especificaciones - Luxómetro LT65
No ratings yet
Manual de Especificaciones - Luxómetro LT65
8 pages
A Guide To Image Captioning. How Deep Learning Helps in Captioning
No ratings yet
A Guide To Image Captioning. How Deep Learning Helps in Captioning
17 pages
Manual Broches
No ratings yet
Manual Broches
52 pages
Best Prompts For Text-to-Image Models and How To F
No ratings yet
Best Prompts For Text-to-Image Models and How To F
12 pages
Image Caption Generator PCL
No ratings yet
Image Caption Generator PCL
19 pages
Stable Diffusion Prompts Article
No ratings yet
Stable Diffusion Prompts Article
13 pages
ENG 3 (Speech and Oral Communication)
No ratings yet
ENG 3 (Speech and Oral Communication)
6 pages
Fluid Dynamics: Impinging Jet Experiment Report
No ratings yet
Fluid Dynamics: Impinging Jet Experiment Report
16 pages
Mid Year F2 2021
No ratings yet
Mid Year F2 2021
13 pages
Bag of Words Model: Unlocking Visual Intelligence with Bag of Words
From Everand
Bag of Words Model: Unlocking Visual Intelligence with Bag of Words
Fouad Sabry
No ratings yet
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet