0% found this document useful (0 votes)
10 views10 pages

2302.03298v4

This document explores Model-Agnostic Zero-Shot Classification (MA-ZSC) and proposes using synthetic images generated by diffusion models to enhance classification performance without relying on real images. The authors introduce a 'bag of tricks' to improve the diversity of generated datasets, which leads to notable performance improvements in various classification architectures, achieving results comparable to state-of-the-art models like CLIP. Experiments conducted on datasets such as CIFAR10, CIFAR100, and EuroSAT validate the effectiveness of their approach in addressing the challenges of MA-ZSC.

Uploaded by

liuwtpku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

2302.03298v4

This document explores Model-Agnostic Zero-Shot Classification (MA-ZSC) and proposes using synthetic images generated by diffusion models to enhance classification performance without relying on real images. The authors introduce a 'bag of tricks' to improve the diversity of generated datasets, which leads to notable performance improvements in various classification architectures, achieving results comparable to state-of-the-art models like CLIP. Experiments conducted on datasets such as CIFAR10, CIFAR100, and EuroSAT validate the effectiveness of their approach in addressing the challenges of MA-ZSC.

Uploaded by

liuwtpku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Diversity is Definitely Needed: Improving Model-Agnostic Zero-shot

Classification via Stable Diffusion

Jordan Shipard1 , Arnold Wiliem1,2 , Kien Nguyen Thanh1 , Wei Xiang3 , and Clinton Fookes1

1
Signal Processing, Artificial Intelligence and Vision Technologies (SAIVT), Queensland University of Technology, Australia
arXiv:2302.03298v4 [cs.CV] 17 Apr 2023

2
Sentient Vision Systems, Australia
3
School of Computing, Engineering and Mathematical Sciences, La Trobe University, Australia
{jordan.shipard@hdr., k.nguyenthanh@, c.fookes@}qut.edu.au, [email protected],
[email protected]

Abstract 1 "Dog"

"Cat" Stable Diffusion


In this work, we investigate the problem of Model- "Bird" Classes Prompts
& Settings
Agnostic Zero-Shot Classification (MA-ZSC), which refers Bag of Tricks
to training non-specific classification architectures (down- "Truck"
2
stream models) to classify real images without using any Synthetic Images
real images during training. Recent research has demon- 3 Real Images
Dog
strated that generating synthetic training images using dif- Cat
Deer
fusion models provides a potential solution to address MA- Horse

ZSC. However, the performance of this approach currently Cat


Bird
falls short of that achieved by large-scale vision-language Dog
Airplane Downstream Model
models. One possible explanation is a potential signifi- Bird
Frog
cant domain gap between synthetic and real images. Our Truck
Horse
work offers a fresh perspective on the problem by provid-
ing initial insights that MA-ZSC performance can be im- Truck 1 Use Stable Diffusion to generate synthetic images
Automobile
proved by improving the diversity of images in the gener- Ship 2 Train downstream model on the synthetic images
Airplane
ated dataset. We propose a set of modifications to the text-
3 Classify real images with downstream model
to-image generation process using a pre-trained diffusion
model to enhance diversity, which we refer to as our bag of
Figure 1. A diagram of our proposed model agnostic zero-shot
tricks. Our approach shows notable improvements in vari-
classification method. We first generate a set of diverse training
ous classification architectures, with results comparable to images using our proposed bag of tricks. These generated syn-
state-of-the-art models such as CLIP. To validate our ap- thetic images are then used to train a downstream model capable
proach, we conduct experiments on CIFAR10, CIFAR100, of classifying a set of real images. This method achieves zero-shot
and EuroSAT, which is particularly difficult for zero-shot performance comparable to CLIP [23].
classification due to its satellite image domain. We evaluate
our approach with five classification architectures, includ-
ing ResNet and ViT. Our findings provide initial insights consuming, and expensive [4, 29]. This is especially true
into the problem of MA-ZSC using diffusion models. All for inherently expensive domains, such as remote sensing
code is available at https://github.com/Jordan- where the creation of a dataset requires the use of satel-
HS/Diversity_is_Definitely_Needed lites [32]. The zero-shot learning tackles the data issue
by training a model using no data from the downstream
1. Introduction task [37]. Generally, zero-shot learning models use spe-
cialised architecture [18, 20, 22, 23, 35] or learn auxiliary
Data is a critical element for the successful training features to achieve the objective [21]. This limits the ar-
of deep learning models [3, 24]. However, acquiring and ray of potential model architectures. On the other hand,
curating a high-quality dataset can be challenging, time- the Model-Agnostic Zero-Shot Learning problem aims to

1
address this issue by considering methods that allow any ing the best tricks for each dataset, we obtain top-1 zero-
model architecture to perform the zero-shot learning task. shot classification accuracy of 81% (\delimiter "3222378 20.5) on CIFAR10,
In this work, we specifically investigate image classifica- 45.63% (\delimiter "3222378 15.91) on CIFAR100 and 42.5% (\delimiter "3222378 6.41) on Eu-
tion tasks, and as such, refer to the problem domain as roSAT. Surprisingly, these results surpass the performance
Model-Agnostic Zero-Shot Classification (MA-ZSC). Solv- of CLIP-ResNet50 [23].
ing the MA-ZSC problem alleviates the previously men- We list our contributions as follows:
tioned problems of constructing a high-quality dataset with- 1. Equipped with insights found in our work, we demon-
out the need for specialised architecture or techniques ap- strate the feasibility of addressing the MA-ZSC prob-
plied to the model. lem by training classification models on a high-quality
One avenue to address the MA-ZSC problem is by gen- Stable Diffusion [27] generated synthetic dataset.
erating a synthetic dataset which is then used to train a clas- 2. We show that improving the diversity of images in a
sification model. A Recent work [8] explored the use of generated synthetic dataset improves the MA-ZSC per-
generated synthetic images powered by the GLIDE diffu- formance.
sion model [19] primarily for improving the zero-shot and 3. We provide a bag of tricks for improving diversity dur-
few-shot performance of CLIP [23]; however, they briefly ing latent diffusion image generation.
explore training classification models, finding synthetic im- We continue our paper as follows. In Section 2, we provide
ages to be inferior compared to real images. This is thought a summary of previous works related to zero-shot learning,
to be due to a domain gap between real and synthetic im- image generation, and training with synthetic data. We then
ages [8]. We revisit the issue and find it possible to im- describe the problem of model-agnostic zero-shot classifi-
prove the MA-ZSC performance by increasing the diversity cation, and introduce our solution in Section 3, before de-
of the images in the generated datasets. Increasing diver- scribing the process of how latent diffusion models generate
sity in training data has been utilised in various domains, images in Section 4. In Section 4.2, We investigate how we
such as computer vision domains [41] and robotics [33]. For can improve the quality of our generated synthetic datasets.
instance, in domain randomisation methods in the robotics Following this, we propose our bag of tricks to improve
domain [33], increasing diversity in the training data would the diversity of our generated synthetic datasets in Section
help to reduce the ’reality gap’ that separates simulated 4.3. Finally, we present our experimental results on CI-
robotics from experiments on hardware. With enough vari- FAR10 [13], CIFAR100 [13], and EuroSAT [9] in Section
ability in the simulator, the real world may appear to the 5, demonstrating the impact of our proposed bag of tricks
model as just another variation. Increasing training data on zero-shot performance across five classification models.
variability is also a common practice in training classifica-
tion models by employing data augmentation methods [30]. 2. Related Work
The variability of the training data is increased by simply In this section, we first show previous works relating to
performing data augmentation operations such as flipping, zero-shot learning and the introduction of CLIP; before cov-
and rotations. Figure 1 presents our overall method, and we ering the recent history of image generation and diffusion
present our results as follows. models. Lastly, we will cover previous works where train-
Using the latent diffusion model, Stable Diffusion [27], ing was done using generated synthetic images.
we start by generating baseline synthetic datasets using the
prompt “an image of a {class}” for CIFAR10 [13], CI- 2.1. Zero-shot Learning and CLIP
FAR100 [13] and EuroSAT [9], which is particularly dif- Zero-shot learning (ZSL) refers to the ability of a trained
ficult for zero-shot classification due to its satellite image model to classify classes it was not trained on [37]. For im-
domain, as noted in [23]. For each dataset, {class} is re- age classification, this is traditionally achieved by learning
placed iteratively with the class labels to generate images auxiliary attributes instead of pre-defined classes [21, 38].
belonging to that class. We then train a ResNet50 model [7] However, a more recent and effective approach is to train an
on these datasets and obtain a zero-shot top-1 accuracy of image and text encoder to learn joint representations, such
60.1%, 29.72% and 36.18% respectively. For reference, as with CLIP [23]. By encoding an image and names or de-
the CLIP zero-shot performance with a ResNet50 backbone scriptors of the target datasets classes, zero-shot predictions
achieves 75.6%, 41.6%, 41.1%. can be made by finding which descriptions text embedding
In order to improve the diversity of images in the gener- is the most similar to the image embedding. Using this
ated dataset we propose a set of modifications to the text- CLIP achieves zero-shot performance on datasets such as
to-image generation process, which we refer to as our bag ImageNet [4] (76.2%) comparable to supervised training of
of tricks. Each trick only improves image diversity, with no high-quality models [5, 7]. Several works look to improve
mitigation of the image domain. In fact, one of our tricks the ZSL performance of CLIP by improving the text de-
(multi-domain) specifically generates out-of-domain exam- scription used [18,20,22,35]. In this work, we focus on im-
ples. After applying the proposed bag of tricks and find- proving the zero-shot learning performance of non-specific

2
architectures, specifically focusing on image classification.
We refer to this problem as Model Agnostic Zero-Shot Clas-
sification (MA-ZSC).
2.2. Image Generation and Diffusion Models
Text-to-image generative models saw a significant per-
formance improvement with DALL-E 1 [26]. The next gen-
eration of generative models, GLIDE [19], Latent Diffusion
Models (LDM) [25], DALL-E 2 [27], and Imagen [28], use
text encoders paired with diffusion models, in contrast to the
discrete variational autoencoder and autoregressive trans- (a) photo (b) drawing (c) painting (d) poster
former used by DALL-E 1. Diffusion models produce an
image by denoising Gaussian noise according to some pro- Figure 2. Examples of images generated from the prompt “a
{caption} of a car”, where {caption} is the caption of each sub-
vided conditioning, such as a text prompt. Latent diffusion
figure. Each row shares a common initial Gaussian.
models perform the denoising in the latent space, making
them more computationally efficient. This work uses a la-
tent diffusion model to generate synthetic images.
attributes. With the advent of vision-language models such
2.3. Training with Synthetic Images as CLIP [23], the image attributes are replaced by the nat-
ural language text which is more expressive and can more
Since our research’s inception, several papers have in- accurately describe the unseen classes. In the CLIP model,
vestigated utilizing diffusion models to extend real datasets the text features and image features are correlated. Thus, we
for domain adaptation [40], semi-supervised learning [39], can classify an unseen image by measuring the dot product
or generating data augmentations [1, 34]. In our work, we between its image features, and the text features extracted
concentrate solely on training downstream models on en- from each class description.
tirely synthetic generated datasets. This differs from the
Whilst the CLIP model has shown impressive zero-shot
two closest works to ours [2, 8], as He et al. [8] studies if
performance, one still needs to use the CLIP model and its
synthetic images from a diffusion model, GLIDE [19], can
zero-shot methodology. We argue that this limits the ap-
be used to fine-tune CLIPs zero-shot and few-shot perfor-
plicability of the zero-shot classification. For instance, it
mance. With respect to training downstream models on syn-
is non-trivial to deploy the CLIP model into edge devices
thetic data they conclude synthetic images are 5\times less data
due to its large model size and complexity. This motivates
efficient than real images. Additionally, He et al. [8] show
us to consider the Model-Agnostic Zero-Shot Classification
synthetic images can effectively pre-train a classifier, on par
(MA-ZSC) problem. In the MA-ZSC setting, we wish
with ImageNet pre-training. While Besnier et al. [2] use a
to use any non-specific architecture and methodology to
GAN pre-trained on ImageNet [4] and propose strategies
perform the zero-shot classification task. In other words,
for improving the training quality of the generated images.
any classification model and methodology can be used for
With their improvements they were able to achieve 88.8%
f u (·).
accuracy on ImageNet-10 when training on synthetic im-
ages, compared to 88.4% on real images. In our work, we To address the MA-ZSC problem, we utilise a Latent
generate training images using a diffusion model trained on Diffusion Model [27] (LDM) which can generate synthetic
a much larger dataset and distribution of images. training images for each unseen class from its textual de-
scription. Once the synthetic training dataset, Dsyn , are
3. Model-Agnostic Zero-Shot Classification generated, we can then train any downstream classification
model. Note that we are not claiming that our work is the
Zero-Shot Learning (ZSL), as traditionally defined by first work to employ a synthetic image generation strategy
[37], aims to learn a classifier f u (·) : X → U, trained to address the MA-ZSC problem. Rather, we show the fea-
on labelled training instances Dtr belonging a set of seen sibility of the strategy by combining the LDM with our pro-
classes S (\protect \mathcal {S}=\{c_i^s | i=1, ..., N_s\}), to classify testing in- posed bag of tricks to generate more diverse synthetic im-
stances X te belonging to a set of unseen classes U ( ages.
\protect \mathcal {U}=\{c_i^u | i=1, ..., N_u\}
). Where S ∩ U = ∅. One method to ad-
dress this problem is to describe each unseen class cui with a 4. Generating Training Images
set of image attributes extracted from the images in the seen
classes S [14]. To classify an unseen image, we first extract In our work, we use a Latent Diffusion Model [27]
image attributes from the image. Then the classification is loaded with Stable Diffusion V1.4 weights to generate syn-
done in the attribute feature space by comparing the image thetic images. Stable Diffusion was trained on a subset
attributes of the unseen image with each unseen class image of the LAION-5B dataset [29] and generates images of

3
512\times 512 pixels, which we resize to the native image size of
each dataset. Stable Diffusion uses a frozen CLIP ViT-L/14
text encoder to provide conditioning from text prompts,
similar to [27, 28]. Although LDMs can generate images
from a number of different conditionings (image, text, se-
mantic map), in our work we only generate images via text
prompt. In the following sections, we will first describe
text-to-image generation in more detail before investigating
the potential domain gap between real and synthetic images.
Figure 3. T-SNE plot comparing clustering of real and synthetic
Finally, we hypothesise image diversity is more important images. The lack of separation in the feature space suggests that
for improving zero-shot classification and provide a bag of the synthetic images contain semantically meaningful features.
tricks for improving the synthetic image diversity.

4.1. Text-to-Image
For text-to-image generation, a prompt that describes the
desired contents of the image is required to guide the diffu-
sion process. The prompt is projected to an intermediate
representation via CLIP’s text encoder and then mapped to
the intermediate layers of the LDMs denoising UNet via (a) Real images. (b) Synthetic images.
cross-attention [27]. The LDM uses this guidance to diffuse
a latent representation of an image starting from Gaussian Figure 4. Two t-SNE plots showing the clustered classes for real
noise. The resulting latent representation is then decoded and synthetic images. The relative classes are clustered together,
back into the pixel domain to produce the final image. Fig- demonstrating the ability of the classifier to classify synthetic im-
ure 2 shows examples of generated images using different ages despite only being trained on real images.
text prompts.
There are three main hyperparameters that control the
generation process. DDIM Steps controls the number of it to noticeably impact our initial validation experiment. If it
steps taken by the Denoising Diffusion Implicit Model [31] does not, however, we then need to investigate what factors
in the denoising process. More steps generally result in do impact the quality of a synthetic dataset.
more realistic and coherent images, while fewer result in
more disjointed surreal images. The images in Figure 2, and 4.2.1 Validating the Usefulness of Synthetic Images
all synthetic images used in this work, were generated with
One approach for validating the quality of synthetic images
40 DDIM steps. Unconditional Guidance Scale (UGC)
is the Fréchet Inception Distance (FID) score [10]. Recent
controls the scale between the precision of the generated
generative diffusion models achieve FID scores as low as
image matching the provided prompt and generation diver-
7.27 [28] on the MS-COCO dataset [15], with Stable Dif-
sity. This is done by scaling between the jointly trained con-
fusion (V1.4) achieving a score of 16. For reference, real
ditional and unconditional diffusion models [11]. A lower
images overlayed with 25% Gaussian noise result in FID
UGC value means less guidance and therefore more diver-
scores of approximately 50 [10]. With this in mind, we con-
sity and vice versa. Lastly, there is the Seed from which the
jecture that the low FID scores of recent diffusion models
initial Gaussian is generated, and serves as the starting point
suggest they are all capable of generating realistic images.
for diffusion. The images in each row of Figure 2 were all
In order to examine if synthetic images contain semanti-
generated from common seeds, resulting in the cars in each
cally meaningful features we take a ResNet50 model [7]
row sharing similar features, such as car shape, position and
with pre-trained ImageNet weights [4] and fine-tune only
colour.
the classifier head on the real CIFAR10 [6] dataset. We then
4.2. Improving Synthetic Data for Training visualise the feature space of the real images and our gener-
ated synthetic CIFAR10 images using a t-SNE plot [36]. If
Before attempting to construct a high-quality synthetic the synthetic images contain vastly different semantic fea-
dataset we first validate that our synthetic images contain tures compared to real images we expect the model to fail in
enough semantic features for classification. In a prior work, classifying them. Additionally, we expect the real and syn-
He et al. [8] conclude there is a domain gap between real thetic images to occupy different areas of the feature space.
and synthetic images generated via their chosen diffusion However, in Figure 3 we can see the features of the real and
model GLIDE [19]. They suggest that reducing this gap synthetic images are intermingled. Furthermore, in Figure 4
is necessary to improve the quality of synthetic images for we see the classes are closely clustered, with the model cor-
training purposes. If the domain gap is significant we expect rectly classifying 76.61% of the real images and 63.8% of

4
gap. RG minimises the domain gap by replacing the Gaus-
sian noise used in the diffusion process with a real image
overlayed with Gaussian noise. We modify this method by
(a) Original images generating images from the interpolated feature representa-
tions of real images instead of from the real images them-
selves. More explicitly, we randomly sample 1% of the im-
ages for each class in CIFAR10 [13], totalling 60 images
per class. These images were then encoded using CLIP’s
image encoder to obtain their feature representations. Next,
we perform linear interpolation of these representations and
use the interpolated feature representation as conditioning
for the diffusion image generation process. Figure 5 shows
an example of this where we generate synthetic images of
(b) Synthetic images from latent interpolation trucks (Fig. 5b) from the feature representations of the orig-
inal truck images (Fig. 5a). The first row in Figure 5b are
Figure 5. Examples of images generated from the combination of images generated from only the feature representation of the
different initial latents. Row 1 of (b) is from the left most image first image in Figure 5a, the second row is from the average
of (a). Row 2 is the average of the centre and left most images. of the first and second images and the third is average of
Row 3 is the average of all three images in (a). Row 4 is random all three. We can qualitatively observe improved synthetic
combinations of all three images in (a)
image diversity from rows one to three. In row four we gen-
erate images from random interpolations between all three
original images.
We use our 60 initial feature representations to generate
images via linear interpolation using two sampling meth-
ods. Firstly, we sample a starting feature representation
from within the convex hull of all 60 feature representa-
tions. Secondly, we sample from only three random fea-
ture representations. We then train a ResNet-50 [7] model
(b) Sampled linearly interpo- on these two generated datasets and validate performance
(a) Sampled linearly interpo- lated feature representations on the real CIFAR10 [13] test set. The average of all la-
lated feature representations from between only three ran- tent combinations obtains 35.01% top-1 test accuracy and
from between all real feature dom real feature representa-
representations. tions at a time. the three latent combinations obtain 52.6%. This shows
data diversity is important when generating datasets and im-
Figure 6. A visual example for demonstrating the difference in proving zero-shot test accuracy is dependent on generating
resulting diversity of sampled latent points from different sampling diverse training examples.
techniques. Sampling from a linearly interpolated combination of In Figure 6, we provide a visual example that more in-
all real feature representations (6a) results in a bias towards the tuitively illustrates the differences between the two linear
centre of the feature space; whereas sampling from three randomly
interpolation sampling methods. This is purely for demon-
selected real feature representations (6b) results in a more uniform
and diverse sampling.
stration purposes and does not represent the actual fea-
ture representations. We first generate a number of evenly
spaced points (shown in blue) along the circumference of
the synthetic images. This initial experiment validates that a circle, representing the feature representations of real im-
the synthetic images contain enough meaningful semantic ages. We then use the two sampling methods described ear-
features and therefore should be useful as training images. lier to generate linearly interpolated feature representations
(shown in orange). Figure 6a shows a significant bias to-
wards the centre of the simulated feature space, resulting in
4.2.2 Investigating Synthetic Image Diversity less diverse samples. On the other hand, Figure 6b shows a
more uniform and diverse set of samples.
Inspired by works in domain randomisation [33, 41] and
data augmentation [30], which demonstrate that increasing 4.3. Bag of Tricks for Improving Synthetic Diversity
diversity in the training data improves performance. We
investigate if diversity impacts the quality of a synthetic Here we propose the tricks we can utilise to improve
training dataset. In order to isolate the impact of diversity, the diversity of generated images. We first describe these
we utilise the Real Guidance (RG) technique used in [8], tricks below and later validate their effectiveness in improv-
with a slight modification, to minimise the potential domain ing zero-shot classification performance. For all synthetic

5
Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck
CIFAR10 [13]

Aquarium
Bicycle Castle Dinosaur Keyboard Sea Shark Television Tractor Wolf
Fish
CIFAR100 [13]

Annual Herbaceous Permanent


Forest Highway Industrial Pasture Residential River Sea/Lake
Crop Vegetation Crop
EuroSAT [9]

Table 1. Examples of real (top row per dataset) and generated (bottom row per dataset) images for the CIFAR10 [13], CIFAR100 [13] and
EuroSAT [9], the generated images are taken from the base class generated dataset. The real images are 32x32 for CIFAR and 64x64 for
EuroSAT, while the synthetic images are 512x512. The generated images are downsized to match the real images during training.

datasets, we generate the same number of images per class art image, rock drawing, stick figure, 3D rendering) for
as their real counterparts. CIFAR datasets. Due to EuroSAT requiring more domain
Base Class - This consists of images that were generated information to correctly generate a satellite image, we use
using the prompt “an image of a {class}” where {class} is the prompt “a satellite photo of a {class} in the style of a
replaced with a class name from the downstream dataset. {domain}” where the domains are (realistic photo, draw-
This represents the naive case for generating images as this ing, painting, sketch, 3D rendering). The images in Figure
prompt is synonymous with the prompt used for zero-shot 2 are an example of CIFAR10 multi-domain images.
prediction in [23]. We only slightly modify the zero-shot Random Unconditional Guidance - We use the base class
prompt from “a photo of” to “an image of” as we use the prompt and randomly set the unconditional guidance scale
prompt “a photo of” later in the mutli-domain trick. between values of 1 and 5. This generates images that are
Class Prompt - We change the prompt from “an image of highly diverse, such as generated with U CG = 1, as well
a {class}” to just “{class}” as images generated from “an as images containing stronger features of the target class,
image of a {class}” are included in the subset of images such as with U CG = 5. U CG = 5 was chosen as the up-
generated by the prompt “{class}”. Therefore, using just per bound from qualitative inspection of generated images
the class name may lead to more diverse outputs. where we found little difference in features of synthetic im-
Multi-Domain - Next, we directly influence the diversity ages with values greater than 5.
by providing a list of domains with the prompt “a {domain} All Combined - Lastly, we combine all previous tricks
of a {class}” where the domain is one of ten preset domains into one final dataset. This should result in a more diverse
(photo, drawing, painting, sketch, collage, poster, digital dataset than any individual dataset.

6
CIFAR10 [13] CIFAR100 [13] EuroSAT [9] erated dataset does not currently capture the full diversity
He et al. (ResNet50) [8] - 28.74 -
CLIP-ResNet50 [23] 75.6 41.6 41.1 of CLIP’s knowledge of each class. In theory, we ex-
Base Class 60.5 29.72 36.18 pect that with infinite training examples, we should achieve
Anti-aliasing Rescale 63.84 \delimiter "3222378 3.34 33.61 \delimiter "3222378 3.89 34.4 \delimiter "3223379 1.78 CLIP’s zero-shot accuracy. The baseline results from [8]
Class Prompt 62.32 \delimiter "3222378 1.82 26.4 \delimiter "3223379 3.32 -
Multi-Domain 67.97 \delimiter "3222378 7.47 32.55 \delimiter "3222378 1.96 35.68 \delimiter "3223379 0.5 on CIFAR100 are the most directly comparable to our base
Random Guidance 72.93 \delimiter "3222378 12.43 31.19 \delimiter "3222378 1.47 40.18 \delimiter "3222378 4 class results as these results are obtained from training a
All Combined 81 \delimiter "3222378 20.5 45.63 \delimiter "3222378 15.91 39.92 \delimiter "3222378 3.74
ResNet50 model from scratch on a synthetic version of CI-
Table 2. Zero-shot classification top-1 test accuracy on the CI-
FAR100 generated using GLIDE [8]. An important point
FAR10, CIFAR100 and EuroSAT datasets from training on dif- to note however is that the results in [8] are obtained af-
ferent permutations of synthetic datasets with a ResNet50 model. ter improving the prompt quality for generating synthetic
The change (\delimiter "3223379 , \delimiter "3222378 ) in top-1 accuracy is measured with respect to datasets, whereas our base class results are already higher
the base class. without any improvements. This is most likely due to the
difference in diffusion models used for the generation of
the synthetic datasets. As mentioned, [8] use GLIDE [19]
5. EXPERIMENTS which was trained on a dataset of 250 million image and
caption pairs. In contrast, we use Stable Diffusion, which
In this section, we first discuss our training setup and
was trained on 2.3 billion image caption pairs from the
then present our baseline zero-shot results with ReseNet50
LAION-5B dataset [29]. This 9.2× increase in training data
[7]. We then apply the bag of tricks when generating our
appears to result in inherently better generative abilities.
synthic datasets of CIFAR10 [13], CIFAR100 [13] and Eu-
roSAT [9]. We have chosen the CIFAR datasets due to their
widespread use and low resolution, allowing for ease of
5.3. Implementing the Bag of Tricks
training. We have also included EuroSAT because of its Here we iterate over the bag of tricks, as described in
low resolution and challenging domain for synthetic im- Section 4.3, in an effort to improve diversity and zero-shot
ages. Additionally, it has been shown to be a challenging classification. Results are shown in Table 2 comparing the
dataset for zero-shot classification [23]. We show examples accuracy obtained using each trick, with a ResNet50 model.
of our generated datasets in Table 1. Although not a trick, we test the impact of using anti-
Following the bag of tricks, we have identified the the aliasing during the rescaling of images from 512×512 pix-
best tricks for each dataset and tested them on four ad- els to 32×32 for CIFAR datasets and 64×64 for EuroSAT.
ditional classification architectures; ResNet101 [7], Mo- We find anti-aliasing significantly benefits CIFAR datasets
bileNetV3 [12], ViT [5] and ConvNeXt [16]. but not EuroSAT. Thus we do not apply anti-aliasing on Eu-
roSAT.
5.1. Training Setup
Class Prompt - Using only the class name as the prompt
All models were trained from random initialisation for we see a small improvement in CIFAR10 accuracy while
200 epochs, with a batch size of 128, the AdamW opti- CIFAR100 accuracy reduces. We suspect the reduction in
miser [17] and cosine annealing learning rate decay. All CIFAR100 performance is due to the generation of incorrect
training used an initial learning rate of 2e−4 . MobileNetV3 images for classes which can have multiple meanings. Such
models used a weight decay of 0.1 for all training; whereas as: ‘Apple’ generating images of the fruit and the software
all other models used a weight decay of 0.9 when training company Apple logo. ‘Beetle’ generating images of the in-
on the CIFAR datasets and 0.3 when training on EuroSAT. sect and the Volkswagen car. ‘Orange’ generating images of
We found the weight decay hyperparameter of the AdamW orange items of clothing mainly instead of the fruit. ‘Ray’
optimiser to be important, as it helps reduce overfitting to generating Sun rays, sting rays and men. This is not the case
the synthetic images. for CIFAR10 where the class names are unambiguous. We
do not test this trick on the EuroSAT dataset as this dataset
5.2. Baseline Results requires some context in the prompt relating to satellite im-
In Table 2, we gather baseline zero-shot classification ages.
results of CLIP-ResNet50 from [23] and ResNet50 from Multi-Domain - Despite approximately 90% of the im-
[8]. For our own baseline, we use the base class synthetic ages being generated under this setting not being realis-
dataset as described in section 4.3. Our base class dataset tic; we see the most significant improvement for the CI-
achieves zero-shot accuracy of 60.5%, 29.72% and 36.18% FAR datasets, with a 7.47% and 1.96% improvement with
for CIFAR10, CIFAR100 and EuroSAT respectively. In CIFAR10 and CIFAR100 respectively. These images, espe-
comparison, CLIP-ResNet50 from [23] achieves zero-shot cially the posters, paintings and drawings, are not within the
accuracies of 75.6%, 41.6% and 41.1% respectively. The real CIFAR domain. For EuroSAT, we see a slight reduction
15.1% and 11.88% difference between CLIP’s ResNet zero- of 0.5% compared to the base class. Both the CIFAR and
shot and our ResNet zero-shot results show that our gen- EuroSAT results show that out-of-domain training images

7
Base Class Best Tricks classification architectures, results are shown in Table 3.
Dataset Model
CLIP-ResNet50∗ [23] 75.6 -
For CIFAR10, the all combined dataset is used for the best
ResNet50 [7] 60.5 81 \delimiter "3222378 20.5 tricks. CIFAR100 uses all tricks except class prompt. For
CIFAR10 [13]
ResNet101 [7] 60.89 81.84 \delimiter "3222378 20.95 EuroSAT, only the random unconditional guidance trick
ViT-B [5] 42.34 75.72 \delimiter "3222378 33.38 improved performance, therefore to further increase diver-
MobileNetV3-S [12] 51.05 74.38 \delimiter "3222378 23.33
ConvNeXt-S [16] 49.15 80.1 \delimiter "3222378 30.95 sity, we generate an additional 2700 images per class, dou-
CLIP-ResNet50∗ [23] 41.6 - bling the size of the dataset. We use the ResNet101 [7]
ResNet50∗ [8] 28.74 - architecture in order to test if simply a deeper ResNet is
CIFAR100 [13]
ResNet50 29.72 45.63 \delimiter "3222378 15.91 able to obtain higher zero-shot performance and we see
ResNet101 27.66 46.63 \delimiter "3222378 18.97
ViT-B 16.38 32.38 \delimiter "3222378 16 only slight improvements over ResNet50. When training
MobileNetV3-S 17.78 39.64 \delimiter "3222378 21.86 with the ViT-B model [5] we see reduced performance com-
ConvNeXt-S 20.93 45.14 \delimiter "3222378 24.21 pared to the ResNet models across all datasets. We con-
CLIP-ResNet50∗ [23] 41.1 - jecture this is due to training from scratch, as ViTs are
ResNet50 36.18 42.59 \delimiter "3222378 6.41
EuroSAT [9]
ResNet101 34.73 37.31 \delimiter "3222378 2.58 known to benefit greatly from ImageNet pre-training [5].
ViT-B 19.53 21.71 \delimiter "3222378 2.18 Despite this, we still see an improvement in zero-shot per-
MobileNetV3-S 34.08 39.13 \delimiter "3222378 5.05 formance when training using the best tricks. Lastly, we use
ConvNeXt-S 18.57 20.22 \delimiter "3222378 1.65 MobileNetV3-small [12] and ConvNeXt-small [16] as ex-
Table 3. Top-1 zero-shot accuracy of various classification models
amples of architectures that previously have not been used
on the Base Class and Best Trick synthetic CIFAR10, CIFAR100 for zero-shot classification, demonstrating our approach ap-
and EuroSAT datasets. Models marked with ∗ are baseline results plies to any existing model. Again we see improvements
(not base class results) from the cited papers, similar to Table 2. in zero-shot classification across all datasets when applying
The change (\delimiter "3222378 ) in Best Trick top-1 accuracy is relative to the Base the best tricks.
Class top-1 accuracy.

6. Conclusion
are not the main constraint for improving the zero-shot po-
tential of synthetic datasets. In conclusion, we investigate the problem of Model-
Random Unconditional Guidance - When we directly en- Agnostic Zero-Shot Classification (MA-ZSC). Where MA-
force diversity over precision by setting a random uncondi- ZSC aims to train any downstream classification architec-
tional guidance scale we see an improvement in zero-shot ture to classify real images without training on any real im-
classification across all datasets. Interestingly, we see the ages. We investigated how to improve the quality of a syn-
most significant improvements in CIFAR10 and EuroSAT. thetic dataset for the purpose of training and found diversity
We conjecture this is due to CIFAR10 and EuroSAT con- in the synthetic images to be an important factor. From this,
taining more training examples per class than CIFAR100. we then proposed a set of modifications to the text-to-image
This supports the finding in [8] that synthetic images are generation process via diffusion models, named our bag of
less data efficient than real images. Random unconditional tricks. This bag of tricks is designed only to improve the di-
guidance resulting in an accuracy improvement further sup- versity of synthetic images, with no mitigation of the poten-
ports our hypothesis that increasing diversity is more impor- tial domain gap, as reported by previous works. Applying
tant than reducing a domain gap when generating synthetic the bag of tricks achieves notable improvements across five
training images. classification architectures on the CIFAR10, CIFAR100 and
All Combined - Finally, we combine all previously gener- EuroSAT datasets. Some architectures even achieve zero-
ated datasets into one large dataset, in order to test if com- shot classification accuracies comparable to state-of-the-art
bining all the tricks, and further increasing diversity, gives zero-shot models, such as CLIP. Our findings provide ini-
more zero-shot classification improvements. In doing so tial insights into the problem of MA-ZSC using diffusion
we obtain our most significant improvements, further sup- models and opens up new avenues for research in this area.
porting our hypothesis. Surprisingly, both CIFAR zero-shot
results have now surpassed the CLIP-ResNet50 zero-shot
results [23], showing that our bag of tricks may distil the Acknowledgement
important signals or features in CLIPs understanding of a This work has been supported by the SmartSat CRC,
concept. whose activities are funded by the Australian Government’s
CRC Program; and partly supported by Sentient Vision Sys-
5.4. Model Agnostic Zero-shot Classification
tems. Sentient Vision Systems is one of the leading Aus-
Using our bag of tricks we can now endow any model tralian developers of computer vision and artificial intelli-
with zero-shot classification capabilities. To demonstrate gence software solutions for defence and civilian applica-
this we test the best tricks for each dataset on four additional tions.

8
References [16] Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, and S.
Xie. A convnet for the 2020s. In CVPR, 2022. 7, 8
[1] Hritik Bansal and Aditya Grover. Leaving Reality to Imagi-
[17] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
nation: Robust Classification via Generated Datasets. ArXiv,
regularization. In ICLR, 2019. 7
abs/2302.02503, 2023. 3
[18] Sachit Menon and Carl Vondrick. Visual Classification
[2] Victor Besnier, Himalaya Jain, Andrei Bursuc, Matthieu
via Description from Large Language Models. ArXiv,
Cord, and Patrick Pérez. This dataset does not exist: training
abs/2210.07183, 2022. 1, 2
models from generated images. In ICASSP, 2020. 3
[19] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
Mark Chen. Glide: Towards photorealistic image genera-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand-
tion and editing with text-guided diffusion models. In ICML,
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom
2021. 2, 3, 4, 7
Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric [20] Zachary Novack, Saurabh Garg, Julian McAuley, and
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Zachary C. Lipton. CHiLS: Zero-Shot Image Classification
Clark, Christopher Berner, Sam McCandlish, Alec Radford, with Hierarchical Label Sets. ArXiv, abs/2302.02551, 2023.
Ilya Sutskever, and Dario Amodei. Language models are 1, 2
few-shot learners. In NeurIPS, 2020. 1 [21] Genevieve Patterson and James Hays. SUN attribute
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, database: Discovering, annotating, and recognizing scene at-
and Li Fei-Fei. ImageNet: A large-scale hierarchical image tributes. In CVPR, 2012. 1, 2
database. In CVPR, 2009. 1, 2, 3, 4 [22] Sarah Pratt, Rosanne Liu, and Ali Farhadi. What does a
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, platypus look like? Generating customized prompts for zero-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, shot image classification. ArXiv, abs/2209.03320, 2022. 1,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- 2
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is [23] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
worth 16x16 words: Transformers for image recognition at Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
scale. In ICLR, 2021. 2, 7, 8 Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
[6] Haibo He and Edwardo A. Garcia. Learning from Imbal- ing transferable visual models from natural language super-
anced Data. TKDE, 2009. 4 vision. In ICML, 2021. 1, 2, 3, 6, 7, 8
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [24] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
Deep residual learning for image recognition. In CVPR, Amodei, Ilya Sutskever, et al. Language models are unsu-
2016. 2, 4, 5, 7, 8 pervised multitask learners. OpenAI blog, 2019. 1
[8] Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing [25] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic and Mark Chen. Hierarchical Text-Conditional Image Gen-
data from generative models ready for image recognition? In eration with CLIP Latents, 2022. 3
ICLR, 2023. 2, 3, 4, 5, 7, 8 [26] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
[9] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
Damian Borth. Eurosat: A novel dataset and deep learning Zero-shot text-to-image generation. In ICML, 2021. 3
benchmark for land use and land cover classification. JS- [27] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
TARS, 2019. 2, 6, 7, 8 Patrick Esser, and Björn Ommer. High-resolution image syn-
[10] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, thesis with latent diffusion models. In CVPR, 2022. 2, 3, 4
Bernhard Nessler, and Sepp Hochreiter. GANs Trained by [28] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
a Two Time-Scale Update Rule Converge to a Local Nash Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed
Equilibrium. In NeurIPS, 2017. 4 Ghasemipour, Burcu Karagol Ayan, Seyedeh Sara Mah-
[11] Jonathan Ho and Tim Salimans. Classifier-free diffusion davi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho,
guidance. In NeurIPS Workshop on Deep Generative Models David J. Fleet, and Mohammad Norouzi. Photorealistic text-
and Downstream Applications, 2021. 4 to-image diffusion models with deep language understand-
[12] A. Howard, M. Sandler, B. Chen, W. Wang, L. Chen, M. Tan, ing. ArXiv, abs/2205.11487, 2022. 3, 4
G. Chu, V. Vasudevan, Y. Zhu, R. Pang, H. Adam, and Q. Le. [29] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Searching for mobilenetv3. In ICCV, 2019. 7, 8 Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo
[13] Alex Krizhevsky. Learning Multiple Layers of Features from Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
Tiny Images. 2009. 2, 5, 6, 7, 8 man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine
[14] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmel- Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia
ing. Attribute-based classification for zero-shot visual object Jitsev. LAION-5b: An open large-scale dataset for training
categorization. PAMI, 36(3), 2014. 3 next generation image-text models. In NeurIPS, 2022. 1, 3,
[15] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, 7
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence [30] Connor Shorten and Taghi M. Khoshgoftaar. A survey on
Zitnick. Microsoft coco: Common objects in context. In Image Data Augmentation for Deep Learning. Journal of
ECCV, 2014. 4 Big Data, 2019. 2, 5

9
[31] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
ing diffusion implicit models. In ICLR, 2021. 4
[32] Gencer Sumbul, Marcela Charfuelan, Begüm Demir, and
Volker Markl. Bigearthnet: A Large-Scale Benchmark
Archive for Remote Sensing Image Understanding. In
IGARSS, 2019. 1
[33] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Woj-
ciech Zaremba, and Pieter Abbeel. Domain randomization
for transferring deep neural networks from simulation to the
real world. In IEEE IROS, 2017. 2, 5
[34] Brandon Trabucco, Kyle Doherty, Max Gurinas, and Ruslan
Salakhutdinov. Effective Data Augmentation With Diffusion
Models. ArXiv, abs/2302.07944, 2023. 3
[35] Vishaal Udandarao, Ankush Gupta, and Samuel Albanie.
SuS-X: Training-Free Name-Only Transfer of Vision-
Language Models. ArXiv, abs/2211.16198, 2023. 1, 2
[36] Laurens Van der Maaten and Geoffrey Hinton. Visualizing
data using t-SNE. JMLR, 2008. 4
[37] Wei Wang, Vincent W Zheng, Han Yu, and Chunyan Miao.
A survey of zero-shot learning: Settings, methods, and ap-
plications. TIST, 2019. 1, 2, 3
[38] Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot
learning-the good, the bad and the ugly. In CVPR, 2017. 2
[39] Zebin You, Yong Zhong, Fan Bao, Jiacheng Sun, Chongx-
uan Li, and Jun Zhu. Diffusion Models and Semi-
Supervised Learners Benefit Mutually with Few Labels.
ArXiv, abs/2302.10586, 2023. 3
[40] Jianhao Yuan, Francesco Pinto, Adam Davies, Aarushi
Gupta, and Philip Torr. Not Just Pretty Pictures: Text-to-
Image Generators Enable Interpretable Interventions for Ro-
bust Representations. ArXiv, abs/2212.11237, 2022. 3
[41] Xiangyu Yue, Yang Zhang, Sicheng Zhao, Alberto
Sangiovanni-Vincentelli, Kurt Keutzer, and Boqing
Gong. Domain Randomization and Pyramid Consistency:
Simulation-to-Real Generalization Without Accessing
Target Domain Data. In ICCV, 2019. 2, 5

10

You might also like