0% found this document useful (0 votes)
34 views

Wang_Enhance_Image_Classification_via_Inter-Class_Image_Mixup_with_Diffusion_Model_CVPR_2024_paper

Uploaded by

Pham Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Wang_Enhance_Image_Classification_via_Inter-Class_Image_Mixup_with_Diffusion_Model_CVPR_2024_paper

Uploaded by

Pham Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model

Zhicai Wang1 *, Longhui Wei2† , Tan Wang3 , Heyu Chen1 , Yanbin Hao1 , Xiang Wang1†,
Xiangnan He1 , Qi Tian2
1
University of Science and Technology of China, 2 Huawei Inc.,
3
Nanyang Technological University

Abstract
Diversity
“Red winged Blackbird” T2I
Faithfulness
Text-to-image (T2I) generative models have recently
emerged as a powerful tool, enabling the creation of photo- reference image synthetic image
realistic images and giving rise to a multitude of appli- domain-specific
cations. However, the effective integration of T2I mod- dataset T2I Diversity

els into fundamental image classification tasks remains an Intra-class


Augmentation
Faithfulness

open question. A prevalent strategy to bolster image clas-


sification performance is through augmenting the training
set with synthetic images generated by T2I models. In this
study, we scrutinize the shortcomings of both current gener- T2I Diversity
“Red winged Blackbird”
ative and conventional data augmentation techniques. Our Inter-class Faithfulness
Augmentation
analysis reveals that these methods struggle to produce im- T2I Vanilla T2I

ages that are both faithful (in terms of foreground objects) T2I Fine-tuned T2I
and diverse (in terms of background contexts) for domain-
specific concepts. To tackle this challenge, we introduce an Figure 1. Strategies to expand domain-specific datasets for im-
innovative inter-class data augmentation method known as proved classification are varied. Row 1 illustrates vanilla distilla-
Diff-Mix 1 , which enriches the dataset by performing image tion from a pretrained text-to-image (T2I) model, which carries the
translations between classes. Our empirical results demon- risk of generating outputs with reduced faithfulness. Intra-class
strate that Diff-Mix achieves a better balance between faith- augmentation, depicted in Row 2, tends to yield samples with lim-
fulness and diversity, leading to a marked improvement in ited diversity to maintain high fidelity to the original class. Our
performance across diverse image classification scenarios, proposed method, showcased in Rows 3 and 4, adopts an inter-
including few-shot, conventional, and long-tail classifica- class augmentation strategy. This involves introducing edits to a
reference image using images from other classes within the train-
tions for domain-specific datasets.
ing set, which significantly enriches the dataset with a greater di-
versity of samples.

1. Introduction
proaches employing T2I diffusion models for image classi-
In comparison to GAN-based models [7, 17, 25], contem- fication, it becomes evident that the challenge in generative
porary state-of-the-art text-to-image (T2I) diffusion models data augmentation for domain-specific datasets is produc-
exhibit enhanced capabilities in producing high-fidelity im- ing samples with both a faithful foreground and a diverse
ages [12, 37, 44, 49]. With the remarkable cross-modality background. Depending on whether a reference image is
alignment capabilities of T2I models, there is significant po- used in the generative process, we divide these methods into
tential for generative techniques to enhance image classifi- two groups:
cation [2, 4]. For instance, a straightforward approach en- • Text-guided knowledge distillation [52, 57] involves gen-
tails augmenting the existing training dataset with synthetic erating new images from scratch using category-related
images generated by feeding categorical textual prompts to prompts to expand the dataset. For the off-the-shelf T2I
a T2I diffusion model. However, upon reviewing prior ap- models, such vanilla distillation presume these models
* This work was done during the internship in Huawei Inc..
have comprehensive knowledge of target domain, which
† Xiang Wang and Longhui Wei are both the corresponding authors. can be problematic for domain-specific datasets. Insuf-
1 https://github.com/Zhicaiwww/Diff-Mix ficient domain knowledge easily makes the distillation

17223
process less effective. For example, vanilla T2I models data augmentation strategy that leverages fine-tuned dif-
struggle to generate images that accurately represent spe- fusion models for inter-class image interpolation.
cific bird species based solely on their names (see Row 1 • We conduct a comparative analysis of Diff-Mix with
of Fig. 1). other distillation-based and intra-class augmentation
• Generative data augmentation [1, 69] employs genera- methods, as well as non-generative approaches, highlight-
tive models to enhance existing images. Da-fusion [58], ing its unique features and benefits.
for instance, translates the source image into multiple
edited versions within the same class. This strategy,
termed intra-class augmentation, primarily introduces 2. Related Works
intra-class variations. While intra-class augmentation re- Text-to-image diffusion models. Following pretraining
tains much of the original image’s layout and visual de- on web-scale data, the T2I diffusion model has demon-
tails, it results in limited background diversity (see Row strated robust capabilities in generating text-controlled im-
2 of Fig. 1). However, synthetic images with constrained ages [37, 49, 53, 56]. Its versatility has led to diverse
diversity may not sufficiently enhance the model’s ability applications, including novel view synthesis [6, 63], con-
to discern foreground concepts. cept learning [28, 46], and text-to-video generation [22, 55],
Based on these observations, a fundamental question among others. Recent advancements [29] also highlight the
emerges: ‘Is it feasible to develop a method that optimizes cross-modality features of such generative models, show-
both the diversity and faithfulness of synthesized data si- casing their ability to serve as zero-shot classifiers.
multaneously?’
In this work, we introduce Diff-Mix, a simple yet ef- Synthetic data for image classification. There are two per-
fective data augmentation method that harnesses diffusion spectives on the utilization of synthetic data for image clas-
models to perform inter-class image interpolation, tailored sification: knowledge distillation [2] and data augmentation
for enhancing domain-specific datasets. The method en- [5, 51, 54, 62]. From the knowledge distillation perspective,
compasses two pivotal operations: personalized fine-tuning SyntheticData [19] reports significant performance gains in
and inter-class image translation. Personalized fine-tuning both zero-shot and few-shot settings by leveraging off-the-
[15, 46] is originally designed for customizing T2I mod- shelf T2I models to obtain synthetic data. The work of [2]
els and enabling them to generate user-specific contents or has indicated that fine-tuning the T2I model on ImageNet
styles. In our case, we implement the technique to tailor [47] yields improved classification accuracy by narrowing
the model, enabling it to generate images with faithful fore- the domain gap. Some works also find that learning from
ground concepts. Inter-class image translation in Diff-Mix the synthetic data presents strong transferability [19, 57]
entails transforming a reference image into an edited ver- and robustness [4, 30, 67]. From the data augmentation
sion that incorporates prompts from different classes. This standpoint, Da-fusion [58] achieves stable performance im-
translation strategy is designed to retain the original back- provements on few-shot datasets by augmenting from ref-
ground context while editing the foreground to align with erence images. In a related study [3], the use of StyleGAN
the target concept. For instance, as depicted in the bottom [25] for generating interpolated images between two differ-
rows of Fig. 1, Diff-Mix can generate images of land birds ent domains has been shown to enhance classifier robust-
in diverse settings, such as maritime environments, enrich- ness for out-of-distribution data. Our work shares similar-
ing the dataset with a variety of counterfactual samples. ities with AMR [5], which generates realistic novel exam-
Unlike previous non-generative augmentation methods, ples by interpolating between two images using GAN [16].
such as Mixup [68] and CutMix [66], Diff-Mix works in The distinction lies in our discussion of interpolation using
a foreground-perceivable inter-class interpolation manner the T2I diffusion model, where its noise-adding and denois-
and shares a different mechanism with the non-generative ing characteristics enable a smoother implementation of in-
approaches. Our experiment under the conventional clas- terpolation.
sification setting indicates that incorporating both CutMix Non-generative data augmentation. Mixup [68] and Cut-
and Diff-Mix could further enhance performance. Addi- Mix [66] stand out as two prominent non-generative data
tionally, when compared with other generative approaches, augmentation methods, serving as effective regularization
we conduct experiments under few-shot and long-tail sce- techniques during training. While Mixup achieves aug-
narios and observe consistent performance improvements. mented samples through a convex combination of two im-
Our contributions can be summarized as follows: ages, CutMix achieves augmentation by cutting and pasting
• We pinpoint the critical factors that affect the efficacy of parts of images. However, both methods are constrained in
generative data augmentation in domain-specific image their ability to produce realistic images. In addressing this
classification: namely, faithfulness and diversity. limitation, the utilization of generative models emerges as a
• We introduce Diff-Mix, a simple yet effective generative potential solution to alleviate this issue.

17224
3. Method
3.1. Preliminary
(a) Real images
Text-to-Image diffusion model. Diffusion models gener-
ate images by gradually removing noise from a Gaussian
noise source [21]. In a diffusion process with a total of T
steps, its forward process, which gradually adds noise, is (b) Synthetic images (vanilla SD)
represented as a Markov chain with a Gaussian transition
p
kernel, where q(xt |xt 1 ) = N xt ; ↵t xt 1 , (1 ↵t )I ,
where xt represents the noisy image at step t. The train-
(c) Synthetic images (SD fine-tuned via DB)
ing objective at step t is to predict the noise to reconstruct
xt 1 . When training a text-conditioned diffusion model,
the simplified training objective can be summarized as fol-
lows: (d) Synthetic images (SD fine-tuned via TI+DB)
h i
2
E✏,x,c,t k✏ ✏✓ (xt , c, t)k2 , (1) Figure 2. Examples of “American Three toed
Woodpecker”. (a) Real images from the training set. (b-
where ✏✓ represents the predicted noise, and c is the en-
d) synthetic images generated using different fine-tuned models
coded text caption associated with the image x. with the same number of fine-tuning steps. TI+DB indicates both
T2I personalization. T2I personalization aims to personal- text embedding and U-Net are fine-tuned. TI+DB achieves a more
ize a diffusion model for generating specific concepts using faithful output compared to DB alone (check the head and wing
a limited number of concept-oriented images [28, 40, 42]. patterns of the birds).
These concepts are typically represented using identifiers
(e.g., “[V]”). As a result, we formalize the constructed we propose treating it as a T2I personalization problem and
image-caption set as x, “photo of a [V]”. Various fine-tuning the Stable Diffusion (SD). Subsequently, to en-
personalization methods differ in their fine-tuning strate- hance the diversity of synthetic data beyond the well-fitted
gies. For instance, Textual Inversion (TI) [15] makes the training distribution, we employ inter-class image transla-
identifier learnable, but other modules are not fine-tuned, tion. This process produces interpolated images with in-
potentially sacrificing some faithfulness in image genera- creased background diversity for each class.
tion. On the other hand, Dreambooth (DB) [46] fine-tunes
the U-Net [45] for more refined personalized generation but 3.3. Fine-tune Diffusion Model
faces the challenge of increased computational cost.
Vanilla distillation tends to be less effective, especially as
Image-to-image translation. Image translation enables the number of training shots increases (refer to Sec. 4.1).
image synthesis and editing using a reference image as In order to mitigate the distribution gap, we propose fine-
guidance [24, 64, 71]. Diffusion-based image translation tuning Stable Diffusion in conjunction with current widely-
methods can generate fine edits, which refer to subtle mod- used T2I personalization strategies.
ifications, with varying degrees of shift relative to the ref-
Dreambooth meets Textual Inversion. Many fine-grained
erence image [8, 35, 59]. Here, we draw inspiration from
datasets provide terminological names for their cate-
SDEdit [35] to perform edits on the reference image, where
gories, like “American Three toed Woodpecker”
the target image xtar is translated from a reference image
and “Pileated Woodpecker”. We could construct
xref . During translation, the reverse process does not tra-
image-text pairs using category-related prompts and fine-
verse the full process but starts from a certain step bsT c,
tune the denoising network of SD using Eq. 1, which is
where s 2 [0, 1] controls the insertion position of the refer-
analogous to Dreambooth. However, we observe that di-
ence image with noise, as follows,
q q rectly incorporating these specialized terms into the text
xbsT c = ↵ ˜ bsT c xref + 1 ↵˜ bsT c ✏. (2) during fine-tuning can impede convergence and hinder the
0
generation of faithful images. We attribute this challenge to
By adjusting the strength parameter s, one can strike a bal- the semantic proximity of terminology within a fine-grained
ance between the diversity of the generated images and their domain, where fine-tuning the vision module alone tends to
faithfulness to the reference image. be less effective at distinguishing two similar classes within
the same family, like “Woodpecker”. Inspired by Textual
3.2. General Framework
Inversion [15], we opt to replace the terminological name in
The Diff-Mix pipeline consists of two key steps. Firstly, to the dataset with “[Vi ] [metaclass]” where “[Vi ]” is
produce more faithful images for domain-specific datasets, a learnable identifier, and i varies from 1 to N , represent-

17225
Diff-Aug
American Crow American Goldfinch Acadian Flycatcher

[V1] bird [V2] bird [V3] bird

a
“photo of a [V1] bird” CLIP-T

Diff-Mix
bird

... UNet

[V1]

[V2]

...

LoRA
Frozen Tunable 0.1 0.3 0.5 0.7 0.9
Translation strength Target class

Figure 3. Fine-tuning framework of Diff-Mix operates as follows: Figure 4. Examples of images translated using Diff-Mix and Diff-
Initially, we replace the class name with a structured identifier for- Aug across various strengths. Diff-Aug employs the same target
matted as “[Vi ] [metaclass]”, thereby sidestepping the need and reference image classes, typically resulting in subtle modifica-
for specific terminological expressions. Next, we engage in joint tions. Diff-Mix progressively adjusts the foreground to align with
fine-tuning of these identifiers and the low-rank residues (LoRA) the target class as the translation strength increases, while preserv-
of U-Net to capture the domain-specific distribution. ing the background layout from the reference image.

ing the category index. The illustration of our fine-tuning methods generate synthetic samples annotated with class
strategy is presented in Fig. 3. The term “[metaclass]” i. Specifically, Diff-Gen generates samples from scratch
is determined by the theme of the current dataset, such as by initializing with random Gaussian noise and proceed-
“bird” for a fine-grained bird dataset. By concurrently ing through the full reverse diffusion process with T steps.
fine-tuning the identifier and the U-Net, we empower the Diff-Gen can produce images aligned with its fine-tuned
model to quickly adapting to the fine-grained domain, al- distribution. In contrast, Diff-Aug sacrifices a portion of
lowing it to generate faithful images using the identifier (see diversity and generate images by editing on a reference im-
comparison between Row 3 and Row 4 in Fig. 2). age. Specifically, it randomly sample a image from the
Parameter efficient fine-tuning. In this context, we em- intra-class training set and enhances the image through im-
brace the parameter-efficient fine-tuning strategy known as age translation using Eq. 2. The term “intra-class” means
LoRA [23]. LoRA distinguishes itself by fine-tuning the that the conditioning prompts are constructed based on the
residue of low-rank matrices instead of directly fine-tuning ground truth categories of images, and such a denoising pro-
the pre-trained weight matrices. To elaborate, consider a cess tends to introduce less variation, particularly for the
weight matrix W 2 Rm⇥n . The tunable residual matrix foreground concepts (see top rows of Fig. 1).
W comprises two low-rank matrices: A 2 Rm⇥d and Diff-Mix. Diff-Mix employs the same translation process
B 2 Rn⇥d , defined as W = AB > . As a default config- as Diff-Aug, but the reference image is sampled from the
uration, we set the rank d to 10. full training set rather than intra-class set to enable inter-
class interpolation. The key difference is that Diff-Mix
3.4. Data Synthesis Using Diffusion Model can generate numerous counterfactual examples, such as a
In generating pseudo data, three strategies can be used blackbird in the sea (see fourth row of Fig. 1). This ne-
with our fine-tuned diffusion model 2 : (1) distillation-based cessitates that downstream models make a more refined dif-
method Diff-Gen, (2) intra-class augmentation Diff-Aug, ferentiation of category attributes, thereby reducing the im-
and (3) inter-class augmentation Diff-Mix. pact of spurious correlations introduced by variations in the
background. Denoting the label of the reference image as
Diff-Gen and Diff-Aug. For a target class y i and its textual
y j , by inserting the reference image into the reverse process
condition, “photo of a [Vi ] [metaclass]”, both
with the prompt “photo of a [Vi ] [metaclass]”,
2 We use the prefix “Diff-” denotes the T2I model is fine-tuned and we can obtain interpolated images between the ith and jth
“Real-” denotes the vanilla T2I model. categories. By controlling the intensity s, we can precisely

17226
diverse Q1: Can generative inter-class augmentation lead to
background more significant performance gains in downstream tasks
compared to those intra-class augmentation methods and
distillation-based methods?
faithful Q2: Is improved background diversity the secret weapon
(a) (b) foreground of Diff-Mix for enhancing the performance?
Q3: How to choose the fine-tuning strategy and annota-
tion strategy to boost the performance gains for the inter-
Figure 5. A schematic explanation of Diff-Mix’s effectiveness us-
ing structural casual model [41]. xf g is the foreground that deter- class augmentation?
mines the real class label, xbg denotes the background. xf g ! To address Q1, we separately discuss these questions in
z ! y is the causal path that we are focusing and xf g I ! few-shot settings in Sec. 4.1, conventional classification in
xbg ! z ! y is the backdoor path that introduces spurious rela- Sec. 4.2, as well as long-tail classification in Sec. 4.3. Ad-
tions between xf g and y. ditionally, to answer Q2, we conduct a test for background
robustness in Sec. 4.4 and perform an ablation study fo-
manage the interpolation process. When annotating the syn-
cusing on the size and diversity of synthetic data in Sec.
thetic image, unlike Mixup and Cutmix, we take into ac-
4.5. For Q3, we conduct an ablation study to empirically
count the non-linear nature of diffusion translation. Thus
discover effective strategies for deployment in Sec. 4.5.
the annotation function is given by

ỹ = (1 s )y i + s y j , (3)
4.1. Few-shot Classification
Experimental Setting. To investigate the impact of dif-
where is a hyperparameter introducing non-linearity. Our ferent data expansion methods, we conduct few-shot exper-
empirical findings indicate that a smaller than 1 is fa- iments on a domain-specific dataset Caltech-UCSD Birds
vored. Additionally, in low-shot cases, the samples with (CUB) [61], with shot numbers of 1, 5, 10, and all. For
higher confidence in the target class are preferred (see de- augmentation-based methods, the synthetic dataset is con-
tails in Sec. 4.5). structed using various translation strengths (s), specifically,
Construct synthetic dataset. To construct the synthetic s 2 {0.1, 0.3, ..., 1.0}. We expand the original training set
dataset using Diff-Mix, similar to Da-fusion [58], we adopt with a multiplier of 5 and cache the synthesized dataset
a randomized sampling strategy (s 2 {0.5, 0.7, 0.9}) for the locally for joint training. Real data are replaced by syn-
selection of translation strength. While applying the inter- thetic data proportionally during training, and the replace-
class editing, we observe that Diff-Mix tends to produce ment probability p is set as 0.1, 0.2, 0.3, and 0.5 for all-
more undesirable samples compared to Diff-Aug. These un- shot, 10-shot, 5-shot, and 1-shot classification, respectively.
desirable samples have incomplete foreground such as frag- All experiments use ResNet50 with an input resolution of
mented bird bodies. This is caused by the intrinsic shape 224 ⇥ 224. Additional details can be found in the SMs.
and pose differences among classes. To mitigate this, we in-
Comparison methods. To unveil the trade-off between
troduce a simple data-cleaning approach to reduce the pro-
faithfulness and diversity resulting from different expan-
portion of such problematic images. We utilize the large
sion strategies, we compared Diff-Mix with Diff-Gen and
vision language model CLIP [43] to assess the confidence
Diff-Aug. Furthermore, we conduct experiments on expan-
in the content, serving as the filtering criterion. Further de-
sion strategies using vanilla SD: Real-Mix, Real-Gen, and
tails can be found in the supplementary materials (SMs).
Real-Aug, where ‘Real’ signifies that SD is not fine-tuned.
Analysis. We depict the core insight of Diff-Mix in Fig.
Main results. To answer Q1 under the few-shot classifica-
5. To eliminate the spurious correlation introduced by xbg ,
tion setting, we augment CUB using X-Mix, X-Aug, and
learning on the synthetic set with randomized xbg (back-
X-Gen, where ‘X’ denotes ‘Diff/Real’ for simplicity. The
ground) can cut off spurious correlation, forcing the classi-
results are shown in Fig. 6, and we can observe that:
fication model to infer only from the foreground. The study
1. Diff-Mix generally outperforms the intra-class competi-
in Fig. 7 (b) shows that the more diverse the background
tor X-Aug and distillation competitor X-Gen in various
(larger the referable class number), the better the perfor-
few-shot scenarios. It tends to achieve higher gains when
mance on the CUB test set.
the strength s is relatively large, i.e., {0.5, 0.7, 0.9},
4. Experiments where the foreground has been edited to match the tar-
get class and the background retains similarities to the
In this section, we investigate the effectiveness of Diff-Mix reference image.
in domain-specific datasets. The key questions we aim to 2. Among the Real-X methods, distillation tends to be more
address are as follows: effective than the augmentation method when the shot

17227
(a) 1-shot (b) 5-shot (c) 10-shot (d) All-shot

Figure 6. Few-shot classification results on CUB.

number is low, but the trend reverses as the shot num- (s 2 {0.25, 0.5, 0.75, 1.0}), and non-generative augmenta-
ber increases (compare Real-Gen with Real-Aug). Real- tion methods (4) CutMix [66] and (5) Mixup [68].
Gen’s samples even show less effective than Real-Aug
Main results. We show the classification accuracy for dif-
(s = 1.0) under the all-shot case 3 . This indicates that
ferent data expansion strategies in Table 1, our observa-
the importance of faithfulness in the trade-off between
tions can be summarized: (1) Diff-Mix consistently demon-
faithfulness and diversity increases with the shot num-
strates stable improvements across the majority of settings.
ber. Additionally, Real-Mix exhibits consistent and sta-
Its average performance gain across the five datasets ex-
ble improvement over the other two methods.
ceeds that of baselines employing intra-class augmentation
3. Diff-Gen consistently outperforms Real-Gen under four
methods (RG and Da-fusion), distillation method (RF), and
scenarios. Notably, Real-Gen’s performance declines
non-generative data augmentation techniques (CutMix and
below that of the baseline as the shot numbers reach
Mixup). (2) Real-filtering, analogous to the discussion
10, showcasing the importance of the fine-tuning process
of Real-Gen, exhibits performance degradation on most
which increases the faithfulness of synthetic samples.
datasets due to the distribution gap. (3) The combined use
of Diff-Mix and CutMix often yields better performance
4.2. Conventional Classification
gains. This is attributed to the distinct enhancement mech-
Experimental setting. To test whether Diff-Mix can fur- anisms of the two methods, i.e., vicinal risk minimization
ther boost performance in a more challenging setting, i.e., [11, 68] and foreground-background disentanglement. (4)
under the all-shot scenario with high input resolution, we Diff-Mix does not exhibit significant performance improve-
conduct conventional classification on five domain-specific ment in the dog dataset. We attribute this lack of improve-
datasets: CUB [61], Stanford Cars [27], Oxford Flowers ment to the complexity of the dog dataset, which often con-
[38], Stanford Dogs [26], and FGVC Aircraft [34]. Two tains multiple subjects in a single image, impeding effective
backbones are employed: pretrained (ImageNet1K [49]) foreground editing (refer to the SMs for visual examples).
ResNet50 [18] with input resolution 448 ⇥ 448, and pre-
trained (ImageNet21K) ViT-B/16 [13] with input resolu- 4.3. Long-Tail Classification
tion 384 ⇥ 384. Label smoothing [36] is applied across all
datasets with a confidence level of 0.9. For all expansion Experiment setting. Following the settings of previ-
strategies, the expansion multiplier is 5 and the replacement ous long-tail dataset constructions [9, 32, 39], we create
probability p is 0.1. Besides, we use a randomized sampling two domain-specific long-tail datasets, CUB-LT [50] and
strategy (s 2 {0.5, 0.7, 0.9}) and a fixed (0.5, and this is Flower-LT. The imbalance factor controls the exponential
specific to Diff-Mix) for all datasets. distribution of the imbalanced data, where a larger value in-
dicates a more imbalanced distribution. To leverage gen-
Comparison methods. We compare Diff-Mix with (1) erative models for long-tail classification, we adopt the
Real-Filtering (RF) [19], a variation of Real-Gen that in- approach of SYNAuG [65], which uniformize the imbal-
corporates clip filtering, (2) Real-Guidance (RG) [19], anced real data distribution using synthetic data. Transla-
which augments the dataset using intra-class image trans- tion strength s (0.7) and (0.5) are fixed for both Diff-Mix
lation at low strength (s = 0.1), (3) Da-Fusion [58], a and Real-Mix.
method that solely fine-tunes the identifier to personal-
ize each class and employs randomized sampling strategy Comparison methods. We compare Diff-Mix with Real-
Mix, Real-Gen, the non-generative CutMix-based oversam-
3 Real-Aug (s = 1.0) remains analogous to the reference image pling approach CMO [39], and its enhanced variant with
(slightly higher faithfulness compared to Real-Gen) because the discrete two-stage deferred re-weighting [9] (CMO+DRW).
forward process cannot approximate the ideal normal distribution within a
limited number of steps (T = 25). Main results. We present the long-tail classification re-

17228
Dataset
Backbone Aug. Method FT Strategy
CUB Aircraft Flower Car Dog Avg
- - 86.64 89.09 99.27 94.54 87.48 91.40
Cutmix[66] - 87.23 89.44 99.25 94.73 87.59 91.65+0.25
Mixup[68] - 86.68 89.41 99.40 94.49 87.42 91.48+0.08
ResNet50@448 Real-filtering [19]† % 85.60 88.54 99.09 94.59 87.30 91.22 0.18
Real-guidance [19]† % 86.71 89.07 99.25 94.55 87.40 91.59+0.19
Da-fusion [58]† TI 86.30 87.64 99.37 94.69 87.33 91.07 0.58
Diff-Mix TI + DB 87.16 90.25 99.54 95.12 87.74 91.96+0.56
Diff-Mix + Cutmix TI + DB 87.56 90.01 99.47 95.21 87.89 92.03+0.63
- - 89.37 83.50 99.56 94.21 92.06 91.74
Cutmix[66] - 90.52 83.50 99.64 94.83 92.13 92.12+0.38
Mixup[68] - 90.32 84.31 99.73 94.98 92.02 92.27+0.53
ViT-B/16@384 Real-filtering [19]† % 89.49 83.07 99.36 94.66 91.91 91.69 0.05
Real-guidance [19]† % 89.54 83.17 99.59 94.65 92.05 91.80+0.06
Da-fusion [58]† TI 89.40 81.88 99.61 94.53 92.07 91.50 0.24
Diff-Mix TI + DB 90.05 84.33 99.64 95.09 91.99 92.22+0.48
Diff-Mix + Cutmix TI + DB 90.35 85.12 99.68 95.26 91.89 92.46+0.72

Table 1. Conventional classification in six fine-grained datasets. ‘† ’ indicates our reproduced results using SD.

IF=100 Group Base. CutMix DA-fusion Diff-Aug Diff-Mix


Method 50 10
Many Medium Few All (waterbird, water) 59.50 62.46 60.90 61.83 63.83
CE 79.11 64.28 13.48 33.65 44.82 58.13 (waterbird, land) 56.70 60.12 58.10 60.12 63.24
CMO [39] 78.32 58.57 14.78 32.94 44.08 57.62 (landbird, land) 73.48 73.39 72.94 73.04 75.64
CMO + DRW [10] 78.97 56.36 14.66 32.57 46.43 59.25 (landbird, water) 73.97 74.72 72.77 73.52 74.36
Real-Gen 84.88 65.23 30.68 45.86 53.43 61.42 Avg. 70.19 71.23 69.90 70.28 72.47
Real-Mix (s=0.7) 84.63 66.34 34.44 47.75 55.67 62.27
Diff-Mix (s=0.7) 84.07 67.79 36.55 50.35 58.19 64.48
Table 4. Classification results across four groups in Waterbird
Table 2. Long-tail classification in CUB-LT. [48]. Waterbird is an out-of-distribution dataset for CUB, crafted
by segmenting CUB’s foregrounds and paste them into the scene
IF=100 images from Places [70]. The constructed dataset can be divided
Method 50 10
Many Medium Few All into four groups based on the composition of foregrounds (water-
CE 99.19 94.95 58.18 80.43 90.87 95.07 bird and landbird) and backgrounds (water and land) .
CMO [39] 99.25 95.19 67.45 83.95 91.43 95.19
CMO+ DRW [10] 99.97 95.06 67.31 84.07 92.06 95.92
Real-Gen 98.64 95.55 66.10 83.56 91.84 95.22 this, we utilize an out-of-distribution test set for CUB,
Real-Mix (s=0.7) 99.87 96.26 68.53 85.19 92.96 96.04
Diff-Mix (s=0.7) 99.25 96.98 78.41 89.46 93.86 96.63 namely Waterbird [48]. We then perform inference on the
whole Waterbird set using classifiers that have been trained
Table 3. Long-tail classification in Flower-LT. on either the original CUB dataset or its expanded varia-
tions. In Table 4, we present the classification accuracies
sults for CUB-LT in Table 2 and Flower-LT in 3. The ob- across the four groups and compare Diff-Mix with other
servations are as follows: (1) Generative approaches ex- intra-class methods (Da-fusion and Diff-Aug) as well as
hibit superior performance in tackling imbalanced classi- CutMix. We observe that Diff-Mix generally outperforms
fication issues compared to CutMix-based methods (CMO its counterparts and achieves a significant performance im-
and CMO+DRW). (2) Real-Mix surpasses Real-Gen in per- provement (+6.5%) in the challenging counterfactual group
formance across various imbalance factors in two datasets. (waterbirds with land backgrounds). It is important to high-
This indicates that tail classes can benefit from the en- light that the background scenes in the Waterbird dataset are
hanced diversity by leveraging the visual context of major- novel to CUB, requiring the classification model to have a
ity classes. (3) Diff-Mix generally achieves the best per- stronger perceptual capability for the images’ foregrounds.
formance among the compared strategies, especially at the
low-shot case, highlighting the importance of fine-tuning. 4.5. Discussion

4.4. Background Robustness In this section, we address Q2 by examining the impact of


size and diversity on synthetic data. Furthermore, we per-
In this section, we aim to evaluate Diff-Mix’s robustness to form an ablation study to assess the effects of fine-tuning
background shifts, specifically, whether synthesizing more strategies and training hyperparameters of Diff-Mix, which
diverse samples can improve the classification model’s gen- is aimed at answering Q3. Unless specified otherwise, our
eralizability when the background is altered. To achieve discussions are based on experiments conducted using CUB

17229
Baseline TI DB TI + DB
FID (Diff-Gen) - 18.26 19.55 18.43
5-shot
Acc. (Diff-Mix) 50.90 57.64 56.11 59.41
FID (Diff-Gen) - 14.13 14.64 13.99
All-shot
Acc. (Diff-Mix) 81.60 81.86 81.99 82.85

Table 5. Comparison of distribution gap and classification accu-


racy across three fine-tuning strategies. TI solely fine-tunes the
identifier, and DB solely fine-tunes the U-Net, and TI+DB.
(a) (b)
5-shot All-shot
Figure 7. Comparison of results across various (a) synthetic data
sizes and (b) numbers of referable classes for each target class. s = 0.5 s = 0.7 s = 0.9 s = 0.5 s = 0.7 s = 0.9
1.5 -4.50 -0.31 +10.30 -1.08 +0.92 +0.90
1.0 -2.31 +2.99 +10.79 +0.25 +1.14 +0.90
0.5 +2.35 +8.44 +11.01 +0.92 +1.30 +0.86
with ResNet50, where inputs are resized to 224 ⇥ 224. 0.3 +3.94 +9.41 +11.15 +0.97 +1.24 +0.69
0.1 +6.18 +9.86 +10.84 +0.50 +0.88 +0.84
Size and diversity of synthetic Data. The relationship be- 0.0 +5.25 +9.41 +11.06 +0.38 +0.63 +0.54
tween performance gain and the size of synthetic data is
depicted in Fig. 7 (a), where a classification model was Table 6. Comparison of performance gain across various and
trained with synthetic data of varying sizes. We observe a translation strength s. Lower indicates a higher confidence over
monotonically increasing trend as the multiplier for the syn- target class, e.g. ( = 0.1, s = 0.7) results in 0.04yi + 0.96yj
and ( = 0.5, s = 0.7) results in 0.16yi + 0.84yj .
thetic dataset ranges from 2 to 10. Ideally, the combination
choices of (xi , yj ) are in the order of N |Dtrain | (N = 200
for CUB). Furthermore, we limit the number of referable ferred when the shot number is small ( = 0.1 for 5-shot,
classes for each target class, which means the number of = 0.5 for all-shot). A possible explanation is that the all-
referable backgrounds decreases, resulting in a synthetic shot setting is less tolerant towards unrealistic images, as
dataset of relatively lower diversity. The results are shown discussed in Section 6. Empirically, we recommend choos-
in Fig. 7 (b), and we observe a consistent improvement in ing a higher translation strength (s 2 0.5, 0.7, 0.9) and a
performance as the number of referable classes increases. smaller ( 2 0.1, 0.3, 0.5) as a conservative option.
These results consistently underscore the critical role of
background diversity introduced by Diff-Mix. 5. Conclusion
Impact of fine-tuning strategy. Here we compare three In this work, we investigate two pivotal aspects, faithfulness
different fine-tuning strategies: TI, DB, and the combined and diversity, that are critical for the current state-of-the-art
TI+DB. All strategies share the same fine-tuning hyperpa- text-to-image generative models to enhance image classifi-
rameters and training steps (35000). To evaluate the distri- cation tasks. To achieve a more effective balance between
bution gap, we compute the FID score [20] of synthesized these two aspects, we propose an inter-class augmentation
images (Diff-Gen) with the training set. As illustrated in strategy that leverages Stable Diffusion. This method en-
Table 5, we observe that both TI+DB and TI have lower ables generative models to produce a greater diversity of
FID scores than DB. This can be attributed to the fact that samples by editing images from other classes and shows
semantic proximity impedes the convergence process. Ad- consistent performance improvement across various classi-
ditionally, while using TI alone results in a relatively low fication tasks.
FID score, the improvements in performance are limited.
This limitation stems from TI’s inability to accurately re- 6. Acknowledgement
construct detailed concept (foreground) information, as it is
primarily fine-tuned at the semantic level [28]. This research is mainly supported by the National Natural
Science Foundation of China (92270114). This work is also
Annotation function. In this section, we discuss the im-
partially supported by the National Key R&D Program of
pact of the choices of the translation strength s and the
China under Grant 2021ZD0112801.
non-linear factor in Eq. 3. As shown in Table 6, we
observe that as the translation strength decreases, the opti-
mal value for also decreases, which underscores the non-
linearity of Diff-Mix. The comparison between the 5-shot
and all-shot settings indicates that the model tends to pre-
fer a more diverse synthetic dataset when the number of
training shots is small (s = 0.9 for 5-shot, s = 0.7 for all-
shot). Besides, a larger confidence in the target class is pre-

17230
References [15] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-
[1] Antreas Antoniou, Amos Storkey, and Harrison Edwards. Or. An image is worth one word: Personalizing text-to-
Data augmentation generative adversarial networks. arXiv image generation using textual inversion. arXiv preprint
preprint arXiv:1711.04340, 2017. 2 arXiv:2208.01618, 2022. 2, 3
[2] Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mo- [16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
hammad Norouzi, and David J Fleet. Synthetic data from Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
diffusion models improves imagenet classification. arXiv Yoshua Bengio. Generative adversarial nets. Advances in
preprint arXiv:2304.08466, 2023. 1, 2 neural information processing systems, 27, 2014. 2
[3] Haoyue Bai, Ceyuan Yang, Yinghao Xu, S-H Gary Chan, [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
and Bolei Zhou. Improving out-of-distribution robustness Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
of classifiers via generative interpolation. arXiv preprint Yoshua Bengio. Generative adversarial nets. Advances in
arXiv:2307.12219, 2023. 2 neural information processing systems, 27, 2014. 1
[4] Hritik Bansal and Aditya Grover. Leaving reality to imag- [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
ination: Robust classification via generated datasets. arXiv Deep residual learning for image recognition. In Proceed-
preprint arXiv:2302.02503, 2023. 1, 2 ings of the IEEE conference on computer vision and pattern
[5] Christopher Beckham, Sina Honari, Vikas Verma, Alex M recognition, pages 770–778, 2016. 6
Lamb, Farnoosh Ghadiri, R Devon Hjelm, Yoshua Bengio, [19] Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing
and Chris Pal. On adversarial mixup resynthesis. Advances Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic
in neural information processing systems, 32, 2019. 2 data from generative models ready for image recognition?
[6] Ankan Kumar Bhunia, Salman Khan, Hisham Cholakkal, arXiv preprint arXiv:2210.07574, 2022. 2, 6, 7, 3
Rao Muhammad Anwer, Jorma Laaksonen, Mubarak Shah, [20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
and Fahad Shahbaz Khan. Person image synthesis via de- Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
noising diffusion model. In Proceedings of the IEEE/CVF two time-scale update rule converge to a local nash equilib-
Conference on Computer Vision and Pattern Recognition, rium. Advances in neural information processing systems,
pages 5968–5976, 2023. 2 30, 2017. 8
[7] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large [21] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
scale gan training for high fidelity natural image synthesis. fusion probabilistic models. Advances in neural information
arXiv preprint arXiv:1809.11096, 2018. 1 processing systems, 33:6840–6851, 2020. 3
[8] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- [22] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,
structpix2pix: Learning to follow image editing instructions. Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben
In Proceedings of the IEEE/CVF Conference on Computer Poole, Mohammad Norouzi, David J Fleet, et al. Imagen
Vision and Pattern Recognition, pages 18392–18402, 2023. video: High definition video generation with diffusion mod-
3 els. arXiv preprint arXiv:2210.02303, 2022. 2
[9] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, [23] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
and Tengyu Ma. Learning imbalanced datasets with label- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
distribution-aware margin loss. Advances in neural informa- Lora: Low-rank adaptation of large language models. arXiv
tion processing systems, 32, 2019. 6 preprint arXiv:2106.09685, 2021. 4
[10] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, [24] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
and Tengyu Ma. Learning imbalanced datasets with label- Efros. Image-to-image translation with conditional adver-
distribution-aware margin loss. Advances in neural informa- sarial networks. In Proceedings of the IEEE conference on
tion processing systems, 32, 2019. 7 computer vision and pattern recognition, pages 1125–1134,
[11] Olivier Chapelle, Jason Weston, Léon Bottou, and Vladimir 2017. 3
Vapnik. Vicinal risk minimization. Advances in neural in- [25] Tero Karras, Samuli Laine, and Timo Aila. A style-based
formation processing systems, 13, 2000. 6 generator architecture for generative adversarial networks.
[12] Prafulla Dhariwal and Alexander Nichol. Diffusion models In Proceedings of the IEEE/CVF conference on computer vi-
beat gans on image synthesis. Advances in neural informa- sion and pattern recognition, pages 4401–4410, 2019. 1, 2
tion processing systems, 34:8780–8794, 2021. 1 [26] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Yao, and Li Fei-Fei. Novel dataset for fine-grained image
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, categorization. In First Workshop on Fine-Grained Visual
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Categorization, IEEE Conference on Computer Vision and
vain Gelly, et al. An image is worth 16x16 words: Trans- Pattern Recognition, Colorado Springs, CO, 2011. 6, 2
formers for image recognition at scale. arXiv preprint [27] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
arXiv:2010.11929, 2020. 6 3d object representations for fine-grained categorization. In
[14] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Proceedings of the IEEE international conference on com-
Williams, J. Winn, and A. Zisserman. The pascal visual ob- puter vision workshops, pages 554–561, 2013. 6, 2
ject classes challenge: A retrospective. International Journal [28] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli
of Computer Vision, 111(1):98–136, 2015. 1, 2, 5 Shechtman, and Jun-Yan Zhu. Multi-concept customization

17231
of text-to-image diffusion. In Proceedings of the IEEE/CVF Schölkopf. Controlling text-to-image diffusion by orthogo-
Conference on Computer Vision and Pattern Recognition, nal finetuning. Advances in Neural Information Processing
pages 1931–1941, 2023. 2, 3, 8 Systems, 36, 2024. 3
[29] Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis [43] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Brown, and Deepak Pathak. Your diffusion model is secretly Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
a zero-shot classifier. arXiv preprint arXiv:2303.16203, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
2023. 2 transferable visual models from natural language supervi-
[30] Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, and Tat seng sion. In International conference on machine learning, pages
Chua. Invariant grounding for video question answering. In 8748–8763. PMLR, 2021. 5, 1
CVPR, 2022. 2 [44] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[31] Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Patrick Esser, and Björn Ommer. High-resolution image syn-
Common diffusion noise schedules and sample steps are thesis with latent diffusion models, 2021. 1
flawed. arXiv preprint arXiv:2305.08891, 2023. 1 [45] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
[32] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, net: Convolutional networks for biomedical image segmen-
Boqing Gong, and Stella X Yu. Large-scale long-tailed tation. In Medical Image Computing and Computer-Assisted
recognition in an open world. In Proceedings of the Intervention–MICCAI 2015: 18th International Conference,
IEEE/CVF conference on computer vision and pattern Munich, Germany, October 5-9, 2015, Proceedings, Part III
recognition, pages 2537–2546, 2019. 6 18, pages 234–241. Springer, 2015. 3
[33] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan [46] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
probabilistic model sampling in around 10 steps. Advances tuning text-to-image diffusion models for subject-driven
in Neural Information Processing Systems, 35:5775–5787, generation. In Proceedings of the IEEE/CVF Conference
2022. 3 on Computer Vision and Pattern Recognition, pages 22500–
22510, 2023. 2, 3
[34] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew
[47] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
Blaschko, and Andrea Vedaldi. Fine-grained visual classi-
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
fication of aircraft. arXiv preprint arXiv:1306.5151, 2013.
Aditya Khosla, Michael Bernstein, et al. Imagenet large
6, 2
scale visual recognition challenge. International journal of
[35] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia-
computer vision, 115:211–252, 2015. 2
jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided
[48] Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and
image synthesis and editing with stochastic differential equa-
Percy Liang. Distributionally robust neural networks for
tions. arXiv preprint arXiv:2108.01073, 2021. 3
group shifts: On the importance of regularization for worst-
[36] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton.
case generalization. arXiv preprint arXiv:1911.08731, 2019.
When does label smoothing help? Advances in neural in-
7
formation processing systems, 32, 2019. 6
[49] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
[37] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi,
Mark Chen. Glide: Towards photorealistic image generation Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J
and editing with text-guided diffusion models. arXiv preprint Fleet, and Mohammad Norouzi. Photorealistic text-to-image
arXiv:2112.10741, 2021. 1, 2 diffusion models with deep language understanding, 2022.
[38] Maria-Elena Nilsback and Andrew Zisserman. Automated 1, 2, 6
flower classification over a large number of classes. In 2008 [50] Dvir Samuel, Yuval Atzmon, and Gal Chechik. From gener-
Sixth Indian conference on computer vision, graphics & im- alized zero-shot learning to long-tail with class descriptors.
age processing, pages 722–729. IEEE, 2008. 6, 2 In Proceedings of the IEEE/CVF Winter Conference on Ap-
[39] Seulki Park, Youngkyu Hong, Byeongho Heo, Sangdoo Yun, plications of Computer Vision, pages 286–295, 2021. 6
and Jin Young Choi. The majority can help the minor- [51] Veit Sandfort, Ke Yan, Perry J Pickhardt, and Ronald M
ity: Context-rich minority oversampling for long-tailed clas- Summers. Data augmentation using generative adversarial
sification. In Proceedings of the IEEE/CVF Conference networks (cyclegan) to improve generalizability in ct seg-
on Computer Vision and Pattern Recognition, pages 6887– mentation tasks. Scientific reports, 9(1):16884, 2019. 2
6896, 2022. 6, 7, 3 [52] Mert Bulent Sariyildiz, Karteek Alahari, Diane Larlus, and
[40] Maitreya Patel, Tejas Gokhale, Chitta Baral, and Yezhou Yannis Kalantidis. Fake it till you make it: Learning trans-
Yang. Conceptbed: Evaluating concept learning abili- ferable representations from synthetic imagenet clones. In
ties of text-to-image diffusion models. arXiv preprint CVPR 2023–IEEE/CVF Conference on Computer Vision and
arXiv:2306.04695, 2023. 3 Pattern Recognition, 2023. 1
[41] Judea Pearl. Causal inference in statistics: An overview. [53] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
2009. 5 Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
[42] Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard man, et al. Laion-5b: An open large-scale dataset for training

17232
next generation image-text models. Advances in Neural In- features. In Proceedings of the IEEE/CVF international con-
formation Processing Systems, 35:25278–25294, 2022. 2 ference on computer vision, pages 6023–6032, 2019. 2, 6, 7
[54] Jordan Shipard, Arnold Wiliem, Kien Nguyen Thanh, Wei [67] An Zhang, Wenchang Ma, Xiang Wang, and Tat seng Chua.
Xiang, and Clinton Fookes. Boosting zero-shot classifica- Incorporating bias-aware margins into contrastive loss for
tion with synthetic data diversity via stable diffusion. arXiv collaborative filtering. In NeurIPS, 2022. 2
preprint arXiv:2302.03298, 2023. 2 [68] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and
[55] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, David Lopez-Paz. mixup: Beyond empirical risk minimiza-
Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, tion. arXiv preprint arXiv:1710.09412, 2017. 2, 6, 7
Oran Gafni, et al. Make-a-video: Text-to-video generation [69] Chenyu Zheng, Guoqiang Wu, and Chongxuan Li. Toward
without text-video data. arXiv preprint arXiv:2209.14792, understanding generative data augmentation. Advances in
2022. 2 Neural Information Processing Systems, 36, 2024. 2
[56] Shikun Sun, Longhui Wei, Zhicai Wang, Zixuan Wang, Jun- [70] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva,
liang Xing, Jia Jia, and Qi Tian. Inner classifier-free guid- and Antonio Torralba. Places: A 10 million image database
ance and its taylor expansion for diffusion models. In The for scene recognition. IEEE transactions on pattern analysis
Twelfth International Conference on Learning Representa- and machine intelligence, 40(6):1452–1464, 2017. 7
tions, 2023. 2 [71] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
[57] Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, and Efros. Unpaired image-to-image translation using cycle-
Dilip Krishnan. Stablerep: Synthetic images from text-to- consistent adversarial networks. In Proceedings of the IEEE
image models make strong visual representation learners. international conference on computer vision, pages 2223–
arXiv preprint arXiv:2306.00984, 2023. 1, 2 2232, 2017. 3
[58] Brandon Trabucco, Kyle Doherty, Max Gurinas, and Ruslan
Salakhutdinov. Effective data augmentation with diffusion
models. arXiv preprint arXiv:2302.07944, 2023. 2, 5, 6, 7,
3
[59] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali
Dekel. Plug-and-play diffusion features for text-driven
image-to-image translation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 1921–1930, 2023. 3
[60] Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro
Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj,
and Thomas Wolf. Diffusers: State-of-the-art diffusion
models. https : / / github . com / huggingface /
diffusers, 2022. 3
[61] Catherine Wah, Steve Branson, Peter Welinder, Pietro Per-
ona, and Serge Belongie. The caltech-ucsd birds-200-2011
dataset. 2011. 5, 6, 2
[62] Zhicai Wang, Yanbin Hao, Tingting Mu, Ouxiang Li, Shuo
Wang, and Xiangnan He. Bi-directional distribution align-
ment for transductive zero-shot learning. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 19893–19902, 2023. 2
[63] Daniel Watson, William Chan, Ricardo Martin-Brualla,
Jonathan Ho, Andrea Tagliasacchi, and Mohammad
Norouzi. Novel view synthesis with diffusion models. arXiv
preprint arXiv:2210.04628, 2022. 2
[64] Zongze Wu, Dani Lischinski, and Eli Shechtman. Stylespace
analysis: Disentangled controls for stylegan image genera-
tion. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 12863–12872,
2021. 3
[65] Moon Ye-Bin, Nam Hyeon-Woo, Wonseok Choi, Nayeong
Kim, Suha Kwak, and Tae-Hyun Oh. Exploiting synthetic
data for data imbalance problems: Baselines from a data per-
spective. arXiv preprint arXiv:2308.00994, 2023. 6
[66] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk
Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu-
larization strategy to train strong classifiers with localizable

17233

You might also like