0% found this document useful (0 votes)
24 views13 pages

2402.04929v3

This paper presents a novel method called DM-SFDA that utilizes diffusion models for Source-Free Domain Adaptation by generating synthetic source domain images from target domain features. The approach involves fine-tuning a pre-trained text-to-image diffusion model to create source samples that enhance model performance in unseen domains while addressing data privacy concerns. Experimental results across various datasets demonstrate significant improvements in SFDA performance, showcasing the effectiveness of diffusion models in generating relevant domain-specific images.

Uploaded by

Hưng Ngô
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views13 pages

2402.04929v3

This paper presents a novel method called DM-SFDA that utilizes diffusion models for Source-Free Domain Adaptation by generating synthetic source domain images from target domain features. The approach involves fine-tuning a pre-trained text-to-image diffusion model to create source samples that enhance model performance in unseen domains while addressing data privacy concerns. Experimental results across various datasets demonstrate significant improvements in SFDA performance, showcasing the effectiveness of diffusion models in generating relevant domain-specific images.

Uploaded by

Hưng Ngô
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Source-Free Domain Adaptation with

Diffusion-Guided Source Data Generation

Shivang Chopra Suraj Kothawade


Georgia Institute of Technology University of Texas, Dallas
[email protected] [email protected]
arXiv:2402.04929v3 [cs.CV] 26 Jun 2024

Houda Aynaou Aman Chadha∗


Georgia Institute of Technology Amazon GenAI
[email protected] [email protected]

Abstract
This paper introduces a novel approach to leverage the generalizability of Diffusion
Models for Source-Free Domain Adaptation (DM-SFDA). Our proposed DM-
SFDA method involves fine-tuning a pre-trained text-to-image diffusion model
to generate source domain images using features from the target images to guide
the diffusion process. Specifically, the pre-trained diffusion model is fine-tuned
to generate source samples that minimize entropy and maximize confidence for
the pre-trained source model. We then use a diffusion model-based image mixup
strategy to bridge the domain gap between the source and target domains. We vali-
date our approach through comprehensive experiments across a range of datasets,
including Office-31 [39], Office-Home [48], and VisDA [35]. The results demon-
strate significant improvements in SFDA performance, highlighting the potential
of diffusion models in generating contextually relevant, domain-specific images.

1 Introduction
Deep Convolutional Neural Networks (CNNs) have demonstrated impressive performance on several
visual tasks in recent years. However, the assumption that the distributions of the training and test sets
are the same is crucial to the effectiveness of CNNs [5]. Subsequently, a big drop in performance is
generally observed when CNN-based models are deployed in real-world settings with a discrepancy
in data distribution [21]. Domain Adaptation (DA) attempts to reduce this disparity to make these
models perform well across multiple domains. Traditional DA approaches that rely on fixed source
data might struggle to keep up with the pace of domain changes. Moreover, the rising prominence of
data privacy regulations has led to a demand for DA techniques that can function effectively without
relying on access to the source training data, a setting generally known as Source Free Domain
Adaptation (SFDA).
Most of the current state-of-the-art DA methods attain model adaptability by converging two disparate
data distributions within a shared feature space, spanning both domains simultaneously [5]. One way
of achieving this in a source-free manner is to use synthetically generated source data. However,
generating synthetic source data that accurately represents the diversity and complexity of the source
domain can be difficult. Furthermore, if the synthetic data is not of high quality, it might introduce
noise and inconsistencies, negatively impacting the model’s performance on the target domain.
Notably, recent advancements in Diffusion Generative Models (DGMs) [15, 43] have demonstrated
exceptional capabilities in producing diverse and high-quality images. Consequently, this paper aims
to harness the generalizability of the state-of-the-art text-to-image diffusion models to the challenging
task of SFDA.

Work does not relate to position at Amazon.
Selective Pseudo Finetuning Diffusion Source Data Generation Unsupervised
I II III IV
Labeling of Target Data Model on Target Data using AlignProp Domain Adaptation

Textual Inversion
2
4

Selective Pseudo 11
Labeling (PL) LEARNED Pre-Trained ResNet
(Source Data)
Reliable PL Unreliable PL 3 CONCEPTS

7
1
<class-3> Reward Function
<class-1> <class-2>

INPUT DATA ... (Classification


Confidence)
(TARGET DOMAIN)

5 8

6 9 10

Pre-Trained Target Data Source Data


Diffusion Diffusion Diffusion
Model Model Model GENERATED
SOURCE DATA

Figure 1: Overall training pipeline of the proposed DM-SFDA method. The training pipeline starts
with selective pseudo labeling target data using the pre-trained source model. This is followed by
fine-tuning a pre-trained text-to-image diffusion model on the target images using textual inversion
[6]. Subsequently, the pre-trained source model is used to fine-tune this diffusion model using
AlignProp [36] to generate Source Images. Finally, the finetuned diffusion models are used to
generate intermediate domains between source and target domains to perform unsupervised domain
adaptation.

To address the challenges of data privacy and diversity in the reconstruction process, we present an
innovative framework named Diffusion Models for Source-Free Domain Adaptation (DM-SFDA). An
overview of DM-SFDA is illustrated in Figure 1. The core idea of this approach is to use text-to-image
diffusion models to generate images representative of the source domain based on the target domain
and a pre-trained source network. Essentially, this involves fine-tuning a pre-trained text-to-image
diffusion model to produce source samples that minimize the entropy for the pre-trained source model.
The key contributions of our framework can be summarized as follows:

1. We propose a novel framework to enhance model performance in unseen domains while


simultaneously addressing the challenges posed by limited access to source data and the
increasing emphasis on data privacy.
2. Our novel framework harnesses the generalization capabilities of Diffusion Models to
improve the data diversity and completeness within the reconstructed source data.
3. Through extensive qualitative and quantitative analyses of several traditional and state-of-
the-art baselines and in-depth analysis, we demonstrate the effectiveness of the proposed
pipeline.

The remainder of the paper is organized as follows. In Section 2, we discuss related work in the areas
of DA, SFDA, and DGMs. In Section 3, we revisit the preliminary concepts that form the basis of our
proposed approach. In Section 4, we discuss and formalize the problem definition and present our
methodology. We present experimental results and analysis in Sections 5 and 6. Finally, we discuss
the limitations of our work in Section 7 and conclude with a brief summary in Section 8.

2 Related Work

2.1 Domain Adaptation

DA has its roots in [1], which focused on the role of good feature representations in their successful
application for the task. The initial few works in DA adapted moment matching to align feature
distributions between source and target domains [28, 17, 29, 44, 47]. Subsequent works used
adversarial learning-based approaches to tackle the problem of DA [7, 40, 31, 10, 42]. Apart from
this, many other techniques like [50, 45, 41, 3] have been proposed to tackle the task of DA.

2
2.2 Unsupervised Domain Adaptation

Unsupervised Domain Adaptation (UDA) is a subtype of DA that aims to transfer knowledge from a
labeled source domain to a different unlabeled target domain [30]. Existing mainstream UDA methods
can be categorized into two main types of methods: those that align the source and target domain
distributions by designing specific metrics [34, 33, 24, 27], and those that learn domain-invariant
feature representations through adversarial learning [49, 9, 32]. However, the success of most of these
methods depends on a huge amount of source data which might not be available in most practical
scenarios.

2.3 Source-Free Domain Adaptation

SFDA has been considered in the literature as a means of reducing reliance on source data. As
described in [55], the existing SFDA research can generally be categorized into two approaches:
data-centric and model-centric. Model-centric methods employ techniques such as self-training and
self-attention, while data-centric methods include domain-based reconstruction and image-based
information extraction. Our proposed method follows the data-centric perspective to solve the SFDA
task using source domain generation. 3C-GAN [25] is a pioneering work in this area which uses a
Generative Adversarial Network (GAN) to generate target-like images and simulatneously adapts the
source pre-trained model. Some other works like SDDA [22] and CPGA [37] also solve the SFDA
task using a similar data-generation approach. More recently, other works like AaD [52] and Co-learn
[57] have also shown state-of-the-art performance across various datasets and tasks.

2.4 DGMs for Domain Adaptation

Recently, there has been a significant shift in the landscape of generative modeling due to DGMs [15,
43], demonstrating impressive capabilities in generating highly realistic text-conditioned images.
DGMs have also seen a growing interest in the DA community with many recent works using DGMs
as input augmentation techniques. A recent example is a text-to-image diffusion model, employed by
[2] to generate target domain images using source domain labels, thereby demonstrating the efficacy
of diffusion models in One-Shot Unsupervised Domain Adaptation (OSUDA). DGMs, when trained
on multiple source domains, have also been instrumental in guiding approximate inference in target
domains, as reported by [11]. In our work, we leverage a recently introduced fine-tuning strategy for
diffusion models called AlignProp [36], to fine-tune the diffusion models using the output probability
of the source model as an objective function.

3 Preliminaries
3.1 Conditional Diffusion Probabilistic Models

Conditional Denoising Diffusion Probabilistic Model (DDPM) forms the backbone of our source data
reconstruction pipeline. This model represent a distribution over data x0 , conditioned on a contextual
input labeled c. This distribution arises from a sequential denoising process, which aims to reverse a
Markovian forward process denoted as q(xt |xt−1 ). This forward process progressively introduces a
Gaussian noise to the data. Subsequently, a forward process posterior mean predictor µθ (xt , t, c) is
then trained to reverse the forward process for all t ∈ {0, 1, 2, ..., T }. The training process entails
maximizing a variational lower bound on the model log-likelihood with an objective function defined
as follows:

LDDP M (θ) = E[||µ̂(xt , x0 ) − µθ (xt , t, c)||2 ] (1)


where µ̂ is a weighted average of x0 and xt .
The sampling process of a diffusion model starts with a sample from the Gaussian distribution
xT ∼ N (0, 1) which is subsequently denoised using the reverse process pθ (xt−1 |xt , c) to product a
trajectory {xt , xT −1 , ..., x0 } ending with a sample x0 . Here, the sampler use an isotropic Gaussian
reverse process with a fixed time-dependent variance:

pθ (xt−1 |xt , c) = N (xt−1 |µθ (xt , t, c), σt2 I) (2)

3
3.2 Markov Decision Process

A Markov decision process (MDP) serves as a structured representation of problems involving a series
of interconnected decisions. It is characterized by a set of components denoted as (S, A, ρ0 , P, R),
where S signifies the collection of possible states, A stands for the available actions, ρ0 signifies the
initial state distribution, P represents the transition pattern, and R defines the reward mechanism. At
each discrete time step denoted as t, the agent observes a specific state represented as st from the
state space S takes an action labeled as at from the action space A and in response receives a reward
labeled as R(st , at ), subsequently transitioning to a novel state st+1 drawn from the distribution
P (·|st , at ). The agent’s decision-making is guided by a policy π(a|s) that dictates actions based on
states. During the agent’s engagement with the MDP, it generates trajectories, which are sequences
comprising both states and actions, conventionally presented as τ = (s0 , a0 , s1 , a1 , ..., sT , aT ).

4 Proposed Method

4.1 Notations and Problem Definition

Following the notations used in [55], in this paper, we represent a domain as D. Each domain consists
of a dataset ϕ and an associated label set L. A dataset comprises of an instance set X = xi ni=1 , derived
from a d-dimensional marginal distribution P(X ), and a label set Y = yi ci=1 where n represents the
total number of samples and c represents the total number of classes.
The SFDA scenario involves two stages: pre-training and adaptation. During pre-training, a model
M is trained on labeled data from the source domain DS = {{X s , P(X s ), ds }, Ls }. Subsequently,
the goal of the adaptation phase is to adapt the pre-trained source model to the unlabeled target data
ϕt = X t , P(X t ), dt . The proposed approach assumes a closed form, implying that the label spaces
of the source and target domains are identical.

4.2 Method Overview

An overview of the proposed method DM-SFDA for source free domain adaptation using text-to-
image diffusion models can be seen in Figure 1. The key idea behind the approach is to leverage
the generalizability of the state-of-the-art text-to-image diffusion models to tackle the task of SFDA.
The training pipeline contains the following four phases: I) Selective Pseudo Labeling Target Data
II) Fine-tuning Diffusion Model on Target Data, III) Source Data Generation using AlignProp, IV)
Unsupervised Domain Adaptation. The first phase involves selective pseudo labeling target data using
the pre-trained source model. The second phase involves finetuning a pre-trained diffusion model to
learn concepts in the target domain using Textual Inversion [6]. Subsequently, the pre-trained source
model is used to fine-tune this diffusion model using AlignProp [36] to generate Source Images.
Finally, a diffusion model-based domain mixup strategy is used to perform unsupervised domain
adaptation. Each of these phases are described in detail in the following subsections.

4.3 Phase I: Selective Pseudo Labeling Target Data

The proposed DM-SFDA pipeline requires labeled target data to generate data from the source domain.
Therefore, the initial phase of our proposed pipeline addresses the challenge of unlabeled target data
by selecting reliable labels for the target samples using a selective pseudo-labeling strategy. This
approach is akin to the one proposed in [20]. As shown in [13], prediction confidence and difference
in entropy can be reliable measures of pseudo-label accuracy and estimate different types of domain
shifts. Therefore, prediction confidence and average entropy are used as metrics to assess label
reliability. A binary reliability score (ri ) is assigned to each sample in the target data, determined by
their prediction confidence and prediction uncertainty (gui ), as illustrated below:

4
gui = std{conf (M(xt ))}

1
PB
Tc = B i=1 conf (M(xt ))

1
PB i (3)
Tu = B i=1 gu

1, if conf (M(xt )) ≥ Tc and gui ≤ Tu



i
r =
0, otherwise

Here, Tc and Tu represent the selection thresholds for confidence and uncertainity. Taking the
average as a threshold eliminates the requirement of per-dataset hyper-parameter tuning and makes
the selection process highly adaptive. Furthermore, aleatoric uncertainty [18] is used since it better
addresses the concern of domain shift.

4.4 Phase II: Diffusion Model Finetuning on Target Data

The second phase of the proposed DM-SFDA pipeline involves fine-tuning a text-to-image diffusion
model on the target data using LoRA. However, the lack of class labels poses a challenge as there
are no textual cues to guide the diffusion process. To address this, we employ a recently introduced
fine-tuning strategy called Textual Inversion [6]. Using images from the target domain, Textual
Inversion learns to represent objects in the images through new "words" in the embedding space
of a pre-trained text-to-image model. As illustrated in Figure 1, we assign a placeholder string
"<class-{idx}>" for the newly learned concepts, using class indices from the selective pseudo labeling
done in Phase I. These new identifiers are then used as textual cues to guide the diffusion process in
subsequent phases of the pipeline.

4.5 Phase III: Source Data Generation using AlignProp

In the third phase, we use the method proposed in [36] to further fine-tune our diffusion model to
generate source-like images. This finetuning is done by transforming the denoising process of a
diffusion model into a differentiable recurrent policy. The iterative denoising process is mapped to
the following single-step MDP:

S ≜ {(xT , c), xT ∼ N (0, 1)}

A ≜ {x0 : x0 ∼ πθ (·|xT , c), xT ∼ N (0, 1)} (4)


Rϕ (x0 ), x0 ∈ A

Here, Rϕ is the reward function which is dependent on the generated images. In our case, we define
the reward function to use the information in the pre-trained source model to guide the diffusion
process to generate source-like images. Inspired by the loss function used in DAFL [4], our reward
function consists of the following three components to extract maximum information from the source
model:

• Confidence Reward: The confidence reward function Rconf makes sure that higher confi-
dence predictions are assigned a higher reward.
B
1 X
Rconf = conf (M(xt )) (5)
B i=1

• One-Hot Reward: As described in [4], the outputs of the source-model should be similar to
the training data if the input follows the training distribution. Therefore, the one-hot reward
function ROH assigns higer rewards to the samples that generate one-hot like predictions.
1 X
ROH = 1 − Hcross (M(xt ), y t ) (6)
B i

5
Target Source
Domain Domain

Office31
Webcam

Amazon

OfficeHome
Product

Clipart

αunet=0.0 αunet=0.25 αunet=0.5 αunet=0.75 αunet=1.0

Figure 2: Visualization of the diffusion-based domain mixup for Unsupervised Domain Adaptation.

• Batch Norm Statistics (BNS) Reward: As described in [54], to effectively match the low
level and high level feature maps of the generated images, we can match the Batch Norm
Statistics of the pre-trained neural network with the generated images. The BNS reward
function ensures that a higer reward is given to the samples which have similar BNS as that
of the model.
X X
RBN S = −(|| µl (xt )−BNl (running_mean)||2 +|| σl (xt )−BNl (running_variance)||2 )
l l
(7)

The final reward function used for finetuning the diffusion model is defined as:

R = λA Rconf + λB ROH + λC RBN S (8)


Using the above state, action and reward functions, and the class prompts P, the parameters of the
diffusion model are updated using gradient descent on the following loss function:

1 X
Lalign (θ; P) = − Rϕ (πθ (xt , ci )) (9)
|P| i
c ∈P

The training is done using backpropagation through time on the recurrent policy π. Furthermore, as
described in [36], in order to reduce the memory overload, the LoRA weights are finetuned along with
using a truncated backpropagation through time (TBTT) instead of a full backpropagation through
time (BPTT). After the completion of the AlignProp fine-tuning, the diffusion model produces source
images by utilizing the "<class-{idx}>" placeholder as guiding prompt.

4.6 Phase IV: Unsupervised Domain Adaptation

The fourth phase starts by labeling the generated source domain images using the pre-trained source
model. Once we have the reconstructed and labeled source domain data, we have effectively converted
the initial problem of SFDA to a standard Unsupervised Domain Adaptation (UDA) problem. Since,
all of the diffusion model finetuning happens via Low-Rank Adapters, we can use the finetuned weight
patches to effectively compensate the domain discrepancy between the source and target domain. This
is done by generating multiple intermediate augmented domains by altering the scaling parameter
αunet while applying the LoRA patch to the pretrained model. Figure 2, shows the visualization
of the diffusion model-based inter-domain mixup. Subsequently, we use the approach proposed in
[33] to train two complementary models on each of these intermediate domains that teach each other
to bridge the domain gap by using a confidence-based learning where one model teaches the other
model using the positive pseudo-labels or teach itself using the negative pseudo-labels. Through the
confidence-based learning approach, the two models with different characteristics gradually get closer
to the target domain. Furthermore, a consistency regularization is used to ensure a stable convergence
of training both models.

6
Source
Method A → D A → W D → A D → W W → A W → D Avg.
Free

ResNet-50 [14] ✗ 68.9 68.4 62.5 96.7 60.7 99.3 76.1

MCC [19] ✗ 95.6 95.4 72.6 98.6 73.9 100.0 89.4


GSDA [16] ✗ 94.8 95.7 73.5 99.1 74.9 100.0 89.7
SRDC [46] ✗ 95.8 95.7 76.7 99.2 77.1 100.0 90.8
FixBi [33] ✗ 95.0 96.1 78.7 99.3 79.4 100.0 91.4
CoVi [34] ✗ 98.0 97.6 77.5 99.3 78.4 100.0 91.8
ICON [56] ✗ 97.0 93.3 79.4 99.2 78.3 100.0 91.2

SHOT [26] ✓ 94.0 90.1 74.7 98.4 74.3 99.9 88.6


3C-GAN [25] ✓ 92.7 93.7 75.3 98.5 77.8 99.8 89.6
NRC [51] ✓ 96.0 90.8 75.3 99.0 75.0 100.0 89.4
NRC++ [53] ✓ 95.9 91.2 75.5 99.1 75.0 100.0 89.5
AaD [52] ✓ 96.4 92.1 75.0 99.1 76.5 100.0 89.9
AaD w/ Co-learn [57] ✓ 97.6 98.7 82.1 99.3 80.1 100.0 93.0

DM-SFDA ✓ 97.7 99.0 82.7 99.3 83.5 100.0 93.7

Table 1: Classification accuracy (%) under UDA and SFDA settings on Office-31 [39] dataset for
source-free domain adaptation (ResNet-50). Best results under SFDA setting are shown in bold font.

Method Source Ar → Cl → Pr → Rw → Avg.


Free Cl Pr Rw Ar Pr Rw Ar Cl Rw Ar Cl Pr

ResNet-50 [14] ✗ 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1

MCC [19] ✗ 88.1 80.3 80.5 71.5 90.1 93.2 85.0 71.6 89.4 73.8 85.0 36.9 78.8
GSDA [16] ✗ 61.3 76.1 79.4 65.4 73.3 74.3 65.0 53.2 80.0 72.2 60.6 83.1 70.3
SRDC [46] ✗ 52.3 76.3 81.0 69.5 76.2 78.0 68.7 53.8 81.7 76.3 57.1 85.0 71.3
FixBi [33] ✗ 58.1 77.3 80.4 67.7 79.5 78.1 65.8 57.9 81.7 76.4 62.9 86.7 72.7
CoVi [34] ✗ 58.5 78.1 80.0 68.1 80.0 77.0 66.4 60.2 82.1 76.6 63.6 86.5 73.1
ICON [56] ✗ 63.3 81.3 84.5 70.3 82.1 81.0 70.3 61.8 83.7 75.6 68.6 87.3 75.8

SHOT [26] ✓ 57.1 78.1 81.5 68.0 78.2 78.1 67.4 54.9 82.2 73.3 58.8 84.3 71.8
NRC [51] ✓ 57.7 80.3 82.0 68.1 79.8 78.6 65.3 56.4 83.0 71.0 58.6 85.6 72.2
NRC++ [53] ✓ 57.8 80.4 81.6 69.0 80.3 79.5 65.6 57.0 83.2 72.3 59.6 85.7 72.5
AaD [52] ✓ 59.3 79.3 82.1 68.9 79.8 79.5 67.2 57.4 83.1 72.1 58.5 85.4 72.7
AaD w/ Co-learn [57] ✓ 65.1 86.0 87.0 76.8 86.3 86.5 74.4 66.1 87.7 77.9 66.1 88.4 79.0

DM-SFDA ✓ 68.5 89.6 83.3 70.0 85.8 87.4 71.3 69.6 88.2 77.8 68.5 88.7 79.5

Table 2: Classification performance (%) under UDA and SFDA settings on Office-Home dataset [48]
(ResNet-50 backbone). We report Top-1 accuracy on 12 domain shifts (→) and take the average
(Avg.) over them.

5 Experiments and Results

5.1 Datasets
• Office-31: Office-31 [39] is a benchmark image classification dataset that consists of a
limited set of images distributed across 31 categories spanning three domains: Amazon
(2,817 images), DSLR (498 images), and Webcam (795 images).
• Office-Home: Office-Home [48], on the other hand, comprises a more extensive dataset
with a total of 15.5K images from 65 classes, gathered from 4 distinct image domains:
Artistic, Clipart, Product, and Real-world. Our analysis includes 12 transfer tasks for this
dataset.
• VisDA: VisDA [35] encompasses two distinct domains: synthetic and real, each comprising
12 classes. The synthetic domain holds around 150K computer-generated 3D images with
various poses, while the corresponding real domain includes approximately 55K images
captured from the real world.

5.2 Experimental Setup

We implement our approach using PyTorch and use ResNet-50 [14] as the backbone network for the
Office-31 [39] and Office-Home [48] datasets and Resnet-101 for the VisDA [35] dataset. All the

7
Method Source-Free plane bike bus car horse knife mcycle person plant sktbrd train truck Avg.
ResNet-101 [14] ✗ 55.1 53.3 61.9 59.1 80.6 17.9 79.7 31.2 81.0 26.5 73.5 8.5 52.4
MCC [19] ✗ 88.7 80.3 80.5 71.5 90.1 93.2 85.0 71.6 89.4 73.8 85.0 36.9 78.8
GSDA [16] ✗ 93.1 67.8 83.1 83.4 94.7 93.4 93.4 79.5 93.0 88.8 83.4 36.7 81.5
FixBi [33] ✗ 96.1 87.8 90.5 90.3 96.8 95.3 92.8 88.7 97.2 94.2 90.9 25.7 87.2
CoVi [34] ✗ 96.8 85.6 88.9 88.6 97.8 93.4 91.9 87.6 96.0 93.8 93.6 48.1 88.5
ICON [56] ✗ 96.6 91.8 88.0 82.0 96.8 93.3 90.0 81.0 95.2 93.6 91.0 49.5 87.4
SHOT [26] ✓ 94.3 88.5 80.1 57.3 93.1 94.9 80.7 80.3 91.5 89.1 86.3 58.2 82.9
3C-GAN [25] ✓ 94.8 73.4 68.8 74.8 93.1 95.4 88.6 84.7 89.1 84.7 83.5 48.1 81.6
NRC [51] ✓ 96.8 91.3 82.4 62.4 96.2 95.9 86.1 80.6 94.8 94.1 90.4 59.7 85.9
NRC++ [53] ✓ 96.8 91.9 88.2 82.8 97.1 96.2 90.0 81.1 95.2 93.8 91.1 49.6 87.8
AaD [52] ✓ 97.4 90.5 80.8 76.2 97.3 96.1 89.8 82.9 95.5 93.0 92.0 64.7 88.0
AaD w/ Co-learn [57] ✓ 97.6 90.2 85.0 83.1 97.1 92.1 84.9 96.8 96.8 95.1 92.2 56.8 89.1
DM-SFDA ✓ 98.1 89.8 90.6 90.5 96.8 95.2 92.2 93.4 97.8 94.4 92.4 48.8 86.3

Table 3: Per-class accuracy and mean accuracy (%) on VisDA-2017 [35] dateset for source-free
domain adaptation (ResNet-101). Best results under SFDA setting are shown in bold font.

experiments for our proposed approach were conducted on a Nvidia A100 GPU. The other details for
specific parts of our pipeline are specified below:

• Diffusion Model Fine-tuning: We use Stable Diffusion v1.4 [38] as the base model for
all experiments. The finetuning of the diffusion models was done in the Accelerate [12]
environment. Memory-efficient attention was enabled using xFormers [23] for all the
experiments performed. Furthermore, we used the Low-Rank Adaptation (LORA) of the
Stable Diffusion Pipeline.
• AlignProp: In the context of the AlignProp experiments, we employed the Low-Rank
Adaptation (LORA) technique within the framework of the Stable Diffusion Pipeline. The
training was done for a batch size of 4, and 100 batches were sampled per step. The training
procedure encompassed 100 steps, with each step incorporating two distinct phases: a
sampling phase and 10 consecutive inner training epochs dedicated to training the model on
the sampled data from the previous phase.
• Unupervised Domain Adaptation: In the unsupervised domain adaptation phase of our
pipeline, we use the experimental setup proposed in [33]. For the UDA approach, we use the
generated source data labeled using the pre-trained model and all unlabeled target data. We
use minibatch stochastic gradient descent (SGD) with a momentum of 0.9, an initial learning
rate of 0.001, and a weight decay of 0.005. We follow the same learning rate schedule as in
[8].

5.3 Results

5.3.1 Results on Office-31


We present the outcomes for the Office-31 dataset in Table 1. By employing the source data produced
through our proposed pipeline as input along with the diffusion-model based mixup strategy, our
approach is able to match/outperform existing state-of-the-art SFDA approaches for all the tasks,
achieving an average accuracy of 93.7%, which is 0.7% higher then the current state-of-the-art.

5.3.2 Results on Office-Home


The results for the Office-Home dataset [48] are presented in Table 2. By integrating the source
data generated through our proposed pipeline into the proposed UDA mixup methodology, we have
successfully surpassed existing methods, achieving an average accuracy of 79.5%.

5.3.3 Results on VisDA17


We summarize the results for the VisDA dataset [35] in Table 3. Our proposed framework is able to
outperform the source-free baselines across many classes like plane, bus and car by 0.5%, 2.1% and
7.4% respectively. The accuracy of our proposed method surpass the approaches where source data
is available, thereby showing the effectiveness of our data generation and domain mixup pipeline.

8
6 Ablation Study
In this section, we perform an ablation study to understand the contribution of each component in the
proposed pipeline to the overall performance.

6.1 Selective Pseudo Labeling Target Data

Correct pseudo labeling of target data samples is essential for the success of the proposed approach,
as it significantly influences the fine-tuning process of the diffusion model. To gauge its impact, we
assess the model’s performance without employing selective pseudo labeling. This will help assess
the importance of initial pseudo labels in guiding the subsequent fine-tuning and adaptation phases.
As shown in Table 4, a significant drop in performance is observed for the proposed pipeline in the
absence of the selective pseudo labeling. The primary reason for this is the inaccuracies in pseudo
label assignment during the initial phase that adversely affect all subsequent phases of the pipeline,
including data generation and UDA.

Selective
A → D A → W D → A D → W W → A W → D Avg.
Pseudo Labeling

✗ 67.8 68.3 60.1 95.4 60.5 98.7 75.1


✓ 97.7 99.0 82.7 99.3 83.5 100.0 93.7

Table 4: Ablation study to investigate effects of selective pseudo-labeling.


6.2 Unsupervised Domain Adaptation

In order to test the efficacy of our Diffusion-Model based Unsupervised Domain Adaptation approach,
we compare the downstream performance of our proposed approach with the existing off-the-shelf
UDA approaches. Since we chose a ResNet backbone as the pre-trained source model, we experiment
with the current state-of-the-art ResNet-based UDA approaches. As shown in Table 5, our proposed
diffusion model-based mixup approach is able to significantly outperform all other UDA approaches.
This shows that our mixup approach is able to generate much better intermediate domain to better
bridge the huge domain gap.

Method A → D A → W D → A D → W W → A W → D Avg.

DM-SFDA (FixBi [34]) 93.0 93.5 77.4 98.7 78.3 99.7 90.1
DM-SFDA (CoVi [33]) 93.6 94.0 77.0 99.0 78.0 99.9 90.2
DM-SFDA (ICON [56]) 95.2 92.9 78.6 99.2 78.2 100.0 90.7
DM-SFDA 97.7 99.0 82.7 99.3 83.5 100.0 93.7

Table 5: Ablation results of prevalent of-the-shelf UDA methods applied to our generated source and
target images as compared to our proposed diffusion model-based mixup approach.

7 Challenges and Limitations


Computational Resources: Training and running the proposed pipeline to generate source-like im-
ages is computationally intensive and time-consuming, requiring significant computational resources.
This may limit the practicality of the proposed approach for researchers or practitioners with limited
access to such resources.
Scalability to Different Domains: The effectiveness of the proposed method in different application
domains has yet to be fully explored. Some domains might present unique challenges that are not
adequately addressed by the current model, such as highly structured where slight inaccuracies in
generated data could lead to significant performance drops.

8 Conclusion
In this work, we presented a novel approach to tackle the problem of SFDA by using a text-to-image
diffusion model to generate source images with high confidence predictions by the pre-trained models.
We conduct extensive experiments on multiple domain adaptation benchmarks. Compared with recent
data-based domain adaptation methods, our model achieves the best or comparable performance in
the absence of source data, thereby proving the efficacy of our proposed approach.

9
References
[1] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations
for domain adaptation. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural
Information Processing Systems, volume 19. MIT Press, 2006. URL https://proceedings.
neurips.cc/paper/2006/file/b1b0432ceafb0ce714426e9114852ac7-Paper.pdf.
[2] Yasser Benigmim, Subhankar Roy, Slim Essid, Vicky Kalogeiton, and Stéphane Lathuil-
ière. One-shot unsupervised domain adaptation with personalized diffusion models. In 2023
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW),
pages 698–708, 2023. doi: 10.1109/CVPRW59228.2023.00077.
[3] Róger Bermúdez-Chacón, Mathieu Salzmann, and P. Fua. Domain adaptive multibranch
networks. In ICLR, 2020.
[4] Hanting Chen, Yunhe Wang, Chang Xu, Zhaohui Yang, Chuanjian Liu, Boxin Shi, Chunjing
Xu, Chao Xu, and Qi Tian. Dafl: Data-free learning of student networks. In ICCV, 2019.
[5] Ning Ding, Yixing Xu, Yehui Tang, Chao Xu, Yunhe Wang, and Dacheng Tao. Source-free
domain adaptation via distribution estimation, 2022. URL https://arxiv.org/abs/2204.
11257.
[6] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and
Daniel Cohen-or. An image is worth one word: Personalizing text-to-image generation using
textual inversion. In The Eleventh International Conference on Learning Representations, 2023.
URL https://openreview.net/forum?id=NAQvF08TcyG.
[7] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation,
2014. URL https://arxiv.org/abs/1409.7495.
[8] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation.
In Proceedings of the 32nd International Conference on International Conference on Machine
Learning - Volume 37, ICML’15, page 1180–1189. JMLR.org, 2015.
[9] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation.
In Proceedings of the 32nd International Conference on International Conference on Machine
Learning - Volume 37, ICML’15, page 1180–1189. JMLR.org, 2015.
[10] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François
Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural
networks. J. Mach. Learn. Res., 17(1):2096–2030, jan 2016. ISSN 1532-4435.
[11] Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. Diffusion models
as plug-and-play priors. In Thirty-Sixth Conference on Neural Information Processing Systems,
2022. URL https://arxiv.org/pdf/2206.09012.pdf.
[12] Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab
Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made
simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022.
[13] Devin Guillory, Vaishaal Shankar, Sayna Ebrahimi, Trevor Darrell, and Ludwig Schmidt.
Predicting with confidence on unseen distributions. In 2021 IEEE/CVF International Conference
on Computer Vision (ICCV), pages 1114–1124, 2021. doi: 10.1109/ICCV48922.2021.00117.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition, 2015.
[15] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. CoRR,
abs/2006.11239, 2020. URL https://arxiv.org/abs/2006.11239.
[16] Lanqing Hu, Meina Kan, Shiguang Shan, and Xilin Chen. Unsupervised domain adaptation
with hierarchical gradient synchronization. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), June 2020.

10
[17] Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex
Smola. Correcting sample selection bias by unlabeled data. In B. Schölkopf, J. Platt,
and T. Hoffman, editors, Advances in Neural Information Processing Systems, vol-
ume 19. MIT Press, 2006. URL https://proceedings.neurips.cc/paper/2006/file/
a2186aa7c086b46ad4e8bf81e2a3a19b-Paper.pdf.
[18] Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine
learning: an introduction to concepts and methods. Machine Learning, 110(3):457–506, Mar
2021. ISSN 1573-0565. doi: 10.1007/s10994-021-05946-3. URL https://doi.org/10.
1007/s10994-021-05946-3.
[19] Ying Jin, Ximei Wang, Mingsheng Long, and Jianmin Wang. Less confusion more transferable:
Minimum class confusion for versatile domain adaptation. CoRR, abs/1912.03699, 2019. URL
http://arxiv.org/abs/1912.03699.
[20] Nazmul Karim, Niluthpol Chowdhury Mithun, Abhinav Rajvanshi, Han-pang Chiu, Supun
Samarasekera, and Nazanin Rahnavard. C-sfda: A curriculum learning aided self-training
framework for efficient source free domain adaptation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages 24120–24131, June
2023.
[21] Jogendra Nath Kundu, Naveen Venkat, Rahul M, and R. Venkatesh Babu. Universal source-free
domain adaptation, 2020. URL https://arxiv.org/abs/2004.04393.
[22] Vinod K Kurmi, Venkatesh K Subramanian, and Vinay P Namboodiri. Domain impression: A
source data free domain adaptation method. In 2021 IEEE Winter Conference on Applications
of Computer Vision (WACV), pages 615–625, 2021. doi: 10.1109/WACV48630.2021.00066.
[23] Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano,
Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza.
xformers: A modular and hackable transformer modelling library. https://github.com/
facebookresearch/xformers, 2022.
[24] Jingjing Li, Erpeng Chen, Zhengming Ding, Lei Zhu, Ke Lu, and Heng Tao Shen. Maximum
density divergence for domain adaptation. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 43(11):3918–3930, 2021. doi: 10.1109/TPAMI.2020.2991050.
[25] Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. Model adaptation: Unsuper-
vised domain adaptation without source data. In 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 9638–9647, 2020. doi: 10.1109/CVPR42600.
2020.00966.
[26] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source
hypothesis transfer for unsupervised domain adaptation. CoRR, abs/2002.08546, 2020. URL
https://arxiv.org/abs/2002.08546.
[27] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features
with deep adaptation networks. In Francis Bach and David Blei, editors, Proceedings of the
32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine
Learning Research, pages 97–105, Lille, France, 07–09 Jul 2015. PMLR. URL https://
proceedings.mlr.press/v37/long15.html.
[28] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. Learning transferable features
with deep adaptation networks, 2015. URL https://arxiv.org/abs/1502.02791.
[29] Mingsheng Long, Jianmin Wang, and Michael I. Jordan. Deep transfer learning with joint
adaptation networks. CoRR, abs/1605.06636, 2016. URL http://arxiv.org/abs/1605.
06636.
[30] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. Unsupervised domain
adaptation with residual transfer networks. In Proceedings of the 30th International Conference
on Neural Information Processing Systems, NIPS’16, page 136–144, Red Hook, NY, USA,
2016. Curran Associates Inc. ISBN 9781510838819.

11
[31] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. Domain adaptation
with randomized multilinear adversarial networks. CoRR, abs/1705.10667, 2017. URL http:
//arxiv.org/abs/1705.10667.
[32] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. Conditional adversarial
domain adaptation. In Proceedings of the 32nd International Conference on Neural Information
Processing Systems, NIPS’18, page 1647–1657, Red Hook, NY, USA, 2018. Curran Associates
Inc.
[33] Jaemin Na, Heechul Jung, Hyung Jin Chang, and Wonjun Hwang. Fixbi: Bridging domain
spaces for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages 1094–1103, June 2021.
[34] Jaemin Na, Dongyoon Han, Hyung Jin Chang, and Wonjun Hwang. Contrastive vicinal space
for unsupervised domain adaptation. In Shai Avidan, Gabriel Brostow, Moustapha Cissé,
Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages
92–110, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19830-4.
[35] Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko.
Visda: The visual domain adaptation challenge, 2017.
[36] Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-
image diffusion models with reward backpropagation, 2023.
[37] Zhen Qiu, Yifan Zhang, Hongbin Lin, Shuaicheng Niu, Yanxia Liu, Qing Du, and Mingkui Tan.
Source-free domain adaptation via avatar prototype generation and adaptation. In International
Joint Conference on Artificial Intelligence, 2021.
[38] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.
High-resolution image synthesis with latent diffusion models, 2022.
[39] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to
new domains. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, Computer
Vision – ECCV 2010, pages 213–226, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
ISBN 978-3-642-15561-1.
[40] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier
discrepancy for unsupervised domain adaptation. CoRR, abs/1712.02560, 2017. URL http:
//arxiv.org/abs/1712.02560.
[41] Kuniaki Saito, Donghyun Kim, Stan Sclaroff, and Kate Saenko. Universal domain adaptation
through self supervision. CoRR, abs/2002.07953, 2020. URL https://arxiv.org/abs/
2002.07953.
[42] Rui Shu, Hung H. Bui, Hirokazu Narui, and Stefano Ermon. A dirt-t approach to unsupervised
domain adaptation, 2018. URL https://arxiv.org/abs/1802.08735.
[43] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep
unsupervised learning using nonequilibrium thermodynamics. CoRR, abs/1503.03585, 2015.
URL http://arxiv.org/abs/1503.03585.
[44] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation.
CoRR, abs/1511.05547, 2015. URL http://arxiv.org/abs/1511.05547.
[45] Hui Tang, Ke Chen, and Kui Jia. Unsupervised domain adaptation via structurally regularized
deep clustering. CoRR, abs/2003.08607, 2020. URL https://arxiv.org/abs/2003.08607.
[46] Hui Tang, Ke Chen, and Kui Jia. Unsupervised domain adaptation via structurally regularized
deep clustering. CoRR, abs/2003.08607, 2020. URL https://arxiv.org/abs/2003.08607.
[47] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain
confusion: Maximizing for domain invariance. CoRR, abs/1412.3474, 2014. URL http:
//arxiv.org/abs/1412.3474.

12
[48] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep hashing network for
unsupervised domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 5385–5394, Los Alamitos, CA, USA, jul 2017. IEEE Computer
Society. doi: 10.1109/CVPR.2017.572. URL https://doi.ieeecomputersociety.org/
10.1109/CVPR.2017.572.
[49] Thomas Westfechtel, Hao-Wei Yeh, Dexuan Zhang, and Tatsuya Harada. Gradual source
domain expansion for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision (WACV), pages 1946–1955, January 2024.
[50] Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. Unsupervised domain adaptation: An
adaptive feature norm approach. CoRR, abs/1811.07456, 2018. URL http://arxiv.org/
abs/1811.07456.
[51] Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Herranz, and Shangling Jui. Exploiting the
intrinsic neighborhood structure for source-free domain adaptation. CoRR, abs/2110.04202,
2021. URL https://arxiv.org/abs/2110.04202.
[52] Shiqi Yang, yaxing wang, kai wang, Shangling Jui, and Joost van de Weijer. Attract-
ing and dispersing: A simple approach for source-free domain adaptation. In S. Koyejo,
S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neu-
ral Information Processing Systems, volume 35, pages 5802–5815. Curran Associates,
Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/
26300457961c3e056ea61c9d3ebec2a4-Paper-Conference.pdf.
[53] Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Herranz, Shangling Jui, and Jian Yang.
Trust your good friends: Source-free domain adaptation by reciprocal neighborhood clustering.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15883–15895, 2023.
doi: 10.1109/TPAMI.2023.3310791.
[54] Hongxu Yin, Pavlo Molchanov, Jose M. Alvarez, Zhizhong Li, Arun Mallya, Derek Hoiem,
Niraj K. Jha, and Jan Kautz. Dreaming to distill: Data-free knowledge transfer via deepinversion.
In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages
8712–8721, 2020. doi: 10.1109/CVPR42600.2020.00874.
[55] Zhiqi Yu, Jingjing Li, Zhekai Du, Lei Zhu, and Heng Tao Shen. A comprehensive survey on
source-free domain adaptation, 2023.
[56] Zhongqi Yue, Qianru Sun, and Hanwang Zhang. Make the u in UDA matter: Invariant
consistency learning for unsupervised domain adaptation. In Thirty-seventh Conference on
Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=
4hYIxI8ds0.
[57] Wenyu Zhang, Li Shen, and Chuan-Sheng Foo. Rethinking the role of pre-trained networks in
source-free domain adaptation. In Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV), pages 18841–18851, October 2023.

13

You might also like