0% found this document useful (0 votes)
29 views

MmAP Multi-Modal Alignment Prompt For Cross-Domain Multi-Task Learning

Uploaded by

wzczjqc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

MmAP Multi-Modal Alignment Prompt For Cross-Domain Multi-Task Learning

Uploaded by

wzczjqc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

MmAP : Multi-modal Alignment Prompt for Cross-domain Multi-task Learning

Yi Xin1,2 * , Junlong Du2 * , Qiang Wang2 , Ke Yan2† , Shouhong Ding2


1
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
2
Youtu Lab, Tencent
[email protected], [email protected], [email protected],
[email protected], [email protected]
arXiv:2312.08636v1 [cs.CV] 14 Dec 2023

Abstract
88
Multi-Task Learning (MTL) is designed to train multiple cor-
MmAP-MT (Ours)
related tasks simultaneously, thereby enhancing the perfor-

Average Accuracy (%) of Four Tasks


MaPLe-MT
mance of individual tasks. Typically, a multi-task network
86
structure consists of a shared backbone and task-specific de-
coders. However, the complexity of the decoders increases
with the number of tasks. To tackle this challenge, we in- CoOp-MT
tegrate the decoder-free vision-language model CLIP, which MT Full Fine-Tune
84
exhibits robust zero-shot generalization capability. Recently, CLIP-Adapter
VPT-MT
parameter-efficient transfer learning methods have been ex-
tensively explored with CLIP for adapting to downstream
tasks, where prompt tuning showcases strong potential. Nev- 82
ertheless, these methods solely fine-tune a single modality
(text or visual), disrupting the modality structure of CLIP. In
this paper, we first propose Multi-modal Alignment Prompt
(MmAP) for CLIP, which aligns text and visual modalities 80
during fine-tuning process. Building upon MmAP, we de- BitFit
velop an innovative multi-task prompt learning framework.
On the one hand, to maximize the complementarity of tasks
78
with high similarity, we utilize a gradient-driven task group- 10 4 10 5 10 6 10 7 10 8 10 9
ing method that partitions tasks into several disjoint groups
and assign a group-shared MmAP to each group. On the other Number of Trainable Parameters
hand, to preserve the unique characteristics of each task, we
assign an task-specific MmAP to each task. Comprehensive
Figure 1: The trade-off between average accuracy over four
experiments on two large multi-task learning datasets demon- tasks on Office-Home (Venkateswara et al. 2017) dataset and
strate that our method achieves significant performance im- the number of trainable parameters. The radius of each circle
provements compared to full fine-tuning while only utilizing represents the relative amount of trainable parameters.
approximately ∼ 0.09% of trainable parameters.

Moreover, training a unified model for multiple tasks is gen-


Introduction erally more parameter-efficient than training several single-
Multi-Task Learning (MTL) has surfaced as a potent ap- task models. Consequently, MTL has garnered considerable
proach in deep learning that allows for joint training of mul- interest in various fields, including Computer Vision (Shen
tiple correlated tasks within a unified network architecture, et al. 2021; Ye and Xu 2023), Natural Language Process-
resulting in enhanced model performance in comparison to ing (He et al. 2022), etc.
Single-Task Learning (STL). The core of MTL lies in learn- In this work, we mainly focus on vision multi-task learn-
ing both the task-shared and the task-specific representa- ing. Prior research has predominantly concentrated on the
tions. By capitalizing on shared representations and knowl- design of multi-task model training framework, encompass-
edge across tasks, MTL enhances generalization and mit- ing encoder-based methods (Gao et al. 2019) and decoder-
igates overfitting. Utilizing specific representations allows based methods (Xu, Yang, and Zhang 2023). However, with
MTL to preserve the distinct characteristics of each task. the growing prowess of vision pre-trained models (e.g.,
ViT (Dosovitskiy et al. 2021), SwinTransformer (Liu et al.
* These authors contributed equally. 2021)), directly fine-tuning these models for downstream

Corresponding author. multi-task leads to substantial performance enhancements
Copyright © 2024, Association for the Advancement of Artificial and has become the mainstream approach for multi-task
Intelligence (www.aaai.org). All rights reserved. learning (Liu et al. 2022). In this fine-tuning paradigm, it
❄ ❄ ❄ ! ❄
! “a photo of !
Pl [class] Text Encoder a [CLASS]” Text Encoder Pl [class] Text Encoder Pl [class] Text Encoder

! !
MLP
Ps
❄ ❄

Image Encoder
Pv ❄ !
Image Encoder
Image Encoder Image Encoder Image Encoder
!
Pv Pv

(a) Text Prompt - CoOp (b) Visual Prompt - VPT (c) Multi-modal Prompts (MaPLe) (d) Multi-modal Alignment Prompt – MmAP (Ours)

Figure 2: Illustrations of (a) text prompt tuning (Zhou et al. 2022), (b) visual prompt tuning (Jia et al. 2022b), (c) multi-modal
prompts learning (Khattak et al. 2023) and (d) our multi-modal alignment prompt tuning. represents trainable parameters,
represents frozen parameters, ⊗ represents Kronecker Product and [class] represents category name.

remains necessary to establish a distinct decoder for each along with a framework tailored for multi-task image recog-
task, with trainable parameters that increase linearly. nition scenarios. Our MmAP generates text prompts and vi-
To address the above issue, we incorporate the pre-trained sual prompts through a source prompt to achieve the tuning
vision-language model CLIP (Radford et al. 2021) and con- alignment effect for both modalities. Additionally, we de-
sider it tailor-made for vision multi-task learning. On one sign a multi-task prompt tuning framework based on MmAP.
hand, CLIP is trained to align language and vision modal- Previous MTL works (Fifty et al. 2021; Standley et al. 2020)
ities using web-scale data (e.g., 400 million text-image have confirmed that training similar tasks together yields
pairs), endowing it with a robust capability for zero-shot a complementary effect, while training dissimilar tasks to-
transfer to vision downstream tasks. On the other hand, the gether results in a negative effect. Therefore, we first employ
architecture of CLIP offers a distinct advantage. It comprises gradient similarity to group tasks and then assign a group-
a text encoder and an image encoder, eliminating the need to shared MmAP for joint training. Furthermore, to maintain
establish additional decoder structures for each task. There- the independent characteristics of each task, we establish
fore, we opt for adapting CLIP to address vision multi-task. task-specific MmAP for each task individually. We evaluate
Following the conventional pretrain-finetune paradigm, our method on two large cross-domain multi-task datasets,
the entire CLIP parameters (∼150M) would require updat- including Office-Home and MiniDomainNet. Figure 1 dis-
ing, which presents challenges concerning computational plays the results on Office-Home, illustrating that our pro-
and storage expenses. Recently, numerous studies (Zaken, posed method has achieved a favorable trade-off between
Goldberg, and Ravfogel 2022; Jia et al. 2022b; Gao et al. trainable parameters and performance.
2021; Zhou et al. 2022) have introduced parameter-efficient Our main contributions are as follows:
transfer learning techniques to achieve an optimal balance • We propose Multi-modal Alignment Prompt (MmAP)
between trainable parameters and performance on down- for CLIP to favourably align its vision-language repre-
stream tasks. Nonetheless, these existing methods primar- sentations while parameter-efficient tuning.
ily concentrate on pre-trained vision models or language • Building upon MmAP, we design a multi-task prompt
models, with their applicability to more complex vision- learning framework for cross-domain image recognition
language models remaining uncertain. Moreover, these ap- tasks, incorporating both group-shared MmAP and task-
proaches tend to emphasize single-task adaptation, while specific MmAP.
multi-task adaptation continues to pose a challenge.
To start with, we initially conduct a thorough examina- • We devise a unified library grounded in CLIP to bench-
tion of the performance of existing successful parameter- mark various parameter-efficient tuning methods for
efficient transfer learning methods when applied to CLIP for multi-task image recognition. To the best of our knowl-
vision multi-task learning, as shown in Figure 1. Through edge, we are the first to undertake this work.
our extensive studies, we discover that prompt tuning meth- • Experimental results on two commonly used visual
ods VPT-MT (Jia et al. 2022b), CoOp-MT (Zhou et al. 2022) multi-task datasets show that our method achieves com-
and MaPLe-MT (Khattak et al. 2023) are more suitable than petitive performance compared to multi-task full fine-
BitFit (Zaken, Goldberg, and Ravfogel 2022) and Adapter tuning leveraging merely ∼ 0.09% of the CLIP parame-
(Gao et al. 2021). This may be attributed to the fact that Bit- ters, as shown in Figure 1.
Fit and Adapter update model parameters and disrupt the
original structural integrity of CLIP. In contrast, prompt tun- Related Work
ing methods only modifies input embedding (text or image), Multi-Task Learning. Multi-Task Learning (MTL) aims
as shown in Figure 2. Moreover, we observe that MaPLe- to simultaneously learn multiple tasks by sharing knowl-
MT outperforms VPT-MT and CoOp-MT, emphasizing the edge and computation. There are two classic multi-task in
advantages of tuning both modalities simultaneously. the field of computer vision. The first is dense scene under-
Subsequently, based on our observations, we propose a standing multi-task, which implements semantic segmenta-
novel Multi-modal Alignment Prompt (MmAP) for CLIP tion, surface normal estimation, saliency detection, etc. for
each input sample. Current research on multi-task dense encoder, and sequentially processed through the following
scene understanding primarily focuses on decoder structure transformer layers:
design (Zhang et al. 2021; Xu, Yang, and Zhang 2023; Liang [ck+1 , Ek+1 ] = Vk+1 ([ck , Ek ]) k = 0, 1, ..., K − 1. (1)
et al. 2023). The other is cross-domain classification multi-
task, and the input data consists of multiple datasets with To acquire the ultimate image representation x, the class to-
domain shifts. As multiple domains are involved, current re- ken cK from the last transformer layer is projected into the
search emphasizes learning shared and private information V-L latent embedding space via ImageProj:
between domains (Shen et al. 2021; Long et al. 2017). x = ImageProj(cK ). (2)
Vision-Language Model. Foundational vision-language Text Encoder. The text encoder adopts a transformer that
models (e.g., CLIP (Radford et al. 2021) and ALIGN (Jia contains K layers to tokenize the input words and project
et al. 2021)) have exhibited remarkable capabilities in var- them into word embeddings W0 ∈ RN ×dl . The Wk are di-
ious vision tasks. In contrast to models learned with only rectly fed into the (k + 1)-th layer Lk+1 of the text encoder:
image supervision, these V-L models encode rich multi-
[Wk+1 ] = Lk+1 (Wk ) k = 0, 1, ..., K − 1. (3)
modal representations. Although these pre-trained V-L mod-
els learn rich representations, efficiently adapting them to The final text representation z is obtained by projecting the
downstream vision tasks remains a challenging problem. text embeddings associated with the last token of the con-
Numerous works have demonstrated improved performance cluding transformer layer LK into V-L latent embedding
on downstream vision tasks by employing tailored meth- space via TextProj:
ods to adapt V-L models for detection (Li et al. 2022; z = TextProj(WK ). (4)
Zhong et al. 2022), segmentation (Rao et al. 2022; Xu et al.
2022), and recognition (Wortsman et al. 2022). Furthermore, Zero Shot Prediction. For zero-shot prediction, a care-
HiPro (Liu et al. 2023) constructs a hierarchical structure to fully designed prompt is introduced into the language branch
adapt a pre-trained V-L model to various downstream tasks. of CLIP, which serves to reconstruct the textual input by
equipping with every class name associated with the down-
Parameter-Efficient Transfer Learning. Parameter Effi- stream tasks (e.g., “a photo of a [CLASS]”). The class with
cient Transfer Learning (PETL) aims to adapt a pre-trained the highest cosine similarity score is then selected as the pre-
model to new downstream tasks by training only a small dicted label ŷ for the given image, namely that:
number of parameters. Existing PETL methods can be cat-
egorized into three groups: parameter tuning, adapter tun- exp (sim (x, zŷ ) /τ )
p(ŷ|x) = PC , (5)
ing, and prompt tuning. Parameter tuning directly modi- i=1 exp (sim (x, zi ) /τ )
fies the parameters of a pre-trained model, either by tun-
where sim(, ) represents the computation of cosine similar-
ing the weights (Hu et al. 2022) or biases (Zaken, Gold-
ity. τ is the temperature coefficient learned by CLIP and C
berg, and Ravfogel 2022). Adapter tuning inserts trainable
is the total number of classes.
bottleneck architectures into a frozen pre-trained model, in-
tending to facilitate learning for downstream tasks, such as Multi-modal Alignment Prompt
AdaptFormer (Chen et al. 2022), VL-Adapter(Sung, Cho,
and Bansal 2022), and CLIP-Adapter (Gao et al. 2021). Prior research has mainly focused on designing prompts
Prompt tuning unifies all downstream tasks into pre-trained for a single modality. For example, VPT investigated visual
tasks via designing a specific template to fully exploit the prompts, while CoOp introduced learnable text prompts. We
capabilities of foundation models (Jia et al. 2022a; Khattak believe that merely tuning one modality disrupts the text-
et al. 2023; Wang et al. 2023). image matching of CLIP, leading to sub-optimal adaptation
for downstream tasks. The most concurrent method MaPLe
proposed to use text prompt to generate visual prompt via
Method a MLP with considerable parameters, which exhibits limita-
In this section, we first revisit vision-language models with tions regarding to visual modality and model efficiency.
a focus on CLIP. Subsequently, we introduce our proposed To address these issues, we propose Multi-modal
Multi-modal Alignment Prompt (MmAP). Finally, we pro- Alignment Prompt (MmAP) to generate text prompt Pl ∈
pose a unified prompt learning framework that incorporates Rb×dl and image prompt Pv ∈ Rb×dv simultaneously, as
both group-shared MmAP and task-specific MmAP. depicted in Figure 2d. Here, we denote b as the length of
prompts, while dl and dv indicate the dimension of text
Contrastive Language-Image Pre-training and image tokens, respectively. We first initialize a source
Image Encoder. In this work, we opt for ViT (Dosovitskiy prompt Ps ∈ Rm×n and two individual scaling matrices
b dl b dv
et al. 2021) as the image encoder to be compatibile with Ml ∈ R m × n and Mv ∈ R m × n for two modalities.
the visual prompt (Jia et al. 2022b). Given an input image Then, we apply Kronecker Product to generate prompts for
I ∈ RH×W ×3 , the image encoder, consisting of K trans- text and image encoders as follows:
former layers, splits the image into M fixed-size patches 
Ml11 Ps · · · Ml1n Ps

and projects them into patch embeddings E0 ∈ RM ×dv . The .. .. ..
Pl = Ml ⊗ Ps =   , (6)
 
patch embeddings Ek , accompanied by a learnable class to- . . .
ken ck , are fed into the (k + 1)-th layer Vk+1 of the image Mlm1 Ps · · · Mlmn Ps
Multiple snapshot Gradient Similarity Task Group #1
Individual Tasks " All Tasks Shared MmAP ... Clipart
Clipart
Pglb
Clipart
Art
Real ❄ Pretrained Real
Task Group #2
Vision-Language Art
Real
Art
Model Product
Clip Rea Art Pro Product
Product a rt l duct
(a) Task Grouping via Gradient Similarity
Text Encoder
❄ ❄ ❄ ❄
Word Embedding

W1 WK -1


W0

Layer LK

Layer L1

Layer L2
“a photo of
….
[class]”

x . z1 z2 ... zc
"Group-Shared "Task-Specific "Group-Shared " "Group-Shared "
Task-Specific …. K
Task-Specific
K
MmAP #1 MmAP #1 MmAP #2 MmAP #2 MmAP #2 MmAP #K s1 s2 ...
sc

❄ ❄ ❄ ❄
Softmax
Patch Embedding

Layer VK
Layer V2
Layer V1

…. argmax
E0 E1 EK -1


Image Encoder
(b) Multi-Task Prompt Learning Framework

Figure 3: Multi-Task Prompt Learning Framework, including (a) grouping tasks by maximizing the complementarity of
tasks with high similarity and (b) employing group-shared and task-specific MmAP for adapting CLIP to downstream tasks.

 
Mv11 Ps · · · Mv1n Ps are adequately catered to. The overall multi-task prompt tun-
Pv = Mv ⊗ Ps =  .. .. .. ing framework diagram is depicted in Figure 3.
 . (7)
 
. . .
Mvm1 Ps · · · Mvmn Ps Task Grouping. Existing MTL works (Fifty et al. 2021)
Our proposed MmAP offers two significant advantages. have demonstrated that gradient cosine similarity can quan-
Firstly, the use of the Kronecker Product ensures maximum tify the similarity of two tasks, i.e., the extent to which
preservation of the information of source prompt Ps . This fa- two tasks can benefit from joint training. Therefore, we as-
cilitates alignment between the text and image prompts. Sec- sess the similarity of two tasks by computing gradients on
ondly, the number of learnable parameters is significantly the shared parameters, while keeping the pretrained vision-
language model frozen, as shown in Figure 3a.
reduced from K(dl + dv ) to mn + K(dmn l +dv )
, where K rep- Specifically, given a global shared MmAP Pglb for all
resents the number of transformer layers. This reduction in tasks, the similarity between the i-th task and the j-th task
parameters not only makes the model more efficient but also can be estimated as the following dot product:
reduces the risk of overfitting.
sim(Ti , Tj ) = ∇Pglb LTi (Pglb ) · ∇Pglb LTj (Pglb ) , (8)
Multi-Task Prompt Learning Framework
where LT denotes the loss on task T . We posit that when
In multi-task learning, the joint training of similar tasks can
sim(Ti , Tj ) > 0, it indicates that the two tasks exhibit a
yield mutually beneficial outcomes. Typically, the degree of
mutual gain effect. Moreover, for robust estimation, we av-
task similarity can be quantified by evaluating the gradi-
erage multiple “snapshots” of similarity during the training
ent conflict between tasks. In light of this, we first involves
of the global shared MmAP. At a high level, we concurrently
grouping similar tasks together. A shared MmAP is assigned
train all tasks, evaluate pairwise task similarity throughout
to each group which facilitates the mutual learning and en-
the training process, and identify task groups that maximize
hancement of tasks within the group. However, to maintain
the total inter-task similarity.
the unique characteristics of each task, we also assign an
individual MmAP to each task. This individual MmAP en- Multi-Task Prompt Learning. We develop a unified
sures that the distinct features and requirements of each task multi-task prompt learning framework upon our proposed
MmAP, as depicted in Figure 3b. Given N downstream tasks Based on previous research w.r.t MTL and prompt learn-
{Ti }Ni=1 , we first partition them into several disjoint groups ing (Shen et al. 2021; Zhou et al. 2022), we randomly select
according to gradient similarities. For brevity, we denote G 10% (6-shot per class) and 20% (12-shot per class) samples
as a task group that consists of |G| tasks (1 ≤ |G| ≤ N ). from Office-Home for training, and 1% (3-shot per class)
Then we construct group-shared MmAP for CLIP that con- and 2% (6-shot per class) samples from MiniDomainNet for
tains K transformer layers, including source prompts PG = training. The remaining samples are reserved for testing.
{PGk }K k K
k=1 , scaling matrices MGl = {MGl }k=1 and MGv =
k K
Baselines. In order to conduct a comprehensive evaluation
{MGv }k=1 for language and vision branches, respectively. of our proposed method, we compare it against several tun-
The group-shared MmAP is cumulatively updated by all ing baselines, including:
tasks within group G, achieving complementary benefits
across similar tasks. Additionally, for every task in group G, • Zero-shot uses hand-crafted text prompt (“a photo of
we build task-specific MmAP for learning unique task char- [class]”) templates to zero-shot prediction.
acteristic, including source prompts PT = {PTk }K k=1 , scal-
• Single-Task Full Fine-Tuning updates an individual
ing matrices MTl = {MkTl }K M {M k K pretrained model for each task and Multi-task Full Fine-
k=1 and T v
= Tv }k=1
for language and vision branches. Tuning updates an shared pretrained model for all tasks.
During the training of one task T in group G, we first • Single-modal prompt tuning methods, including the stan-
generate text and image prompts of the k-th layers in two dard CoOp (Zhou et al. 2022), trained on an individual
encoders, and then we reconstruct the input tokens by com- task for the text prompt tuning, and the standard VPT (Jia
posing the class token, the generated prompts and the tex- et al. 2022b), trained on an individual task for the vi-
t/image tokens from the previous layer. Thereby the calcula- sual prompt tuning. CoOp-MT and VPT-MT are the
tions of the k-th layers within text and image encoders can multi-task version, which train a task-shared prompt with
be formally represented as: samples from all tasks. Additionally, the recent work
MaPLe (Khattak et al. 2023) serves as one of our base-
[ , , Wk ] = Lk PGkl , PTkl , Wk−1
 
lines, which employs text prompts to generate visual
 (9) prompts. Similarly, we also construct a multi-task ver-
= Lk PGk ⊗ MkGl , PTk ⊗ MkTl , Wk−1 ,

sion, referred to as MaPLe-MT.
[ck , , , Ek ] = Vk ck−1 , PGkv , PTkv , Ek−1 • Other parameter-efficient tuning methods, including
 
CLIP-Adapter (Gao et al. 2021), which learns new fea-
= Vk ck−1 , PGk ⊗ MkGv , PTk ⊗ MkTv , Ek−1 .
 
tures on either a visual or a language branch, and Bit-
(10) Fit (Zaken, Goldberg, and Ravfogel 2022), which tunes
Here [·, ·] refers to the concatenation operation. Finally, the bias parameters of the pre-trained model.
group-shared MmAP are cumulatively updated by optimiz-
ing the following loss: Implementation Details. All experiments are conducted
X using the PyTorch toolkit on NVIDIA V100 GPU, with
L (PG , MGl , MGv ) = LT (PG , MGl , MGv ) , (11) CLIP (ViT-B/16) chosen as our default model. To ensure a
T ∈G fair comparison, we maintain consistent hyperparameter set-
tings across all parameter efficient tuning methods. Specifi-
and task-specific MmAP are trained via: cally, we use a batch size of 16/4 and train for 5 epochs for
Office-Home/MiniDomainNet. We employ the SGD opti-
L (PT , MTl , MTv ) = LT (PT , MTl , MTv ) , (12) mizer with a learning rate of 0.0035. We evaluate the check-
where LT is the cross-entropy loss of task T . point of the last epoch and run the experiments with three
different seeds to obtain the average results. For the source
prompt Ps , we initialize it with a Gaussian distribution of
Experiment 0.02 standard deviation.
Benchmark Setting
Experiment Results
Datasets. Following prior MTL works (Shen et al. 2021;
Long et al. 2017), we consider Office-Home (Venkateswara Office-Home. The results are presented in Table 1. Firstly,
et al. 2017) and MiniDomainNet (Zhou et al. 2021) datasets we observe that our method is on par with Multi-Task
to construct our benchmark. Full Fine-Tuning across different data splits (10% or 20%)
while requiring only 0.09% (0.13M vs. 149.62M) train-
• Office-Home contains images from four tasks: Art, Cli- able parameters. This represents a significant breakthrough
part, Product and Real World. Each task covers images in parameter-efficient tuning of CLIP for multi-task image
from 65 object categories collected under office and recognition. Secondly, our method consistently outperforms
home settings. There are about 15,500 images in total. other parameter efficient tuning methods. In comparison to
• MiniDomainNet takes a subset of DomainNet, which is prompt methods (i.e., MaPLe-MT, CoOp-MT, and VPT-
an extremely challenging dataset for multi-task learning. MT), our method exhibits a significant improvement, high-
MiniDomainNet has 140,000 images distributed among lighting the necessity of integrating visual and text modali-
126 categories. It contains four different tasks: Clipart, ties when tuning CLIP and combining the group-shared and
Painting, Sketch and Real. the task-specific knowledge.
Single Task Learning Multi Task Learning
Method ZeroShot Full FT CoOp VPT MaPLe Full FT C-Adapter BitFit CoOp-MT VPT-MT MaPLe-MT Ours
Art 82.9 84.9±0.9 84.2±0.3 83.7±0.5 84.4±0.5 85.8±0.8 82.3±0.2 79.1±0.1 84.3±0.8 84.0±0.3 84.8±0.6 85.7±0.5
Clipart 68.3 75.4±0.3 72.6±1.1 70.5±0.3 72.8±0.6 76.3±1.0 71.7±0.2 67.8±1.4 73.0±0.1 72.4±0.6 73.3±0.3 76.3±0.6
10% Product 89.3 91.6±0.3 92.4±0.2 90.9±0.2 92.2±0.1 92.1±1.3 90.8±0.2 86.7±1.5 92.7±0.2 91.7±0.6 92.7±0.4 92.9±0.5
Real 90.1 89.8±0.8 90.5±0.3 89.2±0.6 90.4±0.4 90.2±1.3 89.2±0.2 85.9±0.5 90.7±0.6 90.6±0.1 90.8±0.3 90.9±0.9
Average 82.6 85.4±0.6 84.9±0.5 83.6±0.4 85.0±0.4 86.1±1.1 83.5±0.2 79.9±0.9 85.2±0.4 84.7±0.4 85.4±0.4 86.5±0.6
Art 84.6 87.1±1.2 85.6±0.6 85.4±0.6 85.9±0.4 87.4±0.8 83.2±1.1 81.7±0.6 86.0±0.2 85.9±0.3 86.3±0.4 88.2±0.7
Clipart 68.2 77.9±0.1 74.5±0.6 71.4±0.4 74.2±0.5 78.8±1.0 75.4±0.1 69.6±1.7 73.9±0.6 72.3±0.6 74.2±0.6 77.1±0.6
20% Product 89.5 91.9±1.3 93.0±0.4 91.5±0.6 92.8±0.6 93.0±1.0 91.7±0.9 87.2±1.4 92.9±0.4 92.1±0.2 92.9±0.2 93.5±0.5
Real 90.7 89.8±0.6 91.8±0.3 90.9±0.1 91.8±0.4 91.9±0.4 90.6±0.2 86.7±1.0 92.0±0.3 91.7±0.3 92.0±0.5 92.4±0.3
Average 83.3 86.7±0.8 86.2±0.5 84.8±0.4 86.2±0.5 87.8±0.8 85.2±0.5 85.5±0.4 86.3±0.4 85.5±0.4 86.4±0.4 87.8±0.5
Parameters - 598.48 M 0.04 M 0.68 M 19.2 M 149.62 M 0.53 M 0.17 M 0.01 M 0.17 M 4.8 M 0.13 M

Table 1: Comparison to various methods on Office-Home, using the average accuracy (%) over 3 different seeds. The benchmark
is implemented by us, based on CLIP with ViT-B/16 backbone. We highlight the best and the second results.

Single Task Learning Multi Task Learning


Method ZeroShot Full FT CoOp VPT MaPLe Full FT C-Adapter BitFit CoOp-MT VPT-MT MaPLe-MT Ours
Clipart 82.6 82.1±1.5 82.7±0.1 82.3±0.1 82.9±0.2 82.8±0.9 82.6±0.1 78.9±0.4 83.4±0.4 83.0±0.3 83.4±0.4 83.9±0.3
Painting 82.3 81.8±0.7 81.8±0.3 81.7±0.3 82.0±0.2 81.5±0.6 80.4±0.3 74.7±0.4 82.3±0.2 81.9±0.6 82.5±0.4 83.5±0.2
1% Real 91.2 89.1±0.5 91.9±0.3 91.6±0.2 92.0±0.3 89.1±0.6 90.9±0.2 84.2±0.3 91.3±0.1 90.1±0.2 91.4±0.1 92.2±0.2
Sketch 79.9 77.0±0.7 77.1±0.2 78.5±0.3 78.5±0.4 77.2±1.0 78.3±0.6 72.4±0.5 79.2±0.2 78.6±0.6 79.1±0.2 79.8±0.7
Average 84.0 82.5±0.9 83.4±0.2 83.5±0.2 83.9±0.3 82.7±0.8 83.0±0.3 77.6±0.4 84.0±0.3 83.4±0.4 84.1±0.3 84.9±0.4
Clipart 82.6 82.2±1.3 83.8±0.1 83.5±0.3 83.8±0.4 82.8±0.9 83.1±0.1 81.5±0.2 84.7±0.3 83.8±0.5 84.5±0.3 85.7±0.4
Painting 82.3 82.1±1.4 82.5±0.2 82.4±0.1 82.7±0.2 82.1±0.7 81.5±0.3 76.8±0.2 83.2±0.2 82.2±0.1 83.6±0.4 85.0±0.2
2% Real 91.2 89.2±0.8 91.9±0.1 91.5±0.1 91.6±0.2 89.3±0.5 90.6±0.2 85.9±0.1 91.7±0.1 90.5±0.1 91.9±0.2 92.3±0.1
Sketch 80.0 77.4±0.5 79.0±0.5 79.6±0.2 79.9±0.3 77.7±1.0 78.7±0.6 74.8±0.2 80.1±0.4 79.0±0.2 80.5±0.3 81.5±0.2
Average 84.0 82.7±1.0 84.3±0.2 84.2±0.2 84.5±0.3 83.0±0.8 83.4±0.3 79.8±0.2 84.9±0.4 83.9±0.3 85.1±0.3 86.1±0.2
Parameters - 598.48 M 0.04 M 0.68 M 19.2 M 149.62 M 0.53 M 0.17 M 0.01 M 0.17 M 4.8 M 0.13 M

Table 2: Comparison to various methods on MiniDomainNet, using the average accuracy (%) over 3 different seeds. The
benchmark is implemented by us, based on CLIP with ViT-B/16 backbone. We highlight the best and the second results.

Regarding the number of trainable parameters, our Ablation Study


method ranks second only to CoOp-MT, achieving the In this section, we construct various ablation experiments to
best trade-off between accuracy and trainable parameters. further analyze our proposed MmAP and multi-task prompt
Thirdly, we also find that prompt methods outperform CLIP- learning framework. At the same time, we also design re-
Adapter and BitFit, indicating that aligning downstream data lated experiments for different downstream data size.
with CLIP is a more efficient approach. ❄ ❄
! !
Pl [class] Text Encoder MLP Pl [class] Text Encoder
MiniDomainNet. The results are shown in Table 2. We !
Ps
can draw consistent conclusions with Office-Home. Our !
Pv ❄ ! MLP
method performs the best and achieves 84.9% on the 1% Pv ❄
Image Encoder Image Encoder
split and 86.1% on the 2% split. However, we observe that
the performance of Full Fine-Tuning is not very satisfactory
and is worse than most parameter-efficient tuning methods, (a) Joint Train (b) MLP
which is caused by overfitting. Specifically, the task diffi- Full Fine-Tuning 86.09 Full Fine-Tuning 83.00 (149.62M)
culty of MiniDomainNet is significantly increased compared
MmAP (Ours) 86.43 MmAP (Ours) 86.13 (0.13 M)
to Office-Home, and concurrently, the number of training
data is limited. Moreover, the BitFit method exhibits the Joint Train 85.48 Joint Train 85.51 (0.24 M)
worst performance. It updates few parameters of the CLIP MLP 85.89 MLP 85.63 (3.96 M)
using a small amount of data, which severely impairs the
original zero-shot capability of CLIP. Zero-shot 82.69 Zero-shot 84.09

The effects of CoOp-MT, VPT-MT, and MaPLe-MT can (c) Office-Home (10%) (d) MiniDomainNet (2%)
only approach zero-shot on the 1% split, but when the
amount of training data reaches 2%, CoOp-MT and MaPLe- Figure 4: Ablation study of MmAP on Office-Home and
MT surpass zero-shot by 0.9% and 1.1%, respectively. miniDomainNet datasets. We construct two baselines: (a)
Therefore, to explore the performance under different train- jointly training the text and visual prompts, and (b) utiliz-
ing data sizes, we set up related experiments, as detailed in ing two MLP layers to generate the text and visual prompts.
ablation study.
Figure 5: Main results on Office-Home (four tasks) under the k-shot setting. We report the accuracy (%) for 1/3/6/12 shots.
Overall, our method attains substantial improvements over zero-shot CLIP and performs favorably against other baselines.

Effectiveness of MmAP. To verify the effectiveness of our sults elucidate that each module within our framework plays
proposed MmAP, we set up related ablation experiments. As a pivotal role, cumulatively contributing to the superior per-
displayed in Figure 4a, a straightforward approach for multi- formance achieved by our multi-task prompt learning frame-
modal prompts is to tune the text and visual prompts jointly. work. Compared to the random grouping, our task group-
Another straightforward solution involves sharing text and ing performs 0.68% and 0.85% higher under the settings of
visual prompts. However, since the dimensions of the Text 10% and 20%, respectively. Compared to the all task in one
Encoder (dl = 512) and Image Encoder (dv = 768) of group, our task grouping performs 0.39% and 0.41% higher
CLIP are not equal, they cannot be shared directly. There- under the settings of 10% and 20%. From another perspec-
fore, we design the MLP prompt baseline as another com- tive, task-specific MmAP surpasses that of CoOp and VPT
parison scheme, which employs two MLP layers to generate (results in Table 1), further demonstrating the effectiveness
the text and visual prompts, as shown in 4b. of our MmAP.
The results are shown in Figure 4c. Across the four tasks
of Office-Home, the MLP baseline exhibits a 0.5% improve- Different Downstream Data Size. We examine the im-
ment compared to the joint train baseline, demonstrating pact of training data size on Office-Home (four tasks). We
the effectiveness of establishing a connection between the select 1/3/6/12 shots per class and compare our MmAP with
text prompt and the visual prompt. Additionally, we observe CoOp-MT, VPT-MT, and MaPLe-MT. The results for each
that MmAP achieves a 0.54% improvement compared to the task and method at different training data scales are pre-
MLP baseline, indicating that the MmAP method is more ef- sented in Figure 5. The results indicate that our method
fective in maximizing information sharing between the text surpasses all other baselines on the four tasks across data
and visual prompts through the Kronecker Product. At the scales, confirming our method’s strong generalization. How-
same time, MmAP trainable parameters are greatly reduced ever, we observe that all methods underperform in compari-
relative to the MLP baseline (0.13M vs. 3.96M). son to Zero-Shot in the 1-shot setting for Art and Real World
tasks. This may be due to the fact that 1-shot is too specific
Group Shared Office-Home to serve as a general representation for the entire task. When
Task Specific provided with 3 or more shots for training, the average per-
Task Group Random 10% 20%
formance gap introduced by our method is substantial.
✓ 85.76 86.97
✓ 86.05 87.29
✓ ✓ 85.80 86.92 Conclusion
✓ ✓ 86.48 87.77
✓ All in one group 86.09 87.36 In this work, we propose the Multi-modal Alignment Prompt
(MmAP) for adapting CLIP to downstream tasks, which
Table 3: Ablation study of Multi-Task Prompt Learning achieves the best trade-off between trainable parameters and
Framework. “Random” means grouping tasks randomly. performance against most of the existing methods. Simul-
taneously, MmAP addresses the issue of previous single-
modal prompt methods (e.g., CoOp and VPT) disrupting
Effectiveness of Multi-Task Prompt Learning Frame- CLIP’s modal alignment. Building on MmAP, we design a
work. In our multi-task prompt learning framework, task- multi-task prompt learning framework, which not only en-
specific MmAP and group-shared MmAP are the primary ables similar tasks to be trained together to enhance task
components. To verify the importance of each module, we complementarity but also preserves the independent charac-
conduct related ablation experiments on Office-Home, and teristics of each task. Our approach achieves significant per-
the results are presented in Table 3. To substantiate the ef- formance improvements compared to full fine-tuning on two
fectiveness of task grouping strategy, we incorporate random large multi-task learning datasets under limited downstream
grouping as a benchmark for comparison. The empirical re- data while only utilizing ∼ 0.09% trainable parameters.
References Liang, X.; Niu, M.; Han, J.; Xu, H.; Xu, C.; and Liang, X.
Chen, S.; Ge, C.; Tong, Z.; Wang, J.; Song, Y.; Wang, J.; 2023. Visual Exemplar Driven Task-Prompting for Unified
and Luo, P. 2022. Adaptformer: Adapting vision transform- Perception in Autonomous Driving. In Proceedings of the
ers for scalable visual recognition. In Advances in Neural IEEE Conference on Computer Vision and Pattern Recogni-
Information Processing Systems (NeurIPS). tion (CVPR).
Liu, Y.; Lu, Y.; Liu, H.; An, Y.; Xu, Z.; Yao, Z.; Zhang, B.;
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, Xiong, Z.; and Gui, C. 2023. Hierarchical Prompt Learn-
D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; ing for Multi-Task Learning. In Proceedings of the IEEE
Heigold, G.; Gelly, S.; et al. 2021. An image is worth 16x16 Conference on Computer Vision and Pattern Recognition
words: Transformers for image recognition at scale. In Pro- (CVPR).
ceedings of the International Conference on Learning Rep-
resentations (ICLR). Liu, Y.-C.; Ma, C.-Y.; Tian, J.; He, Z.; and Kira, Z. 2022.
Polyhistor: Parameter-Efficient Multi-Task Adaptation for
Fifty, C.; Amid, E.; Zhao, Z.; Yu, T.; Anil, R.; and Finn, Dense Vision Tasks. In Advances in Neural Information
C. 2021. Efficiently identifying task groupings for multi- Processing Systems (NeurIPS).
task learning. In Advances in Neural Information Processing
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin,
Systems (NeurIPS).
S.; and Guo, B. 2021. Swin transformer: Hierarchical vision
Gao, P.; Geng, S.; Zhang, R.; Ma, T.; Fang, R.; Zhang, Y.; transformer using shifted windows. In Proceedings of the
Li, H.; and Qiao, Y. 2021. CLIP-Adapter: Better Vision- IEEE International Conference on Computer Vision (ICCV).
Language Models with Feature Adapters. In arXiv preprint Long, M.; Cao, Z.; Wang, J.; and Yu, P. S. 2017. Learn-
arXiv:2110.04544. ing multiple tasks with multilinear relationship networks.
Gao, Y.; Ma, J.; Zhao, M.; Liu, W.; and Yuille, A. L. 2019. In Advances in Neural Information Processing Systems
Nddr-cnn: Layerwise feature fusing in multi-task cnns by (NeurIPS).
neural discriminative dimensionality reduction. In Proceed- Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.;
ings of the IEEE Conference on Computer Vision and Pat- Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.;
tern Recognition (CVPR). et al. 2021. Learning transferable visual models from natural
He, Y.; Zheng, S.; Tay, Y.; Gupta, J.; Du, Y.; Aribandi, V.; language supervision. In Proceedings of the International
Zhao, Z.; Li, Y.; Chen, Z.; Metzler, D.; et al. 2022. Hy- Conference on Machine Learning (ICML).
perprompt: Prompt-based task-conditioning of transformers. Rao, Y.; Zhao, W.; Chen, G.; Tang, Y.; Zhu, Z.; Huang,
In Proceedings of the International Conference on Machine G.; Zhou, J.; and Lu, J. 2022. Denseclip: Language-guided
Learning (ICML). dense prediction with context-aware prompting. In Proceed-
Hu, E. J.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, ings of the IEEE Conference on Computer Vision and Pat-
L.; and Chen, W. 2022. LoRA: Low-Rank Adaptation of tern Recognition (CVPR).
Large Language Models. In Proceedings of the Interna- Shen, J.; Zhen, X.; Worring, M.; and Shao, L. 2021.
tional Conference on Learning Representations (ICLR). Variational multi-task learning with gumbel-softmax pri-
ors. In Advances in Neural Information Processing Systems
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.;
(NeurIPS).
Le, Q.; Sung, Y.-H.; Li, Z.; and Duerig, T. 2021. Scaling
up visual and vision-language representation learning with Standley, T.; Zamir, A.; Chen, D.; Guibas, L.; Malik, J.; and
noisy text supervision. In Proceedings of the International Savarese, S. 2020. Which tasks should be learned together
Conference on Machine Learning (ICML). in multi-task learning? In Proceedings of the International
Conference on Machine Learning (ICML).
Jia, M.; Tang, L.; Chen, B.; Cardie, C.; Belongie, S. J.; Har-
Sung, Y.; Cho, J.; and Bansal, M. 2022. VL-Adapter:
iharan, B.; and Lim, S. 2022a. Visual prompt tuning. In
Parameter-Efficient Transfer Learning for Vision-and-
Proceedings of the European Conference on Computer Vi-
Language Tasks. In Proceedings of the IEEE Conference
sion (ECCV).
on Computer Vision and Pattern Recognition (CVPR).
Jia, M.; Tang, L.; Chen, B.-C.; Cardie, C.; Belongie, S.; Har- Venkateswara, H.; Eusebio, J.; Chakraborty, S.; and Pan-
iharan, B.; and Lim, S.-N. 2022b. Visual Prompt Tuning. In chanathan, S. 2017. Deep hashing network for unsupervised
Proceedings of the European Conference on Computer Vi- domain adaptation. In Proceedings of the IEEE Conference
sion (ECCV). on Computer Vision and Pattern Recognition (CVPR).
Khattak, M. U.; Rasheed, H. A.; Maaz, M.; Khan, S.; and Wang, Q.; Du, J.; Yan, K.; and Ding, S. 2023. Seeing in
Khan, F. S. 2023. Maple: Multi-modal prompt learning. In Flowing: Adapting CLIP for Action Recognition with Mo-
Proceedings of the IEEE Conference on Computer Vision tion Prompts Learning. In Proceedings of the ACM Confer-
and Pattern Recognition (CVPR). ence on Multimedia (MM).
Li, L. H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Wortsman, M.; Ilharco, G.; Kim, J. W.; Li, M.; Kornblith,
Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.-N.; et al. S.; Roelofs, R.; Lopes, R. G.; Hajishirzi, H.; Farhadi, A.;
2022. Grounded language-image pre-training. In Proceed- Namkoong, H.; et al. 2022. Robust fine-tuning of zero-shot
ings of the IEEE Conference on Computer Vision and Pat- models. In Proceedings of the IEEE Conference on Com-
tern Recognition (CVPR). puter Vision and Pattern Recognition (CVPR).
Xu, J.; De Mello, S.; Liu, S.; Byeon, W.; Breuel, T.; Kautz,
J.; and Wang, X. 2022. Groupvit: Semantic segmentation
emerges from text supervision. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR).
Xu, Y.; Yang, Y.; and Zhang, L. 2023. DeMT: Deformable
mixer transformer for multi-task learning of dense predic-
tion. In Proceedings of the AAAI Conference on Artificial
Intelligence (AAAI).
Ye, H.; and Xu, D. 2023. Taskprompter: Spatial-channel
multi-task prompting for dense scene understanding. In Pro-
ceedings of the International Conference on Learning Rep-
resentations (ICLR).
Zaken, E. B.; Goldberg, Y.; and Ravfogel, S. 2022. Bit-
fit: Simple parameter-efficient fine-tuning for transformer-
based masked language-models. In Proceedings of the An-
nual Meeting of the Association for Computational Linguis-
tics (ACL).
Zhang, X.; Zhou, L.; Li, Y.; Cui, Z.; Xie, J.; and Yang, J.
2021. Transfer vision patterns for multi-task pixel learn-
ing. In Proceedings of the ACM Conference on Multimedia
(MM).
Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L. H.;
Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. 2022. Regionclip:
Region-based language-image pretraining. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).
Zhou, K.; Yang, J.; Loy, C. C.; and Liu, Z. 2022. Learn-
ing to prompt for vision-language models. In International
Journal of Computer Vision (IJCV).
Zhou, K.; Yang, Y.; Qiao, Y.; and Xiang, T. 2021. Domain
adaptive ensemble learning. In IEEE Transactions on Image
Processing (TIP).

You might also like