RepAdapter
RepAdapter
Gen Luo 1 , Minglang Huang 1 , Yiyi Zhou12 , Xiaoshuai Sun12 , Guannan Jiang3 , Zhiyu Wang3 , Rongrong Ji12
1
Media Analytics and Computing Lab, Department of Artificial Intelligence,
School of Informatics, Xiamen University, 361005, China.
2
Institute of Artificial Intelligence, Xiamen University, 361005, P.R. China.
3
Intelligent Manufacturing Department, Contemporary Amperex Technology Co. Limited (CATL).
arXiv:2302.08106v2 [cs.CV] 21 Mar 2023
{luogen, huangminglang}@stu.xmu.edu.cn,
{zhouyiyi,xssun,rrji}@xmu.edu.cn, {jianggn, wangzy13}@catl.com
Abstract
!# …… !#
$'#(
main categories, namely visual adapter [2, 16, 20, 31, 38] are deployed sequentially to directly optimize the feature
and prompt tuning [19, 38, 42, 43]. Notably, very recent spaces on downstream tasks, as shown in Fig. 2. In this
progresses also demonstrate competitive performance with paper, we find that when the adaptation module is linear,
lower parameter cost to full fine-tuning on Vision Trans- they can also be re-parameterized in a feed-forward struc-
formers [2, 19, 38, 42, 43]. ture without performance degeneration. This finding also
Despite the great success, most existing PETL methods allows us to keep the network intact during inference in ad-
inevitably slow down model inference [18], as shown in dition to LoRA, while retaining the effectiveness of adapter.
Fig. 1. For prompt-tuning methods [19, 42], the inserted Based on this finding, we further propose a novel PETL
tokens greatly increase the computation cost of vision mod- method called RepAdapter. As shown in Fig. 2, RepAdapter
els, especially the Transformer-based ones [11,33]. In terms also inserts lightweight networks into the pre-trained mod-
of visual adapters [2, 16], their theoretical cost is actually els, and the additional parameters will be re-parameterized
cheap, e.g., +0.03 GFLOPs by the visual adapter [16]. But to the nearby projection weights after training. To the best
the modules they add also increase the network complexity, of our knowledge, re-parameterization of this sequential
e.g., the network depth, thus reducing the efficiency of GPU structure is also the first attempt in the literature. In addi-
parallel computing. As shown in Fig. 1, when the batch size tion, we also investigate the sparse design of visual adapter
is 1, the latency of ViT [11] is increased by almost 20%, and obtain a new dense-sparse structure, which can further
which is actually significant in real-word applications. save 25% parameters. Meanwhile, we empirically find that
A trade-off solution is the newly proposed PETL method the adapter placement is essential for giant vision models.
for pre-trained language models called Low Rank Adaption To validate RepAdapter, we apply it to various vi-
(LoRA) [18]. Inspired by the finding of the “low intrinsic sion models, ranging from CNNs like ConvNeXt [27] to
rank” in large-scale pre-trained models [1], Hu et al. apply single and multi-modal Transformers, e.g., ViT [11] and
two decomposition matrices to approximate the projection CLIP [29]. Extensive experiments are conducted on 27
weights of self-attention, as shown in Fig. 1. During infer- benchmark datasets of image and video classifications and
ence, these weights can be re-parameterized into the pre- semantic segmentation [3, 12, 14, 15, 21, 34, 37, 41]. Exper-
trained model, thereby avoiding additional computation. imental results show that RepAdapter can outperform the
However, the generalization ability of LoRA is still limited state-of-the-art (SOTA) PETL methods [2,19,38,43] in both
for common vision models. For instance, when applying performance and parameter size, while incurring no addi-
LoRA to CNN, e.g., ConvNeXT [27], its performance is tional computations during inference. Meanwhile, we ex-
inferior to full tuning, i.e., -1.9% on VTAB-1k [37]. On amine RepAdapter under the settings of few-shot learning
ViT [11], LoRA also performs worse than the adapter [2]. and domain adaption, where its superior performance and
In addition, its re-parameterization is also not feasible for generalizability still be also witnessed.
common adapters [16, 38] that are sequentially placed after In summary, our contributions are three-fold:
neural modules.
• We propose a novel PETL method for vision models,
In this paper, we investigate that whether common adap- called RepAdapter, which shows that common visual
tation modules can be fully merged into the pre-trained adapters can also be sequentially re-parameterized into
models. In existing re-parameterization methods [6, 8, 10, pre-trained models.
18], the merged parameters are all from the branch added
in parallel, except the one for the re-parameterization of • We investigate the sparse design and effective place-
norm layer [24]. However, most visual adapters [16, 31, 38] ment of visual adapter, which can further improve
RepAdapter in terms of parameter efficiency and per- an inception-like unit for ConvNet, which can be trans-
formance. formed to a convolution layer during inference. Similar
work includes ACNet [6] and RepMLPNet [5], which effec-
• RepAdapter outperforms most existing PETL methods
tively improves the model capacity via SR. Recently, some
on 27 benchmarks of three vision tasks. It generaliza-
works [7, 9] find that SR benefits the training of large con-
tion is validated on a wide range of vision models, in-
volutional kernels [9] and the lossless pruning of CNN [7].
cluding ConvNeXt, ViT, Swin-Transformer and CLIP.
Our work is inspired from these progresses but also dif-
2. Related Work fers in two aspects. Firstly, our strategy is more flexible
and can be deployed in common parameterized modules,
2.1. Parameter-efficient Transfer Learning e.g., convolutions. Secondly, our RepAdater is capable of
re-parameterizing the sequential structures. Based on these
With the rapid growth of the model size, parameter-
two aspects, we believe that the proposed method is a viable
efficient transfer learning (PETL) has attracted increasing
complement to existing SR research.
research interest [2,16,18,19,22,23,25,30,31,38,40,42,43].
PETL for large-scale pre-trained models first emerge in the
field of natural language processing (NLP) [16, 18, 22, 23, 3. Methods
25, 30, 40], which demonstrates that only fine-tuning a few 3.1. Preliminary
lightweight modules in a large-scale pre-trained models can
achieve almost fully tuning performance. Drawing on the We first revisit the visual adaption on a widely-used pre-
success experience in NLP, researchers have begun to ap- trained model called Vision Transformer (ViT) [11].
ply the principle of PETL to large pre-trained vision mod- Vision Transformer. Given an input image I ∈
els on various vision tasks [2, 16, 31, 38, 42, 43]. Among RH×W ×3 , ViT serializes it to visual tokens X ∈ Rn×d via
them, adapter-based [2, 16, 31] and prompt tuning based patch embedding [11]. Then, a learnable token xcls ∈ R1×d
methods [42,43] are two main paradigms for large-scale vi- for classification is concatenated with X, and the positional
sion models. As illustrated in Fig. 2, adapter-based meth- embeddings P ∈ R(n+1)×d are also added, which can be
ods [2, 16, 31] insert small MLP networks into the vision formulated by
model to adapt down-stream tasks. Prompt tuning [42, 43]
is to add a few trainable tokens to the input sequence of vi- X0 = [xcls , x0 , ..., xl ] + P. (1)
sion Transformer to mitigate the gap between pre-training
and downstream data distributions. LoRA [18] learns low- Afterwards, these visual inputs are processed by a set of
rank parameters for the frozen weights of multi-head atten- Transformer layers, and the l-th block can be defined as
tions [33]. Zhang et.al [38] propose a prompt search al-
gorithm to automatically combine the adapter, prompt tun- Xl0 = MHA(LN(Xl−1 )) + Xl−1 ,
(2)
ing and LoRA together. Very recently, Lian et. al [24] in- Xl = FFN(LN(Xl0 )) + Xl0 .
sert normalization layers into vision models to adapt down-
stream tasks, which can also be re-prameterized. MHA, FFN and LN denote the multi-head attention, feed-
The principle of RepAdapter obviously differs from ex- forward network and layer normalization, respectively.
isting visual adapters [2, 16, 38] in its structure and place- In particular, MHA can be formulated by
ment. LoRA [18] and SSF [24] are two related methods, but
T
their re-parameterizations are designed for simple modules (XWQi )(XWK i
)
i
like normalization layer [24]. Compared to these works, Attn (X) = softmax( √ (XWVi ),
dk (3)
RepAdpater also demonstrates a better trade-off among per-
formance, efficiency and generalization. MHA(X) = [Attn0 (X), ..., Attnnh (X)]WO .
2.2. Structural Re-parameterization Here, Attni (X) is the scale-dot product attention for i-
Structural re-parameterization (SR) has achieved great th head. [·] denotes the concatenation operation. WQi ∈
d d d
d× i d× d×
success in designing efficient deep neural networks [5, 7, R nh , WK ∈ R nh , WVi ∈ R nh and WO ∈ Rd×d
9, 10]. The main target of existing SR methods is to convert are the projection matrices. FFN can be defined as
a multi-branch structure to a single-branch one during infer-
ence. One representative SR work is RepVGG [10], which FFN(X) = σ(XW1 + b1 )W2 + b2 , (4)
merges a multi-branch block with 1 × 1 and 3 × 3 convo-
lution kernels and an identity layer into a single convolu- where W1 ∈ Rd×4d and W2 ∈ R4d×d are two projection
tion layer, greatly reducing the computation overhead dur- weights. b1 ∈ R4d and b2 ∈ Rd are bias scalars. σ(·) is the
ing inference. Inspired by RepVGG, DBBNet [8] proposes GELU function [4].
Step 1 !- !- !-)*'
Attention
Attention
Attention
!"#
!% !% !&'()&* !. !0 !(+( !. !0 !.)*' !0
!"$ !/ !/ !/)*'
Visual Adapter. Visual adapter is often a lightweight non-linearity of adapters does not make performance degra-
neural network with a bottleneck structure [2, 16] and a dation on vision tasks.
residual connection, which can be formulated by In this case, we first remove the non-linear function of
visual adapter, and f (X; θ) can be re-written as
f (X; θ) = X + φu (σ(φd (X))). (5)
f (X; θ) = X + φu φd (X) . (8)
Here, σ is the activation function, φd and φu denote the
downsampling and upsampling projections, respectively. φ Here, φu and φd denote the dense projections in common
0
is defined by φ(X) = XW + b, where W ∈ Rd×d and adapters, and they can also be most linear transformations,
0
b ∈ Rd are the projection weight and bias, respectively. In e.g., the sparse layer in RepAdapter. During inference, the
practice, the hidden size of the adapter is very small, e.g., 8, formulation f (X; θ) of adapters is simplified to
which makes it very compact.
There are two common ways to deploy the adapter to f (X; θ) = (XWd + bd )Wu + b + X
Vision Transformers [2, 16]. The first one is the sequential = XWd Wu + XWI + bd Wu + b (9)
manner [16], which places the adapter after FFN. Under this
= XWada + bada .
deployment, Eq. 2 can be modified by
Here, Wd ∈ Rc×d and Wu ∈ Rc×d are the weight matrices.
Xl = f FFN(LN(Xl0 )); θ + Xl0 .
(6) WI ∈ Rd×d is an identity tensor. Wada = Wd Wu + WI and
bada = bd Wu + b are the re-parameterized weights and bias,
The other one is the parallel deployment [2], where the respectively. In this way, we simplify the adapter structure
adapter is placed to the FFN in parallel: to a linear projection layer, which can be incorporated into
the near projection weights via matrix multiplications. No-
Xl = FFN(LN(Xl0 )) + f (Xl0 ; θ) + Xl0 . (7)
tably, E.q 9 is also applicable for more complex structures,
e.g., deep multi-layer network.
According to the principle of re-parameterization [6, 10],
the parallel adapter can not be merged to Transformer due Based on Eq. 9, we depict the re-parameterization of
to the non-linearity of FFN. To the best of our knowledge, adapters. When the adapter is sequentially placed into the
the re-parameterization for sequential adapters is also left vision model, we can re-parameterize f (X; θ) into the pre-
unexplored in literature. trained weight W0 and bias b0 by
RepAdapter RepAdapter Notably, this deployment is also viable and effective for
other vision models like CNN [27].
Norm Norm
4. Experiments
Figure 4. The deployments of RepAdapter in ViT. Four possible
locations that RepAdapter can be inserted and re-parameterized.
4.1. Datasets and Metrics
Our final deployments are in dark orange. Image Classification. VTB-1k benchmark contains 19 di-
verse image classification datasets, which are divided into
3.2.2 Adapter Structure three groups, i.e., the Natural, Specialized and Structured
Next, we investigate the sparse structure of RepAdapter. Al- groups, respectively. Each dataset contains 800 and 200
though the lightweight structure has been actively discussed examples for training and validation, respectively. Follow-
in recent works [2, 16, 24, 38], we believe that it still has ing previous work [19, 38], we train models with all sam-
room to explore on vision models. ples of train and val splits and report the top-1 accuracy
The first observation is that sparse transformation is a on test split. ImageNet [3] is a large-scale image classi-
fundamental characteristic of many vision modules, e.g., fication dataset with 1,000 categories. Following previous
depthwise separable convolution [17]. Some research [35] work [42, 43], we use 16 shots per category for few-shot
also shows that sparse transformation can improve the learning. We evaluate domain generalization on the val set
model capacity for better performance. However, this prop- of ImageNet-Sketch [34], ImageNet-R [14] and ImageNet-
erty has yet to materialize in adapter. A [15]. We report top-1 accuracy on their val set. Details
Inspired by this, we propose a dense-to-sparse structure about these datsets are reported in the appendix.
to RepAdapter, where φu is formulated as a group-wise Video Classification. Something-Something V2 [12] is a
transformation [28] by large collection of video clips with 174 categories, which
contains 169k videos for training and 20k videos for val-
0 0
φu (X) = [Xg0 Wg0 , ..., Xgk Wgk ] + b. (11) idation. HMDB51 [21] has 5k and 1.5k videos from 51
c categories for training and validation, respectively. follow-
Here, Xi0 ∈ Rn× k is the features splitted from X ∈ Rn×c , ing [2], we report top-1 accuracy on val set.
c d
k is the number of groups. Wi ∈ R k × k is the projec- Semantic Segmentation. ADE20K [41] is a challenging
tion weight matrix and b ∈ Rd is the bias term. During dataset for semantic segmentation, which has 20k and 2k
inference, φu (X) can also be converted to a dense projec- images from 150 categories for training and validation. We
tion layer via zero padding Wi . In this case, it can be re- report mIoU on its val set.
parameterized with E.q 9 and 10.
This sparse design makes RepAdapter more lightweight 4.2. Implementation Details
than common visual adapters [2,16], e.g., saving about 25%
parameters when the group number is 2. For image classification, the default visual backbone is
Adapter Placement. Existing visual adpaters [38] usu- ViT-B/16 [11], which is pre-trained ImageNet-21k [3]. For
ally follow the deployment on pre-trained langauge mod- video classification, we use ViT-B/16 pre-trained by Video-
els [16]. However, due to the great difference between vi- MAE [32] as the visual backbone. In terms of semantic seg-
sual and language models, we think that it is necessary to mentation, the visual backbone is ViT-L/14 pre-trained on
investigate the placement of adapters on vision models. ImageNet-21k. The hidden dimension c and the number of
Considering that the parallel adapters are hard to re- group k for RepAdapter is set to 8 and 2, respectively. The
parameterized, we investigate all possible sequential loca- hyper-parameter s is searched from [0.1,0.5,1,5,10]. By
tions, as shown in Fig. 4. Empirically, we find that deploy- default, we insert RepAdapter before MHA and FFN. We
ing RepAdapter before the neural modules can lead to better also provide a lightweight variant called RepAdapterattn ,
performance, which is also feasible for re-parameterization. where the adapter is only deployed before MHA. Other de-
Meanwhile, we also observe that it is more beneficial to ap- tails including image augmentation and hyper-parameters
ply RepAdapter to both MHA and FFN in ViT. These ob- are kept the same with previous work [2, 19, 38, 42], which
servations are further supported in our experiments. are provided in the appendix.
Table 1. Comparison of RepAdapter and the state-of-the-arts PETL methods on VTAB-1k benchmark. ViT-B/16 pretrained on
ImageNet-21k is used as the vision model of all methods.
Natural Specialized Structured
sNORB-Azim
Clevr-Count
Retinopathy
sNORB-Ele
KITTI-Dist
Param (M)
Caltech101
Flower102
Clevr-Dist
Camelyon
Avg. Acc.
dSpr-Loc
EuroSAT
Resisc45
Cifar100
dSpr-Ori
DMLab
Sun397
SVHN
DTD
Pets
Model
Full tuning [19] 85.8 68.9 68.9 87.7 64.3 97.2 86.9 87.4 38.8 79.7 95.7 84.2 73.9 56.3 58.6 41.7 65.5 57.5 46.7 25.7 29.1
Linear probe [19] 0.04 57.6 64.4 85.0 63.2 97.0 86.3 36.6 51.0 78.5 87.5 68.5 74.0 34.3 30.6 33.2 55.4 12.5 20.0 9.6 19.2
Adapter [16] 0.16 73.9 69.2 90.1 68.0 98.8 89.9 82.8 54.3 84.0 94.9 81.9 75.5 80.9 65.3 48.6 78.3 74.8 48.5 29.9 41.6
AdaptFormer [2] 0.16 74.7 70.8 91.2 70.5 99.1 90.9 86.6 54.8 83.0 95.8 84.4 76.3 81.9 64.3 49.3 80.3 76.3 45.7 31.7 41.1
RepAdapterattn 0.11 75.5 70.7 91.6 72.5 99.1 91.3 88.5 54.2 84.1 95.7 85.1 74.6 81.6 69.1 50.4 81.9 79.5 45.6 34.6 41.9
VPT [19] 0.53 72.0 78.8 90.8 65.8 98.0 88.3 78.1 49.6 81.8 96.1 83.4 68.4 68.5 60.0 46.5 72.8 73.6 47.9 32.9 37.8
LoRA [18] 0.29 74.5 67.1 91.4 69.4 98.8 90.4 85.3 54.0 84.9 95.3 84.4 73.6 82.9 69.2 49.8 78.5 75.7 47.1 31.0 44.0
NOAH [38] 0.36 75.5 69.6 92.7 70.2 99.1 90.4 86.1 53.7 84.4 95.4 83.9 75.8 82.8 68.9 49.9 81.7 81.8 48.3 32.8 44.2
SSF [24] 0.24 75.7 69.0 92.6 75.1 99.4 91.8 90.2 52.9 87.4 95.9 87.4 75.5 75.9 62.3 53.3 80.6 77.3 54.9 29.5 37.9
RepAdapter 0.22 76.1 72.4 91.6 71.0 99.2 91.4 90.7 55.1 85.3 95.9 84.6 75.9 82.3 68.0 50.4 79.9 80.4 49.2 38.6 41.0
Table 2. Ablation studies on VTAB-1k. The base model is ViT-B/16, and the default setting is RepAdapterattn , which only has an adapter
before MHA. “Avg acc” denotes the average accuracy on VTAB-1k. “Act.” denotes the use of activation functions. “Parallel” denotes that
RepAdapter is placed in parallel to MHA. “Full sparse” means that all projections are group-wise. The best settings are in gray .
(a) Number of groups. (b) Hidden dimensions. (c) Adapter position. (d) Adapter variants.
groups params avg acc. dims params avg acc. position avg acc. setting avg acc.
before attn 75.5 default 75.5
1 0.16M 75.3 4 0.05M 74.4
after attn 74.9 w.i. act. 75.5
2 0.11M 75.5 8 0.11M 75.5
before mlp 75.2 parallel [2] 75.2
4 0.09M 74.9 12 0.16M 75.3
after mlp 74.6 full sparse 75.1
8 0.08M 74.5 16 0.22M 75.1
Table 3. Cumulative ablation of RepAdapterattn on VTAB-1k. proach NOAH [38] by +0.6%. Compared to LoRA [18],
We use ViT-B/16 as the base model. which is also zero-cost during inference, RepAdapter also
Settings Param (M) Avg Acc. merits in performance and efficiency, e.g., +1.6% average
accuracy. These results greatly validate the effectiveness
Baseline (Adapter [16]) 0.16 73.9
and parameter efficiency of the proposed RepAdapter.
+ dense-sparse transformations 0.11 74.5
+ pre-insertion 0.11 75.5
+ linear structure 0.11 75.5 4.3.2 Ablation Studies
To gain deep insights into RepAdapter, we conduct exten-
4.3. Experimental Results sive ablation studies in Tab. 2 - 3. In Tab. 3, we validate
4.3.1 Comparisons with the State-of-the-arts. the effectiveness of each designs in RepAdapter. From this
table, the first observation is that the sparse structure and
We first compare RepAdapter with the state-of-the-art the new placement obtains obvious improvements on ViT-
(SOTA) PETL methods on ViT, as reported in Tab. 1. We B/16, resulting in +0.6% and +1.0% accuracy, respectively.
first observe that all PETL methods outperform full fine- Meanwhile, We see that activation function is less impor-
tuning by a large margin, while linear probing, only tun- tant under our settings. Overall, these results validate the
ing the classifier, performs much worse. These results con- benefits of RepAdapter’s designs on vision models.
firm the effectiveness of PETL methods for ViT. Compared In Tab. 2a, we show the impacts of group number. When
to these approaches, we can see that RepAdapter performs the number of groups is set to 1, RepAdapter is actually a
much better on VTAB-1k, e.g., +0.7% on Resisc45. In dense network. Notably, this dense structure does not per-
particular, RepAdapterattn can achieve SOTA performance form best with more parameters. Instead, increasing a cer-
while being much more lightweight than all PETL ap- tain number of groups is beneficial to both performance and
proaches, i.e., 0.11M. When employing RepAdapter in both efficiency. Similar results can be found in Tab. 2b, which
MHA and FFN, the average performance can be further im- shows the impact of parameter size. We observe that more
proved from 75.5 to 76.1, which outperforms the SOTA ap- parameters for RepAdapter do not always improve perfor-
Table 4. Efficiency comparison of RepAdapter and existing PETL methods during inference. We use ViT-B/16 as the vision model.
∆P and ∆F denote the additional parameters and FLOPs by PETL methods. The inference speed is defined by images per second
(imgs/sec) and measured on a NVIDIA 3090 GPU. All results are the average of 100 trials.
GPU latency (imgs/sec)
Methods ∆P ∆F
bs=1 bs=4 bs=16 bs=128
Full tuning 0 0 91.5 375.7 539.5 578.3
2
VPT [19] 0.55M 5.60G 86.1 (-5.9%) 283.5 (-24.5%) 381.5 (-29.2%) 421.6 (-27.1%)
Adapter [16] 0.16M 0.03G 70.9 (-22.5%) 306.6 (-18.3%) 504.7 (-6.4%) 552.4 (-5.8%)
AdaptFormer [2] 0.16M 0.03G 71.4 (-21.9%) 309.9 (-17.5%) 508.1 (-4.2%) 555.2 (-3.9%)
NOAH [38]2 0.12M 0.02G 72.1 (-21.2%) 312.7 (-16.7%) 492.9 (-8.6%) 534.7 (-7.5%)
RepAdapter (ours) 0 0 91.5 (-0.0%) 375.7 (-0.0%) 539.5 (-0.0%) 578.3 (-0.0%)
can see that the additional FLOPs of visual prompt tuning datasets, so we use the averaged dimensions to measure their latency.
Table 5. Results of adapting CLIP to 16-shot ImageNet classifi- Table 8. Results of RepAdapter on semantic segmentation.
cation and domain generalization. ViT-B/16 is used as the visual SETR [39] is the vision model, and we report its mIoU scores
backbone. RepAdapter-V and RepAdapter-T denote the adaptions on ADE20K [41] val set. “mIoU-SS” and “mIoU-MS” denote the
to the visual and text encoders of CLIP, respectively. The hand- results of single- and multi-scale predictions, respectively.
craft prompt of “ a photo of a [CLASS]” is used for tuning.
Methods Params (M) mIoU-SS mIoU-MS
Source Target
Method Full tuning [39] 318.31 48.31 50.07
ImageNet -V2 -S -A -R
Head only [39] 13.18 35.12 37.46
Zero-shot [29] 66.73 60.83 46.15 47.77 73.96
Bias [19] 13.46 43.40 45.33
CoOp [42] 71.51 64.20 47.99 49.71 75.21
VPT [19] 13.43 42.11 44.06
CoCoOp [43] 71.02 64.07 48.75 50.63 76.18
VPT + Bias [19] 15.79 44.04 45.63
RepAdapter-V 70.93 64.00 48.40 45.53 75.77
RepAdapter-T 71.87 64.77 49.30 51.13 76.47
RepAdapter 13.82 44.44 46.71