0% found this document useful (0 votes)
28 views10 pages

RepAdapter

This paper introduces RepAdapter, a novel parameter-efficient transfer learning method for large vision models that achieves zero-cost inference by seamlessly integrating adaptation modules through structural re-parameterization. RepAdapter demonstrates superior performance and efficiency compared to existing methods, saving significant training time, GPU memory, and storage costs while maintaining high generalization across various vision tasks. Extensive experiments validate its effectiveness on 27 benchmark datasets, outperforming state-of-the-art PETL methods without incurring additional computational overhead during inference.

Uploaded by

Nairouz Mrabah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views10 pages

RepAdapter

This paper introduces RepAdapter, a novel parameter-efficient transfer learning method for large vision models that achieves zero-cost inference by seamlessly integrating adaptation modules through structural re-parameterization. RepAdapter demonstrates superior performance and efficiency compared to existing methods, saving significant training time, GPU memory, and storage costs while maintaining high generalization across various vision tasks. Extensive experiments validate its effectiveness on 27 benchmark datasets, outperforming state-of-the-art PETL methods without incurring additional computational overhead during inference.

Uploaded by

Nairouz Mrabah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Towards Efficient Visual Adaption via Structural Re-parameterization

Gen Luo 1 , Minglang Huang 1 , Yiyi Zhou12 , Xiaoshuai Sun12 , Guannan Jiang3 , Zhiyu Wang3 , Rongrong Ji12
1
Media Analytics and Computing Lab, Department of Artificial Intelligence,
School of Informatics, Xiamen University, 361005, China.
2
Institute of Artificial Intelligence, Xiamen University, 361005, P.R. China.
3
Intelligent Manufacturing Department, Contemporary Amperex Technology Co. Limited (CATL).
arXiv:2302.08106v2 [cs.CV] 21 Mar 2023

{luogen, huangminglang}@stu.xmu.edu.cn,
{zhouyiyi,xssun,rrji}@xmu.edu.cn, {jianggn, wangzy13}@catl.com

Abstract

Parameter-efficient transfer learning (PETL) is an


emerging research spot aimed at inexpensively adapting
large-scale pre-trained models to downstream tasks. Re-
cent advances have achieved great success in saving stor-
age costs for various pre-trained models by updating a
small number of parameters instead of full tuning. How-
ever, we notice that most existing PETL methods still in-
cur non-negligible latency during inference. In this pa-
per, we propose a parameter-efficient and computational
friendly adapter for giant vision models, called RepAdapter.
Specifically, we first prove that common adaptation mod-
ules can also be seamlessly integrated into most giant vi- Figure 1. Performance comparison of our RepAdpater and ex-
sion models via our structural re-parameterization, thereby isting PETL methods [2, 16, 18, 19, 38] on VTAB-1K. The vi-
achieving zero-cost during inference. We then investigate sion model is ViT-B/16 and the inference speed is measured on a
the sparse design and effective placement of adapter struc- NVIDIA 3090 GPU with a batch size of 1. Most existing PETL
ture, helping our RepAdaper obtain other advantages in methods incur non-negligible GPU latency during inference, while
terms of parameter efficiency and performance. To validate our RepAdapter does not.
RepAdapter, we conduct extensive experiments on 27 bench-
mark datasets of three vision tasks, i.e., image and video vision community [11, 13, 29, 32, 36]. Alone with the out-
classifications and semantic segmentation. Experimental standing performance on various vision tasks [3, 12, 21, 37,
results show the superior performance and efficiency of 41], large-scale pre-training also leads to a rapid growth in
RepAdapter than the state-of-the-art PETL methods. For parameter size. In this case, directly fine-tuning these pre-
instance, RepAdapter outperforms full tuning by +7.2% on trained models on downstream tasks, a common transfer
average and saves up to 25% training time, 20% GPU learning strategy used before, becomes prohibitively expen-
memory, and 94.6% storage cost of ViT-B/16 on VTAB- sive in terms of storage overhead. For instance, when fully
1k. The generalization ability of RepAdapter is also well fine-tuning ViT-G [36] on 19 vision tasks of VTB-1k [37],
validated by a bunch of vision models. Our source code it needs to store over 35 billion parameters for deployment.
is released at https://github.com/luogen1996/ To address this issue, numerous efforts have been re-
RepAdapter. cently devoted to parameter-efficient transfer learning
(PETL) [2, 16, 18, 19, 22, 23, 25, 30, 31, 38, 40, 42, 43]. In-
spired by the great success in natural language processing
1. Introduction (NLP) [16, 18, 22, 23, 25, 30, 40], PETL methods for giant
vision models also aim at reducing the tuning cost by up-
For a year or two, the research of large-scale pre-trained dating or injecting a small fraction of parameters for each
models has attracted an influx of interest from the computer downstream task, which can be roughly divided into two
: Frozen weights of vision models : Trainable weights of PETL methods
Training-time structure
$'#(
$'#( … $'#( Inference-time
! Re-parameterization

!# …… !#
$'#(

!) !-. !-/ !*+,


Transformer
Block
!" $%&
!" !"
$%& …
$%& $%&

Visual Adapter Prompt Learning LoRA RepAdapter (ours)


Figure 2. Comparison of existing PETL methods [2, 18, 19] and our RepAdapter. RepAdapter is deployed in a sequential manner,
but it can be completely re-parameterized into the vision models during inference, enabling zero additional computational overhead. Its
structure is also more lightweight than existing PETL methods.

main categories, namely visual adapter [2, 16, 20, 31, 38] are deployed sequentially to directly optimize the feature
and prompt tuning [19, 38, 42, 43]. Notably, very recent spaces on downstream tasks, as shown in Fig. 2. In this
progresses also demonstrate competitive performance with paper, we find that when the adaptation module is linear,
lower parameter cost to full fine-tuning on Vision Trans- they can also be re-parameterized in a feed-forward struc-
formers [2, 19, 38, 42, 43]. ture without performance degeneration. This finding also
Despite the great success, most existing PETL methods allows us to keep the network intact during inference in ad-
inevitably slow down model inference [18], as shown in dition to LoRA, while retaining the effectiveness of adapter.
Fig. 1. For prompt-tuning methods [19, 42], the inserted Based on this finding, we further propose a novel PETL
tokens greatly increase the computation cost of vision mod- method called RepAdapter. As shown in Fig. 2, RepAdapter
els, especially the Transformer-based ones [11,33]. In terms also inserts lightweight networks into the pre-trained mod-
of visual adapters [2, 16], their theoretical cost is actually els, and the additional parameters will be re-parameterized
cheap, e.g., +0.03 GFLOPs by the visual adapter [16]. But to the nearby projection weights after training. To the best
the modules they add also increase the network complexity, of our knowledge, re-parameterization of this sequential
e.g., the network depth, thus reducing the efficiency of GPU structure is also the first attempt in the literature. In addi-
parallel computing. As shown in Fig. 1, when the batch size tion, we also investigate the sparse design of visual adapter
is 1, the latency of ViT [11] is increased by almost 20%, and obtain a new dense-sparse structure, which can further
which is actually significant in real-word applications. save 25% parameters. Meanwhile, we empirically find that
A trade-off solution is the newly proposed PETL method the adapter placement is essential for giant vision models.
for pre-trained language models called Low Rank Adaption To validate RepAdapter, we apply it to various vi-
(LoRA) [18]. Inspired by the finding of the “low intrinsic sion models, ranging from CNNs like ConvNeXt [27] to
rank” in large-scale pre-trained models [1], Hu et al. apply single and multi-modal Transformers, e.g., ViT [11] and
two decomposition matrices to approximate the projection CLIP [29]. Extensive experiments are conducted on 27
weights of self-attention, as shown in Fig. 1. During infer- benchmark datasets of image and video classifications and
ence, these weights can be re-parameterized into the pre- semantic segmentation [3, 12, 14, 15, 21, 34, 37, 41]. Exper-
trained model, thereby avoiding additional computation. imental results show that RepAdapter can outperform the
However, the generalization ability of LoRA is still limited state-of-the-art (SOTA) PETL methods [2,19,38,43] in both
for common vision models. For instance, when applying performance and parameter size, while incurring no addi-
LoRA to CNN, e.g., ConvNeXT [27], its performance is tional computations during inference. Meanwhile, we ex-
inferior to full tuning, i.e., -1.9% on VTAB-1k [37]. On amine RepAdapter under the settings of few-shot learning
ViT [11], LoRA also performs worse than the adapter [2]. and domain adaption, where its superior performance and
In addition, its re-parameterization is also not feasible for generalizability still be also witnessed.
common adapters [16, 38] that are sequentially placed after In summary, our contributions are three-fold:
neural modules.
• We propose a novel PETL method for vision models,
In this paper, we investigate that whether common adap- called RepAdapter, which shows that common visual
tation modules can be fully merged into the pre-trained adapters can also be sequentially re-parameterized into
models. In existing re-parameterization methods [6, 8, 10, pre-trained models.
18], the merged parameters are all from the branch added
in parallel, except the one for the re-parameterization of • We investigate the sparse design and effective place-
norm layer [24]. However, most visual adapters [16, 31, 38] ment of visual adapter, which can further improve
RepAdapter in terms of parameter efficiency and per- an inception-like unit for ConvNet, which can be trans-
formance. formed to a convolution layer during inference. Similar
work includes ACNet [6] and RepMLPNet [5], which effec-
• RepAdapter outperforms most existing PETL methods
tively improves the model capacity via SR. Recently, some
on 27 benchmarks of three vision tasks. It generaliza-
works [7, 9] find that SR benefits the training of large con-
tion is validated on a wide range of vision models, in-
volutional kernels [9] and the lossless pruning of CNN [7].
cluding ConvNeXt, ViT, Swin-Transformer and CLIP.
Our work is inspired from these progresses but also dif-
2. Related Work fers in two aspects. Firstly, our strategy is more flexible
and can be deployed in common parameterized modules,
2.1. Parameter-efficient Transfer Learning e.g., convolutions. Secondly, our RepAdater is capable of
re-parameterizing the sequential structures. Based on these
With the rapid growth of the model size, parameter-
two aspects, we believe that the proposed method is a viable
efficient transfer learning (PETL) has attracted increasing
complement to existing SR research.
research interest [2,16,18,19,22,23,25,30,31,38,40,42,43].
PETL for large-scale pre-trained models first emerge in the
field of natural language processing (NLP) [16, 18, 22, 23, 3. Methods
25, 30, 40], which demonstrates that only fine-tuning a few 3.1. Preliminary
lightweight modules in a large-scale pre-trained models can
achieve almost fully tuning performance. Drawing on the We first revisit the visual adaption on a widely-used pre-
success experience in NLP, researchers have begun to ap- trained model called Vision Transformer (ViT) [11].
ply the principle of PETL to large pre-trained vision mod- Vision Transformer. Given an input image I ∈
els on various vision tasks [2, 16, 31, 38, 42, 43]. Among RH×W ×3 , ViT serializes it to visual tokens X ∈ Rn×d via
them, adapter-based [2, 16, 31] and prompt tuning based patch embedding [11]. Then, a learnable token xcls ∈ R1×d
methods [42,43] are two main paradigms for large-scale vi- for classification is concatenated with X, and the positional
sion models. As illustrated in Fig. 2, adapter-based meth- embeddings P ∈ R(n+1)×d are also added, which can be
ods [2, 16, 31] insert small MLP networks into the vision formulated by
model to adapt down-stream tasks. Prompt tuning [42, 43]
is to add a few trainable tokens to the input sequence of vi- X0 = [xcls , x0 , ..., xl ] + P. (1)
sion Transformer to mitigate the gap between pre-training
and downstream data distributions. LoRA [18] learns low- Afterwards, these visual inputs are processed by a set of
rank parameters for the frozen weights of multi-head atten- Transformer layers, and the l-th block can be defined as
tions [33]. Zhang et.al [38] propose a prompt search al-
gorithm to automatically combine the adapter, prompt tun- Xl0 = MHA(LN(Xl−1 )) + Xl−1 ,
(2)
ing and LoRA together. Very recently, Lian et. al [24] in- Xl = FFN(LN(Xl0 )) + Xl0 .
sert normalization layers into vision models to adapt down-
stream tasks, which can also be re-prameterized. MHA, FFN and LN denote the multi-head attention, feed-
The principle of RepAdapter obviously differs from ex- forward network and layer normalization, respectively.
isting visual adapters [2, 16, 38] in its structure and place- In particular, MHA can be formulated by
ment. LoRA [18] and SSF [24] are two related methods, but
T
their re-parameterizations are designed for simple modules (XWQi )(XWK i
)
i
like normalization layer [24]. Compared to these works, Attn (X) = softmax( √ (XWVi ),
dk (3)
RepAdpater also demonstrates a better trade-off among per-
formance, efficiency and generalization. MHA(X) = [Attn0 (X), ..., Attnnh (X)]WO .

2.2. Structural Re-parameterization Here, Attni (X) is the scale-dot product attention for i-
Structural re-parameterization (SR) has achieved great th head. [·] denotes the concatenation operation. WQi ∈
d d d
d× i d× d×
success in designing efficient deep neural networks [5, 7, R nh , WK ∈ R nh , WVi ∈ R nh and WO ∈ Rd×d
9, 10]. The main target of existing SR methods is to convert are the projection matrices. FFN can be defined as
a multi-branch structure to a single-branch one during infer-
ence. One representative SR work is RepVGG [10], which FFN(X) = σ(XW1 + b1 )W2 + b2 , (4)
merges a multi-branch block with 1 × 1 and 3 × 3 convo-
lution kernels and an identity layer into a single convolu- where W1 ∈ Rd×4d and W2 ∈ R4d×d are two projection
tion layer, greatly reducing the computation overhead dur- weights. b1 ∈ R4d and b2 ∈ Rd are bias scalars. σ(·) is the
ing inference. Inspired by RepVGG, DBBNet [8] proposes GELU function [4].
Step 1 !- !- !-)*'

Attention
Attention

Attention
!"#
!% !% !&'()&* !. !0 !(+( !. !0 !.)*' !0
!"$ !/ !/ !/)*'

Training-time structure Simplified structure Training-time MHA Re-parameterized MHA


Step 2

!% !&'()&* !(+( !$ !, !(+( !$ !, !)*' !,

Simplified structure Inference-time structure Re-parameterized FFN


Training-time FFN
(a) Re-parameterization of RepAdapter (b) Re-parameterization of MHA and FFN with RepAdapter
Figure 3. Illustration the structural re-parameterization of RepAdapter. (a) RepAdapter can be simplified to a linear projection after
training. (b) The simplified weights can be merged into MHA and FFN.

Visual Adapter. Visual adapter is often a lightweight non-linearity of adapters does not make performance degra-
neural network with a bottleneck structure [2, 16] and a dation on vision tasks.
residual connection, which can be formulated by In this case, we first remove the non-linear function of
visual adapter, and f (X; θ) can be re-written as
f (X; θ) = X + φu (σ(φd (X))). (5) 
f (X; θ) = X + φu φd (X) . (8)
Here, σ is the activation function, φd and φu denote the
downsampling and upsampling projections, respectively. φ Here, φu and φd denote the dense projections in common
0
is defined by φ(X) = XW + b, where W ∈ Rd×d and adapters, and they can also be most linear transformations,
0
b ∈ Rd are the projection weight and bias, respectively. In e.g., the sparse layer in RepAdapter. During inference, the
practice, the hidden size of the adapter is very small, e.g., 8, formulation f (X; θ) of adapters is simplified to
which makes it very compact.
There are two common ways to deploy the adapter to f (X; θ) = (XWd + bd )Wu + b + X
Vision Transformers [2, 16]. The first one is the sequential = XWd Wu + XWI + bd Wu + b (9)
manner [16], which places the adapter after FFN. Under this
= XWada + bada .
deployment, Eq. 2 can be modified by
Here, Wd ∈ Rc×d and Wu ∈ Rc×d are the weight matrices.
Xl = f FFN(LN(Xl0 )); θ + Xl0 .

(6) WI ∈ Rd×d is an identity tensor. Wada = Wd Wu + WI and
bada = bd Wu + b are the re-parameterized weights and bias,
The other one is the parallel deployment [2], where the respectively. In this way, we simplify the adapter structure
adapter is placed to the FFN in parallel: to a linear projection layer, which can be incorporated into
the near projection weights via matrix multiplications. No-
Xl = FFN(LN(Xl0 )) + f (Xl0 ; θ) + Xl0 . (7)
tably, E.q 9 is also applicable for more complex structures,
e.g., deep multi-layer network.
According to the principle of re-parameterization [6, 10],
the parallel adapter can not be merged to Transformer due Based on Eq. 9, we depict the re-parameterization of
to the non-linearity of FFN. To the best of our knowledge, adapters. When the adapter is sequentially placed into the
the re-parameterization for sequential adapters is also left vision model, we can re-parameterize f (X; θ) into the pre-
unexplored in literature. trained weight W0 and bias b0 by

3.2. RepAdapter Rep(f (X; θ), W0 , b0 ) = f (X; θ)W0 + b0


= XWada W0 + bada W0 + b0 (10)
3.2.1 Sequentially Structural Re-parameterization
= XWrep + brep
We first propose a sequentially structural re-
parameterization scheme towards zero extra cost during Here, Wrep = Wada W0 is the re-parameterized projection
inference, which proves that common adapters can also be weights, and brep = bada W0 + b0 is the re-parameterized
merged into the pre-trained model via simple tweaks. bias term. In practice, W0 can also be the convolutional
Above all, we notice that most existing visual kernels, and we give its re-parameterizaion in appendix.
adapters [2, 31] involve a non-linear function in their struc- As shown in Fig. 3, we can incorporate common adapters
tures, which is originally designed to improve the adaption into existing vision modules, e.g., MHA, FFN and convolu-
on NLP tasks [16]. However, we find that removing the tions, thereby avoiding additional inference costs.
Therefore, the deployment of RepAdapter in Trans-
Add Add former is
RepAdapter RepAdapter Xl0 = MHA f (LN(Xl−1 ); θ) + Xl−1 ,

(12)
Xl = FFN f (LN(Xl0 ); θ) + Xl0 .

MHA FFN

RepAdapter RepAdapter Notably, this deployment is also viable and effective for
other vision models like CNN [27].
Norm Norm
4. Experiments
Figure 4. The deployments of RepAdapter in ViT. Four possible
locations that RepAdapter can be inserted and re-parameterized.
4.1. Datasets and Metrics
Our final deployments are in dark orange. Image Classification. VTB-1k benchmark contains 19 di-
verse image classification datasets, which are divided into
3.2.2 Adapter Structure three groups, i.e., the Natural, Specialized and Structured
Next, we investigate the sparse structure of RepAdapter. Al- groups, respectively. Each dataset contains 800 and 200
though the lightweight structure has been actively discussed examples for training and validation, respectively. Follow-
in recent works [2, 16, 24, 38], we believe that it still has ing previous work [19, 38], we train models with all sam-
room to explore on vision models. ples of train and val splits and report the top-1 accuracy
The first observation is that sparse transformation is a on test split. ImageNet [3] is a large-scale image classi-
fundamental characteristic of many vision modules, e.g., fication dataset with 1,000 categories. Following previous
depthwise separable convolution [17]. Some research [35] work [42, 43], we use 16 shots per category for few-shot
also shows that sparse transformation can improve the learning. We evaluate domain generalization on the val set
model capacity for better performance. However, this prop- of ImageNet-Sketch [34], ImageNet-R [14] and ImageNet-
erty has yet to materialize in adapter. A [15]. We report top-1 accuracy on their val set. Details
Inspired by this, we propose a dense-to-sparse structure about these datsets are reported in the appendix.
to RepAdapter, where φu is formulated as a group-wise Video Classification. Something-Something V2 [12] is a
transformation [28] by large collection of video clips with 174 categories, which
contains 169k videos for training and 20k videos for val-
0 0
φu (X) = [Xg0 Wg0 , ..., Xgk Wgk ] + b. (11) idation. HMDB51 [21] has 5k and 1.5k videos from 51
c categories for training and validation, respectively. follow-
Here, Xi0 ∈ Rn× k is the features splitted from X ∈ Rn×c , ing [2], we report top-1 accuracy on val set.
c d
k is the number of groups. Wi ∈ R k × k is the projec- Semantic Segmentation. ADE20K [41] is a challenging
tion weight matrix and b ∈ Rd is the bias term. During dataset for semantic segmentation, which has 20k and 2k
inference, φu (X) can also be converted to a dense projec- images from 150 categories for training and validation. We
tion layer via zero padding Wi . In this case, it can be re- report mIoU on its val set.
parameterized with E.q 9 and 10.
This sparse design makes RepAdapter more lightweight 4.2. Implementation Details
than common visual adapters [2,16], e.g., saving about 25%
parameters when the group number is 2. For image classification, the default visual backbone is
Adapter Placement. Existing visual adpaters [38] usu- ViT-B/16 [11], which is pre-trained ImageNet-21k [3]. For
ally follow the deployment on pre-trained langauge mod- video classification, we use ViT-B/16 pre-trained by Video-
els [16]. However, due to the great difference between vi- MAE [32] as the visual backbone. In terms of semantic seg-
sual and language models, we think that it is necessary to mentation, the visual backbone is ViT-L/14 pre-trained on
investigate the placement of adapters on vision models. ImageNet-21k. The hidden dimension c and the number of
Considering that the parallel adapters are hard to re- group k for RepAdapter is set to 8 and 2, respectively. The
parameterized, we investigate all possible sequential loca- hyper-parameter s is searched from [0.1,0.5,1,5,10]. By
tions, as shown in Fig. 4. Empirically, we find that deploy- default, we insert RepAdapter before MHA and FFN. We
ing RepAdapter before the neural modules can lead to better also provide a lightweight variant called RepAdapterattn ,
performance, which is also feasible for re-parameterization. where the adapter is only deployed before MHA. Other de-
Meanwhile, we also observe that it is more beneficial to ap- tails including image augmentation and hyper-parameters
ply RepAdapter to both MHA and FFN in ViT. These ob- are kept the same with previous work [2, 19, 38, 42], which
servations are further supported in our experiments. are provided in the appendix.
Table 1. Comparison of RepAdapter and the state-of-the-arts PETL methods on VTAB-1k benchmark. ViT-B/16 pretrained on
ImageNet-21k is used as the vision model of all methods.
Natural Specialized Structured

sNORB-Azim
Clevr-Count
Retinopathy

sNORB-Ele
KITTI-Dist
Param (M)

Caltech101

Flower102

Clevr-Dist
Camelyon
Avg. Acc.

dSpr-Loc
EuroSAT

Resisc45
Cifar100

dSpr-Ori
DMLab
Sun397
SVHN
DTD

Pets
Model

Full tuning [19] 85.8 68.9 68.9 87.7 64.3 97.2 86.9 87.4 38.8 79.7 95.7 84.2 73.9 56.3 58.6 41.7 65.5 57.5 46.7 25.7 29.1
Linear probe [19] 0.04 57.6 64.4 85.0 63.2 97.0 86.3 36.6 51.0 78.5 87.5 68.5 74.0 34.3 30.6 33.2 55.4 12.5 20.0 9.6 19.2
Adapter [16] 0.16 73.9 69.2 90.1 68.0 98.8 89.9 82.8 54.3 84.0 94.9 81.9 75.5 80.9 65.3 48.6 78.3 74.8 48.5 29.9 41.6
AdaptFormer [2] 0.16 74.7 70.8 91.2 70.5 99.1 90.9 86.6 54.8 83.0 95.8 84.4 76.3 81.9 64.3 49.3 80.3 76.3 45.7 31.7 41.1
RepAdapterattn 0.11 75.5 70.7 91.6 72.5 99.1 91.3 88.5 54.2 84.1 95.7 85.1 74.6 81.6 69.1 50.4 81.9 79.5 45.6 34.6 41.9
VPT [19] 0.53 72.0 78.8 90.8 65.8 98.0 88.3 78.1 49.6 81.8 96.1 83.4 68.4 68.5 60.0 46.5 72.8 73.6 47.9 32.9 37.8
LoRA [18] 0.29 74.5 67.1 91.4 69.4 98.8 90.4 85.3 54.0 84.9 95.3 84.4 73.6 82.9 69.2 49.8 78.5 75.7 47.1 31.0 44.0
NOAH [38] 0.36 75.5 69.6 92.7 70.2 99.1 90.4 86.1 53.7 84.4 95.4 83.9 75.8 82.8 68.9 49.9 81.7 81.8 48.3 32.8 44.2
SSF [24] 0.24 75.7 69.0 92.6 75.1 99.4 91.8 90.2 52.9 87.4 95.9 87.4 75.5 75.9 62.3 53.3 80.6 77.3 54.9 29.5 37.9
RepAdapter 0.22 76.1 72.4 91.6 71.0 99.2 91.4 90.7 55.1 85.3 95.9 84.6 75.9 82.3 68.0 50.4 79.9 80.4 49.2 38.6 41.0

Table 2. Ablation studies on VTAB-1k. The base model is ViT-B/16, and the default setting is RepAdapterattn , which only has an adapter
before MHA. “Avg acc” denotes the average accuracy on VTAB-1k. “Act.” denotes the use of activation functions. “Parallel” denotes that
RepAdapter is placed in parallel to MHA. “Full sparse” means that all projections are group-wise. The best settings are in gray .
(a) Number of groups. (b) Hidden dimensions. (c) Adapter position. (d) Adapter variants.

groups params avg acc. dims params avg acc. position avg acc. setting avg acc.
before attn 75.5 default 75.5
1 0.16M 75.3 4 0.05M 74.4
after attn 74.9 w.i. act. 75.5
2 0.11M 75.5 8 0.11M 75.5
before mlp 75.2 parallel [2] 75.2
4 0.09M 74.9 12 0.16M 75.3
after mlp 74.6 full sparse 75.1
8 0.08M 74.5 16 0.22M 75.1

Table 3. Cumulative ablation of RepAdapterattn on VTAB-1k. proach NOAH [38] by +0.6%. Compared to LoRA [18],
We use ViT-B/16 as the base model. which is also zero-cost during inference, RepAdapter also
Settings Param (M) Avg Acc. merits in performance and efficiency, e.g., +1.6% average
accuracy. These results greatly validate the effectiveness
Baseline (Adapter [16]) 0.16 73.9
and parameter efficiency of the proposed RepAdapter.
+ dense-sparse transformations 0.11 74.5
+ pre-insertion 0.11 75.5
+ linear structure 0.11 75.5 4.3.2 Ablation Studies
To gain deep insights into RepAdapter, we conduct exten-
4.3. Experimental Results sive ablation studies in Tab. 2 - 3. In Tab. 3, we validate
4.3.1 Comparisons with the State-of-the-arts. the effectiveness of each designs in RepAdapter. From this
table, the first observation is that the sparse structure and
We first compare RepAdapter with the state-of-the-art the new placement obtains obvious improvements on ViT-
(SOTA) PETL methods on ViT, as reported in Tab. 1. We B/16, resulting in +0.6% and +1.0% accuracy, respectively.
first observe that all PETL methods outperform full fine- Meanwhile, We see that activation function is less impor-
tuning by a large margin, while linear probing, only tun- tant under our settings. Overall, these results validate the
ing the classifier, performs much worse. These results con- benefits of RepAdapter’s designs on vision models.
firm the effectiveness of PETL methods for ViT. Compared In Tab. 2a, we show the impacts of group number. When
to these approaches, we can see that RepAdapter performs the number of groups is set to 1, RepAdapter is actually a
much better on VTAB-1k, e.g., +0.7% on Resisc45. In dense network. Notably, this dense structure does not per-
particular, RepAdapterattn can achieve SOTA performance form best with more parameters. Instead, increasing a cer-
while being much more lightweight than all PETL ap- tain number of groups is beneficial to both performance and
proaches, i.e., 0.11M. When employing RepAdapter in both efficiency. Similar results can be found in Tab. 2b, which
MHA and FFN, the average performance can be further im- shows the impact of parameter size. We observe that more
proved from 75.5 to 76.1, which outperforms the SOTA ap- parameters for RepAdapter do not always improve perfor-
Table 4. Efficiency comparison of RepAdapter and existing PETL methods during inference. We use ViT-B/16 as the vision model.
∆P and ∆F denote the additional parameters and FLOPs by PETL methods. The inference speed is defined by images per second
(imgs/sec) and measured on a NVIDIA 3090 GPU. All results are the average of 100 trials.
GPU latency (imgs/sec)
Methods ∆P ∆F
bs=1 bs=4 bs=16 bs=128
Full tuning 0 0 91.5 375.7 539.5 578.3
2
VPT [19] 0.55M 5.60G 86.1 (-5.9%) 283.5 (-24.5%) 381.5 (-29.2%) 421.6 (-27.1%)
Adapter [16] 0.16M 0.03G 70.9 (-22.5%) 306.6 (-18.3%) 504.7 (-6.4%) 552.4 (-5.8%)
AdaptFormer [2] 0.16M 0.03G 71.4 (-21.9%) 309.9 (-17.5%) 508.1 (-4.2%) 555.2 (-3.9%)
NOAH [38]2 0.12M 0.02G 72.1 (-21.2%) 312.7 (-16.7%) 492.9 (-8.6%) 534.7 (-7.5%)
RepAdapter (ours) 0 0 91.5 (-0.0%) 375.7 (-0.0%) 539.5 (-0.0%) 578.3 (-0.0%)

(VPT) [19] are up to 5.6G, but its latency is smaller than


visual adapters when the batch size is 1.
In Fig. 5, we further compare the training costs of these
PETL methods. The first observation is that VPT [19] and
NOAH [38] consume much more training costs. For ex-
ample, due to the super-network training and sub-network
search, NOAH requires about 5× training time and 1.4×
GPU memories than full fine-tuning. Compared to full tun-
ing, RepAdapter reduces about 25% training time and 20%
GPU memory of full tuning, well confirming its efficiency.

4.3.4 Generalization Experiments


Figure 5. Comparisons of training time and memory overhead
on a NVIDIA A100 GPU. Few-shot learning and domain generalization. We
. further apply RepAdapter to CLIP [29] and validate
mance, which may attribute to the overfitting problem on it on few-show learning and domain generalization, as
small-scale downstream datasets, which also suggests the shown in Tab. 5. We provide two different setups for
superiority of our sparse design. Tab. 2c shows the im- CLIP, i.e, RepAdapter-V and RepAdapter-T, which deploy
pact of deployment location. It can be seen that the pre- RepAdapter in the visual and text encoders, respectively.
insertion is consistently better than the post-insertion, while Compared to zero-shot CLIP [29], both RepAdapter-V and
the later is more commonly used [16, 31]. In Tab. 2d, we RepAdapter-T can improve the performance on the source
compare RepAdapter to its three variants. We can see that and target datasets, e.g., +3.77% on ImageNet-A. Notably,
the parallel adapters like like AdaptFomer [2] is worse than RepAdapter-T even outperforms the SOTA soft-prompt ap-
RepAdapter. Besides, the fully sparse structure declines proach, i.e., CoCoOp [43], by using a simple hand-craft
performance, suggesting the importance of RepAdpapter’s prompt of “a photo of a [CLASS]”.
dense part for information exchange. Overall, these results
Results of more network architectures. In Tab. 6, we
well validate the design of RepAdapter.
deploy RepAdapter to more vision models including Con-
vNeXt [27] and Swin-Transformer [26]. We compare
4.3.3 Efficiency Analysis RepAdapter with three baselines on VTB-1k, i.e., full fine-
tuning, linear probing and VPT. From Tab. 6, we can first
In Tab. 4, we compare the inference speed with RepAdapter
see that linear probing performs much worse than full fine-
and existing PETL methods. Compared to RepAdapter,
tuning on all models. Besides, we also find that the gener-
four recent PETL approaches all slow down the inference
alization of VPT and LoRA is poor on the non-Transformer
to different degrees. For example, Adapter [16] is slower
network, i.e., ConvNeXt [27]. The performance is about
than RepAdapter by 22.5% when batch size is 1. How-
5.3% lower than full fine-tuning. In stark contrast, our
ever, the extra computation it brings is only 0.03 GFLOPs.
RepAdapter achieves significant performance gains over
To explain, FLOPs reflect the complexity for CPU comput-
full fine-tuning on all models, e.g., +5% on ConvNeXt,
ing. Bu in practice, GPU latency is also affected by the
network topology, e.g., the network depth. In this case, we 2 The module dimensions of VPT and NOAH are different across

can see that the additional FLOPs of visual prompt tuning datasets, so we use the averaged dimensions to measure their latency.
Table 5. Results of adapting CLIP to 16-shot ImageNet classifi- Table 8. Results of RepAdapter on semantic segmentation.
cation and domain generalization. ViT-B/16 is used as the visual SETR [39] is the vision model, and we report its mIoU scores
backbone. RepAdapter-V and RepAdapter-T denote the adaptions on ADE20K [41] val set. “mIoU-SS” and “mIoU-MS” denote the
to the visual and text encoders of CLIP, respectively. The hand- results of single- and multi-scale predictions, respectively.
craft prompt of “ a photo of a [CLASS]” is used for tuning.
Methods Params (M) mIoU-SS mIoU-MS
Source Target
Method Full tuning [39] 318.31 48.31 50.07
ImageNet -V2 -S -A -R
Head only [39] 13.18 35.12 37.46
Zero-shot [29] 66.73 60.83 46.15 47.77 73.96
Bias [19] 13.46 43.40 45.33
CoOp [42] 71.51 64.20 47.99 49.71 75.21
VPT [19] 13.43 42.11 44.06
CoCoOp [43] 71.02 64.07 48.75 50.63 76.18
VPT + Bias [19] 15.79 44.04 45.63
RepAdapter-V 70.93 64.00 48.40 45.53 75.77
RepAdapter-T 71.87 64.77 49.30 51.13 76.47
RepAdapter 13.82 44.44 46.71

Table 6. Results of RepAdapter on different network architec-


tures on VTAB-1k. “Avg” denotes the average accuracy. “Nat”, Results of more vision tasks. In Tab. 7, we compare
“Spe” and “Str” are the average accuracies of the natural, special- RepAdapter with VPT [19] and AdaptFormer [2] on video
ized and structured groups, respectively. classification. The first observation is that video classifi-
cation is difficult for VPT, so its accuracy is inferior to full
Model Method Avg. Nat. Spe. Str. tuning on SSv2 [12]. Meanwhile, we find that AdaptFormer
Convolutional Network: can outperform full tuning with much fewer parameters, and
ConvNeXt-B Full 74.0 78.0 83.7 60.4 its best performance only requires 1.26M parameters. Even
ConvNeXt-B Linear 63.6 74.5 81.5 34.8 so, RepAdapter is consistently better than AdaptFormer at
ConvNeXt-B VPT 68.7 78.5 83.0 44.6 similar parameter scales. For example, RepAdapter-16 out-
ConvNeXt-B LoRA 72.1 79.2 83.4 53.8 performs AdaptFormer-64 by +3.52% on HMDB51 while
ConvNeXt-B RepAdapter 79.0 83.5 86.7 66.8 saving more than 50% parameters.
Hierarchical Vision Transformer: Afterwards, we validate RepAdapter on SETR [39] for
Swin-B Full 75.0 79.2 86.2 59.7 semantic segmentation in Tab. 8. This adaptation is chal-
Swin-B Linear 62.6 73.5 80.8 33.5
lenging due to the huge gap between the objectives of pre-
Swin-B VPT 71.6 76.8 84.5 53.4
Swin-B RepAdapter 77.4 82.7 87.5 62.0
training and downstream tasks. In this case, we can see that
only fine-tuning the head results in very poor performance,
Vision Transformer: i.e., -13.19% mIoU-SS. Meanwhile, the performance of
ViT-B/16 Full 68.9 75.9 83.4 47.6 VPT is still inferior than tuning bias (Bias). Compared to
ViT-B/16 Linear 57.6 68.9 77.2 26.8
these approaches, RepAdapter demonstrates better adapta-
ViT-B/16 VPT 72.0 78.5 82.4 55.0
ViT-B/16 RepAdapter 76.0 81.6 85.4 61.2
tions. With less parameters, RepAdapter outperforms the
best PETL solution “VPT+Bias” by +1.08 mIoU under the
Table 7. Comparisons of RepAdapter and the state-of-the-arts multi-scale prediction setting (mIoU-MS).
PETL methods on video classification. For all methods, the
backbone is ViT-B/16 pre-trained by VideoMAE [32]. We report
top-1 accuracy on val set. 5. Conclusions
Method Params (M) SSv2 HMDB51 In this paper, we focus on parameter efficient transfer
Full tuning [32] 86.04 53.97 46.41 learning (PETL) for giant vision models and propose a
Linear probe [32] 0.07 29.23 49.84 novel PETL method, termed RepAdapter. The most out-
standing property of RepAdapter is that its parameters can
VPT [19] 0.08 43.73 52.67
be completely merged into the pre-trained vision model via
AdaptFormer-1 [2] 0.10 50.03 51.68
structural re-parameterization, thereby incurring zero extra
AdaptFormer-4 [2] 0.15 54.70 51.81
costs during inference. In addition, RepAdapter is still more
AdaptFormer-64 [2] 1.26 59.02 55.69
effective than existing PETL approaches due to its novel
RepAdapter-2 0.15 55.26 55.67 sparse structure and our careful deployment. To validate
RepAdapter-16 0.53 60.52 59.21 RepAdapter, we apply it to a set of large vision models and
conduct extensive experiments on 27 datasets of three vi-
sion tasks. Experimental results well confirm its superiority
strongly confirming its generalization ability. in terms of efficiency, performance and generalization.
Acknowledgements. This work was supported by ference on Computer Vision and Pattern Recognition, pages
the National Science Fund for Distinguished Young 13733–13742, 2021. 2, 3, 4
Scholars (No.62025603), the National Natural Science [11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Foundation of China (No. U21B2037, No. U22B2051, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
No. 62176222, No. 62176223, No. 62176226, No. Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
62072386, No. 62072387, No. 62072389, No. 62002305 vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
and No. 62272401), Guangdong Basic and Applied worth 16x16 words: Transformers for image recognition at
Basic Research Foundation (No.2019B1515120049), scale. In ICLR, 2021. 1, 2, 3, 5
and the Natural Science Foundation of Fujian [12] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal-
Province of China (No.2021J01002, No.2022J06001). ski, Joanna Materzynska, Susanne Westphal, Heuna Kim,
Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz
Mueller-Freitag, et al. The” something something” video
References database for learning and evaluating visual common sense.
In Proceedings of the IEEE international conference on com-
[1] Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. In- puter vision, pages 5842–5850, 2017. 1, 2, 5, 8
trinsic dimensionality explains the effectiveness of language
[13] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
model fine-tuning. arXiv preprint arXiv:2012.13255, 2020.
Dollár, and Ross B. Girshick. Masked autoencoders are scal-
2
able vision learners. In CVPR, 2022. 1
[2] Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang,
Yibing Song, Jue Wang, and Ping Luo. Adaptformer: [14] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada-
Adapting vision transformers for scalable visual recognition. vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu,
CoRR, abs/2205.13535, 2022. 1, 2, 3, 4, 5, 6, 7, 8 Samyak Parajuli, Mike Guo, et al. The many faces of robust-
ness: A critical analysis of out-of-distribution generalization.
[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
In Proceedings of the IEEE/CVF International Conference
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
on Computer Vision, pages 8340–8349, 2021. 2, 5
database. In CVPR, 2009. 1, 2, 5
[15] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein-
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
hardt, and Dawn Song. Natural adversarial examples. In
Toutanova. BERT: pre-training of deep bidirectional trans-
Proceedings of the IEEE/CVF Conference on Computer Vi-
formers for language understanding. In NAACL-HLT, 2019.
sion and Pattern Recognition, pages 15262–15271, 2021. 2,
3
5
[5] Xiaohan Ding, Honghao Chen, Xiangyu Zhang, Jungong
[16] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna
Han, and Guiguang Ding. Repmlpnet: Hierarchical vi-
Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona
sion mlp with re-parameterized locality. In Proceedings of
Attariyan, and Sylvain Gelly. Parameter-efficient transfer
the IEEE/CVF Conference on Computer Vision and Pattern
learning for NLP. In ICML, 2019. 1, 2, 3, 4, 5, 6, 7
Recognition, pages 578–587, 2022. 3
[6] Xiaohan Ding, Yuchen Guo, Guiguang Ding, and Jungong [17] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
Han. Acnet: Strengthening the kernel skeletons for powerful Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
cnn via asymmetric convolution blocks. In Proceedings of dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
the IEEE/CVF international conference on computer vision, tional neural networks for mobile vision applications. arXiv
pages 1911–1920, 2019. 2, 3, 4 preprint arXiv:1704.04861, 2017. 5
[7] Xiaohan Ding, Tianxiang Hao, Jianchao Tan, Ji Liu, Jungong [18] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-
Han, Yuchen Guo, and Guiguang Ding. Resrep: Lossless Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
cnn pruning via decoupling remembering and forgetting. In LoRA: Low-rank adaptation of large language models. In
Proceedings of the IEEE/CVF International Conference on ICLR, 2022. 1, 2, 3, 6
Computer Vision, pages 4510–4520, 2021. 3 [19] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie,
[8] Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang Serge J. Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi-
Ding. Diverse branch block: Building a convolution as an sual prompt tuning. In ECCV, 2022. 1, 2, 3, 5, 6, 7, 8
inception-like unit. In Proceedings of the IEEE/CVF Con- [20] Shibo Jie and Zhi-Hong Deng. Convolutional bypasses
ference on Computer Vision and Pattern Recognition, pages are better vision transformer adapters. arXiv preprint
10886–10895, 2021. 2, 3 arXiv:2207.07039, 2022. 2
[9] Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang [21] Hildegard Kuehne, Hueihan Jhuang, Estı́baliz Garrote,
Ding. Scaling up your kernels to 31x31: Revisiting large Tomaso Poggio, and Thomas Serre. Hmdb: a large video
kernel design in cnns. In Proceedings of the IEEE/CVF Con- database for human motion recognition. In 2011 Inter-
ference on Computer Vision and Pattern Recognition, pages national conference on computer vision, pages 2556–2563.
11963–11975, 2022. 3 IEEE, 2011. 1, 2, 5
[10] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, [22] Brian Lester, Rami Al-Rfou, and Noah Constant. The power
Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style of scale for parameter-efficient prompt tuning. arXiv preprint
convnets great again. In Proceedings of the IEEE/CVF Con- arXiv:2104.08691, 2021. 1, 3
[23] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing [37] Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov,
continuous prompts for generation. In ACL/IJCNLP, 2021. Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djo-
1, 3 longa, André Susano Pinto, Maxim Neumann, Alexey Doso-
[24] Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao vitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen,
Wang. Scaling & shifting your features: A new baseline for Marcin Michalski, Olivier Bousquet, Sylvain Gelly, and Neil
efficient model tuning. In Advances in Neural Information Houlsby. The visual task adaptation benchmark. CoRR,
Processing Systems (NeurIPS), 2022. 2, 3, 5, 6 abs/1910.04867, 2019. 1, 2
[25] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hi- [38] Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. Neural
roaki Hayashi, and Graham Neubig. Pre-train, prompt, and prompt search. CoRR, abs/2206.04673, 2022. 1, 2, 3, 5,
predict: A systematic survey of prompting methods in nat- 6, 7
ural language processing. arXiv preprint arXiv:2107.13586, [39] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
2021. 1, 3 Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
[26] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Xiang, Philip HS Torr, et al. Rethinking semantic segmen-
Zhang, Stephen Lin, and Baining Guo. Swin transformer: tation from a sequence-to-sequence perspective with trans-
Hierarchical vision transformer using shifted windows. In formers. In Proceedings of the IEEE/CVF conference on
ICCV, 2021. 7 computer vision and pattern recognition, pages 6881–6890,
[27] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- 2021. 8
enhofer, Trevor Darrell, and Saining Xie. A convnet for the [40] Zexuan Zhong, Dan Friedman, and Danqi Chen. Factual
2020s. In CVPR, 2022. 2, 5, 7 probing is [mask]: Learning vs. learning to recall. arXiv
[28] Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Yan Wang, Liujuan preprint arXiv:2104.05240, 2021. 1, 3
Cao, Yongjian Wu, Feiyue Huang, and Rongrong Ji. To- [41] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi-
wards lightweight transformer via group-wise transforma- dler, Adela Barriuso, and Antonio Torralba. Semantic under-
tion for vision-and-language tasks. IEEE Transactions on standing of scenes through the ade20k dataset. International
Image Processing, 31:3386–3398, 2022. 5 Journal of Computer Vision, 127(3):302–321, 2019. 1, 2, 5,
[29] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya 8
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [42] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Liu. Learning to prompt for vision-language models. CoRR,
Krueger, and Ilya Sutskever. Learning transferable visual abs/2109.01134, 2021. 1, 2, 3, 5, 8
models from natural language supervision. In ICML, Pro- [43] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Zi-
ceedings of Machine Learning Research, 2021. 1, 2, 7, 8 wei Liu. Conditional prompt learning for vision-language
[30] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric models. In CVPR, 2022. 1, 2, 3, 5, 7, 8
Wallace, and Sameer Singh. Autoprompt: Eliciting knowl-
edge from language models with automatically generated
prompts. arXiv preprint arXiv:2010.15980, 2020. 1, 3
[31] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter:
Parameter-efficient transfer learning for vision-and-language
tasks. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 5227–5237,
2022. 1, 2, 3, 4, 7
[32] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang.
Videomae: Masked autoencoders are data-efficient learn-
ers for self-supervised video pre-training. arXiv preprint
arXiv:2203.12602, 2022. 1, 5, 8
[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NIPS, 2017. 2, 3
[34] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P
Xing. Learning robust global representations by penalizing
local predictive power. Advances in Neural Information Pro-
cessing Systems, 32, 2019. 2, 5
[35] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
Kaiming He. Aggregated residual transformations for deep
neural networks. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1492–1500,
2017. 5
[36] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu-
cas Beyer. Scaling vision transformers. In CVPR, 2022. 1

You might also like