2204.07118v1
2204.07118v1
⋆ †
Meta AI Sorbonne University
Abstract
A Vision Transformer (ViT) is a simple neural architecture amenable to serve
several computer vision tasks. It has limited built-in architectural priors, in
contrast to more recent architectures that incorporate priors either about the
input data or of specific tasks. Recent works show that ViTs benefit from self-
supervised pre-training, in particular BerT-like pre-training like BeiT.
In this paper, we revisit the supervised training of ViTs. Our procedure
builds upon and simplifies a recipe introduced for training ResNet-50. It in-
cludes a new simple data-augmentation procedure with only 3 augmentations,
closer to the practice in self-supervised learning. Our evaluations on Image
classification (ImageNet-1k with and without pre-training on ImageNet-21k),
transfer learning and semantic segmentation show that our procedure outper-
forms by a large margin previous fully supervised training recipes for ViT. It
also reveals that the performance of our ViT trained with supervision is com-
parable to that of more recent architectures. Our results could serve as better
baselines for recent self-supervised approaches demonstrated on ViT.
ImageNet-1k ImageNet-21k
ViT-L ViT-H 88 ViT-L@384
ViT-B ViT-L@224
84 87
86 ViT-B@224
82 ViT-S
top-1 accuracy (%)
85
80 Ours
DeiT 84
He et al.
78 He et al. + EMA
Ours
Steiner et al.
82 Dosovitskiy et al.
76 Steiner et al.
74
4 8 16 32 64 128 4 8 16 32 64 128
flops (x109) flops (x109)
Figure 1: Comparison of training recipes for (left) vanilla vision transformers trained on ImageNet-1k
and evaluated at resolution 224×224, and (right) pre-trained on ImageNet-21k at 224×224 and fine-
tuned on ImageNet-1k at resolution 224×224 or 384×384.
1
1 Introduction
After their vast success in NLP, transformers models [55] and their derivatives
are increasingly popular in computer vision. They are increasingly used in image
classification [13], detection & segmentation [3], video analysis, etc. In particular,
the vision transformers (ViT) of Dosovistky et al. [13] are a reasonable alternative
to convolutional architectures. This supports the adoption of transformers as a
general architecture able to learn convolutions as well as longer range operations
through the attention process [5, 8]. In contrast, convolutional networks [20, 27, 29,
41] implicitly offer built-in translation invariance. As a result their training does
not have to learn this prior. It is therefore not surprising that hybrid architectures
that include convolution converge faster than vanilla transformers [18].
Because they incorporate as priors only the co-localisation of pixels in patches,
transformers have to learn about the structure of images while optimizing the
model such that it processes the input with the objective of solving a given task.
This can be either reproducing labels in the supervised case, or other proxy tasks
in the case of self-supervised approaches. Nevertheless, despite their huge suc-
cess, there has been only few works in computer vision studying how to efficiently
train vision transformers, and in particular on a midsize dataset like ImageNet-
1k. Since the work of Dosovistky et al. [13], the training procedures are mostly
variants from the proposal of Touvron et al. [48] and Steiner et al. [42]. In con-
trast, multiple works have proposed alternative architectures by introducing pool-
ing, more efficient attention, or hybrid architectures re-incorporating convolutions
and a pyramid structure. These new designs, while being particularly effective
for some tasks, are less general. One difficult question to address is whether the
improved performance is due to a specific architectural design, or because it facil-
itates the optimization as suggested it is the case for convolutions with ViTs [60].
Recently, self-supervised approaches inspired by the popular BerT pre-training
have raised hopes for a BerT moment in computer vision. There are some analo-
gies between the fields of NLP and computer vision, starting with the transformer
architecture itself. However these fields are not identical in every way: The modal-
ities processed are of different nature (continuous versus discrete). Computer vi-
sion offer large annotated databases like ImageNet [40], and fully supervised pre-
training on ImageNet is effective for handling different downstream tasks such as
transfer learning [37] or semantic segmentation.
Without further work on fully supervised approaches on ImageNet it is diffi-
cult to conclude if the intriguing performance of self-supervised approaches like
BeiT [2] is due to the training, e.g. data augmentation, regularization, optimiza-
tion, or to an underlying mechanism that is capable of learning more general
implicit representations. In this paper, we do not pretend to answer this dif-
ficult question, but we want to feed this debate by renewing the training pro-
cedure for vanilla ViT architectures. We hope to contribute to a better under-
standing on how to fully exploit the potential of transformers and of the impor-
tance of BerT-like pre-training. Our work builds upon the recent state of the art
on fully supervised and self-supervised approaches, with new insights regarding
data-augmentation. We propose new training recipes for vision transformers on
ImageNet-1k and ImageNet-21k. The main ingredients are as follows:
• We build upon the work of Wightman et al. [57] introduced for ResNet50. In
2
particular we adopt a binary cross entropy loss for Imagenet1k only training.
We adapt this method by including ingredients that significantly improve the
training of large ViT [51], namely stochastic depth [24] and LayerScale [51].
• 3-Augment: is a simple data augmentation inspired by that employed for
self-supervised learning. Surprisingly, with ViT we observe that it works bet-
ter than the usual automatic/learned data-augmentation employed to train
vision transformers like RandAugment [6].
• Simple Random Cropping is more effective than Random Resize Cropping
when pre-training on a larger set like ImageNet-21k.
• A lower resolution at training time. This choice reduces the train-test dis-
crepancy [53] but has not been much exploited with ViT. We observe that it
also has a regularizing effect for the largest models by preventing overfitting.
For instance, for a target resolution of 224 × 224, a ViT-H pre-trained at res-
olution 126 × 126 (81 tokens) achieves a better performance on ImageNet-1k
than when pre-training at resolution 224 × 224 (256 tokens). This is also less
demanding at pre-training time, as there are 70% fewer tokens. From this
perspective it offers similar scaling properties as mask-autoencoders [19].
Our “new” training strategies do not saturate with the largest models, making
another step beyond the Data-Efficient Image Transformer (DeiT) by Touvron et
al. [48]. As a result, we obtain a competitive performance in image classification
and segmentation, even when compared to recent popular architectures such as
SwinTransformers [31] or modern convnet architectures like ConvNext [32]. Below
we point out a few interesting outcomes.
• We leverage models with more capacity even on midsize datasets. For in-
stance we reach 85.2% in top-1 accuracy when training a ViT-H on Ima-
geNet1k only, which is an improvement of +5.1% over the best ViT-H with
supervised training procedure reported in the literature at resolution 224×224.
3
ImageNet-V2 [39], which indicates that our trained models generalize better
to another validation set than most prior works.
• An ablation on the effect of the crop ratio employed in transfer learning clas-
sification tasks. We observe that it has a noticeable impact on the perfor-
mance but that the best value depends a lot on the target dataset/task.
2 Related work
Vision Transformers were introduced by Dosovitskiy et al. [13]. This architec-
ture, which derives from the transformer by Vaswani et al. [55], is now used as an
alternative to convnets in many tasks: image classification [13, 48], detection [3, 31],
semantic segmentation [2, 31] video analysis [17, 35], to name only a few. This
greater flexibility typically comes with the downside that they need larger datasets,
or the training must be adapted when the data is scarcer [14, 48]. Many vari-
ants have been introduced to reduce the cost of attention by introducing for ex-
ample more efficient attention [16, 17, 31] or pooling layers [21, 31, 56]. Some
papers re-introduce spatial biases specific to convolutions within hybrid architec-
tures [18, 58, 60]. These models are less general than vanilla transformers but gen-
erally perform well in certain computer vision tasks, because their architectural
priors reduce the need to learn from scratch the task biases. This is especially im-
portant for smaller models, where specialized models do not have to devote some
capacity to reproduce known priors such as translation invariance. The models are
formally less flexible but they do not require sophisticated training procedures.
Training procedures: The first procedure proposed in the ViT paper [13] was
mostly effective for larger models trained on large datasets. In particular the ViT
were not competitive with convnets when trained from scratch on ImageNet. Tou-
vron et al. [48] showed that by adapting the training procedure, it is possible to
achieve a performance comparable to that of convnets with Imagenet training only.
After this Data Efficient Image Transformer procedure (DeiT), only few adapta-
tions have been proposed to improve the training vision transformers. Steiner et
al. [42] published a complete study on how to train vision transformers on dif-
ferent datasets by doing a complete ablation of the different training components.
Their results on ImageNet [40] are slightly inferior to those of DeiT but they re-
port improvements on ImageNet-21k compared to Dosovitskiy et al. [13]. The self-
supervised approach referred to as masked auto-encoder (MAE) [19] proposes an
improved supervised baseline for the larger ViT models.
4
pre-training on ImageNet-21k and finetuning on ImageNet-1k, BeiT [2] requires a
full 90-epochs finetuning on ImageNet-21k followed by another full finetuning on
ImageNet-1k to reach its best performance, suggesting that a large labeled dataset
is needed so that BeiT realizes its best potential. A recent work suggests that such
auto-encoders are mostly interesting in a data starving context [15], but this ques-
tions their advantage in the case where more labelled data is actually available.
LayerScale. We use LayerScale [51]. This method was introduced to facilitate the
convergence of deep transformers. With our training procedure, we do not have
convergence problems, however we observe that LayerScale allows our models
to attain a higher accuracy for the largest models. In the original paper [51], the
initialization of LayerScale is adapted according to the depth. In order to simplify
the method we use the same initialization (10−4 ) for all our models.
Binary Cross entropy. Wigthman et al. [57] adopt a binary cross-entropy (BCE)
loss instead of the more common cross-entropy (CE) to train ResNet-50. They con-
clude that the gains are limited compared to the CE loss but that this choice is more
convenient when employed with Mixup [64] and CutMix [63]. For larger ViTs and
with our training procedure on ImageNet-1k, the BCE loss provides us a significant
improvement in performance, see an ablation in Table 4. We did not achieve com-
pelling results during our exploration phase on Imagenet21k, and therefore keep
CE when pre-training with this dataset as well as for the subsequent fine-tuning.
5
Table 1: Summary of our training procedures with ImageNet-1k and ImageNet-21k. We also provide
DeiT [48], Wightman et al [57] and Steiner et al. [42] baselines for reference. Adapt. means the hparams
is adapted to the size of the model. For finetuning to higher resolution with model pre-trained on
ImageNet-1k only we use the finetuning procedure from DeiT see section A for more details.
3.2 Data-augmentation
Since the advent of AlexNet, there has been significant modifications to the data-
augmentation procedures employed to train neural networks. Interestingly, the
same data augmentation, like RandAugment [6], is widely employed for ViT while
their policy was initially learned for convnets. Given that the architectural priors
and biases are quite different in these architectures, the augmentation policy may
not be adapted, and possibly overfitted considering the large amount of choices
involved in their selection. We therefore revisit this prior choice.
• Grayscale: This favors color invariance and give more focus on shapes.
• Solarization: This adds strong noise on the colour to be more robust to the
variation of colour intensity and so focus more on shape.
6
Data-Augmentation ImageNet-1k
ColorJitter Grayscale Gaussian Blur Solarization Val Real V2
0.3 ✗ ✗ ✗ 81.4 86.1 70.3
0.3 ✓ ✗ ✗ 81.0 86.0 69.7
0.3 ✓ ✓ ✗ 82.7 87.6 72.7
0.3 ✓ ✓ ✓ 83.1 87.7 72.6
0.0 ✓ ✓ ✓ 83.1 87.7 72.0
Table 2: Ablation of the components of our data-augmentation strategy with ViT-B on ImageNet-1k.
Original
Gauss. Blurr
Grayscale
Solarization
3.3 Cropping
Random Resized Crop (RRC) was introduced in the GoogleNet [43] paper. It
serves as a regularisation to limit model overfitting, while favoring that the deci-
sion done by the model is invariant to a certain class of transformations. This data
augmentation was deemed important on Imagenet1k to prevent overfitting, which
happens to occur rapidly with modern large models.
7
RRC SRC
Figure 3: Example of crops selected by two strategies: Resized Crop and Simple Random Crop.
Simple Random Crop (SRC) is a much simpler way to extract crops. It is similar
to the original cropping choice proposed in AlexNet [27]: We resize the image such
that the smallest side matches the training resolution. Then we apply a reflect
padding of 4 pixels on all sides, and finally we apply a square Crop of training size
randomly selected along the x-axis of the image.
Figure 3 vizualizes cropping boxes sampled for RRC and SRC. RRC provides a
lot of diversity and very different sizes for crops. In contrast SRC covers a much
larger fraction of the image overall and preserve the aspect ratio, but offers less di-
versity: The crops overlaps significantly. As a result, when training on ImageNet-
1k the performance is better with the commonly used RRC. For instance a ViT-S
reduces its top-1 accuracy by −0.9% if we do not use RRC.
However, in the case of ImageNet-21k (×10 bigger than ImageNet-1k), there is
less risk of overfitting and increasing the regularisation and diversity offered by
RRC is less important. In this context, SRC offers the advantage of reducing the
discrepancy in apparent size and aspect ratio. More importantly, it gives a higher
chance that the actual label of the image matches that of the crop: RRC is relatively
aggressive in terms of cropping and in many cases the labelled object is not even
present in the crop, as shown in Figure 4 where some of the crops do not contain the
labelled object. For instance, with RRC there is a crop no zebra in the left example,
or no train in three of the crops from the middle example. This is more unlikely
to happen with SRC, which covers a much larger fraction of the image pixels. In
Table 5 we provide an ablation of random resized crop on ImageNet-21k, where
we see that these observations translate as a significant gain in performance.
8
SRC RRC SRC RRC SRC RRC
Figure 4: Illustration of Random Resized Crop (RRC) and Simple Random Crop (SRC). The usual RRC
is a more aggressive data-augmentation than SRC: It has a more important regularizing effect and
avoids overfitting by giving more variability to the images. At the same time it introduces a discrepancy
of scale and aspect-ratio. It also leads to labeling errors, for instance when the object is not in the
cropped region (e.g., train or boat). On Imagenet1k this regularization is overall regarded as beneficial.
However our experiments show that it is detrimental on Imagenet21k, which is less prone to overfitting.
9
4 Experiments
This section includes multiple experiments in image classification, with a special
emphasis on Imagenet1k [9, 39, 40]. We also report results for downstream tasks
in fine-grained classification and segmentation. We include a large number of ab-
lations to better analyze different effects, such as the importance of the training
resolution and longer training schedules. We provide additional results in the ap-
pendices.
4.2 Ablations
4.2.1 Impact of training duration
In Figure 5 we provide an ablation on the number of epochs, which show that ViT
models do not saturate as rapidly as the DeiT training procedure [48] when we
increase the number of epochs beyond the 400 epochs adopted for our baseline.
For ImageNet-21k pre-training, we use 90 epochs for pre-training as in a few
works [31, 49]. We finetune during 50 epochs on ImageNet-1k [49] and marginally
adapt the stochastic depth parameter. We point out that this choice is mostly for
the sake of consistency across models: we observe that training 30 epochs also
provides similar results.
4.2.2 Data-Augmentation
In Table 3 we compare our handcrafted data-augmentation 3-Augment with exist-
ing learned augmentation methods. With the ViT architecture, our data-augmentation
is the most effective while being simpler than the other approaches. Since previ-
ous augmentations were introduced on convnets, we also provide results for a
ResNet-50. In this case previous augmentation policies have similar (RandAug-
ment, Trivial-Augment) or better results (Auto-Augment) on the validation set.
10
100
84
imageNet top-1(%) 90
top-1(%)
80
82
CARS
Ours: ViT-L
70 CIFAR-10
CIFAR-100
Ours: ViT-B Flowers
Ours: ViT-S
80 DeiT: ViT-B 60 INAT-18
INAT-19
400 600 800 0.5 0.6 0.7 0.8 0.9 1.0
training epochs crop-ratio
Figure 5: Top-1 accuracy on ImageNet-1k only at Figure 6: Transfer learning performance on 6
resolution 224×224 with our training recipes and datasets with different test-time crop ratio. ViT-B
a different number of epochs pre-trained on ImageNet-1k at resolution 224.
Table 3: Comparison of some existing data-augmentation methods with our simple 3-Augment pro-
posal inspired by data-augmentation used with self-supervised learning.
This is no longer the case when evaluating on the independent set V2, for which
the Auto-Augment better accuracy is not significant.
11
ImageNet-1k
Model Loss LayerScale Data Aug. Epochs
val real v2
CE ✗ RandAugment 300 79.8 85.3 68.1
BCE ✗ RandAugment 300 79.8 85.9 68.2
ViT-S
Table 4: Ablation on different training component with training at resolution 224 × 224 on ImageNet-
1k. We perform avlations with ViT-S, ViT-B and ViT-L. We report top-1 accuracy (%) on ImageNet
validation set , ImageNet real and ImageNet v2.
Table 5: Ablation path: augmentation and regularization with ImageNet-21k pre-training (at reso-
lution 224×224) and ImageNet-1k fine-tuning. We measure the impact of changing Random Resize
Crop (RRC) to Simple Random Crop (SRC), adding LayerScale (LS), removing Mixup, replacing Ran-
dAugment (RA) by 3-Augment (3A), and finally employing a longer number of epochs during the
pre-training phase on ImageNet-21k. All experiments are done with Seed 0 with fixed hparams except
the drop-path rate of stochastic depth, which depends on the model and is increased by 0.05 for the
longer pre-training. We report 2 digits top-1 accuracy but note that the standard standard deviation is
around 0.1 on our ViT-B baseline. Note that all these changes are neutral w.r.t. complexity except in the
last row, where the fine-tuning at resolution 384×384 significantly increases the complexity.
12
epochs Resolution ImageNet top-1 acc
Model
Train. FT Train. FT val real v2
128 × 128 83.2 88.1 73.2
20 160 × 160 83.3 88.0 73.4
400 224 × 224
192 × 192 83.5 88.0 72.8
224 × 224 83.1 87.7 72.6
ViT-B
128 × 128 83.5 88.3 73.4
20 160 × 160 83.6 88.2 73.5
800 224 × 224
192 × 192 83.8 88.2 73.6
224 × 224 83.7 88.1 73.1
128 × 128 83.9 88.8 74.3
20 160 × 160 84.4 88.8 74.3
400 224 × 224
192 × 192 84.5 88.8 75.1
224 × 224 84.2 88.6 74.3
ViT-L
128 × 128 84.5 88.9 74.7
20 160 × 160 84.7 88.9 75.2
800 224 × 224
192 × 192 84.9 88.7 75.1
224 × 224 84.5 88.8 75.0
126 × 126 84.7 89.2 75.2
20 154 × 154 85.1 89.3 75.3
400 224 × 224
182 × 182 85.1 89.2 75.4
224 × 224 84.8 89.1 75.3
ViT-H
126 × 126 85.1 89.2 75.6
20 154 × 154 85.2 89.2 75.9
800 224 × 224
182 × 182 85.1 88.9 75.9
224 × 224 84.6 88.5 74.9
ViT-H-52 400 20 126 × 126 224 × 224 84.9 89.2 75.6
ViT-H-26×2 400 20 126 × 126 224 × 224 84.9 89.1 75.3
Table 6: We compare ViT architectures pre-trained on ImageNet-1k only with different training reso-
lution followed by a fine-tuning at resolution 224 × 224. We benefit from the FixRes effect [53] and
get better performance with a lower training resolution (e.g resolution 160 × 160 with patch size 16
represent 100 tokens vs 196 for 224 × 224. This represents a reduction of 50% of the number of tokens).
and therefore it reduces the training cost and increases the performance. In that re-
spect it effect is comparable to that of MAE [19]. We also report results with ViT-H
52 layers and ViT-H 26 layers parallel [50] models with 1B parameters. Due to the
lower resolution training it is easier to train these models.
13
ImageNet-1k ImageNet-21k
Ours: ViT Ours: ViT
76 ConvNeXt 79 ConvNeXt
DeiT: ViT EfficientNet-V2
75 EfficientNet Swin
ImageNet-v2 Top-1(%)
ImageNet-v2 Top-1(%)
EfficientNet-V2 78
74 RSB: ResNet
Swin
73 77
72
76
71
70 75
69
81 82 83 84 85 86 85 86 87 88
ImageNet Top-1(%) ImageNet Top-1(%)
Figure 7: Generalization experiment: top-1 accuracy on ImageNet1k-val versus ImageNet-v2 for mod-
els in Table 7 and Table 8. We display a linear interpolation of all points in order to compare the
generalization capability (or level of overfitting) for the different models.
14
Table 7: Classification with Imagenet1k training. We compare architectures with comparable FLOPs
and number of parameters. All models are trained on ImageNet1k only without distillation nor self-
supervised pre-training. We report Top-1 accuracy on the validation set of ImageNet1k and ImageNet-
V2 with different measure of complexity: throughput, FLOPs, number of parameters and peak memory
usage. The throughput and peak memory are measured on a single V100-32GB GPU with batch size
fixed to 256 and mixed precision. For ResNet [20] and RegNet [38] we report the improved results from
Wightman et al. [57]. Note that different models may have received a different optimization effort. ↑R
indicates that the model is fine-tuned at the resolution R and -R indicates that the model is trained at
resolution R.
15
Table 8: Classification with Imagenet-21k training. We compare architectures with comparable FLOPs
and number of parameters. All models are trained on ImageNet-21k without distillation nor self-
supervised pre-training. We report Top-1 accuracy on the validation set of ImageNet-1k and ImageNet-
V2 with different measure of complexity: throughput, FLOPs, number of parameters and peak memory
usage. The throughput and peak memory are measured on a single V100-32GB GPU with batch size
fixed to 256 and mixed precision. For Swin-L we decrease the batch size to 128 in order to avoid out of
memory error and re-estimate the memory consumption. ↑R indicates that the model is fine-tuned at
the resolution R.
16
Pretrained # pre-training # finetuning ImageNet
Model Method
data epochs epochs val Real V2
300 100(1k) 82.9
BeiT
800 100(1k) 83.2
ViT-B
MAE⋆ 1600 100(1k) 83.6 88.1 73.2
(1k) (1k)
400 20 83.5 88.0 72.8
Ours
800(1k) 20(1k) 83.8 88.2 73.6
INET-1k
BeiT 800 30(1k) 85.2
400 50(1k) 84.3
MAE 800 50(1k) 84.9
ViT-L
1600 50(1k) 85.1
MAE⋆ 1600 50(1k) 85.9 89.4 76.5
(1k)
400 20(1k) 84.5 88.8 75.1
Ours
800(1k) 20(1k) 84.9 88.7 75.1
150 50(1k) 83.7 88.2 73.1
BeiT
150 + 90(21k) 50(1k) 85.2 89.4 75.4
ViT-B
90(21k) 50(1k) 85.2 89.4 76.1
Ours
INET-21k 240(21k) 50(1k) 85.7 89.5 76.5
150 50(1k) 86.0 89.6 76.7
BeiT
150 + 90(21k) 50(1k) 87.5 90.1 78.8
ViT-L
90(21k) 50(1k) 86.8 89.9 78.3
Ours
240(21k) 50(1k) 87.0 90.0 78.6
Table 9: Comparison of self-supervised pre-training with our approach. As our approach is fully super-
vised, this table is given as an indication. All models are evaluated at resolution 224 × 224. We report
Image classification results on ImageNet val, real and v2 in order to evaluate overfitting. (21k) indicate
a finetuning with labels on ImageNet-21k and (1k) indicate a finetuning with labels on ImageNet-1k. ⋆
design the improved setting of MAE using pixel (w/ norm) loss.
have not made a comparison with approaches such as PeCo [12] which use an
auxiliary model as a psycho-visual loss and iBoT [66], which uses multi-crop and
an exponential moving average of the model.
17
Table 10: We compare Transformers based models on different transfer learning tasks with ImageNet-
1k pre-training. We report results with our default training on ImageNet-1k (400 epochs at resolution
224 × 224). We also report results with convolutional architectures for reference. For consistency we
keep our crop ratio equal to 1.0 on all datasets. Other works use 0.875, which is better for iNat-19 and
iNat-18, see Figure 6.
18
Table 11: ADE20k semantic segmentation performance using UperNet [59] (in comparable settings [11,
16, 31]). All models are pre-trained on ImageNet-1k except models with † symbol that are pre-trained
on ImageNet-21k. We report the pre-training resolution used on ImageNet-1k and ImageNet-21k.
Pre-training UperNet
Backbone
#params FLOPs Single scale Multi-scale
resolution
(×106 ) (×109 ) mIoU mIoU
ResNet50 224 × 224 66.5 42.0
DeiT-S 224 × 224 52.0 1099 44.0
XciT-T12/16 224 × 224 34.2 874 41.5
XciT-T12/8 224 × 224 33.9 942 43.5
Swin-T 224 × 224 59.9 945 44.5 46.1
Our ViT-T 224 × 224 10.9 148 40.1 41.8
Our ViT-S 224 × 224 41.7 588 45.6 46.8
XciT-M24/16 224 × 224 112.2 1213 47.6
XciT-M24/8 224 × 224 110.0 2161 48.4
PatchConvNet-B60 224 × 224 140.6 1258 48.1 48.6
PatchConvNet-B120 224 × 224 229.8 1550 49.4 50.3
MAE ViT-B 224 × 224 127.7 1283 48.1
Swin-B 384 × 384 121.0 1188 48.1 49.7
Our ViT-B 224 × 224 127.7 1283 49.3 50.2
Our ViT-L 224 × 224 353.6 2231 51.5 52.0
PatchConvNet–B60† 224 × 224 140.6 1258 50.5 51.1
PatchConvNet-L120† 224 × 224 383.7 2086 52.2 52.9
Swin-B† (640 × 640) 224 × 224 121.0 1841 50.0 51.6
Swin-L† (640 × 640) 224 × 224 234.0 3230 53.5
Our ViT-B† 224 × 224 127.7 1283 51.8 52.8
Our ViT-B† 384 × 384 127.7 1283 53.4 54.1
Our ViT-L† 224 × 224 353.6 2231 53.8 54.7
Our ViT-L† 320 × 320 353.6 2231 54.6 55.6
5 Conclusion
This paper makes a simple contribution: it proposes improved baselines for vision
transformers trained in a supervised fashion that can serve (1) either as a com-
parison basis for new architectures; (2) or for other training approaches such as
those based on self-supervised learning. We hope that this stronger baseline will
serve the community effort in making progress on learning foundation models
that could serve many tasks. Our experiments have also gathered a few insights
on how to train ViT for larger models with reduced resources without hurting ac-
curacy, allowing us to train a one-billion parameter model with 4 nodes of 8 GPUs.
19
Params Flops ImageNet-1k
Model
(×106 ) (×109 ) orig. val real v2
ViT-S [48] 22.0 4.6 79.8 80.4 86.1 69.7
ViT-B [13, 48] 86.6 17.6 81.8 83.1 87.7 72.6
PiT-S [21] 23.5 2.9 80.9 80.4 86.1 69.2
PiT-B [21] 73.8 12.5 82.0 82.4 86.8 72.0
TNT-S [62] 23.8 5.2 81.5 81.4 87.2 70.6
TNT-B [62] 65.6 14.1 82.9 82.9 87.6 72.2
ConViT-S [8] 27.8 5.8 81.3 81.3 87.0 70.3
ConViT-B [8] 86.5 17.5 82.4 82.0 86.7 71.3
Swin-S [31] 49.6 8.7 83.0 82.1 86.9 70.7
Swin-B [31] 87.8 15.4 83.5 82.2 86.7 70.7
CaiT-B12 [51] 100.0 18.2 83.3 87.7 73.3
Table 12: We report the performance reached with our training recipe with 400 epochs at resolution
224 × 224 for other transformers architectures. We have not performed an extensive grid search to
adapt the hyper-parameters to each architecture. Our results are overall similar to the ones achieved in
the papers where these architectures were originally published (reported in column ’orig.’), except for
Swin Transformers, for which we observe a drop on ImageNet-val.
20
References
[1] Apex. https://nvidia.github.io/apex/index.html, accessed: 2022-
01-01
[2] Bao, H., Dong, L., Wei, F.: Beit: Bert pre-training of image transformers. arXiv
preprint arXiv:2106.08254 (2021)
[3] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.:
End-to-end object detection with transformers. In: European Conference on
Computer Vision (2020)
[4] Chu, P., Bian, X., Liu, S., Ling, H.: Feature space augmentation for long-tailed
data. arXiv preprint arXiv:2008.03673 (2020)
[5] Cordonnier, J.B., Loukas, A., Jaggi, M.: On the relationship between self-
attention and convolutional layers. arXiv preprint arXiv:1911.03584 (2019)
[6] Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: RandAugment: Practical au-
tomated data augmentation with a reduced search space. arXiv preprint
arXiv:1909.13719 (2019)
[7] Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment:
Learning augmentation policies from data. arXiv preprint arXiv:1805.09501
(2018)
[8] d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.:
Convit: Improving vision transformers with soft convolutional inductive bi-
ases. In: ICML (2021)
[9] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-
scale hierarchical image database. In: Conference on Computer Vision and
Pattern Recognition. pp. 248–255 (2009)
[10] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep
bidirectional transformers for language understanding. In: NAACL (2019)
[11] Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B.:
Cswin transformer: A general vision transformer backbone with cross-shaped
windows. arXiv preprint arXiv:2107.00652 (2021)
[12] Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., Chen, D., Wen, F.,
Yu, N.: Peco: Perceptual codebook for bert pre-training of vision transform-
ers. arXiv preprint arXiv:2111.12710 (2021)
[13] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Un-
terthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An
image is worth 16x16 words: Transformers for image recognition at scale. In:
International Conference on Learning Representations (2021)
[14] El-Nouby, A., Izacard, G., Touvron, H., Laptev, I., Jegou, H., Grave, E.: Are
large-scale datasets necessary for self-supervised pre-training? arXiv preprint
arXiv:2112.10740 (2021)
21
[15] El-Nouby, A., Neverova, N., Laptev, I., Jégou, H.: Training vision transform-
ers for image retrieval. arXiv preprint arXiv:2102.05644 (2021)
[16] El-Nouby, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin,
A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., et al.: Xcit: Cross-
covariance image transformers. arXiv preprint arXiv:2106.09681 (2021)
[17] Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.:
Multiscale vision transformers. arXiv preprint arXiv:2104.11227 (2021)
[18] Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., Douze,
M.: Levit: a vision transformer in convnet’s clothing for faster inference. arXiv
preprint arXiv:2104.01136 (2021)
[19] He, K., Chen, X., Xie, S., Li, Y., Doll’ar, P., Girshick, R.B.: Masked autoencoders
are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)
[20] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni-
tion. In: Conference on Computer Vision and Pattern Recognition (2016)
[21] Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial di-
mensions of vision transformers. arXiv preprint arXiv:2103.16302 (2021)
[22] Horn, G.V., Mac Aodha, O., Song, Y., Shepard, A., Adam, H., Perona, P., Be-
longie, S.J.: The iNaturalist species classification and detection dataset. arXiv
preprint arXiv:1707.06642 (2017)
[23] Horn, G.V., Mac Aodha, O., Song, Y., Shepard, A., Adam, H., Perona,
P., Belongie, S.J.: The inaturalist challenge 2018 dataset. arXiv preprint
arXiv:1707.06642 (2018)
[24] Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with
stochastic depth. In: European Conference on Computer Vision (2016)
[25] Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., Houlsby,
N.: Big transfer (bit): General visual representation learning. arXiv preprint
arXiv:1912.11370 6, 3 (2019)
[26] Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-
grained categorization. In: IEEE Workshop on 3D Representation and Recog-
nition (2013)
[27] Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep
convolutional neural networks. In: NeurIPS (2012)
[28] Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech.
rep., CIFAR (2009)
[29] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied
to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
[30] LingChen, T.C., Khonsari, A., Lashkari, A., Nazari, M.R., Sambee, J.S., Nasci-
mento, M.A.: Uniformaugment: A search-free probabilistic data augmenta-
tion approach. arXiv preprint arXiv:2003.14348 (2020)
22
[31] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin
transformer: Hierarchical vision transformer using shifted windows. arXiv
preprint arXiv:2103.14030 (2021)
[32] Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for
the 2020s. arXiv preprint arXiv:2201.03545 (2022)
[33] Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam. arXiv
preprint arXiv:1711.05101 (2017)
[34] Müller, S., Hutter, F.: Trivialaugment: Tuning-free yet state-of-the-art data
augmentation. arXiv preprint arXiv:2103.10158 (2021)
[35] Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network.
arXiv preprint arXiv:2102.00719 (2021)
[36] Nilsback, M.E., Zisserman, A.: Automated flower classification over a large
number of classes. In: Proceedings of the Indian Conference on Computer
Vision, Graphics and Image Processing (2008)
[37] Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level
image representations using convolutional neural networks. In: Conference
on Computer Vision and Pattern Recognition (2014)
[38] Radosavovic, I., Kosaraju, R.P., Girshick, R.B., He, K., Dollár, P.: Designing
network design spaces. Conference on Computer Vision and Pattern Recog-
nition (2020)
[39] Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do ImageNet classifiers gener-
alize to ImageNet? In: International Conference on Machine Learning (2019)
[40] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large
scale visual recognition challenge. International Journal of Computer Vision
115(3), 211–252 (2015)
[41] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-
scale image recognition. In: International Conference on Learning Represen-
tations (2015)
[42] Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.:
How to train your vit? data, augmentation, and regularization in vision trans-
formers. arXiv preprint arXiv:2106.10270 (2021)
[43] Szegedy, C., Wei Liu, Yangqing Jia, Sermanet, P., Reed, S., Anguelov, D., Er-
han, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In:
Conference on Computer Vision and Pattern Recognition (2015)
[44] Tan, M., Le, Q.V.: EfficientNet: Rethinking model scaling for convolutional
neural networks. arXiv preprint arXiv:1905.11946 (2019)
[45] Tan, M., Le, Q.V.: Efficientnetv2: Smaller models and faster training. In: Inter-
national Conference on Machine Learning (2021)
23
[46] Tolstikhin, I., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T.,
Yung, J., Keysers, D., Uszkoreit, J., Lucic, M., Dosovitskiy, A.: MLP-Mixer: An
all-MLP architecture for vision. arXiv preprint arXiv:2105.01601 (2021)
[47] Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave,
E., Joulin, A., Synnaeve, G., Verbeek, J., Jégou, H.: ResMLP: feedforward
networks for image classification with data-efficient training. arXiv preprint
arXiv:2105.03404 (2021)
[48] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Train-
ing data-efficient image transformers & distillation through attention. Inter-
national Conference on Machine Learning (2021)
[49] Touvron, H., Cord, M., El-Nouby, A., Bojanowski, P., Joulin, A., Synnaeve, G.,
Jégou, H.: Augmenting convolutional networks with attention-based aggre-
gation. arXiv preprint arXiv:2112.13692 (2021)
[50] Touvron, H., Cord, M., El-Nouby, A., Verbeek, J., J’egou, H.: Three
things everyone should know about vision transformers. arXiv preprint
arXiv:2203.09795 (2022)
[51] Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going
deeper with image transformers. International Conference on Computer Vi-
sion (2021)
[52] Touvron, H., Sablayrolles, A., Douze, M., Cord, M., Jégou, H.: Grafit: Learn-
ing fine-grained image representations with coarse labels. International Con-
ference on Computer Vision (2021)
[53] Touvron, H., Vedaldi, A., Douze, M., Jegou, H.: Fixing the train-test resolution
discrepancy. Neurips (2019)
[54] Touvron, H., Vedaldi, A., Douze, M., Jégou, H.: Fixing the train-test resolution
discrepancy: Fixefficientnet. arXiv preprint arXiv:2003.08237 (2020)
[55] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)
[56] Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao,
L.: Pyramid vision transformer: A versatile backbone for dense prediction
without convolutions. arXiv preprint arXiv:2102.12122 (2021)
[57] Wightman, R., Touvron, H., Jégou, H.: Resnet strikes back: An improved
training procedure in timm. arXiv preprint arXiv:2110.00476 (2021)
[58] Wu, H., Xiao, B., Codella, N.C.F., Liu, M., Dai, X., Yuan, L., Zhang,
L.: Cvt: Introducing convolutions to vision transformers. arXiv preprint
arXiv:2103.15808 (2021)
[59] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for
scene understanding. In: European Conference on Computer Vision (2018)
24
[60] Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.: Early con-
volutions help transformers see better. arXiv preprint arXiv:2106.14881 (2021)
[61] You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Dem-
mel, J., Keutzer, K., Hsieh, C.J.: Large batch optimization for deep learning:
Training BERT in 76 minutes. In: International Conference on Learning Rep-
resentations (2020)
[62] Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Tay, F., Feng, J., Yan, S.: Tokens-
to-token vit: Training vision transformers from scratch on imagenet. arXiv
preprint arXiv:2101.11986 (2021)
[63] Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: CutMix: Regulariza-
tion strategy to train strong classifiers with localizable features. arXiv preprint
arXiv:1905.04899 (2019)
[64] Zhang, H., Cissé, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical
risk minimization. arXiv preprint arXiv:1710.09412 (2017)
[65] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene
parsing through ade20k dataset. Conference on Computer Vision and Pattern
Recognition (2017)
[66] Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A.L., Kong, T.: ibot:
Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832
(2021)
25
Appendices
A Experimental details
Fine-tuning at higher resolution When pre-training on ImageNet-1k at resolu-
tion 224 × 224 we fix the train-test resolution discrepancy by finetuning at a higher
resolution [53]. Our finetuning procedure is inspired by DeiT, except that we adapt
the stochastic depth rate according to the model size [51]. We fix the learning reate
to lr = 1 × 10−5 with batch-size=512 during 20 epochs with a weight decay of
0.1 without repeated augmentation. Other hyper-parameters are similar to those
employed in DeiT fine-tuning.
Stochastic depth We adapt the stochastic depth drop rate according to the model
size. We report stochastic depth drop rate values in Table 13.
Table 13: Stochastic depth drop-rate according to the model size. For 400 epochs training on ImageNet-
1k and 90 epochs training on ImageNet-21k. See section B for further adaption with longer training.
B Additional Ablations
Number of training epochs In Table 15 we provide an ablation on the number of
training epochs on ImageNet-1k. We do not observe a saturation when the increase
26
ImageNet top1 acc.
Model epochs
val real v2
300 79.9 86.1 68.8
400 80.4 86.1 69.7
ViT-S
600 80.8 86.7 69.9
800 81.4 87.0 70.5
300 82.8 87.6 72.1
400 83.1 87.7 72.6
ViT-B
600 83.2 87.8 73.3
800 83.7 88.1 73.1
300 84.1 88.5 74.1
400 84.2 88.6 74.3
ViT-L
600 84.4 88.6 74.6
800 84.5 88.8 75.0
300 84.6 89.0 74.9
ViT-H
400 84.8 89.1 75.3
Table 15: Impact on the performance of the number of training epochs on ImageNet-1k.
of the number of training epochs, as observed with BerT like approaches [2, 19].
For longer training we increase the weight decay from 0.02 to 0.05 and we increase
the stochastic depth drop-rate by 0.05 every 200 epochs to prevent overfitting.
27