0% found this document useful (0 votes)
12 views

10 Transformers

Uploaded by

Anto Diaz-Cano
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

10 Transformers

Uploaded by

Anto Diaz-Cano
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Published as a conference paper at ICLR 2021

A N I MAGE IS W ORTH 16 X 16 W ORDS :


T RANSFORMERS FOR I MAGE R ECOGNITION AT S CALE
Alexey Dosovitskiy∗,† , Lucas Beyer∗ , Alexander Kolesnikov∗ , Dirk Weissenborn∗ ,
Xiaohua Zhai∗ , Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,
Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,†

equal technical contribution, † equal advising
Google Research, Brain Team
{adosovitskiy, neilhoulsby}@google.com
arXiv:2010.11929v2 [cs.CV] 3 Jun 2021

A BSTRACT

While the Transformer architecture has become the de-facto standard for natural
language processing tasks, its applications to computer vision remain limited. In
vision, attention is either applied in conjunction with convolutional networks, or
used to replace certain components of convolutional networks while keeping their
overall structure in place. We show that this reliance on CNNs is not necessary
and a pure transformer applied directly to sequences of image patches can perform
very well on image classification tasks. When pre-trained on large amounts of
data and transferred to multiple mid-sized or small image recognition benchmarks
(ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent
results compared to state-of-the-art convolutional networks while requiring sub-
stantially fewer computational resources to train.1

1 I NTRODUCTION

Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017), have become


the model of choice in natural language processing (NLP). The dominant approach is to pre-train on
a large text corpus and then fine-tune on a smaller task-specific dataset (Devlin et al., 2019). Thanks
to Transformers’ computational efficiency and scalability, it has become possible to train models of
unprecedented size, with over 100B parameters (Brown et al., 2020; Lepikhin et al., 2020). With the
models and datasets growing, there is still no sign of saturating performance.
In computer vision, however, convolutional architectures remain dominant (LeCun et al., 1989;
Krizhevsky et al., 2012; He et al., 2016). Inspired by NLP successes, multiple works try combining
CNN-like architectures with self-attention (Wang et al., 2018; Carion et al., 2020), some replacing
the convolutions entirely (Ramachandran et al., 2019; Wang et al., 2020a). The latter models, while
theoretically efficient, have not yet been scaled effectively on modern hardware accelerators due to
the use of specialized attention patterns. Therefore, in large-scale image recognition, classic ResNet-
like architectures are still state of the art (Mahajan et al., 2018; Xie et al., 2020; Kolesnikov et al.,
2020).
Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard
Transformer directly to images, with the fewest possible modifications. To do so, we split an image
into patches and provide the sequence of linear embeddings of these patches as an input to a Trans-
former. Image patches are treated the same way as tokens (words) in an NLP application. We train
the model on image classification in supervised fashion.
When trained on mid-sized datasets such as ImageNet without strong regularization, these mod-
els yield modest accuracies of a few percentage points below ResNets of comparable size. This
seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases
1
Fine-tuning code and pre-trained models are available at https://github.com/
google-research/vision_transformer

1
Published as a conference paper at ICLR 2021

inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well
when trained on insufficient amounts of data.
However, the picture changes if the models are trained on larger datasets (14M-300M images). We
find that large scale training trumps inductive bias. Our Vision Transformer (ViT) attains excellent
results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints. When
pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches
or beats state of the art on multiple image recognition benchmarks. In particular, the best model
reaches the accuracy of 8855% on ImageNet, 9072% on ImageNet-ReaL, 9455% on CIFAR-100,
and 7763% on the VTAB suite of 19 tasks.

2 R ELATED W ORK

Transformers were proposed by Vaswani et al. (2017) for machine translation, and have since be-
resume el estado del arte.
inicialmente los transformers fueron come the state of the art method in many NLP tasks. Large Transformer-based models are often
usados para traducción y se han
convertido en el estado del arte en
pre-trained on large corpora and then fine-tuned for the task at hand: BERT (Devlin et al., 2019)
tareas de lenguaje. Se suelen usar
modlelos pre-entrenados en grandes
uses a denoising self-supervised pre-training task, while the GPT line of work uses language mod-
corpus y luego se hace fine-tuning. eling as its pre-training task (Radford et al., 2018; 2019; Brown et al., 2020).
una aplicación poco inteligente de los
transformers sería que cada píxel
atendiera a todos los demás, pero eso Naive application of self-attention to images would require that each pixel attends to every other
es muy costoros y por ello se han
ideado varias técnicas (...). También pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes. Thus,
se han combinado CNNs con attention
y attention a píxeles pero reduciendo to apply Transformers in the context of image processing, several approximations have been tried in
la resolución de la imagen y el espacio
de colores.
the past. Parmar et al. (2018) applied the self-attention only in local neighborhoods for each query
pixel instead of globally. Such local multi-head dot-product self attention blocks can completely
replace convolutions (Hu et al., 2019; Ramachandran et al., 2019; Zhao et al., 2020). In a different
line of work, Sparse Transformers (Child et al., 2019) employ scalable approximations to global self-
attention in order to be applicable to images. An alternative way to scale attention is to apply it in
blocks of varying sizes (Weissenborn et al., 2019), in the extreme case only along individual axes (Ho
et al., 2019; Wang et al., 2020a). Many of these specialized attention architectures demonstrate
promising results on computer vision tasks, but require complex engineering to be implemented
efficiently on hardware accelerators.
Most related to ours is the model of Cordonnier et al. (2020), which extracts patches of size 2 × 2
from the input image and applies full self-attention on top. This model is very similar to ViT,
but our work goes further to demonstrate that large scale pre-training makes vanilla transformers
competitive with (or even better than) state-of-the-art CNNs. Moreover, Cordonnier et al. (2020)
use a small patch size of 2 × 2 pixels, which makes the model applicable only to small-resolution
images, while we handle medium-resolution images as well.
There has also been a lot of interest in combining convolutional neural networks (CNNs) with forms
of self-attention, e.g. by augmenting feature maps for image classification (Bello et al., 2019) or by
further processing the output of a CNN using self-attention, e.g. for object detection (Hu et al., 2018;
Carion et al., 2020), video processing (Wang et al., 2018; Sun et al., 2019), image classification (Wu
et al., 2020), unsupervised object discovery (Locatello et al., 2020), or unified text-vision tasks (Chen
et al., 2020c; Lu et al., 2019; Li et al., 2019).
Another recent related model is image GPT (iGPT) (Chen et al., 2020a), which applies Transformers
to image pixels after reducing image resolution and color space. The model is trained in an unsu-
pervised fashion as a generative model, and the resulting representation can then be fine-tuned or
probed linearly for classification performance, achieving a maximal accuracy of 72% on ImageNet.
Our work adds to the increasing collection of papers that explore image recognition at larger scales
than the standard ImageNet dataset. The use of additional data sources allows to achieve state-of-
the-art results on standard benchmarks (Mahajan et al., 2018; Touvron et al., 2019; Xie et al., 2020).
Moreover, Sun et al. (2017) study how CNN performance scales with dataset size, and Kolesnikov
et al. (2020); Djolonga et al. (2020) perform an empirical exploration of CNN transfer learning from
large scale datasets such as ImageNet-21k and JFT-300M. We focus on these two latter datasets as
well, but train Transformers instead of ResNet-based models used in prior works.

2
Published as a conference paper at ICLR 2021

Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them,
add position embeddings, and feed the resulting sequence of vectors to a standard Transformer
encoder. In order to perform classification, we use the standard approach of adding an extra learnable
“classification token” to the sequence. The illustration of the Transformer encoder was inspired by
Vaswani et al. (2017).

describe cómo funciona un vision transformer


poner la figura 1

3 M ETHOD

In model design we follow the original Transformer (Vaswani et al., 2017) as closely as possible.
An advantage of this intentionally simple setup is that scalable NLP Transformer architectures – and
their efficient implementations – can be used almost out of the box.

3.1 V ISION T RANSFORMER (V I T)

An overview of the model is depicted in Figure 1. The standard Transformer receives as input a 1D
sequence of token embeddings. To handle 2D images, we reshape the image x ∈ RH×W ×C into a
2
sequence of flattened 2D patches xp ∈ RN ×(P ·C) , where (H, W ) is the resolution of the original
image, C is the number of channels, (P, P ) is the resolution of each image patch, and N = HWP 2
is the resulting number of patches, which also serves as the effective input sequence length for the
Transformer. The Transformer uses constant latent vector size D through all of its layers, so we
flatten the patches and map to D dimensions with a trainable linear projection (Eq. 1). We refer to
the output of this projection as the patch embeddings.
Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embed-
ded patches (z00 = xclass ), whose state at the output of the Transformer encoder (z0L ) serves as the
image representation y (Eq. 4). Both during pre-training and fine-tuning, a classification head is at-
tached to z0L . The classification head is implemented by a MLP with one hidden layer at pre-training
time and by a single linear layer at fine-tuning time.
Position embeddings are added to the patch embeddings to retain positional information. We use
standard learnable 1D position embeddings, since we have not observed significant performance
gains from using more advanced 2D-aware position embeddings (Appendix D.4). The resulting
sequence of embedding vectors serves as input to the encoder.
The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self-
attention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before
every block, and residual connections after every block (Wang et al., 2019; Baevski & Auli, 2019).

3
Published as a conference paper at ICLR 2021

The MLP contains two layers with a GELU non-linearity.


2
·C)×D
z0 = [xclass ; x1p E; x2p E; · · · ; xN
p E] + Epos , E ∈ R(P , Epos ∈ R(N +1)×D (1)

z ` = MSA(LN(z`−1 )) + z`−1 , ` = 1L (2)
′ ′
z` = MLP(LN(z ` )) + z ` , ` = 1L (3)
y= LN(z0L ) (4)

Inductive bias. We note that Vision Transformer has much less image-specific inductive bias than
CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are
por qué hay menos inductive bias
que en las cnns? existe algún tipo baked into each layer throughout the whole model. In ViT, only MLP layers are local and transla-
de arquitectura híbrida del vit?
comenta qué recomiendan en fine- tionally equivariant, while the self-attention layers are global. The two-dimensional neighborhood
tuning.
porque en cnns la localidad, el structure is used very sparingly: in the beginning of the model by cutting the image into patches and
neighborhood 2d y la invarianza a
la traslación se usan en todas las
at fine-tuning time for adjusting the position embeddings for images of different resolution (as de-
capas del modelo. en los vit solo scribed below). Other than that, the position embeddings at initialization time carry no information
las mlp son locales y tienen
invarianza a la traslación (las capas about the 2D positions of the patches and all spatial relations between the patches have to be learned
de self-attention son globales).
Un tipo de arquitectura híbrida es from scratch.
que el input sean feature maps de
una cnn en vez de los patches.
en fine-tuning recomiendan usar un
vit pre-entrenado en un dataset Hybrid Architecture. As an alternative to raw image patches, the input sequence can be formed
grande y modificar la cabeza de
predicción entrenando con imágenes from feature maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding
en más alta resolución. projection E (Eq. 1) is applied to patches extracted from a CNN feature map. As a special case,
the patches can have spatial size 1x1, which means that the input sequence is obtained by simply
flattening the spatial dimensions of the feature map and projecting to the Transformer dimension.
The classification input embedding and position embeddings are added as described above.

3.2 F INE - TUNING AND H IGHER R ESOLUTION

Typically, we pre-train ViT on large datasets, and fine-tune to (smaller) downstream tasks. For
this, we remove the pre-trained prediction head and attach a zero-initialized D × K feedforward
layer, where K is the number of downstream classes. It is often beneficial to fine-tune at higher
resolution than pre-training (Touvron et al., 2019; Kolesnikov et al., 2020). When feeding images
of higher resolution, we keep the patch size the same, which results in a larger effective sequence
length. The Vision Transformer can handle arbitrary sequence lengths (up to memory constraints),
however, the pre-trained position embeddings may no longer be meaningful. We therefore perform
2D interpolation of the pre-trained position embeddings, according to their location in the original
image. Note that this resolution adjustment and patch extraction are the only points at which an
inductive bias about the 2D structure of the images is manually injected into the Vision Transformer.

4 E XPERIMENTS
We evaluate the representation learning capabilities of ResNet, Vision Transformer (ViT), and the
hybrid. To understand the data requirements of each model, we pre-train on datasets of varying size
and evaluate many benchmark tasks. When considering the computational cost of pre-training the
model, ViT performs very favourably, attaining state of the art on most recognition benchmarks at
a lower pre-training cost. Lastly, we perform a small experiment using self-supervision, and show
that self-supervised ViT holds promise for the future.

4.1 S ETUP

Datasets. To explore model scalability, we use the ILSVRC-2012 ImageNet dataset with 1k classes
and 1.3M images (we refer to it as ImageNet in what follows), its superset ImageNet-21k with
21k classes and 14M images (Deng et al., 2009), and JFT (Sun et al., 2017) with 18k classes and
303M high-resolution images. We de-duplicate the pre-training datasets w.r.t. the test sets of the
downstream tasks following Kolesnikov et al. (2020). We transfer the models trained on these
dataset to several benchmark tasks: ImageNet on the original validation labels and the cleaned-up
ReaL labels (Beyer et al., 2020), CIFAR-10/100 (Krizhevsky, 2009), Oxford-IIIT Pets (Parkhi et al.,
2012), and Oxford Flowers-102 (Nilsback & Zisserman, 2008). For these datasets, pre-processing
follows Kolesnikov et al. (2020).

4
Published as a conference paper at ICLR 2021

Model Layers Hidden size D MLP size Heads Params


ViT-Base 12 768 3072 12 86M
ViT-Large 24 1024 4096 16 307M
ViT-Huge 32 1280 5120 16 632M

Table 1: Details of Vision Transformer model variants.

We also evaluate on the 19-task VTAB classification suite (Zhai et al., 2019b). VTAB evaluates
low-data transfer to diverse tasks, using 1 000 training examples per task. The tasks are divided into
three groups: Natural – tasks like the above, Pets, CIFAR, etc. Specialized – medical and satellite
imagery, and Structured – tasks that require geometric understanding like localization.
Model Variants. We base ViT configurations on those used for BERT (Devlin et al., 2019), as
summarized in Table 1. The “Base” and “Large” models are directly adopted from BERT and we
add the larger “Huge” model. In what follows we use brief notation to indicate the model size and
the input patch size: for instance, ViT-L/16 means the “Large” variant with 16 × 16 input patch size.
Note that the Transformer’s sequence length is inversely proportional to the square of the patch size,
thus models with smaller patch size are computationally more expensive.
For the baseline CNNs, we use ResNet (He et al., 2016), but replace the Batch Normalization lay-
ers (Ioffe & Szegedy, 2015) with Group Normalization (Wu & He, 2018), and used standardized
convolutions (Qiao et al., 2019). These modifications improve transfer (Kolesnikov et al., 2020),
and we denote the modified model “ResNet (BiT)”. For the hybrids, we feed the intermediate fea-
ture maps into ViT with patch size of one “pixel”. To experiment with different sequence lengths,
we either (i) take the output of stage 4 of a regular ResNet50 or (ii) remove stage 4, place the same
number of layers in stage 3 (keeping the total number of layers), and take the output of this extended
stage 3. Option (ii) results in a 4x longer sequence length, and a more expensive ViT model.
Training & Fine-tuning. We train all models, including ResNets, using Adam (Kingma & Ba,
2015) with β1 = 09, β2 = 0999, a batch size of 4096 and apply a high weight decay of 01, which
we found to be useful for transfer of all models (Appendix D.1 shows that, in contrast to common
practices, Adam works slightly better than SGD for ResNets in our setting). We use a linear learning
rate warmup and decay, see Appendix B.1 for details. For fine-tuning we use SGD with momentum,
batch size 512, for all models, see Appendix B.1.1. For ImageNet results in Table 2, we fine-tuned at
higher resolution: 512 for ViT-L/16 and 518 for ViT-H/14, and also used Polyak & Juditsky (1992)
averaging with a factor of 09999 (Ramachandran et al., 2019; Wang et al., 2020b).
Metrics. We report results on downstream datasets either through few-shot or fine-tuning accuracy.
Fine-tuning accuracies capture the performance of each model after fine-tuning it on the respective
dataset. Few-shot accuracies are obtained by solving a regularized least-squares regression problem
that maps the (frozen) representation of a subset of training images to {−1, 1}K target vectors. This
formulation allows us to recover the exact solution in closed form. Though we mainly focus on
fine-tuning performance, we sometimes use linear few-shot accuracies for fast on-the-fly evaluation
where fine-tuning would be too costly.

4.2 C OMPARISON TO S TATE OF THE A RT

We first compare our largest models – ViT-H/14 and ViT-L/16 – to state-of-the-art CNNs from
the literature. The first comparison point is Big Transfer (BiT) (Kolesnikov et al., 2020), which
performs supervised transfer learning with large ResNets. The second is Noisy Student (Xie et al.,
2020), which is a large EfficientNet trained using semi-supervised learning on ImageNet and JFT-
300M with the labels removed. Currently, Noisy Student is the state of the art on ImageNet and
BiT-L on the other datasets reported here. All models were trained on TPUv3 hardware, and we
report the number of TPUv3-core-days taken to pre-train each of them, that is, the number of TPU
v3 cores (2 per chip) used for training multiplied by the training time in days.
Table 2 shows the results. The smaller ViT-L/16 model pre-trained on JFT-300M outperforms BiT-L
(which is pre-trained on the same dataset) on all tasks, while requiring substantially less computa-
tional resources to train. The larger model, ViT-H/14, further improves the performance, especially
on the more challenging datasets – ImageNet, CIFAR-100, and the VTAB suite. Interestingly, this

5
qué experimentos hacen para
comparar el vit con el estado del arte?
entrenan varios vit de diferente
tamaño (número de parámetros) y
diferentes patch sizes y ven que
los pre-entrenados en un dataset
grande consiguen mejores resultados
que resnet y tardan menos tiempo
en entrenar.
qué experimentos hacen para
analizar la influencia del
pre-entrenamiento?
1. pre-entrenar en datasets de diferente
tamaño. llegan a la conclusión de que
los vit tienen peores resultados cuando
éstos son pequeños, pero mejoran
al estado del arte cuando son más
grandes (fig 3).
2. con un dataset de 300M imágenes
ven el top 1% pre-entrenando con
diferentes tamaños del dataset (fig 4).
qué hacen para comprobar la
escalabilidad de los vit?
plotean en el eje x el número de
exa flops y en el eje y el accuracy y
ven que vit necesita 2-4 veces menos
que resnet y que para menor número
de exa flops los los vit híbridos son
mejores pero cuando aumenta son
iguales.

qué se hace para investigar o ver


cómo funciona por dentro el vit?
a) en la primera capa se proyectan
las componentes principales de los
filtros aprendidos.
b) se estudia el positional encoding y
se ve que en patches de posición
parecida éste también es parecido.
c) se ve que en las capas finales se
atiende a píxeles más lejanos
(información global) y en capas más
cercanas al input a píxeles más
cercanos (información local). en
modelos híbridos la información local
es menor, probablemente porque no
se necesita ya que con la CNN previa
ya lo habíamos conseguido.
Published as a conference paper at ICLR 2021

Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In
ICLR, 2019.
I. Bello, B. Zoph, Q. Le, A. Vaswani, and J. Shlens. Attention augmented convolutional networks.
In ICCV, 2019.
Lucas Beyer, Olivier J. Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord. Are
we done with imagenet? arXiv, 2020.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. arXiv, 2020.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and
Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
Mark Chen, Alec Radford, Rewon Child, Jeff Wu, and Heewoo Jun. Generative pretraining from
pixels. In ICML, 2020a.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework
for contrastive learning of visual representations. In ICML, 2020b.
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and
Jingjing Liu. UNITER: UNiversal Image-TExt Representation Learning. In ECCV, 2020c.
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse
transformers. arXiv, 2019.
Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the relationship between self-
attention and convolutional layers. In ICLR, 2020.
J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In CVPR, 2009.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In NAACL, 2019.
Josip Djolonga, Jessica Yung, Michael Tschannen, Rob Romijnders, Lucas Beyer, Alexander
Kolesnikov, Joan Puigcerver, Matthias Minderer, Alexander D’Amour, Dan Moldovan, Sylvan
Gelly, Neil Houlsby, Xiaohua Zhai, and Mario Lucic. On robustness and transferability of convo-
lutional neural networks. arXiv, 2020.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. In CVPR, 2016.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for
unsupervised visual representation learning. In CVPR, 2020.
Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidi-
mensional transformers. arXiv, 2019.
Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object
detection. In CVPR, 2018.
Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local relation networks for image recognition.
In ICCV, 2019.
Zilong Huang, Xinggang Wang, Yunchao Wei, Lichao Huang, Humphrey Shi, Wenyu Liu, and
Thomas S. Huang. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, 2020.
Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Ali Eslami,
and Aaron van den Oord. Data-efficient image recognition with contrastive predictive coding. In
ICML, 2020.

10
Published as a conference paper at ICLR 2021

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. 2015.
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly,
and Neil Houlsby. Big transfer (BiT): General visual representation learning. In ECCV, 2020.
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convo-
lutional neural networks. In NIPS, 2012.
Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropa-
gation applied to handwritten zip code recognition. Neural Computation, 1:541–551, 1989.
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang,
Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional
computation and automatic sharding. arXiv, 2020.
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. VisualBERT: A
Simple and Performant Baseline for Vision and Language. In Arxiv, 2019.
Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold,
Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot atten-
tion. arXiv, 2020.
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. ViLBERT: Pretraining Task-Agnostic Visi-
olinguistic Representations for Vision-and-Language Tasks. In NeurIPS. 2019.
Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li,
Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised
pretraining. In ECCV, 2018.
M. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In
ICVGIP, 2008.
Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. Cats and dogs. In CVPR,
2012.
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and
Dustin Tran. Image transformer. In ICML, 2018.
B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM
Journal on Control and Optimization, 30(4):838–855, 1992. doi: 10.1137/0330046. URL
https://doi.org/10.1137/0330046.
Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Weight standardization. arXiv
preprint arXiv:1903.10520, 2019.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under-
standing with unsupervised learning. Technical Report, 2018.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
models are unsupervised multitask learners. Technical Report, 2019.
Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens.
Stand-alone self-attention in vision models. In NeurIPS, 2019.
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable ef-
fectiveness of data in deep learning era. In ICCV, 2017.
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint
model for video and language representation learning. In ICCV, 2019.

11
Published as a conference paper at ICLR 2021

Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test resolution
discrepancy. In NeurIPS. 2019.
Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test resolution
discrepancy: Fixefficientnet. arXiv preprint arXiv:2003.08237, 2020.
Michael Tschannen, Josip Djolonga, Marvin Ritter, Aravindh Mahendran, Neil Houlsby, Sylvain
Gelly, and Mario Lucic. Self-supervised learning of video-induced visual invariances. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June
2020.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen.
Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In ECCV, 2020a.
Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh
Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. arXiv preprint
arXiv:2003.07853, 2020b.
Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao.
Learning deep transformer models for machine translation. In ACL, 2019.
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In
CVPR, 2018.
Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. Scaling autoregressive video models. In
ICLR, 2019.
Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Masayoshi Tomizuka, Kurt
Keutzer, and Peter Vajda. Visual transformers: Token-based image representation and processing
for computer vision. arxiv, 2020.
Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.
Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le. Self-training with noisy student
improves imagenet classification. In CVPR, 2020.
Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4 L: Self-Supervised Semi-
Supervised Learning. In ICCV, 2019a.
Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario
Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A
large-scale study of representation learning with the visual task adaptation benchmark. arXiv
preprint arXiv:1910.04867, 2019b.
Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring self-attention for image recognition. In
CVPR, 2020.

12
Published as a conference paper at ICLR 2021

Models Dataset Epochs Base LR LR decay Weight decay Dropout


−4
ViT-B/{16,32} JFT-300M 7 8 · 10 linear 0.1 0.0
ViT-L/32 JFT-300M 7 6 · 10−4 linear 0.1 0.0
ViT-L/16 JFT-300M 7/14 4 · 10−4 linear 0.1 0.0
ViT-H/14 JFT-300M 14 3 · 10−4 linear 0.1 0.0
R50x{1,2} JFT-300M 7 10−3 linear 0.1 0.0
R101x1 JFT-300M 7 8 · 10−4 linear 0.1 0.0
R152x{1,2} JFT-300M 7 6 · 10−4 linear 0.1 0.0
R50+ViT-B/{16,32} JFT-300M 7 8 · 10−4 linear 0.1 0.0
R50+ViT-L/32 JFT-300M 7 2 · 10−4 linear 0.1 0.0
R50+ViT-L/16 JFT-300M 7/14 4 · 10−4 linear 0.1 0.0
ViT-B/{16,32} ImageNet-21k 90 10−3 linear 0.03 0.1
ViT-L/{16,32} ImageNet-21k 30/90 10−3 linear 0.03 0.1
ViT-∗ ImageNet 300 3 · 10−3 cosine 0.3 0.1

Table 3: Hyperparameters for training. All models are trained with a batch size of 4096 and learn-
ing rate warmup of 10k steps. For ImageNet we found it beneficial to additionally apply gradient
clipping at global norm 1. Training resolution is 224.

A PPENDIX

A M ULTIHEAD S ELF - ATTENTION


Standard qkv self-attention (SA, Vaswani et al. (2017)) is a popular building block for neural archi-
explica multihead self- tectures. For each element in an input sequence z ∈ RN ×D , we compute a weighted sum over all
attention.
para cada elemento de la values v in the sequence. The attention weights Aij are based on the pairwise similarity between
secuencia se hace una two elements of the sequence and their respective query qi and key kj representations.
suma ponderada sobre los
valores de la secuencia.
se calculan q, k y v para [q, k, v] = zUqkv Uqkv ∈ RD×3Dh , (5)
cada elemento de la ( √ )
secuencia, se calcula la A = softmax qk>  Dh A ∈ RN ×N , (6)
atención como q*k^T y se
normaliza y se multiplica la SA(z) = Av  (7)
atención por v. lo de varias
cabezas es que se repite
el proceso varias veces y Multihead self-attention (MSA) is an extension of SA in which we run k self-attention operations,
se concatenan los valores
finales de cada cabeza. called “heads”, in parallel, and project their concatenated outputs. To keep compute and number of
parameters constant when changing k, Dh (Eq. 5) is typically set to Dk.
MSA(z) = [SA1 (z); SA2 (z); · · · ; SAk (z)] Umsa Umsa ∈ Rk·Dh ×D (8)

B E XPERIMENT DETAILS

B.1 T RAINING

Table 3 summarizes our training setups for our different models. We found strong regularization
to be key when training models from scratch on ImageNet. Dropout, when used, is applied after
every dense layer except for the the qkv-projections and directly after adding positional- to patch
embeddings. Hybrid models are trained with the exact setup as their ViT counterparts. Finally, all
training is done on resolution 224.

B.1.1 F INE - TUNING


We fine-tune all ViT models using SGD with a momentum of 0.9. We run a small grid search over
learning rates, see learning rate ranges in Table 4. To do so, we use small sub-splits from the training
set (10% for Pets and Flowers, 2% for CIFAR, 1% ImageNet) as development set and train on the
remaining data. For final results we train on the entire training set and evaluate on the respective
test data. For fine-tuning ResNets and hybrid models we use the exact same setup, with the only
exception of ImageNet where we add another value 006 to the learning rate sweep. Additionally,

13
Published as a conference paper at ICLR 2021

Dataset Steps Base LR


ImageNet 20 000 {0.003, 0.01, 0.03, 0.06}
CIFAR100 10 000 {0.001, 0.003, 0.01, 0.03}
CIFAR10 10 000 {0.001, 0.003, 0.01, 0.03}
Oxford-IIIT Pets 500 {0.001, 0.003, 0.01, 0.03}
Oxford Flowers-102 500 {0.001, 0.003, 0.01, 0.03}
VTAB (19 tasks) 2 500 0.01

Table 4: Hyperparameters for fine-tuning. All models are fine-tuned with cosine learning rate decay,
a batch size of 512, no weight decay, and grad clipping at global norm 1. If not mentioned otherwise,
fine-tuning resolution is 384.

for ResNets we also run the setup of Kolesnikov et al. (2020) and select the best results across
this run and our sweep. Finally, if not mentioned otherwise, all fine-tuning experiments run at 384
resolution (running fine-tuning at different resolution than training is common practice (Kolesnikov
et al., 2020)).
When transferring ViT models to another dataset, we remove the whole head (two linear layers) and
replace it by a single, zero-initialized linear layer outputting the number of classes required by the
target dataset. We found this to be a little more robust than simply re-initializing the very last layer.
For VTAB we follow the protocol in Kolesnikov et al. (2020), and use the same hyperparameter
setting for all tasks. We use a learning rate of 001 and train for 2500 steps (Tab. 4). We chose this
setting by running a small sweep over two learning rates and two schedules, and selecting the setting
with the highest VTAB score on the 200-example validation sets. We follow the pre-processing used
in Kolesnikov et al. (2020), except that we do not use task-specific input resolutions. Instead we find
that Vision Transformer benefits most from a high resolution (384 × 384) for all tasks.

B.1.2 S ELF - SUPERVISION

We employ the masked patch prediction objective for preliminary self-supervision experiments. To
qué hacen el el self-supervision?
se usa una máscara al 50% de un do so we corrupt 50% of patch embeddings by either replacing their embeddings with a learnable
patches y de ese 50% el 80% de
veces se hace con una máscara
[mask] embedding (80%), a random other patch embedding (10%) or just keeping them as is
entrenable, el 10% random y el 10%
se deja igual. había tres opciones
(10%). This setup is very similar to the one used for language by Devlin et al. (2019). Finally, we
de target predict the 3-bit, mean color (i.e., 512 colors in total) of every corrupted patch using their respective
a) predecir la media de colores del
patch 3 bits (canales R, G y B), luego patch representations.
2^3*2^3*2^3=512 opciones
b) predecir el patch downsampleado
de 16x16 a 4x4, 512^16 opciones We trained our self-supervised model for 1M steps (ca. 14 epochs) with batch size 4096 on JFT. We
c) regresión del patch entero, es
decir, de 16x16=256 píxeles. ojo, use Adam, with a base learning rate of 2 · 10−4 , warmup of 10k steps and cosine learning rate decay.
aquí es regresión, no clasificación.
obtienen buenos resultados en todas.
As prediction targets for pretraining we tried the following settings: 1) predicting only the mean,
3bit color (i.e., 1 prediction of 512 colors), 2) predicting a 4 × 4 downsized version of the 16 × 16
patch with 3bit colors in parallel (i.e., 16 predictions of 512 colors), 3) regression on the full patch
using L2 (i.e., 256 regressions on the 3 RGB channels). Surprisingly, we found that all worked quite
well, though L2 was slightly worse. We report final results only for option 1) because it has shown
best few-shot performance. We also experimented with 15% corruption rate as used by Devlin et al.
(2019) but results were also slightly worse on our few-shot metrics.
Lastly, we would like to remark that our instantiation of masked patch prediction doesn’t require
such an enormous amount of pretraining nor a large dataset such as JFT in order to lead to sim-
ilar performance gains on ImageNet classification. That is, we observed diminishing returns on
downstream performance after 100k pretraining steps, and see similar gains when pretraining on
ImageNet.

C A DDITIONAL R ESULTS

We report detailed results corresponding to the figures presented in the paper. Table 5 corresponds
to Figure 3 from the paper and shows transfer performance of different ViT models pre-trained
on datasets of increasing size: ImageNet, ImageNet-21k, and JFT-300M. Table 6 corresponds to

14
Published as a conference paper at ICLR 2021

ViT-B/16 ViT-B/32 ViT-L/16 ViT-L/32 ViT-H/14


ImageNet CIFAR-10 98.13 97.77 97.86 97.94 -
CIFAR-100 87.13 86.31 86.35 87.07 -
ImageNet 77.91 73.38 76.53 71.16 -
ImageNet ReaL 83.57 79.56 82.19 77.83 -
Oxford Flowers-102 89.49 85.43 89.66 86.36 -
Oxford-IIIT-Pets 93.81 92.04 93.64 91.35 -
ImageNet-21k CIFAR-10 98.95 98.79 99.16 99.13 99.27
CIFAR-100 91.67 91.97 93.44 93.04 93.82
ImageNet 83.97 81.28 85.15 80.99 85.13
ImageNet ReaL 88.35 86.63 88.40 85.65 88.70
Oxford Flowers-102 99.38 99.11 99.61 99.19 99.51
Oxford-IIIT-Pets 94.43 93.02 94.73 93.09 94.82
JFT-300M CIFAR-10 99.00 98.61 99.38 99.19 99.50
CIFAR-100 91.87 90.49 94.04 92.52 94.55
ImageNet 84.15 80.73 87.12 84.37 88.04
ImageNet ReaL 88.85 86.27 89.99 88.28 90.33
Oxford Flowers-102 99.56 99.27 99.56 99.45 99.68
Oxford-IIIT-Pets 95.80 93.40 97.11 95.83 97.56

Table 5: Top1 accuracy (in %) of Vision Transformer on various datasets when pre-trained on Im-
ageNet, ImageNet-21k or JFT300M. These values correspond to Figure 3 in the main text. Models
are fine-tuned at 384 resolution. Note that the ImageNet results are computed without additional
techniques (Polyak averaging and 512 resolution images) used to achieve results in Table 2.

Epochs ImageNet ImageNet ReaL CIFAR-10 CIFAR-100 Pets Flowers exaFLOPs


name
ViT-B/32 7 80.73 86.27 98.61 90.49 93.40 99.27 55
ViT-B/16 7 84.15 88.85 99.00 91.87 95.80 99.56 224
ViT-L/32 7 84.37 88.28 99.19 92.52 95.83 99.45 196
ViT-L/16 7 86.30 89.43 99.38 93.46 96.81 99.66 783
ViT-L/16 14 87.12 89.99 99.38 94.04 97.11 99.56 1567
ViT-H/14 14 88.08 90.36 99.50 94.71 97.11 99.71 4262
ResNet50x1 7 77.54 84.56 97.67 86.07 91.11 94.26 50
ResNet50x2 7 82.12 87.94 98.29 89.20 93.43 97.02 199
ResNet101x1 7 80.67 87.07 98.48 89.17 94.08 95.95 96
ResNet152x1 7 81.88 87.96 98.82 90.22 94.17 96.94 141
ResNet152x2 7 84.97 89.69 99.06 92.05 95.37 98.62 563
ResNet152x2 14 85.56 89.89 99.24 91.92 95.75 98.75 1126
ResNet200x3 14 87.22 90.15 99.34 93.53 96.32 99.04 3306
R50x1+ViT-B/32 7 84.90 89.15 99.01 92.24 95.75 99.46 106
R50x1+ViT-B/16 7 85.58 89.65 99.14 92.63 96.65 99.40 274
R50x1+ViT-L/32 7 85.68 89.04 99.24 92.93 96.97 99.43 246
R50x1+ViT-L/16 7 86.60 89.72 99.18 93.64 97.03 99.40 859
R50x1+ViT-L/16 14 87.12 89.76 99.31 93.89 97.36 99.11 1668

Table 6: Detailed results of model scaling experiments. These correspond to Figure 5 in the main
paper. We show transfer accuracy on several datasets, as well as the pre-training compute (in ex-
aFLOPs).

Figure 5 from the paper and shows the transfer performance of ViT, ResNet, and hybrid models of
varying size, as well as the estimated computational cost of their pre-training.

D A DDITIONAL A NALYSES

D.1 SGD VS . A DAM FOR R ES N ETS

ResNets are typically trained with SGD and our use of Adam as optimizer is quite unconventional.
Here we show the experiments that motivated this choice. Namely, we compare the fine-tuning

15
cómo evalúan los positional
embeddings?
hacen cuatro pruebas:
a) no añadiendo positional encoding
b) añadiendo positional encoding 1d
c) añadiendo positional encoding 2d
d) añadiendo distancias relativas
de un patch a los demás.
obtienen que añadiendo positional
encoding de cualquier manera se
mejoran los resultados de manera
parecida. solo empeora si no se
añade nada.
Published as a conference paper at ICLR 2021

Table 9: Breakdown of VTAB-1k performance across tasks.

sNORB-Azim

sNORB-Elev
Clevr-Count
Retinopathy
CIFAR-100

Flowers102

KITTI-Dist
Caltech101

Clevr-Dist
Camelyon

dSpr-Loc
EuroSAT

Resisc45

dSpr-Ori
DMLab
Sun397

SVHN

Mean
DTD

Pets

ViT-H/14 (JFT) 95.3 85.5 75.2 99.7 97.2 65.0 88.9 83.3 96.7 91.4 76.6 91.7 63.8 53.1 79.4 63.3 84.5 33.2 51.2 77.6
ViT-L/16 (JFT) 95.4 81.9 74.3 99.7 96.7 63.5 87.4 83.6 96.5 89.7 77.1 86.4 63.1 49.7 74.5 60.5 82.2 36.2 51.1 76.3
ViT-L/16 (I21k) 90.8 84.1 74.1 99.3 92.7 61.0 80.9 82.5 95.6 85.2 75.3 70.3 56.1 41.9 74.7 64.9 79.9 30.5 41.7 72.7

22

You might also like