Scheibenreif Self-Supervised Vision Transformers For Land-Cover Segmentation and Classification CVPRW 2022 Paper
Scheibenreif Self-Supervised Vision Transformers For Land-Cover Segmentation and Classification CVPRW 2022 Paper
Classification
1422
the shifting-window vision Transformer architecture (Swin The remote sensing community has adapted SSL techniques
Transformer) [17] with a contrastive data fusion SSL strat- to learn meaningful representations of satellite imagery in
egy [25] and evaluate this pipeline with single-label classifi- multiple works. Pretext tasks like inpainting and the pre-
cation, multi-label classification and semantic segmentation diction of relative positions for image patches have been
(i.e. pixel-wise classification) as downstream tasks. utilized with different satellite datasets and compared to
We summarize the contributions of this work as follows: contrastive estimation [30]. The authors of Seasonal Con-
trast [18] obtain positive samples for contrastive learning
• We show that vision Transformers combined with self- from satellite images of the same locations at different
supervised pre-training provide an effective approach points in time together with augmented data points. Addi-
for deep learning in the remote sensing domain, sur- tionally, the data is mapped into multiple embedding sub-
passing ConvNet performance in some settings. spaces, which results in representations with invariances
• We show that latent representations derived through with respect to different transformations. Work based on the
self-supervised pre-training and subsequent supervised momentum contrast SSL technique [14] also utilizes satel-
fine-tuning are task agnostic and can be utilized for lite imagery of given locations at different points in time as
both land cover classification and segmentation. temporal positives in contrastive learning, but combines it
• Our work further illustrates that SSL in combination with location classification in a multi-task framework [1].
with vision Transformers or ConvNets can yield large The Contrastive Multiview Coding [32] framework for SSL
performance gains (up to +30% over supervised base- has also been adapted to remote sensing data [27,29]. These
lines) across different downstream tasks when fine- works explored the potential of multi-spectral imaging data
tuned with labelled data. in SSL with different band and sensor combinations, as well
as cross-dataset transfer of pre-trained models. A different
• In an ablation study on fine-tuning self-supervised
strategy for self-supervised pre-training specific to Trans-
models with different amounts of labelled data we
former models is proposed in [39]. This approach exploits
demonstrate that fully supervised approaches can be
the temporal structure of satellite imagery and frames the
outperformed with as little as 10% of labelled data
prediction of artificially corrupted observations in a satel-
through SSL.
lite image timeseries as pretext task. Most relevant to our
work, [8] propose the use of a UNet-like architecture to ob-
2. Related Work tain pixel-wise representations of multi-modal remote sens-
2.1. Self-supervised Learning ing data through contrastive learning. Similarly, [24] com-
bines three different unsupervised loss functions, including
Self-supervised learning is a branch of machine learn- a contrastive loss, on multi-modal remote sensing data to
ing that aims to learn data representations from unlabelled pre-train a change detection model. Our work builds on the
datasets. The literature on self-supervised deep learning multi-modal SSL approach from [25] which utilizes image
rapidly expanded in the last years, following the success of pairs from different satellite instruments as positive pairs.
methods like Word2Vec [19] or GloVe [22] in Natural Lan- However, we move beyond ConvNets and the proposed vi-
guage Processing (NLP). Consequently, similar approaches sion Transformer backbone enables our technique to learn
were also adopted in the vision domain. Most SSL meth- task agnostic representations for classification and segmen-
ods for images either employ pretext tasks or the princi- tation downstream tasks in a self-supervised fashion.
ple of contrastive learning. Pre-text based methods utilize
inherent properties of data samples to construct prediction 2.2. Vision Transformers
tasks for training. These tasks include the prediction of fu-
ture states from earlier states in sequential data (e.g. au- Transformers in NLP Transformers have revolutionized
dio [34] or text [38]), as well as colorization of artificially the field of Natural Language Processing, being the state of
gray-scaled images [40], jigsaw tasks [20] or rotation pre- the art for several NLP tasks [3, 9], and slowly replacing
diction [12] in image data. The second popular SSL strat- RNN-based models. Unlike RNNs, Transformers use atten-
egy, contrastive learning, trains neural networks to learn tion mechanisms that allow them to process sequential data
the relationships between different data points by distin- without necessarily following the order of the sequence, and
guishing among them. Fundamentally, this approach aims capture long-range dependencies between tokens in a se-
to structure the latent space such that embeddings of similar quence (e.g. words in a sentence).
samples are close together, while those of dissimilar sam-
ples are far apart [13]. Different techniques for contrastive Transformers in vision In computer vision, however,
learning on image data have been proposed [6,14,21,32,37] convolutional architectures remain dominant. Inspired by
and recently even surpassed the performance of supervised the successes of Transformers in NLP, several works [2, 31]
training for ImageNet classification [7]. attempt to combine ConvNet-like architectures with atten-
1423
Figure 2. Network architecture for our proposed method. The training is performed in two stages. First, for Sentinel-1 and Sentinel-2
input pairs, we train a unique backbone consisting of two streams of Swin Transformers (Section 3.2), using a self-supervised contrastive
approach (A) (Section 3.1). Then, for the supervised learning of both tasks (B), the two outputs of the backbone (Z1 , Z2 ) are fed into the
classification head (B.1) and the segmentation head (B.2). Intermediate representations (Z1i and Z2i ) are also used for the segmentation
head. The final projection layer of the segmentation head consists of an up-sampling layer followed by a 1x1 convolutional layer.
tion mechanisms. Moreover, ConvNet-Transformer hybrid sentation. The Swin Transformer achieves a better speed-
models began to emerge, using convolutions for the back- accuracy tradeoff compared to other architectures of the
bone, and appending a Transformer for the task head [5]. same complexity, for many downstream tasks such as im-
Vision Transformer (ViT) [10] is the first to replace con- age classification and object detection. We use the Swin
volutions entirely and proposes to apply a standard Trans- Transformer for single-label and multi-label classification.
former directly to images, with as little modification as pos-
sible, by dividing an image into patches and treating these Semantic segmentation with Transformers Semantic
patches the same as tokens (words) in an NLP application. segmentation consists of classifying each pixel of an im-
This Transformer applies self-attention on a global recep- age into a label. This prediction task requires modeling the
tive field, and has a quadratic computational complexity to interactions between pixels to generate refined representa-
the number of token. After being pre-trained on a large- tions, which is not straightforward using Transformers. Re-
scale labelled dataset, ViT obtained competitive results on cently, [36] proposed a pure attention-based model for se-
ImageNet, but has some limitations on dense pixel-level mantic segmentation and introduced the position-aware ax-
predictions (e.g., semantic segmentation), failing to cap- ial attention layer that propagates information densely and
ture the fine details due to its fixed patch size. For these efficiently along the axes of height and width sequentially.
reasons, the Swin Transformer [17], a variant of the vi- While this work follows a ConvNet-like design by gradually
sion Transformer, proposes a hierarchical way of process- reducing the spatial dimension of feature maps, others have
ing the image, with the goal of achieving scale-invariant proposed complete encoder-decoder architectures based on
representation. It uses the same concept of dividing the im- Transformers [28]. Here, we do not propose a new segmen-
age into patches, but groups non-overlapping patches into tation network, instead we study the advantages of using
windows and applies self-attention within each window. A a task-agnostic representation obtained by self-supervised
shifted-window scheme is used to allow for cross-window pre-training of Swin Transformers with multimodal inputs,
attention connection, which provides a better global repre- which greatly improves the segmentation task.
1424
3. Methods next one adopts a window configuration shifted by a certain
pixel offset relative to the previous layer, and so on.
Figure 2 illustrates our overall approach, which we detail in Regarding the classification task, we concatenate the two
the following. representations coming out of the backbone (see Figure 2),
3.1. Self-supervised Learning and feed them into a fully connected layer. We detail in the
next section the architecture used for the segmentation task.
In this work we propose the use of contrastive SSL for
pre-training of Transformer models on remote sensing data.
A key property of remote sensing data is that data obtained 3.2.1 SwinUNet
by a multitude of sensors aboard different satellites close in We use SwinUNet [4] as the basic framework for the seman-
time may be available for the same location. This prop- tic segmentation task. The architecture is similar to a UNet
erty can be exploited to generate multiple views of the [23]; it consists of an encoder, bottleneck, decoder and skip
same scene in an augmentation free manner. The resulting connections. While in a traditional UNet the encoder and
SSL strategy uses satellite imagery from different sensors decoder are symmetric blocks of convolutional and max-
for the same location as positive image pairs and images pooling layers, in the SwinUNet they are symmetric blocks
from other sensors and locations as negative samples [25]. of Swin Transformer (see Section 3.2). To generate the hi-
This approach enables contrastive SSL without the use of erarchical representations of the features, the Swin Trans-
strong random augmentations and with dedicated encoders former blocks are preceded by a patch merging (downsam-
for each modality (i.e., no weight-sharing), contrary to stan- pling) step in the contracting path and a patch expanding
dard practice in SSL methods for natural images [6]. The (upsampling) step in the expansive path. Features extracted
contrastive loss is defined as: after blocks of the same spatial dimension are merged via
skip connections. Finally, to restore the initial resolution
(height and width) of the input, an upsampling operation if
\mathcal {L}_{i,j} = -\mathrm {log}\frac {\mathrm {exp}(\mathrm {sim}(\mathcal {R}_{i}, \mathcal {R}_{j})/\tau )}{\sum _{k=1}^{2N}\mathbbm {1}_{[k\neq i]} \mathrm {exp}(\mathrm {sim}(\mathcal {R}_{i}, \mathcal {R}_{k})/\tau )}, \label {eqn:nce} (1) performed on the last patch expansion layer.
Dual SwinUNet: We propose two separate SwinUNet net-
where Ri and Rj are representations of a positive pair work streams with identical architectures (see previous de-
Rk are negative (contrastive) representations scription) to process pairs of Sentinel-1 and Sentinel-2 data
sim(·, ·) is the dot product in parallel, in a dual stream concept (see Figure 2). The fea-
1 is the indicator function tures extracted after the last patch expansion layer of each
τ is a so-called temperature parameter stream are concatenated along the channel dimension and
fed into the last pointwise convolution (1x1 kernel) to pro-
The vector representations R are obtained by passing the duce the segmentation predictions at the pixel-level. The
samples through the backbone models (see Figure 2). purpose of the dual SwinUNet is to capture the information
3.2. Swin Transformers belonging to each of the two inputs first, before merging the
two representations for the final decision.
Swin Transformers [17] are vision Transformers de-
signed as backbones for all kinds of visual tasks, such as im- 4. Data
age classification, object detection, semantic segmentation.
They owe their success to their scale invariance property, This work applies SSL and vision Transformers on
which allows them to be used for both high-level and dense paired satellite data from the Sentinel-1 and Sentinel-2 mis-
predictions. Their strength results from their shifted win- sions of the European Space Agency’s Copernicus program.
dow approach: a window contains non-overlapping squared Sentinel-1 is a satellite mission for Earth observation with
patches, and self-attention is calculated locally, within each Synthetic Aperture Radar (SAR) [33]. It provides medium
window, before shiting. As a result, the computational com- resolution (∼10m) C-band SAR measurements with dual
plexity is reduced compared to the standard transformer. polarisation, which enables data acquisition during night or
The backbone (encoder) is made up of 4 building blocks, through cloud cover. The two Sentinel-1 satellites are in
where each block is constructed by connecting a patch sun-synchronous orbits with a 12 day repeat cycle. This
merging layer and several Swin Transformer blocks. A work utilizes VV and VH polarized data from the main In-
Swin Transformer block is composed of Multi-head Self- terferometric Wide-swath mode.
Attention (MSA), followed by a 2-layer Multi-Layer Per- Sentinel-2 is a constellation of two sun-synchronous satel-
ceptron (MLP). A Layer Norm (LN) is applied before each lites for optical Earth observation at medium resolu-
MSA and MLP. The first Swin Transformer block uses a tion [11]. The on-board instrument provides multi-spectral
standard window partitioning configuration to locally com- observations in the visible, near- and short-wave infrared in
pute self-attention from uniformly separated windows. The 13 bands with up to 10m pixel resolution. The two Sentinel-
1425
Table 1. Results for single- and multi-label classification downstream tasks with ResNet50 and Swin Transformer backbones. S1 and S2
models are trained solely on data from Sentinel-1 or Sentinel-2 without data fusion. EarlyF. and LateF. perform Sentinel-1/2 data fusion at
the model input or embedding level. FT corresponds to fine-tuning the pre-trained self-supervised model for the downstream task, whereas
”Frozen” models only train task-specific heads. Accuracies are reported with their standard deviations from 5 runs. The best performing
model for each class is highlighted in bold. The frequency of each class in the training set is noted in parentheses in the Class column,
reflecting class imbalances.
Shrubl. (8%)
Grassl. (27%) 33 ± 8 53 ± 5 49 ± 4 43 ± 4 47 ± 6 11 ± 3 14 ± 4 51 ± 2 39 ± 7 50 ± 2 18 ± 6 40 ± 3
Wetl. (35%) 18 ± 2 9±2 10 ± 1 12 ± 2 24 ± 2 16 ± 2 18 ± 1 9±1 9±1 10 ± 1 23 ± 3 27 ± 1
Cropl. (23%) 63 ± 1 58 ± 3 60 ± 1 63 ± 1 70 ± 3 64 ± 1 46 ± 3 56 ± 1 56 ± 2 61 ± 1 69 ± 1 65 ± 1
Urban (10%) 70 ± 3 55 ± 3 61 ± 2 73 ± 1 80 ± 2 79 ± 1 69 ± 0 51 ± 1 65 ± 2 70 ± 1 77 ± 1 83 ± 1
Barren (6%) 27 ± 3 24 ± 2 22 ± 3 22 ± 2 34 ± 3 25 ± 0 14 ± 2 22 ± 2 25 ± 1 26 ± 2 33 ± 4 32 ± 2
Water (43%) 95 ± 0 89 ± 2 96 ± 3 96 ± 0 97 ± 0 93 ± 0 95 ± 0 72 ± 1 94 ± 0 95 ± 0 97 ± 0 96 ± 1
Overall 56 ± 2 56 ± 2 59 ± 2 61 ± 1 67 ± 1 60 ± 0 42 ± 1 51 ± 2 58 ± 1 56 ± 1 60 ± 1 62 ± 1
Average 49 ± 2 47 ± 2 50 ± 1 52 ± 1 59 ± 1 48 ± 1 36 ± 1 43 ± 0 49 ± 1 49 ± 0 53 ± 1 56 ± 1
Table 2. Results for segmentation downstream tasks with Swin Transformer backbone. S1 and S2 models are trained solely on data from
Sentinel-1 or Sentinel-2 without data fusion. EarlyF. and LateF. perform Sentinel-1/2 data fusion at the model input or embedding level. FT
corresponds to fine-tuning the pre-trained self-supervised model for the downstream task, whereas ”Frozen” models only train task-specific
heads. Per-class accuracies and mean Intersection over Union are reported with their standard deviations from 5 runs. The best performing
model for each class is highlighted in bold. Per-class pixel-wise distribution in our training set is mentioned next to each class.
2 satellites achieve a revisit rate of 5 days at the equator. and covers different geographical areas around the world.
SEN12MS The SEN12MS dataset [26] is a large-scale col- All Sentinel-1/2 image pairs are obtained in the same season
lection of spatially aligned observation pairs from Sentinel- and pre-processed to a harmonized resolution of 10m for
1 and Sentinel-2. The dataset contains 180,662 observations all bands. This work utilizes SEN12MS for self-supervised
1426
pre-training without access to any labels. Segmentation Baselines: We use two baseline models to
DFC2020 The DFC2020 dataset is an extension to compare the self-supervised model we present for the se-
SEN12MS constructed for the IEEE GRSS Data Fusion mantic segmentation task, i) a standard UNet architecture
Contest 2020 [16]. This dataset consists of a valida- [23] and ii) a SwinUNet (see Section 3.2.1). We train both
tion and test sets with 986 and 5,128 paired Sentinel- these baselines from scratch, following the same experi-
1/2 observations, respectively. In addition to the satel- mental setup described above. Note that for the late fusion
lite imagery, DFC2020 also provides dense (i.e., pixel- experiment, we use a Dual SwinUNet model (see Section
level) land-cover annotations for the classes Forest, 3.2.1). Results are reported in terms of pixel-wise accu-
Shrubland, Grassland, Wetland, Cropland, racy and mean Intersection over Union (mIoU) in Table 2.
Urban/Built-up, Barren and Water. We use the We note that UNet achieves higher pixel accuracy than Swi-
DFC2020 dataset to evaluate the downstream tasks of nUNet in both uni-modal and multi-modal configurations,
single- and multi-label classification and semantic segmen- with the best average accuracy reaching 46 ± 1.
tation. In this work we use the validation split for training,
as in Track 2 of the Contest. We note that this data is highly 5.2. SSL Pre-training with Fine-tuning
unbalanced (see Tables 1 and 2). Furthermore, we follow After self-supervised training of the respective model
a different objective than the Data Fusion Contest [16] by backbone on the SEN12MS dataset (∼3 days on a NVIDIA
utilizing its dataset to investigate the use of SSL and vision Tesla V100 GPU), all model parameters (backbone and
Transformer models in the remote sensing domain. task-specific head) are fine-tuned for the downstream task
of interest. For single-label classification, we find that
5. Experiments and Results self-supervised pre-training with subsequent fine-tuning
We perform extensive experiments to assess the per- strongly outperforms all baseline models (see Table 1). The
formance of vision Transformers on three different down- average accuracy score increases by 10 and 5 percentage
stream tasks based on the DFC2020 dataset. The Trans- points for ResNet50 and Swin Transformer, respectively.
former architectures are compared against different base- This corresponds to a relative increase of 23.3% and 11.4%
lines, including commonly used ResNet50 ConvNet mod- over the best baselines. We find similar results in the
els [15]. In particular, we focus on the benefits of SSL and multi-label setting, with relative increases in F1 Scores of
subsequent fine-tuning over training from scratch to lever- 13.5% and 8.2% over the best baselines with the fine-tuned
age the large vision Transformer models on small labelled self-supervised ResNet50 and Swin Transformer models.
remote sensing datasets. Segmentation results are shown in Table 2 for uni-modal
(Sentinel-1, Sentinel-2) and multi-modal (late-fusion, see
5.1. Baselines Figure 2) inputs. For the multi-modal case, we note that the
Classification Baselines: We use four different data set- average pixel accuracy of our fine-tuned model increased
tings for each classification baseline model architecture: by 8 (+19% relative increase) and 6 (+13%) percentage
Only Sentinel-1 input data, only Sentinel-2 input data, early points, compared to the SwinUNet and the UNet trained
Sentinel-1/2 fusion through concatenation across channel from scratch, respectively.
dimension at the data input stage, and late fusion by con-
5.3. SSL Pre-training with Frozen Backbone
catenating feature maps derived from Sentinel-1/2 inputs
with distinct model backbones before the final classifica- We investigate to what degree feature maps produced
tion layer. Besides Swin Transformers, we use ResNet50 by models trained in a self-supervised manner encode rele-
as baseline model architecture as it comprises a comparable vant information for land-cover classification and segmen-
number of parameters to the Transformer (Swin-t). These tation downstream tasks. To test this, we freeze all param-
models are trained from scratch on the validation split of eters of the model backbones and only train the parame-
the DFC2020 dataset. Results evaluated on the test split are ters of randomly initialized classification or segmentation
reported in Table 1. Both architectures result in moderately heads for each task. Evidently, our SSL strategy extracts
good performance for single-label classification despite the meaningful features for land-cover classification. Train-
small training dataset. The Swin Transformer achieves the ing a single-label classification head on top of the frozen
best average accuracy in the late-fusion setting (44±0 per- ResNet50 backbone yields strong performance, and even
centage points), while the ResNet50 slightly outperforms surpasses the best ResNet50 baseline model by 13 percent-
the Transformer in early-fusion and uni-modal settings. For age points average accuracy (+30.2%). The frozen Swin
multi-label classification (see Table 1 bottom half) the late Transformer model even yields the best single-label classifi-
fusion approach yields the highest F1 Scores for both back- cation performance of all approaches presented in this work
bone architectures (52±1 and 49±0 percentage points for with an average accuracy of 57±1 (+29.5% over best Swin
ResNet50 and Swin Transformer, respectively). Transformer baseline). The pre-trained and frozen Swin
1427
0.55 0.50
0.45
0.40
0.40
Swin SSL-finetuned 0.35
0.35
Swin from scratch
0.30 ResNet50-finetuned 0.30
0.25 ResNet50 from scratch SwinUNet SSL-finetuned
Swin SSL-frozen 0.25 SwinUNet SSL-frozen
0.20 ResNet50-frozen SwinUNet from scratch
0.20
0.01 0.10 0.50 1.00 0.01 0.10 0.50 1.00
Label fraction Label fraction
(a) Swin Transformer and ResNet50 models pre-trained with SSL strongly (b) SwinUNets with pre-training outperform training from scratch when
outperform training from scratch on the classification downstream task. freezing or fine-tuning the backbone.
Figure 3. (a) Classification and (b) segmentation results for training on different fractions of labelled data.
Transformer yields similar performance on the multi-label Table 3. Comparison of different methods for segmentation. All
downstream task (+14% average F1 Score over best base- models are trained using multi-modal input, SSL-ft means we are
line), even though the frozen ResNet50 fails to outperform fine-tuning the pre-trained model
the best multi-label classification baseline (-7.7% average
UNet SwinUNet SwinUNet SSL-ft. Ensemble
F1 Score). This indicates that the self-supervised Trans-
former model learns more meaningful representations that Avg. Accuracy 0.45 0.43 0.51 0.53
Avg. IoU 0.31 0.33 0.37 0.39
encode sufficient information to extract multiple class labels
with a small classification head.
fixed to sensible values a-priori, rather than tuned for ev-
We observe largely similar behavior for the segmentation
ery individual experimental setting. The batch size is set
task; training only the segmentation head surpasses our two
to 32, learning rate to 3 · 10−6 and the number of training
baselines (UNet and SwinUNet) by 5 (+12%) and 3 (+7%)
epochs to 200. This approach also makes it possible to uti-
percentage points, respectively, but performs no better than
lize the full DFC2020 validation set (986 observations with
fine-tuning.
dense land-cover labels) for training as we do not require a
5.4. Label Fraction Experiments validation set for hyperparameter tuning.
1428
Sentinel-2 RGB Groundtruth UNet SwinUNet SwinUNet SSL-ft. Ensemble
Water
Barren
Urban
Cropland
Wetland
Grassland
Shrubland
Forest
Figure 4. Qualitative comparison of results for 3 different regions. Results from left to right: Sentinel-2 true color (RGB), DFC groundtruth,
UNet trained from scratch on fusion of both inputs, SwinUNet trained from scratch on both inputs, SwinUNet fine-tuned on both inputs,
and finally an ensemble model of both UNet and SwinUNet (see Section 6).
through self-supervised pre-training. For single label classi- all our downstream tasks, we note that classes like Grass-
fication, the frozen backbone models performed better than land or Wetland are commonly misclassified on some im-
finetuning all parameters, which we attribute to the hyperpa- ages. This is most likely due to our limited and very un-
rameter choice. For the segmentation task, we arrive at the balanced training set (see Tables 1 and 2). Overall, the en-
same conclusion as above: self-supervised pre-training con- semble model works best, both visually, giving smooth and
siderably boosts performances. Nevertheless, we note that detailed results, and numerically (see Table 3). This pushes
the SSL-Swin Transformer with frozen backbone does not us to explore this direction in future works, with the aim of
perform better than the SSL-fine-tuned one. This may be improving land cover segmentation, taking into account the
a result of the segmentation head architecture, which uses advantages of each of these methods.
skip connections to merge the multi-scale characteristics of
the encoder with the upsampled characteristics of the de-
coder. Therefore, to achieve the best performance, encoder
and decoder parameters should be updated simultaneously. 7. Conclusion
The importance of our approach for data-efficient learn-
ing is further underlined by the results of the label fraction
This work introduced a self-supervised pre-trained Swin
experiment. Across all downstream tasks, our pre-trained
Transformer for land cover classification and segmentation
and fine-tuned models perform on par with models trained
using a contrastive learning approach as illustrated in Figure
from scratch with as little as 10% of the labelled data. This
2. The training is done in two stages; first, self-supervised
SSL approach thus opens a path to learning based on very
training of one unique backbone is performed on a large un-
small datasets (∼100 samples), enabling data-efficient ap-
labeled dataset, second, supervised fine-tuning of this back-
plications. In a qualitative comparison, we show in Fig-
bone is performed on a small, labeled dataset, for two sepa-
ure 4 some segmentation results. We first observe that the
rate downstream tasks. Experimental results on the test set
UNet method produces smoother segmentation masks com-
validate our proposed method over training various different
pared to the SwinUNet. On the other hand, the segmen-
baseline models trained from scratch. Our self-supervised
tation masks produced by the SwinUNet are much more
approach yields consistently higher performance across dif-
detailed and accurate. These observations motivated the
ferent downstream tasks, with particularly strong improve-
idea of trying an ensemble of these two methods where we
ments in the low-data regime. Furthermore, our work illus-
take the average of the predictions of the two models, be-
trates the utility of Transformer models for Earth observa-
fore computing the final prediction (see Figure 4). Across
tion without the need for large labelled datasets.
1429
References [13] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensional-
ity reduction by learning an invariant mapping. In 2006 IEEE
[1] Kumar Ayush, Burak Uzkent, Chenlin Meng, Kumar Tan- Computer Society Conference on Computer Vision and Pat-
may, Marshall Burke, David Lobell, and Stefano Ermon. tern Recognition (CVPR’06), volume 2, pages 1735–1742.
Geography-aware self-supervised learning. In Proceedings IEEE, 2006. 2
of the IEEE/CVF International Conference on Computer Vi-
[14] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
sion, pages 10181–10190, 2021. 2
Girshick. Momentum contrast for unsupervised visual rep-
[2] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens,
resentation learning. In Proceedings of the IEEE/CVF Con-
and Quoc V. Le. Attention augmented convolutional net-
ference on Computer Vision and Pattern Recognition, pages
works. 2019 IEEE/CVF International Conference on Com-
9729–9738, 2020. 2
puter Vision (ICCV), pages 3285–3294, 2019. 2
[3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- Deep residual learning for image recognition. In Proceed-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- ings of the IEEE conference on computer vision and pattern
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. recognition, pages 770–778, 2016. 6
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, [16] Michael Schmitt; Lloyd Hughes; Pedram Ghamisi; Naoto
Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Yokoya; Ronny Hänsch. 2020 ieee grss data fusion contest,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, 2019. 6
Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad- [17] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
ford, Ilya Sutskever, and Dario Amodei. Language models Zhang, Stephen Lin, and Baining Guo. Swin transformer:
are few-shot learners. ArXiv, abs/2005.14165, 2020. 2 Hierarchical vision transformer using shifted windows. In
[4] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xi- Proceedings of the IEEE/CVF International Conference on
aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Computer Vision, pages 10012–10022, 2021. 2, 3, 4
Unet-like pure transformer for medical image segmentation. [18] Oscar Mañas, Alexandre Lacoste, Xavier Giro-i Nieto,
ArXiv, abs/2105.05537, 2021. 4 David Vazquez, and Pau Rodriguez. Seasonal contrast: Un-
[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nico- supervised pre-training from uncurated remote sensing data.
las Usunier, Alexander Kirillov, and Sergey Zagoruyko. In Proceedings of the IEEE/CVF International Conference
End-to-end object detection with transformers. ArXiv, on Computer Vision, pages 9414–9423, 2021. 2
abs/2005.12872, 2020. 3 [19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
[6] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple Efficient estimation of word representations in vector space.
framework for contrastive learning of visual representations. arXiv preprint arXiv:1301.3781, 2013. 2
In ICML, pages 1597–1607. PMLR, 2020. 2, 4 [20] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of
[7] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad visual representations by solving jigsaw puzzles. In Euro-
Norouzi, and Geoffrey Hinton. Big self-supervised mod- pean conference on computer vision, pages 69–84. Springer,
els are strong semi-supervised learners. arXiv preprint 2016. 2
arXiv:2006.10029, 2020. 2
[21] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-
[8] Yuxing Chen and Lorenzo Bruzzone. Self-supervised sar- sentation learning with contrastive predictive coding. arXiv
optical data fusion of sentinel-1/-2 images. IEEE Transac- preprint arXiv:1807.03748, 2018. 2
tions on Geoscience and Remote Sensing, 2021. 2
[22] Jeffrey Pennington, Richard Socher, and Christopher D Man-
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
ning. Glove: Global vectors for word representation. In
Toutanova. Bert: Pre-training of deep bidirectional trans-
Proceedings of the 2014 conference on empirical methods in
formers for language understanding. ArXiv, abs/1810.04805,
natural language processing (EMNLP), pages 1532–1543,
2019. 2
2014. 2
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
[23] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Convolutional networks for biomedical image segmentation.
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
In MICCAI, 2015. 4, 6
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
worth 16x16 words: Transformers for image recognition at [24] Sudipan Saha, Patrick Ebel, and Xiao Xiang Zhu. Self-
scale. ArXiv, abs/2010.11929, 2021. 1, 3 supervised multisensor change detection. IEEE Transactions
[11] Matthias Drusch, Umberto Del Bello, Sébastien Carlier, on Geoscience and Remote Sensing, 2021. 2
Olivier Colin, Veronica Fernandez, Ferran Gascon, Bianca [25] Linus Scheibenreif, Michael Mommert, and Damian Borth.
Hoersch, Claudia Isola, Paolo Laberinti, Philippe Martimort, Contrastive self-supervised data fusion for satellite imagery.
et al. Sentinel-2: Esa’s optical high-resolution mission for In International Annals of the Photogrammetry, Remote
gmes operational services. Remote sensing of Environment, Sensing and Spatial Information Sciences, 2022. 2, 4
120:25–36, 2012. 4 [26] Michael Schmitt, Lloyd Haydn Hughes, Chunping Qiu, and
[12] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- Xiao Xiang Zhu. Sen12ms–a curated dataset of georefer-
supervised representation learning by predicting image rota- enced multi-spectral sentinel-1/2 imagery for deep learning
tions. arXiv preprint arXiv:1803.07728, 2018. 2 and data fusion. arXiv preprint arXiv:1906.07789, 2019. 5
1430
[27] Vladan Stojnic and Vladimir Risojevic. Self-supervised [40] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
learning of remote sensing scene representations using con- image colorization. In European conference on computer
trastive multiview coding. In Proceedings of the IEEE/CVF vision, pages 649–666. Springer, 2016. 2
Conference on Computer Vision and Pattern Recognition, [41] Xiao Xiang Zhu, Devis Tuia, Lichao Mou, Gui-Song Xia,
pages 1182–1191, 2021. 2 Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer. Deep
[28] Robin A. M. Strudel, Ricardo Garcia Pinel, Ivan Laptev, learning in remote sensing: A comprehensive review and list
and Cordelia Schmid. Segmenter: Transformer for seman- of resources. IEEE Geoscience and Remote Sensing Maga-
tic segmentation. 2021 IEEE/CVF International Conference zine, 5(4):8–36, 2017. 1
on Computer Vision (ICCV), pages 7242–7252, 2021. 3
[29] Aidan M Swope, Xander H Rudelis, and Kyle T Story. Rep-
resentation learning for remote sensing: An unsupervised
sensor fusion approach. arXiv preprint arXiv:2108.05094,
2021. 2
[30] Chao Tao, Ji Qi, Weipeng Lu, Hao Wang, and Haifeng
Li. Remote sensing image scene classification with self-
supervised paradigm under limited labeled samples. IEEE
Geoscience and Remote Sensing Letters, 2020. 2
[31] Chunwei Tian, Yong Xu, Zuoyong Li, Wangmeng Zuo,
Lunke Fei, and Hong Liu. Attention-guided cnn for image
denoising. Neural networks : the official journal of the In-
ternational Neural Network Society, 124:117–129, 2020. 2
[32] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con-
trastive multiview coding. In Computer Vision–ECCV 2020:
16th European Conference, Glasgow, UK, August 23–28,
2020, Proceedings, Part XI 16, pages 776–794. Springer,
2020. 2
[33] Ramon Torres, Paul Snoeij, Dirk Geudtner, David Bibby,
Malcolm Davidson, Evert Attema, Pierre Potin, BjÖrn Rom-
men, Nicolas Floury, Mike Brown, et al. Gmes sentinel-1
mission. Remote sensing of environment, 120:9–24, 2012. 4
[34] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen
Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner,
Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A
generative model for raw audio. SSW, 125:2, 2016. 2
[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30, 2017. 1
[36] Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam,
Alan Loddon Yuille, and Liang-Chieh Chen. Axial-deeplab:
Stand-alone axial-attention for panoptic segmentation. In
ECCV, 2020. 3
[37] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
Unsupervised feature learning via non-parametric instance
discrimination. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 3733–3742,
2018. 2
[38] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell,
Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized
autoregressive pretraining for language understanding. Ad-
vances in neural information processing systems, 32, 2019.
2
[39] Yuan Yuan and Lei Lin. Self-supervised pretraining of trans-
formers for satellite image time series classification. IEEE
Journal of Selected Topics in Applied Earth Observations
and Remote Sensing, 14:474–487, 2020. 2
1431