0% found this document useful (0 votes)

65 views

Scheibenreif Self-Supervised Vision Transformers For Land-Cover Segmentation and Classification CVPRW 2022 Paper

This document proposes using self-supervised pre-training of vision transformers on large unlabeled remote sensing datasets to learn task-agnostic representations. These representations can then be fine-tuned on smaller labeled datasets for downstream tasks like land cover classification and segmentation, outperforming models trained only on labeled data. Specifically, it uses a shifting-window transformer pretrained through contrastive learning on multi-modal satellite imagery, then fine-tunes it for classification and segmentation tasks. An ablation study shows this approach can match fully supervised performance even when fine-tuning on only 10% of the labeled data.

Uploaded by

Zakaria Ngereja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views

Scheibenreif Self-Supervised Vision Transformers For Land-Cover Segmentation and Classification CVPRW 2022 Paper

Uploaded by

Zakaria Ngereja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Self-supervised Vision Transformers for Land-cover Segmentation and

Classification

Linus Scheibenreif* Joëlle Hanna* Michael Mommert Damian Borth

AIML Lab, School of Computer Science, University of St.Gallen
Rosenbergstrasse 30, St. Gallen, Switzerland
{firstname}.{lastname}@unisg.ch

Abstract (A) Self-supervised task (B) Supervised task

(A) agnostic pre-training (B) specific finetuning
Transformer models have recently approached or even Sentinel-1 Sentinel-2
surpassed the performance of ConvNets on computer vision Head
tasks like classification and segmentation. To a large de-
gree, these successes have been enabled by the use of large- Back-
scale labelled image datasets for supervised pre-training. bone
This poses a significant challenge for the adaption of vision Land-cover Classes

Transformers to domains where datasets with millions of la- Unlabelled

Labelled
belled samples are not available. In this work, we bridge the Transformer
gap between ConvNets and Transformers for Earth obser-
vation by self-supervised pre-training on large-scale unla- Figure 1. We propose to use large datasets of unlabelled remote
belled remote sensing data1 . We show that self-supervised sensing data for self-supervised pre-training of vision Transform-
pre-training yields latent task-agnostic representations that ers. After self-supervised training of the backbone (A), the model
can be utilized for both land cover classification and seg- and task-specific head can be fine-tuned on much smaller labelled
mentation tasks, where they significantly outperform the datasets for different downstream tasks (B).
fully supervised baselines. Additionally, we find that subse-
quent fine-tuning of Transformers for specific downstream ual annotation process that is prohibitively expensive for
tasks performs on-par with commonly used ConvNet archi- more than a small fraction of the existing satellite data.
tectures. An ablation study further illustrates that the la- In parallel to the general computer vision (CV) commu-
belled dataset size can be reduced to one-tenth after self- nity, deep learning techniques such as Convolutional Neu-
supervised pre-training while still maintaining the perfor- ral Networks (ConvNets) have become the state-of-the-art
mance of the fully supervised approach. tools for a range of tasks in remote sensing over the last
years [41]. More recently, variations of the Transformer ar-
chitecture [35], which was originally devised for sequential
1. Introduction data and has led to breakthroughs in natural language pro-
cessing, have matched ConvNet performance on important
The identification of land-cover characteristics from CV benchmark tasks like ImageNet classification [10].
satellite imagery is one of the key objectives in the remote In this work, we propose to combine Transformer-based
sensing domain. With an increasing number of Earth ob- computer vision approaches with self-supervised learning
servation satellites in orbit, the amount of available remote (SSL) for the remote sensing domain. This enables us to
sensing data is steadily growing. This abundance of data pre-train Transformer models on large amounts of unla-
makes it possible to address a wide range of problems in belled satellite imagery with a contrastive self-supervised
Earth observation with statistical learning methods that ben- training setup tailored to multi-modal remote sensing data.
efit from large datasets. However, while raw satellite data This facilitates training of large and data intensive Trans-
is available in large quantities today, land cover labels are former models when only a small amount of labelled data
comparatively scarce and must be obtained through a man- is available. Additionally, it allows us to use the same
* Both authors contributed equally to this work backbone model for tasks such as classification or seg-
1 https://github.com/HSG-AIML/SSLTransformerRS mentation by changing only the model head. We utilize

1422
the shifting-window vision Transformer architecture (Swin The remote sensing community has adapted SSL techniques
Transformer) [17] with a contrastive data fusion SSL strat- to learn meaningful representations of satellite imagery in
egy [25] and evaluate this pipeline with single-label classifi- multiple works. Pretext tasks like inpainting and the pre-
cation, multi-label classification and semantic segmentation diction of relative positions for image patches have been
(i.e. pixel-wise classification) as downstream tasks. utilized with different satellite datasets and compared to
We summarize the contributions of this work as follows: contrastive estimation [30]. The authors of Seasonal Con-
trast [18] obtain positive samples for contrastive learning
• We show that vision Transformers combined with self- from satellite images of the same locations at different
supervised pre-training provide an effective approach points in time together with augmented data points. Addi-
for deep learning in the remote sensing domain, sur- tionally, the data is mapped into multiple embedding sub-
passing ConvNet performance in some settings. spaces, which results in representations with invariances
• We show that latent representations derived through with respect to different transformations. Work based on the
self-supervised pre-training and subsequent supervised momentum contrast SSL technique [14] also utilizes satel-
fine-tuning are task agnostic and can be utilized for lite imagery of given locations at different points in time as
both land cover classification and segmentation. temporal positives in contrastive learning, but combines it
• Our work further illustrates that SSL in combination with location classification in a multi-task framework [1].
with vision Transformers or ConvNets can yield large The Contrastive Multiview Coding [32] framework for SSL
performance gains (up to +30% over supervised base- has also been adapted to remote sensing data [27,29]. These
lines) across different downstream tasks when fine- works explored the potential of multi-spectral imaging data
tuned with labelled data. in SSL with different band and sensor combinations, as well
as cross-dataset transfer of pre-trained models. A different
• In an ablation study on fine-tuning self-supervised
strategy for self-supervised pre-training specific to Trans-
models with different amounts of labelled data we
former models is proposed in [39]. This approach exploits
demonstrate that fully supervised approaches can be
the temporal structure of satellite imagery and frames the
outperformed with as little as 10% of labelled data
prediction of artificially corrupted observations in a satel-
through SSL.
lite image timeseries as pretext task. Most relevant to our
work, [8] propose the use of a UNet-like architecture to ob-
2. Related Work tain pixel-wise representations of multi-modal remote sens-
2.1. Self-supervised Learning ing data through contrastive learning. Similarly, [24] com-
bines three different unsupervised loss functions, including
Self-supervised learning is a branch of machine learn- a contrastive loss, on multi-modal remote sensing data to
ing that aims to learn data representations from unlabelled pre-train a change detection model. Our work builds on the
datasets. The literature on self-supervised deep learning multi-modal SSL approach from [25] which utilizes image
rapidly expanded in the last years, following the success of pairs from different satellite instruments as positive pairs.
methods like Word2Vec [19] or GloVe [22] in Natural Lan- However, we move beyond ConvNets and the proposed vi-
guage Processing (NLP). Consequently, similar approaches sion Transformer backbone enables our technique to learn
were also adopted in the vision domain. Most SSL meth- task agnostic representations for classification and segmen-
ods for images either employ pretext tasks or the princi- tation downstream tasks in a self-supervised fashion.
ple of contrastive learning. Pre-text based methods utilize
inherent properties of data samples to construct prediction 2.2. Vision Transformers
tasks for training. These tasks include the prediction of fu-
ture states from earlier states in sequential data (e.g. au- Transformers in NLP Transformers have revolutionized
dio [34] or text [38]), as well as colorization of artificially the field of Natural Language Processing, being the state of
gray-scaled images [40], jigsaw tasks [20] or rotation pre- the art for several NLP tasks [3, 9], and slowly replacing
diction [12] in image data. The second popular SSL strat- RNN-based models. Unlike RNNs, Transformers use atten-
egy, contrastive learning, trains neural networks to learn tion mechanisms that allow them to process sequential data
the relationships between different data points by distin- without necessarily following the order of the sequence, and
guishing among them. Fundamentally, this approach aims capture long-range dependencies between tokens in a se-
to structure the latent space such that embeddings of similar quence (e.g. words in a sentence).
samples are close together, while those of dissimilar sam-
ples are far apart [13]. Different techniques for contrastive Transformers in vision In computer vision, however,
learning on image data have been proposed [6,14,21,32,37] convolutional architectures remain dominant. Inspired by
and recently even surpassed the performance of supervised the successes of Transformers in NLP, several works [2, 31]
training for ImageNet classification [7]. attempt to combine ConvNet-like architectures with atten-

1423
Figure 2. Network architecture for our proposed method. The training is performed in two stages. First, for Sentinel-1 and Sentinel-2
input pairs, we train a unique backbone consisting of two streams of Swin Transformers (Section 3.2), using a self-supervised contrastive
approach (A) (Section 3.1). Then, for the supervised learning of both tasks (B), the two outputs of the backbone (Z1 , Z2 ) are fed into the
classification head (B.1) and the segmentation head (B.2). Intermediate representations (Z1i and Z2i ) are also used for the segmentation
head. The final projection layer of the segmentation head consists of an up-sampling layer followed by a 1x1 convolutional layer.

tion mechanisms. Moreover, ConvNet-Transformer hybrid sentation. The Swin Transformer achieves a better speed-
models began to emerge, using convolutions for the back- accuracy tradeoff compared to other architectures of the
bone, and appending a Transformer for the task head [5]. same complexity, for many downstream tasks such as im-
Vision Transformer (ViT) [10] is the first to replace con- age classification and object detection. We use the Swin
volutions entirely and proposes to apply a standard Trans- Transformer for single-label and multi-label classification.
former directly to images, with as little modification as pos-
sible, by dividing an image into patches and treating these Semantic segmentation with Transformers Semantic
patches the same as tokens (words) in an NLP application. segmentation consists of classifying each pixel of an im-
This Transformer applies self-attention on a global recep- age into a label. This prediction task requires modeling the
tive field, and has a quadratic computational complexity to interactions between pixels to generate refined representa-
the number of token. After being pre-trained on a large- tions, which is not straightforward using Transformers. Re-
scale labelled dataset, ViT obtained competitive results on cently, [36] proposed a pure attention-based model for se-
ImageNet, but has some limitations on dense pixel-level mantic segmentation and introduced the position-aware ax-
predictions (e.g., semantic segmentation), failing to cap- ial attention layer that propagates information densely and
ture the fine details due to its fixed patch size. For these efficiently along the axes of height and width sequentially.
reasons, the Swin Transformer [17], a variant of the vi- While this work follows a ConvNet-like design by gradually
sion Transformer, proposes a hierarchical way of process- reducing the spatial dimension of feature maps, others have
ing the image, with the goal of achieving scale-invariant proposed complete encoder-decoder architectures based on
representation. It uses the same concept of dividing the im- Transformers [28]. Here, we do not propose a new segmen-
age into patches, but groups non-overlapping patches into tation network, instead we study the advantages of using
windows and applies self-attention within each window. A a task-agnostic representation obtained by self-supervised
shifted-window scheme is used to allow for cross-window pre-training of Swin Transformers with multimodal inputs,
attention connection, which provides a better global repre- which greatly improves the segmentation task.

1424
3. Methods next one adopts a window configuration shifted by a certain
pixel offset relative to the previous layer, and so on.
Figure 2 illustrates our overall approach, which we detail in Regarding the classification task, we concatenate the two
the following. representations coming out of the backbone (see Figure 2),
3.1. Self-supervised Learning and feed them into a fully connected layer. We detail in the
next section the architecture used for the segmentation task.
In this work we propose the use of contrastive SSL for
pre-training of Transformer models on remote sensing data.
A key property of remote sensing data is that data obtained 3.2.1 SwinUNet
by a multitude of sensors aboard different satellites close in We use SwinUNet [4] as the basic framework for the seman-
time may be available for the same location. This prop- tic segmentation task. The architecture is similar to a UNet
erty can be exploited to generate multiple views of the [23]; it consists of an encoder, bottleneck, decoder and skip
same scene in an augmentation free manner. The resulting connections. While in a traditional UNet the encoder and
SSL strategy uses satellite imagery from different sensors decoder are symmetric blocks of convolutional and max-
for the same location as positive image pairs and images pooling layers, in the SwinUNet they are symmetric blocks
from other sensors and locations as negative samples [25]. of Swin Transformer (see Section 3.2). To generate the hi-
This approach enables contrastive SSL without the use of erarchical representations of the features, the Swin Trans-
strong random augmentations and with dedicated encoders former blocks are preceded by a patch merging (downsam-
for each modality (i.e., no weight-sharing), contrary to stan- pling) step in the contracting path and a patch expanding
dard practice in SSL methods for natural images [6]. The (upsampling) step in the expansive path. Features extracted
contrastive loss is defined as: after blocks of the same spatial dimension are merged via
skip connections. Finally, to restore the initial resolution
(height and width) of the input, an upsampling operation if
\mathcal {L}_{i,j} = -\mathrm {log}\frac {\mathrm {exp}(\mathrm {sim}(\mathcal {R}_{i}, \mathcal {R}_{j})/\tau )}{\sum _{k=1}^{2N}\mathbbm {1}_{[k\neq i]} \mathrm {exp}(\mathrm {sim}(\mathcal {R}_{i}, \mathcal {R}_{k})/\tau )}, \label {eqn:nce} (1) performed on the last patch expansion layer.
Dual SwinUNet: We propose two separate SwinUNet net-
where Ri and Rj are representations of a positive pair work streams with identical architectures (see previous de-
Rk are negative (contrastive) representations scription) to process pairs of Sentinel-1 and Sentinel-2 data
sim(·, ·) is the dot product in parallel, in a dual stream concept (see Figure 2). The fea-
1 is the indicator function tures extracted after the last patch expansion layer of each
τ is a so-called temperature parameter stream are concatenated along the channel dimension and
fed into the last pointwise convolution (1x1 kernel) to pro-
The vector representations R are obtained by passing the duce the segmentation predictions at the pixel-level. The
samples through the backbone models (see Figure 2). purpose of the dual SwinUNet is to capture the information
3.2. Swin Transformers belonging to each of the two inputs first, before merging the
two representations for the final decision.
Swin Transformers [17] are vision Transformers de-
signed as backbones for all kinds of visual tasks, such as im- 4. Data
age classification, object detection, semantic segmentation.
They owe their success to their scale invariance property, This work applies SSL and vision Transformers on
which allows them to be used for both high-level and dense paired satellite data from the Sentinel-1 and Sentinel-2 mis-
predictions. Their strength results from their shifted win- sions of the European Space Agency’s Copernicus program.
dow approach: a window contains non-overlapping squared Sentinel-1 is a satellite mission for Earth observation with
patches, and self-attention is calculated locally, within each Synthetic Aperture Radar (SAR) [33]. It provides medium
window, before shiting. As a result, the computational com- resolution (∼10m) C-band SAR measurements with dual
plexity is reduced compared to the standard transformer. polarisation, which enables data acquisition during night or
The backbone (encoder) is made up of 4 building blocks, through cloud cover. The two Sentinel-1 satellites are in
where each block is constructed by connecting a patch sun-synchronous orbits with a 12 day repeat cycle. This
merging layer and several Swin Transformer blocks. A work utilizes VV and VH polarized data from the main In-
Swin Transformer block is composed of Multi-head Self- terferometric Wide-swath mode.
Attention (MSA), followed by a 2-layer Multi-Layer Per- Sentinel-2 is a constellation of two sun-synchronous satel-
ceptron (MLP). A Layer Norm (LN) is applied before each lites for optical Earth observation at medium resolu-
MSA and MLP. The first Swin Transformer block uses a tion [11]. The on-board instrument provides multi-spectral
standard window partitioning configuration to locally com- observations in the visible, near- and short-wave infrared in
pute self-attention from uniformly separated windows. The 13 bands with up to 10m pixel resolution. The two Sentinel-

1425
Table 1. Results for single- and multi-label classification downstream tasks with ResNet50 and Swin Transformer backbones. S1 and S2
models are trained solely on data from Sentinel-1 or Sentinel-2 without data fusion. EarlyF. and LateF. perform Sentinel-1/2 data fusion at
the model input or embedding level. FT corresponds to fine-tuning the pre-trained self-supervised model for the downstream task, whereas
”Frozen” models only train task-specific heads. Accuracies are reported with their standard deviations from 5 runs. The best performing
model for each class is highlighted in bold. The frequency of each class in the training set is noted in parentheses in the Class column,
reflecting class imbalances.

ResNet50 Swin Transformer

Baselines SSL Baselines SSL
Class S1 S2 EarlyF. LateF. FT Frozen S1 S2 EarlyF. LateF. FT Frozen
Forest (8%) 34 ± 4 13 ± 4 12 ± 3 15 ± 3 65 ± 8 34 ± 2 8±1 19 ± 3 30 ± 2 30 ± 3 17 ± 6 35 ± 2
Single-label (Accuracy %)

Shrubl. (4%) 24 ± 2 32 ± 5 31 ± 5 34 ± 1 56 ± 11 73 ± 1 9±2 41 ± 4 42 ± 1 46 ± 2 47 ± 9 57 ± 2

Grassl. (10%) 10 ± 1 4±2 7±5 7±3 9±6 2±1 1±0 1±0 0±0 3±1 7±5 5±1
Wetl. (18%) 35 ± 6 21 ± 5 21 ± 8 14 ± 3 15 ± 8 60 ± 2 44 ± 4 10 ± 3 2±1 10 ± 3 22 ± 8 65 ± 4
Cropl. (16%) 47 ± 4 30 ± 3 35 ± 5 58 ± 3 45 ± 8 51 ± 0 20 ± 2 30 ± 1 33 ± 1 39 ± 4 55 ± 10 54 ± 2
Urban (6%) 82 ± 4 74 ± 7 88 ± 4 82 ± 3 95 ± 1 98 ± 0 85 ± 1 54 ± 2 89 ± 2 88 ± 2 94 ± 2 93 ± 1
Barren (2%) 29 ± 5 26 ± 4 26 ± 4 27 ± 4 39 ± 3 37 ± 2 33 ± 6 40 ± 4 35 ± 4 35 ± 2 48 ± 4 50 ± 2
Water (36%) 96 ± 2 91 ± 9 93 ± 7 96 ± 1 99 ± 1 95 ± 0 97 ± 0 78 ± 3 97 ± 0 97 ± 0 99 ± 0 98 ± 0
Overall 54 ± 2 42 ± 2 45 ± 1 52 ± 1 67 ± 2 60 ± 1 40 ± 1 40 ± 2 52 ± 1 53 ± 1 55 ± 3 60 ± 1
Average 43 ± 1 36 ± 2 39 ± 1 42 ± 1 53 ± 1 56 ± 1 37 ± 1 34 ± 1 41 ± 1 44 ± 0 49 ± 2 57 ± 1
Forest (20%) 63 ± 2 59 ± 7 67 ± 6 73 ± 4 79 ± 2 65 ± 2 18 ± 2 55 ± 2 69 ± 2 48 ± 3 65 ± 7 69 ± 2
24 ± 2 32 ± 3 35 ± 2 35 ± 1 40 ± 3 32 ± 1 13 ± 2 32 ± 2 32 ± 1 29 ± 1 38 ± 3 39 ± 0
Multi-label (F1 Score %)

Shrubl. (8%)
Grassl. (27%) 33 ± 8 53 ± 5 49 ± 4 43 ± 4 47 ± 6 11 ± 3 14 ± 4 51 ± 2 39 ± 7 50 ± 2 18 ± 6 40 ± 3
Wetl. (35%) 18 ± 2 9±2 10 ± 1 12 ± 2 24 ± 2 16 ± 2 18 ± 1 9±1 9±1 10 ± 1 23 ± 3 27 ± 1
Cropl. (23%) 63 ± 1 58 ± 3 60 ± 1 63 ± 1 70 ± 3 64 ± 1 46 ± 3 56 ± 1 56 ± 2 61 ± 1 69 ± 1 65 ± 1
Urban (10%) 70 ± 3 55 ± 3 61 ± 2 73 ± 1 80 ± 2 79 ± 1 69 ± 0 51 ± 1 65 ± 2 70 ± 1 77 ± 1 83 ± 1
Barren (6%) 27 ± 3 24 ± 2 22 ± 3 22 ± 2 34 ± 3 25 ± 0 14 ± 2 22 ± 2 25 ± 1 26 ± 2 33 ± 4 32 ± 2
Water (43%) 95 ± 0 89 ± 2 96 ± 3 96 ± 0 97 ± 0 93 ± 0 95 ± 0 72 ± 1 94 ± 0 95 ± 0 97 ± 0 96 ± 1
Overall 56 ± 2 56 ± 2 59 ± 2 61 ± 1 67 ± 1 60 ± 0 42 ± 1 51 ± 2 58 ± 1 56 ± 1 60 ± 1 62 ± 1
Average 49 ± 2 47 ± 2 50 ± 1 52 ± 1 59 ± 1 48 ± 1 36 ± 1 43 ± 0 49 ± 1 49 ± 0 53 ± 1 56 ± 1

Table 2. Results for segmentation downstream tasks with Swin Transformer backbone. S1 and S2 models are trained solely on data from
Sentinel-1 or Sentinel-2 without data fusion. EarlyF. and LateF. perform Sentinel-1/2 data fusion at the model input or embedding level. FT
corresponds to fine-tuning the pre-trained self-supervised model for the downstream task, whereas ”Frozen” models only train task-specific
heads. Per-class accuracies and mean Intersection over Union are reported with their standard deviations from 5 runs. The best performing
model for each class is highlighted in bold. Per-class pixel-wise distribution in our training set is mentioned next to each class.

Baselines SwinUNet SSL

UNet SwinUNet FT Frozen
Class S1 S2 EarlyF. LateF. S1 S2 EarlyF. LateF. S1 S2 LateF S1 S2 LateF.
Forest (9%) 78 ± 0 74 ± 2 80 ± 1 81 ± 1 68 ± 1 72 ± 1 78 ± 1 81 ± 0 78 ± 1 67 ± 0 62 ± 2 70 ± 2 74 ± 1 84 ± 2
Shrubl. (5%) 17 ± 1 24 ± 1 20 ± 1 22 ± 2 4±1 22 ± 2 23 ± 1 27 ± 0 13 ± 2 40 ± 2 48 ± 1 14 ± 1 20 ± 2 24 ± 1
Grassl. (12%) 25 ± 2 28 ± 0 38 ± 3 34 ± 2 9±1 19 ± 0 38 ± 0 14 ± 2 17 ± 0 6±1 6±0 18 ± 2 19 ± 1 23 ± 3
Wetl. (18%) 5±0 4±1 4±0 6±0 5±0 6±3 7±3 3±0 6±2 11 ± 3 16 ± 1 8±0 6±0 8±1
Cropl. (13%) 57 ± 4 47 ± 1 49 ± 2 50 ± 1 37 ± 2 39 ± 2 44 ± 2 48 ± 2 53 ± 0 60 ± 0 52 ± 2 51 ± 2 48 ± 2 47 ± 0
Urban (5%) 55 ± 1 47 ± 1 51 ± 2 48 ± 0 37 ± 0 46 ± 1 54 ± 1 57 ± 1 60 ± 1 74 ± 1 82 ± 0 65 ± 1 58 ± 1 62 ± 2
Barren (3%) 19 ± 1 23 ± 2 28 ± 1 27 ± 2 16 ± 2 20 ± 0 22 ± 0 18 ± 0 22 ± 2 19 ± 2 36 ± 1 36 ± 1 32 ± 0 39 ± 1
Water (35%) 97 ± 0 93 ± 2 98 ± 1 98 ± 0 96 ± 1 94 ± 2 98 ± 1 96 ± 2 98 ± 3 99 ± 0 99 ± 0 98 ± 0 96 ± 1 98 ± 2
Overall 57 ± 0 56 ± 1 57 ± 2 58 ± 1 47 ± 0 53 ± 2 59 ± 1 60 ± 1 52 ± 1 57 ± 2 63 ± 0 59 ± 0 59 ± 0 62 ± 2
Average 43 ± 1 43 ± 1 46 ± 1 45 ± 1 33 ± 2 39 ± 1 44 ± 2 43 ± 1 43 ± 2 46 ± 1 51 ± 1 44 ± 2 44 ± 0 48 ± 2
mIoU 32 ± 1 31 ± 2 32 ± 1 31 ± 3 24 ± 3 28 ± 2 32 ± 0 33 ± 1 29 ± 2 35 ± 0 37 ± 1 31 ± 2 32 ± 1 35 ± 1

2 satellites achieve a revisit rate of 5 days at the equator. and covers different geographical areas around the world.
SEN12MS The SEN12MS dataset [26] is a large-scale col- All Sentinel-1/2 image pairs are obtained in the same season
lection of spatially aligned observation pairs from Sentinel- and pre-processed to a harmonized resolution of 10m for
1 and Sentinel-2. The dataset contains 180,662 observations all bands. This work utilizes SEN12MS for self-supervised

1426
pre-training without access to any labels. Segmentation Baselines: We use two baseline models to
DFC2020 The DFC2020 dataset is an extension to compare the self-supervised model we present for the se-
SEN12MS constructed for the IEEE GRSS Data Fusion mantic segmentation task, i) a standard UNet architecture
Contest 2020 [16]. This dataset consists of a valida- [23] and ii) a SwinUNet (see Section 3.2.1). We train both
tion and test sets with 986 and 5,128 paired Sentinel- these baselines from scratch, following the same experi-
1/2 observations, respectively. In addition to the satel- mental setup described above. Note that for the late fusion
lite imagery, DFC2020 also provides dense (i.e., pixel- experiment, we use a Dual SwinUNet model (see Section
level) land-cover annotations for the classes Forest, 3.2.1). Results are reported in terms of pixel-wise accu-
Shrubland, Grassland, Wetland, Cropland, racy and mean Intersection over Union (mIoU) in Table 2.
Urban/Built-up, Barren and Water. We use the We note that UNet achieves higher pixel accuracy than Swi-
DFC2020 dataset to evaluate the downstream tasks of nUNet in both uni-modal and multi-modal configurations,
single- and multi-label classification and semantic segmen- with the best average accuracy reaching 46 ± 1.
tation. In this work we use the validation split for training,
as in Track 2 of the Contest. We note that this data is highly 5.2. SSL Pre-training with Fine-tuning
unbalanced (see Tables 1 and 2). Furthermore, we follow After self-supervised training of the respective model
a different objective than the Data Fusion Contest [16] by backbone on the SEN12MS dataset (∼3 days on a NVIDIA
utilizing its dataset to investigate the use of SSL and vision Tesla V100 GPU), all model parameters (backbone and
Transformer models in the remote sensing domain. task-specific head) are fine-tuned for the downstream task
of interest. For single-label classification, we find that
5. Experiments and Results self-supervised pre-training with subsequent fine-tuning
We perform extensive experiments to assess the per- strongly outperforms all baseline models (see Table 1). The
formance of vision Transformers on three different down- average accuracy score increases by 10 and 5 percentage
stream tasks based on the DFC2020 dataset. The Trans- points for ResNet50 and Swin Transformer, respectively.
former architectures are compared against different base- This corresponds to a relative increase of 23.3% and 11.4%
lines, including commonly used ResNet50 ConvNet mod- over the best baselines. We find similar results in the
els [15]. In particular, we focus on the benefits of SSL and multi-label setting, with relative increases in F1 Scores of
subsequent fine-tuning over training from scratch to lever- 13.5% and 8.2% over the best baselines with the fine-tuned
age the large vision Transformer models on small labelled self-supervised ResNet50 and Swin Transformer models.
remote sensing datasets. Segmentation results are shown in Table 2 for uni-modal
(Sentinel-1, Sentinel-2) and multi-modal (late-fusion, see
5.1. Baselines Figure 2) inputs. For the multi-modal case, we note that the
Classification Baselines: We use four different data set- average pixel accuracy of our fine-tuned model increased
tings for each classification baseline model architecture: by 8 (+19% relative increase) and 6 (+13%) percentage
Only Sentinel-1 input data, only Sentinel-2 input data, early points, compared to the SwinUNet and the UNet trained
Sentinel-1/2 fusion through concatenation across channel from scratch, respectively.
dimension at the data input stage, and late fusion by con-
5.3. SSL Pre-training with Frozen Backbone
catenating feature maps derived from Sentinel-1/2 inputs
with distinct model backbones before the final classifica- We investigate to what degree feature maps produced
tion layer. Besides Swin Transformers, we use ResNet50 by models trained in a self-supervised manner encode rele-
as baseline model architecture as it comprises a comparable vant information for land-cover classification and segmen-
number of parameters to the Transformer (Swin-t). These tation downstream tasks. To test this, we freeze all param-
models are trained from scratch on the validation split of eters of the model backbones and only train the parame-
the DFC2020 dataset. Results evaluated on the test split are ters of randomly initialized classification or segmentation
reported in Table 1. Both architectures result in moderately heads for each task. Evidently, our SSL strategy extracts
good performance for single-label classification despite the meaningful features for land-cover classification. Train-
small training dataset. The Swin Transformer achieves the ing a single-label classification head on top of the frozen
best average accuracy in the late-fusion setting (44±0 per- ResNet50 backbone yields strong performance, and even
centage points), while the ResNet50 slightly outperforms surpasses the best ResNet50 baseline model by 13 percent-
the Transformer in early-fusion and uni-modal settings. For age points average accuracy (+30.2%). The frozen Swin
multi-label classification (see Table 1 bottom half) the late Transformer model even yields the best single-label classifi-
fusion approach yields the highest F1 Scores for both back- cation performance of all approaches presented in this work
bone architectures (52±1 and 49±0 percentage points for with an average accuracy of 57±1 (+29.5% over best Swin
ResNet50 and Swin Transformer, respectively). Transformer baseline). The pre-trained and frozen Swin

1427
0.55 0.50

Average pixel-wise accuray

0.50 0.45
Average accuracy

0.45
0.40
0.40
Swin SSL-finetuned 0.35
0.35
Swin from scratch
0.30 ResNet50-finetuned 0.30
0.25 ResNet50 from scratch SwinUNet SSL-finetuned
Swin SSL-frozen 0.25 SwinUNet SSL-frozen
0.20 ResNet50-frozen SwinUNet from scratch
0.20
0.01 0.10 0.50 1.00 0.01 0.10 0.50 1.00
Label fraction Label fraction
(a) Swin Transformer and ResNet50 models pre-trained with SSL strongly (b) SwinUNets with pre-training outperform training from scratch when
outperform training from scratch on the classification downstream task. freezing or fine-tuning the backbone.

Figure 3. (a) Classification and (b) segmentation results for training on different fractions of labelled data.

Transformer yields similar performance on the multi-label Table 3. Comparison of different methods for segmentation. All
downstream task (+14% average F1 Score over best base- models are trained using multi-modal input, SSL-ft means we are
line), even though the frozen ResNet50 fails to outperform fine-tuning the pre-trained model
the best multi-label classification baseline (-7.7% average
UNet SwinUNet SwinUNet SSL-ft. Ensemble
F1 Score). This indicates that the self-supervised Trans-
former model learns more meaningful representations that Avg. Accuracy 0.45 0.43 0.51 0.53
Avg. IoU 0.31 0.33 0.37 0.39
encode sufficient information to extract multiple class labels
with a small classification head.
fixed to sensible values a-priori, rather than tuned for ev-
We observe largely similar behavior for the segmentation
ery individual experimental setting. The batch size is set
task; training only the segmentation head surpasses our two
to 32, learning rate to 3 · 10−6 and the number of training
baselines (UNet and SwinUNet) by 5 (+12%) and 3 (+7%)
epochs to 200. This approach also makes it possible to uti-
percentage points, respectively, but performs no better than
lize the full DFC2020 validation set (986 observations with
fine-tuning.
dense land-cover labels) for training as we do not require a
5.4. Label Fraction Experiments validation set for hyperparameter tuning.

We investigate the degree to which SSL can offset the

problem of small labelled training datasets. To that end the
6. Discussion
models are trained with subsets consisting of 50%, 10% and Our work highlights the benefits of pre-training Swin
1% of our training data (corresponding to ∼500, 100 and 10 Transformer backbones with a contrastive learning ap-
observations). This results in strongly reduced performance proach and subsequently fine-tuning them for different
when using only 1% of data (see Figure 3). However, the downstream tasks. Following this protocol we observe a
fine-tuned self-supervised models significantly outperform significant improvement in performance for each of our
both the self-supervised models with frozen backbone, and downstream tasks over standard fully supervised training.
the baseline models trained from random initializations (36 In the classification task, the self-supervised fine-tuned
vs. 25 average accuracy points for the fine-tuned and base- ResNet outperforms the SSL-fine-tuned Swin Transformer
line Swin Transformer, respectively). With only 10% of the in average accuracy by a small margin. This could be ex-
labelled data, all self-supervised modles outperform the best plained by the comparatively higher maturity of the ResNet
supervised baselines trained on the entire dataset. The per- architecture over Transformers in computer vision, lead-
formance rapidly increases with the amount of available la- ing to better default parameter configurations. Moreover,
belled training data for all models. we observe that the SSL-Swin Transformer with frozen
backbone performs better than the SSL-ResNet with frozen
5.5. Implementation Details
backbone, indicating that the Transformer model man-
We perform extensive experiments across different ages to learn more informative representations for our
model backbone architectures, data fusion strategies and downstream tasks. This advantage is particularly appar-
downstream tasks. To limit computational cost, hyperpa- ent in multi-label classification, further illustrating that the
rameters of the task-specific fine-tuning experiments are Swin Transformer manages to extract informative features

1428
Sentinel-2 RGB Groundtruth UNet SwinUNet SwinUNet SSL-ft. Ensemble

Water

Barren

Urban

Cropland

Wetland

Grassland

Shrubland

Forest

Figure 4. Qualitative comparison of results for 3 different regions. Results from left to right: Sentinel-2 true color (RGB), DFC groundtruth,
UNet trained from scratch on fusion of both inputs, SwinUNet trained from scratch on both inputs, SwinUNet fine-tuned on both inputs,
and finally an ensemble model of both UNet and SwinUNet (see Section 6).

through self-supervised pre-training. For single label classi- all our downstream tasks, we note that classes like Grass-
fication, the frozen backbone models performed better than land or Wetland are commonly misclassified on some im-
finetuning all parameters, which we attribute to the hyperpa- ages. This is most likely due to our limited and very un-
rameter choice. For the segmentation task, we arrive at the balanced training set (see Tables 1 and 2). Overall, the en-
same conclusion as above: self-supervised pre-training con- semble model works best, both visually, giving smooth and
siderably boosts performances. Nevertheless, we note that detailed results, and numerically (see Table 3). This pushes
the SSL-Swin Transformer with frozen backbone does not us to explore this direction in future works, with the aim of
perform better than the SSL-fine-tuned one. This may be improving land cover segmentation, taking into account the
a result of the segmentation head architecture, which uses advantages of each of these methods.
skip connections to merge the multi-scale characteristics of
the encoder with the upsampled characteristics of the de-
coder. Therefore, to achieve the best performance, encoder
and decoder parameters should be updated simultaneously. 7. Conclusion
The importance of our approach for data-efficient learn-
ing is further underlined by the results of the label fraction
This work introduced a self-supervised pre-trained Swin
experiment. Across all downstream tasks, our pre-trained
Transformer for land cover classification and segmentation
and fine-tuned models perform on par with models trained
using a contrastive learning approach as illustrated in Figure
from scratch with as little as 10% of the labelled data. This
2. The training is done in two stages; first, self-supervised
SSL approach thus opens a path to learning based on very
training of one unique backbone is performed on a large un-
small datasets (∼100 samples), enabling data-efficient ap-
labeled dataset, second, supervised fine-tuning of this back-
plications. In a qualitative comparison, we show in Fig-
bone is performed on a small, labeled dataset, for two sepa-
ure 4 some segmentation results. We first observe that the
rate downstream tasks. Experimental results on the test set
UNet method produces smoother segmentation masks com-
validate our proposed method over training various different
pared to the SwinUNet. On the other hand, the segmen-
baseline models trained from scratch. Our self-supervised
tation masks produced by the SwinUNet are much more
approach yields consistently higher performance across dif-
detailed and accurate. These observations motivated the
ferent downstream tasks, with particularly strong improve-
idea of trying an ensemble of these two methods where we
ments in the low-data regime. Furthermore, our work illus-
take the average of the predictions of the two models, be-
trates the utility of Transformer models for Earth observa-
fore computing the final prediction (see Figure 4). Across
tion without the need for large labelled datasets.

1429
References [13] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensional-
ity reduction by learning an invariant mapping. In 2006 IEEE
[1] Kumar Ayush, Burak Uzkent, Chenlin Meng, Kumar Tan- Computer Society Conference on Computer Vision and Pat-
may, Marshall Burke, David Lobell, and Stefano Ermon. tern Recognition (CVPR’06), volume 2, pages 1735–1742.
Geography-aware self-supervised learning. In Proceedings IEEE, 2006. 2
of the IEEE/CVF International Conference on Computer Vi-
[14] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
sion, pages 10181–10190, 2021. 2
Girshick. Momentum contrast for unsupervised visual rep-
[2] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens,
resentation learning. In Proceedings of the IEEE/CVF Con-
and Quoc V. Le. Attention augmented convolutional net-
ference on Computer Vision and Pattern Recognition, pages
works. 2019 IEEE/CVF International Conference on Com-
9729–9738, 2020. 2
puter Vision (ICCV), pages 3285–3294, 2019. 2
[3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- Deep residual learning for image recognition. In Proceed-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- ings of the IEEE conference on computer vision and pattern
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. recognition, pages 770–778, 2016. 6
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, [16] Michael Schmitt; Lloyd Hughes; Pedram Ghamisi; Naoto
Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Yokoya; Ronny Hänsch. 2020 ieee grss data fusion contest,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, 2019. 6
Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad- [17] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
ford, Ilya Sutskever, and Dario Amodei. Language models Zhang, Stephen Lin, and Baining Guo. Swin transformer:
are few-shot learners. ArXiv, abs/2005.14165, 2020. 2 Hierarchical vision transformer using shifted windows. In
[4] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xi- Proceedings of the IEEE/CVF International Conference on
aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Computer Vision, pages 10012–10022, 2021. 2, 3, 4
Unet-like pure transformer for medical image segmentation. [18] Oscar Mañas, Alexandre Lacoste, Xavier Giro-i Nieto,
ArXiv, abs/2105.05537, 2021. 4 David Vazquez, and Pau Rodriguez. Seasonal contrast: Un-
[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nico- supervised pre-training from uncurated remote sensing data.
las Usunier, Alexander Kirillov, and Sergey Zagoruyko. In Proceedings of the IEEE/CVF International Conference
End-to-end object detection with transformers. ArXiv, on Computer Vision, pages 9414–9423, 2021. 2
abs/2005.12872, 2020. 3 [19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
[6] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple Efficient estimation of word representations in vector space.
framework for contrastive learning of visual representations. arXiv preprint arXiv:1301.3781, 2013. 2
In ICML, pages 1597–1607. PMLR, 2020. 2, 4 [20] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of
[7] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad visual representations by solving jigsaw puzzles. In Euro-
Norouzi, and Geoffrey Hinton. Big self-supervised mod- pean conference on computer vision, pages 69–84. Springer,
els are strong semi-supervised learners. arXiv preprint 2016. 2
arXiv:2006.10029, 2020. 2
[21] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-
[8] Yuxing Chen and Lorenzo Bruzzone. Self-supervised sar- sentation learning with contrastive predictive coding. arXiv
optical data fusion of sentinel-1/-2 images. IEEE Transac- preprint arXiv:1807.03748, 2018. 2
tions on Geoscience and Remote Sensing, 2021. 2
[22] Jeffrey Pennington, Richard Socher, and Christopher D Man-
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
ning. Glove: Global vectors for word representation. In
Toutanova. Bert: Pre-training of deep bidirectional trans-
Proceedings of the 2014 conference on empirical methods in
formers for language understanding. ArXiv, abs/1810.04805,
natural language processing (EMNLP), pages 1532–1543,
2019. 2
2014. 2
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
[23] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Convolutional networks for biomedical image segmentation.
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
In MICCAI, 2015. 4, 6
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
worth 16x16 words: Transformers for image recognition at [24] Sudipan Saha, Patrick Ebel, and Xiao Xiang Zhu. Self-
scale. ArXiv, abs/2010.11929, 2021. 1, 3 supervised multisensor change detection. IEEE Transactions
[11] Matthias Drusch, Umberto Del Bello, Sébastien Carlier, on Geoscience and Remote Sensing, 2021. 2
Olivier Colin, Veronica Fernandez, Ferran Gascon, Bianca [25] Linus Scheibenreif, Michael Mommert, and Damian Borth.
Hoersch, Claudia Isola, Paolo Laberinti, Philippe Martimort, Contrastive self-supervised data fusion for satellite imagery.
et al. Sentinel-2: Esa’s optical high-resolution mission for In International Annals of the Photogrammetry, Remote
gmes operational services. Remote sensing of Environment, Sensing and Spatial Information Sciences, 2022. 2, 4
120:25–36, 2012. 4 [26] Michael Schmitt, Lloyd Haydn Hughes, Chunping Qiu, and
[12] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- Xiao Xiang Zhu. Sen12ms–a curated dataset of georefer-
supervised representation learning by predicting image rota- enced multi-spectral sentinel-1/2 imagery for deep learning
tions. arXiv preprint arXiv:1803.07728, 2018. 2 and data fusion. arXiv preprint arXiv:1906.07789, 2019. 5

1430
[27] Vladan Stojnic and Vladimir Risojevic. Self-supervised [40] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
learning of remote sensing scene representations using con- image colorization. In European conference on computer
trastive multiview coding. In Proceedings of the IEEE/CVF vision, pages 649–666. Springer, 2016. 2
Conference on Computer Vision and Pattern Recognition, [41] Xiao Xiang Zhu, Devis Tuia, Lichao Mou, Gui-Song Xia,
pages 1182–1191, 2021. 2 Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer. Deep
[28] Robin A. M. Strudel, Ricardo Garcia Pinel, Ivan Laptev, learning in remote sensing: A comprehensive review and list
and Cordelia Schmid. Segmenter: Transformer for seman- of resources. IEEE Geoscience and Remote Sensing Maga-
tic segmentation. 2021 IEEE/CVF International Conference zine, 5(4):8–36, 2017. 1
on Computer Vision (ICCV), pages 7242–7252, 2021. 3
[29] Aidan M Swope, Xander H Rudelis, and Kyle T Story. Rep-
resentation learning for remote sensing: An unsupervised
sensor fusion approach. arXiv preprint arXiv:2108.05094,
2021. 2
[30] Chao Tao, Ji Qi, Weipeng Lu, Hao Wang, and Haifeng
Li. Remote sensing image scene classification with self-
supervised paradigm under limited labeled samples. IEEE
Geoscience and Remote Sensing Letters, 2020. 2
[31] Chunwei Tian, Yong Xu, Zuoyong Li, Wangmeng Zuo,
Lunke Fei, and Hong Liu. Attention-guided cnn for image
denoising. Neural networks : the official journal of the In-
ternational Neural Network Society, 124:117–129, 2020. 2
[32] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con-
trastive multiview coding. In Computer Vision–ECCV 2020:
16th European Conference, Glasgow, UK, August 23–28,
2020, Proceedings, Part XI 16, pages 776–794. Springer,
2020. 2
[33] Ramon Torres, Paul Snoeij, Dirk Geudtner, David Bibby,
Malcolm Davidson, Evert Attema, Pierre Potin, BjÖrn Rom-
men, Nicolas Floury, Mike Brown, et al. Gmes sentinel-1
mission. Remote sensing of environment, 120:9–24, 2012. 4
[34] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen
Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner,
Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A
generative model for raw audio. SSW, 125:2, 2016. 2
[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30, 2017. 1
[36] Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam,
Alan Loddon Yuille, and Liang-Chieh Chen. Axial-deeplab:
Stand-alone axial-attention for panoptic segmentation. In
ECCV, 2020. 3
[37] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
Unsupervised feature learning via non-parametric instance
discrimination. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 3733–3742,
2018. 2
[38] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell,
Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized
autoregressive pretraining for language understanding. Ad-
vances in neural information processing systems, 32, 2019.
2
[39] Yuan Yuan and Lei Lin. Self-supervised pretraining of trans-
formers for satellite image time series classification. IEEE
Journal of Selected Topics in Applied Earth Observations
and Remote Sensing, 14:474–487, 2020. 2