0% found this document useful (0 votes)
18 views

Scheibenreif Masked Vision Transformers For Hyperspectral Image Classification CVPRW 2023 Paper

Uploaded by

bhavesh agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Scheibenreif Masked Vision Transformers For Hyperspectral Image Classification CVPRW 2023 Paper

Uploaded by

bhavesh agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This CVPR workshop paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

Masked Vision Transformers for Hyperspectral Image Classification

Linus Scheibenreif Michael Mommert Damian Borth


AIML Lab, School of Computer Science, University of St. Gallen
{firstname}.{lastname}@unisg.ch

Abstract
A: Masked pre-training
Transformer architectures have become state-of-the-art Reconstruction

models in computer vision and natural language process-


ing. To a significant degree, their success can be attributed Spatial-Spectral
to self-supervised pre-training on large scale unlabeled Transformer
datasets. This work investigates the use of self-supervised Pixelwise
masked image reconstruction to advance transformer mod- Classification
els for hyperspectral remote sensing imagery. To facili-
tate self-supervised pre-training, we build a large dataset B: Supervised fine-tuning
of unlabeled hyperspectral observations from the EnMAP
satellite and systematically investigate modifications of the
vision transformer architecture to optimally leverage the Figure 1. A: We propose the use of masked image modeling to pre-
characteristics of hyperspectral data. We find significant train spatial-spectral transformer networks on a large dataset of
improvements in accuracy on different land cover classifi- unlabeled hyperspectral EnMAP data. B: The pre-trained model
cation tasks over both standard vision and sequence trans- can then be fine-tuned on small labeled datasets for supervised
formers using (i) blockwise patch embeddings, (ii) spatial- downstream tasks like land cover classification.
spectral self-attention, (iii) spectral positional embeddings
and (iv) masked self-supervised pre-training1 . The result-
ing model outperforms standard transformer architectures
by +5% accuracy on a labeled subset of our EnMAP data has strongly improved and the launch of the German hy-
and by +15% on Houston2018 hyperspectral dataset, mak- perspectral Environmental Mapping and Analysis Program
ing it competitive with a strong 3D convolutional neural (EnMAP) mission in April 2022 made global hyperspec-
network baseline. In an ablation study on label-efficiency tral data of high spectral and temporal resolution publicly
based on the Houston2018 dataset, self-supervised pre- available on a large scale [16]. In contrast to this trend,
training significantly improves transformer accuracy when deep learning approaches for the analysis of hyperspectral
little labeled training data is available. The self-supervised remote sensing data are overwhelmingly developed on well-
model outperforms randomly initialized transformers and established benchmark datasets that are very small in com-
the 3D convolutional neural network by +7-8% when only parison to commonly used datasets in other computer vi-
0.1-10% of the training labels are available. sion domains [10, 30]. To a large extent, this is due to the
high acquisition cost of hyperspectral data itself and the cor-
responding labels for individual spectral sequences. This
strongly limits the size of available labeled datasets and
1. Introduction the development of deep learning approaches in the hyper-
Hyperspectral remote sensing provides measurements of spectral domain. In this work we aim to improve vision
the Earth’s surface with high spectral resolution. This en- transformer architectures for the specific characteristics of
ables applications like the detection of specific material cat- hyperspectral data and to leverage the growing amount of
egories or agricultural parameters which often depend on freely available unlabeled hyperspectral remote sensing im-
fine-grained spectral reflectance patterns [15, 21]. In recent agery for self-supervised pre-training of these models. We
years, the availability of hyperspectral remote sensing data illustrate how to increase the performance while decreasing
the amount of required labeled data for hyperspectral clas-
1 Code available at github.com/HSG-AIML/MaskedSST sification tasks.

2166
A: Masked pre-training
Spatial-Spectral Random Masking Pixel Space

e
ub
C Patches
a
at
lD
ra
ct

Linear
Spatial-Spectral
pe
s
er

Transformer
yp

n
io
H

ct
ru
st
on
ec
R
! ×%" ×%#
𝑥 ∈ ℝ!×#×$ 𝑝 ∈ ℝ% Pixelwise Loss

B: Spatial-Spectral Transformer

Spatial Transformer Block Spectral Transformer Block


Patch Embed

Spatial Spectral
Self-attention Self-attention

Figure 2. Overview of our proposed transformer model for hyperspectral data with spatial-spectral factorization within the masked self-
h w c
supervised pre-training framework. A: The hyperspectral data cube is first divided into spatial-spectral patches p ∈ Rp ×p ×p . The
patches are randomly masked, embedded and processed by the transformer, which sequentially applies self-attention spatially and spectrally
between all embeddings. A linear layer maps representations of the masked patches back to pixel space to compute the reconstruction error.
B: Our spectral-spatial transformer consists of a patch embedding layer and transformer blocks that apply self-attention among tokens with
the same spectral or spatial index. The colors indicate token locations in the hyperspectral cube.

The contributions of this work can be summarized as fol- 2. Related Work


lows:
2.1. Hyperspectral Deep Learning
The high dimensionality and spectral correlation of hy-
• We collect a large scale unlabeled dataset of EnMAP perspectral data present unique challenges for machine
observations over Europe and create a labeled dataset learning methods. Accordingly, many machine learning
of Mexico City by matching EnMAP observations techniques have been developed for common hyperspec-
with land cover labels. Based on these datasets, our tral tasks such as dimensionality reduction [55], data fu-
work provides a large scale evaluation of transformer sion [51], unmixing [54] or classification [2]. In particu-
models for hyperspectral data. lar, deep learning approaches like fully connected [14], con-
volutional [26] (CNN), and recurrent neural networks [28]
• We investigate different positional and spectral encod- have been successfully applied on hyperspectral imag-
ing schemes and show that block-wise embedding sig- ing data (see [2] for an overview). Hybrid transformer-
nificantly improves the performance of transformers CNN methods combine convolutional feature extractors
on hyperspectral data. with transformer networks (e.g., [20,38,43]) to leverage the
spatial inductive bias of CNNs in a transformer framework.
• To facilitate spatial-spectral learning with transform-
Following the general trend in the deep learning field, pure
ers, we utilize a spatial-spectral factorization scheme
transformer networks have also recently been developed for
which greatly reduces the computational burden of
hyperspectral remote sensing imagery [22, 31].
the self-attention operation on high-dimensional hy-
perspectral data. 2.2. Vision Transformers
• We show that a self-supervised masked image mod- Transformer models are state-of-the art in natural lan-
eling task for hyperspectral data improves model per- guage processing [11] (NLP) where their attention mecha-
formance on downstream tasks, and can significantly nism [42] models pairwise interactions between tokens, and
improve label efficiency for transformer models. allows them to capture long-range interactions. The trans-

2167
A: Standard Patch Embedding 2.3. Self-supervised Learning
The goal of self-supervised learning (SSL) is to learn
𝑤 Embeddings
s
he

rich representations from unlabeled data. To that end, an


tc
Pa

artificial supervision signal is constructed from information


that is inherent to the data sample. Models can then be
trained to solve such a ‘pretext-task‘ before the learned rep-
𝑤 resentations are transferred to different downstream tasks
of interest. Common pretext-tasks include the prediction of
B: Blockwise Patch Embedding relative rotation [13], solving of jigsaw puzzles [29] or im-
age colorization [53]. More recently, contrastive learning
𝑤" has emerged as a powerful pre-training strategy [6,45]. This
Embeddings
𝑐 #! approach aims to solve an instance-wise classification prob-
lem between data samples with noise contrastive estima-
tion [17]. The objective is to distinguish positive and neg-
ℎ ative pairs of data points, where the pairwise relationships
𝑤! are derived from inherent characteristics of the data samples
𝑤
rather than classical labels. The network thus learns to map
positive pairs close to each other and far apart from nega-
Figure 3. A: Standard patch-embedding approach for vision trans- tive samples in the representation space. Contrast can be
formers. Patches are flattened and embedded using the same linear defined on the image [6], patch [46], or pixel [47] level to
transform w. B: Blockwise spectral embedding for spatial-spectral
control the granularity of resulting representations.
patches. Each spectral interval is embedded with a specific linear
transform wb to account for the characteristics of the correspond-
ing wavelength interval. Masked Image Modeling Self-supervised learning
through the prediction of masked data components is
former approach has since also been successfully adapted widely used in NLP and a central contributor to the success
for computer vision applications, where pre-trained trans- of transformer networks in this domain. The central idea
formers are now among the strongest general purpose back- is to replace a fraction of input tokens with a special mask
bones [12]. The standard vision transformer [12] (ViT) token that has to be predicted by the transformer [11].
method first divides input images x ∈ Rh×w×c into patches Following the success of such approaches with natural
h w
p ∈ Rn×(p ×p ×c) of patch size ph · pw . This set of language, masked modeling approaches are now also used
h w in the vision domain. These approaches closely follow the
n = ( ph ) · ( pw ) non-overlapping patches is linearly em-
bedded to the transformer dimension d and summed with NLP approach by predicting masked tokens from adjacent
positional encodings for every patch. The resulting embed- visible tokens [4, 56], which requires a suitable tokenizer.
dings z are processed by the transformer encoder consist- Recent work has shown that in the vision domain token
ing of l layers of alternating multi-head self-attention [42] prediction can be substituted by directly regressing the
(MSA) and feed-forward (FF) blocks, both with layer nor- values of masked pixel. This pre-training approach results
malization [3] (LN): in strong visual representations when combined with
autoencoder networks [19] or by estimating pixel values
\begin {array}{ll} \mathbf {y}^l &=\mathrm {MSA}(\mathrm {LN}(\mathbf {z}^l))+\mathbf {z}^l \\ \mathbf {z}^{l+1}&=\mathrm {FF}(\mathrm {LN}(\mathbf {y}^l))+\mathbf {y}^l\\ \end {array} from latent representations with a simple linear layer [48].
(1)

SSL in Remote Sensing Remote sensing offers large


The major bottleneck in the application of (vision) trans- amounts of unlabeled data, which has been leveraged in a
formers on high dimensional data is the quadratic complex- number of self-supervised learning strategies (see [44] for
ity of the attention operation in the number of input tokens. a review). Early approaches utilize hand-crafted pretext-
A number of works try to improve the efficiency of trans- tasks like inpainting and location prediction [40]. A num-
formers for large numbers of tokens by reducing the com- ber of methods have tailored the contrastive learning princi-
plexity of self-attention [7, 8], or by applying self-attention ple to the characteristics of remote sensing data by utilizing
selectively rather than pairwise between all tokens [5, 23]. temporal information from consecutive overpasses [27, 52],
Most relevant for this work are transformer architectures for multi-modal data from different sensors [33–35], or multi-
video data where different approaches to divide the self- spectral observations [37, 39] to define positive sample
attention operation along the temporal and spatial dimen- pairs. Masked image modeling approaches for remote
sions have been proposed [1, 5]. sensing data utilize masked autoencoding [19] with multi-

2168
EnMAP
DFC Label

Water
Barren
Urban/Built-up
Croplands
Wetlands
Grassland
Shrubland
Prediction

Forest

Figure 4. Top: Tiles from EnMAP L2 scenes over Mexico City. Center: Corresponding DFC2020 land cover labels. Bottom: Predicted
land cover classes from the masked spatial-spectral transformer model.

spectral and multi-temporal data [9], or extend it to imagery 3.1.1 Spatial-Spectral Patch Embeddings
of varying ground sampling distance [32]. In the hyperspec-
tral domain, masked sequence modeling has been used to Spatial-Spectral Patches To incorporate the spatial con-
model the spectral signal [18], and within the masked au- text for the spectral sequence of each pixel, our spatial-
toencoding framework [24]. spectral model divides the input image x ∈ Rh×w×c into
n = ( phh ) · ( pww ) · ( pcc ) patches of size ph · pw · pc . This
3. Method approach mirrors the spatial patching of the original ViT,
but does not aggregate all spectral bands per location, thus
This section introduces the proposed transformer model retaining the hyperspectral 3D cube. This makes it possible
for hyperspectral data (3.1), the spatial-spectral patch em- to model both spectral and spatial relationships within the
bedding strategy (3.1.1) and how hyperspectral data is effi- image using attention. Spatial-spectral patching increases
ciently processed by factorizing self-attention spatially and the number of tokens by a factor of pcc compared to spa-
spectrally (3.1.2). Finally, we present the masked pre-
tial ViT patching and by phh · pww compared to the spec-
training scheme (3.2).
tral transformer approach. Since the computational cost of
3.1. Transformer Architecture self-attention is quadratic in the number of tokens, model-
ing all spatial-spectral relationships is practically infeasible
This work adapts the vision transformer [12] architec- for anything but very large spatial and spectral patch sizes
ture to hyperspectral imagery. Starting from a baseline p{h,w,c} . We address this limitation in Section 3.1.2.
transformer model, we successively add model components
and adjust design choices to improve efficiency and perfor-
mance on hyperspectral data. Our baseline spectral model Blockwise Spectral Embedding Vision transformers
processes the spectral sequence of individual pixel with an create embeddings from patches through a learned linear
transformer encoder. Each pixel is divided into patches transform that is shared between all patches (see Fig. 3
along the spectral dimension, resulting in n = pcc blocks A). Unlike the spatial patches of ViT, which always repre-
of size pc , and then embedded with a shared linear trans- sent the RGB intervals of the electromagnetic spectrum, our
form. Learnable positional embeddings are added to the spatial-spectral patches represent multiple different spectral
embedding sequence. As a spatial transformer baseline, we wavelength intervals for every spatial patch. To account for
apply the original ViT [12] architecture on the RGB bands this diversity in the spectral signal, we propose a block-
of hyperspectral data (ViT-RGB). wise spectral embedding scheme that utilizes a separate

2169
linear transform for each of the pcc spectral blocks in the EnMAP RGB Reconstruction RGB Mask
patched hyperspectral data (see Fig. 3 B). This approach is 3
most similar to group embeddings which have been used 2
for multi-temporal and multi-spectral remote sensing im- 1
agery [9]. 0

Spectral Positional Embedding We investigate the util-

Norm. Reflectance
ity of two different positional encoding techniques for the 1.0 Original
spatial-spectral embeddings: Learnable positional em- Recon.
0.5 Masked
beddings for every spatial-spectral patch that are optimized
along with the transformer during model training. Alterna- 0.0
tively, spectral positional embeddings explicitly encode 0.5
spatial and spectral positional information of the hyper-
spectral data separately with fixed sine and cosine func- 0 50 100 150 200 0 50 100 150 200
Spectral Bands
tions [9, 42] and the transformer dimensionality d.
Figure 6. Top: Original RGB representation and reconstruction
\begin {array}{ll} \mathrm {PE}_{pos,2i}&=\sin (pos/10000^{2i/d})\\ \mathrm {PE}_{pos,2i+1}&=\cos (pos/10000^{2i/d}) \end {array} with 70% masking and mask patch size 4 after 200 training epochs.
(2)
The heatmap indicates how many of the RGB bands were masked
for each image patch. Bottom: Reconstruction along the spectral
We allot one third of the embedding vector to the spectral axis for the two pixels highlighted in red in the top-left image.
positional embedding (i.e., the encoding of the patch’s index Gray sections correspond to masked patches.
in the spectral sequence), and the remainder for the spatial
embedding of horizontal and vertical position. the number of spatial and spectral patches to their squared
sum, i.e., O(( phh · pww · pcc )2 · d) to O(( phh · pww + pcc )2 · d).
3.1.2 Spatial-Spectral Factorization In practice the number of required operations on our hyper-
1
spectral data changes by a factor of ∼ 75 .
The spatial-spectral patch embedding strategy yields a large
number of tokens for high dimensional hyperspectral data. 3.2. Masked Self-supervised Learning
This is a bottleneck for the attention operation, which has The transformer models investigated in this work consist
quadratic runtime in the number of tokens. To make train- of a transformer encoder. During masked pre-training, we
ing feasible, we resolve this limitation by factorizing the add an additional linear layer to map latent token represen-
transformer model to sequentially process spatial and spec- tations from the transformer to pixel values, following the
tral relationships within the data (see Fig. 2 B). This ap- SimMIM method [48]. After patch embedding, a fraction
proach is similar to separable convolutions in CNNs, where of the embeddings is selected and replaced with a learnable
2D and 1D convolutions are sequentially applied over and mask token. The pre-training objective is to reconstruct
across feature maps [36]. This strategy reduces the compu- the pixel values corresponding to the masked tokens (see
tational load of self-attention from the squared product of Fig. 2). The reconstruction quality is measured by L1 loss,
which is only evaluated for masked pixel tokens (see Fig. 6).
Unlike similar approaches that utilize encoder-decoder ar-
chitectures for masked pre-training [19], the small linear
reconstruction head in this approach forces the encoder to
focus its capacity on modeling the masked tokens, rather
than leaving this task to the decoder. Masked sentence mod-
els [11] commonly mask 15% of tokens, while image [48]
and video [41] models mask around 50% and 90%, respec-
tively. We employ a blockwise-masking strategy (i.e., by
masking 4 × 4 windows of tokens instead of individual to-
kens) to prevent trivial solutions which are possible due to
the high correlation of spectrally adjacent tokens.

4. Data
Figure 5. Top: RGB representation of the Houston2018 hyper- This work applies transformer models on hyperspectral
spectral training set (bands 48, 32, 16). Bottom: Training labels remote sensing data. To that end, we utilize hyperspec-
for Houston2018 (20 classes, unlabeled pixels shown in white).

2170
Model Components Finetuned Frozen
Model Name Spectral Spatial BPE SPE SSL Acc. (%) MAcc. (%) Acc. (%) MAcc. (%)
3D-CNN [25] ✓ ✓ 83 ± 0.3 57 ± 1.0 81 ± 0.4 54 ± 1.1
ViT-RGB [12] ✓ 69 ± 0.5 20 ± 1.0 68 ± 0.3 16 ± 0.3
Transformer [42] ✓ 77 ± 0.2 32 ± 0.4 72 ± 0.1 23 ± 0.2
Spectral T. ✓ ✓ 80 ± 0.1 38 ± 0.5 71 ± 1.0 27 ± 1.3
Masked Transformer ✓ ✓ 76 ± 0.2 29 ± 0.4 65 ± 0.0 14 ± 0.0
Masked Spectral T. ✓ ✓ ✓ 81 ± 0.3 40 ± 0.7 78 ± 0.1 31 ± 0.1
SST ✓ ✓ 79 ± 0.1 38 ± 0.4 74 ± 0.4 32 ± 0.5
SST ✓ ✓ ✓ 81 ± 0.1 40 ± 0.8 75 ± 1.0 27 ± 1.3
SST ✓ ✓ ✓ 78 ± 0.2 33 ± 0.6 73 ± 0.6 24 ± 0.6
SST ✓ ✓ ✓ ✓ 82 ± 0.1 44 ± 0.3 76 ± 1.0 35 ± 1.0
Masked SST ✓ ✓ ✓ 77 ± 0.5 31 ± 0.1 65 ± 0.0 14 ± 0.0
Masked SST ✓ ✓ ✓ ✓ 82 ± 0.1 42 ± 0.2 77 ± 0.1 29 ± 0.2
Masked SST ✓ ✓ ✓ ✓ 78 ± 0.4 32 ± 0.3 65 ± 0.0 14 ± 0.0
Masked SST ✓ ✓ ✓ ✓ ✓ 82 ± 0.2 45 ± 0.6 79 ± 0.1 40 ± 0.1
MSST-Center ✓ ✓ ✓ ✓ ✓ 82 ± 0.2 55 ± 0.5 82 ± 0.2 55 ± 0.2

Table 1. Hyperspectral classification performance of baselines and different transformer configurations on the EnMAP-DFC dataset.
Columns ‘Spectral’ and ‘Spatial’ indicate whether the model utilizes spectral/spatial context. Please refer to Section 3.1.1 for details
about blockwise patch embedding (BPE) and spectral positional encoding (SPE). SSL indicates that the model has been pre-trained on
EnMAP data with the masked reconstruction task. SST refers to the spatial-spectral transformer model. 3D-CNN and MSST-Center
provide predictions for the center pixel of a patch, the other methods for all pixels in the patch simultaneously. Finetuned results indicate
performance after training all model parameters on labeled data, frozen indicates that only the classification head is trained on labeled data.

tral datasets from the Environmental Mapping and Analysis note that some label noise is introduced due to the differ-
Program [16] and the IEEE GRSS Data Fusion Challenge ence in labeling date (2020) and time of the EnMAP over-
(DFC) 2018 [49]. flight in 2022 (see Fig. 4 top and center rows).
Houston2018 As a second labeled hyperspectral dataset,
we use the Houston data from the IEEE GRSS DFC in
EnMAP The EnMAP satellite carries an imaging spec- 2018 [49]. This dataset consists of aerial imagery of the
trometer that scans the Earth’s surface with 224 spectral city of Houston (see Fig. 5), obtained with a hyperspec-
bands in the very-near infrared (420 − 1000nm) and short- tral instrument in the 380 − 1050nm spectral range with 48
wave infrared (900 − 2450nm) intervals [16]. The sen- bands and 1m spatial resolution. The scene has 1202×4172
sor has a spatial resolution of 30 × 30m and a 27-day re- hyperspectral pixels, 590 149 of which are labeled into 20
visit time. We collect a dataset consisting of 90 cloud-free fine-grained classes. We use the official train/test split of
EnMAP L2 scenes (orthorectified and atmospherically cor- the dataset in our experiments (504 712 pixels for training
rected) over Europe in Q4 2022. The EnMAP scenes are and validation, 85 437 for testing).
divided into non-overlapping 64×64 pixel tiles, and invalid Metrics We evaluate model performance for land cover
atmospheric bands are removed (resulting in a total of 200 classification on EnMAP-DFC and Houston2018 with ac-
spectral bands). Our dataset consists of 19 792 tiles, for a curacy and macro accuracy metrics. The standard accuracy
total of more than 81M hyperspectral pixels. measures the fraction of correctly classified samples over
EnMAP-DFC We create a labeled EnMAP dataset the entire dataset (see Eqn. 3). Macro accuracy provides the
by matching two atmospherically corrected EnMAP L2 average of class-wise accuracies, which can deviate from
scenes over Mexico City with land cover data for the accuracy on unbalanced datasets (see Eqn. 4). We report the
same region that was published for the IEEE GRSS average and standard deviation of each metric, computed
DFC 2020 [50] (see Fig. 4). This dataset consists of over 5 training runs with different random seeds.
357 64 × 64 pixel tiles with pixel-wise labels for the
classes Forest, Shrubland, Grassland, Wetland, \mathrm {Acc.} = \frac {\mathrm {TP}+\mathrm {TN}}{\mathrm {TP}+\mathrm {TN}+\mathrm {FP}+\mathrm {FN}} \label {eqn:acc} (3)
Cropland,Urban/Built-up, Barren and Water.
For our experiments, the data is randomly split into 286 where TP,TN,FP and FN correspond to true positive,
training/validation tiles and 71 tiles for final testing. We true negative, false positive, and false negative, respectively.

2171
Model Acc (%) MAcc (%)
\mathrm {Macro Acc.} = \frac {\sum _{c\in \mathrm {C}} \mathrm {Acc.}(\{x_i | y_i = c\}_i)}{|\mathcal {C}|} \label {eqn:macc} (4) 3D-CNN [25] 45 ± 1.8 45 ± 1.0
Transformer [42] 33 ± 1.1 26 ± 1.0
where xi is a data sample, yi the corresponding class label, ViT-RGB 19 ± 1.4 21 ± 1.4
and C the set of all classes in the dataset. Spectral T. 47 ± 3.2 43 ± 1.6
SST 43 ± 2.4 40 ± 1.7
5. Experiments & Results Masked SST 48 ± 2.8 42 ± 1.2
5.1. Baselines
Table 2. Land cover classification results for the Houston2018
We use three different baseline models in this work. The dataset. SST corresponds to spatial-spectral transformer with
ViT-RGB model directly applies the ViT [12] approach BPE. The masked SST is pre-trained and fine-tuned on Hous-
with dimensionality d = 96, 4 blocks, and 8 heads in ton2018 training data. All models besides the standard transformer
the multi-head self-attention to the RGB bands of hyper- are trained for center pixel prediction.
spectral data. Accordingly, the data only consists of three
spectral bands, which are aggregated during patch embed- basic backbone for the other presented transformer modi-
ding. This model yields an accuracy of 69 ± 0.5% on the fications. Adding the blockwise patch embedding (BPE)
EnMAP-DFC dataset (see Table 1) and 19 ± 1.4% on Hous- scheme allows the model to embed patches conditionally
ton2018 (see Table 2). Our spectral model is a sequen- on their position along the spectral axis (see Fig. 3) and im-
tial transformer of the same size as the ViT-RGB. Hyper- proves accuracy to 81 ± 0.2%. We find that spectral posi-
spectral pixel are processed individually, and self-attention tional embeddings (SPE) slightly harm the performance of
acts between spectral tokens of width pc = 10. Unlike the SST model (78 ± 0.2%), while the combination of BPE
the ViT-RGB, this approach can fully leverage the spec- and SPE yields an improvement to 82 ± 0.1% on the En-
tral information of the hyperspectral dataset, and improves MAP data. This model reaches an accuracy of 43 ± 2.4 on
the accuracy significantly to 77 ± 0.2% on EnMAP-DFC. Houston2018.
On the Houston2018 dataset, the spectral model reaches an
accuracy of 47 ± 3.2%. As a convolutional baseline, we 5.3. Masked Pre-training
use an established 3D-CNN model [25] with strong per- Self-supervised masked modeling increases the data ef-
formance on hyperspectral datasets [2]. Using 3D convo- ficiency of transformer models for natural language [11] or
lutions, this model can incorporate both spectral and spa- image [48] applications. We pre-train our transformer con-
tial information, yielding a performance of 83 ± 0.3% for figurations for hyperspectral data on the unlabeled EnMAP
EnMAP and 45 ± 1.8% on Houston2018. Unlike the pre- dataset with a masked pixel reconstruction strategy. The
sented transformer approaches (see Table 1), the 3D-CNN model is trained for 200 epochs to reconstruct the 70% of
model only makes predictions for the center-pixel of every patches which were masked in a 4 × 4 blockwise fashion.
input patch. This improves performance but necessitates a This pre-training yields small improvements over training
sliding-window inference strategy to create pixelwise land from scratch on the larger EnMAP-DFC dataset (e.g., +1%
cover maps, which strongly increases computational cost. accuracy for the masked SST with BPE). On the Hous-
We re-train our best performing model using this approach ton2018 data, masked pre-training improves the SST model
on the EnMAP data and adopt the same strategy on the by +5% to 48±2.8% accuracy. The combination of masked
Houston2018 dataset for comparability. pre-training and BPE results in strong representations, as re-
5.2. Spatial-Spectral Embedding
We extend the spectral transformer to deal with spatial- Dataset Fraction
spectral signals by embedding the data along both spatial Model 0.1% 1% 10% 100%
and spectral axes (see Fig. 3). This increases the number of 3D-CNN [25] 28 ± 1.8 38 ± 1.3 42 ± 1.0 45 ± 1.8
embeddings by a factor of phh · pww compared to the spectral Transf. [42] 10 ± 0.1 10 ± 0.2 17 ± 1.8 33 ± 1.2
model. The spatial-spectral factorization strategy detailed ViT-RGB 14 ± 1.7 14 ± 1.2 17 ± 0.9 19 ± 1.4
in Section 3.1.2 allows our spatial-spectral model (SST) Spectral T. 17 ± 1.7 34 ± 2.0 44 ± 2.9 47 ± 3.2
to efficiently process the increased number of tokens. The SST 27 ± 3.2 38 ± 1.7 43 ± 2.8 43 ± 2.4
model consists of two stacked transformers (with d = 96, Masked SST 35 ± 2.0 46 ± 3.1 47 ± 1.9 48 ± 2.8
4 transformer blocks and 8 heads) that sequentially process
the tokens with ph,w = 1 and pc = 10 along the spatial and Table 3. Land cover classification accuracy on Houston2018 for
spectral dimension, respectively. This approach yields an different training set sizes (100%: 504 712 labeled pixels). When
accuracy of 79 ± 0.1% on EnMAP-DFC and serves as the labeled training data is scarce, the pre-trained transformer signifi-
cantly outperforms the other models.

2172
spectral information beyond the RGB-bands, performs sig-
50%
nificantly worse than the standard transformer trained on
the entire spectral sequence (−8% accuracy). Interestingly,
45% both baseline transformer approaches lag significantly be-
hind the convolutional 3D-CNN baseline [25]. We find that
40%
Accuracy

blockwise patch embedding is an important enhancement


for spectral transformers and provides an implicit encod-
35% ing of each token position in the spectral sequence. De-
spite EnMAP’s high spectral resolution, the use of spatial
30% 3D CNN context provides improvements in model performance. In-
SST corporating spectral positional embeddings into the spatial-
25% Masked SST spectral transformer further boosts classification accuracy
in our experiments when combined with blockwise patch
0.1% 1% 10% 100% embedding.
Label fraction
In order to leverage large unlabeled hyperspectral
Figure 7. Performance of 3D-CNN baseline [25] and spatial- datasets and to boost the label efficiency of transformer
spectral transformer models trained on varying fractions of the models, we utilize masked data reconstruction as self-
Houston2018 dataset. supervised pre-training task. Linear probing from the self-
supervised representations indicates that masked hyper-
vealed by linear probing on the frozen transformer weights spectral image reconstruction yields meaningful representa-
(see columns ’Frozen’ in Table 1). The masked SST im- tions that can achieve strong classification performance on
proves by +3% accuracy to 79 ± 0.1% and +5% in macro EnMAP-DFC. We further conduct an ablation study on the
accuracy over the SST without pre-training on the EnMAP- label efficiency of our masked spatial-spectral transformer
DFC dataset. on the Houston2018 dataset. The pre-trained model can be
fine-tuned with 1% of the labeled data to surpass the per-
formance of the baseline models trained on 100% of the
Data Efficiency To investigate model performance on labeled data.
downstream applications with little labeled data, we train
pre-trained and randomly initialized SST models with BPE 7. Conclusion
and SPE on successively smaller portions of the Hous-
ton2018 dataset (see Table 3). Using as little as 0.1% of the Our systematic evaluation of vision transformer models
Houston2018 training data (∼504 pixels) results in an accu- for hyperspectral remote sensing data reveals the benefits of
racy of 27 ± 3.2% for the randomly initialized SST model different positional encoding schemes and the importance
and 28 ± 1.8% for the baseline 3D-CNN [25]. The pre- of modeling spatial-spectral interactions with self-attention.
trained SST model reaches an accuracy of 35±2.0%, which Factorizing self-attention between the spatial and spectral
corresponds to an +8% increase that can be attributed to dimensions enables self-attention for high-dimensional hy-
self-supervised pre-training (see Fig. 7). We observe a sim- perspectral data. We further showcase the potential of
ilar performance advantage for the self-supervised model masked transformer pre-training and evaluate the resulting
when training on 1% and 10% of the Houston2018 training models with different amounts of labeled training data. The
set (see Table 3). We note that the masked SST model out- results of this study indicate that masked pre-training is
performs the SST model without pre-training and the 3D- highly effective to improve label efficiency of transformer
CNN with as little as 1% of the labeled training data. models, and can also boost performance when a large num-
ber of labels is available. We believe that these results
6. Discussion will be highly relevant for the hyperspectral remote sensing
community as transformer networks continue to excel for
This work investigates the utility of masked hyperspec- vision tasks and more large unlabeled hyperspectral datasets
tral image reconstruction for self-supervised learning of start to become publicly available.
transformers. We pre-train different transformer model con-
figurations on unlabeled data and evaluate them on the la- Acknowledgements
beled EnMAP-DFC and Houston2018 datasets. A compari-
son of the baseline vanilla transformer and ViT-RGB meth- We thank the EnMAP mission team, the Hyperspectral
ods reveals the high importance of spectral information for Image Analysis Lab (University of Houston) and the IEEE
the EnMAP-DFC land cover classification task: ViT-RGB, GRSS IADF for providing the data used in this work as well
which has access to larger spatial context but disregards as the anonymous reviewers for their helpful comments.

2173
References [14] Kamlesh Golhani, Siva K Balasundram, Ganesan Vadamalai,
and Biswajeet Pradhan. A Review of Neural Networks in
[1] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Plant Disease Detection Using Hyperspectral Data. Infor-
Sun, Mario Lučić, and Cordelia Schmid. ViViT: A Video mation Processing in Agriculture, 5(3):354–371, 2018. 2
Vision Transformer. In Proceedings of the IEEE/CVF Inter-
[15] Megandhren Govender, Kershani Chetty, and Hartley Bul-
national Conference on Computer Vision, pages 6836–6846,
cock. A Review of Hyperspectral Remote Sensing and its
2021. 3
Application in Vegetation and Water Resource Studies. Wa-
[2] Nicolas Audebert, Bertrand Le Saux, and Sébastien Lefèvre. ter Sa, 33(2):145–151, 2007. 1
Deep Learning for Classification of Hyperspectral Data: A
[16] Luis Guanter, Hermann Kaufmann, Karl Segl, Saskia Fo-
Comparative Review. IEEE Geoscience and Remote Sensing
erster, Christian Rogass, Sabine Chabrillat, Theres Kuester,
Magazine, 7(2):159–173, 2019. 2, 7
André Hollstein, Godela Rossner, Christian Chlebek, et al.
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- The EnMAP Spaceborne Imaging Spectroscopy Mission for
ton. Layer Normalization. arXiv preprint arXiv:1607.06450, Earth Observation. Remote Sensing, 7(7):8830–8857, 2015.
2016. 3 1, 6
[4] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: [17] Michael Gutmann and Aapo Hyvärinen. Noise-Contrastive
Bert pre-training of image transformers. International Con- Estimation: A New Estimation Principle for Unnormalized
ference on Learning Representations, 2022. 3 Statistical Models. In Proceedings of the Thirteenth Inter-
[5] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is national Conference on Artificial Intelligence and Statistics,
Space-time Attention all you need for Video Understanding? pages 297–304. JMLR Workshop and Conference Proceed-
In ICML, volume 2, page 4, 2021. 3 ings, 2010. 3
[6] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple [18] Ji He, Lina Zhao, Hongwei Yang, Mengmeng Zhang, and
framework for contrastive learning of visual representations. Wei Li. HSI-BERT: Hyperspectral image classification us-
In ICML, pages 1597–1607. PMLR, 2020. 3 ing the bidirectional encoder representation from transform-
[7] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. ers. IEEE Transactions on Geoscience and Remote Sensing,
Generating Long Sequences with Sparse Transformers. 58(1):165–178, 2019. 4
arXiv preprint arXiv:1904.10509, 2019. 3 [19] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
[8] Krzysztof Marcin Choromanski, Valerii Likhosherstov, Dollár, and Ross Girshick. Masked Autoencoders are Scal-
David Dohan, Xingyou Song, Andreea Gane, Tamas Sar- able Vision Learners. In Proceedings of the IEEE/CVF Con-
los, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, ference on Computer Vision and Pattern Recognition, pages
Lukasz Kaiser, et al. Rethinking attention with perform- 16000–16009, 2022. 3, 5
ers. In International Conference on Learning Representa- [20] Xin He, Yushi Chen, and Zhouhan Lin. Spatial-spectral
tions year=2021. 3 Transformer for Hyperspectral Image Classification. Remote
[9] Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Sensing, 13(3):498, 2021. 2
Erik Rozi, Yutong He, Marshall Burke, David B Lobell, [21] Uta Heiden, Karl Segl, Sigrid Roessner, and Hermann Kauf-
and Stefano Ermon. SatMAE: Pre-training Transformers for mann. Determination of Robust Spectral Features for Identi-
Temporal and Multi-Spectral Satellite Imagery. In Advances fication of Urban Surface Materials in Hyperspectral Remote
in Neural Information Processing Systems. 4, 5 Sensing Data. Remote Sensing of Environment, 111(4):537–
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and 552, 2007. 1
Li Fei-Fei. Imagenet: A Large-scale Hierarchical Image [22] Danfeng Hong, Zhu Han, Jing Yao, Lianru Gao, Bing Zhang,
Database. In 2009 IEEE Conference on Computer Vision Antonio Plaza, and Jocelyn Chanussot. SpectralFormer:
and Pattern Recognition, pages 248–255. Ieee, 2009. 1 Rethinking Hyperspectral Image Classification with Trans-
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina formers. IEEE Transactions on Geoscience and Remote
Toutanova. BERT: Pre-training of Deep Bidirectional Trans- Sensing, 60:1–15, 2021. 2
formers for Language Understanding. In Proceedings of the [23] Zilong Huang, Xinggang Wang, Lichao Huang, Chang
2019 Conference of the North American Chapter of the As- Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross
sociation for Computational Linguistics: Human Language Attention for Semantic Segmentation. In Proceedings of the
Technologies, Volume 1 (Long and Short Papers), pages IEEE/CVF International Conference on Computer Vision,
4171–4186, 2019. 2, 3, 5, 7 pages 603–612, 2019. 3
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [24] Damian Ibanez, Ruben Fernandez-Beltran, Filiberto Pla, and
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Naoto Yokoya. Masked Auto-Encoding Spectral–Spatial
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Transformer for Hyperspectral Image Classification. IEEE
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Transactions on Geoscience and Remote Sensing, 60:1–14,
Worth 16x16 Words: Transformers for Image Recognition at 2022. 4
Scale. ArXiv, abs/2010.11929, 2021. 3, 4, 6, 7 [25] Ying Li, Haokui Zhang, and Qiang Shen. Spectral–Spatial
[13] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsu- Classification of Hyperspectral Imagery with 3D Convolu-
pervised Representation Learning by Predicting image Rota- tional Neural Network. Remote Sensing, 9(1):67, 2017. 6, 7,
tions. arXiv preprint arXiv:1803.07728, 2018. 3 8

2174
[26] Konstantinos Makantasis, Konstantinos Karantzalos, Anas- [39] Aidan M Swope, Xander H Rudelis, and Kyle T Story. Rep-
tasios Doulamis, and Nikolaos Doulamis. Deep Supervised resentation Learning for Remote Sensing: An Unsupervised
Learning for Hyperspectral Data Classification through Con- Sensor Fusion Approach. arXiv preprint arXiv:2108.05094,
volutional Neural Networks. In 2015 IEEE International 2021. 3
Geoscience and Remote Sensing Symposium (IGARSS), [40] Chao Tao, Ji Qi, Weipeng Lu, Hao Wang, and Haifeng
pages 4959–4962. IEEE, 2015. 2 Li. Remote Sensing Image Scene Classification with Self-
[27] Oscar Mañas, Alexandre Lacoste, Xavier Giro-i Nieto, supervised Paradigm under Limited Labeled Samples. IEEE
David Vazquez, and Pau Rodriguez. Seasonal contrast: Un- Geoscience and Remote Sensing Letters, 2020. 3
supervised pre-training from uncurated remote sensing data. [41] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang.
In Proceedings of the IEEE/CVF International Conference VideoMAE: Masked Autoencoders are Data-Efficient Learn-
on Computer Vision, pages 9414–9423, 2021. 3 ers for Self-Supervised Video Pre-Training. In Advances in
[28] Lichao Mou, Pedram Ghamisi, and Xiao Xiang Zhu. Deep Neural Information Processing Systems. 5
Recurrent Neural Networks for Hyperspectral Image Clas- [42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
sification. IEEE Transactions on Geoscience and Remote reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Sensing, 55(7):3639–3655, 2017. 2 Polosukhin. Attention is all you need. Advances in Neural
[29] Mehdi Noroozi and Paolo Favaro. Unsupervised Learning Information Processing Systems, 30, 2017. 2, 3, 5, 6, 7
of Visual Representations by Solving Jigsaw Puzzles. In [43] Wenxuan Wang, Leiming Liu, Tianxiang Zhang, Jiachen
European Conference on Computer Vision, pages 69–84. Shen, Jing Wang, and Jiangyun Li. Hyper-ES2T: Ef-
Springer, 2016. 3 ficient Spatial–Spectral Transformer for the Classification
[30] ME Paoletti, JM Haut, J Plaza, and A Plaza. Deep Learning of Hyperspectral Remote Sensing Images. International
Classifiers for Hyperspectral Imaging: A Review. ISPRS Journal of Applied Earth Observation and Geoinformation,
Journal of Photogrammetry and Remote Sensing, 158:279– 113:103005, 2022. 2
317, 2019. 1 [44] Yi Wang, Conrad M Albrecht, Nassim Ait Ali Braham,
[31] Yuhao Qing, Wenyi Liu, Liuyan Feng, and Wanjia Gao. Im- LiChao Mou, and Xiao Xiang Zhu. Self-supervised Learn-
proved Transformer Net for Hyperspectral Image Classifica- ing in Remote Sensing: A Review. IEEE Geoscience and
tion. Remote Sensing, 13(11):2216, 2021. 2 Remote Sensing Magazine (GRSM), 2022. 3
[32] Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brock- [45] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
man, Christopher Funk, Brian Clipp, Salvatore Candido, Unsupervised Feature Learning via Non-parametric Instance
Matt Uyttendaele, and Trevor Darrell. Scale-MAE: A Scale- Discrimination. In Proceedings of the IEEE Conference
Aware Masked Autoencoder for Multiscale Geospatial Rep- on Computer Vision and Pattern Recognition, pages 3733–
resentation Learning. arXiv preprint arXiv:2212.14532, 3742, 2018. 3
2022. 4 [46] Tete Xiao, Colorado J Reed, Xiaolong Wang, Kurt Keutzer,
[33] Sudipan Saha, Patrick Ebel, and Xiao Xiang Zhu. Self- and Trevor Darrell. Region similarity representation learn-
supervised Multisensor Change Detection. IEEE Transac- ing. In Proceedings of the IEEE/CVF International Confer-
tions on Geoscience and Remote Sensing, 2021. 3 ence on Computer Vision, pages 10539–10548, 2021. 3
[34] Linus Scheibenreif, Joëlle Hanna, Michael Mommert, and [47] Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen
Damian Borth. Self-supervised Vision Transformers for Lin, and Han Hu. Propagate Yourself: Exploring Pixel-level
Land-cover Segmentation and Classification. In Proceedings Consistency for Unsupervised Visual Representation Learn-
of the IEEE/CVF Conference on Computer Vision and Pat- ing. In Proceedings of the IEEE/CVF Conference on Com-
tern Recognition Workshops, pages 1422–1431, 2022. 3 puter Vision and Pattern Recognition, pages 16684–16693,
[35] Linus Scheibenreif, Michael Mommert, and Damian Borth. 2021. 3
Contrastive Self-supervised Data Fusion for Satellite Im- [48] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin
agery. In International Annals of the Photogrammetry, Re- Bao, Zhuliang Yao, Qi Dai, and Han Hu. SimMIM: A Simple
mote Sensing and Spatial Information Sciences, 2022. 3 Framework for Masked Image Modeling. In Proceedings of
[36] Laurent Sifre and Prof Stéphane Mallat. Rigid-motion Scat- the IEEE/CVF Conference on Computer Vision and Pattern
tering for Image Classification. English. Supervisor: Prof. Recognition, pages 9653–9663, 2022. 3, 5, 7
Stéphane Mallat. Ph. D. Thesis. Ecole Polytechnique, 2, [49] Yonghao Xu, Bo Du, Liangpei Zhang, Daniele Cerra, Miguel
2014. 5 Pato, Emiliano Carmona, Saurabh Prasad, Naoto Yokoya,
[37] Vladan Stojnic and Vladimir Risojevic. Self-supervised Ronny Hänsch, and Bertrand Le Saux. Advanced Multi-
Learning of Remote Sensing Scene Representations us- sensor Optical Remote Sensing for Urban Land Use and
ing Contrastive Multiview Coding. In Proceedings of the Land Cover Classification: Outcome of the 2018 IEEE
IEEE/CVF Conference on Computer Vision and Pattern GRSS Data Fusion Contest. IEEE Journal of Selected
Recognition, pages 1182–1191, 2021. 3 Topics in Applied Earth Observations and Remote Sensing,
[38] Le Sun, Guangrui Zhao, Yuhui Zheng, and Zebin Wu. 12(6):1709–1724, 2019. 6
Spectral–Spatial Feature Tokenization Transformer for Hy- [50] Naoto Yokoya, Pedram Ghamisi, Ronny Hänsch, and
perspectral Image Classification. IEEE Transactions on Geo- Michael Schmitt. 2020 IEEE GRSS Data Fusion Contest:
science and Remote Sensing, 60:1–14, 2022. 2 Global Land Cover Mapping with Weak Supervision [tech-

2175
nical committees]. IEEE Geoscience and Remote Sensing
Magazine, 8(1):154–157, 2020. 6
[51] Naoto Yokoya, Claas Grohnfeldt, and Jocelyn Chanussot.
Hyperspectral and Multispectral Data Fusion: A Compara-
tive Review of the Recent Literature. IEEE Geoscience and
Remote Sensing Magazine, 5(2):29–56, 2017. 2
[52] Yuan Yuan and Lei Lin. Self-supervised Pretraining of
Transformers for Satellite Image Time Series Classification.
IEEE Journal of Selected Topics in Applied Earth Observa-
tions and Remote Sensing, 14:474–487, 2020. 3
[53] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
image colorization. In European Conference on Computer
Vision, pages 649–666. Springer, 2016. 3
[54] Xiangrong Zhang, Yujia Sun, Jingyan Zhang, Peng Wu, and
Licheng Jiao. Hyperspectral Unmixing via Deep Convolu-
tional Neural Networks. IEEE Geoscience and Remote Sens-
ing Letters, 15(11):1755–1759, 2018. 2
[55] Wenzhi Zhao and Shihong Du. Spectral–Spatial Feature Ex-
traction for Hyperspectral Image Classification: A Dimen-
sion Reduction and Deep Learning Approach. IEEE Trans-
actions on Geoscience and Remote Sensing, 54(8):4544–
4554, 2016. 2
[56] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang
Xie, Alan Yuille, and Tao Kong. iBOT: Image BERT Pre-
Training with Online Tokenizer. International Conference
on Learning Representations, 2022. 3

2176

You might also like