Scheibenreif Masked Vision Transformers For Hyperspectral Image Classification CVPRW 2023 Paper
Scheibenreif Masked Vision Transformers For Hyperspectral Image Classification CVPRW 2023 Paper
Abstract
A: Masked pre-training
Transformer architectures have become state-of-the-art Reconstruction
2166
A: Masked pre-training
Spatial-Spectral Random Masking Pixel Space
e
ub
C Patches
a
at
lD
ra
ct
Linear
Spatial-Spectral
pe
s
er
Transformer
yp
n
io
H
ct
ru
st
on
ec
R
! ×%" ×%#
𝑥 ∈ ℝ!×#×$ 𝑝 ∈ ℝ% Pixelwise Loss
B: Spatial-Spectral Transformer
Spatial Spectral
Self-attention Self-attention
Figure 2. Overview of our proposed transformer model for hyperspectral data with spatial-spectral factorization within the masked self-
h w c
supervised pre-training framework. A: The hyperspectral data cube is first divided into spatial-spectral patches p ∈ Rp ×p ×p . The
patches are randomly masked, embedded and processed by the transformer, which sequentially applies self-attention spatially and spectrally
between all embeddings. A linear layer maps representations of the masked patches back to pixel space to compute the reconstruction error.
B: Our spectral-spatial transformer consists of a patch embedding layer and transformer blocks that apply self-attention among tokens with
the same spectral or spatial index. The colors indicate token locations in the hyperspectral cube.
2167
A: Standard Patch Embedding 2.3. Self-supervised Learning
The goal of self-supervised learning (SSL) is to learn
𝑤 Embeddings
s
he
2168
EnMAP
DFC Label
Water
Barren
Urban/Built-up
Croplands
Wetlands
Grassland
Shrubland
Prediction
Forest
Figure 4. Top: Tiles from EnMAP L2 scenes over Mexico City. Center: Corresponding DFC2020 land cover labels. Bottom: Predicted
land cover classes from the masked spatial-spectral transformer model.
spectral and multi-temporal data [9], or extend it to imagery 3.1.1 Spatial-Spectral Patch Embeddings
of varying ground sampling distance [32]. In the hyperspec-
tral domain, masked sequence modeling has been used to Spatial-Spectral Patches To incorporate the spatial con-
model the spectral signal [18], and within the masked au- text for the spectral sequence of each pixel, our spatial-
toencoding framework [24]. spectral model divides the input image x ∈ Rh×w×c into
n = ( phh ) · ( pww ) · ( pcc ) patches of size ph · pw · pc . This
3. Method approach mirrors the spatial patching of the original ViT,
but does not aggregate all spectral bands per location, thus
This section introduces the proposed transformer model retaining the hyperspectral 3D cube. This makes it possible
for hyperspectral data (3.1), the spatial-spectral patch em- to model both spectral and spatial relationships within the
bedding strategy (3.1.1) and how hyperspectral data is effi- image using attention. Spatial-spectral patching increases
ciently processed by factorizing self-attention spatially and the number of tokens by a factor of pcc compared to spa-
spectrally (3.1.2). Finally, we present the masked pre-
tial ViT patching and by phh · pww compared to the spec-
training scheme (3.2).
tral transformer approach. Since the computational cost of
3.1. Transformer Architecture self-attention is quadratic in the number of tokens, model-
ing all spatial-spectral relationships is practically infeasible
This work adapts the vision transformer [12] architec- for anything but very large spatial and spectral patch sizes
ture to hyperspectral imagery. Starting from a baseline p{h,w,c} . We address this limitation in Section 3.1.2.
transformer model, we successively add model components
and adjust design choices to improve efficiency and perfor-
mance on hyperspectral data. Our baseline spectral model Blockwise Spectral Embedding Vision transformers
processes the spectral sequence of individual pixel with an create embeddings from patches through a learned linear
transformer encoder. Each pixel is divided into patches transform that is shared between all patches (see Fig. 3
along the spectral dimension, resulting in n = pcc blocks A). Unlike the spatial patches of ViT, which always repre-
of size pc , and then embedded with a shared linear trans- sent the RGB intervals of the electromagnetic spectrum, our
form. Learnable positional embeddings are added to the spatial-spectral patches represent multiple different spectral
embedding sequence. As a spatial transformer baseline, we wavelength intervals for every spatial patch. To account for
apply the original ViT [12] architecture on the RGB bands this diversity in the spectral signal, we propose a block-
of hyperspectral data (ViT-RGB). wise spectral embedding scheme that utilizes a separate
2169
linear transform for each of the pcc spectral blocks in the EnMAP RGB Reconstruction RGB Mask
patched hyperspectral data (see Fig. 3 B). This approach is 3
most similar to group embeddings which have been used 2
for multi-temporal and multi-spectral remote sensing im- 1
agery [9]. 0
Norm. Reflectance
ity of two different positional encoding techniques for the 1.0 Original
spatial-spectral embeddings: Learnable positional em- Recon.
0.5 Masked
beddings for every spatial-spectral patch that are optimized
along with the transformer during model training. Alterna- 0.0
tively, spectral positional embeddings explicitly encode 0.5
spatial and spectral positional information of the hyper-
spectral data separately with fixed sine and cosine func- 0 50 100 150 200 0 50 100 150 200
Spectral Bands
tions [9, 42] and the transformer dimensionality d.
Figure 6. Top: Original RGB representation and reconstruction
\begin {array}{ll} \mathrm {PE}_{pos,2i}&=\sin (pos/10000^{2i/d})\\ \mathrm {PE}_{pos,2i+1}&=\cos (pos/10000^{2i/d}) \end {array} with 70% masking and mask patch size 4 after 200 training epochs.
(2)
The heatmap indicates how many of the RGB bands were masked
for each image patch. Bottom: Reconstruction along the spectral
We allot one third of the embedding vector to the spectral axis for the two pixels highlighted in red in the top-left image.
positional embedding (i.e., the encoding of the patch’s index Gray sections correspond to masked patches.
in the spectral sequence), and the remainder for the spatial
embedding of horizontal and vertical position. the number of spatial and spectral patches to their squared
sum, i.e., O(( phh · pww · pcc )2 · d) to O(( phh · pww + pcc )2 · d).
3.1.2 Spatial-Spectral Factorization In practice the number of required operations on our hyper-
1
spectral data changes by a factor of ∼ 75 .
The spatial-spectral patch embedding strategy yields a large
number of tokens for high dimensional hyperspectral data. 3.2. Masked Self-supervised Learning
This is a bottleneck for the attention operation, which has The transformer models investigated in this work consist
quadratic runtime in the number of tokens. To make train- of a transformer encoder. During masked pre-training, we
ing feasible, we resolve this limitation by factorizing the add an additional linear layer to map latent token represen-
transformer model to sequentially process spatial and spec- tations from the transformer to pixel values, following the
tral relationships within the data (see Fig. 2 B). This ap- SimMIM method [48]. After patch embedding, a fraction
proach is similar to separable convolutions in CNNs, where of the embeddings is selected and replaced with a learnable
2D and 1D convolutions are sequentially applied over and mask token. The pre-training objective is to reconstruct
across feature maps [36]. This strategy reduces the compu- the pixel values corresponding to the masked tokens (see
tational load of self-attention from the squared product of Fig. 2). The reconstruction quality is measured by L1 loss,
which is only evaluated for masked pixel tokens (see Fig. 6).
Unlike similar approaches that utilize encoder-decoder ar-
chitectures for masked pre-training [19], the small linear
reconstruction head in this approach forces the encoder to
focus its capacity on modeling the masked tokens, rather
than leaving this task to the decoder. Masked sentence mod-
els [11] commonly mask 15% of tokens, while image [48]
and video [41] models mask around 50% and 90%, respec-
tively. We employ a blockwise-masking strategy (i.e., by
masking 4 × 4 windows of tokens instead of individual to-
kens) to prevent trivial solutions which are possible due to
the high correlation of spectrally adjacent tokens.
4. Data
Figure 5. Top: RGB representation of the Houston2018 hyper- This work applies transformer models on hyperspectral
spectral training set (bands 48, 32, 16). Bottom: Training labels remote sensing data. To that end, we utilize hyperspec-
for Houston2018 (20 classes, unlabeled pixels shown in white).
2170
Model Components Finetuned Frozen
Model Name Spectral Spatial BPE SPE SSL Acc. (%) MAcc. (%) Acc. (%) MAcc. (%)
3D-CNN [25] ✓ ✓ 83 ± 0.3 57 ± 1.0 81 ± 0.4 54 ± 1.1
ViT-RGB [12] ✓ 69 ± 0.5 20 ± 1.0 68 ± 0.3 16 ± 0.3
Transformer [42] ✓ 77 ± 0.2 32 ± 0.4 72 ± 0.1 23 ± 0.2
Spectral T. ✓ ✓ 80 ± 0.1 38 ± 0.5 71 ± 1.0 27 ± 1.3
Masked Transformer ✓ ✓ 76 ± 0.2 29 ± 0.4 65 ± 0.0 14 ± 0.0
Masked Spectral T. ✓ ✓ ✓ 81 ± 0.3 40 ± 0.7 78 ± 0.1 31 ± 0.1
SST ✓ ✓ 79 ± 0.1 38 ± 0.4 74 ± 0.4 32 ± 0.5
SST ✓ ✓ ✓ 81 ± 0.1 40 ± 0.8 75 ± 1.0 27 ± 1.3
SST ✓ ✓ ✓ 78 ± 0.2 33 ± 0.6 73 ± 0.6 24 ± 0.6
SST ✓ ✓ ✓ ✓ 82 ± 0.1 44 ± 0.3 76 ± 1.0 35 ± 1.0
Masked SST ✓ ✓ ✓ 77 ± 0.5 31 ± 0.1 65 ± 0.0 14 ± 0.0
Masked SST ✓ ✓ ✓ ✓ 82 ± 0.1 42 ± 0.2 77 ± 0.1 29 ± 0.2
Masked SST ✓ ✓ ✓ ✓ 78 ± 0.4 32 ± 0.3 65 ± 0.0 14 ± 0.0
Masked SST ✓ ✓ ✓ ✓ ✓ 82 ± 0.2 45 ± 0.6 79 ± 0.1 40 ± 0.1
MSST-Center ✓ ✓ ✓ ✓ ✓ 82 ± 0.2 55 ± 0.5 82 ± 0.2 55 ± 0.2
Table 1. Hyperspectral classification performance of baselines and different transformer configurations on the EnMAP-DFC dataset.
Columns ‘Spectral’ and ‘Spatial’ indicate whether the model utilizes spectral/spatial context. Please refer to Section 3.1.1 for details
about blockwise patch embedding (BPE) and spectral positional encoding (SPE). SSL indicates that the model has been pre-trained on
EnMAP data with the masked reconstruction task. SST refers to the spatial-spectral transformer model. 3D-CNN and MSST-Center
provide predictions for the center pixel of a patch, the other methods for all pixels in the patch simultaneously. Finetuned results indicate
performance after training all model parameters on labeled data, frozen indicates that only the classification head is trained on labeled data.
tral datasets from the Environmental Mapping and Analysis note that some label noise is introduced due to the differ-
Program [16] and the IEEE GRSS Data Fusion Challenge ence in labeling date (2020) and time of the EnMAP over-
(DFC) 2018 [49]. flight in 2022 (see Fig. 4 top and center rows).
Houston2018 As a second labeled hyperspectral dataset,
we use the Houston data from the IEEE GRSS DFC in
EnMAP The EnMAP satellite carries an imaging spec- 2018 [49]. This dataset consists of aerial imagery of the
trometer that scans the Earth’s surface with 224 spectral city of Houston (see Fig. 5), obtained with a hyperspec-
bands in the very-near infrared (420 − 1000nm) and short- tral instrument in the 380 − 1050nm spectral range with 48
wave infrared (900 − 2450nm) intervals [16]. The sen- bands and 1m spatial resolution. The scene has 1202×4172
sor has a spatial resolution of 30 × 30m and a 27-day re- hyperspectral pixels, 590 149 of which are labeled into 20
visit time. We collect a dataset consisting of 90 cloud-free fine-grained classes. We use the official train/test split of
EnMAP L2 scenes (orthorectified and atmospherically cor- the dataset in our experiments (504 712 pixels for training
rected) over Europe in Q4 2022. The EnMAP scenes are and validation, 85 437 for testing).
divided into non-overlapping 64×64 pixel tiles, and invalid Metrics We evaluate model performance for land cover
atmospheric bands are removed (resulting in a total of 200 classification on EnMAP-DFC and Houston2018 with ac-
spectral bands). Our dataset consists of 19 792 tiles, for a curacy and macro accuracy metrics. The standard accuracy
total of more than 81M hyperspectral pixels. measures the fraction of correctly classified samples over
EnMAP-DFC We create a labeled EnMAP dataset the entire dataset (see Eqn. 3). Macro accuracy provides the
by matching two atmospherically corrected EnMAP L2 average of class-wise accuracies, which can deviate from
scenes over Mexico City with land cover data for the accuracy on unbalanced datasets (see Eqn. 4). We report the
same region that was published for the IEEE GRSS average and standard deviation of each metric, computed
DFC 2020 [50] (see Fig. 4). This dataset consists of over 5 training runs with different random seeds.
357 64 × 64 pixel tiles with pixel-wise labels for the
classes Forest, Shrubland, Grassland, Wetland, \mathrm {Acc.} = \frac {\mathrm {TP}+\mathrm {TN}}{\mathrm {TP}+\mathrm {TN}+\mathrm {FP}+\mathrm {FN}} \label {eqn:acc} (3)
Cropland,Urban/Built-up, Barren and Water.
For our experiments, the data is randomly split into 286 where TP,TN,FP and FN correspond to true positive,
training/validation tiles and 71 tiles for final testing. We true negative, false positive, and false negative, respectively.
2171
Model Acc (%) MAcc (%)
\mathrm {Macro Acc.} = \frac {\sum _{c\in \mathrm {C}} \mathrm {Acc.}(\{x_i | y_i = c\}_i)}{|\mathcal {C}|} \label {eqn:macc} (4) 3D-CNN [25] 45 ± 1.8 45 ± 1.0
Transformer [42] 33 ± 1.1 26 ± 1.0
where xi is a data sample, yi the corresponding class label, ViT-RGB 19 ± 1.4 21 ± 1.4
and C the set of all classes in the dataset. Spectral T. 47 ± 3.2 43 ± 1.6
SST 43 ± 2.4 40 ± 1.7
5. Experiments & Results Masked SST 48 ± 2.8 42 ± 1.2
5.1. Baselines
Table 2. Land cover classification results for the Houston2018
We use three different baseline models in this work. The dataset. SST corresponds to spatial-spectral transformer with
ViT-RGB model directly applies the ViT [12] approach BPE. The masked SST is pre-trained and fine-tuned on Hous-
with dimensionality d = 96, 4 blocks, and 8 heads in ton2018 training data. All models besides the standard transformer
the multi-head self-attention to the RGB bands of hyper- are trained for center pixel prediction.
spectral data. Accordingly, the data only consists of three
spectral bands, which are aggregated during patch embed- basic backbone for the other presented transformer modi-
ding. This model yields an accuracy of 69 ± 0.5% on the fications. Adding the blockwise patch embedding (BPE)
EnMAP-DFC dataset (see Table 1) and 19 ± 1.4% on Hous- scheme allows the model to embed patches conditionally
ton2018 (see Table 2). Our spectral model is a sequen- on their position along the spectral axis (see Fig. 3) and im-
tial transformer of the same size as the ViT-RGB. Hyper- proves accuracy to 81 ± 0.2%. We find that spectral posi-
spectral pixel are processed individually, and self-attention tional embeddings (SPE) slightly harm the performance of
acts between spectral tokens of width pc = 10. Unlike the SST model (78 ± 0.2%), while the combination of BPE
the ViT-RGB, this approach can fully leverage the spec- and SPE yields an improvement to 82 ± 0.1% on the En-
tral information of the hyperspectral dataset, and improves MAP data. This model reaches an accuracy of 43 ± 2.4 on
the accuracy significantly to 77 ± 0.2% on EnMAP-DFC. Houston2018.
On the Houston2018 dataset, the spectral model reaches an
accuracy of 47 ± 3.2%. As a convolutional baseline, we 5.3. Masked Pre-training
use an established 3D-CNN model [25] with strong per- Self-supervised masked modeling increases the data ef-
formance on hyperspectral datasets [2]. Using 3D convo- ficiency of transformer models for natural language [11] or
lutions, this model can incorporate both spectral and spa- image [48] applications. We pre-train our transformer con-
tial information, yielding a performance of 83 ± 0.3% for figurations for hyperspectral data on the unlabeled EnMAP
EnMAP and 45 ± 1.8% on Houston2018. Unlike the pre- dataset with a masked pixel reconstruction strategy. The
sented transformer approaches (see Table 1), the 3D-CNN model is trained for 200 epochs to reconstruct the 70% of
model only makes predictions for the center-pixel of every patches which were masked in a 4 × 4 blockwise fashion.
input patch. This improves performance but necessitates a This pre-training yields small improvements over training
sliding-window inference strategy to create pixelwise land from scratch on the larger EnMAP-DFC dataset (e.g., +1%
cover maps, which strongly increases computational cost. accuracy for the masked SST with BPE). On the Hous-
We re-train our best performing model using this approach ton2018 data, masked pre-training improves the SST model
on the EnMAP data and adopt the same strategy on the by +5% to 48±2.8% accuracy. The combination of masked
Houston2018 dataset for comparability. pre-training and BPE results in strong representations, as re-
5.2. Spatial-Spectral Embedding
We extend the spectral transformer to deal with spatial- Dataset Fraction
spectral signals by embedding the data along both spatial Model 0.1% 1% 10% 100%
and spectral axes (see Fig. 3). This increases the number of 3D-CNN [25] 28 ± 1.8 38 ± 1.3 42 ± 1.0 45 ± 1.8
embeddings by a factor of phh · pww compared to the spectral Transf. [42] 10 ± 0.1 10 ± 0.2 17 ± 1.8 33 ± 1.2
model. The spatial-spectral factorization strategy detailed ViT-RGB 14 ± 1.7 14 ± 1.2 17 ± 0.9 19 ± 1.4
in Section 3.1.2 allows our spatial-spectral model (SST) Spectral T. 17 ± 1.7 34 ± 2.0 44 ± 2.9 47 ± 3.2
to efficiently process the increased number of tokens. The SST 27 ± 3.2 38 ± 1.7 43 ± 2.8 43 ± 2.4
model consists of two stacked transformers (with d = 96, Masked SST 35 ± 2.0 46 ± 3.1 47 ± 1.9 48 ± 2.8
4 transformer blocks and 8 heads) that sequentially process
the tokens with ph,w = 1 and pc = 10 along the spatial and Table 3. Land cover classification accuracy on Houston2018 for
spectral dimension, respectively. This approach yields an different training set sizes (100%: 504 712 labeled pixels). When
accuracy of 79 ± 0.1% on EnMAP-DFC and serves as the labeled training data is scarce, the pre-trained transformer signifi-
cantly outperforms the other models.
2172
spectral information beyond the RGB-bands, performs sig-
50%
nificantly worse than the standard transformer trained on
the entire spectral sequence (−8% accuracy). Interestingly,
45% both baseline transformer approaches lag significantly be-
hind the convolutional 3D-CNN baseline [25]. We find that
40%
Accuracy
2173
References [14] Kamlesh Golhani, Siva K Balasundram, Ganesan Vadamalai,
and Biswajeet Pradhan. A Review of Neural Networks in
[1] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Plant Disease Detection Using Hyperspectral Data. Infor-
Sun, Mario Lučić, and Cordelia Schmid. ViViT: A Video mation Processing in Agriculture, 5(3):354–371, 2018. 2
Vision Transformer. In Proceedings of the IEEE/CVF Inter-
[15] Megandhren Govender, Kershani Chetty, and Hartley Bul-
national Conference on Computer Vision, pages 6836–6846,
cock. A Review of Hyperspectral Remote Sensing and its
2021. 3
Application in Vegetation and Water Resource Studies. Wa-
[2] Nicolas Audebert, Bertrand Le Saux, and Sébastien Lefèvre. ter Sa, 33(2):145–151, 2007. 1
Deep Learning for Classification of Hyperspectral Data: A
[16] Luis Guanter, Hermann Kaufmann, Karl Segl, Saskia Fo-
Comparative Review. IEEE Geoscience and Remote Sensing
erster, Christian Rogass, Sabine Chabrillat, Theres Kuester,
Magazine, 7(2):159–173, 2019. 2, 7
André Hollstein, Godela Rossner, Christian Chlebek, et al.
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- The EnMAP Spaceborne Imaging Spectroscopy Mission for
ton. Layer Normalization. arXiv preprint arXiv:1607.06450, Earth Observation. Remote Sensing, 7(7):8830–8857, 2015.
2016. 3 1, 6
[4] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: [17] Michael Gutmann and Aapo Hyvärinen. Noise-Contrastive
Bert pre-training of image transformers. International Con- Estimation: A New Estimation Principle for Unnormalized
ference on Learning Representations, 2022. 3 Statistical Models. In Proceedings of the Thirteenth Inter-
[5] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is national Conference on Artificial Intelligence and Statistics,
Space-time Attention all you need for Video Understanding? pages 297–304. JMLR Workshop and Conference Proceed-
In ICML, volume 2, page 4, 2021. 3 ings, 2010. 3
[6] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple [18] Ji He, Lina Zhao, Hongwei Yang, Mengmeng Zhang, and
framework for contrastive learning of visual representations. Wei Li. HSI-BERT: Hyperspectral image classification us-
In ICML, pages 1597–1607. PMLR, 2020. 3 ing the bidirectional encoder representation from transform-
[7] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. ers. IEEE Transactions on Geoscience and Remote Sensing,
Generating Long Sequences with Sparse Transformers. 58(1):165–178, 2019. 4
arXiv preprint arXiv:1904.10509, 2019. 3 [19] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
[8] Krzysztof Marcin Choromanski, Valerii Likhosherstov, Dollár, and Ross Girshick. Masked Autoencoders are Scal-
David Dohan, Xingyou Song, Andreea Gane, Tamas Sar- able Vision Learners. In Proceedings of the IEEE/CVF Con-
los, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, ference on Computer Vision and Pattern Recognition, pages
Lukasz Kaiser, et al. Rethinking attention with perform- 16000–16009, 2022. 3, 5
ers. In International Conference on Learning Representa- [20] Xin He, Yushi Chen, and Zhouhan Lin. Spatial-spectral
tions year=2021. 3 Transformer for Hyperspectral Image Classification. Remote
[9] Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Sensing, 13(3):498, 2021. 2
Erik Rozi, Yutong He, Marshall Burke, David B Lobell, [21] Uta Heiden, Karl Segl, Sigrid Roessner, and Hermann Kauf-
and Stefano Ermon. SatMAE: Pre-training Transformers for mann. Determination of Robust Spectral Features for Identi-
Temporal and Multi-Spectral Satellite Imagery. In Advances fication of Urban Surface Materials in Hyperspectral Remote
in Neural Information Processing Systems. 4, 5 Sensing Data. Remote Sensing of Environment, 111(4):537–
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and 552, 2007. 1
Li Fei-Fei. Imagenet: A Large-scale Hierarchical Image [22] Danfeng Hong, Zhu Han, Jing Yao, Lianru Gao, Bing Zhang,
Database. In 2009 IEEE Conference on Computer Vision Antonio Plaza, and Jocelyn Chanussot. SpectralFormer:
and Pattern Recognition, pages 248–255. Ieee, 2009. 1 Rethinking Hyperspectral Image Classification with Trans-
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina formers. IEEE Transactions on Geoscience and Remote
Toutanova. BERT: Pre-training of Deep Bidirectional Trans- Sensing, 60:1–15, 2021. 2
formers for Language Understanding. In Proceedings of the [23] Zilong Huang, Xinggang Wang, Lichao Huang, Chang
2019 Conference of the North American Chapter of the As- Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross
sociation for Computational Linguistics: Human Language Attention for Semantic Segmentation. In Proceedings of the
Technologies, Volume 1 (Long and Short Papers), pages IEEE/CVF International Conference on Computer Vision,
4171–4186, 2019. 2, 3, 5, 7 pages 603–612, 2019. 3
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [24] Damian Ibanez, Ruben Fernandez-Beltran, Filiberto Pla, and
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Naoto Yokoya. Masked Auto-Encoding Spectral–Spatial
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Transformer for Hyperspectral Image Classification. IEEE
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Transactions on Geoscience and Remote Sensing, 60:1–14,
Worth 16x16 Words: Transformers for Image Recognition at 2022. 4
Scale. ArXiv, abs/2010.11929, 2021. 3, 4, 6, 7 [25] Ying Li, Haokui Zhang, and Qiang Shen. Spectral–Spatial
[13] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsu- Classification of Hyperspectral Imagery with 3D Convolu-
pervised Representation Learning by Predicting image Rota- tional Neural Network. Remote Sensing, 9(1):67, 2017. 6, 7,
tions. arXiv preprint arXiv:1803.07728, 2018. 3 8
2174
[26] Konstantinos Makantasis, Konstantinos Karantzalos, Anas- [39] Aidan M Swope, Xander H Rudelis, and Kyle T Story. Rep-
tasios Doulamis, and Nikolaos Doulamis. Deep Supervised resentation Learning for Remote Sensing: An Unsupervised
Learning for Hyperspectral Data Classification through Con- Sensor Fusion Approach. arXiv preprint arXiv:2108.05094,
volutional Neural Networks. In 2015 IEEE International 2021. 3
Geoscience and Remote Sensing Symposium (IGARSS), [40] Chao Tao, Ji Qi, Weipeng Lu, Hao Wang, and Haifeng
pages 4959–4962. IEEE, 2015. 2 Li. Remote Sensing Image Scene Classification with Self-
[27] Oscar Mañas, Alexandre Lacoste, Xavier Giro-i Nieto, supervised Paradigm under Limited Labeled Samples. IEEE
David Vazquez, and Pau Rodriguez. Seasonal contrast: Un- Geoscience and Remote Sensing Letters, 2020. 3
supervised pre-training from uncurated remote sensing data. [41] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang.
In Proceedings of the IEEE/CVF International Conference VideoMAE: Masked Autoencoders are Data-Efficient Learn-
on Computer Vision, pages 9414–9423, 2021. 3 ers for Self-Supervised Video Pre-Training. In Advances in
[28] Lichao Mou, Pedram Ghamisi, and Xiao Xiang Zhu. Deep Neural Information Processing Systems. 5
Recurrent Neural Networks for Hyperspectral Image Clas- [42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
sification. IEEE Transactions on Geoscience and Remote reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Sensing, 55(7):3639–3655, 2017. 2 Polosukhin. Attention is all you need. Advances in Neural
[29] Mehdi Noroozi and Paolo Favaro. Unsupervised Learning Information Processing Systems, 30, 2017. 2, 3, 5, 6, 7
of Visual Representations by Solving Jigsaw Puzzles. In [43] Wenxuan Wang, Leiming Liu, Tianxiang Zhang, Jiachen
European Conference on Computer Vision, pages 69–84. Shen, Jing Wang, and Jiangyun Li. Hyper-ES2T: Ef-
Springer, 2016. 3 ficient Spatial–Spectral Transformer for the Classification
[30] ME Paoletti, JM Haut, J Plaza, and A Plaza. Deep Learning of Hyperspectral Remote Sensing Images. International
Classifiers for Hyperspectral Imaging: A Review. ISPRS Journal of Applied Earth Observation and Geoinformation,
Journal of Photogrammetry and Remote Sensing, 158:279– 113:103005, 2022. 2
317, 2019. 1 [44] Yi Wang, Conrad M Albrecht, Nassim Ait Ali Braham,
[31] Yuhao Qing, Wenyi Liu, Liuyan Feng, and Wanjia Gao. Im- LiChao Mou, and Xiao Xiang Zhu. Self-supervised Learn-
proved Transformer Net for Hyperspectral Image Classifica- ing in Remote Sensing: A Review. IEEE Geoscience and
tion. Remote Sensing, 13(11):2216, 2021. 2 Remote Sensing Magazine (GRSM), 2022. 3
[32] Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brock- [45] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
man, Christopher Funk, Brian Clipp, Salvatore Candido, Unsupervised Feature Learning via Non-parametric Instance
Matt Uyttendaele, and Trevor Darrell. Scale-MAE: A Scale- Discrimination. In Proceedings of the IEEE Conference
Aware Masked Autoencoder for Multiscale Geospatial Rep- on Computer Vision and Pattern Recognition, pages 3733–
resentation Learning. arXiv preprint arXiv:2212.14532, 3742, 2018. 3
2022. 4 [46] Tete Xiao, Colorado J Reed, Xiaolong Wang, Kurt Keutzer,
[33] Sudipan Saha, Patrick Ebel, and Xiao Xiang Zhu. Self- and Trevor Darrell. Region similarity representation learn-
supervised Multisensor Change Detection. IEEE Transac- ing. In Proceedings of the IEEE/CVF International Confer-
tions on Geoscience and Remote Sensing, 2021. 3 ence on Computer Vision, pages 10539–10548, 2021. 3
[34] Linus Scheibenreif, Joëlle Hanna, Michael Mommert, and [47] Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen
Damian Borth. Self-supervised Vision Transformers for Lin, and Han Hu. Propagate Yourself: Exploring Pixel-level
Land-cover Segmentation and Classification. In Proceedings Consistency for Unsupervised Visual Representation Learn-
of the IEEE/CVF Conference on Computer Vision and Pat- ing. In Proceedings of the IEEE/CVF Conference on Com-
tern Recognition Workshops, pages 1422–1431, 2022. 3 puter Vision and Pattern Recognition, pages 16684–16693,
[35] Linus Scheibenreif, Michael Mommert, and Damian Borth. 2021. 3
Contrastive Self-supervised Data Fusion for Satellite Im- [48] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin
agery. In International Annals of the Photogrammetry, Re- Bao, Zhuliang Yao, Qi Dai, and Han Hu. SimMIM: A Simple
mote Sensing and Spatial Information Sciences, 2022. 3 Framework for Masked Image Modeling. In Proceedings of
[36] Laurent Sifre and Prof Stéphane Mallat. Rigid-motion Scat- the IEEE/CVF Conference on Computer Vision and Pattern
tering for Image Classification. English. Supervisor: Prof. Recognition, pages 9653–9663, 2022. 3, 5, 7
Stéphane Mallat. Ph. D. Thesis. Ecole Polytechnique, 2, [49] Yonghao Xu, Bo Du, Liangpei Zhang, Daniele Cerra, Miguel
2014. 5 Pato, Emiliano Carmona, Saurabh Prasad, Naoto Yokoya,
[37] Vladan Stojnic and Vladimir Risojevic. Self-supervised Ronny Hänsch, and Bertrand Le Saux. Advanced Multi-
Learning of Remote Sensing Scene Representations us- sensor Optical Remote Sensing for Urban Land Use and
ing Contrastive Multiview Coding. In Proceedings of the Land Cover Classification: Outcome of the 2018 IEEE
IEEE/CVF Conference on Computer Vision and Pattern GRSS Data Fusion Contest. IEEE Journal of Selected
Recognition, pages 1182–1191, 2021. 3 Topics in Applied Earth Observations and Remote Sensing,
[38] Le Sun, Guangrui Zhao, Yuhui Zheng, and Zebin Wu. 12(6):1709–1724, 2019. 6
Spectral–Spatial Feature Tokenization Transformer for Hy- [50] Naoto Yokoya, Pedram Ghamisi, Ronny Hänsch, and
perspectral Image Classification. IEEE Transactions on Geo- Michael Schmitt. 2020 IEEE GRSS Data Fusion Contest:
science and Remote Sensing, 60:1–14, 2022. 2 Global Land Cover Mapping with Weak Supervision [tech-
2175
nical committees]. IEEE Geoscience and Remote Sensing
Magazine, 8(1):154–157, 2020. 6
[51] Naoto Yokoya, Claas Grohnfeldt, and Jocelyn Chanussot.
Hyperspectral and Multispectral Data Fusion: A Compara-
tive Review of the Recent Literature. IEEE Geoscience and
Remote Sensing Magazine, 5(2):29–56, 2017. 2
[52] Yuan Yuan and Lei Lin. Self-supervised Pretraining of
Transformers for Satellite Image Time Series Classification.
IEEE Journal of Selected Topics in Applied Earth Observa-
tions and Remote Sensing, 14:474–487, 2020. 3
[53] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
image colorization. In European Conference on Computer
Vision, pages 649–666. Springer, 2016. 3
[54] Xiangrong Zhang, Yujia Sun, Jingyan Zhang, Peng Wu, and
Licheng Jiao. Hyperspectral Unmixing via Deep Convolu-
tional Neural Networks. IEEE Geoscience and Remote Sens-
ing Letters, 15(11):1755–1759, 2018. 2
[55] Wenzhi Zhao and Shihong Du. Spectral–Spatial Feature Ex-
traction for Hyperspectral Image Classification: A Dimen-
sion Reduction and Deep Learning Approach. IEEE Trans-
actions on Geoscience and Remote Sensing, 54(8):4544–
4554, 2016. 2
[56] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang
Xie, Alan Yuille, and Tao Kong. iBOT: Image BERT Pre-
Training with Online Tokenizer. International Conference
on Learning Representations, 2022. 3
2176