GlobalLocal Multigranularity Transformer for Hyperspectral Image Classification
GlobalLocal Multigranularity Transformer for Hyperspectral Image Classification
18, 2025
© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
https://creativecommons.org/licenses/by-nc-nd/4.0/
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 113
features and achieve effective spatial feature extraction from the classification results. In [31], a spatial–spectral transformer
HSI. However, 2-D CNN cannot capture feature information with cross-attention network is designed for HSI classification,
in the spectral dimension and often needs to be combined with which employs a dual-branch structure that uses an enhanced
other methods to fully capture hyperspectral data characteristics. cross-attention module and a spatial–spectral weighted sharing
Subsequently, researchers introduced 3-D CNN to extract both mechanism to efficiently extract and integrate features at spatial
spectral and spatial features. For instance, Yang et al. [16] and spectral dimensions. However, transformer-based models
proposed a novel recurrent 3-D CNN approach aimed at learning still have some limitations in HSI classification. Specifically,
integrated spectral-spatial features by gradually reducing the transformer mainly relies on modeling global contextual rela-
patch size. Similar network design principles are followed in [17] tionships of input tokens, but it falls short of modeling local
and proved the effectiveness of 3-D CNN in HSI. Besides information.
applying 2-D CNN and 3-D CNN independently, combining Considering that CNNs excel at extracting local features while
both of them is also a feasible approach. Roy et al. [18] combined transformers are effective at capturing global features, combin-
2-D CNN with 3-D CNN, which could comprehensively capture ing CNN and transformer allows for the simultaneous extraction
the spatial and spectral features of HSI data. Chollet et al. of both local and global features from HSIs. For instance, Sun
proposed depthwise convolution (DWConv) to significantly re- et al. [32] combined 3D and 2D convolution layers for low-level
duce the number of parameters of CNN-based methods. Gao feature extraction, while utilizing a transformer encoder for
et al. [19] created lightweight networks for more complex HSI high-level feature learning, resulting in impressive performance.
classification tasks. Xu et al. [20] proposed a dual-branch net- Qi et al. [33] embedded 3D convolution within a dual-branch
work integrating 3-D CNNs and transformer, which introduces transformer to extract global and local features, effectively
DWConv at different scales to effectively extract multiscale preventing the loss of spectral information. Zhao et al. [34]
features and enhance HSI classification performance. In [21], designed a lightweight network model called GSC-ViT, which
a dynamic spatial-spectral attention network based on dynamic incorporates groupwise separable convolution into the trans-
convolution is designed to achieve more discriminative feature former to extract local–global spatial features. Ouyang et al. [35]
extraction by dynamically calibrating the feature responses in introduced a hybrid CNN-transformer model that effectively
the spatial and spectral domains. However, CNNs still exhibit captures global and local information in HSIs by integrating
notable limitations in HSI classification. Specifically, due to the multigranularity semantic tokens and a spatial-spectral attention
limitation of the fixed convolutional kernel, CNN is difficult to module. In [36], a group-aware hierarchical transformer network
capture comprehensive global features, thus hindering the ability is developed, utilizing a grouped pixel embedding module that
of CNN to resolve complex patterns inherent in HSI, especially emphasizes local relationships among spectral channels. In ad-
in high-dimensional data scenes. dition, a local–global spectral feature (LGSF) extraction and
Over the past few years, vision transformer (ViT) [22] has optimization method [37] is developed to extract both global
been extensively employed in various fields of computer vision, and local spectral features through spectral restructuring and a
such as objection detection, instance segmentation, and image dilated convolution-based network. However, the LGSF method
classification [23], [24], [25], and many researchers have also primarily focuses on global–local features in the spectral do-
utilized transformer in HSI classification and achieved good re- main, neglecting the spatial domain. As for the GSC-ViT, it
sults. Rather than using convolution, transformer learns features overlooks the learning of global features in the spectral domain.
primarily by multihead self-attention (MHSA). Specifically, Furthermore, most global–local methods do not adequately
convolutional operation is inherently local, while self-attention extract spatial and spectral feature representations at varying
mechanisms provide a crucial characteristic of global receptive granularities. To address these limitations, we introduce a novel
fields, enabling transformer to capture long-range dependence. global–local multigranularity transformer (GLMGT) network
Hong et al. [26] introduced the transformer into HSI classifi- for HSI classification. The key contributions of our study are
cation and focused on the extraction of spectral information. outlined as follows:
In [27], a transformer-based network is proposed for HSI clas- 1) The GLMGT is proposed for HSI classification, leverag-
sification, replacing CNN to enable global feature extraction ing the strengths of CNN and transformer to sequentially
and improve computational efficiency through an image-based capture spectral and spatial features. In comparison to
classification framework. In [28], a novel multiscale feature other global–local methods, our method comprehensively
attention transformer model is designed to extract comprehen- explores local spatial features, global spatial features, local
sive multiscale features while effectively capturing long-range spectral features, and global spectral features, resulting in
dependencies among features across various scales. Arshad more robust and discriminative feature representations for
et al. [29] designed a lightweight multihead attention mechanism HSI classification.
into transformer and integrated a hybrid spectral spatial feature 2) Our proposed method introduces two new blocks: multi-
extractor, to improve classification and mitigate the computa- granularity spatial feature extraction (MGAFE) block
tional cost. Roy et al. [30] integrated the attention mechanism and multigranularity spectral feature extraction (MGEFE)
with morphological operations to design the input patch for block. These blocks consecutively extract and integrate
the transformer, which enriched the spatial spectral information discriminative features at different granularities in the
by combining morphological features and further improved spectral and spatial domains.
114 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025
where ij
vxy is the output at position (x, y) in the jth feature M HSA(Q, K, V ) = concat(head1 , . . . , headh )W O
map of the ith layer, f indicates the activation function, and where headi = Attention(QWiQ , KWiK , V WiV ).
bij is the network bias, m is the index over the set of feature (4)
maps in the (i − 1)th layer that are connected to the current
ijm
feature map, whw denotes the weight of the convolution kernel Here, W O is the parameter matrix to generate output, and WiQ ,
at spatial position (h, w), and Hi and Wi represent the height WiK , WiV are learnable weight matrices.
and width of the convolution kernel. Given that HSIs have a 3-D
structure incorporating both spatial and spectral information, III. METHOD
3-D convolution operations are employed to extract features,
enabling the simultaneous capture of characteristics in both In this section, we detail the proposed GLMGT for HSI
spatial and spectral domains. The 3-D convolution operation classification. First, we clarify the overall framework of GLMGT
can be expressed as follows: and its workflow for HSI classification. Second, we delve into
the specifics of several main components of the proposed model.
H i −1W
i −1D
i −1
ijm i−1,m
vxyz = f bij +
ij
whwd · v(x+h)(y+w)(z+d)
m h=0 w=0 d=0 A. Overal Framework
(2) The overall framework of GLMGT is shown in Fig. 2, which
ij
where vxyz is the output at position (x, y, z) in the jth feature consists of four main components: depth-wise convolution posi-
map of the ith layer, Di is the depth of the corresponding tion embedding (DCPE), MGAFE, MGEFE, and GFFM, where
3-D convolution kernel of the layer, and other parameters are the latter three components are integrated into the transformer
consistent with those used in the 2-D convolution operation. By encoder. Specifically, given an input HSI X ∈ RH×W ×C , the
convolving the HSI cube with a 3-D convolutional kernel, both GLMGT first divides it into patches Xpatch ∈ RP ×P ×C , where
spatial and spectral features can be extracted. H × W represents the spatial dimensions, P is the patch size,
and C is the number of channels. First, a 3 × 3 convolution layer
B. Transformer is chosen for dimensionality reduction and simply extracted the
Recently, transformer-based models have been increasingly shallow features X̂patch ∈ RP ×P ×D , where D is the channel
applied to HSI classification due to their ability to effectively dimension after dimensionality reduction. Second, the DCPE
capture long-distance dependencies and global information [39]. module embeds the position information into input patches,
As shown in Fig. 1, the core component of the transformer is which are fed into the transformer encoder. Next, the MGAFE
the multihead self-attention (MHSA) mechanism, which models block enhances local spatial features across multiple granu-
global contextual information using multiple attention heads in larities through a multiscale local spatial feature enhancement
parallel. The input of MHSA is first linearly projected into the (MSLAFE) module and captures global spatial information by
query, key, and value matrices. Enhanced representations are a global spatial attention (GAA) module. Subsequently, the
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 115
Fig. 3. Multigranularity Spatial Feature Extraction (MGAFE) block, consisting of the multiscale local spatial feature enhancement (MSLAFE) module and the
GAA module. MGAFE focuses on extracting spatial features from HSI across different granularities, integrating both multiscale local spatial information and
global spatial information.
MGEFE block focuses on global–local spectral feature extrac- C. Multigranularity Spatial Feature Extraction
tion, enhancing local spectral features at different granularities Transformer is excel at capturing global long-range depen-
with a multiscale local spectral feature enhancement (MSLEFE) dencies, while it often fall short in processing local multiscale
module, and integrating global spectral dependencies with a
details. To address the inherent limitation of transformer, the
global spectral attention (GEA) module. Then, the GFFM mod- MGAFE block is introduced for the extraction of multigranu-
ule extracts local features and manages finer feature propagation larity spatial features within HSI. Specifically, we employ the
with its gating mechanism. Finally, the extracted features are
multiscale local spatial feature enhancement (MSLAFE) module
converted into a 1-D vector and fed into a linear classifier for and the GAA module to extract fine- and coarse-grained spatial
the final classification results.
features, respectively. As illustrated in Fig. 3, the output features
Xa of MGAFE is shown below:
B. Depthwise Convolution Position Embedding
Xa = GAA(MSLAFE (LN (Xa )) + Xa (6)
The absolute position embedding used in previous transform-
ers [22], [40] embeds position information by adding a unique where Xa ∈ RP ×P ×D is the input of the transformer encoder,
position encoding to the input, but it fails to make the model be LN denotes layer normalization (LN) [42] layer. In this work,
translation-equivariant [41]. To address this limitation, DCPE is Xa is maintained in the form of image patches to better exploit
designed to dynamically integrate position information into HSI spatial information.
patches based on the local neighborhood information of the input 1) MSLAFE: Given the complex environment of HSI, differ-
patch. As shown in Fig. 2, DCPE consists of a 3 × 3 depthwise ent objects often exhibit different scales. Therefore, we designed
convolution (DWConv) layer and residual connection, which the MSLAFE module to extract multiscale local spatial features
can be expressed as follows: by using three parallel 2-D convolutions with different kernels.
As shown in the left part of Fig. 3, Xa first undergoes the LN
Xa = fDW Conv2D,3×3 (X̂patch ) + X̂patch (5) layer, producing an output feature represented as X̂a , which
serves as the input for the MSLAFE module. The output feature
where X̂patch is the input patch and Xa is the output features of of three 2-D convolution layers can be formulated as
DCPE, fDW Conv2D,3×3 (.) represents DWConv with kernel size
of 3 × 3. X1 = fConv2D,1×1 (X̂a )
116 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025
Fig. 4. MGEFE block, consisting of the MSLEFE module and the GEA module. MGEFE is designed to extract global-local multigranularity spectral features of
HSI.
TABLE I
DETAILED CONFIGURATION OF THE GLMGT NETWORK
TABLE II
SAMPLE DISTRIBUTION OF THE IP DATASET
Fig. 7. UP dataset. (a) False-color image. (b) Ground truth image.
TABLE III
SAMPLE DISTRIBUTION OF THE UP DATASET
Fig. 9. KSC dataset. (a) False-color image. (b) Ground truth image.
TABLE V
SAMPLE DISTRIBUTION OF THE KSC DATASET
TABLE IV
SAMPLE DISTRIBUTION OF THE HT DATASET TABLE VI
SAMPLE DISTRIBUTION OF THE WHLK DATASET
Fig. 10. WHLK dataset. (a) False-color image. (b) Ground truth image.
Fig. 12. GFYC dataset. (a) False-color image. (b) Ground truth image.
TABLE VIII
SAMPLE DISTRIBUTION OF THE GFYC DATASET
Fig. 11. ZYHHK dataset. (a) False-color image. (b) Ground truth image.
TABLE VII
SAMPLE DISTRIBUTION OF THE ZYHHK DATASET
dataset, with 0.3% of training samples and 99.7% of test samples B. Experimental Setting
in each class. All of the experiments are performed on a computer equipped
6) ZY1-02D Huanghekou (ZYHHK) Dataset: The ZYHHK with RTX 3060 GPU. Our proposed model is based on the
dataset was collected by the AHSI sensor equipped on the China PyTorch framework and built with Python 3.9.7. To assess the
ZY1-02D satellite at the Yellow River Estuary in Shandong performance of our method across seven datasets, we used three
province, China, in 2021 [46], [47]. The ZYHHK dataset en- main evaluation metrics: overall accuracy (OA), average accu-
compasses eight different land cover classes and 1050× 1219 racy (AA), and kappa coefficient. For parameter optimization,
pixels, with a spatial resolution of 30 m per pixel. It contains we train models with the Adam optimization algorithm (learning
108 spectral bands ranging from 0.4 to 2.5 μm. The false color rate = 0.001, weight decay = 0.0001). The training process
image and ground truth image of dataset are shown in Fig. 11. consists of 100 epochs using a consistent batch size of 100.
The total samples in the ZYHHK dataset are 14 087. Table VII In addition, to ensure the reliability of the experiments, five
shows the specific class information and sample distribution of experiments were conducted for each model and the final mean
the ZYHHK dataset, with 3% of training samples and 97% of and standard deviation were used as performance metrics.
test samples in each class. The effectiveness of the proposed method is validated in this
7) GF-5 Yancheng (GFYC) Dataset: The GF-5 Yancheng article through comparisons with several state-of-the-art models.
dataset was collected by the AHSI sensor on board the Chinese We select a range of representative models as comparative meth-
GF-5 satellite in Yancheng city, Jiangsu province, China [46], ods, including MSRN [19], DSSAN [28], LGG-CNN [48], Spec-
[47]. The GFYC dataset exist seven land cover categories and tralFormer [26], morphFormer [30], MSFAT [21], SSFTT [32],
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 121
TABLE IX TABLE XI
OA (%) OF GLMGT WITH DIFFERENT NUMBERS OF TRANSFORMER OA (%) OF GLMGT WITH DIFFERENT EMBEDDING DIM ON THE IP, UP, HT,
ENCONDERS ON THE IP, UP, HT, KSC, WHLK, ZYHHK, AND GFYC KSC, WHLK, ZYHHK AND GFYC DATASETS
DATASETS
TABLE X
OA (%) OF GLMGT WITH DIFFERENT PATCH SIZE ON THE IP, UP, HT, KSC,
WHLK, ZYHHK, AND GFYC DATASETS TABLE XII
OA (%) OF GLMGT WITH DIFFERENT EXPANSION FACTOR ON THE IP, UP,
HT, KSC, WHLK, ZYHHK AND GFYC DATASETS
TABLE XIII
CLASSIFICATION ACCURACY OF DIFFERENT METHODS ON INDIAN PINES DATASETS
TABLE XIV
CLASSIFICATION ACCURACY OF DIFFERENT METHODS ON UNIVERSITY OF PAVIA DATASETS
TABLE XV
CLASSIFICATION ACCURACY OF DIFFERENT METHODS ON HOUSTON 2013 DATASETS
TABLE XVI
CLASSIFICATION ACCURACY OF DIFFERENT METHODS ON KSC DATASETS
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 123
TABLE XVII
CLASSIFICATION ACCURACY OF DIFFERENT METHODS ON WHLK DATASETS
TABLE XVIII
CLASSIFICATION ACCURACY OF DIFFERENT METHODS ON ZYHHK DATASETS
TABLE XIX
CLASSIFICATION ACCURACY OF DIFFERENT METHODS ON GF-5 YANCHENG DATASETS
on all three metrics, OA, AA, and Kappa. Specifically, on the proposed method has the fewest noisy pixels and is closest to
IP dataset, GLMGT outperforms in 11 out of 16 categories. the ground truth. Taking the classification maps on the WHLK
Similarly, on the KSC dataset, GLMGT even outperforms other dataset in Fig. 17 as an example, it is clear that the classification
methods in classification accuracy in every category. Also, on the maps of LGG-CNN and Spectralformer have more noisy pixels,
HT dataset, GLMGT outperforms other methods, with OA val- which means that there are more misclassified regions. Further-
ues 1.24% higher than SSFTT and 3.50% higher than DSSAN. more, in the classification maps of all seven datasets, typical
Furthermore, the accuracy of GLMGT exceeds 90% in almost regions are highlighted with a gray border and enlarged to clearly
every category on the seven datasets, which suggests that the display the classification results for those specific areas. For
GLMGT network not only significantly improves the accuracy the IP dataset, the classification map generated by the proposed
in the small-sample category, but also maintains a stable level GLMGT closely matches the ground truth map, particularly for
of performance in the other categories. the “Alfalfa” category, as shown in the zoomed-in images in
In addition, it can be observed that the classification per- Fig. 13. This observation aligns with the quantitative assessment
formance of GLMGT is also higher than the four global-local results presented in Table XIII. Similarly, the classification map
models (i.e., LGG-CNN, SSFTT, CTMixer, HybirdFormer, and of our GLMGT for the UP dataset shows fewer misclassified pix-
GSC-ViT) on seven datasets. For example, the OA value of els, especially in the upper left corner of the zoomed-in region,
GLMGT is 5.24% superior to that of LGG-CNN on the IP as shown in Fig. 14. For the HT dataset, while most methods
dataset, 3.99% superior to that of CTMixer on the HT dataset, misclassify the similar crops “Healthy Grass” and “Stressed
6.46% superior to that of HybirdFormer on the ZYHHK dataset, Grass” (see Fig. 15), GLMGT effectively mitigates this issue
and 5.54% superior to that of GSC-ViT on the KSC dataset. This by learning multigranularity spatial-spectral features from HSI.
also shows the superiority of GLMGT in extracting global and For the KSC dataset, the classification map of our GLMGT
local features of HSI. shows fewer noise points than the other methods, particularly
2) Visual Evaluation: Figs. 13–19 show the classification for the “Slash Pine” category, as illustrated in the upper right
map of each method on the seven datasets. It can be seen that the corner of the zoomed-in region in Fig. 16. In conclusion, the
124 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025
Fig. 13. Classification maps of different models on IP dataset 10% training samples. (a) Ground truth map. (b) MSRN. (c) DSSAN. (d) LGG-CNN.
(e) SpectralFormer. (f) morphFormer. (g) MSFAT. (h) SSFTT. (i) CTMixer. (j) Hybridformer. (k) GSC-ViT. (l) GLMGT.
Fig. 14. Classification maps of different models on UP dataset 1% training samples. (a) Ground truth map. (b) MSRN. (c) DSSAN. (d) LGG-CNN.
(e) SpectralFormer. (f) morphFormer. (g) MSFAT. (h) SSFTT. (i) CTMixer. (j) Hybridformer. (k) GSC-ViT. (l) GLMGT.
classification maps generated by the proposed GLMGT model 0.1%, 0.5%, 1.0%, 1.5%, 2.0% for the UP dataset; 1%, 5%,
across seven datasets show fewer misclassified areas compared 10%, 15%, 20% for the IP, HT, and KSC datasets; 0.1%,
to those produced by other models, further demonstrating the 0.2%, 0.3%, 0.4%, 0.5% for the WHLK dataset; 1%, 2%,
superiority of GLMGT. 3%, 4%, 5% for the ZYHHK and GFYC datasets. As illus-
3) Impact of Training Ratio: To further evaluate the ef- trated in Fig. 20, GLMGT consistently outperformed the other
ficacy of GLMGT, we analyzed the performance of eleven methods at every training sample size. For example, when
different methods across a range of training sample sizes: the training sample ratio is small, GLMGT outperforms other
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 125
Fig. 15. Classification maps of different models on HT dataset 10% training samples. (a) Ground truth map. (b) MSRN. (c) DSSAN. (d) LGG-CNN.
(e) SpectralFormer. (f) morphFormer. (g) MSFAT. (h) SSFTT. (i) CTMixer. (j) Hybridformer. (k) GSC-ViT. (l) GLMGT.
Fig. 16. Classification maps of different models on KSC dataset with 10% training samples. (a) Ground truth map. (b) MSRN. (c) DSSAN. (d) LGG-CNN.
(e) SpectralFormer. (f) morphFormer. (g) MSFAT. (h) SSFTT. (i) CTMixer. (j) Hybridformer. (k) GSC-ViT. (l) GLMGT.
comparable methods. For example, on the UP dataset, GLMGT training sample ratio increased, the performance of the
can achieve 88% OA when the training sample ratio is set to GLMGT network exhibited a steady upward trend, indicat-
0.1%, whereas for other transformer-based methods, such as ing that it can achieve good results under different data
Spectralformer, the OA is only nearly 71%. Moreover, as the scales.
126 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025
Fig. 17. Classification maps of different models on WHLK dataset with 0.3% training samples. (a) Ground truth map. (b) MSRN. (c) DSSAN. (d) LGG-CNN.
(e) SpectralFormer. (f) morphFormer. (g) MSFAT. (h) SSFTT. (i) CTMixer. (j) Hybridformer. (k) GSC-ViT. (l) GLMGT.
TABLE XX
RESULTS OF ABLATION STUDIES OF DIFFERENT COMPONENTS IN IP, UP, HT AND KSC DATASETS (OA%)
A. Ablation Analysis
1) Ablation Analysis of Main Components: We evaluated
the effect of each component on the classification results by
progressively removing the main components. Table XX
shows the results of different component ablation experiments dataset. Thus, these results illustrate the effectiveness of each
performed on the four datasets. Across these four datasets, component in the GLMGT network.
GLMGT performs best when all main components are 2) Ablation Analysis of Granularity Configurations in the
included, while deleting any component adversely affects GLMGT: In this article, we introduce the MSLAFE module
model performance. Specifically, the lack of DCPE caused a and the GAA module to extract fine- and coarse-grained spatial
decrease in OA values, and extracting only spatial features features, respectively. In addition, we employ the MSLEFE
(without MGAFE) or spectral features (without MGEFE) also module and the GEA module for the extraction of fine- and
resulted in poor classification results. However, the addition of coarse-grained spectral features. By thoroughly exploring both
MGEFE after MGAFE significantly improves the classification fine- and coarse-grained representations in HSIs, we can extract
performance with an increase in OA up to 1.64%, highlighting more robust features. As shown in Table XXI, we assessed the
the critical importance of combining spatial features and subtle impact of these four modules on classification outcomes by
spectral information for category differentiation. In addition, we systematically eliminating each module in a stepwise manner.
observed that removing GFFM also diminishes classification Our analysis across four datasets (i.e., IP, UP, HT, and KSC)
performance, for example, a decrease of 5.05% on the HT revealed that GLMGT achieves optimal performance when all
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 127
Fig. 18. Classification maps of different models on ZYHHK dataset with 3% training samples. (a) Ground truth map. (b) MSRN. (c) DSSAN. (d) LGG-CNN.
(e) SpectralFormer. (f) morphFormer. (g) MSFAT. (h) SSFTT. (i) CTMixer. (j) Hybridformer. (k) GSC-ViT. (l) GLMGT.
modules are integrated. Conversely, removing any single module Table XXIII, for the MSLEFE module, using combinations of
leads to a decline in classification accuracy, underscoring the 1 × 1 × 3, 1 × 1 × 5, and 1 × 1 × 7 obtained better accuracy
importance of each component within the overall framework. across all datasets. Using either too few or too many granular-
This demonstrates that feature representations at different gran- ities results in suboptimal performance. A limited number of
ularities contribute to improving classification results. granularities may cause the model to overlook important local
3) Ablation Analysis of Different Granularity Configura- features necessary for accurate predictions. Conversely, having
tions in the MSLAFE and MSLEFE Modules: In this article, too many granularities increases model complexity and may
the MSLAFE module utilizes parallel 2-D convolutions with introduce noise.
varying kernel sizes to extract local spatial features at different 4) Ablation Analysis of DCPE: To evaluate the effect of
granularities. Likewise, the MSLEFE module employs parallel DCPE on the classification performance of HSI, we conducted
3-D convolutions with diverse kernels to capture a range of local ablation experiments comparing DCPE with absolute positional
spectral features. As illustrated in Table XXII, the best accuracy embedding (APE) [44]. As shown in Table XXIV, the OA value
for the MSLAFE module was achieved with the combination of decreases up to 2.40% on the four datasets when replacing DCPE
1 × 1, 3 × 3, and 5 × 5 convolution kernels. As illustrated in with APE. In addition, there is also a slight decrease in the OA
128 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025
Fig. 19. Classification maps of different models on GFYC dataset with 3% training samples. (a) Ground truth map. (b) MSRN. (c) DSSAN. (d) LGG-CNN.
(e) SpectralFormer. (f) morphFormer. (g) MSFAT. (h) SSFTT. (i) CTMixer. (j) Hybridformer. (k) GSC-ViT. (l) GLMGT.
Fig. 20. Classification performance of eleven models across varied training sample ratios on the seven datasets. (a) IP. (b) UP. (c) HT. (d) KSC. (e) WHLK.
(f) ZYHHK. (g) GFYC.
value when both DCPE and APE are used. These results indicate the capture of global information. However, the combination of
the effectiveness of DCPE in embedding position information. MGAFE and MGEFE in GLMGT can not only extract multiscale
5) Ablation Analysis of Attention Module: To evaluate the local features but also capture global spatial and spectral depen-
efficacy of attention module in GLMGT, we used the MHSA dencies, thus significantly enhancing the classification results.
module from the original transformer to replace MGAFE or 6) Ablation Analysis of GFFM: To investigate the effec-
MGEFE. Table XXV shows that the use of MGAFE in combi- tiveness of GFFM, we replace GFFM with MLP in GLMGT
nation with MGEFE demonstrated the highest OA on the four network. Compared with MLP, the gating mechanism in GFFM
datasets. Specifically, the utilization of MHSA only allows for can control the delivery of useful information, and DWConv
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 129
REFERENCES
[1] S. Peyghambari and Y. Zhang, “Hyperspectral remote sensing in lithologi-
cal mapping, mineral exploration, and environmental geology: An updated
review,” J. Appl. Remote Sens., vol. 15, no. 3, pp. 031501–031501, 2021.
TABLE XXVI [2] M. Shimoni, R. Haelterman, and C. Perneel, “Hypersectral imaging for
OA (%) ACROSS IP, UP, HT, AND KSC DATASETS FOR DIFFERENT military and security applications: Combining myriad processing and
FEED-FORWARD MODULE CONFIGURATIONS sensing techniques,” IEEE Geosci. Remote Sens. Mag., vol. 7, no. 2,
pp. 101–117, Jun. 2019.
[3] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sens-
ing images with support vector machines,” IEEE Trans. Geosci. Remote
Sens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004.
[4] J. Ham, Y. Chen, M. M. Crawford, and J. Ghosh, “Investigation
of the random forest framework for classification of hyperspectral
data,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp. 492–501,
Mar. 2005.
TABLE XXVII
[5] L. Ma, M. M. Crawford, and J. Tian, “Local manifold learning-
OA (%) ACROSS IP, UP, HT, AND KSC DATASETS FOR DIFFERENT
based k-nearest-neighbor for hyperspectral image classification,” IEEE
CONNECTION CONFIGURATIONS OF MGAFE AND MGEFE
Trans. Geosci. Remote Sens., vol. 48, no. 11, pp. 4099–4109,
Nov. 2010.
[6] T. Zhao et al., “Artificial intelligence for geoscience: Progress, challenges
and perspectives,” Innovation, vol. 5, no. 5, 2024, Art. no. 100691.
[7] Y. Chen, X. Zhao, and X. Jia, “Spectral–spatial classification of hyperspec-
tral data based on deep belief network,” IEEE J. Sel. Topics Appl. Earth
Observ. Remote Sens., vol. 8, no. 6, pp. 2381–2392, Jun. 2015.
[8] X. Yuan, B. Huang, Y. Wang, C. Yang, and W. Gui, “Deep learning-based
feature representation and its application for soft sensor modeling with
variable-wise weighted SAE,” IEEE Trans. Ind. Inform., vol. 14, no. 7,
can complement the local details in the feedforward module. pp. 3235–3243, Jul. 2018.
Table XXVI shows the classification effect of the GLMGT [9] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutional neural
model with different feed-forward modules. It is observed that networks for hyperspectral image classification,” J. Sensors, vol. 2015,
no. 1, 2015, Art. no. 258619.
employing GFFM results in a notable enhancement of the OA [10] J. Yue, W. Zhao, S. Mao, and H. Liu, “Spectral–spatial classification of
value, which indicates that the performance improvement is due hyperspectral images using deep convolutional neural networks,” Remote
to the design of the gating mechanism and the enhancement of Sens. Lett., vol. 6, no. 6, pp. 468–477, 2015.
[11] S. Mei, J. Ji, J. Hou, X. Li, and Q. Du, “Learning sensor-specific spatial-
local spatial information. spectral features of hyperspectral images via convolutional neural net-
7) Ablation Analysis of the Connection Configuration of works,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 8, pp. 4520–4533,
MGAFE and MGEFE: In the GLMGT network, the MGAFE Aug. 2017.
[12] Q. Liu, Z. Wu, X. Jia, Y. Xu, and Z. Wei, “From local to global: Class feature
and MGEFE blocks are connected in a series configuration. To fused fully convolutional network for hyperspectral image classification,”
assess the impact of the series and parallel configurations of Remote Sens., vol. 13, no. 24, 2021, Art. no. 5043.
these two blocks on the classification accuracy, we performed [13] A. B. Hamida, A. Benoit, P. Lambert, and C. B. Amar, “3-D deep learning
approach for remote sensing image classification,” IEEE Trans. Geosci.
additional ablation experiments. Specifically, in the parallel Remote Sens., vol. 56, no. 8, pp. 4420–4434, Aug. 2018.
configuration, the MGAFE and MGEFE blocks are computed [14] J. Zheng, Y. Feng, C. Bai, and J. Zhang, “Hyperspectral image classifi-
in parallel, and their outputs are directly added together. We cation using mixed convolutions and covariance pooling,” IEEE Trans.
Geosci. Remote Sens., vol. 59, no. 1, pp. 522–534, Jan. 2021.
also investigated how different cascading orders of the MGAFE [15] W. Zhao and S. Du, “Spectral–spatial feature extraction for hyperspec-
and MGEFE blocks influence classification performance. As tral image classification: A dimension reduction and deep learning ap-
shown in Table XXVII, the series configuration where the proach,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 8, pp. 4544–4554,
Aug. 2016.
MGAFE block precedes the MGEFE block consistently yielded [16] X. Yang, Y. Ye, X. Li, R. Y. Lau, X. Zhang, and X. Huang, “Hyperspectral
the highest OA across four datasets. In contrast, the alternative image classification with deep learning models,” IEEE Trans. Geosci.
series configuration (spectral first, then spatial) and the parallel Remote Sens., vol. 56, no. 9, pp. 5408–5423, Sep. 2018.
[17] W. Qi, X. Zhang, N. Wang, M. Zhang, and Y. Cen, “A spectral-spatial
configuration were less effective. Based on these results, we cascaded 3D convolutional neural network with a convolutional long short-
have finalized the network architecture with the MGAFE block term memory network for hyperspectral image classification,” Remote
preceding the MGEFE block in sequence. Sens., vol. 11, no. 20, Oct. 2019, Art. no. 2363.
130 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025
[18] S. K. Roy, G. Krishna, S. R. Dubey, and B. B. Chaudhuri, “Hybridsn: [40] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H.
Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image clas- Jégou, “Training data-efficient image transformers & distillation through
sification,” IEEE Geosci. Remote Sens. Lett., vol. 17, no. 2, pp. 277–281, attention,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 10347–10357.
Feb. 2020. [41] X. Chu, Z. Tian, B. Zhang, X. Wang, and C. Shen, “Conditional posi-
[19] H. Gao, Y. Yang, C. Li, L. Gao, and B. Zhang, “Multiscale residual tional encodings for vision transformers,” in Proc. 11th Int. Conf. Learn.
network with mixed depthwise convolution for hyperspectral image classi- Representations, Kigali, Rwanda, May 1–5, 2023.
fication,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 4, pp. 3396–3408, [42] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016,
Apr. 2021. arXiv:1607.06450.
[20] R. Xu, X.-M. Dong, W. Li, J. Peng, W. Sun, and Y. Xu, “DBCTNet: [43] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),”
Double branch convolution-transformer network for hyperspectral im- 2016, arXiv:1606.08415.
age classification,” IEEE Trans. Geosci. Remote Sens., vol. 62, 2024, [44] A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process.
Art. no. 5509915. Syst., vol. 30, pp. 1–11, 2017.
[21] Z. Meng, Q. Yan, F. Zhao, and M. Liang, “Hyperspectral image clas- [45] Y. Zhong, X. Hu, C. Luo, X. Wang, J. Zhao, and L. Zhang, “WHU-HI:
sification with dynamic spatial-spectral attention network,” in Proc. 13th UAV-borne hyperspectral with high spatial resolution (h2) benchmark
Workshop Hyperspectral Imag. Signal Process., Evol. Remote Sens., 2023, datasets and classifier for precise crop identification based on deep convo-
pp. 1–4. lutional neural network with CRF,” Remote Sens. Environ., vol. 250, 2020,
[22] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for Art. no. 112012.
image recognition at scale,” in Proc. 9th Int. Conf. Learn. Representations, [46] S. Weiwei and P. Jiangtao, “Cross-scene hyperspectral remote sensing
Austria, May 3–7, 2021. wetland image data,” Jul. 2023. [Online]. Available: https://doi.org/10.
[23] Z. Sun, S. Cao, Y. Yang, and K. M. Kitani, “Rethinking transformer-based 5281/zenodo.8105220
set prediction for object detection,” in Proc. IEEE/CVF Int. Conf. Comput. [47] Y. Huang et al., “Cross-scene wetland mapping on hyperspectral remote
Vis., 2021, pp. 3611–3620. sensing images using adversarial domain adaptation network,” ISPRS J.
[24] Y. Wang et al., “End-to-end video instance segmentation with transform- Photogrammetry Remote Sens., vol. 203, pp. 37–54, 2023.
ers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, [48] W. Fu, K. Ding, X. Kang, and D. Wang, “Local-global gated convolutional
pp. 8741–8750. neural network for hyperspectral image classification,” IEEE Geosci.
[25] C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale Remote Sens. Lett., vol. 21, 2023, Art. no. 5500205.
vision transformer for image classification,” in Proc. IEEE/CVF Int. Conf. [49] J. Zhang, Z. Meng, F. Zhao, H. Liu, and Z. Chang, “Convolution trans-
Comput. Vis., 2021, pp. 357–366. former mixer for hyperspectral image classification,” IEEE Geosci. Remote
[26] D. Hong et al., “Spectralformer: Rethinking hyperspectral image classi- Sens. Lett., vol. 19, 2022, Art. no. 6014205.
fication with transformers,” IEEE Trans. Geosci. Remote Sens., vol. 60,
2021, Art. no. 5518615.
[27] H. Yu, Z. Xu, K. Zheng, D. Hong, H. Yang, and M. Song, “Mstnet:
A multilevel spectral–spatial transformer network for hyperspectral im- Zhe Meng (Member, IEEE) received the B.S. degree
age classification,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, in electronic information engineering and Ph.D. de-
Art. no. 5532513. gree in circuits and systems from Xidian University,
[28] Z. Meng, Q. Yan, F. Zhao, and M. Liang, “Multi-scale feature attention Xi’an, China, in 2014 and 2020, respectively.
and transformer for hyperspectral image classification,” in Proc. 13th He is currently a Lecturer with the School of Com-
Workshop Hyperspectral Imag. Signal Processing: Evol. Remote Sens., munications and Information Engineering & School
2023, pp. 1–5. of Artificial Intelligence, Xi’an University of Posts
[29] T. Arshad and J. Zhang, “A light-weighted spectral-spatial transformer and Telecommunications. His research interests in-
model for hyperspectral image classification,” IEEE J. Sel. Topics Appl. clude deep learning and hyperspectral image classifi-
Earth Observ. Remote Sens., vol. 17, pp. 12008–12019, 2024. cation.
[30] S. K. Roy, A. Deria, C. Shah, J. M. Haut, Q. Du, and A. Plaza,
“Spectral–spatial morphological attention transformer for hyperspectral
image classification,” IEEE Trans. Geosci. Remote Sens., vol. 61, 2023,
Art. no. 5503615.
[31] Y. Peng, Y. Zhang, B. Tu, Q. Li, and W. Li, “Spatial–spectral transformer Qian Yan received the B.S. degree from Xi’an Uni-
with cross-attention for hyperspectral image classification,” IEEE Trans. versity of Posts and Telecommunications, Xi’an,
Geosci. Remote Sens., vol. 60, 2022, Art. no. 5537415. China, in 2022, where She is currently working to-
[32] L. Sun, G. Zhao, Y. Zheng, and Z. Wu, “Spectral–spatial feature tok- ward the master’s degree in information and commu-
enization transformer for hyperspectral image classification,” IEEE Trans. nication engineering.
Geosci. Remote Sens., vol. 60, 2022, Art. no. 5522214. Her research interests include deep learning and
[33] W. Qi, C. Huang, Y. Wang, X. Zhang, W. Sun, and L. Zhang, “Global-local hyperspectral image classification.
three-dimensional convolutional transformer network for hyperspectral
image classification,” IEEE Trans. Geosci. Remote Sens., vol. 61, 2023,
Art. no. 5510820.
[34] Z. Zhao, X. Xu, S. Li, and A. Plaza, “Hyperspectral image classification
using groupwise separable convolutional vision transformer network,”
IEEE Trans. Geosci. Remote Sens., vol. 62, 2024, Art. no. 5511817.
[35] E. Ouyang, B. Li, W. Hu, G. Zhang, L. Zhao, and J. Wu, “When multi- Feng Zhao (Member, IEEE) received the B.S. de-
granularity meets spatial–spectral attention: A. hybrid transformer for gree in computer science and technology from Hei-
hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., longjiang University, Heilongjiang, China, in 2004,
vol. 61, 2023, Art. no. 4401118. the M.S. degree in signal and information processing
[36] S. Mei, C. Song, M. Ma, and F. Xu, “Hyperspectral image classification from the Xi’an University of Posts and Telecommu-
using group-aware hierarchical transformer,” IEEE Trans. Geosci. Remote nications, Xi’an, China, in 2007, and the Ph.D. degree
Sens., vol. 60, 2022, Art. no. 5539014. in pattern recognition and intelligent system from
[37] Z. Xu, C. Su, S. Wang, and X. Zhang, “Local and global spectral features Xidian University, Xi’an, in 2010.
for hyperspectral image classification,” Remote Sens., vol. 15, no. 7, 2023, She has been a Professor with the School of Com-
Art. no. 1803. munications and Information Engineering and School
[38] B. Zhang, L. Zhao, and X. Zhang, “Three-dimensional convolu- of Artificial Intelligence, Xi’an University of Posts
tional neural network model for tree species classification using air- and Telecommunications, since 2015. She has authored or coauthored more
borne hyperspectral images,” Remote Sens. Environ., vol. 247, 2020, than 30 articles and two books. Her research interests include fuzzy information
Art. no. 111938. processing, pattern recognition, and image processing.
[39] Z. Meng, T. Zhang, F. Zhao, G. Chen, and M. Liang, “Multiscale super Dr. Zhao was a recipient of New-Star of Young Science and Technology
token transformer for hyperspectral image classification,” IEEE Geosci. Award supported by Shaanxi, in 2014, and the IET International Conference on
Remote Sens. Lett., vol. 21, 2024, Art. no. 5508105. Ubi-media Computing Best Paper Award, in 2012.
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 131
Gaige Chen was born in 1985. He received the Ph.D. Miaomiao Liang received the Ph.D. degree in pat-
degree in mechanical engineering from Xi’an Jiao- tern recognition and intelligent systems from Xidian
tong University, Xi’an, China. University, Xi’an, China, in 2018.
He is currently an Associate Professor with the She is currently an Associate Professor with the
School of Communications and Information Engi- School of Information Engineering, Jiangxi Univer-
neering and School of Artificial Intelligence, Xi’an sity of Science and Technology, Ganzhou, China. Her
University of Posts and Telecommunications. His research interests include computer vision, machine
research interests include multisensor data fusion and learning, and hyperspectral image processing.
complex equipment prognostics.