0% found this document useful (0 votes)

32 views20 pages

GlobalLocal Multigranularity Transformer for Hyperspectral Image Classification

Uploaded by

peterpan549145

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views20 pages

GlobalLocal Multigranularity Transformer for Hyperspectral Image Classification

Uploaded by

peterpan549145

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

112 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL.

18, 2025

Global–Local Multigranularity Transformer for

Hyperspectral Image Classification
Zhe Meng , Member, IEEE, Qian Yan , Feng Zhao , Member, IEEE, Gaige Chen ,
Wenqiang Hua , Member, IEEE, and Miaomiao Liang

Abstract—Hyperspectral image (HSI) classification is a chal- I. INTRODUCTION

lenging task in remote sensing applications, aiming to determine
YPERSPECTRAL imaging is an advanced imaging tech-
the category of each pixel by utilizing rich spectral and spatial
information in HSI. Convolutional neural networks (CNNs) have
been effective in processing HSI data by extracting local features,
H nique capable of capturing reflectance information over
hundreds or even thousands of consecutive narrow bands. The
but they are deficient in capturing global contextual information. abundant spectral information within each pixel of a hyper-
Recently, transformer has become proficient in attending to global
information due to their self-attention mechanisms, yet they may
spectral image (HSI) can provide differences in the internal
fall short in capturing multiscale features of HSI. To address these physical structure and chemical composition of the observed
limitations, a global–local multigranularity transformer (GLMGT) object. Therefore, HSI is extensively employed in environmen-
network is proposed for HSI classification. The GLMGT combines tal monitoring, military reconnaissance, geological exploration,
CNN with the transformer to comprehensively capture multigran- and other fields [1], [2]. HSI classification aims to identify the
ularity spectral and spatial features across global and local scales.
Specifically, we introduce a multigranularity spatial feature extrac-
category of each pixel in HSI, serving as a foundational step in
tion block to extensively extract spatial information at different realizing various hyperspectral remote sensing applications.
granularities, including multiscale local spatial features and global Over the past decades, scholars have made great efforts
spatial features. In addition, we introduce a multigranularity spec- to improve the HSI classification performance. In the begin-
tral feature extraction block to fully leverage spectral information ning, many traditional machine learning methods were used
across different granularities. The validity of the proposed method
is demonstrated through experimental validation using seven pub-
for HSI classification, such as k-nearest neighbors (k-NN),
licly available datasets, which include two Chinese satellite hyper- random forests (RF), and support vector machine (SVM) [3]
spectral datasets (ZY1-02D Huanghekou and GF-5 Yancheng) and [4] [5]. However, these methods are limited by their reliance
one UAV-based hyperspectral dataset. on handcrafted features and restricted learning abilities, which
Index Terms—CNN, multigranularity, transformer, hyperspe- hinder their ability to capture intricate spectral information for
ctral image (HSI) classification. achieving better classification results.
In recent years, deep learning (DL) has achieved significant
success in the field of computer vision, with more and more deep
DL-based methods for HSI classification being proposed [6].
In [7], a deep belief network is employed to simultaneously
Received 17 August 2024; revised 6 October 2024; accepted 31 October 2024. extract features and conduct classification. Similarly, the stacked
Date of publication 6 November 2024; date of current version 22 November autoencoder [8] is another successful DL-based network chosen
2024. This work was supported in part by the National Natural Science Founda-
tion of China under Grant 62266020, Grant 62271390, and Grant 62071379, for generating classification maps in HSI applications. However,
in part by State Key Laboratory of Geo-Information Engineering and Key the two aforementioned networks operate on the spectral vec-
Laboratory of Surveying and Mapping Science and Geospatial Information tors of individual pixels as input, failing to utilize the spatial
Technology of MNR, CASM under Grant 2024-02-02, in part by Jiangxi Provin-
cial Natural Science Foundation under Grant 20224BAB212008, in part by information. Recently, convolutional neural networks (CNNs)
Jiangxi Provincial Key Laboratory of Multidimensional Intelligent Perception have made significant progress in computer vision. CNNs can
and Control of China under Grant 2024SSY03161, and in part by the Youth efficiently capture context information in HSI data by exploiting
Innovation Team of Shaanxi Universities. (Corresponding author: Zhe Meng.)
Zhe Meng, Qian Yan, Feng Zhao, and Gaige Chen are with the School of local receptive fields and shared parameters. Many researchers
Communications and Information Engineering and School of Artificial Intelli- have also successfully applied CNN of different dimensions to
gence, Xi’an University of Posts and Telecommunications, Xi’an 710121, China HSI classification and achieved significant results. Specifically,
(e-mail: [email protected]; [email protected]; zhaofeng201@xupt.
edu.cn; [email protected]). 1-D CNNs [9], [10] are used for spectral feature extraction,
Wenqiang Hua is with the School of Computer Science and Technology, Xi’an 2-D CNNs [11], [12] are used for spatial feature extraction, and
University of Posts and Telecommunications, Xi’an 710121, China (e-mail: 3-D CNNs [13], [14] for joint spectral-spatial analysis by 3-D
[email protected]).
Miaomiao Liang is with the Jiangxi Provincial Key Laboratory of Multidimen- convolution. Hu et al. [9] first applied multilayer 1-D CNN to
sional Intelligent Perception and Control, School of Information Engineering, capture spectral features, and verified the effectiveness of CNN
Jiangxi University of Science and Technology, Ganzhou 341000, China (e-mail: in HSI classification. However, the spatial features could not be
[email protected]).
The source code will be available at https://github.com/zhe-meng/GLMGT. captured using only 1-D convolution, and 2-D CNN was intro-
Digital Object Identifier 10.1109/JSTARS.2024.3491294 duced in HSI. In [15], 2-D CNN is utilized to find spatial-related

© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
https://creativecommons.org/licenses/by-nc-nd/4.0/
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 113

features and achieve effective spatial feature extraction from the classification results. In [31], a spatial–spectral transformer
HSI. However, 2-D CNN cannot capture feature information with cross-attention network is designed for HSI classification,
in the spectral dimension and often needs to be combined with which employs a dual-branch structure that uses an enhanced
other methods to fully capture hyperspectral data characteristics. cross-attention module and a spatial–spectral weighted sharing
Subsequently, researchers introduced 3-D CNN to extract both mechanism to efficiently extract and integrate features at spatial
spectral and spatial features. For instance, Yang et al. [16] and spectral dimensions. However, transformer-based models
proposed a novel recurrent 3-D CNN approach aimed at learning still have some limitations in HSI classification. Specifically,
integrated spectral-spatial features by gradually reducing the transformer mainly relies on modeling global contextual rela-
patch size. Similar network design principles are followed in [17] tionships of input tokens, but it falls short of modeling local
and proved the effectiveness of 3-D CNN in HSI. Besides information.
applying 2-D CNN and 3-D CNN independently, combining Considering that CNNs excel at extracting local features while
both of them is also a feasible approach. Roy et al. [18] combined transformers are effective at capturing global features, combin-
2-D CNN with 3-D CNN, which could comprehensively capture ing CNN and transformer allows for the simultaneous extraction
the spatial and spectral features of HSI data. Chollet et al. of both local and global features from HSIs. For instance, Sun
proposed depthwise convolution (DWConv) to significantly re- et al. [32] combined 3D and 2D convolution layers for low-level
duce the number of parameters of CNN-based methods. Gao feature extraction, while utilizing a transformer encoder for
et al. [19] created lightweight networks for more complex HSI high-level feature learning, resulting in impressive performance.
classification tasks. Xu et al. [20] proposed a dual-branch net- Qi et al. [33] embedded 3D convolution within a dual-branch
work integrating 3-D CNNs and transformer, which introduces transformer to extract global and local features, effectively
DWConv at different scales to effectively extract multiscale preventing the loss of spectral information. Zhao et al. [34]
features and enhance HSI classification performance. In [21], designed a lightweight network model called GSC-ViT, which
a dynamic spatial-spectral attention network based on dynamic incorporates groupwise separable convolution into the trans-
convolution is designed to achieve more discriminative feature former to extract local–global spatial features. Ouyang et al. [35]
extraction by dynamically calibrating the feature responses in introduced a hybrid CNN-transformer model that effectively
the spatial and spectral domains. However, CNNs still exhibit captures global and local information in HSIs by integrating
notable limitations in HSI classification. Specifically, due to the multigranularity semantic tokens and a spatial-spectral attention
limitation of the fixed convolutional kernel, CNN is difficult to module. In [36], a group-aware hierarchical transformer network
capture comprehensive global features, thus hindering the ability is developed, utilizing a grouped pixel embedding module that
of CNN to resolve complex patterns inherent in HSI, especially emphasizes local relationships among spectral channels. In ad-
in high-dimensional data scenes. dition, a local–global spectral feature (LGSF) extraction and
Over the past few years, vision transformer (ViT) [22] has optimization method [37] is developed to extract both global
been extensively employed in various fields of computer vision, and local spectral features through spectral restructuring and a
such as objection detection, instance segmentation, and image dilated convolution-based network. However, the LGSF method
classification [23], [24], [25], and many researchers have also primarily focuses on global–local features in the spectral do-
utilized transformer in HSI classification and achieved good remain, neglecting the spatial domain. As for the GSC-ViT, it
sults. Rather than using convolution, transformer learns features overlooks the learning of global features in the spectral domain.
primarily by multihead self-attention (MHSA). Specifically, Furthermore, most global–local methods do not adequately
convolutional operation is inherently local, while self-attention extract spatial and spectral feature representations at varying
mechanisms provide a crucial characteristic of global receptive granularities. To address these limitations, we introduce a novel
fields, enabling transformer to capture long-range dependence. global–local multigranularity transformer (GLMGT) network
Hong et al. [26] introduced the transformer into HSI classifi- for HSI classification. The key contributions of our study are
cation and focused on the extraction of spectral information. outlined as follows:
In [27], a transformer-based network is proposed for HSI clas- 1) The GLMGT is proposed for HSI classification, leverag-
sification, replacing CNN to enable global feature extraction ing the strengths of CNN and transformer to sequentially
and improve computational efficiency through an image-based capture spectral and spatial features. In comparison to
classification framework. In [28], a novel multiscale feature other global–local methods, our method comprehensively
attention transformer model is designed to extract comprehen- explores local spatial features, global spatial features, local
sive multiscale features while effectively capturing long-range spectral features, and global spectral features, resulting in
dependencies among features across various scales. Arshad more robust and discriminative feature representations for
et al. [29] designed a lightweight multihead attention mechanism HSI classification.
into transformer and integrated a hybrid spectral spatial feature 2) Our proposed method introduces two new blocks: multi-
extractor, to improve classification and mitigate the computa- granularity spatial feature extraction (MGAFE) block
tional cost. Roy et al. [30] integrated the attention mechanism and multigranularity spectral feature extraction (MGEFE)
with morphological operations to design the input patch for block. These blocks consecutively extract and integrate
the transformer, which enriched the spatial spectral information discriminative features at different granularities in the
by combining morphological features and further improved spectral and spatial domains.
114 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025

3) We propose the gated feed-forward module (GFFM) to

replace the multilayer perceptron (MLP) in the original
transformer. The gating mechanism is designed to sup-
press unless features and emphasize useful features for
more effective classification.
The remainder of this article is presented in the following
structure: Section II provides an overview of recent related work
in HSI classification. Section III details the methodology of
the GLMGT approach. Section IV presents the experimental
results across seven datasets. Section V gives a discussion of our
approach. Finally, Section VI offers conclusions to this study. Fig. 1. Structure of the transformer encoder. MLP denotes the multilayer
perceptron.
II. RELATED WORK
A. Convolutional Neural Networks
then generated by the self-attention mechanism, which can be
CNN has been widely used in the field of HSI classifica- formulated as follows:
tion [38]. The convolution operation is at the heart of CNNs,
enabling them to effectively extract local features from input QK T
Attention(Q, K, V ) = sof tmax √ V (3)
data. In HSI classification, 2-D convolutional layers are com- dk
monly used to capture the local spatial information. The 2-D
Here, Q, K, and V represent the query, key, and value matrices,
convolution operation can be formulated as follows:
respectively, dk denotes the dimension of K, and the softmax
H i −1 W
i −1 function is used to obtain the attention weights. The MHSA with
ijm i−1,m
vxy = f bij +
ij
whw · v(x+h)(y+w) (1) h heads can be formulated as follows:
m h=0 w=0

where ij
vxy is the output at position (x, y) in the jth feature M HSA(Q, K, V ) = concat(head1 , . . . , headh )W O
map of the ith layer, f indicates the activation function, and where headi = Attention(QWiQ , KWiK , V WiV ).
bij is the network bias, m is the index over the set of feature (4)
maps in the (i − 1)th layer that are connected to the current
ijm
feature map, whw denotes the weight of the convolution kernel Here, W O is the parameter matrix to generate output, and WiQ ,
at spatial position (h, w), and Hi and Wi represent the height WiK , WiV are learnable weight matrices.
and width of the convolution kernel. Given that HSIs have a 3-D
structure incorporating both spatial and spectral information, III. METHOD
3-D convolution operations are employed to extract features,
enabling the simultaneous capture of characteristics in both In this section, we detail the proposed GLMGT for HSI
spatial and spectral domains. The 3-D convolution operation classification. First, we clarify the overall framework of GLMGT
can be expressed as follows: and its workflow for HSI classification. Second, we delve into
the specifics of several main components of the proposed model.
H i −1W
i −1D
i −1
ijm i−1,m
vxyz = f bij +
ij
whwd · v(x+h)(y+w)(z+d)
m h=0 w=0 d=0 A. Overal Framework
(2) The overall framework of GLMGT is shown in Fig. 2, which
ij
where vxyz is the output at position (x, y, z) in the jth feature consists of four main components: depth-wise convolution posi-
map of the ith layer, Di is the depth of the corresponding tion embedding (DCPE), MGAFE, MGEFE, and GFFM, where
3-D convolution kernel of the layer, and other parameters are the latter three components are integrated into the transformer
consistent with those used in the 2-D convolution operation. By encoder. Specifically, given an input HSI X ∈ RH×W ×C , the
convolving the HSI cube with a 3-D convolutional kernel, both GLMGT first divides it into patches Xpatch ∈ RP ×P ×C , where
spatial and spectral features can be extracted. H × W represents the spatial dimensions, P is the patch size,
and C is the number of channels. First, a 3 × 3 convolution layer
B. Transformer is chosen for dimensionality reduction and simply extracted the
Recently, transformer-based models have been increasingly shallow features X̂patch ∈ RP ×P ×D , where D is the channel
applied to HSI classification due to their ability to effectively dimension after dimensionality reduction. Second, the DCPE
capture long-distance dependencies and global information [39]. module embeds the position information into input patches,
As shown in Fig. 1, the core component of the transformer is which are fed into the transformer encoder. Next, the MGAFE
the multihead self-attention (MHSA) mechanism, which models block enhances local spatial features across multiple granu-
global contextual information using multiple attention heads in larities through a multiscale local spatial feature enhancement
parallel. The input of MHSA is first linearly projected into the (MSLAFE) module and captures global spatial information by
query, key, and value matrices. Enhanced representations are a global spatial attention (GAA) module. Subsequently, the
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 115

Fig. 2. Structure of GLMGT for HSI classification.

Fig. 3. Multigranularity Spatial Feature Extraction (MGAFE) block, consisting of the multiscale local spatial feature enhancement (MSLAFE) module and the
GAA module. MGAFE focuses on extracting spatial features from HSI across different granularities, integrating both multiscale local spatial information and
global spatial information.

MGEFE block focuses on global–local spectral feature extrac- C. Multigranularity Spatial Feature Extraction
tion, enhancing local spectral features at different granularities Transformer is excel at capturing global long-range depen-
with a multiscale local spectral feature enhancement (MSLEFE) dencies, while it often fall short in processing local multiscale
module, and integrating global spectral dependencies with a
details. To address the inherent limitation of transformer, the
global spectral attention (GEA) module. Then, the GFFM mod- MGAFE block is introduced for the extraction of multigranu-
ule extracts local features and manages finer feature propagation larity spatial features within HSI. Specifically, we employ the
with its gating mechanism. Finally, the extracted features are
multiscale local spatial feature enhancement (MSLAFE) module
converted into a 1-D vector and fed into a linear classifier for and the GAA module to extract fine- and coarse-grained spatial
the final classification results.
features, respectively. As illustrated in Fig. 3, the output features
Xa of MGAFE is shown below:
B. Depthwise Convolution Position Embedding
Xa = GAA(MSLAFE (LN (Xa )) + Xa (6)
The absolute position embedding used in previous transform-
ers [22], [40] embeds position information by adding a unique where Xa ∈ RP ×P ×D is the input of the transformer encoder,
position encoding to the input, but it fails to make the model be LN denotes layer normalization (LN) [42] layer. In this work,
translation-equivariant [41]. To address this limitation, DCPE is Xa is maintained in the form of image patches to better exploit
designed to dynamically integrate position information into HSI spatial information.
patches based on the local neighborhood information of the input 1) MSLAFE: Given the complex environment of HSI, differ-
patch. As shown in Fig. 2, DCPE consists of a 3 × 3 depthwise ent objects often exhibit different scales. Therefore, we designed
convolution (DWConv) layer and residual connection, which the MSLAFE module to extract multiscale local spatial features
can be expressed as follows: by using three parallel 2-D convolutions with different kernels.
As shown in the left part of Fig. 3, Xa first undergoes the LN
Xa = fDW Conv2D,3×3 (X̂patch ) + X̂patch (5) layer, producing an output feature represented as X̂a , which
serves as the input for the MSLAFE module. The output feature
where X̂patch is the input patch and Xa is the output features of of three 2-D convolution layers can be formulated as
DCPE, fDW Conv2D,3×3 (.) represents DWConv with kernel size
of 3 × 3. X1 = fConv2D,1×1 (X̂a )
116 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025

Fig. 4. MGEFE block, consisting of the MSLEFE module and the GEA module. MGEFE is designed to extract global-local multigranularity spectral features of
HSI.

X2 = fConv2D,3×3 (X̂a ) D. Multigranularity Spectral Feature Extraction

Given the abundant spectral bands in HSI, it is also particularly
X3 = fConv2D,5×5 (X̂a ) (7)
critical to extract spectral features, especially the fine spectral
where fConv2D,k×k (·) represents 2-D convolution with kernel information plays a crucial role in distinguishing vegetation
k × k. Then, X1 , X2 , and X3 are concatenated in channel classes that exhibit similar spectral features. Therefore, MGEFE
dimension, followed by a 3×3 DWConv layer, which can further block is designed after MGAFE to further extract multigranu-
extract local spatial features. The output of MSLAFE can be larity spectral features. Similarly, we utilize the multiscale local
denoted as spectral feature enhancement (MSLEFE) module and the global
spectral attention (GEA) module for the extraction of fine- and
I = fDW Conv2D,3×3 (concat(X1 , X2 , X3 )) (8) coarse-grained spectral features. Since the MGAFE block and
MGEFE block are sequentially connected, the output feature Xa
where I ∈ RP ×P ×3D is the output features, concat(·) represents of the MGAFE block serves as the input of the MGEFE block.
the concatenation operation. Through the MSLAFE module, the As shown in Fig. 4, the output features Xe of MGEFE can be
spatial information of HSI in different receptive field sizes can expressed as below:
be learned.
2) GAA: GAA module is designed to model the connections Xe = GEA(MSLEFE (LN (Xa )) + Xa . (11)
between different locations of HSI and enhance the spatial 1) MSLEFE: Similar to the structural design of MSLAFE,
awareness of the model. The right part of Fig. 3 shows the struc- to enhance multiscale spectral features of HSI, MSLEFE is
ture of GAA. After the MSLAFE module, the multiscale spatial designed to use 3-D convolution with different kernel sizes to
features I ∈ RP ×P ×3D are divided into three parts, each of size exploit different levels of spectral information. As shown in
P × P × D, which are then reshaped into A, B, C ∈ RP P ×D . Fig. 4, similar to the MSLAFE block, Xa also undergoes the
Next, a softmax layer is applied to the product of the matrix LN layer, producing an output feature represented as X̂e , which
A and B T to generate spatial attention weights, where B T is serves as the input for the MSLEFE module. X̂e is first reshaped
a
transposed by B. The formulation of the GAA map Mattn is into four dimensions (1 × P × P × D) to extract multiscale
given by spectral feature by three parallel 3-D convolutions with kernel
T
size of 1 × 1 × k. Here, we set k = 3, 5, 7 respectively, the
a
Mattn = Sof tmax A·B α (9) output of three 3-D convolution layers can be formulated as

where Sof tmax(·) denotes the softmax function applied to the

final dimension of the input, α is a learnable scaling parameter. X̂1 = fConv3D,1×1×3 (unsqueeze(X̂e ))
a
The resulting Mattn ∈ RP P ×P P represents a matrix where each X̂2 = fConv3D,1×1×5 (unsqueeze(X̂e ))
entry corresponds to the attention weight between different
pixels in the input, facilitating the aggregation of global spatial X̂3 = fConv3D,1×1×7 (unsqueeze(X̂e )) (12)
a
information. After that, we multiply Mattn with C and reshape
the result to R P ×P ×D
. At the end of GAA module, we take where unsqueeze(·) is used to expand X̂e , fConv3D,1×1×k (·)
the global spatial features using a 1 × 1 convolution layer. The represents 3-D convolution with kernel 1 × 1 × k. Next, X̂1 ,
output feature F of GAA module is given by X̂2 and X̂3 are concatenated in the new dimension. Then,
the concatenated multiscale spectral features are reshaped and
F = fConv2D,1×1 (Mattn
a
· C). (10) processed through a 2-D convolutional layer to restore the three
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 117

function [43] is employed in the middle of MLP, adding non-

linearity to the hidden layer. To enhance the extraction of local
information and keep the useful features propagating forward,
we propose an alternative feed-forward module called GFFM.
The output feature Y of GFFM in GLMGT encoder block can
be formulated as follows:
Y = GF F M (LN (Xe )) + Xe . (16)
As shown in Fig. 5(b), the GFFM replaces the fully connected
layers of MLP with 1 × 1 convolution and devises gating mech-
anisms to pass more useful information for effective classifi-
cation. Specifically, the gating mechanism is formulated as a
unitary product of two parallel linearly transformation layer
paths, one of which is nonlinearly activated by the GELU. In
addition, DWConv is introduced into two parallel paths of the
Fig. 5. Comparison of different feed-forward modules: (a) Multilayer percep- gating mechanism to enhance spatial information, especially to
tron (MLP). (b) Gated feed-forward module (GFFM). capture the local details. The process of GFFM can be formu-
lated as
Z = φ (fDW Conv2D,3×3 (fConv2D,1×1 (Z))
dimensions of P × P × D. The output features X̂ of MSLEFE
is obtained through a residual connection, which can be ex- Z = fDW Conv2D,3×3 (fConv2D,1×1 (Z))
pressed as follows:
Ẑ = fConv2D,1×1 (Z Z ) (17)
X̂ = fConv2D,1×1 (reshape(concat(X̂1 , X̂2 , X̂3 ))) + X̂e
(13) where Z is the input of GFFM, Ẑ denotes the output, φ represents
where reshape(·) denotes the reshaping operation. a GELU activation function, and signifies elementwise mul-
2) GEA: Similar to the GAA module in MGAFE, the GEA tiplication. In this work, in the same way as previous work [22]
module is proposed to enhance the spectral awareness of the [44], we expand the output channel dimensions of the first two
model by extracting global spectral features. As shown in the 1 × 1 convolutions by an expansion factor of γ, and the other
right part of Fig. 4, the input feature X̂ ∈ RP ×P ×D is first fed 1 × 1 convolution reverts to the original channel dimensions. To
into three 1 × 1 convolution layers and three 3 × 3 DWConv improve the repeatability of the proposed method, we provide
layers to enrich local context. Subsequently, it is reshaped to the detailed configuration of GLMGT in Table I.
yield three sets of new features Â, B̂, Ĉ ∈ RD×P P . Next, a
matrix multiplication of Â and B̂ is conducted to obtain the cor- IV. EXPERIMENTAL RESULTS AND ANALYSIS
relation between different channels, producing a global spectral In this section, we first present seven standard HSI datasets
attention map Mattne
∈ RD×D through softmax function. The used in experiments. Next, we describe the specific setup of the
e
formulation of Mattn can be defined as follows: experiments. Finally, we compare the classification results of
T GLMGT with several advanced models on the seven datasets to
e
Mattn = Sof tmax Â·α̂B̂ (14) verify its effectiveness.
where α̂ represents a learnable scaling parameter. The resulting
A. Hyperspectral Datasets
spectral attention map is used to weigh the contribution of each
channel to the output features of every other channel. Then, we 1) IndianPines (IP) dataset: The Indian Pines dataset was
e
reshape the product of Mattn and C to ∈ RP ×P ×D and take it collected using the airborne visible/infrared imaging spectrom-
through a 1 × 1 convolution layer. The output feature F̂ of GEA eter (AVIRIS) in northwestern Indiana, USA. The IP dataset
module can be expressed as contains 16 distinct land cover classes and consists of 145×
145 pixels. After noise bands removal, the IP dataset comprises
F̂ = fConv2D,1×1 (Mattn
e
· Ĉ). (15) 200 spectral bands ranging from 0.4 to 2.5 micrometers, with
The GEA module effectively integrates global spectral atten- a spatial resolution of 20 me per pixel. The false color image
tion to enhance the representation of spectral features extracted and ground-truth image of IP dataset are shown in Fig. 6. The
by MGEFE. dataset consists of 10 249 labeled samples, where 10% samples
of each category is randomly selected as training samples and
the remainder as test samples. The color, class name, and the
E. Gated Feed-Forward Module specific sample information of IP dataset are listed in Table II.
As shown in Fig. 5(a), the multilayer perceptron (MLP) used 2) University of Pavia (UP) Dataset: The University of Pavia
for the feed-forward module in the original transformer network dataset was collected by the University of Pavia, Italy, using a
consists of two fully connected layers. The GELU activation sensor named the reflective optics system imaging spectrometer.
118 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025

TABLE I
DETAILED CONFIGURATION OF THE GLMGT NETWORK

Fig. 6. IP dataset. (a) False-color image. (b) Ground truth image.

TABLE II
SAMPLE DISTRIBUTION OF THE IP DATASET
Fig. 7. UP dataset. (a) False-color image. (b) Ground truth image.

The UP dataset encompasses nine different land cover classes

and 610× 340 pixels. After eliminating the noise bands, UP
dataset contains 103 spectral bands between 0.43 and 0.86
micrometers and has a spatial resolution of 1.3 meters per pixel.
The false color image and ground truth image of dataset are
shown in Fig. 7. The total samples in the UP dataset are 42776.
1% of each class was randomly selected as the training set and
the remaining 99% as the training set. Table III shows the UP
dataset with specific class information and sample distribution.
3) Houston 2013 (HT) Dataset: The Houston 2013 dataset
was collected by the ITRES CASI-1500 sensor in rural areas
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 119

TABLE III
SAMPLE DISTRIBUTION OF THE UP DATASET

Fig. 9. KSC dataset. (a) False-color image. (b) Ground truth image.

TABLE V
SAMPLE DISTRIBUTION OF THE KSC DATASET

Fig. 8. HT dataset. (a) False-color image. (b) Ground truth image.

TABLE IV
SAMPLE DISTRIBUTION OF THE HT DATASET TABLE VI
SAMPLE DISTRIBUTION OF THE WHLK DATASET

18 m. After noise bands removal, the KSC dataset contains 176

usable spectral bands ranging from 0.4 to 2.5 μm. The false color
image and ground truth image of dataset are shown in Fig. 9. The
total samples in the KSC dataset are 5211. Table V shows the
specific class information and sample distribution of the KSC
around Texas, USA, covering 15 land cover classes. The HT dataset, with 10% of training samples and 90% of test samples
dataset has a spatial resolution of 2.5 m per pixel and a data in each class.
size of 349× 1905 pixels. It contains 144 spectral bands ranging 5) WHU-Hi-LongKou (WHLK) Dataset: The WHLK dataset
from 0.38 to 1.05 μm. The false color image and ground truth was collected by the Headwall Nano-Hyperspectral sensor
image of dataset are shown in Fig. 8. The total samples in the HT aboard on DJI M600 Pro UAV platform in Longkou town, Hubei
dataset are 15 029. Table IV shows the specific class information province, China, in 2018 [45]. The WHLK dataset encompasses
and sample distribution of the HT dataset, with 10% of training nine different land cover classes and 550× 400 pixels. It contains
samples and 90% of test samples in each class. 270 spectral bands between 0.4 and 1 μm and has a spatial
4) Kennedy Space Center (KSC) Dataset: The KSC dataset resolution of 0.463 meters per pixel. The false color image and
was collected by NASA’s AVIRIS instrument in 1996 at the KSC ground truth image of dataset are shown in Fig. 10. The total
in Florida, USA, and has 13 land cover categories. This dataset samples in the WHLK dataset are 204 542. Table VI shows the
consists of 512× 614 pixels, each with a spatial resolution of specific class information and sample distribution of the WHLK
120 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025

Fig. 10. WHLK dataset. (a) False-color image. (b) Ground truth image.

Fig. 12. GFYC dataset. (a) False-color image. (b) Ground truth image.

TABLE VIII
SAMPLE DISTRIBUTION OF THE GFYC DATASET

Fig. 11. ZYHHK dataset. (a) False-color image. (b) Ground truth image.

TABLE VII
SAMPLE DISTRIBUTION OF THE ZYHHK DATASET

1175× 585 pixels, with a spatial resolution of 30 m per pixel. It

contains 147 spectral bands. The false color image and ground
truth image of GFYC dataset are shown in Fig. 12. The total
samples in the GFYC dataset are 5475. Table VIII shows the
specific class information and sample distribution of the GFYC
dataset, with 3% of training samples and 97% of test samples in
each class.

dataset, with 0.3% of training samples and 99.7% of test samples B. Experimental Setting
in each class. All of the experiments are performed on a computer equipped
6) ZY1-02D Huanghekou (ZYHHK) Dataset: The ZYHHK with RTX 3060 GPU. Our proposed model is based on the
dataset was collected by the AHSI sensor equipped on the China PyTorch framework and built with Python 3.9.7. To assess the
ZY1-02D satellite at the Yellow River Estuary in Shandong performance of our method across seven datasets, we used three
province, China, in 2021 [46], [47]. The ZYHHK dataset en- main evaluation metrics: overall accuracy (OA), average accu-
compasses eight different land cover classes and 1050× 1219 racy (AA), and kappa coefficient. For parameter optimization,
pixels, with a spatial resolution of 30 m per pixel. It contains we train models with the Adam optimization algorithm (learning
108 spectral bands ranging from 0.4 to 2.5 μm. The false color rate = 0.001, weight decay = 0.0001). The training process
image and ground truth image of dataset are shown in Fig. 11. consists of 100 epochs using a consistent batch size of 100.
The total samples in the ZYHHK dataset are 14 087. Table VII In addition, to ensure the reliability of the experiments, five
shows the specific class information and sample distribution of experiments were conducted for each model and the final mean
the ZYHHK dataset, with 3% of training samples and 97% of and standard deviation were used as performance metrics.
test samples in each class. The effectiveness of the proposed method is validated in this
7) GF-5 Yancheng (GFYC) Dataset: The GF-5 Yancheng article through comparisons with several state-of-the-art models.
dataset was collected by the AHSI sensor on board the Chinese We select a range of representative models as comparative meth-
GF-5 satellite in Yancheng city, Jiangsu province, China [46], ods, including MSRN [19], DSSAN [28], LGG-CNN [48], Spec-
[47]. The GFYC dataset exist seven land cover categories and tralFormer [26], morphFormer [30], MSFAT [21], SSFTT [32],
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 121

TABLE IX TABLE XI
OA (%) OF GLMGT WITH DIFFERENT NUMBERS OF TRANSFORMER OA (%) OF GLMGT WITH DIFFERENT EMBEDDING DIM ON THE IP, UP, HT,
ENCONDERS ON THE IP, UP, HT, KSC, WHLK, ZYHHK, AND GFYC KSC, WHLK, ZYHHK AND GFYC DATASETS
DATASETS

TABLE X
OA (%) OF GLMGT WITH DIFFERENT PATCH SIZE ON THE IP, UP, HT, KSC,
WHLK, ZYHHK, AND GFYC DATASETS TABLE XII
OA (%) OF GLMGT WITH DIFFERENT EXPANSION FACTOR ON THE IP, UP,
HT, KSC, WHLK, ZYHHK AND GFYC DATASETS

CTMixer [49], HybridFormer [35], and GSC-ViT [34]. Among

these, the first three are CNN-based models, while the remaining
seven are transformer-based models.
In order to verify the effectiveness of our proposed model, obtained the highest OA value as the default setting in each
some of these comparison models are also global–local mod- dataset. Due to the relatively complex distribution of species
els. Specifically, the LGG-CNN, which is among the CNN- in the KSC dataset, employing larger patch sizes can provide a
based models, employs a gated convolution-based global–local wider range of contextual information.
module to capture global and local information. Among the 3) Embedding Dim: Our model begins with the application
transformer-based models, CTMixer combines CNN and trans- of a 3 × 3 convolution layer for dimensionality reduction. The
former in a two-branch structure to capture both local and embedding dimension D is the channel dimension after dimen-
global features. HybridFormer introduces convolutional oper- sionality reduction. The choice of D affects the quality and
ations into the self-attention mechanism to efficiently learn richness of the features extracted from the HSI patch. A higher
global–local information in the spectral and spatial domains. D may lead the model to capture finer spectral details, but
GSC-ViT incorporates groupwise separable convolution into the it is also likely to process redundant spectral information. On
transformer to extract local–global spatial features. the other hand, a lower D may result in the loss of important
spectral information. Experiments are conducted with different
C. Parameter Sensitive Analysis embedding dimension values. The candidate set of embedding
dimension is {4, 8, 16, 32, 64, 96, 128}. Table XI shows that
In this section, we explore the effects of critical parameters
an embedding dimension of 32 consistently delivered the best
within the proposed model, encompassing the transformer en-
performance across all seven datasets. It can also be seen that
coder number, patch size, embedding dimension, and expansion
choosing excessively small embedding dimensions results in
factor. It should be noted that when we change one of the
poorer classification performance. This indicates that an em-
parameters, the others remain at their default values.
bedding dimension of 32 strikes an optimal balance between
1) Number of Transformer Encoders: In order to investi-
retaining essential spectral details and minimizing redundancy.
gate the influence of the number of transformer encoders on
4) Expansion Factor: In order to analyze the optimal value
the classification performance, we conducted experiments by
of the expansion factor (γ) for extending the channel in GFFM,
varying the number from 1 to 4. The results on seven datasets
we evaluate it in the range of 1 to 10. As shown in Table XII, the
are presented in Table IX. It is evident that the optimal number
optimal expansion factors for the seven datasets were found to be
of transformer encoder GLMGT model is 1. Therefore, a single
9, 9, 10, 5, 10, 10, and 9, respectively. These results suggest that
transformer encoder is used as the default setting in the following
higher expansion factors typically lead to better performance on
experiments.
all seven datasets.
2) Patch Size: The patch size of HSI data can directly affects
the effective use of spatial information, which in turn affects the
D. Classification Results
classification accuracy. Therefore, we conducted experiments
on seven datasets using various patch sizes ranging from 3 to 1) Quantitative Evaluation: In order to validate the advan-
15, with intervals of 2. Table X shows that a patch size of tages of the GLMGT model, we carried out ten state-of-the-art
15 × 15 yielded the highest OA value on the KSC dataset, while models as comparison methods. Tables XIII–XIX show the
a patch size of 11 × 11 achieved the best classification effect classification results of the eleven methods on the seven datasets.
on the other datasets. Therefore, we chose the patch size which It can be observed that GLMGT is ahead of the other models
122 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025

TABLE XIII
CLASSIFICATION ACCURACY OF DIFFERENT METHODS ON INDIAN PINES DATASETS

TABLE XIV
CLASSIFICATION ACCURACY OF DIFFERENT METHODS ON UNIVERSITY OF PAVIA DATASETS

TABLE XV
CLASSIFICATION ACCURACY OF DIFFERENT METHODS ON HOUSTON 2013 DATASETS

TABLE XVI
CLASSIFICATION ACCURACY OF DIFFERENT METHODS ON KSC DATASETS
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 123

TABLE XVII
CLASSIFICATION ACCURACY OF DIFFERENT METHODS ON WHLK DATASETS

TABLE XVIII
CLASSIFICATION ACCURACY OF DIFFERENT METHODS ON ZYHHK DATASETS

TABLE XIX
CLASSIFICATION ACCURACY OF DIFFERENT METHODS ON GF-5 YANCHENG DATASETS

on all three metrics, OA, AA, and Kappa. Specifically, on the proposed method has the fewest noisy pixels and is closest to
IP dataset, GLMGT outperforms in 11 out of 16 categories. the ground truth. Taking the classification maps on the WHLK
Similarly, on the KSC dataset, GLMGT even outperforms other dataset in Fig. 17 as an example, it is clear that the classification
methods in classification accuracy in every category. Also, on the maps of LGG-CNN and Spectralformer have more noisy pixels,
HT dataset, GLMGT outperforms other methods, with OA val- which means that there are more misclassified regions. Further-
ues 1.24% higher than SSFTT and 3.50% higher than DSSAN. more, in the classification maps of all seven datasets, typical
Furthermore, the accuracy of GLMGT exceeds 90% in almost regions are highlighted with a gray border and enlarged to clearly
every category on the seven datasets, which suggests that the display the classification results for those specific areas. For
GLMGT network not only significantly improves the accuracy the IP dataset, the classification map generated by the proposed
in the small-sample category, but also maintains a stable level GLMGT closely matches the ground truth map, particularly for
of performance in the other categories. the “Alfalfa” category, as shown in the zoomed-in images in
In addition, it can be observed that the classification per- Fig. 13. This observation aligns with the quantitative assessment
formance of GLMGT is also higher than the four global-local results presented in Table XIII. Similarly, the classification map
models (i.e., LGG-CNN, SSFTT, CTMixer, HybirdFormer, and of our GLMGT for the UP dataset shows fewer misclassified pix-
GSC-ViT) on seven datasets. For example, the OA value of els, especially in the upper left corner of the zoomed-in region,
GLMGT is 5.24% superior to that of LGG-CNN on the IP as shown in Fig. 14. For the HT dataset, while most methods
dataset, 3.99% superior to that of CTMixer on the HT dataset, misclassify the similar crops “Healthy Grass” and “Stressed
6.46% superior to that of HybirdFormer on the ZYHHK dataset, Grass” (see Fig. 15), GLMGT effectively mitigates this issue
and 5.54% superior to that of GSC-ViT on the KSC dataset. This by learning multigranularity spatial-spectral features from HSI.
also shows the superiority of GLMGT in extracting global and For the KSC dataset, the classification map of our GLMGT
local features of HSI. shows fewer noise points than the other methods, particularly
2) Visual Evaluation: Figs. 13–19 show the classification for the “Slash Pine” category, as illustrated in the upper right
map of each method on the seven datasets. It can be seen that the corner of the zoomed-in region in Fig. 16. In conclusion, the
124 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025

Fig. 13. Classification maps of different models on IP dataset 10% training samples. (a) Ground truth map. (b) MSRN. (c) DSSAN. (d) LGG-CNN.
(e) SpectralFormer. (f) morphFormer. (g) MSFAT. (h) SSFTT. (i) CTMixer. (j) Hybridformer. (k) GSC-ViT. (l) GLMGT.

Fig. 14. Classification maps of different models on UP dataset 1% training samples. (a) Ground truth map. (b) MSRN. (c) DSSAN. (d) LGG-CNN.
(e) SpectralFormer. (f) morphFormer. (g) MSFAT. (h) SSFTT. (i) CTMixer. (j) Hybridformer. (k) GSC-ViT. (l) GLMGT.

classification maps generated by the proposed GLMGT model 0.1%, 0.5%, 1.0%, 1.5%, 2.0% for the UP dataset; 1%, 5%,
across seven datasets show fewer misclassified areas compared 10%, 15%, 20% for the IP, HT, and KSC datasets; 0.1%,
to those produced by other models, further demonstrating the 0.2%, 0.3%, 0.4%, 0.5% for the WHLK dataset; 1%, 2%,
superiority of GLMGT. 3%, 4%, 5% for the ZYHHK and GFYC datasets. As illus-
3) Impact of Training Ratio: To further evaluate the ef- trated in Fig. 20, GLMGT consistently outperformed the other
ficacy of GLMGT, we analyzed the performance of eleven methods at every training sample size. For example, when
different methods across a range of training sample sizes: the training sample ratio is small, GLMGT outperforms other
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 125

Fig. 15. Classification maps of different models on HT dataset 10% training samples. (a) Ground truth map. (b) MSRN. (c) DSSAN. (d) LGG-CNN.
(e) SpectralFormer. (f) morphFormer. (g) MSFAT. (h) SSFTT. (i) CTMixer. (j) Hybridformer. (k) GSC-ViT. (l) GLMGT.

Fig. 16. Classification maps of different models on KSC dataset with 10% training samples. (a) Ground truth map. (b) MSRN. (c) DSSAN. (d) LGG-CNN.
(e) SpectralFormer. (f) morphFormer. (g) MSFAT. (h) SSFTT. (i) CTMixer. (j) Hybridformer. (k) GSC-ViT. (l) GLMGT.

comparable methods. For example, on the UP dataset, GLMGT training sample ratio increased, the performance of the
can achieve 88% OA when the training sample ratio is set to GLMGT network exhibited a steady upward trend, indicat-
0.1%, whereas for other transformer-based methods, such as ing that it can achieve good results under different data
Spectralformer, the OA is only nearly 71%. Moreover, as the scales.
126 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025

Fig. 17. Classification maps of different models on WHLK dataset with 0.3% training samples. (a) Ground truth map. (b) MSRN. (c) DSSAN. (d) LGG-CNN.
(e) SpectralFormer. (f) morphFormer. (g) MSFAT. (h) SSFTT. (i) CTMixer. (j) Hybridformer. (k) GSC-ViT. (l) GLMGT.

TABLE XX
RESULTS OF ABLATION STUDIES OF DIFFERENT COMPONENTS IN IP, UP, HT AND KSC DATASETS (OA%)

V. DISCUSSION TABLE XXI

OA (%) ACROSS IP, UP, HT, AND KSC DATASETS FOR DIFFERENT
In this section, we discuss the validity of the main components GRANULARITY CONFIGURATIONS IN THE GLMGT NETWORK
of GLMGT on the four publicly available datasets: IP, UP, HT,
and KSC datasets.

A. Ablation Analysis
1) Ablation Analysis of Main Components: We evaluated
the effect of each component on the classification results by
progressively removing the main components. Table XX
shows the results of different component ablation experiments dataset. Thus, these results illustrate the effectiveness of each
performed on the four datasets. Across these four datasets, component in the GLMGT network.
GLMGT performs best when all main components are 2) Ablation Analysis of Granularity Configurations in the
included, while deleting any component adversely affects GLMGT: In this article, we introduce the MSLAFE module
model performance. Specifically, the lack of DCPE caused a and the GAA module to extract fine- and coarse-grained spatial
decrease in OA values, and extracting only spatial features features, respectively. In addition, we employ the MSLEFE
(without MGAFE) or spectral features (without MGEFE) also module and the GEA module for the extraction of fine- and
resulted in poor classification results. However, the addition of coarse-grained spectral features. By thoroughly exploring both
MGEFE after MGAFE significantly improves the classification fine- and coarse-grained representations in HSIs, we can extract
performance with an increase in OA up to 1.64%, highlighting more robust features. As shown in Table XXI, we assessed the
the critical importance of combining spatial features and subtle impact of these four modules on classification outcomes by
spectral information for category differentiation. In addition, we systematically eliminating each module in a stepwise manner.
observed that removing GFFM also diminishes classification Our analysis across four datasets (i.e., IP, UP, HT, and KSC)
performance, for example, a decrease of 5.05% on the HT revealed that GLMGT achieves optimal performance when all
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 127

Fig. 18. Classification maps of different models on ZYHHK dataset with 3% training samples. (a) Ground truth map. (b) MSRN. (c) DSSAN. (d) LGG-CNN.
(e) SpectralFormer. (f) morphFormer. (g) MSFAT. (h) SSFTT. (i) CTMixer. (j) Hybridformer. (k) GSC-ViT. (l) GLMGT.

TABLE XXII TABLE XXIII

OA (%) ACROSS IP, UP, HT, AND KSC DATASETS FOR DIFFERENT OA (%) ACROSS IP, UP, HT, AND KSC DATASETS FOR DIFFERENT
GRANULARITY CONFIGURATIONS IN THE MSLAFE MODULE GRANULARITY CONFIGURATIONS IN THE MSLEFE MODULE

modules are integrated. Conversely, removing any single module Table XXIII, for the MSLEFE module, using combinations of
leads to a decline in classification accuracy, underscoring the 1 × 1 × 3, 1 × 1 × 5, and 1 × 1 × 7 obtained better accuracy
importance of each component within the overall framework. across all datasets. Using either too few or too many granular-
This demonstrates that feature representations at different gran- ities results in suboptimal performance. A limited number of
ularities contribute to improving classification results. granularities may cause the model to overlook important local
3) Ablation Analysis of Different Granularity Configura- features necessary for accurate predictions. Conversely, having
tions in the MSLAFE and MSLEFE Modules: In this article, too many granularities increases model complexity and may
the MSLAFE module utilizes parallel 2-D convolutions with introduce noise.
varying kernel sizes to extract local spatial features at different 4) Ablation Analysis of DCPE: To evaluate the effect of
granularities. Likewise, the MSLEFE module employs parallel DCPE on the classification performance of HSI, we conducted
3-D convolutions with diverse kernels to capture a range of local ablation experiments comparing DCPE with absolute positional
spectral features. As illustrated in Table XXII, the best accuracy embedding (APE) [44]. As shown in Table XXIV, the OA value
for the MSLAFE module was achieved with the combination of decreases up to 2.40% on the four datasets when replacing DCPE
1 × 1, 3 × 3, and 5 × 5 convolution kernels. As illustrated in with APE. In addition, there is also a slight decrease in the OA
128 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025

Fig. 19. Classification maps of different models on GFYC dataset with 3% training samples. (a) Ground truth map. (b) MSRN. (c) DSSAN. (d) LGG-CNN.
(e) SpectralFormer. (f) morphFormer. (g) MSFAT. (h) SSFTT. (i) CTMixer. (j) Hybridformer. (k) GSC-ViT. (l) GLMGT.

Fig. 20. Classification performance of eleven models across varied training sample ratios on the seven datasets. (a) IP. (b) UP. (c) HT. (d) KSC. (e) WHLK.
(f) ZYHHK. (g) GFYC.

value when both DCPE and APE are used. These results indicate the capture of global information. However, the combination of
the effectiveness of DCPE in embedding position information. MGAFE and MGEFE in GLMGT can not only extract multiscale
5) Ablation Analysis of Attention Module: To evaluate the local features but also capture global spatial and spectral depen-
efficacy of attention module in GLMGT, we used the MHSA dencies, thus significantly enhancing the classification results.
module from the original transformer to replace MGAFE or 6) Ablation Analysis of GFFM: To investigate the effec-
MGEFE. Table XXV shows that the use of MGAFE in combi- tiveness of GFFM, we replace GFFM with MLP in GLMGT
nation with MGEFE demonstrated the highest OA on the four network. Compared with MLP, the gating mechanism in GFFM
datasets. Specifically, the utilization of MHSA only allows for can control the delivery of useful information, and DWConv
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 129

TABLE XXIV VI. CONCLUSION

OA (%) ACROSS IP, UP, HT, AND KSC DATASETS FOR DIFFERENT EMBEDDING
MODULE CONFIGURATIONS In this article, a novel and effective global-local multigran-
ularity transformer (GLMGT) network is designed for HSI
classification, leveraging the advantages of CNN and trans-
former to extract discriminative global–local features from both
spatial and spectral dimensions of HSI. Wherein, two multi-
granularity feature extraction blocks (MGAFE and MGEFE)
consecutively extract multigranularity features from the rich
TABLE XXV spectral signatures and spatial contexts in HSI. Experiments per-
OA (%) ACROSS IP, UP, HT, AND KSC DATASETS FOR DIFFERENT ATTENTION
MODULE CONFIGURATIONS formed on seven datasets indicate the superiority of the GLMGT
model.

REFERENCES
[1] S. Peyghambari and Y. Zhang, “Hyperspectral remote sensing in lithologi-
cal mapping, mineral exploration, and environmental geology: An updated
review,” J. Appl. Remote Sens., vol. 15, no. 3, pp. 031501–031501, 2021.
TABLE XXVI [2] M. Shimoni, R. Haelterman, and C. Perneel, “Hypersectral imaging for
OA (%) ACROSS IP, UP, HT, AND KSC DATASETS FOR DIFFERENT military and security applications: Combining myriad processing and
FEED-FORWARD MODULE CONFIGURATIONS sensing techniques,” IEEE Geosci. Remote Sens. Mag., vol. 7, no. 2,
pp. 101–117, Jun. 2019.
[3] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sens-
ing images with support vector machines,” IEEE Trans. Geosci. Remote
Sens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004.
[4] J. Ham, Y. Chen, M. M. Crawford, and J. Ghosh, “Investigation
of the random forest framework for classification of hyperspectral
data,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp. 492–501,
Mar. 2005.
TABLE XXVII
[5] L. Ma, M. M. Crawford, and J. Tian, “Local manifold learning-
OA (%) ACROSS IP, UP, HT, AND KSC DATASETS FOR DIFFERENT
based k-nearest-neighbor for hyperspectral image classification,” IEEE
CONNECTION CONFIGURATIONS OF MGAFE AND MGEFE
Trans. Geosci. Remote Sens., vol. 48, no. 11, pp. 4099–4109,
Nov. 2010.
[6] T. Zhao et al., “Artificial intelligence for geoscience: Progress, challenges
and perspectives,” Innovation, vol. 5, no. 5, 2024, Art. no. 100691.
[7] Y. Chen, X. Zhao, and X. Jia, “Spectral–spatial classification of hyperspec-
tral data based on deep belief network,” IEEE J. Sel. Topics Appl. Earth
Observ. Remote Sens., vol. 8, no. 6, pp. 2381–2392, Jun. 2015.
[8] X. Yuan, B. Huang, Y. Wang, C. Yang, and W. Gui, “Deep learning-based
feature representation and its application for soft sensor modeling with
variable-wise weighted SAE,” IEEE Trans. Ind. Inform., vol. 14, no. 7,
can complement the local details in the feedforward module. pp. 3235–3243, Jul. 2018.
Table XXVI shows the classification effect of the GLMGT [9] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutional neural
model with different feed-forward modules. It is observed that networks for hyperspectral image classification,” J. Sensors, vol. 2015,
no. 1, 2015, Art. no. 258619.
employing GFFM results in a notable enhancement of the OA [10] J. Yue, W. Zhao, S. Mao, and H. Liu, “Spectral–spatial classification of
value, which indicates that the performance improvement is due hyperspectral images using deep convolutional neural networks,” Remote
to the design of the gating mechanism and the enhancement of Sens. Lett., vol. 6, no. 6, pp. 468–477, 2015.
[11] S. Mei, J. Ji, J. Hou, X. Li, and Q. Du, “Learning sensor-specific spatial-
local spatial information. spectral features of hyperspectral images via convolutional neural net-
7) Ablation Analysis of the Connection Configuration of works,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 8, pp. 4520–4533,
MGAFE and MGEFE: In the GLMGT network, the MGAFE Aug. 2017.
[12] Q. Liu, Z. Wu, X. Jia, Y. Xu, and Z. Wei, “From local to global: Class feature
and MGEFE blocks are connected in a series configuration. To fused fully convolutional network for hyperspectral image classification,”
assess the impact of the series and parallel configurations of Remote Sens., vol. 13, no. 24, 2021, Art. no. 5043.
these two blocks on the classification accuracy, we performed [13] A. B. Hamida, A. Benoit, P. Lambert, and C. B. Amar, “3-D deep learning
approach for remote sensing image classification,” IEEE Trans. Geosci.
additional ablation experiments. Specifically, in the parallel Remote Sens., vol. 56, no. 8, pp. 4420–4434, Aug. 2018.
configuration, the MGAFE and MGEFE blocks are computed [14] J. Zheng, Y. Feng, C. Bai, and J. Zhang, “Hyperspectral image classifi-
in parallel, and their outputs are directly added together. We cation using mixed convolutions and covariance pooling,” IEEE Trans.
Geosci. Remote Sens., vol. 59, no. 1, pp. 522–534, Jan. 2021.
also investigated how different cascading orders of the MGAFE [15] W. Zhao and S. Du, “Spectral–spatial feature extraction for hyperspec-
and MGEFE blocks influence classification performance. As tral image classification: A dimension reduction and deep learning ap-
shown in Table XXVII, the series configuration where the proach,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 8, pp. 4544–4554,
Aug. 2016.
MGAFE block precedes the MGEFE block consistently yielded [16] X. Yang, Y. Ye, X. Li, R. Y. Lau, X. Zhang, and X. Huang, “Hyperspectral
the highest OA across four datasets. In contrast, the alternative image classification with deep learning models,” IEEE Trans. Geosci.
series configuration (spectral first, then spatial) and the parallel Remote Sens., vol. 56, no. 9, pp. 5408–5423, Sep. 2018.
[17] W. Qi, X. Zhang, N. Wang, M. Zhang, and Y. Cen, “A spectral-spatial
configuration were less effective. Based on these results, we cascaded 3D convolutional neural network with a convolutional long short-
have finalized the network architecture with the MGAFE block term memory network for hyperspectral image classification,” Remote
preceding the MGEFE block in sequence. Sens., vol. 11, no. 20, Oct. 2019, Art. no. 2363.
130 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 18, 2025

[18] S. K. Roy, G. Krishna, S. R. Dubey, and B. B. Chaudhuri, “Hybridsn: [40] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H.
Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image clas- Jégou, “Training data-efficient image transformers & distillation through
sification,” IEEE Geosci. Remote Sens. Lett., vol. 17, no. 2, pp. 277–281, attention,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 10347–10357.
Feb. 2020. [41] X. Chu, Z. Tian, B. Zhang, X. Wang, and C. Shen, “Conditional posi-
[19] H. Gao, Y. Yang, C. Li, L. Gao, and B. Zhang, “Multiscale residual tional encodings for vision transformers,” in Proc. 11th Int. Conf. Learn.
network with mixed depthwise convolution for hyperspectral image classi- Representations, Kigali, Rwanda, May 1–5, 2023.
fication,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 4, pp. 3396–3408, [42] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016,
Apr. 2021. arXiv:1607.06450.
[20] R. Xu, X.-M. Dong, W. Li, J. Peng, W. Sun, and Y. Xu, “DBCTNet: [43] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),”
Double branch convolution-transformer network for hyperspectral im- 2016, arXiv:1606.08415.
age classification,” IEEE Trans. Geosci. Remote Sens., vol. 62, 2024, [44] A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process.
Art. no. 5509915. Syst., vol. 30, pp. 1–11, 2017.
[21] Z. Meng, Q. Yan, F. Zhao, and M. Liang, “Hyperspectral image clas- [45] Y. Zhong, X. Hu, C. Luo, X. Wang, J. Zhao, and L. Zhang, “WHU-HI:
sification with dynamic spatial-spectral attention network,” in Proc. 13th UAV-borne hyperspectral with high spatial resolution (h2) benchmark
Workshop Hyperspectral Imag. Signal Process., Evol. Remote Sens., 2023, datasets and classifier for precise crop identification based on deep convo-
pp. 1–4. lutional neural network with CRF,” Remote Sens. Environ., vol. 250, 2020,
[22] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for Art. no. 112012.
image recognition at scale,” in Proc. 9th Int. Conf. Learn. Representations, [46] S. Weiwei and P. Jiangtao, “Cross-scene hyperspectral remote sensing
Austria, May 3–7, 2021. wetland image data,” Jul. 2023. [Online]. Available: https://doi.org/10.
[23] Z. Sun, S. Cao, Y. Yang, and K. M. Kitani, “Rethinking transformer-based 5281/zenodo.8105220
set prediction for object detection,” in Proc. IEEE/CVF Int. Conf. Comput. [47] Y. Huang et al., “Cross-scene wetland mapping on hyperspectral remote
Vis., 2021, pp. 3611–3620. sensing images using adversarial domain adaptation network,” ISPRS J.
[24] Y. Wang et al., “End-to-end video instance segmentation with transform- Photogrammetry Remote Sens., vol. 203, pp. 37–54, 2023.
ers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, [48] W. Fu, K. Ding, X. Kang, and D. Wang, “Local-global gated convolutional
pp. 8741–8750. neural network for hyperspectral image classification,” IEEE Geosci.
[25] C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale Remote Sens. Lett., vol. 21, 2023, Art. no. 5500205.
vision transformer for image classification,” in Proc. IEEE/CVF Int. Conf. [49] J. Zhang, Z. Meng, F. Zhao, H. Liu, and Z. Chang, “Convolution trans-
Comput. Vis., 2021, pp. 357–366. former mixer for hyperspectral image classification,” IEEE Geosci. Remote
[26] D. Hong et al., “Spectralformer: Rethinking hyperspectral image classi- Sens. Lett., vol. 19, 2022, Art. no. 6014205.
fication with transformers,” IEEE Trans. Geosci. Remote Sens., vol. 60,
2021, Art. no. 5518615.
[27] H. Yu, Z. Xu, K. Zheng, D. Hong, H. Yang, and M. Song, “Mstnet:
A multilevel spectral–spatial transformer network for hyperspectral im- Zhe Meng (Member, IEEE) received the B.S. degree
age classification,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, in electronic information engineering and Ph.D. de-
Art. no. 5532513. gree in circuits and systems from Xidian University,
[28] Z. Meng, Q. Yan, F. Zhao, and M. Liang, “Multi-scale feature attention Xi’an, China, in 2014 and 2020, respectively.
and transformer for hyperspectral image classification,” in Proc. 13th He is currently a Lecturer with the School of Com-
Workshop Hyperspectral Imag. Signal Processing: Evol. Remote Sens., munications and Information Engineering & School
2023, pp. 1–5. of Artificial Intelligence, Xi’an University of Posts
[29] T. Arshad and J. Zhang, “A light-weighted spectral-spatial transformer and Telecommunications. His research interests in-
model for hyperspectral image classification,” IEEE J. Sel. Topics Appl. clude deep learning and hyperspectral image classifi-
Earth Observ. Remote Sens., vol. 17, pp. 12008–12019, 2024. cation.
[30] S. K. Roy, A. Deria, C. Shah, J. M. Haut, Q. Du, and A. Plaza,
“Spectral–spatial morphological attention transformer for hyperspectral
image classification,” IEEE Trans. Geosci. Remote Sens., vol. 61, 2023,
Art. no. 5503615.
[31] Y. Peng, Y. Zhang, B. Tu, Q. Li, and W. Li, “Spatial–spectral transformer Qian Yan received the B.S. degree from Xi’an Uni-
with cross-attention for hyperspectral image classification,” IEEE Trans. versity of Posts and Telecommunications, Xi’an,
Geosci. Remote Sens., vol. 60, 2022, Art. no. 5537415. China, in 2022, where She is currently working to-
[32] L. Sun, G. Zhao, Y. Zheng, and Z. Wu, “Spectral–spatial feature tok- ward the master’s degree in information and commu-
enization transformer for hyperspectral image classification,” IEEE Trans. nication engineering.
Geosci. Remote Sens., vol. 60, 2022, Art. no. 5522214. Her research interests include deep learning and
[33] W. Qi, C. Huang, Y. Wang, X. Zhang, W. Sun, and L. Zhang, “Global-local hyperspectral image classification.
three-dimensional convolutional transformer network for hyperspectral
image classification,” IEEE Trans. Geosci. Remote Sens., vol. 61, 2023,
Art. no. 5510820.
[34] Z. Zhao, X. Xu, S. Li, and A. Plaza, “Hyperspectral image classification
using groupwise separable convolutional vision transformer network,”
IEEE Trans. Geosci. Remote Sens., vol. 62, 2024, Art. no. 5511817.
[35] E. Ouyang, B. Li, W. Hu, G. Zhang, L. Zhao, and J. Wu, “When multi- Feng Zhao (Member, IEEE) received the B.S. de-
granularity meets spatial–spectral attention: A. hybrid transformer for gree in computer science and technology from Hei-
hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., longjiang University, Heilongjiang, China, in 2004,
vol. 61, 2023, Art. no. 4401118. the M.S. degree in signal and information processing
[36] S. Mei, C. Song, M. Ma, and F. Xu, “Hyperspectral image classification from the Xi’an University of Posts and Telecommu-
using group-aware hierarchical transformer,” IEEE Trans. Geosci. Remote nications, Xi’an, China, in 2007, and the Ph.D. degree
Sens., vol. 60, 2022, Art. no. 5539014. in pattern recognition and intelligent system from
[37] Z. Xu, C. Su, S. Wang, and X. Zhang, “Local and global spectral features Xidian University, Xi’an, in 2010.
for hyperspectral image classification,” Remote Sens., vol. 15, no. 7, 2023, She has been a Professor with the School of Com-
Art. no. 1803. munications and Information Engineering and School
[38] B. Zhang, L. Zhao, and X. Zhang, “Three-dimensional convolu- of Artificial Intelligence, Xi’an University of Posts
tional neural network model for tree species classification using air- and Telecommunications, since 2015. She has authored or coauthored more
borne hyperspectral images,” Remote Sens. Environ., vol. 247, 2020, than 30 articles and two books. Her research interests include fuzzy information
Art. no. 111938. processing, pattern recognition, and image processing.
[39] Z. Meng, T. Zhang, F. Zhao, G. Chen, and M. Liang, “Multiscale super Dr. Zhao was a recipient of New-Star of Young Science and Technology
token transformer for hyperspectral image classification,” IEEE Geosci. Award supported by Shaanxi, in 2014, and the IET International Conference on
Remote Sens. Lett., vol. 21, 2024, Art. no. 5508105. Ubi-media Computing Best Paper Award, in 2012.
MENG et al.: GLMGT FOR HYPERSPECTRAL IMAGE CLASSIFICATION[COMP: PLEASE TAKE CARE OF GRAPHICS IN TABLES 131

Gaige Chen was born in 1985. He received the Ph.D. Miaomiao Liang received the Ph.D. degree in pat-
degree in mechanical engineering from Xi’an Jiao- tern recognition and intelligent systems from Xidian
tong University, Xi’an, China. University, Xi’an, China, in 2018.
He is currently an Associate Professor with the She is currently an Associate Professor with the
School of Communications and Information Engi- School of Information Engineering, Jiangxi Univer-
neering and School of Artificial Intelligence, Xi’an sity of Science and Technology, Ganzhou, China. Her
University of Posts and Telecommunications. His research interests include computer vision, machine
research interests include multisensor data fusion and learning, and hyperspectral image processing.
complex equipment prognostics.

Wenqiang Hua (Member, IEEE) received the B.E

degree in electronic science and technology from the
University of Electronic Science and Technology of
China, Chengdou, China, in 2012, and the Ph.D. de-
gree in circuits and systems from the Key Laboratory
of Intelligent Perception and Image Understanding
of Ministry of Education, Xidian University, Xi’an,
China, in 2018.
He is currently an Associate Professor with the
School of Computer Science and Technology, Xi’an
University of Posts and Telecommunications, Xi’an.
His research interests include machine learning, deep learning, and PolSAR
image processing.