0% found this document useful (0 votes)

8 views

Fully-Connected Transformer for Multi-Source Image Fusion

The paper presents a novel Fully-Connected Transformer (FC-Former) for multi-source image fusion, utilizing a generalized self-attention mechanism to enhance the integration of information from various image sources. The FC-Former framework effectively captures both local and non-local correlations, addressing limitations of existing methods by employing multilinear algebra for improved feature representation. Experimental results demonstrate that the FC-Former outperforms state-of-the-art techniques in multiple image fusion tasks, including multispectral and hyperspectral image fusion, visible and infrared image fusion, and remote sensing pansharpening.

Uploaded by

zzym.i.zp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Fully-Connected Transformer for Multi-Source Image Fusion

Uploaded by

zzym.i.zp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO.

3, MARCH 2025 2071

Fully-Connected Transformer for Multi-Source

Image Fusion
Xiao Wu , Zi-Han Cao , Ting-Zhu Huang , Member, IEEE, Liang-Jian Deng , Senior Member, IEEE,
Jocelyn Chanussot , Fellow, IEEE, and Gemine Vivone , Senior Member, IEEE

Abstract—Multi-source image fusion combines the information I. INTRODUCTION

coming from multiple images into one data, thus improving imaging
HE use of deep learning technology for the analysis and
quality. This topic has aroused great interest in the community. How
to integrate information from different sources is still a big chal-
lenge, although the existing self-attention based transformer meth-
T processing of biomedical and image information has be-
come an important research direction [1], [2], [3]. In the field
ods can capture spatial and channel similarities. In this paper, we of multi-source image fusion, there is widely application in
first discuss the mathematical concepts behind the proposed gener-
alized self-attention mechanism, where the existing self-attentions
various image processing problems, such as image fusion [4],
are considered basic forms. The proposed mechanism employs [5], [6], image denoising and reconstruction [7], [8], [9], image
multilinear algebra to drive the development of a novel fully- enhancement [10], and further applied to high-level computer
connected self-attention (FCSA) method to fully exploit local and vision tasks, such as classification [11], object detection [12],
non-local domain-specific correlations among multi-source images. [13], and medical diagnosis [14]. Differently from the incom-
Moreover, we propose a multi-source image representation embed-
ding it into the FCSA framework as a non-local prior within an
plete information that can be captured by a single device, a
optimization problem. Some different fusion problems are unfolded multi-source imaging system can better describe the informa-
into the proposed fully-connected transformer fusion network (FC- tion in the scene, e.g., combining visible and hyperspectral
Former). More specifically, the concept of generalized self-attention images, or thermal infrared images in night scenes, as well as
can promote the potential development of self-attention. Hence, the panchromatic and multispectral data, digital images, etc. Hence,
FC-Former can be viewed as a network model unifying different
fusion tasks. Compared with state-of-the-art methods, the proposed
multi-source image fusion (MSIF) can be divided into several
FC-Former method exhibits robust and superior performance, research fields, such as multispectral and hyperspectral image
showing its capability of faithfully preserving information. fusion (MHIF) [15], [16], visible and infrared image fusion
(VIS-IR) [17], [18], remote sensing pansharpening [19], [20],
Index Terms—Transformer, multilinear algebra, model-driven
neural network, multi-source image fusion, multispectral and [21], [22], multi-focus image fusion, and multi-exposure image
hyperspectral image fusion, remote sensing pansharpening, visible fusion. The fused image preserves spatial information and spec-
and infrared image fusion. tral images for MHIF, remote sensing pansharpening, while for
VIS-IR, digital photographic image fusion, the complementary
features of the two images are fused to avoid the influence of the
shooting environment on the camera.
Recently, deep-learning techniques have obtained increasing
attention, clearly outperforming the latest model-based meth-
Received 20 November 2023; revised 5 October 2024; accepted 12 December
2024. Date of current version 5 February 2025. This work was supported in part ods [27], [28]. Classic CNN-based methods [29], [30] adopt
by the NSFC under Grant 12171072, Grant 12271083, in part by the Natural single scale [31] or multi-scale structures [32], [33], [34] to
Science Foundation of Sichuan Province under Grant 2024NSFSC0038, and learn high-quality information for various vision tasks. However,
in part by the National Key Research and Development Program of China
under Grant 2020YFA0714001. Recommended for acceptance by J. Wang. in the aforementioned approaches, the network structure deter-
(Corresponding authors: Ting-Zhu Huang; Liang-Jian Deng.) mines whether the information in the data can be fully extracted.
Xiao Wu, Zi-Han Cao, Ting-Zhu Huang, and Liang-Jian Deng are with the Researchers have also devoted attention to model-driven neu-
School of Mathematical Sciences, University of Electronic Science and Technol-
ogy of China, Chengdu 611731, China (e-mail: [email protected]; iamz- ral network techniques that offer both good interpretability and
[email protected]; [email protected]; [email protected]). superior generalization capabilities getting state-of-the-art re-
Jocelyn Chanussot is with the Inria, CNRS, Grenoble INP, LJK, Université sults. Model unfolding methods [35], [36] represent an example
Grenoble Alpes, 38000, Grenoble, France, and also with the Aerospace Informa-
tion Research Institute, Chinese Academy of Sciences, Beijing 100045, China in this class. These approaches involve the transformation of a
(e-mail: [email protected]). linear observation model through a certain variant replacement
Gemine Vivone is with the Institute of Methodologies for Environmen- (i.e., the half-quadratic splitting (HQS) [37], [38] and the al-
tal Analysis, CNR-IMAA, 85050 Tito Scalo, Italy, and also with the Na-
tional Biodiversity Future Center (NBFC), 90133 Palermo, Italy (e-mail: gem- ternate direction multiplier method (ADMM) algorithm [39]).
[email protected]). Afterwards, the transformed model is converted into a learnable
Our code is available at https://github.com/XiaoXiao-Woo/FC-Former. network structure, thus endowing the traditional method with
This article has supplementary downloadable material available at
https://doi.org/10.1109/TPAMI.2024.3523364, provided by the authors. a nonlinear representation [40], [41]. As prior knowledge, the
Digital Object Identifier 10.1109/TPAMI.2024.3523364 deep [42] and autoencoder priors [43] impose local priors. A

0162-8828 © 2025 IEEE. All rights reserved, including rights for text and data mining, and training of artificial intelligence and similar technologies.
Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2072 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025

Fig. 1. The current existing forms for self-attention along spatial or channel modes. They are built by matrix multiplication, connecting all the other elements
in one mode. To illustrate the information representation of self-attention, we show four key variants: self-attention (SA) [23], window-SA [24], reduced SA [25],
and cross-scale SA [26]. For multi-source image fusion tasks, cross-scale SA can process mutual fusion at different scales. For example, the Query, Q, can be the
LR-HSI, then the matrices K and V can both be the HR-MSI. Hence, the SA can retain domain-specific information from different domains, while simultaneously
disregarding the two internal source paradigms across scales.

non-local method has been proposed in [44] using non-local

priors for model-driven neural networks.
However, the aforementioned single-scale networks lack con-
textual guidance for feature representations. In contrast, multi-
scale networks always reduce the spatial resolution of features
in the process of feature extractions using skip-connections to
compensate for information loss, thus failing to achieve the ex-
pected feature representation for the multi-source image fusion
task. Another shortcoming is that CNN-based methods have
limited receptive fields and feature representation ability due to Fig. 2. The comparison between the existing self-attention mechanism and
the proposed fully-connected self-attention framework based on the proposed
static kernels for feature extraction [45], [46]. Recent exploration generalized self-attention scheme.
into the self-attention (SA) mechanism within transformers, as
elaborated by Vaswani et al. [23], seeks to unveil latent non-
local relationships across specific dimensions (or modes). More
specifically, transformer-based methods [24], [47], [48] exploit self-attention to build spatial similarity in two feature resolutions
corresponding non-local information by computing the response of the image. Instead, NLRN [26] directly adopts a non-local
of a given pixel along a specific dimension (or mode). Trans- framework as soft block matching, and euclidean distance with
former methods, proposed in the field of multi-source image a kernel function to measure the spatial self-similarity. They
fusion [49], [50], capture domain-related non-local information just verify that the cross-scale patch similarity widely exists in
in both spatial and spectral domains. However, the quality of a single dimension (mode) of the images.
fused images is limited due to a lack of multi-dimensional Although the above-mentioned papers provided relevant con-
information. Therefore, researchers developed various forms tributions, they show some shortcomings in feature represen-
of self-attention and performed matrix multiplication among tation. On one hand, self-attention just achieves preliminary
three factors (Query, Key, and Value) along different dimensions similarities for one or more unfolded dimensions (modes). This
(or modes) within intra-scales (aka in-scale) and cross-scales, leads to a lack of multi-dimensional information. On the other
i.e., spatial self-attention, channel self-attention, and hybrid hand, in-scale and cross-scale self-attentions are independent,
self-attention, as shown in Fig. 1. Regarding spatial in-scale and thus not unified in a mathematical mechanism.
self-attention, each spatial element is connected to all other ele- In this work, we derive a generalized version of self-attention
ments while integrating channel information, without being able from the computational process of self-attention in terms of
to perceive the channel information of each element. Besides, multilinear algebra [55], [56]. Based on the proposed gener-
some hybrid self-attention methods combined different vertical alized self-attention mechanism, the form of self-attention can
and horizontal self-attention paths to model pixel relations in be further extended by getting the so-called fully-connected
all dimensions [51], [52]. Since the in-scale self-attention is self-attention (FCSA). Fig. 2 depicts the relationship between
intrinsic similarity, it cannot learn cross-scale patch similarity, the proposal and the existing self-attention mechanism. After-
leading to reduced accuracy. Accordingly, Mei et al. [53] ex- wards, we present a novel architecture for the task of multi-
plored in-scale and cross-scale self-attention in an independent source image fusion (MSIF), i.e., the fully-connected trans-
connection module. Zhou et al. [54] proposed a cross-scale former (FC-Former). The proposed FC-Former adopts three
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FULLY-CONNECTED TRANSFORMER FOR MULTI-SOURCE IMAGE FUSION 2073

The network can be considered interpretable thanks to the

explicit characterization of both image priors and feature
representation.
This paper is an extended version of the conference paper
in [58], which is the first cross-scale parallel fusion network
specifically designed for remote sensing pansharpening, called
DCFNet. In this version, we extended the work in [58] from
both methodological and application points of view. The related
improvements are as follows:
Fig. 3. Schematic illustration of the different MSIF tasks, including multi- 1) The DCFNet shows a trade-off between parameter number
spectral and hyperspectral image fusion, visible and infrared image fusion, and
remote sensing pansharpening. and feature representation. To get a win-win situation, we
propose the new idea of generalized self-attention, even
developing the FCSA framework to fully exploit multiple
parallel branches for cross-scale fusion, where one branch re- sources of information.
tains the same resolution as the HR-MSI and serves as the 2) The proposed FCSA framework explores self-attention
main branch. One of the remaining two branches has the same along different unfolded dimensions (modes), fully con-
spatial resolution as the LR-HSI image, and the last one is sidering the differences between spatial and channel fea-
twice that of the LR-HSI image. The FCSA module succes- tures.
sively calculates the in-scale channel self-attention from each 3) We develop a model-inspired FC-Former, where the pre-
branch and performs cross-scale spatial self-attention among fusion design is replaced by a multi-source input repre-
branches. Compared with previously developed advanced hy- sentation embedded as a network prior that improves the
brid non-local self-attention and transformer methods [53], [57], outcomes using classical physical constraints.
the proposed FCSA method implements the characteristics of 4) Unlike DCFNet, which is a network tailored to the pan-
intra-branch and inter-branch non-local self-similarity (NSS) sharpening problem, three different applications are con-
into feature maps connecting each dimension (mode) to learn sidered in this work: multispectral and hyperspectral im-
multi-dimensional information from these feature maps via age fusion (MHIF), visible and infrared image fusion
multilinear products. Moreover, it also achieves intra-scale and (VIS-IR), remote sensing pansharpening, and digital pho-
cross-scale feature aggregation for MSIF tasks. Overall, our FC- tographic image fusion.
Former can fully consider the differences among features from The rest of the paper is organized as follows. Section II
multi-source images. The obtained feature extraction enables sequentially presents three related works: model-based methods,
the network to capture more details, leading to more faithful, data-driven methods, and model-driven methods. This section
accurate, and high-quality reconstructions. also provides the motivation for the work. Section III introduces
The main contributions of this paper are summarized as the proposed mathematical idea and framework as well as the
follows: overall network. In Section IV, we conduct extensive experi-
1) The proposed generalized self-attention provides a unified ments on three MSIF tasks. Furthermore, additional discussions
framework relied upon a multilinear product among three and ablation studies demonstrating the FC-Former’s superior
factors for the existing self-attention mechanisms. performance, efficiency, and low parameters are reported in
2) We propose a novel fully-connected self-attention frame- Sections V and VI. Finally, concluding remarks are drawn in
work (FCSA). The FCSA framework overcomes the limi- Section VII.
tations of self-attention in terms of multi-dimensional and
domain-related characterizations. The proposed FCSA II. RELATED WORK
method can fully exploit the characteristics of feature
maps among parallel branches, such as cross-scale and A. Model-Based Methods
intra-scale, local, and non-local self-similarity. In the MSIF task, some early methods exploited domain-
3) We propose a novel architecture, called FC-Former, which specific features of source images using linear transformations,
is the first fully-connected self-attention network with see, e.g., component substitution (CS) [59] and multi-resolution
multi-scale feature representation. Benefiting from the in- analysis (MRA) [60], [61] approaches.
formation fidelity of high-resolution branches, our model Other methods related to the MSIF problem belong to the vari-
achieves state-of-the-art performance for some MSIF ational optimization-based (VO) class. VO approaches yield the
tasks as shown in Fig. 3, i.e., multispectral and hyperspec- unknown fused image by minimizing a given domain-specific
tral image fusion (MHIF), visible and infrared (VIS-IR) optimization problem involving the multi-source images in in-
image fusion, and remote sensing pansharpening. Exten- put. The advantages of VO methods are the better representation
sive ablation experiments corroborate the effectiveness of of the information and an elevated interpretation. Prior knowl-
the proposed network. In addition, we also provide digital edge is introduced by adding a regularization term to address
photographic image fusion results in the supplementary the ill-posed nature of the optimization problem. For example,
material. sparse representation methods in the VO class are related to the
4) The proposed multi-source image representation incorpo- building of a dictionary to model (as a prior) the sparsity for
rates and unfolds the fusion problem into the FC-Former. image patches [62], [63], [64]. To regularize image gradients,
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2074 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025

spatial priors impose the first-order smoothness on the unknown multiplication. Unlike feature representation of convolutions,
(fused) image [65], [66]. Some other methods [27], [67] exploit transformer [79] can theoretically expand the receptive field
the low-rank property. Subspace analysis [67] and matrix/tensor infinitely, thereby correlating different pixels/patches to each
decomposition [67], [68] have also been used in conjunction other. Transformer methods often demonstrate a superior ability
with the low-rank property. However, handcrafted priors are not to learn intrinsic features compared to CNN-based approaches.
usually enough to represent real-world data accurately. To date, non-local networks [80] and transformer meth-
ods [81] represent state-of-the-art mechanisms in computer vi-
B. Data-Driven Deep Learning Methods sion. To encourage joint feature learning across two dimensions
(modes), cross-modality transformers [57], [74] have recently
Deep learning (DL)-based methods have successfully ex-
been designed to learn better feature representations between
ploited their powerful feature representation capability. DL-
two different domains. Wang et al. [44] integrated a data-driven
based methods can be roughly summarized as convolutional
NSS prior and the HQS method addressing the problem with an
neural network (CNN)-based methods and transformer-based
optimization-inspired deep neural network.
methods. Regarding pansharpening, PNN has been proposed
in [69]. It is based on a three-layer CNN to obtain the pansharp-
ened image (HR-MSI). To fuse useful high-frequency informa- D. Motivation
tion based on physical constraints, in [70], the fusion process is
formulated as a linear observed model in which deep and fusion In multi-source image fusion, the input data contains rich
networks are used to extract and fuse features from different multi-dimensional information and domain-specific informa-
source images. Some specialized modules, such as the multi- tion, namely, local and non-local similarities within or across
scale mechanism [49] and the spatial/channel attentions [71], scales, as well as spatial and spectral information. To fully
have recently been proposed for the MSIF problem. To enlarge explore this potential information, we use multilinear algebra
the receptive field, a pixel-adaptive convolution method has been to develop the mathematical concepts behind the generalized
proposed, the so-called LAGNet [72], to exploit pixel-to-pixel self-attention mechanism and propose the FCSA framework.
similarity in local windows to characterize content-aware fea- The naive self-attention-based methods are often limited to a
tures. Bandara et al. [73] designed a cross-attention mechanism single dimension or a specific perspective, resulting in the loss
to correlate pixel relations for multi-source images in MHIF. of key information from different sources. To solve this prob-
Besides, Ma et al. in [74] first presented a transformer-based lem, the FCSA mechanism can simultaneously integrate process
framework for VIS-IR image fusion and digital photographic multi-dimensional feature information from different scales and
image fusion, explaining the significance of the transformer’s domains, and fully mine rich details in the image. Then, the
long-distance dependency on image fusion tasks. The study FCSA framework can deeply explore the information interaction
in [75] first blends image matching, fusion, and semantic between various features in the image. By parallel branch design,
awareness into the same framework, yielding promising results. the FCSA framework establishes a fully connected relationship,
Wang et al. [76] leveraged domain knowledge to design a ensuring the maximum utilization of potential information in the
semi-supervised transfer learning method to fuse infrared and input data.
visible image fusion. However, the above-mentioned networks MSIF networks do not often get contextual guidance for
are limited by the use of multi-scale and multi-dimensional feature representation, even showing a feature extraction phase
feature representation from the self-attention mechanism, often that usually reduces the features’ spatial resolution. Hence, we
resulting in poor fusion performance. cannot advise its use for MSIF tasks. Instead, starting from
the promising results obtained in our conference paper, we
C. Model-Driven Deep Learning Methods develop, in this work, the so-called FC-former network based on
the FCSA framework to consider feature similarity within and
Wang et al. [77] proposed the DBIN model, where the estima- across scales while obtaining discriminative information from
tion of the exploited observation model and the related fusion different sources.
process are optimized iteratively and alternatively during the
reconstruction. Xie et al. [15] proposed the so-called MHFNet to
combine a low-rank prior and a complete basis set of HR-HSIs to III. FULLY-CONNECTED TRANSFORMER MODEL
build the unfolding network. Guo et al. [2] designed a variational
A. Generalized Self-Attention Mechanism
gate mechanism to fuse three different similarities of miRNAs
via a novel contrastive cross-entropy function. As in the case In this section, we first summarize the necessary notations
of classical convolutional networks, where local information and give several new definitions used in this paper. For the MSIF
is extracted by convolutions, the deep [42] and autoencoder task, an image and another image are defined as I1 ∈ RH×W ×c
priors [43] also impose local priors for model-based methods. and I2 ∈ Rh×w×C , respectively. The desired fused image is
Non-local self-similarity (NSS) priors have recently been indicated as If , where the scale ratio is r = H/h (e.g., 4 or 8).
explored in various research fields [23], [78]. The approaches For the MHIF task, the source images are the HR-MSI and the
based on the use of these priors consider similar pixels/patches LR-HSI, respectively, while for the visible and infrared image
of a given image to exploit the internal redundant informa- fusion, are the infrared and the visible images, respectively, and,
tion. The self-attention mechanism is a good instance of NSS for remote sensing pansharpening, are the panchromatic image
methods based on long-range dependencies through matrix and the multispectral cube, respectively.
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FULLY-CONNECTED TRANSFORMER FOR MULTI-SOURCE IMAGE FUSION 2075

M N
Before introducing the generalized self-attention and the where the size of Z is k−1 i=1 Imi × j=k+1 Imj × j=k+1 Jnj
FCSA method, we first describe the classic spatial self-attention M −2
for 1 < k < min(M, N ), k ∈ Z, or i=1 Imi × IN −1 ×
(Spa-SA) based on the definition of the batched matrix multi-
JN −1 for k = min(M, N ). The last dimensions of X and
plication. Given the input tensor X ∈ RB×P ×N ×C , the Spa-SA
Y are contracted. The batched tensor product is the batched
can be formulated as follows:
format of the multilinear product and requires Imi = Jni for
Q = XWQ T
, K = XWK T
, V = XWV
T
, i = 1, 2, . . . , k when k = min(M, N ), or Imi = Jni for i =
1, 2, . . . , k − 2, k when k = min(M, N ). The associative and
T
QK commutative properties are not satisfied.
A = Softmax √ ,
d In Fig. 4(a), we give an illustration of the above definitions.
Below, we will introduce the theorems of the generalized self-
Z = AV + X, (1)
attention mechanism.
where WQ,K,V indicates the learnable parameters, Z represents Theorem 1: Supposing that X ∈ RI1 ×I2 ×···×IM and Y ∈
J1 ×J2 ×···×JN
the output features, the Query is Q ∈ RNQ ×CQ determining the R are two tensors. Thus, we have:
spatial sizes of the output features and of the attention matrix, the 1) YT (in1:k−1 , ink+1:N , ink ) = Y(in1:k−1 , ink , ink+1:N ),
Key and Value are K ∈ RNK ×CQ and V ∈ RNK ×CV defining 2) Z = X ×m n1 ,n2 ,...,nk Y ⇔ X[m;k] Y[n;k] .
1 ,m2 ,...,mk T

the sizes of the attention matrix and of the output channel of The interested readers can refer to the supplementary material
Z, respectively. K and V must have the same spatial size, i.e., to have a look at the proof of Theorem 1. Theorem 1 describes the
NK = NV = N . We assume that Q and K are d-dimensional relationship between the batched tensor product and the batched
vectors. The attention, A, relies upon the dot product between Q matrix multiplication. Below, Theorem 2 will consider the self-
and K to get spatial self-similarity, which influences V and vice attention mechanism with two special tensor forms by using the
verse. Thanks to the self-attention mechanism, transformers can above definitions.
achieve self-similarity along a specific dimension (mode). Theorem 2 (Generalized Self-Attention Mechanism): Let us
Next, we will introduce the generalized self-attention mech- assume an N th-order tensor, X ∈ RI1 ×I2 ×···×IN , the learn-
anism through the following new definitions and theorems. able parameters, W ∈ RIik ×J3 , and the reordering vector, i =
Definition 1 (Tensor Blocking): For a 4th-order tensor, X ∈ (i1 , i2 , . . . , iN ). The generalized self-attention of X has three
RI1 ×I2 ×I3 ×I4 , a window (q × q) is set to be centered at each reordering factors, Q, K, and V∈ RJ1 ×J2 ×J3 along mode k,
spatial location. Tensor blocking with stride (s × s) generates a where (J1 , J2 , J3 ) = ( k−1 i=1 Iii ,
N
j=k+1 Iij , Iik ). Let us define
blocking tensor, T ∈ RI1 ×P ×q×q×I2 . Thus, we have: m = (m1 , m2 , . . . , mN ) and n = (n1 , n2 , . . . , nN ) as indexes
(q×q)
of the batched tensor product. The generalized self-attention
T = unfold(s×s) (X ), (2) generates two matrices A and Z along the kth dimension (mode
denotes the number of patches and satisfies P = k), which have the following forms:
4 IP
where
i −q+2∗p
i=3 s , where p is the border padding. The unfold Q = X[i;k] WQ
T
, K = X[i;k] WK T
, V = X[i;k] WV
T
,
operator is implemented in Pytorch [82] with fast runtime.
Definition 2 (Batched Mode-k Unfolding): Given an N th- Q ×n1 ,n2 ,...,nk K
m1 ,m2 ,...,mk
A = Softmax √ ,
order tensor, X ∈ RI1 ×I2 ×···×IN , n = (n1 , n2 , . . . , nN ) is a vec- d
tor reordering.The batched mode-k unfolding of X is defined
k−1 N Z = A ×1,n k+1 −k+1,nk+2 −k+1,...,nN −k+1,2
V + X[i;k] , (5)
i=1 Ini × j=k+1 Inj ×Ik
n1 ,n2 ,...,nk
as X[n,k] ∈ R (1 < k ≤ N, k ∈ Z),
⎧ where matrices Q, K, V, and A perform the inverse operator of
⎪
⎪ X[n;k] (in1 in2 · · · ink−1 , ink+1 ink+2 · · · inN , ink ) = batched mode-k unfolding to tensor format.
⎪
⎪
⎨ reshape(X , [In1 In2 · · · Ink−1 , The interested readers can refer to the supplementary material
Ink+1 Ink+2 · · · InN , Ink ]), 1 < k < N, (3) to have a look at the proof of Theorem 2. Here, by utilizing
⎪
⎪
⎪X[n;N ] (in1 in2 · · · InN −2 , inN −1 , inN ) =
⎪ the tensor blocking operator given in Definition 1 and the
⎩
reshape(X , [In1 In2 · · · InN −2 , InN −1 , InN ]), k = N, batched mode-k unfolding operator in Definition 2, we can
and its inverse operator yields X = reshape(Xn , [In1 , In2 , sequentially obtain three factors, Q, K, and V, represented
. . . , InN ]) via indices n = (n1 , n2 , . . . , nN ). in the self-attention mechanism. A graphic illustration of the
Definition 2 can have well tensor permutation, X = generalized self-attention is in Fig. 4(b). A special form of spatial
X(in1 , in2 , . . . , inN ) based on vector n. self-attention is shown based on our generalized mechanism.
Definition 3 (Batched Tensor Product): Supposing that By using the proposed definitions and theorems, we can derive
an M th-order X ∈ RI1 ×I2 ×···×IM and an N th-order Y ∈ several forms of self-attention. For example, assuming that the
RJ1 ×J2 ×···×JN . Assume that two vectors m = (m1 , m2 , input tensor Y is RB×d×C×H×W , transforming the dimensions
. . . , mM ) and n = (n1 , n2 , . . . , nN ) are vectors satisfying H × W into the spatial size S, for multi-head spatial self-
Imi = Jnj for i = 1, 2, . . . , k. The batched tensor product be- attention, the batched mode-3 unfolding is performed to generate
tween X and Y along mode k (1 < k ≤ min(M, N ), k ∈ Z) in Q, K, and V ∈ RBd×S×C , where i = (1, 2, 4, 5, 3). Afterwards,
matrix form is as follows: the batched tensor product is performed for Q, K and V,
where m = n = (1, 2, 3). For the channel self-attention, we first
Z = X ×m
n1 ,n2 ,...,nk Y,
1 ,m2 ,...,mk
(4) merge the H and W dimensions to obtain Y ∈ RB×d×C×S ,

Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2076 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025

Fig. 4. Graphical illustration of the batched tensor product in Definition 3. Furthermore, we present the spatial self-attention based on the proposed definitions.
The tensor blocking of Definition 1 takes precedence over the batched mode-k unfolding.

TABLE I Previous works introduced different forms of self-attention

SOME NOTATIONS USED ARE SUMMARIZED AS FOLLOWS
and explored multi-dimensional information based on hybrid
structures. In the paper, we exploit multilinear analysis in tensor
algebra to generalize these self-attention forms.

B. Fully-Connected Self-Attention Framework

The separated matrices Q, K, and V, are multi-dimensional
and domain-related. This information at different modes lacks
an effective way to be combined. Based on the generalized self-
attention mechanism, we can develop the FCSA framework. The
FCSA framework is depicted in Fig. 5. More specifically, we use
cross-scale and intra-scale (aka in-scale) self-attention to trans-
fer features among features at different or at same resolutions.
Following Theorem 2 and the previous self-attention mecha-
nisms, the FCSA framework transforms three 1 × 1 Conv2D
layers obtaining Q, K, and V, to calculate the response in
the same resolution branch. Afterwards, we adopt cross-scale
then Y[i;4] (1, 2, 3, 4) yields Q, K, and V ∈ RBd×C×S , which self-attention, which is defined in Theorem 2. Finally, these
is also derived from batched mode-4 unfolding. In addition, features are transformed into new features along different modes
assuming that the input tensor Y is RB×d×P ×S×C , the spatial with different resolutions and channels.
and spectral multi-head self-attention forms are the same as The inputs of the FCSA framework are the high-resolution
above. Definition 2 gets three factors of self-attention with three (HR) feature maps, FH and IH , the medium-resolution (MR)
dimensions of information, i.e., Q, K, and V ∈ RBd×P ×SC , feature maps, FM and IM , and the low-resolution (LR) feature
called patch self-attention [83], [84]. Further descriptions of maps, FL and IL . The tensors IH , IM , and IL represent
the existing self-attention forms can be found in other related important source images, such as multispectral and thermal
papers [85]. images, etc, which are downsampled to a lower resolution. Af-
Remark 1: For Q, K and V ∈ RI1 ×I2 ×···×IN , the generalized terwards, the FCSA model calculates self-attention along each
self-attention generates tensors A and Z along the kth dimen- of their modes. By employing the proposed idea of generalized
sion. In addition, Theorem 5 can have a simplified form that self-attention, the fully-connected self-attention scheme is as
specifies the batched tensor product as Q ×m 1 ,m2 ,m3
n1 ,n2 ,n3 K and
follows:
1,m3 ,2
A ×n1 ,n2 ,n3 V, where the inverse operator of batched mode-k
unfolding is not used. (ZH , ZM , ZL ) = FCSA[mk ;nk ] (XH , XM , XL ). (6)

Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FULLY-CONNECTED TRANSFORMER FOR MULTI-SOURCE IMAGE FUSION 2077

Fig. 5. Illustration of the FCSA framework. The proposed FCSA framework unifies several self-attention mechanisms, such as [49] and [57], and includes their
corresponding multilinear product representation. The FCSA framework can facilitate the fusion of local and non-local prior information within and across images
from different sources. Note that stages 2 and 3 of the FCSA are simply plotted, not affecting the required tensor format.

Here, we further use feature branches to represent input ten- Algorithm 1: One Stage of FCSA.
sors, i.e., (XH , XM , XL ). Then, XH = (FH , IM , IL ), XM =
(FM , IH , IL ), and XL = (FL , IH , IM ) denote the three fac-
tors (i.e., Query, Key and Value), respectively. [mk ; nk ] is one
of the reordering vectors of m and n at mode k. For a better
understanding of (6), the detailed algorithm is reported in Algo-
rithm 1.
Equation (5) performs the transfer of feature maps at different
resolutions. When transferring a lower resolution branch to a
higher resolution branch, the Query represents higher resolution
feature maps, while Key and Value denote lower resolution
feature maps. Following Definitions and Theorem 2, lower-
resolution feature maps influence higher-resolution feature maps
according to the reordering vector mk . Furthermore, these dif-
ferent resolution feature maps can progressively aggregate new
feature maps from high-to-low and low-to-high branches and
transfer the cross-scale feature maps back to high-resolution
branches. In summary, the proposed scheme can enhance feature
representation and achieve higher performance.
Remark 2: It is worth remarking that the FCSA framework 2
computational complexity of the FCSA is O( k−1 i=1 Ii Ik Ik+1 +
conducts multi-dimensional self-attention using the generalized k−2 2
k−1 2
k−2 2
j=1 Ij Ik Ik−1 ), that is, O( i=1 Ii S C + j=1 Ij SC ). The
self-attention mechanism. The separated matrices, Q, K, and
computational complexity linearly increases with the size of the
V, are used to calculate two different unfolded self-attentions.
image and the number of channels. Besides, self-attention has
This induces both the long-range spatial and the global chan-
some (GPU memory) storage costs.The FCSA storage cost,
nel responses. The FCSA framework retains the transformer’s 2 2
which depends on S and C, is O( k−1 I
i=1 i S + k−2
i=1 Ii C ),
solution while reducing the computational cost and increasing
consistently with the hybrid self-attention considering both spa-
non-local information. The FCSA can improve the performance
tial and spectral modes.
of MSIF, as reported in Table VI.
D. Multi-Source Image Representation
C. Complexity Analysis
Several networks for MSIF can be separated into two parts:
Let us transform an N th-order tensor, X ∈ RI1 ×···×Ik ×···×IN , deep and fusion sub-networks. The simplest fusion method relies
into I1 × · · · × Ik−1 × S × C, where Ik = S or Ik = C. upon just adding or concatenating features. Instead, in this work,
Then, we have batched operations for i < k, and multilin- we will introduce two more elaborated fusion strategies: (a)
ear product operations for i ≤ k, 1 < i ≤ N . Therefore, the dynamic branch fusion; (b) model-based branch fusion.

Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2078 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025

Dynamic Branch Fusion: In our previous work [58], we Problem 9 requires HQS or ADMM frameworks to be solved.
showed that different resolutions have unequal effects on fusion For different regularization terms, including deep network pri-
results. Thus, feature maps at different resolutions should be ors, we usually can have a closed-form solution. For the MHIF
reweighed before being combined by the dynamic branch fusion problem, the gradient of X is: ∇g(X(t) ) = (X(t) D − Y)DT +
(DBF) module. The DBF method adds fusion coefficients to RT (RX(t) − M), where the degradation operator D = BS.
features at different resolutions and resizes features at the same By separating each sub-problem and transforming it into a
resolution before the weighted fusion. The DBF method can be specific network form, we can build an optimization-induced
widely applied to different fusion scenarios, thereby we chose it deep network through employing the MSIR block. It allows the
as the baseline for the multi-source image representation (MSIR) network to approximate the proximal operator of a regularizer,
module. not just a denoiser [35], [84].
Model-based Branch Fusion: The DBF method does not The main difference among several fusion problems is how
consider physical constraints. Physical constraints are usually to formulate a fusion model. In this work, we embed the ob-
introduced by linear observation models [9], [35]. The linear served model into the FC-former to realize an interpretable deep
relationships (reflecting prior knowledge) among input and out- network for MSIF. The proposed FC-former network will be
put data of the MSIF problem are computed by solving an introduced in Section III-E.
optimization problem. By using MSIR with linear observation Remark 4: It is worth to be remarked that the optimization-
models, we can draw the following conclusions. induced neural network is motivated by the linear observed
Lemma 1 (Linear Observation Models for MSIF): Assume model and its solution. Under the aforementioned framework,
the MSIF problem with X ∈ RHW ×S denoting the desired re- the regularization term can provide a non-linear representation
sults. The linear observed models, having as information sources in the objective function. This allows us to estimate the solution
two cubes Y ∈ Rhw×S , and M ∈ RHW ×s , where (H, W ) and by exploiting deep learning and physical constraints.
(h, w) denote the different spatial sizes (with a scale ratio
r= H h ), and S and s indicate the number of spectral bands for E. Network Architecture
the two inputs, are as follows:
The overall architecture of the fully-connected transformer
Y = f1 (X), M = f2 (X). (7) (FC-Former) is presented in Fig. 6. It consists of three parallel
branches: the main HR feature branch, the MR feature branch,
The functions f1 (·) and f2 (·) represent degradation operators.
and the LR feature branch. More specifically, the three branches
Thus, the objective function can be formulated as:
are arranged in parallel, and they are progressively combined
to form three stages. The main HR feature branch considers
H × W spatial size images from different domains. The MR
X = arg min ||Y − f1 (X)||F + ||M − f2 (X)||F + φ(X).
X feature branch receives the MR input and the feature maps from
(8) the HR branch. Similarly, the LR feature branch takes an LR
input and the feature maps from the above two branches as input.
Remark 3: For the MHIF problem, f1 and f2 can be defined
For the inputs in each branch, we design MSIR as the head
as f1 (X) = XBS and f2 (X) = RX, where B, S ∈ RHW ×hw ,
structure of each branch to aggregate feature maps transferred
and R ∈ RS×s denote the blur operator, the downsampling
from other branches with source images, as shown in Fig. 6.
operator, and the spectral response matrix, respectively. f1 and
From an implementation point of view, inspired by HRNet [34],
f2 are the spatial and the spectral fidelity terms, respectively.
we chose the residual block and bottleneck as building blocks.
The problem can be solved into the linear least squares frame-
The convolution kernel of the residual block of each branch
work [15]. Similar linear relationships hold for other related
is the same. Finally, the stacked residual blocks are arranged
MSIF problems.
behind the MSIR blocks. Therefore, a complete stage is built to
It is worth to be pointed out that, to obtain the desired fused
extract better features. Finally, we train the proposed model on
image, we adopt the proximal gradient algorithm [86] to solve
supervised and unsupervised tasks. Let I = {I1 , I2 , . . . } denote
the problem as follows.
input images from different sources. Then, for the supervised
Theorem 3 (Interpretable Model-based Unfolding Represen-
task, we use the mean absolution error (i.e., 1 loss; L1 ) and
tation [86]): Let an observed image, Y, be the corrupted (or
the structural similarity index measure (SSIM) as losses [50]
degraded) tensor version through a function f (·) of an unknown
(LSSIM ) to calculate differences between outputs and ground-
image, X, having a simplified form as:
truths (GTs):
1
X = arg min ||Y − f (X)||2 + φ(X). (9) Θ = arg min L1 (fΘ (I), GT) + λLSSIM (fΘ (I), GT), (11)
X 2 Θ

There exists, X̂, solution of the algorithm, in the form: where λ is set to 0.1 to balance the two losses, GT is the ground-
truth image, and fΘ is a non-linear function depending on the
X̂(t+1) = prox(ληφ) (X(t) − η∇g(X(t) )), (10)
learnable parameters Θ.
where prox(ληφ) (·) denotes a proximal operator, η is a weighting For unsupervised tasks, we use the intensity loss (L1 ), SSIM
coefficient, ∇g(·) is the gradient operator, and t is an iteration loss (LSSIM ), and texture loss (Ltext ), to compute the loss between
index. output and input images. The details of the loss functions can

Fig. 6. The overall architecture of FC-Former. The blue boxes represent network stages and the yellow parts denote the FCSA method depicted in Fig. 5.

Algorithm 2: Fully-Connected Transformer Algorithm. A. Multispectral and Hyperspectral Image Fusion

Setup: We test the proposed method on two widely used
MHIF datasets (i.e., CAVE [106]1 and Harvard [107]2 )
considering 15 state-of-the-art techniques: MTF-GLP-HS
[88]JSTARS 2015 , CSTF-FUS [108]TIP 2018 , BDSD-PC
JSTARS 2015 TNNLS 2019
[59] , LTTR [109] , LTMR [67]TIP 2019 ,
UTV [110]JSTARS 2020 , DBIN [77]ICCV 2019 , SSRNet
[89]TGRS 2021 , HSRNet [71]TNNLS 2021 , MoG-DCN [35]TIP 2021 ,
Fusformer [90]GRSL 2022 , DHIF [91]TCI 2022 , 3DTNet
[84]IF 2023 , MIMO-SST [92]TGRS 2024 , DCINN [93]IJCV 2024 .
Datasets: We assess the performance on the CAVE and Har-
vard datasets simulating a scaling factor of 4/8. Details about
the simulation procedure are provided in the supplementary
material. We randomly chose 20 samples for the simulated
training/validation dataset. The remaining 11 samples are used
be found in related works [17], [74], [87]. for testing, i.e., balloons, cd, chart and stuffed toy, clay, fake and
real beers, fake and real lemon slices, fake and real tomatoes,
feathers, flowers, hairs, and jelly beans.
Θ = arg min λ1 L1 + λ2 LSSIM + λ3 Ltext , (12) Results: The results are reported in Table II. For scaling factor
Θ
4, we also showed the true-color images of the fusion results and
where each term L is L(fΘ (I), I). λ1 , λ2 , and λ3 are hyper- the corresponding error maps in Fig. 7. It can be noted that both
parameters. the details and color accuracy of the proposed method are closest
The FC-Former is summarized in Algorithm 2. to the GT. Besides, the high performance of our technique is also
reported using scaling factor 8, see Table II. Table II generally
shows that our method achieves competitive results compared
IV. EXPERIMENTS AND RESULTS to the benchmark.
We assess the performance by comparing the proposed model B. Visible-Infrared Image Fusion
for different tasks, i.e., the MHIF, the VIS-IR image fusion,
and the remote sensing pansharpening. In addition, digital Setup: Since our method is a general model, we can substitute
photographic image fusion tasks (i.e., multi-exposure image the MHIF fusion task with the visible and infrared (VIS-IR)
fusion and multi-focus image fusion) are shown in the supple- image fusion problem. The related datasets (i.e., TNO [111]3 and
mentary material. MHIF data have different spatial and spectral RoadScene [17]4 ) are publicly available. To build training and
resolutions. The VIS-IR image fusion task relies upon unsu- testing data, all red-green-blue (RGB) inputs are converted into
pervised image fusion (combining data at the same resolution). the YCbCr color space, and then image fusion is performed be-
Finally, the remote sensing pansharpening task considers both tween the IR image and the luminance (Y) channel. We compare
simulated and real-world data to fully assess the performance of the proposed FC-former with 11 representative state-of-the-art
our method and its generalization ability. The proposed approach methods: NSST [94]TIM 2018 , DenseFuse [95]TIP 2018 , IFCNN
is implemented in Pytorch and trained on a workstation with 2
1 http://www.cs.columbia.edu/CAVE/databases/
NVIDIA GeForce RTX 3090 GPUs and 128 GB memory. For 2 http://vision.seas.harvard.edu/hyperspec/
the sake of brevity, we selected results of some representative 3 https://figshare.com/articles/dataset/TNO_Image_Fusion_Dataset/
methods. The interested readers can refer to the supplementary 1008029
material to have a look at all the outcomes. 4 https://github.com/hanna-xu/RoadScene

Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2080 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025

TABLE II
QUANTITATIVE RESULTS FOR THE MHIF TASK COMPARING SOME REPRESENTATIVE STATE-OF-THE-ART APPROACHES

Fig. 7. In the first row, true-color fused images are depicted obtained by the proposed FC-Former and by some representative methods on chart and stuffed toy
with scaling factor 4 for the CAVE dataset. In the second row, the related error maps (calculated between the fused image and the GT) are represented. Some
close-ups are also considered.

[96]IF 2020 , DDcGAN [97]TIP 2020 , U2Fusion [17]TPAMI 2020 , training/validation/test containing heterogeneous characteris-
YDTR [87]TMM 2022 , DecompFusion [98]ECCV 2022 , SwinFuse tics, such as roads, vehicles, and pedestrians. We use the same
[112]TIM 2022 , SwinFusion [74]JAS 2022 , EMMA [100]CVPR 2024 data augmentation strategy as U2Fusion [17] (i.e., images are
and TC-MOA [99]CVPR 2024 . randomly cropped to patches of size of 64 × 64 with flipping)
Datasets: According to [113] and the website5 , we have to enlarge the number of samples.
98/38 training/test images for the TNO dataset. For the Road- Results: The quantitative results related to the TNO and Road-
Scene dataset, we randomly selected 190/10/20 pairs for Scene datasets are shown in Table III. Five quality metrics are
used to assess the performance, i.e., the peak signal-to-noise ratio
(PSNR) [114], the SSIM [115], the learned perceptual image
5 https://github.com/Linfeng-Tang/Image-Fusion patch similarity (LPIPS) [116], Qabf [117], and Qs [118]. It is

TABLE III
QUANTITATIVE RESULTS FOR THE VIS-IR IMAGE FUSION TASK ON THE TNO AND ROADSCENE DATASETS. PLEASE, REFER TO
SECTION IV-B FOR FURTHER DETAILS

Fig. 8. Comparison among some representative state-of-the-art methods for the VIS-IR image fusion task. Some close-ups are depicted in yellow and red boxes.
No error map is depicted because of the absence of a GT.

clear that our FC-Former achieves state-of-the-art performance

on both the VIS-IR datasets, getting superior SSIM and LPIPS C. Remote Sensing Pansharpening
metrics and top performance (close to the best) on the PSNR, Setup: We compare our method using a publicly available
Qabf , and Qs metrics. Some qualitative results are depicted in remote sensing pansharpening dataset [6], namely PanCol-
Fig. 8(a) and (b). It can be observed that our approach gets high lection, consisting of data acquired by WorldView-3 (WV3),
performance, accurately preserving details without introducing QuickBird (QB), GaoFen-2 (GF2), and WorldView-2 (WV2)
issues, such as grayscale biases, artifacts, or noise. sensors. Reduced resolution data are simulated starting from

Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2082 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025

Fig. 9. Visual comparisons involving some representative pansharpening methods on one example of the reduced resolution WV3 dataset. True-color fused
images are depicted in the first row. The second row is devoted to error maps between fused images and the GT. Some close-ups are also reported in yellow and
red boxes.

real-world images exploiting Wald’s protocol [119]. It is very resolution assessment exploits the following reference-based
valuable to perform experiments on remote sensing pansharp- quality metrics: the spatial correlation coefficient [125] (SCC),
ening because it allows for a more comprehensive evaluation the spectral angle mapper [126] (SAM), the erreur relative glob-
of model performance in real fusion scenarios. In addition, ale adimensionnelle de synthése [127] (ERGAS), and the Q2n
we compare the proposed approach with 15 state-of-the-art (Q8 for 8-band data and Q4 for 4-band data). From Table IV, the
methods. They are divided into four classes [120], i.e., CS, MRA, proposed FC-Former outperforms the other methods for almost
VO, and DL (CNNs and transformers) methods:6 all metrics, and it is very close to the optimal values for the rest
1) Component substitution (CS) methods: BT- of the cases. Fig. 9 shows the fused results with the related error
H [121]GRSL 2017 and BDSD-PC [59]TGRS 2019 . maps to appreciate the goodness of the outcome of the proposed
2) Multi-resolution analysis (MRA) approaches: the gener- approach.
alized Laplacian pyramid (GLP) with modulation transfer To assess the performance on real (full resolution) data, where
function (MTF)-matched filters [122] and its full-scale the reference image is not available, indexes without reference
regression version (MTF-GLP-FS) [101]TIP 2018 . are used, i.e., the spectral distortion (Dλ ), the spatial distor-
3) A variational optimization-based (VO) technique: tion (Ds ), and the hybrid quality with no reference (HQNR)
LRTCFPan [27]TIP 2023 . indexes [128]. Table IV reports the average performance on the
4) DL methods: 11 CNN-based approaches, such as full resolution (FR) examples for the exploited public dataset.
PNN [69]RS 2016 , PanNet [70]ICCV 2017 , MSDCNN Again, FC-Former obtains the best results on average with the
[123]JSTARS 2018 , DiCNN [124]JSTARS 2019 , FusionNet lowest standard deviations, showing its superiority and greater
[102]TGRS 2020 , LAGNet [72]AAAI 2021 , DCFNet stability. Furthermore, in Fig. 10, we show visual results on
[58]ICCV 2021 , HMPNet [103]TNNLS 2023 , CANNet a full resolution WV3 example. The outcome of the proposed
[104]CVPR 2024 , and PanMamba [105]Arxiv 2024 , and FC-Former shows more details and a better visual quality.
one transformer-based technique, i.e., Invformer
[57]AAAI 2022 . All the compared DL methods are trained V. DISCUSSIONS
on the same data using default experimental settings (as In this section, we will discuss the components of the proposed
suggested in the related papers) for fair comparison. FC-Former. Without loss of generality, we consider the MHIF
Datasets: The dataset can be downloaded at.7 We chose WV3, problem using the CAVE ×4 dataset as an example.
GF2, and QB data for performance assessment, and WV2 data
to test network generalization. The number of testing samples
for each reduced resolution and full resolution dataset is 20 (i.e., A. MSIR
160 images in total). Please, refer to the supplementary material We analyze the FC-Former combined with classical fu-
for further details. sion methods. Two model-based fusion methods, model-based
Results: To evaluate the quality of the proposed method, we branch fusion (MBF)-1 and MBF-2, have been adopted. In
use reference and no-reference quality metrics.8 The reduced MBF-1 [53], the mutual-projected fusion is used to replace the
simple addition or concatenation of feature maps. The residual
6 All
between two features from the in-scale and cross-scale branches
the obtained results are reported in the supplementary material.
7 https://github.com/liangjiandeng/PanCollection is downsampled into the original identity branch, transferring
8 https://github.com/liangjiandeng/DLPan-Toolbox/tree/main/02-Test- information from both cross-scale and in-scale features. In
toolbox-for-traditional-and-DL(Matlab) contrast, MBF-2 [35] adopts model-guided fusion to perform

TABLE IV
QUANTITATIVE RESULTS FOR THE REMOTE SENSING PANSHARPENING TASK COMPARING SOME REPRESENTATIVE STATE-OF-THE-ART APPROACHES

Fig. 10. Visual comparisons involving some representative pansharpening methods on one example of the full resolution WV3 dataset. True-color fused images
are depicted in the first row. The second row is devoted to the HQNR maps. Some close-ups are also reported in yellow boxes.

Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2084 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025

TABLE V
COMPARISON OF BRANCH FUSION APPROACHES ON THE CAVE ×4 DATASET

Fig. 12. PSNR vs. FLOPs for all the high performance methods on MSI/HSI
images with sizes of 16 × 16/64 × 64 and 16 × 16/128 × 128 related to the
CAVE ×4 and ×8 datasets, respectively. The proposed FC-Former (indicated
with a star marker) gets the best PSNR values with a small amount of FLOPs.

low, yet it achieves the highest performance in both the CAVE

test cases.

VI. ABLATION STUDY

We consider in this section, without loss of generality, the
Fig. 11. The PSNR values between the proposed FC-Former and the second CAVE ×4 dataset to conduct ablation studies.
best method for the three fusion tasks. The FC-Former achieves robust and
superior performance with a small parameter number and FLOPs considering the
CAVE ×4, the RoadScene, and the WorldView-3 (WV3) datasets. RR denotes A. Fully-Connected Self-Attention
reduced resolution data simulated starting from full resolution (FR) images.
The red dotted arrow, in the remote sensing pansharpening case, indicates The proposed FCSA framework, corresponding to the mul-
the performance gain compared to the conference version [58]. The circle
radius indicates the parameter number (i.e., the larger the circle, the higher tilinear algebra in Section III-A, retains several forms of self-
the parameter number). attention along different unfolded modes. To explore the effects
of the FCSA framework, we perform ablation studies by con-
sidering five improved DCFNet methods and using the original
the MSIR operation. The MBF-2 explicitly incorporates the
DCFNet as a baseline. Table VI reports the average and the corre-
observation model into the MHIF problem, where a convolution
sponding standard deviation results for the proposed FC-Former
network is used as a denoiser and guidance. In contrast, the
approaches. It can be observed the beneficial effects of adding
deep network based on the FCSA framework can get a nonlinear
self-attention into the baseline. Indeed, fusion results with self-
representation with a non-local self-similarity prior. The DBF
attention get higher performance on all the metrics. When the
method represents the basic implementation for the MSIR mod-
cross-scale self-attention is the spatial self-attention (Spa-SA)
ule. Table V shows that the FC-Former has a better performance
and the in-scale self-attention is the channel self-attention (CAL-
by using MBFs to deal with multi-source inputs. Finally, it
SA), the FCSA framework achieves the best results.
indicates that our FC-Former is an interpretable network, which
is capable of combining the advantages of the model-driven and
data-driven approaches. B. Spatial Multi-Head Self-Attention
Leveraging the trade-off between global dependency and
B. Complexity Analysis
computational complexity in spatial multi-head self-attention
It is well known that there is a trade-off between the per- (Spa-MSA), we employ a window-based Spa-MSA to model
formance of DL methods and the number of parameters (or long-range information along the spatial mode. In Table VII,
computational cost). Fig. 11 shows these trade-offs for 23 state- we compare the effects of different window sizes, reducing
of-the-art approaches belonging to the considered three fusion it from the HR to the LR branch to have a long-range re-
tasks. It can be concluded that the proposed FC-Former achieves sponse with a flexible range. We chose a window size of
the best trade-off. Moreover, in Fig. 12, we show that the floating 16/8/4 as a good trade-off between computational burden and
point operations per second (FLOPs) of the FC-Former are quite performance.

TABLE VI
ABLATION STUDIES FOR THE FCSA FRAMEWORK USING CROSS-SCALE AND/OR IN-SCALE ATTENTION

TABLE VII [4] X. Zhang and Y. Demiris, “Visible and infrared image fusion using
A COMPARISON OF DIFFERENT WINDOW SIZES FOR THE WINDOW-BASED deep learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8,
SPA-MSA FOR THE HR, THE MR, AND THE LR BRANCHES pp. 10535–10554, Aug. 2023.
[5] G. Vivone, “Multispectral and hyperspectral image fusion in remote
sensing: A survey,” Inf. Fusion, vol. 89, pp. 405–417, 2023.
[6] L. J. Deng et al., “Machine learning in pansharpening: A benchmark, from
shallow to deep networks,” IEEE Geosci. Remote Sens. Mag., vol. 10,
no. 3, pp. 279–315, Sep. 2022.
[7] W. He et al., “Non-local meets global: An iterative paradigm for hy-
perspectral image restoration,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 44, no. 4, pp. 2089–2107, Apr. 2022.
[8] Y. Guo, D. Zhou, P. Li, C. Li, and J. Cao, “Context-aware poly(A)
signal prediction model via deep spatial–temporal neural networks,”
IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 6, pp. 8241–8253
VII. CONCLUSION 2022.
[9] X. Deng and P. L. Dragotti, “Deep convolutional neural network for
In this paper, inspired by multilinear algebra, we proposed multi-modal image restoration and fusion,” IEEE Trans. Pattern Anal.
the mathematical idea of the generalized self-attention to unify Mach. Intell., vol. 43, no. 10, pp. 3333–3348, Oct. 2020.
[10] S. Yang, D. Zhou, J. Cao, and Y. Guo, “LightingNet: An integrated
and generalize existing self-attention mechanisms. Based on this learning method for low-light image enhancement,” IEEE Trans. Comput.
generalized mechanism, we developed the first fully-connected Imag., vol. 9, pp. 29–42, 2023.
self-attention framework that captures intra- and cross-scale [11] S. Li, W. Song, L. Fang, Y. Chen, P. Ghamisi, and J. A. Benediktsson,
“Deep learning for hyperspectral image classification: An overview,”
patterns, as well as local and non-local similarities. Through IEEE Trans. Geosci. Remote Sens., vol. 57, no. 9, pp. 6690–6709,
theoretical analysis and broad experiments, the proposed FCSA Sep. 2019.
framework addresses the representation issue at different dimen- [12] L. Tang, J. Yuan, and J. Ma, “Image fusion in the loop of high-level vision
tasks: A semantic-aware real-time infrared and visible image fusion
sions (modes) and scales while achieving better detail recon- network,” Inf. Fusion, vol. 82, pp. 28–42, 2022.
struction and lower computational costs. Afterwards, we built [13] H. Li, X.-J. Wu, and J. Kittler, “RFN-Nest: An end-to-end residual fusion
a fully-connected transformer network using the FCSA frame- network for infrared and visible images,” Inf. Fusion, vol. 73, pp. 72–86,
2021.
work, called FC-Former. In this case, the multi-source image [14] D. Liu et al., “Transfusion: Multi-view divergent fusion for medical image
representation module provides support to improve the physical segmentation with transformers,” in Proc. Int. Conf. Med. Image Comput.
interpretation of the network and to guide the FCSA regulariza- Comput.-Assisted Interv., 2022, pp. 485–495.
[15] Q. Xie, M. Zhou, Q. Zhao, Z. Xu, and D. Meng, “MHF-Net: An inter-
tion. FC-Former demonstrated superior performance with high pretable deep network for multispectral and hyperspectral image fusion,”
efficiency and low parameters (and computational costs) for IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 3, pp. 1457–1473,
MHIF, VIS-IR image fusion, remote sensing pansharpening, and Mar. 2022.
[16] R. Dian, A. Guo, and S. Li, “Zero-shot hyperspectral sharpening,” IEEE
digital photographic image fusion. Thanks to the positive impact Trans. Pattern Anal. Mach. Intell., vol. 45, no. 10, pp. 12650–12666,
of strong feature representations for different fusion tasks, the Oct. 2023.
proposed method can outperform some state-of-the-art methods, [17] H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2Fusion: A unified
unsupervised image fusion network,” IEEE Trans. Pattern Anal. Mach.
specially designed for the above-mentioned problems, demon- Intell., vol. 44, no. 1, pp. 502–518, Jan. 2022.
strating its usefulness for a wide range of image processing tasks. [18] Z. Liu, E. Blasch, Z. Xue, J. Zhao, R. Laganiere, and W. Wu, “Objec-
tive assessment of multiresolution image fusion algorithms for context
REFERENCES enhancement in night vision: A comparative study,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 34, no. 1, pp. 94–109, Jan. 2012.
[1] T. Zhou, S. Ruan, and S. Canu, “A review: Deep learning for medical [19] G. Vivone, M. Dalla Mura, A. Garzelli, and F. Pacifici, “A benchmarking
image segmentation using multi-modality fusion,” Array, vol. 3, 2019, protocol for pansharpening: Dataset, preprocessing, and quality assess-
Art. no. 100004. ment,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14,
[2] Y. Guo, D. Zhou, X. Ruan, and J. Cao, “Variational gated autoencoder- pp. 6102–6118, 2021.
based feature extraction model for inferring disease-MIRNA associations [20] Y. Yan, J. Liu, S. Xu, Y. Wang, and X. Cao, “MD3 Net: Integrat-
based on multiview features,” Neural Netw., vol. 165, pp. 491–505, ing model-driven and data-driven approaches for pansharpening,” IEEE
2023. Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5411116.
[3] W. Li, Y. Guo, B. Wang, and B. Yang, “Learning spatiotemporal embed- [21] Y. Liang, P. Zhang, Y. Mei, and T. Wang, “PMACNet: Parallel multiscale
ding with gated convolutional recurrent networks for translation initiation attention constraint network for pan-sharpening,” IEEE Geosci. Remote
site prediction,” Pattern Recognit., vol. 136, 2023, Art. no. 109234. Sens. Lett., vol. 19, 2022, Art. no. 5512805.

Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2086 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025

[22] S. Deng, L.-J. Deng, X. Wu, R. Ran, and R. Wen, “Bidirectional dilation [47] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “SwinIR:
transformer for multispectral and hyperspectral image fusion,” in Proc. Image restoration using swin transformer,” in Proc. Int. Conf. Comput.
Int. Joint Conf. Artif. Intell., 2023, pp. 3633–3641. Vis., 2021, pp. 1833–1844.
[23] A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Neural [48] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M. Yang,
Inf. Process. Syst., 2017, pp. 6000–6010. “Restormer: Efficient transformer for high-resolution image restora-
[24] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using tion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022,
shifted windows,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- pp. 5728–5739.
nit., 2021, pp. 10012–10022. [49] S. Jia, Z. Min, and X. Fu, “Multiscale spatial–spectral transformer
[25] W. Wang et al., “Pyramid vision transformer: A versatile backbone network for hyperspectral and multispectral image fusion,” Inf. Fusion,
for dense prediction without convolutions,” in Proc. IEEE/CVF Conf. vol. 96, pp. 117–129, 2023.
Comput. Vis. Pattern Recognit., 2021, pp. 568–578. [50] S.-Q. Deng, L.-J. Deng, X. Wu, R. Ran, D. Hong, and G. Vivone,
[26] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang, “Non-local recurrent “PSRT: Pyramid shuffle-and-reshuffle transformer for multispectral and
network for image restoration,” in Proc. Int. Conf. Neural Inf. Process. hyperspectral image fusion,” IEEE Trans. Geosci. Remote Sens., vol. 61,
Syst., 2018, pp. 1680–1689. 2023, Art. no. 5503715.
[27] Z. C. Wu, T. Z. Huang, L. J. Deng, J. Huang, J. Chanussot, and G. [51] M. Li, Y. Fu, and Y. Zhang, “Spatial-spectral transformer for hyper-
Vivone, “LRTCFPan: Low-rank tensor completion based framework for spectral image denoising,” in Proc. AAAI Conf. Artif. Intell., 2023,
pansharpening,” IEEE Trans. Image Process., vol. 32, pp. 1640–1655, pp. 1368–1376.
2023. [52] Y. Peng, Y. Zhang, B. Tu, Q. Li, and W. Li, “Spatial–spectral transformer
[28] S. Karim, G. Tong, J. Li, A. Qadir, U. Farooq, and Y. Yu, “Current with cross-attention for hyperspectral image classification,” IEEE Trans.
advances and future perspectives of image fusion: A comprehensive Geosci. Remote Sens., vol. 60, 2022, Art. no. 5537415.
review,” Inf. Fusion, vol. 90, pp. 185–217, 2023. [53] Y. Mei, Y. Fan, Y. Zhou, L. Huang, T. S. Huang, and H. Shi, “Im-
[29] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual age super-resolution with cross-scale non-local attention and exhaustive
networks for single image super-resolution,” in Proc. IEEE Conf. Comput. self-exemplars mining,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Vis. Pattern Recognit. Workshops, 2017, pp. 136–144. Recognit., 2020, pp. 5690–5699.
[30] Y. Wang, L.-J. Deng, T.-J. Zhang, and X. Wu, “SSconv: Explicit spectral- [54] S. Zhou, J. Zhang, W. Zuo, and C. C. Loy, “Cross-scale internal graph
to-spatial convolution for pansharpening,” in Proc. ACM Int. Conf. Mul- neural network for image super-resolution,” in Proc. Int. Conf. Neural
timedia, 2021, pp. 4472–4480. Inf. Process. Syst., 2020, pp. 3499–3509.
[31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [55] M. A. O. Vasilescu and D. Terzopoulos, “Multilinear analysis of im-
recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., age ensembles: Tensorfaces,” in Proc. Eur. Conf. Comput. Vis., 2002,
2016, pp. 770–778. pp. 447–460.
[32] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [56] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,”
“Feature pyramid networks for object detection,” in Proc. IEEE/CVF SIAM Rev., vol. 51, no. 3, pp. 455–500, 2009.
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2117–2125. [57] M. Zhou, X. Fu, J. Huang, F. Zhao, A. Liu, and R. Wang,
[33] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks “Effective pan-sharpening with transformer and invertible neu-
for biomedical image segmentation,” in Proc. 18th Int. Conf. Med. Image ral network,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2021,
Comput. Comput.-Assisted Interv., 2015, pp. 234–241. Art. no. 5406815.
[34] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen- [58] X. Wu, T.-Z. Huang, L.-J. Deng, and T.-J. Zhang, “Dynamic cross feature
tation learning for human pose estimation,” in Proc. IEEE/CVF Conf. fusion for remote sensing pansharpening,” in Proc. Int. Conf. Comput.
Comput. Vis. Pattern Recognit., 2019, pp. 5693–5703. Vis., 2021, pp. 14687–14696.
[35] W. Dong, C. Zhou, F. Wu, J. Wu, G. Shi, and X. Li, “Model-guided [59] G. Vivone, “Robust band-dependent spatial-detail approaches for
deep hyperspectral image super-resolution,” IEEE Trans. Image Process., panchromatic sharpening,” IEEE Trans. Geosci. Remote Sens., vol. 57,
vol. 30, pp. 5754–5768, 2021. no. 9, pp. 6421–6433, Sep. 2019.
[36] Q. Ma, J. Jiang, X. Liu, and J. Ma, “Deep unfolding network for spa- [60] G. Vivone et al., “Pansharpening based on semiblind deconvolution,”
tiospectral image super-resolution,” IEEE Trans. Comput. Imag., vol. 8, IEEE Trans. Geosci. Remote Sens., vol. 53, no. 4, pp. 1997–2010,
pp. 28–40, 2021. Apr. 2015.
[37] D. Geman and G. Reynolds, “Constrained restoration and the recovery of [61] G. Vivone, S. Marano, and J. Chanussot, “Pansharpening: Context-
discontinuities,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 3, based generalized Laplacian pyramids by robust regression,” IEEE
pp. 367–383, Mar. 1992. Trans. Geosci. Remote Sens., vol. 58, no. 9, pp. 6152–6167,
[38] D. Geman and C. Yang, “Nonlinear image recovery with half-quadratic Sep. 2020.
regularization,” IEEE Trans. Image Process., vol. 4, no. 7, pp. 932–946, [62] N. Akhtar and A. Mian, “Hyperspectral recovery from RGB images using
Jul. 1995. Gaussian processes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42,
[39] S. Boyd et al., “Distributed optimization and statistical learning via the no. 1, pp. 100–113, Jan. 2020.
alternating direction method of multipliers,” Found. TrendsR PLX Mach. [63] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution
Learn., vol. 3, no. 1, pp. 1–122, 2011. via sparse representation,” IEEE Trans. Image Process., vol. 19, no. 11,
[40] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser pp. 2861–2873, Nov. 2010.
prior for image restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. [64] Q. Zhang, Y. Liu, R. S. Blum, J. Han, and D. Tao, “Sparse representation
Pattern Recognit., 2017, pp. 3929–3938. based multi-sensor image fusion for multi-focus and multi-modality
[41] K. Zhang, L. V. Gool, and R. Timofte, “Deep unfolding network for images: A review,” Inf. Fusion, vol. 40, pp. 57–75, 2018.
image super-resolution,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern [65] X. Fu, Z. Lin, Y. Huang, and X. Ding, “A variational pan-sharpening
Recognit., 2020, pp. 3217–3226. with local gradient constraints,” in Proc. IEEE/CVF Conf. Comput. Vis.
[42] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” Pattern Recognit., 2019, pp. 10265–10274.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, [66] L.-J. Deng, G. Vivone, W. Guo, M. Dalla Mura, and J. Chanussot, “A
pp. 9446–9454. variational pansharpening approach based on reproducible kernel Hilbert
[43] K. Zhang, Y. Li, W. Zuo, L. Zhang, L. Van Gool, and R. Timofte, “Plug- space and heaviside function,” IEEE Trans. Image Process., vol. 27, no. 9,
and-play image restoration with deep denoiser prior,” IEEE Trans. Pattern pp. 4330–4344, Sep. 2018.
Anal. Mach. Intell., vol. 44, no. 10, pp. 6360–6376, Oct. 2022. [67] R. Dian and S. Li, “Hyperspectral image super-resolution via subspace-
[44] L. Wang, C. Sun, M. Zhang, Y. Fu, and H. Huang, “DNU: Deep non-local based low tensor multi-rank regularization,” IEEE Trans. Image Process.,
unrolling for computational spectral imaging,” in Proc. IEEE/CVF Conf. vol. 28, no. 10, pp. 5135–5146, Oct. 2019.
Comput. Vis. Pattern Recognit., 2020, pp. 1661–1671. [68] T. Xu, T.-Z. Huang, L.-J. Deng, and N. Yokoya, “An iterative regulariza-
[45] Y. Han, G. Huang, S. Song, L. Yang, H. Wang, and Y. Wang, “Dynamic tion method based on tensor subspace representation for hyperspectral
neural networks: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., image super-resolution,” IEEE Trans. Geosci. Remote Sens., vol. 60,
vol. 44, no. 11, pp. 7436–7456, Nov. 2022. 2022, Art. no. 5529316.
[46] S. Peng, L.-J. Deng, J.-F. Hu, and Y. Zhuo, “Source-adaptive discrimina- [69] G. Masi, D. Cozzolino, L. Verdoliva, and G. Scarpa, “Pansharpening
tive kernels based network for remote sensing pansharpening,” in Proc. by convolutional neural networks,” Remote Sens., vol. 8, no. 7, 2016,
Int. Joint Conf. Artif. Intell., 2022, pp. 1283–1289. Art. no. 594.

[70] J. Yang, X. Fu, Y. Hu, Y. Huang, X. Ding, and J. Paisley, “PanNet: A deep [93] W. Wang, L.-J. Deng, R. Ran, and G. Vivone, “A general paradigm with
network architecture for pan-sharpening,” in Proc. Int. Conf. Comput. detail-preserving conditional invertible network for image fusion,” Int. J.
Vis., 2017, pp. 5449–5457. Comput. Vis., vol. 132, no. 4, pp. 1029–1054, 2024.
[71] J.-F. Hu, T.-Z. Huang, L.-J. Deng, T.-X. Jiang, G. Vivone, and J. Chanus- [94] M. Yin, X. Liu, Y. Liu, and X. Chen, “Medical image fusion with
sot, “Hyperspectral image super-resolution via deep spatiospectral atten- parameter-adaptive pulse coupled neural network in nonsubsampled
tion convolutional neural networks,” IEEE Trans. Neural Netw. Learn. Shearlet transform domain,” IEEE Trans. Instrum. Meas., vol. 68, no. 1,
Syst., vol. 33, no. 12, pp. 7251–7265, Dec. 2022. pp. 49–64, Jan. 2019.
[72] Z.-R. Jin, T.-J. Zhang, T.-X. Jiang, G. Vivone, and L.-J. Deng, “LAGConv: [95] H. Li and X.-J. Wu, “DenseFuse: A fusion approach to infrared and visible
Local-context adaptive convolution kernels with global harmonic bias for images,” IEEE Trans. Image Process., vol. 28, no. 5, pp. 2614–2623,
pansharpening,” in Proc. AAAI Conf. Artif. Intell., 2022, pp. 1113–1121. May 2019.
[73] W. G. C. Bandara and V. M. Patel, “HyperTransformer: A tex- [96] Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang, “IFCNN: A
tural and spectral feature fusion transformer for pansharpening,” general image fusion framework based on convolutional neural network,”
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, Inf. Fusion, vol. 54, pp. 99–118, 2020.
pp. 1767–1777. [97] J. Ma, H. Xu, J. Jiang, X. Mei, and X.-P. Zhang, “DDcGAN:
[74] J. Ma, L. Tang, F. Fan, J. Huang, X. Mei, and Y. Ma, “SwinFu- A dual-discriminator conditional generative adversarial network for
sion: Cross-domain long-range learning for general image fusion via multi-resolution image fusion,” IEEE Trans. Image Process., vol. 29,
swin transformer,” IEEE-CAA J. Automatica Sinica, vol. 9, no. 7, pp. 4980–4995, 2020.
pp. 1200–1217, Jul. 2022. [98] P. Liang, J. Jiang, X. Liu, and J. Ma, “Fusion from decomposition: A
[75] L. Tang, Y. Deng, Y. Ma, J. Huang, and J. Ma, “SuperFusion: A versatile self-supervised decomposition approach for image fusion,” in Proc. Eur.
image registration and fusion network with semantic awareness,” IEEE- Conf. Comput. Vis., 2022, pp. 719–735.
CAA J. Automatica Sinica, vol. 9, no. 12, pp. 2121–2137, Dec. 2022. [99] P. Zhu, Y. Sun, B. Cao, and Q. Hu, “Task-customized mixture of adapters
[76] H. Liu, C. Feng, R. Dian, and S. Li, “SSTF-Unet: Spatial–spectral for general image fusion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
transformer-based U-Net for high-resolution hyperspectral image ac- Recognit., 2024, pp. 7099–7108.
quisition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 12, [100] Z. Zhao et al., “Equivariant multi-modality image fusion,” in
pp. 18222–18236, Dec. 2023. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024,
[77] W. Wang, W. Zeng, Y. Huang, X. Ding, and J. W. Paisley, “Deep blind pp. 25912–25921.
hyperspectral image fusion,” in Proc. Int. Conf. Comput. Vis., 2019, [101] G. Vivone, R. Restaino, and J. Chanussot, “Full scale regression-based
pp. 4149–4158. injection coefficients for panchromatic sharpening,” IEEE Trans. Image
[78] B. Lecouat, J. Ponce, and J. Mairal, “Fully trainable and interpretable non- Process., vol. 27, no. 7, pp. 3418–3431, Jul. 2018.
local sparse models for image restoration,” in Proc. Eur. Conf. Comput. [102] L.-J. Deng, G. Vivone, C. Jin, and J. Chanussot, “Detail injection-
Vis., 2020, pp. 238–254. based deep convolutional neural networks for pansharpening,” IEEE
[79] N. Park and S. Kim, “How do vision transformers work?,” 2021, Trans. Geosci. Remote Sens., vol. 59, no. 8, pp. 6995–7010, Aug.
arXiv:2202.06709. 2021.
[80] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- [103] X. Tian, K. Li, W. Zhang, Z. Wang, and J. Ma, “Interpretable model-
works,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, driven deep network for hyperspectral, multispectral, and panchromatic
pp. 7794–7803. image fusion,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 10,
[81] A. Kolesnikov et al., “An image is worth 16 × 16 words: Transformers pp. 14382–14395, Oct. 2024.
for image recognition at scale,” 2020, arXiv:2010.11929. [104] Y. Duan, X. Wu, H. Deng, and L.-J. Deng, “Content-adaptive
[82] A. Paszke et al., “PyTorch: An imperative style, high-performance deep non-local convolution for remote sensing pansharpening,” in
learning library,” 2019, arXiv: 1912.01703. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024,
[83] H. Chen et al., “Pre-trained image process.ing transformer,” in pp. 27738–27747.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, [105] X. He et al., “Pan-mamba: Effective pan-sharpening with state space
pp. 12299–12310. model,” 2024, arXiv:2402.12192.
[84] Q. Ma, J. Jiang, X. Liu, and J. Ma, “Learning a 3D-CNN and transformer [106] F. Yasuma, T. Mitsunaga, D. Iso, and S. K. Nayar, “Generalized assorted
prior for hyperspectral image super-resolution,” Inf. Fusion, vol. 100, pixel camera: Postcapture control of resolution, dynamic range, and
2023, Art. no. 101907. spectrum,” IEEE Trans. Image Process., vol. 19, no. 9, pp. 2241–2253,
[85] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, Sep. 2010.
“Transformers in vision: A survey,” ACM Comput. Surv., vol. 54, no. 10s, [107] A. Chakrabarti and T. Zickler, “Statistics of real-world hyperspectral
pp. 1–41, 2022. images,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2011,
[86] K. Gregor and Y. LeCun, “Learning fast approximations pp. 193–200.
of sparse coding,” in Proc. Int. Conf. Mach. Learn., 2010, [108] S. Li, R. Dian, L. Fang, and J. M. Bioucas-Dias, “Fusing hyper-
pp. 399–406. spectral and multispectral images via coupled sparse tensor factoriza-
[87] W. Tang, F. He, and Y. Liu, “YDTR: Infrared and visible image fusion tion,” IEEE Trans. Image Process., vol. 27, no. 8, pp. 4118–4130,
via Y-shape dynamic transformer,” IEEE Trans. Multimedia, vol. 25, Aug. 2018.
pp. 5413–5428, 2023. [109] R. Dian, S. Li, and L. Fang, “Learning a low tensor-train rank repre-
[88] M. Selva, B. Aiazzi, F. Butera, L. Chiarantini, and S. Baronti, “Hyper- sentation for hyperspectral image super-resolution,” IEEE Trans. Neural
sharpening: A first approach on SIM-GA data,” IEEE J. Sel. Topics Netw. Learn. Syst., vol. 30, no. 9, pp. 2672–2683, Sep. 2019.
Appl. Earth Observ. Remote Sens., vol. 8, no. 6, pp. 3008–3024, Jun. [110] T. Xu, T.-Z. Huang, L.-J. Deng, X.-L. Zhao, and J. Huang, “Hyperspectral
2015. image superresolution using unidirectional total variation with tucker
[89] X. Zhang, W. Huang, Q. Wang, and X. Li, “SSR-NET: Spatial–spectral decomposition,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens.,
reconstruction network for hyperspectral and multispectral image fu- vol. 13, pp. 4381–4398, 2020.
sion,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 7, pp. 5953–5965, [111] A. Toet, “The TNO multiband image data collection,” Data Brief, vol. 15,
Jul. 2021. pp. 249–251, 2017.
[90] J.-F. Hu, T.-Z. Huang, L.-J. Deng, H.-X. Dou, D. Hong, and G. Vivone, [112] Z. Wang, Y. Chen, W. Shao, H. Li, and L. Zhang, “SwinFuse: A residual
“Fusformer: A transformer-based fusion network for hyperspectral im- swin transformer fusion network for infrared and visible images,” IEEE
age super-resolution,” IEEE Geosci. Remote Sens. Lett., vol. 19, 2022, Trans. Instrum. Meas., vol. 71, 2022, Art. no. 5016412.
Art. no. 6012305. [113] L. Tang, H. Zhang, H. Xu, and J. Ma, “Deep learning-based image fusion:
[91] T. Huang, W. Dong, J. Wu, L. Li, X. Li, and G. Shi, “Deep hyperspectral A survey,” J. Image Graph., vol. 28, no. 1, pp. 3–36, 2023.
image fusion network with iterative spatio-spectral regularization,” IEEE [114] Q. Huynh-Thu and M. Ghanbari, “Scope of validity of PSNR in
Trans. Comput. Imag., vol. 8, pp. 201–214, 2022. image/video quality assessment,” Electron. Lett., vol. 44, no. 13,
[92] J. Fang, J. Yang, A. Khader, and L. Xiao, “MIMO-SST: Multi-input pp. 800–801, 2008.
multi-output spatial-spectral transformer for hyperspectral and multi- [115] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality
spectral image fusion,” IEEE Trans. Geosci. Remote Sens., vol. 62, 2024, assessment: From error visibility to structural similarity,” IEEE Trans.
Art. no. 5510020. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.

Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2088 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025

[116] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The Ting-Zhu Huang (Member, IEEE) received the BS,
unreasonable effectiveness of deep features as a perceptual metric,” in MS, and PhD degrees in computational mathematics
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 586–595. from the Department of Mathematics, Xi’an Jiaotong
[117] C. S. Xydeas et al., “Objective image fusion performance measure,” University, Xi’an, China, in 1986, 1992, and 2001, re-
Electron. Lett., vol. 36, no. 4, pp. 308–309, 2000. spectively. He is currently a professor with the School
[118] G. Piella and H. Heijmans, “A new quality metric for image fusion,” in of Mathematical Sciences, University of Electronic
Proc. IEEE 2003 Int. Conf. Image Process., 2003, pp. III– 173. Science and Technology of China, Chengdu, China.
[119] L. Wald, T. Ranchin, and M. Mangolini, “Fusion of satellite images His research interests include scientific computation
of different spatial resolutions: Assessing the quality of resulting im- and applications, numerical algorithms for image
ages,” Photogrammetric Eng. Remote Sens., vol. 63, no. 6, pp. 691–699, processing, numerical linear algebra, preconditioning
1997. technologies, and matrix analysis with applications.
[120] G. Vivone et al., “A new benchmark based on recent advances in mul- Dr. Huang is an editor of the Scientific World Journal, Advances in Numerical
tispectral pansharpening: Revisiting pansharpening with classical and Analysis, the Journal of Applied Mathematics, the Journal of Pure and Applied
emerging pansharpening methods,” IEEE Geosci. Remote Sens. Mag., Mathematics: Advances in Applied Mathematics, and the Journal of Electronic
vol. 9, no. 1, pp. 53–81, Mar. 2021. Science and Technology, China.
[121] S. Lolli, L. Alparone, A. Garzelli, and G. Vivone, “Haze correction for Liang-Jian Deng (Senior Member, IEEE) received
contrast-based multispectral pansharpening,” IEEE Geosci. Remote Sens. the BS and PhD degrees in applied mathematics
Lett., vol. 14, no. 12, pp. 2255–2259, Dec. 2017. from the School of Mathematical Sciences, Univer-
[122] B. Aiazzi, L. Alparone, S. Baronti, A. Garzelli, and M. Selva, “Mtf- sity of Electronic Science and Technology of China
tailored multiscale fusion of high-resolution ms and pan imagery,” (UESTC), Chengdu, China, in 2010 and 2016, re-
Photogrammetric Eng. Remote Sens., vol. 72, no. 5, pp. 591–596, spectively. He is currently a research fellow with
2006. the School of Mathematical Sciences, UESTC. From
[123] Q. Yuan, Y. Wei, X. Meng, H. Shen, and L. Zhang, “A multiscale and 2013 to 2014, he was a Joint-Training PhD student
multidepth convolutional neural network for remote sensing imagery with the Case Western Reserve University, Cleveland,
pan-sharpening,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., OH, USA. In 2017, he was a postdoc with Hong
vol. 11, no. 3, pp. 978–989, Mar. 2018. Kong Baptist University (HKBU). In addition, he
[124] L. He et al., “Pansharpening via detail injection based convolutional also stayed with Isaac Newton Institute for Mathematical Sciences, Cambridge
neural networks,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., University and HKBU for short visits. His research interests include the use of
vol. 12, no. 4, pp. 1188–1204, Apr. 2019. partial differential equations (PDE), optimization modeling, and deep learning to
[125] J. Zhou, D. L. Civco, and J. A. Silander, “A wavelet transform method address several tasks in image processing, and computer vision, e.g., resolution
to merge Landsat TM and spot panchromatic data,” Int. J. Remote Sens., enhancement and restoration.
vol. 19, no. 4, pp. 743–757, 1998.
[126] R. H. Yuhas, A. F. Goetz, and J. W. Boardman, “Discrimination among Jocelyn Chanussot (Fellow, IEEE) received the MSc
semi-arid landscape endmembers using the spectral angle mapper (SAM) degree in electrical engineering from the Grenoble
algorithm,” in Proc. Summaries Third Annu. JPL Airborne Geosci. Work- Institute of Technology (Grenoble INP), Grenoble,
shop, 1992, vol. 1, pp. 147–149. France, in 1995, and the PhD degree from the Univer-
[127] L. Wald, Data Fusion: Definitions and Architectures: Fusion of Images sité de Savoie, Annecy, France, in 1998. Since 1999,
of Different Spatial Resolutions. Vouillé, France: Presses des MINES, he has been with Grenoble INP, where he is currently
2002. a professor of signal and image processing. His re-
[128] A. Arienzo, G. Vivone, A. Garzelli, L. Alparone, and J. Chanussot, search interests include image analysis, hyperspectral
“Full-resolution quality assessment of pansharpening: Theoretical and remote sensing, data fusion, machine learning, and
hands-on approaches,” IEEE Geosci. Remote Sens. Mag., vol. 10, no. 3, artificial intelligence. Dr. Chanussot was a member
pp. 168–201, Sep. 2022. of the Institut Universitaire de France from 2012 to
2017. He was the vice-president of the IEEE Geoscience and Remote Sensing
Society (GRSS), in charge of meetings and symposia from 2017 to 2019. He was
the general chair of the first IEEE GRSS Workshop on Hyperspectral Image and
Signal Processing, Evolution in Remote sensing. He is an associate editor for the
IEEE Transactions on Geoscience and Remote Sensing, IEEE Transactions on
Image Processing and the Proceedings of the IEEE. He was the editor-in-chief of
Xiao Wu received the MSc degree from the School of the IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Mathematical Sciences, University of Electronic Sci- Sensing from 2011 to 2015. In 2014, he served as a guest editor for the IEEE
ence and Technology of China (UESTC), Chengdu, Signal Processing Magazine. He has been a Highly Cited Researcher (Clarivate
China, in 2023. He is currently admitted to the Analytics/Thomson Reuters), since 2018.
UESTC and studies for the PhD degree under prof. Gemine Vivone (Senior Member, IEEE) received
Ting-Zhu Huang. His research interests include the- the BSc and MSc degrees (summa cum laude), and
ories and applications of machine learning and deep the PhD degree in information engineering from the
learning in image processing. University of Salerno, Fisciano, Italy, in 2008, 2011,
and 2014, respectively. He is a senior researcher
with the National Research Council (Italy). His main
research interests focus on image fusion, statistical
signal processing, deep learning, and classification
and tracking of remotely sensed images. Dr. Vivone
is a co-chair of the IEEE GRSS Image Analysis and
Data Fusion Technical Committee, a member of the
IEEE Task Force on “Deep Vision in Space”, and he was the Leader of the Image
Zi-Han Cao received the BS degree from the School and Signal Processing Working Group of the IEEE Image Analysis and Data
of Information and Communication Engineering, Fusion Technical Committee (2020-2021). Dr. Vivone is currently an area editor
University of Electronic Science and Technology for Elsevier Information Fusion, and associate editor for IEEE Transactions on
of China (UESTC), Chengdu, China, in 2023. He Geoscience and Remote Sensing (TGRS) and IEEE Geoscience and Remote
is currently working toward the MS degree under Sensing Letters (GRSL). Moreover, he is an Editorial Board Member for Nature
prof. Liang-Jian Deng in the School of Mathematics, Scientific Reports and MDPI Remote Sensing. Dr. Vivone received the IEEE
University of Electronic Science and Technology of GRSS Early Career Award in 2021, the Symposium Best Paper Award at IEEE
China. His research interests include computer vision, International Geoscience and Remote Sensing Symposium (IGARSS), in 2015
machine learning, and applications on low-level vi- and the Best Reviewer Award of the IEEE Transactions on Geoscience and
sion tasks including super-resolution, image fusion, Remote Sensing, in 2017. Moreover, he is listed in the World’s Top 2% Scientists
and inverse problems. by Stanford University.