Fully-Connected Transformer for Multi-Source Image Fusion
Fully-Connected Transformer for Multi-Source Image Fusion
0162-8828 © 2025 IEEE. All rights reserved, including rights for text and data mining, and training of artificial intelligence and similar technologies.
Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2072 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025
Fig. 1. The current existing forms for self-attention along spatial or channel modes. They are built by matrix multiplication, connecting all the other elements
in one mode. To illustrate the information representation of self-attention, we show four key variants: self-attention (SA) [23], window-SA [24], reduced SA [25],
and cross-scale SA [26]. For multi-source image fusion tasks, cross-scale SA can process mutual fusion at different scales. For example, the Query, Q, can be the
LR-HSI, then the matrices K and V can both be the HR-MSI. Hence, the SA can retain domain-specific information from different domains, while simultaneously
disregarding the two internal source paradigms across scales.
spatial priors impose the first-order smoothness on the unknown multiplication. Unlike feature representation of convolutions,
(fused) image [65], [66]. Some other methods [27], [67] exploit transformer [79] can theoretically expand the receptive field
the low-rank property. Subspace analysis [67] and matrix/tensor infinitely, thereby correlating different pixels/patches to each
decomposition [67], [68] have also been used in conjunction other. Transformer methods often demonstrate a superior ability
with the low-rank property. However, handcrafted priors are not to learn intrinsic features compared to CNN-based approaches.
usually enough to represent real-world data accurately. To date, non-local networks [80] and transformer meth-
ods [81] represent state-of-the-art mechanisms in computer vi-
B. Data-Driven Deep Learning Methods sion. To encourage joint feature learning across two dimensions
(modes), cross-modality transformers [57], [74] have recently
Deep learning (DL)-based methods have successfully ex-
been designed to learn better feature representations between
ploited their powerful feature representation capability. DL-
two different domains. Wang et al. [44] integrated a data-driven
based methods can be roughly summarized as convolutional
NSS prior and the HQS method addressing the problem with an
neural network (CNN)-based methods and transformer-based
optimization-inspired deep neural network.
methods. Regarding pansharpening, PNN has been proposed
in [69]. It is based on a three-layer CNN to obtain the pansharp-
ened image (HR-MSI). To fuse useful high-frequency informa- D. Motivation
tion based on physical constraints, in [70], the fusion process is
formulated as a linear observed model in which deep and fusion In multi-source image fusion, the input data contains rich
networks are used to extract and fuse features from different multi-dimensional information and domain-specific informa-
source images. Some specialized modules, such as the multi- tion, namely, local and non-local similarities within or across
scale mechanism [49] and the spatial/channel attentions [71], scales, as well as spatial and spectral information. To fully
have recently been proposed for the MSIF problem. To enlarge explore this potential information, we use multilinear algebra
the receptive field, a pixel-adaptive convolution method has been to develop the mathematical concepts behind the generalized
proposed, the so-called LAGNet [72], to exploit pixel-to-pixel self-attention mechanism and propose the FCSA framework.
similarity in local windows to characterize content-aware fea- The naive self-attention-based methods are often limited to a
tures. Bandara et al. [73] designed a cross-attention mechanism single dimension or a specific perspective, resulting in the loss
to correlate pixel relations for multi-source images in MHIF. of key information from different sources. To solve this prob-
Besides, Ma et al. in [74] first presented a transformer-based lem, the FCSA mechanism can simultaneously integrate process
framework for VIS-IR image fusion and digital photographic multi-dimensional feature information from different scales and
image fusion, explaining the significance of the transformer’s domains, and fully mine rich details in the image. Then, the
long-distance dependency on image fusion tasks. The study FCSA framework can deeply explore the information interaction
in [75] first blends image matching, fusion, and semantic between various features in the image. By parallel branch design,
awareness into the same framework, yielding promising results. the FCSA framework establishes a fully connected relationship,
Wang et al. [76] leveraged domain knowledge to design a ensuring the maximum utilization of potential information in the
semi-supervised transfer learning method to fuse infrared and input data.
visible image fusion. However, the above-mentioned networks MSIF networks do not often get contextual guidance for
are limited by the use of multi-scale and multi-dimensional feature representation, even showing a feature extraction phase
feature representation from the self-attention mechanism, often that usually reduces the features’ spatial resolution. Hence, we
resulting in poor fusion performance. cannot advise its use for MSIF tasks. Instead, starting from
the promising results obtained in our conference paper, we
C. Model-Driven Deep Learning Methods develop, in this work, the so-called FC-former network based on
the FCSA framework to consider feature similarity within and
Wang et al. [77] proposed the DBIN model, where the estima- across scales while obtaining discriminative information from
tion of the exploited observation model and the related fusion different sources.
process are optimized iteratively and alternatively during the
reconstruction. Xie et al. [15] proposed the so-called MHFNet to
combine a low-rank prior and a complete basis set of HR-HSIs to III. FULLY-CONNECTED TRANSFORMER MODEL
build the unfolding network. Guo et al. [2] designed a variational
A. Generalized Self-Attention Mechanism
gate mechanism to fuse three different similarities of miRNAs
via a novel contrastive cross-entropy function. As in the case In this section, we first summarize the necessary notations
of classical convolutional networks, where local information and give several new definitions used in this paper. For the MSIF
is extracted by convolutions, the deep [42] and autoencoder task, an image and another image are defined as I1 ∈ RH×W ×c
priors [43] also impose local priors for model-based methods. and I2 ∈ Rh×w×C , respectively. The desired fused image is
Non-local self-similarity (NSS) priors have recently been indicated as If , where the scale ratio is r = H/h (e.g., 4 or 8).
explored in various research fields [23], [78]. The approaches For the MHIF task, the source images are the HR-MSI and the
based on the use of these priors consider similar pixels/patches LR-HSI, respectively, while for the visible and infrared image
of a given image to exploit the internal redundant informa- fusion, are the infrared and the visible images, respectively, and,
tion. The self-attention mechanism is a good instance of NSS for remote sensing pansharpening, are the panchromatic image
methods based on long-range dependencies through matrix and the multispectral cube, respectively.
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FULLY-CONNECTED TRANSFORMER FOR MULTI-SOURCE IMAGE FUSION 2075
M N
Before introducing the generalized self-attention and the where the size of Z is k−1 i=1 Imi × j=k+1 Imj × j=k+1 Jnj
FCSA method, we first describe the classic spatial self-attention M −2
for 1 < k < min(M, N ), k ∈ Z, or i=1 Imi × IN −1 ×
(Spa-SA) based on the definition of the batched matrix multi-
JN −1 for k = min(M, N ). The last dimensions of X and
plication. Given the input tensor X ∈ RB×P ×N ×C , the Spa-SA
Y are contracted. The batched tensor product is the batched
can be formulated as follows:
format of the multilinear product and requires Imi = Jni for
Q = XWQ T
, K = XWK T
, V = XWV
T
, i = 1, 2, . . . , k when k = min(M, N ), or Imi = Jni for i =
1, 2, . . . , k − 2, k when k = min(M, N ). The associative and
T
QK commutative properties are not satisfied.
A = Softmax √ ,
d In Fig. 4(a), we give an illustration of the above definitions.
Below, we will introduce the theorems of the generalized self-
Z = AV + X, (1)
attention mechanism.
where WQ,K,V indicates the learnable parameters, Z represents Theorem 1: Supposing that X ∈ RI1 ×I2 ×···×IM and Y ∈
J1 ×J2 ×···×JN
the output features, the Query is Q ∈ RNQ ×CQ determining the R are two tensors. Thus, we have:
spatial sizes of the output features and of the attention matrix, the 1) YT (in1:k−1 , ink+1:N , ink ) = Y(in1:k−1 , ink , ink+1:N ),
Key and Value are K ∈ RNK ×CQ and V ∈ RNK ×CV defining 2) Z = X ×m n1 ,n2 ,...,nk Y ⇔ X[m;k] Y[n;k] .
1 ,m2 ,...,mk T
the sizes of the attention matrix and of the output channel of The interested readers can refer to the supplementary material
Z, respectively. K and V must have the same spatial size, i.e., to have a look at the proof of Theorem 1. Theorem 1 describes the
NK = NV = N . We assume that Q and K are d-dimensional relationship between the batched tensor product and the batched
vectors. The attention, A, relies upon the dot product between Q matrix multiplication. Below, Theorem 2 will consider the self-
and K to get spatial self-similarity, which influences V and vice attention mechanism with two special tensor forms by using the
verse. Thanks to the self-attention mechanism, transformers can above definitions.
achieve self-similarity along a specific dimension (mode). Theorem 2 (Generalized Self-Attention Mechanism): Let us
Next, we will introduce the generalized self-attention mech- assume an N th-order tensor, X ∈ RI1 ×I2 ×···×IN , the learn-
anism through the following new definitions and theorems. able parameters, W ∈ RIik ×J3 , and the reordering vector, i =
Definition 1 (Tensor Blocking): For a 4th-order tensor, X ∈ (i1 , i2 , . . . , iN ). The generalized self-attention of X has three
RI1 ×I2 ×I3 ×I4 , a window (q × q) is set to be centered at each reordering factors, Q, K, and V∈ RJ1 ×J2 ×J3 along mode k,
spatial location. Tensor blocking with stride (s × s) generates a where (J1 , J2 , J3 ) = ( k−1 i=1 Iii ,
N
j=k+1 Iij , Iik ). Let us define
blocking tensor, T ∈ RI1 ×P ×q×q×I2 . Thus, we have: m = (m1 , m2 , . . . , mN ) and n = (n1 , n2 , . . . , nN ) as indexes
(q×q)
of the batched tensor product. The generalized self-attention
T = unfold(s×s) (X ), (2) generates two matrices A and Z along the kth dimension (mode
denotes the number of patches and satisfies P = k), which have the following forms:
4 IP
where
i −q+2∗p
i=3 s , where p is the border padding. The unfold Q = X[i;k] WQ
T
, K = X[i;k] WK T
, V = X[i;k] WV
T
,
operator is implemented in Pytorch [82] with fast runtime.
Definition 2 (Batched Mode-k Unfolding): Given an N th- Q ×n1 ,n2 ,...,nk K
m1 ,m2 ,...,mk
A = Softmax √ ,
order tensor, X ∈ RI1 ×I2 ×···×IN , n = (n1 , n2 , . . . , nN ) is a vec- d
tor reordering.The batched mode-k unfolding of X is defined
k−1 N Z = A ×1,n k+1 −k+1,nk+2 −k+1,...,nN −k+1,2
V + X[i;k] , (5)
i=1 Ini × j=k+1 Inj ×Ik
n1 ,n2 ,...,nk
as X[n,k] ∈ R (1 < k ≤ N, k ∈ Z),
⎧ where matrices Q, K, V, and A perform the inverse operator of
⎪
⎪ X[n;k] (in1 in2 · · · ink−1 , ink+1 ink+2 · · · inN , ink ) = batched mode-k unfolding to tensor format.
⎪
⎪
⎨ reshape(X , [In1 In2 · · · Ink−1 , The interested readers can refer to the supplementary material
Ink+1 Ink+2 · · · InN , Ink ]), 1 < k < N, (3) to have a look at the proof of Theorem 2. Here, by utilizing
⎪
⎪
⎪X[n;N ] (in1 in2 · · · InN −2 , inN −1 , inN ) =
⎪ the tensor blocking operator given in Definition 1 and the
⎩
reshape(X , [In1 In2 · · · InN −2 , InN −1 , InN ]), k = N, batched mode-k unfolding operator in Definition 2, we can
and its inverse operator yields X = reshape(Xn , [In1 , In2 , sequentially obtain three factors, Q, K, and V, represented
. . . , InN ]) via indices n = (n1 , n2 , . . . , nN ). in the self-attention mechanism. A graphic illustration of the
Definition 2 can have well tensor permutation, X = generalized self-attention is in Fig. 4(b). A special form of spatial
X(in1 , in2 , . . . , inN ) based on vector n. self-attention is shown based on our generalized mechanism.
Definition 3 (Batched Tensor Product): Supposing that By using the proposed definitions and theorems, we can derive
an M th-order X ∈ RI1 ×I2 ×···×IM and an N th-order Y ∈ several forms of self-attention. For example, assuming that the
RJ1 ×J2 ×···×JN . Assume that two vectors m = (m1 , m2 , input tensor Y is RB×d×C×H×W , transforming the dimensions
. . . , mM ) and n = (n1 , n2 , . . . , nN ) are vectors satisfying H × W into the spatial size S, for multi-head spatial self-
Imi = Jnj for i = 1, 2, . . . , k. The batched tensor product be- attention, the batched mode-3 unfolding is performed to generate
tween X and Y along mode k (1 < k ≤ min(M, N ), k ∈ Z) in Q, K, and V ∈ RBd×S×C , where i = (1, 2, 4, 5, 3). Afterwards,
matrix form is as follows: the batched tensor product is performed for Q, K and V,
where m = n = (1, 2, 3). For the channel self-attention, we first
Z = X ×m
n1 ,n2 ,...,nk Y,
1 ,m2 ,...,mk
(4) merge the H and W dimensions to obtain Y ∈ RB×d×C×S ,
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2076 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025
Fig. 4. Graphical illustration of the batched tensor product in Definition 3. Furthermore, we present the spatial self-attention based on the proposed definitions.
The tensor blocking of Definition 1 takes precedence over the batched mode-k unfolding.
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FULLY-CONNECTED TRANSFORMER FOR MULTI-SOURCE IMAGE FUSION 2077
Fig. 5. Illustration of the FCSA framework. The proposed FCSA framework unifies several self-attention mechanisms, such as [49] and [57], and includes their
corresponding multilinear product representation. The FCSA framework can facilitate the fusion of local and non-local prior information within and across images
from different sources. Note that stages 2 and 3 of the FCSA are simply plotted, not affecting the required tensor format.
Here, we further use feature branches to represent input ten- Algorithm 1: One Stage of FCSA.
sors, i.e., (XH , XM , XL ). Then, XH = (FH , IM , IL ), XM =
(FM , IH , IL ), and XL = (FL , IH , IM ) denote the three fac-
tors (i.e., Query, Key and Value), respectively. [mk ; nk ] is one
of the reordering vectors of m and n at mode k. For a better
understanding of (6), the detailed algorithm is reported in Algo-
rithm 1.
Equation (5) performs the transfer of feature maps at different
resolutions. When transferring a lower resolution branch to a
higher resolution branch, the Query represents higher resolution
feature maps, while Key and Value denote lower resolution
feature maps. Following Definitions and Theorem 2, lower-
resolution feature maps influence higher-resolution feature maps
according to the reordering vector mk . Furthermore, these dif-
ferent resolution feature maps can progressively aggregate new
feature maps from high-to-low and low-to-high branches and
transfer the cross-scale feature maps back to high-resolution
branches. In summary, the proposed scheme can enhance feature
representation and achieve higher performance.
Remark 2: It is worth remarking that the FCSA framework 2
computational complexity of the FCSA is O( k−1 i=1 Ii Ik Ik+1 +
conducts multi-dimensional self-attention using the generalized k−2 2
k−1 2
k−2 2
j=1 Ij Ik Ik−1 ), that is, O( i=1 Ii S C + j=1 Ij SC ). The
self-attention mechanism. The separated matrices, Q, K, and
computational complexity linearly increases with the size of the
V, are used to calculate two different unfolded self-attentions.
image and the number of channels. Besides, self-attention has
This induces both the long-range spatial and the global chan-
some (GPU memory) storage costs.The FCSA storage cost,
nel responses. The FCSA framework retains the transformer’s 2 2
which depends on S and C, is O( k−1 I
i=1 i S + k−2
i=1 Ii C ),
solution while reducing the computational cost and increasing
consistently with the hybrid self-attention considering both spa-
non-local information. The FCSA can improve the performance
tial and spectral modes.
of MSIF, as reported in Table VI.
D. Multi-Source Image Representation
C. Complexity Analysis
Several networks for MSIF can be separated into two parts:
Let us transform an N th-order tensor, X ∈ RI1 ×···×Ik ×···×IN , deep and fusion sub-networks. The simplest fusion method relies
into I1 × · · · × Ik−1 × S × C, where Ik = S or Ik = C. upon just adding or concatenating features. Instead, in this work,
Then, we have batched operations for i < k, and multilin- we will introduce two more elaborated fusion strategies: (a)
ear product operations for i ≤ k, 1 < i ≤ N . Therefore, the dynamic branch fusion; (b) model-based branch fusion.
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2078 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025
Dynamic Branch Fusion: In our previous work [58], we Problem 9 requires HQS or ADMM frameworks to be solved.
showed that different resolutions have unequal effects on fusion For different regularization terms, including deep network pri-
results. Thus, feature maps at different resolutions should be ors, we usually can have a closed-form solution. For the MHIF
reweighed before being combined by the dynamic branch fusion problem, the gradient of X is: ∇g(X(t) ) = (X(t) D − Y)DT +
(DBF) module. The DBF method adds fusion coefficients to RT (RX(t) − M), where the degradation operator D = BS.
features at different resolutions and resizes features at the same By separating each sub-problem and transforming it into a
resolution before the weighted fusion. The DBF method can be specific network form, we can build an optimization-induced
widely applied to different fusion scenarios, thereby we chose it deep network through employing the MSIR block. It allows the
as the baseline for the multi-source image representation (MSIR) network to approximate the proximal operator of a regularizer,
module. not just a denoiser [35], [84].
Model-based Branch Fusion: The DBF method does not The main difference among several fusion problems is how
consider physical constraints. Physical constraints are usually to formulate a fusion model. In this work, we embed the ob-
introduced by linear observation models [9], [35]. The linear served model into the FC-former to realize an interpretable deep
relationships (reflecting prior knowledge) among input and out- network for MSIF. The proposed FC-former network will be
put data of the MSIF problem are computed by solving an introduced in Section III-E.
optimization problem. By using MSIR with linear observation Remark 4: It is worth to be remarked that the optimization-
models, we can draw the following conclusions. induced neural network is motivated by the linear observed
Lemma 1 (Linear Observation Models for MSIF): Assume model and its solution. Under the aforementioned framework,
the MSIF problem with X ∈ RHW ×S denoting the desired re- the regularization term can provide a non-linear representation
sults. The linear observed models, having as information sources in the objective function. This allows us to estimate the solution
two cubes Y ∈ Rhw×S , and M ∈ RHW ×s , where (H, W ) and by exploiting deep learning and physical constraints.
(h, w) denote the different spatial sizes (with a scale ratio
r= H h ), and S and s indicate the number of spectral bands for E. Network Architecture
the two inputs, are as follows:
The overall architecture of the fully-connected transformer
Y = f1 (X), M = f2 (X). (7) (FC-Former) is presented in Fig. 6. It consists of three parallel
branches: the main HR feature branch, the MR feature branch,
The functions f1 (·) and f2 (·) represent degradation operators.
and the LR feature branch. More specifically, the three branches
Thus, the objective function can be formulated as:
are arranged in parallel, and they are progressively combined
to form three stages. The main HR feature branch considers
H × W spatial size images from different domains. The MR
X = arg min ||Y − f1 (X)||F + ||M − f2 (X)||F + φ(X).
X feature branch receives the MR input and the feature maps from
(8) the HR branch. Similarly, the LR feature branch takes an LR
input and the feature maps from the above two branches as input.
Remark 3: For the MHIF problem, f1 and f2 can be defined
For the inputs in each branch, we design MSIR as the head
as f1 (X) = XBS and f2 (X) = RX, where B, S ∈ RHW ×hw ,
structure of each branch to aggregate feature maps transferred
and R ∈ RS×s denote the blur operator, the downsampling
from other branches with source images, as shown in Fig. 6.
operator, and the spectral response matrix, respectively. f1 and
From an implementation point of view, inspired by HRNet [34],
f2 are the spatial and the spectral fidelity terms, respectively.
we chose the residual block and bottleneck as building blocks.
The problem can be solved into the linear least squares frame-
The convolution kernel of the residual block of each branch
work [15]. Similar linear relationships hold for other related
is the same. Finally, the stacked residual blocks are arranged
MSIF problems.
behind the MSIR blocks. Therefore, a complete stage is built to
It is worth to be pointed out that, to obtain the desired fused
extract better features. Finally, we train the proposed model on
image, we adopt the proximal gradient algorithm [86] to solve
supervised and unsupervised tasks. Let I = {I1 , I2 , . . . } denote
the problem as follows.
input images from different sources. Then, for the supervised
Theorem 3 (Interpretable Model-based Unfolding Represen-
task, we use the mean absolution error (i.e., 1 loss; L1 ) and
tation [86]): Let an observed image, Y, be the corrupted (or
the structural similarity index measure (SSIM) as losses [50]
degraded) tensor version through a function f (·) of an unknown
(LSSIM ) to calculate differences between outputs and ground-
image, X, having a simplified form as:
truths (GTs):
1
X = arg min ||Y − f (X)||2 + φ(X). (9) Θ = arg min L1 (fΘ (I), GT) + λLSSIM (fΘ (I), GT), (11)
X 2 Θ
There exists, X̂, solution of the algorithm, in the form: where λ is set to 0.1 to balance the two losses, GT is the ground-
truth image, and fΘ is a non-linear function depending on the
X̂(t+1) = prox(ληφ) (X(t) − η∇g(X(t) )), (10)
learnable parameters Θ.
where prox(ληφ) (·) denotes a proximal operator, η is a weighting For unsupervised tasks, we use the intensity loss (L1 ), SSIM
coefficient, ∇g(·) is the gradient operator, and t is an iteration loss (LSSIM ), and texture loss (Ltext ), to compute the loss between
index. output and input images. The details of the loss functions can
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FULLY-CONNECTED TRANSFORMER FOR MULTI-SOURCE IMAGE FUSION 2079
Fig. 6. The overall architecture of FC-Former. The blue boxes represent network stages and the yellow parts denote the FCSA method depicted in Fig. 5.
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2080 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025
TABLE II
QUANTITATIVE RESULTS FOR THE MHIF TASK COMPARING SOME REPRESENTATIVE STATE-OF-THE-ART APPROACHES
Fig. 7. In the first row, true-color fused images are depicted obtained by the proposed FC-Former and by some representative methods on chart and stuffed toy
with scaling factor 4 for the CAVE dataset. In the second row, the related error maps (calculated between the fused image and the GT) are represented. Some
close-ups are also considered.
[96]IF 2020 , DDcGAN [97]TIP 2020 , U2Fusion [17]TPAMI 2020 , training/validation/test containing heterogeneous characteris-
YDTR [87]TMM 2022 , DecompFusion [98]ECCV 2022 , SwinFuse tics, such as roads, vehicles, and pedestrians. We use the same
[112]TIM 2022 , SwinFusion [74]JAS 2022 , EMMA [100]CVPR 2024 data augmentation strategy as U2Fusion [17] (i.e., images are
and TC-MOA [99]CVPR 2024 . randomly cropped to patches of size of 64 × 64 with flipping)
Datasets: According to [113] and the website5 , we have to enlarge the number of samples.
98/38 training/test images for the TNO dataset. For the Road- Results: The quantitative results related to the TNO and Road-
Scene dataset, we randomly selected 190/10/20 pairs for Scene datasets are shown in Table III. Five quality metrics are
used to assess the performance, i.e., the peak signal-to-noise ratio
(PSNR) [114], the SSIM [115], the learned perceptual image
5 https://github.com/Linfeng-Tang/Image-Fusion patch similarity (LPIPS) [116], Qabf [117], and Qs [118]. It is
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FULLY-CONNECTED TRANSFORMER FOR MULTI-SOURCE IMAGE FUSION 2081
TABLE III
QUANTITATIVE RESULTS FOR THE VIS-IR IMAGE FUSION TASK ON THE TNO AND ROADSCENE DATASETS. PLEASE, REFER TO
SECTION IV-B FOR FURTHER DETAILS
Fig. 8. Comparison among some representative state-of-the-art methods for the VIS-IR image fusion task. Some close-ups are depicted in yellow and red boxes.
No error map is depicted because of the absence of a GT.
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2082 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025
Fig. 9. Visual comparisons involving some representative pansharpening methods on one example of the reduced resolution WV3 dataset. True-color fused
images are depicted in the first row. The second row is devoted to error maps between fused images and the GT. Some close-ups are also reported in yellow and
red boxes.
real-world images exploiting Wald’s protocol [119]. It is very resolution assessment exploits the following reference-based
valuable to perform experiments on remote sensing pansharp- quality metrics: the spatial correlation coefficient [125] (SCC),
ening because it allows for a more comprehensive evaluation the spectral angle mapper [126] (SAM), the erreur relative glob-
of model performance in real fusion scenarios. In addition, ale adimensionnelle de synthése [127] (ERGAS), and the Q2n
we compare the proposed approach with 15 state-of-the-art (Q8 for 8-band data and Q4 for 4-band data). From Table IV, the
methods. They are divided into four classes [120], i.e., CS, MRA, proposed FC-Former outperforms the other methods for almost
VO, and DL (CNNs and transformers) methods:6 all metrics, and it is very close to the optimal values for the rest
1) Component substitution (CS) methods: BT- of the cases. Fig. 9 shows the fused results with the related error
H [121]GRSL 2017 and BDSD-PC [59]TGRS 2019 . maps to appreciate the goodness of the outcome of the proposed
2) Multi-resolution analysis (MRA) approaches: the gener- approach.
alized Laplacian pyramid (GLP) with modulation transfer To assess the performance on real (full resolution) data, where
function (MTF)-matched filters [122] and its full-scale the reference image is not available, indexes without reference
regression version (MTF-GLP-FS) [101]TIP 2018 . are used, i.e., the spectral distortion (Dλ ), the spatial distor-
3) A variational optimization-based (VO) technique: tion (Ds ), and the hybrid quality with no reference (HQNR)
LRTCFPan [27]TIP 2023 . indexes [128]. Table IV reports the average performance on the
4) DL methods: 11 CNN-based approaches, such as full resolution (FR) examples for the exploited public dataset.
PNN [69]RS 2016 , PanNet [70]ICCV 2017 , MSDCNN Again, FC-Former obtains the best results on average with the
[123]JSTARS 2018 , DiCNN [124]JSTARS 2019 , FusionNet lowest standard deviations, showing its superiority and greater
[102]TGRS 2020 , LAGNet [72]AAAI 2021 , DCFNet stability. Furthermore, in Fig. 10, we show visual results on
[58]ICCV 2021 , HMPNet [103]TNNLS 2023 , CANNet a full resolution WV3 example. The outcome of the proposed
[104]CVPR 2024 , and PanMamba [105]Arxiv 2024 , and FC-Former shows more details and a better visual quality.
one transformer-based technique, i.e., Invformer
[57]AAAI 2022 . All the compared DL methods are trained V. DISCUSSIONS
on the same data using default experimental settings (as In this section, we will discuss the components of the proposed
suggested in the related papers) for fair comparison. FC-Former. Without loss of generality, we consider the MHIF
Datasets: The dataset can be downloaded at.7 We chose WV3, problem using the CAVE ×4 dataset as an example.
GF2, and QB data for performance assessment, and WV2 data
to test network generalization. The number of testing samples
for each reduced resolution and full resolution dataset is 20 (i.e., A. MSIR
160 images in total). Please, refer to the supplementary material We analyze the FC-Former combined with classical fu-
for further details. sion methods. Two model-based fusion methods, model-based
Results: To evaluate the quality of the proposed method, we branch fusion (MBF)-1 and MBF-2, have been adopted. In
use reference and no-reference quality metrics.8 The reduced MBF-1 [53], the mutual-projected fusion is used to replace the
simple addition or concatenation of feature maps. The residual
6 All
between two features from the in-scale and cross-scale branches
the obtained results are reported in the supplementary material.
7 https://github.com/liangjiandeng/PanCollection is downsampled into the original identity branch, transferring
8 https://github.com/liangjiandeng/DLPan-Toolbox/tree/main/02-Test- information from both cross-scale and in-scale features. In
toolbox-for-traditional-and-DL(Matlab) contrast, MBF-2 [35] adopts model-guided fusion to perform
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FULLY-CONNECTED TRANSFORMER FOR MULTI-SOURCE IMAGE FUSION 2083
TABLE IV
QUANTITATIVE RESULTS FOR THE REMOTE SENSING PANSHARPENING TASK COMPARING SOME REPRESENTATIVE STATE-OF-THE-ART APPROACHES
Fig. 10. Visual comparisons involving some representative pansharpening methods on one example of the full resolution WV3 dataset. True-color fused images
are depicted in the first row. The second row is devoted to the HQNR maps. Some close-ups are also reported in yellow boxes.
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2084 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025
TABLE V
COMPARISON OF BRANCH FUSION APPROACHES ON THE CAVE ×4 DATASET
Fig. 12. PSNR vs. FLOPs for all the high performance methods on MSI/HSI
images with sizes of 16 × 16/64 × 64 and 16 × 16/128 × 128 related to the
CAVE ×4 and ×8 datasets, respectively. The proposed FC-Former (indicated
with a star marker) gets the best PSNR values with a small amount of FLOPs.
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FULLY-CONNECTED TRANSFORMER FOR MULTI-SOURCE IMAGE FUSION 2085
TABLE VI
ABLATION STUDIES FOR THE FCSA FRAMEWORK USING CROSS-SCALE AND/OR IN-SCALE ATTENTION
TABLE VII [4] X. Zhang and Y. Demiris, “Visible and infrared image fusion using
A COMPARISON OF DIFFERENT WINDOW SIZES FOR THE WINDOW-BASED deep learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8,
SPA-MSA FOR THE HR, THE MR, AND THE LR BRANCHES pp. 10535–10554, Aug. 2023.
[5] G. Vivone, “Multispectral and hyperspectral image fusion in remote
sensing: A survey,” Inf. Fusion, vol. 89, pp. 405–417, 2023.
[6] L. J. Deng et al., “Machine learning in pansharpening: A benchmark, from
shallow to deep networks,” IEEE Geosci. Remote Sens. Mag., vol. 10,
no. 3, pp. 279–315, Sep. 2022.
[7] W. He et al., “Non-local meets global: An iterative paradigm for hy-
perspectral image restoration,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 44, no. 4, pp. 2089–2107, Apr. 2022.
[8] Y. Guo, D. Zhou, P. Li, C. Li, and J. Cao, “Context-aware poly(A)
signal prediction model via deep spatial–temporal neural networks,”
IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 6, pp. 8241–8253
VII. CONCLUSION 2022.
[9] X. Deng and P. L. Dragotti, “Deep convolutional neural network for
In this paper, inspired by multilinear algebra, we proposed multi-modal image restoration and fusion,” IEEE Trans. Pattern Anal.
the mathematical idea of the generalized self-attention to unify Mach. Intell., vol. 43, no. 10, pp. 3333–3348, Oct. 2020.
[10] S. Yang, D. Zhou, J. Cao, and Y. Guo, “LightingNet: An integrated
and generalize existing self-attention mechanisms. Based on this learning method for low-light image enhancement,” IEEE Trans. Comput.
generalized mechanism, we developed the first fully-connected Imag., vol. 9, pp. 29–42, 2023.
self-attention framework that captures intra- and cross-scale [11] S. Li, W. Song, L. Fang, Y. Chen, P. Ghamisi, and J. A. Benediktsson,
“Deep learning for hyperspectral image classification: An overview,”
patterns, as well as local and non-local similarities. Through IEEE Trans. Geosci. Remote Sens., vol. 57, no. 9, pp. 6690–6709,
theoretical analysis and broad experiments, the proposed FCSA Sep. 2019.
framework addresses the representation issue at different dimen- [12] L. Tang, J. Yuan, and J. Ma, “Image fusion in the loop of high-level vision
tasks: A semantic-aware real-time infrared and visible image fusion
sions (modes) and scales while achieving better detail recon- network,” Inf. Fusion, vol. 82, pp. 28–42, 2022.
struction and lower computational costs. Afterwards, we built [13] H. Li, X.-J. Wu, and J. Kittler, “RFN-Nest: An end-to-end residual fusion
a fully-connected transformer network using the FCSA frame- network for infrared and visible images,” Inf. Fusion, vol. 73, pp. 72–86,
2021.
work, called FC-Former. In this case, the multi-source image [14] D. Liu et al., “Transfusion: Multi-view divergent fusion for medical image
representation module provides support to improve the physical segmentation with transformers,” in Proc. Int. Conf. Med. Image Comput.
interpretation of the network and to guide the FCSA regulariza- Comput.-Assisted Interv., 2022, pp. 485–495.
[15] Q. Xie, M. Zhou, Q. Zhao, Z. Xu, and D. Meng, “MHF-Net: An inter-
tion. FC-Former demonstrated superior performance with high pretable deep network for multispectral and hyperspectral image fusion,”
efficiency and low parameters (and computational costs) for IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 3, pp. 1457–1473,
MHIF, VIS-IR image fusion, remote sensing pansharpening, and Mar. 2022.
[16] R. Dian, A. Guo, and S. Li, “Zero-shot hyperspectral sharpening,” IEEE
digital photographic image fusion. Thanks to the positive impact Trans. Pattern Anal. Mach. Intell., vol. 45, no. 10, pp. 12650–12666,
of strong feature representations for different fusion tasks, the Oct. 2023.
proposed method can outperform some state-of-the-art methods, [17] H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2Fusion: A unified
unsupervised image fusion network,” IEEE Trans. Pattern Anal. Mach.
specially designed for the above-mentioned problems, demon- Intell., vol. 44, no. 1, pp. 502–518, Jan. 2022.
strating its usefulness for a wide range of image processing tasks. [18] Z. Liu, E. Blasch, Z. Xue, J. Zhao, R. Laganiere, and W. Wu, “Objec-
tive assessment of multiresolution image fusion algorithms for context
REFERENCES enhancement in night vision: A comparative study,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 34, no. 1, pp. 94–109, Jan. 2012.
[1] T. Zhou, S. Ruan, and S. Canu, “A review: Deep learning for medical [19] G. Vivone, M. Dalla Mura, A. Garzelli, and F. Pacifici, “A benchmarking
image segmentation using multi-modality fusion,” Array, vol. 3, 2019, protocol for pansharpening: Dataset, preprocessing, and quality assess-
Art. no. 100004. ment,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14,
[2] Y. Guo, D. Zhou, X. Ruan, and J. Cao, “Variational gated autoencoder- pp. 6102–6118, 2021.
based feature extraction model for inferring disease-MIRNA associations [20] Y. Yan, J. Liu, S. Xu, Y. Wang, and X. Cao, “MD3 Net: Integrat-
based on multiview features,” Neural Netw., vol. 165, pp. 491–505, ing model-driven and data-driven approaches for pansharpening,” IEEE
2023. Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5411116.
[3] W. Li, Y. Guo, B. Wang, and B. Yang, “Learning spatiotemporal embed- [21] Y. Liang, P. Zhang, Y. Mei, and T. Wang, “PMACNet: Parallel multiscale
ding with gated convolutional recurrent networks for translation initiation attention constraint network for pan-sharpening,” IEEE Geosci. Remote
site prediction,” Pattern Recognit., vol. 136, 2023, Art. no. 109234. Sens. Lett., vol. 19, 2022, Art. no. 5512805.
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2086 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025
[22] S. Deng, L.-J. Deng, X. Wu, R. Ran, and R. Wen, “Bidirectional dilation [47] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “SwinIR:
transformer for multispectral and hyperspectral image fusion,” in Proc. Image restoration using swin transformer,” in Proc. Int. Conf. Comput.
Int. Joint Conf. Artif. Intell., 2023, pp. 3633–3641. Vis., 2021, pp. 1833–1844.
[23] A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Neural [48] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M. Yang,
Inf. Process. Syst., 2017, pp. 6000–6010. “Restormer: Efficient transformer for high-resolution image restora-
[24] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using tion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022,
shifted windows,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- pp. 5728–5739.
nit., 2021, pp. 10012–10022. [49] S. Jia, Z. Min, and X. Fu, “Multiscale spatial–spectral transformer
[25] W. Wang et al., “Pyramid vision transformer: A versatile backbone network for hyperspectral and multispectral image fusion,” Inf. Fusion,
for dense prediction without convolutions,” in Proc. IEEE/CVF Conf. vol. 96, pp. 117–129, 2023.
Comput. Vis. Pattern Recognit., 2021, pp. 568–578. [50] S.-Q. Deng, L.-J. Deng, X. Wu, R. Ran, D. Hong, and G. Vivone,
[26] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang, “Non-local recurrent “PSRT: Pyramid shuffle-and-reshuffle transformer for multispectral and
network for image restoration,” in Proc. Int. Conf. Neural Inf. Process. hyperspectral image fusion,” IEEE Trans. Geosci. Remote Sens., vol. 61,
Syst., 2018, pp. 1680–1689. 2023, Art. no. 5503715.
[27] Z. C. Wu, T. Z. Huang, L. J. Deng, J. Huang, J. Chanussot, and G. [51] M. Li, Y. Fu, and Y. Zhang, “Spatial-spectral transformer for hyper-
Vivone, “LRTCFPan: Low-rank tensor completion based framework for spectral image denoising,” in Proc. AAAI Conf. Artif. Intell., 2023,
pansharpening,” IEEE Trans. Image Process., vol. 32, pp. 1640–1655, pp. 1368–1376.
2023. [52] Y. Peng, Y. Zhang, B. Tu, Q. Li, and W. Li, “Spatial–spectral transformer
[28] S. Karim, G. Tong, J. Li, A. Qadir, U. Farooq, and Y. Yu, “Current with cross-attention for hyperspectral image classification,” IEEE Trans.
advances and future perspectives of image fusion: A comprehensive Geosci. Remote Sens., vol. 60, 2022, Art. no. 5537415.
review,” Inf. Fusion, vol. 90, pp. 185–217, 2023. [53] Y. Mei, Y. Fan, Y. Zhou, L. Huang, T. S. Huang, and H. Shi, “Im-
[29] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual age super-resolution with cross-scale non-local attention and exhaustive
networks for single image super-resolution,” in Proc. IEEE Conf. Comput. self-exemplars mining,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Vis. Pattern Recognit. Workshops, 2017, pp. 136–144. Recognit., 2020, pp. 5690–5699.
[30] Y. Wang, L.-J. Deng, T.-J. Zhang, and X. Wu, “SSconv: Explicit spectral- [54] S. Zhou, J. Zhang, W. Zuo, and C. C. Loy, “Cross-scale internal graph
to-spatial convolution for pansharpening,” in Proc. ACM Int. Conf. Mul- neural network for image super-resolution,” in Proc. Int. Conf. Neural
timedia, 2021, pp. 4472–4480. Inf. Process. Syst., 2020, pp. 3499–3509.
[31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [55] M. A. O. Vasilescu and D. Terzopoulos, “Multilinear analysis of im-
recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., age ensembles: Tensorfaces,” in Proc. Eur. Conf. Comput. Vis., 2002,
2016, pp. 770–778. pp. 447–460.
[32] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [56] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,”
“Feature pyramid networks for object detection,” in Proc. IEEE/CVF SIAM Rev., vol. 51, no. 3, pp. 455–500, 2009.
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2117–2125. [57] M. Zhou, X. Fu, J. Huang, F. Zhao, A. Liu, and R. Wang,
[33] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks “Effective pan-sharpening with transformer and invertible neu-
for biomedical image segmentation,” in Proc. 18th Int. Conf. Med. Image ral network,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2021,
Comput. Comput.-Assisted Interv., 2015, pp. 234–241. Art. no. 5406815.
[34] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen- [58] X. Wu, T.-Z. Huang, L.-J. Deng, and T.-J. Zhang, “Dynamic cross feature
tation learning for human pose estimation,” in Proc. IEEE/CVF Conf. fusion for remote sensing pansharpening,” in Proc. Int. Conf. Comput.
Comput. Vis. Pattern Recognit., 2019, pp. 5693–5703. Vis., 2021, pp. 14687–14696.
[35] W. Dong, C. Zhou, F. Wu, J. Wu, G. Shi, and X. Li, “Model-guided [59] G. Vivone, “Robust band-dependent spatial-detail approaches for
deep hyperspectral image super-resolution,” IEEE Trans. Image Process., panchromatic sharpening,” IEEE Trans. Geosci. Remote Sens., vol. 57,
vol. 30, pp. 5754–5768, 2021. no. 9, pp. 6421–6433, Sep. 2019.
[36] Q. Ma, J. Jiang, X. Liu, and J. Ma, “Deep unfolding network for spa- [60] G. Vivone et al., “Pansharpening based on semiblind deconvolution,”
tiospectral image super-resolution,” IEEE Trans. Comput. Imag., vol. 8, IEEE Trans. Geosci. Remote Sens., vol. 53, no. 4, pp. 1997–2010,
pp. 28–40, 2021. Apr. 2015.
[37] D. Geman and G. Reynolds, “Constrained restoration and the recovery of [61] G. Vivone, S. Marano, and J. Chanussot, “Pansharpening: Context-
discontinuities,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 3, based generalized Laplacian pyramids by robust regression,” IEEE
pp. 367–383, Mar. 1992. Trans. Geosci. Remote Sens., vol. 58, no. 9, pp. 6152–6167,
[38] D. Geman and C. Yang, “Nonlinear image recovery with half-quadratic Sep. 2020.
regularization,” IEEE Trans. Image Process., vol. 4, no. 7, pp. 932–946, [62] N. Akhtar and A. Mian, “Hyperspectral recovery from RGB images using
Jul. 1995. Gaussian processes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42,
[39] S. Boyd et al., “Distributed optimization and statistical learning via the no. 1, pp. 100–113, Jan. 2020.
alternating direction method of multipliers,” Found. TrendsR PLX Mach. [63] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution
Learn., vol. 3, no. 1, pp. 1–122, 2011. via sparse representation,” IEEE Trans. Image Process., vol. 19, no. 11,
[40] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser pp. 2861–2873, Nov. 2010.
prior for image restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. [64] Q. Zhang, Y. Liu, R. S. Blum, J. Han, and D. Tao, “Sparse representation
Pattern Recognit., 2017, pp. 3929–3938. based multi-sensor image fusion for multi-focus and multi-modality
[41] K. Zhang, L. V. Gool, and R. Timofte, “Deep unfolding network for images: A review,” Inf. Fusion, vol. 40, pp. 57–75, 2018.
image super-resolution,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern [65] X. Fu, Z. Lin, Y. Huang, and X. Ding, “A variational pan-sharpening
Recognit., 2020, pp. 3217–3226. with local gradient constraints,” in Proc. IEEE/CVF Conf. Comput. Vis.
[42] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” Pattern Recognit., 2019, pp. 10265–10274.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, [66] L.-J. Deng, G. Vivone, W. Guo, M. Dalla Mura, and J. Chanussot, “A
pp. 9446–9454. variational pansharpening approach based on reproducible kernel Hilbert
[43] K. Zhang, Y. Li, W. Zuo, L. Zhang, L. Van Gool, and R. Timofte, “Plug- space and heaviside function,” IEEE Trans. Image Process., vol. 27, no. 9,
and-play image restoration with deep denoiser prior,” IEEE Trans. Pattern pp. 4330–4344, Sep. 2018.
Anal. Mach. Intell., vol. 44, no. 10, pp. 6360–6376, Oct. 2022. [67] R. Dian and S. Li, “Hyperspectral image super-resolution via subspace-
[44] L. Wang, C. Sun, M. Zhang, Y. Fu, and H. Huang, “DNU: Deep non-local based low tensor multi-rank regularization,” IEEE Trans. Image Process.,
unrolling for computational spectral imaging,” in Proc. IEEE/CVF Conf. vol. 28, no. 10, pp. 5135–5146, Oct. 2019.
Comput. Vis. Pattern Recognit., 2020, pp. 1661–1671. [68] T. Xu, T.-Z. Huang, L.-J. Deng, and N. Yokoya, “An iterative regulariza-
[45] Y. Han, G. Huang, S. Song, L. Yang, H. Wang, and Y. Wang, “Dynamic tion method based on tensor subspace representation for hyperspectral
neural networks: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., image super-resolution,” IEEE Trans. Geosci. Remote Sens., vol. 60,
vol. 44, no. 11, pp. 7436–7456, Nov. 2022. 2022, Art. no. 5529316.
[46] S. Peng, L.-J. Deng, J.-F. Hu, and Y. Zhuo, “Source-adaptive discrimina- [69] G. Masi, D. Cozzolino, L. Verdoliva, and G. Scarpa, “Pansharpening
tive kernels based network for remote sensing pansharpening,” in Proc. by convolutional neural networks,” Remote Sens., vol. 8, no. 7, 2016,
Int. Joint Conf. Artif. Intell., 2022, pp. 1283–1289. Art. no. 594.
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FULLY-CONNECTED TRANSFORMER FOR MULTI-SOURCE IMAGE FUSION 2087
[70] J. Yang, X. Fu, Y. Hu, Y. Huang, X. Ding, and J. Paisley, “PanNet: A deep [93] W. Wang, L.-J. Deng, R. Ran, and G. Vivone, “A general paradigm with
network architecture for pan-sharpening,” in Proc. Int. Conf. Comput. detail-preserving conditional invertible network for image fusion,” Int. J.
Vis., 2017, pp. 5449–5457. Comput. Vis., vol. 132, no. 4, pp. 1029–1054, 2024.
[71] J.-F. Hu, T.-Z. Huang, L.-J. Deng, T.-X. Jiang, G. Vivone, and J. Chanus- [94] M. Yin, X. Liu, Y. Liu, and X. Chen, “Medical image fusion with
sot, “Hyperspectral image super-resolution via deep spatiospectral atten- parameter-adaptive pulse coupled neural network in nonsubsampled
tion convolutional neural networks,” IEEE Trans. Neural Netw. Learn. Shearlet transform domain,” IEEE Trans. Instrum. Meas., vol. 68, no. 1,
Syst., vol. 33, no. 12, pp. 7251–7265, Dec. 2022. pp. 49–64, Jan. 2019.
[72] Z.-R. Jin, T.-J. Zhang, T.-X. Jiang, G. Vivone, and L.-J. Deng, “LAGConv: [95] H. Li and X.-J. Wu, “DenseFuse: A fusion approach to infrared and visible
Local-context adaptive convolution kernels with global harmonic bias for images,” IEEE Trans. Image Process., vol. 28, no. 5, pp. 2614–2623,
pansharpening,” in Proc. AAAI Conf. Artif. Intell., 2022, pp. 1113–1121. May 2019.
[73] W. G. C. Bandara and V. M. Patel, “HyperTransformer: A tex- [96] Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang, “IFCNN: A
tural and spectral feature fusion transformer for pansharpening,” general image fusion framework based on convolutional neural network,”
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, Inf. Fusion, vol. 54, pp. 99–118, 2020.
pp. 1767–1777. [97] J. Ma, H. Xu, J. Jiang, X. Mei, and X.-P. Zhang, “DDcGAN:
[74] J. Ma, L. Tang, F. Fan, J. Huang, X. Mei, and Y. Ma, “SwinFu- A dual-discriminator conditional generative adversarial network for
sion: Cross-domain long-range learning for general image fusion via multi-resolution image fusion,” IEEE Trans. Image Process., vol. 29,
swin transformer,” IEEE-CAA J. Automatica Sinica, vol. 9, no. 7, pp. 4980–4995, 2020.
pp. 1200–1217, Jul. 2022. [98] P. Liang, J. Jiang, X. Liu, and J. Ma, “Fusion from decomposition: A
[75] L. Tang, Y. Deng, Y. Ma, J. Huang, and J. Ma, “SuperFusion: A versatile self-supervised decomposition approach for image fusion,” in Proc. Eur.
image registration and fusion network with semantic awareness,” IEEE- Conf. Comput. Vis., 2022, pp. 719–735.
CAA J. Automatica Sinica, vol. 9, no. 12, pp. 2121–2137, Dec. 2022. [99] P. Zhu, Y. Sun, B. Cao, and Q. Hu, “Task-customized mixture of adapters
[76] H. Liu, C. Feng, R. Dian, and S. Li, “SSTF-Unet: Spatial–spectral for general image fusion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
transformer-based U-Net for high-resolution hyperspectral image ac- Recognit., 2024, pp. 7099–7108.
quisition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 12, [100] Z. Zhao et al., “Equivariant multi-modality image fusion,” in
pp. 18222–18236, Dec. 2023. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024,
[77] W. Wang, W. Zeng, Y. Huang, X. Ding, and J. W. Paisley, “Deep blind pp. 25912–25921.
hyperspectral image fusion,” in Proc. Int. Conf. Comput. Vis., 2019, [101] G. Vivone, R. Restaino, and J. Chanussot, “Full scale regression-based
pp. 4149–4158. injection coefficients for panchromatic sharpening,” IEEE Trans. Image
[78] B. Lecouat, J. Ponce, and J. Mairal, “Fully trainable and interpretable non- Process., vol. 27, no. 7, pp. 3418–3431, Jul. 2018.
local sparse models for image restoration,” in Proc. Eur. Conf. Comput. [102] L.-J. Deng, G. Vivone, C. Jin, and J. Chanussot, “Detail injection-
Vis., 2020, pp. 238–254. based deep convolutional neural networks for pansharpening,” IEEE
[79] N. Park and S. Kim, “How do vision transformers work?,” 2021, Trans. Geosci. Remote Sens., vol. 59, no. 8, pp. 6995–7010, Aug.
arXiv:2202.06709. 2021.
[80] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- [103] X. Tian, K. Li, W. Zhang, Z. Wang, and J. Ma, “Interpretable model-
works,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, driven deep network for hyperspectral, multispectral, and panchromatic
pp. 7794–7803. image fusion,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 10,
[81] A. Kolesnikov et al., “An image is worth 16 × 16 words: Transformers pp. 14382–14395, Oct. 2024.
for image recognition at scale,” 2020, arXiv:2010.11929. [104] Y. Duan, X. Wu, H. Deng, and L.-J. Deng, “Content-adaptive
[82] A. Paszke et al., “PyTorch: An imperative style, high-performance deep non-local convolution for remote sensing pansharpening,” in
learning library,” 2019, arXiv: 1912.01703. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024,
[83] H. Chen et al., “Pre-trained image process.ing transformer,” in pp. 27738–27747.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, [105] X. He et al., “Pan-mamba: Effective pan-sharpening with state space
pp. 12299–12310. model,” 2024, arXiv:2402.12192.
[84] Q. Ma, J. Jiang, X. Liu, and J. Ma, “Learning a 3D-CNN and transformer [106] F. Yasuma, T. Mitsunaga, D. Iso, and S. K. Nayar, “Generalized assorted
prior for hyperspectral image super-resolution,” Inf. Fusion, vol. 100, pixel camera: Postcapture control of resolution, dynamic range, and
2023, Art. no. 101907. spectrum,” IEEE Trans. Image Process., vol. 19, no. 9, pp. 2241–2253,
[85] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, Sep. 2010.
“Transformers in vision: A survey,” ACM Comput. Surv., vol. 54, no. 10s, [107] A. Chakrabarti and T. Zickler, “Statistics of real-world hyperspectral
pp. 1–41, 2022. images,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2011,
[86] K. Gregor and Y. LeCun, “Learning fast approximations pp. 193–200.
of sparse coding,” in Proc. Int. Conf. Mach. Learn., 2010, [108] S. Li, R. Dian, L. Fang, and J. M. Bioucas-Dias, “Fusing hyper-
pp. 399–406. spectral and multispectral images via coupled sparse tensor factoriza-
[87] W. Tang, F. He, and Y. Liu, “YDTR: Infrared and visible image fusion tion,” IEEE Trans. Image Process., vol. 27, no. 8, pp. 4118–4130,
via Y-shape dynamic transformer,” IEEE Trans. Multimedia, vol. 25, Aug. 2018.
pp. 5413–5428, 2023. [109] R. Dian, S. Li, and L. Fang, “Learning a low tensor-train rank repre-
[88] M. Selva, B. Aiazzi, F. Butera, L. Chiarantini, and S. Baronti, “Hyper- sentation for hyperspectral image super-resolution,” IEEE Trans. Neural
sharpening: A first approach on SIM-GA data,” IEEE J. Sel. Topics Netw. Learn. Syst., vol. 30, no. 9, pp. 2672–2683, Sep. 2019.
Appl. Earth Observ. Remote Sens., vol. 8, no. 6, pp. 3008–3024, Jun. [110] T. Xu, T.-Z. Huang, L.-J. Deng, X.-L. Zhao, and J. Huang, “Hyperspectral
2015. image superresolution using unidirectional total variation with tucker
[89] X. Zhang, W. Huang, Q. Wang, and X. Li, “SSR-NET: Spatial–spectral decomposition,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens.,
reconstruction network for hyperspectral and multispectral image fu- vol. 13, pp. 4381–4398, 2020.
sion,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 7, pp. 5953–5965, [111] A. Toet, “The TNO multiband image data collection,” Data Brief, vol. 15,
Jul. 2021. pp. 249–251, 2017.
[90] J.-F. Hu, T.-Z. Huang, L.-J. Deng, H.-X. Dou, D. Hong, and G. Vivone, [112] Z. Wang, Y. Chen, W. Shao, H. Li, and L. Zhang, “SwinFuse: A residual
“Fusformer: A transformer-based fusion network for hyperspectral im- swin transformer fusion network for infrared and visible images,” IEEE
age super-resolution,” IEEE Geosci. Remote Sens. Lett., vol. 19, 2022, Trans. Instrum. Meas., vol. 71, 2022, Art. no. 5016412.
Art. no. 6012305. [113] L. Tang, H. Zhang, H. Xu, and J. Ma, “Deep learning-based image fusion:
[91] T. Huang, W. Dong, J. Wu, L. Li, X. Li, and G. Shi, “Deep hyperspectral A survey,” J. Image Graph., vol. 28, no. 1, pp. 3–36, 2023.
image fusion network with iterative spatio-spectral regularization,” IEEE [114] Q. Huynh-Thu and M. Ghanbari, “Scope of validity of PSNR in
Trans. Comput. Imag., vol. 8, pp. 201–214, 2022. image/video quality assessment,” Electron. Lett., vol. 44, no. 13,
[92] J. Fang, J. Yang, A. Khader, and L. Xiao, “MIMO-SST: Multi-input pp. 800–801, 2008.
multi-output spatial-spectral transformer for hyperspectral and multi- [115] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality
spectral image fusion,” IEEE Trans. Geosci. Remote Sens., vol. 62, 2024, assessment: From error visibility to structural similarity,” IEEE Trans.
Art. no. 5510020. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.
2088 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 47, NO. 3, MARCH 2025
[116] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The Ting-Zhu Huang (Member, IEEE) received the BS,
unreasonable effectiveness of deep features as a perceptual metric,” in MS, and PhD degrees in computational mathematics
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 586–595. from the Department of Mathematics, Xi’an Jiaotong
[117] C. S. Xydeas et al., “Objective image fusion performance measure,” University, Xi’an, China, in 1986, 1992, and 2001, re-
Electron. Lett., vol. 36, no. 4, pp. 308–309, 2000. spectively. He is currently a professor with the School
[118] G. Piella and H. Heijmans, “A new quality metric for image fusion,” in of Mathematical Sciences, University of Electronic
Proc. IEEE 2003 Int. Conf. Image Process., 2003, pp. III– 173. Science and Technology of China, Chengdu, China.
[119] L. Wald, T. Ranchin, and M. Mangolini, “Fusion of satellite images His research interests include scientific computation
of different spatial resolutions: Assessing the quality of resulting im- and applications, numerical algorithms for image
ages,” Photogrammetric Eng. Remote Sens., vol. 63, no. 6, pp. 691–699, processing, numerical linear algebra, preconditioning
1997. technologies, and matrix analysis with applications.
[120] G. Vivone et al., “A new benchmark based on recent advances in mul- Dr. Huang is an editor of the Scientific World Journal, Advances in Numerical
tispectral pansharpening: Revisiting pansharpening with classical and Analysis, the Journal of Applied Mathematics, the Journal of Pure and Applied
emerging pansharpening methods,” IEEE Geosci. Remote Sens. Mag., Mathematics: Advances in Applied Mathematics, and the Journal of Electronic
vol. 9, no. 1, pp. 53–81, Mar. 2021. Science and Technology, China.
[121] S. Lolli, L. Alparone, A. Garzelli, and G. Vivone, “Haze correction for Liang-Jian Deng (Senior Member, IEEE) received
contrast-based multispectral pansharpening,” IEEE Geosci. Remote Sens. the BS and PhD degrees in applied mathematics
Lett., vol. 14, no. 12, pp. 2255–2259, Dec. 2017. from the School of Mathematical Sciences, Univer-
[122] B. Aiazzi, L. Alparone, S. Baronti, A. Garzelli, and M. Selva, “Mtf- sity of Electronic Science and Technology of China
tailored multiscale fusion of high-resolution ms and pan imagery,” (UESTC), Chengdu, China, in 2010 and 2016, re-
Photogrammetric Eng. Remote Sens., vol. 72, no. 5, pp. 591–596, spectively. He is currently a research fellow with
2006. the School of Mathematical Sciences, UESTC. From
[123] Q. Yuan, Y. Wei, X. Meng, H. Shen, and L. Zhang, “A multiscale and 2013 to 2014, he was a Joint-Training PhD student
multidepth convolutional neural network for remote sensing imagery with the Case Western Reserve University, Cleveland,
pan-sharpening,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., OH, USA. In 2017, he was a postdoc with Hong
vol. 11, no. 3, pp. 978–989, Mar. 2018. Kong Baptist University (HKBU). In addition, he
[124] L. He et al., “Pansharpening via detail injection based convolutional also stayed with Isaac Newton Institute for Mathematical Sciences, Cambridge
neural networks,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., University and HKBU for short visits. His research interests include the use of
vol. 12, no. 4, pp. 1188–1204, Apr. 2019. partial differential equations (PDE), optimization modeling, and deep learning to
[125] J. Zhou, D. L. Civco, and J. A. Silander, “A wavelet transform method address several tasks in image processing, and computer vision, e.g., resolution
to merge Landsat TM and spot panchromatic data,” Int. J. Remote Sens., enhancement and restoration.
vol. 19, no. 4, pp. 743–757, 1998.
[126] R. H. Yuhas, A. F. Goetz, and J. W. Boardman, “Discrimination among Jocelyn Chanussot (Fellow, IEEE) received the MSc
semi-arid landscape endmembers using the spectral angle mapper (SAM) degree in electrical engineering from the Grenoble
algorithm,” in Proc. Summaries Third Annu. JPL Airborne Geosci. Work- Institute of Technology (Grenoble INP), Grenoble,
shop, 1992, vol. 1, pp. 147–149. France, in 1995, and the PhD degree from the Univer-
[127] L. Wald, Data Fusion: Definitions and Architectures: Fusion of Images sité de Savoie, Annecy, France, in 1998. Since 1999,
of Different Spatial Resolutions. Vouillé, France: Presses des MINES, he has been with Grenoble INP, where he is currently
2002. a professor of signal and image processing. His re-
[128] A. Arienzo, G. Vivone, A. Garzelli, L. Alparone, and J. Chanussot, search interests include image analysis, hyperspectral
“Full-resolution quality assessment of pansharpening: Theoretical and remote sensing, data fusion, machine learning, and
hands-on approaches,” IEEE Geosci. Remote Sens. Mag., vol. 10, no. 3, artificial intelligence. Dr. Chanussot was a member
pp. 168–201, Sep. 2022. of the Institut Universitaire de France from 2012 to
2017. He was the vice-president of the IEEE Geoscience and Remote Sensing
Society (GRSS), in charge of meetings and symposia from 2017 to 2019. He was
the general chair of the first IEEE GRSS Workshop on Hyperspectral Image and
Signal Processing, Evolution in Remote sensing. He is an associate editor for the
IEEE Transactions on Geoscience and Remote Sensing, IEEE Transactions on
Image Processing and the Proceedings of the IEEE. He was the editor-in-chief of
Xiao Wu received the MSc degree from the School of the IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Mathematical Sciences, University of Electronic Sci- Sensing from 2011 to 2015. In 2014, he served as a guest editor for the IEEE
ence and Technology of China (UESTC), Chengdu, Signal Processing Magazine. He has been a Highly Cited Researcher (Clarivate
China, in 2023. He is currently admitted to the Analytics/Thomson Reuters), since 2018.
UESTC and studies for the PhD degree under prof. Gemine Vivone (Senior Member, IEEE) received
Ting-Zhu Huang. His research interests include the- the BSc and MSc degrees (summa cum laude), and
ories and applications of machine learning and deep the PhD degree in information engineering from the
learning in image processing. University of Salerno, Fisciano, Italy, in 2008, 2011,
and 2014, respectively. He is a senior researcher
with the National Research Council (Italy). His main
research interests focus on image fusion, statistical
signal processing, deep learning, and classification
and tracking of remotely sensed images. Dr. Vivone
is a co-chair of the IEEE GRSS Image Analysis and
Data Fusion Technical Committee, a member of the
IEEE Task Force on “Deep Vision in Space”, and he was the Leader of the Image
Zi-Han Cao received the BS degree from the School and Signal Processing Working Group of the IEEE Image Analysis and Data
of Information and Communication Engineering, Fusion Technical Committee (2020-2021). Dr. Vivone is currently an area editor
University of Electronic Science and Technology for Elsevier Information Fusion, and associate editor for IEEE Transactions on
of China (UESTC), Chengdu, China, in 2023. He Geoscience and Remote Sensing (TGRS) and IEEE Geoscience and Remote
is currently working toward the MS degree under Sensing Letters (GRSL). Moreover, he is an Editorial Board Member for Nature
prof. Liang-Jian Deng in the School of Mathematics, Scientific Reports and MDPI Remote Sensing. Dr. Vivone received the IEEE
University of Electronic Science and Technology of GRSS Early Career Award in 2021, the Symposium Best Paper Award at IEEE
China. His research interests include computer vision, International Geoscience and Remote Sensing Symposium (IGARSS), in 2015
machine learning, and applications on low-level vi- and the Best Reviewer Award of the IEEE Transactions on Geoscience and
sion tasks including super-resolution, image fusion, Remote Sensing, in 2017. Moreover, he is listed in the World’s Top 2% Scientists
and inverse problems. by Stanford University.
Authorized licensed use limited to: Wuhan University. Downloaded on May 13,2025 at 07:57:11 UTC from IEEE Xplore. Restrictions apply.