0% found this document useful (0 votes)
5 views

SGCNet

The document presents SGCNet, a novel multi-modal image fusion method that enhances salient objects through low-frequency guidance while preserving high-frequency details. It utilizes a dual-layer dense connection architecture combining CNNs and Transformers to effectively extract and integrate features from different modalities, resulting in improved contrast and detail in fused images. Experimental results demonstrate that SGCNet outperforms existing fusion methods, making it a promising solution for tasks such as infrared-visible and medical image fusion.

Uploaded by

jiyeondrama
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

SGCNet

The document presents SGCNet, a novel multi-modal image fusion method that enhances salient objects through low-frequency guidance while preserving high-frequency details. It utilizes a dual-layer dense connection architecture combining CNNs and Transformers to effectively extract and integrate features from different modalities, resulting in improved contrast and detail in fused images. Experimental results demonstrate that SGCNet outperforms existing fusion methods, making it a promising solution for tasks such as infrared-visible and medical image fusion.

Uploaded by

jiyeondrama
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

SGCNet: Silhouette Guided Cascaded Network for Multi-modal Image Fusion

d
Yuxuan Wanga,1 , Zhongwei Shena,1 , Hui Lib , Zhenping Xiaa,∗

we
a Suzhou University of Science and Technology, Suzhou 215009, China
b Jiangnan University, Wuxi 214122, China

ie
Abstract

In the field of image fusion, capturing both high-frequency details, such as textures, and low-frequency elements, like

ev
color blocks, is crucial for generating high-quality composite images. However, traditional fusion techniques often rely
on simple weighting strategies, which can overemphasize certain frequency bands, compromising the contrast and detail
of the fused image. To address this challenge, we propose a novel multi-modal image fusion method, SGCNet, which
enhances salient objects through low-frequency guidance while preserving fine details by embedding high-frequency

r
information. Our approach features a dual-layer dense connection architecture, integrating CNNs and Transformers
er
within the encoder to efficiently extract both high- and low-frequency features. By incorporating advanced segmentation
prior knowledge, we ensure the accurate embedding of high-frequency details under low-frequency guidance, effectively
mitigating the biases in existing methods that tend to favor visible or infrared images. This strategy significantly
pe
improves both the contrast and detail of the fused image. Extensive experiments demonstrate that SGCNet outperforms
existing fusion methods across a variety of tasks, including infrared-visible and medical image fusion, highlighting its
technological advancements and broad practical application potential.
Keywords: Image fusion, multi-modal fusion, multi-head cross attention, semantic information
ot

1. Introduction colour contrast information. Some methods focus more on


15 retaining information during the feature encoding process,
Multimodal image fusion technology aims to extract
tn

using structures like dense connections to preserve the de-


complementary information from various modalities, e.g.,
tailed information of the input image modalities. Conse-
the rich detail in visible images and the heat source in-
quently, the fused image performs better in high-frequency
5 formation in infrared images that is robust to lighting or
regions and visually resembles the visible modality, but the
rin

weather conditions, and integrate them into a single im-


20 colour contrast of objects highlighted by infrared is rela-
age. Such fused images typically contain richer informa-
tively poor. Other methods, using fusion structures like
tion, benefiting subsequent downstream visual tasks [1]
cross-attention, prioritize global information consistency,
such as object detection [2], [3], [4], object tracking [5],
resulting in better performance in the low-frequency re-
ep

10 [6], [7], digital autofocus [8] and semantic segmentation


gions. Visually, these fused images lean more towards in-
[9], [10], [11].
25 frared images, but the details of objects provided by visible
The current method leans more or less towards infrared
light tend to be less prominent. The former methods pro-
Pr

or visible light information, resulting in a loss of detail or


vide more visually appealing results, while the latter are
better at highlighting salient targets, making them more
∗ Corresponding author: Zhenping Xia (Email:[email protected]).
1 The two authors contribute equally to this work. supportive for downstream tasks.
Preprint submitted to Computer Vision and Image Understanding December 16, 2024

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
lution in parallel to preserve local information. Through

d
50 adaptive mutual weighting, we construct an encoding unit
referred to as the Enhanced Dense Block (EDB). This cas-

we
caded structure weakens the disruption of high-frequency
local details caused by global modelling by reintroducing

(a) Visible & Infrared (b) DenseFuse shallow visual features twice in succession.
55 Before cross-modal fusion, we introduce a pretrained se-

ie
mantic segmentation model (i.e., Deeplabv3 [14] in this
paper) as prior knowledge to spatially weight and align

ev
features within each modality. The semantic segmenta-
tion model separates objects within the input modality
(c) DATFuse (d) SGCNet (Ours) 60 using different masks, effectively creating low-pass filter-
ing. Such enhancement of low-frequency prior knowl-

r
Figure 1: Nighttime infrared and visible image pair and the fusion
edge as weights enables spatial convergence of intra-
results. The high-frequency biased method, DenseFuse in (b), pre-
er
serves text details clearly but does not highlight pedestrians promi-
modality information toward salient targets, preserving
nently. In contrast, the low-frequency biased method, DATFuse in high-frequency detail textures while suppressing noise.
(c), highlights pedestrians well but loses text detail. Our method 65 The enhanced intra-modality information is fused across
achieves a better balance between the two in (d).
modalities using a classic cross-attention mechanism,
pe
achieving unbiased information complementarity.
30 Figure 1 demonstrates the bias issue in the fusion re- The main contributions of this work can be summarized
sults of existing methods between visible and infrared im- as follows:
ages. As shown in Figure 1(b), DenseFuse [12] clearly 70 • Using a dual-layer dense connection, we precisely en-
ot

displays the detailed text, while the high-intensity effect is hance high-frequency feature extraction, guided by low-
less pronounced for infrared-sensitive areas, such as high- frequency data to ensure accurate detail placement and
lighted human figures in the infrared image. Conversely, enhancement.
tn

35

DATFuse [13] in Figure 1(c) shows clear high-contrast for • Integrating semantic segmentation enhances different
the human figure but preserves text details less effectively. 75 modal images’ complementary information, particularly
In simple terms, effectively combining these two strategies for infrared images, by preserving details and highlight-
rin

during the encoding process can address the bias issue in ing key targets.
40 multimodal image fusion, leveraging the strengths of both • Experiments show our network is a superior fusion
types of methods. method, offering a stronger and more efficient solution for
Based on this foundation, we extend the simple con- 80 combining images from different sources.
ep

volutional encoding in the classic DenseFuse structure to The remainder of this article is organized as follows:
a cascaded architecture to mitigate the tendency of basic Section 2 reviews the related work. Section 3 provides
45 convolutional structures to overly focus on high-frequency a detailed description of the proposed Silhouette Guided
Pr

information within a single modality. Considering that Cascaded Network (SGCNet) model and its associated loss
the transformer architecture tends to significantly disrupt 85 function. Experimental results and discussion are pre-
visual information, we introduce a dense-connected convo- sented in Section 4. Finally, Section 5 concludes the arti-
2

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
cle. the visible image, thereby enriching the texture details of

d
the fused output. GANMcC [33] further improves fusion
2. Related Work 125 results by introducing multi-classification constraints. Its

we
core idea is to use multiple classifiers as discriminators to
Traditional image fusion methods include multi-scale
assess the likelihood that an input image is either a visi-
90 transformation (MST)-based techniques [15, 16], sparse
ble light or infrared image. DDcGAN [34] achieves image
representations (SR) [17, 18, 19], and subspace-based ap-
fusion by constructing a conditional generative adversar-

ie
proaches [20]. MST-based methods encompass pyramid-
130 ial network with two discriminators. This design enables
based algorithms [21], wavelet-based algorithms [22],
the model to be trained without the need for ground truth
curvelet-based algorithms [23], and multi-scale geometric

ev
images, making it effective for fusion tasks at varying res-
95 representation-based algorithms [24]. The fundamental fu-
olutions.
sion strategies employed by these methods typically in-
volve weighted averaging [25] or maximum value selection
2.2. Low-Frequency Biased Methods

r
[26]. With the advent of deep learning, the field of image
fusion has seen significant advancements, as the integra-135 Traditional methods of extracting high-frequency infor-
mation often fail to capture the long-range dependencies
100
er
tion of image information has become more efficient and
accurate through the use of advanced neural network ar- within the source image, leading to insufficient extraction

chitectures and optimization algorithms. This shift has of important global contextual information. To address
pe
greatly accelerated the progress of image processing tech- this limitation, Transformers have been increasingly ap-

nology. 140 plied to image fusion tasks. Introduced by [35], the Trans-
former architecture, with its self-attention mechanism, has
105 2.1. High-Frequency Biased Methods gained significant popularity in deep learning due to its
In certain deep learning-based methods, feature extrac- ability to model global interactions between contextual
ot

tion is performed using Convolutional Neural Networks elements. This capability has led to impressive perfor-
(CNNs) [27, 28, 29]. These methods are particularly ef-145 mance across various visual tasks [36, 37, 38]. As a result,
tn

fective at extracting high-frequency information, such as Transformers have been integrated into image fusion to
110 local image features and texture details. For example, model long-range dependencies between different domains,
DenseFuse [12] employs a dense connection strategy, where achieving superior fusion results [39, 40, 41].
each layer is directly connected to all preceding layers. DATFuse [13] employs a dual attention residual mod-
rin

This design enables efficient feature reuse and enhances150 ule (DARM) to preserve global complementary informa-
the flow of information across layers, improving the over- tion, enabling the model to capture long-range dependen-
115 all feature transfer process. cies in images and enhancing its context-awareness. Swin-
ep

The GAN-based method [30], [31] relies on an adversar- Fusion [42], on the other hand, constructs a feature en-
ial game between the generator and discriminator to esti- coding backbone entirely based on attention mechanisms,
mate the target’s probability distribution. This approach155 effectively simulating long-range dependencies. This de-
can implicitly perform feature extraction, feature fusion, sign allows global features to exhibit stronger representa-
Pr

120 and image reconstruction simultaneously. FusionGAN [32] tional capabilities compared to local features, improving
is a pioneering GAN-based image fusion method that es- the model’s overall performance in image fusion tasks.
tablishes an adversarial game between the fused image and To balance the relationship between high-frequency
3

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
Encoder Decoder

d
we
Conv2

Conv3

Conv4
Conv1
EDB1

EDB2

EDB3

EDB4
Fusion
Layer

ie
DC EDB Fusion Layer

ev
𝐹𝐻
Conv

Conv

Conv

𝐹𝐸𝐻 SSE
𝐹𝐿 EM
V

MCA
4
𝐹𝐸𝐻

LN
K

PE
TR Q
Sigmoid

MLP
c
MLP

LN
SA

r
MCA
4
𝐹𝐸𝐿

LN
PE
K
V
𝐹𝐻 SSE
MLP
MSA

𝐹𝐸𝐿
LN

LN

𝐹𝐿
er
EDB : Enhance Dense Block MLP : Multi-Layer Perceptron PE : Position embedding
pe
LN : Layler Normalization SA : Self Attention SSE : Semantic segmentation embedding

MSA : Multi-head Self Attention Sigmoid : Sigmoid Activation Layer MCA : Multi-head cross attention

Figure 2: The SGCNet framework for multimodal image fusion comprises an encoder, a fusion layer, and a decoder. The encoder includes
ot

four EDB layers, each interconnected by Dense Connections. Each EDB layer contains three modules: DC, TR, and EM. DC stands for Dense
Connection CNN, TR stands for Transformer, EM stands for Enhance Module The fusion layer employs a cross-attention module based on
semantic prior adjustment. The decoder consists of four convolutional layers.
tn

160 and low-frequency information, several studies have ap-170 2.3. Down-stream Vision Task Guided Fusion Methods
plied cross-attention mechanisms in image fusion [43, In semantic segmentation, image fusion can serve as
44, 45]. CrossFuse [46] introduces an innovative cross- a preprocessing step to enhance both the accuracy and
rin

attention mechanism within the Transformer architecture robustness of segmentation results. MFFNet [47] intro-
to strengthen the interaction between features from differ- duces a multi-level feature fusion network designed to im-
165 ent modalities. This mechanism uses self-attention to en-175 prove image semantic segmentation tasks. SeAFusion [48]
hance intra-modal features, while cross-attention improves
ep

is an advanced image fusion framework that integrates


the interaction between features from distinct modalities. semantic-aware real-time infrared and visible light image
As a result, it effectively reduces the correlation between fusion with visual tasks, boosting segmentation perfor-
inputs and enhances the fusion process. mance. MRFS [49] presents a coupled learning framework
Pr

180 that emphasizes the mutual enhancement of image fusion


and semantic segmentation, leveraging visual and seman-
tic consistency to improve overall results.
4

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
In the downstream task of salient object detection Firstly, the Low-frequency biased feature FLi and High-

d
i
within image fusion, researchers face the challenge of ef-220 frequency biased feature FH of the source image are ex-
185 fectively integrating information from diverse modalities tracted by the proposed modules, Dense Connection CNN

we
to enhance object detection performance. Salient object (DC, HC (·)) and Transformer (TR, HT (·)) in parallel.
detection requires models to understand image content This processing can be expressed as:
and identify the most representative objects. To address  1 1 1
, IV1 IS ), HT (IIR
1
, IV1 IS ) ,

FL , FH = HC (IIR (1)
this, various fusion methods are proposed to improve the

ie
190 model’s ability to recognize targets by combining informa- As we know, convolutional layers are essential in the
tion from different modalities, such as visible light and in-225 initial stages of visual processing, providing stability to

ev
frared images. For example, ICAFusion [50] effectively ag- the optimization process through their excellent perfor-
gregates complementary information from RGB and ther- mance, which leads to superior results. Conversely, the
mal images using an iterative cross-attention-guided fea- Transformer model is famed for its self-attention mecha-
195 ture fusion mechanism. DDcGAN [34] generates saliency nism that captures long-term dependencies. In the realm

r
maps through a saliency detection network, then fuses230 of image fusion, this mechanism excels by perceiving global
these maps with the original images to enhance both the
er connections between image elements, ensuring a coherent
visual quality of the images and the accuracy of object and unified fusion process. To further enhance the fu-
detection. sion effect, we design the convolutional component and
pe
introduced a dual-layer Dense Connection structure. This
200 3. Method 235 design facilitates precise spatial filling of high-frequency

In this section, we provide a comprehensive overview information under the guidance of low-frequency informa-

of the SGCNet framework, with a particular focus on the tion.

double-layer dense connection and the multi-head cross- Moreover, we employ a parallel network architecture
ot

attention module based on semantic segmentation. Fur- that facilitates the collaborative operation of DC and TR.

205 thermore, we also elaborate on the loss function designed240 This approach not only enhances computational efficiency
tn

in this section. but also ensures that the features extracted by both com-
ponents at different stages remain distinct and do not in-
3.1. Framework Overview terfere with each other. By addressing the issue of images
The framework of SGCNet is shown in Figure 2, ma- gravitating towards a specific modality in traditional fu-
rin

jorly including an encoder, a fusion layer, and a decoder.245 sion processes, this architecture introduces greater flexi-
210 Let IIR ∈ RH×W ×Cin and IV IS ∈ RH×W ×Cin represent bility and selectivity in fusion strategies. Additionally, it
the source images of infrared and visible light respectively. significantly mitigates the risk of information loss during
IF ∈ RH×W ×Cout is the fused image with complete scene
ep

the fusion process, thereby paving the way for new ad-
representation. H, W and Cin are the height, width and vancements in image fusion technology.
channel number of input images. Cout is the channel num-250 Subsequently, the features extracted by the DC and
215 ber of the fused images. The proposed SGCNet aims to TR methods are combined and fed into a spatial atten-
Pr

generate the fused image IF via merging local and global tion enhancement module. Utilizing a multi-layer percep-
complementary information in the source images IIR and tron (MLP), the features are compressed into a thin slice
IV IS . while maintaining the number of channels as 1, with height
5

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
255 (H) and width (W) unchanged. This compressed slice is290 cally involves small datasets, it is often challenging to inde-

d
processed through transformer self-attention, transformed pendently extract high-quality salient features from these
into weights using a sigmoid function, and subsequently datasets.

we
enhances the input as Enhance Low-frequency biased fea- To achieve this, we utilize semantic information from
 1  1
ture FEH and Enhance High frequency biased FEL : visible light images as prior knowledge. By employing the
295 weights generated by these models, the semantic repre-
1 1 1
, FL1 ) ,
 
FEH , FEL = HE (FH (2)
sentations of Q, K, and V can be enhanced, making them

ie
260 where HE (·) represents a Contrast Enhancement Unit more robust and accurate. Enhancing semantic weights al-

based on a spatial attention mechanism, multi-scale fea- lows for more effective integration of visual features from

ev
tures are utilized after feature extraction at each layer, images with the semantic features of text, thereby im-

allowing for the retention of more valuable information300 proving the model’s overall performance. This approach
through DenseBlock. As the encoder deepens, the ex- ensures that critical details are effectively preserved and

tracted deep features concentrate on salient content. Fur- augmented during the fusion process, resulting in a more

r
265

thermore, to enhance detailed information and prominent accurate and vivid final image.

features, skip connections are implemented in both the en-


er Furthermore, effective segmentation information directs

coder and decoder. 305 feature data to focus on areas where significant differ-

After four layers of feature extraction, the size of the ences exist between infrared and visible light images. This
methodology facilitates more precise incorporation of high-
pe
270 image is reduced, resulting in a smaller bottom image.
Large-scale images are utilized to extract detailed features, frequency texture information into these regions, thereby

while small-scale images focus on capturing overall char- enhancing the quality of the fused image. By adopting this

acteristics. In total, four scales are employed for feature310 strategy, our fusion network processes local image features
extraction. Each layer is further enhanced through image with greater accuracy while maintaining a high degree of
ot

275 contrast adjustments, enabling the model to capture in- global consistency and coordination, yielding exceptional

formation at various resolution levels. This facilitates the results in image fusion tasks.
Generate semantic weights using the pre-trained
tn

simultaneous recognition and comprehension of features,


ranging from minute details to large-scale structures. Ad-315 Deeplabv3 network to enhance the Q, K, and V compo-
ditionally, this process accentuates the features, thereby nents of the cross-attention module. After obtaining the

280 enhancing the model’s ability to recognize and interpret semantically enhanced features, we introduce the proposed
rin

image content. cross-attention block. The attention mechanism is defined


as follows:
3.2. Semantic Weight Guided Fusion Strategy
QKT
Attention(Q, K, V) = sof tmax( √ + B)V, (3)
dk
ep

Introducing semantic features extracted from pre-


trained models into fusion networks can significantly en-320 , where dk is the dimension of keys and B is the learnable
285 hance the expressiveness of the fused features [51], [52]. relative positional encoding.
Thus, DeepLabv3 is utilized to achieve accurate image se- We extend the self-attention into multi-head sele-
Pr

mantic information from source images. In the fusion pro- attention (MSA) to enable the attention mechanism to
cess, we employ these semantic features as prior knowledge consider various attention distributions and make the
to guide the fusion strategy. Given that image fusion typi-325 model capture information from different perspectives.
6

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
We perform the attention function for h times in parallel After fully merging complementary information in dif-

d
and concatenate the results for multi-head self-attention, ferent domains, We design a CNN-based image reconstruc-
in which h is set as 4 in our work. Next, the feature355 tion unit to map the fused depth features back to the image

we
markers generated by the MHA layer are refined by two space. The CNN-based image reconstruction unit HCR is
330 multi-layer perceptron layers with GELU activation layers. deployed to reduce the number of channels and generate
The layer normalization (LN) is always performed after the fused image If , which is denoted as:
both MHA and before MLP. The complete process can be

ie
If = HCR (FD ), (8)
expressed as:

where FD represents the fused deep features and is the


n o
{Q1 , K1 , V2 } = X1 WQ K V
1 , X1 W1 , X1 W1 ,

ev
n o (4)360 input of the feature reconstruction module. The fusion
{Q2 , K2 , V1 } = X2 WQ K V
2 , X2 W2 , X2 W2 ,
image reconstruction unit consists of three convolutional
Given two local window features X1 and X2 from differ- layers.
Q K V
ent domains, the weight matrices W , W , W shared

r
335

across the different windows are used to project them into 3.3. Loss Function

the query Q, key K, and value V in the above way.


er The information contained in the infrared and visible
The full process of fusion based on cross-attention is365 images varies greatly on account of their individual imag-
formalized as: ing mechanisms. Therefore, the loss function formulation
pe
ought to comprehensively consider the inherent character-
istics of the source images so as to obtain an informative
Ze = M HA(Q1 , K2 , V2 ) + M HA(Q2 , K1 , V1 ), (5)
fusion result that can highlight the target together with

340 For Q1 from domain 1, it merges the cross-domain in-370 abundant texture details of scenes.
formation by performing attention weighting on K2 and In this work, the loss function of the proposed SGCNet
ot

V2 from domain 2, while preserving the information in is defined as follows,

domain 1 by residual joining, and vice versa.


L = γLaux + λLtexture , (9)
tn

Z = LN (M LP (Z)),
e (6) where Laux represent the Auxiliary intensity loss and
Ltexture represent the gradient loss. γ and λ are weight
where Z is the output of the fusion unit with X1 ,X2 as
factors for controlling the contributions of each term.
rin

375
345 the input.
One of the goals of infrared and visible image fusion is
The MLP is as follows:
to exploit spatial details from the input images, which can

M LP (X) = GELU (W1 X + b1 )W2 + b2 , (7) be characterized by the gradient. In addition, significant
ep

details can be effectively captured by the maximum aggre-


where GELU is the Gaussian error linear unit. As pre-380 gation of the gradient information.
sented in Eq.(4) ∼ Eq.(7), for Q1 form Visible light im-
Thus, the gradient loss is devised to restrain the fu-
age, it incorporates cross-domain information by perform-
sion result maintaining vital details from the input images,
Pr

350 ing attention weighting with K2 and V2 from Infrared


which is formulated as follows,
image, while preserving information in Visible light image
1
through the residual connection and vice versa. Ltexture = ∥|▽If | − max(|▽Iir |, Ivi )∥1 , (10)
HW
7

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
d
we
(a) Infrared Image (b) Visible Image (c) DenseFuse (d) FusionGAN (e) GANMcC (f) RFN-Nest

ie
(g) SwinFusion (h) DATFuse (i) U2Fusion (j) LRRNet (k) DeepM 2CDL (l) SGCNet (Ours)

ev
Figure 3: Qualitative comparison on the MSRS database. (a) Infrared images; (b) Visible images; (c) DenseFuse; (d) FusionGAN; (e)
GANMcC; (f) RFN-Nest; (g) SwinFusion; (h) DATFuse; (i) U2Fusion; (j) LRRNet; (k) DeepM2 CDL; (l) SGCNet.

Table 1: Quantitative comparison of the proposed SGCNet with nine state-of-the-art methods on the MSRS database. For each indicator,

r
the optimal and suboptimal methods are marked in bold and underlined, respectively
DenseFuse FusionGAN GANMcC RFN-Nest SwinFusion DATFuse U2Fusion LRRNet DeepM2 CDL SGCNet (Ours)
EN
SF
5.544
5.717
5.272
3.909
5.856
5.271
5.408
4.396
er 5.923
8.806
5.818
8.881
4.459
6.782
5.209
6.140
5.846
8.373
6.075
9.976
MI 2.432 1.853 2.384 2.149 3.261 3.194 1.782 2.083 3.733 3.490
V IF 0.769 0.417 0.684 0.579 0.959 0.851 0.469 0.437 0.943 1.021
pe
AG 1.865 1.849 1.703 1.279 2.676 2.615 1.761 1.703 2.497 3.439
Qabf 0.440 0.140 0.340 0.247 0.580 0.541 0.309 0.311 0.628 0.656

where H and W denote the height and width of the source specific designs. We design a downstream detection task
images, severally. ∥·∥1 denotes the l1 − norm, and max(·)400 based on the M3FD dataset.
ot

385

refers to the element-wise maximum selection.


For auxiliary intensity loss, excellent image fusion algo-
tn

rithms expect to generate fused images with appropriate 4.1. Dataset And Experimental Details
intensity based on the global apparent intensity informa-
390 tion of the source image. To this end, we design the fol-
lowing auxiliary strength loss to guide our fusion model in In this work, 1083 pairs of MSRS datasets for train-
rin

capturing appropriate strength information: ing images are used to train infrared image fusion tasks.

1 The training samples are cropped into blocks of size 64 ×


Laux = ∥If − max(Iir , Ivi )∥1 , (11)
HW 405 64 pixels to obtain sufficient data to train the proposed
ep

SGCNet. And normalize these image blocks to [0,1]. The


4. Experiment
learning rate is fixed at 0.001, the batch size is 64, and the
In this section, we compare SGCNet with several state- number of epochs is 30. The hyperparameters that control
395 of-the-art algorithms in multimodal image fusion scenar- the trade-off of each sub-loss term are empirically set to γ
Pr

ios through quantitative and qualitative comparisons. At410 = 40 and λ = 50. The experiment is conducted using the
the same time, we conduct ablation studies on network PyTorch framework on the NVIDIA GeForce RTX 3090
structure and loss function to validate the effectiveness of GPU.
8

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
4.2. Comparative Methods And Objective Evaluation Indi-450 radiation preservation, but they lose some important tex-

d
cators ture details. GANMcC displays better scene details, but

To demonstrate the importance of our SGCNet, we com- there are still some blurry results to some extent. RFN-

we
415

pare the performance of nine state-of-the-art infrared and Nest limits the ability to retain complementary features,

visible image fusion algorithms, namely DenseFuse [12], resulting in blurred fusion results. DATFuse reveals more

FusionGAN [32], GANMcC [33], RFN-Nest [53], Swin-455 prominent roads than RFN-Nest, but still has a certain de-
Fusion [42], DATFuse [13], U2Fusion [54], LRRNet [55], gree of blurring, resulting in the loss of some image details.

ie
420 DeepM2 CDL [56]. The source code for these nine meth- SwinFusion has low contrast, resulting in some texture de-

ods is publicly available or provided by their authors. tails becoming blurry. The overall image of LRRNet is

ev
In order to achieve comprehensive and objective evalua- dark and somewhat blurry.

tion, six commonly used quantitative evaluation indicators460 It is worth emphasizing that our SGCNet not only pre-

in image fusion are utilized: information entropy EN , spa- serves the scene information of visible images but also pre-

tial frequency SF , mutual information M I, visual fidelity serves salient objects, benefiting from effective global con-

r
425

V IF , average gradient AG, and gradient-based fusion per- text awareness and appropriate strength control. Espe-

formance Qabf .
er cially, our model can adaptively focus on salient regions in
465 infrared images and backgrounds in visible images through
4.3. Results Analysis global context aggregation and spatial attention enhance-
ment. For visible and near-infrared image fusion, it has
pe
4.3.1. The fusion results on the MSRS dataset
430 Figure 3 shows a pair of source images from the MSRS excellent visual quality comparison.

database and their corresponding fused images obtained Further quantitative comparisons were conducted on 50

through different methods. Due to the significant differ-470 pairs of images in the MSRS dataset, and the results of
ence in the information contained between infrared images the six evaluation metrics are shown in Table 1. For each
ot

and visible light images, where infrared images provide indicator, the best and second-best methods are marked in

435 low-resolution salient targets, while visible images present bold and underlined, respectively. The proposed SGCNet
achieves top-level performance on EN , SF , V IF , AG,
tn

clear scene details while targets are not obvious, a good


fusion method should be able to produce information-rich475 and Qabf , and demonstrates significant advantages over
fusion results that display complementary information, re- other alternative solutions. Although our method limits

vealing both obvious objects and obvious environments. suboptimal performance on MI, the gap with optimal per-
rin

440 Obviously, all ten fusion methods can achieve relatively formance is small. Overall, SGCNet exhibits good objec-

good performance, but the other nine comparison meth- tive performance on the MSRS dataset, which proves the

ods still have some shortcomings. Due to the lack of480 significance of the proposed spatial attention contrast en-
hancement module.
ep

global information exchange and inappropriate intensity


control, U2Fusion cannot effectively present scene infor-
445 mation in visible images. In addition, DenseFuse can re- 4.3.2. The fusion results of the LLVIP dataset

tain some texture details of visible light images, but it Figure 4 shows a set of source image pairs from the
Pr

still suffers from thermal radiation pollution, which weak- LLVIP database and fused images generated by different
ens the prominent targets of infrared images to varying485 methods, with observation results similar to those in the
degrees. FusionGAN and DeepM2 CDL have good thermal MSRS dataset.
9

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
d
we
(a) Infrared Image (b) Visible Image (c) DenseFuse (d) FusionGAN (e) GANMcC (f) RFN-Nest

ie
(g) SwinFusion (h) DATFuse (i) U2Fusion (j) LRRNet (k) DeepM 2CDL (l) SGCNet (Ours)

ev
Figure 4: Qualitative comparison on the LLVIP database. (a) Infrared images; (b) Visible images; (c) DenseFuse; (d) FusionGAN; (e)
GANMcC; (f)RFN-Nest; (g) SwinFusion; (h) DATFuse; (i) U2Fusion; (j) LRRNet; (k) DeepM2 CDL; (l) SGCNet (Ours).

r
Table 2: Quantitative comparison of the proposed SGCNet with nine state-of-the-art methods on the LLVIP database. For each indicator,
the optimal and suboptimal methods are marked in bold and underlined, respectively
DenseFuse FusionGAN GANMcC RFN-Nester SwinFusion DATFuse U2Fusion LRRNet DeepM2 CDL SGCNet (Ours)
EN 6.875 6.308 6.690 6.862 7.159 7.082 6.36 6.145 7.093 7.322
SF 9.266 6.919 6.814 6.321 13.623 12.331 11.047 9.108 10.851 13.688
MI 2.717 2.816 2.681 2.553 3.839 4.435 2.274 2.347 3.919 4.065
pe
V IF 0.763 0.476 0.613 0.669 0.932 0.839 0.674 0.564 0.849 0.907
AG 2.722 1.947 2.123 2.158 3.915 3.023 3.289 2.467 2.841 3.968
Qabf 0.489 0.254 0.297 0.313 0.649 0.514 0.503 0.424 0.519 0.652

At first glance, all 10 methods produce good fusion tails in visible images, such as the lighting on the car head
ot

results, but the other 9 comparative methods still have505 in the image that retains both low-frequency illumination
some shortcomings. FusionGAN has a certain degree of information and high-frequency detail information.
490 scene blurring. The result of SwinFusion is too biased to- Table 2 shows the quantitative comparison of ten com-
tn

wards visible light, resulting in overexposure of the head- parison methods on six widely used evaluation indicators.
2
lights. DeepM CDL, DATFuse, and RFN-Nest are un- For each indicator, calculate the average score of 50 test
able to effectively preserve the thermal radiation infor-510 samples and mark the best method in bold, with an un-
rin

mation of infrared images, and the overall scene appears derline indicating the second best method. It can be seen
495 dark. DenseFuse has to some extent alleviated the prob- that our SGCNet has shown the best objective evaluation
lem of insufficient detection of thermal radiation informa- in EN , SF , AG, and Qabf , and demonstrates significant
tion, but there are still minor defects. People in GAN- advantages over other competitors. Performed the second-
ep

McC, U2Fusion, and LRRNet scenes are darker and lose515 best objective evaluation on M I and V IF .
some thermal radiation information. Overall, the proposed
Based on qualitative and quantitative analysis, we can
500 SGCNet has the advantage of appropriately preserving the
conclude that our proposed SGCNet has pleasing fusion
Pr

supplementary information of the source image. It can


performance and outperforms other state-of-the-art meth-
produce fusion results with clear goals and environment.
ods in infrared and visible light visual perception and
The proposed method can effectively preserve texture de-
520 quantitative evaluation.
10

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
d
we
(a) MRI (b) PET (c) DenseFuse (d) FusionGAN (e) GANMcC (f) RFN-Nest

ie
ev
(g) SwinFusion (h) DATFuse (i) U2Fusion (j) LRRNet (k) DeepM 2CDL (l) SGCNet (Ours)

r
Figure 5: Qualitative comparison on the medical image database. (a) Infrared images; (b) Visible images; (c) DenseFuse; (d) FusionGAN;
er
(e) GANMcC; (f)RFN-Nest; (g) SwinFusion; (h) DATFuse; (i) U2Fusion; (j) LRRNet; (k) DeepM2 CDL; (l) SGCNet (Ours).

Table 3: Quantitative comparison of the proposed SGCNet with nine state-of-the-art methods on the Medical image dataset. For each
pe
indicator, the optimal and suboptimal methods are marked in bold and underlined, respectively
DenseFuse FusionGAN GANMcC RFN-Nest SwinFusion DATFuse U2Fusion LRRNet DeepM2 CDL SGCNet (Ours)
EN 5.657 5.791 5.664 5.654 5.821 5.938 5.487 5.523 5.623 6.318
SF 16.786 13.269 16.086 10.030 30.612 32.200 23.128 15.629 34.682 34.828
MI 3.043 2.158 2.709 2.639 2.956 2.833 2.619 2.191 3.318 3.044
V IF 0.568 0.317 0.496 0.489 0.413 0.590 0.483 0.338 0.701 0.711
ot

AG 6.263 4.687 5.631 4.100 10.209 10.033 7.055 5.172 10.923 11.289
Qabf 0.433 0.153 0.288 0.172 0.690 0.668 0.471 0.166 0.745 0.712
tn

4.3.3. The fusion results of medical image datasets in MRI images and fully characterize functional informa-
535 tion in PET images.
The visual quality comparison of PET and MRI image
fusion is shown in Figure 5. From the results, it can be
rin

seen that other fusion algorithms inevitably weaken the


525 basic information in the source image. More specifically,
DenseFuse, U2Fusion, and LRRNet have weaker retention Table 3 shows the quantitative comparison of ten com-
ep

of MRI image features, and the colour blocks are notice- parison methods on six widely used evaluation indicators.
ably darker compared to other images. FusionGNN, RFN- For each indicator, calculate the average score of 269 test
Nest, and DATFuse exhibit ambiguity due to the lack samples and mark the best method in bold and the second
530 of guidance from low-frequency information. GANMcC,540 best method in underline. It can be seen that our SGCNet
Pr

SwinFusion, and DeepM2 CDL have shown poor perfor- has shown the best objective evaluation in EN , SF , V IF ,
mance in preserving details and colour blocks of interme- and AG, and achieved the second-best objective evaluation
diate particles. Our fusion model can preserve rich details in M I and Qabf .
11

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
d
we
(a) Infrared Image (b) Visible Image (c) w/o Feature Extract (d) w/o Enhance (e) w/o Fuse (f) SGCNet (Ours)

Figure 6: Visual results of visible light and infrared image fusion ablation experiment. (a) and (b) are infrared and visible images, (c) is the
result image of replacing the feature extraction layer with a 4-layer CNN convolutional layer, (d) is the result image of removing the feature

ie
enhancement module, (e) is the result image of removing the cross attention based fusion module, and (f) is the result image of our network.

Table 4: Quantitative evaluation results of ablation studies on the TNO dataset. Bold indicates the best result, underlined indicates the

ev
second best result
w/o Feature Extraction w/o Enhancement Module w/o Cross Attention w/o Semantic Information SGCNet (Ours)
EN 6.675 6.749 6.574 6.864 6.935
SF 10.729 9.985 10.695 11.355 11.029

r
MI 2.367 3.174 2.330 3.240 3.936
V IF 0.654 0.760 0.364 0.725 0.782
AG
Qabf
3.925
0.470
3.832
0.509
er 3.618
0.310
4.406
0.565
4.259
0.573

4.4. Ablation Study sion. Overall, the complete architecture equipped with
pe
565

545 4.4.1. Research on the ablation of network structure the above three modules can simultaneously capture local

Our ablation research mainly focuses on three mod- and global information from the source image and focus

ules of the network: the feature extraction module par- on significant features, with better performance than the

allel to Transformer and CNN, the feature enhancement other three degraded architectures.
ot

module similar to spatial attention, and the multi-head570 Table 4 reports a quantitative comparison of our SGC-

550 cross-attention fusion module based on semantic informa- Net with various network architectures. For each indica-
tor, the best model is marked in bold. It is obvious that
tn

tion segmentation adjustment.


Figure 6 shows the fused images obtained on the TNO the complete structure has significant advantages over the

dataset using w/o Feature Extraction, w/o Enhancement, other three fusion networks.

w/o Fuse, and the complete fusion model (i.e., SGCNet).


rin

555 It can be seen that without a feature extraction module,575 4.4.2. Research on parameter selection of loss function
deep networks cannot construct long-range dependencies In order to investigate the importance of each term in
well, resulting in blurry fused images. If there is no feature the proposed loss function, namely gradient loss and Aux-
ep

enhancement module, the fusion model cannot accurately iliary intensity loss. The design of gradient loss is to force
focus on significant features, so the fused image cannot the fusion image to transfer effective spatial details from
560 extract detailed features from the input image to some580 the input image. The Auxiliary intensity loss can main-
extent. If there is no multi-head cross-attention module tain the optimal intensity distribution of the fused image.
Pr

based on semantic information segmentation adjustment, In order to verify the influence of the weights of the two
the fusion model cannot pay attention to significant fea- loss function parameters on the model, a loss function pa-
tures as well as loses information due to insufficient fu- rameter selection experiment was conducted on the test
12

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
d
we
(a) Infrared Image (b) Visible Image (c) γ = 0,λ = 50 (d) γ = 20, λ = 50 (e) γ = 10, λ = 0 (f) γ = 40, λ = 50 (Ours)

Figure 7: Visualization results of parameter-selection experiment for loss function. (a) and (b) are infrared and visible images, respectively.
In (c), set γ to 0 and λ to 50. In (d), set γ to 20 and λ to 50. In (d), set γ to 10 and λ to 0. Our method sets γ to 40 and λ to 50.

ie
Table 5: Quantitative results of loss function parameter selection experiments on TNO dataset, bold for best results and underlined for second
best results

ev
γ = 0, λ = 50 γ = 10, λ = 50 γ = 20, λ = 50 γ = 30, λ = 50 γ = 50, λ = 50 γ = 10, λ = 0 γ = 40, λ = 50 (Ours)
EN 6.829 6.828 6.844 6.875 6.827 5.874 6.935
SF 9.645 11.014 10.929 10.634 10.474 10.448 11.029
MI 3.048 3.174 3.330 3.224 3.353 2.700 3.936

r
V IF 0.734 0.740 0.752 0.785 0.706 0.618 0.782
AG 3.640 3.852 4.214 4.090 4.029 4.022 4.259
Qabf 0.510 0.509 0.541
er 0.559 0.516 0.420 0.573

585 set.
pe
Figure 7 shows the fusion results of images with different
loss functions on the TNO dataset. It can be seen that
(a) Infrared & Visible (b) w/o SWE (c) SGCNet (Ours)
the fusion effect is best when γ = 40 and λ = 50. It
can simultaneously preserve the complementary features of Figure 8: Visualization results of features adjusted with semantic
ot

590 the input image and obtain fusion results with significant segmentation embedding. (a) is infrared and visible image, (b) is
the result heatmap after removing the prior semantic information,
targets and clear scenes. In contrast, without gradient loss,
and (c) is our result heatmap.
i.e., λ = 0, Fusion results are affected by exposure and
tn

lose some important details. In the absence of auxiliary605 4.5. Multi-modal Object Detection
intensity loss, i.e., γ = 0, the fusion result deteriorates
To verify the better performance of our method in
595 severely and produces undesirable noise.
salient object detection tasks, we conducted object detec-
rin

tion experiments on the M3FD dataset using the YOLOV8


Table 5 reports a quantitative comparison of the pro-
detection model, as shown in Figure 9. The training set
posed method with different loss functions. Consistent
610 contains 4200 images, and 63 randomly selected images
with subjective observation, the best fusion effect is ob-
ep

are used as the test set for prediction. The infrared image
tained when γ = 40 and λ = 50.
mistakenly detected blank areas as cars, while the visible
600 In order to verify the contribution of semantic segmen- light image do not detect pedestrians in the image. We also
tation information to the fusion result, we conduct feature compare three other latest research methods, and none of
Pr

visualization research on this part, as shown in Figure 8.615 them had the same level of confidence in pedestrians as
The texture details of the image fusion result are clearer our method.
with the addition of semantic information adjustment Table 6 shows the results of our quantitative compar-
13

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
Table 6: Quantitative results of significant object detection experiment on M3FD data.

d
Infrared Image Visible Image U2Fusion DeepM2 CDL DATFuse SGCNet (Ours)
M AE 0.125 0.098 0.084 0.057 0.063 0.045

we
wFβ 0.718 0.614 0.729 0.734 0.715 0.854

information, effectively addressing the bias issue prevalent


in traditional fusion techniques. Our fusion method is ap-

ie
(a)Infrared
(a) Infrared Image
image (b)Visible
(b) Visible Image
image (c)U2Fusion
(c) U2Fusion
640 plicable to multimodal image fusion as well as medical im-
age fusion. From a global perspective, our model presents

ev
the appropriate apparent intensity of the fused image. We
conduct extensive experiments demonstrating that SGC-
(d) DeepM 22CDL
(d) DeepM CDL (e)
(e) DATFuse
DATFuse (f) Ours(Ours)
(f) SGCNet

Net outperforms state-of-the-art alternatives.

r
Figure 9: Test results of salient object detection task on M3FD645 Acknowledgement
dataset. (a) Infrared images; (b) Visible images; (c) U2Fusion; (d)
DeepM2 CDL; (e) DATFuse; (f) SGCNet (Ours).
er This research is supported by the National Natural Sci-
ence Foundation of China (No. 62306203, No. 62202205,

ison. Through quantitative comparison, we compared No. 62002254) and the Natural Science Foundation of
pe
the mean absolute error (M AE) with the weighted F- Jiangsu Province, China (No. BK20200988).

620 measure (wFβ ), where M AE represents the average per


pixel difference between the predicted saliency map and650 References
its ground truth mask. wFβ , as a supplementary mea- [1] X. Wang, H. Tang, Z. Zhu, Gmc: A general framework of multi-
ot

sure to maxFβ , is used to overcome the potential unfair stage context learning and utilization for visual detection tasks,
Computer Vision and Image Understanding 241 (2024) 103944.
comparisons caused by interpolation defects, dependency
[2] Y. Cao, D. Guan, W. Huang, J. Yang, Y. Cao, Y. Qiao, Pedes-
625 defects, and equal importance defects. From the results,655 trian detection with unsupervised multispectral feature learning
tn

it can be seen that both of our indicators are far ahead of using deep neural networks, information fusion 46 (2019) 206–
217.
other methods.
[3] Y. Wang, J. Chen, X. Fang, M. Jiang, J. Ma, Dual cross per-
ception network with texture and boundary guidance for cam-
rin

5. Conclusion 660 ouflaged object detection, Computer Vision and Image Under-
standing 248 (2024) 104131.
This article proposes an image fusion method based on [4] N.-T. Thu, H. N. Tran, M. D. Hossain, E.-N. Huh, Lightsod:

630 low-frequency information-guided high-frequency informa- Towards lightweight and efficient network for salient object de-
tection, Computer Vision and Image Understanding 249 (2024)
ep

tion embedding, termed SGCNet. SGCNet employs my


665 104148.
original dual-layer dense connection CNN and a Trans- [5] Y. Cui, Q. Cheng, D. Guo, X. Kong, Z. Wang, J. Zhang, Ob-
former cascaded encoder, enabling efficient extraction and ject discriminability re-extraction for distractor-aware visual ob-

processing of both high-frequency and low-frequency in- ject tracking, Computer Vision and Image Understanding 247
Pr

(2024) 104075.
635 formation from images. By incorporating advanced seg-
670 [6] A. Psalta, V. Tsironis, K. Karantzalos, Transformer-based as-
mentation prior knowledge, our method ensures accurate signment decision network for multiple object tracking, Com-
fusion of high-frequency details guided by low-frequency puter Vision and Image Understanding 241 (2024) 103957.

14

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
[7] M. Saada, C. Kouppas, B. Li, Q. Meng, A multi-object tracker work, Scientific Reports 11 (1) (2021) 1235.

d
using dynamic bayesian networks and a residual neural network [22] G. Pajares, J. M. De La Cruz, A wavelet-based image fusion
675 based similarity estimator, Computer Vision and Image Under- tutorial, Pattern recognition 37 (9) (2004) 1855–1872.

we
standing 225 (2022) 103569. [23] F. Meng-Yin, Z. Cheng, Fusion of infrared and visible images
[8] W. Zhou, D. Yang, Analysis and comparison of automatic im-725 based on the second generation curvelet transform, Journal of
age focusing algorithms in digital image processing, Journal of infrared and millimeter waves 28 (4) (2009) 254–258.
Radiation Research and Applied Sciences 16 (4) (2023) 100672. [24] L. Yang, B. Guo, W. Ni, Multimodality medical image fusion
680 [9] C. Liang, S. Bai, Found missing semantics: Supplemental pro- based on multiscale geometric analysis of contourlet transform,
totype network for few-shot semantic segmentation, Computer Neurocomputing 72 (1-3) (2008) 203–211.

ie
Vision and Image Understanding 249 (2024) 104191. 730 [25] S. Li, X. Kang, J. Hu, Image fusion with guided filtering, IEEE
[10] S. Sun, W. Yang, H. Peng, J. Wang, Z. Liu, A semantic seg- Transactions on Image processing 22 (7) (2013) 2864–2875.
mentation method integrated convolutional nonlinear spiking [26] J. Hu, S. Li, The multiscale directional bilateral filter and its ap-

ev
685 neural model with transformer, Computer Vision and Image plication to multisensor image fusion, Information Fusion 13 (3)
Understanding 249 (2024) 104196. (2012) 196–206.
[11] Z. Wu, Z. Zhou, G. Allibert, C. Stolz, C. Demonceaux, C. Ma,735 [27] E. S. Ribeiro, L. R. Araújo, G. T. Chaves, A. P. Braga,
Transformer fusion for indoor rgb-d semantic segmentation, Distance-based loss function for deep feature space learning of

r
Computer Vision and Image Understanding 249 (2024) 104174. convolutional neural networks, Computer Vision and Image Un-
690 [12] H. Li, X.-J. Wu, Densefuse: A fusion approach to infrared and derstanding 249 (2024) 104184.

(2018) 2614–2623.
er
visible images, IEEE Transactions on Image Processing 28 (5)
740
[28] X. He, J. Jin, Y. Jiang, D. Li, A lightweight convolutional neural
network-based feature extractor for visible images, Computer
[13] W. Tang, F. He, Y. Liu, Y. Duan, T. Si, Datfuse: Infrared and Vision and Image Understanding 249 (2024) 104157.
visible image fusion via dual attention transformer, IEEE Trans- [29] M. Edwards, X. Xie, R. I. Palmer, G. K. Tam, R. Alcock,
pe
695 actions on Circuits and Systems for Video Technology 33 (7) C. Roobottom, Graph convolutional neural network for multi-
(2023) 3159–3172. scale feature learning, Computer Vision and Image Understand-
[14] L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethink-745 ing 194 (2020) 102881.
ing atrous convolution for semantic image segmentation, arXiv [30] W. Tang, Y. Liu, C. Zhang, J. Cheng, H. Peng, X. Chen, Green
preprint arXiv:1706.05587. fluorescent protein and phase-contrast image fusion via gener-
[15] J. Zhu, W. Jin, L. Li, Z. Han, X. Wang, Multiscale infrared ative adversarial networks, Computational and Mathematical
ot

700

and visible image fusion using gradient domain guided image Methods in Medicine 2019 (1) (2019) 5450373.
filtering, Infrared Physics & Technology 89 (2018) 8–19. 750 [31] K. Regmi, A. Borji, Cross-view image synthesis using geometry-
[16] J. Chen, X. Li, L. Luo, X. Mei, J. Ma, Infrared and visible image guided conditional gans, Computer Vision and Image Under-
tn

fusion based on target-enhanced multiscale transform decompo- standing 187 (2019) 102788.
705 sition, Information Sciences 508 (2020) 64–78. [32] J. Ma, W. Yu, P. Liang, C. Li, J. Jiang, Fusiongan: A gener-
[17] Q. Zhang, Y. Liu, R. S. Blum, J. Han, D. Tao, Sparse repre- ative adversarial network for infrared and visible image fusion,
sentation based multi-sensor image fusion for multi-focus and755 Information fusion 48 (2019) 11–26.
rin

multi-modality images: A review, Information Fusion 40 (2018) [33] J. Ma, H. Zhang, Z. Shao, P. Liang, H. Xu, Ganmcc: A gen-
57–75. erative adversarial network with multiclassification constraints
710 [18] Z. Zhu, H. Yin, Y. Chai, Y. Li, G. Qi, A novel multi-modality for infrared and visible image fusion, IEEE Transactions on In-
image fusion method based on image decomposition and sparse strumentation and Measurement 70 (2020) 1–14.
representation, Information Sciences 432 (2018) 516–529. 760 [34] J. Ma, H. Xu, J. Jiang, X. Mei, X.-P. Zhang, Ddcgan: A
ep

[19] W. Ding, D. Bi, L. He, Z. Fan, Infrared and visible image fusion dual-discriminator conditional generative adversarial network
method based on sparse features, Infrared Physics & Technology for multi-resolution image fusion, IEEE Transactions on Image
715 92 (2018) 372–380. Processing 29 (2020) 4980–4995.
[20] N. Mitianoudis, T. Stathaki, Pixel-based and region-based im- [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
Pr

age fusion schemes using ica bases, Information fusion 8 (2)765 A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you
(2007) 131–142. need, Advances in neural information processing systems 30.
[21] L. Yan, Q. Hao, J. Cao, R. Saad, K. Li, Z. Yan, Z. Wu, Infrared [36] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,
720 and visible image fusion via octave gaussian pyramid frame- X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold,

15

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476
S. Gelly, et al., An image is worth 16x16 words: Transformers Recognition, 2024, pp. 26974–26983.

d
770 for image recognition at scale, arXiv preprint arXiv:2010.11929. [50] J. Shen, Y. Chen, Y. Liu, X. Zuo, H. Fan, W. Yang, Icafusion:
[37] X. Wang, K. Chen, Z. Zhao, G. Shi, X. Xie, X. Jiang, Y. Yang, Iterative cross-attention guided feature fusion for multispectral

we
Multi-scale adaptive skeleton transformer for action recognition,820 object detection, Pattern Recognition 145 (2024) 109913.
Computer Vision and Image Understanding (2024) 104229. [51] J. Shui, S. Ding, M. Li, Y. Ma, Entity semantic feature fu-
[38] L. Huang, X. Bai, J. Zeng, M. Yu, W. Pang, K. Wang, Fam: sion network for remote sensing image-text retrieval, in: Asia-
775 Improving columnar vision transformer with feature attention Pacific Web (APWeb) and Web-Age Information Management
mechanism, Computer Vision and Image Understanding 242 (WAIM) Joint International Conference on Web and Big Data,
(2024) 103981. Springer, 2024, pp. 130–145.

ie
825

[39] L. Qu, S. Liu, M. Wang, Z. Song, Transmef: A transformer- [52] Z. Wen, F. Zhang, S. Zhang, H. Sun, M. Xu, L. Sun, Z. Lian,
based multi-exposure image fusion framework using self- B. Liu, J. Tao, Multimodal fusion with pre-trained model fea-
780 supervised multi-task learning, in: Proceedings of the AAAI tures in affective behaviour analysis in-the-wild, arXiv preprint

ev
conference on artificial intelligence, Vol. 36, 2022, pp. 2126– arXiv:2403.15044.
2134. 830 [53] H. Li, X.-J. Wu, J. Kittler, Rfn-nest: An end-to-end residual fu-
[40] Z. Cai, R. Hong, X. Lin, J. Yang, Y. Ni, Z. Liu, C. Jin, F. Da, A sion network for infrared and visible images, Information Fusion
mlp architecture fusing rgb and cassi for computational spectral 73 (2021) 72–86.

r
785 imaging, Computer Vision and Image Understanding 249 (2024) [54] H. Xu, J. Ma, J. Jiang, X. Guo, H. Ling, U2fusion: A unified
104214. unsupervised image fusion network, IEEE Transactions on Pat-
er
[41] S. N. Uddin, Y. J. Jung, Sifnet: Free-form image inpainting835
using color split-inpaint-fuse approach, Computer Vision and
tern Analysis and Machine Intelligence 44 (1) (2020) 502–518.
[55] H. Li, T. Xu, X.-J. Wu, J. Lu, J. Kittler, Lrrnet: A novel repre-
Image Understanding 221 (2022) 103446. sentation learning guided fusion network for infrared and visible
790 [42] J. Ma, L. Tang, F. Fan, J. Huang, X. Mei, Y. Ma, Swinfu- images, IEEE transactions on pattern analysis and machine in-
pe
sion: Cross-domain long-range learning for general image fusion telligence 45 (9) (2023) 11040–11052.
via swin transformer, IEEE/CAA Journal of Automatica Sinica840 [56] X. Deng, J. Xu, F. Gao, X. Sun, M. Xu, Deepm 2 cdl: Deep
9 (7) (2022) 1200–1217. multi-scale multi-modal convolutional dictionary learning net-
[43] J. Ma, J. Zhao, J. Jiang, H. Zhou, X. Guo, Locality preserving work, IEEE Transactions on Pattern Analysis and Machine In-
795 matching, International Journal of Computer Vision 127 (2019) telligence 46 (2023) 2770–2787.
512–531.
ot

[44] J. Yao, J. Zhang, H. Zhang, L. Zhuo, Lcma-net: A light cross-


modal attention network for streamer re-identification in live
video, Computer Vision and Image Understanding 249 (2024)
tn

800 104183.
[45] J. Dang, Y. Zhong, X. Qin, Ppformer: Using pixel-wise and
patch-wise cross-attention for low-light image enhancement,
Computer Vision and Image Understanding 241 (2024) 103930.
rin

[46] H. Li, X.-J. Wu, Crossfuse: A novel cross attention mechanism


805 based infrared and visible image fusion approach, Information
Fusion 103 (2024) 102147.
[47] B. Wan, X. Zhou, Y. Sun, T. Wang, C. Lv, S. Wang, H. Yin,
C. Yan, Mffnet: Multi-modal feature fusion network for vdt
ep

salient object detection, IEEE Transactions on Multimedia 26


810 (2023) 2069–2081.
[48] L. Tang, J. Yuan, J. Ma, Image fusion in the loop of high-level
vision tasks: A semantic-aware real-time infrared and visible
Pr

image fusion network, Information Fusion 82 (2022) 28–42.


[49] H. Zhang, X. Zuo, J. Jiang, C. Guo, J. Ma, Mrfs: Mutually
815 reinforcing image fusion and segmentation, in: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern

16

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=5062476

You might also like