0% found this document useful (0 votes)
9 views

2021 (code) STDFusionNet-An-Infrared-and-Visible-Image-Fusion-Network-Based-on-Salient-Target-Detection

2021 (code) STDFusionNet-An-Infrared-and-Visible-Image-Fusion-Network-Based-on-Salient-Target-Detection

Uploaded by

hau trinhvan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

2021 (code) STDFusionNet-An-Infrared-and-Visible-Image-Fusion-Network-Based-on-Salient-Target-Detection

2021 (code) STDFusionNet-An-Infrared-and-Visible-Image-Fusion-Network-Based-on-Salient-Target-Detection

Uploaded by

hau trinhvan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/351408853

STDFusionNet: An Infrared and Visible Image Fusion Network Based on Salient


Target Detection

Article in IEEE Transactions on Instrumentation and Measurement · January 2021


DOI: 10.1109/TIM.2021.3075747

CITATIONS READS
311 2,362

5 authors, including:

Jiayi Ma Tang Linfeng


Wuhan University Wuhan University
342 PUBLICATIONS 27,935 CITATIONS 17 PUBLICATIONS 2,242 CITATIONS

SEE PROFILE SEE PROFILE

Meilong Xu Hao Zhang


Wuhan University Wuhan University
2 PUBLICATIONS 342 CITATIONS 39 PUBLICATIONS 3,525 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Hao Zhang on 21 June 2021.

The user has requested enhancement of the downloaded file.


IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 1

STDFusionNet: An Infrared and Visible Image


Fusion Network Based on Salient Target Detection
Jiayi Ma, Linfeng Tang, Meilong Xu, Hao Zhang, and Guobao Xiao

Abstract—In this paper, we propose an infrared and visible


image fusion network based on the salient target detection,
termed as STDFusionNet, which can preserve the thermal targets
in infrared images and the texture structures in visible images.
Firstly, a salient target mask is dedicated to annotating regions of
the infrared image that people or machines pay more attention
to, so as to provide spatial guidance for the integration of Fig. 1. The weakening of useful information in existing methods. From
different information. Secondly, we combine this salient target left to right: infrared image, visible image, results of U2Fusion [9] and
mask to design a specific loss function to guide the extraction FusionGAN [10].
and reconstruction of features. Specifically, the feature extraction
network can selectively extract salient target features from
infrared images and background texture features from visible complementarity gives us the opportunity to fuse them to obtain
images, while the feature reconstruction network can effectively desired results that contain both significant thermal targets and
fuse these features and reconstruct the desired results. It is rich textural details. Due to the excellent characteristics of fused
worth noting that the salient target mask is only required in results, infrared and visible image fusion has been widely used
the training phase, which enables the proposed STDFusionNet
to be an end-to-end model. In other words, our STDFusionNet in military actions, target detection, recognition, surveillance,
can fulfill salient target detection and key information fusion and many other domains.
in an implicit manner. Extensive qualitative and quantitative In the past few decades, numerous traditional infrared
experiments demonstrate the superiority of our fusion algorithm and visible image fusion algorithms have been proposed,
over the state-of-the-art, where our algorithm is much faster and which can be classified into five classes, including multi-scale
the fusion results look like high-quality visible images with clear
highlighted infrared targets. Moreover, the experimental results decomposition-based methods [2], [3], saliency-based meth-
on the public datasets verify that our algorithm can improve ods [4], sparse representation-based methods [5], optimization-
the EN, MI, VIF and SF metrics with 1.25%, 22.65%, 4.3% based methods [6], and hybrid-based methods [7]. Although the
and 0.89% gains, respectively. Our code is publicly available at: above algorithms have achieved relatively satisfactory fusion
https://github.com/Linfeng-Tang/STDFusionNet. performance in most cases, there are still some drawbacks. i)
Index Terms—Image fusion, salient target detection, deep Existing traditional methods generally use the same transfor-
learning, infrared image, mask. mation or representation to extract features from source images
without considering the inherent properties of infrared and
I. I NTRODUCTION visible images. ii) Measurement of activity levels and design of
fusion rules in most methods are manual and tend to become
The image captured by a single sensor or under a single more complex [8].
shooting setting can only describe the imaging scene from In recent years, as the deep learning techniques are maturing,
a limited perspective, hence fusing complementary images researchers have proposed several deep fusion algorithms. Gen-
from various sensors or different shooting settings contributes erally, the predominant deep fusion methods can be divided into
to enhanced understanding of the scene. In all image fusion two categories, i.e. convolutional neural network (CNN)-based
scenarios, the infrared and visible image fusion is probably methods [12], [11] and generative adversarial network (GAN)-
the most widely used [1]. Infrared images capture thermal based methods [13]. The CNN-based fusion methods [14],
radiation emitted from objects, which can effectively highlight [9], [15] usually rely on the powerful fitting ability of neural
salient targets but lack texture details. In contrast, visible networks to realize the extraction and reconstruction of effective
images usually contain rich structure information, but are information under the guidance of well-designed loss functions.
easily affected by the environment to lose targets. This The GAN-based fusion methods establish an adversarial game
between the fused image and source images, so as to make the
This work was supported by the National Natural Science Foundation of
China under Grant 61773295, the Key Research and Development Program of fused image approximate the desired probability distribution
Hubei Province under Grant 2020BAB113, and the Natural Science Foundation without supervision [10], [16]. Although the existing deep
of Hubei Province under Grant 2019CFA037. learning-based methods have achieved relatively good fusion
J. Ma, L. Tang, M. Xu and H. Zhang are with the Electronic In-
formation School, Wuhan University, Wuhan, 430072, China (e-mail: performance compared to traditional methods, there are still a
[email protected], [email protected], [email protected], zh- challenge that cannot be ignored. Specifically, previous deep
[email protected]). fusion methods treat different regions of various source images
G. Xiao is with the Fujian Provincial Key Laboratory of Information Pro-
cessing and Intelligent Control, College of Computer and Control Engineering, without distinction when constructing loss functions, which
Minjiang University, Fuzhou, 350108, China (e-mail: [email protected]). means that these methods introduce a lot of redundant or even
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 2

Fig. 2. Schematic illustration of STDFusionNet. From left to right: the infrared image, the visible image, the fusion result of a traditional method GTF [6], the
fusion result of a deep learning-based method DenseFuse [11], and the result of our proposed STDFusionNet. The red box and green box show that GTF and
DenseFuse suffer detail loss, blurred edges, and artifacts, and our result better highlights prominent targets and has abundant textures.

invalid information in the fusion process. As a result, the useful targets while retaining abundant textures (e.g., the bush, road,
information is weakened in the fused image. We provide a and wall).
typical example in Fig. 1 to illustrate this more intuitively, in The main contributions of this work include the following
which U2Fusion [9] is a representative CNN-based method and three aspects:
FusionGAN [10] is a typical GAN-based method. It can be seen • We introduce the salient target mask to the specific loss
that U2Fusion weakens the salient target, while FusionGAN function, which can guide the network to detect the
weakens the background textures. thermal radiation targets in the infrared image, fusing
To address the above challenges, we propose a new network them with the background texture details in the visible
based on the salient target detection for infrared and visible image.
• We explicitly define the desired information in the fusion
image fusion, namely STDFusionNet. First, for infrared images,
humans and machines primarily pay attention to the regions process as the salient target in the infrared image and the
where salient targets such as pedestrians, vehicles, hunkers, etc., background textures in the visible image. To the best of
are located. For visible images, the rich background textures our knowledge, this is the first precise definition for the
help to make the imaging scene more vivid. Therefore, we can target of infrared and visible image fusion.
• Extensive experiments demonstrate the superiority of our
defined that the most meaningful information for fusion process
is the significant thermal targets in the infrared image and the method over state-of-the-art alternatives. Compared with
background texture structures in the visible image. Based on the competitors, our approach could generate fusion results
this definition, we annotate the salient targets in the infrared looking like high-quality visible images with highlighted
image to obtain the salient target masks. Then, the obtained targets, which contribute to improving target recognition
salient masks are introduced into the design of the specific loss and scene understanding.
function, which selectively drive the network to extract and The rest of this paper is organized as follows. In section
reconstruct the above-defined effective features. In addition, II, we briefly introduce the related works on image fusion. In
due to the significant differences in multi-modal source images, section III, the proposed method is introduced in detail. Section
we adopt the pseudo-siamese network to extract different types IV illustrates the fusion performance of our method on public
of information from source images with distinction, such as datasets with comparisons to other approaches, followed by
the salient target intensity and background texture structures. some concluding remarks in section V.
It is worth emphasizing that the salient target mask is only
utilized to guide the training of networks, and is not required II. R ELATED W ORK
to be fed into the network during the testing phase, and hence In this section, we review the existing infrared and visible
our network is an end-to-end model. Under these specific image fusion approaches, including traditional fusion methods
designs, our STDFusionNet effectively addresses the problems and deep learning-based fusion methods.
of the feature extraction effectiveness and desired information
definition, etc. A. Traditional Fusion Methods
To intuitively show the performance of our method, we Traditional fusion methods generally manually design activ-
provide a typical example in Fig. 2, with a traditional GTF [6] ity level measurement or fusion rules to practice image fusion
and a deep learning-based DenseFuse [11] for comparison. GTF in the spatial or transform domain, which can be divided into
views the fused image as an infrared image with additional five categories according to their principles, including multi-
visible gradients. Even though the targets in its result can scale transform-based methods [17], [2], saliency-based meth-
be highlighted, the textures are not sharp, and artifacts are ods [4], [18], sparse representation-based methods [5], [19],
presented, leading to unnatural fused image. On the contrary, in optimization-based methods [6], [20] and hybrid methods [21],
the result of DenseFuse, the texture details are better preserved, [7]. The main ideas of these methods are discussed below.
while the thermal information of the target is weakened. Our The multi-scale transform-based methods believe that objects
STDFusionNet has the advantages of both GTF and DenseFuse. in the physical world are typically composed of components
Specifically, our method implicitly achieves the salient target of various scales, and the multi-scale transform is consistent
detection and the extraction and reconstruction of effective with the human visual system. Therefore, the fused images
information, and the fused image could highlight significant obtained by multi-scale transform can have pleasing visual
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 3

effect [1]. In general, infrared and visible image fusion schemes With further research, some deep methods based on auto-
based on multi-scale decomposition typically involve three encoder are proposed. These methods usually pre-train an auto-
steps. Firstly, all source images are decomposed into a series encoder to implement feature extraction and image recovery,
of multi-scale representations. Subsequently, the multi-scale while the feature fusion is fulfilled by traditional rules. Li et al.
representations of original images are fused according to introduced the dense block into the encoder and decoder to de-
specific fusion rules. Eventually, the fused image is obtained sign a new image fusion network, termed as DenseFuse [11]. In
by performing corresponding inverse transforms on the fused the fusion layer, DenseFuse is achieved using the conventional
multi-scale representations [17]. addition and l1-norm strategy. Considering that a network
The saliency-based methods are usually built on the basis without down-sampling cannot extract multi-scale features
that salient targets are more easily perceived by human vision from source images, the nest connection-based networks
than their adjacent objects or pixels. Saliency is applied to are introduced in the NestFuse to extract information from
source images in a multi-scale perspective [23]. The spatial
infrared and visible image fusion in two main ways, i.e., weight
and channel attention models are used to fuse the extracted
calculation and salient target extraction. The former is usually
combined with multi-scale transforms, where the source images information, but it is worth mentioning that the attention
are decomposed into base and detail layers through the multi- mechanism used for fusion is still unlearnable.
scale transform. Then saliency detection is used to obtain a Since the unsupervised distribution estimation ability of
saliency map of the base or detail layer, and then a weight mapthe generative adversarial network is very suitable for the
of base or detail layer is obtained from the saliency map [7]. image fusion task, more and more GAN-based fusion methods
The latter uses saliency detection to extract information aboutare proposed. Ma et al. established an adversarial game
the significant regions from the infrared and visible images between the fused result and the visible image to further
enhance the preservation of texture structures. However, this
and then integrates the crucial information into the final fused
image [18].Generally, the saliency-based methods can maintain single adversarial mechanism can easily lead to unbalanced
fusion. To ameliorate this problem, they later proposed the
the integrity and pixel intensity of the significant object regions
and improve the visual quality of the fused image [4]. dual-discriminator conditional generative adversarial network
The premise of the sparse representation-based methods (DDcGAN) [13] to realize image fusion, in which both the
is to learn an over-complete dictionary from a great number infrared image and the visible image participates in the
of high-quality images, which are usually achieved by the adversarial games. It is worth noting that the generative
joint sparse representation [22] and the convolutional sparse adversarial network with dual discriminators is not easy to
train. In this context, a generative adversarial network with
representation [5]. Then, the sparse representation coefficients
of source images can be obtained by the learned over-complete multi-classification constrained is proposed [24] to achieve
dictionary, and be fused according to the given fusion rule. information balance in the fusion process, in which the
Finally, the fused image is reconstructed from the fused sparsemulti-distribution simultaneous estimation is done by a single
representation coefficients with the learned over-complete adversarial game.
dictionary. Due to the strong ability of feature representation in neural
The optimization-based methods generate the desired fusion networks, the varied information can be represented in a unified
way [9]. A growing number of researchers are dedicated to
result via minimizing an objective function [6], [20]. Therefore,
exploring the general image fusion framework. Zhang et al.
the key to such methods lies in the design of objective functions.
The construction of the objective functions should consider first utilized two convolutional layers to extract the salient
two aspects, saying the overall intensity fidelity and texture features from source images. Then, they selected appropriate
fusion rules according to the type of input images to fusion the
structure preservation. The former constrains the fused result to
source images features, and recovered the fused images from
have the desired brightness distribution, while the latter drives
the fused result to contain rich texture details. The above- the convolutional features by two convolutional layers [14].
mentioned infrared and visible image fusion methods all have Their proposed network framework only needs to be trained
their strengths and weaknesses, and the hybrid models combine on one type of image fusion dataset and subsequently adjusts
their strengths to improve the fusion performance [21], [7]. the fusion rules according to the type of source images, thus
implementing a unified network to solve various fusion tasks.
In contrast, Zhang et al. proposed a network structure based on
B. Deep Learning-based Fusion Methods proportional maintenance of gradient and intensity which adapt
Relying on the excellent feature extraction capabilities of to different fusion tasks via adjusting the weights of the loss
neural networks, deep learning has promoted tremendous terms when constructing the loss function [25]. Considering
progress in image fusion. Early deep learning-based methods the cross-fertilization between different fusion tasks, U2Fusion
only adopt the neural network to construct a weight map was trained sequentially on a unified model for different fusion
or extract features [9]. Liu et al. adopted the pre-trained missions, and a unified model for multiple fusion tasks was
convolutional neural network to implement activity level obtained [9].
measurement of source images and generate a weight map, Compared with the above-mentioned methods, the proposed
in which the whole fusion process is based on pyramids [12]. STDFusionNet has two main technical contributions. First, the
However, the neural network is not specifically trained for desired information in the image fusion process is defined as the
image fusion, which limits the fusion performance. salient target in the infrared image and the texture information
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 4

in the visible image. The defined desired information can preserving the texture details of the visible image. Under the
provide a more explicit optimization direction for parameter above design, our model can implicitly realize salient target
learning. Second, we design a special loss in conjunction with detection and desired information fusion.
the salient target mask to guide the network to achieve salient
target detection and information fusion. This enables the fused B. Loss Function
images generated by STDFusionNet to retain as much important The loss function determines the type of information retained
information as possible in the source images and reduce the in the fused image and the proportional relationship between
effect of redundant information. various information. The loss function of our STDFusionNet
consists of two kinds of losses, i.e., the pixel loss and the
III. M ETHOD gradient loss. The pixel loss constrains the pixel intensity of
In this section, we describe the proposed infrared and visible the fused image to be consistent with source images, while the
image fusion network based on the salient target detection, gradient loss forces the fused image to contain more detailed
STDFusionNet. First, we provide the problem formulation of information. We construct the pixel loss and gradient loss for
the STDFusionNet, in which the core ideas are discussed. Then, the salient regions and background areas. Combined with the
we describe the designed loss function in detail. Finally, we salient target mask Im , the desired result Id can be defined as
give the network architecture of the proposed STDFusionNet. follows:
Id = Im ◦ Iir + (1 − Im ) ◦ Ivi , (1)
A. Problem Formulation where the operator ◦ denotes element-wise multiplication.
The target of image fusion is extracting significant informa- Similarly, the fused image generated by STDFusionNet can
tion from multiple source images and fusing the complementary be segmented into a prominent region Im ◦ If containing the
information to generate a synthesized image. The key to this thermal infrared target and a background region (1 − Im ) ◦ If
problem is how to define the most meaningful information with texture details.
and how to fuse the complementary information. In infrared Therefore, we construct the corresponding losses in the
and visible image fusion, the most critical information is the salient and background regions respectively for guiding the
salient targets and the texture structures, which are contained in optimization of the STDFusionNet. On the one hand, we
infrared images and visible images, respectively. Therefore, we constrain the fused image to have the same pixel intensity
explicitly define the desired information as the salient target distribution as the desired image. The salient pixel loss Lsalient
pixel
back
information in infrared images and the background texture and the background pixel loss Lpixel are formulated as:
structure information in visible images. Consequently, there 1
are two keys to image fusion based on this definition. Lsalient
pixel = kIm ◦ (If − Iir )k1 , (2)
HW
The first key is to determine the salient target in the infrared 1
image. As we observed, the significant information of infrared Lback
pixel = k(1 − Im ) ◦ (If − Ivi )k1 , (3)
HW
images is mainly presented in the regions containing objects where H and W are the height and width of the image,
(e.g., pedestrians, vehicles and bunkers) that can emit more heat. respectively, k·k stands for the l -norm. On the other hand,
1
Hence, the proposed network should learn to automatically the gradient loss1 is introduced to enhance the constrains on
detect these regions from infrared images. The second key is the network in order to force the fused images with sharper
to accurately extract the desired information from the detected textures and salient targets with sharpened edges. Similar to
regions and perform effective fusion and reconstruction. In the definition of the pixel loss, the gradient loss also contains
other words, the fused result should accurately contain the the salient gradient loss Lsalient and the background gradient
grad
salient target in the infrared image and the background texture
loss Lback
grad , which are more precisely formulated as follows:
in the visible image. The specific loss function and effective
network structure are designed to address the above two key 1
Lsalient
grad = kIm ◦ (∇If − ∇Iir )k1 , (4)
problems. HW
First, we propose a specific loss function to constrain the 1
Lback
grad = k(1 − Im ) ◦ (∇If − ∇Ivi )k1 , (5)
fusion process, in which the salient target mask is introduced to HW
guide the network to detect salient areas, while the preservation where ∇ denotes the gradient operator, in this paper, we employ
of thermal targets and background texture is achieved by the Sobel operator to compute the gradient of an image.
ensuring the intensity and gradient consistency in specific Different from the previous method, we treat pixel loss
regions. Second, we design an effective network structure to and gradient loss in the same region equally, so the final loss
realize feature extraction, fusion and reconstruction. Concretely, function is defined as:
the feature extraction network adopts a pseudo-siamese net-
L = (Lback back salient
pixel + Lgrad ) + α(Lpixel + Lgrad ),
salient
(6)
work architecture to treat source images differently, so as
to selectively extract salient target features from the infrared where α is the hyper-parameter that controls the loss balance
image Iir and background texture features from the visible in different regions. Due to the introduction of salient region
image Ivi . Eventually, the feature reconstruction network fuses loss i.e., Lsalient salient
pixel and Lgrad , the STDFusionNet has the ability
the extracted features and reconstructs the fused image If , to detect and extract salient targets in infrared images in an
highlighting the salient targets in the infrared image while implicit manner.
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 5

Feature Extraction Network

Feature Reconstruction Network

Fused Image

Visible Image

Infrared Image

L=L p+Lgrad

Loss Function
ResBlock

Addation Element-wise Multiplication

Fig. 3. The architecture of the proposed infrared and visible image fusion network based on the salient target detection. The mask is only needed to construct
loss function in the training of the model, and is not needed in the testing phase.

C. Network Architecture the feature extraction network can extract the salient feature
and texture detail features from source images.
As illustrated in Fig. 3, our network architecture consists of Feature Reconstruction Network. The feature reconstruction
two parts, i.e., the feature extraction network and the feature network consists of four ResBlocks, which play the role of
reconstruction network. feature fusion and image reconstruction. It is worth noting that
Feature Extraction Network. The feature extraction network the activation function of the last layer uses Tanh to ensure that
is constructed on the basis of the convolutional neural network, the range of variation of the fused image is consistent with that
and the ResBlock is introduced to enhance the network of the input images. The input of the feature extraction network
extraction and alleviate the problem of vanishing/exploding is the concatenation of infrared convolutional features and
gradients [26]. As shown in Fig. 3, the feature extraction visible convolutional features in the channel dimension, and its
network consists of a common layer and three ResBlocks output is the fused image. It is well known that information loss
that can reinforce the extracted information. The common is a catastrophic problem in image fusion missions. Therefore,
layer consists of a convolutional layer with a kernel size of in all convolutional layers of STDFusionNet, the padding is set
5 × 5 and a Leaky Rectified Linear Unit activation layer. to SAM E, and stride is set to 1. As a result, our network does
Each Resblock consists of three convolutional layers, named not introduce any downsampling, and the size of the fused
Conv1, Conv2, Conv3, and a skip-connected identity mapping image is consistent with source images.
convolutional layer, termed as identity conv. The kernel size of The purpose of the salient target mask is to highlight the
all convolutional layers is 1 × 1 except for Conv2, which objects (e.g., the pedestrians, vehicles, and bunkers) that radiate
has a kernel size of 3 × 3. Both Conv1 and Conv2 use numerous heats in infrared images. Therefore, we use the
the Leaky Rectified Linear Unit as the activation function, labelme toolbox [27] to annotate salient targets in infrared
while the output of Conv3 and identity conv are summed and images and convert them to binary salient target masks. Then,
followed by the Leaky Rectified Linear Unit activate function. the salient target masks are inverted to obtain the background
The identity conv is designed to overcome the inconsistent masks. After that, we multiply the salient target masks and
dimensionality of the ResBlock input and output. It is worth texture background masks with the infrared images and visible
noting that considering the different properties of infrared and images at the pixel level to obtain the source salient target
visible images, both feature extraction networks use the same regions and source background texture regions, respectively.
network architecture, but the respective parameters are trained Moreover, the fused images are also multiplied with the salient
independently. In combination with the proposed loss function, target masks and the texture background masks at the pixel
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 6

(a) Infrared image (b) Visible image (c) GTF (d) MDLatLRR

(e) DenseFuse (f) NestFuse (g) FusionGAN (h) GANMcC


Fig. 4. Four source and mask image pairs. The top row contains visible
images, the second row contains infrared images and the corresponding mask
images are in the bottom row.

level to receive the fused salient target regions and the fused
(i) IFCNN (j) PMGI (k) U2Fusion (l) STDFusionNet
background regions. Subsequently, the original salient regions,
original salient regions, original background regions, fused Fig. 5. Qualitative comparison of STDFusionNet with 9 state-of-the-art
salient regions, and fused background regions are applied to methods on the bench image pair. For a clear comparison, we select a salient
region (i.e., the red box) in each image and then zoom in it on the bottom
construct the specific loss function, which guides the network right corner.
to realize salient targets detection and information fusion
implicitly.
2) Evaluation Metrics: The assessment of fusion perfor-
IV. E XPERIMENTS mance can be divided into subjective and objective evaluations.
Subjective evaluation relies on human visual perception. For
In this section, we first describe the experimental settings, the infrared and visible image fusion, the desired result is better
including datasets, evaluation metrics and training details. Then, to contain significant thermal targets and rich texture structures.
we demonstrate the efficiency of the proposed STDFusionNet The objective evaluation is a supplement to the subjective
on public datasets, and compare it with nine state-of-the- evaluation, which usually uses some quantitative metrics to
art fusion methods, including two traditional methods, i.e., measure the fusion performance. In this paper, four popular
GTF [6] and MDLatLRR [2], and seven deep learning-based metrics are selected, including entropy (EN) [29], mutual
methods, i.e., DenseFuse [11], NestFuse [23], FusionGAN [10], information (MI) [30], visual information fidelity (VIF) [31],
GANMcC [24], IFCNN [14], PMGI [25] and U2Fusion [9]. and spatial frequency (SF) [32]. The definitions of them are
The implementation of all these nine methods are publicly as follows.
available, and we set the parameters as reported in the original
The EN measures the amount of information contained in
papers. In addition, we provide the generalization experiment,
a fused image, which is defined based on information theory.
efficiency comparison, visualization of salient target detection
The mathematical definition of EN as follows:
and ablation experiments to verify the effectiveness of specific
designs. L−1
X
EN = − pl log2 pl , (7)
l=0
A. Experimental Settings
where L denotes the number of gray levels and pl is the
1) Datasets: Our experiments are executed on two datasets,
normalized histogram of the corresponding gray level in the
namely the TNO dataset [28] and the RoadScene dataset [9].
fused image. A larger entropy indicates that the fusion image
The TNO dataset is a common dataset for infrared and contains more information and that the method achieves better
visible image fusion, containing various types of military- fusion performance.
related scenes. The dataset contains 60 pairs of infrared and
The MI metric measures the amount of information trans-
visible images, with 3 sequences containing 19, 23, and 32
ferred from the source images to the fused image. In infor-
image pairs, respectively. A typical set of source images and
mation theory, MI measures the dependence of two random
their mask images are shown in Fig. 4. In order to remedy
variables, and in image fusion evaluation, the MI fusion metric
the shortage in quantity of existing datasets, Xu et al. released
is defined as follows:
the RoadScene dataset based on the FLIR video [9]. The
RoadScene dataset contains 221 pairs of aligned infrared and M I = M IA,F + M IB,F , (8)
visible image containing rich scenes of roads, vehicles, and
pedestrians. The release of this dataset effectively alleviates where M IA,F and M IB,F denote the amount of information
the challenges of few image pairs and low spatial resolution transferred from source images A and B to fused image F ,
in the benchmark dataset. respectively. The MI between two random variables can be
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 7

(a) Infrared image (b) Visible image (c) GTF (d) MDLatLRR (a) Infrared image (b) Visible image (c) GTF (d) MDLatLRR

(e) DenseFuse (f) NestFuse (g) FusionGAN (h) GANMcC (e) DenseFuse (f) NestFuse (g) FusionGAN (h) GANMcC

(i) IFCNN (j) PMGI (k) U2Fusion (l) STDFusionNet (i) IFCNN (j) PMGI (k) U2Fusion (l) STDFusionNet

Fig. 6. Qualitative comparison of STDFusionNet with 9 state-of-the-art Fig. 7. Qualitative comparison of STDFusionNet with 9 state-of-the-art
methods on the Kaptein 1123 image pair. For a clear comparison, we select a methods on the Kaptein 1654 image pair. For a clear comparison, we select a
small area (i.e., the red box) with abundant texture in each image and then small area (i.e., the red box) with abundant texture in each image and then
zoom in it on the bottom right corner and highlight a salient region (i.e., the zoom in it on the bottom right corner and highlight a salient region (i.e., the
green box). green box).

calculated by the Kullback-Leibler measure, which is defined


as follows:
X pX,F (x, f )
M IX,F = pX,F (x, f ) log , (9)
pX (x)pF (f )
x,f

where pX (x) and pF (f ) denote the marginal histograms of (a) Infrared image (b) Visible image (c) GTF (d) MDLatLRR
the source image X and the fused image F , respectively.
pX,F (x, f ) means the joint histogram of the source image X
and the fused image F . The larger the MI, the more information
is transferred from source images to the fused image and the
(e) DenseFuse (f) NestFuse (g) FusionGAN (h) GANMcC
better fusion performance.
The VIF metric measures the information fidelity of the
fused image, which is consistent with the human visual system.
Computing the VIF metric usually involves the following four
steps. First, the source images and the fused image are divided (i) IFCNN (j) PMGI (k) U2Fusion (l) STDFusionNet
into different blocks; second, evaluate each block for distortion
of the visual information; third, the VIF for each sub-band is Fig. 8. Qualitative comparison of STDFusionNet with 9 state-of-the-art
methods on the Tree 4915 image pair. For a clear comparison, we select a
calculated; finally, calculating the overall metric based on VIF. small area (i.e., the red box) with abundant texture in each image and then
The SF metric is a reference-free metric that measures zoom in it on the bottom right corner and highlight a salient region (i.e., the
the spatial frequency information contained in the fused green box).
image through the row frequency and column frequency. The
mathematical definition of SF is as follows:
p
SF = RF 2 + CF 2 , (10) pairs from RoadScene dataset for the generalization experiment.
s
M P N It is worth noting that each source image is normalized to
P 2
where RF = (F (i, j) − F (i, j − 1)) and CF = [−1, 1]. We adopt Adam as the optimizer solver for training
i=1 j=1
s the model. The proposed algorithm is implemented on the
M P N
P 2
(F (i, j) − F (i − 1, j)) . A large SF metric indicates TensorFlow [33] platform. The training parameters are set as
i=1 j=1 follows: the batchsize equal to 32, the number of iteration is
that the fused image contains abundant textures and detail set to 30, and the learning rate is set to 10−3 . As we have
information, so the fusion method has excellent performance. observed, the salient regions take up only a slight proportion
3) Training Details: We train our model on the TNO dataset, of the infrared image. In order to balance the loss of the
and the number of image pairs for training is 20. In order to salient and background regions, in this work, α is set to 7. It
obtain more training data, we crop each image by setting the is important to note that the source images are fed directly
stride to 24, and each patch is of the same size 128 × 128. As into the fusion network without any cropping during testing.
a result, the number of produced image patch pairs for training What’s more, all the experiments are conducted on a NVIDIA
is 6, 921. In the testing phase, we select 20 image pairs from TITAN V GPU and 2.00 GHz Intel(R) Xeon(R) Gold 5117
TNO dataset for the comparative experiment and 20 image CPU.
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 8

B. Comparative Experiment better visual effects. The largest EN demonstrates that the fused
In order to comprehensively evaluate the performance of our images generated by our proposed method have more abundant
method, we compare the proposed STDFusionNet with other information than the other nine comparison approaches. The
nine methods on the TNO dataset. largest MI indicates that our method transfers more information
1) Qualitative Results: To observe the differences in the from source images to fused images. Although the SF metric
fusion performance of various algorithms intuitively, we first of our algorithm are not the best, the comparable results
select four typical pairs of source images (bench, Kaptein 1123, still indicate that our fused results have sufficient gradient
Kaptein 1654, Tree 4915) from the TNO dataset for subjective information.
evaluation. The fused results of different algorithms are shown
in Figs. 5 - 8. In Fig. 5, we select a salient region (i.e., C. Generalization Experiment
red box) in each fused image, then zoom in and place it
in the bottom right corner for clear comparison. As shown The generalization ability of the network is an important
in Fig. 5, MDLatLRR loses the thermal emission target basis for evaluating the performance of a deep model. In order
information, resulting in failure to capture infrared targets to evaluate the generalization ability of our STDFusionNet,
in the salient region. While DenseFuse, IFCNN, and U2Fusion we use the image pairs of the RoadScene dataset to test the
retain the thermal emission target information but suffer model trained on the TNO dataset. Since the visible images
from serious noise contamination, which comes from visible contained in the RoadScene dataset are in color, we use a
images. Moreover, FusionGAN preserves thermal radiation specific fusion strategy [34] to achieve the image fusion that
information to some extent, but suffers from the shortcoming preserves the color. Specifically, the RGB visible images are
of blurred infrared target edges. GTF, NestFuse, GANMcC, first converted to YCbCr color space. Then the Y channel
PMGI, and STDFusionNet are able to highlight the salient and the grayscale infrared image are used for fusion as the
target. Especially, STDFusionNet generates fused result that structural details are mainly in the Y channel. Finally, through
maintaines the contrast of the salient targets well. the inverse conversion, the fused image can be converted into
In the other three scenes, we select a background region RGB color space with the Cb and Cr (chrominance) channels
with abundant detail in each fused image, and then zoom in it of the visible image.
and put it in the bottom right corner. Also, we label the salient 1) Qualitative Results: The fused results of the different
target in a green box. From the fusion results of the remaining methods are shown in Figs. 10 - 13. From the results, we can
three image pairs, we can find that our STDFusionNet not only observed that our STDFusionNet selectively preserves useful
highlights the salient targets in the scene effectively, but also has information in both infrared and visible images. Compared to
a distinct advantage in maintaining the detailed texture of the fused images generated by other methods, our fused images
background region. Specifically, in the Kaptein 1123 scene, the are very close to the infrared images in salient regions, and
texture of the tree branches in the fused image generated by our the texture structure of the visible images is almost completely
method is the clearest, and STDFusionNet is the only method preserved in background regions.
in which the sky is not contaminated by thermal radiation Although other methods can highlight the distinctive targets,
information. While in the Kaptein 1654 scene, the streetlights the background of the fusion images is extremely unpleasant.
in the background region are almost consistent with the visible In particular, the sky in the fused image is heavily contaminated
image. Moreover, in the Tree 4915 image pair, it is almost with thermal information, and it is not even possible to
impossible to distinguish the shrubs from their surroundings by accurately estimate the current time and weather from the fused
other methods except for our method and NestFuse. However, image, which is fatal for road scenes. Moreover, other methods
NestFuse weakens the thermal radiation targets in the significant are undesirable for retaining texture details in background
regions. It is worth noting that STDFusionNet can highlight the areas, such as the writing on walls, bicycles and tree stumps,
infrared targets in the salient regions and effectively distinguish fences, street lights, etc. In contrast, STDFusionNet effectively
the shrubs from their surroundings. preserves the background region detail information while
By comparison, it can be found that STDFusionNet is able maintaining and even enhancing the intensity and contrast
to selectively preserve salient targets of infrared images and of thermal infrared targets in salient regions.
texture details of visible images during the fusion process. This 2) Quantitative Results: We also select 20 infrared and
mainly benefits from the manually extracted salient target mask visible image pairs from the RoadScene dataset for objective
and the constructed loss function during network training. evaluation, and the performance of different methods on the
2) Quantitative Results: The results of four popular quan- four metrics is shown in Table I and Fig. 14. Similar to the
titative metrics on 20 image pairs from the TNO dataset are results in the TNO dataset, our STDFusionNet has the best
shown in Fig. 9 and Table I. Among the four metrics, our average values for the three metrics, i.e., MI, VIF and SF, but
method has a significant superiority on three metrics, i.e., EN, the advantage is not as pronounced as in the TNO dataset. On
MI, and VIF. As for the SF metric, our STDFusionNet only the EN metric, our method only follows NestFuse by a narrow
lags behind IFCNN by a narrow margin. margin.
It is important to note that STDFusionNet has the highest In general, both qualitative and quantitative results demon-
value on almost all image pairs on the VIF metric, which is strate that our STDFusionNet has good generalization perfor-
consistent with the conclusions of the subjective evaluation mance, which is less affected by the characteristics of imaging
and indicates that STDFusionNet generates fused images with sensors.
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 9

(1
0,
*7)

 *7)

0'/DW/UU
0'/DW/UU
9DOXHVRI7KH0HWULF

'HQVH)XVH

9DOXHVRI7KH0HWULF
'HQVH)XVH
 1HVW)XVH 
1HVW)XVH
)XVLRQ*$1
)XVLRQ*$1

 *$10F&
 *$10F&
,)&11
,)&11

30*,
 30*,

8)XVLRQ
8)XVLRQ

67')XVLRQ1HW 67')XVLRQ1HW


         

&XPXODWLYH'LVWULEXWLRQ &XPXODWLYH'LVWULEXWLRQ

9,) 6)
 
*7) *7)

0'/DW/UU 0'/DW/UU
9DOXHVRI7KH0HWULF



9DOXHVRI7KH0HWULF

'HQVH)XVH 'HQVH)XVH

1HVW)XVH 1HVW)XVH

)XVLRQ*$1 
)XVLRQ*$1

*$10F& *$10F&

,)&11  ,)&11

30*, 30*,

8)XVLRQ  8)XVLRQ

67')XVLRQ1HW 67')XVLRQ1HW



         

&XPXODWLYH'LVWULEXWLRQ &XPXODWLYH'LVWULEXWLRQ

Fig. 9. Quantitative comparisons of the four metrics, i.e., EN, MI, VIF and SF, on twenty image pairs from the TNO dataset. The nine state-of-the-art methods
such as GTF, [6] MDLatLRR [2], DenseFuse [11], NestFuse [23], FusionGAN [10], GANMcC [24], IFCNN [14], PMGI[25] and U2Fusion [9] are used for
comparison. A point on the curve (x, y) denotes that there are (100*x)% percents of image pairs which have metric values no more than y.

TABLE I
Q UANTITATIVE COMPARISONS OF THE FOUR METRICS , i.e., EN, MI, VIF, SF, ON TWENTY IMAGE PAIRS FROM THE TNO DATASET AND THE ROAD S ENCE
DATASET, RESPECTIVELY. RED INDICATES THE BEST RESULT AND BLUE REPRESENTS THE SECOND BEST RESULT.

TNO RoadScene
EN MI VIF SF EN MI VIF SF
GTF [6] 6.8484 ± 0.5058 2.6183 ± 1.2131 0.6057 ± 0.1228 0.0383 ± 0.0200 7.3974 ± 0.2669 3.5454 ± 0.6440 0.6455 ± 0.1225 0.0335 ± 0.0073
MDLatLRR [2] 6.3772 ± 0.4305 1.9782 ± 0.5423 0.6810 ± 0.1147 0.0297 ± 0.0134 6.8413 ± 0.2784 3.0232 ± 0.5338 0.7282 ± 0.1270 0.0305 ± 0.0074
DenseFuse [11] 6.8618 ± 0.3880 2.1487 ± 0.6490 0.7930 ± 0.1864 0.0377 ± 0.0145 7.1794 ± 0.2615 3.1297 ± 0.5293 0.7705 ± 0.1390 0.0373 ± 0.0082
NestFuse [23] 7.0034 ± 0.3489 2.9358 ± 0.5606 0.9229 ± 0.1650 0.0404 ± 0.0157 7.4875 ± 0.1753 3.9642 ± 0.5538 0.9262 ± 0.1271 0.0454 ± 0.0112
FusionGAN [10] 6.5984 ± 0.5161 2.2194 ± 0.6300 0.6330 ± 0.1235 0.0260 ± 0.0089 7.0985 ± 0.2051 3.0262 ± 0.4277 0.6036 ± 0.0663 0.0313 ± 0.0040
GANMcC [24] 6.8099 ± 0.4491 2.1722 ± 0.5346 0.7010 ± 0.1565 0.0243 ± 0.0087 7.2510 ± 0.1892 3.0797 ± 0.5311 0.7180 ± 0.1127 0.0319 ± 0.0049
IFCNN [14] 6.9338 ± 0.4377 1.9199 ± 0.4643 0.7835 ± 0.1576 0.0535 ± 0.0196 7.2027 ± 0.1683 3.1281 ± 0.4737 0.7830 ± 0.1173 0.0516 ± 0.0130
PMGI [25] 7.0527 ± 0.3281 2.2563 ± 0.6806 0.8413 ± 0.2002 0.0352 ± 0.0146 7.3089 ± 0.1400 3.5906 ± 0.5444 0.8314 ± 0.1246 0.0382 ± 0.0062
U2Fusion [9] 7.0762 ± 0.3915 1.9303 ± 0.5256 0.8061 ± 0.1786 0.0493 ± 0.0161 7.1955 ± 0.2966 2.7669 ± 0.5204 0.7371 ± 0.1404 0.0499 ± 0.0102
STDFusionNet 7.1978 ± 0.4793 3.7416 ± 0.5181 1.0436 ± 0.2107 0.0505 ± 0.0156 7.4213 ± 0.1926 4.6754 ± 0.7310 0.9528 ± 0.1588 0.0553 ± 0.0114

D. Efficiency Comparison
TABLE II
M EAN AND STANDARD DEVIATION OF THE RUNNING TIMES OF ALL
Running efficiency is also an important factor in evaluating METHODS ON THE TNO AND ROAD S CENE DATASETS ( UNIT: SECOND , RED
model performance. We provide the average running time INDICATES THE BEST RESULT AND BLUE REPRESENTS THE SECOND BEST
RESULT ).
of different methods on the TNO and RoadScene datasets,
which is shown in Table II. From the results, we can see that
Method TNO RoadScene
deep learning-based algorithms have a considerable advantage
in runtime due to the utilization of GPUs for acceleration, GTF [6] 2.6122 ± 1.9535 1.8188 ± 0.7396
MDLatlRR [2] 135.0391 ± 72.0068 86.8480 ± 19.8430
especially our STDFusionNet. In contrast, traditional methods
DenseFuse [11] 0.7732 ± 0.8658 0.7892 ± 0.763
consume more time to generate the fused image. In particular, NestFuse [23] 0.2982 ± 0.4067 0.2187 ± 0.3496
MDLatLRR needs to decompose the source image into low-lank FusionGAN [10] 0.4810 ± 0.6025 0.5118 ± 0.4155
parts and saliency parts by latent low-rank representation, so it GANMcC [24] 0.7258 ± 0.7856 0.7050 ± 0.4239
is particularly time-consuming. In general, our STDFusionNet IFCNN [14] 0.0885 ± 0.3358 0.0796 ± 0.3172
PMGI [25] 0.2597 ± 0.4320 0.2721 ± 0.3574
has the smallest average running time and the lowest standard
U2Fusion [9] 0.7155 ± 0.7284 0.7820 ± 0.3512
deviation of running time on both datasets. This illustrates the STDFusionNet 0.0461 ± 0.0497 0.0292 ± 0.0333
robustness of our network for different resolution source images
and further proves the efficiency of the designed network.
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 10

(a) Infrared image (b) Visible image (c) GTF (d) MDLatLRR
(a) Infrared image (b) Visible image (c) GTF (d) MDLatLRR

(e) DenseFuse (f) NestFuse (g) FusionGAN (h) GANMcC

(e) DenseFuse (f) NestFuse (g) FusionGAN (h) GANMcC

(i) IFCNN (j) PMGI (k) U2Fusion (l) STDFusionNet

Fig. 10. Qualitative comparison of STDFusionNet with 9 state-of-the-art


methods on the RoadScene dataset. For a clear comparison, we select a small (i) IFCNN (j) PMGI (k) U2Fusion (l) STDFusionNet
area (i.e., the red box) with abundant texture in each image and then zoom
in it on the bottom right corner and highlight a salient region (i.e., the green Fig. 12. Qualitative comparison of STDFusionNet with 9 state-of-the-art
box). methods on the RoadScene dataset. For a clear comparison, we select a small
area (i.e., the red box) with abundant texture in each image and then zoom
in it on the bottom right corner and highlight a salient region (i.e., the green
box).

(a) Infrared image (b) Visible image (c) GTF (d) MDLatLRR

(e) DenseFuse (f) NestFuse (g) FusionGAN (h) GANMcC


(a) Infrared image (b) Visible image (c) GTF (d) MDLatLRR

(i) IFCNN (j) PMGI (k) U2Fusion (l) STDFusionNet

Fig. 11. Qualitative comparison of STDFusionNet with 9 state-of-the-art


methods on the RoadScene dataset. For a clear comparison, we select a small
area (i.e., the red box) with abundant texture in each image and then zoom (e) DenseFuse (f) NestFuse (g) FusionGAN (h) GANMcC
in it on the bottom right corner and highlight a salient region (i.e., the green
box).

E. Visualization of Salient Target Detection


As mentioned earlier, the proposed STDFusionNet can (i) IFCNN (j) PMGI (k) U2Fusion (l) STDFusionNet
fulfill salient target detection in an implicit manner. Several
Fig. 13. Qualitative comparison of STDFusionNet with 9 state-of-the-art
visual examples are provided to confirm this. The salient methods on the RoadScene dataset. For a clear comparison, we select a small
region of infrared images and the results of subtracting the area (i.e., the red box) with abundant texture in each image and then zoom
visible background regions from fused results are shown in in it on the bottom right corner and highlight a salient region (i.e., the green
box).
Fig. 15. From these results, we can see that the results of
subtracting visible background region from fused images are
almost consistent with infrared image salient regions. And there
main difference between these two models is whether or not
is a slight difference between the salient regions detected by
salient target masks are introduced into the loss function. Since
our method and the annotated, which is the additional salient
there is no need to treat the salient and background regions
thermal targets detected by our method. These phenomena
distinctively, the control trade-off parameter α is set to 1 when
demonstrate that our STDFusionNet can implicitly performs
the salient target masks are removed.
salient target detection well.
It can be seen from Fig. 16 that with the desired information
definition, the fused results of STDFusionNet can not only
F. Ablation Experiment highlight distinctive targets in salient regions, but also maintain
1) Desired Information Analysis: In our model, the desired the texture details in the background regions. In contrast, when
information is explicitly defined as the salient target in not using the desired information definition, the network only
the infrared image and the background texture structure in fuses the infrared and visible images in a coarse manner,
the visible image. To verify the rationality of the desired resulting in the thermal emission information of infrared images
information definition, we train two models on the TNO dataset and the texture information of visible images not being well
based on whether or not to use desired information definition preserved. What’s more, as presented in Table III, there is a
to guide the optimization of the network. More specifically, the significant degradation in the performance of the model without
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 11

(1
 0,
*7) 
*7)
0'/DW/UU
0'/DW/UU
9DOXHVRI7KH0HWULF

'HQVH)XVH

9DOXHVRI7KH0HWULF

 'HQVH)XVH
1HVW)XVH
1HVW)XVH
)XVLRQ*$1
)XVLRQ*$1
 
*$10F&
*$10F&
,)&11
,)&11

30*,  30*,



8)XVLRQ
8)XVLRQ

67')XVLRQ1HW 67')XVLRQ1HW


         

&XPXODWLYH'LVWULEXWLRQ &XPXODWLYH'LVWULEXWLRQ

9,) 6)
*7) *7)

0'/DW/UU 0'/DW/UU

9DOXHVRI7KH0HWULF

9DOXHVRI7KH0HWULF
'HQVH)XVH  'HQVH)XVH

1HVW)XVH 1HVW)XVH


)XVLRQ*$1 )XVLRQ*$1

*$10F& *$10F&
 

,)&11 ,)&11

30*, 30*,

8)XVLRQ 8)XVLRQ

67')XVLRQ1HW 67')XVLRQ1HW


         

&XPXODWLYH'LVWULEXWLRQ &XPXODWLYH'LVWULEXWLRQ

Fig. 14. Quantitative comparisons of the four metrics, i.e., EN, MI, VIF, SF, on twenty image pairs from the RoadSence dataset. The nine state-of-the-art
methods such as GTF, [6] MDLatLRR [2], DenseFuse [11], NestFuse [23], FusionGAN [10], are used for comparison. A point on the curve (x, y) denotes that
there are (100*x)% percents of image pairs which have metric values no more than y.

Fig. 15. Visualization of salient target detection on four typical infrared and
visible image pairs. From left to right: Kaptein 1123, Kaptein 1654, Bunker Fig. 16. Visualized results of ablation on four typical infrared and visible
and Nato camp 1816. From top to bottom: infrared images, visible images, image pairs. From left to right: Sandpath, Kaptein 1123, Kaptein 1654 and
results of STDFusionNet, salient regions of infrared images, and difference Nato camp 1816. From top to bottom: infrared images, visible images, results
between fused images and the background regions of visible images. of STDFusionNet, STDFusionNet without desired information definition, and
STDFusionNet without gradient loss.

explicitly defining desired information. Specifically, compared


to STDFusionNet, without introducing salient target masks,
the EN, MI, VIF, and SF metrics decrease by 15.8%, 52.6%, definition of the desired information is reasonable, and it is of
42.3%, and 21.5%, respectively. These results prove that our great significance to improve the fusion performance.
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 12

TABLE III [5] Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang, “Image fusion with
Q UANTITATIVE EVALUATION RESULTS OF ABLATION . T HE W / O D ESIRED convolutional sparse representation,” IEEE Signal Processing Letters,
INFORMATION INDICATES THAT STDF USION N ET WITHOUT DESIRED vol. 23, no. 12, pp. 1882–1886, 2016.
INFORMATION DEFINITION AND W / O G RADIENT LOSS DENOTES [6] J. Ma, C. Chen, C. Li, and J. Huang, “Infrared and visible image fusion via
STDF USION N ET WITHOUT GRADIENT LOSS . RED INDICATES THE BEST gradient transfer and total variation minimization,” Information Fusion,
RESULT AND BLUE REPRESENTS THE SECOND BEST RESULT ). vol. 31, pp. 100–109, 2016.
[7] J. Ma, Z. Zhou, B. Wang, and H. Zong, “Infrared and visible image fusion
w/o Desired information w/o Gradient loss STDFusionNet based on visual saliency map and weighted least square optimization,”
Infrared Physics & Technology, vol. 82, pp. 8–17, 2017.
EN 6.5010 ± 0.5142 6.0294 ± 0.9807 7.2462 ± 0.4308 [8] S. Li, X. Kang, L. Fang, J. Hu, and H. Yin, “Pixel-level image fusion: A
MI 1.9518 ± 0.5197 3.5511 ± 0.8877 4.1213 ± 0.5382 survey of the state of the art,” Information Fusion, vol. 33, pp. 100–112,
VIF 0.6142 ± 0.1692 0.6869 ± 0.2640 1.0652 ± 0.1954 2017.
SF 0.0348 ± 0.0060 0.0691 ± 0.0223 0.0489 ± 0.0159 [9] H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2fusion: A unified
unsupervised image fusion network,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2020.
[10] J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang, “Fusiongan: A generative
2) Gradient Loss Analysis: When constructing the loss adversarial network for infrared and visible image fusion,” Information
function, in addition to pixel constraints, the gradient loss is Fusion, vol. 48, pp. 11–26, 2019.
[11] H. Li and X.-J. Wu, “Densefuse: A fusion approach to infrared and
also introduced to force the salient targets in the fused image to visible images,” IEEE Transactions on Image Processing, vol. 28, no. 5,
have sharper textures and contours. We implement the ablation pp. 2614–2623, 2018.
experiment to demonstrate the effectiveness of the gradient loss. [12] Y. Liu, X. Chen, J. Cheng, H. Peng, and Z. Wang, “Infrared and visible
image fusion with convolutional neural networks,” International Journal
Specifically, we train a model without additional gradient loss, of Wavelets, Multiresolution and Information Processing, vol. 16, no. 03,
and the results are shown in Fig. 16. It can be seen that when p. 1850018, 2018.
removing the gradient loss, the salient regions hardly have [13] J. Ma, H. Xu, J. Jiang, X. Mei, and X.-P. Zhang, “Ddcgan: A dual-
discriminator conditional generative adversarial network for multi-
any texture information, and there is also a severe distortion resolution image fusion,” IEEE Transactions on Image Processing, vol. 29,
in the salient target shape. In addition, several artifacts occur pp. 4980–4995, 2020.
in the background region. What’s more, the results of the [14] Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang, “Ifcnn: A
general image fusion framework based on convolutional neural network,”
quantitative comparison are exhibited in Table III, where all Information Fusion, vol. 54, pp. 99–118, 2020.
metrics present decreasing tendencies except for the SF metric. [15] H. Xu, X. Wang, and J. Ma, “Drf: Disentangled representation for visible
These experimental results demonstrate the importance of the and infrared image fusion,” IEEE Transactions on Instrumentation and
Measurement, 2021.
gradient loss, which can ensure the textures sharpness of salient [16] J. Ma, P. Liang, W. Yu, C. Chen, X. Guo, J. Wu, and J. Jiang, “Infrared
targets in the fused image. and visible image fusion via detail preserving adversarial learning,”
Information Fusion, vol. 54, pp. 85–98, 2020.
[17] J. Chen, X. Li, L. Luo, X. Mei, and J. Ma, “Infrared and visible image
V. C ONCLUSION fusion based on target-enhanced multiscale transform decomposition,”
Information Sciences, vol. 508, pp. 64–78, 2020.
In this paper, we propose a novel infrared and visible [18] C. Liu, Y. Qi, and W. Ding, “Infrared and visible image fusion method
image fusion network based on salient target detection, named based on saliency detection in sparse domain,” Infrared Physics &
Technology, vol. 83, pp. 94–102, 2017.
STDFusionNet. We explicitly define the desired information [19] M. Wu, Y. Ma, F. Fan, X. Mei, and J. Huang, “Infrared and visible image
for infrared and visible image fusion as the salient region of fusion via joint convolutional sparse representation,” JOSA A, vol. 37,
the infrared image and the background region of the visible no. 7, pp. 1105–1115, 2020.
[20] J. Zhao, G. Cui, X. Gong, Y. Zang, S. Tao, and D. Wang, “Fusion of
image. Based on this definition, we introduce the salient target visible and infrared images using global entropy and gradient constrained
mask into the loss function to precisely guide the optimization regularization,” Infrared Physics & Technology, vol. 81, pp. 201–209,
of the network. As a result, our model can fulfill salient target 2017.
[21] F. Fakhari, M. R. Mosavi, and M. M. Lajvardi, “Image fusion based
detection and information fusion in an implicit manner, and on multi-scale transform and sparse representation: an image energy
the result not only contains salient thermal targets, but also has approach,” IET Image Processing, vol. 11, no. 11, pp. 1041–1049, 2017.
rich background textures. Extensive qualitative and quantitative [22] N. Yu, T. Qiu, F. Bi, and A. Wang, “Image features extraction and fusion
based on joint sparse representation,” IEEE Journal of Selected Topics
experiments demonstrate the superiority of our STDFusionNet in Signal Processing, vol. 5, no. 5, pp. 1074–1082, 2011.
over state-of-the-art methods in terms of both subjective visual [23] H. Li, X.-J. Wu, and T. Durrani, “Nestfuse: An infrared and visible image
effect and quantitative metrics. Moreover, our method is much fusion architecture based on nest connection and spatial/channel attention
models,” IEEE Transactions on Instrumentation and Measurement,
faster than other comparative methods. vol. 69, no. 12, pp. 9645–9656, 2020.
[24] J. Ma, H. Zhang, Z. Shao, P. Liang, and H. Xu, “Ganmcc: A generative
adversarial network with multiclassification constraints for infrared
R EFERENCES and visible image fusion,” IEEE Transactions on Instrumentation and
Measurement, vol. 70, pp. 1–14, 2020.
[1] J. Ma, Y. Ma, and C. Li, “Infrared and visible image fusion methods [25] H. Zhang, H. Xu, Y. Xiao, X. Guo, and J. Ma, “Rethinking the image
and applications: A survey,” Information Fusion, vol. 45, pp. 153–178, fusion: A fast unified image fusion network based on proportional
2019. maintenance of gradient and intensity.” in Proceedings of the AAAI
[2] H. Li, X.-J. Wu, and J. Kitler, “Mdlatlrr: A novel decomposition method Conference on Artificial Intelligence, 2020, pp. 12 797–12 804.
for infrared and visible image fusion,” IEEE Transactions on Image [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Processing, vol. 29, pp. 4733–4746, 2020. recognition,” in Proceedings of the IEEE conference on computer vision
[3] J. Ma and Y. Zhou, “Infrared and visible image fusion via gradientlet and pattern recognition, 2016, pp. 770–778.
filter,” Computer Vision and Image Understanding, vol. 197-198, p. [27] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme: a
103016, 2020. database and web-based tool for image annotation,” International journal
[4] D. P. Bavirisetti and R. Dhuli, “Two-scale image fusion of visible and of computer vision, vol. 77, no. 1-3, pp. 157–173, 2008.
infrared images using saliency detection,” Infrared Physics & Technology, [28] A. Toet, “TNO Image Fusion Dataset,” 4 2014. [Online]. Available: https:
vol. 76, pp. 52–64, 2016. //figshare.com/articles/dataset/TNO Image Fusion Dataset/1008029
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 13

[29] J. W. Roberts, J. A. van Aardt, and F. B. Ahmed, “Assessment of Hao Zhang received the B.E. degree from the School
image fusion procedures using entropy, image quality, and multispectral of Mechanical Engineering and Electronic Infor-
classification,” Journal of Applied Remote Sensing, vol. 2, no. 1, p. mation, China University of Geosciences, Wuhan,
023522, 2008. China, in 2019. He is currently a Master student
[30] G. Qu, D. Zhang, and P. Yan, “Information measure for performance of with the Electronic Information School, Wuhan
image fusion,” Electronics Letters, vol. 38, no. 7, pp. 313–315, 2002. University. His research interests include computer
[31] Y. Han, Y. Cai, Y. Cao, and X. Xu, “A new image fusion performance vision, machine learning, and pattern recognition.
metric based on visual information fidelity,” Information Fusion, vol. 14,
no. 2, pp. 127–135, 2013.
[32] A. M. Eskicioglu and P. S. Fisher, “Image quality measures and their
performance,” IEEE Transactions on communications, vol. 43, no. 12,
pp. 2959–2965, 1995.
[33] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-
scale machine learning,” in 12th {USENIX} symposium on operating
systems design and implementation ({OSDI} 16), 2016, pp. 265–283.
[34] K. R. Prabhakar, V. S. Srikar, and R. V. Babu, “Deepfuse: A deep
unsupervised approach for exposure fusion with extreme exposure image
pairs.” in ICCV, vol. 1, no. 2, 2017, p. 3.

Jiayi Ma received the B.S. degree in information


and computing science and the Ph.D. degree in
control science and engineering from the Huazhong
University of Science and Technology, Wuhan, China,
in 2008 and 2014, respectively. He is currently a
Professor with the Electronic Information School,
Wuhan University. He has authored or co-authored
more than 150 refereed journal and conference papers,
including IEEE TPAMI/TIP, IJCV, CVPR, ICCV,
ECCV, etc. His research interests include computer
vision, machine learning, and remote sensing. Dr.
Ma has been identified in the 2020 and 2019 Highly Cited Researcher lists Guobao Xiao is currently a professor at Minjiang
from the Web of Science Group. He is an Area Editor of Information Fusion, University, China. He was a Postdoctoral Fellow
an Associate Editor of Neurocomputing and Entropy, and a Guest Editor of (2016-2018) in the School of Aerospace Engineering
Remote Sensing. at Xiamen University, China. He received the Ph.D.
degree in Computer Science and Technology from
Xiamen University, China, in 2016. He has published
over 30 papers in the international journals and
conferences including IEEE Transactions on Pattern
Analysis and Machine Intelligence, International
Journal of Computer Vision, Pattern Recognition,
Linfeng Tang received the B.E. degree from the Pattern Recognition Letters, Computer Vision and
School of Computer Science and Engineering, Cen- Image Understanding, ICCV, ECCV, ACCV, AAAI, ICIP, ICARCV, etc.
tral South University, Changsha, China, in 2020. He His research interests include machine learning, computer vision, pattern
is currently a Master student with the Electronic recognition and bioinformatics. He has been awarded the best PhD thesis in
Information School, Wuhan University. His research Fujian Province and the best PhD thesis award in China Society of Image and
interests include computer vision, machine learning, Graphics. He serves on the reviewer panel for some international journals and
and pattern recognition. top conferences.

Meilong Xu is currently pursuing the bachelor’s


degree majoring in electronic engineering with the
Electronic Information School, Wuhan University.
His research interests are in the areas of computer
vision and pattern recognition.

View publication stats

You might also like