2021 (code) STDFusionNet-An-Infrared-and-Visible-Image-Fusion-Network-Based-on-Salient-Target-Detection
2021 (code) STDFusionNet-An-Infrared-and-Visible-Image-Fusion-Network-Based-on-Salient-Target-Detection
net/publication/351408853
CITATIONS READS
311 2,362
5 authors, including:
All content following this page was uploaded by Hao Zhang on 21 June 2021.
Fig. 2. Schematic illustration of STDFusionNet. From left to right: the infrared image, the visible image, the fusion result of a traditional method GTF [6], the
fusion result of a deep learning-based method DenseFuse [11], and the result of our proposed STDFusionNet. The red box and green box show that GTF and
DenseFuse suffer detail loss, blurred edges, and artifacts, and our result better highlights prominent targets and has abundant textures.
invalid information in the fusion process. As a result, the useful targets while retaining abundant textures (e.g., the bush, road,
information is weakened in the fused image. We provide a and wall).
typical example in Fig. 1 to illustrate this more intuitively, in The main contributions of this work include the following
which U2Fusion [9] is a representative CNN-based method and three aspects:
FusionGAN [10] is a typical GAN-based method. It can be seen • We introduce the salient target mask to the specific loss
that U2Fusion weakens the salient target, while FusionGAN function, which can guide the network to detect the
weakens the background textures. thermal radiation targets in the infrared image, fusing
To address the above challenges, we propose a new network them with the background texture details in the visible
based on the salient target detection for infrared and visible image.
• We explicitly define the desired information in the fusion
image fusion, namely STDFusionNet. First, for infrared images,
humans and machines primarily pay attention to the regions process as the salient target in the infrared image and the
where salient targets such as pedestrians, vehicles, hunkers, etc., background textures in the visible image. To the best of
are located. For visible images, the rich background textures our knowledge, this is the first precise definition for the
help to make the imaging scene more vivid. Therefore, we can target of infrared and visible image fusion.
• Extensive experiments demonstrate the superiority of our
defined that the most meaningful information for fusion process
is the significant thermal targets in the infrared image and the method over state-of-the-art alternatives. Compared with
background texture structures in the visible image. Based on the competitors, our approach could generate fusion results
this definition, we annotate the salient targets in the infrared looking like high-quality visible images with highlighted
image to obtain the salient target masks. Then, the obtained targets, which contribute to improving target recognition
salient masks are introduced into the design of the specific loss and scene understanding.
function, which selectively drive the network to extract and The rest of this paper is organized as follows. In section
reconstruct the above-defined effective features. In addition, II, we briefly introduce the related works on image fusion. In
due to the significant differences in multi-modal source images, section III, the proposed method is introduced in detail. Section
we adopt the pseudo-siamese network to extract different types IV illustrates the fusion performance of our method on public
of information from source images with distinction, such as datasets with comparisons to other approaches, followed by
the salient target intensity and background texture structures. some concluding remarks in section V.
It is worth emphasizing that the salient target mask is only
utilized to guide the training of networks, and is not required II. R ELATED W ORK
to be fed into the network during the testing phase, and hence In this section, we review the existing infrared and visible
our network is an end-to-end model. Under these specific image fusion approaches, including traditional fusion methods
designs, our STDFusionNet effectively addresses the problems and deep learning-based fusion methods.
of the feature extraction effectiveness and desired information
definition, etc. A. Traditional Fusion Methods
To intuitively show the performance of our method, we Traditional fusion methods generally manually design activ-
provide a typical example in Fig. 2, with a traditional GTF [6] ity level measurement or fusion rules to practice image fusion
and a deep learning-based DenseFuse [11] for comparison. GTF in the spatial or transform domain, which can be divided into
views the fused image as an infrared image with additional five categories according to their principles, including multi-
visible gradients. Even though the targets in its result can scale transform-based methods [17], [2], saliency-based meth-
be highlighted, the textures are not sharp, and artifacts are ods [4], [18], sparse representation-based methods [5], [19],
presented, leading to unnatural fused image. On the contrary, in optimization-based methods [6], [20] and hybrid methods [21],
the result of DenseFuse, the texture details are better preserved, [7]. The main ideas of these methods are discussed below.
while the thermal information of the target is weakened. Our The multi-scale transform-based methods believe that objects
STDFusionNet has the advantages of both GTF and DenseFuse. in the physical world are typically composed of components
Specifically, our method implicitly achieves the salient target of various scales, and the multi-scale transform is consistent
detection and the extraction and reconstruction of effective with the human visual system. Therefore, the fused images
information, and the fused image could highlight significant obtained by multi-scale transform can have pleasing visual
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 3
effect [1]. In general, infrared and visible image fusion schemes With further research, some deep methods based on auto-
based on multi-scale decomposition typically involve three encoder are proposed. These methods usually pre-train an auto-
steps. Firstly, all source images are decomposed into a series encoder to implement feature extraction and image recovery,
of multi-scale representations. Subsequently, the multi-scale while the feature fusion is fulfilled by traditional rules. Li et al.
representations of original images are fused according to introduced the dense block into the encoder and decoder to de-
specific fusion rules. Eventually, the fused image is obtained sign a new image fusion network, termed as DenseFuse [11]. In
by performing corresponding inverse transforms on the fused the fusion layer, DenseFuse is achieved using the conventional
multi-scale representations [17]. addition and l1-norm strategy. Considering that a network
The saliency-based methods are usually built on the basis without down-sampling cannot extract multi-scale features
that salient targets are more easily perceived by human vision from source images, the nest connection-based networks
than their adjacent objects or pixels. Saliency is applied to are introduced in the NestFuse to extract information from
source images in a multi-scale perspective [23]. The spatial
infrared and visible image fusion in two main ways, i.e., weight
and channel attention models are used to fuse the extracted
calculation and salient target extraction. The former is usually
combined with multi-scale transforms, where the source images information, but it is worth mentioning that the attention
are decomposed into base and detail layers through the multi- mechanism used for fusion is still unlearnable.
scale transform. Then saliency detection is used to obtain a Since the unsupervised distribution estimation ability of
saliency map of the base or detail layer, and then a weight mapthe generative adversarial network is very suitable for the
of base or detail layer is obtained from the saliency map [7]. image fusion task, more and more GAN-based fusion methods
The latter uses saliency detection to extract information aboutare proposed. Ma et al. established an adversarial game
the significant regions from the infrared and visible images between the fused result and the visible image to further
enhance the preservation of texture structures. However, this
and then integrates the crucial information into the final fused
image [18].Generally, the saliency-based methods can maintain single adversarial mechanism can easily lead to unbalanced
fusion. To ameliorate this problem, they later proposed the
the integrity and pixel intensity of the significant object regions
and improve the visual quality of the fused image [4]. dual-discriminator conditional generative adversarial network
The premise of the sparse representation-based methods (DDcGAN) [13] to realize image fusion, in which both the
is to learn an over-complete dictionary from a great number infrared image and the visible image participates in the
of high-quality images, which are usually achieved by the adversarial games. It is worth noting that the generative
joint sparse representation [22] and the convolutional sparse adversarial network with dual discriminators is not easy to
train. In this context, a generative adversarial network with
representation [5]. Then, the sparse representation coefficients
of source images can be obtained by the learned over-complete multi-classification constrained is proposed [24] to achieve
dictionary, and be fused according to the given fusion rule. information balance in the fusion process, in which the
Finally, the fused image is reconstructed from the fused sparsemulti-distribution simultaneous estimation is done by a single
representation coefficients with the learned over-complete adversarial game.
dictionary. Due to the strong ability of feature representation in neural
The optimization-based methods generate the desired fusion networks, the varied information can be represented in a unified
way [9]. A growing number of researchers are dedicated to
result via minimizing an objective function [6], [20]. Therefore,
exploring the general image fusion framework. Zhang et al.
the key to such methods lies in the design of objective functions.
The construction of the objective functions should consider first utilized two convolutional layers to extract the salient
two aspects, saying the overall intensity fidelity and texture features from source images. Then, they selected appropriate
fusion rules according to the type of input images to fusion the
structure preservation. The former constrains the fused result to
source images features, and recovered the fused images from
have the desired brightness distribution, while the latter drives
the fused result to contain rich texture details. The above- the convolutional features by two convolutional layers [14].
mentioned infrared and visible image fusion methods all have Their proposed network framework only needs to be trained
their strengths and weaknesses, and the hybrid models combine on one type of image fusion dataset and subsequently adjusts
their strengths to improve the fusion performance [21], [7]. the fusion rules according to the type of source images, thus
implementing a unified network to solve various fusion tasks.
In contrast, Zhang et al. proposed a network structure based on
B. Deep Learning-based Fusion Methods proportional maintenance of gradient and intensity which adapt
Relying on the excellent feature extraction capabilities of to different fusion tasks via adjusting the weights of the loss
neural networks, deep learning has promoted tremendous terms when constructing the loss function [25]. Considering
progress in image fusion. Early deep learning-based methods the cross-fertilization between different fusion tasks, U2Fusion
only adopt the neural network to construct a weight map was trained sequentially on a unified model for different fusion
or extract features [9]. Liu et al. adopted the pre-trained missions, and a unified model for multiple fusion tasks was
convolutional neural network to implement activity level obtained [9].
measurement of source images and generate a weight map, Compared with the above-mentioned methods, the proposed
in which the whole fusion process is based on pyramids [12]. STDFusionNet has two main technical contributions. First, the
However, the neural network is not specifically trained for desired information in the image fusion process is defined as the
image fusion, which limits the fusion performance. salient target in the infrared image and the texture information
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 4
in the visible image. The defined desired information can preserving the texture details of the visible image. Under the
provide a more explicit optimization direction for parameter above design, our model can implicitly realize salient target
learning. Second, we design a special loss in conjunction with detection and desired information fusion.
the salient target mask to guide the network to achieve salient
target detection and information fusion. This enables the fused B. Loss Function
images generated by STDFusionNet to retain as much important The loss function determines the type of information retained
information as possible in the source images and reduce the in the fused image and the proportional relationship between
effect of redundant information. various information. The loss function of our STDFusionNet
consists of two kinds of losses, i.e., the pixel loss and the
III. M ETHOD gradient loss. The pixel loss constrains the pixel intensity of
In this section, we describe the proposed infrared and visible the fused image to be consistent with source images, while the
image fusion network based on the salient target detection, gradient loss forces the fused image to contain more detailed
STDFusionNet. First, we provide the problem formulation of information. We construct the pixel loss and gradient loss for
the STDFusionNet, in which the core ideas are discussed. Then, the salient regions and background areas. Combined with the
we describe the designed loss function in detail. Finally, we salient target mask Im , the desired result Id can be defined as
give the network architecture of the proposed STDFusionNet. follows:
Id = Im ◦ Iir + (1 − Im ) ◦ Ivi , (1)
A. Problem Formulation where the operator ◦ denotes element-wise multiplication.
The target of image fusion is extracting significant informa- Similarly, the fused image generated by STDFusionNet can
tion from multiple source images and fusing the complementary be segmented into a prominent region Im ◦ If containing the
information to generate a synthesized image. The key to this thermal infrared target and a background region (1 − Im ) ◦ If
problem is how to define the most meaningful information with texture details.
and how to fuse the complementary information. In infrared Therefore, we construct the corresponding losses in the
and visible image fusion, the most critical information is the salient and background regions respectively for guiding the
salient targets and the texture structures, which are contained in optimization of the STDFusionNet. On the one hand, we
infrared images and visible images, respectively. Therefore, we constrain the fused image to have the same pixel intensity
explicitly define the desired information as the salient target distribution as the desired image. The salient pixel loss Lsalient
pixel
back
information in infrared images and the background texture and the background pixel loss Lpixel are formulated as:
structure information in visible images. Consequently, there 1
are two keys to image fusion based on this definition. Lsalient
pixel = kIm ◦ (If − Iir )k1 , (2)
HW
The first key is to determine the salient target in the infrared 1
image. As we observed, the significant information of infrared Lback
pixel = k(1 − Im ) ◦ (If − Ivi )k1 , (3)
HW
images is mainly presented in the regions containing objects where H and W are the height and width of the image,
(e.g., pedestrians, vehicles and bunkers) that can emit more heat. respectively, k·k stands for the l -norm. On the other hand,
1
Hence, the proposed network should learn to automatically the gradient loss1 is introduced to enhance the constrains on
detect these regions from infrared images. The second key is the network in order to force the fused images with sharper
to accurately extract the desired information from the detected textures and salient targets with sharpened edges. Similar to
regions and perform effective fusion and reconstruction. In the definition of the pixel loss, the gradient loss also contains
other words, the fused result should accurately contain the the salient gradient loss Lsalient and the background gradient
grad
salient target in the infrared image and the background texture
loss Lback
grad , which are more precisely formulated as follows:
in the visible image. The specific loss function and effective
network structure are designed to address the above two key 1
Lsalient
grad = kIm ◦ (∇If − ∇Iir )k1 , (4)
problems. HW
First, we propose a specific loss function to constrain the 1
Lback
grad = k(1 − Im ) ◦ (∇If − ∇Ivi )k1 , (5)
fusion process, in which the salient target mask is introduced to HW
guide the network to detect salient areas, while the preservation where ∇ denotes the gradient operator, in this paper, we employ
of thermal targets and background texture is achieved by the Sobel operator to compute the gradient of an image.
ensuring the intensity and gradient consistency in specific Different from the previous method, we treat pixel loss
regions. Second, we design an effective network structure to and gradient loss in the same region equally, so the final loss
realize feature extraction, fusion and reconstruction. Concretely, function is defined as:
the feature extraction network adopts a pseudo-siamese net-
L = (Lback back salient
pixel + Lgrad ) + α(Lpixel + Lgrad ),
salient
(6)
work architecture to treat source images differently, so as
to selectively extract salient target features from the infrared where α is the hyper-parameter that controls the loss balance
image Iir and background texture features from the visible in different regions. Due to the introduction of salient region
image Ivi . Eventually, the feature reconstruction network fuses loss i.e., Lsalient salient
pixel and Lgrad , the STDFusionNet has the ability
the extracted features and reconstructs the fused image If , to detect and extract salient targets in infrared images in an
highlighting the salient targets in the infrared image while implicit manner.
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 5
Fused Image
Visible Image
Infrared Image
L=L p+Lgrad
Loss Function
ResBlock
Fig. 3. The architecture of the proposed infrared and visible image fusion network based on the salient target detection. The mask is only needed to construct
loss function in the training of the model, and is not needed in the testing phase.
C. Network Architecture the feature extraction network can extract the salient feature
and texture detail features from source images.
As illustrated in Fig. 3, our network architecture consists of Feature Reconstruction Network. The feature reconstruction
two parts, i.e., the feature extraction network and the feature network consists of four ResBlocks, which play the role of
reconstruction network. feature fusion and image reconstruction. It is worth noting that
Feature Extraction Network. The feature extraction network the activation function of the last layer uses Tanh to ensure that
is constructed on the basis of the convolutional neural network, the range of variation of the fused image is consistent with that
and the ResBlock is introduced to enhance the network of the input images. The input of the feature extraction network
extraction and alleviate the problem of vanishing/exploding is the concatenation of infrared convolutional features and
gradients [26]. As shown in Fig. 3, the feature extraction visible convolutional features in the channel dimension, and its
network consists of a common layer and three ResBlocks output is the fused image. It is well known that information loss
that can reinforce the extracted information. The common is a catastrophic problem in image fusion missions. Therefore,
layer consists of a convolutional layer with a kernel size of in all convolutional layers of STDFusionNet, the padding is set
5 × 5 and a Leaky Rectified Linear Unit activation layer. to SAM E, and stride is set to 1. As a result, our network does
Each Resblock consists of three convolutional layers, named not introduce any downsampling, and the size of the fused
Conv1, Conv2, Conv3, and a skip-connected identity mapping image is consistent with source images.
convolutional layer, termed as identity conv. The kernel size of The purpose of the salient target mask is to highlight the
all convolutional layers is 1 × 1 except for Conv2, which objects (e.g., the pedestrians, vehicles, and bunkers) that radiate
has a kernel size of 3 × 3. Both Conv1 and Conv2 use numerous heats in infrared images. Therefore, we use the
the Leaky Rectified Linear Unit as the activation function, labelme toolbox [27] to annotate salient targets in infrared
while the output of Conv3 and identity conv are summed and images and convert them to binary salient target masks. Then,
followed by the Leaky Rectified Linear Unit activate function. the salient target masks are inverted to obtain the background
The identity conv is designed to overcome the inconsistent masks. After that, we multiply the salient target masks and
dimensionality of the ResBlock input and output. It is worth texture background masks with the infrared images and visible
noting that considering the different properties of infrared and images at the pixel level to obtain the source salient target
visible images, both feature extraction networks use the same regions and source background texture regions, respectively.
network architecture, but the respective parameters are trained Moreover, the fused images are also multiplied with the salient
independently. In combination with the proposed loss function, target masks and the texture background masks at the pixel
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 6
(a) Infrared image (b) Visible image (c) GTF (d) MDLatLRR
level to receive the fused salient target regions and the fused
(i) IFCNN (j) PMGI (k) U2Fusion (l) STDFusionNet
background regions. Subsequently, the original salient regions,
original salient regions, original background regions, fused Fig. 5. Qualitative comparison of STDFusionNet with 9 state-of-the-art
salient regions, and fused background regions are applied to methods on the bench image pair. For a clear comparison, we select a salient
region (i.e., the red box) in each image and then zoom in it on the bottom
construct the specific loss function, which guides the network right corner.
to realize salient targets detection and information fusion
implicitly.
2) Evaluation Metrics: The assessment of fusion perfor-
IV. E XPERIMENTS mance can be divided into subjective and objective evaluations.
Subjective evaluation relies on human visual perception. For
In this section, we first describe the experimental settings, the infrared and visible image fusion, the desired result is better
including datasets, evaluation metrics and training details. Then, to contain significant thermal targets and rich texture structures.
we demonstrate the efficiency of the proposed STDFusionNet The objective evaluation is a supplement to the subjective
on public datasets, and compare it with nine state-of-the- evaluation, which usually uses some quantitative metrics to
art fusion methods, including two traditional methods, i.e., measure the fusion performance. In this paper, four popular
GTF [6] and MDLatLRR [2], and seven deep learning-based metrics are selected, including entropy (EN) [29], mutual
methods, i.e., DenseFuse [11], NestFuse [23], FusionGAN [10], information (MI) [30], visual information fidelity (VIF) [31],
GANMcC [24], IFCNN [14], PMGI [25] and U2Fusion [9]. and spatial frequency (SF) [32]. The definitions of them are
The implementation of all these nine methods are publicly as follows.
available, and we set the parameters as reported in the original
The EN measures the amount of information contained in
papers. In addition, we provide the generalization experiment,
a fused image, which is defined based on information theory.
efficiency comparison, visualization of salient target detection
The mathematical definition of EN as follows:
and ablation experiments to verify the effectiveness of specific
designs. L−1
X
EN = − pl log2 pl , (7)
l=0
A. Experimental Settings
where L denotes the number of gray levels and pl is the
1) Datasets: Our experiments are executed on two datasets,
normalized histogram of the corresponding gray level in the
namely the TNO dataset [28] and the RoadScene dataset [9].
fused image. A larger entropy indicates that the fusion image
The TNO dataset is a common dataset for infrared and contains more information and that the method achieves better
visible image fusion, containing various types of military- fusion performance.
related scenes. The dataset contains 60 pairs of infrared and
The MI metric measures the amount of information trans-
visible images, with 3 sequences containing 19, 23, and 32
ferred from the source images to the fused image. In infor-
image pairs, respectively. A typical set of source images and
mation theory, MI measures the dependence of two random
their mask images are shown in Fig. 4. In order to remedy
variables, and in image fusion evaluation, the MI fusion metric
the shortage in quantity of existing datasets, Xu et al. released
is defined as follows:
the RoadScene dataset based on the FLIR video [9]. The
RoadScene dataset contains 221 pairs of aligned infrared and M I = M IA,F + M IB,F , (8)
visible image containing rich scenes of roads, vehicles, and
pedestrians. The release of this dataset effectively alleviates where M IA,F and M IB,F denote the amount of information
the challenges of few image pairs and low spatial resolution transferred from source images A and B to fused image F ,
in the benchmark dataset. respectively. The MI between two random variables can be
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 7
(a) Infrared image (b) Visible image (c) GTF (d) MDLatLRR (a) Infrared image (b) Visible image (c) GTF (d) MDLatLRR
(e) DenseFuse (f) NestFuse (g) FusionGAN (h) GANMcC (e) DenseFuse (f) NestFuse (g) FusionGAN (h) GANMcC
(i) IFCNN (j) PMGI (k) U2Fusion (l) STDFusionNet (i) IFCNN (j) PMGI (k) U2Fusion (l) STDFusionNet
Fig. 6. Qualitative comparison of STDFusionNet with 9 state-of-the-art Fig. 7. Qualitative comparison of STDFusionNet with 9 state-of-the-art
methods on the Kaptein 1123 image pair. For a clear comparison, we select a methods on the Kaptein 1654 image pair. For a clear comparison, we select a
small area (i.e., the red box) with abundant texture in each image and then small area (i.e., the red box) with abundant texture in each image and then
zoom in it on the bottom right corner and highlight a salient region (i.e., the zoom in it on the bottom right corner and highlight a salient region (i.e., the
green box). green box).
where pX (x) and pF (f ) denote the marginal histograms of (a) Infrared image (b) Visible image (c) GTF (d) MDLatLRR
the source image X and the fused image F , respectively.
pX,F (x, f ) means the joint histogram of the source image X
and the fused image F . The larger the MI, the more information
is transferred from source images to the fused image and the
(e) DenseFuse (f) NestFuse (g) FusionGAN (h) GANMcC
better fusion performance.
The VIF metric measures the information fidelity of the
fused image, which is consistent with the human visual system.
Computing the VIF metric usually involves the following four
steps. First, the source images and the fused image are divided (i) IFCNN (j) PMGI (k) U2Fusion (l) STDFusionNet
into different blocks; second, evaluate each block for distortion
of the visual information; third, the VIF for each sub-band is Fig. 8. Qualitative comparison of STDFusionNet with 9 state-of-the-art
methods on the Tree 4915 image pair. For a clear comparison, we select a
calculated; finally, calculating the overall metric based on VIF. small area (i.e., the red box) with abundant texture in each image and then
The SF metric is a reference-free metric that measures zoom in it on the bottom right corner and highlight a salient region (i.e., the
the spatial frequency information contained in the fused green box).
image through the row frequency and column frequency. The
mathematical definition of SF is as follows:
p
SF = RF 2 + CF 2 , (10) pairs from RoadScene dataset for the generalization experiment.
s
M P N It is worth noting that each source image is normalized to
P 2
where RF = (F (i, j) − F (i, j − 1)) and CF = [−1, 1]. We adopt Adam as the optimizer solver for training
i=1 j=1
s the model. The proposed algorithm is implemented on the
M P N
P 2
(F (i, j) − F (i − 1, j)) . A large SF metric indicates TensorFlow [33] platform. The training parameters are set as
i=1 j=1 follows: the batchsize equal to 32, the number of iteration is
that the fused image contains abundant textures and detail set to 30, and the learning rate is set to 10−3 . As we have
information, so the fusion method has excellent performance. observed, the salient regions take up only a slight proportion
3) Training Details: We train our model on the TNO dataset, of the infrared image. In order to balance the loss of the
and the number of image pairs for training is 20. In order to salient and background regions, in this work, α is set to 7. It
obtain more training data, we crop each image by setting the is important to note that the source images are fed directly
stride to 24, and each patch is of the same size 128 × 128. As into the fusion network without any cropping during testing.
a result, the number of produced image patch pairs for training What’s more, all the experiments are conducted on a NVIDIA
is 6, 921. In the testing phase, we select 20 image pairs from TITAN V GPU and 2.00 GHz Intel(R) Xeon(R) Gold 5117
TNO dataset for the comparative experiment and 20 image CPU.
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 8
B. Comparative Experiment better visual effects. The largest EN demonstrates that the fused
In order to comprehensively evaluate the performance of our images generated by our proposed method have more abundant
method, we compare the proposed STDFusionNet with other information than the other nine comparison approaches. The
nine methods on the TNO dataset. largest MI indicates that our method transfers more information
1) Qualitative Results: To observe the differences in the from source images to fused images. Although the SF metric
fusion performance of various algorithms intuitively, we first of our algorithm are not the best, the comparable results
select four typical pairs of source images (bench, Kaptein 1123, still indicate that our fused results have sufficient gradient
Kaptein 1654, Tree 4915) from the TNO dataset for subjective information.
evaluation. The fused results of different algorithms are shown
in Figs. 5 - 8. In Fig. 5, we select a salient region (i.e., C. Generalization Experiment
red box) in each fused image, then zoom in and place it
in the bottom right corner for clear comparison. As shown The generalization ability of the network is an important
in Fig. 5, MDLatLRR loses the thermal emission target basis for evaluating the performance of a deep model. In order
information, resulting in failure to capture infrared targets to evaluate the generalization ability of our STDFusionNet,
in the salient region. While DenseFuse, IFCNN, and U2Fusion we use the image pairs of the RoadScene dataset to test the
retain the thermal emission target information but suffer model trained on the TNO dataset. Since the visible images
from serious noise contamination, which comes from visible contained in the RoadScene dataset are in color, we use a
images. Moreover, FusionGAN preserves thermal radiation specific fusion strategy [34] to achieve the image fusion that
information to some extent, but suffers from the shortcoming preserves the color. Specifically, the RGB visible images are
of blurred infrared target edges. GTF, NestFuse, GANMcC, first converted to YCbCr color space. Then the Y channel
PMGI, and STDFusionNet are able to highlight the salient and the grayscale infrared image are used for fusion as the
target. Especially, STDFusionNet generates fused result that structural details are mainly in the Y channel. Finally, through
maintaines the contrast of the salient targets well. the inverse conversion, the fused image can be converted into
In the other three scenes, we select a background region RGB color space with the Cb and Cr (chrominance) channels
with abundant detail in each fused image, and then zoom in it of the visible image.
and put it in the bottom right corner. Also, we label the salient 1) Qualitative Results: The fused results of the different
target in a green box. From the fusion results of the remaining methods are shown in Figs. 10 - 13. From the results, we can
three image pairs, we can find that our STDFusionNet not only observed that our STDFusionNet selectively preserves useful
highlights the salient targets in the scene effectively, but also has information in both infrared and visible images. Compared to
a distinct advantage in maintaining the detailed texture of the fused images generated by other methods, our fused images
background region. Specifically, in the Kaptein 1123 scene, the are very close to the infrared images in salient regions, and
texture of the tree branches in the fused image generated by our the texture structure of the visible images is almost completely
method is the clearest, and STDFusionNet is the only method preserved in background regions.
in which the sky is not contaminated by thermal radiation Although other methods can highlight the distinctive targets,
information. While in the Kaptein 1654 scene, the streetlights the background of the fusion images is extremely unpleasant.
in the background region are almost consistent with the visible In particular, the sky in the fused image is heavily contaminated
image. Moreover, in the Tree 4915 image pair, it is almost with thermal information, and it is not even possible to
impossible to distinguish the shrubs from their surroundings by accurately estimate the current time and weather from the fused
other methods except for our method and NestFuse. However, image, which is fatal for road scenes. Moreover, other methods
NestFuse weakens the thermal radiation targets in the significant are undesirable for retaining texture details in background
regions. It is worth noting that STDFusionNet can highlight the areas, such as the writing on walls, bicycles and tree stumps,
infrared targets in the salient regions and effectively distinguish fences, street lights, etc. In contrast, STDFusionNet effectively
the shrubs from their surroundings. preserves the background region detail information while
By comparison, it can be found that STDFusionNet is able maintaining and even enhancing the intensity and contrast
to selectively preserve salient targets of infrared images and of thermal infrared targets in salient regions.
texture details of visible images during the fusion process. This 2) Quantitative Results: We also select 20 infrared and
mainly benefits from the manually extracted salient target mask visible image pairs from the RoadScene dataset for objective
and the constructed loss function during network training. evaluation, and the performance of different methods on the
2) Quantitative Results: The results of four popular quan- four metrics is shown in Table I and Fig. 14. Similar to the
titative metrics on 20 image pairs from the TNO dataset are results in the TNO dataset, our STDFusionNet has the best
shown in Fig. 9 and Table I. Among the four metrics, our average values for the three metrics, i.e., MI, VIF and SF, but
method has a significant superiority on three metrics, i.e., EN, the advantage is not as pronounced as in the TNO dataset. On
MI, and VIF. As for the SF metric, our STDFusionNet only the EN metric, our method only follows NestFuse by a narrow
lags behind IFCNN by a narrow margin. margin.
It is important to note that STDFusionNet has the highest In general, both qualitative and quantitative results demon-
value on almost all image pairs on the VIF metric, which is strate that our STDFusionNet has good generalization perfor-
consistent with the conclusions of the subjective evaluation mance, which is less affected by the characteristics of imaging
and indicates that STDFusionNet generates fused images with sensors.
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 9
( 1
0 ,
* 7 )
* 7 )
0 '