0% found this document useful (0 votes)
21 views

2021 (code) SDNet - A Versatile Squeeze-and-Decomposition Network for Real-Time Image Fusion

2021 (code) SDNet - A Versatile Squeeze-and-Decomposition Network for Real-Time Image Fusion

Uploaded by

hau trinhvan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

2021 (code) SDNet - A Versatile Squeeze-and-Decomposition Network for Real-Time Image Fusion

2021 (code) SDNet - A Versatile Squeeze-and-Decomposition Network for Real-Time Image Fusion

Uploaded by

hau trinhvan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

International Journal of Computer Vision

https://doi.org/10.1007/s11263-021-01501-8

SDNet: A Versatile Squeeze-and-Decomposition Network for Real-Time


Image Fusion
Hao Zhang1 · Jiayi Ma1

Received: 12 December 2020 / Accepted: 5 July 2021


© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021

Abstract
In this paper, a squeeze-and-decomposition network (SDNet) is proposed to realize multi-modal and digital photography
image fusion in real time. Firstly, we generally transform multiple fusion problems into the extraction and reconstruction of
gradient and intensity information, and design a universal form of loss function accordingly, which is composed of intensity
term and gradient term. For the gradient term, we introduce an adaptive decision block to decide the optimization target of
the gradient distribution according to the texture richness at the pixel scale, so as to guide the fused image to contain richer
texture details. For the intensity term, we adjust the weight of each intensity loss term to change the proportion of intensity
information from different images, so that it can be adapted to multiple image fusion tasks. Secondly, we introduce the idea of
squeeze and decomposition into image fusion. Specifically, we consider not only the squeeze process from source images to
the fused result, but also the decomposition process from the fused result to source images. Because the quality of decomposed
images directly depends on the fused result, it can force the fused result to contain more scene details. Experimental results
demonstrate the superiority of our method over the state-of-the-arts in terms of subjective visual effect and quantitative metrics
in a variety of fusion tasks. Moreover, our method is much faster than the state-of-the-arts, which can deal with real-time
fusion tasks.

Keywords Image fusion · Real time · Adaptive · Proportion · Squeeze decomposition

1 Introduction texture content (Piella 2009). Because of the excellent perfor-


mance of the fused image, image fusion as an enhancement
Due to the limitation of hardware devices and optical imag- method is widely used in many fields such as military detec-
ing, an image obtained by a single sensor or under a single tion, medical diagnosis, and remote sensing (Ma et al. 2017;
shooting setting can often capture only part of the details in Xing et al. 2018; Ma et al. 2021; Kong et al. 2007; Shen et al.
the scene. For example, the image generated by capturing 2015).
visible light usually only withstands a limited illumination Typically, image fusion scenarios can be divided into two
variation and has a predefined depth-of-field. In addition, it categories according to the differences in imaging of source
is also susceptible to external factors such as the weather images. The first category is multi-modal image fusion. Due
when shooting. Naturally, image fusion can extract the most to factors such as imaging environment or device perfor-
meaningful information from images acquired by different mance, sometimes a single sensor cannot effectively describe
sensors or under different shooting settings, and combine the the entire scene. Combining multiple sensors for observation
information to generate a single image, which contains more is a better choice. For example, the positron emission tomog-
raphy (PET) can produce images that reflect the metabolic
Communicated by Ioannis Gkioulekas. state of the body, while the magnetic resonance imaging
(MRI) can provide excellent structure textures of organs
B Jiayi Ma
[email protected] and tissues (Liu et al. 2017). The infrared image can distin-
Hao Zhang
guish the target from background, while the visible image
[email protected] contains more texture details (Ma et al. 2016). The sec-
ond category is digital photography image fusion. Due to
1 Electronic Information School, Wuhan University, Wuhan the limit of technology, the sensor is often unable to char-
430072, China

123
International Journal of Computer Vision

Multi-modal image fusion Digital photography image fusion

Infrared image Visible image PET image MRI image Underexposed image Overexposed image Near-focused image Far-focused image

Fused image Fused image Fused image Fused image


(a) infrared and visible image fusion (b) PET and MRI image fusion (c) multi-exposure image fusion (d) multi-focus image fusion

Fig. 1 Schematic illustration of multi-modal image fusion and digital photography image fusion. First row: source image pairs to be fused; second
row: fused results of our proposed method (SDNet)

acterize all content in the scenario under a single setting. ground truth, which is usually inaccurate and will set an
Concretely, it is difficult to have all objects of different depth- upper limit for learning. Third, as mentioned earlier, there are
of-field to be all-in-focus within one image (Ma et al. 2020). large differences between image fusion tasks. In multi-modal
Besides, the image is sometimes exposed to inappropriate image fusion, source images are captured by different sen-
exposures such as underexposure and overexposure (Hayat sors. Conversely, source images in digital photography image
and Imran 2019; Goshtasby 2005). Under these circum- fusion are taken by the same sensor under different shooting
stances, the scene can be described more comprehensively settings. As a result, the existing methods cannot solve dif-
by combining with images under different shooting settings. ferent image fusion problems according to the same idea.
A few examples are provided in Fig. 1 to illustrate these two Finally, the existing methods are usually less competitive in
types of image fusion scenarios more intuitively. operating efficiency due to the large number of parameters
In recent years, researchers have proposed a number of or the high complexity of fusion rules.
methods to solve the image fusion problem, which can be To address the above mentioned challenges, we design
broadly divided into two categories. The first category is a squeeze-and-decomposition network, called SDNet, to
traditional image fusion method, which usually uses the rel- implement multi-modal image fusion and digital photog-
evant mathematical transformation to realize the fusion by raphy image fusion end-to-end in real time. Our design is
designing the activity level measurement and fusion rules in mainly developed from the following two aspects.
the spatial domain or the transform domain (Li et al. 2012; On the one hand, we model the multi-modal image fusion
Zhao et al. 2019; Shen et al. 2014; Paul et al. 2016; Ballester and the digital photography image fusion as the extraction
et al. 2006; Szeliski et al. 2011). The second category is and reconstruction of intensity and gradient information. Our
deep learning-based methods. Methods of this type usually opinion is that the information contained in the image can be
constrain the fused image by constructing an objective func- divided into gradient and intensity information, wherein the
tion to make it have the desired distribution characteristics. gradient information represents the texture structure, while
Because of the strong nonlinear fitting ability of neural net- the intensity information indicates the overall brightness dis-
works, this kind of methods can usually achieve better fused tribution of the image. Based on this idea, we design a loss
results (Ma et al. 2019; Prabhakar et al. 2017; Liu et al. 2017; function in a universal form for the above two types of image
Lai and Fang 1998). fusion scenarios, which can force the network to extract the
Although the existing methods have achieved promis- gradient and intensity information and fuse them by two dif-
ing results in most cases, there are still several aspects to ferent rules. Specifically, for the gradient information, we
be improved. First, the existing traditional methods usually believe that in addition to noise, other areas with strong gradi-
need to manually design the activity level measurement and ents are clear or have a large amount of texture content. Based
fusion rules, which become complex because of the diver- on this observation, we propose an adaptive decision block,
sity of source images. This also limits fusion results because which firstly uses the Gaussian low-pass filter to reduce the
it is impossible to consider all the factors in one manu- effects of noise, and then scores each pixel based on the level
ally designed way. Second, the most prominent obstacle in of the gradient, thereby directing the gradient distribution of
applying deep learning to image fusion is the lack of the the fused image to approximate the source pixel with larger
ground-truth fused image for supervised learning. A few gradient strength. For the intensity information, because dif-
methods solve this difficulty by manually constructing the ferent fusion tasks have different preferences for intensity

123
International Journal of Computer Vision

information preservation, we select more effective and inter- – Our method can perform image fusion in real time for
esting intensity information to be preserved in the fused result multiple fusion tasks. The code is publicly available at:
by adjusting the weight proportion of each intensity loss item. https://github.com/HaoZhang1018/SDNet.
By using these two strategies to extract and reconstruct gra-
dient and intensity information, the proposed loss function A preliminary version of this manuscript appeared in
can be well adopted in multi-modal image fusion and digital Zhang et al. (2020). The primary new contributions include
photography image fusion. the following two aspects. First, we design an adaptive deci-
On the other hand, we propose a fast SDNet to implement sion block to constrain the gradient information instead of the
more effective image fusion. The previous methods only con- previous manually proportioned setting strategy. On the one
sider the squeeze process from the source image to the fusion hand, it reduces the number of super-parameters that need
result, then whether the fused result can be decomposed to to be manually adjusted. On the other hand, it makes our
regenerate source images? Although part of the information method perform better, especially in the multi-focus image
will inevitably be discarded in the fusion process, requir- fusion. Second, we have further improved the network, which
ing the decomposition result to be consistent with the source considers not only the fusion process but also the decompo-
images will reduce the information loss as much as possible. sition process. This decomposition consistency can make the
In other words, this decomposition consistency will force the fused image contain more scene details and thus have a better
fused result to contain more scene details, because the qual- visual effect.
ity of the decomposition result directly depends on the fused The remainder of this paper is organized as follows. Sec-
result. Based on this motivation, we design a squeeze-and- tion 2 describes some related work, including an overview of
decomposition network, which contains two parts: squeeze existing traditional and deep learning-based fusion methods.
and decomposition. In the squeeze stage, the source images Section 3 provides the overall framework, loss functions and
are fused into a single image. While in the decomposition network architecture design. In Sect. 4, we give the detailed
stage, the fused result is re-decomposed into source images. experimental settings and compare our method with sev-
Similarly, this squeeze-and-decomposition network is also eral state-of-the-art methods on publicly available datasets
suitable for both the multi-modal and digital photography by qualitative and quantitative comparisons. In addition, we
image fusion. also carry out comparative experiments of efficiency, abla-
Our method has the following advantages. First of all, tion experiments, visualization of decomposition, infrared
our method does not need to design the activity level mea- and RGB visible image fusion, sequence image fusion, com-
surement and fusion rules, which can implement end-to-end parison with preliminary version, and application verification
fusion. Second, our network does not require ground truth (Zhang et al. 2020) in this section. Conclusions are given in
for supervised learning, but unsupervised learning with weak Sect. 5.
constraints. Third, our method is not only applicable to fusion
of images obtained by multi-modality imaging, but also to
fusion of images obtained by digital photography. It is worth 2 Related Work
noting that due to the use of 1 × 1 convolution kernels and
the control of the number of feature channels, the quantity of With various methods proposed, the field of image fusion
parameters in our network is limited within a certain range. has made great progress. Existing methods can be broadly
As a result, our method can achieve fusion at a high speed. divided into traditional and deep learning-based methods.
Our contributions include the following five aspects: The traditional methods usually use the related math-
ematical transformation and manual design of the fusion
– We propose a new end-to-end image fusion model, which rules to realize the image fusion. Piella (2003) presented an
can realize the multi-modal image fusion and the digital overview on image fusion techniques using multi-resolution
photography image fusion well. decompositions, which make a multi-resolution segmenta-
– We design a specific form of loss function, which can tion based on all different input images, and this segmentation
force the network to generate expected fused results. is subsequently used to guide the infrared and visible image
– We propose an adaptive decision block for the gradi- fusion process. A ghost-free multi-exposure image fusion
ent loss terms, which can reduce the effect of noise and technique using the dense SIFT descriptor and guided filter
effectively guide the fused result to contain richer texture was proposed by Hayat and Imran (2019), which can pro-
details. duce high-quality images without the artifacts using ordinary
– We design a squeeze-and-decomposition network, which cameras. Paul et al. (2016) proposed a general algorithm for
can focus on the two stages of fusion and decomposition multi-focus and multi-exposure image fusion, which is based
at the same time, so as to make the fused result contain on blending the gradients of the luminance components of the
more scene details. input images using the maximum gradient magnitude at each

123
International Journal of Computer Vision

pixel location and then obtaining the fused luminance using 3 Method
a Haar wavelet-based image reconstruction technique. Fu
et al. (2019) introduced a more accurate spatial preservation In this section, we give a detailed introduction of our SDNet.
based on local gradient constraints into remote sensing image We first introduce the overall framework, and then give
fusion, which can fully utilize spatial information contained the definition of the loss function. Finally, we provide the
in the PAN image while maintaining spectral information. detailed structure of the network. Note that the source images
As a result, they can obtain very promising fused results. are assumed to be pre-registered in our method (Ma et al.
Compared with traditional methods, deep learning-based 2021).
methods can learn fusion models with good generalization
ability from a large amount of data. In the field of infrared 3.1 Overall Framework
and visible image fusion, Ma et al. (2019) proposed an end-
to-end model called FusionGAN, which generates a fused The idea of image fusion is to extract and combine the most
image with a dominant infrared intensity and an additional meaningful information from the source images. On the one
visible gradient on the basis of GAN. Subsequently, they hand, for different image fusion tasks, the most meaningful
introduced a dual-discriminator (Ma et al. 2020), a detail information contained in source images is different. Because
loss and a target edge-enhancement loss (Ma et al. 2020) there is no same standard for such meaningful information,
based on FusionGAN to further enhance the texture details existing methods are usually difficult to be migrated to other
in the fused results. In the field of multi-exposure image fusion tasks. Thus, it is desirable to develop a versatile model
fusion, Prabhakar et al. (2017) proposed an unsupervised to fulfill multiple types of image fusion tasks. On the other
deep learning framework that utilizes a no-reference quality hand, it is very important to preserve as much information in
metric as a loss function and can produce satisfactory fusion source images as possible in the fused image. Our method is
results. Xu et al. (2020) introduced an end-to-end architec- designed based on the above two observations, which is an
ture based on GAN with self-attention mechanism and has end-to-end model.
achieved the state-of-the-art performance. In medical image Firstly, we define the meaningful information into two cat-
fusion, Liu et al. (2017) used a neural network to generate egories: gradient and intensity information. For any image,
the weight map that integrates pixel activity levels of two the most essential element of it is the pixel. The intensity
source images, while the fusion process is conducted in a of pixels can represent the overall brightness distribution,
multi-scale manner via image pyramids. With the applica- which can reflect contrast characteristics of the image. The
tion of deep learning, great progress also has been made in difference between the pixels constitutes the gradient, which
the field of multi-focus image fusion. In particular, Ma et al. can represent the texture details in the image. Therefore, the
(2020) proposed an unsupervised network to generate the multi-modal image fusion and digital photography image
decision map for fusion, which can indicate whether the pixel fusion can be model as the extraction and reconstruction
is focused. Deep network has also promoted the progress of of these two kinds of information, as shown in Fig. 2. The
remote sensing image fusion. Zhou et al. (2019) designed extraction and reconstruction of gradient and intensity infor-
a deep model composed of encoder network and pyramid mation is dependent on the design of the loss function. In
fusion network to fuse the low-resolution hyperspectral and our model, we propose a universal loss function for differ-
high-resolution multi-spectral images, which improves the ent image fusion tasks, which consists of the gradient loss
preservation of spatial information by this progressive refine- term and the intensity loss term constructed between the
ment. Ma et al. (2020) proposed an unsupervised deep model fused image and both two source images. Although using
for pansharpening to make full use of the texture structure intensity loss and gradient loss (Ma et al. 2019, 2020) is a
in panchromatic images. They transformed pansharpening common practice in a specific image fusion task (Szeliski
into multi-task learning by using two independent discrimi- et al. 2011), it is non-trivial to extend them to other image
nators, which preserve spectral and spatial information well. fusion tasks. To this end, the reconstruction rules we design
Our preliminary version PMGI (Zhang et al. 2020) proposed for gradient information and intensity information are greatly
a new image fusion network based on proportional main- different. For gradient information reconstruction, we intro-
tenance of gradient and intensity information, which can duce an adaptive decision block that acts on the gradient
realize a variety of image fusion tasks. However, changing loss term. The adaptive decision block first uses the Gaus-
the maintenance proportion of the gradient information by sian low-pass filtering to reduce the influence of noise on the
adjusting the weight will cause a certain degree of the tex- decision-making process, and then evaluates the importance
ture structure loss or blur, thereby reducing the quality of the of corresponding pixels based on the gradient richness, so
fused result. Therefore, in this paper we improve PMGI to as to generate the pixel-scale decision map that guides tex-
realize better fusion performance. tures in the fused image to approximate that in the source
pixels with richer textures. As a result, the texture details

123
International Journal of Computer Vision

tion and reconstruction of intensity and gradient information,


Intensity Loss
Proportioned
and the decomposition network is dedicated to decomposing
Setting Decomposition
Consistency Loss
results that approximates source images from the fused result.
Correspondingly, the loss function also consists of two parts:
Squeeze Decomposition squeeze fusion loss Lsf and decomposition consistency loss
Network Network
Ldc , which is defined as:
Decomposition
Consistency Loss

L = Lsf + Ldc . (1)


Adaptive Gradient Loss
Decision block
3.2.1 Squeeze Fusion Loss
Fig. 2 Overall fusion framework of our SDNet
The squeeze fusion loss Lsf determines the type of informa-
tion extracted and the primary and secondary relationships
contained in the fused image are consistent with the tex- between various types of information in reconstruction.
ture details that are strongest in the corresponding regions of Because our method is based on the extraction and recon-
source images. For the intensity information reconstruction, struction of gradient and intensity information, so our loss
we adopt the proportioned setting strategy. Specifically, we function consists of two types of loss terms, the gradient loss
adjust the weight ratio of the intensity loss items between Lgrad and the intensity loss Lint . We formalize it as:
the fused image and two source images, so as to satisfy the
requirements of different tasks on intensity distribution. For Lsf = βLgrad + Lint , (2)
example, in the infrared and visible image fusion, the inten-
sity distribution of the fused image should be more biased where β is used to balance the intensity and gradient terms.
towards the infrared image to maintain significant contrast. The gradient loss Lgrad forces the fused image to contain
In the MRI and PET image fusion, the intensity distribution rich texture detail. We introduce an adaptive decision block
of the fused image should be more inclined to the PET image, into the gradient loss term to guide the texture of fused image
so as to preserve the functional and metabolic information to be consistent with the strongest texture in the correspond-
of biological tissues. The adaptive decision block and the ing position of source images, which is defined as:
proportioned setting strategy will be introduced in detail in
Sect. 3.2. 1 
Lgrad = S1i, j · (∇ Ifusedi, j − ∇ I1i, j )2
Secondly, we propose the idea of squeezing and decom- HW
i j
posing to preserve as much information in source images
as possible in the fused result. Concretely, the proposed + S2i, j · (∇ Ifusedi, j − ∇ I2i, j )2 , (3)
SDNet is composed of two parts: the squeeze network and
the decomposition network, as shown in Fig. 2. The squeeze where i and j represent the pixel in the i-th row and the j-th
network is the target network for realizing image fusion, column in the decision maps or gradient maps, H and W
which is dedicated to squeezing source images into a sin- represent the height and width of the image, I1 and I2 are
gle fused image. In contrast, the decomposition network is the source images, Ifused is the fused image, and ∇(·) rep-
dedicated to decomposing this fused result to obtain images resents the operation of finding the gradient map using the
consistent with source images. On the whole, SDNet is very Laplacian operator. In addition, S(·) is the decision map gen-
similar to the auto-encoder network. The difference is that erated by the decision block based on the gradient level of
the intermediate result of SDNet is the fused image, while source images. The schematic diagram of the adaptive deci-
the intermediate result of the auto-encoder network is the sion block is shown in Fig. 3. In order to reduce the influence
encoding vector. But the same is that they both require the of noise on the gradient judgment, the decision block first
intermediate result to contain more scene content, which is performs a Gaussian low-pass filter on source images. Then,
conducive to reconstructing source images. Therefore, the we find the gradient maps using the Laplacian, and the deci-
squeeze-and-decomposition network we design can force the sion maps are generated at the pixel scale according to the
fused result to contain richer scene details, and thus have a magnitude of the gradient. The entire generation process of
better fusion effect. the decision diagram can be formalized as:
 
S1i, j = sign ∇(L(I1i, j ))
3.2 Loss Functions
    
− min ∇(L(I1i, j )) , ∇(L(I2i, j )) , (4)
Our SDNet is divided into two parts, where the squeeze
network generates a single fused image through the extrac- S2i, j = 1 − S1i, j , (5)

123
International Journal of Computer Vision

For multi-modal image fusion, the intensity distribution of


Decision Map
Gaussian
Laplacian
Gradient the fused result is often biased to a certain source image. For
low-pass -based Complementary
filtering
operator
selection instance, in the infrared and visible image fusion, the main
Decision Map
intensity information should be obtained from the infrared
image, so as to retain significant contrast. Similarly, for the
Fig. 3 Schematic diagram of the adaptive decision block MRI and PET image fusion, the main intensity information
should be obtained from the PET image, thus preserving the
functional activity information of the organism. Therefore,
where |·| indicates the absolute value function, ∇(·) is the the parameter α should meet the following setting rules:
Laplacian operator, L(·) denotes the Gaussian low-pass filter
function, min(·) denotes a minimum function, and sign(·) is I1 = Iir , I2 = Ivis , α < 1. (7)
the sign function. It is worth noting that the size of S(·) is also
I1 = IPET , I2 = IMRI , α < 1. (8)
H ×W . Since the two source images are both filtered with the
low pass function and the pixels with large gradient values
For digital photography image fusion, the scene content
are selected, so the normal texture is hardly misjudged. The
of source images captured under different shooting settings
similar idea of choice decision is mentioned in Goshtasby
is often highly complementary, and the intensity information
(2005). However, our adaptive decision block is obviously
of the fused image should come from all source images uni-
more advanced for the following reasons. First, the weight
formly. For example, in multi-exposure image fusion task,
function used in Goshtasby (2005) is on the patch scale, while
both overexposed and underexposed images contain texture
the proposed adaptive decision block is on the pixel scale.
details, but their intensity is too strong or too weak. There-
From this perspective, our adaptive decision block is more
fore, the same weights should be set to balance them to get
refined. Second, the weight function in Goshtasby (2005)
the right intensity. Similarly, for multi-focus image fusion,
is based on the information entropy, which is not robust to
source images contain complementary textures, and their
noise. Specifically, when a patch contains more noise, it will
intensity is equally important. So the parameter α should
be unreasonably given a greater weight instead. In contrast,
meet the following setting rules:
the proposed adaptive decision block not only considers the
influence of noise, but also makes decisions based on the
I1 = Iover , I2 = Iunder , α = 1. (9)
scene texture richness, which is more reasonable. Third, the
weight function in Goshtasby (2005) is directly applied to the I1 = Ifocus1 , I2 = Ifocus2 , α = 1. (10)
source images, which is essentially a linear mapping from the
source images to the fused image. The difference is that our 3.2.2 Decomposition Consistency Loss
adaptive decision block acts on the gradient loss function to
guide the fused image to preserve rich textures on the pixel The decomposition consistency loss Ldc requires the decom-
scale in an optimized manner, which is actually a non-linear position results from the fused image to be as similar as
mapping with better performance. possible to source images, which is defined as:
The intensity loss Lint guides the fused image to preserve
useful information that is represented by the pixel intensity, 1 
Ldc = (I1_dei, j − I1i, j )2 + (I2_dei, j − I2i, j )2 ,
such as the contrast. Meanwhile, it can make the overall scene HW
i j
style of the fused image more natural and not divorced from (11)
reality. The intensity loss can be formalized:
in which I1_de and I2_de are results of decomposition from
1  the fused image, and I1 and I2 are the source images. Because
Lint = (Ifusedi, j − I1i, j )2
HW the degree of similarity between the decomposition results
i j
and the source images directly depends on the quality of the
+ α(Ifusedi, j − I2i, j )2 . (6) fused image, the decomposition consistency loss can force
the fused result to contain more scene content, so as to achieve
Because different image fusion tasks have different propen- a better fusion performance.
sity requirements for the intensity distribution of the fused
image. Thus, we adopt the proportioned setting strategy to 3.3 Network Architecture
adjust α, so as to satisfy the intensity distribution require-
ments of different types of fusion tasks. The proportioned The proposed SDNet is composed of two sub-networks, i.e.,
setting strategy of α are related to the type of image fusion, squeeze network and decomposition network, as shown in
which are summarized below. Fig. 4.

123
International Journal of Computer Vision

Squeeze Network Decomposition Network

LReLU

LReLU
Conv

Conv

Conv
Tanh
LReLU

LReLU

LReLU

LReLU
Conv

Conv

Conv

Conv
Concat Concat

LReLU
Conv

Conv
Tanh
Source images Concat Decomposed images
LReLU

LReLU

LReLU
LReLU
Conv

LReLU

LReLU

Conv
Fused image

Conv

Conv
Tanh
Conv

Conv

Conv
Concat Concat

Fig. 4 Network architecture of the proposed SDNet

The purpose of the squeeze network is to fuse source 4 Experiments


images into a single image that contains richer texture con-
tent. The nature of the source image pairs tends to be quite In this section, we verify the superiority of our SDNet on
different, some of which are captured by different sensors, multi-modal image fusion and digital photography image
and some are shot with the same sensor at different shooting fusion. First, we give the specific experimental settings.
settings. As a result, it is a better choice to handle them sep- Then, we provide the qualitative and quantitative experimen-
arately. Because the pseudo-siamese network is very good at tal results of two types of fusion scenarios, and analyze the
processing data with large differences, we design the squeeze results. Besides, we also conduct the comparative experiment
network with reference to the pseudo-siamese network, to of efficiency, ablation experiments, visualization of decom-
realize a variety of image fusion tasks. At the same time, position, infrared and RGB visible image fusion, sequence
as there is information loss caused by padding in the con- image fusion, and comparison with our preliminary version
volution process, we use dense connections like DenseNet (Zhang et al. 2020). Finally, the application verification is
(Huang et al. 2017) to reduce information loss and maximize provided.
the use of information. In each path, we use four convolu-
tional layers for feature extraction. The first layer uses a 5×5 4.1 Experimental Settings
convolution kernel and the latter three layers use a 3 × 3 con-
volution kernel, all with the leaky ReLU activation function. 4.1.1 Data
Then, we fuse the features extracted from the two paths, and
use the strategies of concatenating and convolution to achieve We verify our SDNet in multi-modal image fusion and digital
this purpose. We concatenate the two feature maps along the photography image fusion. Two of the representative multi-
channel. The kernel size of the last convolutional layer is modal fusion tasks are medical image fusion and infrared
1 × 1, and the activation function is tanh. and visible image fusion. While the typical digital photogra-
The decomposition network is dedicated to decomposing phy image fusion tasks are multi-exposure image fusion, and
the fused image to obtain results similar to source images. We multi-focus image fusion. The training and test sets for all
first use one common convolutional layer to extract features fusion tasks are from publicly available datasets: MRI and
from the fused image. Then, we implement decomposition PET images from Harvard medical school website1 for med-
from two branches to generate results, where each branch ical image fusion task; TNO2 dataset for infrared and visible
contains three convolutional layers. The first common con- image fusion task; the dataset provided by Cai et al. (2018)
volutional layer uses the convolution kernel with a size of for multi-exposure image fusion task; the dataset provided
1 × 1, and the remaining convolutional layers all use the by Nejati et al. (2015) for multi-focus image fusion.
convolution kernel with a size of 3 × 3. Except for the last In these four image fusion tasks, the number of image pairs
convolutional layer, all other convolutional layers use the used for testing is all 10. For training, in order to obtain more
leaky ReLU as the activation function, and the last uses the training data, we adopt the expansion strategy of tailoring and
tanh as the activation function. decomposition. Specifically, for the medical image fusion
In all convolution layers, the padding is set to S AM E and task, we crop the rest of images to 13, 328 image patch pairs
stride is set to 1. As a result, none of these convolutional of size 120 × 120 for training; for the infrared and visible
layers changes the size of feature map.
1 http://www.med.harvard.edu/AANLIB/home.html.
2 http://figshare.com/articles/TNO_Image_Fusion_Dataset/1008029.

123
International Journal of Computer Vision

image fusion, we crop the rest of images to 80, 881 image testing phase, only the squeeze network is used to generate
patch pairs of size 120 × 120 for training; for the multi- the fused result.
exposure image fusion task, we crop the rest of images to
91, 840 image patch pairs of size 120 × 120 for training; 4.2 Results on Multi-modal Image Fusion
for the multi-focus image fusion task, we crop the rest of
images to 184, 790 image patch pairs of size 60 × 60 for The two representative tasks of multi-modal image fusion we
training. Since the proposed SDNet is a fully convolutional selected are medical image fusion and infrared and visible
network, the source images do not need to be cropped into image fusion. For the former, we use five state-of-the-art
small patches with the same size as the training data during medical image fusion methods for comparison, including
the test phase. In other words, the test is performed on the ASR (Liu and Wang 2014), PCA (Naidu and Raol 2008),
original size of source images. NSCT (Zhu et al. 2019), CNN (Liu et al. 2017) and U2Fusion
(Xu et al. 2020). For the latter, we select five state-of-the-art
4.1.2 Training Details infrared and visible image fusion methods for comparison,
such as GTF (Ma et al. 2016), MDLatLRR (Li et al. 2020),
For fusion tasks where the source images are all grayscale DenseFuse (Li and Wu 2018), FusionGAN (Ma et al. 2019)
images, the proposed model can be directly used to fuse and U2Fusion (Xu et al. 2020).
source images to generate the fused result, such as infrared
and visible image fusion. For grayscale image and color 4.2.1 Qualitative Comparisons
image fusion tasks, such as medical image fusion, we first
transform the color image from RGB to YCbCr color space. In each fusion task, three typical image pairs are selected to
Because the Y channel (luminance channel) can represent qualitatively demonstrate the performance of each method,
structural details and the brightness variation, we devote to which are shown in Figs. 5 and 6. The qualitative analysis
fusing the Y channel of color source image and the grayscale is as follows.
source image. Then, we directly concatenate the Cb and Cr For medical image fusion, the three typical results we
channels of the color source image with the fused Y channel, select are on different transaxial sections of the brain-
and transferred these components to RGB space to obtain the hemispheric. From these results, we can see that our SDNet
final result. For the fusion task where the source images are all has two advantages over other methods. First, the results of
color images, such as multi-exposure and multi-focus image our method contain a wealth of brain structural textures. Only
fusion, we transform all color source images from RGB to NSCT, CNN, U2Fusion and our SDNet can well preserve the
YCbCr color space. Then, we fuse the Y channels of source texture details in the MRI image, while ASR and PCA cannot,
images using the proposed model, and follow the Eq. (12) to such as the structural textures in the first and second set of
fuse the Cb or Cr channels of source images: results in Fig. 5. However, these textures are much finer and
sharper in the results of our method. Second, our SDNet can
C1 |C1 − ζ | + C2 |C2 − ζ | maintain the better functional information, in other words,
C= , (12)
|C1 − ζ | + |C2 − ζ | the color distortion rarely occurs in our SDNet over other
methods. For example, in the highlighted part of the third set
where |·| indicates the absolute value function, C is the fused of results of Fig. 5, the results of NSCT, CNN and U2Fusion
Cb or Cr, C1 and C2 represent the Cb or Cr of two source appear whitening, which is inconsistent with the distribution
images, respectively. In addition, ζ is the median value of of functional information in the PET image.
the dynamic range, which is set to 128. Finally, the fused In infrared and visible image fusion, according to the char-
components are transferred to RGB space to obtain the final acteristics of fused results, these comparative methods can be
result. divided into two categories. The results of the first category
The batch size is set to b, and it takes m steps to train are biased towards the visible image, such as MDLatLRR,
one epoch. The total number of training epochs is M. In our DenseFuse and U2Fusion. Specifically, although their fused
experiment, we set b = 32, M = 30, and m is set as the ratio results contain richer texture details, they cannot maintain
between the whole number of patches and b. The parameters the significant contrast of the infrared image. The results of
in our SDNet are updated by AdamOptimizer. In addition, the second category are more similar to the infrared image.
α of Eq. (5) in four tasks are set according to the rules in They maintain the significant contrast well, but the texture
Eqs. (7)–(10): 0.5, 0.5, 1 and 1. For β in Eq. (1), we aim details in them are not rich enough, which look more like
to obtain the best results by setting them in four tasks as: sharpened infrared images, such as GTF and FusionGAN. In
10, 80, 50 and 3. All deep learning-based methods run on comparison, our method is more like a combination of these
the same GPU RTX 2080Ti, while other methods run on the two categories. First, our method can maintain the significant
same CPU Intel i7-8750H. It is worth noting that during the contrast and effectively highlight the target from the back-

123
International Journal of Computer Vision

PET MRI ASR PCA NSTC CNN U2Fusion Ours

Fig. 5 Qualitative results of the medical image fusion. From left to right: PET image, MRI image, fused results of ASR (Liu and Wang 2014), PCA
(Naidu and Raol 2008), NSCT (Zhu et al. 2019), CNN (Liu et al. 2017), U2Fusion (Xu et al. 2020) and our SDNet

Infrared Visible GTF MDLatLRR DenseFuse FusionGAN U2Fusion Ours

Fig. 6 Qualitative results of infrared and visible fusion. From left to right: visible image, infrared image, results of GTF (Ma et al. 2016), MDLatLRR
(Li et al. 2020), DenseFuse (Li and Wu 2018), FusionGAN (Ma et al. 2019), U2Fusion (Xu et al. 2020) and our SDNet

ground, like methods of the second category. For example, son on ten pairs of images in each fusion task. Considering
only the fused results of GTF, FusionGAN and our method the characteristics of multi-modal image fusion, four objec-
have significant contrast and can highlight the targets, such tive metrics are selected to evaluate the quality of the fused
as the human in the first and third group of results in Fig. 6. images, namely entropy (EN) (Roberts et al. 2008), mutual
Second, the fused results of our method also contain rich information of discrete cosine features (F M Idct ) (Haghighat
texture structures, just like methods of the first category. For and Razian 2014), the peak signal-to-noise ratio (PSNR), and
example, in the second row of Fig. 6, our SDNet maintains mean gradient (MG). The reasons for selecting them are as
the texture details of the shrub and clothes well, while GTF follows. In multi-modal image fusion, part of the information
and FusionGAN cannot. will inevitably be discarded, so we adopt EN to evaluate the
In general, the qualitative results of our SDNet have cer- amount of information remaining in the fused image. The
tain superiority compared with other comparative methods larger the value of EN, the more information the fused image
in multi-modal image fusion. contains. We use F M Idct to assess the amount of features
that are transferred from source images to the fused image.
It can also reflect the degree of correlation between the fea-
4.2.2 Quantitative Comparisons tures in the fused image and the source image, which has a
characterizing significance for data fidelity. A large F M Idct
In order to assess our method in multi-modal image fusion metric generally indicates that considerable feature informa-
more comprehensively, we conduct a quantitative compari-

123
International Journal of Computer Vision

tion is transferred from source images to the fused image. 4.3.1 Qualitative Comparisons
Affected by the imaging environment, visible images often
contain a lot of noise. Similarly, the particularity of medical In each digital photography image fusion task, we give three
images also requires the noise level to be as low as possible. typical intuitive results to qualitatively compare our SDNet
Therefore, we introduce the PSNR to evaluate the noise level with other methods, as shown in Figs. 7 and 8. The detailed
in the fused results. A larger PSNR value indicates less noise qualitative analysis is as follows.
relative to the useful information. In addition, we adopt the In the multi-exposure image fusion task, we can find
MG to assess the richness of texture structure. A large MG that the proposed SDNet has two major advantages. Firstly,
metric indicates that the fused image contains rich texture our method can avoid strange black shadows and unnatu-
details. The quantitative results of multi-modal image fusion ral illumination transitions. Concretely, the fused results of
are shown in Table 1. AWPIGG, DSIFT and GFF show these strange shadows,
For medical image fusion, it can be seen that our method such as the sky in the second set of results, while DeepFuse,
achieves the largest average values on three metrics EN, U2Fusion and the proposed SDNet do not. Secondly, in areas
F M Idct and MG. In metric PSNR, our method are second of extreme overexposure and extreme underexposure (where
only to U2Fusion. From these results, we can conclude that texture details are only visible in a single source image), our
the results of our SDNet contain the most information, and SDNet can better preserve these details and their shapes. For
can obtain the most features from source images. Moreover, example, in the third group of results in Fig. 7, the lamps
the result of our method contains the richest texture details, are blurred and their edges are even invisible in the results
which proves that our SDNet can preserve sufficient struc- of other three methods except DeepFuse, U2Fusion and our
tural information of MRI images. It is worth noting that the SDNet. However, compared to DeepFuse and U2Fusion, our
level of noise in our results is low, which means the proposed SDNet can preserve these texture details more finely, such as
SDNet is rigorous in fusing information. the roof in the first set of results and the tree branches in the
In the infrared and visible image fusion, our SDNet second group of results. Note that our results have insuffi-
achieves the largest average values on F M Idct , PSNR and cient local contrast in some scenes. Even so, the visual effect
MG. Therefore, the results of our SDNet have a amount of of our method is still better than that of other methods.
feature information that is transferred from source images, In the multi-focus image fusion task, the five compar-
have the largest signal-to-noise ratio, and contain the richest ative methods selected can be divided into two categories
texture details. Interestingly, our method does not achieve according to the principle. Methods of the first category is to
a good EN value like in medical image fusion, and this is generate the decision map based on focus detection to fuse
caused by the characteristics of infrared and visible image multi-focus images, such as CNN, SESF and DSIFT. Such
fusion. Specifically, visible images usually contain a large methods often lose detail due to misjudgment near the junc-
amount of noise, and the noise suppression of our method tion of focused and non-focused regions. The other category
reduces the entropy of fused results to a certain extent. is to fuse the multi-focus image from the global perspective
Overall, the proposed SDNet performs better in quan- instead of focused area detection at pixel scale. The disadvan-
titative comparisons over other comparative methods in tages of this kind of methods are that the intensity distortion
multi-modal image fusion, which is consistent with the qual- and blur effect will appear in fused results, such as GD and
itative comparison. U2Fusion. From the results, we see that our SDNet has clear
advantages over these two categories of methods. First of all,
4.3 Results on Digital Photography Image Fusion compared with the decision map-based methods, our method
can accurately retain details near the junction of focused and
Our SDNet also performs well on the digital photography non-focused regions. For example, in the first set of results in
image fusion. To verify this point, we conducted comparative Fig. 8, these methods lose the golf ball. Secondly, our method
experiments on two typical digital photography image fusion can also maintain the same intensity distribution as the source
tasks: multi-exposure image fusion and multi-focus image image. It can be clearly seen that the fused results of GD have
fusion. For multi-exposure image fusion, we select five state- intensity distortion compared to the source images, and the
of-the-art methods to compare with our SDNet, which are results of U2Fusion have a certain degree of detail blur, while
AWPIGG (Lee et al. 2018), DSIFT (Hayat and Imran 2019), our method does not. Note that there are some visible halos in
GFF (Li et al. 2013), DeepFuse (Prabhakar et al. 2017) and the results of our and other methods at the boundary between
U2Fusion (Xu et al. 2020). For the multi-focus image fusion, the focused and non-focused regions. This is caused by the
five state-of-the-art methods are also selected to compare outline of the foreground target in the source image spreading
with our method. These methods are CNN (Liu et al. 2017), to the background area due to defocus.
DSIFT (Liu et al. 2015), GD (Paul et al. 2016), SESF (Ma Overall, in the digital photography image fusion scenario,
et al. 2020) and U2Fusion (Xu et al. 2020). our SDNet has better performance in terms of intuitive effect.

123
International Journal of Computer Vision

Table 1 Quantitative results of multi-modal image fusion


Medical PCA (Naidu and ASR (Liu and Wang NSCT (Zhu et al. CNN (Liu et al. 2017) U2Fusion (Xu et al. Ours
Raol 2008) 2014) 2019) 2020)

EN (Roberts et al. 4.969 ± 0.283 4.779 ± 0.284 5.312 ± 0.239 5.254 ± 0.242 5.325 ± 0.293 5.561 ± 0.290
2008)↑
F M Idct (Haghighat 0.410 ± 0.009 0.344 ± 0.010 0.405 ± 0.007 0.161 ± 0.016 0.302 ± 0.004 0.433 ± 0.005
and Razian 2014)↑
PSNR↑ 64.092 ± 0.517 63.937 ± 0.508 63.187 ± 0.586 63.789 ± 0.654 65.013 ± 0.433 64.550 ± 0.377
MG↑ 0.024 ± 0.002 0.038 ± 0.003 0.043 ± 0.003 0.040 ± 0.004 0.036 ± 0.003 0.044 ± 0.004
Infrared-Visible GTF (Ma et al. 2016) MDLatLRR (Li et al. DenseFuse (Li and FusionGAN (Ma U2Fusion (Xu et al. Ours
2020) Wu 2018) et al. 2019) 2020)
EN (Roberts et al. 6.982 ± 0.259 6.518 ± 0.439 6.848 ± 0.302 6.608 ± 0.331 7.012 ± 0.273 6.780 ± 0.325
2008)↑
F M Idct (Haghighat 0.422 ± 0.036 0.409 ± 0.035 0.417 ± 0.022 0.370 ± 0.033 0.340 ± 0.016 0.423 ± 0.021
and Razian 2014)↑
PSNR↑ 63.789 ± 2.006 62.976 ± 2.090 62.420 ± 1.625 61.799 ± 1.455 63.068 ± 1.940 64.292 ± 1.822
MG↑ 0.022 ± 0.014 0.016 ± 0.008 0.020 ± 0.012 0.014 ± 0.007 0.026 ± 0.014 0.027 ± 0.013
Bold indicates the best, and Bolditalic indicates the second best

123
International Journal of Computer Vision

Overexposed Underexposed AWPIGG DSIFT GFF DeepFuse U2Fusion Ours

Fig. 7 Qualitative results of multi-exposure image fusion. From left to right: underexposed image, overexposed image, fused results of AWPIGG
(Lee et al. 2018), DSIFT (Hayat and Imran 2019), GFF (Li et al. 2013), DeepFuse (Prabhakar et al. 2017), U2Fusion (Xu et al. 2020) and our
SDNet

Near-Focused Far-Focused CNN DSIFT GD SESF U2Fusion Ours

Fig. 8 Qualitative results of multi-focus fusion. From left to right: the near focus image, the far focus image, fused results of CNN (Liu et al. 2017),
DSIFT (Liu et al. 2015), GD (Paul et al. 2016), SESF (Ma et al. 2020), U2Fusion (Xu et al. 2020) and our SDNet

4.3.2 Quantitative Comparisons fused image is richer and more comprehensive. In addition,
in multi-exposure image fusion, the fused result often con-
We also perform quantitative comparisons to further demon- tains unnatural artifacts, while in multi-focus image fusion,
strate the performance of our SDNet in the digital photog- the fused result often suffers from overall contrast distortion.
raphy image fusion scenario. Considering the characteristics We introduce N AB/F to evaluate artifacts and contrast dis-
of digital photography image fusion, four objective metrics tortion in the fused image. A smaller N AB/F value indicates
are selected to evaluate the fused results, which are Q AB/F , less artifacts and contrast distortion. The quantitative results
mutual information of discrete cosine features (F M Idct ) of digital photography image fusion are shown in Table 2.
(Haghighat and Razian 2014), the peak signal-to-noise ratio In multi-exposure image fusion, our SDNet achieves the
(PSNR) and N AB/F (Kumar 2013). PSNR and F M Idct have largest average values on the first three metrics Q AB/F ,
been described previously, and the rest are explained below. F M Idct and PSNR. For the N AB/F , our method is second
Different from multi-modal image fusion, the scene informa- only to DeepFuse by a slight difference. These can explain
tion reflected by source images in digital photographic image that the results of our method contain the richest and the
fusion is strictly complementary according to regions, so the most comprehensive scene content, contain a mount of fea-
preservation of the scene information is as important as the ture information that is transferred from source images, and
fidelity of the scene information. Therefore, we adopt Q AB/F involve the lowest level of noise. In addition, like Deepfuse,
to assess the amount of edge structures that are transferred our method can also avoid unnatural artifacts, which can also
from source images to the fused image rather than MG. A be seen from the qualitative results in Fig. 7.
large Q AB/F means that the scene content contained in the

123
International Journal of Computer Vision

In multi-focus image fusion, as can be seen from the statis- More specifically, texture textures from MRI images is often
tical results, our SDNet can also achieve the largest average submerged in functional information from PET images, caus-
values on Q AB/F , F M Idct and PSNR. These results can ing texture details to be weakened or lost. The adaptive
indicate that the fused results of our method contain the decision block can decide the target of gradient optimiza-
most scene content, can obtain the most feature informa- tion at the pixel scale, which can make the network retain
tion from source images, and have the largest signal-to-noise the functional information while preserving the structural
ratio. Moreover, our SDNet ranks second on N AB/F , next to texture as significantly as possible. In infrared and visible
DSIFT. image fusion, visible images often contain a lot of noise
Generally, in the digital photography image fusion, our due to environmental factors such as weather. Reducing the
method performs better than the comparative methods in influence of noise on the fusion process is the key to improv-
quantitative comparison. ing the quality of the infrared and visible image fusion. The
adaptive decision block can enhance the robustness of our
4.4 Comparisons of Efficiency method against noise, because there is the Gaussian low-pass
filtering operation in the decision block, which can reduce
In our method, the number of parameters used for testing the noise interference to a certain extent. In multi-exposure
is 0.287 M, which is very lightweight. In order to evaluate image fusion, the exposure degree of the source image is
it more comprehensively, we carry out comparative exper- uncertain. In other words, it is difficult to determine the expo-
iments on the efficiency of multi-modal image fusion and sure level of the source image and the distance of the ideal
digital photography image fusion. The number of image pairs illumination. In this case, the adaptive decision block can
for testing in four representative tasks is all 10. measure the appropriateness of exposure based on the gradi-
The results are shown in Table 3. It can be seen that ent magnitude at the pixel scale, because both overexposure
our method achieves the highest running efficiency in all and underexposure can cause texture detail loss. Therefore,
four image fusion tasks: medical image fusion, infrared the adaptive decision block can force the fused results to
and visible image fusion, multi-exposure image fusion, and have more appropriate lighting while preserving more com-
multi-focus image fusion. In general, our SDNet has the prehensive texture details. In multi-focus image fusion, the
significant superiority in running time, almost an order of adaptive decision block can also make the final fused result
magnitude faster than the comparative methods, which can more promising. The adaptive decision block can select the
deal with real-time fusion tasks. maximum gradient at the corresponding pixel position as
the optimization target to force the fused result to contain
4.5 Ablation Experiments richer texture details, which avoids the fusion result being
the intermediate result between clear and fuzzy. In order to
To verify the effectiveness of the specific designs in this fully verify the role of it, we train our network without it.
paper, we perform relevant ablation experiments. First, we Specifically, we adopt the proportioned setting strategy for
reveal the role of two proposed key modules, including the gradient loss like in the intensity loss, weighting from the
the adaptive decision block and the decomposition network. global perspective of image patches.
Then, we separately evaluate the fusion performance when The qualitative results are shown in Fig. 9. In medical
the intensity loss, the gradient loss, and the squeeze fusion image fusion, it can be seen that when using the adaptive
loss are removed. decision block, the structural textures in the fused result are
maintained very well. At the same time, the functional infor-
4.5.1 Adaptive Decision Block Analysis mation is not worse than the fused result without the adaptive
decision block. In infrared and visible image fusion, it is
In this work, the adaptive decision block acts on the gradient obvious that the visible image contains a lot of noise. When
loss term, which can adaptively guide the gradient distri- there is no adaptive decision block, the gradient constraint
bution of the fused image to approximate to the pixel with is severely affected by noise and the detail in the window
stronger gradient. To our knowledge, it is the first time that the of the car is weak, as highlighted. In contrast, this detail
pixel-scale guidance strategy is adopted for addressing the is well preserved when using the adaptive decision block.
image fusion task. More importantly, it fits well with both For multi-exposure image fusion, the fused result with the
multi-modal image fusion and digital photography image adaptive decision block has more suitable lighting, and can
fusion. We analyze the feasibility of the adaptive decision retain more scene details, which are poorly done without it. In
block in four image fusion tasks as follows. In medical image multi-focus image fusion, when there is no adaptive decision
fusion, because the difference in the intensity of PET and block, the texture detail of the fused result is an intermediate
MRI images is very large, the phenomenon of imbalance of result between sharpness and blurring, which is not as clear
structural textures and functional information often occurs. as the result using the adaptive decision block. Therefore, the

123
123
Table 2 Quantitative results of digital photography fusion
Multi-Exposure AWPIGG (Lee DSIFT (Hayat GFF (Li et al. DeepFuse (Prab- U2Fusion (Xu Ours
et al. 2018) and Imran 2019) 2013) hakar et al. 2017) et al. 2020)

Q AB/F ↑ 0.652 ± 0.064 0.658 ± 0.066 0.662 ± 0.044 0.640 ± 0.017 0.607 ± 0.061 0.677 ± 0.027
F M Idct (Haghighat and Razian 2014)↑ 0.390 ± 0.030 0.400 ± 0.028 0.430 ± 0.030 0.451 ± 0.049 0.375 ± 0.057 0.462 ± 0.048
PSNR↑ 57.053 ± 0.691 57.015 ± 0.865 56.237 ± 0.768 57.661 ± 0.717 57.720 ± 0.657 57.841 ± 0.659
N AB/F (Kumar 2013)↓ 0.154 ± 0.022 0.170 ± 0.016 0.189 ± 0.041 0.085 ± 0.018 0.241 ± 0.024 0.095 ± 0.023
Multi-Focus CNN (Liu et al. DSIFT (Liu et al. GD (Paul et al. SESF (Ma et al. U2Fusion (Xu Ours
2017) 2015) 2016) 2020) et al. 2020)
Q AB/F ↑ 0.618 ± 0.080 0.497 ± 0.060 0.683 ± 0.049 0.611 ± 0.085 0.670 ± 0.034 0.691 ± 0.016
F M Idct (Haghighat and Razian 2014)↑ 0.389 ± 0.017 0.405 ± 0.022 0.339 ± 0.044 0.383 ± 0.034 0.325 ± 0.023 0.410 ± 0.034
PSNR↑ 73.827 ± 2.163 73.809 ± 2.173 69.325 ± 1.810 73.772 ± 2.153 72.488 ± 1.294 74.376 ± 1.782
N AB/F (Kumar 2013)↓ 0.073 ± 0.052 0.005 ± 0.005 0.172 ± 0.039 0.079 ± 0.055 0.156 ± 0.041 0.048 ± 0.018
Bold indicates the best, and Bolditalic indicates the second best
International Journal of Computer Vision
International Journal of Computer Vision

Table 3 Average running time of different methods in multi-modal image fusion and digital photography image fusion (unit: second)

Multi-Modal Medical ASR (Liu and PCA (Naidu and NSCT (Zhu et al. CNN (Liu et al. U2Fusion (Xu Ours
Wang 2014) Raol 2008) 2019) 2017) et al. 2020)
(Size 256 × 256) 32.961 ± 1.195 0.007 ± 0.001 7.467 ± 0.377 16.504 ± 0.562 0.206 ± 0.035 0.004 ± 0.001
Infrared-Visible GTF (Ma et al. MDLatLRR (Li DenseFuse (Li FusionGAN (Ma U2Fusion (Xu Ours
2016) et al. 2020) and Wu 2018) et al. 2019) et al. 2020)
(Size 576 × 768) 3.830 ± 2.377 26.235 ± 12.761 0.321 ± 0.122 0.170 ± 0.129 0.636 ± 0.536 0.007 ± 0.003
Digital Photography Multi-Exposure AWPIGG (Lee DSIFT (Hayat GFF (Li et al. DeepFuse (Prab- U2Fusion (Xu Ours
et al. 2018) and Imran 2019) 2013) hakar et al. 2017) et al. 2020)
(Size 600×1000) 0.969 ± 0.084 2.576 ± 0.253 0.799 ± 0.044 0.227 ± 0.033 1.579 ± 1.513 0.017 ± 0.001
Multi-Focus CNN (Liu et al. DSIFT (Liu et al. GD (Paul et al. SESF (Ma et al. U2Fusion (Xu Ours
2017) 2015) 2016) 2020) et al. 2020)
(Size 520 × 520) 246.064 ± 17.352 ± 31.399 1.429 ± 1.200 0.686 ± 0.678 1.794 ± 3.165 0.022 ± 0.022
307.537
Bold indicates the best, and Bolditalic indicates the second best

123
International Journal of Computer Vision

Multi-Modal Multi-Modal

MRI PET W/O ADB Ours MRI PET W/O DN Ours

Visible Infrared W/O ADB Ours Visible Infrared W/O DN Ours

Overexposed Underexposed W/O ADB Ours Overexposed Underexposed W/O DN Ours

Near-focused Far-focused W/O ADB Ours


Digital Photography
Near-focused Far-focused W/O DN Ours
Fig. 9 Qualitative results of ablation on the adaptive decision block. Digital Photography
From left to right: source image pairs, results without the adaptive deci-
sion block and ours. ADB refers to the adaptive decision block Fig. 10 Qualitative results of ablation on the decomposition network.
From left to right: source image pairs, fused results without the decom-
position network and ours

adaptive decision block has the rationality in all four fusion


tasks, which can help our method produce more promising
visualization results. multi-focus image fusion, decomposition network can make
The quantitative results are reported in Table 4. It can the fused result contain richer texture details.
be seen that the results with the adaptive decision block are The quantitative results are reported in Table 4. The results
better than those without the adaptive decision blocks on with the gradient loss are better than those without the gra-
11/16 metrics, which demonstrate the effectiveness of the dient loss on 12/16 metrics. These results show that the
adaptive decision block. decomposition network can indeed improve the fusion per-
formance.

4.5.2 Decomposition Network Analysis


4.5.3 Intensity Loss Analysis
The role of the decomposition network is to decompose
the fused image to generate results that approximate source The intensity loss guides the fused image to preserve useful
images. Because the quality of the decomposed results information that is characterized by the pixel intensity, such
directly depends on the fused image, the decomposition con- as the contrast. Meanwhile, it can make the overall scene
sistency constraint can force the fused image to contain more style of the fused image more natural and not divorced from
scene details. To validate the effectiveness of the decomposi- reality. In order to evaluate the effectiveness of the inten-
tion network, we train our model without the decomposition sity loss, we conduct the ablation experiment on it, and the
network. The difference of the results is shown in Fig. 10. In results are shown in Fig. 11. It can be seen that in most
medical image fusion, the result without decomposition net- image fusion tasks, the results without the intensity loss have
work loses some structural textures of the brains, while that problems such as information loss and unrealistic style. For
with decomposition network can retain it well. In infrared example, the result of infrared and visible image fusion loses
and visible image fusion, the decomposition network makes the saliency of the thermal radiation target, and even has
the smoke thinner, thereby more clearly highlighting the sol- an intensity reversal. And the results of several other image
dier. In the multi-exposure image fusion, the result with fusion tasks have unrealistic styles. These results show that
decomposition network can more clearly preserve the tex- the intensity loss is very important. As the results without
ture structures in the underexposed region, while in the result the intensity loss are far from expected, we do not perform
without DN these texture structures are weak. Similarly, in quantitative experiments.

123
International Journal of Computer Vision

Table 4 Quantitative results of ablation experiments. ADB indicates the adaptive decision block, and DN is the decomposition network. Bold indicates the best result

Multi-Modal Medical Infrared-Visible


w/o ADB w/o DN w/o Gradient Loss Ours w/o ADB w/o DN w/o Gradient Loss Ours
EN (Roberts et al. 2008)↑ 5.646 ± 0.264 5.686 ± 0.259 5.530 ± 0.281 5.561 ± 0.290 6.725 ± 0.339 6.688 ± 0.289 6.713 ± 0.438 6.780 ± 0.325
F M Idct (Haghighat and Razian 2014)↑ 0.258 ± 0.003 0.270 ± 0.002 0.029 ± 0.002 0.433 ± 0.005 0.440 ± 0.028 0.411 ± 0.041 0.248 ± 0.025 0.423 ± 0.026
PSNR↑ 64.163 ± 0.338 64.435 ± 0.413 64.302 ± 0.341 64.550 ± 0.377 63.917 ± 2.065 63.694 ± 2.087 62.623 ± 1.779 64.292 ± 1.822
MG↑ 0.032 ± 0.002 0.038 ± 0.003 0.029 ± 0.002 0.044 ± 0.004 0.027 ± 0.015 0.026 ± 0.013 0.013 ± 0.002 0.027 ± 0.013
Digital Photography Multi-exposure Multi-Focus
w/o ADB w/o DN w/o Gradient Loss Ours w/o ADB w/o DN w/o Gradient Loss Ours
Q AB/F ↑ 0.660 ± 0.022 0.629 ± 0.051 0.370 ± 0.054 0.677 ± 0.027 0.637 ± 0.052 0.694±0.024 0.599 ± 0.045 0.691 ± 0.017
F M Idct (Haghighat and Razian 2014)↑ 0.486±0.051 0.487 ± 0.050 0.258 ± 0.033 0.462 ± 0.048 0.403 ± 0.040 0.407 ± 0.036 0.260 ± 0.023 0.410 ± 0.035
PSNR↑ 57.676 ± 0.051 57.792 ± 0.645 57.534 ± 0.759 57.841 ± 0.659 73.770 ± 1.661 74.193 ± 1.729 74.090 ± 1.665 74.376 ± 1.879
N AB/F (Kumar 2013)↓ 0.047±0.026 0.228±0.673 0.119±0.025 0.095 ± 0.022 0.009 ± 0.007 0.030 ± 0.013 0.042 ± 0.017 0.048 ± 0.019

123
International Journal of Computer Vision

Multi-Modal Multi-Modal

MRI PET W/O Intensity Loss Ours MRI PET W/O Squeeze Fusion Loss Ours

Visible Infrared W/O Intensity Loss Ours Visible Infrared W/O Squeeze Fusion Loss Ours

Overexposed Underexposed W/O Intensity Loss Ours Overexposed Underexposed W/O Squeeze Fusion Loss Ours

Near-focused Far-focused W/O Intensity Loss Ours Near-focused Far-focused W/O Squeeze Fusion Loss Ours
Digital Photography Digital Photography

Fig. 11 Results of ablation on the intensity loss. From left to right: Fig. 13 Results of ablation on the squeeze fusion loss. From left to
source image pairs, fused results without the intensity loss and ours right: source image pairs, fused results without the squeeze fusion loss
and ours
Multi-Modal

loss, we provide the qualitative results of ablation experiment


on the gradient loss, as shown in Fig. 12. It can be seen that
in all four image fusion tasks, the results without the gradient
loss suffer from texture loss and sharpness degradation. On
MRI PET W/O Gradient Loss Ours
the contrary, the results with the gradient loss can maintain
the original sharpness well and contain rich texture details.
Further, we report the quantitative results of the ablation
experiment on the gradient loss in Table 4. The results with
Visible Infrared W/O Gradient Loss Ours
the gradient loss are better than those without the gradient
loss on 15/16 metrics, which strongly prove that the gradient
loss is very important to the fusion performance.

Overexposed Underexposed W/O Gradient Loss Ours

4.5.5 Squeeze Fusion Loss Analysis

Further, we completely remove the squeeze fusion loss. In


other words, we not only remove the intensity loss but also
Near-focused Far-focused W/O Gradient Loss Ours
the gradient loss, so that both the squeeze network and the
Digital Photography
decomposition network are optimized only under the guid-
Fig. 12 Results of ablation on the gradient loss. From left to right: ance of the decomposition consistency loss. The results are
source image pairs, fused results without the gradient loss and ours shown in Fig. 13. After completely removing the squeeze
fusion loss, although the fused results can contain a certain
degree of scene content, the overall style is far from that of
4.5.4 Gradient Loss Analysis source images, which deviates from the reality. Because of
this, there is no need to implement quantitative verification.
In our model, the gradient loss forces the fused image to In general, these results demonstrate that the squeeze fusion
contain rich texture detail. To verify the effect of the gradient loss is very important to the fusion rationality.

123
International Journal of Computer Vision

Source images Fused image Decomposed images Source images Fused image Decomposed images

Fig. 14 Fusion and decomposition of images with the same scene. Left: decomposition of multi-modal image fusion; right: decomposition of
digital photographic image fusion

Source images Fused image Decomposed images Source images Fused image Decomposed images

Fig. 15 Fusion and decomposition of different scenes. Left: decomposition of multi-modal image fusion; right: decomposition of digital photo-
graphic image fusion

4.6 Visualization of Decomposition the scene content represented by the corresponding source
image. In addition, the fused results are able to integrate the
4.6.1 Fusion and Decomposition of the Same Scene content of different scenes well.

When source images describe the same scene, the decompo-


sition network is dedicated to decomposing the fused image 4.6.3 Decomposition of the Real Image
produced by the squeeze network into results similar to
source images, so as to force the fused image to contain more In the proposed method, the squeeze network and the decom-
scene details of source images. In order to show this pro- position network are collaboratively optimized. In other
cess intuitively, for the multi-modal image fusion and digital words, what the decomposition network disintegrate is actu-
photographic image fusion, we select one representative task ally a synthesized fused image, which is not a real image.
to perform visualization respectively, saying medical image To verify the performance of the decomposition network,
fusion and multi-exposure image fusion. The visualization we adopt the trained decomposition network to decompose
results are in Fig. 14. It can be seen that no matter in multi- the real images in COCO dataset (Lin et al. 2014), and the
modal image fusion or digital photography image fusion, results are shown in Fig. 16. It can be seen that the decom-
good decomposed results can be obtained, which are very position network can well decompose the well-exposed real
similar to source images. Simultaneously, the fused results images into underexposed and overexposed images. And the
can provide a good visual perception, which verifies the fea- decomposed results are natural, in line with human visual
sibility of our design. experience.

4.6.2 Fusion and Decomposition of Different Scenes 4.7 Different Exposure Levels Fusion

An interesting idea is that if the scenes represented by source The proposed model has good adaptability to test data with
images are different, what will the results of the squeeze different distributions. In order to verify this point, we test
network and decomposition network look like? In order to the trained model in source images with different exposure
observe this, we implement the fusion and decomposition levels, and the results are shown in Fig. 17. It can be seen that
of source images with different scenes, and the results are the exposure levels of these three pairs of source images are
shown in Fig. 15. It can be seen that although the decom- significantly different, which means that their distribution
position network does not completely separate the different is different. However, our SDNet can generate good fused
scene content, the decomposed results are still dominated by results in all three tests, which contain rich scene details in

123
International Journal of Computer Vision

Real well-exposed Decomposed underexposed Decomposed overexposed Visible image Infrared image Fused image

Fig. 16 Decomposition of the real image. The decomposition network Fig. 18 Results of the infrared and RGB visible image fusion
is able to decompose the well-exposed real image into overexposed and
underexposed images

Underexposed Overexposed Fused

Source image 1 Source image 2 Source image 3 Fused image

Fig. 19 Sequence image fusion. From left to right: source image 1,


source image 2, source image 3 and the result of our SDNet

respectively. Like in Nejati et al. (2015), we first fuse two


of the source images as before, and then fuse this intermedi-
ate result with the last source image to obtain the final fused
image. The results are shown in Fig. 19. In multi-focus image
fusion, it can be seen that the fused result of our method con-
tains all the clear regions in the three source images, which
Fig. 17 Fusion of different exposure levels. The proposed method can
generate good fused results in all three tests is a full-clear image with good visual effects. Similarly, in
multi-exposure image fusion, our result has proper lighting
and can retain almost all scene content. All these prove that
source images. This demonstrates that our SDNet has good our SDNet is suitable for the fusion of image sequences.
adaptability.
4.10 Composite Fusion Scene
4.8 Infrared and RGB Visible Image Fusion
Because the proposed method has the same idea for different
The proposed SDNet is also applicable to the infrared and tasks, it can handle some composite image fusion scenes well.
RGB image fusion. We provide two typical results to demon- For example, some visible images are overexposed, and some
strate this, as shown in Fig. 18. It can be seen that the fused details are blurred or even invisible, which is a new challenge
results not only accurately maintain significant contrast, but for infrared and visible image fusion. This is more like a
also contain rich texture details. Moreover, the colors in RGB mixture of two fusion tasks, namely multi-exposure image
visible images are also well transferred to the fused images, fusion and infrared and visible image fusion. We provide
which have a good visual effect. a typical example to illustrate this intuitively, as shown in
Fig. 20. Among them, GTF (Ma et al. 2016) and FusionGAN
4.9 Sequence Image Fusion (Ma et al. 2019) are two algorithms designed specifically for
the infrared and visible image fusion. It can be seen that GTF
In digital photographic image fusion scenario, the number and FusionGAN lose the vehicles in the overexposed region,
of source images may exceed two. In this case, our method while our method can preserve them well. Therefore, we
is also applicable. To confirm this point, we implement our believe it is desirable to design a universal model for different
method on a sequence with three multi-focus source images, image fusion tasks, which may bring new inspirations to the
and a sequence with three multi-exposure source images, image fusion community.

123
International Journal of Computer Vision

4.11 SDNet vs. PMGI position. Concretely, the proposed SDNet is composed of
two parts: the squeeze network and the decomposition net-
The previous version of the proposed SDNet is PMGI (Zhang work. The squeeze network is dedicated to squeezing source
et al. 2020), and the improvements have two main aspects. images into a single fused image, while the decomposition
Firstly, we design a new adaptive decision block and intro- network is devoted to decomposing this fused result to obtain
duce it into the construction of gradient loss. In PMGI, the images consistent with source images. The decomposition
weight of the gradient loss term is proportionally set accord- consistency can force the fused result to contain richer scene
ing to the global texture richness of source images. For details, and thus have a better fusion effect.
example, visible images can provide more texture details, Compared to PMGI, SDNet performs better on both multi-
so a large weight is set for the gradient loss term of the vis- modal image fusion and digital photography image fusion.
ible image and a small weight is set for the gradient loss To visually show the difference, we compare their results, as
term of the infrared image. And once these global weights shown in Fig. 21. Firstly, in multi-modal image fusion, our
are set, they are fixed throughout the training process. The SDNet can better preserve the texture structure, and has the
direct negative effect brought by this is the texture struc- expected contrast distribution. For instance, in the infrared
ture loss and the sharpness reduction. Instead, we use the and visible image fusion, SDNet retains the clear roof tex-
adaptive decision block to guide the optimization of gradi- ture in the first column, while PMGI loses it. In the second
ent loss in SDNet, which can adaptively select the gradient column of results, the contour of the target in the result of
of sharp source pixels as the optimization target at the pixel PMGI has artifacts, while the one in SDNet is clean and
scale, making the fused result contain richer texture struc- sharp. Similarly, in the medical image fusion task, SDNet
tures and higher sharpness. Secondly, unlike PMGI that only can better preserve the distribution of brain structures in MRI
considers the squeezing process of image fusion, the SDNet images. In addition, the function information is more simi-
proposed in this paper considers both squeeze and decom- lar to that in PET. Secondly, in digital photographic image

Infrared image Visible image GTF FusionGAN Ours

Fig. 20 Results of the composite fusion scene

Multi-Modal Image Fusion Digital Photography Image Fusion

MRI Visible Overexposed Far-focused

PET Infrared Underexposed Near-focused

PMGI PMGI PMGI PMGI

SDNet SDNet SDNet SDNet

Fig. 21 Comparative results of our SDNet with the previous PMGI

123
International Journal of Computer Vision

Table 5 Efficiency comparisons of our SDNet with the previous version


PMGI (Zhang et al. 2020). Bold indicates the best result
Method PMGI SDNet Visible Image Infrared Image

Medical 0.0148±0.0008 0.004±0.001


Infrared-Visible 0.0831±0.0523 0.007±0.003
GT F FusionGAN
Multi-Exposure 0.0978±0.0423 0.017±0.003
Multi-Focus 0.0691±0.0474 0.022±0.002

MDLtaLRR U2Fusion

fusion, our method can retain almost all scene content, so as


to has better visual performance. For example, in the multi-
DenseFuse Ours
exposure image fusion task, the results of SDNet have more
reasonable illumination, which contain more comprehensive Fig. 22 Feature matching visualization on different modalities. Blue
texture details in the scenes. On the contrary, they are lost in indicates the true matches, and red represents the false matches
the results of PMGI. The most obvious difference is in multi-
focus image fusion. PMGI produces results between clarity
matching performance on our fused result is the best, which
and blur. In contrast, SDNet can produce quite promising full-
demonstrates that the proposed method can best maintain the
clear results. Overall, SDNet performs better than PMGI on
texture structure in the scene.
both multi-modal and digital photography image fusion.
In addition, we optimize the network structure so that it has
higher operating efficiency. We provide the average running 4.12.2 Pedestrian Semantic Segmentation
time of PMGI and SDNet in Table 5 for comparison. It can
be seen that in all image fusion tasks, SDNet is faster than In the infrared image, the pedestrians as the thermal tar-
PMGI. Therefore, we can conclude that the newly designed gets are salient, and it is suitable to implement semantic
network structure in SDNet has higher efficiency without segmentation on this modality. Therefore, we first train a
affecting performance. model for pedestrian semantic segmentation on the infrared
image. Then, the segmentation model is directly tested on
4.12 Application Verification the results of different fusion methods, and the segmentation
accuracy can indicate the quality of each fusion method to
Taking infrared and visible image fusion as an example, preserve the salient target. The results are shown in Fig. 23
we demonstrate the advantages of the proposed method in and Table 7. It can be seen that the segmentation perfor-
preserving scene textures and salient targets through down- mance of our results is the best, which can accurately detect
stream applications, including the feature matching between pedestrians, even small-scale ones. These results prove that
multiple views and the pedestrian semantic segmentation. the proposed method can best preserve the salient thermal
targets in the infrared image. It is worth noting that although
4.12.1 Feature Matching between Multiple Views the segmentation performance of our results is comparable
to that of the infrared image, our results that integrate the
Since feature matching performance is closely related to visible scene information will be probably more conducive
scene textures, it can verify the retention ability of the fusion to higher-level decision-making tasks, such as the pedestrian
method to scene textures. Specifically, we select infrared and identification.
visible image pairs from two views, and test the matching
performance on the visible modality, the infrared modality
and different fused results. The open source VLFeat toolbox 5 Conclusion and Discussion
(Vedaldi and Fulkerson 2010) is employed to determine the
putative correspondence with SIFT (Lowe 2004). Note that In this paper, a squeeze-and-decomposition network, called
the images used are captured from the FLIR video sequence, SDNet, is proposed to generally fulfill multi-modal image
and the scenes usually involve moving objects such as pedes- fusion and digital photography image fusion. On the one
trians and cars in our matching pairs. Thus, it is extremely hand, we model multiple image fusion tasks as the extrac-
challenging to identify the inlier correspondences by recov- tion and reconstruction of gradient and intensity information,
ering the 3D scene geometry in the dataset, and we establish and design a universal form of loss function accordingly. For
ground truth by manually labeling each correspondence. The gradient information, an adaptive decision block is designed
results are shown in Fig. 22 and Table 6. It can be seen that the to decide the optimization target of the gradient distribu-

123
International Journal of Computer Vision

Table 6 Feature matching accuracy on different modalities. Bold indicates the best result
Modal Infrared Visible GTF (Ma MDLtaLRR (Li DenseFuse (Li FusionGAN U2Fusion Ours
et al. et al. 2020) and Wu 2018) (Ma et al. (Xu et al.
2016) 2019) 2020)

Accuracy 0.731 0.739 0.686 0.770 0.758 0.746 0.792 0.792

Table 7 Pedestrian semantic segmentation accuracy on different fused results


Modal GTF (Ma et al. MDLtaLRR (Li DenseFuse (Li and FusionGAN (Ma U2Fusion (Xu Ours
2016) et al. 2020) Wu 2018) et al. 2019) et al. 2020)

IOU 0.698 0.853 0.852 0.767 0.807 0.910

Pedestrian segmentation Pedestrian segmentation

Infrared

GTF

MDLtaLRR

DenseFuse

Visible Infrared Ours


FusionGAN

Fig. 24 The proposed method cannot handle non-registered data from


U2Fusion Ha et al. (2017), and artifacts appear in the fused results

Ours
not handle these real unregistered data, as shown in Fig. 24.
Inevitably, these fusion methods rely on pre-processing of
Fig. 23 Visualization results of the pedestrian semantic segmentation registration algorithms. As a result, they may have certain
limitations in real scenes, such as low efficiency and depen-
dence on registration accuracy. In the future, we will focus
tion according to the texture richness on the pixel scale, on the research of unregistered fusion algorithms, so as to
which can adaptively force the fused result to contain richer fulfill the image registration and fusion in an implicit man-
texture details. For intensity information, we adopt the pro- ner. We believe this will greatly improve the suitability of
portioned setting strategy to adjust the loss weight ratio, so as image fusion for real-world scenarios.
to satisfy the requirement of intensity distribution tendency
in different image fusion tasks. On the other hand, we design
Declarations
a squeeze-and-decomposition network, which not only con-
siders the squeeze process from source images to the fused Conflict of Interest The authors declare that they have no conflict of
result, but also strives to decompose results that approximate interest.
source images from the fused image. The decomposition con-
sistency will force the fused image to contain more scene
details. Extensive qualitative and quantitative experiments
demonstrate the superiority of our SDNet over state-of-the- References
art methods in terms of both subjective visual effect and Ballester, C., Caselles, V., Igual, L., Verdera, J., & Rougé, B. (2006). A
quantitative metrics in multiple image fusion tasks. More- variational model for p+ xs image fusion. International Journal of
over, our method is about one order of magnitude faster Computer Vision, 69(1), 43–58.
compared with the state-of-the-art, which is suitable for Cai, J., Gu, S., & Zhang, L. (2018). Learning a deep single image
contrast enhancer from multi-exposure images. IEEE Transactions
addressing real-time fusion tasks. on Image Processing, 27(4), 2049–2062.
In a real scene, the images captured by the sensor are Fu, X., Lin, Z., Huang, Y., & Ding, X. (2019). A variational pan-
all unregistered. Unfortunately, the existing methods can- sharpening with local gradient constraints. In: Proceedings of the

123
International Journal of Computer Vision

IEEE Conference on Computer Vision and Pattern Recognition, Ma, B., Zhu, Y., Yin, X., Ban, X., Huang, H., & Mukeshimana, M.
pp. 10,265–10,274 (2020). Sesf-fuse: An unsupervised deep model for multi-focus
Goshtasby, A. A. (2005). Fusion of multi-exposure images. Image and image fusion. Neural Computing and Applications pp. 1–12.
Vision Computing, 23(6), 611–618. Ma, J., Chen, C., Li, C., & Huang, J. (2016). Infrared and visible image
Ha, Q., Watanabe, K., Karasawa, T., Ushiku, Y., & Harada, T. (2017). fusion via gradient transfer and total variation minimization. Infor-
Mfnet: Towards real-time semantic segmentation for autonomous mation Fusion, 31, 100–109.
vehicles with multi-spectral scenes. In: Proceedings of the Inter- Ma, J., Jiang, X., Fan, A., Jiang, J., & Yan, J. (2021). Image matching
national Conference on Intelligent Robots and Systems, pp. from handcrafted to deep features: A survey. International Journal
5108–5115. of Computer Vision, 129(1), 23–79.
Haghighat, M., Razian, M.A. (2014). Fast-fmi: non-reference image Ma, J., Liang, P., Yu, W., Chen, C., Guo, X., Wu, J., et al. (2020). Infrared
fusion metric. In: Proceedings of the IEEE International Confer- and visible image fusion via detail preserving adversarial learning.
ence on Application of Information and Communication Tech- Information Fusion, 54, 85–98.
nologies, pp. 1–3. Ma, J., Xu, H., Jiang, J., Mei, X., & Zhang, X. P. (2020). Ddcgan:
Hayat, N., & Imran, M. (2019). Ghost-free multi exposure image fusion A dual-discriminator conditional generative adversarial network
technique using dense sift descriptor and guided filter. Journal of for multi-resolution image fusion. IEEE Transactions on Image
Visual Communication and Image Representation, 62, 295–308. Processing, 29, 4980–4995.
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K.Q. (2017). Ma, J., Yu, W., Chen, C., Liang, P., Guo, X., & Jiang, J. (2020). Pan-
Densely connected convolutional networks. In: Proceedings of the gan: An unsupervised pan-sharpening method for remote sensing
IEEE Conference on Computer Vision and Pattern Recognition, image fusion. Information Fusion, 62, 110–120.
pp. 4700–4708. Ma, J., Yu, W., Liang, P., Li, C., & Jiang, J. (2019). Fusiongan: A gen-
Kong, S. G., Heo, J., Boughorbel, F., Zheng, Y., Abidi, B. R., Koschan, erative adversarial network for infrared and visible image fusion.
A., et al. (2007). Multiscale fusion of visible and thermal ir images Information Fusion, 48, 11–26.
for illumination-invariant face recognition. International Journal Ma, K., Li, H., Yong, H., Wang, Z., Meng, D., & Zhang, L. (2017).
of Computer Vision, 71(2), 215–233. Robust multi-exposure image fusion: A structural patch decompo-
Kumar, B. S. (2013). Multifocus and multispectral image fusion based sition approach. IEEE Transactions on Image Processing, 26(5),
on pixel significance using discrete cosine harmonic wavelet trans- 2519–2532.
form. Signal, Image and Video Processing, 7(6), 1125–1143. Naidu, V., & Raol, J. R. (2008). Pixel-level image fusion using wavelets
Lai, S.H., Fang, M. (1998). Adaptive medical image visualization based and principal component analysis. Defence Science Journal, 58(3),
on hierarchical neural networks and intelligent decision fusion. In: 338–352.
Proceedings of the IEEE Neural Networks for Signal Processing Nejati, M., Samavi, S., & Shirani, S. (2015). Multi-focus image fusion
Workshop, pp. 438–447. using dictionary-based sparse representation. Information Fusion,
Lee, S.h., Park, J.S., Cho, N.I. (2018). A multi-exposure image fusion 25, 72–84.
based on the adaptive weights reflecting the relative pixel inten- Paul, S., Sevcenco, I. S., & Agathoklis, P. (2016). Multi-exposure and
sity and global gradient. In: Proceedings of the IEEE International multi-focus image fusion in gradient domain. Journal of Circuits,
Conference on Image Processing, pp. 1737–1741. Systems and Computers, 25(10), 1650123.
Li, H., & Wu, X. J. (2018). Densefuse: A fusion approach to infrared and Piella, G. (2003). A general framework for multiresolution image
visible images. IEEE Transactions on Image Processing, 28(5), fusion: From pixels to regions. Information Fusion, 4(4), 259–280.
2614–2623. Piella, G. (2009). Image fusion for enhanced visualization: A variational
Li, H., Wu, X. J., & Kittler, J. (2020). Mdlatlrr: A novel decomposition approach. International Journal of Computer Vision, 83(1), 1–11.
method for infrared and visible image fusion. IEEE Transactions Prabhakar, K.R., Srikar, V.S., & Babu, R.V. (2017). Deepfuse: A deep
on Image Processing, 29, 4733–4746. unsupervised approach for exposure fusion with extreme exposure
Li, S., Kang, X., & Hu, J. (2013). Image fusion with guided filtering. image pairs. In: Proceedings of the IEEE International Conference
IEEE Transactions on Image Processing, 22(7), 2864–2875. on Computer Vision, pp. 4714–4722.
Li, S., Yin, H., & Fang, L. (2012). Group-sparse representation with Roberts, J. W., Van Aardt, J. A., & Ahmed, F. B. (2008). Assessment of
dictionary learning for medical image denoising and fusion. IEEE image fusion procedures using entropy, image quality, and multi-
Transactions on Biomedical Engineering, 59(12), 3450–3459. spectral classification. Journal of Applied Remote Sensing, 2(1),
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, Shen, J., Zhao, Y., Yan, S., Li, X., et al. (2014). Exposure fusion using
D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common boosting laplacian pyramid. IEEE Transactions on Cybernetics,
objects in context. In: Proceedings of the European Conference on 44(9), 1579–1590.
Computer Vision, pp. 740–755. Shen, X., Yan, Q., Xu, L., Ma, L., & Jia, J. (2015). Multispectral joint
Liu, Y., Chen, X., Cheng, J., & Peng, H. (2017). A medical image fusion image restoration via optimizing a scale map. IEEE Transactions
method based on convolutional neural networks. In: Proceedings on Pattern Analysis and Machine Intelligence, 37(12), 2518–2530.
of the International Conference on Information Fusion, pp. 1–7. Szeliski, R., Uyttendaele, M., & Steedly, D. (2011). Fast poisson blend-
Liu, Y., Chen, X., Peng, H., & Wang, Z. (2017). Multi-focus image ing using multi-splines. In: Proceedings of the IEEE International
fusion with a deep convolutional neural network. Information Conference on Computational Photography, pp. 1–8.
Fusion, 36, 191–207. Vedaldi, A., Fulkerson, B. (2010). Vlfeat: An open and portable library
Liu, Y., Liu, S., & Wang, Z. (2015). Multi-focus image fusion with of computer vision algorithms. In: Proceedings of the ACM Inter-
dense sift. Information Fusion, 23, 139–155. national Conference on Multimedia, pp. 1469–1472.
Liu, Y., & Wang, Z. (2014). Simultaneous image fusion and denoising Xing, L., Cai, L., Zeng, H., Chen, J., Zhu, J., & Hou, J. (2018). A multi-
with adaptive sparse representation. IET Image Processing, 9(5), scale contrast-based image quality assessment model for multi-
347–357. exposure image fusion. Signal Processing, 145, 233–240.
Lowe, D. G. (2004). Distinctive image features from scale-invariant Xu, H., Ma, J., Jiang, J., Guo, X., & Ling, H. (2020). U2fusion: A
keypoints. International Journal of Computer Vision, 60(2), 91– unified unsupervised image fusion network. IEEE Transactions
110. on Pattern Analysis and Machine Intelligence.

123
International Journal of Computer Vision

Xu, H., Ma, J., & Zhang, X. P. (2020). Mef-gan: Multi-exposure image Zhu, Z., Zheng, M., Qi, G., Wang, D., & Xiang, Y. (2019). A phase con-
fusion via generative adversarial networks. IEEE Transactions on gruency and local laplacian energy based multi-modality medical
Image Processing, 29, 7203–7216. image fusion method in nsct domain. IEEE Access, 7, 20811–
Zhang, H., Xu, H., Xiao, Y., Guo, X., & Ma, J. (2020). Rethinking the 20824.
image fusion: A fast unified image fusion network based on pro-
portional maintenance of gradient and intensity. In: Proceedings
of the AAAI Conference on Artificial Intelligence, pp. 12,797– Publisher’s Note Springer Nature remains neutral with regard to juris-
12,804. dictional claims in published maps and institutional affiliations.
Zhao, F., Xu, G., & Zhao, W. (2019). Ct and mr image fusion based on
adaptive structure decomposition. IEEE Access, 7, 44002–44009.
Zhou, F., Hang, R., Liu, Q., & Yuan, X. (2019). Pyramid fully convolu-
tional network for hyperspectral and multispectral image fusion.
IEEE Journal of Selected Topics in Applied Earth Observations
and Remote Sensing, 12(5), 1549–1558.

123

You might also like