2021 (code) SDNet - A Versatile Squeeze-and-Decomposition Network for Real-Time Image Fusion
2021 (code) SDNet - A Versatile Squeeze-and-Decomposition Network for Real-Time Image Fusion
https://doi.org/10.1007/s11263-021-01501-8
Abstract
In this paper, a squeeze-and-decomposition network (SDNet) is proposed to realize multi-modal and digital photography
image fusion in real time. Firstly, we generally transform multiple fusion problems into the extraction and reconstruction of
gradient and intensity information, and design a universal form of loss function accordingly, which is composed of intensity
term and gradient term. For the gradient term, we introduce an adaptive decision block to decide the optimization target of
the gradient distribution according to the texture richness at the pixel scale, so as to guide the fused image to contain richer
texture details. For the intensity term, we adjust the weight of each intensity loss term to change the proportion of intensity
information from different images, so that it can be adapted to multiple image fusion tasks. Secondly, we introduce the idea of
squeeze and decomposition into image fusion. Specifically, we consider not only the squeeze process from source images to
the fused result, but also the decomposition process from the fused result to source images. Because the quality of decomposed
images directly depends on the fused result, it can force the fused result to contain more scene details. Experimental results
demonstrate the superiority of our method over the state-of-the-arts in terms of subjective visual effect and quantitative metrics
in a variety of fusion tasks. Moreover, our method is much faster than the state-of-the-arts, which can deal with real-time
fusion tasks.
123
International Journal of Computer Vision
Infrared image Visible image PET image MRI image Underexposed image Overexposed image Near-focused image Far-focused image
Fig. 1 Schematic illustration of multi-modal image fusion and digital photography image fusion. First row: source image pairs to be fused; second
row: fused results of our proposed method (SDNet)
acterize all content in the scenario under a single setting. ground truth, which is usually inaccurate and will set an
Concretely, it is difficult to have all objects of different depth- upper limit for learning. Third, as mentioned earlier, there are
of-field to be all-in-focus within one image (Ma et al. 2020). large differences between image fusion tasks. In multi-modal
Besides, the image is sometimes exposed to inappropriate image fusion, source images are captured by different sen-
exposures such as underexposure and overexposure (Hayat sors. Conversely, source images in digital photography image
and Imran 2019; Goshtasby 2005). Under these circum- fusion are taken by the same sensor under different shooting
stances, the scene can be described more comprehensively settings. As a result, the existing methods cannot solve dif-
by combining with images under different shooting settings. ferent image fusion problems according to the same idea.
A few examples are provided in Fig. 1 to illustrate these two Finally, the existing methods are usually less competitive in
types of image fusion scenarios more intuitively. operating efficiency due to the large number of parameters
In recent years, researchers have proposed a number of or the high complexity of fusion rules.
methods to solve the image fusion problem, which can be To address the above mentioned challenges, we design
broadly divided into two categories. The first category is a squeeze-and-decomposition network, called SDNet, to
traditional image fusion method, which usually uses the rel- implement multi-modal image fusion and digital photog-
evant mathematical transformation to realize the fusion by raphy image fusion end-to-end in real time. Our design is
designing the activity level measurement and fusion rules in mainly developed from the following two aspects.
the spatial domain or the transform domain (Li et al. 2012; On the one hand, we model the multi-modal image fusion
Zhao et al. 2019; Shen et al. 2014; Paul et al. 2016; Ballester and the digital photography image fusion as the extraction
et al. 2006; Szeliski et al. 2011). The second category is and reconstruction of intensity and gradient information. Our
deep learning-based methods. Methods of this type usually opinion is that the information contained in the image can be
constrain the fused image by constructing an objective func- divided into gradient and intensity information, wherein the
tion to make it have the desired distribution characteristics. gradient information represents the texture structure, while
Because of the strong nonlinear fitting ability of neural net- the intensity information indicates the overall brightness dis-
works, this kind of methods can usually achieve better fused tribution of the image. Based on this idea, we design a loss
results (Ma et al. 2019; Prabhakar et al. 2017; Liu et al. 2017; function in a universal form for the above two types of image
Lai and Fang 1998). fusion scenarios, which can force the network to extract the
Although the existing methods have achieved promis- gradient and intensity information and fuse them by two dif-
ing results in most cases, there are still several aspects to ferent rules. Specifically, for the gradient information, we
be improved. First, the existing traditional methods usually believe that in addition to noise, other areas with strong gradi-
need to manually design the activity level measurement and ents are clear or have a large amount of texture content. Based
fusion rules, which become complex because of the diver- on this observation, we propose an adaptive decision block,
sity of source images. This also limits fusion results because which firstly uses the Gaussian low-pass filter to reduce the
it is impossible to consider all the factors in one manu- effects of noise, and then scores each pixel based on the level
ally designed way. Second, the most prominent obstacle in of the gradient, thereby directing the gradient distribution of
applying deep learning to image fusion is the lack of the the fused image to approximate the source pixel with larger
ground-truth fused image for supervised learning. A few gradient strength. For the intensity information, because dif-
methods solve this difficulty by manually constructing the ferent fusion tasks have different preferences for intensity
123
International Journal of Computer Vision
information preservation, we select more effective and inter- – Our method can perform image fusion in real time for
esting intensity information to be preserved in the fused result multiple fusion tasks. The code is publicly available at:
by adjusting the weight proportion of each intensity loss item. https://github.com/HaoZhang1018/SDNet.
By using these two strategies to extract and reconstruct gra-
dient and intensity information, the proposed loss function A preliminary version of this manuscript appeared in
can be well adopted in multi-modal image fusion and digital Zhang et al. (2020). The primary new contributions include
photography image fusion. the following two aspects. First, we design an adaptive deci-
On the other hand, we propose a fast SDNet to implement sion block to constrain the gradient information instead of the
more effective image fusion. The previous methods only con- previous manually proportioned setting strategy. On the one
sider the squeeze process from the source image to the fusion hand, it reduces the number of super-parameters that need
result, then whether the fused result can be decomposed to to be manually adjusted. On the other hand, it makes our
regenerate source images? Although part of the information method perform better, especially in the multi-focus image
will inevitably be discarded in the fusion process, requir- fusion. Second, we have further improved the network, which
ing the decomposition result to be consistent with the source considers not only the fusion process but also the decompo-
images will reduce the information loss as much as possible. sition process. This decomposition consistency can make the
In other words, this decomposition consistency will force the fused image contain more scene details and thus have a better
fused result to contain more scene details, because the qual- visual effect.
ity of the decomposition result directly depends on the fused The remainder of this paper is organized as follows. Sec-
result. Based on this motivation, we design a squeeze-and- tion 2 describes some related work, including an overview of
decomposition network, which contains two parts: squeeze existing traditional and deep learning-based fusion methods.
and decomposition. In the squeeze stage, the source images Section 3 provides the overall framework, loss functions and
are fused into a single image. While in the decomposition network architecture design. In Sect. 4, we give the detailed
stage, the fused result is re-decomposed into source images. experimental settings and compare our method with sev-
Similarly, this squeeze-and-decomposition network is also eral state-of-the-art methods on publicly available datasets
suitable for both the multi-modal and digital photography by qualitative and quantitative comparisons. In addition, we
image fusion. also carry out comparative experiments of efficiency, abla-
Our method has the following advantages. First of all, tion experiments, visualization of decomposition, infrared
our method does not need to design the activity level mea- and RGB visible image fusion, sequence image fusion, com-
surement and fusion rules, which can implement end-to-end parison with preliminary version, and application verification
fusion. Second, our network does not require ground truth (Zhang et al. 2020) in this section. Conclusions are given in
for supervised learning, but unsupervised learning with weak Sect. 5.
constraints. Third, our method is not only applicable to fusion
of images obtained by multi-modality imaging, but also to
fusion of images obtained by digital photography. It is worth 2 Related Work
noting that due to the use of 1 × 1 convolution kernels and
the control of the number of feature channels, the quantity of With various methods proposed, the field of image fusion
parameters in our network is limited within a certain range. has made great progress. Existing methods can be broadly
As a result, our method can achieve fusion at a high speed. divided into traditional and deep learning-based methods.
Our contributions include the following five aspects: The traditional methods usually use the related math-
ematical transformation and manual design of the fusion
– We propose a new end-to-end image fusion model, which rules to realize the image fusion. Piella (2003) presented an
can realize the multi-modal image fusion and the digital overview on image fusion techniques using multi-resolution
photography image fusion well. decompositions, which make a multi-resolution segmenta-
– We design a specific form of loss function, which can tion based on all different input images, and this segmentation
force the network to generate expected fused results. is subsequently used to guide the infrared and visible image
– We propose an adaptive decision block for the gradi- fusion process. A ghost-free multi-exposure image fusion
ent loss terms, which can reduce the effect of noise and technique using the dense SIFT descriptor and guided filter
effectively guide the fused result to contain richer texture was proposed by Hayat and Imran (2019), which can pro-
details. duce high-quality images without the artifacts using ordinary
– We design a squeeze-and-decomposition network, which cameras. Paul et al. (2016) proposed a general algorithm for
can focus on the two stages of fusion and decomposition multi-focus and multi-exposure image fusion, which is based
at the same time, so as to make the fused result contain on blending the gradients of the luminance components of the
more scene details. input images using the maximum gradient magnitude at each
123
International Journal of Computer Vision
pixel location and then obtaining the fused luminance using 3 Method
a Haar wavelet-based image reconstruction technique. Fu
et al. (2019) introduced a more accurate spatial preservation In this section, we give a detailed introduction of our SDNet.
based on local gradient constraints into remote sensing image We first introduce the overall framework, and then give
fusion, which can fully utilize spatial information contained the definition of the loss function. Finally, we provide the
in the PAN image while maintaining spectral information. detailed structure of the network. Note that the source images
As a result, they can obtain very promising fused results. are assumed to be pre-registered in our method (Ma et al.
Compared with traditional methods, deep learning-based 2021).
methods can learn fusion models with good generalization
ability from a large amount of data. In the field of infrared 3.1 Overall Framework
and visible image fusion, Ma et al. (2019) proposed an end-
to-end model called FusionGAN, which generates a fused The idea of image fusion is to extract and combine the most
image with a dominant infrared intensity and an additional meaningful information from the source images. On the one
visible gradient on the basis of GAN. Subsequently, they hand, for different image fusion tasks, the most meaningful
introduced a dual-discriminator (Ma et al. 2020), a detail information contained in source images is different. Because
loss and a target edge-enhancement loss (Ma et al. 2020) there is no same standard for such meaningful information,
based on FusionGAN to further enhance the texture details existing methods are usually difficult to be migrated to other
in the fused results. In the field of multi-exposure image fusion tasks. Thus, it is desirable to develop a versatile model
fusion, Prabhakar et al. (2017) proposed an unsupervised to fulfill multiple types of image fusion tasks. On the other
deep learning framework that utilizes a no-reference quality hand, it is very important to preserve as much information in
metric as a loss function and can produce satisfactory fusion source images as possible in the fused image. Our method is
results. Xu et al. (2020) introduced an end-to-end architec- designed based on the above two observations, which is an
ture based on GAN with self-attention mechanism and has end-to-end model.
achieved the state-of-the-art performance. In medical image Firstly, we define the meaningful information into two cat-
fusion, Liu et al. (2017) used a neural network to generate egories: gradient and intensity information. For any image,
the weight map that integrates pixel activity levels of two the most essential element of it is the pixel. The intensity
source images, while the fusion process is conducted in a of pixels can represent the overall brightness distribution,
multi-scale manner via image pyramids. With the applica- which can reflect contrast characteristics of the image. The
tion of deep learning, great progress also has been made in difference between the pixels constitutes the gradient, which
the field of multi-focus image fusion. In particular, Ma et al. can represent the texture details in the image. Therefore, the
(2020) proposed an unsupervised network to generate the multi-modal image fusion and digital photography image
decision map for fusion, which can indicate whether the pixel fusion can be model as the extraction and reconstruction
is focused. Deep network has also promoted the progress of of these two kinds of information, as shown in Fig. 2. The
remote sensing image fusion. Zhou et al. (2019) designed extraction and reconstruction of gradient and intensity infor-
a deep model composed of encoder network and pyramid mation is dependent on the design of the loss function. In
fusion network to fuse the low-resolution hyperspectral and our model, we propose a universal loss function for differ-
high-resolution multi-spectral images, which improves the ent image fusion tasks, which consists of the gradient loss
preservation of spatial information by this progressive refine- term and the intensity loss term constructed between the
ment. Ma et al. (2020) proposed an unsupervised deep model fused image and both two source images. Although using
for pansharpening to make full use of the texture structure intensity loss and gradient loss (Ma et al. 2019, 2020) is a
in panchromatic images. They transformed pansharpening common practice in a specific image fusion task (Szeliski
into multi-task learning by using two independent discrimi- et al. 2011), it is non-trivial to extend them to other image
nators, which preserve spectral and spatial information well. fusion tasks. To this end, the reconstruction rules we design
Our preliminary version PMGI (Zhang et al. 2020) proposed for gradient information and intensity information are greatly
a new image fusion network based on proportional main- different. For gradient information reconstruction, we intro-
tenance of gradient and intensity information, which can duce an adaptive decision block that acts on the gradient
realize a variety of image fusion tasks. However, changing loss term. The adaptive decision block first uses the Gaus-
the maintenance proportion of the gradient information by sian low-pass filtering to reduce the influence of noise on the
adjusting the weight will cause a certain degree of the tex- decision-making process, and then evaluates the importance
ture structure loss or blur, thereby reducing the quality of the of corresponding pixels based on the gradient richness, so
fused result. Therefore, in this paper we improve PMGI to as to generate the pixel-scale decision map that guides tex-
realize better fusion performance. tures in the fused image to approximate that in the source
pixels with richer textures. As a result, the texture details
123
International Journal of Computer Vision
123
International Journal of Computer Vision
123
International Journal of Computer Vision
LReLU
LReLU
Conv
Conv
Conv
Tanh
LReLU
LReLU
LReLU
LReLU
Conv
Conv
Conv
Conv
Concat Concat
LReLU
Conv
Conv
Tanh
Source images Concat Decomposed images
LReLU
LReLU
LReLU
LReLU
Conv
LReLU
LReLU
Conv
Fused image
Conv
Conv
Tanh
Conv
Conv
Conv
Concat Concat
123
International Journal of Computer Vision
image fusion, we crop the rest of images to 80, 881 image testing phase, only the squeeze network is used to generate
patch pairs of size 120 × 120 for training; for the multi- the fused result.
exposure image fusion task, we crop the rest of images to
91, 840 image patch pairs of size 120 × 120 for training; 4.2 Results on Multi-modal Image Fusion
for the multi-focus image fusion task, we crop the rest of
images to 184, 790 image patch pairs of size 60 × 60 for The two representative tasks of multi-modal image fusion we
training. Since the proposed SDNet is a fully convolutional selected are medical image fusion and infrared and visible
network, the source images do not need to be cropped into image fusion. For the former, we use five state-of-the-art
small patches with the same size as the training data during medical image fusion methods for comparison, including
the test phase. In other words, the test is performed on the ASR (Liu and Wang 2014), PCA (Naidu and Raol 2008),
original size of source images. NSCT (Zhu et al. 2019), CNN (Liu et al. 2017) and U2Fusion
(Xu et al. 2020). For the latter, we select five state-of-the-art
4.1.2 Training Details infrared and visible image fusion methods for comparison,
such as GTF (Ma et al. 2016), MDLatLRR (Li et al. 2020),
For fusion tasks where the source images are all grayscale DenseFuse (Li and Wu 2018), FusionGAN (Ma et al. 2019)
images, the proposed model can be directly used to fuse and U2Fusion (Xu et al. 2020).
source images to generate the fused result, such as infrared
and visible image fusion. For grayscale image and color 4.2.1 Qualitative Comparisons
image fusion tasks, such as medical image fusion, we first
transform the color image from RGB to YCbCr color space. In each fusion task, three typical image pairs are selected to
Because the Y channel (luminance channel) can represent qualitatively demonstrate the performance of each method,
structural details and the brightness variation, we devote to which are shown in Figs. 5 and 6. The qualitative analysis
fusing the Y channel of color source image and the grayscale is as follows.
source image. Then, we directly concatenate the Cb and Cr For medical image fusion, the three typical results we
channels of the color source image with the fused Y channel, select are on different transaxial sections of the brain-
and transferred these components to RGB space to obtain the hemispheric. From these results, we can see that our SDNet
final result. For the fusion task where the source images are all has two advantages over other methods. First, the results of
color images, such as multi-exposure and multi-focus image our method contain a wealth of brain structural textures. Only
fusion, we transform all color source images from RGB to NSCT, CNN, U2Fusion and our SDNet can well preserve the
YCbCr color space. Then, we fuse the Y channels of source texture details in the MRI image, while ASR and PCA cannot,
images using the proposed model, and follow the Eq. (12) to such as the structural textures in the first and second set of
fuse the Cb or Cr channels of source images: results in Fig. 5. However, these textures are much finer and
sharper in the results of our method. Second, our SDNet can
C1 |C1 − ζ | + C2 |C2 − ζ | maintain the better functional information, in other words,
C= , (12)
|C1 − ζ | + |C2 − ζ | the color distortion rarely occurs in our SDNet over other
methods. For example, in the highlighted part of the third set
where |·| indicates the absolute value function, C is the fused of results of Fig. 5, the results of NSCT, CNN and U2Fusion
Cb or Cr, C1 and C2 represent the Cb or Cr of two source appear whitening, which is inconsistent with the distribution
images, respectively. In addition, ζ is the median value of of functional information in the PET image.
the dynamic range, which is set to 128. Finally, the fused In infrared and visible image fusion, according to the char-
components are transferred to RGB space to obtain the final acteristics of fused results, these comparative methods can be
result. divided into two categories. The results of the first category
The batch size is set to b, and it takes m steps to train are biased towards the visible image, such as MDLatLRR,
one epoch. The total number of training epochs is M. In our DenseFuse and U2Fusion. Specifically, although their fused
experiment, we set b = 32, M = 30, and m is set as the ratio results contain richer texture details, they cannot maintain
between the whole number of patches and b. The parameters the significant contrast of the infrared image. The results of
in our SDNet are updated by AdamOptimizer. In addition, the second category are more similar to the infrared image.
α of Eq. (5) in four tasks are set according to the rules in They maintain the significant contrast well, but the texture
Eqs. (7)–(10): 0.5, 0.5, 1 and 1. For β in Eq. (1), we aim details in them are not rich enough, which look more like
to obtain the best results by setting them in four tasks as: sharpened infrared images, such as GTF and FusionGAN. In
10, 80, 50 and 3. All deep learning-based methods run on comparison, our method is more like a combination of these
the same GPU RTX 2080Ti, while other methods run on the two categories. First, our method can maintain the significant
same CPU Intel i7-8750H. It is worth noting that during the contrast and effectively highlight the target from the back-
123
International Journal of Computer Vision
Fig. 5 Qualitative results of the medical image fusion. From left to right: PET image, MRI image, fused results of ASR (Liu and Wang 2014), PCA
(Naidu and Raol 2008), NSCT (Zhu et al. 2019), CNN (Liu et al. 2017), U2Fusion (Xu et al. 2020) and our SDNet
Fig. 6 Qualitative results of infrared and visible fusion. From left to right: visible image, infrared image, results of GTF (Ma et al. 2016), MDLatLRR
(Li et al. 2020), DenseFuse (Li and Wu 2018), FusionGAN (Ma et al. 2019), U2Fusion (Xu et al. 2020) and our SDNet
ground, like methods of the second category. For example, son on ten pairs of images in each fusion task. Considering
only the fused results of GTF, FusionGAN and our method the characteristics of multi-modal image fusion, four objec-
have significant contrast and can highlight the targets, such tive metrics are selected to evaluate the quality of the fused
as the human in the first and third group of results in Fig. 6. images, namely entropy (EN) (Roberts et al. 2008), mutual
Second, the fused results of our method also contain rich information of discrete cosine features (F M Idct ) (Haghighat
texture structures, just like methods of the first category. For and Razian 2014), the peak signal-to-noise ratio (PSNR), and
example, in the second row of Fig. 6, our SDNet maintains mean gradient (MG). The reasons for selecting them are as
the texture details of the shrub and clothes well, while GTF follows. In multi-modal image fusion, part of the information
and FusionGAN cannot. will inevitably be discarded, so we adopt EN to evaluate the
In general, the qualitative results of our SDNet have cer- amount of information remaining in the fused image. The
tain superiority compared with other comparative methods larger the value of EN, the more information the fused image
in multi-modal image fusion. contains. We use F M Idct to assess the amount of features
that are transferred from source images to the fused image.
It can also reflect the degree of correlation between the fea-
4.2.2 Quantitative Comparisons tures in the fused image and the source image, which has a
characterizing significance for data fidelity. A large F M Idct
In order to assess our method in multi-modal image fusion metric generally indicates that considerable feature informa-
more comprehensively, we conduct a quantitative compari-
123
International Journal of Computer Vision
tion is transferred from source images to the fused image. 4.3.1 Qualitative Comparisons
Affected by the imaging environment, visible images often
contain a lot of noise. Similarly, the particularity of medical In each digital photography image fusion task, we give three
images also requires the noise level to be as low as possible. typical intuitive results to qualitatively compare our SDNet
Therefore, we introduce the PSNR to evaluate the noise level with other methods, as shown in Figs. 7 and 8. The detailed
in the fused results. A larger PSNR value indicates less noise qualitative analysis is as follows.
relative to the useful information. In addition, we adopt the In the multi-exposure image fusion task, we can find
MG to assess the richness of texture structure. A large MG that the proposed SDNet has two major advantages. Firstly,
metric indicates that the fused image contains rich texture our method can avoid strange black shadows and unnatu-
details. The quantitative results of multi-modal image fusion ral illumination transitions. Concretely, the fused results of
are shown in Table 1. AWPIGG, DSIFT and GFF show these strange shadows,
For medical image fusion, it can be seen that our method such as the sky in the second set of results, while DeepFuse,
achieves the largest average values on three metrics EN, U2Fusion and the proposed SDNet do not. Secondly, in areas
F M Idct and MG. In metric PSNR, our method are second of extreme overexposure and extreme underexposure (where
only to U2Fusion. From these results, we can conclude that texture details are only visible in a single source image), our
the results of our SDNet contain the most information, and SDNet can better preserve these details and their shapes. For
can obtain the most features from source images. Moreover, example, in the third group of results in Fig. 7, the lamps
the result of our method contains the richest texture details, are blurred and their edges are even invisible in the results
which proves that our SDNet can preserve sufficient struc- of other three methods except DeepFuse, U2Fusion and our
tural information of MRI images. It is worth noting that the SDNet. However, compared to DeepFuse and U2Fusion, our
level of noise in our results is low, which means the proposed SDNet can preserve these texture details more finely, such as
SDNet is rigorous in fusing information. the roof in the first set of results and the tree branches in the
In the infrared and visible image fusion, our SDNet second group of results. Note that our results have insuffi-
achieves the largest average values on F M Idct , PSNR and cient local contrast in some scenes. Even so, the visual effect
MG. Therefore, the results of our SDNet have a amount of of our method is still better than that of other methods.
feature information that is transferred from source images, In the multi-focus image fusion task, the five compar-
have the largest signal-to-noise ratio, and contain the richest ative methods selected can be divided into two categories
texture details. Interestingly, our method does not achieve according to the principle. Methods of the first category is to
a good EN value like in medical image fusion, and this is generate the decision map based on focus detection to fuse
caused by the characteristics of infrared and visible image multi-focus images, such as CNN, SESF and DSIFT. Such
fusion. Specifically, visible images usually contain a large methods often lose detail due to misjudgment near the junc-
amount of noise, and the noise suppression of our method tion of focused and non-focused regions. The other category
reduces the entropy of fused results to a certain extent. is to fuse the multi-focus image from the global perspective
Overall, the proposed SDNet performs better in quan- instead of focused area detection at pixel scale. The disadvan-
titative comparisons over other comparative methods in tages of this kind of methods are that the intensity distortion
multi-modal image fusion, which is consistent with the qual- and blur effect will appear in fused results, such as GD and
itative comparison. U2Fusion. From the results, we see that our SDNet has clear
advantages over these two categories of methods. First of all,
4.3 Results on Digital Photography Image Fusion compared with the decision map-based methods, our method
can accurately retain details near the junction of focused and
Our SDNet also performs well on the digital photography non-focused regions. For example, in the first set of results in
image fusion. To verify this point, we conducted comparative Fig. 8, these methods lose the golf ball. Secondly, our method
experiments on two typical digital photography image fusion can also maintain the same intensity distribution as the source
tasks: multi-exposure image fusion and multi-focus image image. It can be clearly seen that the fused results of GD have
fusion. For multi-exposure image fusion, we select five state- intensity distortion compared to the source images, and the
of-the-art methods to compare with our SDNet, which are results of U2Fusion have a certain degree of detail blur, while
AWPIGG (Lee et al. 2018), DSIFT (Hayat and Imran 2019), our method does not. Note that there are some visible halos in
GFF (Li et al. 2013), DeepFuse (Prabhakar et al. 2017) and the results of our and other methods at the boundary between
U2Fusion (Xu et al. 2020). For the multi-focus image fusion, the focused and non-focused regions. This is caused by the
five state-of-the-art methods are also selected to compare outline of the foreground target in the source image spreading
with our method. These methods are CNN (Liu et al. 2017), to the background area due to defocus.
DSIFT (Liu et al. 2015), GD (Paul et al. 2016), SESF (Ma Overall, in the digital photography image fusion scenario,
et al. 2020) and U2Fusion (Xu et al. 2020). our SDNet has better performance in terms of intuitive effect.
123
International Journal of Computer Vision
EN (Roberts et al. 4.969 ± 0.283 4.779 ± 0.284 5.312 ± 0.239 5.254 ± 0.242 5.325 ± 0.293 5.561 ± 0.290
2008)↑
F M Idct (Haghighat 0.410 ± 0.009 0.344 ± 0.010 0.405 ± 0.007 0.161 ± 0.016 0.302 ± 0.004 0.433 ± 0.005
and Razian 2014)↑
PSNR↑ 64.092 ± 0.517 63.937 ± 0.508 63.187 ± 0.586 63.789 ± 0.654 65.013 ± 0.433 64.550 ± 0.377
MG↑ 0.024 ± 0.002 0.038 ± 0.003 0.043 ± 0.003 0.040 ± 0.004 0.036 ± 0.003 0.044 ± 0.004
Infrared-Visible GTF (Ma et al. 2016) MDLatLRR (Li et al. DenseFuse (Li and FusionGAN (Ma U2Fusion (Xu et al. Ours
2020) Wu 2018) et al. 2019) 2020)
EN (Roberts et al. 6.982 ± 0.259 6.518 ± 0.439 6.848 ± 0.302 6.608 ± 0.331 7.012 ± 0.273 6.780 ± 0.325
2008)↑
F M Idct (Haghighat 0.422 ± 0.036 0.409 ± 0.035 0.417 ± 0.022 0.370 ± 0.033 0.340 ± 0.016 0.423 ± 0.021
and Razian 2014)↑
PSNR↑ 63.789 ± 2.006 62.976 ± 2.090 62.420 ± 1.625 61.799 ± 1.455 63.068 ± 1.940 64.292 ± 1.822
MG↑ 0.022 ± 0.014 0.016 ± 0.008 0.020 ± 0.012 0.014 ± 0.007 0.026 ± 0.014 0.027 ± 0.013
Bold indicates the best, and Bolditalic indicates the second best
123
International Journal of Computer Vision
Fig. 7 Qualitative results of multi-exposure image fusion. From left to right: underexposed image, overexposed image, fused results of AWPIGG
(Lee et al. 2018), DSIFT (Hayat and Imran 2019), GFF (Li et al. 2013), DeepFuse (Prabhakar et al. 2017), U2Fusion (Xu et al. 2020) and our
SDNet
Fig. 8 Qualitative results of multi-focus fusion. From left to right: the near focus image, the far focus image, fused results of CNN (Liu et al. 2017),
DSIFT (Liu et al. 2015), GD (Paul et al. 2016), SESF (Ma et al. 2020), U2Fusion (Xu et al. 2020) and our SDNet
4.3.2 Quantitative Comparisons fused image is richer and more comprehensive. In addition,
in multi-exposure image fusion, the fused result often con-
We also perform quantitative comparisons to further demon- tains unnatural artifacts, while in multi-focus image fusion,
strate the performance of our SDNet in the digital photog- the fused result often suffers from overall contrast distortion.
raphy image fusion scenario. Considering the characteristics We introduce N AB/F to evaluate artifacts and contrast dis-
of digital photography image fusion, four objective metrics tortion in the fused image. A smaller N AB/F value indicates
are selected to evaluate the fused results, which are Q AB/F , less artifacts and contrast distortion. The quantitative results
mutual information of discrete cosine features (F M Idct ) of digital photography image fusion are shown in Table 2.
(Haghighat and Razian 2014), the peak signal-to-noise ratio In multi-exposure image fusion, our SDNet achieves the
(PSNR) and N AB/F (Kumar 2013). PSNR and F M Idct have largest average values on the first three metrics Q AB/F ,
been described previously, and the rest are explained below. F M Idct and PSNR. For the N AB/F , our method is second
Different from multi-modal image fusion, the scene informa- only to DeepFuse by a slight difference. These can explain
tion reflected by source images in digital photographic image that the results of our method contain the richest and the
fusion is strictly complementary according to regions, so the most comprehensive scene content, contain a mount of fea-
preservation of the scene information is as important as the ture information that is transferred from source images, and
fidelity of the scene information. Therefore, we adopt Q AB/F involve the lowest level of noise. In addition, like Deepfuse,
to assess the amount of edge structures that are transferred our method can also avoid unnatural artifacts, which can also
from source images to the fused image rather than MG. A be seen from the qualitative results in Fig. 7.
large Q AB/F means that the scene content contained in the
123
International Journal of Computer Vision
In multi-focus image fusion, as can be seen from the statis- More specifically, texture textures from MRI images is often
tical results, our SDNet can also achieve the largest average submerged in functional information from PET images, caus-
values on Q AB/F , F M Idct and PSNR. These results can ing texture details to be weakened or lost. The adaptive
indicate that the fused results of our method contain the decision block can decide the target of gradient optimiza-
most scene content, can obtain the most feature informa- tion at the pixel scale, which can make the network retain
tion from source images, and have the largest signal-to-noise the functional information while preserving the structural
ratio. Moreover, our SDNet ranks second on N AB/F , next to texture as significantly as possible. In infrared and visible
DSIFT. image fusion, visible images often contain a lot of noise
Generally, in the digital photography image fusion, our due to environmental factors such as weather. Reducing the
method performs better than the comparative methods in influence of noise on the fusion process is the key to improv-
quantitative comparison. ing the quality of the infrared and visible image fusion. The
adaptive decision block can enhance the robustness of our
4.4 Comparisons of Efficiency method against noise, because there is the Gaussian low-pass
filtering operation in the decision block, which can reduce
In our method, the number of parameters used for testing the noise interference to a certain extent. In multi-exposure
is 0.287 M, which is very lightweight. In order to evaluate image fusion, the exposure degree of the source image is
it more comprehensively, we carry out comparative exper- uncertain. In other words, it is difficult to determine the expo-
iments on the efficiency of multi-modal image fusion and sure level of the source image and the distance of the ideal
digital photography image fusion. The number of image pairs illumination. In this case, the adaptive decision block can
for testing in four representative tasks is all 10. measure the appropriateness of exposure based on the gradi-
The results are shown in Table 3. It can be seen that ent magnitude at the pixel scale, because both overexposure
our method achieves the highest running efficiency in all and underexposure can cause texture detail loss. Therefore,
four image fusion tasks: medical image fusion, infrared the adaptive decision block can force the fused results to
and visible image fusion, multi-exposure image fusion, and have more appropriate lighting while preserving more com-
multi-focus image fusion. In general, our SDNet has the prehensive texture details. In multi-focus image fusion, the
significant superiority in running time, almost an order of adaptive decision block can also make the final fused result
magnitude faster than the comparative methods, which can more promising. The adaptive decision block can select the
deal with real-time fusion tasks. maximum gradient at the corresponding pixel position as
the optimization target to force the fused result to contain
4.5 Ablation Experiments richer texture details, which avoids the fusion result being
the intermediate result between clear and fuzzy. In order to
To verify the effectiveness of the specific designs in this fully verify the role of it, we train our network without it.
paper, we perform relevant ablation experiments. First, we Specifically, we adopt the proportioned setting strategy for
reveal the role of two proposed key modules, including the gradient loss like in the intensity loss, weighting from the
the adaptive decision block and the decomposition network. global perspective of image patches.
Then, we separately evaluate the fusion performance when The qualitative results are shown in Fig. 9. In medical
the intensity loss, the gradient loss, and the squeeze fusion image fusion, it can be seen that when using the adaptive
loss are removed. decision block, the structural textures in the fused result are
maintained very well. At the same time, the functional infor-
4.5.1 Adaptive Decision Block Analysis mation is not worse than the fused result without the adaptive
decision block. In infrared and visible image fusion, it is
In this work, the adaptive decision block acts on the gradient obvious that the visible image contains a lot of noise. When
loss term, which can adaptively guide the gradient distri- there is no adaptive decision block, the gradient constraint
bution of the fused image to approximate to the pixel with is severely affected by noise and the detail in the window
stronger gradient. To our knowledge, it is the first time that the of the car is weak, as highlighted. In contrast, this detail
pixel-scale guidance strategy is adopted for addressing the is well preserved when using the adaptive decision block.
image fusion task. More importantly, it fits well with both For multi-exposure image fusion, the fused result with the
multi-modal image fusion and digital photography image adaptive decision block has more suitable lighting, and can
fusion. We analyze the feasibility of the adaptive decision retain more scene details, which are poorly done without it. In
block in four image fusion tasks as follows. In medical image multi-focus image fusion, when there is no adaptive decision
fusion, because the difference in the intensity of PET and block, the texture detail of the fused result is an intermediate
MRI images is very large, the phenomenon of imbalance of result between sharpness and blurring, which is not as clear
structural textures and functional information often occurs. as the result using the adaptive decision block. Therefore, the
123
123
Table 2 Quantitative results of digital photography fusion
Multi-Exposure AWPIGG (Lee DSIFT (Hayat GFF (Li et al. DeepFuse (Prab- U2Fusion (Xu Ours
et al. 2018) and Imran 2019) 2013) hakar et al. 2017) et al. 2020)
Q AB/F ↑ 0.652 ± 0.064 0.658 ± 0.066 0.662 ± 0.044 0.640 ± 0.017 0.607 ± 0.061 0.677 ± 0.027
F M Idct (Haghighat and Razian 2014)↑ 0.390 ± 0.030 0.400 ± 0.028 0.430 ± 0.030 0.451 ± 0.049 0.375 ± 0.057 0.462 ± 0.048
PSNR↑ 57.053 ± 0.691 57.015 ± 0.865 56.237 ± 0.768 57.661 ± 0.717 57.720 ± 0.657 57.841 ± 0.659
N AB/F (Kumar 2013)↓ 0.154 ± 0.022 0.170 ± 0.016 0.189 ± 0.041 0.085 ± 0.018 0.241 ± 0.024 0.095 ± 0.023
Multi-Focus CNN (Liu et al. DSIFT (Liu et al. GD (Paul et al. SESF (Ma et al. U2Fusion (Xu Ours
2017) 2015) 2016) 2020) et al. 2020)
Q AB/F ↑ 0.618 ± 0.080 0.497 ± 0.060 0.683 ± 0.049 0.611 ± 0.085 0.670 ± 0.034 0.691 ± 0.016
F M Idct (Haghighat and Razian 2014)↑ 0.389 ± 0.017 0.405 ± 0.022 0.339 ± 0.044 0.383 ± 0.034 0.325 ± 0.023 0.410 ± 0.034
PSNR↑ 73.827 ± 2.163 73.809 ± 2.173 69.325 ± 1.810 73.772 ± 2.153 72.488 ± 1.294 74.376 ± 1.782
N AB/F (Kumar 2013)↓ 0.073 ± 0.052 0.005 ± 0.005 0.172 ± 0.039 0.079 ± 0.055 0.156 ± 0.041 0.048 ± 0.018
Bold indicates the best, and Bolditalic indicates the second best
International Journal of Computer Vision
International Journal of Computer Vision
Table 3 Average running time of different methods in multi-modal image fusion and digital photography image fusion (unit: second)
Multi-Modal Medical ASR (Liu and PCA (Naidu and NSCT (Zhu et al. CNN (Liu et al. U2Fusion (Xu Ours
Wang 2014) Raol 2008) 2019) 2017) et al. 2020)
(Size 256 × 256) 32.961 ± 1.195 0.007 ± 0.001 7.467 ± 0.377 16.504 ± 0.562 0.206 ± 0.035 0.004 ± 0.001
Infrared-Visible GTF (Ma et al. MDLatLRR (Li DenseFuse (Li FusionGAN (Ma U2Fusion (Xu Ours
2016) et al. 2020) and Wu 2018) et al. 2019) et al. 2020)
(Size 576 × 768) 3.830 ± 2.377 26.235 ± 12.761 0.321 ± 0.122 0.170 ± 0.129 0.636 ± 0.536 0.007 ± 0.003
Digital Photography Multi-Exposure AWPIGG (Lee DSIFT (Hayat GFF (Li et al. DeepFuse (Prab- U2Fusion (Xu Ours
et al. 2018) and Imran 2019) 2013) hakar et al. 2017) et al. 2020)
(Size 600×1000) 0.969 ± 0.084 2.576 ± 0.253 0.799 ± 0.044 0.227 ± 0.033 1.579 ± 1.513 0.017 ± 0.001
Multi-Focus CNN (Liu et al. DSIFT (Liu et al. GD (Paul et al. SESF (Ma et al. U2Fusion (Xu Ours
2017) 2015) 2016) 2020) et al. 2020)
(Size 520 × 520) 246.064 ± 17.352 ± 31.399 1.429 ± 1.200 0.686 ± 0.678 1.794 ± 3.165 0.022 ± 0.022
307.537
Bold indicates the best, and Bolditalic indicates the second best
123
International Journal of Computer Vision
Multi-Modal Multi-Modal
123
International Journal of Computer Vision
Table 4 Quantitative results of ablation experiments. ADB indicates the adaptive decision block, and DN is the decomposition network. Bold indicates the best result
123
International Journal of Computer Vision
Multi-Modal Multi-Modal
MRI PET W/O Intensity Loss Ours MRI PET W/O Squeeze Fusion Loss Ours
Visible Infrared W/O Intensity Loss Ours Visible Infrared W/O Squeeze Fusion Loss Ours
Overexposed Underexposed W/O Intensity Loss Ours Overexposed Underexposed W/O Squeeze Fusion Loss Ours
Near-focused Far-focused W/O Intensity Loss Ours Near-focused Far-focused W/O Squeeze Fusion Loss Ours
Digital Photography Digital Photography
Fig. 11 Results of ablation on the intensity loss. From left to right: Fig. 13 Results of ablation on the squeeze fusion loss. From left to
source image pairs, fused results without the intensity loss and ours right: source image pairs, fused results without the squeeze fusion loss
and ours
Multi-Modal
123
International Journal of Computer Vision
Source images Fused image Decomposed images Source images Fused image Decomposed images
Fig. 14 Fusion and decomposition of images with the same scene. Left: decomposition of multi-modal image fusion; right: decomposition of
digital photographic image fusion
Source images Fused image Decomposed images Source images Fused image Decomposed images
Fig. 15 Fusion and decomposition of different scenes. Left: decomposition of multi-modal image fusion; right: decomposition of digital photo-
graphic image fusion
4.6 Visualization of Decomposition the scene content represented by the corresponding source
image. In addition, the fused results are able to integrate the
4.6.1 Fusion and Decomposition of the Same Scene content of different scenes well.
4.6.2 Fusion and Decomposition of Different Scenes 4.7 Different Exposure Levels Fusion
An interesting idea is that if the scenes represented by source The proposed model has good adaptability to test data with
images are different, what will the results of the squeeze different distributions. In order to verify this point, we test
network and decomposition network look like? In order to the trained model in source images with different exposure
observe this, we implement the fusion and decomposition levels, and the results are shown in Fig. 17. It can be seen that
of source images with different scenes, and the results are the exposure levels of these three pairs of source images are
shown in Fig. 15. It can be seen that although the decom- significantly different, which means that their distribution
position network does not completely separate the different is different. However, our SDNet can generate good fused
scene content, the decomposed results are still dominated by results in all three tests, which contain rich scene details in
123
International Journal of Computer Vision
Real well-exposed Decomposed underexposed Decomposed overexposed Visible image Infrared image Fused image
Fig. 16 Decomposition of the real image. The decomposition network Fig. 18 Results of the infrared and RGB visible image fusion
is able to decompose the well-exposed real image into overexposed and
underexposed images
123
International Journal of Computer Vision
4.11 SDNet vs. PMGI position. Concretely, the proposed SDNet is composed of
two parts: the squeeze network and the decomposition net-
The previous version of the proposed SDNet is PMGI (Zhang work. The squeeze network is dedicated to squeezing source
et al. 2020), and the improvements have two main aspects. images into a single fused image, while the decomposition
Firstly, we design a new adaptive decision block and intro- network is devoted to decomposing this fused result to obtain
duce it into the construction of gradient loss. In PMGI, the images consistent with source images. The decomposition
weight of the gradient loss term is proportionally set accord- consistency can force the fused result to contain richer scene
ing to the global texture richness of source images. For details, and thus have a better fusion effect.
example, visible images can provide more texture details, Compared to PMGI, SDNet performs better on both multi-
so a large weight is set for the gradient loss term of the vis- modal image fusion and digital photography image fusion.
ible image and a small weight is set for the gradient loss To visually show the difference, we compare their results, as
term of the infrared image. And once these global weights shown in Fig. 21. Firstly, in multi-modal image fusion, our
are set, they are fixed throughout the training process. The SDNet can better preserve the texture structure, and has the
direct negative effect brought by this is the texture struc- expected contrast distribution. For instance, in the infrared
ture loss and the sharpness reduction. Instead, we use the and visible image fusion, SDNet retains the clear roof tex-
adaptive decision block to guide the optimization of gradi- ture in the first column, while PMGI loses it. In the second
ent loss in SDNet, which can adaptively select the gradient column of results, the contour of the target in the result of
of sharp source pixels as the optimization target at the pixel PMGI has artifacts, while the one in SDNet is clean and
scale, making the fused result contain richer texture struc- sharp. Similarly, in the medical image fusion task, SDNet
tures and higher sharpness. Secondly, unlike PMGI that only can better preserve the distribution of brain structures in MRI
considers the squeezing process of image fusion, the SDNet images. In addition, the function information is more simi-
proposed in this paper considers both squeeze and decom- lar to that in PET. Secondly, in digital photographic image
123
International Journal of Computer Vision
MDLtaLRR U2Fusion
123
International Journal of Computer Vision
Table 6 Feature matching accuracy on different modalities. Bold indicates the best result
Modal Infrared Visible GTF (Ma MDLtaLRR (Li DenseFuse (Li FusionGAN U2Fusion Ours
et al. et al. 2020) and Wu 2018) (Ma et al. (Xu et al.
2016) 2019) 2020)
Infrared
GTF
MDLtaLRR
DenseFuse
Ours
not handle these real unregistered data, as shown in Fig. 24.
Inevitably, these fusion methods rely on pre-processing of
Fig. 23 Visualization results of the pedestrian semantic segmentation registration algorithms. As a result, they may have certain
limitations in real scenes, such as low efficiency and depen-
dence on registration accuracy. In the future, we will focus
tion according to the texture richness on the pixel scale, on the research of unregistered fusion algorithms, so as to
which can adaptively force the fused result to contain richer fulfill the image registration and fusion in an implicit man-
texture details. For intensity information, we adopt the pro- ner. We believe this will greatly improve the suitability of
portioned setting strategy to adjust the loss weight ratio, so as image fusion for real-world scenarios.
to satisfy the requirement of intensity distribution tendency
in different image fusion tasks. On the other hand, we design
Declarations
a squeeze-and-decomposition network, which not only con-
siders the squeeze process from source images to the fused Conflict of Interest The authors declare that they have no conflict of
result, but also strives to decompose results that approximate interest.
source images from the fused image. The decomposition con-
sistency will force the fused image to contain more scene
details. Extensive qualitative and quantitative experiments
demonstrate the superiority of our SDNet over state-of-the- References
art methods in terms of both subjective visual effect and Ballester, C., Caselles, V., Igual, L., Verdera, J., & Rougé, B. (2006). A
quantitative metrics in multiple image fusion tasks. More- variational model for p+ xs image fusion. International Journal of
over, our method is about one order of magnitude faster Computer Vision, 69(1), 43–58.
compared with the state-of-the-art, which is suitable for Cai, J., Gu, S., & Zhang, L. (2018). Learning a deep single image
contrast enhancer from multi-exposure images. IEEE Transactions
addressing real-time fusion tasks. on Image Processing, 27(4), 2049–2062.
In a real scene, the images captured by the sensor are Fu, X., Lin, Z., Huang, Y., & Ding, X. (2019). A variational pan-
all unregistered. Unfortunately, the existing methods can- sharpening with local gradient constraints. In: Proceedings of the
123
International Journal of Computer Vision
IEEE Conference on Computer Vision and Pattern Recognition, Ma, B., Zhu, Y., Yin, X., Ban, X., Huang, H., & Mukeshimana, M.
pp. 10,265–10,274 (2020). Sesf-fuse: An unsupervised deep model for multi-focus
Goshtasby, A. A. (2005). Fusion of multi-exposure images. Image and image fusion. Neural Computing and Applications pp. 1–12.
Vision Computing, 23(6), 611–618. Ma, J., Chen, C., Li, C., & Huang, J. (2016). Infrared and visible image
Ha, Q., Watanabe, K., Karasawa, T., Ushiku, Y., & Harada, T. (2017). fusion via gradient transfer and total variation minimization. Infor-
Mfnet: Towards real-time semantic segmentation for autonomous mation Fusion, 31, 100–109.
vehicles with multi-spectral scenes. In: Proceedings of the Inter- Ma, J., Jiang, X., Fan, A., Jiang, J., & Yan, J. (2021). Image matching
national Conference on Intelligent Robots and Systems, pp. from handcrafted to deep features: A survey. International Journal
5108–5115. of Computer Vision, 129(1), 23–79.
Haghighat, M., Razian, M.A. (2014). Fast-fmi: non-reference image Ma, J., Liang, P., Yu, W., Chen, C., Guo, X., Wu, J., et al. (2020). Infrared
fusion metric. In: Proceedings of the IEEE International Confer- and visible image fusion via detail preserving adversarial learning.
ence on Application of Information and Communication Tech- Information Fusion, 54, 85–98.
nologies, pp. 1–3. Ma, J., Xu, H., Jiang, J., Mei, X., & Zhang, X. P. (2020). Ddcgan:
Hayat, N., & Imran, M. (2019). Ghost-free multi exposure image fusion A dual-discriminator conditional generative adversarial network
technique using dense sift descriptor and guided filter. Journal of for multi-resolution image fusion. IEEE Transactions on Image
Visual Communication and Image Representation, 62, 295–308. Processing, 29, 4980–4995.
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K.Q. (2017). Ma, J., Yu, W., Chen, C., Liang, P., Guo, X., & Jiang, J. (2020). Pan-
Densely connected convolutional networks. In: Proceedings of the gan: An unsupervised pan-sharpening method for remote sensing
IEEE Conference on Computer Vision and Pattern Recognition, image fusion. Information Fusion, 62, 110–120.
pp. 4700–4708. Ma, J., Yu, W., Liang, P., Li, C., & Jiang, J. (2019). Fusiongan: A gen-
Kong, S. G., Heo, J., Boughorbel, F., Zheng, Y., Abidi, B. R., Koschan, erative adversarial network for infrared and visible image fusion.
A., et al. (2007). Multiscale fusion of visible and thermal ir images Information Fusion, 48, 11–26.
for illumination-invariant face recognition. International Journal Ma, K., Li, H., Yong, H., Wang, Z., Meng, D., & Zhang, L. (2017).
of Computer Vision, 71(2), 215–233. Robust multi-exposure image fusion: A structural patch decompo-
Kumar, B. S. (2013). Multifocus and multispectral image fusion based sition approach. IEEE Transactions on Image Processing, 26(5),
on pixel significance using discrete cosine harmonic wavelet trans- 2519–2532.
form. Signal, Image and Video Processing, 7(6), 1125–1143. Naidu, V., & Raol, J. R. (2008). Pixel-level image fusion using wavelets
Lai, S.H., Fang, M. (1998). Adaptive medical image visualization based and principal component analysis. Defence Science Journal, 58(3),
on hierarchical neural networks and intelligent decision fusion. In: 338–352.
Proceedings of the IEEE Neural Networks for Signal Processing Nejati, M., Samavi, S., & Shirani, S. (2015). Multi-focus image fusion
Workshop, pp. 438–447. using dictionary-based sparse representation. Information Fusion,
Lee, S.h., Park, J.S., Cho, N.I. (2018). A multi-exposure image fusion 25, 72–84.
based on the adaptive weights reflecting the relative pixel inten- Paul, S., Sevcenco, I. S., & Agathoklis, P. (2016). Multi-exposure and
sity and global gradient. In: Proceedings of the IEEE International multi-focus image fusion in gradient domain. Journal of Circuits,
Conference on Image Processing, pp. 1737–1741. Systems and Computers, 25(10), 1650123.
Li, H., & Wu, X. J. (2018). Densefuse: A fusion approach to infrared and Piella, G. (2003). A general framework for multiresolution image
visible images. IEEE Transactions on Image Processing, 28(5), fusion: From pixels to regions. Information Fusion, 4(4), 259–280.
2614–2623. Piella, G. (2009). Image fusion for enhanced visualization: A variational
Li, H., Wu, X. J., & Kittler, J. (2020). Mdlatlrr: A novel decomposition approach. International Journal of Computer Vision, 83(1), 1–11.
method for infrared and visible image fusion. IEEE Transactions Prabhakar, K.R., Srikar, V.S., & Babu, R.V. (2017). Deepfuse: A deep
on Image Processing, 29, 4733–4746. unsupervised approach for exposure fusion with extreme exposure
Li, S., Kang, X., & Hu, J. (2013). Image fusion with guided filtering. image pairs. In: Proceedings of the IEEE International Conference
IEEE Transactions on Image Processing, 22(7), 2864–2875. on Computer Vision, pp. 4714–4722.
Li, S., Yin, H., & Fang, L. (2012). Group-sparse representation with Roberts, J. W., Van Aardt, J. A., & Ahmed, F. B. (2008). Assessment of
dictionary learning for medical image denoising and fusion. IEEE image fusion procedures using entropy, image quality, and multi-
Transactions on Biomedical Engineering, 59(12), 3450–3459. spectral classification. Journal of Applied Remote Sensing, 2(1),
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, Shen, J., Zhao, Y., Yan, S., Li, X., et al. (2014). Exposure fusion using
D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common boosting laplacian pyramid. IEEE Transactions on Cybernetics,
objects in context. In: Proceedings of the European Conference on 44(9), 1579–1590.
Computer Vision, pp. 740–755. Shen, X., Yan, Q., Xu, L., Ma, L., & Jia, J. (2015). Multispectral joint
Liu, Y., Chen, X., Cheng, J., & Peng, H. (2017). A medical image fusion image restoration via optimizing a scale map. IEEE Transactions
method based on convolutional neural networks. In: Proceedings on Pattern Analysis and Machine Intelligence, 37(12), 2518–2530.
of the International Conference on Information Fusion, pp. 1–7. Szeliski, R., Uyttendaele, M., & Steedly, D. (2011). Fast poisson blend-
Liu, Y., Chen, X., Peng, H., & Wang, Z. (2017). Multi-focus image ing using multi-splines. In: Proceedings of the IEEE International
fusion with a deep convolutional neural network. Information Conference on Computational Photography, pp. 1–8.
Fusion, 36, 191–207. Vedaldi, A., Fulkerson, B. (2010). Vlfeat: An open and portable library
Liu, Y., Liu, S., & Wang, Z. (2015). Multi-focus image fusion with of computer vision algorithms. In: Proceedings of the ACM Inter-
dense sift. Information Fusion, 23, 139–155. national Conference on Multimedia, pp. 1469–1472.
Liu, Y., & Wang, Z. (2014). Simultaneous image fusion and denoising Xing, L., Cai, L., Zeng, H., Chen, J., Zhu, J., & Hou, J. (2018). A multi-
with adaptive sparse representation. IET Image Processing, 9(5), scale contrast-based image quality assessment model for multi-
347–357. exposure image fusion. Signal Processing, 145, 233–240.
Lowe, D. G. (2004). Distinctive image features from scale-invariant Xu, H., Ma, J., Jiang, J., Guo, X., & Ling, H. (2020). U2fusion: A
keypoints. International Journal of Computer Vision, 60(2), 91– unified unsupervised image fusion network. IEEE Transactions
110. on Pattern Analysis and Machine Intelligence.
123
International Journal of Computer Vision
Xu, H., Ma, J., & Zhang, X. P. (2020). Mef-gan: Multi-exposure image Zhu, Z., Zheng, M., Qi, G., Wang, D., & Xiang, Y. (2019). A phase con-
fusion via generative adversarial networks. IEEE Transactions on gruency and local laplacian energy based multi-modality medical
Image Processing, 29, 7203–7216. image fusion method in nsct domain. IEEE Access, 7, 20811–
Zhang, H., Xu, H., Xiao, Y., Guo, X., & Ma, J. (2020). Rethinking the 20824.
image fusion: A fast unified image fusion network based on pro-
portional maintenance of gradient and intensity. In: Proceedings
of the AAAI Conference on Artificial Intelligence, pp. 12,797– Publisher’s Note Springer Nature remains neutral with regard to juris-
12,804. dictional claims in published maps and institutional affiliations.
Zhao, F., Xu, G., & Zhao, W. (2019). Ct and mr image fusion based on
adaptive structure decomposition. IEEE Access, 7, 44002–44009.
Zhou, F., Hang, R., Liu, Q., & Yuan, X. (2019). Pyramid fully convolu-
tional network for hyperspectral and multispectral image fusion.
IEEE Journal of Selected Topics in Applied Earth Observations
and Remote Sensing, 12(5), 1549–1558.
123