Image Fusion - 01
Image Fusion - 01
A weight induced contrast map for infrared and visible image fusion
Manoj Kumar Panda a , Priyadarsan Parida a ,∗, Deepak Kumar Rout b
a
Department of Electronics and Communication Engineering, GIET University, Gunupur, Rayagada, 765022, Odisha, India
b
Department of Electronics and Telecommunication Engineering, IIIT Bhubaneswar, Gothapatana, Bhubaneswar, 751003, Odisha, India
Keywords: Fusion involves merging details from both infrared (IR) and visible images to generate a
Infrared image unified composite image that offers richer and more valuable information than either individual
Visible image image. Surveillance, navigation, remote sensing, and military applications require various
Image decomposition
imaging modalities, including visible and IR, to oversee specific scenes. These sensors provide
Contrast detection map
supplementary data and improve situational understanding, so it is essential to fuse this
Weight map
information into a single image. Fusing IR and visible images presents several challenges
due to the differences in imaging modalities, data characteristics, and the need for accurate
and meaningful integration of information. In this context, a novel image fusion architecture
focuses on enhancing prominent targets, with the objective of integrating thermal information
from infrared images into visible images while preserving textural details within the visible
images. Initially, in the proposed algorithm, the images from different sensors are divided into
components of high and low frequencies using a Guided filter and an Average filter, respectively.
A unique contrast detection mechanism is proposed that is capable of preserving the contrast
information from the original images. Further, the contrast details of the IR and visible images
are enhanced using local standard deviation filtering and local range filtering, respectively.
We have developed a new weight map construction strategy that can effectively preserve
the supplemental data from both the original images. These weights and gradient details of
the source images are utilized to preserve the salient feature details of the images acquired
from the various modalities. A decision-making approach is utilized among the high-frequency
components of the original images to retain the prominent feature details of the source images.
Finally, the salient feature details and the prominent feature details are integrated to generate
the fused image. The developed technique is validated using both subjective and quantitative
perspectives. The developed approaches provide EN, MI, 𝑁𝑎𝑏𝑓 , and SD of 6.86815, 13.73269,
0.15390, and 78.16158 respectively against deep learning-based approaches. Also, the proposed
algorithm provides EN, MI, 𝑁𝑎𝑏𝑓 , 𝐹 𝑀𝐼𝑤 , and 𝑄𝑎𝑏𝑓 against 6.86815, 13.73269, 0.15390,
0.41634 and 0.47196 respectively against existing traditional fusion methods. It is observed
that the developed technique provides adequate accuracy against twenty-seven state-of-the-art
techniques.
1. Introduction
Object detection algorithms enable machines to identify and locate specific objects within images or video streams. This
technology empowers us to enhance safety, efficiency, and convenience in various domains [1]. However, visual data, while
∗ Corresponding author.
E-mail addresses: [email protected] (M.K. Panda), [email protected] (P. Parida), [email protected] (D.K. Rout).
https://doi.org/10.1016/j.compeleceng.2024.109256
Received 1 November 2023; Received in revised form 27 March 2024; Accepted 17 April 2024
0045-7906/© 2024 Elsevier Ltd. All rights reserved.
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
incredibly valuable for object detection, may not be reliable in poor lighting, and camouflaged conditions. Whereas, thermal
images or videos cannot be trusted if the object is behind transparent or reflective surfaces, or in situations where the object
and surroundings have similar temperature profiles (thermal camouflage), objects covered with non-radiative or highly absorbent
materials, very small-sized objects with respect to the field of view, etc. Hence, in various real-time applications, merging thermal
and visual data can surmount these constraints, offering a more comprehensive comprehension of the environment, and improving
object detection capabilities, especially in challenging and complex scenarios [2].
Image fusion, particularly the amalgamation of thermal and visual images, holds immense importance across various domains.
These techniques allow for a comprehensive understanding of complex scenes and objects, surpassing what can be achieved with
individual images. In fields like medicine, it enhances diagnostic accuracy by combining different modalities, aiding in the detection
and treatment of ailments. In the case of surveillance and security applications, it enables better threat detection and tracking,
ensuring public safety. Furthermore, image fusion finds applications in remote sensing, helping researchers and policymakers
make informed decisions about environmental and resource management. Whether in military operations, medical diagnostics, or
scientific research, image fusion empowers professionals to extract richer insights from data, improving decision-making processes
and outcomes.
Visual images and thermal images possess distinct attributes that make each of them valuable in various applications: Visual
images provide a high level of detail, allowing for the clear identification of objects, textures, and colors. They offer a representation
of the world that aligns with human perception, making them easily interpretable by people. Visual images are effective in well-
lit environments and during daytime operations. They excel in object recognition, making them essential for tasks like facial
recognition, reading signs, and identifying landmarks. Visual images provide the contextual information necessary for understanding
scenes and environments. On the other hand, thermal images capture temperature variations, making them ideal for detecting
heat sources and temperature anomalies. They are highly effective in low-light conditions and complete darkness, as they do not
rely on visible light. Thermal images can penetrate smoke, fog, and different materials, allowing for object detection in obscured
environments. They are valuable for spotting camouflaged or concealed objects, such as hidden individuals or equipment. Thermal
images are less affected by light pollution, making them suitable for a wide variety of applications from urban surveillance to
astronomical observations.
The choice between visual and thermal images depends on the specific requirements of the task and environmental conditions.
Frequently, integrating these two image types results in a more thorough and efficient solution, combining the strengths of each
modality to improve situational awareness and decision-making [3].
This article is an attempt to develop a visible and IR image integration mechanism that can result in fused images enabling clear
differentiation between background and foreground details while maintaining relevant information. The primary contributions of
this work can be outlined as;
• A contrast detection system is devised to precisely preserve the contrast details present in the source images.
• An innovative procedure is curated to construct the weight map to efficiently safeguard the complementary details found in
the source images.
• The algorithm proposed outperforms a range of recently developed deep learning-based methods for infrared and visible image
fusion, as witnessed through both qualitative and quantitative evaluations.
2. State-of-the-art-techniques
In recent years, the fusion of thermal and visual imagery has emerged as a pivotal area of research in the field of computer vision
and remote sensing. This interdisciplinary pursuit seeks to combine the complementary advantages of thermal infrared and visible
spectrum images to enhance the overall perception and understanding of complex scenes. The integration of thermal and visual data
has found applications in a wide array of domains, including surveillance, autonomous navigation, environmental monitoring, and
medical imaging. To harness the full potential of this fusion technique, researchers have delved into a rich tapestry of methodologies,
algorithms, and applications. In this literature section, we aim to provide a comprehensive survey of the key developments and
trends in thermal and visual image fusion, highlighting the seminal works and critical challenges that have shaped our proposed
methodology.
The situational awareness can be improvised with the help of visual and thermal image fusion, and this is verified by
Toet et al. [4] in 1997. Since then lots of research work has been carried out by many experts [5]. Toet et al. [6] have suggested
that reference contour images can potentially be used to design image fusion schemes that align optimally with human visual
perception for various applications and scenarios. A combination of thermal and visible images, is used for the detection of people
and abandoned objects separately by Yigit and Temizel [7], reducing false alarms and improving the accuracy of abandoned object
detection. Shreyamsha Kumar [8] introduced the Cross Bilateral Filter (CBF), which considers both pixel similarities and geometric
closeness without blurring edges. It suggested the use of CBF to extract detailed images for fusing source images by weighted average.
A fusion framework that combines multi-scale transform (MST) and sparse representation (SR) is proposed by Liu et al. [9]. It
has merged low-pass bands with SR and high-pass bands using absolute coefficients for activity measurement, followed by inverse
MST. Bavirisetti and Dhuli [10] utilized anisotropic diffusion to decompose the source images into approximation and detail layers
and employ the Karhunen–Loeve transform and weighted linear superposition to obtain the final layers. An image fusion approach
is developed by Liu et al. [11] incorporating Convolutional Sparse Representation (CSR) to address limitations in existing sparse
representation (SR) methods. CSR effectively overcomes drawbacks related to detail preservation and misregistration sensitivity. The
2
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
proposed CSR-based fusion framework outperforms traditional SR-based methods in both objective assessment and visual quality,
making it a significant contribution to image fusion techniques.
A multi-scale fusion method for infrared and visible image fusion is proposed by Ma et al. [12]. It employs a multi-scale
decomposition technique using the rolling guidance filter and Gaussian filter to preserve scale information and reduce edge artifacts.
The method optimizes the fusion of base and detail layers through visual saliency mapping and weighted least square optimization.
Bavirisetty et al. [13] introduced an image fusion algorithm that utilizes fourth-order partial differential equations and principal
component analysis, a first in the context of image fusion. The algorithm involves applying these equations to source images to obtain
approximation and detail images, using principal component analysis to find optimal weights for detail images, and then fusing detail
and approximation images. The algorithm developed by Zhang et al. [14] utilized quadtree decomposition and Bézier interpolation
to reconstruct the infrared background and extract bright infrared features. To prevent over-exposure, these features are carefully
incorporated into the visual image. Panda et al. [3] introduced an edge-preserving image fusion method for merging images from
different sensors to highlight informative content into a single image. To address sensor differences and pixel uncertainty, a fuzzy
set-based approach is used in three stages: combining pixel intensities, preserving edges, and fusing results. A pixel-level fusion
algorithm for visible and thermal images is proposed by Panda et al. [15], which used a center sliding window to create an initial
feature map through maximum selection, followed by a final feature map generated using the minimum selection strategy. The
fused image is produced by weighted averaging of these feature maps.
Many researchers have also used learning-based techniques for the fusion of thermal and visual images [16]. Li et al. [17]
presented an image fusion method by combining original images into a single composite image which incorporates distinguishing
details from both sources. It decomposes the source images into base parts, fuses them by weighted averaging, and utilizes a deep
learning network to extract multi-layer features from the detailed content. The final fused detail content is obtained through 𝑙1 -norm
and weighted-average strategies, resulting in a better fused image in both qualitative and quantitative assessment compared to other
methods. A siamese convolutional neural network (CNN) is employed by Liu et al. [18] to create a weight map integrating pixel
activity information from both source images. The approach effectively addresses critical image fusion issues, including activity
level measurement and weight assignment. Using a multi-scale approach with image pyramids and a local similarity-based strategy,
it adapts the fusion mode for decomposed coefficients. A convolutional neural network-based image fusion framework called IFCNN
is proposed by Zhang et al. [19]. It extracts salient features from multiple input images using two convolutional layers and fuses
them based on the input image type (elementwise-max, elementwise-min, or elementwise-mean). The fused features are extracted
through two additional convolutional layers that produce the composite image. The model is fully convolutional, allowing end-to-
end training. It is tested on a large-scale multi-focus image dataset derived from NYU-D2, demonstrating superior generalization
for various image types, including multi-focus, infrared-visual, medical, and multi-exposure images, without fine-tuning on other
datasets. Deng et al. [20] developed a deep convolutional neural network called CUNet for solving multi-modal image restoration
(MIR) and multi-modal image fusion (MIF) tasks. It is inspired by a multi-modal convolutional sparse coding (MCSC) model and
is designed to automatically separate common and unique information in different modalities. CUNet consists of three modules:
unique feature extraction, common feature preservation, and image reconstruction, each using learned convolutional sparse coding
blocks.
Designing deep learning-based image fusion methods is complex and often specific to the fusion task. The challenge lies in
choosing the right strategy to generate fused images. To address this issue, Li et al. [21] proposed the RFN-Nest, an end-to-end
fusion network for infrared and visible image fusion. It uses a residual fusion network (RFN) with novel detail-preserving and
feature-enhancing loss functions, employing a two-stage training approach that includes an auto-encoder with a nest connection
concept in the first stage and training the RFN using the proposed loss functions in the second stage. Zhang et al. [2] addressed the
limitations in RGB-based saliency detection algorithms when dealing with challenging scenarios, low illumination, and occluded
objects. It introduced RGB-T saliency detection, leveraging the robustness of thermal imaging. The proposed method incorporates a
deep feature fusion network with multi-scale, multi-modality, and multi-level feature fusion modules. Many deep learning methods
focus on local features but miss multi-scale characteristics and global dependencies, potentially losing target regions and details. In
order to address this concern, Wang et al. [22] introduced UNFusion, a multi-scale densely connected fusion network that efficiently
extracts and reconstructs features, employs dense skip connections, and uses 𝐿𝑝 normalized attention models to preserve thermal
targets and textures, achieving good scene representation and visual perception on TNO and Roadscene datasets.
Wang et al. [23] introduced Res2Fusion, a fusion network that combines dense Res2Net and double nonlocal attention models
in its design. It leverages multiple receptive fields to extract multiscale features efficiently and uses nonlocal attention models
to capture long-range dependencies. This approach enhances the fusion process, focusing on infrared targets and visible details
for a more effective fused result. The Y-shape Dynamic Transformer method proposed by Tang et al. [24] introduced a Dynamic
Transformer module (DTRM) to capture both local and contextual information. The Y-shaped network structure is designed
to maintain IR information and scene details from the original images. Additionally, a loss function incorporating structural
similarity (SSIM) and spatial frequency (SF) terms is used to enhance fusion quality. Ma et al. [25] developed an image fusion
framework called SwinFusion, using cross-domain long-range learning and the Swin Transform. It includes an attention-guided
cross-domain module with intra-domain and inter-domain fusion units that capture long-range dependencies within and across
domains. The model maintains domain-specific information, integrates cross-domain information, and ensures appropriate global
intensity. SwinFusion addresses multi-scene image fusion with structure preservation, detail retention, and intensity control. A
comprehensive loss function, including SSIM loss, texture loss, and intensity loss, encourages the network to preserve texture details,
structural information, and optimal apparent intensity. An unsupervised end-to-end image fusion network coined as U2Fusion,
capable of addressing various fusion problems, including multi-modal, multi-exposure, and multi-focus scenarios is developed by
3
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
Table 1
The summary of the existing image fusion techniques.
Fusion approaches Advantages Disadvantages
Multi-scale Able to retain the features of Generate the fused image
decomposition the source images at various scales with reduced contrast.
[12–14] which produce a fused image with
highlighted back ground as well as
foreground details.
Sparse Able to retain the textural and The fused image suffers
representation structural features accurately from from Ringing artifacts.
[9–11] the source images.
Salient feature Preserve the contrast Suffers from the Halo effects
map detection details in the fused image. and added higher amount of noise
[3–8,15] in the fused image.
Deep learning Capable of preserving low, mid, and High space and time complexity
based methods high-level features from the source against existing traditional
[16–23] images. approaches.
[24–28]
Xu et al. [26]. U2Fusion automatically estimates the importance of source images and adapts information preservation levels,
unifying different fusion tasks in one framework. Training the network to preserve adaptive similarity, mitigates the need for
ground-truth data and specific metrics, allowing a single model to handle multiple fusion tasks. Panda et al. [27] introduced a fusion
technique that combines bi-dimensional empirical mode decomposition (BEMD) with a pre-trained VGG-16 deep neural network.
The approach effectively handles the complexities of infrared and visible images, retaining multi-layer features at different scales
in the frequency domain. The fusion strategy preserves correlated information from the source images using a minimum selection
strategy, resulting in a fused image with reduced artifacts. A semantic-aware real-time image fusion network called SeAFusion is
proposed by Tang et al. [28]. It combined image fusion and semantic segmentation modules, utilizing a semantic loss to enhance
high-level vision task performance on fused images. It also incorporates a gradient residual dense block (GRDB) for improved spatial
detail description.
Deep learning methods have been successful in image fusion, but designing fusion network architectures remains challenging.
Li et al. [29] introduced an approach that mathematically formulates the fusion task and connects the optimal solution to a
network architecture. The result is a lightweight fusion network that avoids the empirical trial-and-error network design process.
The low-rank representation (LRR) objective serves as the foundation, and matrix multiplications are transformed into convolutional
operations. An end-to-end lightweight fusion network is constructed to fuse infrared and visible light images, and a detail-to-semantic
information loss function is used for successful training. Panda et al. [30] presented a scheme for fusing visible and infrared
images based on multi-scale decomposition and salient feature map detection. The technique combines bidimensional empirical
mode decomposition (BEMD) with Bayesian probabilistic fusion. It effectively handles uncertainty in source images and retains
maximum details at multiple scales. Salient feature maps extracted from both types of images preserve common information and
reduce superfluous details, resulting in a fused image that provides complete target scene information with reduced artifacts.
It can be summarized from the above study that various approaches and methods in the field of image fusion exist in the literature.
It covers techniques such as empirical mode decomposition, Bayesian probabilistic fusion, multi-scale decomposition, and salient
feature map detection. Deep learning-based methods, including convolutional neural networks (CNNs), siamese CNNs, and multi-
scale fusion networks, are also used by many researchers, because of their ability to produce better fused images. These methods
aim to combine information from visible and infrared images, enhancing the quality and content of the fused images for applications
like object detection and scene analysis. These image fusion techniques offer enhanced situational awareness, reduced false alarms,
and multi-scale feature preservation. However, they also have limitations related to design complexity, sensitivity to challenging
scenarios, computationally expensive, and sensitivity to noise. The strengths and weaknesses of the existing image fusion methods
are reported in Table 1. Thus a method that can produce better fused images at low complexity and cost would always be preferred.
The algorithm proposed in this article is an attempt to achieve the same.
The rest of the article is organized as follows: Section 3 illustrates the proposed algorithm and the various steps involved in it.
The analysis of results and discussions are carried out in Section 4, which includes objective as well as subjective approaches and
ablation study. Section 5 summarizes the article with concluding remarks.
This article presents an algorithm for fusing IR and visible images by developing a novel contrast detection map and a unique
weight map. The flow chart of the developed algorithm is shown in Fig. 1. Also, the proposed algorithm is applied for color source
image pairs. For the same, initially, the source images are decomposed into three different channels named Red (R), Green (G),
and Blue (B). Then the corresponding channels of the source images are fused to produce the fused color channels. Finally, the
fused color channels are concatenated to generate the fused image presented in Fig. 2. Different stages of the developed scheme are
discussed as follows.
4
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
Fig. 1. Flow chart of the proposed algorithm for gray source images.
Fig. 2. Flow chart of the proposed algorithm for color source images.
5
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
For the image fusion process, both the high-frequency and low-frequency components are highly essential which preserves the
various details from the original images. Therefore, in the developed scheme, the source images are decomposed into high-frequency
and low-frequency components using Guided filter [31] and Mean filter, respectively. The Guided filter is a popular image-filtering
tool used for various tasks, such as image denoising, enhancement, and detail preservation. In this work, the Guided filter is used
for the details preservation of the images from several sensors. For the details preservation, the guided filter requires two input
images. The first one is called the guidance image, which is typically a gray-scale image or a single channel extracted from the
input image. The guidance image is used to guide the filtering process by providing information about the desired structures and
details to preserve. The second input is the source images where the high-frequency details are retained and can be given as;
𝐺𝐾 (𝑥, 𝑦) = 𝑓 × 𝐼𝐾 + 𝑏 (1)
where f is a filter that captures the relationship between the source images 𝐼𝐾 , (𝐾 ∈ {1, 2}, 1 indicates the visible image and 2
represents the IR image) and b is a biased term.
The low-frequency components are retained from the images from the various sensors using a Mean filter (or) average filter. The
Mean or average filter determines the average pixel value within a considered kernel or neighborhood. The Mean filter can reduce
noise in source images effectively. Also, the Mean filter is easy to implement and computationally efficient. Further, the Mean filter
tends to preserve the overall structure of the source images. This means that important features and edges in the source images are
not lost or significantly distorted. The outcomes of the Mean filter can be represented as
where 𝑑𝐾 (𝑥, 𝑦) is the outcome of the Mean filter, 𝜙(𝑥, 𝑦) denotes the center sliding window with size 𝑠𝑤𝜙 × 𝑠𝑤𝜙 .
The pixel values typically convey thermal radiation information, leading to clear object identification in the IR image due to
variations in gray-scale levels between the background and objects. This motivation prompted us to create a fused image with a
pixel value distribution comparable to that of the original IR image. Also, the fused image must have textural details of the visible
image to enhance the background information. This motivated us to develop contrast detection maps of the source images that can
preserve the foreground and background information accurately with clear identification of the same with reduced artifacts. The
contrast detection maps of the images from different modalities can be calculated as
It is observed that the contrast features of the source images are better enhanced and preserved by the usages of local standard
deviation filtering and local range filtering. The visible image provides the textural details of the target scene which is enhanced by
the use of a local standard deviation filter as
√
√
√ 1 ∑ 𝑚 ∑ 𝑚
𝐿𝑠𝑑 = 𝜎 = √ (𝐶1 (𝑥, 𝑦) − 𝐶̄1 )2 (4)
2
𝑚 − 1 𝑥=1 𝑦=1
where 𝐿𝑠𝑑 indicates the local standard deviation features of the visible image. The term 𝑚 denotes the local neighborhood of 𝑚 × 𝑚.
𝐶1 represents the contrast detection map of the visible image and 𝐶̄1 is the mean value of 𝐶1 . The filtering operation is carried out
from left to right and top to bottom resulting an image as
⎡ 𝐿𝑠𝑑 (1, 1) 𝐿𝑠𝑑 (1, 2) ⋯ 𝐿𝑠𝑑 (1, 𝑁) ⎤
⎢ ⎥
𝐿 (2, 1) 𝐿𝑠𝑑 (2, 2) ⋯ 𝐿𝑠𝑑 (2, 𝑁) ⎥
𝐿𝑠𝑑 = ⎢ 𝑠𝑑 (5)
⎢ ⋯ ⋯ ⋯ ⎥
⎢𝐿 (𝑀, 1) 𝐿𝑠𝑑 (𝑀, 2) ⋯ 𝐿𝑠𝑑 (𝑀, 𝑁)⎥⎦
⎣ 𝑠𝑑
The IR image provides the object details of the target image. Further, the target details can be enhanced using the Local Range filter
for a 3 × 3 local neighborhood as
where 𝑖, 𝑗 ranges from −1 to +1. 𝑓 is a function that computes the filtered value for a pixel in the neighborhood based on the range
of pixel values. 𝐶2 indicates the contrast detection map of the IR image. The filtering is carried out in a similar manner, unlike the
local standard deviation filter.
In this work, we have developed a unique weight map generation technique to preserve the complementary information from
the original images accurately as
𝐿𝑠𝑑 (𝑥, 𝑦)
𝑤1 (𝑥, 𝑦) = (7)
𝑚𝑎𝑥(𝐿𝑠𝑑 (𝑥, 𝑦))
𝐿𝑟𝑔 (𝑥, 𝑦)
𝑤2 (𝑥, 𝑦) = (8)
𝑚𝑎𝑥(𝐿𝑟𝑔 (𝑥, 𝑦))
6
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
The salient detail map (𝐹1 (𝑥, 𝑦)) which preserves the significant details from the images from the different modalities can be
obtained using the weight maps 𝑤𝐾 (𝑥, 𝑦) and mean filter outcomes 𝑑𝐾 (𝑥, 𝑦) as
⎧
⎪𝑤1 (𝑥, 𝑦) × 𝑑1 (𝑥, 𝑦) + (1 − 𝑤1 (𝑥, 𝑦)) × 𝑑2 (𝑥, 𝑦),
⎪ 𝑖𝑓 𝑤1 (𝑥, 𝑦) > 𝑤2 (𝑥, 𝑦)
𝐹1 (𝑥, 𝑦) = ⎨ (9)
⎪(1 − 𝑤2 (𝑥, 𝑦)) × 𝑑1 (𝑥, 𝑦) + 𝑤2 (𝑥, 𝑦) × 𝑑2 (𝑥, 𝑦),
⎪ 𝑒𝑙𝑠𝑒
⎩
The salient detail map achieved using Eq. (9) is incapable of highlighting the object details as the IR image possesses high
uncertainty with low contrast and poor resolution characteristics. Therefore, to retain the prominent detail map (𝐹2 (𝑥, 𝑦)) from the
source images we have utilized the 𝑚𝑎𝑥(⋅) selection operator between the corresponding location of the high-frequency components
of the images from different modalities as
{
𝐺1 (𝑥, 𝑦), 𝑖𝑓 𝐺1 (𝑥, 𝑦) > 𝐺2 (𝑥, 𝑦)
𝐹2 (𝑥, 𝑦) = (10)
𝐺2 (𝑥, 𝑦), 𝑒𝑙𝑠𝑒
The salient detail map 𝐹1 (𝑥, 𝑦) is preserved from the source image and highlights the visual details. In contrast, the 𝐹1 (𝑥, 𝑦) is
incapable of retaining the IR thermal radiation information accurately. Similarly, details 𝐹2 (𝑥, 𝑦) achieved from the high-frequency
components prominent the thermal radiation information. However, the 𝐹2 (𝑥, 𝑦) details are incapable of preserving the textural
details from the images from various sensors. To improve the object’s situational awareness in the fused image, it is essential to
comprehensively analyze the background as well as foreground information. It is observed that the 𝐹1 (𝑥, 𝑦) map provides better
background information preserved from the visible images against 𝐹2 (𝑥, 𝑦) details. However, it is found that the generated 𝐹2 (𝑥, 𝑦)
map retains the required thermal radiation information accurately from the IR image compared to the 𝐹1 (𝑥, 𝑦) map. Therefore, to
acquire both background and foreground information from the source images, we have combined 𝐹1 (𝑥, 𝑦) map and 𝐹2 (𝑥, 𝑦) map to
reconstruct the fused image 𝐹 (𝑥, 𝑦) with lesser noise. The fused image can be determined as
Algorithm 1 : Infrared & Visible Image Fusion Using Contrast Detection Map and Weight Map
Input: Visible image indicated by 𝐼1 and Infrared image indicated by 𝐼2 .
− Decompose 𝐼1 and 𝐼2 into various components with diverse details using equations (1)-(2).
− Obtain contrast detection map of the source images using equation (3).
∗ Contrast features of the 𝐼1 are preserved and enhanced using equations (4)-(5).
∗ Contrast features of the 𝐼2 is preserved and enhanced using eqs (6).
− Compute the weight maps 𝑤1 and 𝑤2 using equations (7)-(8).
− for every pixel of 𝐼1 and 𝐼2 ;
∗ Compute 𝐹1 using equation (9).
if 𝑤1 > 𝑤2
𝐹1 is computed using equation (9) with 𝑤1 only.
else
𝐹1 is computed using equation (9) with 𝑤2 only.
∗ Compute 𝐹2 using equation (10).
if 𝐺1 > 𝐺2
𝐹2 = 𝐺1 .
else
𝐹2 = 𝐺2 .
− Obtain the fused image using equation (11).
The designed architecture has been implemented on a Core i5 system equipped with 16 GB of RAM. The experimentation is
carried out using MATLAB 2017a platform on a Windows 10 operating system. Our testing involved a comprehensive assessment
of the proposed scheme on a diverse set of image pairs collected from the TNO benchmark database [32], and [16] encompassing
various challenging scenarios such as smoky environment, variations in illumination, occluded objects, and non-uniform lighting
conditions. The TNO benchmark dataset [32] comprises 63 pairs of images showcasing visual (400–700 nm), near-infrared (700–
1000 nm), and long-wave infrared (8–14 μm) nighttime scenes featuring diverse surveillance and military scenarios. These images
encompass individuals engaged in activities like walking, running, or remaining stationary, carrying assorted objects, alongside
7
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
Fig. 3. Visual demonstration of Street images: (a) Visible frame, (b) IR frame, outcomes attained by image fusion technique based on (c) CBF [8], (d) CSR [11],
(e) CSMCA [36], (f) RP [9], (g) Pixel-level [15], (h) VGG-16 [27], (i) DNN [33], (j) RFN-Nest [21], (k) LRRNet [29] and (l) Proposed approach.
vehicles, buildings, foliage, and artificial structures. Different sensors such as Athena, DHV, FEL, and TRICLOBS were utilized
to capture these images, which have been meticulously aligned and transformed to ensure pixel-level correspondence between
related image pairs. The imagery is taken during nighttime in a range of outdoor settings, encompassing both rural and urban
environments. Also, we have collected source images including manCall images, and color images from [16], encompassing a diverse
array of environments and work situations, including indoor and outdoor scenarios, situations with low light, and instances of
overexposure. The effectiveness of the developed technique has been validated through both qualitative and quantitative analysis.
We have presented visual analysis results on a few considered sets of source image pairs, while objective assessments are provided
for all image pairs. This section is further divided into three parts: subjective analysis, objective assessment, and discussions along
with future work and ablation study.
To verify the effectiveness of the proposed algorithm, we conducted a comparative analysis by comparing its results with
those achieved by fourteen deep learning-based image fusion frameworks including IFCNN [19], CUNet [20], RFN-Nest [21],
Res2Fusion [23], YDTR [24], SwinFusion [25], SeAFusion [28], U2Fusion [26], LRRNet [29], VGG-16 [27], DNN [33], ResNet [34],
DLF [17], and CNN [18]. To test out how well the new algorithm worked it is compared to 13 other image fusion methods that
do not use deep learning including CBF [8], WLS [12], LRR [35], RP [9], Fuzzy edge [3], Pixel-level [15], CSR [11], CSMCA [36],
MGFF [37], BPS [30], ADF [10], FPDE [13], and IEFVIP [14].
The images captured by various modalities, and outcomes attained by the developed and existing fusion techniques on Street
images are presented in Fig. 3. The encircled red regions are zoomed in and showcased after every image. Fig. 3(a) and (b) depicts
the visible and IR image, respectively. Fig. 3(c) and (f) indicates the outcomes acquired by the CBF [8] and RP [9] methods where
the fused images contain more artifacts. The outcomes attained by the CSR [11] and CSMCA [36] techniques are shown in Fig. 3(d)
and (e) respectively. From the Fig. 3(d) and (e), it is found that the said methods are incapable of preserving the required details in
the fused images due to ringing artifacts. The Pixel-level [15] technique is incapable of preserving the edge details in the outcome
8
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
Fig. 4. Visual demonstration of Camp images: (a) Visible frame, (b) IR frame, outcomes attained by image fusion technique based on (c) CBF [8], (d) CSR [11],
(e) CSMCA [36], (f) RP [9], (g) Pixel-level [15], (h) VGG-16 [27], (i) DNN [33], (j) RFN-Nest [21], (k) LRRNet [29] and (l) Proposed approach.
as represented in Fig. 3(g). The deep learning-based methods including Fig. 3(h) VGG-16 [27], Fig. 3(i) DNN [33], Fig. 3(j) RFN-
Nest [21], Fig. 3 and Fig. 3(k) LRRNet [29] are incapable of retaining the contrast information in the fused image. The outcome
attained by the proposed fusion technique is presented in Fig. 3(l) where the contrast details are preserved accurately in the fused
image. Also, the proposed technique acquired edge details from the original images effectively with minimal artifacts. Therefore, the
outcomes achieved by the developed technique differentiate the foreground as well as background information accurately against
the existing fusion techniques. Again, similar kinds of observations are attained for the Camp image shown in Fig. 4. From Fig. 4(l),
it is found that the fused image attained by the proposed technique preserves better contrast details and textural information from
the original images against all the existing fusion techniques. Therefore, the proposed algorithm may be suitable for real-time
applications and can be used for 24-hour video surveillance purposes. Further, the proposed technique is visually demonstrated
on a manCall image against different existing techniques and demonstrated in Fig. 5. Fig. 5(a) and (b) indicates the visible and
IR image, respectively. The outcome attained by the recent transformer-based approach YDTR [24] is shown in Fig. 5(c) where
the details are not clearly visible. Fig. 5(d) indicates the fused image acquired by the WLS [12] method in which the foreground
details are not accurately preserved. The outcome attained by the U2Fusion [26] method is shown in 5(e) where it is observed that
the said technique is incapable of retaining the thermal radiation information. Fig. 5(f) and (g) indicates that the results obtained
by the SwinFusion [25], SeAFusion [28] techniques where the textural details are not well preserved. The results obtained by
the ResNet [34], MGFF [37], IFCNN [19], and DLF [17] techniques are presented in Fig. 5(h), (i), (j) and (k), respectively. From
these Figures, it is observed that the said techniques highlight the object with lesser thermal radiation information. In contrast, the
outcome attained by the proposed technique is represented in Fig. 5(l) where the textural and thermal radiation information are
preserved accurately from the images of IR and visible sensors against all the state-of-the-art techniques. Therefore, the fused image
attained by the proposed technique highlights the background and foreground details simultaneously for a better representation of
the object situations. This corroborates the proposed technique’s findings and outperforms the existing fusion techniques.
9
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
Fig. 5. Visual demonstration of manCall images: (a) Visible frame, (b) IR frame, outcomes attained by image fusion technique based on (c) YDTR [24], (d)
WLS [12], (e) U2Fusion [26], (f) SwinFusion [25], (g) SeAFusion [28], (h) ResNet [34], (i) MGFF [37], (j) IFCNN [19], (k) DLF [17] and (l) Proposed approach.
Fig. 6. Visual demonstration of color images: (a) Visible frame, (b) IR frame, outcomes attained by image fusion technique based on (c) ADF [10], (d) CBF [8],
(e) CNN [18], (f) DLF [17], (g) FPDE [13], (h) IEFVIP [14], (i) LRR [35], (j) ResNet [34] and (k) Proposed approach.
Also, the outcomes attained by the developed and existing techniques for color source images are represented in Fig. 6. From
Fig. 6(k), it is found that the proposed technique is capable of enhancing scene details with reduced artifacts against the existing
IR and visible image fusion schemes.
It is important to understand that no fusion strategy can perform exceptionally well on every image. Similarly, no image can be
ideal for all fusion algorithms. Furthermore, comparing different fusion algorithms through subjective assessment can be challenging
10
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
Table 2
Quantitative comparison of the proposed scheme with different deep learning-based fusion
methods.
Approaches EN ↑ MI ↑ 𝑁𝑎𝑏𝑓 ↓ SD ↑
IFCNN [19] 6.59545 13.1909 0.17959 66.87578
CUNet [20] 6.13996 12.27992 0.16574 43.53543
RFN-Nest [21] 6.84134 13.68269 0.07288 71.90131
Res2Fusion [23] 6.67774 13.35549 0.09223 67.27749
YDTR [24] 6.22681 12.45363 0.02167 51.48819
SwinFusion [25] 6.68096 13.36191 0.12478 80.41930
U2Fusion [26] 6.75708 13.51416 0.29088 64.91158
LRRNet [29] 6.85836 13.71673 0.14168 81.78905
Proposed 6.86815 13.73269 0.15390 78.16158
Table 3
Quantitative comparison of the proposed scheme with existing traditional fusion methods.
Approaches EN ↑ MI ↑ 𝑁𝑎𝑏𝑓 ↓ 𝐹 𝑀𝐼𝑤 ↑ 𝑄𝑎𝑏𝑓 ↑
CBF [8] 6.85745 13.71498 0.31727 0.32350 0.43961
WLS [12] 6.64071 13.28142 0.21257 0.37662 0.50077
LRR [35] 6.35743 12.71486 0.01596 0.38257 0.41277
RP [9] 6.50143 13.00287 0.22677 0.40027 0.46364
Fuzzy edge [3] 6.59686 13.19372 0.28250 0.35211 0.35838
Pixel-level [15] 5.98699 11.97398 0.00563 0.38781 0.32259
BPS [30] 6.17464 12.34927 0.00102 0.40862 0.34372
Proposed 6.86815 13.73269 0.15390 0.41634 0.47196
due to the subtle differences in visually presented fusion results. Therefore, conducting a quantitative analysis of any algorithm is
imperative to assess its effectiveness. Hence, in the proposed technique we have used the quantitative assessments including Entropy
(EN) [3], Mutual information (MI) [3], Amount of noise added in the fused image (𝑁𝑎𝑏𝑓 ) [15], Standard deviation (SD) [29], Wavelet
features (𝐹 𝑀𝐼𝑤 ) [17], and (𝑄𝑎𝑏𝑓 ) [22] assesses edge information by considering both the strength of edges and their ability to
preserve the orientation. If the numerical values of the EN, MI, SD, 𝐹 𝑀𝐼𝑤 and 𝑄𝑎𝑏𝑓 are higher with lower value of 𝑁𝑎𝑏𝑓 then the
effectiveness of any data fusion technique is regarded as superior.
To test the efficiency of the proposed technique we have compared the quantitative measurements obtained by it against different
existing deep learning-based frameworks. Table 2 depicts the objective assessment of the different recent image fusion frameworks
based on deep-learning with proposed technique using average values of EN, MI, 𝑁𝑎𝑏𝑓 and SD for all source images available in
the benchmark TNO dataset. In Table 2 best performing results are highlighted in bold. Here, it may be noticed that the developed
technique achieves better values of EN and MI against all the existing techniques. A higher value of entropy and mutual information
in the fused image indicates greater information content and better preservation of information, respectively. Also, the proposed
algorithm attains higher values SD as compared to IFCNN [19], CUNet [20], RFN-Nest [21], Res2Fusion [23], YDTR [24], and
U2Fusion [26] IR and visible image fusion techniques. A higher value of standard deviation in the fused image indicates greater
variability or spread of pixel intensities, which can be associated with increased image contrast or texture. However, the SD value
attained by the proposed technique is comparable with SwinFusion [25] and LRRNet [29] techniques. In contrast, the amount of
noise added in the fused image acquired by the proposed technique is lower against IFCNN [19], CUNet [20], and U2Fusion [26]
techniques. Furthermore, the amount of noise added in the fused image attained by the proposed technique is a bit higher
against RFN-Nest [21], Res2Fusion [23], YDTR [24], SwinFusion [25], and LRRNet [29] fusion techniques. Therefore, the fused
image attained by the proposed algorithm contains bit higher artificial information compared to RFN-Nest [21], Res2Fusion [23],
YDTR [24], SwinFusion [25], and LRRNet [29] fusion techniques.
Also, the developed method is validated against several non-deep learning-based image fusion methods. Table 3 depicts the
objective assessment of the various non-deep learning-based image fusion methods and proposed technique using average values of
EN, MI, 𝑁𝑎𝑏𝑓 , 𝐹 𝑀𝐼𝑤 , and 𝑄𝑎𝑏𝑓 quantitative measures. In Table 3 best performing results are highlighted in bold. Here, it can be
noticed that the developed technique achieves higher values of EN, MI, and 𝐹 𝑀𝐼𝑤 against all the existing techniques. This indicates
that the fused image acquired by the proposed algorithm consists of significant information with dominating wavelet features that
enhance the visual perception of the challenging source pairs. Further, Table 3 depicts that the proposed technique has a greater
value of 𝑄𝑎𝑏𝑓 measure against all the state-of-the-art techniques except WLS [12] method. However, the 𝑄𝑎𝑏𝑓 value achieved by
the proposed technique is comparable with WLS [12] method. The higher value of 𝑄𝑎𝑏𝑓 measure indicates that the fused image
obtained by the proposed technique is capable of preserving required textural details with enhanced edge strength. Furthermore,
Table 3 shows that the proposed technique attained a lower value of 𝑁𝑎𝑏𝑓 against CBF [8], WLS [12], RP [9], and Fuzzy edge [3]
techniques. This indicates that the outcomes attained by the proposed technique contain a lesser amount of artificial information
compared to all the existing techniques. However, the proposed technique contains a higher amount of artifacts against LRR [35],
Pixel-level [15], and BPS [30] techniques.
11
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
In this article, we have developed a novel IR and visible image fusion technique using a contrast map and weight map. In the
proposed algorithm, we begin by decomposing the source images into high-frequency and low-frequency components by using a
Guided filter and a mean filter, respectively. The guided filter is used for tasks such as edge-preserving, detail enhancement, and
tone mapping. It can effectively preserve structural details in an image while reducing noise and unwanted details. The average
filter is a simple linear filter that determines the average value of pixel intensities in a neighborhood around each pixel. Therefore,
it is effective for smoothing or blurring images, reducing noise, and preserving low-frequency components. To retain both the
edge details, and structural details with reduced artifacts the source images are decomposed into high as well as low-frequency
components using a Guided filter and Average filter respectively. We introduce a novel contrast detection mechanism to effectively
preserve the contrast details found in the source images. The contrast details of the source images are retained by subtracting the
coarse details from the source images which can produce contrast detection maps of both the images. The contrast detection map
of the visible image highlights the fine details of the background and the contrast detection map of the IR image highlights the fine
details of the foreground. Therefore, the outcomes obtained by the proposed technique contain fine details of the source images that
are highly essential for image fusion. Furthermore, we enhance the contrast details in the visible and IR images by applying local
standard deviation filtering and local range filtering, respectively. A local standard deviation filter, also known as a local contrast
filter or standard deviation filter, which determines the standard deviation of pixel values in a local neighborhood around each
pixel. It preserves edges and structures in visible images while providing smoothing in homogeneous regions, making it suitable for
retaining detailed information. Local range filtering, also known as local range normalization or local range adjustment, involves
modifying pixel values based on the range of values in their local neighborhood. By considering the range of pixel values in a local
neighborhood of the IR image, the filter can amplify the differences between neighboring pixels, making structures and details more
pronounced.
Choosing a weight map construction strategy for any algorithm depends on the specific goals of the image processing or computer
vision task based on images from various modalities. Weight maps often assign different levels of importance or influence to pixels
in an image. In the proposed algorithm as we are dealing with IR and visible source images which are having complementary
information, the proposed weight map construction strategy utilizes the outcomes of the local standard deviation filter and local
range filter to generate the weight values. These weight values are capable of capturing significant information from both the
source images with reduced redundant information. We have devised a strategy for constructing weight maps that efficiently retain
complementary information from both IR and visible images. These weight maps, in conjunction with gradient details from the
source images, play a crucial role in preserving the salient feature details originating from the various modalities. Evaluating the
salient feature details of an image involves assessing how well certain features or regions stand out and capture attention. In the
proposed algorithm from Fig. 1, it can be well remarked that the salient feature map (𝐹1 ) highlights areas that attract human
attention and can guide the weighting strategy. We employ a decision-making approach on the high-frequency components of the
source images to extract the prominent feature details. Evaluating prominent feature details (𝐹2 ) in an image involves assessing the
significance and visibility of specific regions of the source images. Human perception plays a vital role in assessing the prominence
of features. Visual inspection by observers can provide qualitative insights into which features stand out and capture attention as
shown in Fig. 1. Also, prominent feature details identify regions that are likely to attract attention, that is, prominent features
coincide with salient objects or regions. Further, we combine the salient and prominent feature details to generate the fused image.
The salient feature extracted the textural details from the source images and the prominent features extracted retain the foreground
information appropriately. Therefore, integrating both, preserves the required significant details from the source images. Also, the
integrated image enhances the visual perception that provides the object’s situational awareness accurately.
We assess the efficiency of our proposed algorithm by conducting evaluations that encompass both subjective and objective
measures. The experimental analysis reveals that our approach outperforms both traditional and deep learning-dependent fusion
techniques in terms of accuracy across all the sources included in the TNO benchmark database and source images available at [16].
The deep learning approaches are capable of extracting diverse features automatically including low, mid, and high-level features
from the source images. However, in the deep learning-based fusion approaches the fused images produced are more dependent
on high-level features rather than low-level features. Also, due to the high correlation among the extracted features, using deep
learning-based fusion algorithms may provide a higher amount of redundant information with enhanced artifacts. Further, the use
of downsampling operation in the deep learning framework reduces the spatial information which degrades the performance of
the image fusion algorithm. In our proposed work, we have used the feature extraction technique which preserves the significant
information providing a better fusion strategy over deep learning-based approaches.
The performance measures are reported in Tables 2 and 3 collected from the following articles published in highly reputed
journals and conferences. It can be verified from the published articles [15,29], and [30]. We have also used the same datasets that
were used in these articles. Therefore, we have collected the results from their published articles as such without any further tuning
of parameters from our side. The experimentation done on the proposed method and the parameters used for the same are described
appropriately in the ablation study section of the proposed work. As the proposed work is a non-deep learning-based approach very
few parameters require tuning which is justified in the ablation study of the article. The performance measures compared in Table 2
include the EN, MI, 𝑁𝑎𝑏𝑓 , and SD for the techniques IFCNN, CUNet, RFNNest, Res2Fusion, YDTR, SwinFusion, U2Fusion, and LRRNet
collected from [29]. Similarly, the performance measures compared in Table 3 include the EN, MI, 𝑁𝑎𝑏𝑓 , 𝐹 𝑀𝐼𝑤 , and 𝑄𝑎𝑏𝑓 for the
techniques CBF, WLS, LRR, RP, Fuzzy edge, Pixel-level, and BPS are simulated with their original tuning parameters reported in
their respective published articles and reported. The parameters 𝑁𝑎𝑏𝑓 for the techniques Pixel-level are reported in [15] and the
12
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
Table 4
Quantitative comparison of the proposed scheme on the street image with different window size.
Guide and Mean filter 𝐹 𝑀𝐼𝑝𝑖𝑥𝑒𝑙 ↑ SCD ↑ 𝑁𝑎𝑏𝑓 ↓ SSIM ↑ 𝑀𝑆𝑆𝑆𝐼𝑀 ↑
window size
(10, 35) 0.90716 1.46673 0.17347 0.61086 0.91578
(16, 35) 0.90884 1.47006 0.13768 0.61127 0.90210
(20, 35) 0.90515 1.46161 0.13583 0.61101 0.89339
(16, 30) 0.90589 1.43554 0.14003 0.61132 0.89984
(16, 40) 0.90588 1.46949 0.14833 0.61067 0.90238
techniques CBF, WLS, LRR, RP, Fuzzy edge, and BPS are reported in [30]. The remaining performance measures are calculated from
the simulated results. The experiments observed that the proposed technique introduced slightly more artifacts in the fused images.
This results from limitations in the Guided and Mean filters’ inability to handle high uncertainty due to variability exists in source
image pixel brightness values and sensors noise. Hence, the future exploration of fuzzy-induced deep learning-based algorithms for
IR and visible image fusion may be capable of reducing the artifacts in the fused image.
In this work, we have conducted an ablation study to check the efficiency of the various components used in the proposed
algorithm. Initially, the effectiveness of the proposed algorithm is verified without and with local standard deviation and local
range filtering. From Fig. 7, it is observed that the proposed technique with local standard deviation and local range filtering is
able to enhance the details in the fused image effectively compared to the proposed technique without local standard deviation
and local range filtering. Fig. 7(c) indicates the result with local standard deviation and local range filtering and Fig. 7(d) denotes
the result without local standard deviation and local range filtering. The respective yellow, red, and blue regions are zoomed in to
demonstrate the effectiveness of the usage of the aforementioned filters. From the zoomed regions of Fig. 7(c) and (d) it can be well
remarked that the usage of filters extracts better information in contrast to without using the filters.
Also, we have conducted an ablation study to validate the effectiveness of the proposed technique. From Fig. 8(c), it is observed
that the proposed technique without a unique wight map generation strategy is incapable of extracting fine as well as coarse-scale
features from source images in the fused image. The result obtained by the proposed technique is presented in 8(c) uses the weight
maps are as follows
𝐿𝑠𝑑 (𝑥, 𝑦)
𝑤1 (𝑥, 𝑦) = (12)
(𝐿𝑠𝑑 (𝑥, 𝑦) + 𝐿𝑟𝑔 (𝑥, 𝑦))
𝐿𝑟𝑔 (𝑥, 𝑦)
𝑤2 (𝑥, 𝑦) = (13)
(𝐿𝑠𝑑 (𝑥, 𝑦) + 𝐿𝑟𝑔 (𝑥, 𝑦))
However, the result obtained by the proposed technique is showcased in Fig. 8(d) uses the unique wight map generation strategy
developed in Eqs. (7)–(8). From Fig. 8(d), it is observed that the proposed technique with a unique wight map is able to attain the
small and large scale details accurately in the fused image against the proposed technique without a unique wight map. Also, we
have used a weighted average fusion strategy where 𝑤1 = 0.5 and 𝑤2 = 0.5 are used to generate the salient detail map (𝐹1 (𝑥, 𝑦)).
Nevertheless, experimental results achieved by the developed technique contain lesser contrast details.
The optimal window sizes for the guided filter and mean filter were established after implementing the proposed algorithm with
different sizes. There are only two exceptions where the optimal filter sizes deviate, with a Guided filter window size of 16 and
a mean filter window size of 35. After adjusting the proposed algorithm to these optimal parameters, the street image from TNO
dataset undergoes fusion, and the resulting fusion metric values, which include 𝐹 𝑀𝐼𝑝𝑖𝑥𝑒𝑙 , SCD, 𝑁𝑎𝑏𝑓 , SSIM, and 𝑀𝑆𝑆𝑆𝐼𝑀 [34]
are recorded for the street image in the Table 4. In Table 4 best performing results are highlighted in bold. From Table 4 it may
be noticed that the proposed algorithm with Guided and Mean filter window sizes of (16, 35) attained higher values of 𝐹 𝑀𝐼𝑝𝑖𝑥𝑒𝑙
and SCD values against various window sizes used in Guided and Mean filter. Higher values of 𝐹 𝑀𝐼𝑝𝑖𝑥𝑒𝑙 and SCD indicate that the
results obtained by the developed method with window sizes (16, 35) are highly correlated with source images and contain enhanced
details. Also, from Table 4 it is observed that the proposed technique with optimized window sizes (16, 35) attained satisfactory
values of 𝑁𝑎𝑏𝑓 , SSIM, and 𝑀𝑆𝑆𝑆𝐼𝑀 which shows the fused images preserves better structural details with lesser artifacts. Therefore,
in the proposed algorithm we have considered the optimized window sizes (16, 35) against (10, 35), (26, 35), (16, 30), and (16, 40)
for simulation. In this work, we have conducted an ablation study to know the computational time for the proposed algorithm as
reported in Table 5. The experimentation is performed on the TNO dataset with the highest resolution image pair of size 632 × 496,
expressed in seconds for the existing and proposed techniques. From Table 5 it is observed that the developed algorithm takes less
time compared to existing deep learning as well as non-deep learning-based algorithms. Therefore, the proposed algorithm may be
suitable for real-time applications.
5. Conclusion
This article proposes a novel infrared and visible image fusion technique capable of accurately preserving thermal radiation
information and edge details from original images within the fused image. In the proposed algorithm, a Guided filter adheres to
acquire high-frequency components from the source images that retain subtle details accurately. To preserve the contextual relation
among the pixels of the source images we have used a Mean filter that generates the low-frequency components. In this work, we
13
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
Fig. 7. Visual demonstration of color images: (a) Visible frame, (b) IR frame, outcomes attained by proposed image fusion technique based on (c) with local
standard deviation and local range filtering and (d) without local standard deviation and local range filtering.
Fig. 8. Visual demonstration of color images: (a) Visible frame, (b) IR frame, outcomes attained by proposed image fusion technique based on (c) without
proposed weight map generation strategy and (d) with proposed weight map generation strategy proposed.
have introduced a novel strategy to produce a contrast detection map that is capable of handling high uncertainties in the source
images. Further, the contrast details are enhanced using local standard deviation filtering and local range filtering. Also, we have
developed a weight map construction strategy to preserve the complementary details of the source images. Later, weight values
and gradient features are used to get the salient details of the original images effectively. Further, prominent feature details of the
source images are acquired by utilizing the high-frequency components. Finally, the source images’ salient features and prominent
feature details are combined to generate the fused image.
14
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
Table 5
Comparison of computational time of proposed algorithm with existing fusion
techniques.
Fusion approaches Computational time (in Seconds)
CBF [8] 12.05
WLS [12] 2.11
LRR [35] 110.95
Fuzzy edge [3] 7.87
CSR [11] 148.27
CS-MCA [36] 478.72
CNN [18] 51.95
DNN [33] 118.77
Proposed 0.74
From the various experiments conducted in the proposed algorithm, it is observed that the fused image acquired by the proposed
technique contains rich details of both the IR and visible images against all the existing techniques. The efficacy of the developed
method is corroborated by twenty-seven state-of-the-art techniques. The proposed technique is validated using both subjective and
objective assessments against the considered existing fusion techniques. It is found that the proposed technique attained better
accuracy where the situational awareness of the scenes is better preserved.
Manoj Kumar Panda: Conception and design of study, Data collection, Analysis and interpretation of results, Writing – original
draft. Priyadarsan Parida: Conception and design of study, Analysis and interpretation of results. Deepak Kumar Rout: Conception
and design of study, Data collection, Analysis and interpretation of results, Writing – original draft.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared
to influence the work reported in this paper.
Data availability
The authors have not used any generative AI tools in scientific writing.
References
[1] Subudhi Badri Narayan, Rout Deepak Kumar, Ghosh Ashish. Big data analytics for video surveillance. Multimedia Tools Appl 2019;78(18):26129–62.
[2] Zhang Qiang, Xiao Tonglin, Huang Nianchang, Zhang Dingwen, Han Jungong. Revisiting feature fusion for RGB-T salient object detection. IEEE Trans
Circuits Syst Video Technol 2021;31(5):1804–18.
[3] Panda Manoj Kumar, Subudhi Badri Narayan, Veerakumar Thangaraj, Gaur Manoj Singh. Edge preserving image fusion using intensity variation approach.
In: Proceedings of the region 10 conference. IEEE; 2020, p. 251–6.
[4] Toet Alexander, IJspeert Jan Kees, Waxman Allen M, Aguilar Mario. Fusion of visible and thermal imagery improves situational awareness. Displays
1997;18(2):85–95.
[5] Jin Xin, Jiang Qian, Yao Shaowen, Zhou Dongming, Nie Rencan, Hai Jinjin, et al. A survey of infrared and visual image fusion methods. Infrared Phys
Technol 2017;85:478–501.
[6] Toet Alexander, Hogervorst Maarten A, Nikolov Stavri G, Lewis John J, Dixon Timothy D, Bull David R, et al. Towards cognitive image fusion. Inf Fusion
2010;11(2):95–113.
[7] Yigit Ahmet, Temizel Alptekin. Abandoned object detection using thermal and visible band image fusion. In: Proceedings of the 18th signal processing
and communications applications conference. IEEE; 2010, p. 617–20.
[8] Shreyamsha Kumar BK. Image fusion based on pixel significance using cross bilateral filter. Signal, Image Video Process 2015;9:1193–204.
[9] Liu Yu, Liu Shuping, Wang Zengfu. A general framework for image fusion based on multi-scale transform and sparse representation. Inf Fusion
2015;24:147–64.
[10] Bavirisetti Durga Prasad, Dhuli Ravindra. Fusion of infrared and visible sensor images based on anisotropic diffusion and Karhunen-Loeve transform. IEEE
Sens J 2016;16(1):203–9.
[11] Liu Yu, Chen Xun, Ward Rabab K, Wang Z Jane. Image fusion with convolutional sparse representation. IEEE Signal Process Lett 2016;23(12):1882–6.
[12] Ma Jinlei, Zhou Zhiqiang, Wang Bo, Zong Hua. Infrared and visible image fusion based on visual saliency map and weighted least square optimization.
Infrared Phys Technol 2017;82:8–17.
[13] Bavirisetti Durga Prasad, Xiao Gang, Liu Gang. Multi-sensor image fusion based on fourth order partial differential equations. In: Proceedings of the 20th
international conference on information fusion. 2017, p. 1–9.
[14] Zhang Yu, Zhang Lijia, Bai Xiangzhi, Zhang Li. Infrared and visual image fusion through infrared feature extraction and visual information preservation.
Infrared Phys Technol 2017;83:227–37.
15
M.K. Panda et al. Computers and Electrical Engineering 117 (2024) 109256
[15] Panda Manoj Kumar, Subudhi Badri Narayan, Veerakumar T, Gaur Manoj Singh. Pixel-level visual and thermal images fusion using maximum and minimum
value selection strategy. In: Proceedings of the IEEE international symposium on sustainable energy, signal processing and cyber security. 2020, p. 1–6.
[16] Zhang Xingchen, Demiris Yiannis. Visible and infrared image fusion using deep learning. IEEE Trans Pattern Anal Mach Intell 2023.
[17] Li Hui, Wu Xiao-Jun, Kittler Josef. Infrared and visible image fusion using a deep learning framework. In: Proceedings of the international conference on
pattern recognition. 2018, p. 2705–10.
[18] Liu Yu, Chen Xun, Cheng Juan, Peng Hu, Wang Zengfu. Infrared and visible image fusion with convolutional neural networks. Int J Wavelets Multiresolut
Inf Process 2018;16(03):1850018.
[19] Zhang Yu, Liu Yu, Sun Peng, Yan Han, Zhao Xiaolin, Zhang Li. IFCNN: A general image fusion framework based on convolutional neural network. Inf
Fusion 2020;54:99–118.
[20] Deng Xin, Dragotti Pier Luigi. Deep convolutional neural network for multi-modal image restoration and fusion. IEEE Trans Pattern Anal Mach Intell
2020;43(10):3333–48.
[21] Li Hui, Wu Xiao-Jun, Kittler Josef. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf Fusion 2021;73:72–86.
[22] Wang Zhishe, Wang Junyao, Wu Yuanyuan, Xu Jiawei, Zhang Xiaoqin. UNFusion: A unified multi-scale densely connected network for infrared and visible
image fusion. IEEE Trans Circuits Syst Video Technol 2021;32(6):3360–74.
[23] Wang Zhishe, Wu Yuanyuan, Wang Junyao, Xu Jiawei, Shao Wenyu. Res2Fusion: Infrared and visible image fusion based on dense Res2net and double
nonlocal attention models. IEEE Trans Instrum Meas 2022;71:1–12.
[24] Tang Wei, He Fazhi, Liu Yu. YDTR: Infrared and visible image fusion via Y-shape dynamic transformer. IEEE Trans Multimed 2022.
[25] Ma Jiayi, Tang Linfeng, Fan Fan, Huang Jun, Mei Xiaoguang, Ma Yong. SwinFusion: Cross-domain long-range learning for general image fusion via swin
transformer. IEEE/CAA J Autom Sin 2022;9(7):1200–17.
[26] Xu Han, Ma Jiayi, Jiang Junjun, Guo Xiaojie, Ling Haibin. U2Fusion: A unified unsupervised image fusion network. IEEE Trans Pattern Anal Mach Intell
2022;44(1):502–18.
[27] Panda Manoj Kumar, Subudhi Badri N, Veerakumar T, Jakhetiya Vinit. Integration of bi-dimensional empirical mode decomposition with two streams deep
learning network for infrared and visible image fusion. In: Proceedings of the 30th European signal processing conference. 2022, p. 493–7.
[28] Tang Linfeng, Yuan Jiteng, Ma Jiayi. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion
network. Inf Fusion 2022;82:28–42.
[29] Li Hui, Xu Tianyang, Wu Xiao-Jun, Lu Jiwen, Kittler Josef. LRRNet: A novel representation learning guided fusion network for infrared and visible images.
IEEE Trans Pattern Anal Mach Intell 2023.
[30] Panda Manoj Kumar, Thangaraj Veerakumar, Subudhi Badri Narayan, Jakhetiya Vinit. Bayesian’s probabilistic strategy for feature fusion from visible and
infrared images. Vis Comput 2023;1–13.
[31] He Kaiming, Sun Jian, Tang Xiaoou. Guided image filtering. IEEE Trans Pattern Anal Mach Intell 2012;35(6):1397–409.
[32] Toet Alexander. The TNO multiband image data collection. Data Brief 2017;15:249–51.
[33] Liu Yu, Chen Xun, Peng Hu, Wang Zengfu. Multi-focus image fusion with a deep convolutional neural network. Inf Fusion 2017;36:191–207.
[34] Li Hui, Wu Xiao-jun, Durrani Tariq S. Infrared and visible image fusion with ResNet and zero-phase component analysis. Infrared Phys Technol
2019;102:103039.
[35] Li Hui, Wu Xiao-Jun. Infrared and visible image fusion using latent low-rank representation. 2018, arXiv preprint arXiv:1804.08992.
[36] Liu Yu, Chen Xun, Ward Rabab K, Wang Z Jane. Medical image fusion via convolutional sparsity based morphological component analysis. IEEE Signal
Process Lett 2019;26(3):485–9.
[37] Bavirisetti Durga Prasad, Xiao Gang, Zhao Junhao, Dhuli Ravindra, Liu Gang. Multi-scale guided image and video fusion: A fast and efficient approach.
Circuits Systems Signal Process 2019;38:5576–605.
16