A Lightweight Dual-Branch Object Detection Network
A Lightweight Dual-Branch Object Detection Network
Article
IV-YOLO: A Lightweight Dual-Branch Object Detection Network
Dan Tian , Xin Yan, Dong Zhou *, Chen Wang and Wenshuai Zhang
Institute of Electronic Science and Technology, University of Electronic Science and Technology of China,
Chengdu 611731, China; [email protected] (D.T.); [email protected] (X.Y.); [email protected] (C.W.);
[email protected] (W.Z.)
* Correspondence: [email protected]
Abstract: With the rapid growth in demand for security surveillance, assisted driving, and remote
sensing, object detection networks with robust environmental perception and high detection accuracy
have become a research focus. However, single-modality image detection technologies face limitations
in environmental adaptability, often affected by factors such as lighting conditions, fog, rain, and
obstacles like vegetation, leading to information loss and reduced detection accuracy. We propose an
object detection network that integrates features from visible light and infrared images—IV-YOLO—to
address these challenges. This network is based on YOLOv8 (You Only Look Once v8) and employs
a dual-branch fusion structure that leverages the complementary features of infrared and visible light
images for target detection. We designed a Bidirectional Pyramid Feature Fusion structure (Bi-Fusion)
to effectively integrate multimodal features, reducing errors from feature redundancy and extracting
fine-grained features for small object detection. Additionally, we developed a Shuffle-SPP structure
that combines channel and spatial attention to enhance the focus on deep features and extract richer
information through upsampling. Regarding model optimization, we designed a loss function
tailored for multi-scale object detection, accelerating the convergence speed of the network during
training. Compared with the current state-of-the-art Dual-YOLO model, IV-YOLO achieves mAP
improvements of 2.8%, 1.1%, and 2.2% on the Drone Vehicle, FLIR, and KAIST datasets, respectively.
On the Drone Vehicle and FLIR datasets, IV-YOLO has a parameter count of 4.31 M and achieves
a frame rate of 203.2 fps, significantly outperforming YOLOv8n (5.92 M parameters, 188.6 fps on
the Drone Vehicle dataset) and YOLO-FIR (7.1 M parameters, 83.3 fps on the FLIR dataset), which
had previously achieved the best performance on these datasets. This demonstrates that IV-YOLO
achieves higher real-time detection performance while maintaining lower parameter complexity,
making it highly promising for applications in autonomous driving, public safety, and beyond.
Citation: Tian, D.; Yan, X.; Zhou, D.;
Wang, C.; Zhang, W. IV-YOLO: A
Keywords: dual-branch image object detection; IV-YOLO; bi-directional pyramid feature fusion;
Lightweight Dual-Branch Object
attention mechanism; small target detection
Detection Network. Sensors 2024, 24,
6181. https://doi.org/10.3390/
s24196181
For instance, the Depth Attention Enhancement Module (DAEM) and the RGB-Depth Fu-
sion Module (RDFM) introduced in the MDFN model effectively extract fine texture details
from visible light images [7], addressing issues like insufficient depth information and noise
interference, which greatly improve detection performance. The fusion of infrared and
visible light modalities significantly enhances the robustness of object detection. However,
achieving an effective fusion of these two modalities presents a complex challenge, requir-
ing the establishment of meaningful correlations and complementary relationships between
them [8]. Due to the inherent differences between infrared and visible light modalities,
the feature extraction and fusion process must overcome these disparities and fully leverage
the unique strengths of each modality. Thus, ensuring feature complementarity between
the two modalities is critical for successful fusion [9].
By thoroughly investigating the complementary nature of information between dif-
ferent sensing modalities, the model’s ability to capture richer and more diverse target
features can be enhanced, leading to more accurate target recognition and classification.
This approach improves detection accuracy and effectively reduces false positives and
missed detections, equipping the model with greater adaptability to complex and dynamic
real-world scenarios [10]. Furthermore, designing an appropriate network architecture
and loss function is crucial to optimizing the model’s learning capacity and generalization
performance, ensuring stable results across diverse environments [9].
Significant advancements have been made in dual-modal image object detection
in recent years. For instance, FusionNet [11] utilizes Convolutional Neural Networks
(CNNs) as a foundational feature extractor and inputs the fused feature maps into the
object detection network to enhance detection performance. Dual-YOLO [12] introduces
an information fusion module and a fusion shuffling module, enabling the network to
complete infrared images using visible light features, thus improving detection accuracy
and robustness. Guan et al. [13] extract features from infrared and visible light images
separately using deep convolutional neural networks and integrate these features with a
convolutional fusion module. Kim et al. [14] employed a joint training approach to enhance
overall detection performance further. J. Zhu et al. [15] designed a Modal Interaction
Module based on the Transformer architecture, which fuses features from RGB and thermal
infrared images and incorporates a Query Location Module for precise object localization.
Z Ye et al. [16] proposed the Cross-Modality Fusion Transformer, utilizing a Cross-Modality
Fusion Transformer structure to effectively integrate information from both modalities
through Transformer modules, thereby improving the performance of dual-modal drone
object detection.
However, current dual-modal object detection methods still exhibit four main limi-
tations. First, the precise control and interpretation of dual-modal feature fusion remain
challenging, often resulting in either information redundancy or insufficiency. As noted
by Gao et al. [17], although deep learning has advanced image fusion, the absence of
well-designed loss functions can lead to suboptimal fusion performance. Ataman et al. [18]
also point out that balancing preserving details with eliminating information redundancy
across different spectral features remains problematic. This reflects the limitations of cur-
rent methods in extracting and fusing both deep and shallow features, particularly in
ensuring the sufficiency and effectiveness of the fused information. Second, the loss of deep
information in complex scenes significantly affects detection accuracy, especially during the
forward propagation of fusion networks. Zhao et al. [19] mitigate this issue by employing
a dual-scale decomposition mechanism that processes low-frequency base information
and high-frequency details separately, thereby reducing deep information loss.Similarly,
Wang et al. [20] enhance the retention of background details and highlight infrared features
by introducing multi-scale information through an improved generative adversarial net-
work. Third, extracting fine-grained features and detecting multi-scale objects in complex
backgrounds remain difficult. Liu et al. [21] emphasize the advantages of multi-scale fusion
strategies in addressing occlusion issues in complex scenes. In contrast, Bao et al. [22] high-
light the critical role of multi-scale processing in dealing with intricate backgrounds. Lastly,
Sensors 2024, 24, 6181 3 of 23
the high parameter count of fusion networks results in large model sizes and increased
computational costs, negatively impacting real-time performance. Nousias et al. [23] ex-
plore model compression and acceleration techniques to reduce computational burdens
and meet real-time inference requirements. Additionally, Poeppel et al. [24] stress that
optimizing computational resource consumption and enhancing inference speed are key to
achieving efficient real-time detection.
We propose a real-time object detection method based on the dual-branch fusion of
visible and infrared images to address the aforementioned challenges. This approach effec-
tively overcomes common issues in feature fusion, such as feature disparity, information
redundancy, and model optimization difficulties, enabling efficient and accurate real-time
object detection. Experimental results demonstrate that the proposed dual-branch detection
method significantly improves the detection accuracy of multi-scale objects in complex
environments while maintaining a low parameter count. The main contributions of this
paper are summarized as follows:
(1) Building on the outstanding performance of YOLOv8 [25] in real-time object
detection, we have developed a dual-branch network named IV-YOLO. This network
consists of one branch dedicated to feature extraction from infrared images and another
for feature extraction from visible light images. Additionally, a small object detection
layer is incorporated into the neck network. This approach effectively addresses issues
of background complexity and target feature occlusion inherent in single-modal image
detection, enabling the extraction of fine-grained features of small objects. IV-YOLO
significantly improves detection accuracy for multi-scale objects in complex backgrounds
by maintaining a low parameter count and high frame rate.
(2) We propose a solution incorporating the Bi-Concat module, which employs a
bidirectional pyramid structure for the weighted fusion of features extracted from infrared
and visible light images at different levels. This approach effectively reduces redundant
features and optimizes the fusion of features from the two modalities.
(3) We have designed the Shuffle-SPP structure, which uses Spatial Pyramid Pooling
(SPP) to extract features at multiple scales, thereby enhancing the model’s ability to detect
multi-scale objects. Subsequently, an efficient hierarchical aggregation mechanism is em-
ployed to merge features from different levels, further improving feature representation.
This approach also emphasizes different parts of the features to enhance the Precision
of object details and localization. The multi-level feature fusion, shuffling, and pooling
operations effectively minimize information loss and retain more critical details, improving
overall network performance. A loss function is also introduced to enhance the focus on
small objects and accelerate the network’s convergence.
(4) Our method has achieved state-of-the-art results on the challenging KAIST [26] mul-
tispectral pedestrian dataset and the Drone Vehicle dataset [27]. Furthermore, experiments
on the multispectral target detection dataset FLIR [28] further validate the effectiveness
and generalizability of the algorithm.
The remainder of this paper is structured as follows: Section 2 describes related work
pertinent to our network. In Section 3, we detail the network architecture and methodology.
Section 4 presents our experimental details and results, comparing them with state-of-the-
art networks to validate the effectiveness of our approach. In Section 5, we summarize the
research content and experimental findings.
2. Related Work
This section briefly reviews classical deep learning methods for real-time object detec-
tion, summarizes the research progress in multi-modal image fusion techniques, and dis-
cusses the structures and applications of attention mechanisms.
classifying and localizing objects under low-latency conditions. Modern real-time ob-
ject detection methods primarily rely on deep learning models such as the YOLO (You
Only Look Once) series, SSD [29] (Single Shot MultiBox Detector), and Fast R-CNN [30].
The YOLO series achieves object localization and classification through a single forward
pass. At the same time, SSD directly predicts the classes and locations of multiple objects
using a convolutional neural network, and Fast R-CNN detects objects by first generating
candidate regions and then classifying and precisely localizing them.
Among these, the YOLO series has gradually become mainstream due to its efficient
design. Versions like YOLOv1 [31], YOLOv2 [32], and YOLOv3 [33] have defined the
classic detection architecture, which typically consists of three parts: the backbone, the neck,
and the head. YOLOv4 [34] and YOLOv5 [35] introduced the CSPNet design to replace
DarkNet, combined with data augmentation strategies, an enhanced Path Aggregation
Network (PAN), and more diversified model scales. YOLOv6 [36] introduced BiC and
SimCSPSPPF in the neck and backbone and adopted anchor-assisted training and self-
distillation strategies. YOLOv7 [37] introduced the E-ELAN layer, while YOLOv8 proposed
the C2f block for efficient feature extraction and fusion. YOLOv9 [38] further improved the
architecture with the introduction of GELAN and incorporated PGI to enhance the training
process. The latest YOLOv10 [39] introduced a continuous dual assignment method with
NMS-free (Non-Maximum Suppression) training, offering competitive performance and
the advantage of low inference latency.
CBAM [50] replaces global average pooling with global max pooling to enhance feature
extraction, and GSoP Net [51] introduces global second-order pooling, improving perfor-
mance at the cost of higher computational overhead. ECA-Net [52] significantly reduces
the complexity of the SE model by generating channel weights through a one-dimensional
convolution filter. The non-local (NL) module proposed by Wang et al. [53] generates
an attention map by computing a correlation matrix between each spatial point in the
feature map.
Shuffle Attention [54] and DANet [55] flexibly fuse local features and global dependen-
cies by weighting and combining attention modules from different branches. FcaNet [56]
interprets the relationship between GAP and the initial frequencies of the discrete co-
sine transform, using residual frequency selection to extract channel information, while
WaveNet leverages discrete wavelet transform to capture channel features. OrthoNet [57]
validates the effectiveness of orthogonal filters, attributing their performance largely to
the orthogonality of DCT kernels. These studies offer diverse directions and theoretical
support for applying attention mechanisms in deep learning.
3. Methods
3.1. Overall Network Architecture
We designed a dual-branch object detection network, IV-YOLO, based on the YOLOv8
framework, as illustrated in Figure 1. The network backbone consists of two branches that
extract multi-scale object features from infrared and visible light images across five levels,
from P1 to P5. At the P1 level, a single-layer convolutional (Conv) structure is employed,
as defined by Equation (1). Here, FCi ∈ RCin × Hin ×Win ,FCi represents the input feature map,
while conv denotes a convolution operation with a kernel size of 3 × 3 and a stride of
2. Batch normalization (BatchNorm2d) is applied to accelerate network convergence and
enhance training stability, and the SiLU activation function ensures the non-linearity of the
convolution operation. From P2 to P5, convolutional layers are integrated with the C2f
structure proposed by YOLOv8, with the specific design of C2f shown in Figure 2.
Conv FCi = SiLU(BatchNorm 2d(conv3×3,s=2 FCi (1)
In the feature fusion phase, we have designed a novel Bidirectional Pyramid Fusion
(Bi-Fusion) module detailed in Section 3.2.1. Under well-lit conditions, visible light images
capture rich target details. In contrast, infrared images provide significant contrast in low-
light environments and can penetrate certain materials for detection due to their thermal
sensitivity. Thus, we employ the Bi-Fusion module to perform bidirectional pyramid fusion
of features extracted from both branches at different scales (P2, P3, P4, P5). The P2 layer,
with the smallest receptive field, is suited for fusing features of small targets; as the number
of convolutional layers increases from P3 to P5, the receptive field expands, corresponding
to the fusion of features for small, medium, and large targets. Experiments demonstrate
that these fused features effectively complement the detailed information from the infrared
branch. Consequently, we input the fused feature map vectors from layers P2 to P4 into
layers P3 to P5 of the infrared branch for further feature extraction.
In the design of the neck structure, we introduced the Shuffled Spatial Pyramid Pooling
(Shuffle-SPP) module. This module enhances feature utilization efficiency through multi-scale
pooling, multi-layer convolution operations, and channel and spatial attention mechanisms. Its
specific structure and characteristics are detailed in Section 3.2.2. Additionally, we incorporated
elements from YOLOv8, adding an upsampling layer and a convolutional concatenation
operation to retain multi-scale features and mitigate the gradual feature loss that can occur
with convolutional operations, thereby enhancing object detection capability.
This architecture ensures comprehensive extraction of target texture details from
visible light images and thermal radiation and material features from infrared images.
Integrating multi-level feature fusion with shuffled attention mechanisms guarantees
optimal utilization of dual-modal features, further improving the network’s robustness in
complex environments.
Sensors 2024, 24, 6181 6 of 23
Figure 1. The overall architecture of the dual-branch IV-YOLO network. This network is specifically
designed for object detection in complex environments. The backbone network extracts features from
visible light and infrared images in parallel and fuses the multi-scale features obtained from the P2 to
P5 layers. The fused features are fed into the infrared branch for deeper feature extraction. In the
neck structure, we employ a Shuffle-SPP module, which integrates the extracted features with those
from the backbone network through a three-layer upsampling process, enabling precise detection of
objects at different scales.
design enables the network to assess the importance of different input features effectively,
thereby improving fusion efficacy. Figure 3 illustrates the weight allocation structure,
where wi represents the weight assigned to input i. We employ the Fast Normalized Fusion
method to obtain the fused features from the two modalities, as detailed in Equation (2).
w
O= ∑ ϵ + ∑i wj
· Ii (2)
i j
In the above equation, P0in and P1in represent the features extracted from the infrared
and visible light images, respectively, while w0 and w1 are the corresponding initial weight
parameters. Ptd is the normalized intermediate feature generated by applying a weighted
average to the two weighted inputs, followed by normalization and a convolution oper-
ation. P0out and P1out denote the fused output features for the infrared and visible light
images, respectively. These are computed using the weight parameters, input features,
and intermediate features described by Equations (4) and (5).
The Bi-Fusion structure achieves efficient feature flow and information exchange across
different scales through bidirectional cross-scale connections and a repeated module design.
This bidirectional flow promotes deep integration of features across scales, leveraging
the complementarity of features from each scale while avoiding significant increases in
computational cost. Compared with traditional unidirectional pyramid structures, Bi-
Fusion maximizes feature utilization through bidirectional information flow, enhancing
detection performance and reducing information loss between different feature levels. This
approach helps the network capture more details, providing richer semantic and spatial
information, particularly when dealing with targets of varying scales.
Sensors 2024, 24, 6181 8 of 23
In Figure 4, we first adjust the number of channels using a Conv layer, followed by
three MaxPool2d layers for multi-scale feature extraction. Next, we aggregate the extracted
spatial information and restore the features through a Conv layer to maintain the original
image dimensions. To better integrate the different characteristics of infrared and visible
light modes, we divide the channels into n sub-features Xk and iterate through each one.
Each sub-feature Xk is further divided into two parts: Xk1 and Xk2 . The Xk1 part fully
utilizes different channels. It applies the channel attention mechanism to focus on key
features. In contrast, the Xk2 part leverages the spatial relationships of the features, using
the spatial attention mechanism to highlight important features.
The design of the Xk1 part references the SE module and employs a channel attention
mechanism. The module uses a single-layer transformation of GAP + Scale + Sigmoid
to convert the feature Xk1 from dimensions w2n ·h·c to 11c through global average pooling,
Sensors 2024, 24, 6181 9 of 23
as shown in Equations (6) and (7). Then, the obtained channel weights are normalized and
multiplied with Xk1 to obtain the feature module Xk1′ with channel attention. As training
h w
1
s = F gp ( Xk1 ) =
h·w ∑ ∑ Xk1 (i, j) (6)
i =1 j =1
′
Xk1 = σ(s) · Xk1 (7)
The design of the Xk2 part employs a spatial attention mechanism, which, unlike chan-
nel attention, focuses on ‘where’ the information is, complementing channel attention. First,
we use Group Norm (GN) [50] on Xk2 to obtain spatial statistics and apply a function to en-
′ . Subsequently, the obtained spatial weights are multiplied
hance the representation of Xk2
with Xk2 to produce the feature module Xk2 ′ with spatial attention, as shown in Equation (8).
Then, all sub-features are aggregated to obtain a feature module with comprehensive
information interaction.
The Shuffle-SPP module acquires global context information by rearranging and pool-
ing features across the channel and spatial dimensions. It effectively combines relevant
information from different channels while capturing a broader spatial context. This process
allows the network to focus more precisely on the semantic and positional information of
the target when processing deep features, thereby enhancing the model’s expressiveness
and target detection accuracy. In this way, the Shuffle-SPP module enriches feature repre-
sentation and strengthens the network’s ability to recognize targets in complex scenarios.
(A ∩ B)
IoU = (9)
A∪B
To better handle the regression samples of different categories and focus on their
respective detection tasks, we adopt a linear interval mapping method to reconstruct the
IoU loss. This method enhances the regression accuracy for edge targets. The formula is
as follows:
0,
IoU < d
Focaler IoU − d
IoU = u−d , d ⩽ IoU ⩽ u
(10)
1, IoU > u
Here, IoU Focaler represents the reconstructed Focaler-IoU, and IoU is the original value,
[d, u] ∈ [0, 1]. By adjusting the values of d and u, we can make IoU Focaler focus on different
regression samples. The loss is defined as follows:
LFocaler-loU = 1 − IoU Focaler (11)
Sensors 2024, 24, 6181 10 of 23
4. Results
In this chapter, we describe the implementation process for the dual-branch infrared
and visible light image detection network, including hardware and software configuration
specifics. To validate the effectiveness of the proposed method, we conducted extensive
experiments across several publicly available datasets, including the Drone Vehicle dataset,
the KAIST dataset, and the FLIR pedestrian dataset. The experimental results demonstrate
that our method performs exceptionally well on these datasets.
24-bit visible light images, encompassing targets such as people, vehicles, and bicycles.
The resolution of the infrared images is 640 × 512 pixels, while the resolution of the visible
light images ranges from 720 × 480 to 2048 × 1536 pixels.
Category Parameter
CPU Intel i7-12700H (Intel Corporation, Santa Clara, CA, USA)
GPU NVIDIA RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA)
System Windows11
Python 3.8.19
PyTorch 1.12.1
Training Epochs 300
Learning Rate 0.01
Weight Decay 0.0005
Momentum 0.937
Table 3. The hyperparameters of the datasets used in this paper. ‘test-val’ indicates that the same
dataset is used for both testing and validation in this study.
TP
precision = (14)
TP + FP
TP
recall = (15)
TP + FN
Specifically, TP (True Positive) represents the number of correctly predicted positive
samples, FP (False Positive) denotes the number of incorrectly predicted positive samples,
and FN (False Negative) refers to the number of positive samples that were incorrectly
predicted as negative. Average Precision (AP) is the area under the Precision–Recall curve,
and the closer the AP value is to 1, the better the detection performance of the algorithm.
Z 1
AP = p(r )dr (16)
0
Mean Average Precision (mAP) is the average of the AP values across all classes,
offering a balanced evaluation by combining Precision and Recall. In multi-class detection
tasks, mAP is particularly important because it ensures good performance across all classes.
Moreover, mAP is robust to class imbalance, making it widely used in evaluating multi-
class object detection tasks. It is also a key metric in our experiments to assess detection
accuracy. The formula for calculating mAP is as follows:
C
1
mAP =
C ∑ AP (17)
c =1
Sensors 2024, 24, 6181 13 of 23
The number of parameters and FPS evaluate the efficiency of the model. The number
of parameters refers to the total count of all learnable parameters, including weights
and biases. FPS measures how many image frames the network can process per second,
an essential real-time performance indicator.
Table 4. Evaluation results based on the Drone Vehicle dataset. All values are expressed as percentages.
The top-ranked results are highlighted in green.
Since many network models are based on single-modal images for detection, we
evaluated these networks using visible and infrared images. As shown in Table 4, YOLOv8
demonstrates outstanding performance in detection accuracy and speed in single-modal
scenarios, so we chose YOLOv8 as the foundational framework for the IV-YOLO algorithm.
Through our innovative improvements, the IV-YOLO algorithm achieved an accuracy of
74.6% on the Drone Vehicle dataset, outperforming all other networks. Notably, in fine-
grained feature extraction and fusion, IV-YOLO achieved detection accuracies of 63.1% for
freight cars and 53% for vans, significantly surpassing other networks.
The results in Table 4 indicate that IV-YOLO effectively integrates dual-modal features
to enable robust detection in complex environments and improves the accuracy of detecting
Sensors 2024, 24, 6181 14 of 23
visually similar objects through dual-branch feature fusion. However, the emphasis on fine-
grained features led to a decrease in detection performance for visually similar categories.
Additionally, in small object detection, the network focuses more on extracting low-level
features to capture fine details, as small objects require higher-resolution feature maps
and precise local details. This focus may weaken the network’s performance on larger
objects, as detecting larger targets requires a broader receptive field to understand the
global context fully. An overemphasis on details can reduce the utilization of high-level
semantic information, affecting larger objects’ detection performance.
Figure 5 illustrates the visualized detection results on the Drone Vehicle dataset.
The images are divided into six groups, each containing three rows of images. The visible
light detection results are on the left side of each group, while the right side shows the
infrared image detection results. The first row demonstrates that the IV-YOLO network is
capable of robust target detection even in low-light conditions or when soft occlusion is
present. The second row highlights the network’s ability to effectively extract fine-grained
features, successfully distinguishing between visually similar objects. The third row shows
the network’s robustness when processing scenes with dense targets.
Figure 5. Visualization of IV-YOLO Detection Results Based on the Drone Vehicle Dataset.
Table 5. The object detection results on the FLIR dataset, calculated at a single IoU threshold of 0.5.
All values are expressed as percentages, with the top-ranked results highlighted in green.
The experimental results on the FLIR dataset are summarized in Table 6. The table
shows that our network achieves the highest mAP value on this dataset, outperforming
other methods. Integrating multi-scale feature fusion and the triple upsampling operations
in the neck significantly enhances our network’s ability to extract features from small
objects. The results indicate a noticeable improvement in Precision for detecting small
objects, such as bicycles, which are challenging to extract. However, due to a slight
reduction in global feature capture, the detection performance for larger objects, such
as cars and pedestrians, is marginally lower than the Dual-YOLO network. Overall,
the mAP results demonstrate that our network effectively extracts multi-scale features,
particularly capturing fine details at smaller receptive field levels, thus improving the
detection of small targets. Additionally, through the fusion module, our network effectively
extracts and integrates shared features from both modalities and their unique characteristics.
Through weighted parameters, the fusion mechanism enables mutual enhancement and
compensation between the two modalities, leading to superior detection performance.
Table 6. The object detection results on the KAIST dataset, calculated at a single IoU threshold of 0.5.
All values are expressed as percentages, with the top-ranked results highlighted in green.
Figure 6 illustrates the visualization of object detection results on the FLIR dataset.
The first row of images demonstrates that, regardless of changes in background lighting, our
network accurately locates and detects objects by integrating features from both modalities.
The second row shows that the network effectively uses the dual-branch structure to extract
fine-grained features of objects at different scales. Combined with a specially designed
loss function, it successfully detects each object, even in cases of occlusion and high target
Sensors 2024, 24, 6181 16 of 23
density. The third row highlights the network’s advantage in multi-scale feature extraction,
particularly at smaller receptive field levels, where it captures more subtle features, thus
significantly enhancing the detection of small targets.
Table 7. Model complexity and runtime comparison of IV-YOLO and the plain counterparts.
5. Conclusions
This paper presents a dual-branch object detection network based on YOLOv8, named
IV-YOLO, which integrates infrared and visible light images. The network is meticulously
designed to effectively extract features from both modalities during the dual-branch feature
extraction process and perform object detection at multiple scales. Additionally, we devel-
oped a Bidirectional Pyramid Feature Fusion structure (Bi-Fusion) that integrates features
from different modalities across multiple scales using weighted parameters. This approach
effectively reduces the number of parameters and avoids feature redundancy, enhancing
the network’s fusion performance. To further bolster the network’s expressive capabil-
ity, we introduced the Shuffle Attention Spatial Pyramid Pooling structure (Shuffle-SPP),
which captures global contextual information across channels and spatial dimensions. This
structure enables the network to focus more accurately on the semantic and positional infor-
mation of objects within deep features, significantly improving the model’s representational
power. Finally, by incorporating the PIoU v2 loss function, we accelerated the convergence
of object detection boxes for targets of varying sizes, enhanced the fitting Precision for
small targets, and sped up the overall convergence of the network. Experimental results
demonstrate that the IV-YOLO network achieves a mean Average Precision (mAP) of 74.6%
on the Drone Vehicle dataset, 75.4% on the KAIST dataset, and 85.6% on the FLIR dataset.
These results indicate that IV-YOLO excels in feature extraction, fusion, and object detection
for infrared and visible light images. Furthermore, the network’s parameter count is only
Sensors 2024, 24, 6181 20 of 23
4.31 M, significantly lower than other networks, showcasing its potential for deployment
on mobile devices.
Despite its excellent performance in dual-modal object detection tasks, IV-YOLO has
some limitations: (1) it requires further optimization of speed and memory usage before
hardware deployment; (2) there is limited availability of dual-modal datasets with diverse
scenes and targets, necessitating further research and validation of the method’s effective-
ness; and (3) while the network enhances detection of small targets, future improvements
are needed to strengthen its global perception capability. Although single-modal detection
performs well for certain specific tasks, its limitations become evident in complex real-
world applications. In contrast, dual-modal detection leverages the advantages of multiple
data sources to provide richer feature information, significantly improving object detection
accuracy and robustness. Therefore, dual-modal detection technology is undoubtedly a
key development direction for future object detection research, offering strong support for
addressing more complex detection tasks.
Author Contributions: Conceptualization, X.Y. ; methodology, X.Y., D.Z. and D.T.; software, X.Y.;
validation, X.Y. and W.Z.; formal analysis, C.W.; investigation, D.T.; resources, X.Y. and D.Z; data
curation, X.Y. and W.Z.; writing—original draft preparation, X.Y.; writing—review and editing, X.Y.,
D.Z. and D.T.; visualization, X.Y.; supervision, W.Z. All authors have read and agreed to the published
version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The DroneVehicle remote sensing dataset is obtained from https://
github.com/VisDrone/DroneVehicle, accessed on 29 December 2021. The KAIST pedestrian dataset is
obtained from https://github.com/SoonminHwang/rgbt-ped-detection/tree/master/data, accessed
on 12 November 2021. The FLIR dataset is obtained from https://www.flir.com/oem/adas/adas-
dataset-form/, accessed on 19 January 2022.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Zhang, H.; Xu, H.; Xiao, Y.; Guo, X.; Ma, J. Rethinking the image fusion: A fast unified image fusion network based on
proportional maintenance of gradient and intensity. In Proceedings of the AAAI Conference on Artificial Intelligence, New York,
NY, USA, 7–12 February 2020; Volume 34, pp. 12797–12804.
2. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning.
In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31.
3. Huang, X.; Liu, M.Y.; Belongie, S.; Kautz, J. Multimodal unsupervised image-to-image translation. In Proceedings of the European
Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 172–189.
4. Qu, L.; Liu, S.; Wang, M.; Song, Z. Transmef: A transformer-based multi-exposure image fusion framework using self-supervised
multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022;
Volume 36, pp. 2126–2134.
5. Munsif, M.; Khan, N.; Hussain, A.; Kim, M.J.; Baik, S.W. Darkness-Adaptive Action Recognition: Leveraging Efficient Tubelet
Slow-Fast Network for Industrial Applications. IEEE Trans. Ind. Inform. 2024, early access.
6. Munsif, M.; Khan, S.U.; Khan, N.; Baik, S.W. Attention-based deep learning framework for action recognition in a dark
environment. Hum. Centric Comput. Inf. Sci. 2024, 14, 1–22.
7. Wen, X.; Wang, F.; Feng, Z.; Lin, J.; Shi, C. MDFN: Multi-scale Dense Fusion Network for RGB-D Salient Object Detection.
In Proceedings of the 2023 IEEE 3rd International Conference on Information Technology, Big Data and Artificial Intelligence
(ICIBA), Chongqing, China, 26–28 May 2023; Volume 3, pp. 730–734.
8. Han, D.; Li, L.; Guo, X.; Ma, J. Multi-exposure image fusion via deep perceptual enhancement. Inf. Fusion 2022, 79, 248–262.
[CrossRef]
9. Hou, J.; Zhang, D.; Wu, W.; Ma, J.; Zhou, H. A generative adversarial network for infrared and visible image fusion based on
semantic segmentation. Entropy 2021, 23, 376. [CrossRef]
10. Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 2849–2858.
Sensors 2024, 24, 6181 21 of 23
11. French, G.; Finlayson, G.; Mackiewicz, M. Multi-spectral pedestrian detection via image fusion and deep neural networks.
J. Imaging Sci. Technol. 2018, 176–181.
12. Bao, C.; Cao, J.; Hao, Q.; Cheng, Y.; Ning, Y.; Zhao, T. Dual-YOLO architecture from infrared and visible images for object
detection. Sensors 2023, 23, 2934. [CrossRef]
13. Guan, D.; Cao, Y.; Yang, J.; Cao, Y.; Yang, M.Y. Fusion of multispectral data through illumination-aware deep neural networks for
pedestrian detection. Inf. Fusion 2019, 50, 148–157. [CrossRef]
14. Kim, J.U.; Park, S.; Ro, Y.M. Uncertainty-guided cross-modal learning for robust multispectral pedestrian detection. IEEE Trans.
Circuits Syst. Video Technol. 2021, 32, 1510–1523. [CrossRef]
15. Zhu, J.; Zhang, X.; Dong, F.; Yan, S.; Meng, X.; Li, Y.; Tan, P. Transformer-based Adaptive Interactive Promotion Network for
RGB-T Salient Object Detection. In Proceedings of the 2022 34th Chinese Control and Decision Conference (CCDC), Hefei, China,
15–17 August 2022; pp. 1989–1994.
16. Ye, Z.; Peng, Y.; Han, B.; Hao, H.; Liu, W. Unmanned Aerial Vehicle Target Detection Algorithm Based on Infrared Visible Light
Feature Level Fusion. In Proceedings of the 2024 6th International Conference on Communications, Information System and
Computer Engineering (CISCE), Guangzhou, China, 10–12 May 2024; pp. 1–5.
17. Gao, Y.; Cheng, Z.; Su, H.; Ji, Z.; Hu, J.; Peng, Z. Infrared and Visible Image Fusion Method based on Residual Network. In
Proceedings of the 2023 4th International Conference on Computer Engineering and Intelligent Control (ICCEIC), Guangzhou,
China, 20–22 October 2023; pp. 366–370.
18. Ataman, F.C.; Akar, G.B. Visible and infrared image fusion using encoder-decoder network. In Proceedings of the 2021 IEEE
International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 1779–1783.
19. Zhao, Z.; Xu, S.; Zhang, J.; Liang, C.; Zhang, C.; Liu, J. Efficient and model-based infrared and visible image fusion via algorithm
unrolling. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1186–1196. [CrossRef]
20. Wang, S.; Li, X.; Huo, W.; You, J. Fusion of infrared and visible images based on improved generative adversarial networks.
In Proceedings of the 2022 3rd International Conference on Information Science, Parallel and Distributed Systems (ISPDS),
Guangzhou, China, 22–24 July 2022; pp. 247–251.
21. Liu, H.; Liu, H.; Wang, Y.; Sun, F.; Huang, W. Fine-grained multilevel fusion for anti-occlusion monocular 3d object detection.
IEEE Trans. Image Process. 2022, 31, 4050–4061. [CrossRef]
22. Bao, W.; Hu, J.; Huang, M.; Xu, Y.; Ji, N.; Xiang, X. Detecting Fine-Grained Airplanes in SAR Images with Sparse Attention-Guided
Pyramid and Class-Balanced Data Augmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8586–8599. [CrossRef]
23. Nousias, S.; Pikoulis, E.V.; Mavrokefalidis, C.; Lalos, A.S. Accelerating deep neural networks for efficient scene understanding in
multi-modal automotive applications. IEEE Access 2023, 11, 28208–28221. [CrossRef]
24. Poeppel, A.; Eymüller, C.; Reif, W. SensorClouds: A Framework for Real-Time Processing of Multi-modal Sensor Data for
Human-Robot-Collaboration. In Proceedings of the 2023 9th International Conference on Automation, Robotics and Applications
(ICARA), Abu Dhabi, United Arab Emirates, 10–12 February 2023; pp. 294–298.
25. Ultralytics. YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 15 November 2023.).
26. Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral pedestrian detection: Benchmark dataset and baseline.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015;
pp. 1037–1045.
27. Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE
Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [CrossRef]
28. FLIR ADAS Dataset. 2022. Available online: https://www.flir.com/oem/adas/adas-dataset-form (accessed on 19 January 2022).
29. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of
the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings,
Part I 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37.
30. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December
2015; pp. 1440–1448.
31. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only Look Once: Unified, Real-Time Object Detection. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
[CrossRef]
32. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [CrossRef]
33. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767.
34. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934.
35. Jocher, G. YOLOv5 by Ultralytics. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 20 September 2024).
36. Li, C.Y.; Wang, N.; Mu, Y.Q.; Wang, J.; Liao, H.Y.M. YOLOv6: A Single-Stage Object Detection Framework for Industrial
Applications. arXiv 2022, arXiv:2209.02976.
37. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors. arXiv 2022, arXiv:2207.02696.
38. Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv
2024, arXiv:2402.13616.
Sensors 2024, 24, 6181 22 of 23
39. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024,
arXiv:2405.14458.
40. Farahnakian, F.; Heikkonen, J. Deep learning based multi-modal fusion architectures for maritime vessel detection. Remote Sens.
2020, 12, 2509. [CrossRef]
41. Li, R.; Peng, Y.; Yang, Q. Fusion enhancement: UAV target detection based on multi-modal GAN. In Proceedings of the 2023
IEEE 7th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 15–17 September 2023;
Volume 7, pp. 1953–1957.
42. Pahde, F.; Puscas, M.; Klein, T.; Nabi, M. Multimodal prototypical networks for few-shot learning. In Proceedings of the
IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual Event, 5–9 January 2021; pp. 2644–2653.
43. Liang, P.; Jiang, J.; Liu, X.; Ma, J. Fusion from decomposition: A self-supervised decomposition approach for image fusion. In
Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022, Springer: Berlin/Heidelberg,
Germany, 2022; pp. 719–735.
44. Dai, X.; Yuan, X.; Wei, X. TIRNet: Object detection in thermal infrared images for autonomous driving. Appl. Intell. 2021, 51, 1244–1261.
[CrossRef]
45. Wang, Q.; Chi, Y.; Shen, T.; Song, J.; Zhang, Z.; Zhu, Y. Improving RGB-infrared object detection by reducing cross-modality
redundancy. Remote Sens. 2022, 14, 2020. [CrossRef]
46. Luo, F.; Li, Y.; Zeng, G.; Peng, P.; Wang, G.; Li, Y. Thermal infrared image colorization for nighttime driving scenes with top-down
guided attention. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15808–15823. [CrossRef]
47. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141.
48. Guo, Z.; Ma, H.; Liu, J. NLNet: A narrow-channel lightweight network for finger multimodal recognition. Digit. Signal Process.
2024, 150, 104517. [CrossRef]
49. Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings
of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019;
pp. 1971–1980.
50. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference
on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
51. Gao, Z.; Xie, J.; Wang, Q.; Li, P. Global second-order pooling convolutional networks. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3024–3033.
52. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020;
pp. 11534–11542.
53. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803.
54. Zhang, Q.L.; Yang, Y.B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the ICASSP
2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June
2021; pp. 2235–2239.
55. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154.
56. Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 783–792.
57. Salman, H.; Parks, C.; Swan, M.; Gauch, J. Orthonets: Orthogonal channel attention networks. In Proceedings of the 2023 IEEE
International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 829–837.
58. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International
Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988.
59. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE
International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773.
60. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision,
Venice, Italy, 22–29 October 2017; pp. 2961–2969.
61. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162.
62. Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1829–1838.
63. Wang, H.; Wang, C.; Fu, Q.; Si, B.; Zhang, D.; Kou, R.; Yu, Y.; Feng, C. YOLOFIV: Object detection algorithm for around-the-clock aerial
remote sensing images by fusing infrared and visible features. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2024, 17, 15269–15287.
[CrossRef]
64. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. arXiv 2019, arXiv:1904.01355.
65. Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-shot refinement neural network for object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212.
Sensors 2024, 24, 6181 23 of 23
66. Li, S.; Li, Y.; Li, Y.; Li, M.; Xu, X. Yolo-firi: Improved yolov5 for infrared image object detection. IEEE Access 2021, 9, 141861–141875.
[CrossRef]
67. Jiang, X.; Cai, W.; Yang, Z.; Xu, P.; Jiang, B. IARet: A lightweight multiscale infrared aerocraft recognition algorithm. Arab. J. Sci.
Eng. 2022, 47, 2289–2303. [CrossRef]
68. Li, Q.; Zhang, C.; Hu, Q.; Fu, H.; Zhu, P. Confidence-aware fusion using dempster-shafer theory for multispectral pedestrian
detection. IEEE Trans. Multimed. 2022, 25, 3420–3431. [CrossRef]
69. Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You only look one-level feature. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13039–13048.
70. Zheng, Z.; Wu, Y.; Han, X.; Shi, J. Forkgan: Seeing into the rainy night. In Proceedings of the Computer Vision–ECCV 2020: 16th
European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16; Springer: Berlin/Heidelberg, Germany, 2020;
pp. 155–170.
71. Anoosheh, A.; Sattler, T.; Timofte, R.; Pollefeys, M.; Van Gool, L. Night-to-day image translation for retrieval-based localization.
In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, BC, Canada, 20–24 May 2019;
pp. 5958–5964.
72. Liu, M.Y.; Breuel, T.; Kautz, J. Unsupervised image-to-image translation networks. Adv. Neural Inf. Process. Syst. 2017, 30.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.