0% found this document useful (0 votes)
3 views

LI-YOLO_An_Object_Detection_Algorithm_for_UAV_Aeri

Uploaded by

Nghĩa Bùi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

LI-YOLO_An_Object_Detection_Algorithm_for_UAV_Aeri

Uploaded by

Nghĩa Bùi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Article

LI-YOLO: An Object Detection Algorithm for UAV Aerial


Images in Low-Illumination Scenes
Songwen Liu 1 , Hao He 1,2, *, Zhichao Zhang 1 and Yatong Zhou 1,2

1 School of Electronic and Information Engineering, Hebei University of Technology, Tianjin 300401, China;
[email protected] (S.L.); [email protected] (Y.Z.)
2 Innovation and Research Institute, Hebei University of Technology in Shijiazhuang,
Shijiazhuang 050299, China
* Correspondence: [email protected]

Abstract: With the development of unmanned aerial vehicle (UAV) technology, deep learning is
becoming more and more widely used in object detection in UAV aerial images; however, detecting
and identifying small objects in low-illumination scenes is still a major challenge. Aiming at the
problem of low brightness, high noise, and obscure details of low-illumination images, an object
detection algorithm, LI-YOLO (Low-Illumination You Only Look Once), for UAV aerial images in
low-illumination scenes is proposed. Specifically, in the feature extraction section, this paper proposes
a feature enhancement block (FEB) to realize global receptive field and context information learning
through lightweight operations and embeds it into the C2f module at the end of the backbone network
to alleviate the problems of high noise and detail blur caused by low illumination with very few
parameter costs. In the feature fusion part, aiming to improve the detection performance for small
objects in UAV aerial images, a shallow feature fusion network and a small object detection head are
added. In addition, the adaptive spatial feature fusion structure (ASFF) is also introduced, which
adaptively fuses information from different levels of feature maps by optimizing the feature fusion
strategy so that the network can more accurately identify and locate objects of various scales. The
experimental results show that the mAP50 of LI-YOLO reaches 76.6% on the DroneVehicle dataset
and 90.8% on the LLVIP dataset. Compared with other current algorithms, LI-YOLO improves
the mAP 50 by 3.1% on the DroneVehicle dataset and 6.9% on the LLVIP dataset. Experimental
Citation: Liu, S.; He, H.; Zhang, Z.; results show that the proposed algorithm can effectively improve object detection performance in
Zhou, Y. LI-YOLO: An Object low-illumination scenes.
Detection Algorithm for UAV Aerial
Images in Low-Illumination Scenes. Keywords: low illumination; small object detection; UAV; YOLOv8
Drones 2024, 8, 653. https://doi.org/
10.3390/drones8110653

Academic Editor: Diego


González-Aguilera 1. Introduction
Unmanned aerial vehicles (UAVs) have been widely used in various fields, such as
Received: 27 September 2024
Revised: 26 October 2024
traffic patrol, environmental monitoring, maritime search and rescue, and other fields [1].
Accepted: 4 November 2024
However, in practical applications, UAV aerial images often face the challenges of low-
Published: 7 November 2024
illumination environments, such as night [2], haze or dawn [3], and dusk, and the images
under these conditions often have problems such as low brightness, high noise, and blurred
details, which seriously affect the accuracy and efficiency of detecting objects such as
vehicles and pedestrians. With the increasing demand for real-time detection of UAV
Copyright: © 2024 by the authors. platforms, UAV object detection algorithms not only need to be able to accurately identify
Licensee MDPI, Basel, Switzerland. objects but also need to have high robustness and real-time capability, which poses a major
This article is an open access article challenge to the design and optimization of object detection algorithms from the perspective
distributed under the terms and
of UAVs in low-illumination scenes.
conditions of the Creative Commons
In recent years, two types of efficient algorithms based on deep learning [4] have
Attribution (CC BY) license (https://
emerged in the field of object detection: single-stage (e.g., YOLO series) and two-stage
creativecommons.org/licenses/by/
(e.g., R-CNN series). Single-stage algorithms such as YOLO [5] can predict the object
4.0/).

Drones 2024, 8, 653. https://doi.org/10.3390/drones8110653 https://www.mdpi.com/journal/drones


Drones 2024, 8, 653 2 of 18

category and location through direct regression to achieve fast real-time detection but
may be slightly inferior in accuracy to two-stage algorithms, especially in small targets
and complex backgrounds. The R-CNN [6] series two-stage algorithm is used for fine
classification of candidate regions with high accuracy, especially for complex scenes, but the
speed is slightly slower. Both perform well in good lighting conditions, while performance
is limited by degradation in image quality in low-illumination scenes.
More specifically, there are significant differences between the object detection al-
gorithm in low-illumination scenes and the object detection algorithm in general scenes,
especially in the perspective of UAVs. The object detection task in low-illumination scenes
mainly has problems such as low brightness, high noise, and blurred details. In response to
the above problems, researchers have conducted extensive research and provided some
solutions. The wavelet-based Retinex image enhancement algorithm [7] uses the Retinex
algorithm to mitigate the effects of low illumination and the wavelet threshold algorithm to
suppress noise while preserving edge information. UA-CMDet [8] proposes an uncertainty-
aware module using cross-modal intersection over union and illumination estimation
to quantify the uncertainty of each object. C2Former [9] designs an intermodality cross-
attention (ICA) module to obtain the calibrated and complementary features by learning
the cross-attention relationship between the RGB and IR modality. Dark-Waste [10] com-
bines an improved ConvNeXt network with YOLOv5 to use large-scale non-overlapping
convolution to improve the algorithm’s ability to capture objects on low-illumination im-
ages. IAIFNet [11] incorporates a salient target aware module (STAM) and an adaptive
differential fusion module (ADFM) to respectively enhance gradient and contrast with
sensitivity to brightness. DIVFusion [12] uses a scene-illumination disentangled network
(SIDNet [13]) to strip the illumination degradation in nighttime visible images while pre-
serving informative features of source images and devises a texture–contrast enhancement
fusion network [14] (TCEFNet) to integrate complementary information and enhance the
contrast and texture details of fused features. The MAFusion [15] method inputs infrared
images and visible images into an adaptive weighting module and learns the difference
between them through dual-stream interaction to obtain a two-modal image fusion task to
improve network performance in low-illumination scenes. The Dark-YOLO [16] method
proposes a path aggregation enhanced module to further enhance the ability of feature
representation and improve the performance of object detection in low-illumination scenes.
Despite these advances, the object detection of UAV aerial images in low-illumination
scenes is still a daunting challenge due to the problems of low brightness, high noise, and
blurred details in low-illumination scenes, as well as the need for high inference speed and
lightweight models for embedded devices such as UAV.
In order to further improve the accuracy of object detection in UAV-captured images in
low-illumination scenes, this paper proposes an object detection algorithm for UAV aerial
images in low-illumination scenes. The main innovations include the following points:
• A feature enhancement block (FEB) is proposed to realize global receptive field and
context feature learning through lightweight operations and then integrate it into the
C2f module at the end of the backbone network, so as to improve the feature extraction
ability of the algorithm with very low parameter costs and alleviate the problems of
high noise and detail blur caused by low illumination.
• On the basis of YOLOv8, a shallow feature fusion network and a small object detection
head are added, aiming to solve the problem of small object scale (any dimension of
the object is less than 10% of the total size of the image) in UAV aerial images and
improve the detection performance of the algorithm for small objects.
• In view of the inconsistencies between different scale features exhibited in aerial
images, the adaptive spatial feature fusion structure (ASFF) is introduced in this paper,
which enables the network to more accurately identify and locate objects of various
scales by adaptively fusing information from different levels of feature maps.
Drones 2024, 8, x FOR PEER REVIEW 3 of 19
Drones 2024, 8, 653 3 of 18

2. Related Works
2. Related
2.1. Works
Benchmark Network YOLOv8
2.1. Benchmark Network YOLOv8
As a mainstream one-stage object detection algorithm, YOLOv8 [17] exhibits signifi-
As a mainstream
cant advantages in the one-stage object
field of object detection
detection duealgorithm, YOLOv8
to its superior [17] exhibits
real-time detectionsignif-
per-
icant advantages in the field of object detection due to its superior real-time
formance, rich global information representation capabilities, and robust generalization detection
performance,
abilities. rich global
YOLOv8 is usedinformation representation
as the benchmark networkcapabilities, andand
in this paper, robust generalization
its network struc-
abilities. YOLOv8 is used as the benchmark network in this paper, and
ture is shown in Figure 1, including the Backbone, Neck, and Head. The backbone its network structure
net-
is shown in Figure 1, including the Backbone, Neck, and Head. The backbone
work is responsible for extracting features from the input image. The neck network sits network is
responsible for extracting features from the input image. The neck network
between the backbone network and the head network, and it is responsible for further sits between
the backbone
processing andnetwork
fusing theandfeatures
the headextracted
network,from
and itthe
is responsible for further
backbone network. Theprocessing
head net-
and fusing the features extracted from the backbone network. The head network is the
work is the last part of YOLOv8, which is responsible for object detection based on the
last part of YOLOv8, which is responsible for object detection based on the feature map
feature map provided by the neck network.
provided by the neck network.

Figure 1.
Figure 1. The network structure of YOLOv8. It contains the backbone, neck, and head.

2.2. Feature Extraction Module in YOLOv8


2.2. Feature Extraction Module in YOLOv8
The benchmark network YOLOv8 uses C2f [18] in feature extraction and fusion net-
The benchmark network YOLOv8 uses C2f [18] in feature extraction and fusion net-
works to enhance feature extraction capabilities so that the network can better learn and
works to enhance feature extraction capabilities so that the network can better learn and
use the correlation information between features. It borrows ideas from ELAN [19] and
use the correlation information between features. It borrows ideas from ELAN [19] and
improves the three series Bottlenecks in the C3 module [20]. By parallelizing more gradient
improves the three series Bottlenecks in the C3 module [20]. By parallelizing more gradi-
flow branches, it can obtain richer gradient information and more reasonable latency than
ent flow branches, it can obtain richer gradient information and more reasonable latency
the C3 module. A comparison of this structure with C3 is shown in Figure 2.
than the C3 module. A comparison of this structure with C3 is shown in Figure 2.
2.3. Adaptive Spatial Feature Fusion Structure (ASFF)
Feature pyramid networks [21] (FPN) are common methods used to solve the challenge
of scale changes in object detection tasks. However, for FPN-based single-stage detectors,
the inconsistency between different feature scales is the main limitation. This inconsistency
can interfere with gradient computation during training and reduce the effectiveness
of the feature pyramid. Therefore, an adaptive spatial feature fusion (ASFF) method
was proposed to improve this problem. ASFF is a feature fusion strategy proposed in
the field of object detection. (a)
This method spatially filters the conflicting information to
suppress inconsistencies, so as to improve the scale invariance of features. It fuses the
features of different layers by learning the weight parameters so that each spatial location
Figure 1. The network structure of YOLOv8. It contains the backbone, neck, and head.
Drones 2024, 8, x FOR PEER REVIEW 4 of 19
2.2. Feature Extraction Module in YOLOv8
The benchmark network YOLOv8 uses C2f [18] in feature extraction and fusion net-
Drones 2024, 8, 653 works to enhance feature extraction capabilities so that the network can better learn4 and of 18
use the correlation information between features. It borrows ideas from ELAN [19] and
improves the three series Bottlenecks in the C3 module [20]. By parallelizing more gradi-
ent
can flow branches,
adaptively it can
select theobtain richer gradient
most useful features information andThe
for prediction. more reasonable
pipeline latency
is shown in
than the C3 module.
Figure 3 [22]. A comparison of this structure with C3 is shown in Figure 2.

(b)
Drones 2024, 8, x FOR PEER REVIEW 4 of 19

(a)

(c)
Figure 2. Comparison of C3 module and C2f module. (a) represents the structure of C3, (b) repre-
sents the structure of C2f, where the structure of the Bottleneck module is shown in (c).

2.3. Adaptive Spatial Feature Fusion Structure (ASFF)


Feature pyramid networks [21] (FPN) are common methods used to solve the chal-
lenge of scale changes in object detection tasks. However, for FPN-based single-stage de-
tectors, the inconsistency between
(b) different feature scales is the main limitation. This in-
consistency can interfere with gradient computation during training and reduce the effec-
tiveness of the feature pyramid. Therefore, an adaptive spatial feature fusion (ASFF)
method was proposed to improve this problem. ASFF is a feature fusion strategy pro-
posed in the field of object detection. This method spatially filters the conflicting infor-
mation to suppress inconsistencies, so as to improve the scale invariance of features. It
fuses the features of different(c) layers by learning the weight parameters so that each spatial
location
Figure can adaptively
2. Comparison
Comparisonof ofC3 select
C3module
moduletheandmost
C2f useful features
module. for prediction. The pipeline is
Figure 2. and C2f module. (a)(a) represents
represents thethe structure
structure of C3,
of C3, (b) repre-
(b) represents
shown in Figure 3 [22].
sents the structure of C2f, where the structure of the Bottleneck module is shown in (c).
the structure of C2f, where the structure of the Bottleneck module is shown in (c).

2.3. Adaptive Spatial Feature Fusion Structure (ASFF)


Feature pyramid networks [21] (FPN) are common methods used to solve the chal-
lenge of scale changes in object detection tasks. However, for FPN-based single-stage de-
tectors, the inconsistency between different feature scales is the main limitation. This in-
consistency can interfere with gradient computation during training and reduce the effec-
tiveness of the feature pyramid. Therefore, an adaptive spatial feature fusion (ASFF)
method was proposed to improve this problem. ASFF is a feature fusion strategy pro-
posed in the field of object detection. This method spatially filters the conflicting infor-
mation to suppress inconsistencies, so as to improve the scale invariance of features. It
fuses the features of different layers by learning the weight parameters so that each spatial
location can adaptively select the most useful features for prediction. The pipeline is
shown in Figure 3 [22].

Figure 3. Illustration
Figure 3. Illustrationofofthe
theadaptive spatial
adaptive feature
spatial fusion
feature mechanism.
fusion For each
mechanism. level,level,
For each the features of all
the features
the other
of all thelevels
otherare resized
levels are to the same
resized shape
to the sameandshape
spatially
andfused according
spatially to according
fused the learnedtoweight maps.
the learned
weight maps.
The green box illustrates its operation, where X 1→3 , X 2→3 , and X 3→3 represent three
→ 3 , β3 ,
output
Thefeatures fromillustrates
green box PANet. These features where
its operation, , 𝑋 → , and
are then𝑋multiplied 𝑋 → represent
by learned weights αthree
and 3
γ , features
respectively, 3
y .learned can𝛼be,
output fromand then summed
PANet. to obtain
These features arethe fused
then featuresby
multiplied This process
weights
mathematically expressed as follows:

yijl = αijl · xij1→l + βlij · xij2→l + γijl xij3→l (1)

Figure 3. Illustration of the adaptive spatial feature fusion mechanism. For each level, the features
Drones 2024, 8, x FOR PEER REVIEW 5 of 19

Drones 2024, 8, 653 𝛽 , and 𝛾 , respectively, and then summed to obtain the fused features 𝑦 . This process
5 of 18
can be mathematically expressed as follows:
𝑦 =𝛼 ⋅𝑥 → +𝛽 ⋅𝑥 → +𝛾 𝑥 → (1)
where yijl represents the (i, j)-th vector of the output feature maps yl among channels. xijn→l
denotes𝑦therepresents
where the (i,
feature vector atj)-th vector of(i,the
the position output
j) on featuremaps
the feature 𝑦 among
mapsresized from channels.
level n to

𝑥level l.denotes
αij , β ij , the
andfeature vector
γij imply the at the position
learnable spatial(i, weights
j) on theof
feature
three maps resized
different levelsfrom levell,
to level
n to level l. 𝛼 , 𝛽 , and 𝛾
which are adaptively learned by the network. According to [22], forcing αij + β ij + γij =to1
imply the learnable spatial weights of three different
l l levels
l
level l,l which are adaptively learned by the network. According to [22], forcing 𝛼 + 𝛽 +
and αij , βlij , γijl ∈ [0, 1], and defined accordingly:
𝛾 = 1 and 𝛼 , 𝛽 , 𝛾 ∈ [0, 1], and defined accordingly:

== λα
a𝑎 e
eλα
(2)
(2)
e e ++eλβ
e ++e e
λr

The variable 𝜆 is
variable λ is aa weighted
weighted scalar map computed from each of the 3 input tensors
using 11 ×
× 1 convolution
convolutionblocks,
blocks,which
whichcancan
bebe learned
learned by by standard
standard backpropagation.
backpropagation. Fi-
Finally,
nally, the obtained
the obtained y in equation
y in equation (1) can
(1) can be activated
be activated byby a nonlinearfunction
a nonlinear functionf f(.),
(.), such
such as
BatchNorm ++ ReLU.
ReLU.

3. Proposed
3. Proposed Algorithm
Algorithm
In this
In this paper,
paper, YOLOv8
YOLOv8 is is selected
selected as
as the
the benchmark
benchmark algorithm, and the
algorithm, and the improved
improved
algorithm LI-YOLO
algorithm LI-YOLO isis shown
shown in
in Figure
Figure 4,
4, which
which is
is described
described below
below around
around the
the proposed
proposed
feature enhancement block (FEB), the C2f module integrates with FEB (C2fFE),
feature enhancement block (FEB), the C2f module integrates with FEB (C2fFE), the the im-
im-
proved feature fusion network, and the adaptive spatial feature fusion structure (ASFF).
proved feature fusion network, and the adaptive spatial feature fusion structure (ASFF).

Figure 4.
Figure LI-YOLO’s framework. It
4. LI-YOLO’s It contains
contains the
the backbone,
backbone, neck, and head. TheThe gray
gray cells
cells represent
modules of the baseline algorithm
the original modules algorithm YOLOv8
YOLOv8 andand the
the colored
colored bold cells represent the
improved or newly added modules. AmongAmong them,
them, the
the model
model structure
structure is improved with C2fFE, the
improved feature fusion network, and ASFF, all of which are bordered
improved feature fusion network, and ASFF, all of which are bordered in in black.
black.

3.1. Principle
3.1. Principle of
of the
the LI-YOLO
LI-YOLO
In order
In order to
to further
further enhance
enhance the
the performance
performance of of YOLOv8 algorithms in
YOLOv8 algorithms in small
small object
object
detection tasks
detection tasksunder
underlow-illumination
low-illumination scenes,
scenes, thisthis study
study optimizes
optimizes threethree
partsparts
of theoforig-
the
original YOLOv8 framework. Firstly, inspired by the multi-head attention
inal YOLOv8 framework. Firstly, inspired by the multi-head attention mechanism mechanism
(MHSA) integrated
(MHSA) integratedininTransformer, thisthis
Transformer, paper proposes
paper a feature
proposes enhancement
a feature block, block,
enhancement which
realizes the global receptive field and context information learning through lightweight
operations and integrates it into the C2f module to alleviate the problems of high noise and
obscure details caused by low illumination with extremely low parameter costs. Secondly,
which realizes the global receptive field and context information learning through light-
weight operations and integrates it into the C2f module to alleviate the problems of high
Drones 2024, 8, 653 noise and obscure details caused by low illumination with extremely low parameter 6costs. of 18
Secondly, in order to make full use of the abundant details and edge information con-
tained in shallow features, a shallow feature fusion network and a small detection head
are
in added
order to thefull
to make feature
use offusion networkdetails
the abundant composed
and edgeby FPN and PAN
information [23]. Additionally,
contained in shallow
features, a shallow feature fusion network and a small detection head the
an adaptive spatial feature fusion (ASFF) structure is inserted in front of are four detection
added to the
heads for
feature multi-scale
fusion networkfeature fusion,
composed which
by FPN further
and PAN improves the model’s
[23]. Additionally, capabilityspatial
an adaptive of de-
tectingfusion
feature small (ASFF)
objects.structure is inserted in front of the four detection heads for multi-scale
feature fusion, which further improves the model’s capability of detecting small objects.
3.2. Improvements in Feature Enhancement Network
3.2. Improvements
Although C2f inimproves
Feature Enhancement Network capability of the network, the detection
the feature extraction
Although
accuracy C2f objects
of small improves in the feature extraction
low-illumination capability
scenes of theDue
is not ideal. network,
to thethe detection
problems of
accuracy
high noise ofand
small objectsdetails
blurred in low-illumination
in low-illuminationscenes is not itideal.
scenes, Due tofor
is difficult thetheproblems
networkof to
high noise and blurred details in low-illumination scenes, it is difficult
fully mine useful information while ignoring the interference information [24]. Trans- for the network to
fully mine
former canuseful information
directly calculate while ignoring the between
the dependencies interferenceanyinformation
two positions [24].
in Transformer
the tensor in
can directly calculate the dependencies between any two positions in the
order to better capture the global information [25], while the multi-head self-attention tensor in order to
better capture the global information [25], while the multi-head self-attention
mechanism (MHSA) can capture multiple dependencies and feature information in the mechanism
(MHSA) can capture
input sequence, multiple
so as to reducedependencies and feature
the dependence of theinformation
model on ain the input
single sequence,
representation
so as to reduce the dependence of the model on a single representation [26].
[26]. However, the huge computational cost makes it difficult to deploy an object detection However, the
huge computational cost makes it difficult to deploy an object detection
algorithm with a multi-head self-attention mechanism on UAV devices. Therefore, this algorithm with
apaper
multi-head self-attention
proposes mechanism
a novel feature on UAV
enhancement devices.
module Therefore,
(C2fFE) basedthis paper
on the C2f,proposes
which hasa
novel feature enhancement module (C2fFE) based on the
stronger feature characterization capabilities than the C3 and C2f. C2f, which has stronger feature
characterization capabilities than the C3 and C2f.
3.2.1. C2fFE
3.2.1. C2fFE
The C2f module is deployed at all layers of the YOLOv8 network, capturing and fus-
The C2f module is deployed at all layers of the YOLOv8 network, capturing and
ing rich information from shallow layer to deep layer [27]. There are some problems such
fusing rich information from shallow layer to deep layer [27]. There are some problems
as high noise and blurred details in low-illumination scenes, so it is important to make
such as high noise and blurred details in low-illumination scenes, so it is important to
reasonable
make use ofuse
reasonable theofshallow detaildetail
the shallow information
information and deep
and deepglobal information
global information of the
of net-
the
work to improve the feature characterization ability in low-illumination
network to improve the feature characterization ability in low-illumination scenes. In this scenes. In this
paper, a feature enhancement block (FEB) is embedded in C2f
paper, a feature enhancement block (FEB) is embedded in C2f as a feature enhancement as a feature enhancement
module(C2fFE)
module (C2fFE)atatthetheend
endofofthe
theYOLOv8
YOLOv8 backbone
backbone network,
network, and
and its its structure
structure is shown
is shown in
in Figure 5. First, the input feature map is transformed through the
Figure 5. First, the input feature map is transformed through the first convolutional layer first convolutional
layerthen
and anddivided
then divided
into two intoparts.
two parts. One is
One part part
fedisinto
fed aninto an ELAN-like
ELAN-like network,
network, where where
the
the feature map of each channel is first enhanced by FEB and then
feature map of each channel is first enhanced by FEB and then into the Bottleneck module. into the Bottleneck
module.
The otherThepartother part directly
is passed is passed to directly
the outputto the output
in the forminofthe form ofFinally,
residuals. residuals.the Finally,
results
thethe
of results of theare
two parts two parts fused
feature are feature fuseddimension
in channel in channeland dimension and passed
passed through the through
second
the second convolutional
convolutional layer to obtainlayerthe
to obtain the final
final output. output. Therefore,
Therefore, the C2fFEthe C2fFE has
module modulea morehas
a more robust feature characterization ability than the
robust feature characterization ability than the original C2f module. original C2f module.

Figure 5. The structure of C2fFE. The structure of the Bottleneck in this figure is the same as in
Figure 2c.

3.2.2. Feature Enhancement Block (FEB)


Inspired by MHSA, FEB can be seen as a feature enhancement module that enables
global receptive field and context feature learning through lightweight operations without
Figure 5. The structure of C2fFE. The structure of the Bottleneck in this figure is the same as in Figure
2c.

3.2.2. Feature Enhancement Block (FEB)


Drones 2024, 8, 653 7 of 18
Inspired by MHSA, FEB can be seen as a feature enhancement module that enables
global receptive field and context feature learning through lightweight operations without
relying on large-scale matrix multiplication [28] and exponential Softmax [29] nonlinear
relying on large-scale
operations. matrix
The structures multiplication
of MHSA and FEB[28]
are and exponential
shown in Figure Softmax
6. [29] nonlinear
operations. The structures of MHSA and FEB are shown in Figure 6.

(a) (b)
Figure
Figure 6.6.The
Thestructures
structuresofofMHSA
MHSA and FEB,
and where
FEB, (a)(a)
where represents MHSA
represents MHSA andand
(b) (b)
represents FEB,
represents FEB,
where
where n.
n. d,
d, d.
d. n,
n, and
and n.
n. nnrepresent
represent the
the dimensions
dimensions of
of matrix,
matrix, respectively.
respectively.

The generalized
The generalized form
form of
of MHSA
MHSA cancan be
be expressed as:
 𝑄⋅𝐾
Q · KT ⋅ 𝑉

𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
Attention( Q, K, V ) = so f tmax √ ·V (3)
(3)
d𝑑
k

The
The algorithm
algorithm process
process is
is as
as follows:
follows:
1.
1. Firstly,
Firstly, the input tensor obtains Q/K/V
the input tensor obtains Q/K/Vtokens
tokensthrough
throughthe
thelinear
linearprojection
projectionlayer.
layer.Q,Q,
K,
K, and V represent the query matrix, key matrix, and value matrix, respectively[30];
and V represent the query matrix, key matrix, and value matrix, respectively [30];
2.
2. Secondly,
Secondly,to toobtain
obtain an
an attention
attention score through S𝑆n =
score through = 𝑆/𝑑
S/dk,, 𝑆S==𝑄.Q𝐾· K; T ;
3.
3. Then,
Then,the
theSoftmax
Softmaxactivation
activationfunction
functionisisused
usedto
toconvert
convertthe
thescore
scoreinto
into probability:
probability:

𝑃 P==𝑠𝑜𝑓𝑡𝑚𝑎𝑥
so f tmax𝑆(Sn )
4.
4. Finally,
Finally, the
the weighted
weighted output
output isis obtained
obtained byby Z𝑍= = 𝑉.
V 𝑃.
· P.
The above algorithm can significantly improve
The above algorithm can significantly improve the network the network performance,
performance, but but
the ex-
the
ponential Softmax operation and large-scale matrix multiplication make
exponential Softmax operation and large-scale matrix multiplication make it difficult it difficult to de-
to
ploy
deploythethe
algorithm
algorithm in in
UAV
UAV devices
deviceswithwithlimited
limitedcomputing
computingresources.
resources.InInorder
order to
to solve
solve
this
this problem,
problem, thisthis paper
paper uses
useslightweight
lightweightSiLUSiLU[31]
[31]totoreplace
replaceexponential
exponential Softmax
Softmax with-
without
out changing
changing the basic
the basic idea idea of MHSA
of MHSA and cleverly
and cleverly takes takes advantage
advantage of the of the property
property of matrixof
matrix multiplication
multiplication to greatly
to greatly reducereduce the amount
the amount of computation.
of computation. The improved
The improved algorithmalgo-
is
rithm
definedis defined
as follows:as follows:
T T
h i h i
ΣN
j=1 SiLU ( Qi ) SiLU ( K j ) Vj ΣN
j=1 SiLU ( Qi ) SiLU ( K j ) Vj
Oi = T = T
SiLU( Qi ) Σ N
j=1 SiLU( K j ) SiLU( Qi ) Σ N
j=1 SiLU( K j )
h T
i (4)
SiLU ( Qi )Σ j=1 SiLU (K j ) Vj
N
= T
SiLU( Qi ) Σ N
j=1 SiLU( K j )

3.3. The Improved Feature Fusion Network


In the feature maps extracted by the backbone network, details and edge information
are mainly concentrated in the shallow feature maps [32]. As the network depth increases,
Drones 2024, 8, 653 8 of 18

the semantic information in the feature maps becomes richer, but the detailed information
gradually diminishes [33]. The detected objects of this study are small objects captured by
UAVs, which, due to their small-scale and susceptibility to complex backgrounds, have
features predominantly residing in shallow feature maps [34]. However, the YOLOv8
algorithm does not adequately fuse shallow features, resulting in suboptimal performance
in small object detection [35]. To address these issues and enhance the network’s accuracy
and robustness, this paper introduces a shallow feature fusion network subsequent to the
shallow feature extraction module of the backbone network. The core function of this
network is to integrate the abundant detailed information from shallow feature maps with
the feature maps extracted by the original backbone network, thereby fully leveraging the
rich details in shallow feature maps. Additionally, to strengthen the model’s ability to
detect small objects, this paper appends a small object detection head, P2, after the shallow
feature fusion layer. The anchor boxes of this detection head are designed based on the
typical sizes of small objects in this study, which can enhance the network’s capability for
detecting small objects.

3.4. An Improved ASFF


In low-illumination scenes, objects may exhibit diverse positions and scales, posing
a significant challenge to the scale invariance of detection algorithms. Traditional image
processing algorithms often struggle to accurately process such data, whereas the PANet
in YOLOv8 converts different feature maps to the same size and sums them up, leading
to incomplete utilization of features across various scales [36]. In order to overcome the
limitations of these approaches, the ASFF module is introduced, which adaptively fuses
information from different hierarchical feature maps by optimizing the feature fusion strategy,
enabling the network to more accurately identify and locate objects of various scales.
The YOLOv8 detection head is integrated with the ASFF module, as shown in Figure 7.
The yellow box illustrates the operation. Due to the addition of a small object detection
layer, the new ASFF algorithm adds a shallow pair of input x4 and output y4 compared
to the original ASFF algorithm, and introduces a shallow weight η. This process can be
mathematically expressed as follows:
Drones 2024, 8, x FOR PEER REVIEW 9 of 19
yijl = αijl · xij1→l + βlij · xij2→l + γijl xij3→l + ηijl λ4ij→l (5)

Figure7.7.Adaptive
Figure Adaptivespatial
spatialfeature
featurefusion
fusion structure
structure diagram.
diagram. TheThe formula
formula in the
in the lightlight
blue blue dotted
dotted box
box represents the fusion algorithm for one layer, and the other layers are
represents the fusion algorithm for one layer, and the other layers are similar. similar.

4. Experiment and Analysis


In order to verify the performance of LI-YOLO in low-illumination scenes, an exper-
imental process, including the experimental environment, experimental parameter set-
ting, datasets, evaluation metrics, result comparison, and analysis of experimental results
are shown in this section.

4.1. Experimental Environment and Training Parameter Settings


Drones 2024, 8, 653 9 of 18

4. Experiment and Analysis


In order to verify the performance of LI-YOLO in low-illumination scenes, an experi-
mental process, including the experimental environment, experimental parameter setting,
datasets, evaluation metrics, result comparison, and analysis of experimental results are
shown in this section.

4.1. Experimental Environment and Training Parameter Settings


The experimental environment is python3.8 (Ubuntu20.04), pytorch1.10.0, and cuda11.3.
The hardware configuration of the server used for the experiment was the NVIDIA RTX 4090
(24GB video memory) GPU and the AMD EPYC 9654 96-Core processor, both of which are
trained, validated, and tested under the same hyperparameters. The specific parameters of
the experiment are shown in Table 1.

Table 1. Training parameters setting.

Training Parameters Value


Epoch 300
Batch size 16
Image size 640 × 640
Learning rate 0.01
Momentum 0.937

4.2. Datasets
4.2.1. DroneVehicle Dataset
DroneVehicle [37] is a large-scale UAV aerial vehicle dataset released by scholars from
Tianjin University, covering urban areas, suburbs, highways, and parking lots from day
to night, along with real-world occlusion and scale changes. In this paper, the research
problem is UAV aerial object detection in low-illumination scenes, so the images taken at
night are screened to construct the dataset in low-illumination scenes, and 1963 images are
selected as the training set, 148 as the validation set, and 1022 as the test set. DroneVehicle
has 5 detection categories, and the category distribution sample statistics in Figure 8 show
the width and height distribution of the objects in the dataset. It can be noted that the lower
Drones 2024, 8, x FOR PEER REVIEW
left quadrant has a higher concentration of points, indicating the dominance10ofofsmaller
19

objects in the DroneVehicle dataset.

(a) (b)
Figure 8. The amount of data for all labels in the DroneVehicle training set. (a) Position distribution
Figure 8. The amount of data for all labels in the DroneVehicle training set. (a) Position distribution
of the labels in the training set. (b) Width and height distribution of the labels in the training set.
of the labels in the training set. (b) Width and height distribution of the labels in the training set.
In this paper, the UAV object detection in low-illumination scenes is screened out to
In this paper,
constructthe
the UAV
datasetobject detection scenes,
in low-illumination in low-illumination scenes
and 1963 images are selectedisasscreened
the train- out
to construct ing
theset, 148 as verification
dataset sets, and 1022 scenes,
in low-illumination as test sets.
and 1963 images are selected as the
training set, 148
4.2.2.as verification
LLVIP Dataset sets, and 1022 as test sets.
LLVIP [38] is a visible-infrared paired dataset for low-illumination vision, which con-
tains 30,976 images, or 15,488 pairs. Most of these images were taken at low-illumination
scenes with only one category: person, as all images are strictly aligned in time and space.
The algorithm in this paper only needs RGB images, so only 3607 RGB images are ran-
domly selected, of which 2886 images are used as the training set and 721 images are used
as the validation set.
Drones 2024, 8, 653 10 of 18

4.2.2. LLVIP Dataset


LLVIP [38] is a visible-infrared paired dataset for low-illumination vision, which contains
30,976 images, or 15,488 pairs. Most of these images were taken at low-illumination scenes with
only one category: person, as all images are strictly aligned in time and space. The algorithm in
this paper only needs RGB images, so only 3607 RGB images are randomly selected, of which
2886 images are used as the training set and 721 images are used as the validation set.

4.3. Evaluation Metrics


Common metrics used to evaluate the performance of object detection algorithms
include P (Precision), R (Recall), AP (Average Precision), mAP (mean Average Precision),
Params (number of parameters in the model), FLOPs (floating-point operations per second),
and FPS (number of image frames processed per second). Considering the limited storage
space of embedded devices such as UAVs and the certain requirements for real-time
detection during UAV flight, mAP 50, mAP 50:95, Params, FLOPs, and FPS are selected as
the evaluation metrics of our model.
Among these, mAP is the mean value of all categories of AP (AP is an area under the
P-R curve) and mAP 50 represents the average precision over an IoU threshold greater than
0.5, while mAP 50:95 represents the average precision over IoU thresholds ranging from
0.5 to 0.95 in increments of 0.05. The formula is defined as follow:
TP
P= 100% (6)
TP + FP
TP
R= 100% (7)
TP + FN
Z 1
AP = P( R)dR (8)
0
1
N∑
mAP = AP(q) (9)
q

where TP, FP, TN, and FN denote the number of true positive samples, false positive
samples, true negative samples, and false negative samples, respectively.
In addition, this paper uses FLOPs to evaluate the complexity of the model and FPS to
evaluate the real-time performance of the algorithm.

4.4. Ablation Experiments


In order to assess the contribution of each module to the performance of LI-YOLO, this paper
conducted ablation experiments on each module. First, YOLOv8s was used as the benchmark
algorithm, then ablation experiments were performed on each module and finally compared
with the YOLOv8m algorithm, and the experimental results are shown in Table 2 and Figure 9.
Among them, SOD stands for shallow feature fusion network and small object detection head.
As can be seen from Table 2, the benchmark algorithm has mAP50 and mAP 50:95 of 72.6% and
46.0%, respectively. As can also be seen in Figure 9, the network is well converged.

Table 2. Ablation experiment based on YOLOv8s.

mAP 50 mAP 50:95 Params FLOPs


Models SOD ASFF FEB FPS
(%) (%) (M) (G)
YOLOv8s -
√ - - 72.6 46.0 11.13 28.5 212.7
Model A √ -
√ - 73.4 47.0 16.60 36.7 138.9
Model B √ √ -
√ 74.2 47.3 16.60 46.9 138.9
Model C 75.8 49.8 16.80 47.0 135.1
YOLOv8m - - - 75.0 47.6 56.97 195.2 137.0

Notes: “ ” indicates the improvement was used, while “-“ indicates the improvement was not applied.
Model C √ √ √ 75.8 49.8 16.80 47.0 135.1
YOLOv8m - - - 75.0 47.6 56.97 195.2 137.0
Notes: “√” indicates the improvement was used, while “-“ indicates the improvement was not ap-
plied.
Drones 2024, 8, 653 11 of 18

Drones 2024, 8, x FOR PEER REVIEW 12 of 19

(a)

(b)
Figure 9.Figure
mAP50 9. and mAP50:95
mAP50 for the ablation
and mAP50:95 for theexperiment, where (a) represents
ablation experiment, where (a) mAP50 andmAP50
represents (b) and
represents
(b)mAP50:95.
represents mAP50:95.

The firstThe experiment (Model A)


first experiment is based
(Model A)onis YOLOv8s, and SOD isand
based on YOLOv8s, added
SOD to enhance
is added to en-
its ability to detect
hance smalltoobjects.
its ability Compared
detect small to the
objects. benchmark
Compared algorithm,
to the benchmark mAP 50 and mAP
algorithm,
mAP 50:9550 andimproved
mAP 50:95by 0.8% and 1.0%,
improved respectively.
by 0.8% and 1.0%, Due to the addition
respectively. Due toof the
a shallow
addition of a
detection head with
shallow a large
detection number
head with aoflarge
parameters,
number the number of parameters
of parameters, the numberand the
of parameters
and thecomplexity
computational computational of thecomplexity
algorithm of theincreased.
have algorithmThe haveFPSincreased.
is reducedThe from FPS is reduced
212.7
to 138.9,from
but it212.7
still to 138.9,
meets thebut it still meetsthat
requirements thethe
requirements
FPS cannot that the FPS
be lower cannot
than 30 when be lower
the than
UAV is 30 when the
detected UAVtime.
in real is detected
In the in real time.
second In the second
experiment (Modelexperiment
B), the ASFF (Model
module B), the
is ASFF
insertedmodule
into theisdetection
inserted into
head thetodetection
performhead to perform
multi-scale multi-scale
adaptive weighted adaptive
featureweighted
fusion feature
fusion on
on the features of the
eachfeatures of each
layer, and layer, and the computational
the computational cost increased
cost increased slightly, while theslightly,
im- while
pact on the
the impact
real-time onperformance
the real-time ofperformance
the algorithmofis the algorithm
small; mAP 50 is andsmall;
mAPmAP 50:9550 and mAP
have
achieved 74.2% and 47.3%, respectively. In the third experiment (Model C), the proposed C), the
50:95 have achieved 74.2% and 47.3%, respectively. In the third experiment (Model
feature proposed
enhancement feature
blockenhancement
FEB is embeddedblock FEB
intoisthe
embedded
C2f module intoatthetheC2f
endmodule at the end of
of the back-
the backbone
bone network network
of YOLOv8s of YOLOv8s
to capture globaltoand
capture global information,
contextual and contextual information,
enhance the fea-enhance
the feature
ture expression expression
ability ability of
of the object, andthesuppress
object, andthe suppress the noise interference
noise interference caused by low caused by
low illumination. Compared with the multi-head self-attention
illumination. Compared with the multi-head self-attention mechanism, the number of pa- mechanism, the number of
parameters and computational complexity of this module
rameters and computational complexity of this module are lower. As can be seen from are lower. As can be seen from
Table 2, at the cost of less storage and computing costs, mAP 50 and mAP 50:95 achieve
an increase of 1.6% and 2.5% respectively, and the FPS of the algorithm is only reduced
by 3.8, which meets the requirements of real-time performance of UAVs. Finally, com-
pared with YOLOv8m, the parameters and computational complexity of LI-YOLO are
much lower than those of YOLOv8m, and mAP 50 and mAP 50:95 are better than those of
Drones 2024, 8, 653 12 of 18

Table 2, at the cost of less storage and computing costs, mAP 50 and mAP 50:95 achieve an
increase of 1.6% and 2.5% respectively, and the FPS of the algorithm is only reduced by 3.8,
which meets the requirements of real-time performance of UAVs. Finally, compared with
YOLOv8m, the parameters and computational complexity of LI-YOLO are much lower
than those of YOLOv8m, and mAP 50 and mAP 50:95 are better than those of YOLOv8m.

4.5. Comparative Experiments


4.5.1. Performance Comparison Between Attention Mechanism
In order to verify the performance of the C2fFE proposed in this paper and prove the
rationality of combining C2f with the FEB attention mechanism, we compared FEB with
some common attention mechanisms, such as CBAM [39], EMA [40], and CA [41], as well
as the comparison of using an attention mechanism and not using any attention mechanism.
Experiments are conducted on the same experimental setup, and the experimental results
are shown in Table 3.

Table 3. Comparison of different attention mechanisms.

mAP 50:95
Attention P [%] R [%] mAP 50 [%] FLOPs [G]
[%]
CBAM 74.7 66.8 74.4 48.3 46.9
EMA 75.4 67.1 76.1 48.7 46.9
CA 73.9 68.6 73.7 47.5 46.9
FEB 77.5 68.1 75.8 49.8 47.0
- 68.5 70.9 74.2 47.3 46.9

As can be seen from Table 3, the Precision, Recall, mAP 50, and mAP 50:95 of the
detection results of FEB are 2.8%, 1.3%, 1.4%, and 1.5% higher than those of the CBAM
attention mechanism, respectively. The Precision, Recall, and mAP 50:95 of FEB are 2.1%,
1.0%, and 1.1% higher than those of EMA, respectively. The Precision, mAP 50, and mAP
50:95 of FEB are 3.6%, 2.1%, and 2.3% higher than those of CA, respectively. And compared
with not using any attention mechanism, the Precision, mAP 50, and mAP 50:95 of FEB are
9%, 1.6%, and 2.5% higher than those, respectively. Overall, in addition to the Recall of CA
and the non-attention option, and the mAP 50 of EMA being slightly higher than that of
FEB, the performance of FEB is better than other algorithms.

4.5.2. Performance Comparison Between Low-Illumination Image Preprocessing Algorithms


In order to assess the contribution of the FEB algorithm proposed in this paper to
UAV aerial object detection in low-illumination scenes, several low-illumination image
enhancement algorithms are introduced as comparison items and compared with YOLOv8
without low-illumination image enhancement but integrated with FEB. The experimental
results are shown in Table 4. For low-illumination UAV aerial images, the Recall, mAP50,
and mAP 50:95 are lower than those of the single-scale Retinex (SSR), except that the
Precision of multi-scale Retinex (MSR) is 0.8% higher than that of SSR. In the experimental
results of SSR, the Gaussian orbit scale size is equal to 5 for the best enhancement of low-
illumination images, but its Precision, mAP50, and mAP 50:95 are lower than those of the
enhancement algorithm FEB proposed in this paper.

Table 4. Comparison of different low-illumination image enhancement algorithms.

mAP 50:95
Algorithms Size P [%] R [%] mAP 50 [%] FPS
[%]
SSR 1 69.3 69.1 73.1 46.6 8.25
SSR 3 75.5 68.7 74.9 48.3 8.18
SSR 5 76.4 69.3 75.8 47.4 8.12
SSR 7 71.5 69.1 72.3 46.0 8.05
MSR (1, 3, 5) 77.2 64.9 73.1 46.9 3.05
LI-YOLO - 77.5 68.1 75.8 49.8 135.1
Drones 2024, 8, 653 13 of 18

It can be clearly observed from Figure 10b, the Retinex algorithm suppresses the
influence of low-illumination scenes on the image, but it also generates noise to the image,
Drones 2024, 8, x FOR PEER REVIEW
resulting in false detections and missed detections, such as the two cars on the far 14
right of
of 19
the figure, which are missed. In contrast, FEB can achieve better results in low-illumination
scenes, as shown in Figure 10a.

(a) (b)
Figure
Figure10.
10.Comparison
Comparison ofof
experimental
experimental results with
results or without
with Retinex
or without low-illumination
Retinex image
low-illumination en-
image
hancement,
enhancement, where (a) (a)
where indicates thatthat
indicates thethe
Retinex low-illumination
Retinex image
low-illumination enhancement
image enhancementtechnology is
technology
not used,
is not andand
used, (b)(b)
indicates that
indicates thethe
that enhancement
enhancement technology is used.
technology is used.

ItIt can be seen


can also be seenfrom
fromthetheTable
Table 4 that
4 that thethe real-time
real-time performance
performance of Retinex
of the the Retinex
algo-
algorithm
rithm is pooris poor
and and the FPS
the FPS of MSR
of MSR is only
is only 3.05,3.05,
whichwhich
is farisfrom
far from meeting
meeting the require-
the requirements
ments
that thethat
FPSthe FPS be
cannot cannot
lowerbethan
lower than 30
30 when thewhen
UAV isthe UAV isindetected
detected real time,ininreal time, the
contrast, in
contrast, the FPS reaches
FPS of LI-YOLO of LI-YOLO
135.1 reaches
under the135.1 under
premise of the premise
ensuring theof ensuring
detection the detection
accuracy, which
accuracy,
has goodwhich has performance.
real-time good real-timeThese
performance. These results
results indicate indicate
that the thatand
accuracy the accuracy
real-time
and real-time performance
performance of FEB
of FEB are better are
than better
those ofthan those ofimage
the Retinex the Retinex image enhancement
enhancement algorithm in
low-illumination
algorithm scenes.
in low-illumination scenes.

4.5.3.Performance
4.5.3. PerformanceComparison
Comparisonon onthe
theDroneVehicle
DroneVehicle Dataset
Dataset
In order
In order totoevaluate
evaluatethethedetection
detectionaccuracy
accuracy and
anddetection
detection speed
speedofofLI-YOLO
LI-YOLOininlow-low-
illumination scenes, the performance of LI-YOLO is compared with the
illumination scenes, the performance of LI-YOLO is compared with the UAV aerial object UAV aerial object
detection in
detection inlow-illumination
low-illumination scene
scene algorithms
algorithms proposed
proposed in in recent
recent years,
years, and
and the
thecommon
common
YOLO series object detection algorithms, including YOLOv8s and YOLOv10s,
YOLO series object detection algorithms, including YOLOv8s and YOLOv10s, as well as well as the
as
experimental results, are shown in Table 5. Low illumination represents datasets
the experimental results, are shown in Table 5. Low illumination represents datasets com- composed by
night images screened from DroneVehicle, while all-day represents whole DroneVehicle
posed by night images screened from DroneVehicle, while all-day represents whole Dron- datasets.
All algorithms
eVehicle are implemented
datasets. All algorithmsonare
the implemented
same environment, hardware
on the devices, and datasets.
same environment, hardware
devices, and datasets.
Table 5. Comparisons on the DroneVehicle Dataset.
Table 5. Comparisons on the DroneVehicle Dataset.
mAP 50 mAP 50:95 Params FLOPs
Scenes Models FPS
(%) (%) (M) (G)
mAP 50 mAP 50:95 Params FLOPs
Scenes
YOLOv8s Models
72.6 46.0 17.10 28.5 FPS
200.0
(%) (%) (M) (G)
YOLOv10s 68.8 44.2 14.02 24.6 149.3
IAW [42] YOLOv8s
72.6 41.672.6 46.0
- 17.10- 28.5 50200.0
CRSIOD [43] 68.46
YOLOv10s - 68.8 18.26
44.2 14.02- 24.6 -149.3
Night
DEDet [44] 62.3 34.9 - 129.8 36.7
Improving (RGB) IAW
71.7[42] 72.6 41.6 - - 50
- 6.05 13.3 88.7
YOLOv7-Tiny (IR) 74.1 [43]
CRSIOD 68.46 - 18.26 - -
[45] LI-YOLO 75.8 49.8 16.80 47.0 135.1
Night DEDet [44] 62.3 34.9 - 129.8 36.7
C2Former 74.2 47.5 132.51 100.9 -
Dark-Waste
Improving73.5
(RGB) 45.9
71.7 17.21 30.7 161.3
All-day
UA-CMDet YOLOv7-Tiny
64.0 41.3 -
138.69 6.05- 13.3 2.788.7
YOLOv8s (IR) 54.574.1
[45] 76.4 17.10 28.5 200.0
YOLOv10s 76.8 54.0 14.02 24.6 149.3
LI-YOLO LI-YOLO
76.6 55.275.8 49.8
16.80 16.80
47.0 47.0 135.1
135.1
C2Former 74.2 47.5 132.51 100.9 -
Dark-Waste 73.5 45.9 17.21 30.7 161.3
UA-CMDet 64.0 41.3 138.69 - 2.7
All-day
YOLOv8s 76.4 54.5 17.10 28.5 200.0
YOLOv10s 76.8 54.0 14.02 24.6 149.3
LI-YOLO 76.6 55.2 16.80 47.0 135.1
Drones 2024, 8, x FOR PEER REVIEW 15 of 19
Drones 2024, 8, 653 14 of 18

As can be seen from Table 5, UA-CMDet [8] proposes an uncertainty-aware module


As can be seen from Table 5, UA-CMDet [8] proposes an uncertainty-aware module
using cross-modal intersection over union and illumination estimation to quantify the un-
using cross-modal intersection over union and illumination estimation to quantify the
certainty of each object, achieving 64.0% mAP50 and 41.3% mAP 50:95. C2Former [9] de-
uncertainty of each object, achieving 64.0% mAP50 and 41.3% mAP 50:95. C2Former [9]
signs an intermodality cross-attention (ICA) module to obtain the calibrated and comple-
designs an intermodality cross-attention (ICA) module to obtain the calibrated and com-
mentary features by learning the cross-attention relationship between the RGB and IR mo-
plementary features by learning the cross-attention relationship between the RGB and IR
dality, resulting in 74.2% mAP50 and 47.5% mAP 50:95. Dark-Waste [10] combines an im-
modality, resulting in 74.2% mAP50 and 47.5% mAP 50:95. Dark-Waste [10] combines an
proved ConvNeXt network with YOLOv5 to use large-scale non-overlapping convolution
improved ConvNeXt network with YOLOv5 to use large-scale non-overlapping convolu-
to
tionimprove
to improve the the
algorithm’s
algorithm’s ability of of
ability capturing
capturingobjects
objectson on low-illumination
low-illumination images, images,
achieving 73.5% mAP50
achieving 73.5% mAP50 and and 45.9%
45.9% mAPmAP 50:95.
50:95. IAW
IAW [42]
[42] introduces
introduces an an Illumination-aware
Illumination-aware
Network
Network to to give
give an illumination measure
an illumination measure of the input
of the input image,
image, resulting
resulting in in 72.6%
72.6% mAP50
mAP50
and 41.6% mAP 50:95. CRSIOD [43] proposes an uncertainty-aware
and 41.6% mAP 50:95. CRSIOD [43] proposes an uncertainty-aware module to quantify the module to quantify
the uncertainties
uncertainties present
present in eachin each modality
modality as weights
as weights to motivate
to motivate the network
the network to learntoinlearn in
a direc-
ation
direction favorable for optimal object detection, achieving 68.46%
favorable for optimal object detection, achieving 68.46% mAP50. DEDet [44] develops a mAP50. DEDet [44]
develops
fine-graineda fine-grained parameter
parameter predictor predictor
(FPP) (FPP)pixelwise
to estimate to estimate pixelwise
parameter parameter
maps maps
of the image
of the image filters, resulting in 62.3% mAP50 and 34.9% mAP
filters, resulting in 62.3% mAP50 and 34.9% mAP 50:95. Improving YOLOv7-Tiny [45] 50:95. Improving YOLOv7-
Tiny
assigns[45]anchor
assigns anchor
boxes boxes according
according to the aspectto the aspect
ratio ratio oftruth
of ground groundboxestruth boxes toprior
to provide pro-
vide prior information on object shape for the network and uses
information on object shape for the network and uses a hard sample mining loss function a hard sample mining
loss function
(HSM Loss) to (HSM
guideLoss) to guide to
the network theenhance
networklearning
to enhance fromlearning from hard
hard samples; samples;
in RGB and IRin
RGB
scenes,and IR scenes,
mAP reachesmAP71.7% reaches 71.7%respectively.
and 74.1%, and 74.1%, respectively.
Compared with Compared
the above withalgorithms,
the above
algorithms,
LI-YOLO achieves LI-YOLO 75.8%achieves
mAP50 75.8% mAP50mAP
and 49.8% and 50:95
49.8%inmAP 50:95 in low-illumination
low-illumination scenes while
scenes
achievingwhile achieving
76.6% mAP5076.6% mAP50
and 55.2% and 50:95
mAP 55.2%inmAP 50:95
all-day in all-day
scenes; scenes; the
the detection detection
accuracy is
accuracy is improved in both low-illumination and all-day scenes.
improved in both low-illumination and all-day scenes. Among the YOLO series algorithms, Among the YOLO se-
ries algorithms,
YOLOv8 has theYOLOv8 has the
best detection best detection
performance performance in low-illumination
in low-illumination scenes, achievingscenes, 72.6%
achieving
mAP50 and72.6% 46.0%mAP50
mAP 50:95, and but
46.0% mAP
both 50:95,than
are lower but LI-YOLO.
both are YOLOv10
lower thanhas LI-YOLO.
the best
YOLOv10 has the bestindetection
detection performance performance
all-day scenes, with mAP50 in all-day
reachingscenes,
76.8%,with
whichmAP50 reaching
is 0.2% higher
76.8%, which isHowever,
than LI-YOLO. 0.2% higher thethan
mAPLI-YOLO. However,isthe
50:95 of YOLOv10 mAP
1.2% 50:95
lower thanof LI-YOLO,
YOLOv10and is 1.2%
the
lower
FPS ofthan
135.1LI-YOLO, and the real-time
can still meet FPS of 135.1 can still meet the real-time requirements.
requirements.
These results
resultsshow
showthat thatthe
theLI-YOLO
LI-YOLOoutperforms
outperforms other
otheralgorithms
algorithms in low-illumination
in low-illumina-
scenes,
tion and the
scenes, andperformance
the performance of LI-YOLO
of LI-YOLO in all-day scenes
in all-day is notisinferior
scenes to thattoofthat
not inferior other
of
algorithms.
other The experimental
algorithms. The experimental resultsresults
also show that the
also show thatLI-YOLO
the LI-YOLOhas goodhas robustness
good robust- in
low-illumination
ness scenes, scenes,
in low-illumination as shown as in Figure
shown in11.
Figure 11.

(a) (b)

Figure 11. Cont.


Drones 2024,
Drones 2024, 8,
8, 653
x FOR PEER REVIEW 16 of 19
15 of 18

(c) (d)
Figure
Figure 11.
11.Comparison
Comparison of experimental results
of experimental in different
results low-illumination
in different scenes, scenes,
low-illumination where (a) rep-
where
resents the detection effect of YOLOv8 in foggy scenes, (b) represents the detection effect
(a) represents the detection effect of YOLOv8 in foggy scenes, (b) represents the detection effect of LI-
YOLO in foggy scenes, (c) represents the detection effect of YOLOv8 in night scenes, and (d) repre-
of LI-YOLO in foggy scenes, (c) represents the detection effect of YOLOv8 in night scenes, and
sents the detection effect of LI-YOLO in night scenes.
(d) represents the detection effect of LI-YOLO in night scenes.

4.5.4.
4.5.4. Performance
Performance Comparison
Comparison on on the
the LLVIP
LLVIP Dataset
Dataset
In
In order
order toto further
furtherevaluate
evaluatethetheperformance
performanceofofLI-YOLO
LI-YOLO inin
aerial object
aerial detection
object in
detection
low-illumination
in low-illumination scenes, comparative
scenes, comparative experiments
experiments are are
alsoalso
carried out out
carried on the LLVIP
on the da-
LLVIP
taset, and
dataset, thethe
and experimental
experimental results
resultsare
arerecorded
recordedininTable
Table6.6.We
Wecancanobserve
observe that
that LI-YOLO
LI-YOLO
outperforms YOLOv8 by 0.5% mAP50 and 1.7% mAP 50:95, respectively, which achieves
the best results in Table 6. In addition, LI-YOLO
LI-YOLO alsoalso satisfies
satisfies the
the real-time
real-time requirements.
requirements.
Experimental resultsindicate
Experimental results indicatethat
that LI-YOLO
LI-YOLO alsoalso achieves
achieves advanced
advanced detection
detection perfor-
performance
mance on the dataset.
on the LLVIP LLVIP dataset.

Table 6. Comparisons on the LLVIP Dataset.

mAP 50mAP 50 mAP 50:95Params


mAP 50:95 Params FLOPs
FLOPs
Models
Models FPS
FPS
(%) (%) (%) (%) (M) (M) (G) (G)
YOLOv8
YOLOv8 90.3 90.3 46.8 46.8 17.10 17.10 28.528.5 188.7
188.7
YOLOv10
YOLOv10 90.2 90.2 46.9 46.9 14.01 14.01 24.524.5 151.5
151.5
IAIFNet [11] 83.9 63.3 63.5 - 90.9
IAIFNet [11]
Literature [46] 87.6
83.9 36.4
63.3 -
63.5 -
- 90.9
6.7
Literature
DIVFusion [12] [46] 89.8 87.6 52.0 36.4 4.4 - 14,454.9- 6.7
0.331
DIVFusion [12]
LI-YOLO 90.8 89.8 48.5 52.0 16.80 4.4 14,454.9
47.0 0.331
156.3
LI-YOLO 90.8 48.5 16.80 47.0 156.3
5. Conclusions
5. Conclusions
This paper proposes a UAV aerial object detection algorithm in low-illumination
This
scenes, paper
which aimsproposes
to improve a UAV aerial object
the accuracy detection
of real-time algorithm
object detectioninin low-illumination
low-illumination
scenes,
scenes. which
Firstly,aims
this to improve
paper the accuracy
proposes a featureofenhancement
real-time object detection
block in low-illumina-
to improve the feature
tion scenes.ability
extraction Firstly,ofthis
thepaper proposes
algorithm and asuppress
feature enhancement block to improve
the noise interference caused the fea-
by low
ture extractionSecondly,
illumination. ability ofin the algorithm
order and suppress
to improve the noise
the algorithm’s interference
ability to detectcaused by low
small objects,
a shallow feature
illumination. fusioninlayer
Secondly, order and a small object
to improve detection head
the algorithm’s abilityare
toadded
detect to the objects,
small feature
afusion network.
shallow featureFinally,
fusion an adaptive
layer spatialobject
and a small feature fusion (ASFF)
detection head are network
added is to introduced,
the feature
which
fusion adaptively learns the
network. Finally, spatial weights
an adaptive spatial and spatially
feature fusionweights
(ASFF) the features
network of different
is introduced,
layers so
which as to effectively
adaptively learns the improve
spatialthe scale invariance
weights and spatiallyof the features.
weights The method
the features in this
of different
paper aims to achieve real-time detection of aerial photographic objects
layers so as to effectively improve the scale invariance of the features. The method in thisin low-illumination
scenes.aims
paper Withtothe development
achieve real-time and improvement
detection of aerialofphotographic
specific modules for low-illumination
objects in low-illumina-
scenes
tion and UAV
scenes. Withaerial photographic
the development andscenes, we have of
improvement achieved
specificremarkable
modules for performance.
low-illumi-
Experimental
nation scenes results
and UAV on UAV aerial
aerial datasets in different
photographic scenes, we scenes
haveshow that the
achieved proposed
remarkable
Drones 2024, 8, 653 16 of 18

algorithm can achieve UAV aerial object detection in low-illumination scenes while ensuring
real-time performance. In conclusion, compared with the existing publicly available UAV
aerial object detection algorithms in low-illumination scenes, our method has significant
advantages in detection accuracy and real-time performance, and it is more suitable for
real-time detection tasks on UAV platforms in low-illumination scenes. LI-YOLO is a pure
vision scheme based on visible light, and with the popularization of infrared technology
in low-illumination scenes, future work should focus on designing a multi-modal feature
extraction module [47] that processes both infrared and visible light data and complements
each other’s advantages.

Author Contributions: Conceptualization, S.L. and H.H.; methodology, S.L.; software, S.L.; validation,
S.L. and Z.Z.; formal analysis, H.H.; investigation, S.L. and Z.Z.; resources, H.H. and Y.Z.; data
curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, S.L., H.H.,
and Y.Z.; visualization, S.L.; supervision, H.H. and Y.Z.; project administration, S.L. and H.H.;
funding acquisition, H.H. and Y.Z. All authors have read and agreed to the published version of
the manuscript.
Funding: This research was funded by the National Natural Science Foundation of China
(No. 32200546), the Science and Technology Project of Hebei Education Department (No. QN2021038),
the Hebei Natural Science Foundation (No. C2024202003), and the Science and Technology Coopera-
tion Special Project of Shijiazhuang (No. SJZZXA23005).
Data Availability Statement: The DroneVehicle and LLVIP datasets were obtained from https:
//github.com/VisDrone/DroneVehicle (accessed on 21 July 2024), https://github.com/bupt-ai-cz/
LLVIP (accessed on 16 August 2024), separately.
Conflicts of Interest: The authors declare no conflicts of interest.

References
1. Klemas, V.V. Coastal and environmental remote sensing from unmanned aerial vehicles: An overview. J. Coast. Res. 2015, 31,
1260–1267. [CrossRef]
2. Lin, S.; Jin, L.; Chen, Z. Real-time monocular vision system for UAV autonomous landing in outdoor low-illumination environ-
ments. Sensors 2021, 21, 6226. [CrossRef] [PubMed]
3. Zhang, Y.; Carballo, A.; Yang, H.; Takeda, K. Perception and sensing for autonomous vehicles under adverse weather conditions:
A survey. ISPRS J. Photogramm. Remote Sens. 2023, 196, 146–177. [CrossRef]
4. Zhao, Y.Q.; Rao, Y.; Dong, S.P.; Zhang, J.Y. Survey on deep learning object detection. J. Image Graph. 2020, 25, 629–654. [CrossRef]
5. Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016.
6. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp.
580–587.
7. Singh, P.; Bhandari, A.K.; Kumar, R. Low light image enhancement using reflection model and wavelet fusion. Multimed. Tools
Appl. 2024, 1, 1–29. [CrossRef]
8. Xie, J.; Nie, J.; Ding, B.; Yu, M.; Cao, J. Cross-modal Local Calibration and Global Context Modeling Network for RGB-Infrared
Remote Sensing Object Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1–10. [CrossRef]
9. Yuan, M.; Wei, X. C2 Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection. arXiv 2024,
arXiv:2306.16175.
10. Qiao, Y.; Zhang, Q.; Qi, Y.; Wan, T.; Yang, L.; Yu, X. A Waste Classification model in Low-illumination scenes based on ConvNeXt.
Resour. Conserv. Recycl. 2023, 199, 107274. [CrossRef]
11. Yang, Q.; Zhang, Y.; Zhao, Z.; Zhang, J.; Zhang, S. IAIFNet: An Illumination-Aware Infrared and Visible Image Fusion Network.
IEEE Signal Process. Lett. 2024, 13, 1374–1378. [CrossRef]
12. Tang, L.; Xiang, X.; Zhang, H.; Gong, M.; Ma, J. DIVFusion: Darkness-free infrared and visible image fusion. Inf. Fusion 2023, 91,
477–493. [CrossRef]
13. Huang, J.; Xu, H.; Liu, G.; Wang, C.; Hu, Z.; Li, Z.J.S.P. SIDNet: A single image dedusting network with color cast correction.
Signal Process. 2022, 199, 108612. [CrossRef]
14. Xu, K.; Chen, H.; Xu, C.; Jin, Y.; Zhu, C.J.I.T.o.C.; Technology, S.f.V. Structure-texture aware network for low-light image
enhancement. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 4983–4996. [CrossRef]
15. Tang, L.; Zhang, H.; Xu, H.; Ma, J.J.I.F. Rethinking the necessity of image fusion in high-level vision tasks: A practical infrared and
visible image fusion network based on progressive semantic injection and scene fidelity. Inf. Fusion 2023, 99, 101870. [CrossRef]
Drones 2024, 8, 653 17 of 18

16. Zetao, J.; Yun, X.; Shaoqin, Z. Low-illumination object detection method based on Dark-YOLO. J. Comput.-Aided Des. Comput.
Graph. 2023, 35, 441–451.
17. Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In
Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS),
Chennai, India, 18–19 April 2024; pp. 1–6.
18. Sohan, M.; Sai Ram, T.; Reddy, R. Data Intelligence and Cognitive Informatics; Springer: Berlin/Heidelberg, Germany, 2024.
19. Zhang, X.; Zeng, H.; Guo, S.; Zhang, L. Efficient long-range attention network for image super-resolution. In Computer Vision;
Springer Nature: Berlin/Heidelberg, Germany, 2022; pp. 649–667.
20. Available online: https://blog.csdn.net/Jiangnan_Cai/article/details/137099734?fromshare=blogdetail&sharetype=blogdetail&
sharerId=137099734&sharerefer=PC&sharesource=weixin_42488451&sharefrom=from_link (accessed on 6 August 2024).
21. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
22. Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516.
23. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768.
24. Xiao, Y.; Jiang, A.; Ye, J. Making of night vision: Object detection under low-illumination. IEEE Access 2020, 8, 123075–123086.
[CrossRef]
25. Han, K.; Wang, Y.; Chen, H. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [CrossRef]
26. Vaswani, A. Attention is all you need. In Advances in Neural Information Processing Systems; Vaswani: Mountain View, CA, USA, 2017.
27. Safaldin, M.; Zaghden, N.; Mejdoub, M. An Improved YOLOv8 to Detect Moving Objects. IEEE Access 2024, 12, 59782–59806.
[CrossRef]
28. Acer, S.; Selvitopi, O.; Aykanat, C. Improving performance of sparse matrix dense matrix multiplication on large-scale parallel
systems. Parallel Comput. 2016, 59, 71–96. [CrossRef]
29. Liu, W.; Wen, Y.; Yu, Z.; Yang, M. Large-margin softmax loss for convolutional neural networks. arXiv 2016, arXiv:1612.02295.
30. Ma, X.; Zhang, P.; Zhang, S.; Duan, N.; Hou, Y.; Zhou, M.; Song, D.; Zhou, M. A tensorized transformer for language modeling.
arXiv 2019.
31. Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Chaurasia, A.; Diaconu, L.; Ingham, F.; Colmagro, A.; Ye, H.; et al.
ultralytics/yolov5: v4. 0-nn. SiLU () Activations, Weights & Biases Logging, PyTorch Hub Integration; Zenodo: Geneva, Switzerland, 2021.
32. Li, K.; Zou, C.; Bu, S.; Liang, Y.; Zhang, J.; Gong, M. Multi-modal feature fusion for geographic image annotation. Pattern Recognit.
2018, 73, 1–14. [CrossRef]
33. Li, X.; Liu, Z.; Luo, P.; Change Loy, C.; Tang, X. Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer
cascade. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July
2017; pp. 3193–3202.
34. Cazzato, D.; Cimarelli, C.; Sanchez-Lopez, J.L.; Voos, H.; Leo, M. A survey of computer vision methods for 2d object detection
from unmanned aerial vehicles. J. Imaging 2020, 6, 78. [CrossRef] [PubMed]
35. Liu, Q.; Ye, H.; Wang, S.; Xu, Z. YOLOv8-CB: Dense Pedestrian Detection Algorithm Based on In-Vehicle Camera. Electronics 2024,
13, 236. [CrossRef]
36. Li, X.; Li, W.; Ren, D.; Zhang, H.; Wang, M.; Zuo, W. Enhanced blind face restoration with multi-exemplar images and adaptive
spatial feature fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA,
USA, 13–19 June 2020; pp. 2706–2715.
37. Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE
Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [CrossRef]
38. Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3496–3504.
39. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference
on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
40. Li, X.; Zhong, Z.; Wu, J.; Yang, Y.; Lin, Z.; Liu, H. Expectation-maximization attention networks for semantic segmentation. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November
2019; pp. 9167–9176.
41. Gu, R.; Wang, G.; Song, T.; Huang, R.; Aertsen, M.; Deprest, J.; Ourselin, S.; Vercauteren, T.; Zhang, S. CA-Net: Comprehensive
attention convolutional neural networks for explainable medical image segmentation. IEEE Trans. Med. Imaging 2020, 40, 699–711.
[CrossRef]
42. Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern
Recognit. 2019, 85, 161–171. [CrossRef]
43. Wang, H.; Wang, C.; Fu, Q.; Zhang, D.; Kou, R.; Yu, Y.; Song, J.J.I.T.o.G.; Sensing, R. Cross-Modal Oriented Object Detection of
UAV Aerial Images Based on Image Feature. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–21. [CrossRef]
44. Xi, Y.; Jia, W.; Miao, Q.; Feng, J.; Ren, J.; Luo, H.J.I.T.o.G.; Sensing, R. Detection-Driven Exposure-Correction Network for
Nighttime Drone-View Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62. [CrossRef]
Drones 2024, 8, 653 18 of 18

45. Hu, S.; Zhao, F.; Lu, H.; Deng, Y.; Du, J.; Shen, X. Improving YOLOv7-tiny for infrared and visible light image object detection on
drones. Remote Sens. 2023, 15, 3214. [CrossRef]
46. Cao, Y.; Bin, J.; Hamari, J.; Blasch, E.; Liu, Z. Multimodal object detection by channel switching and spatial attention. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023;
pp. 403–411.
47. Liu, C.; Chen, H.; Deng, L.; Guo, C.; Lu, X.; Yu, H.; Zhu, L.; Dong, M. Modality specific infrared and visible image fusion based
on multi-scale rich feature representation under low-light environment. Infrared Phys. Technol. 2024, 140, 105351. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like