LI-YOLO_An_Object_Detection_Algorithm_for_UAV_Aeri
LI-YOLO_An_Object_Detection_Algorithm_for_UAV_Aeri
1 School of Electronic and Information Engineering, Hebei University of Technology, Tianjin 300401, China;
[email protected] (S.L.); [email protected] (Y.Z.)
2 Innovation and Research Institute, Hebei University of Technology in Shijiazhuang,
Shijiazhuang 050299, China
* Correspondence: [email protected]
Abstract: With the development of unmanned aerial vehicle (UAV) technology, deep learning is
becoming more and more widely used in object detection in UAV aerial images; however, detecting
and identifying small objects in low-illumination scenes is still a major challenge. Aiming at the
problem of low brightness, high noise, and obscure details of low-illumination images, an object
detection algorithm, LI-YOLO (Low-Illumination You Only Look Once), for UAV aerial images in
low-illumination scenes is proposed. Specifically, in the feature extraction section, this paper proposes
a feature enhancement block (FEB) to realize global receptive field and context information learning
through lightweight operations and embeds it into the C2f module at the end of the backbone network
to alleviate the problems of high noise and detail blur caused by low illumination with very few
parameter costs. In the feature fusion part, aiming to improve the detection performance for small
objects in UAV aerial images, a shallow feature fusion network and a small object detection head are
added. In addition, the adaptive spatial feature fusion structure (ASFF) is also introduced, which
adaptively fuses information from different levels of feature maps by optimizing the feature fusion
strategy so that the network can more accurately identify and locate objects of various scales. The
experimental results show that the mAP50 of LI-YOLO reaches 76.6% on the DroneVehicle dataset
and 90.8% on the LLVIP dataset. Compared with other current algorithms, LI-YOLO improves
the mAP 50 by 3.1% on the DroneVehicle dataset and 6.9% on the LLVIP dataset. Experimental
Citation: Liu, S.; He, H.; Zhang, Z.; results show that the proposed algorithm can effectively improve object detection performance in
Zhou, Y. LI-YOLO: An Object low-illumination scenes.
Detection Algorithm for UAV Aerial
Images in Low-Illumination Scenes. Keywords: low illumination; small object detection; UAV; YOLOv8
Drones 2024, 8, 653. https://doi.org/
10.3390/drones8110653
category and location through direct regression to achieve fast real-time detection but
may be slightly inferior in accuracy to two-stage algorithms, especially in small targets
and complex backgrounds. The R-CNN [6] series two-stage algorithm is used for fine
classification of candidate regions with high accuracy, especially for complex scenes, but the
speed is slightly slower. Both perform well in good lighting conditions, while performance
is limited by degradation in image quality in low-illumination scenes.
More specifically, there are significant differences between the object detection al-
gorithm in low-illumination scenes and the object detection algorithm in general scenes,
especially in the perspective of UAVs. The object detection task in low-illumination scenes
mainly has problems such as low brightness, high noise, and blurred details. In response to
the above problems, researchers have conducted extensive research and provided some
solutions. The wavelet-based Retinex image enhancement algorithm [7] uses the Retinex
algorithm to mitigate the effects of low illumination and the wavelet threshold algorithm to
suppress noise while preserving edge information. UA-CMDet [8] proposes an uncertainty-
aware module using cross-modal intersection over union and illumination estimation
to quantify the uncertainty of each object. C2Former [9] designs an intermodality cross-
attention (ICA) module to obtain the calibrated and complementary features by learning
the cross-attention relationship between the RGB and IR modality. Dark-Waste [10] com-
bines an improved ConvNeXt network with YOLOv5 to use large-scale non-overlapping
convolution to improve the algorithm’s ability to capture objects on low-illumination im-
ages. IAIFNet [11] incorporates a salient target aware module (STAM) and an adaptive
differential fusion module (ADFM) to respectively enhance gradient and contrast with
sensitivity to brightness. DIVFusion [12] uses a scene-illumination disentangled network
(SIDNet [13]) to strip the illumination degradation in nighttime visible images while pre-
serving informative features of source images and devises a texture–contrast enhancement
fusion network [14] (TCEFNet) to integrate complementary information and enhance the
contrast and texture details of fused features. The MAFusion [15] method inputs infrared
images and visible images into an adaptive weighting module and learns the difference
between them through dual-stream interaction to obtain a two-modal image fusion task to
improve network performance in low-illumination scenes. The Dark-YOLO [16] method
proposes a path aggregation enhanced module to further enhance the ability of feature
representation and improve the performance of object detection in low-illumination scenes.
Despite these advances, the object detection of UAV aerial images in low-illumination
scenes is still a daunting challenge due to the problems of low brightness, high noise, and
blurred details in low-illumination scenes, as well as the need for high inference speed and
lightweight models for embedded devices such as UAV.
In order to further improve the accuracy of object detection in UAV-captured images in
low-illumination scenes, this paper proposes an object detection algorithm for UAV aerial
images in low-illumination scenes. The main innovations include the following points:
• A feature enhancement block (FEB) is proposed to realize global receptive field and
context feature learning through lightweight operations and then integrate it into the
C2f module at the end of the backbone network, so as to improve the feature extraction
ability of the algorithm with very low parameter costs and alleviate the problems of
high noise and detail blur caused by low illumination.
• On the basis of YOLOv8, a shallow feature fusion network and a small object detection
head are added, aiming to solve the problem of small object scale (any dimension of
the object is less than 10% of the total size of the image) in UAV aerial images and
improve the detection performance of the algorithm for small objects.
• In view of the inconsistencies between different scale features exhibited in aerial
images, the adaptive spatial feature fusion structure (ASFF) is introduced in this paper,
which enables the network to more accurately identify and locate objects of various
scales by adaptively fusing information from different levels of feature maps.
Drones 2024, 8, x FOR PEER REVIEW 3 of 19
Drones 2024, 8, 653 3 of 18
2. Related Works
2. Related
2.1. Works
Benchmark Network YOLOv8
2.1. Benchmark Network YOLOv8
As a mainstream one-stage object detection algorithm, YOLOv8 [17] exhibits signifi-
As a mainstream
cant advantages in the one-stage object
field of object detection
detection duealgorithm, YOLOv8
to its superior [17] exhibits
real-time detectionsignif-
per-
icant advantages in the field of object detection due to its superior real-time
formance, rich global information representation capabilities, and robust generalization detection
performance,
abilities. rich global
YOLOv8 is usedinformation representation
as the benchmark networkcapabilities, andand
in this paper, robust generalization
its network struc-
abilities. YOLOv8 is used as the benchmark network in this paper, and
ture is shown in Figure 1, including the Backbone, Neck, and Head. The backbone its network structure
net-
is shown in Figure 1, including the Backbone, Neck, and Head. The backbone
work is responsible for extracting features from the input image. The neck network sits network is
responsible for extracting features from the input image. The neck network
between the backbone network and the head network, and it is responsible for further sits between
the backbone
processing andnetwork
fusing theandfeatures
the headextracted
network,from
and itthe
is responsible for further
backbone network. Theprocessing
head net-
and fusing the features extracted from the backbone network. The head network is the
work is the last part of YOLOv8, which is responsible for object detection based on the
last part of YOLOv8, which is responsible for object detection based on the feature map
feature map provided by the neck network.
provided by the neck network.
Figure 1.
Figure 1. The network structure of YOLOv8. It contains the backbone, neck, and head.
(b)
Drones 2024, 8, x FOR PEER REVIEW 4 of 19
(a)
(c)
Figure 2. Comparison of C3 module and C2f module. (a) represents the structure of C3, (b) repre-
sents the structure of C2f, where the structure of the Bottleneck module is shown in (c).
Figure 3. Illustration
Figure 3. Illustrationofofthe
theadaptive spatial
adaptive feature
spatial fusion
feature mechanism.
fusion For each
mechanism. level,level,
For each the features of all
the features
the other
of all thelevels
otherare resized
levels are to the same
resized shape
to the sameandshape
spatially
andfused according
spatially to according
fused the learnedtoweight maps.
the learned
weight maps.
The green box illustrates its operation, where X 1→3 , X 2→3 , and X 3→3 represent three
→ 3 , β3 ,
output
Thefeatures fromillustrates
green box PANet. These features where
its operation, , 𝑋 → , and
are then𝑋multiplied 𝑋 → represent
by learned weights αthree
and 3
γ , features
respectively, 3
y .learned can𝛼be,
output fromand then summed
PANet. to obtain
These features arethe fused
then featuresby
multiplied This process
weights
mathematically expressed as follows:
Figure 3. Illustration of the adaptive spatial feature fusion mechanism. For each level, the features
Drones 2024, 8, x FOR PEER REVIEW 5 of 19
Drones 2024, 8, 653 𝛽 , and 𝛾 , respectively, and then summed to obtain the fused features 𝑦 . This process
5 of 18
can be mathematically expressed as follows:
𝑦 =𝛼 ⋅𝑥 → +𝛽 ⋅𝑥 → +𝛾 𝑥 → (1)
where yijl represents the (i, j)-th vector of the output feature maps yl among channels. xijn→l
denotes𝑦therepresents
where the (i,
feature vector atj)-th vector of(i,the
the position output
j) on featuremaps
the feature 𝑦 among
mapsresized from channels.
level n to
→
𝑥level l.denotes
αij , β ij , the
andfeature vector
γij imply the at the position
learnable spatial(i, weights
j) on theof
feature
three maps resized
different levelsfrom levell,
to level
n to level l. 𝛼 , 𝛽 , and 𝛾
which are adaptively learned by the network. According to [22], forcing αij + β ij + γij =to1
imply the learnable spatial weights of three different
l l levels
l
level l,l which are adaptively learned by the network. According to [22], forcing 𝛼 + 𝛽 +
and αij , βlij , γijl ∈ [0, 1], and defined accordingly:
𝛾 = 1 and 𝛼 , 𝛽 , 𝛾 ∈ [0, 1], and defined accordingly:
== λα
a𝑎 e
eλα
(2)
(2)
e e ++eλβ
e ++e e
λr
The variable 𝜆 is
variable λ is aa weighted
weighted scalar map computed from each of the 3 input tensors
using 11 ×
× 1 convolution
convolutionblocks,
blocks,which
whichcancan
bebe learned
learned by by standard
standard backpropagation.
backpropagation. Fi-
Finally,
nally, the obtained
the obtained y in equation
y in equation (1) can
(1) can be activated
be activated byby a nonlinearfunction
a nonlinear functionf f(.),
(.), such
such as
BatchNorm ++ ReLU.
ReLU.
3. Proposed
3. Proposed Algorithm
Algorithm
In this
In this paper,
paper, YOLOv8
YOLOv8 is is selected
selected as
as the
the benchmark
benchmark algorithm, and the
algorithm, and the improved
improved
algorithm LI-YOLO
algorithm LI-YOLO isis shown
shown in
in Figure
Figure 4,
4, which
which is
is described
described below
below around
around the
the proposed
proposed
feature enhancement block (FEB), the C2f module integrates with FEB (C2fFE),
feature enhancement block (FEB), the C2f module integrates with FEB (C2fFE), the the im-
im-
proved feature fusion network, and the adaptive spatial feature fusion structure (ASFF).
proved feature fusion network, and the adaptive spatial feature fusion structure (ASFF).
Figure 4.
Figure LI-YOLO’s framework. It
4. LI-YOLO’s It contains
contains the
the backbone,
backbone, neck, and head. TheThe gray
gray cells
cells represent
modules of the baseline algorithm
the original modules algorithm YOLOv8
YOLOv8 andand the
the colored
colored bold cells represent the
improved or newly added modules. AmongAmong them,
them, the
the model
model structure
structure is improved with C2fFE, the
improved feature fusion network, and ASFF, all of which are bordered
improved feature fusion network, and ASFF, all of which are bordered in in black.
black.
3.1. Principle
3.1. Principle of
of the
the LI-YOLO
LI-YOLO
In order
In order to
to further
further enhance
enhance the
the performance
performance of of YOLOv8 algorithms in
YOLOv8 algorithms in small
small object
object
detection tasks
detection tasksunder
underlow-illumination
low-illumination scenes,
scenes, thisthis study
study optimizes
optimizes threethree
partsparts
of theoforig-
the
original YOLOv8 framework. Firstly, inspired by the multi-head attention
inal YOLOv8 framework. Firstly, inspired by the multi-head attention mechanism mechanism
(MHSA) integrated
(MHSA) integratedininTransformer, thisthis
Transformer, paper proposes
paper a feature
proposes enhancement
a feature block, block,
enhancement which
realizes the global receptive field and context information learning through lightweight
operations and integrates it into the C2f module to alleviate the problems of high noise and
obscure details caused by low illumination with extremely low parameter costs. Secondly,
which realizes the global receptive field and context information learning through light-
weight operations and integrates it into the C2f module to alleviate the problems of high
Drones 2024, 8, 653 noise and obscure details caused by low illumination with extremely low parameter 6costs. of 18
Secondly, in order to make full use of the abundant details and edge information con-
tained in shallow features, a shallow feature fusion network and a small detection head
are
in added
order to thefull
to make feature
use offusion networkdetails
the abundant composed
and edgeby FPN and PAN
information [23]. Additionally,
contained in shallow
features, a shallow feature fusion network and a small detection head the
an adaptive spatial feature fusion (ASFF) structure is inserted in front of are four detection
added to the
heads for
feature multi-scale
fusion networkfeature fusion,
composed which
by FPN further
and PAN improves the model’s
[23]. Additionally, capabilityspatial
an adaptive of de-
tectingfusion
feature small (ASFF)
objects.structure is inserted in front of the four detection heads for multi-scale
feature fusion, which further improves the model’s capability of detecting small objects.
3.2. Improvements in Feature Enhancement Network
3.2. Improvements
Although C2f inimproves
Feature Enhancement Network capability of the network, the detection
the feature extraction
Although
accuracy C2f objects
of small improves in the feature extraction
low-illumination capability
scenes of theDue
is not ideal. network,
to thethe detection
problems of
accuracy
high noise ofand
small objectsdetails
blurred in low-illumination
in low-illuminationscenes is not itideal.
scenes, Due tofor
is difficult thetheproblems
networkof to
high noise and blurred details in low-illumination scenes, it is difficult
fully mine useful information while ignoring the interference information [24]. Trans- for the network to
fully mine
former canuseful information
directly calculate while ignoring the between
the dependencies interferenceanyinformation
two positions [24].
in Transformer
the tensor in
can directly calculate the dependencies between any two positions in the
order to better capture the global information [25], while the multi-head self-attention tensor in order to
better capture the global information [25], while the multi-head self-attention
mechanism (MHSA) can capture multiple dependencies and feature information in the mechanism
(MHSA) can capture
input sequence, multiple
so as to reducedependencies and feature
the dependence of theinformation
model on ain the input
single sequence,
representation
so as to reduce the dependence of the model on a single representation [26].
[26]. However, the huge computational cost makes it difficult to deploy an object detection However, the
huge computational cost makes it difficult to deploy an object detection
algorithm with a multi-head self-attention mechanism on UAV devices. Therefore, this algorithm with
apaper
multi-head self-attention
proposes mechanism
a novel feature on UAV
enhancement devices.
module Therefore,
(C2fFE) basedthis paper
on the C2f,proposes
which hasa
novel feature enhancement module (C2fFE) based on the
stronger feature characterization capabilities than the C3 and C2f. C2f, which has stronger feature
characterization capabilities than the C3 and C2f.
3.2.1. C2fFE
3.2.1. C2fFE
The C2f module is deployed at all layers of the YOLOv8 network, capturing and fus-
The C2f module is deployed at all layers of the YOLOv8 network, capturing and
ing rich information from shallow layer to deep layer [27]. There are some problems such
fusing rich information from shallow layer to deep layer [27]. There are some problems
as high noise and blurred details in low-illumination scenes, so it is important to make
such as high noise and blurred details in low-illumination scenes, so it is important to
reasonable
make use ofuse
reasonable theofshallow detaildetail
the shallow information
information and deep
and deepglobal information
global information of the
of net-
the
work to improve the feature characterization ability in low-illumination
network to improve the feature characterization ability in low-illumination scenes. In this scenes. In this
paper, a feature enhancement block (FEB) is embedded in C2f
paper, a feature enhancement block (FEB) is embedded in C2f as a feature enhancement as a feature enhancement
module(C2fFE)
module (C2fFE)atatthetheend
endofofthe
theYOLOv8
YOLOv8 backbone
backbone network,
network, and
and its its structure
structure is shown
is shown in
in Figure 5. First, the input feature map is transformed through the
Figure 5. First, the input feature map is transformed through the first convolutional layer first convolutional
layerthen
and anddivided
then divided
into two intoparts.
two parts. One is
One part part
fedisinto
fed aninto an ELAN-like
ELAN-like network,
network, where where
the
the feature map of each channel is first enhanced by FEB and then
feature map of each channel is first enhanced by FEB and then into the Bottleneck module. into the Bottleneck
module.
The otherThepartother part directly
is passed is passed to directly
the outputto the output
in the forminofthe form ofFinally,
residuals. residuals.the Finally,
results
thethe
of results of theare
two parts two parts fused
feature are feature fuseddimension
in channel in channeland dimension and passed
passed through the through
second
the second convolutional
convolutional layer to obtainlayerthe
to obtain the final
final output. output. Therefore,
Therefore, the C2fFEthe C2fFE has
module modulea morehas
a more robust feature characterization ability than the
robust feature characterization ability than the original C2f module. original C2f module.
Figure 5. The structure of C2fFE. The structure of the Bottleneck in this figure is the same as in
Figure 2c.
(a) (b)
Figure
Figure 6.6.The
Thestructures
structuresofofMHSA
MHSA and FEB,
and where
FEB, (a)(a)
where represents MHSA
represents MHSA andand
(b) (b)
represents FEB,
represents FEB,
where
where n.
n. d,
d, d.
d. n,
n, and
and n.
n. nnrepresent
represent the
the dimensions
dimensions of
of matrix,
matrix, respectively.
respectively.
The generalized
The generalized form
form of
of MHSA
MHSA cancan be
be expressed as:
𝑄⋅𝐾
Q · KT ⋅ 𝑉
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
Attention( Q, K, V ) = so f tmax √ ·V (3)
(3)
d𝑑
k
The
The algorithm
algorithm process
process is
is as
as follows:
follows:
1.
1. Firstly,
Firstly, the input tensor obtains Q/K/V
the input tensor obtains Q/K/Vtokens
tokensthrough
throughthe
thelinear
linearprojection
projectionlayer.
layer.Q,Q,
K,
K, and V represent the query matrix, key matrix, and value matrix, respectively[30];
and V represent the query matrix, key matrix, and value matrix, respectively [30];
2.
2. Secondly,
Secondly,to toobtain
obtain an
an attention
attention score through S𝑆n =
score through = 𝑆/𝑑
S/dk,, 𝑆S==𝑄.Q𝐾· K; T ;
3.
3. Then,
Then,the
theSoftmax
Softmaxactivation
activationfunction
functionisisused
usedto
toconvert
convertthe
thescore
scoreinto
into probability:
probability:
𝑃 P==𝑠𝑜𝑓𝑡𝑚𝑎𝑥
so f tmax𝑆(Sn )
4.
4. Finally,
Finally, the
the weighted
weighted output
output isis obtained
obtained byby Z𝑍= = 𝑉.
V 𝑃.
· P.
The above algorithm can significantly improve
The above algorithm can significantly improve the network the network performance,
performance, but but
the ex-
the
ponential Softmax operation and large-scale matrix multiplication make
exponential Softmax operation and large-scale matrix multiplication make it difficult it difficult to de-
to
ploy
deploythethe
algorithm
algorithm in in
UAV
UAV devices
deviceswithwithlimited
limitedcomputing
computingresources.
resources.InInorder
order to
to solve
solve
this
this problem,
problem, thisthis paper
paper uses
useslightweight
lightweightSiLUSiLU[31]
[31]totoreplace
replaceexponential
exponential Softmax
Softmax with-
without
out changing
changing the basic
the basic idea idea of MHSA
of MHSA and cleverly
and cleverly takes takes advantage
advantage of the of the property
property of matrixof
matrix multiplication
multiplication to greatly
to greatly reducereduce the amount
the amount of computation.
of computation. The improved
The improved algorithmalgo-
is
rithm
definedis defined
as follows:as follows:
T T
h i h i
ΣN
j=1 SiLU ( Qi ) SiLU ( K j ) Vj ΣN
j=1 SiLU ( Qi ) SiLU ( K j ) Vj
Oi = T = T
SiLU( Qi ) Σ N
j=1 SiLU( K j ) SiLU( Qi ) Σ N
j=1 SiLU( K j )
h T
i (4)
SiLU ( Qi )Σ j=1 SiLU (K j ) Vj
N
= T
SiLU( Qi ) Σ N
j=1 SiLU( K j )
the semantic information in the feature maps becomes richer, but the detailed information
gradually diminishes [33]. The detected objects of this study are small objects captured by
UAVs, which, due to their small-scale and susceptibility to complex backgrounds, have
features predominantly residing in shallow feature maps [34]. However, the YOLOv8
algorithm does not adequately fuse shallow features, resulting in suboptimal performance
in small object detection [35]. To address these issues and enhance the network’s accuracy
and robustness, this paper introduces a shallow feature fusion network subsequent to the
shallow feature extraction module of the backbone network. The core function of this
network is to integrate the abundant detailed information from shallow feature maps with
the feature maps extracted by the original backbone network, thereby fully leveraging the
rich details in shallow feature maps. Additionally, to strengthen the model’s ability to
detect small objects, this paper appends a small object detection head, P2, after the shallow
feature fusion layer. The anchor boxes of this detection head are designed based on the
typical sizes of small objects in this study, which can enhance the network’s capability for
detecting small objects.
Figure7.7.Adaptive
Figure Adaptivespatial
spatialfeature
featurefusion
fusion structure
structure diagram.
diagram. TheThe formula
formula in the
in the lightlight
blue blue dotted
dotted box
box represents the fusion algorithm for one layer, and the other layers are
represents the fusion algorithm for one layer, and the other layers are similar. similar.
4.2. Datasets
4.2.1. DroneVehicle Dataset
DroneVehicle [37] is a large-scale UAV aerial vehicle dataset released by scholars from
Tianjin University, covering urban areas, suburbs, highways, and parking lots from day
to night, along with real-world occlusion and scale changes. In this paper, the research
problem is UAV aerial object detection in low-illumination scenes, so the images taken at
night are screened to construct the dataset in low-illumination scenes, and 1963 images are
selected as the training set, 148 as the validation set, and 1022 as the test set. DroneVehicle
has 5 detection categories, and the category distribution sample statistics in Figure 8 show
the width and height distribution of the objects in the dataset. It can be noted that the lower
Drones 2024, 8, x FOR PEER REVIEW
left quadrant has a higher concentration of points, indicating the dominance10ofofsmaller
19
(a) (b)
Figure 8. The amount of data for all labels in the DroneVehicle training set. (a) Position distribution
Figure 8. The amount of data for all labels in the DroneVehicle training set. (a) Position distribution
of the labels in the training set. (b) Width and height distribution of the labels in the training set.
of the labels in the training set. (b) Width and height distribution of the labels in the training set.
In this paper, the UAV object detection in low-illumination scenes is screened out to
In this paper,
constructthe
the UAV
datasetobject detection scenes,
in low-illumination in low-illumination scenes
and 1963 images are selectedisasscreened
the train- out
to construct ing
theset, 148 as verification
dataset sets, and 1022 scenes,
in low-illumination as test sets.
and 1963 images are selected as the
training set, 148
4.2.2.as verification
LLVIP Dataset sets, and 1022 as test sets.
LLVIP [38] is a visible-infrared paired dataset for low-illumination vision, which con-
tains 30,976 images, or 15,488 pairs. Most of these images were taken at low-illumination
scenes with only one category: person, as all images are strictly aligned in time and space.
The algorithm in this paper only needs RGB images, so only 3607 RGB images are ran-
domly selected, of which 2886 images are used as the training set and 721 images are used
as the validation set.
Drones 2024, 8, 653 10 of 18
where TP, FP, TN, and FN denote the number of true positive samples, false positive
samples, true negative samples, and false negative samples, respectively.
In addition, this paper uses FLOPs to evaluate the complexity of the model and FPS to
evaluate the real-time performance of the algorithm.
(a)
(b)
Figure 9.Figure
mAP50 9. and mAP50:95
mAP50 for the ablation
and mAP50:95 for theexperiment, where (a) represents
ablation experiment, where (a) mAP50 andmAP50
represents (b) and
represents
(b)mAP50:95.
represents mAP50:95.
Table 2, at the cost of less storage and computing costs, mAP 50 and mAP 50:95 achieve an
increase of 1.6% and 2.5% respectively, and the FPS of the algorithm is only reduced by 3.8,
which meets the requirements of real-time performance of UAVs. Finally, compared with
YOLOv8m, the parameters and computational complexity of LI-YOLO are much lower
than those of YOLOv8m, and mAP 50 and mAP 50:95 are better than those of YOLOv8m.
mAP 50:95
Attention P [%] R [%] mAP 50 [%] FLOPs [G]
[%]
CBAM 74.7 66.8 74.4 48.3 46.9
EMA 75.4 67.1 76.1 48.7 46.9
CA 73.9 68.6 73.7 47.5 46.9
FEB 77.5 68.1 75.8 49.8 47.0
- 68.5 70.9 74.2 47.3 46.9
As can be seen from Table 3, the Precision, Recall, mAP 50, and mAP 50:95 of the
detection results of FEB are 2.8%, 1.3%, 1.4%, and 1.5% higher than those of the CBAM
attention mechanism, respectively. The Precision, Recall, and mAP 50:95 of FEB are 2.1%,
1.0%, and 1.1% higher than those of EMA, respectively. The Precision, mAP 50, and mAP
50:95 of FEB are 3.6%, 2.1%, and 2.3% higher than those of CA, respectively. And compared
with not using any attention mechanism, the Precision, mAP 50, and mAP 50:95 of FEB are
9%, 1.6%, and 2.5% higher than those, respectively. Overall, in addition to the Recall of CA
and the non-attention option, and the mAP 50 of EMA being slightly higher than that of
FEB, the performance of FEB is better than other algorithms.
mAP 50:95
Algorithms Size P [%] R [%] mAP 50 [%] FPS
[%]
SSR 1 69.3 69.1 73.1 46.6 8.25
SSR 3 75.5 68.7 74.9 48.3 8.18
SSR 5 76.4 69.3 75.8 47.4 8.12
SSR 7 71.5 69.1 72.3 46.0 8.05
MSR (1, 3, 5) 77.2 64.9 73.1 46.9 3.05
LI-YOLO - 77.5 68.1 75.8 49.8 135.1
Drones 2024, 8, 653 13 of 18
It can be clearly observed from Figure 10b, the Retinex algorithm suppresses the
influence of low-illumination scenes on the image, but it also generates noise to the image,
Drones 2024, 8, x FOR PEER REVIEW
resulting in false detections and missed detections, such as the two cars on the far 14
right of
of 19
the figure, which are missed. In contrast, FEB can achieve better results in low-illumination
scenes, as shown in Figure 10a.
(a) (b)
Figure
Figure10.
10.Comparison
Comparison ofof
experimental
experimental results with
results or without
with Retinex
or without low-illumination
Retinex image
low-illumination en-
image
hancement,
enhancement, where (a) (a)
where indicates thatthat
indicates thethe
Retinex low-illumination
Retinex image
low-illumination enhancement
image enhancementtechnology is
technology
not used,
is not andand
used, (b)(b)
indicates that
indicates thethe
that enhancement
enhancement technology is used.
technology is used.
4.5.3.Performance
4.5.3. PerformanceComparison
Comparisonon onthe
theDroneVehicle
DroneVehicle Dataset
Dataset
In order
In order totoevaluate
evaluatethethedetection
detectionaccuracy
accuracy and
anddetection
detection speed
speedofofLI-YOLO
LI-YOLOininlow-low-
illumination scenes, the performance of LI-YOLO is compared with the
illumination scenes, the performance of LI-YOLO is compared with the UAV aerial object UAV aerial object
detection in
detection inlow-illumination
low-illumination scene
scene algorithms
algorithms proposed
proposed in in recent
recent years,
years, and
and the
thecommon
common
YOLO series object detection algorithms, including YOLOv8s and YOLOv10s,
YOLO series object detection algorithms, including YOLOv8s and YOLOv10s, as well as well as the
as
experimental results, are shown in Table 5. Low illumination represents datasets
the experimental results, are shown in Table 5. Low illumination represents datasets com- composed by
night images screened from DroneVehicle, while all-day represents whole DroneVehicle
posed by night images screened from DroneVehicle, while all-day represents whole Dron- datasets.
All algorithms
eVehicle are implemented
datasets. All algorithmsonare
the implemented
same environment, hardware
on the devices, and datasets.
same environment, hardware
devices, and datasets.
Table 5. Comparisons on the DroneVehicle Dataset.
Table 5. Comparisons on the DroneVehicle Dataset.
mAP 50 mAP 50:95 Params FLOPs
Scenes Models FPS
(%) (%) (M) (G)
mAP 50 mAP 50:95 Params FLOPs
Scenes
YOLOv8s Models
72.6 46.0 17.10 28.5 FPS
200.0
(%) (%) (M) (G)
YOLOv10s 68.8 44.2 14.02 24.6 149.3
IAW [42] YOLOv8s
72.6 41.672.6 46.0
- 17.10- 28.5 50200.0
CRSIOD [43] 68.46
YOLOv10s - 68.8 18.26
44.2 14.02- 24.6 -149.3
Night
DEDet [44] 62.3 34.9 - 129.8 36.7
Improving (RGB) IAW
71.7[42] 72.6 41.6 - - 50
- 6.05 13.3 88.7
YOLOv7-Tiny (IR) 74.1 [43]
CRSIOD 68.46 - 18.26 - -
[45] LI-YOLO 75.8 49.8 16.80 47.0 135.1
Night DEDet [44] 62.3 34.9 - 129.8 36.7
C2Former 74.2 47.5 132.51 100.9 -
Dark-Waste
Improving73.5
(RGB) 45.9
71.7 17.21 30.7 161.3
All-day
UA-CMDet YOLOv7-Tiny
64.0 41.3 -
138.69 6.05- 13.3 2.788.7
YOLOv8s (IR) 54.574.1
[45] 76.4 17.10 28.5 200.0
YOLOv10s 76.8 54.0 14.02 24.6 149.3
LI-YOLO LI-YOLO
76.6 55.275.8 49.8
16.80 16.80
47.0 47.0 135.1
135.1
C2Former 74.2 47.5 132.51 100.9 -
Dark-Waste 73.5 45.9 17.21 30.7 161.3
UA-CMDet 64.0 41.3 138.69 - 2.7
All-day
YOLOv8s 76.4 54.5 17.10 28.5 200.0
YOLOv10s 76.8 54.0 14.02 24.6 149.3
LI-YOLO 76.6 55.2 16.80 47.0 135.1
Drones 2024, 8, x FOR PEER REVIEW 15 of 19
Drones 2024, 8, 653 14 of 18
(a) (b)
(c) (d)
Figure
Figure 11.
11.Comparison
Comparison of experimental results
of experimental in different
results low-illumination
in different scenes, scenes,
low-illumination where (a) rep-
where
resents the detection effect of YOLOv8 in foggy scenes, (b) represents the detection effect
(a) represents the detection effect of YOLOv8 in foggy scenes, (b) represents the detection effect of LI-
YOLO in foggy scenes, (c) represents the detection effect of YOLOv8 in night scenes, and (d) repre-
of LI-YOLO in foggy scenes, (c) represents the detection effect of YOLOv8 in night scenes, and
sents the detection effect of LI-YOLO in night scenes.
(d) represents the detection effect of LI-YOLO in night scenes.
4.5.4.
4.5.4. Performance
Performance Comparison
Comparison on on the
the LLVIP
LLVIP Dataset
Dataset
In
In order
order toto further
furtherevaluate
evaluatethetheperformance
performanceofofLI-YOLO
LI-YOLO inin
aerial object
aerial detection
object in
detection
low-illumination
in low-illumination scenes, comparative
scenes, comparative experiments
experiments are are
alsoalso
carried out out
carried on the LLVIP
on the da-
LLVIP
taset, and
dataset, thethe
and experimental
experimental results
resultsare
arerecorded
recordedininTable
Table6.6.We
Wecancanobserve
observe that
that LI-YOLO
LI-YOLO
outperforms YOLOv8 by 0.5% mAP50 and 1.7% mAP 50:95, respectively, which achieves
the best results in Table 6. In addition, LI-YOLO
LI-YOLO alsoalso satisfies
satisfies the
the real-time
real-time requirements.
requirements.
Experimental resultsindicate
Experimental results indicatethat
that LI-YOLO
LI-YOLO alsoalso achieves
achieves advanced
advanced detection
detection perfor-
performance
mance on the dataset.
on the LLVIP LLVIP dataset.
algorithm can achieve UAV aerial object detection in low-illumination scenes while ensuring
real-time performance. In conclusion, compared with the existing publicly available UAV
aerial object detection algorithms in low-illumination scenes, our method has significant
advantages in detection accuracy and real-time performance, and it is more suitable for
real-time detection tasks on UAV platforms in low-illumination scenes. LI-YOLO is a pure
vision scheme based on visible light, and with the popularization of infrared technology
in low-illumination scenes, future work should focus on designing a multi-modal feature
extraction module [47] that processes both infrared and visible light data and complements
each other’s advantages.
Author Contributions: Conceptualization, S.L. and H.H.; methodology, S.L.; software, S.L.; validation,
S.L. and Z.Z.; formal analysis, H.H.; investigation, S.L. and Z.Z.; resources, H.H. and Y.Z.; data
curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, S.L., H.H.,
and Y.Z.; visualization, S.L.; supervision, H.H. and Y.Z.; project administration, S.L. and H.H.;
funding acquisition, H.H. and Y.Z. All authors have read and agreed to the published version of
the manuscript.
Funding: This research was funded by the National Natural Science Foundation of China
(No. 32200546), the Science and Technology Project of Hebei Education Department (No. QN2021038),
the Hebei Natural Science Foundation (No. C2024202003), and the Science and Technology Coopera-
tion Special Project of Shijiazhuang (No. SJZZXA23005).
Data Availability Statement: The DroneVehicle and LLVIP datasets were obtained from https:
//github.com/VisDrone/DroneVehicle (accessed on 21 July 2024), https://github.com/bupt-ai-cz/
LLVIP (accessed on 16 August 2024), separately.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Klemas, V.V. Coastal and environmental remote sensing from unmanned aerial vehicles: An overview. J. Coast. Res. 2015, 31,
1260–1267. [CrossRef]
2. Lin, S.; Jin, L.; Chen, Z. Real-time monocular vision system for UAV autonomous landing in outdoor low-illumination environ-
ments. Sensors 2021, 21, 6226. [CrossRef] [PubMed]
3. Zhang, Y.; Carballo, A.; Yang, H.; Takeda, K. Perception and sensing for autonomous vehicles under adverse weather conditions:
A survey. ISPRS J. Photogramm. Remote Sens. 2023, 196, 146–177. [CrossRef]
4. Zhao, Y.Q.; Rao, Y.; Dong, S.P.; Zhang, J.Y. Survey on deep learning object detection. J. Image Graph. 2020, 25, 629–654. [CrossRef]
5. Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016.
6. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp.
580–587.
7. Singh, P.; Bhandari, A.K.; Kumar, R. Low light image enhancement using reflection model and wavelet fusion. Multimed. Tools
Appl. 2024, 1, 1–29. [CrossRef]
8. Xie, J.; Nie, J.; Ding, B.; Yu, M.; Cao, J. Cross-modal Local Calibration and Global Context Modeling Network for RGB-Infrared
Remote Sensing Object Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1–10. [CrossRef]
9. Yuan, M.; Wei, X. C2 Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection. arXiv 2024,
arXiv:2306.16175.
10. Qiao, Y.; Zhang, Q.; Qi, Y.; Wan, T.; Yang, L.; Yu, X. A Waste Classification model in Low-illumination scenes based on ConvNeXt.
Resour. Conserv. Recycl. 2023, 199, 107274. [CrossRef]
11. Yang, Q.; Zhang, Y.; Zhao, Z.; Zhang, J.; Zhang, S. IAIFNet: An Illumination-Aware Infrared and Visible Image Fusion Network.
IEEE Signal Process. Lett. 2024, 13, 1374–1378. [CrossRef]
12. Tang, L.; Xiang, X.; Zhang, H.; Gong, M.; Ma, J. DIVFusion: Darkness-free infrared and visible image fusion. Inf. Fusion 2023, 91,
477–493. [CrossRef]
13. Huang, J.; Xu, H.; Liu, G.; Wang, C.; Hu, Z.; Li, Z.J.S.P. SIDNet: A single image dedusting network with color cast correction.
Signal Process. 2022, 199, 108612. [CrossRef]
14. Xu, K.; Chen, H.; Xu, C.; Jin, Y.; Zhu, C.J.I.T.o.C.; Technology, S.f.V. Structure-texture aware network for low-light image
enhancement. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 4983–4996. [CrossRef]
15. Tang, L.; Zhang, H.; Xu, H.; Ma, J.J.I.F. Rethinking the necessity of image fusion in high-level vision tasks: A practical infrared and
visible image fusion network based on progressive semantic injection and scene fidelity. Inf. Fusion 2023, 99, 101870. [CrossRef]
Drones 2024, 8, 653 17 of 18
16. Zetao, J.; Yun, X.; Shaoqin, Z. Low-illumination object detection method based on Dark-YOLO. J. Comput.-Aided Des. Comput.
Graph. 2023, 35, 441–451.
17. Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In
Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS),
Chennai, India, 18–19 April 2024; pp. 1–6.
18. Sohan, M.; Sai Ram, T.; Reddy, R. Data Intelligence and Cognitive Informatics; Springer: Berlin/Heidelberg, Germany, 2024.
19. Zhang, X.; Zeng, H.; Guo, S.; Zhang, L. Efficient long-range attention network for image super-resolution. In Computer Vision;
Springer Nature: Berlin/Heidelberg, Germany, 2022; pp. 649–667.
20. Available online: https://blog.csdn.net/Jiangnan_Cai/article/details/137099734?fromshare=blogdetail&sharetype=blogdetail&
sharerId=137099734&sharerefer=PC&sharesource=weixin_42488451&sharefrom=from_link (accessed on 6 August 2024).
21. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
22. Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516.
23. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768.
24. Xiao, Y.; Jiang, A.; Ye, J. Making of night vision: Object detection under low-illumination. IEEE Access 2020, 8, 123075–123086.
[CrossRef]
25. Han, K.; Wang, Y.; Chen, H. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [CrossRef]
26. Vaswani, A. Attention is all you need. In Advances in Neural Information Processing Systems; Vaswani: Mountain View, CA, USA, 2017.
27. Safaldin, M.; Zaghden, N.; Mejdoub, M. An Improved YOLOv8 to Detect Moving Objects. IEEE Access 2024, 12, 59782–59806.
[CrossRef]
28. Acer, S.; Selvitopi, O.; Aykanat, C. Improving performance of sparse matrix dense matrix multiplication on large-scale parallel
systems. Parallel Comput. 2016, 59, 71–96. [CrossRef]
29. Liu, W.; Wen, Y.; Yu, Z.; Yang, M. Large-margin softmax loss for convolutional neural networks. arXiv 2016, arXiv:1612.02295.
30. Ma, X.; Zhang, P.; Zhang, S.; Duan, N.; Hou, Y.; Zhou, M.; Song, D.; Zhou, M. A tensorized transformer for language modeling.
arXiv 2019.
31. Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Chaurasia, A.; Diaconu, L.; Ingham, F.; Colmagro, A.; Ye, H.; et al.
ultralytics/yolov5: v4. 0-nn. SiLU () Activations, Weights & Biases Logging, PyTorch Hub Integration; Zenodo: Geneva, Switzerland, 2021.
32. Li, K.; Zou, C.; Bu, S.; Liang, Y.; Zhang, J.; Gong, M. Multi-modal feature fusion for geographic image annotation. Pattern Recognit.
2018, 73, 1–14. [CrossRef]
33. Li, X.; Liu, Z.; Luo, P.; Change Loy, C.; Tang, X. Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer
cascade. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July
2017; pp. 3193–3202.
34. Cazzato, D.; Cimarelli, C.; Sanchez-Lopez, J.L.; Voos, H.; Leo, M. A survey of computer vision methods for 2d object detection
from unmanned aerial vehicles. J. Imaging 2020, 6, 78. [CrossRef] [PubMed]
35. Liu, Q.; Ye, H.; Wang, S.; Xu, Z. YOLOv8-CB: Dense Pedestrian Detection Algorithm Based on In-Vehicle Camera. Electronics 2024,
13, 236. [CrossRef]
36. Li, X.; Li, W.; Ren, D.; Zhang, H.; Wang, M.; Zuo, W. Enhanced blind face restoration with multi-exemplar images and adaptive
spatial feature fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA,
USA, 13–19 June 2020; pp. 2706–2715.
37. Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE
Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [CrossRef]
38. Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3496–3504.
39. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference
on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
40. Li, X.; Zhong, Z.; Wu, J.; Yang, Y.; Lin, Z.; Liu, H. Expectation-maximization attention networks for semantic segmentation. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November
2019; pp. 9167–9176.
41. Gu, R.; Wang, G.; Song, T.; Huang, R.; Aertsen, M.; Deprest, J.; Ourselin, S.; Vercauteren, T.; Zhang, S. CA-Net: Comprehensive
attention convolutional neural networks for explainable medical image segmentation. IEEE Trans. Med. Imaging 2020, 40, 699–711.
[CrossRef]
42. Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern
Recognit. 2019, 85, 161–171. [CrossRef]
43. Wang, H.; Wang, C.; Fu, Q.; Zhang, D.; Kou, R.; Yu, Y.; Song, J.J.I.T.o.G.; Sensing, R. Cross-Modal Oriented Object Detection of
UAV Aerial Images Based on Image Feature. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–21. [CrossRef]
44. Xi, Y.; Jia, W.; Miao, Q.; Feng, J.; Ren, J.; Luo, H.J.I.T.o.G.; Sensing, R. Detection-Driven Exposure-Correction Network for
Nighttime Drone-View Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62. [CrossRef]
Drones 2024, 8, 653 18 of 18
45. Hu, S.; Zhao, F.; Lu, H.; Deng, Y.; Du, J.; Shen, X. Improving YOLOv7-tiny for infrared and visible light image object detection on
drones. Remote Sens. 2023, 15, 3214. [CrossRef]
46. Cao, Y.; Bin, J.; Hamari, J.; Blasch, E.; Liu, Z. Multimodal object detection by channel switching and spatial attention. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023;
pp. 403–411.
47. Liu, C.; Chen, H.; Deng, L.; Guo, C.; Lu, X.; Yu, H.; Zhu, L.; Dong, M. Modality specific infrared and visible image fusion based
on multi-scale rich feature representation under low-light environment. Infrared Phys. Technol. 2024, 140, 105351. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.