CSPPartial-YOLO_A_Lightweight_YOLO-Based_Method_for_Typical_Objects_Detection_in_Remote_Sensing_Images
CSPPartial-YOLO_A_Lightweight_YOLO-Based_Method_for_Typical_Objects_Detection_in_Remote_Sensing_Images
17, 2024
Abstract—Detecting and recognizing objects are crucial steps [2], and Aviation Control [3]. The development of deep learning
in interpreting remote sensing images. At present, deep learn- technology has resulted in more intelligent and efficient analysis
ing methods are predominantly employed for detecting objects of remote sensing images, decreasing the reliance on manual
in remote sensing images, necessitating a significant number of
floating-point computations. However, low computing power and work. Object recognition and detection are fundamental tasks
small storage in computing devices are hard to afford the large in computer vision and are core components of the analysis of
model parameter quantity and high computing complexity. To ad- remote sensing images.
dress these constraints, this article presents a lightweight detection Deep learning-based object detectors can be categorized into
model called CSPPartial-YOLO. This model introduces the partial
two groups: 2-stage detectors, including R-CNN [4], Mask
hybrid dilated convolution (PHDC) Block module that combines
hybrid dilated convolutions and partial convolutions to increase R-CNN [5], Faster-RCNN [6], and others. In 2014, Ross et al.
the receptive field at a low computational cost. By using the PHDC proposed R-CNN as the first two-stage object detection algo-
Block within the model design framework of cross-stage partial rithm. This method first utilizes selective search to extract can-
connection, we construct CSPPartialStage that reduces compu- didate frames, then feeds them through a convolutional neural
tational burden without compromising accuracy. Coordinate at-
network to extract target characteristics, and performs support
tention module is also employed in CSPPartialStage to aggregate
position information and improve the detection of small objects vector machine classification and frame calibration on the tar-
with complex distributions in remote sensing images. A backbone get characteristics. On the other hand, single-stage detectors,
and neck are developed with CSPPartialStage, and the rotation including YOLO [7] and SSD [8], treat the detection process
head of the PPYOLOE-R model adapts to objects of multiple as regression, eschewing the region proposal stage to reduce
orientations in remote sensing images. Empirical experiments using
computation time, thus achieving faster detection. Joseph et al.
the dataset for object deTection in aerial images (DOTA) dataset
and a large-scale small object detection dAtaset (SODA-A) dataset introduced YOLO in 2015, dividing the image into a grid and
indicate that our method is faster and resource efficient than the providing predicted bounding boxes in each division. Finally, re-
baseline model (PPYOLOE-R), while achieving higher accuracy. dundant predicted boxes were removed using the nonmaximum
Furthermore, comparisons with current state-of-the-art YOLO suppression (NMS) method. YOLOv2 [9] expands the detection
series detectors show our proposed model’s competitiveness in dataset through joint training, YOLOv3 [10] uses Darknet53
terms of speed and accuracy. Moreover, compared to mainstream
lightweight networks, our model exhibits better hardware adapt- as a backbone network to boost detection performance, and
ability, with lower inference latency and higher detection accuracy. YOLOv4 [11] utilizes CIoU loss for predictive frame filtering
to improve the model’s convergence. YOLOv5 [12] uses the
Index Terms—Deep learning, object detection, partial
convolution, remote sensing image.
feature pyramid network (FPN) and pixel aggregation network
(PAN) structure in the neck network, achieving superior speed
I. INTRODUCTION with equivalent precision to YOLOv4 due to its lighter model
size. YOLO-X [13] reintroduced the anchor-free technique to the
EMOTE sensing images possess a broad range of applica-
R tions, including Traffic Monitoring [1], Maritime Rescue
YOLO series, proposing the SimOTA label assignment method
and decoupled detection head to separate classification and loca-
tion issues, thereby producing higher quality predicted bounding
Manuscript received 24 July 2023; revised 5 October 2023; accepted 24
October 2023. Date of publication 1 November 2023; date of current version 23 boxes. PPYOLOe [14] introduced advanced technologies such
November 2023. (Corresponding author: Mei Zhou.) as reparameterization, redesigns the backbone network, and
Siyu Xie is with the Department of Space Microwave Remote Sensing Sys- achieves a good balance between speed and accuracy on the
tem, Aerospace Information Research Institute, Chinese Academy of Science,
Beijing 100190, China, and also with the School of Electronic, Electrical and MS COCO dataset. In addition, the PPYOLOE-R [15] model is
Communication Engineering, University of Chinese Academy of Sciences, more suitable for multidirectional object distribution in remote
Beijing 100049, China (e-mail: [email protected]). sensing images by designing a detection head for rotating boxes
Mei Zhou and Chunle Wang are with the Department of Space Microwave
Remote Sensing System, Aerospace Information Research Institute, Chinese and angle loss.
Academy of Science, Beijing 100190, China (e-mail: [email protected]; In spite of achieving good results on general datasets, object
[email protected]). detectors face challenges including large parameter volume and
Shisheng Huang is with the Beijing Institute of Tracking and Telecommuni-
cations Technology, Beijing 100094, China (e-mail: [email protected]). high computational limits and limited storage space required for
Digital Object Identifier 10.1109/JSTARS.2023.3329235 the surveillance applications in real time. Although the use of
© 2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
https://creativecommons.org/licenses/by-nc-nd/4.0/
XIE et al.: CSPPARTIAL-YOLO: A LIGHTWEIGHT YOLO-BASED METHOD FOR TYPICAL OBJECTS DETECTION IN REMOTE SENSING IMAGES 389
deep and wide pruning can decrease the model size as seen is appended to enhance the module’s object represen-
with YOLOv5 model versions like L, M, S, and N, simple tation capability. A new backbone and neck were es-
pruning of the model depth and feature map channel numbers tablished using the CSPPartialStage, and a lightweight
can weaken model representation ability, which results in per- and efficient remote sensing image rotation box detector
formance degradation. While objects in remote sensing aerial named CSPPartial-YOLO was developed based on the
perspective have small size, the model can easily lose important PPYOLOE-R detection head.
features during the process of downsampling, therefore extract- 3) Undertaken experimental studies on typical objects in
ing adequate features for accurate detection becomes difficult. remote sensing images in the dataset for object deTection
In order to address the real-time processing issue of object in aerial images (DOTA) dataset and a large-scale small
detection in remote sensing image interpretation, many scholars object detection dAtaset (SODA-A) dataset. In terms of
have designed lightweight object detection models based on the accuracy, the proposed model was compared with com-
characteristics of remote sensing images. Guo et al. [16] used mon YOLO models YOLOv8 YOLO X and RTMDet. The
depthwise separable convolution to replace standard convolu- proposed model demonstrates superior results in terms of
tion, reducing the model’s parameter volume. They also pro- both volume and speed. In addition, the backbone of the
posed the ACON activation function, which effectively avoids proposed method exhibits dual advantages of both accu-
neuronal death in large gradient propagation. In addition, they racy and speed when compared with common lightweight
introduced the DSASFF module, which effectively aggregates backbones such as MobileNet v3, ShuffleNet v2, and
the target properties at different scales that are neglected during GhostNet.
feature fusion. Cui et al. [17] introduced prior knowledge of the The rest of this article is organized as follows. Section II
Laplacian operator into the Bottleneck and added a sharp value provides a review of relevant research on lightweight backbone
transition based on the original tensor to enhance the low-level networks and mechanisms of attention. Section III presents a
feature tensor that contains small target contours. They con- detailed description of the proposed model. Section IV presents
currently decreased the parameter volume and computational the experimental details, including the findings and discussions
complexity of the model by employing multiple small convolu- from both ablation experiments and comparative experiments.
tional kernels in place of larger ones. Zhang et al. [18] used the Finally, Section V of this article summarizes the key findings
ShuffleV2 module to construct a lightweight FPN network that and presents the conclusion of the study.
fully fuses shallow and deep features to generate an abundant
fused feature map with rich object position information, thereby II. RELATED WORK
improving the ability to locate targets of the original model.
Lyu et al. [19] took inspiration from Liu’s [20] utilization of A. Llightweight Backbone Networks
large kernel convolution to enhance the detector’s performance. In recent years, deep neural networks have been advancing
However, in order to balance efficiency, they employed depth- toward deeper and larger models, achieving continuous accuracy
wise separable convolution. The RTMDet model they designed improvement across multiple benchmark datasets. However, a
achieved a good balance between parameters and accuracy. high number of parameters and computations pose a challenge
In comparison with the aforementioned methods, this arti- for model applications. Researchers have explored lightweight
cle focuses on the redundant feature maps in the process of backbone networks in an attempt to reduce the number of model
model inference. Inspired by [21], this article uses pointwise parameters and computations, while still maintaining similar
convolution and partially connected layers to construct module accuracy. In 2017, Howard et al. [22] introduced depthwise
stages and improve the PPYOLOE-R model, thereby proposing separable convolution (DSC) in MobileNet V1, decomposing
a lightweight and efficient object detector called CSPPartial- standard convolution into depthwise convolution and pointwise
YOLO. Specifically, the primary contributions of this research convolution to effectively reduce the number of parameters and
are as follows: computations in convolutional layers. That same year, Shuf-
1) We present the partial hybrid dilated convolution (PHDC) fleNet V1 [23] used group convolutions to reduce computation
Block module, which combines partial convolution and and employed the ChannelShuffle operation to enhance the
pointwise convolution to fully utilize the redundancy of interchannel information flow, resulting in better performance
the feature map and reduce the model’s parameters and compared to MobileNet V1. In 2018, ShuffleNet V2 [24] pro-
burden on computation. In addition, hybrid dilated con- posed four guidelines to optimize the model, further increas-
volutions are used in the partial convolution to reduce the ing the inference speed. In 2020, Han et al. [25] discovered
computational burden on large sized convolution kernels the redundancy in feature maps via experiments and proposed
as well as to enlarge the receptive field to accurately GhostNet, which employed cheap operations to replace standard
extract small targets in complex background of remote convolutional layers, generating additional feature maps while
sensing images and improve the problem of long-range reducing the calculation cost. In 2023, Chen et al. [21] proposed
dependency. the Partial Convolutional Module, which employs a combination
2) The CSPPartialStage is constructed by integrating the of partial convolution and point convolution to reduce the com-
PHDC Block with partial convolution and CSPNet to putational cost while addressing feature redundancy. Building
decrease computing complexity while simultaneously on the partial convolutional module, this study employs hybrid
preserving comparable precision. At the end of the dilated convolution to expand the receptive field and enhance
CSPPartialStage, a coordinate attention (CA) module the module’s feature extraction ability for small objects.
390 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 17, 2024
B. Attention Mechanism
Attention is a cognitive mechanism that imitates human ability
to selectively focus on specific information and amplify key
details to grasp the essence of data. Deep learning models
employ attention mechanisms to improve their performance.
Visual attention mechanisms in deep learning are classified into
channel attention mechanisms and spatial attention mechanisms.
The squeeze-and-excitation (SE) [26] block is a well-known
module that performs dynamic attention on channel features. It
utilizes global average pooling to compress the channels into a
single value, which is then subject to nonlinear transformations
via a fully connected network before being multiplied with the
input channel vector as weights. ECA [27] reduces model redun-
dancy and captures channel interactions by removing the fully
connected layer and leveraging 1-D convolutional layers. Both
SE and ECA apply attention mechanisms in the channel domain
while ignoring the spatial one. CBAM [28] combines channel
and spatial attention by exploiting large kernel size convolutions
to aggregate positional information within a certain range. Nev- Fig. 1. Comparison of feature maps after the first few layers of convolution in
a well-trained neural network. The last image represents the input image, and
ertheless, this design choice leads to increased computational the circles with the same color indicate the parts with similar features.
costs, making it less suitable for developing lightweight models.
In addition, a single layer with a large convolutional kernel can
only capture local position information instead of global position
information. Coordinate attention (CA) [29] captures precise
positional dependencies by embedding positional information
into channel attention. This approach offers benefits for dense
prediction tasks in lightweight networks. Incorporating the CA
module into the CSPPartialStage results in an improvement in
detection accuracy of typical targets in remote sensing images,
at an acceptable computational cost.
Fig. 5. Details of rotated detection head. Fig. 6. Heatmap color, ranging from light to dark, indicates the number of
times a single pixel involved in computation, among which (a) represents a
traditional 3 × 3 convolution. As seen from the figure, dilated convolutions with
hybrid dilation rates have a more comprehensive range of semantic understand-
operations to a portion of the input channels for spatial feature ing than the standard 3 × 3 convolution, and can capture more distant and highly
extraction while keeping the remaining channels unchanged. related features. Other combinations of dilation rates are shown in subfigures
For sequential memory access, we consider the first or last (b), (c), and (d). (a) [1, 1, 1]. (b) [2, 2, 2]. (c) [1, 2, 3]. (d) [1, 2, 5].
consecutive Cp channels as representatives for the entire feature
map to perform computations. The number of computations for
a single forward propagation can be calculated as follows: improves this situation by encompassing a larger receptive field
F LOP s = h × w × Cp2 × k 2 . (1) range and providing comprehensive information on all related
pixel points in adjacent areas, as compared with the [1, 2, 3]
Here h and w represent the height and width of the output dilation rate combination shown in Fig. 6(c). Moreover, when
feature map, Cp represents the number of channels involved in combined with partially convolutional module, as displayed in
the convolution operation for partial convolution, and k repre- Fig. 2, the HDC with [1, 2, 5] dilation rate combination further
sents the size of the convolution kernel. It can be seen that the enhances the performance of the proposed network.
2
C
computational cost of partial convolution is Cp of that of Partial convolution leads to an inevitable loss of channel
standard convolution. information due to its inability to involve all channel features in
Partial convolution has a limited capacity to capture long- convolutional operations. Nonetheless, this channel information
distance dependencies, which can impede small target detection loss can be mitigated by utilizing pointwise convolution after
in remote sensing images. Expanding the receptive field by partial convolution. To achieve this, we apply pointwise convo-
utilizing a large convolution kernel such as 5 × 5 or 7 × 7 lution to the output of partial convolution, then follow up with
introduces more contextual correlation features to the model. a BatchNorm layer and a rectified linear unit (ReLU) activation
Nevertheless, convolutional layers using large kernels result in function, before finally restoring channel dimensionality using
an exponential increase in computational burden, challenging pointwise convolution, as shown in Fig. 8(b). To help avert
our aim of designing a lightweight model. To address this gradient vanish and explosion issues caused by excessively deep
limitation, we use a hybrid dilated convolution (HDC) to replace convolutional layers, we adopt residual connections as part of
the regular convolution operation in partial convolution, inspired our PHDC Block module, which is consistent with the ResNet
by [31]. An affordable increase in computational complexity method [32]. The comparison in Fig. 8 reveals the main building
allows us to achieve a wider receptive field. Fig. 6 illustrates blocks utilized in constructing the stage of the PPYOLOE-R
the size of the receptive field of three consecutive convolution model and the PHDC block that forms our model stage. Despite
layers using different dilation rates. the implementation of a reparameterization hierarchy in the
To ensure that the hole convolution group adequately covers PPYOLOE-R model, its building blocks exhibit high compu-
the space range while avoiding the sampling loss caused by tational complexity. In contrast, the PHDC block in our model
continuous hole convolutions on the input feature map, it is features a simple and efficient structure.
crucial to carefully select the combination of dilation rates used Compared to the inverted residual module used in Mo-
in HDC. Fig. 6(b) illustrates the adverse effects of improper hole bileNetV2 [33], the PHDC Block employs only BatchNorm
rate combinations, which can cause HDCs to miss adjacent pixel layer and ReLU activation without performing depthwise convo-
points and result in incomplete feature sampling. In contrast, the lution after channel expansion with pointwise convolution. This
[1, 2, 5] dilation rate combination in HDC, as used in this study, approach effectively avoids the frequent memory access caused
392 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 17, 2024
B. CSPPartialStage
1) CSP structure with PHDC Block: Deep convolutional
neural networks often involve dense convolution operations as
the channel numbers expand, which can exponentially increase
the computational cost of the model. This, coupled with a single
feature propagation path, can cause repeated usage of gradient
information, resulting in redundancy and inefficient network
training. To address this issue, CSPNet [34] separates the gradi-
ent flow to propagate through different network paths, ensuring
that the gradient information obtained has greater correlation
differences. Both YOLOv5 [12] and YOLOX [13] utilize a
CSPNet-like structure to reduce computational burden without
compromising accuracy. In our research, we implemented this
method to design the main module.
Fig. 3 illustrates the structure of CSPPartialStage, where CBN
refers to the concatenation of convolutional layer, BatchNorm
layer, and nonlinear activation layer. Our CSPPartialStage incor-
porates the PHDC block mentioned earlier as a crucial module
into consecutive CSPNet-based feature extraction layers. After
the input feature map is processed by the first CBN, the number
of its channels is halved, and then it is routed into two parallel
branching structures. One of the branches executes only simple
CBN operations, while the other goes through a single CBN
Fig. 7. Improve the PartialConv with HDC [1, 2, 5]. (a) Partial convolution before being sent to the feature extraction module made up of
with a single 3 × 3 convolutional layer. (b) Partial convolution with HDC ([1, N PHDC Blocks arranged in series to perform deeper feature
2, 5] dilation rates combination).
extraction. The outputs of both branches are concatenated in the
channel dimension and given coordinate attention through the
CA attention module. Finally, CBN is used for channel matching
to obtain the correct number of channels. It is worth noting that
all the CBNs in CSPPartialStage use 1 × 1 convolution, merely
changing the number of feature map channels or performing sim-
ple feature mapping, without introducing a noticeable increase
in computational burden.
2) Coordinate Attention: The general attention mechanism,
such as Squeeze-and-Excitation, accounts for the correlation
between channels and recalibrates the channel information for
effective aggregation, leading to a better model representation.
While this approach proves useful for detecting natural images,
small object detection in remote sensing images with com-
plex spatial distributions requires more prominent focus on the
target’s localization features. As such, we utilize the Channel
Attention (CA) module [29] to augment the model’s ability to
extract location-based features. A schematic of the CA module’s
workflow is illustrated in Fig. 9.
The input feature map is initially encoded for each channel
independently by performing global average pooling separately
in both the horizontal and vertical directions. Specifically, for the
cth channel, the output in the vertical and horizontal directions
of dimensions h and w is denoted by the following equation:
Fig. 8. Comparison of the basic block in PPYOLOE-R and the PHDC block 1
in our model. (a) Basic block in PPYOLOE-R. (b) PHDC block. zch (h) = xc (h, i) (2)
w 0<i≤w
1
by multiple groups of depthwise convolution, thereby increasing zcw (w) = xc (j, w) (3)
h
the operation efficiency of the module. Moreover, after training 0<j≤h
is completed, the BatchNorm layer can be easily merged into Following the transformations in the two spatial directions
adjacent convolutional layers, further simplifying the network. as stated earlier, a pair of direction-sensitive feature maps is
XIE et al.: CSPPARTIAL-YOLO: A LIGHTWEIGHT YOLO-BASED METHOD FOR TYPICAL OBJECTS DETECTION IN REMOTE SENSING IMAGES 393
TABLE II
EXPERIMENTAL RESULTS OF ABLATION OF THE ALGORITHM MODULE
TABLE III comparison with the modern YOLO series models of the identi-
EXPERIMENTAL RESULTS OF DIFFERENT NUMBERS OF BLOCKS IN EACH STAGE
cal model size. The resulting comparison on the DOTA dataset
and SODA-A dataset is illustrated in Table IV.
In object detection benchmarks, it is vital to ensure that the
models are evaluated on a level playing field. To this end,
we employed the same rotation detection head, namely the
PPYOLOE-R detection head, across all models for predicting
to investigate the impact of the CA Block on the model’s perfor- bounding boxes with rotation angles. The results of our ex-
mance and efficiency. In Experiment 4, we utilized both hybrid periments indicate that while the YOLOX model achieved the
dilated convolution and the CA Block to evaluate their combined highest mAP score in both DOTA dataset and SODA-A dataset
effect on model accuracy and complexity. Our results show that among all models, it also incurred the highest inference delay
Experiment 4 outperformed Experiment 2 and Experiment 3 in due to suboptimal inference speed optimization in the CSPDark-
terms of accuracy. Specifically, Experiment 4 achieved a 1.06% Net backbone network used by YOLOX. On the other hand,
increase in mean Average Precision (mAP) while maintaining PPYOLOE-R utilizes reparameterization techniques to enhance
comparable parameter (Params) and computational (FLOPs) the model’s inference speed, leading to a better tradeoff between
complexity. This improvement in accuracy is attributed to the accuracy and speed. In contrast to YOLOX and PPYOLOE-R,
larger reception field generated by hybrid dilated convolution, YOLOV8 replaces the C3 module of the CSPDarkNet with the
which better captures small remote sensing targets. Although C2F module, which maximizes gradient flow information while
the CA block slightly contributed to the increase in inference maintaining a lightweight architecture. This design choice en-
latency, the improvement in accuracy justifies its use in the ables YOLOV8 to achieve high accuracy while also maintaining
model. reasonable inference speed. The RTMDet model utilizes a design
Fig. 10 displays partial results from the DOTA dataset of our approach comparable to ours, whereby the receptive field is
CSPPartial-YOLO and PPYOLOE-R models. As observed, our expanded via an increase in the kernel size of convolution layers,
model exhibits fewer missed and false detections compared to thereby improving the model’s capability for feature extraction.
the baseline model in scenarios such as detecting ships near or However, RTMDet differs in that it directly enlarges the convo-
far away the shore, detecting dense vehicles, and distinguish- lution kernel size and employs depthwise separable convolution
ing between confusing object classes. This improvement can to achieve a balance between efficiency and effectiveness.
be attributed to the use of hybrid dilated convolution and the In our proposed CSPPartial-YOLO, we also aimed to achieve
improved visual attention module, which allow for a larger a more efficient and lightweight network by utilizing partial con-
receptive field and richer semantic information. Specifically, volution in constructing the network. Furthermore, our approach
hybrid dilated convolution expands the field of view of each con- combines typical characteristics of targets in remote sensing
volutional layer, allowing the network to capture more contex- images, such as small object sizes and complex backgrounds,
tual information and improve object recognition accuracy. The with advanced techniques in computer vision. Specifically, we
coordinate attention module enhances the network’s ability to employ hybrid dilated convolutions to establish long-distance
focus on relevant features by adaptively weighting feature maps. dependency relationships, broaden the receptive field, and im-
Our proposed models achieved a 1.9% increase in accuracy plement coordinate attention modules to enhance the position
compared to the baseline model, while also reducing inference information. These design choices result in improved represen-
latency by 25.8%. These results demonstrate the efficacy of our tation ability for small targets in remote sensing images, which
proposed models in improving object detection performance are often challenging to detect.
while maintaining a reasonable inference speed. Our model incurred only a 0.19%(DOTA) and 2.21%(SODA-
Table III presents the experimental results of allocating dif- A) loss in mAP score when compared to YOLOX, yet registered
ferent ratios of PHDC blocks in CSPPartialStage. The results declines of 22.0%, 42.6%, and 32.3% in terms of parameter
suggest that employing the suggested [1:1:3:1] ratios in Con- count, computational cost, and inference delay, respectively.
vNext [20] yields higher validation accuracy and inference speed Our model also showed a 14.8% speed advantage over the
with comparable parameter and computational costs. fastest YOLOV8 model, requiring only 67.1% of the parameters
and 74.3% of the computational cost, while maintaining some
advantages in precision. Overall, taking into consideration both
E. Comparison With Other YOLO Models
speed and precision, our model maintains competitiveness with
To ascertain the competitiveness of the model introduced current state-of-the-art YOLO models in typical target detection
in this article in both speed and accuracy, we conducted a in remote sensing.
396 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 17, 2024
Fig. 10. Prediction results of the baseline model and our CSPPartial-YOLO in some scenes of the DOTA dataset. The green circle indicates false detection, and
the red circle indicates missed detection. (a) Ground truth. (b) PPYOLOE-R. (c) Ours.
XIE et al.: CSPPARTIAL-YOLO: A LIGHTWEIGHT YOLO-BASED METHOD FOR TYPICAL OBJECTS DETECTION IN REMOTE SENSING IMAGES 397
TABLE IV
EXPERIMENTAL RESULTS OF COMPARISON WITH MODERN YOLO SERIES MODELS
TABLE V
EXPERIMENTAL RESULTS OF COMPARATION WITH ADVANCED LIGHTWEIGHT BACKBONE NETWORKS
Fig. 11. First three rows show the output feature maps of the last three stages of the backbone network, with sizes of 128 × 128, 64 × 64, and 32 × 32, respectively.
The last row displays the prediction results of the model, with the missed detection highlighted by a red circle. (a) MobileNetV3. (b) ShuffleNetV2. (c) GhostNet.
(d) Ours.
398 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 17, 2024
F. Comparison With Lightweight Backbones Based on experimental results, CSPPartialNet achieved the
To validate the performance of the proposed backbone net- highest mAP compared to the aforementioned three lightweight
work, namely CSPPartialNet, we conducted comparative exper- backbone networks. It also achieved the best inference time
iments against several lightweight backbone networks. Table V on real hardware, lower by 42.1%, 57.7%, and 63.3% than
showcases the experimental results obtained for the DOTA and MobileNetV3, ShuffleNetV2, and GhostNet, respectively. In
SODA-A dataset. addition, CSPPartialNet has the lowest parameter amount, which
In this set of experiments, we solely evaluated the perfor- makes the model suitable for devices with low storage.
mance of the backbone network by using the same Neck and
Head, and solely replacing the backbone network. As such, the V. DISCUSSION AND CONCLUSION
metrics “Params,” “FLOPs,” and “Latency” in the table were The detection of objects in remote sensing images has been
exclusively calculated for the backbone network section. challenging due to the limited computing and storage resources
MobileNetV3 is the latest addition to the MobileNet series of of remote sensing platforms. Current object detectors struggle to
networks. It mainly employs depthwise separable convolutions achieve fast and accurate predictions. In this article, we improved
to decrease the number of trainable parameters and computa- the baseline model to achieve a better balance between speed
tional complexity of the network. In addition, it integrates SE and accuracy, and we refer to the improved model as CSPPartial-
channel attention, and utilizes neural architecture search (NAS) YOLO. The new model is specifically designed for the detection
techniques to obtain the optimal parameters. Nevertheless, it of typical targets in remote sensing images.
is noteworthy that despite having the lowest FLOPs in this To improve the model’s inference speed and reduce param-
set of experiments, MobileNetV3 still has considerable latency. eters and calculations, we utilized redundant feature maps in
This is because its architecture is more suited for CPU device the model inference process and introduced the PHDC module,
computation, while GPU computation is more crucial in the which is a combination of partial convolution with hybrid dilated
experimental environment. Thus, it indicates that FLOPs cannot convolution, with specific dilation rates. Furthermore, we incor-
accurately reflect the model’s inference time and one should fo- porated the CA module to increase the model’s sensitivity to
cus more on latency. Moreover, MobileNetV3 has been observed target location information considering the multidirectionality,
to exhibit lower sensitivity to small targets and less attention dense distribution, and small size of typical targets in remote
to intricate details, which may lead to reduced effectiveness in sensing images. Finally, we designed the CSPPartialStage to ex-
typical targets of remote sensing images. Hence, when employ- plore the appropriate computational depth ratio for the backbone
ing MobileNetV3 for remote sensing image classification, one network, constructed the backbone, and the Neck network.
should exercise caution. In this article, we conducted ablative experiments to demon-
ShuffleNetV2 is a lightweight backbone network that con- strate the advantages of the proposed model compared to the
siders the impact of memory access count (MAC) on inference baseline model. Furthermore, we evaluated the effectiveness of
latency. Nonetheless, similar to MobileNetV3, it is better suited the main improvement methods through comparative experi-
for CPUs on mobile devices than for GPUs. Moreover, Shuf- ments with state-of-the-art YOLO series models and lightweight
fleNetV2 exhibits lower capability to concentrate on long-range backbone networks. The proposed model and methods achieved
dependency information, making it difficult to derive essential competitive advantages in terms of both accuracy and speed.
information from images containing large objects with signif- Our experiments show that the lightweight detector introduced
icant aspect ratios. Consequently, this may yield inadequate in this article has potential for real-time detection of typical
results in intricate and uncertain remote sensing object detection targets in remote sensing images. Our future research endeavors
scenarios. will involve exploration of advanced lightweight network design
GhostNet is among the very first models that concentrate methods, like neural network pruning and neural architecture
on redundant feature maps in convolutional neural networks. search (NAS), in order to further decrease model redundancy
By replacing conventional convolutions with a cheap operation, and enhance detection efficiency.
it facilitates the acquisition of feature maps. Nonetheless, the
application of depthwise convolution in the cheap operation REFERENCES
can raise the MAC, which negates the previously reduced
FLOPs. Consequently, despite having fewer FLOPs, GhostNet [1] P. Patil, “Applications of deep learning in traffic management: A review,”
Int. J. Bus. Intell. Big Data Analytics, vol. 5, no. 1, pp. 16–23, 2022.
still demonstrates high latency. [2] S. Wang, Y. Han, J. Chen, Z. Zhang, G. Wang, and N. Du, “A deep-
Fig. 11 indicates the output feature maps of the final three learning-based sea search and rescue algorithm by UAV remote sensing,”
stages and the prediction results for each lightweight backbone in Proc. IEEE CSAA Guid., Navigation Control Conf., 2018, pp. 1–5.
[3] Y. Xu, M. Zhu, P. Xin, S. Li, M. Qi, and S. Ma, “Rapid airplane detection
network. Our CSPPartialNet provides a clearer representation of in remote sensing images based on multilayer feature fusion in fully
the target position at all three scales, especially in the 32 × 32 convolutional neural networks,” Sensors, vol. 18, no. 7, 2018, Art. no. 2335.
feature map output. Our model accurately captures the infor- [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies
for accurate object detection and semantic segmentation,” in Proc. IEEE
mation about the vehicle parking position and displays it with Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580–587.
higher values in the heat map due to our use of the channel [5] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
attention module (CA), which enhances the target position fea- IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969.
[6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time
ture. Compared in the prediction results, our model has the least object detection with region proposal networks,” in Proc. Adv. Neural Inf.
missed detections. Process. Syst., 2015, vol. 28.
XIE et al.: CSPPARTIAL-YOLO: A LIGHTWEIGHT YOLO-BASED METHOD FOR TYPICAL OBJECTS DETECTION IN REMOTE SENSING IMAGES 399
[7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: [36] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. convolutional networks for visual recognition,” IEEE Trans. Pattern Anal.
Pattern Recognit., 2016, pp. 779–788. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2015.
[8] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. [37] G.-S. Xia et al., “DOTA: A large-scale dataset for object detection in
Comput. Vis., 2016, pp. 21–37. aerial images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
[9] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Proc. pp. 3974–3983.
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 7263–7271. [38] G. Cheng et al., “Towards large-scale small object detection: Survey and
[10] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” benchmarks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 11,
2018, arXiv:1804.02767. pp. 13467–13488, Nov. 2023.
[11] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal [39] M. Everingham, L. V. Gool, C. K. Williams, J. Winn, and A. Zisserman,
speed and accuracy of object detection,” 2020, arXiv:2004.10934. “The pascal visual object classes (VOC) challenge,” Int. J. Comput. Vis.,
[12] G. Jocher, “YOLOv5 by Ultralytics,” May 2020. [Online]. Available: https: vol. 88, pp. 303–338, 2010.
//github.com/ultralytics/yolov5 [40] G. Jocher, A. Chaurasia, and J. Qiu, “YOLO by Ultralytics,” Jan. 2023.
[13] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO [Online]. Available: https://github.com/ultralytics/ultralytics
series in 2021,” 2021, arXiv:2107.08430. [41] A. Howard et al., “Searching for MobileNetV3,” in Proc. IEEE/CVF Int.
[14] S. Xu et al., “PP-YOLOE: An evolved version of YOLO,” 2022, Conf. Comput. Vis., 2019, pp. 1314–1324.
arXiv:2203.16250.
[15] X. Wang, G. Wang, Q. Dang, Y. Liu, X. Hu, and D. Yu, “PP-YOLOE-R:
An efficient anchor-free rotated object detector,” 2022, arXiv:2211.02386.
[16] Y. Guo, S. Chen, R. Zhan, W. Wang, and J. Zhang, “LMSD-YOLO: A
lightweight YOLO algorithm for multi-scale sar ship detection,” Remote Siyu Xie received the B.S. degree in electronic in-
Sens., vol. 14, no. 19, 2022, Art. no. 4801. formation science and technology from the College
[17] M. Cui et al., “LC-YOLO: A lightweight model with efficient utilization of Science of Beijing Forestry University, China,
of limited detail features for small object detection,” Appl. Sci., vol. 13, in 2021. He is currently working toward the M.S.
no. 5, 2023, Art. no. 3174. degree in signal and information processing with the
[18] H. Zhang et al., “An improved lightweight yolo-fastest V2 for engineering Aerospace Information Research Institute, Chinese
vehicle recognition fusing location enhancement and adaptive label as- Academy of Sciences, Beijing, China.
signment,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 16, His research interests include deep learning,
pp. 2450–2461, 2023. lightweight model and remote sensing.
[19] C. Lyu et al., “RTMDet: An empirical study of designing real-time object
detectors,” 2022, arXiv:2212.07784.
[20] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A
convnet for the 2020s,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., 2022, pp. 11976–11986.
[21] J. Chen et al., “Run, don’t walk: Chasing higher FLOPS for faster neural Mei Zhou was born in Sichuan, China in 1980.
networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, She received the Ph.D. degree in communication and
pp. 12021–12031. information systems from the Graduate School of
[22] A. G. Howard et al., “MobileNets: Efficient convolutional neural networks the Chinese Academy of Sciences, Beijing, China,
for mobile vision applications,” 2017, arXiv:1704.04861. in 2007.
[23] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely efficient She is currently an Associate Professor with
convolutional neural network for mobile devices,” in Proc. IEEE Conf. Aerospace Information Research Institute, Chinese
Comput. Vis. Pattern Recognit., 2018, pp. 6848–6856. Academy of Sciences. Her main research direction is
[24] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical multidimensional imaging technology for active and
guidelines for efficient CNN architecture design,” in Proc. Eur. Conf. passive sensors.
Comput. Vis., 2018, pp. 116–131.
[25] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “GhostNet: More
features from cheap operations,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit., 2020, pp. 1580–1589.
[26] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141. Chunle Wang was born in Jilin, China, in 1986.
[27] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-Net: Effi- She received the B.S. degree in electronic informa-
cient channel attention for deep convolutional neural networks,” in Proc. tion engineering from Beijing Information Science
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 11534–11542. and Technology University, Beijing, China, in 2008.
[28] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional block She received the Ph.D. degree in communication
attention module,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 3–19. and information systems from University of Chinese
[29] Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for efficient mobile Academy of Science, Beijing, China, in 2013.
network design,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., She is currently an Associate Researcher with the
2021, pp. 13713–13722. Aerospace Information Research Institute, Chinese
[30] Q. Zhang et al., “Split to be slim: An overlooked redundancy in vanilla Academy of Sciences. Her research interests include
convolution,” 2020, arXiv:2006.12085. spaceborne synthetic aperture radar (SAR) system
[31] P. Wang et al., “Understanding convolution for semantic segmentation,” design and SAR image processing.
in Proc. IEEE Winter Conf. Appl. Comput. Vis., 2018, pp. 1451–1460.
[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
pp. 770–778.
[33] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo-
bileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE Conf. Shisheng Huang received the B.S. degree in ap-
Comput. Vis. Pattern Recognit., 2018, pp. 4510–4520. plied mathematics in 2006, and the Ph.D. degree in
[34] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H. system analysis and integration both from National
Yeh, “CSPNet: A new backbone that can enhance learning capability of University of Defence Technology, Changsha, China,
CNN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work- in 2012.
shops, 2020, pp. 390–391. He is currently working in the field of spaceborne
[35] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using synthetic aperture radar designing and image process-
shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, ing with Beijing Institute of Tracking and Telecom-
pp. 10012–10022. munications Technology, Beijing, China.