0% found this document useful (0 votes)

11 views12 pages

CSPPartial-YOLO_A_Lightweight_YOLO-Based_Method_for_Typical_Objects_Detection_in_Remote_Sensing_Images

The document introduces CSPPartial-YOLO, a lightweight YOLO-based model designed for object detection in remote sensing images, which addresses challenges related to high computational demands and storage limitations. The model employs a novel partial hybrid dilated convolution (PHDC) Block and coordinate attention mechanisms to enhance detection accuracy and efficiency, particularly for small objects in complex backgrounds. Experimental results demonstrate that CSPPartial-YOLO outperforms existing YOLO models in terms of speed and accuracy while maintaining lower computational costs.

Uploaded by

mobinaalinezhad120

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views12 pages

CSPPartial-YOLO_A_Lightweight_YOLO-Based_Method_for_Typical_Objects_Detection_in_Remote_Sensing_Images

Uploaded by

mobinaalinezhad120

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

388 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL.

17, 2024

CSPPartial-YOLO: A Lightweight YOLO-Based

Method for Typical Objects Detection in
Remote Sensing Images
Siyu Xie , Mei Zhou , Chunle Wang , and Shisheng Huang

Abstract—Detecting and recognizing objects are crucial steps [2], and Aviation Control [3]. The development of deep learning
in interpreting remote sensing images. At present, deep learn- technology has resulted in more intelligent and efficient analysis
ing methods are predominantly employed for detecting objects of remote sensing images, decreasing the reliance on manual
in remote sensing images, necessitating a significant number of
floating-point computations. However, low computing power and work. Object recognition and detection are fundamental tasks
small storage in computing devices are hard to afford the large in computer vision and are core components of the analysis of
model parameter quantity and high computing complexity. To ad- remote sensing images.
dress these constraints, this article presents a lightweight detection Deep learning-based object detectors can be categorized into
model called CSPPartial-YOLO. This model introduces the partial
two groups: 2-stage detectors, including R-CNN [4], Mask
hybrid dilated convolution (PHDC) Block module that combines
hybrid dilated convolutions and partial convolutions to increase R-CNN [5], Faster-RCNN [6], and others. In 2014, Ross et al.
the receptive field at a low computational cost. By using the PHDC proposed R-CNN as the first two-stage object detection algo-
Block within the model design framework of cross-stage partial rithm. This method first utilizes selective search to extract can-
connection, we construct CSPPartialStage that reduces compu- didate frames, then feeds them through a convolutional neural
tational burden without compromising accuracy. Coordinate at-
network to extract target characteristics, and performs support
tention module is also employed in CSPPartialStage to aggregate
position information and improve the detection of small objects vector machine classification and frame calibration on the tar-
with complex distributions in remote sensing images. A backbone get characteristics. On the other hand, single-stage detectors,
and neck are developed with CSPPartialStage, and the rotation including YOLO [7] and SSD [8], treat the detection process
head of the PPYOLOE-R model adapts to objects of multiple as regression, eschewing the region proposal stage to reduce
orientations in remote sensing images. Empirical experiments using
computation time, thus achieving faster detection. Joseph et al.
the dataset for object deTection in aerial images (DOTA) dataset
and a large-scale small object detection dAtaset (SODA-A) dataset introduced YOLO in 2015, dividing the image into a grid and
indicate that our method is faster and resource efficient than the providing predicted bounding boxes in each division. Finally, re-
baseline model (PPYOLOE-R), while achieving higher accuracy. dundant predicted boxes were removed using the nonmaximum
Furthermore, comparisons with current state-of-the-art YOLO suppression (NMS) method. YOLOv2 [9] expands the detection
series detectors show our proposed model’s competitiveness in dataset through joint training, YOLOv3 [10] uses Darknet53
terms of speed and accuracy. Moreover, compared to mainstream
lightweight networks, our model exhibits better hardware adapt- as a backbone network to boost detection performance, and
ability, with lower inference latency and higher detection accuracy. YOLOv4 [11] utilizes CIoU loss for predictive frame filtering
to improve the model’s convergence. YOLOv5 [12] uses the
Index Terms—Deep learning, object detection, partial
convolution, remote sensing image.
feature pyramid network (FPN) and pixel aggregation network
(PAN) structure in the neck network, achieving superior speed
I. INTRODUCTION with equivalent precision to YOLOv4 due to its lighter model
size. YOLO-X [13] reintroduced the anchor-free technique to the
EMOTE sensing images possess a broad range of applica-
R tions, including Traffic Monitoring [1], Maritime Rescue
YOLO series, proposing the SimOTA label assignment method
and decoupled detection head to separate classification and loca-
tion issues, thereby producing higher quality predicted bounding
Manuscript received 24 July 2023; revised 5 October 2023; accepted 24
October 2023. Date of publication 1 November 2023; date of current version 23 boxes. PPYOLOe [14] introduced advanced technologies such
November 2023. (Corresponding author: Mei Zhou.) as reparameterization, redesigns the backbone network, and
Siyu Xie is with the Department of Space Microwave Remote Sensing Sys- achieves a good balance between speed and accuracy on the
tem, Aerospace Information Research Institute, Chinese Academy of Science,
Beijing 100190, China, and also with the School of Electronic, Electrical and MS COCO dataset. In addition, the PPYOLOE-R [15] model is
Communication Engineering, University of Chinese Academy of Sciences, more suitable for multidirectional object distribution in remote
Beijing 100049, China (e-mail: [email protected]). sensing images by designing a detection head for rotating boxes
Mei Zhou and Chunle Wang are with the Department of Space Microwave
Remote Sensing System, Aerospace Information Research Institute, Chinese and angle loss.
Academy of Science, Beijing 100190, China (e-mail: [email protected]; In spite of achieving good results on general datasets, object
[email protected]). detectors face challenges including large parameter volume and
Shisheng Huang is with the Beijing Institute of Tracking and Telecommuni-
cations Technology, Beijing 100094, China (e-mail: [email protected]). high computational limits and limited storage space required for
Digital Object Identifier 10.1109/JSTARS.2023.3329235 the surveillance applications in real time. Although the use of
© 2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
https://creativecommons.org/licenses/by-nc-nd/4.0/
XIE et al.: CSPPARTIAL-YOLO: A LIGHTWEIGHT YOLO-BASED METHOD FOR TYPICAL OBJECTS DETECTION IN REMOTE SENSING IMAGES 389

deep and wide pruning can decrease the model size as seen is appended to enhance the module’s object represen-
with YOLOv5 model versions like L, M, S, and N, simple tation capability. A new backbone and neck were es-
pruning of the model depth and feature map channel numbers tablished using the CSPPartialStage, and a lightweight
can weaken model representation ability, which results in per- and efficient remote sensing image rotation box detector
formance degradation. While objects in remote sensing aerial named CSPPartial-YOLO was developed based on the
perspective have small size, the model can easily lose important PPYOLOE-R detection head.
features during the process of downsampling, therefore extract- 3) Undertaken experimental studies on typical objects in
ing adequate features for accurate detection becomes difficult. remote sensing images in the dataset for object deTection
In order to address the real-time processing issue of object in aerial images (DOTA) dataset and a large-scale small
detection in remote sensing image interpretation, many scholars object detection dAtaset (SODA-A) dataset. In terms of
have designed lightweight object detection models based on the accuracy, the proposed model was compared with com-
characteristics of remote sensing images. Guo et al. [16] used mon YOLO models YOLOv8 YOLO X and RTMDet. The
depthwise separable convolution to replace standard convolu- proposed model demonstrates superior results in terms of
tion, reducing the model’s parameter volume. They also pro- both volume and speed. In addition, the backbone of the
posed the ACON activation function, which effectively avoids proposed method exhibits dual advantages of both accu-
neuronal death in large gradient propagation. In addition, they racy and speed when compared with common lightweight
introduced the DSASFF module, which effectively aggregates backbones such as MobileNet v3, ShuffleNet v2, and
the target properties at different scales that are neglected during GhostNet.
feature fusion. Cui et al. [17] introduced prior knowledge of the The rest of this article is organized as follows. Section II
Laplacian operator into the Bottleneck and added a sharp value provides a review of relevant research on lightweight backbone
transition based on the original tensor to enhance the low-level networks and mechanisms of attention. Section III presents a
feature tensor that contains small target contours. They con- detailed description of the proposed model. Section IV presents
currently decreased the parameter volume and computational the experimental details, including the findings and discussions
complexity of the model by employing multiple small convolu- from both ablation experiments and comparative experiments.
tional kernels in place of larger ones. Zhang et al. [18] used the Finally, Section V of this article summarizes the key findings
ShuffleV2 module to construct a lightweight FPN network that and presents the conclusion of the study.
fully fuses shallow and deep features to generate an abundant
fused feature map with rich object position information, thereby II. RELATED WORK
improving the ability to locate targets of the original model.
Lyu et al. [19] took inspiration from Liu’s [20] utilization of A. Llightweight Backbone Networks
large kernel convolution to enhance the detector’s performance. In recent years, deep neural networks have been advancing
However, in order to balance efficiency, they employed depth- toward deeper and larger models, achieving continuous accuracy
wise separable convolution. The RTMDet model they designed improvement across multiple benchmark datasets. However, a
achieved a good balance between parameters and accuracy. high number of parameters and computations pose a challenge
In comparison with the aforementioned methods, this arti- for model applications. Researchers have explored lightweight
cle focuses on the redundant feature maps in the process of backbone networks in an attempt to reduce the number of model
model inference. Inspired by [21], this article uses pointwise parameters and computations, while still maintaining similar
convolution and partially connected layers to construct module accuracy. In 2017, Howard et al. [22] introduced depthwise
stages and improve the PPYOLOE-R model, thereby proposing separable convolution (DSC) in MobileNet V1, decomposing
a lightweight and efficient object detector called CSPPartial- standard convolution into depthwise convolution and pointwise
YOLO. Specifically, the primary contributions of this research convolution to effectively reduce the number of parameters and
are as follows: computations in convolutional layers. That same year, Shuf-
1) We present the partial hybrid dilated convolution (PHDC) fleNet V1 [23] used group convolutions to reduce computation
Block module, which combines partial convolution and and employed the ChannelShuffle operation to enhance the
pointwise convolution to fully utilize the redundancy of interchannel information flow, resulting in better performance
the feature map and reduce the model’s parameters and compared to MobileNet V1. In 2018, ShuffleNet V2 [24] pro-
burden on computation. In addition, hybrid dilated con- posed four guidelines to optimize the model, further increas-
volutions are used in the partial convolution to reduce the ing the inference speed. In 2020, Han et al. [25] discovered
computational burden on large sized convolution kernels the redundancy in feature maps via experiments and proposed
as well as to enlarge the receptive field to accurately GhostNet, which employed cheap operations to replace standard
extract small targets in complex background of remote convolutional layers, generating additional feature maps while
sensing images and improve the problem of long-range reducing the calculation cost. In 2023, Chen et al. [21] proposed
dependency. the Partial Convolutional Module, which employs a combination
2) The CSPPartialStage is constructed by integrating the of partial convolution and point convolution to reduce the com-
PHDC Block with partial convolution and CSPNet to putational cost while addressing feature redundancy. Building
decrease computing complexity while simultaneously on the partial convolutional module, this study employs hybrid
preserving comparable precision. At the end of the dilated convolution to expand the receptive field and enhance
CSPPartialStage, a coordinate attention (CA) module the module’s feature extraction ability for small objects.
390 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 17, 2024

B. Attention Mechanism
Attention is a cognitive mechanism that imitates human ability
to selectively focus on specific information and amplify key
details to grasp the essence of data. Deep learning models
employ attention mechanisms to improve their performance.
Visual attention mechanisms in deep learning are classified into
channel attention mechanisms and spatial attention mechanisms.
The squeeze-and-excitation (SE) [26] block is a well-known
module that performs dynamic attention on channel features. It
utilizes global average pooling to compress the channels into a
single value, which is then subject to nonlinear transformations
via a fully connected network before being multiplied with the
input channel vector as weights. ECA [27] reduces model redun-
dancy and captures channel interactions by removing the fully
connected layer and leveraging 1-D convolutional layers. Both
SE and ECA apply attention mechanisms in the channel domain
while ignoring the spatial one. CBAM [28] combines channel
and spatial attention by exploiting large kernel size convolutions
to aggregate positional information within a certain range. Nev- Fig. 1. Comparison of feature maps after the first few layers of convolution in
a well-trained neural network. The last image represents the input image, and
ertheless, this design choice leads to increased computational the circles with the same color indicate the parts with similar features.
costs, making it less suitable for developing lightweight models.
In addition, a single layer with a large convolutional kernel can
only capture local position information instead of global position
information. Coordinate attention (CA) [29] captures precise
positional dependencies by embedding positional information
into channel attention. This approach offers benefits for dense
prediction tasks in lightweight networks. Incorporating the CA
module into the CSPPartialStage results in an improvement in
detection accuracy of typical targets in remote sensing images,
at an acceptable computational cost.

III. PROPOSED METHOD

The CSPPartial-YOLO framework is based on the
PPYOLOE-R model, but replaces the computationally resource-
intensive RepVGG Block with the PHDC Block in the CSP-
Stage. Furthermore, it embeds the coordinate attention module
(CA) in the CSPStage and proposes the lightweight CSPSPar-
tialStage feature extraction module. The model optimizes the
depth ratio of different stages to construct the CSPSPartialNet
backbone network. In addition, it employs the CSPPartialStage
and SPP module to construct a bidirectional feature pyramid for
enhancing the fusion of multiscale features. Fig. 2 shows the Fig. 2. Flowchart of the proposed method.
main structure of the proposed model comprising the backbone,
neck, and detection head. The input is an image with three
low computational cost, thereby enabling our model to achieve
channels of 1024 × 1024 pixels. The backbone comprises four
advantages in both speed and accuracy.
CSPSPartialStages with a Stem Block preceding them. The
output of the last three CSPSPartialStages serves as both the
output of the backbone and the input of the fusion module, A. PHDC Block
with feature map sizes of 128 × 128, 64 × 64, and 32 × 32, Typical convolutional operations usually generate multichan-
respectively. Then, the lightweight bidirectional feature pyramid nel feature maps. Many studies [25], [30] have shown redundan-
produces uniform feature maps to the detection head. Finally, cies among these feature maps. Fig. 1 shows the redundancy of
the model employs the PPYOLOE-R rotation detection head the feature maps.
to obtain target position, direction, and category information Partial convolution is a lightweight convolutional operator
at multiple scales. The PHDC block is the cornerstone of the that efficiently uses redundancies in feature maps, thereby re-
model construction. It combines hybrid dilated convolutions ducing computational costs. Fig. 7(a) illustrates the workflow
with partial convolutions efficiently to extract information at of partial convolution, which selectively applies convolution
XIE et al.: CSPPARTIAL-YOLO: A LIGHTWEIGHT YOLO-BASED METHOD FOR TYPICAL OBJECTS DETECTION IN REMOTE SENSING IMAGES 391

Fig. 3. Workflow of CSPPartialStage.

Fig. 4. Details of neck part.

Fig. 5. Details of rotated detection head. Fig. 6. Heatmap color, ranging from light to dark, indicates the number of
times a single pixel involved in computation, among which (a) represents a
traditional 3 × 3 convolution. As seen from the figure, dilated convolutions with
hybrid dilation rates have a more comprehensive range of semantic understand-
operations to a portion of the input channels for spatial feature ing than the standard 3 × 3 convolution, and can capture more distant and highly
extraction while keeping the remaining channels unchanged. related features. Other combinations of dilation rates are shown in subfigures
For sequential memory access, we consider the first or last (b), (c), and (d). (a) [1, 1, 1]. (b) [2, 2, 2]. (c) [1, 2, 3]. (d) [1, 2, 5].
consecutive Cp channels as representatives for the entire feature
map to perform computations. The number of computations for
a single forward propagation can be calculated as follows: improves this situation by encompassing a larger receptive field
F LOP s = h × w × Cp2 × k 2 . (1) range and providing comprehensive information on all related
pixel points in adjacent areas, as compared with the [1, 2, 3]
Here h and w represent the height and width of the output dilation rate combination shown in Fig. 6(c). Moreover, when
feature map, Cp represents the number of channels involved in combined with partially convolutional module, as displayed in
the convolution operation for partial convolution, and k repre- Fig. 2, the HDC with [1, 2, 5] dilation rate combination further
sents the size of the convolution kernel. It can be seen that the enhances the performance of the proposed network.
2
C
computational cost of partial convolution is Cp of that of Partial convolution leads to an inevitable loss of channel
standard convolution. information due to its inability to involve all channel features in
Partial convolution has a limited capacity to capture long- convolutional operations. Nonetheless, this channel information
distance dependencies, which can impede small target detection loss can be mitigated by utilizing pointwise convolution after
in remote sensing images. Expanding the receptive field by partial convolution. To achieve this, we apply pointwise convo-
utilizing a large convolution kernel such as 5 × 5 or 7 × 7 lution to the output of partial convolution, then follow up with
introduces more contextual correlation features to the model. a BatchNorm layer and a rectified linear unit (ReLU) activation
Nevertheless, convolutional layers using large kernels result in function, before finally restoring channel dimensionality using
an exponential increase in computational burden, challenging pointwise convolution, as shown in Fig. 8(b). To help avert
our aim of designing a lightweight model. To address this gradient vanish and explosion issues caused by excessively deep
limitation, we use a hybrid dilated convolution (HDC) to replace convolutional layers, we adopt residual connections as part of
the regular convolution operation in partial convolution, inspired our PHDC Block module, which is consistent with the ResNet
by [31]. An affordable increase in computational complexity method [32]. The comparison in Fig. 8 reveals the main building
allows us to achieve a wider receptive field. Fig. 6 illustrates blocks utilized in constructing the stage of the PPYOLOE-R
the size of the receptive field of three consecutive convolution model and the PHDC block that forms our model stage. Despite
layers using different dilation rates. the implementation of a reparameterization hierarchy in the
To ensure that the hole convolution group adequately covers PPYOLOE-R model, its building blocks exhibit high compu-
the space range while avoiding the sampling loss caused by tational complexity. In contrast, the PHDC block in our model
continuous hole convolutions on the input feature map, it is features a simple and efficient structure.
crucial to carefully select the combination of dilation rates used Compared to the inverted residual module used in Mo-
in HDC. Fig. 6(b) illustrates the adverse effects of improper hole bileNetV2 [33], the PHDC Block employs only BatchNorm
rate combinations, which can cause HDCs to miss adjacent pixel layer and ReLU activation without performing depthwise convo-
points and result in incomplete feature sampling. In contrast, the lution after channel expansion with pointwise convolution. This
[1, 2, 5] dilation rate combination in HDC, as used in this study, approach effectively avoids the frequent memory access caused
392 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 17, 2024

B. CSPPartialStage
1) CSP structure with PHDC Block: Deep convolutional
neural networks often involve dense convolution operations as
the channel numbers expand, which can exponentially increase
the computational cost of the model. This, coupled with a single
feature propagation path, can cause repeated usage of gradient
information, resulting in redundancy and inefficient network
training. To address this issue, CSPNet [34] separates the gradi-
ent flow to propagate through different network paths, ensuring
that the gradient information obtained has greater correlation
differences. Both YOLOv5 [12] and YOLOX [13] utilize a
CSPNet-like structure to reduce computational burden without
compromising accuracy. In our research, we implemented this
method to design the main module.
Fig. 3 illustrates the structure of CSPPartialStage, where CBN
refers to the concatenation of convolutional layer, BatchNorm
layer, and nonlinear activation layer. Our CSPPartialStage incor-
porates the PHDC block mentioned earlier as a crucial module
into consecutive CSPNet-based feature extraction layers. After
the input feature map is processed by the first CBN, the number
of its channels is halved, and then it is routed into two parallel
branching structures. One of the branches executes only simple
CBN operations, while the other goes through a single CBN
Fig. 7. Improve the PartialConv with HDC [1, 2, 5]. (a) Partial convolution before being sent to the feature extraction module made up of
with a single 3 × 3 convolutional layer. (b) Partial convolution with HDC ([1, N PHDC Blocks arranged in series to perform deeper feature
2, 5] dilation rates combination).
extraction. The outputs of both branches are concatenated in the
channel dimension and given coordinate attention through the
CA attention module. Finally, CBN is used for channel matching
to obtain the correct number of channels. It is worth noting that
all the CBNs in CSPPartialStage use 1 × 1 convolution, merely
changing the number of feature map channels or performing sim-
ple feature mapping, without introducing a noticeable increase
in computational burden.
2) Coordinate Attention: The general attention mechanism,
such as Squeeze-and-Excitation, accounts for the correlation
between channels and recalibrates the channel information for
effective aggregation, leading to a better model representation.
While this approach proves useful for detecting natural images,
small object detection in remote sensing images with com-
plex spatial distributions requires more prominent focus on the
target’s localization features. As such, we utilize the Channel
Attention (CA) module [29] to augment the model’s ability to
extract location-based features. A schematic of the CA module’s
workflow is illustrated in Fig. 9.
The input feature map is initially encoded for each channel
independently by performing global average pooling separately
in both the horizontal and vertical directions. Specifically, for the
cth channel, the output in the vertical and horizontal directions
of dimensions h and w is denoted by the following equation:
Fig. 8. Comparison of the basic block in PPYOLOE-R and the PHDC block 1
in our model. (a) Basic block in PPYOLOE-R. (b) PHDC block. zch (h) = xc (h, i) (2)
w 0<i≤w

1
by multiple groups of depthwise convolution, thereby increasing zcw (w) = xc (j, w) (3)
h
the operation efficiency of the module. Moreover, after training 0<j≤h
is completed, the BatchNorm layer can be easily merged into Following the transformations in the two spatial directions
adjacent convolutional layers, further simplifying the network. as stated earlier, a pair of direction-sensitive feature maps is
XIE et al.: CSPPARTIAL-YOLO: A LIGHTWEIGHT YOLO-BASED METHOD FOR TYPICAL OBJECTS DETECTION IN REMOTE SENSING IMAGES 393

C. CSPPartialNet and CSPPartialFPN

In a single-stage object detection model, the Backbone Net-
work is composed of multiple stages, each of which produces an
output feature map with a different resolution. The distribution
ratio of the computation module in each stage is often determined
through heuristics. In CSPDarknet [10], the distribution ratio of
the computation module is 1:3:3:1, while the distribution ratio
of the computation module in the backbone network based on
CSPRepResNet in PPYOLOE-R [15] is 1:2:2:1. Inspired by the
performance of recent Swin-T [35] models in the field of vision,
the authors of ConvNeXt [20] suggest a distribution ratio of
1:1:3:1. Following this suggestion, the number of PHDC blocks
in the CSPPartialStage is set to [1, 1, 3,1].
For the neck of model, we also use modules similar to the
CSPPartialStage, with the difference being the absence of the
CA module. The highest level feature map is processed using
SPP [36] to achieve feature map-level fusion between local and
global features. The workflow of the FPNStage in the neck part
can be seen in Fig. 4.

Fig. 9. Workflow of coordinate attention module. D. Rotation Detection Head

Remote sensing images frequently employ overhead view-
generated. The resulting aggregated feature map is then con- ing angles (e.g., satellite or airborne imagery), resulting in
catenated and fed into a 1 × 1 convolution for the purpose of diverse angle distributions of targets, including vehicles, ships,
channel reduction. airplanes, and other modes of transportation. Consequently,
One direction is transposed and appended to the encoding we implement the detection head from PPYOLOE-R, utilizing
vector of the other to create a lengthy vector with a spatial three independent branches for predicting the targets’ position,
dimension of H + W . The resulting vector is then subject to direction, and category.The loss function incorporates Varifocal
a shared 1 × 1 convolution transform, detailed as follows: Loss, ProbIoU Loss, and Distributed Focal Loss to calculate the
losses for target classification, bounding box localization, and
f = δ(F1 z h , z w ) (4)
angle estimation, respectively.The overall loss is calculated by
In the above equation, [., .] denotes the concatenation oper- weighting and summing the aforementioned losses with weights
ation, δ represents the batch normalization layer and the non- of 1.0, 2.5, and 0.05, respectively.These settings align with the
linear activation function. F1 represents a 1 × 1 convolution PPYOLOE-R model. Fig. 5 shows the details of the rotation
used for channel reduction. f is an intermediate feature map detection head.
with dimensions RC/r×(H+W ) , where r indicates the channel
reduction ratio. Subsequently, f is spatially split into two feature IV. EXPERIMENTS AND RESULTS
maps, fh ∈ RC/r×H and fw ∈ RC/r×W . These feature maps
are used to recover the number of channels through a 1 × 1 A. Experiments Settings
convolution and then normalized using the Sigmoid function to Our study involves two stages: Training and validation. We
create attention maps in both spatial directions as follows: utilized PaddlePaddle 2.4 deep learning framework to train on
g h = σ(Fh (f h )) (5) the Intel(R) Xeon(R) Gold 5218 CPU, NVIDIA Tesla V100, and
w w
Debian10 stable platforms during the training phase. The SGD
g = σ(Fw (f )) (6) optimizer was selected, with the momentum set at 0.9 and batch
Where σ represents the Sigmoid function. Finally, multiply size at 6 for the training process, while a cosine learning rate
the input feature map with the obtained attention and get the decay strategy was employed. The learning rate was initially set
final output of the CA module as follows: at 0.006, and the total number of epochs trained was 300. While
ensuring training convergence, we selected the weight results
yc (i, j) = xc (i, j) × gch (i) × gcw (j) (7)
from the best performance on the test set among 300 epochs as
The design process of the channel attention (CA) module the final weights. Based on our experimental observations, all
avoids standard three-by-three convolutions and reduces com- models achieve convergence within 300 training epochs. For the
putational complexity through channel dimension reduction, first 10 epochs of training, linear warm-up was applied. During
making the CA module highly suitable for integration into the entire training process, randomized image rotation was used
lightweight networks. In the experiments, the accuracy of the by using four angles 0°, 90°, 180°, and 270°, together with a
model was improved with minimal changes to inference latency. 50% probability of random rotations at 30°and 60°.
Details of these experiments are discussed in the following During the validation phase, we assessed the trained model’s
section. performance on AMD 5800H, NVIDIA RTX3070 laptop, and
394 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 17, 2024

TABLE I Subsequently, by progressively adding them, N sets of precision-

NUMBER OF INSTANCES FOR EACH CATEGORY IN THE TRAINING SET
AND TEST SET
recall pairs are obtained. Using this approach, the precision-
recall curve can be plotted on a 2-D coordinate system. Average
precision (AP) is defined as the area enclosed between the
precision-recall curve and the x-axis, and is defined by the
following formula:
1
AP = P (R)dR. (10)
0

For multiclass objectives, mAP is frequently used as an eval-

uation metric. mAP is the average of each class’s AP and is
Windows 10 platforms based on two primary evaluation criteria: defined as mAP = N1 N n=1 APn . mAP serves as our primary
mAP on the test set and model forward inference latency. metric for evaluating the precision of the model.
In addition to prioritizing the accuracy of the model, equal
B. Dataset consideration is given to its cost. To determine the number of
learnable parameters in the model, we employ M Params, while
To evaluate the effectiveness of our proposed model, we per- G FLOPs are used to calculate the total number of multiply
formed experiments on the DOTA [37] dataset and SODA-A [38] and accumulate operations performed during the model’s in-
dataset showcasing images primarily sourced from Google ference process. However, these two metrics frequently only
Earth. The majority of the imagery in the datasets exhibits a depict the theoretical inference speed of the model. As we are
spatial resolution under 0.5 m and consists of fifteen(DOTA) more concerned with the model’s overall performance on actual
and nine(SODA-A)categories of objects annotated using rotated hardware, we also regard inference latency (ms) as a crucial
rectangular boxes. For the data preprocessing phase, we selected assessment metric. Inference latency is computed by averaging
four typical targets of remote sensing imagery, namely, planes, the propagation time of the model in 1000 forward passes, which
large vehicles, small vehicles, and ships. The remaining cate- diminishes the impact of stochastic errors on the experiment.
gories were removed. Afterward, the images were cropped into
1024 × 1024 pixels, with an overlap of 200 pixels being main-
tained to ensure continuity and consistency between different D. Ablation Study
cropped images. The resulting datasets comprise 6049(DOTA) We used the PPYOLOE-R-S model as the baseline model to
and 8811(SODA-A) images in the training set and 1718(DOTA) conduct ablation experiments on the DOTA dataset, examining
and 5268(SODA-A) images in the testing set. Table I presents a the effectiveness of the PHDC and CA modules. The outcomes
statistical summary of the number of annotated images for each of the experiment are shown in Table II. It is worth noting that
classification in both datasets. in addition to prioritizing model accuracy, we emphasize its
According to the standards of the MS COCO dataset, targets efficiency.
with pixel numbers that are smaller than 1024 pixels as “small,” In Experiment 2, we investigated the performance of a model
targets with pixel numbers falling between 1024 and 9216 are built using the Partial Hybrid Dilated Convolution (PHDC)
noted as “medium,” while targets with pixel numbers larger than module, which omitted the Channel Attention (CA) Block. Our
9216 are designated as “large.” By counting the number of pixels aim was to evaluate the efficiency of the PHDC module in
annotated within the bounding boxes, it is observed that “small” terms of inference speed, computational complexity, and model
objects account for 64.7%(DOTA) and 94.5%(SODA-A) of the accuracy compared to the baseline model. Our results show that
total number of objects in our datasets. This feature is a vital the PHDC module performed well in terms of inference speed.
characteristic that sets remote sensing images apart from natural Specifically, it reduced the inference latency by approximately
images. 32.2%, which is a notable improvement over the baseline model.
Furthermore, the PHDC module achieved a 0.3% mean average
C. Experimental Evaluation Metrics precision (mAP) increase, indicating that it has potential to
To evaluate the model accuracy, we adopted the evaluation improve the model efficiency without compromising its accu-
methodology of PASCAL VOC [39]. Initially, we computed the racy. In addition to its promising results in both inference speed
precision and recall by using the below representations, where and accuracy, the PHDC module also demonstrated a reduction
TP corresponds to true positive, FP references false positive, and in the number of parameters and computational complexity.
FN refers to false negative: Specifically, it decreased the number of parameters by 17.8% and
the computational complexity by 27.0%. These findings suggest
TP
precision = (8) that the PHDC module is not only efficient, but also requires
TP + FP fewer resources to achieve the same level of performance as the
TP baseline model.
recall = (9)
TP + FN In Experiment 3, we introduced the Channel Attention (CA)
During the computation process, the N predicted boxes are Block to coordinate the network without hybrid dilated convo-
sorted in descending order according to their confidence scores. lution in the PHDC Block. The purpose of this experiment was
XIE et al.: CSPPARTIAL-YOLO: A LIGHTWEIGHT YOLO-BASED METHOD FOR TYPICAL OBJECTS DETECTION IN REMOTE SENSING IMAGES 395

TABLE II
EXPERIMENTAL RESULTS OF ABLATION OF THE ALGORITHM MODULE

TABLE III comparison with the modern YOLO series models of the identi-
EXPERIMENTAL RESULTS OF DIFFERENT NUMBERS OF BLOCKS IN EACH STAGE
cal model size. The resulting comparison on the DOTA dataset
and SODA-A dataset is illustrated in Table IV.
In object detection benchmarks, it is vital to ensure that the
models are evaluated on a level playing field. To this end,
we employed the same rotation detection head, namely the
PPYOLOE-R detection head, across all models for predicting
to investigate the impact of the CA Block on the model’s perfor- bounding boxes with rotation angles. The results of our ex-
mance and efficiency. In Experiment 4, we utilized both hybrid periments indicate that while the YOLOX model achieved the
dilated convolution and the CA Block to evaluate their combined highest mAP score in both DOTA dataset and SODA-A dataset
effect on model accuracy and complexity. Our results show that among all models, it also incurred the highest inference delay
Experiment 4 outperformed Experiment 2 and Experiment 3 in due to suboptimal inference speed optimization in the CSPDark-
terms of accuracy. Specifically, Experiment 4 achieved a 1.06% Net backbone network used by YOLOX. On the other hand,
increase in mean Average Precision (mAP) while maintaining PPYOLOE-R utilizes reparameterization techniques to enhance
comparable parameter (Params) and computational (FLOPs) the model’s inference speed, leading to a better tradeoff between
complexity. This improvement in accuracy is attributed to the accuracy and speed. In contrast to YOLOX and PPYOLOE-R,
larger reception field generated by hybrid dilated convolution, YOLOV8 replaces the C3 module of the CSPDarkNet with the
which better captures small remote sensing targets. Although C2F module, which maximizes gradient flow information while
the CA block slightly contributed to the increase in inference maintaining a lightweight architecture. This design choice en-
latency, the improvement in accuracy justifies its use in the ables YOLOV8 to achieve high accuracy while also maintaining
model. reasonable inference speed. The RTMDet model utilizes a design
Fig. 10 displays partial results from the DOTA dataset of our approach comparable to ours, whereby the receptive field is
CSPPartial-YOLO and PPYOLOE-R models. As observed, our expanded via an increase in the kernel size of convolution layers,
model exhibits fewer missed and false detections compared to thereby improving the model’s capability for feature extraction.
the baseline model in scenarios such as detecting ships near or However, RTMDet differs in that it directly enlarges the convo-
far away the shore, detecting dense vehicles, and distinguish- lution kernel size and employs depthwise separable convolution
ing between confusing object classes. This improvement can to achieve a balance between efficiency and effectiveness.
be attributed to the use of hybrid dilated convolution and the In our proposed CSPPartial-YOLO, we also aimed to achieve
improved visual attention module, which allow for a larger a more efficient and lightweight network by utilizing partial con-
receptive field and richer semantic information. Specifically, volution in constructing the network. Furthermore, our approach
hybrid dilated convolution expands the field of view of each con- combines typical characteristics of targets in remote sensing
volutional layer, allowing the network to capture more contex- images, such as small object sizes and complex backgrounds,
tual information and improve object recognition accuracy. The with advanced techniques in computer vision. Specifically, we
coordinate attention module enhances the network’s ability to employ hybrid dilated convolutions to establish long-distance
focus on relevant features by adaptively weighting feature maps. dependency relationships, broaden the receptive field, and im-
Our proposed models achieved a 1.9% increase in accuracy plement coordinate attention modules to enhance the position
compared to the baseline model, while also reducing inference information. These design choices result in improved represen-
latency by 25.8%. These results demonstrate the efficacy of our tation ability for small targets in remote sensing images, which
proposed models in improving object detection performance are often challenging to detect.
while maintaining a reasonable inference speed. Our model incurred only a 0.19%(DOTA) and 2.21%(SODA-
Table III presents the experimental results of allocating dif- A) loss in mAP score when compared to YOLOX, yet registered
ferent ratios of PHDC blocks in CSPPartialStage. The results declines of 22.0%, 42.6%, and 32.3% in terms of parameter
suggest that employing the suggested [1:1:3:1] ratios in Con- count, computational cost, and inference delay, respectively.
vNext [20] yields higher validation accuracy and inference speed Our model also showed a 14.8% speed advantage over the
with comparable parameter and computational costs. fastest YOLOV8 model, requiring only 67.1% of the parameters
and 74.3% of the computational cost, while maintaining some
advantages in precision. Overall, taking into consideration both
E. Comparison With Other YOLO Models
speed and precision, our model maintains competitiveness with
To ascertain the competitiveness of the model introduced current state-of-the-art YOLO models in typical target detection
in this article in both speed and accuracy, we conducted a in remote sensing.
396 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 17, 2024

Fig. 10. Prediction results of the baseline model and our CSPPartial-YOLO in some scenes of the DOTA dataset. The green circle indicates false detection, and
the red circle indicates missed detection. (a) Ground truth. (b) PPYOLOE-R. (c) Ours.
XIE et al.: CSPPARTIAL-YOLO: A LIGHTWEIGHT YOLO-BASED METHOD FOR TYPICAL OBJECTS DETECTION IN REMOTE SENSING IMAGES 397

TABLE IV
EXPERIMENTAL RESULTS OF COMPARISON WITH MODERN YOLO SERIES MODELS

TABLE V
EXPERIMENTAL RESULTS OF COMPARATION WITH ADVANCED LIGHTWEIGHT BACKBONE NETWORKS

Fig. 11. First three rows show the output feature maps of the last three stages of the backbone network, with sizes of 128 × 128, 64 × 64, and 32 × 32, respectively.
The last row displays the prediction results of the model, with the missed detection highlighted by a red circle. (a) MobileNetV3. (b) ShuffleNetV2. (c) GhostNet.
(d) Ours.
398 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 17, 2024

F. Comparison With Lightweight Backbones Based on experimental results, CSPPartialNet achieved the
To validate the performance of the proposed backbone net- highest mAP compared to the aforementioned three lightweight
work, namely CSPPartialNet, we conducted comparative exper- backbone networks. It also achieved the best inference time
iments against several lightweight backbone networks. Table V on real hardware, lower by 42.1%, 57.7%, and 63.3% than
showcases the experimental results obtained for the DOTA and MobileNetV3, ShuffleNetV2, and GhostNet, respectively. In
SODA-A dataset. addition, CSPPartialNet has the lowest parameter amount, which
In this set of experiments, we solely evaluated the perfor- makes the model suitable for devices with low storage.
mance of the backbone network by using the same Neck and
Head, and solely replacing the backbone network. As such, the V. DISCUSSION AND CONCLUSION
metrics “Params,” “FLOPs,” and “Latency” in the table were The detection of objects in remote sensing images has been
exclusively calculated for the backbone network section. challenging due to the limited computing and storage resources
MobileNetV3 is the latest addition to the MobileNet series of of remote sensing platforms. Current object detectors struggle to
networks. It mainly employs depthwise separable convolutions achieve fast and accurate predictions. In this article, we improved
to decrease the number of trainable parameters and computa- the baseline model to achieve a better balance between speed
tional complexity of the network. In addition, it integrates SE and accuracy, and we refer to the improved model as CSPPartial-
channel attention, and utilizes neural architecture search (NAS) YOLO. The new model is specifically designed for the detection
techniques to obtain the optimal parameters. Nevertheless, it of typical targets in remote sensing images.
is noteworthy that despite having the lowest FLOPs in this To improve the model’s inference speed and reduce param-
set of experiments, MobileNetV3 still has considerable latency. eters and calculations, we utilized redundant feature maps in
This is because its architecture is more suited for CPU device the model inference process and introduced the PHDC module,
computation, while GPU computation is more crucial in the which is a combination of partial convolution with hybrid dilated
experimental environment. Thus, it indicates that FLOPs cannot convolution, with specific dilation rates. Furthermore, we incor-
accurately reflect the model’s inference time and one should fo- porated the CA module to increase the model’s sensitivity to
cus more on latency. Moreover, MobileNetV3 has been observed target location information considering the multidirectionality,
to exhibit lower sensitivity to small targets and less attention dense distribution, and small size of typical targets in remote
to intricate details, which may lead to reduced effectiveness in sensing images. Finally, we designed the CSPPartialStage to ex-
typical targets of remote sensing images. Hence, when employ- plore the appropriate computational depth ratio for the backbone
ing MobileNetV3 for remote sensing image classification, one network, constructed the backbone, and the Neck network.
should exercise caution. In this article, we conducted ablative experiments to demon-
ShuffleNetV2 is a lightweight backbone network that con- strate the advantages of the proposed model compared to the
siders the impact of memory access count (MAC) on inference baseline model. Furthermore, we evaluated the effectiveness of
latency. Nonetheless, similar to MobileNetV3, it is better suited the main improvement methods through comparative experi-
for CPUs on mobile devices than for GPUs. Moreover, Shuf- ments with state-of-the-art YOLO series models and lightweight
fleNetV2 exhibits lower capability to concentrate on long-range backbone networks. The proposed model and methods achieved
dependency information, making it difficult to derive essential competitive advantages in terms of both accuracy and speed.
information from images containing large objects with signif- Our experiments show that the lightweight detector introduced
icant aspect ratios. Consequently, this may yield inadequate in this article has potential for real-time detection of typical
results in intricate and uncertain remote sensing object detection targets in remote sensing images. Our future research endeavors
scenarios. will involve exploration of advanced lightweight network design
GhostNet is among the very first models that concentrate methods, like neural network pruning and neural architecture
on redundant feature maps in convolutional neural networks. search (NAS), in order to further decrease model redundancy
By replacing conventional convolutions with a cheap operation, and enhance detection efficiency.
it facilitates the acquisition of feature maps. Nonetheless, the
application of depthwise convolution in the cheap operation REFERENCES
can raise the MAC, which negates the previously reduced
FLOPs. Consequently, despite having fewer FLOPs, GhostNet [1] P. Patil, “Applications of deep learning in traffic management: A review,”
Int. J. Bus. Intell. Big Data Analytics, vol. 5, no. 1, pp. 16–23, 2022.
still demonstrates high latency. [2] S. Wang, Y. Han, J. Chen, Z. Zhang, G. Wang, and N. Du, “A deep-
Fig. 11 indicates the output feature maps of the final three learning-based sea search and rescue algorithm by UAV remote sensing,”
stages and the prediction results for each lightweight backbone in Proc. IEEE CSAA Guid., Navigation Control Conf., 2018, pp. 1–5.
[3] Y. Xu, M. Zhu, P. Xin, S. Li, M. Qi, and S. Ma, “Rapid airplane detection
network. Our CSPPartialNet provides a clearer representation of in remote sensing images based on multilayer feature fusion in fully
the target position at all three scales, especially in the 32 × 32 convolutional neural networks,” Sensors, vol. 18, no. 7, 2018, Art. no. 2335.
feature map output. Our model accurately captures the infor- [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies
for accurate object detection and semantic segmentation,” in Proc. IEEE
mation about the vehicle parking position and displays it with Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580–587.
higher values in the heat map due to our use of the channel [5] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
attention module (CA), which enhances the target position fea- IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969.
[6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time
ture. Compared in the prediction results, our model has the least object detection with region proposal networks,” in Proc. Adv. Neural Inf.
missed detections. Process. Syst., 2015, vol. 28.
XIE et al.: CSPPARTIAL-YOLO: A LIGHTWEIGHT YOLO-BASED METHOD FOR TYPICAL OBJECTS DETECTION IN REMOTE SENSING IMAGES 399

[7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: [36] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. convolutional networks for visual recognition,” IEEE Trans. Pattern Anal.
Pattern Recognit., 2016, pp. 779–788. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2015.
[8] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. [37] G.-S. Xia et al., “DOTA: A large-scale dataset for object detection in
Comput. Vis., 2016, pp. 21–37. aerial images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
[9] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Proc. pp. 3974–3983.
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 7263–7271. [38] G. Cheng et al., “Towards large-scale small object detection: Survey and
[10] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” benchmarks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 11,
2018, arXiv:1804.02767. pp. 13467–13488, Nov. 2023.
[11] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal [39] M. Everingham, L. V. Gool, C. K. Williams, J. Winn, and A. Zisserman,
speed and accuracy of object detection,” 2020, arXiv:2004.10934. “The pascal visual object classes (VOC) challenge,” Int. J. Comput. Vis.,
[12] G. Jocher, “YOLOv5 by Ultralytics,” May 2020. [Online]. Available: https: vol. 88, pp. 303–338, 2010.
//github.com/ultralytics/yolov5 [40] G. Jocher, A. Chaurasia, and J. Qiu, “YOLO by Ultralytics,” Jan. 2023.
[13] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO [Online]. Available: https://github.com/ultralytics/ultralytics
series in 2021,” 2021, arXiv:2107.08430. [41] A. Howard et al., “Searching for MobileNetV3,” in Proc. IEEE/CVF Int.
[14] S. Xu et al., “PP-YOLOE: An evolved version of YOLO,” 2022, Conf. Comput. Vis., 2019, pp. 1314–1324.
arXiv:2203.16250.
[15] X. Wang, G. Wang, Q. Dang, Y. Liu, X. Hu, and D. Yu, “PP-YOLOE-R:
An efficient anchor-free rotated object detector,” 2022, arXiv:2211.02386.
[16] Y. Guo, S. Chen, R. Zhan, W. Wang, and J. Zhang, “LMSD-YOLO: A
lightweight YOLO algorithm for multi-scale sar ship detection,” Remote Siyu Xie received the B.S. degree in electronic in-
Sens., vol. 14, no. 19, 2022, Art. no. 4801. formation science and technology from the College
[17] M. Cui et al., “LC-YOLO: A lightweight model with efficient utilization of Science of Beijing Forestry University, China,
of limited detail features for small object detection,” Appl. Sci., vol. 13, in 2021. He is currently working toward the M.S.
no. 5, 2023, Art. no. 3174. degree in signal and information processing with the
[18] H. Zhang et al., “An improved lightweight yolo-fastest V2 for engineering Aerospace Information Research Institute, Chinese
vehicle recognition fusing location enhancement and adaptive label as- Academy of Sciences, Beijing, China.
signment,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 16, His research interests include deep learning,
pp. 2450–2461, 2023. lightweight model and remote sensing.
[19] C. Lyu et al., “RTMDet: An empirical study of designing real-time object
detectors,” 2022, arXiv:2212.07784.
[20] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A
convnet for the 2020s,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., 2022, pp. 11976–11986.
[21] J. Chen et al., “Run, don’t walk: Chasing higher FLOPS for faster neural Mei Zhou was born in Sichuan, China in 1980.
networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, She received the Ph.D. degree in communication and
pp. 12021–12031. information systems from the Graduate School of
[22] A. G. Howard et al., “MobileNets: Efficient convolutional neural networks the Chinese Academy of Sciences, Beijing, China,
for mobile vision applications,” 2017, arXiv:1704.04861. in 2007.
[23] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely efficient She is currently an Associate Professor with
convolutional neural network for mobile devices,” in Proc. IEEE Conf. Aerospace Information Research Institute, Chinese
Comput. Vis. Pattern Recognit., 2018, pp. 6848–6856. Academy of Sciences. Her main research direction is
[24] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical multidimensional imaging technology for active and
guidelines for efficient CNN architecture design,” in Proc. Eur. Conf. passive sensors.
Comput. Vis., 2018, pp. 116–131.
[25] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “GhostNet: More
features from cheap operations,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit., 2020, pp. 1580–1589.
[26] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141. Chunle Wang was born in Jilin, China, in 1986.
[27] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-Net: Effi- She received the B.S. degree in electronic informa-
cient channel attention for deep convolutional neural networks,” in Proc. tion engineering from Beijing Information Science
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 11534–11542. and Technology University, Beijing, China, in 2008.
[28] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional block She received the Ph.D. degree in communication
attention module,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 3–19. and information systems from University of Chinese
[29] Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for efficient mobile Academy of Science, Beijing, China, in 2013.
network design,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., She is currently an Associate Researcher with the
2021, pp. 13713–13722. Aerospace Information Research Institute, Chinese
[30] Q. Zhang et al., “Split to be slim: An overlooked redundancy in vanilla Academy of Sciences. Her research interests include
convolution,” 2020, arXiv:2006.12085. spaceborne synthetic aperture radar (SAR) system
[31] P. Wang et al., “Understanding convolution for semantic segmentation,” design and SAR image processing.
in Proc. IEEE Winter Conf. Appl. Comput. Vis., 2018, pp. 1451–1460.
[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
pp. 770–778.
[33] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo-
bileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE Conf. Shisheng Huang received the B.S. degree in ap-
Comput. Vis. Pattern Recognit., 2018, pp. 4510–4520. plied mathematics in 2006, and the Ph.D. degree in
[34] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H. system analysis and integration both from National
Yeh, “CSPNet: A new backbone that can enhance learning capability of University of Defence Technology, Changsha, China,
CNN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work- in 2012.
shops, 2020, pp. 390–391. He is currently working in the field of spaceborne
[35] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using synthetic aperture radar designing and image process-
shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, ing with Beijing Institute of Tracking and Telecom-
pp. 10012–10022. munications Technology, Beijing, China.

Module 4a NCM 114
No ratings yet
Module 4a NCM 114
84 pages
Adhd PMT
100% (1)
Adhd PMT
16 pages
Day (2002) School Reform and Transition in Teacher Professionalism and Identity
100% (1)
Day (2002) School Reform and Transition in Teacher Professionalism and Identity
16 pages
Strategy As Diligence PDF
No ratings yet
Strategy As Diligence PDF
30 pages
Remotesensing 12 02501 v2
No ratings yet
Remotesensing 12 02501 v2
26 pages
Applsci 13 12977
No ratings yet
Applsci 13 12977
21 pages
Attention_and_Feature_Fusion_SSD_for_Remote_Sensing_Object_Detection
No ratings yet
Attention_and_Feature_Fusion_SSD_for_Remote_Sensing_Object_Detection
9 pages
Overview of YOLO ObjectDetectionAlgorithm
No ratings yet
Overview of YOLO ObjectDetectionAlgorithm
7 pages
2209.13351v2-SuperYlo
No ratings yet
2209.13351v2-SuperYlo
14 pages
Journal Pone 0259283
No ratings yet
Journal Pone 0259283
15 pages
Sensors 23 06423
No ratings yet
Sensors 23 06423
23 pages
Applsci 13 08161
No ratings yet
Applsci 13 08161
17 pages
YOLO Series Algorithms in Object Detection of Unmanned Aerial Vehicles: A Survey
No ratings yet
YOLO Series Algorithms in Object Detection of Unmanned Aerial Vehicles: A Survey
30 pages
yolo1-11
No ratings yet
yolo1-11
38 pages
Ymer 230109
100% (1)
Ymer 230109
11 pages
J12-TGRS2023-SuperYOLO
No ratings yet
J12-TGRS2023-SuperYOLO
15 pages
Remotesensing 15 03265
No ratings yet
Remotesensing 15 03265
29 pages
YOLO-LITE: A Real-Time Object Detection Algorithm Optimized For Non-GPU Computers
No ratings yet
YOLO-LITE: A Real-Time Object Detection Algorithm Optimized For Non-GPU Computers
8 pages
A_Rich_Feature_Fusion_Single-Stage_Object_Detector
No ratings yet
A_Rich_Feature_Fusion_Single-Stage_Object_Detector
8 pages
Multiscale Object Detection in Remote Sensing Images Using 1qh06jan
No ratings yet
Multiscale Object Detection in Remote Sensing Images Using 1qh06jan
10 pages
SSRN 4542996
No ratings yet
SSRN 4542996
17 pages
Object Detection Using Adaptive Mask RCNN
No ratings yet
Object Detection Using Adaptive Mask RCNN
12 pages
remotesensing-17-01027
No ratings yet
remotesensing-17-01027
17 pages
drones-07-00188
No ratings yet
drones-07-00188
18 pages
s00530-025-01688-7
No ratings yet
s00530-025-01688-7
19 pages
Remote Sensing Target Application
No ratings yet
Remote Sensing Target Application
23 pages
Edge-YOLO Lightweight Infrared Object Detection Method Deployed on Edge Devices
No ratings yet
Edge-YOLO Lightweight Infrared Object Detection Method Deployed on Edge Devices
15 pages
Object Detection Using Yolo Algorithm-1
No ratings yet
Object Detection Using Yolo Algorithm-1
9 pages
s11042-024-18872-y
No ratings yet
s11042-024-18872-y
40 pages
Tinier YOLO
No ratings yet
Tinier YOLO
10 pages
You Only Look Once - Unified, Real-Time Object Detection
No ratings yet
You Only Look Once - Unified, Real-Time Object Detection
10 pages
Yolo Paper
No ratings yet
Yolo Paper
10 pages
Red Mon 2016
No ratings yet
Red Mon 2016
10 pages
Final Synopsis1
No ratings yet
Final Synopsis1
10 pages
YOLO Based Object Detection Models: A Review and Its Applications
No ratings yet
YOLO Based Object Detection Models: A Review and Its Applications
40 pages
Sensors 23 08118
No ratings yet
Sensors 23 08118
21 pages
Remote Sensing: Improved YOLO Network For Free-Angle Remote Sensing Target Detection
No ratings yet
Remote Sensing: Improved YOLO Network For Free-Angle Remote Sensing Target Detection
20 pages
Remotesensing 14 00984 v2
No ratings yet
Remotesensing 14 00984 v2
21 pages
Enhanced_Target_Detection_Fusion_of_SPD_and_CoTC3_Within_YOLOv5_Framework
No ratings yet
Enhanced_Target_Detection_Fusion_of_SPD_and_CoTC3_Within_YOLOv5_Framework
14 pages
Few-Shot Object Detection On Remote Sensing Images
No ratings yet
Few-Shot Object Detection On Remote Sensing Images
14 pages
The Basics of Object Detection YOLO SSD R-CNN
No ratings yet
The Basics of Object Detection YOLO SSD R-CNN
4 pages
27 GSJ8976
No ratings yet
27 GSJ8976
16 pages
sensors-23-07190
No ratings yet
sensors-23-07190
27 pages
EdgeYOLO AnEdge-Real-Time Object Detector
No ratings yet
EdgeYOLO AnEdge-Real-Time Object Detector
7 pages
YOLO-NL
No ratings yet
YOLO-NL
18 pages
YOLO-LITE: A Real-Time Object Detection Algorithm Optimized For Non-GPU Computers
No ratings yet
YOLO-LITE: A Real-Time Object Detection Algorithm Optimized For Non-GPU Computers
8 pages
Object Detect
No ratings yet
Object Detect
12 pages
Lightweight and Efficient Tiny-Object Detection Based on Improved YOLOv8n for UAV Aerial Images
No ratings yet
Lightweight and Efficient Tiny-Object Detection Based on Improved YOLOv8n for UAV Aerial Images
20 pages
YED-YOLO: An Object Detection Algorithm For Automatic Driving
No ratings yet
YED-YOLO: An Object Detection Algorithm For Automatic Driving
9 pages
Real Time Object Detection
No ratings yet
Real Time Object Detection
8 pages
Base Paper (YOLO)
No ratings yet
Base Paper (YOLO)
6 pages
2504.13099v1
No ratings yet
2504.13099v1
15 pages
Remotesensing 14 05063 v2
No ratings yet
Remotesensing 14 05063 v2
25 pages
yolopdf
No ratings yet
yolopdf
10 pages
Improved YOLOv7-Tiny for Object Detection Based On
No ratings yet
Improved YOLOv7-Tiny for Object Detection Based On
23 pages
Remote Sensing Image Detection Based On YOLOv4 Improvements
No ratings yet
Remote Sensing Image Detection Based On YOLOv4 Improvements
12 pages
Yolo
No ratings yet
Yolo
10 pages
(IJCST-V8I3P4) :sakshi Gupta, Dr. T. Uma Devi
No ratings yet
(IJCST-V8I3P4) :sakshi Gupta, Dr. T. Uma Devi
5 pages
Paper 5
No ratings yet
Paper 5
13 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Enhancing Real-Time Object Detection With YOLO Alg
No ratings yet
Enhancing Real-Time Object Detection With YOLO Alg
9 pages
YOLOCS：基于密集通道压缩的特征空间固化目标检测
No ratings yet
YOLOCS：基于密集通道压缩的特征空间固化目标检测
9 pages
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
(Ebook) Understanding Sound Tracks Through Film Theory by Elsie Walker ISBN 9780199896301, 0199896305 - Explore the complete ebook content with the fastest download
100% (1)
(Ebook) Understanding Sound Tracks Through Film Theory by Elsie Walker ISBN 9780199896301, 0199896305 - Explore the complete ebook content with the fastest download
57 pages
1995 Journal of Advertising Research
No ratings yet
1995 Journal of Advertising Research
19 pages
How Digital Technology Shapes Us
No ratings yet
How Digital Technology Shapes Us
230 pages
Film Lighting
100% (2)
Film Lighting
3 pages
Introductions and Conclusions: Some General Advice About Introductions
No ratings yet
Introductions and Conclusions: Some General Advice About Introductions
4 pages
Kayvan Miri Lavassani Bahar Movahedi Work Family Life 6 19
No ratings yet
Kayvan Miri Lavassani Bahar Movahedi Work Family Life 6 19
14 pages
Packaging and Purchase Decisions
No ratings yet
Packaging and Purchase Decisions
19 pages
Multimodality in Cosmetic Advertising
No ratings yet
Multimodality in Cosmetic Advertising
22 pages
2nd Sem-A3-Learning, Teaching and Assessment
No ratings yet
2nd Sem-A3-Learning, Teaching and Assessment
32 pages
How To Meditate 2
No ratings yet
How To Meditate 2
52 pages
Esearch. Critique of Moustakas's Method - Sandy Sela-Smith
No ratings yet
Esearch. Critique of Moustakas's Method - Sandy Sela-Smith
36 pages
(Norton books in education) Rechtschaffen, Daniel J - The mindful education workbook_ lessons for teaching mindfulness to students-W. W. Norton & Company (2016)
No ratings yet
(Norton books in education) Rechtschaffen, Daniel J - The mindful education workbook_ lessons for teaching mindfulness to students-W. W. Norton & Company (2016)
295 pages
Deep Work PDF
No ratings yet
Deep Work PDF
7 pages
The Multitasking Myth
No ratings yet
The Multitasking Myth
2 pages
The Power of Sensory Marketing in Advertising: Sciencedirect
No ratings yet
The Power of Sensory Marketing in Advertising: Sciencedirect
6 pages
Teaching Listening
100% (1)
Teaching Listening
29 pages
Research On The Geography of Agricultural Change: Redundant or Revitalized?
No ratings yet
Research On The Geography of Agricultural Change: Redundant or Revitalized?
18 pages
Hypnotism - Forel PDF
No ratings yet
Hypnotism - Forel PDF
390 pages
An Integrative Cognitive Rehabilitation Using.47
No ratings yet
An Integrative Cognitive Rehabilitation Using.47
7 pages
Joint Attention
100% (1)
Joint Attention
37 pages
Organizational Behaviour: Unit-Wise Exam Notes
No ratings yet
Organizational Behaviour: Unit-Wise Exam Notes
92 pages
Functional Behavioral Assessment Final
No ratings yet
Functional Behavioral Assessment Final
4 pages
Criticisms of Conceptual Metaphor Theory
No ratings yet
Criticisms of Conceptual Metaphor Theory
17 pages
Sustained Attention, Inattention, Receptive Language, and Story Interruptions in Preschool Head Start Story Time
No ratings yet
Sustained Attention, Inattention, Receptive Language, and Story Interruptions in Preschool Head Start Story Time
15 pages
Yeyen Armita
No ratings yet
Yeyen Armita
106 pages
Radiance: Experiencing Divine Presence
No ratings yet
Radiance: Experiencing Divine Presence
49 pages

CSPPartial-YOLO_A_Lightweight_YOLO-Based_Method_for_Typical_Objects_Detection_in_Remote_Sensing_Images

Uploaded by

CSPPartial-YOLO_A_Lightweight_YOLO-Based_Method_for_Typical_Objects_Detection_in_Remote_Sensing_Images

Uploaded by

388 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL.

CSPPartial-YOLO: A Lightweight YOLO-Based

III. PROPOSED METHOD

Fig. 3. Workflow of CSPPartialStage.

Fig. 4. Details of neck part.

C. CSPPartialNet and CSPPartialFPN

Fig. 9. Workflow of coordinate attention module. D. Rotation Detection Head

TABLE I Subsequently, by progressively adding them, N sets of precision-

For multiclass objectives, mAP is frequently used as an eval-

You might also like