Remotesensing 15 03970
Remotesensing 15 03970
Article
YOLO-DCTI: Small Object Detection in Remote Sensing Base
on Contextual Transformer Enhancement
Lingtong Min 1 , Ziman Fan 1 , Qinyi Lv 1, *, Mohamed Reda 2 , Linghao Shen 3 and Binglu Wang 3
Abstract: Object detection for remote sensing is a fundamental task in image processing of remote
sensing; as one of the core components, small or tiny object detection plays an important role. Despite
the considerable advancements achieved in small object detection with the integration of CNN and
transformer networks, there remains untapped potential for enhancing the extraction and utilization
of information associated with small objects. Particularly within transformer structures, this potential
arises from the disregard of the complex and the intertwined interplay between spatial context
information and channel information during the global modeling of pixel-level information within
small objects. As a result, valuable information is prone to being obfuscated and annihilated. To
mitigate this limitation, we propose an innovative framework, YOLO-DCTI, that capitalizes on the
Contextual Transformer (CoT) framework for the detection of small or tiny objects. Specifically, within
CoT, we seamlessly incorporate global residuals and local fusion mechanisms throughout the entire
input-to-output pipeline. This integration facilitates a profound investigation into the network’s
intrinsic representations at deeper levels and fosters the fusion of spatial contextual attributes with
channel characteristics. Moreover, we propose an improved decoupled contextual transformer
detection head structure, denoted as DCTI, to effectively resolve the feature conflicts that ensue from
Citation: Min, L.; Fan, Z.; Lv, Q.;
the concurrent classification and regression tasks. The experimental results on the Dota, VISDrone,
Reda, M.; Shen, L.; Wang, B.
YOLO-DCTI: Small Object Detection
and NWPU VHR-10 datasets show that, on the powerful real-time detection network YOLOv7, the
in Remote Sensing Base on speed and accuracy of tiny targets are better balanced.
Contextual Transformer
Enhancement. Remote Sens. 2023, 15, Keywords: small object detection; remote sensing images; transformer; YOLOv7
3970. https://doi.org/10.3390/
rs15163970
GAN models to generate high-resolution images of small objects, narrowing the gap
between small and large objects, and improving the detection capability of tiny objects [5].
Xu W et al. integrated contextual information into the Swin Transformer and designed an
advanced framework called the foreground-enhanced attention Swin Transformer (FEA-
Swin) [6]. Although the accuracy of detecting small objects has improved, the speed has
been somewhat compromised. Zhu X. et al., proposed the YOLOv5-THP model, which
is based on YOLOv5 and adds a transformer model with an attention mechanism to the
detection head [7]. While this enhances the network’s performance in detecting small
objects, it also brings a significant computing burden.
In the field of remote sensing, detecting small objects remains challenging due to
large image scales, complex and varied backgrounds, and unique shooting perspectives.
Cheng et al. propose a model training regularization method that enhances the perfor-
mance of detection of small or tiny objects in remote sensing by exploiting and incorporating
global contextual cues and image-level contextual information [8]. Liu J. et al., added a
dilated convolution module to the FPN and designed a relationship connection attention
module to automatically select and refine features, combining global and local attention to
achieve the detection task of small objects in remote sensing [9]. Cheng et al., proposed
an end-to-end cross-scale feature fusion (CSFF) framework based on the feature pyramid
network (FPN), which inserted squeeze-and-excitation (SE) modules at the top layer to
achieve better detection of tiny objects in optical remote sensing images [10]. Dong et al.,
proposed a CNN method based on balanced multi-scale fusion (BMF-CNN), which fused
high- and low-level semantic information to improve the detection performance of tiny
objects in remote sensing [11]. Liang X. et al., proposed a single-shot detector (FS-SSD)
based on feature fusion and scaling to better adapt to the detection of tiny or small ob-
jects in remote sensing. FS-SSD added a scaling branch in the deconvolution module and
used two feature pyramids generated by the deconvolution module and feature fusion
module together for prediction, improving the accuracy of object detection [12]. Xu et al.,
designed a transformer-guided multi-interaction network (TransMIN) using local–global
feature interaction (LGFI) and cross-view feature interaction (CVFI) modules to enhance
the performance of small object detection in remote sensing. However, this improvement
unavoidably introduces a computational burden [13]. Li et al., proposed a transformer that
aggregates multi-scale global spatial positions to enhance small object detection perfor-
mance but it also comes with a computational burden [14]. To reduce the computational
cost of the transformer, Xu et al., improved the lightweight Swin transformer and designed
a Local Perception Swin transformer (LPSW) backbone network to enhance small-scale
detection accuracy [15]. Gong et al., designed an SPH-YOLOv5 model based on Swin
Transformer Prediction Heads (SPHs) to balance the accuracy and speed of small object
detection in remote sensing [16]. Although many experts and scholars are studying the
balance between detection accuracy and inference speed, achieving an elegant balance
remains a challenging problem [17–21].
Considerable advancements have been achieved in the utilization of transformers [6,7,13–16]
for small object detection within the remote sensing domain. The exceptional performance
of the Contextual Transformer (CoT) [22] in harnessing spatial contextual information,
thereby offering a fresh outlook on transformer design, merits significant attention. In
the domain of remote sensing, small target pixels are characterized by a scarcity of spatial
information but a profusion of channel-based data. Consequently, the amalgamation and
modeling of spatial and channel information assume paramount importance. Further-
more, transformers impose notable demands on computational resources and network
capacity, presenting a challenge in striking an optimal balance between detection accuracy
and processing speed for small object detection in the remote sensing discipline. Mean-
while, Bar M et al. demonstrated that the background is critical for human recognition of
objects [18]. Empirical research in computer vision has also shown that both traditional
methods [19] and deep learning-based methods [12] can enhance algorithm performance
by properly modeling spatial context. Moreover, He K. et al., have proven that residual
Remote Sens. 2023, 15, 3970 3 of 18
structures are advantageous for improving network performance [17,20]. Finally, we note
that the classification and regression tasks of object detection focus on the salient features
and boundary features of the target, respectively [23]. Therefore, a decoupled detection
head incorporating residual structure as well as channel and spatial context knowledge
should have a positive impact on the detection of small or tiny objects.
We propose a new detection framework, YOLO-DCTI, for detecting small or tiny ob-
jects in remote sensing images. By introducing a global residual structure and a local fusion
structure into the contextual transformer (CoT), and designing an improved decoupled
contextual transformer detection head structure (DCTI) based on CoT, we have achieved
improved detection performance for small or tiny objects on the powerful single-stage
benchmark network YOLOv7. The main contributions of this paper can be summarized
as follows:
1. We have developed the CoT-I module, an extension of the original CoT framework,
which integrates global residual structures and local fusion modules. This integration
facilitates the extraction of spatial context background features and the fusion of
channel features, thereby enabling the network to learn deeper-level characteristics.
In comparison to the conventional CoT approach, the inclusion of global residual
structures empowers the network to capture more profound features, while the
incorporation of local fusion structures seamlessly combines background context
features with channel features.
2. We introduce an efficient decoupled detection head structure DCTI, leveraging the
CoT-I framework, to mitigate the limited exploration and utilization of salient region
features and boundary-adjoining features arising from the interdependence of clas-
sification and regression tasks within most object detection heads. This decoupled
design allows the classification task to emphasize salient region features, while the
regression task focuses on boundary-surrounding features. Concurrently, CoT-I effec-
tively exploits and harnesses the feature relationships between spatial context and
channels, facilitating the detection of small objects in remote sensing and yielding
substantial improvements in detection accuracy.
3. Despite the escalation in model parameters and the consequential inference latency
resulting from the adoption of our proposed DCTI structure, the integration of global
residual connections and local fusion strategies yields a notable enhancement in infer-
ence accuracy without incurring any detrimental impact on the inference speed. Com-
parative evaluation against the baseline YOLO v7 model showcases a substantial im-
provement in the inference accuracy specifically for diminutive targets, [email protected]:0.95
surging from 61.8 to 65.2. Additionally, our model achieves a reduction of 0.3ms in
the inference speed per image with dimensions of 640 × 640.
2. Related Work
2.1. Transformer Framework for Object Detection
The transformer structure, based on self-attention, first appeared in NLP tasks. Com-
pared to modern convolutional neural networks (CNN) [24], the Vision Transformer has
made impressive progress in the field of computer vision. After Dosovitskiy A et al.
successfully introduced transformers into computer vision [25], many scholars turned
to transformers [26–28]. In object detection, DETR [29] and Pix2seq [30] are the earliest
transformer detectors that define two different object detection paradigms. However, trans-
formers have many parameters, require high computing power and hardware, and are
not easily applicable. To apply transformers on mobile devices, Mehta S et al. proposed a
lightweight MobileVIT series [31–33], which achieved a good balance between accuracy
and real-time performance, and has been widely used in risk detection [34], medicine [35],
and other fields. A major advantage of transformers is that they can use the attention mech-
anism to model the global dependence of input data, obtain longer-term global information,
and ignore the connection between local contexts. To address this problem, Li Y. et al.,
proposed a lightweight CoT [22] self-attention module to capture contextual background
Remote Sens. 2023, 15, 3970 4 of 18
information on 2D feature maps. It can extract information between local contexts while
capturing global dependencies for more adequate information exchange. In this paper, we
use CoT to exploit the global characteristics of spatial context and channels. Based on the
original structure, we added the global residual and local fusion structures to further tap
and utilize the characteristics of space and channels.
3. Proposed Method
We present an improved decoupled contextual transformer (YOLO-DCTI) for the
detection of tiny or small objects in the domain of remote sensing. Our proposed framework
is built upon the foundation of YOLOv7. The comprehensive architecture of our framework
is depicted in Figure 1. Our contributions begin with the feature X ∈ R H ×W ×C obtained
after the backbone, FPN, and PAN stages. In this section, we first give a brief overview of the
widely adopted Contextual Transformer (CoT) framework in object detection. Subsequently,
we introduce an enhanced variant named CoT-I, which incorporates a global residual
structure and a local fusion structure into the CoT module. The global residual mechanism
integrates input information with self-attention features, while the local fusion mechanism
combines spatial contextual information with channel-based information. Ultimately,
we integrate the CoT-I module into a decoupled detection head named DCTI, enabling
the establishment of global interdependencies between the classification and regression
tasks through the utilization of self-attention mechanisms. This integration facilitates the
comprehensive exploration and exploitation of a wider spectrum of channel features and
spatial contextual features.
Figure 1. The overall framework of our YOLO-DCTI; DCTI consists of CoT-I and Decoupled-Head.
The FPN features are input to the CoT-I module for comprehensive modeling of global contextual
information and spatial relationships. Subsequently, a dual-branch architecture is employed to
effectively extract and discriminate both category-specific and localization-specific information.
Figure 2. The detailed structures of (left) CoT and (right) our CoT-I block.
Remote Sens. 2023, 15, 3970 6 of 18
4. Results
To evaluate the performance of the YOLOv7-DCTI algorithm for remote sensing
small object detection, training, and testing were conducted on the Dota-small dataset [49].
Furthermore, to assess the algorithm’s overall performance, this experiment included
training and testing on the VISDrone dataset [50] and NWPU VHR-10 datasets [51]. A
comparison was made between seven different networks, namely Faster RCNN, SSD, YOLO
v5s, YOLO v5l, YOLO v5m, YOLO v7-tiny, and YOLOv7, using the aforementioned three
datasets. To ensure fairness among the YOLO series, a batch size of 16 was utilized during
the training process and pre-trained weights were not employed. The data augmentation
strategy [52,53] remained consistent with other training conditions. During testing, the
NMS [54] threshold was uniformly set to 0.65 and the batch size was uniformly set to 32.
4.1. Datasets
4.1.1. Dota-Small
In recent years, several remote sensing datasets have been developed. This paper
focuses on the extraction of small or tiny objects from the DoTAv1.0 dataset, which consists
of 2000 aerial images of 2000 cities and over 190,000 fully labeled objects, each of which
comprises eight positional parameters (x1, y1, x2, y2, x3, y3, x4, and y4). In this study, we
have selected a dataset that includes five categories of small objects, namely small vehicles,
large vehicles, planes, storage tanks, and ships. However, due to the large image size
in the DoTAv1.0 dataset, direct training is not feasible. Therefore, we have cropped the
images to a size of 1024 × 1024, resulting in a total of 8624 images. These images were
subsequently divided into three sets according to the train:val:test ratio, with a split of 8:2:2.
Remote Sens. 2023, 15, 3970 8 of 18
Among these, 5176 images were used for training, while 1724 images were allocated to
validation and testing. The five types of objects included in the dataset are small vehicle,
large vehicle, plane, storage tank, and ship, as illustrated in Figure 4. We have set three sets
of anchors with the following dimensions: (10,10, 13,22, 24,12), (23,24, 28,37, 45,25), and
(49,50, 91,88, 186,188).
4.1.2. VisDrone
VisDrone is a widely recognized and highly demanding aerial photography dataset
that is extensively used in UAV (Unmanned Aerial Vehicle) applications. It features a metic-
ulous manual annotation process that has accurately labeled and classified 342,391 objects
into 10 distinct categories. However, the official evaluation portal for the test-challenge set
is unavailable, so we have utilized the test-dev set for evaluating our proposed method.
Figures 5–7 showcases a selection of unprocessed images extracted from the VisDrone
dataset. In our experiments, we have employed three sets of anchor dimensions: (3,4, 4,8, 8,7),
(7,14, 14,9, 13,20), and (25,13, 27,27, 51,40).
Figure 7. Recognition results in crowded environments of Dota-small dataset. (a) Recognition results
of the YOLOv7 network. (b) Recognition results of the YOLO-DCTI network.
Figure 8. Some examples of detection results on the Dota-small dataset using YOLO-DCTI.
set to 0.65 and the batch size was uniformly set to 32. In our experimental analysis, we
conducted a comparative evaluation of the YOLOv7-DCTI algorithm with mainstream
object detection algorithms using the Dota-small dataset generated specifically for this
study. The outcomes of these experiments are documented in Table 1, encompassing five
distinct categories: small vehicles, large vehicles, planes, storage tanks, and ships. The
[email protected] values provided in the table indicate the average recognition accuracy achieved by
each algorithm for individual categories. Additionally, the columns denoted as [email protected]
and [email protected]:0.95 represent the average recognition accuracy across all categories.
The Dota-small dataset contains predominantly small or tiny objects with limited infor-
mation and is characterized by complex and variable image backgrounds. Distinguishing
these objects from the background presents significant challenges, as some objects may
be partially occluded, further complicating detection. In comparison to other mainstream
object detection algorithms, the enhanced network proposed in this study demonstrates a
notable accuracy advantage in detecting small or tiny objects.
Our proposed method achieves the highest accuracy rate of 65.2% for small or tiny
objects in the Dota-small dataset, surpassing YOLOv7 by 3.4%. In comparison to these four
object detection algorithms, Faster R-CNN, SSD, YOLOv5l, and YOLOv7-tiny, the mean
average precision (mAP) at the intersection over union (IoU) threshold range of 0.5 to 0.95
showed improvements of 19.3%, 45.2%, 4.5%, and 12.1%, respectively.
Although the detection speed of YOLOv7 and YOLOv5 is similar to that of our
proposed method, their mAP scores are lower. In scenarios where the differentiation
among the YOLO series detection heads is minimal, the proposed method achieves higher
mAP at the IoU range of 0.5 to 0.95, while maintaining similar detection speeds. This
demonstrates that the proposed method effectively compensates for the differences in
detection heads and offers greater advantages.
The introduction of a Contextual Transformer (CoT) in the decoupled head unavoid-
ably leads to a slight sacrifice in reasoning speed. However, the global residual structure
and local fusion structure do not experience any decrease in speed. As a result, the de-
tection speed of the proposed structure remains largely unaffected even with increasing
complexity, achieving a favorable balance between inference speed and detection accuracy.
Table 1. Comparison of detection accuracy of different object detection algorithms on Dota-small dataset.
Model Small Vehicle Large Vehicle Plane Storage Tank Ship [email protected] [email protected]:0.95
SSD [55] 26.9 47.4 79.9 35.2 28.9 43.7 20.0
Faster R-CNN [56] 67.8 78.8 93.4 74.6 58.6 74.6 45.9
YOLOv5s [40] 82.9 83.4 89.0 77.5 86.4 83.8 51.9
YOLOv5l [40] 89.0 91.5 95.5 84.5 90.4 90.2 60.7
YOLOv5m [40] 88.9 91.2 92.6 86.3 90.5 89.9 60.8
YOLOv7-tiny [42] 84.5 87.8 92.8 77.3 86.2 85.7 53.1
YOLOv7 [42] 89.9 91.9 95.6 84.9 91.1 90.7 61.8
Ours 90.0 92.6 96.8 85.7 91.5 91.4 65.2
specific issue accentuates the imperativeness of conducting more extensive and meticulous
investigations in future research endeavors.
Figure 9. Some examples of detection results on the VisDrone dataset using YOLO-DCTI.
and assign appropriate attention to them. Overall, the experimental results highlight the
robust performance of the proposed method in object detection tasks, thus indicating its
potential in various real-world applications.
Figure 10. Some examples of detection results on the NWPU VHR-10 dataset using YOLO-DCTI.
Figure 11. A selection of class activation maps exported using grad-cam on NWPU VHR-10 datasets.
various models using the Dota-small dataset, as presented in Table 4. This analysis incorpo-
rates essential metrics such as [email protected]:0.95, inference speed, and model parameters. The
evaluation employed test images with a resolution of 640 × 640 and the inference speed
was quantified in milliseconds (ms). Our model showcased a total of 37.67 M parameters,
closely akin to the 39.46 M parameters of Faster RCNN. However, our model gained a
notable advantage due to the absence of candidate box generation operations, leading
to commendable inference speed. While our inference speed aligns with YOLOv5l and
YOLOv7, the distinctive structural design of DCTI fortifies its capacity to effectively capture
features of small targets. Although our approach may not outperform others in terms of
model parameters and inference speed, it successfully achieves a favorable equilibrium
between inference speed and detection accuracy.
Table 4. Inference speed and parameter comparison.
After incorporating CoT in the decoupled head, the [email protected]:0.95 value increased
by 0.4%. CoT helps in identifying small or tiny objects by exploiting spatial context and
global channel information. The addition of global residual structure and local fusion to
CoT led to an improvement in [email protected]:0.95 by 1.0%. CoT-I further fuses spatial context
and channel features, enabling the network to learn more about small object information,
thereby enhancing detection performance. After incorporating the CoT-I structure in
YOLOV7, [email protected]:0.95 increased by 1.5%, providing further evidence that CoT-I can
enhance detection accuracy.
We conducted a comparative analysis of the model’s performance regarding speed
and parameters, as presented in Table 6. The results demonstrate that the incorporation of
decoupled heads and CoT introduced a speed latency of 0.1 ms and 0.2 ms, respectively, in
comparison to the baseline, along with an augmentation of 0.4 M and 2.86 M parameters.
However, it is noteworthy that the inclusion of Global-Residual and Local-Fusion did not
impose any discernible burden on the inference speed and parameter requirements.
Remote Sens. 2023, 15, 3970 15 of 18
We analyzed the kernel sizes utilized in CoT-I and present our findings in Table 7. In
Equation (1), the kernel size, denoted as Wk , is specifically referred to as k = 3, 5, and 7. Our
analysis reveals that the model achieves the highest inference speed when employing a
kernel size of k = 3. On the other hand, adopting a kernel size of k = 5 results in improved
detection accuracy, albeit with a certain trade-off in terms of inference speed. Notably,
when utilizing a kernel size of k = 7, both the model’s detection accuracy and inference
speed significantly decrease. These observations suggest that expanding the perception
range does not necessarily lead to performance enhancement.
5. Conclusions
This research proposes the framework YOLO-DCTI for remote sensing small or tiny
object detection based on YOLOv7 and an improved Context Transformer (CoT-I), which
improves detection accuracy by mining and utilizing upper and lower spatial features
and channel features. Specifically, we designed an efficient decoupled detection head
structure DCTI by introducing CoT and embedding it into YOLOv7 to obtain long-term
dependent features on channel and context spatial features. Furthermore, we introduce
an innovative CoT variant, CoT-I, which incorporates a global residual structure and a
local fusion structure. The global residual structure plays a critical role in merging and
integrating information from various input levels, thereby yielding a more comprehensive
and holistic feature representation. Similarly, the local fusion structure assumes a vital
role in modeling the intricate interactions among neighboring pixels while leveraging
the abundant channel-based information prevalent in remote sensing data. Extensive
experiments demonstrate that our method can improve detection accuracy by mining and
utilizing more features. Although the improvement in detection accuracy is accompanied
by a slight loss of detection speed, the balance between detection accuracy and speed is
crucial for remote sensing object detection, provided that the speed meets the application
requirements. Despite the advancements in detection accuracy demonstrated by our
model, occasional instances of missed detections persist in densely populated scenarios
featuring small targets with similar features. In light of this, we aim to undertake more
comprehensive research in our future endeavors to delve deeper into this matter.
Author Contributions: Conceptualization, L.M. and Z.F.; methodology, L.M. and Z.F.; software, L.M.
and Z.F.; validation, L.M., Z.F. and Q.L.; formal analysis, L.M. and Z.F.; investigation, L.M. and Z.F.;
resources, L.M. and Z.F.; data curation, L.M. and Z.F.; writing—original draft preparation, L.M. and
Z.F.; writing—review and editing, B.W. and L.S.; visualization, L.M. and Z.F.; supervision, L.M. and
Z.F.; project administration, L.M. and Z.F.; funding acquisition, L.M. and M.R.; All authors have read
and agreed to the published version of the manuscript.
Remote Sens. 2023, 15, 3970 16 of 18
Funding: This work was supported by the National Natural Science Foundation of China under
Grant 62206221, the Fundamental Research Funds for the Central Universities, the Postdoctoral
Science Foundation of China under Grant 2022M710393, and the Fourth Special Grant of China
Postdoctoral Science Foundation (in front of the station) 2022TQ0035.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Wang, B.; Zhao, Y.; Li, X. Multiple instance graph learning for weakly supervised remote sensing object detection. IEEE Trans.
Geosci. Remote Sens. 2021, 60, 5613112. [CrossRef]
2. Tong, K.; Wu, Y. Deep learning-based detection from the perspective of tiny objects: A survey. Image Vis. Comput. 2022, 123,
104471. [CrossRef]
3. Wu, Y.; Zhang, K.; Wang, J.; Wang, Y.; Wang, Q.; Li, Q. CDD-Net: A context-driven detection network for multiclass object
detection. IEEE Geosci. Remote Sens. Lett. 2020, 19, 8004905. [CrossRef]
4. Liu, W.; Quijano, K.; Crawford, M.M. YOLOv5-Tassel: Detecting tassels in RGB UAV imagery with improved YOLOv5 is based
on transfer learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8085–8094. [CrossRef]
5. Li, J.; Liang, X.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Perceptual generative adversarial networks for small object detection. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1222–1230.
6. Xu, W.; Zhang, C.; Wang, Q.; Dai, P. FEA-swin: Foreground enhancement attention swin transformer network for accurate
UAV-based dense object detection. Sensors 2022, 22, 6993. [CrossRef]
7. Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection
on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC,
Canada, 11–17 October 2021; pp. 2778–2788.
8. Cheng, G.; Lang, C.; Wu, M.; Xie, X.; Yao, X.; Han, J. Feature enhancement network for object detection in optical remote sensing
images. J. Remote Sens. 2021, 2021, 9805389. [CrossRef]
9. Liu, J.; Yang, D.; Hu, F. Multiscale object detection in remote sensing images combined with multi-receptive-field features and
relation-connected attention. Remote Sens. 2022, 14, 427. [CrossRef]
10. Cheng, G.; Si, Y.; Hong, H.; Yao, X.; Guo, L. Cross-scale feature fusion for object detection in optical remote sensing images. IEEE
Geosci. Remote Sens. Lett. 2020, 18, 431–435. [CrossRef]
11. Dong, Z.; Lin, B. BMF-CNN: An object detection method based on multi-scale feature fusion in VHR remote sensing images.
Remote Sens. Lett. 2020, 11, 215–224. [CrossRef]
12. Liang, X.; Zhang, J.; Zhuo, L.; Li, Y.; Tian, Q. Small object detection in unmanned aerial vehicle images using feature fusion and
scaling-based single shot detector with spatial context analysis. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1758–1770.
[CrossRef]
13. Xu, G.; Song, T.; Sun, X.; Gao, C. TransMIN: Transformer-Guided Multi-Interaction Network for Remote Sensing Object Detection.
IEEE Geosci. Remote Sens. Lett. 2022, 20, 6000505. [CrossRef]
14. Li, Q.; Chen, Y.; Zeng, Y. Transformer with transfer CNN for remote-sensing-image object detection. Remote Sens. 2022, 14, 984.
[CrossRef]
15. Xu, X.; Feng, Z.; Cao, C.; Li, M.; Wu, J.; Wu, Z.; Shang, Y.; Ye, S. An improved swin transformer-based model for remote sensing
object detection and instance segmentation. Remote Sens. 2021, 13, 4779. [CrossRef]
16. Gong, H.; Mu, T.; Li, Q.; Dai, H.; Li, C.; He, Z.; Wang, W.; Han, F.; Tuniyazi, A.; Li, H.; et al. Swin-Transformer-Enabled YOLOv5
with Attention Mechanism for Small Object Detection on Satellite Images. Remote Sens. 2022, 14, 2861. [CrossRef]
17. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708.
18. Bar, M. Visual objects in context. Nat. Rev. Neurosci. 2004, 5, 617–629. [CrossRef] [PubMed]
19. Carbonetto, P.; De Freitas, N.; Barnard, K. A statistical model for general contextual object recognition. In Proceedings of 8th
European Conference on Computer Vision, Prague, Czech Republic, 11–14 May 2004; pp. 350–362.
20. He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the Computer Vision–ECCV
2016: 14th European Conference, Part IV 14, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg,
Germany, 2016; pp. 630–645.
21. Huang, J.; Rathod, V.; Sun, C.; Zhu, M.; Korattikara, A.; Fathi, A.; Fischer, I.; Wojna, Z.; Song, Y.; Guadarrama, S.; et al.
Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7310–7311.
22. Li, Y.; Yao, T.; Pan, Y.; Mei, T. Contextual transformer networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022,
45, 1489–1500. [CrossRef] [PubMed]
23. Song, G.; Liu, Y.; Wang, X. Revisiting the sibling head in object detector. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11563–11572.
Remote Sens. 2023, 15, 3970 17 of 18
24. Liu, Y.; Zhang, Y.; Wang, Y.; Hou, F.; Yuan, J.; Tian, J.; Zhang, Y.; Shi, Z.; Fan, J.; He, Z. A survey of visual transformers. arXiv 2021,
arxiv:2111.06091. [CrossRef]
25. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arxiv:2010.11929.
26. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted
windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October
2021; pp. 10012–10022.
27. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone
for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
Montreal, BC, Canada, 11–17 October 2021; pp. 568–578.
28. Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers
from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada,
11–17 October 2021; pp. 558–567.
29. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers.
In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Part I 16, Glasgow, UK, 23–28 August 2020;
Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229.
30. Chen, T.; Saxena, S.; Li, L.; Fleet, D.J.; Hinton, G. Pix2seq: A language modeling framework for object detection. arXiv 2021,
arxiv:2109.10852.
31. Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021,
arxiv:2110.02178.
32. Mehta, S.; Rastegari, M. Separable self-attention for mobile vision transformers. arXiv 2022, arXiv:2206.02680.
33. Wadekar, S.N.; Chaurasia, A. Mobilevitv3: Mobile-friendly vision transformer with simple and effective fusion of local, global
and input features. arXiv 2022, arxiv:2209.15159.
34. Tong, H.; Peng, T.; Jiang, X. A Lightweight Risk Advertising Image Detection Method Based on Mobile-ViT. In Proceedings of the
2022 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS), Dalian, China, 11–12 December 2022; IEEE:
Piscataway, NJ, USA, 2022; pp. 1249–1253.
35. Marefat, A.; Joloudari, J.H.; Rastgarpour, M. A Transformer-Based Algorithm for Automatically Diagnosing Malaria Parasite in Thin
Blood Smear Images Using MobileViT; Technical Report; EasyChair: Manchester, UK, 2022.
36. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
37. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
38. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arxiv:1804.02767.
39. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arxiv:2004.10934.
40. Yolov5. Available online: https://github.com/ultralytics/yolov5 (accessed on 15 March 2023).
41. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection
framework for industrial applications. arXiv 2022, arxiv:2209.02976.
42. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors. arXiv 2022, arXiv:2207.02696.
43. Hussain, M.; Al-Aqrabi, H.; Munawar, M.; Hill, R.; Alsboui, T. Domain Feature Mapping with YOLOv7 for Automated
Edge-Based Pallet Racking Inspections. Sensors 2022, 22, 6927. [CrossRef] [PubMed]
44. Zhao, H.; Zhang, H.; Zhao, Y. Yolov7-sea: Object detection of maritime uav images based on improved yolov7. In Proceedings of
the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 233–238.
45. Jiang, K.; Xie, T.; Yan, R.; Wen, X.; Li, D.; Jiang, H.; Jiang, N.; Feng, L.; Duan, X.; Wang, J. An Attention Mechanism-Improved
YOLOv7 Object Detection Algorithm for Hemp Duck Count Estimation. Agriculture 2022, 12, 1659. [CrossRef]
46. Wu, Y.; Chen, Y.; Yuan, L.; Liu, Z.; Wang, L.; Li, H.; Fu, Y. Rethinking classification and localization for object detection. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020;
pp. 10186–10195.
47. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arxiv:2107.08430.
48. Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021
IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE Computer
Society: Piscataway, NJ, USA, 2021; pp. 3490–3499.
49. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object
detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City,
UT, USA, 18–23 June 2018; pp. 3974–3983.
50. Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision
meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer
Vision Workshops, Seoul, Korea, 27–28 October 2019.
Remote Sens. 2023, 15, 3970 18 of 18
51. Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on
collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [CrossRef]
52. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arxiv:1710.09412.
53. Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple copy-paste is a strong data augmentation
method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Nashville, TN, USA, 20–25 June 2021; pp. 2918–2928.
54. Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern
Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 3, pp. 850–855.
55. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings
of the Computer Vision–ECCV 2016: 14th European Conference, Part I 14, Amsterdam, The Netherlands, 11–14 October 2016;
Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37.
56. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural
Inf. Process. Syst. 2015, 28, 1497. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.