0% found this document useful (0 votes)

11 views

Oriented_Tiny_Object_Detection_A_Dataset_Benchmark

Uploaded by

Raluca-Cristina Guriencu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Oriented_Tiny_Object_Detection_A_Dataset_Benchmark

Uploaded by

Raluca-Cristina Guriencu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2021 1

Oriented Tiny Object Detection: A Dataset,

Benchmark, and Dynamic Unbiased Learning
Chang Xu∗ , Ruixiang Zhang∗ , Wen Yang† , Haoran Zhu, Fang Xu, Jian Ding, Gui-Song Xia

Abstract—Detecting oriented tiny objects, which are limited in

appearance information yet prevalent in real-world applications,
remains an intricate and under-explored problem. To address
this, we systemically introduce a new dataset, benchmark, and
arXiv:2412.11582v1 [cs.CV] 16 Dec 2024

a dynamic coarse-to-fine learning scheme in this study. Our

proposed dataset, AI-TOD-R, features the smallest object sizes
among all oriented object detection datasets. Based on AI-
TOD-R, we present a benchmark spanning a broad range of
detection paradigms, including both fully-supervised and label-
efficient approaches. Through investigation, we identify a learn-
ing bias presents across various learning pipelines: confident
objects become increasingly confident, while vulnerable oriented
tiny objects are further marginalized, hindering their detection
performance. To mitigate this issue, we propose a Dynamic
Coarse-to-Fine Learning (DCFL) scheme to achieve unbiased
learning. DCFL dynamically updates prior positions to better
align with the limited areas of oriented tiny objects, and it
assigns samples in a way that balances both quantity and
quality across different object shapes, thus mitigating biases
in prior settings and sample selection. Extensive experiments
across eight challenging object detection datasets demonstrate
that DCFL achieves state-of-the-art accuracy, high efficiency, and
remarkable versatility. The dataset, benchmark, and code are
available at https://chasel-tsui.github.io/AI-TOD-R/.
Fig. 1. This paper systemically introduces the challenging task of oriented
Index Terms—Object detection, Dataset and benchmark, Un- tiny object detection, with the AI-TOD-R dataset, benchmark, and a dynamic
biased learning coarse-to-fine learning pipeline. Upper: Typical annotation examples from AI-
TOD-R and detection paradigms covered by this benchmark, where “L.”, “U.”,
“S. L.”, and “C. L.” denote labelled, unlabelled, sparsely labelled, and coarsely
I. I NTRODUCTION labelled images, respectively. Lower: A comparison of learning paradigms
for oriented object detection. Compared to prior arts (left), our proposed

W HEN observations approach the physical limit of a

camera’s properties (e.g., focal length and resolution),
the captured images will inevitably contain objects at ex-
pipeline (right) mitigates the learning bias against oriented tiny objects with
a dynamically updated prior and a coarse-to-fine sample learning scheme.

tremely tiny scales. This situation, though extreme, is prevalent

in real-world applications ranging from micro-vision (e.g., few, traffic monitoring [4], border surveillance [5], medical
medical and cell imaging [1]) to macro-vision (e.g., drone diagnostics [1], and defect identification [6]. Unfortunately,
and satellite imaging [2], [3]). In these professional domains, previous studies mainly focus on detecting generic objects [7]
imaging typically adopts an overhead perspective to more or arbitrarily oriented objects [8]. When it comes to the more
accurately capture the primary features of the objects, resulting challenging task of detecting oriented tiny objects, existing
in them appearing in arbitrary orientations. methods often struggle to deliver satisfactory performance.
Detecting arbitrarily oriented tiny objects is a fundamental Typically, 77% objects in DOTA-v2 [3] are in the size range
yet highly challenging step towards achieving an intelligent of 102 -502 pixels, while the State-Of-The-Art (SOTA) perfor-
understanding of these scenarios. Meanwhile, numerous risk- mance [9] is still lower than 30% AP@50:5:95 . As a hard nut to
sensitive applications demand the precise and robust detec- crack in the community, what makes things worse is the lack
tion of tiny objects with orientation information, to name a of task-specific datasets and benchmarks designed to prompt
the development of detection methods. So far, there is only one
* Equal contribution, † Corresponding author. recently released dataset (SODA-A) [10] tailored to relevant
C. Xu, R. Zhang, W. Yang, H. Zhu, and F. Xu are with the School of study, while its research focus mainly lies on small-scale rather
Electronic Information, Wuhan University, Wuhan, 430072 China. E-mail:
{xuchangeis, zhangruixiang, yangwen, zhuhaoran, xufang}@whu.edu.cn than tiny-scale objects1 .
J, Ding is with the King Abdullah University of Science and Technology. Taking a step towards more severe challenges, this work
E-mail: [email protected]
G-S. Xia is with the School of Computer Science and the State
Key Lab. LIESMARS, Wuhan University, Wuhan, 430072, China. E-mail: 1 According to existing literature [11], [12], small and tiny objects are
[email protected] defined as objects smaller than 322 and 162 pixels, respectively
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2

systemically explores the oriented tiny object detection prob- while balancing the quantity and quality of samples across
lem from perspectives of dataset, benchmark, and method. different scales.
A visual summary of this work is presented in Figure 1. Aiming at addressing the challenging task of oriented
To further push the boundaries of oriented object detection tiny object detection, this paper provides a comprehensive
for extremely tiny objects, we contribute a new dataset extension of our previous conference version [16]. Beyond
dedicated to oriented tiny object detection, named AI-TOD- methodological contributions published previously, this journal
R (Section III). With a mean object size of only 10.62 pixels, extension introduces several additional advancements:
AI-TOD-R is the dataset of the smallest object size for oriented • Establishing a task-specific dataset for oriented tiny ob-
object detection. This challenging dataset is established using ject detection, features the smallest object size among
a semi-automatic annotation process via supplementing AI- oriented object detection datasets, compensating for the
TOD-v2 [13] with orientation information and ensuring high lack of resources in this challenging area.
annotation quality. Then, we benchmark diverse object • Creating a benchmark that covers a variety of object
detection paradigms with AI-TOD-R to investigate how detection paradigms, including both fully-supervised and
different detection paradigms perform on oriented tiny ob- label-efficient methods, revealing learning biases against
jects (Section IV). What distinguishes this benchmark from oriented tiny objects across these approaches.
prior arts is that we break the fully-supervised paradigm and • Demonstrating the versatility of DCFL by plugging it into
study both supervised and label-efficient methods, catering to both one-stage and two-stage methods, and verifying its
broader and more practical applications. Our findings reveal generalization ability on small oriented object detection
that generic object detectors tend to exhibit abnormal results by validating on SODA-A dataset.
when confronted with extremely tiny objects. Notably, the
learning bias appears invariably across various methods. The II. R ELATED WORKS
objects’ tiny size and low confidence characteristics make
them easily suppressed or ignored during model training. The A. Small and Oriented Object Detection Datasets
vanilla optimization process will inevitably pose them into Small and tiny object detection datasets. Due to the lack
significantly biased prior setting and biased sample learning of specialized datasets, early studies on Small Object Detection
dilemmas, severely impeding the performance of oriented tiny (SOD) are mainly based on small objects in generic or some
object detection (Section IV-D). To address this issue, we task-specific datasets. For example, the generic object detec-
propose a new approach: Dynamic Coarse-to-Fine Learning tion dataset MS COCO [11], face detection dataset Wider-
(DCFL), aimed at providing unbiased prior setting and sample Face [17], pedestrian detection dataset EuroCity Persons [18],
supervision for oriented tiny objects (Section V). On the and Drone-view dataset VisDrone [2] all contain a consider-
one hand, we reformulate the static prior into an adaptively able number of small objects that could assist related studies.
updating prior, thereby guiding more prior positions towards As the SOD performance has been struggling for a long time,
the main area of tiny objects. On the other hand, dynamic the establishment of specialized dataset for SOD is receiving
coarse-to-fine sample learning separates the label assignment growing attention. TinyPerson [19] is the first dataset designed
into two steps: the coarse step offers diverse positive sample for tiny-scale person detection. AI-TOD [12], [13] is the first
candidates for objects of various sizes and orientations, and multi-category dataset for tiny object detection. DTOD [20]
the fine step warrants the high quality of positive samples for compounds the challenge by addressing not only the tiny
predictions. size of objects but also their dense packing. Recently, the
We perform experiments on eight heterogeneous bench- introduction of the first large-scale SOD dataset SODA [10]
marks, including tiny/small oriented object detection (AI- along with its benchmark further highlights the necessity of
TOD-R, SODA-A [10]), oriented object detection with large targeted research on SOD.
numbers of tiny objects (DOTA-v1.5 [3], DOTA-v2 [3]), Oriented object detection datasets. Oriented object de-
multi-scale oriented object detection (DOTA-v1 [14], DIOR- tection is an important direction of visual detection since
R [15]), and horizontal object detection (VisDrone [2], MS orientation information significantly reduces the background
COCO [11]). Our results demonstrate that DCFL remark- region in bounding boxes with minimal additional parameters.
ably outperforms existing methods for detecting tiny objects The multi-scale datasets DOTA-v1/1.5/2 [3], [14] and DIOR-
(Section VI). Moreover, our results highlight three char- R [15] are widely adopted for performance benchmarking,
acteristics of DCFL: (1) Costless improvement: Extensive where the DOTA-v2 is also characterized by its large number
experiments on various datasets show that DCFL improves of small objects. In addition to these generic datasets, task-
the detection performance without adding any parameter or specific datasets are also introduced to dissect some targeted
computational overhead during inference. (2) Versatility: The problems. For example, some datasets are established to
DCFL approach can be plugged into both one-stage and two- study specific classes (e.g., HRSC2016 [21], UACS-AOD [22],
stage detection pipelines and improve their performance on VEDAI [23]), some are designed for the fine-grained object
oriented tiny objects. Beyond oriented tiny objects, DCFL also detection (e.g., FAIR1M [24]), while some datasets are pro-
enhances the detection performance of generic small objects. posed for specific modalities (e.g., SSDD [25]). Meanwhile,
(3) Unbiased learning. By dissecting the training process, there are also datasets designed for other scenarios sensitive
we reveal how DCFL achieves unbiased learning—adaptively to the object’s orientation, including text [26], retail [27], and
updating priors to better align with tiny objects’ main areas, crack detection [28].
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3

B. Oriented Object Detection TABLE I

C OMPARISON OF IMAGE AND OBJECT FEATURES IN DIFFERENT OBJECT
Prior design. The anchor mechanism is a classic prior DETECTION DATASETS . HBB, AND OBB DENOTE HORIZONTAL , AND
design that can facilitate the training of both generic object ORIENTED BOUNDING BOX , RESPECTIVELY. O BJECT SIZE IS
REPRESENTED IN THE FORM OF mean ± std OF THE DATASET.
detectors and oriented object detectors. As a pioneering work,
rotated RPN [29] extends horizontal anchors to the field Dataset Image Height Image Count Type Object Size
of oriented object detection via presetting 54 anchors with
MS COCO [11] 800-1333 163,957 HBB 99.5 ± 107.5
various scales and angles for each feature point. Although DIOR [60] 800 23,463 HBB 65.6 ± 91.7
this approach improves recall by covering a wide range of DIOR-R [60] 800 23,463 OBB 57.7 ± 80.2
DOTA-v1.0 [14] 800-13000 2,423 H/OBB 55.3 ± 63.1
gt shapes, it comes with an increased computational cost. VisDrone [61] 2000 8,629 HBB 35.8 ± 32.8
Afterwards, RoI Transformer [30] learns to transform RPN- xView [62] -3000 1,127 HBB 34.9 ± 39.9
DOTA-v1.5 [3] 800-13000 2,423 H/OBB 34.0 ± 47.8
generated horizontal proposals to oriented proposals, achieving DOTA-v2 [3] 800-13000 11,268 H/OBB 24.8 ± 32.6
more accurate feature alignment while simplifying the anchor VEDAI (512) [23] 512, 1024 1,210 HBB 33.4 ± 11.3
SODA-D [10] 3407 24,828 OBB 25.4 ± 10.0
design. Toward a simpler and more efficient framework, the TinyPerson [19] 1000-5616 1,610 HBB 18.0 ± 17.4
Oriented R-CNN [9] designs an oriented RPN that directly pre- SODA-A [10] 4761 2,513 OBB 15.6 ± 7.6
AI-TOD-v2 [13] 800 28,036 HBB 12.7 ± 5.6
dicts oriented proposals based on horizontal anchors. More re-
AI-TOD-R 800 28,036 OBB 10.6 ± 4.9
cently, one-stage oriented object detectors gradually emerged,
including anchor-based detectors [31], [32] with box-based
prior and anchor-free detectors [33]–[35] with point-based or features of small objects. CFINet [45] also enhances small
query-based prior. objects’ feature representation through mimicking high-quality
Label assignment. The label assignment process separates features. Other notable methods that leverage super-resolution
the prior positions into positive or negative learning samples, for small object detection include works such as [48]–[51].
playing a pivotal role in object detection [36]–[39]. In the Metric design. Tiny objects often have a low tolerance for
field of oriented object detection, several methods have been bounding box perturbation under generic location metrics like
proposed to enhance the quality of label assignment. DAL [40] IoU. To address IoU-induced issues throughout the detection
addresses the inconsistency between input prior IoU and pipeline, specialized metrics have been designed to better han-
output predicted IoU by defining a matching degree as a soft dle tiny objects. To improve the label assignment performance,
label that dynamically reweights the anchors. More recently, DotD [52], series of works like NWD [53], RFLA [44], and
SASM [41] introduces a shape-adaptive sample selection and KLDet [54] introduce either center-based or distribution-based
measurement strategy, accurately assigning labels according metrics. These approaches mitigate sample imbalance issues
to the object’s shape and orientation. Similarly, GGHL [42] caused by overlap-based measurements. On the other hand,
proposes fitting the main body of an instance with a single loss metrics designed to achieve scale-invariant [55], [56],
2-D Gaussian heatmap, dividing and reweighting samples evaluation consistent [57], and boundary continuous [58], [59]
in a dynamic manner. In addition, Oriented Reppoints [34] location regression also offer valuable insights into improving
improves the RepPoints [43] by assessing the quality of points, tiny object detection.
refining the detection performance. Despite these progress, the existing literature falls short
in handling extremely oriented tiny objects. First, there still
C. Tiny Object Detection lacks a task-specific dataset and benchmark aiming at detecting
Sample learning. Tiny objects usually suffer from low the challenging but ubiquitous oriented tiny objects. Second,
matching degrees with static anchors or limited coverage of current detection paradigms cannot simultaneously manage
feature point priors, resulting in a lack of positive samples. the prior and sample biases in oriented tiny object detection,
In generic object detection, the adaptive label assignment resulting in sub-optimal performance. In this work, we aim
strategy ATSS [39] implicitly reconciles the number of positive to bridge these gaps by 1) further pushing the limits of object
samples for objects of different scales. Explicitly targeting the size in the dataset and benchmark for oriented object detection,
sample learning issues of tiny objects, NWD-RKA [13] and and 2) proposing an unbiased prior update and sample learning
RFLA [44] propose distribution-based similarity measurement pipeline that enables detectors to be supervised by more high-
and sample assignment strategies to achieve scale balanced quality oriented tiny object samples during training.
learning. More recently, the CFINet [45] improve the detection
III. AI-TOD-R DATASET
performance of small objects by employing dynamic anchor
selection and cascade regression to generate high-quality pro- A. Semi-automatic Annotation
posals. Due to weak features and large quantity, arbitrarily oriented
Feature enhancement. Small or tiny objects themselves tiny objects are easily confused with the background, mak-
show very limited features, some studies thus propose to lever- ing the artificial annotation process difficult and laborious.
age external content to enhance the features of small objects To guarantee high annotation quality and reduce annotation
with super-resolution or GAN. Among them, PGAN [46] is the cost, we employ a semi-automatic annotation protocol that
pioneering work that applies GAN to small object detection. is composed of three basic steps: algorithm-based coarse
Besides, Bai et al. [47] introduce the MT-GAN which trains labelling, manual refinement, and quality double-checking.
an image-level super-resolution model to improve the RoI The illustration of this process is shown in Figure 3.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4

Angle (degree)
(a) Object Size Distribution (b) Object Angle Distribution (c) Object Number Distribution (d) Class Size Distribution

Fig. 2. Statistical analysis of the AI-TOD-R. From left to right, we show the dataset’s object size distribution, object angle distribution, object number
per image distribution, and class size distribution, respectively. The box plot of “Class Size Distribution” shows the object’s absolute size’s mean value and
standard deviation within each class.

In the first step, we use the weakly supervised detector to

generate OBB predictions with an HBB tiny object detection
dataset. Prior to this work, our AI-TOD-v2 [13], characterized
by its extremely tiny object size, multi-source images, and
high-quality HBB annotation, lays the foundation for AI-
TOD-R. Meanwhile, recently emerging weakly supervised
methods are capable of predicting OBBs under only HBB
supervision [63], [64], achieving competitive performance with Fig. 3. The labelling process of the AI-TOD-R. The coarse labels are
fully supervised methods on the DOTA dataset [3], [14]. automatically generated by H2RBox-v2, and final labels are obtained by
manual labelling and verification.
Combining the merits of existing datasets and methods, we
use the SOTA weakly supervised method H2RBox-v2 [64] to
generate OBB predictions based on the AI-TOD-v2 dataset,
which serves as the initial annotations. bounding boxes and category labels. The dataset is divided into
the train set, val set, trainval set, and test set. In the
In the second step, we manually refine the algorithm-
following, we present a comprehensive statistical analysis and
generated preliminary annotations to fix errors. Although the
a comparative evaluation of the characteristics of this data set
current weakly supervised method can provide OBB pre-
against similar data sets.
dictions under HBB supervision, their performance on tiny
objects remains far from satisfactory, particularly in scenarios Extremely tiny object size. As shown in Table I, the
with weak, densely packed objects. Existing algorithms tend mean object size of AI-TOD-R is only 10.62 pixels, which
to produce false negative and inaccurate predictions, as shown is the smallest among all datasets. The detailed object size
in Figure 3. Consequently, we manually adjust the initial an- distribution is shown in Figure 2(a), where most objects are
notations to fix error annotations with the following approach. gathered within the tiny scale (<16 × 16 pixels). Different
First, we select some typical images and call for experts to from previous datasets proposed for generic oriented object
re-annotate them with the help of visual results from AI-TOD- detection or small oriented object detection, the extremely
v2, establishing an annotation guide. Based on this guide, tiny mean object size and massive tiny objects make AI-TOD-
we train volunteers with background in computer vision to R a challenging dataset dedicated to the oriented tiny object
perform large-scale adjustments. Volunteers are encouraged detection task.
to adjust inaccurate predictions that they are confident with Arbitrary object orientations. We employ the OpenCV
and mark down the image ID of cases that they are uncertain definition to analyze the distribution of an object’s rotation
about. These uncertain cases are then resolved through team angles. The object’s rotation angle is defined as the angle
discussions and voting. between the bounding box and the horizontal axis, with a range
Finally, we call for experts and volunteers to double-check of (0, 90◦ ]. The dataset’s object angle distribution is shown in
each image and find out low-quality annotations. We then Figure 2(b). AI-TOD-R contains a large number of objects
redistribute these images to volunteers for re-annotation. This across various rotation angles, demonstrating its characteristic
combination of algorithmic initial annotation and meticulous of arbitrary orientation. This feature aids detectors in learning
manual refinement by a collaborative team ensures the high object representation robust to different rotation angles.
quality of the dataset. Massive objects per image. In addition to the tiny scale
and arbitrary orientations, another distinct characteristic of
this dataset is the large quantity of objects in each image.
B. Statistical Analysis Aerial imagery captures layout information with a broad field
AI-TOD-R is currently the dataset of the smallest object of view, resulting in the large number of objects covered by
size in the field of oriented object detection, containing a total each image. According to Figure 2(c), an image in AI-TOD-
of 8 classes, 28,036 images, and 752,460 objects with oriented R can contain over 2000 objects, with most images featuring
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5

Fig. 4. Visualization of annotations in AI-TOD-R. Compared to AI-TOD-v2, using oriented bounding boxes to represent tiny objects can significantly reduce
back noise, and this advantage is particularly obvious in densely arranged scenarios. In addition to the extremely tiny object size, AI-TOD-R introduces other
challenges like dense arrangement, weak feature representation, and imbalanced class distributions.

over 100 objects. The vast number of tiny objects in each IV. AI-TOD-R B ENCHMARK
image significantly increases the computational burden during
In this section, we present a comprehensive benchmark
training and inference, giving rise to the need for efficient
for AI-TOD-R, encompassing fully-supervised oriented object
detector designs that facilitate practical applications.
detection methods as well as label-efficient methods consisting
Imbalanced class distribution. Like many generic object of semi-supervised object detection (SSOD), sparsely anno-
detection or oriented object detection datasets, the class im- tated object detection (SAOD), and weakly-supervised object
balance challenge also exists in AI-TOD-R. This imbalance detection (WSOD) methods.
is reflected in the object number2 and object size distribution
(Figure 2(d)) for each class. This imbalance depicts the real-
world class distribution and also calls for robust oriented object A. Implementation Details
detectors capable of class-balanced detection performance.
For fully-supervised methods, experiments on the AI-TOD-
R are performed following the default setting of AI-TOD
series [12], [13]. We use AI-TOD-R’s trainval set for
C. Label Visualization training and its test set for evaluation, and retain the
image size as 800×800 for training and testing. The batch size
Figure 4 showcases typical samples from the AI-TOD- and learning rate are set to 2 and 0.0025 respectively. We only
R dataset. These typical samples exhibit characteristics of use random flipping as data augmentation for all experiments.
the dataset, including extremely tiny object scale, arbitrary For label-efficient methods, we reorganize training la-
orientation, dense arrangement, and complex scenes. In partic- bels and schedules to adapt to different paradigms. Semi-
ular, the visualized annotations reveal the unique advantages Supervised Object Detection (SSOD) methods randomly retain
of representing tiny objects with oriented bounding boxes. annotations with 10%, 20%, and 30% of the images from
Using oriented bounding boxes to represent objects allows the the AI-TOD-R’s trainval set as training annotations. We
annotation boxes to more tightly enclose the object’s main follow the default settings of SOOD [65] with a batch size of
area. This advantage is particularly evident in densely packed 6 (with a 1:2 ratio of unlabeled to labeled data) and a learning
regions, where rotated bounding boxes can significantly reduce rate of 0.0025. Additionally, we maintain the same total num-
overlap between adjacent object boundaries, thus preventing ber of batch size × iterations as the fully-supervised 40-epoch
confusion during the network’s learning and prediction pro- setup. Sparsely Annotated Object Detection (SAOD) randomly
cesses in such areas. In addition, oriented bounding boxes can retains 10%, 20%, and 30% annotations of all objects from
capture the orientation information of moving objects, such as the AI-TOD-R’s trainval set as training labels. We use
vehicles in motion or ships at sea, providing richer information a batch size of 2, and a learning rate of 0.0025, and maintain
for downstream applications. the same total number of batch size × iterations as the
fully-supervised 40-epoch setup. Besides, Weakly Supervised
2 airplane (1,667), bridge (1,541), storage-tank (13,771), ship (35,813), Object Detection (WSOD) mainly switches the trainval
swimming-pool (1,617), vehicle (662,929), person (34,490), wind-mill (632) set’s annotations from OBB to HBB and keeps other settings
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6

TABLE II
M AIN RESULTS OF FULLY- SUPERVISED METHODS ON AI-TOD-R. F OR THE TRAINING SCHEDULE , 1× DENOTES 12 EPOCHS AND 40 E DENOTES 40
EPOCHS . M ETHODS WITH “-O” MEAN THE ROTATED VERSION OF BASE DETECTORS , AND THE NAME IN “()” DENOTES THE BASELINE METHOD .

ID Method Backbone Schedule AP AP0.5 AP0.75 APvt APt APs APm #Params.
## Architecture:
#1 RetinaNet-O [66] ResNet-50 1× 7.3 23.9 1.8 2.2 5.9 11.1 15.4 36.3M
#2 FCOS-O [67] ResNet-50 1× 11.0 33.6 3.7 3.0 8.9 15.7 22.0 31.9M
#3 Faster R-CNN-O [68] ResNet-50 1× 10.2 30.8 3.6 0.6 7.8 19.0 22.9 41.1M
#4 RoI Transformer [30] ResNet-50 1× 10.5 34.0 2.2 1.1 8.8 16.9 20.3 55.1M
#5 Oriented R-CNN [9] ResNet-50 1× 11.2 33.2 4.3 0.6 9.1 19.5 23.2 41.1M
#6 Deformable DETR-O [69] ResNet-50 1× 8.4 26.7 2.0 4.8 9.3 8.6 7.3 40.8M
#7 ARS-DETR [35] ResNet-50 1× 14.3 41.1 5.8 6.3 14.5 17.6 18.7 41.1M
## Representation:
#8 KLD (RetinaNet-O) [70] ResNet-50 1× 7.8 24.8 2.3 3.1 6.7 10.3 15.8 36.3M
#9 KFIoU (RetinaNet-O) [71] ResNet-50 1× 8.1 25.2 2.8 2.0 6.6 12.3 17.1 36.3M
#10 Oriented RepPoints [34] ResNet-50 1× 13.0 40.3 4.2 5.2 12.2 16.8 21.4 36.6M
#11 PSC (RetinaNet-O) [59] ResNet-50 1× 4.5 15.8 1.2 1.0 3.7 8.2 12.7 36.4M
#12 Gliding Vertex [72] ResNet-50 1× 8.1 27.4 2.1 0.9 6.7 14.7 17.9 41.1M
## Refinement:
#13 R3 Det [31] ResNet-50 1× 8.1 25.8 2.2 1.9 7.3 12.0 16.8 41.7M
#14 S2 A-Net [32] ResNet-50 1× 10.8 33.4 3.3 4.3 11.2 13.0 16.0 38.6M
## Assignment:
#15 ATSS-O (RetinaNet-O) [39] ResNet-50 1× 10.9 33.8 3.1 2.7 8.9 15.5 19.4 36.0M
#16 SASM (RepPoints-O) [41] ResNet-50 1× 11.4 35.0 3.7 3.6 10.2 15.4 19.8 36.6M
#17 CFA [73] ResNet-50 1× 12.4 38.7 4.0 5.0 11.9 16.5 18.8 36.6M
## Backbone:
#18 Oriented R-CNN ResNet-101 1× 11.2 33.0 4.1 0.5 8.9 19.8 24.4 60.1M
#19 Oriented R-CNN Swin-T 1× 12.0 34.6 4.6 0.7 9.9 20.8 25.3 44.8M
#20 Oriented R-CNN LSKNet-T 1× 11.1 33.4 3.8 0.6 9.2 18.9 22.6 21.0M
#21 ReDet [74] ReResNet-50 1× 11.6 32.8 4.8 1.4 9.5 19.4 23.2 31.6M
#22 DCFL (RetinaNet-O) ResNet-50 1× 12.3 (+5.0) 36.7 (+12.8) 4.5 (+2.7) 4.3 10.7 17.2 22.2 36.1M
#23 DCFL (RetinaNet-O) ResNet-50 40e 15.2 (+7.9) 44.9 (+21.0) 5.1 (+3.3) 4.9 13.1 19.7 25.9 36.1M
#24 DCFL (Oriented R-CNN) ResNet-50 1× 15.7 (+4.5) 47.0 (+13.8) 5.8 (+1.5) 6.3 14.8 19.6 22.5 41.1M
#25 DCFL (Oriented R-CNN) ResNet-50 40e 17.1 (+5.9) 49.0 (+15.8) 7.2 (+2.9) 6.4 16.0 21.6 24.9 41.1M
#26 DCFL (S2 A-Net) ResNet-50 1× 13.7 (+2.9) 39.7 (+6.3) 5.3 (+2.0) 4.7 12.4 18.6 22.6 38.6M
#27 DCFL (S2 A-Net) ResNet-50 40e 17.5 (+6.7) 49.6 (+16.2) 7.9 (+4.6) 6.5 15.7 22.6 27.4 38.6M

as the fully-supervised setting. All other settings are retained than dense methods, while at the cost of higher computation
as their baseline methods unless otherwise specified. demand. Compared to other paradigms, the state-of-the-art
sparse method (#7) gradually performs favorably on oriented
B. Results of Fully-supervised Methods tiny objects, mainly attributed to its training strategies tailored
In Table II, we benchmark the detection performance on from advanced generic detectors and its rotated deformable
oriented tiny objects across a wide range of oriented object attention optimized for arbitrary-oriented objects.
detectors. To better compare and analyze the characteristics Box representation and loss design. The vanilla
of various detection paradigms on the oriented tiny object regression-based loss suffers from issues including inconsis-
detection task, we introduce them in a classified manner. tency with evaluation metrics, boundary discontinuity, and
Basic architecture. Based on the prior setting and stage square-like problems, giving rise to numerous box repre-
number, oriented object detection architectures can be sepa- sentation studies. Here, oriented tiny object detection also
rated into dense [66], [67] (#1, 2), dense-to-sparse [9], [30], benefits from these improved representations and their induced
[68] (#3, 4, 5), and sparse [35], [69] (#6, 7) paradigms. loss functions. The Gaussian-based loss [70], [71] (#8, 9)
The dense paradigm usually refers to one-stage methods eradicates the boundary discontinuity issue and enforces the
that yield dense predictions per feature point, dense-to-sparse alignment between the optimization goal with the evaluation
methods use the first stage to generate sparse proposals (e.g., metric, slightly improving the AP for about 1 point based
RPN) and refine proposals as final predictions in the second on the RetinaNet-O baseline. Notably, the point set-based
stage (e.g., R-CNN), while the sparse paradigm is mainly method [34], [41], [73] (#10, 16, 17) is particularly effective
based on Transformers to reason about the object’s class for detecting oriented tiny objects, which may be attributed to
and location with a set of sparse queries. Among the dense the deformable points’ representation robustness to extreme
paradigm, FCOS-O releases the IoU-constrained assignment geometric characteristics.
by labeling gt-covered points as positive samples, performing Sample selection strategies. The quality of positive sample
better than the anchor-based dense method. Benefiting from selection directly affects the supervision information in the
the FPN with higher resolution (P2) and feature interpolated training process, playing a crucial role in tiny object detec-
RoI Align, dense-to-sparse methods perform slightly better tion. By adaptively determining the positive anchor threshold
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7

TABLE III
M AIN RESULTS OF LABEL - EFFICIENT METHODS ON AI-TOD-R. E VALUATIONS ARE PERFORMED ON THE test set OF AI-TOD-R BY TRAINING
UNDER DIFFERENT RATIOS OF O RIENTED B OUNDING B OX (OBB) ANNOTATIONS OR H ORIZONTAL B OUNDING B OX (HBB) ANNOTATIONS FROM ITS
trainval set. SSOD, SAOD, AND WSOD DENOTE SEMI - SUPERVISED OBJECT DETECTION , SPARSELY ANNOTATED OBJECT DETECTION , AND
WEAKLY SUPERVISED OBJECT DETECTION , RESPECTIVELY.

10% OBB 20% OBB 30% OBB 100% HBB

Method Category Backbone
AP AP0.5 APvt APt AP AP0.5 APvt APt AP AP0.5 APvt APt AP AP0.5 APvt APt
Unbiased Teacher [76] SSOD ResNet-50 7.6 24.7 0.4 6.0 8.1 24.7 0.5 6.1 8.1 25.4 0.4 6.1 - - - -
Soft Teacher [77] SSOD ResNet-50 9.4 29.0 0.3 7.6 10.2 31.1 0.5 7.9 10.4 32.2 0.6 7.8 - - - -
SOOD [65] SSOD ResNet-50 9.4 29.3 2.8 8.1 12.1 35.7 3.5 10.2 13.0 38.8 3.9 11.1 - - - -
Co-mining [78] SAOD ResNet-50 6.4 20.4 0.5 4.4 8.0 24.1 0.4 6.1 8.2 25.0 0.4 6.8 - - - -
H2R-Box [63] WSOD ResNet-50 - - - - - - - - - - - - 11.4 39.1 3.4 9.4
H2R-Box-v2 [64] WSOD ResNet-50 - - - - - - - - - - - - 11.7 38.2 4.6 9.5

for each gt, ATSS-O lifts the RetinaNet-O baseline by 3.6 ready achieved competitive performance with fully-supervised
points. By dynamically assessing the sample quality based single-stage counterparts (i.e., FCOS-O with 1×) using full-
on the object’s arrangement and shape information, CFA and set annotation. The uncovers the great potential and application
SASM yield promising performances of 11.4% and 12.4%, value of SSOD methods on tiny-scale oriented objects.
respectively. These significant improvement raised by sample Sparsely Annotated Object Detection (SAOD). SAOD
selection strategies (#15-17) further highlights the importance approaches propose to randomly annotate a proportion of
of customized sample assignment methods for oriented tiny objects throughout the whole training set for label-efficient
objects. learning. We adapt a classic SAOD method to oriented tiny
Backbone choice. We analyze the effects of various back- object detection (i.e., Co-mining [78]). Despite using the same
bones on oriented tiny object detection by investigating deeper number of annotated objects, SSOD methods outperform the
architecture, vision transformer, large convolution kernels, and SAOD method tested. This performance gap may be attributed
rotation equivariance. Different from generic object detection, to the fact that Co-mining does not utilize the advanced
oriented tiny object detection does not benefit from deeper teacher-student network, thereby limiting its effectiveness.
backbone architecture (#18 vs. #5) or large convolution kernels Weakly-Supervised Object Detection (WSOD). Another
(#20 vs. #5), where these improved backbones retain similar popular direction of label-efficient object detection uses
AP with the basic ResNet-50. This interesting phenomenon coarse-level annotations, which are more easily accessible,
can be largely attributed to the limited and local informa- for fine-level predictions. Among them, a dominant line of
tion representation of tiny objects. After multiple times of research lies in using horizontal bounding box supervision for
down-sampling in deeper layers of the network, the limited oriented box prediction (e.g., H2RBox [63]). With advanced
information of tiny objects is further lost. Besides, the large training strategies, experiments reveal that merely using HBB
receptive field of large convolution kernels struggles to fit or supervision has shown comparable performance with OBB-
converge to the extremely tiny region of interest. By contrast, supervised single-stage baselines (e.g., FCOS-O).
the shifted window transformer (Swin Transformer [75]) and In short, label-efficient methods have demonstrated excel-
rotation-equivalent feature extraction (ReDet [74]) could also lent performance in the task of oriented tiny object detection.
benefit oriented tiny objects based on our experiments. Training with much fewer annotations, SSOD and WSOD
methods show very competitive performance compared to one-
C. Results of Label-efficient Methods stage fully-supervised baselines. These findings demonstrate
the significant application value and potential for further
Label-efficient object detection aims at simplifying the
exploration of label-efficient methods in the field of oriented
annotation cost (e.g., quantity, difficulty), meanwhile aligning
tiny object detection.
or even surpassing the performance with fully-supervised
methods. Label-efficient approaches show great demand and
potential on oriented tiny objects since their annotation process D. Uncovering Learning Bias
is quite laborious and difficult. Herein, we investigate three Despite the differences in detection paradigms, one consis-
kinds of dominant label-efficient paradigms as follows, whose tent finding is that the detection performance of oriented tiny
results on the AI-TOD-R dataset are listed in Table III. objects remains significantly inferior to that of regular-sized
Semi-Supervised Object Detection (SSOD). SSOD re- objects. To gain a clearer understanding of the underlying
lieves the annotation burden via leveraging the precious an- reasons for this performance gap, we conduct a statistical
notated images and massive unlabelled images to train object analysis from the perspective that directly drives the model’s
detectors efficiently. Current state-of-the-art approaches [65], training: the sample learning process of objects across various
[76], [77] employ a teacher-student network architecture with scales (i.e., the input sample and output predictions).
a pseudo-labelling fashion. Surprisingly, using only 30% la- Specifically, we investigate the prior matching degree (in-
belled images and the remaining unlabelled images, the state- put) and posterior confidence scores (output) for different-
of-the-art SSOD approach: SOOD [65] (40 epochs) has al- sized objects when training. The results, presented in Figure 5,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8

tiny-sized objects are ignored and lack training, thereby widen-

ing the gap between tiny objects and regular objects.
This investigation naturally raises the question: can we
develop a method that achieves unbiased learning for different
objects? The following section addresses this question by in-
troducing a new learning pipeline composed of a dynamically
updated prior setting and a dynamic coarse-to-fine sample
selection scheme. Our prior setting breaks the limit of fixed
prior position by adapting prior initialization to the object’s
main area, and our sample selection method improves the
assignment rule by providing more and higher-quality positive
samples for training oriented tiny objects.

V. M ETHOD
In this section, we first provide a paradigmatic comparison
of our method with prior arts. Following this, we describe
the details for core components (i.e., Dynamic Prior, Coarse
Fig. 5. An illustration of the sample learning bias. SOOD [65] is trained with Prior Matching, and Finer Posterior Matching) in our proposed
10% labels under the semi-supervised object detection pipeline.
DCFL. Figure 6 shows an overview of the proposed method.

show the prior sample selection results across different detec- A. Pipeline Overview
tors by counting the number of positive samples assigned to Static prior → Dynamic prior. Oriented object detection
objects of varying scales (upper line charts), and the model’s is predominantly solved with dense one-stage detectors (e.g.,
posterior confidence scores for different-sized objects (lower RetinaNet-O) or dense-to-sparse two-stage detectors (e.g.,
bar chart). Oriented R-CNN) nowadays [80]. Different as architectures,
Our analysis in Figure 5 shows that oriented tiny objects their detection processes all initialize from a set of dense priors
often face a biased dilemma across various detection pipelines. P ∈ RW ×H×C (W × H: the size of the feature map, C:
At the prior level, tiny-scale objects receive significantly fewer the number of prior information per feature point) and remap
positive samples than larger-scale objects. This phenomenon the set into final detection results D through a Deep Neural
can largely be attributed to the limited feature map resolution, Network (DNN), which can be simplified as:
sub-optimal measurement, and label assignment strategies.
Specifically, the stride between adjacent feature points and D = DNNd (P ), (1)
their corresponding prior locations (e.g., anchor box/point)
where DNNd is composed of the backbone and detection head.
is constrained by the feature map resolution. For example,
Detection results D can be mainly separated into two parts:
the stride of prior locations in a typical single-stage detector
classification scores Dcls ∈ RW ×H×A (A denotes the class
is at least 8 pixels. This sparse and fixed prior setting fun-
number) and box locations Dreg ∈ RW ×H×B (B is the box
damentally limits the number of sample candidates for tiny
parameter number).
objects compared to larger ones, leading to a biased prior
This static prior modeling suffers from significant prior
setting. Furthermore, oriented tiny objects often have a lower
bias issues for tiny objects: the prior position mostly deviates
similarity with the sparse prior boxes (e.g., RetinaNet-O [66])
from the objects’ main body (Section I). To accommodate the
or cover very few prior points (e.g., FCOS-O [67]), which
extreme sizes and arbitrary geometries of these tiny objects, we
exacerbates the problem. Under the generic sample selection
incorporate an iterative updating process for the prior position
strategies (e.g., MaxIoU, Center Sampling), the number of
and refine it dynamically with each iteration. This transforms
positive samples ultimately assigned to tiny objects is further
the prior into a dynamic set P̃ ( ˜ denotes the dynamic item),
reduced, leading to a serious sample bias problem.
leading to a reformulated detection process:
This learning bias against oriented tiny objects is also re-
flected in their high uncertainty levels in posterior predictions, D = DNNd ( DNNp (P ) ), (2)
as shown in the lower part of Figure 5. The low confidence | {z }
Dynamic Prior P̃
scores can further exacerbate the learning bias against oriented
tiny objects. In supervised learning, some methods propose to DNNp is a learnable block incorporated within the detection
select or re-weight confident samples [36], [37], [79], which pipeline to update the prior.
will further weaken oriented tiny objects due to their high Static sample learning → Dynamic coarse-to-fine sample
uncertainty levels. In label-efficient learning, thresholds based learning. To train the DNNd , a proper matching between
on predicted scores are used to select pseudo-labels. This size- the prior set P and the gt set GT needs to be solved to
induced bias will also be amplified in this process, as regular assign pos/neg labels to P and supervise the network learning.
objects, having higher posterior confidence scores, are more Existing assignment strategies can be classified into static and
likely to be selected as pseudo-labels for training, whereas dynamic strategies. For static assignment (e.g. RetinaNet [66]),
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9

the set of pos labels G is obtained via a hand-crafted matching Besides, the learned offsets from the regression branch are
function Ms , and the set for a specific image remains the same used to guide feature extraction in the classification branch,
for each epoch, which is formulated as: leading to better alignment between the two tasks. As such, the
PCB inherits the flexibility of learnable priors in query-based
G = Ms (P, GT ), (3)
detectors (e.g., DETR [82]) and retains the explicit physical
while dynamic assignment approaches [36], [37], [40] tend meaning of static priors in dense detectors (e.g., RetinaNet-O).
to leverage the prior information P along with posterior The dynamic prior capturing process further unfolds as
information (predictions) D for dynamic sample selection, follows. As initialization, each prior location p(x, y) is set to
where they apply a prediction-aware mapping Md to get the the spatial location s of each feature point, which has been
set G: remapped to the image. In each iteration, we forward the
G = Md (P, D, GT ), (4) network to capture the offset sets ∆o of each prior location.
Hence, the prior’s location can be updated by:
after the pos/neg label separation, the loss function can be n
summarized into two parts:
X
s̃ = s + st ∆oi /2n, (8)
Npos Nneg i=1
X X
L= Lpos (Di , Gi ) + Lneg (Dj , yj ), (5) where st represents the stride of the feature map and n is the
i=1 j=1 vector number of offsets for each location.
where Npos , Nneg are the number of positive and negative As a model-agnostic approach, the dynamic prior can be
samples respectively, yj denotes the negative label. adapted into both one-stage and two-stage methods. More
Whether dynamic or static, oriented tiny objects are amidst a specifically, we use a 2-D Gaussian distribution Np (µp , Σp ),
sample bias dilemma under existing label assignment methods: which has proven conducive to small objects [44], [83] and
these strategies typically sample and weight high-scoring sam- oriented objects [70], [83], to fit the prior’s spatial location.
ples (i.e., prior location) as positive samples, while both prior Each dynamic prior location s̃ serves as the Gaussian mean
and posterior scores for tiny objects are extremely low, making vector µp , and each prior is associated with a square-shaped
their effective samples wrongly labeled as outlier negative prior (w, h, θ) as their baseline detector, this shape information
samples. serves as the covariance matrix Σp [58]:
Towards unbiased sample learning, we reformulate this
" w2 #
process into a dynamic coarse-to-fine learning pipeline based

cos θ − sin θ 4
0 cos θ sin θ
Σp = h2
. (9)
on the dynamic priors. The coarse step works in an object- sin θ cos θ 0 − sin θ cos θ
4
centric way, where we construct a coarse positive candidate
bag to warrant sufficient and diverse positive samples for each C. Dynamic Coarse-to-Fine Learning
object. The fine step aims at guaranteeing the learning quality,
Without specialized consideration of tiny-scale objects, pre-
where we fit each gt with a Dynamic Gaussian Mixture Model
vious sample assignment strategies are biased towards sam-
(DGMM) as a constraint to select high-quality samples. Thus,
pling large object samples which usually hold higher confi-
the assignment process can be expressed as follows:
dence, discarding tiny-scale oriented objects as background.
˜ ),
G̃ = Md (Ms (P̃ , GT ), GT (6) Towards scale-unbiased optimization, we design a dynamic
coarse-to-fine learning pipeline, where the coarse step offers
˜ is a finer representation of an object with the DGMM.
the GT sample diversity while the fine step warrants learning quality.
In a nutshell, our final loss is modeled as: Coarse prior matching for sample diversity. In the
Ñpos
X Ñneg
X coarse step, we introduce an object-specific sample screening
L= Lpos (D̃i , G̃i ) + Lneg (D̃j , yj ). (7) approach to offer sufficient and diverse positive sample candi-
i=1 j=1 dates for each object. Specifically, we construct a set of Coarse
Positive Sample (CPS) candidates for each object, where we
consider prior locations from diverse spatial locations and FPN
B. Dynamic Prior hierarchies as candidates for a specific gt. Unlike sampling
We introduce a dynamic updating mechanism that can ben- from a single FPN layer or all FPN layers [84], [85], we
efit both dense and dense-to-sparse oriented object detection slightly expand the range of candidates to the gt’s nearby
paradigm, named Prior Capturing Block (PCB). Seamlessly spatial location and adjacent FPN layers, which warrants
embedded into the original detection head, the PCB generates relatively diverse and sufficient candidates compared to the
prior positions that are better aligned with the main body and single-layer heuristic and narrows down the searching area
geometries of tiny objects, increasing the number of high- from all-layer candidates, alleviating tiny object’s lack of
quality sample candidates for these objects and mitigating the positive samples candidates.
biased prior configuration. In this step, we also model the gt into a 2-D Gaussian
The structure of the proposed PCB is illustrated in Figure 6. Ng (µg , Σg ) with the aforementioned method to assist sample
selection. The similarity measurement in constructing the CPS
In this design, a dilated convolution is deployed to incorporate is realized with the Jensen-Shannon Divergence (JSD) [86]
the object’s surrounding context information, followed by the between the anchor and gt. JSD inherits the scale invariance
offsets prediction [81] to capture dynamic prior positions. property of the Kullback–Leibler Divergence (KLD) [70] and
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

Fig. 6. An overview of the proposed method. The proposed DCFL learning scheme can be adapted into both one-stage and two-stage detection pipelines
for oriented tiny object detection. Left: Feature extraction process and the prior updating process of the PCB. Right: The schematic diagram of the dynamic
coarse-to-fine sample learning.

can measure the similarity between the gt and nearby non- which is a linear combination of the predicted classification
overlapping priors [44], [70]. Moreover, it overcomes KLD’s score and the location score with the gt. We define the P T of
drawback of asymmetry. However, the closed-form solution of the ith sample Di as:
the JSD between Gaussian distributions is unavailable [87].
Thus, we use the Generalized Jensen-Shannon Divergence P Ti = 0.5(Cls(Di ) + IoU (Di , gti )), (13)
(GJSD) [87] which yields a closed-form solution, as the
substitute. For example, the GJSD between two Gaussian dis- where Cls is the predicted classification confidence and IoU
tributions Np (µp , Σp ) and Ng (µg , Σg ) is defined as follows: is the rotated IoU between the predicted location and its
GJSD(Np , Ng ) = (1 − α)KL(Nα , Np ) + αKL(Nα , Ng ), (10) corresponding gt location. We select candidates with Q highest
P T as Medium Positive Sample (MPS) candidates.
where KL denotes the KLD, and Nα (µα , Σα ) is given by: Following this, we define the DGMM using a mixture of
Σ
Σα = (Σp Σg )α = (1 − α)Σ−1 −1 −1
gt’s geometry and MPS distribution to eliminate misaligned
p + αΣg , (11)
samples and obtain the final positive samples for prediction.
and µ Unlike previous works which utilize the center probability
µα = µp µg α map [88] or the single-Gaussian [42], [58] for instance rep-
(12)
= Σα (1 − α)Σ−1 −1

p µp + αΣg µg ,
resentation, our approach represents the instance with a more
refined DGMM. This model consists of two components: one
α is a parameter that controls the weighting of two distribu-
centered on the geometry center and the other on the semantic
tions [87] in the similarity measurement. In our case, Np and
center of the object. Specifically, for a given instance gti , the
Ng contribute equally, so α is set to 0.5.
geometry center (cxi , cyi ) serves as the mean vector µi,1 of
Ultimately, for each gt, we select K priors that hold the
the first Gaussian, and the semantic center (sxi , syi ), which is
top K GJSD scores as the Coarse Positive Samples (CPS)
deduced by averaging the location of the samples in the MPS,
and label the remaining priors as negative samples. This
serves as the µi,2 . That is to say, we parameterize the instance
coarse matching serves as the Ms in Equation 6. GJSD can
representation as:
effectively measure the similarity between samples across FPN
2
layers with a specific gt. Consequently, we extend CPS to X q
include both the object’s adjacent region and cross hierarchies DGMM i (s|x, y) = wi,m 2π|Σi,m |Ni,m (µi,m , Σi,m ),
m=1
by selecting a relatively large number of sample candidates. (14)
Finer posterior matching enhances sample quality. In where wi,m is the weight of each Gaussian with a summation
the fine step, we aim to improve the learning quality without of 1, Σi,m equals to the gt’s Σg . Under this modeling,
exacerbating the inter-object learning bias. To achieve this, each sample in MPS is associated with a DGMM score
we approximate the instance-wise semantic pattern by repre- DGMM (s|M P S). Samples with DGMM (s|M P S) < e−g
senting each object with a Dynamic Gaussian Mixture Model for any gt are assigned negative masks, with g being an
(DGMM). This model serves as the Md in Equation 6 for adjustable parameter.
object-wise sample constraint. Unlike batch-wise or sample-
wise evaluations [36], [37], [40] which tend to favor larger
VI. E XPERIMENTS
objects, our approach assesses the relative quality of samples
within each object, ensuring consistent positive sample super- A. Datasets and Implementations Details
vision across objects of varying sizes. Datasets. In addition to experiments on the AI-TOD-R,
First of all, we refine the sample candidates in the CPS ac- we conduct experiments on seven more datasets covering
cording to their predicted scores to fit the object’s semantically various tasks to verify the method’s broad adaptability. These
salient regions. More specifically, we define the Possibility of tasks include small oriented object detection (SODA-A [10]),
becoming True predictions (P T ) [37] for sample screening, oriented object detection with the existence of a large number
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

TABLE IV
R ESULTS ON SODA-A TEST- SET. A LL THE MODELS ARE TRAINED ON SODA-A TRAIN - SET WITH A R ES N ET-50 AS THE BACKBONE . S CHEDULE
DENOTES THE TRAINING EPOCHS , WHERE ’1×’ REFERS TO 12 EPOCHS .

Method Publication Schedule AP AP0.5 AP0.75 APeS APrS APgS APN #Params. FLOPs
Faster RCNN-O [68] TPAMI 2017 1× 32.5 70.1 24.3 11.9 27.3 42.2 34.4 41.1M 292.25G
RetinaNet-O [66] TPAMI 2020 1× 26.8 63.4 16.2 9.1 22.0 35.4 28.2 36.2M 221.90G
RoI Transformer [30] CVPR 2019 1× 36.0 73.0 30.1 13.5 30.3 46.1 39.5 55.1M 306.20G
Gliding Vertex [72] TPAMI 2021 1× 31.7 70.8 22.6 11.7 27.0 41.1 33.8 41.1M 292.25G
Oriented RCNN [9] ICCV 2021 1× 34.4 70.7 28.6 12.5 28.6 44.5 36.7 41.1M 292.44G
S2 A-Net [32] TGRS 2022 1× 28.3 69.6 13.1 10.2 22.8 35.8 29.5 38.6M 277.72G
DODet [89] TGRS 2022 1× 31.6 68.1 23.4 11.3 26.3 41.0 33.5 69.3M 555.49G
Oriented RepPoints [34] CVPR 2022 1× 26.3 58.8 19.0 9.4 22.6 32.4 28.5 55.7M 274.07G
DHRec [90] TPAMI 2022 1× 30.1 68.8 19.8 10.6 24.6 40.3 34.6 32.0M 792.76G
CFINet [45] ICCV 2023 1× 34.4 73.1 26.1 13.5 29.3 44.0 35.9 44.0M 312.60G
DCFL (RetinaNet-O) Ours 1× 34.9 (+8.1) 73.2 (+9.8) 27.8 (+11.6) 14.2 29.8 43.7 38.0 36.1M 221.90G
DCFL (Oriented R-CNN) Ours 1× 36.6 (+2.2) 72.6 (+1.9) 32.4 (+3.8) 13.9 30.3 47.4 41.2 41.1M 292.44G

TABLE V
M AIN RESULTS ON THE DOTA- V 2 OBB TASK . W E FOLLOW THE OFFICIAL CLASS ABBREVIATIONS AS THE DOTA- V 2.0 BENCHMARK [3]. D P DENOTES
D EFORMABLE RO I P OOLING [81]. † DENOTES TRAINING FOR 40 EPOCHS . N OTE THAT THIS PAPER [70] REPORTS 50.90% M AP FOR R3 D ET W / KLD
UNDER 20 EPOCHS , THE R E R101 BACKBONE IS PROPOSED BY THE R E D ET [74]. T HE RESULTS IN BOLD AND UNDERLINE DENOTE THE BEST AND
SECOND - BEST PERFORMANCE OF EACH COLUMN .

Method Backbone Plane BD Bridge GTF SV LV Ship TC BC ST SBF RA Harbor SP HC CC Air Heli mAP
multi-stage:
Faster R-CNN-O [68] R50 71.61 47.20 39.28 58.70 35.55 48.88 51.51 78.97 58.36 58.55 36.11 51.73 43.57 55.33 57.07 3.51 52.94 2.79 47.31
Faster R-CNN-O w/ Dp R50 71.55 49.74 40.34 60.40 40.74 50.67 56.58 79.03 58.22 58.24 34.73 51.95 44.33 55.10 53.14 7.21 59.53 6.38 48.77
Mask R-CNN [91] R50 76.20 49.91 41.61 60.00 41.08 50.77 56.24 78.01 55.85 57.48 36.62 51.67 47.39 55.79 59.06 3.64 60.26 8.95 49.47
HTC* [92] R50 77.69 47.25 41.15 60.71 41.77 52.79 58.87 78.74 55.22 58.49 38.57 52.48 49.58 56.18 54.09 4.20 66.38 11.92 50.34
RoI Transformer [30] R50 71.81 48.39 45.88 64.02 42.09 54.39 59.92 82.70 63.29 58.71 41.04 52.82 53.32 56.18 57.94 25.71 63.72 8.70 52.81
Oriented R-CNN [9] R50 77.95 50.29 46.73 65.24 42.61 54.56 60.02 79.08 61.69 59.42 42.26 56.89 51.11 56.16 59.33 25.81 60.67 9.17 53.28
one-stage:
DAL [40] R50 71.23 38.36 38.60 45.24 35.42 43.75 56.04 70.84 50.87 56.63 20.28 46.53 33.49 47.29 12.15 0.81 25.77 0.00 38.52
SASM [41] R50 70.30 40.62 37.01 59.03 40.21 45.46 44.60 78.58 49.34 60.73 29.89 46.57 42.95 48.31 28.13 1.82 76.37 0.74 44.53
RetinaNet-O [66] R50 70.63 47.26 39.12 55.02 38.10 40.52 47.16 77.74 56.86 52.12 37.22 51.75 44.15 53.19 51.06 6.58 64.28 7.45 46.68
R3 Det w/ KLD [70] R50 75.44 50.95 41.16 61.61 41.11 45.76 49.65 78.52 54.97 60.79 42.07 53.20 43.08 49.55 34.09 36.26 68.65 0.06 47.26
FCOS-O [67] R50 74.84 47.53 40.83 57.41 43.89 47.72 55.66 78.61 57.86 63.00 38.02 52.38 41.91 53.24 40.22 7.15 65.51 7.42 48.51
Oriented Reppoints [34] R50 73.02 46.68 42.37 63.05 47.06 50.28 58.64 78.84 57.12 66.77 35.21 50.76 48.77 51.62 34.23 6.17 64.66 5.87 48.95
ATSS-O [39] R50 77.46 49.55 42.12 62.61 45.15 48.40 51.70 78.43 59.33 62.65 39.18 52.43 42.92 53.98 42.70 5.91 67.09 10.68 49.57
S2 A-Net [32] R50 77.84 51.31 43.72 62.59 47.51 50.58 57.86 80.73 59.11 65.32 36.43 52.60 45.36 52.46 40.12 0.00 62.81 11.11 49.86
ours:
DCFL (Retinanet-O) R50 75.71 49.40 44.69 63.23 46.48 51.55 55.50 79.30 59.96 65.39 41.86 54.42 47.03 55.72 50.49 11.75 69.01 7.75 51.57
DCFL (S2 A-Net) R50 74.79 53.25 45.81 65.46 46.49 53.23 58.10 81.51 60.13 66.42 43.24 55.09 50.52 55.58 54.53 5.23 68.73 13.06 52.84
DCFL (Oriented R-CNN) R50 77.59 52.46 45.98 61.73 49.77 54.32 60.55 79.27 61.76 68.17 43.41 56.59 52.41 56.68 55.32 27.42 63.50 12.64 54.42
DCFL (Retinanet-O)† R50 78.30 53.03 44.24 60.17 48.56 55.42 58.66 78.29 60.89 65.93 43.54 55.82 53.33 60.00 54.76 30.90 74.01 15.60 55.08
DCFL (Retinanet-O)† ReR101 79.49 55.97 50.15 61.59 49.00 55.33 59.31 81.18 66.52 60.06 52.87 56.71 57.83 58.13 60.35 35.66 78.65 13.03 57.66

of tiny objects (DOTA-v1.5 [3], DOTA-v2 [3]), multi-scale computer equipped with a single NVIDIA RTX 4090 GPU,
oriented object detection (DOTA-v1 [14], DIOR-R [15]), and setting the batch size to 4. The models are built using
horizontal object detection (VisDrone [61], MS COCO [11], MMDetection [93] and MMRotate [94] frameworks with
DOTA-v2 HBB). PyTorch [95]. We utilized ImageNet [96] pre-trained models
For ablation studies and analyses, we choose the large- as the backbone. For training, we employ the Stochastic
scale DOTA-v2 train set for training and its val set Gradient Descent (SGD) optimizer with a learning rate of
for evaluation since DOTA-v2 is the largest dataset for oriented 0.005, momentum of 0.9, and weight decay of 0.0001. Unless
object detection and contains a substantial number of tiny otherwise specified, the default backbone is ResNet-50 [97]
objects. This dataset enables us to simultaneously verify the with FPN [98]. We use Focal loss [66] for classification and
method’s effectiveness on tiny object detection and generic IoU loss [57] for regression. We only use random flipping for
oriented object detection. For fair comparison with other data augmentation across all experiments.
methods, we use the trainval sets of DOTA-v1, DOTA-
v1.5, DOTA-v2, and DIOR-R for training and their respective For experiments on DOTA-v1 and DOTA-v2, we adhere
test sets for testing, and we use the train set and to the official settings of the DOTA-v2 benchmark [3].
test set of SODA-A, the train sets, val sets of Specifically, we crop images into patches of 1024 × 1024
VisDrone2019 and MS COCO for training and evaluation. with 200-pixel overlaps and train the models for 12 epochs.
Implementation details. We conduct all experiments on a For DOTA-v2, we reproduce several state-of-the-art one-stage
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

methods [31], [32], [34], [39]–[41], [70], [99] using the TABLE VI
same settings. For experiments on other datasets, we follow C OMPARISON WITH ONE - STAGE DETECTORS ON THE DOTA- V 1 OBB
TASK . A LL RESULTS ARE BASED ON THE MMROTATE [94] WITH 12
their default benchmarks for image pre-processings, including EPOCHS EXCEPT FOR GGHL [42]. 3× MEANS TRAINING FOR 36 EPOCHS .
setting the input size to 1200×1200 for SODA-A, 1024×1024
with 200-pixel overlaps for DOTA-v1.5, 800 × 800 for DIOR- Method CFA [73] RetinaNet-O [66] R3 Det [31] Oriented Rep [34] ATSS-O [39]
mAP 69.63 69.79 70.18 71.94 72.29
R, 1333 × 800 for VisDrone and MS COCO. The models are
Method KLD [70] S2 A-Net [32] GGHL [42] (3×) DCFL DCFL (3×)
trained for 40 epochs on DOTA-v1.5 and DIOR-R, and for mAP 72.76 73.91 73.98 74.26 75.35
12 epochs on SODA-A, VisDrone, and MS COCO, following
previous works [34], [73]. The DCFL uses RetinaNet-O as the
TABLE VII
baseline detector if not specified. Unless otherwise specified, M AIN RESULTS ON THE DOTA- V 1.5 OBB TASK .
these settings are consistently maintained.
Method Backbone SV Ship ST mAP
RetinaNet-O [66] R50 44.53 73.31 59.96 59.16
B. Main Results Faster R-CNN-O [91] R50 51.28 79.37 67.50 62.00
CMR [91] R50 51.64 79.99 67.58 63.41
Tiny/small oriented object detection. As the main track, RoI Transformer [30] R50 52.05 80.72 68.26 65.03
we evaluate the performance of DCFL on challenging datasets ReDet [74] ReR50 52.38 80.92 68.64 66.86
that are dedicated to tiny (AI-TOD-R) and small (SODA- DCFL R50 56.72 80.87 75.65 67.37 (+8.21)
A) oriented object detection. First of all, results on the AI- DCFL ReR101 57.31 86.60 76.55 70.24 (+11.08)
TOD-R are shown in Table II. Without whistles and bells,
DCFL can improve both one-stage (#1 vs. #22) and two-stage
object detectors (#5 vs. #24) by large margins. Notably, when across diverse oriented object detection tasks. Therefore, we
plugging DCFL into the advanced one-stage method: S2 A- validate DCFL on the DOTA-v1 and DIOR-R multi-scale
Net, our approach hits a new state-of-the-art performance of oriented object detection datasets, which also include some
49.6% AP0.5 , with a remarkable improvement of 16.2% and tiny object classes. The results of these datasets are shown
significant improvements on very tiny objects. Besides, we in Tables VI and VIII. Beyond tiny object-specific datasets,
also evaluate the proposed method on another oriented small DCFL also excels in multi-scale scenarios, achieving leading
object detection benchmark: SODA-A. As a recently proposed performance among all one-stage methods. Furthermore, the
dataset, the challenging and large-scale characteristics of the class-wise AP of tiny objects on DOTA-v1 and DIOR-R, listed
SODA [10] attract increasing attention. Results on this bench- in Tables VI and IX, show particularly significant improve-
mark are shown in Table IV, where DCFL really shines on ments for tiny-size classes, often with a notable increase of
this challenging dataset by boosting the RetinaNet-O by 8.1 more than 10%.
AP points and boosting the strong baseline: Oriented R-CNN Horizontal object detection. The proposed method can also
by 2.2 AP points. Moreover, the improvement in terms of be applied to the generic object detection tasks and enhance
AP0.75 is more pronounced than AP0.5 , indicating that DCFL their performance, by simply discarding the angle information.
can more precisely locate the oriented tiny objects. Given We evaluate the model on three different scenarios: drone-
that DCFL mainly optimizes the model’s training process, the captured images (VisDrone), natural images (MS COCO),
accuracy improvement does not incur additional parameter and and aerial images (DOTA-v2 HBB). These datasets, annotated
computational costs on both datasets, as described in Tables II with horizontal bounding boxes, contain a significant number
and IV. of small objects. Integrating our learning pipeline into the
Oriented object detection with massive tiny objects. RetinaNet-O baseline results in an improvement of 2-3 points,
More generally, evaluating the model’s detection performance as shown in Table X.
in datasets with both massive tiny objects and other-sized In a nutshell, these results demonstrate that our DCFL is
objects cannot only validate its ability to address tiny objects not only highly effective for detecting oriented tiny objects
but also examine its robustness to scale variance. We thus (such as small vehicles, ships, and storage tanks), achieving an
perform experiments on the DOTA-v1.5 and DOTA-v2, which approximate 10-point improvement over the baseline for these
are general-purpose datasets characterized by the existence of classes. Meanwhile, it excels in general-purpose oriented ob-
a significant number of tiny objects. As shown in Table V, ject detection or horizontal object detection tasks, as evidenced
our proposed method achieves a state-of-the-art performance by its performance on tracks like DOTA-v1, DIOR-R, and MS
of 57.66% mAP on the challenging DOTA-v2 benchmark COCO.
with single-scale training and testing. Meanwhile, our model
attains 51.57% mAP on this dataset without bells and whistles, C. Ablations
outperforming all tested one-stage oriented object detectors. Effects of individual strategy. We evaluate the effective-
Besides, results on the DOTA-v1.5 are presented in Table VII, ness of each proposed strategy from our method through
where DCFL notably improves the baseline and achieves a a series of ablation experiments. For consistency and fair
leading performance among one-stage methods. comparisons, we tile one prior for each feature point in all
Multi-scale oriented object detection. An investigation experiments. As shown in Table XIIa, the baseline detector,
of the method’s performance on multi-scale oriented object RetinaNet-O, achieves an mAP of 51.70%. Gradually integrat-
detection datasets can demonstrate its versatility and generality ing the posterior re-ranked MPS and DGMM into the detector,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

Fig. 7. Visualization analysis of the predicted results. The first row shows the predicted results by the Oriented R-CNN while the second row shows results
from DCFL on the AI-TOD-R dataset. True positive, false negative, and false positive predictions are marked in green, red, and blue, respectively.

TABLE VIII TABLE X

P ERFORMANCE COMPARISONS ON THE DIOR-R DATASET. T HE VERSATILITY ON GENERIC OBJECT DETECTION DATASETS .

Method RetinaNet-O [66] FR-OBB [68] RT [30] AOPG [15] Dataset VisDrone MS COCO DOTA-v2 HBB
mAP 57.55 59.54 63.87 64.41
Method RetinaNet [66] DCFL RetinaNet DCFL FCOS [44] DCFL
Method GGHL [42] Oriented Rep [34] DCFL DCFL (ReR101)
mAP 66.48 66.71 66.80 71.03 AP0.5 29.2 32.1 55.4 57.3 55.4 57.4

TABLE IX
D ETECTION RESULTS OF TYPICAL TINY OBJECTS ON THE DIOR-R neither of these two methods yields the best performance.
DATASET. VE, BR, AND WM DENOTE VEHICLE , BRIDGE , AND In contrast, using distribution distances (KLD, GWD, GJSD)
WIND - MILL . to construct the Cross-FPN-layer CPS extends the candidate
Method Backbone VE BR WM
range to adjacent layers in addition to the main layer. We can
also see the GJSD gets the best performance of 59.15% mAP,
RetinaNet-O [66] R50 38.0 24.0 60.2
Oriented Rep [34] R50 50.4 38.8 64.7 mainly resulting from its property of scale-invariance [70],
DCFL R50 50.9 (+12.9) 42.1 (+18.1) 70.9 (+10.7) [87], symmetry [87], and ability to measure non-overlapping
boxes [87] compared to other counterparts.
Fixed prior or dynamic prior. We conduct a detailed
based on the CPS, results in progressive mAP improvements, set of ablation studies to verify the necessity of introducing
confirming the effectiveness of each design. It is important to the dynamic prior. As shown in Table XIIc, disabling the
note that CPS cannot be used independently, as its samples are dynamic prior by fixing the location of samples results in
too coarse to serve as the final positive samples. Nevertheless, a performance drop. This indicates that the prior should be
we compare different methods of constructing the CPS to adjusted accordingly when leveraging the dynamic sampling
verify its superiority. strategy to better capture the shape of objects.
Comparisons of different CPS. The design choice of CPS Detailed design in PCB. The PCB consists of a dilated
determines the range of sample candidates when training. convolution and a guiding DCN. We slightly enlarge the
In this section, we compare several CPS design paradigms, receptive field using a dilation rate of 3 and then utilize the
including limiting the CPS to a specific gt within a single DCN to generate dynamic priors in a guiding manner. As
layer and utilizing all FPN layers as the CPS, similar to shown in Table XIIc, the DCN provides an improvement of
Objectbox [84]. We present their performance in Table XIIb. 0.34 mAP points, and the dilated convolution slightly enhances
For fair comparisons, the number of samples in CPS is fixed the mAP. However, applying the DCN [100] to the single
at 16, and all other components remain unchanged. In the regression branch slightly deteriorates accuracy (denoted as
Single-FPN-layer approach, we group gt onto different layers Separate in Table XIIc), likely due to mismatch issues between
based on the regression range defined in FCOS [99] and assign the two branches. To address this, we use the offsets from the
labels within each layer. In the All-FPN-layer approach, we regression head to guide the offsets for the classification head,
do not group gt onto different layers but instead, discard resulting in better alignment (denoted as Guiding).
prior scale information and directly measure the distance Effects of parameters. The three introduced parameters
between Gaussian gt and prior points. As shown in Tab. XIIb, are robust within a certain range. As shown in Table XIId,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

TABLE XI
A BLATIONS . W E TRAIN ON DOTA- V 2 TRAIN SET, TEST ON ITS VAL SET, AND REPORT M AP UNDER I O U THRESHOLD 0.5.

(a) Individual effectiveness. CPS, MPS, and (b) Comparsions of different CPS. The FPN (c) Effects of designs in the PCB.
DGMM denote Coarse, Medium Sample Candi- layer number varies for different strategies of DP: the dynamic prior. Guiding: reg
dates and Dynamic Gaussian Mixture Model. getting the CPS. guides cls branch.

Method CPS MPS DGMM mAP Strategy Measurement mAP DCN Dilated Conv DP mAP
baseline [66] 51.70 All-FPN-layer Gaussian 50.12 58.07
Single-FPN-layer Gaussian 56.72 ✓ 58.41
✓ ✓ 53.41 Cross-FPN-layer KLD [70] 57.82 ✓ ✓ 58.65
DCFL ✓ ✓ 57.20 Cross-FPN-layer GWD [83] 58.55 Separate ✓ ✓ 58.71
✓ ✓ ✓ 59.15 Cross-FPN-layer GJSD 59.15 Guiding ✓ ✓ 59.15

(d) Effects of parameters K and Q. (e) Effects of parameter g.

K 24 20 g 1.2 1.0
Q 20 16 12 8 16 12 10 8
mAP 58.31 58.11 58.95 59.06 58.66 58.71 58.92 58.28 mAP 57.91 58.20

K 16 12 g 0.8 0.4
Q 12 10 8 6 10 8 6 4
mAP 59.15 58.57 58.97 57.84 58.79 58.25 57.01 57.37 mAP 59.15 58.95

D. Analysis
Visual Analysis. We visualize DCFL’s predictions and
dynamic prior positions to better show models’ capability
on addressing oriented tiny objects in Figures 7 and 8, re-
spectively. In Figure 7, By separating the model’s predictions
into true positive, false negative, and false positive predictions
based on the gt with different colors, we can easily find that
DCFL significantly suppresses false negative predictions (i.e.,
missing detection) for tiny objects. This improvement can
be largely attributed to the sufficient and unbiased sample
learning of different-sized objects resulting from the coarse-to-
fine sample selection scheme. Besides, from Figure 8 (Upper),
we can find that the prior setting in DCFL can better match the
oriented tiny objects’ discriminative areas. This further verifies
that by adaptively adjusting prior positions according to the
object’s region of interest, the prior bias in previous static
prior designs can be mitigated.
How does DCFL achieve unbiased learning? To better
understand the working mechanism of DCFL, we delve into
its training process by statistically investigating its sample
Fig. 8. Analysis of the learning bias across different methods. The first assignment. Specifically, we calculate the quantity and quality
and second columns investigate quality and quantity imbalances, respectively.
Results are sampled from the model’s last training epoch.
of positive samples assigned to ground truth (gt) bounding
boxes within various angle and scale intervals. This analysis
reveals two types of imbalance issues (quantity and quality)
in baseline methods: (1) The number of positive samples
assigned to each object varies periodically with respect to
its angle and scale, with objects whose shapes (scale, angle)
the combination of K = 16 and Q = 12 yields the best differ from predefined anchors receiving much fewer positive
performance. In Table XIIe, we verify the threshold e−g in samples. (2) The predicted IoU fluctuates periodically with
the DGMM and find that setting wi,1 to 0.7 and a threshold respect to the gt’s scale while remaining invariant with respect
of g = 0.8 results in the highest mAP. Although making to the gt’s angle. In contrast, DCFL effectively addresses these
the CPS/MPS/DGMM coarser and stricter can weaken perfor- learning biases: (1) It compensates by assigning more positive
mance, the mAP only fluctuates slightly. This indicates that samples to previously outlier angles and scales. (2) It improves
the coarse-to-fine assignment method ensures robustness in and balances the quality of samples (predicted IoU) across
parameter selection, as multiple parameters can mitigate the all angles and scales. These results demonstrate the desired
effects of any single under-tuned parameter. behavior of dynamic coarse-to-fine learning.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15

VII. D ISCUSSIONS across various detection pipelines significantly impede the de-
The precise detection of arbitrary-oriented tiny objects is a tection performance of oriented tiny objects. To address these
fundamental step towards more generic pattern recognition in biases, we propose a dynamic coarse-to-fine learning (DCFL)
numerous specialized scenarios. Meanwhile, the state-of-the- scheme that is applicable to both one-stage and two-stage
art object detectors significantly degrade when detecting these architectures. Extensive experiments on eight heterogeneous
objects. Moreover, there is still a lack of task-specific datasets benchmarks verify that DCFL can significantly improve the
and benchmarks dedicated to corresponding research. This detection accuracy of oriented tiny objects while maintaining
motivates us to address this intricate but inevitable challenge. high efficiency.
To this end, we establish a task-specific dataset, benchmark,
and design a new method that realizes unbiased learning for ACKNOWLEDGMENTS
objects of different scales and orientations.
Nevertheless, some challenges remain. First, the detection We would like to thank Zijuan Chen, Xianhang Ye, Nuoyi
of oriented tiny objects is a widespread issue across various Wang, Jinrui Zhang, Yuxin Li, Zheyan Xiao, Ziming Gui,
scenarios (e.g., autonomous driving, medical imaging, and Zhiwei Chen, Zijun Wu, and Huan Li for their voluntary
defect detection) and diverse modalities (e.g., SAR, thermal, annotation works. This work was supported in part by the
and X-ray data). This work, however, primarily focuses on National Natural Science Foundation of China (NSFC) under
aerial scenes in high-resolution optical data. By focusing on Grant 62271355.
the typical scenario of aerial imagery where oriented tiny ob-
jects frequently appear, we aim to establish a solid foundation R EFERENCES
and open the possibility for understanding these challenging
[1] Y. Li, “Detecting lesion bounding ellipses with gaussian proposal
objects in a broader range of scenarios and modalities. Fu- networks,” in Machine Learning in Medical Imaging: 10th Interna-
ture research could also explore incorporating complementary tional Workshop, MLMI 2019, Held in Conjunction with MICCAI 2019,
information from different modalities or leveraging temporal Shenzhen, China, October 13, 2019, Proceedings 10. Springer, 2019,
pp. 337–344.
data to enhance the detection of oriented tiny objects, further [2] P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection
expanding and fulfilling practical applications. Second, the and tracking meet drones challenge,” IEEE Transactions on Pattern
methodology part in this paper is performed under the closed- Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399,
2021.
set setting, which requires full object annotations from the [3] J. Ding, N. Xue, G.-S. Xia, X. Bai, W. Yang, M. Y. Yang, S. Belongie,
training set. However, the object annotations for tiny objects J. Luo, M. Datcu, M. Pelillo et al., “Object detection in aerial images: A
with oriented information are scarce and their acquisition large-scale benchmark and challenges,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7778–7796,
process is difficult, especially when it comes to scenarios in 2021.
an open-world assumption. Meanwhile, experimental results [4] B. Zhao, P. Han, and X. Li, “Vehicle perception from satellite,” IEEE
have shown that label-efficient methods show very com- Transactions on Pattern Analysis and Machine Intelligence, 2023.
petitive performance compared to fully-supervised methods [5] N. Bhadwal, V. Madaan, P. Agrawal, A. Shukla, and A. Kakran,
“Smart border surveillance system using wireless sensor network and
on oriented tiny object detection. Thus, it is worth further computer vision,” in 2019 international conference on Automation,
exploring the simplification of annotation requirements and Computational and Technology Management (ICACTM). IEEE, 2019,
the enhancement of tiny object detection performance with pp. 183–190.
[6] N. Zeng, P. Wu, Z. Wang, H. Li, W. Liu, and X. Liu, “A small-
limited annotations. Third, foundation models are becoming sized object detection oriented multi-scale feature fusion approach with
a hot topic that facilitates various research directions while application to defect detection,” IEEE Transactions on Instrumentation
this work did not discuss or improve relevant works. How and Measurement, vol. 71, pp. 1–14, 2022.
[7] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20
foundation models perform and how to pre-train or adapt them years: A survey,” Proceedings of the IEEE, vol. 111, no. 3, pp. 257–
on oriented tiny objects are also questions worth exploring in 276, 2023.
the future. [8] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in
optical remote sensing images: A survey and a new benchmark,” ISPRS
journal of photogrammetry and remote sensing, vol. 159, pp. 296–307,
VIII. C ONCLUSION 2020.
[9] X. Xie, G. Cheng, J. Wang, X. Yao, and J. Han, “Oriented r-cnn
In this work, we systematically address the challenging for object detection,” in IEEE International Conference on Computer
Vision, 2021, pp. 3520–3529.
task of detecting oriented tiny objects by establishing a new
[10] G. Cheng, X. Yuan, X. Yao, K. Yan, Q. Zeng, X. Xie, and J. Han,
dataset and a benchmark, and proposing a dynamic coarse- “Towards large-scale small object detection: Survey and benchmarks,”
to-fine learning scheme aimed at scale-unbiased learning. Our IEEE Transactions on Pattern Analysis and Machine Intelligence,
dataset, AI-TOD-R, has the smallest mean object size among vol. 45, no. 11, pp. 13 467–13 488, 2023.
[11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
all oriented object detection datasets, and it presents additional P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
challenges such as dense arrangement and class imbalance. context,” in European Conference on Computer Vision. Springer, 2014,
Based on this dataset, we establish a benchmark and in- pp. 740–755.
[12] J. Wang, W. Yang, H. Guo, R. Zhang, and G.-S. Xia, “Tiny object
vestigate the performance of various detection paradigms, detection in aerial images,” in International Conference on Pattern
uncovering two key insights. First, label-efficient detection Recognition, 2021, pp. 3791–3798.
methods now offer highly competitive performance on oriented [13] C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, “Detecting tiny
objects in aerial images: A normalized wasserstein distance and a new
tiny objects, showing great potential for further exploration. benchmark,” ISPRS Journal of Photogrammetry and Remote Sensing,
Second, biased prior settings and biased sample assignment vol. 190, pp. 79–93, 2022.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16

[14] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, [34] W. Li, Y. Chen, K. Hu, and J. Zhu, “Oriented reppoints for aerial
M. Pelillo, and L. Zhang, “DOTA: A large-scale dataset for object object detection,” in IEEE Conference on Computer Vision and Pattern
detection in aerial images,” in IEEE Conference on Computer Vision Recognition, 2022, pp. 1829–1838.
and Pattern Recognition, 2018, pp. 3974–3983. [35] Y. Zeng, Y. Chen, X. Yang, Q. Li, and J. Yan, “Ars-detr: Aspect ratio-
[15] G. Cheng, J. Wang, K. Li, X. Xie, C. Lang, Y. Yao, and J. Han, sensitive detection transformer for aerial oriented object detection,”
“Anchor-free oriented proposal generator for object detection,” IEEE IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp.
Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 1–15, 2024.
2022. [36] K. Kim and H. S. Lee, “Probabilistic anchor assignment with iou
[16] C. Xu, J. Ding, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, prediction for object detection,” in European Conference on Computer
“Dynamic coarse-to-fine learning for oriented tiny object detection,” in Vision. Springer, 2020, pp. 355–371.
IEEE Conference on Computer Vision and Pattern Recognition, June [37] Z. Ge, S. Liu, Z. Li, O. Yoshie, and J. Sun, “Ota: Optimal transport
2023, pp. 7318–7328. assignment for object detection,” in IEEE Conference on Computer
[17] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: A face detection Vision and Pattern Recognition, 2021, pp. 303–312.
benchmark,” in IEEE Conference on Computer Vision and Pattern [38] Y. Ma, S. Liu, Z. Li, and J. Sun, “Iqdet: Instance-wise quality distribu-
Recognition, 2016, pp. 5525–5533. tion sampling for object detection,” in Proceedings of the IEEE/CVF
[18] M. Braun, S. Krebs, F. Flohr, and D. M. Gavrila, “Eurocity persons: Conference on Computer Vision and Pattern Recognition, 2021, pp.
A novel benchmark for person detection in traffic scenes,” IEEE 1717–1725.
Transactions on Pattern Analysis and Machine Intelligence, vol. 41, [39] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap
no. 8, pp. 1844–1861, 2019. between anchor-based and anchor-free detection via adaptive training
[19] X. Yu, Y. Gong, N. Jiang, Q. Ye, and Z. Han, “Scale match for tiny sample selection,” in IEEE Conference on Computer Vision and Pattern
person detection,” in IEEE Workshops on Applications of Computer Recognition, 2020, pp. 9759–9768.
Vision, 2020, pp. 1257–1265. [40] Q. Ming, Z. Zhou, L. Miao, H. Zhang, and L. Li, “Dynamic anchor
[20] Z. Zhao, J. Du, C. Li, X. Fang, Y. Xiao, and J. Tang, “Dense tiny object learning for arbitrary-oriented object detection,” in AAAI Conference
detection: A scene context guided approach and a unified benchmark,” on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2355–2363.
IEEE Transactions on Geoscience and Remote Sensing, 2024. [41] L. Hou, K. Lu, J. Xue, and Y. Li, “Shape-adaptive selection and
[21] Z. Liu, H. Wang, L. Weng, and Y. Yang, “Ship rotated bounding box measurement for oriented object detection,” in AAAI Conference on
space for ship extraction from high-resolution optical satellite images Artificial Intelligence, 2022.
with complex backgrounds,” IEEE Geoscience and Remote Sensing [42] Z. Huang, W. Li, X.-G. Xia, and R. Tao, “A general gaussian heatmap
Letters, vol. 13, no. 8, pp. 1074–1078, 2016. label assignment for arbitrary-oriented object detection,” IEEE Trans-
actions on Image Processing, vol. 31, pp. 1895–1910, 2022.
[22] H. Zhu, X. Chen, W. Dai, K. Fu, Q. Ye, and J. Jiao, “Orientation
robust object detection in aerial images using deep convolutional neural [43] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “Reppoints: Point set
network,” in IEEE International Conference on Image Processing, representation for object detection,” in IEEE International Conference
2015, pp. 3735–3739. on Computer Vision, 2019, pp. 9657–9666.
[44] C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, “Rfla: Gaussian
[23] S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A
receptive field based label assignment for tiny object detection,” in
small target detection benchmark,” Journal of Visual Communication
European Conference on Computer Vision. Springer, 2022, pp. 526–
and Image Representation, vol. 34, pp. 187–203, 2016.
543.
[24] X. Sun, P. Wang, Z. Yan, F. Xu, R. Wang, W. Diao, J. Chen, J. Li,
[45] X. Yuan, G. Cheng, K. Yan, Q. Zeng, and J. Han, “Small object
Y. Feng, T. Xu et al., “Fair1m: A benchmark dataset for fine-grained
detection via coarse-to-fine proposal generation and imitation learning,”
object recognition in high-resolution remote sensing imagery,” ISPRS
in IEEE International Conference on Computer Vision, 2023, pp. 6317–
Journal of Photogrammetry and Remote Sensing, vol. 184, pp. 116–
6327.
130, 2022.
[46] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, “Perceptual
[25] T. Zhang, X. Zhang, J. Li, X. Xu, B. Wang, X. Zhan, Y. Xu, X. Ke, generative adversarial networks for small object detection,” in IEEE
T. Zeng, H. Su et al., “Sar ship detection dataset (ssdd): Official release Conference on Computer Vision and Pattern Recognition, 2017, pp.
and comprehensive data analysis,” Remote Sensing, vol. 13, no. 18, p. 1222–1230.
3690, 2021.
[47] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem, “Sod-mtgan: Small object
[26] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, detection via multi-task generative adversarial network,” in European
M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu et al., Conference on Computer Vision. Springer, 2018, pp. 206–221.
“Icdar 2015 competition on robust reading,” in 2015 13th international [48] J. Noh, W. Bae, W. Lee, J. Seo, and G. Kim, “Better to follow, follow
conference on document analysis and recognition (ICDAR). IEEE, to be better: Towards precise supervision of feature super-resolution for
2015, pp. 1156–1160. small object detection,” in IEEE International Conference on Computer
[27] E. Goldman, R. Herzig, A. Eisenschtat, J. Goldberger, and T. Hassner, Vision, 2019, pp. 9725–9734.
“Precise detection in densely packed scenes,” in Proceedings of the [49] L. Courtrai, M.-T. Pham, and S. Lefèvre, “Small object detection
IEEE/CVF Conference on Computer Vision and Pattern Recognition, in remote sensing images based on super-resolution with auxiliary
2019, pp. 5227–5236. generative adversarial networks,” Remote Sensing, vol. 12, no. 19, p.
[28] Z. Chen, J. Zhang, Z. Lai, G. Zhu, Z. Liu, J. Chen, and J. Li, “The devil 3152, 2020.
is in the crack orientation: A new perspective for crack detection,” in [50] S. M. A. Bashir and Y. Wang, “Small object detection in remote sensing
Proceedings of the IEEE/CVF International Conference on Computer images with residual feature aggregation-based super-resolution and
Vision, 2023, pp. 6653–6663. object detector network,” Remote Sensing, vol. 13, no. 9, p. 1854, 2021.
[29] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue, [51] J. Rabbi, N. Ray, M. Schubert, S. Chowdhury, and D. Chao, “Small-
“Arbitrary-oriented scene text detection via rotation proposals,” IEEE object detection in remote sensing images with end-to-end edge-
Transactions on Multimedia, vol. 20, no. 11, pp. 3111–3122, 2018. enhanced gan and object detector network,” Remote Sensing, vol. 12,
[30] J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu, “Learning roi no. 9, p. 1432, 2020.
transformer for detecting oriented objects in aerial images,” in IEEE [52] C. Xu, J. Wang, W. Yang, and L. Yu, “Dot distance for tiny object
Conference on Computer Vision and Pattern Recognition, 2019, pp. detection in aerial images,” in IEEE Conference on Computer Vision
2849–2858. and Pattern Recognition Workshops, 2021, pp. 1192–1201.
[31] X. Yang, Q. Liu, J. Yan, A. Li, Z. Zhang, and G. Yu, “R3det: [53] J. Wang, C. Xu, W. Yang, and L. Yu, “A normalized gaus-
Refined single-stage detector with feature refinement for rotating sian wasserstein distance for tiny object detection,” arXiv preprint
object,” CoRR, vol. abs/arXiv:1908.05612, 2019. [Online]. Available: arXiv:2110.13389, 2021.
https://arxiv.org/abs/1908.05612 [54] Z. Zhou and Y. Zhu, “Kldet: Detecting tiny objects in remote sensing
[32] J. Han, J. Ding, J. Li, and G.-S. Xia, “Align deep features for ori- images via kullback-leibler divergence,” IEEE Transactions on Geo-
ented object detection,” IEEE Transactions on Geoscience and Remote science and Remote Sensing, 2024.
Sensing, vol. 60, pp. 1–11, 2021. [55] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and
[33] Z. Li, B. Hou, Z. Wu, L. Jiao, B. Ren, and C. Yang, “Fcosr: A simple S. Savarese, “Generalized intersection over union: A metric and a loss
anchor-free rotated detector for aerial object detection,” arXiv preprint for bounding box regression,” in IEEE Conference on Computer Vision
arXiv:2111.10780, 2021. and Pattern Recognition, 2019, pp. 658–666.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17

[56] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou IEEE Conference on Computer Vision and Pattern Recognition, 2021,
loss: Faster and better learning for bounding box regression,” in AAAI pp. 3060–3069.
Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 993– [78] T. Wang, T. Yang, J. Cao, and X. Zhang, “Co-mining: Self-supervised
13 000. learning for sparsely annotated object detection,” in AAAI Conference
[57] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, “Unitbox: An advanced on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 2800–2808.
object detection network,” 2016, pp. 516–520. [79] B. Zhu, J. Wang, Z. Jiang, F. Zong, S. Liu, Z. Li, and J. Sun, “Au-
[58] X. Yang, G. Zhang, X. Yang, Y. Zhou, W. Wang, J. Tang, T. He, and toassign: Differentiable label assignment for dense object detection,”
J. Yan, “Detecting rotated objects as gaussian distributions and its 3-d arXiv preprint arXiv:2007.03496, 2020.
generalization,” IEEE Transactions on Pattern Analysis and Machine [80] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka,
Intelligence, vol. 45, no. 4, pp. 4335–4354, 2023. L. Li, Z. Yuan, C. Wang, and P. Luo, “Sparse r-cnn: End-to-end object
[59] Y. Yu and F. Da, “Phase-shifting coder: Predicting accurate orientation detection with learnable proposals,” in IEEE Conference on Computer
in oriented object detection,” in IEEE Conference on Computer Vision Vision and Pattern Recognition, 2021, pp. 14 454–14 463.
and Pattern Recognition, 2023, pp. 13 354–13 363. [81] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “De-
[60] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection formable convolutional networks,” in IEEE Conference on Computer
in optical remote sensing images: A survey and a new benchmark,” Vision and Pattern Recognition, 2017, pp. 764–773.
ISPRS Journal of Photogrammetry and Remote Sensing, vol. 159, pp. [82] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
296–307, 2020. S. Zagoruyko, “End-to-end object detection with transformers,” in
[61] D. Du, P. Zhu, L. Wen, and et al., “Visdrone-det2019: The vision European Conference on Computer Vision. Springer, 2020, pp. 213–
meets drone object detection in image challenge results,” in IEEE 229.
International Conference on Computer Vision Workshops, 2019, pp. [83] X. Yang, J. Yan, Q. Ming, W. Wang, X. Zhang, and Q. Tian, “Rethink-
213–226. ing rotated object detection with gaussian wasserstein distance loss,”
[62] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, in International Conference on Machine Learning, vol. 139, 2021, pp.
Y. Bulatov, and B. McCord, “xview: Objects in context in overhead 11 830–11 841.
imagery,” arXiv preprint arXiv:1802.07856, 2018. [84] M. Zand, A. Etemad, and M. Greenspan, “Objectbox: From centers
[63] X. Yang, G. Zhang, W. Li, Y. Zhou, X. Wang, and J. Yan, “H2rbox: to boxes for anchor-free object detection,” in European Conference on
Horizontal box annotation is all you need for oriented object detection,” Computer Vision, 2022, pp. 390–406.
in The Eleventh International Conference on Learning Representations, [85] C. Zhu, Y. He, and M. Savvides, “Feature selective anchor-free module
2022. for single-shot object detection,” in IEEE Conference on Computer
[64] Y. Yu, X. Yang, Q. Li, Y. Zhou, F. Da, and J. Yan, “H2rbox- Vision and Pattern Recognition, 2019, pp. 840–849.
v2: Incorporating symmetry for boosting horizontal box supervised [86] D. M. Endres and J. E. Schindelin, “A new metric for probability
oriented object detection,” Advances in Neural Information Processing distributions,” IEEE Transactions on Information Theory (TIT), vol. 49,
Systems, vol. 36, 2024. no. 7, pp. 1858–1860, 2003.
[65] W. Hua, D. Liang, J. Li, X. Liu, Z. Zou, X. Ye, and X. Bai, [87] F. Nielsen, “On a generalization of the jensen–shannon divergence and
“Sood: Towards semi-supervised oriented object detection,” in IEEE the jensen–shannon centroid,” Entropy, vol. 22, no. 2, p. 221, 2020.
Conference on Computer Vision and Pattern Recognition, 2023, pp. [88] J. Wang, W. Yang, H.-c. Li, H. Zhang, and G.-S. Xia, “Learning
15 558–15 567. center probability map for detecting objects in aerial images,” IEEE
[66] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for Transactions on Geoscience and Remote Sensing, vol. 59, no. 5, pp.
dense object detection,” in IEEE International Conference on Computer 4307–4323, 2021.
Vision, 2017, pp. 2980–2988. [89] G. Cheng, Y. Yao, S. Li, K. Li, X. Xie, J. Wang, X. Yao, and J. Han,
[67] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: A simple and strong “Dual-aligned oriented detector,” IEEE Transactions on Geoscience
anchor-free object detector,” IEEE Transactions on Pattern Analysis and Remote Sensing, vol. 60, pp. 1–11, 2022.
and Machine Intelligence, vol. 44, no. 4, pp. 1922–1933, 2022. [90] G. Nie and H. Huang, “Multi-oriented object detection in aerial images
[68] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- with double horizontal rectangles,” IEEE Transactions on Pattern
time object detection with region proposal networks,” in Advances in Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4932–4944, 2023.
Neural Information Processing Systems, 2015, pp. 91–99. [91] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in
[69] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable IEEE International Conference on Computer Vision, 2017, pp. 2961–
detr: Deformable transformers for end-to-end object detection,” in 2969.
International Conference on Learning Representations, 2021. [92] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng,
[70] X. Yang, X. Yang, J. Yang, Q. Ming, W. Wang, Q. Tian, and J. Yan, Z. Liu, J. Shi, W. Ouyang et al., “Hybrid task cascade for instance
“Learning high-precision bounding box for rotated object detection via segmentation,” in IEEE Conference on Computer Vision and Pattern
kullback-leibler divergence,” Advances in Neural Information Process- Recognition, 2019, pp. 4974–4983.
ing Systems, vol. 34, pp. 18 381–18 394, 2021. [93] K. Chen, J. Wang, J. Pang, and et al., “MMDetection: Open mmlab
[71] X. Yang, Y. Zhou, G. Zhang, J. Yang, W. Wang, J. Yan, X. ZHANG, detection toolbox and benchmark,” CoRR, vol. abs/arXiv:1906.07155,
and Q. Tian, “The kfiou loss for rotated object detection,” in The 2019. [Online]. Available: https://arxiv.org/abs/1906.07155
Eleventh International Conference on Learning Representations, 2022. [94] Y. Zhou, X. Yang, G. Zhang, J. Wang, Y. Liu, L. Hou, X. Jiang, X. Liu,
[72] Y. Xu, M. Fu, Q. Wang, Y. Wang, K. Chen, G.-S. Xia, and X. Bai, J. Yan, C. Lyu et al., “Mmrotate: A rotated object detection benchmark
“Gliding vertex on the horizontal bounding box for multi-oriented using pytorch,” in Proceedings of the 30th ACM International Confer-
object detection,” IEEE Transactions on Pattern Analysis and Machine ence on Multimedia, 2022, pp. 7331–7334.
Intelligence, vol. 43, no. 4, pp. 1452–1459, 2021. [95] A. Paszke, S. Gross, F. Massa, A. Lerer et al., “Pytorch: An imperative
[73] Z. Guo, C. Liu, X. Zhang, J. Jiao, X. Ji, and Q. Ye, “Beyond bounding- style, high-performance deep learning library,” in Advances in Neural
box: Convex-hull feature adaptation for oriented and densely packed Information Processing Systems, 2019, pp. 8024–8035.
object detection,” in IEEE Conference on Computer Vision and Pattern [96] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Recognition, 2021, pp. 8792–8801. Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large
[74] J. Han, J. Ding, N. Xue, and G.-S. Xia, “Redet: A rotation-equivariant scale visual recognition challenge,” International Journal of Computer
detector for aerial object detection,” in IEEE Conference on Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
Vision and Pattern Recognition, 2021, pp. 2786–2795. [97] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
[75] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, recognition,” in IEEE Conference on Computer Vision and Pattern
“Swin transformer: Hierarchical vision transformer using shifted win- Recognition, 2016, pp. 770–778.
dows,” in Proceedings of the IEEE/CVF international conference on [98] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie,
computer vision, 2021, pp. 10 012–10 022. “Feature pyramid networks for object detection,” in IEEE Conference
[76] Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.
B. Wu, Z. Kira, and P. Vajda, “Unbiased teacher for semi- [99] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional one-
supervised object detection,” in International Conference on Learning stage object detection,” in IEEE International Conference on Computer
Representations, 2021. [Online]. Available: https://openreview.net/ Vision, 2019, pp. 9627–9636.
forum?id=MJIve1zgR [100] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More
[77] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu, deformable, better results,” in IEEE Conference on Computer Vision
“End-to-end semi-supervised object detection with soft teacher,” in and Pattern Recognition, 2019, pp. 9308–9316.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18

Chang Xu received his B.S. degree in electronic Fang Xu received her B.S. degree in electronic and
information engineering, and his M.S. degree in information engineering and her Ph.D. degree in
information and communication systems, both from communication and information system from Wuhan
Wuhan University, Wuhan, China, in 2021, and University, Wuhan, China, in 2018 and 2023, re-
2024, respectively. This work was done during his spectively. She is a postdoctoral researcher with
master study in Wuhan University. He is currently the school of computer science, Wuhan University,
pursuing his Ph.D. degree in the Environmental China. Her research involves remote sensing image
Computational Science and Earth Observation Lab- processing, including multi-modal data matching
oratory, EPFL, Sion, Switzerland. His research fo- and fusion.
cuses on object detection, visual geo-localization,
and multi-modal learning.

Jian Ding is currently a post-doctoral researcher

in King Abdullah University of Science and Tech-
nology (KAUST). He received the B.S. degree in
Aircraft Design and Engineering from Northwestern
Polytechnical University, Xian, China in 2017, then
Ruixiang Zhang received the B.S. in electronic obtained the Ph.D. degree at the State Key Lab-
engineering from Wuhan University, China, in 2019. oratory of Information Engineering in Surveying,
He is currently working towards a Ph.D degree in Mapping and Remote Sensing, Wuhan University,
communication and information system at Wuhan Wuhan, China in 2023. His research interests include
University, China. His research involves remote object detection, instance segmentation, and remote
sensing image processing, including label-efficient sensing.
object detection and cross-modal object detection.

Gui-Song Xia received his Ph.D. degree in image

processing and computer vision from CNRS LTCI,
Télécom ParisTech, Paris, France, in 2011. From
2011 to 2012, he was a Post-Doctoral Researcher
with the Centre de Recherche en Mathématiques
de la Decision, CNRS, Paris-Dauphine University,
Paris, for one and a half years. He is currently
working as a full professor in computer vision and
Wen Yang (Senior Member, IEEE) received his B.S.
photogrammetry at Wuhan University. He has also
degree in Electronic Apparatus and Surveying Tech-
been working as a Visiting Scholar at DMA, École
nology, his M.S. degree in Computer Application
Normale Supérieure (ENS-Paris) for two months
Technology, and his Ph.D. degree in Communication
in 2018. He was also a guest professor of the Future Lab AI4EO at
and Information System, all from Wuhan University,
the Technical University of Munich (TUM). His current research interests
Wuhan, China, in 1998, 2001, and 2004, respec-
include mathematical modeling of images and videos, structure from motion,
tively. In 2008 and 2009, he worked as a Visiting
perceptual grouping, and remote sensing image understanding. He serves
Scholar with the Apprentissage et Interfaces (AI)
on the Editorial Boards of several journals, including ISPRS Journal of
Team at the Laboratoire Jean Kuntzmann in Greno-
Photogrammetry and Remote Sensing, Pattern Recognition, Signal Processing:
ble, France. Following that, he served as a Post-
Image Communications, EURASIP Journal on Image & Video Processing,
Doctoral Researcher with the State Key Laboratory
Journal of Remote Sensing, and Frontiers in Computer Science: Computer
of Information Engineering, Surveying, Mapping, and Remote Sensing, also
Vision.
at Wuhan University, from 2010 to 2013. Since then, he has held the position
of Full Professor at the School of Electronic Information, Wuhan University.
He was also a guest professor of the Future Lab AI4EO at the Technical
University of Munich (TUM). He received the U.V. Helava Award for the best
paper in the ISPRS Journal of Photogrammetry and Remote Sensing in 2021.
His research interests include object detection and recognition, multisensor
information fusion, and remote sensing image interpretation.

Haoran Zhu received his B.S. degree in elec-

tronic information engineering from Wuhan Univer-
sity, Wuhan, China, in 2023, where he is currently
pursuing the Ph.D. degree. His research interests
include computer vision and remote sensing image
tiny object detection.

21 - Role of Business Development Service Providers
No ratings yet
21 - Role of Business Development Service Providers
20 pages
2207.14096v4
No ratings yet
2207.14096v4
24 pages
Visible_and_Clear_Finding_Tiny_Objects_in_Differen
No ratings yet
Visible_and_Clear_Finding_Tiny_Objects_in_Differen
17 pages
2006.15056v2
No ratings yet
2006.15056v2
11 pages
Few-Shot Object Detection On Remote Sensing Images
No ratings yet
Few-Shot Object Detection On Remote Sensing Images
14 pages
更好地解释目标检测
No ratings yet
更好地解释目标检测
10 pages
Object Detection With DL
No ratings yet
Object Detection With DL
17 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
From classical techniques to convolution-based models: A review of object detection algorithms
No ratings yet
From classical techniques to convolution-based models: A review of object detection algorithms
6 pages
Transformers in Small Object Detection - SOTA
No ratings yet
Transformers in Small Object Detection - SOTA
20 pages
Explainable Contextual Anomaly Detection
No ratings yet
Explainable Contextual Anomaly Detection
48 pages
OPODet_Toward_Open_World_Potential_Oriented_Object_Detection_in_Remote_Sensing_Images
No ratings yet
OPODet_Toward_Open_World_Potential_Oriented_Object_Detection_in_Remote_Sensing_Images
13 pages
Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection With Attentive Feature Alignment
No ratings yet
Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection With Attentive Feature Alignment
14 pages
Applsci 13 12977
No ratings yet
Applsci 13 12977
21 pages
Fnins 18 1349204
No ratings yet
Fnins 18 1349204
10 pages
Region-level Active Learning for Cluttered Scenes
No ratings yet
Region-level Active Learning for Cluttered Scenes
9 pages
IJISAE 20 Divya+kumawat 3 1834
No ratings yet
IJISAE 20 Divya+kumawat 3 1834
10 pages
Few-Shot Object Detection A Comprehensive Survey
No ratings yet
Few-Shot Object Detection A Comprehensive Survey
21 pages
DOCK - Detecting Objects by Transferring Common-Sense Knowledge
No ratings yet
DOCK - Detecting Objects by Transferring Common-Sense Knowledge
17 pages
Detection of Multiclass Objects in Optical Remote Sensing Images
No ratings yet
Detection of Multiclass Objects in Optical Remote Sensing Images
5 pages
Data-Efficient Image Recognition With Contrastive Predictive Coding
No ratings yet
Data-Efficient Image Recognition With Contrastive Predictive Coding
13 pages
2205.15445v1
No ratings yet
2205.15445v1
17 pages
1 s2.0 S1568494624006951 Main
No ratings yet
1 s2.0 S1568494624006951 Main
22 pages
IEEE PAMI: Towards Open Vocabulary Learning A Survey
No ratings yet
IEEE PAMI: Towards Open Vocabulary Learning A Survey
20 pages
YOLO-NL
No ratings yet
YOLO-NL
18 pages
A Survey and Performance Evaluation of Deep Learning Methods For Small 2021
No ratings yet
A Survey and Performance Evaluation of Deep Learning Methods For Small 2021
14 pages
Orientedformer: An End-To-End Transformer-Based Oriented Object Detector in Remote Sensing Images
No ratings yet
Orientedformer: An End-To-End Transformer-Based Oriented Object Detector in Remote Sensing Images
16 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
2802 8020 1 PB
No ratings yet
2802 8020 1 PB
3 pages
LSTD A Low-Shot Transfer Detector For Object Detection
No ratings yet
LSTD A Low-Shot Transfer Detector For Object Detection
8 pages
2304.03110v1
No ratings yet
2304.03110v1
13 pages
Target State Classification by Attention-Based Branch
No ratings yet
Target State Classification by Attention-Based Branch
19 pages
Remotesensing 15 01187 v2
No ratings yet
Remotesensing 15 01187 v2
21 pages
Applsci 13 08161
No ratings yet
Applsci 13 08161
17 pages
Deep Learning For X Ray Image To Text Generation
No ratings yet
Deep Learning For X Ray Image To Text Generation
4 pages
2212.00968v1
No ratings yet
2212.00968v1
14 pages
390 Submission
No ratings yet
390 Submission
5 pages
2110.13389v2
No ratings yet
2110.13389v2
12 pages
elhoseiny16
No ratings yet
elhoseiny16
10 pages
1-s2.0-S1077314222000637-main
No ratings yet
1-s2.0-S1077314222000637-main
9 pages
2307.09220v2
No ratings yet
2307.09220v2
27 pages
YOLOv8-CAB Improved YOLOv8 For Real-Time Object de
No ratings yet
YOLOv8-CAB Improved YOLOv8 For Real-Time Object de
15 pages
Large Selective Kernel Network For Remote Sensing Object Detection
No ratings yet
Large Selective Kernel Network For Remote Sensing Object Detection
16 pages
1050_vos_learning_what_you_don_t_kn
No ratings yet
1050_vos_learning_what_you_don_t_kn
21 pages
Remotesensing 15 03265
No ratings yet
Remotesensing 15 03265
29 pages
Stardist Paper for Project
No ratings yet
Stardist Paper for Project
4 pages
CentroidNetV2 A Hybrid Deep Neural Network For Small-Object Segmentation and Counting 2021
No ratings yet
CentroidNetV2 A Hybrid Deep Neural Network For Small-Object Segmentation and Counting 2021
16 pages
Knowledge-Based Systems
No ratings yet
Knowledge-Based Systems
10 pages
GraspNet-1B
No ratings yet
GraspNet-1B
10 pages
Object Separation in X-Ray Image Sets: Geremy Heitz and Gal Chechik Qylur Security Systems, Inc. Palo Alto, CA 94305
No ratings yet
Object Separation in X-Ray Image Sets: Geremy Heitz and Gal Chechik Qylur Security Systems, Inc. Palo Alto, CA 94305
8 pages
Dark - SORT Seminar (Aaiman)
No ratings yet
Dark - SORT Seminar (Aaiman)
22 pages
1801.00868v3
No ratings yet
1801.00868v3
10 pages
Sensors 24 03856
No ratings yet
Sensors 24 03856
16 pages
Patch-based within-object classification
No ratings yet
Patch-based within-object classification
8 pages
Remotesensing 16 00327
No ratings yet
Remotesensing 16 00327
28 pages
An Analysis of Scale Invariance in Object Detectio
No ratings yet
An Analysis of Scale Invariance in Object Detectio
11 pages
Detecting and Identifying Occluded and Camouflaged Objects in Low-Illumination Environments
No ratings yet
Detecting and Identifying Occluded and Camouflaged Objects in Low-Illumination Environments
9 pages
A Survey of Modern Deep Learning Based Object Detection Models
No ratings yet
A Survey of Modern Deep Learning Based Object Detection Models
19 pages
remotesensing-14-02385-v2
No ratings yet
remotesensing-14-02385-v2
41 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Activity Recognition: Fundamentals and Applications
From Everand
Activity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
TERRARIUM-A-TINY-ECOSYSTEM
No ratings yet
TERRARIUM-A-TINY-ECOSYSTEM
6 pages
FireYOLO-Lite_Lightweight_Forest_Fire_Detection_Ne
No ratings yet
FireYOLO-Lite_Lightweight_Forest_Fire_Detection_Ne
21 pages
Potentially_Functional_Apple_Snacks_Infused_in_the
No ratings yet
Potentially_Functional_Apple_Snacks_Infused_in_the
14 pages
Ahbb 2014 0002
No ratings yet
Ahbb 2014 0002
36 pages
OPTIMIZATION_OF_FOOD_RATIONS_USED_FOR_THE_PRE-DEVE
No ratings yet
OPTIMIZATION_OF_FOOD_RATIONS_USED_FOR_THE_PRE-DEVE
7 pages
Calibrating_Glucose_Sensors_at_the_Edge_A_Stress_G
No ratings yet
Calibrating_Glucose_Sensors_at_the_Edge_A_Stress_G
12 pages
Autonomous_UAV_Inspection_of_Insulators_Based_on_I
No ratings yet
Autonomous_UAV_Inspection_of_Insulators_Based_on_I
26 pages
Determination of Bioactive Compounds in Vaccinium
No ratings yet
Determination of Bioactive Compounds in Vaccinium
6 pages
Proposal of Descriptors To Study The Variability o
No ratings yet
Proposal of Descriptors To Study The Variability o
12 pages
CITES Identification Materials Volume6 en
No ratings yet
CITES Identification Materials Volume6 en
632 pages
Lingonberry Vaccinium Vitis-Idaea L Health Effects
No ratings yet
Lingonberry Vaccinium Vitis-Idaea L Health Effects
10 pages
Advances in Blueberry Vaccinium SPP in Vitro Cultu
No ratings yet
Advances in Blueberry Vaccinium SPP in Vitro Cultu
21 pages
Sludge Treatment
No ratings yet
Sludge Treatment
7 pages
Afin - Articol Stiintific
No ratings yet
Afin - Articol Stiintific
13 pages
Suedia Uppsala
No ratings yet
Suedia Uppsala
16 pages
Olanda Utrecht
No ratings yet
Olanda Utrecht
11 pages
Germania Tübingen
No ratings yet
Germania Tübingen
52 pages
Danemarca Copenhaga
No ratings yet
Danemarca Copenhaga
9 pages
Finlanda Turku
No ratings yet
Finlanda Turku
12 pages
Estonia Tartu
No ratings yet
Estonia Tartu
51 pages
COST Pallid Sturgeon
No ratings yet
COST Pallid Sturgeon
91 pages
Ammonia Excretion and Urea Handling by Fish Gills
No ratings yet
Ammonia Excretion and Urea Handling by Fish Gills
18 pages
010 MILCO Presentation V2 - 4 GB
No ratings yet
010 MILCO Presentation V2 - 4 GB
36 pages
Pharm 17 Chapter 17
No ratings yet
Pharm 17 Chapter 17
15 pages
UCR Catalog 2019-2020
No ratings yet
UCR Catalog 2019-2020
551 pages
45PLTSXN
No ratings yet
45PLTSXN
16 pages
Postcolonial Witnessing Trauma Out of Bounds - (1 The Trauma of Empire)
No ratings yet
Postcolonial Witnessing Trauma Out of Bounds - (1 The Trauma of Empire)
11 pages
S1 Heat and Light Topic
No ratings yet
S1 Heat and Light Topic
57 pages
Ecotank Presentation
No ratings yet
Ecotank Presentation
16 pages
The Indian Community School Kuwait First Mid Term Examination - 2022 - 2023 Class - Xi - Mathematics - Answer Key
No ratings yet
The Indian Community School Kuwait First Mid Term Examination - 2022 - 2023 Class - Xi - Mathematics - Answer Key
1 page
MC44 - Inventory Turnover (1) - SAP Mental Notes
No ratings yet
MC44 - Inventory Turnover (1) - SAP Mental Notes
6 pages
Themes, Motifs, Symbols
No ratings yet
Themes, Motifs, Symbols
4 pages
Jabunan Arjay P. Ojt Consolidation Report
No ratings yet
Jabunan Arjay P. Ojt Consolidation Report
133 pages
Cdi Post Test
No ratings yet
Cdi Post Test
8 pages
The Starry Night
No ratings yet
The Starry Night
3 pages
FGDGSDFGPB
No ratings yet
FGDGSDFGPB
40 pages
JO - 16 - Renault Follow-Up Plan Audit HSE Internship
No ratings yet
JO - 16 - Renault Follow-Up Plan Audit HSE Internship
2 pages
Dark Elegant Korean Style Project Proposal by Slidesgo
No ratings yet
Dark Elegant Korean Style Project Proposal by Slidesgo
51 pages
Sankalp Phase III Heat 3
No ratings yet
Sankalp Phase III Heat 3
5 pages
Sample CV
No ratings yet
Sample CV
6 pages
Class-XI Half Yearly Syllabus 2023
No ratings yet
Class-XI Half Yearly Syllabus 2023
11 pages
SALES PAMIR
No ratings yet
SALES PAMIR
13 pages
Bayer, Oswald. 2003. Law and Morality
No ratings yet
Bayer, Oswald. 2003. Law and Morality
15 pages
Answers To Exercise
No ratings yet
Answers To Exercise
31 pages
Acticide RS: ® Product Information
No ratings yet
Acticide RS: ® Product Information
2 pages
CNC
No ratings yet
CNC
43 pages
Mechanical PE AM - 003 Answer
No ratings yet
Mechanical PE AM - 003 Answer
2 pages
MCQ For C-1-13
No ratings yet
MCQ For C-1-13
13 pages
Exercises 2b Solutions
No ratings yet
Exercises 2b Solutions
8 pages
LV Test
No ratings yet
LV Test
4 pages
Notes
No ratings yet
Notes
3 pages