Oriented_Tiny_Object_Detection_A_Dataset_Benchmark
Oriented_Tiny_Object_Detection_A_Dataset_Benchmark
8, AUGUST 2021 1
systemically explores the oriented tiny object detection prob- while balancing the quantity and quality of samples across
lem from perspectives of dataset, benchmark, and method. different scales.
A visual summary of this work is presented in Figure 1. Aiming at addressing the challenging task of oriented
To further push the boundaries of oriented object detection tiny object detection, this paper provides a comprehensive
for extremely tiny objects, we contribute a new dataset extension of our previous conference version [16]. Beyond
dedicated to oriented tiny object detection, named AI-TOD- methodological contributions published previously, this journal
R (Section III). With a mean object size of only 10.62 pixels, extension introduces several additional advancements:
AI-TOD-R is the dataset of the smallest object size for oriented • Establishing a task-specific dataset for oriented tiny ob-
object detection. This challenging dataset is established using ject detection, features the smallest object size among
a semi-automatic annotation process via supplementing AI- oriented object detection datasets, compensating for the
TOD-v2 [13] with orientation information and ensuring high lack of resources in this challenging area.
annotation quality. Then, we benchmark diverse object • Creating a benchmark that covers a variety of object
detection paradigms with AI-TOD-R to investigate how detection paradigms, including both fully-supervised and
different detection paradigms perform on oriented tiny ob- label-efficient methods, revealing learning biases against
jects (Section IV). What distinguishes this benchmark from oriented tiny objects across these approaches.
prior arts is that we break the fully-supervised paradigm and • Demonstrating the versatility of DCFL by plugging it into
study both supervised and label-efficient methods, catering to both one-stage and two-stage methods, and verifying its
broader and more practical applications. Our findings reveal generalization ability on small oriented object detection
that generic object detectors tend to exhibit abnormal results by validating on SODA-A dataset.
when confronted with extremely tiny objects. Notably, the
learning bias appears invariably across various methods. The II. R ELATED WORKS
objects’ tiny size and low confidence characteristics make
them easily suppressed or ignored during model training. The A. Small and Oriented Object Detection Datasets
vanilla optimization process will inevitably pose them into Small and tiny object detection datasets. Due to the lack
significantly biased prior setting and biased sample learning of specialized datasets, early studies on Small Object Detection
dilemmas, severely impeding the performance of oriented tiny (SOD) are mainly based on small objects in generic or some
object detection (Section IV-D). To address this issue, we task-specific datasets. For example, the generic object detec-
propose a new approach: Dynamic Coarse-to-Fine Learning tion dataset MS COCO [11], face detection dataset Wider-
(DCFL), aimed at providing unbiased prior setting and sample Face [17], pedestrian detection dataset EuroCity Persons [18],
supervision for oriented tiny objects (Section V). On the and Drone-view dataset VisDrone [2] all contain a consider-
one hand, we reformulate the static prior into an adaptively able number of small objects that could assist related studies.
updating prior, thereby guiding more prior positions towards As the SOD performance has been struggling for a long time,
the main area of tiny objects. On the other hand, dynamic the establishment of specialized dataset for SOD is receiving
coarse-to-fine sample learning separates the label assignment growing attention. TinyPerson [19] is the first dataset designed
into two steps: the coarse step offers diverse positive sample for tiny-scale person detection. AI-TOD [12], [13] is the first
candidates for objects of various sizes and orientations, and multi-category dataset for tiny object detection. DTOD [20]
the fine step warrants the high quality of positive samples for compounds the challenge by addressing not only the tiny
predictions. size of objects but also their dense packing. Recently, the
We perform experiments on eight heterogeneous bench- introduction of the first large-scale SOD dataset SODA [10]
marks, including tiny/small oriented object detection (AI- along with its benchmark further highlights the necessity of
TOD-R, SODA-A [10]), oriented object detection with large targeted research on SOD.
numbers of tiny objects (DOTA-v1.5 [3], DOTA-v2 [3]), Oriented object detection datasets. Oriented object de-
multi-scale oriented object detection (DOTA-v1 [14], DIOR- tection is an important direction of visual detection since
R [15]), and horizontal object detection (VisDrone [2], MS orientation information significantly reduces the background
COCO [11]). Our results demonstrate that DCFL remark- region in bounding boxes with minimal additional parameters.
ably outperforms existing methods for detecting tiny objects The multi-scale datasets DOTA-v1/1.5/2 [3], [14] and DIOR-
(Section VI). Moreover, our results highlight three char- R [15] are widely adopted for performance benchmarking,
acteristics of DCFL: (1) Costless improvement: Extensive where the DOTA-v2 is also characterized by its large number
experiments on various datasets show that DCFL improves of small objects. In addition to these generic datasets, task-
the detection performance without adding any parameter or specific datasets are also introduced to dissect some targeted
computational overhead during inference. (2) Versatility: The problems. For example, some datasets are established to
DCFL approach can be plugged into both one-stage and two- study specific classes (e.g., HRSC2016 [21], UACS-AOD [22],
stage detection pipelines and improve their performance on VEDAI [23]), some are designed for the fine-grained object
oriented tiny objects. Beyond oriented tiny objects, DCFL also detection (e.g., FAIR1M [24]), while some datasets are pro-
enhances the detection performance of generic small objects. posed for specific modalities (e.g., SSDD [25]). Meanwhile,
(3) Unbiased learning. By dissecting the training process, there are also datasets designed for other scenarios sensitive
we reveal how DCFL achieves unbiased learning—adaptively to the object’s orientation, including text [26], retail [27], and
updating priors to better align with tiny objects’ main areas, crack detection [28].
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3
Angle (degree)
(a) Object Size Distribution (b) Object Angle Distribution (c) Object Number Distribution (d) Class Size Distribution
Fig. 2. Statistical analysis of the AI-TOD-R. From left to right, we show the dataset’s object size distribution, object angle distribution, object number
per image distribution, and class size distribution, respectively. The box plot of “Class Size Distribution” shows the object’s absolute size’s mean value and
standard deviation within each class.
Fig. 4. Visualization of annotations in AI-TOD-R. Compared to AI-TOD-v2, using oriented bounding boxes to represent tiny objects can significantly reduce
back noise, and this advantage is particularly obvious in densely arranged scenarios. In addition to the extremely tiny object size, AI-TOD-R introduces other
challenges like dense arrangement, weak feature representation, and imbalanced class distributions.
over 100 objects. The vast number of tiny objects in each IV. AI-TOD-R B ENCHMARK
image significantly increases the computational burden during
In this section, we present a comprehensive benchmark
training and inference, giving rise to the need for efficient
for AI-TOD-R, encompassing fully-supervised oriented object
detector designs that facilitate practical applications.
detection methods as well as label-efficient methods consisting
Imbalanced class distribution. Like many generic object of semi-supervised object detection (SSOD), sparsely anno-
detection or oriented object detection datasets, the class im- tated object detection (SAOD), and weakly-supervised object
balance challenge also exists in AI-TOD-R. This imbalance detection (WSOD) methods.
is reflected in the object number2 and object size distribution
(Figure 2(d)) for each class. This imbalance depicts the real-
world class distribution and also calls for robust oriented object A. Implementation Details
detectors capable of class-balanced detection performance.
For fully-supervised methods, experiments on the AI-TOD-
R are performed following the default setting of AI-TOD
series [12], [13]. We use AI-TOD-R’s trainval set for
C. Label Visualization training and its test set for evaluation, and retain the
image size as 800×800 for training and testing. The batch size
Figure 4 showcases typical samples from the AI-TOD- and learning rate are set to 2 and 0.0025 respectively. We only
R dataset. These typical samples exhibit characteristics of use random flipping as data augmentation for all experiments.
the dataset, including extremely tiny object scale, arbitrary For label-efficient methods, we reorganize training la-
orientation, dense arrangement, and complex scenes. In partic- bels and schedules to adapt to different paradigms. Semi-
ular, the visualized annotations reveal the unique advantages Supervised Object Detection (SSOD) methods randomly retain
of representing tiny objects with oriented bounding boxes. annotations with 10%, 20%, and 30% of the images from
Using oriented bounding boxes to represent objects allows the the AI-TOD-R’s trainval set as training annotations. We
annotation boxes to more tightly enclose the object’s main follow the default settings of SOOD [65] with a batch size of
area. This advantage is particularly evident in densely packed 6 (with a 1:2 ratio of unlabeled to labeled data) and a learning
regions, where rotated bounding boxes can significantly reduce rate of 0.0025. Additionally, we maintain the same total num-
overlap between adjacent object boundaries, thus preventing ber of batch size × iterations as the fully-supervised 40-epoch
confusion during the network’s learning and prediction pro- setup. Sparsely Annotated Object Detection (SAOD) randomly
cesses in such areas. In addition, oriented bounding boxes can retains 10%, 20%, and 30% annotations of all objects from
capture the orientation information of moving objects, such as the AI-TOD-R’s trainval set as training labels. We use
vehicles in motion or ships at sea, providing richer information a batch size of 2, and a learning rate of 0.0025, and maintain
for downstream applications. the same total number of batch size × iterations as the
fully-supervised 40-epoch setup. Besides, Weakly Supervised
2 airplane (1,667), bridge (1,541), storage-tank (13,771), ship (35,813), Object Detection (WSOD) mainly switches the trainval
swimming-pool (1,617), vehicle (662,929), person (34,490), wind-mill (632) set’s annotations from OBB to HBB and keeps other settings
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6
TABLE II
M AIN RESULTS OF FULLY- SUPERVISED METHODS ON AI-TOD-R. F OR THE TRAINING SCHEDULE , 1× DENOTES 12 EPOCHS AND 40 E DENOTES 40
EPOCHS . M ETHODS WITH “-O” MEAN THE ROTATED VERSION OF BASE DETECTORS , AND THE NAME IN “()” DENOTES THE BASELINE METHOD .
ID Method Backbone Schedule AP AP0.5 AP0.75 APvt APt APs APm #Params.
## Architecture:
#1 RetinaNet-O [66] ResNet-50 1× 7.3 23.9 1.8 2.2 5.9 11.1 15.4 36.3M
#2 FCOS-O [67] ResNet-50 1× 11.0 33.6 3.7 3.0 8.9 15.7 22.0 31.9M
#3 Faster R-CNN-O [68] ResNet-50 1× 10.2 30.8 3.6 0.6 7.8 19.0 22.9 41.1M
#4 RoI Transformer [30] ResNet-50 1× 10.5 34.0 2.2 1.1 8.8 16.9 20.3 55.1M
#5 Oriented R-CNN [9] ResNet-50 1× 11.2 33.2 4.3 0.6 9.1 19.5 23.2 41.1M
#6 Deformable DETR-O [69] ResNet-50 1× 8.4 26.7 2.0 4.8 9.3 8.6 7.3 40.8M
#7 ARS-DETR [35] ResNet-50 1× 14.3 41.1 5.8 6.3 14.5 17.6 18.7 41.1M
## Representation:
#8 KLD (RetinaNet-O) [70] ResNet-50 1× 7.8 24.8 2.3 3.1 6.7 10.3 15.8 36.3M
#9 KFIoU (RetinaNet-O) [71] ResNet-50 1× 8.1 25.2 2.8 2.0 6.6 12.3 17.1 36.3M
#10 Oriented RepPoints [34] ResNet-50 1× 13.0 40.3 4.2 5.2 12.2 16.8 21.4 36.6M
#11 PSC (RetinaNet-O) [59] ResNet-50 1× 4.5 15.8 1.2 1.0 3.7 8.2 12.7 36.4M
#12 Gliding Vertex [72] ResNet-50 1× 8.1 27.4 2.1 0.9 6.7 14.7 17.9 41.1M
## Refinement:
#13 R3 Det [31] ResNet-50 1× 8.1 25.8 2.2 1.9 7.3 12.0 16.8 41.7M
#14 S2 A-Net [32] ResNet-50 1× 10.8 33.4 3.3 4.3 11.2 13.0 16.0 38.6M
## Assignment:
#15 ATSS-O (RetinaNet-O) [39] ResNet-50 1× 10.9 33.8 3.1 2.7 8.9 15.5 19.4 36.0M
#16 SASM (RepPoints-O) [41] ResNet-50 1× 11.4 35.0 3.7 3.6 10.2 15.4 19.8 36.6M
#17 CFA [73] ResNet-50 1× 12.4 38.7 4.0 5.0 11.9 16.5 18.8 36.6M
## Backbone:
#18 Oriented R-CNN ResNet-101 1× 11.2 33.0 4.1 0.5 8.9 19.8 24.4 60.1M
#19 Oriented R-CNN Swin-T 1× 12.0 34.6 4.6 0.7 9.9 20.8 25.3 44.8M
#20 Oriented R-CNN LSKNet-T 1× 11.1 33.4 3.8 0.6 9.2 18.9 22.6 21.0M
#21 ReDet [74] ReResNet-50 1× 11.6 32.8 4.8 1.4 9.5 19.4 23.2 31.6M
#22 DCFL (RetinaNet-O) ResNet-50 1× 12.3 (+5.0) 36.7 (+12.8) 4.5 (+2.7) 4.3 10.7 17.2 22.2 36.1M
#23 DCFL (RetinaNet-O) ResNet-50 40e 15.2 (+7.9) 44.9 (+21.0) 5.1 (+3.3) 4.9 13.1 19.7 25.9 36.1M
#24 DCFL (Oriented R-CNN) ResNet-50 1× 15.7 (+4.5) 47.0 (+13.8) 5.8 (+1.5) 6.3 14.8 19.6 22.5 41.1M
#25 DCFL (Oriented R-CNN) ResNet-50 40e 17.1 (+5.9) 49.0 (+15.8) 7.2 (+2.9) 6.4 16.0 21.6 24.9 41.1M
#26 DCFL (S2 A-Net) ResNet-50 1× 13.7 (+2.9) 39.7 (+6.3) 5.3 (+2.0) 4.7 12.4 18.6 22.6 38.6M
#27 DCFL (S2 A-Net) ResNet-50 40e 17.5 (+6.7) 49.6 (+16.2) 7.9 (+4.6) 6.5 15.7 22.6 27.4 38.6M
as the fully-supervised setting. All other settings are retained than dense methods, while at the cost of higher computation
as their baseline methods unless otherwise specified. demand. Compared to other paradigms, the state-of-the-art
sparse method (#7) gradually performs favorably on oriented
B. Results of Fully-supervised Methods tiny objects, mainly attributed to its training strategies tailored
In Table II, we benchmark the detection performance on from advanced generic detectors and its rotated deformable
oriented tiny objects across a wide range of oriented object attention optimized for arbitrary-oriented objects.
detectors. To better compare and analyze the characteristics Box representation and loss design. The vanilla
of various detection paradigms on the oriented tiny object regression-based loss suffers from issues including inconsis-
detection task, we introduce them in a classified manner. tency with evaluation metrics, boundary discontinuity, and
Basic architecture. Based on the prior setting and stage square-like problems, giving rise to numerous box repre-
number, oriented object detection architectures can be sepa- sentation studies. Here, oriented tiny object detection also
rated into dense [66], [67] (#1, 2), dense-to-sparse [9], [30], benefits from these improved representations and their induced
[68] (#3, 4, 5), and sparse [35], [69] (#6, 7) paradigms. loss functions. The Gaussian-based loss [70], [71] (#8, 9)
The dense paradigm usually refers to one-stage methods eradicates the boundary discontinuity issue and enforces the
that yield dense predictions per feature point, dense-to-sparse alignment between the optimization goal with the evaluation
methods use the first stage to generate sparse proposals (e.g., metric, slightly improving the AP for about 1 point based
RPN) and refine proposals as final predictions in the second on the RetinaNet-O baseline. Notably, the point set-based
stage (e.g., R-CNN), while the sparse paradigm is mainly method [34], [41], [73] (#10, 16, 17) is particularly effective
based on Transformers to reason about the object’s class for detecting oriented tiny objects, which may be attributed to
and location with a set of sparse queries. Among the dense the deformable points’ representation robustness to extreme
paradigm, FCOS-O releases the IoU-constrained assignment geometric characteristics.
by labeling gt-covered points as positive samples, performing Sample selection strategies. The quality of positive sample
better than the anchor-based dense method. Benefiting from selection directly affects the supervision information in the
the FPN with higher resolution (P2) and feature interpolated training process, playing a crucial role in tiny object detec-
RoI Align, dense-to-sparse methods perform slightly better tion. By adaptively determining the positive anchor threshold
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7
TABLE III
M AIN RESULTS OF LABEL - EFFICIENT METHODS ON AI-TOD-R. E VALUATIONS ARE PERFORMED ON THE test set OF AI-TOD-R BY TRAINING
UNDER DIFFERENT RATIOS OF O RIENTED B OUNDING B OX (OBB) ANNOTATIONS OR H ORIZONTAL B OUNDING B OX (HBB) ANNOTATIONS FROM ITS
trainval set. SSOD, SAOD, AND WSOD DENOTE SEMI - SUPERVISED OBJECT DETECTION , SPARSELY ANNOTATED OBJECT DETECTION , AND
WEAKLY SUPERVISED OBJECT DETECTION , RESPECTIVELY.
for each gt, ATSS-O lifts the RetinaNet-O baseline by 3.6 ready achieved competitive performance with fully-supervised
points. By dynamically assessing the sample quality based single-stage counterparts (i.e., FCOS-O with 1×) using full-
on the object’s arrangement and shape information, CFA and set annotation. The uncovers the great potential and application
SASM yield promising performances of 11.4% and 12.4%, value of SSOD methods on tiny-scale oriented objects.
respectively. These significant improvement raised by sample Sparsely Annotated Object Detection (SAOD). SAOD
selection strategies (#15-17) further highlights the importance approaches propose to randomly annotate a proportion of
of customized sample assignment methods for oriented tiny objects throughout the whole training set for label-efficient
objects. learning. We adapt a classic SAOD method to oriented tiny
Backbone choice. We analyze the effects of various back- object detection (i.e., Co-mining [78]). Despite using the same
bones on oriented tiny object detection by investigating deeper number of annotated objects, SSOD methods outperform the
architecture, vision transformer, large convolution kernels, and SAOD method tested. This performance gap may be attributed
rotation equivariance. Different from generic object detection, to the fact that Co-mining does not utilize the advanced
oriented tiny object detection does not benefit from deeper teacher-student network, thereby limiting its effectiveness.
backbone architecture (#18 vs. #5) or large convolution kernels Weakly-Supervised Object Detection (WSOD). Another
(#20 vs. #5), where these improved backbones retain similar popular direction of label-efficient object detection uses
AP with the basic ResNet-50. This interesting phenomenon coarse-level annotations, which are more easily accessible,
can be largely attributed to the limited and local informa- for fine-level predictions. Among them, a dominant line of
tion representation of tiny objects. After multiple times of research lies in using horizontal bounding box supervision for
down-sampling in deeper layers of the network, the limited oriented box prediction (e.g., H2RBox [63]). With advanced
information of tiny objects is further lost. Besides, the large training strategies, experiments reveal that merely using HBB
receptive field of large convolution kernels struggles to fit or supervision has shown comparable performance with OBB-
converge to the extremely tiny region of interest. By contrast, supervised single-stage baselines (e.g., FCOS-O).
the shifted window transformer (Swin Transformer [75]) and In short, label-efficient methods have demonstrated excel-
rotation-equivalent feature extraction (ReDet [74]) could also lent performance in the task of oriented tiny object detection.
benefit oriented tiny objects based on our experiments. Training with much fewer annotations, SSOD and WSOD
methods show very competitive performance compared to one-
C. Results of Label-efficient Methods stage fully-supervised baselines. These findings demonstrate
the significant application value and potential for further
Label-efficient object detection aims at simplifying the
exploration of label-efficient methods in the field of oriented
annotation cost (e.g., quantity, difficulty), meanwhile aligning
tiny object detection.
or even surpassing the performance with fully-supervised
methods. Label-efficient approaches show great demand and
potential on oriented tiny objects since their annotation process D. Uncovering Learning Bias
is quite laborious and difficult. Herein, we investigate three Despite the differences in detection paradigms, one consis-
kinds of dominant label-efficient paradigms as follows, whose tent finding is that the detection performance of oriented tiny
results on the AI-TOD-R dataset are listed in Table III. objects remains significantly inferior to that of regular-sized
Semi-Supervised Object Detection (SSOD). SSOD re- objects. To gain a clearer understanding of the underlying
lieves the annotation burden via leveraging the precious an- reasons for this performance gap, we conduct a statistical
notated images and massive unlabelled images to train object analysis from the perspective that directly drives the model’s
detectors efficiently. Current state-of-the-art approaches [65], training: the sample learning process of objects across various
[76], [77] employ a teacher-student network architecture with scales (i.e., the input sample and output predictions).
a pseudo-labelling fashion. Surprisingly, using only 30% la- Specifically, we investigate the prior matching degree (in-
belled images and the remaining unlabelled images, the state- put) and posterior confidence scores (output) for different-
of-the-art SSOD approach: SOOD [65] (40 epochs) has al- sized objects when training. The results, presented in Figure 5,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8
V. M ETHOD
In this section, we first provide a paradigmatic comparison
of our method with prior arts. Following this, we describe
the details for core components (i.e., Dynamic Prior, Coarse
Fig. 5. An illustration of the sample learning bias. SOOD [65] is trained with Prior Matching, and Finer Posterior Matching) in our proposed
10% labels under the semi-supervised object detection pipeline.
DCFL. Figure 6 shows an overview of the proposed method.
show the prior sample selection results across different detec- A. Pipeline Overview
tors by counting the number of positive samples assigned to Static prior → Dynamic prior. Oriented object detection
objects of varying scales (upper line charts), and the model’s is predominantly solved with dense one-stage detectors (e.g.,
posterior confidence scores for different-sized objects (lower RetinaNet-O) or dense-to-sparse two-stage detectors (e.g.,
bar chart). Oriented R-CNN) nowadays [80]. Different as architectures,
Our analysis in Figure 5 shows that oriented tiny objects their detection processes all initialize from a set of dense priors
often face a biased dilemma across various detection pipelines. P ∈ RW ×H×C (W × H: the size of the feature map, C:
At the prior level, tiny-scale objects receive significantly fewer the number of prior information per feature point) and remap
positive samples than larger-scale objects. This phenomenon the set into final detection results D through a Deep Neural
can largely be attributed to the limited feature map resolution, Network (DNN), which can be simplified as:
sub-optimal measurement, and label assignment strategies.
Specifically, the stride between adjacent feature points and D = DNNd (P ), (1)
their corresponding prior locations (e.g., anchor box/point)
where DNNd is composed of the backbone and detection head.
is constrained by the feature map resolution. For example,
Detection results D can be mainly separated into two parts:
the stride of prior locations in a typical single-stage detector
classification scores Dcls ∈ RW ×H×A (A denotes the class
is at least 8 pixels. This sparse and fixed prior setting fun-
number) and box locations Dreg ∈ RW ×H×B (B is the box
damentally limits the number of sample candidates for tiny
parameter number).
objects compared to larger ones, leading to a biased prior
This static prior modeling suffers from significant prior
setting. Furthermore, oriented tiny objects often have a lower
bias issues for tiny objects: the prior position mostly deviates
similarity with the sparse prior boxes (e.g., RetinaNet-O [66])
from the objects’ main body (Section I). To accommodate the
or cover very few prior points (e.g., FCOS-O [67]), which
extreme sizes and arbitrary geometries of these tiny objects, we
exacerbates the problem. Under the generic sample selection
incorporate an iterative updating process for the prior position
strategies (e.g., MaxIoU, Center Sampling), the number of
and refine it dynamically with each iteration. This transforms
positive samples ultimately assigned to tiny objects is further
the prior into a dynamic set P̃ ( ˜ denotes the dynamic item),
reduced, leading to a serious sample bias problem.
leading to a reformulated detection process:
This learning bias against oriented tiny objects is also re-
flected in their high uncertainty levels in posterior predictions, D = DNNd ( DNNp (P ) ), (2)
as shown in the lower part of Figure 5. The low confidence | {z }
Dynamic Prior P̃
scores can further exacerbate the learning bias against oriented
tiny objects. In supervised learning, some methods propose to DNNp is a learnable block incorporated within the detection
select or re-weight confident samples [36], [37], [79], which pipeline to update the prior.
will further weaken oriented tiny objects due to their high Static sample learning → Dynamic coarse-to-fine sample
uncertainty levels. In label-efficient learning, thresholds based learning. To train the DNNd , a proper matching between
on predicted scores are used to select pseudo-labels. This size- the prior set P and the gt set GT needs to be solved to
induced bias will also be amplified in this process, as regular assign pos/neg labels to P and supervise the network learning.
objects, having higher posterior confidence scores, are more Existing assignment strategies can be classified into static and
likely to be selected as pseudo-labels for training, whereas dynamic strategies. For static assignment (e.g. RetinaNet [66]),
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9
the set of pos labels G is obtained via a hand-crafted matching Besides, the learned offsets from the regression branch are
function Ms , and the set for a specific image remains the same used to guide feature extraction in the classification branch,
for each epoch, which is formulated as: leading to better alignment between the two tasks. As such, the
PCB inherits the flexibility of learnable priors in query-based
G = Ms (P, GT ), (3)
detectors (e.g., DETR [82]) and retains the explicit physical
while dynamic assignment approaches [36], [37], [40] tend meaning of static priors in dense detectors (e.g., RetinaNet-O).
to leverage the prior information P along with posterior The dynamic prior capturing process further unfolds as
information (predictions) D for dynamic sample selection, follows. As initialization, each prior location p(x, y) is set to
where they apply a prediction-aware mapping Md to get the the spatial location s of each feature point, which has been
set G: remapped to the image. In each iteration, we forward the
G = Md (P, D, GT ), (4) network to capture the offset sets ∆o of each prior location.
Hence, the prior’s location can be updated by:
after the pos/neg label separation, the loss function can be n
summarized into two parts:
X
s̃ = s + st ∆oi /2n, (8)
Npos Nneg i=1
X X
L= Lpos (Di , Gi ) + Lneg (Dj , yj ), (5) where st represents the stride of the feature map and n is the
i=1 j=1 vector number of offsets for each location.
where Npos , Nneg are the number of positive and negative As a model-agnostic approach, the dynamic prior can be
samples respectively, yj denotes the negative label. adapted into both one-stage and two-stage methods. More
Whether dynamic or static, oriented tiny objects are amidst a specifically, we use a 2-D Gaussian distribution Np (µp , Σp ),
sample bias dilemma under existing label assignment methods: which has proven conducive to small objects [44], [83] and
these strategies typically sample and weight high-scoring sam- oriented objects [70], [83], to fit the prior’s spatial location.
ples (i.e., prior location) as positive samples, while both prior Each dynamic prior location s̃ serves as the Gaussian mean
and posterior scores for tiny objects are extremely low, making vector µp , and each prior is associated with a square-shaped
their effective samples wrongly labeled as outlier negative prior (w, h, θ) as their baseline detector, this shape information
samples. serves as the covariance matrix Σp [58]:
Towards unbiased sample learning, we reformulate this
" w2 #
process into a dynamic coarse-to-fine learning pipeline based
cos θ − sin θ 4
0 cos θ sin θ
Σp = h2
. (9)
on the dynamic priors. The coarse step works in an object- sin θ cos θ 0 − sin θ cos θ
4
centric way, where we construct a coarse positive candidate
bag to warrant sufficient and diverse positive samples for each C. Dynamic Coarse-to-Fine Learning
object. The fine step aims at guaranteeing the learning quality,
Without specialized consideration of tiny-scale objects, pre-
where we fit each gt with a Dynamic Gaussian Mixture Model
vious sample assignment strategies are biased towards sam-
(DGMM) as a constraint to select high-quality samples. Thus,
pling large object samples which usually hold higher confi-
the assignment process can be expressed as follows:
dence, discarding tiny-scale oriented objects as background.
˜ ),
G̃ = Md (Ms (P̃ , GT ), GT (6) Towards scale-unbiased optimization, we design a dynamic
coarse-to-fine learning pipeline, where the coarse step offers
˜ is a finer representation of an object with the DGMM.
the GT sample diversity while the fine step warrants learning quality.
In a nutshell, our final loss is modeled as: Coarse prior matching for sample diversity. In the
Ñpos
X Ñneg
X coarse step, we introduce an object-specific sample screening
L= Lpos (D̃i , G̃i ) + Lneg (D̃j , yj ). (7) approach to offer sufficient and diverse positive sample candi-
i=1 j=1 dates for each object. Specifically, we construct a set of Coarse
Positive Sample (CPS) candidates for each object, where we
consider prior locations from diverse spatial locations and FPN
B. Dynamic Prior hierarchies as candidates for a specific gt. Unlike sampling
We introduce a dynamic updating mechanism that can ben- from a single FPN layer or all FPN layers [84], [85], we
efit both dense and dense-to-sparse oriented object detection slightly expand the range of candidates to the gt’s nearby
paradigm, named Prior Capturing Block (PCB). Seamlessly spatial location and adjacent FPN layers, which warrants
embedded into the original detection head, the PCB generates relatively diverse and sufficient candidates compared to the
prior positions that are better aligned with the main body and single-layer heuristic and narrows down the searching area
geometries of tiny objects, increasing the number of high- from all-layer candidates, alleviating tiny object’s lack of
quality sample candidates for these objects and mitigating the positive samples candidates.
biased prior configuration. In this step, we also model the gt into a 2-D Gaussian
The structure of the proposed PCB is illustrated in Figure 6. Ng (µg , Σg ) with the aforementioned method to assist sample
selection. The similarity measurement in constructing the CPS
In this design, a dilated convolution is deployed to incorporate is realized with the Jensen-Shannon Divergence (JSD) [86]
the object’s surrounding context information, followed by the between the anchor and gt. JSD inherits the scale invariance
offsets prediction [81] to capture dynamic prior positions. property of the Kullback–Leibler Divergence (KLD) [70] and
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10
Fig. 6. An overview of the proposed method. The proposed DCFL learning scheme can be adapted into both one-stage and two-stage detection pipelines
for oriented tiny object detection. Left: Feature extraction process and the prior updating process of the PCB. Right: The schematic diagram of the dynamic
coarse-to-fine sample learning.
can measure the similarity between the gt and nearby non- which is a linear combination of the predicted classification
overlapping priors [44], [70]. Moreover, it overcomes KLD’s score and the location score with the gt. We define the P T of
drawback of asymmetry. However, the closed-form solution of the ith sample Di as:
the JSD between Gaussian distributions is unavailable [87].
Thus, we use the Generalized Jensen-Shannon Divergence P Ti = 0.5(Cls(Di ) + IoU (Di , gti )), (13)
(GJSD) [87] which yields a closed-form solution, as the
substitute. For example, the GJSD between two Gaussian dis- where Cls is the predicted classification confidence and IoU
tributions Np (µp , Σp ) and Ng (µg , Σg ) is defined as follows: is the rotated IoU between the predicted location and its
GJSD(Np , Ng ) = (1 − α)KL(Nα , Np ) + αKL(Nα , Ng ), (10) corresponding gt location. We select candidates with Q highest
P T as Medium Positive Sample (MPS) candidates.
where KL denotes the KLD, and Nα (µα , Σα ) is given by: Following this, we define the DGMM using a mixture of
Σ
Σα = (Σp Σg )α = (1 − α)Σ−1 −1 −1
gt’s geometry and MPS distribution to eliminate misaligned
p + αΣg , (11)
samples and obtain the final positive samples for prediction.
and µ Unlike previous works which utilize the center probability
µα = µp µg α map [88] or the single-Gaussian [42], [58] for instance rep-
(12)
= Σα (1 − α)Σ−1 −1
p µp + αΣg µg ,
resentation, our approach represents the instance with a more
refined DGMM. This model consists of two components: one
α is a parameter that controls the weighting of two distribu-
centered on the geometry center and the other on the semantic
tions [87] in the similarity measurement. In our case, Np and
center of the object. Specifically, for a given instance gti , the
Ng contribute equally, so α is set to 0.5.
geometry center (cxi , cyi ) serves as the mean vector µi,1 of
Ultimately, for each gt, we select K priors that hold the
the first Gaussian, and the semantic center (sxi , syi ), which is
top K GJSD scores as the Coarse Positive Samples (CPS)
deduced by averaging the location of the samples in the MPS,
and label the remaining priors as negative samples. This
serves as the µi,2 . That is to say, we parameterize the instance
coarse matching serves as the Ms in Equation 6. GJSD can
representation as:
effectively measure the similarity between samples across FPN
2
layers with a specific gt. Consequently, we extend CPS to X q
include both the object’s adjacent region and cross hierarchies DGMM i (s|x, y) = wi,m 2π|Σi,m |Ni,m (µi,m , Σi,m ),
m=1
by selecting a relatively large number of sample candidates. (14)
Finer posterior matching enhances sample quality. In where wi,m is the weight of each Gaussian with a summation
the fine step, we aim to improve the learning quality without of 1, Σi,m equals to the gt’s Σg . Under this modeling,
exacerbating the inter-object learning bias. To achieve this, each sample in MPS is associated with a DGMM score
we approximate the instance-wise semantic pattern by repre- DGMM (s|M P S). Samples with DGMM (s|M P S) < e−g
senting each object with a Dynamic Gaussian Mixture Model for any gt are assigned negative masks, with g being an
(DGMM). This model serves as the Md in Equation 6 for adjustable parameter.
object-wise sample constraint. Unlike batch-wise or sample-
wise evaluations [36], [37], [40] which tend to favor larger
VI. E XPERIMENTS
objects, our approach assesses the relative quality of samples
within each object, ensuring consistent positive sample super- A. Datasets and Implementations Details
vision across objects of varying sizes. Datasets. In addition to experiments on the AI-TOD-R,
First of all, we refine the sample candidates in the CPS ac- we conduct experiments on seven more datasets covering
cording to their predicted scores to fit the object’s semantically various tasks to verify the method’s broad adaptability. These
salient regions. More specifically, we define the Possibility of tasks include small oriented object detection (SODA-A [10]),
becoming True predictions (P T ) [37] for sample screening, oriented object detection with the existence of a large number
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11
TABLE IV
R ESULTS ON SODA-A TEST- SET. A LL THE MODELS ARE TRAINED ON SODA-A TRAIN - SET WITH A R ES N ET-50 AS THE BACKBONE . S CHEDULE
DENOTES THE TRAINING EPOCHS , WHERE ’1×’ REFERS TO 12 EPOCHS .
Method Publication Schedule AP AP0.5 AP0.75 APeS APrS APgS APN #Params. FLOPs
Faster RCNN-O [68] TPAMI 2017 1× 32.5 70.1 24.3 11.9 27.3 42.2 34.4 41.1M 292.25G
RetinaNet-O [66] TPAMI 2020 1× 26.8 63.4 16.2 9.1 22.0 35.4 28.2 36.2M 221.90G
RoI Transformer [30] CVPR 2019 1× 36.0 73.0 30.1 13.5 30.3 46.1 39.5 55.1M 306.20G
Gliding Vertex [72] TPAMI 2021 1× 31.7 70.8 22.6 11.7 27.0 41.1 33.8 41.1M 292.25G
Oriented RCNN [9] ICCV 2021 1× 34.4 70.7 28.6 12.5 28.6 44.5 36.7 41.1M 292.44G
S2 A-Net [32] TGRS 2022 1× 28.3 69.6 13.1 10.2 22.8 35.8 29.5 38.6M 277.72G
DODet [89] TGRS 2022 1× 31.6 68.1 23.4 11.3 26.3 41.0 33.5 69.3M 555.49G
Oriented RepPoints [34] CVPR 2022 1× 26.3 58.8 19.0 9.4 22.6 32.4 28.5 55.7M 274.07G
DHRec [90] TPAMI 2022 1× 30.1 68.8 19.8 10.6 24.6 40.3 34.6 32.0M 792.76G
CFINet [45] ICCV 2023 1× 34.4 73.1 26.1 13.5 29.3 44.0 35.9 44.0M 312.60G
DCFL (RetinaNet-O) Ours 1× 34.9 (+8.1) 73.2 (+9.8) 27.8 (+11.6) 14.2 29.8 43.7 38.0 36.1M 221.90G
DCFL (Oriented R-CNN) Ours 1× 36.6 (+2.2) 72.6 (+1.9) 32.4 (+3.8) 13.9 30.3 47.4 41.2 41.1M 292.44G
TABLE V
M AIN RESULTS ON THE DOTA- V 2 OBB TASK . W E FOLLOW THE OFFICIAL CLASS ABBREVIATIONS AS THE DOTA- V 2.0 BENCHMARK [3]. D P DENOTES
D EFORMABLE RO I P OOLING [81]. † DENOTES TRAINING FOR 40 EPOCHS . N OTE THAT THIS PAPER [70] REPORTS 50.90% M AP FOR R3 D ET W / KLD
UNDER 20 EPOCHS , THE R E R101 BACKBONE IS PROPOSED BY THE R E D ET [74]. T HE RESULTS IN BOLD AND UNDERLINE DENOTE THE BEST AND
SECOND - BEST PERFORMANCE OF EACH COLUMN .
Method Backbone Plane BD Bridge GTF SV LV Ship TC BC ST SBF RA Harbor SP HC CC Air Heli mAP
multi-stage:
Faster R-CNN-O [68] R50 71.61 47.20 39.28 58.70 35.55 48.88 51.51 78.97 58.36 58.55 36.11 51.73 43.57 55.33 57.07 3.51 52.94 2.79 47.31
Faster R-CNN-O w/ Dp R50 71.55 49.74 40.34 60.40 40.74 50.67 56.58 79.03 58.22 58.24 34.73 51.95 44.33 55.10 53.14 7.21 59.53 6.38 48.77
Mask R-CNN [91] R50 76.20 49.91 41.61 60.00 41.08 50.77 56.24 78.01 55.85 57.48 36.62 51.67 47.39 55.79 59.06 3.64 60.26 8.95 49.47
HTC* [92] R50 77.69 47.25 41.15 60.71 41.77 52.79 58.87 78.74 55.22 58.49 38.57 52.48 49.58 56.18 54.09 4.20 66.38 11.92 50.34
RoI Transformer [30] R50 71.81 48.39 45.88 64.02 42.09 54.39 59.92 82.70 63.29 58.71 41.04 52.82 53.32 56.18 57.94 25.71 63.72 8.70 52.81
Oriented R-CNN [9] R50 77.95 50.29 46.73 65.24 42.61 54.56 60.02 79.08 61.69 59.42 42.26 56.89 51.11 56.16 59.33 25.81 60.67 9.17 53.28
one-stage:
DAL [40] R50 71.23 38.36 38.60 45.24 35.42 43.75 56.04 70.84 50.87 56.63 20.28 46.53 33.49 47.29 12.15 0.81 25.77 0.00 38.52
SASM [41] R50 70.30 40.62 37.01 59.03 40.21 45.46 44.60 78.58 49.34 60.73 29.89 46.57 42.95 48.31 28.13 1.82 76.37 0.74 44.53
RetinaNet-O [66] R50 70.63 47.26 39.12 55.02 38.10 40.52 47.16 77.74 56.86 52.12 37.22 51.75 44.15 53.19 51.06 6.58 64.28 7.45 46.68
R3 Det w/ KLD [70] R50 75.44 50.95 41.16 61.61 41.11 45.76 49.65 78.52 54.97 60.79 42.07 53.20 43.08 49.55 34.09 36.26 68.65 0.06 47.26
FCOS-O [67] R50 74.84 47.53 40.83 57.41 43.89 47.72 55.66 78.61 57.86 63.00 38.02 52.38 41.91 53.24 40.22 7.15 65.51 7.42 48.51
Oriented Reppoints [34] R50 73.02 46.68 42.37 63.05 47.06 50.28 58.64 78.84 57.12 66.77 35.21 50.76 48.77 51.62 34.23 6.17 64.66 5.87 48.95
ATSS-O [39] R50 77.46 49.55 42.12 62.61 45.15 48.40 51.70 78.43 59.33 62.65 39.18 52.43 42.92 53.98 42.70 5.91 67.09 10.68 49.57
S2 A-Net [32] R50 77.84 51.31 43.72 62.59 47.51 50.58 57.86 80.73 59.11 65.32 36.43 52.60 45.36 52.46 40.12 0.00 62.81 11.11 49.86
ours:
DCFL (Retinanet-O) R50 75.71 49.40 44.69 63.23 46.48 51.55 55.50 79.30 59.96 65.39 41.86 54.42 47.03 55.72 50.49 11.75 69.01 7.75 51.57
DCFL (S2 A-Net) R50 74.79 53.25 45.81 65.46 46.49 53.23 58.10 81.51 60.13 66.42 43.24 55.09 50.52 55.58 54.53 5.23 68.73 13.06 52.84
DCFL (Oriented R-CNN) R50 77.59 52.46 45.98 61.73 49.77 54.32 60.55 79.27 61.76 68.17 43.41 56.59 52.41 56.68 55.32 27.42 63.50 12.64 54.42
DCFL (Retinanet-O)† R50 78.30 53.03 44.24 60.17 48.56 55.42 58.66 78.29 60.89 65.93 43.54 55.82 53.33 60.00 54.76 30.90 74.01 15.60 55.08
DCFL (Retinanet-O)† ReR101 79.49 55.97 50.15 61.59 49.00 55.33 59.31 81.18 66.52 60.06 52.87 56.71 57.83 58.13 60.35 35.66 78.65 13.03 57.66
of tiny objects (DOTA-v1.5 [3], DOTA-v2 [3]), multi-scale computer equipped with a single NVIDIA RTX 4090 GPU,
oriented object detection (DOTA-v1 [14], DIOR-R [15]), and setting the batch size to 4. The models are built using
horizontal object detection (VisDrone [61], MS COCO [11], MMDetection [93] and MMRotate [94] frameworks with
DOTA-v2 HBB). PyTorch [95]. We utilized ImageNet [96] pre-trained models
For ablation studies and analyses, we choose the large- as the backbone. For training, we employ the Stochastic
scale DOTA-v2 train set for training and its val set Gradient Descent (SGD) optimizer with a learning rate of
for evaluation since DOTA-v2 is the largest dataset for oriented 0.005, momentum of 0.9, and weight decay of 0.0001. Unless
object detection and contains a substantial number of tiny otherwise specified, the default backbone is ResNet-50 [97]
objects. This dataset enables us to simultaneously verify the with FPN [98]. We use Focal loss [66] for classification and
method’s effectiveness on tiny object detection and generic IoU loss [57] for regression. We only use random flipping for
oriented object detection. For fair comparison with other data augmentation across all experiments.
methods, we use the trainval sets of DOTA-v1, DOTA-
v1.5, DOTA-v2, and DIOR-R for training and their respective For experiments on DOTA-v1 and DOTA-v2, we adhere
test sets for testing, and we use the train set and to the official settings of the DOTA-v2 benchmark [3].
test set of SODA-A, the train sets, val sets of Specifically, we crop images into patches of 1024 × 1024
VisDrone2019 and MS COCO for training and evaluation. with 200-pixel overlaps and train the models for 12 epochs.
Implementation details. We conduct all experiments on a For DOTA-v2, we reproduce several state-of-the-art one-stage
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12
methods [31], [32], [34], [39]–[41], [70], [99] using the TABLE VI
same settings. For experiments on other datasets, we follow C OMPARISON WITH ONE - STAGE DETECTORS ON THE DOTA- V 1 OBB
TASK . A LL RESULTS ARE BASED ON THE MMROTATE [94] WITH 12
their default benchmarks for image pre-processings, including EPOCHS EXCEPT FOR GGHL [42]. 3× MEANS TRAINING FOR 36 EPOCHS .
setting the input size to 1200×1200 for SODA-A, 1024×1024
with 200-pixel overlaps for DOTA-v1.5, 800 × 800 for DIOR- Method CFA [73] RetinaNet-O [66] R3 Det [31] Oriented Rep [34] ATSS-O [39]
mAP 69.63 69.79 70.18 71.94 72.29
R, 1333 × 800 for VisDrone and MS COCO. The models are
Method KLD [70] S2 A-Net [32] GGHL [42] (3×) DCFL DCFL (3×)
trained for 40 epochs on DOTA-v1.5 and DIOR-R, and for mAP 72.76 73.91 73.98 74.26 75.35
12 epochs on SODA-A, VisDrone, and MS COCO, following
previous works [34], [73]. The DCFL uses RetinaNet-O as the
TABLE VII
baseline detector if not specified. Unless otherwise specified, M AIN RESULTS ON THE DOTA- V 1.5 OBB TASK .
these settings are consistently maintained.
Method Backbone SV Ship ST mAP
RetinaNet-O [66] R50 44.53 73.31 59.96 59.16
B. Main Results Faster R-CNN-O [91] R50 51.28 79.37 67.50 62.00
CMR [91] R50 51.64 79.99 67.58 63.41
Tiny/small oriented object detection. As the main track, RoI Transformer [30] R50 52.05 80.72 68.26 65.03
we evaluate the performance of DCFL on challenging datasets ReDet [74] ReR50 52.38 80.92 68.64 66.86
that are dedicated to tiny (AI-TOD-R) and small (SODA- DCFL R50 56.72 80.87 75.65 67.37 (+8.21)
A) oriented object detection. First of all, results on the AI- DCFL ReR101 57.31 86.60 76.55 70.24 (+11.08)
TOD-R are shown in Table II. Without whistles and bells,
DCFL can improve both one-stage (#1 vs. #22) and two-stage
object detectors (#5 vs. #24) by large margins. Notably, when across diverse oriented object detection tasks. Therefore, we
plugging DCFL into the advanced one-stage method: S2 A- validate DCFL on the DOTA-v1 and DIOR-R multi-scale
Net, our approach hits a new state-of-the-art performance of oriented object detection datasets, which also include some
49.6% AP0.5 , with a remarkable improvement of 16.2% and tiny object classes. The results of these datasets are shown
significant improvements on very tiny objects. Besides, we in Tables VI and VIII. Beyond tiny object-specific datasets,
also evaluate the proposed method on another oriented small DCFL also excels in multi-scale scenarios, achieving leading
object detection benchmark: SODA-A. As a recently proposed performance among all one-stage methods. Furthermore, the
dataset, the challenging and large-scale characteristics of the class-wise AP of tiny objects on DOTA-v1 and DIOR-R, listed
SODA [10] attract increasing attention. Results on this bench- in Tables VI and IX, show particularly significant improve-
mark are shown in Table IV, where DCFL really shines on ments for tiny-size classes, often with a notable increase of
this challenging dataset by boosting the RetinaNet-O by 8.1 more than 10%.
AP points and boosting the strong baseline: Oriented R-CNN Horizontal object detection. The proposed method can also
by 2.2 AP points. Moreover, the improvement in terms of be applied to the generic object detection tasks and enhance
AP0.75 is more pronounced than AP0.5 , indicating that DCFL their performance, by simply discarding the angle information.
can more precisely locate the oriented tiny objects. Given We evaluate the model on three different scenarios: drone-
that DCFL mainly optimizes the model’s training process, the captured images (VisDrone), natural images (MS COCO),
accuracy improvement does not incur additional parameter and and aerial images (DOTA-v2 HBB). These datasets, annotated
computational costs on both datasets, as described in Tables II with horizontal bounding boxes, contain a significant number
and IV. of small objects. Integrating our learning pipeline into the
Oriented object detection with massive tiny objects. RetinaNet-O baseline results in an improvement of 2-3 points,
More generally, evaluating the model’s detection performance as shown in Table X.
in datasets with both massive tiny objects and other-sized In a nutshell, these results demonstrate that our DCFL is
objects cannot only validate its ability to address tiny objects not only highly effective for detecting oriented tiny objects
but also examine its robustness to scale variance. We thus (such as small vehicles, ships, and storage tanks), achieving an
perform experiments on the DOTA-v1.5 and DOTA-v2, which approximate 10-point improvement over the baseline for these
are general-purpose datasets characterized by the existence of classes. Meanwhile, it excels in general-purpose oriented ob-
a significant number of tiny objects. As shown in Table V, ject detection or horizontal object detection tasks, as evidenced
our proposed method achieves a state-of-the-art performance by its performance on tracks like DOTA-v1, DIOR-R, and MS
of 57.66% mAP on the challenging DOTA-v2 benchmark COCO.
with single-scale training and testing. Meanwhile, our model
attains 51.57% mAP on this dataset without bells and whistles, C. Ablations
outperforming all tested one-stage oriented object detectors. Effects of individual strategy. We evaluate the effective-
Besides, results on the DOTA-v1.5 are presented in Table VII, ness of each proposed strategy from our method through
where DCFL notably improves the baseline and achieves a a series of ablation experiments. For consistency and fair
leading performance among one-stage methods. comparisons, we tile one prior for each feature point in all
Multi-scale oriented object detection. An investigation experiments. As shown in Table XIIa, the baseline detector,
of the method’s performance on multi-scale oriented object RetinaNet-O, achieves an mAP of 51.70%. Gradually integrat-
detection datasets can demonstrate its versatility and generality ing the posterior re-ranked MPS and DGMM into the detector,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13
Fig. 7. Visualization analysis of the predicted results. The first row shows the predicted results by the Oriented R-CNN while the second row shows results
from DCFL on the AI-TOD-R dataset. True positive, false negative, and false positive predictions are marked in green, red, and blue, respectively.
Method RetinaNet-O [66] FR-OBB [68] RT [30] AOPG [15] Dataset VisDrone MS COCO DOTA-v2 HBB
mAP 57.55 59.54 63.87 64.41
Method RetinaNet [66] DCFL RetinaNet DCFL FCOS [44] DCFL
Method GGHL [42] Oriented Rep [34] DCFL DCFL (ReR101)
mAP 66.48 66.71 66.80 71.03 AP0.5 29.2 32.1 55.4 57.3 55.4 57.4
TABLE IX
D ETECTION RESULTS OF TYPICAL TINY OBJECTS ON THE DIOR-R neither of these two methods yields the best performance.
DATASET. VE, BR, AND WM DENOTE VEHICLE , BRIDGE , AND In contrast, using distribution distances (KLD, GWD, GJSD)
WIND - MILL . to construct the Cross-FPN-layer CPS extends the candidate
Method Backbone VE BR WM
range to adjacent layers in addition to the main layer. We can
also see the GJSD gets the best performance of 59.15% mAP,
RetinaNet-O [66] R50 38.0 24.0 60.2
Oriented Rep [34] R50 50.4 38.8 64.7 mainly resulting from its property of scale-invariance [70],
DCFL R50 50.9 (+12.9) 42.1 (+18.1) 70.9 (+10.7) [87], symmetry [87], and ability to measure non-overlapping
boxes [87] compared to other counterparts.
Fixed prior or dynamic prior. We conduct a detailed
based on the CPS, results in progressive mAP improvements, set of ablation studies to verify the necessity of introducing
confirming the effectiveness of each design. It is important to the dynamic prior. As shown in Table XIIc, disabling the
note that CPS cannot be used independently, as its samples are dynamic prior by fixing the location of samples results in
too coarse to serve as the final positive samples. Nevertheless, a performance drop. This indicates that the prior should be
we compare different methods of constructing the CPS to adjusted accordingly when leveraging the dynamic sampling
verify its superiority. strategy to better capture the shape of objects.
Comparisons of different CPS. The design choice of CPS Detailed design in PCB. The PCB consists of a dilated
determines the range of sample candidates when training. convolution and a guiding DCN. We slightly enlarge the
In this section, we compare several CPS design paradigms, receptive field using a dilation rate of 3 and then utilize the
including limiting the CPS to a specific gt within a single DCN to generate dynamic priors in a guiding manner. As
layer and utilizing all FPN layers as the CPS, similar to shown in Table XIIc, the DCN provides an improvement of
Objectbox [84]. We present their performance in Table XIIb. 0.34 mAP points, and the dilated convolution slightly enhances
For fair comparisons, the number of samples in CPS is fixed the mAP. However, applying the DCN [100] to the single
at 16, and all other components remain unchanged. In the regression branch slightly deteriorates accuracy (denoted as
Single-FPN-layer approach, we group gt onto different layers Separate in Table XIIc), likely due to mismatch issues between
based on the regression range defined in FCOS [99] and assign the two branches. To address this, we use the offsets from the
labels within each layer. In the All-FPN-layer approach, we regression head to guide the offsets for the classification head,
do not group gt onto different layers but instead, discard resulting in better alignment (denoted as Guiding).
prior scale information and directly measure the distance Effects of parameters. The three introduced parameters
between Gaussian gt and prior points. As shown in Tab. XIIb, are robust within a certain range. As shown in Table XIId,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14
TABLE XI
A BLATIONS . W E TRAIN ON DOTA- V 2 TRAIN SET, TEST ON ITS VAL SET, AND REPORT M AP UNDER I O U THRESHOLD 0.5.
(a) Individual effectiveness. CPS, MPS, and (b) Comparsions of different CPS. The FPN (c) Effects of designs in the PCB.
DGMM denote Coarse, Medium Sample Candi- layer number varies for different strategies of DP: the dynamic prior. Guiding: reg
dates and Dynamic Gaussian Mixture Model. getting the CPS. guides cls branch.
Method CPS MPS DGMM mAP Strategy Measurement mAP DCN Dilated Conv DP mAP
baseline [66] 51.70 All-FPN-layer Gaussian 50.12 58.07
Single-FPN-layer Gaussian 56.72 ✓ 58.41
✓ ✓ 53.41 Cross-FPN-layer KLD [70] 57.82 ✓ ✓ 58.65
DCFL ✓ ✓ 57.20 Cross-FPN-layer GWD [83] 58.55 Separate ✓ ✓ 58.71
✓ ✓ ✓ 59.15 Cross-FPN-layer GJSD 59.15 Guiding ✓ ✓ 59.15
K 16 12 g 0.8 0.4
Q 12 10 8 6 10 8 6 4
mAP 59.15 58.57 58.97 57.84 58.79 58.25 57.01 57.37 mAP 59.15 58.95
D. Analysis
Visual Analysis. We visualize DCFL’s predictions and
dynamic prior positions to better show models’ capability
on addressing oriented tiny objects in Figures 7 and 8, re-
spectively. In Figure 7, By separating the model’s predictions
into true positive, false negative, and false positive predictions
based on the gt with different colors, we can easily find that
DCFL significantly suppresses false negative predictions (i.e.,
missing detection) for tiny objects. This improvement can
be largely attributed to the sufficient and unbiased sample
learning of different-sized objects resulting from the coarse-to-
fine sample selection scheme. Besides, from Figure 8 (Upper),
we can find that the prior setting in DCFL can better match the
oriented tiny objects’ discriminative areas. This further verifies
that by adaptively adjusting prior positions according to the
object’s region of interest, the prior bias in previous static
prior designs can be mitigated.
How does DCFL achieve unbiased learning? To better
understand the working mechanism of DCFL, we delve into
its training process by statistically investigating its sample
Fig. 8. Analysis of the learning bias across different methods. The first assignment. Specifically, we calculate the quantity and quality
and second columns investigate quality and quantity imbalances, respectively.
Results are sampled from the model’s last training epoch.
of positive samples assigned to ground truth (gt) bounding
boxes within various angle and scale intervals. This analysis
reveals two types of imbalance issues (quantity and quality)
in baseline methods: (1) The number of positive samples
assigned to each object varies periodically with respect to
its angle and scale, with objects whose shapes (scale, angle)
the combination of K = 16 and Q = 12 yields the best differ from predefined anchors receiving much fewer positive
performance. In Table XIIe, we verify the threshold e−g in samples. (2) The predicted IoU fluctuates periodically with
the DGMM and find that setting wi,1 to 0.7 and a threshold respect to the gt’s scale while remaining invariant with respect
of g = 0.8 results in the highest mAP. Although making to the gt’s angle. In contrast, DCFL effectively addresses these
the CPS/MPS/DGMM coarser and stricter can weaken perfor- learning biases: (1) It compensates by assigning more positive
mance, the mAP only fluctuates slightly. This indicates that samples to previously outlier angles and scales. (2) It improves
the coarse-to-fine assignment method ensures robustness in and balances the quality of samples (predicted IoU) across
parameter selection, as multiple parameters can mitigate the all angles and scales. These results demonstrate the desired
effects of any single under-tuned parameter. behavior of dynamic coarse-to-fine learning.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15
VII. D ISCUSSIONS across various detection pipelines significantly impede the de-
The precise detection of arbitrary-oriented tiny objects is a tection performance of oriented tiny objects. To address these
fundamental step towards more generic pattern recognition in biases, we propose a dynamic coarse-to-fine learning (DCFL)
numerous specialized scenarios. Meanwhile, the state-of-the- scheme that is applicable to both one-stage and two-stage
art object detectors significantly degrade when detecting these architectures. Extensive experiments on eight heterogeneous
objects. Moreover, there is still a lack of task-specific datasets benchmarks verify that DCFL can significantly improve the
and benchmarks dedicated to corresponding research. This detection accuracy of oriented tiny objects while maintaining
motivates us to address this intricate but inevitable challenge. high efficiency.
To this end, we establish a task-specific dataset, benchmark,
and design a new method that realizes unbiased learning for ACKNOWLEDGMENTS
objects of different scales and orientations.
Nevertheless, some challenges remain. First, the detection We would like to thank Zijuan Chen, Xianhang Ye, Nuoyi
of oriented tiny objects is a widespread issue across various Wang, Jinrui Zhang, Yuxin Li, Zheyan Xiao, Ziming Gui,
scenarios (e.g., autonomous driving, medical imaging, and Zhiwei Chen, Zijun Wu, and Huan Li for their voluntary
defect detection) and diverse modalities (e.g., SAR, thermal, annotation works. This work was supported in part by the
and X-ray data). This work, however, primarily focuses on National Natural Science Foundation of China (NSFC) under
aerial scenes in high-resolution optical data. By focusing on Grant 62271355.
the typical scenario of aerial imagery where oriented tiny ob-
jects frequently appear, we aim to establish a solid foundation R EFERENCES
and open the possibility for understanding these challenging
[1] Y. Li, “Detecting lesion bounding ellipses with gaussian proposal
objects in a broader range of scenarios and modalities. Fu- networks,” in Machine Learning in Medical Imaging: 10th Interna-
ture research could also explore incorporating complementary tional Workshop, MLMI 2019, Held in Conjunction with MICCAI 2019,
information from different modalities or leveraging temporal Shenzhen, China, October 13, 2019, Proceedings 10. Springer, 2019,
pp. 337–344.
data to enhance the detection of oriented tiny objects, further [2] P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection
expanding and fulfilling practical applications. Second, the and tracking meet drones challenge,” IEEE Transactions on Pattern
methodology part in this paper is performed under the closed- Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399,
2021.
set setting, which requires full object annotations from the [3] J. Ding, N. Xue, G.-S. Xia, X. Bai, W. Yang, M. Y. Yang, S. Belongie,
training set. However, the object annotations for tiny objects J. Luo, M. Datcu, M. Pelillo et al., “Object detection in aerial images: A
with oriented information are scarce and their acquisition large-scale benchmark and challenges,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7778–7796,
process is difficult, especially when it comes to scenarios in 2021.
an open-world assumption. Meanwhile, experimental results [4] B. Zhao, P. Han, and X. Li, “Vehicle perception from satellite,” IEEE
have shown that label-efficient methods show very com- Transactions on Pattern Analysis and Machine Intelligence, 2023.
petitive performance compared to fully-supervised methods [5] N. Bhadwal, V. Madaan, P. Agrawal, A. Shukla, and A. Kakran,
“Smart border surveillance system using wireless sensor network and
on oriented tiny object detection. Thus, it is worth further computer vision,” in 2019 international conference on Automation,
exploring the simplification of annotation requirements and Computational and Technology Management (ICACTM). IEEE, 2019,
the enhancement of tiny object detection performance with pp. 183–190.
[6] N. Zeng, P. Wu, Z. Wang, H. Li, W. Liu, and X. Liu, “A small-
limited annotations. Third, foundation models are becoming sized object detection oriented multi-scale feature fusion approach with
a hot topic that facilitates various research directions while application to defect detection,” IEEE Transactions on Instrumentation
this work did not discuss or improve relevant works. How and Measurement, vol. 71, pp. 1–14, 2022.
[7] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20
foundation models perform and how to pre-train or adapt them years: A survey,” Proceedings of the IEEE, vol. 111, no. 3, pp. 257–
on oriented tiny objects are also questions worth exploring in 276, 2023.
the future. [8] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in
optical remote sensing images: A survey and a new benchmark,” ISPRS
journal of photogrammetry and remote sensing, vol. 159, pp. 296–307,
VIII. C ONCLUSION 2020.
[9] X. Xie, G. Cheng, J. Wang, X. Yao, and J. Han, “Oriented r-cnn
In this work, we systematically address the challenging for object detection,” in IEEE International Conference on Computer
Vision, 2021, pp. 3520–3529.
task of detecting oriented tiny objects by establishing a new
[10] G. Cheng, X. Yuan, X. Yao, K. Yan, Q. Zeng, X. Xie, and J. Han,
dataset and a benchmark, and proposing a dynamic coarse- “Towards large-scale small object detection: Survey and benchmarks,”
to-fine learning scheme aimed at scale-unbiased learning. Our IEEE Transactions on Pattern Analysis and Machine Intelligence,
dataset, AI-TOD-R, has the smallest mean object size among vol. 45, no. 11, pp. 13 467–13 488, 2023.
[11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
all oriented object detection datasets, and it presents additional P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
challenges such as dense arrangement and class imbalance. context,” in European Conference on Computer Vision. Springer, 2014,
Based on this dataset, we establish a benchmark and in- pp. 740–755.
[12] J. Wang, W. Yang, H. Guo, R. Zhang, and G.-S. Xia, “Tiny object
vestigate the performance of various detection paradigms, detection in aerial images,” in International Conference on Pattern
uncovering two key insights. First, label-efficient detection Recognition, 2021, pp. 3791–3798.
methods now offer highly competitive performance on oriented [13] C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, “Detecting tiny
objects in aerial images: A normalized wasserstein distance and a new
tiny objects, showing great potential for further exploration. benchmark,” ISPRS Journal of Photogrammetry and Remote Sensing,
Second, biased prior settings and biased sample assignment vol. 190, pp. 79–93, 2022.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16
[14] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, [34] W. Li, Y. Chen, K. Hu, and J. Zhu, “Oriented reppoints for aerial
M. Pelillo, and L. Zhang, “DOTA: A large-scale dataset for object object detection,” in IEEE Conference on Computer Vision and Pattern
detection in aerial images,” in IEEE Conference on Computer Vision Recognition, 2022, pp. 1829–1838.
and Pattern Recognition, 2018, pp. 3974–3983. [35] Y. Zeng, Y. Chen, X. Yang, Q. Li, and J. Yan, “Ars-detr: Aspect ratio-
[15] G. Cheng, J. Wang, K. Li, X. Xie, C. Lang, Y. Yao, and J. Han, sensitive detection transformer for aerial oriented object detection,”
“Anchor-free oriented proposal generator for object detection,” IEEE IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp.
Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 1–15, 2024.
2022. [36] K. Kim and H. S. Lee, “Probabilistic anchor assignment with iou
[16] C. Xu, J. Ding, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, prediction for object detection,” in European Conference on Computer
“Dynamic coarse-to-fine learning for oriented tiny object detection,” in Vision. Springer, 2020, pp. 355–371.
IEEE Conference on Computer Vision and Pattern Recognition, June [37] Z. Ge, S. Liu, Z. Li, O. Yoshie, and J. Sun, “Ota: Optimal transport
2023, pp. 7318–7328. assignment for object detection,” in IEEE Conference on Computer
[17] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: A face detection Vision and Pattern Recognition, 2021, pp. 303–312.
benchmark,” in IEEE Conference on Computer Vision and Pattern [38] Y. Ma, S. Liu, Z. Li, and J. Sun, “Iqdet: Instance-wise quality distribu-
Recognition, 2016, pp. 5525–5533. tion sampling for object detection,” in Proceedings of the IEEE/CVF
[18] M. Braun, S. Krebs, F. Flohr, and D. M. Gavrila, “Eurocity persons: Conference on Computer Vision and Pattern Recognition, 2021, pp.
A novel benchmark for person detection in traffic scenes,” IEEE 1717–1725.
Transactions on Pattern Analysis and Machine Intelligence, vol. 41, [39] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap
no. 8, pp. 1844–1861, 2019. between anchor-based and anchor-free detection via adaptive training
[19] X. Yu, Y. Gong, N. Jiang, Q. Ye, and Z. Han, “Scale match for tiny sample selection,” in IEEE Conference on Computer Vision and Pattern
person detection,” in IEEE Workshops on Applications of Computer Recognition, 2020, pp. 9759–9768.
Vision, 2020, pp. 1257–1265. [40] Q. Ming, Z. Zhou, L. Miao, H. Zhang, and L. Li, “Dynamic anchor
[20] Z. Zhao, J. Du, C. Li, X. Fang, Y. Xiao, and J. Tang, “Dense tiny object learning for arbitrary-oriented object detection,” in AAAI Conference
detection: A scene context guided approach and a unified benchmark,” on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2355–2363.
IEEE Transactions on Geoscience and Remote Sensing, 2024. [41] L. Hou, K. Lu, J. Xue, and Y. Li, “Shape-adaptive selection and
[21] Z. Liu, H. Wang, L. Weng, and Y. Yang, “Ship rotated bounding box measurement for oriented object detection,” in AAAI Conference on
space for ship extraction from high-resolution optical satellite images Artificial Intelligence, 2022.
with complex backgrounds,” IEEE Geoscience and Remote Sensing [42] Z. Huang, W. Li, X.-G. Xia, and R. Tao, “A general gaussian heatmap
Letters, vol. 13, no. 8, pp. 1074–1078, 2016. label assignment for arbitrary-oriented object detection,” IEEE Trans-
actions on Image Processing, vol. 31, pp. 1895–1910, 2022.
[22] H. Zhu, X. Chen, W. Dai, K. Fu, Q. Ye, and J. Jiao, “Orientation
robust object detection in aerial images using deep convolutional neural [43] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “Reppoints: Point set
network,” in IEEE International Conference on Image Processing, representation for object detection,” in IEEE International Conference
2015, pp. 3735–3739. on Computer Vision, 2019, pp. 9657–9666.
[44] C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, “Rfla: Gaussian
[23] S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A
receptive field based label assignment for tiny object detection,” in
small target detection benchmark,” Journal of Visual Communication
European Conference on Computer Vision. Springer, 2022, pp. 526–
and Image Representation, vol. 34, pp. 187–203, 2016.
543.
[24] X. Sun, P. Wang, Z. Yan, F. Xu, R. Wang, W. Diao, J. Chen, J. Li,
[45] X. Yuan, G. Cheng, K. Yan, Q. Zeng, and J. Han, “Small object
Y. Feng, T. Xu et al., “Fair1m: A benchmark dataset for fine-grained
detection via coarse-to-fine proposal generation and imitation learning,”
object recognition in high-resolution remote sensing imagery,” ISPRS
in IEEE International Conference on Computer Vision, 2023, pp. 6317–
Journal of Photogrammetry and Remote Sensing, vol. 184, pp. 116–
6327.
130, 2022.
[46] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, “Perceptual
[25] T. Zhang, X. Zhang, J. Li, X. Xu, B. Wang, X. Zhan, Y. Xu, X. Ke, generative adversarial networks for small object detection,” in IEEE
T. Zeng, H. Su et al., “Sar ship detection dataset (ssdd): Official release Conference on Computer Vision and Pattern Recognition, 2017, pp.
and comprehensive data analysis,” Remote Sensing, vol. 13, no. 18, p. 1222–1230.
3690, 2021.
[47] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem, “Sod-mtgan: Small object
[26] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, detection via multi-task generative adversarial network,” in European
M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu et al., Conference on Computer Vision. Springer, 2018, pp. 206–221.
“Icdar 2015 competition on robust reading,” in 2015 13th international [48] J. Noh, W. Bae, W. Lee, J. Seo, and G. Kim, “Better to follow, follow
conference on document analysis and recognition (ICDAR). IEEE, to be better: Towards precise supervision of feature super-resolution for
2015, pp. 1156–1160. small object detection,” in IEEE International Conference on Computer
[27] E. Goldman, R. Herzig, A. Eisenschtat, J. Goldberger, and T. Hassner, Vision, 2019, pp. 9725–9734.
“Precise detection in densely packed scenes,” in Proceedings of the [49] L. Courtrai, M.-T. Pham, and S. Lefèvre, “Small object detection
IEEE/CVF Conference on Computer Vision and Pattern Recognition, in remote sensing images based on super-resolution with auxiliary
2019, pp. 5227–5236. generative adversarial networks,” Remote Sensing, vol. 12, no. 19, p.
[28] Z. Chen, J. Zhang, Z. Lai, G. Zhu, Z. Liu, J. Chen, and J. Li, “The devil 3152, 2020.
is in the crack orientation: A new perspective for crack detection,” in [50] S. M. A. Bashir and Y. Wang, “Small object detection in remote sensing
Proceedings of the IEEE/CVF International Conference on Computer images with residual feature aggregation-based super-resolution and
Vision, 2023, pp. 6653–6663. object detector network,” Remote Sensing, vol. 13, no. 9, p. 1854, 2021.
[29] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue, [51] J. Rabbi, N. Ray, M. Schubert, S. Chowdhury, and D. Chao, “Small-
“Arbitrary-oriented scene text detection via rotation proposals,” IEEE object detection in remote sensing images with end-to-end edge-
Transactions on Multimedia, vol. 20, no. 11, pp. 3111–3122, 2018. enhanced gan and object detector network,” Remote Sensing, vol. 12,
[30] J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu, “Learning roi no. 9, p. 1432, 2020.
transformer for detecting oriented objects in aerial images,” in IEEE [52] C. Xu, J. Wang, W. Yang, and L. Yu, “Dot distance for tiny object
Conference on Computer Vision and Pattern Recognition, 2019, pp. detection in aerial images,” in IEEE Conference on Computer Vision
2849–2858. and Pattern Recognition Workshops, 2021, pp. 1192–1201.
[31] X. Yang, Q. Liu, J. Yan, A. Li, Z. Zhang, and G. Yu, “R3det: [53] J. Wang, C. Xu, W. Yang, and L. Yu, “A normalized gaus-
Refined single-stage detector with feature refinement for rotating sian wasserstein distance for tiny object detection,” arXiv preprint
object,” CoRR, vol. abs/arXiv:1908.05612, 2019. [Online]. Available: arXiv:2110.13389, 2021.
https://arxiv.org/abs/1908.05612 [54] Z. Zhou and Y. Zhu, “Kldet: Detecting tiny objects in remote sensing
[32] J. Han, J. Ding, J. Li, and G.-S. Xia, “Align deep features for ori- images via kullback-leibler divergence,” IEEE Transactions on Geo-
ented object detection,” IEEE Transactions on Geoscience and Remote science and Remote Sensing, 2024.
Sensing, vol. 60, pp. 1–11, 2021. [55] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and
[33] Z. Li, B. Hou, Z. Wu, L. Jiao, B. Ren, and C. Yang, “Fcosr: A simple S. Savarese, “Generalized intersection over union: A metric and a loss
anchor-free rotated detector for aerial object detection,” arXiv preprint for bounding box regression,” in IEEE Conference on Computer Vision
arXiv:2111.10780, 2021. and Pattern Recognition, 2019, pp. 658–666.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17
[56] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou IEEE Conference on Computer Vision and Pattern Recognition, 2021,
loss: Faster and better learning for bounding box regression,” in AAAI pp. 3060–3069.
Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 993– [78] T. Wang, T. Yang, J. Cao, and X. Zhang, “Co-mining: Self-supervised
13 000. learning for sparsely annotated object detection,” in AAAI Conference
[57] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, “Unitbox: An advanced on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 2800–2808.
object detection network,” 2016, pp. 516–520. [79] B. Zhu, J. Wang, Z. Jiang, F. Zong, S. Liu, Z. Li, and J. Sun, “Au-
[58] X. Yang, G. Zhang, X. Yang, Y. Zhou, W. Wang, J. Tang, T. He, and toassign: Differentiable label assignment for dense object detection,”
J. Yan, “Detecting rotated objects as gaussian distributions and its 3-d arXiv preprint arXiv:2007.03496, 2020.
generalization,” IEEE Transactions on Pattern Analysis and Machine [80] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka,
Intelligence, vol. 45, no. 4, pp. 4335–4354, 2023. L. Li, Z. Yuan, C. Wang, and P. Luo, “Sparse r-cnn: End-to-end object
[59] Y. Yu and F. Da, “Phase-shifting coder: Predicting accurate orientation detection with learnable proposals,” in IEEE Conference on Computer
in oriented object detection,” in IEEE Conference on Computer Vision Vision and Pattern Recognition, 2021, pp. 14 454–14 463.
and Pattern Recognition, 2023, pp. 13 354–13 363. [81] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “De-
[60] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection formable convolutional networks,” in IEEE Conference on Computer
in optical remote sensing images: A survey and a new benchmark,” Vision and Pattern Recognition, 2017, pp. 764–773.
ISPRS Journal of Photogrammetry and Remote Sensing, vol. 159, pp. [82] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
296–307, 2020. S. Zagoruyko, “End-to-end object detection with transformers,” in
[61] D. Du, P. Zhu, L. Wen, and et al., “Visdrone-det2019: The vision European Conference on Computer Vision. Springer, 2020, pp. 213–
meets drone object detection in image challenge results,” in IEEE 229.
International Conference on Computer Vision Workshops, 2019, pp. [83] X. Yang, J. Yan, Q. Ming, W. Wang, X. Zhang, and Q. Tian, “Rethink-
213–226. ing rotated object detection with gaussian wasserstein distance loss,”
[62] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, in International Conference on Machine Learning, vol. 139, 2021, pp.
Y. Bulatov, and B. McCord, “xview: Objects in context in overhead 11 830–11 841.
imagery,” arXiv preprint arXiv:1802.07856, 2018. [84] M. Zand, A. Etemad, and M. Greenspan, “Objectbox: From centers
[63] X. Yang, G. Zhang, W. Li, Y. Zhou, X. Wang, and J. Yan, “H2rbox: to boxes for anchor-free object detection,” in European Conference on
Horizontal box annotation is all you need for oriented object detection,” Computer Vision, 2022, pp. 390–406.
in The Eleventh International Conference on Learning Representations, [85] C. Zhu, Y. He, and M. Savvides, “Feature selective anchor-free module
2022. for single-shot object detection,” in IEEE Conference on Computer
[64] Y. Yu, X. Yang, Q. Li, Y. Zhou, F. Da, and J. Yan, “H2rbox- Vision and Pattern Recognition, 2019, pp. 840–849.
v2: Incorporating symmetry for boosting horizontal box supervised [86] D. M. Endres and J. E. Schindelin, “A new metric for probability
oriented object detection,” Advances in Neural Information Processing distributions,” IEEE Transactions on Information Theory (TIT), vol. 49,
Systems, vol. 36, 2024. no. 7, pp. 1858–1860, 2003.
[65] W. Hua, D. Liang, J. Li, X. Liu, Z. Zou, X. Ye, and X. Bai, [87] F. Nielsen, “On a generalization of the jensen–shannon divergence and
“Sood: Towards semi-supervised oriented object detection,” in IEEE the jensen–shannon centroid,” Entropy, vol. 22, no. 2, p. 221, 2020.
Conference on Computer Vision and Pattern Recognition, 2023, pp. [88] J. Wang, W. Yang, H.-c. Li, H. Zhang, and G.-S. Xia, “Learning
15 558–15 567. center probability map for detecting objects in aerial images,” IEEE
[66] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for Transactions on Geoscience and Remote Sensing, vol. 59, no. 5, pp.
dense object detection,” in IEEE International Conference on Computer 4307–4323, 2021.
Vision, 2017, pp. 2980–2988. [89] G. Cheng, Y. Yao, S. Li, K. Li, X. Xie, J. Wang, X. Yao, and J. Han,
[67] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: A simple and strong “Dual-aligned oriented detector,” IEEE Transactions on Geoscience
anchor-free object detector,” IEEE Transactions on Pattern Analysis and Remote Sensing, vol. 60, pp. 1–11, 2022.
and Machine Intelligence, vol. 44, no. 4, pp. 1922–1933, 2022. [90] G. Nie and H. Huang, “Multi-oriented object detection in aerial images
[68] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- with double horizontal rectangles,” IEEE Transactions on Pattern
time object detection with region proposal networks,” in Advances in Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4932–4944, 2023.
Neural Information Processing Systems, 2015, pp. 91–99. [91] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in
[69] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable IEEE International Conference on Computer Vision, 2017, pp. 2961–
detr: Deformable transformers for end-to-end object detection,” in 2969.
International Conference on Learning Representations, 2021. [92] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng,
[70] X. Yang, X. Yang, J. Yang, Q. Ming, W. Wang, Q. Tian, and J. Yan, Z. Liu, J. Shi, W. Ouyang et al., “Hybrid task cascade for instance
“Learning high-precision bounding box for rotated object detection via segmentation,” in IEEE Conference on Computer Vision and Pattern
kullback-leibler divergence,” Advances in Neural Information Process- Recognition, 2019, pp. 4974–4983.
ing Systems, vol. 34, pp. 18 381–18 394, 2021. [93] K. Chen, J. Wang, J. Pang, and et al., “MMDetection: Open mmlab
[71] X. Yang, Y. Zhou, G. Zhang, J. Yang, W. Wang, J. Yan, X. ZHANG, detection toolbox and benchmark,” CoRR, vol. abs/arXiv:1906.07155,
and Q. Tian, “The kfiou loss for rotated object detection,” in The 2019. [Online]. Available: https://arxiv.org/abs/1906.07155
Eleventh International Conference on Learning Representations, 2022. [94] Y. Zhou, X. Yang, G. Zhang, J. Wang, Y. Liu, L. Hou, X. Jiang, X. Liu,
[72] Y. Xu, M. Fu, Q. Wang, Y. Wang, K. Chen, G.-S. Xia, and X. Bai, J. Yan, C. Lyu et al., “Mmrotate: A rotated object detection benchmark
“Gliding vertex on the horizontal bounding box for multi-oriented using pytorch,” in Proceedings of the 30th ACM International Confer-
object detection,” IEEE Transactions on Pattern Analysis and Machine ence on Multimedia, 2022, pp. 7331–7334.
Intelligence, vol. 43, no. 4, pp. 1452–1459, 2021. [95] A. Paszke, S. Gross, F. Massa, A. Lerer et al., “Pytorch: An imperative
[73] Z. Guo, C. Liu, X. Zhang, J. Jiao, X. Ji, and Q. Ye, “Beyond bounding- style, high-performance deep learning library,” in Advances in Neural
box: Convex-hull feature adaptation for oriented and densely packed Information Processing Systems, 2019, pp. 8024–8035.
object detection,” in IEEE Conference on Computer Vision and Pattern [96] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Recognition, 2021, pp. 8792–8801. Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large
[74] J. Han, J. Ding, N. Xue, and G.-S. Xia, “Redet: A rotation-equivariant scale visual recognition challenge,” International Journal of Computer
detector for aerial object detection,” in IEEE Conference on Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
Vision and Pattern Recognition, 2021, pp. 2786–2795. [97] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
[75] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, recognition,” in IEEE Conference on Computer Vision and Pattern
“Swin transformer: Hierarchical vision transformer using shifted win- Recognition, 2016, pp. 770–778.
dows,” in Proceedings of the IEEE/CVF international conference on [98] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie,
computer vision, 2021, pp. 10 012–10 022. “Feature pyramid networks for object detection,” in IEEE Conference
[76] Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.
B. Wu, Z. Kira, and P. Vajda, “Unbiased teacher for semi- [99] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional one-
supervised object detection,” in International Conference on Learning stage object detection,” in IEEE International Conference on Computer
Representations, 2021. [Online]. Available: https://openreview.net/ Vision, 2019, pp. 9627–9636.
forum?id=MJIve1zgR [100] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More
[77] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu, deformable, better results,” in IEEE Conference on Computer Vision
“End-to-end semi-supervised object detection with soft teacher,” in and Pattern Recognition, 2019, pp. 9308–9316.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18
Chang Xu received his B.S. degree in electronic Fang Xu received her B.S. degree in electronic and
information engineering, and his M.S. degree in information engineering and her Ph.D. degree in
information and communication systems, both from communication and information system from Wuhan
Wuhan University, Wuhan, China, in 2021, and University, Wuhan, China, in 2018 and 2023, re-
2024, respectively. This work was done during his spectively. She is a postdoctoral researcher with
master study in Wuhan University. He is currently the school of computer science, Wuhan University,
pursuing his Ph.D. degree in the Environmental China. Her research involves remote sensing image
Computational Science and Earth Observation Lab- processing, including multi-modal data matching
oratory, EPFL, Sion, Switzerland. His research fo- and fusion.
cuses on object detection, visual geo-localization,
and multi-modal learning.