Object_Detection_in_Aerial_Images_A_Large_Scale_Benchmark_and_Challenges
Object_Detection_in_Aerial_Images_A_Large_Scale_Benchmark_and_Challenges
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
1
Abstract—In the past decade, object detection has achieved significant progress in natural images but not in aerial images, due to the
massive variations in the scale and orientation of objects caused by the bird’s-eye view of aerial images. More importantly, the lack of
large-scale benchmarks has become a major obstacle to the development of object detection in aerial images (ODAI). In this paper,
we present a large-scale Dataset of Object deTection in Aerial images (DOTA) and comprehensive baselines for ODAI. The proposed
DOTA dataset contains 1,793,658 object instances of 18 categories of oriented-bounding-box annotations collected from 11,268 aerial
images. Based on this large-scale and well-annotated dataset, we build baselines covering 10 state-of-the-art algorithms with over 70
configurations, where the speed and accuracy performances of each model have been evaluated. Furthermore, we provide a code
library for ODAI and build a website for evaluating different algorithms. Previous challenges run on DOTA have attracted more than 1300
teams worldwide. We believe that the expanded large-scale DOTA dataset, the extensive baselines, the code library and the challenges
can facilitate the designs of robust algorithms and reproducible research on the problem of object detection in aerial images.
Index Terms—Object detection, remote sensing, aerial images, oriented object detection, benchmark dataset.
F
1 I NTRODUCTION
Currently, Earth vision (also known as Earth observation half meter. Although challenging, developing mathematical
and remote sensing) technologies enable us to observe the tools and numerical algorithms is necessary for interpreting
Earth‘s surface with aerial images1 with a resolution up to a these huge volumes of images, among which object detec-
tion refers to localizing objects of interest (e.g., vehicles and
ships) on the Earth‘s surface and predicting their categories.
• J. Ding and L. Zhang are with the State Key Lab. LIES- Object detection in aerial images (ODAI) has been an es-
MARS, Wuhan University, Wuhan, 430079, China. Email: {jian.ding,
zlp62}@whu.edu.cn. sential step in many real-world applications such as urban
• N. Xue is with the National Engineering Research Center for Multi- management, precision agriculture, emergency rescue and
media Software, School of Computer Science and Institute of Artificial disaster relief [1], [2]. Although extensive studies have been
Intelligence, Wuhan University, Wuhan, 430072, China. Email: xue-
[email protected].
devoted to object detection in aerial images and apprecia-
• G.-S. Xia is with the National Engineering Research Center for Multi- ble breakthroughs have been made [3]–[8], the task has
media Software, School of Computer Science and Institute of Artificial numerous difficulties such as arbitrary orientations, scale
Intelligence, and also the State Key Lab. LIESMARS, Wuhan University, variations, extremely nonuniform object densities and large
Wuhan, 430072, China. Email: [email protected].
• X. Bai is with the School of Electronic Information, Huazhong Uni- aspect ratios (ARs), as shown in Fig. 1.
versity of Science and Technology, Wuhan, 430079, China. Email: Among these difficulties, the arbitrary orientation of
[email protected]. objects caused by the overhead view is the main differ-
• W. Yang is with the School of Electronic Information, and the State Key
Lab. LIESMARS, Wuhan University, Wuhan, 430072, China. Email:
ence between natural images and aerial images, and it
[email protected]. complicates the object detection task in two ways. First,
• M. Y. Yang is with Faculty of Geo-Information Science and Earth rotation-invariant feature representations are preferred in
Observation (ITC), University of Twente, The Netherlands. Email: the detection of arbitrarily orientated objects, but they are
[email protected].
• S. Belongie is with Department of Computer Science, Cornell University often beyond the capability of most of current deep neural
and Cornell Tech. Email: [email protected]. network models. Although the methods such as those de-
• J. Luo is with Department of Computer Science, University of Rochester, signed in [6], [9], [10] use rotation-invariant convolutional
Rochester, NY 14627. Email: [email protected].
• M. Datcu is with Remote Sensing Technology Institute, German
neural networks (CNNs), the problem is far from solved.
Aerospace Center (DLR), 82234, Germany. Email: [email protected]. Second, the horizontal bounding box (HBB) object representa-
• M. Pelillo is with DAIS, Ca’ Foscari University of Venice, Italy. Email: tion used in conventional object detection [11]–[13] cannot
[email protected]. localize the oriented objects precisely, such as ships and
• The studies in this paper have been supported by the NSFC projects under
the contracts No.61922065, No.61771350 and No.41820104006. Dr. Nan large vehicles, as shown in Fig. 1. The oriented bounding box
Xue was also supported by National Post-Doctoral Program for Innovative (OBB) object representation is more appropriate for aerial
Talents under Grant BX20200248. images [4], [14]–[17]. It allows us to distinguish densely
• The corresponding author is Gui-Song Xia ([email protected]).
packed instances (as shown in Fig. 3) and extract rotation-
1. This paper uses the term ”aerial” to refer to any overhead image invariant features [4], [18], [19]. The OBB object representa-
looking approximately straight down onto the Earth, including both tion actually introduces a new object detection task, called
satellite images and airborne images, for simplification unless other-
wise indicated. We use the term “airborne” if we do not want to include oriented object detection. In contrast with horizontal object
the satellite images. detection [8], [20]–[22], oriented object detection is a recently
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
2
Fig. 1. An example image taken from DOTA. (a) A typical image in DOTA consisting of many instances from multiple categories. (b), (c), (d), (e)
are cropped from the source image. We can see that instances such as small vehicles have arbitrary orientations. There is also a massive scale
variation across different instances. Moreover, the instances are not distributed uniformly. The instances are sparse in most areas but crowded in
some local areas. Large vehicles and ships have large ARs. (f) and (g) exhibit the size and orientation histograms, respectively, for all instances.
emerging research direction and most of the methods for is highly desirable. Nevertheless, current object detection
this new task attempt to transfer successful deep-learning- libraries, e.g., MMDetection [28] and Detectron [29], do not
based object detectors pre-trained on large-scale natural support oriented object detection.
image datasets (e.g., ImageNet [12] and Microsoft Common To address the above-mentioned problems, in this paper
Objects in Context (MS COCO) [13]) to aerial scenes [18], we first extend the preliminary version of DOTA, i.e., DOTA-
[19], [23]–[25] due to the lack of large-scale annotated aerial v1.0 [14], to DOTA-v2.0. Specifically, DOTA-v2.0 collects
image datasets. 11, 268 aerial images from various sensors and platforms
To mitigate the dataset problem, some public datasets and contains approximately 1.8 million object instances
of aerial images have been created, see e.g. [7], [15]–[17], annotated with OBBs in 18 common categories, which, to
[26], but they contain a limited number of instances and our knowledge, is the largest public Earth vision object de-
tend to use images taken under ideal conditions (e.g., clear tection dataset. Then, to facilitate algorithm developments
backgrounds and centered objects), which cannot reflect the and comparisons with DOTA, we provide a well-designed
real-world difficulties of the problem. The recently released code library that supports oriented object detection in aerial
xView [27] dataset provides a wide range of categories and images. Based on the code library, we also build more
contains large quantities of instances in complicated scenes. comprehensive baselines than the preliminary version [14],
However, it annotates the instances with HBBs instead of keeping the hardware, software platform, and settings the
the more precise OBBs. Thus, a large-scale dataset that has same. In total, we evaluate 10 algorithms and over 70 mod-
OBB annotations and reflects the difficulties in real-world els with different configurations. We then provide detailed
applications of aerial images is in high demand. speed and accuracy analyses to explore the module designs
Another issue with ODAI is that the module design and parameter settings in aerial images to guide future
and the hyperparameter setting of conventional object de- research. These experiments verify the large differences in
tectors learned from natural images are not appropriate object detector design between natural and aerial images
for aerial images due to domain differences. Thus, when and provide materials for universal object detection algo-
developing new algorithms, comprehensive baselines and rithms [30].
sufficient ablative analyses of models on aerial images are The main contributions of this paper are three-fold:
required. However, comparing different algorithms is diffi-
cult due to the diversities in hardware, software platforms, • To the best of our knowledge, the expanded DOTA is
detailed settings and so on. These factors influence both the largest dataset for object detection in Earth vision.
speed and accuracy. Therefore, when building the baselines, The OBB annotations of DOTA not only provide a
implementing the algorithms with a unified code library large-scale benchmark for object detection in Earth
and keeping the hardware and software platform the same vision but also pose interesting algorithmic questions
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
3
and challenges to generalized object detection in datasets of natural and aerial images. Moreover, the large
computer vision. images include more instances per image and contextual
• We build a code library for object detection in aerial information. Tab. 1 provides the detailed comparisons.
images. This is expected to facilitate the development
and benchmarking of object detection algorithms in TABLE 1
aerial images with both HBB and OBB representa- DOTA vs. general object detection datasets. BBox is short for bounding
tions. box, Avg. BBox quantity indicates the average number of bounding
boxes per image. For PASCAL VOC (07++12), we count the whole
• With the expanded DOTA, we evaluate 10 repre- PASCAL VOC 07 and training and validation (trainval) set of PASCAL
sentative algorithms over 70 model configurations, VOC 12. DOTA has a comparable scale with the large-scale datasets
providing comprehensive analysis that can guide for object detection in natural images. Note that for the average number
of instances per image, DOTA surpasses the other datasets.
the designs of object detection algorithms in aerial
Image Megapixel BBox Avg. BBox
images. Dataset Classes
quantity area quantity quantity
PASCAL VOC
The dataset, code library, and regular evaluation server (07++12)
20 21,503 5,133 52,090 2.42
are available and maintained on the DOTA website2 . It MS COCO
80 123,287 32,639 886,266 7.19
is worth noting that the creation and use of DOTA have (2014 trainval)
ImageNet
advanced object detection in aerial images. For instance, the (2014 train)
200 456,567 82,820 478,807 1.05
regular DOTA evaluation server and two object detection DOTA-v1.0 15 2,806 19,173 188,282 67.10
contests organized at the 2018 International Conference DOTA-v1.5 16 2,806 19,173 402,089 143.73
DOTA-v2.0 18 11,268 126,306 1,793,658 159.18
on Pattern Recognition (ICPR’ 2018 with DOTA-v1.0)3 and
2019 Conference on Computer Vision and Pattern Recogni-
tion (CVPR’2019 with DOTA-v1.5)4 have attracted approx- 2.2 Datasets for Object Detection in Aerial Images
imately 1300 registrations. We believe that our new DOTA
In aerial object detection, a dataset resembling MS COCO
dataset, with a comprehensive code library and an online
and ImageNet both in terms of the image number and de-
evaluation platform, will further promote the reproducible
tailed annotations has been missing, which becomes one of
research in Earth vision.
the main obstacles to research in Earth vision, especially for
developing deep-learning-based algorithms. In Earth vision,
2 R ELATED WORK many aerial image datasets are prepared for actual demands
Well-annotated datasets have played an important role in in a specific category, such as building datasets [7], [36], ve-
data-driven computer vision research [12], [13], [31]–[35] hicle datasets [8], [15], [16], [26], [37]–[39], ship datasets [4],
and have promoted cutting-edge research in a number of [40], and plane datasets [17], [41]. Although some public
tasks such as object detection and classification. In this datasets [17], [42]–[45] have multiple categories, they have
section, we first review object detection datasets of natural only limited number of samples, which are hardly efficient
and aerial images. Then we discuss the recent deep learning for training robust deep models. For example, NWPU [42]
based object detectors in aerial images. Finally, we briefly only contains 800 images, 10 classes and 3, 651 instances.
review the code libraries for object detection. To alleviate this problem, our preliminary work DOTA-
v1.0 [14] presented a dataset with 15 categories and 188, 282
2.1 Datasets for Conventional Object Detection instances, which for the first time enables us to efficiently
train robust deep models for ODAI without the help of
As a pioneer, PASCAL Visual Object Classes (VOC) [11] large-scale datasets of natural images, such as MS COCO
held challenges on object detection from 2005 to 2012. The and ImageNet. Later, iSAID [46] provided an instance seg-
computer vision community has widely adopted PASCAL mentation extension of DOTA-v1.0 [14]. A notable dataset
VOC datasets and their evaluation metrics. Specifically, is xView [27], which contains 1, 413 images, 16 main cate-
the PASCAL VOC Challenge 2012 dataset contains 11, 530 gories, 60 fine-grained categories, and 1 million instances.
images, 20 classes, and 27, 450 annotated bounding boxes. Another dataset DIOR [47] provided a comparable number
Later, the ImageNet dataset [12] was developed and is an of instances as DOTA-v1.0 [14]. However, the instances in
order of magnitude larger than PASCAL VOC, containing xView and DIOR are both annotated by HBBs, which are
200 classes and approximately 500, 000 annotated bounding not suitable for precisely detecting objects that are arbitrarily
boxes. However, non-iconic views are not addressed. Then oriented in aerial images. In addition, VisDrone [48] is also
MS COCO [13] was released, containing a total of 328K a large-scale dataset for drone images but focuses more on
images, 91 categories, and 2.5 million labeled segmented video object detection and tracking. The image subset in
objects. MS COCO has on average more instances and cate- VisDrone for object detection is not very large. Furthermore,
gories per image and contains more contextual information most of the previous datasets are heavily imbalanced in
than PASCAL VOC and ImageNet. It is worth noticing that, favor of positive samples, whose negative samples are not
in Earth vision, the image size could be extremely large (e.g., sufficient to represent the real-world distribution.
20, 000 × 20, 000 pixels), so the number of images cannot As we stated previously [14], a good dataset for aerial
reflect the scale of a dataset. In this case, the pixel area would
image object detection should have the following proper-
be more reasonable when comparing the scale between the
ties: 1) substantial annotated data to facilitate data-driven,
2. https://captain-whu.github.io/DOTA/ especially deep-learning-based methods; 2) large images to
3. https://captain-whu.github.io/ODAI/results.html contain more contextual information; 3) OBB annotation to
4. https://captain-whu.github.io/DOAI2019/challenge.html describe the precise location of objects; and 4) balance in
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
4
TABLE 2
DOTA vs. object detection datasets in aerial images. HBB is horizontal bounding box, and OBB is oriented bounding box . CP is center point.
# of main Total # of Image
Dataset Source Annotation # of instances # of images Year
categories categories width
TAS [26] satellite HBB 1 1 1,319 30 792 2008
SZTAKI-INRIA [7] multi source OBB 1 1 665 9 ∼800 2012
NWPU VHR-10 [42] multi source HBB 10 10 3,651 800 ∼1000 2014
VEDAI [15] satellite OBB 3 9 2,950 1,268 512, 1024 2015
DLR 3k [16] aerial OBB 2 8 14,235 20 5616 2015
UCAS-AOD [17] Google Earth OBB 2 2 14,596 1,510 ∼1000 2015
COWC [37] aerial CP 1 1 32,716 53 2000−19,000 2016
HRSC2016 [4] Google Earth OBB 1 26 2,976 1,061 ∼1100 2016
RSOD [43] Google Earth HBB 4 4 6,950 976 ∼1000 2017
CARPPK [8] drone HBB 1 1 89,777 1,448 1280 2017
ITCVD [38] aerial HBB 1 1 228 23,543 5616 2018
LEVIR [44] Google Earth HBB 3 3 11,000 22,000 800−600 2018
xView [27] satellite HBB 16 60 1,000,000 1,413 ∼3000 2018
VisDrone [48] drone HBB 10 10 54,200 10,209 2000 2018
SpaceNet MVOI [36] satellite polygon 1 1 126,747 60,000 900 2019
HRRSD [45] multi source HBB 13 13 55,740 21,761 152−10569 2019
DIOR [47] Google Earth HBB 20 20 190,288 23,463 800 2019
iSAID [46] multi source polygon 14 15 655,451 2,806 800−13,000 2019
FGSD [40] Google Earth OBB 1 43 5,634 2,612 930 2020
RarePlanes [41] satellite polygon 1 110 644,258 50,253 1080 2020
DOTA-v1.0 [14] multi source OBB 14 15 188,282 2,806 800−13,000 2018
DOTA-v1.5 multi source OBB 15 16 402,089 2,806 800−13,000 2019
DOTA-v2.0 multi source OBB 17 18 1,793,658 11,268 800−20,000 2021
image sources, as pointed in [49]. DOTA is built considering feature pyramids [19], [58] and image pyramids [24], [25]
these principles (unless otherwise specified, DOTA refers to are widely used to extract scale-invariant features in aerial
DOTA-v2.0). Detailed comparisons of these existing datasets images. We evaluate the geometric transformation network
and DOTA are shown in Tab. 2. Compared to other aerial modules and geometric data augmentations in Sec. 6.1.
datasets, as we shall see in Sec. 4, DOTA is challenging due Crowded instances represented by HBBs are difficult to
to its large number of object instances, arbitrary orientations, distinguish (see Fig. 3). Traditional HBB-based non maxi-
various categories, density distribution, and diverse aerial mum suppression (NMS) will fail in such cases. Therefore,
scenes from various image sources. These properties make these methods [18], [24], [25] use rotated NMS (R-NMS),
DOTA helpful for real-world applications. which requires precise detections to address this problem.
Similar to text and face detection in natural scenes, e.g. [23],
[59]–[61], precise ODAI can also be modeled as an oriented
2.3 Deep Models for Object Detection in Aerial Images object detection task. Most of the previous works [14], [23]–
Object detection in aerial images is a longstanding problem. [25], [62] consider it as a regression problem and regress
Recently, with the development of deep learning, many the offsets of the OBB ground truth relative to anchors
researchers in Earth vision have adapted deep object detec- (or proposals). However, the definition of an OBB is am-
tors [50]–[54] developed for natural images to aerial images. biguous. For example, there are four permutations of the
However, the challenges caused by the domain shift need to corner points in a quadrilateral. The Faster R-CNN OBB [14]
be addressed. Here, we highlight some notable works. solves it by using a defined rule to determine the order
Objects in aerial images are often arbitrarily oriented due of points in OBBs. Work in [63] further uses the gliding
to the bird’s-eye view, and the scale variations are larger offset and obliquity factor to eliminate the ambiguity. The
than those in natural images. To handle rotation variations, circular smooth label (CSL) [64] transforms the regression of
a simple model [9] plugs an additional rotation-invariant angle as a classification problem to avoid the problem. Mask
layer into R-CNN [52] relying on rotation data augmenta- OBB [65] and CenterMap [66] consider object detection as a
tion. The oriented response network (ORN) introduces ac- pixel-level classification problem to avoid ambiguity. Mask-
tive rotating filters (ARF) to produce the rotation-invariant based methods converge more easily but have more floating
feature without using data augmentation, which is adopted point operations per second (FLOPS) than regression-based
by the rotation-sensitive regression detector (RRD) [23]. The methods. We will give a more detailed comparison between
deformable modules [55] designed for general object defor- them in one unified code library in Sec. 6.1.1.
mation are also widely used in aerial images. The methods The final challenge is detecting objects in large images.
mentioned above do not fully utilize the OBB annotations. Aerial images are usually extremely large (over 20k × 20k
When OBB annotations are available, a rotation R-CNN (RR- pixels). Current GPU memory capacity is insufficient to
CNN) [56] uses rotation region-of-interest (RRoI) pooling process large images. Downsampling a large image to a
to extract rotation-invariant region features. However, RR- small size would lose the detailed information. To solve this
CNNs [56] generate proposals by hand-crafted way. Then problem [14], [16], the large images can be simply split into
the RoI Transformer [18] tries to use the supervision of small patches. After obtaining the results on these patches,
OBBs to learn RoI-wise spatial transformation. The later the results are integrated into large images. To speed up
S2 A-Net [57] extracts spatially invariant features in one- inference on large images, these methods [20]–[22], [67] first
stage detectors. To solve the challenges of scale variations, find regions that are likely to contain instances in the large
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
5
Fig. 2. Number of instances per image among DOTA and general object detection datasets. For PASCAL, ImageNet and MS COCO, we count
the statistics of 10,000 random images. As the images in DOTA are very large (20, 000 × 20, 000), for a fair comparison, we count the statistics of
10,1000 image patches with the size of 1024 × 1024, which is also the size used for the baselines in Sec. 5.2. DOTA-v2.0 has a wider range of the
number of instances per image.
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
6
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
7
Fig. 4. Examples of annotated images in DOTA. We show three examples per category.
103
102
# Images
101
100
0.0 0.5 1.0 1.5 2.0
Fig. 5. Typical examples of images collected from Google Earth (a), GSD (m/pixel)
GF&JL satellite (b) and CycloMedia (c). Fig. 6. The statistics of the GSD in 30% of the images in DOTA-v2.0.
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
8
1
TABLE 4 0.9
Comparison of the instance size distributions of aerial and natural 0.8
images in some datasets. 0.7
0.6
Dataset 10-50 pixels 50-300 pixels >300 pixels 0.5
PASCAL VOC [11] 0.14 0.61 0.25 0.4
0.3
MS COCO [13] 0.43 0.49 0.08 0.2
NWPU VHR-10 [42] 0.15 0.83 0.02 0.1
DLR 3K [16] 0.93 0.07 0 0
#Instances
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
9
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
10
OBBs to HBBs. To directly predict the HBB results, we large number of instances per image, as illustrated in Fig. 2,
use RetinaNet [85], Mask R-CNN, Cascade Mask R-CNN, we apply a mask head to all the HBBs after NMS. In this
Hybrid Task Cascade and Faster R-CNN [54] as baselines. way, we evaluate Mask R-CNN [86], Cascade Mask R-CNN
For the OBB predictions, we will introduce the methods in and Hybrid Task Cascade [88].
the following section.
5.3 Codebase and Development Kit
5.2.2 Baselines with OBBs
Most of the state-of-the-art object detection methods are not We also provide an aerial object detection code library7
designed for oriented objects. To enable these methods to and a development kit8 for using DOTA. To construct the
predict OBBs, we build the baselines in two ways. The first comprehensive baselines, we select MMDetection as the
is to change HBB head to OBB Head, which regresses the fundamental code library since it contains rich object de-
offsets of OBBs relative to the HBBs. The second is Mask tection algorithms and has the feature of modular design.
Head, which considers the OBBs to a coarse mask and However, the original MMDetection [28] lacks the modules
predicts the pixel-level classification from each RoI. to support oriented object detection. Therefore, we enriched
OBB Head To predict OBB, the previous Faster R- MMDetection with OBB Head as described in Sec. 5.2.2
CNN OBB [14] and Textboxes++ [62] modified RoI Head to enable OBB predictions. We also implemented modules
of Faster R-CNN and the Anchor Head of the single- such as rotated RoI Align and rotated position-sensitive
shot detector (SSD), respectively, to regress quadrangles. RoI Align for rotated region feature extraction, which are
In this paper, we use the representation (x, y, w, h, θ) in- crucial components in algorithms such as rotated region
stead of {(xi , yi )|i = 1, 2, 3, 4} for OBB regression. More proposal network (RRPN) [89] and RoI Transformer [18].
precisely, rectangular RoIs (anchors) can be written as B = These new modules are compatible with the modularly
(xmin , ymin , xmax , ymax ). We can also consider it a special designed MMDetection, so we can easily create new al-
OBB and rewrite it as R = (x, y,w, h, θ). For matching, gorithms for oriented object detection not restricted to the
IoUs are calculated between the horizontal RoIs (anchors) baseline methods in this paper. We also provide a develop-
and HBBs of the ground truths for computational simplicity. ment kit containing necessary functions for object detection
Each OBB, it has four forms: G = {gti |i = 1, 2, 3, 4}, where in DOTA, including:
gt1 = (xg , yg , wg , hg , θg ), gt2 = (xg , yg , wg , hg , θg + π), - Loading and visualizing the ground truths.
gt3 = (xg , yg , hg , wg , θg ), and gt4 = (xg , yg , hg , wg , θg + π). - Calculating the IoU between OBBs, which is imple-
Before calculating the targets, we choose the best matched mented in a mixture of Python/C program. We pro-
ground-truth form. The index of the best matched form is vide both the CPU and GPU versions.
calculated by ξ = arg min D(R, gti ), where D is a distance - Evaluating the results. The evaluation metrics are de-
i
function, which could be Euclidean distance or another scribed in Sec. 5.1.
distance function. We denote the best matched form by - Cropping and merging images. When using the large-
gtξ = (xb , yb , wb , hb , θb ). Then the learning target is calcu- size images in DOTA, one can utilize this tool kit to
lated as split an original image into patches. After testing on
the patches, one can use the tools to map the results
tx = (xb − x)/w, ty = (yb − y)/h,
of patches back to the original image coordinates and
tw = log(wb /w), th = log(hb /h), (1) apply NMS.
tθ = θb − θ
We then simply replace the HBB RoI Head of Faster R- 6 R ESULTS
CNN and anchor head of RetinaNet with OBB Head and 6.1 Benchmark Results and Analyses
obtain two models, called Faster R-CNN OBB and RetinaNet
In this section, we conduct a comprehensive evaluation
OBB. We also modify the Faster R-CNN to predict both
of over 70 experiments and analyze the results. First, we
the HBB and OBB in parallel, which is similar to Mask
demonstrate the baseline results of 10 algorithms on DOTA-
R-CNN [86]. We call this model Faster R-CNN H-OBB. We
v.10, DOTA-v1.5 and DOTA-v2.0. The baselines cover both
further evaluate the deformable RoI pooling (Dpool) and
two-stage and one-stage algorithms. For most algorithms,
RoI Transformer by replacing the RoI Align in Faster R-CNN
we report the mAPs of HBB and OBB predictions, respec-
OBB. Then we have two models: Faster R-CNN OBB + Dpool
tively, except for RetinaNet and Faster R-CNN, since they
and Faster R-CNN OBB + RoI Transformer. Note that the RoI
do not support oriented object detection. For algorithms
Transformer used here is slightly different from the original
with only OBB heads (RetinaNet OBB, Faster R-CNN OBB,
one. The original RoI Transformer uses the Light Head R-
Faster R-CNN OBB +DPool, Faster R-CNN OBB +RoI Trans-
CNN [87] as the base detector while we use Faster R-CNN.
former), we obtain their HBB results by transferring from
Mask Head Mask R-CNN [86] was originally used for in-
OBB as described in Sec. 5.2.1. For algorithms with both
stance segmentation. Although DOTA does not have pixel-
HBB and OBB heads (Mask R-CNN, Cascade Mask R-CNN,
level annotation for each instance, the OBB annotations
Hybrid Task Cascade*, and Faster R-CNN H-OBB), the HBB
can be considered coarse pixel-level annotations, so we can
mAP is the maximum of the predicted HBB mAP and the
apply Mask R-CNN [86] to DOTA. During inference, we
transferred HBB mAP. It can be seen that the OBB mAP
calculate the minimum OBBs that contain the predicted
masks. The original Mask R-CNN [86] only applies a mask 7. https://github.com/dingjiansw101/AerialDetection
head to the top 100 HBBs in terms of the score. Due to the 8. https://github.com/CAPTAIN-WHU/DOTA devkit
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
11
TABLE 6
Baseline results on DOTA. For the evaluation of DOTA-v2.0, we use the DOTA-v2.0 test-dev set. The implementation details are described in
Sec. 5.2. All the algorithms in this table adopt the ResNet-50 with an FPN as backbone. The speed refers to the inference speed, which is reported
for a single NVIDIA Tesla V100 GPU on DOTA-v2.0 test-dev. The image size is 1, 024 × 1, 024. Hybrid Task Cascade* means that the semantic
branch is not used since there are no semantic annotations in DOTA.
DOTA-v1.0 DOTA-v1.5 DOTA-v2.0
method speed (fps)
HBB mAP OBB mAP HBB mAP OBB mAP HBB mAP OBB mAP
RetinaNet [85] 16.7 67.45 - 61.64 - 49.31 -
RetinaNet OBB [85] 12.1 69.05 66.28 62.49 59.16 49.26 46.68
Mask R-CNN [86] 9.7 71.61 70.71 64.54 62.67 51.16 49.47
Cascade Mask R-CNN [88] 7.2 71.36 70.96 64.31 63.41 50.98 50.04
Hybrid Task Cascade* [88] 7.9 72.49 71.21 64.47 63.40 50.88 50.34
Faster R-CNN [54] 14.3 70.76 - 64.16 - 50.71 -
Faster R-CNN OBB [14] 14.1 71.91 69.36 63.85 62.00 49.37 47.31
Faster R-CNN OBB + Dpool [55] 12.1 71.83 70.14 63.67 62.20 50.48 48.77
Faster R-CNN H-OBB [14] 13.7 70.37 70.11 64.43 62.57 50.38 48.90
Faster R-CNN OBB + RoI Transformer [18] 12.4 74.59 73.76 66.09 65.03 53.37 52.81
TABLE 7
Baseline results of class-wise AP on DOTA-v1.0. The abbreviations of algorithms are: Mask R-CNN (MR), CMR-Cascade Mask R-CNN (CMR),
Hybrid Task Cascade without a semantic branch (HTC*), Faster R-CNN (FR), Deformable RoI Pooling (Dp) and RoI Transformer (RT). The short
names for categories are defined as: BD–Baseball diamond, GTF–Ground field track, SV–Small vehicle, LV–Large vehicle, TC–Tennis court,
BC–Basketball court, SC–Storage tank, SBF–Soccer-ball field, RA–Roundabout, SP–Swimming pool, HC–Helicopter.
OBB Results
method Plane BD Bridge GTF SV LV Ship TC BC ST SBF RA Harbor SP HC mAP
RetinaNet [85] 86.54 77.45 42.8 64.87 71.06 58.5 73.53 90.72 80.97 66.67 52.42 62.16 60.79 64.84 40.83 66.28
MR [86] 88.7 74.13 50.75 63.66 73.64 73.98 83.68 89.74 78.92 80.26 47.43 65.09 64.79 66.09 59.79 70.71
CMR [88] 88.93 75.21 51.55 64.9 74.39 75.37 84.74 90.23 77.48 81.51 46.57 63.49 65.39 67.63 56.96 70.96
HTC* [88] 89.17 75.05 51.95 64.5 74.19 76.3 86.05 90.55 79.51 77.18 50.3 61.23 65.89 68.29 58.01 71.21
FR OBB [14] 88.42 74.24 45.31 61.49 73.53 70.03 77.76 90.87 81.8 82.64 48.75 60.14 63.42 67.65 54.31 69.36
FR OBB + Dp [55] 88.82 74.12 45.44 63.07 73.13 73.59 84.39 90.71 82.28 83.59 42.76 58.49 63.52 68.25 59.89 70.14
FR H-OBB [14] 88.41 79.35 45.39 63.16 73.91 72.11 83.86 90.25 77.34 81.04 48.5 60.53 63.88 66.43 57.53 70.11
FR OBB + RT [18] 88.34 77.07 51.63 69.62 77.45 77.15 87.11 90.75 84.9 83.14 52.95 63.75 74.45 68.82 59.24 73.76
HBB Results
method Plane BD Bridge GTF SV LV Ship TC BC ST SBF RA Harbor SP HC mAP
RetinaNet [85] 88.28 77.76 47.47 59.07 73.83 63.49 77.69 90.43 78.57 65.87 48.67 61.82 68.92 71.59 38.22 67.45
RetinaNet OBB [85] 88.51 78.21 47.9 64.46 75.17 72.9 78.5 90.72 82.22 67.11 51.33 62.16 69.47 69.79 37.3 69.05
MR [86] 88.79 79.06 53.04 63.14 78.22 65.23 77.94 89.6 81.99 81.06 47.17 65.19 72.72 68.67 62.27 71.61
CMR [88] 88.93 81.5 52.85 64.01 79.18 65.85 78.12 90.08 77.48 81.87 46.45 62.88 72.99 69.95 58.32 71.36
HTC* [88] 89.11 79.76 53.87 64.4 79.06 76.23 86.57 90.5 79.51 81.66 50.44 63.75 73.33 70.61 48.59 72.49
FR [54] 89.02 75.8 53.47 60.8 78.02 65.56 78.01 90.1 77.3 81.7 47.12 61.48 72.59 70.89 59.54 70.76
FR OBB [14] 88.37 75.45 52.11 59.98 78.08 73.42 85.65 90.81 83.22 83.18 51.45 60.01 72.44 71.09 53.44 71.91
FR OBB + Dp [55] 88.86 77.4 51.6 63.12 77.62 74.73 86.2 90.72 82.73 83.75 44.35 58.81 71.97 70.98 54.68 71.83
FR H-OBB [14] 88.67 78.95 52.63 57.34 78.55 65.22 78.08 90.69 82.01 82.06 46.08 61.46 72.43 70.34 51.12 70.37
FR OBB + RT [18] 88.47 81.0 54.1 69.19 78.42 81.16 87.35 90.75 84.9 83.55 52.63 62.97 75.89 71.31 57.22 74.59
TABLE 8
Baseline results of class-wise AP on DOTA-v1.5. The abbreviations of algorithms are: Mask R-CNN (MR), CMR-Cascade Mask R-CNN (CMR),
Hybrid Task Cascade without a semantic branch (HTC*), Faster R-CNN (FR), Deformable RoI Pooling (Dp) and RoI Transformer (RT). The short
names for categories are defined as: BD–Baseball diamond, GTF–Ground field track, SV–Small vehicle, LV–Large vehicle, TC–Tennis court,
BC–Basketball court, SC–Storage tank, SBF–Soccer-ball field, RA–Roundabout, SP–Swimming pool, HC–Helicopter, CC-Container Crane.
OBB Results
method Plane BD Bridge GTF SV LV Ship TC BC ST SBF RA Harbor SP HC CC mAP
RetinaNet [85] 71.43 77.64 42.12 64.65 44.53 56.79 73.31 90.84 76.02 59.96 46.95 69.24 59.65 64.52 48.06 0.83 59.16
MR [86] 76.84 73.51 49.9 57.8 51.31 71.34 79.75 90.46 74.21 66.07 46.21 70.61 63.07 64.46 57.81 9.42 62.67
CMR [88] 77.77 74.62 51.09 63.44 51.64 72.9 79.99 90.35 74.9 67.58 49.54 72.85 64.19 64.88 55.87 3.02 63.41
HTC* [88] 77.8 73.67 51.4 63.99 51.54 73.31 80.31 90.48 75.12 67.34 48.51 70.63 64.84 64.48 55.87 5.15 63.4
FR OBB [14] 71.89 74.47 44.45 59.87 51.28 68.98 79.37 90.78 77.38 67.5 47.75 69.72 61.22 65.28 60.47 1.54 62.0
FR OBB + Dp [55] 71.79 73.61 44.76 61.99 51.34 70.04 79.67 90.78 76.58 67.73 44.58 70.51 61.8 65.49 64.35 0.15 62.2
FR H-OBB [14] 71.57 74.71 46.39 63.4 51.54 70.11 79.09 90.63 76.81 67.4 48.66 70.9 63.1 65.67 56.66 4.55 62.57
FR OBB + RT [18] 71.92 76.07 51.87 69.24 52.05 75.18 80.72 90.53 78.58 68.26 49.18 71.74 67.51 65.53 62.16 9.99 65.03
HBB Results
method Plane BD Bridge GTF SV LV Ship TC BC ST SBF RA Harbor SP HC CC mAP
RetinaNet [85] 74.05 77.75 48.75 59.94 49.23 61.43 77.31 90.38 75.46 61.19 47.29 69.99 67.99 74.15 40.88 10.4 61.64
RetinaNet OBB [85] 71.66 77.22 48.71 65.16 49.48 69.64 79.21 90.84 77.21 61.03 47.3 68.69 67.22 74.48 46.16 5.78 62.49
MR [86] 78.36 77.41 53.36 56.94 52.17 63.6 79.74 90.31 74.28 66.41 45.49 71.32 70.77 73.87 61.49 17.11 64.54
CMR [88] 78.61 75.43 54.0 63.76 52.55 63.93 79.88 90.06 75.05 67.83 45.76 72.48 72.1 74.11 53.9 9.59 64.31
HTC* [88] 78.41 74.41 53.41 63.17 52.45 63.56 79.89 90.34 75.17 67.64 48.44 69.94 72.13 74.02 56.42 12.14 64.47
FR [54] 71.88 74.06 52.69 62.35 52.08 63.22 79.69 90.55 76.91 66.86 47.84 70.72 70.61 68.55 63.34 15.26 64.16
FR OBB [14] 71.91 71.6 50.58 61.95 51.99 71.05 80.16 90.78 77.16 67.66 47.93 69.35 69.51 74.4 60.33 5.17 63.85
FR OBB + Dp [55] 71.9 72.8 50.84 61.99 51.97 72.15 80.13 90.74 76.53 67.95 45.09 69.93 70.43 74.72 58.28 3.19 63.67
FR H-OBB [14] 71.6 74.14 52.69 62.85 52.13 71.32 79.75 90.64 76.29 67.52 49.31 71.27 72.11 73.91 55.11 10.26 64.43
FR OBB + RT [18] 71.92 75.21 54.09 68.1 52.54 74.87 80.79 90.46 78.58 68.41 51.57 71.48 74.91 74.84 56.66 13.01 66.09
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
12
TABLE 9
Baseline results of class-wise AP on DOTA-v2.0. The abbreviations of algorithms are: Mask R-CNN (MR), CMR-Cascade Mask R-CNN (CMR),
Hybrid Task Cascade without a semantic branch (HTC*), Faster R-CNN (FR), Deformable RoI Pooling (Dp) and RoI Transformer (RT). The short
names for categories are defined as: BD–Baseball diamond, GTF–Ground field track, SV–Small vehicle, LV–Large vehicle, TC–Tennis court,
BC–Basketball court, SC–Storage tank, SBF–Soccer-ball field, RA–Roundabout, SP–Swimming pool, HC–Helicopter, CC–Container Crane,
Air–Airport, Heli–Helipad.
OBB Results
method Plane BD Bridge GTF SV LV Ship TC BC ST SBF RA Harbor SP HC CC Air Heli mAP
RetinaNet [85] 70.63 47.26 39.12 55.02 38.1 40.52 47.16 77.74 56.86 52.12 37.22 51.75 44.15 53.19 51.06 6.58 64.28 7.45 46.68
MR [86] 76.2 49.91 41.61 60.0 41.08 50.77 56.24 78.01 55.85 57.48 36.62 51.67 47.39 55.79 59.06 3.64 60.26 8.95 49.47
CMR [88] 77.01 47.54 41.79 58.02 41.58 51.74 57.86 78.2 56.75 58.5 37.89 51.23 49.38 55.98 54.59 12.31 67.33 3.01 50.04
HTC* [88] 77.69 47.25 41.15 60.71 41.77 52.79 58.87 78.74 55.22 58.49 38.57 52.48 49.58 56.18 54.09 4.2 66.38 11.92 50.34
FR OBB [14] 71.61 47.2 39.28 58.7 35.55 48.88 51.51 78.97 58.36 58.55 36.11 51.73 43.57 55.33 57.07 3.51 52.94 2.79 47.31
FR OBB + Dp [55] 71.55 49.74 40.34 60.4 40.74 50.67 56.58 79.03 58.22 58.24 34.73 51.95 44.33 55.1 53.14 7.21 59.53 6.38 48.77
FR H-OBB [14] 71.39 47.59 39.82 59.01 41.51 49.88 57.17 78.36 56.87 58.24 37.66 51.86 44.61 55.49 54.74 7.56 61.88 6.6 48.9
FR OBB + RT [18] 71.81 48.39 45.88 64.02 42.09 54.39 59.92 82.7 63.29 58.71 41.04 52.82 53.32 56.18 57.94 25.71 63.72 8.7 52.81
HBB Results
method Plane BD Bridge GTF SV LV Ship TC BC ST SBF RA Harbor SP HC CC Air Heli mAP
RetinaNet [85] 71.86 48.69 42.2 53.12 41.16 45.64 55.9 77.74 56.14 52.0 37.68 51.46 53.27 57.51 46.76 15.66 67.76 12.97 49.31
RetinaNet OBB [85] 70.99 46.77 43.76 55.08 41.55 51.06 58.01 77.78 57.72 53.5 37.66 51.72 53.56 57.32 49.18 12.01 64.53 4.48 49.26
MR [86] 77.61 51.35 44.89 60.12 42.51 48.1 57.93 77.84 57.55 57.88 36.53 51.71 54.79 58.93 60.01 14.42 60.32 8.43 51.16
CMR [88] 78.12 47.89 46.43 57.8 42.97 48.23 59.11 78.19 57.17 58.88 37.42 51.32 53.66 58.07 55.5 17.67 67.37 1.81 50.98
HTC* [88] 78.28 47.95 45.7 59.95 42.97 48.7 59.14 78.58 55.91 58.77 37.75 52.46 53.34 58.64 55.53 9.78 66.67 5.74 50.88
FR [54] 76.14 49.93 44.97 57.8 42.4 47.86 57.76 77.7 56.57 58.65 39.24 52.6 54.94 58.92 56.62 12.88 61.64 6.24 50.71
FR OBB [14] 71.68 45.8 45.56 58.7 42.18 51.28 59.28 79.01 58.74 58.75 37.26 51.93 52.36 58.08 54.12 8.48 53.01 2.4 49.37
FR OBB + Dp [55] 71.58 47.68 46.16 60.48 42.34 52.55 59.48 79.07 59.61 58.46 35.35 53.73 53.12 58.33 52.06 12.56 59.76 6.38 50.48
FR H-OBB [14] 77.14 50.54 45.6 57.53 42.27 48.09 57.6 78.4 59.78 57.8 36.64 52.13 52.51 58.42 48.91 14.99 60.0 8.47 50.38
FR OBB + RT [18] 71.84 48.2 47.84 63.94 42.97 54.79 60.74 82.88 63.51 58.89 40.63 52.83 55.7 58.87 57.94 27.04 64.27 7.68 53.37
�����������
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
13
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
14
TABLE 11
Results using different number of proposals on DOTA-v2.0 test-dev. The speeds are tested on a single Tesla V100 GPU. The other settings are the
same with those in Tab. 6.
Method # of proposals 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000
OBB mAP (%) 51.72 52.81 52.81 53.24 53.29 53.51 53.70 53.94 53.93 53.92
Faster R-CNN OBB
HBB* mAP 52.56 53.37 53.37 54.63 54.86 55.07 55.08 55.09 55.08 55.06
+ RoI Transformer
speed (fps) 14.4 12.4 12.2 9.1 8.7 7.8 7.5 6.5 6 5.7
OBB mAP (%) 47.10 47.31 48.03 48.09 48.32 48.35 48.48 48.49 48.49 48.49
Faster R-CNN OBB HBB* mAP 48.44 49.37 49.46 49.71 49.74 50.09 50.37 50.39 50.38 50.47
speed (fps) 15.8 14.1 12.5 11.9 10.9 9.9 9.3 9.1 8.4 7.8
Fig. 12. Visualization of the results on DOTA-v2.0 test-dev. The five models are from the DOTA-v2.0 models in Tab. 6. The predictions with
scores above 0.1 are shown. The results illustrate the performance in cases of orientation variations, density variations, large ARs and small ARs.
TABLE 12 Task, and 22 teams submitted valid results on the HBB Task.
Data augmentation experiments on DOTA-v1.5. Each column in this The detailed leaderboards for the two tasks can be found on
table indicates an experiment configuration. The first column
represents our baseline method without additional data augmentations, the DOAI Challenge-2019 website10 , and the top 3 results
while the other columns gradually add augmentation. We use Faster are listed in Tab. 14. Notice that most of these results have
R-CNN OBB + RoI Transformer as the baseline. High overlap means been achieved by using an ensemble of detection models,
an overlap of 512 between patches instead of 200 as in Tab. 6. except [25] which used a single model and reported 74.9
Baseline Data augmentation and 77.9 in terms of mAP on the OBB and HBB tasks,
High overlap X X X X X
Multi scale Train X X X X
respectively. Both in the training and testing phase, multi-
Multi scale Test X X X scaling and rotation strategies were used for data augmenta-
Rotation Train X X tions. With the same settings, our single model [18] achieved
Rotation Test X 76.43 and 77.24 in terms of mAP on the OBB and HBB tasks
OBB mAP 65.03 67.57 69.44 73.62 76.43 77.60
HBB mAP 66.09 67.94 70.63 74.63 77.24 78.88 respectively, as shown in Tab. 12, which was the best results
reported on the OBB task.
the small vehicles, large vehicles, and ships). For example,
the detection performance for the large vehicle category 7 C ONCLUSION
gains an improvement of 12.18 points compared to the ODAI is challenging. To advance future research, we in-
previous results. troduce a large-scale dataset, DOTA, containing 1,793,658
instances annotated by OBBs. The DOTA statistics show that
6.3 DOAI 2019 Challenge Results it can well represent the real world well. Then, we build
DOTA-v1.5 was first used to hold the DOAI Challenge-2019 a code library for both oriented and horizontal ODAI to
in conjunction with CVPR 20199 . There were 173 registra- conduct a comprehensive evaluation. We hope these exper-
tions in total, 13 teams submitted valid results on the OBB iments can act as benchmarks for fair comparisons between
9. https://captain-whu.github.io/DOAI2019/ 10. https://captain-whu.github.io/DOAI2019/results.html
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
15
TABLE 13
State-of-the-art results on DOTA-v1.0 [14]. The short names for categories are defined as: BD–Baseball diamond, GTF–Ground field track,
SV–Small vehicle, LV–Large vehicle, TC–Tennis court, BC–Basketball court, SC–Storage tank, SBF–Soccer-ball field, RA–Roundabout,
SP–Swimming pool, and HC–Helicopter. FR-O indicates the Faster R-CNN OBB detector, which is the previous official baseline provided by
DOTA-v1.0 [14]. ICN [19] is the image cascade network. The LR-O + RT means Light Head R-CNN + RoI Transformer. DR-101-FPN means
deformable ResNet-101 with an FPN. SCRDet means small, cluttered and rotated object detector. R-101-SF-MDA means ResNet-101 with
sampling fusion network (SF-Net) and multi-dimensional attention network (MDA-Net). RT means RoI Transformer. Aug. means the data
augmentations in Sec. 6.1.5. FR-O* means the re-implemented Faster R-CNN OBB detector, which is slightly different from the original FR-O [14].
OBB Results
method backbone Plane BD Bridge GTF SV LV Ship TC BC ST SBF RA Harbor SP HC mAP
FR-O [14] R-101 79.42 77.13 17.70 64.05 35.30 38.02 37.16 89.41 69.64 59.28 50.30 52.91 47.89 47.40 46.30 54.13
ICN [19] DR-101-FPN 81.36 74.30 47.70 70.32 64.89 67.82 69.98 90.76 79.06 78.20 53.64 62.90 67.02 64.17 50.23 68.16
LR-O + RT [18] R-101-FPN 88.64 78.52 43.44 75.92 68.81 73.68 83.59 90.74 77.27 81.46 58.39 53.54 62.83 58.93 47.67 69.56
SCRDet [24] R-101-SF-MDA 89.98 80.65 52.09 68.36 68.36 60.32 72.41 90.85 87.94 86.86 65.02 66.68 66.25 68.24 65.21 72.61
DRN [90] H-104 89.71 82.34 47.22 64.10 76.22 74.43 85.84 90.57 86.18 84.89 57.65 61.93 69.30 69.63 58.48 73.23
Gliding Vertex [63] R-101-FPN 89.64 85.00 52.26 77.34 73.01 73.14 86.82 90.74 79.02 86.81 59.55 70.91 72.94 70.86 57.32 75.02
CenterMap [66] R-101-FPN 89.83 84.41 54.60 70.25 77.66 78.32 87.19 90.66 84.89 85.27 56.46 69.23 74.13 71.56 66.06 76.03
CSL [91] R-152-FPN 90.25 85.53 54.64 75.31 70.44 73.51 77.62 90.84 86.15 86.69 69.60 68.04 73.83 71.10 68.93 76.17
Li et al. [25] R-101-FPN 90.41 85.21 55.00 78.27 76.19 72.19 82.14 90.70 87.22 86.87 66.62 68.43 75.43 72.70 57.99 76.36
2
S A-Net [57] R-50-FPN 88.89 83.60 57.74 81.95 79.94 83.19 89.11 90.78 84.87 87.81 70.30 68.25 78.30 77.01 69.58 79.42
FR-O* + RT [18] R-50-FPN 88.34 77.07 51.63 69.62 77.45 77.15 87.11 90.75 84.90 83.14 52.95 63.75 74.45 68.82 59.24 73.76
FR-O* + RT (Aug.) [18] R-50-FPN 87.89 85.01 57.83 78.55 75.22 84.37 88.04 90.88 87.28 85.79 71.04 69.67 79.00 83.29 73.43 79.82
HBB Results
method backbone Plane BD Bridge GTF SV LV Ship TC BC ST SBF RA Harbor SP HC mAP
ICN [19] DR-101-FPN 89.97 77.71 53.38 73.26 73.46 65.02 78.22 90.79 79.05 84.81 57.20 62.11 73.45 70.22 58.08 72.45
SCRDet [24] R-101-SF-MDA 90.18 81.88 55.30 73.29 72.09 77.65 78.06 90.91 82.44 86.39 64.53 63.45 75.77 78.21 60.11 75.35
CenterMap [91] R-101-FPN 89.70 84.92 59.72 67.96 79.16 80.66 86.61 90.47 84.47 86.19 56.42 69.00 79.33 80.53 64.81 77.33
Li et al. [25] ResNet101 90.41 85.77 61.94 78.18 77.00 79.94 84.03 90.88 87.30 86.92 67.78 68.76 82.10 80.44 60.43 78.79
FR-O* + RT [18] R-50-FPN 88.47 81.00 54.10 69.19 78.42 81.16 87.35 90.75 84.90 83.55 52.63 62.97 75.89 71.31 57.22 74.59
FR-O* + RT (Aug.) [18] R-50-FPN 87.91 85.11 62.65 77.73 75.83 85.03 88.18 90.88 87.28 86.18 71.49 70.37 84.94 84.11 73.61 80.75
TABLE 14
DOAI 2019 Challenge Results. CC is the container crane for short. The other abbreviations for categories are the same as those in Tab. 13. The
USTC-NELSLIP, pca lab and czh, AICyber are the top 3 participants in the OBB and HBB Tasks. The FR-O means Faster R-CNN OBB. RT
means the RoI Transformer. Aug. means the data augmentation method described in Sec. 6.1.5. Note that FR-O + RT and FR-O + RT (Aug.) are
single models, while others are ensembles of multiple models.
OBB results
team (method) Plane BD Bridge GTF SV LV Ship TC BC ST SBF RA Harbor SP HC CC mAP
USTC-NELSLIP [92] 89.19 85.32 57.27 80.86 73.87 81.26 89.5 90.84 85.94 85.62 69.5 76.73 76.34 76 77.84 57.33 78.34
pca lab [25] 89.11 83.83 59.55 82.8 66.93 82.51 89.78 90.88 85.36 84.22 71.95 77.89 78.47 74.27 74.77 53.22 77.84
czh 89 83.22 54.47 73.79 72.61 80.28 89.32 90.83 84.36 85 68.68 75.3 74.22 74.41 73.45 42.13 75.69
FR-O + RT [18] 71.92 76.07 51.87 69.24 52.05 75.18 80.72 90.53 78.58 68.26 49.18 71.74 67.51 65.53 62.16 9.99 65.03
FR-O + RT (Aug.) [18] 87.54 84.34 62.22 79.77 67.29 83.16 89.93 90.86 83.85 77.74 73.91 75.31 78.61 77.07 75.20 54.77 77.60
HBB results
team (method) Plane BD Bridge GTF SV LV Ship TC BC ST SBF RA Harbor SP HC CC mAP
pca lab [25] 88.26 86.55 65.68 79.83 74.59 79.35 88.12 90.86 85.45 84.15 73.9 77.44 84.1 81.07 76.07 57.07 79.53
USTC-NELSLIP [92] 89.26 85.6 59.61 80.86 75.2 81.13 89.58 90.84 85.94 85.71 69.5 76.34 81.7 81.84 76.53 57.09 79.17
AICyber 89.2 85.56 64.44 74.07 77.45 81.5 89.65 90.83 85.72 86.03 69.82 76.34 82.89 82.95 74.64 44.02 78.44
FR-O + RT [18] 71.92 75.21 54.09 68.10 52.54 74.87 80.79 90.46 78.58 68.41 51.57 71.48 74.91 74.84 56.66 13.01 66.09
FR-O + RT (Aug.) [18] 87.79 84.33 63.75 79.13 72.92 83.08 90.04 90.86 83.85 77.80 73.30 75.66 84.84 82.16 75.20 57.39 78.88
ODAI algorithms. The results show that hyperparameter advice on running Faster R-CNN, and Jinwang Wang for
selection and module design of the algorithms (e.g., number valuable discussions in details of parameter settings. The
of proposals) for aerial images are very different from those numerical calculations in this paper have been done on the
for natural images. It indicates that DOTA can be used as supercomputing system in the Supercomputing Center of
a supplement to natural scene images to facilitate universal Wuhan University.
object detection.
In the future, we will continue to extend the dataset,
host more challenges, and integrate more algorithms for
R EFERENCES
oriented object detection into our code library. We believe [1] E. J. Sadgrove, G. Falzon, D. Miron, and D. W. Lamb, “Real-time
that DOTA, challenges and code library will not only pro- object detection in agricultural/remote environments using the
multiple-expert colour feature extreme learning machine (mec-
mote the development of object detection in Earth vision elm),” Computers in Industry, vol. 98, pp. 183–191, 2018.
but also pose interesting algorithmic questions for general [2] V. Reilly, H. Idrees, and M. Shah, “Detection and tracking of large
object detection in computer vision. number of targets in wide area surveillance,” in ECCV. Springer,
2010, pp. 186–199.
[3] J. Porway, Q. Wang, and S. C. Zhu, “A hierarchical and contextual
ACKNOWLEDGMENT model for aerial image parsing,” IJCV, vol. 88, no. 2, pp. 254–283,
2010.
We thank the the support of CycloMedia B.V. for providing [4] Z. Liu, H. Wang, L. Weng, and Y. Yang, “Ship rotated bounding
the airborne images in DOTA-v2.0. We thank Huan Yi, box space for ship extraction from high-resolution optical satellite
Zhipeng Lin, Fan Hu, Pu Jin, Xinyi Tong, Xuan Hu, Zhipeng images with complex backgrounds,” IEEE Geosci. Remote Sensing
Dong, Liang Wu, Jun Tang, Linyan Cui, Duoyou Zhou, Lett., vol. 13, no. 8, pp. 1074–1078, 2016.
[5] T. Moranduzzo and F. Melgani, “Detecting cars in uav images
Tengteng Huang, and all the others who involved in the with a catalog-based approach,” IEEE Trans. Geosci. Remote Sens.,
annotations of DOTA. We also thank Zhen Zhu for his vol. 52, no. 10, pp. 6356–6367, 2014.
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
16
[6] G. Cheng, J. Han, P. Zhou, and D. Xu, “Learning rotation-invariant [32] B. Zhou, À. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learn-
and fisher discriminative convolutional neural networks for object ing deep features for scene recognition using places database,” in
detection,” IEEE TIP, vol. 28, no. 1, pp. 265–278, 2018. NIPS, 2014, pp. 487–495.
[7] C. Benedek, X. Descombes, and J. Zerubia, “Building development [33] Q. You, J. Luo, H. Jin, and J. Yang, “Building a large scale dataset
monitoring in multitemporal remotely sensed image pairs with for image emotion recognition: The fine print and the benchmark,”
stochastic birth-death dynamics,” IEEE TPAMI, vol. 34, no. 1, pp. in AAAI, 2016, pp. 308–314.
33–50, 2012. [34] G. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu,
[8] M.-R. Hsieh, Y.-L. Lin, and W. H. Hsu, “Drone-based object count- “AID: A benchmark data set for performance evaluation of aerial
ing by spatially regularized regional proposal network,” in ICCV, scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 7,
2017, pp. 4145–4153. pp. 3965–3981, 2017.
[9] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant con- [35] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-
volutional neural networks for object detection in VHR optical Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig et al., “The open
remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 54, images dataset v4: Unified image classification, object detection,
no. 12, pp. 7405–7415, 2016. and visual relationship detection at scale,” arXiv:1811.00982, 2018.
[10] Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao, “Oriented response networks,” [36] N. Weir, D. Lindenbaum, A. Bastidas, A. V. Etten, S. McPherson,
in CVPR. IEEE, 2017, pp. 4961–4970. J. Shermeyer, V. Kumar, and H. Tang, “Spacenet mvoi: A multi-
[11] M. Everingham, L. V. Gool, C. Williams, J. Winn, and A. Zisser- view overhead imagery dataset,” in ICCV, October 2019.
man, “The pascal visual object classes (VOC) challenge,” IJCV, [37] T. N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye, “A
vol. 88, no. 2, pp. 303–338, 2010. large contextual dataset for classification, detection and counting
[12] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet: A of cars with deep learning,” in ECCV, 2016, pp. 785–800.
large-scale hierarchical image database,” in CVPR, 2009, pp. 248– [38] M. Y. Yang, W. Liao, X. Li, and B. Rosenhahn, “Deep learning for
255. vehicle detection in aerial images,” in ICIP. IEEE, 2018, pp. 3079–
[13] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, 3083.
P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in [39] M. Y. Yang, W. Liao, X. Li, Y. Cao, and B. Rosenhahn, “Vehicle
context,” in ECCV, 2014, pp. 740–755. detection in aerial images,” Photogrammetric engineering and remote
[14] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, sensing : PE&RS, vol. 85, no. 4, pp. 297–304, 2019.
M. Pelillo, and L. Zhang, “Dota: A large-scale dataset for object [40] K. Chen, M. Wu, J. Liu, and C. Zhang, “Fgsd: A dataset for
detection in aerial images,” in CVPR, 2018. fine-grained ship detection in high resolution satellite images,”
[15] S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: arXiv:2003.06832, 2020.
A small target detection benchmark,” J Vis. Commun. Image R., [41] J. Shermeyer, T. Hossler, A. Van Etten, D. Hogan, R. Lewis,
vol. 34, pp. 187–203, 2016. and D. Kim, “Rareplanes: Synthetic data takes flight,”
[16] K. Liu and G. Máttyus, “Fast multiclass vehicle detection on aerial arXiv:2006.02963, 2020.
images,” IEEE Geosci. Remote Sensing Lett., vol. 12, no. 9, pp. 1938– [42] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial
1942, 2015. object detection and geographic image classification based on
collection of part detectors,” ISPRS Journal of Photogrammetry and
[17] H. Zhu, X. Chen, W. Dai, K. Fu, Q. Ye, and J. Jiao, “Orientation
Remote Sensing, vol. 98, pp. 119–132, 2014.
robust object detection in aerial images using deep convolutional
neural network,” in ICIP, 2015, pp. 3735–3739. [43] Y. Long, Y. Gong, Z. Xiao, and Q. Liu, “Accurate object local-
ization in remote sensing images based on convolutional neural
[18] J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu, “Learning roi
networks,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2486–
transformer for oriented object detection in aerial images,” in
2498, 2017.
CVPR, 2019, pp. 2849–2858.
[44] Z. Zou and Z. Shi, “Random access memories: A new paradigm for
[19] S. M. Azimi, E. Vig, R. Bahmanyar, M. Körner, and P. Reinartz, target detection in high resolution aerial remote sensing images,”
“Towards multi-class object detection in unconstrained remote IEEE TIP, vol. 27, no. 3, pp. 1100–1111, 2017.
sensing imagery,” in ACCV. Springer, 2018, pp. 150–165.
[45] Y. Zhang, Y. Yuan, Y. Feng, and X. Lu, “Hierarchical and robust
[20] J. Pang, C. Li, J. Shi, Z. Xu, and H. Feng, “R2 -cnn: Fast tiny object convolutional neural network for very high-resolution remote
detection in large-scale remote sensing images,” IEEE Trans. Geosci. sensing object detection,” IEEE Trans. Geosci. Remote Sens., 2019.
Remote Sens., 2019. [46] S. Waqas Zamir, A. Arora, A. Gupta, S. Khan, G. Sun, F. Shah-
[21] R. LaLonde, D. Zhang, and M. Shah, “Clusternet: Detecting small baz Khan, F. Zhu, L. Shao, G.-S. Xia, and X. Bai, “isaid: A large-
objects in large scenes by exploiting spatio-temporal information,” scale dataset for instance segmentation in aerial images,” in CVPR
in CVPR, June 2018. Workshops, 2019, pp. 28–37.
[22] F. Yang, H. Fan, P. Chu, E. Blasch, and H. Ling, “Clustered object [47] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in
detection in aerial images,” in CVPR, 2019, pp. 8311–8320. optical remote sensing images: A survey and a new benchmark,”
[23] M. Liao, Z. Zhu, B. Shi, G.-s. Xia, and X. Bai, “Rotation-sensitive ISPRS Journal of Photogrammetry and Remote Sensing, vol. 159, pp.
regression for oriented scene text detection,” in CVPR, 2018, pp. 296–307, 2020.
5909–5918. [48] P. Zhu, L. Wen, X. Bian, H. Ling, and Q. Hu, “Vision meets drones:
[24] X. Yang, J. Yang, J. Yan, Y. Zhang, T. Zhang, Z. Guo, X. Sun, and A challenge,” arXiv:1804.07437, 2018.
K. Fu, “Scrdet: Towards more robust detection for small, cluttered [49] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in
and rotated objects,” in ICCV, 2019, pp. 8232–8241. CVPR, 2011, pp. 1521–1528.
[25] C. Li, C. Xu, Z. Cui, D. Wang, Z. Jie, T. Zhang, and J. Yang, [50] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and
“Learning object-wise semantic representation for detection in A. C. Berg, “SSD: single shot multibox detector,” in ECCV, 2016,
remote sensing imagery,” in CVPR Workshops, 2019, pp. 20–27. pp. 21–37.
[26] G. Heitz and D. Koller, “Learning spatial context: Using stuff to [51] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in
find things,” in ECCV, 2008, pp. 30–43. CVPR, 2017, pp. 7263–7271.
[27] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, [52] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hier-
Y. Bulatov, and B. McCord, “xview: Objects in context in overhead archies for accurate object detection and semantic segmentation,”
imagery,” arXiv:1802.07856, 2018. in CVPR, 2014, pp. 580–587.
[28] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, [53] R. Girshick, “Fast r-cnn,” in CVPR, 2015, pp. 1440–1448.
Z. Liu, J. Xu et al., “Mmdetection: Open mmlab detection toolbox [54] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards
and benchmark,” arXiv:1906.07155, 2019. real-time object detection with region proposal networks,” IEEE
[29] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He, “De- TPAMI, vol. 39, no. 6, pp. 1137–1149, 2017.
tectron,” https://github.com/facebookresearch/detectron, 2018. [55] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “De-
[30] X. Wang, Z. Cai, D. Gao, and N. Vasconcelos, “Towards universal formable convolutional networks,” CoRR, abs/1703.06211, vol. 1,
object detection by domain attention,” in CVPR, 2019, pp. 7289– no. 2, p. 3, 2017.
7298. [56] Z. Liu, J. Hu, L. Weng, and Y. Yang, “Rotated region based cnn for
[31] B. Yao, X. Yang, and S.-C. Zhu, “Introduction to a large-scale ship detection,” in ICIP. IEEE, 2017, pp. 900–904.
general purpose ground truth database: Methodology, annotation [57] J. Han, J. Ding, J. Li, and G.-S. Xia, “Align deep features for
tool and benchmarks,” in EMMCVPR, 2007, pp. 169–183. oriented object detection,” arXiv:2008.09397, 2020.
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
17
[58] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. drasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny, “ICDAR
Belongie, “Feature pyramid networks for object detection.” in 2015 competition on robust reading,” in ICDAR, 2015.
CVPR, vol. 1, no. 2, 2017, p. 3. [84] S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A face
[59] F. Wang, L. Zhao, X. Li, X. Wang, and D. Tao, “Geometry-aware detection benchmark,” in CVPR, 2016, pp. 5525–5533.
scene text detection with instance transformation network,” in [85] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
CVPR, 2018, pp. 1381–1389. dense object detection,” in ICCV, 2017, pp. 2980–2988.
[60] C. Huang, H. Ai, Y. Li, and S. Lao, “High-performance rotation [86] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
invariant multiview face detection,” IEEE TPAMI, vol. 29, no. 4, ICCV. IEEE, 2017, pp. 2980–2988.
pp. 671–686, 2007. [87] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, “Light-head
[61] X. Shi, S. Shan, M. Kan, S. Wu, and X. Chen, “Real-time rotation- r-cnn: In defense of two-stage object detector,” arXiv:1711.07264,
invariant face detection with progressive calibration networks,” in 2017.
CVPR, 2018, pp. 2295–2303. [88] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng,
[62] M. Liao, B. Shi, and X. Bai, “Textboxes++: A single-shot oriented Z. Liu, J. Shi, W. Ouyang et al., “Hybrid task cascade for instance
scene text detector,” IEEE TIP, vol. 27, no. 8, pp. 3676–3690, 2018. segmentation,” in CVPR, 2019, pp. 4974–4983.
[63] Y. Xu, M. Fu, Q. Wang, Y. Wang, K. Chen, G.-S. Xia, and X. Bai, [89] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue,
“Gliding vertex on the horizontal bounding box for multi-oriented “Arbitrary-oriented scene text detection via rotation proposals,”
object detection,” IEEE TPAMI, 2020. TMM, 2018.
[64] X. Yang and J. Yan, “Arbitrary-oriented object detection with [90] X. Pan, Y. Ren, K. Sheng, W. Dong, H. Yuan, X. Guo, C. Ma, and
circular smooth label,” ECCV, 2020. C. Xu, “Dynamic refinement network for oriented and densely
[65] J. Wang, J. Ding, H. Guo, W. Cheng, T. Pan, and W. Yang, “Mask packed object detection,” in CVPR, 2020, pp. 11 207–11 216.
obb: A semantic attention-based mask oriented bounding box rep- [91] X. Yang and J. Yan, “Arbitrary-oriented object detection with
resentation for multi-category object detection in aerial images,” circular smooth label,” arXiv:2003.05597, 2020.
Remote Sensing, vol. 11, no. 24, p. 2930, 2019. [92] Y. Zhu, X. Wu, and J. Du, “Adaptive period embedding for
[66] J. Wang, W. Yang, H.-C. Li, H. Zhang, and G.-S. Xia, “Learning representing oriented objects in aerial images,” arXiv:1906.09447,
center probability map for detecting objects in aerial images,” IEEE 2019.
Trans. Geosci. Remote Sens., 2020. Jian Ding is currently pursuing his Ph.D de-
[67] B. Uzkent, C. Yeh, and S. Ermon, “Efficient object detection in gree at the State Key Laboratory of Informa-
large images using deep reinforcement learning,” in WACV, 2020, tion Engineering in Surveying, Mapping and Re-
pp. 1824–1833. mote Sensing, Wuhan University. He received
[68] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, the B.S. degree in Aircraft Design and Engineer-
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama et al., ing from Northwestern Polytechnical University,
“Speed/accuracy trade-offs for modern convolutional object de- Xian, China in 2017. His research interests in-
tectors,” in CVPR, 2017, pp. 7310–7311. clude object detection, instance segmentation
[69] F. Massa and R. Girshick, “maskrcnn-benchmark: Fast, mod- and remote sensing.
ular reference implementation of Instance Segmentation and
Object Detection algorithms in PyTorch,” https://github.com/
Nan Xue is currently a Research Associate
facebookresearch/maskrcnn-benchmark, 2018.
Professor in the School of Computer Science,
[70] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detec-
Wuhan University. He received the B.S., and
tron2,” https://github.com/facebookresearch/detectron2, 2019.
Ph.D. degrees from Wuhan University in 2014
[71] Y. Chen, C. Han, Y. Li, Z. Huang, Y. Jiang, N. Wang, and Z. Zhang, and 2020 respectively. He was a visiting scholar
“Simpledet: A simple and versatile distributed framework for at North Carolina State University from Sep.
object detection and instance recognition,” JMLR, vol. 20, no. 156, 2018 to June 2020. His research interests in-
pp. 1–8, 2019. clude geometric structure analysis in computer
[72] A.-M. de Oca, R. Bahmanyar, N. Nistor, and M. Datcu, “Earth vision.
observation image semantic bias: A collaborative user annotation
approach,” J-STARS, 2017.
[73] Z. Zhang, G. Vosselman, M. Gerke, C. Persello, D. Tuia, and Gui-Song Xia received his Ph.D. degree in im-
M. Y. Yang, “Detecting building changes between airborne laser age processing and computer vision from CNRS
scanning and photogrammetric data,” Remote sensing, vol. 11, LTCI, Télécom ParisTech, Paris, France, in 2011.
no. 20, p. 2417, 2019. From 2011 to 2012, he has been a Post-Doctoral
[74] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba, “SUN Researcher with the Centre de Recherche en
database: Large-scale scene recognition from abbey to zoo,” in Mathématiques de la Decision, CNRS, Paris-
CVPR, 2010, pp. 3485–3492. Dauphine University, Paris, for one and a half
[75] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, years. He is currently working as a full professor
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual at Wuhan University. He has also been work-
genome: Connecting language and vision using crowdsourced ing as Visiting Scholar at DMA, École Normale
dense image annotations,” IJCV, vol. 123, no. 1, pp. 32–73, 2017. Supérieure (ENS-Paris) for two months in 2018.
[76] D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari, “Ex- His current research interests include mathematical modeling of images
treme clicking for efficient object annotation,” in ICCV, 2017, pp. and videos, structure from motion, perceptual grouping, and remote
4930–4939. sensing image understanding. He serves on the Editorial Boards of
[77] Z. Wu, N. Bodla, B. Singh, M. Najibi, R. Chellappa, and L. S. Davis, several journals, including Pattern Recognition, Signal Processing: Im-
“Soft sampling for robust object detection,” arXiv:1806.06986, 2018. age Communications, EURASIP Journal on Image & Video Processing,
[78] H. Zhang, F. Chen, Z. Shen, Q. Hao, C. Zhu, and M. Savvides, Journal of Remote Sensing, and Frontiers in Computer Science: Com-
“Solving missing-annotation object detection with background puter Vision.
recalibration loss,” in ICASSP. IEEE, 2020, pp. 1888–1892.
Xiang Bai received the B.S., M.S., and Ph.D.
[79] Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang, “Bounding
degrees from the Huazhong University of Sci-
box regression with uncertainty for accurate object detection,” in
ence and Technology (HUST), Wuhan, China, in
CVPR, 2019, pp. 2888–2897.
2003, 2005, and 2009, respectively, all in elec-
[80] W. Li, W. Wei, and L. Zhang, “Gsdet: Object detection in aerial
tronics and information engineering. He is cur-
images based on scale reasoning,” IEEE TIP, vol. 30, pp. 4599–
rently a Professor with the School of Electronic
4609, 2021.
Information and Communications and the Vice-
[81] B. Singh and L. S. Davis, “An analysis of scale invariance in object Director of the National Center of AntiCounter-
detection snip,” in CVPR, 2018, pp. 3578–3587. feiting Technology, HUST. His research interests
[82] wiki, “gsd,” https://en.wikipedia.org/wiki/Ground sample include object recognition, shape analysis, and
distance. scene text recognition.
[83] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D.
Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chan-
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2021.3117983, IEEE Transactions on Pattern Analysis and Machine Intelligence
18
Wen Yang received the Ph.D. degree in com- Mihai Datcu received the M.S. and Ph.D.
munication and information system from Wuhan degrees in electronics and telecommunications
University, Wuhan, China, in 2004. From 2008 to from the University Politehnica of Bucharest
2009, he was a visiting Scholar with the Labora- UPB, Bucharest, Romania, in 1978 and
toire Jean Kuntzmann (LJK), Grenoble, France. 1986, and the title Habilitation a diriger des
From 2010 to 2013, he was a Post-Doctoral recherches from Universit Louis Pasteur,
Researcher with the State Key Laboratory of In- Strasbourg, France. He holds a Professorship
formation Engineering, Surveying, Mapping and in electronics and telecommunications with
Remote Sensing (LIESMARS), Wuhan Univer- UPB since 1981. Since 1993, he has been a
sity. Since then, he has been a Full Professor Scientist with the German Aerospace Center
with the School of Electronic Information, Wuhan (DLR), Oberpfaffenhofen, Germany. He is
University. His research interests include object detection and recogni- currently developing algorithms for model-based information retrieval
tion, semantic segmentation, and change detection. from high-complexity signals, methods for scene understanding from
SAR and interferometric SAR data, and he is engaged in research
in information theoretical aspects and semantic representations in
Michael Ying Yang is currently Assistant Pro- advanced communication systems. His research interests are in
fessor in the Department of Earth Observation Bayesian inference, information and complexity theory, stochastic
Science at ITC - Faculty of Geo-Information processes, model-based scene understanding, image information
Science and Earth Observation, University of mining, for applications in information retrieval and understanding of
Twente, The Netherlands, heading a group work- high-resolution SAR and optical observations.
ing on scene understanding. He received the Marcello Pellilo is a Full Professor of Com-
PhD degree (summa cum laude) from University puter Science at Ca Foscari University, Venice,
of Bonn (Germany) in 2011. He received the where he leads the Computer Vision and Pattern
venia legendi in Computer Science from Leibniz Recognition Lab. He has been the Director of the
University Hannover in 2016. His research inter- European Centre for Living Technology (ECLT)
ests are in the fields of computer vision and pho- and has held visiting research/teaching positions
togrammetry with specialization on scene understanding and semantic in several institutions including Yale University
interpretation from imagery. He serves as Associate Editor of ISPRS (USA), University College London (UK), McGill
Journal of Photogrammetry and Remote Sensing, Co-chair of ISPRS University (Canada), University of Vienna (Aus-
working group II/5 Dynamic Scene Analysis, Program Chair of ISPRS tria), York University (UK), NICTA (Australia),
Geospatial Week 2019, and recipient of ISPRS President’s Honorary Wuhan University (China), Huazhong University
Citation (2016), Best Science Paper Award at BMVC (2016), and The of Science and Technology (China), and South China University of
Willem Schermerhorn Award (2020). Technology (China). He is also an external affiliate of the Computer
Serge Belongie received a B.S. (with honor) in Science Department at Drexel University (USA). His research interests
EE from Caltech in 1995 and a Ph.D. in EECS are in the areas of computer vision, machine learning and pattern
from Berkeley in 2000. While at Berkeley, his recognition where he has published more than 200 technical papers
research was supported by an NSF Graduate in refereed journals, handbooks, and conference proceedings. He has
Research Fellowship. From 2001-2013 he was been General Chair for ICCV 2017, Program Chair for ICPR 2020,
a professor in the Department of Computer Sci- and has been Track or Area Chair for several conferences in his area.
ence and Engineering at University of Califor- He is the Specialty Chief Editor of Frontiers in Computer Vision and
nia, San Diego. He is currently a professor at serves, or has served, on the Editorial Boards of several journals, includ-
Cornell Tech and the Department of Computer ing IEEE Transactions on Pattern Analysis and Machine Intelligence,
Science at Cornell University. His research inter- Pattern Recognition, IET Computer Vision, and Brain Informatics. He
ests include Computer Vision, Machine Learn- also serves on the Advisory Board of Springers International Journal of
ing, Crowdsourcing and Human-in-the-Loop Computing. He is also a co- Machine Learning and Cybernetics. Prof. Pelillo has been elected Fellow
founder of several companies including Digital Persona, Anchovi Labs of the IEEE and Fellow of the IAPR, and is an IEEE SMC Distinguished
and Orpix. He is a recipient of the NSF CAREER Award, the Alfred Lecturer.
P. Sloan Research Fellowship, the MIT Technology Review Innovators Liangpei Zhang received the B.S. degree in
Under 35 Award and the Helmholtz Prize for fundamental contributions physics from Hunan Normal University, Chang-
in Computer Vision. sha, China, in 1982, the M.S. degree in op-
Jiebo Luo joined the University of Rochester tics from the Xian Institute of Optics and Preci-
in Fall 2011 after over fifteen prolific years at sion Mechanics, Chinese Academy of Sciences,
Kodak Research Laboratories, where he was a Xian, China, in 1988, and the Ph.D. degree
Senior Principal Scientist leading research and in photogrammetry and remote sensing from
advanced development. He has been involved in Wuhan University, Wuhan, China, in 1998. He is
numerous technical conferences, including serv- currentlya Chang-Jiang Scholar Chair Professor
ing as the program co-chair of ACM Multimedia with Wuhan University, appointed by the Min-
2010, IEEE CVPR 2012 and IEEE ICIP 2017. He istry of Education of China. He has authored
has served on the editorial boards of the IEEE or coauthored over 500 research papers and ve books. He holds 15
Transactions on Pattern Analysis and Machine patents. His research interests include hyper spectral remote sensing,
Intelligence, IEEE Transactions on Multimedia, high resolution remote sensing, image processing, and articial intelli-
IEEE Transactions on Circuits and Systems for Video Technology, ACM gence. Dr. Zhang was a recipient of the 2010 Best Paper Boeing Award
Transactions on Intelligent Systems and Technology, Pattern Recogni- and the 2013 Best Paper ERDAS Award from the American Society of
tion, Machine Vision and Applications, and Journal of Electronic Imag- Photogrammetry and Remote Sensing. He serves as a Co-Chair for
ing. Dr. Luo will serve as the Editor-in-Chief of the IEEE Transactions on the series SPIE Conferences on Multispectral Image Processing and
Multimedia for the 2020-2022 term. Dr. Luo is a Fellow of SPIE, IAPR, Pattern Recognition, the Conference on Asia Remote Sensing, and
IEEE, ACM, and AAAI. In addition, he is a Board Member of the Greater many other conferences. He serves as an Associate Editor for the IEEE
Rochester Data Science Industry Consortium. Transactions on Geoscience and Remote Sensing. He is a fellow of
IEEE.
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on March 10,2022 at 08:20:16 UTC from IEEE Xplore. Restrictions apply.
information.