0% found this document useful (0 votes)
10 views

A_Survey_of_Deep_Learning-Based_Object_Detection_M

This document presents a comprehensive survey of deep learning-based object detection methods and datasets specifically for overhead imagery, highlighting advancements and challenges in the field. It categorizes the methods into six major areas, including efficient detection and small object detection, while discussing the limitations of current approaches and promising future research directions. The survey aims to provide insights for both experts and beginners in the field of remote sensing and computer vision.

Uploaded by

Pengfei Du
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

A_Survey_of_Deep_Learning-Based_Object_Detection_M

This document presents a comprehensive survey of deep learning-based object detection methods and datasets specifically for overhead imagery, highlighting advancements and challenges in the field. It categorizes the methods into six major areas, including efficient detection and small object detection, while discussing the limitations of current approaches and promising future research directions. The survey aims to provide insights for both experts and beginners in the field of remote sensing and computer vision.

Uploaded by

Pengfei Du
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Received January 2, 2022, accepted January 25, 2022, date of publication February 4, 2022, date of current version February

25, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3149052

A Survey of Deep Learning-Based Object


Detection Methods and Datasets for
Overhead Imagery
JUNHYUNG KANG 1, SHAHROZ TARIQ2,3 , HAN OH 4, AND SIMON S. WOO1,5
1 Department of Artificial Intelligence, Sungkyunkwan University, Suwon 16419, South Korea
2 Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, South Korea
3 Data61, CSIRO, Australia
4 National Satellite Operation and Application Center, Korea Aerospace Research Institute (KARI), Daejeon 34133, South Korea
5 Department of Applied Data Science, Sungkyunkwan University, Suwon 16419, South Korea

Corresponding author: Simon S. Woo ([email protected])


This work was supported in part by the Institute of Information and Communications Technology Planning and Evaluation (IITP) grant
funded by the Korea Government (MSIT), AI Graduate School Support Program, Sungkyunkwan University, under Grant 2019-0-00421;
in part by the Regional strategic industry convergence security core talent training business under Grant 2019-0-01343; in part by the Basic
Science Research Program through the National Research Foundation of Korea (NRF) grant funded by Korea Government (MSIT) under
Grant 2020R1C1C1006004 and Grant 2020M1A3B2A02084969; in part by IITP grant funded by the Korea Government (MSIT), Original
Technology Development of Artificial Intelligence Industry, under Grant 2021-0-02157 and Grant 2021-0-00017; and in part by the
Institute of Information and Communications Technology Planning and Evaluation (IITP) Grant funded by the Korea Government (MSIT)
(Artificial Intelligence Innovation Hub) under Grant 2021-0-02068.

ABSTRACT Significant advancements and progress made in recent computer vision research enable more
effective processing of various objects in high-resolution overhead imagery obtained by various sources from
drones, airplanes, and satellites. In particular, overhead images combined with computer vision allow many
real-world uses for economic, commercial, and humanitarian purposes, including assessing economic impact
from access crop yields, financial supply chain prediction for company’s revenue management, and rapid
disaster surveillance system (wildfire alarms, rising sea levels, weather forecast). Likewise, object detection
in overhead images provides insight for use in many real-world applications yet is still challenging because of
substantial image volumes, inconsistent image resolution, small-sized objects, highly complex backgrounds,
and nonuniform object classes. Although extensive studies in deep learning-based object detection have
achieved remarkable performance and success, they are still ineffective yielding a low detection performance,
due to the underlying difficulties in overhead images. Thus, high-performing object detection in overhead
images is an active research field to overcome such difficulties. This survey paper provides a comprehensive
overview and comparative reviews on the most up-to-date deep learning-based object detection in overhead
images. Especially, our work can shed light on capturing the most recent advancements of object detection
methods in overhead images and the introduction of overhead datasets that have not been comprehensively
surveyed before.

INDEX TERMS Object detection, satellites, synthetic aperture radar, unmanned aerial vehicles.

I. INTRODUCTION deep learning in remote sensing is now gaining considerable


Deep learning has advanced rapidly in recent years, achieving attention, motivated by numerous successful applications in
great success in a variety of fields. As opposed to tradi- the computer vision community [1]–[7]. Consequently, the
tional algorithms, deep learning-based approaches frequently expeditious advancement of deep learning applications in
use deep networks to extract feature representations from remote sensing boosts the volume and variety of classification
raw data for various tasks. Especially, the application of methods available to identify different objects on the earth’s
surface, such as cars, airplanes, and houses [8], [9]. Our work
The associate editor coordinating the review of this manuscript and focus on reviewing the recent advancements in remote sens-
approving it for publication was Zhongyi Guo. ing for satellite and aerial-imagery-based object detection.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
20118 VOLUME 10, 2022
J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

With this work, we hope to promote further research in extensively analyze and categorize existing studies
the related fields by providing a general overview of the accordingly.
object detection for overhead imagery to both the experts and • Based on the study, we provide a comparative study
beginners. among the latest methods and datasets, then discuss the
Previously, Cheng and Han [1] surveyed object detection limitation of the current approach and the promising
methods in optical remote sensing images and discussed future research directions.
the challenges with promising research directions. Although
they proposed a deep learning-based feature representation II. APPLYING DEEP LEARNING-BASED METHODS FOR
as one of the promising research directions, they focused OVERHEAD IMAGERY IN CHALLENGING ENVIRONMENTS
on traditional methods such as template matching [10], Recently, there has been an increasing interest in deep
[11] or knowledge-based methods [12], [13], which are learning-based object detection methods along with sig-
far from recently developed deep learning-based methods. nificant performance improvement compared to traditional
On the other hand, our survey performs a comprehensive algorithms such as template matching and knowledge-based
review focused on modern deep learning-based approaches methods [10]–[13]. In general, deep learning-based object
for object detection in overhead imagery. detection models consist of a backbone and a head network,
For the performance assessment, Groener et al. [2] and where the basic structures of deep learning-based object
Alganci et al. [3] compared the object detection performance detection models are described in Fig. 2. And the pseudo code
of deep learning-based models on single class type satel- of object detector is provided in Appendix. The backbone net-
lite datasets sampled from a publicly available database [8], work is capable of extracting features from the input images,
[9]. They contributed to assessing the advantages and lim- while the head network uses the extracted features to localize
itations of each model based on the performance. Further- the bounding boxes of the detected objects and classify them
more, Zheng et al. [6] systematically summarized the deep (See Fig. 2).
learning-based object detection algorithms for remote sens- In the case of backbone networks, CNN-based net-
ing images. However, these studies mainly focus on general works are commonly employed. Meanwhile, methods such
object detection methods, not specifically on the remote as ViT-FRCNN [18], ViT-YOLO [19], and Swin trans-
sensing and overhead imagery domains. Moreover, they con- former [20] that incorporate transformer-based networks
ducted the experiments for the one-class object detection task, and self-attention mechanisms have recently demonstrated
which is not the case for the overhead imagery that usually high performance. However, developing a new backbone
requires detecting multi-class objects. Thus, this survey paper structure capable of achieving high performance is a dif-
aims to discuss the modern methods for object detection in ficult task that requires massive computation cost and
satellite and aerial images with multi-class datasets. requires pre-training on large-scale image data such as Ima-
Furthermore, Yao et al. [4] and Cazzato et al. [5] reviewed geNet [21]. To overcome this limitation, Liang et al. [22]
the object detection methods for aerial images from proposed CBNetV2, which improved object detection per-
unmanned aerial vehicles (UAV) and provided new insight formance by using existing pre-trained backbones, such
into future research directions. Moreover, Li et al. [7] pre- as ResNet50 [23], ResNet152 [23] and Res2Net50 [24].
sented a comprehensive review of the deep learning-based CBNetV2 achieved this performance improvement by com-
object detection methods in optical remote sensing images. posing multiple pre-trained identical backbones with assist-
Compared to their approaches, this survey aims to cover the ing backbones and lead backbones. CBNetV2 can apply both
comprehensive methods for object detection in the broader CNN-based and transformer-based backbones.
scope of overhead imageries, including both satellite images On the other hand, the head network is divided into
(Electro-Optical (EO), Synthetic Aperture Radar (SAR)) and one-stage or two-stage detectors. While the one-stage detec-
aerial images. tor performs object localization and classification simulta-
There are other studies for reviewing the deep learning- neously, the two-stage detector performs classification to
based application for satellite and aerial images [14]–[17]. determine classes after proposing the regions of interests
While these studies cover general state-of-the-art object (ROIs). A typical two-stage detector is Faster-RCNN [25],
detection methods, our work specifically aims to investigate and a one-stage detector is YOLOv3 [26]. In general, two-
the recent advancements in object detection for overhead stage detectors have higher accuracy than one-stage detec-
imagery and examine the challenges. The contributions of tors because the two-stage detector performs localization and
this paper are summarized as follows: classification on the proposed regions from the first stage.
• This paper provides a comprehensive survey of deep However, the inference time of two-stage detectors tends to be
learning-based object detection methods and datasets longer due to a large number of regions and additional stage
using satellite images (SAR and EO) and aerial images processing.
after thoroughly reviewing more than 90 research papers While many methods focus on extracting more accurate
from the past six years. features from an image, recently, Dai et al. [27] proposed a
• We define the six major areas and construct a taxon- new head network structure called Dynamic head. From this
omy to tackle the challenges of overhead imagery, and approach, the Dynamic head takes the input data that is the

VOLUME 10, 2022 20119


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

FIGURE 1. High level overview for six different categories of deep learning-based object detection methods for the overhead imagery.

output data of the backbone network. Then, the head network is a method to overcome the computational overhead of
considers the input data as a 3-dimensional tensor according self-attention by applying the focal self-attention method.
to the scale, location, and representation of objects. Fur- In recent years, enormous quantities of high-resolution
thermore, an attention mechanism is applied to each dimen- overhead photographs being created near-real-time, due to the
sion of the feature tensor. Therefore, the proposed approach advancements made in earth observing technologies. There-
improved the detection performance more efficiently than fore, the efficiency research area has also gained considerable
simultaneously applying the full self-attention mechanism to interests in the overhead and satellte imagery domains.
the feature tensors. The representative methods for efficient object detection in
However, it is challenging to apply the aforementioned overhead imagery are proposed in two approaches, as shown
state-of-the-art methods on overhead imagery because of in Fig. 3. One stands for the reducing computation method
the unique underlying characteristics of satellite and aerial that reduces the computational load of the model. The other
images. Therefore, our survey focuses explicitly on different approach is the reducing search area that reduces the search
methods used in overhead imagery domains instead of gen- area from provided input images.
eral deep learning-based methods such as Faster-RCNN [25],
YOLO [28]–[30], and SSD [31]. 1) REDUCING COMPUTATION
We define the following six major object detection Zhang et al. [34] proposed SlimYOLOv3 using the chan-
categories based on the unique challenges associated with nel pruning method to make the object detection model
overhead imagery as shown in Fig. 1: 1) efficient detec- lighter and more efficient. The proposed method was inspired
tion, 2) small object detection, 3) oriented object detection, by a network slimming approach [35], which prunes the
4) augmentation and super-resolution, 5) multimodal object Convolutional Neural Network (CNN) model to reduce the
detection, and 6) imbalanced objects detection. We discuss costs for the computational processes. Their approach added
the details of each area in the following section. the Spatial Pyramid Pooling module [36] on the original
YOLOv3 [26] and pruned the less informative CNN channels
A. EFFICIENT DETECTION to improve detection accuracy and reduce floating-point oper-
Efficiency is one of the important performance metrics of ations (FLOPs) by reducing the size of the parameters. On the
the object detection task. As the size of deep learning-based Visdrone dataset [37], their experimental results showed that
models as well as the resolution, complexity, and size of the the proposed method runs twice faster with only about 8% of
images increases, the importance of efficiency has become the original parameter size.
paramount recently. In particular, the Swin Transformer Usually, training with high-resolution overhead images
V2 [32] and the Focal Transformer [33] method are pro- requires a high computational cost. To alleviate this
posed. The Swin Transformer V2 is a method of scaling up problem, Uzkent et al. [38] applied reinforcement learn-
the Swin Transformer [20], which has shown high perfor- ing (RL) to object detection models for minimizing the
mance in object detection tasks. For scaling up, the Swin usage of high-resolution overhead images. The agent from
Transformer V2 applies specific techniques such as post nor- the reinforcement learning model determines whether the
malization, scaled cosine attention, and a log-spaced contin- low-resolution image is enough to detect objects or the
uous position bias. Additionally, the Focal transformer [33] high-resolution image is required. This process increases

20120 VOLUME 10, 2022


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

FIGURE 2. Basic deep learning-based one-stage vs. two-stage object detection model architectures. The backbone network can be used as a CNN or
transformer-based network, where the backbone network can be categorized into one-stage or two-stage networks according to the structure of the
head network. As shown in (a), the one-stage detector simultaneously performs object localization and classification in the head network. On the other
hand, localization and classification in the two-stage detector are performed on regions after the region proposals are obtained, as shown in (b).

RPN, the RLN can predict the possible areas of the object
location from the original overhead images. Also, since the
cropped images of predicted areas are much smaller than
the entire area of the original images, the proposed method
showed a significant improvement in terms of efficiency for
detecting specific objects.
In addition, Sommer et al. [40] proposed a similar
approach with RLN [39] called Search Area Reduc-
tion (SAR) module. The image is divided into several image
patches; then, the module predicts the scores based on the
FIGURE 3. The two primary categories of the efficient object detection
methods. Efficient detection methods are achieved by reducing the number of objects contained in each image patch. In partic-
computation and reducing search area. The representative methods are ular, there are two distinguished characteristics of the SAR
listed on the right.
module from the previous RLN [39]. Firstly, contrary to
the RLN that generates various sized images based on the
clustering method, the SAR module handles divided images
runtime efficiency by reducing the number of required with the specific sized image. Secondly, while the SAR
high-resolution images. However, the proposed method [38] module is integrated with Faster R-CNN to share the network,
requires pair of low-resolution images and high-resolution RLN has a separate network for finding the regions. Such an
images, while SlimYOLOv3 [34] can be applied directly integrated approach significantly reduces the inference time.
to high-resolution images without pairing the low and high Also, Yang et al. [41] proposed Clustered Detection
resolution dataset. (ClusDet) network composed of a cluster proposal sub-
network (CPNet), a scale estimation sub-network (ScaleNet),
2) REDUCING SEARCH AREA and a dedicated detection network (DetecNet). CPNet is
Unlike other methods that use the information from entire attached to the feature extraction backbone network and
image areas for object detection, several methods [39]–[41] obtains high-level feature maps. Based on the feature map
suggested reducing the search area of images for efficient information, the CPNet produces the prediction of location
object detection. and scales of clusters of input images. Then, ScaleNet pre-
Han et al. [39] applied Region Locating Network (RLN) in dicts the scale offset for objects to rescale the cluster chips.
addition to Faster R-CNN [25] to generate cropped images. Finally, detection results from DetecNet on cluster chips
The proposed architecture of RLN is the same as Region and a whole image are combined together to generate the
Proposal Network (RPN) in Faster R-CNN. However, unlike final result. This method achieves high runtime efficiency

VOLUME 10, 2022 20121


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

by reducing the search area with the clustering approach. While Sommer et al. [42] utilized Faster R-CNN, which
Compared with the existing search area-based methods [39], is a representative two-stage object detection model, other
[40], it is noteworthy that ClusDet achieves not only high approaches employed the one-stage object detection mod-
efficiency, but also improved detection performance for small els such as YOLO [26], [29], [30]. Pham et al. [44] pro-
objects. posed YOLO-fine, which is an one-stage object detection
model. The proposed model was implemented based on
B. SMALL OBJECT DETECTION YOLOv3 [26] to effectively handle small objects. In detail,
Moreover, limitation in detecting small-sized objects is this model replaced feature extraction layers with finer ones
another challenging problem associated with overhead by a lower sub-sampling factor. With this finer object search
images. Object detection in the overhead image is not only grid, YOLO-fine could recognize objects smaller than eight
targeting to distinguish relatively large-sized objects such as pixels that were not recognized by the original YOLOv3.
buildings, bridges, and soccer ball fields but also, in many Also, they reduced the number of model parameters com-
cases, it needs to detect small-sized objects such as vehi- pared to the original YOLOv3 by removing the last two
cles, people, and ships. However, if the resolution of images convolutional blocks that were not helpful in small-sized
decreases, the capability of detecting small-sized objects object detection. Overall, their work improved the detec-
decreases drastically. Therefore, performance degradation in tion performance for small objects by improving the adja-
detecting the small-sized objects in the overhead image is cent objects’ discrimination capabilities through a finer grid
extremely challenging and needs to be addressed. Recently, search, while reducing the number of model parameters by
several methods have been proposed for small object detec- removing unnecessary convolution layers.
tion to achieve better performance, as shown in Fig. 4, which
we will describe more in the following sections. 2) MULTI-SCALE LEARNING
While some achieved a better performance in small object
detection through parameters optimization, others suggested
a multi-scale approach that obtains features of various scaled
objects for small object detection.
Van Etten [28] proposed the You Only Look Twice (YOLT)
model inspired by YOLO [29], [30]. They introduced a new
structure that is a similar approach adopted to fine-grained
model architecture. In YOLT model, fine-grained features
are extracted by adjusting architecture parameters. Further-
more, YOLT applied a multi-scale training approach con-
currently, because the fine-grained model could suffer from
FIGURE 4. The two primary categories of the small object detection.
Small object detection methods are achieved by fine-grained model
the high false-positive issue. The intuitive way to understand
architecture and multi-scale learning, and the corresponding methods are multi-scale training is that it is similar to building two differ-
listed on the right. ent models. In this way, YOLT combines the detection results
obtained from two different models that detect small and large
objects, respectively, to determine the final object detection
1) FINE-GRAINED MODEL ARCHITECTURE result.
The intuitive and straightforward approach to solve Inspired by YOLT, Li et al. [45] proposed MOD-YOLT for
small-sized objects detection problems is to extract a Multi-scale object detection (MOD) task. They categorized
fine-grained features from source images by adjusting model objects into three types with different size criteria empirically.
parameters. Then, categorized objects were trained with Multi-YOLT
First, Sommer et al. [42] demonstrated the effectiveness of Network (MYN). While YOLT used a single network struc-
deep learning-based detection methods for vehicle detection ture, MOD-YOLT proposes MYN to obtain optimal feature
in aerial images. Mainly, they performed the experiments maps using optimized network structures for each scale.
based on Fast R-CNN [43] and Faster R-CNN [25] that are With this advanced framework, MOD-YOLT achieved higher
widely used in the object detection domain for terrestrial detection performance than YOLT on a dataset from the
applications. Specifically, they proposed a common model second stage of the AIIA1 Cup Competition.1
architecture to detect small size objects. To maintain suffi- Also, Zhou et al. [46] applied the multi-scale network in
cient information for feature maps, they optimized parame- addition to the Faster R-CNN architecture. Because a depth of
ters, including a number of layers, kernel size, and anchor convolutional neural networks is related to feature level, the
scale. This work showed the applicability of a general object multi-scale network enabled the model to use multiple levels
detection model when applied for vehicle detection in over- of input feature. Therefore, this multi-scale network is bene-
head imagery. However, the proposed work could not present ficial to detect small objects such as ships in SAR imagery,
a novel methodology except for optimizing the network
parameters of the existing models [25], [43]. 1 https://www.datafountain.cn/competitions/288/datasets

20122 VOLUME 10, 2022


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

layer and softmax layers. They applied rotation augmentation


on the data so that both before and after rotation were used
jointly. During the training phase, a unique loss function
enabled the RICNN to obtain similar features from an image
before and after rotation. The RICNN improved the detection
performance for oriented objects on the NWPU VHR-10 [56]
dataset; however, this method required additional fine-tuning
and cannot be applied to the dataset labeled as oriented
bounding box form.

FIGURE 5. The two primary categories of the oriented object detection, 2) DETECTING ORIENTED BOUNDING BOX
where they are divided into the horizontal and oriented bounding box
methods at the high level. The core methods are illustrated on the Liu et al. [51] proposed rotated region-based CNN
right-hand side. (RR-CNN) with a rotated region of interest (RRoI) pool-
ing layer. The proposed RRoI pooling layer pooled rotated
features into 5-tuple; center position with respect to x and
improving mAP performance, compared to the baseline mod- y-axis, width, height, and rotation angle. The pooled features
els such as Single Shot Multibox Detector (SSD) [31] and obtained by RROI are more robust and accurate than the pre-
RetinaNet [47]. viously proposed Free-form RoI pooling layer [57]. Another
Another method [48] to overcome a challenges lies in small advantage of this approach is the extensibility as it can be
object detection is to apply multi-scale training to cropped combined with any other two-stage object detection model.
images from a source image. Based on a clustering method For example, Liu et al. [51] used Fast R-CNN with RR-CNN
or density map, cropped images are generated, where the in the experiments.
cropped images have various object scales. In particular, Even though RRoI can accurately extract rotated features,
Li et al. [48] proposed Density-Map guided object detection it suffered from expensive computational costs to generate
Network (DMNet) to crop the images based on a density proposal regions. To address this issue, Ding et al. [52] pro-
map from the proposed density generation module. Multi- posed an RoI transformer that consists of an RRoI learner
column CNN (MCNN) [49] inspired the idea of density gen- and a Rotated Position Sensitive (RPS) RoI alignment mod-
eration module, which learned features of images and gen- ule. The RRoI leaner is trained to learn transformation from
erated a density map. The cropped images were fed into the HRoIs (Horizontal RoIs) to RRoIs, and the RPS RoI align-
object detector, and the result was fused to increase detection ment module extracts the rotation-invariant features. Despite
performance for small objects. negligible computational cost increase, this RoI transformer-
based method [52] significantly improved performance to
C. ORIENTED OBJECT DETECTION detect oriented objects.
Also, oriented objects can cause misclassification In addition, Yi et al. [53] introduced the box boundary
and produce a considerable decrease in object detection aware vectors (BBAVectors) to detect and predict the oriented
models’ performance. Therefore, deep learning-based meth- bounding boxes of objects. Instead of using predicted angle
ods [50]–[54] have recently been proposed to detect oriented values from features [51], BBAVectors employed a Cartesian
objects with higher accuracy, as shown in Fig. 5. Exist- coordinate system. And, the model detects the center key
ing overhead imagery datasets can be classified according point and then specifies the position of bounding boxes. The
to whether the coordinate of oriented bounding boxes is entire model architecture is implemented as an anchor-free
provided or the coordinate is provided as horizontal boxes one-stage detector so that the model can make inferences
such as center point, width, and height of boxes. However, faster than other two-stage detectors.
according to the dataset labeling configuration, different On the other hand, Han et al. [54] proposed a one-stage
methods are employed for detecting oriented objects. Thus, detection method called the single-shot alignment network
we categorized oriented object detection methods based on (S 2 A-Net), where the Feature Alignment Module (FAM)
the application of bounding box format: either horizontal or and Oriented Detection Module (ODM) were introduced.
oriented. FAM consisted of Anchor Refinement Network (ARN) and
Alignment Convolution Layer (ACL). ARN generated rotated
1) DETECTING HORIZONTAL BOUNDING BOX anchors, and ACL decoded anchor prediction map to the
The most intuitive way to improve the detection accu- oriented bounding box, extracting aligned features using
racy of the oriented object is to explore data augmenta- alignment convolution (AlignConv). On the other hand,
tion. Cheng et al. [50] applied the data augmentation strategy ODM applied active rotating filters (ARF) [58] to extract
and proposed the new objective function to achieve rota- orientation-sensitive features and orientation-invariant fea-
tion invariance of the feature representations. The proposed tures. These features were used to predict the bounding
method was extended from AlexNet [55] by replacing the last boxes and classify the categories from two sub-networks.
classification layer with the rotation-invariant CNN (RICNN) It achieved the state-of-the-art performance on the DOTA [9]

VOLUME 10, 2022 20123


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

and HRSC2016 [59] datasets, where these dataset are widely generate super-resolution images for their experiments. The
utilized in the oriented object detection research area. results demonstrated that these super-resolution methods
improved detection performance. When the resolution of the
D. AUGMENTATION AND SUPER-RESOLUTION input images was increased from 30cm to 15cm, the mAP
In order to further improve the detection performance, image performance was increased by 13% to 36%. Conversely,
data augmentation and super-resolution can be applied at a when the resolution was degraded from 30cm to 120cm,
preprocessing stage. Different preprocessing strategies are the detection performance was decreased by 22% to 27%.
categorized and described in Fig. 6. The outcome of the experiments demonstrated that image
resolution is highly related to detection performance; thus,
generally the super-resolution methods can improve the over-
all object detection performance.
In a similar direction, Rabbi et al. [66] introduced Edge-
Enhanced Super-Resolution GAN (EESRGAN) method,
which was inspired by Edge Enhanced GAN (EEGAN) [67]
and Enhanced Super-Resolution GAN (ESRGAN) [68]. The
proposed method consists of EESRGAN and the end-to-
end network structure such as Faster R-CNN and SSD as
a base detection network. In particular, EESRGAN gener-
ated super-resolution images with rich edge information, and
FIGURE 6. The two primary categories of the preprocessing method. The
most frequently used preprocessing methods for improving detection base detection networks (Faster R-CNN and SSD) achieved
performance are super-resolution and image augmentation. improved accuracy with the super-resolution images.

E. MULTIMODAL OJBECT DETECTION


1) IMAGE AUGMENTATION Another challenging but promising research area is object
Chou et al. [60] proposed an interesting approach for detection with multimodal data such as different resolutions,
detecting stingrays, employing a generative approach called view points, and data types. In this section, we will examine
Conditional GLO (C-GLO) in aerial images. Their approach methods using multimodal data for object detection. In order
was motivated by Generative Latent Optimization (GLO) to achieve more robust and accurate detection performance
proposed by Bojanowski et al. [61]. Unlike original GLO, utilizing various types of data, multimodal object detection
C-GLO generates objects that are mixed with the background can be applied, as shown in Fig. 7.
of the selected image region. Through training with these
augmented images, the baseline model showed the significant
performance improvement.
Also, Chen et al. [62] applied an adaptive augmentation
method called AdaResampling to improve the model perfor-
mance, where there are two significant issues with the regu-
lar augmentation methods, background and scale mismatch.
To address the issues, AdaResampling applied a pretrained FIGURE 7. Illustration of multimodal object detection methods. E-SVMs
segmentation network during the augmentation phase to pro- and NDFT are the most frequently used methods.
duce a segmented road map. From the segmented road map,
the model used the position information to place objects. First, a fundamental approach multimodal object detec-
Additionally, a simple linear function was utilized to calculate tion is to use different resolution images from separate
the scale factor to resize objects. The augmented images were sensors. Cao et al. [69] introduced a detection framework
passed to the proposed hybrid detector called RRNet, which that simultaneously used low-resolution satellite images and
has a re-regression module. Then, the re-regression module high-resolution aerial images. Coupled dictionary learning
took feature maps with coarse bounding boxes as input then was applied to obtain augmented features for the detection
predicted the final bounding boxes as an output. framework, then E-SVM [70] was used to make them a more
robust model in various image resolutions. Compared with
2) SUPER-RESOLUTION multi-scale training and super-resolution generated images,
Another approach that is often applied to the original image the proposed method obtained data with separate domains,
at the preprocessing stage is to generate a super-resolution which provide different image resolutions such as satellite
image. Shermeyer and Van [63] analyzed the effect of dif- and aerial.
ferent resolutions of overhead imagery on object detection Similar to using multi-resolution information,
performance. Very Deep Super-Resolution (VDSR) [64] and Wegner et al. [71] used information obtained from multiple
Random-Forest Super-Resolution (RFSR), which is extended views such as street and overhead view. The Faster R-CNN
from Super-Resolution Forest (SRF) [65], were used to model was utilized as a base detection model to detect objects

20124 VOLUME 10, 2022


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

TABLE 1. Summary of EO satellite datasets. The SpaceNet 1 and 2 datasets provide two types of images, which are 8-band multispectral and 3-band RGB
imagery. All datasets are publicly available, where † indicates the performance achieved by the best team for the challenge evaluation metric.

from each street view image. The results with geographic version of the original focal loss function. A threshold was
coordination were combined to calculate multi-view proposal applied to keep minimum weights to positive samples to
scores, and the scores generated final detection results to prevent the unintended drop of recall. The proposed method
the input region. The proposed model showed significant was experimented on the xView [8] dataset with a random
improvement in mAP score at the evaluation stage compared undersampling strategy and achieved first place with DIUx
to the Faster R-CNN model on simple overhead images. xView 2018 Detection Challenge [8].
Unlike the approaches that utilized different types of Unlike the previous studies using the focal loss function,
images, Wu et al. [72] proposed Nuisance Disentangled Fea- Zhang et al. [75] proposed a Difficult Region Estimation
ture Transform (NDFT) to use meta-data in conjunction with Network (DREN). The DREN was trained to generate
the images to obtain domain-robust features. Furthermore, cropped images for the difficult-to-detect region for the
adopting adversarial learning, the NDFT disentangles the fea- testing phase, and these images were passed to the detector
tures of domain-specific nuisances such as altitudes, angles, with original images. Their network utilized a balanced L1
and weather information. Their proposed training process loss from Libra R-CNN [77]. Then, the balanced L1 loss
enables the model to be robust in various domains by learning restrained gradients produced by outliers, which were sam-
domain-invariant features. ples with high loss values. To clip maximum gradients from
outliers, it made a more balanced regression from accurate
F. IMBALANCED OBJECTS DETECTION samples.
Imbalanced objects are one of the challenging issues in
the overhead imagery research [47], [73]–[76]. After Reti- III. DATASETS
naNet [47] introduced focal loss to overcome detecting the In this section, we explain the most popular and openly avail-
imbalanced objects, there have been more studies to extend able satellite imagery datasets based on their image sensor
focal loss and improve the performance, as shown in Fig. 8. sources.

A. EO SATELLITE IMAGERY DATASETS


Although EO satellite images are generally low-resolution
and difficult to collect compared to other images, EO images
are advantageous in capturing the large areas that are phys-
ically difficult to be collected by UAV or flight. Follow-
ing datasets are constructed with EO satellites, as shown in
FIGURE 8. A visual overview for detecting imbalanced objects, where
detailed methods are presented.
Table 1.

Especially, Yang et al. [73] proposed a double focal loss 1) HRSC2016


convolutional neural network (DFL-CNN) using focal loss Liu et al. [79] introduced the High-Resolution Ship Collec-
to region proposal network (RPN) and classifier module of tion 2016 (HRSC2016) dataset to promote research on optical
Faster R-CNN model. Using the focal loss instead of the remote sensing ship detection and recognition. The dataset
cross-entropy loss, the RPN considers the class imbalance was utilized by Liu et al. [59] before publishing the dataset to
problem when determining the region of interest, and the demonstrate a detection performance for rotated ships. There-
classifier is enabled to handle hard negative data during fore, the dataset provides labeling information of rotated and
training. Additionally, skip connection was proposed to pass horizontal bounding boxes coordinations. It is extraordinary
detailed features from the shallow layer to the deeper layer. to provide hierarchical classes of ships compared to other
Their methods demonstrated an improvement in detection ship detection datasets that usually contain a single class.
performance compared with the Faster R-CNN model on the Moreover, since both labeling formats are also supported, the
ITCVD dataset, which was constructed in this study. dataset can be used in various detection models. However,
Sergievskiy and Ponamarev [74] addressed the challeng- most ship images are presented with harbor backgrounds,
ing imbalance issue with a reduced focal loss, a modified so for the better quality of the dataset, separate sea images

VOLUME 10, 2022 20125


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

FIGURE 9. Images taken from EO satellite datasets as examples. Each dataset is composed of images captured using different sensors with varying
resolutions.

are needed to be included as indicated in their future work. 4) SPACENET MVOI


The sample image is shown in Fig. 9. (a). Weir et al. [81] addressed limitations of the previous satellite
imagery dataset, which cannot represent various viewpoint
2) SPACENET CHALLENGE 1 AND 2 in real-world cases, and introduced SpaceNet Multi-View
Van et al. [80] and SpaceNet partners (CosmiQ Works, Radi- Overhead Imagery (MVOI) dataset. While other datasets
ant Solutions, and NVIDIA) released a large satellite dataset had a fixed viewing angle viewpoint from sensing sources,
called SpaceNet. The SpaceNet comprises a series of SpaceNet MVOI was constructed with a broad range of
datasets, dataset 1 and 2, which aims to extract building off-nadir images from the WorldView-2 satellite. Therefore,
footprints. The SpaceNet1 obtained images from Rio De SpaceNet MVOI obtained different images of the same area
Janeiro with the WorldView-2 satellite at 50cm GSD. Further- in various angles. This characteristic provides the dataset
more, images of SpaceNet2 were captured from various areas, effectiveness in evaluating the generalization performance of
including Las Vegas, Paris, Shanghai, and Khartoum, with detection models. Similar to SpaceNet 1 and 2 [80], the labels
the WorldView-3 satellite at 30cm ground sample distance are provided in polygon format to represent the accurate
(GSD). Both datasets contain 8-band multispectral images ground truth of building footprints. The sample image is
with lower resolution. Because building footprints are pro- shown in Fig. 9. (e).
vided as polygon format, these datasets enable detection mod-
els to evaluate performance more accurately than datasets B. SAR SATELLITE IMAGERY DATASETS
with bounding boxes format. The sample image is shown in Although imagery from SAR satellite is embedded with
Fig. 9. (b) and (c). many speckle noises, the SAR satellite imagery is an impor-
tant research area due to the unique characteristic that SAR
images can be provided regardless of the obstacles, such as
3) XVIEW
clouds and lights. In particular, most of the existing SAR
Lam et al. [8] constructed the xView dataset, consisting of
datasets are constructed for ship detection, as shown in
over 1 million objects across 60 classes with 30cm ground
Table 2.
sample distance from the WorldView-3 satellites. Compared
to other satellite datasets introduced before, xView pro-
vides high geographic diversity and various class categories. 1) SSDD
However, since xView collected images from only a single SAR Ship Detection Dataset (SSDD) is the widely used
source, the dataset is not suitable for evaluating detection dataset for the SAR ship detection research, which was first
performance on images from various sources. In addition, introduced by Li et al. [88] in 2017. The images were col-
they maintained the quality of the dataset through three lected from RadarSat-2, TerraSAR-X, and Sentinel-1 satel-
stages of quality control; worker, supervisory, and expert. lites to utilize various sensor types and resolutions. The
Labelers performed the role of reviewers to check the work minimum size of a ship on low-resolution images is three
of other labelers related to bounding boxes in the worker pixels to recognize the ship. This dataset is helpful to start
quality control stage. Afterward, in the supervisory quality evaluating the performance of the ship detection model on
control stage, quality was checked and provided feedback SAR imagery. However, the number of objects is relatively
to labelers by hosting the training session. Lastly, in the small compared to other datasets, and it is considered as a
expert quality control stage, a standard dataset was gen- less challenging dataset that has already achieved higher than
erated to make thresholds to compare precision and recall 95% mAP values [82], [89].
with generated dataset batches. Throughout the quality pro-
cess, xView minimized human error and achieved the con- 2) OPENSARSHIP 2.0
sistency of the dataset. The sample image is shown in Li et al. [90] presented a SAR ship detection dataset called
Fig. 9. (d). OpenSARShip 2.0, an updated version of OpenSARShip [91].

20126 VOLUME 10, 2022


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

FIGURE 10. Sample images from SAR satellite imagery datasets. The radar waves are collected from active sensors and converted to black and
white images. Compared with images from passive sensors, generally, a SAR image has speckle noises.

TABLE 2. Summary of SAR satellite imagery datasets. Because OpenSARShip datasets provide only ship chips without raw images, image size is not
described in this table. For LS-SSDD-v1.0, both sizes of raw images and split image patches are described. All the datasets have ship class only, and they
are publicly available.

The images were collected from Sentinel-1 satellite with datasets provides ship chips or small size images, HRSID
two different product types; Single look complex and Gound provides comparatively large size images, which are benefi-
range detected product. Compared to the previous version, cial for evaluating object detection methods. Furthermore, the
OpenSARShip 2.0 provides the additional information called HRSID is composed of higher resolution images than other
automatic identification system (AIS) and Marine Traffic SAR datasets so that the HRSID dataset is more effective in
information that contains ship type, longitude, and latitude. discriminating the adjacent ships. The sample image is shown
Although detailed position information is annotated for each in Fig. 10. (c).
ship, images are provided as cropped ship chips. Thus, the
dataset is more suitable for a classification task instead of 5) LS-SSDD-v1.0
object detection. The sample image is shown in Fig. 10. (a). Unlike previously released datasets containing ship chips or
small sized images, Zhang et al. [94] released a dataset with
3) SAR-SHIP DATASET large-scaled 15 raw images collected from Sentinel-1. The
Wang et al. [92] constructed a SAR image dataset using dataset called a Large-Scale SAR Ship Detection Dataset-
102 images from the Chinese Gaofen-3 satellite and v1.0 (LS-SSDD-v1.0) provides raw images of the size
108 images from the Sentinel-1 satellite. Compared with the 24,000×16.000 pixels with split sub-images of 800×800 pix-
previous SAR datasets, this dataset focused on containing els. In order to create similar conditions to the actual envi-
complex background images such as a harbor or near an ronment, images are provided with pure backgrounds without
island. Throughout this characteristic of the dataset, it aimed making separate ship chips. It means that sub-image patches
to increase the performance of the ship detection model with- are also provided regardless of whether they contained target
out any land-ocean segmentation image pre-processing. The objects or not. This characteristic is helpful for detection
sample image is provided in Fig. 10. (b). models to learn pure backgrounds without objects, making
it more practical to real-world cases. The sample raw image
4) HRSID is presented in Fig. 10. (d).
Wei et al. [93] constructed and released a High-Resolution
SAR Images Dataset (HRSID) to foster research for ship C. AERIAL IMAGERY DATASETS
detection and instance segmentation on SAR imagery. From As shown in Table 3, there are datasets constructed with
Sentinel-1B, TerraSAR-X, and TanDEM satellites, 136 raw images from passive optical sensors such as flights and
images were collected. Then, the images were cropped drones. And, it is difficult to specify detailed sensor informa-
into a fixed size with a 25% overlapping area. Optical tion because the sensor specifications can not be described
imagery from Google Earth was utilized to minimize anno- in detail generally. Initial datasets have been more focused
tation errors. While OpenSARShip [90] and SAR-Ship [92] on primarily detecting cars. However, recent trends for object

VOLUME 10, 2022 20127


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

FIGURE 11. Sample images from aerial imagery datasets. Generally, aerial imagery datasets have higher resolution than satellite images.

detection on aerial imagery have extended to detect various 3) VEDAI


objects in backgrounds. Razakarivony and Jurie [102] introduced Vehicle Detection
in Aerial Imagery (VEDAI) Dataset, where the dataset con-
1) OIRDS
sists of subsets of two different image resolutions that sup-
port color or infrared image type. They cut large original
Tanner et al. [101] created Overhead Imagery Research Data
images into small images on selected regions to maximize
Set (OIRDS) with non-copyrighted overhead imagery. They
diversity. The total number of vehicle classes is nine, and
obtained images from the United States Geological Sur-
two meta-classes are also defined. Although this dataset is
vey (USGS) and the Video Verification of Identity (VIVID)
composed of several image types and has an advantage in
program of the Defense Advanced Research Projects Agency
scalability to various sensor images in real-world cases, the
(DARPA). The dataset contains approximately 1,800 vehicle
amount of data is relatively small for to be used in the
targets categorized as car, truck, pick-up truck, and unknown.
deep learning detection method as similar to OIRDS [101].
Based on the comparative color distribution analysis of the
A sample image is shown in Fig. 11. (c).
actual vehicle statistics with the statistics of the collected
images from the dataset, the OIRDS dataset showed a high
4) COWC
quality reflection on the natural distribution of the reality.
Mundhenk et al. [99] created a large contextural dataset
However, since the amount of data is relatively small, there is
called Cars Overhead with Context (COWC). Unlike the
a limitation in improving the generalized performance when
existing dataset covering one region or the same sensor
a deep learning-based method is applied. A sample image is
source [101], [102], COWC covers six regions from Toronto
shown in Fig. 11. (a).
Cannada, Selwyn New Zealand, Potsdam, Vaihingen Ger-
many, Columbus, and Utah to guarantee diversity. The images
2) DLR MVDA DATASET from two regions (Vaihingen and Columbus) are grayscale,
German Aerospace Center (DLR) obtained aerial images and the other is color images. COWC has the advantage as
by the DLR 3K camera system and provided a dataset the dataset contains a diverse range of objects that can be uti-
called DLR-MVDA. Liu and Mattyus [95] firstly utilized lized in the deep learning-based method for vehicle detection
the DLR-MVDA dataset to evaluate the performance of the than the other previous datasets [101], [102]. Also, COWC
proposed method. Although they considered only two vehicle includes various usable negative targets for the difficulty of
classes in the research, the dataset have seven vehicle classes. the dataset. A sample image is shown in Fig. 11. (d).
The images were captured from an airplane at the height
of 1,000 meters above Munich, Germany. DLR-MVDA has 5) CARPK
an advantage as the annotation includes angle information Hsieh et al. [103] presented aerial view images dataset col-
of each object. Therefore, the dataset can be utilized in ori- lected by drones that detected cars on parking lots. The
ented object detection methods. A sample image is shown in dataset contains 89,777 cars with various viewpoints from
Fig. 11. (b). four different places. Compared to previous datasets used

20128 VOLUME 10, 2022


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

TABLE 3. Summary of aerial imagery datasets. For DLR-MVDA, the values in parentheses are the number used for the training in the study [95]. All
datasets are publicly available, where † indicates the performance achieved by the best team for the challenge evaluation metric.

for car detection, such as OIRDS, VEDAI, and COWC, this Generally, EO satellite is utilized for satellite images that
dataset provides higher resolution images to utilize fine- have lower image resolution than aerial images.
grained information. In addition, since the images were col-
lected from one designated spot (parking lot), a large portion 1) TAS DATASET
of the image is filled with objects compared to images with Heitz and Koller [109] constructed an overhead car detection
sparse objects from the previous datasets. Because CARPK dataset obtained from Google Earth for demonstrating the
is a high-resolution image dataset, the image contains dis- performance of the things and stuff (TAS) context model. The
tinguishable objects located in proximity. A sample image is dataset is a set of 30 color images of the city and suburbs of
presented in Fig. 11. (e). Brussels, Belgium, with a size of 792 × 636 pixels. A total of
1,319 cars are labeled manually with an average size of a car
6) VISDRONE being 45 × 45 pixels. The TAS dataset is meaningful in that
There have been two object detection challenges called Vis- it is one of the earliest developed overhead-viewed vehicle
Drone Challenge in 2018 and 2019 with a drone-based datasets. However, the amount of data is insufficient and lacks
benchmark dataset called VisDrone [104]. Zhu et al. [37] diversity for applying the latest deep learning-based detection
released the dataset to motivate the research in computer methods. A sample image is shown in Fig. 11. (a).
vision tasks on the drone platform. It contains 263 video clips
(179,264 frames) and 10,209 images captured by drones in 2) SZTAKI-INRIA
various areas of China. Besides, occlusion and truncation Benedek et al. [110] developed the SZTAKI-INRIA Building
ratio information is provided to capture the characteristics of Detection Benchmark dataset for evaluating proposed detec-
overhead imagery. Whereas the existing aerial image datasets tion methods. The dataset contains 665 building footprints
usually use vehicles as target objects, Visdrone includes some in 9 images from several cities of Hungary, UK, Germany,
other smaller objects classes such as vehicles, pedestrians, and France. Among nine images, two images were obtained
and bicycles so that the dataset can be used for various object from an aerial source, and the rest were from satellite and
detection purposes. A sample image is shown in Fig. 11. (f). Google Earth platform. The SZTAKI-INRIA dataset, similar
to the TAS dataset [109], is not suitable for applying a deep
7) UAVDT learning-based object detection method due to the insufficient
Du et al. [105] constructed a dataset for detecting vehicles volume of images and objects. A sample image is shown in
on a UAV platform. This dataset is called UAV Detection Fig. 11. (b)
and Tracking (UAVDT), and it provides useful annotated
attributes such as weather conditions, flying altitude, camera 3) NWPU VHR-10
view, vehicle occlusion, and out-of-view. In particular, the Cheng et al. [56] constructed the NWPU VHR-10 dataset,
out-of-view was categorized based on a ratio of objects in the which contains 800 satellite and aerial images from Google
frame outside. Because UAVDT represents real-world envi- Earth and Vaihingen data [111]. The dataset consists of ten
ronments by focusing on various scenes, weather, and camera different types of objects such as airplanes, ships, storage
angles, it has the advantage of evaluating the generaliza- tanks, Etc. The size of each object type varies from 418 ×
tion performance of detection methods. Furthermore, it also 418 pixels for large objects to 33 × 33 pixels for small ones.
contains various backgrounds in divided subsets for training Furthermore, the image resolution varies from 0.08m to 2m
and testing, respectively. The volume of images described in for a diversity of the dataset. In order to use the dataset accord-
Table 3 excludes images for a single object tracking task. ing to the applying purpose, the NWPU VHR-10 dataset
A sample image is shown in Fig. 11. (g). provides the independently divided four sub-groups; 1) neg-
ative image set, 2) positive image set, 3) optimizing set, and
D. SATELLITE AND AERIAL IMAGERY DATASETS 4) testing set. A sample image is shown in Fig. 11. (c).
Lastly, there are datasets constructed with images from
both satellite and aerial sources, as shown in Table 4, 4) DOTA
where such datasets are helpful to improve and evaluate the Xia et al. [9] firstly introduced a large-scale Dataset for
generalization performance of the object detection methods. Object deTection in Aerial images (DOTA), aiming for an

VOLUME 10, 2022 20129


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

FIGURE 12. Sample images for satellite and aerial imagery datasets. Since it contains both satellite and aerial images, datasets consist of
various characteristics such as sensor, resolution, and angle.

TABLE 4. Summary of satellite and aerial imagery datasets. All datasets are publicly available, where † indicates the performance achieved by the best
team for the challenge evaluation metric.

international object detection challenge. After DOTA-v1.0 A. ACCURACY VS. EFFICIENCY


issued, DOTA-v1.5 and v2.0 were subsequently issued in Accurate detection for small and imbalanced objects is
2019, and 2021 [112], respectively. DOTA-v.1.5 used the closely related to the efficiency of detection methods because
same image as DOTA-v1.0. However, one object class and accuracy and efficiency are in a trade-off relationship in
annotations for small objects were added because DOTA-v1.0 general. Although well-established object detection methods
do not contain annotations for objects less than 10 pixels. for natural images have been studied [113]–[115] to meet
In DOTA-v2.0, images were additionally collected from var- both accuracy and efficiency requirement, overhead imagery
ious sources such as Google Earth, GF-2, and JL-1 satel- has to consider this problem a more critical issue because the
lite. Moreover, object categories were broadened from 15 to amount of data and the size of the images to be processed are
18 classes. Because objects in overhead images exist with more extensive than that of the natural images. In the case of
arbitrary orientations in the real world, the DOTA datasets the currently proposed methods, the efficiency was improved;
provide oriented bounding box information to evaluate however, the upper bound of the detection performance of
the accurate performance. A sample image is shown in the methods was the performance of the vanilla detection
Fig. 11. (d), (e), and (f). model [34], [38], [40]. On the one hand, there are meth-
ods [28], [46], [63] focusing on improve accuracy; however,
IV. FUTURE RESEARCH DIRECTIONS the efficiency was decreased due to the high computational
In this work, we propose two promising research directions load for modified model architecture. Therefore, research on
based on the comprehensive survey of deep learning-based a novel approach that improves both accuracy and efficiency
methods and overhead imagery datasets. is the first primary research direction.

20130 VOLUME 10, 2022


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

B. FUSION OF OTHER DOMAIN DATA Algorithm 1 Training Algorithm of Deep Learning-Based


Another direction is to utilize additional information from Object Detection Methods. Based on a Procedure of Head
other domains for the overhead imagery. In most cases, it is Network, the Methods Can Be Defined Into One or Two-
practically challenging to obtain sufficiently labeled data in Stage Detectors
various computer vision tasks. Therefore, many studies such 1: Require:mini-batch x compose of pre-processed image
as the soft teacher method [116] have recently been con- patches, the total number of mini-batch N , A detection
ducted to overcome this problem. Xu et al. [116] proposed model M compose of the backbone network B and the
the soft teacher method, where the proposed framework is head network H
composed of the teacher and student networks. The teacher 2: Set hyper-parameters such as batch size, learning rate,
network in the method assigns a classification score to an IOU threshold, confidence threshold, the maximum num-
unlabeled bounding box and calculates the loss. Through- ber of objects to find
out this process, the accuracy of pseudo labels is gradually 3: for each epoch do
improved. For object detection on the overhead imagery, this 4: for N steps do
challenge with scarce data can be overcome through fusion 5: Extract features xf from mini-batch data x by the
with other domain data. Even though labeled data is not backbone network B
sufficient, we can leverage more data from other domain data, 6: if The model M is one-stage then
such as different sensing sources [71], [81] and metadata 7: Predict objects classes c and bounding boxes b
information [105]. Therefore, detection performance can be 8: else if The model M is two-stage then
improved by using the additional information so that more 9: Extract Regions of Interest (ROIs) by search
meaningful results can be obtained from real-world use cases. method such as selective search or region pro-
Moreover, overhead imagery is closely related to social, and posal network
real-world applications because it is obtained over a broad 10: From the ROIs, predict objects classes c and
area in real-time basis [117]–[119]. Given these considera- bounding boxes b
tions, one of the most significant research directions is to 11: end if
apply other domain information with overhead imagery. 12: Delete uncertain objects based on the confidence
threshold
13: Delete overlapped objects based on the IOU thresh-
V. CONCLUSION old
Object detection on overhead imagery is one of the exciting 14: Choose accurate objects up to the maximum number
research areas in the computer vision community. However, of objects
there are challenging issues due to the unique characteristics 15: Calculate batch loss from the objective function
of overhead images that are different from natural images. 16: end for
The characteristics cause difficulty in applying the state-of- 17: Update model parameter based on calculated loss
the-art methods in natural images directly. Therefore, many 18: end for
approaches have been introduced recently to overcome the
challenging issues.
Our survey paper explores the recent approaches in satel- ACKNOWLEDGMENT
lite and aerial imagery-based object detection research and The authors would like to thank Jin Yong Park for reviewing
aims to stimulate further research in this area by presenting the earlier version of the draft, and providing helpful and
comprehensive and comparative reviews. After researching insightful comments.
a number of papers, we categorized the most important
approaches into the six different categories. Further, we com- REFERENCES
pare and analyze publicly available datasets to utilize and [1] G. Cheng and J. Han, ‘‘A survey on object detection in optical
motivate research on object detection with the overhead remote sensing images,’’ ISPRS J. Photogramm. Remote Sens., vol. 117,
imagery domain. Also, based on the difference in image pp. 11–28, Jul. 2016.
[2] A. Groener, G. Chern, and M. Pritt, ‘‘A comparison of deep learning
sources, our paper surveyed the datasets with helpful infor- object detection models for satellite imagery,’’ in Proc. IEEE Appl. Imag.
mation such as image resolution and size. We hope this Pattern Recognit. Workshop (AIPR), Oct. 2019, pp. 1–10.
paper will be helpful in developing more advanced deep [3] U. Alganci, M. Soydas, and E. Sertel, ‘‘Comparative research on deep
learning approaches for airplane detection from very high-resolution
learning-based approaches as well as understanding and dis- satellite images,’’ Remote Sens., vol. 12, no. 3, p. 458, Feb. 2020.
cussing future research directions. [4] H. Yao, R. Qin, and X. Chen, ‘‘Unmanned aerial vehicle for remote
sensing applications—A review,’’ Remote Sens., vol. 11, no. 12, p. 1443,
Jun. 2019.
[5] D. Cazzato, C. Cimarelli, J. L. Sanchez-Lopez, H. Voos, and M. Leo,
APPENDIX ‘‘A survey of computer vision methods for 2D object detection from
PSEUDO CODE FOR OBJECT DETECTOR unmanned aerial vehicles,’’ J. Imag., vol. 6, no. 8, p. 78, Aug. 2020.
[6] Z. Zheng, L. Lei, H. Sun, and G. Kuang, ‘‘A review of remote sensing
The training procedure for one and two-stage detectors is image object detection algorithms based on deep learning,’’ in Proc. IEEE
formally presented in Algorithm 1. 5th Int. Conf. Image, Vis. Comput. (ICIVC), Jul. 2020, pp. 34–43.

VOLUME 10, 2022 20131


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

[7] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, ‘‘Object detection in [32] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang,
optical remote sensing images: A survey and a new benchmark,’’ ISPRS L. Dong, F. Wei, and B. Guo, ‘‘Swin transformer v2: Scaling up capacity
J. Photogramm. Remote Sens., vol. 159, pp. 296–307, Jan. 2020. and resolution,’’ 2021, arXiv:2111.09883.
[8] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, [33] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, ‘‘Focal
Y. Bulatov, and B. McCord, ‘‘XView: Objects in context in overhead self-attention for local-global interactions in vision transformers,’’ 2021,
imagery,’’ 2018, arXiv:1802.07856. arXiv:2107.00641.
[9] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, [34] P. Zhang, Y. Zhong, and X. Li, ‘‘SlimYOLOv3: Narrower, faster and
M. Pelillo, and L. and Zhang, ‘‘DOTA: A large-scale dataset for object better for real-time UAV applications,’’ in Proc. IEEE/CVF Int. Conf.
detection in aerial images,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1–9.
Recognit., Jun. 2018, pp. 3974–3983. [35] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, ‘‘Learning efficient
[10] M. A. Fischler and R. A. Elschlager, ‘‘The representation and matching of convolutional networks through network slimming,’’ in Proc. IEEE Int.
pictorial structures,’’ IEEE Trans. Comput., vol. C-22, no. 1, pp. 67–92, Conf. Comput. Vis., Oct. 2017, pp. 2736–2744.
Jan. 1973. [36] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Spatial pyramid pooling in
[11] D. M. McKeown, Jr., and J. L. Denlinger, ‘‘Cooperative methods for deep convolutional networks for visual recognition,’’ IEEE Trans. Pattern
road tracking in aerial imagery,’’ in Proc. DARPA IUS Workshop, 1988, Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Jul. 2015.
pp. 327–341. [37] P. Zhu, L. Wen, X. Bian, H. Ling, and Q. Hu, ‘‘Vision meets drones: A
[12] A. Huertas and R. Nevatia, ‘‘Detecting buildings in aerial images,’’ Com- challenge,’’ 2018, arXiv:1804.07437.
put. Vis., Graph., Image Process., vol. 41, no. 2, pp. 131–152, Feb. 1988. [38] B. Uzkent, C. Yeh, and S. Ermon, ‘‘Efficient object detection in large
[13] R. B. Irvin and D. M. McKeown, ‘‘Methods for exploiting the relationship images using deep reinforcement learning,’’ in Proc. IEEE Winter Conf.
between buildings and their shadows in aerial imagery,’’ IEEE Trans. Appl. Comput. Vis. (WACV), Mar. 2020, pp. 1824–1833.
Syst., Man Cybern., vol. 19, no. 6, pp. 1564–1575, Nov. 1989. [39] Z. Han, H. Zhang, J. Zhang, and X. Hu, ‘‘Fast aircraft detection based on
[14] J. E. Ball, D. T. Anderson, and C. S. Chan, ‘‘Comprehensive survey of region locating network in large-scale remote sensing images,’’ in Proc.
deep learning in remote sensing: Theories, tools, and challenges for the IEEE Int. Conf. Image Process. (ICIP), Sep. 2017, pp. 2294–2298.
community,’’ J. Appl. Remote Sens., vol. 11, no. 4, p. 042609, 2017. [40] L. Sommer, N. Schmidt, A. Schumann, and J. Beyerer, ‘‘Search area
[15] Y. Gu, Y. Wang, and Y. Li, ‘‘A survey on deep learning-driven remote reduction fast-RCNN for fast vehicle detection in large aerial imagery,’’
sensing image scene understanding: Scene classification, scene retrieval in Proc. 25th IEEE Int. Conf. Image Process. (ICIP), Oct. 2018,
and scene-guided object detection,’’ Appl. Sci., vol. 9, no. 10, p. 2110, pp. 3054–3058.
May 2019. [41] F. Yang, H. Fan, P. Chu, E. Blasch, and H. Ling, ‘‘Clustered object
[16] L. Ma, Y. Liu, X. Zhang, Y. Ye, G. Yin, and B. A. Johnson, ‘‘Deep detection in aerial images,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis.
learning in remote sensing applications: A meta-analysis and review,’’ (ICCV), Oct. 2019, pp. 8311–8320.
ISPRS J. Photogramm. Remote Sens., vol. 152, pp. 166–177, Jun. 2019. [42] L. W. Sommer, T. Schuchert, and J. Beyerer, ‘‘Fast deep vehicle detection
[17] Q. Yuan, H. Shen, T. Li, Z. Li, S. Li, Y. Jiang, H. Xu, W. Tan, Q. Yang, in aerial images,’’ in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV),
J. Wang, J. Gao, and L. Zhang, ‘‘Deep learning in environmental remote Mar. 2017, pp. 311–319.
sensing: Achievements and challenges,’’ Remote Sens. Environ., vol. 241, [43] R. Girshick, ‘‘Fast R-CNN,’’ in Proc. IEEE Int. Conf. Comput. Vis.
May 2020, Art. no. 111716. (ICCV), Dec. 2015, pp. 1440–1448.
[18] J. Beal, E. Kim, E. Tzeng, D. H. Park, A. Zhai, and D. Kislyuk, ‘‘Toward [44] M.-T. Pham, L. Courtrai, C. Friguet, S. Lefèvre, and A. Baussard,
transformer-based object detection,’’ 2020, arXiv:2012.09958. ‘‘YOLO-fine: One-stage detector of small objects under various back-
[19] Z. Zhang, X. Lu, G. Cao, Y. Yang, L. Jiao, and F. Liu, ‘‘ViT-YOLO: grounds in remote sensing images,’’ Remote Sens., vol. 12, no. 15,
Transformer-based YOLO for object detection,’’ in Proc. IEEE/CVF Int. p. 2501, Aug. 2020.
Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, pp. 2799–2808. [45] W. Li, W. Li, F. Yang, and P. Wang, ‘‘Multi-scale object detection in
[20] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, ‘‘Swin satellite imagery based on YOLT,’’ in Proc. IEEE Int. Geosci. Remote
transformer: Hierarchical vision transformer using shifted Windows,’’ Sens. Symp. (IGARSS), Jul. 2019, pp. 162–165.
2021, arXiv:2103.14030. [46] Y. Zhou, Z. Cai, Y. Zhu, and J. Yan, ‘‘Automatic ship detection in SAR
[21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet: image based on multi-scale faster R-CNN,’’ J. Phys., Conf., vol. 1550,
A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput. no. 4, May 2020, Art. no. 042006.
Vis. Pattern Recognit., Jun. 2009, pp. 248–255. [47] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, ‘‘Focal loss for
[22] T. Liang, X. Chu, Y. Liu, Y. Wang, Z. Tang, W. Chu, J. Chen, and dense object detection,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
H. Ling, ‘‘CBNetV2: A composite backbone network architecture for Oct. 2017, pp. 2980–2988.
object detection,’’ 2021, arXiv:2107.00420. [48] C. Li, T. Yang, S. Zhu, C. Chen, and S. Guan, ‘‘Density map guided
[23] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image object detection in aerial images,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
recognition,’’ in Proc. IEEE Conf. Comput. Vis. pattern Recognit., 2016, Pattern Recognit. Workshops (CVPRW), Jun. 2020, pp. 190–191.
pp. 770–778. [49] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, ‘‘Single-image
[24] S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and crowd counting via multi-column convolutional neural network,’’ in
P. Torr, ‘‘Res2Net: A new multi-scale backbone architecture,’’ IEEE Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
Trans. Pattern Anal. Mach. Intell., vol. 43, no. 2, pp. 652–662, Feb. 2021. pp. 589–597.
[25] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards [50] G. Cheng, P. Zhou, and J. Han, ‘‘Learning rotation-invariant convo-
real-time object detection with region proposal networks,’’ 2015, lutional neural networks for object detection in VHR optical remote
arXiv:1506.01497. sensing images,’’ IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12,
[26] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’ pp. 7405–7415, Dec. 2016.
2018, arXiv:1804.02767. [51] Z. Liu, J. Hu, L. Weng, and Y. Yang, ‘‘Rotated region based CNN for ship
[27] X. Dai, Y. Chen, B. Xiao, D. Chen, M. Liu, L. Yuan, and L. Zhang, detection,’’ in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep. 2017,
‘‘Dynamic head: Unifying object detection heads with attentions,’’ pp. 900–904.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), [52] J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu, ‘‘Learning RoI trans-
Jun. 2021, pp. 7373–7382. former for oriented object detection in aerial images,’’ in Proc. IEEE/CVF
[28] A. Van Etten, ‘‘You only look twice: Rapid multi-scale object detection Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 2849–2858.
in satellite imagery,’’ 2018, arXiv:1805.09512. [53] J. Yi, P. Wu, B. Liu, Q. Huang, H. Qu, and D. Metaxas, ‘‘Ori-
[29] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once: ented object detection in aerial images with box boundary-aware vec-
Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis. tors,’’ in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., Jan. 2021,
Pattern Recognit., Jun. 2016, pp. 779–788. pp. 2150–2159.
[30] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’ [54] J. Han, J. Ding, J. Li, and G.-S. Xia, ‘‘Align deep features for oriented
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, object detection,’’ IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–11,
pp. 7263–7271. 2022.
[31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and [55] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘Imagenet classification
A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Proc. Eur. Conf. with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf.
Comput. Vis. Springer, 2016, pp. 21–37. Process. Syst., vol. 25, 2012, pp. 1097–1105.

20132 VOLUME 10, 2022


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

[56] G. Cheng, J. Han, P. Zhou, and L. Guo, ‘‘Multi-class geospatial object [79] Z. Liu, L. Yuan, L. Weng, and Y. Yang, ‘‘A high resolution optical satellite
detection and geographic image classification based on collection of part image dataset for ship recognition and some new baselines,’’ in Proc. Int.
detectors,’’ ISPRS J. Photogramm. Remote Sens., vol. 98, pp. 119–132, Conf. Pattern Recognit. Appl. Methods, vol. 2, 2017, pp. 324–331.
Dec. 2014. [80] A. Van Etten, D. Lindenbaum, and T. M. Bacastow, ‘‘SpaceNet: A remote
[57] H. Caesar, J. Uijlings, and V. Ferrari, ‘‘Region-based semantic segmenta- sensing dataset and challenge series,’’ 2018, arXiv:1807.01232.
tion with end-to-end training,’’ in Proc. Eur. Conf. Comput. Vis. Springer, [81] N. Weir, D. Lindenbaum, A. Bastidas, A. V. Etten, S. McPherson,
2016, pp. 381–397. J. Shermeyer, V. Kumar, and H. Tang, ‘‘SpaceNet MVOI: A multi-view
[58] Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao, ‘‘Oriented response networks,’’ in overhead imagery dataset,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis.,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 519–528. Oct. 2019, pp. 992–1001.
[59] Z. Liu, H. Wang, H. Weng, and L. Yang, ‘‘Ship rotated bounding box [82] T. Zhang, X. Zhang, and X. Ke, ‘‘Quad-FPN: A novel quad feature
space for ship extraction from high-resolution optical satellite images pyramid network for SAR ship detection,’’ Remote Sens., vol. 13, no. 14,
with complex backgrounds,’’ IEEE Geosci. Remote Sens. Lett., vol. 13, p. 2771, Jul. 2021.
no. 8, pp. 1074–1078, Aug. 2016. [83] T. Zhang and X. Zhang, ‘‘Squeeze- and-excitation Laplacian pyra-
[60] Y.-M. Chou, C.-H. Chen, K.-H. Liu, and C.-S. Chen, ‘‘Stingray detection mid network with dual-polarization feature fusion for ship classifi-
of aerial images using augmented training images generated by a condi- cation in SAR images,’’ IEEE Geosci. Remote Sens. Lett., vol. 19,
tional generative model,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern pp. 1–5, 2022.
Recognit. Workshops (CVPRW), Jun. 2018, pp. 1403–1409. [84] Y. Wu, Y. Yuan, J. Guan, L. Yin, J. Chen, G. Zhang, and P. Feng, ‘‘Joint
[61] P. Bojanowski, A. Joulin, D. Lopez-Paz, and A. Szlam, ‘‘Optimizing the convolutional neural network for small-scale ship classification in SAR
latent space of generative networks,’’ 2017, arXiv:1707.05776. images,’’ in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS),
[62] C. Chen, Y. Zhang, Q. Lv, S. Wei, X. Wang, X. Sun, and J. Dong, ‘‘RRNet: Jul. 2019, pp. 2619–2622.
A hybrid detector for object detection in drone-captured images,’’ [85] Z. Cui, X. Wang, N. Liu, Z. Cao, and J. Yang, ‘‘Ship detection
in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops, Oct. 2019, in large-scale SAR images via spatial shuffle-group enhance atten-
pp. 1–9. tion,’’ IEEE Trans. Geosci. Remote Sens., vol. 59, no. 1, pp. 379–391,
[63] J. Shermeyer and A. Van Etten, ‘‘The effects of super-resolution on object Jan. 2021.
detection performance in satellite imagery,’’ in Proc. IEEE/CVF Conf. [86] Z. Sun, M. Dai, X. Leng, Y. Lei, B. Xiong, K. Ji, and G. Kuang,
Comput. Vis. Pattern Recognit. Workshops, Jun. 2019, pp. 1–10. ‘‘An anchor-free detection method for ship targets in high-resolution SAR
[64] J. Kim, J. K. Lee, and K. M. Lee, ‘‘Accurate image super-resolution using images,’’ IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14,
very deep convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis. pp. 7799–7816, 2021.
Pattern Recognit., Jun. 2016, pp. 1646–1654. [87] X. Zhang, C. Huo, N. Xu, H. Jiang, Y. Cao, L. Ni, and C. Pan, ‘‘Multitask
[65] S. Schulter, C. Leistner, and H. Bischof, ‘‘Fast and accurate image learning for ship detection from synthetic aperture radar images,’’ IEEE
upscaling with super-resolution forests,’’ in Proc. IEEE Conf. Comput. J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14, pp. 8048–8062,
Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3791–3799. 2021.
[66] J. Rabbi, N. Ray, M. Schubert, S. Chowdhury, and D. Chao, ‘‘Small- [88] J. Li, C. Qu, and J. Shao, ‘‘Ship detection in SAR images based on an
object detection in remote sensing images with end-to-end edge-enhanced improved faster R-CNN,’’ in Proc. AR Big Data Era, Models, Methods
GAN and object detector network,’’ Remote Sens., vol. 12, no. 9, p. 1432, Appl. (BIGSARDATA), Nov. 2017, pp. 1–6.
May 2020. [89] H. Guo, X. Yang, N. Wang, and X. Gao, ‘‘A CenterNet++ model for
[67] K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, and J. Jiang, ‘‘Edge-enhanced ship detection in SAR images,’’ Pattern Recognit., vol. 112, Apr. 2021,
GAN for remote sensing image superresolution,’’ IEEE Trans. Geosci. Art. no. 107787.
Remote Sens., vol. 57, no. 8, pp. 5799–5812, Mar. 2019. [90] B. Li, B. Liu, L. Huang, W. Guo, Z. Zhang, and W. Yu, ‘‘OpenSARShip
[68] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and 2.0: A large-volume dataset for deeper interpretation of ship targets in
C. C. Loy, ‘‘ESRGAN: Enhanced super-resolution generative adversar- Sentinel-1 imagery,’’ in Proc. SAR Big Data Era, Models, Methods Appl.
ial networks,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV) Workshops, (BIGSARDATA), Nov. 2017, pp. 1–5.
Sep. 2018, pp. 1–16. [91] L. Huang, B. Liu, B. Li, W. Guo, W. Yu, Z. Zhang, and W. Yu, ‘‘Opensar-
[69] L. Cao, R. Ji, C. Wang, and J. Li, ‘‘Towards domain adaptive vehicle ship: A dataset dedicated to sentinel-1 ship interpretation,’’ IEEE J. Sel.
detection in satellite image by supervised super-resolution transfer,’’ in Topics Appl. Earth Observ. Remote Sens., vol. 11, no. 1, pp. 195–208,
Proc. AAAI Conf. Artif. Intell., vol. 30, no. 1, 2016, pp. 1–7. Jan. 2017.
[70] T. Malisiewicz, A. Gupta, and A. A. Efros, ‘‘Ensemble of exemplar- [92] Y. Wang, C. Wang, H. Zhang, Y. Dong, and S. Wei, ‘‘A SAR dataset of
SVMs for object detection and beyond,’’ in Proc. Int. Conf. Comput. Vis., ship detection for deep learning under complex backgrounds,’’ Remote
Nov. 2011, pp. 89–96. Sens., vol. 11, no. 7, p. 765, Mar. 2019.
[71] J. D. Wegner, S. Branson, D. Hall, K. Schindler, and P. Perona, [93] S. Wei, X. Zeng, Q. Qu, M. Wang, H. Su, and J. Shi, ‘‘HRSID: A high-
‘‘Cataloging public objects using aerial and street-level images-urban resolution SAR images dataset for ship detection and instance segmenta-
trees,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, tion,’’ IEEE Access, vol. 8, pp. 120234–120254, 2020.
pp. 6014–6023. [94] T. Zhang, X. Zhang, X. Ke, X. Zhan, J. Shi, S. Wei, D. Pan, J. Li, H. Su,
[72] Z. Wu, K. Suresh, P. Narayanan, H. Xu, H. Kwon, and Z. Wang, ‘‘Delving Y. Zhou, and D. Kumar, ‘‘LS-SSDD-v1.0: A deep learning dataset ded-
into robust object detection from unmanned aerial vehicles: A deep nui- icated to small ship detection from large-scale sentinel-1 SAR images,’’
sance disentanglement approach,’’ in Proc. IEEE/CVF Int. Conf. Comput. Remote Sens., vol. 12, no. 18, p. 2997, Sep. 2020.
Vis. (ICCV), Oct. 2019, pp. 1201–1210. [95] K. Liu and G. Mattyus, ‘‘Fast multiclass vehicle detection on aerial
[73] M. Y. Yang, W. Liao, X. Li, and B. Rosenhahn, ‘‘Deep learning for vehicle images,’’ IEEE Geosci. Remote Sens. Lett., vol. 12, no. 9, pp. 1938–1942,
detection in aerial images,’’ in Proc. 25th IEEE Int. Conf. Image Process. Sep. 2015.
(ICIP), Oct. 2018, pp. 3079–3083. [96] H. Tayara, K. G. Soo, and K. T. Chong, ‘‘Vehicle detection and counting
[74] N. Sergievskiy and A. Ponamarev, ‘‘Reduced focal loss: 1st place solution in high-resolution aerial images using convolutional regression neural
to xView object detection in satellite imagery,’’ 2019, arXiv:1903.01347. network,’’ IEEE Access, vol. 6, pp. 2220–2230, 2018.
[75] J. Zhang, J. Huang, X. Chen, and D. Zhang, ‘‘How to fully exploit the [97] T. Tang, S. Zhou, Z. Deng, H. Zou, and L. Lei, ‘‘Vehicle detection in aerial
abilities of aerial image detectors,’’ in Proc. IEEE/CVF Int. Conf. Comput. images based on region convolutional neural networks and hard negative
Vis. Workshops, Oct. 2019, pp. 1–8. example mining,’’ Sensors, vol. 17, no. 2, p. 336, 2017.
[76] S. Hong, S. Kang, and D. Cho, ‘‘Patch-level augmentation for object [98] T. Tang, S. Zhou, Z. Deng, L. Lei, and H. Zou, ‘‘Arbitrary-oriented vehicle
detection in aerial images,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. detection in aerial imagery with single convolutional neural networks,’’
Workshop (ICCVW), Oct. 2019, pp. 1–8. Remote Sens., vol. 9, no. 11, p. 1170, 2017.
[77] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, ‘‘Libra R-CNN: [99] T. N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye, ‘‘A large
Towards balanced learning for object detection,’’ in Proc. IEEE/CVF contextual dataset for classification, detection and counting of cars
Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 821–830. with deep learning,’’ in Proc. Eur. Conf. Comput. Vis. Springer, 2016,
[78] D. Liang, Q. Geng, Z. Wei, D. A. Vorontsov, E. L. Kim, M. Wei, and pp. 785–800.
H. Zhou, ‘‘Anchor retouching via model interaction for robust object [100] E. Goldman, R. Herzig, A. Eisenschtat, J. Goldberger, and T. Hassner,
detection in aerial images,’’ IEEE Trans. Geosci. Remote Sens., early ‘‘Precise detection in densely packed scenes,’’ in Proc. IEEE/CVF Conf.
access, Dec. 16, 2021, doi: 10.1109/TGRS.2021.3136350. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 5227–5236.

VOLUME 10, 2022 20133


J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery

[101] F. Tanner, B. Colder, C. Pullen, D. Heagy, M. Eppolito, V. Carlan, SHAHROZ TARIQ received the B.S. degree
C. Oertel, and P. and Sallee, ‘‘Overhead imagery research data set—An (Hons.) in computer science from the National
annotated data library & tools to aid in the development of computer University of Computer and Emerging Sciences,
vision algorithms,’’ in Proc. IEEE Appl. Imag. Pattern Recognit. Work- (FAST-NUCES), Islamabad, Pakistan, and the
shop (AIPR), Oct. 2009, pp. 1–8. M.S. degree (Hons.) in computer science from
[102] S. Razakarivony and F. Jurie, ‘‘Vehicle detection in aerial imagery:
Sangmyung University, Cheonan, South Korea.
A small target detection benchmark,’’ J. Vis. Commun. Image Represent.,
He is currently pursuing the Ph.D. degree
vol. 34, pp. 187–203, Jan. 2016.
[103] M.-R. Hsieh, Y.-L. Lin, and W. H. Hsu, ‘‘Drone-based object counting by with Sungkyunkwan University (Natural Sci-
spatially regularized regional proposal network,’’ in Proc. IEEE Int. Conf. ences Campus), Suwon, South Korea. He worked
Comput. Vis. (ICCV), Oct. 2017, pp. 4145–4153. as a Software Engineer at Bentley Systems,
[104] P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, ‘‘Detection from 2014 to 2015. He was a Ph.D. Research Assistant at Stony Brook
and tracking meet drones challenge,’’ 2020, arXiv:2001.06303. University and SUNY Korea, from 2017 to 2019.
[105] D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and
Q. Tian, ‘‘The unmanned aerial vehicle benchmark: Object detection and
tracking,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 370–386.
[106] E. Li, J. Femiani, S. Xu, X. Zhang, and P. Wonka, ‘‘Robust rooftop
extraction from visible band images using higher order CRF,’’ IEEE
Trans. Geosci. Remote Sens., vol. 53, no. 8, pp. 4483–4495, Aug. 2015.
[107] Z. Deng, H. Sun, S. Zhou, J. Zhao, L. Lei, and H. Zou, ‘‘Multi-scale object
detection in remote sensing imagery with convolutional neural networks,’’
ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 3–22, Nov. 2018.
[108] W. Qian, X. Yang, S. Peng, J. Yan, and X. Zhang, ‘‘RSDet++: Point-
based modulated loss for more accurate rotated object detection,’’ 2021,
arXiv:2109.11906.
[109] G. Heitz and D. Koller, ‘‘Learning spatial context: Using stuff to find
things,’’ in Proc. Eur. Conf. Comput. Vis. Springer, 2008, pp. 30–43. HAN OH received the B.E. degree (Hons.) in
[110] C. Benedek, X. Descombes, and J. Zerubia, ‘‘Building development
computer engineering from Myongji University,
monitoring in multitemporal remotely sensed image pairs with stochastic
Yongin, South Korea, in 2004, the M.S. degree
birth-death dynamics,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 34,
no. 1, pp. 33–50, Jan. 2011.
in information and communications engineering
[111] M. Cramer, ‘‘The DGPF-test on digital airborne camera evaluation from the Gwangju Institute of Science and Tech-
overview and test design,’’ Photogrammetrie Fernerkundung Geoinfor- nology (GIST), Gwangju, South Korea, in 2006,
mation, vol. 2010, no. 2, pp. 73–82, May 2010. and the Ph.D. degree in electrical engineering from
[112] J. Ding, N. Xue, G.-S. Xia, X. Bai, W. Yang, M. Ying Yang, The University of Arizona, Tucson, AZ, USA,
S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang, ‘‘Object detec- in 2011. In 2012, he joined Samsung Electronics,
tion in aerial images: A large-scale benchmark and challenges,’’ 2021, Suwon, South Korea, where he was involved in
arXiv:2102.12219. development of digital cameras and 3D scanning systems. Since 2014, he has
[113] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, ‘‘Learning effi- been with the Korea Aerospace Research Institute (KARI), where he is cur-
cient object detection models with knowledge distillation,’’ in Proc. Adv.
rently with the Satellite Application Division. His current research interests
Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–10.
[114] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, ‘‘Single-shot refinement include deep learning-based image processing, digital communications, and
neural network for object detection,’’ in Proc. IEEE/CVF Conf. Comput. space/ground network protocols.
Vis. Pattern Recognit., Jun. 2018, pp. 4203–4212.
[115] M. Tan, R. Pang, and Q. V. Le, ‘‘EfficientDet: Scalable and efficient object
detection,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2020, pp. 10781–10790.
[116] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu,
‘‘End-to-end semi-supervised object detection with soft teacher,’’ 2021,
arXiv:2106.09018.
[117] M. M. Bennett and L. C. Smith, ‘‘Advances in using multitemporal night-
time lights satellite imagery to detect, estimate, and monitor socioe-
conomic dynamics,’’ Remote Sens. Environ., vol. 192, pp. 176–197,
Apr. 2017.
[118] J. Chen, W. Fan, K. Li, X. Liu, and M. Song, ‘‘Fitting Chinese cities’
population distributions using remote sensing satellite data,’’ Ecol. Indi-
cators, vol. 98, pp. 327–333, Mar. 2019. SIMON S. WOO received the B.S. degree in elec-
[119] C. Yeh, A. Perez, A. Driscoll, G. Azzari, Z. Tang, D. Lobell, S. Ermon, and trical engineering from the University of Washing-
M. Burke, ‘‘Using publicly available satellite imagery and deep learning ton (UW), Seattle, WA, USA, the M.S. degree in
to understand economic well-being in Africa,’’ Nature Commun., vol. 11, electrical and computer engineering from the Uni-
no. 1, pp. 1–11, Dec. 2020. versity of California, San Diego (UCSD), La Jolla,
CA, USA, and the M.S. and Ph.D. degrees in
computer science from the University of Southern
California (USC), Los Angeles, CA, USA. He was
a member of technical staff (technologist) for nine
JUNHYUNG KANG received the B.S. degree years at the NASA’s Jet Propulsion Laboratory
in mechanical engineering from Sungkyunkwan (JPL), Pasadena, CA, USA, conducting research in satellite communications,
University (Natural Sciences Campus), Suwon, networking, and cybersecurity areas. He was also worked at Intel Corp. and
South Korea, where he is currently pursuing the the Verisign Research Laboratory. Since 2017, he has been a Tenure-Track
M.S. degree. Assistant Professor at SUNY, South Korea, and a Research Assistant Pro-
fessor at Stony Brook University. He is currently a Tenure-Track Assistant
Professor at the SKKU Institute for Convergence and the Department of
Applied Data Science and Software, Sungkyunkwan University, Suwon,
South Korea.

20134 VOLUME 10, 2022

You might also like