A_Survey_of_Deep_Learning-Based_Object_Detection_M
A_Survey_of_Deep_Learning-Based_Object_Detection_M
25, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3149052
ABSTRACT Significant advancements and progress made in recent computer vision research enable more
effective processing of various objects in high-resolution overhead imagery obtained by various sources from
drones, airplanes, and satellites. In particular, overhead images combined with computer vision allow many
real-world uses for economic, commercial, and humanitarian purposes, including assessing economic impact
from access crop yields, financial supply chain prediction for company’s revenue management, and rapid
disaster surveillance system (wildfire alarms, rising sea levels, weather forecast). Likewise, object detection
in overhead images provides insight for use in many real-world applications yet is still challenging because of
substantial image volumes, inconsistent image resolution, small-sized objects, highly complex backgrounds,
and nonuniform object classes. Although extensive studies in deep learning-based object detection have
achieved remarkable performance and success, they are still ineffective yielding a low detection performance,
due to the underlying difficulties in overhead images. Thus, high-performing object detection in overhead
images is an active research field to overcome such difficulties. This survey paper provides a comprehensive
overview and comparative reviews on the most up-to-date deep learning-based object detection in overhead
images. Especially, our work can shed light on capturing the most recent advancements of object detection
methods in overhead images and the introduction of overhead datasets that have not been comprehensively
surveyed before.
INDEX TERMS Object detection, satellites, synthetic aperture radar, unmanned aerial vehicles.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
20118 VOLUME 10, 2022
J. Kang et al.: Survey of Deep Learning-Based Object Detection Methods and Datasets for Overhead Imagery
With this work, we hope to promote further research in extensively analyze and categorize existing studies
the related fields by providing a general overview of the accordingly.
object detection for overhead imagery to both the experts and • Based on the study, we provide a comparative study
beginners. among the latest methods and datasets, then discuss the
Previously, Cheng and Han [1] surveyed object detection limitation of the current approach and the promising
methods in optical remote sensing images and discussed future research directions.
the challenges with promising research directions. Although
they proposed a deep learning-based feature representation II. APPLYING DEEP LEARNING-BASED METHODS FOR
as one of the promising research directions, they focused OVERHEAD IMAGERY IN CHALLENGING ENVIRONMENTS
on traditional methods such as template matching [10], Recently, there has been an increasing interest in deep
[11] or knowledge-based methods [12], [13], which are learning-based object detection methods along with sig-
far from recently developed deep learning-based methods. nificant performance improvement compared to traditional
On the other hand, our survey performs a comprehensive algorithms such as template matching and knowledge-based
review focused on modern deep learning-based approaches methods [10]–[13]. In general, deep learning-based object
for object detection in overhead imagery. detection models consist of a backbone and a head network,
For the performance assessment, Groener et al. [2] and where the basic structures of deep learning-based object
Alganci et al. [3] compared the object detection performance detection models are described in Fig. 2. And the pseudo code
of deep learning-based models on single class type satel- of object detector is provided in Appendix. The backbone net-
lite datasets sampled from a publicly available database [8], work is capable of extracting features from the input images,
[9]. They contributed to assessing the advantages and lim- while the head network uses the extracted features to localize
itations of each model based on the performance. Further- the bounding boxes of the detected objects and classify them
more, Zheng et al. [6] systematically summarized the deep (See Fig. 2).
learning-based object detection algorithms for remote sens- In the case of backbone networks, CNN-based net-
ing images. However, these studies mainly focus on general works are commonly employed. Meanwhile, methods such
object detection methods, not specifically on the remote as ViT-FRCNN [18], ViT-YOLO [19], and Swin trans-
sensing and overhead imagery domains. Moreover, they con- former [20] that incorporate transformer-based networks
ducted the experiments for the one-class object detection task, and self-attention mechanisms have recently demonstrated
which is not the case for the overhead imagery that usually high performance. However, developing a new backbone
requires detecting multi-class objects. Thus, this survey paper structure capable of achieving high performance is a dif-
aims to discuss the modern methods for object detection in ficult task that requires massive computation cost and
satellite and aerial images with multi-class datasets. requires pre-training on large-scale image data such as Ima-
Furthermore, Yao et al. [4] and Cazzato et al. [5] reviewed geNet [21]. To overcome this limitation, Liang et al. [22]
the object detection methods for aerial images from proposed CBNetV2, which improved object detection per-
unmanned aerial vehicles (UAV) and provided new insight formance by using existing pre-trained backbones, such
into future research directions. Moreover, Li et al. [7] pre- as ResNet50 [23], ResNet152 [23] and Res2Net50 [24].
sented a comprehensive review of the deep learning-based CBNetV2 achieved this performance improvement by com-
object detection methods in optical remote sensing images. posing multiple pre-trained identical backbones with assist-
Compared to their approaches, this survey aims to cover the ing backbones and lead backbones. CBNetV2 can apply both
comprehensive methods for object detection in the broader CNN-based and transformer-based backbones.
scope of overhead imageries, including both satellite images On the other hand, the head network is divided into
(Electro-Optical (EO), Synthetic Aperture Radar (SAR)) and one-stage or two-stage detectors. While the one-stage detec-
aerial images. tor performs object localization and classification simulta-
There are other studies for reviewing the deep learning- neously, the two-stage detector performs classification to
based application for satellite and aerial images [14]–[17]. determine classes after proposing the regions of interests
While these studies cover general state-of-the-art object (ROIs). A typical two-stage detector is Faster-RCNN [25],
detection methods, our work specifically aims to investigate and a one-stage detector is YOLOv3 [26]. In general, two-
the recent advancements in object detection for overhead stage detectors have higher accuracy than one-stage detec-
imagery and examine the challenges. The contributions of tors because the two-stage detector performs localization and
this paper are summarized as follows: classification on the proposed regions from the first stage.
• This paper provides a comprehensive survey of deep However, the inference time of two-stage detectors tends to be
learning-based object detection methods and datasets longer due to a large number of regions and additional stage
using satellite images (SAR and EO) and aerial images processing.
after thoroughly reviewing more than 90 research papers While many methods focus on extracting more accurate
from the past six years. features from an image, recently, Dai et al. [27] proposed a
• We define the six major areas and construct a taxon- new head network structure called Dynamic head. From this
omy to tackle the challenges of overhead imagery, and approach, the Dynamic head takes the input data that is the
FIGURE 1. High level overview for six different categories of deep learning-based object detection methods for the overhead imagery.
output data of the backbone network. Then, the head network is a method to overcome the computational overhead of
considers the input data as a 3-dimensional tensor according self-attention by applying the focal self-attention method.
to the scale, location, and representation of objects. Fur- In recent years, enormous quantities of high-resolution
thermore, an attention mechanism is applied to each dimen- overhead photographs being created near-real-time, due to the
sion of the feature tensor. Therefore, the proposed approach advancements made in earth observing technologies. There-
improved the detection performance more efficiently than fore, the efficiency research area has also gained considerable
simultaneously applying the full self-attention mechanism to interests in the overhead and satellte imagery domains.
the feature tensors. The representative methods for efficient object detection in
However, it is challenging to apply the aforementioned overhead imagery are proposed in two approaches, as shown
state-of-the-art methods on overhead imagery because of in Fig. 3. One stands for the reducing computation method
the unique underlying characteristics of satellite and aerial that reduces the computational load of the model. The other
images. Therefore, our survey focuses explicitly on different approach is the reducing search area that reduces the search
methods used in overhead imagery domains instead of gen- area from provided input images.
eral deep learning-based methods such as Faster-RCNN [25],
YOLO [28]–[30], and SSD [31]. 1) REDUCING COMPUTATION
We define the following six major object detection Zhang et al. [34] proposed SlimYOLOv3 using the chan-
categories based on the unique challenges associated with nel pruning method to make the object detection model
overhead imagery as shown in Fig. 1: 1) efficient detec- lighter and more efficient. The proposed method was inspired
tion, 2) small object detection, 3) oriented object detection, by a network slimming approach [35], which prunes the
4) augmentation and super-resolution, 5) multimodal object Convolutional Neural Network (CNN) model to reduce the
detection, and 6) imbalanced objects detection. We discuss costs for the computational processes. Their approach added
the details of each area in the following section. the Spatial Pyramid Pooling module [36] on the original
YOLOv3 [26] and pruned the less informative CNN channels
A. EFFICIENT DETECTION to improve detection accuracy and reduce floating-point oper-
Efficiency is one of the important performance metrics of ations (FLOPs) by reducing the size of the parameters. On the
the object detection task. As the size of deep learning-based Visdrone dataset [37], their experimental results showed that
models as well as the resolution, complexity, and size of the the proposed method runs twice faster with only about 8% of
images increases, the importance of efficiency has become the original parameter size.
paramount recently. In particular, the Swin Transformer Usually, training with high-resolution overhead images
V2 [32] and the Focal Transformer [33] method are pro- requires a high computational cost. To alleviate this
posed. The Swin Transformer V2 is a method of scaling up problem, Uzkent et al. [38] applied reinforcement learn-
the Swin Transformer [20], which has shown high perfor- ing (RL) to object detection models for minimizing the
mance in object detection tasks. For scaling up, the Swin usage of high-resolution overhead images. The agent from
Transformer V2 applies specific techniques such as post nor- the reinforcement learning model determines whether the
malization, scaled cosine attention, and a log-spaced contin- low-resolution image is enough to detect objects or the
uous position bias. Additionally, the Focal transformer [33] high-resolution image is required. This process increases
FIGURE 2. Basic deep learning-based one-stage vs. two-stage object detection model architectures. The backbone network can be used as a CNN or
transformer-based network, where the backbone network can be categorized into one-stage or two-stage networks according to the structure of the
head network. As shown in (a), the one-stage detector simultaneously performs object localization and classification in the head network. On the other
hand, localization and classification in the two-stage detector are performed on regions after the region proposals are obtained, as shown in (b).
RPN, the RLN can predict the possible areas of the object
location from the original overhead images. Also, since the
cropped images of predicted areas are much smaller than
the entire area of the original images, the proposed method
showed a significant improvement in terms of efficiency for
detecting specific objects.
In addition, Sommer et al. [40] proposed a similar
approach with RLN [39] called Search Area Reduc-
tion (SAR) module. The image is divided into several image
patches; then, the module predicts the scores based on the
FIGURE 3. The two primary categories of the efficient object detection
methods. Efficient detection methods are achieved by reducing the number of objects contained in each image patch. In partic-
computation and reducing search area. The representative methods are ular, there are two distinguished characteristics of the SAR
listed on the right.
module from the previous RLN [39]. Firstly, contrary to
the RLN that generates various sized images based on the
clustering method, the SAR module handles divided images
runtime efficiency by reducing the number of required with the specific sized image. Secondly, while the SAR
high-resolution images. However, the proposed method [38] module is integrated with Faster R-CNN to share the network,
requires pair of low-resolution images and high-resolution RLN has a separate network for finding the regions. Such an
images, while SlimYOLOv3 [34] can be applied directly integrated approach significantly reduces the inference time.
to high-resolution images without pairing the low and high Also, Yang et al. [41] proposed Clustered Detection
resolution dataset. (ClusDet) network composed of a cluster proposal sub-
network (CPNet), a scale estimation sub-network (ScaleNet),
2) REDUCING SEARCH AREA and a dedicated detection network (DetecNet). CPNet is
Unlike other methods that use the information from entire attached to the feature extraction backbone network and
image areas for object detection, several methods [39]–[41] obtains high-level feature maps. Based on the feature map
suggested reducing the search area of images for efficient information, the CPNet produces the prediction of location
object detection. and scales of clusters of input images. Then, ScaleNet pre-
Han et al. [39] applied Region Locating Network (RLN) in dicts the scale offset for objects to rescale the cluster chips.
addition to Faster R-CNN [25] to generate cropped images. Finally, detection results from DetecNet on cluster chips
The proposed architecture of RLN is the same as Region and a whole image are combined together to generate the
Proposal Network (RPN) in Faster R-CNN. However, unlike final result. This method achieves high runtime efficiency
by reducing the search area with the clustering approach. While Sommer et al. [42] utilized Faster R-CNN, which
Compared with the existing search area-based methods [39], is a representative two-stage object detection model, other
[40], it is noteworthy that ClusDet achieves not only high approaches employed the one-stage object detection mod-
efficiency, but also improved detection performance for small els such as YOLO [26], [29], [30]. Pham et al. [44] pro-
objects. posed YOLO-fine, which is an one-stage object detection
model. The proposed model was implemented based on
B. SMALL OBJECT DETECTION YOLOv3 [26] to effectively handle small objects. In detail,
Moreover, limitation in detecting small-sized objects is this model replaced feature extraction layers with finer ones
another challenging problem associated with overhead by a lower sub-sampling factor. With this finer object search
images. Object detection in the overhead image is not only grid, YOLO-fine could recognize objects smaller than eight
targeting to distinguish relatively large-sized objects such as pixels that were not recognized by the original YOLOv3.
buildings, bridges, and soccer ball fields but also, in many Also, they reduced the number of model parameters com-
cases, it needs to detect small-sized objects such as vehi- pared to the original YOLOv3 by removing the last two
cles, people, and ships. However, if the resolution of images convolutional blocks that were not helpful in small-sized
decreases, the capability of detecting small-sized objects object detection. Overall, their work improved the detec-
decreases drastically. Therefore, performance degradation in tion performance for small objects by improving the adja-
detecting the small-sized objects in the overhead image is cent objects’ discrimination capabilities through a finer grid
extremely challenging and needs to be addressed. Recently, search, while reducing the number of model parameters by
several methods have been proposed for small object detec- removing unnecessary convolution layers.
tion to achieve better performance, as shown in Fig. 4, which
we will describe more in the following sections. 2) MULTI-SCALE LEARNING
While some achieved a better performance in small object
detection through parameters optimization, others suggested
a multi-scale approach that obtains features of various scaled
objects for small object detection.
Van Etten [28] proposed the You Only Look Twice (YOLT)
model inspired by YOLO [29], [30]. They introduced a new
structure that is a similar approach adopted to fine-grained
model architecture. In YOLT model, fine-grained features
are extracted by adjusting architecture parameters. Further-
more, YOLT applied a multi-scale training approach con-
currently, because the fine-grained model could suffer from
FIGURE 4. The two primary categories of the small object detection.
Small object detection methods are achieved by fine-grained model
the high false-positive issue. The intuitive way to understand
architecture and multi-scale learning, and the corresponding methods are multi-scale training is that it is similar to building two differ-
listed on the right. ent models. In this way, YOLT combines the detection results
obtained from two different models that detect small and large
objects, respectively, to determine the final object detection
1) FINE-GRAINED MODEL ARCHITECTURE result.
The intuitive and straightforward approach to solve Inspired by YOLT, Li et al. [45] proposed MOD-YOLT for
small-sized objects detection problems is to extract a Multi-scale object detection (MOD) task. They categorized
fine-grained features from source images by adjusting model objects into three types with different size criteria empirically.
parameters. Then, categorized objects were trained with Multi-YOLT
First, Sommer et al. [42] demonstrated the effectiveness of Network (MYN). While YOLT used a single network struc-
deep learning-based detection methods for vehicle detection ture, MOD-YOLT proposes MYN to obtain optimal feature
in aerial images. Mainly, they performed the experiments maps using optimized network structures for each scale.
based on Fast R-CNN [43] and Faster R-CNN [25] that are With this advanced framework, MOD-YOLT achieved higher
widely used in the object detection domain for terrestrial detection performance than YOLT on a dataset from the
applications. Specifically, they proposed a common model second stage of the AIIA1 Cup Competition.1
architecture to detect small size objects. To maintain suffi- Also, Zhou et al. [46] applied the multi-scale network in
cient information for feature maps, they optimized parame- addition to the Faster R-CNN architecture. Because a depth of
ters, including a number of layers, kernel size, and anchor convolutional neural networks is related to feature level, the
scale. This work showed the applicability of a general object multi-scale network enabled the model to use multiple levels
detection model when applied for vehicle detection in over- of input feature. Therefore, this multi-scale network is bene-
head imagery. However, the proposed work could not present ficial to detect small objects such as ships in SAR imagery,
a novel methodology except for optimizing the network
parameters of the existing models [25], [43]. 1 https://www.datafountain.cn/competitions/288/datasets
FIGURE 5. The two primary categories of the oriented object detection, 2) DETECTING ORIENTED BOUNDING BOX
where they are divided into the horizontal and oriented bounding box
methods at the high level. The core methods are illustrated on the Liu et al. [51] proposed rotated region-based CNN
right-hand side. (RR-CNN) with a rotated region of interest (RRoI) pool-
ing layer. The proposed RRoI pooling layer pooled rotated
features into 5-tuple; center position with respect to x and
improving mAP performance, compared to the baseline mod- y-axis, width, height, and rotation angle. The pooled features
els such as Single Shot Multibox Detector (SSD) [31] and obtained by RROI are more robust and accurate than the pre-
RetinaNet [47]. viously proposed Free-form RoI pooling layer [57]. Another
Another method [48] to overcome a challenges lies in small advantage of this approach is the extensibility as it can be
object detection is to apply multi-scale training to cropped combined with any other two-stage object detection model.
images from a source image. Based on a clustering method For example, Liu et al. [51] used Fast R-CNN with RR-CNN
or density map, cropped images are generated, where the in the experiments.
cropped images have various object scales. In particular, Even though RRoI can accurately extract rotated features,
Li et al. [48] proposed Density-Map guided object detection it suffered from expensive computational costs to generate
Network (DMNet) to crop the images based on a density proposal regions. To address this issue, Ding et al. [52] pro-
map from the proposed density generation module. Multi- posed an RoI transformer that consists of an RRoI learner
column CNN (MCNN) [49] inspired the idea of density gen- and a Rotated Position Sensitive (RPS) RoI alignment mod-
eration module, which learned features of images and gen- ule. The RRoI leaner is trained to learn transformation from
erated a density map. The cropped images were fed into the HRoIs (Horizontal RoIs) to RRoIs, and the RPS RoI align-
object detector, and the result was fused to increase detection ment module extracts the rotation-invariant features. Despite
performance for small objects. negligible computational cost increase, this RoI transformer-
based method [52] significantly improved performance to
C. ORIENTED OBJECT DETECTION detect oriented objects.
Also, oriented objects can cause misclassification In addition, Yi et al. [53] introduced the box boundary
and produce a considerable decrease in object detection aware vectors (BBAVectors) to detect and predict the oriented
models’ performance. Therefore, deep learning-based meth- bounding boxes of objects. Instead of using predicted angle
ods [50]–[54] have recently been proposed to detect oriented values from features [51], BBAVectors employed a Cartesian
objects with higher accuracy, as shown in Fig. 5. Exist- coordinate system. And, the model detects the center key
ing overhead imagery datasets can be classified according point and then specifies the position of bounding boxes. The
to whether the coordinate of oriented bounding boxes is entire model architecture is implemented as an anchor-free
provided or the coordinate is provided as horizontal boxes one-stage detector so that the model can make inferences
such as center point, width, and height of boxes. However, faster than other two-stage detectors.
according to the dataset labeling configuration, different On the other hand, Han et al. [54] proposed a one-stage
methods are employed for detecting oriented objects. Thus, detection method called the single-shot alignment network
we categorized oriented object detection methods based on (S 2 A-Net), where the Feature Alignment Module (FAM)
the application of bounding box format: either horizontal or and Oriented Detection Module (ODM) were introduced.
oriented. FAM consisted of Anchor Refinement Network (ARN) and
Alignment Convolution Layer (ACL). ARN generated rotated
1) DETECTING HORIZONTAL BOUNDING BOX anchors, and ACL decoded anchor prediction map to the
The most intuitive way to improve the detection accu- oriented bounding box, extracting aligned features using
racy of the oriented object is to explore data augmenta- alignment convolution (AlignConv). On the other hand,
tion. Cheng et al. [50] applied the data augmentation strategy ODM applied active rotating filters (ARF) [58] to extract
and proposed the new objective function to achieve rota- orientation-sensitive features and orientation-invariant fea-
tion invariance of the feature representations. The proposed tures. These features were used to predict the bounding
method was extended from AlexNet [55] by replacing the last boxes and classify the categories from two sub-networks.
classification layer with the rotation-invariant CNN (RICNN) It achieved the state-of-the-art performance on the DOTA [9]
and HRSC2016 [59] datasets, where these dataset are widely generate super-resolution images for their experiments. The
utilized in the oriented object detection research area. results demonstrated that these super-resolution methods
improved detection performance. When the resolution of the
D. AUGMENTATION AND SUPER-RESOLUTION input images was increased from 30cm to 15cm, the mAP
In order to further improve the detection performance, image performance was increased by 13% to 36%. Conversely,
data augmentation and super-resolution can be applied at a when the resolution was degraded from 30cm to 120cm,
preprocessing stage. Different preprocessing strategies are the detection performance was decreased by 22% to 27%.
categorized and described in Fig. 6. The outcome of the experiments demonstrated that image
resolution is highly related to detection performance; thus,
generally the super-resolution methods can improve the over-
all object detection performance.
In a similar direction, Rabbi et al. [66] introduced Edge-
Enhanced Super-Resolution GAN (EESRGAN) method,
which was inspired by Edge Enhanced GAN (EEGAN) [67]
and Enhanced Super-Resolution GAN (ESRGAN) [68]. The
proposed method consists of EESRGAN and the end-to-
end network structure such as Faster R-CNN and SSD as
a base detection network. In particular, EESRGAN gener-
ated super-resolution images with rich edge information, and
FIGURE 6. The two primary categories of the preprocessing method. The
most frequently used preprocessing methods for improving detection base detection networks (Faster R-CNN and SSD) achieved
performance are super-resolution and image augmentation. improved accuracy with the super-resolution images.
TABLE 1. Summary of EO satellite datasets. The SpaceNet 1 and 2 datasets provide two types of images, which are 8-band multispectral and 3-band RGB
imagery. All datasets are publicly available, where † indicates the performance achieved by the best team for the challenge evaluation metric.
from each street view image. The results with geographic version of the original focal loss function. A threshold was
coordination were combined to calculate multi-view proposal applied to keep minimum weights to positive samples to
scores, and the scores generated final detection results to prevent the unintended drop of recall. The proposed method
the input region. The proposed model showed significant was experimented on the xView [8] dataset with a random
improvement in mAP score at the evaluation stage compared undersampling strategy and achieved first place with DIUx
to the Faster R-CNN model on simple overhead images. xView 2018 Detection Challenge [8].
Unlike the approaches that utilized different types of Unlike the previous studies using the focal loss function,
images, Wu et al. [72] proposed Nuisance Disentangled Fea- Zhang et al. [75] proposed a Difficult Region Estimation
ture Transform (NDFT) to use meta-data in conjunction with Network (DREN). The DREN was trained to generate
the images to obtain domain-robust features. Furthermore, cropped images for the difficult-to-detect region for the
adopting adversarial learning, the NDFT disentangles the fea- testing phase, and these images were passed to the detector
tures of domain-specific nuisances such as altitudes, angles, with original images. Their network utilized a balanced L1
and weather information. Their proposed training process loss from Libra R-CNN [77]. Then, the balanced L1 loss
enables the model to be robust in various domains by learning restrained gradients produced by outliers, which were sam-
domain-invariant features. ples with high loss values. To clip maximum gradients from
outliers, it made a more balanced regression from accurate
F. IMBALANCED OBJECTS DETECTION samples.
Imbalanced objects are one of the challenging issues in
the overhead imagery research [47], [73]–[76]. After Reti- III. DATASETS
naNet [47] introduced focal loss to overcome detecting the In this section, we explain the most popular and openly avail-
imbalanced objects, there have been more studies to extend able satellite imagery datasets based on their image sensor
focal loss and improve the performance, as shown in Fig. 8. sources.
FIGURE 9. Images taken from EO satellite datasets as examples. Each dataset is composed of images captured using different sensors with varying
resolutions.
FIGURE 10. Sample images from SAR satellite imagery datasets. The radar waves are collected from active sensors and converted to black and
white images. Compared with images from passive sensors, generally, a SAR image has speckle noises.
TABLE 2. Summary of SAR satellite imagery datasets. Because OpenSARShip datasets provide only ship chips without raw images, image size is not
described in this table. For LS-SSDD-v1.0, both sizes of raw images and split image patches are described. All the datasets have ship class only, and they
are publicly available.
The images were collected from Sentinel-1 satellite with datasets provides ship chips or small size images, HRSID
two different product types; Single look complex and Gound provides comparatively large size images, which are benefi-
range detected product. Compared to the previous version, cial for evaluating object detection methods. Furthermore, the
OpenSARShip 2.0 provides the additional information called HRSID is composed of higher resolution images than other
automatic identification system (AIS) and Marine Traffic SAR datasets so that the HRSID dataset is more effective in
information that contains ship type, longitude, and latitude. discriminating the adjacent ships. The sample image is shown
Although detailed position information is annotated for each in Fig. 10. (c).
ship, images are provided as cropped ship chips. Thus, the
dataset is more suitable for a classification task instead of 5) LS-SSDD-v1.0
object detection. The sample image is shown in Fig. 10. (a). Unlike previously released datasets containing ship chips or
small sized images, Zhang et al. [94] released a dataset with
3) SAR-SHIP DATASET large-scaled 15 raw images collected from Sentinel-1. The
Wang et al. [92] constructed a SAR image dataset using dataset called a Large-Scale SAR Ship Detection Dataset-
102 images from the Chinese Gaofen-3 satellite and v1.0 (LS-SSDD-v1.0) provides raw images of the size
108 images from the Sentinel-1 satellite. Compared with the 24,000×16.000 pixels with split sub-images of 800×800 pix-
previous SAR datasets, this dataset focused on containing els. In order to create similar conditions to the actual envi-
complex background images such as a harbor or near an ronment, images are provided with pure backgrounds without
island. Throughout this characteristic of the dataset, it aimed making separate ship chips. It means that sub-image patches
to increase the performance of the ship detection model with- are also provided regardless of whether they contained target
out any land-ocean segmentation image pre-processing. The objects or not. This characteristic is helpful for detection
sample image is provided in Fig. 10. (b). models to learn pure backgrounds without objects, making
it more practical to real-world cases. The sample raw image
4) HRSID is presented in Fig. 10. (d).
Wei et al. [93] constructed and released a High-Resolution
SAR Images Dataset (HRSID) to foster research for ship C. AERIAL IMAGERY DATASETS
detection and instance segmentation on SAR imagery. From As shown in Table 3, there are datasets constructed with
Sentinel-1B, TerraSAR-X, and TanDEM satellites, 136 raw images from passive optical sensors such as flights and
images were collected. Then, the images were cropped drones. And, it is difficult to specify detailed sensor informa-
into a fixed size with a 25% overlapping area. Optical tion because the sensor specifications can not be described
imagery from Google Earth was utilized to minimize anno- in detail generally. Initial datasets have been more focused
tation errors. While OpenSARShip [90] and SAR-Ship [92] on primarily detecting cars. However, recent trends for object
FIGURE 11. Sample images from aerial imagery datasets. Generally, aerial imagery datasets have higher resolution than satellite images.
TABLE 3. Summary of aerial imagery datasets. For DLR-MVDA, the values in parentheses are the number used for the training in the study [95]. All
datasets are publicly available, where † indicates the performance achieved by the best team for the challenge evaluation metric.
for car detection, such as OIRDS, VEDAI, and COWC, this Generally, EO satellite is utilized for satellite images that
dataset provides higher resolution images to utilize fine- have lower image resolution than aerial images.
grained information. In addition, since the images were col-
lected from one designated spot (parking lot), a large portion 1) TAS DATASET
of the image is filled with objects compared to images with Heitz and Koller [109] constructed an overhead car detection
sparse objects from the previous datasets. Because CARPK dataset obtained from Google Earth for demonstrating the
is a high-resolution image dataset, the image contains dis- performance of the things and stuff (TAS) context model. The
tinguishable objects located in proximity. A sample image is dataset is a set of 30 color images of the city and suburbs of
presented in Fig. 11. (e). Brussels, Belgium, with a size of 792 × 636 pixels. A total of
1,319 cars are labeled manually with an average size of a car
6) VISDRONE being 45 × 45 pixels. The TAS dataset is meaningful in that
There have been two object detection challenges called Vis- it is one of the earliest developed overhead-viewed vehicle
Drone Challenge in 2018 and 2019 with a drone-based datasets. However, the amount of data is insufficient and lacks
benchmark dataset called VisDrone [104]. Zhu et al. [37] diversity for applying the latest deep learning-based detection
released the dataset to motivate the research in computer methods. A sample image is shown in Fig. 11. (a).
vision tasks on the drone platform. It contains 263 video clips
(179,264 frames) and 10,209 images captured by drones in 2) SZTAKI-INRIA
various areas of China. Besides, occlusion and truncation Benedek et al. [110] developed the SZTAKI-INRIA Building
ratio information is provided to capture the characteristics of Detection Benchmark dataset for evaluating proposed detec-
overhead imagery. Whereas the existing aerial image datasets tion methods. The dataset contains 665 building footprints
usually use vehicles as target objects, Visdrone includes some in 9 images from several cities of Hungary, UK, Germany,
other smaller objects classes such as vehicles, pedestrians, and France. Among nine images, two images were obtained
and bicycles so that the dataset can be used for various object from an aerial source, and the rest were from satellite and
detection purposes. A sample image is shown in Fig. 11. (f). Google Earth platform. The SZTAKI-INRIA dataset, similar
to the TAS dataset [109], is not suitable for applying a deep
7) UAVDT learning-based object detection method due to the insufficient
Du et al. [105] constructed a dataset for detecting vehicles volume of images and objects. A sample image is shown in
on a UAV platform. This dataset is called UAV Detection Fig. 11. (b)
and Tracking (UAVDT), and it provides useful annotated
attributes such as weather conditions, flying altitude, camera 3) NWPU VHR-10
view, vehicle occlusion, and out-of-view. In particular, the Cheng et al. [56] constructed the NWPU VHR-10 dataset,
out-of-view was categorized based on a ratio of objects in the which contains 800 satellite and aerial images from Google
frame outside. Because UAVDT represents real-world envi- Earth and Vaihingen data [111]. The dataset consists of ten
ronments by focusing on various scenes, weather, and camera different types of objects such as airplanes, ships, storage
angles, it has the advantage of evaluating the generaliza- tanks, Etc. The size of each object type varies from 418 ×
tion performance of detection methods. Furthermore, it also 418 pixels for large objects to 33 × 33 pixels for small ones.
contains various backgrounds in divided subsets for training Furthermore, the image resolution varies from 0.08m to 2m
and testing, respectively. The volume of images described in for a diversity of the dataset. In order to use the dataset accord-
Table 3 excludes images for a single object tracking task. ing to the applying purpose, the NWPU VHR-10 dataset
A sample image is shown in Fig. 11. (g). provides the independently divided four sub-groups; 1) neg-
ative image set, 2) positive image set, 3) optimizing set, and
D. SATELLITE AND AERIAL IMAGERY DATASETS 4) testing set. A sample image is shown in Fig. 11. (c).
Lastly, there are datasets constructed with images from
both satellite and aerial sources, as shown in Table 4, 4) DOTA
where such datasets are helpful to improve and evaluate the Xia et al. [9] firstly introduced a large-scale Dataset for
generalization performance of the object detection methods. Object deTection in Aerial images (DOTA), aiming for an
FIGURE 12. Sample images for satellite and aerial imagery datasets. Since it contains both satellite and aerial images, datasets consist of
various characteristics such as sensor, resolution, and angle.
TABLE 4. Summary of satellite and aerial imagery datasets. All datasets are publicly available, where † indicates the performance achieved by the best
team for the challenge evaluation metric.
[7] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, ‘‘Object detection in [32] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang,
optical remote sensing images: A survey and a new benchmark,’’ ISPRS L. Dong, F. Wei, and B. Guo, ‘‘Swin transformer v2: Scaling up capacity
J. Photogramm. Remote Sens., vol. 159, pp. 296–307, Jan. 2020. and resolution,’’ 2021, arXiv:2111.09883.
[8] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, [33] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, ‘‘Focal
Y. Bulatov, and B. McCord, ‘‘XView: Objects in context in overhead self-attention for local-global interactions in vision transformers,’’ 2021,
imagery,’’ 2018, arXiv:1802.07856. arXiv:2107.00641.
[9] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, [34] P. Zhang, Y. Zhong, and X. Li, ‘‘SlimYOLOv3: Narrower, faster and
M. Pelillo, and L. and Zhang, ‘‘DOTA: A large-scale dataset for object better for real-time UAV applications,’’ in Proc. IEEE/CVF Int. Conf.
detection in aerial images,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1–9.
Recognit., Jun. 2018, pp. 3974–3983. [35] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, ‘‘Learning efficient
[10] M. A. Fischler and R. A. Elschlager, ‘‘The representation and matching of convolutional networks through network slimming,’’ in Proc. IEEE Int.
pictorial structures,’’ IEEE Trans. Comput., vol. C-22, no. 1, pp. 67–92, Conf. Comput. Vis., Oct. 2017, pp. 2736–2744.
Jan. 1973. [36] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Spatial pyramid pooling in
[11] D. M. McKeown, Jr., and J. L. Denlinger, ‘‘Cooperative methods for deep convolutional networks for visual recognition,’’ IEEE Trans. Pattern
road tracking in aerial imagery,’’ in Proc. DARPA IUS Workshop, 1988, Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Jul. 2015.
pp. 327–341. [37] P. Zhu, L. Wen, X. Bian, H. Ling, and Q. Hu, ‘‘Vision meets drones: A
[12] A. Huertas and R. Nevatia, ‘‘Detecting buildings in aerial images,’’ Com- challenge,’’ 2018, arXiv:1804.07437.
put. Vis., Graph., Image Process., vol. 41, no. 2, pp. 131–152, Feb. 1988. [38] B. Uzkent, C. Yeh, and S. Ermon, ‘‘Efficient object detection in large
[13] R. B. Irvin and D. M. McKeown, ‘‘Methods for exploiting the relationship images using deep reinforcement learning,’’ in Proc. IEEE Winter Conf.
between buildings and their shadows in aerial imagery,’’ IEEE Trans. Appl. Comput. Vis. (WACV), Mar. 2020, pp. 1824–1833.
Syst., Man Cybern., vol. 19, no. 6, pp. 1564–1575, Nov. 1989. [39] Z. Han, H. Zhang, J. Zhang, and X. Hu, ‘‘Fast aircraft detection based on
[14] J. E. Ball, D. T. Anderson, and C. S. Chan, ‘‘Comprehensive survey of region locating network in large-scale remote sensing images,’’ in Proc.
deep learning in remote sensing: Theories, tools, and challenges for the IEEE Int. Conf. Image Process. (ICIP), Sep. 2017, pp. 2294–2298.
community,’’ J. Appl. Remote Sens., vol. 11, no. 4, p. 042609, 2017. [40] L. Sommer, N. Schmidt, A. Schumann, and J. Beyerer, ‘‘Search area
[15] Y. Gu, Y. Wang, and Y. Li, ‘‘A survey on deep learning-driven remote reduction fast-RCNN for fast vehicle detection in large aerial imagery,’’
sensing image scene understanding: Scene classification, scene retrieval in Proc. 25th IEEE Int. Conf. Image Process. (ICIP), Oct. 2018,
and scene-guided object detection,’’ Appl. Sci., vol. 9, no. 10, p. 2110, pp. 3054–3058.
May 2019. [41] F. Yang, H. Fan, P. Chu, E. Blasch, and H. Ling, ‘‘Clustered object
[16] L. Ma, Y. Liu, X. Zhang, Y. Ye, G. Yin, and B. A. Johnson, ‘‘Deep detection in aerial images,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis.
learning in remote sensing applications: A meta-analysis and review,’’ (ICCV), Oct. 2019, pp. 8311–8320.
ISPRS J. Photogramm. Remote Sens., vol. 152, pp. 166–177, Jun. 2019. [42] L. W. Sommer, T. Schuchert, and J. Beyerer, ‘‘Fast deep vehicle detection
[17] Q. Yuan, H. Shen, T. Li, Z. Li, S. Li, Y. Jiang, H. Xu, W. Tan, Q. Yang, in aerial images,’’ in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV),
J. Wang, J. Gao, and L. Zhang, ‘‘Deep learning in environmental remote Mar. 2017, pp. 311–319.
sensing: Achievements and challenges,’’ Remote Sens. Environ., vol. 241, [43] R. Girshick, ‘‘Fast R-CNN,’’ in Proc. IEEE Int. Conf. Comput. Vis.
May 2020, Art. no. 111716. (ICCV), Dec. 2015, pp. 1440–1448.
[18] J. Beal, E. Kim, E. Tzeng, D. H. Park, A. Zhai, and D. Kislyuk, ‘‘Toward [44] M.-T. Pham, L. Courtrai, C. Friguet, S. Lefèvre, and A. Baussard,
transformer-based object detection,’’ 2020, arXiv:2012.09958. ‘‘YOLO-fine: One-stage detector of small objects under various back-
[19] Z. Zhang, X. Lu, G. Cao, Y. Yang, L. Jiao, and F. Liu, ‘‘ViT-YOLO: grounds in remote sensing images,’’ Remote Sens., vol. 12, no. 15,
Transformer-based YOLO for object detection,’’ in Proc. IEEE/CVF Int. p. 2501, Aug. 2020.
Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, pp. 2799–2808. [45] W. Li, W. Li, F. Yang, and P. Wang, ‘‘Multi-scale object detection in
[20] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, ‘‘Swin satellite imagery based on YOLT,’’ in Proc. IEEE Int. Geosci. Remote
transformer: Hierarchical vision transformer using shifted Windows,’’ Sens. Symp. (IGARSS), Jul. 2019, pp. 162–165.
2021, arXiv:2103.14030. [46] Y. Zhou, Z. Cai, Y. Zhu, and J. Yan, ‘‘Automatic ship detection in SAR
[21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet: image based on multi-scale faster R-CNN,’’ J. Phys., Conf., vol. 1550,
A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput. no. 4, May 2020, Art. no. 042006.
Vis. Pattern Recognit., Jun. 2009, pp. 248–255. [47] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, ‘‘Focal loss for
[22] T. Liang, X. Chu, Y. Liu, Y. Wang, Z. Tang, W. Chu, J. Chen, and dense object detection,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
H. Ling, ‘‘CBNetV2: A composite backbone network architecture for Oct. 2017, pp. 2980–2988.
object detection,’’ 2021, arXiv:2107.00420. [48] C. Li, T. Yang, S. Zhu, C. Chen, and S. Guan, ‘‘Density map guided
[23] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image object detection in aerial images,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
recognition,’’ in Proc. IEEE Conf. Comput. Vis. pattern Recognit., 2016, Pattern Recognit. Workshops (CVPRW), Jun. 2020, pp. 190–191.
pp. 770–778. [49] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, ‘‘Single-image
[24] S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and crowd counting via multi-column convolutional neural network,’’ in
P. Torr, ‘‘Res2Net: A new multi-scale backbone architecture,’’ IEEE Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
Trans. Pattern Anal. Mach. Intell., vol. 43, no. 2, pp. 652–662, Feb. 2021. pp. 589–597.
[25] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards [50] G. Cheng, P. Zhou, and J. Han, ‘‘Learning rotation-invariant convo-
real-time object detection with region proposal networks,’’ 2015, lutional neural networks for object detection in VHR optical remote
arXiv:1506.01497. sensing images,’’ IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12,
[26] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’ pp. 7405–7415, Dec. 2016.
2018, arXiv:1804.02767. [51] Z. Liu, J. Hu, L. Weng, and Y. Yang, ‘‘Rotated region based CNN for ship
[27] X. Dai, Y. Chen, B. Xiao, D. Chen, M. Liu, L. Yuan, and L. Zhang, detection,’’ in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep. 2017,
‘‘Dynamic head: Unifying object detection heads with attentions,’’ pp. 900–904.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), [52] J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu, ‘‘Learning RoI trans-
Jun. 2021, pp. 7373–7382. former for oriented object detection in aerial images,’’ in Proc. IEEE/CVF
[28] A. Van Etten, ‘‘You only look twice: Rapid multi-scale object detection Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 2849–2858.
in satellite imagery,’’ 2018, arXiv:1805.09512. [53] J. Yi, P. Wu, B. Liu, Q. Huang, H. Qu, and D. Metaxas, ‘‘Ori-
[29] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once: ented object detection in aerial images with box boundary-aware vec-
Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis. tors,’’ in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., Jan. 2021,
Pattern Recognit., Jun. 2016, pp. 779–788. pp. 2150–2159.
[30] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’ [54] J. Han, J. Ding, J. Li, and G.-S. Xia, ‘‘Align deep features for oriented
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, object detection,’’ IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–11,
pp. 7263–7271. 2022.
[31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and [55] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘Imagenet classification
A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Proc. Eur. Conf. with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf.
Comput. Vis. Springer, 2016, pp. 21–37. Process. Syst., vol. 25, 2012, pp. 1097–1105.
[56] G. Cheng, J. Han, P. Zhou, and L. Guo, ‘‘Multi-class geospatial object [79] Z. Liu, L. Yuan, L. Weng, and Y. Yang, ‘‘A high resolution optical satellite
detection and geographic image classification based on collection of part image dataset for ship recognition and some new baselines,’’ in Proc. Int.
detectors,’’ ISPRS J. Photogramm. Remote Sens., vol. 98, pp. 119–132, Conf. Pattern Recognit. Appl. Methods, vol. 2, 2017, pp. 324–331.
Dec. 2014. [80] A. Van Etten, D. Lindenbaum, and T. M. Bacastow, ‘‘SpaceNet: A remote
[57] H. Caesar, J. Uijlings, and V. Ferrari, ‘‘Region-based semantic segmenta- sensing dataset and challenge series,’’ 2018, arXiv:1807.01232.
tion with end-to-end training,’’ in Proc. Eur. Conf. Comput. Vis. Springer, [81] N. Weir, D. Lindenbaum, A. Bastidas, A. V. Etten, S. McPherson,
2016, pp. 381–397. J. Shermeyer, V. Kumar, and H. Tang, ‘‘SpaceNet MVOI: A multi-view
[58] Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao, ‘‘Oriented response networks,’’ in overhead imagery dataset,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis.,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 519–528. Oct. 2019, pp. 992–1001.
[59] Z. Liu, H. Wang, H. Weng, and L. Yang, ‘‘Ship rotated bounding box [82] T. Zhang, X. Zhang, and X. Ke, ‘‘Quad-FPN: A novel quad feature
space for ship extraction from high-resolution optical satellite images pyramid network for SAR ship detection,’’ Remote Sens., vol. 13, no. 14,
with complex backgrounds,’’ IEEE Geosci. Remote Sens. Lett., vol. 13, p. 2771, Jul. 2021.
no. 8, pp. 1074–1078, Aug. 2016. [83] T. Zhang and X. Zhang, ‘‘Squeeze- and-excitation Laplacian pyra-
[60] Y.-M. Chou, C.-H. Chen, K.-H. Liu, and C.-S. Chen, ‘‘Stingray detection mid network with dual-polarization feature fusion for ship classifi-
of aerial images using augmented training images generated by a condi- cation in SAR images,’’ IEEE Geosci. Remote Sens. Lett., vol. 19,
tional generative model,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern pp. 1–5, 2022.
Recognit. Workshops (CVPRW), Jun. 2018, pp. 1403–1409. [84] Y. Wu, Y. Yuan, J. Guan, L. Yin, J. Chen, G. Zhang, and P. Feng, ‘‘Joint
[61] P. Bojanowski, A. Joulin, D. Lopez-Paz, and A. Szlam, ‘‘Optimizing the convolutional neural network for small-scale ship classification in SAR
latent space of generative networks,’’ 2017, arXiv:1707.05776. images,’’ in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS),
[62] C. Chen, Y. Zhang, Q. Lv, S. Wei, X. Wang, X. Sun, and J. Dong, ‘‘RRNet: Jul. 2019, pp. 2619–2622.
A hybrid detector for object detection in drone-captured images,’’ [85] Z. Cui, X. Wang, N. Liu, Z. Cao, and J. Yang, ‘‘Ship detection
in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops, Oct. 2019, in large-scale SAR images via spatial shuffle-group enhance atten-
pp. 1–9. tion,’’ IEEE Trans. Geosci. Remote Sens., vol. 59, no. 1, pp. 379–391,
[63] J. Shermeyer and A. Van Etten, ‘‘The effects of super-resolution on object Jan. 2021.
detection performance in satellite imagery,’’ in Proc. IEEE/CVF Conf. [86] Z. Sun, M. Dai, X. Leng, Y. Lei, B. Xiong, K. Ji, and G. Kuang,
Comput. Vis. Pattern Recognit. Workshops, Jun. 2019, pp. 1–10. ‘‘An anchor-free detection method for ship targets in high-resolution SAR
[64] J. Kim, J. K. Lee, and K. M. Lee, ‘‘Accurate image super-resolution using images,’’ IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14,
very deep convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis. pp. 7799–7816, 2021.
Pattern Recognit., Jun. 2016, pp. 1646–1654. [87] X. Zhang, C. Huo, N. Xu, H. Jiang, Y. Cao, L. Ni, and C. Pan, ‘‘Multitask
[65] S. Schulter, C. Leistner, and H. Bischof, ‘‘Fast and accurate image learning for ship detection from synthetic aperture radar images,’’ IEEE
upscaling with super-resolution forests,’’ in Proc. IEEE Conf. Comput. J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14, pp. 8048–8062,
Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3791–3799. 2021.
[66] J. Rabbi, N. Ray, M. Schubert, S. Chowdhury, and D. Chao, ‘‘Small- [88] J. Li, C. Qu, and J. Shao, ‘‘Ship detection in SAR images based on an
object detection in remote sensing images with end-to-end edge-enhanced improved faster R-CNN,’’ in Proc. AR Big Data Era, Models, Methods
GAN and object detector network,’’ Remote Sens., vol. 12, no. 9, p. 1432, Appl. (BIGSARDATA), Nov. 2017, pp. 1–6.
May 2020. [89] H. Guo, X. Yang, N. Wang, and X. Gao, ‘‘A CenterNet++ model for
[67] K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, and J. Jiang, ‘‘Edge-enhanced ship detection in SAR images,’’ Pattern Recognit., vol. 112, Apr. 2021,
GAN for remote sensing image superresolution,’’ IEEE Trans. Geosci. Art. no. 107787.
Remote Sens., vol. 57, no. 8, pp. 5799–5812, Mar. 2019. [90] B. Li, B. Liu, L. Huang, W. Guo, Z. Zhang, and W. Yu, ‘‘OpenSARShip
[68] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and 2.0: A large-volume dataset for deeper interpretation of ship targets in
C. C. Loy, ‘‘ESRGAN: Enhanced super-resolution generative adversar- Sentinel-1 imagery,’’ in Proc. SAR Big Data Era, Models, Methods Appl.
ial networks,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV) Workshops, (BIGSARDATA), Nov. 2017, pp. 1–5.
Sep. 2018, pp. 1–16. [91] L. Huang, B. Liu, B. Li, W. Guo, W. Yu, Z. Zhang, and W. Yu, ‘‘Opensar-
[69] L. Cao, R. Ji, C. Wang, and J. Li, ‘‘Towards domain adaptive vehicle ship: A dataset dedicated to sentinel-1 ship interpretation,’’ IEEE J. Sel.
detection in satellite image by supervised super-resolution transfer,’’ in Topics Appl. Earth Observ. Remote Sens., vol. 11, no. 1, pp. 195–208,
Proc. AAAI Conf. Artif. Intell., vol. 30, no. 1, 2016, pp. 1–7. Jan. 2017.
[70] T. Malisiewicz, A. Gupta, and A. A. Efros, ‘‘Ensemble of exemplar- [92] Y. Wang, C. Wang, H. Zhang, Y. Dong, and S. Wei, ‘‘A SAR dataset of
SVMs for object detection and beyond,’’ in Proc. Int. Conf. Comput. Vis., ship detection for deep learning under complex backgrounds,’’ Remote
Nov. 2011, pp. 89–96. Sens., vol. 11, no. 7, p. 765, Mar. 2019.
[71] J. D. Wegner, S. Branson, D. Hall, K. Schindler, and P. Perona, [93] S. Wei, X. Zeng, Q. Qu, M. Wang, H. Su, and J. Shi, ‘‘HRSID: A high-
‘‘Cataloging public objects using aerial and street-level images-urban resolution SAR images dataset for ship detection and instance segmenta-
trees,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, tion,’’ IEEE Access, vol. 8, pp. 120234–120254, 2020.
pp. 6014–6023. [94] T. Zhang, X. Zhang, X. Ke, X. Zhan, J. Shi, S. Wei, D. Pan, J. Li, H. Su,
[72] Z. Wu, K. Suresh, P. Narayanan, H. Xu, H. Kwon, and Z. Wang, ‘‘Delving Y. Zhou, and D. Kumar, ‘‘LS-SSDD-v1.0: A deep learning dataset ded-
into robust object detection from unmanned aerial vehicles: A deep nui- icated to small ship detection from large-scale sentinel-1 SAR images,’’
sance disentanglement approach,’’ in Proc. IEEE/CVF Int. Conf. Comput. Remote Sens., vol. 12, no. 18, p. 2997, Sep. 2020.
Vis. (ICCV), Oct. 2019, pp. 1201–1210. [95] K. Liu and G. Mattyus, ‘‘Fast multiclass vehicle detection on aerial
[73] M. Y. Yang, W. Liao, X. Li, and B. Rosenhahn, ‘‘Deep learning for vehicle images,’’ IEEE Geosci. Remote Sens. Lett., vol. 12, no. 9, pp. 1938–1942,
detection in aerial images,’’ in Proc. 25th IEEE Int. Conf. Image Process. Sep. 2015.
(ICIP), Oct. 2018, pp. 3079–3083. [96] H. Tayara, K. G. Soo, and K. T. Chong, ‘‘Vehicle detection and counting
[74] N. Sergievskiy and A. Ponamarev, ‘‘Reduced focal loss: 1st place solution in high-resolution aerial images using convolutional regression neural
to xView object detection in satellite imagery,’’ 2019, arXiv:1903.01347. network,’’ IEEE Access, vol. 6, pp. 2220–2230, 2018.
[75] J. Zhang, J. Huang, X. Chen, and D. Zhang, ‘‘How to fully exploit the [97] T. Tang, S. Zhou, Z. Deng, H. Zou, and L. Lei, ‘‘Vehicle detection in aerial
abilities of aerial image detectors,’’ in Proc. IEEE/CVF Int. Conf. Comput. images based on region convolutional neural networks and hard negative
Vis. Workshops, Oct. 2019, pp. 1–8. example mining,’’ Sensors, vol. 17, no. 2, p. 336, 2017.
[76] S. Hong, S. Kang, and D. Cho, ‘‘Patch-level augmentation for object [98] T. Tang, S. Zhou, Z. Deng, L. Lei, and H. Zou, ‘‘Arbitrary-oriented vehicle
detection in aerial images,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. detection in aerial imagery with single convolutional neural networks,’’
Workshop (ICCVW), Oct. 2019, pp. 1–8. Remote Sens., vol. 9, no. 11, p. 1170, 2017.
[77] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, ‘‘Libra R-CNN: [99] T. N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye, ‘‘A large
Towards balanced learning for object detection,’’ in Proc. IEEE/CVF contextual dataset for classification, detection and counting of cars
Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 821–830. with deep learning,’’ in Proc. Eur. Conf. Comput. Vis. Springer, 2016,
[78] D. Liang, Q. Geng, Z. Wei, D. A. Vorontsov, E. L. Kim, M. Wei, and pp. 785–800.
H. Zhou, ‘‘Anchor retouching via model interaction for robust object [100] E. Goldman, R. Herzig, A. Eisenschtat, J. Goldberger, and T. Hassner,
detection in aerial images,’’ IEEE Trans. Geosci. Remote Sens., early ‘‘Precise detection in densely packed scenes,’’ in Proc. IEEE/CVF Conf.
access, Dec. 16, 2021, doi: 10.1109/TGRS.2021.3136350. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 5227–5236.
[101] F. Tanner, B. Colder, C. Pullen, D. Heagy, M. Eppolito, V. Carlan, SHAHROZ TARIQ received the B.S. degree
C. Oertel, and P. and Sallee, ‘‘Overhead imagery research data set—An (Hons.) in computer science from the National
annotated data library & tools to aid in the development of computer University of Computer and Emerging Sciences,
vision algorithms,’’ in Proc. IEEE Appl. Imag. Pattern Recognit. Work- (FAST-NUCES), Islamabad, Pakistan, and the
shop (AIPR), Oct. 2009, pp. 1–8. M.S. degree (Hons.) in computer science from
[102] S. Razakarivony and F. Jurie, ‘‘Vehicle detection in aerial imagery:
Sangmyung University, Cheonan, South Korea.
A small target detection benchmark,’’ J. Vis. Commun. Image Represent.,
He is currently pursuing the Ph.D. degree
vol. 34, pp. 187–203, Jan. 2016.
[103] M.-R. Hsieh, Y.-L. Lin, and W. H. Hsu, ‘‘Drone-based object counting by with Sungkyunkwan University (Natural Sci-
spatially regularized regional proposal network,’’ in Proc. IEEE Int. Conf. ences Campus), Suwon, South Korea. He worked
Comput. Vis. (ICCV), Oct. 2017, pp. 4145–4153. as a Software Engineer at Bentley Systems,
[104] P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, ‘‘Detection from 2014 to 2015. He was a Ph.D. Research Assistant at Stony Brook
and tracking meet drones challenge,’’ 2020, arXiv:2001.06303. University and SUNY Korea, from 2017 to 2019.
[105] D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and
Q. Tian, ‘‘The unmanned aerial vehicle benchmark: Object detection and
tracking,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 370–386.
[106] E. Li, J. Femiani, S. Xu, X. Zhang, and P. Wonka, ‘‘Robust rooftop
extraction from visible band images using higher order CRF,’’ IEEE
Trans. Geosci. Remote Sens., vol. 53, no. 8, pp. 4483–4495, Aug. 2015.
[107] Z. Deng, H. Sun, S. Zhou, J. Zhao, L. Lei, and H. Zou, ‘‘Multi-scale object
detection in remote sensing imagery with convolutional neural networks,’’
ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 3–22, Nov. 2018.
[108] W. Qian, X. Yang, S. Peng, J. Yan, and X. Zhang, ‘‘RSDet++: Point-
based modulated loss for more accurate rotated object detection,’’ 2021,
arXiv:2109.11906.
[109] G. Heitz and D. Koller, ‘‘Learning spatial context: Using stuff to find
things,’’ in Proc. Eur. Conf. Comput. Vis. Springer, 2008, pp. 30–43. HAN OH received the B.E. degree (Hons.) in
[110] C. Benedek, X. Descombes, and J. Zerubia, ‘‘Building development
computer engineering from Myongji University,
monitoring in multitemporal remotely sensed image pairs with stochastic
Yongin, South Korea, in 2004, the M.S. degree
birth-death dynamics,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 34,
no. 1, pp. 33–50, Jan. 2011.
in information and communications engineering
[111] M. Cramer, ‘‘The DGPF-test on digital airborne camera evaluation from the Gwangju Institute of Science and Tech-
overview and test design,’’ Photogrammetrie Fernerkundung Geoinfor- nology (GIST), Gwangju, South Korea, in 2006,
mation, vol. 2010, no. 2, pp. 73–82, May 2010. and the Ph.D. degree in electrical engineering from
[112] J. Ding, N. Xue, G.-S. Xia, X. Bai, W. Yang, M. Ying Yang, The University of Arizona, Tucson, AZ, USA,
S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang, ‘‘Object detec- in 2011. In 2012, he joined Samsung Electronics,
tion in aerial images: A large-scale benchmark and challenges,’’ 2021, Suwon, South Korea, where he was involved in
arXiv:2102.12219. development of digital cameras and 3D scanning systems. Since 2014, he has
[113] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, ‘‘Learning effi- been with the Korea Aerospace Research Institute (KARI), where he is cur-
cient object detection models with knowledge distillation,’’ in Proc. Adv.
rently with the Satellite Application Division. His current research interests
Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–10.
[114] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, ‘‘Single-shot refinement include deep learning-based image processing, digital communications, and
neural network for object detection,’’ in Proc. IEEE/CVF Conf. Comput. space/ground network protocols.
Vis. Pattern Recognit., Jun. 2018, pp. 4203–4212.
[115] M. Tan, R. Pang, and Q. V. Le, ‘‘EfficientDet: Scalable and efficient object
detection,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2020, pp. 10781–10790.
[116] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu,
‘‘End-to-end semi-supervised object detection with soft teacher,’’ 2021,
arXiv:2106.09018.
[117] M. M. Bennett and L. C. Smith, ‘‘Advances in using multitemporal night-
time lights satellite imagery to detect, estimate, and monitor socioe-
conomic dynamics,’’ Remote Sens. Environ., vol. 192, pp. 176–197,
Apr. 2017.
[118] J. Chen, W. Fan, K. Li, X. Liu, and M. Song, ‘‘Fitting Chinese cities’
population distributions using remote sensing satellite data,’’ Ecol. Indi-
cators, vol. 98, pp. 327–333, Mar. 2019. SIMON S. WOO received the B.S. degree in elec-
[119] C. Yeh, A. Perez, A. Driscoll, G. Azzari, Z. Tang, D. Lobell, S. Ermon, and trical engineering from the University of Washing-
M. Burke, ‘‘Using publicly available satellite imagery and deep learning ton (UW), Seattle, WA, USA, the M.S. degree in
to understand economic well-being in Africa,’’ Nature Commun., vol. 11, electrical and computer engineering from the Uni-
no. 1, pp. 1–11, Dec. 2020. versity of California, San Diego (UCSD), La Jolla,
CA, USA, and the M.S. and Ph.D. degrees in
computer science from the University of Southern
California (USC), Los Angeles, CA, USA. He was
a member of technical staff (technologist) for nine
JUNHYUNG KANG received the B.S. degree years at the NASA’s Jet Propulsion Laboratory
in mechanical engineering from Sungkyunkwan (JPL), Pasadena, CA, USA, conducting research in satellite communications,
University (Natural Sciences Campus), Suwon, networking, and cybersecurity areas. He was also worked at Intel Corp. and
South Korea, where he is currently pursuing the the Verisign Research Laboratory. Since 2017, he has been a Tenure-Track
M.S. degree. Assistant Professor at SUNY, South Korea, and a Research Assistant Pro-
fessor at Stony Brook University. He is currently a Tenure-Track Assistant
Professor at the SKKU Institute for Convergence and the Department of
Applied Data Science and Software, Sungkyunkwan University, Suwon,
South Korea.