Detection of Multiclass Objects in Optical Remote Sensing Images

This document presents a novel framework for multiclass object detection in optical remote sensing images, addressing challenges related to scale, density, and shape variations. The proposed method enhances the YOLOv2 architecture by incorporating an oriented response dilated convolution strategy, which improves detection performance for both small and large objects without increasing model complexity. Experimental results demonstrate a 4.4% improvement in mean average precision on the DOTA dataset, showcasing the effectiveness of the proposed approach in real-time detection scenarios.

Uploaded by

sivatharani20

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Detection of Multiclass Objects in Optical Remote Sensing Images

Uploaded by

sivatharani20

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 16, NO.

5, MAY 2019 791

Detection of Multiclass Objects in

Optical Remote Sensing Images
Wenchao Liu , Long Ma , Jue Wang , and He Chen

Abstract— Object detection in complex optical remote sensing for object detection by structural feature selection and struc-
images is a challenging problem due to the wide variety of tural feature description, but they are difficult to use with
scales, densities, and shapes of object instances on the earth complicated remote sensing scenes with limited account of
surface. In this letter, we focus on the wide-scale variation
problem of multiclass object detection and propose an effective data. Yang et al. [6] proposed an effective fully convolutional
object detection framework in remote sensing images based on network-based airplane detection framework. Experiments
YOLOv2. To make the model adaptable to multiscale object show the high precision, recall, and location accuracy of the
detection, we design a network that concatenates feature maps detection framework. However, it is difficult for this method to
from layers of different depths and adopt a feature introducing detect objects in complex scenes because this method relies
strategy based on oriented response dilated convolution. Through
this strategy, the performance for small-scale object detection on Markov random field image segmentation. Liu et al. [7]
is improved without losing the performance for large-scale proposed an arbitrary-oriented ship detection framework based
object detection. Compared to YOLOv2, the performance of the on convolutional neural networks (CNNs). This framework
proposed framework tested in the DOTA (a large-scale data set performs well on small object detection. However, it has poor
for object detection in aerial images) data set improves by 4.4% performance on extremely large-scale objects. Cheng et al. [3]
mean average precision without adding extra parameters. The
proposed framework achieves real-time detection for 1024 ×1024 proposed a learning rotation-invariant CNN for object detec-
image using Titan Xp GPU acceleration.1 tion in very high-resolution (VHR) optical remote sensing
Index Terms— Feature introducing strategy, object detection, images. The experimental results demonstrate the excellent
optical remote sensing image, oriented response (OR) dilated performance on a publicly available 10-class VHR object
convolution. detection data set. Long et al. [8] proposed a multiclass object
I. I NTRODUCTION detection method for remote sensing images based on CNNs.
This method focuses on the accurate localization of detected
O BJECT detection (e.g., ship, airplane, and vehicle) in
optical remote sensing images is in high demand in
extensive earth observation applications. However, complex
objects and achieves good performance. However, the above
two methods are complex and may reduce the applicability of
scene conditions and massive data quantities make multiclass large-sized remote sensing object detection.
object detection a challenging problem. Furthermore, huge Overall, with the astonishing development of deep learning,
variations in the scale, orientation, and shape of the object state-of-the-art CNN-based object detection frameworks such
instances on the earth’s surface and imbalance of a wide as SSD [9], R-FCN [10], Faster R-CNN [11], and YOLOv2
variety of categories further increase the complexity of object [12], [13] perform excellently on the general object detection
detection in optical remote sensing images, which can be problem and, therefore, have been widely used to address the
reflected by existing annotated data sets [1]–[3]. object detection problem for remote sensing images. Recently,
Extensive studies have been devoted to object detection in Xia et al. [2] evaluated the performance of these methods on a
optical remote sensing images. Drawing upon recent advances large-scale remote sensing object detection data set (DOTA).
in computer vision, many researchers have pursued applying The pixel size of categories, aspect ratio, and orientations
the object detection method originally developed for natural of instances, instance density of images, and spatial resolu-
scenes to optical remote sensing images. Bai et al. [4] and tion in DOTA are various, which makes DOTA challenging.
Zhang et al. [5] proposed intuitive and effective methods Therefore, it is not surprising that these object detectors
perform poorly. Recently, the new version of YOLO (v3) [14]
Manuscript received July 10, 2018; revised October 12, 2018; accepted significantly improves the performance of object detection.
November 18, 2018. Date of publication December 12, 2018; date of current
version April 22, 2019. This work was supported in part by the Chang However, it comes at the cost of increasing model complexity.
Jiang Scholars Program under Grant T2012122 and in part by the Hundred In this letter, we focus on the huge-scale variation problem
Leading Talent Project of Beijing Science and Technology under Grant of multiclass object detection and propose a simple object
Z141101001514005. (Corresponding author: He Chen.)
W. Liu, J. Wang, and H. Chen are with the Beijing Key Laboratory of detection method that slightly modifies YOLOv2 and robustly
Embedded Real-Time Information Processing Technology, Beijing Institute and effectively detects objects in large-scale challenging data
of Technology, Beijing 100081, China (e-mail: [email protected]). sets.
L. Ma is with the School of Information Engineering, Zhengzhou University,
Zhengzhou 450001, China. The main contributions of this letter lie in the following two
Color versions of one or more of the figures in this letter are available aspects.
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/LGRS.2018.2882778
1) We present a simple multiclass object detection archi-
1 https://github.com/WenchaoliuMUC/Detection-of-Multiclass-Objects-in- tecture for a wide-scale varied object in optical remote
Optical-Remote-Sensing-Images sensing images.
1545-598X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Ramco Institute of Technology-Rajapalayam. Downloaded on February 08,2024 at 10:49:30 UTC from IEEE Xplore. Restrictions apply.
792 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 16, NO. 5, MAY 2019

Fig. 2. Network structure with OR convolution.

maps for large object instances, which makes large objects

difficult to correctly detect and classify. Generally, multiscale
prediction is an effective solution. However, it can make the
network architecture complicated. In this letter, we propose a
simple and effective method that does not increase the number
of parameters compared to YOLOv2.
To improve the performance of large object detection with-
out the performance loss for small object detection, we need
to expand the receptive field while introducing fine features
Fig. 1. Comparison of network structure. (a) YOLOv2 network structure. (layer 6). The dilate convolution is a good choice that can
(b) Proposed network structure.
effectively expand the receptive field. Thus, we adopt a dilated
convolution-based fine feature introduction strategy. We set the
2) We enhance the detection performance of small objects stride size of the dilated convolution to 2. Thus, the size of the
by introducing fine features with less performance loss output feature map is half the size of layer 6. Then, the output
in large object detection. feature map is brought to layer 25 by the route layer.
The dilated convolution is also adopted in layer 23. The
Compared to the fundamental framework YOLOv2, the mean
output feature map of layer 23 is upsampled by transpose
average precision (mAP) metric is increased by 4.4% without
convolution. The upsampled feature map is brought to layer
increasing the number of parameters. In summary, the pro-
25 by the route layer. At the same time, a route layer brings
posed architecture depicts a concise procedure to achieve high
the output feature maps of layer 16 to layer 25. All the
detection performance for a wide-scale varied object in optical
feature maps brought by the route layer are integrated. Two
remote sensing images.
convolutional layers are followed by the integrated feature
map. The convolutional kernel sizes of these two layers are
II. P ROPOSED A RCHITECTURE 3 × 3 and 1 × 1. All convolutional layers except the last layer
The improved YOLOv2 network structure in [7] performs are sequentially followed by a batch normalization (BN) layer
well in small object detection for remote sensing images. The and a ReLU layer.
structure of the network is as simple as that in YOLOv2.
Moreover, this network can run in real time and is suitable for B. Oriented Response Dilated Convolution
object detection in large-sized remote sensing images. Thus,
we adopt the network structure in [7] as our fundamental Conventional convolution can introduce many parameters
network in this letter. However, this network structure per- while introducing fine features. It is unnecessary to add extra
forms poorly in large object detection. In this letter, we aim parameters. OR convolution [15], [16] can achieve similar per-
to enhance the performance of large object detection for a formance to conventional convolution with fewer parameters
fundamental network without or with less performance loss by enhancing the model’s capability of capturing global/local
of small object detection. We adopt an oriented response rotations. Considering that objects in remote sensing images
(OR)-dilate convolution-based fine feature introduction strat- usually exist in arbitrary rotations, we adopt an OR dilate
egy to improve the performance of large object detection. The convolution in the proposed architecture. This means that we
network structure is shown in Fig. 1(b). The details will be use the OR dilated convolution in the fine feature introduction
described in Sections II-A and II-B. module and layer 23. Layer 26 uses the original OR convo-
lution. The network structure with OR convolution is shown
in Fig. 2.
A. Network Structure As shown in Fig. 3, OR convolution generates feature
The network in [7] demonstrates that concatenating fine maps that contain N orientation channels by active rotating
features can effectively improve the performance of object filters (ARFs). The number of parameters of OR convolution
detection, especially small objects, in remote sensing images. is 1/N of that of the conventional convolution. First, the ARF
However, we find that directly concatenating fine features proactively captures the feature maps of the sampling result
by the route layer can limit the performance for large-scale and explicitly encodes its location and orientation to generate
objects, such as soccer ball field. We consider that the receptive output feature maps that consist of N orientation channels.
field becomes too small while training with fine feature The kernel size is set to 3 × 3 for the OR dilated convolution

Authorized licensed use limited to: Ramco Institute of Technology-Rajapalayam. Downloaded on February 08,2024 at 10:49:30 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DETECTION OF MULTICLASS OBJECTS IN OPTICAL REMOTE SENSING IMAGES 793

Fig. 3. Dilated ARF (3 × 3 × 8). (R 0 )—materialized filter. (R 1 –R 7 )—seven

ARFs of materialized filter.

and the original OR convolution. Then, an OR BN layer

is followed by the generated feature maps. This layer can
effectively accelerate the learning process. Fig. 4. mAP curve evaluated on the DOTA validation set.
During training, the back-propagation collectively updates the detection results on the original images. These processing
the materialized filter whose gradient is aggregated from the steps of the DOTA data set are consistent with the steps in [2].
ARF bank by the following:
B. Performance Evaluation
1
N−1
δ= δi (1) All experiments in this section are performed on the DOTA
N validation set. We train 60 epochs for all models. We set the
i=0
learning rate to 0.01 for the first 50 epochs. The learning
where δi denotes the gradient of the i th filter in the ARF (R i ) rate is set to 0.0001 for the last 10 epochs. The evaluation
and δ is the aggregated gradient that is used to update the result (mAP) on the validation set is obtained using a DOTA
materialized filter. In this letter, the orientation channel number development kit. The larger the value of the mAP, the better
N is set to 4 or 8, which are typical values [16]. is the performance of the model. Thus, we adopt the model
parameter that obtains the largest mAP tested on the DOTA
III. E XPERIMENTAL R ESULTS AND D ISCUSSION validation set as the final model parameters.
This section reports several experiments conducted with the Pretrain: All networks are pretrained using the union of
proposed architecture. All networks except YOLOv3 [14] in VOC 2007 and VOC 2012 trainval data sets. Note that,
this letter are trained in PyTorch 0.4.0 using the Adam sto- we pretrain two cases where N is 4 and 8. We select the
chastic optimization method with the same loss function. Note weight parameter that performs best on the VOC 2007 test set
that, we adopt cross-entropy loss to evaluate the classification as the pretrained model parameter.
loss. YOLOv3 is trained and tested in DARKNET. In this Experiment 1: In this experiment, we train and evaluate the
section, the threshold for nonmaximum suppression (NMS) proposed model with an OR convolution kernel. The network
for all models is set to 0.4. All experiments were performed structure tested in this experiment is shown in Fig. 2. Note
with the assistance of a GTX Titan Xp GPU with 12 GB of that, we test two cases where N is 4 and 8. Fig. 4 shows
memory donated by the NVIDIA Academic Program Team. the mAP curve for the object detection result evaluated on the
The parameter settings of the model are the same as those of DOTA test data set. The model with N = 4 adopts epoch 57 as
[7]. the final model and obtains a mAP of 69.2%. The model with
N = 8 adopts epoch 57 as the final model and obtains a mAP
of 65.8%. The details for each class are shown in Table I.
A. Data Set Description
From the experimental result, we find that the performance
The quite challenging large-scale data set for object detec- of the proposed model with N = 4 is better than that of the
tion in aerial images (DOTA) [2] is used in this letter. model with N = 8. Therefore, we suspect that the excessive
DOTA contains 2806 aerial images from different sensors and reduction of parameters leads to a decrease in the fitting ability
platforms. Each image is approximately 4000×4000 pixels and of the model.
contains 15 common object categories. This data set allows Experiment 2: We train and evaluate the proposed model.
two tasks: detection on horizontal bounding boxes (HBB) and Note that, all convolution kernels are conventional in this
detection on oriented bounding boxes. In this letter, we select experiment. The network structure tested in this experiment is
the HBB task. The best result (mAP) of the baseline models shown in Fig. 1(b). The experimental result is shown in Fig. 4.
evaluated with HBB ground truths is 60.46%. Most instance The model with conventional convolution kernels adopts epoch
sizes in the object classes, such as small vehicles, ships, and 60 as the final model and has a mAP of 69.3%. From the
storage tanks, are small. The instance sizes in classes such as experimental results, we find that the performance of the
soccer ball fields and ground track fields are highly diverse proposed model with a conventional convolution kernel is
and of large size. Some of them are extremely large. similar to that of the proposed model with an OR conventional
As GPU memory is limited, all training images are divided convolution kernel.
into 1024 × 1024 pixel patches by the DOTA development
kit in the experiment. In the testing phase, the test images are C. Comparison and Discussion
first cropped with an overlapping area of 512 pixels. Then, the We consider the state-of-the-art methods YOLOv2 and
detection results of each cropped patch are combined to restore YOLOv3, and the method in [7] as the baseline.

Authorized licensed use limited to: Ramco Institute of Technology-Rajapalayam. Downloaded on February 08,2024 at 10:49:30 UTC from IEEE Xplore. Restrictions apply.
794 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 16, NO. 5, MAY 2019

Fig. 5. Visual comparison results in typical scenes. (First column) Result of the method in [7]. (Second column) Result of the original YOLOv2.
(Third column) Result of the original YOLOv3. (Last column) Result of the proposed architecture.

The performance comparison metric (mAP) on the DOTA test set has a mAP of 65.7%. Compared with the model in [7],
set is obtained by the DOTA performance evaluation server. the performance on small object detection of YOLOv2 is
We train and test the model architecture in [7]. The model not satisfactory. For YOLOv3, epoch 52 performs best on
in [7] was originally designed for ship detection. For multiclass the validation set and is adopted as the final model. The
object detection, we modify the last layer to make it consistent experimental result tested on the DOTA test set has a mAP
with the last layer of the proposed model in this letter. For a of 60.0%. Compared to YOLOv2, YOLOv3 performs better
fair comparison, this model is pretrained by using the union on classes such as ships and bridges whose instance size is
of VOC 2007 and VOC 2012 trainval data sets. As shown small.
in Fig. 4, epoch 60 performs best on the validation set and The proposed model with OR convolution kernel (N = 4)
is adopted as the final model. As shown in Table II, the is adopted as a comparison model in this letter. Epoch 57 is
experimental result tested on the DOTA test set has a mAP adopted as the final model. As shown in Table III, the experi-
of 64.3%. This model performs well on small objects such mental result tested on the DOTA test set is 70.1% mAP. The
as ship, vehicle, and storage tank. However, the performance proposed model performs well on object detection of each
on the soccer field, baseball diamond, and ground track field, class. Compared with the model in [7], the proposed model
whose instances are large in size, is poor. has advantages in extremely large object detection. In addition,
Original YOLOv2 and YOLOv3 are also pretrained using compared with original YOLOv2, the performance on small
the union of the VOC 2007 and VOC 2012 trainval data sets. object detection is satisfactory.
As shown in Fig. 4, epoch 55 performs best on the validation The visual comparison results tested on the test set are
set and is adopted as the final model of YOLOv2. As shown shown in Fig. 5. The first row shows that YOLOv2 and the
in Table III, the experimental result tested on the DOTA test proposed model perform better than others do on large object

TABLE I In terms of efficiency, the proposed object detection frame-

P ERFORMANCE C ONSUMPTION OF work runs at 88.2 s per 1024×1024 image with the use of Titan
D IFFERENT M ETHODS ( IN AP)
Xp GPU acceleration. The details are shown in Table III. The
proposed model achieves real-time object detection. However,
the proposed model has no advantage in efficiency compared
with YOLOv2. The reason is that the feature map size of the
last layer is twice that of YOLOv2. The subsequent steps, such
as region proposal generation and the NMS, do not perform
parallel acceleration.

IV. C ONCLUSION
In this letter, we focused on the large-scale variation prob-
lem of multiclass object detection in optical remote sensing
images. To solve this problem, we proposed a multiclass
object detection framework for optical remote sensing images.
We adopted a feature introducing strategy based on OR dilated
convolution. Using this strategy, the performance for small-
scale object detection by the network is enhanced without
losing the performance for large-scale object detection. The
experiments confirmed that the proposed framework is efficient
and robust for multiscale objects in complex optical remote
sensing scenes.
TABLE II R EFERENCES
P ERFORMANCE C ONSUMPTION OF
[1] H. Zhu, X. Chen, W. Dai, K. Fu, Q. Ye, and J. Jiao, “Orientation
D IFFERENT M ETHODS ( IN AP)
robust object detection in aerial images using deep convolutional
neural network,” in Proc. IEEE Int. Conf. Image Process., Sep. 2015,
pp. 3735–3739.
[2] G.-S. Xia et al., “DOTA: A large-scale dataset for object detection in
aerial images,” in Proc. IEEE CVPR, Jun. 2016, pp. 3974–3983.
[3] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convo-
lutional neural networks for object detection in VHR optical remote
sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12,
pp. 7405–7415, Dec. 2016.
[4] X. Bai, H. Zhang, and J. Zhou, “VHR object detection based on
structural feature extraction and query expansion,” IEEE Trans. Geosci.
Remote Sens., vol. 52, no. 10, pp. 6508–6520, Oct. 2014.
[5] H. Zhang, X. Bai, J. Zhou, J. Cheng, and H. Zhao, “Object detection
via structural feature selection and shape model,” IEEE Trans. Image
Process., vol. 22, no. 12, pp. 4984–4995, Dec. 2013.
[6] Y. Yang, Y. Zhuang, F. Bi, H. Shi, and Y. Xie, “M-FCN: Effective
fully convolutional network-based airplane detection framework,” IEEE
Geosci. Remote Sens. Lett., vol. 14, no. 8, pp. 1293–1297, Aug. 2017.
[7] W. Liu, L. Ma, and H. Chen, “Arbitrary-oriented ship detection frame-
work in optical remote-sensing images,” IEEE Geosci. Remote Sens.
Lett., vol. 15, no. 6, pp. 937–941, Jun. 2016.
[8] Y. Long, Y. Gong, Z. Xiao, and Q. Liu, “Accurate object localization in
remote sensing images based on convolutional neural networks,” IEEE
Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2486–2498, May 2017.
[9] W. Liu et al. (2015). “SSD: Single shot multibox detector.” [Online].
TABLE III Available: https://arxiv.org/abs/1512.02325
[10] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via
T IME C ONSUMPTION OF D IFFERENT region-based fully convolutional networks,” in Proc. Adv. NIPS, 2016,
M ETHODS ( IN M ILLISECONDS ) pp. 379–387.
[11] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
real-time object detection with region proposal networks,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
[12] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
look once: Unified, real-time object detection,” in Proc. IEEE CVPR,
Jun. 2016, pp. 779–788.
detection. Unfortunately, neither model detected the ground [13] J. Redmon and A. Farhadi. (2016). “YOLO9000: Better, faster, stronger.”
track field whose size is large. We can find from the second [Online]. Available: https://arxiv.org/abs/1612.08242
[14] J. Redmon and A. Farhadi. (2018). “YOLOv3: An incremental improve-
row that all models except YOLOv2 perform well for small ment.” [Online]. Available: https://arxiv.org/abs/1804.02767
object detection. The third row shows that all models except [15] J. Wang, W. Liu, L. Ma, H. Chen, and L. Chen, “IORN: An effective
YOLOv3 are robust to object scale variety. Overall, this visual remote sensing image scene classification framework,” IEEE Geosci.
Remote Sens. Lett., vol. 15, no. 11, pp. 1695–1699, Nov. 2018.
evaluation results demonstrate the robustness of the proposed [16] Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao, “Oriented response networks,” in
method. Proc. IEEE CVPR, Jul. 2017, pp. 4961–4970.

Authorized licensed use limited to: Ramco Institute of Technology-Rajapalayam. Downloaded on February 08,2024 at 10:49:30 UTC from IEEE Xplore. Restrictions apply.