0% found this document useful (0 votes)
4 views

li2021

Research Paper

Uploaded by

ansun1234567
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

li2021

Research Paper

Uploaded by

ansun1234567
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

International Journal of Remote Sensing

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/tres20

Ship detection of optical remote sensing image in


multiple scenes

Xungen Li, Zixuan Li, Shuaishuai Lv, Jing Cao, Mian Pan, Qi Ma & Haibin Yu

To cite this article: Xungen Li, Zixuan Li, Shuaishuai Lv, Jing Cao, Mian Pan, Qi Ma & Haibin Yu
(2021): Ship detection of optical remote sensing image in multiple scenes, International Journal of
Remote Sensing, DOI: 10.1080/01431161.2021.1931544

To link to this article: https://doi.org/10.1080/01431161.2021.1931544

Published online: 03 Jun 2021.

Submit your article to this journal

Article views: 103

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=tres20
INTERNATIONAL JOURNAL OF REMOTE SENSING
https://doi.org/10.1080/01431161.2021.1931544

SPECIAL ISSUE: LEARNING FROM DATA FOR REMOTE SENSING IMAGE


ANALYSIS

Ship detection of optical remote sensing image in multiple


scenes
Xungen Lia,b, Zixuan Lia, Shuaishuai Lv a,b
, Jing Caoa,c, Mian Pana, Qi Maa
and Haibin Yua
a
School of Electronics and Information Engineering, Hangzhou Dianzi University, Hangzhou China; bPujiang
Microelectronics and Intelligent Manufacturing Research Institute of Hangzhou Dianzi University, Hangzhou
Dianzi University, Jinhua, China; cGraduate School of Creative Industry Design, National Taiwan University of
Arts

ABSTRACT ARTICLE HISTORY


In view of characteristics of the ship in the optical remote-sensing Received 21 January 2021
image, such as multiple dimensions, majority of small objects, Accepted 12 May 2021
crowded arrangement and complex background, and so on, the
paper presents a ship detection framework combining the network-
fusing multi-level features crossing levels, the rotation region pro­
posal network and the bidirectional recurrent neural network fusing
self-attention mechanism. Firstly, we set up a network fusing multi-
level features crossing levels because of the multiple scales and
diverse characteristics of the remote-sensing ships to increase the
precision of feature extraction of the ship, thus improving the
performance in the multiple scales, small objects, and complex
background problems. Secondly, we separately design the ROI
Pooling Layer and the bidirectional recurrent neural network fusing
self-attention mechanism, which infuses the prior information of
ship dimension and position to realize a good performance and
precise ship positioning in crowded scenes. Finally, we verify the
effectiveness of the proposed method through experiments, the
experimental dataset includes the private dataset designed by us
based on Google Earth, the ship dataset in DOTA-Ship and
HRSC2016 public ship dataset. The results verify the contributions
of each proposed module, and the comparison results show that
our proposed method has a state-of-the-art performance.

1. Introduction
Ship identification and detection have important significance in the national defence
construction, port management, cargo transport, maritime rescue and combating on
illegal ships, etc. (Tang, Deng, and Huang et al. 2014; Zhu et al. 2010; Chen et al. 2018;
Li, Tang, and Chen et al. 2019; Ma et al. 2018). However, the existing ship monitoring
system cannot meet the needs of ship monitoring and emergency management in a wide
range of sea areas (Li, Cheng, and Bu et al. 2017). Compared with SAR images (Kang, Leng,

CONTACT Shuaishuai Lv [email protected] School of Electronics and Information Engineering, Hangzhou


Dianzi University, Hangzhou 310018, China
© 2021 Informa UK Limited, trading as Taylor & Francis Group
2 X. LI ET AL.

and Lin et al. 2017; Jiao, Zhang, and Sun et al. 2018; Li, Qu, and Shao 2017) and
hyperspectral images (Kim et al. 2019), optical remote-sensing images have high resolu­
tion and low noise. The structure and texture information of the ship object is clearer,
which is more conducive to the extraction and analysis of ship features during detection,
and the scene of the ship is also richer and more extensive. In the field of remote-sensing,
ship detection based on optical remote-sensing images has received extensive attention
(Chang, Wu, and Chiang 2019; Dong et al. 2019; Bay, Ess, and Tuytelaars et al. 2008; Zou
and Shi 2016; Shi et al. 2013). However, with changes in light intensity, climatic conditions,
sea surface topography, and so on, the significant difference of ships in complex back­
grounds also increases. Traditional feature detection is difficult to separate effective ship
features, resulting in low accuracy of detection results. Therefore, ship detection in
multiple scenarios is a huge challenge.
Ships in remote-sensing images have the characteristics of a multi-scale, small object,
dense and complex backgrounds. These characteristics appear frequently in scenes such
as ports and routes. Solving these problems has strong engineering significance. The
existing remote-sensing ship data sets are small in number, and the scene types are
single. The current deep neural network ship detection methods use mostly single ship
scenes, which cannot contain the characteristics of the ships mentioned above (Girshick,
Donahue, and Darrell et al. 2014; Zhang, Yao, and Zhang et al. 2016; Zou and Shi 2016; Wu,
Zhou, and Wang et al. 2018; You, Cao, and Zhang et al. 2019; Xian, Pwab, and Cheng et al.
2020) and cannot prove their effectiveness in detecting ship characteristics in multiple
scenarios. The remote-sensing ship detection shall firstly have strong adaptation and can
adapt to the identification and detection in the complicated background; secondly, it shall
be designed against the features of the remote-sensing ships, for example, crowded
arrangement, multiple scales, etc.; finally, it shall make better use of the prior conditions
of the mutual position correlation of ship. Therefore, it becomes the key of the current
remote-sensing ship detection to design a ship identification and detection model
adapting to different illumination, climate condition and ocean scenes.
The traditional optical remote-sensing image-based ship detection algorithm is mainly
divided into three steps. Firstly, extract the ship regions employing window sliding,
selective search, and so on; then extract the shape and texture features of ships using
HOG feature descriptor (Chang, Wu, and Chiang 2019), SIFT partial feature descriptor
(Dong et al. 2019), and SURF feature descriptor (Bay, Ess, and Tuytelaars et al. 2008); and
finally, judge the categories with classifiers, such as support vector machine (Zou and Shi
2016), Adaboost (Shi et al. 2013), and so on, and refine the position of the effective
proposal boxes. However, the remote-sensing ship has a complicated background and
many interferential factors, and due to the single feature extraction and bad detection
robustness, the model of the traditional ship detection method cannot provide
a comprehensive and correct description of the ship features and the traditional ship
detection is not an end-to-end process, which greatly reduces the accuracy and efficiency
of ship detection.
With the rapid development of deep learning in the ship detection field, more and
more scholars start to apply the deep learning-based ship detection technique in the
study of remote-sensing ship detection (Girshick, Donahue, and Darrell et al. 2014; Zhang,
Yao, and Zhang et al. 2016; Zou and Shi 2016; Wu, Zhou, and Wang et al. 2018; You, Cao,
and Zhang et al. 2019; Sun, Liu, and Yan et al. 2020; Xian, Pwab, and Cheng et al. 2020).
INTERNATIONAL JOURNAL OF REMOTE SENSING 3

The deep learning-based ship detection method is generally improved from the following
two aspects:
Firstly, increase the precision of the extracted remote-sensing ship features by modify­
ing the feature extraction network, thus improving the detection difficulties of multiple
scales, small objects, complicated backgrounds, and so on. For example, He et al. (Girshick,
Donahue, and Darrell et al. 2014) resolved the problems of low ship detection precision,
big leak detection, and false detection rates in satellite images by enhancing the texture
feature extraction of satellite images. Zhang et al. (Zhang, Yao, and Zhang et al. 2016)
proposed a new ship detection method called S-CNN, which combined the specially
designed proposal region extracted from the ship model to improve the significant
detection performance. Zou et al. (Zou and Shi 2016) proposed an SVDnet, which
executed self-adaptive learning of features from the remote-sensing images by the
convolutional neural network and singular value compensation algorithm.
Secondly, utilize prior information on ships. Some scholars utilized the known prior
information of ship dimensions to apply the rotation proposal region to remote-sensing
ship detection, thus improving the detection performance (Wu, Zhou, and Wang et al.
2018; You, Cao, and Zhang et al. 2019; Sun, Liu, and Yan et al. 2020; Xian, Pwab, and Cheng
et al. 2020; Ma, Guo, and Wu et al. 2019; Wu, Ma, and Gong et al. 2020; Ma, Zhou, and
Wang et al. 2019; Tian, Pan, and Tan et al. 2020; Xiao, Zhou, and Wang et al. 2019). Some
scholars improved the detection by utilizing the prior information of remote-sensing ship
positions. For example, Wu et al. (Wu, Zhou, and Wang et al. 2018) proposed a new
offshore ship detection, which implemented universal search of different parts of the ship
head through the classification network to obtain the possible ship’s bow position and
rough ship direction, very helpful to produce smaller and more precise proposal region of
a ship. You et al. (You, Cao, and Zhang et al. 2019) proposed an end-to-end scene Mask
R-CNN based on the accuracy and feasibility of the DCNN detection framework, to reduce
the false detection of ship-shaped objects on the land. Sun et al. (Xian, Pwab, and Cheng
et al. 2020) proposed a shape robust anchor-free network (SRAF-Net) and a unified part-
based convolutional neural network (PBNet). The SRAF-Net consists of feature extraction,
multitask detection, and postprocessing. And PBNet treated a composite object as
a group of parts and incorporates part information into context information to improve
composite object detection. The correct part information can guide the prediction of
a composite object, thus solving the problems caused by various shapes and sizes. Ma
et al. (Ma, Guo, and Wu et al. 2019) considered contextual information and multi-region
features and proposed a novel multi-model decision fusion framework to solve the
problem of the diversity and complexity of geospatial object appearance and the insuffi­
cient understanding of geospatial object spatial structure information. Sun et al. (Wu, Ma,
and Gong et al. 2020) proposed a shape robust anchor-free network (SRAF-Net) to solve
the problem of blurred boundaries in garbage dumps, which includes feature extraction,
multitask detection, and postprocessing. Wang et al. (Ma, Zhou, and Wang et al. 2019)
proposed a two-stage ship detection based on ship centre and orientation prediction,
which constructed the central region prediction network and ship orientation classifica­
tion network to produce rotating region proposals and predict the rotation bounding box
from the rotating region proposals. Tian et al. (Tian, Pan, and Tan et al. 2020) improved
NMS to defect non-maximum suppression and made pooling improvements to the
defected context features.
4 X. LI ET AL.

The above method improves the accuracy of ship detection by modifying the feature
extraction network, modifying the size of the proposal region, and using the ship’s prior
information, but there are still some problems: optical remote-sensing ship dataset has
a single ship scene, and the model cannot verify its effectiveness in detecting different
ship characteristics. It is a problem that we need to consider when designing a detection
method for ships with multiple scales, small objects, dense and complex background
characteristics in multiple scenarios.
Concerning the above problems and considering the real multiple scenes of the
remote-sensing ships, we have collected multiple remote-sensing ship images in compli­
cated scenes from Google Earth and labelled the ship. We have designed an end-to-end
multi-scene ship detection network framework, which has achieved good performances
in multiple scenes. The main contributions of this paper are as follows:

(1) Infuse the prior information of the position. Considering the prior condition that
the remote-sensing ships have angle consistency in the region, that is, in most
scenes, the angle values between the Ground True bounding boxes of remote-
sensing ships are close, we add the bidirectional recurrent neural network fusing
self-attention mechanism into the network framework to obtain more precise
proposal regions according to the confidence of the proposal region.
(2) Infuse the prior information of dimension. To obtain more precise ship body
positioning, we have designed anchors conforming to the feature proportion,
ratio, and angle of ships and added Pooling Layer size conforming to the ship
shape in the ROI Pooling layer, considering the bounding box of the ship shape. In
this way, the redundant region of the bounding box is reduced to obtain good
performance in crowded scenes.
(3) Fuse cross-level features. For the multi-scale and diversity characteristics of the
remote-sensing ship, we have designed a network fusing the cross-level feature
layer based on the feature pyramid and increased the feature precision of the ship
with different scales to improve the performance in the multi-scale and small
object problems of remote-sensing ships.
(4) Design dataset. To resolve the single problem of current ship detection scenes, we
have collected multiple frequent sea areas, port, shore side, and riverway scenes of
remote-sensing ships in practical application engineering, as well as ships of
different classes and sizes. These images contain the features of a remote-sensing
ship, such as multiple scales, small objects, crowded arrangements, complicated
background, etc.

Finally, we have verified the effectiveness of our proposed method through experi­
ments. Our network framework not only verifies the effectiveness in our dataset but also
presents an excellent performance in the ship in the public dataset HRSC2016 and large-
scale aerial dataset DOTA (Xia, Bai, and Ding et al. 2018) compared with other advanced
methods.
The rest parts of this paper are as follows: Part 2 introduces the general framework of
the proposed model and the detailed method. Part 3 briefly introduces the overall
algorithm procedure of our model as well as the training and testing procedures. Part 4
introduces three datasets as well as the experiments of our model framework and contrast
INTERNATIONAL JOURNAL OF REMOTE SENSING 5

model in each scene of the dataset. Part 5 summarizes the full text and prospects the
future work.

2. Proposed Method
We will introduce our model framework in this section, and the general framework is
as shown in Figure 1, including the feature extraction network module, rotation
proposal region module, and angle re-scoring module. The feature extraction network
module is used to fuse the multi-level feature layer of the top-down network to obtain
the multi-scale features; the rotation proposal region module includes to produce
anchors conforming to remote-sensing ship characteristics and containing angle para­
meters in RPN stage, and to increase the size of pooling layer matching with the ship
length–width ratio in the pooling module of interest; the angle re-scoring module is
used to re-calculate the confidence of the original proposal region through the
bidirectional recurrent network and re-scoring network with self-attention mechanism
according to the parameters of the proposal region, for example, angle, etc. Finally, in
the Fast-RCNN stage, we will make position regression correction and class prediction
of the proposal region.

2.1. Cross-level feature extraction network


The cross-level feature extraction proposed in this paper is improved based on FPN
(Lin, Dollar, and Girshick et al. 2017). As shown in Figure 2, the FPN network generates
new feature layers P2 , P3 andP4 through horizontal connections and adjacent upper
feature layers. And such connections produce richer semantic features. However,
compared with traditional images, optical remote-sensing ship images have high
complexity, low contrast, small targets, and multi-scale features. The semantic features
of ships extracted by FPN are not enough, and it is difficult to identify many ship-
shaped interferences objects, for example, in container stacks and island reef scenes.
Besides, there are also missed inspections of ships in dense scenes such as ports and
shores. Therefore, based on the FPN structure, a cross-level feature extraction structure
is designed to adapt to the characteristics of remote-sensing ship images and the
targets of various scales, which combines multiple levels of high- and low-level feature
maps. Through such a feature extraction network, multi-scale features are suitable for
remote-sensing ships, and high-precision ship features can be obtained from remote-
sensing ship images.
In the cross-level feature extraction, we select the feature maps of the last four residual
modules to construct a bottom-up network, and obtain higher-resolution feature maps
through horizontal connections and multi-level cross-layer connections, wherein the
output of the down-sampling of the adjacent upper layer feature map and the output
of the cross-level multi-layer feature map down-sampling are combined in series, as the
input of each layer of the feature map in the multi-level cross-layer connection. By adding
multi-level cross-layer connections on the basis of the FPN structure to smooth the feature
propagation and feature reuse, not only can integrate the multi-scale features of remot­
ing-sensing ships, but also increase the accuracy of feature extraction of ship targets in
complex backgrounds.
6 X. LI ET AL.

We firstly select ResNets (He, Zhang, and Ren et al. 2016) pre-trained through
ImageNet dataset (Simonyan and Zisserman 2015) as the baseline network, that is,
to extract features from the input remote-sensing ship image in the feature extraction
module of Figure 1. We select the feature map of the last layers of the last four
residual modules to build a bottom-up network; then obtain a feature map with
higher resolution through lateral connection and multi-level cross-layer connection,
thus building a top-down network. Wherein, connect the output sampled from the
feature map of adjacent upper layers and the output sampled from the feature map of
cross-level multiple layers in series as the input of the feature map of each layer in the
multi-level cross-layer connection.
Specifically, we select feature images with four different resolutions of the last outputs
of conv 2, conv 3, conv 4 and conv 5 modules of ResNets as the bottom-up network. As
shown in Figure 2, the size of the image of the input feature extraction network is H × W
pixel, and the pixel of the feature map of each convolutional layer is, respectively,
H=4 � H=4, H=8 � H=8, H=16 � H=16and H=32 � H=32. Concerning the construction of
a top-down network, the input of the feature map of each layer includes the output of the
corresponding feature layer in the corresponding bottom-up network after 1 × 1 con­
volution and the output of the upper layer of the network after upper sampling of the
corresponding size. With the lowest three layers of the top-down network as the output of
the feature extraction network, we get the process of the three-layer feature layer,
expressed as the following mode:
!
kþi�5
X
i
Pk ¼ f3�3 2 � UpðPkþi Þ þ f1�1 ðCk Þ (1)
i¼1

where Ck means the feature map of the Kth layer of the bottom-up network, PK means the
fused feature map of the Kth layer (K decreases progressively from up and down), f1x1 and
f3x3 mean convolutional layers of 1 × 1 and 3 × 3, UP means the up sample, 2i means the
size of the upsample, and i of the up sample of the adjacent upper layer is 1, one more
layer increased, i of the up sample 2i plus one. After this process, we can get the pyramid
network of multi-level cross-layer feature fusion. These inputs reduce the poor effects
caused by aliasing through 3 × 3 convolution after connect and finally are fused with the
feature map as the input of the latter prediction. The output of the subsequent prediction
does not share the classification and regression parameters, the output of each layer is
independent, and the feature map produced in this way contains more high-level features
based on the fusion of multi-level semantic features, and produces more accurate remote-
sensing ship feature information.
We find that the addition of a multi-level cross-layer connection based on the feature
pyramid structure not only can better fuse the multi-scale features of the remote-sensing
ship but also increase the feature extraction precision of ships in the complicated
background.

2.2. Rotation proposal region


The function of the rotation proposal region module is to produce a rotation bounding
box in an arbitrary direction. The traditional horizontal detection box is limited for an
INTERNATIONAL JOURNAL OF REMOTE SENSING 7

RPN

cls. Score
(2x)

reg. ( , y, w, h,θ )

C2
C3
C4
C5
Rotation
Proposals Fast R-CNN
anchors Score
7×7 (class x)

P2
P3
P4
P5
cls.
3×10
Feature map 10×3
reg.
Ship Image Angle
Cross-level Feature Extraction
RNN Rescoring ( , y, w, h,θ )
Angle Rescoring Multi-scale ROI pooling

Figure 1. Overall Framework of Our Proposed Network.

object with a big length–width ratio. As can be seen from Figure 3, due to the feature of
the big length–width ratio of the remote-sensing ship, the horizontal bounding box can
bring up many redundant pixels not belonging to the ship, causing a deviation in the
positioning result. In addition, in crowded scenes, a lot of overlap among the Ground True
of horizontal bounding boxes of multiple ships produces a big IOU value, which will cause
the Ground True to be filtered out in NMS, cause the correct proposal region to be
discarded and result in missing detection.
The rotation bounding box is required to modify the representation form and
structural parameters of the anchors, the rotation angle and ROI Pooling layer are all
required to change. To realize the detection of remote-sensing ships with big length–
width ratio features, we have modified RPN and Fast R-CNN stages of the ship
detection. In RPN stage, we have re-defined the size and length–width ratio of the
anchor according to the characteristics of remote-sensing ships, added angle para­
meters and obtained the rotation bounding box conforming to the shape of the
remote-sensing ship. In the pooling stage of the interesting stage, we have added
a new pooling size, restricted the missing detection problem caused by NMS, and
obtained a feature map of fixed size more conforming to the shape of the remote-
sensing ship.

2.2.1. Rotation anchors


The traditional horizontal rectangle bounding box can be expressed by two coor­
dinate points of the upper left and lower right, that is, ðxmin ; ymin ; xmax ; ymax Þ.
However, the rotation bounding box needs to introduce new rotation parameters
to express and ðx; y; w; h; θÞis used to express the rotation bounding box. As shown
in Figure 4, xandy represent the coordinates of the centre point of the bounding
box, width and height represent the width and the height of the bounding box,
and θ represents the angle between the wide side and the horizontal direction,
with the range ½0; πÞ.
8 X. LI ET AL.

C5 P5
predict
C4 P4
predict
C3 P3
predict
P2
input C2

Figure 2. Extraction Network of Cross-level Cross-layer Feature Extraction Network.

(a) (b)

Figure 3. The marking conditions of the horizontal bounding box and the rotation bounding box on
the image.

Figure 4. Representation of a Rotation Bounding Box.

In RPN stage, to produce the rotation box, we have carried out a large number of
tests and statistical analyses of the datasets used in the experiments, and considering
the bearing capacity of the network model, we have added relevant parameters
conforming to the remote-sensing ship to increase the number of matched anchors.
INTERNATIONAL JOURNAL OF REMOTE SENSING 9

The K-Means clustering algorithm determines that the aspect ratios of the rotation
anchor points are 3, 7, and 9, and the aspect ratios of the rotation anchor points are
set to 1:3, 3:1, 1:5, 5:1, 1:7, 7:1, respectively. Considering that multiple length-width
ratios and load capacity of the network model, the scale size of each feature layer is
set to 1. The angle of the rotation anchor complies with even distribution within ½0; πÞ,
so the angle parameters are set as π=6, π=3, π=2, 2π=3, 5π=6and π, as shown in Figure
5. Each pixel of each feature map corresponds to 36 rotation anchors (1 × 6 × 6), the
outputs of the classification layer and the regression layer are, respectively, 72 outputs
(2 × 36) and 108 outputs (3 × 36).

2.2.2. Multi-scale ROI pooling


For proposal regions, each region of the ROI Pooling layer in the Fast R-CNN stage is
extracted feature map with fixed dimension by reusing the convolutional feature map,
and finally, each region gets a 7 × 7× ConvDepth feature map based on the size of the
object in the natural scene, wherein, ConvDepth means the depth of the feature map.
However, the feature map with the same length and width does not apply to the remote-
sensing ship. Considering the big length–width ratio of the ship, we have added two
different sizes of pooling to capture more remote-sensing ship features and help to detect
proposal regions with a width much bigger than the height and proposal regions with
a height much bigger than the width. The pooled features are connected for further
detection, and such feature maps can more accurately contain the features of remote-
sensing ships.
For the selection of length-width ratios of proposal regions output in the last step, we
have made a lot of tests and statistical analysis of the used dataset and selected the
length–width ratio of 10/3 as the value of the added pooling layer size after the ship
length–width ratio analysis and weighted average calculation. Two ROI Pooling layers
with pooling size as 10 × 3 and 3 × 10 are added, wherein, the pooling size of 3 × 10 can
capture more horizontal features and contribute to the detection of the ship with a width
much bigger than the height; the pooling size of 10 × 3 can capture more vertical features,
and is very useful for the vertical ship when the height is bigger than the width. Each
proposal region separately passes through three ROI Pooling layers and finally outputs
the feature map with fixed sizes of 7 × 7× ConvDepth, 10 × 3× ConvDepth, and
3 × 10× ConvDepth. In this way, the feature map can be more matched with the shape
of the rotation bounding box of the remote-sensing ship.

Figure 5. Setting of Anchors.


10 X. LI ET AL.

2.3. Angle Rescoring Module (ARM)


As can be seen from observation and statistics, there is a certain rule among the
angles of the rotation bounding box of the Ground True in the remote-sensing ship
image: the position orientation of ship arrangement is consistent in most scenes of the
port, wharf, shore side, river, etc. That is to say, the angles of the rotation bounding
boxes of the ship’s ground true are close, as shown in Figure 6. In the practically
applied port and wharf scenes, the ships are uniformly stopped against the shore with
the same ship arrangement angle. In the sea route scenes of ships, the ships all sail
according to the uniform sea route, and the angle orientation is also consistent.
Therefore, the utilization of the prior condition of consistent partial angles of the
remote-sensing ships and the angle parameters of the rotation bounding box can
help us perform a better prediction of the proposal region of ships. We think in the
prediction that, within a certain region, if the angle of the proposal region is closer to
the mean value of other bounding boxes, the value of this proposal region is more
likely to be a predicted value. On the contrary, if the difference with the mean value of
other proposal regions is bigger, the value of this proposal region is less likely to be
a predicted value. In the network, we re-arrange the sequence of the proposal regions
by changing the confidence of the proposal region.
The ship detection is a single category detection and the category of each proposal
region is identical, so, we extract the confidence, position coordinates, and angle of each
proposal region as the input vectors to build a stacked bidirectional RNN angle rescoring
network module fusing Self-Attention network. The overall detailed procedures of the
rescoring module are as shown in Figure 7, and the algorithm of the angle rescoring
module (ARM) is as follows:

(1) Step 1: the confidence and coordinate parameters x, y, h, w, θ of the i-th ship
candidate area in the image are recorded as the vector xi;
(2) Step 2: the vector x1 ; x2 ; � � � ; xi of each candidate region is used as the input of the
two-layer hidden layer bidirectional RNN network, where the vector sequence of
� �
ð2Þ ð2Þ ð1Þ
the second layer p position is represented as hp ¼ f Uð2Þ hp 1 þ W ð2Þ hp þ bð2Þ ,
ð2Þ
where Uð2Þ , W ð2Þ is the weight matrix, bð2Þ is the bias, and hp 1 is the vector sequence

(a) The wharf scene (b) The port scene (c) The river scene

Figure 6. The marking condition of the ship bounding box in different scenes.
INTERNATIONAL JOURNAL OF REMOTE SENSING 11

[0.99;xywhθ1] RNN 0.95

Self-Attention

Regressor
[0.85;xywhθ2] RNN 0.40
θ2
θ1
[0.65;xywhθ3] RNN 0.90
θ3 θ4
[0.70;xywhθ4] RNN 0.85

Proposals Extract feature vector Propogate through Predict new


and build sequence RNN+Self-Attention confidence score

Figure 7. Overview of the Angle Rescoring Approach.

ð1Þ
of the previous position p-1 in the second layer as shown in equation (2), and hp is
the vector sequence of the p position L in the first layer.
P
(3) Step 3: the vector sequence ci ¼ αij hj about the angle obtained by the vector
j¼1
x1 ; x2 ; � � � ; xi through the self-attention mechanism, where L is the length of the
input sequence, hj is the hidden vector of the jth element, and αij is the alignment
weight between the ith element and jth element.
(4) Step 4: The stacked bidirectional RNN outputhp in step 2 and the self-attention
mechanism output ci in step 3 are connected in series, and input to the regressor
composed of three layers of MLP. Finally, the activation function f ð xÞ ¼
��
SigmoidðReLu bð2Þ þ W ð2Þ bð1Þ þ W ð1Þ x is used to generate generates
a confidence level between 0 and 1 in the new ship candidate area, where the
connection weights between the various layers of the MLP and Bias, that is,
W ð1Þ ,W ð2Þ ,bð1Þ ,bð2Þ .
(5) Step 5: output new confidence of all the ship candidate areas.

The stacked bi-directional RNN network can make better use of the prior conditions of
the physical structural feature of partial consistent angles of the ship, and make use of all
angle information of the proposal region in the image during prediction.
The training process of the RNN network is as follows:
The two hidden layers~ hp and hp of size nh are used to represent the forward and
ð1Þ
backward sequences of the bidirectional RNN. Define hp as the hidden state of the first
layer at position p, which is determined by the hidden state of the first layer at the
ðlÞ
previous position p-1 and the hidden state of the l-1th layer of p. And the hidden layer hp
can be expressed as
� �
ðlÞ ðlÞ ðlÞ ðl 1Þ
hðlÞ
p ¼ f U hp 1 þ W hp þ bðlÞ (2)

where UðlÞ , W ðlÞ are the weight matrixes, bðlÞ is the bias vector, and ht ð0Þ¼xt . The stacked
network layers can better rely on candidate regions, and gradually abstract the sequence
characteristics between the high-level confidence and position relationship according to
the context. The hidden state inside the bidirectional cyclic neural network layer contains
different levels of structural representation (Singh, Marks, and Jones et al. 2016).
12 X. LI ET AL.

The attention is well applied in natural language processing tasks and can quickly
extract the important features of sparse data (Bahdanau, Cho, and Bengio 2014). The self-
attention mechanism is the improvement of the attention mechanism. It reduces the
dependency on external information and is better at capturing the internal correlations of
data or features (Vaswani, Shazeer, and Parmar et al. 2017), and can better highlight the
divisibility features of positions and angles between the proposal regions of the ship and
suppress the un-relevant features.
In this paper, the self-attention is used to calculate the relationship between the two
candidate regions according to the angle parameters in the position coordinates, and
normalizes them to obtain the weights. And then use the confidence of all candidate
regions in the entire sequence and the position of the vector. The weighted average is
used as a new vector, and the traversal completes the update of the vector in the
sequence. In this process, the analysis between the two candidate regions has nothing
to do with the distance, only with the vector itself, there will be no long-term dependence
problem.
For the confidence, position coordinates and angle sequence features after RNN, it is
considered to enhance the weight of ship proposal regions with the angle more tending
to the mean value of all the proposal region angles in the image, and reduce the weight of
the ship proposal regions with angle parameters having great difference with the mean
value. The self-attention mechanism is used to deal with the long-distance dependency
relationships between ship detections, and these dependency relationships are difficult to
capture only through the RNN network. For each ship proposal region element i, the self-
attention mechanism expresses the whole sequence as the vector ci of the angle, and the
confidence score obtained by weighting and aligning the mean value of all hidden
vectors of the vector sequence ci is:

L
X
ci ¼ αij hj (3)
j¼1

where L is the length of the input sequence, hj is the hidden vector of element j, and aij is
the aligned weight between element i and element j. The weight aij is calculated by
Softmax as:
,
� XL
αij ¼ exp scoreðhi ; hj Þ expðscoreðhi ; hk ÞÞ (4)
k¼1

where exp scoreðhi ; hj Þ means to measure the alignment degree between the vectors hi
and hj, and specifically uses the vector dot product function showing scaling in the
formula (5).
.pffiffi
scoreðhi ; hj Þ¼ðhTi hj Þ L (5)

We use the multi-layer perceptron to predict and re-calculate the value of confidence. The
input of the regression calculator is the serial connection of the hidden vector h in the
stacked bi-directional RNN network and the context vector c in the self-attention. The
activation function is composed of ReLU and sigmoid, and finally produces the new
INTERNATIONAL JOURNAL OF REMOTE SENSING 13

confidence of ship proposal region between 0 and 1. We select a multilayer perceptron of


three layers, whose expression is:
� �
f2 ð xÞ ¼ SigmoidðReLu bð2Þ þ W ð2Þ ðf1 ð xÞÞ (6)

f1 ð xÞ ¼ bð1Þ þ W ð1Þ x (7)


All the parameters of MLP are the connection weights and offsets among layers, that is,
W ð1Þ ; W ð2Þ ; bð1Þ ; bð2Þ . We give higher confidence of the proposal region having a higher
correlation with the angles of other ship bounding boxes and reduce the confidence of
the proposal region having a lower correlation with the angles of other ship bounding
boxes to obtain new confidences of all proposal regions and rank them in NMS algorithm
according to the confidence.
Our angle rescoring module readjusts the confidence of the proposal region through
the stacked bi-directional RNN network and self-attention mechanism according to the
prior condition of angle consistency within the bounding box of the remote-sensing ship,
re-ranks the proposal regions in the later NMS algorithm and screen out proposal regions
more conforming to the ship features to improve the prediction precision of the ship
bounding box.

2.4. Loss functions


The loss functions of our ship detection network are divided into the loss function of the
ship classification network and the loss function of the ship position regression network.
The loss function of the classification network is used to produce the ship class probability
according to the previous output ship multi-dimensional features and calculated accord­
ing to the final ship class probability and the corresponding tag. The loss function of the
regression network is used to evaluate the loss cost of regression and compare the
difference between the predicted translation and scaling parameters corresponding to
the true classification tag and the true translation and scaling parameters. In combination
with the classification loss and regression loss, the expression of the loss function in the
multi-task form of the ship detection network is:
� X � X �
L pi ; p�i ; vi� ; vi ¼ Lcls pi ; p�i þ λ p�i Lreg vi� ; vi (8)
i i

Where i is the index of anchors in the classification network and regression network, pi is
the prediction probability of anchor i as ship class, and p�i is the Ground True tag of the
anchor. vi represents the vectors of the five coordinate parameters x, y, w, h and θ of the
predicted bounding box, and vi� represents the coordinates of the Ground True detection
box in the tag corresponding to the predicted ship area. Lðpi ; p�i ; vi ; vi� Þ is the sum of the
classification loss functions and regression loss functions of all anchors in the remote-
sensing ship image, the hyper-parameter λ weights and controls the ship class loss and
the balance between the two losses, and the experiment uses λ = 1. In addition, the
function Lcls uses the cross-entropy loss function, defined as:

Lcls p; p�i ¼ log pp�i (9)
14 X. LI ET AL.

The function Lreg of the regression network is defined as


� �
Lreg vi� ; vi ¼ smoothL1 vi� vi (10)

We select the smooth L1 loss function with strong robustness, and the function of
smoothL1 ð xÞ is defined as:
� �
0:5x 2 ; if j xj < 1
smoothL1 ð xÞ ¼ (11)
j xj 0:5; otherwise
WheresmoothL1 ðxÞis the smooth L1 loss function. To maintain the invariance of scale and

position, smoothL1 ðxÞ is computed according to the distance vectorΔ¼ δx ; δy ; δw ; δh ; δθ
defined in the formula below.

δx ¼ ðgx bx Þ=bw ; δy ¼ gy by =bh
f δw ¼ logðgw =bw Þ; δh ¼ logðgh =bh Þ (12)
δθ ¼θ θ�

Besides, Δ¼ δx ; δy ; δw ; δh ; δθ has to be normalized through the mean value and variance.

3. Training and Testing Procedures


The training and testing procedures are shown in Figure 8.
The green solid line represents the training process, and the yellow solid line repre­
sents the test process. In the training process, the input remote-sensing ship pictures are
pre-processed, and then passed through the Cross-level Feature Extraction, RPN, Angle
Rescoring Module and Fast R-CNN stages. Finally, the loss function feedback updates the
parameters in the model, and saves the parameters of the model. In the test process, the
model saved by the training is used for ship detection on remote-sensing ship images,
and the output is used to predict the categories and locations of the ships. The details are
as follows:

Train
Data Train process

Test Data Cross-level Angle


Data Pretreatment RPN Fast R-CNN
Feature Extraction Rescoring

Loss
function

Backpropagation No Is the loss value less


than the threshold
Yes
Save model
regress classification
Test process Location Class Ship

Figure 8. The training and testing procedures.


INTERNATIONAL JOURNAL OF REMOTE SENSING 15

Algorithm train
1: Input: network parameters loaded by the training model, images of training sets and the corresponding tags.
2: Output: parameters of RPN network and Fast-RCNN network.
3: Train the RPM network separately.
4: Train Fast-RCNN network separately, using the proposal region output by RPN network in Step 1 as the input of Fast-
RCNN network.
So far, the two networks have no shared parameters and are trained separately.
5: Train the RPN network again, at this time, fix the parameters in the common part of the network, and just update the
parameters of the exclusive part of the RPN network.
6: Fine-tune the Fast-RCNN network again with RPN results, fix the parameters of the common part of the network, and
just update the parameters of the exclusive part of the Fast-RCNN network.

3.1. Training stage


Our model is an end-to-end two-stage model divided for the RPN network and the Fast
R-CNN network.
During the training stage, the classification network and regression network are trained
simultaneously. Each anchor in each branch is set with a binary tag. For the classification
network, firstly, give the anchors with positive and negative tags, which is used to select
the IoU overlapping threshold (increased to 0.7 from 0.5) of the positive anchor and select
the IoU overlapping threshold (increased to 0.3 from 0.2) of the negative anchor, a bigger
threshold means the selected positive anchor is much closer to the Ground True box,
which makes classification more precise. The quantity of negative samples in the remote-
sensing image is much larger than that of the positive samples, so, we randomly select the
negative samples to ensure that the ratio between the positive samples and the negative
samples in each small amount is 1:3.

3.2. Test stage


(1) Input: the network parameters of the model, image of test sets and corresponding
tags after training.
(2) Output: the predicted classes and position coordinates of the test sets.
(3) Input the images and corresponding tags of the test set to the feature extraction
module of the model.
(4) As the model is an end-to-end network, the results are directly output by the Fast-
RCNN network.

In the testing stage, firstly, we output the coordinate offset onto the pixel of the feature
map of each anchor under the bounding box regression network; then, adjust the
position of each anchor through the bounding box regression strategies to obtain the
final prediction bounding box. The two outputs of the classification networks are con­
fidence scores s1 and s2 corresponding to each bounding box, and the confidence score
shows the probability of the ship appearing in the bounding box. If the score s1 of the
bounding box is less than 0.2, delete the corresponding bounding box, determine the
confidence corresponding to the rest bounding box as the product of s1 and s2, and
select the bounding box with the confidence bigger than 0.35. Finally, use the non-
maximum suppression (NMS) to obtain the final detection result.
16 X. LI ET AL.

4. Experiments and Results


In this chapter, firstly, we will introduce the three datasets used by us; then, introduce the
evaluation indicators in the experimental results and the detailed parameters in the
experiment; finally, analyse the experimental results of ships in various scenes and select
other associated advanced models to make multi-group experimental comparison with
our model.

4.1. Dataset
We have carried out experiments on three datasets in total. Wherein the SHIP dataset is
our private dataset, HRSC2016 and DOTA datasets are common datasets.

4.1.1. (1) SHIP


Due to the lack of datasets of the optical remote-sensing ship image in multiple scenes,
we have collected images of ships in multiple typical scenes all over the world from
Google Earth, for example, large port, wharf, river, lake, ocean, etc. The quantity of ships in
each image is different from dozens to hundreds, and we have marked the position of
each ship in the image, including the coordinates of the centre point, the length, width
and angle of the ship.
Therefore, this paper selects five scenes of whether the ship is carrying cargo, whether
it is docked, the scene is a river, lake or sea, the density between ships, and the large-,
medium-, and small-scale ships as scene tags. The scenes of the ship in the image are
often multiple, so, we conduct flat processing of the data of each ship, that is, in the tag
name, use numerical bits to sub-divide the ships of different classifications and scenes.
Different figures in each digit correspond to different classes of classification scenes, and
the tag bit of the ship is described in Table 1.
The name of each ship is a tag composed of a five-digit figure, each digit of the tag
represents a class of classification or scene, which are divided into two or three situations
according to the classification or scenes, expressed by Figures 1, 0, or 2, 1, 0. In the later
experimental results, we regard the experimental results as the experimental results

Table 1. Description of Tag Bits of SHIP Dataset.


Tag: *****
1/0 1/0 1/0 2/1/0 2/1/0
Cargo or not Docked or not Crowded/sparse River/lake/sea Big/medium/small

Table 2. The number of samples and instances of the training set and test set of each scenario in the
SHIP data set.
Cargo ship Docked
or not or not Ship spacing Local water area Ship scale
Scene Yes No Yes No Crowd Sparse River Lake Sea Big Medium Small
Train 521 536 507 516 511 538 245 239 246 243 244 243
Sam.
Train ins. 2108 2123 2096 2135 2092 2139 1427 1396 1408 1410 1411 1410
Train 174 192 178 187 171 196 102 91 89 165 175 159
Sam.
Train ins. 887 914 880 921 877 924 619 589 593 598 600 603
INTERNATIONAL JOURNAL OF REMOTE SENSING 17

analysed for statistics in different classifications and scenes according to the tag. The size
of images collected by us from Google Earth is about 3000 × 3000 pixels. To prove the
effectiveness of the model for the detection of a small ship and multi-scale features of
ships, we did not ignore small ships. The size of ships included in the data set ranges from
10 × 10 pixels to 400 × 400 pixels. To expand the remote-sensing image samples, and
considering the influences of the object in the remote-sensing image in actual scenes (e.g.
changes of view, changes of direction, etc.) as well as multiple test comparisons, we
amplify the datasets fourfold by a horizontal flip and rotation of 5°, 180°and 355°. The
rotation of the image in a small degree can reduce the redundant area of the image after
rotation. Divide the amplified data into the training set and the testing set based on 7:3,
and then crop the image with an overlapping ratio of 20%. The cropped image is
800 × 800 in size. There are 1012 pieces of maps in the cropped dataset and 6032
instances in total, of which, 730 pieces of training sets and 4231 instances; 282 pieces of
testing set, and 1801 instances. The details are shown in Table 2.

4.1.2. (2) HRSC2016


HRSC2016 is also an optical image collected from Google Earth, the officially provided
training set and testing set with rotation boxes have 1070 images, and the image size is
1200 × 800 pixel, containing 2976 ship instances. Some maps do not contain ship, so, we
remove the negative samples and finally get 585 images in the training set, containing
1702 instances, and 438 images in the testing set, containing 1274 instances.

4.1.3. (3) DOTA-Ship


DOTA dataset contains 2806 optical images with the size of about 4000 × 4000 pixels,
containing 15 classes and 188,282 instances in total. The official images marked using
rotation box and contain ship tags as datasets in the experiment, and rename them as
DOTA-Ship dataset. The images in the officially provided training set and testing set are
different and bigger than 1000 × 1000 pixel in size, and the length-width ratios vary
widely. We have cropped the image, and the size of the cropped image is 800 × 800 and
the overlapping ratio is 20%. The cropped training set contains 1033 maps, containing
82,534 instances, and the cropped testing set includes 282 maps, containing 22,443
instances.

4.2. Evaluation indicators


To quantitatively evaluate the performance and robustness of the proposed framework,
the average precision (AP) and comprehensive evaluation index (F-Measure) are com­
monly used indexes for the overall performance evaluation of the network model in the
ship detection.

4.2.1. (1) Average precision


The average precision is the area below the precision-recall curve, and the better the
detector is, the higher AP value is. mAP is the mean value of multiple classes of AP. In our
dataset, there is only one class, and in this experiment, the value of mAP is the value of AP.
The average precision index is the most important one in the ship detection algorithm.
The recall is calculated as:
18 X. LI ET AL.

TP
recall ¼ (14)
TP þ FP
Precision is calculated as:
TP
precision ¼ (15)
TP þ FP
Where TP (True Positives) means the positive samples are correctly identified as positive
samples, representing the number of ships correctly detected, FN (False Negative) is the
false-positive samples, that is, the negative samples are falsely identified as positive
samples, representing the quantity of undetected or missed ships, FP (False Positives)
represents the quantity of falsely detected ships.
The average precisions under various IoU thresholds (0.5, 0.55, . . ., 0.95) are separately
calculated by each class. Firstly, rank the results according to the confidence, and estimate
the area of the interpolation below the Precision-Recall curve through the average
interpolation precision of the recall on 11 isometric horizontal axes, and the average
precision can be calculated as:
1 X
APtc ¼ max pð~rÞ (16)
11 r2f0;0:01;���;1g ~r�r

Where r is the recall, c is the number of the given classes, and for single ship detection, c is
1, t is the threshold of IoU. The curve is realized monotone decreasing by re-allocating the
accuracy of each recall to the maximum precision of the highest recall.
For ship detection, the bigger the mAP value is, the better the ship detection perfor­
mance is.

4.3. (2) F1 score


Sometimes, Precision and Recall indexes may have contradictions, so that these two
indexes shall be comprehensively considered, and the most common method is the
comprehensive evaluation index (also called as F-Score) method. F-Measure is the
weighted harmonic mean of Precision and Recall, and we use the most common F1
value, that is, the weights of Precision and Recall are equal, and F1 is calculated as:
2 � precision � recall
F1 ¼ (17)
precision þ recall
F1 synthesizes the results of P and R, when the value of F1 is higher, it means the
performance of the experimental method is better.

4.4. SHIP dataset and experimental parameters in each scene


In order to evaluate the effectiveness of the proposed method, experimental comparison
with other models, including comparison in various scenarios. The R2CNN (Jiang, Zhu, and
Wang et al. 2017), RRPN (Ma, Shao, and Ye et al. 2018), R-DFPN (Xue, Hao, and Kun et al.
2018) and SCRDet (Yang, Yang, and Yan et al. 2019) models are selected as our compar­
ison methods. In which, the R2CNN algorithm adds the pooling size of the length–width
INTERNATIONAL JOURNAL OF REMOTE SENSING 19

ratio of the ship at the ROI Pooling layer (Jiang, Zhu, and Wang et al. 2017). RRPN proposes
R-anchors containing angle parameters in the RPN stage and resets the proportion and
ratio of the anchors according to the ship characteristics (Ma, Shao, and Ye et al. 2018).
R-DFPN network proposes the feature extraction network of DFPN and the rotation region
module of RDN (Xue, Hao, and Kun et al. 2018); SCRDet network implements more stable
detection of small objects, cluttered background, and rotating objects (Yang, Yang, and
Yan et al. 2019).
Our model is implemented on an NVIDIA 2080Ti GPU display card based on the
TensorFlow framework. The backbone network selects Resnet-101 network pre-trained
on ImageNet. The entire model is trained from end to end, with the initial learning rate of
0.001 and training steps of 10k times, the learning rate for the previous 80,000 times
remains unchanged and the learning rate of the latter 20,000 times is decreased to 0.0003,
the weight is attenuated to 0.0001 and the momentum is 0.9.
The experimental comparison results using the three data sets are shown in Table 3.
The Recall, Precision, mAP and F1 of our proposed on the SHIP, HRSC2016 and DOTA-Ship
data sets are better than other models. In SHIP, compared with R2CNN, RRPN, R-DFPN and
SCRDet, mAP increased by 16.3%, 19.8%, 12.7%, 2.2%, and F1 increased by 16.5%, 17.3%,
5.2%, 4.2%; In HRSC2016, compared with R2CNN, RRPN, R-DFPN and SCRDet, mAP
increased by 5.6%, 1.6%, 9.1%, 0.7%, and F1 increased by 5.7%, 15.1%, 4.6%, 5.5%; In
DOTA-Ship, compared with R2CNN, RRPN, R-DFPN and SCRDet, mAP increased by 42.9%,
28.8%, 27.8%, 3.0%, and F1 increased by 33.1%, 29.5%, 27.3%, 1.9%.
The experimental results show that Proposed performs best in multiple scenarios with
different data sets, especially for ships with multi-scale, small ship, dense and complex
backgrounds. Proposed greatly improves Recall, Precision, mAP and F1.
Below we will compare and analyse specific experimental results for the five scenarios
in the SHIP data set to illustrate the effectiveness of our proposed model in detecting
ships with different characteristics in different scenarios.

4.4.1. A. Scene whether the ship is loaded with cargo


The experimental comparison results are shown in Table 4, in the cargo, compared with
R2CNN, RRPN, R-DFPN, and SCRDet, our proposed has improved mAP by 17.6%, 21.1%,
13.8%, 2.8%, and F1 by 16.6%, 17.6%, 4.7%, 4.2%. It shows that in the multi-category scene
of ships, proposed effectively improves the results of various indicators of ship detection
compared with others.
The comparison effect of the detection of the cargo ship scene using SHIP is shown in
Figure 9. It can be seen that Figure 9(a) and (b) exist misdetection of different degrees.
There are multiple detection frames for a ship in Figure 9(a) and (b). There are also

Table 3. The experimental results under different datasets by different methods.


Data SHIP HRSC2016 DOTA-Ship
R P mAP F1 R P mAP F1 R P mAP F1
R2CNN 63.5 727 693 678 890 847 816 868 418 552 263 476
RRPN 66.2 678 658 670 883 689 856 774 488 539 404 512
R-DFPN 75.3 834 729 791 872 887 781 879 503 568 414 534
SCRDet 77.8 827 834 802 914 827 865 870 692 913 662 788
Proposed 82.1 868 856 844 921 929 872 925 717 922 692 807
20 X. LI ET AL.

Table 4. Detection results of whether the ship is a cargo ship.


Model Scene Recall Precision mAP F1
R2CNN Cargo ship 0.615 0.713 0.658 0.660
Non cargo ship 0.633 0.728 0.724 0.677
RRPN Cargo ship 0.644 0.657 0.623 0.650
Non cargo ship 0.657 0.683 0.697 0.670
R-DFPN Cargo ship 0.739 0.823 0.696 0.779
Non cargo ship 0.746 0.837 0.769 0.789
SCRDet Cargo ship 0.764 0.806 0.806 0.784
Non cargo ship 0.778 0.832 0.856 0.804
Proposed Cargo ship 0.809 0.843 0.834 0.826
Non cargo ship 0.818 0.874 0.876 0.845

Figure 9. The detection effect in the cargo ship scene of the SHIP.

situations where other similar objects are mistaken for the ship in Figure 9(a). Both Figure
9(c) and (d) also have missed detection, for example, some ships are not detected, and the
bounding box of the detected ship has a large deviation from the true value of its. And in
Figure 9 (e), the above problems are avoided, there are no problems such as missed
detection and false detection, and the bounding box offset from the true value is small.

4.4.2. B. Scene whether the ship is docked


The experimental comparison results are shown in Table 5, compared with R2CNN, RRPN,
R-DFPN, and SCRDet, our proposed model proposed in the ship docking scene has
improved mAP by 18.9%, 21%, 14.7%, 2.2%, and F1 by 18.4%, 19.3%, and 3.8%, 4.8%. It
INTERNATIONAL JOURNAL OF REMOTE SENSING 21

Table 5. Detection results of whether the ship is a cargo ship.


Model Scene Recall Precision mAP F1
R2CNN Docked 0.584 0.704 0.645 0.638
Not docked 0.657 0.737 0.741 0.695
RRPN Docked 0.635 0.623 0.624 0.629
Not docked 0.674 0.702 0.692 0.688
R-DFPN Docked 0.745 0.827 0.687 0.784
Not docked 0.768 0.845 0.756 0.805
SCRDet Docked 0.754 0.795 0.812 0.774
Not docked 0.788 0.841 0.866 0.814
Proposed Docked 0.793 0.853 0.834 0.822
Not docked 0.828 0.875 0.878 0.851

shows that in the docking scene with dense ships and complex backgrounds, compared
with others, the proposed model improves the results of various indicators of ship
detection effectively.
The comparison effect of the SHIP docking scene is shown in Figure 10. For docked and
dense ships, there is a misdetection of multiple detection frames for a ship in Figure 10(a),
the detection frame has different degrees of redundancy compared with the true value in
Figure 10(b)–(d). The detection frame in Figure 10(e) is the most accurate, and the
bounding box offset from the true value is small.

Figure 10. The detection effect in the docking scene of the SHIP.
22 X. LI ET AL.

Table 6. Detection results of the water scene where the ship is located.
Model Tag bit Recall Precision mAP F1
R2CNN River 0.636 0.706 0.687 0.669
Lake 0.647 0.785 0.684 0.709
Ocean 0.614 0.693 0.729 0.651
RRPN River 0.658 0.689 0.623 0.673
Lake 0.684 0.697 0.648 0.69
Ocean 0.639 0.623 0.687 0.631
R-DFPN River 0.743 0.812 0.697 0.776
Lake 0.756 0.842 0.716 0.800
Ocean 0.754 0.839 0.754 0.794
SCRDet River 0.765 0.834 0.804 0.798
Lake 0.814 0.839 0.826 0.826
Ocean 0.756 0.792 0.856 0.774
Proposed River 0.802 0.846 0.827 0.823
Lake 0.822 0.864 0.834 0.842
Ocean 0.823 0.873 0.879 0.847

4.4.3. C. Scene that the ships are in river, lake or ocean


The experimental results are shown in Table 6. In the river scene, compared with R2CNN,
RRPN, R-DFPN and SCRDet, our proposed increases mAP by 14%, 20.4%, 13%, 2.3%, and F1
increases by 15.4%, 15%, 4.7%, 2.5%; In the lake scene, the proposed increases mAP by
15%, 18.6%, 11.8%, 0.8%, and F1 increases by 13.3%, 15.2%, 4.2%, 1.6%; In the sea scene,
the proposed increases mAP by 15%, 19.2%, 12.5%, 2.3%, and F1 increases by 19.6%,
21.6%, 5.3%, 7.3%. It shows that the proposed method effectively improves the results of
various indicators of ship detection compared with others in the scenes of rivers and lakes
with complex ship backgrounds.
The comparison result of the detection of the SHIP river scene is shown in Figure 11. In
a complex river scene, the pictures show jungles, houses, fields in the natural environ­
ment, and objects that cannot be accurately distinguished by the naked eye. All the
results have missed inspections to varying degrees in Figure 11(a)–(d). From Figure 11, it
can be seen that the detection results are the most serious in terms of missing inspections,
that is, 5 missed inspections are included. Figure 11(a), Figure 11 (c) and (d) miss 1 to 2
pieces. But in Figure 11 (e), there is no missing or false detection, and the difference
between the prediction frame and the true value is the smallest.

4.4.4. D. Scene whether the ships are crowded


The experimental comparison results are shown in Table 7, compared with R2CNN, RRPN,
R-DFPN and SCRDet, the proposed in the dense ship scene has improved mAP by 17.8%,
20.3%, 12.5%, 3.0%, and F1 by 16.4%, 20.3%, 4.6%, 3.7%. It shows that in scenes with
dense ships and complex backgrounds, proposed effectively improves the results of
various indicators of ship detection compared with others.
The detection and comparison results of SHIP dense scenes are shown in Figure 12. The
scene includes factors such as ports, denseness and cargo ships. There are messy contain­
ers on the portland, and cars with a shape similar to the hull on the overpass. In Figure 11
(a) and (b), there are different degrees of misdetection of containers and other objects in
the land and redundant detection frames, and some predicted bounding boxes have
larger redundant areas. In Figure 12(c), the cases of missed inspection have occurred. In
Figure 12(d), some detection frames have large redundant areas and missed detection.
INTERNATIONAL JOURNAL OF REMOTE SENSING 23

Figure 11. The detection effect in the river scene of the SHIP.

Table 7. Detection and comparison results of ship density scenes.


Model Tag bit Recall Precision mAP F1
R2CNN Crowded 0.613 0.686 0.638 0.647
Sparse 0.65 0.754 0.754 0.698
RRPN Crowded 0.593 0.624 0.613 0.608
Sparse 0.708 0.714 0.689 0.711
R-DFPN Crowded 0.727 0.807 0.691 0.765
Sparse 0.774 0.86 0.769 0.815
SCRDet Crowded 0.758 0.791 0.786 0.774
Sparse 0.791 0.851 0.876 0.82
Proposed Crowded 0.798 0.825 0.816 0.811
Sparse 0.836 0.883 0.886 0.859

The above-mentioned error condition does not occur, and the difference between the
prediction frame and the true value is the smallest in Figure 12 (e).

4.4.5. E. Scene of the detection box area of Ground True of ships


The experimental comparison results are shown in Table 8. In large ships, compared with
R2CNN, RRPN, R-DFPN and SCRDet, our proposed increases mAP by 15.8%, 19.7%, 14.5%,
3.4%, and F1 increases by 15.2%, 24.4%, 5.0%, 2.9%. In the medium ships, proposed
increases mAP by 14.8%, 17.3%, 13.0%, 0.5%, and F1 increases by 14.8%, 14.9%, 5.0%,
2.9%. In the small ships, proposed increases mAP by 19.7%, 22.6%, 13.1%, 2.1%, and F1
increases by 19.3%, 21.3%, 3.5%, 6.2%. It shows that in the multi-scale and small scenes of
24 X. LI ET AL.

Figure 12. Detection effect in dense scene of SHIP.

Table 8. Detection and comparison results of ship multi-scale scenes.


Model Tag bit Recall Precision mAP F1
R2CNN Big ship 0.662 0.743 0.723 0.7
Medium ship 0.657 0.756 0.716 0.703
Small ship 0.579 0.672 0.645 0.622
RRPN Big ship 0.665 0.723 0.684 0.608
Medium ship 0.701 0.704 0.691 0.702
Small ship 0.607 0.598 0.616 0.602
R-DFPN Big ship 0.768 0.84 0.736 0.802
Medium ship 0.766 0.839 0.734 0.801
Small ship 0.724 0.822 0.711 0.78
SCRDet Big ship 0.8 0.848 0.847 0.823
Medium ship 0.803 0.842 0.859 0.822
Small ship 0.723 0.786 0.821 0.753
Proposed Big ship 0.831 0.875 0.881 0.852
Medium ship 0.829 0.874 0.864 0.851
Small ship 0.797 0.834 0.842 0.815

ships, proposed effectively improves the results of various indicators of ship detection
compared with others.
The detection effect of the multi-scale scene using the SHIP is shown in Figure 13 These
images show the situation of ships of multiple sizes in the river. There are natural farm­
land, houses and illegible objects on land. In Figure 13(a), (b) and (d), the small ship in the
INTERNATIONAL JOURNAL OF REMOTE SENSING 25

Figure 13. Detection effect in SHIP dataset multi-scale scene.

upper left corner was missed. There are also redundant bounding boxes with false
detections in Figure 13(a). There is a large deviation between the missing and predicted
bounding box and the true value in Figure 13(c). The middle bounding box also has
a certain degree of redundancy in Figure 13(d). But in Figure 13(e), there is no such wrong
detection situation, and the predicted detection frame is the most accurate.

4.5. Ablation experiment


The model we proposed is composed of a pyramid that integrates cross-level features
across levels, an anchor that adds angle parameters, and a pooling size and NMS re-
scoring mechanism in ROI Pooling. To illustrate the effectiveness of each structure in the
method, ablation experiments are performed in this section to verify the advantages of
each structure in the SHIP, HRSC2016, and DOTA-Ship datasets.
As shown in Table 9, Models 1, 2, 3, and 4 represent the lack of improved ROI pooling,
NMS re-scoring, multi-angle anchor, and cross-level feature extraction, respectively. Six
represents the experimental results after data amplification, Au represents data amplifica­
tion, E represents cross-level feature extraction, A represents a multi-angle anchor, NMS
represents NMS re-scoring network, ROI represents ROI pooling layer. Experimental
comparison proves that these structures are all conducive to improving the recall,
accuracy and F1 of ship detection. Among them, adding a pool size that meets the ship
26 X. LI ET AL.

Table 9. Recall, Precision and F1 Indexes of the Ablation Experiments of Our Model.
Dataset SHIP HRSC2016 DOTA-Ship
M Au. E. A. NMS ROI R P F1 R P F1 R P F1
Pool
1 √ √ √ 71.4 69.5 70.4 86.5 79.6 82.9 52.3 55.9 54.0
2 √ √ √ 74.6 81.9 78.1 84.6 81.9 83.2 50.7 61.3 55.5
3 √ √ √ 71.6 73.5 72.5 86.4 85.7 86.0 49.6 65.5 55.7
4 √ √ √ 72.3 79.2 75.6 85.6 86.8 86.2 57.4 67.8 62.2
5 √ √ √ √ 78.8 82.6 80.7 89.4 89.7 89.5 71.7 92.2 80.7
6 √ √ √ √ √ 81.9 86.8 84.3 92.1 92.9 92.5 – – –

size to the ROI pooling layer structure has the greatest effect on the overall model. For the
F1, it increased by 6.6% on HRSC2016, 26.7% on DOTA-Ship, and 10.3% on SHIP.
Due to the many and dense ships in the DOTA-Ship images, data amplification will
reduce the accuracy, so only the HRSC2016 and SHIP datasets were augmented. The
experimental results prove that the data set expansion has a certain positive impact on
the experimental results.

5. Conclusions
In the paper, we have proposed a network model for remote sensing ship detection in
multiple scenes, which is divided into the feature extraction network module, rotation
proposal region module and angle rescoring module. For the features of multiple scales
and complicated background of remote-sensing ships, we combine the cross-level cross-
layer features by improving the feature pyramid structure in the feature extraction net­
work module and obtain the fused optimal ship features: To resolve the problem that the
crowded feature of ships and the redundancy of bounding boxes increase the detection
difficulty, the paper adds rotation bounding boxes with angle parameters in RPN stage of
the rotation proposal region module and in the ROI Pooling layer. In the angle rescoring
module, we propose the stacked bidirectional recurrent neural network fusing the self-
attention network and improve its confidence through the angle parameters of the
proposal region coordinates to obtain proposal regions with more precise positioning.
As proved and analysed through several experimental results, the above modules are
beneficial to the final experimental results, especially in the multiple diversified scenes of
remote-sensing ships, our experimental model has the most advanced performances.
Our model has made certain achievements in the ship detection of remote-sensing
ships, and in future work, we will focus on the work in the following two aspects: (1) Take
advantage of the prior conditions of the remote-sensing ships. The parts of the remote-
sensing ships (e.g. bow, tail, etc.) have specific characteristics, which can be utilized to
improve the prediction of the bounding box angles. (2) Increase the detection speed and
the model robustness. We add parameters (e.g. angle, etc.) in the basic ship detection
model with respect to ship features and increase the corresponding detection time. We
have to reduce the model detection time without affecting the model accuracy.

Disclosure of potential conflicts of interest


No potential conflict of interest was reported by the author(s).
INTERNATIONAL JOURNAL OF REMOTE SENSING 27

Funding
This work was supported by the National Key Research and Development Project of China
[2016YFC1400302]; National Natural Science Foundation of China [61501155,61871164]; National
Defense Science and Technology Key Laboratory Fund Q5 [6142401200201].

ORCID
Shuaishuai Lv http://orcid.org/0000-0002-9822-9959

References
Bahdanau, D., K. Cho, and Y. Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align
and Translate.” arXiv 1409: 0473.
Bay, H., A. Ess, T. Tuytelaars, L. V. Goolab. 2008. “Speeded-Up Robust Features (SURF)[J].” Computer
Vision and Image Understanding 110 (3): 346–359. DOI:10.1016/j.cviu.2007.09.014.
Chang, H. H., G. L. Wu, and M. H. Chiang. 2019. “Remote-sensing Image Registration Based on
Modified SIFT and Feature Slope Grouping[J].” IEEE Geoscience and Remote Sensing Letters 16:
1363–1367. doi:10.1109/LGRS.2019.2899123.
Chen, W., X. Li, H. He, and L. A. Wang. 2018. “Review of Fine-Scale Land Use and Land Cover
Classification in Open-Pit Mining Areas by Remote Sensing Techniques[J].” Remote Sens 10: 15.
doi:10.3390/rs10010015.
Dong, C., J. Liu, F. Xu, and C. Liu. 2019. “Ship Detection from Optical Remote Sensing Images Using
Multi-Scale Analysis and Fourier HOG Descriptor[J].” Remote Sens 11: 1529. doi:10.3390/
rs11131529.
Girshick, R., J. Donahue, T. Darrell, J. Malik. “Rich Feature Hierarchies for Accurate Object Detection
and Semantic Segmentation[C].” In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition,Columbus, OH, USA, 23–28 June 2014; pp. 580–587.
He, K., X. Zhang, S. Ren, J. Sun. “Deep Residual Learning for Image Recognition[C]”. 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp.
770–778, doi: 10.1109/CVPR.2016.90.
Jiang, Y., X. Zhu, X. Wang, S. Yang, W. Li, H.Wang, P. Fu, et al. 2017. “R2CNN: Rotational Region CNN
for Orientation Robust Scenetext Detection.” CoRR abs/1706:09579.
Jiao, J., Y. Zhang, H. Sun, X. Yang, X. Gao, W. Hong, K. Fu. 2018. “A Densely Connected End-to-End
Neural Network for Multiscale and Multiscene SAR Ship Detection[J].” IEEE Access 6 :20881–20892.
doi:10.1109/ACCESS.2018.2825376.
Kang, M., X. Leng, Z. Lin, K. F. Ji. “A Modified Faster R-CNN Based on CFAR Algorithm for SAR Ship
detection[C]”. 2017 International Workshop on Remote Sensing with Intelligent Processing (RSIP),
Shanghai, 2017, pp. 1–4, doi: 10.1109/RSIP.2017.7958815.
Kim, T., S. Oh, T. B. Chun, and M. Lee “Impact of Atmospheric Correction on the Ship Detection Using
Airborne Hyperspectral Image[C]”. IGARSS 2019-2019 IEEE International Geoscience and Remote
Sensing Symposium, Yokohama, Japan, 2019, pp. 2190–2192, doi: 10.1109/IGARSS.2019.8898766.
Li, J., C. Qu, and J. Shao. “Ship Detection in SAR Images Based on an Improved Faster R-CNN[C]” 2017
SAR in Big Data Era: Models, Methods and Applications (BIGSARDATA), Beijing, 2017, pp. 1–6, doi:
10.1109/BIGSARDATA.2017.8124934.
Li, K., G. Cheng, S. Bu, X. You. 2017. “Rotation-Insensitive and Context-Augmented Object Detection
in Remote Sensing Images[J].” IEEE Transactions on Geoence and Remote Sensing 56 (4):
2337–2348.
Li, X., Z. Tang, W. Chen, L. Wang. 2019. “Multimodal and Multi-Model Deep Fusion for Fine
Classification of Regional Complex Landscape Areas Using ZiYuan-3 Imagery[J].” Remote
Sensing 11 (22): 2716. DOI:10.3390/rs11222716.
28 X. LI ET AL.

Lin, T. Y., P. Dollar, R. Girshick, K. He, B. Hariharan, S. Hariharan. “Feature Pyramid Networks for Object
Detection[C]”. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE
Computer Society, 2017, Honolulu, HI, USA. DOI: 10.1109/CVPR.2017.106
Ma, J., J. Jiang, H. Zhou, J. Zhao. 2018. “Guided Locality Preserving Feature Matching for Remote
Sensing Image registration[J].” IEEE Transactions on Geoscience and Remote Sensing 56
:4435–4447. doi:10.1109/TGRS.2018.2820040.
Ma, J., W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, et al. 2018. “Arbitrary-Oriented Scene Text
Detection via Rotation Proposals[J].” IEEE Transactions on Multimedia 20 (11): 3111–3122.
doi:10.1109/TMM.2018.2818020.
Ma, J., Z. Zhou, B. Wang, H. Zong, F. Wu. 2019. “Ship Detection in Optical Satellite Images via
Directional Bounding Boxes Based on Ship Center and Orientation Prediction[J].” Remote Sensing
11 (18): 2173. doi:10.3390/rs11182173.
Ma, W., Q. Guo, Y. Wu, W. Zhao, X. Zhang, L. Jiao. 2019. “Novel Multi-Model Decision Fusion Network
for Object Detection in Remote Sensing Images.” Remote Sens 11 :737. doi:10.3390/rs11070737.
Shi, Z., X. Yu, Z. Jiang, and B. Li. 2013. “Ship Detection in High-resolution Optical Imagery Based on
Anomaly Detector and Local Shape Feature[J].” IEEE Transactions on Geoscience and Remote
Sensing 52: 4511–4523.
Simonyan, K., and A. Zisserman. “Very Deep Convolutional Networks for Large-scale Image
Recognition”. In ICLR, 2015.
Singh, B., T. K. Marks, M. Jones, O. Tuzei, M. Shao. “A Multi-stream Bi-directional Recurrent Neural
Network for Fine-Grained Action Detection[C]”. 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Las Vegas, NV, USA. IEEE, 2016.
Sun, X., Y. Liu, Z. Yan, P. Wang, W. Diao, K. Fu, et al. 2020. “SRAF-Net: Shape Robust Anchor-Free
Network for Garbage Dumps in Remote Sensing Imagery.” IEEE Transactions on Geoscience and
Remote Sensing. doi:10.1109/TGRS.2020.3023928
Tang, J., C. Deng, G. Huang, B. Zhao. 2014. “Compressed-Domain Ship Detection on Spaceborne
Optical Image Using Deep Neural Network and Extreme Learning Machine[J].” IEEE Transactions
on Geoscience and Remote Sensing 53 (3): 1174–1185. DOI:10.1109/TGRS.2014.2335751.
Tian, T., Z. Pan, X. Tan, Z. Chu. 2020. “Arbitrary-Oriented Inshore Ship Detection Based on Multi-Scale
Feature Fusion and Contextual Pooling on Rotation Region Proposals[J].” Remote Sensing 12 (2).
339. doi:10.3390/rs12020339.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, et al. 2017. “Attention
Is All You need[C].” Advances in Neural Information Processing Systems, 5998–6008.
Wu, F., Z. Zhou, B. Wang, J. Ma. 2018. “Inshore Ship Detection Based on Convolutional Neural
Network in Optical Satellite Images[J].” IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing, 11 (11): 1–11. doi: 10.1109/JSTARS.2018.2873190.
Wu, Y., W. Ma, M. Gong, Z. Bai, W. Zhao, Q. Guo, X. Chen, et al. 2020. “A Coarse-to-Fine Network for
Ship Detection in Optical Remote Sensing Images.” Remote Sens 12 :246. doi:10.3390/rs12020246.
Xia, G., X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, et al. “DOTA: A Large-scale Dataset for
Object Detection in Aerial Images[C]”.2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition, Salt Lake City, UT, USA. IEEE, 2018.
Xian, S., P.Wang, C.Wang, Y. Liu, K. Fu. 2020. “PBNet: Part-based Convolutional Neural Network for
Complex Composite Object Detection in Remote Sensing Imagery”. ISPRS Journal of
Photogrammetry and Remote Sensing 2020.12.015 173:50–65. 10.1016/j.isprsjprs
Xiao, X., Z. Zhou, B. Wang, L. Li, L. Miao. 2019. “Ship Detection under Complex Backgrounds Based on
Accurate Rotated Anchor Boxes from Paired Semantic Segmentation[J].” Remote Sensing 11 (21):
2506. DOI:10.3390/rs11212506.
Xue, Y., S. Hao, F. Kun, J. Yang, X. Sun, M. Yan, Z. Guo. 2018. “Automatic Ship Detection in Remote
Sensing Images from Google Earth of Complex Scenes Based on Multiscale Rotation Dense
Feature Pyramid Networks[J].” Remote Sensing 10 (1): 132. DOI:10.3390/rs10010132.
Yang, X., J. Yang, J. Yan, Y. Zhang. “SCRDet: Towards More Robust Detection for Small, Cluttered and
Rotated Objects[C]”. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul,
Korea (South), 2019, 8231–8240, doi: 10.1109/ICCV.2019.00832.
INTERNATIONAL JOURNAL OF REMOTE SENSING 29

You, Y., J. Cao, Y. Zhang, F. Liu, W. Zhou. 2019. “Nearshore Ship Detection on High-Resolution
Remote Sensing Image via Scene-Mask R-CNN[J].”IEEE Access. PP(99). 1. doi: 10.1109/
ACCESS.2019.2940102.
Zhang, R., J. Yao, K. Zhang, C. Feng, J. Zhang. 2016. “S-CNN-Based Ship Detection from
High-resolution Remote-sensing Image[J].” International Archives of the Photogrammetry,
Remote Sensing and Spatial Information Sciences 41:917–921.
Zhu, C., H. Zhou, R. Wang, J. Guo, and A. Novel Hierarchical. 2010. “Method of Ship Detection from
Spaceborne Optical Image Based on Shape and Texture Features[J].” IEEE Transactions on
Geoscience and Remote Sensing 48 (9): 3446–3456. doi:10.1109/TGRS.2010.2046330.
Zou, Z., and Z. Shi. 2016. “Ship Detection in Spaceborne Optical Image with SVD Networks[J].” IEEE
Transactions on Geoscience and Remote Sensing 54: 5832–5845. doi:10.1109/TGRS.2016.2572736.

You might also like