li2021
li2021
Xungen Li, Zixuan Li, Shuaishuai Lv, Jing Cao, Mian Pan, Qi Ma & Haibin Yu
To cite this article: Xungen Li, Zixuan Li, Shuaishuai Lv, Jing Cao, Mian Pan, Qi Ma & Haibin Yu
(2021): Ship detection of optical remote sensing image in multiple scenes, International Journal of
Remote Sensing, DOI: 10.1080/01431161.2021.1931544
1. Introduction
Ship identification and detection have important significance in the national defence
construction, port management, cargo transport, maritime rescue and combating on
illegal ships, etc. (Tang, Deng, and Huang et al. 2014; Zhu et al. 2010; Chen et al. 2018;
Li, Tang, and Chen et al. 2019; Ma et al. 2018). However, the existing ship monitoring
system cannot meet the needs of ship monitoring and emergency management in a wide
range of sea areas (Li, Cheng, and Bu et al. 2017). Compared with SAR images (Kang, Leng,
and Lin et al. 2017; Jiao, Zhang, and Sun et al. 2018; Li, Qu, and Shao 2017) and
hyperspectral images (Kim et al. 2019), optical remote-sensing images have high resolu
tion and low noise. The structure and texture information of the ship object is clearer,
which is more conducive to the extraction and analysis of ship features during detection,
and the scene of the ship is also richer and more extensive. In the field of remote-sensing,
ship detection based on optical remote-sensing images has received extensive attention
(Chang, Wu, and Chiang 2019; Dong et al. 2019; Bay, Ess, and Tuytelaars et al. 2008; Zou
and Shi 2016; Shi et al. 2013). However, with changes in light intensity, climatic conditions,
sea surface topography, and so on, the significant difference of ships in complex back
grounds also increases. Traditional feature detection is difficult to separate effective ship
features, resulting in low accuracy of detection results. Therefore, ship detection in
multiple scenarios is a huge challenge.
Ships in remote-sensing images have the characteristics of a multi-scale, small object,
dense and complex backgrounds. These characteristics appear frequently in scenes such
as ports and routes. Solving these problems has strong engineering significance. The
existing remote-sensing ship data sets are small in number, and the scene types are
single. The current deep neural network ship detection methods use mostly single ship
scenes, which cannot contain the characteristics of the ships mentioned above (Girshick,
Donahue, and Darrell et al. 2014; Zhang, Yao, and Zhang et al. 2016; Zou and Shi 2016; Wu,
Zhou, and Wang et al. 2018; You, Cao, and Zhang et al. 2019; Xian, Pwab, and Cheng et al.
2020) and cannot prove their effectiveness in detecting ship characteristics in multiple
scenarios. The remote-sensing ship detection shall firstly have strong adaptation and can
adapt to the identification and detection in the complicated background; secondly, it shall
be designed against the features of the remote-sensing ships, for example, crowded
arrangement, multiple scales, etc.; finally, it shall make better use of the prior conditions
of the mutual position correlation of ship. Therefore, it becomes the key of the current
remote-sensing ship detection to design a ship identification and detection model
adapting to different illumination, climate condition and ocean scenes.
The traditional optical remote-sensing image-based ship detection algorithm is mainly
divided into three steps. Firstly, extract the ship regions employing window sliding,
selective search, and so on; then extract the shape and texture features of ships using
HOG feature descriptor (Chang, Wu, and Chiang 2019), SIFT partial feature descriptor
(Dong et al. 2019), and SURF feature descriptor (Bay, Ess, and Tuytelaars et al. 2008); and
finally, judge the categories with classifiers, such as support vector machine (Zou and Shi
2016), Adaboost (Shi et al. 2013), and so on, and refine the position of the effective
proposal boxes. However, the remote-sensing ship has a complicated background and
many interferential factors, and due to the single feature extraction and bad detection
robustness, the model of the traditional ship detection method cannot provide
a comprehensive and correct description of the ship features and the traditional ship
detection is not an end-to-end process, which greatly reduces the accuracy and efficiency
of ship detection.
With the rapid development of deep learning in the ship detection field, more and
more scholars start to apply the deep learning-based ship detection technique in the
study of remote-sensing ship detection (Girshick, Donahue, and Darrell et al. 2014; Zhang,
Yao, and Zhang et al. 2016; Zou and Shi 2016; Wu, Zhou, and Wang et al. 2018; You, Cao,
and Zhang et al. 2019; Sun, Liu, and Yan et al. 2020; Xian, Pwab, and Cheng et al. 2020).
INTERNATIONAL JOURNAL OF REMOTE SENSING 3
The deep learning-based ship detection method is generally improved from the following
two aspects:
Firstly, increase the precision of the extracted remote-sensing ship features by modify
ing the feature extraction network, thus improving the detection difficulties of multiple
scales, small objects, complicated backgrounds, and so on. For example, He et al. (Girshick,
Donahue, and Darrell et al. 2014) resolved the problems of low ship detection precision,
big leak detection, and false detection rates in satellite images by enhancing the texture
feature extraction of satellite images. Zhang et al. (Zhang, Yao, and Zhang et al. 2016)
proposed a new ship detection method called S-CNN, which combined the specially
designed proposal region extracted from the ship model to improve the significant
detection performance. Zou et al. (Zou and Shi 2016) proposed an SVDnet, which
executed self-adaptive learning of features from the remote-sensing images by the
convolutional neural network and singular value compensation algorithm.
Secondly, utilize prior information on ships. Some scholars utilized the known prior
information of ship dimensions to apply the rotation proposal region to remote-sensing
ship detection, thus improving the detection performance (Wu, Zhou, and Wang et al.
2018; You, Cao, and Zhang et al. 2019; Sun, Liu, and Yan et al. 2020; Xian, Pwab, and Cheng
et al. 2020; Ma, Guo, and Wu et al. 2019; Wu, Ma, and Gong et al. 2020; Ma, Zhou, and
Wang et al. 2019; Tian, Pan, and Tan et al. 2020; Xiao, Zhou, and Wang et al. 2019). Some
scholars improved the detection by utilizing the prior information of remote-sensing ship
positions. For example, Wu et al. (Wu, Zhou, and Wang et al. 2018) proposed a new
offshore ship detection, which implemented universal search of different parts of the ship
head through the classification network to obtain the possible ship’s bow position and
rough ship direction, very helpful to produce smaller and more precise proposal region of
a ship. You et al. (You, Cao, and Zhang et al. 2019) proposed an end-to-end scene Mask
R-CNN based on the accuracy and feasibility of the DCNN detection framework, to reduce
the false detection of ship-shaped objects on the land. Sun et al. (Xian, Pwab, and Cheng
et al. 2020) proposed a shape robust anchor-free network (SRAF-Net) and a unified part-
based convolutional neural network (PBNet). The SRAF-Net consists of feature extraction,
multitask detection, and postprocessing. And PBNet treated a composite object as
a group of parts and incorporates part information into context information to improve
composite object detection. The correct part information can guide the prediction of
a composite object, thus solving the problems caused by various shapes and sizes. Ma
et al. (Ma, Guo, and Wu et al. 2019) considered contextual information and multi-region
features and proposed a novel multi-model decision fusion framework to solve the
problem of the diversity and complexity of geospatial object appearance and the insuffi
cient understanding of geospatial object spatial structure information. Sun et al. (Wu, Ma,
and Gong et al. 2020) proposed a shape robust anchor-free network (SRAF-Net) to solve
the problem of blurred boundaries in garbage dumps, which includes feature extraction,
multitask detection, and postprocessing. Wang et al. (Ma, Zhou, and Wang et al. 2019)
proposed a two-stage ship detection based on ship centre and orientation prediction,
which constructed the central region prediction network and ship orientation classifica
tion network to produce rotating region proposals and predict the rotation bounding box
from the rotating region proposals. Tian et al. (Tian, Pan, and Tan et al. 2020) improved
NMS to defect non-maximum suppression and made pooling improvements to the
defected context features.
4 X. LI ET AL.
The above method improves the accuracy of ship detection by modifying the feature
extraction network, modifying the size of the proposal region, and using the ship’s prior
information, but there are still some problems: optical remote-sensing ship dataset has
a single ship scene, and the model cannot verify its effectiveness in detecting different
ship characteristics. It is a problem that we need to consider when designing a detection
method for ships with multiple scales, small objects, dense and complex background
characteristics in multiple scenarios.
Concerning the above problems and considering the real multiple scenes of the
remote-sensing ships, we have collected multiple remote-sensing ship images in compli
cated scenes from Google Earth and labelled the ship. We have designed an end-to-end
multi-scene ship detection network framework, which has achieved good performances
in multiple scenes. The main contributions of this paper are as follows:
(1) Infuse the prior information of the position. Considering the prior condition that
the remote-sensing ships have angle consistency in the region, that is, in most
scenes, the angle values between the Ground True bounding boxes of remote-
sensing ships are close, we add the bidirectional recurrent neural network fusing
self-attention mechanism into the network framework to obtain more precise
proposal regions according to the confidence of the proposal region.
(2) Infuse the prior information of dimension. To obtain more precise ship body
positioning, we have designed anchors conforming to the feature proportion,
ratio, and angle of ships and added Pooling Layer size conforming to the ship
shape in the ROI Pooling layer, considering the bounding box of the ship shape. In
this way, the redundant region of the bounding box is reduced to obtain good
performance in crowded scenes.
(3) Fuse cross-level features. For the multi-scale and diversity characteristics of the
remote-sensing ship, we have designed a network fusing the cross-level feature
layer based on the feature pyramid and increased the feature precision of the ship
with different scales to improve the performance in the multi-scale and small
object problems of remote-sensing ships.
(4) Design dataset. To resolve the single problem of current ship detection scenes, we
have collected multiple frequent sea areas, port, shore side, and riverway scenes of
remote-sensing ships in practical application engineering, as well as ships of
different classes and sizes. These images contain the features of a remote-sensing
ship, such as multiple scales, small objects, crowded arrangements, complicated
background, etc.
Finally, we have verified the effectiveness of our proposed method through experi
ments. Our network framework not only verifies the effectiveness in our dataset but also
presents an excellent performance in the ship in the public dataset HRSC2016 and large-
scale aerial dataset DOTA (Xia, Bai, and Ding et al. 2018) compared with other advanced
methods.
The rest parts of this paper are as follows: Part 2 introduces the general framework of
the proposed model and the detailed method. Part 3 briefly introduces the overall
algorithm procedure of our model as well as the training and testing procedures. Part 4
introduces three datasets as well as the experiments of our model framework and contrast
INTERNATIONAL JOURNAL OF REMOTE SENSING 5
model in each scene of the dataset. Part 5 summarizes the full text and prospects the
future work.
2. Proposed Method
We will introduce our model framework in this section, and the general framework is
as shown in Figure 1, including the feature extraction network module, rotation
proposal region module, and angle re-scoring module. The feature extraction network
module is used to fuse the multi-level feature layer of the top-down network to obtain
the multi-scale features; the rotation proposal region module includes to produce
anchors conforming to remote-sensing ship characteristics and containing angle para
meters in RPN stage, and to increase the size of pooling layer matching with the ship
length–width ratio in the pooling module of interest; the angle re-scoring module is
used to re-calculate the confidence of the original proposal region through the
bidirectional recurrent network and re-scoring network with self-attention mechanism
according to the parameters of the proposal region, for example, angle, etc. Finally, in
the Fast-RCNN stage, we will make position regression correction and class prediction
of the proposal region.
We firstly select ResNets (He, Zhang, and Ren et al. 2016) pre-trained through
ImageNet dataset (Simonyan and Zisserman 2015) as the baseline network, that is,
to extract features from the input remote-sensing ship image in the feature extraction
module of Figure 1. We select the feature map of the last layers of the last four
residual modules to build a bottom-up network; then obtain a feature map with
higher resolution through lateral connection and multi-level cross-layer connection,
thus building a top-down network. Wherein, connect the output sampled from the
feature map of adjacent upper layers and the output sampled from the feature map of
cross-level multiple layers in series as the input of the feature map of each layer in the
multi-level cross-layer connection.
Specifically, we select feature images with four different resolutions of the last outputs
of conv 2, conv 3, conv 4 and conv 5 modules of ResNets as the bottom-up network. As
shown in Figure 2, the size of the image of the input feature extraction network is H × W
pixel, and the pixel of the feature map of each convolutional layer is, respectively,
H=4 � H=4, H=8 � H=8, H=16 � H=16and H=32 � H=32. Concerning the construction of
a top-down network, the input of the feature map of each layer includes the output of the
corresponding feature layer in the corresponding bottom-up network after 1 × 1 con
volution and the output of the upper layer of the network after upper sampling of the
corresponding size. With the lowest three layers of the top-down network as the output of
the feature extraction network, we get the process of the three-layer feature layer,
expressed as the following mode:
!
kþi�5
X
i
Pk ¼ f3�3 2 � UpðPkþi Þ þ f1�1 ðCk Þ (1)
i¼1
where Ck means the feature map of the Kth layer of the bottom-up network, PK means the
fused feature map of the Kth layer (K decreases progressively from up and down), f1x1 and
f3x3 mean convolutional layers of 1 × 1 and 3 × 3, UP means the up sample, 2i means the
size of the upsample, and i of the up sample of the adjacent upper layer is 1, one more
layer increased, i of the up sample 2i plus one. After this process, we can get the pyramid
network of multi-level cross-layer feature fusion. These inputs reduce the poor effects
caused by aliasing through 3 × 3 convolution after connect and finally are fused with the
feature map as the input of the latter prediction. The output of the subsequent prediction
does not share the classification and regression parameters, the output of each layer is
independent, and the feature map produced in this way contains more high-level features
based on the fusion of multi-level semantic features, and produces more accurate remote-
sensing ship feature information.
We find that the addition of a multi-level cross-layer connection based on the feature
pyramid structure not only can better fuse the multi-scale features of the remote-sensing
ship but also increase the feature extraction precision of ships in the complicated
background.
RPN
cls. Score
(2x)
reg. ( , y, w, h,θ )
C2
C3
C4
C5
Rotation
Proposals Fast R-CNN
anchors Score
7×7 (class x)
P2
P3
P4
P5
cls.
3×10
Feature map 10×3
reg.
Ship Image Angle
Cross-level Feature Extraction
RNN Rescoring ( , y, w, h,θ )
Angle Rescoring Multi-scale ROI pooling
object with a big length–width ratio. As can be seen from Figure 3, due to the feature of
the big length–width ratio of the remote-sensing ship, the horizontal bounding box can
bring up many redundant pixels not belonging to the ship, causing a deviation in the
positioning result. In addition, in crowded scenes, a lot of overlap among the Ground True
of horizontal bounding boxes of multiple ships produces a big IOU value, which will cause
the Ground True to be filtered out in NMS, cause the correct proposal region to be
discarded and result in missing detection.
The rotation bounding box is required to modify the representation form and
structural parameters of the anchors, the rotation angle and ROI Pooling layer are all
required to change. To realize the detection of remote-sensing ships with big length–
width ratio features, we have modified RPN and Fast R-CNN stages of the ship
detection. In RPN stage, we have re-defined the size and length–width ratio of the
anchor according to the characteristics of remote-sensing ships, added angle para
meters and obtained the rotation bounding box conforming to the shape of the
remote-sensing ship. In the pooling stage of the interesting stage, we have added
a new pooling size, restricted the missing detection problem caused by NMS, and
obtained a feature map of fixed size more conforming to the shape of the remote-
sensing ship.
C5 P5
predict
C4 P4
predict
C3 P3
predict
P2
input C2
(a) (b)
Figure 3. The marking conditions of the horizontal bounding box and the rotation bounding box on
the image.
In RPN stage, to produce the rotation box, we have carried out a large number of
tests and statistical analyses of the datasets used in the experiments, and considering
the bearing capacity of the network model, we have added relevant parameters
conforming to the remote-sensing ship to increase the number of matched anchors.
INTERNATIONAL JOURNAL OF REMOTE SENSING 9
The K-Means clustering algorithm determines that the aspect ratios of the rotation
anchor points are 3, 7, and 9, and the aspect ratios of the rotation anchor points are
set to 1:3, 3:1, 1:5, 5:1, 1:7, 7:1, respectively. Considering that multiple length-width
ratios and load capacity of the network model, the scale size of each feature layer is
set to 1. The angle of the rotation anchor complies with even distribution within ½0; πÞ,
so the angle parameters are set as π=6, π=3, π=2, 2π=3, 5π=6and π, as shown in Figure
5. Each pixel of each feature map corresponds to 36 rotation anchors (1 × 6 × 6), the
outputs of the classification layer and the regression layer are, respectively, 72 outputs
(2 × 36) and 108 outputs (3 × 36).
(1) Step 1: the confidence and coordinate parameters x, y, h, w, θ of the i-th ship
candidate area in the image are recorded as the vector xi;
(2) Step 2: the vector x1 ; x2 ; � � � ; xi of each candidate region is used as the input of the
two-layer hidden layer bidirectional RNN network, where the vector sequence of
� �
ð2Þ ð2Þ ð1Þ
the second layer p position is represented as hp ¼ f Uð2Þ hp 1 þ W ð2Þ hp þ bð2Þ ,
ð2Þ
where Uð2Þ , W ð2Þ is the weight matrix, bð2Þ is the bias, and hp 1 is the vector sequence
(a) The wharf scene (b) The port scene (c) The river scene
Figure 6. The marking condition of the ship bounding box in different scenes.
INTERNATIONAL JOURNAL OF REMOTE SENSING 11
Self-Attention
Regressor
[0.85;xywhθ2] RNN 0.40
θ2
θ1
[0.65;xywhθ3] RNN 0.90
θ3 θ4
[0.70;xywhθ4] RNN 0.85
ð1Þ
of the previous position p-1 in the second layer as shown in equation (2), and hp is
the vector sequence of the p position L in the first layer.
P
(3) Step 3: the vector sequence ci ¼ αij hj about the angle obtained by the vector
j¼1
x1 ; x2 ; � � � ; xi through the self-attention mechanism, where L is the length of the
input sequence, hj is the hidden vector of the jth element, and αij is the alignment
weight between the ith element and jth element.
(4) Step 4: The stacked bidirectional RNN outputhp in step 2 and the self-attention
mechanism output ci in step 3 are connected in series, and input to the regressor
composed of three layers of MLP. Finally, the activation function f ð xÞ ¼
��
SigmoidðReLu bð2Þ þ W ð2Þ bð1Þ þ W ð1Þ x is used to generate generates
a confidence level between 0 and 1 in the new ship candidate area, where the
connection weights between the various layers of the MLP and Bias, that is,
W ð1Þ ,W ð2Þ ,bð1Þ ,bð2Þ .
(5) Step 5: output new confidence of all the ship candidate areas.
The stacked bi-directional RNN network can make better use of the prior conditions of
the physical structural feature of partial consistent angles of the ship, and make use of all
angle information of the proposal region in the image during prediction.
The training process of the RNN network is as follows:
The two hidden layers~ hp and hp of size nh are used to represent the forward and
ð1Þ
backward sequences of the bidirectional RNN. Define hp as the hidden state of the first
layer at position p, which is determined by the hidden state of the first layer at the
ðlÞ
previous position p-1 and the hidden state of the l-1th layer of p. And the hidden layer hp
can be expressed as
� �
ðlÞ ðlÞ ðlÞ ðl 1Þ
hðlÞ
p ¼ f U hp 1 þ W hp þ bðlÞ (2)
where UðlÞ , W ðlÞ are the weight matrixes, bðlÞ is the bias vector, and ht ð0Þ¼xt . The stacked
network layers can better rely on candidate regions, and gradually abstract the sequence
characteristics between the high-level confidence and position relationship according to
the context. The hidden state inside the bidirectional cyclic neural network layer contains
different levels of structural representation (Singh, Marks, and Jones et al. 2016).
12 X. LI ET AL.
The attention is well applied in natural language processing tasks and can quickly
extract the important features of sparse data (Bahdanau, Cho, and Bengio 2014). The self-
attention mechanism is the improvement of the attention mechanism. It reduces the
dependency on external information and is better at capturing the internal correlations of
data or features (Vaswani, Shazeer, and Parmar et al. 2017), and can better highlight the
divisibility features of positions and angles between the proposal regions of the ship and
suppress the un-relevant features.
In this paper, the self-attention is used to calculate the relationship between the two
candidate regions according to the angle parameters in the position coordinates, and
normalizes them to obtain the weights. And then use the confidence of all candidate
regions in the entire sequence and the position of the vector. The weighted average is
used as a new vector, and the traversal completes the update of the vector in the
sequence. In this process, the analysis between the two candidate regions has nothing
to do with the distance, only with the vector itself, there will be no long-term dependence
problem.
For the confidence, position coordinates and angle sequence features after RNN, it is
considered to enhance the weight of ship proposal regions with the angle more tending
to the mean value of all the proposal region angles in the image, and reduce the weight of
the ship proposal regions with angle parameters having great difference with the mean
value. The self-attention mechanism is used to deal with the long-distance dependency
relationships between ship detections, and these dependency relationships are difficult to
capture only through the RNN network. For each ship proposal region element i, the self-
attention mechanism expresses the whole sequence as the vector ci of the angle, and the
confidence score obtained by weighting and aligning the mean value of all hidden
vectors of the vector sequence ci is:
L
X
ci ¼ αij hj (3)
j¼1
where L is the length of the input sequence, hj is the hidden vector of element j, and aij is
the aligned weight between element i and element j. The weight aij is calculated by
Softmax as:
,
� XL
αij ¼ exp scoreðhi ; hj Þ expðscoreðhi ; hk ÞÞ (4)
k¼1
�
where exp scoreðhi ; hj Þ means to measure the alignment degree between the vectors hi
and hj, and specifically uses the vector dot product function showing scaling in the
formula (5).
.pffiffi
scoreðhi ; hj Þ¼ðhTi hj Þ L (5)
We use the multi-layer perceptron to predict and re-calculate the value of confidence. The
input of the regression calculator is the serial connection of the hidden vector h in the
stacked bi-directional RNN network and the context vector c in the self-attention. The
activation function is composed of ReLU and sigmoid, and finally produces the new
INTERNATIONAL JOURNAL OF REMOTE SENSING 13
Where i is the index of anchors in the classification network and regression network, pi is
the prediction probability of anchor i as ship class, and p�i is the Ground True tag of the
anchor. vi represents the vectors of the five coordinate parameters x, y, w, h and θ of the
predicted bounding box, and vi� represents the coordinates of the Ground True detection
box in the tag corresponding to the predicted ship area. Lðpi ; p�i ; vi ; vi� Þ is the sum of the
classification loss functions and regression loss functions of all anchors in the remote-
sensing ship image, the hyper-parameter λ weights and controls the ship class loss and
the balance between the two losses, and the experiment uses λ = 1. In addition, the
function Lcls uses the cross-entropy loss function, defined as:
�
Lcls p; p�i ¼ log pp�i (9)
14 X. LI ET AL.
We select the smooth L1 loss function with strong robustness, and the function of
smoothL1 ð xÞ is defined as:
� �
0:5x 2 ; if j xj < 1
smoothL1 ð xÞ ¼ (11)
j xj 0:5; otherwise
WheresmoothL1 ðxÞis the smooth L1 loss function. To maintain the invariance of scale and
�
position, smoothL1 ðxÞ is computed according to the distance vectorΔ¼ δx ; δy ; δw ; δh ; δθ
defined in the formula below.
�
δx ¼ ðgx bx Þ=bw ; δy ¼ gy by =bh
f δw ¼ logðgw =bw Þ; δh ¼ logðgh =bh Þ (12)
δθ ¼θ θ�
�
Besides, Δ¼ δx ; δy ; δw ; δh ; δθ has to be normalized through the mean value and variance.
Train
Data Train process
Loss
function
Algorithm train
1: Input: network parameters loaded by the training model, images of training sets and the corresponding tags.
2: Output: parameters of RPN network and Fast-RCNN network.
3: Train the RPM network separately.
4: Train Fast-RCNN network separately, using the proposal region output by RPN network in Step 1 as the input of Fast-
RCNN network.
So far, the two networks have no shared parameters and are trained separately.
5: Train the RPN network again, at this time, fix the parameters in the common part of the network, and just update the
parameters of the exclusive part of the RPN network.
6: Fine-tune the Fast-RCNN network again with RPN results, fix the parameters of the common part of the network, and
just update the parameters of the exclusive part of the Fast-RCNN network.
In the testing stage, firstly, we output the coordinate offset onto the pixel of the feature
map of each anchor under the bounding box regression network; then, adjust the
position of each anchor through the bounding box regression strategies to obtain the
final prediction bounding box. The two outputs of the classification networks are con
fidence scores s1 and s2 corresponding to each bounding box, and the confidence score
shows the probability of the ship appearing in the bounding box. If the score s1 of the
bounding box is less than 0.2, delete the corresponding bounding box, determine the
confidence corresponding to the rest bounding box as the product of s1 and s2, and
select the bounding box with the confidence bigger than 0.35. Finally, use the non-
maximum suppression (NMS) to obtain the final detection result.
16 X. LI ET AL.
4.1. Dataset
We have carried out experiments on three datasets in total. Wherein the SHIP dataset is
our private dataset, HRSC2016 and DOTA datasets are common datasets.
Table 2. The number of samples and instances of the training set and test set of each scenario in the
SHIP data set.
Cargo ship Docked
or not or not Ship spacing Local water area Ship scale
Scene Yes No Yes No Crowd Sparse River Lake Sea Big Medium Small
Train 521 536 507 516 511 538 245 239 246 243 244 243
Sam.
Train ins. 2108 2123 2096 2135 2092 2139 1427 1396 1408 1410 1411 1410
Train 174 192 178 187 171 196 102 91 89 165 175 159
Sam.
Train ins. 887 914 880 921 877 924 619 589 593 598 600 603
INTERNATIONAL JOURNAL OF REMOTE SENSING 17
analysed for statistics in different classifications and scenes according to the tag. The size
of images collected by us from Google Earth is about 3000 × 3000 pixels. To prove the
effectiveness of the model for the detection of a small ship and multi-scale features of
ships, we did not ignore small ships. The size of ships included in the data set ranges from
10 × 10 pixels to 400 × 400 pixels. To expand the remote-sensing image samples, and
considering the influences of the object in the remote-sensing image in actual scenes (e.g.
changes of view, changes of direction, etc.) as well as multiple test comparisons, we
amplify the datasets fourfold by a horizontal flip and rotation of 5°, 180°and 355°. The
rotation of the image in a small degree can reduce the redundant area of the image after
rotation. Divide the amplified data into the training set and the testing set based on 7:3,
and then crop the image with an overlapping ratio of 20%. The cropped image is
800 × 800 in size. There are 1012 pieces of maps in the cropped dataset and 6032
instances in total, of which, 730 pieces of training sets and 4231 instances; 282 pieces of
testing set, and 1801 instances. The details are shown in Table 2.
TP
recall ¼ (14)
TP þ FP
Precision is calculated as:
TP
precision ¼ (15)
TP þ FP
Where TP (True Positives) means the positive samples are correctly identified as positive
samples, representing the number of ships correctly detected, FN (False Negative) is the
false-positive samples, that is, the negative samples are falsely identified as positive
samples, representing the quantity of undetected or missed ships, FP (False Positives)
represents the quantity of falsely detected ships.
The average precisions under various IoU thresholds (0.5, 0.55, . . ., 0.95) are separately
calculated by each class. Firstly, rank the results according to the confidence, and estimate
the area of the interpolation below the Precision-Recall curve through the average
interpolation precision of the recall on 11 isometric horizontal axes, and the average
precision can be calculated as:
1 X
APtc ¼ max pð~rÞ (16)
11 r2f0;0:01;���;1g ~r�r
Where r is the recall, c is the number of the given classes, and for single ship detection, c is
1, t is the threshold of IoU. The curve is realized monotone decreasing by re-allocating the
accuracy of each recall to the maximum precision of the highest recall.
For ship detection, the bigger the mAP value is, the better the ship detection perfor
mance is.
ratio of the ship at the ROI Pooling layer (Jiang, Zhu, and Wang et al. 2017). RRPN proposes
R-anchors containing angle parameters in the RPN stage and resets the proportion and
ratio of the anchors according to the ship characteristics (Ma, Shao, and Ye et al. 2018).
R-DFPN network proposes the feature extraction network of DFPN and the rotation region
module of RDN (Xue, Hao, and Kun et al. 2018); SCRDet network implements more stable
detection of small objects, cluttered background, and rotating objects (Yang, Yang, and
Yan et al. 2019).
Our model is implemented on an NVIDIA 2080Ti GPU display card based on the
TensorFlow framework. The backbone network selects Resnet-101 network pre-trained
on ImageNet. The entire model is trained from end to end, with the initial learning rate of
0.001 and training steps of 10k times, the learning rate for the previous 80,000 times
remains unchanged and the learning rate of the latter 20,000 times is decreased to 0.0003,
the weight is attenuated to 0.0001 and the momentum is 0.9.
The experimental comparison results using the three data sets are shown in Table 3.
The Recall, Precision, mAP and F1 of our proposed on the SHIP, HRSC2016 and DOTA-Ship
data sets are better than other models. In SHIP, compared with R2CNN, RRPN, R-DFPN and
SCRDet, mAP increased by 16.3%, 19.8%, 12.7%, 2.2%, and F1 increased by 16.5%, 17.3%,
5.2%, 4.2%; In HRSC2016, compared with R2CNN, RRPN, R-DFPN and SCRDet, mAP
increased by 5.6%, 1.6%, 9.1%, 0.7%, and F1 increased by 5.7%, 15.1%, 4.6%, 5.5%; In
DOTA-Ship, compared with R2CNN, RRPN, R-DFPN and SCRDet, mAP increased by 42.9%,
28.8%, 27.8%, 3.0%, and F1 increased by 33.1%, 29.5%, 27.3%, 1.9%.
The experimental results show that Proposed performs best in multiple scenarios with
different data sets, especially for ships with multi-scale, small ship, dense and complex
backgrounds. Proposed greatly improves Recall, Precision, mAP and F1.
Below we will compare and analyse specific experimental results for the five scenarios
in the SHIP data set to illustrate the effectiveness of our proposed model in detecting
ships with different characteristics in different scenarios.
Figure 9. The detection effect in the cargo ship scene of the SHIP.
situations where other similar objects are mistaken for the ship in Figure 9(a). Both Figure
9(c) and (d) also have missed detection, for example, some ships are not detected, and the
bounding box of the detected ship has a large deviation from the true value of its. And in
Figure 9 (e), the above problems are avoided, there are no problems such as missed
detection and false detection, and the bounding box offset from the true value is small.
shows that in the docking scene with dense ships and complex backgrounds, compared
with others, the proposed model improves the results of various indicators of ship
detection effectively.
The comparison effect of the SHIP docking scene is shown in Figure 10. For docked and
dense ships, there is a misdetection of multiple detection frames for a ship in Figure 10(a),
the detection frame has different degrees of redundancy compared with the true value in
Figure 10(b)–(d). The detection frame in Figure 10(e) is the most accurate, and the
bounding box offset from the true value is small.
Figure 10. The detection effect in the docking scene of the SHIP.
22 X. LI ET AL.
Table 6. Detection results of the water scene where the ship is located.
Model Tag bit Recall Precision mAP F1
R2CNN River 0.636 0.706 0.687 0.669
Lake 0.647 0.785 0.684 0.709
Ocean 0.614 0.693 0.729 0.651
RRPN River 0.658 0.689 0.623 0.673
Lake 0.684 0.697 0.648 0.69
Ocean 0.639 0.623 0.687 0.631
R-DFPN River 0.743 0.812 0.697 0.776
Lake 0.756 0.842 0.716 0.800
Ocean 0.754 0.839 0.754 0.794
SCRDet River 0.765 0.834 0.804 0.798
Lake 0.814 0.839 0.826 0.826
Ocean 0.756 0.792 0.856 0.774
Proposed River 0.802 0.846 0.827 0.823
Lake 0.822 0.864 0.834 0.842
Ocean 0.823 0.873 0.879 0.847
Figure 11. The detection effect in the river scene of the SHIP.
The above-mentioned error condition does not occur, and the difference between the
prediction frame and the true value is the smallest in Figure 12 (e).
ships, proposed effectively improves the results of various indicators of ship detection
compared with others.
The detection effect of the multi-scale scene using the SHIP is shown in Figure 13 These
images show the situation of ships of multiple sizes in the river. There are natural farm
land, houses and illegible objects on land. In Figure 13(a), (b) and (d), the small ship in the
INTERNATIONAL JOURNAL OF REMOTE SENSING 25
upper left corner was missed. There are also redundant bounding boxes with false
detections in Figure 13(a). There is a large deviation between the missing and predicted
bounding box and the true value in Figure 13(c). The middle bounding box also has
a certain degree of redundancy in Figure 13(d). But in Figure 13(e), there is no such wrong
detection situation, and the predicted detection frame is the most accurate.
Table 9. Recall, Precision and F1 Indexes of the Ablation Experiments of Our Model.
Dataset SHIP HRSC2016 DOTA-Ship
M Au. E. A. NMS ROI R P F1 R P F1 R P F1
Pool
1 √ √ √ 71.4 69.5 70.4 86.5 79.6 82.9 52.3 55.9 54.0
2 √ √ √ 74.6 81.9 78.1 84.6 81.9 83.2 50.7 61.3 55.5
3 √ √ √ 71.6 73.5 72.5 86.4 85.7 86.0 49.6 65.5 55.7
4 √ √ √ 72.3 79.2 75.6 85.6 86.8 86.2 57.4 67.8 62.2
5 √ √ √ √ 78.8 82.6 80.7 89.4 89.7 89.5 71.7 92.2 80.7
6 √ √ √ √ √ 81.9 86.8 84.3 92.1 92.9 92.5 – – –
size to the ROI pooling layer structure has the greatest effect on the overall model. For the
F1, it increased by 6.6% on HRSC2016, 26.7% on DOTA-Ship, and 10.3% on SHIP.
Due to the many and dense ships in the DOTA-Ship images, data amplification will
reduce the accuracy, so only the HRSC2016 and SHIP datasets were augmented. The
experimental results prove that the data set expansion has a certain positive impact on
the experimental results.
5. Conclusions
In the paper, we have proposed a network model for remote sensing ship detection in
multiple scenes, which is divided into the feature extraction network module, rotation
proposal region module and angle rescoring module. For the features of multiple scales
and complicated background of remote-sensing ships, we combine the cross-level cross-
layer features by improving the feature pyramid structure in the feature extraction net
work module and obtain the fused optimal ship features: To resolve the problem that the
crowded feature of ships and the redundancy of bounding boxes increase the detection
difficulty, the paper adds rotation bounding boxes with angle parameters in RPN stage of
the rotation proposal region module and in the ROI Pooling layer. In the angle rescoring
module, we propose the stacked bidirectional recurrent neural network fusing the self-
attention network and improve its confidence through the angle parameters of the
proposal region coordinates to obtain proposal regions with more precise positioning.
As proved and analysed through several experimental results, the above modules are
beneficial to the final experimental results, especially in the multiple diversified scenes of
remote-sensing ships, our experimental model has the most advanced performances.
Our model has made certain achievements in the ship detection of remote-sensing
ships, and in future work, we will focus on the work in the following two aspects: (1) Take
advantage of the prior conditions of the remote-sensing ships. The parts of the remote-
sensing ships (e.g. bow, tail, etc.) have specific characteristics, which can be utilized to
improve the prediction of the bounding box angles. (2) Increase the detection speed and
the model robustness. We add parameters (e.g. angle, etc.) in the basic ship detection
model with respect to ship features and increase the corresponding detection time. We
have to reduce the model detection time without affecting the model accuracy.
Funding
This work was supported by the National Key Research and Development Project of China
[2016YFC1400302]; National Natural Science Foundation of China [61501155,61871164]; National
Defense Science and Technology Key Laboratory Fund Q5 [6142401200201].
ORCID
Shuaishuai Lv http://orcid.org/0000-0002-9822-9959
References
Bahdanau, D., K. Cho, and Y. Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align
and Translate.” arXiv 1409: 0473.
Bay, H., A. Ess, T. Tuytelaars, L. V. Goolab. 2008. “Speeded-Up Robust Features (SURF)[J].” Computer
Vision and Image Understanding 110 (3): 346–359. DOI:10.1016/j.cviu.2007.09.014.
Chang, H. H., G. L. Wu, and M. H. Chiang. 2019. “Remote-sensing Image Registration Based on
Modified SIFT and Feature Slope Grouping[J].” IEEE Geoscience and Remote Sensing Letters 16:
1363–1367. doi:10.1109/LGRS.2019.2899123.
Chen, W., X. Li, H. He, and L. A. Wang. 2018. “Review of Fine-Scale Land Use and Land Cover
Classification in Open-Pit Mining Areas by Remote Sensing Techniques[J].” Remote Sens 10: 15.
doi:10.3390/rs10010015.
Dong, C., J. Liu, F. Xu, and C. Liu. 2019. “Ship Detection from Optical Remote Sensing Images Using
Multi-Scale Analysis and Fourier HOG Descriptor[J].” Remote Sens 11: 1529. doi:10.3390/
rs11131529.
Girshick, R., J. Donahue, T. Darrell, J. Malik. “Rich Feature Hierarchies for Accurate Object Detection
and Semantic Segmentation[C].” In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition,Columbus, OH, USA, 23–28 June 2014; pp. 580–587.
He, K., X. Zhang, S. Ren, J. Sun. “Deep Residual Learning for Image Recognition[C]”. 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp.
770–778, doi: 10.1109/CVPR.2016.90.
Jiang, Y., X. Zhu, X. Wang, S. Yang, W. Li, H.Wang, P. Fu, et al. 2017. “R2CNN: Rotational Region CNN
for Orientation Robust Scenetext Detection.” CoRR abs/1706:09579.
Jiao, J., Y. Zhang, H. Sun, X. Yang, X. Gao, W. Hong, K. Fu. 2018. “A Densely Connected End-to-End
Neural Network for Multiscale and Multiscene SAR Ship Detection[J].” IEEE Access 6 :20881–20892.
doi:10.1109/ACCESS.2018.2825376.
Kang, M., X. Leng, Z. Lin, K. F. Ji. “A Modified Faster R-CNN Based on CFAR Algorithm for SAR Ship
detection[C]”. 2017 International Workshop on Remote Sensing with Intelligent Processing (RSIP),
Shanghai, 2017, pp. 1–4, doi: 10.1109/RSIP.2017.7958815.
Kim, T., S. Oh, T. B. Chun, and M. Lee “Impact of Atmospheric Correction on the Ship Detection Using
Airborne Hyperspectral Image[C]”. IGARSS 2019-2019 IEEE International Geoscience and Remote
Sensing Symposium, Yokohama, Japan, 2019, pp. 2190–2192, doi: 10.1109/IGARSS.2019.8898766.
Li, J., C. Qu, and J. Shao. “Ship Detection in SAR Images Based on an Improved Faster R-CNN[C]” 2017
SAR in Big Data Era: Models, Methods and Applications (BIGSARDATA), Beijing, 2017, pp. 1–6, doi:
10.1109/BIGSARDATA.2017.8124934.
Li, K., G. Cheng, S. Bu, X. You. 2017. “Rotation-Insensitive and Context-Augmented Object Detection
in Remote Sensing Images[J].” IEEE Transactions on Geoence and Remote Sensing 56 (4):
2337–2348.
Li, X., Z. Tang, W. Chen, L. Wang. 2019. “Multimodal and Multi-Model Deep Fusion for Fine
Classification of Regional Complex Landscape Areas Using ZiYuan-3 Imagery[J].” Remote
Sensing 11 (22): 2716. DOI:10.3390/rs11222716.
28 X. LI ET AL.
Lin, T. Y., P. Dollar, R. Girshick, K. He, B. Hariharan, S. Hariharan. “Feature Pyramid Networks for Object
Detection[C]”. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE
Computer Society, 2017, Honolulu, HI, USA. DOI: 10.1109/CVPR.2017.106
Ma, J., J. Jiang, H. Zhou, J. Zhao. 2018. “Guided Locality Preserving Feature Matching for Remote
Sensing Image registration[J].” IEEE Transactions on Geoscience and Remote Sensing 56
:4435–4447. doi:10.1109/TGRS.2018.2820040.
Ma, J., W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, et al. 2018. “Arbitrary-Oriented Scene Text
Detection via Rotation Proposals[J].” IEEE Transactions on Multimedia 20 (11): 3111–3122.
doi:10.1109/TMM.2018.2818020.
Ma, J., Z. Zhou, B. Wang, H. Zong, F. Wu. 2019. “Ship Detection in Optical Satellite Images via
Directional Bounding Boxes Based on Ship Center and Orientation Prediction[J].” Remote Sensing
11 (18): 2173. doi:10.3390/rs11182173.
Ma, W., Q. Guo, Y. Wu, W. Zhao, X. Zhang, L. Jiao. 2019. “Novel Multi-Model Decision Fusion Network
for Object Detection in Remote Sensing Images.” Remote Sens 11 :737. doi:10.3390/rs11070737.
Shi, Z., X. Yu, Z. Jiang, and B. Li. 2013. “Ship Detection in High-resolution Optical Imagery Based on
Anomaly Detector and Local Shape Feature[J].” IEEE Transactions on Geoscience and Remote
Sensing 52: 4511–4523.
Simonyan, K., and A. Zisserman. “Very Deep Convolutional Networks for Large-scale Image
Recognition”. In ICLR, 2015.
Singh, B., T. K. Marks, M. Jones, O. Tuzei, M. Shao. “A Multi-stream Bi-directional Recurrent Neural
Network for Fine-Grained Action Detection[C]”. 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Las Vegas, NV, USA. IEEE, 2016.
Sun, X., Y. Liu, Z. Yan, P. Wang, W. Diao, K. Fu, et al. 2020. “SRAF-Net: Shape Robust Anchor-Free
Network for Garbage Dumps in Remote Sensing Imagery.” IEEE Transactions on Geoscience and
Remote Sensing. doi:10.1109/TGRS.2020.3023928
Tang, J., C. Deng, G. Huang, B. Zhao. 2014. “Compressed-Domain Ship Detection on Spaceborne
Optical Image Using Deep Neural Network and Extreme Learning Machine[J].” IEEE Transactions
on Geoscience and Remote Sensing 53 (3): 1174–1185. DOI:10.1109/TGRS.2014.2335751.
Tian, T., Z. Pan, X. Tan, Z. Chu. 2020. “Arbitrary-Oriented Inshore Ship Detection Based on Multi-Scale
Feature Fusion and Contextual Pooling on Rotation Region Proposals[J].” Remote Sensing 12 (2).
339. doi:10.3390/rs12020339.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, et al. 2017. “Attention
Is All You need[C].” Advances in Neural Information Processing Systems, 5998–6008.
Wu, F., Z. Zhou, B. Wang, J. Ma. 2018. “Inshore Ship Detection Based on Convolutional Neural
Network in Optical Satellite Images[J].” IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing, 11 (11): 1–11. doi: 10.1109/JSTARS.2018.2873190.
Wu, Y., W. Ma, M. Gong, Z. Bai, W. Zhao, Q. Guo, X. Chen, et al. 2020. “A Coarse-to-Fine Network for
Ship Detection in Optical Remote Sensing Images.” Remote Sens 12 :246. doi:10.3390/rs12020246.
Xia, G., X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, et al. “DOTA: A Large-scale Dataset for
Object Detection in Aerial Images[C]”.2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition, Salt Lake City, UT, USA. IEEE, 2018.
Xian, S., P.Wang, C.Wang, Y. Liu, K. Fu. 2020. “PBNet: Part-based Convolutional Neural Network for
Complex Composite Object Detection in Remote Sensing Imagery”. ISPRS Journal of
Photogrammetry and Remote Sensing 2020.12.015 173:50–65. 10.1016/j.isprsjprs
Xiao, X., Z. Zhou, B. Wang, L. Li, L. Miao. 2019. “Ship Detection under Complex Backgrounds Based on
Accurate Rotated Anchor Boxes from Paired Semantic Segmentation[J].” Remote Sensing 11 (21):
2506. DOI:10.3390/rs11212506.
Xue, Y., S. Hao, F. Kun, J. Yang, X. Sun, M. Yan, Z. Guo. 2018. “Automatic Ship Detection in Remote
Sensing Images from Google Earth of Complex Scenes Based on Multiscale Rotation Dense
Feature Pyramid Networks[J].” Remote Sensing 10 (1): 132. DOI:10.3390/rs10010132.
Yang, X., J. Yang, J. Yan, Y. Zhang. “SCRDet: Towards More Robust Detection for Small, Cluttered and
Rotated Objects[C]”. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul,
Korea (South), 2019, 8231–8240, doi: 10.1109/ICCV.2019.00832.
INTERNATIONAL JOURNAL OF REMOTE SENSING 29
You, Y., J. Cao, Y. Zhang, F. Liu, W. Zhou. 2019. “Nearshore Ship Detection on High-Resolution
Remote Sensing Image via Scene-Mask R-CNN[J].”IEEE Access. PP(99). 1. doi: 10.1109/
ACCESS.2019.2940102.
Zhang, R., J. Yao, K. Zhang, C. Feng, J. Zhang. 2016. “S-CNN-Based Ship Detection from
High-resolution Remote-sensing Image[J].” International Archives of the Photogrammetry,
Remote Sensing and Spatial Information Sciences 41:917–921.
Zhu, C., H. Zhou, R. Wang, J. Guo, and A. Novel Hierarchical. 2010. “Method of Ship Detection from
Spaceborne Optical Image Based on Shape and Texture Features[J].” IEEE Transactions on
Geoscience and Remote Sensing 48 (9): 3446–3456. doi:10.1109/TGRS.2010.2046330.
Zou, Z., and Z. Shi. 2016. “Ship Detection in Spaceborne Optical Image with SVD Networks[J].” IEEE
Transactions on Geoscience and Remote Sensing 54: 5832–5845. doi:10.1109/TGRS.2016.2572736.