0% found this document useful (0 votes)
4 views

IET Image Processing - 2022 - Zheng - Fast ship detection based on lightweight YOLOv5 network

Uploaded by

1198969944
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

IET Image Processing - 2022 - Zheng - Fast ship detection based on lightweight YOLOv5 network

Uploaded by

1198969944
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Received: 30 May 2021 Revised: 2 January 2022 Accepted: 9 January 2022 IET Image Processing

DOI: 10.1049/ipr2.12432

ORIGINAL RESEARCH PAPER

Fast ship detection based on lightweight YOLOv5 network

Jia-Chun Zheng Shi-Dan Sun Shi-Jia Zhao

School of Ocean Information Engineering, Jimei Abstract


University, Xiamen, People’s Republic of China
Aiming at a series of problems such as detection accuracy, calculation blocking, display
delay, and so on in the ship detection of surveillance video, an improved YOLOv5 algo-
Correspondence
Jia-Chun Zheng, School of Ocean Information rithm is proposed in this paper. First, to improve the detection performance, it is proposed
Engineering, Jimei University, Xiamen 361021, to optimize the anchor box algorithm in the YOLOv5 network according to the ship target
P. R. China.
characteristics. Then, the t-SNE algorithm is used to reduce and visualize the data set label
Email: [email protected]
information and perform weighted analysis on the processed features for low-dimensional
Funding information data. The mapped kernel k-means clustering algorithm adaptively selects a more appropri-
Xiamen Municipal Ocean and Fishery Devel- ate anchor box and considers the detection performance of large and small ship targets.
opment Special Fund, Grant/Award Number: Secondly, to improve the problem of computational blocking and delay, the BN scaling
21CZB013HJ15; Key Project of Fujian Science
and Technology Plan, Grant/Award Number: factor γ is used to compress the YOLOv5 network, so that the model can be reduced with-
2017h0028; Fund Project of Jimei University, out reducing the detection performance. The optimized YOLOv5 framework is trained
Grant/Award Number: zp2020042; Xiamen Key on the self-integrated data set. The accuracy of the algorithm is increased by 2.34%, and
Laboratory of Marine Intelligent Terminal R&D and
Application, Grant/Award Number: B18208 the ship detection speed reaches 98 fps and 20 fps in the server environment and the low
computing power version (Jetson nano), respectively.

1 INTRODUCTION (CNN) based ship target detection algorithm, which uses CNN
to predict the type and location of the target and assists in cor-
In recent years, marine ship monitoring has received more and recting the target localization using the saliency map, and the
more attention. The fast detection method of marine targets experimental results show that the method has high detection
based on dynamic video is one of the key technologies to realize accuracy and speed.
intelligent monitoring of sea areas. The commonly used detec- The third type of methods is the deep learning-based target
tion methods can be divided into three categories in terms of detection algorithm for surface ships. Zhang et al. [4] proposed
sea surface targets. The first type of methods is the detection an integrated target segmentation method based on an interferer
method of sea surface targets based on edge and texture fea- discriminator and a ship target extractor, first using SqueezeNet
tures. Zhang et al. [1] used the DCT domain energy features of network as an interferer discriminator to determine what type
image sub-blocks to achieve fast extraction of sea level, establish of interference is contained in the input image, and then using
a sea surface hybrid texture model, and achieve fast segmenta- the improved DeepLabv3+ depth network to segment the ship
tion of sea surface background and ship targets. target. Experimental results show that the method has high seg-
The second type of methods is to imitate the visual atten- mentation accuracy and good fog resistance. Wang et al. [5]
tion selection mechanism of human eyes and find the saliency achieved fast end-to-end ship target detection with improved
map of the target of interest and achieve the detection of ship YOLOv3 (You Only Look Once) with 74.8% detection accu-
targets according to the established visual attention model. Shi racy and 29.8 frames per second detection speed in GPU 1080Ti
et al. [2] first extracted the low-frequency and high-frequency hardware environment.
features of the image in the wavelet domain, then used a modi- However, the current algorithm still has the following two
fied Gabor filter to extract the directional features and extracted problems: one is the low accuracy of small target detection.
the colour and moment features in HIS space. Finally, they fused Second is most of the existing algorithms still cannot meet the
the above features to obtain the saliency map and detected ship needs of practical applications. In order to overcome the above
targets. Shao et al. [3] proposed a convolutional neural network problems, this paper compares different network frameworks

This is an open access article under the terms of the Creative Commons Attribution-NonCommercial License, which permits use, distribution and reproduction in any medium, provided
the original work is properly cited and is not used for commercial purposes.
© 2022 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology

IET Image Process. 2022;16:1585–1593. wileyonlinelibrary.com/iet-ipr 1585


17519667, 2022, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12432, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1586 ZHENG ET AL.

FIGURE 1 Structure of YOLOv5

[6–12], and finally selects YOLOv5 network as the basic frame- The network structure of YOLOv5 is shown in Figure 1.
work. By improving the adaptive targeting frame algorithm, YOLOv5 is divided into four main parts, which are network
select a more appropriate anchor box, and optimize the network input, feature extraction backbone, feature extraction parsing
pruning to improve the performance of the algorithm. network and prediction module. The input uses three methods
The contribution of this work can be summarized as fol- to enhance features: mosaic data enhancement, adaptive anchor
lows: (1) In order to improve the detection accuracy, the t-SNE frame calculation and adaptive image scaling. The purpose of
weighted clustering algorithm is proposed to be applied to the mosaic data enhancement is to make the model better detect
data processing process to realize the mapping of data to the small objects in the image. The input data is sliced by focus
high-dimensional space, and to perform accurate classification before entering the backbone. The focus structure mainly
in the high-dimensional space to obtain more accurate predic- expands the original image of three channels to 12. CSP struc-
tion boxes; (2) In order to reduce the computational complexity ture reduces the parameters and size of the model from the
of the algorithm, the BN scaling factor is further used to fine- perspective of network structure design. Neck enhances the net-
tune the network channel to realize the lightweight algorithm; work feature fusion for FPN + PAN structure, and uses a larger
(3) The improved lightweight algorithm model can be deployed feature map to compensate for the loss of feature information
to the edge embedded equipment of the offshore target detec- at the top of feature pyramid. Prediction uses GIOU_Loss
tion platform to realize real-time monitoring on the sea. The test function, which is used to estimate the recognition loss of the
results show that the detection accuracy of the improved algo- detection target rectangle. Four kinds of network with different
rithm is increased by 2.34%, and the detection speed can reach depth and width sizes are designed in YOLOv5, which are
20 fps on edge embedded devices. YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x. YOLOv5s
is the smallest network framework of YOLOv5. When it runs
on Titan RTX 2080 GPU, 78 frames can be processed per
2 INTRODUCTION TO THE BASIC second, but if deployed on the Jetson nano embedded device
PRINCIPLE OF YOLO NETWORK for detection, the operation speed is only 4 fps, which cannot
achieve the effect of real-time detection. To overcome this prob-
YOLO [13–16] is a single-stage target detection algorithm. With lem, this paper proposes an optimized kernel clustering and
the development of YOLO, its detection accuracy and speed are quantization compression YOLOv5 network structure method.
gradually improving. YOLOv2 proposes a joint training algo- The network structure is reduced by compression pruning to
rithm, which improves the accuracy and speed of prediction. reduce the model size and runtime memory usage to overcome
YOLOv3 joins the multi-scale network architecture in the FPN the problem that the ship cannot be detected in real time.
network [17]. It deepens the network backbone framework and
improves the detection accuracy of the algorithm for multi-scale
targets. YOLOv4 uses a large number of tricks to improve the 3 OPTIMIZATION OF YOLOV5
detection accuracy of the algorithm as a whole. YOLOv5 slices ALGORITHM
the picture and adds CSPNet (Cross Stage Partial Networks) to
the backbone network. YOLOv5 significantly reduces the skele- 3.1 Optimization of network structure
ton of the network system. With its lightweight model size, the
object recognition speed can be as high as 140 fps when running As shown in the Figure 2, YOLOv5 has been optimized and
on the server. improved based on data enhancement to improve detection
17519667, 2022, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12432, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHENG ET AL. 1587

FIGURE 2 Optimized framework

accuracy and speed. The t-SNE is set to extract low-dimensional SNE (stochastic neighbour embedding). SNE maps data points
feature information in order to better handle the relevant data. to probability distribution by affine transformation, and uses
The obtained low-dimensional data are sent into the weighted Euclidean distance to express the similarity between points.
kernel function clustering to get a more accurate prediction tar- Given a piece of N high-dimensional data. First, calculate the
get frame for a better prediction effect. The obtained data set probability p j |i proportional to the similarity between data
are fed into the network training optimization, and the trained points xi and x j . Equation (1) represents the conditional prob-
model is pruned using the BN scaling factor γ. The memory ability of similarity expressed by high dimensional Euclidean
consumption at runtime is reduced and the number of compu- distance. Parameter 𝜎i is based on data point xi√is the center
tational operations is reduced without affecting the accuracy to of Gaussian mean square error, here set to 1∕ 2. For low-
facilitate the application of the model to removable devices. dimensionality, the similarity between yi pairs can be expressed
as Equation (2);
( )
3.2 Optimization of clustering algorithm exp − ∥ xi − x j ∥2 ∕2𝜎i2
p j |i = ∑ ( ) (1)
k≠i exp − ∥ xi − xK ∥ ∕2𝜎i
2 2
YOLOv5 optimizes the preprocessing of the data set. Auto
( )
learning bounding box anchors aim to get the preset anchor exp − ∥ yi − y j ∥2
frame suitable for the predicting of the object bounding box q j |i = ∑ (2)
k≠i exp − ∥ yi − yK ∥
( 2)
in the custom data set. Adaptive anchor frame calculation is to
update the target frame by updating the predicted frame area
Set the similarity probability of xi and yi to 0. The distance
of each iteration. Because the accuracy of target detection is
between the two probability distributions is the KL divergences
closely related to the setting of prediction box. The more accu-
(Kullback-Leibler divergences). Then the cost function is as
rate the prediction box, the higher the accuracy of its detection.
shown in Equation (3). Pi represents the conditional probability
t-SNE (t-Distributed Stochastic Neighbour Embedding) algo-
distribution of all other data points at a given point xi . When
rithm [18] is used to reduce the dimension of anchor frame pre-
the dimensionality reduction effect is good, p(j|i) = q(j|i).
diction, and then combined with the weighted kernel clustering
algorithm to predict the size of the frame. A more accurate pre- { }
∑ ∑∑ p j |i
diction target frame is obtained, so as to achieve better predic- C = KL (Pi |Qi ) = p j |i log (3)
tion effect. i i j
q j |i
t-SNE reduces high-dimensional data to two-dimensional or
three-dimensional low dimensional space. t-SNE obtains the Analogous to the objective function gradient formula
joint probability of high-dimensional data and low-dimensional of softmax, the gradient of the conditional probability of
mapping points through the symmetry of conditional probabil- i under j in the objective function of SNE is derived
ity, and minimizes KL divergence to reduce the difference of as 2(pi| j − qi| j )(yi − y j ). The gradient of the conditional proba-
conditional probability distribution. t-SNE is developed from bility of j under i is 2(p j |i − q j |i )(yi − y j ). Finally, the complete
17519667, 2022, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12432, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1588 ZHENG ET AL.

gradient equation is as follows: high-dimensional feature space as a way to accommodate the


linear indistinguishability problem between different samples.
𝜎C ∑ ( ) After the classification is completed the mapping is returned
= 2 (p j |i − q j |i + pi| j − qi| j ) yi − y j (4)
𝜎yi j
to the two-dimensional plane to obtain the clustering results.
Figure 2 shows the three-dimensional classification map after
t-SNE optimizes the congestion problem existing in SNE. mapping. The weighted kernel k-means algorithm matches the
It uses Gaussian distribution to convert the distance into a importance of different features for different attributes, which
probability distribution. A distribution with a large distance makes the different feature ratios more balanced and results in
probability is regarded as a joint probability distribution in a more accurate prediction frame.
a high-dimensional space, and a distribution with a small This kind of mapping is a non-linear transformation, which
distance probability is a joint probability distribution in a is conducive to the detection of unknown target frames. Equa-
low-dimensional space. tion (8) is the objective function definition of weighted kernel
In the low-dimensional space, the t-distribution is used to k-means, where i and ai represent the clustered pixels in the
replace the Gaussian distribution to express the similarity of two input space. w(i ) represents the weight of corresponding pix-
points, reducing the impact of outliers, so as to better capture els. 𝜋k represents the kth cluster. ck is the clustering centre of
the overall characteristics of the data. The objective function each subcategory. ϕ (AI) is the nonlinear mapping kernel dis-
can be rewritten as Equation (5): tance function.

{ } k ∑

∑∑ pi j
C = KL (Pi |Qi ) = pi, j log (5) J (v ) = w (i ) ||𝜙 (ai ) − ck ||2 (8)
i j
qi j i=1 ai ∈𝜋k

Equation (6) shows the change of q after using the t distri- The calculation of non-linear kernel function is relatively dif-
bution. The t-distribution is the superposition of an infinitely ficult. From a mathematical point of view, there is a function
many Gaussian distributions, which reduces amount of calcula- K(x, x′) in a low dimensional space. When K(x, x′) = < φ
tion. The optimized gradient is as shown in Equation (7): (x) ⋅ φ (x′) > , it is exactly equal to the inner product in a
high-dimensional space. The solution of the non-linear kernel
( )−1 function can be obtained by calculating the value of the sample
1+ ∥ yi − y j ∥2
qi j = ∑ (6) points projected into the high-dimensional space and then per-
k≠l
(1+ ∥ yi − yK ∥2 )−1 forming the inner product operation. By calculating the inner
product function of K(x, x′), the distance between the sample
𝜎C ∑ ( )( )−1
= 4 (pi j − qi j ) yi − y j 1+ ∥ yi − y j ∥2 (7) points and the centre of the cluster is obtained, which greatly
𝜎yi j reduces the difficulty of calculation. Equation (9) is the distance
after K function simplification. Equation (10) is the cluster cen-
For points with greater similarity, the distance of t- tre after operation.
distribution in the low-dimensional space needs to be slightly ∑ ( )
smaller. For points with low similarity, the distance of t- 2 a j ∈𝜋k 𝜙 (ai ) ⋅ 𝜙 a j
||𝜙 (ai ) − ck ||2 || = 𝜙 (ai ) ⋅ 𝜙 (ai ) −
distribution in the low-dimensional space needs to be farther. || || |𝜋k |
It meets the requirements of the target frame of the cluster-
∑ ( )
ing algorithm after t-SNE dimensionality reduction. That is, the
a j ,ai ∈𝜋k 𝜙 (ai ) ⋅ 𝜙 a j
cluster points of different classes are separated from each other +
for classification. |𝜋k |2
The clustering algorithm [19] processes the input data set ∑ ( )
( ) 2 a ∈𝜋 𝜙 (ai ) ⋅ 𝜙 a j
to optimize the selection of the initial target frame within the j k
= K x, x ′ −
network. The data set is classified by the size of the target |𝜋k |
frame, and the size of 9 a priori boxes is obtained. The size of ∑ ( )
a j ,ai ∈𝜋k 𝜙 ai ⋅ 𝜙 a j
( )
the prior box is related to the scale. The smaller the prior frame
+ (9)
is, the more detailed target edge information can be obtained |𝜋k |2
when the scale is larger. This approach can cope with most
datasets with a single data source and simple features, however, ∑
ai ∈𝜋k w (i ) 𝜙 (ai )
k-means clustering is biased for complex situations such as ck = ∑ (10)
multiple data sources and differences in the effects of different ai ∈𝜋k w (i )
attributes on different classes of features. In this regard, this
paper adopts a weighted kernel clustering approach to clus- The t-SNE method is used to reduce the dimensionality of
ter the data set. Kernel methods [20] map two-dimensional the image feature matrix, and then whiten it. Finally, a matrix
inputs to a high-dimensional feature space to classify cate- of N images (N, 256) is obtained. Weighted kernel clustering
gories. The sample features in the plane are mapped to the is used to get the image and its corresponding clustering. The
17519667, 2022, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12432, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHENG ET AL. 1589

FIGURE 3 Visualization results display

TABLE 1 Comparison of back aiming frames in weighted kernel normalization layer is as follows. Equations (11) and (12) are
clustering the mean value and standard deviation of the output data of
Anchor box 52 × 52 26 × 26 13 × 13 the upper layer, and m is the size of the batch of the training
sample:
Adaptive anchor (10, 13) (30, 61) (116, 90)
frame (16, 30) (62, 45) (156, 198)
(33, 23) (59, 119) (373, 326) 1 ∑
𝜇𝛽 = z (11)
Aiming frame (8, 7) (27, 23) (115, 81) m
after weighted (10, 21) (77, 50) (224, 127)
sum clustering (15, 18) (61, 114) (354, 201) 1 ∑
𝜎2𝛽 = % z − 𝜇𝛽 2 (12)
m

z − 𝜇B
clustering results are shown in Figure 3, which will act as pseudo ẑ = √i (13)
tags, and the model will train on them. After analysing the data 𝜎B2 + 𝜀
set of self-made ship by using the improved clustering algo-
rithm, the size of 9 sets of prior frames is obtained. The com- z = 𝛾ẑ + 𝛽 (14)
parison is shown in Table 1. The experimental results show that
the optimized k-means algorithm has improved the detection
Equation (13) is the result of normalization. ε is a value close
effect.
to 0 added to avoid denominator being 0. Equation (14) is
obtained by reconstructing the data obtained by the above nor-
malization process. Among them, γ and β are learnable param-
3.3 Network layer model compression
eters, which are used to restore the normalized data distribu-
optimization
tion. The scaling factor γ in batch_norm is used to evaluate
the importance of channel. The smaller the number of γ, the
To compress the YOLOv5 model, it is necessary to train the
less important the channel information is. The channel can be
trained model sparsely. The purpose of sparse training is to
deleted. In order to constrain the size of γ, a regular term about
identify the less important channels in the process of model
γ is added to the objective equation, which can be automati-
training, so as to cut the less important channels. Sparsity is
cally pruned in training. But the previous model compression
introduced to the dense connection of deep neural network,
does not have. The L1 norm is calculated for the γ value of each
and the weight proportion with a small proportion is eliminated
channel. The specific calculation equation is as follows:
to reduce the network structure. After initializing the network,
the channel sparse penalty is added to train the network. After ∑
deleting the channel, the fine-tuning network is trained. The LBN = 𝜆 ||𝛾||1 = 𝜆 |𝛾| (15)
channel sparseness method can reduce the size of the model,
reduce the memory consumption at runtime, and reduce the Finally, the loss function trained by the model adds a regular-
number of calculation operations without affecting the accu- ization term to the original loss function:
racy. As shown in Figure 4, the BN scaling factor γ is used to
prune the channels of the network. L = LYOLO + LBN (16)
The scale factor in the batch normalization layer [21] is used
as index of channel importance, L1 norm is calculated for the λ is the penalty term. After sparse training, a global thresh-
scale factor and trained in the loss function to obtain the impor- old γ is introduced to decide whether to cut a feature channel.
tance score of each channel [22]. The update process of batch Prune the channels whose channel scale factor γ is less than
17519667, 2022, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12432, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1590 ZHENG ET AL.

FIGURE 4 Network layer model compression

TABLE 2 Comparison of different algorithms

Model FPS
Model volume (MB) AP (%) (frame/S)

Fast R-CNN 495 86.02 6


SSD 91 89.63 18.43
eYOLOv3 [23] 79 90.61 30.51
s-CNN [3] 65 91.70 40.27
YOLOv5x-pruned 47 96.31 45.97

CUDNN, 128G memory platform, and pytorch1.2.0 frame-


work is used for training. The pretraining model integrates
FIGURE 5 Total distribution of γ value of each epoch 18,306 pieces of data from public data set and self-built data
from the research team as the training target detection model.
The training batch size selects 64 pictures, and the verification
the global threshold. At the same time, local security thresh- batch size selects 2 pictures. The initial learning rate is 0.0032.
old δ is introduced to prevent over pruning on convolution The cosine annealing strategy is used to reduce the learning rate
layer and maintain the integrity of network connection. The gradually, but not less than 0.000001. The IOU threshold is 0.5,
local security threshold δ is the retention ratio of all chan- and the NMS threshold is 0.5. Back propagation is used to fine
nel scale factors in a specific layer. After channel pruning, tune the network parameters. The data set is divided into train-
the model is fine tuned to improve the detection accuracy of ing set and verification set for evaluation separately. The first
the pruned model. Figure 5 shows the total distribution of γ 100 epochs optimize the network parameters of the last layer,
values for the 150th epoch of sparse training. It is obtained and the second 200 epochs adjust the network parameters of
through experiments that λ = 0.001 is adopted for sparse train- the whole network.
ing. When BN sparse regularization training is carried out after
the normal training converges, the value of loss function first
increases, and then γ value decreases with the increase of epoch 4.2 Analysis of experimental results
number.
The final model of the network is obtained by choosing to To evaluate the detection performance, we compared several
prune off 45% of the parameters. The model size of YOLOv5x representative methods: Fast R-CNN, SSD, eYOLOv3, s-CNN,
before compression is 92 MB, and the model size after com- with our improved YOLOv5x. Table 2 summarizes the detec-
pression is 47 MB, with an accuracy loss of 0.5%, which can tion results in terms of model volume, accuracy and speed.
be recovered by fine-tuning the network. See the experimental Model volume (MB) is the parametric size of the neural net-
results section for its speed comparison. work, which reflects the complexity of the deep neural network
to a certain extent. The AP index is related to the area under the
exact recall curve and is used to measure the model detection
4 ALGORITHMIC MODEL TRAINING accuracy. The FPS is used to evaluate the detection frame rate
per second of the model. As can be seen from the table, the
4.1 Experimental environment and plan improved YOLOv5 detects frame rate of 45.97 in the server
environment, and the size of the improved lightweight model
The experimental environment is configured on ubuntu18.04 reaches 47 MB, which is 32 MB smaller than the eYOLOv3
operating system, TITAN RTX 2080GPU, cuda10.1 and model size, and 18 MB smaller than the s-CNN model size,
17519667, 2022, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12432, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHENG ET AL. 1591

TABLE 3 Performance comparison of four models before and after improvement

Recall (%) Accuracy (%) FPS (frame/S)


Model
Before After Before After Before After

YOLOv5s 0.6723 0.7038 0.8439 0.8645 58.05 88.50


YOLOv5m 0.7820 0.7904 0.8773 0.8904 45.54 73.31
YOLOv5l 0.8546 0.8826 0.9090 0.9240 33.21 65.25
YOLOv5x 0.8735 0.9026 0.9397 0.9631 28.48 45.97

TABLE 4 Time comparison of reasoning in yolov5x pruned with different resolution

TITAN RTX
Size Precision type CPU 2080GPU Jetson nano Jetson Xavier nx

204 × 204 Float16 384.7 4.18 498.5 25.3


480 × 480 Float16 375.6 4.26 497.4 24.9
640 × 640 Float16 386.9 3.96 490.1 23.3

which is reflected in the frame rate of the improved model is 15 5 TEST EXPERIMENT AND RESULT
fps faster than eYOLOv3, and 5 fps faster than the s-CNN, the ANALYSIS
improved model has a better frame rate than other algorithms.
And the accuracy of the improved model is 96.31%, which 5.1 Testing data
is 6.68% higher compared to the ssd algorithm and 4.61%
higher compared to the more accurate s-CNN. This indicates Based on the existing untrained five hundred images and three
that our method again improves the detection accuracy while sea surface monitoring videos to form a testset, the perfor-
enhancing the detection speed. It is beneficial for practical mance of the network is evaluated in two dimensions: the actual
deployment. detection effect and the actual arithmetic power.
Table 3 shows the comparison of recall, accuracy and per-
formance on GPU before and after the improvement using
the four different versions of YOLOv5 trained separately 5.2 Analysis of test experiment results
using the ship training set integrated in this paper. It can be
seen that the improved v5 network has improved in both Figure 6 gives a comparison of the detection results before and
velocity and accuracy. The accuracy of the improved model after the improvement of a set of test data. As can be seen
detection has been improved compared to the accuracy of the from the figure, the detection frame of the ship before the
model detection before the improvement, and the speed of improvement could not be precisely positioned due to overlap,
the improved model operation has been improved significantly. which became correct and accurate after the improvement. The
Take v5x as an example, its detection speed on GPU before the improved detection frame of the ship is more accurate and fit
improvement reached 28.48 frames per second, and after the the target better than before the improvement. The detection
improvement, the detection speed increased to 45.97 frames accuracy also improves accordingly due to the improvement of
per second. When applied to the actual video detection, a detection frame accuracy.
one-minute dynamic video detection takes four minutes before In GPU platform, the network model can detect the ship tar-
the improvement, but only two minutes after the improvement. get at several tens of frames per second, but for the video pro-
Table 4 shows the reasoning time comparison of float16 cessing recognition system of the sea mobile platform, only light
images with different resolutions in different network environ- weight devices can be used, which is not convenient to con-
ments. The deep neural network with floating-point bit width figure the highpower consuming GPU. Therefore, this paper
requires a lot of computing resources. Here, the floating-point counted the processing speed of various network models for
type with 16bit width is set to reduce the computational com- ship video on the embedded development board respectively.
plexity of the network. When the input size is 204 × 204, the Table 5 shows the frame rate comparison before and after the
inference time is 25.3 ms on Jetson Xavier nx and 4.18 ms improvement of different models on GPU, embedded device
on TITAN RTX GPU, but the time spent on Jetson nano is Jetson Xavier nx and Jetson nano. It can be seen that the com-
498.5 ms. Further optimization is needed to improve the speed pressed model has a great improvement over the previous one,
of the algorithm on low-power devices. and the improved YOLOv5s can reach a frame rate of 98.5 in
17519667, 2022, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12432, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1592 ZHENG ET AL.

FIGURE 6 Comparison of test results before and after improvement. (a) Test results before improvement. (b) Test results after improvement

TABLE 5 Frame rate comparison before and after improvement of four models

FPS (frame/S)
Model
Jetson nano Gpu Jetson Xavier nx

YOLOv5s 4.59 20.3 58.05 88.50 29.12 48.91


YOLOv5m 2.95 13.1 45.54 73.31 22.69 37.33
YOLOv5l 1.04 7.8 33.21 65.25 14.46 26.87
YOLOv5x 0.32 5.23 28.48 45.97 6.38 19.09

the server environment and 20.3 fps on Jetson nano, which is improve the real-time performance of the algorithm while sat-
beneficial for the model to achieve fast pedestrian detection. isfying the detection accuracy; and the porting implementation
The experimental results demonstrate that the improved model of the detection algorithm on surface unmanned ships will be
can effectively compress the volume and floating-point oper- carried out.
ations to improve the detection speed of the algorithm on the
basis of guaranteed accuracy, and the compressed model volume ACKNOWLEDGEMENTS
and prediction speed are better than the traditional YOLOv5 This work was supported by: Xiamen Municipal Ocean and
model. Fishery Development Special Fund (No. 21CZB013HJ15),
Key Project of Fujian Science and Technology Plan (No.
2017h0028); Fund Project of Jimei University (No. zp2020042);
6 CONCLUSION Xiamen Key Laboratory of Marine Intelligent Terminal R&D
and Application (No. B18208).
In this paper, we propose a target detection algorithm for
marine ships based on the improved YOLOv5 network model. CONFLICT OF INTEREST
The network reconstructs the adaptive anchor frame and adopts The authors declare that they have no financial or personal
the t-SNE algorithm dimensional mapping, which enables the relationships with other organizations or individuals that may
weighted kernel clustering algorithm to achieve more accu- improperly affect their research work. The opinions and con-
rate target frame positioning, fully extract target features and clusions in the paper entitled “Fast Ship Detection based on
improve target detection accuracy; the BN scaling factor is used Lightweight YOLOv5 Network”, no institution, company or
to compress and prune the network layer to eliminate redun- individual can explain that they are related to their products, ser-
dant parameters, further reduce the model size and computa- vices or intellectual property rights.
tional effort, and then fine-tune the pruned model to improve
the accuracy, which can significantly improve the video frame DATA AVAILABILITY STATEMENT
detection rate. The experimental results show that the algorithm Since the data is part of the ongoing research, the data set
can quickly and effectively detect surface ship targets. Provides required for the current algorithm research cannot be shared
a theoretical basis for implementing target detection tasks for at present.
small devices with limited storage and edge mobile devices. In
the future research, the streamlining method of deep convolu- ORCID
tional network structure will be further investigated to further Shi-Dan Sun https://orcid.org/0000-0001-7219-5933
17519667, 2022, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12432, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHENG ET AL. 1593

REFERENCES 13. Joseph, R., Ali, F.: YOLO 9000: Better, faster, stronger. In: IEEE Con-
1. Zhang, Y., Li, Q., Zang, F.: Ship detection for visual maritime ference on Computer Vision and Pattern Recognition (CVPR). Honolulu,
surveillance from non-stationary platforms. Ocean Eng. 141(1), 53–63 HI, pp. 7263–7271 (2017)
(2017) 14. Joseph, R., Santosh, K., Ross, G., Ali, F.: You only look once: Unified, real
2. Shi, G., Suo, J.: Ship targets detection based on visual atten- - Time object detection. In: IEEE Conference on Computer Vision and
tion. In: IEEE International Conference on Signal Processing, Pattern Recognition (CVPR). Las Vegas, NV, pp. 779–788 (2016)
Communications and Computing (ICSPCC). Qingdao, pp. 1–4 15. Joseph, R., Ali, F.: YOLOv3: An incremental improvement. In: IEEE Con-
(2018) ference on Computer Vision and Pattern Recognition. Salt Lake City, UT,
3. Shao, Z., Wang, L., Wang, Z., et.al.: Saliency-aware convolution neural Net- pp. 89–95 (2018)
work for ship detection in surveillance video. IEEE Trans. Circuits Syst. 16. Bochkovskiy, A., Wang, C.Y., Liao, H.: YOLOv4: Optimal speed and accu-
Video Technol. 30(3), 1–15 (2019) racy of object detection. arXiv:2004.10934 (2020)
4. Zhang, W., He, X., Li, W., et.al.: An integrated ship segmentation method 17. Lin, T., Dollár, P., Girshick, R., et.al.: Feature pyramid networks for object
based on discriminator and extractor. Image Vision Comput. 89(1), 1–11 detection. In: IEEE Conference on Computer Vision and Pattern Recog-
(2019) nition (CVPR). Honolulu, HI, pp. 2117–2125 (2017)
5. Wang, Y., Ning, X., Leng, B., et.al.: Ship detection based on deep Learn- 18. Laurens, M., Geoffrey, H.: Visualizing data using t-SNE. J. Mach. Learn.
ing. In: IEEE International Conference on Mechatronics and Automation Res. 9(2605), 2579–2605 (2008)
(ICMA). Tianjin, pp. 275–279 (2019) 19. Sulaiman, S., Isana, M.: Adaptive fuzzy-k-means clustering algorithm for
6. Spyros, G., Nikos, K.: Object detection via a multi - region and semantic image segmentation. IEEE Trans. Consum. Electron. 56(4), 2661–2668
segmentation - aware CNN model. In: IEEE International Conference on (2010)
Computer Vision (ICCV). pp. 1134–1142 (2015) 20. Geng, F., Qian, S.: An optimal reproducing kernel method for linear non-
7. Lin, T., Dollar, P., Girshick, R., et.al.: Feature pyramid networks for object local boundary value problems. Appl. Math. Lett. 77, 49–56 (2017)
detection. In: IEEE Conference on Computer Vision and Pattern Recog- 21. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network
nition (CVPR). Honolulu, HI, pp. 936–944 (2017) training by reducing internal covariate shift. arXiv:1502.03167 (2015)
8. Kaiming, H., Georgia, G., Piotr, D., Ross, G.: Mask R - CNN. In: IEEE 22. Zhuang, L., Li, J., Shen, Z., et.al.: Learning efficient convolutional networks
International Conference on Computer Vision (ICCV). Venice, Italy, pp. through network slimming. In: IEEE International Conference on Com-
2980–2988 (2017) puter Vision (ICCV). Venice, Italy, pp. 2736–2744 (2017)
9. Dai, J., Qi, H., Xiong, Y., et.al.: Deformable convolutional networks. In: 23. Liu, W., Yuan, W., Chen, X., Lu, Y.: An enhanced CNN-enabled learn-
IEEE International Conference on Computer Vision (ICCV). Venice, ing method for promoting ship detection in maritime surveillance system.
Italy, pp. 764–773 (2017) Ocean Eng. 235, 109435 (2021)
10. Yang, Z., Liu, S., Hu, H., et.al.: RepPoints: Point set representation for
object detection. In: IEEE International Conference on Computer Vision
(ICCV). Coex, pp. 9656–9665 (2019)
11. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: Fully convolutional one - stage How to cite this article: Zheng, J.-C., Sun, S.-D.,
object detection. In: IEEE International Conference on Computer Vision Zhao, S.-J.: Fast ship detection based on lightweight
(ICCV). Coex, pp. 9627–9636 (2019)
YOLOv5 network. IET Image Process. 16, 1585–1593
12. Wei, L., Dragomir, A., Dumitru, E., et.al.: SSD: Single shot multibox detec-
tor. In: European Conference on Computer Vision (ECCV). Amsterdam, (2022). https://doi.org/10.1049/ipr2.12432
pp. 21–37 (2016)

You might also like