jmse-11-00314-v2
jmse-11-00314-v2
Marine Science
and Engineering
Article
Real-Time Relative Positioning Study of an Underwater Bionic
Manta Ray Vehicle Based on Improved YOLOx
Qiaoqiao Zhao 1,2 , Lichuan Zhang 1,2,3, * , Yuchen Zhu 1,2 , Lu Liu 1,3 , Qiaogao Huang 1,3 , Yong Cao 1,2
and Guang Pan 1,2,3
1 School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China
2 Key Laboratory of Unmanned Underwater Vehicle, Northwestern Polytechnical University,
Xi’an 710072, China
3 Unmanned Vehicle Innovation Center, Ningbo Institute of NPU, Ningbo 315048, China
* Correspondence: [email protected]
Abstract: Compared to traditional vehicles, the underwater bionic manta ray vehicle (UBMRV) is
highly maneuverable, has strong concealment, and is an emerging research field in underwater
vehicles. Based on the completion of the single-body research, it is crucial to research the swarm
of UBMRVs for the implementation of complex tasks, such as large-scale underwater detection.
The relative positioning capability of the UBMRV is the key to realizing a swarm, especially when
underwater acoustic communications are delayed. To solve the real-time relative positioning problem
between individuals in the UBMRV swarm, this study proposes a relative positioning method based
on the combination of the improved object detection algorithm and binocular distance measurement.
To increase the precision of underwater object detection in small samples, this paper improves the
original YOLOx algorithm. It increases the network’s interest in the object area by adding an attention
mechanism module to the network model, thereby improving its detection accuracy. Further, the
output of the object detection result is used as the input of the binocular distance measurement
module. We use the ORB algorithm to extract and match features in the object-bounding box and
obtain the disparity of the features. The relative distance and bearing information of the target are
output and shown on the image. We conducted pool experiments to verify the proposed algorithm
on the UBMRV platform, proved the method’s feasibility, and analyzed the results.
Citation: Zhao, Q.; Zhang, L.; Zhu, Y.;
Liu, L.; Huang, Q.; Cao, Y.; Pan, G. Keywords: underwater bionic manta ray vehicle; underwater object detection; relative positioning
Real-Time Relative Positioning Study
of an Underwater Bionic Manta Ray
Vehicle Based on Improved YOLOx. J.
Mar. Sci. Eng. 2023, 11, 314. https://
1. Introduction
doi.org/10.3390/jmse11020314
Underwater vehicles, as important tools for developing underwater resources, can
Academic Editor: Rafael Morales perform complex tasks in place of humans, such as environmental monitoring, underwater
Received: 2 January 2023
object detection, underwater resource detection, and military strikes [1]. Under hundreds of
Revised: 19 January 2023
millions of years of natural selection, fish have acquired superb swimming skills. Their high
Accepted: 20 January 2023 efficiency, low noise, and flexible motion advantages provide new ideas for the development
Published: 1 February 2023 of underwater vehicles. The emergence of bionic underwater vehicles overcomes the
shortcomings of traditional propeller-propelled underwater vehicles in terms of efficiency,
maneuverability, and noise [2].
The manta ray is a creature that flaps its pectoral fins and conducts propulsion by
Copyright: © 2023 by the authors. gliding. Its motion has the advantages of high maneuverability, high adaptability, high
Licensee MDPI, Basel, Switzerland. efficiency, and energy saving. Moreover, it is an excellent reference object for underwater
This article is an open access article bionic vehicles. Northwestern Polytechnical University ’broke through’ the dynamic analy-
distributed under the terms and sis of the manta ray and obtained the shape parameters, three-dimensional physical model,
conditions of the Creative Commons and kinematic parameters of the manta ray’s shape [3]. In the prototype development,
Attribution (CC BY) license (https://
the pectoral fin structure with multi-level-distributed fin rays and the caudal fin structure
creativecommons.org/licenses/by/
scheme with quantitative deflection were designed. Finally, the underwater bionic manta
4.0/).
ray vehicle (UBMRV) was successfully developed [4]. In terms of intelligent motion con-
trol, the researchers used fuzzy CPG control to realize the rhythmic motion control of the
pectoral fin structure and the quantitative deflection control of the caudal fin of the bionic
manta ray vehicle. Finally, the reliability of the autonomous motion of the UBMRV was
verified through the lake experiment [5].
In order to complete the complex underwater tasks, such as dynamic monitoring
of large-scale sea areas and multi-task coordination, it was very important to research
the UBMRV swarm. The UBMRV swarm takes place with a certain formation structure.
Compared with the single-robot operation, the UBMRV swarm has a wide task execution
range and high efficiency [6]. In the military, it can be used in anti-submarine warfare, mine
stations, reconnaissance, and surveillance. In the civilian sector, it can be used for marine
environment detection, information collection, and underwater scientific research.
Individuals in the swarm update the information interaction rules of the UBMRV
swarm through a comprehensive evaluation of their neighbors’ distance, speed, orientation,
and other factors. The premise of applying swarm formation is to use the sensor to obtain
the relative position coordinates of the underwater robot [7]. With the improvements
in machine vision, the visual positioning technology of computer vision combined with
artificial intelligence has been developed further. In 2015, Joseph Redmond proposed the
single-stage object detection algorithm YOLOv1 [8]. The YOLO algorithm only deals with
regression problems and processes images faster. The YOLO series of object detection
algorithms have been well applied in engineering [9,10]. Zhang et al. [11] proposed a
formation control method for quadrotor UAVs based on visual positioning. The relative
positioning technology of the two UAVs is realized by calculating and marking the two-
dimensional code in the world coordinate system. Feng et al. [12] proposed a vision-
based end-guidance method for underwater vehicles. In this study, optical beacon arrays
and AR markers were installed on neighboring vehicles as guidance objects, and the
attitude positions and orientations of the guidance objects were determined based on the
PNP method.
The relative positioning research among these swarm individuals is mostly realized by
installing fixed beacons on the platform. If the fixed beacon is blocked, positioning cannot
be achieved. This imposes limitations on the swarm positioning. Therefore, this paper uses
the method based on the binocular camera to directly detect and locate the UBMRV, which
is no longer limited to the method of only detecting fixed beacons.
Visual positioning technology is mostly used in land and air robots, and underwater
use has not been the focus of attention. However, cameras have lower costs and more
informative solutions than other underwater sensors, such as acoustic sensors. Despite
the limited use of visual sensors underwater, visual data can still play important roles in
underwater real-time localization and navigation. This is especially effective in close object
detection and distance measurement [13]. Xu et al. [14] proposed an underwater object
recognition and tracking method based on the YOLOv3 algorithm. The research realizes
the identification, location, and tracking of underwater objects. The average accuracy of
underwater object recognition is 75.1%, and the detection speed is 15 fps. Zhai et al. [15]
proposed a sea cucumber object recognition algorithm based on the improved YOLOv5
network. The improved model greatly improves the accuracy of small object recognition.
Compared with YOLOv5s, the precision and recall of the improved YOLO5s model are
improved by 9% and 11.5%, respectively.
After the binocular camera collects the image, the algorithm will perform feature
extraction and matching in the full size of the image, which will undoubtedly result
in a large number of calculations, resulting in the poor real-time performance of the
positioning system. In order to solve the problem of a large number of calculations in
binocular positioning, this paper first obtained the target’s position information on the
image through the improved object detection algorithm, then used feature extraction and
matching algorithms to obtain positioning information in the object-bounding box, thereby
reducing the amount of the calculation.
J. Mar. Sci. Eng. 2023, 11, 314 3 of 18
In order to improve the object detection precision of the UBMRV in the underwater
environment, the YOLOx object detection algorithm [16] was improved in this study.
A coordinate attention module was added to the YOLOx backbone network to improve
the model’s attention to features, thereby enhancing the performance of the underwater
object detection model. We compare the improved YOLOx model with the original model.
Experimental results show that the improved model has a higher mAP; an IoU value
of 0.5 mAP is the average of all class AP values. The area enclosed by the PR curve is
the AP value of the network object detection. IoU is the overlap between the candidate
bound predicted by the network and the ground truth bound, representing the ratio of the
intersection and union.
In this paper, an object detection algorithm based on a deep convolutional neural net-
work is combined with the ORB binocular estimation method to design a visual positioning
system for the UBMRV, which realizes the perception of neighbor robot information and
effectively improves the precision of distance estimation between objects. Finally, we verify
the effectiveness of the proposed method on the UBMRV platform.
The visual positioning system of the UBMRV was mainly composed of an image
acquisition module and an image processing module. The image acquisition module
consisted of two camera rigs arranged on the head of the UBMRV. The camera baseline was
172 mm. The binocular camera module installed on the prototype had a view field of 85◦
and a visible distance of 5 m in the pool environment without light assistance. The image
processing module adopted the Jeston NX industrial computer, which deployed the deep
convolutional neural network model proposed in this paper. Jeston NX’s size was only
70 mm × 45 mm, which greatly improved the utilization rate of the prototype cabin. Jeston
NX can provide 14TOPS of computing power at 10 W and 21TOPS of computing power
at 15 W. It can also achieve good calculation results for a low-powered craft. Figure 2
shows the physical map of the binocular camera and Jeston NX processor distributed on
the prototype of UBMRV.
In the UBMRV visual positioning system, the image information collected by the two
camera modules was connected to the Jeston NX processor inside the cabin through two
USB cables to receive real-time images. The relative distance and bearing information
between the vehicle and its neighbors were obtained using algorithms deployed on the
processor. The Jeston NX processor sends the calculated relative distance and bears infor-
mation to the main control chip in the cabin through the serial port, realizing the whole
process from the binocular perception system information to the motion control of the main
control chip.
J. Mar. Sci. Eng. 2023, 11, 314 4 of 18
The prototype head was equipped with two camera modules. A Doppler velocity
log (DVL) was placed on the abdomen. The interior of the prototype cabin contained
modules such as an attitude and heading reference system (AHRS), router, depth sensor,
main control board, and battery. The distributions of each module on the aircraft prototype
are shown in Figure 3. The sensor data were sent to the main control processor through the
serial port to realize the autonomous motion estimation of the UBMRV.
AHRS usually consists of a multi-axis gyroscope, multi-axis accelerometer, and mag-
netic compass. The data collected by these sensors were fused through a Kalman filter to
give the vehicle three degrees of freedom of motion (yaw, pitch, and roll).
DVL uses the Doppler shift between the transmitted acoustic wave and the received
bottom-reflected wave to measure the vehicle’s speed and accumulated range relative to
the bottom. From the Doppler effect, DVL can calculate the speeds of the forward, right,
and up axis.
The depth sensor was used to measure the depth information of the UBMRV. The vehi-
cle implemented depth closed-loop control based on this measured depth information value.
Two camera modules were used to measure the relative distance and bearing information
of neighbors. The vehicle used the information to achieve formation control.
J. Mar. Sci. Eng. 2023, 11, 314 5 of 18
where Lcls and Lobj use binary cross-entropy loss and Lreg uses IoU loss. λ is the equilibrium
coefficient of the regression loss, and Npos denotes the number of anchor points classified
as positive samples. Lcls and Lreg only calculate the loss of positive samples, and Lobj
calculates the total losses of positive and negative samples.
images, and the actual relative distance between the UBMRV is obtained by combining it
with the camera’s intrinsic matrix.
The purpose of stereo rectification is to align the two imaging planes and align the
pixels on each row, thereby reducing the search range of the feature-matching algorithm
and reducing the complexity of the binocular distance estimation. The object detection
algorithm based on YOLOx can obtain the position coordinates of the target on the image,
which are relative to the image coordinates system. To achieve relative positioning between
robots, it is necessary to match the features extracted from the left and right images and
use the principle of stereo ranging to obtain the position coordinates of the target in the
actual coordinate system.
The world coordinate system has C1 and C2 cameras. The pixel point projected by
the three-dimensional point P on the camera is p1 , p2 . We used triangulation to obtain
the coordinates of point P in the three-dimensional space. First, we completed the stereo
rectification operation to align the polar lines of the left and right images in the horizontal
direction. After stereo rectification, the two images only had disparity in the horizontal
direction. Thus, the stereo-matching problem was reduced from two-dimensional to one-
dimensional, which improved the matching speed.
The stereo rectification of the binocular camera mainly solves the homography trans-
formation matrix corresponding to the left and right cameras, according to the principle of
camera projection, if there is a point in space P. The coordinates in the world coordinate
system are [ Xw , Yw , Zw ] T ; the coordinates in the camera coordinate system are [ Xc , Yc , Zc ] T .
After the projection transformation, the coordinates in the imaging plane are [ x, y] T . The co-
ordinates in the pixel coordinate system are [u, v] T . Figure 4 shows the camera projection
transformation relationship.
The rotation matrix R and displacement vector T represent the mapping relationship
of points in the world coordinate system and camera coordinate system.
Pw = R × Pc + T (2)
where T = C and C are the coordinates of the origin of the camera coordinate system in the
world coordinate system. The relationship between the left camera and the right camera is
obtained from the camera projection relationship,
uL Xw
λ L v L = K L R−
L
1
Yw − CL (3)
1 Zw
where λ is the scaling factor. K is the intrinsic matrix of the cameras. R and C are the rotation
matrices and translation vectors of the camera, respectively. The projection relationship
expression of the right camera can be obtained in the same way.
J. Mar. Sci. Eng. 2023, 11, 314 7 of 18
Since the two virtual cameras are identical, they have the same external parameter
R and intrinsic matrix K. The projection relationship of the space point P on the virtual
plane is
û L Xw
λ̂ L v̂ L = K̂ L R̂−
L
1
Yw − CL (4)
1 Zw
The projection relationship between the original camera and the virtual camera is
compared to obtain the homography matrix from the original image of the left camera to
the new virtual image.
û L uL
v̂ L = λ L − 1 − 1
K̂ R̂ R L K L v L
1 λ̂ L 1
(5)
uL
= HL v L
1
Similarly, the homography matrix of the right camera can be obtained. The image after
stereo correction is shown in Figure 5.
spatial direction, assigning different weights to the input feature maps to improve the
representation of the region of interest.
For input X, each channel is first coded in the horizontal and vertical directions using
pooled cores with dimensions ( H, 1) and (1, W ). The output of the c channel with height h
can be expressed as
1
W 1≤∑
zw
c (h) = xc (h, i ) (6)
i ≤W
1
zw
c (w) =
H ∑ xc ( j, w) (7)
1≤ i ≤ H
Among them, σ represents a sigmoid function. Further, f is divided into two separate
tensors f h , f w , along the spatial dimension. Two 1×1 convolutions are used to transform
the characteristic graph f h , f w to the same number of channels as input X.
gh = σ Fh ( f h )
(9)
gw = σ( Fw ( f w ))
Using gh ,gw as attention weights, the output of the CA module can be obtained as
In this study, the CA module is added to the CSPlayer module. The network structure
after adding the CA module is shown in Figure 7. The overall network structure is divided
into CSPDarknet, FPN, and head.
J. Mar. Sci. Eng. 2023, 11, 314 9 of 18
The CA module is mainly added to the main feature extraction part of YOLOx to
improve the precision of the feature extraction of the input image by the network. The in-
put images first enter the focus structure, the width and height data of the images are
concentrated on the channel information to complete the channel expansion process. Then
feature extraction is conducted by using the convolutional layer and the CSPlayer layer.
The extracted features are called the feature sets of the input images. After the main feature
extractions of the input images are complete, three feature layers with scales of 80 × 80,
40 × 40, and 20 × 20 are the output. These three feature layers are called effective feature
layers and serve as the input for the next step of network construction.
The feature fusion of the three effective feature layers are performed through the FPN
structure. The purpose of feature fusion is to combine feature information of different
scales to achieve further feature extraction.
After the input image is extracted through the backbone feature and the FPN structure,
three effective feature layers with width, height, and many channels are output. Each
feature layer can obtain three prediction results, namely regression parameters, positive
and negative sample prediction, and category prediction.
We used the binocular camera on the prototype to acquire a full range of images of the
UBMRV in different lighting environments and orientations. More than 10,000 images were
collected. We selected 8700 images for category annotation and bounding box annotation to
create the dataset for network training. We divided the produced dataset into the training
set, testing set, and validation set according to the ratio of 8:1:1.
Moreover, the original YOLOx was used to compare with the improved YOLOx.
The training was performed using the same platform. The training Epoch was set to
200; the final training results are shown in Table 1.
As can be seen from the training results in the table, the addition of the CA module to
the original YOLOx model improves the precision of object detection. However, the training
duration becomes slightly longer due to the increased complexity of the model. Adding the
CA module increases the weights generated by the network training. Therefore, the speed
of detecting objects was slightly reduced. Since this study only recognizes one class of the
UBMRV, the precision of detection is high.
J. Mar. Sci. Eng. 2023, 11, 314 10 of 18
Figure 8. Variation of parameters during network training: (a) loss changes; (b) object detection
accuracy.
Figure 9 plots the prediction results of the network on the test dataset. Figure 9a
represents the precision recall curve plot when the IoU was 0.5. Figure 9b shows the F1
value change curve for different IoU thresholds. Figure 9c shows the precision change
curve for different IoU thresholds. Figure 9d shows the recall change curve for different
IoU thresholds. AP measures how well the trained model can detect the category of interest.
The area enclosed by the PR curve is the AP value of the network object detection. From the
PR curve in the figure, we see that the network achieves 99.0% AP for detecting the UBMRV
category on this dataset.
J. Mar. Sci. Eng. 2023, 11, 314 11 of 18
Figure 9. Plots of the prediction results of the network on the test set: (a) precision versus recall curve;
(b) F1 value change; (c) precision change; (d) recall change.
Precision represents the proportion of positive predictions that are positive. Precision
can be considered the ability of the model to find the correct data. The F1 is the summed
average of Precision and Recall, and is a composite evaluation metric of Precision and Recall.
It is used to avoid a single extreme value of Precision or Recall and is a comprehensive
indicator of the whole [19]. The calculation formula of each index is shown in Equation (11):
TP
Precision =
TP + FP
TP
Recall = (11)
TP + FN
Precision × Recall
F1 = 2 ×
Precision + Recall
TP indicates that if the prediction is positive and the actual is positive, then the
prediction is correct. FP means that if the prediction is positive and the actual is negative,
then the prediction is wrong. FN means that if the prediction is negative and the actual
is positive, the prediction is wrong. From the prediction results graph, it can be seen
that when the IoU threshold is 0.5, the predicted precision is 99.32%, and recall is 98.98%
on the data set of this study. Since the values of both precision and recall are large, this
study calculated the F1 value when the IoU threshold was 0.5 and obtained its value of
0.99. The prediction results show that the network model has a good object detection
performance on the test dataset.
First, the position information of the UBMRV on the image was obtained from the
image captured by the binocular camera after the object detection module. The coordinates
of the predicted bounding box on the left image are defined as ( x1l , y1l ) for the upper left
corner and ( x2l , y2l ) for the lower right corner. Similarly, the coordinates of the upper left
corner ( x1r , y1r ) and lower right corner ( x2r , y2r ) of the bounding box on the right image
can be obtained. The matching points corresponding to the left image are extracted from
the bounding box on the right image. Firstly, the overlap degree was calculated for the
bounding box of the left and right figure, as in Equation (12).
S[max( x1l , x1r ), max(y1l , y1r ), min( x2l , x2r ), min(y2l , y2r )]
p= (12)
max{S[( x2l − x1l ) × (y2l − y1l )], S[( x2r − x1r ) × (y2r − y1r )]}
where λ1 and λ2 are the eigenvalues of the matrix M. (x,y) is the coordinate of the corre-
sponding pixel in the window, and w( x, y) is the window function. Ix and Iy represent the
gradient coordinates of each pixel.
The Harris response value is calculated by solving the eigenvalues of the matrix M.
Finally, the N feature points with the largest response values are selected as the final feature
point set.
ORB uses the grayscale center of mass in solving the rotation of features and the center
of mass is the center with the weight of the grayscale value of the image block. In a small
image block D, the moments of the image block are defined as,
From the moments of the image block, the center of mass of the image block is
m10 m01
C= , (15)
m00 m00
J. Mar. Sci. Eng. 2023, 11, 314 13 of 18
Connecting the geometric center O and the center of mass C of the image block yields
−→
the vector OC, and the direction of the feature point is
m01
θ = arctan (16)
m00
Figure 10. The principle of the binocular camera distance measurement schematic.
The distance between the optical centers of the two cameras is called the baseline
(noted as b), and parameter b is known after the camera mounting position is determined.
If point P exists in space, the imaging on the left and right cameras are noted as P_L and
P_R. After aberration and stereo rectification, only the image on the x-axis is displaced.
Therefore, the position of P on the image plane also differs only on the x-axis, corresponding
to the u-axis of the pixel coordinates. Let the coordinates of P_L on the left image be x_l
and on the right image be x_r. The geometric relationship is shown in Figure 10. According
to the triangle similarity, we have
z x x−b y y
= = = = (17)
f xl xr yl yr
Then, we have,
b × xl b× f b × yl
x= ; z= ; y= (18)
x l − xr x l − xr x l − xr
and,
b× f z × xl z × yl
z= ; x= ; y= (19)
d f f
where d = x_l − x_r is the difference between the left camera and right camera pixel coor-
dinates, called disparity. f is the camera’s focal length, and b is the baseline of the binocular
J. Mar. Sci. Eng. 2023, 11, 314 14 of 18
camera. This gives the position coordinates of the feature points on thepimage under the
camera coordinate system as ( x, y, z), and the relative distance is dis = x2 + y2 + z2 .
The camera coordinate system is established with the left camera optical center of the
binocular camera as the origin. Then the bearing of the feature points on the image for the
camera can be expressed as
x y
α = arctan ; β = arctan (20)
z z
We arranged the distance values calculated in the bounding box in order and took the
middle distance value as the final UBMRV relative distance and bearing.
Figure 11. Examples of the object detection results in a small pool (a–c) and large pool (d–f) at
different relative distances: (a) 1 m; (b) 2 m; (c) 3.5 m; (d) 1 m; (e) 2 m; (f) 3.5 m.
J. Mar. Sci. Eng. 2023, 11, 314 15 of 18
We used the camera mounted on the head of the prototype to detect the neighboring
UBMRV. From the underwater experiments, the visible distance of the camera was 5 m.
When the distance between neighboring UBMRVs exceeds 5 m, the rear UBMRV cannot
see the object ahead. We performed 20 detections of neighboring UBMRVs at 1 m, 1.5 m,
2 m, 2.5 m, 3 m, 3.5 m, 4 m, and 5 m, respectively. The number of successful detections
of the neighboring objects was also counted and defined as the effective detection rate.
The detection results are shown in Table 2. When the relative distance between UBMRVs
was close, at 1–3 m, the neighbors were detected using the algorithm proposed in this
paper, and the effective detection rate reached more than 90%. When the relative distance
between UBMRVs and neighbors was farther, 3 m or more, the effective detection rate of
the object was 70–80%. The probability of successful detection of the object was low. Finally,
we calculated the average probability of the successful detection of a neighboring object
within the visible distance of the camera as 85.6%. The detection speed of the proposed
algorithm deployed on NVIDIA Jeston NX can reach 25 fps, which meets the real-time
requirements needed for subsequent control.
Figure 12. Results of relative distance and bearing estimation based on object detection: (a) 1 m;
(b) 1.5 m; (c) 2 m; (d) 2.5 m, where the red dot denotes the center point of predicted bounding box.
relative to the length of the UBMRV platform. The experimental results of relative distance
estimation are shown in Table 3 and Figure 13.
From the above distance and bearing error graphs, it can be seen that when the relative
distance between targets was close, there was high accuracy regardless of whether object
detection or distance and bearing estimation were performed. When the distance between
objects was 1–3 m, the estimated distance error was around 0.3 m. The estimated angle
error was around 4°. When the distance between objects was larger, the error changed
J. Mar. Sci. Eng. 2023, 11, 314 17 of 18
more. From the experimental results, the binocular localization system proposed in this
paper has good results in the case of close distance. The experimental error is relatively
large at 3–5 m. The operation speed of the algorithm deployed on the IPC is 5 fps, which
can meet the system’s real-time requirements.
5. Conclusions
In order to solve the relative positioning problem of the UBMRV cluster, this paper
proposes a relative positioning method based on an object detection algorithm and binocu-
lar vision. The engineering realization of the UBMRV cluster can promote the completion
of complex tasks, such as large-scale sea area monitoring and multi-task coordination,
and relative positioning technology is one of the key technologies for realizing cluster
formation. Therefore, this paper used a binocular camera to obtain the neighbor’s distance
and bearing information to realize the relative positioning. We adopted an improved
object detection algorithm to detect the neighbor’s UBMRV directly. From the experimental
results, the improved object detection algorithm has higher precision. Then, this study
used the ORB algorithm to extract and match features in the bounding box. This method
reduces the computational burden of binocular matching. It can be concluded from the
pool experiment that when the actual distance between neighbors is 1–3 m, both the object
detection and relative distance estimation have high accuracy. When the actual distance is
far, the positioning accuracy is poor, and even “invisible” occurs. This paper completes
the design of the UBMRV cluster relative positioning system and deploys the proposed
algorithm on the prototype for pool experiments. The algorithm’s running time is 0.2–0.25 s,
which can meet the real-time requirements of the system.
To further improve the current research work—since the motion of UBMRV is divided
into different poses—the motion poses of neighbors can be estimated through visual
perception, thereby improving the formation positioning error. The ultimate goal is to
realize the engineering application of the UBMRV formation.
Author Contributions: Conceptualization, Q.Z., L.Z., Q.H. and G.P.; methodology, Q.Z.; software,
Q.Z. and Y.Z.; validation, L.Z. and Y.C.; formal analysis, Q.Z., Y.Z. and L.L.; investigation, Q.Z.
and Y.Z.; resources, L.Z. and G.P. (Guang Pan); data curation, Q.Z. and Y.Z.; writing—original
draft preparation, Q.Z.; writing—review and editing, L.Z. and L.L.; visualization, Q.Z. and Y.Z.;
supervision, L.Z., Q.H. and G.P.; project administration, L.Z. and Q.H.; funding acquisition, L.Z. All
authors have read and agreed to the published version of the manuscript.
Funding: This work was supported by the National Key Research and Development Program of
China (grant no. 2020YFB1313200, 2020YFB1313202, 2020YFB1313204) and the National Natural
Science Foundation of China (grant no. 51979229).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data can be obtained from the corresponding author upon reason-
able request.
Conflicts of Interest: The authors declare that they have no known competing financial interests or
personal relationships that could have appeared to influence the work reported in this paper.
References
1. Yuh, J. Design and control of autonomous underwater robots: A survey. Auton. Robot. 2000, 8, 7–24. [CrossRef]
2. Alam, K.; Ray, T.; Anavatti, S.G. Design optimization of an unmanned underwater vehicle using low-and high-fidelity models.
IEEE Trans. Syst. Man, Cybern. Syst. 2015, 47, 2794–2808. [CrossRef]
3. Huang, Q.; Zhang, D.; Pan, G. Computational model construction and analysis of the hydrodynamics of a Rhinoptera Javanica.
IEEE Access 2020, 8, 30410–30420. [CrossRef]
4. He, J.; Cao, Y.; Huang, Q.; Cao, Y.; Tu, C.; Pan, G. A New Type of Bionic Manta Ray Robot. In Proceedings of the IEEE Global
Oceans 2020: Singapore–US Gulf Coast, Biloxi, MS, USA, 5–30 October 2020; pp. 1–6.
J. Mar. Sci. Eng. 2023, 11, 314 18 of 18
5. Cao, Y.; Ma, S.; Xie, Y.; Hao, Y.; Zhang, D.; He, Y.; Cao, Y. Parameter Optimization of CPG Network Based on PSO for
Manta Ray Robot. In Proceedings of the International Conference on Autonomous Unmanned Systems, Changsha, China,
24–26 September 2021; pp. 3062–3072.
6. Ryuh, Y.S.; Yang, G.H.; Liu, J.; Hu, H. A school of robotic fish for mariculture monitoring in the sea coast. J. Bionic Eng. 2015,
12, 37–46. [CrossRef]
7. Chen, Y.L.; Ma, X.W.; Bai, G.Q.; Sha, Y.; Liu, J. Multi-autonomous underwater vehicle formation control and cluster search using a
fusion control strategy at complex underwater environment. Ocean Eng. 2020, 216, 108048. [CrossRef]
8. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
9. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
10. Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073.
[CrossRef]
11. Zhang, S.; Li, J.; Yang, C.; Yang, Y.; Hu, X. Vision-based UAV Positioning Method Assisted by Relative Attitude Classifica-
tion. In Proceedings of the 2020 5th International Conference on Mathematics and Artificial Intelligence, Chengdu, China,
10–13 April 2020; pp. 154–160.
12. Feng, J.; Yao, Y.; Wang, H.; Jin, H. Multi-AUV terminal guidance method based on underwater visual positioning. In Proceedings
of the 2020 IEEE International Conference on Mechatronics and Automation (ICMA), Beijing, China, 13–16 October 2020;
pp. 314–319.
13. Chi, W.; Zhang, W.; Gu, J.; Ren, H. A vision-based mobile robot localization method. In Proceedings of the 2013 IEEE International
Conference on Robotics and Biomimetics (ROBIO), Shenzhen, China, 12–14 December 2013; pp. 2703–2708.
14. Xu, J.; Dou, Y.; Zheng, Y. Underwater target recognition and tracking method based on YOLO-V3 algorithm. J. Chin. Intertial
Technol. 2020, 28, 129–133.
15. Zhai, X.; Wei, H.; He, Y.; Shang, Y.; Liu, C. Underwater Sea Cucumber Identification Based on Improved YOLOv5. Appl. Sci. 2022,
12, 9105. [CrossRef]
16. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, preprint, arXiv:2107.08430.
17. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile,
7–13 December 2015; pp. 1440–1448.
18. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722.
19. Cha, Y.J.; Choi, W.; Suh, G.; Mahmoudkhani, S.; Büyüköztürk, O. Autonomous structural visual inspection using region-based
deep learning for detecting multiple damage types. Comput.-Aided Civ. Infrastruct. Eng. 2018, 33, 731–747. [CrossRef]
20. Karami, E.; Prasad, S.; Shehata, M. Image matching using SIFT, SURF, BRIEF and ORB: Performance comparison for distorted
images. arXiv 2017, preprint, arXiv:1710.02726.
21. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the IEEE 2011
International Conference on Computer Vision,Washington, DC, USA, 20–25 June 2011; pp. 2564–2571.
22. Shu, C.W.; Xiao, X.Z. ORB-oriented mismatching feature points elimination. In Proceedings of the 2018 IEEE International
Conference on Progress in Informatics and Computing (PIC), Suzhou, China, 14–16 December 2018; pp. 246–249.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.