Sensors 22 07603 v2
Sensors 22 07603 v2
Article
A Two-Mode Underwater Smart Sensor Object for Precision
Aquaculture Based on AIoT Technology
Chin-Chun Chang 1 , Naomi A. Ubina 1,2 , Shyi-Chyi Cheng 1, * , Hsun-Yu Lan 3 , Kuan-Chu Chen 1
and Chin-Chao Huang 1
Abstract: Monitoring the status of culture fish is an essential task for precision aquaculture using a
smart underwater imaging device as a non-intrusive way of sensing to monitor freely swimming
fish even in turbid or low-ambient-light waters. This paper developed a two-mode underwater
surveillance camera system consisting of a sonar imaging device and a stereo camera. The sonar
imaging device has two cloud-based Artificial Intelligence (AI) functions that estimate the quantity
and the distribution of the length and weight of fish in a crowded fish school. Because sonar images
can be noisy and fish instances of an overcrowded fish school are often overlapped, machine learning
technologies, such as Mask R-CNN, Gaussian mixture models, convolutional neural networks, and
semantic segmentation networks were employed to address the difficulty in the analysis of fish
in sonar images. Furthermore, the sonar and stereo RGB images were aligned in the 3D space,
Citation: Chang, C.-C.; Ubina, N.A.; offering an additional AI function for fish annotation based on RGB images. The proposed two-mode
Cheng, S.-C.; Lan, H.-Y.; Chen, K.-C.; surveillance camera was tested to collect data from aquaculture tanks and off-shore net cages using a
Huang, C.-C. A Two-Mode cloud-based AIoT system. The accuracy of the proposed AI functions based on human-annotated
Underwater Smart Sensor Object for fish metric data sets were tested to verify the feasibility and suitability of the smart camera for the
Precision Aquaculture Based on AIoT estimation of remote underwater fish metrics.
Technology. Sensors 2022, 22, 7603.
https://doi.org/10.3390/s22197603 Keywords: sonar images; stereo RGB images; Mask R-CNN; gaussian mixture models; convolutional
Academic Editors: Nunzio neural networks; semantic segmentation networks; object detection CNN
Alberto Borghese and
Matteo Luperto
the environmental and fish conditions in the cage. Its goal is to enable farmers to make
intelligent decisions by providing objective information to improve their capability to mon-
itor and control factors that involve fish production; thus, farming decisions are adjusted
to improve fish health and maximize farm production. Large and modern aquaculture
farms must incorporate technological innovation to automate their processes, minimize
workforce requirements, and maximize their fish feeding process. It enables farmers to
integrate technology and data-driven decisions making, enabling efficient aquaculture
farm management and remote monitoring, especially for farms situated in the open sea.
The use of machine learning and computer vision in Artificial Intelligence (AI), together
with sensors and Internet of Things (IoT) technologies, have been widely used to monitor
fish feeding behavior, disease, and growth as a non-invasive method, thereby enabling
objective observation of the fish farm. Such a mechanism also allows data collection and
real-time image acquisition using reliable wireless communication channels [4] without
relying so much on human intervention [5].
Various sensors such as temperature, position, humidity, flow, and photo optic or
camera sensors have changed how the world accesses data from remote locations. These
devices have bridged the gap in collecting data from the physical environment and trans-
mitting wirelessly to a platform with a network of remote servers for storage, management,
and data processing [6]. Data collection from physical environments can be carried out
by other means such as underwater vehicles [7]. The advancement of cloud computing
and IoT has brought tremendous innovation and improvement to aquaculture farming.
Cloud computing services enable the collection and storage of big data for processing using
AI methodologies capable of predictive analysis to provide informed decision-making
mechanisms for precise aquaculture. It enables a brand-new farming approach [8] that
eases the burden of the farming industry in terms of monitoring.
For aquaculture farms, it is vital to monitor the fish growth and population as an
essential parameter to approximate fish food and assess the overall wellness of the fish
species. To achieve the goal of smart aquaculture, fish counting and body length estimation
using underwater images are essential to estimate the fish growth curve [9,10]. Cameras
as sensors can now be used to capture underwater fish images in an off-shore cage in a
non-intrusive manner that reduces the manual handling of the fish, thus reducing direct
contact that can cause stress, injury, and growth disturbance to the fish species in the cage.
In addition, sonar and stereo cameras for data collection and computer vision can estimate
the fish’s biological information. Sonar and RGB cameras, such as stereo systems, are just
one of the most widely used and studied systems for underwater environment monitoring.
In the underwater environment where the lighting condition is poor or low, RGB
cameras are limited. In contrast, sonar cameras are more robust concerning the issue
of light attenuation and water turbidity that severely affects optical sensors. In terms
of area to cover in capturing the underwater environment, sonar cameras have more
scope and a higher range than stereo cameras, as shown in Figure 1. In addition, a sonar
camera provides a depth reference value for 3D images, further improving the length
or size estimation accuracy. Various studies also dealt with using sonar systems and
their applicability to fish length estimation [11–13]. A 3D sonar camera allows direct
representation of a scene’s 3D information, drastically reducing or no longer requiring
the 3D reconstruction process from 2D views, making it more viable for real-time data
processing [14]. However, the cost of a high-resolution 3D sonar camera is expensive. To
meet the cost concerns of sensors for aquaculture management, in this study, we proposed a
fusion of using a low-cost sonar imaging device and a stereo camera system for aquaculture
fish monitoring.
Figure 2 shows the framework of our AIoT system where the two camera sensors
(sonar and stereo camera) were deployed and installed in the aquaculture farm to collect
images/videos from the site. In addition, these sensors were equipped with wireless
transmission capabilities to send the collected data to the AI cloud services, where each
Sensors 2022, 22, x FOR PEER REVIEW 3 of 29
Figure 2 shows the framework of our AIoT system where the two camera sensors
(sonar and stereo camera) were deployed and installed in the aquaculture farm to collect
images/videos from the site. In addition, these sensors were equipped with wireless trans-
(a) Sonar imagemission capabilities to send the collected data(b) Stereo
to theimage pair services, where each of the
AI cloud
trained1.deep
Figure learningfrom
and the
machine learning ismodels performed the for
necessary AI function
Figure 1.Data
Datacaptured
captured from sensor
the devices
sensor devices transmitted to the
is transmitted tocloud
the cloud storage and big
for storage data
and big
for a specific application.
analytics.
data analytics.
Figure 2 shows the framework of our AIoT system where the two camera sensors
(sonar and stereo camera) were deployed and installed in the aquaculture farm to collect
images/videos from the site. In addition, these sensors were equipped with wireless trans-
mission capabilities to send the collected data to the AI cloud services, where each of the
trained deep learning and machine learning models performed the necessary AI function
for a specific application.
The framework
Figure 2. The
Figure framework of
of our
our proposed
proposed AIoT
AIoT technology
technology for
for smart Underwater
Underwater surveillance
surveillance for
for
precision aquaculture.
precision
Our proposed sensor fusion comprises four steps: data collection from the aquacul-
ture sites using our sonar and stereo camera system sensors, 3D-point cloud estimation,
overlapping detection, and object detection to integrate AI functions; the details of these
are discussed in the subsequent section. In this work, we used the sonar camera system as
the primary
Figure sensor device
2. The framework forproposed
of our collecting depth
AIoT information
technology from
for smart the targetsurveillance
Underwater fish objects
forto
precision aquaculture.
Sensors 2022, 22, 7603 4 of 29
perform fish metric estimation, specifically for fish length and fish count. It uses its beam to
collect fish information by sending multiple sound waves to the scene. Thus, it can capture
the real environment’s depth information. Sonars use depth information to form an image
object much different from an optical image.
Although sonar devices, as mentioned earlier, provide bigger coverage, they do not
have texture and color information since they just provide depth information. Due to
refraction, the shape of the captured fish sonar images is affected. Thus, they can only map
macro-features due to their limited resolution [15]. The stereo camera system addresses this
concern or limitations of the sonar camera system. These two devices can work together to
provide a clearer picture of the underwater fish object in the aquaculture cages or ponds.
Since sonar images lack color information, we used the RGB images captured from the
low-cost sonar camera to provide additional functions for fish-type annotation.
One of the challenges of sensor fusion is to detect the common area of each sensor
or their corresponding images, and considering the environment is underwater, more
problems arise. Additionally, the target objects in the sonar and stereo camera systems have
different positions, so a mechanism should be devised to project the same target image
into the same plane. Incorporating transformation in their corresponding rotation and
translation vectors will map the features from sonar to optical coordinate system using
the extrinsic sensor calibration method. Since each sensor’s range is different, the target
object and its corresponding shape should be recognized by both [14]. In this work, we
proposed a method to determine the sensor’s overlapping areas using their corresponding
3D point cloud information so that the sonar images are projected to the stereo images. We
used four markers (four pixels) or points with their corresponding 3D point sets. Both the
sonar and stereo camera systems should be able to detect or distinguish these markers. We
had two phases to integrate plane projection conversion. First, the learning phase obtained
the transformation matrix in rotation and translation based on the marker’s information.
Second, each pixel in the sonar image was transformed into its corresponding pixel in the
stereo image (left image). Then, the result of the transformation was projected into the 2D
space to identify the pixel correspondence for both camera systems. In the testing phase,
for each pixel in the frame of the sonar camera, the transformation matrices were then used
to locate the corresponding pixel in the synchronized frame of the optical camera. Thus,
the common area covered by the sonar and RGB cameras was detected.
The contributions of our paper are the following:
• We proposed an AIoT system that provides sonar and stereo camera fusion that
supports automatic data collection from aquaculture farms and performs artificial
intelligence functions such as fish type detection, fish count, and fish length estimation.
To our knowledge, combining a low-cost sonar and stereo camera system tested in
various aquaculture environments with different AI monitoring functions is a novel
work.
• We designed a methodology to perform sonar and stereo camera system fusion. How-
ever, deploying IoT can be expensive, and to limit the cost of its implementation, we
employed low-cost sensors that do not entail high additional expenses for aquaculture
farmers.
• Using a sonar camera system, we developed our mechanism to estimate the fish’s
length and weight. Additional plugin AI functions can also be deployed in the cloud
to meet the emerging requirements of decision-making for aquaculture management
based on the collected big data sets. Agile development realizes the design of learnable
digital agents to achieve the goal of precision aquaculture.
The paper is structured as follows: Section 2 provides the related works, and Section 3
contains the materials and methods, which detail our approach to addressing the issue
discussed earlier. Sections 4 and 5 include the experimental results and discussion, re-
spectively. Finally, the last section outlines our conclusions and recommendations for
future works.
Sensors 2022, 22, 7603 5 of 29
2. Related Works
IoT and AI have gained popularity in the past few years due to their efficiency and
promising results in various fields. For example, in aquaculture production, they have
been widely used to improve the accuracy and precision of farming operations, facilitate
autonomous and continuous monitoring, provide a reliable decision support system, and
limit manual labor demand and subjective assessment of fish conditions [16]. In addition,
vision cameras such as stereo and sonar systems are popular for computer vision-based
problems with the capability of image processing, object detection and classification, and
image segmentation.
Imaging sonar systems have been applied to many aquaculture applications, such
as fish counting [17,18], estimation of fish length [11], analysis of fish population [19–21],
fish tracking [22], fish detection [23], monitoring of fish behavior [24], and control of fish
feeding [25]. Hightower et al. [12] used multibeam sonar to determine the reliability of
length estimates of some fish species. The sonar was positioned approximately 0.5 m above
the bottom, and the beam was aimed slightly off the bottom. Additionally, using sonar
image analysis and statistical methodologies, a non-invasive procedure using multibeam
sonar was used to count and size the fish in a pond. Simulation software was developed to
calculate the abundance correction factor, which depends on the transducer beam size based
on the pond size [26]. DIDSON acoustic system with an ultra-high-resolution lens was
used to evaluate the accuracy and precision of estimating length from images of tethered
fish insonified at the side aspect. The device used has a good potential for discriminating
sizes among different species [27]. Lagarde et al. [28] integrated an ARIS acoustic camera
to perform counts and size estimates for European eels. Count estimates were performed
using 58 videos. The acoustic camera was installed in a channel that links a lagoon to
the Mediterranean Sea. It was positioned in the narrowest part of the channel and came
53 m wide and 3.5 m deep. A real-time system for scientific fishery biomass estimator was
proposed by Sthapit et al. [29] using a compact single beam advanced echosounder. The
device, composed of a transducer, processing unit, a keypad, and a display unit, analyzes
ping data continuously. In real-time, it calculates various parameters and simultaneously
displays the echogram results on the screen.
Convolutional Neural Networks, or CNNs, have been applied to process sonar images
for many applications, such as detecting objects on the sea floor [30] and counting fish [31].
As illustrated in Figure 3, some characteristics of the fish schools in sonar images are as
follows:
• Fish schools swim in the three-dimensional space;
• In the sonar image, fish close to the sonar system can often be incomplete, and fish
away from the sonar system become blurrier;
• In sonar images, fish are often overlapped, and the location difference of fish in the
direction perpendicular to the sonar beam is indistinguishable [32];
• Annotators are often required to examine successive sonar images to identify fish
in sonar images because they find fish by the change of the pattern and strength
of echoes.
The stereo camera system has also been extensively used in computer vision. For
example, a DeepVision stereo camera system was used by Rosen et al. [33] for continuous
data collection of fish color images passing inside the extension of the trawl. Out of 1729 fish
captured while trawling, 98% were identified in terms of species. Such a mechanism
increases the scope of the information collected specifically on documenting the fine-scale
distribution of individual fish and species overlap. The information that can be drawn
from this can help interpret acoustic data. The underwater stereo video was also used to
determine population counts and spatial and temporal frequencies, incorporating detection
and identification [34]. Stereo vision is also integrated for video-based tracking [35], fish
volume monitoring [36] or abundance [37], and 3D tracking of free-swimming fish [38].
Sensors 2022,
Sensors 22, x7603
2022, 22, FOR PEER REVIEW 6 6ofof29
29
(a) (b)
Figure
Figure3.3.An
Anillustration
illustrationof
ofthe
theregion
regioncovered
coveredby
byan
animaging
imagingsonar
sonarsystem
systemwhere
where(a)(a)shows
showsthat
thatthe
the
imaging sonar system can partly cover the fish, and (b) shows that the fish at different locations in
imaging sonar system can partly cover the fish, and (b) shows that the fish at different locations in
the direction perpendicular to the sonar beam can be overlapped in the sonar image.
the direction perpendicular to the sonar beam can be overlapped in the sonar image.
The stereo
Stereo camera
camera system
systems has
also alsobeen
have beenwidely
extensively
used in used
fishinlength
computer vision. For
estimations ex-
[39–42],
ample, a DeepVision stereo camera system was used by Rosen
using disparity information to provide 3D information about an object [43–45]. In aquacul- et al. [33] for continuous
data
ture,collection
many are now of fish color images
putting passing
their interest andinside
effortsthe extension
into integrating of the
stereotrawl. Out of
cameras for1729
fish
fish captured while trawling, 98% were identified in terms of
length and biomass estimations [40,41,46]. In our previous work, we integrated a low-cost species. Such a mechanism
increases
stereo camerathe scope
system of the information
to perform fish collected specificallyWe
metrics estimation. on used
documenting
a reliablethe fine-scale
object-based
distribution of individual fish and species overlap. The information
matching using sub-pixel disparity computation with video interpolation CNN and tracked that can be drawn
from this can help interpret acoustic
and computed the fish length in each video frame [10].data. The underwater stereo video was also used to
determine population counts and spatial and temporal frequencies,
Through the years, interest in combining various sensors to achieve higher accuracy incorporating detec-
tion
and and identification
efficiency has been [34]. Stereo vision
widespread. Manyis also integrated
studies regardingfor video-based
sensor fusions tracking
have [35],
been
fish volume monitoring
successfully integrated and [36]applied
or abundance
in multiple [37],fields,
and 3D such tracking of free-swimming
as camera-lidar integrationfish for
[38].
semantic mapping [47], driver aid systems for intelligent vehicles [48,49], target tracking
Stereo fish
for robotic camera[50],systems
activity also have been
detection of soundwidely used[51]
sources in fish
andlength
avianestimations
monitoring [39–42],
[52]. An
using
underwater acoustic-optic image matching was proposed by Zhou et al. [53]. Theiraqua-
disparity information to provide 3D information about an object [43–45]. In work
culture,
combined many are now putting
the advantages of CNN their interest
depth andextraction
features efforts into to integrating
determine the stereo
image cameras
visual
for fish length
attribute and biomass
conversion; estimations
the difference [40,41,46].
between In our previous
the acousto-optic images work,
was we integrated
discarded. Theira
low-cost
matchingstereo camera
technique usedsystem to perform
current advanced fish metricsdescriptions
learned estimation. in Wethe used a reliable
generated ob-
target
ject-based matching
image (acoustic) andusing sub-pixel
the original imagedisparity
(optical).computation with videomethod
The data aggregation interpolation CNN
was utilized
and tracked and
in displaying thecomputed
calibrated the fish length
matching in each video
correspondence frame [10].
between the two types of images.
Through the years, interest in combining various sensors to achieve higher accuracy
3. Materials
and efficiencyand hasMethods
been widespread. Many studies regarding sensor fusions have been
3.1. Devices Used
successfully integrated and Experimental
and appliedEnvironments
in multiple fields, such as camera-lidar integration for
semanticFiguremapping
4 shows [47],
thedriver
sonar aid systems we
equipment for used
intelligent
for thevehicles [48,49], target
image capture tracking
with GARMIN
Panoptix
for robotic LiveScope System
fish [50], activity (Garmin
detection of Ltd.,
soundTaiwan),
sources [51]which andincludes a sonar screen,
avian monitoring [52]. Ana
processor, and a sonar transducer probe. The sonar system
underwater acoustic-optic image matching was proposed by Zhou et al. [53]. Their work uses an Intel NUC minicomputer
(Intel Corporation,
combined the advantagesSanta Clara,
of CNN CA, USA)features
depth to collect and analyze
extraction the sonarthe
to determine images
image enclosed
visual
with the conversion;
attribute sonar block the box.difference
Meanwhile, we used
between theaacousto-optic
low-cost camera imagesto set up discarded.
was our stereo
camera
Their system using
matching technique two Go used Pro Hero 8advanced
current devices (GoPro,
learnedSan Mateo, CA,
descriptions in USA). The two
the generated
cameras
target were
image mountedand
(acoustic) in athefixed relative
original imageposition, as shown
(optical). The data in Figure
aggregation5a, with a baseline
method was
or camera distance of 11 cm. A waterproof case was used
utilized in displaying the calibrated matching correspondence between the two types ofto cover the two Go Pro cameras
to protect them from water damage since they would be submerged in the water during the
images.
data capturing. Next, we calibrated the two stereo cameras using the popular checkboard-
based
3. method,
Materials and asMethods
shown in Figure 5b, since the patterns are distinct and easy to detect. The
calibration checkboard has an A4 paper size with a 2.5 cm grid size. The first step of the
3.1. Devices Used and Experimental Environments
sensor fusion relies on the stereo image rectification process of the left and right images
of theFigure
low-cost4 shows
stereothe sonar equipment
cameras. One potential we usedproblemfor the
of aimage
low-costcapture
stereowithimageGARMIN
camera
Panoptix
system is that it is incomplete or incorrectly synchronized, causing the object’s posea to
LiveScope System (Garmin Ltd., Taiwan), which includes a sonar screen, pro-
be
cessor, and a sonar transducer probe. The sonar system uses an
different in the left and right images. As seen in Figure 5c, the checkboard corners serve as Intel NUC minicomputer
(Intel Corporation, points
the corresponding Santa of Clara, CA,and
the left USA)right toimages.
collect and analyze the sonar images en-
closed with the sonar block box. Meanwhile, we used a low-cost camera to set up our
checkboard-based method, as shown in Figure 5b, since the patterns are distinct and easy
checkboard-based
to detect. method,
The calibration as shown
checkboard in an
has Figure 5b, since
A4 paper thewith
size patterns
a 2.5are
cmdistinct andThe
grid size. easy
firsttostep
detect. The
of the calibration
sensor fusion checkboard
relies on thehas an A4
stereo paper
image size with process
rectification a 2.5 cmofgrid size.and
the left The
first step of the sensor fusion relies on the stereo image rectification process
right images of the low-cost stereo cameras. One potential problem of a low-cost stereo of the left and
right images of the low-cost stereo cameras. One potential problem of a low-cost stereo
image camera system is that it is incomplete or incorrectly synchronized, causing the ob-
Sensors 2022, 22, 7603 image camera system is that it is incomplete or incorrectly synchronized, causing the 7 ofob-
29
ject’s pose to be different in the left and right images. As seen in Figure 5c, the checkboard
ject’s pose to be different in the left and right images. As seen in Figure 5c, the checkboard
corners serve as the corresponding points of the left and right images.
corners serve as the corresponding points of the left and right images.
Figure 4. Sonar
Figure
Figure camera
4.4.Sonar
Sonar device
camera
camera used
device
device for for
used
used data
for gathering.
data
data gathering.
gathering.
(c)
(c)
Figure 5. Set-up of the low-cost stereo camera system: (a) the stereo camera; (b) the correction checkboard
for calibrating; (c) the stereo camera calibration based on the warping of the check-board map.
The experimental site has three locations representing indoor and outdoor environ-
ments and less dense and highly dense fish populations. Figure 6 shows the set-up of
the environment and locations with its corresponding fish species. Fish tank A is 4 m in
length, 1 m in width, and 0.8 m in depth, and the fish instances are small. Fish tank B is an
off-shore cage with 50 m in circumference and 25 m in depth. Lastly, tank C with crowded
fish instances is 5.3 m in length, 4 m in width, and 0.8 m in depth.
The experimental site has three locations representing indoor and outdoor environ-
ments and less dense and highly dense fish populations. Figure 6 shows the set-up of the
environment and locations with its corresponding fish species. Fish tank A is 4 m in
Sensors 2022, 22, 7603
length, 1 m in width, and 0.8 m in depth, and the fish instances are small. Fish tank8 of
B 29
is
an off-shore cage with 50 m in circumference and 25 m in depth. Lastly, tank C with
crowded fish instances is 5.3 m in length, 4 m in width, and 0.8 m in depth.
(a) A: AAC-A13, Keelung (b) B: Offshore Cage, Penghu (c) C: LongDann-C10, Pingtung
Species: Oplegnathus punctatus Species: Trachinotus blochii Species: Cephalopholis sonnerati
Figure 6. Experimental
Figure 6. Experimental environments
environments utilized
utilized for
for training
training various
various deep
deep learning
learning models
models for
for fish
fish
length estimation, fish count estimation, and fish type annotation.
length estimation, fish count estimation, and fish type annotation.
For
For simultaneous
simultaneous datadata capture using the
capture using the two
two sensors
sensors for
for calibration,
calibration, we
we used
used aa laptop
laptop
computer as the sonar recording. In contrast, the data captured by
computer as the sonar recording. In contrast, the data captured by the stereothe stereo camera were
camera
savedsaved
were in their
in respective storage
their respective devices.
storage The distance
devices. between
The distance the sonar
between and stereo
the sonar cam-
and stereo
era was was
camera 50 cm.50 The
cm. Go
TheProGocamera, whichwhich
Pro camera, acts asacts
theas
stereo camera
the stereo system,
camera used the
system, usedrg174
the
signal line to confirm that the target object was captured. In contrast, the sonar
rg174 signal line to confirm that the target object was captured. In contrast, the sonar waswas con-
firmed by the
confirmed by human operators.
the human The The
operators. computer device
computer that that
device trains the neural
trains networks
the neural had
networks
an Intel i7-107000k 3.8. GHz CPU, NVIDIA GeForce RTX 3090 GPU, and
had an Intel i7-107000k 3.8. GHz CPU, NVIDIA GeForce RTX 3090 GPU, and 48 GB memory. 48 GB memory.
Figure 7. The sensor fusion consists of a stereo camera, and a sonar imaging device captures the fish
Figure 7. The sensor fusion consists of a stereo camera, and a sonar imaging device captures the fish
images from a pond or a net cage. The 3D point clouds of the common object in the left image and
images from a pond or a net cage. The 3D point clouds of the common object in the left image and
the sonar image can be used to calculate the transformation matrix using a 3D affine transformation
the sonar image can be used to calculate the transformation matrix using a 3D affine transformation
algorithm[54].
algorithm [54].
Figure 7. The sensor fusion consists of a stereo camera, and a sonar imaging device captures the fish
images from a pond or a net cage. The 3D point clouds of the common object in the left image and
Sensors 2022, 22, 7603 9 of 29
the sonar image can be used to calculate the transformation matrix using a 3D affine transformation
algorithm [54].
Four markers
Four markers (A,(A, B,
B, C,
C, D)
D) exist
exist in
in both
both sonar
sonar and
and stereo
stereo images.
images. Each
Each marked
marked point point
has an image coordinate of (u, v). To combine sonar and stereoscopic
has an image coordinate of (u, v). To combine sonar and stereoscopic images using camera images using camera
calibration, we
calibration, we used
used two
two bricks
bricks asas the
the target
target object
object and
and integrated
integrated YOLOv4
YOLOv4 to to mark
mark or or
capture the point coordinates provided by the bounding box of the
capture the point coordinates provided by the bounding box of the disparity conversion. disparity conversion.
For the
For the stereo
stereo image,
image, the
the left
left and
and the
the right
right images
images were
were captured
captured and and underwent
underwent an an image
image
rectification process to to obtain
obtain the
the correct
correct intrinsic
intrinsic and
and extrinsic
extrinsic parameters
parameters using camera
calibration. To
calibration. To find
find the
the corresponding
corresponding pointspoints between
between aa stereo
stereo pair
pair and
and plot
plot them into the
3D space, given a point in the left image, its corresponding
corresponding pointpoint in in the right image
image lies
lies on
its epipolar line. Using a stereo-image-based disparity matching algorithm in finding the
correspondenceofofthe
correspondence theleft
left image
image to to
thethe
rightright image,
image, taketake a pixel
a pixel in thein the
left left image
image and
and search
search
on on the epipolar
the epipolar line forline
thatfor thatinpixel
pixel in theimage.
the right right image.
The pixelThewithpixel with
the the minimum
minimum cost is
cost is selected,
selected, and theand the disparity
disparity can nowcan now be computed.
be computed. The point The point ison
is located located on the line,
the epipolar epi-
which would
polar line, whichonly require
would only a one-dimensional search where
require a one-dimensional cameras
search where need
cameras to be aligned
need to be
along
aligned thealong
samethe axis. To obtain
same axis. Tothe depththe
obtain of depth
a stereoofimage
a stereopair, the disparity
image information
pair, the disparity in-
is the difference
formation in the image
is the difference location
in the imageoflocation
the same of3D
thepoint
same projected using twousing
3D point projected different
two
→
different The
cameras. cameras. Theof
disparity disparity x, y) in⃗ the
a pixel xof=a (pixel = ( left
, )image
in thecanleftbeimage can be
computed bycomputed
obtaining
by obtaining
the differencethe difference between
between
d = x − x0 (1)
= − (1)
where x 0 is the x-coordinate of the corresponding pixel in the right image. Once the
→
disparity value of the pixel x in the left image has been computed, the depth value from
disparity d and its 3D coordinates XO = ( xO , yO , zO ) can be determined using triangulation:
zO = f ∗ b/d
xO = ( x − c x ) ∗ zO / f (2)
yO = y − cy ∗ zO / f
where f is the focal length of the camera, b is the baseline, defined as the distance between
the centers of the left and right cameras, and (c x , cy ) is the center of the projected 2D plane.
Similarly, the 3D point P3D (r, θ, ϕ) of the spherical coordinates where θ is the azimuth
direction, and ϕ is the spread in the elevation direction and can be expressed in Cartesian
coordinates as follows:
x r cos θ cos ∅
X = y = r sin θ cos ∅ (3)
z r sin ∅
→
The 2D point x = (u, v) projected on the sonar image plane is expressed as follows:
→ u 1 x r cos θ
x = = = (4)
v cos ∅ y r sin θ
Thus, in the 2D sonar images, the information of the elevation angle and, therefore,
the height of information of the target fish objects cannot be identified. For the target object
Sensors 2022, 22, 7603 10 of 29
in the underwater environment, critical points refer to the shortest distance points, where
the sonar’s acoustic beams reflect on the object. The critical position in the jth acoustic
beam is expressed as rcp ( j) and θcp ( j) using the sonar system’s local coordinate, which can
be calculated using:
q
Local ( j )
xcp s
rcp ( j) 1 − sin2 t + 2 − sin2 θcp ( j)
Local
ycp ( j) =
(5)
rcp ( j) sin θcp ( j)
Local
zcp ( j) s
rcp ( j) sin t + 2
where t and s are the imaging sonar’s tilt angle and spreading angle, respectively. Since
the imaging sonar is tilted by an angle t, azimuth angle θ cp ( j), it differs from the azimuth
angle of the spherical coordinate. Hence, xcp Local ( j ) can be calculated using y Local ( j ) and
cp
Local ( j ). The critical point’s position in the global coordinates can be expressed using the
zcp
rotation matrix R = Rz Ry R x and the position of the imaging sonar ( xS,W , yS,W , zS,W ) in
terms of the world coordinate system is represented using:
Local ( j )
xcp
xW ( j ) xS,W
Local
yW ( j) = yS,W + R ycp ( j) (6)
zW ( j ) zS,W Local ( j )
zcp
where R represents a 3D rotation transformation matrix to determine the roll angle, pitch
angle, and yaw angle of the imaging sonar. The 3D point cloud of the sonar scene can be
generated by accumulating the calculated coordinates while scanning [56].
Once the corresponding pixel pairs between the sonar image and the stereo images are
detected, the 3D coordinates of the matched points between sonar and stereo images are com-
puted based on the above 3D coordinates computing scheme. Let XO,A = ( xO,A , yO,A , zO,A )
and XS,A = ( xS,A , yS,A , zS,A ) be the 3D coordinates of the common point A generated
from the stereo images and the sonar image, respectively. Obviously, XO,A 6= XS,A since
they locate point A in different 3D coordinate systems. Let, XW,A = ( x A , y A , z A ) be the
3D coordinates of the point A in the world coordinate system. Then we can apply the
following 3D transformation to transform XS,A or XO,A into XW,A :
where RO→W (RS→W ) is the 3D rotation matrix that aligns the z-axis of the optical camera
(sonar camera) coordinate system with the z-axis of the world coordinate system; TO (TS )
is the translation vector that locates the center of the optical camera (sonar camera) in the
world coordinate system. Equation (7) can be rewritten as
where xS and yS are the x-coordinate and the y-coordinate of the position of the sonar
device in the world coordinate system. The value of yS equals to 0 when we set the origin
630 cm × 600 cm fish pond. Thus, each pixel in the sonar image would occupy an area of
0.583 cm × 0.3125 cm in the fish pond. If a sonar pixel occupies × ℎ cm in the fish
pond, the 3D coordinates of the pixel ⃗ = ( , ) can be calculated as
,⃗ = − ( ∗ ), , ∗ℎ (9)
Sensors 2022, 22, 7603 11 of 29
where and are the x-coordinate and the y-coordinate of the position of the sonar
device in the world coordinate system. The value of equals to 0 when we set the origin
of the world coordinate system to be [0, , 0] where is the depth of the sonar device.
of
Inthe worldwe
practice, coordinate system
can obtain to beof[0, d,by0]putting
the value where da is the depth
depth sensorofinthe
thesonar
centerdevice.
of the
In practice,
sonar camera.we can obtain the value of d by putting a depth sensor in the center of the
sonar camera.
Figure9.9.An
Figure Anexample
example
of of
thethe 1080
1080 × 1092
× 1092 sonarsonar images
images captured
captured from afrom a 630
630 cm cmcm
× 600 × 600
fishcm fish
pond.
pond.
Given the pixel coordinates of A, B, C, D in the sonar image, Equation (9) thus
Given
generates the
the 3Dpixel
point coordinates
set [ XS,A , X ofS,BA,
, XB,S,CC,, XDS,Din].the sonar image,
Similarly, Equation (9) 3D
the corresponding thuspoint
gen-
erates
set the
[ XO,A , X3D
O,B , point
XO,C , set
XO,D[] can
, , be, ,
computed
, , , ]. Similarly,
based on thethe corresponding
four pixels in the 3D
stereo point set
images
[ , , the, ,disparity
using , , , ]computing
can be computed
algorithm based on the four
mentioned pixels Equation
above. in the stereo
(8) images
can nowusing
be
the disparity
written as computing algorithm
mentioned above. Equation (8) can now be written as
XO,A XS,A
XO,B
= RS→O XS,B + TS→O
XO,C XS,C (10)
XO,D XS,D
To solve the unknown parameters θ = ( RS→O , TS→O ), we can apply any optimization
scheme to minimize the following loss function:
1
Lθ = ∑ ||XO,i − RS→O XS,i + TS→O ||2
4 i∈[ A,B,C,D
(11)
]
where || X ||2 is the L2 norm of the vector X. Although the object detection CNN for the
common object detection has been proved to be accurate, the 2D coordinates of the detected
matched points in both the sonar image and the stereo images still contain some errors.
This error implies that the resulting 3D point pairs contain noises that reduce the reliability
of the parameters θ. To further improve the quality of the learned parameters, the loss
function for minimization can be rewritten as
N
1
Lθ =
4N ∑ ∑ XO,ij − RS→O XS,ij + TS→O 2
(12)
i =1 j∈[ Ai ,Bi ,Ci ,Di ]
3.3. Sonar and Stereo Camera Fusion for Fish Metrics Estimation
Figure 10 shows the block diagram of the proposed fish metrics estimation using the
sonar and stereo camera fusion and the cloud-based AI functions. First, as mentioned
above, we can compute the 3D point clouds PS and PO for each captured sonar image IS
and its synchronized stereo image pair IO , respectively. Next, the overlapping detection
module is applied to detect the area of the monitored fish pond or cage both cameras
3.3. Sonar and Stereo Camera Fusion for Fish Metrics Estimation
Figure 10 shows the block diagram of the proposed fish metrics estimation using the
sonar and stereo camera fusion and the cloud-based AI functions. First, as mentioned
above, we can compute the 3D point clouds and for each captured sonar image
Sensors 2022, 22, 7603 12 of 29
and its synchronized stereo image pair , respectively. Next, the overlapping detection
module is applied to detect the area of the monitored fish pond or cage both cameras
watch. Finally, the two 3D point clouds are inputted simultaneously into the overlapping
watch. Finally,
detection modulethe to
two 3D point
identify clouds
their are inputtedFigure
correspondence. simultaneously
11 shows into the overlapping
the overlapping area
detection
of each camera system that was converted into the 3D point cloud discussed in thearea
module to identify their correspondence. Figure 11 shows the overlapping of
previ-
each camera system that was converted into the 3D point cloud discussed
ous subsection. The two-mode fish count estimation could be an added feature to supportin the previous
subsection. The two-mode
the sonar camera fish count
for type-specific fishestimation could beif an
count estimation theadded feature
monitored topond
fish support the
or cage
sonar camera for type-specific
contains multiple types of fish. fish count estimation if the monitored fish pond or cage
contains multiple types of fish.
(a) (b)
Figure 11. Schematic
Figure 11.diagram of overlapping
Schematic area detection.
diagram of overlapping area detection.
The area
The area covered bycovered by camera
the optical the optical camera isbycontained
is contained the sonar by the sonar
device. Once device.
the Once
the transformation parameters
transformation parameters = ( → , → ) are obtained,
θ = ( R S →O , T the
S→ ) are obtained, the first
O first step of our overlap- step of our
overlapping area detection is to compute the
ping area detection is to compute the transformed point cloud: transformed point cloud:
...
= →P S =+RS→ →O PS + TS→O (13) (13)
Next, we compute the overlapped point cloud:
Next, we compute the overlapped point cloud:
= ∧ ... (14)
P̂S = PS ∧ P S (14)
For each 3D point , ⃗ = ( , , ) in , we can then estimate the 2D coordinates of
the pixel ⃗ = ( For
, ) each
as: 3D point X̂ → = ( x, y, z) in P̂S , we can then estimate the 2D coordinates of
S, x
→
the pixel x = (u, v) as: ⃗ = [( − )/ , /ℎ] (15)
→
x = [( x − x )/w, z/h] (15)
where w and h are the width and the height of a pixel inSthe sonar image, respectively;
is the x-coordinate of the position of the sonar device in the world coordinate system.
Finally, the bounding box to crop the sonar image is defined by the two corner pixels:
⃗ = min , min
⃗∈ ⃗∈
(16)
Sensors 2022, 22, 7603 13 of 29
where w and h are the width and the height of a pixel in the sonar image, respectively; xS is
the x-coordinate of the position of the sonar device in the world coordinate system. Finally,
the bounding box BS to crop the sonar image is defined by the two corner pixels:
" #
→
x = →min ui , →min vi
lu
x i ∈ P̂S x i ∈ P̂S
" # (16)
→
x rb = →max ui , →max vi
x i ∈ P̂S x i ∈ P̂S
3.3.1. Estimation of Fish Standard Length and Weight Using Sonar Image
The distributions of the standard length and weight of fish are essential to assessing
the health and growth of the fish culture. As Figure 12 shows, there are four main steps for
estimating those two distributions:
• Apply Mask R-CNN to identify fish instances in each frame of the input sonar video.
The standard length of an identified fish instance is estimated by the distance between
the two farthest points on this instance.
• Apply the EM algorithm [57] to learn a GMM for the distribution of the length of the
identified fish instance. The GMM for the distribution of the length x can be expressed
as follows:
c
p (x) = ∑ wi N x; µi , σi2 (17)
i =1
where c denotes the weight of the ith Gaussian components, wi denotes the weight of the
2
ith Gaussian component, and N x; µi , σi denotes the probability density of the Gaussian
distribution with the mean of µi and variance σi2 . The probability of sample x from the ith
Gaussian components, denoted by p( Gi | x ), can be estimated by:
wi N x; µi , σi2
p( Gi | x ) = (18)
∑ic=1 wi N x; µi , σi2
The sample x belongs to the ith Gaussian component Gi if p( Gi | x ) is the largest among
p( Gi | x ), i = 1, . . . , c . Given the number of Gaussian components c, the EM algorithm
can find the parameters wi, µi and σi2 for each of the c components through maximum-
likelihood estimation. In this paper, a non-Gaussianity criterion Φ (c) was defined in terms
of the standardized skewness, and kurtosis [58] was adopted to determine the number c of
Gaussian components:
1 3 1 4
1 c ∑ x ∈ Gi ( x − µ i ) ∑
| Gi | x ∈ Gi ( x − µ i )
|Gi |
c i∑
Φ (c) = 3
+ −3 (19)
=1 σ i σi4
The EM algorithm was applied with the number of Gaussian components ranging
from one to five. Then, the GMM with the least non-Gaussianity criterion Φ (c) was selected
for the subsequent analysis.
• Select the Gaussian component Gi∗ with the largest component weight as the compo-
nent comprising a single fish instance. Then, output the statistics of the fish length
in Gi∗ .
• Apply K-nearest neighbor regression with the training set, where the length and
weight of the fish are measured manually to estimate the weight using the fish length
in Gi∗ . This paper set parameter K for the K-nearest neighbor regression to 5.
lected for the subsequent analysis.
Select the Gaussian component ∗ with the largest component weight as the com-
ponent comprising a single fish instance. Then, output the statistics of the fish length
in ∗ .
Sensors 2022, 22, 7603
Apply K-nearest neighbor regression with the training set, where the length and
14 of 29
weight of the fish are measured manually to estimate the weight using the fish length
in ∗ . This paper set parameter K for the K-nearest neighbor regression to 5.
Figure12.
Figure 12.The
Theflowchart
flowcharttotoestimate
estimatethe
thefish
fishlength
lengthand
andweight
weightdistribution.
distribution.
3.3.2.
3.3.2.Estimation
Estimationofofthe
theQuantity
QuantityofofFish
Fishininan
anOff-Shore
Off-ShoreNetNetCage
CageUsing
UsingSonar
SonarImage
Image
The n
The quantity f ish of fish in an off-shore net cage is estimated using the volumeofofthe
quantity of fish in an off-shore net cage is estimated using the volume the
fish
fish schoolthat
school thatisisswimming
swimmingon onthe
thewater
watersurface
surfaceand
andgrabbing
grabbingfood
foodpellets
pelletsasasfollows:
follows:
V×δ
n f ish = (20)
Vf ish
where δ denotes the average fish density of the fish school, and V and Vf ish represent
the volume of the fish school and the volume of the space occupied by a fish instance,
l l
respectively. In this paper, Vf ish is roughly estimated by Vf ish = l f ish × f2ish × f2ish , which
is the volume of the cuboid covering the space of a fish instance, where l f ish denotes the
average length of the fish instance and is measured beforehand. The volume V and the
average length fish density d of the fish school are estimated by the average normalized
pixel value of the fish region F in the sonar image as follows:
1
δ=
gmax × |F | ∑ g( x ) (21)
x∈F
where gmax is the maximum pixel value in the fish region and |F | denotes the number of
pixels in F .
There are several ways to scan the fish school to estimate their volume using the sonar
system. For example, it can rotate and sideway scan the fish school. In this paper, for
simplicity purposes, the fish school was analyzed without rotating the sonar beam. The
sonar beam in Figure 13a passes through the fish school in a slantwise position, where the
angle between the sonar beam and the seaplane is θ. Meanwhile, in Figure 13b, the space
of the fish school when the fish is grabbing pellets is enclosed by an irregular prism, and
the volume of the fish school is estimated by the volume of its irregular prism and can be
expressed as:
V = A×d (22)
simplicity purposes, the fish school was analyzed without rotating the sonar beam. The
sonar beam in Figure 13a passes through the fish school in a slantwise position, where the
angle between the sonar beam and the seaplane is . Meanwhile, in Figure 13b, the space
of the fish school when the fish is grabbing pellets is enclosed by an irregular prism, and
Sensors 2022, 22, 7603 the volume of the fish school is estimated by the volume of its irregular prism and can 15
beof 29
expressed as:
= × (22)
where A is the area of the fish regions in the sonar image projected onto the sea plane and
where
d is theisdepth
the area of the fishof
information regions inschool.
the fish the sonar image projected onto the sea plane and
is the depth information of the fish school.
(a) (b)
Figure 13. Illustration
Figure of the
13. Illustration net net
of the cage andand
cage thethe
fishfish
school with
school thethe
with imaging
imaging sonar
sonarsystem, where
system, where(a)(a) is
is the angle between the water and the plane of the sonar beam is ; and (b) is the feeding
the angle between the water and the plane of the sonar beam is θ; and (b) is the feeding fishfish school
school
showing is grabbing pellets during feeding which is enclosed by an irregular prism.
showing is grabbing pellets during feeding which is enclosed by an irregular prism.
TheThe
pattern of the
pattern fishfish
of the school in the
school sonar
in the image
sonar imagewhen the the
when fishfish
gathers andand
gathers swims
swims
toward the fish surface to grab the pellets is different compared with when the
toward the fish surface to grab the pellets is different compared with when the fish disperses, fish dis-
perses, as shown
as shown in Figure
in Figure 14.this
14. In In this
work,work, a CNN
a CNN is first
is first applied
applied to find
to find thethe frame
frame in the
in the sonar
sonar video where the fish school gathers and grabs the pellets. Then,
video where the fish school gathers and grabs the pellets. Then, the fish region F in the fish region ℱ the
in the said
said frame
frame is identified
is identified using
using a semantic
a semantic segmentation
segmentation network;the
network; thedetails
detailsofofthese
thesetwo
twoneural
neuralnetworks
networkswill willbebepresented
presentedlater.
later. The
The area
area A of ofthe
thebottom
bottomof ofthe
theprism
prismandand the
the depth
depth of
of the
the fish
fish school
schoolcancanbe
beestimated
estimatedby: by:
=A =| Fℱ|××∆ ∆ × × ),
x ∆y∆× cos
× cos(
( θ ), (23)(23)
= d = ymax
× ×∆ ∆×y×sin(
sin(),
θ)
Sensors 2022, 22, x FOR PEER REVIEW 16 of 29
where
where ymax denotes
denotes thethe
bottom
bottomrow
rowofofthis
thisregion and ∆
regionand ∆ denote
and ∆
∆ x and y
denotethe
thewidth
widthand
andheight
heightofofaapixel
pixelinincentimeters,
centimeters,respectively.
respectively.
(a) (b)
(c) (d)
Figure
Figure14.
14. Sonar
Sonarimages
imagesofofaafish
fishschool
schoolin
inan
anoff-shore
off-shorenet
netcage,
cage,where
where(a)
(a)and
and(b)
(b)show
showthe
thefish
fish
dispersing;
dispersing;and
and(c)
(c)and
and(d)
(d)show
showfish
fishswimming
swimming toward
toward the
the water
water surface
surface to
to grab
grab feed
feed pellets.
pellets.
Figure
Figure15 15shows
showshow
howto toestimate
estimatethethequantity
quantityof offish
fishin
inan
anoff-shore
off-shorenetnetcage.
cage.For
Forthe
the
first
firststep,
step,ititconstructs
constructsthe
theinput
inputforforthe
thesubsequent
subsequenttwo twoCNNs
CNNsby bystacking
stackingfive
fivesuccessive
successive
frames
frames toto form
form a five-channel image.
image. These
These five
fiveframes
framesare arethe
thetarget
targetframe,
frame,twotwo preced-
preceding,
ing,
andandtwotwo succeeding
succeeding frames
frames of target
of the the target frame.
frame. Next, Next, the CNN
the CNN presented
presented in Section
in Section 3.3.3
is applied
3.3.3 is appliedto determine if the
to determine given
if the frame
given frameis is
a afish-gathering
fish-gatheringframe.
frame. If target frame
If the target frame
is classified as a fish-gathering frame, the CNN in Section 3.3.4 is applied to segment the
fish region in the target frame. Equations (20), (21), and (22) are then used to estimate the
fish quantity. The neural network architectures of the two CNNs are described in the suc-
ceeding subsections.
dispersing; and (c) and (d) show fish swimming toward the water surface to grab feed pellets.
Figure 15 shows how to estimate the quantity of fish in an off-shore net cage. For the
first step, it constructs the input for the subsequent two CNNs by stacking five successive
frames to form a five-channel image. These five frames are the target frame, two preced-
Sensors 2022, 22, 7603 16 of 29
ing, and two succeeding frames of the target frame. Next, the CNN presented in Section
3.3.3 is applied to determine if the given frame is a fish-gathering frame. If the target frame
is classified as a fish-gathering frame, the CNN in Section 3.3.4 is applied to segment the
is classified
fish region inasthea fish-gathering frame, the(20),
target frame. Equations CNN in Section
(21), and (22)3.3.4 is applied
are then toestimate
used to segmentthethe
fish region in the target frame. Equations (20), (21), and (22) are then used to
fish quantity. The neural network architectures of the two CNNs are described in the suc- estimate
the fish subsections.
ceeding quantity. The neural network architectures of the two CNNs are described in the
succeeding subsections.
Figure 15. The flowchart of estimating the quantity of fish in an off-shore net cage.
Figure 15. The flowchart of estimating the quantity of fish in an off-shore net cage.
3.3.3. CNN for Detecting the Fish-Gathering Frame
3.3.3. CNN for Detecting the Fish-Gathering Frame
Figure 16 shows the neural network architecture of the CNN for detecting the fish-
Figure 16 shows the neural network architecture of the CNN for detecting the fish-
gathering frames. The input for the CNN is a five-channel image comprising five success-
gathering frames. The input for the CNN is a five-channel image comprising five successful
ful sonar image frames. The kernel size of the first ten convolutional layers of the CNN is
sonar image frames. The kernel size of the first ten convolutional layers of the CNN is all
all of size 3 × 3 with a corresponding activation function ReLu. Meanwhile, the last three
of size 3 × 3 with a corresponding activation function ReLu. Meanwhile, the last three
layers also have a ReLu activation function, and a sigmoid was incorporated into the last
layers also have a ReLu activation function, and a sigmoid was incorporated into the last
1 × 1 convolution layer.
1 × 1 convolution layer.
90 90
1
90 90 G lobalM ax
Conv2D 1x1 Conv2D 1x1 Conv2D 1x1
90 90 M axPooling2D Conv 2D C onv2D Pooling2D
M ax Pooling2D Conv2D Conv2D
90 90
M ax Pooling2D Conv2D Conv2D
90 90
M axPooling2D Conv2D Conv2D
5 45 45
I nput Conv2DConv2D
Figure 16. The neural network architecture for detecting fish-gathering frame.
Figure 16. The neural network architecture for detecting fish-gathering frame.
3.3.4. Semantic Segmentation Network for Segmenting Fish Regions
3.3.4. Semantic Segmentation Network for Segmenting Fish Regions
The semantic segmentation networks’ neural network architecture to segment fish
The semantic
regions segmentation
in the sonar networks’
image is shown neural
in Figure network
17. The neuralarchitecture
network is basedto segment fish
on the U-Net
regions in the sonar image is shown in Figure 17. The neural network is based
architecture [59]. The input of this CNN is a five-channel image comprising five successive on the U-
Net architecture [59]. The input of this CNN is a five-channel image comprising
sonar image frames. In the semantic segmentation network, the transposed convolutional five suc-
cessive
layer sonar image (2,2)
with strides frames. In the semantic
is adopted segmentation
for up-sampling. network,
The kernel sizethe
of transposed con-
the convolutional
volutional
layer andlayer with stridesconvolutional
the transposed (2,2) is adopted for up-sampling.
layer is 3 × 3. The The kernel of
activation size
theoflast
the later
con- is
volutional
softmax, layer andactivation
and the the transposed
functionconvolutional layer isReLu.
of the other layers 3 × 3. The activation of the last
later is softmax, and the activation function of the other layers ReLu.
regions in the sonar image is shown in Figure 17. The neural network is based on the U-
Net architecture [59]. The input of this CNN is a five-channel image comprising five suc-
cessive sonar image frames. In the semantic segmentation network, the transposed con-
volutional layer with strides (2,2) is adopted for up-sampling. The kernel size of the con-
Sensors 2022, 22, 7603 volutional layer and the transposed convolutional layer is 3 × 3. The activation of the last
17 of 29
later is softmax, and the activation function of the other layers ReLu.
3232
512 512 512 32
512 512 512 32
256256256 32
128128 32
5 6464 3232
2
I nput
Figure 17. The neural network architecture for segmenting fish regions in the sonar image.
Figure 17. The neural network architecture for segmenting fish regions in the sonar image.
3.4. Object Detection for Fish Type Identification and Two-Mode Fish Counting
3.4. ObjectThe fish of for
Detection theFish
leftType
image of the input
Identification andstereo image
Two-Mode pair
Fish is detected by any object
Counting
detection CNN, e.g., the YOLOv4. The object detection results can be used to annotate the
The fish of the left image of the input stereo image pair is detected by any object
fish type since the types of fish are given in the training dataset to train the object detection
detection CNN, e.g., the YOLOv4. The object detection results can be used to annotate the
CNN. Let ci and ctotal be the fish count of the i-th type and the total count of fish detected
fish type since the types of fish are given in the training dataset to train the object detection
based on the RGB image. As mentioned above, the sonar image estimates the number
CNN. Let 𝑐𝑖 and 𝑐𝑡𝑜𝑡𝑎𝑙 be the fish count of the i-th type and the total count of fish de-
of fish without information on fish types. To deal with the difficulty, our two-mode fish
tected based on the RGB image. As mentioned above, the sonar image estimates the num-
counting algorithm estimates the count of the i-th type fish as:
ber of fish without information on fish types. To deal with the difficulty, our two-mode
fish counting algorithm estimates the C count
= Cof the i-th type fish as:
sonar × c /c (24)
i i total
In this study, we focused on the design of the two-mode smart sensor, which consists
of a sonar scanning device and a stereo optical camera. The captured images are sent to the
cloud using a wireless communication network. Although the object detection CNN is not
new, we can design a new CNN architecture for underwater object detection to improve
the accuracy of fish distributions using (24). Note that the functionality of the smart sensor
is incremental since we can add a new AI function into the cloud to provide a new service
for sensor fusion.
4. Experimental Results
4.1. Sonar and Stereo Camera Fusion Results
The fish objects detected in the sonar images are mapped into the stereo camera image
as an area of interest. In Figure 18, we conducted an experiment using the two bricks as
the target object representing the fish to determine the object detection capability of our
proposed method. Of course, sonar images have a wider range than stereo images, and
the positions of the target objects are entirely different. However, based on the detection
results for both camera systems, our approach identified the target objects using different
sonar and stereo image frames.
On the other hand, the fish detection in Figure 19a identified the same number of fish
objects in the sonar image and are all correctly mapped and detected with fish annotated in
the stereo images in Figure 19b.
To ensure that our mechanism detects a correct object in both sonar and stereo images,
we integrated a bounding box that shows the area covered by the stereo image in the sonar
image in Figure 20, where a shows the sonar image with its corresponding area covered in
the stereo image while b shows the entire stereo image area that appears in some part of
the sonar images. The images were taken from our various aquaculture locations.
age as an area of interest. In Figure 18, we conducted an experiment using the two bricks
as the target object representing the fish to determine the object detection capability of our
proposed method. Of course, sonar images have a wider range than stereo images, and
the positions of the target objects are entirely different. However, based on the detection
Sensors 2022, 22, 7603 results for both camera systems, our approach identified the target objects using different
18 of 29
sonar and stereo image frames.
Figure 18.18.
Figure Detection
Detectionresults
results (marker) ofthe
(marker) of thetarget
target object
object from
from (a) (a) sonar
sonar and and (b) stereo
(b) stereo images.
images.
On the other hand, the fish detection in Figure 19a identified the same number of fish
objects in the sonar image and are all correctly mapped and detected with fish annotated
in the stereo images in Figure 19b.
(a) (b)
Figure 19.19.
Figure Fish
Fishclassification
classification results using(a)
results using (a)sonar
sonar and
and (b)(b) stereo
stereo images.
images.
To ensure that our mechanism detects a correct object in both sonar and stereo im-
ages, we integrated a bounding box that shows the area covered by the stereo image in
the sonar image in Figure 20, where a shows the sonar image with its corresponding area
covered in the stereo image while b shows the entire stereo image area that appears in
some part of the sonar images. The images were taken from our various aquaculture lo-
cations.
To ensure that our mechanism detects a correct object in both sonar and stereo im-
ages, we integrated a bounding box that shows the area covered by the stereo image in
the sonar image in Figure 20, where a shows the sonar image with its corresponding area
covered in the stereo image while b shows the entire stereo image area that appears in
Sensors 2022, 22, 7603 19 of 29 lo-
some part of the sonar images. The images were taken from our various aquaculture
cations.
4.2.Estimation
4.2. EstimationofofFish
Fish Standard
Standard Length
Lengthand
andWeight
WeightUsing
UsingSonar Images
Sonar Images
Figure21
Figure 21shows
shows the
the fish
fishinstances
instancesofofa sonar image
a sonar imagedetected by Mask
detected R-CNN.
by Mask TableTable
R-CNN. 1
shows that the true positive rates of Mask R-CNN for the three experimental environments
1 shows that the true positive rates of Mask R-CNN for the three experimental environ-
were approximately 85, 90, and 75%, respectively. Environment C incurred the lowest
ments were approximately 85, 90, and 75%, respectively. Environment C incurred the low-
positive rate, which was affected by the crowded environment of the fish cage. The number
estvalues
positive rate,
in the which
image was affected
represent by theestimation.
the fish length crowded environment of the fish cage. The
number values in the image represent the fish length estimation.
Table 1. The true positive rate of Mask R-CNN for different experimental environments.
Figure21.
Figure 21.Fish
Fishdetection
detection using
using Mask
Mask R-CNN
R-CNN andand length
length estimation
estimation resultsresults using
using the thecamera
sonar sonar system.
camera
system.
Table 2, on the other hand, shows the experimental results where the relative errors
Table
of the 1. The truelength
average positive rateweight
and of MaskcanR-CNN for different
be reduced experimental
by applying environments.
GMMs. The length and
weight of each fish
Environmentin the tank were measured in all True Positive Rates compared the
environments. We
distributions of all estimated data, the estimated data incorporating GMM, and the ground
A 85%
truth (Figure 22), all presented in Table 2. The t-test and Bartlett test were used to determine
B 90%
if the distributions of the two independent samples were significantly different or not in
C
terms of means and variances, respectively. The comparison result 75%showed that the length
distributions of the three data were different in means. The p-values for the ground truth
Table and
vs. GMM, 2, onthe theground
other hand,
truthshows
vs. thethedistribution
experimental of results where
fish length the relative
identified errors
by Mask
of the average
R-CNN were 1.11 × 10 length −and
41 weight can −be8 reduced by applying GMMs.
and 2.57 × 10 , respectively. However, the variance of the fishThe length and
length processed by the GMM and the ground truth was similar (the p-value is 0.61 and dis-
weight of each fish in the tank were measured in all environments. We compared the the
tributions
p-value for of
theallother
estimated
pair is data,
1.12 ×the estimated
10− data incorporating
24 ). In manually measuring the GMM, and thewe
fish length, ground
used
truth
the (Figure
fork length. 22), all presented in Table 2. The t-test and Bartlett test were used to deter-
mine if the distributions of the two independent samples were significantly different or
not in2.terms
Table of means
The relative errorand variances,
of the estimated respectively. The comparison
average standard result of
length and weight showed that the
fish, where the
length distributions of the three data were different in means. The p-values
ground truth for the average standard length and weight of fish is measured manually, and ε, N, for the ground
truth
NG, vs.c GMM,
and denotesand the the ground
relative error,truth vs. the distribution
the number of fish
of fish instances lengthbyidentified
identified Mask R-CNN,by Mask
the
R-CNNofwere
number 1.11 ×in10
the instance the largest 2.57 × 10
and Gaussian , respectively.
component, However,
and the number the variance
of Gaussian of the
components,
respectively.
4.3.Estimation
4.3. EstimationofofFish
FishQuantity
QuantityininaaNet
NetCage
CageUsing
UsingSonar
SonarImages
Images
Sensors 2022, 22, x FOR PEER REVIEW 22 of 29
Figure23
Figure 23shows
showsthetheoff-shore
off-shorenetnetcage
cageenvironment
environmentininPenghu,
Penghu,Taiwan,
Taiwan,forforthe
thefish
fish
quantity estimation with Trachinotous blochii as the fish species. The net cage
quantity estimation with Trachinotous blochii as the fish species. The net cage is 15 m in is 15 m in
diameterand
diameter and55mminindepth,
depth,with
withapproximately
approximately22002200fish
fishinstances
instancesduring
duringthe
theexperiment.
experiment.
The
Theaverage
average standard
standard length
length of
of the
the fish was 20
fish was 20 cm,
cm, and
and the
the sonar
sonarbeam
beamwas
waspositioned
positionedatata
aslant angleθ 𝜃of of
slantangle 2020°.
◦.
(a) (b)
Figure
Figure23.
23.Off-shore
Off-shorecages
cagesininPenghu,
Penghu,Taiwan,
Taiwan,where
where(a)
(a)shows
showsthe
thelandscape
landscapeof
ofthe
theoff-shore
off-shorenet
net
cages; and (b) is the environment of the net cage used for the experiment.
cages; and (b) is the environment of the net cage used for the experiment.
The
Thedataset
datasetused
usedtototrain
trainthe CNN
the CNN in in
detecting
detecting fish-gathering
fish-gatheringframes
framesconsisted of 58
consisted of
fish-gathering
58 fish-gatheringimages andand
images 116 116
fish-dispersing
fish-dispersingimages. The The
images. CNN was was
CNN evaluated using
evaluated 10-
using
fold cross-validation
10-fold and obtained
cross-validation an accuracy
and obtained of 0.98.
an accuracy ofTable
0.98. 3Table
shows3 the confusion
shows matrix
the confusion
results.
matrix The intersection
results. over union
The intersection over(IoU) was
union adopted
(IoU) to represent
was adopted the performance
to represent index
the performance
for the for
index semantic segmentation
the semantic network
segmentation to segment
network the fishthe
to segment region. This network
fish region. was eval-
This network was
uated usingusing
evaluated 10-fold cross-validation,
10-fold and the
cross-validation, andaverage IoU was
the average IoU 0.77
was ±0.77 ± 0.66.
0.66.
Figure 24 shows the results of applying the procedures in Figure 15 for fish quantity
Table 3. The confusion
estimation. matrix
The quantity for the
of the fishCNNs’ detection using
was estimated of fish105
gathering frame. frames. Figure 25
fish gathering
shows the distribution of the estimated fish quantity with a mean and standard deviation
Predicted
of 2578.72 and 569.099, respectively. Therefore, the manual estimation of the quantity of
Actual
Gathering Dispersing
Gathering 58 0
Dispersing 3 113
Figure 24 shows the results of applying the procedures in Figure 15 for fish quantity
estimation. The quantity of the fish was estimated using 105 fish gathering frames. Figure
results. The intersection over union (IoU) was adopted to represent the performance index
for the semantic segmentation network to segment the fish region. This network was eval-
uated using 10-fold cross-validation, and the average IoU was 0.77 ± 0.66.
Table 3. The confusion matrix for the CNNs’ detection of fish gathering frame.
Sensors 2022, 22, 7603 22 of 29
Predicted
Actual
Gathering Dispersing
fish in the Gathering 58
net cage was 2200, within the estimate’s 0 intervals with
68% and 95% confidence
values of [2112.32, 3045.21] and [1659.33, 3498.11],
Dispersing 3 respectively. 113
3. The 24
TableFigure confusion
shows matrix for the
the results ofCNNs’ detection
applying of fish gathering
the procedures frame.15 for fish quantity
in Figure
estimation. The quantity of the fish was estimated usingPredicted
105 fish gathering frames. Figure
25 shows theActual
distribution of the estimated fish quantity with a mean and standard devia-
Gathering Dispersing
tion of 2578.72 and 569.099, respectively. Therefore, the manual estimation of the quantity
Gathering
of fish in the 58
net cage was 2200, within the estimate’s 0
68% and 95% confidence intervals
with valuesDispersing
of [2112.32, 3045.21] and [1659.33,3 3498.11], respectively. 113
(a) (b)
(c) (d)
(c) (d)
Figure 24. Experimental results of estimating fish quantity where (a) shows the non-gathering char-
acteristics
Figure
Figure 24. ofExperimental
the fish school;
24.Experimental and
results (b), (c), andfish
of estimating
results of estimating (d)fish
show
quantitythe gathering
where
quantity feature
(a) shows
where (a) theof
shows the
the fish school
non-gathering and
char-
non-gathering
the estimated
acteristics of thefish
fishquantity
school; based
and on
(b), the
(c), andfish
(d)region
show identified
the by
gathering the semantic
feature of segmentation
the fish
characteristics of the fish school; and (b–d) show the gathering feature of the fish school and the school net-
and
work.
the estimated
estimated fishfish quantity
quantity based
based on fish
on the the region
fish region identified
identified by thebysemantic
the semantic segmentation
segmentation net-
network.
work.
Figure 25.
Figure 25. The
The distribution
distribution of
of the
the estimated
estimated fish
fish quantity.
quantity.
Figure 25. The distribution of the estimated fish quantity.
4.4. Object Detection for Fish Type Identification and Two-Mode Fish Count
4.4. Object Detection
The object for Fishmodel
detection Type Identification and Two-Mode
utilized the pre-trained Fish Count
YOLOv4 with the COCO dataset
[60]The
wasobject
used detection model
for the object utilized the
detection pre-trained
model. YOLOv4
The object withresults
detection the COCO dataset
representing
Sensors 2022, 22, 7603 23 of 29
Figure 25. The distribution of the estimated fish quantity.
4.4. Object Detection for Fish Type Identification and Two-Mode Fish Count
4.4. Object Detection for Fish Type Identification and Two-Mode Fish Count
The object detection model utilized the pre-trained YOLOv4 with the COCO dataset
The object detection model utilized the pre-trained YOLOv4 with the COCO dataset [60]
[60] was used for the object detection model. The object detection results representing
was used for the object detection model. The object detection results representing three
three different aquaculture environments (Keelung, Penghu, and Pingtung locations) are
different aquaculture environments (Keelung, Penghu, and Pingtung locations) are shown in
shown in Figure
Figure 26. 26. The experimental
The experimental result
result for the fakefor
fishthe fake fish experiment
experiment is inwhere
is in Figure 27, Figure 27,
three
where three fish species
fish species were detected.were detected.
(a) AAC-A13, Keelung (b) Offshore Cage, Penghu (c) LongDann-C10, Pingtung
Figure 26. Fish target object
Figure detection
26. Fish results
target object usingresults
detection YOLOv4usingin the different
YOLOv4 aquaculture
in the different sites.
aquaculture sites.
Figure 28
Figure 28 is the result of isthe
thefish
result of theestimation,
count fish count estimation,
which which
showsshows the actual
the actual Trachinotus
Trachino-
blochii species detected from the images taken from the Penghu off-shore cage. Since the
tus blochii species detected from the images taken from the Penghu off-shore cage. Since
low-cost stereo camera range is short, it cannot see other fish objects beyond its reliable
the low-cost stereo camera
area range
coverage; thus,isonly
short, it cannot
55 fish objects see
wereother fishand
detected objects beyond its reliable
counted.
area coverage; thus, only 55 fish objects were detected and counted.
tus blochii species detected from the images taken from the Penghu off-shore cage. Since
the low-cost stereo camera range is short, it cannot see other fish objects beyond its reliable
area coverage; thus, only 55 fish objects were detected and counted.
5. Discussion
5. Discussion
Our sensor-based fusion mechanism was applied to aquaculture monitoring using
Our sensor-basedoptical
fusion mechanism
(stereo was
camera) and applied
sonar images.toEach
aquaculture monitoring
sensor system using
has its strength and limita-
optical (stereo camera)tions,
andand
sonar images.
we took Each of
advantage sensor systemtohas
its capabilities its strength
address the issuesand limita-
of the other sensors.
For example,
tions, and we took advantage it would
of its be difficult
capabilities for an RGB
to address thecamera
issuestoof
accurately
the otherestimate fish metrics
sensors.
due to poor underwater conditions addressed by the sonar camera system. Therefore, we
For example, it wouldtookbe difficult for an RGB camera to accurately estimate fish metrics
advantage of the texture information of the optical images to provide fish species
annotation to sonar images. In addition, detecting the common area of each sensor poses a
significant challenge considering the quality of images in the underwater environment.
Additionally, sonar cameras have a larger area covered when compared with stereo
cameras. Thus, the target objects will be in different positions or locations. We also must
consider that sensors vary in terms of errors, the origin of the coordinate axis, and the
types of data received. Two essential issues need to be addressed for sensor fusion, namely,
opti-acoustic extrinsic calibration and opto-acoustic feature matching [14]. To deal with
this and improve the quality of our data fusion, we performed camera calibration to enable
both sensors to be in a common world frame or coordinates. Our approach integrated
the 3D point cloud information of both sensors to identify the overlapping areas by using
markers (4 pixels) to project sonar images to stereo images as part of the learning phase. The
integration of the transformation matrix made it possible to locate the corresponding pixels
in both camera systems. Camera calibration performed a significant function in our sensor
fusion by transforming their corresponding rotation matrix and translation vectors to match
the features from sonar to optical coordinate system, thus taking advantage of the epipolar
geometry for the multi-modal feature association [14]. For opti-acoustic feature matching,
the 3D information was utilized to identify the same features of both sensor modalities.
One of the AI functions is to estimate the length and weight distribution of the fish in
an indoor aquaculture tank. The main challenge of the first method is that the fish in the
aquaculture tank are often crowded and overlapped in sonar images. Besides, sonar images
of an aquaculture tank are usually noisy due to the echo from the air pumper and the
bottom and wall of the tank. The Mask R-CNN [61] identified single fish instances in sonar
images. Therefore, it may locate overlapped and incomplete fish for crowded aquaculture
tanks as single fish instances. Because Mask R-CNN determines the fish instance, which
looks like a single fish instance, an assumption that the length distribution of identified
Sensors 2022, 22, 7603 25 of 29
instances is a mixture of Gaussians and the length distribution of the valid single fish
instance is the largest Gaussian component of this mixture of Gaussians. Based on that
assumption, the first method employs Gaussian mixture models (GMMs) to model the
length distribution of the fish instance identified by Mask R-CNN. Then, the proposed
method regards the fish instance with the length from the Gaussian component with the
largest mixture component weight as a single fish instance and estimates the weight of the
instance by the k-nearest-neighbor regression. Since we manually measured the fish fork
length as the basis for the length estimation, our proposed estimation method is biased
since the fork length is usually more significant than the standard length.
Furthermore, the response of the caudal fish fin in the sonar image is usually weak,
which makes the fish length measured by our proposed method close to the standard length
of the fish. On the other hand, the estimated weight distribution and the ground truth were
significantly different. This result could be attributed to the fish’s weight being affected
by other factors, such as the thickness of the fish. In practice, it is difficult to observe both
the thickness and length of the fish instance from the view of the imaging sonar system.
Overall, the relative error of the estimated average fish standard length was approximately
15%, while the relative error for the estimated average weight was less than 50%.
The second AI function for sonar images estimates fish quantity in an off-shore net
cage, and we identified two main challenges. First, the fish considered in this paper, which
were Trachinotus blochii, are often widely distributed in the net cage. Second, the view of an
imaging sonar system only covers a small portion of the net cage. Since the fish of the target
species can gather and swim close to the water surface to grab food pellets, the proposed
method only estimates the quantity of fish in the net cage when feeding fish. The number
of fish is assessed on the average fish volume and the fish school’s estimated volume and
density. In this paper, a convolutional neural network was developed to determine if fishes
gathered and were grabbing food pellets. Then, we supposed that the fishes gathered and
were grabbing feed. In that case, a semantic segmentation network was applied to segment
the fish school in the sonar image, and the volume of the fish school was estimated on the
segmentation result. The visible imaging has a short imaging distance underwater due to
the light attenuation caused by water absorption and scattering. Therefore, the image was
more blurred, and the quantity of the image decreased as the shooting distance increased.
However, the sound wave can travel far through water without attenuation. Consequently,
counting based on acoustics can still work when visual counting is inappropriate [62].
For object detection, this work used two images from the two sensors to capture a
common target. The YOLOv4 [55], with an efficient and powerful object detection model,
which makes it possible to achieve real-time object detection, was used to identify or
detect the target object in the sonar and stereo images. In the two-mode fish counting
estimation, we detected the type of fish found in the cage and provided an assessment
of the number of populations for each species. Since we used a low-cost camera type, its
range is minimal, so it cannot detect fish out of its range even with robust object detection
deep learning models such as YOLOv4. Therefore, we only counted detected fish images
within the reliable range, which is why we had a lower fish count when compared with the
estimated number of fish in the cage, unlike the sonar camera system that covers a larger
and broader range. Thus, we will still rely on the sonar camera system for the final fish
population count to perform such estimation. The two-mode fish counting estimation now
serves as a sampling device to support and assist the sonar camera system in providing
information about the fish species distribution as added sonar image analytics. However, at
the moment, the dataset available is only one species per cage/pond. However, we tested
our mechanism to perform annotation functions using a well-known object detection CNN
to check if our proposed method can detect various fish types. We can replace the YOLOv4
with a state-of-the-art object detection CNN because YOLOv4 does not work well in the
underwater environment. This shortcoming will be one of our future works to improve the
performance of our smart sensor fusion.
Sensors 2022, 22, 7603 26 of 29
The success of the data collection procedure for this study poses a difficult task. First,
the assistance of the aquaculture operators is much needed during the data collection, and
they must be present in every data collection activity. Second, the data can only be obtained
when the aquaculture operators feed the fish, usually once a day. Third, the weather is
another major factor since the net cages are in the open sea. Finally, it is essential to consider
the sea current to make sure that the transducer of the imaging sonar system is steady since
it greatly affects the quality of the data. After several attempts to collect data, only a few
were collected due to difficulties.
6. Conclusions
The 3D point clouds for each camera system were separately obtained, extracted,
and matched to find each correspondence. Our sensor fusion approach detected the
corresponding pixels or the bounding box of the common area. Thus, it could also detect
the fish objects in the images of both sensors to be utilized for fish type annotation. In this
paper, two methods were developed to estimate the quantity and the distribution of the
standard length and weight of fish using a sonar imaging system. The first method was
developed for estimating the distribution of the standard length and weight of fish. Using
GMMs to find the distribution of the standard length of single fish instances and employing
K-nearest-neighbor regression to estimate the weight of fish by the length, the relative errors
of the estimated average fish standard length and weight were approximately less than
15 and 50%, respectively. Those errors can be reduced if the fish length manually measured
is based on the standard length instead of the fork length. Therefore, the proposed method
can be applied to monitoring the growth of culture fish. The second method estimates
the fish quantity in an off-shore net cage. The preliminary experimental result showed
that the quantity of fish could be within the estimate’s 68 and 95% confidence intervals.
The 68 and 95% confidence interval widths were approximately 900 and 1800, respectively.
Preliminary experimental results showed that the proposed method is feasible. Lastly,
the fish target object detection provides an additional function to annotate fish species
and offers additional information to the sonar system. For our future works, we plan to
incorporate Generative Adversarial Networks to convert the optical image of the target
object into a sonar image. Additionally, we will integrate sonar and stereo camera fusion
for fish length and weight estimation.
Abbreviations
AIoT Artificial Intelligence-based Internet of Things
CNNs Convolutional Neural Networks
EM Expectation Maximization
GMM Gaussian Mixture Model
IoT Internet of Things
K-NN K-Nearest Neighbors
ReLU Rectified Linear Unit
RGB Red, green, blue
YOLOv4 You only look once version 4
References
1. Food and Agriculture Organizations of the United Nations. State of the World and Aquaculture; FAO: Rome, Italy, 2020.
2. O’Donncha, F.; Grant, J. Precision Aquaculture. IEEE Internet Things Mag. 2019, 2, 26–30. [CrossRef]
3. O’Donncha, F.; Stockwell, C.; Planellas, S.; Micallef, G.; Palmes, P.; Webb, C.; Filgueira, R.; Grant, J. Data Driven Insight Into Fish
Behaviour and Their Use for Precision Aquaculture. Front. Anim. Sci. 2021, 2, 695054. [CrossRef]
4. Antonucci, F.; Costa, C. Precision aquaculture: A short review on engineering innovations. Aquac. Int. 2019, 28, 41–57. [CrossRef]
5. Gupta, S.; Gupta, A.; Hasija, Y. Transforming IoT in aquaculture: A cloud solution in AI. In Edge and IoT-based Smart Agriculture A
Volume in Intelligent Data-Centric Systems; Academic Press: Cambridge, MA, USA, 2022; pp. 517–531.
6. Mustapha, U.F.; Alhassan, A.-W.; Jiang, D.-N.; Li, G.-L. Sustainable aquaculture development: A review on the roles of cloud
computing, internet of things and artificial intelligence (CIA). Rev. Aquac. 2021, 3, 2076–2091. [CrossRef]
7. Petritoli, E.; Leccese, F. Albacore: A Sub Drone for Shallow Waters A Preliminary Study. In Proceedings of the MetroSea 2020–TC19
International Workshop on Metrology for the Sea, Naples, Italy, 5–7 October 2020.
8. Acar, U.; Kane, F.; Vlacheas, P.; Foteinos, V.; Demestichas, P.; Yuceturk, G.; Drigkopoulou, I.; Vargün, A. Designing An IoT Cloud
Solution for Aquaculture. In Proceedings of the 2019 Global IoT Summit (GIoTS), Aarhus, Denmark, 17–21 June 2019.
9. Chang, C.-C.; Wang, Y.-P.; Cheng, S.-C. Fish Segmentation in Sonar Images by Mask R-CNN on Feature Maps of Conditional
Random Fields. Sensors 2021, 21, 7625. [CrossRef] [PubMed]
10. Ubina, N.A.; Cheng, S.-C.; Chang, C.-C.; Cai, S.-Y.; Lan, H.-Y.; Lu, H.-Y. Intelligent Underwater Stereo Camera Design for Fish
Metric Estimation Using Reliable Object Matching. IEEE Access 2022, 10, 74605–74619. [CrossRef]
11. Cook, D.; Middlemiss, K.; Jaksons, P.; Davison, W.; Jerrett, A. Validation of fish length estimations from a high frequency
multi-beam sonar (ARIS) and its utilisation as a field-based measurement technique. Fish. Res. 2019, 218, 56–98. [CrossRef]
12. Hightower, J.; Magowan, K.; Brown, L.; Fox, D. Reliability of Fish Size Estimates Obtained From Multibeam Imaging Sonar. J. Fish
Wildl. Manag. 2013, 4, 86–96. [CrossRef]
13. Puig-Pons, V.; Muñoz-Benavent, P.; Espinosa, V.; Andreu-García, G.; Valiente-González, J.; Estruch, V.; Ordóñez, P.;
Pérez-Arjona, I.; Atienza, V.; Mèlich, B.; et al. Automatic Bluefin Tuna (Thunnus thynnus) biomass estimation during transfers
using acoustic and computer vision techniques. Aquac. Eng. 2019, 85, 22–31. [CrossRef]
14. Ferreira, F.; Machado, D.; Ferri, G.; Dugelay, S.; Potter, J. Underwater optical and acoustic imaging: A time for fusion? A brief
overview of the state-of-the-art. In Proceedings of the OCEANS 2016 MTS/IEEE, Monterey, CA, USA, 19–23 September 2016.
15. Servos, J.; Smart, M.; Waslander, S.L. Underwater stereo SLAM with refraction correction. In Proceedings of the IEEE/RSJ
International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013.
16. Føre, M.; Frank, K.; Norton, T.; Svendsen, E.; Alfredsen, J.; Dempster, T.; Eguiraun, H.; Watson, W.; Stahl, A.; Sunde, L.; et al.
Precision fish farming: A new framework to improve production in aquaculture. Biosyst. Eng. 2018, 173, 176–193. [CrossRef]
17. Hughes, J.B.; Hightower, J.E. Combining split-beam and dual-frequency identification sonars to estimate abundance of anadro-
mous fishes in the roanoke river, North Carolina. N. Am. J. Fish. Manag. 2015, 35, 229–240. [CrossRef]
18. Jing, D.; Han, J.; Wang, X.; Wang, G.; Tong, J.; Shen, W.; Zhang, J. A method to estimate the abundance of fish based on
dual-frequency identification sonar (DIDSON) imaging. Fish. Sci. 2017, 35, 229–240. [CrossRef]
19. Martignac, F.; Daroux, A.; Bagliniere, J.-L.; Ombredane, D.; Guillard, J. The use of acoustic cameras in shallow waters: New
hydroacoustic tools for monitoring migratory fish population. a review of DIDSON technology. Fish Fish. 2015, 16, 486–510.
[CrossRef]
20. Baumann, J.R.; Oakley, N.C.; McRae, B.J. Evaluating the effectiveness of artificial fish habitat designs in turbid reservoirs using
sonar imagery. N. Am. J. Fish. Manag. 2016, 36, 1437–1444. [CrossRef]
21. Shahrestani, S.; Bi, H.; Lyubchich, V.; Boswell, K.M. Detecting a nearshore fish parade using the adaptive resolution imaging
sonar (ARIS): An automated procedure for data analysis. Fish. Res. 2017, 191, 190–199. [CrossRef]
22. Jing, D.; Han, J.; Wang, G.; Wang, X.; Wu, J.; Chen, G. Dense multiple-target tracking based on dual frequency identification sonar
(DIDSON) image. In Proceedings of the OCEANS 2016, Shanghai, China, 10–13 April 2016.
23. Wolff, L.M.; Badri-Hoeher, S. Imaging sonar- based fish detection in shallow waters. In Proceedings of the 2014 Oceans, St. John’s,
NL, Canada, 14–19 September 2014.
Sensors 2022, 22, 7603 28 of 29
24. Handegard, N.O. An overview of underwater acoustics applied to observe fish behaviour at the institute of marine research. In
Proceedings of the 2013 MTS/IEEE OCEANS, Bergen, Norway, 23–26 September 2013.
25. Llorens, S.; Pérez-Arjona, I.; Soliveres, E.; Espinosa, V. Detection and target strength measurements of uneaten feed pellets with a
single beam echosounder. Aquac. Eng. 2017, 78, 216–220. [CrossRef]
26. Estrada, J.; Pulido-Calvo, I.; Castro-Gutiérrez, J.; Peregrín, A.; López, S.; Gómez-Bravo, F.; Garrocho-Cruz, A.; De La Rosa, I. Fish
abundance estimation with imaging sonar in semi-intensive aquaculture ponds. Aquac. Eng. 2022, 97, 102235. [CrossRef]
27. Burwen, D.; Fleischman, S.; Miller, J. Accuracy and Precision of Salmon Length Estimates Taken from DIDSON Sonar Images.
Trans. Am. Fish. Soc. 2010, 139, 1306–1314. [CrossRef]
28. Lagarde, R.; Peyre, J.; Amilhat, E.; Mercader, M.; Prellwitz, F.; Gael, S.; Elisabeth, F. In situ evaluation of European eel counts and
length estimates accuracy from an acoustic camera (ARIS). Knowl. Manag. Aquat. Ecosyst. 2020, 421, 44. [CrossRef]
29. Sthapit, P.; Kim, M.; Kang, D.; Kim, K. Development of Scientific Fishery Biomass Estimator: System Design and Prototyping.
Sensors 2020, 20, 6095. [CrossRef]
30. Valdenegro-Toro, M. End-to-end object detection and recognition in forward-looking sonar images with convolutional neural
networks. In Proceedings of the 2016 IEEE/ OES Autonomous Underwater Vehicles (AUV), Tokyo, Japan, 6–9 November 2016.
31. Liu, L.; Lu, H.; Cao, Z.; Xiao, Y. Counting fish in sonar images. In Proceedings of the 25th IEEE International Conference on Image
Processing (ICIP), Athens, Greece, 7–10 October 2018.
32. Christ, R.D.; Wernli, R.L. Chapter 15-Sonar. In The ROV Manual; Butterworth-Heinemann: Oxford, UK, 2014; pp. 387–424.
33. Rosen, S.; Jørgensen, T.; Hammersland-White, D.; Holst, J.; Grant, J. DeepVision: A stereo camera system provides highly accurate
counts and lengths of fish passing inside a trawl. Can. J. Fish. Aquat. Sci. 2013, 70, 1456–1467. [CrossRef]
34. Shortis, M.; Ravanbakskh, M.; Shaifat, F.; Harvey, E.; Mian, A.; Seager, J.; Culverhouse, P.; Cline, D.; Edgington, D. A review of
techniques for the identification and measurement of fish in underwater stereo-video image sequences. In Proceedings of the
Videometrics, Range Imaging, and Applications XII; and Automated Visual Inspection, Munich, Germany, 14–16 May 2013.
35. Huang, T.-W.; Hwang, J.-N.; Romain, S.; Wallace, F. Fish Tracking and Segmentation From Stereo Videos on the Wild Sea Surface
for Electronic Monitoring of Rail Fishing. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 3146–3158. [CrossRef]
36. Vale, R.; Ueda, E.; Takimoto, R.; Martins, T. Fish Volume Monitoring Using Stereo Vision for Fish Farms. IFAC-PapersOnLine 2020,
53, 15824–15828. [CrossRef]
37. Williams, K.; Rooper, C.; Towler, R. Use of stereo camera systems for assessment of rockfish abundance in untrawlable areas and
for recording pollock behavior during midwater trawls. Fish. Bull.-Natl. Ocean. Atmos. Adm. 2010, 108, 352–365.
38. Torisawa, S.; Kadota, M.; Komeyama, K.; Suzuki, K.; Takagi, T. A digital stereo-video camera system for three-dimensional
monitoring of free-swimming Pacific bluefin tuna, Thunnus orientalis, cultured in a net cage. Aquat. Living Resour. 2011, 24,
107–112. [CrossRef]
39. Cheng, R.; Zhang, C.; Xu, Q.; Liu, G.; Song, Y.; Yuan, X.; Sun, J. Underwater Fish Body Length Estimation Based on Binocular
Image Processing. Information 2020, 11, 476. [CrossRef]
40. Voskakis, D.; Makris, A.; Papandroulakis, N. Deep learning based fish length estimation. An application for the Mediterranean
aquaculture. In Proceedings of the OCEANS 2021, San Diego, CA, USA, 20–23 September 2021.
41. Shi, C.; Wang, Q.; He, X.; Xiaoshuan, Z.; Li, D. An automatic method of fish length estimation using underwater stereo system
based on LabVIEW. Comput. Electron. Agric. 2020, 173, 105419. [CrossRef]
42. Garner, S.B.; Olsen, A.M.; Caillouet, R.; Campbell, M.D.; Patterson, W.F. Estimating reef fish size distributions with a mini
remotely operated vehicle-integrated stereo camera system. PLoS ONE 2021, 16, e0247985. [CrossRef]
43. Kadambi, A.; Bhandari, A.; Raskar, R. 3D Depth Cameras in Vision: Benefits and Limitations of the Hardware. In Computer Vision
and Pattern Recognition; Springer International Publishing: Cham, Switzerland, 2014; pp. 1–26.
44. Harvey, E.; Shortis, M.; Stadler, M. A Comparison of the Accuracy and Precision of Measurements from Single and Stereo-Video
Systems. Mar. Technol. Soc. J. 2002, 36, 38–49. [CrossRef]
45. Bertels, M.; Jutzi, B.; Ulrich, M. Automatic Real-Time Pose Estimation of Machinery from Images. Sensors 2022, 22, 2627. [CrossRef]
46. Boldt, J.; Williams, K.; Rooper, C.; Towler, R.; Gauthier, S. Development of stereo camera methodologies to improve pelagic fish
biomass estimates and inform ecosystem management in marine waters. Fish. Res. 2017, 198, 66–77. [CrossRef]
47. Berrio, J.S.; Shan, M.; Worrall, S.; Nebot, E. Camera-LIDAR Integration: Probabilistic Sensor Fusion for Semantic Mapping. IEEE
Trans. Intell. Transp. Syst. 2022, 7, 7637–7652. [CrossRef]
48. John, V.; Long, Q.; Liu, Z.; Mita, S. Automatic calibration and registration of lidar and stereo camera without calibration
objects. In Proceedings of the 2015 IEEE International Conference on Vehicular Electronics and Safety (ICVES), Yokohama, Japan,
5–7 November 2015.
49. Roche, V.D.-S.J.; Kondoz, A. A Multi-modal Perception-Driven Self Evolving Autonomous Ground Vehicle. IEEE Trans. Cybern.
2021, 1–11. [CrossRef]
50. Zhong, Y.; Chen, Y.; Wang, C.; Wang, Q.; Yang, J. Research on Target Tracking for Robotic Fish Based on Low-Cost Scarce Sensing
Information Fusion. IEEE Robot. Autom. Lett. 2022, 7, 6044–6051. [CrossRef]
51. Dov, D.; Talmon, R.; Cohen, I. Multimodal Kernel Method for Activity Detection of Sound Sources. EEE/ACM Trans. Audio Speech
Lang. Processing 2017, 25, 1322–1334. [CrossRef]
52. Mirzaei, G.; Jamali, M.M.; Ross, J.; Gorsevski, P.V.; Bingman, V.P. Data Fusion of Acoustics, Infrared, and Marine Radar for Avian
Study. IEEE Sens. J. 2015, 15, 6625–6632. [CrossRef]
Sensors 2022, 22, 7603 29 of 29
53. Zhou, X.; Yu, C.; Yuan, X.; Luo, C. A Matching Algorithm for Underwater Acoustic and Optical Images Based on Image Attribute
Transfer and Local Features. Sensors 2021, 21, 7043. [CrossRef]
54. Andrei, C.-O. 3D Affine Coordinate Transformations. Master’s Thesis, School of Architecture and the Built Environment Royal
Institute of Technology (KTH), Stockholm, Sweden, 2006.
55. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934.
56. Kim, B.; Joe, H.; Yu, S.-C. High-precision Underwater 3D Mapping Using Imaging Sonar for Navigation of Autonomous
Underwater Vehicle. Int. J. Control. Autom. Syst. 2021, 19, 3199–3208. [CrossRef]
57. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 1977,
39, 1–38.
58. Gkalelis, N.; Mezaris, V.; Kompatsiaris, I. Mixture subclass discriminant analysis. IEEE Signal Processing Lett. 2011, 18, 319–322.
[CrossRef]
59. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the
Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Munich, Germany, 5–9 October 2015.
60. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in
Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014.
61. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [CrossRef]
62. Li, D.; Miao, Z.; Peng, F.; Wang, L.; Hao, Y.; Wang, Z.; Chen, T.; Li, H.; Zheng, Y. Automatic counting methods in aquaculture: A
review. J. World Aquac. Soc. 2020, 52, 269–283. [CrossRef]