Jmse 12 00055 v2
Jmse 12 00055 v2
Marine Science
and Engineering
Article
Underwater Object Detection in Marine Ranching Based on
Improved YOLOv8
Rong Jia , Bin Lv * , Jie Chen, Hailin Liu, Lin Cao and Min Liu
Abstract: The aquaculture of marine ranching is of great significance for scientific aquaculture and the
practice of statistically grasping existing information on the types of living marine resources and their
density. However, underwater environments are complex, and there are many small and overlapping
targets for marine organisms, which seriously affects the performance of detectors. To overcome these
issues, we attempted to improve the YOLOv8 detector. The InceptionNeXt block was used in the
backbone to enhance the feature extraction capabilities of the network. Subsequently, a separate and
enhanced attention module (SEAM) was added to the neck to enhance the detection of overlapping
targets. Moreover, the normalized Wasserstein distance (NWD) loss was proportionally added to the
original CIoU loss to improve the detection of small targets. Data augmentation methods were used
to improve the dataset during training to enhance the robustness of the network. The experimental
results showed that the improved YOLOv8 achieved the mAP of 84.5%, which was an improvement
over the original YOLOv8 of approximately 6.2%. Meanwhile, there were no significant increases in
the numbers of parameters and computations. This detector can be applied on platforms for seafloor
observation experiments in the field of marine ranching to complete the task of real-time detection of
marine organisms.
Keywords: underwater vision; seafloor observation; object detection; deep learning; YOLO
which is inefficient and error prone. To improve the efficiency of the observation of
organisms associated with marine ranching, the use of computer vision technology for
target detection and assistance in the observation of marine organisms has important
research value and application prospects.
Underwater images are different from normal land images. The quality of underwa-
ter images is significantly degraded due to uncertainties in the water environment, the
absorption and scattering of light by water, and the various media contained in water [3].
With the rise and development of artificial intelligence technology, the introduction of deep
learning-based target detection and classification techniques has enabled accurate and fast
target detection in complex underwater environments.
There are two main categories of current deep learning-based target detection methods,
namely, two-stage detection models and one-stage detection models. Among others, the
most prominent two-stage models are R-CNN [4], Fast-RCNN [5], and Faster-RCNN [6].
The main features of two-stage models are the initial generation of a region proposal net-
work with specialized modules, followed by movement to the foreground and adjustment
of the bounding box. The model structure is more complex, large, and slow, but it has some
advantages in detection accuracy. Lin et al. [7] designed the RoIMix data enhancement
method based on Faster-RCNN to solve the problem of having more overlapping, occluded,
and blurred targets in underwater images. Their experiments showed that this method
could significantly improve the detection performance of Faster-RCNN.
The main representative of the one-stage models is YOLO [8–11]. Its main feature
is the direct input of images into the detection model and the output of results. The
main advantages are its simple structure, small size, and high speed. It is more suitable
for underwater target detection in marine ranching. For underwater target detection
in the field of one-stage modeling, Han et al. [12] first used a combination of max-RGB
and shades of gray methods to enhance underwater images. Then, a deep convolutional
neural network method was used for underwater target recognition, and good results were
achieved. Chen et al. [13] improved YOLOv4 by replacing the upsampling module with
a deconvolution module, adding depth-separable convolution, and finally, augmenting
the data with an improved mosaic. Zhao et al. [14] reduced the parameters of the model
by replacing the backbone network with MobileNetV3. Deformable convolution was also
added to improve detection accuracy. Sun et al. [15] proposed LK-YOLO, which improved
the performance of the model by introducing a large kernel convolution into the backbone
network, thus improving the detection head, as well as the sample matching strategy.
Currently, the application of transformers in computer vision tasks is also a popular
research direction. Zhang et al. [16] proposed the YOLO-SWFormer, which introduced the
Swin-Transformer into the backbone of the model. This effectively improved the detection
accuracy of the model. However, at the same time, the detection speed was slow, and the
model structure was relatively bloated.
In recent years, attention mechanisms have become an integral part of target detection
tasks. Attention mechanisms stem from the study of human vision and its ability to sift
through large amounts of information to find important data. Shen et al. [17] proposed the
crisscross global interaction strategy (CGIS) for versions of YOLO detectors in order to min-
imize the interference of the underwater background with the detected target. Yu et al. [18]
proposed YOLOv7-net, which added bi-level routing attention (BRA) and a new coordi-
nated attention module (RFCAConv) to YOLOv7. This improved the detection of broken
nets in complex marine environments. Lv et al. [19] improved YOLOv5 by adding ASPP
structures and a CBAM module combined with the FOCAL loss function. A small target
detection head was also added. Li et al. [20] used the RGHS algorithm to improve the image
quality. The performance was then improved by adding a triplet attention mechanism and
an additional small target detection header. Li et al. [21] improved YOLOv5 by using a
Res2Net residual structure with a coordinate attention mechanism and applied it to fish
detection. These target detection models had high recognition accuracy, fast detection
speed, good robustness, and high practicality.
J. Mar. Sci. Eng. 2024, 12, 55 3 of 18
Based on the above analysis, to accurately and quickly recognize target marine organ-
isms in complex underwater environments, this study proposes an improved YOLOv8
detector. Firstly, the InceptionNeXt block was used in the backbone network. This improved
the feature extraction abilities of the model without increasing its number of parameters.
Secondly, the SEAM attention mechanism was added to the neck. This enhanced the
detection of overlapping targets by increasing the focus on the region of the detected object
in the image and weakening the background region. Finally, NWD loss was added to
the original CIoU loss to improve the ability to detect small targets. Data augmentation
methods were used to improve the dataset during the training process, thus enhancing
the robustness of the network. The main research and innovations of this study can be
summarized as follows:
(1) Aiming at the characteristics of underwater image features that are not obvious, the back-
bone network of YOLOv8 was improved by using the InceptionNeXt block, which en-
hanced its ability to extract image features while maintaining its lightweight advantage.
(2) For the characteristics of underwater images with more overlapping targets, the SEAM
attention mechanism was added to the neck, and experimental comparisons were
made with two other classical attention mechanisms, which proved that the SEAM
was the most effective.
(3) In view of the characteristics of underwater images containing more small targets,
NWD loss was added on the basis of the original CIoU loss, and the most suitable ratio
of the two functions was found through experiments, which improved the accuracy
of small targets detections without causing a loss of detection accuracy for medium
and large targets.
(4) In response to the insufficient number of underwater datasets, data from three parts
were used to form the final dataset that was used. The dataset was augmented with
a combination of Mosaic and MixUp to create the training set during the training
process, which improved the generalization ability of the model and avoided the
overfitting of the model.
The remainder of this study is organized as follows: In Section 2, we focus on the
main structure of the video monitoring system in the seafloor observation network and
on the analysis of the specific improvement strategies for the detection model. Section 3
focuses on a general analysis of the dataset and training strategy, as well as the design of
the experiments and a discussion of the results. Section 4 provides final conclusions and
directions for future work.
2.2.1. Backbone
The pre-processed images were fed into the backbone for feature extraction. A major
improvement point in the whole feature extraction network was the introduction of the In-
ceptionNeXt block [26] to replace the original C2F block [27], thus enhancing the extraction
of the input image features. The InceptionNeXt block is mainly based on ConvNeXt [28]
and the idea of Inception [29–33]. The depthwise convolution of the large kernel convolu-
tions in ConvNeXt was decomposed into four parallel branches according to the channel
dimension. One-third of the channels were kerned at 3 × 3, one-third of the channels
were kerned at 1 × 11, and the remaining third of the channels were kerned at 11 × 1;
finally, a constant mapping was added. This decomposition not only reduced the number
of parameters and the computational effort, but it also retained the advantages of large
kernel depthwise convolution, i.e., it expanded the field of perception and improved the
model performance.
After that, the image features were extracted by the MLP block. The main difference
from the previous version was the replacement of the original two layers of the fully
connected network with two 1 × 1 convolution layers. Finally, the present structure
decomposed large convolutional kernels in a simple and quick manner while maintaining
comparable performance, achieving a better balance among accuracy, speed, and the
number of parameters. The main structure of the InceptionNeXt block is shown in Figure 4.
2.2.2. Neck
The PANet [34] structure is most commonly used in the neck part of YOLOv8. Iterative
extraction occurs, and the output contains features from three dimensions. The problem of
occlusion is greater when considering marine organisms, and occlusion between different
organisms can lead to misalignment, local blending, and missing features. To address
these issues, the separated and enhancement attention module (SEAM) [35] was added to
emphasize the object detection region in the image and weaken the background region,
J. Mar. Sci. Eng. 2024, 12, 55 7 of 18
thus enhancing the detection of marine organisms in the presence of occlusion. The SEAM
was first used for the detection of occluded faces, and a diagram of its structure is shown in
Figure 5.
First, the input feature maps were passed through the channel and spatial mixing
module (CSSM) to learn the correlations of spatial dimensions and channels. In the CSMM,
the input feature map was first sliced into a number of image sequence blocks by us-
ing the patch embedding operation, and it was linearly mapped and flattened into a
one-dimensional vector. Then, there was a 3 × 3 depthwise convolution with residual
connections. The depthwise convolution was operated depth-by-depth, i.e., there was a
channel-by-channel separation of the convolutions. Thus, although the depthwise convolu-
tion could be used to learn the importance of different channels and reduce the number of
parameters, it ignored the relationships of information between channels.
To compensate for this loss, the outputs of the different depth convolutions were
subsequently combined through a 1 × 1 pointwise convolution. A two-layer fully connected
network was then used to fuse the information from each channel. In this way, the network
could strengthen the connections between all channels. After that, the range of values for
the logits output from the fully connected layers was [0,1]. Then, the exponential function
y = e x was used expand it to [1,e]. This exponential normalization provided a monotonic
mapping relationship that made the results more tolerant of positional errors. Finally, the
output of the SEAM was multiplied by the original features as attention so that the model
could handle the occlusion of the detected targets more effectively.
The classification loss still used the BCE loss. The regression loss used the distribution
focal loss and CIoU loss because it needed to be bound to the integral-form representation
proposed in the distribution focal loss. The three loss functions were weighted with a
ratio of 0.5:1.5:7.5. CIoU is an upgraded version of DIoU. Adding the aspect ratio of the
prediction box to DIoU improved the regression accuracy.
The CIoU formula is expressed using Equation (1):
ρ2 (b, bgt )
CIoU = IoU − −αv (1)
c2
The three terms in the formula correspond to the calculation of the IoU, center-point
distance, and aspect ratio, respectively. ρ2 b, bgt represents the Euclidean distance between
the center points of the prediction frame and the real frame, and c represents the diagonal
distance of the smallest closed area that can contain both the predicted and real boxes. The
equations for α and v are Equations (2) and (3):
v
α= (2)
1 − IoU + v
2
4 wgt w
V= 2
(arctan gt − arctan ) (3)
π h h
J. Mar. Sci. Eng. 2024, 12, 55 9 of 18
Here, w, h and wgt ,hgt represent the height and width of the predicted box and the
real box, respectively. The final loss is expressed using Equation (4):
ρ2 (b, bgt )
LOSSCIoU = 1 − IoU + + αv (4)
c2
However, CIoU also has its limitations. In particular, it is very sensitive to the deviation
in the positions of small targets, which seriously reduces its detection performance for
such targets. To improve the detection performance for small targets while retaining the
CIoU, the normalized Wasserstein distance (NWD) loss [36] was added. The main process
of NWD is to first model the enclosing frame as a 2D Gaussian distribution and then use
the Wasserstein distance to calculate the similarity between the corresponding Gaussian
distributions. Compared with the traditional IoU, the advantages of the NWD are, firstly,
that it can measure the distribution similarity regardless of the overlap between small
targets and, secondly, that it is insensitive to targets of different scales and more suitable
for measuring the similarities between small targets. The formula for the NWD loss is
expressed using Equation (5):
q
w22 (Na , Nb )
NWD(Na , Nb ) = exp(− ) (5)
c
Here, W22 (Na , Nb ) is the distance measure, Na is the Gaussian distribution of the
prediction frame, and Nb is the Gaussian distribution of the GT frame. C is a constant
related to the dataset. In this study, we set C to the average absolute size of the targets in
dataset. The NWD loss is expressed using Equation (6):
the position of the target’s center point relative to the whole figure. The lower-right panel
shows the height-to-width ratio of the target relative to the whole image. By synthesizing
the four charts, it was found that the whole dataset was more difficult due to the uneven
numbers of detection categories, which had more targets and greater variations in size, as
well as more small targets.
The comparison of the complexity showed that the two-stage detectors were much
more complex than the one-stage detectors. Among the one-stage detectors, YOLOv5n
had the lowest complexity. YOLOv8n had a significant increase in complexity over that of
YOLOv5n. There was not much difference between the improved YOLOv8 and YOLOv8n.
The highest complexity was that of YOLOv7-tiny. The number of parameters, the amount
of computation, and the sizes of the weights were much higher than those of the other
two detectors.
smallest targets and the most serious phenomenon of obscuration. The holothurians in
marine ranches mostly inhabit the fine sand in the shallows. Others are in the crevices of
reefs. Therefore, the improvement strategies used here were most effective for holothurians.
At the end, an example of video detection was conducted while using the improved
YOLOv8, and the results are shown in Figure 9. It can be seen that the whole process had
excellent detection results. The FPS value was stable at around 40. This indicates that this
system is competent in detecting marine organisms in seafloor videos in real time.
Overall, although the two-stage detector Faster-RCNN had a slight advantage in terms
of the mAP, its structure was complex and functioned poorly in real time, which made it
unsuitable for video monitoring systems in seafloor observation network. The one-stage
model of the improved YOLOv8 had the best performance indicators. Its performance
was improved without essentially changing its complexity. It is the most suitable for
applications in video monitoring systems in seafloor observation networks for the detection
of marine organisms.
Figure 10. Comparison of detections 1: (a) detection by YOLOv8; (b) detection by improved YOLOv8.
A comparison of the detection results of the latter two groups is shown in Figure 11. We
have used red circles to mark the misdetections and omissions. The third and fourth groups
of images mainly contained dense schools of fish. It could be seen that in the third group,
the improved YOLOv8 detected sixteen fish, which was two more heavily obscured fish
than the ordinary version of the detector found. In the fourth set of images, the improved
YOLOv8 detected a total of 54 fish. Compared to the normal version, six more items
were detected, and there was one missed item. This experiment proved that in the face
of the small- and medium-sized targets and greater amount of overlap in marine ranch
J. Mar. Sci. Eng. 2024, 12, 55 14 of 18
environment, the improved YOLOv8 performed better than the ordinary YOLOv8 and was
more suitable for the detection of marine organisms under such conditions.
Figure 11. Comparison of detections 2: (a) detection by YOLOv8; (b) detection by improved YOLOv8.
the classical attention mechanisms SE [40] and CBAM [41] for a comparison. The SEAM
was the most effective. The corresponding losses of occluded marine organisms could be
compensated by enhancing the response of nonoccluded marine organisms. The mAP was
increased by about 1.7%. A comparison of the different attention mechanisms is shown in
Table 4.
4. Conclusions
This study aimed to improve the efficiency of monitoring target marine organisms in
marine ranches. At the same time, the workload of staff can be reduced, and a new way
of thinking for the efficient management of relevant aquatic organisms in marine ranches
is provided. To achieve this goal, we attempted to improve the YOLOv8 detector as the
basis of the study. Firstly, the InceptionNeXt block was used in the backbone to replace
the original C2F block, which improved the feature extraction capabilities of the network
while keeping the number of parameters basically unchanged. Secondly, the SEAM was
incorporated into the neck to enhance the detection of overlapping targets by increasing
the attention to the detected object regions in images and weakening the background
regions. Finally, the NWD loss was added to the original CIoU loss, and the proportion
of the two functions was adjusted through experimentation. This resulted in improved
detection of small targets without compromising the detection performance for medium
and large targets. The traditional enhancement method of performing several types of
random transformations on images and a data enhancement method combining mosaic
and MixUp were used to improve the dataset during the training process, which enhanced
the robustness of the network in an attempt to obtain good results with limited resources.
Overall, the improved YOLOv8 in this study achieved a mAP of 84.5%, which was an
increase of 6.2%. Meanwhile, there was no significant increase in the number of parameters
and computations, so a balance between detection performance and model volume was
J. Mar. Sci. Eng. 2024, 12, 55 16 of 18
achieved. With its fair performance, it can be applied in seafloor observation platforms
in marine ranches to complete the task of the real-time detection of marine organisms.
However, there are still some areas that can be improved, including the following.
First, the dataset can be further improved. The dataset used here still suffered from a
small and uneven sample size. In the future, we can consider acquiring more videos and
images for model training and testing. There was also a serious imbalance in the number
of individual detection categories in the dataset, so we can consider balancing the number
of samples. Second, underwater images are different from land images because of the
presence of low-contrast, non-uniform illumination, blurring, bright spots, and high noise
due to a variety of complicating factors. The images in the dataset can be enhanced with
image enhancement algorithms to improve their clarity and facilitate subsequent work.
Finally, further improvements and test models can be considered to make this system more
lightweight and faster so that it can be better adapted to embedded devices on experimental
platforms, thus opening it to a wider range of applications.
Author Contributions: Conceptualization, R.J.; Methodology, R.J. and L.C.; Formal analysis and
investigation, M.L. and J.C.; Writing—original draft preparation, R.J.; Writing—review and editing,
B.L. and H.L. All authors have read and agreed to the published version of the manuscript.
Funding: This study is supported in part by Qilu University of Technology (Shandong Academy of
Sciences) Pilot Project of Science, Education, and Industry Integration Major Innovation Special Project
“Project of Unveiling System”, Pivotal Technologies for Ocean Intelligent Sensing and Information
Processing Based on End-to-End Cloud Architecture [No.2023JBZ02], the National Natural Science
Foundation of China (42106172), Project Plan of Pilot Project of Integration of Science, Education and
Industry of Qilu University of Technology (Shandong Academy of Sciences) (2022GH004, 2022PY041).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author.
Conflicts of Interest: The authors declare that they have no conflicts of interest or competing financial
and/or nonfinancial interests in relation to this work.
References
1. Agardy, T. Effects of fisheries on marine ecosystems: A conservationist’s perspective. ICES J. Mar. Sci. 2000, 57, 761–765.
[CrossRef]
2. Greenville, J.; MacAulay, T. Protected areas in fisheries: A two-patch, two-species model. Aust. J. Agric. Resour. Econ. 2006, 50,
207–226. [CrossRef]
3. Hu, K.; Weng, C.; Zhang, Y.; Jin, J.; Xia, Q. An Overview of Underwater Vision Enhancement: From Traditional Methods to Recent
Deep Learning. J. Mar. Sci. Eng. 2022, 10, 241. [CrossRef]
4. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014;
pp. 580–587. [CrossRef]
5. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile,
7–13 December 2015; pp. 1440–1448. [CrossRef]
6. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
7. Lin, W.; Zhong, J.; Liu, S.; Li, T.; Li, G. ROIMIX: Proposal-Fusion Among Multiple Images for Underwater Object Detection.
In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, 4–8 May 2020.
[CrossRef]
8. Redmon, J.; Divvala, S.; Girshick, R.; Farhad, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the
Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [CrossRef]
9. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [CrossRef]
10. Bochkovskiy, A.; Wang, C.; Liao, H. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934.
[CrossRef]
11. Wang, C.; Bochkovskiy, A.; Liao, H. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors.
arXiv 2022, arXiv:2207.02696. [CrossRef]
J. Mar. Sci. Eng. 2024, 12, 55 17 of 18
12. Han, F.; Yao, J.; Zhu, H.; Wang, C. Underwater Image Processing and Object Detection Based on Deep CNN Method. J. Sens. 2020,
2020, 6707328. [CrossRef]
13. Chen, L.; Zheng, M.; Duan, S.; Luo, W.; Yao, L. Underwater Target Recognition Based on Improved YOLOv4 Neural Network.
Electronics 2021, 10, 1634. [CrossRef]
14. Zhao, S.; Zhang, S.; Lu, J.; Wang, H.; Feng, Y.; Shi, C.; Li, D.; Zhao, R. A lightweight dead fish detection method based on
deformable convolution and YOLOV4. Comput. Electron. Agric. 2022, 198, 107098. [CrossRef]
15. Sun, S.; Xu, Z. Large kernel convolution YOLO for ship detection in surveillance video. Math. Biosci. Eng. MBE 2023, 20,
15018–15043. [CrossRef] [PubMed]
16. Zhang, Q.; Li, Y.; Zhang, Z.; Yin, S.; Ma, L. Marine target detection for PPI images based on YOLO-SWFormer. Alex. Eng. J. 2023,
82, 396–403. [CrossRef]
17. Shen, X.; Wang, H.; Li, Y.; Gao, T.; Fu, X. Criss-cross global interaction-based selective attention in YOLO for underwater object
detection. Multimed. Tools Appl. 2023. [CrossRef]
18. Yu, G.; Su, J.; Luo, Y.; Chen, Z.; Chen, Q.; Chen, S. Efficient detection method of deep-sea netting breakage based on attention and
focusing on receptive-field spatial feature. Signal Image Video Process. 2023. [CrossRef]
19. Lv, C.; Cao, S.; Zhang, Y.; Xu, G.; Zhao, B. Methods studies for attached marine organisms detecting based on convolutional
neural network. Energy Rep. 2022, 8, 1192–1201. [CrossRef]
20. Li, Y.; Bai, X.; Xia, C. An Improved YOLOV5 Based on Triplet Attention and Prediction Head Optimization for Marine Organism
Detection on Underwater Mobile Platforms. J. Mar. Sci. Eng. 2022, 10, 1230. [CrossRef]
21. Li, L.; Shi, G.; Jiang, T. Fish detection method based on improved YOLOv5. Aquac. Int. 2023, 31, 2513–2530. [CrossRef]
22. Favali, P.; Beranzoli, L. Seafloor observatory science: A review. Ann. Geophys. 2006, 49, 515–567. [CrossRef]
23. Matabos, M.; Best, M.; Blandin, J.; Hoeberechts, M.; Juniper, K.; Pirenne, B.; Robert, K.; Ruhl, H.; Sarrazin, J.; Vardaro, M. Seafloor
Observatories: Clark/Biological Sampling in the Deep Sea; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2016.
24. Chen, J.; Liu, H.; Lv, B.; Liu, C.; Zhang, X.; Li, H.; Cao, L.; Wan, J. Research on an Extensible Monitoring System of a Seafloor
Observatory Network in Laizhou Bay. J. Mar. Sci. Eng. 2022, 10, 1051. [CrossRef]
25. Lv, B.; Chen, J.; Liu, H.; Chao, L.; Zhang, Z.; Zhang, X.; Gao, H.; Cai, Y. Design of deep-sea chemical data collector for the seafloor
observatory network. Mar. Georesour. Geotechnol. 2022, 40, 1359–1369. [CrossRef]
26. Yu, W.; Zhou, P.; Yan, S.; Wang, X. InceptionNeXt: When Inception Meets ConvNeXt. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [CrossRef]
27. Wang, C.; Liao, H.; Wu, Y.; Chen, P.; Hsieh, J.; Yeh, I. CSPNet: A new backbone that can enhance learning capability of CNN. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA,
14–19 June 2020. [CrossRef]
28. Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE Computer
Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [CrossRef]
29. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [CrossRef]
30. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In
Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015. [CrossRef]
31. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826.
[CrossRef]
32. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on
Learning. arXiv 2016, arXiv:1602.07261. [CrossRef]
33. Szegedy, C.; Liu, W.; Jia, Y.; Pierre, S.; Scott, R.; Dragomir, A.; Dumitrue, E.; Vincent, V.; Andrew, R. Going deeper with
convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA,
7–12 June 2015. [CrossRef]
34. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [CrossRef]
35. Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. YOLO-FaceV2: A Scale and Occlusion Aware Face Detector. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [CrossRef]
36. Wang, J.; Xu, C.; Yang, W.; Lei, Y. A Normalized Gaussian Wasserstein Distance for Tiny Object Detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [CrossRef]
37. Cutter, G.; Stierhoff, K.; Zeng, J. Automated Detection of Rockfish in Unconstrained Underwater Videos Using Haar Cascades.
In Proceedings of the Applications and Computer Vision Workshops (WACVW), Waikoloa Beach, HI, USA, 5–9 January 2015.
[CrossRef]
38. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [CrossRef]
39. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [CrossRef]
J. Mar. Sci. Eng. 2024, 12, 55 18 of 18
40. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conf Computer Vision Pattern
Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [CrossRef]
41. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, S.I. CBAM: Convolutional block attention module. In Proceedings of the European Conference
on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.