0% found this document useful (0 votes)

46 views19 pages

10 1016@j Ecoinf 2019 05 004

Uploaded by

iril

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views19 pages

10 1016@j Ecoinf 2019 05 004

Uploaded by

iril

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Ecological Informatics 52 (2019) 103–121

Contents lists available at ScienceDirect

Ecological Informatics
journal homepage: www.elsevier.com/locate/ecolinf

Cascaded deep network systems with linked ensemble components for T

underwater fish detection in the wild
Alfonso B. Labao, Prospero C. Naval Jr
⁎

Computer Vision and Machine Intelligence Group, Department of Computer Science, College of Engineering, University of the Philippines, Philippines

ARTICLE INFO ABSTRACT

Keywords: We propose a fish detection system based on deep network architectures to robustly detect and count fish objects
Fish detection in the wild under a variety of benthic background and illumination conditions. The algorithm consists of an ensemble of
Deep learning applications to the environment Region-based Convolutional Neural Networks that are linked in a cascade structure by Long Short-Term Memory
networks. The proposed network is efficiently trained as all components are jointly trained by backpropagation.
We train and test our system for a dataset of 18 videos taken in the wild. In our dataset, there are around 20 to
100 fish objects per frame with many fish objects having small pixel areas (less than 900 square pixels). From a
series of experiments and ablation tests, the proposed system preserves detection accuracy despite multi-scale
distortions, cropping and varying background environments. We present analysis that shows how object loca-
lization accuracy is increased by an automatic correction mechanism in the deep network's cascaded ensemble
structure. The correction mechanism rectifies any errors in the predictions as information progresses through the
network cascade. Our findings in this experiment regarding ensemble system architectures can be generalized to
other object detection applications.

1. Introduction and continual observation of fish species to keep up with rapid shifts in
population distribution (Hollowed et al., 2013) (Mieszkowska et al.,
Fish detection and counting are crucial tasks in marine science for 2014). Some works proposed in situ monitoring programs using ROV
temporal tracking of species, understanding of fish behaviour vessels that are cost-effective (Siddiqui et al., 2017). However, these
(Spampinato et al., 2014), aquaculture (Zion, 2012), among others. For methods require manual offline annotation of collected video frames by
fisheries management and policy formulation, keeping track of fish fish experts. Manual annotation is very inefficient since classifying and
stocks and population is crucial to effectively control fish harvesting, annotating a minute of footage may take up to 15 min of a marine
promote breeding and prevent stock depletion (Walsh et al., 2004). For biologist's time (Spampinato et al., 2008). Given the number of frames
these reasons, the size of fish populations has to be accurately de- that have to be processed, manual annotation require statistical sam-
termined through surveys (Costa et al., 2006). pling techniques to gather confident estimates of the fish population
Traditionally, fish surveys are carried out by recording information which could lead to possible sampling errors by novice annotators.
of fish captured in traps, in nets by trawling, with lines, or through the An attractive alternative is to use computer vision techniques to
use of piscicides. Capture-tag-recapture are also used for determining detect fish from videos or image stills and automate the counting pro-
age, growth, movement and behaviour in reef fish populations. Non- cess. This allows the use of camera set-ups for monitoring, as well as
capture techniques include underwater visual census by divers and automated and efficient fish counting. However, this approach presents
hydroacoutic methods which are more accurate and non-destructive non-trivial difficulties. Automatic detection of fish objects in under-
(Spampinato et al., 2014). However, diver observation of fishes may water videos need to deal with several challenges (Garcia et al., 2002;
suffer from observational bias as many fish species instinctively evade Labao and Naval, 2017; Negahdaripour and Yu, 1995). Underwater
human divers, swimming away from the survey area (Spampinato et al., media produce light scattering effects, wavelength-dependent absorp-
2010). tion, and lens/air/water interface image distortions. Suspended parti-
To address these drawbacks, the use of cameras which are non-in- cles in water deflect photons from their straight line trajectories and
vasive and are less conspicuous to fishes has been suggested introduce backscatter, termed “marine snow” (Horgan and Toal, 2009).
(Katsanevakis et al., 2012). Camera-based monitoring also offers rapid Longer wavelengths of visible light are strongly absorbed by water

⁎
Corresponding author.
E-mail address: [email protected] (P.C. Naval).

https://doi.org/10.1016/j.ecoinf.2019.05.004
Received 2 December 2018; Received in revised form 4 May 2019; Accepted 6 May 2019
Available online 09 May 2019
1574-9541/ © 2019 Elsevier B.V. All rights reserved.
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

resulting in varying fish colors relative to camera distance and depth. structure whose components are detector networks (Ren et al., 2017).
These factors confuse classical detection algorithms that are not de- G-RMI implements the traditional type of ensemble by combining the
signed to handle such difficulties. This is compounded by the fact that outputs of several independent networks. Our proposed ensemble uses a
large numbers of fish, from 20 to 100 individuals per frame, have to be special structure where the ensemble components are arranged in a
detected. cascade. The cascade components are not independent since they have
Some early methods that attempted to perform automatic fish de- connections in the form of Long Short-Term Memory (LSTM) links
tection often relied on background subtraction methods (Garcia et al., (Hochreiter and Schmidhuber, 1997). Moreover, the flow of informa-
2002). This approach gathers motion information using pixel-wise tion from one cascade to the next provides an automatic correction
subtraction of consecutive image frames to segment and localize fish mechanism that increases accuracy. The proposed model has other
objects from static backgrounds. However, these approaches are limited benefits, such as (1) cascade components that are jointly trained in a
by their dependence on fixed camera setups, static backgrounds, non- single backpropagation pass and (2) LSTM links that process informa-
varying illumination conditions and on the assumption that the fishes tion with an attention mechanism which confers robustness against
are in motion. The latter condition may not be true for some fish spe- image distortions.
cies. Furthermore, the presence of underwater media problems men- To assess the performance of our model, we compare our approach
tioned above pose serious difficulties to background subtraction algo- to a second ensemble system similar to traditional ensembles (Fathi
rithms. et al., 2019) and to a strong baseline model composed of a single Faster
Recent advances in computer vision and machine learning provide R-CNN network as applied to the SEACLEF database (Zhuang et al.,
methods that can potentially address the challenges presented by un- 2017). Experiments consist of a series of tests: (test 1) prediction on
derwater media. Most of these techniques are based on deep learning unseen fish objects found in the training frames, (test 2) prediction on
algorithms that address the limitations of previous algorithms which new frames, (test 3) prediction on fish objects with multi-scale distor-
rely on motion information or manually crafted features. Deep learning tions and cropping, and (test 4) ablation tests. Experimental results
methods automatically generate features using convolution and other show that for test 1, all networks performed similarly. For tests 2 and 3,
operations (Krizhevsky et al., 2012; LeCun et al., 1988). The most the proposed cascaded ensemble outperformed other systems. We
popular deep learning method for computer vision tasks is the Con- conjecture that the cascaded structure of our proposed system benefits
volutional Neural Network (CNN) whose variants have been success- from the automatic correction mechanism where cascades repeatedly
fully applied to numerous image classification tasks (Karpathy et al., refine initial proposals. In addition, the attention mechanism in System
2014; Krizhevsky et al., 2012; Simonyan and Zisserman, 2014). 1's LSTM links makes it more robust against scale distortions. This is
Deep learning methods in computer vision have also progressed to verified in test 4, where LSTM links with attention mechanism sig-
deal with localization tasks. Several recent localization techniques nificantly improve multi-scale inference.
utilize variations of the base CNN to predict bounding box coordinates For future work, our proposed cascade ensemble structure could be
of objects. One of the first deep learning localization network is the generalized to include other components aside from Faster R-CNN, as
Region-based Convolutional Neural Network (R-CNN) which uses a well as detect objects other than fish. In summary, our paper has these
selective search procedure generate object proposals (Girshick et al., contributions:
2016). A further improvement in localization networks is the Faster R- Summary of Contributions:
CNN which automates the proposal generation process itself using a
Region Proposal Network (RPN) (Ren et al., 2017). Faster R-CNN served • a localization network that adopts a cascaded ensemble structure,
as a base network architecture for several other localization models where components are linked by an LSTM network. For efficiency,
(Dai et al., 2015; Li et al., 2016), and the G-RMI network (Fathi et al., all network components are trained in a single backpropagation pass
2019). The G-RMI network is notable since it uses an ensemble archi- • an automatic correction mechanism under the cascade structure to
tecture whereby predictions of several networks are combined to in- lower prediction errors, along with an attention mechanism in re-
crease accuracy performance. However, we note that several of these current network links for more robust predictions against image
detection works use standard datasets (Everingham et al., 2010), and distortions
have not been applied to image data taken in the wild, which is the • a new dataset of 18 underwater video sequences of varying illumi-
objective of this paper. nation conditions and backgrounds. Close to 88% of the fish objects
For the fish detection task, some prior works have implemented in the test set have small object sizes of less than 900 square pixels,
deep learning based systems (Labao and Naval, 2017; Li et al., 2015; and where training videos have less fish objects than test videos to
Villon et al., 2016; Zhuang et al., 2017). In particular, Villon et al. test generalization capacities of models
(Villon et al., 2016) found that deep learning models outperform clas- • performance comparisons of cascade ensemble with traditional en-
sical machine learning techniques that rely on manually crafted fea- semble systems and a strong baseline single object detector, under 4
tures. Their experiments were performed on the SEACLEF database tests with multi-crop distortions, cropping, and ablation.
which consists of 20 to 30 fish objects per frame. To differentiate these • experiments that show better performance for the cascade ensemble.
experiments from our paper, we note that the dataset which we use to The benefits of the automatic correction mechanism and attention is
train our models is more challenging and reflects more closely the ac- demonstrated along with analysis.
tual number of fish objects at benthic depths. The dataset consists of 18
underwater videos, separated into 10 training videos and 8 test videos. 2. Deep learning neural networks
In addition, fish objects are more dense, numbering 20 to 100 fish
objects per frame, with small sizes of less than 900 square pixel areas. This section presents some concepts on deep learning-based detec-
This adds up to a total of close to 10,000 fish objects that have to be tion networks using Faster R-CNN for its base architecture. Briefly, a
detected by the algorithm, the majority of which are small. We also deep network is simply a neural network with several layers, thereby
explicity set the number of fish objects in the training data set to be less providing it with depth (Goodfellow et al., 2016). Such a neural net-
than the number of fish objects in the test data set. This is to test the work can automatically extract informative features that are appro-
capacity of algorithms to generalize well over harder environments. priate for its given task (Goodfellow et al., 2016). Early neural networks
Given the challenges presented by the dataset, standard deep are unable to increase their depth significantly due to vanishing gra-
learning localization models may not be able to perform well and some dients which are circumvented by deep networks through the use of an
enhancements to the base network architecture are needed. Hence, this non-squashing activation function such as the Rectified Linear Unit
paper proposes a deep learning architecture that adopts an ensemble (ReLU) activation function (LeCun et al., 2015). Furthermore, progress

104
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

Fig. 1. Flow of the basic Faster R-CNN detection model.

in loss functions (Goodfellow et al., 2016) (i.e, the function that mea- In the case of underwater fish localization, the CNN's capacity to
sures the amount of error in a network's prediction from ground truth) automatically generate features from arbitrary inputs is useful. It cir-
enabled more well behaved supervised training of deep networks. cumvents the need for features that rely solely on motion or brightness
Early deep networks were classification networks (Krizhevsky et al., levels, i.e. by relying also on features that depend on edges, shapes, or
2012), but localization networks were proposed shortly after (Ren contours that are brightness-independent. This renders CNN models
et al., 2015). Since this paper concentrates on localization networks, we robust to illumination changes as a result of changing depth and
show in Fig. 1 the general flow of Faster R-CNN (Ren et al., 2015) which movement in surface waters. In addition, since the input consists of an
is considered the standard deep learning localization network. The RGB image, the CNN is also able to automatically generate features that
Faster R-CNN model has three main parts (1) a CNN network trunk, (2) use color information to differentiate fish objects from their back-
a proposal generator RPN network, and (3) the R-CNN region classifi- ground. This is shown in Fig. 1, where the features in the network trunk
cation network. (from low level to high level) combine edge information along with
The CNN network trunk is the first component in Faster R-CNN. It color information.
receives an input RGB image of arbitrary size and generates a feature
map containing highly informative features describing the input image. 2.2. Region proposal network for detection (RPN)
These features will be used to predict possible locations of objects (a
detection task), as well as to predict their objectness probability (i.e. Early deep learning classification networks classify a single object
whether they correctly represent an object or not, which is a classifi- found within the input image. This changes in the case of detection
cation task). Given the feature map, detection is done using a proposal tasks, since an input image can contain several objects. To handle de-
generation process carried out by the Region Proposal Network (RPN). tection tasks, one of the methods used is proposal generation. Proposals
The classification task is handled by the R-CNN which uses the same are data structures that contain information on the locations of possible
feature map to predict objectness probabilities of the proposals. The objects in the input image. For Faster R-CNN, a proposal is a 4-element
CNN, RPN and R-CNN are jointly trained during backpropagation tuple consisting of coordinates that represent corners of bounding
thereby increasing training efficiency significantly. boxes.
Faster R-CNN automates the production of proposals using a Region
2.1. Convolutional neural networks (CNN) Proposal Network (RPN). We can view the RPN as another CNN with its
own set of convolutional filters. The filters of the RPN operate over the
The Convolutional Neural Network (CNN) provides the feature input feature map from the trunk and predict a tuple consisting of
generation component used by most deep networks designed for image objectness probabilities and proposal coordinates of boxes fixed at
analysis tasks (Krizhevsky et al., 2012). The trunk of a CNN is formed certain locations dispersed across the image. These locations, called
by a series of convolutional filters that are convolved over the input “anchors”, form the points of a grid that span the entire input image.
image to generate a stack of feature maps, with each feature map These anchors are spaced in intervals of 9 or 12 pixels and each anchor
containing a set of special features characterizing the image. At the is assigned a set of ‘anchor boxes’ as shown in Fig. 2. Using its filters,
front end of the CNN trunk are usually found filters that detect low- the RPN predicts for each anchor box its objectness probability and
level features which represent edges and color patterns. These are fol- regressed bounding box coordinates. Anchor boxes with high objectness
lowed by increasingly sophisticated features, i.e. shapes and contours in probabilities are stored as proposal candidates and serve as inputs for
later parts of the trunk. The final output of the CNN trunk is a highly the R-CNN since they are more likely to contain objects (Ren et al.,
informative feature map that will now serve as input for the Region 2015).
Proposal Network and R-CNN components of our detection network. We note that the standard Faster R-CNN implemented a single RPN
For our proposed network, the trunk adopts a residual network struc- since one RPN is sufficient to detect relatively larger objects in the
ture from (He et al., 2016). PASCAL VOC dataset. However, for our proposed network, we increase

105
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

2.5. Long short-term memory recurrent neural network (LSTM)

Recurrent neural networks (RNN) are applied to sequence data

where information has temporal dependencies. RNNs re-use prior pre-
dictions to capture context, where the simplest RNN re-uses its own
prediction from prior sequence stesps. However, RNN's are prone to
vanishing gradients or gradient explosions across long time periods. But
one type of RNN, the Long Short-Term Memory (LSTM) Network (Gers
et al., 2019) addresses the vanishing gradient problem through gates
and bypass connections. In this paper, we use the LSTM to link the R-
CNN cascade. The LSTM's capacity to retain information over long se-
quences, allows it to propagate gradients effectively across 7 cascade
components. To verify the effect of LSTM links, we implement several
ablation tests (test 4) in our experiments.

3. Methodology

In this section, we provide implementation details for our proposed

algorithm. We term our proposed algorithm as System 1. For compar-
ison, we implement 3 systems, termed Systems 1, 2, and 3. System 2
Fig. 2. A graphical illustration of the anchor grid over an input image used by uses traditional ensembles while System 3 uses basic Faster R-CNN. For
the RPN. The anchors (green dots) are spaced evenly across the input image.
this section, we only provide details on System 1, and details on
Each anchor is assigned a set of ‘anchor boxes’ of varying aspect ratios and size,
Systems 2 and 3 are in the Appendix.
i.e. there are three anchor boxes for one anchor (red dot). For each anchor box
(of each anchor), the RPN predicts its objectness probability and regresses the
actual coordinates (bounding box corners) of a potential object - given the sub- • System 1: Multi-cascade object detection network with 2 RPNs and
image enclosed by the anchor box. (For interpretation of the references to color an ensemble of 7 CNN components linked by sequential LSTMs
in this figure legend, the reader is referred to the web version of this article.) (jointly trained)
• System 2: Ensemble formed from 3 object detection networks
trained separately: each with 1 RPN and 2 cascade components
the number of RPNs to two, with each RPN having a specialized set of
anchor boxes that can accommodate both small and large fish sizes. • System 3 (Faster-R-CNN Baseline System): Single object detection
network with 2 RPNs and 2 cascade components

2.3. Region classification network (R-CNN) The three systems receive a single RGB video frame as input. The
input frame's dimensions (H × W × 3) can vary in height (H) and width
The R-CNN is a sub-network CNN that operates on ‘smaller’ feature (W) but are fixed at the 3 RGB channels. For output, the three systems
map inputs, as shown in Fig. 1. It predicts the objectness probability provide a set of box coordinates for each detected fish object. For
and actual coordinates of each object captured by a proposal from the System 1, we enumerate its features as follows:
RPN. The R-CNN is actually a refinement network that corrects
bounding box coordinates and objectness probabilities of each pro- 3.1. System 1 architecture features
posal. For this paper, our main contribution lies in modifying the R-
CNN to a cascaded ensemble for improved localization accuracy.
• single 50-layer Residual Network trunk
2.4. Ensembles
• two RPN systems, to generate proposals that accommodate both
small and large fish objects

In machine learning, a common practice to improve performance is

• a R-CNN network composed of an ensemble of 7 CNN units arranged
in a cascade
to use ensembles that combines outputs of several models (Krogh and • seven CNN units are linked by an LSTM recurrent neural network
Vedelsby, 1995; Lee et al., 2019; Wang et al., 2019). The theory behind • LSTM link processes information using an attention mechanism
ensembles is that they resemble a form of ‘bagging’ to reduce the var-
iance of predictors (Hastie et al., 2013). With lower variance, predictors
• all network components are trained jointly in a single back-
propagation pass using compound loss functions
generalize better on test sets that were not encountered during training.
Mathematically, let σi2 represent the variance of a ensemble component With the cascade ensemble, outputs of earlier cascades are re-used
i given prediction f(x)i under input x. Suppose that the variances of in future cascades - which allows predictions to be refined. In addition,
component i and j, i ≠ j are not correlated, i.e. [ i2 2j ] = 0 . The most the LSTM's attention mechanism processes information by sections,
basic form of ensemble involves averaging over the predictions of K giving it robustness properties. We note that System 1 differs from
components. The resulting variance can be expressed as: existing detection models in the R-CNN part. Instead of a single CNN
(Ren et al., 2015), we use a cascaded ensemble. The base model for
K
K 2 System 1 is the multi-network 2-cascade network (MNC) of (Dai et al.,
1 2
2 2 2
Var [f (x )] = Var i + i j = =
K i j i
K 2 K 2015), but MNC does not implement a linking mechanism to propagate
information from one cascade to the next.
Thus, having K components in an ensemble reduce f(x)'s variance by
1/K assuming that components are not correlated. 3.2. 50-layer CNN residual network trunk
In this paper, we use cascade ensemble instead of simple averaging.
This ensemble is implemented using a recurrent neural network (LSTM) System 1 (and also Systems 2 and 3) adopts a 50-layer residual
to reduce both variance and bias, where bias is attenuated by an au- network structure from (He et al., 2016). Residual networks use skip
tomatic correction mechanism. The Appendix provides a mathematical layers to allow information to freely pass across the network. The units
model to show bias reduction properties of cascades. of residual networks are termed as “residual blocks” where each block

106
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

Fig. 3. System 1 Network Structure - a single 50-layer residual trunk provides shareable features for 2 Region Proposal Networks and an R-CNN composed of 7
cascade components. The feature map from the 50-layer residual network trunk is the input for RPN-1 and RPN-2 as well as for each of the cascades. The vertical
connections over the cascades represent the sharing of the feature map to each of the cascade components. Hence, each cascade components re-uses the feature map
from the main CNN trunk.

is composed of a series of 1 × 1 - 3 × 3 - 1 × 1 convolutions with skip coordinates [x1, y1, x2, y2]. During training, 256 proposals are pro-
addition and batch normalization layers (He et al., 2016). The trunk cessed, where 50% have foreground labels and the other 50% have
receives an RGB image of arbitrary size as input and outputs a shareable background labels. During inference, the top 4000 proposals with the
feature map of H/16 × W/16 × 1024 dimensions. For System 1, we use highest predicted objectness probabilities are fed to the R-CNN.
a single 50-layer residual network trunk, where its shareable feature
map output is used for both RPN and R-CNN.
3.4. Multi-cascade R-CNN with an ensemble of 7 components and LSTM
links
3.3. Region proposal network (RPN) for proposal generation

System 1's R-CNN is a 7-cascade ensemble as shown in Fig. 4. Each

System 1 adopts a dual RPN structure to cater to both small and
component j extracts sub-feature maps from the set of proposals gen-
large fish objects (Fig. 3). RPN-1 is connected to the middle of the
erated by the previous cascade j − 1. Sub-feature maps are extracted
network trunk and receives a feature map of size H/8 × W/8, while
from F(Θ) through RoI Cropping which uses bilinear interpolation to fix
RPN-2 is directly connected to the shareable feature map at the end of
all maps to size 6 × 6. Bilinear interpolation has nice differentiable
the network trunk of size H/16 × W/16. Each anchor in the RPN is
properties that allows backpropagation of gradient corrections to be
assigned a set of twelve anchor boxes where pixel dimensions of the 12
passed across cascades during training. Extracted sub-feature maps are
anchor boxes are computed according to base_-
passed on to the CNN unit assigned to cascade j. Each CNN unit in a
size × scale × aspect_ratio. For System 1, anchors in RPN-1 are sepa-
cascade j is composed of three residual blocks. Residual block weights
rated by strides of 8 pixels with anchor boxes that have a base_size of 3.
for each of the 7 cascade components are independent, resulting in an
For RPN-2, anchors are separated by strides of 16 pixels, its anchor
ensemble structure. (See Fig. 5.)
boxes have a base_size of 5. For all RPNs across systems, the vector scale
The three residual blocks in the CNN unit of a cascade outputs a
is [8,16,32,64], while the vector aspect_ratio is [1 : 1, 1 : 2, 2 : 1]. Mul-
3 × 3 feature map with a depth of 2048 dimensions. This 3 × 3 × 2048
tiplying base_size with scale and aspect_ratio creates a total of 12 anchor
feature map is averaged across the dimensions resulting in a 3 × 3 map.
boxes for each anchor location.
The averaged map is fed to a sequential LSTM unit which reshapes the
During training, for each anchor in the RPN, the proposal with
sub-feature map to a flattened vector of dimensions 1 × 9. This way,
largest Intersection over Union (IoU) over a ground truth box, or with
each element in the flattened vector corresponds to an element in the
an IoU of above 70% is assigned as foreground (class 1). The rest are
LSTM sequence. By processing each element in the flattened vector
assigned as background (class 0). These labels train the RPN to predict
separately, the LSTM operates according to attention mechanism. Here,
objectness probabilities of each anchor. The RPN also performs
each block in the sequence corresponds to a coordinate in the 3 × 3
bounding-box regression of coordinate adjustments [dx, dy, dw, dh],
feature map, from the top-left coordinate down to the bottom-right
which adjust the reference anchor box to approximate the actual box
coordinate. Attention mechanism schemes improve network accuracy
coordinates [x1, y1, x2, y2] of a fish object. RPNs are trained using a
since it can extract key features from portions of the object (Hara et al.,
compound loss function. Formally, let F(Θ) denote the shareable feature
2017). The LSTM unit receives its prior 2048 dimensional state vector
map of size of H/16 × W/16, where Θ denotes all network parameters.
from the previous cascade and, after processing the sequence, outputs a
For System 1, let F8(Θ) denote the feature map branching from the
2048 dimensional hidden unit vector that serves as input to a fully
middle of the trunk with a size of H/8 × W/8. The RPN passes a 3 × 3
connected layer. The current 2048 state vector in cascade j is passed to
convolution over F(Θ), followed by two 1 × 1 convolutions to produce
the next cascade j + 1. The fully connected layer form the final stage of
pia(Θ) and tia(Θ), where i denotes an anchor location. The RPN's com-
each cascade component j. It predicts proposal coordinate adjustments
pound loss function Lrpn is
and objectness probabilities for each RoI proposal i. A state-bridge layer
Lrpn = lcls (pianchor ( )) + lreg (tianchor ( )) (Dai et al., 2015) transforms the reference proposal i to a new proposal
8
= lcls (pianchor ( )) + lreg
8
(tianchor ( )) given the predicted coordinate adjustments. In summary, the detailed
(1)
steps in the R-CNN for System 1 are as follows:
where pianchor(Θ)
is an 24 dimensional vector of object probabilities for System 1: procedures for cascades j = 1 to 7
anchor i‘s 12 scale and aspect ratios. The quantity tianchor(Θ) is a 48-d Step 1: Receive RoI i proposal coordinates Bj−1 from the previous
vector of bounding box parameters, lcls is softmax, while lreg is the cascade j − 1. In the case of the first cascade j = 1, RoI proposal co-
smoothL1 loss function. Vectors lcls8 and lreg8 refer to anchors taken at ordinates are from the RPN;
F8(Θ), corresponding to RPN-1 in systems 1 and 2. From bounding-box Step 2: Use RoI i proposal coordinates in the form [x1, y1, x2, y2] to
regression, RPN feeds to R-CNN a set of proposals B0, i representing box extract sub-feature maps from the shareable feature map F(Θ) at the

107
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

Fig. 4. The structure of two interconnected cascade components in the R-CNN. In System 1, cascade components are extended to 7 cascades, while for systems 2 and
3, the R-CNN has only two cascades.

end of the network trunk using RoI-cropping. Sub-feature maps are Step 7: The current hidden state Sj and the new set of RoIs Bj are
resized to a uniform 6 × 6 × 1024; passed to the next cascade j + 1 which begins again at step 1.
Step 3: Pass the 6 × 6 × 1024 sub-feature map to a series of con- The R-CNN in System 1 is trained using a compound loss function
volutional layers (three residual blocks). The sub-feature map is down- for each proposal RoI i, and for each cascade j = 1, . . , 7 following Eq.
sized to 3 × 3 at a depth dimension D = 2048; (2). Loss is averaged over RoIs i.
Step 4: Pass the 3 × 3 sub-feature map to an LSTM cell, where the
sub-feature map is resized to sequential-form: 1 × 9 × 2048) and the Lj, i = lcls (pjcls
, i ( )) + lreg (t j, i ( )) (2)
LSTM sequentially processes each block. In this model, the LSTM se-
quence has 9 blocks as shown in 4. In each cascade, the LSTM re-uses where lcls is a softmax loss function while lreg is the smoothL1 loss
the hidden states Sj−1 computed from the previous cascade (except for function. The quantity pj, i refers to cascade j‘s prediction of objectness
the first cascade j = 1). The 2048 dimensional hidden units are passed probability for proposal RoI i, while tj, i refers to cascade j‘s prediction of
to the fully connected layer; coordinate adjustments for RoI i. The end-output of the R-CNN is pj=7, i
Step 5: Using the previous step's inputs, the fully connected layer and Bj=7 referring to the final predicted probabilities of an RoI i and the
predicts bounding box coordinate adjustments p(i, j) and objectness final box coordinates.
probabilities t(i, j). Coordinate adjustments are in the form [dx, dy, dw,
dh] which refer to adjustments with respect to the reference box center
x and y coordinates and the box height h and width w; 3.5. Total loss and training details
Step 6: Predicted bounding box coordinate adjustments are pro-
cessed according to a state-bridge layer (following (Dai et al., 2015)) The total loss of the network combines the compound losses of the
which transforms the reference RoI Bj−1 to a new set of RoIs Bj of the RPN and of each cascade in the ensemble R-CNN:
form [x1, y1, x2, y2] using the predicted [dx, dy, dw, dh];

108
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

Fig. 5. System 1 R-CNN with Sequential LSTM Structure. The cascaded network in System 1 has 7 CNN ensembled components that are linked together through a
sequential LSTM unit. The LSTM unit performs an attention mechanism over each CNN cascade output, by reshaping the 3 × 3 CNN output tensor to 1 × 9 and
treating each element in the 9-dimensional flattened tensor as a part of a sequence.

7
1 VOC dataset (Everingham et al., 2010), for the following reasons:
Ltotal = Lrpn + Lj , i
R
• uneven and changing illumination conditions, water backscattering
j=1 (3)

For all three systems, network training is performed end-to-end for effects, presence of marine snow, fish motion, and similar appear-
200 epochs (with 300 training frames for each epoch). We use a weight ance of fish objects with coral background, etc.
decay parameter of 0.0001 and a Nesterov momentum value of 0.9. • around half of the objects are small objects having areas smaller
Learning rate is set at 0.001 which is divided by 10 after 150 epochs. No than 900 square pixels and with deformable shapes
image pre-processing is performed other than subtraction from mean • much larger quantity of objects per frame to be detected in the test
image. Network is initialized with ImageNet weights from (He et al., set compared to training set, ranging from 20 to more than 100
2016) as done elsewhere (Dai et al., 2015; Ren et al., 2017). In a for-
ward pass, 4 K proposals per RPN are processed, and the top proposals For the test data, we gathered eight (8) videos with some differences
for objectness are retained after Non-Maximum Suppression (NMS) in background environments from the training data. For each video, we
with 0.3 threshold (pre R-CNN). Final predicted proposals (post R-CNN) randomly sample 3 frames for manual labeling, amounting to 27 frames
are NMS-suppressed with 0.1 threshold. with more than 2000 fish objects to be detected in total. We describe
the datasets in Table 1, where for index notation, we append each train
and test video with a ‘J’ at the beginning. The fish objects in the training
4. Training and test data description
data were manually annotated by a marine science researcher for expert
verification.
In this section, we present (1) statistics of our data, (2) performance
In general, the number of fish objects in the 8 test videos are larger
metrics and four test schemes, and (3) experimental results for each of
than those in the training videos. This presents a unique challenge for
the 4 test schemes. In (3), we insert some qualitative analysis on the
localization systems since they have to generalize over a more difficult
experimental results, and we provide a mathematical treatment of the
test set. However, this suits the purpose of this experiment, which is to
analysis in the Appendix for reference.
test the capacity of models to generalize over new environments.
Table 2 shows different fish object sizes that can be found among the 8
4.1. Statistics on training and test set test videos. As can be seen, roughly 88% have object sizes that fall
between 100 and 2500 square pixels. In the COCO dataset (Lin et al.,
Our training data consists of ten (10) underwater video sequences 2014) these sizes fall under the ‘small’ object size, and are among the
for a total of 300 training frames, with more than 10,000 fish objects. harder-to-localize objects.
The videos were obtained at depths ranging from 7 to 24 m, taken from In terms of background, we include in Table 1 the rough proportion
a custom-made stereo rig composed of three (3) GoPro cameras. The of water column areas against seabed and coral areas. We also include
video frames have a wide variety of backgrounds and most contain additional information on the illumination conditions of the video and
large numbers of fish objects different species. In general, the training on background objects, i.e. rocks/corals/particles.
and test data in this experiment is harder than the benchmark PASCAL

109
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

Table 1
Training Data Statistics for the 18 videos (10 for training, 8 for test).
Training Video Average Number of % Water Specs Illumination Test Average Number of % Water Specs Illumination
Fish Objects Column Video Fish Objects Column

J01 6 95% Particles/sand clear J103 74 60% Rocks/corals Dark

J06 33 60% Rocks/corals clear J105 146 85% Seabed/corals Dark
J07 68 85% Particles/ clear J115 88 70% Rocks/corals Clear
bubbles
J08 16 40% Rocks/sand clear J119 31 55% Rocks/seabed Dark
J09 30 60% Rocks/corals clear J121 97 65% Rocks/corals Dark
J49 15 50% Rocks/corals blurred J239 31 70% Particles/ Clear
corals
J58 18 60% Debris/corals blurred J243 40 60% Rocks/corals Clear
J59 7 70% Rocks/corals blurred J255 20 40% Rocks/corals A bit blurred
J70 11 50% Seabed/corals dark
J75 14 45% Rocks/corals dark

Table 2 8 independent test videos with entirely new environments and illumi-
Fish Size Statistics for the test set: summed across all 8 test videos. nation conditions. In a way, test 2 is more crucial for assessing a sys-
Fish Size x (in square pixels) Number of Objects tem's performance than test 1 since its frames are independent from the
training set. Test Set 3 builds upon Test Set 2 and uses multi-crop and
x < 100 5 multi-scale inference to check if the system can still localize despite
100 ≤ x < 900 929
additional distortions and removal of global information. Test Set 4 is
900 ≤ x < 2,500 478
2,500 ≤ x < 10,000 168
an ablation test to check if the LSTM sequential links effectively pro-
x ≥ 10,000 6 pagates information across ensemble components.

6. Overview of system performance relative to background

5. Performance metrics and test schemes
We show in Figs. 6 and 7 sample detections of System 1 algorithm.
5.1. Performance metrics for the specific task of the experiment In these figures, green boxes denote bounding box predictions out-
putted by the algorithm, while red boxes denote sample false negatives
The performance of each system is measured in terms of how well (i.e. missed detections) of the algorithm. From the figures, System 1 is
each predicts the bounding box coordinates of fish objects per frame, able to localize fish in both the water column and rock/coral back-
agnostic to which type of species they belong. This is similar to a ground. However, it reported a lot of false negatives for very small fish
foreground/background localization problem where objects of interest objects. This is shown in Fig. 6, where several very small objects are
are classified as foreground, while all other pixels and objects are missed by the detector. These objects however are very far from the
classified as background. We measure localization performance in camera location and are hardly discernable even by an untrained
terms of intersection over union or IoU. We use IoU precision/recall/F- human observer. Comparing detection accuracy in the water column
Score performance metrics described in PASCAL VOC where true po- background against a rock/coral background, System 1 performs better
sitive (TP) boxes have ≥0.50 IoU threshold. If a predicted RoI has with a water column background. As seen in Fig. 7, it is able to detect
≥0.50 IoU with a ground truth box, it is counted as TP, otherwise it is a several fish objects on top of the coral/rock bed. However, several fish
false positive (FP). In the case of multiple overlapping boxes over a objects that lie near the bottom against a rock background are missed
single ground truth object, only one predicted box is counted as TP, out.
while the rest are FP. This is a penalty scheme for multiple boxes over a
single ground truth object since only one box should remain for each
7. Analysis and comparison of system performance
ground truth object during inference.
7.1. Test set 1 systems performance: Prediction on training environments
5.2. Test schemes
From Table 3, all three systems perform comparably well for frames
To assess the performance of each system given our new dataset, we taken from training video sequences. Both precision and recall for all
implement four different test schemes as follows: three systems are at an average of 60+. This indicates that all three
systems are able to learn. Hence, given familiar frames, ensemble
1. Test 1: Localize unseen fish objects in videos that were used for structures do not provide much benefit. As seen in Table 3, System 3
training. with its non-ensemble single network structure has an F-Score that is
2. Test 2: Localize fish objects in entirely new video frames from a test very close to the ensemble based structures of systems 1 and 2.
set (independent of training set) using single crop inference
3. Test 3: Localize fish objects in the same set as test 2 - using multi- 7.2. Test set 2 systems performance: prediction on new environments
scale inference with multi-crop
4. Test 4: Ablation tests for System 1 where LSTM components are For test set 2, Table 4 shows that System 1 has the best Precision,
removed Recall and average weighted F-Score values. We conjecture that this is
due to the correction mechanism of a cascade ensemble. (to be ex-
Test Set 1 is meant to check the learning capacities of each system plained in more detail below). Both ensemble-based Systems 1 and 2
and assess if all are able to learn given familiar training sets. This test outperform the Faster R-CNN baseline model. This is expected since
consists of 60 frames. This test makes sure that all systems are able to ensemble models have better generalization than single models. Com-
learn equally well. Test Set 2 is meant to check the generalization ca- paring the multi-cascade ensemble network in System 1 with the se-
pacities of each system and assess if they are able to perform well given parated network ensemble in System 2, System 1 performs better when

110
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

Fig. 6. Sample fish object detections taken from the test set using System 1. Green boxes are localization outputs from the algorithm, while red boxes depict some of
the missed fish objects. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

tested in new environments. This is despite having only a single net- 7.2.1. System 1 cascade correction mechanism
work trunk for System 1 - compared to the network structure of System A cascade ensemble architecture can potentially perform correction
2 with three separate trunks. mechanisms. The correction mechanism occurs as the CNN sub-net-
Precision and recall measures in Test 2 however are lower compared works pass information from one component in the cascade to the next
to single crop inference on the training set in Test 1. But these could be such that prior errors in early cascades are rectified in future cascades.
brought further up by increasing the number of proposals of the net- To show the need for correction mechanisms, Fig. 8 shows a proposal
work in each forward pass during inference. This will be performed in instance received by the first component of the R-CNN. The proposal
subsection 7.3, where the three systems are subjected to multi-crop includes IoU (Intersection over Union) in excess of 50%. Hence, it is a
inference testing with multi-scale distortions. valid candidate for bounding box regression. However, as seen in Fig. 8,
there is inherent ambiguity in information provided by the initial
proposal, i.e. the actual fish object can acquire several possible

111
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

Fig. 7. Sample fish object detections taken from the test set using System 1. Green boxes are localization outputs from the algorithm, while red boxes depict some of
the missed fish objects. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 3 Table 4
Test Set 1: Performance Results on 60 frames, where frames are taken from Test Set 2: Performance Results on the 8-Video Test Set with New Backgrounds.
training video sequences.
System Precision Recall F-Score
System Precision Recall F-Score
System 1 53.29 37.77 44.21
System 1 67.21 64.56 65.86 System 2 47.37 33.54 39.28
System 2 67.28 68.25 67.76 R-CNN Baseline 43.99 21.00 28.43
R-CNN Baseline 69.81 62.72 66.07
The bold figures indicate the highest score for each performance measurement.

orientations for its tail. These orientations cannot be inferred from the paths that the ensemble cascade can take from the initial proposal of
initial proposal alone. If the R-CNN is limited to a single cascade, it is the RPN (leftmost) to the final cascade (rightmost). The initial proposal
forced to select one of the many likely orientations of the fish object. If of the RPN (denoted as x) is usually not very precise and has bounding
ever it incurs an error in bounding-box regression under a single- box coordinates that do not properly enclose the fish object (where
component R-CNN (i.e. choose a wrong orientation of the tail), it does errors are represented by the term ε). The ensemble cascade relies on
not have a chance to rectify its error. the assumption that as cascades progress from 1. K, errors gradually
Fig. 9 shows the correction process that occurs in the ensemble decrease by a factor of β where β ∈ (0, 1). This assumption is reasonable
cascade. The arrows in Fig. 9 indicate the diverse range of possible given that each cascade minimizes a convex loss function under SGD

112
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

Fig. 8. Inherent ambiguity of information in initial proposals relative to ground truth (rightmost figure). The initial proposal (leftmost figure) shows a portion of a
fish object. However, the actual fish object is larger than the initial proposal and can have several orientations of the tail. The correct orientation cannot be inferred
from the initial proposal alone.

Fig. 9. Illustration of correction mechanisms. The initial proposal (leftmost figure) has an inaccurate bounding box with error ε relative to ground truth (rightmost
figure). The 1st cascade uses bounding box regression (i.e. function f) to reduce error to βε, where β ∈ (0, 1) is a correction parameter. Each RoI beginning at the
proposal has two arrows from left to right (for illustration purposes), where these arrows express the diverse possibilities that the R-CNN in a cascade can predict. It
follows that there are several possible paths from the initial proposal to the final prediction due to information ambiguity. But as long as β ∈ (0, 1), each prediction of
a cascade serves to reduce the initial error ε. Eventually after several cascades, the predictions converge closely to the ground truth (rightmost figure), with βK being a
very small number.

113
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

Fig. 10. Cascaded information gathering across steps. Neighboring contextual information are extracted as cascades progress: cascade 1 (1st row, 1st col), cascade 2
(1st row, 2nd col), cascade 3 (1st row, 3rd col), cascade 4 (1st row, 4th col), cascade 5 (2nd row, 1st col), cascade 6 (2nd row, 2nd col), step 7 (2nd row, 3rd col).

optimization, i.e. it is optimized to reduce bounding box errors from The mathematical expressions for the recursion are shown in
malformed proposals. In addition, each bounding box regression f re- Appendix. Our mathematical analysis relies on the assumption that
sults in a rectified proposal f(x) that lessens the initial ambiguity (i.e. as later cascades refine prior cascades under SGD minimization of a
shown in the rectified proposals of Fig. 9 from cascade 1 to cascade 2). convex loss function, i.e. with a correction parameter β ∈ (0, 1). Under
Hence, f(x) contains more information than x, and the next cascade f(f this assumption, the cascade ensemble is able to reduce both bias and
(x)) contains even more information. This process can be modeled as a variance compared to traditional ensemble averaging which is limited
recursion. Eventually, repeated applications of f results in an error of to reduction of variance only.
βKε in the final cascade K which is a small number. The bounding boxes We can see the correction mechanism from the behaviour of
in this case fK(x) have close convergence to ground truth. bounding box predictions shown in Fig. 10. The first cascade has

114
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

predicted boxes that are poorly aligned, and a coral object is wrongly objectness probability threshold per detected fish object of 70%.
identified as fish. As the network progresses to cascade 2, several of the From Tables 5, 6 and 7, all three systems performed worse for the
initial predicted boxes for fish objects are re-aligned, and the false coral scale resize of 0.75. This type of performance degradation is expected
detection box is expanded to include a larger area of the initial pro- given that downsampling of the image removes key information de-
posal. In cascade 3, the network managed to detect that the coral de- fining fish objects. However, even with downsampling at this rate,
tection is a non-fish object and rectified its prediction. From cascade 4 System 1 performs best with the highest F-score of 33.64. The separate
to 7, only correct fish objects are identified as valid detections. This network ensemble of System 2 performed poorly with an F-score of
pattern across cascades show a step-wise gathering of contextual in- 24.65, indicating that the system is not robust to distortions of smaller
formation by the ensemble CNN units around the neighboring area of image scales.
an initial Region of Interest. This information gathering process occurs Given a scale multiplier of 1.0, both System 2 and the baseline
whenever a cascade j performs prediction of bounding box coordinate system displayed better performance compared to non multi-crop in-
adjustments to refine the initial reference box from cascade j − 1. The ference in accordance with the findings in (Fathi et al., 2019) for en-
network uses the newly predicted box coordinates from cascade j to re- sembles with separated networks. While System 2 has the highest recall
extract features from the shareable feature map upon the start of cas- given a scale multiplier of 2.0, its precision suffered, with a score of
cade j + 1. In this example, a total of 7 boxes are predicted within the only 40.01. This indicates that System 2, inspite its ensemble me-
neighboring area of an initial proposal from cascade 1 to cascade 7. chanism, is not very consistent given different scales. The most con-
With multiple box predictions and repeated RoI-crop feature re-ex- sistent model for multi-crop inference across all scale resize ratios is
tractions for each cascade, there is more likelihood that the ensemble R- System 1. In fact, System 1's F-score increased from 44.21% in non
CNN eventually gathers key contextual information such as the loca- multi-crop inference to 48.84% given a scale multiplier of 1.0, and to a
tions of a fish's snout and tail or the actual boundaries of a non-fish larger value of 56.15 given a scale multiplier of 2.0. The large increase
object (e.g., coral). This contextual information gathering allows the indicates that System 1 can utilize the image's expanded resolution to
network to automatically correct predictions for both proposal co- improve detection. Among all the tests conducted in this experiment,
ordinates and objectness probabilities thereby increasing precision. System 1 with 2.0 scale increase and multi-crop inference reported the
Aside from increasing precision, a cascade structure also improves best performance.
recall. With repeated feature re-extractions and bounding box predic-
tions, poorly formed initial RoIs in prior cascades are eventually cor- 7.4. Test set 4 systems performance: ablation tests for system 1
rected in future cascades. As corrected RoIs tend to include more con-
textual information, missed objects in prior cascades due to false We implement Test Set 4 as an ablation test for System 1 where we
objectness probability predictions are stochastically rectified in future determine the effect of the recurrent LSTM links. Instead of LSTM links,
cascades, leading to more true positive detections. This mechanism is we implement vector links in the form of flattened feature maps. More
shown in Fig. 11 where Cascade 1 has several false detections with poor specifically, for Step 5 in the algorithm shown in Sec. 3.3, instead of the
alignments, and Cascade 2 misclassified some proposals as non-objects, 2048 dimensional hidden units Sj−1 that are passed to the fully con-
missing 2 fishes. Cascade 3 up to Cascade 7 re-detected the 2 missing nected layer, we implement average pooling over the 3 × 3 sub-feature
fishes after re-prediction of object boundaries and re-extraction of map produced by the CNN component resulting in a 2048-d vector. To
better RoI feature information. This increased recall performance. serve as link, we concatenate the 2048-d averaged vector in cascaded
In Fig. 11, having only two cascades may not be optimal for in- component j with the respective 2048-d averaged vector in the previous
creasing object recall as some detections may still be missed. This ex- cascade component j − 1. The result is a 4096-d concatenated vector
plains why the baseline system which is a single non-ensemble network that serves as input to the fully connected layer for bounding box
with only two cascades produced a low recall of 21%. prediction.
We implement the same multicrop tests as in Test Set 3 to the
7.3. Test set 3 systems performance: robustness testing through multi-crop modified System 1 with vector links. From Table 8 to Table 10, we show
inference with scale distortions the results of the original System 1 with LSTM links, the modified
System 1 with vector links, and the baseline system 3. We choose to
This section describes another test experiment which subjects the include system 3 among the results in Test Set 4 since it is equivalent to
three systems to robustness tests using eight test videos. Here, the test a 2-component cascaded ensemble variant of System 1.
images are cropped according to nine different sections, and predicted From Table 8, having LSTM links at a scale distortion of 0.75× do
detections for all 9 sections are combined for final inference. (See not indicate any performance improvement, as performance from the
Fig. 12.) Each cropped section is tested according to a scale multiplier modified System 1 is comparable with the original system. But from
of 0.75, 1.0, and 2.0. The rationale behind this method is to test the Table 9 to Table 10, it could be seen that System 1's performance with
generalization capacity of the three systems according to different LSTM links improves, while the modified System 1 with vector links
image scales while removing portions of the global context. If the reports bad performance at a scale distortion of 2.0×. In fact, the
system performs well despite the removal of global context and scale baseline non-ensemble system performs even better than the modified
distortions, it means that the system can generalize and is robust to System 1 at a scale distortion of 2.0×. This indicates that LSTM links
overfitting. Among the different scales, the system is expected to per- with attention mechanisms provides more robustness since it is able to
form worse for a scale resize of 0.75, since information is lost upon maintain good performance despite multiple scale distortions.
down-sampling of the image by 25%.
We note that multi-crop inference could actually increase system 7.4.1. Insights on attention mechanisms in system 1 LSTM unit
performance in some instances (Fathi et al., 2019) since it allows net- The robustness of System 1's performance can be attributed to the
works to focus on a sub-region. Given nine sub-sections, the total attention mechanism in System 1's LSTM which focuses on sub-regions
number of proposals could increase up to nine times. However, im- and links their features in a sequential fashion. It does not rely on
provements in localization is dependent on the network's capacity to- features taken from the entire object RoI image, compared to Systems 2
wards to predict well despite removal of global contexts. In this test, and 3, which convolve on the entire 6 × 6 RoI feature map after RoI-
each system is allowed 2000 proposals per section along with an cropping. This means that System 1 has to detect the key features of an

115
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

Fig. 11. Increased recall with 7-cascade R-CNN: Cascade 1 (1st row, left), cascade 2 (1st row, right), cascade 3 (2nd row, left), cascade 4 (2nd row, right), cascade 5
(3rd row, left), cascade 6 (3rd row, right), cascade 7 (4th row).

object and re-orientations due to scale distortions do not affect in- 8. Future work
ference. A sample of the attention mechanism procedure is shown in
Fig. 13, where 9 key features of the fish object are gathered and ar- For future research, System 1 can be modified to include additional
ranged in a sequential manner. The LSTM unit uses the sequence of fish sub-networks for various tasks, i.e. species classification or semantic
object portions to construct the overall object. Since it does not depend segmentation. The network can likewise be modified to incorporate
on global information, it is rendered more robust to scale distortions, temporal information captured in frame sequences, e.g. fish move-
similar to a ‘bag-of-words' scheme. ments. This allows prediction not only on static fish locations but also
on their behavioural (swimming) patterns. In terms of cascade

116
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

Fig. 12. Multi-Crop Inference with Nine (9) Subsections. During inference, the network systems processes each subsection independently. This leaves out global
information, but allows the networks to focus more proposals on a single cropped section during inference - leading to more detections for increased recall.

Table 5 Table 9
Test Set 3: Multi-Crop Inference Performance Statistics (Scale Multiplier: 0.75). Test Set 4: Multi-Crop Inference Ablation Tests (Scale Multiplier: 1.0).
System Precision Recall F-Score System Precision Recall F-Score

System 1 39.56 29.56 33.64 System 1 (w/ LSTM) 48.51 49.18 48.84
System 2 28.89 21.19 24.45 System 1 (w/ vector links) 54.72 34.68 42.45
Baseline 33.74 17.34 22.91 Baseline (System 1 w/ 2 cascades) 55.00 21.50 30.92

The bold figures indicate the highest score for each performance measurement. The bold figures indicate the highest score for each performance measurement.

Table 6 Table 10
Test Set 3: Multi-Crop Inference Performance Statistics (Scale Multiplier: 1.0). Test Set 4: Multi-Crop Inference Ablation Tests (Scale Multiplier: 2.0).
System Precision Recall F-Score System Precision Recall F-Score

System 1 48.51 49.18 48.84 System 1 (w/ LSTM) 60.32 52.52 56.15
System 2 48.25 27.81 35.28 System 1 (w/ vector links) 30.29 38.27 33.82
Baseline 55.00 21.50 30.92 Baseline (System 1 w/ 2 cascades) 47.11 44.20 45.61

The bold figures indicate the highest score for each performance measurement. The bold figures indicate the highest score for each performance measurement.

Table 7 structure, further research can be done to analyse the effects of cascade
Test Set 3: Multi-Crop Inference Performance Statistics (Scale Multiplier: 2.0). lengths on overall accuracy. In addition, we constructed System 1 to
System Precision Recall F-Score
work well in offline server systems, i.e. computers with GPU's. For ROV
platforms with lower computational resources, System 1 can be sim-
System 1 60.32 52.52 56.15 plified to accomodate computational hardware with less memory
System 2 40.01 61.16 48.42 without compromising much of its accuracy.
Baseline 47.11 44.20 45.61

The bold figures indicate the highest score for each performance measurement. 9. Conclusion

Table 8 We constructed three deep network object detection systems to

Test Set 4: Multi-Crop Inference Ablation Tests (Scale Multiplier: 1.0). perform large-scale fish object detection over a new dataset consisting
of 20 to 100 fish objects with majority having object sizes of 100 to
System Precision Recall F-Score
2500 square pixels. We report the following technical contributions:
System 1 (w/ LSTM) 39.56 29.56 33.64
System 1 (w/ vector links) 39.04 32.37 35.38 1. We compare three types of ensemble systems, where the first system
Baseline (System 1 w/ 2 cascades) 33.74 17.34 22.91 consists of an integrated network with LSTM-linked ensemble
components arranged in a cascade. The second system is a tradi-
The bold figures indicate the highest score for each performance measurement.
tional ensemble system patterned after G-RMI where predictions are
combined from three separate residual networks at the last stage of

117
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

Fig. 13. Sample sub-image sequence in the LSTM unit of System 1. The LSTM unit focuses on key features from each sub-section in the image given a 3 × 3 RoI
feature map produced by a CNN unit.

inference. The third system is a Faster R-CNN baseline. We imple- multi-cropping.

ment four tests to assess the three system's performance. 5. With multi-crop inference, scale distortions and removal of global
2. Ensembled fish object detection networks outperform the baseline cues, System 1 displays good generalization capacity. The best
single model when tested on new frames. Comparing the two types performance given System 1 is under a scale increase of 2× and 9-
of ensemble structures, an integrated system network with linked section multi-crop inference.
multi-cascade ensemble units performs better than an ensemble 6. Ablation tests indicate that recurrent LSTM links are crucial for
composed of three separately trained networks. generalization performance.
3. The cascade structure of the integrated network of System 1 relies
on an automatic correction mechanism involving a step-wise gath- Acknowledgment
ering of contextual information around the initial proposal. As
cascades progress, link connections between cascades propagate key This work was supported by the Philippine Council for Industry,
feature information to correct bounding box and objectness pre- Energy and Emerging Technology Research and Development of the
dictions. This lessens initial information ambiguity from early pro- Department of Science and Technology under the FishDrop Project. The
posals. authors also wish to thank Dr. Laura T. David and Mr. Mark Manalo of
4. The sequential LSTM link allows the network to improve general- the Ocean Color and Coastal Oceanography Laboratory, Marine Science
ization by focusing on key-features of objects according to an at- Institute, University of the Philippines Diliman. Mr. Mark Manalo was
tention mechanism. This enables the network structure in System 1 responsible for manually annotating the training frames for the fish
to report the best performance given multi-scale distortions with objects.

Appendix A

A.1. Table of Acronyms

A.1.1. System 2, and baseline system: R-CNN with 2 cascade components

System 2 is an ensemble system based on three separate networks following (Fathi et al., 2019), where each network has its own RPN and R-CNN
that are trained separately, and the outputs of the three networks are combined upon inference. This results in three network trunks and three
separate RPNs in total. Similar to System 1, each of the 3 networks is fitted with a 50-layer residual network for feature generation along with its own
RPN and 2-cascade R-CNN similar to (Dai et al., 2015). The first network has an RPN base_size of 3, the second network has an RPN base_size of 2,
while the third has an RPN base_size of 5. Thus, System 2 trains three separate RPNs. The final predictions on bounding box coordinates and
objectness probabilities of each network are combined during inference. One benefit of this system is that each network is able to specialize on a
certain fish size given the multiplicity of anchor box aspect scales and sizes for each network. This system increases the network's generalization
capacity since it is able to leverage on the different specializations of each network in the ensemble. Averaging of the ensemble components in
System 2 is indirectly performed through the NMS process, where the NMS combines boxes that are located close to each other. Hence, if several
boxes are estimated over a certain portion, it is likely that majority of the network components of System 2 predicted an object located over the
portion, resulting in a form of majority vote. For a proposal to be classified as a detected object, the majority vote has to surpass a certain threshold of
the objectness probability. To increase recall, we do not set the threshold limit very high.

DL Deep Learning
CNN Convolutional Neural Network
R-CNN Region Convolutional Neural Network
Faster R-CNN Faster Region Convolutional Neural Network (the basic localization network)
G-RMI Google Research and Machine Intelligence (a type of ensemble localization network)
MNC Multi Network Cascade (a type of cascade localization network)
LSTM Long Short-Term Memory Unit (a type of recurrent neural network)
SEACLEF dataset for fish object localization
PASCAL VOC dataset for object localization (with larger and fewer objects than COCO)
COCO dataset for object localization (harder dataset for localization than PASCAL VOC)

118
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

Fig. 14. System 2 Network Structure - 3 separate networks are trained separately, each with its own trunk and Region Proposal Network and R-CNN. The R-CNN has
two cascade components for proposal refinement. Final bounding box and objectness probabilities come from a combination of the three separate networks.
The baseline system follows a Faster-R-CNN network structure fitted with two cascade components similar to MNC, and serves as the baseline for
comparison (Zhuang et al., 2017). It has a single 50-layer residual network trunk that branches out to 1 PRN and a 2-cascade R-CNN. It mirrors
closely the network in (Dai et al., 2015) and in (Zhuang et al., 2017) where Faster-R-CNN is applied to fish detection. Due to the lack of ensemble
mechanisms in this system however, it is expected to be outperformed by Systems 1 and 2 as verified in (Fathi et al., 2019).

A.2. Analysis on cascade ensemble

To express our ideas regarding the benefit of cascade ensembles, we present a simple mathematical model on the cascade correction mechanism.
Let there be K components on the cascade. Without loss of generality, let fi(x) denote a computable function whose parameters are the optimal
network weights of each component in the cascade. These components are indexed by i ∈ K, s.t. fi(x) = fj(x) ∀ i ≠ j ∈ K given input x - i.e. all
cascade components are equal and are at their optimal values (under SGD minimization with convex loss). Without loss of generality, let f(xi) = xi+1,
i.e. f(xi) : ℝd → ℝd, where d is the dimension of the feature map, i.e. f maps sections of feature maps to sections feature maps through the Roi-
Cropping process. From Section 3.3.1, the cascades result in a recursion using f, where:
x 0 = input for cascade 1
x1 = f1 (x 0 )
x2 = f2 (x1) = f2 (f1 (x 0 ))
…
xi = fi (fi 1 ( f1 (x 0)) (4)
The recursion results in a sequence of inputs [x0, x1, x2 … xK], where each input x represents a section of the feature map (enclosed in a regressed
bounding box). The cascade ensemble supposes that as cascades progress, the coordinates of the section's bounding box approaches closer its ground-
truth coordinates. More formally, let b(x) be a function that computes the coordinates of x, i.e. b(x) → [x1, y1, x2, y2]. Let the ground-truth co-
ordinates be b∗. The cascade ensemble assumes that ∣b(xi) − b∗ ∣ ≥ ∣ b(xj) − b∗∣ ∀ j > i ∈ K, s.t. ∣b(xi) − b∗ ∣ → 0 and b(xi) → b∗ as i → ∞. The result of
this assumption is that inputs for higher indexed cascades have less error and possess more information on the actual object. This assumption is
reasonable given Eq. 2, where each cascade tries to minimize a convex loss function using SGD for accurate bounding box regression. Hence, an
application of f(x) results in coordinates b(x) that are closer to b∗. This process is actually the same as what is done by the R-CNN in the Faster-R-CNN

119
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

for proposals from the RPN, i.e. proposal refinement.

Now given more informative inputs x, it is reasonable to assume that under uniform optimal cascade parameters (contained in f(x∗), with x∗ as
the ground-truth feature map) cascade errors decrease for higher-indexed cascades (in the case of fish objects, higher-indexed cascades are able to
‘see’ more of the fish such as its shape, fins, tail, etc.). For instance, let f∗ denote the ground truth feature map, where b∗(f∗) = [x∗1, y∗1, x∗2, y∗2]
provides the ground truth coordinates of the object. Given f∗, we can use f s.t. b(f∗) = b(f(x)) + εb for the coordinate regression and p∗(f∗) = p(f
(x)) + εp for the objectness probability. We conjecture that ∣b(f(xj)) − b∗(f∗) ∣ < ∣ b(f(xi) − b∗(f∗) ∣ εb, j < εb, i for j > i, ∀ j, i ∈ K. We express this
relationship as follows for bounding box coordinate regression:
b (f ) b (fi + 1 (x i )) = b (fi + 1 [x i]) + b, i + 1
where b,i + 1 is:

b, i + 1 = b, i
= [b (f ) b (fi (xi 1))] (5)
β ∈ (0, 1) is a ‘correction parameter’ that adjusts errors of prior cascades. Using Eq. 4 and 5, we form a recursion of b(f) relative to ground truth
b∗(f∗):
b (f ) = b [f1 (x 0)] + b,0 start of cascade with input 0
b (f ) = b [(f2 (x1)] + b,2 = f [f (x 0 )] + b,0
b (f ) = b [(fi (x i 1)] + b, i = f i [f (x 0)] + i
b,0
…
b (f ) = b [fK (xK 1)] + K = fK (x 0 ) + b,0
K 1
= b [f K (x 0)] + K
b,0

where [| b[f ] − b[f (x0)]| ] → 0 as cascades progress i → K given the assumption on correction parameter β ∈ (0, 1). This implies b[fk(x0)] → b∗[f∗]
∗ K

assuming that f represents optimal weights as k → ∞. This expresses the notion that as cascades progress from 1. K, they refine the original error e0,
and provide more accurate bounding boxes over a refined feature map b(fi(x)). The same equations apply for predictions on objectness probabilities
p.
p (f ) = p [f1 (x 0 )] + p,0 start of cascade with input 0
p (f ) = p [(f2 (x1)] + p,2 = f [f (x 0)] + b,0
p (f ) = p [(fi (x i 1)] + p, i = f i [f (x 0)] + i
p,0
…
p (f ) = p [fK (xK 1)] + K = f K (x 0 ) + p,0
K 1
= p [f K (x 0)] + K
p,0

Hence, comparing a cascade ensemble f with a traditional ensemble g, suppose that all cascade and traditional ensemble components f and g have
equal variance σ2 with no correlation [ i2 2j ] = 0 for i ≠ j. Then variances in both ensemble types are equal. However, for K components in both
types we have:
1
|b (f ) fK (xK 1)| b (f ) Kg (x 0)
K
i.e. the bias in cascade estimates fK(xK−1) at the end of the Kth cascade is less than traditional ensembles g which merely average the outputs of the K
components.

References learning: Comparison between deep learning and hog+ svm methods. In:
International Conference on Advanced Concepts for Intelligent Vision Systems.
Springer, pp. 160–171.
Spampinato, C., Giordano, D., Di Salvo, R., Chen-Burger, Y.-H.J., Fisher, R.B., Nadarajan, Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards real-time object de-
G., 2010. Automatic fish classification for underwater species behavior under- tection with region proposal networks. In: Advances in Neural Information Processing
standing. In: Proceedings of the First ACM International Workshop on Analysis and Systems, pp. 91–99.
Retrieval of Tracked Events and Motion in Imagery Streams. ACM, pp. 45–50. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In:
LeCun, Y., Touresky, D., Hinton, G., Sejnowski, T., 1988. A theoretical framework for Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
back-propagation. In: Proceedings of the 1988 Connectionist Models Summer School. 770–778.
vol. 1. CMU, Pittsburgh, Pa: Morgan Kaufmann, pp. 21–28. Krogh, A., Vedelsby, J., 1995. Neural network ensembles, cross validation, and active
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep con- learning. In: Advances in Neural Information Processing Systems, pp. 231–238.
volutional neural networks. In: Advances in Neural Information Processing Systems, Hastie, T., Tibshirani, R., Friedman, J., 2013. The Elements of Statistical Learning: Data
pp. 1097–1105. Mining, Inference, and Prediction, Springer Series in Statistics. Springer New York
Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional Networks for Large-scale URL. https://books.google.com.ph/books?id=yPfZBwAAQBAJ.
Image Recognition. CoRR abs/1409.1556. arXiv:1409.1556. URL. http://arxiv.org/ Hara, K., Liu, M., Tuzel, O., Farahmand, A., 2017. Attentional Network for Visual Object
abs/1409.1556. Detection. CoRR abs/1702.01478. arXiv:1702.01478. URL. http://arxiv.org/abs/
Li, J., Liang, X., Li, J., Xu, T., Feng, J., Yan, S., 2016. Multi-stage Object Detection With 1702.01478.
Group Recursive Learning. CoRR abs/1608.05159. URL. http://arxiv.org/abs/1608. Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P.,
05159. Ramanan, D., Dollár, P., Zitnick, C.L., 2014. Microsoft COCO: Common Objects in
Dai, J., He, K., Sun, J., 2015. Instance-aware Semantic Segmentation Via Multi-task Context. CoRR abs/1405.0312. arXiv:1405.0312. URL. http://arxiv.org/abs/1405.
Network Cascades. CoRR abs/1512.04412. arXiv:1512.04412. URL. http://arxiv. 0312.
org/abs/1512.04412. Costa, C., Loy, A., Cataudella, S., Davis, D., Scardi, M., 2006. Extracting fish size using
Li, X., Shang, M., Qin, H., Chen, L., 2015. Fast accurate fish detection and recognition of dual underwater cameras. Aquac. Eng. 35 (3), 218–227.
underwater images with fast r-cnn. In: OCEANS 2015 - MTS/IEEE Washington, pp. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A., 2010. The pascal
1–5. https://doi.org/10.23919/OCEANS.2015.7404464. visual object classes (voc) challenge. Int. J. Comput. Vis. 88 (2), 303–338.
Villon, S., Chaumont, M., Subsol, G., Villéger, S., Claverie, T., Mouillot, D., 2016. Coral Fathi, A., Korattikara, A., Sun, C., Fischer, I., Huang, J., Murphy, K., Zhu, M.,
reef fish detection and recognition in underwater videos by supervised machine Guadarrama, S., Rathod, V., Song, Y., et al., 2019. G-rmi Object Detection. URL.

120
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121

http://image-net.org/challenges/talks/2016/GRMI-COCO-slidedeck.pdf. 85, 1–9.

Garcia, R., Nicosevici, T., Cuf, X., 2002. On the way to solve lighting problems in un- Mieszkowska, N., Sugden, H., Firth, L., Hawkins, S., 2014. The role of sustained ob-
derwater imaging. In: OCEANS'02 MTS/IEEE. vol. 2. IEEE, pp. 1018–1024. servations in tracking impacts of environmental change on marine biodiversity and
Gers, F.A., Schmidhuber, J., Cummins, F., 2019. Learning to Forget: Continual Prediction ecosystems. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 372 (2025), 20130339.
With lstm. Negahdaripour, S., Yu, C.H., 1995. On shape and range recovery from image shading for
Girshick, R., Donahue, J., Darrell, T., Malik, J., 2016. Region-based convolutional net- underwater applications. In: Underwater Robotic Vehicles: Design and Control, pp.
works for accurate object detection and segmentation. IEEE Trans. Pattern Anal. 221–250.
Mach. Intell. 38 (1), 142–158. https://doi.org/10.1109/TPAMI.2015.2437384. Ren, S., He, K., Girshick, R., Sun, J., 2017. Faster r-cnn: towards real-time object detection
Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y., 2016. Deep Learning. vol. 1 MIT with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39 (6),
Press Cambridge. 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031.
Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9 (8), Siddiqui, S.A., Salman, A., Malik, M.I., Shafait, F., Mian, A., Shortis, M.R., Harvey, E.S.,
1735–1780. 2017. Automatic fish species classification in underwater videos: exploiting pre-
Hollowed, A.B., Barange, M., Beamish, R.J., Brander, K., Cochrane, K., Drinkwater, K., trained deep neural network models to compensate for limited labelled data. ICES J.
Foreman, M.G., Hare, J.A., Holt, J., Ito, S.-i., et al., 2013. Projected impacts of climate Mar. Sci. 75 (1), 374–389 H. editor: Howard Browman.
change on marine fish and fisheries. ICES J. Mar. Sci. 70 (5), 1023–1037. Spampinato, C., Chen-Burger, Y.-H., Nadarajan, G., Fisher, R.B., 2008. Detecting, tracking
Horgan, J., Toal, D., 2009. Computer Vision Applications in the Navigation of Unmanned and counting fish in low quality unconstrained underwater videos. VISAPP (2) 2008
Underwater Vehicles. pp. 194–214. (514–519), 1.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L., 2014. Large- Spampinato, C., Palazzo, S., Boom, B., van Ossenbruggen, J., Kavasidis, I., Di Salvo, R.,
scale video classification with convolutional neural networks. In: Proceedings of the Lin, F.-P., Giordano, D., Hardman, L., Fisher, R.B., 2014. Understanding fish behavior
IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. during typhoon events in real-life underwater environments. Multimed. Tools Appl.
Katsanevakis, S., Weber, A., Pipitone, C., Leopold, M., Cronin, M., Scheidat, M., Doyle, 70 (1), 199–236.
T.K., Buhl-Mortensen, L., Buhl-Mortensen, P., Anna, G., et al., 2012. Monitoring Walsh, S.J., Godø, O.R., Michalsen, K., 2004. Fish behaviour relevant to fish catchability.
marine populations and communities: methods dealing with imperfect detectability. ICES J. Mar. Sci. 61 (7), 1238–1239.
Aquat. Biol. 16 (1), 31–52. Wang, H., Shen, Y., Wang, S., Xiao, T., Deng, L., Wang, X., Zhao, X., 2019. Ensemble of 3d
Labao, A., Naval, P., 2017. Non-motion-based segmentation of fish objects in underwater densely connected convolutional network for diagnosis of mild cognitive impairment
videos using resnet-fcn. In: Asian Conference on Intelligent Information and Database and alzheimerâ€™s disease. Neurocomputing 333, 145–156.
Systems. Zhuang, P., Xing, L., Liu, Y., Guo, S., Qiao, Y., 2017. Marine animal detection and re-
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521 (7553), 436. cognition with advanced deep learning models. In: Working Notes of CLEF.
Lee, M., Lee, J., Chang, J.-H., 2019. Ensemble of jointly trained deep neural network- Zion, B., 2012. The use of computer vision technologies in aquaculture–a review. Comput.
based acoustic models for reverberant speech recognition. Digital Signal Process. Electron. Agric. 88, 125–132.

121

Deep Learning To Count Fish in Sonar Images
No ratings yet
Deep Learning To Count Fish in Sonar Images
39 pages
Artikel 7
No ratings yet
Artikel 7
32 pages
Bachelor Thesis
No ratings yet
Bachelor Thesis
52 pages
Black Book
No ratings yet
Black Book
45 pages
Fish and Fisheries - 2022 - Saleh - Computer Vision and Deep Learning For Fish Classification in Underwater Habitats A
No ratings yet
Fish and Fisheries - 2022 - Saleh - Computer Vision and Deep Learning For Fish Classification in Underwater Habitats A
23 pages
Journal Pone 0263377
No ratings yet
Journal Pone 0263377
23 pages
Survey 5
No ratings yet
Survey 5
21 pages
Automated Detection, Classification and Counting of Fish in Fish Passages With Deep Learning
No ratings yet
Automated Detection, Classification and Counting of Fish in Fish Passages With Deep Learning
15 pages
1 s2.0 S0165783624002054 Main
No ratings yet
1 s2.0 S0165783624002054 Main
16 pages
Targeted Data Augmentation and Hierarchical Classi
No ratings yet
Targeted Data Augmentation and Hierarchical Classi
22 pages
1 s2.0 S0144860923000377 Main
No ratings yet
1 s2.0 S0144860923000377 Main
19 pages
FSZ 025
No ratings yet
FSZ 025
13 pages
Sensors 20 00726 With Cover
No ratings yet
Sensors 20 00726 With Cover
26 pages
A Deep Learning Approach To Detecting Objects in Underwater Images
No ratings yet
A Deep Learning Approach To Detecting Objects in Underwater Images
16 pages
Introduction
No ratings yet
Introduction
13 pages
Underwater Fish Detection and Counting Using Mask Regional Convolutional Neural Network 2022
No ratings yet
Underwater Fish Detection and Counting Using Mask Regional Convolutional Neural Network 2022
23 pages
Active Detection For Fish Species Recognition in Underwater Environments
No ratings yet
Active Detection For Fish Species Recognition in Underwater Environments
10 pages
Peerj Cs 2033
No ratings yet
Peerj Cs 2033
15 pages
Sensors 23 02835 v2
No ratings yet
Sensors 23 02835 v2
19 pages
Survey 16
No ratings yet
Survey 16
10 pages
Dipu Rpaper
No ratings yet
Dipu Rpaper
11 pages
Huang 2019
No ratings yet
Huang 2019
13 pages
Underwater Fish Detection With Weak Multi-Domain Supervision
No ratings yet
Underwater Fish Detection With Weak Multi-Domain Supervision
8 pages
Remotesensing 14 04487 v2
No ratings yet
Remotesensing 14 04487 v2
18 pages
A Deep Learning Approach For Underwater Fish Detection: Cience Echnology
No ratings yet
A Deep Learning Approach For Underwater Fish Detection: Cience Echnology
13 pages
Fish Species Classification in Underwater Video Monitoring Using Convolutional Neural Networks
No ratings yet
Fish Species Classification in Underwater Video Monitoring Using Convolutional Neural Networks
9 pages
Fish Species Recognition With Faster R C
No ratings yet
Fish Species Recognition With Faster R C
11 pages
Live Fish Species Classification in Underwater Ima
No ratings yet
Live Fish Species Classification in Underwater Ima
15 pages
Fishes 08 00514 With Cover
No ratings yet
Fishes 08 00514 With Cover
18 pages
Ship Accident Prevention System Using Python
No ratings yet
Ship Accident Prevention System Using Python
13 pages
1810 WildFish 32次
No ratings yet
1810 WildFish 32次
9 pages
q014p097
No ratings yet
q014p097
16 pages
YOLOv8 - Fish Journal
No ratings yet
YOLOv8 - Fish Journal
10 pages
A17 6022 WWW
No ratings yet
A17 6022 WWW
6 pages
Multi-Fish Tracking For Marine Biodiversity Monitoring: Keywords
No ratings yet
Multi-Fish Tracking For Marine Biodiversity Monitoring: Keywords
7 pages
Saberioon Et Al-2017-Reviews in Aquaculture
No ratings yet
Saberioon Et Al-2017-Reviews in Aquaculture
19 pages
1 s2.0 S0168169919302881 Main
No ratings yet
1 s2.0 S0168169919302881 Main
9 pages
Detecting Marine Organisms Via Joint Attention-Relation Learning For Marine Video Surveillance
No ratings yet
Detecting Marine Organisms Via Joint Attention-Relation Learning For Marine Video Surveillance
16 pages
Limnology Ocean Methods - 2016 - Salman - Fish Species Classification in Unconstrained Underwater Environments Based On
No ratings yet
Limnology Ocean Methods - 2016 - Salman - Fish Species Classification in Unconstrained Underwater Environments Based On
16 pages
Marine Animal Detection and Recognition With Advanced Deep Learning Models
No ratings yet
Marine Animal Detection and Recognition With Advanced Deep Learning Models
9 pages
pdf1805 10106 PDF
No ratings yet
pdf1805 10106 PDF
6 pages
Fishnet
No ratings yet
Fishnet
6 pages
Hammerhead Shark Detection Using Regions With Convolutional Neural Networks
No ratings yet
Hammerhead Shark Detection Using Regions With Convolutional Neural Networks
6 pages
Deep Learning For Shark Detection Tasks
No ratings yet
Deep Learning For Shark Detection Tasks
6 pages
Jmse 11 00572
No ratings yet
Jmse 11 00572
16 pages
Hierarchical Deep Learning Models For Identification of Fish Species
No ratings yet
Hierarchical Deep Learning Models For Identification of Fish Species
5 pages
Unlocking The Potential of Deep Learning For Marin
No ratings yet
Unlocking The Potential of Deep Learning For Marin
44 pages
Fishes 08 00514
No ratings yet
Fishes 08 00514
17 pages
Literature Review Hritick
No ratings yet
Literature Review Hritick
5 pages
2 Image Based Fish Recognition
No ratings yet
2 Image Based Fish Recognition
4 pages
IEEE Conference Template 5
No ratings yet
IEEE Conference Template 5
3 pages
Unsupervised Knowledge Transfer For Object Detection in Marine Environmental Monitoring and Exploration
No ratings yet
Unsupervised Knowledge Transfer For Object Detection in Marine Environmental Monitoring and Exploration
11 pages
Deep Learning and Machine Learning Classification Technique For Integrated Forecasting
No ratings yet
Deep Learning and Machine Learning Classification Technique For Integrated Forecasting
7 pages
IEEE Conference Template 1
No ratings yet
IEEE Conference Template 1
5 pages
Pedersen Detection of Marine Animals in A New Underwater Dataset With CVPRW 2019 Paper
No ratings yet
Pedersen Detection of Marine Animals in A New Underwater Dataset With CVPRW 2019 Paper
9 pages
CSE 444 Detection and Recognition of Bangladeshi Fishes Using Surf and Convolutional Neural Network
No ratings yet
CSE 444 Detection and Recognition of Bangladeshi Fishes Using Surf and Convolutional Neural Network
7 pages
(Revision) MLR-VGGNet (No Authors) Revision 2 - Submit (Abstract)
No ratings yet
(Revision) MLR-VGGNet (No Authors) Revision 2 - Submit (Abstract)
2 pages
3 Vol 18 No 1
No ratings yet
3 Vol 18 No 1
8 pages
How Many Fish in A Tank? Constructing An Automated Fish Counting System by Using PTV Analysis
No ratings yet
How Many Fish in A Tank? Constructing An Automated Fish Counting System by Using PTV Analysis
5 pages
Chemistry Notes Class 11 Chapter 3 Classification of Elements and Periodicity in Properties
No ratings yet
Chemistry Notes Class 11 Chapter 3 Classification of Elements and Periodicity in Properties
9 pages
Captain's Skills
No ratings yet
Captain's Skills
174 pages
Problem Analysis Methods
0% (1)
Problem Analysis Methods
24 pages
Jiji-12th Marksheet
No ratings yet
Jiji-12th Marksheet
2 pages
NVS - NWDA - Pension-Supremecourt - Judgment
No ratings yet
NVS - NWDA - Pension-Supremecourt - Judgment
32 pages
Panini 90%
No ratings yet
Panini 90%
2 pages
The Psychology of Risk Embracing Uncertainty To Stay Profitable
No ratings yet
The Psychology of Risk Embracing Uncertainty To Stay Profitable
4 pages
Supporting Research On The Paradigm Shift of The Teaching Learning Process
No ratings yet
Supporting Research On The Paradigm Shift of The Teaching Learning Process
2 pages
2 Chapter Lesson 4 Estimating Products
No ratings yet
2 Chapter Lesson 4 Estimating Products
23 pages
Develop A Competencies Framework For Digital Transformation in The Banking Industry
No ratings yet
Develop A Competencies Framework For Digital Transformation in The Banking Industry
52 pages
Sample Project Final
No ratings yet
Sample Project Final
33 pages
Jaea Davidson Resume
No ratings yet
Jaea Davidson Resume
2 pages
Item Analysis Mean PL Mps
No ratings yet
Item Analysis Mean PL Mps
10 pages
General Biology
No ratings yet
General Biology
12 pages
ActEdUK SP8 GNP12 2020 V01
No ratings yet
ActEdUK SP8 GNP12 2020 V01
2 pages
Competencies Proficiency Scale
100% (1)
Competencies Proficiency Scale
2 pages
Conceptual Paper
No ratings yet
Conceptual Paper
15 pages
10 Great Relationship Principles1
No ratings yet
10 Great Relationship Principles1
2 pages
Introduction To Intelligent Systems
No ratings yet
Introduction To Intelligent Systems
3 pages
Chromosomal-Aberrations
No ratings yet
Chromosomal-Aberrations
4 pages
Pharma Sales Executives Across Tamilnadu
No ratings yet
Pharma Sales Executives Across Tamilnadu
1 page
DLL Science 6 q3 w10
No ratings yet
DLL Science 6 q3 w10
6 pages
2nd-QUARTER-EXAM-key To Correction 2.0
No ratings yet
2nd-QUARTER-EXAM-key To Correction 2.0
5 pages
Reviewer SPX 001 21 - 29
No ratings yet
Reviewer SPX 001 21 - 29
8 pages
Past Simple Activity Cards
No ratings yet
Past Simple Activity Cards
5 pages
Toilet Drawing
No ratings yet
Toilet Drawing
1 page
Analysis of Students Inquiry Skills in Senior High School Though Learning Based On The Hierarchy of Inquiry Model
No ratings yet
Analysis of Students Inquiry Skills in Senior High School Though Learning Based On The Hierarchy of Inquiry Model
6 pages
EW3, Scenario, Act
No ratings yet
EW3, Scenario, Act
2 pages
ISYE 6413: Design and Analysis of Experiments Fall, 2020: Jeffwu@isye - Gatech.edu
No ratings yet
ISYE 6413: Design and Analysis of Experiments Fall, 2020: Jeffwu@isye - Gatech.edu
3 pages
CW Marksheet and Cover Template
No ratings yet
CW Marksheet and Cover Template
3 pages

10 1016@j Ecoinf 2019 05 004

Uploaded by

10 1016@j Ecoinf 2019 05 004

Uploaded by

Ecological Informatics 52 (2019) 103–121

Contents lists available at ScienceDirect

Cascaded deep network systems with linked ensemble components for T

ARTICLE INFO ABSTRACT

Fig. 1. Flow of the basic Faster R-CNN detection model.

2.5. Long short-term memory recurrent neural network (LSTM)

Recurrent neural networks (RNN) are applied to sequence data

In this section, we provide implementation details for our proposed

In machine learning, a common practice to improve performance is

System 1's R-CNN is a 7-cascade ensemble as shown in Fig. 4. Each

J01 6 95% Particles/sand clear J103 74 60% Rocks/corals Dark

6. Overview of system performance relative to background

Table 8 We constructed three deep network object detection systems to

inference. The third system is a Faster R-CNN baseline. We imple- multi-cropping.

A.1. Table of Acronyms

A.1.1. System 2, and baseline system: R-CNN with 2 cascade components

A.2. Analysis on cascade ensemble

for proposals from the RPN, i.e. proposal refinement.

http://image-net.org/challenges/talks/2016/GRMI-COCO-slidedeck.pdf. 85, 1–9.

You might also like