10 1016@j Ecoinf 2019 05 004
10 1016@j Ecoinf 2019 05 004
Ecological Informatics
journal homepage: www.elsevier.com/locate/ecolinf
Computer Vision and Machine Intelligence Group, Department of Computer Science, College of Engineering, University of the Philippines, Philippines
Keywords: We propose a fish detection system based on deep network architectures to robustly detect and count fish objects
Fish detection in the wild under a variety of benthic background and illumination conditions. The algorithm consists of an ensemble of
Deep learning applications to the environment Region-based Convolutional Neural Networks that are linked in a cascade structure by Long Short-Term Memory
networks. The proposed network is efficiently trained as all components are jointly trained by backpropagation.
We train and test our system for a dataset of 18 videos taken in the wild. In our dataset, there are around 20 to
100 fish objects per frame with many fish objects having small pixel areas (less than 900 square pixels). From a
series of experiments and ablation tests, the proposed system preserves detection accuracy despite multi-scale
distortions, cropping and varying background environments. We present analysis that shows how object loca-
lization accuracy is increased by an automatic correction mechanism in the deep network's cascaded ensemble
structure. The correction mechanism rectifies any errors in the predictions as information progresses through the
network cascade. Our findings in this experiment regarding ensemble system architectures can be generalized to
other object detection applications.
1. Introduction and continual observation of fish species to keep up with rapid shifts in
population distribution (Hollowed et al., 2013) (Mieszkowska et al.,
Fish detection and counting are crucial tasks in marine science for 2014). Some works proposed in situ monitoring programs using ROV
temporal tracking of species, understanding of fish behaviour vessels that are cost-effective (Siddiqui et al., 2017). However, these
(Spampinato et al., 2014), aquaculture (Zion, 2012), among others. For methods require manual offline annotation of collected video frames by
fisheries management and policy formulation, keeping track of fish fish experts. Manual annotation is very inefficient since classifying and
stocks and population is crucial to effectively control fish harvesting, annotating a minute of footage may take up to 15 min of a marine
promote breeding and prevent stock depletion (Walsh et al., 2004). For biologist's time (Spampinato et al., 2008). Given the number of frames
these reasons, the size of fish populations has to be accurately de- that have to be processed, manual annotation require statistical sam-
termined through surveys (Costa et al., 2006). pling techniques to gather confident estimates of the fish population
Traditionally, fish surveys are carried out by recording information which could lead to possible sampling errors by novice annotators.
of fish captured in traps, in nets by trawling, with lines, or through the An attractive alternative is to use computer vision techniques to
use of piscicides. Capture-tag-recapture are also used for determining detect fish from videos or image stills and automate the counting pro-
age, growth, movement and behaviour in reef fish populations. Non- cess. This allows the use of camera set-ups for monitoring, as well as
capture techniques include underwater visual census by divers and automated and efficient fish counting. However, this approach presents
hydroacoutic methods which are more accurate and non-destructive non-trivial difficulties. Automatic detection of fish objects in under-
(Spampinato et al., 2014). However, diver observation of fishes may water videos need to deal with several challenges (Garcia et al., 2002;
suffer from observational bias as many fish species instinctively evade Labao and Naval, 2017; Negahdaripour and Yu, 1995). Underwater
human divers, swimming away from the survey area (Spampinato et al., media produce light scattering effects, wavelength-dependent absorp-
2010). tion, and lens/air/water interface image distortions. Suspended parti-
To address these drawbacks, the use of cameras which are non-in- cles in water deflect photons from their straight line trajectories and
vasive and are less conspicuous to fishes has been suggested introduce backscatter, termed “marine snow” (Horgan and Toal, 2009).
(Katsanevakis et al., 2012). Camera-based monitoring also offers rapid Longer wavelengths of visible light are strongly absorbed by water
⁎
Corresponding author.
E-mail address: [email protected] (P.C. Naval).
https://doi.org/10.1016/j.ecoinf.2019.05.004
Received 2 December 2018; Received in revised form 4 May 2019; Accepted 6 May 2019
Available online 09 May 2019
1574-9541/ © 2019 Elsevier B.V. All rights reserved.
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
resulting in varying fish colors relative to camera distance and depth. structure whose components are detector networks (Ren et al., 2017).
These factors confuse classical detection algorithms that are not de- G-RMI implements the traditional type of ensemble by combining the
signed to handle such difficulties. This is compounded by the fact that outputs of several independent networks. Our proposed ensemble uses a
large numbers of fish, from 20 to 100 individuals per frame, have to be special structure where the ensemble components are arranged in a
detected. cascade. The cascade components are not independent since they have
Some early methods that attempted to perform automatic fish de- connections in the form of Long Short-Term Memory (LSTM) links
tection often relied on background subtraction methods (Garcia et al., (Hochreiter and Schmidhuber, 1997). Moreover, the flow of informa-
2002). This approach gathers motion information using pixel-wise tion from one cascade to the next provides an automatic correction
subtraction of consecutive image frames to segment and localize fish mechanism that increases accuracy. The proposed model has other
objects from static backgrounds. However, these approaches are limited benefits, such as (1) cascade components that are jointly trained in a
by their dependence on fixed camera setups, static backgrounds, non- single backpropagation pass and (2) LSTM links that process informa-
varying illumination conditions and on the assumption that the fishes tion with an attention mechanism which confers robustness against
are in motion. The latter condition may not be true for some fish spe- image distortions.
cies. Furthermore, the presence of underwater media problems men- To assess the performance of our model, we compare our approach
tioned above pose serious difficulties to background subtraction algo- to a second ensemble system similar to traditional ensembles (Fathi
rithms. et al., 2019) and to a strong baseline model composed of a single Faster
Recent advances in computer vision and machine learning provide R-CNN network as applied to the SEACLEF database (Zhuang et al.,
methods that can potentially address the challenges presented by un- 2017). Experiments consist of a series of tests: (test 1) prediction on
derwater media. Most of these techniques are based on deep learning unseen fish objects found in the training frames, (test 2) prediction on
algorithms that address the limitations of previous algorithms which new frames, (test 3) prediction on fish objects with multi-scale distor-
rely on motion information or manually crafted features. Deep learning tions and cropping, and (test 4) ablation tests. Experimental results
methods automatically generate features using convolution and other show that for test 1, all networks performed similarly. For tests 2 and 3,
operations (Krizhevsky et al., 2012; LeCun et al., 1988). The most the proposed cascaded ensemble outperformed other systems. We
popular deep learning method for computer vision tasks is the Con- conjecture that the cascaded structure of our proposed system benefits
volutional Neural Network (CNN) whose variants have been success- from the automatic correction mechanism where cascades repeatedly
fully applied to numerous image classification tasks (Karpathy et al., refine initial proposals. In addition, the attention mechanism in System
2014; Krizhevsky et al., 2012; Simonyan and Zisserman, 2014). 1's LSTM links makes it more robust against scale distortions. This is
Deep learning methods in computer vision have also progressed to verified in test 4, where LSTM links with attention mechanism sig-
deal with localization tasks. Several recent localization techniques nificantly improve multi-scale inference.
utilize variations of the base CNN to predict bounding box coordinates For future work, our proposed cascade ensemble structure could be
of objects. One of the first deep learning localization network is the generalized to include other components aside from Faster R-CNN, as
Region-based Convolutional Neural Network (R-CNN) which uses a well as detect objects other than fish. In summary, our paper has these
selective search procedure generate object proposals (Girshick et al., contributions:
2016). A further improvement in localization networks is the Faster R- Summary of Contributions:
CNN which automates the proposal generation process itself using a
Region Proposal Network (RPN) (Ren et al., 2017). Faster R-CNN served • a localization network that adopts a cascaded ensemble structure,
as a base network architecture for several other localization models where components are linked by an LSTM network. For efficiency,
(Dai et al., 2015; Li et al., 2016), and the G-RMI network (Fathi et al., all network components are trained in a single backpropagation pass
2019). The G-RMI network is notable since it uses an ensemble archi- • an automatic correction mechanism under the cascade structure to
tecture whereby predictions of several networks are combined to in- lower prediction errors, along with an attention mechanism in re-
crease accuracy performance. However, we note that several of these current network links for more robust predictions against image
detection works use standard datasets (Everingham et al., 2010), and distortions
have not been applied to image data taken in the wild, which is the • a new dataset of 18 underwater video sequences of varying illumi-
objective of this paper. nation conditions and backgrounds. Close to 88% of the fish objects
For the fish detection task, some prior works have implemented in the test set have small object sizes of less than 900 square pixels,
deep learning based systems (Labao and Naval, 2017; Li et al., 2015; and where training videos have less fish objects than test videos to
Villon et al., 2016; Zhuang et al., 2017). In particular, Villon et al. test generalization capacities of models
(Villon et al., 2016) found that deep learning models outperform clas- • performance comparisons of cascade ensemble with traditional en-
sical machine learning techniques that rely on manually crafted fea- semble systems and a strong baseline single object detector, under 4
tures. Their experiments were performed on the SEACLEF database tests with multi-crop distortions, cropping, and ablation.
which consists of 20 to 30 fish objects per frame. To differentiate these • experiments that show better performance for the cascade ensemble.
experiments from our paper, we note that the dataset which we use to The benefits of the automatic correction mechanism and attention is
train our models is more challenging and reflects more closely the ac- demonstrated along with analysis.
tual number of fish objects at benthic depths. The dataset consists of 18
underwater videos, separated into 10 training videos and 8 test videos. 2. Deep learning neural networks
In addition, fish objects are more dense, numbering 20 to 100 fish
objects per frame, with small sizes of less than 900 square pixel areas. This section presents some concepts on deep learning-based detec-
This adds up to a total of close to 10,000 fish objects that have to be tion networks using Faster R-CNN for its base architecture. Briefly, a
detected by the algorithm, the majority of which are small. We also deep network is simply a neural network with several layers, thereby
explicity set the number of fish objects in the training data set to be less providing it with depth (Goodfellow et al., 2016). Such a neural net-
than the number of fish objects in the test data set. This is to test the work can automatically extract informative features that are appro-
capacity of algorithms to generalize well over harder environments. priate for its given task (Goodfellow et al., 2016). Early neural networks
Given the challenges presented by the dataset, standard deep are unable to increase their depth significantly due to vanishing gra-
learning localization models may not be able to perform well and some dients which are circumvented by deep networks through the use of an
enhancements to the base network architecture are needed. Hence, this non-squashing activation function such as the Rectified Linear Unit
paper proposes a deep learning architecture that adopts an ensemble (ReLU) activation function (LeCun et al., 2015). Furthermore, progress
104
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
in loss functions (Goodfellow et al., 2016) (i.e, the function that mea- In the case of underwater fish localization, the CNN's capacity to
sures the amount of error in a network's prediction from ground truth) automatically generate features from arbitrary inputs is useful. It cir-
enabled more well behaved supervised training of deep networks. cumvents the need for features that rely solely on motion or brightness
Early deep networks were classification networks (Krizhevsky et al., levels, i.e. by relying also on features that depend on edges, shapes, or
2012), but localization networks were proposed shortly after (Ren contours that are brightness-independent. This renders CNN models
et al., 2015). Since this paper concentrates on localization networks, we robust to illumination changes as a result of changing depth and
show in Fig. 1 the general flow of Faster R-CNN (Ren et al., 2015) which movement in surface waters. In addition, since the input consists of an
is considered the standard deep learning localization network. The RGB image, the CNN is also able to automatically generate features that
Faster R-CNN model has three main parts (1) a CNN network trunk, (2) use color information to differentiate fish objects from their back-
a proposal generator RPN network, and (3) the R-CNN region classifi- ground. This is shown in Fig. 1, where the features in the network trunk
cation network. (from low level to high level) combine edge information along with
The CNN network trunk is the first component in Faster R-CNN. It color information.
receives an input RGB image of arbitrary size and generates a feature
map containing highly informative features describing the input image. 2.2. Region proposal network for detection (RPN)
These features will be used to predict possible locations of objects (a
detection task), as well as to predict their objectness probability (i.e. Early deep learning classification networks classify a single object
whether they correctly represent an object or not, which is a classifi- found within the input image. This changes in the case of detection
cation task). Given the feature map, detection is done using a proposal tasks, since an input image can contain several objects. To handle de-
generation process carried out by the Region Proposal Network (RPN). tection tasks, one of the methods used is proposal generation. Proposals
The classification task is handled by the R-CNN which uses the same are data structures that contain information on the locations of possible
feature map to predict objectness probabilities of the proposals. The objects in the input image. For Faster R-CNN, a proposal is a 4-element
CNN, RPN and R-CNN are jointly trained during backpropagation tuple consisting of coordinates that represent corners of bounding
thereby increasing training efficiency significantly. boxes.
Faster R-CNN automates the production of proposals using a Region
2.1. Convolutional neural networks (CNN) Proposal Network (RPN). We can view the RPN as another CNN with its
own set of convolutional filters. The filters of the RPN operate over the
The Convolutional Neural Network (CNN) provides the feature input feature map from the trunk and predict a tuple consisting of
generation component used by most deep networks designed for image objectness probabilities and proposal coordinates of boxes fixed at
analysis tasks (Krizhevsky et al., 2012). The trunk of a CNN is formed certain locations dispersed across the image. These locations, called
by a series of convolutional filters that are convolved over the input “anchors”, form the points of a grid that span the entire input image.
image to generate a stack of feature maps, with each feature map These anchors are spaced in intervals of 9 or 12 pixels and each anchor
containing a set of special features characterizing the image. At the is assigned a set of ‘anchor boxes’ as shown in Fig. 2. Using its filters,
front end of the CNN trunk are usually found filters that detect low- the RPN predicts for each anchor box its objectness probability and
level features which represent edges and color patterns. These are fol- regressed bounding box coordinates. Anchor boxes with high objectness
lowed by increasingly sophisticated features, i.e. shapes and contours in probabilities are stored as proposal candidates and serve as inputs for
later parts of the trunk. The final output of the CNN trunk is a highly the R-CNN since they are more likely to contain objects (Ren et al.,
informative feature map that will now serve as input for the Region 2015).
Proposal Network and R-CNN components of our detection network. We note that the standard Faster R-CNN implemented a single RPN
For our proposed network, the trunk adopts a residual network struc- since one RPN is sufficient to detect relatively larger objects in the
ture from (He et al., 2016). PASCAL VOC dataset. However, for our proposed network, we increase
105
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
3. Methodology
2.3. Region classification network (R-CNN) The three systems receive a single RGB video frame as input. The
input frame's dimensions (H × W × 3) can vary in height (H) and width
The R-CNN is a sub-network CNN that operates on ‘smaller’ feature (W) but are fixed at the 3 RGB channels. For output, the three systems
map inputs, as shown in Fig. 1. It predicts the objectness probability provide a set of box coordinates for each detected fish object. For
and actual coordinates of each object captured by a proposal from the System 1, we enumerate its features as follows:
RPN. The R-CNN is actually a refinement network that corrects
bounding box coordinates and objectness probabilities of each pro- 3.1. System 1 architecture features
posal. For this paper, our main contribution lies in modifying the R-
CNN to a cascaded ensemble for improved localization accuracy.
• single 50-layer Residual Network trunk
2.4. Ensembles
• two RPN systems, to generate proposals that accommodate both
small and large fish objects
106
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
Fig. 3. System 1 Network Structure - a single 50-layer residual trunk provides shareable features for 2 Region Proposal Networks and an R-CNN composed of 7
cascade components. The feature map from the 50-layer residual network trunk is the input for RPN-1 and RPN-2 as well as for each of the cascades. The vertical
connections over the cascades represent the sharing of the feature map to each of the cascade components. Hence, each cascade components re-uses the feature map
from the main CNN trunk.
is composed of a series of 1 × 1 - 3 × 3 - 1 × 1 convolutions with skip coordinates [x1, y1, x2, y2]. During training, 256 proposals are pro-
addition and batch normalization layers (He et al., 2016). The trunk cessed, where 50% have foreground labels and the other 50% have
receives an RGB image of arbitrary size as input and outputs a shareable background labels. During inference, the top 4000 proposals with the
feature map of H/16 × W/16 × 1024 dimensions. For System 1, we use highest predicted objectness probabilities are fed to the R-CNN.
a single 50-layer residual network trunk, where its shareable feature
map output is used for both RPN and R-CNN.
3.4. Multi-cascade R-CNN with an ensemble of 7 components and LSTM
links
3.3. Region proposal network (RPN) for proposal generation
107
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
Fig. 4. The structure of two interconnected cascade components in the R-CNN. In System 1, cascade components are extended to 7 cascades, while for systems 2 and
3, the R-CNN has only two cascades.
end of the network trunk using RoI-cropping. Sub-feature maps are Step 7: The current hidden state Sj and the new set of RoIs Bj are
resized to a uniform 6 × 6 × 1024; passed to the next cascade j + 1 which begins again at step 1.
Step 3: Pass the 6 × 6 × 1024 sub-feature map to a series of con- The R-CNN in System 1 is trained using a compound loss function
volutional layers (three residual blocks). The sub-feature map is down- for each proposal RoI i, and for each cascade j = 1, . . , 7 following Eq.
sized to 3 × 3 at a depth dimension D = 2048; (2). Loss is averaged over RoIs i.
Step 4: Pass the 3 × 3 sub-feature map to an LSTM cell, where the
sub-feature map is resized to sequential-form: 1 × 9 × 2048) and the Lj, i = lcls (pjcls
, i ( )) + lreg (t j, i ( )) (2)
LSTM sequentially processes each block. In this model, the LSTM se-
quence has 9 blocks as shown in 4. In each cascade, the LSTM re-uses where lcls is a softmax loss function while lreg is the smoothL1 loss
the hidden states Sj−1 computed from the previous cascade (except for function. The quantity pj, i refers to cascade j‘s prediction of objectness
the first cascade j = 1). The 2048 dimensional hidden units are passed probability for proposal RoI i, while tj, i refers to cascade j‘s prediction of
to the fully connected layer; coordinate adjustments for RoI i. The end-output of the R-CNN is pj=7, i
Step 5: Using the previous step's inputs, the fully connected layer and Bj=7 referring to the final predicted probabilities of an RoI i and the
predicts bounding box coordinate adjustments p(i, j) and objectness final box coordinates.
probabilities t(i, j). Coordinate adjustments are in the form [dx, dy, dw,
dh] which refer to adjustments with respect to the reference box center
x and y coordinates and the box height h and width w; 3.5. Total loss and training details
Step 6: Predicted bounding box coordinate adjustments are pro-
cessed according to a state-bridge layer (following (Dai et al., 2015)) The total loss of the network combines the compound losses of the
which transforms the reference RoI Bj−1 to a new set of RoIs Bj of the RPN and of each cascade in the ensemble R-CNN:
form [x1, y1, x2, y2] using the predicted [dx, dy, dw, dh];
108
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
Fig. 5. System 1 R-CNN with Sequential LSTM Structure. The cascaded network in System 1 has 7 CNN ensembled components that are linked together through a
sequential LSTM unit. The LSTM unit performs an attention mechanism over each CNN cascade output, by reshaping the 3 × 3 CNN output tensor to 1 × 9 and
treating each element in the 9-dimensional flattened tensor as a part of a sequence.
7
1 VOC dataset (Everingham et al., 2010), for the following reasons:
Ltotal = Lrpn + Lj , i
R
• uneven and changing illumination conditions, water backscattering
j=1 (3)
For all three systems, network training is performed end-to-end for effects, presence of marine snow, fish motion, and similar appear-
200 epochs (with 300 training frames for each epoch). We use a weight ance of fish objects with coral background, etc.
decay parameter of 0.0001 and a Nesterov momentum value of 0.9. • around half of the objects are small objects having areas smaller
Learning rate is set at 0.001 which is divided by 10 after 150 epochs. No than 900 square pixels and with deformable shapes
image pre-processing is performed other than subtraction from mean • much larger quantity of objects per frame to be detected in the test
image. Network is initialized with ImageNet weights from (He et al., set compared to training set, ranging from 20 to more than 100
2016) as done elsewhere (Dai et al., 2015; Ren et al., 2017). In a for-
ward pass, 4 K proposals per RPN are processed, and the top proposals For the test data, we gathered eight (8) videos with some differences
for objectness are retained after Non-Maximum Suppression (NMS) in background environments from the training data. For each video, we
with 0.3 threshold (pre R-CNN). Final predicted proposals (post R-CNN) randomly sample 3 frames for manual labeling, amounting to 27 frames
are NMS-suppressed with 0.1 threshold. with more than 2000 fish objects to be detected in total. We describe
the datasets in Table 1, where for index notation, we append each train
and test video with a ‘J’ at the beginning. The fish objects in the training
4. Training and test data description
data were manually annotated by a marine science researcher for expert
verification.
In this section, we present (1) statistics of our data, (2) performance
In general, the number of fish objects in the 8 test videos are larger
metrics and four test schemes, and (3) experimental results for each of
than those in the training videos. This presents a unique challenge for
the 4 test schemes. In (3), we insert some qualitative analysis on the
localization systems since they have to generalize over a more difficult
experimental results, and we provide a mathematical treatment of the
test set. However, this suits the purpose of this experiment, which is to
analysis in the Appendix for reference.
test the capacity of models to generalize over new environments.
Table 2 shows different fish object sizes that can be found among the 8
4.1. Statistics on training and test set test videos. As can be seen, roughly 88% have object sizes that fall
between 100 and 2500 square pixels. In the COCO dataset (Lin et al.,
Our training data consists of ten (10) underwater video sequences 2014) these sizes fall under the ‘small’ object size, and are among the
for a total of 300 training frames, with more than 10,000 fish objects. harder-to-localize objects.
The videos were obtained at depths ranging from 7 to 24 m, taken from In terms of background, we include in Table 1 the rough proportion
a custom-made stereo rig composed of three (3) GoPro cameras. The of water column areas against seabed and coral areas. We also include
video frames have a wide variety of backgrounds and most contain additional information on the illumination conditions of the video and
large numbers of fish objects different species. In general, the training on background objects, i.e. rocks/corals/particles.
and test data in this experiment is harder than the benchmark PASCAL
109
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
Table 1
Training Data Statistics for the 18 videos (10 for training, 8 for test).
Training Video Average Number of % Water Specs Illumination Test Average Number of % Water Specs Illumination
Fish Objects Column Video Fish Objects Column
Table 2 8 independent test videos with entirely new environments and illumi-
Fish Size Statistics for the test set: summed across all 8 test videos. nation conditions. In a way, test 2 is more crucial for assessing a sys-
Fish Size x (in square pixels) Number of Objects tem's performance than test 1 since its frames are independent from the
training set. Test Set 3 builds upon Test Set 2 and uses multi-crop and
x < 100 5 multi-scale inference to check if the system can still localize despite
100 ≤ x < 900 929
additional distortions and removal of global information. Test Set 4 is
900 ≤ x < 2,500 478
2,500 ≤ x < 10,000 168
an ablation test to check if the LSTM sequential links effectively pro-
x ≥ 10,000 6 pagates information across ensemble components.
110
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
Fig. 6. Sample fish object detections taken from the test set using System 1. Green boxes are localization outputs from the algorithm, while red boxes depict some of
the missed fish objects. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
tested in new environments. This is despite having only a single net- 7.2.1. System 1 cascade correction mechanism
work trunk for System 1 - compared to the network structure of System A cascade ensemble architecture can potentially perform correction
2 with three separate trunks. mechanisms. The correction mechanism occurs as the CNN sub-net-
Precision and recall measures in Test 2 however are lower compared works pass information from one component in the cascade to the next
to single crop inference on the training set in Test 1. But these could be such that prior errors in early cascades are rectified in future cascades.
brought further up by increasing the number of proposals of the net- To show the need for correction mechanisms, Fig. 8 shows a proposal
work in each forward pass during inference. This will be performed in instance received by the first component of the R-CNN. The proposal
subsection 7.3, where the three systems are subjected to multi-crop includes IoU (Intersection over Union) in excess of 50%. Hence, it is a
inference testing with multi-scale distortions. valid candidate for bounding box regression. However, as seen in Fig. 8,
there is inherent ambiguity in information provided by the initial
proposal, i.e. the actual fish object can acquire several possible
111
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
Fig. 7. Sample fish object detections taken from the test set using System 1. Green boxes are localization outputs from the algorithm, while red boxes depict some of
the missed fish objects. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Table 3 Table 4
Test Set 1: Performance Results on 60 frames, where frames are taken from Test Set 2: Performance Results on the 8-Video Test Set with New Backgrounds.
training video sequences.
System Precision Recall F-Score
System Precision Recall F-Score
System 1 53.29 37.77 44.21
System 1 67.21 64.56 65.86 System 2 47.37 33.54 39.28
System 2 67.28 68.25 67.76 R-CNN Baseline 43.99 21.00 28.43
R-CNN Baseline 69.81 62.72 66.07
The bold figures indicate the highest score for each performance measurement.
orientations for its tail. These orientations cannot be inferred from the paths that the ensemble cascade can take from the initial proposal of
initial proposal alone. If the R-CNN is limited to a single cascade, it is the RPN (leftmost) to the final cascade (rightmost). The initial proposal
forced to select one of the many likely orientations of the fish object. If of the RPN (denoted as x) is usually not very precise and has bounding
ever it incurs an error in bounding-box regression under a single- box coordinates that do not properly enclose the fish object (where
component R-CNN (i.e. choose a wrong orientation of the tail), it does errors are represented by the term ε). The ensemble cascade relies on
not have a chance to rectify its error. the assumption that as cascades progress from 1. K, errors gradually
Fig. 9 shows the correction process that occurs in the ensemble decrease by a factor of β where β ∈ (0, 1). This assumption is reasonable
cascade. The arrows in Fig. 9 indicate the diverse range of possible given that each cascade minimizes a convex loss function under SGD
112
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
Fig. 8. Inherent ambiguity of information in initial proposals relative to ground truth (rightmost figure). The initial proposal (leftmost figure) shows a portion of a
fish object. However, the actual fish object is larger than the initial proposal and can have several orientations of the tail. The correct orientation cannot be inferred
from the initial proposal alone.
Fig. 9. Illustration of correction mechanisms. The initial proposal (leftmost figure) has an inaccurate bounding box with error ε relative to ground truth (rightmost
figure). The 1st cascade uses bounding box regression (i.e. function f) to reduce error to βε, where β ∈ (0, 1) is a correction parameter. Each RoI beginning at the
proposal has two arrows from left to right (for illustration purposes), where these arrows express the diverse possibilities that the R-CNN in a cascade can predict. It
follows that there are several possible paths from the initial proposal to the final prediction due to information ambiguity. But as long as β ∈ (0, 1), each prediction of
a cascade serves to reduce the initial error ε. Eventually after several cascades, the predictions converge closely to the ground truth (rightmost figure), with βK being a
very small number.
113
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
Fig. 10. Cascaded information gathering across steps. Neighboring contextual information are extracted as cascades progress: cascade 1 (1st row, 1st col), cascade 2
(1st row, 2nd col), cascade 3 (1st row, 3rd col), cascade 4 (1st row, 4th col), cascade 5 (2nd row, 1st col), cascade 6 (2nd row, 2nd col), step 7 (2nd row, 3rd col).
optimization, i.e. it is optimized to reduce bounding box errors from The mathematical expressions for the recursion are shown in
malformed proposals. In addition, each bounding box regression f re- Appendix. Our mathematical analysis relies on the assumption that
sults in a rectified proposal f(x) that lessens the initial ambiguity (i.e. as later cascades refine prior cascades under SGD minimization of a
shown in the rectified proposals of Fig. 9 from cascade 1 to cascade 2). convex loss function, i.e. with a correction parameter β ∈ (0, 1). Under
Hence, f(x) contains more information than x, and the next cascade f(f this assumption, the cascade ensemble is able to reduce both bias and
(x)) contains even more information. This process can be modeled as a variance compared to traditional ensemble averaging which is limited
recursion. Eventually, repeated applications of f results in an error of to reduction of variance only.
βKε in the final cascade K which is a small number. The bounding boxes We can see the correction mechanism from the behaviour of
in this case fK(x) have close convergence to ground truth. bounding box predictions shown in Fig. 10. The first cascade has
114
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
predicted boxes that are poorly aligned, and a coral object is wrongly objectness probability threshold per detected fish object of 70%.
identified as fish. As the network progresses to cascade 2, several of the From Tables 5, 6 and 7, all three systems performed worse for the
initial predicted boxes for fish objects are re-aligned, and the false coral scale resize of 0.75. This type of performance degradation is expected
detection box is expanded to include a larger area of the initial pro- given that downsampling of the image removes key information de-
posal. In cascade 3, the network managed to detect that the coral de- fining fish objects. However, even with downsampling at this rate,
tection is a non-fish object and rectified its prediction. From cascade 4 System 1 performs best with the highest F-score of 33.64. The separate
to 7, only correct fish objects are identified as valid detections. This network ensemble of System 2 performed poorly with an F-score of
pattern across cascades show a step-wise gathering of contextual in- 24.65, indicating that the system is not robust to distortions of smaller
formation by the ensemble CNN units around the neighboring area of image scales.
an initial Region of Interest. This information gathering process occurs Given a scale multiplier of 1.0, both System 2 and the baseline
whenever a cascade j performs prediction of bounding box coordinate system displayed better performance compared to non multi-crop in-
adjustments to refine the initial reference box from cascade j − 1. The ference in accordance with the findings in (Fathi et al., 2019) for en-
network uses the newly predicted box coordinates from cascade j to re- sembles with separated networks. While System 2 has the highest recall
extract features from the shareable feature map upon the start of cas- given a scale multiplier of 2.0, its precision suffered, with a score of
cade j + 1. In this example, a total of 7 boxes are predicted within the only 40.01. This indicates that System 2, inspite its ensemble me-
neighboring area of an initial proposal from cascade 1 to cascade 7. chanism, is not very consistent given different scales. The most con-
With multiple box predictions and repeated RoI-crop feature re-ex- sistent model for multi-crop inference across all scale resize ratios is
tractions for each cascade, there is more likelihood that the ensemble R- System 1. In fact, System 1's F-score increased from 44.21% in non
CNN eventually gathers key contextual information such as the loca- multi-crop inference to 48.84% given a scale multiplier of 1.0, and to a
tions of a fish's snout and tail or the actual boundaries of a non-fish larger value of 56.15 given a scale multiplier of 2.0. The large increase
object (e.g., coral). This contextual information gathering allows the indicates that System 1 can utilize the image's expanded resolution to
network to automatically correct predictions for both proposal co- improve detection. Among all the tests conducted in this experiment,
ordinates and objectness probabilities thereby increasing precision. System 1 with 2.0 scale increase and multi-crop inference reported the
Aside from increasing precision, a cascade structure also improves best performance.
recall. With repeated feature re-extractions and bounding box predic-
tions, poorly formed initial RoIs in prior cascades are eventually cor- 7.4. Test set 4 systems performance: ablation tests for system 1
rected in future cascades. As corrected RoIs tend to include more con-
textual information, missed objects in prior cascades due to false We implement Test Set 4 as an ablation test for System 1 where we
objectness probability predictions are stochastically rectified in future determine the effect of the recurrent LSTM links. Instead of LSTM links,
cascades, leading to more true positive detections. This mechanism is we implement vector links in the form of flattened feature maps. More
shown in Fig. 11 where Cascade 1 has several false detections with poor specifically, for Step 5 in the algorithm shown in Sec. 3.3, instead of the
alignments, and Cascade 2 misclassified some proposals as non-objects, 2048 dimensional hidden units Sj−1 that are passed to the fully con-
missing 2 fishes. Cascade 3 up to Cascade 7 re-detected the 2 missing nected layer, we implement average pooling over the 3 × 3 sub-feature
fishes after re-prediction of object boundaries and re-extraction of map produced by the CNN component resulting in a 2048-d vector. To
better RoI feature information. This increased recall performance. serve as link, we concatenate the 2048-d averaged vector in cascaded
In Fig. 11, having only two cascades may not be optimal for in- component j with the respective 2048-d averaged vector in the previous
creasing object recall as some detections may still be missed. This ex- cascade component j − 1. The result is a 4096-d concatenated vector
plains why the baseline system which is a single non-ensemble network that serves as input to the fully connected layer for bounding box
with only two cascades produced a low recall of 21%. prediction.
We implement the same multicrop tests as in Test Set 3 to the
7.3. Test set 3 systems performance: robustness testing through multi-crop modified System 1 with vector links. From Table 8 to Table 10, we show
inference with scale distortions the results of the original System 1 with LSTM links, the modified
System 1 with vector links, and the baseline system 3. We choose to
This section describes another test experiment which subjects the include system 3 among the results in Test Set 4 since it is equivalent to
three systems to robustness tests using eight test videos. Here, the test a 2-component cascaded ensemble variant of System 1.
images are cropped according to nine different sections, and predicted From Table 8, having LSTM links at a scale distortion of 0.75× do
detections for all 9 sections are combined for final inference. (See not indicate any performance improvement, as performance from the
Fig. 12.) Each cropped section is tested according to a scale multiplier modified System 1 is comparable with the original system. But from
of 0.75, 1.0, and 2.0. The rationale behind this method is to test the Table 9 to Table 10, it could be seen that System 1's performance with
generalization capacity of the three systems according to different LSTM links improves, while the modified System 1 with vector links
image scales while removing portions of the global context. If the reports bad performance at a scale distortion of 2.0×. In fact, the
system performs well despite the removal of global context and scale baseline non-ensemble system performs even better than the modified
distortions, it means that the system can generalize and is robust to System 1 at a scale distortion of 2.0×. This indicates that LSTM links
overfitting. Among the different scales, the system is expected to per- with attention mechanisms provides more robustness since it is able to
form worse for a scale resize of 0.75, since information is lost upon maintain good performance despite multiple scale distortions.
down-sampling of the image by 25%.
We note that multi-crop inference could actually increase system 7.4.1. Insights on attention mechanisms in system 1 LSTM unit
performance in some instances (Fathi et al., 2019) since it allows net- The robustness of System 1's performance can be attributed to the
works to focus on a sub-region. Given nine sub-sections, the total attention mechanism in System 1's LSTM which focuses on sub-regions
number of proposals could increase up to nine times. However, im- and links their features in a sequential fashion. It does not rely on
provements in localization is dependent on the network's capacity to- features taken from the entire object RoI image, compared to Systems 2
wards to predict well despite removal of global contexts. In this test, and 3, which convolve on the entire 6 × 6 RoI feature map after RoI-
each system is allowed 2000 proposals per section along with an cropping. This means that System 1 has to detect the key features of an
115
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
Fig. 11. Increased recall with 7-cascade R-CNN: Cascade 1 (1st row, left), cascade 2 (1st row, right), cascade 3 (2nd row, left), cascade 4 (2nd row, right), cascade 5
(3rd row, left), cascade 6 (3rd row, right), cascade 7 (4th row).
object and re-orientations due to scale distortions do not affect in- 8. Future work
ference. A sample of the attention mechanism procedure is shown in
Fig. 13, where 9 key features of the fish object are gathered and ar- For future research, System 1 can be modified to include additional
ranged in a sequential manner. The LSTM unit uses the sequence of fish sub-networks for various tasks, i.e. species classification or semantic
object portions to construct the overall object. Since it does not depend segmentation. The network can likewise be modified to incorporate
on global information, it is rendered more robust to scale distortions, temporal information captured in frame sequences, e.g. fish move-
similar to a ‘bag-of-words' scheme. ments. This allows prediction not only on static fish locations but also
on their behavioural (swimming) patterns. In terms of cascade
116
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
Fig. 12. Multi-Crop Inference with Nine (9) Subsections. During inference, the network systems processes each subsection independently. This leaves out global
information, but allows the networks to focus more proposals on a single cropped section during inference - leading to more detections for increased recall.
Table 5 Table 9
Test Set 3: Multi-Crop Inference Performance Statistics (Scale Multiplier: 0.75). Test Set 4: Multi-Crop Inference Ablation Tests (Scale Multiplier: 1.0).
System Precision Recall F-Score System Precision Recall F-Score
System 1 39.56 29.56 33.64 System 1 (w/ LSTM) 48.51 49.18 48.84
System 2 28.89 21.19 24.45 System 1 (w/ vector links) 54.72 34.68 42.45
Baseline 33.74 17.34 22.91 Baseline (System 1 w/ 2 cascades) 55.00 21.50 30.92
The bold figures indicate the highest score for each performance measurement. The bold figures indicate the highest score for each performance measurement.
Table 6 Table 10
Test Set 3: Multi-Crop Inference Performance Statistics (Scale Multiplier: 1.0). Test Set 4: Multi-Crop Inference Ablation Tests (Scale Multiplier: 2.0).
System Precision Recall F-Score System Precision Recall F-Score
System 1 48.51 49.18 48.84 System 1 (w/ LSTM) 60.32 52.52 56.15
System 2 48.25 27.81 35.28 System 1 (w/ vector links) 30.29 38.27 33.82
Baseline 55.00 21.50 30.92 Baseline (System 1 w/ 2 cascades) 47.11 44.20 45.61
The bold figures indicate the highest score for each performance measurement. The bold figures indicate the highest score for each performance measurement.
Table 7 structure, further research can be done to analyse the effects of cascade
Test Set 3: Multi-Crop Inference Performance Statistics (Scale Multiplier: 2.0). lengths on overall accuracy. In addition, we constructed System 1 to
System Precision Recall F-Score
work well in offline server systems, i.e. computers with GPU's. For ROV
platforms with lower computational resources, System 1 can be sim-
System 1 60.32 52.52 56.15 plified to accomodate computational hardware with less memory
System 2 40.01 61.16 48.42 without compromising much of its accuracy.
Baseline 47.11 44.20 45.61
The bold figures indicate the highest score for each performance measurement. 9. Conclusion
117
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
Fig. 13. Sample sub-image sequence in the LSTM unit of System 1. The LSTM unit focuses on key features from each sub-section in the image given a 3 × 3 RoI
feature map produced by a CNN unit.
Appendix A
DL Deep Learning
CNN Convolutional Neural Network
R-CNN Region Convolutional Neural Network
Faster R-CNN Faster Region Convolutional Neural Network (the basic localization network)
G-RMI Google Research and Machine Intelligence (a type of ensemble localization network)
MNC Multi Network Cascade (a type of cascade localization network)
LSTM Long Short-Term Memory Unit (a type of recurrent neural network)
SEACLEF dataset for fish object localization
PASCAL VOC dataset for object localization (with larger and fewer objects than COCO)
COCO dataset for object localization (harder dataset for localization than PASCAL VOC)
118
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
Fig. 14. System 2 Network Structure - 3 separate networks are trained separately, each with its own trunk and Region Proposal Network and R-CNN. The R-CNN has
two cascade components for proposal refinement. Final bounding box and objectness probabilities come from a combination of the three separate networks.
The baseline system follows a Faster-R-CNN network structure fitted with two cascade components similar to MNC, and serves as the baseline for
comparison (Zhuang et al., 2017). It has a single 50-layer residual network trunk that branches out to 1 PRN and a 2-cascade R-CNN. It mirrors
closely the network in (Dai et al., 2015) and in (Zhuang et al., 2017) where Faster-R-CNN is applied to fish detection. Due to the lack of ensemble
mechanisms in this system however, it is expected to be outperformed by Systems 1 and 2 as verified in (Fathi et al., 2019).
To express our ideas regarding the benefit of cascade ensembles, we present a simple mathematical model on the cascade correction mechanism.
Let there be K components on the cascade. Without loss of generality, let fi(x) denote a computable function whose parameters are the optimal
network weights of each component in the cascade. These components are indexed by i ∈ K, s.t. fi(x) = fj(x) ∀ i ≠ j ∈ K given input x - i.e. all
cascade components are equal and are at their optimal values (under SGD minimization with convex loss). Without loss of generality, let f(xi) = xi+1,
i.e. f(xi) : ℝd → ℝd, where d is the dimension of the feature map, i.e. f maps sections of feature maps to sections feature maps through the Roi-
Cropping process. From Section 3.3.1, the cascades result in a recursion using f, where:
x 0 = input for cascade 1
x1 = f1 (x 0 )
x2 = f2 (x1) = f2 (f1 (x 0 ))
…
xi = fi (fi 1 ( f1 (x 0)) (4)
The recursion results in a sequence of inputs [x0, x1, x2 … xK], where each input x represents a section of the feature map (enclosed in a regressed
bounding box). The cascade ensemble supposes that as cascades progress, the coordinates of the section's bounding box approaches closer its ground-
truth coordinates. More formally, let b(x) be a function that computes the coordinates of x, i.e. b(x) → [x1, y1, x2, y2]. Let the ground-truth co-
ordinates be b∗. The cascade ensemble assumes that ∣b(xi) − b∗ ∣ ≥ ∣ b(xj) − b∗∣ ∀ j > i ∈ K, s.t. ∣b(xi) − b∗ ∣ → 0 and b(xi) → b∗ as i → ∞. The result of
this assumption is that inputs for higher indexed cascades have less error and possess more information on the actual object. This assumption is
reasonable given Eq. 2, where each cascade tries to minimize a convex loss function using SGD for accurate bounding box regression. Hence, an
application of f(x) results in coordinates b(x) that are closer to b∗. This process is actually the same as what is done by the R-CNN in the Faster-R-CNN
119
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
b, i + 1 = b, i
= [b (f ) b (fi (xi 1))] (5)
β ∈ (0, 1) is a ‘correction parameter’ that adjusts errors of prior cascades. Using Eq. 4 and 5, we form a recursion of b(f) relative to ground truth
b∗(f∗):
b (f ) = b [f1 (x 0)] + b,0 start of cascade with input 0
b (f ) = b [(f2 (x1)] + b,2 = f [f (x 0 )] + b,0
b (f ) = b [(fi (x i 1)] + b, i = f i [f (x 0)] + i
b,0
…
b (f ) = b [fK (xK 1)] + K = fK (x 0 ) + b,0
K 1
= b [f K (x 0)] + K
b,0
where [| b[f ] − b[f (x0)]| ] → 0 as cascades progress i → K given the assumption on correction parameter β ∈ (0, 1). This implies b[fk(x0)] → b∗[f∗]
∗ K
assuming that f represents optimal weights as k → ∞. This expresses the notion that as cascades progress from 1. K, they refine the original error e0,
and provide more accurate bounding boxes over a refined feature map b(fi(x)). The same equations apply for predictions on objectness probabilities
p.
p (f ) = p [f1 (x 0 )] + p,0 start of cascade with input 0
p (f ) = p [(f2 (x1)] + p,2 = f [f (x 0)] + b,0
p (f ) = p [(fi (x i 1)] + p, i = f i [f (x 0)] + i
p,0
…
p (f ) = p [fK (xK 1)] + K = f K (x 0 ) + p,0
K 1
= p [f K (x 0)] + K
p,0
Hence, comparing a cascade ensemble f with a traditional ensemble g, suppose that all cascade and traditional ensemble components f and g have
equal variance σ2 with no correlation [ i2 2j ] = 0 for i ≠ j. Then variances in both ensemble types are equal. However, for K components in both
types we have:
1
|b (f ) fK (xK 1)| b (f ) Kg (x 0)
K
i.e. the bias in cascade estimates fK(xK−1) at the end of the Kth cascade is less than traditional ensembles g which merely average the outputs of the K
components.
References learning: Comparison between deep learning and hog+ svm methods. In:
International Conference on Advanced Concepts for Intelligent Vision Systems.
Springer, pp. 160–171.
Spampinato, C., Giordano, D., Di Salvo, R., Chen-Burger, Y.-H.J., Fisher, R.B., Nadarajan, Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards real-time object de-
G., 2010. Automatic fish classification for underwater species behavior under- tection with region proposal networks. In: Advances in Neural Information Processing
standing. In: Proceedings of the First ACM International Workshop on Analysis and Systems, pp. 91–99.
Retrieval of Tracked Events and Motion in Imagery Streams. ACM, pp. 45–50. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In:
LeCun, Y., Touresky, D., Hinton, G., Sejnowski, T., 1988. A theoretical framework for Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
back-propagation. In: Proceedings of the 1988 Connectionist Models Summer School. 770–778.
vol. 1. CMU, Pittsburgh, Pa: Morgan Kaufmann, pp. 21–28. Krogh, A., Vedelsby, J., 1995. Neural network ensembles, cross validation, and active
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep con- learning. In: Advances in Neural Information Processing Systems, pp. 231–238.
volutional neural networks. In: Advances in Neural Information Processing Systems, Hastie, T., Tibshirani, R., Friedman, J., 2013. The Elements of Statistical Learning: Data
pp. 1097–1105. Mining, Inference, and Prediction, Springer Series in Statistics. Springer New York
Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional Networks for Large-scale URL. https://books.google.com.ph/books?id=yPfZBwAAQBAJ.
Image Recognition. CoRR abs/1409.1556. arXiv:1409.1556. URL. http://arxiv.org/ Hara, K., Liu, M., Tuzel, O., Farahmand, A., 2017. Attentional Network for Visual Object
abs/1409.1556. Detection. CoRR abs/1702.01478. arXiv:1702.01478. URL. http://arxiv.org/abs/
Li, J., Liang, X., Li, J., Xu, T., Feng, J., Yan, S., 2016. Multi-stage Object Detection With 1702.01478.
Group Recursive Learning. CoRR abs/1608.05159. URL. http://arxiv.org/abs/1608. Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P.,
05159. Ramanan, D., Dollár, P., Zitnick, C.L., 2014. Microsoft COCO: Common Objects in
Dai, J., He, K., Sun, J., 2015. Instance-aware Semantic Segmentation Via Multi-task Context. CoRR abs/1405.0312. arXiv:1405.0312. URL. http://arxiv.org/abs/1405.
Network Cascades. CoRR abs/1512.04412. arXiv:1512.04412. URL. http://arxiv. 0312.
org/abs/1512.04412. Costa, C., Loy, A., Cataudella, S., Davis, D., Scardi, M., 2006. Extracting fish size using
Li, X., Shang, M., Qin, H., Chen, L., 2015. Fast accurate fish detection and recognition of dual underwater cameras. Aquac. Eng. 35 (3), 218–227.
underwater images with fast r-cnn. In: OCEANS 2015 - MTS/IEEE Washington, pp. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A., 2010. The pascal
1–5. https://doi.org/10.23919/OCEANS.2015.7404464. visual object classes (voc) challenge. Int. J. Comput. Vis. 88 (2), 303–338.
Villon, S., Chaumont, M., Subsol, G., Villéger, S., Claverie, T., Mouillot, D., 2016. Coral Fathi, A., Korattikara, A., Sun, C., Fischer, I., Huang, J., Murphy, K., Zhu, M.,
reef fish detection and recognition in underwater videos by supervised machine Guadarrama, S., Rathod, V., Song, Y., et al., 2019. G-rmi Object Detection. URL.
120
A.B. Labao and P.C. Naval Ecological Informatics 52 (2019) 103–121
121