0% found this document useful (0 votes)
15 views

CentroidNetV2 A Hybrid Deep Neural Network For Small-Object Segmentation and Counting 2021

Uploaded by

p.viaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

CentroidNetV2 A Hybrid Deep Neural Network For Small-Object Segmentation and Counting 2021

Uploaded by

p.viaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Neurocomputing 423 (2021) 490–505

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

CentroidNetV2: A hybrid deep neural network for small-object


segmentation and counting
Klaas Dijkstra a,c,⇑, Jaap van de Loosdrecht a, Waatze A. Atsma b, Lambert R.B. Schomaker c,
Marco A. Wiering c
a
Professorship Computer Vision and Data Science, NHL Stenden University of Applied Sciences, P.O. Box 1080, 8900 CB Leeuwarden, The Netherlands
b
Vitens N.V., Snekertrekweg 61, 8912 AA Leeuwarden, The Netherlands
c
Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence, University of Groningen, P.O. Box 407, 9700 AK Groningen, The Netherlands

a r t i c l e i n f o a b s t r a c t

Article history: This paper presents CentroidNetV2, a novel hybrid Convolutional Neural Network (CNN) that has been
Received 22 December 2019 specifically designed to segment and count many small and connected object instances. This complete
Revised 26 July 2020 redesign of the original CentroidNet uses a CNN backbone to regress a field of centroid-voting vectors
Accepted 11 October 2020
and border-voting vectors. The segmentation masks of the individual object instances are produced by
Available online 5 November 2020
Communicated by Lei Zhang
decoding centroid votes and border votes. A loss function that combines cross-entropy loss and
Euclidean-distance loss achieves high quality centroids and borders of object instances. Several back-
bones and loss functions are tested on three different datasets ranging from precision agriculture to
2000 MSC:
68T10
microbiology and pathology. CentroidNetV2 is compared to the state-of-the art networks You Only
68T45 Look Once Version 3 (YOLOv3) and Mask Recurrent Convolutional Neural Network (MRCNN). On two
out of three datasets CentroidNetV2 achieves the highest F1 score and on all three datasets
Keywords: CentroidNetV2 achieves the highest recall. CentroidNetV2 demonstrates the best ability to detect small
Deep Learning objects although the best segmentation masks for larger objects are produced by MRCNN.
Computer Vision Ó 2020 Elsevier B.V. All rights reserved.
Convolutional Neural Networks
Object Detection
Instance Segmentation

1. Introduction In previous research the use of deep learning was investigated


for counting and localizing crops in images produced by a camera
Counting many small and connected objects is an important mounted on an Unmanned Aerial Vehicle (UAV). This paper shows
and challenging image analysis task [1,2]. Many applications for several improvements over the original CentroidNet algorithm [4]
counting objects exist ranging from microbiology [3] to precision and discusses additional results on other datasets as well as abla-
agriculture [4]. For example, to test the quality of drinking water tion studies on the backbones, the loss functions and on
a sample of water is inoculated on a Petri-dish containing an agar. pretraining.
This dish is then placed inside an incubator to promote bacterial Deep neural networks have consistently been shown to produce
growth. The number of bacterial-colony clusters growing inside state-of-the-art results for many complex image analysis tasks for
the dish is an important indicator for the quality of the water. This which enough data is available. Due to the large variety in counting
type of microbiological procedure usually involves counting many tasks, this data-driven approach is promising for getting good
small and connected circular colonies with a specific morphology results. In deep learning a large set of annotated data is used to
[5]. Automated approaches use either traditional computer vision train a specific model. Nowadays mainly Convolutional Neural Net-
[6] or are based on deep learning [3]. Another example is in the works (CNNs) are used for a multitude of image analysis tasks like
field of precision agriculture. The state of crops needs to be moni- classification, segmentation, object detection, instance segmenta-
tored continuously and is an important indicator for predicting tion, image data synthesis and resolution enhancement in hyper-
crop yield is the number of crops and the size of the crops. spectral images [7–12].
A typical method to count objects with a CNN is to train an
object detection model and subsequently count the number of
⇑ Corresponding author.
detected objects [13–15]. Because most object-detection neural
E-mail address: [email protected] (K. Dijkstra).

https://doi.org/10.1016/j.neucom.2020.10.075
0925-2312/Ó 2020 Elsevier B.V. All rights reserved.
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

networks are designed to detect typical everyday objects, they detection and segmentation. Furthermore, we provide a compar-
might provide inferior results on counting tasks where small and ison between the properties of the original CentroidNet and
connected objects are involved. An alternative method to count CentroidNetV2.
objects is to regard counting as a regression task. In this case the In addition to the architectural changes several ablation studies
number of counted objects is directly estimated from images or are performed. The loss function is redesigned to contain two
crops of images [16–19]. This is mostly used in congested scenes Euclidean loss terms and a cross entropy term. The loss terms
when it is difficult to individually detect objects. An example of are compared to the original MSE loss function. We aim to investi-
this approach is estimating the number of people in crowds gate the effect of several architectural choices. In principle any
[20–22]. Recent approaches combine object localization and object semantic segmentation network can serve as a backbone for Cen-
detection with counting [4,23]. troidNetV2. In the experiments U-net and DeepLabV3 with three
When counting objects in an image without regarding their backbones, ResNet50, ResNet101 and Xception, are evaluated as
location there is a risk of unwanted count compensation. When representative backbones. Finally, we also investigate if transfer
this happens an underestimate of the count in one part of the learning improves the performance of CentroidNetV2.
image compensates the overestimation in another part of the This leads to the following research questions:
image. To correctly validate counting results the location of the
objects should also be taken into account. A suitable metric for 1. What is the performance of CentroidNetV2 for detecting and
detection and counting is the F1 score which is the harmonic mean counting many small objects?
between precision and recall that represents the optimal equilib- 2. How does the performance of CentroidNetV2 compare to well
rium between overestimating and underestimating the number known state-of-the-art models for object detection and
of objects. This paper will focus on models for object detection instance segmentation?
and instance segmentation because these models can estimate 3. What backbones and loss functions are most suitable for
the location and dimensions of the counted objects simultane- CentroidNetV2?
ously. In this paper a new hybrid deep learning architecture called 4. What is the effect of transfer learning on the performance of
CentroidNetV2 is introduced. CentroidNetV2?

1.1. Contributions and research questions In this paper we generally refer to a 1-d structure as a vector, a
2-d structure as a matrix and a 3-d structure as a tensor. A matrix
The original version of CentroidNet [4] is a Fully Convolutional that contains vectors is referred to as a tensor where the name
Network (FCN) that is trained using a standard Mean Squared Error indicates the type of vectors. For example: the target-centroid-
(MSE) loss function. A U-net backbone is used to regress a field of vectors tensor is a matrix containing target-centroid vectors.
2-d vectors. These vectors are trained to predict the location of the The remainder of this paper is structured as follows. In Section 3
centroid of the nearest object. CentroidNet is independent of image the formal design of CentroidNetV2 is discussed. Section 4 explains
size during training and during inference, because vectors only the contents of the aerial-crops, cell-nuclei and bacterial-colonies
encode relative positions and are not scaled by the image size. A datasets that are used for this research. The method of training
voting algorithm is used to produce the actual centroids of the and validation is discussed in Section 5 and the results are pre-
objects. Although demonstrating state-of-the-art results, the origi- sented in Section 6. Finally, in Section 7 the conclusion and future
nal CentroidNet has some limitations: the size of the objects are work are discussed.
not estimated and the standard MSE loss does not specifically
penalize the segmentation and the voting mechanism incorporated
in the algorithm. Finally CentroidNet was only evaluated using the 2. Related work
U-net backbone.
In CentroidNetV2 several improvements are proposed, while CNNs [26–28] are applied to an increasing number of complex
still maintaining the deep-learning and computer-vision hybrid image analysis tasks. One of the first break-through applications
design and the majority voting mechanisms. Firstly, for each pixel, of CNNs was the classification of images from the ImageNet chal-
an additional 2-d vector is predicted which represents the relative lenge [7]. Classification models take an image as an input and pro-
location of the nearest border of the object with the nearest cen- duce a single prediction for the whole image. Image segmentation
troid. This border information is used to predict the delineation using a CNN is performed by classifying each pixel into a one-hot
of objects. Therefore, in a computer vision context, CentroidNetV2 vector representing the class of that pixel. This results in a dense
is regarded as a form of contour fitting [24] with properties similar segmentation mask of the entire image. Impressive performance
to the generalized Hough transform [25]. CentroidNetV2 produces was achieved by U-net on biomedical image data [8] and by Dee-
instance-segmentation masks by fitting a predefined geometric pLabV3+ on segmenting everyday scenes [29]. A sparser detection
shape through the border points. In a deep learning context Cen- is achieved by object detection CNNs like YOLOv3 [30] and Retina-
troidNetV2 is considered an instance segmentation model. Net [31]. These architectures directly estimate the bounding box
We compare CentroidNetV2 to other well-known state-of-the- and class of individual object instances in images with everyday
art networks that have comparable complexity and goals. The objects. YOLOv3 focuses specifically on small-object detection.
ResNet backbones for Mask Recurrent Convolutional Neural Net- Instance segmentation can be regarded as a combination of
work (MRCNN) and CentroidNetV2 give them comparable com- object detection and segmentation. MRCNN is a widely used
plexity. A specific shared goal of CentroidNetV2 and You Only state-of-the-art instance segmentation architecture that uses the
Look Once Version 3 (YOLOv3) is small-object detection. In this detected boxes, called region proposals, to predict a dense segmen-
paper we aim to focus on the general applicability of the Cen- tation mask of individual object instances [10] and this requires a
troidNetV2 architecture for detecting and segmenting small two-stage training process. A Recurrent Neural Network (RNN) for
objects. Therefore we have chosen three datasets to cover a instance segmentation is proposed [32], where recurrent attention
broader range of applications. This also means that we do not focus is used to alternate between producing bounding boxes and pro-
on solving any one specific application (for example, colony count- ducing segmentation masks for the objects within these boxes.
ing or crop detection). Furthermore, and because of this broader A proposal-free instance segmentation network is proposed
scope, we compare to well-known and general methods for object [33], where segments and boxes are directly regressed and a
491
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

traditional off-the-shelf clustering method is used to create indi- For convenience and without loss of generality the functions in
vidual instances. This approach, where deep learning and tradi- this section are defined using 3-d image-like tensors. However the
tional deterministic algorithms are combined, belongs to an actual implementation uses mini batches of 3-d tensors. The three
emerging class of hybrid algorithms. In DCAN individual dense- main functions fðÞ; gðÞ and lð; Þ are explained in sub-Section 3.1,
object instances are produced by post processing the segmenta- 3.2 and 3.3 respectively.
tion result using the probability map to estimate segment bound-
aries [34]. In a similar fashion a deterministic temporal 3.1. Backbones
consistency algorithm is combined with a CNN to segment RGB
+ depth videos [35]. InstanceCut produces object instances by Function fðÞ in Fig. 1 is the backbone of CentroidNetV2 and
deterministically combining two output modalities of the CNN: represents the trainable part. A multi-channel image serves as an
a semantic segmentation mask and an instance boundary. An input. In our experiments this is an Red Green Blue (RGB) image.
alternative method to estimate boundaries is proposed by the The output tensor contains three separate types of predictions:
deep watershed transform, which is a deep-learning based each spatial position of the first two planes contains the y and x
instance segmentation method inspired by a traditional water- components of a relative vector that points to the nearest centroid
shed transform [36]. of an object. Each spatial position of the next two planes contains
Other instance segmentation methods directly estimate decod- the y and x components of the vectors pointing to the nearest bor-
able shape representations. In the straight-to-shapes approach the der of the object with the nearest centroid. The final planes of the
embeddings produced by a CNN are decoded into shapes with var- output tensor contain the logits for the semantic segmentation of
ious methods to produce delineations of instances [37]. The star- all pixels. In this paper we only test binary classification which
convex polygon method uses radial distances to encode object means that this logits output consists of two planes (foreground/
instances with a CNN [38]. background). The spatial dimensions of X and Y should be identical
Related to this are methods that predict the centroids of indi- and any semantic segmentation network can serve as backbone
vidual object instances. These are proposed in [39] and in Cen- fðÞ. In our experiments the depth of the input X is 3 (RGB) and
troidNet [4]. Both of these methods use a traditional Hough- the depth of the output Y is 6 (a 2-d centroid vector, a 2-d border
transform inspired algorithm for determining centroids after vector and 2 logits). This is mathematically expressed by
model inference. The former method uses the bounding boxes to
Y ¼ f ðXÞ ð1Þ
predict dense segments and the latter uses a fixed-size bounding
box and uses binning to produce sharper centroids. CentroidNet Y ¼ ½YcjYbjYl; ð2Þ
has shown to produced state-of-the-art performance on a dataset
where X is the input tensor of the model, Y is the output tensor with
for counting potato crops. In that approach dense spatial-voting
stacked tensors containing the centroid-vectors tensor Yc, border-
vectors are predicted using a CNN and a majority voting algorithm
vectors tensor Yb and the logits tensor Yl.
combined with a non-max-suppression is subsequently used to
Additionally the probabilities per logit are determined by divid-
produce centroid locations.
ing each logit by the sum of all logits for that pixel:
Conceptually, the integration of machine vision and deep learn-
ing can be viewed as embedding and exploiting prior knowledge in Ylz;y;x
Ypz;y;x ¼ P ; ð3Þ
the algorithm. For example, in CentroidNet, partially occluded and z2½c Ylz;y;x
connected objects still produce votes because patches of the
objects are assumed, by the algorithm, to have information about where Yp contains the class probabilities and c is the number of
the location of the centroid. For example, the leaves of a plant classes.
and the grain of these leaves naturally point outward. This means Some example centroid vectors and border vectors in Yc and Yb
that implicit information about the location of the centroid of a are geometrically shown in Fig. 2, where pi ; ci and bi represent the
plant is contained in a small patch of the image. This way of pixel coordinate, the vector of the nearest centroid and its nearest
prior-knowledge embedding has been demonstrated to outper- border, with three example pixels: i 2 f1; 2; 3g. An important detail
form non-hybrid approaches [4]. about border vectors is that for some coordinates, like p1 , the near-
est border coordinate of the object with the nearest centroid is dif-
ferent from the nearest border coordinate. The nearest centroid to
3. The CentroidNetV2 architecture p1 is of object B, but the nearest border coordinate of p1 is of object
A. In this case b1 is the correct border vector (which is not equal to
0
The main architecture of CentroidNetV2 is shown in Fig. 1. b1 ).
The top part of the graph shows the inference pipeline to predict
instances and their corresponding class from input images. The 3.2. Loss functions
bottom part shows the pipeline for converting the annotations
to a suitable CentroidNet format for training. An image tensor Function lð; Þ in Fig. 1 calculates the loss between the output
X serves as an input to the model indicated by fðÞ, which in tensor Y and the target tensor T. Depending on the loss function
turn predicts an output tensor Y containing the centroid vectors, we use the logits output Yl or the probability output Yp. The target
border vectors and class logits (the score for each class). This tensors are defined in a similar way as Eq. (2):
tensor is then decoded into instance ids, class ids and class prob-
T ¼ ½TcjTbjTp; ð4Þ
abilities by the decoding function gðÞ. The ground-truth tensor Z
contains class ids and instance ids and is encoded into centroid where T consists of the stacked tensors with target-centroid-vectors
vectors, border vectors and class logits. This is done by the tensor Tc, target-border-vectors tensor Tb and the target-
inverse transform of gðÞ, indicated by g0 ðÞ.1 Additionally the loss probability tensor Tp containing n planes. Note that the target prob-
function lð; Þ calculates a loss between the output tensor Y and ability for a certain class is always 0 or 1.
the target tensor T.
3.2.1. MSE loss
1
The function g0 ðÞ is the inverse transform of gðÞ if the class probability is The original CentroidNet used the mean squared error (MSE)
disregarded. loss defined as:
492
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

Fig. 1. The CentroidNetV2 architecture. The top part shows the inference pipeline and the bottom part shows the pipeline for encoding the ground-truth annotations. The
encoder function g0 ðÞ is the inverse transform of decoder function gðÞ.

A limitation of using the MSE loss is the fact that it ignores the
meaning of the specific planes in the output tensor Y and target
tensor T. For example, the first two planes contain the y and x com-
ponent of the centroid-voting vectors. For these two planes it
makes more sense to use an Euclidean distance as the loss function,
while the cross-entropy loss is more useful for the planes that con-
tain the classification logits per pixel. Therefore, in CentroidNetV2,
the loss function is decomposed into two different terms: vector
loss and segmentation loss. These are discussed in the remaining
part of this sub-section.

3.2.2. Vector loss


The Euclidean distance loss between the output-centroid vec-
tors and target-centroid vectors or the output-border vectors and
target-border vectors can be calculated by:
X 2
D2y;x ¼ Yvz;y;x  Tvz;y;x
z2½c
XX ð6Þ
ld ðYv; TvÞ ¼ hw
1
Dy;x ;
y2½hx2½w

Fig. 2. Examples of centroid vectors (c1, c2, and c3) pointing from the pixel
coordinates (p1, p2 and p3) to the nearest centroids of object A and B. The border where Yv and Tv have size c  h  w and represent the output- and
vectors (b1, b2 and b3) point to the nearest border of the objects with the nearest target-vectors tensors. Because both the centroid vectors and bor-
centroid. der vectors are two dimensional, each vector has two components
(c ¼ 2). The size of the spatial dimensions h and w are the same
as the dimensions of input image.
1 XX X  2
lmse ðY; TÞ ¼ Yz;y;x  Tz;y;x ð5Þ The vector loss is calculated separately for the centroid vectors
c  h  w z2½c y2½hx2½w
and the border vectors using Eq. (6) and then the sum is calculated.

lvl ðYc; Yb; Tc; TbÞ ¼ ld ðYc; TcÞ þ ld ðYb; TbÞ; ð7Þ

where Y and T are the output and target tensor with a size of where Yc; Yb; Tc; Tb contain the output-centroid vectors, output-
c  h  w. In our experiments the output tensor consists of five border vectors, target-centroid vectors and target-border vectors
planes and consequently z runs over 1 through 5. respectively.
493
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

3.2.3. Segmentation loss


The second term, the per-pixel classification loss or segmenta- 1: function vote Yv
tion loss, can be calculated in two ways. The cross entropy loss is 2: h; w height; width of Yv
defined as: 3: V zero-filled
X   matrix of size ðh; wÞ
CEy;x ¼  Tpz;y;x log Ypz;y;x 4: for y 1 to h do
z2½c
XX ð8Þ 5: for x 1 to w do
lce ðYp; TpÞ ¼ 1
hw
CEy;x ; 6: y0 y þ Y v 1;y;x . Get the absolute y
y2½hx2½w component
7: x0 x þ Y v 2;y;x . Get the absolute x
component
where Yp; Tp are the output-probability tensor and the target- 8: V y0 ;x0 V y0 ;x0 þ 1 . Aggregate votes
probability tensor (with values of either one or zero), c is the num- 9: end for
ber of classes and h and w are the spatial dimensions of the respec- 10: end for
tive tensors. 11: return V
The Intersection over Union (IoU) loss is defined as 1 minus the 12: end function
intersection divided by the union of the class probabilities. IoU loss
has been shown to outperform the cross-entropy loss in [40,41]
and is defined by:
XX Algorithm 2. Calculate the border coordinates of an instance with
Iz ¼ Ypz;y;x  Tpz;y;x
respect to a given centroid.
y2½hx2½w
XX   
Uz ¼ Ypz;y;x þ Tpz;y;x  Ypz;y;x  Tpz;y;x ð9Þ 1: function border(yc ; xc Þ; Yb; Yc
y2½hx2½w
X 2: B ¼ fg
liou ðYp; TpÞ ¼ 1  1c Iz
Uz 3: h; w height; width of Yc . Get spatial
z2½c
dimensions of the
input
4: for y 1 to h do
5: for x 1 to w do
3.2.4. CentroidNetV2 loss 6: y0c y þ Ycy;x;1 . Get absolute
The individual terms of the loss functions are tested and their centroid vector y
performance is reported in the results section of this paper. Eq. 7: x0c x þ Ycy;x;2 . Get absolute
(10) combines vector loss and the cross entropy loss and Eq. (11) centroid vector x
combines the vector loss and the IoU loss.  
8: if y0c ; x0c ¼¼ ðyc ; xc Þ . Contributed to
centroid (yc ; xc ) then
lvl:ce ðY; TÞ ¼ lvl ðYc; Yb; Tc; TbÞ þ lce ðYp; TpÞ ð10Þ 9: y0b y þ Yby;x;1 . Get absolute
border vector y
lvl:iou ðY; TÞ ¼ lvl ðYc; Yb; Tc; TbÞ þ liou ðYp; TpÞ; ð11Þ
10: x0b x þ Yby;x;2 . Get absolute
border vector x
where Y is the output tensor containing output-centroid-vectors  
11: B B[ y0b ; x0b . Add border
tensor Yc, output-border-vectors tensor Yb and output- coordinate
probabilities tensor Yp, similarly T is the target tensor containing 12: end if
target-centroid-vectors tensor Tc, target-border-vectors tensor Tb 13: end for
and target-probabilities tensor Tp. 14: end for
15: return B . Border
3.3. Coders coordinates of object
with centroid (yc ; xc )
This sub-section discusses the decoder function gðÞ and the 16: end function
encoder function g0 ðÞ of Fig. 1. These functions represent the
deterministic parts of CentroidNetV2. During inference the deco-
der calculates the output tensor R from the output Y of the
Initially the voteðÞ function in Algorithm 1 calculates the vot-
model. The decoder is responsible for decoding centroid vectors,
ing matrix. An output-voting-vectors tensor Y v serves as an input
border vectors and logits into instance ids, class ids and their
(this can either be Yc or Yb). This tensor contains the relative 2-d
probabilities. The encoder generates the target tensor T given
centroid vectors for every spatial location of the corresponding
the annotations. This can be regarded as preprocessing the
input image. The absolute vectors y0 and x0 are calculated by adding
ground truth. The encoder is responsible for encoding instance
the image coordinate y and x to each vector component. In the vot-
ids and class ids into centroid vectors, border vectors and class
ing map the votes, represented by these absolute2 voting vectors,
logits.
are summed.
3.3.1. Decoder The decoder then selects centroid locations which received a
Individual object instances are calculated from the output of the high number of votes. The key idea of CentroidNetV2 is that the
model using the centroid-vectors tensor Yc, border-vectors tensor image locations which provided the vectors for these selected cen-
Yb and class-probabilities tensor Yp, defined in Eqs. (1)–(3).
2
In this context the term ‘absolute’ refers to the fact that all vectors are
Algorithm 1. Calculate the voting matrix given the output-voting- recalculated so that they have a common origin at the top-left of the image. It does
vectors tensor not refer to the absolute value of the vector elements.

494
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

troids might be in high-information areas in the image. The


hypothesis is that these high-information locations also provide a
good estimate for the border location.
In Algorithm 2 these border coordinates are calculated. A cen-
troid coordinate ðyc ; xc Þ, the border-vectors tensor and centroid-
vectors tensor serve as inputs to the algorithm. Using nested for-
loops the image locations which contributed to centroid yc ; xc are
calculated (Line 8) and subsequently the absolute border coordi-
nate is added to a set of border coordinates B for that centroid (Line
11).
These border coordinates can be quite noisy, therefore a geo-
metric shape is fitted through this set of border coordinates. This
allows additional prior knowledge about the shape of the objects
to be embedded in the algorithm. For example: if the goal is to look
for elliptical objects, an ellipse is fitted through the border coordi-
nates. By fitting a convex-hull, arbitrary convex shapes can be Fig. 3. An example of the data of the decoder represented by images of potato
accommodated by CentroidNetV2. plants annotated with circles. From left to right and top to bottom: input image X,
magnitudes of the centroid vectors Yc, magnitudes of border vectors Yb, accumu-
Finally, the class ids and probabilities of each spatial coordinate
lated centroid votes V, set of centroid coordinates C, set of border coordinates B,
are calculated from the logits layers of the model output. The class color-coded instances I and per-pixel class ids C.
of an object instance is determined by determining the highest
class probability at the location of the centroid of that object.
The decoder is more formally defined in the following steps. The
intermediate images that support the explanation of the decoder is accepted in the corresponding class-id matrix C, otherwise the
are shown in Fig. 3. element is assigned to the background. In our experiments a
probability threshold of 0.2 gave the best results. The class of
1. Calculate the centroid-vote matrix V ¼ voteðYcÞ, where Yc is a an instance with centroid y; x is defined by the value of C y;x .
tensor containing the centroid vectors predicted by the model 7. Guarantee that for each element in instance-ids matrix I and
and voteðÞ is the voting function defined in Algorithm 1. class-ids matrix C, both the instance id and the class id are
2. Calculate the suppressed-voting matrix V^ ¼ wðV Þ. Function wðÞ known. This means that if either the instance id or the class
only keeps maximum values in a local window and is given by: id is background for a certain element, both the instance id
and class id for that element are set to background. Because
(
  1; if V y;x ¼¼ max V yv ;xu masking is performed per pixel, the final shape of object
ðv ;uÞ2½0::nÞ½0::mÞ
w V y;x ¼ V y;x  instances can be different from the fitted shape.
0; otherwise;

where y and x are spatial coordinates, and v and u are coordi- The instance-ids matrix I, class-ids matrix C and class-
nates inside an n  m window of the non-max suppression. In probabilities matrix P are the final outputs of CentroidNetV2.
our case plateaus of equal maxima are reduced to single points.
3. Select voting peaks by applying a threshold h to the suppressed- 3.3.2. Encoder
^ to generate the set of selected votes C given by: Encoding is a preprocessing step needed to convert the ground-
voting matrix V
truth annotations to a format that can be used to train the model.
n o
^ y ;x P h ;
C ¼ ðyc ; xc Þ 2 ½h  ½wjV ð12Þ An annotation of an input image X consists of the target-class-ids
c c
matrix C 0 in which each element represents the class of a pixel in
where yc ; xc are the peak coordinates and h and w are the dimen- the input image, and the target-instance-ids matrix I0 in which
sions of matrix V. ^ each element represents the id of an individual object instance in
4. Select the set of border coordinates corresponding to a centroid. the input image. The encoder can be regarded as the inverse of
The set of border coordinates for a centroid at coordinate the decoder and therefore the input matrices are named the same
ðyc ; xc Þ 2 C is given by: as the output matrices of the decoder, but are denoted by an addi-
tional apostrophe (0). The output of the encoder is the target-
B ¼ borderððyc ; xc Þ 2 C; Yb; YcÞ; centroid-vectors tensor Tc, the target-border-vectors tensor Tb
and the target-probabilities matrix Tp defined in Eq. (4).
where the function borderð; ; Þ calculates border coordinates
The encoding process is defined in the remaining part of this
for a given centroid at ðyc ; xc Þ and is given by Algorithm 2, Yb
section and the intermediate images to support the explanation
and Yc are tensors containing the border and centroid vectors
are shown in Fig. 4. The black border around the objects in Tc
respectively.
and Tb is caused by the clipping of voting vectors. This is set to
5. Fit a geometric shape (e.g. circle, ellipse, convex hull, etc.)
roughly twice the average radius of the target objects.
through the set of border coordinates B for a given centroid
All unique instance ids in matrix I0 are represented by the set I .
and draw the geometric shape with a unique value in the
A set of coordinates of an instance with id i is and given by:
instance-ids matrix I. n o
6. Calculate the class-ids matrix and probabilities matrix C and P O0i ¼ ðyo ; xo ÞjI0yo ;xo ¼¼ i 2 I ;
respectively by taking the arg maxðÞ and maxðÞ over class
probabilities: where yo and xo represent the coordinates within the instance-ids
  matrix I0 .
C y;x ¼ arg maxz2½c Ypz;y;x
  The set of centroids for all objects are calculated by taking the
Py;x ¼ maxz2½c Ypz;y;x ; average y and x coordinate of each set of coordinates:
n o
where c is the number of logits in the output-probabilities tensor C0 ¼ O01 ; O02 ; . . . ; O0n ;
Yp. When the probability of an element in matrix P is above /, it
495
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

image. This has proven to be a good dataset for investigating


how well the various networks handle a mix of small and large
objects as well as high connectedness between objects. For Cen-
troidNetV2 circles are fitted through the border coordinates to pro-
duce instances. For YOLOv3 and MRCNN the circles are calculated
from the predicted bounding boxes.
Fig. 4. Data of the encoder represented by images. From left to right: input
image X, color-coded target-instance ids I0 , target-class ids C 0 , magnitudes of the
target-centroid-vectors Tc and target-border-vectors Tb. Voting vectors with high 4.2. Cell nuclei
magnitudes are bright white and voting vectors with low magnitudes appear
darker.
The cell-nuclei dataset was used for the Kaggle data science
bowl 2018.3 It contains annotated samples of cell-nuclei images
where C0 is the set of target centroids of the object instances, Oi is taken with a microscope. This dataset consists of 673 images and
the centroid of the spatial coordinates that belong to instance with has a total of 29,461 annotated nuclei. The images vary in resolution,
id i. cell type, magnification and imaging modality. The annotations are
Subsequently the tensor with target-centroid vectors Tc is cal- per-pixel masks indicating the individual instances of each cell
culated by taking the difference between a spatial coordinate of nucleus. See Fig. 6 for some examples. A 80%/20% training/validation
Tc and the coordinate of the nearest centroid: split of the dataset is used for validation.
This dataset is used to investigate how the models perform on
Tcy;x ¼ arg minðyc ;xc Þ2C0 jjðyc ; xc Þ  ðy; xÞjj  ðy; xÞ; complex data with much variation. Also the dataset is ideal for
investigating how varying image resolutions are handled. For Cen-
where ð y; xÞ is a spatial coordinate of the target-centroid-vectors
troidNetV2 rotated ellipses are fitted through the predicted border
tensor Tc; ðyc ; xc Þ are the centroid coordinates from the set of cen-
coordinates to produce instances. MRCNN predicts instances
troids C0 . Note that Tc is a 3-d tensor where the third dimension
directly as masks. YOLOv3 has not been tested on this set because
has size two and contains the relative vectors ððyc ; xc Þ 2 C0 Þ  ð y; xÞ
it is not able to produce instances of arbitrary shapes or rotated
pointing to the nearest centroid. Also note that the arg min function
ellipses.
returns a vector ðyc ; xc Þ 2 C0 .
The set of border coordinates for a certain instance i is given by
B0i . The target-border-vectors tensor is then calculated as follows: 4.3. Bacterial colonies
Tby;x ¼ arg minðyb ;xb Þ2B0i jjðyb ; xb Þ  ðy; xÞjj  ðy; xÞ;
The bacterial-colonies dataset contains images of Petri dishes
where y; x are the spatial coordinates of the target-border-vectors with bacterial growth from water samples. In this study Legionella
tensor Tb and ðyb ; xb Þ are border coordinates. Tb contains the relative colonies which were cultivated on Buffered Charcoal Yeast Agar
  were used 4. The dataset has been created by a water company in
vectors ðyb ; xb Þ 2 B0i  ð y; xÞ pointing to the nearest border of the
the Netherlands. A domain expert annotated colonies which have a
object instance with the nearest centroid.
typical morphology for Legionella. Additional tests were used to con-
Finally, the target-probabilities matrix is given by:
  firm that the colonies are Legionella species. The dataset consists of
Tpc;y;x ¼ 1 C 0y;x ¼¼ c ; 79 images with a total of 2541 annotated bacterial colonies. The
images have a resolution of 1024  1024 pixels. See Fig. 7 for some
where Tp contains target logits, C 0 is the target-class-ids matrix, y examples. A 80%/20% training/validation split of the dataset is used
and x are the spatial coordinates and c is the target-class id. The for validation.
indicator function 1ðÞ returns one if the condition is true and zero This set is used to test the ability of the models to detect mul-
otherwise. tiple connected objects with various sizes and to not detect colo-
The target-centroid-vectors tensor Tc, target-border-vectors nies which are not Legionella suspected (the yellow colonies). An
tensor Tb and target-probabilities matrix Tp are the outputs of image of a dish typically contains many colonies which makes this
the decoder and are used as a target to train the model. a good dataset for testing approaches to count many-small objects.
The most important reason to test various approaches on this set is
because colony counting is a real practical example of a counting
4. Datasets
task which has not been sufficiently solved and, to date, still
requires manual labor.
In this research three datasets are used to test CentroidNetV2
and compare it to the other well-known models. These datasets
are discussed in this section. 4.4. Tiling

4.1. Aerial crops All of the datasets described in this section contain images
which are either too large or contain images of various sizes. This
The aerial-crops dataset contains images of potato crops taken means they cannot be used directly for training because a mini
with a low-cost drone which navigated over a potato field [4]. It batch should consist of multiple equally sized images. A common
consists of 10 frames randomly sampled from a 24 fps video shot approach to handle this problem is to resize all images to some
at 10 meters altitude. The dataset contains a mix of small, con- predefined size. However, this would not achieve the desired result
nected and distinct potato plants as well as background soil. The because small objects could be removed by this action. We employ
resolution of each image is 1500  1800 pixels. The set contains two strategies to handle this problem. For CentroidNetV2 and
over 3000 annotated plants using circles to indicate the location MRCNN we randomly crop the image during training with 256 
of the plants. See Fig. 5 for some examples. A 50%/50% training/val- 256 image crops.
idation split of the dataset is used for validation.
This set is used to compare the individual models on a relatively 3
https://www.kaggle.com/c/data-science-bowl-2018.
small amount of images, but a large amount of small objects per 4
ISO 11731:2017, Water quality – Enumeration of Legionella

496
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

Fig. 5. Example images from the aerial-crops dataset. The images show variations in the size of the crops and high connectedness between individual crops.

Fig. 6. Example images from the cell-nuclei dataset. The images show variations in resolution, cell type, magnification and imaging modality.

Fig. 7. Example images from the bacterial-colonies dataset. The images show variations in size, color and number of colonies per Petri dish.

The best performing YOLOv3 should be trained with images of 5.1. Training
608  608 pixels as described in the original paper [30]. To be able
to generate a dataset that can be used to train the original YOLOv3 For CentroidNetV2 the input data is normalized using the the-
in DarkNet, images are tiled with 50% overlap in both directions. oretical range of the image data: subtracting 128 and dividing
This overlap is used to prevent clipped objects at the edges of each pixel by 255. Typically the data is normalized using the
the tiles. When recombining the results to get object locations in statistics of the dataset or the statistics of the dataset that was
the original images, only objects at the center of each tile are kept. used to pretrain the model. However, in practice we did not
We observed that this approach works remarkably well for observe any significant loss in performance when using fixed nor-
YOLOv3 because the tiles are much larger than the objects in the malization coefficients for all datasets. Furthermore, Adam is used
images. In Fig. 8 this tiling process is explained. to train CentroidNetV2 with a learning rate of 0.001 and a
momentum of 0.9. To avoid overfitting and to select the best
5. Training and validation model during training, early stopping was applied [28]. In each
experiment it was observed that the trained model did not
In this section the methods for training and validation of the improve significantly after 500 epochs.
various models are discussed. Each model is trained using a train- MRCNN and YOLOv3 apply various methods to optimize perfor-
ing set and validated using a disjoint validation set. The split is ran- mance (augmentation, optimizers, normalization, etc.) The maxi-
domly determined. mum amount of instances that MRCNN can produce has been

497
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

O \ O0
IoUðO; O0 Þ ¼ ;
O [ O0
Matching of object instances is based on a certain minimum IoU
threshold. The set of output-instance ids is given by A ¼ ½m and
the set of target-instance ids is given by B ¼ ½n, where m and n
are the number of output instances and target instances respec-
tively. A match between output-instance id b and all target-
instance ids in A is given by:
  
is matchðb 2 BÞ ¼ max IoU Oa ; O0b > s
a2A

where s is the minimum IoU threshold, is matchðÞ returns true


when a match is found. For counting tasks the IoU threshold can
be set to a low value because the goal is to know if an object is
roughly found in the correct location, therefore in our experiments
we set s ¼ 0:1.
If there is a match between an output-instance id a and a target-
instance id b, the matching ids are removed from both the set of
output ids A and from the set of target ids B. The matching ids
are then added to the set of matches by M ¼ M [ fða; bÞg. This
process of matching and removing is repeated for all output-
instance ids in B. If all objects have a match, both A and B will
be empty and M will contain all matching instance-id pairs, but
Fig. 8. Tiling process. The large rectangles (orange) represent four examples of the
actual tiles used for training. The smaller dashed tiles (blue) at the center of each in practice this is almost never the case. From the number of items
large tile represent the areas in which objects are kept during the recombination of in these sets the performance metrics are calculated:
the instances that have been predicted within the tiles.
TP ¼ jMj ð13Þ
FP ¼ jAj ð14Þ
increased to 2048 to accommodate the many objects found in the FN ¼ jBj ð15Þ
aerial-crops dataset. Random resizing of input images has been TP
P¼ ð16Þ
disabled for all networks because it does not seem appropriate TP þ FP
for counting many-small objects as it might result in the removal TP
R¼ ð17Þ
of object details or remove small objects altogether. TP þ FN
PR
F1 ¼ 2  ð18Þ
5.2. Validation PþR
Count ¼ TP þ FP; ð19Þ
Validation is done using a number of metrics for instance seg-
mentation and counting. Most important is the F1 score which where TP; FP; TN; P; R; F1 and Count are the true positives, false pos-
gives the equilibrium between overestimating and underestimat- itives, true negatives, precision, recall, F1 score and object count
ing instance counts. For further analysis the precision and recall respectively. Theoretically these metrics can be calculated per indi-
are used. The true positives, false positives, false negatives and vidual object class and, in that case, the metrics usually have prefix
counting results give an indication of the number of objects that mA, for ‘mean Average’, indicating the mean over classes and the
have been either correctly or incorrectly detected. average over all images. In the experiments discussed in this paper
The validation of each method is based on the ability of the only two classes are used (background and foreground).
model to provide instances at the correct locations. The output-
instances matrix I of a model and the target-instances matrix I0 6. Experiments and results
are compared. Each element of these matrices contains a value
indicating the instance id that pixel belongs to. The apostrophe In this section the results of the experiments are discussed. Each
(0) indicates that the symbol contains data from the ground truth. sub-section shows the performance of the various models, loss
If a model gives a perfect output the symbols with and without functions and backbones on each of the three datasets. The final
an apostrophe are identical. The two sets of image coordinates rep- part of this section discusses common failure cases of all
resenting the object instances are defined by: approaches and also an analysis about the difference in perfor-
  mance between CentroidNetV1 and CentroidNetV2 is given.
Oa ¼ ð y; xÞ 2 ½h; wjIy;x ¼¼ a
n o In summary, CentroidNetV2 achieves the best experimental
O0b ¼ ðy0 ; x0 Þ 2 ½h; wjI0y0 ;x0 ¼¼ b ; performance based on F1-score on the aerial crops dataset
(94.7%) and on the bacterial colonies dataset (92.6%). MRCNN
where Oa is the set of output-object coordinates for object instance achieves the best F1 score on the cell nuclei dataset (92.3%). In gen-
a; O0b is the set of target-object coordinates for instance b, the height eral, better, or on-par results for the various metrics are obtained
and width are indicated by h and w, the spatial coordinates are indi- by our proposed algorithm. The remainder of this section gives a
cated by y; x; y0 and x0 . more thorough analysis of the experimental results using the pro-
Result instances are matched against target instances based on posed metrics.
their respective overlap. The overlap between two objects is Each table with results has the same basic structure. The model
defined by the IoU which is used to calculate a normalized output name, backbone name and loss function used is shown in the first
between zero and one, where one means a perfect match and zero three columns of the tables. The metrics given by Eqs. 13,19 are
means no match. IoU is defined by: reported in the remaining columns. The cursive text in the rows
498
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

of each table indicate the category of the experiment and is used to mance when using the ResNet101 backbone with and without pre-
group experiments in a logical manner. training (indicated by the PT column). Also an experiment with the
Because the highest F1 score represents the best equilibrium alternative VL-IoU loss function has been included here. The
between overestimating and underestimating the number of MRCNN model with a ResNet101 backbone pretrained on Ima-
objects, the network threshold hyperparameter that determines geNet achieves the highest F1 score (92.3%). The runner up is a
the trade off between precision and recall is optimized on the pretrained CentroidNetV2 with a DeepLabV3 + _ResNet101 back-
training set by an exhaustive search. For CentroidNetV2 the integer bone (91.9% F1 score). Furthermore CentroidNetV2 shows the
voting threshold h discussed in Eq. (12) is optimized, for MRCNN highest recall (89.9%) which indicates that CentroidNetV2 tends
and YOLOv3 the confidence threshold is optimized. After the to detect more objects and achieves the lowest amount of false
thresholds have been optimized the metrics are calculated on the negatives (583 nuclei) at its highest F1 score.
validation set and reported in the respective tables. In Fig. 10 an example of the instances produced by MRCNN and
The naming of the loss functions in this section follows the CentroidNetV2 is shown on a challenging image. It can be seen that
naming scheme introduced in Section 3. MSE loss is the standard MRCNN gives more accurate instance segmentation masks which
loss defined by Eq. (5). The Vector Loss (VL) is computed by the explains the higher F1 score. The higher recall of CentroidNetV2
Euclidean distance between the target-voting vectors and the is explained by the fact that more small and low-contrast cell
output-voting vectors and is defined by Eq. (7). The Cross Entropy nuclei are predicted.
(CE) loss and IoU loss, defined in Eqs. 8 and 9, are calculated using From the literature is it well known that pretraining improves
the output logits and the target logits. Finally the combined losses the performance of models [43] and this is confirmed by the mea-
used for the analysis in this section are MSE, VL-CE and VL-IoU, sured increase in F1 score for MRCNN. An interesting observation is
defined in Eqs. 5, 10, 11 respectively. that this also holds for CentroidNetV2 which achieves a 1.3% higher
An open-source reference implementation of OpenCentroidNet F1 score when using pretrained weights. This confirms that the
written in Python, using PyTorch 1.0 [42] is published with this regression of centroid- and border-voting vectors also benefits
paper. A fully annotated dataset containing images of potato crops from a ResNet101 backbone pretrained on ImageNet and that pre-
and a dataset containing annotated Legionella bacterial colonies trained convolutional filter weights are quite general in that they
are also published together with this paper. can be repurposed for predicting voting vectors. The only case
where the pretrained backbone has a lower F1 score compared to
6.1. Results on aerial crops the non-pretrained model is when a ResNet50 backbone is used
with MRCNN. However, the pretrained version still achieves the
The results of the performance on the aerial-crops dataset for highest precision (96.3%) at its highest F1 score. Interestingly the
the different models are shown in Table 1. use of the VL-IoU with pretraining achieves the lowest F1 score
The first part of Table 1 (Comparing to the state-of-the-art) shows (90.3%).
the comparison between CentroidNetV2 and the other models. The The third part of Table 2 (Comparison to U-net backbone) shows
overall best F1 score is achieved by CentroidNetV2 (94.7%). YOLOv3 the performance of CentroidNetV2 using the original U-net back-
achieves an F1 score of 94.3%. This shows that the tiling scheme bone on the cell-nuclei dataset. That configuration is similar to
used for YOLOv3 is quite optimal. MRCNN achieves an F1 score the original version of CentroidNet, which used MSE loss and a
of 92.4%. Further analysis shows that MRCNN fails to detect small U-net backbone, and has among the lowest F1 scores (90.6%). Using
crops. This automatically results in the highest precision for the VL-CE loss function in conjunction with the U-net backbone
MRCNN (97.7%) caused by the low amount of false positives (34 yields better results (91.1%). But still the conclusion holds that
crops). When using MSE loss and a U-net backbone, a configuration the best CentroidNetV2 configuration uses a ResNet101 backbone
similar to the original CentroidNet, a lower F1 score of 93.5% is and the VL-CE loss function. CentroidNetV2 seems to have no obvi-
achieved. ous advantage when using the U-net backbone because the preci-
The visual differences between the individual models are sion for CentroidNetV2 (93.3%) is lower compared to the original
shown in Fig. 9. CentroidNetV2 shows the most correctly detected CentroidNet (94.3%). This means that the improvements of both
crops in Fig. 9a. YOLOv3 seems to not find the right balance the loss function and the backbone together yield a higher perfor-
between the false positives and false negatives indicated by the mance on all metrics (91.7% F1 score, 94.3% precision and 89.3%
false positive crop found in the left-bottom of Fig. 9b and the recall).
two missed small crops. Fig. 9c shows that MRCNN failed to detect
two small potato-plant crop and also a misses a crop closely con- 6.3. Results on bacterial colonies
nected to a bigger crop (shown in the left bottom of Fig. 9c).
The second part of Table 1 (Comparing loss functions) shows that On the bacterial-colonies dataset CentroidNetV2 achieves the
the MSE loss achieves the lowest F1 score (93.5%) compared to the overall highest F1 score of 92.6% shown in Table 3. YOLOv3 is the
other loss functions and using the same backbone. runner op with an F1 score of 92.3%. The U-net backbone of Cen-
The third part of Table 1 (Comparing backbones) shows the per- troidNetV2 struggles to get good results and achieves only an F1
formance of the alternative backbones for CentroidNetV2. The score of 87.1%. This confirms the added value of the ResNet101
extra 51 layers of the ResNet101 backbone only achieve a 0.1% backbone on this dataset. Also in this case CentroidNetV2 achieves
higher F1 score compared to the ResNet50 backbone for Cen- the highest recall (91.0%). MRCNN seems to miss objects and
troidNetV2. The Xception backbone achieves a 4.4% lower F1 score. achieves the highest precision of 95.4% at the cost of lower recall
Also the U-net backbone shows a lower F1 score (0.7% lower). From (89.1%).
this can be concluded that the overall best backbone for Cen- In Table 3 it is shown that the number of predicted objects in
troidNetV2 on the aerial-crops dataset is DeepLabV3 + _ResNet101. the image (indicated by the ‘Count’ column) is not representative
for the actual number of correctly detected colonies. It seems that
6.2. Results on cell nuclei YOLOv3 only counts one less colony compared to CentroidNetV2
(885 and 886). However, when looking at the difference in the
The results of the performance on the cell-nuclei dataset for the number of true positives (indicating colonies found at the right
different models are shown in Table 2. The first part of the table location) it can be seen that YOLOv3 actually misses three colonies
(Usage of pretraining with ResNet101 backbone) shows the perfor- (832 and 835). The two extra colonies in the ‘Count’ column are
499
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

Table 1
Results for counting crops with 1660 annotated validation samples. Performance of several configurations of CentroidNetV2 and comparison to YOLOv3 and MRCNN (in
percentages). The best precision, recall and F1 score are boldface.

Model Backbone Loss F1 P R TP FP FN Count


Comparing to the state-of-the-art
CentroidNetV2 DLV3-RN101 VL-CE 94.7 94.4 95.1 1578 94 82 1672
CentroidNet U-net MSE 93.5 92.2 94.8 1573 133 87 1706
YOLOv3 Default Default 94.3 93.7 94.9 1575 106 85 1681
MRCNN RN101 Default 92.4 97.7 87.7 1456 34 204 1490
Comparing loss functions
CentroidNetV2 DLV3-RN101 MSE 93.5 92.5 94.6 1570 127 90 1697
CentroidNetV2 DLV3-RN101 VL-IoU 94.3 93.9 94.6 1571 102 89 1673
Comparing backbones
CentroidNetV2 U-net VL-CE 94.0 92.3 95.7 1588 132 72 1720
CentroidNetV2 DLV3-XC VL-CE 90.3 86.6 94.3 1566 242 94 1808
CentroidNetV2 DLV3-RN50 VL-CE 94.6 94.7 94.5 1569 87 91 1656
MRCNN RN50 Default 93.4 97.3 89.8 1491 41 169 1532

Fig. 9. Red circles show the prediction of the three models and the annotations are shown in green. CentroidNetV2 detected most crops (one false negative), MRCNN has three
false negatives and YOLOv3 produced a false positive and two false negatives.

Table 2
Results for counting nuclei with 5755 annotated validation samples. Performance of several configurations of CentroidNetV2 and MRCNN (in percentages). PT indicates if a model
is pretrained. The best precision, recall and F1 score are boldface.

Model Backbone Loss PT F1 P R TP FP FN Count


Usage of pretraining with ResNet101 backbone
CentroidNetV2 DLV3-RN101 VL-CE Yes 91.9 94.1 89.9 5172 323 583 5495
CentroidNetV2 DLV3-RN101 VL-CE No 90.6 93.8 87.7 5048 335 707 5383
CentroidNetV2 DLV3-RN101 VL-IoU Yes 90.3 94.0 86.8 4993 314 762 5307
MRCNN RN101 Default Yes 92.3 96.1 88.9 5116 210 639 5326
MRCNN RN101 Default No 91.5 95.3 87.9 5061 248 694 5309
Usage of pretraining with ResNet50 backbone
CentroidNetV2 DLV3-RN50 VL-CE Yes 91.7 94.3 89.3 5138 309 617 5447
CentroidNetV2 DLV3-RN50 VL-CE No 91.4 94.1 88.8 5112 318 643 5430
MRCNN RN50 Default Yes 91.0 96.3 86.3 4966 193 789 5159
MRCNN RN50 Default No 91.5 95.1 88.1 5072 260 683 5332
Comparison to U-net backbone
CentroidNetV2 U-net VL-CE No 91.1 93.3 88.9 5116 365 639 5481
CentroidNet U-net MSE No 90.6 94.3 87.2 5021 304 734 5325

caused by the two extra false positives found elsewhere in the left of each image is an air bubble adjacent to a colony. Air bubble
image. This is why we argue that for counting tasks the validation formation is a common problem for certain types of culturing
should be based on F1 score rather than raw object-detection media. However, this exact visual appearance is rare in the training
count because it takes the location of the object into account. set. In Fig. 11e it is shown that YOLOv3 fails to detect the colony,
The visual differences in performance between the models are probably because it has not seen something similar before. Both
shown in Fig. 11. The thick red circles indicate the predictions CentroidNetV2 and MRCNN detect this colony correctly. For Cen-
and the thin green circles indicate the annotations. In the top troidNetV2 this is probably because the partial bacterial colony
row a cropped part of an image with bacterial colonies is shown. still produces part of the votes (similar to when two colonies are
Each model correctly ignores the yellow colony which is not Legio- overlapping).
nella. In Fig. 11b YOLOv3 incorrectly detects the large colony that In this section our focus has been to compare the F1 score, Pre-
has not been annotated as Legionella suspected. MRCNN fails to cision, Recall and Count metrics of the various approaches and
detect the small colony near the right bottom of Fig. 11c. The bot- therefore did not include inference-time metrics. For an analysis
tom row of Fig. 11 gives another interesting insight in the differ- of the inference time of MRCNN and YOLOv3 we would like to refer
ences between the models. The large black-ish structure at the the reader to [44]. In that paper the authors provide an extensive

500
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

Fig. 10. Instance segmentation results on an image of the cell-nuclei dataset. The input image and ground truth are shown on the left and the predicted output of the models
is shown on the right. MRCNN predicts more accurate segments. CentroidNetV2 detects small and low contrast objects that MRCNN fails to detect.

Table 3
Results for counting bacterial colonies with 918 annotated validation samples. Performance of CentroidNetV2 compared to YOLOv3 and MRCNN (in percentages). The best
precision, recall and F1 score are boldface.

Model Backbone Loss F1 P R TP FP FN Count


CentroidNetV2 DLV3-RN101 VL-CE 92.6 94.2 91.0 835 51 83 886
YOLOv3 Default Default 92.3 94.0 90.6 832 53 86 885
MRCNN RN101 Default 92.2 95.4 89.1 818 39 100 857
CentroidNetV2 U-net VL-CE 87.1 90.5 84.0 771 81 147 852

will provide a more elaborate qualitative analysis of the common


failure cases of the three approaches, MRCNN, YOLOv3 and Cen-
troidNetV2. The failure cases are divided into three categories:
Detection of small objects, detection of low-contrast objects and
detection of connected objects. By analyzing details of the results
on individual images, interesting insights can be gained into the
properties of the algorithms, details that are not always apparent
from the reported quantitative metrics in the previous sections.
In Fig. 12, detailed parts of images are shown where the first
and the third row contain images from the bacterial-colony dataset
and the images in the middle row are from the aerial-crops dataset.
The red circles denote detections and the green circles show the
ground-truth. In Figs. 12c, f and i can be seen that MRCNN fails
to detect the smallest objects. Furthermore, Figs. 12b and e show
that YOLOv3 detects all objects but misses one colony in
Fig. 11. Object detection results on an image of the bacteria-colonies dataset. The Fig. 12h. CentroidNetV2 detects all objects in these images but
thick red circles indicate the predicted colonies and thin green circles represent the the position is slightly misaligned with the ground-truth.
annotations. In this example CentroidNetV2 detects all colonies correctly, MRCNN
fails to detect a small colony and YOLOv3 produces a false positive in the top image
In Fig. 13 the first two rows contain parts from the cell nuclei
and a false negative in the bottom image. and the bacterial-colonies datasets. In those images almost no
objects are visible due to the very low contrast in parts of the orig-
inal images. These images have deliberately not been enhanced to
show the real contrast. The red circles, which indicate detections,
comparison between the approaches (including the ResNet back- show that all approaches have difficulty detecting all objects, but
bones that have been used by CentroidNetV2). The authors report in Fig. 13a and c CentroidNetV2 is able to detect more of the
an inference time of 27 ms, 100 ms and 130 ms for YOLOv3, objects. Fig. 13g shows that MRCNN did not detect the faint purple
MRCNN RN50 and MRCNN RN101 on the PASCAL VOC dataset. nucleus and a small nucleus at the bottom edge, however, these are
Because of the additional decoding algorithm on top of the detected by CentroidNetV2. But because these specific cases are
backbone the run-time performance of CentroidNetV2 will most relatively rare in the dataset their effect in the F1 score is minimal.
likely be worse compared to the other methods. We did not focus As explained earlier, for the cell-nuclei dataset YOLOv3 was not a
on optimizing the run-time efficiency of the decoding algorithm. suitable approach.
The current version that is implemented in Python is not represen- In Fig. 14, parts of some challenging images from the cell-nuclei
tative for the potential inference time (the decoding process cur- dataset are shown that contain densely connected objects. When
rently takes multiple seconds.). comparing Fig. 14a and b it can be seen that CentroidNetV2 is able
to distinguish more of the individual objects where MRCNN
wrongly detects the cluster of multiple objects as one (indicated
6.4. Common failure cases
by the largest red circle in the Fig. 14b). Furthermore, Cen-
troidNetV2 detects the closely connected object in Fig. 14c, but
In previous subsections the quantitative performance differ-
has difficulty determining the correct size and shape.
ences between the models have been discussed. This subsection
501
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

Fig. 13. Detection of low-contrast objects. Images (c), (d) and (e) contain bacterial
colonies, the other images contain cell nuclei. The red circles show detections and
the green circles represent the ground truth. Images (a) through (e) seem to contain
no image information, however this is the true contrast in the image. Cen-
troidNetV2 is able to detect more low contrast objects.

Fig. 12. Detection of small objects. The red circles show detections and the green
circles represent the ground truth. This shows that MRCNN detect fewer of the
small bacterial colonies and potato-plant crops.

6.5. Comparison of CentroidNetV1 and CentroidNetV2

In this final subsection we reflect on the differences in perfor-


mance between the original CentroidNet and CentroidNetV2. The
orignal CentroidNet is designed as an object localization algorithm
that only detects centroids of objects. CenroidNetV2 is an object
detection or instance segmentation approach that is designed to
also detect borders of objects. Therefore, it is difficult to make a
direct comparison. However, both approached have similarities
that can be used to compare them. Both approaches utilize a seg-
mentation backbone and an accompanying loss function. The orig-
inal CentroidNet utilizes a U-net backbone and an MSE loss
function. By choosing a comparable configuration for Cen-
troidNetV2 both approaches are compared.
In Tables 1 and 2 the model denoted CentroidNet with a U-net
backbone and an MSE loss function is as close as possible to the
original CentroidNet model that is still comparable to Cen-
troidNetV2. Therefore, that model will be referred to as Cen-
troidNetV1. The results on the potato-crops dataset in Table 1 Fig. 14. Detection of connected objects. The red circles show detections and the
show that CentroidNetV1 achieves an F1 score of 93.5% and that green circles represent the ground truth. Images (a) through (d) show that
CentroidNetV2 detects more of the densely connected nuclei as individuals. In
CentroidNetV2 achieves a better F1 score of 94.7%. In Table 2 a sim- image (f) YOLOv3 is the only approach that detects all bacterial colonies.
ilar observation is made that CentroidNetV1 achieves an F1 score
of 90.6% and CentroidNetV2 shows a better F1 score of 91.9%.
Some voting images of the CentroidNets are shown in Fig. 15.
Overall, the votes appear brighter for CentroidNetV2 which indi- maxima in the top image of Fig. 15c are farther apart. Generally
cates that more votes appear on the same locations which, in turn, it is better for a counting model to detect an actual object at a
results in more robust detections. Furthermore the two voting slightly wrong location than to not detect it at all.

502
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

of tiles after inference to avoid scaling. The overlapped tiling


method did not seem to adversely affect the performance of
YOLOv3.
MRCNN needs to be trained in two stages while CentroidNetV2
and YOLOv3 can be trained in only one stage. YOLOv3 has the ben-
efit of being fully end-to-end trainable, but the decoding of voting
vectors and the choice of geometric output shape gives the ability
to configure CentroidNetV2 for a specific application. In this hybrid
approach, where deep learning is integrated with traditional com-
puter vision, the black-box nature of CNNs is mitigated and, at the
same time, performance is improved on certain tasks like counting
many small and connected objects.
The remainder of this section will reflect specifically on the
research questions.

Fig. 15. Voting matrices for CentroidNetV1 and CentroidNetV2. In this example the 1. What is the performance of CentroidNetV2 for detecting and count-
ground-truth centroids are detected with both approaches. The improvements ing many small objects?
made to CentroidNetV2 are shown to produce sharper votes. CentroidNetV2 is considered to be the preferable approach for
counting many small objects because the results show that it
either achieves the highest F1 score or achieves the best recall
7. Discussion and conclusion and tends to detect more small objects.
2. How does the performance of CentroidNetV2 compare to well-
Experiments have been performed on three datasets with three known state-of-the-art neural networks for object detection and
different models. The datasets and models can be divided in two instance segmentation?
categories: object detection and instance segmentation. The mod- On two datasets CentroidNetV2 outperforms the well-known
els for instance segmentation: CentroidNetV2 and MRCNN have state-of-the-art networks on object detection and instance seg-
been tested on all datasets. The object-detection model YOLOv3 mentation. Only on the cell-nuclei dataset does MRCNN pro-
has only been tested on the object-detection datasets: aerial- duce a higher F1 score.
crops and bacterial-colonies. This is because an instance- 3. What backbone and loss function is best suitable for Cen-
segmentation model can be used for object detection but not vice troidNetV2?
versa. The F1 score has been the main metric by which to evaluate The loss function combining vector loss and cross-entropy loss
performance, because it indicates the best trade off between over- gives sharper voting peaks and consistently achieves the best F1
estimation and underestimation of the number of counted object score compared to the original MSE loss function. The
instances. Precision and recall have been calculated at the point DeepLabV3 + _ResNet101 backbone generally obtains the best
of the highest F1 score determined by an exhaustive search on performance.
the training set. All reported metrics are calculated using a disjoint 4. What is the effect of transfer learning on the performance of Cen-
validation set. troidNetV2?
CentroidNetV2 shows the best F1 score for the aerial-crops The results show that the vector-voting method of Cen-
dataset (94.7%) and the bacterial-colonies dataset (92.6%). The best troidNetV2 also benefits from a pretrained backbone of the
F1 score on the cell-nuclei dataset is achieved by MRCNN (92.3%). model. This means that pretrained feature maps of the CNNs
For all datasets CentroidNetV2 consistently shows the highest are general enough to have a beneficial impact on the F1 score.
recall: 95.7%, 89.9% and 91.0% on the aerial-crops, cell-nuclei and
bacterial-colonies datasets respectively. MRCNN shows the highest
precision: 97.7%, 96.3% and 95.4% on the aerial-crops, cell-nuclei 7.1. Future work
and bacterial-colonies datasets respectively. MRCNN has the ten-
dency to miss small objects which results in a high precision at CentroidNetV2 was compared to the popular and general archi-
the cost of recall. YOLOv3 generally achieves a high precision, tectures MRCNN and YOLOv3. Newer and more advanced CNN
recall and F1 score but is always outperformed by either Cen- architectures are introduced regularly. In the future CentroidNetV2
troidNetV2 or MRCNN. can be compared to recent advances in object detection and
The measured differences among the best-performing models segmentation.
are mostly small, but these differences are consistent over the var- The run-time performance of the decoding algorithm of Cen-
ious datasets. Each model has its own unique properties and the troidNetV2 can probably be further optimized by making use of
choice ultimately depends on the application. If accurate counting the GPU or by implementing the algorithm in a language that
of objects is needed for a large number of small and connected allows for lower-level access to the CPU (for example C++).
objects, CentroidNetV2 is preferable. When accurate masks of Many applications exist for counting that are closely related to
objects should be determined with high recall then MRCNN is the research discussed in this paper. Many different types of vege-
preferable. YOLOv3 does a good job at detecting small objects tation exist that need to be counted. This does not necessarily have
but it is only able to detect bounding boxes whereas Cen- to be crops, but can also be trees or other types of large vegetation.
troidNetV2 produces a complete circumference of objects. Also in the field of microbiology, many applications for colony
For CentroidNetV2 and MRCNN, images of various sizes are counting exist. CentroidNetV2 can be tested on other types of bac-
handled in a similar fashion and has thus been made completely terial colonies and research into colony counting can be extended
transparent by using random image crops during training. How- to other microbiological fields like medical pathology. Other fields
ever, CentroidNetV2 truly does not take into account image dimen- unrelated to counting and more related to object detection and
sions because all voting vectors are relative. The original YOLOv3 instance segmentation can be investigated. For example, segmen-
implementation is defined for fixed-size images and therefore tation of everyday objects like persons, cars, etc. CentroidNetV2
requires tiling of the images prior to training and recombination might be able to detect smaller everyday objects.
503
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

This paper has shown that the results of CentroidNetV2 [11] T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative
adversarial networks, Conference on Computer Vision and Pattern Recognition
improved by changing the backbone and the loss function. In the
(2019) 4401–4410.
future new segmentation backbones can be integrated with Cen- [12] K. Dijkstra, J. van de Loosdrecht, L.R. Schomaker, M.A. Wiering, Hyperspectral
troidNetV2. Further investigation of other loss functions might also demosaicking and crosstalk correction using deep learning, Mach. Vision Appl.
improve the results. 30 (1) (2018) 1–21.
[13] P. Ren, W. Fang, S. Djahel, A novel yolo-based real-time people counting
In this research only classification between background and approach, in: 2017 International Smart Cities Conference (ISC2), IEEE, 2017,
foreground has been investigated. Future work might focus on pp. 1–2.
counting objects of multiple classes separately. Furthermore mul- [14] A. Özlü, TensorFlow Object Counting API (2018), https://github.com/
ahmetozlu/tensorflow_object_counting_api.
tichannel images can serve as an input to CentroidNetV2. Therefore [15] B. Chen, X. Miao, Distribution line pole detection and counting based on yolo
future work might include using hyperspectral imaging to count using uav inspection line video, J. Electr. Eng. Technol. (2019) 1–8.
objects. For this, additional image channels like fluorescent images [16] W. Xie, J.A. Noble, A. Zisserman, Microscopy cell counting and detection with
fully convolutional regression networks, Comput. Methods Biomech. Biomed.
can be recorded. Even data outside of the visible spectrum can be Eng. Imag. Visualiz. 6 (3) (2018) 283–292.
used like thermal or short-wave infrared images. [17] T. Stahl, S.L. Pintea, J.C. Van Gemert, Divide and count: generic object counting
Architectural changes could be made to CentroidNetV2 to by image divisions, IEEE Trans. Image Process. 28 (2019) 1035–1044.
[18] J. Wan, W. Luo, B. Wu, A.B. Chan, W. Liu, Residual regression with semantic
reduce the number of hyperparameters and this should make the prior for crowd counting, in: Proceedings of the IEEE Conference on Computer
decoding process more straightforward. Such new architectures Vision and Pattern Recognition, 2019, pp. 4036–4045.
could be compared to other novel architectures of promising [19] F. Dai, H. Liu, Y. Ma, J. Cao, Q. Zhao, Y. Zhang, Dense scale network for crowd
counting. arXiv preprint arXiv:1906.09707.
object-detection and instance-segmentation networks.
[20] Y. Li, X. Zhang, D. Chen, CSRNet: dilated convolutional neural networks for
understanding the highly congested scenes, Conference on Computer Vision
and Pattern Recognition (2018) 1092–1100.
CRediT authorship contribution statement [21] J. Gao, Q. Wang, X. Li, PCC net: perspective crowd counting via spatial
convolutional network, IEEE Trans. Circ. Syst. Video Technol.
[22] Q. Wang, M. Chen, F. Nie, X. Li, Detecting coherent groups in crowd scenes by
Klaas Dijkstra: Conceptualization, Methodology, Software, multiview clustering, IEEE Trans. Pattern Anal. Mach. Intell. 42 (1) (2020) 46–
Investigation, Writing - original draft. Jaap de Loosdrecht: Concep- 58.
tualization, Writing - review & editing, Supervision. Waatze A. [23] M.R. Hsieh, Y.L. Lin, W.H. Hsu, Drone-based object counting by spatially
regularized regional proposal network, Conference on Computer Vision and
Atsma: Writing - review & editing, Resources, Data curation.
Pattern Recognition (2017) 4165–4173.
Lambert R.B. Schomaker: Conceptualization, Writing - review & [24] M. Kass, A. Witkin, D. Terzopoulos, Snakes: active contour models, Int. J.
editing, Supervision. Marco A. Wiering: Conceptualization, Comput. Vision 1 (4) (1988) 321–331.
Writing - review & editing, Supervision. [25] D.H. Ballard, Generalizing the hough transform to detect arbitrary shapes,
Pattern Recogn. 13 (2) (1981) 111–122.
[26] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436.
[27] J. Schmidhuber, Deep learning in neural networks: an overview, Neural
Declaration of Competing Interest Networks 61 (2015) 85–117.
[28] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016, http://
The authors declare that they have no known competing finan- www.deeplearningbook.org.
[29] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with
cial interests or personal relationships that could have appeared atrous separable convolution for semantic image segmentation, European
to influence the work reported in this paper. Conference on Computer Vision (2018) 801–818.
[30] J. Redmon, A. Farhadi, Yolov3: an incremental improvement, arXiv preprint
arXiv:1804.02767 (2018).
Acknowledgements [31] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object
detection, in: Proceedings of the IEEE International Conference on Computer
Vision, 2017, pp. 2980–2988.
We gratefully acknowledge the support of NVIDIA Corporation [32] M. Ren, R.S. Zemel, End-to-end instance segmentation with recurrent
with the donation of the Titan X Pascal GPU used for this research. attention, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2017, pp. 6656–6664.
[33] X. Liang, L. Lin, Y. Wei, X. Shen, J. Yang, S. Yan, Proposal-free network for
References instance-level object segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 40
(12) (2017) 2978–2991.
[34] H. Chen, X. Qi, L. Yu, P.-A. Heng, Dcan: deep contour-aware networks for
[1] J. Paul Cohen, G. Boucher, C.A. Glastonbury, H.Z. Lo, Y. Bengio, Count-ception:
accurate gland segmentation, in: Proceedings of the IEEE Conference on
counting by fully convolutional redundant counting, International Conference
Computer Vision and Pattern Recognition, 2016, pp. 2487–2496.
on Computer Vision (2017) 18–26.
[35] C. Couprie, C. Farabet, L. Najman, Y. Lecun, Convolutional nets and watershed
[2] M. Baygin, M. Karakose, A. Sarimaden, E. Akin, An image processing based
cuts for real-time semantic labeling of RGBD videos, J. Mach. Learn. Res. 15 (1)
object counting approach for machine vision application, in: Conference on
(2014) 3489–3511.
Advances and Innovations in Engineering, 2018, pp. 966–970.
[36] M. Bai, R. Urtasun, Deep watershed transform for instance segmentation,
[3] A. Ferrari, S. Lombardi, A. Signoroni, Bacterial colony counting with
Conference on Computer Vision and Pattern Recognition (2017) 2858–2866.
Convolutional Neural Networks, Conference of the IEEE Engineering in
[37] S. Jetley, M. Sapienza, S. Golodetz, P.H. Torr, Straight to shapes: real-time
Medicine and Biology Society (2015) 7458–7461.
detection of encoded shapes, in: Proceedings of the IEEE Conference on
[4] K. Dijkstra, J. van de Loosdrecht, L.R. Schomaker, M.A. Wiering, Centroidnet: a
Computer Vision and Pattern Recognition, 2017, pp. 6550–6559.
deep neural network for joint object localization and counting, in: European
[38] U. Schmidt, M. Weigert, C. Broaddus, G. Myers, Cell detection with star-convex
Conference on Machine Learning and Principles and Practice of Knowledge
polygons, in: International Conference on Medical Image Computing and
Discovery in Databases, 2018, pp. 585–601.
Computer-Assisted Intervention, Springer, 2018, pp. 265–273.
[5] A. Croxatto, K. Dijkstra, G. Prod’hom, G. Greub, Comparison of inoculation with
[39] Z. Wu, C. Shen, A. v. d. Hengel, Bridging category-level and instance-level
the InoqulA and WASP automated systems with manual inoculation, J. Clin.
semantic image segmentation. arXiv preprint arXiv:1605.06885.
Microbiol. 53 (7) (2015) 2298–2307.
[40] M.A. Rahman, Y. Wang, Optimizing intersection-over-union in deep neural
[6] A.U.M. Khan, A. Torelli, I. Wolf, N. Gretz, AutoCellSeg: Robust automatic colony
networks for image segmentation, in: International Symposium on Visual
forming unit (CFU)/cell analysis using adaptive image segmentation and easy-
Computing, 2016, pp. 234–244.
to-use post-editing techniques, Nat. Sci. Rep. 8 (1) (2018) 2045–2322.
[41] F. van Beers, A. Lindstrom, E. Okafor, M.A. Wiering, Deep neural networks with
[7] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep
intersection over union loss for binary image segmentation, Conference on
convolutional neural networks, Adv. Neural Inf. Process. Syst. (2012) 1097–
Pattern Recognition Applications and Methods (2019) 438–445.
1105.
[42] A. Paszke, G. Chanan, Z. Lin, S. Gross, E. Yang, L. Antiga, Z. Devito, Automatic
[8] O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for
differentiation in PyTorch, Neural Inf. Process. Syst.
biomedical image segmentation, Conference on Medical Image Computing
[43] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, S. Bengio, Why
and Computer-Assisted Intervention (2015) 234–241.
does unsupervised pre-training help deep learning?, J. Mach. Learn. Res. 11
[9] A.R. Pathak, M. Pandey, S. Rautaray, Application of deep learning for object
(2010) 625–660.
detection, Procedia Comput. Sci. 132 (2018) 1706–1717.
[44] N.-D. Nguyen, T. Do, T.D. Ngo, D.-D. Le, An evaluation of deep learning methods
[10] K. He, G. Gkioxari, P. Dollar, R. Girshick, R.-C.N.N. Mask, Conference on
for small object detection, J. Electr. Comput. Eng. (2020).
Computer Vision and Pattern Recognition (2017) 2980–2988.

504
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505

Dr. Klaas Dijkstra is an associate professor in computer Netherlands, Atsma is one of the delegation members for the ISO/TC147/SC4
vision and data science at NHL Stenden University of microbiological parameters with the aim of drawing up or changing international
Applied Sciences. His main research interests are in standards for water microbiology tests.
computer vision, machine learning and hyperspectral
imaging. After completing his B.Eng. degree in technical
information science in 2005, he has been active in the Prof. dr. Lambert Schomaker is professor in artificial
field of computer vision by doing applied research intelligence at the university of Groningen since 2001.
projects in several domains. In 2013 he obtained his M. He is known for research in simulation and recognition
Sc. degree from the Limerick Institute of Technology in of handwriting, writer identification, style-based docu-
Ireland, on the application of evolutionary algorithms ment dating and other studies in pattern recognition,
and computer vision to the domain of microbiological machine learning and robotics. He has (co)authored
analysis. He obtained his Ph.D. degree in 2020 from the over 200 publications and was involved in the organi-
University of Groningen on the topic of deep learning and hyperspectral imaging for zation of many conferences in handwriting recognition
unmanned aerial vehicles. and document analysis. In recent years he and his team
have worked on the Monk system: an interactively
trainable search engine and e-Science service for his-
Jaap van de Loosdrecht is a professor in computer torical manuscripts. The availability of up to thousands
vision and data scienceat the NHL Stenden University of of training images for single classes of complex patterns has brought pattern
Applied Sciences. His main research interests are in recognition and machine learning into the ballpark of big data. Other recent work is
computer vision, deep learning and hyperspectral in the area of robotics and industrial maintenance, in the EU ECSEL project Mantis.
imaging. In 1996 he has founded the professorship In 2015, he became co-chair of the Data Science and Systems Complexity center at
Computer Vision and & Data Science. His staff, the Faculty of Science and Engineering at the University of Groningen. In 2017, he
researchers, teachers and students have carried out joined the CogniGron center for cognitive systems and materials in a largescale
more than 350 research projects for the business com- seven-year project in neuromorphic computing. He is a member of the IAPR and
munity in the field of Computer Vision & Data Science, senior member of IEEE.
including the Raak-Award 2016 project ‘Smart Vision
for UAV’s’. He is Comenius Senior Fellow at KNAW
(Royal Dutch Academy of Sciences). Dr. Marco Wiering is an assistant professor in the
department of artificial intelligence from the University
of Groningen, the Netherlands. He performed his PhD
research in the research institute IDSIA in Switzerland
and graduated in 1999 on the topic of reinforcement
Waatze A. Atsma, received his engineering degree in
learning. Before going to the University of Groningen, he
Biotechnology at the Noordelijke Hogeschool in
worked as an assistant professor at Utrecht University.
Leeuwarden in 2002 and has been working at the
Dr. Wiering has co-authored more than 170 conference
drinking water laboratory Vitens N.V. as a principle
or journal papers and has supervised or is supervising
analyst and project leader since 2003. His specialty is
12 PhD students and more than 110 Master student
mainly in the field of drinking water diagnostics, in
graduation projects. His main research topics are rein-
particular the development and implementation of new
forcement learning, deep learning, neural networks,
(molecular based) microbiological methods within the
robotics, computer vision, game playing, timeseries
Dutch drinking water laboratories. In addition, Atsma is
prediction and optimization.
a member and chairman of various national working
groups in the field of implementation of rapid micro-
biological methods in the drinking water laboratories
and is closely involved in drawing up guidelines for drinking water-related,
microbiological methods, including for Legionella diagnostics. On behalf of the

505

You might also like