CentroidNetV2 A Hybrid Deep Neural Network For Small-Object Segmentation and Counting 2021
CentroidNetV2 A Hybrid Deep Neural Network For Small-Object Segmentation and Counting 2021
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: This paper presents CentroidNetV2, a novel hybrid Convolutional Neural Network (CNN) that has been
Received 22 December 2019 specifically designed to segment and count many small and connected object instances. This complete
Revised 26 July 2020 redesign of the original CentroidNet uses a CNN backbone to regress a field of centroid-voting vectors
Accepted 11 October 2020
and border-voting vectors. The segmentation masks of the individual object instances are produced by
Available online 5 November 2020
Communicated by Lei Zhang
decoding centroid votes and border votes. A loss function that combines cross-entropy loss and
Euclidean-distance loss achieves high quality centroids and borders of object instances. Several back-
bones and loss functions are tested on three different datasets ranging from precision agriculture to
2000 MSC:
68T10
microbiology and pathology. CentroidNetV2 is compared to the state-of-the art networks You Only
68T45 Look Once Version 3 (YOLOv3) and Mask Recurrent Convolutional Neural Network (MRCNN). On two
out of three datasets CentroidNetV2 achieves the highest F1 score and on all three datasets
Keywords: CentroidNetV2 achieves the highest recall. CentroidNetV2 demonstrates the best ability to detect small
Deep Learning objects although the best segmentation masks for larger objects are produced by MRCNN.
Computer Vision Ó 2020 Elsevier B.V. All rights reserved.
Convolutional Neural Networks
Object Detection
Instance Segmentation
https://doi.org/10.1016/j.neucom.2020.10.075
0925-2312/Ó 2020 Elsevier B.V. All rights reserved.
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505
networks are designed to detect typical everyday objects, they detection and segmentation. Furthermore, we provide a compar-
might provide inferior results on counting tasks where small and ison between the properties of the original CentroidNet and
connected objects are involved. An alternative method to count CentroidNetV2.
objects is to regard counting as a regression task. In this case the In addition to the architectural changes several ablation studies
number of counted objects is directly estimated from images or are performed. The loss function is redesigned to contain two
crops of images [16–19]. This is mostly used in congested scenes Euclidean loss terms and a cross entropy term. The loss terms
when it is difficult to individually detect objects. An example of are compared to the original MSE loss function. We aim to investi-
this approach is estimating the number of people in crowds gate the effect of several architectural choices. In principle any
[20–22]. Recent approaches combine object localization and object semantic segmentation network can serve as a backbone for Cen-
detection with counting [4,23]. troidNetV2. In the experiments U-net and DeepLabV3 with three
When counting objects in an image without regarding their backbones, ResNet50, ResNet101 and Xception, are evaluated as
location there is a risk of unwanted count compensation. When representative backbones. Finally, we also investigate if transfer
this happens an underestimate of the count in one part of the learning improves the performance of CentroidNetV2.
image compensates the overestimation in another part of the This leads to the following research questions:
image. To correctly validate counting results the location of the
objects should also be taken into account. A suitable metric for 1. What is the performance of CentroidNetV2 for detecting and
detection and counting is the F1 score which is the harmonic mean counting many small objects?
between precision and recall that represents the optimal equilib- 2. How does the performance of CentroidNetV2 compare to well
rium between overestimating and underestimating the number known state-of-the-art models for object detection and
of objects. This paper will focus on models for object detection instance segmentation?
and instance segmentation because these models can estimate 3. What backbones and loss functions are most suitable for
the location and dimensions of the counted objects simultane- CentroidNetV2?
ously. In this paper a new hybrid deep learning architecture called 4. What is the effect of transfer learning on the performance of
CentroidNetV2 is introduced. CentroidNetV2?
1.1. Contributions and research questions In this paper we generally refer to a 1-d structure as a vector, a
2-d structure as a matrix and a 3-d structure as a tensor. A matrix
The original version of CentroidNet [4] is a Fully Convolutional that contains vectors is referred to as a tensor where the name
Network (FCN) that is trained using a standard Mean Squared Error indicates the type of vectors. For example: the target-centroid-
(MSE) loss function. A U-net backbone is used to regress a field of vectors tensor is a matrix containing target-centroid vectors.
2-d vectors. These vectors are trained to predict the location of the The remainder of this paper is structured as follows. In Section 3
centroid of the nearest object. CentroidNet is independent of image the formal design of CentroidNetV2 is discussed. Section 4 explains
size during training and during inference, because vectors only the contents of the aerial-crops, cell-nuclei and bacterial-colonies
encode relative positions and are not scaled by the image size. A datasets that are used for this research. The method of training
voting algorithm is used to produce the actual centroids of the and validation is discussed in Section 5 and the results are pre-
objects. Although demonstrating state-of-the-art results, the origi- sented in Section 6. Finally, in Section 7 the conclusion and future
nal CentroidNet has some limitations: the size of the objects are work are discussed.
not estimated and the standard MSE loss does not specifically
penalize the segmentation and the voting mechanism incorporated
in the algorithm. Finally CentroidNet was only evaluated using the 2. Related work
U-net backbone.
In CentroidNetV2 several improvements are proposed, while CNNs [26–28] are applied to an increasing number of complex
still maintaining the deep-learning and computer-vision hybrid image analysis tasks. One of the first break-through applications
design and the majority voting mechanisms. Firstly, for each pixel, of CNNs was the classification of images from the ImageNet chal-
an additional 2-d vector is predicted which represents the relative lenge [7]. Classification models take an image as an input and pro-
location of the nearest border of the object with the nearest cen- duce a single prediction for the whole image. Image segmentation
troid. This border information is used to predict the delineation using a CNN is performed by classifying each pixel into a one-hot
of objects. Therefore, in a computer vision context, CentroidNetV2 vector representing the class of that pixel. This results in a dense
is regarded as a form of contour fitting [24] with properties similar segmentation mask of the entire image. Impressive performance
to the generalized Hough transform [25]. CentroidNetV2 produces was achieved by U-net on biomedical image data [8] and by Dee-
instance-segmentation masks by fitting a predefined geometric pLabV3+ on segmenting everyday scenes [29]. A sparser detection
shape through the border points. In a deep learning context Cen- is achieved by object detection CNNs like YOLOv3 [30] and Retina-
troidNetV2 is considered an instance segmentation model. Net [31]. These architectures directly estimate the bounding box
We compare CentroidNetV2 to other well-known state-of-the- and class of individual object instances in images with everyday
art networks that have comparable complexity and goals. The objects. YOLOv3 focuses specifically on small-object detection.
ResNet backbones for Mask Recurrent Convolutional Neural Net- Instance segmentation can be regarded as a combination of
work (MRCNN) and CentroidNetV2 give them comparable com- object detection and segmentation. MRCNN is a widely used
plexity. A specific shared goal of CentroidNetV2 and You Only state-of-the-art instance segmentation architecture that uses the
Look Once Version 3 (YOLOv3) is small-object detection. In this detected boxes, called region proposals, to predict a dense segmen-
paper we aim to focus on the general applicability of the Cen- tation mask of individual object instances [10] and this requires a
troidNetV2 architecture for detecting and segmenting small two-stage training process. A Recurrent Neural Network (RNN) for
objects. Therefore we have chosen three datasets to cover a instance segmentation is proposed [32], where recurrent attention
broader range of applications. This also means that we do not focus is used to alternate between producing bounding boxes and pro-
on solving any one specific application (for example, colony count- ducing segmentation masks for the objects within these boxes.
ing or crop detection). Furthermore, and because of this broader A proposal-free instance segmentation network is proposed
scope, we compare to well-known and general methods for object [33], where segments and boxes are directly regressed and a
491
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505
traditional off-the-shelf clustering method is used to create indi- For convenience and without loss of generality the functions in
vidual instances. This approach, where deep learning and tradi- this section are defined using 3-d image-like tensors. However the
tional deterministic algorithms are combined, belongs to an actual implementation uses mini batches of 3-d tensors. The three
emerging class of hybrid algorithms. In DCAN individual dense- main functions fðÞ; gðÞ and lð; Þ are explained in sub-Section 3.1,
object instances are produced by post processing the segmenta- 3.2 and 3.3 respectively.
tion result using the probability map to estimate segment bound-
aries [34]. In a similar fashion a deterministic temporal 3.1. Backbones
consistency algorithm is combined with a CNN to segment RGB
+ depth videos [35]. InstanceCut produces object instances by Function fðÞ in Fig. 1 is the backbone of CentroidNetV2 and
deterministically combining two output modalities of the CNN: represents the trainable part. A multi-channel image serves as an
a semantic segmentation mask and an instance boundary. An input. In our experiments this is an Red Green Blue (RGB) image.
alternative method to estimate boundaries is proposed by the The output tensor contains three separate types of predictions:
deep watershed transform, which is a deep-learning based each spatial position of the first two planes contains the y and x
instance segmentation method inspired by a traditional water- components of a relative vector that points to the nearest centroid
shed transform [36]. of an object. Each spatial position of the next two planes contains
Other instance segmentation methods directly estimate decod- the y and x components of the vectors pointing to the nearest bor-
able shape representations. In the straight-to-shapes approach the der of the object with the nearest centroid. The final planes of the
embeddings produced by a CNN are decoded into shapes with var- output tensor contain the logits for the semantic segmentation of
ious methods to produce delineations of instances [37]. The star- all pixels. In this paper we only test binary classification which
convex polygon method uses radial distances to encode object means that this logits output consists of two planes (foreground/
instances with a CNN [38]. background). The spatial dimensions of X and Y should be identical
Related to this are methods that predict the centroids of indi- and any semantic segmentation network can serve as backbone
vidual object instances. These are proposed in [39] and in Cen- fðÞ. In our experiments the depth of the input X is 3 (RGB) and
troidNet [4]. Both of these methods use a traditional Hough- the depth of the output Y is 6 (a 2-d centroid vector, a 2-d border
transform inspired algorithm for determining centroids after vector and 2 logits). This is mathematically expressed by
model inference. The former method uses the bounding boxes to
Y ¼ f ðXÞ ð1Þ
predict dense segments and the latter uses a fixed-size bounding
box and uses binning to produce sharper centroids. CentroidNet Y ¼ ½YcjYbjYl; ð2Þ
has shown to produced state-of-the-art performance on a dataset
where X is the input tensor of the model, Y is the output tensor with
for counting potato crops. In that approach dense spatial-voting
stacked tensors containing the centroid-vectors tensor Yc, border-
vectors are predicted using a CNN and a majority voting algorithm
vectors tensor Yb and the logits tensor Yl.
combined with a non-max-suppression is subsequently used to
Additionally the probabilities per logit are determined by divid-
produce centroid locations.
ing each logit by the sum of all logits for that pixel:
Conceptually, the integration of machine vision and deep learn-
ing can be viewed as embedding and exploiting prior knowledge in Ylz;y;x
Ypz;y;x ¼ P ; ð3Þ
the algorithm. For example, in CentroidNet, partially occluded and z2½c Ylz;y;x
connected objects still produce votes because patches of the
objects are assumed, by the algorithm, to have information about where Yp contains the class probabilities and c is the number of
the location of the centroid. For example, the leaves of a plant classes.
and the grain of these leaves naturally point outward. This means Some example centroid vectors and border vectors in Yc and Yb
that implicit information about the location of the centroid of a are geometrically shown in Fig. 2, where pi ; ci and bi represent the
plant is contained in a small patch of the image. This way of pixel coordinate, the vector of the nearest centroid and its nearest
prior-knowledge embedding has been demonstrated to outper- border, with three example pixels: i 2 f1; 2; 3g. An important detail
form non-hybrid approaches [4]. about border vectors is that for some coordinates, like p1 , the near-
est border coordinate of the object with the nearest centroid is dif-
ferent from the nearest border coordinate. The nearest centroid to
3. The CentroidNetV2 architecture p1 is of object B, but the nearest border coordinate of p1 is of object
A. In this case b1 is the correct border vector (which is not equal to
0
The main architecture of CentroidNetV2 is shown in Fig. 1. b1 ).
The top part of the graph shows the inference pipeline to predict
instances and their corresponding class from input images. The 3.2. Loss functions
bottom part shows the pipeline for converting the annotations
to a suitable CentroidNet format for training. An image tensor Function lð; Þ in Fig. 1 calculates the loss between the output
X serves as an input to the model indicated by fðÞ, which in tensor Y and the target tensor T. Depending on the loss function
turn predicts an output tensor Y containing the centroid vectors, we use the logits output Yl or the probability output Yp. The target
border vectors and class logits (the score for each class). This tensors are defined in a similar way as Eq. (2):
tensor is then decoded into instance ids, class ids and class prob-
T ¼ ½TcjTbjTp; ð4Þ
abilities by the decoding function gðÞ. The ground-truth tensor Z
contains class ids and instance ids and is encoded into centroid where T consists of the stacked tensors with target-centroid-vectors
vectors, border vectors and class logits. This is done by the tensor Tc, target-border-vectors tensor Tb and the target-
inverse transform of gðÞ, indicated by g0 ðÞ.1 Additionally the loss probability tensor Tp containing n planes. Note that the target prob-
function lð; Þ calculates a loss between the output tensor Y and ability for a certain class is always 0 or 1.
the target tensor T.
3.2.1. MSE loss
1
The function g0 ðÞ is the inverse transform of gðÞ if the class probability is The original CentroidNet used the mean squared error (MSE)
disregarded. loss defined as:
492
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505
Fig. 1. The CentroidNetV2 architecture. The top part shows the inference pipeline and the bottom part shows the pipeline for encoding the ground-truth annotations. The
encoder function g0 ðÞ is the inverse transform of decoder function gðÞ.
A limitation of using the MSE loss is the fact that it ignores the
meaning of the specific planes in the output tensor Y and target
tensor T. For example, the first two planes contain the y and x com-
ponent of the centroid-voting vectors. For these two planes it
makes more sense to use an Euclidean distance as the loss function,
while the cross-entropy loss is more useful for the planes that con-
tain the classification logits per pixel. Therefore, in CentroidNetV2,
the loss function is decomposed into two different terms: vector
loss and segmentation loss. These are discussed in the remaining
part of this sub-section.
Fig. 2. Examples of centroid vectors (c1, c2, and c3) pointing from the pixel
coordinates (p1, p2 and p3) to the nearest centroids of object A and B. The border where Yv and Tv have size c h w and represent the output- and
vectors (b1, b2 and b3) point to the nearest border of the objects with the nearest target-vectors tensors. Because both the centroid vectors and bor-
centroid. der vectors are two dimensional, each vector has two components
(c ¼ 2). The size of the spatial dimensions h and w are the same
as the dimensions of input image.
1 XX X 2
lmse ðY; TÞ ¼ Yz;y;x Tz;y;x ð5Þ The vector loss is calculated separately for the centroid vectors
c h w z2½c y2½hx2½w
and the border vectors using Eq. (6) and then the sum is calculated.
lvl ðYc; Yb; Tc; TbÞ ¼ ld ðYc; TcÞ þ ld ðYb; TbÞ; ð7Þ
where Y and T are the output and target tensor with a size of where Yc; Yb; Tc; Tb contain the output-centroid vectors, output-
c h w. In our experiments the output tensor consists of five border vectors, target-centroid vectors and target-border vectors
planes and consequently z runs over 1 through 5. respectively.
493
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505
494
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505
where y and x are spatial coordinates, and v and u are coordi- The instance-ids matrix I, class-ids matrix C and class-
nates inside an n m window of the non-max suppression. In probabilities matrix P are the final outputs of CentroidNetV2.
our case plateaus of equal maxima are reduced to single points.
3. Select voting peaks by applying a threshold h to the suppressed- 3.3.2. Encoder
^ to generate the set of selected votes C given by: Encoding is a preprocessing step needed to convert the ground-
voting matrix V
truth annotations to a format that can be used to train the model.
n o
^ y ;x P h ;
C ¼ ðyc ; xc Þ 2 ½h ½wjV ð12Þ An annotation of an input image X consists of the target-class-ids
c c
matrix C 0 in which each element represents the class of a pixel in
where yc ; xc are the peak coordinates and h and w are the dimen- the input image, and the target-instance-ids matrix I0 in which
sions of matrix V. ^ each element represents the id of an individual object instance in
4. Select the set of border coordinates corresponding to a centroid. the input image. The encoder can be regarded as the inverse of
The set of border coordinates for a centroid at coordinate the decoder and therefore the input matrices are named the same
ðyc ; xc Þ 2 C is given by: as the output matrices of the decoder, but are denoted by an addi-
tional apostrophe (0). The output of the encoder is the target-
B ¼ borderððyc ; xc Þ 2 C; Yb; YcÞ; centroid-vectors tensor Tc, the target-border-vectors tensor Tb
and the target-probabilities matrix Tp defined in Eq. (4).
where the function borderð; ; Þ calculates border coordinates
The encoding process is defined in the remaining part of this
for a given centroid at ðyc ; xc Þ and is given by Algorithm 2, Yb
section and the intermediate images to support the explanation
and Yc are tensors containing the border and centroid vectors
are shown in Fig. 4. The black border around the objects in Tc
respectively.
and Tb is caused by the clipping of voting vectors. This is set to
5. Fit a geometric shape (e.g. circle, ellipse, convex hull, etc.)
roughly twice the average radius of the target objects.
through the set of border coordinates B for a given centroid
All unique instance ids in matrix I0 are represented by the set I .
and draw the geometric shape with a unique value in the
A set of coordinates of an instance with id i is and given by:
instance-ids matrix I. n o
6. Calculate the class-ids matrix and probabilities matrix C and P O0i ¼ ðyo ; xo ÞjI0yo ;xo ¼¼ i 2 I ;
respectively by taking the arg maxðÞ and maxðÞ over class
probabilities: where yo and xo represent the coordinates within the instance-ids
matrix I0 .
C y;x ¼ arg maxz2½c Ypz;y;x
The set of centroids for all objects are calculated by taking the
Py;x ¼ maxz2½c Ypz;y;x ; average y and x coordinate of each set of coordinates:
n o
where c is the number of logits in the output-probabilities tensor C0 ¼ O01 ; O02 ; . . . ; O0n ;
Yp. When the probability of an element in matrix P is above /, it
495
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505
4.1. Aerial crops All of the datasets described in this section contain images
which are either too large or contain images of various sizes. This
The aerial-crops dataset contains images of potato crops taken means they cannot be used directly for training because a mini
with a low-cost drone which navigated over a potato field [4]. It batch should consist of multiple equally sized images. A common
consists of 10 frames randomly sampled from a 24 fps video shot approach to handle this problem is to resize all images to some
at 10 meters altitude. The dataset contains a mix of small, con- predefined size. However, this would not achieve the desired result
nected and distinct potato plants as well as background soil. The because small objects could be removed by this action. We employ
resolution of each image is 1500 1800 pixels. The set contains two strategies to handle this problem. For CentroidNetV2 and
over 3000 annotated plants using circles to indicate the location MRCNN we randomly crop the image during training with 256
of the plants. See Fig. 5 for some examples. A 50%/50% training/val- 256 image crops.
idation split of the dataset is used for validation.
This set is used to compare the individual models on a relatively 3
https://www.kaggle.com/c/data-science-bowl-2018.
small amount of images, but a large amount of small objects per 4
ISO 11731:2017, Water quality – Enumeration of Legionella
496
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505
Fig. 5. Example images from the aerial-crops dataset. The images show variations in the size of the crops and high connectedness between individual crops.
Fig. 6. Example images from the cell-nuclei dataset. The images show variations in resolution, cell type, magnification and imaging modality.
Fig. 7. Example images from the bacterial-colonies dataset. The images show variations in size, color and number of colonies per Petri dish.
The best performing YOLOv3 should be trained with images of 5.1. Training
608 608 pixels as described in the original paper [30]. To be able
to generate a dataset that can be used to train the original YOLOv3 For CentroidNetV2 the input data is normalized using the the-
in DarkNet, images are tiled with 50% overlap in both directions. oretical range of the image data: subtracting 128 and dividing
This overlap is used to prevent clipped objects at the edges of each pixel by 255. Typically the data is normalized using the
the tiles. When recombining the results to get object locations in statistics of the dataset or the statistics of the dataset that was
the original images, only objects at the center of each tile are kept. used to pretrain the model. However, in practice we did not
We observed that this approach works remarkably well for observe any significant loss in performance when using fixed nor-
YOLOv3 because the tiles are much larger than the objects in the malization coefficients for all datasets. Furthermore, Adam is used
images. In Fig. 8 this tiling process is explained. to train CentroidNetV2 with a learning rate of 0.001 and a
momentum of 0.9. To avoid overfitting and to select the best
5. Training and validation model during training, early stopping was applied [28]. In each
experiment it was observed that the trained model did not
In this section the methods for training and validation of the improve significantly after 500 epochs.
various models are discussed. Each model is trained using a train- MRCNN and YOLOv3 apply various methods to optimize perfor-
ing set and validated using a disjoint validation set. The split is ran- mance (augmentation, optimizers, normalization, etc.) The maxi-
domly determined. mum amount of instances that MRCNN can produce has been
497
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505
O \ O0
IoUðO; O0 Þ ¼ ;
O [ O0
Matching of object instances is based on a certain minimum IoU
threshold. The set of output-instance ids is given by A ¼ ½m and
the set of target-instance ids is given by B ¼ ½n, where m and n
are the number of output instances and target instances respec-
tively. A match between output-instance id b and all target-
instance ids in A is given by:
is matchðb 2 BÞ ¼ max IoU Oa ; O0b > s
a2A
of each table indicate the category of the experiment and is used to mance when using the ResNet101 backbone with and without pre-
group experiments in a logical manner. training (indicated by the PT column). Also an experiment with the
Because the highest F1 score represents the best equilibrium alternative VL-IoU loss function has been included here. The
between overestimating and underestimating the number of MRCNN model with a ResNet101 backbone pretrained on Ima-
objects, the network threshold hyperparameter that determines geNet achieves the highest F1 score (92.3%). The runner up is a
the trade off between precision and recall is optimized on the pretrained CentroidNetV2 with a DeepLabV3 + _ResNet101 back-
training set by an exhaustive search. For CentroidNetV2 the integer bone (91.9% F1 score). Furthermore CentroidNetV2 shows the
voting threshold h discussed in Eq. (12) is optimized, for MRCNN highest recall (89.9%) which indicates that CentroidNetV2 tends
and YOLOv3 the confidence threshold is optimized. After the to detect more objects and achieves the lowest amount of false
thresholds have been optimized the metrics are calculated on the negatives (583 nuclei) at its highest F1 score.
validation set and reported in the respective tables. In Fig. 10 an example of the instances produced by MRCNN and
The naming of the loss functions in this section follows the CentroidNetV2 is shown on a challenging image. It can be seen that
naming scheme introduced in Section 3. MSE loss is the standard MRCNN gives more accurate instance segmentation masks which
loss defined by Eq. (5). The Vector Loss (VL) is computed by the explains the higher F1 score. The higher recall of CentroidNetV2
Euclidean distance between the target-voting vectors and the is explained by the fact that more small and low-contrast cell
output-voting vectors and is defined by Eq. (7). The Cross Entropy nuclei are predicted.
(CE) loss and IoU loss, defined in Eqs. 8 and 9, are calculated using From the literature is it well known that pretraining improves
the output logits and the target logits. Finally the combined losses the performance of models [43] and this is confirmed by the mea-
used for the analysis in this section are MSE, VL-CE and VL-IoU, sured increase in F1 score for MRCNN. An interesting observation is
defined in Eqs. 5, 10, 11 respectively. that this also holds for CentroidNetV2 which achieves a 1.3% higher
An open-source reference implementation of OpenCentroidNet F1 score when using pretrained weights. This confirms that the
written in Python, using PyTorch 1.0 [42] is published with this regression of centroid- and border-voting vectors also benefits
paper. A fully annotated dataset containing images of potato crops from a ResNet101 backbone pretrained on ImageNet and that pre-
and a dataset containing annotated Legionella bacterial colonies trained convolutional filter weights are quite general in that they
are also published together with this paper. can be repurposed for predicting voting vectors. The only case
where the pretrained backbone has a lower F1 score compared to
6.1. Results on aerial crops the non-pretrained model is when a ResNet50 backbone is used
with MRCNN. However, the pretrained version still achieves the
The results of the performance on the aerial-crops dataset for highest precision (96.3%) at its highest F1 score. Interestingly the
the different models are shown in Table 1. use of the VL-IoU with pretraining achieves the lowest F1 score
The first part of Table 1 (Comparing to the state-of-the-art) shows (90.3%).
the comparison between CentroidNetV2 and the other models. The The third part of Table 2 (Comparison to U-net backbone) shows
overall best F1 score is achieved by CentroidNetV2 (94.7%). YOLOv3 the performance of CentroidNetV2 using the original U-net back-
achieves an F1 score of 94.3%. This shows that the tiling scheme bone on the cell-nuclei dataset. That configuration is similar to
used for YOLOv3 is quite optimal. MRCNN achieves an F1 score the original version of CentroidNet, which used MSE loss and a
of 92.4%. Further analysis shows that MRCNN fails to detect small U-net backbone, and has among the lowest F1 scores (90.6%). Using
crops. This automatically results in the highest precision for the VL-CE loss function in conjunction with the U-net backbone
MRCNN (97.7%) caused by the low amount of false positives (34 yields better results (91.1%). But still the conclusion holds that
crops). When using MSE loss and a U-net backbone, a configuration the best CentroidNetV2 configuration uses a ResNet101 backbone
similar to the original CentroidNet, a lower F1 score of 93.5% is and the VL-CE loss function. CentroidNetV2 seems to have no obvi-
achieved. ous advantage when using the U-net backbone because the preci-
The visual differences between the individual models are sion for CentroidNetV2 (93.3%) is lower compared to the original
shown in Fig. 9. CentroidNetV2 shows the most correctly detected CentroidNet (94.3%). This means that the improvements of both
crops in Fig. 9a. YOLOv3 seems to not find the right balance the loss function and the backbone together yield a higher perfor-
between the false positives and false negatives indicated by the mance on all metrics (91.7% F1 score, 94.3% precision and 89.3%
false positive crop found in the left-bottom of Fig. 9b and the recall).
two missed small crops. Fig. 9c shows that MRCNN failed to detect
two small potato-plant crop and also a misses a crop closely con- 6.3. Results on bacterial colonies
nected to a bigger crop (shown in the left bottom of Fig. 9c).
The second part of Table 1 (Comparing loss functions) shows that On the bacterial-colonies dataset CentroidNetV2 achieves the
the MSE loss achieves the lowest F1 score (93.5%) compared to the overall highest F1 score of 92.6% shown in Table 3. YOLOv3 is the
other loss functions and using the same backbone. runner op with an F1 score of 92.3%. The U-net backbone of Cen-
The third part of Table 1 (Comparing backbones) shows the per- troidNetV2 struggles to get good results and achieves only an F1
formance of the alternative backbones for CentroidNetV2. The score of 87.1%. This confirms the added value of the ResNet101
extra 51 layers of the ResNet101 backbone only achieve a 0.1% backbone on this dataset. Also in this case CentroidNetV2 achieves
higher F1 score compared to the ResNet50 backbone for Cen- the highest recall (91.0%). MRCNN seems to miss objects and
troidNetV2. The Xception backbone achieves a 4.4% lower F1 score. achieves the highest precision of 95.4% at the cost of lower recall
Also the U-net backbone shows a lower F1 score (0.7% lower). From (89.1%).
this can be concluded that the overall best backbone for Cen- In Table 3 it is shown that the number of predicted objects in
troidNetV2 on the aerial-crops dataset is DeepLabV3 + _ResNet101. the image (indicated by the ‘Count’ column) is not representative
for the actual number of correctly detected colonies. It seems that
6.2. Results on cell nuclei YOLOv3 only counts one less colony compared to CentroidNetV2
(885 and 886). However, when looking at the difference in the
The results of the performance on the cell-nuclei dataset for the number of true positives (indicating colonies found at the right
different models are shown in Table 2. The first part of the table location) it can be seen that YOLOv3 actually misses three colonies
(Usage of pretraining with ResNet101 backbone) shows the perfor- (832 and 835). The two extra colonies in the ‘Count’ column are
499
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505
Table 1
Results for counting crops with 1660 annotated validation samples. Performance of several configurations of CentroidNetV2 and comparison to YOLOv3 and MRCNN (in
percentages). The best precision, recall and F1 score are boldface.
Fig. 9. Red circles show the prediction of the three models and the annotations are shown in green. CentroidNetV2 detected most crops (one false negative), MRCNN has three
false negatives and YOLOv3 produced a false positive and two false negatives.
Table 2
Results for counting nuclei with 5755 annotated validation samples. Performance of several configurations of CentroidNetV2 and MRCNN (in percentages). PT indicates if a model
is pretrained. The best precision, recall and F1 score are boldface.
caused by the two extra false positives found elsewhere in the left of each image is an air bubble adjacent to a colony. Air bubble
image. This is why we argue that for counting tasks the validation formation is a common problem for certain types of culturing
should be based on F1 score rather than raw object-detection media. However, this exact visual appearance is rare in the training
count because it takes the location of the object into account. set. In Fig. 11e it is shown that YOLOv3 fails to detect the colony,
The visual differences in performance between the models are probably because it has not seen something similar before. Both
shown in Fig. 11. The thick red circles indicate the predictions CentroidNetV2 and MRCNN detect this colony correctly. For Cen-
and the thin green circles indicate the annotations. In the top troidNetV2 this is probably because the partial bacterial colony
row a cropped part of an image with bacterial colonies is shown. still produces part of the votes (similar to when two colonies are
Each model correctly ignores the yellow colony which is not Legio- overlapping).
nella. In Fig. 11b YOLOv3 incorrectly detects the large colony that In this section our focus has been to compare the F1 score, Pre-
has not been annotated as Legionella suspected. MRCNN fails to cision, Recall and Count metrics of the various approaches and
detect the small colony near the right bottom of Fig. 11c. The bot- therefore did not include inference-time metrics. For an analysis
tom row of Fig. 11 gives another interesting insight in the differ- of the inference time of MRCNN and YOLOv3 we would like to refer
ences between the models. The large black-ish structure at the the reader to [44]. In that paper the authors provide an extensive
500
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505
Fig. 10. Instance segmentation results on an image of the cell-nuclei dataset. The input image and ground truth are shown on the left and the predicted output of the models
is shown on the right. MRCNN predicts more accurate segments. CentroidNetV2 detects small and low contrast objects that MRCNN fails to detect.
Table 3
Results for counting bacterial colonies with 918 annotated validation samples. Performance of CentroidNetV2 compared to YOLOv3 and MRCNN (in percentages). The best
precision, recall and F1 score are boldface.
Fig. 13. Detection of low-contrast objects. Images (c), (d) and (e) contain bacterial
colonies, the other images contain cell nuclei. The red circles show detections and
the green circles represent the ground truth. Images (a) through (e) seem to contain
no image information, however this is the true contrast in the image. Cen-
troidNetV2 is able to detect more low contrast objects.
Fig. 12. Detection of small objects. The red circles show detections and the green
circles represent the ground truth. This shows that MRCNN detect fewer of the
small bacterial colonies and potato-plant crops.
502
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505
Fig. 15. Voting matrices for CentroidNetV1 and CentroidNetV2. In this example the 1. What is the performance of CentroidNetV2 for detecting and count-
ground-truth centroids are detected with both approaches. The improvements ing many small objects?
made to CentroidNetV2 are shown to produce sharper votes. CentroidNetV2 is considered to be the preferable approach for
counting many small objects because the results show that it
either achieves the highest F1 score or achieves the best recall
7. Discussion and conclusion and tends to detect more small objects.
2. How does the performance of CentroidNetV2 compare to well-
Experiments have been performed on three datasets with three known state-of-the-art neural networks for object detection and
different models. The datasets and models can be divided in two instance segmentation?
categories: object detection and instance segmentation. The mod- On two datasets CentroidNetV2 outperforms the well-known
els for instance segmentation: CentroidNetV2 and MRCNN have state-of-the-art networks on object detection and instance seg-
been tested on all datasets. The object-detection model YOLOv3 mentation. Only on the cell-nuclei dataset does MRCNN pro-
has only been tested on the object-detection datasets: aerial- duce a higher F1 score.
crops and bacterial-colonies. This is because an instance- 3. What backbone and loss function is best suitable for Cen-
segmentation model can be used for object detection but not vice troidNetV2?
versa. The F1 score has been the main metric by which to evaluate The loss function combining vector loss and cross-entropy loss
performance, because it indicates the best trade off between over- gives sharper voting peaks and consistently achieves the best F1
estimation and underestimation of the number of counted object score compared to the original MSE loss function. The
instances. Precision and recall have been calculated at the point DeepLabV3 + _ResNet101 backbone generally obtains the best
of the highest F1 score determined by an exhaustive search on performance.
the training set. All reported metrics are calculated using a disjoint 4. What is the effect of transfer learning on the performance of Cen-
validation set. troidNetV2?
CentroidNetV2 shows the best F1 score for the aerial-crops The results show that the vector-voting method of Cen-
dataset (94.7%) and the bacterial-colonies dataset (92.6%). The best troidNetV2 also benefits from a pretrained backbone of the
F1 score on the cell-nuclei dataset is achieved by MRCNN (92.3%). model. This means that pretrained feature maps of the CNNs
For all datasets CentroidNetV2 consistently shows the highest are general enough to have a beneficial impact on the F1 score.
recall: 95.7%, 89.9% and 91.0% on the aerial-crops, cell-nuclei and
bacterial-colonies datasets respectively. MRCNN shows the highest
precision: 97.7%, 96.3% and 95.4% on the aerial-crops, cell-nuclei 7.1. Future work
and bacterial-colonies datasets respectively. MRCNN has the ten-
dency to miss small objects which results in a high precision at CentroidNetV2 was compared to the popular and general archi-
the cost of recall. YOLOv3 generally achieves a high precision, tectures MRCNN and YOLOv3. Newer and more advanced CNN
recall and F1 score but is always outperformed by either Cen- architectures are introduced regularly. In the future CentroidNetV2
troidNetV2 or MRCNN. can be compared to recent advances in object detection and
The measured differences among the best-performing models segmentation.
are mostly small, but these differences are consistent over the var- The run-time performance of the decoding algorithm of Cen-
ious datasets. Each model has its own unique properties and the troidNetV2 can probably be further optimized by making use of
choice ultimately depends on the application. If accurate counting the GPU or by implementing the algorithm in a language that
of objects is needed for a large number of small and connected allows for lower-level access to the CPU (for example C++).
objects, CentroidNetV2 is preferable. When accurate masks of Many applications exist for counting that are closely related to
objects should be determined with high recall then MRCNN is the research discussed in this paper. Many different types of vege-
preferable. YOLOv3 does a good job at detecting small objects tation exist that need to be counted. This does not necessarily have
but it is only able to detect bounding boxes whereas Cen- to be crops, but can also be trees or other types of large vegetation.
troidNetV2 produces a complete circumference of objects. Also in the field of microbiology, many applications for colony
For CentroidNetV2 and MRCNN, images of various sizes are counting exist. CentroidNetV2 can be tested on other types of bac-
handled in a similar fashion and has thus been made completely terial colonies and research into colony counting can be extended
transparent by using random image crops during training. How- to other microbiological fields like medical pathology. Other fields
ever, CentroidNetV2 truly does not take into account image dimen- unrelated to counting and more related to object detection and
sions because all voting vectors are relative. The original YOLOv3 instance segmentation can be investigated. For example, segmen-
implementation is defined for fixed-size images and therefore tation of everyday objects like persons, cars, etc. CentroidNetV2
requires tiling of the images prior to training and recombination might be able to detect smaller everyday objects.
503
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505
This paper has shown that the results of CentroidNetV2 [11] T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative
adversarial networks, Conference on Computer Vision and Pattern Recognition
improved by changing the backbone and the loss function. In the
(2019) 4401–4410.
future new segmentation backbones can be integrated with Cen- [12] K. Dijkstra, J. van de Loosdrecht, L.R. Schomaker, M.A. Wiering, Hyperspectral
troidNetV2. Further investigation of other loss functions might also demosaicking and crosstalk correction using deep learning, Mach. Vision Appl.
improve the results. 30 (1) (2018) 1–21.
[13] P. Ren, W. Fang, S. Djahel, A novel yolo-based real-time people counting
In this research only classification between background and approach, in: 2017 International Smart Cities Conference (ISC2), IEEE, 2017,
foreground has been investigated. Future work might focus on pp. 1–2.
counting objects of multiple classes separately. Furthermore mul- [14] A. Özlü, TensorFlow Object Counting API (2018), https://github.com/
ahmetozlu/tensorflow_object_counting_api.
tichannel images can serve as an input to CentroidNetV2. Therefore [15] B. Chen, X. Miao, Distribution line pole detection and counting based on yolo
future work might include using hyperspectral imaging to count using uav inspection line video, J. Electr. Eng. Technol. (2019) 1–8.
objects. For this, additional image channels like fluorescent images [16] W. Xie, J.A. Noble, A. Zisserman, Microscopy cell counting and detection with
fully convolutional regression networks, Comput. Methods Biomech. Biomed.
can be recorded. Even data outside of the visible spectrum can be Eng. Imag. Visualiz. 6 (3) (2018) 283–292.
used like thermal or short-wave infrared images. [17] T. Stahl, S.L. Pintea, J.C. Van Gemert, Divide and count: generic object counting
Architectural changes could be made to CentroidNetV2 to by image divisions, IEEE Trans. Image Process. 28 (2019) 1035–1044.
[18] J. Wan, W. Luo, B. Wu, A.B. Chan, W. Liu, Residual regression with semantic
reduce the number of hyperparameters and this should make the prior for crowd counting, in: Proceedings of the IEEE Conference on Computer
decoding process more straightforward. Such new architectures Vision and Pattern Recognition, 2019, pp. 4036–4045.
could be compared to other novel architectures of promising [19] F. Dai, H. Liu, Y. Ma, J. Cao, Q. Zhao, Y. Zhang, Dense scale network for crowd
counting. arXiv preprint arXiv:1906.09707.
object-detection and instance-segmentation networks.
[20] Y. Li, X. Zhang, D. Chen, CSRNet: dilated convolutional neural networks for
understanding the highly congested scenes, Conference on Computer Vision
and Pattern Recognition (2018) 1092–1100.
CRediT authorship contribution statement [21] J. Gao, Q. Wang, X. Li, PCC net: perspective crowd counting via spatial
convolutional network, IEEE Trans. Circ. Syst. Video Technol.
[22] Q. Wang, M. Chen, F. Nie, X. Li, Detecting coherent groups in crowd scenes by
Klaas Dijkstra: Conceptualization, Methodology, Software, multiview clustering, IEEE Trans. Pattern Anal. Mach. Intell. 42 (1) (2020) 46–
Investigation, Writing - original draft. Jaap de Loosdrecht: Concep- 58.
tualization, Writing - review & editing, Supervision. Waatze A. [23] M.R. Hsieh, Y.L. Lin, W.H. Hsu, Drone-based object counting by spatially
regularized regional proposal network, Conference on Computer Vision and
Atsma: Writing - review & editing, Resources, Data curation.
Pattern Recognition (2017) 4165–4173.
Lambert R.B. Schomaker: Conceptualization, Writing - review & [24] M. Kass, A. Witkin, D. Terzopoulos, Snakes: active contour models, Int. J.
editing, Supervision. Marco A. Wiering: Conceptualization, Comput. Vision 1 (4) (1988) 321–331.
Writing - review & editing, Supervision. [25] D.H. Ballard, Generalizing the hough transform to detect arbitrary shapes,
Pattern Recogn. 13 (2) (1981) 111–122.
[26] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436.
[27] J. Schmidhuber, Deep learning in neural networks: an overview, Neural
Declaration of Competing Interest Networks 61 (2015) 85–117.
[28] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016, http://
The authors declare that they have no known competing finan- www.deeplearningbook.org.
[29] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with
cial interests or personal relationships that could have appeared atrous separable convolution for semantic image segmentation, European
to influence the work reported in this paper. Conference on Computer Vision (2018) 801–818.
[30] J. Redmon, A. Farhadi, Yolov3: an incremental improvement, arXiv preprint
arXiv:1804.02767 (2018).
Acknowledgements [31] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object
detection, in: Proceedings of the IEEE International Conference on Computer
Vision, 2017, pp. 2980–2988.
We gratefully acknowledge the support of NVIDIA Corporation [32] M. Ren, R.S. Zemel, End-to-end instance segmentation with recurrent
with the donation of the Titan X Pascal GPU used for this research. attention, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2017, pp. 6656–6664.
[33] X. Liang, L. Lin, Y. Wei, X. Shen, J. Yang, S. Yan, Proposal-free network for
References instance-level object segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 40
(12) (2017) 2978–2991.
[34] H. Chen, X. Qi, L. Yu, P.-A. Heng, Dcan: deep contour-aware networks for
[1] J. Paul Cohen, G. Boucher, C.A. Glastonbury, H.Z. Lo, Y. Bengio, Count-ception:
accurate gland segmentation, in: Proceedings of the IEEE Conference on
counting by fully convolutional redundant counting, International Conference
Computer Vision and Pattern Recognition, 2016, pp. 2487–2496.
on Computer Vision (2017) 18–26.
[35] C. Couprie, C. Farabet, L. Najman, Y. Lecun, Convolutional nets and watershed
[2] M. Baygin, M. Karakose, A. Sarimaden, E. Akin, An image processing based
cuts for real-time semantic labeling of RGBD videos, J. Mach. Learn. Res. 15 (1)
object counting approach for machine vision application, in: Conference on
(2014) 3489–3511.
Advances and Innovations in Engineering, 2018, pp. 966–970.
[36] M. Bai, R. Urtasun, Deep watershed transform for instance segmentation,
[3] A. Ferrari, S. Lombardi, A. Signoroni, Bacterial colony counting with
Conference on Computer Vision and Pattern Recognition (2017) 2858–2866.
Convolutional Neural Networks, Conference of the IEEE Engineering in
[37] S. Jetley, M. Sapienza, S. Golodetz, P.H. Torr, Straight to shapes: real-time
Medicine and Biology Society (2015) 7458–7461.
detection of encoded shapes, in: Proceedings of the IEEE Conference on
[4] K. Dijkstra, J. van de Loosdrecht, L.R. Schomaker, M.A. Wiering, Centroidnet: a
Computer Vision and Pattern Recognition, 2017, pp. 6550–6559.
deep neural network for joint object localization and counting, in: European
[38] U. Schmidt, M. Weigert, C. Broaddus, G. Myers, Cell detection with star-convex
Conference on Machine Learning and Principles and Practice of Knowledge
polygons, in: International Conference on Medical Image Computing and
Discovery in Databases, 2018, pp. 585–601.
Computer-Assisted Intervention, Springer, 2018, pp. 265–273.
[5] A. Croxatto, K. Dijkstra, G. Prod’hom, G. Greub, Comparison of inoculation with
[39] Z. Wu, C. Shen, A. v. d. Hengel, Bridging category-level and instance-level
the InoqulA and WASP automated systems with manual inoculation, J. Clin.
semantic image segmentation. arXiv preprint arXiv:1605.06885.
Microbiol. 53 (7) (2015) 2298–2307.
[40] M.A. Rahman, Y. Wang, Optimizing intersection-over-union in deep neural
[6] A.U.M. Khan, A. Torelli, I. Wolf, N. Gretz, AutoCellSeg: Robust automatic colony
networks for image segmentation, in: International Symposium on Visual
forming unit (CFU)/cell analysis using adaptive image segmentation and easy-
Computing, 2016, pp. 234–244.
to-use post-editing techniques, Nat. Sci. Rep. 8 (1) (2018) 2045–2322.
[41] F. van Beers, A. Lindstrom, E. Okafor, M.A. Wiering, Deep neural networks with
[7] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep
intersection over union loss for binary image segmentation, Conference on
convolutional neural networks, Adv. Neural Inf. Process. Syst. (2012) 1097–
Pattern Recognition Applications and Methods (2019) 438–445.
1105.
[42] A. Paszke, G. Chanan, Z. Lin, S. Gross, E. Yang, L. Antiga, Z. Devito, Automatic
[8] O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for
differentiation in PyTorch, Neural Inf. Process. Syst.
biomedical image segmentation, Conference on Medical Image Computing
[43] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, S. Bengio, Why
and Computer-Assisted Intervention (2015) 234–241.
does unsupervised pre-training help deep learning?, J. Mach. Learn. Res. 11
[9] A.R. Pathak, M. Pandey, S. Rautaray, Application of deep learning for object
(2010) 625–660.
detection, Procedia Comput. Sci. 132 (2018) 1706–1717.
[44] N.-D. Nguyen, T. Do, T.D. Ngo, D.-D. Le, An evaluation of deep learning methods
[10] K. He, G. Gkioxari, P. Dollar, R. Girshick, R.-C.N.N. Mask, Conference on
for small object detection, J. Electr. Comput. Eng. (2020).
Computer Vision and Pattern Recognition (2017) 2980–2988.
504
K. Dijkstra, J. van de Loosdrecht, W.A. Atsma et al. Neurocomputing 423 (2021) 490–505
Dr. Klaas Dijkstra is an associate professor in computer Netherlands, Atsma is one of the delegation members for the ISO/TC147/SC4
vision and data science at NHL Stenden University of microbiological parameters with the aim of drawing up or changing international
Applied Sciences. His main research interests are in standards for water microbiology tests.
computer vision, machine learning and hyperspectral
imaging. After completing his B.Eng. degree in technical
information science in 2005, he has been active in the Prof. dr. Lambert Schomaker is professor in artificial
field of computer vision by doing applied research intelligence at the university of Groningen since 2001.
projects in several domains. In 2013 he obtained his M. He is known for research in simulation and recognition
Sc. degree from the Limerick Institute of Technology in of handwriting, writer identification, style-based docu-
Ireland, on the application of evolutionary algorithms ment dating and other studies in pattern recognition,
and computer vision to the domain of microbiological machine learning and robotics. He has (co)authored
analysis. He obtained his Ph.D. degree in 2020 from the over 200 publications and was involved in the organi-
University of Groningen on the topic of deep learning and hyperspectral imaging for zation of many conferences in handwriting recognition
unmanned aerial vehicles. and document analysis. In recent years he and his team
have worked on the Monk system: an interactively
trainable search engine and e-Science service for his-
Jaap van de Loosdrecht is a professor in computer torical manuscripts. The availability of up to thousands
vision and data scienceat the NHL Stenden University of of training images for single classes of complex patterns has brought pattern
Applied Sciences. His main research interests are in recognition and machine learning into the ballpark of big data. Other recent work is
computer vision, deep learning and hyperspectral in the area of robotics and industrial maintenance, in the EU ECSEL project Mantis.
imaging. In 1996 he has founded the professorship In 2015, he became co-chair of the Data Science and Systems Complexity center at
Computer Vision and & Data Science. His staff, the Faculty of Science and Engineering at the University of Groningen. In 2017, he
researchers, teachers and students have carried out joined the CogniGron center for cognitive systems and materials in a largescale
more than 350 research projects for the business com- seven-year project in neuromorphic computing. He is a member of the IAPR and
munity in the field of Computer Vision & Data Science, senior member of IEEE.
including the Raak-Award 2016 project ‘Smart Vision
for UAV’s’. He is Comenius Senior Fellow at KNAW
(Royal Dutch Academy of Sciences). Dr. Marco Wiering is an assistant professor in the
department of artificial intelligence from the University
of Groningen, the Netherlands. He performed his PhD
research in the research institute IDSIA in Switzerland
and graduated in 1999 on the topic of reinforcement
Waatze A. Atsma, received his engineering degree in
learning. Before going to the University of Groningen, he
Biotechnology at the Noordelijke Hogeschool in
worked as an assistant professor at Utrecht University.
Leeuwarden in 2002 and has been working at the
Dr. Wiering has co-authored more than 170 conference
drinking water laboratory Vitens N.V. as a principle
or journal papers and has supervised or is supervising
analyst and project leader since 2003. His specialty is
12 PhD students and more than 110 Master student
mainly in the field of drinking water diagnostics, in
graduation projects. His main research topics are rein-
particular the development and implementation of new
forcement learning, deep learning, neural networks,
(molecular based) microbiological methods within the
robotics, computer vision, game playing, timeseries
Dutch drinking water laboratories. In addition, Atsma is
prediction and optimization.
a member and chairman of various national working
groups in the field of implementation of rapid micro-
biological methods in the drinking water laboratories
and is closely involved in drawing up guidelines for drinking water-related,
microbiological methods, including for Legionella diagnostics. On behalf of the
505