0% found this document useful (0 votes)
5 views

Semantic Instance Segmentation With A Discriminative Loss Function

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Semantic Instance Segmentation With A Discriminative Loss Function

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Semantic Instance Segmentation with a Discriminative Loss Function

Bert De Brabandere∗ Davy Neven∗ Luc Van Gool


ESAT-PSI, KU Leuven
[email protected]
arXiv:1708.02551v1 [cs.CV] 8 Aug 2017

Abstract

Semantic instance segmentation remains a challenging


task. In this work we propose to tackle the problem with
a discriminative loss function, operating at the pixel level,
that encourages a convolutional network to produce a rep-
resentation of the image that can easily be clustered into
instances with a simple post-processing step. The loss func-
tion encourages the network to map each pixel to a point
in feature space so that pixels belonging to the same in-
stance lie close together while different instances are sepa-
rated by a wide margin. Our approach of combining an off-
the-shelf network with a principled loss function inspired by
a metric learning objective is conceptually simple and dis-
tinct from recent efforts in instance segmentation. In con-
trast to previous works, our method does not rely on ob-
ject proposals or recurrent mechanisms. A key contribu-
tion of our work is to demonstrate that such a simple setup
without bells and whistles is effective and can perform on-
par with more complex methods. Moreover, we show that
it does not suffer from some of the limitations of the popu-
Figure 1. The network maps each pixel to a point in feature space
lar detect-and-segment approaches. We achieve competitive
so that pixels belonging to the same instance are close to each
performance on the Cityscapes and CVPPP leaf segmenta- other, and can easily be clustered with a fast post-processing step.
tion benchmarks. From top to bottom, left to right: input image, output of the net-
work, pixel embeddings in 2-dimensional feature space, clustered
image.
1. Introduction
Semantic instance segmentation has recently gained in
drawing the tightest bounding box around each segmenta-
popularity. As an extension of regular semantic segmen-
tion mask, and show that their system reaches state-of-the-
tation, the task is to generate a binary segmentation mask
art performance on an object detection benchmark.
for each individual object along with a semantic label. It is
considered a fundamentally harder problem than semantic The relation between instance segmentation and seman-
segmentation - where overlapping objects of the same class tic segmentation is less clear. Intuitively, the two tasks feel
are segmented as one - and is closely related to the tasks of very closely related, but it turns out not to be obvious how
object counting and object detection. One could also say to apply the network architectures and loss functions that
instance segmentation is a generalization of object detec- are successful in semantic segmentation to this related in-
tion, with the goal of producing a segmentation mask rather stance task. One key factor that complicates the naive ap-
than a bounding box for each object. Pinheiro et al. [28] ob- plication of the popular softmax cross-entropy loss function
tain bounding boxes from instance segmentations by simply to instance segmentation, is the fact that an image can con-
tain an arbitrary number of instances and that the labeling is
∗ Authors contributed equally permutation-invariant: it does not matter which specific la-

1
bel an instance gets, as long as it is different from all other proaches build a multi-stage pipeline with a separate object
instance labels. One possible solution is to set an upper proposal and classification step. Hariharan et al. [17]
limit to the number of detectable instances and to impose and Chen et al. [9] use MCG [3] to generate category-
extra constraints on the labeling, but this may unnecessarily independent region proposals, followed by a classification
limit the representational power of the network and intro- step. Pinheiro et al. [28, 29] use the same general ap-
duce unwanted biases, leading to unsatisfying results. proach, but their work focuses on generating segmentation
Most recent works on instance segmentation with deep proposals with a deep network. Dai et al. [12, 13] won
networks go a different route. Two popular approaches in- the 2015 MS-COCO instance segmentation challenge with
troduce a multi-stage pipeline with object proposals [17, 9, a cascade of networks (MNC) to merge bounding boxes,
28, 29, 12, 13, 42, 4], or train a recurrent network end-to- segmentation masks and category information. Many
end with a custom loss function that outputs instances se- works were inspired by this approach and also combine an
quentially [26, 31, 30]. Another line of research is to train object detector with a semantic segmentation network to
a network to transform the image into a representation that produce instances [42, 4, 18]. In contrast to these works,
is clustered into individual instances with a post-processing our method does not rely on object proposals or bounding
step [34, 46, 37, 22]. Our method belongs to this last cat- boxes but treats the image holistically, which we show
egory, but takes a more principled (less ad-hoc) approach to be beneficial for handling certain tasks with complex
than previous works and reduces the post-processing step occlusions as discussed in section 3.3.
to a minimum. Recurrent methods Other recent works [26, 31, 30] em-
Inspired by the success of siamese networks [8, 10] and ploy recurrent networks to generate the individual instances
the triplet loss [39, 33] in image classification, we introduce sequentially. Stewart et al. [35] train a network for end-
a discriminative loss function to replace the pixel-wise soft- to-end object detection using an LSTM [19]. Their loss
max loss that is commonly used in semantic segmentation. function is permutation-invariant as it incorporates the Hun-
Our loss function enforces the network to map each pixel in garian algorithm to match candidate hypotheses to ground-
the image to an n-dimensional vector in feature space, such truth instances. Inspired by their work, Romera et al. [31]
that feature vectors of pixels that belong to the same in- propose an end-to-end recurrent network with convolutional
stance lie close together while feature vectors of pixels that LSTMs that sequentially outputs binary segmentation maps
belong to different instances lie far apart. The output of the for each instance. Ren et al. [30] improve upon [31] by
network can easily be clustered with a fast and simple post- adding a box network to confine segmentations within a
processing operation. With this mechanism, we optimize an local window and skip connections instead of graphical
objective that avoids the aforementioned problems related models to restore the resolution at the output. Their final
to variable number of instances and permutation-invariance. framework consists of four major components: an exter-
Our work mainly focuses on the loss function, as we nal memory and networks for box proposal, segmentation
aim to be able to re-use network architectures that were de- and scoring. We argue that our proposed method is con-
signed for semantic segmentation: we plug in an off-the- ceptually simpler and easier to implement than these meth-
shelf architecture and retrain the system with our discrimi- ods. Our method does not involve recurrent mechanisms
native loss function. In our framework, the tasks of seman- and can work with any off-the-shelf segmentation architec-
tic and instance segmentation can be treated in a consistent ture. Moreover, our loss function is permutation-invariant
and similar manner and do not require changes on the archi- by design, without the need to resort to a Hungarian algo-
tecture side. rithm.
The rest of this paper is structured as follows. First we
give an extensive overview of the related work in section 2. Clustering Another approach is to transform the im-
In section 3 we discuss our proposed method in detail. In age into a representation that is subsequently clustered into
section 4 we set up experiments on two instance segmenta- discrete instances. Silberman et al. [34] produce a seg-
tion benchmarks and show that we get a performance that is mentation tree and use a coverage loss to cut it into non-
competitive with the state-of-the-art. overlapping regions. Zhang et al. [46] impose an ordering
on the individual instances based on their depth, and use a
2. Related Work MRF to merge overlapping predicted patches into a coher-
ent segmentation. Two earlier works [43, 36] also use depth
In the last few years, deep networks have achieved im- information to segment instances. Uhrig et al. [37] train
pressive results in semantic and instance segmentation. All a network to predict each pixel’s direction towards its in-
top-performing methods across different benchmarks use a stance center, along with monocular depth and semantic la-
deep network in their pipeline. Here we discuss these prior bels. They use template matching and proposal fusion tech-
works and situate our model between them. niques to extract the individual instances from this represen-
Proposal-based Many instance segmentation ap- tation. Liang et al. [22] predict pixel-wise feature vectors
representing the ground truth bounding box of the instance
it belongs to. With the help of a sub-network that predicts
an object count, they cluster the output of the network into
individual instances. Our work is similar to these works in
that we have a separate clustering step, but our loss does not
constrain the output of the network to a specific representa-
tion like instance center coordinates or depth ordering; it is
less ad-hoc in that sense.
Other Bai et al. [7] use deep networks to directly learn
the energy of the watershed transform. A drawback of this
bottom-up approach is that they cannot handle occlusions
where instances are separated into multiple pieces. Kir-
illov et al. [20] use a CRF, but with a novel MultiCut formu-
Figure 2. The intra-cluster pulling force pulls embeddings towards
lation to combine semantic segmentations with edge maps the cluster center, i.e. the mean embedding of that cluster. The
to extract instances as connected regions. A shortcoming inter-cluster repelling force pushes cluster centers away from each
of this method is that, although they reason globally about other. Both forces are hinged: they are only active up to a cer-
instances, they also cannot handle occlusions. Arnab et tain distance determined by the margins δv and δd , denoted by the
al. [5] combine an object detector with a semantic segmen- dotted circles. This diagram is inspired by a similar one in [39].
tation module using a CRF model. By considering the im-
age holistically it can handle occlusions and produce more
label, and a term to penalize small distances between em-
precise segmentations.
beddings with a different label.
Loss function Our loss function is inspired by earlier In our loss function we keep the first term, but replace
works on distance metric learning, discriminative loss func- the second term with a more tractable one: instead of
tions and siamese networks [8, 10, 16, 39, 21]. Most sim- directly penalizing small distances between every pair of
ilar to our loss function, Weinberger et al. [39] propose to differently-labeled embeddings, we only penalize small dis-
learn a distance metric for large margin nearest neighbor tances between the mean embeddings of different labels. If
classification. Kostinger et al. [21] further explore a simi- the number of different labels is smaller than the number of
lar LDA based objective. More recently Schroff et al. [33], inputs, this is computationally much cheaper than calculat-
building on Sun et al. [38], introduced the triplet loss for ing the distances between every pair of embeddings. This is
face recognition. The triplet loss enforces a margin be- a valid assumption for instance segmentation, where there
tween each pair of faces from one person, to all other faces. are orders of magnitude fewer instances than pixels in an
Xie et al. [41] propose a clustering objective for unsuper- image.
vised learning. Whereas these works employ a discrimina- We now formulate our discriminative loss in terms of
tive loss function to optimize distances between images in push (i.e. repelling) and pull forces between and within
a dataset, our method operates at the pixel level, optimiz- clusters. A cluster is defined as a group of pixel embed-
ing distances between individual pixels in an image. To our dings sharing the same label, e.g. pixels belonging to the
knowledge, we are the first to successfully use a discrimi- same instance. Our loss consists of three terms:
native loss based on distance metric learning principles for
the task of instance segmentation with deep networks. 1. variance term: an intra-cluster pull-force that draws
embeddings towards the mean embedding, i.e. the
3. Method cluster center.
2. distance term: an inter-cluster push-force that pushes
3.1. Discriminative loss function
clusters away from each other, increasing the distance
Consider a differentiable function that maps each pixel between the cluster centers.
in an input image to a point in n-dimensional feature space,
3. regularization term: a small pull-force that draws
referred to as the pixel embedding. The intuition behind our
all clusters towards the origin, to keep the activations
loss function is that embeddings with the same label (same
bounded.
instance) should end up close together, while embeddings
with a different label (different instance) should end up far The variance and distance terms are hinged: their forces
apart. are only active up to a certain distance. Embeddings within
Weinberger et al. [39] propose a loss function with two a distance of δv from their cluster centers are no longer at-
competing terms to achieve this objective: a term to pe- tracted to it, which means that they can exist on a local man-
nalize large distances between embeddings with the same ifold in feature space rather than having to converge to a
single point. Analogously, cluster centers further apart than importantly, the dimensionality of the feature space is in-
2δd are no longer repulsed and can move freely in feature dependent of the number of instances that needs to be seg-
space. Hinging the forces relaxes the constraints on the net- mented. Figure 3 depicts the convergence of our loss func-
work, giving it more representational power to achieve its tion when overfitting on a single image with 15 instances,
goal. The interacting forces in feature space are illustrated in a 2-dimensional feature space.
in figure 2.
The loss function can also be written down exactly. We 3.2. Post-processing
use the following definitions: C is the number of clusters in When the variance and distance terms of the loss are
the ground truth, Nc is the number of elements in cluster c, zero, the following is true:
xi is an embedding, µc is the mean embedding of cluster c
(the cluster center), k·k is the L1 or L2 distance, and [x]+ = • all embeddings are within a distance of δv from their
max(0, x) denotes the hinge. δv and δd are respectively the cluster center
margins for the variance and distance loss. The loss can
then be written as follows: • all cluster centers are at least 2δd apart

C Nc If δd > δv , then each embedding is closer to its own


1 X 1 X 2 cluster center than to any other cluster center. It follows that
Lvar = [kµc − xi k − δv ]+ (1)
C c=1 Nc i=1 during inference, we can threshold with a bandwidth b = δv
around a cluster center to select all embeddings belonging
to that cluster. Thresholding in this case means selecting all
C C
1 X X 2
embeddings xi that lie within a hypersphere with radius b
Ldist = [2δd − kµcA − µcB k]+ (2) around the cluster center xc :
C(C − 1) c =1 c =1
A B
cA 6=cB
xi ∈ C ⇔ kxi − xc k < b (5)

C For the tasks of classification and semantic segmentation,


1 X
Lreg = kµc k (3) with a fixed set of classes, this leads to a simple strategy
C c=1
for post-processing the output of the network into discrete
classes: after training, calculate the cluster centers of each
L = α · Lvar + β · Ldist + γ · Lreg (4) class over the entire training set. During inference, thresh-
old around each of the cluster centers to select all pixels be-
In our experiments we set α = β = 1 and γ = 0.001. longing to the corresponding semantic class. This requires
The loss is minimized by stochastic gradient descent. that the cluster centers of a specific class are the same in
Comparison with softmax loss We discuss the relation each image, which can be accomplished by coupling the
of our loss function with the popular pixel-wise multi-class cluster centers across a mini-batch.
cross-entropy loss, often referred to as the softmax loss. In For the task of instance segmentation things are more
the case of a softmax loss with n classes, each pixel em- complicated. As the labeling is permutation invariant, we
bedding is driven to a one-hot vector, i.e. the unit vector on cannot simply record cluster centers and threshold around
one of the axes of an n-dimensional feature space. Because them during inference. We could follow a different strategy:
the softmax function has the normalizing property that its if we set δd > 2δv , then each embedding is closer to all
outputs are positive and sum to one, the embeddings are re- embeddings of its own cluster than to any embedding of a
stricted to lie on the unit simplex. When the loss reaches different cluster. It follows that we can threshold around
zero, all embeddings lie on one of the unit vectors. By de- any embedding to select all embeddings belonging to the
sign, the dimensions of the output feature space (which cor- same cluster. The procedure during inference is to select an
respond to the number of feature maps in the last layer of the unlabeled pixel, threshold around its embedding to find all
network) must be equal to the number of classes. To add a pixels belonging to the same instance, and assign them all
class after training, the architecture needs to be updated too. the same label. Then select another pixel that does not yet
In comparison, our loss function does not drive the em- belong to an instance and repeat until all pixels are labeled.
beddings to a specific point in feature space. The network Increasing robustness In a real-world problem the loss
could for example place similar clusters (e.g. two small ob- on the test set will not be zero, potentially causing our clus-
jects) closer together than dissimilar ones (e.g. a small and tering algorithm for instance segmentation to make mis-
a large object). When the loss reaches zero, the system of takes. If a cluster is not compact and we accidentally se-
push and pull forces has minimal energy and the clusters lect an outlier to threshold around, it could happen that a
have organized themselves in n-dimensional space. Most real cluster gets predicted as two sub-clusters. To avoid this
Figure 3. Convergence of our method on a single image in a 2-dimensional feature space. Left: input and ground truth label. The middle
row shows the raw output of the network (as the R- and G- channels of an RGB image), masked with the foreground mask. The upper row
shows each of the pixel embeddings xi in 2-d feature space, colored corresponding to their ground truth label. The cluster center µc and
margins δv and δd are also drawn. The last row shows the result of clustering the embeddings by thresholding around their cluster center,
as explained in section 3.2. We display the images after 0, 2, 4, 8, 16, 32 and 64 gradient update steps.

issue, we make the clustering more robust against outliers


by applying a fast variant of the mean-shift algorithm [14].
As before, we select a random unlabeled pixel and threshold
around its embedding. Next however, we calculate the mean
of the selected group of embeddings and use the mean to
threshold again. We repeat this process until mean conver-
gence. This has the effect of moving to a high-density area
in feature space, likely corresponding to a true cluster cen-
ter. In the experiments section, we investigate the effect of
this clustering algorithm by comparing against ground truth
clustering, where the thresholding targets are calculated as
an average embedding over the ground truth instance labels.
3.3. Pros and cons
Our proposed method has some distinctive advantages
and disadvantages compared to other methods that we now
discuss. Figure 4. Results on the synthetic scattered sticks dataset to illus-
One big limitation of detect-and-segment approaches trate that our approach is a good fit for problems with complex
that is not immediately apparent from their excellent results occlusions.
on popular benchmarks, is that they rely on the assumption
that an object’s segmentation mask can be unambiguously
extracted from its bounding box. This is an implicit prior hard to unambigously extract a segmentation mask of the
that is very effective for datasets like MS COCO and Pas- indicated object. Methods that rely on bounding boxes in
cal VOC, which contain relatively blobby objects that do their pipeline [17, 28, 29, 12, 13, 42, 4, 26, 30, 22, 5, 18] all
not occlude each other in complex ways. However, the as- suffer from this issue. In contrast, our method can handle
sumption is problematic for tasks where an object’s bound- such complex occlusions without problems as it treats the
ing box conveys insufficient information to recover the ob- image holistically and learns to reason about occlusions, but
ject’s segmentation mask. Consider the synthetic scattered does not employ a computationally expensive CRF like [5].
sticks dataset shown in figure 4 as an example to illustrate Many real-world industrial or medical applications (con-
the issue. When two sticks overlap like two crossed swords, veyor belt sorting systems, overlapping cell and chromo-
their bounding boxes are highly overlapping. Given only a some segmentation, etc.) exhibit this kind of occlusions.
detection in the form of a bounding box, it is exceedingly To the best of our knowledge no sufficiently large datasets
for such tasks are publicly available, which unfortunately port accuracy using 4 metrics defined in [11]: mean
prevents us from showcasing this particular strength of our Average Precision (AP), mean Average Precision with
method to the full. overlap of 50% (AP0.5), AP50m and AP100m, where
On the other hand, our method also has some drawbacks. evaluation is restricted to objects within 50 m and 100
Due to the holistic treatment of the image, our method per- m distance, respectively.
forms well on datasets with a lot of similarity across the im-
ages (traffic scenes in Cityscapes or leaf configurations in 4.2. Setup
CVPPP), but underperforms on datasets where objects can
appear in random constellations and diverse settings, like Model architecture and general setup Since we want
Pascal VOC and MSCOCO. A sliding-window detection- to stress the fact that our loss can be used with an off-
based approach with non-max suppression is more suited the-shelf network, we use the ResNet-38 network [40], de-
for such datasets. For example, if our method were trained signed for semantic segmentation. We finetune from a
on images with only one object, it would perform badly on model that was pre-trained on CityScapes semantic segmen-
an image that unexpectedly contained many of these ob- tation.
jects. A detection-based approach has no trouble with this. In the following experiments, all models are trained us-
ing Adam, with a learning rate of 1e-4 on a NVidia Titan X
4. Experiments GPU.
Leaf segmentation Since this dataset only consists of
We test our loss function on two instance segmentation 128 images, we use online data augmentation to prevent
datasets: The CVPPP Leaf Segmentation dataset and the the model from overfitting and to increase the overall ro-
Cityscapes Instance-Level Semantic Labeling Task. These bustness. We apply random left-right flip, random rota-
datasets contain a median number of more than 15 instances tion with θ ∈ [0, 2π] and random scale deformation with
per image. We also study the influence of the different com- s ∈ [1.0, 1.5]. All images are rescaled to 512x512 and
ponents of our method and point out where there is room for concatenated with an x- and y- coordinate map with val-
improvement. ues between -1 and 1. We train the network with margins
4.1. Datasets δv = 0.5, δd = 1.5, and 16 output dimensions. Foreground
masks are included with the test set, since this challenge
CVPPP leaf segmentation The LSC competition of the only focuses on instance segmentation.
CVPPP workshop [2] is a small but challenging bench- Cityscapes Our final model is trained on the training im-
mark. The task is to individually segment each leaf of ages, downsampled to 768 × 384. Because of the size and
a plant. The dataset [23] was developed to encourage variability of the dataset, there is no need for extra data aug-
the use of computer vision methods to aid in the study mentation. We train the network with margins δv = 0.5,
of plant phenotyping [24]. We use the A1 subset which δd = 1.5, and 8 output dimensions. In contrast to the
consists of 128 labeled images and 33 test images. [32] CVPPP dataset, Cityscapes is a multi-class instance seg-
gives an overview of results on this dataset. We com- mentation challenge. Therefore, we run our loss function
pare our performance with some of these works and independently on every semantic class, so that instances
two other recent approaches [31, 30]. We report two belonging to the same class are far apart in feature space,
metrics defined in [32]: Symmetric Best Dice (SBD), whereas instances from different classes can occupy the
which denotes the accuracy of the instance segmenta- same space. For example, the cluster centers of a pedes-
tion and Absolute Difference in Count (|DiC|) which trian and a car that appear in the same image are not pushed
is the absolute value of the mean of the difference be- away from each other. We use a pretrained ResNet-38 net-
tween the predicted number of leaves and the ground work [40] to generate segmentation masks for the semantic
truth over all images. classes.
Cityscapes The large-scale Cityscapes dataset [11] focuses 4.3. Analysis of the individual components
on semantic understanding of urban street scenes. It
has a benchmark for pixel-level and instance-level se- The final result of the semantic instance segmentation is
mantic segmentation. We test our method on the latter, influenced by three main components: the performance of
using only the fine-grained annotations. The dataset the network architecture with our loss function, the quality
is split up in 2975 training images, 500 validation im- of the semantic labels, and the post-processing step. To dis-
ages, and 1525 test images. We tune hyperparameters entangle the effects of the semantic segmentation and the
using the validation set and only use the train set to post-processing and to point out potential for improvement,
train our final model. We compare our results with the we run two extra experiments on the Cityscapes validation
published works in the official leaderboard [1]. We re- set:
SBD |DiC| semantic segm. clustering AP AP0.5
RIS+CRF [31] 66.6 1.1 resnet38 [40] mean-shift 21.4 40.2
MSU [32] 66.7 2.3 resnet38 [40] center threshold 22.9 44.1
Nottingham [32] 68.3 3.8 ground truth mean-shift 37.5 58.5
Wageningen [44] 71.1 2.2 ground truth center threshold 47.8 77.8
IPK [25] 74.4 2.6
PRIAn [15] - 1.3 Table 3. Effect of the semantic segmentation and clustering com-
End-to-end [30] 84.9 0.8 ponents on the performance of our method on the Cityscapes val-
idation set. We study this by gradually replacing each component
Ours 84.2 1.0
with their ground truth counterpart. Row 1 vs row 3: the quality of
Table 1. Segmentation and counting performance on the test set of the semantic segmentation has a big influence on the overall per-
the CVPPP leaf segmentation challenge. formance. Row 1 vs 2 and row 3 vs 4: the effect of the clustering
method is less pronounced but also shows room for improvement.
AP AP0.5 AP100m AP50m
R-CNN+MCG 4.6 12.9 7.7 10.3
FCN+Depth 8.9 21.1 15.3 16.7 (SBD score of 84.2) that are on-par with the state-of-the art
JGD 9.8 23.2 16.8 20.3 (SBD score of 84.9). We outperform all non-deep learn-
InstanceCut 13.0 27.9 22.1 26.1 ing methods and also the recurrent instance segmentation
Boundary-aware 17.4 36.7 29.3 34.0 of [31], with a method that is arguably less complex.
DWT 19.4 35.3 31.4 36.8 Figure 6 shows some visual results on the Cityscapes val-
Pixelwise DIN 20.0 38.8 32.6 37.6 idation set. We see that even in difficult scenarios, with
Mask R-CNN 26.2 49.9 37.6 40.1 street scenes containing a lot of cars or pedestrians, our
Ours 17.5 35.9 27.8 31.0 method often manages to identify the individual objects.
Failure cases mostly involve the splitting up of a single ob-
Table 2. Segmentation performance on the test set of the ject into multiple instances or incorrect merging of neigh-
Cityscapes instance segmentation benchmark. boring instances. This happens in the lower left example,
where the two rightmost cars are merged. Another failure
mode is incorrect semantic segmentation: in the lower right
• Semantic segmentation vs ground truth For the
example, the semantic segmentation network accidentally
Cityscapes challenge, we rely on semantic segmenta-
mistakes an empty bicycle storage for actual bikes. The in-
tion masks to make a distinction between the different
stance segmentation network is left no choice but to give it
classes. Since our instance segmentation will discard
a shot, and tries to split up the imaginary bikes into individ-
regions not indicated in the semantic segmentation la-
ual objects. Nevertheless, we achieve competitive results on
bels, the results will be influenced by the quality of
the Cityscapes leaderboard [1], outperforming all but one
the semantic segmentation masks. To measure the size
unpublished work [5]. Note that we perform on-par with
of this influence, we also report performance with the
the MNC-based method SAIS [13, 18] on this dataset. See
ground truth semantic segmentation masks.
table 2 for a complete overview. A video of the results is
• Mean shift clustering vs ground truth clustering available at https://youtu.be/FB_SZIKyX50.
Since the output of our network needs to be clustered As discussed in section 4.3, we are interested to know the
into discrete instances, the clustering method can po- influence of the quality of the semantic segmentations and
tentially influence the accuracy of the overall instance the clustering algorithm on the overall performance. The
segmentation. In this experiment, we measure this in- results of these experiments can be found in table 3. As
fluence by clustering with our adapted mean shift clus- expected, the performance increases when we switch out
tering algorithm versus thresholding around the mean a component with its ground truth counterpart. The effect
embeddings over the ground truth instance masks, as of the semantic segmentation is the largest: comparing the
explained in section 3.2). first row (our method) to the third row, we see a large perfor-
mance increase when replacing the ResNet-38 [40] seman-
tic segmentation masks with the ground truth masks. This
4.4. Results and discussion
can be explained by the fact that the average precision met-
Figure 5 shows some results of our method on the valida- ric is an average over the semantic classes. Some classes
tion set of the CVPPP dataset. The network makes very few like tram, train and bus almost never have more than one
mistakes: only the segmentation of the smallest leafs and instance per image, causing the semantic segmentation to
the leaf stalks sometimes show a small error. Table 1 con- have a big influence on this metric. It is clear that the over-
tains the numerical results. We achieve competitive results all performance can be increased by having better semantic
input ground truth ours input ground truth ours

Figure 5. Some visual examples on the CVPPP leaf dataset.

Figure 6. Some examples for different semantic classes on the Cityscapes instance segmentation validation set. Note some typical failure
cases in the last row: incorrect merging of true instances and wrong semantic segmentation masks.

segmentations. els on different resolutions and evaluate them on the car


The last two entries of the table show the difference be- class of the Cityscapes validation set. This also illustrates
tween ground truth clustering and mean shift clustering, the benefit that our method can be used with any off-the-
both using ground truth segmentation masks. Here also, shelf network designed for semantic segmentation.
there is a performance gap. The main reason is that the loss
on the validation set is not zero which means the constraints
imposed by the loss function are not met. Clustering using Table 4 shows the results. We can conclude that Resnet-
mean-shift will therefore not lead to perfect results. The 38 is best for accuracy, but requires some more memory. If
effect is more pronounced for small instances, as also no- speed is important, ENet would favor over Segnet since it
ticeable in the shown examples. is much faster with almost the same accuracy. It also shows
that running on a higher resolution than 768x384 doesn’t in-
4.5. Speed-accuracy trade-off
crease accuracy much for the tested networks. Note that the
To investigate the trade-off between speed, accuracy and post-processing step can be implemented efficiently, caus-
memory requirements, we train four different network mod- ing only a negligible overhead.
Dim AP APgt fps #p mem [8] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun,
512 x 256 0.19 0.21 145 1.00 C. Moore, E. Säckinger, and R. Shah. Signature verifica-
ENet [27] 768 x 384 0.21 0.25 94 0.36 1.03 tion using a siamese time delay neural network. Interna-
1024 x 512 0.20 0.26 61 1.12 tional Journal of Pattern Recognition and Artificial Intelli-
512 x 256 0.20 0.22 27 1.22
gence, 7(04):669–688, 1993. 2, 3
Segnet [6] 768 x 384 0.22 0.26 14 29.4 1.29
1024 x 512 0.18 0.24 8 1.47 [9] Y.-T. Chen, X. Liu, and M.-H. Yang. Multi-instance object
512 x 256 0.21 0.24 15 2.20 segmentation with occlusion handling. In CVPR, 2015. 2
Dilation [45] 768 x 384 0.24 0.29 6 134.3 2.64 [10] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity
1024 x 512 0.23 0.30 4 3.27 metric discriminatively, with application to face verification.
512 x 256 0.24 0.27 12 4.45 In CVPR, 2005. 2, 3
Resnet38 [40] 768 x 384 0.29 0.34 5 124 8.83 [11] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
R. Benenson, U. Franke, S. Roth, and B. Schiele. The
Table 4. Average Precision (AP), AP using gt segmentation labels cityscapes dataset for semantic urban scene understanding.
(APgt ), speed of forward pass (fps), number of parameters (×106 ) In CVPR, 2016. 6
and memory usage (GB) for different models evaluated on the car [12] J. Dai, K. He, and J. Sun. Convolutional feature masking for
class of the Cityscapes validation set. joint object and stuff segmentation. In CVPR, 2015. 2, 5
[13] J. Dai, K. He, and J. Sun. Instance-aware semantic segmen-
tation via multi-task network cascades. In CVPR, 2016. 2, 5,
5. Conclusion 7
In this work we have proposed a discriminative loss func- [14] K. Fukunaga and L. Hostetler. The estimation of the gradi-
tion for the task of instance segmentation. After training, ent of a density function, with applications in pattern recog-
the output of the network can be clustered into discrete in- nition. IEEE Trans. Inf. Theor., 21(1):32–40, Sept. 2006. 4
stances with a simple post-processing thresholding opera- [15] M. V. Giuffrida, M. Minervini, and S. A. Tsaftaris. Learning
to count leaves in rosette plants. 2016. 7
tion that is tailored to the loss function. Furthermore, we
[16] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality re-
showed that our method can handle complex occlusions as
duction by learning an invariant mapping. In CVPR, 2006.
opposed to popular detect-and-segment approaches. Our 3
method achieves competitive performance on two bench- [17] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simul-
marks. In this paper we still used a pretrained network to taneous detection and segmentation. In ECCV, 2014. 2, 5
produce the semantic segmentation masks. We will inves- [18] Z. Hayder, X. He, and M. Salzmann. Shape-aware instance
tigate the joint training of instance and semantic segmenta- segmentation. arXiv preprint arXiv:1612.03129, 2016. 2, 5,
tion with our loss function in future work. 7
Acknowledgement: This work was supported by Toy- [19] S. Hochreiter and J. Schmidhuber. Long short-term memory.
ota, and was carried out at the TRACE Lab at KU Leuven Neural computation, 9(8):1735–1780, 1997. 2
(Toyota Research on Automated Cars in Europe - Leuven). [20] A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, and
C. Rother. Instancecut: from edges to instances with multi-
References cut. arXiv preprint arXiv:1611.08272, 2016. 3
[21] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and
[1] Cityscapes dataset. https://www. H. Bischof. Large scale metric learning from equiva-
cityscapes-dataset.com/. Accessed: Novem- lence constraints. In Computer Vision and Pattern Recogni-
ber 2016. 6, 7 tion (CVPR), 2012 IEEE Conference on, pages 2288–2295.
[2] Cvppp dataset. http://www.plant-phenotyping. IEEE, 2012. 3
org/datasets-home. Accessed: November 2016. 6 [22] X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, and S. Yan.
[3] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and Proposal-free network for instance-level object segmenta-
J. Malik. Multiscale combinatorial grouping. In CVPR, tion. arXiv preprint arXiv:1509.02636, 2015. 2, 5
2014. 2 [23] M. Minervini, A. Fischbach, H. Scharr, and S. A. Tsaftaris.
[4] A. Arnab and P. H. Torr. Bottom-up instance segmentation Finely-grained annotated datasets for image-based plant phe-
using deep higher-order crfs. In BMVC, 2016. 2, 5 notyping. Pattern recognition letters, 2015. 6
[5] A. Arnab and P. H. S. Torr. Pixelwise instance segmentation [24] M. Minervini, H. Scharr, and S. A. Tsaftaris. Image analy-
with a dynamically instantiated network. In CVPR, 2017. 3, sis: The new bottleneck in plant phenotyping [applications
5, 7 corner]. IEEE signal processing magazine, 32(4):126–131,
[6] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A 2015. 6
deep convolutional encoder-decoder architecture for image [25] J.-M. Pape and C. Klukas. 3-d histogram-based segmentation
segmentation. arXiv preprint arXiv:1511.00561, 2015. 9 and leaf detection for rosette plants. In ECCV, 2014. 7
[7] M. Bai and R. Urtasun. Deep watershed transform for [26] E. Park and A. C. Berg. Learning to decompose for ob-
instance segmentation. arXiv preprint arXiv:1611.08303, ject detection and instance segmentation. arXiv preprint
2016. 3 arXiv:1511.06449, 2015. 2, 5
[27] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. Enet: [46] Z. Zhang, A. G. Schwing, S. Fidler, and R. Urtasun. Monoc-
A deep neural network architecture for real-time semantic ular object instance segmentation and depth ordering with
segmentation. arXiv preprint arXiv:1606.02147, 2016. 9 cnns. In ICCV, 2015. 2
[28] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to seg-
ment object candidates. In NIPS, 2015. 1, 2, 5
[29] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learn-
ing to refine object segments. In ECCV, 2016. 2, 5
[30] M. Ren and R. S. Zemel. End-to-end instance segmenta-
tion and counting with recurrent attention. arXiv preprint
arXiv:1605.09410, 2016. 2, 5, 6, 7
[31] B. Romera-Paredes and P. H. Torr. Recurrent instance seg-
mentation. In ECCV, 2016. 2, 6, 7
[32] H. Scharr, M. Minervini, A. P. French, C. Klukas, D. M.
Kramer, X. Liu, I. Luengo, J.-M. Pape, G. Polder, D. Vukadi-
novic, et al. Leaf segmentation in plant phenotyping: a colla-
tion study. Machine vision and applications, 27(4):585–606,
2016. 6, 7
[33] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni-
fied embedding for face recognition and clustering. In CVPR,
2015. 2, 3
[34] N. Silberman, D. Sontag, and R. Fergus. Instance segmenta-
tion of indoor scenes using a coverage loss. In ECCV, 2014.
2
[35] R. Stewart and M. Andriluka. End-to-end people detection
in crowded scenes. In CVPR, 2016. 2
[36] J. Tighe, M. Niethammer, and S. Lazebnik. Scene pars-
ing with object instances and occlusion ordering. In CVPR,
2014. 2
[37] J. Uhrig, M. Cordts, U. Franke, and T. Brox. Pixel-level
encoding and depth layering for instance-level semantic la-
beling. GCPR, 2016. 2
[38] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang,
J. Philbin, B. Chen, and Y. Wu. Learning fine-grained im-
age similarity with deep ranking. In CVPR, 2014. 3
[39] K. Q. Weinberger and L. K. Saul. Distance metric learning
for large margin nearest neighbor classification. Journal of
Machine Learning Research, 10(Feb):207–244, 2009. 2, 3
[40] Z. Wu, C. Shen, and A. van den Hengel. Wider or deeper:
Revisiting the resnet model for visual recognition. arXiv
preprint arXiv:1611.10080, 2016. 6, 7, 9
[41] J. Xie, R. Girshick, and A. Farhadi. Unsupervised deep em-
bedding for clustering analysis. In ICML, 2016. 3
[42] Y. Xu, Y. Li, M. Liu, Y. Wang, Y. Fan, M. Lai, E. I. Chang,
et al. Gland instance segmentation by deep multichannel
neural networks. arXiv preprint arXiv:1607.04889, 2016. 2,
5
[43] Y. Yang, S. Hallman, D. Ramanan, and C. C. Fowlkes.
Layered object models for image segmentation. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
34(9):1731–1743, 2012. 2
[44] X. Yin, X. Liu, J. Chen, and D. M. Kramer. Multi-leaf track-
ing from fluorescence plant videos. In 2014 IEEE Interna-
tional Conference on Image Processing (ICIP), pages 408–
412. IEEE, 2014. 7
[45] F. Yu and V. Koltun. Multi-scale context aggregation by di-
lated convolutions. In ICLR, 2016. 9

You might also like