Semantic Instance Segmentation With A Discriminative Loss Function
Semantic Instance Segmentation With A Discriminative Loss Function
Abstract
1
bel an instance gets, as long as it is different from all other proaches build a multi-stage pipeline with a separate object
instance labels. One possible solution is to set an upper proposal and classification step. Hariharan et al. [17]
limit to the number of detectable instances and to impose and Chen et al. [9] use MCG [3] to generate category-
extra constraints on the labeling, but this may unnecessarily independent region proposals, followed by a classification
limit the representational power of the network and intro- step. Pinheiro et al. [28, 29] use the same general ap-
duce unwanted biases, leading to unsatisfying results. proach, but their work focuses on generating segmentation
Most recent works on instance segmentation with deep proposals with a deep network. Dai et al. [12, 13] won
networks go a different route. Two popular approaches in- the 2015 MS-COCO instance segmentation challenge with
troduce a multi-stage pipeline with object proposals [17, 9, a cascade of networks (MNC) to merge bounding boxes,
28, 29, 12, 13, 42, 4], or train a recurrent network end-to- segmentation masks and category information. Many
end with a custom loss function that outputs instances se- works were inspired by this approach and also combine an
quentially [26, 31, 30]. Another line of research is to train object detector with a semantic segmentation network to
a network to transform the image into a representation that produce instances [42, 4, 18]. In contrast to these works,
is clustered into individual instances with a post-processing our method does not rely on object proposals or bounding
step [34, 46, 37, 22]. Our method belongs to this last cat- boxes but treats the image holistically, which we show
egory, but takes a more principled (less ad-hoc) approach to be beneficial for handling certain tasks with complex
than previous works and reduces the post-processing step occlusions as discussed in section 3.3.
to a minimum. Recurrent methods Other recent works [26, 31, 30] em-
Inspired by the success of siamese networks [8, 10] and ploy recurrent networks to generate the individual instances
the triplet loss [39, 33] in image classification, we introduce sequentially. Stewart et al. [35] train a network for end-
a discriminative loss function to replace the pixel-wise soft- to-end object detection using an LSTM [19]. Their loss
max loss that is commonly used in semantic segmentation. function is permutation-invariant as it incorporates the Hun-
Our loss function enforces the network to map each pixel in garian algorithm to match candidate hypotheses to ground-
the image to an n-dimensional vector in feature space, such truth instances. Inspired by their work, Romera et al. [31]
that feature vectors of pixels that belong to the same in- propose an end-to-end recurrent network with convolutional
stance lie close together while feature vectors of pixels that LSTMs that sequentially outputs binary segmentation maps
belong to different instances lie far apart. The output of the for each instance. Ren et al. [30] improve upon [31] by
network can easily be clustered with a fast and simple post- adding a box network to confine segmentations within a
processing operation. With this mechanism, we optimize an local window and skip connections instead of graphical
objective that avoids the aforementioned problems related models to restore the resolution at the output. Their final
to variable number of instances and permutation-invariance. framework consists of four major components: an exter-
Our work mainly focuses on the loss function, as we nal memory and networks for box proposal, segmentation
aim to be able to re-use network architectures that were de- and scoring. We argue that our proposed method is con-
signed for semantic segmentation: we plug in an off-the- ceptually simpler and easier to implement than these meth-
shelf architecture and retrain the system with our discrimi- ods. Our method does not involve recurrent mechanisms
native loss function. In our framework, the tasks of seman- and can work with any off-the-shelf segmentation architec-
tic and instance segmentation can be treated in a consistent ture. Moreover, our loss function is permutation-invariant
and similar manner and do not require changes on the archi- by design, without the need to resort to a Hungarian algo-
tecture side. rithm.
The rest of this paper is structured as follows. First we
give an extensive overview of the related work in section 2. Clustering Another approach is to transform the im-
In section 3 we discuss our proposed method in detail. In age into a representation that is subsequently clustered into
section 4 we set up experiments on two instance segmenta- discrete instances. Silberman et al. [34] produce a seg-
tion benchmarks and show that we get a performance that is mentation tree and use a coverage loss to cut it into non-
competitive with the state-of-the-art. overlapping regions. Zhang et al. [46] impose an ordering
on the individual instances based on their depth, and use a
2. Related Work MRF to merge overlapping predicted patches into a coher-
ent segmentation. Two earlier works [43, 36] also use depth
In the last few years, deep networks have achieved im- information to segment instances. Uhrig et al. [37] train
pressive results in semantic and instance segmentation. All a network to predict each pixel’s direction towards its in-
top-performing methods across different benchmarks use a stance center, along with monocular depth and semantic la-
deep network in their pipeline. Here we discuss these prior bels. They use template matching and proposal fusion tech-
works and situate our model between them. niques to extract the individual instances from this represen-
Proposal-based Many instance segmentation ap- tation. Liang et al. [22] predict pixel-wise feature vectors
representing the ground truth bounding box of the instance
it belongs to. With the help of a sub-network that predicts
an object count, they cluster the output of the network into
individual instances. Our work is similar to these works in
that we have a separate clustering step, but our loss does not
constrain the output of the network to a specific representa-
tion like instance center coordinates or depth ordering; it is
less ad-hoc in that sense.
Other Bai et al. [7] use deep networks to directly learn
the energy of the watershed transform. A drawback of this
bottom-up approach is that they cannot handle occlusions
where instances are separated into multiple pieces. Kir-
illov et al. [20] use a CRF, but with a novel MultiCut formu-
Figure 2. The intra-cluster pulling force pulls embeddings towards
lation to combine semantic segmentations with edge maps the cluster center, i.e. the mean embedding of that cluster. The
to extract instances as connected regions. A shortcoming inter-cluster repelling force pushes cluster centers away from each
of this method is that, although they reason globally about other. Both forces are hinged: they are only active up to a cer-
instances, they also cannot handle occlusions. Arnab et tain distance determined by the margins δv and δd , denoted by the
al. [5] combine an object detector with a semantic segmen- dotted circles. This diagram is inspired by a similar one in [39].
tation module using a CRF model. By considering the im-
age holistically it can handle occlusions and produce more
label, and a term to penalize small distances between em-
precise segmentations.
beddings with a different label.
Loss function Our loss function is inspired by earlier In our loss function we keep the first term, but replace
works on distance metric learning, discriminative loss func- the second term with a more tractable one: instead of
tions and siamese networks [8, 10, 16, 39, 21]. Most sim- directly penalizing small distances between every pair of
ilar to our loss function, Weinberger et al. [39] propose to differently-labeled embeddings, we only penalize small dis-
learn a distance metric for large margin nearest neighbor tances between the mean embeddings of different labels. If
classification. Kostinger et al. [21] further explore a simi- the number of different labels is smaller than the number of
lar LDA based objective. More recently Schroff et al. [33], inputs, this is computationally much cheaper than calculat-
building on Sun et al. [38], introduced the triplet loss for ing the distances between every pair of embeddings. This is
face recognition. The triplet loss enforces a margin be- a valid assumption for instance segmentation, where there
tween each pair of faces from one person, to all other faces. are orders of magnitude fewer instances than pixels in an
Xie et al. [41] propose a clustering objective for unsuper- image.
vised learning. Whereas these works employ a discrimina- We now formulate our discriminative loss in terms of
tive loss function to optimize distances between images in push (i.e. repelling) and pull forces between and within
a dataset, our method operates at the pixel level, optimiz- clusters. A cluster is defined as a group of pixel embed-
ing distances between individual pixels in an image. To our dings sharing the same label, e.g. pixels belonging to the
knowledge, we are the first to successfully use a discrimi- same instance. Our loss consists of three terms:
native loss based on distance metric learning principles for
the task of instance segmentation with deep networks. 1. variance term: an intra-cluster pull-force that draws
embeddings towards the mean embedding, i.e. the
3. Method cluster center.
2. distance term: an inter-cluster push-force that pushes
3.1. Discriminative loss function
clusters away from each other, increasing the distance
Consider a differentiable function that maps each pixel between the cluster centers.
in an input image to a point in n-dimensional feature space,
3. regularization term: a small pull-force that draws
referred to as the pixel embedding. The intuition behind our
all clusters towards the origin, to keep the activations
loss function is that embeddings with the same label (same
bounded.
instance) should end up close together, while embeddings
with a different label (different instance) should end up far The variance and distance terms are hinged: their forces
apart. are only active up to a certain distance. Embeddings within
Weinberger et al. [39] propose a loss function with two a distance of δv from their cluster centers are no longer at-
competing terms to achieve this objective: a term to pe- tracted to it, which means that they can exist on a local man-
nalize large distances between embeddings with the same ifold in feature space rather than having to converge to a
single point. Analogously, cluster centers further apart than importantly, the dimensionality of the feature space is in-
2δd are no longer repulsed and can move freely in feature dependent of the number of instances that needs to be seg-
space. Hinging the forces relaxes the constraints on the net- mented. Figure 3 depicts the convergence of our loss func-
work, giving it more representational power to achieve its tion when overfitting on a single image with 15 instances,
goal. The interacting forces in feature space are illustrated in a 2-dimensional feature space.
in figure 2.
The loss function can also be written down exactly. We 3.2. Post-processing
use the following definitions: C is the number of clusters in When the variance and distance terms of the loss are
the ground truth, Nc is the number of elements in cluster c, zero, the following is true:
xi is an embedding, µc is the mean embedding of cluster c
(the cluster center), k·k is the L1 or L2 distance, and [x]+ = • all embeddings are within a distance of δv from their
max(0, x) denotes the hinge. δv and δd are respectively the cluster center
margins for the variance and distance loss. The loss can
then be written as follows: • all cluster centers are at least 2δd apart
Figure 6. Some examples for different semantic classes on the Cityscapes instance segmentation validation set. Note some typical failure
cases in the last row: incorrect merging of true instances and wrong semantic segmentation masks.