0% found this document useful (0 votes)
13 views11 pages

Segmentation-Aware Convolutional Networks Using Local Attention Masks

Uploaded by

agrpa15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views11 pages

Segmentation-Aware Convolutional Networks Using Local Attention Masks

Uploaded by

agrpa15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Segmentation-Aware Convolutional Networks Using Local Attention Masks

Adam W. Harley Konstantinos G. Derpanis Iasonas Kokkinos


Carnegie Mellon University Ryerson University Facebook AI Research
[email protected] [email protected] [email protected]
arXiv:1708.04607v1 [cs.CV] 15 Aug 2017

Abstract

We introduce an approach to integrate segmentation in-


formation within a convolutional neural network (CNN).
This counter-acts the tendency of CNNs to smooth informa-
tion across regions and increases their spatial precision. To Embed Mask Embed Mask
obtain segmentation information, we set up a CNN to pro-
vide an embedding space where region co-membership can
be estimated based on Euclidean distance. We use these em-
beddings to compute a local attention mask relative to every
neuron position. We incorporate such masks in CNNs and
replace the convolution operation with a “segmentation-
aware” variant that allows a neuron to selectively attend Filter Filter
to inputs coming from its own region. We call the result-
ing network a segmentation-aware CNN because it adapts
its filters at each image point according to local segmen-
tation cues. We demonstrate the merit of our method on Normalized filter response ≈ Normalized filter response

two widely different dense prediction tasks, that involve Figure 1: Segmentation-aware convolution filters are invari-
classification (semantic segmentation) and regression (op- ant to backgrounds. We achieve this in three steps: (i) com-
tical flow). Our results show that in semantic segmentation pute segmentation cues for each pixel (i.e., “embeddings”),
we can match the performance of DenseCRFs while being (ii) create a foreground mask for each patch, and (iii) com-
faster and simpler, and in optical flow we obtain clearly bine the masks with convolution, so that the filters only pro-
sharper responses than networks that do not use local at- cess the local foreground in each image patch.
tention masks. In both cases, segmentation-aware con-
volution yields systematic improvements over strong base-
lines. Source code for this work is available online at dictions that are smooth and low-resolution, resulting from
http://cs.cmu.edu/˜aharley/segaware. the repeated pooling and subsampling stages in the net-
work architecture, respectively. These stages play an im-
portant role in the hierarchical consolidation of features,
1. Introduction and widen the higher layer effective receptive fields. The
low-resolution issue has received substantial attention: for
Convolutional neural networks (CNNs) have recently instance methods have been proposed for replacing the
made rapid progress in pixel-wise prediction tasks, includ- subsampling layers with resolution-preserving alternatives
ing depth prediction [15], optical flow estimation [14], and such as atrous convolution [9, 58, 43], or restoring the lost
semantic segmentation [47, 9, 34]. This progress has been resolution via upsampling stages [39, 34]. However, the
built on the remarkable success of CNNs in image classifi- issue of smoothness has remained relatively unexplored.
cation tasks [29, 50] – indeed, most dense prediction mod- Smooth neuron outputs result from the spatial pooling (i.e.,
els are based closely on architectures that were successful abstraction) of information across different regions. This
in object recognition. While this strategy facilitates transfer can be useful in high-level tasks, but can degrade accuracy
learning, it also brings design elements that are incompati- on per-pixel prediction tasks where rapid changes in acti-
ble with dense prediction. vation may be required, e.g., around region boundaries or
By design CNNs typically produce feature maps and pre- motion discontinuities.

1
To address the issue of smoothness, we propose volutional network, for signature verification. Subsequent
segmentation-aware convolutional networks, which operate related work has yielded compelling results for tasks such
as illustrated in Figure 1. These networks adjust their be- as wide-baseline stereo correspondence [20, 59, 60], and
havior on a per-pixel basis according to segmentation cues, face verification [11]. Recently, the topic of metric learning
so that the filters can selectively “attend” to information has been studied extensively in conjunction with image de-
coming from the region containing the neuron, and treat scriptors, such as SIFT and SID [54, 49, 3], improving the
it differently from background signals. To achieve this, applicability of those descriptors to patch-matching prob-
we complement each image patch with a local foreground- lems. Most prior work in metric learning has been con-
background segmentation mask that acts like a gating mech- cerned with the task of finding one-to-one correspondences
anism for the information feeding into the neuron. This between pixels seen from different viewpoints. In contrast,
avoids feature blurring, by reducing the extent to which the focus of our work is (as in our prior work [23]) to bring
foreground and contextual information is mixed, and allows a given point close to all of the other points that lie in the
neuron activation levels to change rapidly, by dynamically same object. This requires a higher degree of invariance
adapting the neuron’s behavior to the image content. This than before – not only to rotation, scale, and partial occlu-
goes beyond sharpening the network outputs post-hoc, as sion, but also to the interior appearance details of objects.
is currently common practice; it fixes the blurring problem Concurrent work has targeted a similar goal, for body joints
“before the damage is done”, since it can be integrated at [38] and instance segmentation [17]. We refer to the fea-
both early and later stages of a CNN. tures that produce these invariances as embeddings, as they
The general idea of combining filtering with segmen- embed pixels into a space where the quality of correspon-
tation to enhance sharpness dates back to nonlinear im- dences can be measured as a distance.
age processing [42, 53] and segmentation-aware feature ex- The embeddings in our work are used to generate local
traction [54, 55]. Apart from showing that this technique attention masks to obtain segmentation-aware feature maps.
successfully carries over to CNNs, another contribution of The resulting features are meant to capture the appearance
our work consists in using the network itself to obtain seg- of the foreground (relative to a given point), while being
mentation information, rather than relying on hand-crafted invariant to changes in the background or occlusions. To
pipelines. In particular, as in an earlier version of this work date, related work has focused on developing handcrafted
[23], we use a constrastive side loss to train the “segmenta- descriptors that have this property. For instance, soft seg-
tion embedding” branch of our network, so that we can then mentation masks [41, 32] and boundary cues [36, 48] have
construct segmentation masks using embedding distances. been used to develop segmentation-aware variants of hand-
There are three steps to creating segmentation-aware crafted features, like SIFT and HOG, effectively suppress-
convolutional nets, described in Sections 3.1-3.4: (i) learn ing contributions from pixels likely to come from the back-
segmentation cues, (ii) use the cues to create local fore- ground [54, 55]. More in line with the current paper are re-
ground masks, and (iii) use the masks together with con- cent works that incorporate segmentation cues into CNNs,
volution, to create foreground-focused convolution. Our by sharpening or masking intermediate feature maps with
approach realizes each of these steps in a unified manner the help of superpixels [12, 19]. This technique adds spa-
that is at once general (i.e., applicable to both discrete and tial structure to multiple stages of the pipeline. In all of
continuous prediction tasks), differentiable (i.e., end-to-end these works, the affinities are defined in a handcrafted man-
trainable as a neural network), and fast (i.e., implemented ner, and are typically pre-computed in a separate process.
as GPU-optimized variants of convolution). In contrast, we learn the cues directly from image data,
Experiments show that minimally modifying existing and compute the affinities densely and “on the fly” within
CNN architectures to use segmentation-aware convolution a CNN. Additionally, we combine the masking filters with
yields substantial gains in two widely different task set- arbitrary convolutional filters, allowing any layer (or even
tings: dense discrete labelling (i.e., semantic segmenta- all layers) to perform segmentation-aware convolution.
tion), and dense regression (i.e., optical flow estimation).
Source code for this work is available online at http: Concurrent work in language modelling [13] and im-
//cs.cmu.edu/˜aharley/segaware. age generation [40] has also emphasized the importance
of locally masked (or “gated”) convolutions. Unlike these
2. Related work works, our approach uniquely makes use of embeddings to
measure context relevance, which lends interpretability to
This work builds on a wide range of research topics. The the masks, and allows for task-agnostic pre-training. Simi-
first is metric learning. The goal of metric learning is to pro- lar attention mechanisms are being used in visual [35] and
duce features from which one can estimate the similarity be- non-visual [52] question answering tasks. These works use
tween pixels or regions in the input [18]. Bromley et al. [5] a question to construct a single or a limited sequence of
influentially proposed learning these descriptors in a con- globally-supported attention signals. Instead, we use con-

2
Patch
volutional embeddings, and efficiently construct local atten-
Input Embedding
tion masks in “batch mode” around the region of any given space
neuron.
Another relevant thread of works relates to efforts on
mitigating the low-resolution and spatially-imprecise pre-
dictions of CNNs. Approaches to counter the spatial im-
precision weakness can be grouped into preventions (i.e.,
methods integrated early in the CNN), and cures (i.e., post-
processes). A popular preventative method is atrous convo- Figure 2: Visualization of the goal for pixel embeddings.
lution (also known as “dilated” convolution) [9, 58], which For any two pixels sampled from the same object, the em-
allows neurons to cover a wider field of view with the same beddings should have a small relative distance. For any
number of parameters. Our approach also adjusts neurons’ two pixels sampled from different objects, the embeddings
field of view, but focuses it toward the local foreground, should have a large distance. The embeddings are illustrated
rather than widening it in general. The “cures” aim to re- in 2D; in principle, they can have any dimensionality.
store resolution or sharpness after it has been lost. For
example, one effective approach is to add trainable up- for each pixel – what other pixels belong to the same object
sampling stages to the network, via “deconvolution” lay- (or scene segment).
ers [39, 34]. A complementary approach is to stack fea- Given an RGB image, I, made up of pixels, p ∈ R3 (i.e.,
tures from multiple resolutions near the end of the net- 3D vectors encoding color), we learn an embedding func-
work, so that the final stages have access to both high- tion that maps (i.e., embeds) the pixels into a feature space
resolution (shallow) features and low-resolution (deep) fea- where semantic similarity between pixels can be measured
tures [22, 37, 14]. Sharpening can be done outside of the as a distance [11]. Choosing the dimensionality of that fea-
CNN, e.g., using edges found in the image [8, 4], or using a ture space to be D = 64, we can write the embedding func-
dense conditional random field (CRF) [28, 9, 58]. Recently, tion as f : R3 7→ RD , or more specifically, f (p) = e, where
the CRF approach has been integrated more closely with e is the embedding for pixel p.
the CNN, by framing the CRF as a recurrent network, and Pixel pairs that lie on the same object should produce
chaining it to the backpropagation of the underlying CNN similar embeddings (i.e., a short distance in feature-space),
[61]. We make connections and extensions to CRFs in Sec- and pairs from different objects should produce dissimilar
tion 3.3 and provide comparisons in Section 5.1. embeddings (i.e., a large distance in feature-space). Fig-
ure 2 illustrates this goal with 2D embeddings. Given se-
3. Technical approach mantic category labels for the pixels as training data, we can
The following subsections describe the main compo- represent the embedding goal as a loss function over pixel
nents of our approach. We begin by learning segmenta- pairs. For any two pixel indices i and j, and corresponding
tion cues (Sec. 3.1). We formulate this as a task of find- embeddings ei , ej and object class labels li , lj , we can op-
ing “segmentation embeddings” for the pixels. This step timize the same-label pairs to have “near” embeddings, and
yields features that allow region similarity to be measured the different-label pairs to have “far” embeddings. Using
as a distance in feature-space. That is, if two pixels have α and β to denote the “near” and “far” thresholds, respec-
nearby embeddings, then they likely come from the same tively, we can define the pairwise loss as
region. We next create soft segmentation masks from the 
max (kei − ej k − α, 0) if li = lj
embeddings (Sec. 3.2). Our approach generalizes the bi- `i,j = , (1)
max (β − kei − ej k, 0) if li 6= lj
lateral filter [31, 2, 51, 53], which is a technique for creat-
ing adaptive smoothing filters that preserve object bound- where k · k denotes a vector norm. We find that embeddings
aries. Noting that CRFs make heavy use of bilateral fil- learned from L1 and L2 norms are similar, but L1 -based
ters to sharpen posterior estimates, we next describe how embeddings are less vulnerable to exploding gradients. For
to simplify and improve CRFs using our segmentation- thresholds, we use α = 0.5, and β = 2. In practice, the
aware masks (Sec. 3.3). Finally, in Sec. 3.4 we intro- specific values of α and β are unimportant, so long as α ≤ β
duce segmentation-aware convolution, where we merge and the remainder of the network can learn to compensate
segmentation-aware masks with intermediate convolution for the scale of the resulting embeddings, e.g., through λ in
operations, giving rise to segmentation-aware networks. upcoming Eq. 3.
To quantify the overall quality of the embedding func-
3.1. Learning segmentation cues
tion, we simply sum the pairwise losses (Eq. 1) across the
The first goal of our work is to obtain segmentation cues. image. Although for an image with N pixels there are N 2
In particular, we desire features that can be used to infer – pairs to evaluate, we find it is effective to simply sample

3
Embed RGB FC8
Input Patch Embed mask mask Input Sharpened FC8

Embed

Figure 4: Segmentation-aware bilateral filtering. Given an


input image (left), a CNN typically produces a smooth pre-
diction map (middle top). Using learned per-pixel embed-
dings (middle bottom), we adaptively smooth the FC8 fea-
Figure 3: Embeddings and local masks are computed ture map with our segmentation-aware bilateral filter (right).
densely for input images. For four locations in the im-
age shown on the left, the figure shows (left-to-right) the where k is a spatial displacement from index i. Equation 4
extracted patch, the embeddings (compressed to three di- has some interesting special cases, which depend on the un-
mensions by PCA for visualization), the embedding-based derlying indexed embeddings ej :
mask, and the mask generated by color distance.
• if ej = 0, the equation yields the average filter;
pairs from a neighborhood around each pixel, as in
• if ej = i, the equation yields Gaussian smoothing;
X X
L= `i,j , (2)
i∈N j∈Ni
• if ej = (i, pi ), where pi denotes the color vector at i,
where j ∈ Ni iterates over the spatial neighbors of index i. the equation yields bilateral filtering [31, 2, 51, 53].
In practice, we use three overlapping 3 × 3 neighborhoods,
with atrous factors [9] of 1, 2, and 5. We train a fully- Since the embeddings are learned in a CNN, Eq. 4 repre-
convolutional CNN to minimize this loss through stochastic sents a generalization of all these cases. For comparison,
gradient descent. The network design is detailed in Sec. 4. Jampani et al. [25] propose to learn the kernel used in the
bilateral filter, but keep the arguments to the similarity mea-
3.2. Segmentation-aware bilateral filtering
sure (i.e., ei ) fixed. In our work, by training the network
The distance between the embedding at one index, ei , to provide convolutional embeddings, we additionally learn
and any other embedding, ej , provides a magnitude indicat- the arguments of the bilateral distance function.
ing whether or not i and j fall on the same object. We can When the embeddings are integrated into a larger net-
convert these magnitudes into (unnormalized) probabilities, work that uses them for filtering, the embedding loss func-
using the exponential distribution: tion (Eq. 2) is no longer necessary. Since all of the terms
mi,j = exp(−λkei − ej k), (3) in the filter function (Eq. 4) are differentiable, the global
objective (e.g., classification accuracy) can be used to tune
where λ is a learnable parameter specifying the hardness not only the input terms, xi , but also the mask terms, mi,j ,
of this decision, and the notation mi,j denotes that i is the and their arguments, ej . Therefore, the embeddings can
reference pixel, and j is the neighbor being considered. In be learned end-to-end in the network when used to create
other words, considering all indices j ∈ Ni , mi represents a masks. In our work, we first train the embeddings with a
foreground-background segmentation mask, where the cen- dedicated loss, then fine-tune them in the larger pipeline in
tral pixel i is defined as the foreground, i.e., mi,i = 1. Fig- which they are used for masks.
ure 3 shows examples of the learned segmentation masks Figure 4 shows an example of how segmentation-aware
(and the intermediate embeddings), and compares them bilateral filtering sharpens FC8 predictions in practice.
with masks computed from color distances. In general,
the learned semantic embeddings successfully generate ac- 3.3. Segmentation-aware CRFs
curate foreground-background masks, whereas the color-
Segmentation-aware bilateral filtering can be used to im-
based embeddings are not as reliable.
prove CRFs. As discussed earlier, dense CRFs [28] are
A first application of these masks is to perform a
effective at sharpening the prediction maps produced by
segmentation-aware smoothing (of pixels, features, or pre-
CNNs [9, 61].
dictions). Given an input feature xi , we can compute a
These models optimize a Gibbs energy given by
segmentation-aware smoothed result, yi , as follows:
P X XX
k xi−k mi,i−k E(x) = ψu (xi ) + ψp (xi , xj ), (5)
yi = P , (4)
k mi,i−k i i j≤i

4
where i ranges over all pixel indices in the image. In seman- case of normalized convolution [27]. The idea of normal-
tic segmentation, the unary term ψu is typically chosen to ized convolution is to “focus” the convolution operator on
be the negative log probability provided by a CNN trained the part of the input that truly describes the input signal,
for per-pixel classification. The pairwise potentials take the avoiding the interpolation of noise or missing information.
form ψp (xi , xj ) = µ(xi , xj )k(fi , fj ), where µ is a label In this case, “noise” corresponds to information coming
compatibility function (e.g., the Potts model), and k(fi , fj ) from regions other than the one to which index i belongs.
is a feature compatibility function. The feature compatibil- Any convolution filter can be made segmentation-aware.
ity is composed of an appearance term (a bilateral filter), The advantage of segmentation awareness depends on the
and a smoothness term (an averaging filter), in the form filter. For instance, a center-surround filter might be ren-
! dered useless by the effect of the mask (since it would block
1 ki − jk2 kpi − pj k2 the input from the “surround”), whereas a filter selective to
k(fi , fj ) = w exp − −
2θα2 2θβ2 a particular shape might benefit from invariance to context.
(6) The basic intuition is that the information masked out needs
ki − jk2
 
2
+ w exp − , to be distracting rather than helping; realizing this in prac-
2θγ2
tice requires learning the masking functions. In our work,
where the wk are weights on the two terms. Combined with we use backpropagation to learn both the arguments and the
the label compatibility function, the appearance term adds softness of each layer’s masking operation, i.e., both ei and
a penalty if a pair of pixels are assigned the same label but λ in Eq. 3. Note that the network can always fall back to a
have dissimilar colors. To be effective, these filtering oper- standard CNN by simply learning a setting of λ = 0.
ations are carried out with extremely wide filters (e.g., the
size of the image), which necessitates using a data structure 4. Implementation details
called a permutohedral lattice [1]. This section first describes how the basic ideas of the
Motivated by our earlier observation that learned embed- technical approach are integrated in a CNN architecture,
dings are a stronger semantic similarity signal than color and then provides details on how the individual components
(see Fig. 3), we replace the color vector pi in Eq. 6 with are implemented efficiently as convolution-like layers.
the learned embedding vector ei . The permutohedral lattice
would be inefficient for such a high-dimensional filter, but 4.1. Network architecture
we find that the signal provided by the embeddings is rich Any convolutional network can be made segmentation-
enough that we can use small filters (e.g., 13 × 13), and aware. In our work, the technique for achieving this mod-
achieve the same (or better) performance. This allows us to ification involves generating embeddings with a dedicated
implement the entire CRF with standard convolution oper- “embedding network”, then using masks computed from
ators, reduce computation time by half, and backpropagate those embeddings to modify the convolutions of a given
through the CRF into the embeddings. task-specific network. This implementation strategy is il-
3.4. Segmentation-aware convolution lustrated in Figure 5.
The embedding network has the following architecture.
The bilateral filter in Eq. 4 is similar in form to convo- The first seven layers share the design of the earliest con-
lution, but with a non-linear sharpening mask instead of a volution layers in VGG-16 [7], and are initialized with that
learned task-specific filter. In this case, we can have the network’s (object recognition-trained) weights. There is a
benefits of both, by inserting the learned convolution filter, subsampling layer after the second convolution layer and
t, into the equation: also after the fourth convolution layer, so the network cap-
P
xi−k mi,i−k tk tures information at three different scales. The final output
yi = kP . (7) from each scale is sent to a pairwise distance computation
k mi,i−k
(detailed in Sec. 4.2) followed by a loss (as in Eq. 1), so that
This is a non-linear convolution: the input signal is multi- each scale develops embedding-like representations. The
plied pointwise by the normalized local mask before form- outputs from the intermediate embedding layers are then
ing the inner product with the learned filter. If the learned upsampled to a common resolution, concatenated, and sent
filter ti is all ones, we have the same bilateral filter as in to a convolution layer with 1 × 1 filters. This layer learns a
Eq. 4; if the embedding-based segmentation mask mi is all weighted average of the intermediate embeddings, and cre-
ones, we have standard convolution. Since the masks in ates the final embedding for each pixel.
this context encode segmentation cues, we refer to Eq. 7 as The idea of using a loss at intermediate layers is inspired
segmentation-aware convolution. by Xie and Tu [57], who used this strategy to learn boundary
The mask acts as an applicability function for the fil- cues in a CNN. The motivation behind this strategy is to
ter, which makes segmentation-aware convolution a special provide early layers a stronger signal of the network’s end

5
loss loss

dist dist
loss
loss
dist
dist
objective

Input Embedding network Task-specific network

Figure 5: General schematic for our segmentation-aware CNN. The first part is an embedding network, which is guided
to compute embedding-like representations at multiple scales, and constructs a final embedding as a weighted sum of the
intermediate embeddings. The loss on these layers operates on pairwise distances computed from the embeddings. These
same distances are then used to construct local attention masks, that intercept the convolutions in a task-specific network.
The final objective backpropagates through both networks, fine-tuning the embeddings for the task.

goal, reducing the burden on backpropagation to carry the is passed through an exponential function with a specified
signal through multiple layers [30]. hardness, λ. This operation realizes the mask term (Eq. 3).
The final embeddings are used to create masks in the In our work, the hardness of the exponential is learned as a
task-specific network. The lightest usage of these masks in- parameter of the CNN.
volves performing segmentation-aware bilateral filtering on To perform the actual masking, the input to be masked
the network’s final layer outputs; this achieves the sharp- is simply processed by an image-to-column transformation
ening effect illustrated in Figure 4. The most intrusive us- (producing another H · W × K matrix), then multiplied
age of the masks involves converting all convolutions into pointwise with the normalized mask matrix. From that
segmentation-aware convolutions. Even in this case, how- product, segmentation-aware bilateral filtering is merely a
ever, the masks can be inserted with no detrimental effect matter of summing across the K dimension, producing an
(i.e., by initializing with λ = 0 in Eq. 3), allowing the net- H · W × 1 matrix that can be reshaped into dimensions
work to learn whether or not (and at what layer) to acti- H × W . Segmentation-aware convolution (Eq. 7) simply
vate the masks. Additionally, if the target task has discrete requires multiplying the H · W × K masked values with a
output labels, as in the case of semantic segmentation, a K×F matrix of weights, where F is the number of convolu-
segmentation-aware CRF can be attached to the end of the tion filters. The result of this multiplication can be reshaped
network to sharpen the final output predictions. into F different H × W feature maps.
4.2. Efficient convolutional implementation details
5. Evaluation
We reduce all steps of the pipeline to matrix multipli-
cations, making the approach very efficient on GPUs. We We evaluate on two different dense prediction tasks: se-
achieve this by casting the mask creation (i.e., pairwise em- mantic segmentation, and optical flow estimation. The goal
bedding distance computation) as a convolution-like opera- of the experiments is to minimally modify strong baseline
tion, and implementing it in exactly the way Caffe [26] re- networks, and examine the effects of instilling various lev-
alizes convolution: via an image-to-column transformation, els of “segmentation awareness”.
followed by matrix multiplication.
5.1. Semantic segmentation
More precisely, the distance computation works as fol-
lows. For every position i in the feature-map provided by Semantic segmentation is evaluated on the PASCAL
the layer below, a patch of features is extracted from the VOC 2012 challenge [16], augmented with additional im-
neighborhood j ∈ Ni , and distances are computed between ages from Hariharan et al. [21]. Experiments are carried
the central feature and its neighbors. These distances are ar- out with two different baseline networks, “DeepLab” [9]
ranged into a row vector of length K, where K is the spatial and “DeepLabV2” [10]. DeepLab is a fully-convolutional
dimensionality of the patch. This process turns an H × W version of VGG-16 [7], using atrous convolution in some
feature-map into an H · W × K matrix, where each element layers to reduce downsampling. DeepLabV2 is a fully-
in the K dimension holds a distance relating that pixel to convolutional version of a 101-layer residual network
the central pixel at that spatial index. (ResNet) [24], modified with atrous spatial pyramid pooling
To convert the distances into masks, the H·W ×K matrix and multi-scale input processing. Both networks are initial-

6
Table 1: PASCAL VOC 2012 validation results for the var- Input Labels Baseline Proposed
ious considered approaches, compared against the baseline.
All methods use DeepLab as the base network; “BF” means
bilateral filter; “SegAware” means segmentation-aware.

Method IOU (%)


DeepLab 66.33
. . . + CRF 67.60
. . . + 9 × 9 SegAware BF 66.98
. . . + 9 × 9 SegAware BF ×2 67.36
. . . + 9 × 9 SegAware BF ×4 67.68
. . . with FC6 SegAware 67.40
. . . with all layers SegAware 67.94
. . . with all layers SegAware + 9 × 9 BF 68.00 Figure 6: Visualizations of semantic segmentations pro-
. . . with all layers SegAware + 7 × 7 BF ×2 68.57 duced by DeepLab and its segmentation-aware variant on
. . . with all layers SegAware + 5 × 5 BF ×4 68.52 the PASCAL VOC 2012 validation set.
. . . with all layers and CRF SegAware 69.01
70

65
ized with weights learned on ImageNet [46], then trained on

mean IOU (%)


60
the Microsoft COCO training and validation sets [33], and
finally fine-tuned on the PASCAL images [16, 21]. 55
+ seg.−aware + bilateral
To replace the densely connected CRF used in the orig- + seg.−aware
50 Baseline
inal works [9, 10], we attach a very sparse segmentation-
aware CRF. We select the hyperparameters of the 45
0 10 20 30 40
segmentation-aware CRF via cross validation on a small Trimap half−width (pixels)
subset of the validation set, arriving at a 13 × 13 bilateral
filter with an atrous factor of 9, a 5 × 5 spatial filter, and 2 Figure 7: Performance near object boundaries (“trimaps”).
meanfield iterations for both training and testing. Example trimaps are visualized (in white) for the image in
the top left; the trimap of half-width three is shown in the
We carry out the main set of experiments with DeepLab
middle left, and the trimap of half-width ten is shown on the
on the VOC validation set, investigating the piecewise ad-
bottom left. Mean IOU performance of the baseline and two
dition of various segmentation-aware components. A sum-
segmentation-aware variants are plotted (right) for trimap
mary of the results is presented in Table 1. The first re-
half-widths 1 to 40.
sult is that using learned embeddings to mask the output
of DeepLab approximately provides a 0.6% improvement
in mean intersection-over-union (IOU) accuracy. This is while making all layers segmentation-aware improves accu-
achieved with a single application of a 9 × 9 bilateral-like racy by 1.6%, at a cost of just +200 ms.
filter on the FC8 outputs produced by DeepLab. To examine where the gains are taking place, we com-
Once the embeddings and masks are computed, it is pute each method’s accuracy within “trimaps” that extend
straightforward to run the masking process repeatedly. Ap- from the objects’ boundaries. A trimap is a narrow band (of
plying the process multiple times improves performance by a specified half-width) that surrounds a boundary on either
strengthening the contribution from similar neighbors in the side; measuring accuracy exclusively within this band can
radius, and also by allowing information from a wider ra- help separate within-object accuracy from on-boundary ac-
dius to contribute to each prediction. Applying the bilat- curacy [9]. Figure 7 (left) shows examples of trimaps, and
eral filter four times increases the gain in IOU accuracy to (right) plots accuracies as a function of trimap width. The
1.3%. This is at the cost of approximately 500 ms of addi- results show that segmentation-aware convolution offers its
tional computation time. A dense CRF yields slightly worse main improvement slightly away from the boundaries (i.e.,
performance, at approximately half the speed (1 second). beyond 10 pixels), while bilateral filtering offers its largest
Segmentation-aware convolution provides similar im- improvement very near the boundary (i.e., within 5 pixels).
provements, at less computational cost. Simply making the Combining segmentation-aware convolution with bilat-
FC6 layer segmentation-aware produces an improvement of eral filtering pushes the gains to 2.2%. Finally, adding a
approximately 1% to IOU accuracy, at a cost of +100 ms, segmentation-aware CRF to the pipeline increases IOU ac-

7
Table 2: PASCAL VOC 2012 test results. Input Labels Baseline Proposed

Method IOU (%)


DeepLab 67.0
DeepLab+CRF 68.2
SegAware DeepLab 69.0
DeepLabV2 79.0
DeepLabV2+CRF 79.7
SegAware DeepLabV2 79.8

curacy by an additional 0.5%, bringing the overall gain to


approximately 2.7% over the DeepLab baseline.
We evaluate the “all components” approach on the VOC
test server, with both DeepLab and DeepLabV2. Results are
summarized in Table 2. The improvement over DeepLab is
2%, which is noticeable in visualizations of the results, as Figure 8: Visualizations of optical flow produced by
shown in Figure 6. DeepLabV2 performs approximately 10 FlowNet and its segmentation-aware variant on the Fly-
points higher than DeepLab; we exceed this improvement ingChairs test set: segmentation-awareness yields much
by approximately 0.8%. The segmentation-aware modifi- sharper results than the baseline.
cations perform equally well (0.1% superior) to dense CRF
post-processing, despite being simpler (using only a sparse Table 3: FlyingChairs test results.
CRF, and replacing the permutohedral lattice with basic
convolution), and twice as fast (0.5s rather than 1s). Method aEPE aAE
SPyNet [44] 2.63 -
5.2. Optical flow EpicFlow [45] 2.94 -
We evaluate optical flow on the recently introduced Fly- DeepFlow [56] 3.53 -
ingChairs [14] dataset. The baseline network for this ex- LDOF [6] 3.47 -
periment is the “FlowNetSimple” model from Dosovitskiy FlowNetSimple [14] 2.78 15.58
et al. [14]. This is a fully-convolutional network, with a FlowNetSimple + variational [14] 2.86 -
contractive part that reduces the resolution of the input by FlowNetCorr [14] 2.19 -
a factor of 64, and an expansionary part (with skip connec- FlowNetCorr + variational [14] 2.61 -
tions) that restores the resolution to quarter-size. SegAware FlowNetSimple 2.36 9.54
In this context, we find that relatively minor
segmentation-aware modifications yield substantial gains in
accuracy. Using embeddings pre-trained on PASCAL VOC, refinement [14], likely because this step was not integrated
we make the final prediction layer segmentation-aware, in the training process. The filtering methods of this work,
and add 9 × 9 bilateral filtering to the end of the network. however, are easily integrated into backpropagation.
This reduces the average end-point error (aEPE) from 2.78
to 2.26 (an 18% reduction in error), and reduces average 6. Conclusion
angular error by approximately 6 degrees, from 15.58
to 9.54. We achieve these gains without the aggressive This work introduces Segmentation-Aware Convolu-
data augmentation techniques pursued by Dosovitskiy et tional Networks, a direct generalization of standard CNNs
al. [14]. Table 3 lists these results in the context of some that allows us to seamlessly accommodate segmentation in-
related work in this domain, demonstrating that the gain is formation throughout a deep architecture. Our approach
fairly substantial. FlowNetCorr [14] achieves a better error, avoids feature blurring before it happens, rather than fix-
but it effectively doubles the network size and runtime, ing it post-hoc. The full architecture can be trained end-to-
whereas our method only adds a shallow set of embedding end. We have shown that this allows us to directly com-
layers. As shown in Figure 8, a qualitative improvement to pete with segmentation-specific structured prediction al-
the flow fields is easily discernable, especially near object gorithms, while easily extending to continuous prediction
boundaries. Note that the performance of prior FlowNet tasks, such as optical flow estimation, that currently have
architectures diminishes with the application of variational no remedy for blurred responses.

8
References [18] A. Frome, Y. Singer, F. Sha, and J. Malik. Learning globally-
consistent local distance functions for shape-based image re-
[1] A. Adams, J. Baek, and M. A. Davis. Fast high-dimensional trieval and classification. In ICCV, 2007. 2
filtering using the permutohedral lattice. Computer Graphics
[19] R. Gadde, V. Jampani, M. Kiefel, and P. V. Gehler. Super-
Forum, 29(2):753–762, 2010. 5
pixel convolutional networks using bilateral inceptions. In
[2] V. Aurich and J. Weule. Non-linear gaussian filters perform-
ECCV, 2016. 2
ing edge preserving diffusion. In Proceedings of the DAGM
[20] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. Berg. Match-
Symposium, pages 538–545, 1995. 3, 4
Net: Unifying feature and metric learning for patch-based
[3] V. Balntas, E. Johns, L. Tang, and K. Mikolajczyk. PN-Net:
matching. In CVPR, 2015. 2
Conjoined triple deep network for learning local image de-
scriptors. arXiv:1601.05030, 2016. 2 [21] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik.
[4] G. Bertasius, J. Shi, and L. Torresani. Semantic segmentation Semantic contours from inverse detectors. In ICCV, 2011. 6,
with boundary neural fields. In CVPR, 2016. 3 7
[5] J. Bromley, I. Guyon, Y. Lecun, E. Sackinger, and R. Shah. [22] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hyper-
Signature verification using a “siamese” time delay neural columns for object segmentation and fine-grained localiza-
network. In NIPS, 1994. 2 tion. In CVPR, 2015. 3
[6] T. Brox and J. Malik. Large displacement optical flow: De- [23] A. W. Harley, K. G. Derpanis, and I. Kokkinos. Learning
scriptor matching in variational motion estimation. PAMI, dense convolutional embeddings for semantic segmentation.
33(3):500–513, 2011. 8 In ICLR, 2016. 2
[7] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. [24] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
Return of the devil in the details: Delving deep into convo- for image recognition. In CVPR, 2016. 6
lutional nets. In BMVC, 2014. 5, 6 [25] V. Jampani, M. Kiefel, and P. V. Gehler. Learning sparse high
[8] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and dimensional filters: Image filtering, dense CRFs and bilateral
A. L. Yuille. Semantic image segmentation with task-specific neural networks. In CVPR, 2016. 4
edge detection using CNNs and a discriminatively trained [26] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
domain transform. In CVPR, 2016. 3 shick, S. Guadarrama, and T. Darrell. Caffe: Convolutional
[9] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and architecture for fast feature embedding. In ACM-MM, pages
A. L. Yuille. Semantic image segmentation with deep con- 675–678, 2014. 6
volutional nets and fully connected CRFs. In ICLR, 2015. 1, [27] H. Knutsson and C.-F. Westin. Normalized and differential
3, 4, 6, 7 convolution. In CVPR, 1993. 5
[10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and [28] P. Krähenbühl and V. Koltun. Efficient inference in fully
A. L. Yuille. DeepLab: Semantic image segmentation with connected CRFs with Gaussian edge potentials. In NIPS,
deep convolutional nets, atrous convolution, and fully con- 2011. 3, 4
nected CRFs. PAMI, 2016. 6, 7 [29] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
[11] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity classification with deep convolutional neural networks. In
metric discriminatively, with application to face verification. NIPS, 2012. 1
In CVPR, 2005. 2, 3 [30] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-
[12] J. Dai, K. He, and J. Sun. Convolutional feature masking for supervised nets. AISTATS, 2(3):6, 2015. 6
joint object and stuff segmentation. In CVPR, 2015. 2
[31] J.-S. Lee. Digital image smoothing and the sigma filter.
[13] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier.
CVGIP, 24(2):255–269, 1983. 3, 4
Language modeling with gated convolutional networks.
[32] M. Leordeanu, R. Sukthankar, and C. Sminchisescu. Ef-
arXiv:1612.08083, 2016. 2
ficient closed-form solution to generalized boundary detec-
[14] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş,
tion. In ECCV, pages 516–529, 2012. 2
V. Golkov, P. van der Smagt, D. Cremers, and T. Brox.
FlowNet: Learning optical flow with convolutional net- [33] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
works. In ICCV, 2015. 1, 3, 8 manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
mon objects in context. In ECCV, pages 740–755, 2014. 7
[15] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction
from a single image using a multi-scale deep network. In [34] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
NIPS, 2014. 1 networks for semantic segmentation. In CVPR, 2014. 1, 3
[16] M. Everingham, L. Van-Gool, C. K. I. Williams, J. Winn, [35] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical
and A. Zisserman. The PASCAL Visual Object Classes question-image co-attention for visual question answering.
Challenge 2012 (VOC2012) Results. http://www.pascal- In NIPS, pages 289–297, 2016. 2
network.org/challenges/VOC/voc2012/workshop/index.html, [36] M. Maire, P. Arbeláez, C. Fowlkes, and J. Malik. Using con-
2012. 6, 7 tours to detect and localize junctions in natural images. In
[17] A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, CVPR, 2008. 2
S. Guadarrama, and K. P. Murphy. Semantic instance seg- [37] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feed-
mentation via deep metric learning. arXiv:1703.10277, forward semantic segmentation with zoom-out features. In
2017. 2 CVPR, 2015. 3

9
[38] A. Newell and J. Deng. Associative embedding: End-to-end [60] J. Žbontar and Y. LeCun. Computing the stereo matching
learning for joint detection and grouping. arXiv:1611.05424, cost with a convolutional neural network. In CVPR, 2014. 2
2016. 2 [61] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
[39] H. Noh, S. Hong, and B. Han. Learning deconvolution net- Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional ran-
work for semantic segmentation. In ICCV, 2015. 1, 3 dom fields as recurrent neural networks. In ICCV, 2015. 3,
[40] A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, 4
A. Graves, and K. Kavukcuoglu. Conditional image gen-
eration with PixelCNN decoders. arXiv:1606.05328, 2016.
2
[41] P. Ott and M. Everingham. Implicit color segmentation fea-
tures for pedestrian and object detection. In ICCV, 2009. 2
[42] P. Perona and J. Malik. Scale-space and edge detection using
anisotropic diffusion. PAMI, 1990. 2
[43] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Full-
resolution residual networks for semantic segmentation in
street scenes. In CVPR, 2014. 1
[44] A. Ranjan and M. J. Black. Optical flow estimation using a
spatial pyramid network. CVPR, 2017. 8
[45] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid.
EpicFlow: Edge-preserving interpolation of correspon-
dences for optical flow. In CVPR, 2015. 8
[46] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
Recognition Challenge. IJCV, 115(3):211–252, 2015. 7
[47] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and
Y. LeCun. OverFeat: Integrated recognition, localization and
detection using convolutional networks. In ICLR, 2014. 1
[48] J. Shi and J. Malik. Normalized cuts and image segmenta-
tion. PAMI, 22(8):888–905, 2000. 2
[49] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and
F. Moreno-Noguer. Discriminative learning of deep convo-
lutional feature point descriptors. In ICCV, 2015. 2
[50] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. In ICLR, 2015.
1
[51] S. M. Smith and J. M. Brady. SUSAN – A new approach to
low level image processing. IJCV, 23(1):45–78, 1997. 3, 4
[52] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end mem-
ory networks. In NIPS, pages 2440–2448, 2015. 2
[53] C. Tomasi and R. Manduchi. Bilateral filtering for gray and
color images. In ICCV, 1998. 2, 3, 4
[54] E. Trulls, I. Kokkinos, A. Sanfeliu, and F. Moreno-Noguer.
Dense segmentation-aware descriptors. In CVPR, 2013. 2
[55] E. Trulls, S. Tsogkas, I. Kokkinos, A. Sanfeliu, and
F. Moreno-Noguer. Segmentation-aware deformable part
models. In CVPR, 2014. 2
[56] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid.
DeepFlow: Large displacement optical flow with deep
matching. In ICCV, pages 1385–1392, 2013. 8
[57] S. Xie and Z. Tu. Holistically-nested edge detection. In
CVPR, 2015. 5
[58] F. Yu and V. Koltun. Multi-scale context aggregation by di-
lated convolutions. In ICLR, 2016. 1, 3
[59] S. Zagoruyko and N. Komodakis. Learning to compare im-
age patches via convolutional neural networks. In CVPR,
2015. 2

10
Segmentation-Aware Convolutional Networks Using Local Attention Masks
Supplementary Material

Adam W. Harley Konstantinos G. Derpanis Iasonas Kokkinos


Carnegie Mellon University Ryerson University Facebook AI Research
[email protected] [email protected] [email protected]

Implementation of convolution Weights Matrix


Unrolled input output
Multi-dim. Multi-dim.
image input output

im2col reshape
× =
W W
H H
E F
H*W H*W
K*E K*E F
F

Implementation of segmentation-aware convolution

Multi-dim. Unrolled input


image input

im2col Weights Matrix


Masked, unrolled input output
Multi-dim.
output

reshape
× =
Multi-dim.
embeddings

im2dist

Unrolled masks

Figure 1: Implementation of convolution in Caffe, compared with the implementation of segmentation-aware convo-
lution. Convolution involves re-organizing the elements of each (potentially overlapping) patch into a column (i.e.,
im2col), followed by a matrix multiplication with weights. Segmentation-aware convolution works similarly, with an
image-to-column transformation on the input, an image-to-distance transformation on the embeddings (i.e., im2dist), a
pointwise multiplication of those two matrices, and then a matrix multiplication with weights. The variables H, W denote
the height and width of the input, respectively; E denotes the number of channels in the input; K denotes the dimensionality
of a patch (e.g., K = 9 in convolution with a 3 × 3 filter); F denotes the number of filters (and the dimensionality of the
output). In both cases, an H × W × E input is transformed into an H × W × F output.

You might also like