Segmentation-Aware Convolutional Networks Using Local Attention Masks
Segmentation-Aware Convolutional Networks Using Local Attention Masks
Abstract
two widely different dense prediction tasks, that involve Figure 1: Segmentation-aware convolution filters are invari-
classification (semantic segmentation) and regression (op- ant to backgrounds. We achieve this in three steps: (i) com-
tical flow). Our results show that in semantic segmentation pute segmentation cues for each pixel (i.e., “embeddings”),
we can match the performance of DenseCRFs while being (ii) create a foreground mask for each patch, and (iii) com-
faster and simpler, and in optical flow we obtain clearly bine the masks with convolution, so that the filters only pro-
sharper responses than networks that do not use local at- cess the local foreground in each image patch.
tention masks. In both cases, segmentation-aware con-
volution yields systematic improvements over strong base-
lines. Source code for this work is available online at dictions that are smooth and low-resolution, resulting from
http://cs.cmu.edu/˜aharley/segaware. the repeated pooling and subsampling stages in the net-
work architecture, respectively. These stages play an im-
portant role in the hierarchical consolidation of features,
1. Introduction and widen the higher layer effective receptive fields. The
low-resolution issue has received substantial attention: for
Convolutional neural networks (CNNs) have recently instance methods have been proposed for replacing the
made rapid progress in pixel-wise prediction tasks, includ- subsampling layers with resolution-preserving alternatives
ing depth prediction [15], optical flow estimation [14], and such as atrous convolution [9, 58, 43], or restoring the lost
semantic segmentation [47, 9, 34]. This progress has been resolution via upsampling stages [39, 34]. However, the
built on the remarkable success of CNNs in image classifi- issue of smoothness has remained relatively unexplored.
cation tasks [29, 50] – indeed, most dense prediction mod- Smooth neuron outputs result from the spatial pooling (i.e.,
els are based closely on architectures that were successful abstraction) of information across different regions. This
in object recognition. While this strategy facilitates transfer can be useful in high-level tasks, but can degrade accuracy
learning, it also brings design elements that are incompati- on per-pixel prediction tasks where rapid changes in acti-
ble with dense prediction. vation may be required, e.g., around region boundaries or
By design CNNs typically produce feature maps and pre- motion discontinuities.
1
To address the issue of smoothness, we propose volutional network, for signature verification. Subsequent
segmentation-aware convolutional networks, which operate related work has yielded compelling results for tasks such
as illustrated in Figure 1. These networks adjust their be- as wide-baseline stereo correspondence [20, 59, 60], and
havior on a per-pixel basis according to segmentation cues, face verification [11]. Recently, the topic of metric learning
so that the filters can selectively “attend” to information has been studied extensively in conjunction with image de-
coming from the region containing the neuron, and treat scriptors, such as SIFT and SID [54, 49, 3], improving the
it differently from background signals. To achieve this, applicability of those descriptors to patch-matching prob-
we complement each image patch with a local foreground- lems. Most prior work in metric learning has been con-
background segmentation mask that acts like a gating mech- cerned with the task of finding one-to-one correspondences
anism for the information feeding into the neuron. This between pixels seen from different viewpoints. In contrast,
avoids feature blurring, by reducing the extent to which the focus of our work is (as in our prior work [23]) to bring
foreground and contextual information is mixed, and allows a given point close to all of the other points that lie in the
neuron activation levels to change rapidly, by dynamically same object. This requires a higher degree of invariance
adapting the neuron’s behavior to the image content. This than before – not only to rotation, scale, and partial occlu-
goes beyond sharpening the network outputs post-hoc, as sion, but also to the interior appearance details of objects.
is currently common practice; it fixes the blurring problem Concurrent work has targeted a similar goal, for body joints
“before the damage is done”, since it can be integrated at [38] and instance segmentation [17]. We refer to the fea-
both early and later stages of a CNN. tures that produce these invariances as embeddings, as they
The general idea of combining filtering with segmen- embed pixels into a space where the quality of correspon-
tation to enhance sharpness dates back to nonlinear im- dences can be measured as a distance.
age processing [42, 53] and segmentation-aware feature ex- The embeddings in our work are used to generate local
traction [54, 55]. Apart from showing that this technique attention masks to obtain segmentation-aware feature maps.
successfully carries over to CNNs, another contribution of The resulting features are meant to capture the appearance
our work consists in using the network itself to obtain seg- of the foreground (relative to a given point), while being
mentation information, rather than relying on hand-crafted invariant to changes in the background or occlusions. To
pipelines. In particular, as in an earlier version of this work date, related work has focused on developing handcrafted
[23], we use a constrastive side loss to train the “segmenta- descriptors that have this property. For instance, soft seg-
tion embedding” branch of our network, so that we can then mentation masks [41, 32] and boundary cues [36, 48] have
construct segmentation masks using embedding distances. been used to develop segmentation-aware variants of hand-
There are three steps to creating segmentation-aware crafted features, like SIFT and HOG, effectively suppress-
convolutional nets, described in Sections 3.1-3.4: (i) learn ing contributions from pixels likely to come from the back-
segmentation cues, (ii) use the cues to create local fore- ground [54, 55]. More in line with the current paper are re-
ground masks, and (iii) use the masks together with con- cent works that incorporate segmentation cues into CNNs,
volution, to create foreground-focused convolution. Our by sharpening or masking intermediate feature maps with
approach realizes each of these steps in a unified manner the help of superpixels [12, 19]. This technique adds spa-
that is at once general (i.e., applicable to both discrete and tial structure to multiple stages of the pipeline. In all of
continuous prediction tasks), differentiable (i.e., end-to-end these works, the affinities are defined in a handcrafted man-
trainable as a neural network), and fast (i.e., implemented ner, and are typically pre-computed in a separate process.
as GPU-optimized variants of convolution). In contrast, we learn the cues directly from image data,
Experiments show that minimally modifying existing and compute the affinities densely and “on the fly” within
CNN architectures to use segmentation-aware convolution a CNN. Additionally, we combine the masking filters with
yields substantial gains in two widely different task set- arbitrary convolutional filters, allowing any layer (or even
tings: dense discrete labelling (i.e., semantic segmenta- all layers) to perform segmentation-aware convolution.
tion), and dense regression (i.e., optical flow estimation).
Source code for this work is available online at http: Concurrent work in language modelling [13] and im-
//cs.cmu.edu/˜aharley/segaware. age generation [40] has also emphasized the importance
of locally masked (or “gated”) convolutions. Unlike these
2. Related work works, our approach uniquely makes use of embeddings to
measure context relevance, which lends interpretability to
This work builds on a wide range of research topics. The the masks, and allows for task-agnostic pre-training. Simi-
first is metric learning. The goal of metric learning is to pro- lar attention mechanisms are being used in visual [35] and
duce features from which one can estimate the similarity be- non-visual [52] question answering tasks. These works use
tween pixels or regions in the input [18]. Bromley et al. [5] a question to construct a single or a limited sequence of
influentially proposed learning these descriptors in a con- globally-supported attention signals. Instead, we use con-
2
Patch
volutional embeddings, and efficiently construct local atten-
Input Embedding
tion masks in “batch mode” around the region of any given space
neuron.
Another relevant thread of works relates to efforts on
mitigating the low-resolution and spatially-imprecise pre-
dictions of CNNs. Approaches to counter the spatial im-
precision weakness can be grouped into preventions (i.e.,
methods integrated early in the CNN), and cures (i.e., post-
processes). A popular preventative method is atrous convo- Figure 2: Visualization of the goal for pixel embeddings.
lution (also known as “dilated” convolution) [9, 58], which For any two pixels sampled from the same object, the em-
allows neurons to cover a wider field of view with the same beddings should have a small relative distance. For any
number of parameters. Our approach also adjusts neurons’ two pixels sampled from different objects, the embeddings
field of view, but focuses it toward the local foreground, should have a large distance. The embeddings are illustrated
rather than widening it in general. The “cures” aim to re- in 2D; in principle, they can have any dimensionality.
store resolution or sharpness after it has been lost. For
example, one effective approach is to add trainable up- for each pixel – what other pixels belong to the same object
sampling stages to the network, via “deconvolution” lay- (or scene segment).
ers [39, 34]. A complementary approach is to stack fea- Given an RGB image, I, made up of pixels, p ∈ R3 (i.e.,
tures from multiple resolutions near the end of the net- 3D vectors encoding color), we learn an embedding func-
work, so that the final stages have access to both high- tion that maps (i.e., embeds) the pixels into a feature space
resolution (shallow) features and low-resolution (deep) fea- where semantic similarity between pixels can be measured
tures [22, 37, 14]. Sharpening can be done outside of the as a distance [11]. Choosing the dimensionality of that fea-
CNN, e.g., using edges found in the image [8, 4], or using a ture space to be D = 64, we can write the embedding func-
dense conditional random field (CRF) [28, 9, 58]. Recently, tion as f : R3 7→ RD , or more specifically, f (p) = e, where
the CRF approach has been integrated more closely with e is the embedding for pixel p.
the CNN, by framing the CRF as a recurrent network, and Pixel pairs that lie on the same object should produce
chaining it to the backpropagation of the underlying CNN similar embeddings (i.e., a short distance in feature-space),
[61]. We make connections and extensions to CRFs in Sec- and pairs from different objects should produce dissimilar
tion 3.3 and provide comparisons in Section 5.1. embeddings (i.e., a large distance in feature-space). Fig-
ure 2 illustrates this goal with 2D embeddings. Given se-
3. Technical approach mantic category labels for the pixels as training data, we can
The following subsections describe the main compo- represent the embedding goal as a loss function over pixel
nents of our approach. We begin by learning segmenta- pairs. For any two pixel indices i and j, and corresponding
tion cues (Sec. 3.1). We formulate this as a task of find- embeddings ei , ej and object class labels li , lj , we can op-
ing “segmentation embeddings” for the pixels. This step timize the same-label pairs to have “near” embeddings, and
yields features that allow region similarity to be measured the different-label pairs to have “far” embeddings. Using
as a distance in feature-space. That is, if two pixels have α and β to denote the “near” and “far” thresholds, respec-
nearby embeddings, then they likely come from the same tively, we can define the pairwise loss as
region. We next create soft segmentation masks from the
max (kei − ej k − α, 0) if li = lj
embeddings (Sec. 3.2). Our approach generalizes the bi- `i,j = , (1)
max (β − kei − ej k, 0) if li 6= lj
lateral filter [31, 2, 51, 53], which is a technique for creat-
ing adaptive smoothing filters that preserve object bound- where k · k denotes a vector norm. We find that embeddings
aries. Noting that CRFs make heavy use of bilateral fil- learned from L1 and L2 norms are similar, but L1 -based
ters to sharpen posterior estimates, we next describe how embeddings are less vulnerable to exploding gradients. For
to simplify and improve CRFs using our segmentation- thresholds, we use α = 0.5, and β = 2. In practice, the
aware masks (Sec. 3.3). Finally, in Sec. 3.4 we intro- specific values of α and β are unimportant, so long as α ≤ β
duce segmentation-aware convolution, where we merge and the remainder of the network can learn to compensate
segmentation-aware masks with intermediate convolution for the scale of the resulting embeddings, e.g., through λ in
operations, giving rise to segmentation-aware networks. upcoming Eq. 3.
To quantify the overall quality of the embedding func-
3.1. Learning segmentation cues
tion, we simply sum the pairwise losses (Eq. 1) across the
The first goal of our work is to obtain segmentation cues. image. Although for an image with N pixels there are N 2
In particular, we desire features that can be used to infer – pairs to evaluate, we find it is effective to simply sample
3
Embed RGB FC8
Input Patch Embed mask mask Input Sharpened FC8
Embed
4
where i ranges over all pixel indices in the image. In seman- case of normalized convolution [27]. The idea of normal-
tic segmentation, the unary term ψu is typically chosen to ized convolution is to “focus” the convolution operator on
be the negative log probability provided by a CNN trained the part of the input that truly describes the input signal,
for per-pixel classification. The pairwise potentials take the avoiding the interpolation of noise or missing information.
form ψp (xi , xj ) = µ(xi , xj )k(fi , fj ), where µ is a label In this case, “noise” corresponds to information coming
compatibility function (e.g., the Potts model), and k(fi , fj ) from regions other than the one to which index i belongs.
is a feature compatibility function. The feature compatibil- Any convolution filter can be made segmentation-aware.
ity is composed of an appearance term (a bilateral filter), The advantage of segmentation awareness depends on the
and a smoothness term (an averaging filter), in the form filter. For instance, a center-surround filter might be ren-
! dered useless by the effect of the mask (since it would block
1 ki − jk2 kpi − pj k2 the input from the “surround”), whereas a filter selective to
k(fi , fj ) = w exp − −
2θα2 2θβ2 a particular shape might benefit from invariance to context.
(6) The basic intuition is that the information masked out needs
ki − jk2
2
+ w exp − , to be distracting rather than helping; realizing this in prac-
2θγ2
tice requires learning the masking functions. In our work,
where the wk are weights on the two terms. Combined with we use backpropagation to learn both the arguments and the
the label compatibility function, the appearance term adds softness of each layer’s masking operation, i.e., both ei and
a penalty if a pair of pixels are assigned the same label but λ in Eq. 3. Note that the network can always fall back to a
have dissimilar colors. To be effective, these filtering oper- standard CNN by simply learning a setting of λ = 0.
ations are carried out with extremely wide filters (e.g., the
size of the image), which necessitates using a data structure 4. Implementation details
called a permutohedral lattice [1]. This section first describes how the basic ideas of the
Motivated by our earlier observation that learned embed- technical approach are integrated in a CNN architecture,
dings are a stronger semantic similarity signal than color and then provides details on how the individual components
(see Fig. 3), we replace the color vector pi in Eq. 6 with are implemented efficiently as convolution-like layers.
the learned embedding vector ei . The permutohedral lattice
would be inefficient for such a high-dimensional filter, but 4.1. Network architecture
we find that the signal provided by the embeddings is rich Any convolutional network can be made segmentation-
enough that we can use small filters (e.g., 13 × 13), and aware. In our work, the technique for achieving this mod-
achieve the same (or better) performance. This allows us to ification involves generating embeddings with a dedicated
implement the entire CRF with standard convolution oper- “embedding network”, then using masks computed from
ators, reduce computation time by half, and backpropagate those embeddings to modify the convolutions of a given
through the CRF into the embeddings. task-specific network. This implementation strategy is il-
3.4. Segmentation-aware convolution lustrated in Figure 5.
The embedding network has the following architecture.
The bilateral filter in Eq. 4 is similar in form to convo- The first seven layers share the design of the earliest con-
lution, but with a non-linear sharpening mask instead of a volution layers in VGG-16 [7], and are initialized with that
learned task-specific filter. In this case, we can have the network’s (object recognition-trained) weights. There is a
benefits of both, by inserting the learned convolution filter, subsampling layer after the second convolution layer and
t, into the equation: also after the fourth convolution layer, so the network cap-
P
xi−k mi,i−k tk tures information at three different scales. The final output
yi = kP . (7) from each scale is sent to a pairwise distance computation
k mi,i−k
(detailed in Sec. 4.2) followed by a loss (as in Eq. 1), so that
This is a non-linear convolution: the input signal is multi- each scale develops embedding-like representations. The
plied pointwise by the normalized local mask before form- outputs from the intermediate embedding layers are then
ing the inner product with the learned filter. If the learned upsampled to a common resolution, concatenated, and sent
filter ti is all ones, we have the same bilateral filter as in to a convolution layer with 1 × 1 filters. This layer learns a
Eq. 4; if the embedding-based segmentation mask mi is all weighted average of the intermediate embeddings, and cre-
ones, we have standard convolution. Since the masks in ates the final embedding for each pixel.
this context encode segmentation cues, we refer to Eq. 7 as The idea of using a loss at intermediate layers is inspired
segmentation-aware convolution. by Xie and Tu [57], who used this strategy to learn boundary
The mask acts as an applicability function for the fil- cues in a CNN. The motivation behind this strategy is to
ter, which makes segmentation-aware convolution a special provide early layers a stronger signal of the network’s end
5
loss loss
dist dist
loss
loss
dist
dist
objective
Figure 5: General schematic for our segmentation-aware CNN. The first part is an embedding network, which is guided
to compute embedding-like representations at multiple scales, and constructs a final embedding as a weighted sum of the
intermediate embeddings. The loss on these layers operates on pairwise distances computed from the embeddings. These
same distances are then used to construct local attention masks, that intercept the convolutions in a task-specific network.
The final objective backpropagates through both networks, fine-tuning the embeddings for the task.
goal, reducing the burden on backpropagation to carry the is passed through an exponential function with a specified
signal through multiple layers [30]. hardness, λ. This operation realizes the mask term (Eq. 3).
The final embeddings are used to create masks in the In our work, the hardness of the exponential is learned as a
task-specific network. The lightest usage of these masks in- parameter of the CNN.
volves performing segmentation-aware bilateral filtering on To perform the actual masking, the input to be masked
the network’s final layer outputs; this achieves the sharp- is simply processed by an image-to-column transformation
ening effect illustrated in Figure 4. The most intrusive us- (producing another H · W × K matrix), then multiplied
age of the masks involves converting all convolutions into pointwise with the normalized mask matrix. From that
segmentation-aware convolutions. Even in this case, how- product, segmentation-aware bilateral filtering is merely a
ever, the masks can be inserted with no detrimental effect matter of summing across the K dimension, producing an
(i.e., by initializing with λ = 0 in Eq. 3), allowing the net- H · W × 1 matrix that can be reshaped into dimensions
work to learn whether or not (and at what layer) to acti- H × W . Segmentation-aware convolution (Eq. 7) simply
vate the masks. Additionally, if the target task has discrete requires multiplying the H · W × K masked values with a
output labels, as in the case of semantic segmentation, a K×F matrix of weights, where F is the number of convolu-
segmentation-aware CRF can be attached to the end of the tion filters. The result of this multiplication can be reshaped
network to sharpen the final output predictions. into F different H × W feature maps.
4.2. Efficient convolutional implementation details
5. Evaluation
We reduce all steps of the pipeline to matrix multipli-
cations, making the approach very efficient on GPUs. We We evaluate on two different dense prediction tasks: se-
achieve this by casting the mask creation (i.e., pairwise em- mantic segmentation, and optical flow estimation. The goal
bedding distance computation) as a convolution-like opera- of the experiments is to minimally modify strong baseline
tion, and implementing it in exactly the way Caffe [26] re- networks, and examine the effects of instilling various lev-
alizes convolution: via an image-to-column transformation, els of “segmentation awareness”.
followed by matrix multiplication.
5.1. Semantic segmentation
More precisely, the distance computation works as fol-
lows. For every position i in the feature-map provided by Semantic segmentation is evaluated on the PASCAL
the layer below, a patch of features is extracted from the VOC 2012 challenge [16], augmented with additional im-
neighborhood j ∈ Ni , and distances are computed between ages from Hariharan et al. [21]. Experiments are carried
the central feature and its neighbors. These distances are ar- out with two different baseline networks, “DeepLab” [9]
ranged into a row vector of length K, where K is the spatial and “DeepLabV2” [10]. DeepLab is a fully-convolutional
dimensionality of the patch. This process turns an H × W version of VGG-16 [7], using atrous convolution in some
feature-map into an H · W × K matrix, where each element layers to reduce downsampling. DeepLabV2 is a fully-
in the K dimension holds a distance relating that pixel to convolutional version of a 101-layer residual network
the central pixel at that spatial index. (ResNet) [24], modified with atrous spatial pyramid pooling
To convert the distances into masks, the H·W ×K matrix and multi-scale input processing. Both networks are initial-
6
Table 1: PASCAL VOC 2012 validation results for the var- Input Labels Baseline Proposed
ious considered approaches, compared against the baseline.
All methods use DeepLab as the base network; “BF” means
bilateral filter; “SegAware” means segmentation-aware.
65
ized with weights learned on ImageNet [46], then trained on
7
Table 2: PASCAL VOC 2012 test results. Input Labels Baseline Proposed
8
References [18] A. Frome, Y. Singer, F. Sha, and J. Malik. Learning globally-
consistent local distance functions for shape-based image re-
[1] A. Adams, J. Baek, and M. A. Davis. Fast high-dimensional trieval and classification. In ICCV, 2007. 2
filtering using the permutohedral lattice. Computer Graphics
[19] R. Gadde, V. Jampani, M. Kiefel, and P. V. Gehler. Super-
Forum, 29(2):753–762, 2010. 5
pixel convolutional networks using bilateral inceptions. In
[2] V. Aurich and J. Weule. Non-linear gaussian filters perform-
ECCV, 2016. 2
ing edge preserving diffusion. In Proceedings of the DAGM
[20] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. Berg. Match-
Symposium, pages 538–545, 1995. 3, 4
Net: Unifying feature and metric learning for patch-based
[3] V. Balntas, E. Johns, L. Tang, and K. Mikolajczyk. PN-Net:
matching. In CVPR, 2015. 2
Conjoined triple deep network for learning local image de-
scriptors. arXiv:1601.05030, 2016. 2 [21] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik.
[4] G. Bertasius, J. Shi, and L. Torresani. Semantic segmentation Semantic contours from inverse detectors. In ICCV, 2011. 6,
with boundary neural fields. In CVPR, 2016. 3 7
[5] J. Bromley, I. Guyon, Y. Lecun, E. Sackinger, and R. Shah. [22] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hyper-
Signature verification using a “siamese” time delay neural columns for object segmentation and fine-grained localiza-
network. In NIPS, 1994. 2 tion. In CVPR, 2015. 3
[6] T. Brox and J. Malik. Large displacement optical flow: De- [23] A. W. Harley, K. G. Derpanis, and I. Kokkinos. Learning
scriptor matching in variational motion estimation. PAMI, dense convolutional embeddings for semantic segmentation.
33(3):500–513, 2011. 8 In ICLR, 2016. 2
[7] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. [24] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
Return of the devil in the details: Delving deep into convo- for image recognition. In CVPR, 2016. 6
lutional nets. In BMVC, 2014. 5, 6 [25] V. Jampani, M. Kiefel, and P. V. Gehler. Learning sparse high
[8] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and dimensional filters: Image filtering, dense CRFs and bilateral
A. L. Yuille. Semantic image segmentation with task-specific neural networks. In CVPR, 2016. 4
edge detection using CNNs and a discriminatively trained [26] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
domain transform. In CVPR, 2016. 3 shick, S. Guadarrama, and T. Darrell. Caffe: Convolutional
[9] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and architecture for fast feature embedding. In ACM-MM, pages
A. L. Yuille. Semantic image segmentation with deep con- 675–678, 2014. 6
volutional nets and fully connected CRFs. In ICLR, 2015. 1, [27] H. Knutsson and C.-F. Westin. Normalized and differential
3, 4, 6, 7 convolution. In CVPR, 1993. 5
[10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and [28] P. Krähenbühl and V. Koltun. Efficient inference in fully
A. L. Yuille. DeepLab: Semantic image segmentation with connected CRFs with Gaussian edge potentials. In NIPS,
deep convolutional nets, atrous convolution, and fully con- 2011. 3, 4
nected CRFs. PAMI, 2016. 6, 7 [29] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
[11] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity classification with deep convolutional neural networks. In
metric discriminatively, with application to face verification. NIPS, 2012. 1
In CVPR, 2005. 2, 3 [30] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-
[12] J. Dai, K. He, and J. Sun. Convolutional feature masking for supervised nets. AISTATS, 2(3):6, 2015. 6
joint object and stuff segmentation. In CVPR, 2015. 2
[31] J.-S. Lee. Digital image smoothing and the sigma filter.
[13] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier.
CVGIP, 24(2):255–269, 1983. 3, 4
Language modeling with gated convolutional networks.
[32] M. Leordeanu, R. Sukthankar, and C. Sminchisescu. Ef-
arXiv:1612.08083, 2016. 2
ficient closed-form solution to generalized boundary detec-
[14] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş,
tion. In ECCV, pages 516–529, 2012. 2
V. Golkov, P. van der Smagt, D. Cremers, and T. Brox.
FlowNet: Learning optical flow with convolutional net- [33] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
works. In ICCV, 2015. 1, 3, 8 manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
mon objects in context. In ECCV, pages 740–755, 2014. 7
[15] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction
from a single image using a multi-scale deep network. In [34] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
NIPS, 2014. 1 networks for semantic segmentation. In CVPR, 2014. 1, 3
[16] M. Everingham, L. Van-Gool, C. K. I. Williams, J. Winn, [35] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical
and A. Zisserman. The PASCAL Visual Object Classes question-image co-attention for visual question answering.
Challenge 2012 (VOC2012) Results. http://www.pascal- In NIPS, pages 289–297, 2016. 2
network.org/challenges/VOC/voc2012/workshop/index.html, [36] M. Maire, P. Arbeláez, C. Fowlkes, and J. Malik. Using con-
2012. 6, 7 tours to detect and localize junctions in natural images. In
[17] A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, CVPR, 2008. 2
S. Guadarrama, and K. P. Murphy. Semantic instance seg- [37] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feed-
mentation via deep metric learning. arXiv:1703.10277, forward semantic segmentation with zoom-out features. In
2017. 2 CVPR, 2015. 3
9
[38] A. Newell and J. Deng. Associative embedding: End-to-end [60] J. Žbontar and Y. LeCun. Computing the stereo matching
learning for joint detection and grouping. arXiv:1611.05424, cost with a convolutional neural network. In CVPR, 2014. 2
2016. 2 [61] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
[39] H. Noh, S. Hong, and B. Han. Learning deconvolution net- Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional ran-
work for semantic segmentation. In ICCV, 2015. 1, 3 dom fields as recurrent neural networks. In ICCV, 2015. 3,
[40] A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, 4
A. Graves, and K. Kavukcuoglu. Conditional image gen-
eration with PixelCNN decoders. arXiv:1606.05328, 2016.
2
[41] P. Ott and M. Everingham. Implicit color segmentation fea-
tures for pedestrian and object detection. In ICCV, 2009. 2
[42] P. Perona and J. Malik. Scale-space and edge detection using
anisotropic diffusion. PAMI, 1990. 2
[43] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Full-
resolution residual networks for semantic segmentation in
street scenes. In CVPR, 2014. 1
[44] A. Ranjan and M. J. Black. Optical flow estimation using a
spatial pyramid network. CVPR, 2017. 8
[45] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid.
EpicFlow: Edge-preserving interpolation of correspon-
dences for optical flow. In CVPR, 2015. 8
[46] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
Recognition Challenge. IJCV, 115(3):211–252, 2015. 7
[47] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and
Y. LeCun. OverFeat: Integrated recognition, localization and
detection using convolutional networks. In ICLR, 2014. 1
[48] J. Shi and J. Malik. Normalized cuts and image segmenta-
tion. PAMI, 22(8):888–905, 2000. 2
[49] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and
F. Moreno-Noguer. Discriminative learning of deep convo-
lutional feature point descriptors. In ICCV, 2015. 2
[50] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. In ICLR, 2015.
1
[51] S. M. Smith and J. M. Brady. SUSAN – A new approach to
low level image processing. IJCV, 23(1):45–78, 1997. 3, 4
[52] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end mem-
ory networks. In NIPS, pages 2440–2448, 2015. 2
[53] C. Tomasi and R. Manduchi. Bilateral filtering for gray and
color images. In ICCV, 1998. 2, 3, 4
[54] E. Trulls, I. Kokkinos, A. Sanfeliu, and F. Moreno-Noguer.
Dense segmentation-aware descriptors. In CVPR, 2013. 2
[55] E. Trulls, S. Tsogkas, I. Kokkinos, A. Sanfeliu, and
F. Moreno-Noguer. Segmentation-aware deformable part
models. In CVPR, 2014. 2
[56] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid.
DeepFlow: Large displacement optical flow with deep
matching. In ICCV, pages 1385–1392, 2013. 8
[57] S. Xie and Z. Tu. Holistically-nested edge detection. In
CVPR, 2015. 5
[58] F. Yu and V. Koltun. Multi-scale context aggregation by di-
lated convolutions. In ICLR, 2016. 1, 3
[59] S. Zagoruyko and N. Komodakis. Learning to compare im-
age patches via convolutional neural networks. In CVPR,
2015. 2
10
Segmentation-Aware Convolutional Networks Using Local Attention Masks
Supplementary Material
im2col reshape
× =
W W
H H
E F
H*W H*W
K*E K*E F
F
reshape
× =
Multi-dim.
embeddings
im2dist
Unrolled masks
Figure 1: Implementation of convolution in Caffe, compared with the implementation of segmentation-aware convo-
lution. Convolution involves re-organizing the elements of each (potentially overlapping) patch into a column (i.e.,
im2col), followed by a matrix multiplication with weights. Segmentation-aware convolution works similarly, with an
image-to-column transformation on the input, an image-to-distance transformation on the embeddings (i.e., im2dist), a
pointwise multiplication of those two matrices, and then a matrix multiplication with weights. The variables H, W denote
the height and width of the input, respectively; E denotes the number of channels in the input; K denotes the dimensionality
of a patch (e.g., K = 9 in convolution with a 3 × 3 filter); F denotes the number of filters (and the dimensionality of the
output). In both cases, an H × W × E input is transformed into an H × W × F output.