0% found this document useful (0 votes)
11 views

Semantic Image Segmentation Via Deep Parsing Network

Uploaded by

Upal Mazumder
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Semantic Image Segmentation Via Deep Parsing Network

Uploaded by

Upal Mazumder
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Semantic Image Segmentation via Deep Parsing Network∗

Ziwei Liu† Xiaoxiao Li† Ping Luo Chen Change Loy Xiaoou Tang
Department of Information Engineering, The Chinese University of Hong Kong
{lz013,lx015,pluo,ccloy,xtang}@ie.cuhk.edu.hk
arXiv:1509.02634v2 [cs.CV] 24 Sep 2015

Abstract high-order potentials [37, 36], and semantic label contexts


[21, 26, 38]. For example, Krähenbühl et al. [16] attained
This paper addresses semantic image segmentation by accurate segmentation boundary by inferring on a fully-
incorporating rich information into Markov Random Field connected graph. Vineet et al. [37] extended [16] by
(MRF), including high-order relations and mixture of label defining both high-order and long-range terms between
contexts. Unlike previous works that optimized MRFs pixels. Global or local semantic contexts between labels
using iterative algorithm, we solve MRF by proposing a were also investigated by [38]. Although they accomplished
Convolutional Neural Network (CNN), namely Deep Pars- promising results, they modeled the unary terms as SVM or
ing Network (DPN), which enables deterministic end-to- Adaboost, whose learning capacity becomes a bottleneck.
end computation in a single forward pass. Specifically, The learning and inference of complex pairwise terms are
DPN extends a contemporary CNN architecture to model often expensive.
unary terms and additional layers are carefully devised to In the second group, people learned a strong unary clas-
approximate the mean field algorithm (MF) for pairwise sifier by leveraging the recent advances of deep learning,
terms. It has several appealing properties. First, different such as the Convolutional Neural Network (CNN). With
from the recent works that combined CNN and MRF, where deep models, these works [23, 24, 25, 22, 3, 28, 39, 30, 19]
many iterations of MF were required for each training demonstrated encouraging results using simple definition of
image during back-propagation, DPN is able to achieve the pairwise function or even ignore it. For instance, Long
high performance by approximating one iteration of MF. et al. [22] transformed fully-connected layers of CNN into
Second, DPN represents various types of pairwise terms, convolutional layers, making accurate per-pixel classifica-
making many existing works as its special cases. Third, tion possible using the contemporary CNN architectures
DPN makes MF easier to be parallelized and speeded up that were pre-trained on ImageNet [6]. Chen et al. [3]
in Graphical Processing Unit (GPU). DPN is thoroughly improved [22] by feeding the outputs of CNN into a MRF
evaluated on the PASCAL VOC 2012 dataset, where a sin- with simple pairwise potentials, but it treated CNN and
gle DPN model yields a new state-of-the-art segmentation MRF as separated components. A recent advance was
accuracy of 77.5%. obtained by [30], which jointly trained CNN and MRF by
passing the error of MRF inference backward into CNN, but
iterative inference of MRF such as the mean field algorithm
1. Introduction (MF) [27] is required for each training image during back-
Markov Random Field (MRF) or Conditional Random propagation (BP). Zheng et al. [39] further showed that
Field (CRF) has achieved great successes in semantic im- the procedure of MF inference can be represented as a
age segmentation, which is one of the most challenging Recurrent Neural Network (RNN), but their computational
problems in computer vision. Existing works such as costs are similar. We found that directly combing CNN
[31, 29, 9, 34, 11, 2, 8, 25, 22] can be generally categorized and MRF as above is inefficient, because CNN typically
into two groups based on their definitions of the unary and has millions of parameters while MRF infers thousands of
pairwise terms of MRF. latent variables; and even worse, incorporating complex
In the first group, researchers improved labeling ac- pairwise terms into MRF becomes impractical, limiting the
curacy by exploring rich information to define the pair- performance of the entire system.
wise functions, including long-range dependencies [16, 17], This work proposes a novel Deep Parsing Network
∗ This work has been accepted to appear in ICCV 2015. This is the pre- (DPN), which is able to jointly train CNN and complex
printed version. Content may slightly change prior to the final publication. pairwise terms. DPN has several appealing properties.
† indicates shared first authorship. (1) DPN solves MRF with a single feed-forward pass,
reducing computational cost and meanwhile maintaining where y, V, and E denote a set of latent variables, nodes,
high performance. Specifically, DPN models unary terms and edges, respectively. Φ(yiu ) is the unary term, measuring
by extending the VGG-16 network (VGG16 ) [32] pre- the cost of assigning label u to the i-th pixel. For instance,
trained on ImageNet, while additional layers are carefully if pixel i belongs to the first category other than the second
designed to model complex pairwise terms. Learning of one, we should have Φ(yi1 ) < Φ(yi2 ). Moreover, Ψ(yiu , yjv )
these terms is transformed into deterministic end-to-end is the pairwise term that measures the penalty of assigning
computation by BP, instead of embedding MF into BP as labels u, v to pixels i, j respectively.
[30, 19] did. Although MF can be represented by RNN [39], Intuitively, the unary terms represent per-pixel classifica-
it needs to recurrently compute the forward pass so as to tions, while the pairwise terms represent a set of smoothness
achieve good performance and thus is time-consuming, e.g. constraints. The unary term in Eqn.(1) is typically defined
each forward pass contains hundred thousands of weights. as
DPN approximates MF by using only one iteration. This Φ(yiu ) = − ln p(yiu = 1|I) (2)
is made possible by joint learning strong unary terms and
rich pairwise information. (2) Pairwise terms determine where p(yiu = 1|I) indicates the probability of the presence
the graphical structure. In previous works, if the former is of label u at pixel i, modeling by VGG16 . To simplify
changed, so is the latter as well as its inference procedure. discussions, we abbreviate it as pui . The smoothness term
But with DPN, modifying the complexity of pairwise terms, can be formulated as
e.g. range of pixels and contexts, is as simple as modifying
Ψ(yiu , yjv ) = µ(u, v)d(i, j), (3)
the receptive fields of convolutions, without varying BP.
DPN is able to represent multiple types of pairwise terms, where the first term learns the penalty of global co-
making many previous works [3, 39, 30] as its special occurrence between any pair of labels, e.g. the output value
cases. (3) DPN approximates MF with convolutional and of µ(u, v) is large if u and v should not coexist, while the
pooling operations, which can be speeded up by low- second term calculates the distances between pixels, e.g.
rank approximation [14] and easily parallelized [4] in a d(i, j) = ω1 kIi − Ij k2 + ω2 k[xi yi ] − [xj yj ]k2 . Here,
Graphical Processing Unit (GPU). Ii indicates a feature vector such as RGB values extracted
Our contributions are summarized as below. (1) A from the i-th pixel, x, y denote coordinates of pixels’
novel DPN is proposed to jointly train VGG16 and rich positions, and ω1 , ω2 are the constant weights. Eqn.(3)
pairwise information, i.e. mixture of label contexts and implies that if two pixels are close and look similar, they
high-order relations. Compared to existing deep models, are encouraged to have labels that are compatible. It has
DPN can approximate MF with only one iteration, reducing been adopted by most of the recent deep models [3, 39, 30]
computational cost but still maintaining high performance. for semantic image segmentation.
(2) We disclose that DPN represents multiple types of However, Eqn.(3) has two main drawbacks. First, its
MRFs, making many previous works such as RNN [39] and first term captures the co-occurrence frequency of two
DeepLab [3] as its special cases. (3) Extensive experiments labels in the training data, but neglects the spatial context
investigate which component of DPN is crucial to achieve between objects. For example, ‘person’ may appear beside
high performance. A single DPN model achieves a new ‘table’, but not at its bottom. This spatial context is a
state-of-the-art accuracy of 77.5% on the PASCAL VOC mixture of patterns, as different object configurations may
2012 [7] test set. (4) We analyze the time complexity of appear in different images. Second, it defines only the
DPN on GPU. pairwise relations between pixels, missing their high-order
interactions.
2. Our Approach
To resolve these issues, we define the smoothness term
DPN learns MRF by extending VGG16 to model unary by leveraging rich information between pixels, which is one
terms and additional layers are carefully designed for pair- of the advantages of DPN over existing deep models. We
wise terms. have
Overview MRF [10] is an undirected graph where each K
node represents a pixel in an image I, and each edge repre- X X
Ψ(yiu , yjv ) = λk µk (i, u, j, v) d(j, z)pvz . (4)
sents relation between pixels. Each node is associated with
k=1 ∀z∈Nj
a binary latent variable, yui ∈ {0, 1}, indicating whether
a pixel i has label u. We have ∀u ∈ L = {1, 2, ..., l}, The first term in Eqn.(4) learns a mixture of local label
representing a set of l labels. The energy function of MRF contexts, penalizing label assignment in a local region,
is written as where K is the number of components in mixture and λk
X X
E(y) = Φ(yiu ) + Ψ(yiu , yjv ), (1) is an indicator, determining which
PK component is activated.
∀i∈V ∀i,j∈E We define λk ∈ {0, 1} and k=1 λk = 1. An intuitive
� 𝒅𝒅 𝒋𝒋, 𝒛𝒛 𝒑𝒑𝒛𝒛𝒗𝒗 The last term is the entropy, measuring the confidences of
𝒛𝒛∈𝜨𝜨𝒋𝒋
𝝁𝝁 𝒊𝒊, 𝒖𝒖, 𝒋𝒋, 𝒗𝒗 predictions. To estimate qiu , we differentiate Eqn.(5) with
(𝒛𝒛, 𝒗𝒗) respect to it and equate the resulting expression to zero. We
𝝁𝝁𝒌𝒌 𝒊𝒊, 𝒖𝒖, 𝒋𝒋, 𝒗𝒗
𝒗𝒗) (𝒋𝒋, 𝒗𝒗)
then have a closed-form expression,
(𝒊𝒊, 𝒖𝒖) X X
𝒋𝒋, 𝒗𝒗 , qiu ∝ exp − (Φui + qjv Ψuv

ij ) , (6)
𝒋𝒋 ∈ 𝜨𝜨𝒊𝒊
∀j∈Ni ∀v∈L
(𝒊𝒊, 𝒖𝒖) (a) (b)
𝒏𝒏 × 𝒏𝒏
such that the predictions for each pixel is independently
𝒎𝒎 × 𝒎𝒎
𝒅𝒅 𝒋𝒋, 𝒛𝒛 𝒒𝒒𝒗𝒗𝒋𝒋 , 𝒛𝒛 ∈ 𝜨𝜨𝒋𝒋 𝝁𝝁𝒌𝒌 𝒊𝒊, 𝒖𝒖, 𝒋𝒋, 𝒗𝒗 , 𝒋𝒋 ∈ 𝜨𝜨𝒊𝒊 attained by repeating Eqn.(6), which implies whether pixel i
𝒎𝒎 𝒏𝒏 have label u is proportional to the estimated probabilities of
𝑸𝑸 𝒗𝒗 𝒗𝒗
𝑸𝑸 ′ all its neighboring pixels, weighted by their corresponding
𝒎𝒎 𝒏𝒏
𝒋𝒋 𝒊𝒊 smoothness penalties. Substituting Eqn.(4) into (6), we
have
(c) (d)
n K
X X X
Figure 1: (a) Illustration of the pairwise terms in DPN. (b) explains the qiu ∝ exp − Φui − λk (7)
label contexts. (c) and (d) show that mean field update of DPN corresponds k=1 ∀v∈L ∀j∈Ni
to convolutions. X o
µk (i, u, j, v) d(j, z)qjv qzv ,
∀z∈Nj
illustration is given in Fig.1 (b), where the dots in red and
blue represent a center pixel i and its neighboring pixels j, where each qiu is initialized by the corresponding pui in
i.e. j ∈ Ni , and (i, u) indicates assigning label u to pixel i. Eqn.(2), which is the unary prediction of VGG16 . Eqn.(7)
Here, µ(i, u, j, v) outputs labeling cost between (i, u) and satisfies the smoothness constraints.
(j, v) with respect to their relative positions. For instance, In the following, DPN approximates one iteration of
if u, v represent ‘person’ and ‘table’, the learned penalties Eqn.(7) by decomposing it into two steps. Let Qv be a
of positions j that are at the bottom of center i should be predicted label map of the v-th category. In the first step
large. The second term basically models a triple penalty, as shown in Fig.1 (c), we calculate the triple penalty term
which involves pixels i, j, and j’s neighbors, implying that in (7) by applying a m × m filter on each position j, where
if (i, u) and (j, v) are compatible, then (i, u) should be also each element of this filter equals d(j, z)qjv , resulting in Qv 0 .
compatible with j’s nearby pixels (z, v), ∀z ∈ Nj , as shown Apparently, this step smoothes the prediction of pixel j with
in Fig.1 (a). respect to the distances between it and its neighborhood. In
Learning parameters (i.e. weights of VGG16 and costs the second step as illustrated in (d), the labeling contexts
of label contexts) in Eqn.(1) is to minimize the distances can be obtained by convolving Qv 0 with a n × n filter, each
between ground-truth label map and y, which needs to be element of which equals µk (i, u, j, v), penalizing the triple
inferred subject to the smoothness constraints. relations as shown in (a).
Inference Overview Inference of Eqn.(1) can be
obtained by the mean field (MF) algorithm [27], 3. Deep Parsing Network
which estimates the joint distribution of MRF,
P (y)= Z1 exp{−E(y)}, by using aQ fully-factorized This section describes the implementation of Eq.(7) in
proposal distribution, Q(y) = ∀i∈V ∀u∈L qiu , where a Deep Parsing Network (DPN). DPN extends VGG16 as
Q
each qiu is a variable we need to estimate, indicating the unary term and additional layers are designed to approxi-
predicted probability of assigning label u to pixel i. To mate one iteration of MF inference as the pairwise term.
simplify the discussion, we denote Φ(yiu ) and Ψ(yiu , yjv ) as The hyper-parameters of VGG16 and DPN are compared in
Φui and Ψuv ij , respectively. Q(y) is typically optimized by
Table 1.
minimizing a free energy function [15] of MRF, VGG16 As listed in Table 1 (a), the first row represents
X X X X X the name of layer and ‘x-y’ in the second row represents
F (Q) = qiu Φui + qiu qjv Ψuv
ij the size of the receptive field and the stride of convolution,
∀i∈V ∀u∈L ∀i,j∈E ∀u∈L ∀v∈L respectively. For instance, ‘3-1’ in the convolutional layer
X X implies that the receptive field of each filter is 3×3 and it
+ qiu ln qiu . (5)
is applied on every single pixel of an input feature map,
∀i∈V ∀u∈L
while ‘2-2’ in the max-pooling layer indicates each feature
Specifically, the first term in Eqn.(5) characterizes the cost map is pooled over every other pixel within a 2×2 local
of each pixel’s predictions, while the second term char- region. The last three rows show the number of the output
acterizes the consistencies of predictions between pixels. feature maps, activation functions, and the size of output
(a) VGG16 : 224×224×3 input image; 1×1000 output labels
1 2 3 4 5 6 7 8 9 10 11 12
layer 2×conv max 2×conv max 3×conv max 3×conv max 3×conv max 2×fc fc
filter–stride 3-1 2-2 3-1 2-2 3-1 2-2 3-1 2-2 3-1 2-2 - -
#channel 64 64 128 128 256 256 512 512 512 512 1 1
activation relu idn relu idn relu idn relu idn relu idn relu soft
size 224 112 112 56 56 28 28 14 14 7 4096 1000
(b) DPN: 512×512×3 input image; 512×512×21 output label maps
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
layer 2×conv max 2×conv max 3×conv max 3×conv 3×conv conv conv conv lconv conv bmin sum
filter–stride 3-1 2-2 3-1 2-2 3-1 2-2 3-1 5-1 25-1 1-1 1-1 50-1 9-1 1-1 1-1
#channel 64 64 128 128 256 256 512 512 4096 4096 21 21 105 21 21
activation relu idn relu idn relu idn relu relu relu relu sigm lin lin idn soft
size 512 256 256 128 128 64 64 64 64 64 512 512 512 512 512

Table 1: The comparisons between the network architectures of VGG16 and DPN, as shown in (a) and (b) respectively. Each table contains five rows,
representing the ‘name of layer’, ‘receptive field of filter’−‘stride’, ‘number of output feature maps’, ‘activation function’ and ‘size of output feature
maps’, respectively. Furthermore, ‘conv’, ‘lconv’,‘max’, ‘bmin’, ‘fc’, and ‘sum’ represent the convolution, local convolution, max pooling, block min
pooling, fully connection, and summation, respectively. Moreover, ‘relu’, ‘idn’, ‘soft’, ‘sigm’, and ‘lin’ represent the activation functions, including rectified
linear unit [18], identity, softmax, sigmoid, and linear, respectively.

(a) (b) (c) Local


feature maps, respectively. As summarized in Table 1 (a), convolution 21 channels
VGG16 contains thirteen convolutional layers, five max- 5 of b12 𝒗𝒗
pooling layers, and three fully-connected layers. These 5 25
50
512
layers can be partitioned into twelve groups, each of which . 50
(𝒋𝒋, 𝒗𝒗)
covers one or more homogenous layers. For example, the 𝒌𝒌(𝒋𝒋,𝒗𝒗)
512
first group comprises two convolutional layers with 3×3
receptive field and 64 output feature maps, each of which Figure 2: (a) and (b) show the padding of the filters. (c) illustrates local
is 224×224. convolution of b12.

3.1. Modeling Unary Terms (i, j)


Since DPN is trained with label
(a) Convolution of b13
maps of the entire
𝒖𝒖 ∈ {𝟏𝟏, … , 𝟐𝟐𝟐𝟐} images,
(b) Max pooling of b14
21 channels
the missing information in the preceding 𝒗𝒗 5 layers of b11 can 5
To make full use of VGG16 , which is pre-trained by 9
ImageNet, we adopt all its parameters to initialize the be recovered by BP. . (𝒋𝒋, 𝒗𝒗)
.(𝒊𝒊, 𝒖𝒖 = 𝟏𝟏)
512 9 512
filters of the first ten groups of DPN. To simplify the Second, two fully-connected layers at a11 are trans-
b12 b13 105 105
discussions, we take PASCAL VOC 2012 (VOC12) [7] as formed to two convolutional layers at b9 and b10, respec-
512 512 512
an example. Note that DPN can be easily adapted to any tively. As shown in Table 1 (a), the first ‘fc’ layer learns
other semantic image segmentation dataset by modifying 7×7×512×4096 parameters, which can be altered to 4096
its hyper-parameters. VOC12 contains 21 categories and filters in b9, each of which is 25×25×512. Since a8 and a10
each image is rescaled to 512×512 in training. Therefore, have been removed, the 7×7 receptive field is padded into
DPN needs to predict totally 512×512×21 labels, i.e. one 25×25 similar as above and shown in Fig.2 (b). The second
label for each pixel. To this end, we extends VGG16 in two ‘fc’ layer learns a 4096×4096 weight matrix, corresponding
aspects. to 4096 filters in b10. Each filter is 1×1×4096.
In particular, let ai and bi denote the i-th group in Table Overall, b11 generates the unary labeling results, pro-
1 (a) and (b), respectively. First, we increase resolution of ducing twenty-one 512×512 feature maps, each of which
VGG16 by removing its max pooling layers at a8 and a10, represents the probabilistic label map of each category.
because most of the information is lost after pooling, e.g.
3.2. Modeling Smoothness Terms
a10 reduces the input size by 32 times, i.e. from 224×224
to 7×7. As a result, the smallest size of feature map in The last four layers of DPN, i.e. from b12 to b15, are
DPN is 64×64, keeping much more information compared carefully designed to smooth the unary labeling results.
with VGG16 . Note that the filters of b8 are initialized as • b12 As listed in Table 1 (b), ‘lconv’ in b12 indicates
the filters of a9, but the 3×3 receptive field is padded into a locally convolutional layer, which is widely used in
5×5 as shown in Fig.2 (a), where the cells in white are the face recognition [33, 35] to capture different information
original values of the a9’s filter and the cells in gray are from different facial positions. Similarly, distinct spatial
zeros. This is done because a8 is not presented in DPN, such positions of b12 have different filters, and each filter is
that each filter in a9 should be convolved on every other shared across 21 input channels, as shown in Fig.2 (c). It
pixel of a7. To maintain the convolution with one stride, we can be formulated as
pad the filters with zeros. Furthermore, the feature maps in
b11 are up-sampled to 512×512 by bilinear interpolation. o12 11
(j,v) = lin(k(j,v) ∗ o(j,v) ), (8)
5 25 512
. 50
(𝒋𝒋, 𝒗𝒗)
𝒌𝒌(𝒋𝒋,𝒗𝒗)
512
21 channels

(a) Convolution of b13 (b) Pooling in b14


21 channels 𝒖𝒖 ∈ {𝟏𝟏, … , 𝟐𝟐𝟐𝟐} element-wise manner similar to Eqn.(7),
5
𝒗𝒗 5
9

exp ln(o11 14
(i,u) ) − o(i,u)
512 9
. (𝒋𝒋, 𝒗𝒗) .(𝒊𝒊, 𝒖𝒖 = 𝟏𝟏) 15
o(i,u) = P21 , (9)
512
 11 14
105 u=1 exp ln(o(i,u) ) − o(i,u)
b12 b13 105
512 512 512
where probability of assigning label u to pixel i is normal-
Figure 3: (a) and (b) illustrates the convolutions of b13 and the poolings ized over all the labels.
in b14.
Relation to Previous Deep Models Many existing deep
models such as [39, 3, 30] employed Eqn.(3) as the pairwise
where lin(x) = ax + b representing the linear activation terms, which are the special cases of Eqn.(7). To see this,
function, ‘∗’ is the convolutional operator, and k(j,v) is let K=1 and j=i, the right hand side of (7) reduces to
a 50×50×1 filter at position j of channel v. We have X X
k(j,1) = k(j,2) = ... = k(j,21) shared across 21 channels. exp{−Φui − λ1 µ1 (i, u, i, v) d(i, z)pvi pvz }
o11 12
(j,v) indicates a local patch in b11, while o(j,v) is the v∈L z∈Ni
corresponding output of b12. Since b12 has stride one, = exp{−Φui −
X
µ(u, v)
X
d(i, z)pvz }, (10)
the result of kj ∗ o11 (j,v) is a scalar. In summary, b12 has v∈L z∈Ni ,z6=i
512×512 different filters and produces 21 output feature
maps. where µ(u, v) and d(i, z) represent the global label co-
Eqn.(8) implements the triple penalty of Eqn.(7). Recall occurrence and pairwise pixel similarity of Eqn.(3), respec-
that each output feature map of b11 indicates a probabilistic tively. This is because λ1 is a constant, d(i, i) = 0, and
label map of a specific object appearing in the image. As µ(i, u, i, v) = µ(u, v). Eqn.(10) is the corresponding MF
a result, Eqn.(8) suggests that the probability of object v update equation of (3).
presented at position j is updated by weighted averaging
over the probabilities at its nearby positions. Thus, as shown 3.3. Learning Algorithms
in Fig.1 (c), o11 v
(j,v) corresponds to a patch of Q centered at Learning The first ten groups of DPN are initialized
j, which has values pvz , ∀z ∈ Nj50×50 . Similarly, k(j,v) by VGG16 2 , while the last four groups can be initialized
is initialized by d(j, z)pvj , implying each filter captures randomly. DPN is then fine-tuned in an incremental manner
dissimilarities between positions. These filters remain fixed with four stages. During fine-tuning, all these stages solve
during BP, other than learned as in conventional CNN1 . the pixelwise softmax loss [22], but updating different sets
• b13 As shown in Table 1 (b) and Fig.3 (a), b13 is of parameters.
a convolutional layer that generates 105 feature maps by First, we add a loss function to b11 and fine-tune the
using 105 filters of size 9×9×21. For example, the value weights from b1 to b11 without the last four groups, in
of (i, u = 1) is attained by applying a 9×9×21 filter at order to learn the unary terms. Second, to learn the
positions {(j, v = 1, ..., 21)}. In other words, b13 learns triple relations, we stack b12 on top of b11 and update its
a filter for each category to penalize the probabilistic label parameters (i.e. ω1 , ω2 in the distance measure), but the
maps of b12, corresponding to the local label contexts in weights of the preceding groups (i.e. b1∼b11) are fixed.
Eqn.(7) by assuming K = 5 and n = 9, as shown in Fig.1 Third, b13 and b14 are stacked onto b12 and similarly,
(d). their weights are updated with all the preceding parameters
• b14 As illustrated in Table 1 and Fig.3 (b), b14 is a fixed, so as to learn the local label contexts. Finally, all the
block min pooling layer that pools over every 1×1 region parameters are jointly fine-tuned.
with one stride across every 5 input channels, leading to Implementation DPN transforms Eqn.(7) into convo-
21 output channels, i.e. 105÷5=21. b14 activates the lutions and poolings in the groups from b12 to b15, such
contextual pattern with the smallest penalty. that filtering at each pixel can be performed in a parallel
manner. Assume we have f input and f 0 output feature
• b15 This layer combines both the unary and smooth-
maps, N × N pixels, filters with s × s receptive field, and a
ness terms by summing the outputs of b11 and b14 in an
mini-batch with M samples. b12 takes a total f · N 2 · s2 · M
1 Each filter in b12 actually represents a distance metric between pixels
operations, b13 takes f · f 0 · N 2 · s2 · M operations,
in a specific region. In VOC12, the patterns of all the training images while both b14 and b15 require f · N 2 · M operations.
in a specific region are heterogenous, because of various object shapes. For example, when M =10 as in our experiment, we have
Therefore, we initialize each filter with Euclidean distance. Nevertheless, 21×5122 ×502 ×10=1.3×1011 operations in b12, which
Eqn.(8) is a more general form than the triple penalty in Eqn.(7), i.e. filters
in (8) can be automatically learned from data, if the patterns in a specific 2 We use the released VGG
16 model, which is public available at
region are homogenous, such as face or human images, which have more http://www.robots.ox.ac.uk/˜vgg/research/very_
regular shapes than images in VOC12. deep/
has the highest complexity in DPN. We parallelize these Receptive Field baseline 10×10 50×50 100×100
operations using matrix multiplication on GPU as [4] did, mIoU (%) 63.4 63.8 64.7 64.3
b12 can be computed within 30ms. The total runtime of the (a) Comparisons between different receptive fields of b12.
last four layers of DPN is 75ms. Note that convolutions in Receptive Field 1×1 5×5 9×9 9×9 mixtures
DPN can be further speeded up by low-rank decompositions mIoU (%) 64.8 66.0 66.3 66.5
[14] of the filters and model compressions [13]. (b) Comparisons between different receptive fields of b13.
However, direct calculation of Eqn.(7) is accelerated by
Pairwise Terms DSN [30] DeepLab [3] DPN
fast Gaussian filtering [1]. For a mini-batch of ten 512×512
improvement (%) 2.6 3.3 5.4
images, a recently optimized implementation [16] takes 12
(c) Comparing pairwise terms of different methods.
seconds on CPU to compute one iteration of (7). Therefore,
DPN makes (7) easier to be parallelized and speeded up. Table 2: Ablation study of hyper-parameters.

4. Experiments
Sec.4.2 compares DPN with the state-of-the-art methods on
Dataset We evaluate the proposed approach on the PAS- the VOC12 test set.
CAL VOC 2012 (VOC12) [7] dataset, which contains 20
object categories and one background category. Following 4.1. Effectiveness of DPN
previous works such as [12, 22, 3], we employ 10, 582
All the models evaluated in this section are trained and
images for training, 1, 449 images for validation, and 1, 456
tested on VOC12.
images for testing.
Triple Penalty The receptive field of b12 indicates
Evaluation Metrics All existing works employed mean the range of triple relations for each pixel. We examine
pixelwise intersection-over-union (denoted as mIoU) [22] different settings of the receptive fields, including ‘10×10’,
to evaluate their performance. To fully examine the effec- ‘50×50’, and ‘100×100’, as shown in Table 2 (a), where
tiveness of DPN, we introduce another three metrics, in- ‘50×50’ achieves the best mIoU, which is sightly better
cluding tagging accuracy (TA), localization accuracy (LA), than ‘100×100’. For a 512×512 image, this result implies
and boundary accuracy (BA). (1) TA compares the pre- that 50×50 neighborhood is sufficient to capture relations
dicted image-level tags with the ground truth tags, calculat- between pixels, while smaller or larger regions tend to
ing the accuracy of multi-class image classification. (2) LA under-fit or over-fit the training data. Moreover, all models
evaluates the IoU between the predicted object bounding
of triple relations outperform the ‘baseline’ method that
boxes3 and the ground truth bounding boxes (denoted as models dense pairwise relations, i.e. VGG16 +denseCRF
bIoU), measuring the precision of object localization. (3) [16].
For those objects that have been correctly localized, we
Label Contexts Receptive field of b13 indicates the
compare the predicted object boundary with the ground
range of local label context. To evaluate its effectiveness,
truth boundary, measuring the precision of semantic bound-
we fix the receptive field of b12 as 50×50. As summarized
ary similar to [12].
in Table 2 (b), ‘9×9 mixtures’ improves preceding settings
Comparisons DPN is compared with the best- by 1.7, 0.5, and 0.2 percent respectively. We observe large
performing methods on VOC12, including FCN [22], gaps exist between ‘1×1’ and ‘5×5’. Note that the 1×1
Zoom-out [25], DeepLab [3], WSSL [28], BoxSup [5], receptive field of b13 corresponds to learning a global label
Piecewise [19], and RNN [39]. All these methods are co-occurrence without considering local spatial contexts.
based on CNNs and MRFs, and trained on VOC12 data Table 2 (c) shows that the pairwise terms of DPN are more
following [22]. They can be grouped according to different effective than DSN and DeepLab4 .
aspects: (1) joint-train: Piecewise and RNN; (2) w/o More importantly, mIoU of all the categories can be
joint-train: DeepLab, WSSL, FCN, and BoxSup; (3) pre- improved when increasing the size of receptive field and
train on COCO: RNN, WSSL, and BoxSup. The first learning a mixture. Specifically, for each category, the im-
and the second groups are the methods with and without provements of the last three settings in Table 2 (b) over the
joint training CNNs and MRFs, respectively. Methods in first one are 1.2±0.2, 1.5±0.2, and 1.7±0.3, respectively.
the last group also employed MS-COCO [20] to pre-train We also visualize the learned label compatibilities and
deep models. To conduct a comprehensive comparison, the contexts in Fig.4 (a) and (b), respectively. (a) is obtained
performance of DPN are reported on both settings, i.e., with by summing each filter in b13 over 9×9 region, indicating
and without pre-training on COCO. how likely a column object would present when a row
In the following, Sec.4.1 investigates the effectiveness of object is presented. Blue represents high possibility. (a)
different components of DPN on the VOC12 validation set.
4 The other deep models such as RNN and Piecewise did not report the
3 They are the bounding boxes of the predicted segmentation regions. exact imrprovements after combining unary and pairwise terms.
(a) 0.7 (b)

person
0.7

mbike
bottle

horse
(a)

sheep
(b)

chair

plant
table

train
boat
bird
bike
areo Incremental Learning

sofa
cow
bkg

dog
bus
car
cat

tv
bkg Joint Learning
areo 0.68
bike 0.68
bird

mIoU

mIoU
boat 0.66
bottle
bus 0.66
car 0.64
cat DPN pairwise terms
chair bottle : bottle train : bkg denseCRF [16]
cow 0.64 0.62
table 0 1000 2000 3000 4000 5000 0 1 2 3 4 5
dog Number of Training Iterations Number of MF Iterations
horse
mbike
person
plant
sheep
Figure 6: Ablation study of (a) training strategy (b) required MF
sofa iterations. (Best viewed in color)
train person : mbike chair : person
tv

Unary Term Triple Penalty Label Contexts Joint Tuning


Figure 4: Visualization of (a) learned label compatibility (b) learned
contextual information. (Best viewed in color) 98% 73% 94%
72.1%
97.1% 93.3%

mean BBox IoU

mean Pixel Acc


97%

mean Tag Acc


96.5% 71% 70.1% 93.1%93.1%
96.4% 69.5% 93%
96.0%
96% 69%
67.0% 91.9%
92%
95% 67%

94% 65% 91%


(a) TA (b) LA (c) BA

(a) Original Image (b) Ground Truth (c) Unary Term Figure 7: Stage-wise analysis of (a) mean tagging accuracy (b) mean
localization accuracy (c) mean boundary accuracy.

have been fine-tuned.


(d) +Triple Penalty (e) +Label Contexts (f) +Joint Tuning One-iteration MF DPN approximates one iteration of
MF. Fig.6 (b) illustrates that DPN reaches a good accuracy
Figure 5: Step-by-step visualization of DPN. (Best viewed in color) with one MF iteration. A CRF [16] with dense pairwise
edges needs more than 5 iterations to converge. It also
has a large gap compared to DPN. Note that the existing
is non-symmetry. For example, when ‘horse’ is presented, deep models such as [3, 39, 30] required 5∼10 iterations to
‘person’ is more likely to present than the other objects. converge as well.
Also, ‘chair’ is compatible with ‘table’ and ‘bkg’ is com- Different Components Modeling Different Informa-
patible with all the objects. (b) visualizes some contextual tion We further evaluate DPN using three metrics. The
patterns, where ‘A:B’ indicates that when ‘A’ is presented, results are given in Fig.7. For example, (a) illustrates that
where ‘B’ is more likely to present. For example, ‘bkg’ is the tagging accuracy can be improved in the third stage, as
around ‘train’, ‘motor bike’ is below ‘person’, and ‘person’ it captures label co-occurrence with a mixture of contextual
is sitting on ‘chair’. patterns. However, TA is decreased a little after the final
Incremental Learning As discussed in Sec.3.3, DPN is stage. Since joint tuning maximizes segmentation accu-
trained in an incremental manner. The right hand side of Ta- racies by optimizing all components together, extremely
ble 3 (a) demonstrates that each stage leads to performance small objects, which rarely occur in VOC training set,
gain compared to its previous stage. For instance, ‘triple are discarded. As shown in (b), accuracies of object
penalty’ improves ‘unary term’ by 2.3 percent, while ‘label localization are significantly improved in the second and the
contexts’ improves ‘triple penalty’ by 1.8 percent. More final stages. This is intuitive because the unary prediction
importantly, joint fine-tuning all the components (i.e. unary can be refined by long-range and high-order pixel relations,
terms and pairwise terms) in DPN achieves another gain and joint training further improves results. (c) discloses
of 1.3 percent. A step-by-step visualization is provided in that the second stage also captures object boundary, since
Fig.5. it measures dissimilarities between pixels.
We also compare ‘incremental learning’ with ‘joint Per-class Analysis Table 3 (a) reports the per-class
learning’, which fine-tunes all the components of DPN at accuracies of four evaluation metrics, where the first four
the same time. The training curves of them are plotted in rows represent the mIoU of four stages, while the last
Fig.6 (a), showing that the former leads to higher and more three rows represent TA, LA, and BA, respectively. We
stable accuracies with respect to different iterations, while have several valuable observations, which motivate future
the latter may get stuck at local minima. This difference researches. (1) Joint training benefits most of the categories,
is easy to understand, because incremental learning only except animals such as ‘bird’, ‘cat’, and ‘cow’. Some
introduces new parameters until all existing parameters instances of these categories are extremely small so that
areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv Avg.
Unary Term (mIoU) 77.5 34.1 76.2 58.3 63.3 78.1 72.5 76.5 26.6 59.9 40.8 70.0 62.9 69.3 76.3 39.2 70.4 37.6 72.5 57.3 62.4
+ Triple Penalty 82.3 35.9 80.6 60.1 64.8 79.5 74.1 80.9 27.9 63.5 40.4 73.8 66.7 70.8 79.0 42.0 74.1 39.1 73.2 58.5 64.7
+ Label Contexts 83.2 35.6 82.6 61.6 65.5 80.5 74.3 82.6 29.9 67.9 47.5 75.2 70.3 71.4 79.6 42.7 77.8 40.6 75.3 59.1 66.5
+ Joint Tuning 84.8 37.5 80.7 66.3 67.5 84.2 76.4 81.5 33.8 65.8 50.4 76.8 67.1 74.9 81.1 48.3 75.9 41.8 76.6 60.4 67.8
TA (tagging Acc.) 98.8 97.9 98.4 97.7 96.1 98.6 95.2 96.8 90.1 97.5 95.7 96.7 96.3 98.1 93.3 96.1 98.7 92.2 97.4 96.3 96.4
LA (bIoU) 81.7 76.3 75.5 70.3 54.4 86.4 70.6 85.6 51.8 79.6 57.1 83.3 79.2 80.0 74.1 53.1 79.1 68.4 76.3 58.8 72.1
BA (boundary Acc.) 95.9 83.9 96.9 92.6 93.8 94.0 95.7 95.6 89.5 93.3 91.4 95.2 94.2 92.7 94.5 90.4 94.8 90.5 93.7 96.6 93.3
(a) Per-class results on VOC12 val.
areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mIoU
FCN [22] 76.8 34.2 68.9 49.4 60.3 75.3 74.7 77.6 21.4 62.5 46.8 71.8 63.9 76.5 73.9 45.2 72.4 37.4 70.9 55.1 62.2
Zoom-out [25] 85.6 37.3 83.2 62.5 66.0 85.1 80.7 84.9 27.2 73.2 57.5 78.1 79.2 81.1 77.1 53.6 74.0 49.2 71.7 63.3 69.6
Piecewise [19] 87.5 37.7 75.8 57.4 72.3 88.4 82.6 80.0 33.4 71.5 55.0 79.3 78.4 81.3 82.7 56.1 79.8 48.6 77.1 66.3 70.7
DeepLab [3] 84.4 54.5 81.5 63.6 65.9 85.1 79.1 83.4 30.7 74.1 59.8 79.0 76.1 83.2 80.8 59.7 82.2 50.4 73.1 63.7 71.6
RNN [39] 87.5 39.0 79.7 64.2 68.3 87.6 80.8 84.4 30.4 78.2 60.4 80.5 77.8 83.1 80.6 59.5 82.8 47.8 78.3 67.1 72.0
WSSL† [28] 89.2 46.7 88.5 63.5 68.4 87.0 81.2 86.3 32.6 80.7 62.4 81.0 81.3 84.3 82.1 56.2 84.6 58.3 76.2 67.2 73.9
RNN† [39] 90.4 55.3 88.7 68.4 69.8 88.3 82.4 85.1 32.6 78.5 64.4 79.6 81.9 86.4 81.8 58.6 82.4 53.5 77.4 70.1 74.7
BoxSup† [5] 89.8 38.0 89.2 68.9 68.0 89.6 83.0 87.7 34.4 83.6 67.1 81.5 83.7 85.2 83.5 58.6 84.9 55.8 81.2 70.7 75.2
DPN 87.7 59.4 78.4 64.9 70.3 89.3 83.5 86.1 31.7 79.9 62.6 81.9 80.0 83.5 82.3 60.5 83.2 53.4 77.9 65.0 74.1
DPN† 89.0 61.6 87.7 66.8 74.7 91.2 84.3 87.6 36.5 86.3 66.1 84.4 87.8 85.6 85.4 63.6 87.3 61.3 79.4 66.4 77.5
(b) Per-class results on VOC12 test. The approaches pre-trained on COCO [20] are marked with † .

Table 3: Per-class results on VOC12.

joint training discards them for smoother results. (2) previous works.
Training DPN with pixelwise label maps implicitly models Following [39, 5], we pre-train DPN with COCO, where
image-level tags, since it achieves a high averaged TA of 20 object categories that are also presented in VOC12 are
96.4%. (3) Object localization always helps. However, selected for training. A single DPN† has achieved 77.5%
for the object with complex boundary such as ‘bike’, its mIoU on VOC12 test set. As shown in Table 3 (b), we
mIoU is low even it can be localized, e.g. ‘bike’ has observe that DPN† achieves best performances on more
high LA but low BA and mIoU. (4) Failures of different than half of the object classes. Please refer to the appendices
categories have different factors. With these three metrics, for visual quality comparisons.
they can be easily identified. For example, the failures of
‘chair’, ‘table’, and ‘plant’ are caused by the difficulties
to accurately capture their bounding boxes and boundaries. 5. Conclusion
Although ‘bottle’ and ‘tv’ are also difficult to localize, they We proposed Deep Parsing Network (DPN) to address
achieve moderate mIoU because of their regular shapes. In semantic image segmentation, which has several appealing
other words, mIoU of ‘bottle’ and ‘tv’ can be significantly properties. First, DPN unifies the inference and learning
improved if they can be accurately localized. of unary term and pairwise terms in a single convolutional
4.2. Overall Performance network. No iterative inference are required during back-
propagation. Second, high-order relations and mixtures
As shown in Table 3 (b), we compare DPN with the of label contexts are incorporated to its pairwise terms
best-performing methods5 on VOC12 test set based on two modeling, making existing works serve as special cases.
settings, i.e. with and without pre-training on COCO. The Third, DPN is built upon conventional operations of CNN,
approaches pre-trained on COCO are marked with ‘†’. We thus easy to be parallelized and speeded up.
evaluate DPN on several scales of the images and then DPN achieves state-of-the-art performance on VOC12,
average the results following [3, 19]. and multiple valuable facts about semantic image segmen-
DPN outperforms all the existing methods that were tion are revealed through extensive experiments. Future
trained on VOC12, but DPN needs only one MF iteration directions include investigating the generalizability of DPN
to solve MRF, other than 10 iterations of RNN, DeepLab, to more challenging scenarios, e.g. large number of object
and Piecewise. By averaging the results of two DPNs, we classes and substantial appearance/scale variations.
achieve 74.1% accuracy on VOC12 without outside training
data. As discussed in Sec.3.3, MF iteration is the most
complex step even when it is implemented as convolutions. References
Therefore, DPN at least reduces 10× runtime compared to
[1] A. Adams, J. Baek, and M. A. Davis. Fast high-dimensional
5 The results of these methods were presented in either the published filtering using the permutohedral lattice. In Computer
papers or arXiv pre-prints. Graphics Forum, volume 29, pages 753–762, 2010. 6
[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Con- [21] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene
tour detection and hierarchical image segmentation. PAMI, parsing via label transfer. PAMI, 33(12):2368–2382, 2011.
33(5):898–916, 2011. 1 1
[3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and [22] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
A. L. Yuille. Semantic image segmentation with deep networks for semantic segmentation. In CVPR, pages 3431–
convolutional nets and fully connected crfs. In ICLR, 2015. 3440, 2015. 1, 5, 6, 8, 10, 11
1, 2, 5, 6, 7, 8, 10, 11 [23] P. Luo, X. Wang, and X. Tang. Hierarchical face parsing via
[4] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, deep learning. In CVPR, pages 2480–2487, 2012. 1
B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives [24] P. Luo, X. Wang, and X. Tang. Pedestrian parsing via deep
for deep learning. In NIPS Deep Learning Workshop, 2014. decompositional network. In ICCV, pages 2648–2655, 2013.
2, 6 1
[5] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding [25] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feed-
boxes to supervise convolutional networks for semantic seg- forward semantic segmentation with zoom-out features. In
mentation. arXiv:1503.01640v2, 18 May 2015. 6, 8 CVPR, pages 3376–3385, 2015. 1, 6, 8
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- [26] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee,
Fei. Imagenet: A large-scale hierarchical image database. In S. Fidler, R. Urtasun, and A. Yuille. The role of context
CVPR, pages 248–255, 2009. 1 for object detection and semantic segmentation in the wild.
[7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, In CVPR, pages 891–898, 2014. 1
and A. Zisserman. The pascal visual object classes (voc) [27] M. Opper, O. Winther, et al. From naive mean field theory to
challenge. IJCV, 88(2):303–338, 2010. 2, 4, 6 the tap equations. 2001. 1, 3
[8] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning
[28] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille.
hierarchical features for scene labeling. PAMI, 35(8):1915–
Weakly-and semi-supervised learning of a dcnn for semantic
1929, 2013. 1
image segmentation. arXiv:1502.02734v2, 8 May 2015. 1,
[9] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief 6, 8
propagation for early vision. IJCV, 70(1):41–54, 2006. 1
[29] X. Ren and J. Malik. Learning a classification model for
[10] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael. Learn-
segmentation. In ICCV, pages 10–17, 2003. 1
ing low-level vision. IJCV, 40(1):25–47, 2000. 2
[30] A. G. Schwing and R. Urtasun. Fully connected deep
[11] B. Fulkerson, A. Vedaldi, and S. Soatto. Class segmentation
structured networks. arXiv:1503.02351v1, 9 Mar 2015. 1,
and object localization with superpixel neighborhoods. In
2, 5, 6, 7
ICCV, pages 670–677, 2009. 1
[31] J. Shi and J. Malik. Normalized cuts and image segmenta-
[12] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik.
tion. PAMI, 22(8):888–905, 2000. 1
Semantic contours from inverse detectors. In ICCV, pages
991–998, 2011. 6 [32] K. Simonyan and A. Zisserman. Very deep convolutional
[13] G. E. Hinton, O. Vinyals, and J. Dean. Distilling the networks for large-scale image recognition. In ICLR, 2015.
knowledge in a neural network. In NIPS Deep Learning 2
Workshop, 2014. 6 [33] Y. Sun, X. Wang, and X. Tang. Deep learning face repre-
[14] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up sentation by joint identification-verification. In NIPS, 2014.
convolutional neural networks with low rank expansions. In 4
BMVC, 2014. 2, 6 [34] M. Szummer, P. Kohli, and D. Hoiem. Learning crfs using
[15] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. graph cuts. In ECCV, pages 582–595. 2008. 1
An introduction to variational methods for graphical models. [35] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
Machine learning, 37(2):183–233, 1999. 3 Closing the gap to human-level performance in face verifica-
[16] P. Krähenbühl and V. Koltun. Efficient inference in fully tion. In CVPR, pages 1701–1708, 2014. 4
connected crfs with gaussian edge potentials. NIPS, 2011. 1, [36] V. Vineet, G. Sheasby, J. Warrell, and P. H. Torr. Posefield:
6, 7 An efficient mean-field based method for joint estimation of
[17] P. Krähenbühl and V. Koltun. Parameter learning and human pose, segmentation, and depth. In Energy Minimiza-
convergent inference for dense random fields. In ICML, tion Methods in Computer Vision and Pattern Recognition,
pages 513–521, 2013. 1 pages 180–194. Springer, 2013. 1
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet [37] V. Vineet, J. Warrell, and P. H. Torr. Filter-based mean-
classification with deep convolutional neural networks. In field inference for random fields with higher-order terms and
NIPS, pages 1097–1105, 2012. 4 product label-spaces. In ECCV, pages 31–44. 2012. 1
[19] G. Lin, C. Shen, I. Reid, and A. Hengel. Efficient piecewise [38] J. Yang, B. Price, S. Cohen, and M.-H. Yang. Context driven
training of deep structured models for semantic segmenta- scene parsing with attention to rare classes. In CVPR, pages
tion. arXiv:1504.01013v2, 23 Apr 2015. 1, 2, 6, 8 3294–3301, 2014. 1
[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, [39] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Z. Su, D. Du, C. Huang, and P. Torr. Conditional random
Common objects in context. In ECCV, pages 740–755. 2014. fields as recurrent neural networks. arXiv:1502.03240v2, 30
6, 8 Apr 2015. 1, 2, 5, 6, 7, 8
Appendices

A. Fast Implementation of Locally Convolu-


tion
b12 in DPN is a locally convolutional layer. As men-
tioned in Eqn.(3), the local filters in b12 are computed
by the distances between RGB values of the pixels. XY
coordinates are omitted here because they could be pre-
computed. To accelerate the computation of locally convo-
lution, lookup table-based filtering approach is employed.
We first construct a lookup table storing distances between
any two pixel intensities (ranging from 0 to 255), which
results in a 256 × 256 matrix. Then when we perform lo-
cally convolution, the kernels’ coefficients can be obtained
efficiently by just looking up the table.

B. Visual Quality Comparisons


In the following, we inspect visual quality of obtained
label maps. Fig.8 demonstrates the comparisons of DPN
with FCN [22] and DeepLab [3]. We use the publicly
released model6 to re-generate label maps of FCN while
the results of DeepLab are extracted from their published
paper. DPN generally makes more accurate predictions in
both image-level and instance-level.
We also include more examples of DPN label maps in
Fig.9. We observe that learning local label contexts helps
differentiate confusing objects and learning triple penalty
facilitates the capturing of intrinsic object boundaries.

6 http://dl.caffe.berkeleyvision.org/

fcn-8s-pascal.caffemodel
(a) (b) (c) (d) (e)

Figure 8: Visual quality comparison of different semantic image segmentation methods: (a) input image (b) ground truth (c)
FCN [22] (d) DeepLab [3] and (e) DPN.

(a) (b) (c) (a) (b) (c)

Figure 9: Visual quality of DPN label maps: (a) input image (b) ground truth (white labels indicating ambiguous regions)
and (c) DPN.

You might also like