Semantic Image Segmentation Via Deep Parsing Network
Semantic Image Segmentation Via Deep Parsing Network
Ziwei Liu† Xiaoxiao Li† Ping Luo Chen Change Loy Xiaoou Tang
Department of Information Engineering, The Chinese University of Hong Kong
{lz013,lx015,pluo,ccloy,xtang}@ie.cuhk.edu.hk
arXiv:1509.02634v2 [cs.CV] 24 Sep 2015
Table 1: The comparisons between the network architectures of VGG16 and DPN, as shown in (a) and (b) respectively. Each table contains five rows,
representing the ‘name of layer’, ‘receptive field of filter’−‘stride’, ‘number of output feature maps’, ‘activation function’ and ‘size of output feature
maps’, respectively. Furthermore, ‘conv’, ‘lconv’,‘max’, ‘bmin’, ‘fc’, and ‘sum’ represent the convolution, local convolution, max pooling, block min
pooling, fully connection, and summation, respectively. Moreover, ‘relu’, ‘idn’, ‘soft’, ‘sigm’, and ‘lin’ represent the activation functions, including rectified
linear unit [18], identity, softmax, sigmoid, and linear, respectively.
4. Experiments
Sec.4.2 compares DPN with the state-of-the-art methods on
Dataset We evaluate the proposed approach on the PAS- the VOC12 test set.
CAL VOC 2012 (VOC12) [7] dataset, which contains 20
object categories and one background category. Following 4.1. Effectiveness of DPN
previous works such as [12, 22, 3], we employ 10, 582
All the models evaluated in this section are trained and
images for training, 1, 449 images for validation, and 1, 456
tested on VOC12.
images for testing.
Triple Penalty The receptive field of b12 indicates
Evaluation Metrics All existing works employed mean the range of triple relations for each pixel. We examine
pixelwise intersection-over-union (denoted as mIoU) [22] different settings of the receptive fields, including ‘10×10’,
to evaluate their performance. To fully examine the effec- ‘50×50’, and ‘100×100’, as shown in Table 2 (a), where
tiveness of DPN, we introduce another three metrics, in- ‘50×50’ achieves the best mIoU, which is sightly better
cluding tagging accuracy (TA), localization accuracy (LA), than ‘100×100’. For a 512×512 image, this result implies
and boundary accuracy (BA). (1) TA compares the pre- that 50×50 neighborhood is sufficient to capture relations
dicted image-level tags with the ground truth tags, calculat- between pixels, while smaller or larger regions tend to
ing the accuracy of multi-class image classification. (2) LA under-fit or over-fit the training data. Moreover, all models
evaluates the IoU between the predicted object bounding
of triple relations outperform the ‘baseline’ method that
boxes3 and the ground truth bounding boxes (denoted as models dense pairwise relations, i.e. VGG16 +denseCRF
bIoU), measuring the precision of object localization. (3) [16].
For those objects that have been correctly localized, we
Label Contexts Receptive field of b13 indicates the
compare the predicted object boundary with the ground
range of local label context. To evaluate its effectiveness,
truth boundary, measuring the precision of semantic bound-
we fix the receptive field of b12 as 50×50. As summarized
ary similar to [12].
in Table 2 (b), ‘9×9 mixtures’ improves preceding settings
Comparisons DPN is compared with the best- by 1.7, 0.5, and 0.2 percent respectively. We observe large
performing methods on VOC12, including FCN [22], gaps exist between ‘1×1’ and ‘5×5’. Note that the 1×1
Zoom-out [25], DeepLab [3], WSSL [28], BoxSup [5], receptive field of b13 corresponds to learning a global label
Piecewise [19], and RNN [39]. All these methods are co-occurrence without considering local spatial contexts.
based on CNNs and MRFs, and trained on VOC12 data Table 2 (c) shows that the pairwise terms of DPN are more
following [22]. They can be grouped according to different effective than DSN and DeepLab4 .
aspects: (1) joint-train: Piecewise and RNN; (2) w/o More importantly, mIoU of all the categories can be
joint-train: DeepLab, WSSL, FCN, and BoxSup; (3) pre- improved when increasing the size of receptive field and
train on COCO: RNN, WSSL, and BoxSup. The first learning a mixture. Specifically, for each category, the im-
and the second groups are the methods with and without provements of the last three settings in Table 2 (b) over the
joint training CNNs and MRFs, respectively. Methods in first one are 1.2±0.2, 1.5±0.2, and 1.7±0.3, respectively.
the last group also employed MS-COCO [20] to pre-train We also visualize the learned label compatibilities and
deep models. To conduct a comprehensive comparison, the contexts in Fig.4 (a) and (b), respectively. (a) is obtained
performance of DPN are reported on both settings, i.e., with by summing each filter in b13 over 9×9 region, indicating
and without pre-training on COCO. how likely a column object would present when a row
In the following, Sec.4.1 investigates the effectiveness of object is presented. Blue represents high possibility. (a)
different components of DPN on the VOC12 validation set.
4 The other deep models such as RNN and Piecewise did not report the
3 They are the bounding boxes of the predicted segmentation regions. exact imrprovements after combining unary and pairwise terms.
(a) 0.7 (b)
person
0.7
mbike
bottle
horse
(a)
sheep
(b)
chair
plant
table
train
boat
bird
bike
areo Incremental Learning
sofa
cow
bkg
dog
bus
car
cat
tv
bkg Joint Learning
areo 0.68
bike 0.68
bird
mIoU
mIoU
boat 0.66
bottle
bus 0.66
car 0.64
cat DPN pairwise terms
chair bottle : bottle train : bkg denseCRF [16]
cow 0.64 0.62
table 0 1000 2000 3000 4000 5000 0 1 2 3 4 5
dog Number of Training Iterations Number of MF Iterations
horse
mbike
person
plant
sheep
Figure 6: Ablation study of (a) training strategy (b) required MF
sofa iterations. (Best viewed in color)
train person : mbike chair : person
tv
(a) Original Image (b) Ground Truth (c) Unary Term Figure 7: Stage-wise analysis of (a) mean tagging accuracy (b) mean
localization accuracy (c) mean boundary accuracy.
joint training discards them for smoother results. (2) previous works.
Training DPN with pixelwise label maps implicitly models Following [39, 5], we pre-train DPN with COCO, where
image-level tags, since it achieves a high averaged TA of 20 object categories that are also presented in VOC12 are
96.4%. (3) Object localization always helps. However, selected for training. A single DPN† has achieved 77.5%
for the object with complex boundary such as ‘bike’, its mIoU on VOC12 test set. As shown in Table 3 (b), we
mIoU is low even it can be localized, e.g. ‘bike’ has observe that DPN† achieves best performances on more
high LA but low BA and mIoU. (4) Failures of different than half of the object classes. Please refer to the appendices
categories have different factors. With these three metrics, for visual quality comparisons.
they can be easily identified. For example, the failures of
‘chair’, ‘table’, and ‘plant’ are caused by the difficulties
to accurately capture their bounding boxes and boundaries. 5. Conclusion
Although ‘bottle’ and ‘tv’ are also difficult to localize, they We proposed Deep Parsing Network (DPN) to address
achieve moderate mIoU because of their regular shapes. In semantic image segmentation, which has several appealing
other words, mIoU of ‘bottle’ and ‘tv’ can be significantly properties. First, DPN unifies the inference and learning
improved if they can be accurately localized. of unary term and pairwise terms in a single convolutional
4.2. Overall Performance network. No iterative inference are required during back-
propagation. Second, high-order relations and mixtures
As shown in Table 3 (b), we compare DPN with the of label contexts are incorporated to its pairwise terms
best-performing methods5 on VOC12 test set based on two modeling, making existing works serve as special cases.
settings, i.e. with and without pre-training on COCO. The Third, DPN is built upon conventional operations of CNN,
approaches pre-trained on COCO are marked with ‘†’. We thus easy to be parallelized and speeded up.
evaluate DPN on several scales of the images and then DPN achieves state-of-the-art performance on VOC12,
average the results following [3, 19]. and multiple valuable facts about semantic image segmen-
DPN outperforms all the existing methods that were tion are revealed through extensive experiments. Future
trained on VOC12, but DPN needs only one MF iteration directions include investigating the generalizability of DPN
to solve MRF, other than 10 iterations of RNN, DeepLab, to more challenging scenarios, e.g. large number of object
and Piecewise. By averaging the results of two DPNs, we classes and substantial appearance/scale variations.
achieve 74.1% accuracy on VOC12 without outside training
data. As discussed in Sec.3.3, MF iteration is the most
complex step even when it is implemented as convolutions. References
Therefore, DPN at least reduces 10× runtime compared to
[1] A. Adams, J. Baek, and M. A. Davis. Fast high-dimensional
5 The results of these methods were presented in either the published filtering using the permutohedral lattice. In Computer
papers or arXiv pre-prints. Graphics Forum, volume 29, pages 753–762, 2010. 6
[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Con- [21] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene
tour detection and hierarchical image segmentation. PAMI, parsing via label transfer. PAMI, 33(12):2368–2382, 2011.
33(5):898–916, 2011. 1 1
[3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and [22] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
A. L. Yuille. Semantic image segmentation with deep networks for semantic segmentation. In CVPR, pages 3431–
convolutional nets and fully connected crfs. In ICLR, 2015. 3440, 2015. 1, 5, 6, 8, 10, 11
1, 2, 5, 6, 7, 8, 10, 11 [23] P. Luo, X. Wang, and X. Tang. Hierarchical face parsing via
[4] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, deep learning. In CVPR, pages 2480–2487, 2012. 1
B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives [24] P. Luo, X. Wang, and X. Tang. Pedestrian parsing via deep
for deep learning. In NIPS Deep Learning Workshop, 2014. decompositional network. In ICCV, pages 2648–2655, 2013.
2, 6 1
[5] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding [25] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feed-
boxes to supervise convolutional networks for semantic seg- forward semantic segmentation with zoom-out features. In
mentation. arXiv:1503.01640v2, 18 May 2015. 6, 8 CVPR, pages 3376–3385, 2015. 1, 6, 8
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- [26] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee,
Fei. Imagenet: A large-scale hierarchical image database. In S. Fidler, R. Urtasun, and A. Yuille. The role of context
CVPR, pages 248–255, 2009. 1 for object detection and semantic segmentation in the wild.
[7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, In CVPR, pages 891–898, 2014. 1
and A. Zisserman. The pascal visual object classes (voc) [27] M. Opper, O. Winther, et al. From naive mean field theory to
challenge. IJCV, 88(2):303–338, 2010. 2, 4, 6 the tap equations. 2001. 1, 3
[8] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning
[28] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille.
hierarchical features for scene labeling. PAMI, 35(8):1915–
Weakly-and semi-supervised learning of a dcnn for semantic
1929, 2013. 1
image segmentation. arXiv:1502.02734v2, 8 May 2015. 1,
[9] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief 6, 8
propagation for early vision. IJCV, 70(1):41–54, 2006. 1
[29] X. Ren and J. Malik. Learning a classification model for
[10] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael. Learn-
segmentation. In ICCV, pages 10–17, 2003. 1
ing low-level vision. IJCV, 40(1):25–47, 2000. 2
[30] A. G. Schwing and R. Urtasun. Fully connected deep
[11] B. Fulkerson, A. Vedaldi, and S. Soatto. Class segmentation
structured networks. arXiv:1503.02351v1, 9 Mar 2015. 1,
and object localization with superpixel neighborhoods. In
2, 5, 6, 7
ICCV, pages 670–677, 2009. 1
[31] J. Shi and J. Malik. Normalized cuts and image segmenta-
[12] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik.
tion. PAMI, 22(8):888–905, 2000. 1
Semantic contours from inverse detectors. In ICCV, pages
991–998, 2011. 6 [32] K. Simonyan and A. Zisserman. Very deep convolutional
[13] G. E. Hinton, O. Vinyals, and J. Dean. Distilling the networks for large-scale image recognition. In ICLR, 2015.
knowledge in a neural network. In NIPS Deep Learning 2
Workshop, 2014. 6 [33] Y. Sun, X. Wang, and X. Tang. Deep learning face repre-
[14] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up sentation by joint identification-verification. In NIPS, 2014.
convolutional neural networks with low rank expansions. In 4
BMVC, 2014. 2, 6 [34] M. Szummer, P. Kohli, and D. Hoiem. Learning crfs using
[15] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. graph cuts. In ECCV, pages 582–595. 2008. 1
An introduction to variational methods for graphical models. [35] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
Machine learning, 37(2):183–233, 1999. 3 Closing the gap to human-level performance in face verifica-
[16] P. Krähenbühl and V. Koltun. Efficient inference in fully tion. In CVPR, pages 1701–1708, 2014. 4
connected crfs with gaussian edge potentials. NIPS, 2011. 1, [36] V. Vineet, G. Sheasby, J. Warrell, and P. H. Torr. Posefield:
6, 7 An efficient mean-field based method for joint estimation of
[17] P. Krähenbühl and V. Koltun. Parameter learning and human pose, segmentation, and depth. In Energy Minimiza-
convergent inference for dense random fields. In ICML, tion Methods in Computer Vision and Pattern Recognition,
pages 513–521, 2013. 1 pages 180–194. Springer, 2013. 1
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet [37] V. Vineet, J. Warrell, and P. H. Torr. Filter-based mean-
classification with deep convolutional neural networks. In field inference for random fields with higher-order terms and
NIPS, pages 1097–1105, 2012. 4 product label-spaces. In ECCV, pages 31–44. 2012. 1
[19] G. Lin, C. Shen, I. Reid, and A. Hengel. Efficient piecewise [38] J. Yang, B. Price, S. Cohen, and M.-H. Yang. Context driven
training of deep structured models for semantic segmenta- scene parsing with attention to rare classes. In CVPR, pages
tion. arXiv:1504.01013v2, 23 Apr 2015. 1, 2, 6, 8 3294–3301, 2014. 1
[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, [39] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Z. Su, D. Du, C. Huang, and P. Torr. Conditional random
Common objects in context. In ECCV, pages 740–755. 2014. fields as recurrent neural networks. arXiv:1502.03240v2, 30
6, 8 Apr 2015. 1, 2, 5, 6, 7, 8
Appendices
6 http://dl.caffe.berkeleyvision.org/
fcn-8s-pascal.caffemodel
(a) (b) (c) (d) (e)
Figure 8: Visual quality comparison of different semantic image segmentation methods: (a) input image (b) ground truth (c)
FCN [22] (d) DeepLab [3] and (e) DPN.
Figure 9: Visual quality of DPN label maps: (a) input image (b) ground truth (white labels indicating ambiguous regions)
and (c) DPN.