Deeplab: Semantic Image Segmentation With Deep Convolutional Nets, Atrous Convolution, and Fully Connected Crfs
Deeplab: Semantic Image Segmentation With Deep Convolutional Nets, Atrous Convolution, and Fully Connected Crfs
Abstract—In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions
arXiv:1606.00915v2 [cs.CV] 12 May 2017
that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or
‘atrous convolution’, as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at
which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of
view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we
propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional
feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at
multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical
models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on
localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional
Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed
“DeepLab” system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in
the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code
is made publicly available online.
Index Terms—Convolutional Neural Networks, Semantic Segmentation, Atrous Convolution, Conditional Random Fields.
1 I NTRODUCTION
Deep Convolutional Neural Networks (DCNNs) [1] have employed in a fully convolutional fashion [14]. In order to
pushed the performance of computer vision systems to overcome this hurdle and efficiently produce denser feature
soaring heights on a broad array of high-level problems, maps, we remove the downsampling operator from the last
including image classification [2], [3], [4], [5], [6] and object few max pooling layers of DCNNs and instead upsample
detection [7], [8], [9], [10], [11], [12], where DCNNs trained the filters in subsequent convolutional layers, resulting in
in an end-to-end manner have delivered strikingly better feature maps computed at a higher sampling rate. Filter
results than systems relying on hand-crafted features. Es- upsampling amounts to inserting holes (‘trous’ in French)
sential to this success is the built-in invariance of DCNNs between nonzero filter taps. This technique has a long
to local image transformations, which allows them to learn history in signal processing, originally developed for the
increasingly abstract data representations [13]. This invari- efficient computation of the undecimated wavelet transform
ance is clearly desirable for classification tasks, but can ham- in a scheme also known as “algorithme à trous” [15]. We use
per dense prediction tasks such as semantic segmentation, the term atrous convolution as a shorthand for convolution
where abstraction of spatial information is undesired. with upsampled filters. Various flavors of this idea have
In particular we consider three challenges in the applica- been used before in the context of DCNNs by [3], [6], [16].
tion of DCNNs to semantic image segmentation: (1) reduced In practice, we recover full resolution feature maps by a
feature resolution, (2) existence of objects at multiple scales, combination of atrous convolution, which computes feature
and (3) reduced localization accuracy due to DCNN invari- maps more densely, followed by simple bilinear interpola-
ance. Next, we discuss these challenges and our approach tion of the feature responses to the original image size. This
to overcome them in our proposed DeepLab system. scheme offers a simple yet powerful alternative to using
The first challenge is caused by the repeated combination deconvolutional layers [13], [14] in dense prediction tasks.
of max-pooling and downsampling (‘striding’) performed at Compared to regular convolution with larger filters, atrous
consecutive layers of DCNNs originally designed for image convolution allows us to effectively enlarge the field of view
classification [2], [4], [5]. This results in feature maps with of filters without increasing the number of parameters or the
significantly reduced spatial resolution when the DCNN is amount of computation.
The second challenge is caused by the existence of ob-
• L.-C. Chen, G. Papandreou, and K. Murphy are with Google Inc. I. Kokki- jects at multiple scales. A standard way to deal with this is
nos is with University College London. A. Yuille is with the Departments to present to the DCNN rescaled versions of the same image
of Cognitive Science and Computer Science, Johns Hopkins University. and then aggregate the feature or score maps [6], [17], [18].
The first two authors contributed equally to this work.
We show that this approach indeed increases the perfor-
2
mance of our system, but comes at the cost of computing The updated DeepLab system we present in this paper
feature responses at all DCNN layers for multiple scaled features several improvements compared to its first version
versions of the input image. Instead, motivated by spatial reported in our original conference publication [38]. Our
pyramid pooling [19], [20], we propose a computationally new version can better segment objects at multiple scales,
efficient scheme of resampling a given feature layer at via either multi-scale input processing [17], [39], [40] or
multiple rates prior to convolution. This amounts to probing the proposed ASPP. We have built a residual net variant
the original image with multiple filters that have com- of DeepLab by adapting the state-of-art ResNet [11] image
plementary effective fields of view, thus capturing objects classification DCNN, achieving better semantic segmenta-
as well as useful image context at multiple scales. Rather tion performance compared to our original model based
than actually resampling features, we efficiently implement on VGG-16 [4]. Finally, we present a more comprehensive
this mapping using multiple parallel atrous convolutional experimental evaluation of multiple model variants and
layers with different sampling rates; we call the proposed report state-of-art results not only on the PASCAL VOC
technique “atrous spatial pyramid pooling” (ASPP). 2012 benchmark but also on other challenging tasks. We
The third challenge relates to the fact that an object- have implemented the proposed methods by extending the
centric classifier requires invariance to spatial transforma- Caffe framework [41]. We share our code and models at
tions, inherently limiting the spatial accuracy of a DCNN. a companion web site http://liangchiehchen.com/projects/
One way to mitigate this problem is to use skip-layers DeepLab.html.
to extract “hyper-column” features from multiple network
layers when computing the final segmentation result [14],
[21]. Our work explores an alternative approach which we 2 R ELATED W ORK
show to be highly effective. In particular, we boost our Most of the successful semantic segmentation systems de-
model’s ability to capture fine details by employing a fully- veloped in the previous decade relied on hand-crafted fea-
connected Conditional Random Field (CRF) [22]. CRFs have tures combined with flat classifiers, such as Boosting [24],
been broadly used in semantic segmentation to combine [42], Random Forests [43], or Support Vector Machines [44].
class scores computed by multi-way classifiers with the low- Substantial improvements have been achieved by incorpo-
level information captured by the local interactions of pixels rating richer information from context [45] and structured
and edges [23], [24] or superpixels [25]. Even though works prediction techniques [22], [26], [27], [46], but the perfor-
of increased sophistication have been proposed to model mance of these systems has always been compromised by
the hierarchical dependency [26], [27], [28] and/or high- the limited expressive power of the features. Over the past
order dependencies of segments [29], [30], [31], [32], [33], few years the breakthroughs of Deep Learning in image
we use the fully connected pairwise CRF proposed by [22] classification were quickly transferred to the semantic seg-
for its efficient computation, and ability to capture fine edge mentation task. Since this task involves both segmentation
details while also catering for long range dependencies. and classification, a central question is how to combine the
That model was shown in [22] to improve the performance two tasks.
of a boosting-based pixel-level classifier. In this work, we The first family of DCNN-based systems for seman-
demonstrate that it leads to state-of-the-art results when tic segmentation typically employs a cascade of bottom-
coupled with a DCNN-based pixel-level classifier. up image segmentation, followed by DCNN-based region
A high-level illustration of the proposed DeepLab model classification. For instance the bounding box proposals and
is shown in Fig. 1. A deep convolutional neural network masked regions delivered by [47], [48] are used in [7] and
(VGG-16 [4] or ResNet-101 [11] in this work) trained in [49] as inputs to a DCNN to incorporate shape information
the task of image classification is re-purposed to the task into the classification process. Similarly, the authors of [50]
of semantic segmentation by (1) transforming all the fully rely on a superpixel representation. Even though these
connected layers to convolutional layers (i.e., fully convo- approaches can benefit from the sharp boundaries delivered
lutional network [14]) and (2) increasing feature resolution by a good segmentation, they also cannot recover from any
through atrous convolutional layers, allowing us to compute of its errors.
feature responses every 8 pixels instead of every 32 pixels in The second family of works relies on using convolution-
the original network. We then employ bi-linear interpolation ally computed DCNN features for dense image labeling,
to upsample by a factor of 8 the score map to reach the and couples them with segmentations that are obtained
original image resolution, yielding the input to a fully- independently. Among the first have been [39] who apply
connected CRF [22] that refines the segmentation results. DCNNs at multiple image resolutions and then employ a
From a practical standpoint, the three main advantages segmentation tree to smooth the prediction results. More
of our DeepLab system are: (1) Speed: by virtue of atrous recently, [21] propose to use skip layers and concatenate the
convolution, our dense DCNN operates at 8 FPS on an computed intermediate feature maps within the DCNNs for
NVidia Titan X GPU, while Mean Field Inference for the pixel classification. Further, [51] propose to pool the inter-
fully-connected CRF requires 0.5 secs on a CPU. (2) Accu- mediate feature maps by region proposals. These works still
racy: we obtain state-of-art results on several challenging employ segmentation algorithms that are decoupled from
datasets, including the PASCAL VOC 2012 semantic seg- the DCNN classifier’s results, thus risking commitment to
mentation benchmark [34], PASCAL-Context [35], PASCAL- premature decisions.
Person-Part [36], and Cityscapes [37]. (3) Simplicity: our sys- The third family of works uses DCNNs to directly
tem is composed of a cascade of two very well-established provide dense category-level pixel labels, which makes
modules, DCNNs and CRFs. it possible to even discard segmentation altogether. The
3
Fig. 1: Model Illustration. A Deep Convolutional Neural Network such as VGG-16 or ResNet-101 is employed in a fully
convolutional fashion, using atrous convolution to reduce the degree of signal downsampling (from 32x down 8x). A
bilinear interpolation stage enlarges the feature maps to the original image resolution. A fully connected CRF is then
applied to refine the segmentation result and better capture the object boundaries.
segmentation-free approaches of [14], [52] directly apply high level of activity in the benchmark’s leaderboard1 [17],
DCNNs to the whole image in a fully convolutional fashion, [40], [58], [59], [60], [61], [62], [63]. Interestingly, most top-
transforming the last fully connected layers of the DCNN performing methods have adopted one or both of the key
into convolutional layers. In order to deal with the spatial lo- ingredients of our DeepLab system: Atrous convolution for
calization issues outlined in the introduction, [14] upsample efficient dense feature extraction and refinement of the raw
and concatenate the scores from intermediate feature maps, DCNN scores by means of a fully connected CRF. We outline
while [52] refine the prediction result from coarse to fine by below some of the most important and interesting advances.
propagating the coarse results to another DCNN. Our work End-to-end training for structured prediction has more re-
builds on these works, and as described in the introduction cently been explored in several related works. While we
extends them by exerting control on the feature resolution, employ the CRF as a post-processing method, [40], [59],
introducing multi-scale pooling techniques and integrating [62], [64], [65] have successfully pursued joint learning of
the densely connected CRF of [22] on top of the DCNN. the DCNN and CRF. In particular, [59], [65] unroll the CRF
We show that this leads to significantly better segmentation mean-field inference steps to convert the whole system into
results, especially along object boundaries. The combination an end-to-end trainable feed-forward network, while [62]
of DCNN and CRF is of course not new but previous works approximates one iteration of the dense CRF mean field
only tried locally connected CRF models. Specifically, [53] inference [22] by convolutional layers with learnable filters.
use CRFs as a proposal mechanism for a DCNN-based Another fruitful direction pursued by [40], [66] is to learn
reranking system, while [39] treat superpixels as nodes for a the pairwise terms of a CRF via a DCNN, significantly
local pairwise CRF and use graph-cuts for discrete inference. improving performance at the cost of heavier computation.
As such their models were limited by errors in superpixel In a different direction, [63] replace the bilateral filtering
computations or ignored long-range dependencies. Our ap- module used in mean field inference with a faster domain
proach instead treats every pixel as a CRF node receiving transform module [67], improving the speed and lowering
unary potentials by the DCNN. Crucially, the Gaussian CRF the memory requirements of the overall system, while [18],
potentials in the fully connected CRF model of [22] that we [68] combine semantic segmentation with edge detection.
adopt can capture long-range dependencies and at the same Weaker supervision has been pursued in a number of
time the model is amenable to fast mean field inference. papers, relaxing the assumption that pixel-level semantic
We note that mean field inference had been extensively annotations are available for the whole training set [58], [69],
studied for traditional image segmentation tasks [54], [55], [70], [71], achieving significantly better results than weakly-
[56], but these older models were typically limited to short- supervised pre-DCNN systems such as [72]. In another line
range connections. In independent work, [57] use a very of research, [49], [73] pursue instance segmentation, jointly
similar densely connected CRF model to refine the results of tackling object detection and semantic segmentation.
DCNN for the problem of material classification. However, What we call here atrous convolution was originally de-
the DCNN module of [57] was only trained by sparse point veloped for the efficient computation of the undecimated
supervision instead of dense supervision at every pixel. wavelet transform in the “algorithme à trous” scheme of
[15]. We refer the interested reader to [74] for early refer-
Since the first version of this work was made publicly ences from the wavelet literature. Atrous convolution is also
available [38], the area of semantic segmentation has pro- intimately related to the “noble identities” in multi-rate sig-
gressed drastically. Multiple groups have made important nal processing, which builds on the same interplay of input
advances, significantly raising the bar on the PASCAL VOC 1. http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?
2012 semantic segmentation benchmark, as reflected to the challengeid=11&compid=6
4
signal and filter sampling rates [75]. Atrous convolution is a Output feature
term we first used in [6]. The same operation was later called Convolution
dilated convolution by [76], a term they coined motivated by kernel = 3
stride = 1
the fact that the operation corresponds to regular convolu- pad = 1
tion with upsampled (or dilated in the terminology of [15]) Input feature
filters. Various authors have used the same operation before
(a) Sparse feature extraction
for denser feature extraction in DCNNs [3], [6], [16]. Beyond
mere resolution enhancement, atrous convolution allows us Convolution
kernel = 3
to enlarge the field of view of filters to incorporate larger stride = 1
pad = 2
context, which we have shown in [38] to be beneficial. This rate = 2
approach has been pursued further by [76], who employ a (insert 1 zero) rate = 2
series of atrous convolutional layers with increasing rates
to aggregate multiscale context. The atrous spatial pyramid (b) Dense feature extraction
pooling scheme proposed here to capture multiscale objects
and context also employs multiple atrous convolutional Fig. 2: Illustration of atrous convolution in 1-D. (a) Sparse
layers with different sampling rates, which we however lay feature extraction with standard convolution on a low reso-
out in parallel instead of in serial. Interestingly, the atrous lution input feature map. (b) Dense feature extraction with
convolution technique has also been adopted for a broader atrous convolution with rate r = 2, applied on a high
set of tasks, such as object detection [12], [77], instance- resolution input feature map.
level segmentation [78], visual question answering [79], and
optical flow [80].
We also show that, as expected, integrating into DeepLab
more advanced image classification DCNNs such as the
residual net of [11] leads to better results. This has also been
observed independently by [81].
final DCNN network responses at an arbitrarily high resolu- Conv Conv Conv Conv
tion. For example, in order to double the spatial density of kernel: 3x3 kernel: 3x3 kernel: 3x3 kernel: 3x3
computed feature responses in the VGG-16 or ResNet-101 rate: 6 rate: 12 rate: 18 rate: 24
rate = 24
networks, we find the last pooling or convolutional layer rate = 12
rate = 18
rate = 6
that decreases resolution (’pool5’ or ’conv5 1’ respectively),
set its stride to 1 to avoid signal decimation, and replace all
subsequent convolutional layers with atrous convolutional
layers having rate r = 2. Pushing this approach all the way
through the network could allow us to compute feature Atrous Spatial Pyramid Pooling
responses at the original image resolution, but this ends Input Feature Map
up being too costly. We have adopted instead a hybrid
approach that strikes a good efficiency/accuracy trade-off,
using atrous convolution to increase by a factor of 4 the Fig. 4: Atrous Spatial Pyramid Pooling (ASPP). To classify
density of computed feature maps, followed by fast bilinear the center pixel (orange), ASPP exploits multi-scale features
interpolation by an additional factor of 8 to recover feature by employing multiple parallel filters with different rates.
maps at the original image resolution. Bilinear interpolation The effective Field-Of-Views are shown in different colors.
is sufficient in this setting because the class score maps
(corresponding to log-probabilities) are quite smooth, as
illustrated in Fig. 5. Unlike the deconvolutional approach 3.2 Multiscale Image Representations using Atrous
adopted by [14], the proposed approach converts image Spatial Pyramid Pooling
classification networks into dense feature extractors without DCNNs have shown a remarkable ability to implicitly repre-
requiring learning any extra parameters, leading to faster sent scale, simply by being trained on datasets that contain
DCNN training in practice. objects of varying size. Still, explicitly accounting for object
scale can improve the DCNN’s ability to successfully handle
Atrous convolution also allows us to arbitrarily enlarge both large and small objects [6].
the field-of-view of filters at any DCNN layer. State-of-the- We have experimented with two approaches to han-
art DCNNs typically employ spatially small convolution dling scale variability in semantic segmentation. The first
kernels (typically 3×3) in order to keep both computation approach amounts to standard multiscale processing [17],
and number of parameters contained. Atrous convolution [18]. We extract DCNN score maps from multiple (three
with rate r introduces r − 1 zeros between consecutive filter in our experiments) rescaled versions of the original image
values, effectively enlarging the kernel size of a k ×k filter using parallel DCNN branches that share the same param-
to ke = k + (k − 1)(r − 1) without increasing the number eters. To produce the final result, we bilinearly interpolate
of parameters or the amount of computation. It thus offers the feature maps from the parallel DCNN branches to the
an efficient mechanism to control the field-of-view and original image resolution and fuse them, by taking at each
finds the best trade-off between accurate localization (small position the maximum response across the different scales.
field-of-view) and context assimilation (large field-of-view). We do this both during training and testing. Multiscale
We have successfully experimented with this technique: processing significantly improves performance, but at the
Our DeepLab-LargeFOV model variant [38] employs atrous cost of computing feature responses at all DCNN layers for
convolution with rate r = 12 in VGG-16 ‘fc6’ layer with multiple scales of input.
significant performance gains, as detailed in Section 4. The second approach is inspired by the success of the
R-CNN spatial pyramid pooling method of [20], which
showed that regions of an arbitrary scale can be accurately
Turning to implementation aspects, there are two effi- and efficiently classified by resampling convolutional fea-
cient ways to perform atrous convolution. The first is to tures extracted at a single scale. We have implemented a
implicitly upsample the filters by inserting holes (zeros), or variant of their scheme which uses multiple parallel atrous
equivalently sparsely sample the input feature maps [15]. convolutional layers with different sampling rates. The fea-
We implemented this in our earlier work [6], [38], followed tures extracted for each sampling rate are further processed
by [76], within the Caffe framework [41] by adding to the in separate branches and fused to generate the final result.
im2col function (it extracts vectorized patches from multi- The proposed “atrous spatial pyramid pooling” (DeepLab-
channel feature maps) the option to sparsely sample the ASPP) approach generalizes our DeepLab-LargeFOV vari-
underlying feature maps. The second method, originally ant and is illustrated in Fig. 4.
proposed by [82] and used in [3], [16] is to subsample the
input feature map by a factor equal to the atrous convolu-
tion rate r, deinterlacing it to produce r2 reduced resolution 3.3 Structured Prediction with Fully-Connected Condi-
maps, one for each of the r×r possible shifts. This is followed tional Random Fields for Accurate Boundary Recovery
by applying standard convolution to these intermediate A trade-off between localization accuracy and classifica-
feature maps and reinterlacing them to the original image tion performance seems to be inherent in DCNNs: deeper
resolution. By reducing atrous convolution into regular con- models with multiple max-pooling layers have proven most
volution, it allows us to use off-the-shelf highly optimized successful in classification tasks, however the increased in-
convolution routines. We have implemented the second variance and the large receptive fields of top-level nodes can
approach into the TensorFlow framework [83]. only yield smooth responses. As illustrated in Fig. 5, DCNN
6
Kernel Rate FOV Params Speed bef/aft CRF Learning policy Batch size Iteration mean IOU
7×7 4 224 134.3M 1.44 64.38 / 67.64 step 30 6K 62.25
4×4 4 128 65.1M 2.90 59.80 / 63.74
4×4 8 224 65.1M 2.90 63.41 / 67.14 poly 30 6K 63.42
3×3 12 224 20.5M 4.84 62.25 / 67.64 poly 30 10K 64.90
poly 10 10K 64.71
poly 10 20K 65.88
TABLE 1: Effect of Field-Of-View by adjusting the kernel
size and atrous sampling rate r at ‘fc6’ layer. We show TABLE 2: PASCAL VOC 2012 val set results (%) (before CRF)
number of model parameters, training speed (img/sec), and as different learning hyper parameters vary. Employing
val set mean IOU before and after CRF. DeepLab-LargeFOV “poly” learning policy is more effective than “step” when
(kernel size 3×3, r = 12) strikes the best balance. training DeepLab-LargeFOV.
Fig. 6: PASCAL VOC 2012 val results. Input image and our DeepLab results before/after CRF.
Sum-Fusion
Fc8
(1x1)
Fc8 Fc8 Fc8 Fc8
(1x1) (1x1) (1x1) (1x1)
Fc7
(1x1) Fc7 Fc7 Fc7 Fc7
(1x1) (1x1) (1x1) (1x1)
Fc6
(3x3, rate = 12) Fc6 Fc6 Fc6 Fc6
(3x3, rate = 6) (3x3, rate = 12) (3x3, rate = 18) (3x3, rate = 24)
(a) Image (b) LargeFOV (c) ASPP-S (d) ASPP-L
Pool5 Pool5
Fig. 8: Qualitative segmentation results with ASPP com-
(a) DeepLab-LargeFOV (b) DeepLab-ASPP pared to the baseline LargeFOV model. The ASPP-L model,
employing multiple large FOVs can successfully capture
Fig. 7: DeepLab-ASPP employs multiple filters with differ-
objects as well as image context at multiple scales.
ent rates to capture objects and context at multiple scales.
performance of 77.69%.
Qualitative results: We provide qualitative visual com-
parisons of DeepLab’s results (our best model variant)
before and after CRF in Fig. 6. The visualization results
obtained by DeepLab before CRF already yields excellent
segmentation results, while employing the CRF further im-
Image VGG-16 Bef. VGG-16 Aft. ResNet Bef. ResNet Aft.
proves the performance by removing false positives and
refining object boundaries. Fig. 9: DeepLab results based on VGG-16 net or ResNet-
Test set results: We have submitted the result of our 101 before and after CRF. The CRF is critical for accurate
final best model to the official server, obtaining test set prediction along object boundaries with VGG-16, whereas
performance of 79.7%, as shown in Tab. 5. The model ResNet-101 has acceptable performance even before CRF.
substantially outperforms previous DeepLab variants (e.g.,
DeepLab-LargeFOV with VGG-16 net) and is currently the 75
top performing method on the PASCAL VOC 2012 segmen- 70
tation leaderboard.
60
Method mIOU 55 ResNet aft
VGG−16 aft
DeepLab-CRF-LargeFOV-COCO [58] 72.7 50 ResNet bef
VGG−16 bef
MERL DEEP GCRF [88] 73.2 45
0 5 10 15 20 25 30 35 40
Trimap Width (pixels)
CRF-RNN [59] 74.7
POSTECH DeconvNet CRF VOC [61] 74.8 (a) (b)
BoxSup [60] 75.2
Context + CRF-RNN [76] 75.3
QO4mres [66] 75.5 Fig. 10: (a) Trimap examples (top-left: image. top-right:
DeepLab-CRF-Attention [17] 75.7 ground-truth. bottom-left: trimap of 2 pixels. bottom-right:
CentraleSuperBoundaries++ [18] 76.0 trimap of 10 pixels). (b) Pixel mean IOU as a function of the
DeepLab-CRF-Attention-DT [63] 76.3
H-ReNet + DenseCRF [89] 76.8
band width around the object boundaries when employing
LRR 4x COCO [90] 76.8 VGG-16 or ResNet-101 before and after CRF.
DPN [62] 77.5
Adelaide Context [40] 77.8
Oxford TVG HO CRF [91] 77.9 Method MSC COCO Aug LargeFOV ASPP CRF mIOU
Context CRF + Guidance CRF [92] 78.1 VGG-16
Adelaide VeryDeep FCN VOC [93] 79.1 DeepLab [38] X 37.6
DeepLab-CRF (ResNet-101) 79.7 DeepLab [38] X X 39.6
ResNet-101
TABLE 5: Performance on PASCAL VOC 2012 test set. We DeepLab 39.6
have added some results from recent arXiv papers on top of DeepLab X X 41.4
DeepLab X X X 42.9
the official leadearboard results. DeepLab X X X X 43.5
DeepLab X X X X 44.7
VGG-16 vs. ResNet-101: We have observed that DeepLab X X X X X 45.7
DeepLab based on ResNet-101 [11] delivers better segmen- O2 P [45] 18.1
tation results along object boundaries than employing VGG- CFM [51] 34.4
16 [4], as visualized in Fig. 9. We think the identity mapping FCN-8s [14] 37.8
CRF-RNN [59] 39.3
[94] of ResNet-101 has similar effect as hyper-column fea- ParseNet [86] 40.4
tures [21], which exploits the features from the intermediate BoxSup [60] 40.5
layers to better localize boundaries. We further quantize this HO CRF [91] 41.3
Context [40] 43.3
effect in Fig. 10 within the “trimap” [22], [31] (a narrow band VeryDeep [93] 44.5
along object boundaries). As shown in the figure, employing
ResNet-101 before CRF has almost the same accuracy along TABLE 6: Comparison with other state-of-art methods on
object boundaries as employing VGG-16 in conjunction with PASCAL-Context dataset.
a CRF. Post-processing the ResNet-101 result with a CRF
further improves the segmentation result.
DeepLab improves 2% over the VGG-16 LargeFOV. Simi-
4.2 PASCAL-Context lar to [17], employing multi-scale inputs and max-pooling
Dataset: The PASCAL-Context dataset [35] provides de- to merge the results improves the performance to 41.4%.
tailed semantic labels for the whole scene, including both Pretraining the model on MS-COCO brings extra 1.5%
object (e.g., person) and stuff (e.g., sky). Following [35], the improvement. Employing atrous spatial pyramid pooling
proposed models are evaluated on the most frequent 59 is more effective than LargeFOV. After further employing
classes along with one background category. The training dense CRF as post processing, our final model yields 45.7%,
set and validation set contain 4998 and 5105 images. outperforming the current state-of-art method [40] by 2.4%
Evaluation: We report the evaluation results in Tab. 6. without using their non-linear pairwise term. Our final
Our VGG-16 based LargeFOV variant yields 37.6% before model is slightly better than the concurrent work [93] by
and 39.6% after CRF. Repurposing the ResNet-101 [11] for 1.2%, which also employs atrous convolution to repurpose
10
Fig. 11: PASCAL-Context results. Input image, ground-truth, and our DeepLab results before/after CRF.
Method MSC COCO Aug LFOV ASPP CRF mIOU Method mIOU
ResNet-101 pre-release version of dataset
DeepLab 58.90 Adelaide Context [40] 66.4
DeepLab X X 63.10 FCN-8s [14] 65.3
DeepLab X X X 64.40
DeepLab X X X X 64.94 DeepLab-CRF-LargeFOV-StrongWeak [58] 64.8
DeepLab-CRF-LargeFOV [38] 63.1
DeepLab X X X X 62.18
DeepLab X X X X 62.76 CRF-RNN [59] 62.5
DPN [62] 59.1
Attention [17] 56.39 Segnet basic [100] 57.0
HAZN [95] 57.54 Segnet extended [100] 56.1
LG-LSTM [96] 57.97
Graph LSTM [97] 60.16 official version
Adelaide Context [40] 71.6
Dilation10 [76] 67.1
TABLE 7: Comparison with other state-of-art methods on DPN [62] 66.8
PASCAL-Person-Part dataset. Pixel-level Encoding [101] 64.3
DeepLab-CRF (ResNet-101) 70.4
the residual net of [11] for semantic segmentation. TABLE 8: Test set results on the Cityscapes dataset, compar-
Qualitative results: We visualize the segmentation re- ing our DeepLab system with other state-of-art methods.
sults of our best model with and without CRF as post pro-
cessing in Fig. 11. DeepLab before CRF can already predict
most of the object/stuff with high accuracy. Employing CRF, DeepLab alone yields 58.9%, significantly outperforming
our model is able to further remove isolated false positives DeepLab-LargeFOV (VGG-16 net) and DeepLab-Attention
and improve the prediction along object/stuff boundaries. (VGG-16 net) by about 7% and 2.5%, respectively. Incorpo-
rating multi-scale inputs and fusion by max-pooling further
improves performance to 63.1%. Additionally pretraining
4.3 PASCAL-Person-Part
the model on MS-COCO yields another 1.3% improvement.
Dataset: We further perform experiments on semantic part However, we do not observe any improvement when adopt-
segmentation [98], [99], using the extra PASCAL VOC 2010 ing either LargeFOV or ASPP on this dataset. Employing
annotations by [36]. We focus on the person part for the the dense CRF to post process our final output substantially
dataset, which contains more training data and large varia- outperforms the concurrent work [97] by 4.78%.
tion in object scale and human pose. Specifically, the dataset Qualitative results: We visualize the results in Fig. 12.
contains detailed part annotations for every person, e.g.
eyes, nose. We merge the annotations to be Head, Torso,
Upper/Lower Arms and Upper/Lower Legs, resulting in 4.4 Cityscapes
six person part classes and one background class. We only Dataset: Cityscapes [37] is a recently released large-scale
use those images containing persons for training (1716 im- dataset, which contains high quality pixel-level annotations
ages) and validation (1817 images). of 5000 images collected in street scenes from 50 different
Evaluation: The human part segmentation results on cities. Following the evaluation protocol [37], 19 semantic
PASCAL-Person-Part is reported in Tab. 7. [17] has already labels (belonging to 7 super categories: ground, construc-
conducted experiments on this dataset with re-purposed tion, object, nature, sky, human, and vehicle) are used for
VGG-16 net for DeepLab, attaining 56.39% (with multi-scale evaluation (the void label is not considered for evaluation).
inputs). Therefore, in this part, we mainly focus on the effect The training, validation, and test sets contain 2975, 500, and
of repurposing ResNet-101 for DeepLab. With ResNet-101, 1525 images respectively.
11
Fig. 12: PASCAL-Person-Part results. Input image, ground-truth, and our DeepLab results before/after CRF.
Fig. 13: Cityscapes results. Input image, ground-truth, and our DeepLab results before/after CRF.
Full Aug LargeFOV ASPP CRF mIOU plored the validation set in Tab. 9. The images of Cityscapes
VGG-16 have resolution 2048×1024, making it a challenging prob-
X 62.97 lem to train deeper networks with limited GPU memory.
X X 64.18 During benchmarking the pre-release of the dataset, we
X X 64.89
X X X 65.94 downsampled the images by 2. However, we have found
that it is beneficial to process the images in their original
ResNet-101
X 66.6 resolution. With the same training protocol, using images
X X 69.2 of original resolution significantly brings 1.9% and 1.8%
X X 70.4 improvements before and after CRF, respectively. In order
X X X 71.0
X X X X 71.4
to perform inference on this dataset with high resolution
images, we split each image into overlapped regions, similar
to [37]. We have also replaced the VGG-16 net with ResNet-
TABLE 9: Val set results on Cityscapes dataset. Full: model
101. We do not exploit multi-scale inputs due to the lim-
trained with full resolution images.
ited GPU memories at hand. Instead, we only explore (1)
deeper networks (i.e., ResNet-101), (2) data augmentation,
Test set results of pre-release: We have participated in (3) LargeFOV or ASPP, and (4) CRF as post processing
benchmarking the Cityscapes dataset pre-release. As shown on this dataset. We first find that employing ResNet-101
in the top of Tab. 8, our model attained third place, with per- alone is better than using VGG-16 net. Employing LargeFOV
formance of 63.1% and 64.8% (with training on additional brings 2.6% improvement and using ASPP further improves
coarsely annotated images). results by 1.2%. Adopting data augmentation and CRF as
post processing brings another 0.6% and 0.4%, respectively.
Val set results: After the initial release, we further ex-
12
[33] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. Yuille, [61] H. Noh, S. Hong, and B. Han, “Learning deconvolution network
“Towards unified depth and semantic prediction from a single for semantic segmentation,” in ICCV, 2015.
image,” in CVPR, 2015. [62] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang, “Semantic image
[34] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, segmentation via deep parsing network,” in ICCV, 2015.
J. Winn, and A. Zisserma, “The pascal visual object classes [63] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L.
challenge a retrospective,” IJCV, 2014. Yuille, “Semantic image segmentation with task-specific edge
[35] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, detection using cnns and a discriminatively trained domain
R. Urtasun, and A. Yuille, “The role of context for object detection transform,” in CVPR, 2016.
and semantic segmentation in the wild,” in CVPR, 2014. [64] L.-C. Chen, A. Schwing, A. Yuille, and R. Urtasun, “Learning
[36] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille, deep structured models,” in ICML, 2015.
“Detect what you can: Detecting and representing objects using [65] A. G. Schwing and R. Urtasun, “Fully connected deep structured
holistic models and body parts,” in CVPR, 2014. networks,” arXiv:1503.02351, 2015.
[37] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, [66] S. Chandra and I. Kokkinos, “Fast, exact and multi-scale inference
R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes for semantic image segmentation with deep Gaussian CRFs,”
dataset for semantic urban scene understanding,” in CVPR, 2016. arXiv:1603.08358, 2016.
[38] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. [67] E. S. L. Gastal and M. M. Oliveira, “Domain transform for edge-
Yuille, “Semantic image segmentation with deep convolutional aware image and video processing,” in SIGGRAPH, 2011.
nets and fully connected crfs,” in ICLR, 2015. [68] G. Bertasius, J. Shi, and L. Torresani, “High-for-low and low-for-
[39] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning high: Efficient boundary detection from deep object features and
hierarchical features for scene labeling,” PAMI, 2013. its applications to high-level vision,” in ICCV, 2015.
[40] G. Lin, C. Shen, I. Reid et al., “Efficient piecewise training of deep [69] P. O. Pinheiro and R. Collobert, “Weakly supervised seman-
structured models for semantic segmentation,” arXiv:1504.01013, tic segmentation with convolutional networks,” arXiv:1411.6228,
2015. 2014.
[41] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, [70] D. Pathak, P. Krähenbühl, and T. Darrell, “Constrained convo-
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture lutional neural networks for weakly supervised segmentation,”
for fast feature embedding,” arXiv:1408.5093, 2014. 2015.
[42] Z. Tu and X. Bai, “Auto-context and its application to high- [71] S. Hong, H. Noh, and B. Han, “Decoupled deep neural network
level vision tasks and 3d brain image segmentation,” IEEE Trans. for semi-supervised semantic segmentation,” in NIPS, 2015.
Pattern Anal. Mach. Intell., vol. 32, no. 10, pp. 1744–1757, 2010. [72] A. Vezhnevets, V. Ferrari, and J. M. Buhmann, “Weakly su-
[43] J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests pervised semantic segmentation with a multi-image model,” in
for image categorization and segmentation,” in CVPR, 2008. ICCV, 2011.
[44] B. Fulkerson, A. Vedaldi, and S. Soatto, “Class segmentation [73] X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, and S. Yan, “Proposal-
and object localization with superpixel neighborhoods,” in ICCV, free network for instance-level object segmentation,” arXiv
2009. preprint arXiv:1509.02636, 2015.
[45] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, “Semantic [74] J. E. Fowler, “The redundant discrete wavelet transform and
segmentation with second-order pooling,” in ECCV, 2012. additive noise,” IEEE Signal Processing Letters, vol. 12, no. 9, pp.
[46] J. Carreira and C. Sminchisescu, “CPMC: Automatic object seg- 629–632, 2005.
mentation using constrained parametric min-cuts,” PAMI, vol. 34, [75] P. P. Vaidyanathan, “Multirate digital filters, filter banks,
no. 7, pp. 1312–1328, 2012. polyphase networks, and applications: a tutorial,” Proceedings of
[47] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik, the IEEE, vol. 78, no. 1, pp. 56–93, 1990.
“Multiscale combinatorial grouping,” in CVPR, 2014. [76] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated
[48] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, convolutions,” in ICLR, 2016.
“Selective search for object recognition,” IJCV, 2013. [77] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-
[49] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Simultane- based fully convolutional networks,” arXiv:1605.06409, 2016.
ous detection and segmentation,” in ECCV, 2014. [78] J. Dai, K. He, Y. Li, S. Ren, and J. Sun, “Instance-sensitive fully
[50] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, “Feedfor- convolutional networks,” arXiv:1603.08678, 2016.
ward semantic segmentation with zoom-out features,” in CVPR, [79] K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia,
2015. “Abc-cnn: An attention based convolutional neural network for
[51] J. Dai, K. He, and J. Sun, “Convolutional feature masking for joint visual question answering,” arXiv:1511.05960, 2015.
object and stuff segmentation,” arXiv:1412.1283, 2014. [80] L. Sevilla-Lara, D. Sun, V. Jampani, and M. J. Black, “Op-
[52] D. Eigen and R. Fergus, “Predicting depth, surface normals tical flow with semantic segmentation and localized layers,”
and semantic labels with a common multi-scale convolutional arXiv:1603.03911, 2016.
architecture,” arXiv:1411.4734, 2014. [81] Z. Wu, C. Shen, and A. van den Hengel, “High-performance
[53] M. Cogswell, X. Lin, S. Purushwalkam, and D. Batra, “Combining semantic segmentation using very deep fully convolutional net-
the best of graphical models and convnets for semantic segmen- works,” arXiv:1604.04339, 2016.
tation,” arXiv:1412.4313, 2014. [82] M. J. Shensa, “The discrete wavelet transform: wedding the a
[54] D. Geiger and F. Girosi, “Parallel and deterministic algorithms trous and mallat algorithms,” Signal Processing, IEEE Transactions
from mrfs: Surface reconstruction,” PAMI, vol. 13, no. 5, pp. 401– on, vol. 40, no. 10, pp. 2464–2482, 1992.
412, 1991. [83] M. Abadi, A. Agarwal et al., “Tensorflow: Large-scale
[55] D. Geiger and A. Yuille, “A common framework for image machine learning on heterogeneous distributed systems,”
segmentation,” IJCV, vol. 6, no. 3, pp. 227–243, 1991. arXiv:1603.04467, 2016.
[56] I. Kokkinos, R. Deriche, O. Faugeras, and P. Maragos, “Computa- [84] A. Adams, J. Baek, and M. A. Davis, “Fast high-dimensional
tional analysis and learning for a biologically motivated model of filtering using the permutohedral lattice,” in Eurographics, 2010.
boundary detection,” Neurocomputing, vol. 71, no. 10, pp. 1798– [85] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik,
1812, 2008. “Semantic contours from inverse detectors,” in ICCV, 2011.
[57] S. Bell, P. Upchurch, N. Snavely, and K. Bala, “Material recog- [86] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider
nition in the wild with the materials in context database,” to see better,” arXiv:1506.04579, 2015.
arXiv:1412.0623, 2014. [87] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
[58] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille, ECCV, 2014.
“Weakly- and semi-supervised learning of a dcnn for semantic [88] R. Vemulapalli, O. Tuzel, M.-Y. Liu, and R. Chellappa, “Gaussian
image segmentation,” in ICCV, 2015. conditional random field network for semantic segmentation,” in
[59] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, CVPR, 2016.
D. Du, C. Huang, and P. Torr, “Conditional random fields as [89] Z. Yan, H. Zhang, Y. Jia, T. Breuel, and Y. Yu, “Combining the best
recurrent neural networks,” in ICCV, 2015. of convolutional layers and recurrent layers: A hybrid network
[60] J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to for semantic segmentation,” arXiv:1603.04871, 2016.
supervise convolutional networks for semantic segmentation,” in [90] G. Ghiasi and C. C. Fowlkes, “Laplacian reconstruction and
ICCV, 2015. refinement for semantic segmentation,” arXiv:1605.02264, 2016.
14
[91] A. Arnab, S. Jayasumana, S. Zheng, and P. Torr, “Higher order Iasonas Kokkinos (S’02–M’06) obtained the
potentials in end-to-end trainable conditional random fields,” Diploma of Engineering in 2001 and the Ph.D.
arXiv:1511.08119, 2015. Degree in 2006 from the School of Electrical and
[92] F. Shen and G. Zeng, “Fast semantic image segmentation with Computer Engineering of the National Technical
high order context and guided filtering,” arXiv:1605.04068, 2016. University of Athens in Greece, and the Habili-
[93] Z. Wu, C. Shen, and A. van den Hengel, “Bridging tation Degree in 2013 from Universit Paris-Est.
category-level and instance-level semantic image segmentation,” In 2006 he joined the University of California at
arXiv:1605.06885, 2016. Los Angeles as a postdoctoral scholar, and in
[94] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep 2008 joined as faculty the Department of Applied
residual networks,” arXiv:1603.05027, 2016. Mathematics of Ecole Centrale Paris (Centrale-
[95] F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille, “Zoom better to Supelec), working an associate professor in the
see clearer: Huamn part segmentation with auto zoom net,” Center for Visual Computing of CentraleSupelec and affiliate researcher
arXiv:1511.06881, 2015. at INRIA-Saclay. In 2016 he joined University College London and Face-
[96] X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan, “Se- book Artificial Intelligence Research. His currently research activity is on
mantic object parsing with local-global long short-term memory,” deep learning for computer vision, focusing in particular on structured
arXiv:1511.04510, 2015. prediction for deep learning, shape modeling, and multi-task learning
[97] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan, “Semantic object architectures. He has been awarded a young researcher grant by the
parsing with graph lstm,” arXiv:1603.07063, 2016. French National Research Agency, has served as associate editor for
[98] J. Wang and A. Yuille, “Semantic part segmentation using com- the Image and Vision Computing and Computer Vision and Image
positional model combining shape and appearance,” in CVPR, Understanding Journals, serves regularly as a reviewer and area chair
2015. for all major computer vision conferences and journals.
[99] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. Yuille, “Joint
object and part segmentation using deep learned potentials,” in
ICCV, 2015.
[100] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep
convolutional encoder-decoder architecture for image segmenta-
tion,” arXiv:1511.00561, 2015.
[101] J. Uhrig, M. Cordts, U. Franke, and T. Brox, “Pixel-level en-
coding and depth layering for instance-level semantic labeling,”
arXiv:1604.05096, 2016.
[102] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional Kevin Murphy was born in Ireland, grew up in
networks for biomedical image segmentation,” in MICCAI, 2015. England, went to graduate school in the USA
(MEng from U. Penn, PhD from UC Berkeley,
Postdoc at MIT), and then became a professor
at the Computer Science and Statistics Depart-
ments at the University of British Columbia in
Vancouver, Canada in 2004. After getting tenure,
Kevin went to Google in Mountain View, Cali-
fornia for his sabbatical. In 2011, he converted
Liang-Chieh Chen received his B.Sc. from Na- to a full-time research scientist at Google. Kevin
tional Chiao Tung University, Taiwan, his M.S. has published over 50 papers in refereed con-
from the University of Michigan- Ann Arbor, and ferences and journals related to machine learning and graphical mod-
his Ph.D. from the University of California- Los els. He has recently published an 1100-page textbook called “Machine
Angeles. He is currently working at Google. His Learning: a Probabilistic Perspective” (MIT Press, 2012).
research interests include semantic image seg-
mentation, probabilistic graphical models, and
machine learning.