0% found this document useful (0 votes)

29 views

Learning Hierarchical Features For Scene Labeling

This document summarizes a research paper on scene labeling using hierarchical convolutional neural networks. The key points are: 1) A multiscale convolutional network is used to extract dense feature vectors encoding regions around each pixel, capturing texture, shape, and context without engineered features. 2) These hierarchical features allow each pixel to be labeled based on a large contextual window, reducing the need for postprocessing to ensure labeling consistency. 3) The system combines the features with image segmentations to produce highly accurate scene labelings on standard datasets, an order of magnitude faster than previous methods.

Uploaded by

kiltz

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Learning Hierarchical Features For Scene Labeling

Uploaded by

kiltz

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO.

8, AUGUST 2013 1915

Learning Hierarchical Features

for Scene Labeling
Clément Farabet, Camille Couprie, Laurent Najman, and Yann LeCun

Abstract—Scene labeling consists of labeling each pixel in an image with the category of the object it belongs to. We propose a
method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of
multiple sizes centered on each pixel. The method alleviates the need for engineered features, and produces a powerful representation
that captures texture, shape, and contextual information. We report results using multiple postprocessing methods to produce the final
labeling. Among those, we propose a technique to automatically retrieve, from a pool of segmentation components, an optimal set of
components that best explain the scene; these components are arbitrary, for example, they can be taken from a segmentation tree or
from any family of oversegmentations. The system yields record accuracies on the SIFT Flow dataset (33 classes) and the Barcelona
dataset (170 classes) and near-record accuracy on Stanford background dataset (eight classes), while being an order of magnitude
faster than competing approaches, producing a 320 240 image labeling in less than a second, including feature extraction.

Index Terms—Convolutional networks, deep learning, image segmentation, image classification, scene parsing

1 INTRODUCTION

I MAGE understanding is a task of primary importance for a

wide range of practical applications. One important step
toward understanding an image is to perform a full-scene
contrast normalization), and trained in supervised mode
from fully labeled images to produce a category for each
pixel location. ConvNets are composed of multiple stages,
labeling, also known as a scene parsing, which consists of each of which contains a filter bank module, a nonlinear-
labeling every pixel in the image with the category of the ity, and a spatial pooling module. With end-to-end
object to which it belongs. After a perfect scene parsing, training, ConvNets can automatically learn hierarchical
every region and every object is delineated and tagged. One feature representations.
challenge of scene parsing is that it combines the traditional Unfortunately, labeling each pixel by looking at a small
problems of detection, segmentation, and multilabel recog- region around it is difficult. The category of a pixel may
nition in a single process. depend on relatively short-range information (e.g., the
There are two questions of primary importance in the presence of a human face generally indicates the presence of
context of scene parsing: how to produce good internal a human body nearby), but may also depend on long-range
representations of the visual information, and how to use information. For example, identifying a gray pixel as
contextual information to ensure the self-consistency of belonging to a road, a sidewalk, a gray car, a concrete
the interpretation. building, or a cloudy sky requires a wide contextual
This paper presents a scene parsing system that relies window that shows enough of the surroundings to make
on deep learning methods to approach both questions. The an informed decision. To address this problem, we propose
main idea is to use a convolutional network (ConvNet) [27] using a multiscale ConvNet, which can take into account
operating on a large input window to produce label large input windows while keeping the number of free
hypotheses for each pixel location. The convolutional net is parameters to a minimum.
fed with raw image pixels (after band-pass filtering and Common approaches to scene parsing first produce
segmentation hypotheses using graph-based methods.
Candidate segments are then encoded using engineered
. C. Farabet is with the Courant Institute of Mathematical Sciences, New
York University, New York, NY 10003, and with the Laboratoire features. Finally, a conditional random field (CRF) (or some
d’Informatique Gaspard-Monge, Université Paris-Est, Equipe A3SI, other type of graphical model) is trained to produce labels
ESIEE Paris, 93160 Noisy-le-Grand, France. E-mail: [email protected]. for each candidate segment and to ensure that the labelings
. C. Couprie and Y. LeCun are with the Courant Institute of Mathematical
Sciences, New York University, 12th Fl, 715 Broadway, New York, NY are globally consistent.
10003. E-mail: {ccouprie, yann}@cs.nyu.edu. A striking characteristic of the system proposed here is
. L. Najman is with the Laboratoire d’Informatique Gaspard-Monge, that the use of a large contextual window to label pixels
Université Paris-Est, Equipe A3SI, ESIEE Paris, 93160 Noisy-le-Grand,
France. E-mail: [email protected]. reduces the requirement for sophisticated postprocessing
Manuscript received 9 Apr. 2012; revised 15 Aug. 2012; accepted 1 Oct. 2012;
methods that ensure the consistency of the labeling.
published online 17 Oct. 2012. More precisely, the proposed scene parsing architecture
Recommended for acceptance by S. Bengio, L. Deng, H. Larochelle, H. Lee, is depicted in Fig. 1. It relies on two main components.
and R. Salakhutdinov.
For information on obtaining reprints of this article, please send e-mail to: 1.1 Multiscale, Convolutional Representation
[email protected], and reference IEEECS Log Number
TPAMISI-2012-04-0262. Our multiscale, dense feature extractor produces a series of
Digital Object Identifier no. 10.1109/TPAMI.2012.231. feature vectors for regions of multiple sizes centered around
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
0162-8828/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society
1916 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

Fig. 1. Diagram of the scene parsing system. The raw input image is transformed through a Laplacian pyramid. Each scale is fed to a three-stage
ConvNet, which produces a set of feature maps. The feature maps of all scales are concatenated, the coarser scale maps being upsampled to match
the size of the finest scale map. Each feature vector thus represents a large contextual window around each pixel. In parallel, a single segmentation
(i.e., superpixels) or a family of segmentations (e.g., a segmentation tree) are computed to exploit the natural contours of the image. The final
labeling is produced from the feature vectors and the segmentation(s) using different methods, as presented in Section 4.

every pixel in the image, covering a large context. The community, and we show that our learned multiscale
multiscale convolutional net contains multiple copies of a feature representation essentially makes the use of a global
single network (all sharing the same weights) that are random field much less useful: Most scene-level relation-
applied to different scales of a Laplacian pyramid version of ships seem to be already captured by it.
the input image. For each pixel, the networks collectively
encode the information present in a large contextual 1.2.3 Multilevel Cut with Class Purity Criterion
window around the given pixel (184 184 pixels in the A family of segmentations is constructed over the image
system described here). The ConvNet is fed with raw pixels to analyze the scene at multiple levels. In the simplest
and trained end to end, thereby alleviating the need for case, this family might be a segmentation tree; in the most
hand-engineered features. When properly trained, these general case, it can be any set of segmentations, for
features produce a representation that captures texture, example, a collection of superpixels either produced using
shape, and contextual information. While using a multiscale the same algorithm with different parameter tunings or
representation seems natural for FSL, it has rarely been used produced by different algorithms. Each segmentation
component is represented by the set of feature vectors
in the context of feature learning systems. The multiscale
that fall into it: The component is encoded by a spatial
representation that is learned is sufficiently complete to
grid of aggregated feature vectors. The aggregated feature
allow the detection and recognition of all the objects and
vector of each grid cell is computed by a component-wise
regions in the scene. However, it does not accurately max pooling of the feature vectors centered on all the
pinpoint the boundaries of the regions and requires some pixels that fall into the grid cell. This produces a scale-
postprocessing to yield cleanly delineated predictions. invariant representation of the segment and its surround-
1.2 Graph-Based Classification ings. A classifier is then applied to the aggregated feature
grid of each node. This classifier is trained to estimate the
An oversegmentation is constructed from the image and is histogram of all object categories present in the compo-
used to group the feature descriptors. Several oversegmen- nent. A subset of the components is then selected such
tations are considered, and three techniques are proposed that they cover the entire image. These components are
to produce the final image labeling. selected so as to minimize the average “impurity” of the
class distribution in a procedure that we name “optimal
1.2.1 Superpixels cover.” The class “impurity” is defined as the entropy of
The image is segmented into disjoint components, widely the class distribution. The choice of the cover thus
oversegmenting the scene. In this scenario, a pixelwise attempts to find a consistent overall segmentation in
classifier is trained on the convolutional feature vectors, which each segment contains pixels belonging to only one
and a simple vote is done for each component to assign a of the learned categories. This simple method allows us to
single class per component. This method is simple and consider full families of segmentation components, rather
effective, but imposes a fixed level of segmentation, which than a unique, predetermined segmentation (e.g., a single
can be suboptimal. set of superpixels).
All the steps in the process have a complexity linear
1.2.2 CRF over Superpixels (or almost linear) in the number of pixels. The bulk of the
A CRF is defined over a set of superpixels. Compared to the computation resides in the ConvNet feature extractor. The
previous, simpler method, this postprocessing models joint resulting system is very fast, producing a full parse of a
probabilities at the level of the scene, and is useful to avoid 320 240 image in less than a second on a conventional
local aberrations (e.g., a person in the sky). That kind CPU, and in less than 100 ms using dedicated hardware,
of approach is widely used in the computer vision opening the door to real-time applications. Once trained,
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
FARABET ET AL.: LEARNING HIERARCHICAL FEATURES FOR SCENE LABELING 1917

the system is parameter free and requires no adjustment of preliminary, this work showed that ConvNets fed with
thresholds or other knobs. raw pixels could be trained to perform scene parsing with
An early version of this work was first published in [7]. decent accuracy. Unlike [17], however, our system uses a
This journal version reports more complete experiments, boundary-based hierarchy of segmentations to align the
comparisons, and higher results. labels produced by the network to the boundaries in
the image and thus produces representations that are
independent of the size of the segments through feature
2 RELATED WORK pooling. Slightly after [8], Schulz and Behnke proposed a
The scene parsing problem has been approached with a similar architecture of a multiscale ConvNet for scene
wide variety of methods in recent years. Many methods parsing [40]. Unlike us, they use pairwise class location
rely on MRFs, CRFs, or other types of graphical models to filters to predict the final segmentation, instead of using the
ensure the consistency of the labeling and to account for image gradient that we found to be more accurate.
context [19], [39], [15], [25], [32], [44], [30]. Most methods
rely on a presegmentation into superpixels or other
segment candidates, and extract features and categories 3 MULTISCALE FEATURE EXTRACTION FOR SCENE
from individual segments and from various combinations PARSING
of neighboring segments. The graphical model inference The model proposed in this paper, depicted in Fig. 1, relies
pulls out the most consistent set of segments which cover on two complementary image representations. In the first
the image. representation, an image patch is seen as a point in IRP , and
Socher et al. [43] proposed a method to aggregate we seek to find a transform f : IRP ! IRQ that maps each
segments in a greedy fashion using a trained scoring patch into IRQ , a space where it can be classified linearly.
function. The originality of the approach is that the feature This first representation typically suffers from two main
vector of the combination of two segments is computed problems when using a classical ConvNet where the image
from the feature vectors of the individual segments through is divided following a grid pattern: 1) The window
a trainable function. Like us, they use “deep learning” considered rarely contains an object that is properly
methods to train their feature extractor. But unlike us, their centered and scaled, and therefore offers a poor observation
feature extractor operates on hand-engineered features. basis to predict the class of the underlying object;
One of the main questions in scene parsing is how to take 2) integrating a large context involves increasing the grid
a wide context into account to make a local decision. Munoz size and therefore the dimensionality P of the input; given a
et al. [32] proposed using the histogram of labels extracted finite amount of training data, it is then necessary to enforce
from a coarse scale as input to the labeler that looks at finer some invariance in the function f itself. This is usually
scales. Our approach is somewhat simpler: Our feature achieved by using pooling/subsampling layers, which in
extractor is applied densely to an image pyramid. The turn degrades the ability of the model to precisely locate
coarse feature maps thereby generated are upsampled to and delineate objects. In this paper, f is implemented by a
match that of the finest scale. Hence, with three scales, each multiscale ConvNet, which allows integrating large con-
feature vector has multiple fields that encode multiple texts (as large as the complete scene) into local decisions,
regions of increasing sizes and decreasing resolutions, while still remaining manageable in terms of parameters/
centered on the same pixel location. dimensionality. This multiscale model in which weights are
Like us, a number of authors have used families of shared across scales allows the model to capture long-range
segmentations or trees to generate candidate segments by interactions without the penalty of extra parameters to
aggregating elementary segments. The approaches of [39], train. This model is described in Section 3.1.
[30] rely on inference algorithms based on graph cuts to In the second representation, the image is seen as an
label images using trees of segmentation. Other strategies edge-weighted graph on which one or several oversegmen-
using families of segmentations appeared in [36] and [5]. tations can be constructed. The components are spatially
None of the previous strategies for scene labeling used a accurate, and naturally delineate the underlying objects, as
purity criterion on the class distributions. Combined to the this representation conserves pixel-level precision. Section 4
optimal cover strategy, this purity criterion is general, describes multiple strategies to combine both representa-
efficient, and could be applied to solve different problems. tions. In particular, we describe in Section 4.3 a method for
Contrary to the previously cited approaches using analyzing a family of segmentations (at multiple levels). It
engineered features, our system extracts features densely can be used as a solution to the first problem exposed
from a multiscale pyramid of images using a ConvNet [27]. above: Assuming the capability of assessing the quality of
These networks can be fed with raw pixels and can all the components in this family of segmentations, a system
automatically learn low-level and mid-level features, alle- can automatically choose its components so as to produce
viating the need for hand-engineered features. One of their the best set of predictions.
advantages is the ability to compute dense features
efficiently over large images. They are best known for their 3.1 Scale-Invariant, Scene-Level Feature Extraction
applications to detection and recognition [47], [14], [35], [21], Good internal representations are hierarchical. In vision,
but they have also been used for image segmentation, pixels are assembled into edglets, edglets into motifs, motifs
particularly for biological image segmentation [34], [20], [46]. into parts, parts into objects, and objects into scenes. This
The only previously published work on using ConvNets suggests that recognition architectures for vision (and for
for scene parsing is that of [17]. While somewhat other modalities such as audio and natural language)
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
1918 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

should have multiple trainable stages stacked on top of each symmetric squashing units (typically the tanh function
other, one for each level in the feature hierarchy. ConvNets [28]), and pooling/subsampling operators. For a network fs
provide a simple framework to learn such hierarchies of with L layers, we have
features.
ConvNets [26], [27] are trainable architectures composed fs ðXs ; s Þ ¼ WL HL1 ; ð1Þ
of multiple stages. The input and output of each stage are where the vector of hidden units at layer l is
sets of arrays called feature maps. For example, if the input is
a color image, each feature map would be a two-dimen- Hl ¼ poolðtanhðWl Hl1 þ bl ÞÞ; ð2Þ
sional array containing a color channel of the input image for all l 2 f1; . . . ; L 1g, with bl a vector of bias parameters,
(for an audio input, each feature map would be a one- and H0 ¼ Xs . The matrices Wl are Toeplitz matrices;
dimensional array, and for a video or volumetric image, it therefore each hidden unit vector Hl can be expressed as
would be a three-dimensional array). At the output, each a regular convolution between kernels from Wl and the
feature map represents a particular feature extracted at all previous hidden unit vector Hl1 , squashed through a tanh ,
locations on the input. Each stage is composed of three and pooled spatially. More specifically,
layers: a filter bank layer, a nonlinearity layer, and a feature 0 0 11
pooling layer. A typical ConvNet is composed of one, two, or X
three such three-layer stages, followed by a classification Hlp ¼ pool@tanh@blp þ wlpq Hl1;q AA: ð3Þ
module. Because they are trainable, arbitrary input mod- q2parentsðpÞ
alities can be modeled beyond natural images.
The filters Wl and the biases bl constitute the trainable
Our feature extractor is a three-stage ConvNet. The first
parameters of our model, and are collectively denoted s .
two stages contain a bank of filters producing multiple
The function tanh is a point-wise nonlinearity, while pool is
feature maps, a point-wise nonlinear mapping and a
a function that considers a neighborhood of activations and
spatial pooling, followed by subsampling of each feature
produces one activation per neighborhood. In all our
map. The last layer only contains a bank of filters. The
experiments, we use a max-pooling operator which takes
filters (convolution kernels) are subject to training. Each
the maximum activation within the neighborhood. Pooling
filter is applied to the input feature maps through a two-
over a small neighborhood provides built-in invariance to
dimensional convolution operation which detects local
small translations.
features at all locations on the input. Each filter bank of a
Finally, the outputs of the N networks are upsampled
ConvNet produces features that are equivariant under
and concatenated so as to produce F, a map of feature
shifts, i.e., if the input is shifted, the output is also shifted
vectors of size N times the size of f 1 , which can be seen as
but otherwise unchanged.
local patch descriptors and scene-level descriptors:
While ConvNets have been used successfully for a
number of image labeling problems, image-level tasks such F ¼ ½f 1 ; uðf 2 Þ; . . . ; uðf N Þ; ð4Þ
as full-scene understanding (pixelwise labeling or any
dense feature estimation) require the system to model where u is an upsampling function.
complex interactions at the scale of complete images, not As mentioned above, weights are shared between net-
simply within a patch. To view a large contextual window works fs . Intuitively, imposing complete weight sharing
at full resolution, a ConvNet would have to be unmanage- across scales is a natural way of forcing the network to learn
ably large. scale invariant features, and at the same time reduce the
The solution is to use a multiscale approach. Our chances of overfitting. The more scales used to jointly train
multiscale ConvNet overcomes these limitations by extend- the models fs ðs Þ, the better the representation becomes for
ing the concept of spatial weight replication to the scale all scales. Because image content is, in principle, scale
space. Given an input image I, a multiscale pyramid of invariant, using the same function to extract features at each
images Xs , 8s 2 f1; . . . ; Ng, is constructed where X1 has the scale is justified.
size of I. The multiscale pyramid can be a Laplacian 3.2 Learning Discriminative Scale-Invariant
pyramid and is typically preprocessed so that local Features
neighborhoods have zero mean and unit standard devia- As described in Section 3.1, feature vectors in F are obtained
tion. Given a classical ConvNet fs with parameters s , the by concatenating the outputs of multiple networks fs , each
multiscale network is obtained by instantiating one network taking as input a different image in a multiscale pyramid.
per scale s, and sharing all parameters across scales: s ¼ 0 , Ideally a linear classifier should produce the correct
8s 2 f1; . . . ; Ng. categorization for all pixel locations i from the feature
We introduce the following convention: Banks of images vectors Fi . We train the parameters s to achieve this goal,
will be seen as three-dimensional arrays in which the first using the multiclass cross entropy loss function. Let cî be the
dimension is the number of independent feature maps or normalized prediction vector from the linear classifier for
images, the second is the height of the maps, and the third is pixel i. We compute normalized predicted probability
the width. The output state of the Lth stage is denoted HL . î;a using the softmax function, i.e.,
distributions over classes c
The maps in the pyramid are computed using a scaling/
T
normalizing function gs as Xs ¼ gs ðIÞ, for all s 2 f1; . . . ; Ng. ewa Fi
For each scale s, the ConvNet fs can be described as a î;a ¼ P
c T ; ð5Þ
b2classes ewb Fi
sequence of linear transforms, interspersed with nonlinear
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
FARABET ET AL.: LEARNING HIERARCHICAL FEATURES FOR SCENE LABELING 1919

where w is a temporary weight matrix only used to learn

the features. The cross entropy between the predicted class
distribution c
^ and the target class distribution c penalizes
their deviation and is measured by
X X
Lcat ¼ ci;a Þ:
ci;a lnð^ ð6Þ
i2pixels a2classes

The true target probability ci;a of class a being present at Fig. 2. First labeling strategy from the features: Using superpixels as
location i can either be a distribution of classes at location i, described in Section 4.1.
in a given neighborhood, or a hard target vector: ci;a ¼ 1 if
pixel i is labeled a, and 0 otherwise. For training maximally neural network, as opposed to the simple linear classifier
discriminative features, we use hard target vectors in this used in Section 3.2, allows the system to capture nonlinear
first stage. relationships between the features at different scales. In this
Once the parameters s are trained, the classifier in (5) is case, the final labeling for each component k is given by
discarded, and the feature vectors Fi are used using
different strategies, as described in Section 4. ^ k;a :
lk ¼ arg max d ð12Þ
a2classes

The pipeline is depicted in Fig. 2.

4 SCENE LABELING STRATEGIES
The simplest strategy for labeling the scene is to use the 4.2 Conditional Random Fields
linear classifier described in Section 3.2, and assign each The local assignment obtained using superpixels does not
pixel with the argmax of the prediction at its location. More involve a global understanding of the scene. In this section,
specifically, for each pixel i: we implement a classical CRF model, constructed on the
superpixels. This is a quite standard approach for image
li ¼ arg max c
^i;a : ð7Þ labeling. Our multiscale ConvNet already has the capability
a2classes
of modeling global relationships within a scene, but might
The resulting labeling l, although fairly accurate, is not still be prone to errors, and can benefit from a CRF to impose
satisfying visually as it lacks spatial consistency and precise consistency and coherency between labels at test time.
delineation of objects. In this section, we explore three A common strategy for labeling a scene consists of
strategies to produce more spatially appealing labelings. associating the image to a graph and defining an energy
function whose optimal solution corresponds to the desired
4.1 Superpixels
segmentation [41], [13].
Predicting the class of each pixel independently from its For this purpose, we define a graph G ¼ ðV ; EÞ with
neighbors yields noisy predictions. A simple cleanup can be vertices v 2 V and edges e 2 E V V . Each pixel in the
obtained by forcing local regions of same color intensities to image is associated with a vertex, and edges are added
be assigned a single label. between every neighboring nodes. An edge, e, spanning
As in [13] and [16], we compute superpixels, following two vertices, vi and vj , is denoted by eij . The CRF energy
the method proposed by Felzenszwalb and Huttenlocher function is typically composed of a unary term enforcing
[11], to produce an oversegmentation of the image. We then the variable l to take values close to the predictions d ^ and a
classify each location of the image densely and aggregate pairwise term enforcing regularity or local consistency of l.
these predictions in each superpixel by computing the The CRF energy to minimize is given by
average class distribution within the superpixel. X X
^ k at
For this method, the pixelwise distributions d EðlÞ ¼ ^ i ; li þ
d ðli ; lj Þ: ð13Þ
superpixel k are predicted from the feature vectors Fi i2V eij 2E
using a two-layer neural network:
We considered as unary terms:
yi ¼ W2 tanhðW1 Fi þ b1 Þ; ð8Þ
d^ i;a ; li ¼ exp ðd
^ i;a Þ1ðli 6¼ aÞ; ð14Þ
yi;a ^ i;a corresponds to the probability of class a being
^ i;a ¼ P e
d ; ð9Þ where d
yi;b
b2classes e present at a pixel i computed as in Section 4.1, and 1ðÞ is an
indicator function that equals one if the input is true, and
X X
Lcat ¼ ^ i;a ;
di;a ln d ð10Þ zero otherwise.
i2pixels a2classes The pairwise term consists of

ðli ; lj Þ ¼ exp ðkrIki Þ1ðli 6¼ lj Þ; ð15Þ

X
^ k;a ¼ 1
d ^ i;a ;
d ð11Þ
sðkÞ i2k where krIki is the ‘2 norm of the gradient of the image I at
a pixel i. Details on the parameters used are given in the
with di the ground-truth distribution at location i, and sðkÞ experimental section.
the surface of component k. Matrices W1 and W2 are the The CRF energy (13) is minimized using alpha-expansions
trainable parameters of the classifier. Using a two-layer [4], [3]. An illustration of the procedure appears in Fig. 3.
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
1920 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

Fig. 3. Second labeling strategy from the features: Using a CRF, described in Section 4.2.

4.3 Parameter-Free Multilevel Parsing different merging thresholds. In Section 5, we use such an
One problem subsists with the two methods presented approach by computing multiple levels of the Felzenszwalb
above: the observation level problem. An object, or object algorithm [11]. The Felzenszwalb algorithm is not strictly
part, can be easily classified once it is segmented at the right monotonic, so the structure obtained cannot be cast into a
level. The two methods above are based on an arbitrary tree: Rather, it has a general graph form in which each pixel
segmentation of the image, which typically decomposes it belongs to as many superpixels as levels explored. Solving
into segments that are too small or, more rarely, too large. (16) in this case consists of the following procedure: For each
In this section, we propose a method to analyze a family pixel i, the optimal component Ck ðiÞ is the one among all the
of segmentations and automatically discover the best segmentations with minimal cost Sk ðiÞ . Thus, the complexity
observation level for each pixel in the image, as illustrated
in Fig. 4. One special case of such families is the
segmentation tree, in which components are hierarchically
organized. Our method is not restricted to such trees and
can be used for arbitrary sets of neighborhoods.
In Section 4.3.1, we formulate the search for the most
adapted neighborhood of a pixel as an optimization
problem. The construction of the cost function that is
minimized is then described in Section 4.3.2.

4.3.1 Optimal Purity Cover

We define the neighborhood of a pixel as a connected
component that contains this pixel. Let Ck , 8k 2 f1; . . . ; Kg,
be the set of all possible connected components of the lattice
defined on image I, and let Sk be a cost associated with each
of these components. For each pixel i, we wish to find the
index k ðiÞ of the component that best explains this pixel,
that is, the component with the minimal cost Sk ðiÞ :

k ðiÞ ¼ argmin Sk : ð16Þ

k j i2Ck

Note that components Ck ðiÞ are nondisjoint sets that

form P a cover of the lattice. Note also that the overall cost
S ¼ i Sk ðiÞ is minimal.
In practice, the set of components Ck is too large, and only
a subset of it can be considered. A classical technique to
reduce the set of components is to consider a hierarchy of
segmentations [33], [1] that can be represented as a tree T .
This was previously explored in [7]. Solving (16) on T
consists of the following procedure: For each pixel (leaf) i, the
optimal component Ck ðiÞ is the one along the path between
the leaf and the root with minimal cost Sk ðiÞ . The optimal
cover is the union of all these components. For efficiency
purposes, it can be done simply by exploring the tree in a Fig. 4. Third labeling strategy from the features: Using a family of
segmentations as described in Section 4.3. In this figure, the family of
depth-first search manner and finding the component with segmentations is a segmentation tree. The segment associated with
minimal weight along each branch. The complexity of the each node in the tree is encoded by a spatial grid of feature vectors
optimal cover procedure is then linear in the number of pooled in the segment’s region. A classifier is then applied to all the
components in the tree. Fig. 5 illustrates the procedure. aggregated feature grids to produce a histogram of categories, the
entropy of which measures the “impurity” of the segment. Each pixel is
Another technique to reduce the set of components then labeled by the minimally impure node above it, which is the
considered is to compute a set of segmentations using segment that best “explains” the pixel.
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
FARABET ET AL.: LEARNING HIERARCHICAL FEATURES FOR SCENE LABELING 1921

Fig. 6. The shape-invariant attention function a. For each component Ck

in the family of segmentations T , the corresponding image segment is
Fig. 5. Finding the optimal cover on a tree. The numbers next to the encoded by a spatial grid of feature vectors that fall into this segment.
components correspond to the entropy scores Si . For each pixel (leaf) i, The aggregated feature vector of each grid cell is computed by a
the optimal component Ck ðiÞ is the one along the path between the leaf component-wise max pooling of the feature vectors centered on all the
and the root with minimal cost Sk ðiÞ . The optimal cover is the union of all pixels that fall into the grid cell; this produces a scale-invariant
these components. In this example, the optimal cover fC1 ; C3 ; C4 ; C5 g representation of the segment and its surroundings. The result, Ok , is
will result in a segmentation in disjoint sets fC1 ; C2 ; C3 ; C4 g, with the a descriptor that encodes spatial relations between the underlying
subtle difference that component C2 will be labeled with the class of C5 , object’s parts. The grid size was set to 3 3 for all our experiments.
as C5 is the best observation level for C2 . The generalization to a family
of segmentations is straightforward (see text). component Ck , d ^ k the predicted class distribution, and Sk
the cost associated to this distribution. We have
to produce a cover on the family of components is linear on
the number of pixels, but with a constant that is proportional yk ¼ W2 tanhðW1 Ok þ b1 Þ; ð17Þ
to the number of levels explored.
yk;a
^ k;a ¼ P e
d ; ð18Þ
4.3.2 Producing the Confidence Costs eyk;b
b2classes
Given a set of components Ck , we explain how to produce
all the confidence costs Sk . These costs represent the class X
Sk ¼ dk;a ln d^ k;a ; ð19Þ
purity of the associated components. Given the ground-
a2classes
truth segmentation, we can compute the cost as being the
entropy of the distribution of classes present in the with dk the ground-truth distribution for component k.
component. At test time, when no ground-truth is available, Matrices W1 , W2 , and b are noted c and represent the
we need to define a function that can predict this cost by trainable parameters of c. These parameters need to be
simply looking at the component. We now describe a way learned over the complete set of segmentation families,
of achieving this, as illustrated in Fig. 6. computed on the entire training set available. The training
Given the scale-invariant features F, we define a procedure is described in Section 4.3.3.
compact representation to describe objects as an elastic For each component Ck chosen by the optimal purity
spatial arrangement of such features. In other terms, an cover (see Section 4.3.1) the label is produced by
object, or category in general, can be best described as a ^ k;a
lk ¼ arg max d Ck 2 cut: ð20Þ
spatial arrangement of features or parts. We define a simple a2classes
attention function a used to mask the feature vector map
with each component Ck , producing
T a set of K masked 4.3.3 Training Procedure
feature vector patterns fF Ck g; 8k 2 f1; . . . ; Kg. The Let F be the set of all feature maps in the training set and
function a is called an attention function because it T the set of all families of segmentations. We construct
suppresses the background T around the component being the segmentation collections ðT ÞT 2T on the entire training
analyzed. The patterns fF Ck g are resampled to produce set, and, for all T 2 T , train the classifier c to predict the
fixed-size representations. In our model the sampling is distribution of classes in component Ck 2 T , as well as
done using an elastic max-pooling function, which remaps the costs Sk .
input patterns of arbitrary size into a fixed G G grid. This Given the trained parameters s , we build F and T ,
grid can be seen as a highly invariant representation that i.e., we compute all vector maps F and segmentation
encodes spatial relations between an object’s attributes/ collections T on all the training data available so as to
parts. This representation is denoted Ok . Some nice proper- produce a new training set of descriptors Ok . This time, the
ties of this encoding are as follows: 1) Elongated or, in parameters c of the classifier c are trained to minimize the
general, ill-shaped objects are nicely handled, 2) the KL-divergence between the true (known) distributions of
dominant features are used to represent the object, labels dk in each component and the prediction from the
combined with background subtraction; the features pooled ^ k (see (18):
classifier d
represent solid basis functions to recognize the underlying !
object. X ^ k;a
d
ldiv ¼ ^ k;a ln
d : ð21Þ
Once we have the set of object descriptors Ok , we define dk;a
a2classes
a function c : Ok ! ½0; 1Nc (where Nc is the number of
classes) as predicting the distribution of classes present in In this setting, the ground-truth distributions dk are not
component Ck . We associate a cost Sk to this distribution. In hard target vectors, but normalized histograms of the labels
this paper, c is implemented as a simple two-layer neural present in component Ck . Once the parameters c are
network, and Sk is the entropy of the predicted distribution. ^ k accurately predicts the distribution of labels, and
trained, d
More formally, let Ok be the feature vector associated with (19) is used to assign a purity cost to the component.
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
1922 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

TABLE 1 TABLE 2
Performance of Our System on the Stanford Background Performance of Our System on the SIFT Flow Dataset [31]:
Dataset [15]: Per-Pixel/Average Per-Class Accuracy Per-Pixel/Average Per-Class Accuracy

Our multiscale network is trained using two sampling methods:

1
balanced frequencies, 2 natural frequencies. We compare the results
of our multiscale network with the raw (pixelwise) classifier, Felzensz-
walb et al. superpixels [11] (one level), and our optimal cover applied to
The third column reports compute times as reported by the authors. Our a stack of 10 levels of Felzenszwalb et al. superpixels. Note: The
algorithms were computed using a 4-core Intel i7. threshold for the single level was picked to yield the best results; the
cover automatically finds the best combination of superpixels.
5 EXPERIMENTS
For the SIFT Flow and Barcelona datasets, we experi-
We report our semantic scene understanding results on mented with two sampling methods when learning the
three different datasets: “Stanford Background” on which
multiscale features: respecting natural frequencies of
related state-of-the-art methods report classification errors,
and two more challenging datasets with a larger number of classes, and balancing them so that an equal amount of
classes: “SIFT Flow” and “Barcelona.” The Stanford back- each class is shown to the network. Balancing class
ground dataset [15] contains 715 images of outdoor scenes occurrences is essential to modeling the conditional like-
composed of eight classes, chosen from other existing lihood of each class (i.e., ignoring their prior distribution).
public datasets so that all the images are outdoor scenes, Both results are reported in Table 2. Training with balanced
have approximately 320 240 pixels, where each image frequencies allows better discrimination of small objects,
contains at least one foreground object. We use the and although it decreases the overall pixelwise accuracy, it
evaluation procedure introduced in [15], 5-fold cross is more correct from a recognition point of view. Frequency
validation: 572 images used for training, and 143 for testing. balancing is used on the Stanford background dataset as it
The SIFT Flow dataset [31] is composed of 2,688 images, consistently gives better results. For the Barcelona dataset,
that have been thoroughly labeled by LabelMe users, and both sampling methods are used as well, but frequency
split into 2,488 training images and 200 test images. The
balancing worked rather poorly in that case. This can be
authors used synonym correction to obtain 33 semantic
explained by the fact that this dataset has a large amount of
labels. The Barcelona dataset, as described in [44], is derived
from the LabelMe subset used in [38]. It has 14,871 training classes with very few training examples. These classes are
and 279 test images. The test set consists of street scenes therefore extremely hard to model, and overfitting occurs
from Barcelona, while the training set ranges in scene types much faster than for the SIFT Flow dataset. Results are
but has no street scenes from Barcelona. Synonyms were shown in Table 3.
manually consolidated by [44] to produce 170 unique labels. Results in Table 1 demonstrate the impressive computa-
To evaluate the representation from our multiscale tional advantage of ConvNets over competing algorithms.
ConvNet, we report results from several experiments on Exploiting the parallel structure of this special network by
the Stanford Background dataset: computing convolutions in parallel allows us to parse an
image of size 320 240 in less than 1 second on a 4-core
1. a system based on a plain ConvNet alone,
Intel i7 laptop. Using GPUs or other types of dedicated
2. the multiscale ConvNet presented in Section 3.1,
hardware, our scene parsing model can be run in real time
with raw pixelwise prediction,
(i.e., at more than 10 fps).
3. superpixel-based predictions, as presented in Sec-
tion 4.1,
4. CRF-based predictions, as presented in Section 4.2, TABLE 3
5. cover-based predictions, as presented in Section 4.3. Performance of Our System on the Barcelona Dataset [44]:
Results are reported in Table 1, and compared with Per-Pixel/Average Per-Class Accuracy
related works. Our model achieves very good results in
comparison with previous approaches. Methods of [25],
[30] achieve similar or better performances on this
particular dataset but at the price of several minutes to
parse one image. Examples of results using the investigated
postprocessing strategies are shown in Fig. 8. Our multiscale network is trained using two sampling methods: 1
We then demonstrate that our system scales nicely when balanced frequencies, 2 natural frequencies. We compare the results of
augmenting the number of classes on two other datasets in our multiscale network with the raw (pixelwise) classifier, Felzenszwalb
et al. superpixels [11] (one level), and our optimal cover applied to a
Tables 2 and 3. Results on these datasets were obtained stack of 10 levels of Felzenszwalb et al. superpixels. Note: The
using our cover-based method from Section 4.3. Example threshold for the single level was picked to yield the best results; the
parses on the SIFT Flow dataset are shown in Fig. 9. cover automatically finds the best combination of superpixels.
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
FARABET ET AL.: LEARNING HIERARCHICAL FEATURES FOR SCENE LABELING 1923

Fig. 7. Example of results on the Stanford background dataset. (b), (d), and (f) Results with different labeling strategies, overlaid with superpixels
(see Section 4.1), segments results of a threshold in the gPb hierarchy [1], and segments recovered by the maximum purity approach with an optimal
cover (see Section 4.3). The result (c) is obtained with a CRF on the superpixels shown in (d), as described in Section 4.2.

5.1 Multiscale Feature Extraction map being produced by a combination of eight randomly
For all experiments, we use a three-stage ConvNet. The first selected feature maps from the previous layer. Finally, the
two layers of the network are composed of a bank of filters 64-dimension feature map is transformed into a 256-
of size 7 7 followed by tanh units and 2 2 max-pooling dimension feature map, each map being produced by a
operations. The last layer is a simple filter bank. The filters combination of 32 randomly selected feature maps from
and pooling dimensions were chosen by a grid search. The the previous layer.
input image is transformed into YUV space, and a The outputs of each of the three networks are then
Laplacian pyramid is constructed from it. The Y, U, and V upsampled and concatenated so as to produce a 256
channels of each scale in the pyramid are then indepen- 3 ¼ 768-dimension feature vector map F. Given the filter
dently locally normalized such that each local 15 15 patch sizes, the network has a field of view of 46 46 at each
has zero-mean and unit variance. For these experiments, the scale, which means that a feature vector in F is influenced
pyramid consists of three rescaled versions of the input by a 46 46 neighborhood at full resolution, a 92 92
(N ¼ 3), in octaves: 320 240, 160 120, 80 60. neighborhood at half resolution, and a 184 184 neighbor-
The network is then applied to each three-dimension input hood at quarter resolution. These neighborhoods are shown
map Xs . This input is transformed into a 16-dimension in Fig. 1.
feature map, using a bank of 16 filters, 10 connected to the The network is trained on all three scales in parallel,
Y channel, the six others connected to the U and using stochastic gradient descent with no second-order
V channels. The second layer transforms this 16-dimen- information, and minibatches of size 1. Simple grid-search
sion feature map into a 64-dimension feature map, each was performed to find the best learning rate (103 ) and
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
1924 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

Fig. 8. More results using our multiscale ConvNet and a flat CRF on the Stanford background dataset.

regularization parameters (L2 coefficient: 105 ), using a 5.3 Multilevel Parsing

holdout of 10 percent of the training data for validation. The Although the simple strategy of the previous section seems
holdout is also used to select the best network, i.e., the appealing, the results can be further improved using the
network that generalizes the most on the holdout. multilevel approach of Section 4.3.
Convergence, that is, maximum generalization perfor- The family of segmentations used to find the optimal
mance, is typically attained after between 10 to 50 million cover could be a simple segmentation tree constructed on
patches have been seen during stochastic gradient descent. the raw image gradient. For the Stanford background
This typically represents between two to five days of dataset experiments, we used a more sophisticated tree
training. No special hardware (GPUs) was used for training. based on a semantic image gradient. We used the gPb
The ConvNet has roughly 0.5 million trainable para- hierarchies of Arbelaez et al., which are computed using
meters. To ensure that features do not overfit some spectral clustering to produce semantically consistent
irrelevant biases present in the data, jitter—horizontal contours of objects. Their computation requires one minute
flipping of all images, rotations between 8 and 8 degrees, per image.
and rescaling between 90 and 110 percent—was used to For the SIFT Flow and Barcelona datasets, we used a
artificially expand the size of the training data. These cheaper technique, which does not rely on a tree: We ran
additional distortions are applied during training, before the superpixel method proposed by Felzenszwalb and
loading a new training point, and are sampled from Huttenlocher [11] at 10 different levels. The Felzenszwalb
uniform distributions. Jitter was shown to be crucial for algorithm is not strictly monotonic, so the structure obtained
low-level feature learning in the works of [42] and [6]. cannot be cast into a tree: Rather, it has a general graph form
For our baseline, we trained a single-scale network and a in which each pixel belongs to 10 different superpixels. Our
optimal cover algorithm can be readily applied to arbitrary
three-scale network as raw site predictors, for each location i,
structures of this type. The 10 levels were chosen such that
using the classification loss Lcat defined in (10), with the
they are linearly distributed and span a large range.
two-layer neural network defined in (9). Table 1 shows the
Classically, segmentation methods find a partition of the
clear advantage of the multiscale representation, which
segments rather than a cover. Partitioning the segments
captures scene-level dependences and can classify more
consists of finding an optimal cut in a tree (so that each
pixels accurately. Without an explicit segmentation model,
terminal node in the pruned tree corresponds to a segment).
the visual aspect of the predictions still suffers from We experimented with graph-cuts to do so [12], [2], but the
inaccurate object delineation. results were less accurate than with our optimal cover
5.2 Parsing with Superpixels method (Stanford background dataset only).
The two-layer neural network c from (17) has 3 3
The results obtained with the strategy presented in Section
768 ¼ 6,912 input units (using a 3 3 grid of feature vectors
4.1 demonstrate the quality of our multiscale features by
from F), 1,024 hidden units, and as many output units as
reaching a very high classification accuracy on all three
classes in each dataset. This rather large neural network
datasets. This simple strategy is also a real fit for real-time is trained with L2 regularization (coefficient: 102 ) to
applications, taking only an additional 0.2 second to label a minimize overfitting.
320 240 image on Intel i7 CPU. An example of result is Results are better than the superpixel method; in
given in Fig. 7. particular, better delineation is achieved (see Fig. 7).
The two-layer neural network used for this method (see
(9)) has 768 input units, 1,024 hidden units, and as many 5.4 Conditional Random Field
output units as classes in each dataset. This neural network We demonstrate the state-of-the-art quality of our features by
is trained with no regularization. employing a CRF on the superpixels given by thresholding
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
FARABET ET AL.: LEARNING HIERARCHICAL FEATURES FOR SCENE LABELING 1925

Fig. 9. Typical results achieved on the SIFT Flow dataset.

the gPb hierarchy on the Stanford Background dataset. A generalization of a recognition system learned on specific,
similar test is performed in [30], where the authors also use a publicly available datasets.
CRF on the same superpixels (at the threshold 20 in the gPb We used our multiscale features combined with classi-
hierarchy), but employ different features: histograms of fication, using superpixels as described in Section 4.1,
densely sampled SIFT words, colors, locations, and contour trained on the SIFT Flow dataset (2,688 images, most of
shape descriptors. They report a ratio of correctly classified them taken in nonurban environments, see Table 2 and
pixels of 81.1 percent on the Stanford Background dataset. We Fig. 9). We collected a 360 degree movie in our workplace
recall that this accuracy is the best one achieved to date on this environment, including a street and a park, introducing
dataset with a flat CRF. difficulties such as lighting conditions and image distor-
In our CRF energy, we performed a grid search to set the tions (see Fig. 11).
parameters of (13) ( ¼ 20, ¼ 0:1 ¼ 200), and used a The movie was built from four videos that were stitched
gray level gradient. The accuracy of the resulting system is to form a 360 degree video stream of 1,280 256 images,
81.4, as reported in Table 1. Our features are thus out- thus creating artifacts not seen during training. We
performing the best publicly available combination of processed each frame independently, without using any
handcrafted features. temporal consistency or smoothing.
Despite all these constraints and the rather small size of
5.5 Some Comments on the Learned Features the training dataset, we observe rather convincing general-
With recent advances in unsupervised (deep) learning, ization of our models on these previously unseen scenes.
learned features have become easier to analyze and under- The two video sequences are available at http://www.
stand. In this work, the entire stack of features is learned in clement.farabet.net/. Two snapshots are included in Fig. 11.
a purely supervised manner, and yet we found that the Our scene parsing system constitutes, to the best of our
features obtained are rather meaningful. We believe that the knowledge, the first approach achieving real-time perfor-
reason for this is the type of loss function we use, which mance, one frame being processed in less than a second on a
enforces a large invariance: The system is forced to produce 4-core Intel i7. Feature extraction, which represents around
an invariant representation for all the locations of a given
object. This type of invariance is very similar to what can be
achieved using semi-supervised techniques such as Dr-LIM
[18], where the loss forces pairs of similar patches to yield
the same encoding. Fig. 10 shows an example of the
features learned on the SIFT Flow dataset.

5.6 Some Comments on Real-World Generalization

Now that we have compared and discussed several Fig. 10. Typical first layer features, learned on the SIFT Flow dataset.
strategies for scene parsing based on our multiscale (a)-(c) The 16 filters learned at each scale when no weight sharing is
features, we consider taking our system into the real used (networks at each scale are independent). (d) The 16 filters
obtained when sharing weights across all three scales. All the filters are
world, to evaluate its generalization properties. The work 7 7. We observe typical oriented edges and high-frequency filters.
of [45], measuring dataset bias, raises the question of the Filters at higher layers are more difficult to analyze.
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
1926 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

Fig. 11. Real-time scene parsing in natural conditions. Training on the SIFT Flow dataset. We display one label per component in the final prediction.

500 ms on the i7, can be reduced to 60 ms using dedicated . Feeding the system with a wide contextual window
FPGA hardware [9], [10]. is critical to the quality of the results. The numbers in
Table 1 show a dramatic improvement in the
performance of the multiscale ConvNet over the
6 DISCUSSION
single scale version.
The main lessons from the experiments presented in this . When a wide context is taken into account to
paper are as follows:
produce each pixel label, the role of the postproces-
. Using a high-capacity feature-learning system fed sing is greatly reduced. In fact, a simple majority
with raw pixels yields excellent results when vote of the categories within a superpixel yields
compared with systems that use engineered fea- state-of-the-art accuracy. This seems to suggest that
tures. The accuracy is similar to or better than contextual information can be taken into account by
competing systems, even when the segmentation a feed-forward trainable system with a wide
hypothesis generation and the postprocessing mod- contextual window, perhaps as well as an inference
ule are absent or very simple. mechanism that propagates label constraints over a
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
FARABET ET AL.: LEARNING HIERARCHICAL FEATURES FOR SCENE LABELING 1927

graphical model, but with a considerably lower system that produces accurate boundaries for large regions
computational cost. (sky, road, grass), but fails to spot small objects. A reflection
. Highly sophisticated postprocessing schemes, which is needed on the best ways to measure the accuracy of scene
seem so crucial to the success of other models, do not labeling systems.
seem to improve the results significantly over simple Scene parsing datasets also need better labels. One could
schemes. This seems to suggest that the performance imagine using scene parsing datasets with hierarchical
is limited by the quality of the labeling or the quality labels so that a window within a building would be labeled
of the segmentation hypotheses, rather than by the as “building” and “window.” Using this kind of labeling in
quality of the contextual consistency system or the conjunction with graph structures on sets of labels that
inference algorithm. contain is-part-of relationships would likely produce
. Relying heavily on a highly accurate feed-forward more consistent interpretations of the whole scene.
pixel labeling system, while simplifying the post- The framework presented in this paper trains the
convolutional net as a pixel labeling system in isolation
processing module to its bare minimum cuts down
from the postprocessing module that ensures the consis-
the inference times considerably. The resulting
tency of the labeling and its proper registration with the
system is dramatically faster than those that rely
image regions. This requires that the convolutional net be
heavily on graphical model inference. Moreover, the
trained with images that are fully labeled at the pixel level.
bulk of the computation takes place in the ConvNet. One would hope that jointly fine-tuning the convolutional
This computation is algorithmically simple, easily net and the postprocessor produces better overall inter-
parallelizable. Implementations on multicore ma- pretations. Gradients can be back-propagated through the
chines, general-purpose GPUs, digital signal proces- postprocessor to the convolutional nets. This is reminiscent
sors, or specialized architectures implemented on of the graph transformer network model, a kind of nonlinear
FPGAs is straightforward. This is demonstrated by CRF in which an unnormalized graphical model-based
the FPGA implementation [9], [10] of the feature postprocessing module was trained jointly with a ConvNet
extraction scheme presented in this paper that runs for handwriting recognition [27]. Unfortunately, prelimin-
in 60 ms for an image resolution of 320 240. ary experiments with such joint training yielded lower test-
set accuracies probably due to overfitting.
7 CONCLUSION AND FUTURE WORK A more important advantage of joint training would
allow the use of weakly labeled images in which only a list of
This paper demonstrates that a feed-forward ConvNet,
objects present in the image would be given, perhaps tagged
trained end-to-end in a supervised manner and fed with with approximate positions. This would be similar in spirit
raw pixels from large patches over multiple scales, can to sentence-level discriminative training methods used in
produce state-of-the-art performance on standard scene speech recognition and handwriting recognition [27].
parsing datasets. The model does not rely on engineered Another possible direction for improvement includes the
features and uses purely supervised training from fully use of objective functions that directly operate on the edge
labeled images to learn appropriate low-level and mid- costs of neighborhood graphs in such as way that graph-cut
level features. segmentation and similar methods produce the best
Perhaps the most surprising result is that even in the answer. One such objective function is Turaga’s maximin
absence of any postprocessing, by simply labeling each learning [46], which pushes up the lowest edge cost along
pixel with the highest scoring category produced by the the shortest path between two points in different segments
convolutional net for that location the system yields near and pushes down the highest edge cost along a path
state-of-the-art pixel-wise accuracy and better per-class between two points in the same segment.
accuracy than all previously published results. Feeding Our system so far has been trained using purely
the features of the convolutional net to various sophisti- supervised learning applied to a fairly classical ConvNet
cated schemes that generate segmentation hypotheses and architecture. However, a number of recent works have
that find consistent segmentations and labeling by taking shown the advantage of architectural elements such as
local constraints into account improves the results slightly, rectifying nonlinearities and local contrast normalization
but not considerably. [21]. More importantly, several works have shown the
While the results on datasets with few categories are advantage of using unsupervised pretraining to prime the
good, the accuracy of the best existing scene parsing convolutional net into a good starting point before super-
systems, including ours, is still quite low when the number vised refinement [37], [22], [23], [29], [24]. These methods
of categories is large. The problem of scene parsing is far improve the performance in the low training set size regime
from being solved. While the system presented here has a and would probably improve the performance of the
number of advantages and shortcomings, the framing of the present system.
scene parsing task itself is in need of refinement. Finally, code and data are available online at http://
First of all, the pixel-wise accuracy is a somewhat www.clement.farabet.net/.
inaccurate measure of the visual and practical quality of
the result. Spotting rare objects is often more important than
accurately labeling every boundary pixel of the sky (which ACKNOWLEDGMENTS
are often in greater number). The average per-class accuracy The authors would like to thank Marco Scoffier for fruitful
is a step in the right direction, but not the ultimate solution: discussions and the 360 degree video collection. They are also
One would prefer a system that correctly spots every object grateful to Victor Lempitsky who kindly provided them with
or region while giving an approximate boundary to a his results on the Stanford database for comparison. This
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
1928 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

work was funded in part by DARPA contract “Integrated [21] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What Is
the Best Multi-Stage Architecture for Object Recognition?” Proc.
deep learning for large scale multimodal data representa- IEEE Int’l Conf. Computer Vision, 2009.
tion,” ONR MURI “Provably stable vision-based control of [22] K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun, “Learning
high-speed flight,” ONR grant “Learning Hierarchical Invariant Features Through Topographic Filter Maps,” Proc. IEEE
Models for Information Integration.” Conf. Computer Vision and Pattern Recognition, 2009.
[23] K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “Fast Inference in
Sparse Coding Algorithms with Applications to Object Recogni-
REFERENCES tion,” Technical Report CBLL-TR-2008-12-01, Courant Inst. of
Math. Sciences, New York Univ., 2008.
[1] P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik, “Contour [24] K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu,
Detection and Hierarchical Image Segmentation,” IEEE Trans. and Y. LeCun, “Learning Convolutional Feature Hierachies for
Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 898-916, Visual Recognition,” Proc. Advances in Neural Information Proces-
May 2011. sing Systems Conf., vol. 23, 2010.
[2] Y. Boykov and M.P. Jolly, “Interactive Graph Cuts for Optimal [25] M. Kumar and D. Koller, “Efficiently Selecting Regions for Scene
Boundary & Region Segmentation of Objects in n-d Images,” Proc. Understanding,” Proc. IEEE Conf. Computer Vision and Pattern
IEEE Int’l Conf. Computer Vision, vol. 1, pp. 105-112, 2001. Recognition, pp. 3217-3224, 2010.
[3] Y. Boykov and V. Kolmogorov, “An Experimental Comparison of
[26] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard,
Min-Cut/Max-Flow Algorithms for Energy Minimization in
W. Hubbard, and L.D. Jackel, “Handwritten Digit Recognition
Vision,” IEEE Trans. Pattern Analysis and Machine Intelligence,
with a Back-Propagation Network,” Proc. Advances in Neural
vol. 26, no. 9, pp. 1124-1137, Sept. 2004.
Information Processing Systems Conf., 1990.
[4] Y. Boykov, O. Veksler, and R. Zabih, “Fast Approximate
Energy Minimization via Graph Cuts,” IEEE Trans. Pattern [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based
Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222-1239, Learning Applied to Document Recognition,” Proc. IEEE, vol. 86,
Nov. 2001. no. 11, pp. 2278-2324, Nov. 1998.
[5] J. Carreira and C. Sminchisescu, “CPMC: Automatic Object [28] Y. LeCun, L. Bottou, G. Orr, and K. Muller, “Efficient Backprop,”
Segmentation Using Constrained Parametric Min-Cuts,” IEEE Neural Networks: Tricks of the Trade, Springer, 1998.
Trans. Pattern Analysis and Machine Intelligence, vol. 34, no. 7, [29] H. Lee, R. Grosse, R. Ranganath, and Y.N. Andrew., “Convolu-
pp. 1312-1328, July 2012. tional Deep Belief Networks for Scalable Unsupervised Learning
[6] D. Ciresan, U. Meier, J. Masci, and J. Schmidhuber, “A Committee of Hierarchical Representations,” Proc. Int’l Conf. Machine Learn-
of Neural Networks for Traffic Sign Classification,” Proc. Int’l Joint ing, 2009.
Conf. Neural Networks, pp. 1918-1921, 2011. [30] V. Lempitsky, A. Vedaldi, and A. Zisserman, “A Pylon Model for
[7] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Scene Parsing Semantic Segmentation,” Proc. Advances in Neural Information
with Multiscale Feature Learning, Purity Trees, and Optimal Processing Systems Conf., 2011.
Covers,” Proc. Int’l Conf. Machine Learning, June 2012. [31] C. Liu, J. Yuen, and A. Torralba, “Nonparametric Scene Parsing:
[8] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Scene Parsing Label Transfer via Dense Scene Alignment,” Artificial Intelligence,
with Multiscale Feature Learning, Purity Trees, and Optimal 2009.
Covers,” CoRR, Feb. 2012. [32] D. Munoz, J. Bagnell, and M. Hebert, “Stacked Hierarchical
[9] C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Labeling,” Proc. 11th European Conf. Computer Vision, Jan. 2010.
Culurciello, “Hardware Accelerated Convolutional Neural Net- [33] L. Najman and M. Schmitt, “Geodesic Saliency of Watershed
works for Synthetic Vision Systems,” Proc. Int’l Symp. Circuits and Contours and Hierarchical Segmentation,” IEEE Trans. Pattern
Systems, May 2010. Analysis and Machine Intelligence, vol. 18, no. 12, pp. 1163-1173,
[10] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. Dec. 1996.
LeCun, “Neuflow: A Runtime Reconfigurable Dataflow Processor [34] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P.
for Vision,” Proc. Fifth IEEE Workshop Embedded Computer Vision, Barbano, “Toward Automatic Phenotyping of Developing Em-
2011. bryos from Videos,” IEEE Trans. Image Processing, vol. 14, no. 9,
[11] P. Felzenszwalb and D. Huttenlocher, “Efficient Graph-Based pp. 1360-1371, Sept. 2005.
Image Segmentation,” Int’l J. Computer Vision, vol. 59, pp. 167-181, [35] M. Osadchy, Y. LeCun, and M. Miller, “Synergistic Face Detection
2004. and Pose Estimation with Energy-Based Models,” J. Machine
[12] L.R. Ford and D.R. Fulkerson, “A Simple Algorithm for Finding Learning Research, vol. 8, pp. 1197-1215, 2007.
Maximal Network Flows and an Application to the Hitchcock [36] C. Pantofaru, C. Schmid, and M. Hebert, “Object Recognition by
Problem,” technical report, RAND Corp., 1955. Integrating Multiple Image Segmentations.” Proc. 10th European
[13] B. Fulkerson, A. Vedaldi, and S. Soatto, “Class Segmentation and Conf. Computer Vision, pp. 481-494, 2008.
Object Localization with Superpixel Neighborhoods,” Proc. 12th
[37] M. Ranzato, F. Huang, Y. Boureau, and Y. LeCun, “Unsupervised
IEEE Int’l Conf. Computer Vision, pp. 670-677, 2009.
Learning of Invariant Feature Hierarchies with Applications to
[14] C. Garcia and M. Delakis, “Convolutional Face Finder: A Neural
Object Recognition.” Proc. IEEE Conf. Computer Vision and Pattern
Architecture for Fast and Robust Face Detection,” IEEE Trans.
Recognition, 2007.
Pattern Analysis and Machine Intelligence, vol. 26, no. 11, pp. 1408-
1428, Nov. 2004. [38] B. Russell, A. Torralba, C. Liu, R. Fergus, and W. Freeman, “Object
[15] S. Gould, R. Fulton, and D. Koller, “Decomposing a Scene into Recognition by Scene Alignment,” Proc. Neural Advances in Neural
Geometric and Semantically Consistent Regions,” Proc. IEEE Int’l Information Conf., 2007.
Conf. Computer Vision, pp. 1-8, Sept. 2009. [39] C. Russell, P.H.S. Torr, and P. Kohli, “Associative Hierarchical
[16] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller, “Multi- CRFs for Object Class Image Segmentation,” Proc. IEEE Int’l Conf.
Class Segmentation with Relative Location Prior,” Int’l J. Computer Computer Vision, 2009.
Vision, vol. 80, no. 3, pp. 300-316, Dec. 2008. [40] H. Schulz and S. Behnke., “Learning Object-Class Segmentation
[17] D. Grangier, L. Bottou, and R. Collobert, “Deep Convolutional with Convolutional Neural Networks.” Proc. 11th European Symp.
Networks for Scene Parsing,” Proc. Int’l Conf. Machine Learning, Artificial Neural Networks, 2012.
2009. [41] J. Shotton, J.M. Winn, C. Rother, and A. Criminisi, “TextonBoost:
[18] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality Reduction Joint Appearance, Shape and Context Modeling for Multi-Class
by Learning an Invariant Mapping.” Proc. IEEE Conf. Computer Object Recognition and Segmentation,” Proc. European Conf.
Vision and Pattern Recognition, 2006. Computer Vision, pp. 1-15, 2006.
[19] X. He and R. Zemel, “Learning Hybrid Models for Image [42] P. Simard, D. Steinkraus, and J. Platt, “Best Practices for
Annotation with Partially Labeled Data,” Proc. Advances in Neural Convolutional Neural Networks Applied to Visual Document
Information Processing Systems Conf., 2008. Analysis.” Proc. Seventh Int’l Conf. Document Analysis and Recogni-
[20] V. Jain, J.F. Murray, F. Roth, S. Turaga, V. Zhigulin, K. Briggman, tion, vol. 2, pp. 958-962, 2003.
M. Helmstaedter, W. Denk, and S.H. Seung, “Supervised Learning [43] R. Socher, C.C. Lin, A.Y. Ng, and C.D. Manning, “Parsing Natural
of Image Restoration with Convolutional Networks,” Proc. 11th Scenes and Natural Language with Recursive Neural Networks.”
IEEE Int’l Conf. Computer Vision, 2007. Proc. 26th Int’l Conf. Machine Learning, 2011.
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
FARABET ET AL.: LEARNING HIERARCHICAL FEATURES FOR SCENE LABELING 1929

[44] J. Tighe and S. Lazebnik, “Superparsing: Scalable Nonparametric Laurent Najman received the “Ingénieur” de-
Image Parsing with Superpixels,” Proc. European Conf. Computer gree from the Ecole des Mines de Paris in 1991,
Vision, pp. 352-365, 2010. the PhD degree in applied mathematics from
[45] A. Torralba and A.A. Efros, “Unbiased Look at Data Set Bias,” Paris-Dauphine University with the highest
Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1521- honor (Félicitations du Jury) in 1994, and the
1528, 2011. habilitation à Diriger les Recherches from the
[46] S. Turaga, K. Briggman, M. Helmstaedter, W. Denk, and H. Seung, University of Marne-la-Vallée in 2006. After
“Maximin Affinity Learning of Image Segmentation,” Proc. receiving the engineering degree, he was with
Advances in Neural Information Processing Systems Conf., Jan. 2009. the Central Research Laboratories of Thomson-
[47] R. Vaillant, C. Monrocq, and Y. LeCun, “Original Approach for CSF for three years, working on some problems
the Localisation of Objects in Images,” IEE Proc. Vision, Image, and of infrared image segmentation using mathematical morphology. He
Signal Processing, vol. 141, no. 4, pp. 245-250, Aug. 1994. then joined a start-up company named Animation Science in 1995 as a
director of research and development. In 1998, he joined OC Print Logic
Clément Farabet received the master’s degree Technologies as a senior scientist. He worked there on various
in electrical engineering with honors from the problems of image analysis dedicated to scanning and printing. In
Institut National des Sciences Appliques de 2002, he joined the Informatics Department of the Ecole Supérieure
Lyon, France, in 2008. Since 2010, he has been d’Ingénieurs en Electrotechnique et Electronique, Paris, where he is a
working toward the PhD degree at the Université professor and member of the Gaspard-Monge Computer Science
Paris-Est, with Professor Laurent Najman, in Research Laboratory, Université Paris-Est Marne-la-Vallée. His current
parallel with his research work at Yale University research interests include discrete mathematical morphology. The
and New York University (NY). His master’s technology of particle systems for computer graphics and scientific
thesis work was developed at the Courant visualisation, developed by the company under his technical leadership,
Institute of Mathematical Sciences of NYU under received several awards, including the “European Information Technol-
the guidance of Professor Yann LeCun. He then joined Professor ogy Prize 1997” awarded by the European Commission (Esprit
Yann LeCun’s laboratory in 2008 as a research scientist. In 2009, he programme) and by the European Council for Applied Science and
started collaborating with Yale University’s e-Lab, led by Professor Engineering and the “Hottest Products of the Year 1996” awarded by
Eugenio Culurciello. His research interests include intelligent hard- Computer Graphics World.
ware, embedded supercomputers, computer vision, machine learning,
embedded robotics, and, more broadly, artificial intelligence. His Yann Lecun received the electrical engineer
current work aims at developing a massively parallel yet low-power diploma from the Ecole Supérieure d’Ingénieurs
processor for general-purpose vision. Algorithmically, most of this work en Electrotechnique et Electronique, Paris, in
is based on Professor Yann LeCun’s Convolutional Networks, while 1983, and the PhD degree in computer science
the hardware has its roots in dataflow computers and architectures as from the Université Pierre et Marie Curie, Paris,
they first appeared in the 1960s. in 1987. He is a silver professor of computer
science and neural science at the Courant
Camille Couprie received the engineer degree Institute of Mathematical Sciences and the
from the Ecole Supérieure d’Ingénieurs en Center for Neural Science of New York Uni-
Electrotechnique et Electronique, Paris, with versity (NYU). After doing postdoctoral research
the highest honors in 2008, and the PhD degree with Geoffrey Hinton at the University of Toronto, he joined AT&T Bell
in computer science from the Université Paris Laboratories in Holmdel, New Jersey, in 1988, and became the head of
Est, France, in 2011. Her PhD, advised by the Image Processing Research Department at AT&T Labs-Research in
Professor Laurent Najman, Professor Hugues 1996. He joined NYU as a professor in 2003 after a brief period as a
Talbot, and Leo Grady, was supported by the fellow at the NEC Research Institute in Princeton, New Jersey. His
French Direction Générale de l’Armement MRIS current interests include machine learning, computer perception and
program and the Centre National de la Re- vision, mobile robotics, and computational neuroscience. He has
cherche Scientifique. Since Autumn 2011, she has been a postdoctoral published more than 140 technical papers and book chapters on these
researcher in the Courant Institute of Mathematical Sciences at New topics as well as on neural networks, handwriting recognition, image
York University in the Computer Science Department with Professor processing and compression, and VLSI design. His handwriting
Yann Lecun. Her research interests include image segmentation, recognition technology is used by several banks around the world to
optimization techniques, graph theory, PDE, mathematical morphology, read checks. His image compression technology, called DjVu, is used by
and machine learning. In 2013 she received the best interdisciplinary hundreds of web sites and publishers and millions of users to access
thesis award from the EADS Foundation and won second place in the scanned documents on the Web, and his image recognition methods are
Gilles Kahn award, which acknowledges the best annual French PhD used in deployed systems by companies such as Google, Microsoft,
thesis in computer science. NEC, France Telecom, and several startup companies for document
recognition, human-computer interaction, image indexing, and video
analytics. He has been on the editorial board of the Internationl Journal
of Computer Vision, IEEE Transactions on Pattern Analysis and
Machine Intelligence, and IEEE Transaction Neural Networks, was
program chair of CVPR ’06, and is a chair of the annual Learning
Workshop. He is on the science advisory board of the Institute for Pure
and Applied Mathematics, and is the cofounder of MuseAmi, a music
technology company.