Learning Hierarchical Features For Scene Labeling
Learning Hierarchical Features For Scene Labeling
Abstract—Scene labeling consists of labeling each pixel in an image with the category of the object it belongs to. We propose a
method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of
multiple sizes centered on each pixel. The method alleviates the need for engineered features, and produces a powerful representation
that captures texture, shape, and contextual information. We report results using multiple postprocessing methods to produce the final
labeling. Among those, we propose a technique to automatically retrieve, from a pool of segmentation components, an optimal set of
components that best explain the scene; these components are arbitrary, for example, they can be taken from a segmentation tree or
from any family of oversegmentations. The system yields record accuracies on the SIFT Flow dataset (33 classes) and the Barcelona
dataset (170 classes) and near-record accuracy on Stanford background dataset (eight classes), while being an order of magnitude
faster than competing approaches, producing a 320 240 image labeling in less than a second, including feature extraction.
Index Terms—Convolutional networks, deep learning, image segmentation, image classification, scene parsing
1 INTRODUCTION
Fig. 1. Diagram of the scene parsing system. The raw input image is transformed through a Laplacian pyramid. Each scale is fed to a three-stage
ConvNet, which produces a set of feature maps. The feature maps of all scales are concatenated, the coarser scale maps being upsampled to match
the size of the finest scale map. Each feature vector thus represents a large contextual window around each pixel. In parallel, a single segmentation
(i.e., superpixels) or a family of segmentations (e.g., a segmentation tree) are computed to exploit the natural contours of the image. The final
labeling is produced from the feature vectors and the segmentation(s) using different methods, as presented in Section 4.
every pixel in the image, covering a large context. The community, and we show that our learned multiscale
multiscale convolutional net contains multiple copies of a feature representation essentially makes the use of a global
single network (all sharing the same weights) that are random field much less useful: Most scene-level relation-
applied to different scales of a Laplacian pyramid version of ships seem to be already captured by it.
the input image. For each pixel, the networks collectively
encode the information present in a large contextual 1.2.3 Multilevel Cut with Class Purity Criterion
window around the given pixel (184 184 pixels in the A family of segmentations is constructed over the image
system described here). The ConvNet is fed with raw pixels to analyze the scene at multiple levels. In the simplest
and trained end to end, thereby alleviating the need for case, this family might be a segmentation tree; in the most
hand-engineered features. When properly trained, these general case, it can be any set of segmentations, for
features produce a representation that captures texture, example, a collection of superpixels either produced using
shape, and contextual information. While using a multiscale the same algorithm with different parameter tunings or
representation seems natural for FSL, it has rarely been used produced by different algorithms. Each segmentation
component is represented by the set of feature vectors
in the context of feature learning systems. The multiscale
that fall into it: The component is encoded by a spatial
representation that is learned is sufficiently complete to
grid of aggregated feature vectors. The aggregated feature
allow the detection and recognition of all the objects and
vector of each grid cell is computed by a component-wise
regions in the scene. However, it does not accurately max pooling of the feature vectors centered on all the
pinpoint the boundaries of the regions and requires some pixels that fall into the grid cell. This produces a scale-
postprocessing to yield cleanly delineated predictions. invariant representation of the segment and its surround-
1.2 Graph-Based Classification ings. A classifier is then applied to the aggregated feature
grid of each node. This classifier is trained to estimate the
An oversegmentation is constructed from the image and is histogram of all object categories present in the compo-
used to group the feature descriptors. Several oversegmen- nent. A subset of the components is then selected such
tations are considered, and three techniques are proposed that they cover the entire image. These components are
to produce the final image labeling. selected so as to minimize the average “impurity” of the
class distribution in a procedure that we name “optimal
1.2.1 Superpixels cover.” The class “impurity” is defined as the entropy of
The image is segmented into disjoint components, widely the class distribution. The choice of the cover thus
oversegmenting the scene. In this scenario, a pixelwise attempts to find a consistent overall segmentation in
classifier is trained on the convolutional feature vectors, which each segment contains pixels belonging to only one
and a simple vote is done for each component to assign a of the learned categories. This simple method allows us to
single class per component. This method is simple and consider full families of segmentation components, rather
effective, but imposes a fixed level of segmentation, which than a unique, predetermined segmentation (e.g., a single
can be suboptimal. set of superpixels).
All the steps in the process have a complexity linear
1.2.2 CRF over Superpixels (or almost linear) in the number of pixels. The bulk of the
A CRF is defined over a set of superpixels. Compared to the computation resides in the ConvNet feature extractor. The
previous, simpler method, this postprocessing models joint resulting system is very fast, producing a full parse of a
probabilities at the level of the scene, and is useful to avoid 320 240 image in less than a second on a conventional
local aberrations (e.g., a person in the sky). That kind CPU, and in less than 100 ms using dedicated hardware,
of approach is widely used in the computer vision opening the door to real-time applications. Once trained,
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
FARABET ET AL.: LEARNING HIERARCHICAL FEATURES FOR SCENE LABELING 1917
the system is parameter free and requires no adjustment of preliminary, this work showed that ConvNets fed with
thresholds or other knobs. raw pixels could be trained to perform scene parsing with
An early version of this work was first published in [7]. decent accuracy. Unlike [17], however, our system uses a
This journal version reports more complete experiments, boundary-based hierarchy of segmentations to align the
comparisons, and higher results. labels produced by the network to the boundaries in
the image and thus produces representations that are
independent of the size of the segments through feature
2 RELATED WORK pooling. Slightly after [8], Schulz and Behnke proposed a
The scene parsing problem has been approached with a similar architecture of a multiscale ConvNet for scene
wide variety of methods in recent years. Many methods parsing [40]. Unlike us, they use pairwise class location
rely on MRFs, CRFs, or other types of graphical models to filters to predict the final segmentation, instead of using the
ensure the consistency of the labeling and to account for image gradient that we found to be more accurate.
context [19], [39], [15], [25], [32], [44], [30]. Most methods
rely on a presegmentation into superpixels or other
segment candidates, and extract features and categories 3 MULTISCALE FEATURE EXTRACTION FOR SCENE
from individual segments and from various combinations PARSING
of neighboring segments. The graphical model inference The model proposed in this paper, depicted in Fig. 1, relies
pulls out the most consistent set of segments which cover on two complementary image representations. In the first
the image. representation, an image patch is seen as a point in IRP , and
Socher et al. [43] proposed a method to aggregate we seek to find a transform f : IRP ! IRQ that maps each
segments in a greedy fashion using a trained scoring patch into IRQ , a space where it can be classified linearly.
function. The originality of the approach is that the feature This first representation typically suffers from two main
vector of the combination of two segments is computed problems when using a classical ConvNet where the image
from the feature vectors of the individual segments through is divided following a grid pattern: 1) The window
a trainable function. Like us, they use “deep learning” considered rarely contains an object that is properly
methods to train their feature extractor. But unlike us, their centered and scaled, and therefore offers a poor observation
feature extractor operates on hand-engineered features. basis to predict the class of the underlying object;
One of the main questions in scene parsing is how to take 2) integrating a large context involves increasing the grid
a wide context into account to make a local decision. Munoz size and therefore the dimensionality P of the input; given a
et al. [32] proposed using the histogram of labels extracted finite amount of training data, it is then necessary to enforce
from a coarse scale as input to the labeler that looks at finer some invariance in the function f itself. This is usually
scales. Our approach is somewhat simpler: Our feature achieved by using pooling/subsampling layers, which in
extractor is applied densely to an image pyramid. The turn degrades the ability of the model to precisely locate
coarse feature maps thereby generated are upsampled to and delineate objects. In this paper, f is implemented by a
match that of the finest scale. Hence, with three scales, each multiscale ConvNet, which allows integrating large con-
feature vector has multiple fields that encode multiple texts (as large as the complete scene) into local decisions,
regions of increasing sizes and decreasing resolutions, while still remaining manageable in terms of parameters/
centered on the same pixel location. dimensionality. This multiscale model in which weights are
Like us, a number of authors have used families of shared across scales allows the model to capture long-range
segmentations or trees to generate candidate segments by interactions without the penalty of extra parameters to
aggregating elementary segments. The approaches of [39], train. This model is described in Section 3.1.
[30] rely on inference algorithms based on graph cuts to In the second representation, the image is seen as an
label images using trees of segmentation. Other strategies edge-weighted graph on which one or several oversegmen-
using families of segmentations appeared in [36] and [5]. tations can be constructed. The components are spatially
None of the previous strategies for scene labeling used a accurate, and naturally delineate the underlying objects, as
purity criterion on the class distributions. Combined to the this representation conserves pixel-level precision. Section 4
optimal cover strategy, this purity criterion is general, describes multiple strategies to combine both representa-
efficient, and could be applied to solve different problems. tions. In particular, we describe in Section 4.3 a method for
Contrary to the previously cited approaches using analyzing a family of segmentations (at multiple levels). It
engineered features, our system extracts features densely can be used as a solution to the first problem exposed
from a multiscale pyramid of images using a ConvNet [27]. above: Assuming the capability of assessing the quality of
These networks can be fed with raw pixels and can all the components in this family of segmentations, a system
automatically learn low-level and mid-level features, alle- can automatically choose its components so as to produce
viating the need for hand-engineered features. One of their the best set of predictions.
advantages is the ability to compute dense features
efficiently over large images. They are best known for their 3.1 Scale-Invariant, Scene-Level Feature Extraction
applications to detection and recognition [47], [14], [35], [21], Good internal representations are hierarchical. In vision,
but they have also been used for image segmentation, pixels are assembled into edglets, edglets into motifs, motifs
particularly for biological image segmentation [34], [20], [46]. into parts, parts into objects, and objects into scenes. This
The only previously published work on using ConvNets suggests that recognition architectures for vision (and for
for scene parsing is that of [17]. While somewhat other modalities such as audio and natural language)
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
1918 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013
should have multiple trainable stages stacked on top of each symmetric squashing units (typically the tanh function
other, one for each level in the feature hierarchy. ConvNets [28]), and pooling/subsampling operators. For a network fs
provide a simple framework to learn such hierarchies of with L layers, we have
features.
ConvNets [26], [27] are trainable architectures composed fs ðXs ; s Þ ¼ WL HL1 ; ð1Þ
of multiple stages. The input and output of each stage are where the vector of hidden units at layer l is
sets of arrays called feature maps. For example, if the input is
a color image, each feature map would be a two-dimen- Hl ¼ poolðtanhðWl Hl1 þ bl ÞÞ; ð2Þ
sional array containing a color channel of the input image for all l 2 f1; . . . ; L 1g, with bl a vector of bias parameters,
(for an audio input, each feature map would be a one- and H0 ¼ Xs . The matrices Wl are Toeplitz matrices;
dimensional array, and for a video or volumetric image, it therefore each hidden unit vector Hl can be expressed as
would be a three-dimensional array). At the output, each a regular convolution between kernels from Wl and the
feature map represents a particular feature extracted at all previous hidden unit vector Hl1 , squashed through a tanh ,
locations on the input. Each stage is composed of three and pooled spatially. More specifically,
layers: a filter bank layer, a nonlinearity layer, and a feature 0 0 11
pooling layer. A typical ConvNet is composed of one, two, or X
three such three-layer stages, followed by a classification Hlp ¼ pool@tanh@blp þ wlpq Hl1;q AA: ð3Þ
module. Because they are trainable, arbitrary input mod- q2parentsðpÞ
alities can be modeled beyond natural images.
The filters Wl and the biases bl constitute the trainable
Our feature extractor is a three-stage ConvNet. The first
parameters of our model, and are collectively denoted s .
two stages contain a bank of filters producing multiple
The function tanh is a point-wise nonlinearity, while pool is
feature maps, a point-wise nonlinear mapping and a
a function that considers a neighborhood of activations and
spatial pooling, followed by subsampling of each feature
produces one activation per neighborhood. In all our
map. The last layer only contains a bank of filters. The
experiments, we use a max-pooling operator which takes
filters (convolution kernels) are subject to training. Each
the maximum activation within the neighborhood. Pooling
filter is applied to the input feature maps through a two-
over a small neighborhood provides built-in invariance to
dimensional convolution operation which detects local
small translations.
features at all locations on the input. Each filter bank of a
Finally, the outputs of the N networks are upsampled
ConvNet produces features that are equivariant under
and concatenated so as to produce F, a map of feature
shifts, i.e., if the input is shifted, the output is also shifted
vectors of size N times the size of f 1 , which can be seen as
but otherwise unchanged.
local patch descriptors and scene-level descriptors:
While ConvNets have been used successfully for a
number of image labeling problems, image-level tasks such F ¼ ½f 1 ; uðf 2 Þ; . . . ; uðf N Þ; ð4Þ
as full-scene understanding (pixelwise labeling or any
dense feature estimation) require the system to model where u is an upsampling function.
complex interactions at the scale of complete images, not As mentioned above, weights are shared between net-
simply within a patch. To view a large contextual window works fs . Intuitively, imposing complete weight sharing
at full resolution, a ConvNet would have to be unmanage- across scales is a natural way of forcing the network to learn
ably large. scale invariant features, and at the same time reduce the
The solution is to use a multiscale approach. Our chances of overfitting. The more scales used to jointly train
multiscale ConvNet overcomes these limitations by extend- the models fs ðs Þ, the better the representation becomes for
ing the concept of spatial weight replication to the scale all scales. Because image content is, in principle, scale
space. Given an input image I, a multiscale pyramid of invariant, using the same function to extract features at each
images Xs , 8s 2 f1; . . . ; Ng, is constructed where X1 has the scale is justified.
size of I. The multiscale pyramid can be a Laplacian 3.2 Learning Discriminative Scale-Invariant
pyramid and is typically preprocessed so that local Features
neighborhoods have zero mean and unit standard devia- As described in Section 3.1, feature vectors in F are obtained
tion. Given a classical ConvNet fs with parameters s , the by concatenating the outputs of multiple networks fs , each
multiscale network is obtained by instantiating one network taking as input a different image in a multiscale pyramid.
per scale s, and sharing all parameters across scales: s ¼ 0 , Ideally a linear classifier should produce the correct
8s 2 f1; . . . ; Ng. categorization for all pixel locations i from the feature
We introduce the following convention: Banks of images vectors Fi . We train the parameters s to achieve this goal,
will be seen as three-dimensional arrays in which the first using the multiclass cross entropy loss function. Let c^i be the
dimension is the number of independent feature maps or normalized prediction vector from the linear classifier for
images, the second is the height of the maps, and the third is pixel i. We compute normalized predicted probability
the width. The output state of the Lth stage is denoted HL . ^i;a using the softmax function, i.e.,
distributions over classes c
The maps in the pyramid are computed using a scaling/
T
normalizing function gs as Xs ¼ gs ðIÞ, for all s 2 f1; . . . ; Ng. ewa Fi
For each scale s, the ConvNet fs can be described as a ^i;a ¼ P
c T ; ð5Þ
b2classes ewb Fi
sequence of linear transforms, interspersed with nonlinear
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
FARABET ET AL.: LEARNING HIERARCHICAL FEATURES FOR SCENE LABELING 1919
The true target probability ci;a of class a being present at Fig. 2. First labeling strategy from the features: Using superpixels as
location i can either be a distribution of classes at location i, described in Section 4.1.
in a given neighborhood, or a hard target vector: ci;a ¼ 1 if
pixel i is labeled a, and 0 otherwise. For training maximally neural network, as opposed to the simple linear classifier
discriminative features, we use hard target vectors in this used in Section 3.2, allows the system to capture nonlinear
first stage. relationships between the features at different scales. In this
Once the parameters s are trained, the classifier in (5) is case, the final labeling for each component k is given by
discarded, and the feature vectors Fi are used using
different strategies, as described in Section 4. ^ k;a :
lk ¼ arg max d ð12Þ
a2classes
Fig. 3. Second labeling strategy from the features: Using a CRF, described in Section 4.2.
4.3 Parameter-Free Multilevel Parsing different merging thresholds. In Section 5, we use such an
One problem subsists with the two methods presented approach by computing multiple levels of the Felzenszwalb
above: the observation level problem. An object, or object algorithm [11]. The Felzenszwalb algorithm is not strictly
part, can be easily classified once it is segmented at the right monotonic, so the structure obtained cannot be cast into a
level. The two methods above are based on an arbitrary tree: Rather, it has a general graph form in which each pixel
segmentation of the image, which typically decomposes it belongs to as many superpixels as levels explored. Solving
into segments that are too small or, more rarely, too large. (16) in this case consists of the following procedure: For each
In this section, we propose a method to analyze a family pixel i, the optimal component Ck ðiÞ is the one among all the
of segmentations and automatically discover the best segmentations with minimal cost Sk ðiÞ . Thus, the complexity
observation level for each pixel in the image, as illustrated
in Fig. 4. One special case of such families is the
segmentation tree, in which components are hierarchically
organized. Our method is not restricted to such trees and
can be used for arbitrary sets of neighborhoods.
In Section 4.3.1, we formulate the search for the most
adapted neighborhood of a pixel as an optimization
problem. The construction of the cost function that is
minimized is then described in Section 4.3.2.
TABLE 1 TABLE 2
Performance of Our System on the Stanford Background Performance of Our System on the SIFT Flow Dataset [31]:
Dataset [15]: Per-Pixel/Average Per-Class Accuracy Per-Pixel/Average Per-Class Accuracy
Fig. 7. Example of results on the Stanford background dataset. (b), (d), and (f) Results with different labeling strategies, overlaid with superpixels
(see Section 4.1), segments results of a threshold in the gPb hierarchy [1], and segments recovered by the maximum purity approach with an optimal
cover (see Section 4.3). The result (c) is obtained with a CRF on the superpixels shown in (d), as described in Section 4.2.
5.1 Multiscale Feature Extraction map being produced by a combination of eight randomly
For all experiments, we use a three-stage ConvNet. The first selected feature maps from the previous layer. Finally, the
two layers of the network are composed of a bank of filters 64-dimension feature map is transformed into a 256-
of size 7 7 followed by tanh units and 2 2 max-pooling dimension feature map, each map being produced by a
operations. The last layer is a simple filter bank. The filters combination of 32 randomly selected feature maps from
and pooling dimensions were chosen by a grid search. The the previous layer.
input image is transformed into YUV space, and a The outputs of each of the three networks are then
Laplacian pyramid is constructed from it. The Y, U, and V upsampled and concatenated so as to produce a 256
channels of each scale in the pyramid are then indepen- 3 ¼ 768-dimension feature vector map F. Given the filter
dently locally normalized such that each local 15 15 patch sizes, the network has a field of view of 46 46 at each
has zero-mean and unit variance. For these experiments, the scale, which means that a feature vector in F is influenced
pyramid consists of three rescaled versions of the input by a 46 46 neighborhood at full resolution, a 92 92
(N ¼ 3), in octaves: 320 240, 160 120, 80 60. neighborhood at half resolution, and a 184 184 neighbor-
The network is then applied to each three-dimension input hood at quarter resolution. These neighborhoods are shown
map Xs . This input is transformed into a 16-dimension in Fig. 1.
feature map, using a bank of 16 filters, 10 connected to the The network is trained on all three scales in parallel,
Y channel, the six others connected to the U and using stochastic gradient descent with no second-order
V channels. The second layer transforms this 16-dimen- information, and minibatches of size 1. Simple grid-search
sion feature map into a 64-dimension feature map, each was performed to find the best learning rate (103 ) and
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
1924 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013
Fig. 8. More results using our multiscale ConvNet and a flat CRF on the Stanford background dataset.
the gPb hierarchy on the Stanford Background dataset. A generalization of a recognition system learned on specific,
similar test is performed in [30], where the authors also use a publicly available datasets.
CRF on the same superpixels (at the threshold 20 in the gPb We used our multiscale features combined with classi-
hierarchy), but employ different features: histograms of fication, using superpixels as described in Section 4.1,
densely sampled SIFT words, colors, locations, and contour trained on the SIFT Flow dataset (2,688 images, most of
shape descriptors. They report a ratio of correctly classified them taken in nonurban environments, see Table 2 and
pixels of 81.1 percent on the Stanford Background dataset. We Fig. 9). We collected a 360 degree movie in our workplace
recall that this accuracy is the best one achieved to date on this environment, including a street and a park, introducing
dataset with a flat CRF. difficulties such as lighting conditions and image distor-
In our CRF energy, we performed a grid search to set the tions (see Fig. 11).
parameters of (13) ( ¼ 20, ¼ 0:1 ¼ 200), and used a The movie was built from four videos that were stitched
gray level gradient. The accuracy of the resulting system is to form a 360 degree video stream of 1,280 256 images,
81.4, as reported in Table 1. Our features are thus out- thus creating artifacts not seen during training. We
performing the best publicly available combination of processed each frame independently, without using any
handcrafted features. temporal consistency or smoothing.
Despite all these constraints and the rather small size of
5.5 Some Comments on the Learned Features the training dataset, we observe rather convincing general-
With recent advances in unsupervised (deep) learning, ization of our models on these previously unseen scenes.
learned features have become easier to analyze and under- The two video sequences are available at http://www.
stand. In this work, the entire stack of features is learned in clement.farabet.net/. Two snapshots are included in Fig. 11.
a purely supervised manner, and yet we found that the Our scene parsing system constitutes, to the best of our
features obtained are rather meaningful. We believe that the knowledge, the first approach achieving real-time perfor-
reason for this is the type of loss function we use, which mance, one frame being processed in less than a second on a
enforces a large invariance: The system is forced to produce 4-core Intel i7. Feature extraction, which represents around
an invariant representation for all the locations of a given
object. This type of invariance is very similar to what can be
achieved using semi-supervised techniques such as Dr-LIM
[18], where the loss forces pairs of similar patches to yield
the same encoding. Fig. 10 shows an example of the
features learned on the SIFT Flow dataset.
Fig. 11. Real-time scene parsing in natural conditions. Training on the SIFT Flow dataset. We display one label per component in the final prediction.
500 ms on the i7, can be reduced to 60 ms using dedicated . Feeding the system with a wide contextual window
FPGA hardware [9], [10]. is critical to the quality of the results. The numbers in
Table 1 show a dramatic improvement in the
performance of the multiscale ConvNet over the
6 DISCUSSION
single scale version.
The main lessons from the experiments presented in this . When a wide context is taken into account to
paper are as follows:
produce each pixel label, the role of the postproces-
. Using a high-capacity feature-learning system fed sing is greatly reduced. In fact, a simple majority
with raw pixels yields excellent results when vote of the categories within a superpixel yields
compared with systems that use engineered fea- state-of-the-art accuracy. This seems to suggest that
tures. The accuracy is similar to or better than contextual information can be taken into account by
competing systems, even when the segmentation a feed-forward trainable system with a wide
hypothesis generation and the postprocessing mod- contextual window, perhaps as well as an inference
ule are absent or very simple. mechanism that propagates label constraints over a
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
FARABET ET AL.: LEARNING HIERARCHICAL FEATURES FOR SCENE LABELING 1927
graphical model, but with a considerably lower system that produces accurate boundaries for large regions
computational cost. (sky, road, grass), but fails to spot small objects. A reflection
. Highly sophisticated postprocessing schemes, which is needed on the best ways to measure the accuracy of scene
seem so crucial to the success of other models, do not labeling systems.
seem to improve the results significantly over simple Scene parsing datasets also need better labels. One could
schemes. This seems to suggest that the performance imagine using scene parsing datasets with hierarchical
is limited by the quality of the labeling or the quality labels so that a window within a building would be labeled
of the segmentation hypotheses, rather than by the as “building” and “window.” Using this kind of labeling in
quality of the contextual consistency system or the conjunction with graph structures on sets of labels that
inference algorithm. contain is-part-of relationships would likely produce
. Relying heavily on a highly accurate feed-forward more consistent interpretations of the whole scene.
pixel labeling system, while simplifying the post- The framework presented in this paper trains the
convolutional net as a pixel labeling system in isolation
processing module to its bare minimum cuts down
from the postprocessing module that ensures the consis-
the inference times considerably. The resulting
tency of the labeling and its proper registration with the
system is dramatically faster than those that rely
image regions. This requires that the convolutional net be
heavily on graphical model inference. Moreover, the
trained with images that are fully labeled at the pixel level.
bulk of the computation takes place in the ConvNet. One would hope that jointly fine-tuning the convolutional
This computation is algorithmically simple, easily net and the postprocessor produces better overall inter-
parallelizable. Implementations on multicore ma- pretations. Gradients can be back-propagated through the
chines, general-purpose GPUs, digital signal proces- postprocessor to the convolutional nets. This is reminiscent
sors, or specialized architectures implemented on of the graph transformer network model, a kind of nonlinear
FPGAs is straightforward. This is demonstrated by CRF in which an unnormalized graphical model-based
the FPGA implementation [9], [10] of the feature postprocessing module was trained jointly with a ConvNet
extraction scheme presented in this paper that runs for handwriting recognition [27]. Unfortunately, prelimin-
in 60 ms for an image resolution of 320 240. ary experiments with such joint training yielded lower test-
set accuracies probably due to overfitting.
7 CONCLUSION AND FUTURE WORK A more important advantage of joint training would
allow the use of weakly labeled images in which only a list of
This paper demonstrates that a feed-forward ConvNet,
objects present in the image would be given, perhaps tagged
trained end-to-end in a supervised manner and fed with with approximate positions. This would be similar in spirit
raw pixels from large patches over multiple scales, can to sentence-level discriminative training methods used in
produce state-of-the-art performance on standard scene speech recognition and handwriting recognition [27].
parsing datasets. The model does not rely on engineered Another possible direction for improvement includes the
features and uses purely supervised training from fully use of objective functions that directly operate on the edge
labeled images to learn appropriate low-level and mid- costs of neighborhood graphs in such as way that graph-cut
level features. segmentation and similar methods produce the best
Perhaps the most surprising result is that even in the answer. One such objective function is Turaga’s maximin
absence of any postprocessing, by simply labeling each learning [46], which pushes up the lowest edge cost along
pixel with the highest scoring category produced by the the shortest path between two points in different segments
convolutional net for that location the system yields near and pushes down the highest edge cost along a path
state-of-the-art pixel-wise accuracy and better per-class between two points in the same segment.
accuracy than all previously published results. Feeding Our system so far has been trained using purely
the features of the convolutional net to various sophisti- supervised learning applied to a fairly classical ConvNet
cated schemes that generate segmentation hypotheses and architecture. However, a number of recent works have
that find consistent segmentations and labeling by taking shown the advantage of architectural elements such as
local constraints into account improves the results slightly, rectifying nonlinearities and local contrast normalization
but not considerably. [21]. More importantly, several works have shown the
While the results on datasets with few categories are advantage of using unsupervised pretraining to prime the
good, the accuracy of the best existing scene parsing convolutional net into a good starting point before super-
systems, including ours, is still quite low when the number vised refinement [37], [22], [23], [29], [24]. These methods
of categories is large. The problem of scene parsing is far improve the performance in the low training set size regime
from being solved. While the system presented here has a and would probably improve the performance of the
number of advantages and shortcomings, the framing of the present system.
scene parsing task itself is in need of refinement. Finally, code and data are available online at http://
First of all, the pixel-wise accuracy is a somewhat www.clement.farabet.net/.
inaccurate measure of the visual and practical quality of
the result. Spotting rare objects is often more important than
accurately labeling every boundary pixel of the sky (which ACKNOWLEDGMENTS
are often in greater number). The average per-class accuracy The authors would like to thank Marco Scoffier for fruitful
is a step in the right direction, but not the ultimate solution: discussions and the 360 degree video collection. They are also
One would prefer a system that correctly spots every object grateful to Victor Lempitsky who kindly provided them with
or region while giving an approximate boundary to a his results on the Stanford database for comparison. This
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
1928 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013
work was funded in part by DARPA contract “Integrated [21] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What Is
the Best Multi-Stage Architecture for Object Recognition?” Proc.
deep learning for large scale multimodal data representa- IEEE Int’l Conf. Computer Vision, 2009.
tion,” ONR MURI “Provably stable vision-based control of [22] K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun, “Learning
high-speed flight,” ONR grant “Learning Hierarchical Invariant Features Through Topographic Filter Maps,” Proc. IEEE
Models for Information Integration.” Conf. Computer Vision and Pattern Recognition, 2009.
[23] K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “Fast Inference in
Sparse Coding Algorithms with Applications to Object Recogni-
REFERENCES tion,” Technical Report CBLL-TR-2008-12-01, Courant Inst. of
Math. Sciences, New York Univ., 2008.
[1] P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik, “Contour [24] K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu,
Detection and Hierarchical Image Segmentation,” IEEE Trans. and Y. LeCun, “Learning Convolutional Feature Hierachies for
Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 898-916, Visual Recognition,” Proc. Advances in Neural Information Proces-
May 2011. sing Systems Conf., vol. 23, 2010.
[2] Y. Boykov and M.P. Jolly, “Interactive Graph Cuts for Optimal [25] M. Kumar and D. Koller, “Efficiently Selecting Regions for Scene
Boundary & Region Segmentation of Objects in n-d Images,” Proc. Understanding,” Proc. IEEE Conf. Computer Vision and Pattern
IEEE Int’l Conf. Computer Vision, vol. 1, pp. 105-112, 2001. Recognition, pp. 3217-3224, 2010.
[3] Y. Boykov and V. Kolmogorov, “An Experimental Comparison of
[26] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard,
Min-Cut/Max-Flow Algorithms for Energy Minimization in
W. Hubbard, and L.D. Jackel, “Handwritten Digit Recognition
Vision,” IEEE Trans. Pattern Analysis and Machine Intelligence,
with a Back-Propagation Network,” Proc. Advances in Neural
vol. 26, no. 9, pp. 1124-1137, Sept. 2004.
Information Processing Systems Conf., 1990.
[4] Y. Boykov, O. Veksler, and R. Zabih, “Fast Approximate
Energy Minimization via Graph Cuts,” IEEE Trans. Pattern [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based
Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222-1239, Learning Applied to Document Recognition,” Proc. IEEE, vol. 86,
Nov. 2001. no. 11, pp. 2278-2324, Nov. 1998.
[5] J. Carreira and C. Sminchisescu, “CPMC: Automatic Object [28] Y. LeCun, L. Bottou, G. Orr, and K. Muller, “Efficient Backprop,”
Segmentation Using Constrained Parametric Min-Cuts,” IEEE Neural Networks: Tricks of the Trade, Springer, 1998.
Trans. Pattern Analysis and Machine Intelligence, vol. 34, no. 7, [29] H. Lee, R. Grosse, R. Ranganath, and Y.N. Andrew., “Convolu-
pp. 1312-1328, July 2012. tional Deep Belief Networks for Scalable Unsupervised Learning
[6] D. Ciresan, U. Meier, J. Masci, and J. Schmidhuber, “A Committee of Hierarchical Representations,” Proc. Int’l Conf. Machine Learn-
of Neural Networks for Traffic Sign Classification,” Proc. Int’l Joint ing, 2009.
Conf. Neural Networks, pp. 1918-1921, 2011. [30] V. Lempitsky, A. Vedaldi, and A. Zisserman, “A Pylon Model for
[7] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Scene Parsing Semantic Segmentation,” Proc. Advances in Neural Information
with Multiscale Feature Learning, Purity Trees, and Optimal Processing Systems Conf., 2011.
Covers,” Proc. Int’l Conf. Machine Learning, June 2012. [31] C. Liu, J. Yuen, and A. Torralba, “Nonparametric Scene Parsing:
[8] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Scene Parsing Label Transfer via Dense Scene Alignment,” Artificial Intelligence,
with Multiscale Feature Learning, Purity Trees, and Optimal 2009.
Covers,” CoRR, Feb. 2012. [32] D. Munoz, J. Bagnell, and M. Hebert, “Stacked Hierarchical
[9] C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Labeling,” Proc. 11th European Conf. Computer Vision, Jan. 2010.
Culurciello, “Hardware Accelerated Convolutional Neural Net- [33] L. Najman and M. Schmitt, “Geodesic Saliency of Watershed
works for Synthetic Vision Systems,” Proc. Int’l Symp. Circuits and Contours and Hierarchical Segmentation,” IEEE Trans. Pattern
Systems, May 2010. Analysis and Machine Intelligence, vol. 18, no. 12, pp. 1163-1173,
[10] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. Dec. 1996.
LeCun, “Neuflow: A Runtime Reconfigurable Dataflow Processor [34] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P.
for Vision,” Proc. Fifth IEEE Workshop Embedded Computer Vision, Barbano, “Toward Automatic Phenotyping of Developing Em-
2011. bryos from Videos,” IEEE Trans. Image Processing, vol. 14, no. 9,
[11] P. Felzenszwalb and D. Huttenlocher, “Efficient Graph-Based pp. 1360-1371, Sept. 2005.
Image Segmentation,” Int’l J. Computer Vision, vol. 59, pp. 167-181, [35] M. Osadchy, Y. LeCun, and M. Miller, “Synergistic Face Detection
2004. and Pose Estimation with Energy-Based Models,” J. Machine
[12] L.R. Ford and D.R. Fulkerson, “A Simple Algorithm for Finding Learning Research, vol. 8, pp. 1197-1215, 2007.
Maximal Network Flows and an Application to the Hitchcock [36] C. Pantofaru, C. Schmid, and M. Hebert, “Object Recognition by
Problem,” technical report, RAND Corp., 1955. Integrating Multiple Image Segmentations.” Proc. 10th European
[13] B. Fulkerson, A. Vedaldi, and S. Soatto, “Class Segmentation and Conf. Computer Vision, pp. 481-494, 2008.
Object Localization with Superpixel Neighborhoods,” Proc. 12th
[37] M. Ranzato, F. Huang, Y. Boureau, and Y. LeCun, “Unsupervised
IEEE Int’l Conf. Computer Vision, pp. 670-677, 2009.
Learning of Invariant Feature Hierarchies with Applications to
[14] C. Garcia and M. Delakis, “Convolutional Face Finder: A Neural
Object Recognition.” Proc. IEEE Conf. Computer Vision and Pattern
Architecture for Fast and Robust Face Detection,” IEEE Trans.
Recognition, 2007.
Pattern Analysis and Machine Intelligence, vol. 26, no. 11, pp. 1408-
1428, Nov. 2004. [38] B. Russell, A. Torralba, C. Liu, R. Fergus, and W. Freeman, “Object
[15] S. Gould, R. Fulton, and D. Koller, “Decomposing a Scene into Recognition by Scene Alignment,” Proc. Neural Advances in Neural
Geometric and Semantically Consistent Regions,” Proc. IEEE Int’l Information Conf., 2007.
Conf. Computer Vision, pp. 1-8, Sept. 2009. [39] C. Russell, P.H.S. Torr, and P. Kohli, “Associative Hierarchical
[16] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller, “Multi- CRFs for Object Class Image Segmentation,” Proc. IEEE Int’l Conf.
Class Segmentation with Relative Location Prior,” Int’l J. Computer Computer Vision, 2009.
Vision, vol. 80, no. 3, pp. 300-316, Dec. 2008. [40] H. Schulz and S. Behnke., “Learning Object-Class Segmentation
[17] D. Grangier, L. Bottou, and R. Collobert, “Deep Convolutional with Convolutional Neural Networks.” Proc. 11th European Symp.
Networks for Scene Parsing,” Proc. Int’l Conf. Machine Learning, Artificial Neural Networks, 2012.
2009. [41] J. Shotton, J.M. Winn, C. Rother, and A. Criminisi, “TextonBoost:
[18] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality Reduction Joint Appearance, Shape and Context Modeling for Multi-Class
by Learning an Invariant Mapping.” Proc. IEEE Conf. Computer Object Recognition and Segmentation,” Proc. European Conf.
Vision and Pattern Recognition, 2006. Computer Vision, pp. 1-15, 2006.
[19] X. He and R. Zemel, “Learning Hybrid Models for Image [42] P. Simard, D. Steinkraus, and J. Platt, “Best Practices for
Annotation with Partially Labeled Data,” Proc. Advances in Neural Convolutional Neural Networks Applied to Visual Document
Information Processing Systems Conf., 2008. Analysis.” Proc. Seventh Int’l Conf. Document Analysis and Recogni-
[20] V. Jain, J.F. Murray, F. Roth, S. Turaga, V. Zhigulin, K. Briggman, tion, vol. 2, pp. 958-962, 2003.
M. Helmstaedter, W. Denk, and S.H. Seung, “Supervised Learning [43] R. Socher, C.C. Lin, A.Y. Ng, and C.D. Manning, “Parsing Natural
of Image Restoration with Convolutional Networks,” Proc. 11th Scenes and Natural Language with Recursive Neural Networks.”
IEEE Int’l Conf. Computer Vision, 2007. Proc. 26th Int’l Conf. Machine Learning, 2011.
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.
FARABET ET AL.: LEARNING HIERARCHICAL FEATURES FOR SCENE LABELING 1929
[44] J. Tighe and S. Lazebnik, “Superparsing: Scalable Nonparametric Laurent Najman received the “Ingénieur” de-
Image Parsing with Superpixels,” Proc. European Conf. Computer gree from the Ecole des Mines de Paris in 1991,
Vision, pp. 352-365, 2010. the PhD degree in applied mathematics from
[45] A. Torralba and A.A. Efros, “Unbiased Look at Data Set Bias,” Paris-Dauphine University with the highest
Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1521- honor (Félicitations du Jury) in 1994, and the
1528, 2011. habilitation à Diriger les Recherches from the
[46] S. Turaga, K. Briggman, M. Helmstaedter, W. Denk, and H. Seung, University of Marne-la-Vallée in 2006. After
“Maximin Affinity Learning of Image Segmentation,” Proc. receiving the engineering degree, he was with
Advances in Neural Information Processing Systems Conf., Jan. 2009. the Central Research Laboratories of Thomson-
[47] R. Vaillant, C. Monrocq, and Y. LeCun, “Original Approach for CSF for three years, working on some problems
the Localisation of Objects in Images,” IEE Proc. Vision, Image, and of infrared image segmentation using mathematical morphology. He
Signal Processing, vol. 141, no. 4, pp. 245-250, Aug. 1994. then joined a start-up company named Animation Science in 1995 as a
director of research and development. In 1998, he joined OC Print Logic
Clément Farabet received the master’s degree Technologies as a senior scientist. He worked there on various
in electrical engineering with honors from the problems of image analysis dedicated to scanning and printing. In
Institut National des Sciences Appliques de 2002, he joined the Informatics Department of the Ecole Supérieure
Lyon, France, in 2008. Since 2010, he has been d’Ingénieurs en Electrotechnique et Electronique, Paris, where he is a
working toward the PhD degree at the Université professor and member of the Gaspard-Monge Computer Science
Paris-Est, with Professor Laurent Najman, in Research Laboratory, Université Paris-Est Marne-la-Vallée. His current
parallel with his research work at Yale University research interests include discrete mathematical morphology. The
and New York University (NY). His master’s technology of particle systems for computer graphics and scientific
thesis work was developed at the Courant visualisation, developed by the company under his technical leadership,
Institute of Mathematical Sciences of NYU under received several awards, including the “European Information Technol-
the guidance of Professor Yann LeCun. He then joined Professor ogy Prize 1997” awarded by the European Commission (Esprit
Yann LeCun’s laboratory in 2008 as a research scientist. In 2009, he programme) and by the European Council for Applied Science and
started collaborating with Yale University’s e-Lab, led by Professor Engineering and the “Hottest Products of the Year 1996” awarded by
Eugenio Culurciello. His research interests include intelligent hard- Computer Graphics World.
ware, embedded supercomputers, computer vision, machine learning,
embedded robotics, and, more broadly, artificial intelligence. His Yann Lecun received the electrical engineer
current work aims at developing a massively parallel yet low-power diploma from the Ecole Supérieure d’Ingénieurs
processor for general-purpose vision. Algorithmically, most of this work en Electrotechnique et Electronique, Paris, in
is based on Professor Yann LeCun’s Convolutional Networks, while 1983, and the PhD degree in computer science
the hardware has its roots in dataflow computers and architectures as from the Université Pierre et Marie Curie, Paris,
they first appeared in the 1960s. in 1987. He is a silver professor of computer
science and neural science at the Courant
Camille Couprie received the engineer degree Institute of Mathematical Sciences and the
from the Ecole Supérieure d’Ingénieurs en Center for Neural Science of New York Uni-
Electrotechnique et Electronique, Paris, with versity (NYU). After doing postdoctoral research
the highest honors in 2008, and the PhD degree with Geoffrey Hinton at the University of Toronto, he joined AT&T Bell
in computer science from the Université Paris Laboratories in Holmdel, New Jersey, in 1988, and became the head of
Est, France, in 2011. Her PhD, advised by the Image Processing Research Department at AT&T Labs-Research in
Professor Laurent Najman, Professor Hugues 1996. He joined NYU as a professor in 2003 after a brief period as a
Talbot, and Leo Grady, was supported by the fellow at the NEC Research Institute in Princeton, New Jersey. His
French Direction Générale de l’Armement MRIS current interests include machine learning, computer perception and
program and the Centre National de la Re- vision, mobile robotics, and computational neuroscience. He has
cherche Scientifique. Since Autumn 2011, she has been a postdoctoral published more than 140 technical papers and book chapters on these
researcher in the Courant Institute of Mathematical Sciences at New topics as well as on neural networks, handwriting recognition, image
York University in the Computer Science Department with Professor processing and compression, and VLSI design. His handwriting
Yann Lecun. Her research interests include image segmentation, recognition technology is used by several banks around the world to
optimization techniques, graph theory, PDE, mathematical morphology, read checks. His image compression technology, called DjVu, is used by
and machine learning. In 2013 she received the best interdisciplinary hundreds of web sites and publishers and millions of users to access
thesis award from the EADS Foundation and won second place in the scanned documents on the Web, and his image recognition methods are
Gilles Kahn award, which acknowledges the best annual French PhD used in deployed systems by companies such as Google, Microsoft,
thesis in computer science. NEC, France Telecom, and several startup companies for document
recognition, human-computer interaction, image indexing, and video
analytics. He has been on the editorial board of the Internationl Journal
of Computer Vision, IEEE Transactions on Pattern Analysis and
Machine Intelligence, and IEEE Transaction Neural Networks, was
program chair of CVPR ’06, and is a chair of the annual Learning
Workshop. He is on the science advisory board of the Institute for Pure
and Applied Mathematics, and is the cofounder of MuseAmi, a music
technology company.
Authorized licensed use limited to: UNIVERSIDADE DE SANTIAGO. Downloaded on October 20,2023 at 08:02:38 UTC from IEEE Xplore. Restrictions apply.