
Class Segmentation and Object Localization with Superpixel Neighborhoods
Brian Fulkerson
1
Andrea Vedaldi
2
Stefano Soatto
1
1
Department of Computer Science
2
Department of Engineering Science
University of California, Los Angeles, CA 90095 University of Oxford, UK
{bfulkers,soatto}@cs.ucla.edu [email protected]
Abstract
We propose a method to identify and localize object
classes in images. Instead of operating at the pixel level,
we advocate the use of superpixels as the basic unit of a
class segmentation or pixel localization scheme. To this
end, we construct a classifier on the histogram of local fea-
tures found in each superpixel. We regularize this clas-
sifier by aggregating histograms in the neighborhood of
each superpixel and then refine our results further by us-
ing the classifier in a conditional random field operating
on the superpixel graph. Our proposed method exceeds
the previously published state-of-the-art on two challeng-
ing datasets: Graz-02 and the PASCAL VOC 2007 Segmen-
tation Challenge.
1. Introduction
Recent success in image-level object categorization has
led to significant interest on the related fronts of localization
and pixel-level categorization. Both areas have seen signifi-
cant progress, through object detection challenges like PAS-
CAL VOC [9]. So far, the most promising techniques seem
to be those that consider each pixel of an image.
For localization, sliding window classifiers [8, 3, 21, 35]
consider a window (or all possible windows) around each
pixel of an image and attempt to find the classification
which best fits the model. Lately, this model often includes
some form of spatial consistency (e.g. [22]). In this way, we
can view sliding window classification as a “top-down” lo-
calization technique which tries to fit a coarse global object
model to each possible location.
In object class segmentation, the goal is to produce a
pixel-level segmentation of the input image. Most ap-
proaches are built from the bottom up on learned local rep-
resentations (e.g. TextonBoost [32]) and can be seen as an
evolution of texture detectors. Because of their rather lo-
cal nature, a conditional random field [20] or some other
model is often introduced to enforce spatial consistency.
For computational reasons, this usually operates on a re-
duced grid of the image, abandoning pixel accuracy in favor
of speed. The current state-of-the-art for the PASCAL VOC
2007 Segmentation Challenge [31] is a scheme which falls
into this category.
Rather than using the pixel grid, we advocate a repre-
sentation adapted to the local structure of the image. We
consider small regions obtained from a conservative over-
segmentation, or “superpixels,” [29, 10, 25] to be the ele-
mentary unit of any detection, categorization or localization
scheme.
On the surface, using superpixels as the elementary units
seems counter-productive, because aggregating pixels into
groups entails a decision that is unrelated to the final task.
However, aggregating pixels into superpixels captures the
local redundancy in the data, and the goal is to perform
this decision in a conservative way to minimize the risk of
merging unrelated pixels [33]. At the same time, moving
to superpixels allows us to measure feature statistics (in this
case: histograms of visual words) on a naturally adaptive
domain rather than on a fixed window. Since superpixels
tend to preserve boundaries, we also have the opportunity
to create a very accurate segmentation by simply finding
the superpixels which are part of the object.
We show that by aggregating neighborhoods of superpix-
els we can create a robust region classifier which exceeds
the state-of-the-art on Graz-02 pixel-localization and on the
PASCAL VOC 2007 Segmentation Challenge. Our results
can be further refined by a simple conditional random field
(CRF) which operates on superpixels, which we propose in
Section 3.4.
2. Related Work
Sliding window classifiers have been well explored for
the task of detecting the location of an object in an image [3,
21, 8, 9]. Most recently, Blaschko et al. [3] have shown
that it is feasible to search all possible sub-windows of an
image for an object using branch and bound and a structured
classifier whose output is a bounding box. However, for our
purposes a bounding box is not an acceptable final output,
even for the task of localization.