Spatial Pyramid Matching For Scene Category Recognition
Spatial Pyramid Matching For Scene Category Recognition
Abstract
In this work we present a method for scene category
recognition which is based on matching approximate global
correspondences. This work was proposed by the Lazebnik
et al[1]. It is based on partitioning images into increasingly fine sub-regions and then computing histograms of
local-features in those sub-regions. This approach creates a
spatial pyramid which is an extension of the orderless bagof-features representation of an image. This work shows a
significant improvement in scene categorization tasks and
is one of the best methods for object recognition in the
Caltech-101 database. Here, our work focuses on trying
this approach on the new Caltech-256 [8] dataset which is
generally considered to be a more challenging dataset than
the Caltech-101 [3].
1. Introduction
Our task in this project is to find the image category of
an image. One of the dominant approaches taken to achieve
this task is the bag-of-features methods, which represents an
image as a collection of local features. However, since this
method and related methods propose taking a histogram of
features of the dominant points of interest, we are throwing
away all the information giving spatial layout of the features. This way they are incapable of capturing the shape of
object or segmenting them from background. This means
that it would be better if we could use the spatial information to build a structural object description. However this
task in not very simple in the presence of occlusion, clutter
or view point changes. There has been considerable work
done towards building a robust structural object descriptor. Some of this work involves the generative parts model
[3], [4] which associates a certain relationship between the
position of detected parts. Another approach considered
efficient for this task finds pairwise relations between the
neighboring local features. However, these methods are too
computationally expensive or they have yielded inconclusive results.
2. Previous work
In computer vision histograms are widely used for image
description. Koendernik and Van Doorn [?] replaced local
image structure with local histograms. Essentially they discard the precise location of individual image elements. In
this sense histogram images are locally orderless images.
(Basically for each region of interest, where ROI is a Gaussian aperture with given location and scale, they compute
histograms of features over that ROI). The proposed spatial pyramid approach can be considered as an alternative
way of creating histogram images, where instead of using
Gaussian apertures, a fixed hierarchy of rectangular win-
D
X
i=1
l
min(HX
(i), HYl (i))
(1)
X
1
1
(I l I l+1 ) = L I 0 +
Il
Ll
2
2
2Ll+1
l=1
l=0
(2)
Both the histogram intersection and the pyramid match
kernel are Mercer kernels [5]. This way, we know that when
we evaluate the kernel function, we are finding the inner
product between the elements of a rich feature space. Consequently this kernel function can be used in an SVM to do
classification.
k L (X, Y ) = I L +
M
X
k L (Xm , Ym )
(3)
m=1
Since we are clustering the features into a Visual Vocabulary ,this method is a regular bag of features approach
when L = 0. Another important point to note here is
that we can implement the final kernel as a single histogram intersection of vectors, formed by concatenating the
weighted histograms of all the channels at all resolutions
(Fig 1). This is possible because the pyramid match kernel is a weighted sum of histogram intersections and since
cmin(a, b) = min(ca, cb) for positive numbers. For L levels P
and M channels, the resulting vector has dimensionality
L
M l=0 4l = M 13 (4L+1 1). If we set M = 400 channels
and have 3 levels we will have to do 34000-dimensional histogram intersection, but since these histograms are sparse,
set of patches from the training set to form a visual vocabulary. Typical vocabulary sizes for experiments are M = 200
and M = 400.
Since the results shown by the weak features were not
very good, in our work we decided to use only the strong
features which are the SIFT descriptors.
5. Authors Experiments
Figure 1. Example of constructing a three-level pyramid. The image has three feature types, indicated by circles, diamonds, and
crosses. At the top, the image is subdivided into three different
levels of resolution. Next, for each level of resolution and each
channel, features that fall in each spatial bin are counted. Finally,
each spatial histogram is weighted according to eq. (2)
4. Feature Extraction
The authors use two kinds of features for their experiments. The first type of features they use are dubbed weak
features, which are oriented edge points, namely points
whose gradient magnitude in a given direction exceeds a
minimum threshold. To create features which are similar to
Torralbas gist features [7], the authors extract edge points
at two scales and eight orientations, for total of M = 16
channels.
Then, for better discriminative power, they use what they
dub as strong features, which are actually SIFT descriptors of 16 16 pixel patches computed over a grid with
spacing of 8 pixels. The authors propose the use of a dense
regular grid instead of only interest points, since comparative evaluation of Fei-Fei and Perona [3] show that dense
features work better for scene classification. This is because
they capture uniform regions like the sky, calm water etc.
After this, we perform k-means clustering of a random sub-
L
0(1 1)
1(2 2)
2(4 4)
3(8 8)
Class
Bikes
People
L=0
82.4 2.0
79.5 2.3
L=0
86.3 2.5
82.3 3.1
6. Our Experiments
We tested our implementation on the Caltech 256 [8]
which is considered to be harder than the Caltech 101 on
which the authors have already tested. This dataset is much
bigger (with 30608 images) and has many images per category (much more than the Caltech 101). The Caltech 256
also has more clutter and occlusion. The images are not leftright aligned and does not suffer from the rotation artifacts
which exists in the Caltech 101 (this gave high recognition
rates for the method because they provide stable cues).
For our experiments we are using only the strong features which we have described before. For this, we are
taking the SIFT descriptors of the image in a dense fashion over a 16 16 pixel patch of the image at a time and
moving the patch by 8 pixels. Like the authors, we are tak-
Number of Images
Classified Correctly
Percentage of Images
Classified Correctly
Number of Classes
Detected
Level 1
Level 2
Level 3
21
150
26
150
27
150
14 %
17.33%
18 %
17
50
17
50
18
50
Number of Images
Classified Correctly
Percentage of Images
Classified Correctly
Number of Classes
Detected
Level 1
Level 2
Level 3
9
50
9
50
8
50
18 %
18%
16 %
9
50
9
50
8
50
ing the descriptors over a dense patch and not just for the
points of interest. We are also resizing all the images to
320320 to eliminate the need for histogram normalization
(normalization becomes necessary when the histograms are
collected over different patch sizes for images of different sizes). After generating the descriptors, we are creating a feature space by randomly selecting descriptors from
patches of the images. Random selection is done rather than
taking all the descriptors since that would be prohibitively
costly. After creating the feature space, we run a clustering
algorithm (we used k-means) on the feature space to obtain
a set of visual words. This forms our visual vocabulary.
For our experiments, we have used M = 300 visual words
and M = 500 visual words. Since we wanted to check
the benefit of having more visual words we created a set of
M = 500 visual words and tested the algorithm with these.
After creating the visual words, we assign the features to
the visual words that they are closest to (i.e. we assign the
features the cluster label of the cluster that they belong to).
L
0
1
2
3
Weak features
Single-level
15.5 0.9
31.4 1.2
47.2 1.1
52.2 0.8
Pyramid
32.8 1.3
49.3 1.4
54.0 1.1
Number of Images
Classified Correctly
Percentage of Images
Classified Correctly
Number of Classes
Detected
100% detection?
No Detection?
Level 1
Level 2
Level 3
8
15
8
15
7
15
53 %
53%
46.67 %
4
5
3
5
3
5
Class 3
Classes
2,3
Classes
4,5
Class 2
Class 4
Classes
4,5
7. Conclusion
The above discussed method presents a holistic
method for image categorization. Despite its simplicity it
has shown good results when compared to methods that
construct a structural model for object recognition. It even
does better than an orderless image representation scheme
which is not a trivial accomplishment. It highlights the
power of global scene statistics and provides useful discrim-
Level 1
Level 2
Level 3
10
15
7
15
6
15
66.67 %
46.67 %
40 %
4
5
4
5
3
5
Classes 1,2,3
Class 4
Class 3
Class 4
None
Classes 4,5
Table 7. Results for the followin setup: Number of Classes: 5, Training Images per Class: 20, Testing Images per Class: 3, Number of
Visual Words: 500
Figure 2. Training Set: The training set used to train the SVM for the baseball bat, baseball gloves and backpacks
cremental Bayesian approach tested on 101 object categories. In IEEE CVPR Workshop on Generative-Model
Based Vision, 2004.
[4] R. Fergus, P. Perona, and A. Zisserman. Object class
recognition by unsupervised scale-invariant learning. In
Proc. CVPR, volume 2, pages 264271, 2003.
[5] K. Grauman and T. Darrell. Pyramid match kernels:
Discriminative classification with sets of image features. In Proc. ICCV, 2005.
Figure 3. Test set: Backpack always had high recognition, while
very difficult test images were the baseball bats (no recognition)
and baseball gloves (some low recognition)
References
[1] Svetlana Lazebnik, Cordelia Schmid, Jean Ponce, Beyond Bags of Features: Spatial Pyramid Matching for
Recognizing Natural Scene Categories, In Proc. CVPR,
2006.
[2] Griffin, G. Holub, AD. Perona, P. The Caltech-256,
Caltech Technical Report.
[3] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an in-