CVPR 2004
CVPR 2004
Lighting
Yann LeCun, Fu Jie Huang, and Leon Bottou
The Courant Institute, New York University
715 Broadway, New York, NY 10003, USA
NEC Labs America, 4 Independence Way, Princeton, NJ 08540
http://yann.lecun.com, http://leon.bottou.org
Abstract
1 Introduction
The recognition of generic object categories with invariance
to pose, lighting, diverse backgrounds, and the presence of
clutter is one of the major challenges of Computer Vision.
While there have been attempts to detect and recognize objects in natural scenes using a variety of clues, such as color,
texture, the detection of distinctive local features, and the
use of separately acquired 3D models, very few authors
have attacked the problem of detecting and recognizing 3D
objects in images primarily from the shape information.
Even fewer authors have attacked the problem of recognizing generic categories, such as cars, trucks, airplanes,
human figures, or four-legged animals purely from shape
information. The dearth of work in this area is due in part
to the difficulty of the problem, and in large part to the nonavailability of a dataset with sufficient size and diversity to
carry out meaningful experiments.
The first part of this paper describes the NORB dataset,
a large image dataset comprising 97,200 stereo image pairs
stereo pairs were collected for each object instance: 9 azimuths (30, 35, 40, 45, 50, 55, 60, 65, and 70 degrees from
the horizontal), 36 angles (from 0 to 350 every 10 ), and 6
lighting conditions (various on-off conbinations of the four
lights). A total of 194,400 RGB images at 640480 resolution were collected (5 categories, 10 instances, 9 azimuth,
36 angles, 6 lightings, 2 cameras) for a total of 179GB of
raw data. Note that each object instance was placed in a
different initial pose, therefore 0 degree angle may mean
facing left for one instance of an animal, and facing 30
degree right for another instance.
Conversely, learning-based methods operating on raw pixels or low-level local features have been quite succesful for
such applications as face detection [24, 18, 12, 7, 25, 21],
but they have yet to be applied succesfully to shape-based,
pose-invariant object recognition. One of the central questions addressed in this paper is how methods based on
global templates and methods based on local features compare on invariant shape classification tasks.
In the NORB dataset, the only useful and reliable clue
is the shape of the object, while all the other parameters
that affect the appearence are subject to variation, or are designed to contain no useful clue. Parameters that are subject
to variation are: viewing angles (pose), lighting condition,
position in the image plane, scale, image-plane rotation,
surrounding objects, background texture, contrast, luminance, and camera settings (gain and white balance). Potential clues whose impact was eliminated include: color (all
images were grayscale), and object texture (objects were
painted with a uniform color). For specific object recognition tasks, the color and texture information may be helpful,
but for generic shape recognition tasks the color and texture
information are distractions rather than useful clues. The
image acquisition setup was deliberately designed to reflect
real imaging situations. By preserving natural variabilities
and eliminating irrelevant clues and systematic biases, our
aim was to produce a benchmark in which no hidden regularity can be used, which would unfairly advantage some
methods over others.
While several datasets of object images have been made
available in the past [11, 22, 19], NORB is considerably
larger than those datasets, and offers more variability, stereo
pairs, and the ability to composite the objects and their cast
shadows onto diverse backgrounds.
Ultimately, practical object recognition systems will
have to be trained on natural images. The value of the
present approach is to allow systematic objective comparisons shape classification methods, as well as a way of assessing their invariant properties, and the number of examples required to train them.
2.2 Processing
Training and testing samples were generated so as to carefully remove (or avoid) any potential bias in the data that
might make the task easier than it would be in realistic situations. The object masks and their cast shadows were extracted from the raw images. A scaling factor was determined for each of the 50 object instances by computing the
bounding box of the union of all the object masks for all the
images of that instance. The scaling factor was chosen such
that the largest dimension of the bounding box was 80 pixels. This removed the most obvious systematic bias caused
by the variety of sizes of the objects (e.g. most airplanes
were larger than most human figures in absolute terms).
The segmented and normalized objects were then composited (with their cast shadows) in the center of various 96x96
pixel background images. In some experiments, the locations, scales, image-plane angle, brightness, and contrast
were randomly perturbed during the compositing process.
2.3 Datasets
Experiments were conducted with four datasets generated
from the normalized object images. The first two datasets
were for pure categorization experiments (a somewhat unrealistic task), while the last two were for simultaneous detection/segmentation/recognition experiments.
All datasets used 5 instances of each category for training and the 5 remaining instances for testing. In the normalized dataset, 972 images of each instance were used: 9
azimuths, 18 angles (0 to 360 every 20 ), and 6 illuminations, for a total of 24,300 training samples and 24,300
test samples. In the various jittered datasets, each of the
972 images of each instance were used to generate additional examples by randomly perturbing the position ([-3,
+3] pixels), scale (ratio in [0.8, 1.1]), image-plane angle ([5, 5] degrees), brightness ([-20, 20] shifts of gray levels),
contrast ([0.8, 1.3] gain) of the objects during the compositing process. Ten drawings of these random parameters were
drawn to generate training sets, and one or two drawings to
generate test sets.
In the textured and cluttered datasets, the objects were
placed on randomly picked background images. In those
experiments, a 6-th category was added: background images with no objects (results are reported for this 6-way
classification). In the textured set, the backgrounds were
placed at a fixed disparity, akin to a back wall orthogonal to
the camera axis at a fixed distance. In the cluttered datasets,
Figure 1: The 50 object instances in the NORB dataset. The left side contains the training instances and the right side the
testing instances for each of the 5 categories.
3 Experiments
u)
,
(x
+
u)
. A quick solution is obi
i
i
tained with online (stochastic) algorithms as discussed in
[2] in the context of the K-Means algorithm. Repeated applications of this method, with projections on the complementary space spanned by the previously obtained directions, yield the first 100 principal components in a few CPU
hours. The first 29 components thus obtained (the left camera portion) are shown in figure 4. The first 95 principal
normalized-uniform set: 5 classes, centered, unperturbed objects on uniform backgrounds. 24,300 training samples, 24,300 testing samples. See figure 1.
jittered-uniform set: 5 classes, random perturbations,
uniform backgrounds. 243,000 training samples (10
drawings) and 24,300 test samples (1 drawing)
jittered-textured set: 6 classes (including one background class) random perturbation, natural background textures at fixed disparity. 291,600 training samples (10 drawings), 58,320 testing samples (2
drawings). See figure 2.
jittered-cluttered set: 6 classes (including one background class), random perturbation, highly cluttered
background images at random disparities, and randomly placed distractor objects around the periphery.
291,600 training samples (10 drawings), 58,320 testing samples (2 drawings). See figure 3.
Occlusions of the central object by the distractor occur occasionally, as can be seen in figure 3. Most experiments
were performed in binocular mode (using left and right images), but some were performed in monocular mode. In
monocular experiments, the training set and test set were
composed of all left and right images used in the corresponding binocular experiment. Therefore, while the number of training samples was twice higher, the total amount
of training data was identical. Examples from the jitteredtextured and jittered-cluttered training set are shown in figures 2 and 3.
3
Figure 2: Some of the 291,600 examples from the jittered-textured training set (left camera images).
Figure 3: Some of the 291,600 examples from the jittered-cluttered training set (left camera images).
the overwhelming dimension, the number of training samples, and the task complexity. We resorted to using the 95dimensional, PCA-derived feature vectors, as well as subsampled, monocular versions of the images at 4848 pixels
and 3232 resolutions.
Ten SVMs were independently trained to classify one
class versus one other class (pairwise classifiers). This
greatly reduces the number of samples that must be examined by each SVM over the more traditional approach of
classifying one class versus all others. During testing, the
sample is sent to all 10 classifiers. Each classifier votes
for one of its attributed categories. The category with the
largest number of votes wins. The number of support vectors per classifier were between 800 and 2000 on PCAderived inputs (roughly 2 106 flops to classify one sample), and between 2000 and 3000 on 32x32 raw images
(roughly 30 106 flops to classify one sample). SVMs
could not be trained on the jittered datasets because of the
prohibitive size of the training set.
Convolutional Networks [7] have been used with great success in various image recognition applications, such as
handwriting recognition and face detection. The reader is
exp#
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
2.0
2.1
exp#
5.1
6.0
6.2
Classification
Classifier
Input
Dataset
Linear
raw 2x96x96 norm-unif
K-NN (K=1)
raw 2x96x96 norm-unif
K-NN (K=1)
PCA 95
norm-unif
SVM Gauss
raw 2x96x96 norm-unif
SVM Gauss
raw 1x48x48 norm-unif
SVM Gauss
raw 1x32x32 norm-unif
SVM Gauss
PCA 95
norm-unif
Conv Net 80
raw 2x96x96 norm-unif
Conv Net 100 raw 2x96x96 norm-unif
Linear
raw 2x96x96
jitt-unif
Conv Net 100 raw 2x96x96
jitt-unif
Detection/Segmentation/Recognition
Classifier
Input
Dataset
Conv Net 100 raw 2x96x96
jitt-text
Conv Net 100 raw 2x96x96
jitt-clutt
Conv Net 100 raw 1x96x96
jitt-clutt
Test Error
30.2%
18.4 %
16.6%
N.C.
13.9%
12.6%
13.3%
6.6%
6.8%
30.6%
7.1%
Test Error
10.6%
16.7%
39.9%
class
animal
human
plane
truck
car
junk
animal
0.85
0.01
0.01
0.03
0.00
0.01
human
0.02
0.89
0.00
0.00
0.00
0.02
plane
0.01
0.00
0.77
0.00
0.01
0.00
truck
0.00
0.00
0.02
0.84
0.20
0.00
car
0.00
0.00
0.06
0.05
0.69
0.00
junk
0.11
0.10
0.14
0.07
0.09
0.96
Table 2: Confusion matrix on the test set for the binocular convolutional net on the jittered-cluttered database (line
6.0 in the results table). Each row indicates the probability
that the system will classify an object of the given category
into each of the 6 categories. Most errors are false negatives (objects classified as junk), or cars being classified as
trucks.
Acknowledgments
[withheld for anonymity]
6
Figure 5: Examples of results on natural images. The list of objects found by the monocular convolutional net is displayed
above each sample.
References
[15] M. Pontil, A. Verri. Support Vector Machines for 3-D Object Recognition, IEEE Trans. Patt. Anal. Machine Intell.
Vol. 20, 637-646, 1998.
[16] S. Agarwal, and D. Roth Learning a Sparse Representation
for Object Detection ECCV02, May 2002
[17] S. Roweis personal communication, 2003
[18] H.A. Rowley, S. Baluja, T. Kanade. Neural networkbased face detection. IEEE Trans. Patt. Anal. Mach. Intell.,
20(1):2338, January 1998.
[19] B. Leibe, and B. Schiele. Analyzing Appearance and Contour Based Methods for Object Categorization., CVPR,
IEEE, 2003.
[20] C. Schmid and R. Mohr. Local grayvalue invariants for
image retrieval. IEEE Trans. Patt. Anal. Mach. Intell.,
19(5):530535, May 1997.
[21] H. Schneiderman and T. Kanade. A statistical method for 3d
object detection applied to faces and cars. In CVPR, IEEE,
2000.
[22] A. Selinger, R. Nelson. Appearance-Based Object Recognition Using Multiple Views, CVPR, IEEE, 2001.
[23] S. Ullman, M. Vidal-Naquet, and E. Sali. Visual features of
intermediate complexity and their use in classification, Nature Neuroscience, 5(7), 2002.
[24] [withheld for anonymity]
[25] P. Viola, M. Jones. Rapid Object Detection using a Boosted
Cascade of Simple Features. CVPR, IEEE, 2001.
[26] M. Weber, M. Welling, and P. Perona. Towards automatic
discovery of object categories. In CVPR, IEEE 2000.