Object Recognition From Local Scale-Invariant Features
Object Recognition From Local Scale-Invariant Features
David G. Lowe
Computer Science Department
University of British Columbia
Vancouver, B.C., V6T 1Z4, Canada
[email protected]
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on June 19,2023 at 19:06:38 UTC from IEEE Xplore. Restrictions apply.
2. Related research Furthermore, an explicit scale is determined for each point,
which allows the image description vector for that point to
Object recognition is widely used in the machine vision in- be sampled at an equivalent scale in each image. A canoni-
dustry for the purposes of inspection, registration, and ma- cal orientation is determined at each location, so that match-
nipulation. However, current commercial systems for object ing can be performed relative to a consistent local 2D co-
recognition depend almost exclusively on correlation-based ordinate frame. This allows for the use of more distinctive
template matching. While very effective for certain engi- image descriptors than the rotation-invariant ones used by
neered environments, where object pose and illumination Schmid and Mohr, and the descriptor is further modified to
are tightly controlled, template matching becomes computa- improve its stability to changes in affine projection and illu-
tionally infeasible when object rotation, scale, illumination, mination.
and 3D pose are allowed to vary, and even more so when
Other approaches to appearance-based recognition in-
dealing with partial visibility and large model databases.
clude eigenspace matching [13], color histograms [20], and
An alternative to searching all image locations for
receptive field histograms [18]. These approaches have all
matches is to extract features from the image that are at
been demonstrated successfully on isolated objects or pre-
least partially invariant to the image formation process and
segmented images, but due to their more global features it
matching only to those features. Many candidate feature
has been difficult to extend them to cluttered and partially
types have been proposed and explored, including line seg-
occluded images. Ohba & Ikeuchi [15] successfully apply
ments [6], groupings of edges [11, 14], and regions [2],
the eigenspace approach to cluttered images by using many
among many other proposals. While these features have
small local eigen-windows, but this then requires expensive
worked well for certain object classes, they are often not de-
search for each window in a new image, as with template
tected frequently enough or with sufficient stability to form
matching.
a basis for reliable recognition.
There has been recent work on developing much denser
collections of image features. One approach has been to 3. Key localization
use a corner detector (more accurately, a detector of peaks
We wish to identify locations in image scale space that are
in local image variation) to identify repeatable image loca-
invariant with respect to image translation, scaling, and ro-
tions, around which local image properties can be measured.
tation, and are minimally affected by noise and small dis-
Zhang et al. [23] used the Harris corner detector to iden-
tortions. Lindeberg [8] has shown that under some rather
tify feature locations for epipolar alignment of images taken
general assumptions on scale invariance, the Gaussian ker-
from differing viewpoints. Rather than attempting to cor-
nel and its derivatives are the only possible smoothing ker-
relate regions from one image against all possible regions
nels for scale space analysis.
in a second image, large savings in computation time were
To achieve rotation invariance and a high level of effi-
achieved by only matching regions centered at corner points
ciency, we have chosen to select key locations at maxima
in each image.
and minima of a difference of Gaussian function applied in
For the object recognition problem, Schmid & Mohr
scale space. This can be computed very efficiently by build-
[19] also used the Harris corner detector to identify in-
ing an image pyramid with resampling between each level.
terest points, and then created a local image descriptor at
Furthermore, it locates key points at regions and scales of
each interest point from an orientation-invariant vector of
high variation, making these locations particularly stable for
derivative-of-Gaussian image measurements. These image
characterizing the image. Crowley & Parker [4] and Linde-
descriptors were used for robust object recognition by look-
berg [9] have previously used the difference-of-Gaussian in
ing for multiple matching descriptors that satisfied object-
scale space for other purposes. In the following, we describe
based orientation and location constraints. This work was
a particularly efficient and stable method to detect and char-
impressive both for the speed of recognition in a large
acterize the maxima and minima of this function.
database and the ability to handle cluttered images.
As the 2D Gaussian function is separable, its convolution
The corner detectors used in these previous approaches
with the input image can be efficiently computed by apply-
have a major failing, which is that they examine an image
ing two passes of the 1D Gaussian function in the horizontal
at only a single scale. As the change in scale becomes sig-
and vertical directions:
nificant, these detectors respond to different image points.
g(x) = p 1 e,x2 =22
Also, since the detector does not provide an indication of the
object scale, it is necessary to create image descriptors and 2
attempt matching at a large number of scales. This paper de-
scribes an efficient method to identify stable key locations p
For key localization, all smoothing operations are done us-
in scale space. This means that different scalings of an im- ing = 2, which can be approximated with sufficient ac-
age will have no effect on the set of key locations selected. curacy using a 1D kernel with 7 sample points.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on June 19,2023 at 19:06:38 UTC from IEEE Xplore. Restrictions apply.
The input image is p first convolved with the Gaussian
function using = 2 to give an image A. This is then
p time with a further incremental smooth-
repeated a second
ing of = 2 to give a new image, B, which now has an
effective smoothing of = 2. The difference of Gaussian
function is obtained pby subtracting
p image B from A, result-
ing in a ratio of 2= 2 = 2 between the two Gaussians.
To generate the next pyramid level, we resample the al-
ready smoothed image B using bilinear interpolation with a
pixel spacing of 1.5 in each direction. While it may p seem
more natural to resample with a relative scale of 2, the
only constraint is that sampling be frequent enough to de-
tect peaks. The 1.5 spacing means that each new sample will
be a constant linear combination of 4 adjacent pixels. This
is efficient to compute and minimizes aliasing artifacts that
would arise from changing the resampling coefficients.
Maxima and minima of this scale-space function are de-
termined by comparing each pixel in the pyramid to its
neighbours. First, a pixel is compared to its 8 neighbours at
the same level of the pyramid. If it is a maxima or minima
at this level, then the closest pixel location is calculated at
the next lowest level of the pyramid, taking account of the Figure 1: The second image was generated from the first by
1.5 times resampling. If the pixel remains higher (or lower) rotation, scaling, stretching, change of brightness and con-
than this closest pixel and its 8 neighbours, then the test is trast, and addition of pixel noise. In spite of these changes,
repeated for the level above. Since most pixels will be elim- 78% of the keys from the first image have a closely match-
inated within a few comparisons, the cost of this detection is ing key in the second image. These examples show only a
small and much lower than that of building the pyramid. subset of the keys to reduce clutter.
If the first level of the pyramid is sampled at the same rate
as the input image, the highest spatial frequencies will be ig-
nored. This is due to the initial smoothing, which is needed maximum possible gradient value. This reduces the effect
to provide separation of peaks for robust detection. There- of a change in illumination direction for a surface with 3D
fore, we expand the input image by a factor of 2, using bilin- relief, as an illumination change may result in large changes
ear interpolation, prior to building the pyramid. This gives to gradient magnitude but is likely to have less influence on
on the order of 1000 key points for a typical 512 512 pixel gradient orientation.
image, compared to only a quarter as many without the ini-
Each key location is assigned a canonical orientation so
tial expansion.
that the image descriptors are invariant to rotation. In or-
der to make this as stable as possible against lighting or con-
3.1. SIFT key stability trast changes, the orientation is determined by the peak in a
To characterize the image at each key location, the smoothed histogram of local image gradient orientations. The orien-
image A at each level of the pyramid is processed to extract tation histogram is created using a Gaussian-weighted win-
image gradients and orientations. At each pixel, Aij , the im- dow with of 3 times that of the current smoothing scale.
age gradient magnitude, Mij , and orientation, R ij , are com- These weights are multiplied by the thresholded gradient
puted using pixel differences: values and accumulated in the histogram at locations corre-
q sponding to the orientation, R ij . The histogram has 36 bins
Mij = A , Ai+1;j )2 + (Aij , Ai;j +1)2
( ij covering the 360 degree range of rotations, and is smoothed
prior to peak selection.
R ij = atan2 (Aij , Ai+1;j ; Ai;j +1 , Aij ) The stability of the resulting keys can be tested by sub-
The pixel differences are efficient to compute and provide jecting natural images to affine projection, contrast and
sufficient accuracy due to the substantial level of previous brightness changes, and addition of noise. The location of
smoothing. The effective half-pixel shift in position is com- each key detected in the first image can be predicted in the
pensated for when determining key location. transformed image from knowledge of the transform param-
Robustness to illuminationchange is enhanced by thresh- eters. This framework was used to select the various sam-
olding the gradient magnitudes at a value of 0.1 times the pling and smoothing parameters given above, so that max-
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on June 19,2023 at 19:06:38 UTC from IEEE Xplore. Restrictions apply.
Image transformation Match % Ori % One approach to this is suggested by the response properties
of complex neurons in the visual cortex, in which a feature
A. Increase contrast by 1.2 89.0 86.6
position is allowed to vary over a small region while orienta-
B. Decrease intensity by 0.2 88.5 85.9 tion and spatial frequency specificity are maintained. Edel-
C. Rotate by 20 degrees 85.4 81.0 man, Intrator & Poggio [5] have performed experiments that
simulated the responses of complex neurons to different 3D
D. Scale by 0.7 85.1 80.3
views of computer graphic models, and found that the com-
E. Stretch by 1.2 83.5 76.1 plex cell outputs provided much better discrimination than
F. Stretch by 1.5 77.7 65.0 simple correlation-based matching. This can be seen, for ex-
ample, if an affine projection stretches an image in one di-
G. Add 10% pixel noise 90.3 88.4
rection relative to another, which changes the relative loca-
H. All of A,B,C,D,E,G. 78.6 71.8 tions of gradient features while having a smaller effect on
their orientations and spatial frequencies.
Figure 2: For various image transformations applied to a This robustness to local geometric distortion can be ob-
sample of 20 images, this table gives the percent of keys that tained by representing the local image region with multiple
are found at matching locations and scales (Match %) and images representing each of a number of orientations (re-
that also match in orientation (Ori %). ferred to as orientation planes). Each orientation plane con-
tains only the gradients corresponding to that orientation,
imum efficiency could be obtained while retaining stability with linear interpolation used for intermediate orientations.
to changes. Each orientation plane is blurred and resampled to allow for
Figure 1 shows a relatively small number of keys de- larger shifts in positions of the gradients.
tected over a 2 octave range of only the larger scales (to This approach can be efficiently implemented by using
avoid excessive clutter). Each key is shown as a square, with the same precomputed gradients and orientations for each
a line from the center to one side of the square indicating ori- level of the pyramid that were used for orientation selection.
entation. In the second half of this figure, the image is ro- For each keypoint, we use the pixel sampling from the pyra-
tated by 15 degrees, scaled by a factor of 0.9, and stretched mid level at which the key was detected. The pixels that fall
by a factor of 1.1 in the horizontal direction. The pixel inten- in a circle of radius 8 pixels around the key location are in-
sities, in the range of 0 to 1, have 0.1 subtracted from their serted into the orientation planes. The orientation is mea-
brightness values and the contrast reduced by multiplication sured relative to that of the key by subtracting the key’s ori-
by 0.9. Random pixel noise is then added to give less than entation. For our experiments we used 8 orientation planes,
5 bits/pixel of signal. In spite of these transformations, 78% each sampled over a 4 4 grid of locations, with a sample
of the keys in the first image had closely matching keys in spacing 4 times that of the pixel spacing used for gradient
the second image at the predicted locations, scales, and ori- detection. The blurring is achieved by allocating the gradi-
entations ent of each pixel among its 8 closest neighbors in the sam-
The overall stability of the keys to image transformations ple grid, using linear interpolation in orientation and the two
can be judged from Table 2. Each entry in this table is gen- spatial dimensions. This implementation is much more effi-
erated from combining the results of 20 diverse test images cient than performing explicit blurring and resampling, yet
and summarizes the matching of about 15,000 keys. Each gives almost equivalent results.
line of the table shows a particular image transformation. In order to sample the image at a larger scale, the same
The first figure gives the percent of keys that have a match- process is repeated for a second level of the pyramid one oc-
ing key in the transformed image within in location (rel- tave higher. However, this time a 2 2 rather than a 4 4
ative to scale for that key) and a factor of 1.5 in scale. The sample region is used. This means that approximately the
second column gives the percent that match these criteria as same image region will be examined at both scales, so that
well as having an orientation within 20 degrees of the pre- any nearby occlusions will not affect one scale more than the
diction. other. Therefore, the total number of samples in the SIFT
key vector, from both scales, is 8 4 4 + 8 2 2 or 160
elements, giving enough measurements for high specificity.
4. Local image description
Given a stable location, scale, and orientation for each key, it 5. Indexing and matching
is now possible to describe the local image region in a man-
ner invariant to these transformations. In addition, it is desir- For indexing, we need to store the SIFT keys for sample im-
able to make this representation robust against small shifts ages and then identify matching keys from new images. The
in local geometry, such as arise from affine or 3D projection. problem of identifyingthe most similar keys for high dimen-
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on June 19,2023 at 19:06:38 UTC from IEEE Xplore. Restrictions apply.
sional vectors is known to have high complexity if an ex-
act solution is required. However, a modification of the k-d
tree algorithm called the best-bin-first search method (Beis
& Lowe [3]) can identify the nearest neighbors with high
probability using only a limited amount of computation. To
further improve the efficiency of the best-bin-first algorithm,
the SIFT key samples generated at the larger scale are given
twice the weight of those at the smaller scale. This means
that the larger scale is in effect able to filter the most likely
neighbours for checking at the smaller scale. This also im-
proves recognition performance by giving more weight to
the least-noisy scale. In our experiments, it is possible to
have a cut-off for examining at most 200 neighbors in a
probabilistic best-bin-first search of 30,000 key vectors with
almost no loss of performance compared to finding an exact
solution.
An efficient way to cluster reliable model hypotheses
is to use the Hough transform [1] to search for keys that
agree upon a particular model pose. Each model key in the
database contains a record of the key’s parameters relative
to the model coordinate system. Therefore, we can create
an entry in a hash table predicting the model location, ori-
entation, and scale from the match hypothesis. We use a
bin size of 30 degrees for orientation, a factor of 2 for scale,
and 0.25 times the maximum model dimension for location.
These rather broad bin sizes allow for clustering even in the
presence of substantial geometric distortion, such as due to a
change in 3D viewpoint. To avoid the problem of boundary
effects in hashing, each hypothesis is hashed into the 2 clos-
est bins in each dimension, giving a total of 16 hash table
entries for each hypothesis. Figure 3: Model images of planar objects are shown in the
top row. Recognition results below show model outlines and
image keys used for matching.
6. Solution for affine parameters
the equation above can be rewritten as
The hash table is searched to identify all clusters of at least 2 3
2 36
m1
3 entries in a bin, and the bins are sorted into decreasing or- 7
7 2 3
x y 6 m2
7 6 u
0 0 1 0
76 7
der of size. Each such cluster is then subject to a verification
6
6 x y 76 m3 7
76 7=6 v
procedure in which a least-squares solution is performed for 0 0 0 1
6 7
6 ::: 76 m4
7 4 5
56 7
the affine projection parameters relating the model to the im-
4 6 7 ..
age.
::: 6 tx 7 .
4 5
The affine transformation of a model point [x y]T to an
image point [u v]T can be written as ty
" # " #" # " # This equation shows a single match, but any number of fur-
u =
m1 m2 x +
tx ther matches can be added, with each match contributing
v m3 m4 y ty two more rows to the first and last matrix. At least 3 matches
are needed to provide a solution.
We can write this linear system as
where the model translation is [tx ty ]T and the affine rota-
tion, scale, and stretch are represented by the mi parameters. Ax = b
We wish to solve for the transformation parameters, so The least-squares solution for the parameters x can be deter-
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on June 19,2023 at 19:06:38 UTC from IEEE Xplore. Restrictions apply.
Figure 5: Examples of 3D object recognition with occlusion.
7. Experiments
The affine solution provides a good approximation to per-
Figure 4: Top row shows model images for 3D objects with spective projection of planar objects, so planar models pro-
outlines found by background segmentation. Bottom image vide a good initial test of the approach. The top row of Fig-
shows recognition results for 3D objects with model outlines ure 3 shows three model images of rectangular planar faces
and image keys used for matching. of objects. The figure also shows a cluttered image contain-
ing the planar objects, and the same image is shown over-
layed with the models following recognition. The model
mined by solving the corresponding normal equations, keys that are displayed are the ones used for recognition and
final least-squares solution. Since only 3 keys are needed
x = [ATA],1ATb for robust recognition, it can be seen that the solutions are
highly redundant and would survive substantial occlusion.
which minimizes the sum of the squares of the distances Also shown are the rectangular borders of the model images,
from the projected model locations to the corresponding im- projected using the affine transform from the least-square
age locations. This least-squares approach could readily be solution. These closely agree with the true borders of the
extended to solving for 3D pose and internal parameters of planar regions in the image, except for small errors intro-
articulated and flexible objects [12]. duced by the perspective projection. Similar experiments
Outliers can now be removed by checking for agreement have been performed for many images of planar objects, and
between each image feature and the model, given the param- the recognition has proven to be robust to at least a 60 degree
p Each match must agree within 15 degrees ori-
eter solution. rotation of the object in any direction away from the camera.
entation, 2 change in scale, and 0.2 times maximum model Although the model images and affine parameters do not
size in terms of location. If fewer than 3 points remain after account for rotation in depth of 3D objects, they are still
discarding outliers, then the match is rejected. If any outliers sufficient to perform robust recognition of 3D objects over
are discarded, the least-squares solution is re-solved with the about a 20 degree range of rotation in depth away from each
remaining points. model view. An example of three model images is shown in
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on June 19,2023 at 19:06:38 UTC from IEEE Xplore. Restrictions apply.
est match to the correct corresponding key in the second im-
age. Any 3 of these keys would be sufficient for recognition.
While matching keys are not found in some regions where
highlights or shadows change (for example on the shiny top
of the camera) in general the keys show good invariance to
illumination change.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on June 19,2023 at 19:06:38 UTC from IEEE Xplore. Restrictions apply.
affine distortions. The large number of features in a typical [7] Ito, Minami, Hiroshi Tamura, Ichiro Fujita, and Keiji
image allow for robust recognition under partial occlusion in Tanaka, “Size and position invariance of neuronal responses
cluttered images. A final stage that solves for affine model in monkey inferotemporal cortex,” Journal of Neurophysiol-
parameters allows for more accurate verification and pose ogy, 73, 1 (1995), pp. 218–226.
determination than in approaches that rely only on indexing. [8] Lindeberg, Tony, “Scale-space theory: A basic tool for
analysing structures at different scales”, Journal of Applied
An important area for further research is to build models
Statistics, 21, 2 (1994), pp. 224–270.
from multiple views that represent the 3D structure of ob- [9] Lindeberg, Tony, “Detecting salient blob-like image struc-
jects. This would have the further advantage that keys from tures and their scales with a scale-space primal sketch:
multiple viewing conditions could be combined into a single a method for focus-of-attention,” International Journal of
model, thereby increasing the probability of finding matches Computer Vision, 11, 3 (1993), pp. 283–318.
in new views. The models could be true 3D representations [10] Logothetis, Nikos K., Jon Pauls, and Tomaso Poggio, “Shape
based on structure-from-motion solutions, or could repre- representation in the inferior temporal cortex of monkeys,”
sent the space of appearance in terms of automated cluster- Current Biology, 5, 5 (1995), pp. 552–563.
ing and interpolation (Pope & Lowe [17]). An advantage of [11] Lowe, David G., “Three-dimensional object recognition
the latter approach is that it could also model non-rigid de- from single two-dimensional images,” Artificial Intelli-
gence, 31, 3 (1987), pp. 355–395.
formations.
[12] Lowe, David G., “Fitting parameterized three-dimensional
The recognition performance could be further improved models to images,” IEEE Trans. on Pattern Analysis and Ma-
by adding new SIFT feature types to incorporate color, tex- chine Intelligence, 13, 5 (1991), pp. 441–450.
ture, and edge groupings, as well as varying feature sizes [13] Murase, Hiroshi, and Shree K. Nayar, “Visual learning and
and offsets. Scale-invariant edge groupings that make local recognition of 3-D objects from appearance,” International
figure-ground discriminations would be particularly useful Journal of Computer Vision, 14, 1 (1995), pp. 5–24.
at object boundaries where background clutter can interfere [14] Nelson, Randal C., and Andrea Selinger, “Large-scale tests
with other features. The indexing and verification frame- of a keyed, appearance-based 3-D object recognition sys-
work allows for all types of scale and rotation invariant fea- tem,” Vision Research, 38, 15 (1998), pp. 2469–88.
tures to be incorporated into a single model representation. [15] Ohba, Kohtaro, and Katsushi Ikeuchi, “Detectability,
uniqueness, and reliability of eigen windows for stable
Maximum robustness would be achieved by detecting many
verification of partially occluded objects,” IEEE Trans. on
different feature types and relying on the indexing and clus- Pattern Analysis and Machine Intelligence, 19, 9 (1997),
tering to select those that are most useful in a particular im- pp. 1043–48.
age. [16] Perrett, David I., and Mike W. Oram, “Visual recognition
based on temporal cortex cells: viewer-centered processing
of pattern configuration,” Zeitschrift für Naturforschung C,
References 53c (1998), pp. 518–541.
[17] Pope, Arthur R. and David G. Lowe, “Learning probabilis-
[1] Ballard, D.H., “Generalizing the Hough transform to detect tic appearance models for object recognition,” in Early Vi-
arbitrary patterns,” Pattern Recognition, 13, 2 (1981), pp. sual Learning, eds. Shree Nayar and Tomaso Poggio (Oxford
111-122. University Press, 1996), pp. 67–97.
[2] Basri, Ronen, and David. W. Jacobs, “Recognition using re- [18] Schiele, Bernt, and James L. Crowley, “Object recognition
gion correspondences,” International Journal of Computer using multidimensional receptive field histograms,” Fourth
Vision, 25, 2 (1996), pp. 141–162. European Conference on Computer Vision, Cambridge, UK
[3] Beis, Jeff, and David G. Lowe, “Shape indexing using (1996), pp. 610–619.
approximate nearest-neighbour search in high-dimensional [19] Schmid, C., and R. Mohr, “Local grayvalue invariants for
spaces,” Conference on Computer Vision and Pattern Recog- image retrieval,” IEEE PAMI, 19, 5 (1997), pp. 530–534.
nition, Puerto Rico (1997), pp. 1000–1006. [20] Swain, M., and D. Ballard, “Color indexing,” International
Journal of Computer Vision, 7, 1 (1991), pp. 11–32.
[4] Crowley, James L., and Alice C. Parker, “A representation [21] Tanaka, Keiji, “Mechanisms of visual object recognition:
for shape based on peaks and ridges in the difference of low- monkey and human studies,” Current Opinion in Neurobiol-
pass transform,” IEEE Trans. on Pattern Analysis and Ma- ogy, 7 (1997), pp. 523–529.
chine Intelligence, 6, 2 (1984), pp. 156–170. [22] Treisman, Anne M., and Nancy G. Kanwisher, “Perceiv-
[5] Edelman, Shimon, Nathan Intrator, and Tomaso Poggio, ing visually presented objects: recognition, awareness, and
“Complex cells and object recognition,” Unpublished modularity,” Current Opinion in Neurobiology, 8 (1998), pp.
Manuscript, preprint at http://www.ai.mit.edu/ 218–226.
edelman/mirror/nips97.ps.Z [23] Zhang, Z., R. Deriche, O. Faugeras, Q.T. Luong, “A robust
[6] Grimson, Eric, and Thomás Lozano-Pérez, “Localizing technique for matching two uncalibrated images through the
overlapping parts by searching the interpretation tree,” IEEE recovery of the unknown epipolar geometry,” Artificial In-
Trans. on Pattern Analysis and Machine Intelligence, 9 telligence, 78, (1995), pp. 87-119.
(1987), pp. 469–482.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on June 19,2023 at 19:06:38 UTC from IEEE Xplore. Restrictions apply.