SlideShare a Scribd company logo
Searching in High-Dimensional Spacesโ€”Index Structures
for Improving the Performance of Multimedia Databases
ยจ
CHRISTIAN BOHM
University of Munich, Germany

STEFAN BERCHTOLD
stb ag, Germany

AND
DANIEL A. KEIM
AT&T Research Labs and University of Constance, Germany

During the last decade, multimedia databases have become increasingly important in
many application areas such as medicine, CAD, geography, and molecular biology. An
important research issue in the ๏ฌeld of multimedia databases is the content-based
retrieval of similar multimedia objects such as images, text, and videos. However, in
contrast to searching data in a relational database, a content-based retrieval requires
the search of similar objects as a basic functionality of the database system. Most of the
approaches addressing similarity search use a so-called feature transformation that
transforms important properties of the multimedia objects into high-dimensional points
(feature vectors). Thus, the similarity search is transformed into a search of points in
the feature space that are close to a given query point in the high-dimensional feature
space. Query processing in high-dimensional spaces has therefore been a very active
research area over the last few years. A number of new index structures and algorithms
have been proposed. It has been shown that the new index structures considerably
Categories and Subject Descriptors: A.1 [General Literature]: Introductory and
Survey; E.1 [Data]: Data Structures; F.2 [Theory of Computation]: Analysis of
Algorithms and Problem Complexity; G.1 [Mathematics of Computing]: Numerical
Analysis; G.2 [Mathematics of Computing]: Discrete Mathematics; H.2
[Information Systems]: Database Management; H.3 [Information Systems]:
Information Storage and Retrieval; H.4 [Information Systems]: Information Systems
Applications
General Terms: Algorithms, Design, Measurement, Performance, Theory
Additional Key Words and Phrases: Index structures, indexing high-dimensional data,
multimedia databases, similarity search

Authorsโ€™ addresses: C. Bยจ hm, University of Munich, Institute for Computer Science, Oettingenstr. 67,
o
ยจ
80538 Munchen, Germany; email: boehm@informatik.uni-muenchen.de; S. Berchtold, stb ag, Moritzplatz
6, 86150 Augsburg, Germany; email: Stefan.Berchtold@stb-ag.com; D. A. Keim, University of Constance,
ยจ
Department of Computer & Information Science, Box: D 78, Universitatsstr. 10, 78457 Konstanz, Germany;
email: keim@informatik.uni-konstanz.de.
Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted
without fee provided that the copies are not made or distributed for pro๏ฌt or commercial advantage, the
copyright notice, the title of the publication, and its date appear, and notice is given that copying is by
permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists,
requires prior speci๏ฌc permission and/or a fee.
c 2001 ACM 0360-0300/01/0900-0322 $5.00

ACM Computing Surveys, Vol. 33, No. 3, September 2001, pp. 322โ€“373.
Searching in High-Dimensional Spaces

323

improve the performance in querying large multimedia databases. Based on recent
tutorials [Berchtold and Keim 1998], in this survey we provide an overview of the
current state of the art in querying multimedia databases, describing the index
structures and algorithms for an ef๏ฌcient query processing in high-dimensional spaces.
We identify the problems of processing queries in high-dimensional space, and we
provide an overview of the proposed approaches to overcome these problems.

1. INDEXING MULTIMEDIA DATABASES

Multimedia databases are of high importance in many application areas such as
geography, CAD, medicine, and molecular
biology. Depending on the application, the
multimedia databases need to have different properties and need to support different types of queries. In contrast to traditional database applications, where point,
range, and partial match queries are very
important, multimedia databases require
a search for all objects in the database
that are similar (or complementary) to a
given search object. In the following, we
describe the notion of similarity queries
and the feature-based approach to process
those queries in multimedia databases in
more detail.
1.1. Feature-Based Processing of
Similarity Queries

An important aspect of similarity queries
is the similarity measure. There is
no general de๏ฌnition of the similarity measure since it depends on the
needs of the application and is therefore
highly application-dependent. Any similarity measure, however, takes two objects
as input parameters and determines a positive real number, denoting the similarity
of the two objects. A similarity measure is
therefore a function of the form
ฮด: Obj ร— Obj โ†’

+
0.

In de๏ฌning similarity queries, we have
to distinguish between two different tasks,
which are both important in multimedia
database applications: ฮต-similarity means
that we are interested in all objects of
which the similarity to a given search object is below a given threshold ฮต, and NNsimilarity (nearest neighbor) means that

we are only interested in the objects which
are the most similar ones with respect to
the search object.
De๏ฌnition 1. (ฮต-Similarity, Identity).
Two objects obj1 and obj2 are called ฮตsimilar if and only if ฮด(obj2 , obj1 ) < ฮต. (For
ฮต = 0, the objects are called identical.)
Note that this de๏ฌnition is independent of
database applications and just describes
a way to measure the similarity of two
objects.
De๏ฌnition 2. (NN-Similarity). Two objects obj1 and obj2 are called NN-similar
with respect to a database of objects DB if
and only if โˆ€obj โˆˆ DB, obj = obj1 : ฮด(obj2 ,
obj1 ) โ‰ค ฮด(obj2 , obj).
We are now able to formally de๏ฌne the
ฮต-similarity query and the NN-similarity
query.
De๏ฌnition 3. (ฮต-Similarity Query, NNSimilarity Query). Given a query object
objs , ๏ฌnd all objects obj from the database
of objects DB that are ฮต-similar (identical
for ฮต = 0) to objs ; that is, determine
{obj โˆˆ DB | ฮด(objs , obj) < ฮต}.
Given a query object objs , ๏ฌnd the object(s) obj from the database of objects DB
that are NN-similar to objs ; that is, determine
{obj โˆˆ DB | โˆ€obj โˆˆ DB, obj = obj :
ฮด(objs , obj) โ‰ค ฮด(objs , obj)}.
The solutions currently used to solve
similarity search problems are mostly
feature-based solutions. The basic idea of
feature-based similarity search is to extract important features from the multimedia objects, map the features into
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
324

C. Bยจ hm et al.
o

Fig. 1. Basic idea of feature-based similarity search.

high-dimensional feature vectors, and
search the database of feature vectors
for objects with similar feature vectors
(cf. Figure 1). The feature transformation
F is de๏ฌned as the mapping of the multimedia object (obj) into a d -dimensional
feature vector
F : Obj โ†’

d

.

The similarity of two objects obj1 and obj2
can now be determined,
ฮด(obj1 , obj2 ) = ฮดEuclid (F (obj1 ), F (obj2 )).
Feature-based approaches are used
in many application areas including
molecular biology (for molecule docking)
[Shoichet et al. 1992], information retrieval (for text matching) [Altschul et al.
1990], multimedia databases (for image
retrieval) [Faloutsos et al. 1994; Seidl and
Kriegel 1997], sequence databases (for
subsequence matching) [Agrawal et al.
1993, 1995; Faloutsos et al. 1994], geometric databases (for shape matching)
[Mehrotra and Gary 1993, 1995; Korn
et al. 1996], and so on. Examples of feature
vectors are color histograms [Shawney
and Hafner 1994], shape descriptors
[Mumford 1987; Jagadish 1991; Mehrotra and Gary 1995], Fourier vectors [Wallace and Wintz 1980], text descriptors [Kukich 1992], and so on. The result of the
feature transformation are sets of highdimensional feature vectors. The similarity search now becomes an ฮต-query
or a nearest-neighbor query on the feature vectors in the high-dimensional feature space, which can be handled much
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

more ef๏ฌciently on large amounts of data
than the time-consuming comparison of
the search object to all complex multimedia objects in the database. Since the
databases are very large and consist of
millions of data objects with several tens
to a few hundreds of dimensions, it is
essential to use appropriate multidimensional indexing techniques to achieve an
ef๏ฌcient search of the data. Note that the
feature transformation often also involves
complex transformations of the multimedia objects such as feature extraction,
normalization, or Fourier transformation.
Depending on the application, these operations may be necessary to achieve, for example, invariance with respect to a scaling or rotation of the objects. The details
of the feature transformations are beyond
the scope of this survey. For further reading on feature transformations, the interested reader is referred to the literature
[Wallace and Wintz 1980; Mumford 1987;
Jagadish 1991; Kukich 1992; Shawney
and Hafner 1994; Mehrotra and Gary
1995].
For an ef๏ฌcient similarity search it is
necessary to store the feature vectors in a
high-dimensional index structure and use
the index structure to ef๏ฌciently evaluate
the distance metric. The high-dimensional
index structure used must ef๏ฌciently
support
โ€” point queries for processing identity
queries on the multimedia objects;
โ€” range queries for processing ฮตsimilarity queries; and
โ€” nearest-neighbor queries for processing NN-similarity queries.
Searching in High-Dimensional Spaces
Note that instead of using a feature
transformation into a vector space, the
data can also be directly processed using a
metric space index structure. In this case,
the user has to provide a metric that corresponds to the properties of the similarity measure. The basic idea of metric indexes is to use the given metric properties
to build a tree that then can be used to
prune branches in processing the queries.
The basic idea of metric index structures is
discussed in Section 5. A problem of metric
indexes is that they use less information
about the data than vector space index
structures which results in poorer pruning and also a poorer performance. A nice
possibility to improve this situation is the
FASTMAP algorithm [Faloutsos and Lin
1995] which maps the metric data into a
lower-dimensional vector space and uses
a vector space index structure for ef๏ฌcient
access to the transformed data.
Due to their practical importance, in
this survey we restrict ourselves to
vector space index structures. We assume we have some given applicationdependent feature transformation that
provides a mapping of the multimedia objects into some high-dimensional space.
There are a quite large number of index
structures that have been developed for
ef๏ฌcient query processing in some multidimensional space. In general, the index structures can be classi๏ฌed in two
groups: data organizing structures such as
R-trees [Guttman 1984; Beckmann et al.
1990] and space organizing structures
such as multidimensional hashing [Otoo
1984; Kriegel and Seeger 1986, 1987,
1988; Seeger and Kriegel 1990], GRIDFiles [Nievergelt et al. 1984; Hinrichs
1985; Krishnamurthy and Whang 1985;
Ouksel 1985; Freeston 1987; Hut๏ฌ‚esz
et al. 1988b; Kriegel and Seeger 1988],
and kd-tree-based methods (kd-B-tree
[Robinson 1981], hB-tree [Lomet and
Salzberg 1989, 1990; Evangelidis 1994],
and LSDh -tree [Henrich 1998]).
For a comprehensive description of most
multidimensional access methods, primarily concentrating on low-dimensional
indexing problems, the interested reader
is referred to a recent survey presented

325
ยจ
in Gaede and Gunther [1998]. That survey, however, does not tackle the problem
of indexing multimedia databases which
requires an ef๏ฌcient processing of nearestneighbor queries in high-dimensional feature spaces; and therefore, the survey does
not deal with nearest-neighbor queries
and the problems of indexing highdimensional spaces. In our survey, we
focus on the index structures that have
been speci๏ฌcally designed to cope with
the effects occurring in high-dimensional
space. Since hashing- and GRID-Filebased methods do not play an important role in high-dimensional indexing,
we do not cover them in the survey.1
The reason why hashing techniques are
not used in high-dimensional spaces are
the problems that arise in such space. To
be able to understand these problems in
more detail, in the following we discuss
some general effects that occur in highdimensional spaces.
1.2. Effects in High-Dimensional Space

A broad variety of mathematical effects
can be observed when one increases the
dimensionality of the data space. Interestingly, some of these effects are not
of quantitative but of qualitative nature.
In other words, one cannot think about
these effects, by simply extending twoor three-dimensional experiences. Rather,
one has to think for example, at least
10-dimensional to even see the effect occurring. Furthermore, some are pretty
nonintuitive. Few of the effects are of
pure mathematical interest whereas some
others have severe implications for the
performance of multidimensional index
structures. Therefore, in the database
world, these effects are subsumed by the
term, โ€œcurse of dimensionality.โ€ Generally
speaking, the problem is that important
parameters such as volume and area depend exponentially on the number of dimensions of the data space. Therefore,
1

The only exceptions to this is a technique for
searching approximate nearest neighbors in highdimensional spaces that has been proposed in Gionis
et al. [1999] and Ouksel et al. [1992].

ACM Computing Surveys, Vol. 33, No. 3, September 2001.
326
most index structures proposed so far operate ef๏ฌciently only if the number of dimensions is fairly small. The effects are
nonintuitive because we are used to dealing with three-dimensional spaces in the
real world but these effects do not occur in low-dimensional spaces. Many people even have trouble understanding spatial relations in three-dimensional spaces,
however, no one can โ€œimagineโ€ an eightdimensional space. Rather, we always
try to ๏ฌnd a low-dimensional analogy
when dealing with such spaces. Note that
there actually is no notion of a โ€œhighโ€dimensional space. Nevertheless, if people
speak about high-dimensional, they usually mean a dimension of about 10 to 16,
or at least 5 or 6.
Next, we list the most relevant effects
and try to classify them:
โ€”pure geometric effects concerning the
surface and volume of (hyper) cubes and
(hyper) spheres:
โ€”the volume of a cube grows exponentially with increasing dimension
(and constant edge length),
โ€”the volume of a sphere grows exponentially with increasing dimension,
and
โ€”most of the volume of a cube is very
close to the (d โˆ’ 1)-dimensional surface of the cube;
โ€”effects concerning the shape and location of index partitions:
โ€”a typical index partition in highdimensional spaces will span the majority of the data space in most dimensions and only be split in a few
dimensions,
โ€”a typical index partition will not be
cubic, rather it will โ€œlookโ€ like a rectangle,
โ€”a typical index partition touches the
boundary of the data space in most
dimensions, and
โ€”the partitioning of space gets coarser
the higher the dimension;
โ€”effects arising in a database environment (e.g., selectivity of queries):
โ€”assuming uniformity, a reasonably
selective range query corresponds to
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

C. Bยจ hm et al.
o

Fig. 2. Spheres in high-dimensional
spaces.

a hypercube having a huge extension
in each dimension, and
โ€”assuming uniformity, a reasonably
selective nearest-neighbor query corresponds to a hypersphere having a
huge radius in each dimension; usually this radius is even larger than
the extension of the data space in
each dimension.
To be more precise, we present some of
the listed effects in more depth and detail
in the rest of the section.
To demonstrate how much we stick
to our understanding of low-dimensional
spaces, consider the following lemma.
Consider a cubic-shaped d -dimensional
data space of extension [0, 1]d . We de๏ฌne the centerpoint c of the data space as
the point (0.5, . . . , 0.5). The lemma, โ€œEvery d -dimensional sphere touching (or intersecting) the (d โˆ’ 1)-dimensional boundaries of the data space also contains c,โ€ is
obviously true for d = 2, as one can take
from Figure 2. Spending some more effort
and thinking, we are able to also prove
the lemma for d = 3. However, the lemma
is de๏ฌnitely false for d = 16, as the following counterexample shows. De๏ฌne a
sphere around the point p = (0.3, . . . , 0.3).
This point p has a Euclidean distance of
โˆš
d ยท 0.22 = 0.8 from the centerpoint. If we
de๏ฌne the sphere around p with a radius
of 0.7, the sphere will touch (or intersect)
all 15-dimensional surfaces of the space.
However, the centerpoint is not included
in the sphere. We have to be aware of the
fact that effects like this are not only nice
mathematical properties but also lead to
severe conclusions for the performance of
index structures.
Searching in High-Dimensional Spaces

327

Fig. 3. Space partitioning in high-dimensional spaces.

The most basic effect is the exponential
growth of volume. The volume of a cube
in a d -dimensional space is of the formula
vol = ed , where d is the dimension of the
data space and e is the edge length of the
cube. Now if the edge length is a number
between 0 and 1, the volume of the cube
will exponentially decrease when increasing the dimension. Viewing the problem
from the opposite side, if we want to de๏ฌne
a cube of constant volume for increasing
dimensions, the appropriate edge length
will quickly approach 1. For example, in
a 2-dimensional space of extension [0, 1]d ,
a cube of volume 0.25 has an edge length
of 0.5 whereas in a 16-dimensional space,
โˆš
16
the edge length has to be 0.25 โ‰ˆ 0.917.
The exponential growth of the volume
has a serious impact on conventional index
structures. Space-organizing index structures, for example, suffer from the โ€œdead
spaceโ€ indexing problem. Since space organizing techniques index the whole domain space, a query window may overlap
part of the space belonging to a page that
actually contains no points at all.
Another important issue is the space
partitioning one can expect in highdimensional spaces. Usually, index structures split the data space using (d โˆ’ 1)dimensional hyperplanes; for example,
in order to perform a split, the index
structure selects a dimension (the split
dimension) and a value in this dimension
(the split value). All data items having a
value in the split dimension smaller than
the split value are assigned to the ๏ฌrst
partition whereas the other data items
form the second partition. This process of
splitting the data space continues recursively until the number of data items in a
partition is below a certain threshold and

the data items of this partition are stored
in a data page. Thus, the whole process
can be described by a binary tree, the split
tree. As the tree is a binary tree, the height
h of the split tree usually depends logarithmically on the number of leaf nodes,
that is, data pages. On the other hand, the
number d of splits for a single data page is
on average
d = log2

N
,
Ceff (d )

where N is the number of data items
and Ceff (d ) is the capacity of a single
data page.2 Thus, we can conclude that if
all dimensions are equally used as split
dimensions, a data page has been split at
most once or twice in each dimension and
therefore, spans a range between 0.25
and 0.5 in each of the dimensions (for uniformly distributed data). From that, we
may conclude that the majority of the data
pages are located at the surface of the data
space rather than in the interior. In addition, this obviously leads to a coarse data
space partitioning in single dimensions.
However, from our understanding of index
structures such as the R -tree that had
been designed for geographic applications,
we are used to very ๏ฌne partitions where
the majority of the data pages are in the
interior of the space and we have to be
careful not to apply this understanding to
high-dimensional spaces. Figure 3 depicts
the different con๏ฌgurations. Note that
this effect applies to almost any index
2 For most index structures, the capacity of a single
data page depends on the dimensionality since the
number of entries decreases with increasing dimension due to the larger size of the entries.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.
328
structure proposed so far because we
only made assumptions about the split
algorithm.
Additionally, not only index structures show a strange behavior in highdimensional spaces but also the expected
distribution of the queries is affected by
the dimensionality of the data space. If
we assume a uniform data distribution,
the selectivity of a query (the fraction of
data items contained in the query) directly depends on the volume of the query.
In case of nearest-neighbor queries, the
query affects a sphere around the query
point that contains exactly one data item,
the NN-sphere. According to Berchtold
et al. [1997b], the radius of the NN-sphere
increases rapidly with increasing dimension. In a data space of extension [0, 1]d , it
quickly reaches a value larger than 1 when
increasing d . This is a consequence of the
above-mentioned exponential relation of
extension and volume in high-dimensional
spaces.
Considering all these effects, we can
conclude that if one builds an index structure using a state-of-the-art split algorithm the performance will deteriorate
rapidly when increasing the dimensionality of the data space. This has been
realized not only in the context of multimedia systems where nearest-neighbor
queries are most relevant, but also in the
context of data warehouses where range
queries are the most frequent type of
query [Berchtold et al. 1998a, b]. Theoretical results based on cost models for indexbased nearest-neighbor and range queries
also con๏ฌrm the degeneration of the query
performance [Yao and Yao 1985; Berchtold
et al. 1997b, 2000b; Beyer et al. 1999].
Other relevant cost models proposed before include Friedman et al. [1997], Cleary
[1979], Eastman [1981], Sproull [1991],
Pagel et al. [1993], Arya et al, [1995], Arya
[1995], Theodoridis and Sellis [1996], and
Papadopoulos and Monolopoulos [1997b].
1.3. Basic De๏ฌnitions

Before we proceed, we need to introduce
some notions and to formalize our problem description. In this section we de๏ฌne
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

C. Bยจ hm et al.
o
our notion of the database and develop a
twofold orthogonal classi๏ฌcation for various neighborhood queries. Neighborhood
queries can either be classi๏ฌed according
to the metric that is applied to determine
distances between points or according to
the query type. Any combination between
metrics and query types is possible.
1.3.1. Database. We assume that in our
similarity search application, objects are
feature-transformed into points of a vector
space with ๏ฌxed ๏ฌnite dimension d . Therefore, a database DB is a set of points in
a d -dimensional data space DS. The data
space DS is a subset of d . Usually, analytical considerations are simpli๏ฌed if the
data space is restricted to the unit hypercube DS = [0..1]d .
Our database is completely dynamic.
That means insertions of new points
and deletions of points are possible and
should be handled ef๏ฌciently. The number
of point objects currently stored in our
database is abbreviated as n. We note
that the notion of a point is ambiguous.
Sometimes, we mean a point object (i.e.,
a point stored in the database). In other
cases, we mean a point in the data space
(i.e., a position), which is not necessarily
stored in DB. The most common example
for the latter is the query point. From
the context, the intended meaning of the
notion point is always obvious.

De๏ฌnition 4 (Database). A database
DB is a set of n points in a d -dimensional
data space DS,
DB = {P0 , . . . , Pnโˆ’1 }
Pi โˆˆ DS, i = 0..n โˆ’ 1, DS โІ

d

.

1.3.2. Vector Space Metrics.
All neighborhood queries are based on the notion
of the distance between two points P and
Q in the data space. Depending on the application to be supported, several metrics
to de๏ฌne distances are applied. Most common is the Euclidean metric L2 de๏ฌning
the usual Euclidean distance function:
d โˆ’1

ฮดEuclid (P, Q) =

(Q i โˆ’ Pi )2 .

2

i=0
Searching in High-Dimensional Spaces

329

Fig. 4. Metrics for data spaces.

But other L p metrics such as the
Manhattan metric (L1 ) or the maximum
metric (Lโˆž ) are also widely applied:
d โˆ’1

ฮดManhattan (P, Q) =

|Q i โˆ’ Pi |
i=0

ฮดMax (P, Q) = max{|Q i โˆ’ Pi |}.
Queries using the L2 metric are (hyper)
sphere shaped. Queries using the maximum metric or Manhattan metric are hypercubes and rhomboids, respectively (cf.
Figure 4). If additional weights w0 , . . . ,
wd โˆ’1 are assigned to the dimensions, then
we de๏ฌne weighted Euclidean or weighted
maximum metrics that correspond to
axis-parallel ellipsoids and axis-parallel
hyperrectangles:
d โˆ’1

ฮดW.Euclid (P, Q) =

wi ยท (Q i โˆ’ Pi )2

2

i=0

ฮดW.Max (P, Q) = max{wi ยท | Q i โˆ’ Pi |}.
Arbitrarily rotated ellipsoids can be de๏ฌned using a positive de๏ฌnite similarity
matrix W . This concept is used for adaptable similarity search [Seidl 1997]:
2
ฮดellipsoid (P, Q) = (P โˆ’ Q)T ยท W ยท (P โˆ’ Q).

1.3.3. Query Types. The ๏ฌrst classi๏ฌcation of queries is according to the vector
space metric de๏ฌned on the feature space.
An orthogonal classi๏ฌcation is based on
the question of whether the user de๏ฌnes
a region of the data space or an intended
size of the result set.
Point Query. The most simple query type

is the point query. It speci๏ฌes a point in the

data space and retrieves all point objects
in the database with identical coordinates:
PointQuery(DB, Q) = {P โˆˆ DB | P = Q}.
A simpli๏ฌed version of the point query
determines only the Boolean answer,
whether the database contains an identical point or not.
Range Query. In a range query, a query
point Q, a distance r, and a metric M
are speci๏ฌed. The result set comprises all
points P from the database, which have a
distance smaller than or equal to r from Q
according to metric M :

RangeQuery(DB, Q, r, M )
= {P โˆˆ DB | ฮด M (P, Q) โ‰ค r}.
Point queries can also be considered as
range queries with a radius r = 0 and an
arbitrary metric M . If M is the Euclidean
metric, then the range query de๏ฌnes a hypersphere in the data space, from which
all points in the database are retrieved.
Analogously, the maximum metric de๏ฌnes
a hypercube.
Window Query. A window query speci๏ฌes a rectangular region in data space,
from which all points in the database are
selected. The speci๏ฌed hyperrectangle is
always parallel to the axis (โ€œwindowโ€). We
regard the window query as a region query
around the centerpoint of the window using a weighted maximum metric, where
the weights wi represent the inverse of the
side lengths of the window.
Nearest-Neighbor Query. The range query
and its special cases (point query and window query) have the disadvantage that
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
330
the size of the result set is previously unknown. A user specifying the radius r may
have no idea how many results his query
may produce. Therefore, it is likely that he
may fall into one of two extremes: either
he gets no answers at all, or he gets almost all database objects as answers. To
overcome this drawback, it is common to
de๏ฌne similarity queries with a de๏ฌned result set size, the nearest-neighbor queries.
The classical nearest-neighbor query returns exactly one point object as result,
the object with the lowest distance to the
query point among all points stored in the
database.3 The only exception from this
one-answer rule is due to tie effects. If several points in the database have the same
(minimal) distance, then our ๏ฌrst de๏ฌnition allows more than one answer:
NNQueryDeterm(DB, Q, M ) = {P โˆˆ DB |
โˆ€P โˆˆ DB : ฮด M (P, Q) โ‰ค ฮด M (P , Q)}.
A common solution avoiding the exception to the one-answer rule uses nondeterminism. If several points in the
database have minimal distance from
the query point Q, an arbitrary point from
the result set is chosen and reported as answer. We follow this approach:
NNQuery(DB, Q, M ) = SOME{P โˆˆ DB |
โˆ€P โˆˆ DB : ฮด M (P, Q) โ‰ค ฮด M (P , Q)}.
K-Nearest-Neighbor Query. If a user does
not only want one closest point as answer
upon her query, but rather a natural number k of closest points, she will perform a
k-nearest-neighbor query. Analogously to
the nearest-neighbor query, the k-nearestneighbor query selects k points from the
database such that no point among the remaining points in the database is closer to
the query point than any of the selected
points. Again, we have the problem of ties,
3 A recent extension of nearest neighbor queries are
closest pair queries which are also called distance
joins [Hjaltason and Samet 1998; Corral et al. 2000].
This query type is mainly important in the area of
spatial databases and therefore, closest pair queries
are beyond the scope of this survey.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

C. Bยจ hm et al.
o
which can be solved either by nondeterminism or by allowing more than k answers in this special case:
kNNQuery(DB, Q, k, M ) = {P0 . . . Pkโˆ’1
โˆˆ DB | ยฌโˆƒP โˆˆ DB{P0 . . . Pkโˆ’1 }
โˆงยฌโˆƒi, 0 โ‰ค i < k : ฮด M (Pi , Q) > ฮด M (P , Q)}.
A variant of k-nearest-neighbor queries
is ranking queries which do not require
that the user specify a range in the data
space or a result set size. The ๏ฌrst answer
of a ranking query is always the nearest
neighbor. Then, the user has the possibility of asking for further answers. Upon
this request, the second nearest neighbor
is reported, then the third, and so on. The
user decides after examining an answer,
if further answers are needed. Ranking
queries can be especially useful in the ๏ฌlter step of a multistep query processing
environment. Here, the re๏ฌnement step
usually takes the decision whether the ๏ฌlter step has to produce further answers.
Approximate Nearest-Neighbor Query. In
approximate nearest-neighbor queries and
approximate k-nearest-neighbor queries,
the user also specify a query point and
a number k of answers to be reported. In
contrast to exact nearest-neighbor queries,
the user is not interested exactly in the
closest points, but wants only points that
are not much farther away from the query
point than the exact nearest neighbor. The
degree of inexactness can be speci๏ฌed by
an upper bound, how much farther away
the reported answers may be compared to
the exact nearest neighbors. The inexactness can be used for ef๏ฌciency improvement of query processing.
1.4. Query Evaluation Without Index

All query types introduced in the previous
section can be evaluated by a single scan
of the database. As we assume that our
database is densely stored on a contiguous block on secondary storage all queries
can be evaluated using a so-called sequential scan, which is faster than the access
of small blocks spread over wide parts of
secondary storage.
Searching in High-Dimensional Spaces
The sequential scan works as follows:
the database is read in very large blocks,
determined by the amount of main memory available to query processing. After
reading a block from disk, the CPU processes it and extracts the required information. After a block is processed, the next
block is read in. Note that we assume that
there is no parallelism between CPU and
disk I/O for any query processing technique presented in this article.
Furthermore, we do not assume any additional information to be stored in the
database. Therefore, the database has the
size in bytes:
sizeof(DB) = d ยท n ยท sizeof(๏ฌ‚oat).
The cost of query processing based on
the sequential scan is proportional to the
size of the database in bytes.
1.5. Overview

The rest of the survey is organized as follows. We start with describing the common principles of multidimensional index structures and the algorithms used to
build the indexes and process the different
query types. Then we provide a systematic overview of the querying and indexing techniques that have been proposed
for high-dimensional data spaces, describing them in a uniform way and discussing
their advantages and drawbacks. Rather
than describing the details of all the different approaches, we try to focus on the
basic concepts and algorithms used. We
also cover a number of recently proposed
techniques dealing with optimization and
parallelization issues. In concluding the
survey, we try to stir up further research
activities by presenting a number of interesting research problems.
2. COMMON PRINCIPLES OF
HIGH-DIMENSIONAL INDEXING
METHODS
2.1. Structure

High-dimensional indexing methods are
based on the principle of hierarchical clustering of the data space. Structurally,

331
they resemble the B+ -tree [Bayer and
McCreight 1977; Comer 1979]: The data
vectors are stored in data nodes such that
spatially adjacent vectors are likely to reside in the same node. Each data vector
is stored in exactly one data node; that
is, there is no object duplication among
data nodes. The data nodes are organized
in a hierarchically structured directory.
Each directory node points to a set of subtrees. Usually, the structure of the information stored in data nodes is completely
different from the structure of the directory nodes. In contrast, the directory nodes
are uniformly structured among all levels
of the index and consist of (key, pointer)tuples. The key information is different for
different index structures. For B-trees, for
example, the keys are ranges of numbers
and for an R-tree the keys are bounding
boxes. There is a single directory node,
which is called the root node. It serves as
an entry point for query and update processing. The index structures are heightbalanced. That means the lengths of the
paths between the root and all data pages
are identical, but may change after insert
or delete operations. The length of a path
from the root to a data page is called the
height of the index structure. The length
of the path from a random node to a data
page is called the level of the node. Data
pages are on level zero. See Figure 5.
The uniform (key, pointer)-structure of
the directory nodes also allows an implementation of a wide variety of index
structures as extensions of a generic index structure as done in the generalized search tree [Hellerstein et al. 1995].
The generalized search tree (GiST) provides a nice framework for a fast and reliable implementation of search trees. The
main requirement for de๏ฌning a new index structure in GiST is to de๏ฌne the
keys and provide an implementation of
four basic methods needed for building
and searching the tree (cf. Section 3). Additional methods may be de๏ฌned to enhance the performance of the index, which
is especially relevant for similarity or
nearest-neighbor queries [Aoki 1998]. An
advantage of GiST is that the basic data
structures and algorithms as well as main
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
332

C. Bยจ hm et al.
o

Fig. 5. Hierarchical index structures.

portions of the concurrency and recovery
code can be reused. It is also useful as a
basis for theoretical analysis of indexing
schemes [Hellerstein et al. 1997]. A recent
implementation in a commercial objectrelational system shows that GiST-based
implementations of index structures can
provide a competitive performance while
considerably reducing the implementation efforts [Kornacker 1999].
2.2. Management

The high-dimensional access methods are
designed primarily for secondary storage.
Data pages have a data page capacity
Cmax,data , de๏ฌning how many data vectors
can be stored in a data page at most.
Analogously, the directory page capacity
Cmax,dir gives an upper limit to the number
of subnodes in each directory node. The
original idea was to choose Cmax,data and
Cmax,dir such that data and directory nodes
๏ฌt exactly into the pages of secondary
storage. However, in modern operating
systems, the page size of a disk drive is
considered as a hardware detail hidden
from programmers and users. Despite
that, consecutive reading of contiguous
data on disk is by orders of magnitude less
expensive than reading at random positions. It is a good compromise to read data
contiguously from disk in portions between a few kilobytes and a few hundred
kilobytes. This is a kind of arti๏ฌcial paging
with a user-de๏ฌned logical page size. How
to properly choose this logical page size is
investigated in Sections 3 and 4. The logical page sizes for data and directory nodes
are constant for most of the index structures presented in this section. The only
exceptions are the X -tree and the DABSACM Computing Surveys, Vol. 33, No. 3, September 2001.

tree. The X -tree de๏ฌnes a basic page size
and allows directory pages to extend over
multiples of the basic page size. This concept is called supernode (cf. Section 6.2).
The DABS-tree is an indexing structure
giving up the requirement of a constant
blocksize. Instead, an optimal blocksize
is determined individually for each page
during creation of the index. This dynamic adoption of the block size gives the
DABS-tree [Bยจ hm 1998] its name.
o
All index structures presented here are
dynamic: they allow insert and delete operations in O(log n) time. To cope with
dynamic insertions, updates, and deletes,
the index structures allow data and directory nodes to be ๏ฌlled under their capacity
Cmax . In most index structures the rule is
applied that all nodes up to the root node
must be ๏ฌlled to about 40% at least. This
threshold is called the minimum storage
utilization sumin . For obvious reasons, the
root is generally allowed to obviate this
rule.
For B-trees, it is possible to analytically derive an average storage utilization, further on referred to as the effective storage utilization sueff . In contrast,
for high-dimensional index structures, the
effective storage utilization is in๏ฌ‚uenced
by the speci๏ฌc heuristics applied in insert
and delete processing. Since these indexing methods are not amenable to an analytical derivation of the effective storage
utilization, it usually has to be determined
experimentally.4
For comfort, we denote the product of
the capacity and the effective storage
4 For the hB-tree, it has been shown in Lomet and
Salzberg [1990] that under certain assumptions the
average storage utilization is 67%.
Searching in High-Dimensional Spaces

Fig. 6. Corresponding page
regions of an indexing structure.

utilization as the effective capacity Ceff of
a page:
Ceff,data = sueff,data ยท Cmax,data
Ceff,dir = sueff,dir ยท Cmax,dir .
2.3. Regions

For ef๏ฌcient query processing it is important that the data are well clustered into
the pages; that is, that data objects which
are close to each other are likely to be
stored in the same data page. Assigned to
each page is a so-called page region, which
is a subset of the data space (see Figure 6).
The page region can be a hypersphere,
a hypercube, a multidimensional cuboid,
a multidimensional cylinder, or a settheoretical combination (union, intersection) of several of the above. For most, but
not all high-dimensional index structures,
the page region is a contiguous, solid, convex subset of the data space without holes.
For most index structures, regions of pages
in different branches of the tree may overlap, although overlaps lead to bad performance behavior and are avoided if possible
or at least minimized.
The regions of hierarchically organized
pages must always be completely contained in the region of their parent. Analogously, all data objects stored in a subtree
are always contained in the page region of
the root page of the subtree. The page region is always a conservative approximation for the data objects and the other page
regions stored in a subtree.
In query processing, the page region
is used to cut branches of the tree from

333
further processing. For example, in the
case of range queries, if a page region
does not intersect with the query range,
it is impossible for any region of a hierarchically subordered page to intersect with the query range. Neither is
it possible for any data object stored in
this subtree to intersect with the query
range. Only pages where the corresponding page region intersects with the query
have to be investigated further. Therefore, a suitable algorithm for range query
processing can guarantee that no false
drops occur.
For nearest-neighbor queries a related
but slightly different property of conservative approximations is important. Here,
distances to a query point have to be determined or estimated. It is important
that distances to approximations of point
sets are never greater than the distances
to the regions of subordered pages and
never greater than the distances to the
points stored in the corresponding subtree. This is commonly referred to as the
lower bounding property.
Page regions always have a representation that is an invertible mapping between the geometry of the region and a set
of values storable in the index. For example, spherical regions can be represented
as centerpoint and radius using d +1 ๏ฌ‚oating point values, if d is the dimension of
the data space. For ef๏ฌcient query processing it is necessary that the test for intersection with a query region and the distance computation to the query point in
the case of nearest-neighbor queries can
be performed ef๏ฌciently.
Both geometry and representation of
the page regions must be optimized. If the
geometry of the page region is suboptimal,
the probability increases that the corresponding page has to be accessed more frequently. If the representation of the region
is unnecessarily large, the index itself gets
larger, yielding worse ef๏ฌciency in query
processing, as we show later.
3. BASIC ALGORITHMS

In this section, we present some basic algorithms on high-dimensional index
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
334
structures for index construction and
maintenance in a dynamic environment,
as well as for query processing. Although
some of the algorithms are published using a speci๏ฌc indexing structure, they are
presented here in a more general way.
3.1. Insert, Delete, and Update

Insert, delete, and update are the operations that are most speci๏ฌc to the corresponding index structures. Despite that,
there are basic algorithms capturing all
actions common to all index structures.
In the GiST framework [Hellerstein et al.
1995], the buildup of the tree via the insert operation is handled using three basic operations: Union, Penalty, and PickSplit. The Union operation consolidates
information in the tree and returns a new
key that is true for all data items in the
considered subtree. The Penalty operation
is used to ๏ฌnd the best path for inserting
a new data item into the tree by providing
a number representing how bad an insertion into that path would be. The PickSplit
operation is used to split a data page in
case of an over๏ฌ‚ow.
The insertion and delete operations
of tree structures are usually the most
critical operations, heavily determining
the structure of the resulting index and
the achievable performance. Some index
structures require for a simple insert the
propagation of changes towards the root
or down the children as, for example, in
the cases of the R-tree and kd-B-tree
and some that do not as, for example,
the hB-tree. In the latter case, the insert/delete operations are called local operations, whereas in the ๏ฌrst case, they
are called nonlocal operations. Inserts are
generally handled as follows.
โ€”Search a suitable data page dp for the
data object do.
โ€”Insert do into dp.
โ€”If the number of objects stored in dp exceeds Cmax,data , then split dp into two
data pages.
โ€”Replace the old description (the representation of the region and the background storage address) of dp in the parACM Computing Surveys, Vol. 33, No. 3, September 2001.

C. Bยจ hm et al.
o
ent node of dp by the descriptions of the
new pages.
โ€”If the number of subtrees stored in
the parent exceeds Cmax,dir , split the
parent and proceed similarly with the
parent. It is possible that all pages on
the path from dp to the root have to
be split.
โ€”If the root node has to be split, let the
height of the tree grow by one. In this
case, a new root node is created pointing
to two subtrees resulting from the split
of the original root.
Heuristics individual to the speci๏ฌc indexing structure are applied for the following subtasks.
โ€”The search for a suitable data page
(commonly referred to as the PickBranch procedure): Due to overlap
between regions and as the data space
is not necessarily completely covered
by page regions, there are generally
multiple alternatives for the choice of
a data page in most multidimensional
index structures.
โ€”The choice of the split (i.e., which of the
data objects/subtrees are aggregated
into which of the newly created nodes).
Some index structures try to avoid
splits by a concept named forced reinsert.
Some data objects are deleted from a
node having an over๏ฌ‚ow condition and
reinserted into the index. The details are
presented later.
The choice of heuristics in insert processing may affect the effective storage
utilization. For example, if a volumeminimizing algorithm allows unbalanced
splitting in a 30:70 proportion, then the
storage utilization of the index is decreased and the search performance is
usually negatively affected.5 On the other
hand, the presence of forced reinsert operations increases the storage utilization
and the search performance.
5

For the hB-tree, it has been shown in Lomet and
Salzberg [1990] that under certain assumptions even
a 33:67 splitting proportion yields an average storage
utilization of 64%.
Searching in High-Dimensional Spaces

335

ALGORITHM 1. (Algorithm for Exact Match Queries)
bool ExactMatchQuery(Point q, PageAdr pa) {
int i;
Page p = LoadPage(pa);
if (IsDatapage(p))
for (i = 0; i < p.num objects; i++)
if (q == p.object[i])
return true;
if (IsDirectoryPage(p))
for (i = 0; i < p.num objects; i++)
if (IsPointInRegion(q, p.region[i]))
if (ExactMatchQuery(q, p.sonpage[i]))
return true;
return false;
}

Some work has been undertaken on
handling deletions from multidimensional
index structures. Under๏ฌ‚ow conditions
can generally be handled by three different actions:
โ€”balancing pages by moving objects from
one page to another,
โ€”merging pages, and
โ€”deleting the page and reinserting all objects into the index.
For most index structures it is a dif๏ฌcult
task to ๏ฌnd a suitable mate for balancing
or merging actions. The only exceptions
are the LSDh -tree [Henrich 1998] and the
space ๏ฌlling curves [Morton 1966; Finkel
and Bentley 1974; Abel and Smith 1983;
Orenstein and Merret 1984; Faloutsos
1985, 1988; Faloutsos and Roseman 1989;
Jagdish 1990] (cf. Sections 6.3 and 6.7).
All other authors either suggest reinserting or do not provide a deletion algorithm
at all. An alternative approach might be
to permit under๏ฌlled pages and to maintain them until they are completely empty.
The presence of delete operations and the
choice of under๏ฌ‚ow treatment can affect
sueff,data and sueff,dir positively as well as
negatively.
An update-operation is viewed as a sequence of a delete-operation, followed by
an insert-operation. No special procedure
has been suggested yet.

3.2. Exact Match Query

Exact match queries are de๏ฌned as follows: given a query point q, determine
whether q is contained in the database.
Query processing starts with the root
node, which is loaded into main memory.
For all regions containing point q the function ExactMatchQuery() is called recursively. As overlap between page regions
is allowed in most index structures presented in this survey, it is possible that
several branches of the indexing structure have to be examined for processing
an exact match query. In the GiST framework [Hellerstein et al. 1995], this situation is handled using the Consistent operation which is the generic operation that
needs to be reimplemented for different instantiations of the generalized search tree.
The result of ExactMatchQuery is true if
any of the recursive calls returns true. For
data pages, the result is true if one of the
points stored on the data page ๏ฌts. If no
point ๏ฌts, the result is false. Algorithm 1
contains the pseudocode for processing exact match queries.
3.3. Range Query

The algorithm for range query processing
returns a set of points contained in the
query range as the result of the calling function. The size of the result set
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
336

C. Bยจ hm et al.
o

ALGORITHM 2. (Algorithm for Range Queries)
PointSet RangeQuery(Point q, ๏ฌ‚oat r, PageAdr pa) {
int i;
PointSet result = EmptyPointSet;
Page p = LoadPage(pa);
if (IsDatapage(p))
for (i = 0; i < p.num objects; i++)
if (IsPointInRange(q, p.object[i], r))
AddToPointSet(result, p.object[i]);
if (IsDirectoryPage(p))
for (i = 0; i < p.num objects; i++)
if (RangeIntersectRegion(q, p.region[i], r))
PointSetUnion(result, RangeQuery(q, r, p.childpage[i]));
return result;
}

is previously unknown and may reach
the size of the entire database. The algorithm is formulated independently of
the applied metric. Any L p metric, including metrics with weighted dimensions (ellipsoid queries [Seidl 1997; Seidl
and Kriegel 1997]), can be applied, if
there exists an effective and ef๏ฌcient test
for the predicates IsPointInRange and
RangeIntersectRegion. Also partial range
queries (i.e., range queries where only a
subset of the attributes is speci๏ฌed) can
be considered as regular range queries
with weights (the unspeci๏ฌed attributes
are weighted with zero). Window queries
can be transformed into range-queries using a weighted Lmax metric.
The algorithm (cf. Algorithm 2) performs a recursive self-call for all childpages, where the corresponding page regions intersect with the query. The union
of the results of all recursive calls is built
and passed to the caller.
3.4. Nearest-Neighbor Query

There are two different approaches to
processing nearest-neighbor queries on
multidimensional index structures. One
was published by Roussopoulos et al.
[1995] and is in the following referred
to as the RKV algorithm. The other,
called the HS algorithm, was published in
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Henrich [1994] and Hjaltason and Samet
[1995]. Due to their importance for our
further presentation, these algorithms are
presented in detail and their strengths
and weaknesses are discussed.
We start with the description of the
RKV algorithm because it is more similar to the algorithm for range query processing, in the sense that a depth-๏ฌrst
traversal through the indexing structure
is performed. RKV is an algorithm of the
โ€œbranch and boundโ€ type. In contrast, the
HS algorithm loads pages from different
branches and different levels of the index
in an order induced by the closeness to the
query point.
Unlike range query processing, there
is no ๏ฌxed criterion, known a priori, to
exclude branches of the indexing structure from processing in nearest neighbor algorithms. Actually, the criterion is
the nearest neighbor distance but the
nearest neighbor distance is not known
until the algorithm has terminated. To
cut branches, nearest neighbor algorithms
have to use pessimistic (conservative) estimations of the nearest neighbor distance,
which will change during the run of the
algorithm and will approach the nearest
neighbor distance. A suitable pessimistic
estimation is the closest point among
all points visited at the current state
of execution (the so-called closest point
Searching in High-Dimensional Spaces

337

Fig. 7. MINDIST and MAXDIST.

candidate cpc). If no point has been visited
yet, it is also possible to derive pessimistic
estimations from the page regions visited
so far.
3.4.1. The RKV Algorithm. The authors
of the RKV algorithm de๏ฌne two important distance functions, MINDIST and
MINMAXDIST. MINDIST is the actual
distance between the query point and
a page region in the geometrical sense,
that is, the nearest possible distance of
any point inside the region to the query
point. The de๏ฌnition in the original proposal [Roussopoulos et al. 1995] is limited to R-treelike structures, where regions are provided as multidimensional
intervals (i.e., minimum bounding rectangles, MBR) I with

I = [l b0 , ub0 ] ร— ยท ยท ยท ร— [l bd โˆ’1 , ubd โˆ’1 ].
Then, MINDIST is de๏ฌned as follows.
De๏ฌnition 5 (MINDIST). The distance
of a point q to region I , denoted
MINDIST(q, I ) is:
MINDIST2 (q, I )
๏ฃซ๏ฃฑ
d โˆ’1 ๏ฃฒ l bi โˆ’ qi
๏ฃญ
0
=
๏ฃณ
i=0
qi โˆ’ ubi

๏ฃถ
if qi < l bi 2
otherwise ๏ฃธ .
if ubi < qi

An example of MINDIST is presented
on the left side of Figure 7. In page regions
pr1 and pr3 , the edges of the rectangles
de๏ฌne the MINDIST. In page region
pr4 the corner de๏ฌnes MINDIST. As the

query point lies in pr2 , the corresponding
MINDIST is 0. A similar de๏ฌnition can
also be provided for differently shaped
page regions, such as spheres (subtract
the radius from the distance between
center and q) or combinations. A similar
de๏ฌnition can be given for the L1 and Lmax
metric, respectively. For a pessimistic estimation, some speci๏ฌc knowledge about
the underlying indexing structure is required. One assumption which is true for
all known index structures, is that every
page must contain at least one point.
Therefore, we could de๏ฌne the following
MAXDIST function determining the distance to the farthest possible point inside
a region.
d โˆ’1

MAXDIST2 (q, I ) =
i=0

ร—

|l bi โˆ’ qi | if |l bi โˆ’ qi | > | qi โˆ’ ubi |
|qi โˆ’ ubi | otherwise

2

.

MAXDIST is not de๏ฌned in the original paper, as it is not needed in R-treelike
structures. An example is shown on the
right side of Figure 7. Being the greatest
possible distance from the query point to
a point in a page region, the MAXDIST is
not equal to 0, even if the query point is
located inside the page region pr2 .
In R-trees, the page regions are minimum bounding rectangles (MBR) (i.e.,
rectangular regions), where each surface hyperplane contains at least one
datapoint. The following MINMAXDIST
function provides a better (i.e., lower) but
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
338

C. Bยจ hm et al.
o

Fig. 8. MINMAXDIST.

still conservative estimation of the nearest
neighbor distance.
MINMAXDIST2 (q, I )
๏ฃซ
๏ฃฌ
๏ฃฌ
= min ๏ฃฌ| qk โˆ’ rmk |2
0โ‰คk<d ๏ฃญ
๏ฃถ
+
i=k
0โ‰คi<d

๏ฃท
๏ฃท
| qi โˆ’ r M i |2 ๏ฃท,
๏ฃธ

where
rmk =
r Mi =

l bk +ubk
2

lbk

if qk โ‰ค

ubk

otherwise

lbi

if qi โ‰ฅ

l bi +ubi
2

ubi otherwise

and

.

The general idea is that every surface hyperarea must contain a point. The
farthest point on every surface is determined and among those the minimum is
taken. For each pair of opposite surfaces,
only the nearer surface can contain the
minimum. Thus, it is guaranteed that a
data object can be found in the region
having a distance less than or equal to
MINMAXDIST(q, I ). MINMAXDIST(q, I )
is the smallest distance providing this
guarantee. The example in Figure 8 shows
on the left side the considered edges.
Among each pair of opposite edges of an
MBR, only the edge closer to the query
point is considered. The point yielding
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

the maximum distance on each considered
edge is marked with a circle. The minimum among all marked points of each
page region de๏ฌnes the MINMAXDIST, as
shown on the right side of Figure 8.
This pessimistic estimation cannot be
used for spherical or combined regions as
these in general do not ful๏ฌll a property
similar to the MBR property. In this case,
MAXDIST(q, I ), which is an estimation
worse than MINMAXDIST, has to be used.
All de๏ฌnitions presented using the L2 metric in the original paper [Roussopoulos
et al. 1995] can easily be adapted to L1
or Lmax metrics, as well as to weighted
metrics.
The algorithm (cf. Algorithm 3) performs accesses to the pages of an index in
a depth-๏ฌrst order (โ€œbranch and boundโ€).
A branch of the index is always completely
processed before the next branch is begun.
Before child nodes are loaded and recursively processed, they are heuristically
sorted according to their probability of
containing the nearest neighbor. For the
sorting order, the optimistic or pessimistic
estimation or a combination thereof may
be chosen. The quality of sorting is critical
for the ef๏ฌciency of the algorithm because
for different sequences of processing
the estimation of the nearest neighbor
distance may approach more or less
quickly the actual nearest neighbor distance. Roussopoulos et al. [1995] report
advantages for the optimistic estimation. The list of child nodes is pruned
whenever the pessimistic estimation of
the nearest neighbor distance changes.
Pruning means the discarding of all child
nodes having a MINDIST larger than the
Searching in High-Dimensional Spaces

339

ALGORITHM 3. (The RKV Algorithm for Finding the Nearest Neighbor)
๏ฌ‚oat pruning dist / The current distance for pruning branches /
= INFINITE; / Initialization before the start of RKV algorithm /
Point cpc;
/ The closest point candidate. This variable will contain
the nearest neighbor after RKV algorithm has completed /
void RKV algorithm(Point q, PageAdr pa) {
int i; ๏ฌ‚oat h;
Page p = LoadPage(pa);
if (IsDatapage(p))
for (i = 0; i < p.num objects; i++) {
h = PointToPointDist(q, p.object[i]);
if (pruning dist>=h) {
pruning dist = h;
cpc = p.object[i];
}
}
if (IsDirectoryPage(p)) {
sort(p, CRITERION);/ CRITERION is MINDIST or MINMAXDIST /
for (i = 0; i < p.num objects; i++) {
if (MINDIST(q, p.region[i]) <= pruning dist)
RKV algorithm(q, p.childpage[i]);
h = MINMAXDIST(q, p.region[i]);
if (pruning dist >= h)
pruning dist = h;
}
}
}

pessimistic estimation of the nearest
neighbor distance. These pages are guaranteed not to contain the nearest neighbor because even the closest point in
these pages is farther away than an already found point (lower bounding property). The pessimistic estimation is the
lowest among all distances to points processed thus far and all results of the
MINMAXDIST(q, I ) function for all page
regions processed thus far.
In Cheung and Fu [1998], several
heuristics for the RKV algorithm with and
without the MINMAXDIST function are
discussed. The authors prove that any
page which can be pruned by exploiting
the MINMAXDIST can also be pruned
without that concept. Their conclusion is
that the determination of MINMAXDIST
should be avoided as it causes an additional overhead for the computation of
MINMAXDIST.
To extend the algorithm to k-nearest
neighbor processing is a dif๏ฌcult task.
Unfortunately, the authors make it easy
by discarding the MINMAXDIST from

pruning, sacri๏ฌcing the performance path
gains obtainable from the MINMAXDIST
path pruning. The kth lowest among all
distances to points found thus far must be
used. Additionally required is a buffer for
k points (the k closest point candidate list,
cpcl) which allows an ef๏ฌcient deletion of
the point with the highest distance and
an ef๏ฌcient insertion of a random point.
A suitable data structure for the closest
point candidate list is a priority queue
(also known as semisorted heap [Knuth
1975]).
Considering the MINMAXDIST imposes some dif๏ฌculties, as the algorithm
has to assure that k points are closer to
the query than a given region. For each region, we know that at least one point must
have a distance less than or equal to MINMAXDIST. If the k-nearest neighbor algorithm prunes a branch according to MINMAXDIST, it would assume that k points
must be positioned on the nearest surface
hyperplane of the page region. The MBR
property only guarantees one such point.
We further know that m points must have
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
340

C. Bยจ hm et al.
o

Fig. 9. The HS algorithm for ๏ฌnding the nearest neighbor.

a distance less than or equal to MAXDIST,
where m is the number of points stored
in the corresponding subtree. The number m could be, for example, stored in
the directory nodes, or could be estimated
pessimistically by assuming minimal storage utilization if the indexing structure
provides storage utilization guarantees. A
suitable extension of the RKV algorithm
could use a semisorted heap with k entries. Each entry is a cpc, a MAXDIST
estimation, or a MINMAXDIST estimation. The heap entry with the greatest
distance to the query point q is used for
branch pruning and is called the pruning
element. Whenever new points or estimations are encountered, they are inserted
into the heap if they are closer to the query
point than the pruning element. Whenever a new page is processed, all estimations based on the appropriate page region
have to be deleted from the heap. They
are replaced by the estimations based on
the regions of the child pages (or the contained points, if it is a data page). This additional deletion implies additional complexities because a priority queue does not
ef๏ฌciently support the deletion of elements
other than the pruning element. All these
dif๏ฌculties are neglected in the original
paper [Roussopoulos et al. 1995].
3.4.2. The HS Algorithm. The problems
arising from the need to estimate the
nearest neighbor distance are elegantly
avoided in the HS algorithm [Hjaltason
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

and Samet 1995]. The HS algorithm does
not access the pages in an order induced
by the hierarchy of the indexing structure, such as depth-๏ฌrst or breadth-๏ฌrst.
Rather, all pages of the index are accessed
in the order of increasing distance to the
query point. The algorithm is allowed to
jump between branches and levels for processing pages. See Figure 9.
The algorithm manages an active page
list (APL). A page is called active if its
parent has been processed but not the
page itself. Since the parent of an active
page has been loaded, the corresponding
region of all active pages is known and
the distance between region and query
point can be determined. The APL stores
the background storage address of the
page, as well as the distance to the query
point. The representation of the page region is not needed in the APL. A processing step of the HS algorithm comprises the
following actions.
โ€”Select the page p with the lowest distance to the query point from the APL.
โ€”Load p into main memory.
โ€”Delete p from the APL.
โ€”If p is a data page, determine if one
of the points contained in this page is
closer to the query point than the closest point found so far (called the closest
point candidate cpc).
โ€”Otherwise: Determine the distances to
the query point for the regions of all
child pages of p and insert all child
Searching in High-Dimensional Spaces
pages and the corresponding distances
into APL.
The processing step is repeated until
the closest point candidate is closer to the
query point than the nearest active page.
In this case, no active page is able to contain a point closer to q than cpc due to
the lower bounding property. Also, no subtree of any active page may contain such a
point. As all other pages have already been
looked upon, processing can stop. Again,
the priority queue is the suitable data
structure for APL.
For k-nearest neighbor processing, a
second priority queue with ๏ฌxed length k
is required for the closest point candidate
list.
3.4.3. Discussion. Now we compare the
two algorithms in terms of their space
and time complexity. In the context of
space complexity, we regard the available
main memory as the most important system limitation. We assume that the stack
for recursion management and all priority queues are held in main memory
although one could also provide an implementation of the priority queue data
structure suitable for secondary storage
usage.

LEMMA 1 (Worst Case Space Complexity
of the RKV Algorithm). The RKV algorithm has a worst case space complexity
of O(log n).
For the proof see Appendix A.
As the RKV algorithm performs a depth๏ฌrst pass through the indexing structure,
and no additional dynamic memory is required, the space complexity is O(log n).
Lemma 1 is also valid for the k-nearest
neighbor search, if allowance is made for
the additional space requirement for the
closest point candidate list with a space
complexity of O(k).
LEMMA 2 (Worst Case Space Complexity
of the HS Algorithm). The HS algorithm
has a space complexity of O(n) in the worst
case.

341
For the Proof see Appendix B.
In spite of the order O(n), the size of the
APL is only a very small fraction of the size
of the data set because the APL contains
only the page address and the distance between the page region and query point q.
If the size of the data set in bytes is DSS,
then we have a number of DP data pages
with
DP =

DSS
.
sueff,data ยท sizeof(DataPage)

Then the size of the APL is f times the
data set size:
sizeof(APL) = f ยท DSS
sizeof(๏ฌ‚oat) + sizeof(address)
=
ยท DSS,
sueff,data ยท sizeof(DataPage)
where a typical factor for a page size of
4 Kbytes is f = 0.3%, even shrinking with
a growing data page size. Thus, it should
be no practical problem to hold 0.3% of a
database in main memory, although theoretically unattractive.
For the objective of comparing the two
algorithms, we prove optimality of the
HS algorithm in the sense that it accesses as few pages as theoretically possible for a given index. We further show,
using counterexamples, that the RKV algorithm does not generally reach this
optimum.
LEMMA 3 (Page Regions Intersecting the
Nearest Neighbor Sphere). Let nnd ist be
the distance between the query point and
its nearest neighbor. All pages that intersect a sphere around the query point having a radius equal to nnd ist (the so-called
nearest neighbor sphere) must be accessed
for query processing. This condition is necessary and suf๏ฌcient.
For the proof see Appendix C.
LEMMA 4 (Schedule of the HS Algorithm). The HS algorithm accesses pages
in the order of increasing distance to the
query point.
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
342

C. Bยจ hm et al.
o

Fig. 10. Schedules of RKV and HS algorithm.

For the proof see Appendix D.
LEMMA 5 (Optimality of HS Algorithm). The HS algorithm is optimal in
terms of the number of page accesses.
For the proof see Appendix E.
Now we demonstrate by an example
that the RKV algorithm does not always
yield an optimal number of page accesses.
The main reason is that once a branch
of the index has been selected, it has
to be completely processed before a new
branch can be begun. In the example of
Figure 10, both algorithms chose pr1 to
load ๏ฌrst. Some important MINDISTS and
MINMAXDISTS are marked in the ๏ฌgure with solid and dotted arrows, respectively. Although the HS algorithm loads
pr2 and pr21 , the RKV algorithm has ๏ฌrst
to load pr11 and pr12 , because no MINMAXDIST estimate can prune the appropriate branches. If pr11 and pr12 are not
data pages, but represent further subtrees
with larger heights, many of the pages in
the subtrees will have to be accessed.
We have to summarize that the HS algorithm for nearest neighbor search is superior to the RKV algorithm when counting
page accesses. On the other side, it has the
disadvantage of dynamically allocating
main memory of the order O(n), although
with a very small factor of less than 1% of
the database size. In addition, the extension to the RKV algorithm for a k-nearest
neighbor search is dif๏ฌcult to implement.
An open question is whether minimizing the number of page accesses will minimize the time needed for the page acACM Computing Surveys, Vol. 33, No. 3, September 2001.

cesses, too. We show later that statically
constructed indexes yield an interpage
clustering, meaning that all pages in a
branch of the index are laid out contiguously on background storage. Therefore,
the depth-๏ฌrst search of the RKV algorithm could yield fewer diskhead movements than the distance-driven search of
the HS algorithm. A new challenge could
be to develop an algorithm for the nearest
neighbor search directly optimizing the
processing time rather than the number
of page accesses.
3.5. Ranking Query

Ranking queries can be seen as generalized k-nearest-neighbor queries with a
previously unknown result set size k. A
typical application of a ranking query requests the nearest neighbor ๏ฌrst, then the
second closest point, the third, and so on.
The requests stop according to a criterion
that is external to the index-based query
processing. Therefore, neither a limited
query range nor a limited result set size
can be assumed before the application terminates the ranking query.
In contrast to the k-nearest neighbor algorithm, a ranking query algorithm needs
an unlimited priority queue for the candidate list of closest points (cpcl). A further
difference is that each request of the next
closest point is regarded as a phase that
is ended by reporting the next resulting
point. The phases are optimized independently. In contrast, the k-nearest neighbor
algorithm searches all k points in a single
phase and reports the complete set.
Searching in High-Dimensional Spaces
In each phase of a ranking query algorithm, all points encountered during the
data page accesses are stored in the cpcl.
The phase ends if it is guaranteed that
unprocessed index pages cannot contain a
point closer than the ๏ฌrst point in cpcl (the
corresponding criterion of the k-nearest
neighbor algorithm is based on the last element of cpcl). Before beginning the next
phase, the leading element is deleted from
the cpcl.
It does not appear very attractive to
extend the RKV algorithm for processing ranking queries due to the fact that
effective branch pruning can be performed neither based on MINMAXDIST
or MAXDIST estimates nor based on
the points encountered during data
page accesses.
In contrast, the HS algorithm for nearest neighbor processing needs only the
modi๏ฌcations described above to be applied as a ranking query algorithm. The
original proposal [Hjaltason and Samet
1995] contains these extensions.
The major limitation of the HS algorithm for ranking queries is the cpcl. It
can be proven similarly as in Lemma 2
that the length of the cpcl is of the order
O(n). In contrast to the APL, the cpcl contains the full information of possibly all
data objects stored in the index. Thus its
size is bounded only by the database size
questioning the applicability not only theoretically, but also practically. From our
point of view, a priority queue implementation suitable for background storage is
required for this purpose.
3.6. Reverse Nearest-Neighbor Queries

In Korn and Muthukrishnan [2000], the
authors introduce the operation of reverse nearest-neighbor queries. Given an
arbitrary query point q, this operation retrieves all points of the database to which
q is the nearest neighbor, that is, the set
of reverse nearest neighbors. Note that the
nearest-neighbor relation is not symmetric. If some point p1 is the nearest neighbor of p2 , then p2 is not necessarily the
nearest neighbor of p1 . Therefore the result set of the rnn-operation can be empty

343

Fig. 11. Indexing for the
reverse nearest neighbor
search.

or may contain an arbitrary number of
points.
A database point p is in the result set
of the rnn-operation for query point q unless another database point p is closer
to p than q is. Therefore, p is in the result set if q is enclosed by the sphere centered by p touching the nearest neighbor of p (the nearest neighbor sphere of
p). Therefore in Korn and Muthukrishnan
[2000] the problem is solved by a specialized index structure for sphere objects
that stores the nearest neighbor spheres
rather than the database points. An rnnquery corresponds to a point query in that
index structure. For an insert operation,
the set of reverse nearest neighbors of the
new point must be determined. The corresponding nearest neighbor spheres of
all result points must be reinserted into
the index.
The two most important drawbacks of
this solution are the high cost for the insert operation and the use of a highly
specialized index. For instance, if the rnn
has to be determined only for a subset of
the dimensions, a completely new index
must be constructed. Therefore, in Stanoi
et al. [2000] the authors propose a solution
for point index structures. This solution,
however, is limited to the two-dimensional
case. See Figure 11.
4. COST MODELS FOR HIGH-DIMENSIONAL
INDEX STRUCTURES

Due to the high practical relevance of multidimensional indexing, cost models for estimating the number of necessary page
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
344

C. Bยจ hm et al.
o

accesses were proposed several years ago.
The ๏ฌrst approach is the well-known cost
model proposed by Friedman et al. [1977]
for nearest-neighbor query processing
using the maximum metric. The original
model estimates leaf accesses in a kdtree, but can be easily extended to estimate data page accesses of R-trees and
related index structures. This extension
was presented in Faloutsos et al. [1987]
and with slightly different aspects in
Aref and Samet [1991], Pagel et al. [1993],
and Theodoridis and Sellis [1996]. The expected number of data page accesses in an
R-tree is
๏ฃซ
Ann,mm,FBF = ๏ฃญ d

๏ฃถd
1
+ 1๏ฃธ .
Ceff

This formula is motivated as follows.
The query evaluation algorithm is assumed to access an area of the data
space, which is a hypercube of the volume
V1 = 1/N , where N is the number of objects stored in the database. Analogously,
the page region is approximated by a hypercube with the volume V2 = Ceff /N . In
each dimension the chance that the projection of V1 and V2 intersect each other
โˆš
โˆš
corresponds to d V1 + d V2 if n โ†’ โˆž. To
obtain a probability that V1 and V2 intersect in all dimensions, this term must
be taken to the power of d . Multiplying
this result with the number of data pages
N /Ceff yields the expected number of page
accesses Ann,mm,FBF . The assumptions of
the model, however, are unrealistic for
nearest-neighbor queries on highdimensional data for several reasons.
First, the number N of objects in the
database is assumed to approach in๏ฌnity.
Second, effects of high-dimensional data
spaces and correlations are not considered
by the model. In Cleary [1979] the model
presented in Friedman et al. [1977] is
extended by allowing nonrectangular
page regions, but still boundary effects
and correlations are not considered. In
Eastman [1981] the existing models are
used for optimizing the bucket size of
the kd-tree. In Sproull [1991] the author
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Fig. 12. Evaluation of the model of Friedman et al.
[1977].

shows that the number of datapoints
must be exponential in the number of
dimensions for the models to provide accurate estimations. According to Sproull,
boundary effects signi๏ฌcantly contribute
to the costs unless the condition holds.
N

Ceff ยท

d

d

1
Ceff ยท VS

1
2

+1

,

where VS (r) is the volume of a hypersphere with radius r which can be
computed as
VS (r) =

โˆš
ฯ€d
ยท rd
(d /2 + 1)

with the gamma-function (x) which is the
extension of the factorial operator x! =
(x + 1) into the domain of real numbers:
โˆš
(x+1) = xยท (x), (1) = 1, and ( 1 ) = ฯ€ .
2
For example, in a 20-dimensional data
space with Ceff = 20, Sproullโ€™s formula
evaluates to N 1.1 ยท 1011 . We show later
(cf. Figure 12), how bad the cost estimations of the FBF model are if substantially fewer than a hundred billion points
are stored in the database. Unfortunately,
Sproull still assumes for his analysis uniformity and independence in the distribution of datapoints and queries; that is,
both datapoints and the centerpoints of
the queries are chosen from a uniform
data distribution, whereas the selectivity
of the queries (1/N ) is considered ๏ฌxed.
The above formulas are also generalized
to k-nearest-neighbor queries, where k is
also a user-given parameter.
Searching in High-Dimensional Spaces
The assumptions made in the existing models do not hold in the highdimensional case. The main reason for
the problems of the existing models is
that they do not consider boundary effects. โ€œBoundary effectsโ€ stands for an exceptional performance behavior, when the
query reaches the boundary of the data
space. Boundary effects occur frequently
in high-dimensional data spaces and lead
to pruning of major amounts of empty
search space which is not considered by
the existing models. To examine these effects, we performed experiments to compare the necessary page accesses with
the model estimations. Figure 12 shows
the actual page accesses for uniformly
distributed point data versus the estimations of the Friedman et al. model. For
high-dimensional data, the model completely fails to estimate the number of
page accesses.
The basic model of Friedman et al.
[1977] has been extended in different directions. The ๏ฌrst is to take correlation effects into account by using the concept of
the fractal dimension [Mandelbrot 1977;
Schrยจ der 1991]. There are various de๏ฌnio
tions of the fractal dimension which all
capture the relevant aspect (the correlation), but are different in the details of how
the correlation is measured.
In Faloutsos and Kamel [1994] the authors used the box-counting fractal dimension (also known as the Hausdorff
fractal dimension) for modeling the performance of R-trees when processing range
queries using the maximum metric. In
their model they assume to have a correlation in the points stored in the database.
For the queries, they still assume a uniform and independent distribution. The
analysis does not take into account effects of high-dimensional spaces and the
evaluation is limited to data spaces with
dimensions less than or equal to three.
In Belussi and Faloutsos [1995] the authors used the fractal dimension with a
different de๏ฌnition (the correlation fractal dimension) for the selectivity estimation of spatial queries. In this paper, range
queries in low-dimensional data spaces
using the Manhattan, Euclidean, and

345
maximum metrics were modeled. Unfortunately, the model only allows the estimation of selectivities. It is not possible to extend the model in a straightforward way to
determine expectations of page accesses.
Papadopoulos
and
Manolopoulos
[1997b] used the results of Faloutsos and
Kamel and of Belussi and Faloutsos for a
new model published in a recent paper.
Their model is capable of estimating data
page accesses of R-trees when processing
nearest-neighbor queries in a Euclidean
space. They estimate the distance of the
nearest neighbor by using the selectivity
estimation presented in Belussi and
Faloutsos [1995] in the reverse way. As it
is dif๏ฌcult to determine accesses to pages
with rectangular regions for spherical
queries, they approximate query spheres
by minimum bounding and maximum
enclosed cubes and determine upper
and lower bounds of the number of page
accesses in this way. This approach
makes the model inoperative for highdimensional data spaces, because the
approximation error grows exponentially
with increasing dimension. Note that in a
20-dimensional data space, the volume of
the minimum bounding cube of a sphere is
by a factor of 1/VS (1/2) = 4.1 ยท 107 larger
than the volume of the sphere. The sphere
โˆš
volume, in turn, is by VS ( d /2) = 27, 000
times larger than the greatest enclosed
cube. An asset of Papadopoulos and
Manolopoulosโ€™ model is that queries are
no longer assumed to be taken from a
uniform and independent distribution. Instead, the authors assume that the query
distribution follows the data distribution.
The concept of fractal dimension is
also widely used in the domain of spatial databases, where the complexity
of stored polygons is modeled [Gaede
1995; Faloutsos and Gaede 1996]. These
approaches are of minor importance for
point databases.
The second direction, where the basic
model of Friedman et al. [1977] needs
extension, are the boundary effects occurring when indexing data spaces of higher
dimensionality.
Arya [1995] and Arya et al. [1995] presented a new cost model for processing
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
346

Fig. 13. The Minkowski sum.

nearest-neighbor queries in the context of
the application domain of vector quantization. Arya et al. restricted their model to
the maximum metric and neglected correlation effects. Unfortunately, they still assumed that the number of points was exponential with the dimension of the data
space. This assumption is justi๏ฌed in their
application domain, but it is unrealistic for
database applications.
Berchtold et al. [1997b] presented
a cost model for query processing in
high-dimensional data spaces in the
so-called BBKK model. The basic concept
of the BBKK model is the Minkowski
sum (cf. Figure 13), a concept from robot
motion planning that was introduced
by the BBKK model for the ๏ฌrst time
for cost estimations. The general idea is
to transform a query having a spatial
extension (such as a range query or
nearest-neighbor query) equivalently into
a point query by enlarging the page
region. In Figure 13, the page region has
been enlarged such that a point query
lies in the enlarged region if (and only if)
the original query intersects the original
region. Together with concepts to estimate
the size of page regions and query regions,
the model provides accurate estimations
for nearest neighbor and range queries
using the Euclidean metric and considers
boundary effects. To cope with correlation, the authors propose using the fractal
dimension without presenting the details.
The main limitations of the model are
(1) that no estimation for the maximum
metric is presented, (2) that the number
of data pages is assumed to be a power of
two, and (3) that a complete, overlap-free
coverage of the data space with data pages
is assumed. Weber et al. [1998] use the
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

C. Bยจ hm et al.
o
cost model by Berchtold et al. without the
extension for correlated data to show the
superiority of the sequential scan in suf๏ฌciently high dimensions. They present the
VA-๏ฌle, an improvement of the sequential
scan. Ciaccia et al. [1998] adapt the cost
model [Berchtold et al. 1997b] to estimate
the page accesses of the M -tree, an index
structure for data spaces that are metric
spaces but not vector spaces (i.e., only the
distances between the objects are known,
but no explicit positions). In Papadopoulos and Manolopoulos [1998] the authors
apply the cost model for declustering
of data in a disk array. Two papers by
Agrawal et al. [1998] and Riedel et al.
[1998] present applications in the data
mining domain.
A recent paper [Bยจ hm 2000] is based on
o
the BBKK cost model which is presented
in a comprehensive way and extended in
many aspects. The extensions not yet covered by the BBKK model include all estimations for the maximum metric, which
are also developed throughout the whole
paper. The restriction of the BBKK model
to numbers of data pages that are a power
of two is overcome. A further extension
of the model regards k-nearest-neighbor
queries (the BBKK model is restricted
to one-nearest-neighbor queries). The numerical methods for integral approximation and for the estimation of the boundary effects were to a large extent beyond
the scope of Berchtold et al. [1997b]. Finally, the concept of the fractal dimension,
which was also used in the BBKK model
in a simpli๏ฌed way (the data space dimension is simply replaced by the fractal dimension) is in this paper well established
by the consequent application of the fractal power laws.
5. INDEXING IN METRIC SPACES

In some applications, objects cannot be
mapped into feature vectors. However,
there still exists some notion of similarity
between objects, which can be expressed
as a metric distance between the objects;
that is, the objects are embedded in a metric space. The object distances can be used
directly for query evaluation.
Searching in High-Dimensional Spaces

Fig. 14. Example Burkhardโ€“Keller tree
(D: data points, v: values of discrete distance function).

Several index structures for pure metric spaces have been proposed in the
literature. Probably the oldest reference
is the so-called Burkhardโ€“Keller [1973]
trees. Burkhardโ€“Keller trees use a distance function that returns a small number (i) of discrete values. An arbitrary
object is chosen as the root of the tree
and the distance function is used to partition the remaining data objects into i
subsets which are the i branches of the
tree. The same procedure is repeated for
each nonempty subset to build up the tree
(cf. Figure 14). More recently, a number of
variants of the Burkhardโ€“Keller tree have
been proposed [Baeza-Yates et al. 1994].
In the ๏ฌxed queries tree, for example, the
data objects used as pivots are con๏ฌned to
be the same on the same level of the tree
[Baeza-Yates et al. 1994].
In most applications, a continuous distance function is used. Examples of index
structures based on a continuous distance
function are the vantage-point tree (VPT),
the generalized hyperplane tree (GHT),
and the M -tree. The VPT [Uhlmann 1991;
Yianilos 1993] is a binary tree that uses
some pivot element as the root and partitions the remaining data elements based
on their distance with respect to the
pivot element in two subsets. The same
is repeated recursively for the subsets
(cf. Figure 15). Variants of the VPT are
the optimized VP-tree [Chiueh 1994], the

347
Multiple VP-tree [Bozkaya and Ozsoyoglu
1997], and the VP-Forest [Yianilos 1999].
The GHT [Uhlmann 1991] is also a binary tree that uses two pivot elements
on each level of the tree. All data elements that are closer to the ๏ฌrst pivot element are assigned to the left subtree and
all elements closer to the other pivot element are assigned to the other subtree (cf.
Figure 16). A variant of the GHT is the geometric near neighbor access tree (GNAT)
[Brin 1995]. The main difference is that
the GNAT is an m-ary tree that uses m
pivots on each level of the tree.
The basic structure of the M-tree
[Ciaccia et al. 1997] is similar to the VPtree. The main difference is that the Mtree is designed for secondary memory and
allows overlap in the covered areas to allow easier updates. Note that among all
metric index structures the M-tree is the
only one that is optimized for large secondary memory-based data sets. All others
are main memory index structures supporting rather small data sets.
Note that metric indexes are only used
in applications where the distance in vector space is not meaningful. This is true
since vector spaces contain more information and therefore allow a better structuring of the data than general metric spaces.
6. APPROACHES TO HIGH-DIMENSIONAL
INDEXING

In this section, we introduce and brie๏ฌ‚y
discuss the most important index structures for high-dimensional data spaces.
We ๏ฌrst describe index structures using
minimum bounding rectangles as page regions such as the R-tree, the R -tree, and
the X -tree. We continue with the structures using bounding spheres such as the
SS-tree and the TV-tree and conclude with
two structures using combined regions.
The SR-tree uses the intersection solid of
MBR and bounding sphere as the page
region. The page region of a space-๏ฌlling
curve is the union of not necessarily connected hypercubes.
Multidimensional access methods that
have not been investigated for query
processing in high-dimensional data
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
348

C. Bยจ hm et al.
o

Fig. 15. Example vantage-point tree.

Fig. 16. Example generalized hyperplane tree.

spaces such as hashing-based methods
[Nievergelt et al. 1984; Otoo 1984;
Hinrichs 1985; Krishnamurthy and
Whang 1985; Ouksel 1985; Kriegel and
Seeger 1986, 1987, 1988; Freeston 1987;
Hut๏ฌ‚esz et al. 1988a, b; Henrich et al.
1989] are excluded from the discussion
here. In the VAMSplit R-tree [Jain and
White 1996] and in the Hilbert-R-tree
[Kamel and Faloutsos 1994], methods
for statically constructing R-trees are
presented. Since the VAMSplit R-tree
and the Hilbert-R-tree are more of a
construction method than an indexing
structure of their own, they are also not
presented in detail here.

6.1. R-tree, R -tree, and R+ -tree

The R-tree [Guttman 1984] family of index structures uses solid minimum bounding rectangles (MBR) as page regions.
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

An MBR is a multidimensional interval of the data space (i.e., axis-parallel
multidimensional rectangles). MBRs are
minimal approximations of the enclosed
point set. There exists no smaller axisparallel rectangle also enclosing the complete point set. Therefore, every (d โˆ’ 1)dimensional surface area must contain
at least one datapoint. Space partitioning is neither complete nor disjoint. Parts
of the data space may be not covered at
all by data page regions. Overlapping between regions in different branches is allowed, although overlaps deteriorate the
search performance especially for highdimensional data spaces [Berchtold et al.
1996]. The region description of an MBR
comprises for each dimension a lower
and an upper bound. Thus, 2d ๏ฌ‚oating
point values are required. This description allows an ef๏ฌcient determination of
MINDIST, MINMAXDIST, and MAXDIST
using any L p metric.
Searching in High-Dimensional Spaces
R-trees have originally been designed
for spatial databases, that is, for the
management of two-dimensional objects
with a spatial extension (e.g., polygons).
In the index, these objects are represented
by the corresponding MBR. In contrast to
point objects, it is possible that no overlapfree partition for a set of such objects exists at all. The same problem also occurs
when R-trees are used to index datapoints
but only in the directory part of the index. Page regions are treated as spatially
extended, atomic objects in their parent
nodes (no forced split). Therefore, it is possible that a directory page cannot be split
without creating overlap among the newly
created pages [Berchtold et al. 1996].
According to our framework of highdimensional index structures, two heuristics have to be de๏ฌned to handle the insert operation: the choice of a suitable page
to insert the point into and the management of page over๏ฌ‚ow. When searching
for a suitable page, one out of three cases
may occur.
โ€”The point is contained in exactly one
page region. In this case, the corresponding page is used.
โ€”The point is contained in several different page regions. In this case, the page
region with the smallest volume is used.
โ€”No region contains the point. In this
case, the region that yields the smallest volume enlargement is chosen. If
several such regions yield minimum enlargement, the region with the smallest
volume among them is chosen.
The insert algorithm starts with the
root and chooses in each step a child
node by applying the above rules. Page
over๏ฌ‚ows are generally handled by splitting the page. Four different algorithms
have been published for the purpose of
๏ฌnding the right split dimension (also
called split axis) and the split hyperplane.
They are distinguished according to their
time complexity with varying page capacity C. Details are provided in Gaede and
ยจ
Gunther [1998]:
โ€”an exponential algorithm,
โ€”a quadratic algorithm,

349

Fig. 17. Misled insert operations.

โ€”a linear algorithm, and
โ€”Greeneโ€™s [1989] algorithm.
Guttman [1984] reports only slight
differences between the linear and the
quadratic algorithm, however, an evaluation study performed by Beckmann et al.
[1990] reveals disadvantages for the linear
algorithm. The quadratic algorithm and
Greeneโ€™s algorithm are reported to yield
similar search performance.
In the insert algorithm, the suitable
data page for the object is found in O(log n)
time, by examining a single path of the
index. It seems to be an advantage that
only a single path is examined for the determination of the data page into which a
point is inserted. An uncontrolled number
of paths, in contrast, would violate the demand of an O(n log n) time complexity for
the index construction. Figure 17 shows,
however, that inserts are often misled in
such tie situations. It is intuitively clear
that the point must be inserted into page
p2,1 , because p2,1 is the only page on the
second index level that contains the point.
But the insert algorithm faces a tie situation at the ๏ฌrst index level because both
pages, p1 as well as p2 , cover the point. According to the heuristics, the smaller page
p1 is chosen. The page p2,1 as a child of p2
will never be under consideration. The result of this misled insert is that the page
p1,2 unnecessarily becomes enlarged by a
large factor and an additional overlap situation of the pages p1,2 and p2,1 . Therefore,
overlap at or near the data level is mostly
a consequence of some initial overlap in
the directory levels near the root (which
would, eventually, be tolerable).
The initial overlap usually stems from
the inability to split a higher-level
page without overlap, because all child
pages have independently grown extended
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
350
page regions. For an overlap-free split, a
dimension is needed in which the projections of the page regions have no overlap
at some point. It has been shown in Berchtold et al. [1996] that the existence of such
a point becomes less likely as the dimension of the data space increases. The reason simply is that the projection of each
child page to an arbitrary dimension is
not much smaller than the corresponding
projection of the child page. If we assume
all page regions to be hypercubes of side
length A (parent page) and a (child page),
respectively, we get a = Aยท d 1/Ceff , which
is substantially below A if d is small but
actually in the same order of magnitude
as A if d is suf๏ฌciently high.
The R -tree [Beckmann et al. 1990] is
an extension of the R-tree based on a
careful study of the R-tree algorithms
under various data distributions. In contrast to Guttman, who optimizes only for
a small volume of the created page regions,
Beckmann et al. identify the optimization
objectives:
โ€”minimize overlap between page regions,
โ€”minimize the surface of page regions,
โ€”minimize the volume covered by internal nodes, and
โ€”maximize the storage utilization.
The heuristic for the choice of a suitable page to insert a point is modi๏ฌed in
the third alternative: no page region contains the point. In this case, the distinction is made whether the child page is a
data page or a directory page. If it is a data
page, then the region is taken that yields
the smallest enlargement of the overlap.
In the case of a tie, further criteria are
the volume enlargement and the volume.
If the child node is a directory page, the
region with the smallest volume enlargement is taken. In case of doubt, the volume
decides.
As in Greeneโ€™s algorithm, the split
heuristic has certain phases. In the ๏ฌrst
phase, the split dimension is determined:
โ€”for each dimension, the objects are
sorted according to their lower bound
and according to their upper bound;
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

C. Bยจ hm et al.
o
โ€”a number of partitionings with a controlled degree of asymmetry are encountered; and
โ€”for each dimension, the surface areas
of the MBRs of all partitionings are
summed up and the least sum determines the split dimension.
In the second phase, the split plane is
determined, minimizing these criteria:
โ€”overlap between the page regions, and
โ€”when in doubt, least coverage of dead
space.
Splits can often be avoided by the concept of forced reinsert. If a node over๏ฌ‚ow
occurs, a de๏ฌned percentage of the objects
with the highest distances from the center
of the region are deleted from the node and
inserted into the index again, after the region has been adapted. By this means, the
storage utilization will grow to a factor between 71 and 76%. Additionally, the quality of partitioning improves because unfavorable decisions in the beginning of index
construction can be corrected this way.
Performance studies report improvements between 10 and 75% over the Rtree. In higher-dimensional data spaces;
the split algorithm proposed in Beckmann
et al. [1990] leads to a deteriorated directory. Therefore, the R -tree is not adequate
for these data spaces; rather it has to load
the entire index in order to process most
queries. A detailed explanation of this effect is given in Berchtold et al. [1996]. The
basic problem of the R-tree, overlap coming up at high index levels and then propagating down by misled insert operations,
is alleviated by more appropriate heuristics but not solved.
The heuristic of the R -tree split to optimize for page regions with a small surface
(i.e., for square/cubelike page regions)
is bene๏ฌcial, in particular with respect
to range queries and nearest-neighbor
queries. As pointed out in Section 4 (cost
models), the access probability corresponds to the Minkowski sum of the page
region and the query sphere. The
Minkowski sum primarily consists of
the page region which is enlarged at each
surface segment. If the page regions are
Searching in High-Dimensional Spaces

351
describe a minimum bounding rectangle
in d -dimensional space.
6.2. X-Tree

Fig. 18. Shapes of page regions and their suitability for similarity queries.

optimized for a small surface, they directly
optimize the Minkowski sum. Figure 18
shows an extreme, nonetheless typical,
example of volume-equivalent pages and
their Minkowski sums. The square (1 ร— 1
unit) yields with 3.78 a substantially lower
Minkowski sum than the volume equivalent rectangle (3 ร— 1 ) with 5.11 units. Note
3
again, that the effect becomes stronger
with an increasing number of dimensions
as every dimension is a potential source of
imbalance. For spherical queries, however,
spherical page regions yield the lowest
Minkowski sum (3.55 units). Spherical
page regions are discussed later.
The R+ -tree [Stonebraker et al. 1986;
Sellis et al. 1987] is an overlap-free variant
of the R-tree. To guarantee no overlap the
split algorithm is modi๏ฌed by a forced-split
strategy. Child pages that are an obstacle
in overlap-free splitting of some page, are
simply cut into two pieces at a suitable position. It is possible, however, that these
forced splits must be propagated down until the data page level is reached. The
number of pages can even exponentially
increase from level to level. As we have
pointed out before, the extension of the
child pages is not much smaller than the
extension of the parent if the dimension is
suf๏ฌciently high. Therefore, high dimensionality leads to many forced split operations. Pages that are subject to a forced
split are split although no over๏ฌ‚ow has occurred. The resulting pages are utilized by
less than 50%. The more forced splits are
raised, the more the storage utilization of
the complete index will deteriorate.
A further problem which more or less
concerns all of the data organizing techniques described in this survey is the decreasing fanout of the directory nodes with
increasing dimension. For the R-tree family, for example, the internal nodes have to
store 2d high and low bounds in order to

The R-tree and the R -tree have primarily been designed for the management
of spatially extended, two-dimensional
objects, but have also been used for highdimensional point data. Empirical studies [Berchtold et al. 1996; White and Jain
1996], however, showed a deteriorated performance of R -trees for high-dimensional
data. The major problem of R-tree-based
index structures in high-dimensional data
spaces is overlap. In contrast to lowdimensional spaces, there exist only a few
degrees of freedom for splits in the directory. In fact, in most situations there exists
only a single โ€œgoodโ€ split axis. An index
structure that does not use this split axis
will produce highly overlapping MBRs in
the directory and thus show a deteriorated performance in high-dimensional
spaces. Unfortunately, this speci๏ฌc split
axis might lead to unbalanced partitions.
In this case, a split should be avoided in
order to avoid under๏ฌlled nodes.
The X -tree [Berchtold et al. 1996] is
an extension of the R -tree which is directly designed for the management of
high-dimensional objects and based on
the analysis of problems arising in highdimensional data spaces. It extends the
R -tree by two concepts:
โ€” overlap-free split according to a splithistory, and
โ€” supernodes with an enlarged page capacity.
If one records the history of data page
splits in an R-tree-based index structure,
this results in a binary tree. The index
starts with a single data page A covering
almost the whole data space and inserts
data items. If the page over๏ฌ‚ows, the index splits the page into two new pages A
and B. Later on, each of these pages might
be split again into new pages. Thus the
history of all splits may be described as a
binary tree, having split dimensions (and
positions) as nodes and having the current
data pages as leaf nodes. Figure 19 shows
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
352

C. Bยจ hm et al.
o

Fig. 19. Example for the split history.

an example of such a process. In the lower
half of the ๏ฌgure, the appropriate directory node is depicted. If the directory node
over๏ฌ‚ows, we have to divide the set of data
pages (the MBRs A , B , C, D, E) into two
partitions. Therefore, we have to choose
a split axis ๏ฌrst. Now, what are potential
candidates for split axes in our example?
Say we chose dimension 5 as a split axis.
Then, we had to put A and E into one
of the partitions. However, A and E have
never been split according to dimension 5.
Thus they span almost the whole data
space in this dimension. If we put A and
E into one of the partitions, the MBR of
this partition in turn will span the whole
data space. This obviously leads to high
overlap with the other partition, regardless of the shape of the other partition. If
one looks at the example in Figure 19, it
becomes clear that only dimension 2 may
be used as a split dimension. The X -tree
generalizes this observation and uses always the split dimension with which the
root node of the particular split tree is
labeled. This guarantees an overlap-free
directory. However, the split tree might
be unbalanced. In this case it is advantageous not to split at all because splitting
would create one under๏ฌlled node and another almost over๏ฌ‚owing node. Thus the
storage utilization in the directory would
decrease dramatically and the directory
would degenerate. In this case the X -tree
does not split and creates an enlarged directory node instead, a supernode. The
higher the dimensionality, the more suACM Computing Surveys, Vol. 33, No. 3, September 2001.

pernodes will be created and the larger
the supernodes become. To also operate on
lower-dimensional spaces ef๏ฌciently, the
X -tree split algorithm also includes a geometric split algorithm. The whole split
algorithm works as follows. In case of a
data page split, the X -tree uses the R -tree
split algorithm or any other topological
split algorithm. In case of directory nodes,
the X -tree ๏ฌrst tries to split the node using a topological split algorithm. If this
split leads to highly overlapping MBRs,
the X -tree applies the overlap-free split
algorithm based on the split history as described above. If this leads to an unbalanced directory, the X -tree simply creates
a supernode.
The X -tree shows a high performance
gain compared to R -trees for all query
types in medium-dimensional spaces. For
small dimensions, the X -tree shows a behavior almost identical to R-trees; for
higher dimensions the X -tree also has to
visit such a large number of nodes that
a linear scan is less expensive. It is impossible to provide exact values here because many factors such as the number
of data items, the dimensionality, the distribution, and the query type have a high
in๏ฌ‚uence on the performance of an index
structure.
6.3. Structures with a kd-Tree Directory

Like the R-tree and its variants, the kd-Btree [Robinson 1981] uses hyperrectangleshaped page regions. An adaptive kd-tree
Searching in High-Dimensional Spaces

353

Fig. 20. The kd-tree.

Fig. 21. Incomplete versus complete decomposition for clustered and correlated data.

[Bentley 1975, 1979] is used for space
partitioning (cf. Figure 20). Therefore,
complete and disjoint space partitioning
is guaranteed. Obviously, the page regions
are (hyper) rectangles, but not minimum
bounding rectangles. The general advantage of kd-tree-based partitioning
is that the decision of which subtree to
use is always unambiguous. The deletion
operation is also supported in a better
way than in R-tree variants because leaf
nodes with a common parent exactly comprise a hyperrectangle of the data space.
Thus they can be merged without violating the conditions of complete, disjoint
space partitioning.
Complete partitioning has the disadvantage that page regions are generally larger than necessary. Particularly in
high-dimensional data spaces often large
parts of the data space are not occupied
by data points at all. Real data often are
clustered or correlated. If the data distribution is cluster shaped, it is intuitively
clear that large parts of the space are
empty. But also the presence of correlations (i.e., one dimension is more or less dependent on the values of one or more other
dimension) leads to empty parts of the
data space, as depicted in Figure 21. Index
structures that do not use complete parti-

tioning are superior because larger page
regions yield a higher access probability.
Therefore, these pages are more often accessed during query processing than minimum bounding page regions. The second
problem is that kd-trees in principle are
unbalanced. Therefore, it is not directly
possible to pack contiguous subtrees into
directory pages. The kd-B-tree approaches
this problem by a concept involving
forced splits:
If some page has an over๏ฌ‚ow condition, it is split by an appropriately chosen
hyperplane. The entries are distributed
among the two pages and the split is propagated up the tree. Unfortunately, regions
on lower levels of the tree may also be intersected by the split plane, which must be
split (forced split). As every region on the
subtree can be affected, the time complexity of the insert operation is O(n) in the
worst case. A minimum storage utilization
guarantee cannot be provided. Therefore,
theoretical considerations about the index
size are dif๏ฌcult.
The hB-tree (holey brick) [Lomet and
Salzberg 1989, 1990; Evangelidis 1994]
also uses a kd-tree directory to de๏ฌne the
page regions of the index. In this approach,
splitting of a node is based on multiple
attributes. This means that page regions
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
354

C. Bยจ hm et al.
o

Fig. 22. The kd-B-tree.

Fig. 23. The Minkowski sum of a holey
brick.

do not correspond to solid rectangles but
to rectangles from which other rectangles
have been removed (holey bricks). With
this technique, the forced split of the kdB-tree and the R+ -tree is avoided.
For similarity search in highdimensional spaces, we can state the
same bene๏ฌts and shortcomings of a
complete space decomposition as in the
kd-B-tree, depicted in Figure 22. In
addition, we can state that the cavities
of the page regions decrease the volume
of the page region, but hardly decrease
the Minkowski sum (and thus the access
probability of a page). This is illustrated
in Figure 23, where two large cavities are
removed from a rectangle, reducing its
volume by more than 30%. The Minkowski
sum, however, is not reduced in the left
cavity, because it is not as wide as the
perimeter of the query. In the second cavity, there is only a very small area where
the page region is not touched. Thus the
cavities reduce the access probability of
the page by less than 1%.
The directory of the LSDh -tree [Henrich
1998] is also an adaptive kd -tree [Bentley
1975, 1979] (see Figure 24). In contrast to
R-tree variants and kd-B-trees, the region
description is coded in a sophisticated way
leading to reduced space requirements for
the region description. A specialized paging strategy collects parts of the kd -tree
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

into directory pages. Some levels on the
top of the kd -tree are assumed to be ๏ฌxed
in main memory. They are called the internal directory in contrast to the external
directory which is subject to paging. In
each node, only the split axis (e.g., 8 bits
for up to 256-dimensional data spaces)
and the position, where the split-plane
intersects the split axis (e.g., 32 bits for
a ๏ฌ‚oat number), have to be stored. Two
pointers to child nodes require 32 bits
each. To describe k regions, (k โˆ’ 1) nodes
are required, leading to a total amount
of 104 ยท (k โˆ’ 1) bits for the complete directory. R-treelike structures require for
each region description two ๏ฌ‚oat values
for each dimension plus the child node
pointer. Therefore, only the lowest level
of the directory needs (32 + 64 ยท d ) ยท k
bits for the region description. While
the space requirement of the R-tree
directory grows linearly with increasing
dimension, it is constant (theoretically
logarithmic, for very large dimensionality) for the LSDh -tree. Note that this
argument also holds for the hBฯ€ -tree.
See Evangelidis et al. [1997] for a more
detailed discussion of the issue. For
16-dimensional data spaces, R-tree
directories are more than 10 times
larger than the corresponding LSDh -tree
directory.
The rectangle representing the region of
a data page can be determined from the
split planes in the directory. It is called
the potential data region and not explicitly
stored in the index.
One disadvantage of the kd-tree directory is that the data space is completely
covered with potential data regions. In
cases where major parts of the data space
are empty, this results in performance degeneration. To overcome this drawback, a
concept called coded actual data regions,
cadr is introduced. The cadr is a multidimensional interval conservatively approximating the MBR of the points stored in a
data page. To save space in the description of the cadr, the potential data region is quantized into a grid of 2zยทd cells.
Therefore, only 2 ยท z ยท d bits are additionally required for each cadr. The parameter z can be chosen by the user. Good
Searching in High-Dimensional Spaces

355

Fig. 24. The LSDh -tree.

Fig. 25. Region approximation using the LSDh tree.

results are achieved using a value z = 5.
See Figure 25.
The most important advantage of the
complete partitioning using potential data
regions is that they allow a maintenance
guaranteeing no overlap. It has been
pointed out in the discussion of the R-tree
variants and of the X -tree that overlap is
a particular problem in high-dimensional
data spaces. By the complete partitioning
of the kd -tree directory, tie situations that
lead to overlap do not arise. On the other
hand, the regions of the index pages are
not able to adapt equally well to changes
in the actual data distribution as can page
regions that are not forced into the kd -tree
directory. The description of the page regions in terms of splitting planes forces the
regions to be overlap-free, anyway. When
a point has to be inserted into an LSDh tree, there exists always a unique potential data region, in which the point has
to be inserted. In contrast, the MBR of
an R-tree may have to be enlarged for
an insert operation, which causes overlap between data pages in some cases. A
situation where no overlap-free enlargement is possible is depicted in Figure 26.
The coded actual data regions may have
to be enlarged during an insert operation. As they are completely contained in

Fig. 26. No overlapfree insert is possible.

a potential page region, overlap cannot
arise either.
The split strategy for LSDh -trees is
rather simple. The split dimension is
increased by one compared to the parent
node in the kd -tree directory. The only
exception from this rule is that a dimension having too few distinct values for
splitting is left out.
As reported in Henrich [1998], the
LSDh -tree shows a performance that is
very similar to that of the X -tree, except
that inserts are done much faster in an
LSDh -tree because no complex computation takes place. Using a bulk-loading
technique to construct the index, both index structures are equal in performance.
Also from an implementation point of view,
both structures are of similar complexity.
The LSDh -tree has a rather complex directory structure and simple algorithms,
whereas the X -tree has a rather straightforward directory and complex algorithms.
6.4. SS-Tree

In contrast to all previously introduced
index structures, the SS-tree [White and
Jain 1996] uses spheres as page regions.
For maintenance ef๏ฌciency, the spheres
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
356

C. Bยจ hm et al.
o

Fig. 27. No overlap-free split is possible.

are not minimum bounding spheres.
Rather, the centroid point (i.e., the average value in each dimension) is used as
the center for the sphere and the minimum
radius is chosen such that all objects are
included in the sphere. The region description comprises therefore the centroid point
and the radius. This allows an ef๏ฌcient determination of the MINDIST and of the
MAXDIST, but not of the MINMAXDIST.
The authors suggest using the RKV algorithm, but they do not provide any hints
on how to prune the branches of the index
ef๏ฌciently.
For insert processing, the tree is descended choosing the child node whose
centroid is closest to the point, regardless of volume or overlap enlargement.
Meanwhile, the new centroid point and
the new radius are determined. When an
over๏ฌ‚ow condition occurs, a forced reinsert operation is raised, as in the R -tree.
30% of the objects with highest distances
from the centroid are deleted from the
node, all region descriptions are updated,
and the objects are reinserted into the
index.
The split determination is merely based
on the criterion of variance. First, the split
axis is determined as the dimension yielding the highest variance. Then, the split
plane is determined by encountering all
possible split positions, which ful๏ฌll space
utilization guarantees. The sum of the
variances on each side of the split plane
is minimized.
It was pointed out already in Section 6.1
(cf. Figure 18 in particular) that spheres
are theoretically superior to volumeequivalent MBRs because the Minkowski
sum is smaller. The general problem of
spheres is that they are not amenable to
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

an easy overlap-free split, as depicted in
Figure 27. MBRs have in general a smaller
volume, and, therefore, the advantage in
the Minkowski sum is more than compensated. The SS-tree outperforms the R tree by a factor of two; however, it does
not reach the performance of the LSDh tree and the X -tree.
6.5. TV-Tree

The TV-tree [Lin et al. 1995] is designed
especially for real data that are subject
to the Karhunenโ€“Loeve Transform (also
known as principal component analysis),
a mapping that preserves distances and
eliminates linear correlations. Such data
yield a high variance and therefore, a
good selectivity in the ๏ฌrst few dimensions
whereas the last few dimensions are of minor importance for query processing. Indexes storing KL-transformed data tend
to have the following properties.
โ€”The last few attributes are never
used for cutting branches in query
processing. Therefore, it is not useful to
split the data space in the corresponding dimensions.
โ€”Branching according to the ๏ฌrst few attributes should be performed as early
as possible, that is, in the topmost levels of the index. Then, the extension
of the regions of lower levels (especially of data pages) is often zero in
these dimensions.
Regions of the TV-tree are described using so-called telescope vectors (TV), that
is, vectors that may be dynamically shortened. A region has k inactive dimensions and ฮฑ active dimensions. The inactive dimensions form the greatest common
Searching in High-Dimensional Spaces
pre๏ฌx of the vectors stored in the subtree. Therefore, the extension of the region
is zero in these dimensions. In the ฮฑ active dimensions, the region has the form
of an L p -sphere, where p may be 1, 2, or
โˆž. The region has an in๏ฌnite extension
in the remaining dimensions, which are
supposed either to be active in the lower
levels of the index or to be of minor importance for query processing. Figure 28
depicts the extension of a telescope vector
in space.
The region description comprises ฮฑ
๏ฌ‚oating point values for the coordinates
of the centerpoint in the active dimensions and one ๏ฌ‚oat value for the radius.
The coordinates of the inactive dimensions are stored in higher levels of the
index (exactly in the level, where a dimension turns from active into inactive).
To achieve a uniform capacity of directory nodes, the number ฮฑ of active dimensions is constant in all pages. The concept
of telescope vectors increases the capacity of the directory pages. It was experimentally determined that a low number
of active dimensions (ฮฑ = 2) yields the best
search performance.
The insert algorithm of the TV-tree
chooses the branch to insert a point according to these criteria (with decreasing
priority):
โ€” minimum increase of the number of
overlapping regions,
โ€” minimum decrease of the number of inactive dimensions,
โ€” minimum increase of the radius, and
โ€” minimum distance to the center.
To cope with page over๏ฌ‚ows, the authors propose performing a reinsert operation, as in the R -tree. The split algorithm determines the two seed-points
(seed-regions in the case of a directory
page) having the least common pre๏ฌx or
(in case of doubt) having maximum distance. The objects are then inserted into
one of the new subtrees using the above
criteria for the subtree choice in insert processing, while storage utilization guarantees are considered.

357
The authors report a good speedup in
comparison to the R -tree when applying
the TV-tree to data that ful๏ฌll the precondition stated in the beginning of this section. Other experiments [Berchtold et al.
1996], however, show that the X -tree and
the LSDh -tree outperform the TV-tree on
uniform or other real data (not amenable
to the KL transformation).
6.6. SR-Tree

The SR-tree [Katayama and Satoh 1997]
can be regarded as the combination of the
R -tree and the SS-tree. It uses the intersection solid between a rectangle and
a sphere as the page region. The rectangular part is, as in R-tree variants,
the minimum bounding rectangle of all
points stored in the corresponding subtree. The spherical part is, as in the SStree, the minimum sphere around the centroid point of the stored objects. Figure 29
depicts the resulting geometric object. Regions of SR-trees have the most complex description among all index structures presented in this section: they comprise 2d ๏ฌ‚oating point values for the MBR
and d + 1 ๏ฌ‚oating point values for the
sphere.
The motivation for using a combination of sphere and rectangle, presented
by the authors, is that according to an
analysis presented in White and Jain
[1996], spheres are basically better suited
for processing nearest-neighbor and range
queries using the L2 metric. On the other
hand, spheres are dif๏ฌcult to maintain and
tend to produce much overlap in splitting,
as depicted previously in Figure 27. The
authors believe therefore that a combination of R-tree and SS-tree will overcome
both disadvantages.
The authors de๏ฌne the following function as the distance between a query
point q and a region R.
MINDIST(q, R) = max(MINDIST(q,
R.MBR, MINDIST(q, R.Sphere)).
This is not the correct minimum distance to the intersection solid, as depicted
in Figure 30. Both distances to MBR and
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
358

C. Bยจ hm et al.
o

Fig. 28. Telescope vectors.

sphere (meeting the corresponding solids
at the points M MBR and M Sphere , resp.)
are smaller than the distance to the intersection solid, which is met in point M R
where the sphere intersects the rectangle.
However, it can be shown that the above
function MINDIST(q, R) is a lower bound
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

of the correct distance function. Therefore, it is guaranteed that processing of
range and nearest-neighbor queries produces no false dismissals. But still, the ef๏ฌciency can be worsened by the incorrect
distance function. The MAXDIST function
can be de๏ฌned to be the minimum among
Searching in High-Dimensional Spaces

Fig. 29. Page regions of an SR-tree.

359
sidered in the choice of branches nor in
the determination of the split.
The reported performance results, compared to the SS-tree and the R -tree, suggest that the SR-tree outperforms both
index structures. It is, however, open if
the SR-tree outperforms the X-tree or the
LSDh -tree. No experimental comparison
has been done yet to the best of the authorsโ€™ knowledge. Comparing the index
structures indirectly by comparing both to
the performance of the R -tree, we could
draw the conclusion that the SR-tree does
not reach the performance of the LSDh tree or the X -tree.
6.7. Space Filling Curves

Fig. 30. Incorrect MINDIST
in the SR-tree.

the MAXDIST functions, applied to MBR
and sphere, although a similar error is
made as in the de๏ฌnition of MINDIST.
Since no MAXMINDIST de๏ฌnition exists
for spheres, the MAXMINDIST function
for the MBR must be applied. This is
also correct in the sense that no false dismissals are guaranteed but in this case no
knowledge about the sphere is exploited
at all. Some potential for performance increase is wasted.
Using the de๏ฌnitions above, range and
nearest-neighbor query processing using
both RKV algorithm and HS algorithm are
possible.
Insert processing and the split algorithm are taken from the SS-tree and only
modi๏ฌed in a few details of minor importance. In addition to the algorithms for the
SS-tree, the MBRs have to be updated and
determined after inserts and node splits.
Information of the MBRs is neither con-

Space ๏ฌlling curves (for an overview see
Sagan [1994]) like Z-ordering [Morton
1966; Finkel and Bentley 1974; Abel and
Smith 1983; Orenstein and Merret 1984;
Orenstein 1990], Gray Codes [Faloutsos
1985, 1988], or the Hilbert curve [Faloutsos and Roseman 1989; Jagadish 1990;
Kamel and Faloutsos 1993] are mappings from a d -dimensional data space
(original space) into a one-dimensional
data space (embedded space). Using
space ๏ฌlling curves, distances are not
exactly preserved but points that are
close to each other in the original space
are likely to be close to each other in
the embedded space. Therefore, these
mappings are called distance-preserving
mappings.
Z-ordering is de๏ฌned as follows. The
data space is ๏ฌrst partitioned into two
halves of identical volume, perpendicular to the d 0 -axis.The volume on the side
of the lower d 0 -values gets the name 0
(as a bit string); the other volume gets
the name 1 . Then each of the volumes
is partitioned perpendicular to the d 1 axis, and the resulting subpartitions of
0 get the names 00 and 01 , and the
subpartitions of 1 get the names 10
and 11 , respectively. When all axes are
used for splitting, d 0 is used for a second
split, and so on. The process stops when a
user-de๏ฌned basic resolution br is reached.
Then, we have a total number of 2br grid
cells, each with an individual numbered
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
360

C. Bยจ hm et al.
o

Fig. 31. Examples of space ๏ฌlling curves.

bit string. If only grid cells with the basic resolution br are considered, all bit
strings have the same lengths, and can
therefore be interpreted as binary representations of integer numbers. The other
space ๏ฌlling curves are de๏ฌned similarly
but the numbering scheme is slightly more
sophisticated. This has been done in order to achieve more neighboring cells getting subsequent integer numbers. Some
two-dimensional examples of space ๏ฌlling
curves are depicted in Figure 31.
Datapoints are transformed by assigning the number of the grid cell in which
they are located. Without presenting the
details, we let SFC( p) be the function that
assigns p to the corresponding grid cell
number. Vice versa, SFCโˆ’1 (c) returns the
corresponding grid cell as a hyperrectangle. Then any one-dimensional indexing structure capable of processing range
queries can be applied for storing SFC( p)
for every point p in the database. We
assume in the following that a B+ -tree
[Comer 1979] is used.
Processing of insert and delete operations and exact match queries is very simple because the points inserted or sought
have merely to be transformed using the
SFC function.
In contrast, range and nearest-neighbor
queries are based on distance calculations of page regions, which have to be
determined accordingly. In B-trees, before a page is accessed, only the interval I = [l b . . ub] of values in this page is
known. Therefore, the page region is the
union of all grid cells having a cell number
between l b and ub. The region of an index
based on a space ๏ฌlling curve is a combination of rectangles. Based on this observation, we can de๏ฌne a corresponding
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

MINDIST and analogously a MAXDIST
function:
MINDIST(q, I )
= min {MINDIST(q, SFCโˆ’1 (c))}
l bโ‰คcโ‰คub

MAXDIST(q, I ) =
= max {MAXDIST(q, SFCโˆ’1 (c))}.
l bโ‰คcโ‰คub

Again, no MINMAXDIST function can
be provided because there is no minimum
bounding property to exploit. The question
is how these functions can be evaluated
ef๏ฌciently, without enumerating all grid
cells in the interval [l b . . ub]. This is possible by splitting the interval recursively
into two parts [l b..s[ and [s . . ub], where s
has the form p100 . . . 00 . Here, p stands
for the longest common pre๏ฌx of l b and
ub. Then we determine the MINDIST and
the MAXDIST to the rectangular blocks
numbered with the bit strings p0 and
p1 . Any interval having a MINDIST
greater than the MAXDIST of any other
interval or greater than the MINDIST of
any terminating interval (see later) can
be excluded from further consideration.
The decomposition of an interval stops
when the interval covers exactly one rectangle. Such an interval is called a terminal
interval. MINDIST(q, I ) is then the minimum among the MINDISTs of all terminal intervals. An example is presented in
Figure 32. The shaded area is the page
region, a set of contiguous grid cell values I . In the ๏ฌrst step, the interval is
split into two parts I1 and I2 , determining the MINDIST and MAXDIST (not depicted) of the surrounding rectangles. I1
is terminal, as it comprises a rectangle.
In the second step, I2 is split into I21
Searching in High-Dimensional Spaces

361

Fig. 32. MINDIST determination using space ๏ฌlling curves.

and I22 , where I21 is terminal. Since
the MINDIST to I21 is smaller than the
other two MINDIST values, I1 and I22 are
discarded. Therefore MINDIST(q, I21 ) is
equal to MINDIST(q, I ).
A similar algorithm to determine
MAXDIST(q, I ) would exchange the roles
of MINDIST and MAXDIST.
Fig. 33. Partitioning the data space into pyramids.

6.8. Pyramid-Tree

The Pyramid-tree [Berchtold et al. 1998b]
is an index structure that, similar
to the Hilbert technique, maps a d dimensional point into a one-dimensional
space and uses a B+ -tree to index
the one-dimensional space. Obviously,
queries have to be translated in the
same way. In the data pages of the
B+ -tree, the Pyramid-tree stores both
the d -dimensional points and the onedimensional key. Thus, no inverse transformation is required and the re๏ฌnement
step can be done without lookups to
another ๏ฌle. The speci๏ฌc mapping used
by the Pyramid-tree is called Pyramidmapping. It is based on a special partitioning strategy that is optimized for range
queries on high-dimensional data. The basic idea is to divide the data space such
that the resulting partitions are shaped
like peels of an onion. Such partitions cannot be ef๏ฌciently stored by R-treelike or
kd-treelike index structures. However, the
Pyramid-tree achieves the partitioning by
๏ฌrst dividing the d -dimensional space into
2d pyramids having the centerpoint of the
space as their top. In a second step, the
single pyramids are cut into slices parallel to the basis of the pyramid forming the data pages. Figure 33 depicts this
partitioning technique.

Fig. 34. Properties of pyramids: (a) numbering of
pyramids; (b) point in pyramid.

This technique can be used to compute
a mapping as follows. In the ๏ฌrst step,
we number the pyramids as shown in
Figure 34(a). Given a point, it is easy to
determine in which pyramid it is located.
Then we determine the so-called height
of the point within its pyramid that is
the orthogonal distance of the point to the
centerpoint of the data space as shown
in Figure 34(b). In order to map a d dimensional point into a one-dimensional
value, we simply add the two numbers,
the number of the pyramid in which the
point is located, and the height of the point
within this pyramid.
Query processing is a nontrivial task
on a Pyramid-tree because for a given
query range, we have to determine the
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
362

C. Bยจ hm et al.
o
Table I. High-Dimensional Index Structures and Their Properties

Name

Region

Disjoint

Complete

Criteria for Insert

Criteria for Split

Reinsert

(Various algorithms)

No

R-tree

MBR

No

No

Volume enlargement
volume

R -tree

MBR

No

No

Overlap enlargement
Volume enlargement
volume

Surface area
Overlap
Dead space coverage

Yes

X -tree

MBR

No

No

Overlap enlargement
Volume enlargement
volume

Split history
Surface/overlap
Dead space coverage

No

LSDh tree

kd-treeregion

Yes

No/Yes

(Unique due to complete, disjoint part.)

Cyclic change of dim.
Distinct values

No

SS-tree

Sphere

No

No

Proximity to centroid

Variance

Yes

No

No

Overlap regions
Inactive dim.
Radius of region
Distance to center

Seeds with least
Common pre๏ฌx
Maximum distance

Yes

No

No

Proximity to centroid

Variance

Yes

TV-tree

SR-tree

Sphere
with
reduced
dim.
Intersect.
sphere/
MBR

Space
๏ฌlling
curves

Union of
rectangles

Yes

Yes

(Unique due to complete, disjoint part.)

According to space
๏ฌlling curve

No

Pyramid
tree

Trunks
of
pyramids

Yes

Yes

(Unique due to complete, disjoint part.)

According to
pyramid-mapping

No

affected pyramids and the affected heights
within these pyramids. The details of
how this can be done are explained in
Berchtold et al. [1998b]. Although the details of the algorithm are hard to understand, it is not computationally hard;
rather it consists of a variety of cases
that have to be distinguished and simple
computations.
The Pyramid-tree is the only index
structure known thus far that is not
affected by the so-called curse of dimensionality. This means that for uniform
data and range queries, the performance
of the Pyramid-tree gets even better if one
increases the dimensionality of the data
space. An analytical explanation of this
phenomenon is given in Berchtold et al.
[1998b].
6.9. Summary

Table I shows the index structures described above and their most important
properties. The ๏ฌrst column contains the
name of the index structure, the second
shows which geometrical region is represented by a page, and the third and
fourth columns show whether the index
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

structure provides a disjoint and complete partitioning of the data space. The
last three columns describe the used algorithms: what strategy is used to insert new data items (column 5), what
criteria are used to determine the division of objects into subpartitions in case
of an over๏ฌ‚ow (column 6), and if the insert algorithm uses the concept of forced
reinserts (column 7).
Since, so far, no extensive and objective
comparison between the different index
structures has been published, only structural arguments may be used in order
to compare the different approaches. The
experimental comparisons tend to highly
depend on the data that have been used
in the experiments. Even higher is the
in๏ฌ‚uence of seemingly minor parameters
such as the size and location of queries or
the statistical distribution of these. The
higher the dimensionality of the data,
the more these in๏ฌ‚uences lead to different results. Thus we provide a comparison among the indexes listing only properties not trying to say anything about
the โ€œoverallโ€ performance of a single index. In fact, most probably, there is no
overall performance; rather, one index will
Searching in High-Dimensional Spaces

363

Table II. Qualitative Comparison of High-Dimensional Index Structures
Name

Problems in
High-D

Supported
Query
Types

Locality
of
Node
Splits

R-tree

Poor split algorithm
leads to
deteriorated
directories

NN,
Region,
range

Yes

Poor

Poor, linearly
dimension dependent

R -tree

Dto.

NN,
Region,
range

Yes

Medium

Poor, linearly
dimension dependent

X -tree

High probability
of queries overlapping MRSโ€™s
leads to
poor performance

NN,
Region,
range

Yes

Medium

Poor, linearly
dimension dependent

LSDh tree

Changing data
distribution deteriorates directory

NN,
Region,
range

No

Medium

Very good, dimension
independent

SS-tree

High overlap
in directory

NN

Yes

Medium

Very good, dimension
independent

TV-tree

Only useful for
speci๏ฌc data

NN

Yes

Medium

Poor, somehow
dimension dependent

SR-tree

Very large
directory sizes

NN

Yes

Medium

Very poor, linearly
dimension dependent

Space
๏ฌlling
curves

Poor space
partitioning

NN,
Region,
range

Yes

Medium

As good as B-tree,
dimension
independent

Pyramid
tree

Problems with
asymmetric queries

Region,
range

Yes

Medium

As good as B-tree,
dimension
independent

outperform other indexes in a special situation whereas this index is quite useless
for other con๏ฌgurations of the database.
Table II shows such a comparison. The
๏ฌrst column lists the name of the index,
the second column explains the biggest
problem of this index when the dimension
increases. The third column lists the supported types of queries. In the fourth column, we show if a split in the directory
causes โ€œforced splitsโ€ on lower levels of the
directory. The ๏ฌfth column shows the storage utilization of the index, which is only
a statistical value depending on the type
of data and, sometimes, even on the order
of insertion. The last column is about the
fanout in the directory which in turn depends on the size of a single entry in a directory node.
7. IMPROVEMENTS, OPTIMIZATIONS, AND
FUTURE RESEARCH ISSUES

During the past years, a signi๏ฌcant
amount of work has been invested not to

Storage Utilization

Fanout /
Size of Index Entries

develop new index structures but to improve the performance of existing index
structures. As a result, a variety of techniques has been proposed using or tuning index structures. In this section, we
present a selection of those techniques.
Furthermore, we point out a selection of
problems that have not yet been addressed
in the context of high-dimensional indexing, or the solution of which cannot be considered suf๏ฌcient.
Tree-Striping

From a variety of cost models that have
been developed one might conclude that if
the data space has a suf๏ฌciently high dimensionality, no index structure can succeed. This has been contradicted by the
development of index structures that are
not severely affected by the dimensionality of the data space. On the other hand,
one has to be very careful to judge the implications of a speci๏ฌc cost model. A lesson
all researchers in the area of highdimensional index structures learned was
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
364
that things are very sensitive to the
change of parameters. A model of nearestneighbor queries can not directly be used
to make any claims about the behavior
in the case of range queries. Still, the research community agreed that in the case
of nearest-neighbor queries, there exists a
dimension above which a sequential scan
will be faster than any indexing technique
for most relevant data distributions.
Tree-striping is a technique that tries
to tackle the problem from a different
perspective. If it is hard to solve the d dimensional problem of query processing,
why not try to solve k l -dimensional problems, where k ยท l = d . The speci๏ฌc work
presented in Berchtold et al. [2000c] focuses on the processing of range queries in
a high-dimensional space. It generalizes
the well-known inverted lists and multidimensional indexing approaches. A theoretical analysis of the generalized technique shows that both, inverted lists and
multidimensional indexing approaches,
are far from being optimal. A consequence of the analysis is that the use
of a set of multidimensional indexes provides considerable improvements over one
d -dimensional index (multidimensional
indexing) or d one-dimensional indexes
(inverted lists). The basic idea of treestriping is to use the optimal number k
of lower-dimensional indexes determined
by a theoretical analysis for ef๏ฌcient query
processing. A given query is also split
into k lower-dimensional queries and processed independently. In a ๏ฌrst step, the
single results are merged. As the merging
step also involves I/O costs and these costs
increase with a decreasing dimensionality
of a single index, there exists an optimal
dimensionality for the single indexes that
can be determined analytically. Note that
tree-striping has serious limitations especially for nearest-neighbor queries and
skewed data where, in many cases, the d dimensional index performs better than
any lower-dimensional index.
Voronoi Approximations

In another approach [Berchtold et al.
1998c, 2000d] to overcome the curse of diACM Computing Surveys, Vol. 33, No. 3, September 2001.

C. Bยจ hm et al.
o
mensionality for nearest-neighbor search,
the results of any nearest-neighbor search
are precomputed. This corresponds to a
computation of the Voronoi cell of each datapoint. The Voronoi cell of a point p contains all points that have p as a nearestneighbor. In high-dimensional spaces, the
exact computation of a Voronoi cell is computationally very hard. Thus rather than
computing exact Voronoi cells, the algorithm stores conservative approximations
of the Voronoi cells in an index structure that is ef๏ฌcient for high-dimensional
data spaces. As a result, nearest-neighbor
search corresponds to a simple point query
on the index structure. Although the technique is based on a precomputation of the
solution space, it is dynamic; that is, it supports insertions of new datapoints. Furthermore, an extension of the technique
to a k-nearest-neighbor search is given in
Berchtold et al. [2000d].
Parallel Nearest-Neighbor Search

Most similarity search techniques map the
data objects into some high-dimensional
feature space. The similarity search then
corresponds to a nearest-neighbor search
in the feature space that is computationally very intensive. In Berchtold et al.
[1997a], the authors present a new parallel method for fast nearest-neighbor
search in high-dimensional feature
spaces. The core problem of designing a
parallel nearest neighbor algorithm is to
๏ฌnd an adequate distribution of the data
onto the disks. Unfortunately, the known
declustering methods do not perform well
for high-dimensional nearest-neighbor
search. In contrast, the proposed method
has been optimized based on the special
properties of high-dimensional spaces and
therefore provides a near-optimal distribution of the data items among the disks.
The basic idea of this data declustering
technique is to assign the buckets corresponding to different quadrants of the
data space to different disks. The authors
show that their techniqueโ€”in contrast to
other declustering methodsโ€”guarantees
that all buckets corresponding to neighboring quadrants are assigned to different
Searching in High-Dimensional Spaces

365

disks. The speci๏ฌc mapping of points to
disks is done by the following formula.
๏ฃซ

d โˆ’1

col(c) = ๏ฃญ XOR
i=0

i+1
0

๏ฃถ
if ci = 1
๏ฃธ .
otherwise
10

The input is a bit string de๏ฌning the
quadrant in which the point to be declustered is located. But not any number of
disks may be used for this declustering
technique. In fact, the number required is
linear in the number of the dimensions.
Therefore, the authors present an extension of their technique adapted to an arbitrary number of disks. A further extension
is a recursive declustering technique that
allows an improved adaptation to skewed
and correlated data distributions.
An approach for similarity query processing using disk arrays is presented in
Papadopoulos and Monolopoulos [1998].
The authors propose two new algorithms
for the nearest-neighbor search on single
processors and multiple disks. Their
solution relies on a well-known page distribution technique for low-dimensional
data spaces [Kamel and Faloutsos 1992]
called a proximity index. Upon a split,
the MBR of a newly created node is compared with the MBRs stored in its father
node (i.e., its siblings). The new node is
assigned to the disk that stores the โ€œleast
proximalโ€ pages with respect to the new
page region. Thus the selected disk must
contain sibling nodes that are far from the
new node. The ๏ฌrst algorithm, called full
parallel similarity search (FPSS), determines the threshold sphere (cf. Figure 35),
an upper bound of the nearest neighbor
distance according to the maximum
distance between the query point and the
nearest page region. Then, all pages that
are not pruned by the threshold sphere
are called in by a parallel request to all
disks. The second algorithm, candidate reduction similarity search (CRSS), applies
a heuristic that leads to an intermediate
form between depth-๏ฌrst and breadth๏ฌrst search of the index. Pages that are
completely contained in the threshold
sphere are processed with a higher prior-

Fig. 35. The threshold sphere
for FPSS and CRSS.

ity than pages that are merely intersected
by it. The authors compare FPSS and
CRSS with a (not existing) optimal parallel algorithm that knows the distance
of the nearest neighbor in advance and
report up to 100% more page accesses of
CRSS compared to the optimal algorithm.
The same authors also propose a solution
for shared-nothing parallel architectures
[Papadopoulos and Manolopoulos 1997a].
Their architecture distributes the data
pages of the index over the secondary
servers while the complete directory is
held in the primary server. Their static
page distribution strategy is based on
a fractal curve (sorting according to
the Hilbert value of the centroid of the
MBR). The k-nn algorithm ๏ฌrst performs
a depth-๏ฌrst search in the directory.
When the bottom level of the directory is
reached, a sphere around the query point
is determined that encloses as many data
pages as required to guarantee that k
points are stored in them (i.e., assuming
that the page capacity is โ‰ฅ k, the sphere
is chosen such that one page is completely
contained). A parallel range query is
performed, ๏ฌrst accessing a smaller
number of data pages obtained by a
cost model.
Compression Techniques

Recently, the VA-๏ฌle [Weber et al. 1998]
was developed, an index structure that is
actually not an index structure. Based on
the cost model proposed in Berchtold et al.
[1997b] the authors prove that under certain assumptions, above a certain dimensionality no index structure can process a
nearest-neighbor query ef๏ฌciently. Therefore, they suggest accelerating the sequential scan by the use of data compression.
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
366
The basic idea of the VA-๏ฌle is to keep
two ๏ฌles: a bit-compressed, quantized version of the points and their exact representation. Both ๏ฌles are unsorted; however, the positions of the points in the two
๏ฌles agree.
The quantization of the points is determined by an irregular grid laid over the
data space. The resolution of the grid in
each dimension corresponds to 2b, where
b is the number of bits per dimension that
are used to approximate the coordinates.
The grid lines correspond to the quantiles of the projections of the points to
the corresponding axes. These quantiles
are assumed to change rarely. Changing
the quantiles requires a reconstruction
of the compressed ๏ฌle. The k-nearestneighbor queries are processed by the multistep paradigm: The quantized points are
loaded into main memory by a sequential
scan (๏ฌlter step). Candidates that cannot
be pruned are re๏ฌned; that is, their exact
coordinates are called in from the second
๏ฌle. Several access strategies for timing
๏ฌlter and re๏ฌnement steps have been proposed. Basically, the speedup of the VA๏ฌle compared to a simple sequential scan
corresponds to the compression rate, because reading large ๏ฌles sequentially from
disk yields a linear time complexity with
respect to the ๏ฌle length. The computational effort of determining distances between the query point and the quantized
datapoints is also improved compared to
the sequential scan by precomputation of
the squared distances between the query
point and the grid lines. CPU speedups,
however, do not yield large factors and are
independent of the compression rate. The
most important overhead in query processing is the re๏ฌnements which require
an expensive random disk access each.
With decreasing resolution, the number of
points to be re๏ฌned increases, thus limiting the compression ratio. The authors report a number of ๏ฌve to six bits per dimension to be optimal.
There are some major drawbacks of
the VA-๏ฌle. First, the deterioration of
index structures is much more prevalent
in arti๏ฌcial data than in data sets from
real-world applications. For such data,
ACM Computing Surveys, Vol. 33, No. 3, September 2001.

C. Bยจ hm et al.
o

Fig. 36. Structure of the IQ-tree.

index structures are ef๏ฌciently applicable
for much higher dimensions. The second
drawback is the number of bits per dimension which is a system parameter.
Unfortunately, the authors do not provide
any model or guideline for the selection
of a suitable bit rate. To overcome these
drawbacks, the IQ-tree has recently been
proposed by Berchtold et al. [2000a],
which is a three-level tree index structure
exploiting quantization (cf. Figure 36).
The ๏ฌrst level is a ๏ฌ‚at directory consisting
of MBRs and the corresponding pointers
to pages on the second level. The pages on
the second level contain the quantized version of the datapoints. In contrast to the
VA-๏ฌle, the quantization is not based on
quantiles but is a regular decomposition of
the page regions. The authors claim that
regular quantization based on the page
regions adapts equally well to skewed and
correlated data distributions as quantiles
do. The suitable compression rate is determined for each page independently according to a cost model proposed in Berchtold
et al. [2000b]. Finally, the bottom level of
the IQ-tree contains the exact representation of the datapoints. For processing
of nearest-neighbor queries, the authors
propose a fast index scan which essentially subsumes the advantages of indexes
and scan-based methods. The algorithm
collects accesses to neighboring pages and
performs chained I/O requests. The length
of such chains is determined according
to a cost model. In situations where a
sequential scan is clearly indicated, the
algorithm degenerates automatically to
Searching in High-Dimensional Spaces
the sequential scan. In other situations,
where the search can be directly designated, the algorithm performs the priority
search of the HS algorithm. In intermediate situations, the algorithm accesses
chains of intermediate length thus clearly
outperforming both the sequential scan
as well as the HS algorithm. The bottom
level of the IQ-tree is accessed according
to the usual multistep strategy.
Bottom-Up Construction

Usually, the performance of dynamically
inserting a new datapoint into a multidimensional index structure is poor. The reason for this is that most structures have to
consider multiple paths in the tree where
the point could be inserted. Furthermore,
split algorithms are complex and computationally intensive. For example, a single
split in an X -tree might take up to the order of a second to be performed. Therefore,
a number of bulk-load algorithms for multidimensional index structures have been
proposed. Bulk-loading an index means
building an index on an entire database
in a single process. This can be done much
more ef๏ฌciently than inserting the points
one at a time. Most of the bulk-load algorithms such as the one proposed in van
den Bercken et al. [1997] are not especially
adapted for the case of high-dimensional
data spaces. In Berchtold et al. [1998a],
however, the authors proposed a new bulkloading technique for high-dimensional indexes that exploits a priori knowledge of
the complete data set to improve both
construction time and query performance.
The algorithm operates in a manner similar to the Quicksort algorithm and has an
average run-time complexity of O(n log n).
In contrast to other bulk-loading techniques, the query performance is additionally improved by optimizing the shape of
the bounding boxes, by completely avoiding overlap, and by clustering the pages
on disk. A sophisticated unbalanced split
strategy is used leading to a better space
partitioning.
Another important issue would be to apply the knowledge that people aggregated
to other areas such as data reduction, data

367
mining (e.g., clustering), or visualization,
where people have to deal with tens to
hundreds of attributes and therefore face
a high-dimensional data space. Most of
the lessons learned also apply to these areas. Examples for successful approaches to
make use of these side-effects are Agrawal
et al. [1998] and Berchtold et al. [1998d].
Future Research Issues

Although signi๏ฌcant progress has been
made to understand the nature of highdimensional spaces and to develop techniques that can operate in these spaces,
there still are many open questions.
A ๏ฌrst problem is that most of the understanding that the research community developed during the last years is restricted
to the case of uniform and independent
data. Not only are all proposed indexing
techniques optimized for this case, also almost all theoretical considerations such as
cost models are restricted to this simple
case. The interesting observation is that
index structures do not suffer from โ€œrealโ€
data. Rather, they nicely take advantage
from having nonuniform distributions. In
fact, a uniform distribution seems to be
the worst thing that can happen to an index structure. One reason for this effect
is that often the data are located only on
a subspace of the data space and if the
index adapts to this situation, it actually
behaves as if the data would be lowerdimensional. A promising approach to understanding and explaining this effect theoretically has been followed in Faloutsos
and Kamel [1994] and Bยจ hm [1998] where
o
the concept of the fractal dimension is applied. However, even this approach cannot
cover โ€œrealโ€ effects such as local skewness.
A second interesting research issue is
the partitioning strategies that perform
well in high-dimensional spaces. As previous research (e.g., the Pyramid-tree) has
shown, the partitioning does not have to be
balanced to be optimal for certain queries.
The open question is what an optimal
partitioning schema for nearest-neighbor
queries would be. Does it need to be balanced or better unbalanced? Is it based
upon bounding-boxes or on pyramids?
ACM Computing Surveys, Vol. 33, No. 3, September 2001.
368
How does the optimum change when the
data set grows in size or dimensionality?
There are many open questions that need
to be answered.
A third open research issue is an approximate processing of nearest-neighbor
queries. The ๏ฌrst question is what a
useful de๏ฌnition for approximate nearestneighbor search in high-dimensional
spaces is, and how that fussiness introduced by the de๏ฌnition may be exploited
for an ef๏ฌcient query processing. A ๏ฌrst
approach for an approximate nearestneighbor search has been proposed in
Gionis et al. [1999].
Other interesting research issues include the parallel processing of nearestneighbor queries in high-dimensional
space and the data mining and visualization of high-dimensional spaces. The
parallel processing aims at ๏ฌnding appropriate declustering and query processing strategies to overcome the dif๏ฌculties in high-dimensional spaces. A ๏ฌrst
approach in this direction has already
been presented in Berchtold et al. [1997a].
The efforts in the area of data mining
and visualization of high-dimensional feature spaces (for an example see Hinneburg and Keim [1998]) try to understand
and explore the high-dimensional feature
spaces. Also, the application of compression techniques to improve the query performance is an interesting and promising
research area. A ๏ฌrst approach, the VA๏ฌle, has recently been proposed in Weber
et al. [1998].
8. CONCLUSIONS

Research in high-dimensional index structures has been very active and productive
over the past few years, resulting in a multitude of interesting new approaches for
indexing high-dimensional data. Since it
is very dif๏ฌcult to follow up on this discussion, in this survey we tried to provide
insight into the effects occurring in indexing high-dimensional spaces and to provide an overview of the principal ideas of
the index structures that have been proposed to overcome these problems. There
are still a number of interesting open reACM Computing Surveys, Vol. 33, No. 3, September 2001.

C. Bยจ hm et al.
o
search problems and we expect the ๏ฌeld
to remain a fruitful research area over the
next years. Due to the increasing importance of multimedia databases in various
application areas and due to the remarkable results of the research, we also expect
the research on high-dimensional indexing to have a major impact on many practical applications and commercial multimedia database systems.
APPENDIX
A. LEMMA 1

The RKV algorithm has a worst case space
complexity of O(log n).
PROOF. The only source of dynamic
memory assignment in the RKV algorithm is the recursive calls of the function RKV algorithm. The recursion depth
is at most equal to the height of the indexing structure. The height of all highdimensional index structures presented in
this section is of the complexity O(log n).
Since a constant amount of memory (one
data or directory page) is allocated in each
call, Lemma 1 follows.

B. LEMMA 2

The HS algorithm has a space complexity
of O(n) in the worst case.
PROOF. The following scenario describes the worst case. Query processing
starts with the root in APL. The root is
replaced by its child nodes, which are on
the level h โˆ’ 1 if h is the height of the
index. All nodes on level hโˆ’ 1 are replaced
by their child nodes, and so on, until all
data nodes are in the APL. At this state it
is possible that no data page is excluded
from the APL because no datapoint was
encountered yet. The situation described
above occurs, for example, if all data
objects are located on a sphere around the
query point. Thus, all data pages are in
the APL and the APL is maximal because
the APL grows only by replacing a page
by its descendants. If all data pages are
in the APL, it has a length of O(n).
Searching in High-Dimensional Spaces
C. LEMMA 3

Let nndist be the distance between the
query point and its nearest neighbor.
All pages that intersect a sphere around
the query point having a radius equal
to nndist (the so-called nearest neighbor
sphere) must be accessed for query processing. This condition is necessary and
suf๏ฌcient.
PROOF.
1. Suf๏ฌciency: If all data pages intersecting the nn-sphere are accessed, then all
points in the database with a distance
less than or equal to nndist are known
to the query processor. No closer point
than the nearest known point can exist
in the database.
2. Necessity: If a page region intersects
with the nearest-neighbor sphere but
is not accessed during query processing, the corresponding subtree could include a point that is closer to the query
point than the nearest neighbor candidate. Therefore, accessing all intersecting pages is necessary.
D. LEMMA 4

The HS algorithm accesses pages in the
order of increasing distance to the query
point.
PROOF. Due to the lower bounding property of page regions, the distance between
the query point and a page region is always greater than or equal to the distance
of the query point and the region of the
parent of the page. Therefore, the minimum distance between the query point
and any page in the APL can only be increased or remain unchanged, never decreased by the processing step of loading a
page and replacing the corresponding APL
entry. Since the active page with minimum
distance is always accessed, the pages are
accessed in the order of increasing distances to the query point.
E. LEMMA 5

The HS algorithm is optimal in terms of
the number of page accesses.

369
PROOF. According to Lemma 4, the HS
algorithm accesses pages in the order of
increasing distance to the query point q.
Let m be the lowest MINDIST in the APL.
Processing stops if the distance of q to the
cpc is less than m. Due to the lower bounding property, processing of any page in the
APL cannot encounter any points with a
distance to q less than m. The distance between the cpc and q cannot fall below m
during processing. Therefore, exactly the
pages with a MINDIST less than or equal
to the nearest neighbor distance are processed by the HS algorithm. According to
Lemma 3, these pages must be loaded by
any correct nearest neighbor algorithm.
Thus, the HS algorithm yields an optimal
number of page accesses.

REFERENCES
ABEL, D. AND SMITH, J. 1983. A data structure and
algorithm based on a linear key for a rectangle
retrieval problem. Comput. Vis. 24, 1โ€“13.
AGRAWAL, R., FALOUTSOS, C., AND SWAMI, A. 1993. Ef๏ฌcient similarity search in sequence databases.
In Proc. 4th Int. Conf. on Foundations of Data Organization and Algorithms, LNCS 730, 69โ€“84.
AGRAWAL, R., GEHRKE, J., GUNOPULOS, D., AND RAGHAVAN,
P. 1998. Automatic subspace clustering of
high-dimensional data for data mining applications. In Proc. ACM SIGMOD Int. Conf. on Management of Data (Seattle), 94โ€“105.
AGRAWAL, R., LIN, K., SAWHNEY, H., AND SHIM, K.
1995. Fast similarity search in the presence
of noise, scaling, and translation in time-series
databases. In Proc. 21st Int. Conf. on Very Large
Databases, 490โ€“501.
ALTSCHUL, S., GISH, W., MILLER, W., MYERS, E.,
AND LIPMAN, D. 1990. A basic local alignment
search tool. J. Molecular Biol. 215, 3, 403โ€“410.
AOKI, P. 1998. Generalizing โ€œsearchโ€ in generalized search trees. In Proc. 14th Int. Conf. on Data
Engineering (Orlando, FL), 380โ€“389.
AREF, W. AND SAMET, H. 1991. Optimization strategies for spatial query processing. In Proc. 17th
Int. Conf. on Very Large Databases (Barcelona),
81โ€“90.
ARYA, S. 1995. Nearest neighbor searching and applications. PhD thesis, University of Maryland,
College Park, MD.
ARYA, S., MOUNT, D., AND NARAYAN, O. 1995. Accounting for boundary effects in nearest neighbor searching. In Proc. 11th Symp. on Computational Geometry (Vancouver, Canada), 336โ€“344.
BAEZA-YATES, R., CUNTO, W., MANBER, U., AND WU, S.
1994. Proximity matching using ๏ฌxed-queries

ACM Computing Surveys, Vol. 33, No. 3, September 2001.
370
trees. In Proc. Combinatorial Pattern Matching,
LNCS 807, 198โ€“212.
BAYER, R. AND MCCREIGHT, E. 1977. Organization
and maintenance of large ordered indices. Acta
Inf. 1, 3, 173โ€“189.
BECKMANN, N., KRIEGEL, H.-P., SCHNEIDER, R., AND
SEEGER, B. 1990. The r -tree: An ef๏ฌcient and
robust access method for points and rectangles.
In Proc. ACM SIGMOD Int. Conf. on Management of Data (Atlantic City, NJ), 322โ€“331.
BELUSSI, A. AND FALOUTSOS, C. 1995. Estimating
the selectivity of spatial queries using the correlation fractal dimension. In Proc. 21st Int.
Conf. on Very Large Databases (Zurich), 299โ€“
310.
BENTLEY, J. 1975. Multidimensional search trees
used for associative searching. Commun.
ACM 18, 9, 509โ€“517.
BENTLEY, J. 1979. Multidimensional binary search
in database applications. IEEE Trans. Softw.
Eng. 4, 5, 397โ€“409.
BERCHTOLD, S. AND KEIM, D. 1998. Highdimensional
index
structuresโ€”Database
support for next decadesโ€™s applications. In Tutorial ACM SIGMOD Int. Conf. on Management
of Data (Seattle, NJ).
ยจ
ยจ
BERCHTOLD, S., BOHM, C., BRAUNMULLER, B., KEIM, D.,
AND KRIEGEL, H.-P. 1997a. Fast parallel similarity search in multimedia databases. In Proc.
ACM SIGMOD Int. Conf. on Management of
Data.
ยจ
BERCHTOLD, S., BOHM, C., JAGADISH, H., KRIEGEL, H.-P.,
AND SANDER, J. 2000a. Independent quantization: An index compression technique for highdimensional data spaces. In Proc. 16th Int. Conf.
on Data Engineering.
ยจ
BERCHTOLD, S., BOHM, C., KEIM, D., AND KRIEGEL,
H.-P. 1997b. A cost model for nearest neighbor search in high-dimensional data space.
In Proc. ACM PODS Symp. on Principles of
Database Systems (Tucson, AZ).
ยจ
BERCHTOLD, S., BOHM, C., KEIM, D., AND KRIEGEL, H.-P.
2001. On optimizing processing of nearest
neighbor queries in high-dimensional data
space. Proc. Conf. on Database Theory, 435โ€“449.
ยจ
BERCHTOLD, S., BOHM, C., KEIM, D., KRIEGEL, H.-P.,
AND XU, X. 2000c. Optimal multidimensional
query processing using tree striping. Dawak,
244โ€“257.
ยจ
BERCHTOLD, S., BOHM, C., AND KRIEGEL, H.-P. 1998a.
Improving the query performance of highdimensional index structures using bulk-load
operations. In Proc. 6th Int. Conf. on Extending
Database Technology (Valencia, Spain).
ยจ
BERCHTOLD, S., BOHM, C., AND KRIEGEL, H.-P. 1998b.
The pyramid-technique: Towards indexing beyond the curse of dimensionality. In Proc. ACM
SIGMOD Int. Conf. on Management of Data
(Seattle, NJ), 142โ€“153.
BERCHTOLD, S., ERTL, B., KEIM, D., KRIEGEL, H.-P.,
AND SEIDL, T. 1998c. Fast nearest neighbor

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

C. Bยจ hm et al.
o
search in high-dimensional spaces. In Proc.
14th Int. Conf. on Data Engineering (Orlando,
FL).
BERCHTOLD, S., JAGADISH, H., AND ROSS, K. 1998d.
Independence diagrams: A technique for visual
data mining. In Proc. 4th Int. Conf. on Knowledge
Discovery and Data Mining (New York), 139โ€“
143.
BERCHTOLD, S., KEIM, D., AND KRIEGEL, H.-P. 1996.
The x-tree: An index structure for highdimensional data. In Proc. 22nd Int. Conf. on
Very Large Databases (Bombay), 28โ€“39.
BERCHTOLD, S., KEIM, D., KRIEGEL, H.-P., AND SEIDL,
T. 2000d. Indexing the solution space: A new
technique for nearest neighbor search in highdimensional space. IEEE Trans. Knowl. Data
Eng., 45โ€“57.
BEYER, K., GOLDSTEIN, J., RAMAKRISHNAN, R., AND SHAFT,
U. 1999. When is โ€œnearest neighborโ€ meaningful? In Proc. Int. Conf. on Database Theory,
217โ€“235.
ยจ
BOHM,
C. 1998. Ef๏ฌciently
indexing
highdimensional databases. PhD thesis, University
of Munich, Germany.
ยจ
BOHM, C. 2000. A cost model for query processing
in high-dimensional data spaces. To appear in:
ACM Trans. Database Syst.
BOZKAYA, T. AND OZSOYOGLU, M. 1997. Distancebased indexing for high-dimensional metric
spaces. SIGMOD Rec. 26, 2, 357โ€“368.
BRIN, S. 1995. Near neighbor search in large metric spaces. In Proc. 21st Int. Conf. on Very Large
Databases (Switzerland), 574โ€“584.
BURKHARD, W. AND KELLER, R. 1973. Some approaches to best-match ๏ฌle searching. Commun.
ACM 16, 4, 230โ€“236.
CHEUNG, K. AND FU, A. 1998. Enhanced nearest neighbour search on the r-tree. SIGMOD
Rec. 27, 3, 16โ€“21.
CHIUEH, T. 1994. Content-based image indexing.
In Proc. 20th Int. Conf. on Very Large Databases
(Chile), 582โ€“593.
CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1997. Mtree: An ef๏ฌcient access method for similarity search in metric spaces. In Proc. 23rd Int.
Conf. on Very Large Databases (Greece), 426โ€“
435.
CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998. A cost
model for similarity queries in metric spaces. In
Proc. 17th ACM Symp. on Principles of Database
Systems (Seattle), 59โ€“67.
CLEARY, J. 1979. Analysis of an algorithm for ๏ฌnding nearest neighbors in Euclidean space. ACM
Trans. Math. Softw. 5, 2, 183โ€“192.
COMER, D. 1979. The ubiquitous b-tree. ACM Comput. Surv. 11, 2, 121โ€“138.
CORRAL, A., MANOLOPOULOS, Y., THEODORIDIS, Y.,
AND VASSILAKOPOULOS, M. 2000. Closest pair
queries in spatial databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 189โ€“
200.
Searching in High-Dimensional Spaces
EASTMAN, C. 1981. Optimal bucket size for nearest neighbor searching in kd -trees. Inf. Proc.
Lett. 12, 4.
EVANGELIDIS, G. 1994. The hBฯ€ -tree: A concurrent
and recoverable multi-attribute index structure.
PhD thesis, Northeastern University, Boston,
MA.
EVANGELIDIS, G., LOMET, D., AND SALZBERG, B. 1997.
The hBฯ€ -tree: A multiattribute index supporting concurrency, recovery and node consolidation. VLDB J. 6, 1, 1โ€“25.
FALOUTSOS, C. 1985. Multiattribute hashing using
gray codes. In Proc. ACM SIGMOD Int. Conf. on
Management of Data, 227โ€“238.
FALOUTSOS, C. 1988. Gray codes for partial match
and range queries. IEEE Trans. Softw. Eng. 14,
1381โ€“1393.
FALOUTSOS, C. AND GAEDE, V. 1996. Analysis of ndimensional quadtrees using the Hausdorff fractal dimension. In Proc. 22nd Int. Conf. on Very
Large Databases (Mumbai, India), 40โ€“50.
FALOUTSOS, C. AND KAMEL, I. 1994. Beyond uniformity and independence: Analysis of r-trees using the concept of fractal dimension. In Proc.
13th ACM SIGACT-SIGMOD-SIGART Symp. on
Principles of Database Systems (Minneapolis,
MN), 4โ€“13.
FALOUTSOS, C. AND LIN, K.-I. 1995. Fast map: A fast
algorithm for indexing, data-mining and visualization of traditional and multimedia data. In
Proc. ACM SIGMOD Int. Conf. on Management
of Data (San Jose, CA), 163โ€“174.
FALOUTSOS, C. AND ROSEMAN, S. 1989. Fractals
for secondary key retrieval. In Proc. 8th
ACM SIGACT-SIGMOD Symp. on Principles of
Database Systems, 247โ€“252.
FALOUTSOS, C., BARBER, R., FLICKNER, M., AND HAFNER,
J. 1994a. Ef๏ฌcient and effective querying by
image content. J. Intell. Inf. Syst. 3, 231โ€“262.
FALOUTSOS, C., RANGANATHAN, M., AND MANOLOPOULOS,
Y. 1994b. Fast subsequence matching in
time-series databases. In Proc. ACM SIGMOD
Int. Conf. on Management of Data, 419โ€“429.
FALOUTSOS, C., SELLIS, T., AND ROUSSOPOULOS, N. 1987.
Analysis of object-oriented spatial access methods. In Proc. ACM SIGMOD Int. Conf. on Management of Data.
FINKEL, R. AND BENTLEY, J. 1974. Quad trees: A
data structure for retrieval of composite trees.
Acta Inf. 4, 1, 1โ€“9.
FREESTON, M. 1987. The bang ๏ฌle: A new kind of
grid ๏ฌle. In Proc. ACM SIGMOD Int. Conf. on
Management of Data (San Francisco), 260โ€“269.
FRIEDMAN, J., BENTLEY, J., AND FINKEL, R. 1977. An
algorithm for ๏ฌnding best matches in logarithmic expected time. ACM Trans. Math. Softw. 3, 3,
209โ€“226.
GAEDE, V. 1995. Optimal redundancy in spatial
database systems. In Proc. 4th Int. Symp. on Advances in Spatial Databases (Portland, ME), 96โ€“
116.

371
ยจ
GAEDE, V. AND GUNTHER, O. 1998. Multidimensional access methods. ACM Comput. Surv. 30, 2,
170โ€“231.
GIONIS, A., INDYK, P., AND MOTWANI, R. 1999. Similarity search in high dimensions via hashing.
In Proc. 25th Int. Conf. on Very Large Databases
(Edinburgh), 518โ€“529.
GREENE, D. 1989. An implementation and performance analysis of spatial data access methods. In Proc. 5th IEEE Int. Conf. on Data
Engineering.
GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proc. ACM SIGMOD Int. Conf. on Management of Data (Boston),
47โ€“57.
HELLERSTEIN, J., KOUTSOUPIAS, E., AND PAPADIMITRIOU,
C. 1997. On the analysis of indexing schemes.
In Proc. 16th SIGACT-SIGMOD-SIGART Symp.
on Principles of Database Systems (Tucson, AZ),
249โ€“256.
HELLERSTEIN, J., NAUGHTON, J., AND PFEFFER, A. 1995.
Generalized search trees for database systems.
In Proc. 21st Int. Conf. on Very Large Databases
(Zurich), 562โ€“573.
HENRICH, A. 1994. A distance-scan algorithm for
spatial access strucures. In Proc. 2nd ACM Workshop on Advances in Geographic Information
Systems (Gaithersburg, MD), 136โ€“143.
HENRICH, A. 1998. The lsdh -tree: An access structure for feature vectors. In Proc. 14th Int. Conf.
on Data Engineering (Orlando, FL).
HENRICH, A., SIX, H.-W., AND WIDMAYER, P. 1989.
The lsd-tree: Spatial access to multidimensional
point and non-point objects. In Proc. 15th Int.
Conf. on Very Large Databases (Amsterdam, The
Netherlands), 45โ€“53.
HINNEBURG, A. AND KEIM, D. 1998. An ef๏ฌcient
approach to clustering in large multimedia
databases with noise. In Proc. Int. Conf. on
Knowledge Discovery in Databases (New York).
HINRICHS, K. 1985. Implementation of the grid ๏ฌle:
Design concepts and experience. BIT 25, 569โ€“
592.
HJALTASON, G. AND SAMET, H. 1995. Ranking in
spatial databases. In Proc. 4th Int. Symp.
on Large Spatial Databases (Portland, ME),
83โ€“95.
HJALTASON, G. AND SAMET, H. 1998. Incremental
distance join algorithms for spatial databases.
In Proc. ACM SIGMOD Int. Conf. on Management of Data, 237โ€“248.
HUTFLESZ, A., SIX, H.-W., AND WIDMAYER, P. 1988a.
Globally order preserving multidimensional linear hashing. In Proc. 4th IEEE Int. Conf. on Data
Engineering, 572โ€“579.
HUTFLESZ, A., SIX, H.-W., AND WIDMAYER, P. 1988b.
Twin grid ๏ฌles: Space optimizing access
schemes. In Proc. ACM SIGMOD Int. Conf. on
Management of Data.
JAGADISH, H. 1990. Linear clustering of objects
with multiple attributes. In Proc. ACM SIGMOD

ACM Computing Surveys, Vol. 33, No. 3, September 2001.
372
Int. Conf. on Management of Data (Atlantic City,
NJ), 332โ€“342.
JAGADISH, H. 1991. A retrieval technique for similar shapes. In Proc. ACM SIGMOD Int. Conf. on
Management of Data, 208โ€“217.
JAIN, R. AND WHITE, D. 1996. Similarity indexing:
Algorithms and performance. In Proc. SPIE Storage and Retrieval for Image and Video Databases
IV (San Jose, CA), 62โ€“75.
KAMEL, I. AND FALOUTSOS, C. 1992. Parallel r-trees.
In Proc. ACM SIGMOD Int. Conf. on Management of Data, 195โ€“204.
KAMEL, I. AND FALOUTSOS, C. 1993. On packing rtrees. CIKM, 490โ€“499.
KAMEL, I. AND FALOUTSOS, C. 1994. Hilbert r-tree:
An improved r-tree using fractals. In Proc.
20th Int. Conf. on Very Large Databases, 500โ€“
509.
KATAYAMA, N. AND SATOH, S. 1997. The sr-tree: An
index structure for high-dimensional nearest
neighbor queries. In Proc. ACM SIGMOD Int.
Conf. on Management of Data, 369โ€“380.
KNUTH,
D. 1975. The
Art
of
Computer
Programmingโ€”Volume 3: Sorting and Searching. Addison-Wesley, Reading, Mass.
KORN, F. AND MUTHUKRISHNAN, S. 2000. In๏ฌ‚uence
sets based on reverse nearest neighbor queries.
In Proc. ACM SIGMOD Int. Conf. on Management of Data, 201โ€“212.
KORN, F., SIDIROPOULOS, N., FALOUTSOS, C., SIEGEL, E.,
AND PROTOPAPAS, Z. 1996. Fast nearest neighbor search in medical image databases. In
Proc. 22nd Int. Conf. on Very Large Databases
(Mumbai, India), 215โ€“226.
KORNACKER, M. 1999. High-performance generalized search trees. In Proc. 24th Int. Conf. on Very
Large Databases (Edinburgh).
KRIEGEL, H.-P. AND SEEGER, B. 1986. Multidimensional order preserving linear hashing with partial extensions. In Proc. Int. Conf. on Database
Theory, Lecture Notes in Computer Science, vol.
243, Springer-Verlag, New York.
KRIEGEL, H.-P. AND SEEGER, B. 1987. Multidimensional dynamic quantile hashing is very ef๏ฌcient
for non-uniform record distributions. In Proc.
3rd Int. Conf. on Data Engineering, 10โ€“17.
KRIEGEL, H.-P. AND SEEGER, B. 1988. Plop-hashing:
A grid ๏ฌle without directory. In Proc. 4th Int.
Conf. on Data Engineering, 369โ€“376.
KRISHNAMURTHY, R. AND WHANG, K.-Y. 1985. Multilevel Grid Files. IBM Research Center Report,
Yorktown Heights, NY.
KUKICH, K. 1992. Techniques for automatically correcting words in text. ACM Comput.
Surv. 24, 4, 377โ€“440.
LIN, K., JAGADISH, H., AND FALOUTSOS, C. 1995. The
tv-tree: An index structure for high-dimensional
data. VLDB J. 3, 517โ€“542.
LOMET, D. AND SALZBERG, B. 1989. The hb-tree: A
robust multiattribute search structure. In Proc.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

C. Bยจ hm et al.
o
5th IEEE Int. Conf. on Data Engineering, 296โ€“
304.
LOMET, D. AND SALZBERG, B. 1990. The hb-tree:
A multiattribute indexing method with good
guaranteed performance. ACM Trans. Database
Syst. 15, 4, 625โ€“658.
MANDELBROT, B. 1977. Fractal Geometry of Nature.
W.H. Freeman, New York.
MEHROTRA, R. AND GARY, J. 1993. Feature-based retrieval of similar shapes. In Proc. 9th Int. Conf.
on Data Engineering.
MEHROTRA, R. AND GARY, J. 1995. Feature-indexbased similar shape retrieval. In Proc. 3rd Working Conf. on Visual Database Systems.
MORTON, G. 1966. A Computer Oriented Geodetic
Data Base and a New Technique in File Sequencing. IBM Ltd., USA.
MUMFORD, D. 1987. The problem of robust shape
descriptors. In Proc. 1st IEEE Int. Conf. on Computer Vision.
NIEVERGELT, J., HINTERBERGER, H., AND SEVCIK, K.
1984. The grid ๏ฌle: An adaptable, symmetric
multikey ๏ฌle structure. ACM Trans. Database
Syst. 9, 1, 38โ€“71.
ORENSTEIN, J. 1990. A comparison of spatial query
processing techniques for native and parameter spaces. In Proc. ACM SIGMOD Int. Conf. on
Management of Data, 326โ€“336.
ORENSTEIN, J. AND MERRET, T. 1984. A class of data
structures for associative searching. In Proc. 3rd
ACM SIGACT-SIGMOD Symp. on Principles of
Database Systems, 181โ€“190.
OTOO, E. 1984. A mapping function for the directory of a multidimensional extendible hashing.
In Proc. 10th Int. Conf. on Very Large Databases,
493โ€“506.
OUKSEL, M. 1985. The interpolation based grid
๏ฌle. In Proc. 4th ACM SIGACT-SIGMOD
Symp. on Principles of Database Systems, 20โ€“
27.
OUSKEL, A. AND MAYES, O. 1992. The nested
interpolation-based Grid File. Acta Informatika
29, 335โ€“373.
PAGEL, B.-U., SIX, H.-W., TOBEN, H., AND WIDMAYER,
P. 1993. Towards an analysis of range query
performance in spatial data structures. In Proc.
12th ACM SIGACT-SIGMOD-SIGART Symp. on
Principles of Database Systems (Washington,
DC), 214โ€“221.
PAPADOPOULOS, A. AND MANOLOPOULOS, Y. 1997a.
Nearest neighbor queries in shared-nothing environments. Geoinf. 1, 1, 1โ€“26.
PAPADOPOULOS, A. AND MANOLOPOULOS, Y. 1997b.
Performance of nearest neighbor queries in
r-trees. In Proc. 6th Int. Conf. on Database
Theory, Lecture Notes in Computer Science, vol.
1186, Springer-Verlag, New York, 394โ€“408.
PAPADOPOULOS, A. AND MANOLOPOULOS, Y. 1998. Similarity query processing using disk arrays. In
Proc. ACM SIGMOD Int. Conf. on Management
of Data.
Searching in High-Dimensional Spaces
RIEDEL, E., GIBSON, G., AND FALOUTSOS, C. 1998. Actice storage for large-scale data mining and
multimedia. In Proc. 24th Int. Conf. on Very
Large Databases, 62โ€“73.
ROBINSON, J. 1981. The k-d -b-tree: A search structure for large multidimensional dynamic indexes. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 10โ€“18.
ROUSSOPOULOS, N., KELLEY, S., AND VINCENT, F. 1995.
Nearest neighbor queries. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 71โ€“79.
SAGAN, H. 1994. Space Filling Curves. SpringerVerlag, New York.
ยจ
SCHRODER, M. 1991. Fractals, Chaos, Power Laws:
Minutes from an In๏ฌnite Paradise. W.H.
Freeman, New York.
SEEGER, B. AND KRIEGEL, H.-P. 1990. The buddy
tree: An ef๏ฌcient and robust access method for
spatial data base systems. In Proc. 16th Int.
Conf. on Very Large Databases (Brisbane), 590โ€“
601.
SEIDL, T. 1997. Adaptable similarity search in 3-d
spatial database systems. PhD thesis, University of Munich, Germany.
SEIDL, T. AND KRIEGEL, H.-P. 1997. Ef๏ฌcient useradaptable similarity search in large multimedia
databases. In Proc. 23rd Int. Conf. on Very Large
Databases (Athens).
SELLIS, T., ROUSSOPOULOS, N., AND FALOUTSOS, C.
1987. The r+ -tree: A dynamic index for multidimensional objects. In Proc. 13th Int. Conf. on
Very Large Databases (Brighton, GB), 507โ€“518.
SHAWNEY, H. AND HAFNER, J. 1994. Ef๏ฌcient color
histogram indexing. In Proc. Int. Conf. on Image
Processing, 66โ€“70.
SHOICHET, B., BODIAN, D., AND KUNTZ, I. 1992.
Molecular docking using shape descriptors. J.
Comput. Chem. 13, 3, 380โ€“397.
SPROULL, R. 1991. Re๏ฌnements to nearest neighbor searching in k-dimensional trees. Algorithmica, 579โ€“589.
STANOI, I., AGRAWAL, D., AND ABBADI, A. 2000. Reverse nearest neighbor queries for dynamic

373
databases. In Proc. ACM SIGMOD Workshop on
Research Issues in Data Mining and Knowledge
Discovery, 44โ€“53.
STONEBRAKER, M., SELLIS, T., AND HANSON, E. 1986.
An analysis of rule indexing implementations in
data base systems. In Proc. Int. Conf. on Expert
Database Systems.
THEODORIDIS, Y. AND SELLIS, T. 1996. A model for
the prediction of r-tree performance. In Proc.
15th ACM SIGACT-SIGMOD-SIGART Symp. on
Principles of Database Systems (Montreal), 161โ€“
171.
UHLMANN, J. 1991. Satisfying general proximity/similarity queries with metric trees. Inf. Proc.
Lett. 145โ€“157.
VAN DEN BERCKEN, J., SEEGER, B., AND WIDMAYER,
P. 1997. A general approach to bulk loading multidimensional index structures. In Proc.
23rd Int. Conf. on Very Large Databases
(Athens).
WALLACE, T. AND WINTZ, P. 1980. An ef๏ฌcient threedimensional aircraft recognition algorithm using normalized Fourier descriptors. Comput.
Graph. Image Proc. 13, 99โ€“126.
WEBER, R., SCHEK, H.-J., AND BLOTT, S. 1998. A
quantitative analysis and performance study for
similarity-search methods in high-dimensional
spaces. In Proc. Int. Conf. on Very Large
Databases (New York).
WHITE, D. AND JAIN, R. 1996. Similarity indexing
with the ss-tree. In Proc. 12th Int. Conf. on Data
Engineering (New Orleans).
YAO, A. AND YAO, F. 1985. A general approach to
d-dimensional geometric queries. In Proc. ACM
Symp. on Theory of Computing.
YIANILOS, P. 1993. Data structures and algorithms
for nearest neighbor search in general metric
spaces. In Proc. 4th ACM-SIAM Symp. on Discrete Algorithms, 311โ€“321.
YIANILOS, P. 1999. Excluded middle vantage point
forests for nearest neighbor search. In Proc.
DIMACS Implementation Challenge (Baltimore,
MD).

Received August 1998; revised March 2000; accepted November 2000

ACM Computing Surveys, Vol. 33, No. 3, September 2001.
Ad

More Related Content

What's hot (18)

Enhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete DatabasesEnhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete Databases
paperpublications3
ย 
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATACOMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
cscpconf
ย 
Combined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex dataCombined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex data
csandit
ย 
A new link based approach for categorical data clustering
A new link based approach for categorical data clusteringA new link based approach for categorical data clustering
A new link based approach for categorical data clustering
International Journal of Science and Research (IJSR)
ย 
Topological data analysis
Topological data analysisTopological data analysis
Topological data analysis
Sunghyon Kyeong
ย 
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
IOSR Journals
ย 
algorithms
algorithmsalgorithms
algorithms
DikshaGupta535173
ย 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
KU Leuven
ย 
Bl24409420
Bl24409420Bl24409420
Bl24409420
IJERA Editor
ย 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
ย 
Advanced machine learning for metabolite identification
Advanced machine learning for metabolite identificationAdvanced machine learning for metabolite identification
Advanced machine learning for metabolite identification
Dai-Hai Nguyen
ย 
Lx3520322036
Lx3520322036Lx3520322036
Lx3520322036
IJERA Editor
ย 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
IJERA Editor
ย 
F04463437
F04463437F04463437
F04463437
IOSR-JEN
ย 
130509
130509130509
130509
International Journal of Technical Research & Application
ย 
Multidimensioal database
Multidimensioal  databaseMultidimensioal  database
Multidimensioal database
TPO TPO
ย 
13584 27 multimedia mining
13584 27 multimedia mining13584 27 multimedia mining
13584 27 multimedia mining
Universitas Bina Darma Palembang
ย 
Comparison of Text Classifiers on News Articles
Comparison of Text Classifiers on News ArticlesComparison of Text Classifiers on News Articles
Comparison of Text Classifiers on News Articles
IRJET Journal
ย 
Enhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete DatabasesEnhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete Databases
paperpublications3
ย 
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATACOMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
cscpconf
ย 
Combined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex dataCombined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex data
csandit
ย 
Topological data analysis
Topological data analysisTopological data analysis
Topological data analysis
Sunghyon Kyeong
ย 
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
IOSR Journals
ย 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
KU Leuven
ย 
Bl24409420
Bl24409420Bl24409420
Bl24409420
IJERA Editor
ย 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
ย 
Advanced machine learning for metabolite identification
Advanced machine learning for metabolite identificationAdvanced machine learning for metabolite identification
Advanced machine learning for metabolite identification
Dai-Hai Nguyen
ย 
Lx3520322036
Lx3520322036Lx3520322036
Lx3520322036
IJERA Editor
ย 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
IJERA Editor
ย 
F04463437
F04463437F04463437
F04463437
IOSR-JEN
ย 
Multidimensioal database
Multidimensioal  databaseMultidimensioal  database
Multidimensioal database
TPO TPO
ย 
Comparison of Text Classifiers on News Articles
Comparison of Text Classifiers on News ArticlesComparison of Text Classifiers on News Articles
Comparison of Text Classifiers on News Articles
IRJET Journal
ย 

Similar to Searching in high dimensional spaces index structures for improving the performance of multimedia databases (20)

Nonmetric similarity search
Nonmetric similarity searchNonmetric similarity search
Nonmetric similarity search
unyil96
ย 
On nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domainsOn nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domains
unyil96
ย 
A semantic framework and software design to enable the transparent integratio...
A semantic framework and software design to enable the transparent integratio...A semantic framework and software design to enable the transparent integratio...
A semantic framework and software design to enable the transparent integratio...
Patricia Tavares Boralli
ย 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
Editor IJARCET
ย 
Data Science and Machine learning-Module-01 (3).pptx
Data Science and Machine learning-Module-01 (3).pptxData Science and Machine learning-Module-01 (3).pptx
Data Science and Machine learning-Module-01 (3).pptx
bansalmayank1512
ย 
High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...
High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...
High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...
kuuokoeinbu
ย 
High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...
High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...
High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...
dyakonashz
ย 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
ย 
Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)
infoblog
ย 
Data Science and Machine learning-Lect01.pdf
Data Science and Machine learning-Lect01.pdfData Science and Machine learning-Lect01.pdf
Data Science and Machine learning-Lect01.pdf
RAJVEERKUMAR41
ย 
Best Keyword Cover Search
Best Keyword Cover SearchBest Keyword Cover Search
Best Keyword Cover Search
1crore projects
ย 
Efficient processing of continuous spatial-textual queries over geo-textual d...
Efficient processing of continuous spatial-textual queries over geo-textual d...Efficient processing of continuous spatial-textual queries over geo-textual d...
Efficient processing of continuous spatial-textual queries over geo-textual d...
nooriasukmaningtyas
ย 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
Anubhav Jain
ย 
How Vector Search Transforms Information Retrieval?
How Vector Search Transforms Information Retrieval?How Vector Search Transforms Information Retrieval?
How Vector Search Transforms Information Retrieval?
Lucy Zeniffer
ย 
G04124041046
G04124041046G04124041046
G04124041046
IOSR-JEN
ย 
A comparative analysis of retrieval techniques in content based image retrieval
A comparative analysis of retrieval techniques in content based image retrievalA comparative analysis of retrieval techniques in content based image retrieval
A comparative analysis of retrieval techniques in content based image retrieval
csandit
ย 
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVAL
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVALA COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVAL
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVAL
cscpconf
ย 
Ijetcas14 446
Ijetcas14 446Ijetcas14 446
Ijetcas14 446
Iasir Journals
ย 
E0322035037
E0322035037E0322035037
E0322035037
inventionjournals
ย 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
International Journal of Engineering Inventions www.ijeijournal.com
ย 
Nonmetric similarity search
Nonmetric similarity searchNonmetric similarity search
Nonmetric similarity search
unyil96
ย 
On nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domainsOn nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domains
unyil96
ย 
A semantic framework and software design to enable the transparent integratio...
A semantic framework and software design to enable the transparent integratio...A semantic framework and software design to enable the transparent integratio...
A semantic framework and software design to enable the transparent integratio...
Patricia Tavares Boralli
ย 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
Editor IJARCET
ย 
Data Science and Machine learning-Module-01 (3).pptx
Data Science and Machine learning-Module-01 (3).pptxData Science and Machine learning-Module-01 (3).pptx
Data Science and Machine learning-Module-01 (3).pptx
bansalmayank1512
ย 
High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...
High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...
High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...
kuuokoeinbu
ย 
High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...
High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...
High Dimensional Indexing Transformational Approaches to High-Dimensional Ran...
dyakonashz
ย 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
ย 
Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)
infoblog
ย 
Data Science and Machine learning-Lect01.pdf
Data Science and Machine learning-Lect01.pdfData Science and Machine learning-Lect01.pdf
Data Science and Machine learning-Lect01.pdf
RAJVEERKUMAR41
ย 
Best Keyword Cover Search
Best Keyword Cover SearchBest Keyword Cover Search
Best Keyword Cover Search
1crore projects
ย 
Efficient processing of continuous spatial-textual queries over geo-textual d...
Efficient processing of continuous spatial-textual queries over geo-textual d...Efficient processing of continuous spatial-textual queries over geo-textual d...
Efficient processing of continuous spatial-textual queries over geo-textual d...
nooriasukmaningtyas
ย 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
Anubhav Jain
ย 
How Vector Search Transforms Information Retrieval?
How Vector Search Transforms Information Retrieval?How Vector Search Transforms Information Retrieval?
How Vector Search Transforms Information Retrieval?
Lucy Zeniffer
ย 
G04124041046
G04124041046G04124041046
G04124041046
IOSR-JEN
ย 
A comparative analysis of retrieval techniques in content based image retrieval
A comparative analysis of retrieval techniques in content based image retrievalA comparative analysis of retrieval techniques in content based image retrieval
A comparative analysis of retrieval techniques in content based image retrieval
csandit
ย 
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVAL
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVALA COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVAL
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVAL
cscpconf
ย 
Ijetcas14 446
Ijetcas14 446Ijetcas14 446
Ijetcas14 446
Iasir Journals
ย 
Ad

More from unyil96 (20)

Xml linking
Xml linkingXml linking
Xml linking
unyil96
ย 
Xml data clustering an overview
Xml data clustering an overviewXml data clustering an overview
Xml data clustering an overview
unyil96
ย 
Word sense disambiguation a survey
Word sense disambiguation a surveyWord sense disambiguation a survey
Word sense disambiguation a survey
unyil96
ย 
Web page classification features and algorithms
Web page classification features and algorithmsWeb page classification features and algorithms
Web page classification features and algorithms
unyil96
ย 
The significance of linking
The significance of linkingThe significance of linking
The significance of linking
unyil96
ย 
Techniques for automatically correcting words in text
Techniques for automatically correcting words in textTechniques for automatically correcting words in text
Techniques for automatically correcting words in text
unyil96
ย 
Strict intersection types for the lambda calculus
Strict intersection types for the lambda calculusStrict intersection types for the lambda calculus
Strict intersection types for the lambda calculus
unyil96
ย 
Smart meeting systems a survey of state of-the-art
Smart meeting systems a survey of state of-the-artSmart meeting systems a survey of state of-the-art
Smart meeting systems a survey of state of-the-art
unyil96
ย 
Semantically indexed hypermedia linking information disciplines
Semantically indexed hypermedia linking information disciplinesSemantically indexed hypermedia linking information disciplines
Semantically indexed hypermedia linking information disciplines
unyil96
ย 
Realization of natural language interfaces using
Realization of natural language interfaces usingRealization of natural language interfaces using
Realization of natural language interfaces using
unyil96
ย 
Ontology visualization methodsโ€”a survey
Ontology visualization methodsโ€”a surveyOntology visualization methodsโ€”a survey
Ontology visualization methodsโ€”a survey
unyil96
ย 
Multidimensional access methods
Multidimensional access methodsMultidimensional access methods
Multidimensional access methods
unyil96
ย 
Machine transliteration survey
Machine transliteration surveyMachine transliteration survey
Machine transliteration survey
unyil96
ย 
Machine learning in automated text categorization
Machine learning in automated text categorizationMachine learning in automated text categorization
Machine learning in automated text categorization
unyil96
ย 
Is this document relevant probably
Is this document relevant probablyIs this document relevant probably
Is this document relevant probably
unyil96
ย 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
unyil96
ย 
Integrating content search with structure analysis for hypermedia retrieval a...
Integrating content search with structure analysis for hypermedia retrieval a...Integrating content search with structure analysis for hypermedia retrieval a...
Integrating content search with structure analysis for hypermedia retrieval a...
unyil96
ย 
Information retrieval on the web
Information retrieval on the webInformation retrieval on the web
Information retrieval on the web
unyil96
ย 
Implementing sorting in database systems
Implementing sorting in database systemsImplementing sorting in database systems
Implementing sorting in database systems
unyil96
ย 
Image retrieval from the world wide web
Image retrieval from the world wide webImage retrieval from the world wide web
Image retrieval from the world wide web
unyil96
ย 
Xml linking
Xml linkingXml linking
Xml linking
unyil96
ย 
Xml data clustering an overview
Xml data clustering an overviewXml data clustering an overview
Xml data clustering an overview
unyil96
ย 
Word sense disambiguation a survey
Word sense disambiguation a surveyWord sense disambiguation a survey
Word sense disambiguation a survey
unyil96
ย 
Web page classification features and algorithms
Web page classification features and algorithmsWeb page classification features and algorithms
Web page classification features and algorithms
unyil96
ย 
The significance of linking
The significance of linkingThe significance of linking
The significance of linking
unyil96
ย 
Techniques for automatically correcting words in text
Techniques for automatically correcting words in textTechniques for automatically correcting words in text
Techniques for automatically correcting words in text
unyil96
ย 
Strict intersection types for the lambda calculus
Strict intersection types for the lambda calculusStrict intersection types for the lambda calculus
Strict intersection types for the lambda calculus
unyil96
ย 
Smart meeting systems a survey of state of-the-art
Smart meeting systems a survey of state of-the-artSmart meeting systems a survey of state of-the-art
Smart meeting systems a survey of state of-the-art
unyil96
ย 
Semantically indexed hypermedia linking information disciplines
Semantically indexed hypermedia linking information disciplinesSemantically indexed hypermedia linking information disciplines
Semantically indexed hypermedia linking information disciplines
unyil96
ย 
Realization of natural language interfaces using
Realization of natural language interfaces usingRealization of natural language interfaces using
Realization of natural language interfaces using
unyil96
ย 
Ontology visualization methodsโ€”a survey
Ontology visualization methodsโ€”a surveyOntology visualization methodsโ€”a survey
Ontology visualization methodsโ€”a survey
unyil96
ย 
Multidimensional access methods
Multidimensional access methodsMultidimensional access methods
Multidimensional access methods
unyil96
ย 
Machine transliteration survey
Machine transliteration surveyMachine transliteration survey
Machine transliteration survey
unyil96
ย 
Machine learning in automated text categorization
Machine learning in automated text categorizationMachine learning in automated text categorization
Machine learning in automated text categorization
unyil96
ย 
Is this document relevant probably
Is this document relevant probablyIs this document relevant probably
Is this document relevant probably
unyil96
ย 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
unyil96
ย 
Integrating content search with structure analysis for hypermedia retrieval a...
Integrating content search with structure analysis for hypermedia retrieval a...Integrating content search with structure analysis for hypermedia retrieval a...
Integrating content search with structure analysis for hypermedia retrieval a...
unyil96
ย 
Information retrieval on the web
Information retrieval on the webInformation retrieval on the web
Information retrieval on the web
unyil96
ย 
Implementing sorting in database systems
Implementing sorting in database systemsImplementing sorting in database systems
Implementing sorting in database systems
unyil96
ย 
Image retrieval from the world wide web
Image retrieval from the world wide webImage retrieval from the world wide web
Image retrieval from the world wide web
unyil96
ย 
Ad

Recently uploaded (20)

Hands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordDataHands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordData
Lynda Kane
ย 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
ย 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
ย 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
ย 
Leading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael JidaelLeading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael Jidael
Michael Jidael
ย 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
ย 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
ย 
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your UsersAutomation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Lynda Kane
ย 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
ย 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
ย 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
ย 
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from AnywhereAutomation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Lynda Kane
ย 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
ย 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
ย 
Salesforce AI Associate 2 of 2 Certification.docx
Salesforce AI Associate 2 of 2 Certification.docxSalesforce AI Associate 2 of 2 Certification.docx
Salesforce AI Associate 2 of 2 Certification.docx
Josรฉ Enrique Lรณpez Rivera
ย 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
ย 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
ย 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
ย 
Learn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step GuideLearn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step Guide
Marcel David
ย 
Buckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug LogsBuckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug Logs
Lynda Kane
ย 
Hands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordDataHands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordData
Lynda Kane
ย 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
ย 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
ย 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
ย 
Leading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael JidaelLeading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael Jidael
Michael Jidael
ย 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
ย 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
ย 
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your UsersAutomation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Lynda Kane
ย 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
ย 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
ย 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
ย 
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from AnywhereAutomation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Lynda Kane
ย 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
ย 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
ย 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
ย 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
ย 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
ย 
Learn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step GuideLearn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step Guide
Marcel David
ย 
Buckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug LogsBuckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug Logs
Lynda Kane
ย 

Searching in high dimensional spaces index structures for improving the performance of multimedia databases

  • 1. Searching in High-Dimensional Spacesโ€”Index Structures for Improving the Performance of Multimedia Databases ยจ CHRISTIAN BOHM University of Munich, Germany STEFAN BERCHTOLD stb ag, Germany AND DANIEL A. KEIM AT&T Research Labs and University of Constance, Germany During the last decade, multimedia databases have become increasingly important in many application areas such as medicine, CAD, geography, and molecular biology. An important research issue in the ๏ฌeld of multimedia databases is the content-based retrieval of similar multimedia objects such as images, text, and videos. However, in contrast to searching data in a relational database, a content-based retrieval requires the search of similar objects as a basic functionality of the database system. Most of the approaches addressing similarity search use a so-called feature transformation that transforms important properties of the multimedia objects into high-dimensional points (feature vectors). Thus, the similarity search is transformed into a search of points in the feature space that are close to a given query point in the high-dimensional feature space. Query processing in high-dimensional spaces has therefore been a very active research area over the last few years. A number of new index structures and algorithms have been proposed. It has been shown that the new index structures considerably Categories and Subject Descriptors: A.1 [General Literature]: Introductory and Survey; E.1 [Data]: Data Structures; F.2 [Theory of Computation]: Analysis of Algorithms and Problem Complexity; G.1 [Mathematics of Computing]: Numerical Analysis; G.2 [Mathematics of Computing]: Discrete Mathematics; H.2 [Information Systems]: Database Management; H.3 [Information Systems]: Information Storage and Retrieval; H.4 [Information Systems]: Information Systems Applications General Terms: Algorithms, Design, Measurement, Performance, Theory Additional Key Words and Phrases: Index structures, indexing high-dimensional data, multimedia databases, similarity search Authorsโ€™ addresses: C. Bยจ hm, University of Munich, Institute for Computer Science, Oettingenstr. 67, o ยจ 80538 Munchen, Germany; email: [email protected]; S. Berchtold, stb ag, Moritzplatz 6, 86150 Augsburg, Germany; email: [email protected]; D. A. Keim, University of Constance, ยจ Department of Computer & Information Science, Box: D 78, Universitatsstr. 10, 78457 Konstanz, Germany; email: [email protected]. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for pro๏ฌt or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior speci๏ฌc permission and/or a fee. c 2001 ACM 0360-0300/01/0900-0322 $5.00 ACM Computing Surveys, Vol. 33, No. 3, September 2001, pp. 322โ€“373.
  • 2. Searching in High-Dimensional Spaces 323 improve the performance in querying large multimedia databases. Based on recent tutorials [Berchtold and Keim 1998], in this survey we provide an overview of the current state of the art in querying multimedia databases, describing the index structures and algorithms for an ef๏ฌcient query processing in high-dimensional spaces. We identify the problems of processing queries in high-dimensional space, and we provide an overview of the proposed approaches to overcome these problems. 1. INDEXING MULTIMEDIA DATABASES Multimedia databases are of high importance in many application areas such as geography, CAD, medicine, and molecular biology. Depending on the application, the multimedia databases need to have different properties and need to support different types of queries. In contrast to traditional database applications, where point, range, and partial match queries are very important, multimedia databases require a search for all objects in the database that are similar (or complementary) to a given search object. In the following, we describe the notion of similarity queries and the feature-based approach to process those queries in multimedia databases in more detail. 1.1. Feature-Based Processing of Similarity Queries An important aspect of similarity queries is the similarity measure. There is no general de๏ฌnition of the similarity measure since it depends on the needs of the application and is therefore highly application-dependent. Any similarity measure, however, takes two objects as input parameters and determines a positive real number, denoting the similarity of the two objects. A similarity measure is therefore a function of the form ฮด: Obj ร— Obj โ†’ + 0. In de๏ฌning similarity queries, we have to distinguish between two different tasks, which are both important in multimedia database applications: ฮต-similarity means that we are interested in all objects of which the similarity to a given search object is below a given threshold ฮต, and NNsimilarity (nearest neighbor) means that we are only interested in the objects which are the most similar ones with respect to the search object. De๏ฌnition 1. (ฮต-Similarity, Identity). Two objects obj1 and obj2 are called ฮตsimilar if and only if ฮด(obj2 , obj1 ) < ฮต. (For ฮต = 0, the objects are called identical.) Note that this de๏ฌnition is independent of database applications and just describes a way to measure the similarity of two objects. De๏ฌnition 2. (NN-Similarity). Two objects obj1 and obj2 are called NN-similar with respect to a database of objects DB if and only if โˆ€obj โˆˆ DB, obj = obj1 : ฮด(obj2 , obj1 ) โ‰ค ฮด(obj2 , obj). We are now able to formally de๏ฌne the ฮต-similarity query and the NN-similarity query. De๏ฌnition 3. (ฮต-Similarity Query, NNSimilarity Query). Given a query object objs , ๏ฌnd all objects obj from the database of objects DB that are ฮต-similar (identical for ฮต = 0) to objs ; that is, determine {obj โˆˆ DB | ฮด(objs , obj) < ฮต}. Given a query object objs , ๏ฌnd the object(s) obj from the database of objects DB that are NN-similar to objs ; that is, determine {obj โˆˆ DB | โˆ€obj โˆˆ DB, obj = obj : ฮด(objs , obj) โ‰ค ฮด(objs , obj)}. The solutions currently used to solve similarity search problems are mostly feature-based solutions. The basic idea of feature-based similarity search is to extract important features from the multimedia objects, map the features into ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 3. 324 C. Bยจ hm et al. o Fig. 1. Basic idea of feature-based similarity search. high-dimensional feature vectors, and search the database of feature vectors for objects with similar feature vectors (cf. Figure 1). The feature transformation F is de๏ฌned as the mapping of the multimedia object (obj) into a d -dimensional feature vector F : Obj โ†’ d . The similarity of two objects obj1 and obj2 can now be determined, ฮด(obj1 , obj2 ) = ฮดEuclid (F (obj1 ), F (obj2 )). Feature-based approaches are used in many application areas including molecular biology (for molecule docking) [Shoichet et al. 1992], information retrieval (for text matching) [Altschul et al. 1990], multimedia databases (for image retrieval) [Faloutsos et al. 1994; Seidl and Kriegel 1997], sequence databases (for subsequence matching) [Agrawal et al. 1993, 1995; Faloutsos et al. 1994], geometric databases (for shape matching) [Mehrotra and Gary 1993, 1995; Korn et al. 1996], and so on. Examples of feature vectors are color histograms [Shawney and Hafner 1994], shape descriptors [Mumford 1987; Jagadish 1991; Mehrotra and Gary 1995], Fourier vectors [Wallace and Wintz 1980], text descriptors [Kukich 1992], and so on. The result of the feature transformation are sets of highdimensional feature vectors. The similarity search now becomes an ฮต-query or a nearest-neighbor query on the feature vectors in the high-dimensional feature space, which can be handled much ACM Computing Surveys, Vol. 33, No. 3, September 2001. more ef๏ฌciently on large amounts of data than the time-consuming comparison of the search object to all complex multimedia objects in the database. Since the databases are very large and consist of millions of data objects with several tens to a few hundreds of dimensions, it is essential to use appropriate multidimensional indexing techniques to achieve an ef๏ฌcient search of the data. Note that the feature transformation often also involves complex transformations of the multimedia objects such as feature extraction, normalization, or Fourier transformation. Depending on the application, these operations may be necessary to achieve, for example, invariance with respect to a scaling or rotation of the objects. The details of the feature transformations are beyond the scope of this survey. For further reading on feature transformations, the interested reader is referred to the literature [Wallace and Wintz 1980; Mumford 1987; Jagadish 1991; Kukich 1992; Shawney and Hafner 1994; Mehrotra and Gary 1995]. For an ef๏ฌcient similarity search it is necessary to store the feature vectors in a high-dimensional index structure and use the index structure to ef๏ฌciently evaluate the distance metric. The high-dimensional index structure used must ef๏ฌciently support โ€” point queries for processing identity queries on the multimedia objects; โ€” range queries for processing ฮตsimilarity queries; and โ€” nearest-neighbor queries for processing NN-similarity queries.
  • 4. Searching in High-Dimensional Spaces Note that instead of using a feature transformation into a vector space, the data can also be directly processed using a metric space index structure. In this case, the user has to provide a metric that corresponds to the properties of the similarity measure. The basic idea of metric indexes is to use the given metric properties to build a tree that then can be used to prune branches in processing the queries. The basic idea of metric index structures is discussed in Section 5. A problem of metric indexes is that they use less information about the data than vector space index structures which results in poorer pruning and also a poorer performance. A nice possibility to improve this situation is the FASTMAP algorithm [Faloutsos and Lin 1995] which maps the metric data into a lower-dimensional vector space and uses a vector space index structure for ef๏ฌcient access to the transformed data. Due to their practical importance, in this survey we restrict ourselves to vector space index structures. We assume we have some given applicationdependent feature transformation that provides a mapping of the multimedia objects into some high-dimensional space. There are a quite large number of index structures that have been developed for ef๏ฌcient query processing in some multidimensional space. In general, the index structures can be classi๏ฌed in two groups: data organizing structures such as R-trees [Guttman 1984; Beckmann et al. 1990] and space organizing structures such as multidimensional hashing [Otoo 1984; Kriegel and Seeger 1986, 1987, 1988; Seeger and Kriegel 1990], GRIDFiles [Nievergelt et al. 1984; Hinrichs 1985; Krishnamurthy and Whang 1985; Ouksel 1985; Freeston 1987; Hut๏ฌ‚esz et al. 1988b; Kriegel and Seeger 1988], and kd-tree-based methods (kd-B-tree [Robinson 1981], hB-tree [Lomet and Salzberg 1989, 1990; Evangelidis 1994], and LSDh -tree [Henrich 1998]). For a comprehensive description of most multidimensional access methods, primarily concentrating on low-dimensional indexing problems, the interested reader is referred to a recent survey presented 325 ยจ in Gaede and Gunther [1998]. That survey, however, does not tackle the problem of indexing multimedia databases which requires an ef๏ฌcient processing of nearestneighbor queries in high-dimensional feature spaces; and therefore, the survey does not deal with nearest-neighbor queries and the problems of indexing highdimensional spaces. In our survey, we focus on the index structures that have been speci๏ฌcally designed to cope with the effects occurring in high-dimensional space. Since hashing- and GRID-Filebased methods do not play an important role in high-dimensional indexing, we do not cover them in the survey.1 The reason why hashing techniques are not used in high-dimensional spaces are the problems that arise in such space. To be able to understand these problems in more detail, in the following we discuss some general effects that occur in highdimensional spaces. 1.2. Effects in High-Dimensional Space A broad variety of mathematical effects can be observed when one increases the dimensionality of the data space. Interestingly, some of these effects are not of quantitative but of qualitative nature. In other words, one cannot think about these effects, by simply extending twoor three-dimensional experiences. Rather, one has to think for example, at least 10-dimensional to even see the effect occurring. Furthermore, some are pretty nonintuitive. Few of the effects are of pure mathematical interest whereas some others have severe implications for the performance of multidimensional index structures. Therefore, in the database world, these effects are subsumed by the term, โ€œcurse of dimensionality.โ€ Generally speaking, the problem is that important parameters such as volume and area depend exponentially on the number of dimensions of the data space. Therefore, 1 The only exceptions to this is a technique for searching approximate nearest neighbors in highdimensional spaces that has been proposed in Gionis et al. [1999] and Ouksel et al. [1992]. ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 5. 326 most index structures proposed so far operate ef๏ฌciently only if the number of dimensions is fairly small. The effects are nonintuitive because we are used to dealing with three-dimensional spaces in the real world but these effects do not occur in low-dimensional spaces. Many people even have trouble understanding spatial relations in three-dimensional spaces, however, no one can โ€œimagineโ€ an eightdimensional space. Rather, we always try to ๏ฌnd a low-dimensional analogy when dealing with such spaces. Note that there actually is no notion of a โ€œhighโ€dimensional space. Nevertheless, if people speak about high-dimensional, they usually mean a dimension of about 10 to 16, or at least 5 or 6. Next, we list the most relevant effects and try to classify them: โ€”pure geometric effects concerning the surface and volume of (hyper) cubes and (hyper) spheres: โ€”the volume of a cube grows exponentially with increasing dimension (and constant edge length), โ€”the volume of a sphere grows exponentially with increasing dimension, and โ€”most of the volume of a cube is very close to the (d โˆ’ 1)-dimensional surface of the cube; โ€”effects concerning the shape and location of index partitions: โ€”a typical index partition in highdimensional spaces will span the majority of the data space in most dimensions and only be split in a few dimensions, โ€”a typical index partition will not be cubic, rather it will โ€œlookโ€ like a rectangle, โ€”a typical index partition touches the boundary of the data space in most dimensions, and โ€”the partitioning of space gets coarser the higher the dimension; โ€”effects arising in a database environment (e.g., selectivity of queries): โ€”assuming uniformity, a reasonably selective range query corresponds to ACM Computing Surveys, Vol. 33, No. 3, September 2001. C. Bยจ hm et al. o Fig. 2. Spheres in high-dimensional spaces. a hypercube having a huge extension in each dimension, and โ€”assuming uniformity, a reasonably selective nearest-neighbor query corresponds to a hypersphere having a huge radius in each dimension; usually this radius is even larger than the extension of the data space in each dimension. To be more precise, we present some of the listed effects in more depth and detail in the rest of the section. To demonstrate how much we stick to our understanding of low-dimensional spaces, consider the following lemma. Consider a cubic-shaped d -dimensional data space of extension [0, 1]d . We de๏ฌne the centerpoint c of the data space as the point (0.5, . . . , 0.5). The lemma, โ€œEvery d -dimensional sphere touching (or intersecting) the (d โˆ’ 1)-dimensional boundaries of the data space also contains c,โ€ is obviously true for d = 2, as one can take from Figure 2. Spending some more effort and thinking, we are able to also prove the lemma for d = 3. However, the lemma is de๏ฌnitely false for d = 16, as the following counterexample shows. De๏ฌne a sphere around the point p = (0.3, . . . , 0.3). This point p has a Euclidean distance of โˆš d ยท 0.22 = 0.8 from the centerpoint. If we de๏ฌne the sphere around p with a radius of 0.7, the sphere will touch (or intersect) all 15-dimensional surfaces of the space. However, the centerpoint is not included in the sphere. We have to be aware of the fact that effects like this are not only nice mathematical properties but also lead to severe conclusions for the performance of index structures.
  • 6. Searching in High-Dimensional Spaces 327 Fig. 3. Space partitioning in high-dimensional spaces. The most basic effect is the exponential growth of volume. The volume of a cube in a d -dimensional space is of the formula vol = ed , where d is the dimension of the data space and e is the edge length of the cube. Now if the edge length is a number between 0 and 1, the volume of the cube will exponentially decrease when increasing the dimension. Viewing the problem from the opposite side, if we want to de๏ฌne a cube of constant volume for increasing dimensions, the appropriate edge length will quickly approach 1. For example, in a 2-dimensional space of extension [0, 1]d , a cube of volume 0.25 has an edge length of 0.5 whereas in a 16-dimensional space, โˆš 16 the edge length has to be 0.25 โ‰ˆ 0.917. The exponential growth of the volume has a serious impact on conventional index structures. Space-organizing index structures, for example, suffer from the โ€œdead spaceโ€ indexing problem. Since space organizing techniques index the whole domain space, a query window may overlap part of the space belonging to a page that actually contains no points at all. Another important issue is the space partitioning one can expect in highdimensional spaces. Usually, index structures split the data space using (d โˆ’ 1)dimensional hyperplanes; for example, in order to perform a split, the index structure selects a dimension (the split dimension) and a value in this dimension (the split value). All data items having a value in the split dimension smaller than the split value are assigned to the ๏ฌrst partition whereas the other data items form the second partition. This process of splitting the data space continues recursively until the number of data items in a partition is below a certain threshold and the data items of this partition are stored in a data page. Thus, the whole process can be described by a binary tree, the split tree. As the tree is a binary tree, the height h of the split tree usually depends logarithmically on the number of leaf nodes, that is, data pages. On the other hand, the number d of splits for a single data page is on average d = log2 N , Ceff (d ) where N is the number of data items and Ceff (d ) is the capacity of a single data page.2 Thus, we can conclude that if all dimensions are equally used as split dimensions, a data page has been split at most once or twice in each dimension and therefore, spans a range between 0.25 and 0.5 in each of the dimensions (for uniformly distributed data). From that, we may conclude that the majority of the data pages are located at the surface of the data space rather than in the interior. In addition, this obviously leads to a coarse data space partitioning in single dimensions. However, from our understanding of index structures such as the R -tree that had been designed for geographic applications, we are used to very ๏ฌne partitions where the majority of the data pages are in the interior of the space and we have to be careful not to apply this understanding to high-dimensional spaces. Figure 3 depicts the different con๏ฌgurations. Note that this effect applies to almost any index 2 For most index structures, the capacity of a single data page depends on the dimensionality since the number of entries decreases with increasing dimension due to the larger size of the entries. ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 7. 328 structure proposed so far because we only made assumptions about the split algorithm. Additionally, not only index structures show a strange behavior in highdimensional spaces but also the expected distribution of the queries is affected by the dimensionality of the data space. If we assume a uniform data distribution, the selectivity of a query (the fraction of data items contained in the query) directly depends on the volume of the query. In case of nearest-neighbor queries, the query affects a sphere around the query point that contains exactly one data item, the NN-sphere. According to Berchtold et al. [1997b], the radius of the NN-sphere increases rapidly with increasing dimension. In a data space of extension [0, 1]d , it quickly reaches a value larger than 1 when increasing d . This is a consequence of the above-mentioned exponential relation of extension and volume in high-dimensional spaces. Considering all these effects, we can conclude that if one builds an index structure using a state-of-the-art split algorithm the performance will deteriorate rapidly when increasing the dimensionality of the data space. This has been realized not only in the context of multimedia systems where nearest-neighbor queries are most relevant, but also in the context of data warehouses where range queries are the most frequent type of query [Berchtold et al. 1998a, b]. Theoretical results based on cost models for indexbased nearest-neighbor and range queries also con๏ฌrm the degeneration of the query performance [Yao and Yao 1985; Berchtold et al. 1997b, 2000b; Beyer et al. 1999]. Other relevant cost models proposed before include Friedman et al. [1997], Cleary [1979], Eastman [1981], Sproull [1991], Pagel et al. [1993], Arya et al, [1995], Arya [1995], Theodoridis and Sellis [1996], and Papadopoulos and Monolopoulos [1997b]. 1.3. Basic De๏ฌnitions Before we proceed, we need to introduce some notions and to formalize our problem description. In this section we de๏ฌne ACM Computing Surveys, Vol. 33, No. 3, September 2001. C. Bยจ hm et al. o our notion of the database and develop a twofold orthogonal classi๏ฌcation for various neighborhood queries. Neighborhood queries can either be classi๏ฌed according to the metric that is applied to determine distances between points or according to the query type. Any combination between metrics and query types is possible. 1.3.1. Database. We assume that in our similarity search application, objects are feature-transformed into points of a vector space with ๏ฌxed ๏ฌnite dimension d . Therefore, a database DB is a set of points in a d -dimensional data space DS. The data space DS is a subset of d . Usually, analytical considerations are simpli๏ฌed if the data space is restricted to the unit hypercube DS = [0..1]d . Our database is completely dynamic. That means insertions of new points and deletions of points are possible and should be handled ef๏ฌciently. The number of point objects currently stored in our database is abbreviated as n. We note that the notion of a point is ambiguous. Sometimes, we mean a point object (i.e., a point stored in the database). In other cases, we mean a point in the data space (i.e., a position), which is not necessarily stored in DB. The most common example for the latter is the query point. From the context, the intended meaning of the notion point is always obvious. De๏ฌnition 4 (Database). A database DB is a set of n points in a d -dimensional data space DS, DB = {P0 , . . . , Pnโˆ’1 } Pi โˆˆ DS, i = 0..n โˆ’ 1, DS โІ d . 1.3.2. Vector Space Metrics. All neighborhood queries are based on the notion of the distance between two points P and Q in the data space. Depending on the application to be supported, several metrics to de๏ฌne distances are applied. Most common is the Euclidean metric L2 de๏ฌning the usual Euclidean distance function: d โˆ’1 ฮดEuclid (P, Q) = (Q i โˆ’ Pi )2 . 2 i=0
  • 8. Searching in High-Dimensional Spaces 329 Fig. 4. Metrics for data spaces. But other L p metrics such as the Manhattan metric (L1 ) or the maximum metric (Lโˆž ) are also widely applied: d โˆ’1 ฮดManhattan (P, Q) = |Q i โˆ’ Pi | i=0 ฮดMax (P, Q) = max{|Q i โˆ’ Pi |}. Queries using the L2 metric are (hyper) sphere shaped. Queries using the maximum metric or Manhattan metric are hypercubes and rhomboids, respectively (cf. Figure 4). If additional weights w0 , . . . , wd โˆ’1 are assigned to the dimensions, then we de๏ฌne weighted Euclidean or weighted maximum metrics that correspond to axis-parallel ellipsoids and axis-parallel hyperrectangles: d โˆ’1 ฮดW.Euclid (P, Q) = wi ยท (Q i โˆ’ Pi )2 2 i=0 ฮดW.Max (P, Q) = max{wi ยท | Q i โˆ’ Pi |}. Arbitrarily rotated ellipsoids can be de๏ฌned using a positive de๏ฌnite similarity matrix W . This concept is used for adaptable similarity search [Seidl 1997]: 2 ฮดellipsoid (P, Q) = (P โˆ’ Q)T ยท W ยท (P โˆ’ Q). 1.3.3. Query Types. The ๏ฌrst classi๏ฌcation of queries is according to the vector space metric de๏ฌned on the feature space. An orthogonal classi๏ฌcation is based on the question of whether the user de๏ฌnes a region of the data space or an intended size of the result set. Point Query. The most simple query type is the point query. It speci๏ฌes a point in the data space and retrieves all point objects in the database with identical coordinates: PointQuery(DB, Q) = {P โˆˆ DB | P = Q}. A simpli๏ฌed version of the point query determines only the Boolean answer, whether the database contains an identical point or not. Range Query. In a range query, a query point Q, a distance r, and a metric M are speci๏ฌed. The result set comprises all points P from the database, which have a distance smaller than or equal to r from Q according to metric M : RangeQuery(DB, Q, r, M ) = {P โˆˆ DB | ฮด M (P, Q) โ‰ค r}. Point queries can also be considered as range queries with a radius r = 0 and an arbitrary metric M . If M is the Euclidean metric, then the range query de๏ฌnes a hypersphere in the data space, from which all points in the database are retrieved. Analogously, the maximum metric de๏ฌnes a hypercube. Window Query. A window query speci๏ฌes a rectangular region in data space, from which all points in the database are selected. The speci๏ฌed hyperrectangle is always parallel to the axis (โ€œwindowโ€). We regard the window query as a region query around the centerpoint of the window using a weighted maximum metric, where the weights wi represent the inverse of the side lengths of the window. Nearest-Neighbor Query. The range query and its special cases (point query and window query) have the disadvantage that ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 9. 330 the size of the result set is previously unknown. A user specifying the radius r may have no idea how many results his query may produce. Therefore, it is likely that he may fall into one of two extremes: either he gets no answers at all, or he gets almost all database objects as answers. To overcome this drawback, it is common to de๏ฌne similarity queries with a de๏ฌned result set size, the nearest-neighbor queries. The classical nearest-neighbor query returns exactly one point object as result, the object with the lowest distance to the query point among all points stored in the database.3 The only exception from this one-answer rule is due to tie effects. If several points in the database have the same (minimal) distance, then our ๏ฌrst de๏ฌnition allows more than one answer: NNQueryDeterm(DB, Q, M ) = {P โˆˆ DB | โˆ€P โˆˆ DB : ฮด M (P, Q) โ‰ค ฮด M (P , Q)}. A common solution avoiding the exception to the one-answer rule uses nondeterminism. If several points in the database have minimal distance from the query point Q, an arbitrary point from the result set is chosen and reported as answer. We follow this approach: NNQuery(DB, Q, M ) = SOME{P โˆˆ DB | โˆ€P โˆˆ DB : ฮด M (P, Q) โ‰ค ฮด M (P , Q)}. K-Nearest-Neighbor Query. If a user does not only want one closest point as answer upon her query, but rather a natural number k of closest points, she will perform a k-nearest-neighbor query. Analogously to the nearest-neighbor query, the k-nearestneighbor query selects k points from the database such that no point among the remaining points in the database is closer to the query point than any of the selected points. Again, we have the problem of ties, 3 A recent extension of nearest neighbor queries are closest pair queries which are also called distance joins [Hjaltason and Samet 1998; Corral et al. 2000]. This query type is mainly important in the area of spatial databases and therefore, closest pair queries are beyond the scope of this survey. ACM Computing Surveys, Vol. 33, No. 3, September 2001. C. Bยจ hm et al. o which can be solved either by nondeterminism or by allowing more than k answers in this special case: kNNQuery(DB, Q, k, M ) = {P0 . . . Pkโˆ’1 โˆˆ DB | ยฌโˆƒP โˆˆ DB{P0 . . . Pkโˆ’1 } โˆงยฌโˆƒi, 0 โ‰ค i < k : ฮด M (Pi , Q) > ฮด M (P , Q)}. A variant of k-nearest-neighbor queries is ranking queries which do not require that the user specify a range in the data space or a result set size. The ๏ฌrst answer of a ranking query is always the nearest neighbor. Then, the user has the possibility of asking for further answers. Upon this request, the second nearest neighbor is reported, then the third, and so on. The user decides after examining an answer, if further answers are needed. Ranking queries can be especially useful in the ๏ฌlter step of a multistep query processing environment. Here, the re๏ฌnement step usually takes the decision whether the ๏ฌlter step has to produce further answers. Approximate Nearest-Neighbor Query. In approximate nearest-neighbor queries and approximate k-nearest-neighbor queries, the user also specify a query point and a number k of answers to be reported. In contrast to exact nearest-neighbor queries, the user is not interested exactly in the closest points, but wants only points that are not much farther away from the query point than the exact nearest neighbor. The degree of inexactness can be speci๏ฌed by an upper bound, how much farther away the reported answers may be compared to the exact nearest neighbors. The inexactness can be used for ef๏ฌciency improvement of query processing. 1.4. Query Evaluation Without Index All query types introduced in the previous section can be evaluated by a single scan of the database. As we assume that our database is densely stored on a contiguous block on secondary storage all queries can be evaluated using a so-called sequential scan, which is faster than the access of small blocks spread over wide parts of secondary storage.
  • 10. Searching in High-Dimensional Spaces The sequential scan works as follows: the database is read in very large blocks, determined by the amount of main memory available to query processing. After reading a block from disk, the CPU processes it and extracts the required information. After a block is processed, the next block is read in. Note that we assume that there is no parallelism between CPU and disk I/O for any query processing technique presented in this article. Furthermore, we do not assume any additional information to be stored in the database. Therefore, the database has the size in bytes: sizeof(DB) = d ยท n ยท sizeof(๏ฌ‚oat). The cost of query processing based on the sequential scan is proportional to the size of the database in bytes. 1.5. Overview The rest of the survey is organized as follows. We start with describing the common principles of multidimensional index structures and the algorithms used to build the indexes and process the different query types. Then we provide a systematic overview of the querying and indexing techniques that have been proposed for high-dimensional data spaces, describing them in a uniform way and discussing their advantages and drawbacks. Rather than describing the details of all the different approaches, we try to focus on the basic concepts and algorithms used. We also cover a number of recently proposed techniques dealing with optimization and parallelization issues. In concluding the survey, we try to stir up further research activities by presenting a number of interesting research problems. 2. COMMON PRINCIPLES OF HIGH-DIMENSIONAL INDEXING METHODS 2.1. Structure High-dimensional indexing methods are based on the principle of hierarchical clustering of the data space. Structurally, 331 they resemble the B+ -tree [Bayer and McCreight 1977; Comer 1979]: The data vectors are stored in data nodes such that spatially adjacent vectors are likely to reside in the same node. Each data vector is stored in exactly one data node; that is, there is no object duplication among data nodes. The data nodes are organized in a hierarchically structured directory. Each directory node points to a set of subtrees. Usually, the structure of the information stored in data nodes is completely different from the structure of the directory nodes. In contrast, the directory nodes are uniformly structured among all levels of the index and consist of (key, pointer)tuples. The key information is different for different index structures. For B-trees, for example, the keys are ranges of numbers and for an R-tree the keys are bounding boxes. There is a single directory node, which is called the root node. It serves as an entry point for query and update processing. The index structures are heightbalanced. That means the lengths of the paths between the root and all data pages are identical, but may change after insert or delete operations. The length of a path from the root to a data page is called the height of the index structure. The length of the path from a random node to a data page is called the level of the node. Data pages are on level zero. See Figure 5. The uniform (key, pointer)-structure of the directory nodes also allows an implementation of a wide variety of index structures as extensions of a generic index structure as done in the generalized search tree [Hellerstein et al. 1995]. The generalized search tree (GiST) provides a nice framework for a fast and reliable implementation of search trees. The main requirement for de๏ฌning a new index structure in GiST is to de๏ฌne the keys and provide an implementation of four basic methods needed for building and searching the tree (cf. Section 3). Additional methods may be de๏ฌned to enhance the performance of the index, which is especially relevant for similarity or nearest-neighbor queries [Aoki 1998]. An advantage of GiST is that the basic data structures and algorithms as well as main ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 11. 332 C. Bยจ hm et al. o Fig. 5. Hierarchical index structures. portions of the concurrency and recovery code can be reused. It is also useful as a basis for theoretical analysis of indexing schemes [Hellerstein et al. 1997]. A recent implementation in a commercial objectrelational system shows that GiST-based implementations of index structures can provide a competitive performance while considerably reducing the implementation efforts [Kornacker 1999]. 2.2. Management The high-dimensional access methods are designed primarily for secondary storage. Data pages have a data page capacity Cmax,data , de๏ฌning how many data vectors can be stored in a data page at most. Analogously, the directory page capacity Cmax,dir gives an upper limit to the number of subnodes in each directory node. The original idea was to choose Cmax,data and Cmax,dir such that data and directory nodes ๏ฌt exactly into the pages of secondary storage. However, in modern operating systems, the page size of a disk drive is considered as a hardware detail hidden from programmers and users. Despite that, consecutive reading of contiguous data on disk is by orders of magnitude less expensive than reading at random positions. It is a good compromise to read data contiguously from disk in portions between a few kilobytes and a few hundred kilobytes. This is a kind of arti๏ฌcial paging with a user-de๏ฌned logical page size. How to properly choose this logical page size is investigated in Sections 3 and 4. The logical page sizes for data and directory nodes are constant for most of the index structures presented in this section. The only exceptions are the X -tree and the DABSACM Computing Surveys, Vol. 33, No. 3, September 2001. tree. The X -tree de๏ฌnes a basic page size and allows directory pages to extend over multiples of the basic page size. This concept is called supernode (cf. Section 6.2). The DABS-tree is an indexing structure giving up the requirement of a constant blocksize. Instead, an optimal blocksize is determined individually for each page during creation of the index. This dynamic adoption of the block size gives the DABS-tree [Bยจ hm 1998] its name. o All index structures presented here are dynamic: they allow insert and delete operations in O(log n) time. To cope with dynamic insertions, updates, and deletes, the index structures allow data and directory nodes to be ๏ฌlled under their capacity Cmax . In most index structures the rule is applied that all nodes up to the root node must be ๏ฌlled to about 40% at least. This threshold is called the minimum storage utilization sumin . For obvious reasons, the root is generally allowed to obviate this rule. For B-trees, it is possible to analytically derive an average storage utilization, further on referred to as the effective storage utilization sueff . In contrast, for high-dimensional index structures, the effective storage utilization is in๏ฌ‚uenced by the speci๏ฌc heuristics applied in insert and delete processing. Since these indexing methods are not amenable to an analytical derivation of the effective storage utilization, it usually has to be determined experimentally.4 For comfort, we denote the product of the capacity and the effective storage 4 For the hB-tree, it has been shown in Lomet and Salzberg [1990] that under certain assumptions the average storage utilization is 67%.
  • 12. Searching in High-Dimensional Spaces Fig. 6. Corresponding page regions of an indexing structure. utilization as the effective capacity Ceff of a page: Ceff,data = sueff,data ยท Cmax,data Ceff,dir = sueff,dir ยท Cmax,dir . 2.3. Regions For ef๏ฌcient query processing it is important that the data are well clustered into the pages; that is, that data objects which are close to each other are likely to be stored in the same data page. Assigned to each page is a so-called page region, which is a subset of the data space (see Figure 6). The page region can be a hypersphere, a hypercube, a multidimensional cuboid, a multidimensional cylinder, or a settheoretical combination (union, intersection) of several of the above. For most, but not all high-dimensional index structures, the page region is a contiguous, solid, convex subset of the data space without holes. For most index structures, regions of pages in different branches of the tree may overlap, although overlaps lead to bad performance behavior and are avoided if possible or at least minimized. The regions of hierarchically organized pages must always be completely contained in the region of their parent. Analogously, all data objects stored in a subtree are always contained in the page region of the root page of the subtree. The page region is always a conservative approximation for the data objects and the other page regions stored in a subtree. In query processing, the page region is used to cut branches of the tree from 333 further processing. For example, in the case of range queries, if a page region does not intersect with the query range, it is impossible for any region of a hierarchically subordered page to intersect with the query range. Neither is it possible for any data object stored in this subtree to intersect with the query range. Only pages where the corresponding page region intersects with the query have to be investigated further. Therefore, a suitable algorithm for range query processing can guarantee that no false drops occur. For nearest-neighbor queries a related but slightly different property of conservative approximations is important. Here, distances to a query point have to be determined or estimated. It is important that distances to approximations of point sets are never greater than the distances to the regions of subordered pages and never greater than the distances to the points stored in the corresponding subtree. This is commonly referred to as the lower bounding property. Page regions always have a representation that is an invertible mapping between the geometry of the region and a set of values storable in the index. For example, spherical regions can be represented as centerpoint and radius using d +1 ๏ฌ‚oating point values, if d is the dimension of the data space. For ef๏ฌcient query processing it is necessary that the test for intersection with a query region and the distance computation to the query point in the case of nearest-neighbor queries can be performed ef๏ฌciently. Both geometry and representation of the page regions must be optimized. If the geometry of the page region is suboptimal, the probability increases that the corresponding page has to be accessed more frequently. If the representation of the region is unnecessarily large, the index itself gets larger, yielding worse ef๏ฌciency in query processing, as we show later. 3. BASIC ALGORITHMS In this section, we present some basic algorithms on high-dimensional index ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 13. 334 structures for index construction and maintenance in a dynamic environment, as well as for query processing. Although some of the algorithms are published using a speci๏ฌc indexing structure, they are presented here in a more general way. 3.1. Insert, Delete, and Update Insert, delete, and update are the operations that are most speci๏ฌc to the corresponding index structures. Despite that, there are basic algorithms capturing all actions common to all index structures. In the GiST framework [Hellerstein et al. 1995], the buildup of the tree via the insert operation is handled using three basic operations: Union, Penalty, and PickSplit. The Union operation consolidates information in the tree and returns a new key that is true for all data items in the considered subtree. The Penalty operation is used to ๏ฌnd the best path for inserting a new data item into the tree by providing a number representing how bad an insertion into that path would be. The PickSplit operation is used to split a data page in case of an over๏ฌ‚ow. The insertion and delete operations of tree structures are usually the most critical operations, heavily determining the structure of the resulting index and the achievable performance. Some index structures require for a simple insert the propagation of changes towards the root or down the children as, for example, in the cases of the R-tree and kd-B-tree and some that do not as, for example, the hB-tree. In the latter case, the insert/delete operations are called local operations, whereas in the ๏ฌrst case, they are called nonlocal operations. Inserts are generally handled as follows. โ€”Search a suitable data page dp for the data object do. โ€”Insert do into dp. โ€”If the number of objects stored in dp exceeds Cmax,data , then split dp into two data pages. โ€”Replace the old description (the representation of the region and the background storage address) of dp in the parACM Computing Surveys, Vol. 33, No. 3, September 2001. C. Bยจ hm et al. o ent node of dp by the descriptions of the new pages. โ€”If the number of subtrees stored in the parent exceeds Cmax,dir , split the parent and proceed similarly with the parent. It is possible that all pages on the path from dp to the root have to be split. โ€”If the root node has to be split, let the height of the tree grow by one. In this case, a new root node is created pointing to two subtrees resulting from the split of the original root. Heuristics individual to the speci๏ฌc indexing structure are applied for the following subtasks. โ€”The search for a suitable data page (commonly referred to as the PickBranch procedure): Due to overlap between regions and as the data space is not necessarily completely covered by page regions, there are generally multiple alternatives for the choice of a data page in most multidimensional index structures. โ€”The choice of the split (i.e., which of the data objects/subtrees are aggregated into which of the newly created nodes). Some index structures try to avoid splits by a concept named forced reinsert. Some data objects are deleted from a node having an over๏ฌ‚ow condition and reinserted into the index. The details are presented later. The choice of heuristics in insert processing may affect the effective storage utilization. For example, if a volumeminimizing algorithm allows unbalanced splitting in a 30:70 proportion, then the storage utilization of the index is decreased and the search performance is usually negatively affected.5 On the other hand, the presence of forced reinsert operations increases the storage utilization and the search performance. 5 For the hB-tree, it has been shown in Lomet and Salzberg [1990] that under certain assumptions even a 33:67 splitting proportion yields an average storage utilization of 64%.
  • 14. Searching in High-Dimensional Spaces 335 ALGORITHM 1. (Algorithm for Exact Match Queries) bool ExactMatchQuery(Point q, PageAdr pa) { int i; Page p = LoadPage(pa); if (IsDatapage(p)) for (i = 0; i < p.num objects; i++) if (q == p.object[i]) return true; if (IsDirectoryPage(p)) for (i = 0; i < p.num objects; i++) if (IsPointInRegion(q, p.region[i])) if (ExactMatchQuery(q, p.sonpage[i])) return true; return false; } Some work has been undertaken on handling deletions from multidimensional index structures. Under๏ฌ‚ow conditions can generally be handled by three different actions: โ€”balancing pages by moving objects from one page to another, โ€”merging pages, and โ€”deleting the page and reinserting all objects into the index. For most index structures it is a dif๏ฌcult task to ๏ฌnd a suitable mate for balancing or merging actions. The only exceptions are the LSDh -tree [Henrich 1998] and the space ๏ฌlling curves [Morton 1966; Finkel and Bentley 1974; Abel and Smith 1983; Orenstein and Merret 1984; Faloutsos 1985, 1988; Faloutsos and Roseman 1989; Jagdish 1990] (cf. Sections 6.3 and 6.7). All other authors either suggest reinserting or do not provide a deletion algorithm at all. An alternative approach might be to permit under๏ฌlled pages and to maintain them until they are completely empty. The presence of delete operations and the choice of under๏ฌ‚ow treatment can affect sueff,data and sueff,dir positively as well as negatively. An update-operation is viewed as a sequence of a delete-operation, followed by an insert-operation. No special procedure has been suggested yet. 3.2. Exact Match Query Exact match queries are de๏ฌned as follows: given a query point q, determine whether q is contained in the database. Query processing starts with the root node, which is loaded into main memory. For all regions containing point q the function ExactMatchQuery() is called recursively. As overlap between page regions is allowed in most index structures presented in this survey, it is possible that several branches of the indexing structure have to be examined for processing an exact match query. In the GiST framework [Hellerstein et al. 1995], this situation is handled using the Consistent operation which is the generic operation that needs to be reimplemented for different instantiations of the generalized search tree. The result of ExactMatchQuery is true if any of the recursive calls returns true. For data pages, the result is true if one of the points stored on the data page ๏ฌts. If no point ๏ฌts, the result is false. Algorithm 1 contains the pseudocode for processing exact match queries. 3.3. Range Query The algorithm for range query processing returns a set of points contained in the query range as the result of the calling function. The size of the result set ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 15. 336 C. Bยจ hm et al. o ALGORITHM 2. (Algorithm for Range Queries) PointSet RangeQuery(Point q, ๏ฌ‚oat r, PageAdr pa) { int i; PointSet result = EmptyPointSet; Page p = LoadPage(pa); if (IsDatapage(p)) for (i = 0; i < p.num objects; i++) if (IsPointInRange(q, p.object[i], r)) AddToPointSet(result, p.object[i]); if (IsDirectoryPage(p)) for (i = 0; i < p.num objects; i++) if (RangeIntersectRegion(q, p.region[i], r)) PointSetUnion(result, RangeQuery(q, r, p.childpage[i])); return result; } is previously unknown and may reach the size of the entire database. The algorithm is formulated independently of the applied metric. Any L p metric, including metrics with weighted dimensions (ellipsoid queries [Seidl 1997; Seidl and Kriegel 1997]), can be applied, if there exists an effective and ef๏ฌcient test for the predicates IsPointInRange and RangeIntersectRegion. Also partial range queries (i.e., range queries where only a subset of the attributes is speci๏ฌed) can be considered as regular range queries with weights (the unspeci๏ฌed attributes are weighted with zero). Window queries can be transformed into range-queries using a weighted Lmax metric. The algorithm (cf. Algorithm 2) performs a recursive self-call for all childpages, where the corresponding page regions intersect with the query. The union of the results of all recursive calls is built and passed to the caller. 3.4. Nearest-Neighbor Query There are two different approaches to processing nearest-neighbor queries on multidimensional index structures. One was published by Roussopoulos et al. [1995] and is in the following referred to as the RKV algorithm. The other, called the HS algorithm, was published in ACM Computing Surveys, Vol. 33, No. 3, September 2001. Henrich [1994] and Hjaltason and Samet [1995]. Due to their importance for our further presentation, these algorithms are presented in detail and their strengths and weaknesses are discussed. We start with the description of the RKV algorithm because it is more similar to the algorithm for range query processing, in the sense that a depth-๏ฌrst traversal through the indexing structure is performed. RKV is an algorithm of the โ€œbranch and boundโ€ type. In contrast, the HS algorithm loads pages from different branches and different levels of the index in an order induced by the closeness to the query point. Unlike range query processing, there is no ๏ฌxed criterion, known a priori, to exclude branches of the indexing structure from processing in nearest neighbor algorithms. Actually, the criterion is the nearest neighbor distance but the nearest neighbor distance is not known until the algorithm has terminated. To cut branches, nearest neighbor algorithms have to use pessimistic (conservative) estimations of the nearest neighbor distance, which will change during the run of the algorithm and will approach the nearest neighbor distance. A suitable pessimistic estimation is the closest point among all points visited at the current state of execution (the so-called closest point
  • 16. Searching in High-Dimensional Spaces 337 Fig. 7. MINDIST and MAXDIST. candidate cpc). If no point has been visited yet, it is also possible to derive pessimistic estimations from the page regions visited so far. 3.4.1. The RKV Algorithm. The authors of the RKV algorithm de๏ฌne two important distance functions, MINDIST and MINMAXDIST. MINDIST is the actual distance between the query point and a page region in the geometrical sense, that is, the nearest possible distance of any point inside the region to the query point. The de๏ฌnition in the original proposal [Roussopoulos et al. 1995] is limited to R-treelike structures, where regions are provided as multidimensional intervals (i.e., minimum bounding rectangles, MBR) I with I = [l b0 , ub0 ] ร— ยท ยท ยท ร— [l bd โˆ’1 , ubd โˆ’1 ]. Then, MINDIST is de๏ฌned as follows. De๏ฌnition 5 (MINDIST). The distance of a point q to region I , denoted MINDIST(q, I ) is: MINDIST2 (q, I ) ๏ฃซ๏ฃฑ d โˆ’1 ๏ฃฒ l bi โˆ’ qi ๏ฃญ 0 = ๏ฃณ i=0 qi โˆ’ ubi ๏ฃถ if qi < l bi 2 otherwise ๏ฃธ . if ubi < qi An example of MINDIST is presented on the left side of Figure 7. In page regions pr1 and pr3 , the edges of the rectangles de๏ฌne the MINDIST. In page region pr4 the corner de๏ฌnes MINDIST. As the query point lies in pr2 , the corresponding MINDIST is 0. A similar de๏ฌnition can also be provided for differently shaped page regions, such as spheres (subtract the radius from the distance between center and q) or combinations. A similar de๏ฌnition can be given for the L1 and Lmax metric, respectively. For a pessimistic estimation, some speci๏ฌc knowledge about the underlying indexing structure is required. One assumption which is true for all known index structures, is that every page must contain at least one point. Therefore, we could de๏ฌne the following MAXDIST function determining the distance to the farthest possible point inside a region. d โˆ’1 MAXDIST2 (q, I ) = i=0 ร— |l bi โˆ’ qi | if |l bi โˆ’ qi | > | qi โˆ’ ubi | |qi โˆ’ ubi | otherwise 2 . MAXDIST is not de๏ฌned in the original paper, as it is not needed in R-treelike structures. An example is shown on the right side of Figure 7. Being the greatest possible distance from the query point to a point in a page region, the MAXDIST is not equal to 0, even if the query point is located inside the page region pr2 . In R-trees, the page regions are minimum bounding rectangles (MBR) (i.e., rectangular regions), where each surface hyperplane contains at least one datapoint. The following MINMAXDIST function provides a better (i.e., lower) but ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 17. 338 C. Bยจ hm et al. o Fig. 8. MINMAXDIST. still conservative estimation of the nearest neighbor distance. MINMAXDIST2 (q, I ) ๏ฃซ ๏ฃฌ ๏ฃฌ = min ๏ฃฌ| qk โˆ’ rmk |2 0โ‰คk<d ๏ฃญ ๏ฃถ + i=k 0โ‰คi<d ๏ฃท ๏ฃท | qi โˆ’ r M i |2 ๏ฃท, ๏ฃธ where rmk = r Mi = l bk +ubk 2 lbk if qk โ‰ค ubk otherwise lbi if qi โ‰ฅ l bi +ubi 2 ubi otherwise and . The general idea is that every surface hyperarea must contain a point. The farthest point on every surface is determined and among those the minimum is taken. For each pair of opposite surfaces, only the nearer surface can contain the minimum. Thus, it is guaranteed that a data object can be found in the region having a distance less than or equal to MINMAXDIST(q, I ). MINMAXDIST(q, I ) is the smallest distance providing this guarantee. The example in Figure 8 shows on the left side the considered edges. Among each pair of opposite edges of an MBR, only the edge closer to the query point is considered. The point yielding ACM Computing Surveys, Vol. 33, No. 3, September 2001. the maximum distance on each considered edge is marked with a circle. The minimum among all marked points of each page region de๏ฌnes the MINMAXDIST, as shown on the right side of Figure 8. This pessimistic estimation cannot be used for spherical or combined regions as these in general do not ful๏ฌll a property similar to the MBR property. In this case, MAXDIST(q, I ), which is an estimation worse than MINMAXDIST, has to be used. All de๏ฌnitions presented using the L2 metric in the original paper [Roussopoulos et al. 1995] can easily be adapted to L1 or Lmax metrics, as well as to weighted metrics. The algorithm (cf. Algorithm 3) performs accesses to the pages of an index in a depth-๏ฌrst order (โ€œbranch and boundโ€). A branch of the index is always completely processed before the next branch is begun. Before child nodes are loaded and recursively processed, they are heuristically sorted according to their probability of containing the nearest neighbor. For the sorting order, the optimistic or pessimistic estimation or a combination thereof may be chosen. The quality of sorting is critical for the ef๏ฌciency of the algorithm because for different sequences of processing the estimation of the nearest neighbor distance may approach more or less quickly the actual nearest neighbor distance. Roussopoulos et al. [1995] report advantages for the optimistic estimation. The list of child nodes is pruned whenever the pessimistic estimation of the nearest neighbor distance changes. Pruning means the discarding of all child nodes having a MINDIST larger than the
  • 18. Searching in High-Dimensional Spaces 339 ALGORITHM 3. (The RKV Algorithm for Finding the Nearest Neighbor) ๏ฌ‚oat pruning dist / The current distance for pruning branches / = INFINITE; / Initialization before the start of RKV algorithm / Point cpc; / The closest point candidate. This variable will contain the nearest neighbor after RKV algorithm has completed / void RKV algorithm(Point q, PageAdr pa) { int i; ๏ฌ‚oat h; Page p = LoadPage(pa); if (IsDatapage(p)) for (i = 0; i < p.num objects; i++) { h = PointToPointDist(q, p.object[i]); if (pruning dist>=h) { pruning dist = h; cpc = p.object[i]; } } if (IsDirectoryPage(p)) { sort(p, CRITERION);/ CRITERION is MINDIST or MINMAXDIST / for (i = 0; i < p.num objects; i++) { if (MINDIST(q, p.region[i]) <= pruning dist) RKV algorithm(q, p.childpage[i]); h = MINMAXDIST(q, p.region[i]); if (pruning dist >= h) pruning dist = h; } } } pessimistic estimation of the nearest neighbor distance. These pages are guaranteed not to contain the nearest neighbor because even the closest point in these pages is farther away than an already found point (lower bounding property). The pessimistic estimation is the lowest among all distances to points processed thus far and all results of the MINMAXDIST(q, I ) function for all page regions processed thus far. In Cheung and Fu [1998], several heuristics for the RKV algorithm with and without the MINMAXDIST function are discussed. The authors prove that any page which can be pruned by exploiting the MINMAXDIST can also be pruned without that concept. Their conclusion is that the determination of MINMAXDIST should be avoided as it causes an additional overhead for the computation of MINMAXDIST. To extend the algorithm to k-nearest neighbor processing is a dif๏ฌcult task. Unfortunately, the authors make it easy by discarding the MINMAXDIST from pruning, sacri๏ฌcing the performance path gains obtainable from the MINMAXDIST path pruning. The kth lowest among all distances to points found thus far must be used. Additionally required is a buffer for k points (the k closest point candidate list, cpcl) which allows an ef๏ฌcient deletion of the point with the highest distance and an ef๏ฌcient insertion of a random point. A suitable data structure for the closest point candidate list is a priority queue (also known as semisorted heap [Knuth 1975]). Considering the MINMAXDIST imposes some dif๏ฌculties, as the algorithm has to assure that k points are closer to the query than a given region. For each region, we know that at least one point must have a distance less than or equal to MINMAXDIST. If the k-nearest neighbor algorithm prunes a branch according to MINMAXDIST, it would assume that k points must be positioned on the nearest surface hyperplane of the page region. The MBR property only guarantees one such point. We further know that m points must have ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 19. 340 C. Bยจ hm et al. o Fig. 9. The HS algorithm for ๏ฌnding the nearest neighbor. a distance less than or equal to MAXDIST, where m is the number of points stored in the corresponding subtree. The number m could be, for example, stored in the directory nodes, or could be estimated pessimistically by assuming minimal storage utilization if the indexing structure provides storage utilization guarantees. A suitable extension of the RKV algorithm could use a semisorted heap with k entries. Each entry is a cpc, a MAXDIST estimation, or a MINMAXDIST estimation. The heap entry with the greatest distance to the query point q is used for branch pruning and is called the pruning element. Whenever new points or estimations are encountered, they are inserted into the heap if they are closer to the query point than the pruning element. Whenever a new page is processed, all estimations based on the appropriate page region have to be deleted from the heap. They are replaced by the estimations based on the regions of the child pages (or the contained points, if it is a data page). This additional deletion implies additional complexities because a priority queue does not ef๏ฌciently support the deletion of elements other than the pruning element. All these dif๏ฌculties are neglected in the original paper [Roussopoulos et al. 1995]. 3.4.2. The HS Algorithm. The problems arising from the need to estimate the nearest neighbor distance are elegantly avoided in the HS algorithm [Hjaltason ACM Computing Surveys, Vol. 33, No. 3, September 2001. and Samet 1995]. The HS algorithm does not access the pages in an order induced by the hierarchy of the indexing structure, such as depth-๏ฌrst or breadth-๏ฌrst. Rather, all pages of the index are accessed in the order of increasing distance to the query point. The algorithm is allowed to jump between branches and levels for processing pages. See Figure 9. The algorithm manages an active page list (APL). A page is called active if its parent has been processed but not the page itself. Since the parent of an active page has been loaded, the corresponding region of all active pages is known and the distance between region and query point can be determined. The APL stores the background storage address of the page, as well as the distance to the query point. The representation of the page region is not needed in the APL. A processing step of the HS algorithm comprises the following actions. โ€”Select the page p with the lowest distance to the query point from the APL. โ€”Load p into main memory. โ€”Delete p from the APL. โ€”If p is a data page, determine if one of the points contained in this page is closer to the query point than the closest point found so far (called the closest point candidate cpc). โ€”Otherwise: Determine the distances to the query point for the regions of all child pages of p and insert all child
  • 20. Searching in High-Dimensional Spaces pages and the corresponding distances into APL. The processing step is repeated until the closest point candidate is closer to the query point than the nearest active page. In this case, no active page is able to contain a point closer to q than cpc due to the lower bounding property. Also, no subtree of any active page may contain such a point. As all other pages have already been looked upon, processing can stop. Again, the priority queue is the suitable data structure for APL. For k-nearest neighbor processing, a second priority queue with ๏ฌxed length k is required for the closest point candidate list. 3.4.3. Discussion. Now we compare the two algorithms in terms of their space and time complexity. In the context of space complexity, we regard the available main memory as the most important system limitation. We assume that the stack for recursion management and all priority queues are held in main memory although one could also provide an implementation of the priority queue data structure suitable for secondary storage usage. LEMMA 1 (Worst Case Space Complexity of the RKV Algorithm). The RKV algorithm has a worst case space complexity of O(log n). For the proof see Appendix A. As the RKV algorithm performs a depth๏ฌrst pass through the indexing structure, and no additional dynamic memory is required, the space complexity is O(log n). Lemma 1 is also valid for the k-nearest neighbor search, if allowance is made for the additional space requirement for the closest point candidate list with a space complexity of O(k). LEMMA 2 (Worst Case Space Complexity of the HS Algorithm). The HS algorithm has a space complexity of O(n) in the worst case. 341 For the Proof see Appendix B. In spite of the order O(n), the size of the APL is only a very small fraction of the size of the data set because the APL contains only the page address and the distance between the page region and query point q. If the size of the data set in bytes is DSS, then we have a number of DP data pages with DP = DSS . sueff,data ยท sizeof(DataPage) Then the size of the APL is f times the data set size: sizeof(APL) = f ยท DSS sizeof(๏ฌ‚oat) + sizeof(address) = ยท DSS, sueff,data ยท sizeof(DataPage) where a typical factor for a page size of 4 Kbytes is f = 0.3%, even shrinking with a growing data page size. Thus, it should be no practical problem to hold 0.3% of a database in main memory, although theoretically unattractive. For the objective of comparing the two algorithms, we prove optimality of the HS algorithm in the sense that it accesses as few pages as theoretically possible for a given index. We further show, using counterexamples, that the RKV algorithm does not generally reach this optimum. LEMMA 3 (Page Regions Intersecting the Nearest Neighbor Sphere). Let nnd ist be the distance between the query point and its nearest neighbor. All pages that intersect a sphere around the query point having a radius equal to nnd ist (the so-called nearest neighbor sphere) must be accessed for query processing. This condition is necessary and suf๏ฌcient. For the proof see Appendix C. LEMMA 4 (Schedule of the HS Algorithm). The HS algorithm accesses pages in the order of increasing distance to the query point. ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 21. 342 C. Bยจ hm et al. o Fig. 10. Schedules of RKV and HS algorithm. For the proof see Appendix D. LEMMA 5 (Optimality of HS Algorithm). The HS algorithm is optimal in terms of the number of page accesses. For the proof see Appendix E. Now we demonstrate by an example that the RKV algorithm does not always yield an optimal number of page accesses. The main reason is that once a branch of the index has been selected, it has to be completely processed before a new branch can be begun. In the example of Figure 10, both algorithms chose pr1 to load ๏ฌrst. Some important MINDISTS and MINMAXDISTS are marked in the ๏ฌgure with solid and dotted arrows, respectively. Although the HS algorithm loads pr2 and pr21 , the RKV algorithm has ๏ฌrst to load pr11 and pr12 , because no MINMAXDIST estimate can prune the appropriate branches. If pr11 and pr12 are not data pages, but represent further subtrees with larger heights, many of the pages in the subtrees will have to be accessed. We have to summarize that the HS algorithm for nearest neighbor search is superior to the RKV algorithm when counting page accesses. On the other side, it has the disadvantage of dynamically allocating main memory of the order O(n), although with a very small factor of less than 1% of the database size. In addition, the extension to the RKV algorithm for a k-nearest neighbor search is dif๏ฌcult to implement. An open question is whether minimizing the number of page accesses will minimize the time needed for the page acACM Computing Surveys, Vol. 33, No. 3, September 2001. cesses, too. We show later that statically constructed indexes yield an interpage clustering, meaning that all pages in a branch of the index are laid out contiguously on background storage. Therefore, the depth-๏ฌrst search of the RKV algorithm could yield fewer diskhead movements than the distance-driven search of the HS algorithm. A new challenge could be to develop an algorithm for the nearest neighbor search directly optimizing the processing time rather than the number of page accesses. 3.5. Ranking Query Ranking queries can be seen as generalized k-nearest-neighbor queries with a previously unknown result set size k. A typical application of a ranking query requests the nearest neighbor ๏ฌrst, then the second closest point, the third, and so on. The requests stop according to a criterion that is external to the index-based query processing. Therefore, neither a limited query range nor a limited result set size can be assumed before the application terminates the ranking query. In contrast to the k-nearest neighbor algorithm, a ranking query algorithm needs an unlimited priority queue for the candidate list of closest points (cpcl). A further difference is that each request of the next closest point is regarded as a phase that is ended by reporting the next resulting point. The phases are optimized independently. In contrast, the k-nearest neighbor algorithm searches all k points in a single phase and reports the complete set.
  • 22. Searching in High-Dimensional Spaces In each phase of a ranking query algorithm, all points encountered during the data page accesses are stored in the cpcl. The phase ends if it is guaranteed that unprocessed index pages cannot contain a point closer than the ๏ฌrst point in cpcl (the corresponding criterion of the k-nearest neighbor algorithm is based on the last element of cpcl). Before beginning the next phase, the leading element is deleted from the cpcl. It does not appear very attractive to extend the RKV algorithm for processing ranking queries due to the fact that effective branch pruning can be performed neither based on MINMAXDIST or MAXDIST estimates nor based on the points encountered during data page accesses. In contrast, the HS algorithm for nearest neighbor processing needs only the modi๏ฌcations described above to be applied as a ranking query algorithm. The original proposal [Hjaltason and Samet 1995] contains these extensions. The major limitation of the HS algorithm for ranking queries is the cpcl. It can be proven similarly as in Lemma 2 that the length of the cpcl is of the order O(n). In contrast to the APL, the cpcl contains the full information of possibly all data objects stored in the index. Thus its size is bounded only by the database size questioning the applicability not only theoretically, but also practically. From our point of view, a priority queue implementation suitable for background storage is required for this purpose. 3.6. Reverse Nearest-Neighbor Queries In Korn and Muthukrishnan [2000], the authors introduce the operation of reverse nearest-neighbor queries. Given an arbitrary query point q, this operation retrieves all points of the database to which q is the nearest neighbor, that is, the set of reverse nearest neighbors. Note that the nearest-neighbor relation is not symmetric. If some point p1 is the nearest neighbor of p2 , then p2 is not necessarily the nearest neighbor of p1 . Therefore the result set of the rnn-operation can be empty 343 Fig. 11. Indexing for the reverse nearest neighbor search. or may contain an arbitrary number of points. A database point p is in the result set of the rnn-operation for query point q unless another database point p is closer to p than q is. Therefore, p is in the result set if q is enclosed by the sphere centered by p touching the nearest neighbor of p (the nearest neighbor sphere of p). Therefore in Korn and Muthukrishnan [2000] the problem is solved by a specialized index structure for sphere objects that stores the nearest neighbor spheres rather than the database points. An rnnquery corresponds to a point query in that index structure. For an insert operation, the set of reverse nearest neighbors of the new point must be determined. The corresponding nearest neighbor spheres of all result points must be reinserted into the index. The two most important drawbacks of this solution are the high cost for the insert operation and the use of a highly specialized index. For instance, if the rnn has to be determined only for a subset of the dimensions, a completely new index must be constructed. Therefore, in Stanoi et al. [2000] the authors propose a solution for point index structures. This solution, however, is limited to the two-dimensional case. See Figure 11. 4. COST MODELS FOR HIGH-DIMENSIONAL INDEX STRUCTURES Due to the high practical relevance of multidimensional indexing, cost models for estimating the number of necessary page ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 23. 344 C. Bยจ hm et al. o accesses were proposed several years ago. The ๏ฌrst approach is the well-known cost model proposed by Friedman et al. [1977] for nearest-neighbor query processing using the maximum metric. The original model estimates leaf accesses in a kdtree, but can be easily extended to estimate data page accesses of R-trees and related index structures. This extension was presented in Faloutsos et al. [1987] and with slightly different aspects in Aref and Samet [1991], Pagel et al. [1993], and Theodoridis and Sellis [1996]. The expected number of data page accesses in an R-tree is ๏ฃซ Ann,mm,FBF = ๏ฃญ d ๏ฃถd 1 + 1๏ฃธ . Ceff This formula is motivated as follows. The query evaluation algorithm is assumed to access an area of the data space, which is a hypercube of the volume V1 = 1/N , where N is the number of objects stored in the database. Analogously, the page region is approximated by a hypercube with the volume V2 = Ceff /N . In each dimension the chance that the projection of V1 and V2 intersect each other โˆš โˆš corresponds to d V1 + d V2 if n โ†’ โˆž. To obtain a probability that V1 and V2 intersect in all dimensions, this term must be taken to the power of d . Multiplying this result with the number of data pages N /Ceff yields the expected number of page accesses Ann,mm,FBF . The assumptions of the model, however, are unrealistic for nearest-neighbor queries on highdimensional data for several reasons. First, the number N of objects in the database is assumed to approach in๏ฌnity. Second, effects of high-dimensional data spaces and correlations are not considered by the model. In Cleary [1979] the model presented in Friedman et al. [1977] is extended by allowing nonrectangular page regions, but still boundary effects and correlations are not considered. In Eastman [1981] the existing models are used for optimizing the bucket size of the kd-tree. In Sproull [1991] the author ACM Computing Surveys, Vol. 33, No. 3, September 2001. Fig. 12. Evaluation of the model of Friedman et al. [1977]. shows that the number of datapoints must be exponential in the number of dimensions for the models to provide accurate estimations. According to Sproull, boundary effects signi๏ฌcantly contribute to the costs unless the condition holds. N Ceff ยท d d 1 Ceff ยท VS 1 2 +1 , where VS (r) is the volume of a hypersphere with radius r which can be computed as VS (r) = โˆš ฯ€d ยท rd (d /2 + 1) with the gamma-function (x) which is the extension of the factorial operator x! = (x + 1) into the domain of real numbers: โˆš (x+1) = xยท (x), (1) = 1, and ( 1 ) = ฯ€ . 2 For example, in a 20-dimensional data space with Ceff = 20, Sproullโ€™s formula evaluates to N 1.1 ยท 1011 . We show later (cf. Figure 12), how bad the cost estimations of the FBF model are if substantially fewer than a hundred billion points are stored in the database. Unfortunately, Sproull still assumes for his analysis uniformity and independence in the distribution of datapoints and queries; that is, both datapoints and the centerpoints of the queries are chosen from a uniform data distribution, whereas the selectivity of the queries (1/N ) is considered ๏ฌxed. The above formulas are also generalized to k-nearest-neighbor queries, where k is also a user-given parameter.
  • 24. Searching in High-Dimensional Spaces The assumptions made in the existing models do not hold in the highdimensional case. The main reason for the problems of the existing models is that they do not consider boundary effects. โ€œBoundary effectsโ€ stands for an exceptional performance behavior, when the query reaches the boundary of the data space. Boundary effects occur frequently in high-dimensional data spaces and lead to pruning of major amounts of empty search space which is not considered by the existing models. To examine these effects, we performed experiments to compare the necessary page accesses with the model estimations. Figure 12 shows the actual page accesses for uniformly distributed point data versus the estimations of the Friedman et al. model. For high-dimensional data, the model completely fails to estimate the number of page accesses. The basic model of Friedman et al. [1977] has been extended in different directions. The ๏ฌrst is to take correlation effects into account by using the concept of the fractal dimension [Mandelbrot 1977; Schrยจ der 1991]. There are various de๏ฌnio tions of the fractal dimension which all capture the relevant aspect (the correlation), but are different in the details of how the correlation is measured. In Faloutsos and Kamel [1994] the authors used the box-counting fractal dimension (also known as the Hausdorff fractal dimension) for modeling the performance of R-trees when processing range queries using the maximum metric. In their model they assume to have a correlation in the points stored in the database. For the queries, they still assume a uniform and independent distribution. The analysis does not take into account effects of high-dimensional spaces and the evaluation is limited to data spaces with dimensions less than or equal to three. In Belussi and Faloutsos [1995] the authors used the fractal dimension with a different de๏ฌnition (the correlation fractal dimension) for the selectivity estimation of spatial queries. In this paper, range queries in low-dimensional data spaces using the Manhattan, Euclidean, and 345 maximum metrics were modeled. Unfortunately, the model only allows the estimation of selectivities. It is not possible to extend the model in a straightforward way to determine expectations of page accesses. Papadopoulos and Manolopoulos [1997b] used the results of Faloutsos and Kamel and of Belussi and Faloutsos for a new model published in a recent paper. Their model is capable of estimating data page accesses of R-trees when processing nearest-neighbor queries in a Euclidean space. They estimate the distance of the nearest neighbor by using the selectivity estimation presented in Belussi and Faloutsos [1995] in the reverse way. As it is dif๏ฌcult to determine accesses to pages with rectangular regions for spherical queries, they approximate query spheres by minimum bounding and maximum enclosed cubes and determine upper and lower bounds of the number of page accesses in this way. This approach makes the model inoperative for highdimensional data spaces, because the approximation error grows exponentially with increasing dimension. Note that in a 20-dimensional data space, the volume of the minimum bounding cube of a sphere is by a factor of 1/VS (1/2) = 4.1 ยท 107 larger than the volume of the sphere. The sphere โˆš volume, in turn, is by VS ( d /2) = 27, 000 times larger than the greatest enclosed cube. An asset of Papadopoulos and Manolopoulosโ€™ model is that queries are no longer assumed to be taken from a uniform and independent distribution. Instead, the authors assume that the query distribution follows the data distribution. The concept of fractal dimension is also widely used in the domain of spatial databases, where the complexity of stored polygons is modeled [Gaede 1995; Faloutsos and Gaede 1996]. These approaches are of minor importance for point databases. The second direction, where the basic model of Friedman et al. [1977] needs extension, are the boundary effects occurring when indexing data spaces of higher dimensionality. Arya [1995] and Arya et al. [1995] presented a new cost model for processing ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 25. 346 Fig. 13. The Minkowski sum. nearest-neighbor queries in the context of the application domain of vector quantization. Arya et al. restricted their model to the maximum metric and neglected correlation effects. Unfortunately, they still assumed that the number of points was exponential with the dimension of the data space. This assumption is justi๏ฌed in their application domain, but it is unrealistic for database applications. Berchtold et al. [1997b] presented a cost model for query processing in high-dimensional data spaces in the so-called BBKK model. The basic concept of the BBKK model is the Minkowski sum (cf. Figure 13), a concept from robot motion planning that was introduced by the BBKK model for the ๏ฌrst time for cost estimations. The general idea is to transform a query having a spatial extension (such as a range query or nearest-neighbor query) equivalently into a point query by enlarging the page region. In Figure 13, the page region has been enlarged such that a point query lies in the enlarged region if (and only if) the original query intersects the original region. Together with concepts to estimate the size of page regions and query regions, the model provides accurate estimations for nearest neighbor and range queries using the Euclidean metric and considers boundary effects. To cope with correlation, the authors propose using the fractal dimension without presenting the details. The main limitations of the model are (1) that no estimation for the maximum metric is presented, (2) that the number of data pages is assumed to be a power of two, and (3) that a complete, overlap-free coverage of the data space with data pages is assumed. Weber et al. [1998] use the ACM Computing Surveys, Vol. 33, No. 3, September 2001. C. Bยจ hm et al. o cost model by Berchtold et al. without the extension for correlated data to show the superiority of the sequential scan in suf๏ฌciently high dimensions. They present the VA-๏ฌle, an improvement of the sequential scan. Ciaccia et al. [1998] adapt the cost model [Berchtold et al. 1997b] to estimate the page accesses of the M -tree, an index structure for data spaces that are metric spaces but not vector spaces (i.e., only the distances between the objects are known, but no explicit positions). In Papadopoulos and Manolopoulos [1998] the authors apply the cost model for declustering of data in a disk array. Two papers by Agrawal et al. [1998] and Riedel et al. [1998] present applications in the data mining domain. A recent paper [Bยจ hm 2000] is based on o the BBKK cost model which is presented in a comprehensive way and extended in many aspects. The extensions not yet covered by the BBKK model include all estimations for the maximum metric, which are also developed throughout the whole paper. The restriction of the BBKK model to numbers of data pages that are a power of two is overcome. A further extension of the model regards k-nearest-neighbor queries (the BBKK model is restricted to one-nearest-neighbor queries). The numerical methods for integral approximation and for the estimation of the boundary effects were to a large extent beyond the scope of Berchtold et al. [1997b]. Finally, the concept of the fractal dimension, which was also used in the BBKK model in a simpli๏ฌed way (the data space dimension is simply replaced by the fractal dimension) is in this paper well established by the consequent application of the fractal power laws. 5. INDEXING IN METRIC SPACES In some applications, objects cannot be mapped into feature vectors. However, there still exists some notion of similarity between objects, which can be expressed as a metric distance between the objects; that is, the objects are embedded in a metric space. The object distances can be used directly for query evaluation.
  • 26. Searching in High-Dimensional Spaces Fig. 14. Example Burkhardโ€“Keller tree (D: data points, v: values of discrete distance function). Several index structures for pure metric spaces have been proposed in the literature. Probably the oldest reference is the so-called Burkhardโ€“Keller [1973] trees. Burkhardโ€“Keller trees use a distance function that returns a small number (i) of discrete values. An arbitrary object is chosen as the root of the tree and the distance function is used to partition the remaining data objects into i subsets which are the i branches of the tree. The same procedure is repeated for each nonempty subset to build up the tree (cf. Figure 14). More recently, a number of variants of the Burkhardโ€“Keller tree have been proposed [Baeza-Yates et al. 1994]. In the ๏ฌxed queries tree, for example, the data objects used as pivots are con๏ฌned to be the same on the same level of the tree [Baeza-Yates et al. 1994]. In most applications, a continuous distance function is used. Examples of index structures based on a continuous distance function are the vantage-point tree (VPT), the generalized hyperplane tree (GHT), and the M -tree. The VPT [Uhlmann 1991; Yianilos 1993] is a binary tree that uses some pivot element as the root and partitions the remaining data elements based on their distance with respect to the pivot element in two subsets. The same is repeated recursively for the subsets (cf. Figure 15). Variants of the VPT are the optimized VP-tree [Chiueh 1994], the 347 Multiple VP-tree [Bozkaya and Ozsoyoglu 1997], and the VP-Forest [Yianilos 1999]. The GHT [Uhlmann 1991] is also a binary tree that uses two pivot elements on each level of the tree. All data elements that are closer to the ๏ฌrst pivot element are assigned to the left subtree and all elements closer to the other pivot element are assigned to the other subtree (cf. Figure 16). A variant of the GHT is the geometric near neighbor access tree (GNAT) [Brin 1995]. The main difference is that the GNAT is an m-ary tree that uses m pivots on each level of the tree. The basic structure of the M-tree [Ciaccia et al. 1997] is similar to the VPtree. The main difference is that the Mtree is designed for secondary memory and allows overlap in the covered areas to allow easier updates. Note that among all metric index structures the M-tree is the only one that is optimized for large secondary memory-based data sets. All others are main memory index structures supporting rather small data sets. Note that metric indexes are only used in applications where the distance in vector space is not meaningful. This is true since vector spaces contain more information and therefore allow a better structuring of the data than general metric spaces. 6. APPROACHES TO HIGH-DIMENSIONAL INDEXING In this section, we introduce and brie๏ฌ‚y discuss the most important index structures for high-dimensional data spaces. We ๏ฌrst describe index structures using minimum bounding rectangles as page regions such as the R-tree, the R -tree, and the X -tree. We continue with the structures using bounding spheres such as the SS-tree and the TV-tree and conclude with two structures using combined regions. The SR-tree uses the intersection solid of MBR and bounding sphere as the page region. The page region of a space-๏ฌlling curve is the union of not necessarily connected hypercubes. Multidimensional access methods that have not been investigated for query processing in high-dimensional data ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 27. 348 C. Bยจ hm et al. o Fig. 15. Example vantage-point tree. Fig. 16. Example generalized hyperplane tree. spaces such as hashing-based methods [Nievergelt et al. 1984; Otoo 1984; Hinrichs 1985; Krishnamurthy and Whang 1985; Ouksel 1985; Kriegel and Seeger 1986, 1987, 1988; Freeston 1987; Hut๏ฌ‚esz et al. 1988a, b; Henrich et al. 1989] are excluded from the discussion here. In the VAMSplit R-tree [Jain and White 1996] and in the Hilbert-R-tree [Kamel and Faloutsos 1994], methods for statically constructing R-trees are presented. Since the VAMSplit R-tree and the Hilbert-R-tree are more of a construction method than an indexing structure of their own, they are also not presented in detail here. 6.1. R-tree, R -tree, and R+ -tree The R-tree [Guttman 1984] family of index structures uses solid minimum bounding rectangles (MBR) as page regions. ACM Computing Surveys, Vol. 33, No. 3, September 2001. An MBR is a multidimensional interval of the data space (i.e., axis-parallel multidimensional rectangles). MBRs are minimal approximations of the enclosed point set. There exists no smaller axisparallel rectangle also enclosing the complete point set. Therefore, every (d โˆ’ 1)dimensional surface area must contain at least one datapoint. Space partitioning is neither complete nor disjoint. Parts of the data space may be not covered at all by data page regions. Overlapping between regions in different branches is allowed, although overlaps deteriorate the search performance especially for highdimensional data spaces [Berchtold et al. 1996]. The region description of an MBR comprises for each dimension a lower and an upper bound. Thus, 2d ๏ฌ‚oating point values are required. This description allows an ef๏ฌcient determination of MINDIST, MINMAXDIST, and MAXDIST using any L p metric.
  • 28. Searching in High-Dimensional Spaces R-trees have originally been designed for spatial databases, that is, for the management of two-dimensional objects with a spatial extension (e.g., polygons). In the index, these objects are represented by the corresponding MBR. In contrast to point objects, it is possible that no overlapfree partition for a set of such objects exists at all. The same problem also occurs when R-trees are used to index datapoints but only in the directory part of the index. Page regions are treated as spatially extended, atomic objects in their parent nodes (no forced split). Therefore, it is possible that a directory page cannot be split without creating overlap among the newly created pages [Berchtold et al. 1996]. According to our framework of highdimensional index structures, two heuristics have to be de๏ฌned to handle the insert operation: the choice of a suitable page to insert the point into and the management of page over๏ฌ‚ow. When searching for a suitable page, one out of three cases may occur. โ€”The point is contained in exactly one page region. In this case, the corresponding page is used. โ€”The point is contained in several different page regions. In this case, the page region with the smallest volume is used. โ€”No region contains the point. In this case, the region that yields the smallest volume enlargement is chosen. If several such regions yield minimum enlargement, the region with the smallest volume among them is chosen. The insert algorithm starts with the root and chooses in each step a child node by applying the above rules. Page over๏ฌ‚ows are generally handled by splitting the page. Four different algorithms have been published for the purpose of ๏ฌnding the right split dimension (also called split axis) and the split hyperplane. They are distinguished according to their time complexity with varying page capacity C. Details are provided in Gaede and ยจ Gunther [1998]: โ€”an exponential algorithm, โ€”a quadratic algorithm, 349 Fig. 17. Misled insert operations. โ€”a linear algorithm, and โ€”Greeneโ€™s [1989] algorithm. Guttman [1984] reports only slight differences between the linear and the quadratic algorithm, however, an evaluation study performed by Beckmann et al. [1990] reveals disadvantages for the linear algorithm. The quadratic algorithm and Greeneโ€™s algorithm are reported to yield similar search performance. In the insert algorithm, the suitable data page for the object is found in O(log n) time, by examining a single path of the index. It seems to be an advantage that only a single path is examined for the determination of the data page into which a point is inserted. An uncontrolled number of paths, in contrast, would violate the demand of an O(n log n) time complexity for the index construction. Figure 17 shows, however, that inserts are often misled in such tie situations. It is intuitively clear that the point must be inserted into page p2,1 , because p2,1 is the only page on the second index level that contains the point. But the insert algorithm faces a tie situation at the ๏ฌrst index level because both pages, p1 as well as p2 , cover the point. According to the heuristics, the smaller page p1 is chosen. The page p2,1 as a child of p2 will never be under consideration. The result of this misled insert is that the page p1,2 unnecessarily becomes enlarged by a large factor and an additional overlap situation of the pages p1,2 and p2,1 . Therefore, overlap at or near the data level is mostly a consequence of some initial overlap in the directory levels near the root (which would, eventually, be tolerable). The initial overlap usually stems from the inability to split a higher-level page without overlap, because all child pages have independently grown extended ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 29. 350 page regions. For an overlap-free split, a dimension is needed in which the projections of the page regions have no overlap at some point. It has been shown in Berchtold et al. [1996] that the existence of such a point becomes less likely as the dimension of the data space increases. The reason simply is that the projection of each child page to an arbitrary dimension is not much smaller than the corresponding projection of the child page. If we assume all page regions to be hypercubes of side length A (parent page) and a (child page), respectively, we get a = Aยท d 1/Ceff , which is substantially below A if d is small but actually in the same order of magnitude as A if d is suf๏ฌciently high. The R -tree [Beckmann et al. 1990] is an extension of the R-tree based on a careful study of the R-tree algorithms under various data distributions. In contrast to Guttman, who optimizes only for a small volume of the created page regions, Beckmann et al. identify the optimization objectives: โ€”minimize overlap between page regions, โ€”minimize the surface of page regions, โ€”minimize the volume covered by internal nodes, and โ€”maximize the storage utilization. The heuristic for the choice of a suitable page to insert a point is modi๏ฌed in the third alternative: no page region contains the point. In this case, the distinction is made whether the child page is a data page or a directory page. If it is a data page, then the region is taken that yields the smallest enlargement of the overlap. In the case of a tie, further criteria are the volume enlargement and the volume. If the child node is a directory page, the region with the smallest volume enlargement is taken. In case of doubt, the volume decides. As in Greeneโ€™s algorithm, the split heuristic has certain phases. In the ๏ฌrst phase, the split dimension is determined: โ€”for each dimension, the objects are sorted according to their lower bound and according to their upper bound; ACM Computing Surveys, Vol. 33, No. 3, September 2001. C. Bยจ hm et al. o โ€”a number of partitionings with a controlled degree of asymmetry are encountered; and โ€”for each dimension, the surface areas of the MBRs of all partitionings are summed up and the least sum determines the split dimension. In the second phase, the split plane is determined, minimizing these criteria: โ€”overlap between the page regions, and โ€”when in doubt, least coverage of dead space. Splits can often be avoided by the concept of forced reinsert. If a node over๏ฌ‚ow occurs, a de๏ฌned percentage of the objects with the highest distances from the center of the region are deleted from the node and inserted into the index again, after the region has been adapted. By this means, the storage utilization will grow to a factor between 71 and 76%. Additionally, the quality of partitioning improves because unfavorable decisions in the beginning of index construction can be corrected this way. Performance studies report improvements between 10 and 75% over the Rtree. In higher-dimensional data spaces; the split algorithm proposed in Beckmann et al. [1990] leads to a deteriorated directory. Therefore, the R -tree is not adequate for these data spaces; rather it has to load the entire index in order to process most queries. A detailed explanation of this effect is given in Berchtold et al. [1996]. The basic problem of the R-tree, overlap coming up at high index levels and then propagating down by misled insert operations, is alleviated by more appropriate heuristics but not solved. The heuristic of the R -tree split to optimize for page regions with a small surface (i.e., for square/cubelike page regions) is bene๏ฌcial, in particular with respect to range queries and nearest-neighbor queries. As pointed out in Section 4 (cost models), the access probability corresponds to the Minkowski sum of the page region and the query sphere. The Minkowski sum primarily consists of the page region which is enlarged at each surface segment. If the page regions are
  • 30. Searching in High-Dimensional Spaces 351 describe a minimum bounding rectangle in d -dimensional space. 6.2. X-Tree Fig. 18. Shapes of page regions and their suitability for similarity queries. optimized for a small surface, they directly optimize the Minkowski sum. Figure 18 shows an extreme, nonetheless typical, example of volume-equivalent pages and their Minkowski sums. The square (1 ร— 1 unit) yields with 3.78 a substantially lower Minkowski sum than the volume equivalent rectangle (3 ร— 1 ) with 5.11 units. Note 3 again, that the effect becomes stronger with an increasing number of dimensions as every dimension is a potential source of imbalance. For spherical queries, however, spherical page regions yield the lowest Minkowski sum (3.55 units). Spherical page regions are discussed later. The R+ -tree [Stonebraker et al. 1986; Sellis et al. 1987] is an overlap-free variant of the R-tree. To guarantee no overlap the split algorithm is modi๏ฌed by a forced-split strategy. Child pages that are an obstacle in overlap-free splitting of some page, are simply cut into two pieces at a suitable position. It is possible, however, that these forced splits must be propagated down until the data page level is reached. The number of pages can even exponentially increase from level to level. As we have pointed out before, the extension of the child pages is not much smaller than the extension of the parent if the dimension is suf๏ฌciently high. Therefore, high dimensionality leads to many forced split operations. Pages that are subject to a forced split are split although no over๏ฌ‚ow has occurred. The resulting pages are utilized by less than 50%. The more forced splits are raised, the more the storage utilization of the complete index will deteriorate. A further problem which more or less concerns all of the data organizing techniques described in this survey is the decreasing fanout of the directory nodes with increasing dimension. For the R-tree family, for example, the internal nodes have to store 2d high and low bounds in order to The R-tree and the R -tree have primarily been designed for the management of spatially extended, two-dimensional objects, but have also been used for highdimensional point data. Empirical studies [Berchtold et al. 1996; White and Jain 1996], however, showed a deteriorated performance of R -trees for high-dimensional data. The major problem of R-tree-based index structures in high-dimensional data spaces is overlap. In contrast to lowdimensional spaces, there exist only a few degrees of freedom for splits in the directory. In fact, in most situations there exists only a single โ€œgoodโ€ split axis. An index structure that does not use this split axis will produce highly overlapping MBRs in the directory and thus show a deteriorated performance in high-dimensional spaces. Unfortunately, this speci๏ฌc split axis might lead to unbalanced partitions. In this case, a split should be avoided in order to avoid under๏ฌlled nodes. The X -tree [Berchtold et al. 1996] is an extension of the R -tree which is directly designed for the management of high-dimensional objects and based on the analysis of problems arising in highdimensional data spaces. It extends the R -tree by two concepts: โ€” overlap-free split according to a splithistory, and โ€” supernodes with an enlarged page capacity. If one records the history of data page splits in an R-tree-based index structure, this results in a binary tree. The index starts with a single data page A covering almost the whole data space and inserts data items. If the page over๏ฌ‚ows, the index splits the page into two new pages A and B. Later on, each of these pages might be split again into new pages. Thus the history of all splits may be described as a binary tree, having split dimensions (and positions) as nodes and having the current data pages as leaf nodes. Figure 19 shows ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 31. 352 C. Bยจ hm et al. o Fig. 19. Example for the split history. an example of such a process. In the lower half of the ๏ฌgure, the appropriate directory node is depicted. If the directory node over๏ฌ‚ows, we have to divide the set of data pages (the MBRs A , B , C, D, E) into two partitions. Therefore, we have to choose a split axis ๏ฌrst. Now, what are potential candidates for split axes in our example? Say we chose dimension 5 as a split axis. Then, we had to put A and E into one of the partitions. However, A and E have never been split according to dimension 5. Thus they span almost the whole data space in this dimension. If we put A and E into one of the partitions, the MBR of this partition in turn will span the whole data space. This obviously leads to high overlap with the other partition, regardless of the shape of the other partition. If one looks at the example in Figure 19, it becomes clear that only dimension 2 may be used as a split dimension. The X -tree generalizes this observation and uses always the split dimension with which the root node of the particular split tree is labeled. This guarantees an overlap-free directory. However, the split tree might be unbalanced. In this case it is advantageous not to split at all because splitting would create one under๏ฌlled node and another almost over๏ฌ‚owing node. Thus the storage utilization in the directory would decrease dramatically and the directory would degenerate. In this case the X -tree does not split and creates an enlarged directory node instead, a supernode. The higher the dimensionality, the more suACM Computing Surveys, Vol. 33, No. 3, September 2001. pernodes will be created and the larger the supernodes become. To also operate on lower-dimensional spaces ef๏ฌciently, the X -tree split algorithm also includes a geometric split algorithm. The whole split algorithm works as follows. In case of a data page split, the X -tree uses the R -tree split algorithm or any other topological split algorithm. In case of directory nodes, the X -tree ๏ฌrst tries to split the node using a topological split algorithm. If this split leads to highly overlapping MBRs, the X -tree applies the overlap-free split algorithm based on the split history as described above. If this leads to an unbalanced directory, the X -tree simply creates a supernode. The X -tree shows a high performance gain compared to R -trees for all query types in medium-dimensional spaces. For small dimensions, the X -tree shows a behavior almost identical to R-trees; for higher dimensions the X -tree also has to visit such a large number of nodes that a linear scan is less expensive. It is impossible to provide exact values here because many factors such as the number of data items, the dimensionality, the distribution, and the query type have a high in๏ฌ‚uence on the performance of an index structure. 6.3. Structures with a kd-Tree Directory Like the R-tree and its variants, the kd-Btree [Robinson 1981] uses hyperrectangleshaped page regions. An adaptive kd-tree
  • 32. Searching in High-Dimensional Spaces 353 Fig. 20. The kd-tree. Fig. 21. Incomplete versus complete decomposition for clustered and correlated data. [Bentley 1975, 1979] is used for space partitioning (cf. Figure 20). Therefore, complete and disjoint space partitioning is guaranteed. Obviously, the page regions are (hyper) rectangles, but not minimum bounding rectangles. The general advantage of kd-tree-based partitioning is that the decision of which subtree to use is always unambiguous. The deletion operation is also supported in a better way than in R-tree variants because leaf nodes with a common parent exactly comprise a hyperrectangle of the data space. Thus they can be merged without violating the conditions of complete, disjoint space partitioning. Complete partitioning has the disadvantage that page regions are generally larger than necessary. Particularly in high-dimensional data spaces often large parts of the data space are not occupied by data points at all. Real data often are clustered or correlated. If the data distribution is cluster shaped, it is intuitively clear that large parts of the space are empty. But also the presence of correlations (i.e., one dimension is more or less dependent on the values of one or more other dimension) leads to empty parts of the data space, as depicted in Figure 21. Index structures that do not use complete parti- tioning are superior because larger page regions yield a higher access probability. Therefore, these pages are more often accessed during query processing than minimum bounding page regions. The second problem is that kd-trees in principle are unbalanced. Therefore, it is not directly possible to pack contiguous subtrees into directory pages. The kd-B-tree approaches this problem by a concept involving forced splits: If some page has an over๏ฌ‚ow condition, it is split by an appropriately chosen hyperplane. The entries are distributed among the two pages and the split is propagated up the tree. Unfortunately, regions on lower levels of the tree may also be intersected by the split plane, which must be split (forced split). As every region on the subtree can be affected, the time complexity of the insert operation is O(n) in the worst case. A minimum storage utilization guarantee cannot be provided. Therefore, theoretical considerations about the index size are dif๏ฌcult. The hB-tree (holey brick) [Lomet and Salzberg 1989, 1990; Evangelidis 1994] also uses a kd-tree directory to de๏ฌne the page regions of the index. In this approach, splitting of a node is based on multiple attributes. This means that page regions ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 33. 354 C. Bยจ hm et al. o Fig. 22. The kd-B-tree. Fig. 23. The Minkowski sum of a holey brick. do not correspond to solid rectangles but to rectangles from which other rectangles have been removed (holey bricks). With this technique, the forced split of the kdB-tree and the R+ -tree is avoided. For similarity search in highdimensional spaces, we can state the same bene๏ฌts and shortcomings of a complete space decomposition as in the kd-B-tree, depicted in Figure 22. In addition, we can state that the cavities of the page regions decrease the volume of the page region, but hardly decrease the Minkowski sum (and thus the access probability of a page). This is illustrated in Figure 23, where two large cavities are removed from a rectangle, reducing its volume by more than 30%. The Minkowski sum, however, is not reduced in the left cavity, because it is not as wide as the perimeter of the query. In the second cavity, there is only a very small area where the page region is not touched. Thus the cavities reduce the access probability of the page by less than 1%. The directory of the LSDh -tree [Henrich 1998] is also an adaptive kd -tree [Bentley 1975, 1979] (see Figure 24). In contrast to R-tree variants and kd-B-trees, the region description is coded in a sophisticated way leading to reduced space requirements for the region description. A specialized paging strategy collects parts of the kd -tree ACM Computing Surveys, Vol. 33, No. 3, September 2001. into directory pages. Some levels on the top of the kd -tree are assumed to be ๏ฌxed in main memory. They are called the internal directory in contrast to the external directory which is subject to paging. In each node, only the split axis (e.g., 8 bits for up to 256-dimensional data spaces) and the position, where the split-plane intersects the split axis (e.g., 32 bits for a ๏ฌ‚oat number), have to be stored. Two pointers to child nodes require 32 bits each. To describe k regions, (k โˆ’ 1) nodes are required, leading to a total amount of 104 ยท (k โˆ’ 1) bits for the complete directory. R-treelike structures require for each region description two ๏ฌ‚oat values for each dimension plus the child node pointer. Therefore, only the lowest level of the directory needs (32 + 64 ยท d ) ยท k bits for the region description. While the space requirement of the R-tree directory grows linearly with increasing dimension, it is constant (theoretically logarithmic, for very large dimensionality) for the LSDh -tree. Note that this argument also holds for the hBฯ€ -tree. See Evangelidis et al. [1997] for a more detailed discussion of the issue. For 16-dimensional data spaces, R-tree directories are more than 10 times larger than the corresponding LSDh -tree directory. The rectangle representing the region of a data page can be determined from the split planes in the directory. It is called the potential data region and not explicitly stored in the index. One disadvantage of the kd-tree directory is that the data space is completely covered with potential data regions. In cases where major parts of the data space are empty, this results in performance degeneration. To overcome this drawback, a concept called coded actual data regions, cadr is introduced. The cadr is a multidimensional interval conservatively approximating the MBR of the points stored in a data page. To save space in the description of the cadr, the potential data region is quantized into a grid of 2zยทd cells. Therefore, only 2 ยท z ยท d bits are additionally required for each cadr. The parameter z can be chosen by the user. Good
  • 34. Searching in High-Dimensional Spaces 355 Fig. 24. The LSDh -tree. Fig. 25. Region approximation using the LSDh tree. results are achieved using a value z = 5. See Figure 25. The most important advantage of the complete partitioning using potential data regions is that they allow a maintenance guaranteeing no overlap. It has been pointed out in the discussion of the R-tree variants and of the X -tree that overlap is a particular problem in high-dimensional data spaces. By the complete partitioning of the kd -tree directory, tie situations that lead to overlap do not arise. On the other hand, the regions of the index pages are not able to adapt equally well to changes in the actual data distribution as can page regions that are not forced into the kd -tree directory. The description of the page regions in terms of splitting planes forces the regions to be overlap-free, anyway. When a point has to be inserted into an LSDh tree, there exists always a unique potential data region, in which the point has to be inserted. In contrast, the MBR of an R-tree may have to be enlarged for an insert operation, which causes overlap between data pages in some cases. A situation where no overlap-free enlargement is possible is depicted in Figure 26. The coded actual data regions may have to be enlarged during an insert operation. As they are completely contained in Fig. 26. No overlapfree insert is possible. a potential page region, overlap cannot arise either. The split strategy for LSDh -trees is rather simple. The split dimension is increased by one compared to the parent node in the kd -tree directory. The only exception from this rule is that a dimension having too few distinct values for splitting is left out. As reported in Henrich [1998], the LSDh -tree shows a performance that is very similar to that of the X -tree, except that inserts are done much faster in an LSDh -tree because no complex computation takes place. Using a bulk-loading technique to construct the index, both index structures are equal in performance. Also from an implementation point of view, both structures are of similar complexity. The LSDh -tree has a rather complex directory structure and simple algorithms, whereas the X -tree has a rather straightforward directory and complex algorithms. 6.4. SS-Tree In contrast to all previously introduced index structures, the SS-tree [White and Jain 1996] uses spheres as page regions. For maintenance ef๏ฌciency, the spheres ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 35. 356 C. Bยจ hm et al. o Fig. 27. No overlap-free split is possible. are not minimum bounding spheres. Rather, the centroid point (i.e., the average value in each dimension) is used as the center for the sphere and the minimum radius is chosen such that all objects are included in the sphere. The region description comprises therefore the centroid point and the radius. This allows an ef๏ฌcient determination of the MINDIST and of the MAXDIST, but not of the MINMAXDIST. The authors suggest using the RKV algorithm, but they do not provide any hints on how to prune the branches of the index ef๏ฌciently. For insert processing, the tree is descended choosing the child node whose centroid is closest to the point, regardless of volume or overlap enlargement. Meanwhile, the new centroid point and the new radius are determined. When an over๏ฌ‚ow condition occurs, a forced reinsert operation is raised, as in the R -tree. 30% of the objects with highest distances from the centroid are deleted from the node, all region descriptions are updated, and the objects are reinserted into the index. The split determination is merely based on the criterion of variance. First, the split axis is determined as the dimension yielding the highest variance. Then, the split plane is determined by encountering all possible split positions, which ful๏ฌll space utilization guarantees. The sum of the variances on each side of the split plane is minimized. It was pointed out already in Section 6.1 (cf. Figure 18 in particular) that spheres are theoretically superior to volumeequivalent MBRs because the Minkowski sum is smaller. The general problem of spheres is that they are not amenable to ACM Computing Surveys, Vol. 33, No. 3, September 2001. an easy overlap-free split, as depicted in Figure 27. MBRs have in general a smaller volume, and, therefore, the advantage in the Minkowski sum is more than compensated. The SS-tree outperforms the R tree by a factor of two; however, it does not reach the performance of the LSDh tree and the X -tree. 6.5. TV-Tree The TV-tree [Lin et al. 1995] is designed especially for real data that are subject to the Karhunenโ€“Loeve Transform (also known as principal component analysis), a mapping that preserves distances and eliminates linear correlations. Such data yield a high variance and therefore, a good selectivity in the ๏ฌrst few dimensions whereas the last few dimensions are of minor importance for query processing. Indexes storing KL-transformed data tend to have the following properties. โ€”The last few attributes are never used for cutting branches in query processing. Therefore, it is not useful to split the data space in the corresponding dimensions. โ€”Branching according to the ๏ฌrst few attributes should be performed as early as possible, that is, in the topmost levels of the index. Then, the extension of the regions of lower levels (especially of data pages) is often zero in these dimensions. Regions of the TV-tree are described using so-called telescope vectors (TV), that is, vectors that may be dynamically shortened. A region has k inactive dimensions and ฮฑ active dimensions. The inactive dimensions form the greatest common
  • 36. Searching in High-Dimensional Spaces pre๏ฌx of the vectors stored in the subtree. Therefore, the extension of the region is zero in these dimensions. In the ฮฑ active dimensions, the region has the form of an L p -sphere, where p may be 1, 2, or โˆž. The region has an in๏ฌnite extension in the remaining dimensions, which are supposed either to be active in the lower levels of the index or to be of minor importance for query processing. Figure 28 depicts the extension of a telescope vector in space. The region description comprises ฮฑ ๏ฌ‚oating point values for the coordinates of the centerpoint in the active dimensions and one ๏ฌ‚oat value for the radius. The coordinates of the inactive dimensions are stored in higher levels of the index (exactly in the level, where a dimension turns from active into inactive). To achieve a uniform capacity of directory nodes, the number ฮฑ of active dimensions is constant in all pages. The concept of telescope vectors increases the capacity of the directory pages. It was experimentally determined that a low number of active dimensions (ฮฑ = 2) yields the best search performance. The insert algorithm of the TV-tree chooses the branch to insert a point according to these criteria (with decreasing priority): โ€” minimum increase of the number of overlapping regions, โ€” minimum decrease of the number of inactive dimensions, โ€” minimum increase of the radius, and โ€” minimum distance to the center. To cope with page over๏ฌ‚ows, the authors propose performing a reinsert operation, as in the R -tree. The split algorithm determines the two seed-points (seed-regions in the case of a directory page) having the least common pre๏ฌx or (in case of doubt) having maximum distance. The objects are then inserted into one of the new subtrees using the above criteria for the subtree choice in insert processing, while storage utilization guarantees are considered. 357 The authors report a good speedup in comparison to the R -tree when applying the TV-tree to data that ful๏ฌll the precondition stated in the beginning of this section. Other experiments [Berchtold et al. 1996], however, show that the X -tree and the LSDh -tree outperform the TV-tree on uniform or other real data (not amenable to the KL transformation). 6.6. SR-Tree The SR-tree [Katayama and Satoh 1997] can be regarded as the combination of the R -tree and the SS-tree. It uses the intersection solid between a rectangle and a sphere as the page region. The rectangular part is, as in R-tree variants, the minimum bounding rectangle of all points stored in the corresponding subtree. The spherical part is, as in the SStree, the minimum sphere around the centroid point of the stored objects. Figure 29 depicts the resulting geometric object. Regions of SR-trees have the most complex description among all index structures presented in this section: they comprise 2d ๏ฌ‚oating point values for the MBR and d + 1 ๏ฌ‚oating point values for the sphere. The motivation for using a combination of sphere and rectangle, presented by the authors, is that according to an analysis presented in White and Jain [1996], spheres are basically better suited for processing nearest-neighbor and range queries using the L2 metric. On the other hand, spheres are dif๏ฌcult to maintain and tend to produce much overlap in splitting, as depicted previously in Figure 27. The authors believe therefore that a combination of R-tree and SS-tree will overcome both disadvantages. The authors de๏ฌne the following function as the distance between a query point q and a region R. MINDIST(q, R) = max(MINDIST(q, R.MBR, MINDIST(q, R.Sphere)). This is not the correct minimum distance to the intersection solid, as depicted in Figure 30. Both distances to MBR and ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 37. 358 C. Bยจ hm et al. o Fig. 28. Telescope vectors. sphere (meeting the corresponding solids at the points M MBR and M Sphere , resp.) are smaller than the distance to the intersection solid, which is met in point M R where the sphere intersects the rectangle. However, it can be shown that the above function MINDIST(q, R) is a lower bound ACM Computing Surveys, Vol. 33, No. 3, September 2001. of the correct distance function. Therefore, it is guaranteed that processing of range and nearest-neighbor queries produces no false dismissals. But still, the ef๏ฌciency can be worsened by the incorrect distance function. The MAXDIST function can be de๏ฌned to be the minimum among
  • 38. Searching in High-Dimensional Spaces Fig. 29. Page regions of an SR-tree. 359 sidered in the choice of branches nor in the determination of the split. The reported performance results, compared to the SS-tree and the R -tree, suggest that the SR-tree outperforms both index structures. It is, however, open if the SR-tree outperforms the X-tree or the LSDh -tree. No experimental comparison has been done yet to the best of the authorsโ€™ knowledge. Comparing the index structures indirectly by comparing both to the performance of the R -tree, we could draw the conclusion that the SR-tree does not reach the performance of the LSDh tree or the X -tree. 6.7. Space Filling Curves Fig. 30. Incorrect MINDIST in the SR-tree. the MAXDIST functions, applied to MBR and sphere, although a similar error is made as in the de๏ฌnition of MINDIST. Since no MAXMINDIST de๏ฌnition exists for spheres, the MAXMINDIST function for the MBR must be applied. This is also correct in the sense that no false dismissals are guaranteed but in this case no knowledge about the sphere is exploited at all. Some potential for performance increase is wasted. Using the de๏ฌnitions above, range and nearest-neighbor query processing using both RKV algorithm and HS algorithm are possible. Insert processing and the split algorithm are taken from the SS-tree and only modi๏ฌed in a few details of minor importance. In addition to the algorithms for the SS-tree, the MBRs have to be updated and determined after inserts and node splits. Information of the MBRs is neither con- Space ๏ฌlling curves (for an overview see Sagan [1994]) like Z-ordering [Morton 1966; Finkel and Bentley 1974; Abel and Smith 1983; Orenstein and Merret 1984; Orenstein 1990], Gray Codes [Faloutsos 1985, 1988], or the Hilbert curve [Faloutsos and Roseman 1989; Jagadish 1990; Kamel and Faloutsos 1993] are mappings from a d -dimensional data space (original space) into a one-dimensional data space (embedded space). Using space ๏ฌlling curves, distances are not exactly preserved but points that are close to each other in the original space are likely to be close to each other in the embedded space. Therefore, these mappings are called distance-preserving mappings. Z-ordering is de๏ฌned as follows. The data space is ๏ฌrst partitioned into two halves of identical volume, perpendicular to the d 0 -axis.The volume on the side of the lower d 0 -values gets the name 0 (as a bit string); the other volume gets the name 1 . Then each of the volumes is partitioned perpendicular to the d 1 axis, and the resulting subpartitions of 0 get the names 00 and 01 , and the subpartitions of 1 get the names 10 and 11 , respectively. When all axes are used for splitting, d 0 is used for a second split, and so on. The process stops when a user-de๏ฌned basic resolution br is reached. Then, we have a total number of 2br grid cells, each with an individual numbered ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 39. 360 C. Bยจ hm et al. o Fig. 31. Examples of space ๏ฌlling curves. bit string. If only grid cells with the basic resolution br are considered, all bit strings have the same lengths, and can therefore be interpreted as binary representations of integer numbers. The other space ๏ฌlling curves are de๏ฌned similarly but the numbering scheme is slightly more sophisticated. This has been done in order to achieve more neighboring cells getting subsequent integer numbers. Some two-dimensional examples of space ๏ฌlling curves are depicted in Figure 31. Datapoints are transformed by assigning the number of the grid cell in which they are located. Without presenting the details, we let SFC( p) be the function that assigns p to the corresponding grid cell number. Vice versa, SFCโˆ’1 (c) returns the corresponding grid cell as a hyperrectangle. Then any one-dimensional indexing structure capable of processing range queries can be applied for storing SFC( p) for every point p in the database. We assume in the following that a B+ -tree [Comer 1979] is used. Processing of insert and delete operations and exact match queries is very simple because the points inserted or sought have merely to be transformed using the SFC function. In contrast, range and nearest-neighbor queries are based on distance calculations of page regions, which have to be determined accordingly. In B-trees, before a page is accessed, only the interval I = [l b . . ub] of values in this page is known. Therefore, the page region is the union of all grid cells having a cell number between l b and ub. The region of an index based on a space ๏ฌlling curve is a combination of rectangles. Based on this observation, we can de๏ฌne a corresponding ACM Computing Surveys, Vol. 33, No. 3, September 2001. MINDIST and analogously a MAXDIST function: MINDIST(q, I ) = min {MINDIST(q, SFCโˆ’1 (c))} l bโ‰คcโ‰คub MAXDIST(q, I ) = = max {MAXDIST(q, SFCโˆ’1 (c))}. l bโ‰คcโ‰คub Again, no MINMAXDIST function can be provided because there is no minimum bounding property to exploit. The question is how these functions can be evaluated ef๏ฌciently, without enumerating all grid cells in the interval [l b . . ub]. This is possible by splitting the interval recursively into two parts [l b..s[ and [s . . ub], where s has the form p100 . . . 00 . Here, p stands for the longest common pre๏ฌx of l b and ub. Then we determine the MINDIST and the MAXDIST to the rectangular blocks numbered with the bit strings p0 and p1 . Any interval having a MINDIST greater than the MAXDIST of any other interval or greater than the MINDIST of any terminating interval (see later) can be excluded from further consideration. The decomposition of an interval stops when the interval covers exactly one rectangle. Such an interval is called a terminal interval. MINDIST(q, I ) is then the minimum among the MINDISTs of all terminal intervals. An example is presented in Figure 32. The shaded area is the page region, a set of contiguous grid cell values I . In the ๏ฌrst step, the interval is split into two parts I1 and I2 , determining the MINDIST and MAXDIST (not depicted) of the surrounding rectangles. I1 is terminal, as it comprises a rectangle. In the second step, I2 is split into I21
  • 40. Searching in High-Dimensional Spaces 361 Fig. 32. MINDIST determination using space ๏ฌlling curves. and I22 , where I21 is terminal. Since the MINDIST to I21 is smaller than the other two MINDIST values, I1 and I22 are discarded. Therefore MINDIST(q, I21 ) is equal to MINDIST(q, I ). A similar algorithm to determine MAXDIST(q, I ) would exchange the roles of MINDIST and MAXDIST. Fig. 33. Partitioning the data space into pyramids. 6.8. Pyramid-Tree The Pyramid-tree [Berchtold et al. 1998b] is an index structure that, similar to the Hilbert technique, maps a d dimensional point into a one-dimensional space and uses a B+ -tree to index the one-dimensional space. Obviously, queries have to be translated in the same way. In the data pages of the B+ -tree, the Pyramid-tree stores both the d -dimensional points and the onedimensional key. Thus, no inverse transformation is required and the re๏ฌnement step can be done without lookups to another ๏ฌle. The speci๏ฌc mapping used by the Pyramid-tree is called Pyramidmapping. It is based on a special partitioning strategy that is optimized for range queries on high-dimensional data. The basic idea is to divide the data space such that the resulting partitions are shaped like peels of an onion. Such partitions cannot be ef๏ฌciently stored by R-treelike or kd-treelike index structures. However, the Pyramid-tree achieves the partitioning by ๏ฌrst dividing the d -dimensional space into 2d pyramids having the centerpoint of the space as their top. In a second step, the single pyramids are cut into slices parallel to the basis of the pyramid forming the data pages. Figure 33 depicts this partitioning technique. Fig. 34. Properties of pyramids: (a) numbering of pyramids; (b) point in pyramid. This technique can be used to compute a mapping as follows. In the ๏ฌrst step, we number the pyramids as shown in Figure 34(a). Given a point, it is easy to determine in which pyramid it is located. Then we determine the so-called height of the point within its pyramid that is the orthogonal distance of the point to the centerpoint of the data space as shown in Figure 34(b). In order to map a d dimensional point into a one-dimensional value, we simply add the two numbers, the number of the pyramid in which the point is located, and the height of the point within this pyramid. Query processing is a nontrivial task on a Pyramid-tree because for a given query range, we have to determine the ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 41. 362 C. Bยจ hm et al. o Table I. High-Dimensional Index Structures and Their Properties Name Region Disjoint Complete Criteria for Insert Criteria for Split Reinsert (Various algorithms) No R-tree MBR No No Volume enlargement volume R -tree MBR No No Overlap enlargement Volume enlargement volume Surface area Overlap Dead space coverage Yes X -tree MBR No No Overlap enlargement Volume enlargement volume Split history Surface/overlap Dead space coverage No LSDh tree kd-treeregion Yes No/Yes (Unique due to complete, disjoint part.) Cyclic change of dim. Distinct values No SS-tree Sphere No No Proximity to centroid Variance Yes No No Overlap regions Inactive dim. Radius of region Distance to center Seeds with least Common pre๏ฌx Maximum distance Yes No No Proximity to centroid Variance Yes TV-tree SR-tree Sphere with reduced dim. Intersect. sphere/ MBR Space ๏ฌlling curves Union of rectangles Yes Yes (Unique due to complete, disjoint part.) According to space ๏ฌlling curve No Pyramid tree Trunks of pyramids Yes Yes (Unique due to complete, disjoint part.) According to pyramid-mapping No affected pyramids and the affected heights within these pyramids. The details of how this can be done are explained in Berchtold et al. [1998b]. Although the details of the algorithm are hard to understand, it is not computationally hard; rather it consists of a variety of cases that have to be distinguished and simple computations. The Pyramid-tree is the only index structure known thus far that is not affected by the so-called curse of dimensionality. This means that for uniform data and range queries, the performance of the Pyramid-tree gets even better if one increases the dimensionality of the data space. An analytical explanation of this phenomenon is given in Berchtold et al. [1998b]. 6.9. Summary Table I shows the index structures described above and their most important properties. The ๏ฌrst column contains the name of the index structure, the second shows which geometrical region is represented by a page, and the third and fourth columns show whether the index ACM Computing Surveys, Vol. 33, No. 3, September 2001. structure provides a disjoint and complete partitioning of the data space. The last three columns describe the used algorithms: what strategy is used to insert new data items (column 5), what criteria are used to determine the division of objects into subpartitions in case of an over๏ฌ‚ow (column 6), and if the insert algorithm uses the concept of forced reinserts (column 7). Since, so far, no extensive and objective comparison between the different index structures has been published, only structural arguments may be used in order to compare the different approaches. The experimental comparisons tend to highly depend on the data that have been used in the experiments. Even higher is the in๏ฌ‚uence of seemingly minor parameters such as the size and location of queries or the statistical distribution of these. The higher the dimensionality of the data, the more these in๏ฌ‚uences lead to different results. Thus we provide a comparison among the indexes listing only properties not trying to say anything about the โ€œoverallโ€ performance of a single index. In fact, most probably, there is no overall performance; rather, one index will
  • 42. Searching in High-Dimensional Spaces 363 Table II. Qualitative Comparison of High-Dimensional Index Structures Name Problems in High-D Supported Query Types Locality of Node Splits R-tree Poor split algorithm leads to deteriorated directories NN, Region, range Yes Poor Poor, linearly dimension dependent R -tree Dto. NN, Region, range Yes Medium Poor, linearly dimension dependent X -tree High probability of queries overlapping MRSโ€™s leads to poor performance NN, Region, range Yes Medium Poor, linearly dimension dependent LSDh tree Changing data distribution deteriorates directory NN, Region, range No Medium Very good, dimension independent SS-tree High overlap in directory NN Yes Medium Very good, dimension independent TV-tree Only useful for speci๏ฌc data NN Yes Medium Poor, somehow dimension dependent SR-tree Very large directory sizes NN Yes Medium Very poor, linearly dimension dependent Space ๏ฌlling curves Poor space partitioning NN, Region, range Yes Medium As good as B-tree, dimension independent Pyramid tree Problems with asymmetric queries Region, range Yes Medium As good as B-tree, dimension independent outperform other indexes in a special situation whereas this index is quite useless for other con๏ฌgurations of the database. Table II shows such a comparison. The ๏ฌrst column lists the name of the index, the second column explains the biggest problem of this index when the dimension increases. The third column lists the supported types of queries. In the fourth column, we show if a split in the directory causes โ€œforced splitsโ€ on lower levels of the directory. The ๏ฌfth column shows the storage utilization of the index, which is only a statistical value depending on the type of data and, sometimes, even on the order of insertion. The last column is about the fanout in the directory which in turn depends on the size of a single entry in a directory node. 7. IMPROVEMENTS, OPTIMIZATIONS, AND FUTURE RESEARCH ISSUES During the past years, a signi๏ฌcant amount of work has been invested not to Storage Utilization Fanout / Size of Index Entries develop new index structures but to improve the performance of existing index structures. As a result, a variety of techniques has been proposed using or tuning index structures. In this section, we present a selection of those techniques. Furthermore, we point out a selection of problems that have not yet been addressed in the context of high-dimensional indexing, or the solution of which cannot be considered suf๏ฌcient. Tree-Striping From a variety of cost models that have been developed one might conclude that if the data space has a suf๏ฌciently high dimensionality, no index structure can succeed. This has been contradicted by the development of index structures that are not severely affected by the dimensionality of the data space. On the other hand, one has to be very careful to judge the implications of a speci๏ฌc cost model. A lesson all researchers in the area of highdimensional index structures learned was ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 43. 364 that things are very sensitive to the change of parameters. A model of nearestneighbor queries can not directly be used to make any claims about the behavior in the case of range queries. Still, the research community agreed that in the case of nearest-neighbor queries, there exists a dimension above which a sequential scan will be faster than any indexing technique for most relevant data distributions. Tree-striping is a technique that tries to tackle the problem from a different perspective. If it is hard to solve the d dimensional problem of query processing, why not try to solve k l -dimensional problems, where k ยท l = d . The speci๏ฌc work presented in Berchtold et al. [2000c] focuses on the processing of range queries in a high-dimensional space. It generalizes the well-known inverted lists and multidimensional indexing approaches. A theoretical analysis of the generalized technique shows that both, inverted lists and multidimensional indexing approaches, are far from being optimal. A consequence of the analysis is that the use of a set of multidimensional indexes provides considerable improvements over one d -dimensional index (multidimensional indexing) or d one-dimensional indexes (inverted lists). The basic idea of treestriping is to use the optimal number k of lower-dimensional indexes determined by a theoretical analysis for ef๏ฌcient query processing. A given query is also split into k lower-dimensional queries and processed independently. In a ๏ฌrst step, the single results are merged. As the merging step also involves I/O costs and these costs increase with a decreasing dimensionality of a single index, there exists an optimal dimensionality for the single indexes that can be determined analytically. Note that tree-striping has serious limitations especially for nearest-neighbor queries and skewed data where, in many cases, the d dimensional index performs better than any lower-dimensional index. Voronoi Approximations In another approach [Berchtold et al. 1998c, 2000d] to overcome the curse of diACM Computing Surveys, Vol. 33, No. 3, September 2001. C. Bยจ hm et al. o mensionality for nearest-neighbor search, the results of any nearest-neighbor search are precomputed. This corresponds to a computation of the Voronoi cell of each datapoint. The Voronoi cell of a point p contains all points that have p as a nearestneighbor. In high-dimensional spaces, the exact computation of a Voronoi cell is computationally very hard. Thus rather than computing exact Voronoi cells, the algorithm stores conservative approximations of the Voronoi cells in an index structure that is ef๏ฌcient for high-dimensional data spaces. As a result, nearest-neighbor search corresponds to a simple point query on the index structure. Although the technique is based on a precomputation of the solution space, it is dynamic; that is, it supports insertions of new datapoints. Furthermore, an extension of the technique to a k-nearest-neighbor search is given in Berchtold et al. [2000d]. Parallel Nearest-Neighbor Search Most similarity search techniques map the data objects into some high-dimensional feature space. The similarity search then corresponds to a nearest-neighbor search in the feature space that is computationally very intensive. In Berchtold et al. [1997a], the authors present a new parallel method for fast nearest-neighbor search in high-dimensional feature spaces. The core problem of designing a parallel nearest neighbor algorithm is to ๏ฌnd an adequate distribution of the data onto the disks. Unfortunately, the known declustering methods do not perform well for high-dimensional nearest-neighbor search. In contrast, the proposed method has been optimized based on the special properties of high-dimensional spaces and therefore provides a near-optimal distribution of the data items among the disks. The basic idea of this data declustering technique is to assign the buckets corresponding to different quadrants of the data space to different disks. The authors show that their techniqueโ€”in contrast to other declustering methodsโ€”guarantees that all buckets corresponding to neighboring quadrants are assigned to different
  • 44. Searching in High-Dimensional Spaces 365 disks. The speci๏ฌc mapping of points to disks is done by the following formula. ๏ฃซ d โˆ’1 col(c) = ๏ฃญ XOR i=0 i+1 0 ๏ฃถ if ci = 1 ๏ฃธ . otherwise 10 The input is a bit string de๏ฌning the quadrant in which the point to be declustered is located. But not any number of disks may be used for this declustering technique. In fact, the number required is linear in the number of the dimensions. Therefore, the authors present an extension of their technique adapted to an arbitrary number of disks. A further extension is a recursive declustering technique that allows an improved adaptation to skewed and correlated data distributions. An approach for similarity query processing using disk arrays is presented in Papadopoulos and Monolopoulos [1998]. The authors propose two new algorithms for the nearest-neighbor search on single processors and multiple disks. Their solution relies on a well-known page distribution technique for low-dimensional data spaces [Kamel and Faloutsos 1992] called a proximity index. Upon a split, the MBR of a newly created node is compared with the MBRs stored in its father node (i.e., its siblings). The new node is assigned to the disk that stores the โ€œleast proximalโ€ pages with respect to the new page region. Thus the selected disk must contain sibling nodes that are far from the new node. The ๏ฌrst algorithm, called full parallel similarity search (FPSS), determines the threshold sphere (cf. Figure 35), an upper bound of the nearest neighbor distance according to the maximum distance between the query point and the nearest page region. Then, all pages that are not pruned by the threshold sphere are called in by a parallel request to all disks. The second algorithm, candidate reduction similarity search (CRSS), applies a heuristic that leads to an intermediate form between depth-๏ฌrst and breadth๏ฌrst search of the index. Pages that are completely contained in the threshold sphere are processed with a higher prior- Fig. 35. The threshold sphere for FPSS and CRSS. ity than pages that are merely intersected by it. The authors compare FPSS and CRSS with a (not existing) optimal parallel algorithm that knows the distance of the nearest neighbor in advance and report up to 100% more page accesses of CRSS compared to the optimal algorithm. The same authors also propose a solution for shared-nothing parallel architectures [Papadopoulos and Manolopoulos 1997a]. Their architecture distributes the data pages of the index over the secondary servers while the complete directory is held in the primary server. Their static page distribution strategy is based on a fractal curve (sorting according to the Hilbert value of the centroid of the MBR). The k-nn algorithm ๏ฌrst performs a depth-๏ฌrst search in the directory. When the bottom level of the directory is reached, a sphere around the query point is determined that encloses as many data pages as required to guarantee that k points are stored in them (i.e., assuming that the page capacity is โ‰ฅ k, the sphere is chosen such that one page is completely contained). A parallel range query is performed, ๏ฌrst accessing a smaller number of data pages obtained by a cost model. Compression Techniques Recently, the VA-๏ฌle [Weber et al. 1998] was developed, an index structure that is actually not an index structure. Based on the cost model proposed in Berchtold et al. [1997b] the authors prove that under certain assumptions, above a certain dimensionality no index structure can process a nearest-neighbor query ef๏ฌciently. Therefore, they suggest accelerating the sequential scan by the use of data compression. ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 45. 366 The basic idea of the VA-๏ฌle is to keep two ๏ฌles: a bit-compressed, quantized version of the points and their exact representation. Both ๏ฌles are unsorted; however, the positions of the points in the two ๏ฌles agree. The quantization of the points is determined by an irregular grid laid over the data space. The resolution of the grid in each dimension corresponds to 2b, where b is the number of bits per dimension that are used to approximate the coordinates. The grid lines correspond to the quantiles of the projections of the points to the corresponding axes. These quantiles are assumed to change rarely. Changing the quantiles requires a reconstruction of the compressed ๏ฌle. The k-nearestneighbor queries are processed by the multistep paradigm: The quantized points are loaded into main memory by a sequential scan (๏ฌlter step). Candidates that cannot be pruned are re๏ฌned; that is, their exact coordinates are called in from the second ๏ฌle. Several access strategies for timing ๏ฌlter and re๏ฌnement steps have been proposed. Basically, the speedup of the VA๏ฌle compared to a simple sequential scan corresponds to the compression rate, because reading large ๏ฌles sequentially from disk yields a linear time complexity with respect to the ๏ฌle length. The computational effort of determining distances between the query point and the quantized datapoints is also improved compared to the sequential scan by precomputation of the squared distances between the query point and the grid lines. CPU speedups, however, do not yield large factors and are independent of the compression rate. The most important overhead in query processing is the re๏ฌnements which require an expensive random disk access each. With decreasing resolution, the number of points to be re๏ฌned increases, thus limiting the compression ratio. The authors report a number of ๏ฌve to six bits per dimension to be optimal. There are some major drawbacks of the VA-๏ฌle. First, the deterioration of index structures is much more prevalent in arti๏ฌcial data than in data sets from real-world applications. For such data, ACM Computing Surveys, Vol. 33, No. 3, September 2001. C. Bยจ hm et al. o Fig. 36. Structure of the IQ-tree. index structures are ef๏ฌciently applicable for much higher dimensions. The second drawback is the number of bits per dimension which is a system parameter. Unfortunately, the authors do not provide any model or guideline for the selection of a suitable bit rate. To overcome these drawbacks, the IQ-tree has recently been proposed by Berchtold et al. [2000a], which is a three-level tree index structure exploiting quantization (cf. Figure 36). The ๏ฌrst level is a ๏ฌ‚at directory consisting of MBRs and the corresponding pointers to pages on the second level. The pages on the second level contain the quantized version of the datapoints. In contrast to the VA-๏ฌle, the quantization is not based on quantiles but is a regular decomposition of the page regions. The authors claim that regular quantization based on the page regions adapts equally well to skewed and correlated data distributions as quantiles do. The suitable compression rate is determined for each page independently according to a cost model proposed in Berchtold et al. [2000b]. Finally, the bottom level of the IQ-tree contains the exact representation of the datapoints. For processing of nearest-neighbor queries, the authors propose a fast index scan which essentially subsumes the advantages of indexes and scan-based methods. The algorithm collects accesses to neighboring pages and performs chained I/O requests. The length of such chains is determined according to a cost model. In situations where a sequential scan is clearly indicated, the algorithm degenerates automatically to
  • 46. Searching in High-Dimensional Spaces the sequential scan. In other situations, where the search can be directly designated, the algorithm performs the priority search of the HS algorithm. In intermediate situations, the algorithm accesses chains of intermediate length thus clearly outperforming both the sequential scan as well as the HS algorithm. The bottom level of the IQ-tree is accessed according to the usual multistep strategy. Bottom-Up Construction Usually, the performance of dynamically inserting a new datapoint into a multidimensional index structure is poor. The reason for this is that most structures have to consider multiple paths in the tree where the point could be inserted. Furthermore, split algorithms are complex and computationally intensive. For example, a single split in an X -tree might take up to the order of a second to be performed. Therefore, a number of bulk-load algorithms for multidimensional index structures have been proposed. Bulk-loading an index means building an index on an entire database in a single process. This can be done much more ef๏ฌciently than inserting the points one at a time. Most of the bulk-load algorithms such as the one proposed in van den Bercken et al. [1997] are not especially adapted for the case of high-dimensional data spaces. In Berchtold et al. [1998a], however, the authors proposed a new bulkloading technique for high-dimensional indexes that exploits a priori knowledge of the complete data set to improve both construction time and query performance. The algorithm operates in a manner similar to the Quicksort algorithm and has an average run-time complexity of O(n log n). In contrast to other bulk-loading techniques, the query performance is additionally improved by optimizing the shape of the bounding boxes, by completely avoiding overlap, and by clustering the pages on disk. A sophisticated unbalanced split strategy is used leading to a better space partitioning. Another important issue would be to apply the knowledge that people aggregated to other areas such as data reduction, data 367 mining (e.g., clustering), or visualization, where people have to deal with tens to hundreds of attributes and therefore face a high-dimensional data space. Most of the lessons learned also apply to these areas. Examples for successful approaches to make use of these side-effects are Agrawal et al. [1998] and Berchtold et al. [1998d]. Future Research Issues Although signi๏ฌcant progress has been made to understand the nature of highdimensional spaces and to develop techniques that can operate in these spaces, there still are many open questions. A ๏ฌrst problem is that most of the understanding that the research community developed during the last years is restricted to the case of uniform and independent data. Not only are all proposed indexing techniques optimized for this case, also almost all theoretical considerations such as cost models are restricted to this simple case. The interesting observation is that index structures do not suffer from โ€œrealโ€ data. Rather, they nicely take advantage from having nonuniform distributions. In fact, a uniform distribution seems to be the worst thing that can happen to an index structure. One reason for this effect is that often the data are located only on a subspace of the data space and if the index adapts to this situation, it actually behaves as if the data would be lowerdimensional. A promising approach to understanding and explaining this effect theoretically has been followed in Faloutsos and Kamel [1994] and Bยจ hm [1998] where o the concept of the fractal dimension is applied. However, even this approach cannot cover โ€œrealโ€ effects such as local skewness. A second interesting research issue is the partitioning strategies that perform well in high-dimensional spaces. As previous research (e.g., the Pyramid-tree) has shown, the partitioning does not have to be balanced to be optimal for certain queries. The open question is what an optimal partitioning schema for nearest-neighbor queries would be. Does it need to be balanced or better unbalanced? Is it based upon bounding-boxes or on pyramids? ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 47. 368 How does the optimum change when the data set grows in size or dimensionality? There are many open questions that need to be answered. A third open research issue is an approximate processing of nearest-neighbor queries. The ๏ฌrst question is what a useful de๏ฌnition for approximate nearestneighbor search in high-dimensional spaces is, and how that fussiness introduced by the de๏ฌnition may be exploited for an ef๏ฌcient query processing. A ๏ฌrst approach for an approximate nearestneighbor search has been proposed in Gionis et al. [1999]. Other interesting research issues include the parallel processing of nearestneighbor queries in high-dimensional space and the data mining and visualization of high-dimensional spaces. The parallel processing aims at ๏ฌnding appropriate declustering and query processing strategies to overcome the dif๏ฌculties in high-dimensional spaces. A ๏ฌrst approach in this direction has already been presented in Berchtold et al. [1997a]. The efforts in the area of data mining and visualization of high-dimensional feature spaces (for an example see Hinneburg and Keim [1998]) try to understand and explore the high-dimensional feature spaces. Also, the application of compression techniques to improve the query performance is an interesting and promising research area. A ๏ฌrst approach, the VA๏ฌle, has recently been proposed in Weber et al. [1998]. 8. CONCLUSIONS Research in high-dimensional index structures has been very active and productive over the past few years, resulting in a multitude of interesting new approaches for indexing high-dimensional data. Since it is very dif๏ฌcult to follow up on this discussion, in this survey we tried to provide insight into the effects occurring in indexing high-dimensional spaces and to provide an overview of the principal ideas of the index structures that have been proposed to overcome these problems. There are still a number of interesting open reACM Computing Surveys, Vol. 33, No. 3, September 2001. C. Bยจ hm et al. o search problems and we expect the ๏ฌeld to remain a fruitful research area over the next years. Due to the increasing importance of multimedia databases in various application areas and due to the remarkable results of the research, we also expect the research on high-dimensional indexing to have a major impact on many practical applications and commercial multimedia database systems. APPENDIX A. LEMMA 1 The RKV algorithm has a worst case space complexity of O(log n). PROOF. The only source of dynamic memory assignment in the RKV algorithm is the recursive calls of the function RKV algorithm. The recursion depth is at most equal to the height of the indexing structure. The height of all highdimensional index structures presented in this section is of the complexity O(log n). Since a constant amount of memory (one data or directory page) is allocated in each call, Lemma 1 follows. B. LEMMA 2 The HS algorithm has a space complexity of O(n) in the worst case. PROOF. The following scenario describes the worst case. Query processing starts with the root in APL. The root is replaced by its child nodes, which are on the level h โˆ’ 1 if h is the height of the index. All nodes on level hโˆ’ 1 are replaced by their child nodes, and so on, until all data nodes are in the APL. At this state it is possible that no data page is excluded from the APL because no datapoint was encountered yet. The situation described above occurs, for example, if all data objects are located on a sphere around the query point. Thus, all data pages are in the APL and the APL is maximal because the APL grows only by replacing a page by its descendants. If all data pages are in the APL, it has a length of O(n).
  • 48. Searching in High-Dimensional Spaces C. LEMMA 3 Let nndist be the distance between the query point and its nearest neighbor. All pages that intersect a sphere around the query point having a radius equal to nndist (the so-called nearest neighbor sphere) must be accessed for query processing. This condition is necessary and suf๏ฌcient. PROOF. 1. Suf๏ฌciency: If all data pages intersecting the nn-sphere are accessed, then all points in the database with a distance less than or equal to nndist are known to the query processor. No closer point than the nearest known point can exist in the database. 2. Necessity: If a page region intersects with the nearest-neighbor sphere but is not accessed during query processing, the corresponding subtree could include a point that is closer to the query point than the nearest neighbor candidate. Therefore, accessing all intersecting pages is necessary. D. LEMMA 4 The HS algorithm accesses pages in the order of increasing distance to the query point. PROOF. Due to the lower bounding property of page regions, the distance between the query point and a page region is always greater than or equal to the distance of the query point and the region of the parent of the page. Therefore, the minimum distance between the query point and any page in the APL can only be increased or remain unchanged, never decreased by the processing step of loading a page and replacing the corresponding APL entry. Since the active page with minimum distance is always accessed, the pages are accessed in the order of increasing distances to the query point. E. LEMMA 5 The HS algorithm is optimal in terms of the number of page accesses. 369 PROOF. According to Lemma 4, the HS algorithm accesses pages in the order of increasing distance to the query point q. Let m be the lowest MINDIST in the APL. Processing stops if the distance of q to the cpc is less than m. Due to the lower bounding property, processing of any page in the APL cannot encounter any points with a distance to q less than m. The distance between the cpc and q cannot fall below m during processing. Therefore, exactly the pages with a MINDIST less than or equal to the nearest neighbor distance are processed by the HS algorithm. According to Lemma 3, these pages must be loaded by any correct nearest neighbor algorithm. Thus, the HS algorithm yields an optimal number of page accesses. REFERENCES ABEL, D. AND SMITH, J. 1983. A data structure and algorithm based on a linear key for a rectangle retrieval problem. Comput. Vis. 24, 1โ€“13. AGRAWAL, R., FALOUTSOS, C., AND SWAMI, A. 1993. Ef๏ฌcient similarity search in sequence databases. In Proc. 4th Int. Conf. on Foundations of Data Organization and Algorithms, LNCS 730, 69โ€“84. AGRAWAL, R., GEHRKE, J., GUNOPULOS, D., AND RAGHAVAN, P. 1998. Automatic subspace clustering of high-dimensional data for data mining applications. In Proc. ACM SIGMOD Int. Conf. on Management of Data (Seattle), 94โ€“105. AGRAWAL, R., LIN, K., SAWHNEY, H., AND SHIM, K. 1995. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In Proc. 21st Int. Conf. on Very Large Databases, 490โ€“501. ALTSCHUL, S., GISH, W., MILLER, W., MYERS, E., AND LIPMAN, D. 1990. A basic local alignment search tool. J. Molecular Biol. 215, 3, 403โ€“410. AOKI, P. 1998. Generalizing โ€œsearchโ€ in generalized search trees. In Proc. 14th Int. Conf. on Data Engineering (Orlando, FL), 380โ€“389. AREF, W. AND SAMET, H. 1991. Optimization strategies for spatial query processing. In Proc. 17th Int. Conf. on Very Large Databases (Barcelona), 81โ€“90. ARYA, S. 1995. Nearest neighbor searching and applications. PhD thesis, University of Maryland, College Park, MD. ARYA, S., MOUNT, D., AND NARAYAN, O. 1995. Accounting for boundary effects in nearest neighbor searching. In Proc. 11th Symp. on Computational Geometry (Vancouver, Canada), 336โ€“344. BAEZA-YATES, R., CUNTO, W., MANBER, U., AND WU, S. 1994. Proximity matching using ๏ฌxed-queries ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 49. 370 trees. In Proc. Combinatorial Pattern Matching, LNCS 807, 198โ€“212. BAYER, R. AND MCCREIGHT, E. 1977. Organization and maintenance of large ordered indices. Acta Inf. 1, 3, 173โ€“189. BECKMANN, N., KRIEGEL, H.-P., SCHNEIDER, R., AND SEEGER, B. 1990. The r -tree: An ef๏ฌcient and robust access method for points and rectangles. In Proc. ACM SIGMOD Int. Conf. on Management of Data (Atlantic City, NJ), 322โ€“331. BELUSSI, A. AND FALOUTSOS, C. 1995. Estimating the selectivity of spatial queries using the correlation fractal dimension. In Proc. 21st Int. Conf. on Very Large Databases (Zurich), 299โ€“ 310. BENTLEY, J. 1975. Multidimensional search trees used for associative searching. Commun. ACM 18, 9, 509โ€“517. BENTLEY, J. 1979. Multidimensional binary search in database applications. IEEE Trans. Softw. Eng. 4, 5, 397โ€“409. BERCHTOLD, S. AND KEIM, D. 1998. Highdimensional index structuresโ€”Database support for next decadesโ€™s applications. In Tutorial ACM SIGMOD Int. Conf. on Management of Data (Seattle, NJ). ยจ ยจ BERCHTOLD, S., BOHM, C., BRAUNMULLER, B., KEIM, D., AND KRIEGEL, H.-P. 1997a. Fast parallel similarity search in multimedia databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data. ยจ BERCHTOLD, S., BOHM, C., JAGADISH, H., KRIEGEL, H.-P., AND SANDER, J. 2000a. Independent quantization: An index compression technique for highdimensional data spaces. In Proc. 16th Int. Conf. on Data Engineering. ยจ BERCHTOLD, S., BOHM, C., KEIM, D., AND KRIEGEL, H.-P. 1997b. A cost model for nearest neighbor search in high-dimensional data space. In Proc. ACM PODS Symp. on Principles of Database Systems (Tucson, AZ). ยจ BERCHTOLD, S., BOHM, C., KEIM, D., AND KRIEGEL, H.-P. 2001. On optimizing processing of nearest neighbor queries in high-dimensional data space. Proc. Conf. on Database Theory, 435โ€“449. ยจ BERCHTOLD, S., BOHM, C., KEIM, D., KRIEGEL, H.-P., AND XU, X. 2000c. Optimal multidimensional query processing using tree striping. Dawak, 244โ€“257. ยจ BERCHTOLD, S., BOHM, C., AND KRIEGEL, H.-P. 1998a. Improving the query performance of highdimensional index structures using bulk-load operations. In Proc. 6th Int. Conf. on Extending Database Technology (Valencia, Spain). ยจ BERCHTOLD, S., BOHM, C., AND KRIEGEL, H.-P. 1998b. The pyramid-technique: Towards indexing beyond the curse of dimensionality. In Proc. ACM SIGMOD Int. Conf. on Management of Data (Seattle, NJ), 142โ€“153. BERCHTOLD, S., ERTL, B., KEIM, D., KRIEGEL, H.-P., AND SEIDL, T. 1998c. Fast nearest neighbor ACM Computing Surveys, Vol. 33, No. 3, September 2001. C. Bยจ hm et al. o search in high-dimensional spaces. In Proc. 14th Int. Conf. on Data Engineering (Orlando, FL). BERCHTOLD, S., JAGADISH, H., AND ROSS, K. 1998d. Independence diagrams: A technique for visual data mining. In Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining (New York), 139โ€“ 143. BERCHTOLD, S., KEIM, D., AND KRIEGEL, H.-P. 1996. The x-tree: An index structure for highdimensional data. In Proc. 22nd Int. Conf. on Very Large Databases (Bombay), 28โ€“39. BERCHTOLD, S., KEIM, D., KRIEGEL, H.-P., AND SEIDL, T. 2000d. Indexing the solution space: A new technique for nearest neighbor search in highdimensional space. IEEE Trans. Knowl. Data Eng., 45โ€“57. BEYER, K., GOLDSTEIN, J., RAMAKRISHNAN, R., AND SHAFT, U. 1999. When is โ€œnearest neighborโ€ meaningful? In Proc. Int. Conf. on Database Theory, 217โ€“235. ยจ BOHM, C. 1998. Ef๏ฌciently indexing highdimensional databases. PhD thesis, University of Munich, Germany. ยจ BOHM, C. 2000. A cost model for query processing in high-dimensional data spaces. To appear in: ACM Trans. Database Syst. BOZKAYA, T. AND OZSOYOGLU, M. 1997. Distancebased indexing for high-dimensional metric spaces. SIGMOD Rec. 26, 2, 357โ€“368. BRIN, S. 1995. Near neighbor search in large metric spaces. In Proc. 21st Int. Conf. on Very Large Databases (Switzerland), 574โ€“584. BURKHARD, W. AND KELLER, R. 1973. Some approaches to best-match ๏ฌle searching. Commun. ACM 16, 4, 230โ€“236. CHEUNG, K. AND FU, A. 1998. Enhanced nearest neighbour search on the r-tree. SIGMOD Rec. 27, 3, 16โ€“21. CHIUEH, T. 1994. Content-based image indexing. In Proc. 20th Int. Conf. on Very Large Databases (Chile), 582โ€“593. CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1997. Mtree: An ef๏ฌcient access method for similarity search in metric spaces. In Proc. 23rd Int. Conf. on Very Large Databases (Greece), 426โ€“ 435. CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998. A cost model for similarity queries in metric spaces. In Proc. 17th ACM Symp. on Principles of Database Systems (Seattle), 59โ€“67. CLEARY, J. 1979. Analysis of an algorithm for ๏ฌnding nearest neighbors in Euclidean space. ACM Trans. Math. Softw. 5, 2, 183โ€“192. COMER, D. 1979. The ubiquitous b-tree. ACM Comput. Surv. 11, 2, 121โ€“138. CORRAL, A., MANOLOPOULOS, Y., THEODORIDIS, Y., AND VASSILAKOPOULOS, M. 2000. Closest pair queries in spatial databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 189โ€“ 200.
  • 50. Searching in High-Dimensional Spaces EASTMAN, C. 1981. Optimal bucket size for nearest neighbor searching in kd -trees. Inf. Proc. Lett. 12, 4. EVANGELIDIS, G. 1994. The hBฯ€ -tree: A concurrent and recoverable multi-attribute index structure. PhD thesis, Northeastern University, Boston, MA. EVANGELIDIS, G., LOMET, D., AND SALZBERG, B. 1997. The hBฯ€ -tree: A multiattribute index supporting concurrency, recovery and node consolidation. VLDB J. 6, 1, 1โ€“25. FALOUTSOS, C. 1985. Multiattribute hashing using gray codes. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 227โ€“238. FALOUTSOS, C. 1988. Gray codes for partial match and range queries. IEEE Trans. Softw. Eng. 14, 1381โ€“1393. FALOUTSOS, C. AND GAEDE, V. 1996. Analysis of ndimensional quadtrees using the Hausdorff fractal dimension. In Proc. 22nd Int. Conf. on Very Large Databases (Mumbai, India), 40โ€“50. FALOUTSOS, C. AND KAMEL, I. 1994. Beyond uniformity and independence: Analysis of r-trees using the concept of fractal dimension. In Proc. 13th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems (Minneapolis, MN), 4โ€“13. FALOUTSOS, C. AND LIN, K.-I. 1995. Fast map: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia data. In Proc. ACM SIGMOD Int. Conf. on Management of Data (San Jose, CA), 163โ€“174. FALOUTSOS, C. AND ROSEMAN, S. 1989. Fractals for secondary key retrieval. In Proc. 8th ACM SIGACT-SIGMOD Symp. on Principles of Database Systems, 247โ€“252. FALOUTSOS, C., BARBER, R., FLICKNER, M., AND HAFNER, J. 1994a. Ef๏ฌcient and effective querying by image content. J. Intell. Inf. Syst. 3, 231โ€“262. FALOUTSOS, C., RANGANATHAN, M., AND MANOLOPOULOS, Y. 1994b. Fast subsequence matching in time-series databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 419โ€“429. FALOUTSOS, C., SELLIS, T., AND ROUSSOPOULOS, N. 1987. Analysis of object-oriented spatial access methods. In Proc. ACM SIGMOD Int. Conf. on Management of Data. FINKEL, R. AND BENTLEY, J. 1974. Quad trees: A data structure for retrieval of composite trees. Acta Inf. 4, 1, 1โ€“9. FREESTON, M. 1987. The bang ๏ฌle: A new kind of grid ๏ฌle. In Proc. ACM SIGMOD Int. Conf. on Management of Data (San Francisco), 260โ€“269. FRIEDMAN, J., BENTLEY, J., AND FINKEL, R. 1977. An algorithm for ๏ฌnding best matches in logarithmic expected time. ACM Trans. Math. Softw. 3, 3, 209โ€“226. GAEDE, V. 1995. Optimal redundancy in spatial database systems. In Proc. 4th Int. Symp. on Advances in Spatial Databases (Portland, ME), 96โ€“ 116. 371 ยจ GAEDE, V. AND GUNTHER, O. 1998. Multidimensional access methods. ACM Comput. Surv. 30, 2, 170โ€“231. GIONIS, A., INDYK, P., AND MOTWANI, R. 1999. Similarity search in high dimensions via hashing. In Proc. 25th Int. Conf. on Very Large Databases (Edinburgh), 518โ€“529. GREENE, D. 1989. An implementation and performance analysis of spatial data access methods. In Proc. 5th IEEE Int. Conf. on Data Engineering. GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proc. ACM SIGMOD Int. Conf. on Management of Data (Boston), 47โ€“57. HELLERSTEIN, J., KOUTSOUPIAS, E., AND PAPADIMITRIOU, C. 1997. On the analysis of indexing schemes. In Proc. 16th SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems (Tucson, AZ), 249โ€“256. HELLERSTEIN, J., NAUGHTON, J., AND PFEFFER, A. 1995. Generalized search trees for database systems. In Proc. 21st Int. Conf. on Very Large Databases (Zurich), 562โ€“573. HENRICH, A. 1994. A distance-scan algorithm for spatial access strucures. In Proc. 2nd ACM Workshop on Advances in Geographic Information Systems (Gaithersburg, MD), 136โ€“143. HENRICH, A. 1998. The lsdh -tree: An access structure for feature vectors. In Proc. 14th Int. Conf. on Data Engineering (Orlando, FL). HENRICH, A., SIX, H.-W., AND WIDMAYER, P. 1989. The lsd-tree: Spatial access to multidimensional point and non-point objects. In Proc. 15th Int. Conf. on Very Large Databases (Amsterdam, The Netherlands), 45โ€“53. HINNEBURG, A. AND KEIM, D. 1998. An ef๏ฌcient approach to clustering in large multimedia databases with noise. In Proc. Int. Conf. on Knowledge Discovery in Databases (New York). HINRICHS, K. 1985. Implementation of the grid ๏ฌle: Design concepts and experience. BIT 25, 569โ€“ 592. HJALTASON, G. AND SAMET, H. 1995. Ranking in spatial databases. In Proc. 4th Int. Symp. on Large Spatial Databases (Portland, ME), 83โ€“95. HJALTASON, G. AND SAMET, H. 1998. Incremental distance join algorithms for spatial databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 237โ€“248. HUTFLESZ, A., SIX, H.-W., AND WIDMAYER, P. 1988a. Globally order preserving multidimensional linear hashing. In Proc. 4th IEEE Int. Conf. on Data Engineering, 572โ€“579. HUTFLESZ, A., SIX, H.-W., AND WIDMAYER, P. 1988b. Twin grid ๏ฌles: Space optimizing access schemes. In Proc. ACM SIGMOD Int. Conf. on Management of Data. JAGADISH, H. 1990. Linear clustering of objects with multiple attributes. In Proc. ACM SIGMOD ACM Computing Surveys, Vol. 33, No. 3, September 2001.
  • 51. 372 Int. Conf. on Management of Data (Atlantic City, NJ), 332โ€“342. JAGADISH, H. 1991. A retrieval technique for similar shapes. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 208โ€“217. JAIN, R. AND WHITE, D. 1996. Similarity indexing: Algorithms and performance. In Proc. SPIE Storage and Retrieval for Image and Video Databases IV (San Jose, CA), 62โ€“75. KAMEL, I. AND FALOUTSOS, C. 1992. Parallel r-trees. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 195โ€“204. KAMEL, I. AND FALOUTSOS, C. 1993. On packing rtrees. CIKM, 490โ€“499. KAMEL, I. AND FALOUTSOS, C. 1994. Hilbert r-tree: An improved r-tree using fractals. In Proc. 20th Int. Conf. on Very Large Databases, 500โ€“ 509. KATAYAMA, N. AND SATOH, S. 1997. The sr-tree: An index structure for high-dimensional nearest neighbor queries. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 369โ€“380. KNUTH, D. 1975. The Art of Computer Programmingโ€”Volume 3: Sorting and Searching. Addison-Wesley, Reading, Mass. KORN, F. AND MUTHUKRISHNAN, S. 2000. In๏ฌ‚uence sets based on reverse nearest neighbor queries. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 201โ€“212. KORN, F., SIDIROPOULOS, N., FALOUTSOS, C., SIEGEL, E., AND PROTOPAPAS, Z. 1996. Fast nearest neighbor search in medical image databases. In Proc. 22nd Int. Conf. on Very Large Databases (Mumbai, India), 215โ€“226. KORNACKER, M. 1999. High-performance generalized search trees. In Proc. 24th Int. Conf. on Very Large Databases (Edinburgh). KRIEGEL, H.-P. AND SEEGER, B. 1986. Multidimensional order preserving linear hashing with partial extensions. In Proc. Int. Conf. on Database Theory, Lecture Notes in Computer Science, vol. 243, Springer-Verlag, New York. KRIEGEL, H.-P. AND SEEGER, B. 1987. Multidimensional dynamic quantile hashing is very ef๏ฌcient for non-uniform record distributions. In Proc. 3rd Int. Conf. on Data Engineering, 10โ€“17. KRIEGEL, H.-P. AND SEEGER, B. 1988. Plop-hashing: A grid ๏ฌle without directory. In Proc. 4th Int. Conf. on Data Engineering, 369โ€“376. KRISHNAMURTHY, R. AND WHANG, K.-Y. 1985. Multilevel Grid Files. IBM Research Center Report, Yorktown Heights, NY. KUKICH, K. 1992. Techniques for automatically correcting words in text. ACM Comput. Surv. 24, 4, 377โ€“440. LIN, K., JAGADISH, H., AND FALOUTSOS, C. 1995. The tv-tree: An index structure for high-dimensional data. VLDB J. 3, 517โ€“542. LOMET, D. AND SALZBERG, B. 1989. The hb-tree: A robust multiattribute search structure. In Proc. ACM Computing Surveys, Vol. 33, No. 3, September 2001. C. Bยจ hm et al. o 5th IEEE Int. Conf. on Data Engineering, 296โ€“ 304. LOMET, D. AND SALZBERG, B. 1990. The hb-tree: A multiattribute indexing method with good guaranteed performance. ACM Trans. Database Syst. 15, 4, 625โ€“658. MANDELBROT, B. 1977. Fractal Geometry of Nature. W.H. Freeman, New York. MEHROTRA, R. AND GARY, J. 1993. Feature-based retrieval of similar shapes. In Proc. 9th Int. Conf. on Data Engineering. MEHROTRA, R. AND GARY, J. 1995. Feature-indexbased similar shape retrieval. In Proc. 3rd Working Conf. on Visual Database Systems. MORTON, G. 1966. A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing. IBM Ltd., USA. MUMFORD, D. 1987. The problem of robust shape descriptors. In Proc. 1st IEEE Int. Conf. on Computer Vision. NIEVERGELT, J., HINTERBERGER, H., AND SEVCIK, K. 1984. The grid ๏ฌle: An adaptable, symmetric multikey ๏ฌle structure. ACM Trans. Database Syst. 9, 1, 38โ€“71. ORENSTEIN, J. 1990. A comparison of spatial query processing techniques for native and parameter spaces. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 326โ€“336. ORENSTEIN, J. AND MERRET, T. 1984. A class of data structures for associative searching. In Proc. 3rd ACM SIGACT-SIGMOD Symp. on Principles of Database Systems, 181โ€“190. OTOO, E. 1984. A mapping function for the directory of a multidimensional extendible hashing. In Proc. 10th Int. Conf. on Very Large Databases, 493โ€“506. OUKSEL, M. 1985. The interpolation based grid ๏ฌle. In Proc. 4th ACM SIGACT-SIGMOD Symp. on Principles of Database Systems, 20โ€“ 27. OUSKEL, A. AND MAYES, O. 1992. The nested interpolation-based Grid File. Acta Informatika 29, 335โ€“373. PAGEL, B.-U., SIX, H.-W., TOBEN, H., AND WIDMAYER, P. 1993. Towards an analysis of range query performance in spatial data structures. In Proc. 12th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems (Washington, DC), 214โ€“221. PAPADOPOULOS, A. AND MANOLOPOULOS, Y. 1997a. Nearest neighbor queries in shared-nothing environments. Geoinf. 1, 1, 1โ€“26. PAPADOPOULOS, A. AND MANOLOPOULOS, Y. 1997b. Performance of nearest neighbor queries in r-trees. In Proc. 6th Int. Conf. on Database Theory, Lecture Notes in Computer Science, vol. 1186, Springer-Verlag, New York, 394โ€“408. PAPADOPOULOS, A. AND MANOLOPOULOS, Y. 1998. Similarity query processing using disk arrays. In Proc. ACM SIGMOD Int. Conf. on Management of Data.
  • 52. Searching in High-Dimensional Spaces RIEDEL, E., GIBSON, G., AND FALOUTSOS, C. 1998. Actice storage for large-scale data mining and multimedia. In Proc. 24th Int. Conf. on Very Large Databases, 62โ€“73. ROBINSON, J. 1981. The k-d -b-tree: A search structure for large multidimensional dynamic indexes. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 10โ€“18. ROUSSOPOULOS, N., KELLEY, S., AND VINCENT, F. 1995. Nearest neighbor queries. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 71โ€“79. SAGAN, H. 1994. Space Filling Curves. SpringerVerlag, New York. ยจ SCHRODER, M. 1991. Fractals, Chaos, Power Laws: Minutes from an In๏ฌnite Paradise. W.H. Freeman, New York. SEEGER, B. AND KRIEGEL, H.-P. 1990. The buddy tree: An ef๏ฌcient and robust access method for spatial data base systems. In Proc. 16th Int. Conf. on Very Large Databases (Brisbane), 590โ€“ 601. SEIDL, T. 1997. Adaptable similarity search in 3-d spatial database systems. PhD thesis, University of Munich, Germany. SEIDL, T. AND KRIEGEL, H.-P. 1997. Ef๏ฌcient useradaptable similarity search in large multimedia databases. In Proc. 23rd Int. Conf. on Very Large Databases (Athens). SELLIS, T., ROUSSOPOULOS, N., AND FALOUTSOS, C. 1987. The r+ -tree: A dynamic index for multidimensional objects. In Proc. 13th Int. Conf. on Very Large Databases (Brighton, GB), 507โ€“518. SHAWNEY, H. AND HAFNER, J. 1994. Ef๏ฌcient color histogram indexing. In Proc. Int. Conf. on Image Processing, 66โ€“70. SHOICHET, B., BODIAN, D., AND KUNTZ, I. 1992. Molecular docking using shape descriptors. J. Comput. Chem. 13, 3, 380โ€“397. SPROULL, R. 1991. Re๏ฌnements to nearest neighbor searching in k-dimensional trees. Algorithmica, 579โ€“589. STANOI, I., AGRAWAL, D., AND ABBADI, A. 2000. Reverse nearest neighbor queries for dynamic 373 databases. In Proc. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 44โ€“53. STONEBRAKER, M., SELLIS, T., AND HANSON, E. 1986. An analysis of rule indexing implementations in data base systems. In Proc. Int. Conf. on Expert Database Systems. THEODORIDIS, Y. AND SELLIS, T. 1996. A model for the prediction of r-tree performance. In Proc. 15th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems (Montreal), 161โ€“ 171. UHLMANN, J. 1991. Satisfying general proximity/similarity queries with metric trees. Inf. Proc. Lett. 145โ€“157. VAN DEN BERCKEN, J., SEEGER, B., AND WIDMAYER, P. 1997. A general approach to bulk loading multidimensional index structures. In Proc. 23rd Int. Conf. on Very Large Databases (Athens). WALLACE, T. AND WINTZ, P. 1980. An ef๏ฌcient threedimensional aircraft recognition algorithm using normalized Fourier descriptors. Comput. Graph. Image Proc. 13, 99โ€“126. WEBER, R., SCHEK, H.-J., AND BLOTT, S. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proc. Int. Conf. on Very Large Databases (New York). WHITE, D. AND JAIN, R. 1996. Similarity indexing with the ss-tree. In Proc. 12th Int. Conf. on Data Engineering (New Orleans). YAO, A. AND YAO, F. 1985. A general approach to d-dimensional geometric queries. In Proc. ACM Symp. on Theory of Computing. YIANILOS, P. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proc. 4th ACM-SIAM Symp. on Discrete Algorithms, 311โ€“321. YIANILOS, P. 1999. Excluded middle vantage point forests for nearest neighbor search. In Proc. DIMACS Implementation Challenge (Baltimore, MD). Received August 1998; revised March 2000; accepted November 2000 ACM Computing Surveys, Vol. 33, No. 3, September 2001.